Distributed clustering of categorical data using the information bottleneck framework

Details

Serval ID
serval:BIB_36A59507582F
Type
Article: article from journal or magazin.
Collection
Publications
Institution
Title
Distributed clustering of categorical data using the information bottleneck framework
Journal
Information Systems
Author(s)
Tagasovska N., Andritsos P.
ISSN
0306-4379
Publication state
Published
Issued date
12/2017
Peer-reviewed
Oui
Volume
72
Pages
161-178
Language
english
Abstract
We perform clustering of categorical data using the Information Bottleneck, (IB), framework at large scale. We examine the performance of existing solutions using multiple machine architectures. The IB method uses information theory to recast database relations as probability distributions and the proximity of their tuples as their loss of information when they are considered together. More precisely, we study the Agglomerative Information Bottleneck, the Sequential Information Bottleneck and LIMBO, a newer approach that uses summaries of the original data. First we evaluate the performance and limitations of these algorithms when confronted with large datasets in a single, powerful machine. We then propose new implementations that take advantage of distributed environments. Using real and large synthetic datasets of tens of Gigabytes in size, we finally evaluate their effectiveness and efficiency.
Keywords
Hardware and Architecture, Software, Information Systems
Create date
29/11/2017 15:23
Last modification date
20/08/2019 14:24
Usage data