Distributed clustering of categorical data using the information bottleneck framework

Détails

ID Serval
serval:BIB_36A59507582F
Type
Article: article d'un périodique ou d'un magazine.
Collection
Publications
Institution
Titre
Distributed clustering of categorical data using the information bottleneck framework
Périodique
Information Systems
Auteur⸱e⸱s
Tagasovska N., Andritsos P.
ISSN
0306-4379
Statut éditorial
Publié
Date de publication
12/2017
Peer-reviewed
Oui
Volume
72
Pages
161-178
Langue
anglais
Résumé
We perform clustering of categorical data using the Information Bottleneck, (IB), framework at large scale. We examine the performance of existing solutions using multiple machine architectures. The IB method uses information theory to recast database relations as probability distributions and the proximity of their tuples as their loss of information when they are considered together. More precisely, we study the Agglomerative Information Bottleneck, the Sequential Information Bottleneck and LIMBO, a newer approach that uses summaries of the original data. First we evaluate the performance and limitations of these algorithms when confronted with large datasets in a single, powerful machine. We then propose new implementations that take advantage of distributed environments. Using real and large synthetic datasets of tens of Gigabytes in size, we finally evaluate their effectiveness and efficiency.
Mots-clé
Hardware and Architecture, Software, Information Systems
Création de la notice
29/11/2017 15:23
Dernière modification de la notice
20/08/2019 14:24
Données d'usage