Collecting and de-identifying half a million WhatsApp messages

Détails

ID Serval
serval:BIB_2B279EDD8929
Type
Actes de conférence (partie): contribution originale à la littérature scientifique, publiée à l'occasion de conférences scientifiques, dans un ouvrage de compte-rendu (proceedings), ou dans l'édition spéciale d'un journal reconnu (conference proceedings).
Collection
Publications
Institution
Titre
Collecting and de-identifying half a million WhatsApp messages
Titre de la conférence
Proceedings of the 10th International Conference on CMC and Social Media Corpora for the Humanities (CMC-Corpora 2023), 14–15 September 2023, University of Mannheim, Germany
Auteur⸱e⸱s
Gupta Prakhar, Doudot Lliana, Loup Romain, Xanthos Aris
Editeur
Leibniz-Institut für Deutsche Sprache (IDS)
Statut éditorial
Publié
Date de publication
07/09/2023
Peer-reviewed
Oui
Langue
anglais
Notes
https://ids-pub.bsz-bw.de/frontdoor/deliver/index/docId/12095/file/CMC_Corpora_2023_Proceedings_2023.pdf
Résumé
Instant messaging (IM) applications, especially WhatsApp, have become ubiquitous in contemporary computer-mediated communication practices. IM data have the potential to constitute a rich source of research material for corpus linguistics and cultural analytics, owing to their similarities with face-to-face conversations as well as their private nature. In this work, we outline the creation process of a large curated dataset of WhatsApp messages in French. The paper covers the protocol for collecting these messages as well as the de-identification process for removing sensitive information liable to identify the users in these messages. The de-identified dataset will ultimately be made available to researchers on request.
Mots-clé
WhatsApp, chats, instant messaging, IM, de-identification, corpus, French
Open Access
Oui
Création de la notice
19/09/2023 11:18
Dernière modification de la notice
20/09/2023 6:55
Données d'usage