Collecting and de-identifying half a million WhatsApp messages

Details

Serval ID
serval:BIB_2B279EDD8929
Type
Inproceedings: an article in a conference proceedings.
Collection
Publications
Institution
Title
Collecting and de-identifying half a million WhatsApp messages
Title of the conference
Proceedings of the 10th International Conference on CMC and Social Media Corpora for the Humanities (CMC-Corpora 2023), 14–15 September 2023, University of Mannheim, Germany
Author(s)
Gupta Prakhar, Doudot Lliana, Loup Romain, Xanthos Aris
Publisher
Leibniz-Institut für Deutsche Sprache (IDS)
Publication state
Published
Issued date
07/09/2023
Peer-reviewed
Oui
Language
english
Notes
https://ids-pub.bsz-bw.de/frontdoor/deliver/index/docId/12095/file/CMC_Corpora_2023_Proceedings_2023.pdf
Abstract
Instant messaging (IM) applications, especially WhatsApp, have become ubiquitous in contemporary computer-mediated communication practices. IM data have the potential to constitute a rich source of research material for corpus linguistics and cultural analytics, owing to their similarities with face-to-face conversations as well as their private nature. In this work, we outline the creation process of a large curated dataset of WhatsApp messages in French. The paper covers the protocol for collecting these messages as well as the de-identification process for removing sensitive information liable to identify the users in these messages. The de-identified dataset will ultimately be made available to researchers on request.
Keywords
WhatsApp, chats, instant messaging, IM, de-identification, corpus, French
Open Access
Yes
Create date
19/09/2023 11:18
Last modification date
20/09/2023 6:55
Usage data