Identifying document similarity using a fast estimation of the Levenshtein Distance based on compression and signatures

Détails

Ressource 1Télécharger: Levenshtein.pdf (383.89 [Ko])
Etat: Public
Version: Final published version
Licence: Non spécifiée
ID Serval
serval:BIB_5958E6DD6D8F
Type
Actes de conférence (partie): contribution originale à la littérature scientifique, publiée à l'occasion de conférences scientifiques, dans un ouvrage de compte-rendu (proceedings), ou dans l'édition spéciale d'un journal reconnu (conference proceedings).
Collection
Publications
Institution
Titre
Identifying document similarity using a fast estimation of the Levenshtein Distance based on compression and signatures
Titre de la conférence
Proceedings of the Digital Forensics Research Conference Europe (DFRWS EU)
Auteur⸱e⸱s
Coates Peter, Breitinger Frank
Statut éditorial
Publié
Date de publication
31/03/2022
Peer-reviewed
Oui
Langue
anglais
Résumé
Identifying document similarity has many applications, e.g., source code analysis or plagiarism detection. However, identifying similarities is not trivial and can be time complex. For instance, the Levenshtein Distance is a common metric to define the similarity between two documents but has quadratic runtime which makes it impractical for large documents where large starts with a few hundred kilobytes. In this paper, we present a novel concept that allows estimating the Levenshtein Distance: the algorithm first compresses documents to signatures (similar to hash values) using a user-defined compression ratio. Signatures can then be compared against each other (some constrains apply) where the outcome is the estimated Levenshtein Distance. Our evaluation shows promising results in terms of runtime efficiency and accuracy. In addition, we introduce a significance score allowing examiners to set a threshold and identify related documents.
Mots-clé
Levenshtein distance, edit distance, estimation, document similarity, approximate string matching, fingerprint, digest
Création de la notice
18/05/2022 8:33
Dernière modification de la notice
15/01/2024 7:16
Données d'usage