Identifying document similarity using a fast estimation of the Levenshtein Distance based on compression and signatures

Coates, Peter; Breitinger, Frank

Identifying document similarity using a fast estimation of the Levenshtein Distance based on compression and signatures

Details

Download: Levenshtein.pdf (383.89 [Ko])
State: Public
Version: Final published version
License: Not specified

Serval ID

serval:BIB_5958E6DD6D8F

Type

Inproceedings: an article in a conference proceedings.

Collection

Publications

Institution

UNIL/CHUV

Title

Identifying document similarity using a fast estimation of the Levenshtein Distance based on compression and signatures

Title of the conference

Proceedings of the Digital Forensics Research Conference Europe (DFRWS EU)

Author(s)

Coates Peter, Breitinger Frank

Publication state

Published

Issued date

31/03/2022

Peer-reviewed

Oui

Language

english

Abstract

Identifying document similarity has many applications, e.g., source code analysis or plagiarism detection. However, identifying similarities is not trivial and can be time complex. For instance, the Levenshtein Distance is a common metric to define the similarity between two documents but has quadratic runtime which makes it impractical for large documents where large starts with a few hundred kilobytes. In this paper, we present a novel concept that allows estimating the Levenshtein Distance: the algorithm first compresses documents to signatures (similar to hash values) using a user-defined compression ratio. Signatures can then be compared against each other (some constrains apply) where the outcome is the estimated Levenshtein Distance. Our evaluation shows promising results in terms of runtime efficiency and accuracy. In addition, we introduce a significance score allowing examiners to set a threshold and identify related documents.

Keywords

Levenshtein distance, edit distance, estimation, document similarity, approximate string matching, fingerprint, digest

URN

urn:nbn:ch:serval-BIB_5958E6DD6D8F0

OAI-PMH

oai:serval.unil.ch:BIB_5958E6DD6D8F

Publisher's website

https://www.researchgate.net/publication/359961968_Identifying_document_similarity_using_a_fast_estimation_of_the_Levenshtein_Distance_based_on_compression_and_signatures

Create date

18/05/2022 9:33

Last modification date

15/01/2024 8:16

Usage data

SERVAL

serveur académique lausannois

Identifying document similarity using a fast estimation of the Levenshtein Distance based on compression and signatures

Details