Similarity Hashing Based on Levenshtein Distances

Breitinger, Frank; Ziroff, Georg; Lange, Steffen; Baier, Harald

doi:10.1007/978-3-662-44952-3_10

Similarity Hashing Based on Levenshtein Distances

Détails

Demande d'une copie

ID Serval

serval:BIB_689E756093D0

Type

Actes de conférence (partie): contribution originale à la littérature scientifique, publiée à l'occasion de conférences scientifiques, dans un ouvrage de compte-rendu (proceedings), ou dans l'édition spéciale d'un journal reconnu (conference proceedings).

Collection

Publications

Institution

Production externe

Titre

Similarity Hashing Based on Levenshtein Distances

Titre de la conférence

Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications

Auteur⸱e⸱s

Breitinger Frank, Ziroff Georg, Lange Steffen, Baier Harald

Editeur

Springer International Publishing

ISBN

9783319125671
9783319125688

ISSN

0302-9743
1611-3349

Statut éditorial

Publié

Date de publication

2014

Editeur⸱rice scientifique

Peterson Gilbert, Shenoi Sujeet

Volume

433

Pages

133-147

Langue

anglais

Résumé

It is increasingly common in forensic investigations to use automated pre-processing techniques to reduce the massive volumes of data that are encountered. This is typically accomplished by comparing fingerprints (typically cryptographic hashes) of files against existing databases. In addition to finding exact matches of cryptographic hashes, it is necessary to find approximate matches corresponding to similar files, such as different versions of a given file.
This paper presents a new stand-alone similarity hashing approach called saHash, which has a modular design and operates in linear time. saHash is almost as fast as SHA-1 and more efficient than other approaches for approximate matching. The similarity hashing algorithm uses four sub-hash functions, each producing its own hash value. The four sub-hashes are concatenated to produce the final hash value. This modularity enables sub-hash functions to be added or removed, e.g., if an exploit for a sub-hash function is discovered. Given the hash values of two byte sequences, saHash returns a lower bound on the number of Levenshtein operations between the two byte sequences as their similarity score. The robustness of saHash is verified by comparing it with other approximate matching approaches such as sdhash.

Mots-clé

Fuzzy hashing, similarity digest, Levenshtein distance

DOI

10.1007/978-3-662-44952-3_10

Site de l'éditeur

http://dx.doi.org/10.1007/978-3-662-44952-3_10

Création de la notice

06/05/2021 12:01

Dernière modification de la notice

06/05/2021 12:34

Données d'usage

SERVAL

serveur académique lausannois

Similarity Hashing Based on Levenshtein Distances

Détails