Properties of a similarity preserving hash function and their realization in sdhash

Details

Serval ID
serval:BIB_4B0DF3C87990
Type
Inproceedings: an article in a conference proceedings.
Collection
Publications
Title
Properties of a similarity preserving hash function and their realization in sdhash
Title of the conference
2012 Information Security for South Africa
Author(s)
Breitinger Frank, Baier Harald
Publisher
IEEE
ISBN
9781467321594
9781467321600
9781467321587
Publication state
Published
Issued date
08/2012
Language
english
Abstract
Finding similarities between byte sequences is a complex task and necessary in many areas of computer science, e.g., to identify malicious files or spam. Instead of comparing files against each other, one may apply a similarity preserving compression function (hash function) first and do the comparison for the hashes. Although we have different approaches, there is no clear definition / specification or needed properties of such algorithms available. This paper presents four basic properties for similarity pre- serving hash functions that are partly related to the properties of cryptographic hash functions. Compression and ease of computation are borrowed from traditional hash functions and define the hash value length and the performance. As every byte is expected to influence the hash value, we introduce coverage. Similarity score describes the need for a comparison function for hash values. We shortly discuss these properties with respect to three existing approaches and finally have a detailed view on the promising approach sdhash. However, we uncovered some bugs and other peculiarities of the implementation of sdhash. Finally we conclude that sdhash has the potential to be a robust similarity preserving digest algorithm, but there are some points that need to be improved.
Keywords
cryptography, file organisation, unsolicited e-mail, byte sequences, hash function, malicious files, sdhash, Digital forensics, fuzzy hashing, properties of similarity preserving hashing, similarity preserving hashing
Create date
06/05/2021 12:01
Last modification date
06/05/2021 12:24
Usage data