Towards Syntactic Approximate Matching - A Pre-Processing Experiment

Jeong, Doowon; Kang, Hari; Lee, Sangjin

doi:10.15394/jdfsl.2016.1381

Towards Syntactic Approximate Matching - A Pre-Processing Experiment

Details

Request a copy

Serval ID

serval:BIB_D22FC78B7B52

Type

Article: article from journal or magazin.

Collection

Publications

Institution

Production externe

Title

Towards Syntactic Approximate Matching - A Pre-Processing Experiment

Journal

Journal of Digital Forensics, Security and Law

Author(s)

Jeong Doowon, Kang Hari, Lee Sangjin

ISSN

1558-7223

Publication state

Published

Issued date

2016

Volume

Number

Language

english

Abstract

Over the past few years, the popularity of approximate matching algorithms (a.k.a. fuzzy hashing) has increased. Especially within the area of bytewise approximate matching, several algorithms were published, tested, and improved. It has been shown that these algorithms are powerful, however they are sometimes too precise for real world investigations. That is, even very small commonalities (e.g., in the header of a file) can cause a match. While this is a desired property, it may also lead to unwanted results. In this paper, we show that by using simple pre-processing, we significantly can influence the outcome. Although our test set is based on text-based file types (cause of an easy processing), this technique can be used for other, well-documented types as well. Our results show that it can be beneficial to focus on the content of files only (depending on the use-case). While for this experiment we utilized text files, Additionally, we present a small, self-created dataset that can be used in the future for approximate matching algorithms since it is labeled (we know which files are similar and how).

Keywords

Bytewise Approximate Matching, Pre-processing, Syntactic Similarity, Digital forensics

DOI

10.15394/jdfsl.2016.1381

Publisher's website

https://doi.org/10.15394/jdfsl.2016.1381

Open Access

Yes

Create date

06/05/2021 12:01

Last modification date

06/05/2021 12:40

Usage data

SERVAL

serveur académique lausannois

Towards Syntactic Approximate Matching - A Pre-Processing Experiment

Details