An Efficient Type-Agnostic Approach for Finding Sub-sequences in Data

Chapuis, B.; Garbinato, B.; Andritsos, P.

doi:10.1109/hpcc-smartcity-dss.2017.35

An Efficient Type-Agnostic Approach for Finding Sub-sequences in Data

Details

Request a copy

Serval ID

serval:BIB_7EF8C37FC0BF

Type

Inproceedings: an article in a conference proceedings.

Collection

Publications

Institution

UNIL/CHUV

Title

An Efficient Type-Agnostic Approach for Finding Sub-sequences in Data

Title of the conference

2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)

Author(s)

Chapuis B., Garbinato B., Andritsos P.

Publisher

IEEE

ISBN

9781538625880

Publication state

Published

Issued date

12/2017

Peer-reviewed

Oui

Abstract

In this paper, we present an efficient type-agnostic approach for finding sub-sequences in data, such as text documents or GPS trajectories. Our approach relies on data deduplication for creating an inverted index. In contrast with existing data deduplication techniques that split raw sequences of characters arbitrarily, our approach preserves the semantics of the original sequence via the notion of token and can be used to index normalized data. When compared to indexing methods that preserve the semantics and operate on normalized data, our method increases the relevance of the inverted index, reduces its size and improves its performances. As data normalization is generally not used beyond the scope of textual data, we introduce a framework that helps identify the extent to which data should be normalized regardless of its type. On this basis, we demonstrate with a dataset made of GPS trajectories that our method can be used agnostically: it can be used to index and query data of a completely different type. Finally, we show that the resulting spatial-index is characterized by a better discrimination than classic spatial-indexing approaches.

OAI-PMH

oai:serval.unil.ch:BIB_7EF8C37FC0BF

DOI

10.1109/hpcc-smartcity-dss.2017.35

Create date

28/02/2018 10:51

Last modification date

21/08/2019 5:13

Usage data

SERVAL

serveur académique lausannois

An Efficient Type-Agnostic Approach for Finding Sub-sequences in Data

Details