An Efficient Type-Agnostic Approach for Finding Sub-sequences in Data

Details

Serval ID
serval:BIB_7EF8C37FC0BF
Type
Inproceedings: an article in a conference proceedings.
Collection
Publications
Institution
Title
An Efficient Type-Agnostic Approach for Finding Sub-sequences in Data
Title of the conference
2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)
Author(s)
Chapuis B., Garbinato B., Andritsos P.
Publisher
IEEE
ISBN
9781538625880
Publication state
Published
Issued date
12/2017
Peer-reviewed
Oui
Abstract
In this paper, we present an efficient type-agnostic approach for finding sub-sequences in data, such as text documents or GPS trajectories. Our approach relies on data deduplication for creating an inverted index. In contrast with existing data deduplication techniques that split raw sequences of characters arbitrarily, our approach preserves the semantics of the original sequence via the notion of token and can be used to index normalized data. When compared to indexing methods that preserve the semantics and operate on normalized data, our method increases the relevance of the inverted index, reduces its size and improves its performances. As data normalization is generally not used beyond the scope of textual data, we introduce a framework that helps identify the extent to which data should be normalized regardless of its type. On this basis, we demonstrate with a dataset made of GPS trajectories that our method can be used agnostically: it can be used to index and query data of a completely different type. Finally, we show that the resulting spatial-index is characterized by a better discrimination than classic spatial-indexing approaches.
Create date
28/02/2018 11:51
Last modification date
21/08/2019 6:13
Usage data