INDREX: In-database relation extraction

Details

Serval ID
serval:BIB_3244F12BF848
Type
Article: article from journal or magazin.
Collection
Publications
Institution
Title
INDREX: In-database relation extraction
Journal
Information Systems
Author(s)
Kilias T., Löser A., Andritsos P.
ISSN
0306-4379
Publication state
Published
Issued date
10/2015
Peer-reviewed
Oui
Volume
53
Pages
124-144
Language
english
Abstract
The management of text data has a long-standing history in the human mankind. A particular common task is extracting relations from text. Typically, the user performs this task with two separate systems, a relation extraction system and an SQL-based query engine for analytical tasks. During this iterative analytical workflow, the user must frequently ship data between these systems. Worse, the user must learn to manage both systems. Therefore, end users often desire a single system for both analytical and relation extraction tasks.
We propose INDREX, a system that provides a single and comprehensive view of the whole process combining both relation extraction and later exploitation with SQL. The system permits a data warehouse style extract-transform-load of generic relations extracted from text documents and can support additional text mining analysis libraries or systems. Once generic relations are loaded, the user can define SQL queries on the extracted relations to discover higher level semantics or to join them with other relational data.
For executing this powerful task, our system extends the SQL-based analytical capabilities of a columnar-based massively parallel query processing engine with a broad set of user-defined functions and a data model that supports this task. Our white-box approach permits INDREX to benefit from built-in query optimization and indexing techniques of the underlaying query execution engine.
Applications that support both text mining and analytical workflows leverage new analytical platforms based on the MapReduce framework and its open source Hadoop implementation. We compare our system against this base line. We measure execution times for common workflows and demonstrate orders of magnitude improvement in execution time using INDREX.
Keywords
Iterative text mining in a RDBMS, Ad-hoc reports from text data, Information extraction
Web of science
Create date
22/08/2017 9:55
Last modification date
21/08/2019 5:16
Usage data