Queries over unstructured data : probabilistic methods to the rescue (Keynote)
DSpace at IIT Bombay
View Archive InfoField | Value | |
Title |
Queries over unstructured data : probabilistic methods to the rescue (Keynote)
|
|
Creator |
SARAWAGI, S
|
|
Subject |
imprecise data models
information extraction duplicate elimination conditional random fields |
|
Description |
Unstructured data like emails, addresses, invoices, call transcripts, reviews, and press releases are now an integral part of any large enterprise. A challenge of modern business intelligence applications is analyzing and querying data seamlessly across structured and unstructured sources. This requires the development of automated techniques for extracting structured records from text sources and resolving entity mentions in data from various sources. The success of any automated method for extraction and integration depends on how effectively it unifies diverse clues in the unstructured source and in existing structured databases. We argue that statistical learning techniques like Conditional Random Fields (CRFs) provide a accurate, elegant and principled framework for tackling these tasks. Given the inherent noise in real-world sources, it is important to capture the uncertainty of the above operations via imprecise data models. CRFs provide a sound probability distribution over extractions but are not easy to represent and query in a relational framework. We present methods of approximating this distribution to query-friendly row and column uncertainty models. Finally, we present models for representing the uncertainty of de-duplication and algorithms for various Top-K count queries on imprecise duplicates.
|
|
Publisher |
SPRINGER-VERLAG BERLIN
|
|
Date |
2011-10-22T08:05:25Z
2011-12-15T09:10:49Z 2011-10-22T08:05:25Z 2011-12-15T09:10:49Z 2010 |
|
Type |
Proceedings Paper
|
|
Identifier |
ENABLING REAL-TIME BUSINESS INTELLIGENCE,41,1-13
978-3-642-14558-2 1865-1348 http://dspace.library.iitb.ac.in/xmlui/handle/10054/14847 http://hdl.handle.net/100/1676 |
|
Source |
3rd International Workshop on Business Intelligence for the Real-Time Enterprise,Lyon, FRANCE,AUG 24, 2009
|
|
Language |
English
|
|