Record Details

DSpace at IIT Bombay

View Archive Info
 

Metadata

 
Field Value
 
Title Automatic segmentation of text into structured records
 
Names BORKAR, V
DESHMUKH, K
SARAWAGI, S
Date Issued 2001 (iso8601)
Abstract In this paper we present a method for automatically segmenting unformatted text records into structured elements. Several useful data sources today are human-generated as continuous text whereas convenient usage requires the data to be organized as structured records. A prime motivation is the warehouse address cleaning problem of transforming dirty addresses stored in large corporate databases as a single text field into subfields like "City" and "Street". Existing tools rely on hand-tuned, domain-specific rule-based systems. We describe a tool DATAMOLD that learns to automatically extract structure when seeded with a small number of training examples. The tool enhances on Hidden Markov Models (HMM) to build a powerful probabilistic model that corroborates multiple sources of information including, the sequence of elements, their length distribution, distinguishing words from the vocabulary and an optional external data dictionary.; Experiments on real-life datasets yielded accuracy of 90% on Asian addresses and 99% on US addresses. In contrast, existing information extraction methods based on rule-learning techniques yielded considerably lower accuracy.
Genre Article; Proceedings Paper
Identifier SIGMOD RECORD,30(2)175-186