IJCAI Conference 2009 Conference Paper
- Matthew Michelson
- Craig A. Knoblock
Previous work on information extraction from unstructured, ungrammatical text (e. g. classified ads) showed that exploiting a set of background knowledge, called a “reference set, ” greatly improves the precision and recall of the extractions. However, finding a source for this reference set is often dif- ficult, if not impossible. Further, even if a source is found, it might not overlap well with the text for extraction. In this paper we present an approach to building the reference set directly from the text itself. Our approach eliminates the need to find the source for the reference set, and ensures better overlap between the text and reference set. Starting with a small amount of background knowledge, our technique constructs tuples representing the entities in the text to form a reference set. Our results show that our method outperforms manually constructed reference sets, since hand built reference sets may not overlap with the entities in the unstructured, ungrammatical text. We also ran experiments comparing our method to the supervised approach of Conditional Random Fields (CRFs) using simple, generic features. These results show our method achieves an improvement in F1-measure for 6/9 attributes and is competitive in performance on the others, and this is without training data. ∗ This research is based upon work supported in part by the National Science Foundation under award number IIS-0324955; in part by the Air Force Office of Scientific Research under grant number FA9550-07-1-0416; and in part by the Defense Advanced Research Projects Agency (DARPA) under Contract No. FA8750-07- D-0185/0004. The United States Government is authorized to reproduce and distribute reports for Governmental purposes not withstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of any of the above organizations or any person connected with them. † Work done while at USC Information Sciences Institute.