Arrow Research search

Author name cluster

Matthew Michelson

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

4 papers
1 author row

Possible papers

4

IJCAI Conference 2009 Conference Paper

  • Matthew Michelson
  • Craig A. Knoblock

Previous work on information extraction from unstructured, ungrammatical text (e. g. classified ads) showed that exploiting a set of background knowledge, called a “reference set, ” greatly improves the precision and recall of the extractions. However, finding a source for this reference set is often dif- ficult, if not impossible. Further, even if a source is found, it might not overlap well with the text for extraction. In this paper we present an approach to building the reference set directly from the text itself. Our approach eliminates the need to find the source for the reference set, and ensures better overlap between the text and reference set. Starting with a small amount of background knowledge, our technique constructs tuples representing the entities in the text to form a reference set. Our results show that our method outperforms manually constructed reference sets, since hand built reference sets may not overlap with the entities in the unstructured, ungrammatical text. We also ran experiments comparing our method to the supervised approach of Conditional Random Fields (CRFs) using simple, generic features. These results show our method achieves an improvement in F1-measure for 6/9 attributes and is competitive in performance on the others, and this is without training data. ∗ This research is based upon work supported in part by the National Science Foundation under award number IIS-0324955; in part by the Air Force Office of Scientific Research under grant number FA9550-07-1-0416; and in part by the Defense Advanced Research Projects Agency (DARPA) under Contract No. FA8750-07- D-0185/0004. The United States Government is authorized to reproduce and distribute reports for Governmental purposes not withstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of any of the above organizations or any person connected with them. † Work done while at USC Information Sciences Institute.

AAAI Conference 2006 Conference Paper

Learning Blocking Schemes for Record Linkage

  • Matthew Michelson
  • Craig A. Knoblock

Record linkage is the process of matching records across data sets that refer to the same entity. One issue within record linkage is determining which record pairs to consider, since a detailed comparison between all of the records is impractical. Blocking addresses this issue by generating candidate matches as a preprocessing step for record linkage. For example, in a person matching problem, blocking might return all people with the same last name as candidate matches. Two main problems in blocking are the selection of attributes for generating the candidate matches and deciding which methods to use to compare the selected attributes. These attribute and method choices constitute a blocking scheme. Previous approaches to record linkage address the blocking issue in a largely ad-hoc fashion. This paper presents a machine learning approach to automatically learn effective blocking schemes. We validate our approach with experiments that show our learned blocking schemes outperform the ad-hoc blocking schemes of non-experts and perform comparably to those manually built by a domain expert.

AAAI Conference 2006 System Paper

Phoebus: A System for Extracting and Integrating Data from Unstructured and Ungrammatical Sources

  • Matthew Michelson

With the proliferation of online classifieds and auctions comes a new need to meaningfully search and organize the items for sale. However, since the seller’s item descriptions are not structured and do not conform to a standard set of values (think “Chevy” versus “Chevrolet”), searching and organizing this data is difficult. This paper describes a working demonstration of the Phoebus system which uses both record linkage and information extraction to parse out the meaningful attributes of an item description and assign them standard values. This allows the data to be sorted, searched and linked to other data sources where standard values for the attributes are required to link the sources together.

IJCAI Conference 2005 Conference Paper

Semantic annotation of unstructured and ungrammatical text

  • Matthew Michelson
  • Craig A

There are vast amounts of free text on the internet that are neither grammatical nor formally structured, such as item descriptions on Ebay or internet classifieds like Craig’s list. These sources of data, called “posts, ” are full of useful information for agents scouring the Semantic Web, but they lack the semantic annotation to make them searchable. Annotating these posts is difficult since the text generally exhibits little formal grammar and the structure of the posts varies. However, by leveraging collections of known entities and their common attributes, called “reference sets, ” we can annotate these posts despite their lack of grammar and structure. To use this reference data, we align a post to a member of the reference set, and then exploit this matched member during information extraction. We compare this extraction approach to more traditional information extraction methods that rely on structural and grammatical characteristics, and we show that our approach outperforms traditional methods on this type of data.