Arrow Research search

Author name cluster

Oren Etzioni

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

28 papers
2 author rows

Possible papers

28

AAAI Conference 2016 Conference Paper

Combining Retrieval, Statistics, and Inference to Answer Elementary Science Questions

  • Peter Clark
  • Oren Etzioni
  • Tushar Khot
  • Ashish Sabharwal
  • Oyvind Tafjord
  • Peter Turney
  • Daniel Khashabi

What capabilities are required for an AI system to pass standard 4th Grade Science Tests? Previous work has examined the use of Markov Logic Networks (MLNs) to represent the requisite background knowledge and interpret test questions, but did not improve upon an information retrieval (IR) baseline. In this paper, we describe an alternative approach that operates at three levels of representation and reasoning: information retrieval, corpus statistics, and simple inference over a semi-automatically constructed knowledge base, to achieve substantially improved results. We evaluate the methods on six years of unseen, unedited exam questions from the NY Regents Science Exam (using only non-diagram, multiple choice questions), and show that our overall system’s score is 71. 3%, an improvement of 23. 8% (absolute) over the MLN-based method described in previous work. We conclude with a detailed analysis, illustrating the complementary strengths of each method in the ensemble. Our datasets are being released to enable further research.

IJCAI Conference 2016 Conference Paper

Question Answering via Integer Programming over Semi-Structured Knowledge

  • Daniel Khashabi
  • Tushar Khot
  • Ashish Sabharwal
  • Peter Clark
  • Oren Etzioni
  • Dan Roth

Answering science questions posed in natural language is an important AI challenge. Answering such questions often requires non-trivial inference and knowledge that goes beyond factoid retrieval. Yet, most systems for this task are based on relatively shallow Information Retrieval (IR) and statistical correlation techniques operating on large unstructured corpora. We propose a structured inference system for this task, formulated as an Integer Linear Program (ILP), that answers natural language questions using a semi-structured knowledge base derived from text, including questions requiring multi-step inference and a combination of multiple facts. On a dataset of real, unseen science questions, our system significantly outperforms (+14%) the best previous attempt at structured reasoning for this task, which used Markov Logic Networks (MLNs). It also improves upon a previous ILP formulation by 17. 7%. When combined with unstructured inference methods, the ILP system significantly boosts overall performance (+10%). Finally, we show our approach is substantially more robust to a simple answer perturbation compared to statistical correlation methods.

AAAI Conference 2014 Conference Paper

Diagram Understanding in Geometry Questions

  • Min Joon Seo
  • Hannaneh Hajishirzi
  • Ali Farhadi
  • Oren Etzioni

Automatically solving geometry questions is a longstanding AI problem. A geometry question typically includes a textual description accompanied by a diagram. The first step in solving geometry questions is diagram understanding, which consists of identifying visual elements in the diagram, their locations, their geometric properties, and aligning them to corresponding textual descriptions. In this paper, we present a method for diagram understanding that identifies visual elements in a diagram while maximizing agreement between textual and visual data. We show that the method’s objective function is submodular; thus we are able to introduce an efficient method for diagram understanding that is close to optimal. To empirically evaluate our method, we compile a new dataset of geometry questions (textual descriptions and diagrams) and compare with baselines that utilize standard vision techniques. Our experimental evaluation shows an F1 boost of more than 17% in identifying visual elements and 25% in aligning visual elements with their textual descriptions.

IJCAI Conference 2011 Conference Paper

Open Information Extraction: The Second Generation

  • Oren Etzioni
  • Anthony Fader
  • Janara Christensen
  • Stephen Soderland
  • Mausam

How do we scale information extraction to the massive size and unprecedented heterogeneity of the Web corpus? Beginning in 2003, our KnowItAll project has sought to extract high-quality knowledge from the Web. In 2007, we introduced the Open Information Extraction (Open IE) paradigm which eschews handlabeled training examples, and avoids domain-specific verbs and nouns, to develop unlexicalized, domain-independent extractors that scale to the Web corpus. Open IE systems have extracted billions of assertions as the basis for both common-sense knowledge and novel question-answering systems. This paper describes the second generation of Open IE systems, which rely on a novel model of how relations and their arguments are expressed in English sentences to double precision/recall compared with previous systems such as TEXTRUNNER and WOE.

AAAI Conference 2010 Conference Paper

Panlingual Lexical Translation via Probabilistic Inference

  • Mausam
  • Stephen Soderland
  • Oren Etzioni

The bare minimum lexical resource required to translate between a pair of languages is a translation dictionary. Unfortunately, dictionaries exist only between a tiny fraction of the 49 million possible language-pairs making machine translation virtually impossible between most of the languages. This paper summarizes the last four years of our research motivated by the vision of panlingual communication. Our research comprises three key steps. First, we compile over 630 freely available dictionaries over the Web and convert this data into a single representation – the translation graph. Second, we build several inference algorithms that infer translations between word pairs even when no dictionary lists them as translations. Finally, we run our inference procedure offline to construct PANDICTIONARY– a sense-distinguished, massively multilingual dictionary that has translations in more than 1000 languages. Our experiments assess the quality of this dictionary and find that we have 4 times as many translations at a high precision of 0. 9 compared to the English Wiktionary, which is the lexical resource closest to PANDIC- TIONARY.

NeurIPS Conference 2008 Conference Paper

Look Ma, No Hands: Analyzing the Monotonic Feature Abstraction for Text Classification

  • Doug Downey
  • Oren Etzioni

Is accurate classification possible in the absence of hand-labeled data? This paper introduces the Monotonic Feature (MF) abstraction--where the probability of class membership increases monotonically with the MF's value. The paper proves that when an MF is given, PAC learning is possible with no hand-labeled data under certain assumptions. We argue that MFs arise naturally in a broad range of textual classification applications. On the classic "20 Newsgroups" data set, a learner given an MF and unlabeled data achieves classification accuracy equal to that of a state-of-the-art semi-supervised learner relying on 160 hand-labeled examples. Even when MFs are not given as input, their presence or absence can be determined from a small amount of hand-labeled data, which yields a new semi-supervised learning method that reduces error by 15% on the 20 Newsgroups data.

IJCAI Conference 2007 Conference Paper

  • Doug Downey
  • Matthew Broadhead
  • Oren Etzioni

Named Entity Recognition (NER) is the task of locating and classifying names in text. In previous work, NER was limited to a small number of pre-defined entity classes (e. g. , people, locations, and organizations). However, NER on the Web is a far more challenging problem. Complex names (e. g. , film or book titles) can be very difficult to pick out precisely from text. Further, the Web contains a wide variety of entity classes, which are not known in advance. Thus, hand-tagging examples of each entity class is impractical. This paper investigates a novel approach to the first step in Web NER: locating complex named entities in Web text. Our key observation is that named entities can be viewed as a species of multi-word units, which can be detected by accumulating n-gram statistics over the Web corpus. We show that this statistical method's F1 score is 50% higher than that of supervised techniques including Conditional Random Fields (CRFs) and Conditional Markov Models (CMMs) when applied to complex names. The method also outperforms CMMs and CRFs by 117% on entity classes absent from the training data. Finally, our method outperforms a semi-supervised CRF by 73%.

IJCAI Conference 2007 Conference Paper

  • Michele Banko
  • Michael J Cafarella
  • Stephen Soderland
  • Matt Broadhead
  • Oren Etzioni

Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e. g. , extract the location and time of seminars from a set of announcements). Shifting to a new domain requires the user to name the target relations and to manually create new extraction rules or hand-tag new training examples. This manual labor scales linearly with the number of target relations. This paper introduces Open IE (OIE), a new extraction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input. The paper also introduces TextRunner, a fully implemented, highly scalable OIE system where the tuples are assigned a probability and indexed to support efficient extraction and exploration via user queries. We report on experiments over a 9, 000, 000 Web page corpus that compare TextRunner with KnowItAll, a state-of-the-art Web IE system. TextRunner achieves an error reduction of 33% on a comparable set of extractions. Furthermore, in the amount of time it takes KnowItAll to perform extraction for a handful of pre-specified relations, TextRunner extracts a far broader set of facts reflecting orders of magnitude more relations, discovered on the fly. We report statistics on TextRunner's 11, 000, 000 highest probability tuples, and show that they contain over 1, 000, 000 concrete facts and over 6, 500, 000 more abstract assertions.

AAAI Conference 2006 Conference Paper

Machine Reading

  • Oren Etzioni

The time is ripe for the AI community to set its sights on Machine Reading—the automatic, unsupervised understanding of text. In this paper, we place the notion of Machine Reading in context, describe progress towards this goal by the KnowItAll research group at the University of Washington, and highlight several central research questions.

IJCAI Conference 2005 Conference Paper

A Probabilistic Model of Redundancy in Information Extraction

  • Doug Downey
  • Oren Etzioni
  • Stephen

Unsupervised Information Extraction (UIE) is the task of extracting knowledge from text without using hand-tagged training examples. A fundamental problem for both UIE and supervised IE is assessing the probability that extracted information is correct. In massive corpora such as the Web, the same extraction is found repeatedly in different documents. How does this redundancy impact the probability of correctness? This paper introduces a combinatorial “balls-andurns” model that computes the impact of sample size, redundancy, and corroboration from multiple distinct extraction rules on the probability that an extraction is correct. We describe methods for estimating the model’s parameters in practice and demonstrate experimentally that for UIE the model’s log likelihoods are 15 times better, on average, than those obtained by Pointwise Mutual Information (PMI) and the noisy-or model used in previous work. For supervised IE, the model’s performance is comparable to that of Support Vector Machines, and Logistic Regression.

AAAI Conference 2004 Conference Paper

Methods for Domain-Independent Information Extraction from the Web: An Experimental Comparison

  • Oren Etzioni
  • Doug Downey
  • Tal Shaked
  • Daniel S. Weld

Our KNOWITALL system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an autonomous, domain-independent, and scalable manner. In its first major run, KNOWITALL extracted over 50,000 facts with high precision, but suggested a challenge: How can we improve KNOWITALL’s recall and extraction rate without sacrificing precision? This paper presents three distinct ways to address this challenge and evaluates their performance. Rule Learning learns domain-specific extraction rules. Subclass Extraction automatically identifies sub-classes in order to boost recall. List Extraction locates lists of class instances, learns a “wrapper” for each list, and extracts elements of each list. Since each method bootstraps from KNOWITALL’s domain-independent methods, no hand-labeled training examples are required. Experiments show the relative coverage of each method and demonstrate their synergy. In concert, our methods gave KNOWITALL a 4-fold to 19-fold increase in recall, while maintaining high precision, and discovered 10,300 cities missing from the Tipster Gazetteer.

AAAI Conference 2004 System Paper

PRECISE on ATIS: Semantic Tractability and Experimental Results

  • Ana-Maria Popescu
  • Oren Etzioni

The need for Natural Language Interfaces to databases (NLIs) has become increasingly acute as more and more people access information through their web browsers, PDAs, and cell phones. Yet NLIs are only usable if they map natural language questions to SQL queries correctly — people are unwilling to trade reliable and predictable user interfaces for intelligent but unreliable ones. We describe a reliable NLI, PRECISE, that incorporates a modern statistical paser and a semantic module. PRECISE provably handles a large class of natural language questions correctly. On the benchmark ATIS data set, PRECISE achieves 93. 8% accuracy.

IJCAI Conference 2003 Conference Paper

Automatically Personalizing User Interfaces

  • Daniel S. Weld
  • Corin Anderson
  • Pedro Domingos
  • Oren Etzioni
  • Krzysztof Gajos
  • Tessa Lau
  • Steve Wolfinan

Todays computer interfaces are one-size-fits-all. Users with little programming experience have very limited opportunities to customize an interface to their task and work habits. Furthermore, the overhead induced by generic interfaces will be proportionately greater on small form-factor PDAs, embedded applications and wearable devices. Automatic personalization may greatly enhance user productivity, but it requires advances in customization (explicit, user-initiated change) and adaptation (interface-initiated change in response to routine user behavior). In order to improve customization, we must make it easier for users to direct these changes. In order to improve adaptation, we must better predict user behavior and navigate the inherent tension between the dynamism of automatic adaptation and the stability required in order for the user to predict the computers behavior and maintain control. This paper surveys a decade's work on customization and adaptation at the University of Washington, distilling the lessons we have learned.

IJCAI Conference 1997 Conference Paper

Adaptive Web Sites: an AI Challenge

  • Mike Perkowitz
  • Oren Etzioni

The creation of a complex web site is a thorny problem in user interface design. First, different visitors have distinct goals. Second, even a single visitor may have different needs at different times. Much of the information at the site may also be dynamic or time-dependent. Third, as the site grows and evolves, its original design may no longer be appropriate. Finally, a site may be designed for a particular purpose but used in unexpected ways. Web servers record data about user interactions and accumulate this data over time. We believe that AI techniques can be used to examine user access logs in order to automatically improve the site. We challenge the AI community to create adaptive web sites: sites that automatically improve their organization and presentation based on user access data. Several unrelated research projects in plan recognition, machine learning, knowledge representation, and user modeling have begun to explore aspects of this problem. We hope that posing this challenge explicitly will bring these projects together and stimulate fundamental AI research. Success would have a broad and highly visible impact on the web and the AI community.

FOCS Conference 1996 Conference Paper

Efficient Information Gathering on the Internet (extended abstract)

  • Oren Etzioni
  • Steve Hanks
  • Tao Jiang 0001
  • Richard M. Karp
  • Omid Madani
  • Orli Waarts

The Internet offers unprecedented access to information. At present most of this information is free, but information providers ore likely to start charging for their services in the near future. With that in mind this paper introduces the following information access problem: given a collection of n information sources, each of which has a known time delay, dollar cost and probability of providing the needed information, find an optimal schedule for querying the information sources. We study several variants of the problem which differ in the definition of an optimal schedule. We first consider a cost model in which the problem is to minimize the expected total cost (monetary and time) of the schedule, subject to the requirement that the schedule may terminate only when the query has been answered or all sources have been queried unsuccessfully. We develop an approximation algorithm for this problem and for an extension of the problem in which more than a single item of information is being sought. We then develop approximation algorithms for a reward model in which a constant reward is earned if the information is successfully provided, and we seek the schedule with the maximum expected difference between the reward and a measure of cost. The monetary and time costs may either appear in the cost measure or be constrained not to exceed a fixed upper bound; these options give rise to four different variants of the reward model.

AAAI Conference 1996 Conference Paper

Moving Up the Information Food Chain: Deploying Softbots on the World Wide Web

  • Oren Etzioni

I view the World Wide Web as an information food chain (figure 1). The maze of pages and hyperlinks that comprise the Web are at the very bottom of the chain. The WebCrawlers and Alta Vistas of the world are information herbivores; they graze on Web pages and regurgitate them as searchable indices. Today, most Web users feed near the bottom of the information food chain, but the time is ripe to move up. Since 1991, we have been building information carnivores, which intelligently hunt and feast on herbivores in Unix, on the Internet and on the Web.

IJCAI Conference 1995 Conference Paper

Category Translation: Learning to understand information on the Internet

  • Mike Perkowitz
  • Oren Etzioni

This paper investigates the problem of automatically learning declarative models of information sources available on the Internet. We report on ILA, a domain-independent program that learns the meaning of external information by explaining it in terms of internal categories. In our experiments, ILA starts with knowledge of local faculty members, and is able to learn models of the Internet service whois and of the personnel directories available at Berkeley, Brown, Caltech, Cornell, Rice, Rutgers, and UC1, averaging fewer than 40 queries per information source. ILA's hypothesis language is compositions of first-order predicates, and its bias is compactly encoded as a determination. We analyze ILA's sample complexity both within the Valiant model, and using a probabilistic model specifically tailored to ILA.

AAAI Conference 1991 Conference Paper

STATIC: A Problem-Space Compiler for PRODIGY

  • Oren Etzioni

Explanation-Based Learning (EBL) can be used to significantly speed up problem solving. Is there sufficient structure in the definition of a problem space to enable a static analyzer, using EBL-style optimizations, to speed up problem solving without utilizing training examples? If so, will such an analyzer run in reasonable time? This paper demonstrates that for a wide range of problem spaces the answer to both questions is “yes. ” The STATIC program speeds up problem solving for the PRODIGY problem solver without utilizin examples. In Minton’ s problem spaces [1988, STATIC s training acquires control knowledge from twenty six to seventy seven times faster, and speeds up PRODIGY up to three times as much as PRODIGY/EBL. This paper presents STATIC'S algorithms, derives a condition under which STATIC is guaranteed to achieve polynomial-time problem solving, and contrasts STATIC with PRODIGY/EBL.