Arrow Research search

Author name cluster

Ani Nenkova

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers
2 author rows

Possible papers

8

ICLR Conference 2024 Conference Paper

ADOPD: A Large-Scale Document Page Decomposition Dataset

  • Jiuxiang Gu
  • Xiangxi Shi
  • Jason Kuen
  • Lu Qi
  • Ruiyi Zhang 0002
  • Anqi Liu 0001
  • Ani Nenkova
  • Tong Sun 0005

Research in document image understanding is hindered by limited high-quality document data. To address this, we introduce ADOPD, a comprehensive dataset for document page decomposition. ADOPD stands out with its data-driven approach for document taxonomy discovery during data collection, complemented by dense annotations. Our approach integrates large-scale pretrained models with a human-in-the-loop process to guarantee diversity and balance in the resulting data collection. Leveraging our data-driven document taxonomy, we collect and densely annotate document images, addressing four document image understanding tasks: Doc2Mask, Doc2Box, Doc2Tag, and Doc2Seq. Specifically, for each image, the annotations include human-labeled entity masks, text bounding boxes, as well as automatically generated tags and captions that have been manually cleaned. We conduct comprehensive experimental analyses to validate our data and assess the four tasks using various models. We envision ADOPD as a foundational dataset with the potential to drive future research in document understanding.

ICLR Conference 2024 Conference Paper

SOHES: Self-supervised Open-world Hierarchical Entity Segmentation

  • Shengcao Cao
  • Jiuxiang Gu
  • Jason Kuen
  • Hao Tan 0002
  • Ruiyi Zhang 0002
  • Handong Zhao
  • Ani Nenkova
  • Liangyan Gui

Open-world entity segmentation, as an emerging computer vision task, aims at segmenting entities in images without being restricted by pre-defined classes, offering impressive generalization capabilities on unseen images and concepts. Despite its promise, existing entity segmentation methods like Segment Anything Model (SAM) rely heavily on costly expert annotators. This work presents Self-supervised Open-world Hierarchical Entity Segmentation (SOHES), a novel approach that eliminates the need for human annotations. SOHES operates in three phases: self-exploration, self-instruction, and self-correction. Given a pre-trained self-supervised representation, we produce abundant high-quality pseudo-labels through visual feature clustering. Then, we train a segmentation model on the pseudo-labels, and rectify the noises in pseudo-labels via a teacher-student mutual-learning procedure. Beyond segmenting entities, SOHES also captures their constituent parts, providing a hierarchical understanding of visual entities. Using raw images as the sole training data, our method achieves unprecedented performance in self-supervised open-world segmentation, marking a significant milestone towards high-quality open-world entity segmentation in the absence of human-annotated masks. Project page: https://SOHES.github.io.

NeurIPS Conference 2021 Conference Paper

UniDoc: Unified Pretraining Framework for Document Understanding

  • Jiuxiang Gu
  • Jason Kuen
  • Vlad I Morariu
  • Handong Zhao
  • Rajiv Jain
  • Nikolaos Barmpalios
  • Ani Nenkova
  • Tong Sun

Document intelligence automates the extraction of information from documents and supports many business applications. Recent self-supervised learning methods on large-scale unlabeled document datasets have opened up promising directions towards reducing annotation efforts by training models with self-supervised objectives. However, most of the existing document pretraining methods are still language-dominated. We present UDoc, a new unified pretraining framework for document understanding. UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input. Each input element is composed of words and visual features from a semantic region of the input document image. An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses, encouraging the representation to model sentences, learn similarities, and align modalities. Extensive empirical analysis demonstrates that the pretraining procedure learns better joint representations and leads to improvements in downstream tasks.

JAIR Journal 2017 Journal Article

Combining Lexical and Syntactic Features for Detecting Content-Dense Texts in News

  • Yinfei Yang
  • Ani Nenkova

Content-dense news report important factual information about an event in direct, succinct manner. Information seeking applications such as information extraction, question answering and summarization normally assume all text they deal with is content-dense. Here we empirically test this assumption on news articles from the business, U.S. international relations, sports and science journalism domains. Our findings clearly indicate that about half of the news texts in our study are in fact not content-dense and motivate the development of a supervised content-density detector. We heuristically label a large training corpus for the task and train a two-layer classifying model based on lexical and unlexicalized syntactic features. On manually annotated data, we compare the performance of domain-specific classifiers, trained on data only from a given news domain and a general classifier in which data from all four domains is pooled together. Our annotation and prediction experiments demonstrate that the concept of content density varies depending on the domain and that naive annotators provide judgement biased toward the stereotypical domain label. Domain-specific classifiers are more accurate for domains in which content-dense texts are typically fewer. Domain independent classifiers reproduce better naive crowdsourced judgements. Classification prediction is high across all conditions, around 80%.

AAAI Conference 2015 Conference Paper

Fast and Accurate Prediction of Sentence Specificity

  • Junyi Li
  • Ani Nenkova

Recent studies have demonstrated that specificity is an important characterization of texts potentially beneficial for a range of applications such as multi-document news summarization and analysis of science journalism. The feasibility of automatically predicting sentence specificity from a rich set of features has also been confirmed in prior work. In this paper we present a practical system for predicting sentence specificity which exploits only features that require minimum processing and is trained in a semi-supervised manner. Our system outperforms the state-of-the-art method for predicting sentence specificity and does not require part of speech tagging or syntactic parsing as the prior methods did. With the tool that we developed — SPECITELLER — we study the role of specificity in sentence simplification. We show that specificity is a useful indicator for finding sentences that need to be simplified and a useful objective for simplification, descriptive of the differences between original and simplified sentences.

AAAI Conference 2014 Conference Paper

Detecting Information-Dense Texts in Multiple News Domains

  • Yinfei Yang
  • Ani Nenkova

We introduce the task of identifying information-dense texts, which report important factual information in direct, succinct manner. We describe a procedure that allows us to label automatically a large training corpus of New York Times texts. We train a classifier based on lexical, discourse and unlexicalized syntactic features and test its performance on a set of manually annotated articles from business, U. S. international relations, sports and science domains. Our results indicate that the task is feasible and that both syntactic and lexical features are highly predictive for the distinction. We observe considerable variation of prediction accuracy across domains and find that domain-specific models are more accurate.

AAAI Conference 2005 Conference Paper

Automatic Text Summarization of Newswire: Lessons Learned from the Document Understanding Conference

  • Ani Nenkova

Since 2001, the Document Understanding Conferences have been the forum for researchers in automatic text summarization to compare methods and results on common test sets. Over the years, several types of summarization tasks have been addressed—single document summarization, multi-document summarization, summarization focused by question, and headline generation. This paper is an overview of the achieved results in the different types of summarization tasks. We compare both the broader classes of baselines, systems and humans, as well as individual pairs of summarizers (both human and automatic). An analysis of variance model is fitted, with summarizer and input set as independent variables, and the coverage score as the dependent variable, and simulation-based multiple comparisons were performed. The results document the progress in the field as a whole, rather then focusing on a single system, and thus can serve as a future reference on the work done up to date, as well as a starting point in the formulation of future tasks. Results also indicate that most progress in the field has been achieved in generic multi-document summarization and that the most challenging task is that of producing a focused summary in answer to a question/topic.

AAAI Conference 2005 Short Paper

Discourse Factors in Multi-Document Summarization

  • Ani Nenkova

The over-abundance of information today, especially on-line, has established the need for natural language technologies that can help the user find relevant information; multi-document summarization (MDS) and question answering (QA) are two examples. The requirement in MDS and open-ended QA to produce multi-sentential answers imposes the extra demand that the output of such systems be a coherent discourse. The problem of generating appropriate referring expressions to entities in these texts is non-trivial, because different sentences are taken from their original context and put together to form a text. The new context of the summary often requires changes in surface realization of the references, demanding the inclusion of additional information or removal of redundant information. Such changes can be implemented by gathering a collection of possible references to an entity from the input documents and then rewriting the references in the sentences selected for inclusion in the summary. A question arises how to determine which attributes or descriptions of the referent would be appropriate for the context of the summary.