Arrow Research search

Author name cluster

Benno Stein

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers
1 author row

Possible papers

9

IJCAI Conference 2024 Conference Paper

The Information Retrieval Experiment Platform (Extended Abstract)

  • Maik Fröbe
  • Jan Heinrich Reimer
  • Sean MacAvaney
  • Niklas Deckers
  • Simon Reich
  • Janek Bevendorff
  • Benno Stein
  • Matthias Hagen

We have built TIREx, the information retrieval experiment platform, to promote standardized, reproducible, scalable, and blinded retrieval experiments. Standardization is achieved through integration with PyTerrier's interfaces and compatibility with ir_datasets and ir_measures. Reproducibility and scalability are based on the underlying TIRA framework, which runs dockerized software in a cloud-native execution environment. Using Docker images of 50 standard retrieval approaches, we evaluated all of them on 32 tasks (i. e. , 1, 600 runs) in less than a week on a midsize cluster (1, 620 CPU cores and 24 GPUs), demonstrating multi-task scalability. Importantly, TIRA also enables blind evaluation of AI experiments, as the test data can be hidden from public access and the tested approaches run in a sandbox that prevents data leaks. Keeping the test data hidden from public access ensures that it cannot be used by third parties for LLM training, preventing future training-test leaks.

TMLR Journal 2023 Journal Article

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

  • Aarohi Srivastava
  • Abhinav Rastogi
  • Abhishek Rao
  • Abu Awal Md Shoeb
  • Abubakar Abid
  • Adam Fisch
  • Adam R. Brown
  • Adam Santoro

Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG- bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood develop- ment, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google- internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.

AAAI Conference 2020 Conference Paper

End-to-End Argumentation Knowledge Graph Construction

  • Khalid Al-Khatib
  • Yufang Hou
  • Henning Wachsmuth
  • Charles Jochim
  • Francesca Bonin
  • Benno Stein

This paper studies the end-to-end construction of an argumentation knowledge graph that is intended to support argument synthesis, argumentative question answering, or fake news detection, among others. The study is motivated by the proven effectiveness of knowledge graphs for interpretable and controllable text generation and exploratory search. Original in our work is that we propose a model of the knowledge encapsulated in arguments. Based on this model, we build a new corpus that comprises about 16k manual annotations of 4740 claims with instances of the model’s elements, and we develop an end-to-end framework that automatically identi- fies all modeled types of instances. The results of experiments show the potential of the framework for building a web-based argumentation graph that is of high quality and large scale.

AAAI Conference 2019 Conference Paper

Model-Based Diagnosis for Cyber-Physical Production Systems Based on Machine Learning and Residual-Based Diagnosis Models

  • Andreas Bunte
  • Benno Stein
  • Oliver Niggemann

This paper introduces a novel approach to Model-Based Diagnosis (MBD) for hybrid technical systems. Unlike existing approaches which normally rely on qualitative diagnosis models expressed in logic, our approach applies a learned quantitative model that is used to derive residuals. Based on these residuals a diagnosis model is generated and used for a root cause identification. The new solution has several advantages such as the easy integration of new machine learning algorithms into MBD, a seamless integration of qualitative models, and a significant speed-up of the diagnosis runtime. The paper at hand formally defines the new approach, outlines its advantages and drawbacks, and presents an evaluation with real-world use cases.

TIST Journal 2013 Journal Article

Paraphrase acquisition via crowdsourcing and machine learning

  • Steven Burrows
  • Martin Potthast
  • Benno Stein

To paraphrase means to rewrite content while preserving the original meaning. Paraphrasing is important in fields such as text reuse in journalism, anonymizing work, and improving the quality of customer-written reviews. This article contributes to paraphrase acquisition and focuses on two aspects that are not addressed by current research: (1) acquisition via crowdsourcing, and (2) acquisition of passage-level samples. The challenge of the first aspect is automatic quality assurance; without such a means the crowdsourcing paradigm is not effective, and without crowdsourcing the creation of test corpora is unacceptably expensive for realistic order of magnitudes. The second aspect addresses the deficit that most of the previous work in generating and evaluating paraphrases has been conducted using sentence-level paraphrases or shorter; these short-sample analyses are limited in terms of application to plagiarism detection, for example. We present the Webis Crowd Paraphrase Corpus 2011 (Webis-CPC-11), which recently formed part of the PAN 2010 international plagiarism detection competition. This corpus comprises passage-level paraphrases with 4067 positive samples and 3792 negative samples that failed our criteria, using Amazon's Mechanical Turk for crowdsourcing. In this article, we review the lessons learned at PAN 2010, and explain in detail the method used to construct the corpus. The empirical contributions include machine learning experiments to explore if passage-level paraphrases can be identified in a two-class classification problem using paraphrase similarity features, and we find that a k-nearest-neighbor classifier can correctly distinguish between paraphrased and nonparaphrased samples with 0.980 precision at 0.523 recall. This result implies that just under half of our samples must be discarded (remaining 0.477 fraction), but our cost analysis shows that the automation we introduce results in a 18% financial saving and over 100 hours of time returned to the researchers when repeating a similar corpus design. On the other hand, when building an unrelated corpus requiring, say, 25% training data for the automated component, we show that the financial outcome is cost neutral, while still returning over 70 hours of time to the researchers. The work presented here is the first to join the paraphrasing and plagiarism communities.

TIST Journal 2012 Journal Article

Information Retrieval in the Commentsphere

  • Martin Potthast
  • Benno Stein
  • Fabian Loose
  • Steffen Becker

This article studies information retrieval tasks related to Web comments. Prerequisite of such a study and a main contribution of the article is a unifying survey of the research field. We identify the most important retrieval tasks related to comments, namely filtering, ranking, and summarization. Within these tasks, we distinguish two paradigms according to which comments are utilized and which we designate as comment-targeting and comment-exploiting. Within the first paradigm, the comments themselves form the retrieval targets. Within the second paradigm, the commented items form the retrieval targets (i.e., comments are used as an additional information source to improve the retrieval performance for the commented items). We report on four case studies to demonstrate the exploration of the commentsphere under information retrieval aspects: comment filtering, comment ranking, comment summarization and cross-media retrieval. The first three studies deal primarily with comment-targeting retrieval, while the last one deals with comment-exploiting retrieval. Throughout the article, connections to information retrieval research are pointed out.

AAAI Conference 2012 Conference Paper

Learning Behavior Models for Hybrid Timed Systems

  • Oliver Niggemann
  • Benno Stein
  • Asmir Vodencarevic
  • Alexander Maier
  • Hans Kleine Büning

A tailored model of a system is the prerequisite for various analysis tasks, such as anomaly detection, fault identification, or quality assurance. This paper deals with the algorithmic learning of a system’s behavior model given a sample of observations. In particular, we consider real-world production plants where the learned model must capture timing behavior, dependencies between system variables, as well as mode switches—in short: hybrid system’s characteristics. Usually, such model formation tasks are solved by human engineers, entailing the well-known bunch of problems including knowledge acquisition, development cost, or lack of experience. Our contributions to the outlined field are as follows. (1) We present a taxonomy of learning problems related to model formation tasks. As a result, an important open learning problem for the domain of production system is identified: The learning of hybrid timed automata. (2) For this class of models, the learning algorithm HyBUTLA is presented. This algorithm is the first of its kind to solve the underlying model formation problem at scalable precision. (3) We present two case studies that illustrate the usability of this approach in realistic settings. (4) We give a proof for the learning and runtime properties of HyBUTLA.

TIST Journal 2011 Journal Article

Cross-Lingual Adaptation Using Structural Correspondence Learning

  • Peter Prettenhofer
  • Benno Stein

Cross-lingual adaptation is a special case of domain adaptation and refers to the transfer of classification knowledge between two languages. In this article we describe an extension of Structural Correspondence Learning (SCL), a recently proposed algorithm for domain adaptation, for cross-lingual adaptation in the context of text classification. The proposed method uses unlabeled documents from both languages, along with a word translation oracle, to induce a cross-lingual representation that enables the transfer of classification knowledge from the source to the target language. The main advantages of this method over existing methods are resource efficiency and task specificity. We conduct experiments in the area of cross-language topic and sentiment classification involving English as source language and German, French, and Japanese as target languages. The results show a significant improvement of the proposed method over a machine translation baseline, reducing the relative error due to cross-lingual adaptation by an average of 30% (topic classification) and 59% (sentiment classification). We further report on empirical analyses that reveal insights into the use of unlabeled data, the sensitivity with respect to important hyperparameters, and the nature of the induced cross-lingual word correspondences.