Arrow Research search

Author name cluster

Ivan Stelmakh

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers
1 author row

Possible papers

9

TMLR Journal 2025 Journal Article

A Gold Standard Dataset for the Reviewer Assignment Problem

  • Ivan Stelmakh
  • John Frederick Wieting
  • Yang Xi
  • Graham Neubig
  • Nihar B Shah

Many peer-review venues are using algorithms to assign submissions to reviewers. The crux of such automated approaches is the notion of the “similarity score’’ — a numerical estimate of the expertise of a reviewer in reviewing a paper — and many algorithms have been proposed to compute these scores. However, these algorithms have not been subjected to a principled comparison, making it difficult for stakeholders to choose the algorithm in an evidence-based manner. The key challenge in comparing existing algorithms and developing better algorithms is the lack of the publicly available gold-standard data that would be needed to perform reproducible research. We address this challenge by collecting a novel dataset of similarity scores that we release to the research community. Our dataset consists of 477 self-reported expertise scores provided by 58 researchers who evaluated their expertise in reviewing papers they have read previously. We use this data to compare several popular algorithms currently employed in computer science conferences and come up with recommendations for stakeholders. Our four main findings are: - All algorithms make a non-trivial amount of error. For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases, thereby highlighting the vital need for more research on the similarity-computation problem. - Most specialized algorithms are designed to work with titles and abstracts of papers, and in this regime the Specter2 algorithm performs best. - The classical TF-IDF algorithm which can use full texts of papers is on par with Specter2 that uses only titles and abstracts. - The performance of off-the-shelf LLMs is worse than the specialized algorithms. We encourage researchers to participate in our survey and contribute their data to the dataset here: https://forms.gle/SP1Rh8eivGz54xR37

JMLR Journal 2024 Journal Article

Debiasing Evaluations That Are Biased by Evaluations

  • Jingyan Wang
  • Ivan Stelmakh
  • Yuting Wei
  • Nihar Shah

It is common to evaluate a set of items by soliciting people to rate them. For example, universities ask students to rate the teaching quality of their instructors, and conference organizers ask authors of submissions to evaluate the quality of the reviews. However, in these applications, students often give a higher rating to a course if they receive higher grades in a course, and authors often give a higher rating to the reviews if their papers are accepted to the conference. In this work, we call these external factors the "outcome" experienced by people, and consider the problem of mitigating these outcome-induced biases in the given ratings when some information about the outcome is available. We formulate the information about the outcome as a known partial ordering on the bias. We propose a debiasing method by solving a regularized optimization problem under this ordering constraint, and also provide a carefully designed cross-validation method that adaptively chooses the appropriate amount of regularization. We provide theoretical guarantees on the performance of our algorithm, as well as experimental evaluations. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2024. ( edit, beta )

AAAI Conference 2021 Conference Paper

A Novice-Reviewer Experiment to Address Scarcity of Qualified Reviewers in Large Conferences

  • Ivan Stelmakh
  • Nihar B. Shah
  • Aarti Singh
  • Hal Daumé III

Conference peer review constitutes a human-computation process whose importance cannot be overstated: not only it identifies the best submissions for acceptance, but, ultimately, it impacts the future of the whole research area by promoting some ideas and restraining others. A surge in the number of submissions received by leading AI conferences has challenged the sustainability of the review process by increasing the burden on the pool of qualified reviewers which is growing at a much slower rate. In this work, we consider the problem of reviewer recruiting with a focus on the scarcity of qualified reviewers in large conferences. Specifically, we design a procedure for (i) recruiting reviewers from the population not typically covered by major conferences and (ii) guiding them through the reviewing pipeline. In conjunction with the ICML 2020 — a large, top-tier machine learning conference — we recruit a small set of reviewers through our procedure and compare their performance with the general population of ICML reviewers. Our experiment reveals that a combination of the recruiting and guiding mechanisms allows for a principled enhancement of the reviewer pool and results in reviews of superior quality compared to the conventional pool of reviews as evaluated by senior members of the program committee (meta-reviewers).

AAAI Conference 2021 Conference Paper

Catch Me if I Can: Detecting Strategic Behaviour in Peer Assessment

  • Ivan Stelmakh
  • Nihar B. Shah
  • Aarti Singh

We consider the issue of strategic behaviour in various peerassessment tasks, including peer grading of exams or homeworks and peer review in hiring or promotions. When a peerassessment task is competitive (e. g. , when students are graded on a curve), agents may be incentivized to misreport evaluations in order to improve their own final standing. Our focus is on designing methods for detection of such manipulations. Specifically, we consider a setting in which agents evaluate a subset of their peers and output rankings that are later aggregated to form a final ordering. In this paper, we investigate a statistical framework for this problem and design a principled test for detecting strategic behaviour. We prove that our test has strong false alarm guarantees and evaluate its detection ability in practical settings. For this, we design and conduct an experiment that elicits strategic behaviour from subjects and release a dataset of patterns of strategic behaviour that may be of independent interest. We use this data to run a series of real and semi-synthetic evaluations that reveal a strong detection power of our test.

NeurIPS Conference 2021 Conference Paper

CrowdSpeech and Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription

  • Nikita Pavlichenko
  • Ivan Stelmakh
  • Dmitry Ustalov

Domain-specific data is the crux of the successful transfer of machine learning systems from benchmarks to real life. In simple problems such as image classification, crowdsourcing has become one of the standard tools for cheap and time-efficient data collection: thanks in large part to advances in research on aggregation methods. However, the applicability of crowdsourcing to more complex tasks (e. g. , speech recognition) remains limited due to the lack of principled aggregation methods for these modalities. The main obstacle towards designing aggregation methods for more advanced applications is the absence of training data, and in this work, we focus on bridging this gap in speech recognition. For this, we collect and release CrowdSpeech --- the first publicly available large-scale dataset of crowdsourced audio transcriptions. Evaluation of existing and novel aggregation methods on our data shows room for improvement, suggesting that our work may entail the design of better algorithms. At a higher level, we also contribute to the more general challenge of developing the methodology for reliable data collection via crowdsourcing. In that, we design a principled pipeline for constructing datasets of crowdsourced audio transcriptions in any novel domain. We show its applicability on an under-resourced language by constructing VoxDIY --- a counterpart of CrowdSpeech for the Russian language. We also release the code that allows a full replication of our data collection pipeline and share various insights on best practices of data collection via crowdsourcing.

AAAI Conference 2021 Conference Paper

Debiasing Evaluations That Are Biased by Evaluations

  • Jingyan Wang
  • Ivan Stelmakh
  • Yuting Wei
  • Nihar B. Shah

It is common to evaluate a set of items by soliciting people to rate them. For example, universities ask students to rate the teaching quality of their instructors, and conference organizers ask authors of submissions to evaluate the quality of the reviews. However, in these applications, students often give a higher rating to a course if they receive higher grades in a course, and authors often give a higher rating to the reviews if their papers are accepted to the conference. In this work, we call these external factors the “outcome” experienced by people, and consider the problem of mitigating these outcome-induced biases in the given ratings when some information about the outcome is available. We formulate the information about the outcome as a known partial ordering on the bias. We propose a debiasing method by solving a regularized optimization problem under this ordering constraint, and also provide a carefully designed cross-validation method that adaptively chooses the appropriate amount of regularization. We provide theoretical guarantees on the performance of our algorithm, as well as experimental evaluations.

JMLR Journal 2021 Journal Article

PeerReview4All: Fair and Accurate Reviewer Assignment in Peer Review

  • Ivan Stelmakh
  • Nihar Shah
  • Aarti Singh

We consider the problem of automated assignment of papers to reviewers in conference peer review, with a focus on fairness and statistical accuracy. Our fairness objective is to maximize the review quality of the most disadvantaged paper, in contrast to the commonly used objective of maximizing the total quality over all papers. We design an assignment algorithm based on an incremental max-flow procedure that we prove is near-optimally fair. Our statistical accuracy objective is to ensure correct recovery of the papers that should be accepted. We provide a sharp minimax analysis of the accuracy of the peer-review process for a popular objective-score model as well as for a novel subjective-score model that we propose in the paper. Our analysis proves that our proposed assignment algorithm also leads to a near-optimal statistical accuracy. Finally, we design a novel experiment that allows for an objective comparison of various assignment algorithms, and overcomes the inherent difficulty posed by the absence of a ground truth in experiments on peer-review. The results of this experiment as well as of other experiments on synthetic and real data corroborate the theoretical guarantees of our algorithm. [abs] [ pdf ][ bib ] &copy JMLR 2021. ( edit, beta )

AAAI Conference 2021 Short Paper

Towards Fair, Equitable, and Efficient Peer Review

  • Ivan Stelmakh

Peer review is the backbone of academia. The rapid growth of the number of submissions to leading publication venues has identified a need for automation of some parts of the peer-review pipeline and nowadays human referees are required to interact with various interfaces and technologies in this process. However, there exists evidence that if such interactions are not carefully designed, they can exacerbate various problems related to fairness and efficiency of the process. In my research, I aim to design a Human-AI collaboration pipeline in peer review to mitigate these issues and ensure that science progresses in a fair, equitable, and efficient manner.

NeurIPS Conference 2019 Conference Paper

On Testing for Biases in Peer Review

  • Ivan Stelmakh
  • Nihar Shah
  • Aarti Singh

We consider the issue of biases in scholarly research, specifically, in peer review. There is a long standing debate on whether exposing author identities to reviewers induces biases against certain groups, and our focus is on designing tests to detect the presence of such biases. Our starting point is a remarkable recent work by Tomkins, Zhang and Heavlin which conducted a controlled, large-scale experiment to investigate existence of biases in the peer reviewing of the WSDM conference. We present two sets of results in this paper. The first set of results is negative, and pertains to the statistical tests and the experimental setup used in the work of Tomkins et al. We show that the test employed therein does not guarantee control over false alarm probability and under correlations between relevant variables, coupled with any of the following conditions, with high probability can declare a presence of bias when it is in fact absent: (a) measurement error, (b) model mismatch, (c) reviewer calibration. Moreover, we show that the setup of their experiment may itself inflate false alarm probability if (d) bidding is performed in non-blind manner or (e) popular reviewer assignment procedure is employed. Our second set of results is positive, in that we present a general framework for testing for biases in (single vs. double blind) peer review. We then present a hypothesis test with guaranteed control over false alarm probability and non-trivial power even under conditions (a)--(c). Conditions (d) and (e) are more fundamental problems that are tied to the experimental setup and not necessarily related to the test.