Arrow Research search

Author name cluster

Tammy Stark

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

2 papers
1 author row

Possible papers

2

NeurIPS Conference 2025 Conference Paper

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

  • Jang Hyun Cho
  • Andrea Madotto
  • Effrosyni Mavroudi
  • Triantafyllos Afouras
  • Tushar Nagarajan
  • Muhammad Maaz
  • Yale Song
  • Tengyu Ma

Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2. 8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM–VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about ''what'', ''where'', ''when'', and ''how'' of a video. We make our work fully reproducible by providing data, training recipes, code & models.

NeurIPS Conference 2025 Conference Paper

WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios

  • Eun Chang
  • Zhuangqun Huang
  • Yiwei Liao
  • Sagar Bhavsar
  • Amogh Param
  • Tammy Stark
  • Adel Ahmadyan
  • Xiao Yang

We introduce WearVQA, the first benchmark specifically designed to evaluate the visual questionanswering (VQA) capabilities of multi-modal AI assistant on wearable devices like smart glasses. Unlikeprior benchmarks that focus on high-quality, third-person imagery, WearVQA reflects the unique chal-lenges of ego-centric interaction—where visual inputs may be occluded, poorly lit, unzoomed, or blurry, and questions are grounded in realistic wearable use cases. The benchmark comprises 2, 500 carefullycurated image-question-answer triplets, spanning 7 diverse image domains including both text-centricand general scenes, 10 cognitive task types ranging from basic recognition to various forms of reasoning, and 6 common wearables-specific image quality issues. All questions are designed to be answerable usingonly the visual input and common senses. WearVQA is paired with a rigorous LLM-as-a-judge evaluationframework with 96% labeling accuracy. Open-source and proprietary multi-modal LLMs achieved a QAaccuracy as low as 24–52% on WearVQA, with substantial drops on lower-quality images and reasoning-heavy tasks. These observations position WearVQA as a comprehensive and challenging benchmark forguiding technicial advancement towards robust, real-world multi-modal wearables AI systems.