Arrow Research search

Author name cluster

Yang Shi

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers
1 author row

Possible papers

7

AAAI Conference 2026 Conference Paper

Detecting Unobserved Confounders: A Kernelized Regression Approach

  • Yikai Chen
  • Yunxin Mao
  • Chunyuan Zheng
  • Hao Zou
  • Shanzhi Gu
  • Shixuan Liu
  • Yang Shi
  • Wenjing Yang

Detecting unobserved confounders is crucial for reliable causal inference in observational studies. Existing methods require either linearity assumptions or multiple heterogeneous environments, limiting applicability to nonlinear single-environment settings. To bridge this gap, we propose Kernel Regression Confounder Detection (KRCD), a novel method for detecting unobserved confounding in nonlinear observational data under single-environment conditions. KRCD leverages reproducing kernel Hilbert spaces to model complex dependencies. By comparing standard and higher-order kernel regressions, we derive a test statistic whose significant deviation from zero indicates unobserved confounding. Theoretically, we prove two key results: First, in infinite samples, regression coefficients coincide if and only if no unobserved confounders exist. Second, finite-sample differences converge to zero-mean Gaussian distributions with tractable variance. Extensive experiments on synthetic benchmarks and the Twins dataset demonstrate that KRCD not only outperforms existing baselines but also achieves superior computational efficiency.

AAAI Conference 2026 Conference Paper

Uncovering Pretraining Code in LLMs: A Syntax-Aware Attribution Approach

  • Yuanheng Li
  • Zhuoyang Chen
  • Xiaoyun Liu
  • Yuhao Wang
  • Mingwei Liu
  • Yang Shi
  • Kaifeng Huang
  • Shengjie Zhao

As large language models (LLMs) become increasingly capable, concerns over the unauthorized use of copyrighted and licensed content in their training data have grown, especially in the context of code. Open-source code, often protected by open source licenses (e.g, GPL), poses legal and ethical challenges when used in pretraining. Detecting whether specific code samples were included in LLM training data is thus critical for transparency, accountability, and copyright compliance. We propose SynPrune, a syntax-pruned membership inference attack method tailored for code. Unlike prior MIA approaches that treat code as plain text, SynPrune leverages the structured and rule-governed nature of programming languages. Specifically, it identifies and excludes consequent tokens that are syntactically required and not reflective of authorship, from attribution when computing membership scores. Experimental results show that SynPrune consistently outperforms the state-of-the-arts. Our method is also robust across varying function lengths and syntax categories.

AAAI Conference 2025 Conference Paper

Generalized Debiased Semi-Supervised Hashing for Large-Scale Image Retrieval

  • Xingbo Liu
  • Xuening Zhang
  • Xiushan Nie
  • Yang Shi
  • Yilong Yin

Semi-supervised hashing has shown promising efficacy in large-scale image retrieval, which learns similarity-preserving codes from both labeled and unlabeled data. To enable the use of advanced supervised hashing techniques, pseudo labels are widely applied. However, existing methods typically suffer from a biased learning issue due to pseudo label noise, which can be further aggravated during optimization. Although such a bias can adversely affect hashing accuracy, it has not been investigated sufficiently. In view of this, we present a comprehensive discussion on potential causes of biases, involving processes of pseudo-labeling, hash learning and optimization. Accordingly, a novel Generalized Debiased Semi-supervised Hashing (GDSH) method is proposed as a unified solution to mitigate the biases. Specifically, reliable pseudo labels are first predicted via a robust label completion strategy. Secondly, a debiased hash learning module is designed by combining label denoising and similarity updating. This can not only refine the supervision, but also obtain hash codes that are semantically debiased in both category and sample levels. Finally, a discrete semi-supervised hashing algorithm is proposed to alleviate the bias arising from optimization. Experimental results on three single-label and three multi-label image benchmarks demonstrate that GDSH remarkably outperforms the state-of-the-arts in different semi-supervised settings.

NeurIPS Conference 2025 Conference Paper

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

  • Yang Shi
  • Huanqian Wang
  • Xie Xie
  • Huanyao Zhang
  • Lijie Zhao
  • Yifan Zhang
  • Xinfeng Li
  • Chaoyou Fu

Multimodal Large Language Models (MLLMs) have achieved considerable accuracy in Optical Character Recognition (OCR) from static images. However, their efficacy in video OCR is significantly diminished due to factors such as motion blur, temporal variations, and visual effects inherent in video content. To provide clearer guidance for training practical MLLMs, we introduce MME-VideoOCR benchmark, which encompasses a comprehensive range of video OCR application scenarios. MME-VideoOCR features 10 task categories comprising 25 individual tasks and spans 44 diverse scenarios. These tasks extend beyond text recognition to incorporate deeper comprehension and reasoning of textual content within videos. The benchmark consists of 1, 464 videos with varying resolutions, aspect ratios, and durations, along with 2, 000 meticulously curated, manually annotated question-answer pairs. We evaluate 18 state-of-the-art MLLMs on MME-VideoOCR, revealing that even the best-performing model (Gemini-2. 5 Pro) achieves only an accuracy of 73. 7%. Fine-grained analysis indicates that while existing MLLMs demonstrate strong performance on tasks where relevant texts are contained within a single or few frames, they exhibit limited capability in effectively handling tasks that demand holistic video comprehension. These limitations are especially evident in scenarios that require spatio-temporal reasoning, cross-frame information integration, or resistance to language prior bias. Our findings also highlight the importance of high-resolution visual input and sufficient temporal coverage for reliable OCR in dynamic video scenarios.

AAAI Conference 2020 Short Paper

Focusing on Detail: Deep Hashing Based on Multiple Region Details (Student Abstract)

  • Quan Zhou
  • Xiushan Nie
  • Yang Shi
  • Xingbo Liu
  • Yilong Yin

Fast retrieval efficiency and high performance hashing, which aims to convert multimedia data into a set of short binary codes while preserving the similarity of the original data, has been widely studied in recent years. Majority of the existing deep supervised hashing methods only utilize the semantics of a whole image in learning hash codes, but ignore the local image details, which are important in hash learning. To fully utilize the detailed information, we propose a novel deep multi-region hashing (DMRH), which learns hash codes from local regions, and in which the final hash codes of the image are obtained by fusing the local hash codes corresponding to local regions. In addition, we propose a self-similarity loss term to address the imbalance problem (i.e., the number of dissimilar pairs is significantly more than that of the similar ones) of methods based on pairwise similarity.

AAAI Conference 2019 Conference Paper

AI-Sketcher: A Deep Generative Model for Producing High-Quality Sketches

  • Nan Cao
  • Xin Yan
  • Yang Shi
  • Chaoran Chen

Sketch drawings play an important role in assisting humans in communication and creative design since ancient period. This situation has motivated the development of artificial intelligence (AI) techniques for automatically generating sketches based on user input. Sketch-RNN, a sequence-to-sequence variational autoencoder (VAE) model, was developed for this purpose and known as a state-of-the-art technique. However, it suffers from limitations, including the generation of lowquality results and its incapability to support multi-class generations. To address these issues, we introduced AI-Sketcher, a deep generative model for generating high-quality multiclass sketches. Our model improves drawing quality by employing a CNN-based autoencoder to capture the positional information of each stroke at the pixel level. It also introduces an influence layer to more precisely guide the generation of each stroke by directly referring to the training data. To support multi-class sketch generation, we provided a conditional vector that can help differentiate sketches under various classes. The proposed technique was evaluated based on two large-scale sketch datasets, and results demonstrated its power in generating high-quality sketches.

AAAI Conference 2004 Short Paper

Evaluating Consistency Algorithms for Temporal Metric Constraints

  • Yang Shi

We study the performance of some known algorithms for solving the Simple Temporal Problem (STP) and the Temporal Constraint Satisfaction Problem (TCSP). In particular, we empirically compare the Bellman-Ford (BF) algorithm and its incremental version (incBF) by (Cesta & Oddi 1996) to the 4STP of (Xu & Choueiry 2003a). Among the tested algorithms, we show that 4STP is the most efficient for determining the consistency of an STP, and that incBF combined with the heuristics of (Xu & Choueiry 2003b) is the most efficient for solving the TCSP. We plan to improve 4STP by exploiting incrementality as in incBF and other new incremental algorithms.