Author name cluster

Xinyi Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers

1 author row

TMLR Journal 2026 Journal Article

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

Rui Meng
Ziyan Jiang
Ye Liu
Mingyi Su
Xinyi Yang
Yuepeng Fu
Can Qin
Raghuveer Thirukovalluru

Multimodal embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering over different modalities. However, existing multimodal embeddings like VLM2Vec, E5-V, GME are predominantly focused on natural images, with limited support for other visual forms such as videos and visual documents. This restricts their applicability in real-world scenarios, including AI agents, retrieval-augmented generation (RAG) systems, and recommendation. To close this gap, we propose VLM2Vec-V2, a unified framework for learning embeddings across diverse visual forms. First, we introduce MMEB-V2, a comprehensive benchmark that extends MMEB with five new task types: visual document retrieval, video retrieval, temporal grounding, video classification and video question answering -- spanning text, image, video, and visual document inputs. Next, we train VLM2Vec-V2, a general-purpose embedding model that supports text, image, video, and visual document inputs. Extensive experiments show that VLM2Vec-V2 achieves strong performance not only on the newly introduced video and document retrieval tasks, but also improves over prior baselines on the original image benchmarks. Through extensive evaluation, our study offers insights into the generalizability of various multimodal embedding models and highlights effective strategies for unified embedding learning, laying the groundwork for more scalable and adaptable representation learning in both research and real-world settings.

PDF Details

NeurIPS Conference 2025 Conference Paper

Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost

Runzhe Zhan
Zhihong Huang
Xinyi Yang
Lidia Chao
Min Yang
Derek Wong

Recent advancements in large reasoning models (LRMs) have introduced an intermediate "thinking" process prior to generating final answers, improving their reasoning capabilities on complex downstream tasks. However, the potential of LRMs as evaluators for machine translation (MT) quality remains underexplored. We provides the first systematic analysis of LRM-as-a-judge in MT evaluation. We identify key challenges, revealing LRMs require tailored evaluation materials, tend to "overthink" simpler instances and have issues with scoring mechanisms leading to overestimation. To address these, we propose to calibrate LRM thinking by training them on synthetic, human-like thinking trajectories. Our experiments on WMT24 Metrics benchmarks demonstrate that this approach largely reduces thinking budgets by ~35x while concurrently improving evaluation performance across different LRM scales from 7B to 32B (e. g. , R1-Distill-Qwen-7B achieves a +8. 7 correlation point improvement). These findings highlight the potential of efficiently calibrated LRMs to advance fine-grained automatic MT evaluation.

PDF Details

JBHI Journal 2025 Journal Article

Decoding Arm Movement Direction Using Ultra-High-Density EEG

Zhen Ma
Xinyi Yang
Jiayuan Meng
Kun Wang
Minpeng Xu
Dong Ming

Detecting arm movement direction is significant for individuals with upper-limb motor disabilities to restore independent self-care abilities. It involves accurately decoding the fine movement patterns of the arm, which has become feasible using invasive brain-computer interfaces (BCIs). However, it is still a significant challenge for traditional electroencephalography (EEG) based BCIs to decode multi-directional arm movements effectively. This study designed an ultra-high-density (UHD) EEG system to decode multi-directional arm movements. The system contains 200 electrodes with an interval of about 4 mm. We analyzed the patterns of the UHD EEG signals induced by arm movements in different directions. To extract discriminative features from UHD EEG, we proposed a spatial filtering method combining principal component analysis (PCA) and discriminative spatial pattern (DSP). We collected EEG signals from five healthy subjects (two left-handed and three right-handed) to verify the system's feasibility. The movement-related cortical potentials (MRCPs) showed a certain degree of separability both in waveforms and spatial patterns for arm movements in different directions. This study achieved an average classification accuracy of 63. 15 (8. 71)% for both arms (eight-class task) with a peak accuracy of 77. 24%. For the dominant arm (four-class task), we obtained an average accuracy of 75. 31 (9. 21)% with a peak accuracy of 85. 00%. For the first time, this study simultaneously decodes multi-directional movements of both arms using UHD EEG. This study provides a promising approach for detecting information about arm movement directions, which is significant for the development of BCIs.

Details DOI

NeurIPS Conference 2025 Conference Paper

Heterogeneous Adversarial Play in Interactive Environments

Manjie Xu
Xinyi Yang
Jiayu Zhan
Wei Liang
Chi Zhang
Yixin Zhu

Self-play constitutes a fundamental paradigm for autonomous skill acquisition, whereby agents iteratively enhance their capabilities through self-directed environmental exploration. Conventional self-play frameworks exploit agent symmetry within zero-sum competitive settings, yet this approach proves inadequate for open-ended learning scenarios characterized by inherent asymmetry. Human pedagogical systems exemplify asymmetric instructional frameworks wherein educators systematically construct challenges calibrated to individual learners' developmental trajectories. The principal challenge resides in operationalizing these asymmetric, adaptive pedagogical mechanisms within artificial systems capable of autonomously synthesizing appropriate curricula without predetermined task hierarchies. Here we present Heterogeneous Adversarial Play (HAP), an adversarial Automatic Curriculum Learning framework that formalizes teacher-student interactions as a minimax optimization wherein task-generating instructor and problem-solving learner co-evolve through adversarial dynamics. In contrast to prevailing automatic curriculum learning methodologies that employ static curricula or unidirectional task selection mechanisms, HAP establishes a bidirectional feedback system wherein instructors continuously recalibrate task complexity in response to real-time learner performance metrics. Experimental validation across multi-task learning domains demonstrates that our framework achieves performance parity with SOTA baselines while generating curricula that enhance learning efficacy in both artificial agents and human subjects.

PDF Details

AAAI Conference 2025 Conference Paper

Transfer Learning of Real Image Features with Soft Contrastive Loss for Fake Image Detection

Ziyou Liang
Weifeng Liu
Run Wang
Mengjie Wu
Boheng Li
Yuyang Zhang
Lina Wang
Xinyi Yang

In the last few years, the artifact patterns in fake images synthesized by different generative models have been inconsistent, leading to the failure of previous research that relied on spotting subtle differences between real and fake. In our preliminary experiments, we find that the artifacts in fake images always change with the development of the generative model, while natural images exhibit stable statistical properties. In this paper, we employ natural traces shared only by real images as an additional target for a classifier. Specifically, we introduce a self-supervised feature mapping process for natural trace extraction and develop a transfer learning based on soft contrastive loss to bring them closer to real images and further away from fake ones. This motivates the detector to make decisions based on the proximity of images to the natural traces. To conduct a comprehensive experiment, we built a high-quality and diverse dataset that includes generative models comprising GANs and diffusion models, to evaluate the effectiveness in generalizing unknown forgery techniques and robustness in surviving different transformations. Experimental results show that our proposed method gives 96.2% mAP significantly outperforms the baselines. Extensive experiments conducted on the widely recognized platform Midjourney reveal that our proposed method achieves an accuracy exceeding 78.4%, underscoring its practicality for real-world application deployment.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

DetectRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios

Junchao Wu
Runzhe Zhan
Derek F. Wong
Shu Yang
Xinyi Yang
Yulin Yuan
Lidia S. Chao

Detecting text generated by large language models (LLMs) is of great recent interest. With zero-shot methods like DetectGPT, detection capabilities have reached impressive levels. However, the reliability of existing detectors in real-world applications remains underexplored. In this study, we present a new benchmark, DetectRL, highlighting that even state-of-the-art (SOTA) detection techniques still underperformed in this task. We collected human-written datasets from domains where LLMs are particularly prone to misuse. Using popular LLMs, we generated data that better aligns with real-world applications. Unlike previous studies, we employed heuristic rules to create adversarial LLM-generated text, simulating advanced prompt usages, human revisions like word substitutions, and writing errors. Our development of DetectRL reveals the strengths and limitations of current SOTA detectors. More importantly, we analyzed the potential impact of writing styles, model types, attack methods, the text lengths, and real-world human writing factors on different types of detectors. We believe DetectRL could serve as an effective benchmark for assessing detectors in real-world scenarios, evolving with advanced attack methods, thus providing more stressful evaluation to drive the development of more efficient detectors\footnote{Data and code are publicly available at: https: //github. com/NLP2CT/DetectRL.

PDF Details DOI

AAMAS Conference 2023 Conference Paper

Asynchronous Multi-Agent Reinforcement Learning for Efficient Real-Time Multi-Robot Cooperative Exploration

Chao Yu
Xinyi Yang
Jiaxuan Gao
Jiayu Chen
Yunfei Li
Jijia Liu
Yunfei Xiang
Ruixin Huang

We consider the problem of cooperative exploration where multiple robots need to cooperatively explore an unknown region as fast as possible. Multi-agent reinforcement learning (MARL) has recently become a trending paradigm for solving this challenge. However, existing MARL-based methods adopt action-making steps as the metric for exploration efficiency by assuming all the agents are acting in a fully synchronous manner: i. e. , every single agent produces an action simultaneously and every single action is executed instantaneously at each time step. Despite its mathematical simplicity, such a synchronous MARL formulation can be problematic for real-world robotic applications. It can be typical that different robots may take slightly different wall-clock times to accomplish an atomic action or even periodically get lost due to hardware issues. Simply waiting for every robot being ready for the next action can be particularly time-inefficient. Therefore, we propose an asynchronous MARL solution, Asynchronous Coordination Explorer (ACE), to tackle this real-world challenge. We first extend a classical MARL algorithm, multi-agent PPO (MAPPO), to the asynchronous setting and additionally apply action-delay randomization to enforce the learned policy to generalize better to varying action delays in the real world. Moreover, each navigation agent is represented as a team-size-invariant CNN-based policy, which greatly benefits real-robot deployment by handling possible robot lost and allows bandwidth-efficient intra-agent communication through low-dimensional CNN features. We first validate our approach in a grid-based scenario. Both simulation and real-robot results show that ACE reduces over 10% actual exploration time compared with classical approaches. We also apply our framework Proc. of the 22nd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2023), A. Ricci, W. Yeoh, N. Agmon, B. An (eds.), May 29 – June 2, 2023, London, United Kingdom. © 2023 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org). All rights reserved. to a high-fidelity visual-based environment, Habitat, achieving 28% improvement in exploration efficiency.

PDF

AAMAS Conference 2023 Conference Paper

Learning Graph-Enhanced Commander-Executor for Multi-Agent Navigation

Xinyi Yang
Shiyu Huang
Yiwen Sun
Yuxiang Yang
Chao Yu
Wei-Wei Tu
Huazhong Yang
Yu Wang

This paper investigates the multi-agent navigation problem, which requires multiple agents to reach the target goals in a limited time. Multi-agent reinforcement learning (MARL) has shown promising results for solving this issue. However, it is inefficient for MARL to directly explore the (nearly) optimal policy in the large search space, which is exacerbated as the agent number increases (e. g. , 10+ agents) or the environment is more complex (e. g. , 3𝐷 simulator). Goal-conditioned hierarchical reinforcement learning (HRL) provides a promising direction to tackle this challenge by introducing a hierarchical structure to decompose the search space, where the low-level policy predicts primitive actions in the guidance of the goals derived from the high-level policy. In this paper, we propose Multi-Agent Graph-Enhanced Commander-EXecutor (MAGE-X), a graph-based goal-conditioned hierarchical method for multi-agent navigation tasks. MAGE-X comprises a high-level Goal Commander and a low-level Action Executor. The Goal Commander predicts the probability distribution of the goals and leverages them to assign the most appropriate final target to each agent. The Action Executor utilizes graph neural networks (GNN) to construct a subgraph for each agent that only contains its crucial partners to improve cooperation. Additionally, the Goal Encoder in the Action Executor captures the relationship between the agent and the designated goal to encourage the agent to reach the final target. The results show that MAGE-X outperforms the state-of-the-art MARL baselines with a 100% success rate with only 3 million training steps in multi-agent particle environments (MPE) with 50 agents, and at least a 12% higher success rate and 2× higher data efficiency in a more complicated quadrotor 3𝐷 navigation task.

PDF

NeurIPS Conference 2023 Conference Paper

UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

Can Qin
Shu Zhang
Ning Yu
Yihao Feng
Xinyi Yang
Yingbo Zhou
Huan Wang
Juan Carlos Niebles

Achieving machine autonomy and human control often represent divergent objectives in the design of interactive AI systems. Visual generative foundation models such as Stable Diffusion show promise in navigating these goals, especially when prompted with arbitrary languages. However, they often fall short in generating images with spatial, structural, or geometric controls. The integration of such controls, which can accommodate various visual conditions in a single unified model, remains an unaddressed challenge. In response, we introduce UniControl, a new generative foundation model that consolidates a wide array of controllable condition-to-image (C2I) tasks within a singular framework, while still allowing for arbitrary language prompts. UniControl enables pixel-level-precise image generation, where visual conditions primarily influence the generated structures and language prompts guide the style and context. To equip UniControl with the capacity to handle diverse visual conditions, we augment pretrained text-to-image diffusion models and introduce a task-aware HyperNet to modulate the diffusion models, enabling the adaptation to different C2I tasks simultaneously. Trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities with unseen visual conditions. Experimental results show that UniControl often surpasses the performance of single-task-controlled methods of comparable model sizes. This control versatility positions UniControl as a significant advancement in the realm of controllable visual generation.

PDF Details

YNICL Journal 2021 Journal Article

Aberrant rich club organization in patients with obsessive-compulsive disorder and their unaffected first-degree relatives

Ziwen Peng
Xinyi Yang
Chuanyong Xu
Xiangshu Wu
Qiong Yang
Zhen Wei
Zihan Zhou
Tom Verguts

Recent studies suggested that the rich club organization promoting global brain communication and integration of information, may be abnormally increased in obsessive-compulsive disorder (OCD). However, the structural and functional basis of this organization is still not very clear. Given the heritability of OCD, as suggested by previous family-based studies, we hypothesize that aberrant rich club organization may be a trait marker for OCD. In the present study, 32 patients with OCD, 30 unaffected first-degree relatives (FDR) and 32 healthy controls (HC) underwent diffusion tensor imaging (DTI) and functional magnetic resonance imaging (fMRI). We examined the structural rich club organization and its interrelationship with functional coupling. Our results showed that rich club and peripheral connection strength in patients with OCD was lower than in HC, while it was intermediate in FDR. Finally, the coupling between structural and functional connections of the rich club, was decreased in FDR but not in OCD relative to HC, which suggests a buffering mechanism of brain functions in FDR. Overall, our findings suggest that alteration of the rich club organization may reflect a vulnerability biomarker for OCD, possibly buffered by structural and functional coupling of the rich club.

Details DOI