Arrow Research search

Author name cluster

Yue Wu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

60 papers
2 author rows

Possible papers

60

AAAI Conference 2026 Conference Paper

Adaptive Graph Attention Based Discrete Hashing for Incomplete Cross-modal Retrieval

  • Shuang Zhang
  • Yue Wu
  • Lei Shi
  • Huilong Jin
  • Feifei Kou
  • Pengfei Zhang
  • Mingying Xu
  • Pengtao Lv

Cross-modal hashing has emerged as a pivotal solution for efficient retrieval across diverse modalities, such as images and texts, by mapping them into compact binary hash spaces. However, in real-world scenarios, the modalities data is often missing or misaligned. Existing methods are most rely on fully paired training data and ignore missing or misaligned modalities data, resulting in the semantic inconsistencies. To address these challenges, we propose an Adaptive Graph Attention-Based Discrete Hashing (AGADH) method, which consists of three parts. First, to solve the problem of missing modalities, AGADH employs a masked completion strategy to reconstruct missing modalities. Second, to mitigate semantic misalignment, AGADH leverages a Graph Attention Network (GAT) encoder-decoder architecture with alignment module to construct features from different modalities. Additionally, to enhance the fusion performance, an adaptive fusion module dynamically adjusting the contributions of image and text modalities with learnable weighting coefficients is proposed. Extensive experiments on three benchmark datasets, MS-COCO, NUS-WIDE, and MIRFlickr-25K, demonstrating that AGADH outperforms state-of-the-art methods in both fully paired and incompletely paired scenarios, showing its robustness and effectiveness in cross-modal retrieval tasks.

AAAI Conference 2026 Conference Paper

DcSplat: Dual-Constraint Human Gaussian Splatting with Latent Multi-View Consistency

  • Tengfei Xiao
  • Yue Wu
  • Zhigang Gao
  • Yongzhe Yuan
  • Can Qin
  • Hao Li
  • Mingyang Zhang

Human Novel View Synthesis (HNVS) aims to synthesize photorealistic human images from novel viewpoints given observations from known views. Despite significant advances achieved by existing methods such as NeRF, diffusion models, and 3DGS, they still face substantial challenges in achieving stable modeling from a single image. In this paper, we introduce Dual-Constraint Human Gaussian Splatting (DcSplat), a novel, simple, and efficient 3D Gaussian-based framework for single-view 3D human reconstruction. To address occlusion-induced texture missing and depth ambiguities, we introduce two key components: a Latent Multi-View Consistency Constraint Mechanism and a Geometric Constraint Module. The former employs a Latent-space Appearance Transformer (LatentFormer) to learn semantically coherent, view-consistent appearance priors via SMPL-guided pseudo-view fusion. The latter refines noisy SMPL-based depth through a U-Net-like structure conditioned on latent appearance features. These two modules are jointly optimized to generate high-quality Gaussian parameters in a unified latent space. Extensive experiments demonstrate that DcSplat outperforms existing SOTA methods in both geometry and texture quality, while achieving fast inference and lower computational cost.

AAAI Conference 2026 Conference Paper

Flora: Effortless Context Construction to Arbitrary Length and Scale

  • Tianxiang Chen
  • Zhentao Tan
  • Xiaofan Bo
  • Yue Wu
  • Tao Gong
  • Qi Chu
  • Jieping Ye

Effectively handling long contexts is challenging for Large Language Models (LLMs) due to the rarity of long texts, high computational demands, and substantial forgetting of short-context abilities. Recent approaches have attempted to construct long contexts for instruction tuning, but these methods often require LLMs or human interventions, which are both costly and limited in length and diversity. Also, the drop in short-context performances of present long-context LLMs remains significant. In this paper, we introduce Flora, an effortless (human/LLM-free) long-context construction strategy. Flora can markedly enhance the long-context performance of LLMs by arbitrarily assembling short instructions based on categories and instructing LLMs to generate responses based on long-context meta-instructions. This enables Flora to produce contexts of arbitrary length and scale with rich diversity, while only slightly compromising short-context performance. Experiments on Llama3-8B-Instruct and QwQ-32B show that LLMs enhanced by Flora excel in three long-context benchmarks while maintaining strong performances in short-context tasks.

AAAI Conference 2026 Conference Paper

Hybrid Vector-Occupancy Field for Robust Implicit 3D Surface Reconstruction

  • Yue Wu
  • Zhigang Gao
  • Tengfei Xiao
  • Can Qin
  • Yongzhe Yuan
  • Hao Li
  • Kaiyuan Feng
  • Wenping Ma

We introduce the Hybrid Vector-Occupancy Field (HVOF), a new implicit 3D representation for reconstructing both open and closed surfaces from sparse point clouds. Existing approaches, such as occupancy field and signed distance fields, face severe limitations. They struggle with open surfaces, while unsigned distance field and neural vector field exhibit directional instability in complex topologies and ridge regions. HVOF addresses these challenges by incorporating a smoothly decaying occupancy field around the surface, while capturing precise local geometry using truncated displacement vectors, naturally mitigating direction-field ambiguities near ridge regions. This unified design forms a robust hybrid representation that leverages both occupancy and vector fields. To fulfill it, we design a Hybrid Field variational autoencoder including a hierarchical cross-attention encoder and dual-branch decoder that jointly learn occupancy and vector fields through continuous weighting. Extensive experiments demonstrate that HVOF consistently outperforms state-of-the-art methods across ShapeNet, ABC, and MGN datasets, accurately reconstructing both open and closed surfaces while preserving fine geometric details in complex regions.

ICLR Conference 2025 Conference Paper

A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement

  • Hui Yuan 0002
  • Yifan Zeng
  • Yue Wu
  • Huazheng Wang
  • Mengdi Wang 0001
  • Liu Leqi

Reinforcement Learning from Human Feedback (RLHF) has become the predominant approach for aligning language models (LMs) to be more helpful and less harmful. At its core, RLHF uses a margin-based loss for preference optimization, which specifies the ideal LM behavior only in terms of the difference between preferred and dispreferred responses. In this paper, we identify a common pitfall of margin-based methods---the under-specification of ideal LM behavior on preferred and dispreferred responses individually, which results in two unintended consequences as the margin increases: (1) The probability of dispreferred (e.g., unsafe) responses may increase, resulting in potential safety alignment failures. (2) The probability of preferred responses may decrease, even when those responses are ideal. We demystify the reasons behind these problematic behaviors: margin-based losses couple the change in the preferred probability with the gradient of the dispreferred one, and vice versa, often preventing the preferred probability from increasing while the dispreferred one decreases, and thus causing a synchronized increase or decrease in both probabilities. We term this effect, inherent in margin-based objectives, gradient entanglement. Formally, we derive conditions for general margin-based alignment objectives under which gradient entanglement becomes concerning: the inner product between the gradient of preferred log-probability and the gradient of dispreferred log-probability is large relative to the individual gradient norms. Furthermore, we theoretically investigate why such inner products can be large when aligning language models and empirically validate our findings. Empirical implications of our framework further extend to explaining important differences in the training dynamics of various preference optimization algorithms and suggesting future directions for improvement.

AAAI Conference 2025 Conference Paper

AdvDisplay: Adversarial Display Assembled by Thermoelectric Cooler for Fooling Thermal Infrared Detectors

  • Hao Li
  • Fanggao Wan
  • Yue Su
  • Yue Wu
  • Mingyang Zhang
  • Maoguo Gong

When the current physical adversarial patches cannot deceive thermal infrared detectors, the existing techniques implement adversarial attacks from scratch, such as digital patch generation, material production, and physical deployment. Besides, it is difficult to finely regulate infrared radiation. To address these issues, this paper designs an adversarial thermal display (AdvDisplay ) by assembling thermoelectric coolers (TECs) as an array. Specifically, to reduce the gap between patches in the physical and digital worlds and decrease the power of AdvDisplay device, heat transfer loss and electric power loss are designed to guide the patch optimization. In addition, a precise temperature control scheme for AdvDisplay is proposed based on proportional-integral-derivative (PID) control. Due to the accurate temperature regulation and the reusability of AdvDisplay, our method is able to improve the attack success rate and the efficiency of physical deployments. Extensive experimental results indicate that the proposed method possesses superior adversarial effectiveness compared to other methods and demonstrates strong robustness in physical attacks.

ICML Conference 2025 Conference Paper

Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment

  • Yifan Zhang
  • Ge Zhang
  • Yue Wu
  • Kangping Xu
  • Quanquan Gu

Modeling human preferences is crucial for aligning foundation models with human values. Traditional reward modeling methods, such as the Bradley-Terry (BT) reward model, fall short in expressiveness, particularly in addressing intransitive preferences. In this paper, we introduce preference embedding, an approach that embeds responses into a latent space to capture intricate preference structures efficiently, achieving linear query complexity. Additionally, we propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback (RLHF). Experimental results show that our General Preference embedding Model (GPM) consistently outperforms the BT reward model on the RewardBench benchmark and effectively models cyclic preferences where any BT reward model behaves like a random guess. Furthermore, evaluations on downstream tasks such as AlpacaEval2. 0, following the language model post-training with GPO and our general preference model, reveal performance improvements over BT models. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values. The code is available at https: //github. com/general-preference/general-preference-model.

AAAI Conference 2025 Conference Paper

Both Supply and Precision: Sample Debias and Ranking Consistency Joint Learning for Large Scale Pre-Ranking System

  • Feng Gao
  • Xin Zhou
  • Yinning Shao
  • Yue Wu
  • Jiahua Gao
  • Yujian Ren
  • Fengyang Qi
  • Ruochen Deng

Cascade ranking architecture, composed of matching, pre-ranking, ranking and re-ranking stages, is usually adopted to balance the efficiency and effectiveness in real-world recommendation system (RS). As the middle stage of RS, pre-ranking aims to quickly filter out the low-quality items selected at the matching stage and then forwarding high-quality items to the ranking stage. Existing pre-ranking approaches mainly endure two problems 1) Sample Selection Bias (SSB) problem, which heavily limits the performance improvement of filtering out low-quality items owing to ignoring the data flow between stages; and 2) Ranking Consistency (RC) problem, which may cause the ranked lists of the ranking stage and previous pre-ranking stage to be inconsistent. As a result, the competitive items with high scores at the ranking stage may not be selected because of low scores at the pre-ranking stage. These both two problems may cause sub-optimal performances, but previous works usually only focus on the one of them. In this paper, we propose a novel Sample Debias and Ranking Consistency Joint Learning Framework (SDCL) to jointly alleviate SSB and RC problems. SDCL consists of two main modules including 1) Multi-Task Distillation Module (MTD), which enhances the ability of identifying high-quality items by distilling knowledge across all tasks simultaneously from the more complex ranking model which jointly trained with the pre-ranking model; and 2) Adaptive Negative Sample Learning Module (ANSL), which improves the performance of filtering out low-quality items by adaptively adjusting negative samples learning weights based on the current performance of model. SDCL seamlessly integrates two modules in an end-to-end multi-task learning framework. Evaluations on both real-world large-scale traffic logs and online A/B test demonstrate the efficacy and superiority of SDCL.

NeurIPS Conference 2025 Conference Paper

DISC: Dynamic Decomposition Improves LLM Inference Scaling

  • Jonathan Li
  • Wei Cheng
  • Benjamin Riviere
  • Yue Wu
  • Masafumi Oyamada
  • Mengdi Wang
  • Yisong Yue
  • Santiago Paternain

Inference scaling methods for LLMs often rely on decomposing problems into steps (or groups of tokens), followed by sampling and selecting the best next steps. However, these steps and their sizes are often predetermined or manually designed based on domain knowledge. We propose dynamic decomposition, a method that adaptively and automatically partitions solution and reasoning traces into manageable steps during inference. By more effectively allocating compute -- particularly through subdividing challenging steps and prioritizing their sampling -- dynamic decomposition significantly improves inference efficiency. Experiments on benchmarks such as APPS, MATH, and LiveCodeBench demonstrate that dynamic decomposition outperforms static approaches, including token-level, sentence-level, and single-step decompositions, reducing the pass@10 error rate by 5. 0%, 6. 7%, and 10. 5% respectively. These findings highlight the potential of dynamic decomposition to improve a wide range of inference scaling techniques.

NeurIPS Conference 2025 Conference Paper

Dynamic Masking and Auxiliary Hash Learning for Enhanced Cross-Modal Retrieval

  • Shuang Zhang
  • Yue Wu
  • Lei Shi
  • Yingxue Zhang
  • Feifei Kou
  • Huilong Jin
  • Pengfei Zhang
  • Meiyu Liang

The demand for multimodal data processing drives the development of information technology. Cross-modal hash retrieval has attracted much attention because it can overcome modal differences and achieve efficient retrieval, and has shown great application potential in many practical scenarios. Existing cross-modal hashing methods have difficulties in fully capturing the semantic information of different modal data, which leads to a significant semantic gap between modalities. Moreover, these methods often ignore the importance differences of channels, and due to the limitation of a single goal, the matching effect between hash codes is also affected to a certain extent, thus facing many challenges. To address these issues, we propose a Dynamic Masking and Auxiliary Hash Learning (AHLR) method for enhanced cross-modal retrieval. By jointly leveraging the dynamic masking and auxiliary hash learning mechanisms, our approach effectively resolves the problems of channel information imbalance and insufficient key information capture, thereby significantly improving the retrieval accuracy. Specifically, we introduce a dynamic masking mechanism that automatically screens and weights the key information in images and texts during the training process, enhancing the accuracy of feature matching. We further construct an auxiliary hash layer to adaptively balance the weights of features across each channel, compensating for the deficiencies of traditional methods in key information capture and channel processing. In addition, we design a contrastive loss function to optimize the generation of hash codes and enhance their discriminative power, further improving the performance of cross-modal retrieval. Comprehensive experimental results on NUS-WIDE, MIRFlickr-25K and MS-COCO benchmark datasets show that the proposed AHLR algorithm outperforms several existing algorithms.

NeurIPS Conference 2025 Conference Paper

FedQS: Optimizing Gradient and Model Aggregation for Semi-Asynchronous Federated Learning

  • Yunbo Li
  • Jiaping Gui
  • Zhihang Deng
  • Fanchao Meng
  • Yue Wu

Federated learning (FL) enables collaborative model training across multiple parties without sharing raw data, with semi-asynchronous FL (SAFL) emerging as a balanced approach between synchronous and asynchronous FL. However, SAFL faces significant challenges in optimizing both gradient-based (e. g. , FedSGD) and model-based (e. g. , FedAvg) aggregation strategies, which exhibit distinct trade-offs in accuracy, convergence speed, and stability. While gradient aggregation achieves faster convergence and higher accuracy, it suffers from pronounced fluctuations, whereas model aggregation offers greater stability but slower convergence and suboptimal accuracy. This paper presents FedQS, the first framework to theoretically analyze and address these disparities in SAFL. FedQS introduces a divide-and-conquer strategy to handle client heterogeneity by classifying clients into four distinct types and adaptively optimizing their local training based on data distribution characteristics and available computational resources. Extensive experiments on computer vision, natural language processing, and real-world tasks demonstrate that FedQS achieves the highest accuracy, attains the lowest loss, and ranks among the fastest in convergence speed, outperforming state-of-the-art baselines. Our work bridges the gap between aggregation strategies in SAFL, offering a unified solution for stable, accurate, and efficient federated learning. The code and datasets are available at https: //github. com/bkjod/FedQS_.

AAAI Conference 2025 Conference Paper

Infer Human’s Intentions Before Following Natural Language Instructions

  • Yanming Wan
  • Yue Wu
  • Yiping Wang
  • Jiayuan Mao
  • Natasha Jaques

For AI agents to be helpful to humans, they should be able to follow natural language instructions to complete everyday cooperative tasks in human environments. However, real human instructions inherently possess ambiguity, because the human speakers assume sufficient prior knowledge about their hidden goals and intentions. Standard language grounding and planning methods fail to address such ambiguities because they do not model human internal goals as additional partially observable factors in the environment. We propose a new framework, Follow Instructions with Social and Embodied Reasoning (FISER), aiming for better natural language instruction following in collaborative embodied tasks. Our framework makes explicit inferences about human goals and intentions as intermediate reasoning steps. We implement a set of Transformer-based models and evaluate them over a challenging benchmark, HandMeThat. We empirically demonstrate that using social reasoning to explicitly infer human intentions before making action plans surpasses purely end-to-end approaches. We also compare our implementation with strong baselines, including Chain of Thought prompting on the largest available pre-trained language models, and find that FISER provides better performance on the embodied social reasoning tasks under investigation, reaching the state-of-the-art on HandMeThat.

NeurIPS Conference 2025 Conference Paper

Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

  • yuyang Hong
  • Jiaqi Gu
  • Yang Qi
  • Lubin Fan
  • Yue Wu
  • Ying Wang
  • Kun Ding
  • Shiming Xiang

The task of Knowlegde-Based Visual Question Answering (KB-VQA) requires the model to understand visual features and retrieve external knowledge. Retrieval-Augmented Generation (RAG) have been employed to address this problem through knowledge base querying. However, existing work demonstrate two limitations: insufficient interactivity during knowledge retrieval and ineffective organization of retrieved information for Visual-Language Model (VLM). To address these challenges, we propose a three-stage visual language model with Process, Retrieve and Filter (VLM-PRF) framework. For interactive retrieval, VLM-PRF uses reinforcement learning (RL) to guide the model to strategically process information via tool-driven operations. For knowledge filtering, our method trains the VLM to transform the raw retrieved information into into task-specific knowledge. With a dual reward as supervisory signals, VLM-PRF successfully enable model to optimize retrieval strategies and answer generation capabilities simultaneously. Experiments on two datasets demonstrate the effectiveness of our framework.

ICLR Conference 2025 Conference Paper

Learning Fine-Grained Representations through Textual Token Disentanglement in Composed Video Retrieval

  • Yue Wu
  • Zhaobo Qi
  • Yiling Wu
  • Junshu Sun
  • Yaowei Wang 0001
  • Shuhui Wang

With the explosive growth of video data, finding videos that meet detailed requirements in large datasets has become a challenge. To address this, the composed video retrieval task has been introduced, enabling users to retrieve videos using complex queries that involve both visual and textual information. However, the inherent heterogeneity between the modalities poses significant challenges. Textual data are highly abstract, while video content contains substantial redundancy. The modality gap in information representation makes existing methods struggle with the modality fusion and alignment required for fine-grained composed retrieval. To overcome these challenges, we first introduce FineCVR-1M, a fine-grained composed video retrieval dataset containing 1,010,071 video-text triplets with detailed textual descriptions. This dataset is constructed through an automated process that identifies key concept changes between video pairs to generate textual descriptions for both static and action concepts. For fine-grained retrieval methods, the key challenge lies in understanding the detailed requirements. Text description serves as clear expressions of intent, but it requires models to distinguish subtle differences in the description of video semantics. Therefore, we propose a textual Feature Disentanglement and Cross-modal Alignment framework (FDCA) that disentangles features at both the sentence and token levels. At the sequence level, we separate text features into retained and injected features. At the token level, an Auxiliary Token Disentangling mechanism is proposed to disentangle texts into retained, injected, and excluded tokens. The disentanglement at both levels extracts fine-grained features, which are aligned and fused with the reference video to extract global representations for video retrieval. Experiments on FineCVR-1M dataset demonstrate the superior performance of FDCA. Our code and dataset are available at: https://may2333.github.io/FineCVR/.

ICML Conference 2025 Conference Paper

MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations

  • Kaixuan Huang
  • Jiacheng Guo
  • Zihao Li
  • Xiang Ji
  • Jiawei Ge 0003
  • Wenzhe Li
  • Yingqing Guo
  • Tianle Cai

Large language models have demonstrated impressive performance on challenging mathematical reasoning tasks, which has triggered the discussion of whether the performance is achieved by true reasoning capability or memorization. To investigate this question, prior work has constructed mathematical benchmarks when questions undergo simple perturbations – modifications that still preserve the underlying reasoning patterns of the solutions. However, no work has explored hard perturbations, which fundamentally change the nature of the problem so that the original solution steps do not apply. To bridge the gap, we construct MATH-P-Simple and MATH-P-Hard via simple perturbation and hard perturbation, respectively. Each consists of 279 perturbed math problems derived from level-5 (hardest) problems in the MATH dataset (Hendrycks et al. , 2021). We observe significant performance drops on MATH-P-Hard across various models, including o1-mini (-16. 49%) and gemini-2. 0-flash-thinking (-12. 9%). We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills without assessing their applicability to modified contexts. This issue is amplified when using original problems for in-context learning. We call for research efforts to address this challenge, which is critical for developing more robust and reliable reasoning models. The project is available at https: //math-perturb. github. io/.

AAAI Conference 2025 Conference Paper

MUCD: Unsupervised Point Cloud Change Detection via Masked Consistency

  • Yue Wu
  • Zhipeng Wang
  • Yongzhe Yuan
  • Maoguo Gong
  • Hao Li
  • Mingyang Zhang
  • Wenping Ma
  • Qiguang Miao

3D Change Detection (3DCD) has gradually become another research hotspot after image change detection. Recent works focus on using artificial labels for supervised or weakly-supervised training of siamese networks to segment changed points. However, labeling every points of multi-temporal point clouds is very expensive and time-consuming. In addition, these works lack effective self-supervised signals, and existing self-supervised signals often fail to capture sufficiently rich change information. To solve this problem, we assume that the powerful representation of 3D objects should model the consistency information of unchanged regions and distinguish different objects. Based on this assumption, we propose a new unsupervised framework called MUCD to learn change information of multi-temporal point clouds through bidirectional optimization of change segmentor and feature extractor. The training of network is divided into two stages. We first design a foreknowledge point contrastive loss based on the characteristics of the 3DCD task to initialize the feature extractor, and then propose a masked consistency loss to further learn the shared geometric information of unchanged regions in the multi-temporal point clouds, utilizing it as a free and powerful supervised signal to train a change segmentor. In the inference stage, only the segmentor is used to take multi-temporal point clouds as input and produce change segmentation result. Extensive experiments are conducted on SLPCCD and Urb3DCD, two real-world datasets of streets and urban buildings, to verify that our proposed unsupervised method is highly competitive and even outperforms supervised methods in scenes where semantic information changes occur, exhibiting better performance in generalization ability and robustness.

AAAI Conference 2025 Conference Paper

Partial Point Cloud Registration with Multi-view 2D Image Learning

  • Yue Zhang
  • Yue Wu
  • Wenping Ma
  • Maoguo Gong
  • Hao Li
  • Biao Hou

Learning representations from numerous 2D image data has shown promising performance, yet very few works apply this representations to point cloud registration. In this paper, we explore how to leverage the 2D information to assist the point cloud registration, and propose IAPReg, an Image-Assisted Partial 3D point cloud Registration framework with the multi-view images generated by the input point cloud. It is expected to enrich 3D information with 2D knowledge, and leverage 2D knowledge to assist with point cloud registration. Specifically, we create multi-view depth maps by projecting the input point cloud from several specific views, and then extract 2D and 3D features using some well-established models. To fuse the information learned from 2D and 3D modalities, inter-modality multi-view learning module is proposed to enhance geometric information and complement semantic information. Weighted SVD is a common method to reduce the impact of inaccurate correspondences on registration. However, determining the correspondence weights is not trivial. Therefore, we design a 2D-weighted SVD method, where the 2D knowledge is employed to provide weight information of correspondences. Extensive experiments perform that our method outperform the state-of-the-art method without additional 2D training data.

NeurIPS Conference 2025 Conference Paper

PointTruss: K-Truss for Point Cloud Registration

  • Yue Wu
  • Jun Jiang
  • Yongzhe Yuan
  • Maoguo Gong
  • Qiguang Miao
  • Hao Li
  • Mingyang Zhang
  • Wenping Ma

Point cloud registration is a fundamental task in 3D computer vision. Recent advances have shown that graph-based methods are effective for outlier rejection in this context. However, existing clique-based methods impose overly strict constraints and are NP-hard, making it difficult to achieve both robustness and efficiency. While the k-core reduces computational complexity, which only considers node degree and ignores higher-order topological structures such as triangles, limiting its effectiveness in complex scenarios. To overcome these limitations, we introduce the $k$-truss from graph theory into point cloud registration, leveraging triangle support as a constraint for inlier selection. We further propose a consensus voting-based low-scale sampling strategy to efficiently extract the structural skeleton of the point cloud prior to $k$-truss decomposition. Additionally, we design a spatial distribution score that balances coverage and uniformity of inliers, preventing selections that concentrate on sparse local clusters. Extensive experiments on KITTI, 3DMatch, and 3DLoMatch demonstrate that our method consistently outperforms both traditional and learning-based approaches in various indoor and outdoor scenarios, achieving state-of-the-art results.

ICML Conference 2025 Conference Paper

Ranking with Multiple Oracles: From Weak to Strong Stochastic Transitivity

  • Tao Jin 0002
  • Yue Wu
  • Quanquan Gu
  • Farzad Farnoud

We study the problem of efficiently aggregating the preferences of items from multiple information sources (oracles) and infer the ranking under both the weak stochastic transitivity (WST) and the strong stochastic transitivity (SST) conditions. When the underlying preference model satisfies the WST condition, we propose an algorithm named RMO-WST, which has a bi-level design: at the higher level, it actively allocates comparison budgets to all undetermined pairs until the full ranking is recovered; at the lower level, it attempts to compare the pair of items and selects the more accurate oracles simultaneously. We prove that the sample complexity of RMO-WST is $ \tilde O( N\sum_{i=2}^{N}H_{\sigma^{-1}(i), {\sigma^{-1}(i-1)}} )$, where $N$ is the number of items to rank, $H$ is a problem-dependent hardness factor, and $\sigma^{-1}(i)$ represents the $i$-th best item. We also provide a tight lower bound that matches the upper bound of approximate ranking under the WST condition, answering a previously open problem. In addition, when the SST condition is satisfied, we propose an algorithm named RMO-SST, which can achieve an $\tilde{O}(\sum_{i=1}^{N} H_i \log(N))$ sample complexity. This outperforms the best-known sample complexity by a factor of $\log(N)$. The theoretical advantages of our algorithms are verified by empirical experiments in a simulated environment.

ICML Conference 2025 Conference Paper

ROPO: Robust Preference Optimization for Large Language Models

  • Xize Liang
  • Chao Chen 0026
  • Shuang Qiu
  • Jie Wang 0005
  • Yue Wu
  • Zhihang Fu
  • Hanzhu Chen
  • Feng Wu 0001

The prevalent noise in the preference data unavoidably poses significant challenges to the preference alignment of large language models (LLMs). Existing efforts for this problem either marginally alleviate the impact of noise without noise reduction, or rely on external LLMs that incur substantial computational costs. To address these challenges, we propose RO bust P reference O ptimization ( ROPO ), an iterative alignment approach that integrates noise-tolerance and noise filtering without the aid of external models. Specifically, ROPO first formulates the training process with adaptive noise reduction as an optimization problem, which can be efficiently solved in an iterative paradigm. Then, to equip this solving process with noise-tolerance and noise-identification capabilities, we derive a robust loss that suppresses the gradients from samples with high uncertainty. We demonstrate both empirically and theoretically that the derived loss is key to the noise-tolerance and effective filtering of noisy samples. The derived loss further inspires a robustness-guided rejection sampling technique to compensate for the potential important information in discarded queries. Extensive experiments on several widely-used datasets and model architectures demonstrate that ROPO significantly outperforms all baselines under four practical noise settings and the random symmetric noise, with its advantage increasing as the noise rate increases.

NeurIPS Conference 2025 Conference Paper

SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models

  • Pingyi Chen
  • Yujing Lou
  • Shen Cao
  • Jinhui Guo
  • Lubin Fan
  • Yue Wu
  • Lin Yang
  • Lizhuang Ma

While vision language models (VLMs) excel in 2D semantic visual understanding, their ability to quantitatively reason about 3D spatial relationships remains underexplored due to the deficiency of spatial representation ability of 2D images. In this paper, we analyze the problem hindering VLMs’ spatial understanding abilities and propose SD-VLM, a novel framework that significantly enhances fundamental spatial perception abilities of VLMs through two key contributions: (1) propose Massive Spatial Measuring and Understanding (MSMU) dataset with precise spatial annotations, and (2) introduce a simple depth positional encoding method strengthening VLMs’ spatial awareness. MSMU dataset includes massive quantitative spatial tasks with 700K QA pairs, 2. 5M physical numerical annotations, and 10K chain-of-thought augmented samples. We have trained SD-VLM, a strong generalist VLM which shows superior quantitative spatial measuring and understanding capability. SD-VLM not only achieves state-of-the-art performance on our proposed MSMU-Bench, but also shows spatial generalization abilities on other spatial understanding benchmarks including Q-Spatial and SpatialRGPTBench. Extensive experiments demonstrate that SD-VLM outperforms GPT-4o and Intern-VL3-78B by 26. 91% and 25. 56% respectively on MSMU-Bench. Code and models are released at https: //github. com/cpystan/SD-VLM.

ICLR Conference 2025 Conference Paper

Self-Play Preference Optimization for Language Model Alignment

  • Yue Wu
  • Zhiqing Sun
  • Huizhuo Yuan
  • Kaixuan Ji
  • Yiming Yang 0002
  • Quanquan Gu

Standard reinforcement learning from human feedback (RLHF) approaches relying on parametric models like the Bradley-Terry model fall short in capturing the intransitivity and irrationality in human preferences. Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences, enabling more flexible and accurate language model alignment. In this paper, we propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game aimed at identifying the Nash equilibrium policy. Our approach, dubbed *Self-Play Preference Optimization* (SPPO), utilizes iterative policy updates to provably approximate the Nash equilibrium. Additionally, we propose a new SPPO objective which is both strongly motivated by theory and is simple and effective in practice. In our experiments, using only 60k prompts (without responses) from the UltraFeedback dataset and without any prompt augmentation, by leveraging a pre-trained preference model PairRM with only 0.4B parameters, SPPO can obtain a model from fine-tuning Mistral-7B-Instruct-v0.2 that achieves the state-of-the-art length-controlled win-rate of 28.53\% against GPT-4-Turbo on AlpacaEval 2.0. It also outperforms the (iterative) DPO and IPO on MT-Bench, Arena-Hard, and the Open LLM Leaderboard. Starting from a stronger base model Llama-3-8B-Instruct, we are able to achieve a length-controlled win rate of 38.77\%. Notably, the strong performance of SPPO is achieved without additional external supervision (e.g., responses, preferences, etc.) from GPT-4 or other stronger language models.

ICLR Conference 2025 Conference Paper

SFS: Smarter Code Space Search improves LLM Inference Scaling

  • Jonathan Light
  • Yue Wu
  • Yiyou Sun
  • Wenchao Yu
  • Yanchi Liu
  • Xujiang Zhao
  • Ziniu Hu
  • Haifeng Chen

We frame code generation as a black-box optimization problem within the code space and demonstrate how optimization-inspired techniques can enhance inference scaling over text. Based on this perspective, we propose **SCATTERED FOREST SEARCH (SFS)**, a novel approach that improves solution diversity during evolutionary search, thereby avoiding local optima. Our theoretical analysis illustrates how these methods improve exploration and enhance efficiency. Extensive experiments on *HumanEval, MBPP, APPS, CodeContests,* and *Leetcode* reveal significant performance gains. For instance, our method achieves a **pass@1 rate of 67.1% on HumanEval+** and **87.2% on HumanEval with GPT-3.5**, marking improvements of **8.6%** and **4.3%** over the state-of-the-art, while also halving the iterations needed to find the correct solution. Furthermore, our approach scales more efficiently than existing search techniques, including **tree search, line search,** and **repeated sampling (Best of N)**.

AAAI Conference 2025 Conference Paper

Where Precision Meets Efficiency: Transformation Diffusion Model for Point Cloud Registration

  • Yongzhe Yuan
  • Yue Wu
  • Xiaolong Fan
  • Maoguo Gong
  • Qiguang Miao
  • Wenping Ma

We propose a transformation diffusion model for point cloud registration to balance precision and efficiency. Our method formulates point cloud registration as a denoising diffusion process from noisy transformation to object transformation, which is represented by quaternion and translation. Specifically, in training stage, object transformation diffuses from ground-truth transformation to random distribution, and the model learns to reverse this noising process. In sampling stage, the model refines randomly generated transformation to the optimal transformation in a progressive way. We derive the variational bound in closed form for training and provide instantiation of the model. Our diffusion model maps transformation into latent space, and splits the transformation into two components (rotation and translation) based on the fact that they belong to different solution spaces. In addition, our work provides the following crucial findings: (i) Point cloud registration, one of the representative discriminative tasks, can be solved by a generative way and mapped into latent space to obtain new unified probabilistic formulation. (ii) Our model, Transformation Diffusion Model (TDM) can be a plug-and-play agent for point cloud registration, making our method applicable to different deep registration networks. Experimental results on synthetic and real-world datasets demonstrate that, in correspondence-free and correspondence-based scenarios, TDM can both achieve exceeding 60% performance improvements and higher efficiency simultaneously.

ICLR Conference 2024 Conference Paper

Boosting Vanilla Lightweight Vision Transformers via Re-parameterization

  • Zhentao Tan
  • Xiaodan Li
  • Yue Wu
  • Qi Chu 0001
  • Le Lu 0001
  • Nenghai Yu
  • Jieping Ye

Large-scale Vision Transformers have achieved promising performance on downstream tasks through feature pre-training. However, the performance of vanilla lightweight Vision Transformers (ViTs) is still far from satisfactory compared to that of recent lightweight CNNs or hybrid networks. In this paper, we aim to unlock the potential of vanilla lightweight ViTs by exploring the adaptation of the widely-used re-parameterization technology to ViTs for improving learning ability during training without increasing the inference cost. The main challenge comes from the fact that CNNs perfectly complement with re-parameterization over convolution and batch normalization, while vanilla Transformer architectures are mainly comprised of linear and layer normalization layers. We propose to incorporate the nonlinear ensemble into linear layers by expanding the depth of the linear layers with batch normalization and fusing multiple linear features with hierarchical representation ability through a pyramid structure. We also discover and solve a new transformer-specific distribution rectification problem caused by multi-branch re-parameterization. Finally, we propose our Two-Dimensional Re-parameterized Linear module (TDRL) for ViTs. Under the popular self-supervised pre-training and supervised fine-tuning strategy, our TDRL can be used in these two stages to enhance both generic and task-specific representation. Experiments demonstrate that our proposed method not only boosts the performance of vanilla Vit-Tiny on various vision tasks to new state-of-the-art (SOTA) but also shows promising generality ability on other networks. Code will be available.

ICML Conference 2024 Conference Paper

Borda Regret Minimization for Generalized Linear Dueling Bandits

  • Yue Wu
  • Tao Jin 0002
  • Qiwei Di
  • Hao Lou
  • Farzad Farnoud
  • Quanquan Gu

Dueling bandits are widely used to model preferential feedback prevalent in many applications such as recommendation systems and ranking. In this paper, we study the Borda regret minimization problem for dueling bandits, which aims to identify the item with the highest Borda score while minimizing the cumulative regret. We propose a rich class of generalized linear dueling bandit models, which cover many existing models. We first prove a regret lower bound of order $\Omega(d^{2/3} T^{2/3})$ for the Borda regret minimization problem, where $d$ is the dimension of contextual vectors and $T$ is the time horizon. To attain this lower bound, we propose an explore-then-commit type algorithm for the stochastic setting, which has a nearly matching regret upper bound $\tilde{O}(d^{2/3} T^{2/3})$. We also propose an EXP3-type algorithm for the adversarial linear setting, where the underlying model parameter can change in each round. Our algorithm achieves an $\tilde{O}(d^{2/3} T^{2/3})$ regret, which is also optimal. Empirical evaluations on both synthetic data and a simulated real-world environment are conducted to corroborate our theoretical analysis.

NeurIPS Conference 2024 Conference Paper

Delving into the Reversal Curse: How Far Can Large Language Models Generalize?

  • Zhengkai Lin
  • Zhihang Fu
  • Kai Liu
  • Liang Xie
  • Binbin Lin
  • Wenxiao Wang
  • Deng Cai
  • Yue Wu

While large language models (LLMs) showcase unprecedented capabilities, they also exhibit certain inherent limitations when facing seemingly trivial tasks. A prime example is the recently debated "reversal curse", which surfaces when models, having been trained on the fact "A is B", struggle to generalize this knowledge to infer that "B is A". In this paper, we examine the manifestation of the reversal curse across various tasks and delve into both the generalization abilities and the problem-solving mechanisms of LLMs. This investigation leads to a series of significant insights: (1) LLMs are able to generalize to "B is A" when both A and B are presented in the context as in the case of a multiple-choice question. (2) This generalization ability is highly correlated to the structure of the fact "A is B" in the training documents. For example, this generalization only applies to biographies structured in "[Name] is [Description]" but not to "[Description] is [Name]". (3) We propose and verify the hypothesis that LLMs possess an inherent bias in fact recalling during knowledge application, which explains and underscores the importance of the document structure to successful learning. (4) The negative impact of this bias on the downstream performance of LLMs can hardly be mitigated through training alone. Based on these intriguing findings, our work not only presents a novel perspective for interpreting LLMs' generalization abilities from their intrinsic working mechanism but also provides new insights for the development of more effective learning methods for LLMs.

ICLR Conference 2024 Conference Paper

DNA-GPT: Divergent N-Gram Analysis for Training-Free Detection of GPT-Generated Text

  • Xianjun Yang
  • Wei Cheng 0002
  • Yue Wu
  • Linda Ruth Petzold
  • William Yang Wang
  • Haifeng Chen

Large language models (LLMs) have notably enhanced the fluency and diversity of machine-generated text. However, this progress also presents a significant challenge in detecting the origin of a given text, and current research on detection methods lags behind the rapid evolution of LLMs. Conventional training-based methods have limitations in flexibility, particularly when adapting to new domains, and they often lack explanatory power. To address this gap, we propose a novel training-free detection strategy called Divergent N-Gram Analysis (DNA-GPT). Given a text, we first truncate it in the middle and then use only the preceding portion as input to the LLMs to regenerate the new remaining parts. By analyzing the differences between the original and new remaining parts through N-gram analysis in black-box or probability divergence in white-box, we can clearly illustrate significant discrepancies between machine-generated and human-written text. We conducted extensive experiments on the most advanced LLMs from OpenAI, including text-davinci-003, GPT-3.5-turbo, and GPT-4, as well as open-source models such as GPT-NeoX-20B and LLaMa-13B. Results show that our zero-shot approach exhibits state-of-the-art performance in distinguishing between human and GPT-generated text on four English and one German dataset, outperforming OpenAI's own classifier, which is trained on millions of text. Additionally, our methods provide reasonable explanations and evidence to support our claim, which is a unique feature of explainable detection. Our method is also robust under the revised text attack and can additionally solve model sourcing.

NeurIPS Conference 2024 Conference Paper

Enhancing LLM’s Cognition via Structurization

  • Kai Liu
  • Zhihang Fu
  • Chao Chen
  • Wei Zhang
  • Rongxin Jiang
  • Fan Zhou
  • Yaowu Chen
  • Yue Wu

When reading long-form text, human cognition is complex and structurized. While large language models (LLMs) process input contexts through a causal and sequential perspective, this approach can potentially limit their ability to handle intricate and complex inputs effectively. To enhance LLM’s cognition capability, this paper presents a novel concept of context structurization. Specifically, we transform the plain, unordered contextual sentences into well-ordered and hierarchically structurized elements. By doing so, LLMs can better grasp intricate and extended contexts through precise attention and information-seeking along the organized structures. Extensive evaluations are conducted across various model architectures and sizes (including a series of auto-regressive LLMs as well as BERT-like masking models) on a diverse set of NLP tasks (e. g. , context-based question-answering, exhaustive hallucination evaluation, and passage-level dense retrieval). Empirical results show consistent and significant performance gains afforded by a single-round structurization. In particular, we boost the open-sourced LLaMA2-70B model to achieve comparable performance against GPT-3. 5-Turbo as the halluci- nation evaluator. Besides, we show the feasibility of distilling advanced LLMs’ language processing abilities to a smaller yet effective StruXGPT-7B to execute structurization, addressing the practicality of our approach. Code is available at https: //github. com/alibaba/struxgpt.

ICLR Conference 2024 Conference Paper

INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection

  • Chao Chen 0026
  • Kai Liu 0023
  • Ze Chen 0001
  • Yi Gu
  • Yue Wu
  • Mingyuan Tao
  • Zhihang Fu
  • Jieping Ye

Knowledge hallucination have raised widespread concerns for the security and reliability of deployed LLMs. Previous efforts in detecting hallucinations have been employed at logit-level uncertainty estimation or language-level self-consistency evaluation, where the semantic information is inevitably lost during the token-decoding procedure. Thus, we propose to explore the dense semantic information retained within LLMs' \textbf{IN}ternal \textbf{S}tates for halluc\textbf{I}nation \textbf{DE}tection (\textbf{INSIDE}). In particular, a simple yet effective \textbf{EigenScore} metric is proposed to better evaluate responses' self-consistency, which exploits the eigenvalues of responses' covariance matrix to measure the semantic consistency/diversity in the dense embedding space. Furthermore, from the perspective of self-consistent hallucination detection, a test time feature clipping approach is explored to truncate extreme activations in the internal states, which reduces overconfident generations and potentially benefits the detection of overconfident hallucinations. Extensive experiments and ablation studies are performed on several popular LLMs and question-answering (QA) benchmarks, showing the effectiveness of our proposal.

AAAI Conference 2024 Conference Paper

M3SOT: Multi-Frame, Multi-Field, Multi-Space 3D Single Object Tracking

  • Jiaming Liu
  • Yue Wu
  • Maoguo Gong
  • Qiguang Miao
  • Wenping Ma
  • Cai Xu
  • Can Qin

3D Single Object Tracking (SOT) stands a forefront task of computer vision, proving essential for applications like autonomous driving. Sparse and occluded data in scene point clouds introduce variations in the appearance of tracked objects, adding complexity to the task. In this research, we unveil M3SOT, a novel 3D SOT framework, which synergizes multiple input frames (template sets), multiple receptive fields (continuous contexts), and multiple solution spaces (distinct tasks) in ONE model. Remarkably, M3SOT pioneers in modeling temporality, contexts, and tasks directly from point clouds, revisiting a perspective on the key factors influencing SOT. To this end, we design a transformer-based network centered on point cloud targets in the search area, aggregating diverse contextual representations and propagating target cues by employing historical frames. As M3SOT spans varied processing perspectives, we've streamlined the network—trimming its depth and optimizing its structure—to ensure a lightweight and efficient deployment for SOT applications. We posit that, backed by practical construction, M3SOT sidesteps the need for complex frameworks and auxiliary components to deliver sterling results. Extensive experiments on benchmarks such as KITTI, nuScenes, and Waymo Open Dataset demonstrate that M3SOT achieves state-of-the-art performance at 38 FPS. Our code and models are available at https://github.com/ywu0912/TeamCode.git.

TMLR Journal 2024 Journal Article

MoMA: Model-based Mirror Ascent for Offline Reinforcement Learning

  • Mao Hong
  • Zhiyue Zhang
  • Yue Wu
  • Yanxun Xu

Model-based offline reinforcement learning methods (RL) have achieved state-of-the-art performance in many decision-making problems thanks to their sample efficiency and generalizability. Despite these advancements, existing model-based offline RL approaches either focus on theoretical studies without developing practical algorithms or rely on a restricted parametric policy space, thus not fully leveraging the advantages of an unrestricted policy space inherent to model-based methods. To address this limitation, we develop MoMA, a model-based mirror ascent algorithm with general function approximations under partial coverage of offline data. MoMA distinguishes itself from existing literature by employing an unrestricted policy class. In each iteration, MoMA conservatively estimates the value function by a minimization procedure within a confidence set of transition models in the policy evaluation step, then updates the policy with general function approximations instead of commonly-used parametric policy classes in the policy improvement step. Under some mild assumptions, we establish theoretical guarantees for MoMA by proving an upper bound on the suboptimality of the returned policy. We also provide a practically implementable, approximate version of the algorithm. The effectiveness of MoMA is demonstrated via numerical studies.

AAAI Conference 2024 Conference Paper

Neural Gaussian Similarity Modeling for Differential Graph Structure Learning

  • Xiaolong Fan
  • Maoguo Gong
  • Yue Wu
  • Zedong Tang
  • Jieyi Liu

Graph Structure Learning (GSL) has demonstrated considerable potential in the analysis of graph-unknown non-Euclidean data across a wide range of domains. However, constructing an end-to-end graph structure learning model poses a challenge due to the impediment of gradient flow caused by the nearest neighbor sampling strategy. In this paper, we construct a differential graph structure learning model by replacing the non-differentiable nearest neighbor sampling with a differentiable sampling using the reparameterization trick. Under this framework, we argue that the act of sampling nearest neighbors may not invariably be essential, particularly in instances where node features exhibit a significant degree of similarity. To alleviate this issue, the bell-shaped Gaussian Similarity (GauSim) modeling is proposed to sample non-nearest neighbors. To adaptively model the similarity, we further propose Neural Gaussian Similarity (NeuralGauSim) with learnable parameters featuring flexible sampling behaviors. In addition, we develop a scalable method by transferring the large-scale graph to the transition graph to significantly reduce the complexity. Experimental results demonstrate the effectiveness of the proposed methods.

ICML Conference 2024 Conference Paper

Protein Conformation Generation via Force-Guided SE(3) Diffusion Models

  • Yan Wang
  • Lihao Wang
  • Yuning Shen
  • Yiqun Wang
  • Huizhuo Yuan
  • Yue Wu
  • Quanquan Gu

The conformational landscape of proteins is crucial to understanding their functionality in complex biological processes. Traditional physics-based computational methods, such as molecular dynamics (MD) simulations, suffer from rare event sampling and long equilibration time problems, hindering their applications in general protein systems. Recently, deep generative modeling techniques, especially diffusion models, have been employed to generate novel protein conformations. However, existing score-based diffusion methods cannot properly incorporate important physical prior knowledge to guide the generation process, causing large deviations in the sampled protein conformations from the equilibrium distribution. In this paper, to overcome these limitations, we propose a force-guided $\mathrm{SE}(3)$ diffusion model, ConfDiff, for protein conformation generation. By incorporating a force-guided network with a mixture of data-based score models, ConfDiff can generate protein conformations with rich diversity while preserving high fidelity. Experiments on a variety of protein conformation prediction tasks, including 12 fast-folding proteins and the Bovine Pancreatic Trypsin Inhibitor (BPTI), demonstrate that our method surpasses the state-of-the-art method.

AAAI Conference 2024 Conference Paper

Reliable Conflictive Multi-View Learning

  • Cai Xu
  • Jiajun Si
  • Ziyu Guan
  • Wei Zhao
  • Yue Wu
  • Xiyue Gao

Multi-view learning aims to combine multiple features to achieve more comprehensive descriptions of data. Most previous works assume that multiple views are strictly aligned. However, real-world multi-view data may contain low-quality conflictive instances, which show conflictive information in different views. Previous methods for this problem mainly focus on eliminating the conflictive data instances by removing them or replacing conflictive views. Nevertheless, real-world applications usually require making decisions for conflictive instances rather than only eliminating them. To solve this, we point out a new Reliable Conflictive Multi-view Learning (RCML) problem, which requires the model to provide decision results and attached reliabilities for conflictive multi-view data. We develop an Evidential Conflictive Multi-view Learning (ECML) method for this problem. ECML first learns view-specific evidence, which could be termed as the amount of support to each category collected from data. Then, we can construct view-specific opinions consisting of decision results and reliability. In the multi-view fusion stage, we propose a conflictive opinion aggregation strategy and theoretically prove this strategy can exactly model the relation of multi-view common and view-specific reliabilities. Experiments performed on 6 datasets verify the effectiveness of ECML. The code is released at https://github.com/jiajunsi/RCML.

AAAI Conference 2024 Conference Paper

TCI-Former: Thermal Conduction-Inspired Transformer for Infrared Small Target Detection

  • Tianxiang Chen
  • Zhentao Tan
  • Qi Chu
  • Yue Wu
  • Bin Liu
  • Nenghai Yu

Infrared small target detection (ISTD) is critical to national security and has been extensively applied in military areas. ISTD aims to segment small target pixels from background. Most ISTD networks focus on designing feature extraction blocks or feature fusion modules, but rarely describe the ISTD process from the feature map evolution perspective. In the ISTD process, the network attention gradually shifts towards target areas. We abstract this process as the directional movement of feature map pixels to target areas through convolution, pooling and interactions with surrounding pixels, which can be analogous to the movement of thermal particles constrained by surrounding variables and particles. In light of this analogy, we propose Thermal Conduction-Inspired Transformer (TCI-Former) based on the theoretical principles of thermal conduction. According to thermal conduction differential equation in heat dynamics, we derive the pixel movement differential equation (PMDE) in the image domain and further develop two modules: Thermal Conduction-Inspired Attention (TCIA) and Thermal Conduction Boundary Module (TCBM). TCIA incorporates finite difference method with PMDE to reach a numerical approximation so that target body features can be extracted. To further remove errors in boundary areas, TCBM is designed and supervised by boundary masks to refine target body features with fine boundary details. Experiments on IRSTD-1k and NUAA-SIRST demonstrate the superiority of our method.

ICLR Conference 2024 Conference Paper

Variance-aware Regret Bounds for Stochastic Contextual Dueling Bandits

  • Qiwei Di
  • Tao Jin 0002
  • Yue Wu
  • Heyang Zhao
  • Farzad Farnoud
  • Quanquan Gu

Dueling bandits is a prominent framework for decision-making involving preferential feedback, a valuable feature that fits various applications involving human interaction, such as ranking, information retrieval, and recommendation systems. While substantial efforts have been made to minimize the cumulative regret in dueling bandits, a notable gap in the current research is the absence of regret bounds that account for the inherent uncertainty in pairwise comparisons between the dueling arms. Intuitively, greater uncertainty suggests a higher level of difficulty in the problem. To bridge this gap, this paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM). We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $\tilde O\big(d\sqrt{\sum_{t=1}^T\sigma_t^2} + d\big)$, where $\sigma_t$ is the variance of the pairwise comparison at round $t$, $d$ is the dimension of the context vectors, and $T$ is the time horizon. Our regret bound naturally aligns with the intuitive expectation — in scenarios where the comparison is deterministic, the algorithm only suffers from an $\tilde O(d)$ regret. We perform empirical experiments on synthetic data to confirm the advantage of our method over previous variance-agnostic algorithms.

ICLR Conference 2023 Conference Paper

Avoiding spurious correlations via logit correction

  • Sheng Liu
  • Xu Zhang
  • Nitesh Sekhar
  • Yue Wu
  • Prateek Singhal
  • Carlos Fernandez-Granda

Empirical studies suggest that machine learning models trained with empirical risk minimization (ERM) often rely on attributes that may be spuriously correlated with the class labels. Such models typically lead to poor performance during inference for data lacking such correlations. In this work, we explicitly consider a situation where potential spurious correlations are present in the majority of training data. In contrast with existing approaches, which use the ERM model outputs to detect the samples without spurious correlations and either heuristically upweight or upsample those samples, we propose the logit correction (LC) loss, a simple yet effective improvement on the softmax cross-entropy loss, to correct the sample logit. We demonstrate that minimizing the LC loss is equivalent to maximizing the group-balanced accuracy, so the proposed LC could mitigate the negative impacts of spurious correlations. Our extensive experimental results further reveal that the proposed LC loss outperforms state-of-the-art solutions on multiple popular benchmarks by a large margin, an average 5.5% absolute improvement, without access to spurious attribute labels. LC is also competitive with oracle methods that make use of the attribute labels.

ICML Conference 2023 Conference Paper

Personalized Federated Learning under Mixture of Distributions

  • Yue Wu
  • Shuaicheng Zhang
  • Wenchao Yu
  • Yanchi Liu
  • Quanquan Gu
  • Dawei Zhou 0003
  • Haifeng Chen
  • Wei Cheng 0002

The recent trend towards Personalized Federated Learning (PFL) has garnered significant attention as it allows for the training of models that are tailored to each client while maintaining data privacy. However, current PFL techniques primarily focus on modeling the conditional distribution heterogeneity (i. e. concept shift), which can result in suboptimal performance when the distribution of input data across clients diverges (i. e. covariate shift). Additionally, these techniques often lack the ability to adapt to unseen data, further limiting their effectiveness in real-world scenarios. To address these limitations, we propose a novel approach, FedGMM, which utilizes Gaussian mixture models (GMM) to effectively fit the input data distributions across diverse clients. The model parameters are estimated by maximum likelihood estimation utilizing a federated Expectation-Maximization algorithm, which is solved in closed form and does not assume gradient similarity. Furthermore, FedGMM possesses an additional advantage of adapting to new clients with minimal overhead, and it also enables uncertainty quantification. Empirical evaluations on synthetic and benchmark datasets demonstrate the superior performance of our method in both PFL classification and novel sample detection.

NeurIPS Conference 2023 Conference Paper

Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals

  • Yue Wu
  • Yewen Fan
  • Paul Pu Liang
  • Amos Azaria
  • Yuanzhi Li
  • Tom M. Mitchell

High sample complexity has long been a challenge for RL. On the other hand, humans learn to perform tasks not only from interaction or demonstrations, but also by reading unstructured text documents, e. g. , instruction manuals. Instruction manuals and wiki pages are among the most abundant data that could inform agents of valuable features and policies or task-specific environmental dynamics and reward structures. Therefore, we hypothesize that the ability to utilize human-written instruction manuals to assist learning policies for specific tasks should lead to a more efficient and better-performing agent. We propose the Read and Reward framework. Read and Reward speeds up RL algorithms on Atari games by reading manuals released by the Atari game developers. Our framework consists of a QA Extraction module that extracts and summarizes relevant information from the manual and a Reasoning module that evaluates object-agent interactions based on information from the manual. An auxiliary reward is then provided to a standard A2C RL agent, when interaction is detected. Experimentally, various RL algorithms obtain significant improvement in performance and training speed when assisted by our design. Code at github. com/Holmeswww/RnR

NeurIPS Conference 2023 Conference Paper

SPRING: Studying Papers and Reasoning to play Games

  • Yue Wu
  • So Yeon Min
  • Shrimai Prabhumoye
  • Yonatan Bisk
  • Russ R. Salakhutdinov
  • Amos Azaria
  • Tom M. Mitchell
  • Yuanzhi Li

Open-world survival games pose significant challenges for AI algorithms due to their multi-tasking, deep exploration, and goal prioritization requirements. Despite reinforcement learning (RL) being popular for solving games, its high sample complexity limits its effectiveness in complex open-world games like Crafter or Minecraft. We propose a novel approach, SPRING, to read Crafter's original academic paper and use the knowledge learned to reason and play the game through a large language model (LLM). Prompted with the LaTeX source as game context and a description of the agent's current observation, our SPRING framework employs a directed acyclic graph (DAG) with game-related questions as nodes and dependencies as edges. We identify the optimal action to take in the environment by traversing the DAG and calculating LLM responses for each node in topological order, with the LLM's answer to final node directly translating to environment actions. In our experiments, we study the quality of in-context "reasoning" induced by different forms of prompts under the setting of the Crafter environment. Our experiments suggest that LLMs, when prompted with consistent chain-of-thought, have great potential in completing sophisticated high-level trajectories. Quantitatively, SPRING with GPT-4 outperforms all state-of-the-art RL baselines, trained for 1M steps, without any training. Finally, we show the potential of Crafter as a test bed for LLMs. Code at github. com/holmeswww/SPRING

AAMAS Conference 2023 Conference Paper

Sybil-Proof Diffusion Auction in Social Networks

  • Hongyin Chen
  • Xiaotie Deng
  • Ying Wang
  • Yue Wu
  • Dengji Zhao

A diffusion auction is a market to sell commodities over a social network, where the challenge is to incentivize existing buyers to invite their neighbors in the network to join the market. Existing mechanisms have been designed to solve the challenge in various settings, aiming at desirable properties such as non-deficiency, incentive compatibility and social welfare maximization. Since the mechanisms are employed in dynamic networks with ever-changing structures, buyers could easily generate fake nodes in the network to manipulate the mechanisms for their own benefits, which is commonly known as the Sybil attack. We observe that strategic agents may gain an unfair advantage in existing mechanisms through such attacks. To resist this potential attack, we propose two diffusion auction mechanisms, the Sybil tax mechanism (STM) and the Sybil cluster mechanism (SCM), to achieve both Sybil-proofness and incentive compatibility in the single-item setting. Our proposal provides the first mechanisms to protect the interests of buyers against Sybil attacks with a mild sacrifice of social welfare and revenue.

UAI Conference 2023 Conference Paper

Uniform-PAC Guarantees for Model-Based RL with Bounded Eluder Dimension

  • Yue Wu
  • Jiafan He
  • Quanquan Gu

Recently, there has been remarkable progress in reinforcement learning (RL) with general function approximation. However, all these works only provide regret or sample complexity guarantees. It is still an open question if one can achieve stronger performance guarantees, i. e. , the uniform probably approximate correctness (Uniform-PAC) guarantee that can imply both a sub-linear regret bound and a polynomial sample complexity for any target learning accuracy. We study this problem by proposing algorithms for both nonlinear bandits and model-based episodic RL using the general function class with a bounded eluder dimension. The key idea of the proposed algorithms is to assign each action to different levels according to its width with respect to the confidence set. The achieved Uniform-PAC sample complexity is tight in the sense that it matches the state-of-the-art regret bounds or sample complexity guarantees when reduced to the linear case. To the best of our knowledge, this is the first work for Uniform-PAC guarantees on bandit and RL that goes beyond linear cases.

AAAI Conference 2023 Conference Paper

User-Controllable Arbitrary Style Transfer via Entropy Regularization

  • Jiaxin Cheng
  • Yue Wu
  • Ayush Jaiswal
  • Xu Zhang
  • Pradeep Natarajan
  • Prem Natarajan

Ensuring the overall end-user experience is a challenging task in arbitrary style transfer (AST) due to the subjective nature of style transfer quality. A good practice is to provide users many instead of one AST result. However, existing approaches require to run multiple AST models or inference a diversified AST (DAST) solution multiple times, and thus they are either slow in speed or limited in diversity. In this paper, we propose a novel solution ensuring both efficiency and diversity for generating multiple user-controllable AST results by systematically modulating AST behavior at run-time. We begin with reformulating three prominent AST methods into a unified assign-and-mix problem and discover that the entropies of their assignment matrices exhibit a large variance. We then solve the unified problem in an optimal transport framework using the Sinkhorn-Knopp algorithm with a user input ε to control the said entropy and thus modulate stylization. Empirical results demonstrate the superiority of the proposed solution, with speed and stylization quality comparable to or better than existing AST and significantly more diverse than previous DAST works. Code is available at https://github.com/cplusx/eps-Assign-and-Mix.

NeurIPS Conference 2022 Conference Paper

Active Ranking without Strong Stochastic Transitivity

  • Hao Lou
  • Tao Jin
  • Yue Wu
  • Pan Xu
  • Quanquan Gu
  • Farzad Farnoud

Ranking from noisy comparisons is of great practical interest in machine learning. In this paper, we consider the problem of recovering the exact full ranking for a list of items under ranking models that do *not* assume the Strong Stochastic Transitivity property. We propose a $$\delta$$-correct algorithm, Probe-Rank, that actively learns the ranking of the items from noisy pairwise comparisons. We prove a sample complexity upper bound for Probe-Rank, which only depends on the preference probabilities between items that are adjacent in the true ranking. This improves upon existing sample complexity results that depend on the preference probabilities for all pairs of items. Probe-Rank thus outperforms existing methods over a large collection of instances that do not satisfy Strong Stochastic Transitivity. Thorough numerical experiments in various settings are conducted, demonstrating that Probe-Rank is significantly more sample-efficient than the state-of-the-art active ranking method.

NeurIPS Conference 2022 Conference Paper

AniFaceGAN: Animatable 3D-Aware Face Image Generation for Video Avatars

  • Yue Wu
  • Yu Deng
  • Jiaolong Yang
  • Fangyun Wei
  • Qifeng Chen
  • Xin Tong

Although 2D generative models have made great progress in face image generation and animation, they often suffer from undesirable artifacts such as 3D inconsistency when rendering images from different camera viewpoints. This prevents them from synthesizing video animations indistinguishable from real ones. Recently, 3D-aware GANs extend 2D GANs for explicit disentanglement of camera pose by leveraging 3D scene representations. These methods can well preserve the 3D consistency of the generated images across different views, yet they cannot achieve fine-grained control over other attributes, among which facial expression control is arguably the most useful and desirable for face animation. In this paper, we propose an animatable 3D-aware GAN for multiview consistent face animation generation. The key idea is to decompose the 3D representation of the 3D-aware GAN into a template field and a deformation field, where the former represents different identities with a canonical expression, and the latter characterizes expression variations of each identity. To achieve meaningful control over facial expressions via deformation, we propose a 3D-level imitative learning scheme between the generator and a parametric 3D face model during adversarial training of the 3D-aware GAN. This helps our method achieve high-quality animatable face image generation with strong visual 3D consistency, even though trained with only unstructured 2D images. Extensive experiments demonstrate our superior performance over prior works. Project page: \url{https: //yuewuhkust. github. io/AniFaceGAN/

NeurIPS Conference 2022 Conference Paper

Towards Understanding the Mixture-of-Experts Layer in Deep Learning

  • Zixiang Chen
  • Yihe Deng
  • Yue Wu
  • Quanquan Gu
  • Yuanzhi Li

The Mixture-of-Experts (MoE) layer, a sparsely-activated model controlled by a router, has achieved great success in deep learning. However, the understanding of such architecture remains elusive. In this paper, we formally study how the MoE layer improves the performance of neural network learning and why the mixture model will not collapse into a single model. Our empirical results suggest that the cluster structure of the underlying problem and the non-linearity of the expert are pivotal to the success of MoE. This motivates us to consider a challenging classification problem with intrinsic cluster structures. Theoretically, we proved that this problem is hard to solve by a single expert such as a two-layer convolutional neural network (CNN). Yet with the MoE layer with each expert being a two-layer CNN, the problem can be solved successfully. In particular, our theory shows that the router can learn the cluster-center features, which helps divide the input complex problem into simpler classification sub-problems that individual experts can conquer. To our knowledge, this is the first theoretical result toward formally understanding the mechanism of the MoE layer for deep learning.

IJCAI Conference 2021 Conference Paper

Towards Understanding the Spectral Bias of Deep Learning

  • Yuan Cao
  • Zhiying Fang
  • Yue Wu
  • Ding-Xuan Zhou
  • Quanquan Gu

An intriguing phenomenon observed during training neural networks is the spectral bias, which states that neural networks are biased towards learning less complex functions. The priority of learning functions with low complexity might be at the core of explaining the generalization ability of neural networks, and certain efforts have been made to provide a theoretical explanation for spectral bias. However, there is still no satisfying theoretical result justifying the underlying mechanism of spectral bias. In this paper, we give a comprehensive and rigorous explanation for spectral bias and relate it with the neural tangent kernel function proposed in recent work. We prove that the training process of neural networks can be decomposed along different directions defined by the eigenfunctions of the neural tangent kernel, where each direction has its own convergence rate and the rate is determined by the corresponding eigenvalue. We then provide a case study when the input data is uniformly distributed over the unit sphere, and show that lower degree spherical harmonics are easier to be learned by over-parameterized neural networks. Finally, we provide numerical experiments to demonstrate the correctness of our theory. Our experimental results also show that our theory can tolerate certain model misspecification in terms of the input data distribution.

NeurIPS Conference 2020 Conference Paper

Improving GAN Training with Probability Ratio Clipping and Sample Reweighting

  • Yue Wu
  • Pan Zhou
  • Andrew G. Wilson
  • Eric Xing
  • Zhiting Hu

Despite success on a wide range of problems related to vision, generative adversarial networks (GANs) often suffer from inferior performance due to unstable training, especially for text generation. To solve this issue, we propose a new variational GAN training framework which enjoys superior training stability. Our approach is inspired by a connection of GANs and reinforcement learning under a variational perspective. The connection leads to (1) probability ratio clipping that regularizes generator training to prevent excessively large updates, and (2) a sample re-weighting mechanism that improves discriminator training by downplaying bad-quality fake samples. Moreover, our variational GAN framework can provably overcome the training issue in many GANs that an optimal discriminator cannot provide any informative gradient to training generator. By plugging the training approach in diverse state-of-the-art GAN architectures, we obtain significantly improved performance over a range of tasks, including text generation, text style transfer, and image generation.

IJCAI Conference 2018 Conference Paper

Feature Hashing for Network Representation Learning

  • Qixiang Wang
  • Shanfeng Wang
  • Maoguo Gong
  • Yue Wu

The goal of network representation learning is to embed nodes so as to encode the proximity structures of a graph into a continuous low-dimensional feature space. In this paper, we propose a novel algorithm called node2hash based on feature hashing for generating node embeddings. This approach follows the encoder-decoder framework. There are two main mapping functions in this framework. The first is an encoder to map each node into high-dimensional vectors. The second is a decoder to hash these vectors into a lower dimensional feature space. More specifically, we firstly derive a proximity measurement called expected distance as target which combines position distribution and co-occurrence statistics of nodes over random walks so as to build a proximity matrix, then introduce a set of T different hash functions into feature hashing to generate uniformly distributed vector representations of nodes from the proximity matrix. Compared with the existing state-of-the-art network representation learning approaches, node2hash shows a competitive performance on multi-class node classification and link prediction tasks on three real-world networks from various domains.

NeurIPS Conference 2018 Conference Paper

Towards Understanding Learning Representations: To What Extent Do Different Neural Networks Learn the Same Representation

  • Liwei Wang
  • Lunjia Hu
  • Jiayuan Gu
  • Zhiqiang Hu
  • Yue Wu
  • Kun He
  • John Hopcroft

It is widely believed that learning good representations is one of the main reasons for the success of deep neural networks. Although highly intuitive, there is a lack of theory and systematic approach quantitatively characterizing what representations do deep neural networks learn. In this work, we move a tiny step towards a theory and better understanding of the representations. Specifically, we study a simpler problem: How similar are the representations learned by two networks with identical architecture but trained from different initializations. We develop a rigorous theory based on the neuron activation subspace match model. The theory gives a complete characterization of the structure of neuron activation subspace matches, where the core concepts are maximum match and simple match which describe the overall and the finest similarity between sets of neurons in two networks respectively. We also propose efficient algorithms to find the maximum match and simple matches. Finally, we conduct extensive experiments using our algorithms. Experimental results suggest that, surprisingly, representations learned by the same convolutional layers of networks trained from different initializations are not as similar as prevalently expected, at least in terms of subspace match.

IROS Conference 2015 Conference Paper

Identification and reconstruction of complex weld geometry based on modified entropy

  • Soheil Keshmiri
  • Yan Zhi Tan
  • Xin Zheng
  • Syeda Mariam Ahmed
  • Yue Wu
  • Wen Feng Lu
  • Chee-Meng Chew
  • Chee Khiang Pang

In this paper, a modified entropy-based algorithm is proposed for identification and reconstruction of a complex weld geometry. The edge of the weld geometry is identified based on minimizing a modified entropy-type cost function, and the weld geometry is reconstructed based on the detected edge. In addition, the volume of the weld geometry is computed using the point cloud samples of the identified weld geometry, and the effects of Gaussian noise are also considered. Our simulation results using the proposed reconstruction algorithm demonstrate efficient identification and reconstruction of a complex weld geometry in the presence of Gaussian noise.

NeurIPS Conference 2014 Conference Paper

Gaussian Process Volatility Model

  • Yue Wu
  • José Miguel Hernández-Lobato
  • Zoubin Ghahramani

The prediction of time-changing variances is an important task in the modeling of financial data. Standard econometric models are often limited as they assume rigid functional relationships for the evolution of the variance. Moreover, functional parameters are usually learned by maximum likelihood, which can lead to overfitting. To address these problems we introduce GP-Vol, a novel non-parametric model for time-changing variances based on Gaussian Processes. This new model can capture highly flexible functional relationships for the variances. Furthermore, we introduce a new online algorithm for fast inference in GP-Vol. This method is much faster than current offline inference procedures and it avoids overfitting problems by following a fully Bayesian approach. Experiments with financial data show that GP-Vol performs significantly better than current standard alternatives.

ICML Conference 2013 Conference Paper

Dynamic Covariance Models for Multivariate Financial Time Series

  • Yue Wu
  • José Miguel Hernández-Lobato
  • Zoubin Ghahramani

The accurate prediction of time-changing covariances is an important problem in the modeling of multivariate financial data. However, some of the most popular models suffer from a) overfitting problems and multiple local optima, b) failure to capture shifts in market conditions and c) large computational costs. To address these problems we introduce a novel dynamic model for time-changing covariances. Over-fitting and local optima are avoided by following a Bayesian approach instead of computing point estimates. Changes in market conditions are captured by assuming a diffusion process in parameter values, and finally computationally efficient and scalable inference is performed using particle filters. Experiments with financial data show excellent performance of the proposed method with respect to current standard models.