Arrow Research search

Author name cluster

Xin Xu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

36 papers
2 author rows

Possible papers

36

AAAI Conference 2026 Conference Paper

Domain-Aware Suppression and Aggregation for Federated DG ReID

  • Zhixi Yu
  • Wei Liu
  • Wenke Huang
  • Bin Yang
  • Qian Bie
  • Guancheng Wan
  • Xin Xu

Federated domain generalization in person re-identification (FedDG-ReID) aims to learn a privacy-preserving server model from decentralized client source domains that generalizes to unseen domains. Existing approaches enhance the generalizability of the server model by increasing the diversity of client person data. However, these methods overlook that ReID model parameters are easily biased by client-specific data distributions, leading to the capture of excessive domain-specific identity information. Such identity information (e.g., clothing style) struggles with identity information in unseen domains, thereby hindering the generalization ability of the server model. To address this, we propose a novel FedDG-ReID framework, which mainly consists of Domain-aware Parameter Suppression (DPS) and Domain-invariant Weighted Aggregation (DWA), called FedSupWA. Specifically, DPS adaptively attenuates the update magnitude of the parameters based on the fit of the parameters to the client's domain, encouraging the model to focus on more generalized domain-independent identity information, such as pedestrian contours, and other consistent information across domains. DWA enhances the server model’s generalization by evaluating the effectiveness of the client model in maintaining the consistency of pedestrian identities to measure the importance of the learned domain-independent identity information and assigning greater aggregation weights to clients that contribute more generalized information. Extensive experiments demonstrate the effectiveness of FedSupWA, showing that it achieves state-of-the-art performance.

AAAI Conference 2026 Conference Paper

FedARKS: Federated Aggregation via Robust and Discriminative Knowledge Selection and Integration for Person Re-identification

  • Xin Xu
  • Binchang Ma
  • Zhixi Yu
  • Wei Liu

The application of federated domain generalization in person re-identification (FedDG-ReID) aims to enhance the model's generalization ability in unseen domains while protecting client data privacy. However, existing mainstream methods typically rely on global feature representations and simple averaging operations for model aggregation, leading to two limitations in domain generalization: (1) Using only global features makes it difficult to capture subtle, domain-invariant local details (such as accessories or textures); (2) Uniform parameter averaging treats all clients as equivalent, ignoring their differences in robust feature extraction capabilities, thereby diluting the contributions of high-quality clients. To address these issues, we propose a novel federated learning framework—Federated Aggregation via Robust and Discriminative Knowledge Selection and Integration (FedARKS)—comprising two mechanisms: RK (Robust Knowledge) and KS (Knowledge Selection). In our design, each client employs a dual-branch network of RK: the Global Feature Processing Branch serves as the primary component, extracting overall representations for model aggregation and server-side updates; while the Body Part Processing Branch acts as an auxiliary component, focusing on extracting domain-invariant local details to supplement and guide the local training process during global feature learning. Additionally, our KS mechanism adaptively assigns corresponding aggregation weights to clients based on their ability to extract domain-invariant knowledge, enabling the server to better integrate cross-domain invariant knowledge extracted by clients. Extensive experiments validate that FedARKS achieves state-of-the-art generalization results on the FedDG-ReID benchmark, demonstrating that learning subtle body part features can effectively assist and reinforce global representations, thereby enabling robust cross-domain person ReID capabilities.

AAAI Conference 2026 Conference Paper

ICLR: Inter-Chrominance and Luminance Interaction for Natural Color Restoration in Low-Light Image Enhancement

  • Xin Xu
  • Hao Liu
  • Wei Liu
  • Wei Wang
  • Jiayi Wu
  • Kui Jiang

Low-Light Image Enhancement (LLIE) task aims at improving contrast while restoring details and textures for images captured in low-light conditions. HVI color space has made significant progress in this task by enabling precise decoupling of chrominance and luminance. However, for the interaction of chrominance and luminance branches, substantial distributional differences between the two branches prevalent in natural images limit complementary feature extraction, and luminance errors are propagated to chrominance channels through the nonlinear parameter. Furthermore, for interaction between different chrominance branches, images with large homogeneous-color regions usually exhibit weak correlation between chrominance branches due to concentrated distributions. Traditional pixel-wise losses exploit strong inter-branch correlations for co-optimization, causing gradient conflicts in weakly correlated regions. Therefore, we propose an Inter-Chrominance and Luminance Interaction (ICLR) framework including a Dual-stream Interaction Enhancement Module (DIEM) and a Covariance Correction Loss (CCL). The DIEM improves the extraction of complementary information from two dimensions, fusion and enhancement, respectively. The CCL utilizes luminance residual statistics to penalize chrominance errors and balances gradient conflicts by constraining chrominance branches covariance. Experimental results on multiple datasets show that the proposed ICLR framework outperforms state-of-the-art methods.

AAAI Conference 2026 Conference Paper

MambaOVSR: Multiscale Fusion with Global Motion Modeling for Chinese Opera Video Super-Resolution

  • Hua Chang
  • Xin Xu
  • Wei Liu
  • Wei Wang
  • Xin Yuan
  • Kui Jiang

Chinese opera is celebrated for preserving classical art. However, early filming equipment limitations have degraded videos of last-century performances by renowned artists (e.g., low frame rates and resolution), hindering archival efforts. Although space-time video super-resolution (STVSR) has advanced significantly, applying it directly to opera videos remains challenging. The scarcity of datasets impedes the recovery of high-frequency details, and existing STVSR methods lack global modeling capabilities—compromising visual quality when handling opera’s characteristic large motions. To address these challenges, we pioneer a large-scale Chinese Opera Video Clip (COVC) dataset and propose the Mamba-based multiscale fusion network for space-time Opera Video Super-Resolution (MambaOVSR). Specifically, MambaOVSR involves three novel components: the Global Fusion Module (GFM) for motion modeling through a multiscale alternating scanning mechanism, and the Multiscale Synergistic Mamba Module (MSMM) for alignment across different sequence lengths. Additionally, our MambaVR block resolves feature artifacts and positional information loss during alignment. Experimental results on the COVC dataset show that MambaOVSR significantly outperforms the SOTA STVSR method by an average of 1.86 dB in terms of PSNR.

AAAI Conference 2026 Conference Paper

Robust Pedestrian Detection with Uncertain Modality

  • Qian Bie
  • Xiao Wang
  • Bin Yang
  • Zhixi Yu
  • Jun Chen
  • Xin Xu

Existing cross-modal pedestrian detection (CMPD) employs complementary information from RGB and thermal-infrared (TIR) modalities to detect pedestrians in 24h-surveillance systems. RGB captures rich pedestrian details under daylight, while TIR excels at night. However, TIR focuses primarily on the person's silhouette, neglecting critical texture details essential for detection. While the near-infrared (NIR) captures texture under low-light conditions, which effectively alleviates performance issues of RGB and detail loss in TIR, thereby reducing missed detections. To this end, we construct a new Triplet RGB–NIR–TIR (TRNT) dataset, comprising 8,281 pixel-aligned image triplets, establishing a comprehensive foundation for algorithmic research. However, due to the variable nature of real-world scenarios, imaging devices may not always capture all three modalities simultaneously. This results in input data with unpredictable combinations of modal types, which challenge existing CMPD methods that fail to extract robust pedestrian information under arbitrary input combinations, leading to significant performance degradation. To address these challenges, we propose the Adaptive Uncertainty-aware Network (AUNet) for accurately discriminating modal availability and fully utilizing the available information under uncertain inputs. Specifically, we introduce Unified Modality Validation Refinement (UMVR), which includes an uncertainty-aware router to validate modal availability and a semantic refinement to ensure the reliability of information within the modality. Furthermore, we design a Modality-Aware Interaction (MAI) module to adaptively activate or deactivate its internal interaction mechanisms per UMVR output, enabling effective complementary information fusion from available modalities. AUNet enables accurate modality validation and robust inference without fixed modality pairings, facilitating the effective fusion of RGB, NIR, and TIR information across diverse inputs.

AAAI Conference 2026 Conference Paper

WenetSpeech-Yue: A Large-Scale Cantonese Speech Corpus with Multi-dimensional Annotation

  • Longhao Li
  • Zhao Guo
  • Hongjie Chen
  • Yuhang Dai
  • Ziyu Zhang
  • Hongfei Xue
  • Tianlun Zuo
  • Chengyou Wang

The development of speech understanding and generation has been significantly accelerated by the availability of large-scale, high-quality speech datasets. Among these, ASR and TTS are regarded as the most established and fundamental tasks. However, for Cantonese (Yue Chinese), spoken by approximately 84.9 million native speakers worldwide, limited annotated resources have hindered progress and resulted in suboptimal ASR and TTS performance. To address this challenge, we propose WenetSpeech-Pipe, an integrated pipeline for building large-scale speech corpus with multi-dimensional annotation tailored for speech understanding and generation. Based on this pipeline, we release WenetSpeech-Yue, the first large-scale Cantonese speech corpus with multi-dimensional annotation for ASR and TTS, covering 21,800 hours across 10 domains with annotations including ASR transcription, text confidence, speaker identity, age, gender, speech quality scores, among other annotations. We also release WSYue-eval, a comprehensive Cantonese benchmark with two components: WSYue-ASR-eval, a manually annotated set for evaluating ASR on short and long utterances, code-switching, and diverse acoustic conditions, and WSYue-TTS-eval, with base and coverage subsets for standard and generalization testing. Experimental results show that models trained on WenetSpeech-Yue achieve competitive results against state-of-the-art (SOTA) Cantonese ASR and TTS systems, including commercial and LLM-based models, highlighting the value of our dataset and pipeline.

ICLR Conference 2025 Conference Paper

Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception

  • Ziqi Pang
  • Xin Xu
  • Yu-Xiong Wang

With success in image generation, generative diffusion models are increasingly adopted for discriminative scenarios because generating pixels is a unified and natural perception interface. Although directly re-purposing their generative denoising process has established promising progress in specialist (e.g., depth estimation) and generalist models, the inherent gaps between a generative process and discriminative objectives are rarely investigated. For instance, generative models can tolerate deviations at intermediate sampling steps as long as the final distribution is reasonable, while discriminative tasks with rigorous ground truth for evaluation are sensitive to such errors. Without mitigating such gaps, diffusion for perception still struggles on tasks represented by multi-modal understanding (e.g., referring image segmentation). Motivated by these challenges, we analyze and improve the alignment between the generative diffusion process and perception objectives centering around the key observation: how perception quality evolves with the denoising process. (1) Notably, earlier denoising steps contribute more than later steps, necessitating a tailored learning objective for training: loss functions should reflect varied contributions of timesteps for each perception task. (2) Perception quality drops unexpectedly at later denoising steps, revealing the sensitiveness of perception to training-denoising distribution shift. We introduce diffusion-tailored data augmentation to simulate such drift in the training data. (3) We suggest a novel perspective to the long-standing question: why should a generative process be useful for discriminative tasks - interactivity. The denoising process can be leveraged as a controllable user interface adapting to users' correctional prompts and conducting multi-round interaction in an agentic workflow. Collectively, our insights enhance multiple generative diffusion-based perception models without architectural changes: state-of-the-art diffusion-based depth estimator, previously underplayed referring image segmentation models, and perception generalists. Our code is available at https://github.com/ziqipang/ADDP.

IJCAI Conference 2025 Conference Paper

Can We Verify Step by Step for Incorrect Answer Detection?

  • Xin Xu
  • Shizhe Diao
  • Can Yang
  • Yang Wang

Chain-of-Thought (CoT) prompting has marked a significant advancement in enhancing the reasoning capabilities of large language models (LLMs). Previous studies have developed various extensions of CoT, which focus primarily on enhancing end-task performance. In addition, there has been research on assessing the quality of reasoning chains in CoT. This raises an intriguing question: Is it possible to predict the accuracy of LLM outputs by scrutinizing the reasoning chains they generate? To answer this research question, we introduce a benchmark, R2PE, designed specifically to explore the relationship between reasoning chains and performance in various reasoning tasks spanning five different domains. This benchmark aims to measure the falsehood of the final output of LLMs based on the reasoning steps. To make full use of information in multiple reasoning chains, we propose the process discernibility score (PDS) framework that beats the answer-checking baseline by a large margin. Concretely, this resulted in an average of 5. 1% increase in the F1 score and 2. 97% improvement in AUC-PR across all 45 subsets within R2PE. We further demonstrate our PDS’s efficacy in advancing open-domain QA accuracy. Our code will be released in the final version. Codes and data are available at https: //github. com/XinXU-USTC/R2PE. git. For further details on the appendix, please refer to https: //arxiv. org/abs/2402. 10528.

NeurIPS Conference 2025 Conference Paper

GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling

  • Tianhao Chen
  • Xin Xu
  • Zijing Liu
  • Pengxiang Li
  • Xinyuan Song
  • AJAY JAISWAL
  • Fan Zhang
  • Jishan Hu

Modern Large Language Models, such as the LLaMA, Qwen and DeepSeek series, predominantly adopt the Pre-LayerNorm (Pre-LN) Transformer architecture. While being stable during pretraining and scalable to large model sizes, Pre-LN suffers from an exponential growth in activation variance across layers, causing the shortcut to dominate over sub-layer outputs in the residual connection and limiting the learning capacity of deeper layers. To mitigate this issue, we propose Gradient-Preserving Activation Scaling (GPAS), a simple technique that can be used in combination with existing approaches. GPAS works by scaling down the intermediate activations while keeping their gradients unchanged. This leaves information in the activations intact, and avoids the gradient vanishing problem associated with gradient downscaling. Extensive experiments across various model sizes from 71M to 1B show that GPAS achieves consistent performance gains. Beyond enhancing Pre-LN Transformers, GPAS also shows promise in improving alternative architectures such as Sandwich-LN and DeepNorm, demonstrating its versatility and potential for improving training dynamics in a wide range of settings. Our code is available at https: //github. com/dandingsky/GPAS.

JMLR Journal 2025 Journal Article

Multiple Instance Verification

  • Xin Xu
  • Eibe Frank
  • Geoffrey Holmes

We explore multiple instance verification, a problem setting in which a query instance is verified against a bag of target instances with heterogeneous, unknown relevancy. We show that naive adaptations of attention-based multiple instance learning (MIL) methods and standard verification methods like Siamese neural networks are unsuitable for this setting: directly combining state-of-the-art (SOTA) MIL methods and Siamese networks is shown to be no better, and sometimes significantly worse, than a simple baseline model. Postulating that this may be caused by the failure of the representation of the target bag to incorporate the query instance, we introduce a new pooling approach named “cross-attention pooling” (CAP). Under the CAP framework, we propose two novel attention functions to address the challenge of distinguishing between highly similar instances in a target bag. Through empirical studies on three different verification tasks, we demonstrate that CAP outperforms adaptations of SOTA MIL methods and the baseline by substantial margins, in terms of both classification accuracy and the ability to detect key instances. The superior ability to identify key instances is attributed to the new attention functions by ablation studies. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2025. ( edit, beta )

IROS Conference 2025 Conference Paper

Offline Reinforcement Learning with Koopman Operators for Control of Soft Robots

  • Yue Jiang
  • Cong Li
  • Yihe Yang
  • Wenyu Cao
  • Xin Xu
  • Jinze Liu
  • Wei Jiang
  • Xinglong Zhang

Soft robots are promising to offer flexibility in environmental interaction tasks through compliant deformations. However, the infinite degrees of freedom and high nonlinearity of dynamics pose significant challenges in dynamic modeling and control in soft robots. While online reinforcement learning (RL) is promising for designing policies directly from data, the black-box policy learning process suffers from data inefficiency and sim-to-real gap, limiting its applications in soft robots. To address these challenges, we propose a novel offline RL with Koopman operators (KORL) framework to generate control policies for soft robots without using physical simulators or real-world interactions. In particular, we first utilize a deep neural network to map dynamics of soft robots to a lifted Koopman observable space, which is inherently linear. Then, an offline RL algorithm with a control-informed actor is designed to learn the robotic policy in the linear observable space. This is significantly different from the black-box policy design in existing offline RL paradigms. The designed Koopman observable enables efficient model-free policy learning with linear control theory, improving control performance while preserving interpretability in policy learning. The effectiveness of our KORL framework is validated in a real-world soft robotic system. Comparative experimental results demonstrate that our method outperforms state-of-the-art methods in target-reaching and trajectory-tracking tasks.

AAAI Conference 2025 Conference Paper

S^3cMath: Spontaneous Step-Level Self-Correction Makes Large Language Models Better Mathematical Reasoners

  • Yuchen Yan
  • Jin Jiang
  • Yang Liu
  • Yixin Cao
  • Xin Xu
  • Mengdi Zhang
  • Xunliang Cai
  • Jian Shao

Self-correction is a novel method that can stimulate the potential reasoning abilities of large language models (LLMs). It involves detecting and correcting errors during the inference process when LLMs solve reasoning problems. However, recent works do not regard self-correction as a spontaneous and intrinsic capability of LLMs. Instead, such correction is achieved through post-hoc generation, external knowledge introduction, multi-model collaboration, and similar techniques. In this paper, we propose a series of mathematical LLMs called S^3cMath, which are able to perform Spontaneous Step-level Self-correction for Mathematical reasoning. This capability helps LLMs to recognize whether their ongoing inference tends to contain errors and simultaneously correct these errors to produce a more reliable response. We proposed a method, which employs a step-level sampling approach to construct step-wise self-correction data for achieving such ability. Additionally, we implement a training strategy that uses above constructed data to equip LLMs with spontaneous step-level self-correction capacities. Our data and methods have been demonstrated to be effective across various foundation LLMs, consistently showing significant progress in evaluations on GSM8K, MATH, and other mathematical benchmarks. To the best of our knowledge, we are the first to introduce the spontaneous step-level self-correction ability of LLMs in mathematical reasoning.

AAAI Conference 2025 Conference Paper

The Parables of the Mustard Seed and the Yeast: Extremely Low-Budget, High-Performance Nighttime Semantic Segmentation

  • Shiqin Wang
  • Xin Xu
  • Haoyang Chen
  • Kui Jiang
  • Zheng Wang

Nighttime Semantic Segmentation (NSS) is essential to many cutting-edge vision applications. However, existing technologies overly rely on massive labeled data, whose annotation is time-consuming and laborious. In this paper, we pioneer a new task focusing on exploring the potential of training strategy and framework design with limited annotation to achieve high-performance NSS. Insufficient information at very low labeling budgets can easily lead to under-optimization or overfitting of the model. Our solution comprises two main components: i) a novel region-based active sampling strategy called Contextual-Aware Region Query (CARQ), which identifies highly informative target nighttime regions for labeling; and ii) an innovative Fragmentation Synergy Active Domain Adaptation framework (FS-ADA), which progressively broadcasts the limited annotation to the unlabeled regions, achieving high performance with a minimal annotation budget. Extensive experiments demonstrate that our method outperforms state-of-the-art UDA-NSS & ADA-SS methods across four day-to-nighttime benchmarks, and generalizes well to foggy, rainy, & snowy scenes. In particular only with 1% target nighttime data annotation, our method is on par with the mainstream fully-supervised methods on the BDD100K-Night val dataset.

AAAI Conference 2025 Conference Paper

TokenMatcher: Diverse Tokens Matching for Unsupervised Visible-Infrared Person Re-Identification

  • Xiao Wang
  • Lekai Liu
  • Bin Yang
  • Mang Ye
  • Zheng Wang
  • Xin Xu

Unsupervised visible-infrared person re-identification (US-VI-ReID) seeks to match infrared and visible images of the same individual without the use of annotations. Current methods typically derive cross-modal correspondences through a single global feature matching process for generating pseudo labels and learning modality-invariant features. However, this matching approach is hindered by both intra-modality and inter-modality discrepancies, which result in imprecise measurements. As a consequence, the clustering of individuals with single global feature is often incomplete and unreliable, leading to suboptimal performance in cross-modal clustering tasks. To address these challenges and to extract cross-modality discriminative identity information, we propose a TokenMatcher, which encompasses three key components: Diverse Tokens Matching (DTM), Diverse Tokens Neighbor Learning (DTNL), and the Homogeneous Fusion (HF) Module. DTM utilizes multiple class tokens within the visual transformer framework to capture diverse embedding representations, thereby facilitating the integration of fine-grained information essential for reliable cross-modality correspondences. DTNL enhances the intra-modality and inter-modality consistency among diverse tokens by refining neighborhood sets with insights from neighboring tokens and camera information, promoting robust neighborhood learning and fostering discriminative identity information. Additionally, the HF module consolidates clusters of the same identity while effectively separating those of different identities. Extensive experiments conducted on the publicly available SYSU-MM01 and RegDB datasets demonstrate the efficacy of the proposed method.

ICLR Conference 2025 Conference Paper

UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models

  • Xin Xu
  • Jiaxin Zhang
  • Tianhao Chen
  • Zitong Chao
  • Jishan Hu
  • Can Yang

Large Language Models (LLMs) have made significant strides in mathematical reasoning, underscoring the need for a comprehensive and fair evaluation of their capabilities. However, existing benchmarks often fall short, either lacking extensive coverage of undergraduate-level mathematical problems or probably suffering from test-set contamination. To address these issues, we introduce UGMathBench, a diverse and dynamic benchmark specifically designed for evaluating undergraduate-level mathematical reasoning with LLMs. UGMathBench comprises 5,062 problems across 16 subjects and 111 topics, featuring 10 distinct answer types. Each problem includes three randomized versions, with additional versions planned for release as the leading open-source LLMs become saturated in UGMathBench. Furthermore, we propose two key metrics: effective accuracy (EAcc), which measures the percentage of correctly solved problems across all three versions, and reasoning gap ($\Delta$), which assesses reasoning robustness by calculating the difference between the average accuracy across all versions and EAcc. Our extensive evaluation of 23 leading LLMs reveals that the highest EAcc achieved is 56.3% by OpenAI-o1-mini, with large $\Delta$ values observed across different models. This highlights the need for future research aimed at developing "large reasoning models" with high EAcc and $\Delta = 0$. We anticipate that the release of UGMathBench, along with its detailed evaluation codes, will serve as a valuable resource to advance the development of LLMs in solving mathematical problems.

AAAI Conference 2024 Conference Paper

FMRNet: Image Deraining via Frequency Mutual Revision

  • Kui Jiang
  • Junjun Jiang
  • Xianming Liu
  • Xin Xu
  • Xianzheng Ma

The wavelet transform has emerged as a powerful tool in deciphering structural information within images. And now, the latest research suggests that combining the prowess of wavelet transform with neural networks can lead to unparalleled image deraining results. By harnessing the strengths of both the spatial domain and frequency space, this innovative approach is poised to revolutionize the field of image processing. The fascinating challenge of developing a comprehensive framework that takes into account the intrinsic frequency property and the correlation between rain residue and background is yet to be fully explored. In this work, we propose to investigate the potential relationships among rain-free and residue components at the frequency domain, forming a frequency mutual revision network (FMRNet) for image deraining. Specifically, we explore the mutual representation of rain residue and background components at frequency domain, so as to better separate the rain layer from clean background while preserving structural textures of the degraded images. Meanwhile, the rain distribution prediction from the low-frequency coefficient, which can be seen as the degradation prior is used to refine the separation of rain residue and background components. Inversely, the updated rain residue is used to benefit the low-frequency rain distribution prediction, forming the multi-layer mutual learning. Extensive experiments demonstrate that our proposed FMRNet delivers significant performance gains for seven datasets on image deraining task, surpassing the state-of-the-art method ELFormer by 1.14 dB in PSNR on the Rain100L dataset, while with similar computation cost. Code and retrained models are available at https://github.com/kuijiang94/FMRNet.

TMLR Journal 2024 Journal Article

In-context Learning with Retrieved Demonstrations for Language Models: A Survey

  • Man Luo
  • Xin Xu
  • Yue Liu
  • Panupong Pasupat
  • Mehran Kazemi

Large language models have demonstrated remarkable few-shot in-context learning (ICL) capabilities, adapting to new tasks with few-shots demonstrations. However, the efficacy of ICL is highly dependent on the selection of these demonstrations. Recent developments have introduced retrieval-based in-context learning (RetICL), which dynamically retrieves demonstrations tailored to each input query. This approach leverages existing databases and retrieval systems, enhancing efficiency and scalability while mitigating biases inherent in manual example selection. Given the promising results and growing interest in RetICL, we present a comprehensive survey of this field. Our review encompasses: design choices for ICL demonstration retrieval models, retrieval training procedures, inference strategies and current applications of RetICL. In the end, we explore future directions for this emerging technology.

ECAI Conference 2024 Conference Paper

MixCon: A Hybrid Architecture for Efficient and Adaptive Sequence Modeling

  • Xin Xu
  • Zhouchen Lin

Sequence modeling is a critical task in various domains such as natural language processing, speech recognition, and time series analysis. The existing models still face challenges in capturing long-range dependencies and efficiently modeling sequences. This paper proposes a novel hybrid sequence modeling architecture called MixCon to address these challenges. The MixCon (Mixture of Conba) architecture combines a Transformer layer based on attention mechanism, a Conba layer, and a Mixture of Experts (MoE) module. We apply this idea to the design of the attention mechanism, achieving significant improvements in computational efficiency. Additionally, the MixCon architecture integrates feedback and adaptive control mechanism inspired by control theory, providing a new perspective and approach to sequence modeling. The experimental results demonstrate MixCon’s superior throughput, outperforming Mixtral by 4. 5 times and Jamba by 1. 5 times when processing lengthy sequences of up to 128K tokens on a single A800 80GB GPU. Moreover, MixCon achieves top-tier scores on academic benchmarks, exemplified by its outstanding performance with a score of 87. 9% on HellaSwag and 83. 4% on WinoGrande, showcasing its capability to excel in complex sequence modeling tasks.

NeurIPS Conference 2024 Conference Paper

RealMAN: A Real-Recorded and Annotated Microphone Array Dataset for Dynamic Speech Enhancement and Localization

  • Bing Yang
  • Changsheng Quan
  • Yabo Wang
  • Pengyu Wang
  • Yujie Yang
  • Ying Fang
  • Nian Shao
  • Hui Bu

The training of deep learning-based multichannel speech enhancement and source localization systems relies heavily on the simulation of room impulse response and multichannel diffuse noise, due to the lack of large-scale real-recorded datasets. However, the acoustic mismatch between simulated and real-world data could degrade the model performance when applying in real-world scenarios. To bridge this simulation-to-real gap, this paper presents a new relatively large-scale Real-recorded and annotated Microphone Array speech&Noise (RealMAN) dataset. The proposed dataset is valuable in two aspects: 1) benchmarking speech enhancement and localization algorithms in real scenarios; 2) offering a substantial amount of real-world training data for potentially improving the performance of real-world applications. Specifically, a 32-channel array with high-fidelity microphones is used for recording. A loudspeaker is used for playing source speech signals (about 35 hours of Mandarin speech). A total of 83. 7 hours of speech signals (about 48. 3 hours for static speaker and 35. 4 hours for moving speaker) are recorded in 32 different scenes, and 144. 5 hours of background noise are recorded in 31 different scenes. Both speech and noise recording scenes cover various common indoor, outdoor, semi-outdoor and transportation environments, which enables the training of general-purpose speech enhancement and source localization networks. To obtain the task-specific annotations, speaker location is annotated with an omni-directional fisheye camera by automatically detecting the loudspeaker. The direct-path signal is set as the target clean speech for speech enhancement, which is obtained by filtering the source speech signal with an estimated direct-path propagation filter. Baseline experiments demonstrate that i) compared to using simulated data, the proposed dataset is indeed able to train better speech enhancement and source localization networks; ii) using various sub-arrays of the proposed 32-channel microphone array can successfully train variable-array networks that can be directly used to unseen arrays.

IROS Conference 2024 Conference Paper

Similarity Distance-Based Label Assignment for Tiny Object Detection

  • Shuohao Shi
  • Qiang Fang
  • Xin Xu
  • Tong Zhao

Tiny object detection is becoming one of the most challenging tasks in computer vision because of the limited object size and lack of information. The label assignment strategy is a key factor affecting the accuracy of object detection. Although there are some effective label assignment strategies for tiny objects, most of them focus on reducing the sensitivity to the bounding boxes to increase the number of positive samples and have some fixed hyperparameters need to set. However, more positive samples may not necessarily lead to better detection results, in fact, excessive positive samples may lead to more false positives. In this paper, we introduce a simple but effective strategy named the Similarity Distance (SimD) to evaluate the similarity between bounding boxes. This proposed strategy not only considers both location and shape similarity but also learns hyperparameters adaptively, ensuring that it can adapt to different datasets and various object sizes in a dataset. Our approach can be simply applied in common anchor-based detectors in place of the IoU for label assignment and Non Maximum Suppression (NMS). Extensive experiments on four mainstream tiny object detection datasets demonstrate superior performance of our method, especially, 1. 8 AP points and 4. 1 AP points of very tiny higher than the state-of-the-art competitors on AI-TOD. Code is available at: https://github.com/cszzshi/simd.

AAAI Conference 2024 Conference Paper

TaskLAMA: Probing the Complex Task Understanding of Language Models

  • Quan Yuan
  • Mehran Kazemi
  • Xin Xu
  • Isaac Noble
  • Vaiva Imbrasaite
  • Deepak Ramachandran

Structured Complex Task Decomposition (SCTD) is the problem of breaking down a complex real-world task (such as planning a wedding) into a directed acyclic graph over individual steps that contribute to achieving the task, with edges specifying temporal dependencies between steps. SCTD is an important component of assistive planning tools, and a challenge for commonsense reasoning systems. We probe how accurately SCTD can be done with the knowledge extracted from pre-trained Large Language Models (LLMs). We introduce a new high-quality human-annotated dataset for this problem and novel metrics to fairly assess performance of LLMs against several baselines. Our experiments reveal that LLMs are able to decompose complex tasks into individual steps effectively, with a relative improvement of 15% to 280% over the best baseline. We also propose a number of approaches to further improve their performance, with a relative improvement of 7% to 37%. However, we find that LLMs still struggle to predict pairwise temporal dependencies, which reveals a gap in their understanding of complex tasks.

NeurIPS Conference 2023 Conference Paper

BoardgameQA: A Dataset for Natural Language Reasoning with Contradictory Information

  • Mehran Kazemi
  • Quan Yuan
  • Deepti Bhatia
  • Najoung Kim
  • Xin Xu
  • Vaiva Imbrasaite
  • Deepak Ramachandran

Automated reasoning with unstructured natural text is a key requirement for many potential applications of NLP and for developing robust AI systems. Recently, Language Models (LMs) have demonstrated complex reasoning capacities even without any finetuning. However, existing evaluation for automated reasoning assumes access to a consistent and coherent set of information over which models reason. When reasoning in the real-world, the available information is frequently inconsistent or contradictory, and therefore models need to be equipped with a strategy to resolve such conflicts when they arise. One widely-applicable way of resolving conflicts is to impose preferences over information sources (e. g. , based on source credibility or information recency) and adopt the source with higher preference. In this paper, we formulate the problem of reasoning with contradictory information guided by preferences over sources as the classical problem of defeasible reasoning, and develop a dataset called BoardgameQA for measuring the reasoning capacity of LMs in this setting. BoardgameQA also incorporates reasoning with implicit background knowledge, to better reflect reasoning problems in downstream applications. We benchmark various LMs on BoardgameQA and the results reveal a significant gap in the reasoning capacity of state-of-the-art LMs on this problem, showing that reasoning with conflicting information does not surface out-of-the-box in LMs. While performance can be improved with finetuning, it nevertheless remains poor.

IJCAI Conference 2023 Conference Paper

From Generation to Suppression: Towards Effective Irregular Glow Removal for Nighttime Visibility Enhancement

  • Wanyu Wu
  • Wei Wang
  • Zheng Wang
  • Kui Jiang
  • Xin Xu

Most existing Low-Light Image Enhancement (LLIE) methods are primarily designed to improve brightness in dark regions, which suffer from severe degradation in nighttime images. However, these methods have limited exploration in another major visibility damage, the glow effects in real night scenes. Glow effects are inevitable in the presence of artificial light sources and cause further diffused blurring when directly enhanced. To settle this issue, we innovatively consider the glow suppression task as learning physical glow generation via multiple scattering estimation according to the Atmospheric Point Spread Function (APSF). In response to the challenges posed by uneven glow intensity and varying source shapes, an APSF-based Nighttime Imaging Model with Near-field Light Sources (NIM-NLS) is specifically derived to design a scalable Light-aware Blind Deconvolution Network (LBDN). The glow-suppressed result is then brightened via a Retinex-based Enhancement Module (REM). Remarkably, the proposed glow suppression method is based on zero-shot learning and does not rely on any paired or unpaired training data. Empirical evaluations demonstrate the effectiveness of the proposed method in both glow suppression and low-light enhancement tasks.

ICRA Conference 2023 Conference Paper

StereoVAE: A lightweight stereo-matching system using embedded GPUs

  • Qiong Chang
  • Xiang Li 0110
  • Xin Xu
  • Xin Liu 0020
  • Yun Li 0015
  • Jun Miyazaki

We propose a lightweight system for stereo-matching using embedded graphic processing units (GPUs). The proposed system overcomes the trade-off between accuracy and processing speed in stereo matching, thus further improving the matching accuracy while ensuring real-time processing. The basic idea is to construct a tiny neural network based on a variational autoencoder (VAE) to achieve the upscaling and refinement a small size of coarse disparity map. This map is initially generated using a traditional matching method. The proposed hybrid structure maintains the advantage of low computational complexity found in traditional methods. Additionally, it achieves matching accuracy with the help of a neural network. Extensive experiments on the KITTI 2015 benchmark dataset demonstrate that our tiny system exhibits high robustness in improving the accuracy of coarse disparity maps generated by different algorithms, while running in real-time on embedded GPUs.

IJCAI Conference 2022 Conference Paper

HEA-D: A Hybrid Evolutionary Algorithm for Diversified Top-k Weight Clique Search Problem

  • Jun Wu
  • Chu-Min Li
  • Yupeng Zhou
  • Minghao Yin
  • Xin Xu
  • Dangdang Niu

The diversified top-k weight clique (DTKWC) search problem is an important generalization of the diversified top-k clique (DTKC) search problem with extensive applications, which extends the DTKC search problem by taking into account the weight of vertices. In this paper, we formulate DTKWC search problem using mixed integer linear program constraints and propose an efficient hybrid evolutionary algorithm (HEA-D) that combines a clique-based crossover operator and an effective simulated annealing-based local optimization procedure to find high-quality local optima. The experimental results show that HEA-D performs much better than the existing methods on two representative real-world benchmarks.

ICRA Conference 2021 Conference Paper

Two-stream 2D/3D Residual Networks for Learning Robot Manipulations from Human Demonstration Videos

  • Xin Xu
  • Kun Qian 0005
  • Bo Zhou 0017
  • Shenghao Chen
  • Yitong Li

Learning manipulation skills from observing human demonstration videos is a promising aspect for intelligent robotic systems. Recent advances in video to command provide an end-to-end approach to translate a video into robot plans. However, the general video captioning methods focus more on the understanding of the full frame, while they lack the consideration of the spatio-temporal features in videos. In this paper, we proposed the two-stream 2D/3D residual networks for robots to learn manipulation tasks from human demonstration videos. We integrate spatial features with 2D residual network and temporal features with 3D residual network as inputs for RNN layers. An encoder-decoder architecture is then used to encode the spatio-temporal features and sequentially generate the command words. Experimental results on an extended manipulation dataset show that our approach outperforms the state-of-the-art methods. Real-world experiments results on a Baxter robotic arm indicate that our method could produce more accurate commands from video demonstrations.

AAAI Conference 2020 Short Paper

Multi-View Deep Attention Network for Reinforcement Learning (Student Abstract)

  • Yueyue Hu
  • Shiliang Sun
  • Xin Xu
  • Jing Zhao

The representation approximated by a single deep network is usually limited for reinforcement learning agents. We propose a novel multi-view deep attention network (MvDAN), which introduces multi-view representation learning into the reinforcement learning task for the first time. The proposed model approximates a set of strategies from multiple representations and combines these strategies based on attention mechanisms to provide a comprehensive strategy for a singleagent. Experimental results on eight Atari video games show that the MvDAN has effective competitive performance than single-view reinforcement learning methods.