Arrow Research search

Author name cluster

Xi Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

147 papers
2 author rows

Possible papers

147

AAAI Conference 2026 Conference Paper

ARCHE: A Novel Task to Evaluate LLMs on Latent Reasoning Chain Extraction

  • Pengze Li
  • Jiaqi Liu
  • Junchi Yu
  • Lihao Liu
  • Mingyu Ding
  • Wanli Ouyang
  • Shixiang Tang
  • Xi Chen

Large language models (LLMs) are increasingly used in scientific domains. While they can produce reasoning-like content via methods such as chain-of-thought prompting, these outputs are typically unstructured and informal, obscuring whether models truly understand the fundamental reasoning paradigms that underpin scientific inference. To address this, we introduce a novel task named Latent Reasoning Chain Extraction (ARCHE), in which models must decompose complex reasoning arguments into combinations of standard reasoning paradigms in the form of a Reasoning Logic Tree (RLT). In an RLT, all reasoning steps are explicitly categorized as one of three variants of Peirce’s fundamental inference modes: deduction, induction, or abduction. To facilitate this task, we release ARCHE Bench, a new benchmark derived from 70 Nature Communications articles, including more than 1,900 references and 38,000 viewpoints. We propose two logic-aware evaluation metrics: Entity Coverage (EC) for content completeness and Reasoning Edge Accuracy (REA) for step-by-step logical validity. Evaluations on 10 leading LLMs on ARCHE Bench reveal that models exhibit a trade-off between REA and EC, and none are yet able to extract a complete and standard reasoning chain. These findings highlight a substantial gap between the abilities of current reasoning models and the rigor required for scientific argumentation.

AAAI Conference 2026 Conference Paper

How Does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding

  • Xi Chen
  • Aske Plaat
  • Niki van Stein

Chain‑of‑thought (CoT) prompting boosts Large Language Models accuracy on multi‑step tasks, yet whether the generated ``thoughts'' reflect the true internal reasoning process is unresolved. We present the first feature‑level causal study of CoT faithfulness. Combining sparse autoencoders with activation patching, we extract monosemantic features from Pythia‑70M and Pythia‑2.8B while they tackle GSM8K math problems under CoT and plain (noCoT) prompting. Swapping a small set of CoT‑reasoning features into a noCoT run raises answer log‑probabilities significantly in the 2.8B model, but has no reliable effect in 70M, revealing a clear contrast for these two scales. CoT also leads to significantly higher activation sparsity and feature interpretability scores in the larger model, signalling more modular internal computation. For example, the model's confidence in generating correct answers improves from 1.2 to 4.3. We introduce patch‑curves and random‑feature patching baselines, showing that useful CoT information is not only present in the top-K patches but widely distributed. Overall, our results indicate that CoT can induce more interpretable internal structures in high-capacity LLMs, validating its role as a structured prompting method.

AAAI Conference 2026 Conference Paper

SASST: Leveraging Syntax-Aware Chunking and LLMs for Simultaneous Speech Translation

  • Zeyu Yang
  • Lai Wei
  • Roman Koshkin
  • Xi Chen
  • Satoshi Nakamura

This work proposes a grammar-based chunking strategy that segments input streams into semantically complete units by parsing dependency relations (e.g., noun phrase boundaries, verb-object structures) and punctuation features. The method ensures chunk coherence and minimizes semantic fragmentation. Building on this mechanism, we present SASST (Syntax-Aware Simultaneous Translation), an end-to-end framework integrating frozen Whisper encoder and decoder-only LLM. The unified architecture dynamically outputs translation tokens or symbols to jointly optimize translation timing and content, with target-side reordering addressing word-order divergence. Experiments on CoVoST2 multilingual corpus (En to De/Zh/Ja) demonstrate significant translation quality improvements across languages, validating the effectiveness of syntactic structures in LLM-driven SimulST systems.

AAAI Conference 2026 Conference Paper

Seeing Is Believing: Grounding Long-Video Understanding in Spatio-Temporal Visual Evidence

  • Zhaoyang Wei
  • Guoliang Wang
  • Guohua Gao
  • Yanchao Hao
  • Mingda Li
  • Wenchao Ding
  • Xi Chen
  • Shizhu He

Although Vision Language Models (VLMs) have excelled at image and video understanding, applying them to hour-long videos is held back by two interrelated challenges: exorbitant computational expense and a qualitative breakdown in long-term temporal reasoning. Thus, models tend to generate answers based on speculation instead of solid visual facts, causing both factually incorrect and plausible hallucinations. This problem is compounded by current benchmarks that, by only emphasizing final answers, lack an effective mechanism to check whether reasoning is substantiated by specific visual evidence. This makes it hard to differentiate between true understanding and pretend comprehension, inhibiting targeted model refinement. To address these interrelated challenges of model fragility and evaluation weakness, we adopt a twofold strategy. First, we present EV²-Bench, a large-scale benchmark that breaks new ground by an evaluation paradigm built upon spatio-temporal visual evidence, forcing models to justify answers with checkable hints. Second, we put forward DynamicSelect, an adaptive token compression system that efficiently condenses salient information by a dynamic semantic selector and a hierarchical compression strategy. Comprehensive experiments demonstrate that DynamicSelect significantly outperforms the baselines on EV²-Bench as well as other public benchmarks. Our study offers not only a more effective approach to long-video understanding but also a more stringent evaluation paradigm, indicating the way toward more robust models.

TMLR Journal 2026 Journal Article

Stepwise Guided Policy Optimization: Coloring Your Incorrect Reasoning in GRPO

  • Peter Chen
  • Xiaopeng Li
  • Ziniu Li
  • Xi Chen
  • Tianyi Lin

Reinforcement learning (RL) has proven effective in strengthening the reasoning capabilities of large language models (LLMs). A widely adopted method, Group Relative Policy Optimization (GRPO)~\citep{Shao-2024-Deepseekmath}, has shown strong empirical results in training recent reasoning models~\citep{Guo-2025-Deepseek}, but it fails to update the policy when all responses within a group are incorrect (i.e., all-negative-sample groups). This limitation highlights a gap between artificial and human intelligence: unlike humans, who can learn from mistakes, GRPO discards these failure signals. We introduce a simple framework to mitigate the all-negative-sample issue by incorporating response diversity within groups using a \textit{step-wise} judge model, which can be trained directly or adapted from existing LLMs. In a simplified setting, we prove that this diversification accelerates GRPO’s learning dynamics. We then empirically validate Stepwise Guided Policy Optimization (SGPO) across model sizes (7B, 14B, 32B) in both offline and online training on nine reasoning benchmarks (including base and distilled variants). Overall, SGPO improves average performance and is effective in early and mid-training when all-negative groups are prevalent, while improvements are not uniform across every benchmark and depend on the structure and informativeness of negative samples. Finally, SGPO does not require the judge model to generate correct solutions, distinguishing it from knowledge distillation methods.

AAAI Conference 2026 Conference Paper

Topological Federated Clustering via Gravitational Potential Fields Under Local Differential Privacy

  • Yunbo Long
  • Jiaquan Zhang
  • Xi Chen
  • Alexandra Brintrup

Clustering non-independent and identically distributed (non-IID) data under local differential privacy (LDP) in federated settings presents a critical challenge: preserving privacy while maintaining accuracy without iterative communication. Existing one-shot methods rely on unstable pairwise centroid distances or neighborhood rankings, degrading severely under strong LDP noise and data heterogeneity. We present Gravitational Federated Clustering (GFC), a novel approach to privacy-preserving federated clustering that overcomes the limitations of distance-based methods under varying LDP. Addressing the critical challenge of clustering non-IID data with diverse privacy guarantees, GFC transforms privatized client centroids into a global gravitational potential field where true cluster centers emerge as topologically persistent singularities. Our framework introduces two key innovations: (1) a client-side compactness-aware perturbation mechanism that encodes local cluster geometry as "mass" values, and (2) a server-side topological aggregation phase that extracts stable centroids through persistent homology analysis of the potential field's superlevel sets. Theoretically, we establish a closed-form bound between the privacy budget ε and centroid estimation error, proving the potential field's Lipschitz smoothing properties exponentially suppress noise in high-density regions. Empirically, GFC outperforms state-of-the-art methods on ten benchmarks, especially under strong LDP constraints (ε < 1), while maintaining comparable performance at lower privacy budgets. By reformulating federated clustering as a topological persistence problem in a synthetic physics-inspired space, GFC achieves unprecedented privacy-accuracy trade-offs without iterative communication, providing a new perspective for privacy-preserving distributed learning.

JBHI Journal 2025 Journal Article

AI-Assisted in Silico Trial for the Optimization of Osmotherapy After Ischaemic Stroke

  • Xi Chen
  • Lei Lu
  • Tamás I. Józsa
  • Jiandong Zhou
  • David A. Clifton
  • Stephen J. Payne

Over the past few decades, osmotherapy has commonly been employed to reduce intracranial pressure in post-stroke oedema. However, evaluating the effectiveness of osmotherapy has been challenging due to the difficulties in clinical intracranial pressure measurement. As a result, there are no established guidelines regarding the selection of administration protocol parameters. Considering that the infusion of osmotic agents can also give rise to various side effects, the effectiveness of osmotherapy has remained a subject of debate. In previous studies, we proposed the first mathematical model for the investigation of osmotherapy and validated the model with clinical intracranial pressure data. The physiological parameters vary among patients and such variations can result in the failure of osmotherapy. Here, we propose an AI-assisted in silico trial for further investigation of the optimisation of administration protocols. The proposed deep neural network predicts intracranial pressure evolution over osmotherapy episodes. The effects of the parameters and the choice of dose of osmotic agents are investigated using the model. In addition, clinical stratifications of patients are related to a brain model for the first time for the optimisation of treatment of different patient groups. This provides an alternative approach to tackle clinical challenges with in silico trials supported by both mathematical/physical laws and patient-specific biomedical information.

AAAI Conference 2025 Conference Paper

Asynchronous Federated Clustering with Unknown Number of Clusters

  • Yunfan Zhang
  • Yiqun Zhang
  • Yang Lu
  • Mengke Li
  • Xi Chen
  • Yiu-ming Cheung

Federated Clustering (FC) is crucial to mining knowledge from unlabeled non-Independent Identically Distributed (non-IID) data provided by multiple clients while preserving their privacy. Most existing attempts learn cluster distributions at local clients, then securely pass the desensitized information to the server for aggregation. However, some tricky but common FC problems are still relatively unexplored, including the heterogeneity in terms of clients' communication capacity and the unknown number of proper clusters. To further bridge the gap between FC and real application scenarios, this paper first shows that the clients' communication asynchrony and unknown proper cluster numbers are complex coupling problems, and then proposes an Asynchronous Federated Cluster Learning (AFCL) method accordingly. It spreads the excessive number of seed points to clients as a learning medium and coordinates them across clients to form a consensus. To alleviate the distribution imbalance cumulated due to the unforeseen asynchronous uploading from the heterogeneous clients, we also design a balancing mechanism for seeds updating. As a result, the seeds gradually adapt to each other to reveal a proper number of clusters. Extensive experiments demonstrate the efficacy of AFCL.

IJCAI Conference 2025 Conference Paper

AttentionDrag: Exploiting Latent Correlation Knowledge in Pre-trained Diffusion Models for Image Editing

  • Biao Yang
  • Muqi Huang
  • Yuhui Zhang
  • Yun Xiong
  • Kun Zhou
  • Xi Chen
  • Shiyang Zhou
  • Huishuai Bao

Traditional point-based image editing methods rely on iterative latent optimization or geometric transformations, which are either inefficient in their processing or fail to capture the semantic relationships within the image. These methods often overlook the powerful yet underutilized image editing capabilities inherent in pre-trained diffusion models. In this work, we propose a novel one-step point-based image editing method, named \textbf{AttentionDrag}, which leverages the inherent latent knowledge and feature correlations within pre-trained diffusion models for image editing tasks. This framework enables semantic consistency and high-quality manipulation without the need for extensive re-optimization or retraining. Specifically, we reutilize the latent correlations knowledge learned by the self-attention mechanism in the U-Net module during the DDIM inversion process to automatically identify and adjust relevant image regions, ensuring semantic validity and consistency. Additionally, AttentionDrag adaptively generates masks to guide the editing process, enabling precise and context-aware modifications with friendly interaction. Our results demonstrate a performance that surpasses most state-of-the-art methods with significantly faster speeds, showing a more efficient and semantically coherent solution for point-based image editing tasks. Code is released at: https: //github. com/GPlaying/AttentionDrag.

NeurIPS Conference 2025 Conference Paper

ComPO: Preference Alignment via Comparison Oracles

  • Peter Chen
  • Xi Chen
  • Wotao Yin
  • Tianyi Lin

Direct alignment methods are increasingly used for aligning large language models (LLMs) with human preferences. However, these methods suffer from the issues of verbosity and likelihood displacement, which can be driven by the noisy preference pairs that induce similar likelihood for preferred and dispreferred responses. The contributions of this paper are two-fold. First, we propose a new preference alignment method based on zeroth-order, comparison-based optimization via comparison oracles and provide convergence guarantees for its basic scheme. Second, we improve our method using some heuristics and conduct the experiments to demonstrate the flexibility and compatibility of practical scheme in improving the performance of LLMs using noisy preference pairs. Evaluations are conducted across multiple base and instruction-tuned models (Mistral-7B, Llama-3-8B and Gemma-2-9B) with benchmarks (AlpacaEval 2, MT-Bench and Arena-Hard). Experimental results show the effectiveness of our method as an alternative to addressing the limitations of existing direct alignment methods. A highlight of our work is that we evidence the importance of designing specialized methods for preference pairs with distinct likelihood margin, which complements the recent findings in Razin et al (2025).

IJCAI Conference 2025 Conference Paper

Connector-S: A Survey of Connectors in Multi-modal Large Language Models

  • Xun Zhu
  • Zheng Zhang
  • Xi Chen
  • Yiming Shi
  • Miao Li
  • Ji Wu

With the rapid advancements in multi-modal large language models (MLLMs), connectors play a pivotal role in bridging diverse modalities and enhancing model performance. However, the design and evolution of connectors have not been comprehensively analyzed, leaving gaps in understanding how these components function and hindering the development of more powerful connectors. In this survey, we systematically review the current progress of connectors in MLLMs and present a structured taxonomy that categorizes connectors into atomic operations (mapping, compression, mixture of experts) and holistic designs (multi-layer, multi-encoder, multi-modal scenarios), highlighting their technical contributions and advancements. Furthermore, we discuss several promising research frontiers and challenges, including high-resolution input, dynamic compression, guide information selection, combination strategy, and interpretability. This survey is intended to serve as a foundational reference and a clear roadmap for researchers, providing valuable insights into the design and optimization of next-generation connectors to enhance the performance and adaptability of MLLMs.

ICLR Conference 2025 Conference Paper

CryoGEN: Generative Energy-based Models for Cryogenic Electron Tomography Reconstruction

  • Yunfei Teng
  • Yuxuan Ren
  • Kai Chen
  • Xi Chen
  • Zhaoming Chen
  • Qiwei Ye

Cryogenic electron tomography (Cryo-ET) is a powerful technique for visualizing subcellular structures in their native states. Nonetheless, its effectiveness is compromised by anisotropic resolution artifacts caused by the missing-wedge effect. To address this, IsoNet, a deep learning-based method, proposes iteratively reconstructing the missing-wedge information. While successful, IsoNet's dependence on recursive prediction updates often leads to training instability and model divergence. In this study, we introduce CryoGEN—an energy-based probabilistic model that not only mitigates resolution anisotropy but also removes the need for recursive subtomogram averaging, delivering an approximate *10*$\times$ speedup for training. Evaluations across various biological datasets, including immature HIV-1 virions and ribosomes, demonstrate that CryoGEN significantly enhances structural completeness and interpretability of the reconstructed samples.

AAAI Conference 2025 Conference Paper

Decoupling Metacognition from Cognition: A Framework for Quantifying Metacognitive Ability in LLMs

  • Guoqing Wang
  • Wen Wu
  • Guangze Ye
  • Zhenxiao Cheng
  • Xi Chen
  • Hong Zheng

Large Language Models (LLMs) are known to hallucinate facts and make non-factual statements which can undermine trust in their output. The essence of hallucination lies in the absence of metacognition in LLMs, namely the understanding of their own cognitive processes. However, there has been limited research on quantitatively measuring metacognition within LLMs. Drawing inspiration from cognitive psychology theories, we first quantify the metacognitive ability of LLMs as their ability to evaluate the correctness of responses through confidence. Subsequently, we introduce a general framework called DMC designed to decouple metacognitive ability and cognitive ability. This framework tackles the challenge of noisy quantification caused by the coupling of metacognition and cognition in current research, such as calibration-based metrics. Specifically, the DMC framework comprises two key steps. Initially, the framework tasks the LLM with failure prediction, aiming to evaluate the model's performance in predicting failures, a performance jointly determined by both cognitive and metacognitive abilities of the LLM. Following this, the framework disentangles metacognitive ability and cognitive ability based on the failure prediction performance, providing a quantification of the LLM's metacognitive ability independent of cognitive influences. Experiments conducted on eight datasets across five domains reveal that (1) Our proposed DMC framework effectively separates the metacognition and cognition of LLMs; (2) Various confidence elicitation methods impact the quantification of metacognitve ability differently; (3) Stronger metacognitive ability are exhibited by LLMs with better overall performance; (4) Enhancing metacognition holds promise for alleviating hallucination issues.

AAAI Conference 2025 Conference Paper

Disentangled Modeling of Preferences and Social Influence for Group Recommendation

  • Guangze Ye
  • Wen Wu
  • Guoqing Wang
  • Xi Chen
  • Hong Zheng
  • Liang He

The group recommendation (GR) aims to suggest items for a group of users in social networks. Existing work typically considers individual preferences as the sole factor in aggregating group preferences. Actually, social influence is also an important factor in modeling users' contributions to the final group decision. However, existing methods either neglect the social influence of individual members or bundle preferences and social influence together as a unified representation. As a result, these models emphasize the preferences of the majority within the group rather than the actual interaction items, which we refer to as the preference bias issue in GR. Moreover, the self-supervised learning (SSL) strategies they designed to address the issue of group data sparsity fail to account for users' contextual social weights when regulating group representations, leading to suboptimal results. To tackle these issues, we propose a novel model based on Disentangled Modeling of Preferences and Social Influence for Group Recommendation (DisRec). Concretely, we first design a user-level disentangling network to disentangle the preferences and social influence of group members with separate embedding propagation schemes based on (hyper)graph convolution networks. We then introduce a social-based contrastive learning strategy, selectively excluding user nodes based on their social importance to enhance group representations and alleviate the group-level data sparsity issue. The experimental results demonstrate that our model significantly outperforms state-of-the-art methods on two real-world datasets.

JBHI Journal 2025 Journal Article

Fall Detection Method Based on a Human Electrostatic Field and VMD-ECANet Architecture

  • Xi Chen
  • Jiaao Yan
  • Sichao Qin
  • Pengfei Li
  • Shuangqian Ning
  • Yuting Liu

Falls are one of the most serious health risks faced by older adults worldwide, and they can have a significant impact on their physical and mental well-being as well as their quality of life. Detecting falls promptly and accurately and providing assistance can effectively reduce the harm caused by falls to older adults. This paper proposed a noncontact fall detection method based on the human electrostatic field and a VMD-ECANet framework. An electrostatic measurement system was used to measure the electrostatic signals of four types of falling postures and five types of daily actions. The signals were randomly divided in proportion and by individuals to construct a training set and test set. A fall detection model based on the VMD-ECA network was proposed that decomposes electrostatic signals into modal component signals using the variational mode decomposition (VMD) technique. These signals were then fed into a multichannel convolutional neural network for feature extraction. Information fusion was achieved through the efficient channel attention network (ECANet) module. Finally, the extracted features were input into a classifier to obtain the output results. The constructed model achieved an accuracy of 96. 44%. The proposed fall detection solution has several advantages, including being noncontact, cost-effective, and privacy friendly. It is suitable for detecting indoor falls by older individuals living alone and helps to reduce the harm caused by falls.

NeurIPS Conference 2025 Conference Paper

Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?

  • Xi Chen
  • Kaituo Feng
  • Changsheng Li
  • Xunhao Lai
  • Xiangyu Yue
  • Ye Yuan
  • Guoren Wang

Low-rank training has emerged as a promising approach for reducing memory usage in training Large Language Models (LLMs). Previous methods either rely on decomposing weight matrices (e. g. , LoRA), or seek to decompose gradient matrices (e. g. , GaLore) to ensure reduced memory consumption. However, both of them constrain the training in a low-rank subspace, thus inevitably leading to sub-optimal performance. To resolve this, we propose a new plug-and-play training framework for LLMs called Fira, as the first attempt to consistently preserve the low-rank constraint for memory efficiency, while achieving full-rank training (i. e. , training with full-rank gradients of full-rank weights) to avoid inferior outcomes. First, we observe an interesting phenomenon during LLM training: the scaling impact of adaptive optimizers (e. g. , Adam) on the gradient norm remains similar from low-rank to full-rank training. In light of this, we propose a \textit{norm-based scaling} method, which utilizes the scaling impact of low-rank optimizers as substitutes for that of original full-rank optimizers to achieve this goal. Moreover, we find that there are potential loss spikes during training. To address this, we further put forward a norm-growth limiter to smooth the gradient. Extensive experiments on the pre-training and fine-tuning of LLMs show that Fira outperforms both LoRA and GaLore. Notably, for pre-training LLaMA 7B, our Fira uses $8\times$ smaller memory of optimizer states than Galore, yet outperforms it by a large margin.

TMLR Journal 2025 Journal Article

Foundation Models Meet Federated Learning: A One-shot Feature-sharing Method with Privacy and Performance Guarantees

  • Mahdi Beitollahi
  • Alex Bie
  • Sobhan Hemati
  • Leo Maxime Brunswic
  • Xu Li
  • Xi Chen
  • Guojun Zhang

Adapting foundation models for downstream tasks via Federated Learning (FL) is a promising strategy for protecting privacy while leveraging the capability of foundation models. However, FL's iterative training and model transmission result in high communication costs and GPU memory demands, making large foundation models impractical for FL. This paper introduces a one-shot FL method with a server-side performance bound to enable foundation models by reducing communication costs and GPU memory requirements. Our approach, FedPFT (FL with Parametric Feature Transfer), involves clients learning and transferring parametric models for features extracted from frozen foundation models in a single round. Parametric models are then used to generate synthetic features at the server to train a classifier head. We evaluate FedPFT across eight vision datasets using three vision foundation models. Our findings demonstrate that FedPFT is agnostic to data heterogeneity and network topology and it enhances the communication-accuracy frontier up to 7.8\%. Finally, we show FedPFT's compatibility with differential privacy and its resilience against reconstruction attacks. Our work highlights the capability of private, feature-sharing methods for one-shot knowledge transfer using foundation models.

AAAI Conference 2025 Conference Paper

HFF-Tracker: A Hierarchical Fine-grained Fusion Tracker for Referring Multi-Object Tracking

  • Zeyong Zhao
  • Yanchao Hao
  • Minghao Zhang
  • Qingbin Liu
  • Bo Li
  • Dianbo Sui
  • Shizhu He
  • Xi Chen

Referring Multi-Object Tracking (RMOT) aims to track multiple objects based on a provided language expression. Although prior studies have sought to accomplish this by integrating an textual module into the multi-object tracker, these methods combine text and image features in a basic way, neglecting the importance of text features. In this study, we propose a Hierarchical Fine-grained text-image Fusion tracker, named HFF-Tracker, which can perform fine-grained fusion of pixel-level visual features and text features across various semantic levels. Specifically, we have devised a Hierarchical Multi-Modal Fusion (HMMF) module to merge text and image features at an early stage in a hierarchical and detailed manner. The Text-Guided Decoder (TGD) is designed to provide the query with prior semantic information during the decoding process. Additionally, we have crafted a Text-Guided Prediction Head (TGPH) that utilizes text information to enhance the performance of the prediction head. Furthermore, we have implemented an adaptive Look-Back training strategy to maximize the utilization of valuable labeled data. Extensive experiments on the Refer-KITTI dataset and the Refer-KITTI-V2 dataset demonstrate that our proposed HFF-Tracker outperforms other state-of-the-art methods with remarkable margins.

ICML Conference 2025 Conference Paper

iDPA: Instance Decoupled Prompt Attention for Incremental Medical Object Detection

  • Huahui Yi
  • Wei Xu 0046
  • Ziyuan Qin 0001
  • Xi Chen
  • Xiaohu Wu
  • Kang Li 0004
  • Qicheng Lao

Existing prompt-based approaches have demonstrated impressive performance in continual learning, leveraging pre-trained large-scale models for classification tasks; however, the tight coupling between foreground-background information and the coupled attention between prompts and image-text tokens present significant challenges in incremental medical object detection tasks, due to the conceptual gap between medical and natural domains. To overcome these challenges, we introduce the iDPA framework, which comprises two main components: 1) Instance-level Prompt Generation (IPG), which decouples fine-grained instance-level knowledge from images and generates prompts that focus on dense predictions, and 2) Decoupled Prompt Attention (DPA), which decouples the original prompt attention, enabling a more direct and efficient transfer of prompt information while reducing memory usage and mitigating catastrophic forgetting. We collect 13 clinical, cross-modal, multi-organ, and multi-category datasets, referred to as ODinM-13, and experiments demonstrate that iDPA outperforms existing SOTA methods, with FAP improvements of f 5. 44%, 4. 83%, 12. 88%, and 4. 59% in full data, 1-shot, 10-shot, and 50-shot settings, respectively.

NeurIPS Conference 2025 Conference Paper

LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization

  • Zhenpeng Huang
  • Jiaqi Li
  • Zihan Jia
  • Xinhao Li
  • Desen Meng
  • Lingxue Song
  • Xi Chen
  • Liang Li

We present LongVPO, a novel two‑stage Direct Preference Optimization framework that enables short‑context vision‑language models to robustly understand ultra‑long videos without any long‑video annotations. In Stage 1, we synthesize preference triples by anchoring questions to individual short clips, interleaving them with distractors, and applying visual‑similarity and question‑specificity filtering to mitigate positional bias and ensure unambiguous supervision. We also approximate the reference model’s scoring over long contexts by evaluating only the anchor clip, reducing computational overhead. In Stage 2, we employ a recursive captioning pipeline on long videos to generate scene-level metadata, and then use a large language model to craft multi-segment reasoning queries and dispreferred responses, aligning the model's preferences through multi-segment reasoning tasks. With only 16K synthetic examples and no costly human labels, \model{} outperforms the state‑of‑the‑art open‑source models on multiple long‑video benchmarks, while maintaining strong short‑video performance (e. g. , on MVBench), offering a scalable paradigm for efficient long‑form video understanding.

NeurIPS Conference 2025 Conference Paper

MiCo: Multi-image Contrast for Reinforcement Visual Reasoning

  • Xi Chen
  • Mingkang Zhu
  • Shaoteng Liu
  • Xiaoyang Wu
  • Xiaogang Xu
  • Yu Liu
  • Xiang Bai
  • Hengshuang Zhao

This work explores enabling Chain-of-Thought (CoT) reasoning to link visual cues across multiple images. A straightforward solution is to adapt rule-based reinforcement learning for Vision-Language Models (VLMs). However, such methods typically rely on manually curated question-answer pairs, which can be particularly challenging when dealing with fine-grained visual details and complex logic across images. Inspired by self-supervised visual representation learning, we observe that images contain inherent constraints that can serve as supervision. Based on this insight, we construct image triplets comprising two augmented views of the same image and a third, similar but distinct image. During training, the model is prompted to generate a reasoning process to compare these images (i. e. , determine same or different). Then we optimize the model with rule-based reinforcement learning. Due to the high visual similarity and the presence of augmentations, the model must attend to subtle visual cues and perform logical reasoning to succeed. Experimental results demonstrate that, although trained solely on visual comparison tasks, the learned reasoning ability generalizes effectively to a wide range of questions. Without relying on any human-annotated question-answer pairs, our method achieves significant improvements on multi-image reasoning benchmarks and shows strong performance on general vision tasks.

NeurIPS Conference 2025 Conference Paper

OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

  • Yuanhao Cai
  • He Zhang
  • Xi Chen
  • Jinbo Xing
  • Yiwei Hu
  • Yuqian Zhou
  • Kai Zhang
  • Zhifei Zhang

Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging problem that how to use the signals such as depth, mask, camera, and text prompts to control and edit the subject in the customized video is still less explored. In this paper, we first propose a data construction pipeline, VideoCus-Factory, to produce training data pairs for multi-subject customization from raw videos without labels and control signals such as depth-to-video and mask-to-video pairs. Based on our constructed data, we develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video. Then we propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE) and Temporally Aligned Embedding (TAE). LE enables inference with more subjects by using the training subjects to activate more frame embeddings. TAE encourages the generation process to extract guidance from temporally aligned control signals by assigning the same frame embeddings to the control and noise tokens. Experiments demonstrate that our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations. Project page is at https: //caiyuanhao1998. github. io/project/OmniVCus/

NeurIPS Conference 2025 Conference Paper

PlayerOne: Egocentric World Simulator

  • Yuanpeng Tu
  • Hao Luo
  • Xi Chen
  • Xiang Bai
  • Fan Wang
  • Hengshuang Zhao

We introduce PlayerOne, the first egocentric realistic world simulator, facilitating immersive and unrestricted exploration within vividly dynamic environments. Given an egocentric scene image from the user, PlayerOne can accurately construct the corresponding world and generate egocentric videos that are strictly aligned with the real-scene human motion of the user captured by an exocentric camera. PlayerOne is trained in a coarse-to-fine pipeline that first performs pretraining on large-scale egocentric text-video pairs for coarse-level egocentric understanding, followed by finetuning on synchronous motion-video data extracted from egocentric-exocentric video datasets with our automatic construction pipeline. Besides, considering the varying importance of different components, we design a part-disentangled motion injection scheme, enabling precise control of part-level movements. In addition, we devise a joint reconstruction framework that progressively models both the 4D scene and video frames, ensuring scene consistency in the long-form video generation. Experimental results demonstrate its great generalization ability in precise control of varying human movements and world-consistent modeling of diverse scenarios. It marks the first endeavor into egocentric real-world simulation and can pave the way for the community to delve into fresh frontiers of world modeling and its diverse applications.

AAAI Conference 2025 Conference Paper

Reverse Region-to-Entity Annotation for Pixel-Level Visual Entity Linking

  • Zhengfei Xu
  • Sijia Zhao
  • Yanchao Hao
  • Xiaolong Liu
  • Lili Li
  • Yuyang Yin
  • Bo Li
  • Xi Chen

Visual Entity Linking (VEL) is a crucial task for achieving fine-grained visual understanding, matching objects within images (visual mentions) to entities in a knowledge base. Previous VEL tasks rely on textual inputs, but writing queries for complex scenes can be challenging. Visual inputs like clicks or bounding boxes offer a more convenient alternative. Therefore, we propose a new task, Pixel-Level Visual Entity Linking (PL-VEL), which uses pixel masks from visual inputs to refer to objects, supplementing reference methods for VEL. To facilitate research on this task, we have constructed the MaskOVEN-Wiki dataset through an entirely automatic reverse region-entity annotation framework. This dataset contains over 5 million annotations aligning pixel-level regions with entity-level labels, which will advance visual understanding towards fine-grained. Moreover, as pixel masks correspond to semantic regions in an image, we enhance previous patch-interacted attention with region-interacted attention by a visual semantic tokenization approach. Manual evaluation results indicate that the reverse annotation framework achieved a 94.8% annotation success rate. Experimental results show that models trained on this dataset improved accuracy by 18 points compared to zero-shot models. Additionally, the semantic tokenization method achieved a 5-point accuracy improvement over the trained baseline.

NeurIPS Conference 2025 Conference Paper

ROSE: Remove Objects with Side Effects in Videos

  • Chenxuan Miao
  • Yutong Feng
  • Jianshu Zeng
  • Zixiang Gao
  • Hantang Liu
  • Yunfeng Yan
  • Donglian Qi
  • Xi Chen

Video object removal has achieved advanced performance due to the recent success of video generative models. However, when addressing the side effects of objects, \textit{e. g. ,} their shadows and reflections, existing works struggle to eliminate these effects for the scarcity of paired video data as supervision. This paper presents \method, termed \textbf{R}emove \textbf{O}bjects with \textbf{S}ide \textbf{E}ffects, a framework that systematically studies the object's effects on environment, which can be categorized into five common cases: shadows, reflections, light, translucency and mirror. Given the challenges of curating paired videos exhibiting the aforementioned effects, we leverage a 3D rendering engine for synthetic data generation. We carefully construct a fully-automatic pipeline for data preparation, which simulates a large-scale paired dataset with diverse scenes, objects, shooting angles, and camera trajectories. ROSE is implemented as an video inpainting model built on diffusion transformer. To localize all object-correlated areas, the entire video is fed into the model for reference-based erasing. Moreover, additional supervision is introduced to explicitly predict the areas affected by side effects, which can be revealed through the differential mask between the paired videos. To fully investigate the model performance on various side effect removal, we presents a new benchmark, dubbed ROSE-Bench, incorporating both common scenarios and the five special side effects for comprehensive evaluation. Experimental results demonstrate that \method achieves superior performance compared to existing video object erasing models and generalizes well to real-world video scenarios.

JBHI Journal 2025 Journal Article

RPD: Regional Prior Distillation for Breast Cancer Diagnosis in Ultrasound Images

  • Yi Lin
  • Haosen Wang
  • Yingnan Zhao
  • Dan Lu
  • Yanchen Xu
  • Jiexiao Xue
  • Xi Chen
  • Jingchi Jiang

Breast cancer is the leading cause of death among women worldwide. Ultrasound imaging is an important means for the early detection of breast cancer, improving the survival rate. Due to the shortage of experienced sonographers, computer-aided systems for breast cancer recognition become particularly important. Some recent studies analyze tumor types in lesion regions but rely on predefined ROIs. Some other studies recognize cancer in the whole ultrasound image, but always suffer from the extremely variable proportion, location and quantity of the tumor lesions. In this paper, we propose a regional prior distillation (RPD) framework for breast cancer diagnosis in ultrasound images. To enhance the analysis of the tumor region, we propose an Image-Cross Attention (ICA) to fuse the predefined ROI prior information with ultrasound images and train a prior-fused model. To remove the constraint of predefined ROIs, we propose a Distribution Distillation Learning (DDL) to distill the prior-fused sample distribution from the prior-fused model into a diagnostic model, which analyzes the disease from only ultrasound images, based on the knowledge distillation paradigm of the teacher-student framework. Comprehensive experiments are conducted on multi-institutional datasets to validate the proposed RPD framework. The results demonstrate the following points. The ICA fuses regional prior information adequately, leading to a high-performance prior-fused model. The DDL distills the prior information effectively, enhancing the diagnostic model to focus on the tumor lesions. The performance of the diagnostic model surpasses that of current SOTA methods by 1. 66% in accuracy and 0. 64% in AUC. In addition, the diagnostic model is robust to slight perturbations and achieves good generalization performance.

NeurIPS Conference 2025 Conference Paper

Seg-VAR:Image Segmentation with Visual Autoregressive Modeling

  • Rongkun Zheng
  • Lu Qi
  • Xi Chen
  • Yi Wang
  • Kun Wang
  • Hengshuang Zhao

While visual autoregressive modeling (VAR) strategies have shed light on image generation with the autoregressive models, their potential for segmentation, a task that requires precise low-level spatial perception, remains unexplored. Inspired by the multi-scale modeling of classic Mask2Former-based models, we propose Seg-VAR, a novel framework that rethinks segmentation as a conditional autoregressive mask generation problem. This is achieved by replacing the discriminative learning with the latent learning process. Specifically, our method incorporates three core components: (1) an image encoder generating latent priors from input images, (2) a spatial-aware seglat (a latent expression of segmentation mask) encoder that maps segmentation masks into discrete latent tokens using a location-sensitive color mapping to distinguish instances, and (3) a decoder reconstructing masks from these latents. A multi-stage training strategy is introduced: first learning seglat representations via image-seglat joint training, then refining latent transformations, and finally aligning image-encoder-derived latents with seglat distributions. Experiments show Seg-VAR outperforms previous discriminative and generative methods on various segmentation tasks and validation benchmarks. By framing segmentation as a sequential hierarchical prediction task, Seg-VAR opens new avenues for integrating autoregressive reasoning into spatial-aware vision systems.

AAAI Conference 2025 Conference Paper

TC-LLaVA: Rethinking the Transfer of LLava from Image to Video Understanding with Temporal Considerations

  • Mingze Gao
  • Jingyu Liu
  • Mingda Li
  • Jiangtao Xie
  • Qingbin Liu
  • Kevin Zhao
  • Xi Chen
  • Hui Xiong

Multimodal Large Language Models (MLLMs) have significantly improved performance across various image-language applications. Recently, there has been a growing interest in adapting image pre-trained MLLMs for video-related tasks. However, most efforts concentrate on enhancing the vision encoder and projector components, while the core part, Large Language Models (LLMs), remains comparatively under-explored. In this paper, we propose two strategies to enhance the model's capability in video understanding tasks by improving inter-layer attention computation in LLMs. Specifically, the first approach focuses on the enhancement of Rotary Position Embedding (RoPE) with Temporal-Aware Dual RoPE, which introduces temporal position information to strengthen the MLLM's temporal modeling capabilities while preserving the relative position relationships of both visual and text tokens. The second approach involves enhancing the Attention Mask with the Frame-wise Block Causal Attention Mask, a simple yet effective method that broadens visual token interactions within and across video frames while maintaining the causal inference mechanism. Based on these proposed methods, we adapt LLaVA for video understanding tasks, naming it Temporal-Considered LLaVA (TC-LLaVA). Our TC-LLaVA achieves new state-of-the-art performance across various video understanding benchmarks with only supervised fine-tuning (SFT) on video-related datasets.

AAAI Conference 2025 Conference Paper

The Distributional Reward Critic Framework for Reinforcement Learning Under Perturbed Rewards

  • Xi Chen
  • Zhihui Zhu
  • Andrew Perrault

The reward signal plays a central role in defining the desired behaviors of agents in reinforcement learning (RL). Rewards collected from realistic environments could be perturbed, corrupted, or noisy due to an adversary, sensor error, or because they come from subjective human feedback. Thus, it is important to construct agents that can learn under such rewards. Existing methodologies for this problem make strong assumptions, including that the perturbation is known in advance, clean rewards are accessible, or that the perturbation preserves the optimal policy. We study a new, more general, class of unknown perturbations, and introduce a distributional reward critic framework for estimating reward distributions and perturbations during training. Our proposed methods are compatible with any RL algorithm. Despite their increased generality, we show that they achieve comparable or better rewards than existing methods in a variety of environments, including those with clean rewards. Under the challenging and generalized perturbations we study, we win/tie the highest return in 44/48 tested settings (compared to 11/48 for the best baseline). Our results broaden and deepen our ability to perform RL in reward-perturbed environments.

RLC Conference 2025 Conference Paper

Understanding Learned Representations and Action Collapse in Visual Reinforcement Learning

  • Xi Chen
  • Zhihui Zhu
  • Andrew Perrault

In contrast to deep learning models trained with supervised data, visual reinforcement learning (VRL) models learn to represent their environment implicitly via the process of seeking higher rewards. However, there has been little research on the specific representations VRL models learn. Using linear probing, we study the extent to which VRL models learn to linearly represent the ground truth vectorized state of an environment, on which layers these representations are most accessible, and how this relates to the reward achieved by the final model. We observe that poorly performing agents differ substantially from well-performing ones in the representation learned in their later MLP layers, but not their earlier CNN layers. When an agent is initialized by reusing the later layers of a poorly performing agent, the result is always poor. These poorly performing agents end up with no entropy in their actor network output, a phenomenon we call {\it action collapse}. Based on these observations, we propose a simple rule to prevent action collapse during training, leading to better performance on tasks with image observations with no additional computational cost. Code is available at: https: //github. com/cx441000319/action-collapse.

RLJ Journal 2025 Journal Article

Understanding Learned Representations and Action Collapse in Visual Reinforcement Learning

  • Xi Chen
  • Zhihui Zhu
  • Andrew Perrault

In contrast to deep learning models trained with supervised data, visual reinforcement learning (VRL) models learn to represent their environment implicitly via the process of seeking higher rewards. However, there has been little research on the specific representations VRL models learn. Using linear probing, we study the extent to which VRL models learn to linearly represent the ground truth vectorized state of an environment, on which layers these representations are most accessible, and how this relates to the reward achieved by the final model. We observe that poorly performing agents differ substantially from well-performing ones in the representation learned in their later MLP layers, but not their earlier CNN layers. When an agent is initialized by reusing the later layers of a poorly performing agent, the result is always poor. These poorly performing agents end up with no entropy in their actor network output, a phenomenon we call {\it action collapse}. Based on these observations, we propose a simple rule to prevent action collapse during training, leading to better performance on tasks with image observations with no additional computational cost. Code is available at: https://github.com/cx441000319/action-collapse.

TMLR Journal 2025 Journal Article

Uniform Noise Distribution and Compact Clusters: Unveiling the Success of Self-Supervised Learning in Label Noise

  • Pengcheng Xu
  • Li Yi
  • Gezheng Xu
  • Xi Chen
  • Ian McLeod
  • Charles Ling
  • Boyu Wang

Label noise is ubiquitous in real-world datasets, posing significant challenges to machine learning models. While self-supervised learning (SSL) algorithms have empirically demonstrated effectiveness in learning noisy labels, the theoretical understanding of their effectiveness remains underexplored. In this paper, we present a theoretical framework to understand how SSL methods enhance learning with noisy labels, especially for the instance-dependent label noise. We reveal that the uniform and compact cluster structures induced by contrastive SSL play a crucial role in mitigating the adverse effects of label noise. Specifically, we theoretically show that a classifier trained on SSL-learned representations significantly outperforms one trained using traditional supervised learning methods. This results from two key merits of SSL representations over label noise: 1. Uniform Noise Distribution: Label noise becomes uniformly distributed over SSL representations with respect to the true class labels, rather than the noisy ones, leading to an easier learning task. 2. Enhanced Cluster Structure: SSL enhances the formation of well-separated and compact categorical clusters, increasing inter-class distances while tightening intra-class clusters. We further theoretically justify the benefits of training a classifier on such structured representations, demonstrating that it encourages the classifier trained on noisy data to be aligned with the optimal classifier. Extensive experiments validate the robustness of SSL representations in combating label noise, confirming the practical values of our theoretical findings.

NeurIPS Conference 2025 Conference Paper

Unifying Text Semantics and Graph Structures for Temporal Text-attributed Graphs with Large Language Models

  • Siwei Zhang
  • Yun Xiong
  • Yateng Tang
  • Jiarong Xu
  • Xi Chen
  • Zehao Gu
  • Xuehao Zheng
  • Zi'an Jia

Temporal graph neural networks (TGNNs) have shown remarkable performance in temporal graph modeling. However, real-world temporal graphs often possess rich textual information, giving rise to temporal text-attributed graphs (TTAGs). Such combination of dynamic text semantics and evolving graph structures introduces heightened complexity. Existing TGNNs embed texts statically and rely heavily on encoding mechanisms that biasedly prioritize structural information, overlooking the temporal evolution of text semantics and the essential interplay between semantics and structures for synergistic reinforcement. To tackle these issues, we present $\textbf{CROSS}$, a flexible framework that seamlessly extends existing TGNNs for TTAG modeling. CROSS is designed by decomposing the TTAG modeling process into two phases: (i) temporal semantics extraction; and (ii) semantic-structural information unification. The key idea is to advance the large language models (LLMs) to $\textit{dynamically}$ extract the temporal semantics in text space and then generate $\textit{cohesive}$ representations unifying both semantics and structures. Specifically, we propose a Temporal Semantics Extractor in the CROSS framework, which empowers LLMs to offer the temporal semantic understanding of node's evolving contexts of textual neighborhoods, facilitating semantic dynamics. Subsequently, we introduce the Semantic-structural Co-encoder, which collaborates with the above Extractor for synthesizing illuminating representations by jointly considering both semantic and structural information while encouraging their mutual reinforcement. Extensive experiments show that CROSS achieves state-of-the-art results on four public datasets and one industrial dataset, with 24. 7\% absolute MRR gain on average in temporal link prediction and 3. 7\% AUC gain in node classification of industrial application.

NeurIPS Conference 2025 Conference Paper

Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations

  • Chaofan Gan
  • Yuanpeng Tu
  • Xi Chen
  • Tieyuan Chen
  • Yuxi Li
  • Mehrtash Harandi
  • Weiyao Lin

Pre-trained stable diffusion models (SD) have shown great advances in visual correspondence. In this paper, we investigate the capabilities of Diffusion Transformers (DiTs) for accurate dense correspondence. Distinct from SD, DiTs exhibit a critical phenomenon in which very few feature activations exhibit significantly larger values than others, known as massive activations, leading to uninformative representations and significant performance degradation for DiTs. The massive activations consistently concentrate at very few fixed dimensions across all image patch tokens, holding little local information. We analyze these dimension-concentrated massive activations and uncover that their concentration is inherently linked to the Adaptive Layer Normalization (AdaLN) in DiTs. Building on these findings, we propose the Diffusion Transformer Feature (DiTF), a training-free AdaLN-based framework that extracts semantically discriminative features from DiTs. Specifically, DiTF leverages AdaLN to adaptively localize and normalize massive activations through channel-wise modulation. Furthermore, a channel discard strategy is introduced to mitigate the adverse effects of massive activations. Experimental results demonstrate that our DiTF outperforms both DINO and SD-based models and establishes a new state-of-the-art performance for DiTs in different visual correspondence tasks (e. g. , with +9. 4\% on Spair-71k and +4. 4\% on AP-10K-C. S. ).

AAAI Conference 2025 Conference Paper

VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

  • Yongxin Guo
  • Jingyu Liu
  • Mingda Li
  • Dingxin Cheng
  • Xiaoying Tang
  • Dianbo Sui
  • Qingbin Liu
  • Xi Chen

Video Temporal Grounding (VTG) strives to accurately pinpoint event timestamps in a specific video using linguistic queries, significantly impacting downstream tasks like video browsing and editing. Unlike traditional task-specific models, Video Large Language Models (video LLMs) can handle multiple tasks concurrently in a zero-shot manner. Consequently, exploring the application of video LLMs for VTG tasks has become a burgeoning research area. However, despite considerable advancements in video content understanding, video LLMs often struggle to accurately pinpoint timestamps within videos, limiting their effectiveness in VTG tasks. To address this, we introduce VTG-LLM, a model designed to enhance video LLMs' timestamp localization abilities. Our approach includes: (1) effectively integrating timestamp knowledge into visual tokens; (2) incorporating absolute-time tokens to manage timestamp knowledge without concept shifts; and (3) introducing a lightweight, high-performance, slot-based token compression technique designed to accommodate the demands of a large number of frames to be sampled for VTG tasks. Additionally, we present VTG-IT-120K, a collection of publicly available VTG datasets that we have re-annotated to improve upon low-quality annotations. Our comprehensive experiments demonstrate the superior performance of VTG-LLM in comparison to other video LLM methods across a variety of VTG tasks.

NeurIPS Conference 2025 Conference Paper

Zero-shot Denoising via Neural Compression: Theoretical and algorithmic framework

  • Ali Zafari
  • Xi Chen
  • Shirin Jalali

Zero-shot denoising aims to denoise observations without access to training samples or clean reference images. This setting is particularly relevant in practical imaging scenarios involving specialized domains such as medical imaging or biology. In this work, we propose the Zero-Shot Neural Compression Denoiser (ZS-NCD), a novel denoising framework based on neural compression. ZS-NCD treats a neural compression network as an untrained model, optimized directly on patches extracted from a single noisy image. The final reconstruction is then obtained by aggregating the outputs of the trained model over overlapping patches. Thanks to the built-in entropy constraints of compression architectures, our method naturally avoids overfitting and does not require manual regularization or early stopping. Through extensive experiments, we show that ZS-NCD achieves state-of-the-art performance among zero-shot denoisers for both Gaussian and Poisson noise, and generalizes well to both natural and non-natural images. Additionally, we provide new finite-sample theoretical results that characterize upper bounds on the achievable reconstruction error of general maximum-likelihood compression-based denoisers. These results further establish the theoretical foundations of compression-based denoising. Our code is available at: https: //github. com/Computational-Imaging-RU/ZS-NCDenoiser.

TMLR Journal 2024 Journal Article

3D Molecular Generation via Virtual Dynamics

  • Shuqi Lu
  • Lin Yao
  • Xi Chen
  • Hang Zheng
  • Di He
  • Guolin Ke

Structure-based drug design, a critical aspect of drug discovery, aims to identify high-affinity molecules for target protein pockets. Traditional virtual screening methods, which involve exhaustive searches within large molecular databases, are inefficient and limited in discovering novel molecules. The pocket-based 3D molecular generation model offers a promising alternative by directly generating molecules with 3D structures and binding positions in the pocket. In this paper, we present VD-Gen, a novel pocket-based 3D molecular generation pipeline. VD-Gen features a series of carefully designed stages to generate fine-grained 3D molecules with binding positions in the pocket cavity end-to-end. Rather than directly generating or sampling atoms with 3D positions in the pocket, VD-Gen randomly initializes multiple virtual particles within the pocket and learns to iteratively move them to approximate the distribution of molecular atoms in 3D space. After the iterative movement, a 3D molecule is extracted and further refined through additional iterative movement, yielding a high-quality 3D molecule with a confidence score. Comprehensive experimental results on pocket-based molecular generation demonstrate that VD-Gen can generate novel 3D molecules that fill the target pocket cavity with high binding affinities, significantly outperforming previous baselines.

ICML Conference 2024 Conference Paper

Bagged Deep Image Prior for Recovering Images in the Presence of Speckle Noise

  • Xi Chen
  • Zhewen Hou
  • Christopher A. Metzler
  • Arian Maleki
  • Shirin Jalali

We investigate both the theoretical and algorithmic aspects of likelihood-based methods for recovering a complex-valued signal from multiple sets of measurements, referred to as looks, affected by speckle (multiplicative) noise. Our theoretical contributions include establishing the first existing theoretical upper bound on the Mean Squared Error (MSE) of the maximum likelihood estimator under the deep image prior hypothesis. Our theoretical results capture the dependence of MSE upon the number of parameters in the deep image prior, the number of looks, the signal dimension, and the number of measurements per look. On the algorithmic side, we introduce the concept of bagged Deep Image Priors (Bagged-DIP) and integrate them with projected gradient descent. Furthermore, we show how employing Newton-Schulz algorithm for calculating matrix inverses within the iterations of PGD reduces the computational complexity of the algorithm. We will show that this method achieves the state-of-the-art performance.

TMLR Journal 2024 Journal Article

Beyond Loss Functions: Exploring Data-Centric Approaches with Diffusion Model for Domain Generalization

  • Sobhan Hemati
  • Mahdi Beitollahi
  • Amir Hossein Estiri
  • Bassel Al Omari
  • Soufiane Lamghari
  • Yasser H. Khalil
  • Xi Chen
  • Guojun Zhang

There has been a huge effort to tackle the Domain Generalization (DG) problem with a focus on developing new loss functions. Inspired by the image generation capabilities of the diffusion models, we pose a pivotal question: Can diffusion models function as data augmentation tools to address DG from a data-centric perspective, rather than relying on the loss functions? Our findings reveal that trivial cross-domain data augmentation (CDGA) along with the vanilla ERM using readily available diffusion models without additional finetuning outperforms state-of-the-art (SOTA) training algorithms. This paper delves into the exploration of why and how this rudimentary data generation can outperform complicated DG algorithms. With the help of domain shift quantification tools, We empirically show that CDGA reduces the domain shift between domains. We empirically reveal connections between the loss landscape, adversarial robustness, and data generation, illustrating that CDGA reduces loss sharpness and improves robustness against adversarial shifts in data. Additionally, we discuss our intuitions that CDGA along with ERM can be considered as a way to replace the pointwise kernel estimates in ERM with new density estimates in the \textit{vicinity of domain pairs} which can diminish the true data estimation error of ERM under domain shift scenario. These insights advocate for further investigation into the potential of data-centric approaches in DG.

AAAI Conference 2024 Conference Paper

Calibrated One Round Federated Learning with Bayesian Inference in the Predictive Space

  • Mohsin Hasan
  • Guojun Zhang
  • Kaiyang Guo
  • Xi Chen
  • Pascal Poupart

Federated Learning (FL) involves training a model over a dataset distributed among clients, with the constraint that each client’s dataset is localized and possibly heterogeneous. In FL, small and noisy datasets are common, highlighting the need for well-calibrated models that represent the uncertainty of predictions. The closest FL techniques to achieving such goals are the Bayesian FL methods which collect parameter samples from local posteriors, and aggregate them to approximate the global posterior. To improve scalability for larger models, one common Bayesian approach is to approximate the global predictive posterior by multiplying local predictive posteriors. In this work, we demonstrate that this method gives systematically overconfident predictions, and we remedy this by proposing β-Predictive Bayes, a Bayesian FL algorithm that interpolates between a mixture and product of the predictive posteriors, using a tunable parameter β. This parameter is tuned to improve the global ensemble’s calibration, before it is distilled to a single model. Our method is evaluated on a variety of regression and classification datasets to demonstrate its superiority in calibration to other baselines, even as data heterogeneity increases. Code available at https://github.com/hasanmohsin/betaPredBayesFL. Our paper's full version is at https://arxiv.org/abs/2312.09817.

TMLR Journal 2024 Journal Article

DFML: Decentralized Federated Mutual Learning

  • Yasser H. Khalil
  • Amir Hossein Estiri
  • Mahdi Beitollahi
  • Nader Asadi
  • Sobhan Hemati
  • Xu Li
  • Guojun Zhang
  • Xi Chen

In the realm of real-world devices, centralized servers in Federated Learning (FL) present challenges including communication bottlenecks and susceptibility to a single point of failure. Additionally, contemporary devices inherently exhibit model and data heterogeneity. Existing work lacks a Decentralized FL (DFL) framework capable of accommodating such heterogeneity without imposing architectural restrictions or assuming the availability of additional data. To address these issues, we propose a Decentralized Federated Mutual Learning (DFML) framework that is serverless, supports nonrestrictive heterogeneous models, and avoids reliance on additional data. DFML effectively handles model and data heterogeneity through mutual learning, which distills knowledge between clients, and cyclically varying the amount of supervision and distillation signals. Extensive experimental results demonstrate consistent effectiveness of DFML in both convergence speed and global accuracy, outperforming prevalent baselines under various conditions. For example, with the CIFAR-100 dataset and 50 clients, DFML achieves a substantial increase of +17.20% and +19.95% in global accuracy under Independent and Identically Distributed (IID) and non-IID data shifts, respectively.

AAAI Conference 2024 Conference Paper

Dual-Window Multiscale Transformer for Hyperspectral Snapshot Compressive Imaging

  • Fulin Luo
  • Xi Chen
  • Xiuwen Gong
  • Weiwen Wu
  • Tan Guo

Coded aperture snapshot spectral imaging (CASSI) system is an effective manner for hyperspectral snapshot compressive imaging. The core issue of CASSI is to solve the inverse problem for the reconstruction of hyperspectral image (HSI). In recent years, Transformer-based methods achieve promising performance in HSI reconstruction. However, capturing both long-range dependencies and local information while ensuring reasonable computational costs remains a challenging problem. In this paper, we propose a Transformer-based HSI reconstruction method called dual-window multiscale Transformer (DWMT), which is a coarse-to-fine process, reconstructing the global properties of HSI with the long-range dependencies. In our method, we propose a novel U-Net architecture using a dual-branch encoder to refine pixel information and full-scale skip connections to fuse different features, enhancing the extraction of fine-grained features. Meanwhile, we design a novel self-attention mechanism called dual-window multiscale multi-head self-attention (DWM-MSA), which utilizes two different-sized windows to compute self-attention, which can capture the long-range dependencies in a local region at different scales to improve the reconstruction performance. We also propose a novel position embedding method for Transformer, named con-abs position embedding (CAPE), which effectively enhances positional information of the HSIs. Extensive experiments on both the simulated and the real data are conducted to demonstrate the superior performance, stability, and generalization ability of our DWMT. Code of this project is at https://github.com/chenx2000/DWMT.

AAAI Conference 2024 Conference Paper

Editing Language Model-Based Knowledge Graph Embeddings

  • Siyuan Cheng
  • Ningyu Zhang
  • Bozhong Tian
  • Xi Chen
  • Qingbin Liu
  • Huajun Chen

Recently decades have witnessed the empirical success of framing Knowledge Graph (KG) embeddings via language models. However, language model-based KG embeddings are usually deployed as static artifacts, making them difficult to modify post-deployment without re-training after deployment. To address this issue, we propose a new task of editing language model-based KG embeddings in this paper. This task is designed to facilitate rapid, data-efficient updates to KG embeddings without compromising the performance of other aspects. We build four new datasets: E-FB15k237, A-FB15k237, E-WN18RR, and A-WN18RR, and evaluate several knowledge editing baselines demonstrating the limited ability of previous models to handle the proposed challenging task. We further propose a simple yet strong baseline dubbed KGEditor, which utilizes additional parametric layers of the hypernetwork to edit/add facts. Our comprehensive experimental results reveal that KGEditor excels in updating specific facts without impacting the overall performance, even when faced with limited training resources. Code and datasets will be available at https://github.com/AnonymousForPapers/DeltaKG.

AAAI Conference 2024 Conference Paper

Exploiting Symmetric Temporally Sparse BPTT for Efficient RNN Training

  • Xi Chen
  • Chang Gao
  • Zuowen Wang
  • Longbiao Cheng
  • Sheng Zhou
  • Shih-Chii Liu
  • Tobi Delbruck

Recurrent Neural Networks (RNNs) are useful in temporal sequence tasks. However, training RNNs involves dense matrix multiplications which require hardware that can support a large number of arithmetic operations and memory accesses. Implementing online training of RNNs on the edge calls for optimized algorithms for an efficient deployment on hardware. Inspired by the spiking neuron model, the Delta RNN exploits temporal sparsity during inference by skipping over the update of hidden states from those inactivated neurons whose change of activation across two timesteps is below a defined threshold. This work describes a training algorithm for Delta RNNs that exploits temporal sparsity in the backward propagation phase to reduce computational requirements for training on the edge. Due to the symmetric computation graphs of forward and backward propagation during training, the gradient computation of inactivated neurons can be skipped. Results show a reduction of ∼80% in matrix operations for training a 56k parameter Delta LSTM on the Fluent Speech Commands dataset with negligible accuracy loss. Logic simulations of a hardware accelerator designed for the training algorithm show 2-10X speedup in matrix computations for an activation sparsity range of 50%-90%. Additionally, we show that the proposed Delta RNN training will be useful for online incremental learning on edge devices with limited computing resources.

IS Journal 2024 Journal Article

Exploring alterations of brain networks of AD patients using WTC method

  • Li Yapeng
  • Yuanyuan Qin
  • Xi Chen
  • Wei Li

Objective: To explore the influences of different frequency bands on preprocessing of resting-state fMRI datasets used by the Wavelet Transform Coherence (WTC) method, and to study changes in the functional brain networks of AD patients. Method: Resting-state fMRI datasets of 10 AD patients and 11 healthy controls were collected in this study and time series of 90 brain regions defined by AAL (Automated Anatomical Labeling) were exacted after preprocessing. Wavelet transformation was performed for each time series, and a functional brain network were established in different frequencies (0. 125Hz, 0. 0625Hz) using the WTC (Wavelet Transform Coherence) method. The topology parameters of networks, containing global efficiency, clustering coefficient, average short paths length and small world property were calculated and averaged within each group. Result: The results imply that there are significant differences of topology parameters in networks of different frequencies. Likewise, statistical analysis of topology parameters of AD and HC (Healthy Controls) show that global efficiency, clustering coefficient and small world properties of AD all decreased by varying degrees, while the short path length of AD remained longer. Conclusion: Our research provides a theoretical basis for the choice of filter bands for data preprocessing in functional magnetic resonance imaging. The findings may serve as indicators for early diagnosis of AD patients.

TMLR Journal 2024 Journal Article

GOPlan: Goal-conditioned Offline Reinforcement Learning by Planning with Learned Models

  • Mianchu Wang
  • Rui Yang
  • Xi Chen
  • Hao Sun
  • Meng Fang
  • Giovanni Montana

Offline Goal-Conditioned RL (GCRL) offers a feasible paradigm for learning general-purpose policies from diverse and multi-task offline datasets. Despite notable recent progress, the predominant offline GCRL methods, mainly model-free, face constraints in handling limited data and generalizing to unseen goals. In this work, we propose Goal-conditioned Offline Planning (GOPlan), a novel model-based framework that contains two key phases: (1) pretraining a prior policy capable of capturing multi-modal action distribution within the multi-goal dataset; (2) employing the reanalysis method with planning to generate imagined trajectories for funetuning policies. Specifically, we base the prior policy on an advantage-weighted conditioned generative adversarial network, which facilitates distinct mode separation, mitigating the pitfalls of out-of-distribution (OOD) actions. For further policy optimization, the reanalysis method generates high-quality imaginary data by planning with learned models for both intra-trajectory and inter-trajectory goals. With thorough experimental evaluations, we demonstrate that GOPlan achieves state-of-the-art performance on various offline multi-goal navigation and manipulation tasks. Moreover, our results highlight the superior ability of GOPlan to handle small data budgets and generalize to OOD goals.

IJCAI Conference 2024 Conference Paper

InstructEdit: Instruction-Based Knowledge Editing for Large Language Models

  • Ningyu Zhang
  • Bozhong Tian
  • Siyuan Cheng
  • Xiaozhuan Liang
  • Yi Hu
  • Kouying Xue
  • Yanjie Gou
  • Xi Chen

Knowledge editing for large language models can offer an efficient solution to alter a model’s behavior without negatively impacting the overall performance. However, the current approaches encounter issues with limited generalizability across tasks, necessitating one distinct editor for each task, significantly hindering the broader applications. To address this, we take the first step to analyze the multi-task generalization issue in knowledge editing. Specifically, we develop an instruction-based editing technique, termed InstructEdit, which facilitates the editor's adaptation to various task performances simultaneously using simple instructions. With only one unified editor for each LLM, we empirically demonstrate that InstructEdit can improve the editor's control, leading to an average 14. 86% increase in Reliability in multi-task editing setting. Furthermore, experiments involving holdout unseen task illustrate that InstructEdit consistently surpass previous strong baselines. To further investigate the underlying mechanisms of instruction-based knowledge editing, we analyze the principal components of the editing gradient directions, which unveils that instructions can help control optimization direction with stronger OOD generalization.

NeurIPS Conference 2024 Conference Paper

Job-SDF: A Multi-Granularity Dataset for Job Skill Demand Forecasting and Benchmarking

  • Xi Chen
  • Chuan Qin
  • Chuyu Fang
  • Chao Wang
  • Chen Zhu
  • Fuzhen Zhuang
  • Hengshu Zhu
  • Hui Xiong

In a rapidly evolving job market, skill demand forecasting is crucial as it enables policymakers and businesses to anticipate and adapt to changes, ensuring that workforce skills align with market needs, thereby enhancing productivity and competitiveness. Additionally, by identifying emerging skill requirements, it directs individuals towards relevant training and education opportunities, promoting continuous self-learning and development. However, the absence of comprehensive datasets presents a significant challenge, impeding research and the advancement of this field. To bridge this gap, we present Job-SDF, a dataset designed to train and benchmark job-skill demand forecasting models. Based on millions of public job advertisements collected from online recruitment platforms, this dataset encompasses monthly recruitment demand. Our dataset uniquely enables evaluating skill demand forecasting models at various granularities, including occupation, company, and regional levels. We benchmark a range of models on this dataset, evaluating their performance in standard scenarios, in predictions focused on lower value ranges, and in the presence of structural breaks, providing new insights for further research. Our code and dataset are publicly accessible via the https: //github. com/Job-SDF/benchmark.

ICAPS Conference 2024 Conference Paper

More Flexible Proximity Wildcards Path Planning with Compressed Path Databases

  • Xi Chen
  • Yue Zhang
  • Yonggang Zhang

Grid-based path planning is one of the classic problems in AI, and a popular topic in application areas such as computer games and robotics. Compressed Path Databases (CPDs) are recognized as a state-of-the-art method for grid-based path planning. It is able to find an optimal path extremely fast without state-space search. In recent years, researchers have tended to focus on improving CPDs by reducing CPD size or improving search performance. Among various methods, proximity wildcards are one of the most proven improvements in reducing the size of CPD. However, its proximity area is significantly restricted by complex terrain, which significantly affects the pathfinding efficiency and causes additional costs. In this paper, we enhance CPDs from the perspective of improving search efficiency and reducing search costs. Our work focuses on using more flexible methods to obtain larger proximity areas, so that more heuristic information can be used to improve search performance. Experiments conducted on the Grid-Based Path Planning Competition (GPPC) benchmarks demonstrate that the two proposed methods can effectively improve search efficiency and reduce search costs by up to 3 orders of magnitude. Remarkably, our methods can further reduce the storage cost, and improve the compression capability of CPDs simultaneously.

ICRA Conference 2024 Conference Paper

Open X-Embodiment: Robotic Learning Datasets and RT-X Models: Open X-Embodiment Collaboration

  • Abby O'Neill
  • Abdul Rehman
  • Abhiram Maddukuri
  • Abhishek Gupta 0004
  • Abhishek Padalkar
  • Abraham Lee
  • Acorn Pooley
  • Agrim Gupta

Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train "generalist" X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. The project website is robotics-transformer-x. github.io.

ICLR Conference 2024 Conference Paper

PolyVoice: Language Models for Speech to Speech Translation

  • Qianqian Dong
  • Zhiying Huang
  • Qi Tian 0001
  • Chen Xu 0008
  • Tom Ko
  • Yunlong Zhao 0004
  • Siyuan Feng
  • Tang Li 0001

With the huge success of GPT models in natural language processing, there is a growing interest in applying language modeling approaches to speech tasks. Currently, the dominant architecture in speech-to-speech translation (S2ST) remains the encoder-decoder paradigm, creating a need to investigate the impact of language modeling approaches in this area. In this study, we introduce PolyVoice, a language model-based framework designed for S2ST systems. Our framework comprises three decoder-only language models: a translation language model, a duration language model, and a speech synthesis language model. These language models employ different types of prompts to extract learned information effectively. By utilizing unsupervised semantic units, our framework can transfer semantic information across these models, making it applicable even to unwritten languages. We evaluate our system on Chinese $\rightarrow$ English and English $\rightarrow$ Spanish language pairs. Experimental results demonstrate that \method outperforms the state-of-the-art encoder-decoder model, producing voice-cloned speech with high translation and audio quality. Speech samples are available at https://polyvoice.github.io.

IJCAI Conference 2024 Conference Paper

Pre-DyGAE: Pre-training Enhanced Dynamic Graph Autoencoder for Occupational Skill Demand Forecasting

  • Xi Chen
  • Chuan Qin
  • Zhigaoyuan Wang
  • Yihang Cheng
  • Chao Wang
  • Hengshu Zhu
  • Hui Xiong

Occupational skill demand (OSD) forecasting seeks to predict dynamic skill demand specific to occupations, beneficial for employees and employers to grasp occupational nature and maintain a competitive edge in the rapidly evolving labor market. Although recent research has proposed data-driven techniques for forecasting skill demand, the focus has remained predominantly on overall trends rather than occupational granularity. In this paper, we propose a novel Pre-training Enhanced Dynamic Graph Autoencoder (Pre-DyGAE), forecasting skill demand from an occupational perspective. Specifically, we aggregate job descriptions (JDs) by occupation and segment them into several timestamps. Subsequently, in the initial timestamps, we pre-train a graph autoencoder (GAE), consisting of a semantically-aware cross-attention enhanced uncertainty-aware encoder and decoders for link prediction and edge regression to achieve graph reconstruction. In particular, we utilize contrastive learning on skill cooccurrence clusters to solve the data sparsity and a unified Tweedie and ranking loss for predicting the imbalanced distribution. Afterward, we incorporate an adaptive temporal encoding unit and a temporal shift module into GAE to achieve a dynamic GAE (DyGAE). Furthermore, we fine-tune the DyGAE with a two-stage optimization strategy and infer future representations. Extensive experiments on four real-world datasets validate the effectiveness of Pre-DyGAE compared with state-of-the-art baselines.

ICML Conference 2024 Conference Paper

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

  • Fangyun Wei
  • Xi Chen
  • Lin Luo

Despite their sophisticated capabilities, large language models (LLMs) encounter a major hurdle in effective assessment. This paper first revisits the prevalent evaluation method—multiple choice question answering (MCQA), which allows for straightforward accuracy measurement. Through a comprehensive evaluation of 24 models across 11 benchmarks, we highlight several potential drawbacks of MCQA, for instance, the inconsistency between the MCQA evaluation and the generation of open-ended responses in practical scenarios. In response, we introduce an RWQ-Elo rating system, engaging 24 LLMs such as GPT-4, GPT-3. 5, Google-Gemini-Pro and LLaMA-1/-2, in a two-player competitive format, with GPT-4 serving as the judge. Each LLM receives an Elo rating thereafter. This system is designed to mirror real-world usage, and for this purpose, we have compiled a new benchmark called “Real-world questions” (RWQ), comprising 20, 772 authentic user inquiries. Additionally, we thoroughly analyze the characteristics of our system and compare it with prior leaderboards like Alpaca Eval and MT-Bench. Our analysis reveals the stability of our RWQ-Elo system, the feasibility of registering new models, and its potential to reshape LLM leaderboards.

NeurIPS Conference 2024 Conference Paper

Slicing Vision Transformer for Flexible Inference

  • Yitian Zhang
  • Huseyin Coskun
  • Xu Ma
  • Huan Wang
  • Ke Ma
  • Xi Chen
  • Derek H. Hu
  • Yun Fu

Vision Transformers (ViT) is known for its scalability. In this work, we target to scale down a ViT to fit in an environment with dynamic-changing resource constraints. We observe that smaller ViTs are intrinsically the sub-networks of a larger ViT with different widths. Thus, we propose a general framework, named Scala, to enable a single network to represent multiple smaller ViTs with flexible inference capability, which aligns with the inherent design of ViT to vary from widths. Concretely, Scala activates several subnets during training, introduces Isolated Activation to disentangle the smallest sub-network from other subnets, and leverages Scale Coordination to ensure each sub-network receives simplified, steady, and accurate learning objectives. Comprehensive empirical validations on different tasks demonstrate that with only one-shot training, Scala learns slimmable representation without modifying the original ViT structure and matches the performance of Separate Training. Compared with the prior art, Scala achieves an average improvement of 1. 6% on ImageNet-1K with fewer parameters.

ICLR Conference 2024 Conference Paper

Stylized Offline Reinforcement Learning: Extracting Diverse High-Quality Behaviors from Heterogeneous Datasets

  • Yihuan Mao
  • Chengjie Wu
  • Xi Chen
  • Hao Hu 0006
  • Ji Jiang
  • Tianze Zhou
  • Tangjie Lv
  • Changjie Fan

Previous literature on policy diversity in reinforcement learning (RL) either focuses on the online setting or ignores the policy performance. In contrast, offline RL, which aims to learn high-quality policies from batched data, has yet to fully leverage the intrinsic diversity of the offline dataset. Addressing this dichotomy and aiming to balance quality and diversity poses a significant challenge to extant methodologies. This paper introduces a novel approach, termed Stylized Offline RL (SORL), which is designed to extract high-performing, stylistically diverse policies from a dataset characterized by distinct behavioral patterns. Drawing inspiration from the venerable Expectation-Maximization (EM) algorithm, SORL innovatively alternates between policy learning and trajectory clustering, a mechanism that promotes policy diversification. To further augment policy performance, we introduce advantage-weighted style learning into the SORL framework. Experimental evaluations across multiple environments demonstrate the significant superiority of SORL over previous methods in extracting high-quality policies with diverse behaviors. A case in point is that SORL successfully learns strong policies with markedly distinct playing patterns from a real-world human dataset of a popular basketball video game "Dunk City Dynasty."

NeurIPS Conference 2024 Conference Paper

SyncVIS: Synchronized Video Instance Segmentation

  • Rongkun Zheng
  • Lu Qi
  • Xi Chen
  • Yi Wang
  • Kun Wang
  • Yu Qiao
  • Hengshuang Zhao

Recent DETR-based methods have advanced the development of Video Instance Segmentation (VIS) through transformers' efficiency and capability in modeling spatial and temporal information. Despite harvesting remarkable progress, existing works follow asynchronous designs, which model video sequences via either video-level queries only or adopting query-sensitive cascade structures, resulting in difficulties when handling complex and challenging video scenarios. In this work, we analyze the cause of this phenomenon and the limitations of the current solutions, and propose to conduct synchronized modeling via a new framework named SyncVIS. Specifically, SyncVIS explicitly introduces video-level query embeddings and designs two key modules to synchronize video-level query with frame-level query embeddings: a synchronized video-frame modeling paradigm and a synchronized embedding optimization strategy. The former attempts to promote the mutual learning of frame- and video-level embeddings with each other and the latter divides large video sequences into small clips for easier optimization. Extensive experimental evaluations are conducted on the challenging YouTube-VIS 2019 & 2021 & 2022, and OVIS benchmarks, and SyncVIS achieves state-of-the-art results, which demonstrates the effectiveness and generality of the proposed approach. The code is available at https: //github. com/rkzheng99/SyncVIS.

IS Journal 2024 Journal Article

UCRI: A Unified Conversational Recommender System Based on Item-Guided Conditional Generation

  • Xi Chen
  • Yuehai Wang
  • Jianyi Yang

In recent years, great efforts have been made to develop a conversational recommender system (CRS). However, existing works always ignore the incorporation of the recommended items and the generated replies. This causes the performance of the recommendation to degrade in the conversations. To solve this problem, we propose a novel framework called unified conversational recommender system based on item-guided conditional generation (UCRI) to fuse the recommender module and the dialogue module seamlessly. UCRI captures the semantic similarity between the recommended items and the candidate words to realize the item-guided conditional generation. Besides, we further design the weight control mechanism and the recommender gating mechanism to make accurate recommendations in the conversations. Our approach can explicitly generate the recommended items in the replies and encourage the model to generate the related context for the items. Extensive experiments on the benchmark dataset REcommendations through DIALog show that our model achieves the best performance on both item recommendation and reply generation tasks.

TMLR Journal 2024 Journal Article

Understanding the Role of Layer Normalization in Label-Skewed Federated Learning

  • Guojun Zhang
  • Mahdi Beitollahi
  • Alex Bie
  • Xi Chen

Layer normalization (LN) is a widely adopted deep learning technique especially in the era of foundation models. Recently, LN has been shown to be surprisingly effective in federated learning (FL) with non-i.i.d. data. However, exactly why and how it works remains mysterious. In this work, we reveal the profound connection between layer normalization and the label shift problem in federated learning. To understand layer normalization better in FL, we identify the key contributing mechanism of normalization methods in FL, called feature normalization (FN), which applies normalization to the latent feature representation before the classifier head. Although LN and FN do not improve expressive power, they control feature collapse and local overfitting to heavily skewed datasets, and thus accelerates global training. Empirically, we show that normalization leads to drastic improvements on standard benchmarks under extreme label shift. Moreover, we conduct extensive ablation studies to understand the critical factors of layer normalization in FL. Our results verify that FN is an essential ingredient inside LN to significantly improve the convergence of FL while remaining robust to learning rate choices, especially under extreme label shift where each client has access to few classes.

ICML Conference 2024 Conference Paper

Understanding the Training Speedup from Sampling with Approximate Losses

  • Rudrajit Das
  • Xi Chen
  • Bertram Ieong
  • Parikshit Bansal
  • Sujay Sanghavi

It is well known that selecting samples with large losses/gradients can significantly reduce the number of training steps. However, the selection overhead is often too high to yield any meaningful gains in terms of overall training time. In this work, we focus on the greedy approach of selecting samples with large approximate losses instead of exact losses in order to reduce the selection overhead. For smooth convex losses, we show that such a greedy strategy can converge to a constant factor of the minimum value of the average loss in fewer iterations than the standard approach of random selection. We also theoretically quantify the effect of the approximation level. We then develop SIFT which uses early exiting to obtain approximate losses with an intermediate layer’s representations for sample selection. We evaluate SIFT on the task of training a 110M parameter 12 layer BERT base model, and show significant gains (in terms of training hours and number of backpropagation steps) without any optimized implementation over vanilla training. For e. g. , to reach 64% validation accuracy, SIFT with exit at the first layer takes $\sim$ 43 hours compared to $\sim$ 57 hours of vanilla training.

NeurIPS Conference 2024 Conference Paper

Untrained Neural Nets for Snapshot Compressive Imaging: Theory and Algorithms

  • Mengyu Zhao
  • Xi Chen
  • Xin Yuan
  • Shirin Jalali

Snapshot compressive imaging (SCI) recovers high-dimensional (3D) data cubes from a single 2D measurement, enabling diverse applications like video and hyperspectral imaging to go beyond standard techniques in terms of acquisition speed and efficiency. In this paper, we focus on SCI recovery algorithms that employ untrained neural networks (UNNs), such as deep image prior (DIP), to model source structure. Such UNN-based methods are appealing as they have the potential of avoiding the computationally intensive retraining required for different source models and different measurement scenarios. We first develop a theoretical framework for characterizing the performance of such UNN-based methods. The theoretical framework, on the one hand, enables us to optimize the parameters of data-modulating masks, and on the other hand, provides a fundamental connection between the number of data frames that can be recovered from a single measurement to the parameters of the untrained NN. We also employ the recently proposed bagged-deep-image-prior (bagged-DIP) idea to develop SCI Bagged Deep Video Prior (SCI-BDVP) algorithms that address the common challenges faced by standard UNN solutions. Our experimental results show that in video SCI our proposed solution achieves state-of-the-art among UNN methods, and in the case of noisy measurements, it even outperforms supervised solutions. Code is publicly available at https: //github. com/Computational-Imaging-RU/SCI-BDVP.

AAAI Conference 2024 Conference Paper

Wavelet-Driven Spatiotemporal Predictive Learning: Bridging Frequency and Time Variations

  • Xuesong Nie
  • Yunfeng Yan
  • Siyuan Li
  • Cheng Tan
  • Xi Chen
  • Haoyuan Jin
  • Zhihang Zhu
  • Stan Z. Li

Spatiotemporal predictive learning is a paradigm that empowers models to learn spatial and temporal patterns by predicting future frames from past frames in an unsupervised manner. This method typically uses recurrent units to capture long-term dependencies, but these units often come with high computational costs and limited performance in real-world scenes. This paper presents an innovative Wavelet-based SpatioTemporal (WaST) framework, which extracts and adaptively controls both low and high-frequency components at image and feature levels via 3D discrete wavelet transform for faster processing while maintaining high-quality predictions. We propose a Time-Frequency Aware Translator uniquely crafted to efficiently learn short- and long-range spatiotemporal information by individually modeling spatial frequency and temporal variations. Meanwhile, we design a wavelet-domain High-Frequency Focal Loss that effectively supervises high-frequency variations. Extensive experiments across various real-world scenarios, such as driving scene prediction, traffic flow prediction, human motion capture, and weather forecasting, demonstrate that our proposed WaST achieves state-of-the-art performance over various spatiotemporal prediction methods.

NeurIPS Conference 2024 Conference Paper

Zero-shot Image Editing with Reference Imitation

  • Xi Chen
  • Yutong Feng
  • Mengting Chen
  • Yiyang Wang
  • Shilong Zhang
  • Yu Liu
  • Yujun Shen
  • Hengshuang Zhao

Image editing serves as a practical yet challenging task considering the diverse demands from users, where one of the hardest parts is to precisely describe how the edited image should look like. In this work, we present a new form of editing, termed imitative editing, to help users exercise their creativity more conveniently. Concretely, to edit an image region of interest, users are free to directly draw inspiration from some in-the-wild references (e. g. , some relative pictures come across online), without having to cope with the fit between the reference and the source. Such a design requires the system to automatically figure out what to expect from the reference to perform the editing. For this purpose, we propose a generative training framework, dubbed MimicBrush, which randomly selects two frames from a video clip, masks some regions of one frame, and learns to recover the masked regions using the information from the other frame. That way, our model, developed from a diffusion prior, is able to capture the semantic correspondence between separate images in a self-supervised manner. We experimentally show the effectiveness of our method under various test cases as well as its superiority over existing alternatives. We also construct a benchmark to facilitate further research.

ICML Conference 2023 Conference Paper

2D-Shapley: A Framework for Fragmented Data Valuation

  • Zhihong Liu
  • Hoang Anh Just
  • Xiangyu Chang
  • Xi Chen
  • Ruoxi Jia 0001

Data valuation—quantifying the contribution of individual data sources to certain predictive behaviors of a model—is of great importance to enhancing the transparency of machine learning and designing incentive systems for data sharing. Existing work has focused on evaluating data sources with the shared feature or sample space. How to valuate fragmented data sources of which each only contains partial features and samples remains an open question. We start by presenting a method to calculate the counterfactual of removing a fragment from the aggregated data matrix. Based on the counterfactual calculation, we further propose 2D-Shapley, a theoretical framework for fragmented data valuation that uniquely satisfies some appealing axioms in the fragmented data context. 2D-Shapley empowers a range of new use cases, such as selecting useful data fragments, providing interpretation for sample-wise data values, and fine-grained data issue diagnosis.

JMLR Journal 2023 Journal Article

Boosting Multi-agent Reinforcement Learning via Contextual Prompting

  • Yue Deng
  • Zirui Wang
  • Xi Chen
  • Yin Zhang

Multi-agent reinforcement learning (MARL) has gained increasing attention due to its ability to enable multiple agents to learn policies simultaneously. However, the bootstrapping error arises from the difference between the estimated Q value and the real discounted return and accumulates backward through dynamic programming iterations. This error can become even larger as the number of agents increases, due to the exponential growth of agent interactions, resulting in infeasible learning time and incorrect actions during early training steps. To address this challenge, we observe that previously collected trajectories are useful contexts, model them using a contextual predictor to yield the next action and observation, and use the contextual predictor to replace the Q value function or utility function during the early training phase. Furthermore, we employ a joint-action sampling mechanism to restrict the action space and dynamically select policies from the vanilla utility network and those from the contextual trajectory predictor to perform rollout processes. By reasonably constraining the action space and rollout process, we can significantly accelerate the algorithm training process. Our framework applies to various value-based MARL methods in both centralized training decentralized execution (CTDE) and non-CTDE scenarios where agents are accessible (non-accessible) to global states during the training process. Experimental results on three tasks, Spread, Tag, and Reference, from the Particle World Environment (PWE) show that our framework significantly accelerates the training process of existing state-of-the-art CTDE and non-CTDE MARL methods, while also competing with or outperforming their original versions. [abs] [ pdf ][ bib ] &copy JMLR 2023. ( edit, beta )

AAAI Conference 2023 Conference Paper

FreeEnricher: Enriching Face Landmarks without Additional Cost

  • Yangyu Huang
  • Xi Chen
  • Jongyoo Kim
  • Hao Yang
  • Chong Li
  • Jiaolong Yang
  • Dong Chen

Recent years have witnessed significant growth of face alignment. Though dense facial landmark is highly demanded in various scenarios, e.g., cosmetic medicine and facial beautification, most works only consider sparse face alignment. To address this problem, we present a framework that can enrich landmark density by existing sparse landmark datasets, e.g., 300W with 68 points and WFLW with 98 points. Firstly, we observe that the local patches along each semantic contour are highly similar in appearance. Then, we propose a weakly-supervised idea of learning the refinement ability on original sparse landmarks and adapting this ability to enriched dense landmarks. Meanwhile, several operators are devised and organized together to implement the idea. Finally, the trained model is applied as a plug-and-play module to the existing face alignment networks. To evaluate our method, we manually label the dense landmarks on 300W testset. Our method yields state-of-the-art accuracy not only in newly-constructed dense 300W testset but also in the original sparse 300W and WFLW testsets without additional cost.

FOCS Conference 2023 Conference Paper

Memory-Query Tradeoffs for Randomized Convex Optimization

  • Xi Chen
  • Binghui Peng

We show that any randomized first-order algorithm which minimizes a d-dimensional, 1-Lipschitz convex function over the unit ball must either use $\Omega\left(d^{2-\delta}\right)$ bits of memory or make $\Omega\left(d^{1+\delta / 6-o(1)}\right)$ queries, for any constant $\delta \in(0, 1)$ and when the precision $\epsilon$ is quasipolynomially small in d. Our result implies that cutting plane methods, which use $\tilde{O}\left(d^{2}\right)$ bits of memory and $\tilde{O}(d)$ queries, are Pareto-optimal among randomized first-order algorithms, and quadratic memory is required to achieve optimal query complexity for convex optimization.

FOCS Conference 2023 Conference Paper

New Lower Bounds for Adaptive Tolerant Junta Testing

  • Xi Chen
  • Shyamal Patel

We prove a $k^{-\Omega\left(\log \left(\varepsilon_{2}-\varepsilon_{1}\right)\right)}$ lower bound for adap- tively testing whether a Boolean function is $\varepsilon_{1}$-close to or $\varepsilon_{2}-$ far from k-juntas. Our results provide the first superpolynomial separation between tolerant and non-tolerant testing for a natural property of boolean functions under the adaptive setting. Furthermore, our techniques generalize to show that adaptively testing whether a function is $\varepsilon_{1}$-close to a k-junta or $\varepsilon_{2}$-far from $(k+o(k))$-juntas cannot be done with poly $(k, \left(\varepsilon_{2}-\varepsilon_{1}\right)^{-1})$ queries. This is in contrast to an algorithm by Iyer, Tal and Whitmeyer [CCC 2021] which uses poly $(k, \left(\varepsilon_{2}-\varepsilon_{1}\right)^{-1})$ queries to test whether a function is $\varepsilon_{1}$-close to a k-junta or $\varepsilon_{2}$-far from $O(k /\left(\varepsilon_{2}-\varepsilon_{1}\right)^{2})$-juntas

TMLR Journal 2023 Journal Article

Proportional Fairness in Federated Learning

  • Guojun Zhang
  • Saber Malekmohammadi
  • Xi Chen
  • Yaoliang Yu

With the increasingly broad deployment of federated learning (FL) systems in the real world, it is critical but challenging to ensure fairness in FL, i.e. reasonably satisfactory performances for each of the numerous diverse clients. In this work, we introduce and study a new fairness notion in FL, called proportional fairness (PF), which is based on the relative change of each client's performance. From its connection with the bargaining games, we propose PropFair, a novel and easy-to-implement algorithm for finding proportionally fair solutions in FL, and study its convergence properties. Through extensive experiments on vision and language datasets, we demonstrate that PropFair can approximately find PF solutions, and it achieves a good balance between the average performances of all clients and of the worst 10% clients.

AAAI Conference 2023 Conference Paper

Supervised Contrastive Few-Shot Learning for High-Frequency Time Series

  • Xi Chen
  • Cheng Ge
  • Ming Wang
  • Jin Wang

Significant progress has been made in representation learning, especially with recent success on self-supervised contrastive learning. However, for time series with less intuitive or semantic meaning, sampling bias may be inevitably encountered in unsupervised approaches. Although supervised contrastive learning has shown superior performance by leveraging label information, it may also suffer from class collapse. In this study, we consider a realistic scenario in industry with limited annotation information available. A supervised contrastive framework is developed for high-frequency time series representation and classification, wherein a novel variant of supervised contrastive loss is proposed to include multiple augmentations while induce spread within each class. Experiments on four mainstream public datasets as well as a series of sensitivity and ablation analyses demonstrate that the learned representations are effective and robust compared with the direct supervised learning and self-supervised learning, notably under the minimal few-shot situation.

NeurIPS Conference 2023 Conference Paper

TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation

  • Rongkun Zheng
  • Lu Qi
  • Xi Chen
  • Yi Wang
  • Kun Wang
  • Yu Qiao
  • Hengshuang Zhao

Training on large-scale datasets can boost the performance of video instance segmentation while the annotated datasets for VIS are hard to scale up due to the high labor cost. What we possess are numerous isolated filed-specific datasets, thus, it is appealing to jointly train models across the aggregation of datasets to enhance data volume and diversity. However, due to the heterogeneity in category space, as mask precision increase with the data volume, simply utilizing multiple datasets will dilute the attention of models on different taxonomy. Thus, increasing the data scale and enriching taxonomy space while improving classification precision is important. In this work, we analyze that providing extra taxonomy information can help models concentrate on specific taxonomy, and propose our model named Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation (TMT-VIS) to address this vital challenge. Specifically, we design a two-stage taxonomy aggregation module that first compiles taxonomy information from input videos and then aggregates these taxonomy priors into instance queries before the transformer decoder. We conduct extensive experimental evaluations on four popular and challenging benchmarks, including YouTube-VIS 2019, YouTube-VIS 2021, OVIS, and UVO. Our model shows significant improvement over the baseline solutions, and sets new state-of-the-art records on all these benchmarks. These appealing and encouraging results demonstrate the effectiveness and generality of our proposed approach. The code and trained models will be publicly available.

NeurIPS Conference 2023 Conference Paper

Uni3DETR: Unified 3D Detection Transformer

  • Zhenyu Wang
  • Ya-Li Li
  • Xi Chen
  • Hengshuang Zhao
  • Shengjin Wang

Existing point cloud based 3D detectors are designed for the particular scene, either indoor or outdoor ones. Because of the substantial differences in object distribution and point density within point clouds collected from various environments, coupled with the intricate nature of 3D metrics, there is still a lack of a unified network architecture that can accommodate diverse scenes. In this paper, we propose Uni3DETR, a unified 3D detector that addresses indoor and outdoor 3D detection within the same framework. Specifically, we employ the detection transformer with point-voxel interaction for object prediction, which leverages voxel features and points for cross-attention and behaves resistant to the discrepancies from data. We then propose the mixture of query points, which sufficiently exploits global information for dense small-range indoor scenes and local information for large-range sparse outdoor ones. Furthermore, our proposed decoupled IoU provides an easy-to-optimize training target for localization by disentangling the $xy$ and $z$ space. Extensive experiments validate that Uni3DETR exhibits excellent performance consistently on both indoor and outdoor 3D detection. In contrast to previous specialized detectors, which may perform well on some particular datasets but suffer a substantial degradation on different scenes, Uni3DETR demonstrates the strong generalization ability under heterogeneous conditions (Fig. 1).

JMLR Journal 2022 Journal Article

Accelerating Adaptive Cubic Regularization of Newton's Method via Random Sampling

  • Xi Chen
  • Bo Jiang
  • Tianyi Lin
  • Shuzhong Zhang

In this paper, we consider an unconstrained optimization model where the objective is a sum of a large number of possibly nonconvex functions, though overall the objective is assumed to be smooth and convex. Our bid to solving such model uses the framework of cubic regularization of Newton's method. As well known, the crux in cubic regularization is its utilization of the Hessian information, which may be computationally expensive for large-scale problems. To tackle this, we resort to approximating the Hessian matrix via sub-sampling. In particular, we propose to compute an approximated Hessian matrix by either uniformly or non-uniformly sub-sampling the components of the objective. Based upon such sampling strategy, we develop accelerated adaptive cubic regularization approaches and provide theoretical guarantees on global iteration complexity of $\O(\epsilon^{-1/3})$ with high probability, which matches that of the original accelerated cubic regularization methods Jiang et al. (2020) using the full Hessian information. Interestingly, we also show that in the worst case scenario our algorithm still achieves an $O(\epsilon^{-5/6}\log(\epsilon^{-1}))$ iteration complexity bound. The proof techniques are new to our knowledge and can be of independent interets. Experimental results on the regularized logistic regression problems demonstrate a clear effect of acceleration on several real data sets. [abs] [ pdf ][ bib ] &copy JMLR 2022. ( edit, beta )

SODA Conference 2022 Conference Paper

Distribution-free Testing for Halfspaces (Almost) Requires PAC Learning

  • Xi Chen
  • Shyamal Patel

It is well known that halfspaces over ℝ n and {0, 1} n are PAC-learnable with Θ( n ) samples. Recently Blais et al. [4] showed that even the easier task of distribution-free sample-based testing requires Ω( n /log n ) samples for halfspaces. In this work we study the distribution-free testing of halfspaces with queries, for which we show that the complexity remains to be. Indeed we prove the following stronger tradeoff result: any distribution-free testing algorithm for halfspaces over {0, 1} n that receives k samples must make queries on the input function, when k satisfies n. 99 ≤ k ≤ O ( n/ log 3 n ). For halfspaces over ℝ n we show that any algorithm that makes a finite number of queries must draw Ω( n /log n ) many samples.

IJCAI Conference 2022 Conference Paper

Dynamic Car Dispatching and Pricing: Revenue and Fairness for Ridesharing Platforms

  • Zishuo Zhao
  • Xi Chen
  • Xuefeng Zhang
  • Yuan Zhou

A major challenge for ridesharing platforms is to guarantee profit and fairness simultaneously, especially in the presence of misaligned incentives of drivers and riders. We focus on the dispatching-pricing problem to maximize the total revenue while keeping both drivers and riders satisfied. We study the computational complexity of the problem, provide a novel two-phased pricing solution with revenue and fairness guarantees, extend it to stochastic settings and develop a dynamic (a. k. a. , learning-while-doing) algorithm that actively collects data to learn the demand distribution during the scheduling process. We also conduct extensive experiments to demonstrate the effectiveness of our algorithms.

AILAW Journal 2022 Journal Article

How to justify a backing’s eligibility for a warrant: the justification of a legal interpretation in a hard case

  • Shiyang Yu
  • Xi Chen

Abstract The Toulmin model has been proved useful in law and argumentation theory. This model describes the basic process in justifying a claim, which comprises six elements, i. e. , claim (C), data (D), warrant (W), backing (B), qualifier (Q), and rebuttal (R). Specifically, in justifying a claim, one must put forward ‘data’ and a ‘warrant’, whereas the latter is authorized by ‘backing’. The force of the ‘claim’ being justified is represented by the ‘qualifier’, and the condition under which the claim cannot be justified is represented as the ‘rebuttal’. To further improve the model, (Goodnight, Informal Logic 15: 41–52, 1993) points out that the selection of a backing needs justification, which he calls legitimation justification. However, how such justification is constituted has not yet been clarified. To identify legitimation justification, we separate it into two parts. One justifies a backing’s eligibility (legitimation justification 1; LJ 1 ); the other justifies its superiority over other eligible backings (legitimation justification 2; LJ 2 ). In this paper, we focus on LJ 1 and apply it to the legal justification (of judgements) in hard cases for illustration purposes. We submit that LJ 1 refers to the justification of the legal interpretation of a norm by its backing, which can be further separated into several orderable subjustifications. Taking the subjustification of a norm’s existence as an example, we show how it would be influenced by different positions in the philosophy of law. Taking the position of the theory of natural law, such subjustification is presented and evaluated. This paper aims not only to inform ongoing theoretical efforts to apply the Toulmin model in the legal field, but it also seeks to clarify the process in the justification of legal judgments in hard cases. It also offers background information for the possible construction of related AI systems. In our future work, LJ 2 and other subjustifications of LJ 1 will be discussed.

TMLR Journal 2022 Journal Article

Interpretable Node Representation with Attribute Decoding

  • Xiaohui Chen
  • Xi Chen
  • Liping Liu

Variational Graph Autoencoders (VGAEs) are powerful models for unsupervised learning of node representations from graph data. In this work, we make a systematic analysis of modeling node attributes in VGAEs and show that attribute decoding is important for node representation learning. We further propose a new learning model, interpretable NOde Representation with Attribute Decoding (NORAD). The model encodes node representations in an interpretable approach: node representations capture community structures in the graph and the relationship between communities and node attributes. We further propose a rectifying procedure to refine node representations of isolated notes, which improves the quality of the representations of these nodes. Our empirical results demonstrate the advantage of the proposed model when learning graph data in an interpretable approach.

NeurIPS Conference 2022 Conference Paper

LAPO: Latent-Variable Advantage-Weighted Policy Optimization for Offline Reinforcement Learning

  • Xi Chen
  • Ali Ghadirzadeh
  • Tianhe Yu
  • Jianhao Wang
  • Alex Yuan Gao
  • Wenzhe Li
  • Liang Bin
  • Chelsea Finn

Offline reinforcement learning methods hold the promise of learning policies from pre-collected datasets without the need to query the environment for new samples. This setting is particularly well-suited for continuous control robotic applications for which online data collection based on trial-and-error is costly and potentially unsafe. In practice, offline datasets are often heterogeneous, i. e. , collected in a variety of scenarios, such as data from several human demonstrators or from policies that act with different purposes. Unfortunately, such datasets often contain action distributions with multiple modes and, in some cases, lack a sufficient number of high-reward trajectories, which render offline policy training inefficient. To address this challenge, we propose to leverage latent-variable generative model to represent high-advantage state-action pairs leading to better adherence to data distributions that contributes to solving the task, while maximizing reward via a policy over the latent variable. As we empirically show on a range of simulated locomotion, navigation, and manipulation tasks, our method referred to as latent-variable advantage-weighted policy optimization (LAPO), improves the average performance of the next best-performing offline reinforcement learning methods by 49\% on heterogeneous datasets, and by 8\% on datasets with narrow and biased distributions.

JMLR Journal 2022 Journal Article

No Weighted-Regret Learning in Adversarial Bandits with Delays

  • Ilai Bistritz
  • Zhengyuan Zhou
  • Xi Chen
  • Nicholas Bambos
  • Jose Blanchet

Consider a scenario where a player chooses an action in each round $t$ out of $T$ rounds and observes the incurred cost after a delay of $d_{t}$ rounds. The cost functions and the delay sequence are chosen by an adversary. We show that in a non-cooperative game, the expected weighted ergodic distribution of play converges to the set of coarse correlated equilibria if players use algorithms that have “no weighted-regret” in the above scenario, even if they have linear regret due to too large delays. For a two-player zero-sum game, we show that no weighted-regret is sufficient for the weighted ergodic average of play to converge to the set of Nash equilibria. We prove that the FKM algorithm with $n$ dimensions achieves an expected regret of $O\left(nT^{\frac{3}{4}}+\sqrt{n}T^{\frac{1}{3}}D^{\frac{1}{3}}\right)$ and the EXP3 algorithm with $K$ arms achieves an expected regret of $O\left(\sqrt{\log K\left(KT+D\right)}\right)$ even when $D=\sum_{t=1}^{T}d_{t}$ and $T$ are unknown. These bounds use a novel doubling trick that, under mild assumptions, provably retains the regret bound for when $D$ and $T$ are known. Using these bounds, we show that FKM and EXP3 have no weighted-regret even for $d_{t}=O\left(t\log t\right)$. Therefore, algorithms with no weighted-regret can be used to approximate a CCE of a finite or convex unknown game that can only be simulated with bandit feedback, even if the simulation involves significant delays. [abs] [ pdf ][ bib ] &copy JMLR 2022. ( edit, beta )

JBHI Journal 2022 Journal Article

Understanding Patient Query With Weak Supervision From Doctor Response

  • Xiaoming Shi
  • Sendong Zhao
  • Yuxuan Wang
  • Xi Chen
  • Ziheng Zhang
  • Yefeng Zheng
  • Wanxiang Che

Currently, the need for high-quality dialogue systems that assist users to conduct self-diagnosis is rapidly increasing. Slot filling for automatic diagnosis, which converts medical queries into structured representations, plays an important role in diagnostic dialogue systems. However, the lack of high-quality datasets limits the performance of slot filling. While medical communities like AskAPatient usually have multiple rounds of diagnostic dialogue containing colloquial input and professional responses from doctors. Therefore, the data of diagnostic dialogue in medical communities can be utilized to solve the main challenges in slot filling. This paper proposes a two-step training framework to make full use of these unlabeled dialogue data in medical communities. To promote further researches, we provide a Chinese dataset with 2, 652 annotated samples and a large amount of unlabeled samples. Experimental results on the dataset demonstrate the effectiveness of the proposed method with an increase of 6. 32% in Micro F1 and 8. 20% in Macro F1 on average over strong baselines.

ICML Conference 2021 Conference Paper

Adversarial Combinatorial Bandits with General Non-linear Reward Functions

  • Yanjun Han
  • Yining Wang
  • Xi Chen

In this paper we study the adversarial combinatorial bandit with a known non-linear reward function, extending existing work on adversarial linear combinatorial bandit. {The adversarial combinatorial bandit with general non-linear reward is an important open problem in bandit literature, and it is still unclear whether there is a significant gap from the case of linear reward, stochastic bandit, or semi-bandit feedback. } We show that, with $N$ arms and subsets of $K$ arms being chosen at each of $T$ time periods, the minimax optimal regret is $\widetilde\Theta_{d}(\sqrt{N^d T})$ if the reward function is a $d$-degree polynomial with $d< K$, and $\Theta_K(\sqrt{N^K T})$ if the reward function is not a low-degree polynomial. {Both bounds are significantly different from the bound $O(\sqrt{\mathrm{poly}(N, K)T})$ for the linear case, which suggests that there is a fundamental gap between the linear and non-linear reward structures. } Our result also finds applications to adversarial assortment optimization problem in online recommendation. We show that in the worst-case of adversarial assortment problem, the optimal algorithm must treat each individual $\binom{N}{K}$ assortment as independent.

NeurIPS Conference 2021 Conference Paper

Bridging the Gap Between Practice and PAC-Bayes Theory in Few-Shot Meta-Learning

  • Nan Ding
  • Xi Chen
  • Tomer Levinboim
  • Sebastian Goodman
  • Radu Soricut

Despite recent advances in its theoretical understanding, there still remains a significant gap in the ability of existing PAC-Bayesian theories on meta-learning to explain performance improvements in the few-shot learning setting, where the number of training examples in the target tasks is severely limited. This gap originates from an assumption in the existing theories which supposes that the number of training examples in the observed tasks and the number of training examples in the target tasks follow the same distribution, an assumption that rarely holds in practice. By relaxing this assumption, we develop two PAC-Bayesian bounds tailored for the few-shot learning setting and show that two existing meta-learning algorithms (MAML and Reptile) can be derived from our bounds, thereby bridging the gap between practice and PAC-Bayesian theories. Furthermore, we derive a new computationally-efficient PACMAML algorithm, and show it outperforms existing meta-learning algorithms on several few-shot benchmark datasets.

IJCAI Conference 2021 Conference Paper

Drop Redundant, Shrink Irrelevant: Selective Knowledge Injection for Language Pretraining

  • Ningyu Zhang
  • Shumin Deng
  • Xu Cheng
  • Xi Chen
  • Yichi Zhang
  • Wei Zhang
  • Huajun Chen

Previous research has demonstrated the power of leveraging prior knowledge to improve the performance of deep models in natural language processing. However, traditional methods neglect the fact that redundant and irrelevant knowledge exists in external knowledge bases. In this study, we launched an in-depth empirical investigation into downstream tasks and found that knowledge-enhanced approaches do not always exhibit satisfactory improvements. To this end, we investigate the fundamental reasons for ineffective knowledge infusion and present selective injection for language pretraining, which constitutes a model-agnostic method and is readily pluggable into previous approaches. Experimental results on benchmark datasets demonstrate that our approach can enhance state-of-the-art knowledge injection methods.

NeurIPS Conference 2021 Conference Paper

Generalized DataWeighting via Class-Level Gradient Manipulation

  • Can Chen
  • Shuhao Zheng
  • Xi Chen
  • Erqun Dong
  • Xue (Steve) Liu
  • Hao Liu
  • Dejing Dou

Label noise and class imbalance are two major issues coexisting in real-world datasets. To alleviate the two issues, state-of-the-art methods reweight each instance by leveraging a small amount of clean and unbiased data. Yet, these methods overlook class-level information within each instance, which can be further utilized to improve performance. To this end, in this paper, we propose Generalized Data Weighting (GDW) to simultaneously mitigate label noise and class imbalance by manipulating gradients at the class level. To be specific, GDW unrolls the loss gradient to class-level gradients by the chain rule and reweights the flow of each gradient separately. In this way, GDW achieves remarkable performance improvement on both issues. Aside from the performance gain, GDW efficiently obtains class-level weights without introducing any extra computational cost compared with instance weighting methods. Specifically, GDW performs a gradient descent step on class-level weights, which only relies on intermediate gradients. Extensive experiments in various settings verify the effectiveness of GDW. For example, GDW outperforms state-of-the-art methods by $2. 56\%$ under the $60\%$ uniform noise setting in CIFAR10. Our code is available at https: //github. com/GGchen1997/GDW-NIPS2021.

JMLR Journal 2021 Journal Article

Shape-Enforcing Operators for Generic Point and Interval Estimators of Functions

  • Xi Chen
  • Victor Chernozhukov
  • Ivan Fernandez-Val
  • Scott Kostyshak
  • Ye Luo

A common problem in econometrics, statistics, and machine learning is to estimate and make inference on functions that satisfy shape restrictions. For example, distribution functions are nondecreasing and range between zero and one, height growth charts are nondecreasing in age, and production functions are nondecreasing and quasi-concave in input quantities. We propose a method to enforce these restrictions ex post on generic unconstrained point and interval estimates of the target function by applying functional operators. The interval estimates could be either frequentist confidence bands or Bayesian credible regions. If an operator has reshaping, invariance, order-preserving, and distance-reducing properties, the shape-enforced point estimates are closer to the target function than the original point estimates and the shape-enforced interval estimates have greater coverage and shorter length than the original interval estimates. We show that these properties hold for six different operators that cover commonly used shape restrictions in practice: range, convexity, monotonicity, monotone convexity, quasi-convexity, and monotone quasi-convexity, with the latter two restrictions being of paramount importance. The main attractive property of the post-processing approach is that it works in conjunction with any generic initial point or interval estimate, obtained using any of parametric, semi-parametric or nonparametric learning methods, including recent methods that are able to exploit either smoothness, sparsity, or other forms of structured parsimony of target functions. The post-processed point and interval estimates automatically inherit and provably improve these properties in finite samples, while also enforcing qualitative shape restrictions brought by scientific reasoning. We illustrate the results with two empirical applications to the estimation of a height growth chart for infants in India and a production function for chemical firms in China. [abs] [ pdf ][ bib ] &copy JMLR 2021. ( edit, beta )

IJCAI Conference 2021 Conference Paper

Unsupervised Knowledge Graph Alignment by Probabilistic Reasoning and Semantic Embedding

  • Zhiyuan Qi
  • Ziheng Zhang
  • Jiaoyan Chen
  • Xi Chen
  • Yuejia Xiang
  • Ningyu Zhang
  • Yefeng Zheng

Knowledge Graph (KG) alignment is to discover the mappings (i. e. , equivalent entities, relations, and others) between two KGs. The existing methods can be divided into the embedding-based models, and the conventional reasoning and lexical matching based systems. The former compute the similarity of entities via their cross-KG embeddings, but they usually rely on an ideal supervised learning setting for good performance and lack appropriate reasoning to avoid logically wrong mappings; while the latter address the reasoning issue but are poor at utilizing the KG graph structures and the entity contexts. In this study, we aim at combining the above two solutions and thus propose an iterative framework named PRASE which is based on probabilistic reasoning and semantic embedding. It learns the KG embeddings via entity mappings from a probabilistic reasoning system named PARIS, and feeds the resultant entity mappings and embeddings back into PARIS for augmentation. The PRASE framework is compatible with different embedding-based models, and our experiments on multiple datasets have demonstrated its state-of-the-art performance.

JMLR Journal 2021 Journal Article

Variance Reduced Median-of-Means Estimator for Byzantine-Robust Distributed Inference

  • Jiyuan Tu
  • Weidong Liu
  • Xiaojun Mao
  • Xi Chen

This paper develops an efficient distributed inference algorithm, which is robust against a moderate fraction of Byzantine nodes, namely arbitrary and possibly adversarial machines in a distributed learning system. In robust statistics, the median-of-means (MOM) has been a popular approach to hedge against Byzantine failures due to its ease of implementation and computational efficiency. However, the MOM estimator has the shortcoming in terms of statistical efficiency. The first main contribution of the paper is to propose a variance reduced median-of-means (VRMOM) estimator, which improves the statistical efficiency over the vanilla MOM estimator and is computationally as efficient as the MOM. Based on the proposed VRMOM estimator, we develop a general distributed inference algorithm that is robust against Byzantine failures. Theoretically, our distributed algorithm achieves a fast convergence rate with only a constant number of rounds of communications. We also provide the asymptotic normality result for the purpose of statistical inference. To the best of our knowledge, this is the first normality result in the setting of Byzantine-robust distributed learning. The simulation results are also presented to illustrate the effectiveness of our method. [abs] [ pdf ][ bib ] &copy JMLR 2021. ( edit, beta )

IJCAI Conference 2020 Conference Paper

Bayesian Decision Process for Budget-efficient Crowdsourced Clustering

  • Xiaozhou Wang
  • Xi Chen
  • Qihang Lin
  • Weidong Liu

The performance of clustering depends on an appropriately defined similarity between two items. When the similarity is measured based on human perception, human workers are often employed to estimate a similarity score between items in order to support clustering, leading to a procedure called crowdsourced clustering. Assuming a monetary reward is paid to a worker for each similarity score and assuming the similarities between pairs and workers' reliability have a large diversity, when the budget is limited, it is critical to wisely assign pairs of items to different workers to optimize the clustering result. We model this budget allocation problem as a Markov decision process where item pairs are dynamically assigned to workers based on the historical similarity scores they provided. We propose an optimistic knowledge gradient policy where the assignment of items in each stage is based on the minimum-weight K-cut defined on a similarity graph. We provide simulation studies and real data analysis to demonstrate the performance of the proposed method.

JMLR Journal 2020 Journal Article

Distributed High-dimensional Regression Under a Quantile Loss Function

  • Xi Chen
  • Weidong Liu
  • Xiaojun Mao
  • Zhuoyi Yang

This paper studies distributed estimation and support recovery for high-dimensional linear regression model with heavy-tailed noise. To deal with heavy-tailed noise whose variance can be infinite, we adopt the quantile regression loss function instead of the commonly used squared loss. However, the non-smooth quantile loss poses new challenges to high-dimensional distributed estimation in both computation and theoretical development. To address the challenge, we transform the response variable and establish a new connection between quantile regression and ordinary linear regression. Then, we provide a distributed estimator that is both computationally and communicationally efficient, where only the gradient information is communicated at each iteration. Theoretically, we show that, after a constant number of iterations, the proposed estimator achieves a near-oracle convergence rate without any restriction on the number of machines. Moreover, we establish the theoretical guarantee for the support recovery. The simulation analysis is provided to demonstrate the effectiveness of our method. [abs] [ pdf ][ bib ] &copy JMLR 2020. ( edit, beta )

JMLR Journal 2020 Journal Article

Dynamic Assortment Optimization with Changing Contextual Information

  • Xi Chen
  • Yining Wang
  • Yuan Zhou

In this paper, we study the dynamic assortment optimization problem over a finite selling season of length $T$. At each time period, the seller offers an arriving customer an assortment of substitutable products under a cardinality constraint, and the customer makes the purchase among offered products according to a discrete choice model. Most existing work associates each product with a real-valued fixed mean utility and assumes a multinomial logit choice (MNL) model. In many practical applications, feature/contextual information of products is readily available. In this paper, we incorporate the feature information by assuming a linear relationship between the mean utility and the feature. In addition, we allow the feature information of products to change over time so that the underlying choice model can also be non-stationary. To solve the dynamic assortment optimization under this changing contextual MNL model, we need to simultaneously learn the underlying unknown coefficient and make the decision on the assortment. To this end, we develop an upper confidence bound (UCB) based policy and establish the regret bound on the order of $\tilde{O}(d\sqrt{T})$, where $d$ is the dimension of the feature and $\tilde{O}$ suppresses logarithmic dependence. We further establish a lower bound $\Omega(d\sqrt{T}/{K})$, where $K$ is the cardinality constraint of an offered assortment, which is usually small. When $K$ is a constant, our policy is optimal up to logarithmic factors. In the exploitation phase of the UCB algorithm, we need to solve a combinatorial optimization problem for assortment optimization based on the learned information. We further develop an approximation algorithm and an efficient greedy heuristic. The effectiveness of the proposed policy is further demonstrated by our numerical studies. [abs] [ pdf ][ bib ] &copy JMLR 2020. ( edit, beta )

NeurIPS Conference 2020 Conference Paper

Fixed-Support Wasserstein Barycenters: Computational Hardness and Fast Algorithm

  • Tianyi Lin
  • Nhat Ho
  • Xi Chen
  • Marco Cuturi
  • Michael Jordan

We study the fixed-support Wasserstein barycenter problem (FS-WBP), which consists in computing the Wasserstein barycenter of $m$ discrete probability measures supported on a finite metric space of size $n$. We show first that the constraint matrix arising from the standard linear programming (LP) representation of the FS-WBP is \textit{not totally unimodular} when $m \geq 3$ and $n \geq 3$. This result resolves an open question pertaining to the relationship between the FS-WBP and the minimum-cost flow (MCF) problem since it proves that the FS-WBP in the standard LP form is not an MCF problem when $m \geq 3$ and $n \geq 3$. We also develop a provably fast \textit{deterministic} variant of the celebrated iterative Bregman projection (IBP) algorithm, named \textsc{FastIBP}, with a complexity bound of $\tilde{O}(mn^{7/3}\varepsilon^{-4/3})$, where $\varepsilon \in (0, 1)$ is the desired tolerance. This complexity bound is better than the best known complexity bound of $\tilde{O}(mn^2\varepsilon^{-2})$ for the IBP algorithm in terms of $\varepsilon$, and that of $\tilde{O}(mn^{5/2}\varepsilon^{-1})$ from accelerated alternating minimization algorithm or accelerated primal-dual adaptive gradient algorithm in terms of $n$. Finally, we conduct extensive experiments with both synthetic data and real images and demonstrate the favorable performance of the \textsc{FastIBP} algorithm in practice.

NeurIPS Conference 2020 Conference Paper

Hedging in games: Faster convergence of external and swap regrets

  • Xi Chen
  • Binghui Peng

We consider the setting where players run the Hedge algorithm or its optimistic variant \cite{syrgkanis2015fast} to play an n-action game repeatedly for T rounds. 1) For two-player games, we show that the regret of optimistic Hedge decays at \tilde{O}( 1/T ^{5/6} ), improving the previous bound O(1/T^{3/4}) by \cite{syrgkanis2015fast}. 2) In contrast, we show that the convergence rate of vanilla Hedge is no better than \tilde{\Omega}(1/ \sqrt{T})}, addressing an open question posted in \cite{syrgkanis2015fast}. For general m-player games, we show that the swap regret of each player decays at rate \tilde{O}(m^{1/2} (n/T)^{3/4}) when they combine optimistic Hedge with the classical external-to-internal reduction of Blum and Mansour \cite{blum2007external}. The algorithm can also be modified to achieve the same rate against itself and a rate of \tilde{O}(\sqrt{n/T}) against adversaries. Via standard connections, our upper bounds also imply faster convergence to coarse correlated equilibria in two-player games and to correlated equilibria in multiplayer games.

NeurIPS Conference 2020 Conference Paper

Information Theoretic Counterfactual Learning from Missing-Not-At-Random Feedback

  • Zifeng Wang
  • Xi Chen
  • Rui Wen
  • Shao-Lun Huang
  • Ercan KURUOGLU
  • Yefeng Zheng

Counterfactual learning for dealing with missing-not-at-random data (MNAR) is an intriguing topic in the recommendation literature, since MNAR data are ubiquitous in modern recommender systems. Instead, missing-at-random (MAR) data, namely randomized controlled trials (RCTs), are usually required by most previous counterfactual learning methods. However, the execution of RCTs is extraordinarily expensive in practice. To circumvent the use of RCTs, we build an information theoretic counterfactual variational information bottleneck (CVIB), as an alternative for debiasing learning without RCTs. By separating the task-aware mutual information term in the original information bottleneck Lagrangian into factual and counterfactual parts, we derive a contrastive information loss and an additional output confidence penalty, which facilitates balanced learning between the factual and counterfactual domains. Empirical evaluation on real-world datasets shows that our CVIB significantly enhances both shallow and deep models, which sheds light on counterfactual learning in recommendation that goes beyond RCTs.

JMLR Journal 2020 Journal Article

On Stationary-Point Hitting Time and Ergodicity of Stochastic Gradient Langevin Dynamics

  • Xi Chen
  • Simon S. Du
  • Xin T. Tong

Stochastic gradient Langevin dynamics (SGLD) is a fundamental algorithm in stochastic optimization. Recent work by Zhang et al. (2017) presents an analysis for the hitting time of SGLD for the first and second order stationary points. The proof in Zhang et al. (2017) is a two-stage procedure through bounding the Cheeger's constant, which is rather complicated and leads to loose bounds. In this paper, using intuitions from stochastic differential equations, we provide a direct analysis for the hitting times of SGLD to the first and second order stationary points. Our analysis is straightforward. It only relies on basic linear algebra and probability theory tools. Our direct analysis also leads to tighter bounds comparing to Zhang et al. (2017) and shows the explicit dependence of the hitting time on different factors, including dimensionality, smoothness, noise strength, and step size effects. Under suitable conditions, we show that the hitting time of SGLD to first-order stationary points can be dimension-independent. Moreover, we apply our analysis to study several important online estimation problems in machine learning, including linear regression, matrix factorization, and online PCA. [abs] [ pdf ][ bib ] &copy JMLR 2020. ( edit, beta )

AAAI Conference 2020 Short Paper

Optimizing the Feature Selection Process for Better Accuracy in Datasets with a Large Number of Features (Student Abstract)

  • Xi Chen
  • Afsaneh Doryab

Most feature selection methods only perform well on datasets with relatively small set of features. In the case of large feature sets and small number of data points, almost none of the existing feature selection methods help in achieving high accuracy. This paper proposes a novel approach to optimize the feature selection process through Frequent Pattern Growth algorithm to find sets of features that appear frequently among the top features selected by the main feature selection methods. Our experimental evaluation on two datasets containing a small and very large number of features shows that our approach significantly improves the accuracy results of the dataset with a very large number of features.

IJCAI Conference 2020 Conference Paper

SiamBOMB: A Real-time AI-based System for Home-cage Animal Tracking, Segmentation and Behavioral Analysis

  • Xi Chen
  • Hao Zhai
  • Danqian Liu
  • Weifu Li
  • Chaoyue Ding
  • Qiwei Xie
  • Hua Han

Biologists often need to handle numerous video-based home-cage animal behavior analysis tasks that require massive workloads. Therefore, we develop an AI-based multi-species tracking and segmentation system, SiamBOMB, for real-time and automatic home-cage animal behavioral analysis. In this system, a background-enhanced Siamese-based network with replaceable modular design ensures the flexibility and generalizability of the system, and a user-friendly interface makes it convenient to use for biologists. This real-time AI system will effectively reduce the burden on biologists.

IJCAI Conference 2020 Conference Paper

Variational Learning of Bayesian Neural Networks via Bayesian Dark Knowledge

  • Gehui Shen
  • Xi Chen
  • Zhihong Deng

Bayesian neural networks (BNNs) have received more and more attention because they are capable of modeling epistemic uncertainty which is hard for conventional neural networks. Markov chain Monte Carlo (MCMC) methods and variational inference (VI) are two mainstream methods for Bayesian deep learning. The former is effective but its storage cost is prohibitive since it has to save many samples of neural network parameters. The latter method is more time and space efficient, however the approximate variational posterior limits its performance. In this paper, we aim to combine the advantages of above two methods by distilling MCMC samples into an approximate variational posterior. On the basis of an existing distillation technique we first propose variational Bayesian dark knowledge method. Moreover, we propose Bayesian dark prior knowledge, a novel distillation method which considers MCMC posterior as the prior of a variational BNN. Two proposed methods both not only can reduce the space overhead of the teacher model so that are scalable, but also maintain a distilled posterior distribution capable of modeling epistemic uncertainty. Experimental results manifest our methods outperform existing distillation method in terms of predictive accuracy and uncertainty modeling.

AAAI Conference 2019 Conference Paper

Deep Cascade Multi-Task Learning for Slot Filling in Online Shopping Assistant

  • Yu Gong
  • Xusheng Luo
  • Yu Zhu
  • Wenwu Ou
  • Zhao Li
  • Muhua Zhu
  • Kenny Q. Zhu
  • Lu Duan

Slot filling is a critical task in natural language understanding (NLU) for dialog systems. State-of-the-art approaches treat it as a sequence labeling problem and adopt such models as BiLSTM-CRF. While these models work relatively well on standard benchmark datasets, they face challenges in the context of E-commerce where the slot labels are more informative and carry richer expressions. In this work, inspired by the unique structure of E-commerce knowledge base, we propose a novel multi-task model with cascade and residual connections, which jointly learns segment tagging, named entity tagging and slot filling. Experiments show the effectiveness of the proposed cascade and residual structures. Our model has a 14. 6% advantage in F1 score over the strong baseline methods on a new Chinese E-commerce shopping assistant dataset, while achieving competitive accuracies on a standard dataset. Furthermore, online test deployed on such dominant E-commerce platform shows 130% improvement on accuracy of understanding user utterances. Our model has already gone into production in the E-commerce platform.

JBHI Journal 2019 Journal Article

Detecting Alzheimer's Disease on Small Dataset: A Knowledge Transfer Perspective

  • Wei Li
  • Yifei Zhao
  • Xi Chen
  • Yang Xiao
  • Yuanyuan Qin

Computer-aided diagnosis (CAD) is an attractive topic in Alzheimer's disease (AD) research. Many algorithms are based on a relatively large training dataset. However, small hospitals are usually unable to collect sufficient training samples for robust classification. Although data sharing is expanding in scientific research, it is unclear whether a model based on one dataset is well suited for other data sources. Using a small dataset from a local hospital and a large shared dataset from the AD neuroimaging initiative, we conducted a heterogeneity analysis and found that different functional magnetic resonance imaging data sources show different sample distributions in feature space. In addition, we proposed an effective knowledge transfer method to diminish the disparity among different datasets and improve the classification accuracy on datasets with insufficient training samples. The accuracy increased by approximately 20% compared with that of a model based only on the original small dataset. The results demonstrated that the proposed approach is a novel and effective method for CAD in hospitals with only small training datasets. It solved the challenge of limited sample size in detection of AD, which is a common issue but lack of adequate attention. Furthermore, this paper sheds new light on effective use of multi-source data for neurological disease diagnosis.

JMLR Journal 2019 Journal Article

Distributed Inference for Linear Support Vector Machine

  • Xiaozhou Wang
  • Zhuoyi Yang
  • Xi Chen
  • Weidong Liu

The growing size of modern data brings many new challenges to existing statistical inference methodologies and theories, and calls for the development of distributed inferential approaches. This paper studies distributed inference for linear support vector machine (SVM) for the binary classification task. Despite a vast literature on SVM, much less is known about the inferential properties of SVM, especially in a distributed setting. In this paper, we propose a multi-round distributed linear-type (MDL) estimator for conducting inference for linear SVM. The proposed estimator is computationally efficient. In particular, it only requires an initial SVM estimator and then successively refines the estimator by solving simple weighted least squares problem. Theoretically, we establish the Bahadur representation of the estimator. Based on the representation, the asymptotic normality is further derived, which shows that the MDL estimator achieves the optimal statistical efficiency, i.e., the same efficiency as the classical linear SVM applying to the entire data set in a single machine setup. Moreover, our asymptotic result avoids the condition on the number of machines or data batches, which is commonly assumed in distributed estimation literature, and allows the case of diverging dimension. We provide simulation studies to demonstrate the performance of the proposed MDL estimator. [abs] [ pdf ][ bib ] &copy JMLR 2019. ( edit, beta )

ICRA Conference 2019 Conference Paper

Learning From Demonstration in the Wild

  • Feryal M. P. Behbahani
  • Kyriacos Shiarlis
  • Xi Chen
  • Vitaly Kurin
  • Sudhanshu Kasewa
  • Ciprian Stirbu
  • João Gomes
  • Supratik Paul

Learning from demonstration (LfD) is useful in settings where hand-coding behaviour or a reward function is impractical. It has succeeded in a wide range of problems but typically relies on manually generated demonstrations or specially deployed sensors and has not generally been able to leverage the copious demonstrations available in the wild: those that capture behaviours that were occurring anyway using sensors that were already deployed for another purpose, e. g. , traffic camera footage capturing demonstrations of natural behaviour of vehicles, cyclists, and pedestrians. We propose video to behaviour (ViBe), a new approach to learn models of behaviour from unlabelled raw video data of a traffic scene collected from a single, monocular, initially uncalibrated camera with ordinary resolution. Our approach calibrates the camera, detects relevant objects, tracks them through time, and uses the resulting trajectories to perform LfD, yielding models of naturalistic behaviour. We apply ViBe to raw videos of a traffic intersection and show that it can learn purely from videos, without additional expert knowledge.

NeurIPS Conference 2019 Conference Paper

Online EXP3 Learning in Adversarial Bandits with Delayed Feedback

  • Ilai Bistritz
  • Zhengyuan Zhou
  • Xi Chen
  • Nicholas Bambos
  • Jose Blanchet

Consider a player that in each of T rounds chooses one of K arms. An adversary chooses the cost of each arm in a bounded interval, and a sequence of feedback delays \left{ d_{t}\right} that are unknown to the player. After picking arm a_{t} at round t, the player receives the cost of playing this arm d_{t} rounds later. In cases where t+d_{t}>T, this feedback is simply missing. We prove that the EXP3 algorithm (that uses the delayed feedback upon its arrival) achieves a regret of O\left(\sqrt{\ln K\left(KT+\sum_{t=1}^{T}d_{t}\right)}\right). For the case where \sum_{t=1}^{T}d_{t} and T are unknown, we propose a novel doubling trick for online learning with delays and prove that this adaptive EXP3 achieves a regret of O\left(\sqrt{\ln K\left(K^{2}T+\sum_{t=1}^{T}d_{t}\right)}\right). We then consider a two player zero-sum game where players experience asynchronous delays. We show that even when the delays are large enough such that players no longer enjoy the “no-regret property”, (e. g. , where d_{t}=O\left(t\log t\right)) the ergodic average of the strategy profile still converges to the set of Nash equilibria of the game. The result is made possible by choosing an adaptive step size \eta_{t} that is not summable but is square summable, and proving a “weighted regret bound” for this general case.

AAAI Conference 2018 Conference Paper

HodgeRank With Information Maximization for Crowdsourced Pairwise Ranking Aggregation

  • Qianqian Xu
  • Jiechao Xiong
  • Xi Chen
  • Qingming Huang
  • Yuan Yao

Recently, crowdsourcing has emerged as an effective paradigm for human-powered large scale problem solving in various domains. However, task requester usually has a limited amount of budget, thus it is desirable to have a policy to wisely allocate the budget to achieve better quality. In this paper, we study the principle of information maximization for active sampling strategies in the framework of HodgeRank, an approach based on Hodge Decomposition of pairwise ranking data with multiple workers. The principle exhibits two scenarios of active sampling: Fisher information maximization that leads to unsupervised sampling based on a sequential maximization of graph algebraic connectivity without considering labels; and Bayesian information maximization that selects samples with the largest information gain from prior to posterior, which gives a supervised sampling involving the labels collected. Experiments show that the proposed methods boost the sampling efficiency as compared to traditional sampling schemes and are thus valuable to practical crowdsourcing experiments.

NeurIPS Conference 2018 Conference Paper

Near-Optimal Policies for Dynamic Multinomial Logit Assortment Selection Models

  • Yining Wang
  • Xi Chen
  • Yuan Zhou

In this paper we consider the dynamic assortment selection problem under an uncapacitated multinomial-logit (MNL) model. By carefully analyzing a revenue potential function, we show that a trisection based algorithm achieves an item-independent regret bound of O(sqrt(T log log T), which matches information theoretical lower bounds up to iterated logarithmic terms. Our proof technique draws tools from the unimodal/convex bandit literature as well as adaptive confidence parameters in minimax multi-armed bandit problems.

JMLR Journal 2016 Journal Article

Bayesian Decision Process for Cost-Efficient Dynamic Ranking via Crowdsourcing

  • Xi Chen
  • Kevin Jiao
  • Qihang Lin

Rank aggregation based on pairwise comparisons over a set of items has a wide range of applications. Although considerable research has been devoted to the development of rank aggregation algorithms, one basic question is how to efficiently collect a large amount of high-quality pairwise comparisons for the ranking purpose. Because of the advent of many crowdsourcing services, a crowd of workers are often hired to conduct pairwise comparisons with a small monetary reward for each pair they compare. Since different workers have different levels of reliability and different pairs have different levels of ambiguity, it is desirable to wisely allocate the limited budget for comparisons among the pairs of items and workers so that the global ranking can be accurately inferred from the comparison results. To this end, we model the active sampling problem in crowdsourced ranking as a Bayesian Markov decision process, which dynamically selects item pairs and workers to improve the ranking accuracy under a budget constraint. We further develop a computationally efficient sampling policy based on knowledge gradient as well as a moment matching technique for posterior approximation. Experimental evaluations on both synthetic and real data show that the proposed policy achieves high ranking accuracy with a lower labeling cost. [abs] [ pdf ][ bib ] &copy JMLR 2016. ( edit, beta )

NeurIPS Conference 2016 Conference Paper

Improved Techniques for Training GANs

  • Tim Salimans
  • Ian Goodfellow
  • Wojciech Zaremba
  • Vicki Cheung
  • Alec Radford
  • Xi Chen

We present a variety of new architectural features and training procedures that we apply to the generative adversarial networks (GANs) framework. Using our new techniques, we achieve state-of-the-art results in semi-supervised classification on MNIST, CIFAR-10 and SVHN. The generated images are of high quality as confirmed by a visual Turing test: Our model generates MNIST samples that humans cannot distinguish from real data, and CIFAR-10 samples that yield a human error rate of 21. 3%. We also present ImageNet samples with unprecedented resolution and show that our methods enable the model to learn recognizable features of ImageNet classes.

NeurIPS Conference 2016 Conference Paper

Improved Variational Inference with Inverse Autoregressive Flow

  • Durk Kingma
  • Tim Salimans
  • Rafal Jozefowicz
  • Xi Chen
  • Ilya Sutskever
  • Max Welling

The framework of normalizing flows provides a general strategy for flexible variational inference of posteriors over latent variables. We propose a new type of normalizing flow, inverse autoregressive flow (IAF), that, in contrast to earlier published flows, scales well to high-dimensional latent spaces. The proposed flow consists of a chain of invertible transformations, where each transformation is based on an autoregressive neural network. In experiments, we show that IAF significantly improves upon diagonal Gaussian approximate posteriors. In addition, we demonstrate that a novel type of variational autoencoder, coupled with IAF, is competitive with neural autoregressive models in terms of attained log-likelihood on natural images, while allowing significantly faster synthesis.

NeurIPS Conference 2016 Conference Paper

InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

  • Xi Chen
  • Yan Duan
  • Rein Houthooft
  • John Schulman
  • Ilya Sutskever
  • Pieter Abbeel

This paper describes InfoGAN, an information-theoretic extension to the Generative Adversarial Network that is able to learn disentangled representations in a completely unsupervised manner. InfoGAN is a generative adversarial network that also maximizes the mutual information between a small subset of the latent variables and the observation. We derive a lower bound to the mutual information objective that can be optimized efficiently, and show that our training procedure can be interpreted as a variation of the Wake-Sleep algorithm. Specifically, InfoGAN successfully disentangles writing styles from digit shapes on the MNIST dataset, pose from lighting of 3D rendered images, and background digits from the central digit on the SVHN dataset. It also discovers visual concepts that include hair styles, presence/absence of eyeglasses, and emotions on the CelebA face dataset. Experiments show that InfoGAN learns interpretable representations that are competitive with representations learned by existing fully supervised methods.

JMLR Journal 2016 Journal Article

On Bayes Risk Lower Bounds

  • Xi Chen
  • Adityanand Guntuboyina
  • Yuchen Zhang

This paper provides a general technique for lower bounding the Bayes risk of statistical estimation, applicable to arbitrary loss functions and arbitrary prior distributions. A lower bound on the Bayes risk not only serves as a lower bound on the minimax risk, but also characterizes the fundamental limit of any estimator given the prior knowledge. Our bounds are based on the notion of $f$-informativity (Csiszár, 1972), which is a function of the underlying class of probability measures and the prior. Application of our bounds requires upper bounds on the $f$-informativity, thus we derive new upper bounds on $f$-informativity which often lead to tight Bayes risk lower bounds. Our technique leads to generalizations of a variety of classical minimax bounds (e.g., generalized Fano's inequality). Our Bayes risk lower bounds can be directly applied to several concrete estimation problems, including Gaussian location models, generalized linear models, and principal component analysis for spiked covariance models. To further demonstrate the applications of our Bayes risk lower bounds to machine learning problems, we present two new theoretical results: (1) a precise characterization of the minimax risk of learning spherical Gaussian mixture models under the smoothed analysis framework, and (2) lower bounds for the Bayes risk under a natural prior for both the prediction and estimation errors for high-dimensional sparse linear regression under an improper learning setting. [abs] [ pdf ][ bib ] &copy JMLR 2016. ( edit, beta )

NeurIPS Conference 2016 Conference Paper

On the Recursive Teaching Dimension of VC Classes

  • Xi Chen
  • Yu Cheng
  • Bo Tang

The recursive teaching dimension (RTD) of a concept class $C \subseteq \{0, 1\}^n$, introduced by Zilles et al. [ZLHZ11], is a complexity parameter measured by the worst-case number of labeled examples needed to learn any target concept of $C$ in the recursive teaching model. In this paper, we study the quantitative relation between RTD and the well-known learning complexity measure VC dimension (VCD), and improve the best known upper and (worst-case) lower bounds on the recursive teaching dimension with respect to the VC dimension. Given a concept class $C \subseteq \{0, 1\}^n$ with $VCD(C) = d$, we first show that $RTD(C)$ is at most $d 2^{d+1}$. This is the first upper bound for $RTD(C)$ that depends only on $VCD(C)$, independent of the size of the concept class $|C|$ and its~domain size $n$. Before our work, the best known upper bound for $RTD(C)$ is $O(d 2^d \log \log |C|)$, obtained by Moran et al. [MSWY15]. We remove the $\log \log |C|$ factor. We also improve the lower bound on the worst-case ratio of $RTD(C)$ to $VCD(C)$. We present a family of classes $\{ C_k \}_{k \ge 1}$ with $VCD(C_k) = 3k$ and $RTD(C_k)=5k$, which implies that the ratio of $RTD(C)$ to $VCD(C)$ in the worst case can be as large as $5/3$. Before our work, the largest ratio known was $3/2$ as obtained by Kuhlmann [Kuh99]. Since then, no finite concept class $C$ has been known to satisfy $RTD(C) > (3/2) VCD(C)$.

JMLR Journal 2016 Journal Article

Spectral Methods Meet EM: A Provably Optimal Algorithm for Crowdsourcing

  • Yuchen Zhang
  • Xi Chen
  • Dengyong Zhou
  • Michael I. Jordan

Crowdsourcing is a popular paradigm for effectively collecting labels at low cost. The Dawid-Skene estimator has been widely used for inferring the true labels from the noisy labels provided by non-expert crowdsourcing workers. However, since the estimator maximizes a non-convex log-likelihood function, it is hard to theoretically justify its performance. In this paper, we propose a two-stage efficient algorithm for multi-class crowd labeling problems. The first stage uses the spectral method to obtain an initial estimate of parameters. Then the second stage refines the estimation by optimizing the objective function of the Dawid-Skene estimator via the EM algorithm. We show that our algorithm achieves the optimal convergence rate up to a logarithmic factor. We conduct extensive experiments on synthetic and real datasets. Experimental results demonstrate that the proposed algorithm is comparable to the most accurate empirical approach, while outperforming several other recently proposed methods. [abs] [ pdf ][ bib ] &copy JMLR 2016. ( edit, beta )

NeurIPS Conference 2016 Conference Paper

VIME: Variational Information Maximizing Exploration

  • Rein Houthooft
  • Xi Chen
  • Yan Duan
  • John Schulman
  • Filip De Turck
  • Pieter Abbeel

Scalable and effective exploration remains a key challenge in reinforcement learning (RL). While there are methods with optimality guarantees in the setting of discrete state and action spaces, these methods cannot be applied in high-dimensional deep RL scenarios. As such, most contemporary RL relies on simple heuristics such as epsilon-greedy exploration or adding Gaussian noise to the controls. This paper introduces Variational Information Maximizing Exploration (VIME), an exploration strategy based on maximization of information gain about the agent's belief of environment dynamics. We propose a practical implementation, using variational inference in Bayesian neural networks which efficiently handles continuous state and action spaces. VIME modifies the MDP reward function, and can be applied with several different underlying RL algorithms. We demonstrate that VIME achieves significantly better performance compared to heuristic exploration methods across a variety of continuous control tasks and algorithms, including tasks with very sparse rewards.

IJCAI Conference 2015 Conference Paper

Clustering Dynamic Spatio-Temporal Patterns in The Presence of Noise and Missing Data

  • Xi Chen
  • James H. Faghmous
  • Ankush Khandelwal
  • Vipin Kumar

Clustering has gained widespread use, especially for static data. However, the rapid growth of spatio-temporal data from numerous instruments, such as earth-orbiting satellites, has created a need for spatio-temporal clustering methods to extract and monitor dynamic clusters. Dynamic spatiotemporal clustering faces two major challenges: First, the clusters are dynamic and may change in size, shape, and statistical properties over time. Second, numerous spatio-temporal data are incomplete, noisy, heterogeneous, and highly variable (over space and time). We propose a new spatiotemporal data mining paradigm, to autonomously identify dynamic spatio-temporal clusters in the presence of noise and missing data. Our proposed approach is more robust than traditional clustering and image segmentation techniques in the case of dynamic patterns, non-stationary, heterogeneity, and missing data. We demonstrate our method’s performance on a real-world application of monitoring in-land water bodies on a global scale.

JMLR Journal 2015 Journal Article

Statistical Decision Making for Optimal Budget Allocation in Crowd Labeling

  • Xi Chen
  • Qihang Lin
  • Dengyong Zhou

It has become increasingly popular to obtain machine learning labels through commercial crowdsourcing services. The crowdsourcing workers or annotators are paid for each label they provide, but the task requester usually has only a limited amount of the budget. Since the data instances have different levels of labeling difficulty and the workers have different reliability for the labeling task, it is desirable to wisely allocate the budget among all the instances and workers such that the overall labeling quality is maximized. In this paper, we formulate the budget allocation problem as a Bayesian Markov decision process (MDP), which simultaneously conducts learning and decision making. The optimal allocation policy can be obtained by using the dynamic programming (DP) recurrence. However, DP quickly becomes computationally intractable when the size of the problem increases. To solve this challenge, we propose a computationally efficient approximate policy which is called optimistic knowledge gradient. Our method applies to both pull crowdsourcing marketplaces with homogeneous workers and push marketplaces with heterogeneous workers. It can also incorporate the contextual information of instances when they are available. The experiments on both simulated and real data show that our policy achieves a higher labeling quality than other existing policies at the same budget level. [abs] [ pdf ][ bib ] &copy JMLR 2015. ( edit, beta )

IS Journal 2014 Journal Article

A Network Evolution Model for Chinese Traditional Acquaintance Networks

  • Xi Chen
  • Lan Zhang
  • Wei Li

The evolution model of Chinese traditional acquaintance relationship networks described in this article emphasizes individual heterogeneity and social culture. The model incorporates three distinct mechanisms that affect acquaintance network evolution and formation: heredity linking, variation linking, and similarity-based disconnection. The authors found that the degree distribution of Chinese traditional acquaintance networks is manifested in a piecewise approximation that combines a power-law form with an exponential cutoff and exponential distribution. Numerical results indicate that individuals maintaining a medium amount of connections far outweigh others, reflecting the characteristics of Guanxi-centered society. The formation of acquaintance relationship networks is greatly affected by the special Chinese kinship culture. The authors' findings are supported by sociological statistical conclusions and offer a rational explanation for the nature of Chinese kinship networks. Their work provides an adequate framework for further research on dynamic human complex behaviors such as epidemic spreading and rumor propagation.

ICRA Conference 2014 Conference Paper

An inertial-based human motion tracking system with twists and exponential maps

  • Xi Chen
  • Jie Zhang 0074
  • William R. Hamel
  • Jindong Tan

Wearable inertial tracking is well accepted due to its convenience for free-style motion tracking with high accuracy. Traditionally, complicated high-order calculations for human kinematic modeling and inaccurate estimation of sensor placement are interfering the efficiency of real-time tracking. In order to tackle the challenges, a wearable human motion tracking system is developed by applying twists and exponential maps techniques. When the body segments are articulated by product of exponential maps, joint positions are continuously updated based on these techniques and their rotational angles are represented individually within the global frame. It is more efficient to achieve real-time motion tracking with low-order calculations. Meanwhile by applying the well-designed calibration procedure, it is more convenient to estimate a sensor's position and orientation regardless of knowing its placement. This paper presents our approach and exemplifies the assessment of proposed motion tracking system by several tests of limb and full body motion tracking. The comparisons with Vicon and OptiTrack motion capture systems verify satisfactorily high accuracy.

NeurIPS Conference 2014 Conference Paper

Spectral Methods meet EM: A Provably Optimal Algorithm for Crowdsourcing

  • Yuchen Zhang
  • Xi Chen
  • Dengyong Zhou
  • Michael Jordan

The Dawid-Skene estimator has been widely used for inferring the true labels from the noisy labels provided by non-expert crowdsourcing workers. However, since the estimator maximizes a non-convex log-likelihood function, it is hard to theoretically justify its performance. In this paper, we propose a two-stage efficient algorithm for multi-class crowd labeling problems. The first stage uses the spectral method to obtain an initial estimate of parameters. Then the second stage refines the estimation by optimizing the objective function of the Dawid-Skene estimator via the EM algorithm. We show that our algorithm achieves the optimal convergence rate up to a logarithmic factor. We conduct extensive experiments on synthetic and real datasets. Experimental results demonstrate that the proposed algorithm is comparable to the most accurate empirical approach, while outperforming several other recently proposed methods.

IROS Conference 2013 Conference Paper

Case studies of a robot enhanced walker for training of children with cerebral palsy

  • Sunil K. Agrawal
  • Jiyeon Kang
  • Xi Chen
  • Mi Jung Kim
  • Youngmyung Lee
  • Sang Won Kong
  • Gyung-Jin Park

Cerebral palsy (CP) is a disorder of movement and posture in children caused by non-progressive insult of the immature brain. The characteristic features are weakness, spasticity, muscle contractures, and poor motor coordination. The gait patterns of children with CP are slow, uncoordinated, and unstable. Our hypothesis is that these impaired children will benefit from robot enhanced walkers to improve their balance, coordination, and speed during gait. In addition, this experience will also impact their clinical scores that relate to their functional performance and caregiver assistance. In this study, we used a specially-designed robotic walker which children used to perform a series of walking tasks, in increasing order of difficulty. This study was performed in 30 training sessions over a period of 3 months. Each training session lasted for 20 minutes. The outcome measures were variables recorded by the robot such as travel distance, average speed, and clinical measured variables that characterize their disability profiles.

UAI Conference 2013 Conference Paper

Evaluating computational models of explanation using human judgments

  • Michael Pacer
  • Joseph Jay Williams
  • Xi Chen
  • Tania Lombrozo
  • Thomas L. Griffiths 0001

We evaluate four computational models of explanation in Bayesian networks by comparing model predictions to human judgments. In two experiments, we present human participants with causal structures for which the models make divergent predictions and either solicit the best explanation for an observed event (Experiment 1) or have participants rate provided explanations for an observed event (Experiment 2). Across two versions of two causal structures and across both experiments, we find that the Causal Explanation Tree and Most Relevant Explanation models provide better fits to human data than either Most Probable Explanation or Explanation Tree models. We identify strengths and shortcomings of these models and what they can reveal about human explanation. We conclude by suggesting the value of pursuing computational and psychological investigations of explanation in parallel.

ICML Conference 2013 Conference Paper

Optimistic Knowledge Gradient Policy for Optimal Budget Allocation in Crowdsourcing

  • Xi Chen
  • Qihang Lin
  • Dengyong Zhou

In real crowdsourcing applications, each label from a crowd usually comes with a certain cost. Given a pre- fixed amount of budget, since different tasks have different ambiguities and different workers have different expertises, we want to find an optimal way to allocate the budget among instance-worker pairs such that the overall label quality can be maximized. To address this issue, we start from the simplest setting in which all workers are assumed to be perfect. We formulate the problem as a Bayesian Markov Decision Process (MDP). Using the dynamic programming (DP) algorithm, one can obtain the optimal allocation policy for a given budget. However, DP is computationally intractable. To solve the computational challenge, we propose a novel approximate policy which is called optimistic knowledge gradient. It is practically efficient while theoretically its consistency can be guaranteed. We then extend the MDP framework to deal with inhomogeneous workers and tasks with contextual information available. The experiments on both simulated and real data demonstrate the superiority of our method.

NeurIPS Conference 2013 Conference Paper

Variance Reduction for Stochastic Gradient Optimization

  • Chong Wang
  • Xi Chen
  • Alexander Smola
  • Eric Xing

Stochastic gradient optimization is a class of widely used algorithms for training machine learning models. To optimize an objective, it uses the noisy gradient computed from the random data samples instead of the true gradient computed from the entire dataset. However, when the variance of the noisy gradient is large, the algorithm might spend much time bouncing around, leading to slower convergence and worse performance. In this paper, we develop a general approach of using control variate for variance reduction in stochastic gradient. Data statistics such as low-order moments (pre-computed or estimated online) is used to form the control variate. We demonstrate how to construct the control variate for two practical problems using stochastic gradient optimization. One is convex---the MAP estimation for logistic regression, and the other is non-convex---stochastic variational inference for latent Dirichlet allocation. On both problems, our approach shows faster convergence and better performance than the classical approach.

NeurIPS Conference 2012 Conference Paper

Clustering by Nonnegative Matrix Factorization Using Graph Random Walk

  • Zhirong Yang
  • Tele Hao
  • Onur Dikmen
  • Xi Chen
  • Erkki Oja

Nonnegative Matrix Factorization (NMF) is a promising relaxation technique for clustering analysis. However, conventional NMF methods that directly approximate the pairwise similarities using the least square error often yield mediocre performance for data in curved manifolds because they can capture only the immediate similarities between data samples. Here we propose a new NMF clustering method which replaces the approximated matrix with its smoothed version using random walk. Our method can thus accommodate farther relationships between data samples. Furthermore, we introduce a novel regularization in the proposed objective function in order to improve over spectral clustering. The new learning objective is optimized by a multiplicative Majorization-Minimization algorithm with a scalable implementation for learning the factorizing matrix. Extensive experimental results on real-world datasets show that our method has strong performance in terms of cluster purity.

NeurIPS Conference 2012 Conference Paper

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

  • Xi Chen
  • Qihang Lin
  • Javier Pena

This paper considers a wide spectrum of regularized stochastic optimization problems where both the loss function and regularizer can be non-smooth. We develop a novel algorithm based on the regularized dual averaging (RDA) method, that can simultaneously achieve the optimal convergence rates for both convex and strongly convex loss. In particular, for strongly convex loss, it achieves the optimal rate of $O(\frac{1}{N}+\frac{1}{N^2})$ for $N$ iterations, which improves the best known rate $O(\frac{\log N }{N})$ of previous stochastic dual averaging algorithms. In addition, our method constructs the final solution directly from the proximal mapping instead of averaging of all previous iterates. For widely used sparsity-inducing regularizers (e. g. , $\ell_1$-norm), it has the advantage of encouraging sparser solutions. We further develop a multi-stage extension using the proposed algorithm as a subroutine, which achieves the uniformly-optimal rate $O(\frac{1}{N}+\exp\{-N\})$ for strongly convex loss.

ICRA Conference 2011 Conference Paper

A magnetic thin film microrobot with two operating modes

  • Wuming Jing
  • Xi Chen
  • Sean Lyttle
  • Zhenbo Fu
  • Yong Shi
  • David J. Cappelleri

Magnetic principles have proved successful for untethered submillimeter microrobotics, although challenges still exist in areas of propulsion and control. This paper presents the design, analysis, and performance results for a bimorph thin film magnetic microrobot utilizing the magnetostrictive principle as a secondary oscillating operation mode. The microrobot is no larger than 580 μm in its planar dimension and its total thickness is less than 5 μm. As a robot with magnetic material, it can be operated in a pushing/pulling mode in orthogonal directions for movement in a plane, while it's powered with an external magnetic field as low as 1 mT. For the secondary oscillating operation mode utilizing the magnetostrictive principle, in-plane strain is induced, resulting in bending and blocking forces on the robot. These forces are theoretically calculated to prove enough drive force can be generated in this mode. The design is further abstracted and translated into a piezoelectric cantilever FEM model to confirm the theorectical results. Microrobot fabrication and test-bed development based on this analysis is shown, which enabled us to participate in the final competition in the 2010 NIST Mobile Microrobot Challenge, with good performance in the dash and freestyle events. Finally, we discuss the testing results in various dry and fluid environments along with recommendations for future investigation and improvements. Keywords: microrobot, magnetostrictive, bimorph.

ICRA Conference 2011 Conference Paper

Pedestrian positioning with physical activity classification for indoors

  • Xi Chen
  • Sheng Hu
  • Zhenzhou Shao
  • Jindong Tan

This paper presents a wearable Inertial Measurement Unit pedestrian positioning system for indoors. Hidden Markov Model (HMM) is introduced to pre-process the sensor data and classify common activities. HMM also complements local minimum angular rate value for capturing the onset/end of each step. ZUPT algorithm are implemented to correct the walking velocity at step stance phase when errors existed. A novel acceleration-based approach combined with gyroscope data is developed to achieve a better heading estimation. Proposed method is able to reduce drift errors from gyroscopes and avoid electromagnetic perturbance to magnetometers when estimate subject's position. Experiment results show the positioning system achieves approximately 99% accuracy.

NeurIPS Conference 2010 Conference Paper

Graph-Valued Regression

  • Han Liu
  • Xi Chen
  • Larry Wasserman
  • John Lafferty

Undirected graphical models encode in a graph $G$ the dependency structure of a random vector $Y$. In many applications, it is of interest to model $Y$ given another random vector $X$ as input. We refer to the problem of estimating the graph $G(x)$ of $Y$ conditioned on $X=x$ as ``graph-valued regression''. In this paper, we propose a semiparametric method for estimating $G(x)$ that builds a tree on the $X$ space just as in CART (classification and regression trees), but at each leaf of the tree estimates a graph. We call the method ``Graph-optimized CART'', or Go-CART. We study the theoretical properties of Go-CART using dyadic partitioning trees, establishing oracle inequalities on risk minimization and tree partition consistency. We also demonstrate the application of Go-CART to a meteorological dataset, showing how graph-valued regression can provide a useful tool for analyzing complex data.

AAAI Conference 2010 Conference Paper

Learning Spatial-Temporal Varying Graphs with Applications to Climate Data Analysis

  • Xi Chen
  • Yan Liu
  • Han Liu
  • Jaime Carbonell

An important challenge in understanding climate change is to uncover the dependency relationships between various climate observations and forcing factors. Graphical lasso, a recently proposed `1 penalty based structure learning algorithm, has been proven successful for learning underlying dependency structures for the data drawn from a multivariate Gaussian distribution. However, climatological data often turn out to be non-Gaussian, e. g. cloud cover, precipitation, etc. In this paper, we examine nonparametric learning methods to address this challenge. In particular, we develop a methodology to learn dynamic graph structures from spatial-temporal data so that the graph structures at adjacent time or locations are similar. Experimental results demonstrate that our method not only recovers the underlying graph well but also captures the smooth variation properties on both synthetic data and climate data.

NeurIPS Conference 2010 Conference Paper

Multivariate Dyadic Regression Trees for Sparse Learning Problems

  • Han Liu
  • Xi Chen

We propose a new nonparametric learning method based on multivariate dyadic regression trees (MDRTs). Unlike traditional dyadic decision trees (DDTs) or classification and regression trees (CARTs), MDRTs are constructed using penalized empirical risk minimization with a novel sparsity-inducing penalty. Theoretically, we show that MDRTs can simultaneously adapt to the unknown sparsity and smoothness of the true regression functions, and achieve the nearly optimal rates of convergence (in a minimax sense) for the class of $(\alpha, C)$-smooth functions. Empirically, MDRTs can simultaneously conduct function estimation and variable selection in high dimensions. To make MDRTs applicable for large-scale learning problems, we propose a greedy heuristics. The superior performance of MDRTs are demonstrated on both synthetic and real datasets.

ICRA Conference 2010 Conference Paper

Training special needs infants to drive mobile robots using force-feedback joystick

  • Sunil K. Agrawal
  • Xi Chen
  • James C. Galloway

In typically developing infants, the onset of crawling and walking is associated with changes across development domains such as cognition and perception ([1], [2]). Currently, infants born with significant mobility impairments do not use powered wheelchairs until three years of age [3]. This potentially limits their development in the early growth years. The goal of this research is to train infants with impairments to safely and purposefully drive a mobile robot indoors while being seated on it. We anticipate that these impaired infants will benefit from early mobility in their early years, similar to their healthy peers.

IROS Conference 2009 Conference Paper

An adaptive mobile robots tethering algorithm in constrained environments

  • Xi Chen
  • Jindong Tan

This paper presents an adaptive and decentralized robotic cooperation algorithm for controlling the mobile sensors to form a chained network and maintaining the communication links. A single-layer and double-layer chain tethering algorithms are developed for exploring the open and constrained environments by mobile robots. A comprehensive metric for finding the optimal communication range is introduced. With the measurements, mobile robots could be organized into an optimal chained form for tethering. The tethering algorithm could detect the failed nodes and reconfigure the system. It offers an adaptive solution to broken communication links.

NeurIPS Conference 2009 Conference Paper

Nonparametric Greedy Algorithms for the Sparse Learning Problem

  • Han Liu
  • Xi Chen

This paper studies the forward greedy strategy in sparse nonparametric regression. For additive models, we propose an algorithm called additive forward regression; for general multivariate regression, we propose an algorithm called generalized forward regression. Both of them simultaneously conduct estimation and variable selection in nonparametric settings for the high dimensional sparse learning problem. Our main emphasis is empirical: on both simulated and real data, these two simple greedy methods can clearly outperform several state-of-the-art competitors, including the LASSO, a nonparametric version of the LASSO called the sparse additive model (SpAM) and a recently proposed adaptive parametric forward-backward algorithm called the Foba. Some theoretical justifications are also provided.