Arrow Research search

Author name cluster

Fan Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

112 papers
2 author rows

Possible papers

112

AAAI Conference 2026 Conference Paper

Beyond Euclidean Assumptions: Geometry-Aware Adaptive Routing for Remote Sensing Segmentation

  • Jie Qiu
  • Dizuo Cao
  • Linwei Dai
  • Xin Li
  • Fan Yang
  • Dong Yu
  • Changying Wang
  • Zongheng Wen

Remote sensing imagery poses a distinct challenge for semantic segmentation due to its inherent fractal complexity and the diversity of geometric structures present in real-world geospatial scenes. Euclidean-based models typically assume spatial uniformity; however, such assumptions often break down when confronted with objects exhibiting markedly different structural characteristics—such as roads versus vegetation—thereby complicating the feature representation process. Hyperbolic space offers a theoretically grounded alternative for modeling such hierarchical and heterogeneous patterns, yet fully replacing Euclidean geometry incurs significant computational overhead. We therefore introduce Geometry-Aware Adaptive Routing (GAAR), a novel module that facilitates geometry-aware routing by dynamically allocating high-level features to either Euclidean or Hyperbolic subspaces through a learnable binary gating mechanism, informed by structural priors learned during training. To further promote routing stability and geometric consistency, we introduce Geometry-Aware Deterministic Regularization (GADR), a regularization strategy that encourages confident, structure-aligned assignments. GAAR is plug-and-play and integrates seamlessly into existing segmentation architectures. Experiments on three challenging Remote Sensing Image Semantic Segmentation (RSISS) benchmarks demonstrate that our approach consistently outperforms state-of-the-art (SOTA) methods, particularly in geometrically complex regions, offering a scalable and effective solution to the limitations of purely Euclidean modeling.

AAAI Conference 2026 Conference Paper

Breaking the Stealth-Potency Trade-off in Clean-Image Backdoors with Generative Trigger Optimization

  • Binyan Xu
  • Fan Yang
  • Di Tang
  • Xilin Dai
  • Kehuan Zhang

Clean-image backdoor attacks, which use only label manipulation in training datasets to compromise deep neural networks, pose a significant threat to security-critical applications. A critical flaw in existing methods is that the poison rate required for a successful attack induces a proportional, and thus noticeable, drop in Clean Accuracy (CA), undermining their stealthiness. This paper presents a new paradigm for clean-image attacks that minimizes this accuracy degradation by optimizing the trigger itself. We introduce Generative Clean-Image Backdoors (GCB), a framework that uses a conditional InfoGAN to identify naturally occurring image features that can serve as potent and stealthy triggers. By ensuring these triggers are easily separable from benign task-related features, GCB enables a victim model to learn the backdoor from an extremely small set of poisoned examples, resulting in a CA drop of less than 1%. Our experiments demonstrate GCB's remarkable versatility, successfully adapting to six datasets, five architectures, and four tasks, including the first demonstration of clean-image backdoors in regression and segmentation. GCB also exhibits resilience against most of the existing backdoor defenses.

AAAI Conference 2026 Conference Paper

Catastrophic Forgetting in Kolmogorov-Arnold Networks

  • Mohammad Marufur Rahman
  • Guanchu Wang
  • Kaixiong Zhou
  • Minghan Chen
  • Fan Yang

Catastrophic forgetting is a longstanding challenge in continual learning, where models lose knowledge from earlier tasks when learning new ones. While various mitigation strategies have been proposed for Multi-Layer Perceptrons (MLPs), recent architectural advances like Kolmogorov-Arnold Networks (KANs) have been suggested to offer intrinsic resistance to forgetting by leveraging localized spline-based activations. However, the practical behavior of KANs under continual learning remains unclear, and their limitations are not well understood. To address this, we present a comprehensive study of catastrophic forgetting in KANs and develop a theoretical framework that links forgetting to activation support overlap and intrinsic data dimension. We validate these analyses through systematic experiments on synthetic and vision tasks, measuring forgetting dynamics under varying model configurations and data complexity. Further, we introduce KAN-LoRA, a novel adapter design for parameter-efficient continual fine-tuning of language models, and evaluate its effectiveness in knowledge editing tasks. Our findings reveal that while KANs exhibit promising retention in low-dimensional algorithmic settings, they remain vulnerable to forgetting in high-dimensional domains such as image classification and language modeling. These results advance the understanding of KANs’ strengths and limitations, offering practical insights for continual learning system design.

AAAI Conference 2026 Conference Paper

FilmSceneDesigner: Chaining Set Design for Procedural Film Scene Generation

  • Zhifeng Xie
  • Keyi Zhang
  • Yiye Yan
  • Yuling Guo
  • Fan Yang
  • Jiting Zhou
  • Mengtian Li

Film set design plays a pivotal role in cinematic storytelling and shaping the visual atmosphere. However, the traditional process depends on expert-driven manual modeling, which is labor-intensive and time-consuming. To address this issue, we introduce FilmSceneDesigner, an automated scene generation system that emulates professional film set design workflow. Given a natural language description, including scene type, historical period, and style, we design an agent-based chaining framework to generate structured parameters aligned with film set design workflow, guided by prompt strategies that ensure parameter accuracy and coherence. On the other hand, we propose a procedural generation pipeline which executes a series of dedicated functions with the structured parameters for floorplan and structure generation, material assignment, door and window placement, and object retrieval and layout, ultimately constructing a complete film scene from scratch. Moreover, to enhance cinematic realism and asset diversity, we construct SetDepot-Pro, a curated dataset of 6,862 film-specific 3D assets and 733 materials. Experimental results and human evaluations demonstrate that our system produces structurally sound scenes with strong cinematic fidelity, supporting downstream tasks such as virtual previs, construction drawing and mood board creation.

AAAI Conference 2026 Conference Paper

GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models

  • Shurong Zheng
  • Yousong Zhu
  • Hongyin Zhao
  • Fan Yang
  • Yufei Zhan
  • Ming Tang
  • Jinqiao Wang

Multimodal Large Language Models (MLLMs) have demonstrated impressive progress in single-image grounding and general multi-image understanding. Recently, some methods begin to address multi-image grounding. However, they are constrained by single-target localization and limited types of practical tasks, due to the lack of unified modeling for generalized grounding tasks. Therefore, we propose GeM-VG, an MLLM capable of Generalized Multi-image Visual Grounding. To support this, we systematically categorize and organize existing multi-image grounding tasks according to cognitive demands and introduce the MG-Data-240K dataset, addressing the limitations of existing datasets regarding target quantity and image relation. To tackle the challenges of robustly handling diverse multi-image grounding tasks, we further propose a hybrid reinforcement finetuning strategy that integrates chain-of-thought (CoT) reasoning and direct answering, considering their complementary strengths. This strategy adopts an R1-like algorithm guided by a carefully designed rule-based reward, effectively enhancing the model’s overall perception and reasoning capabilities. Extensive experiments demonstrate the superior generalized grounding capabilities of our model. For multi-image grounding, it outperforms the previous leading MLLMs by 2.0% and 9.7% on MIG-Bench and MC-Bench, respectively. In single-image grounding, it achieves a 9.1% improvement over the base model on ODINW. Furthermore, our model retains strong capabilities in general multi-image understanding.

AAAI Conference 2026 System Paper

KnowThyself: An Agentic Assistant for LLM Interpretability

  • Suraj Prasai
  • Mengnan Du
  • Ying Zhang
  • Fan Yang

We develop KnowThyself, an agentic assistant that advances large language model (LLM) interpretability. Existing tools provide useful insights but remain fragmented and code-intensive. KnowThyself consolidates these capabilities into a chat-based interface, where users can upload models, pose natural language questions, and obtain interactive visualizations with guided explanations. At its core, an orchestrator LLM first reformulates user queries, an agent router further directs them to specialized modules, and the outputs are finally contextualized into coherent explanations. This design lowers technical barriers and provides an extensible platform for LLM inspection. By embedding the whole process into a conversational workflow, KnowThyself offers a robust foundation for accessible LLM interpretability.

JBHI Journal 2026 Journal Article

MoACNN-XGNet: Interpretable Multi-Omics Convolutional Network for Breast Cancer Subtyping and Prognostic Genes Identification

  • Qian Li
  • Lei Liu
  • Qing Zhang
  • Xiaobin Zhang
  • Na Li
  • Yaoyao Zhao
  • Jiayi Teng
  • Fuzhong Xue

Breast cancer, a highly heterogeneous disease at both the phenotypic and molecular levels, presents significant challenges for prognosis and treatment. Accurate subtyping of breast cancer is critical due to its complex biological characteristics, which directly influence disease progression and therapeutic outcomes. In this study, we integrate multi-omics data, including copy number variation, RNA sequencing, and DNA methylation, to generate two-dimensional representations of each sample using Uniform Manifold Approximation and Projection. This transformation enhances data interpretability and supports subsequent learning tasks. Traditional convolutional neural networks have demonstrated potential in medical image analysis but often struggle with high-dimensional omics data. To address this limitation, we propose MoACNN-XGNet, an attention-based convolutional neural network framework that prioritizes key features within image-transformed multi-omics data. Our method significantly improves the precision of subtype classification and effectively overcomes the challenges posed by the high dimensionality and structural complexity of multi-omics data. Furthermore, we employ the Guided Grad-CAM method to enhance model interpretability, enabling the identification of subtype-specific explainable genes. Subsequent enrichment and survival analyses of these genes reveal critical biological pathways and potential therapeutic targets. This study offers a novel approach to refining breast cancer subtyping and highlights the potential for personalized treatment strategies, ultimately aiming to improve patient survival outcomes.

AAAI Conference 2026 Conference Paper

TIME: Temporal-Sensitive Multi-Dimensional Instruction Tuning and Robust Benchmarking for Video-LLMs

  • Yunxiao Wang
  • Meng Liu
  • Wenqi Liu
  • Xuemeng Song
  • Bin Wen
  • Fan Yang
  • Tingting Gao
  • Di Zhang

Video large language models have achieved remarkable performance in tasks such as video question answering, however, their temporal understanding remains suboptimal. To address this limitation, we curate a dedicated instruction fine-tuning dataset that focuses on enhancing temporal comprehension across five key dimensions. In order to reduce reliance on costly temporal annotations, we introduce a multi-task prompt fine-tuning approach that seamlessly integrates temporal-sensitive tasks into existing instruction datasets without requiring additional annotations. Furthermore, we develop a novel benchmark for temporal-sensitive video understanding that not only fills the gaps in dimension coverage left by existing benchmarks but also rigorously filters out potential shortcuts, ensuring a more accurate evaluation. Extensive experimental results demonstrate that our approach significantly enhances the temporal understanding of video-LLMs while avoiding reliance on shortcuts.

AAAI Conference 2026 Conference Paper

UDCH: Unsupervised Dynamic Weighted Cluster-cooperative Hashing for Cross-modal Retreival

  • Yuanzhi Zhao
  • Fan Yang
  • Yudong Zhao
  • Xiaoyu Li

In cross-modal retrieval tasks, unsupervised hash code learning still faces key challenges, including the difficulty of modeling shared semantic structures across modalities and the inability to adaptively balance multiple supervision objectives during optimization. To address these issues, we propose a novel Unsupervised Dynamic Weighted Cluster-Cooperative Hashing (UDCH) framework, which jointly models feature-level alignment and cluster-level semantic structure to guide consistency learning across modalities under label-free conditions. Specifically, we design an instance-level contrastive loss in the feature branch to align the embedding spaces of images and texts, while employing K-Means clustering to generate pseudo-labels and construct a cluster-center contrast mechanism that captures semantic grouping information. Furthermore, we integrate cross-modal feature similarity to construct a high-order structure matrix, enabling fine-grained structural supervision. To enhance the synergy of multi-objective optimization, we introduce a dynamic weighting strategy that adaptively adjusts the contributions of the feature and cluster branches based on the degree of modal alignment and semantic compactness. Extensive experiments on multiple cross-modal retrieval benchmarks demonstrate that UDCH achieves superior semantic alignment and retrieval performance under unsupervised settings, validating the effectiveness of multi-level semantic modeling and adaptive collaboration mechanisms in unsupervised hashing tasks.

AAAI Conference 2025 Conference Paper

3DHumanEdit: Multi-modal Body Part-aware Conditioning Information Integration for 3D Human Manipulation

  • FeiFan Xu
  • Tianyi Chen
  • Fan Yang
  • Yunfei Zhang
  • Si Wu

The rapid advancement of 3D Generative Adversarial Networks (GANs) has significantly enhanced the diversity and quality of generated 3D images. Despite these breakthroughs, the manipulation capabilities of 3D GANs remain unexplored, presenting substantial challenges for practical applications where user interaction and modification are essential. Current manipulation methods often lack the precision needed for fine-grained attribute manipulation, and struggle to maintain multi-view consistency during the editing process. To address these limitations, we propose 3DHumanEdit, a novel approach for 3D human body part-aware manipulation. 3DHumanEdit leverages multi-modal feature fusion and body part-aware feature alignment to achieve precise manipulation of individual body parts based on detailed text inputs and segmentation images. By exploring 3D prior for accurate editing and enforcing correspondence in latent space, 3DHumanEdit ensures coherence across multiple views. Experiments demonstrate that 3DHumanEdit outperforms existing methods in both editing fidelity and multi-view consistency, offering a robust solution for fine-grained 3D manipulation.

AAAI Conference 2025 Conference Paper

Contrasting Adversarial Perturbations: The Space of Harmless Perturbations

  • Lu Chen
  • Shaofeng Li
  • Benhao Huang
  • Fan Yang
  • Zheng Li
  • Jie Li
  • Yuan Luo

Existing works have extensively studied adversarial examples, which are minimal perturbations that can mislead the output of deep neural networks (DNNs) while remaining imperceptible to humans. However, in this work, we reveal the existence of a harmless perturbation space, in which perturbations drawn from this space, regardless of their magnitudes, leave the network output unchanged when applied to inputs. Essentially, the harmless perturbation space emerges from the usage of non-injective functions (linear or non-linear layers) within DNNs, enabling multiple distinct inputs to be mapped to the same output. For linear layers with input dimensions exceeding output dimensions, any linear combination of the orthogonal bases of the nullspace of the parameter consistently yields no change in their output. For non-linear layers, the harmless perturbation space may expand, depending on the properties of the layers and input samples. Inspired by this property of DNNs, we solve for a family of general perturbation spaces that are redundant for the DNN's decision, and can be used to hide sensitive data and serve as a means of model identification. Our work highlights the distinctive robustness of DNNs (i.e., consistency under large magnitude perturbations) in contrast to adversarial examples (vulnerability for small noises).

NeurIPS Conference 2025 Conference Paper

Decoupling Contrastive Decoding: Robust Hallucination Mitigation in Multimodal Large Language Models

  • Wei Chen
  • Xin Yan
  • Bin Wen
  • Fan Yang
  • Tingting Gao
  • Di Zhang
  • Long Chen

Although multimodal large language models (MLLMs) exhibit remarkable reasoning capabilities on complex multimodal understanding tasks, they still suffer from the notorious 'hallucination' issue: generating outputs misaligned with obvious visual or factual evidence. Currently, training-based solutions, like direct preference optimization (DPO), leverage paired preference data to suppress hallucinations. However, they risk sacrificing general reasoning capabilities due to the likelihood displacement. Meanwhile, training-free solutions, like contrastive decoding, achieve this goal by subtracting the estimated hallucination pattern from a distorted input. Yet, these handcrafted perturbations (e. g. , add noise to images) may poorly capture authentic hallucination patterns. To avoid these weaknesses of existing methods, and realize ``robust'' hallucination mitigation (\ie, maintaining general reasoning performance), we propose a novel framework: Decoupling Contrastive Decoding (DCD). Specifically, DCD decouples the learning of positive and negative samples in preference datasets, and trains separate positive and negative image projections within the MLLM. The negative projection implicitly models real hallucination patterns, which enables vision-aware negative images in the contrastive decoding inference stage. Our DCD alleviates likelihood displacement by avoiding pairwise optimization and generalizes robustly without handcrafted degradation. Extensive ablations across hallucination benchmarks and general reasoning tasks demonstrate the effectiveness of DCD, \ie, it matches DPO’s hallucination suppression while preserving general capabilities and outperforms the handcrafted contrastive decoding methods.

IJCAI Conference 2025 Conference Paper

Detection and Geographic Localization of Natural Objects in the Wild: A Case Study on Palms

  • Kangning Cui
  • Rongkun Zhu
  • Manqi Wang
  • Wei Tang
  • Gregory D. Larsen
  • Victor P. Pauca
  • Sarra Alqahtani
  • Fan Yang

Palms are ecologically and economically indicators of tropical forest health, biodiversity, and human impact that support local economies and global forest product supply chains. While palm detection in plantations is well-studied, efforts to map naturally occurring palms in dense forests remain limited by overlapping crowns, uneven shading, and heterogeneous landscapes. We develop PRISM (Processing, Inference, Segmentation, and Mapping), a flexible pipeline for detecting and localizing palms in dense tropical forests using large orthomosaic images. Orthomosaics are created from thousands of aerial images and spanning several to hundreds of gigabytes. Our contributions are threefold. First, we construct a large UAV-derived orthomosaic dataset collected across 21 ecologically diverse sites in western Ecuador, annotated with 8, 830 bounding boxes and 5, 026 palm center points. Second, we evaluate multiple state-of-the-art object detectors based on efficiency and performance, integrating zero-shot SAM~2 as the segmentation backbone, and refining the results for precise geographic mapping. Third, we apply calibration methods to align confidence scores with IoU and explore saliency maps for feature explainability. Though optimized for palms, PRISM is adaptable for identifying other natural objects, such as eastern white pines. Future work will explore transfer learning for lower-resolution datasets (0. 5–1m). Data and code can be found at github. com/Zippppo/PRISM.

IJCAI Conference 2025 Conference Paper

Dynamic Multiple High-order Correlations Fusion with Noise Filtering for Incomplete Multi-view Noisy-label Learning

  • Kaixiang Wang
  • Xiaojian Ding
  • Fan Yang

Multi-view multi-label data often suffers from incomplete feature views and label noise. This paper is the first to address both challenges simultaneously, rectifying critical deficiencies in existing methodologies that inadequately extract and fuse high-order structural correlations across views while lacking robust solutions to mitigate label noise. We introduce a dynamic multiple high-order correlations fusion with noise filtering, specifically designed for incomplete multi-view noisy-label learning. By capitalizing on a dynamic multi-hypergraph neural network, inspired by the principles of ensemble learning, we adeptly capture and integrate high-order correlations among samples from different views. The model's capability is further augmented through an innovative hypergraph fusion technique based on random walk theory, which empowers it to seamlessly amalgamate both structural and feature information. Moreover, we propose sophisticated noise-filtering matrices that are tightly embedded within the hypergraph neural network, devised to counteract the detrimental impact of label noise. Recognizing that label noise perturbs the data distribution in the label space, these filtering matrices exploit the distributional disparities between feature and label spaces. The high-order structural information derived from both domains underpins the learning and efficacy of the noise-filtering matrices. Empirical evaluations on benchmark datasets unequivocally demonstrate that our method significantly outperforms contemporary state-of-the-art techniques.

JBHI Journal 2025 Journal Article

Effectiveness Evaluation for Clinical Depression Detection Using Deep Learning Based Synthetic House-Tree-Person Test

  • Zhuolong Chen
  • Xiaoqing Yin
  • Fan Yang
  • Xiaofan Li
  • Zixuan Zhao
  • Xueying Li
  • Jianghu Liu
  • Yubin Zhao

Depression is one of the most common mood disorders and the number of patients increases significantly in recent years. Due to the lack of biomarkers, conversation between patients and psychiatrists is still the main clinical diagnostic method which is easily influenced by subjectivity of both patients and psychiatrists. Synthetic House-tree-person test (S-HTP), a convenient and efficient mental assessment tool, minimizes subjective influences from patients, while its effectiveness is limited by the professional ability of analyst. Here we introduce a deep learning model DeHTP, a flexible and convenient depression detection method based on S-HTP without interaction between people. Experimental results demonstrate that DeHTP achieves 0. 963 AUC and 0. 9 accuracy, and outperforms the conventional manual analysis of S-HTP, which is conducted on the guideline of 50 conclusions from previous study related to depression. In addition, it reveals 22 depression-correlated drawing features aligned with conclusions above from the perspective of our proposed model. Leveraging the advantages of deep learning and S-HTP, this approach has the potential for widespread promotion and adoption as the available tool for daily self-mental monitoring, as well as the promising auxiliary diagnostic method in clinical.

JBHI Journal 2025 Journal Article

EGA-Ploc: An Efficient Global-Local Attention Model for Multi-Label Protein Subcellular Localization Prediction on the Immunohistochemistry Images

  • Boyang Wan
  • Xiaoyang Huang
  • Yang Qiao
  • Jiajie Peng
  • Fan Yang

Protein subcellular localization (PSL) is central to unraveling protein functions and disease mechanisms in bioinformatics. Immunohistochemistry (IHC) images serve as rich sources of high-resolution visual cues for PSL prediction. However, conventional deep learning approaches face critical limitations: whole-image models suffer irreversible fine-grained detail loss during downsampling, while patch-based methods lack effective global context integration. Additionally, the long-tailed class distribution in PSL datasets exacerbates performance degradation for underrepresented classes. To address these challenges, we present EGA-Ploc, a framework employing a linear attention mechanism optimized for high-resolution IHC images. This mechanism enables efficient global and local feature modeling with near-linear computational complexity, facilitating end-to-end processing of original images without resolution loss. Moreover, we propose an adaptive multi-label loss function that integrates zero-bounded log-sum-exp constraints with dynamic class-weighted compensation to mitigate dataset imbalance. Consequently, our EGA-Ploc achieves competitive performance across multiple PSL benchmarks while maintaining computational efficiency superior to existing methods. Through extensive visualization analysis, we further investigate the generalizability of off-the-shelf computer vision models in PSL, uncovering interpretable insights into their subcellular localization mechanisms.

NeurIPS Conference 2025 Conference Paper

FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation

  • Fan Yang
  • Yousong Zhu
  • Xin Li
  • Yufei Zhan
  • Hongyin Zhao
  • Shurong Zheng
  • Yaowei Wang
  • Ming Tang

Recent Large Vision Language Models (LVLMs) demonstrate promising capabilities in unifying visual understanding and generative modeling, enabling both accurate content understanding and flexible editing. However, current approaches treat \textbf{\textit{"what to see"}} and \textbf{\textit{"how to edit"}} separately: they either perform isolated object segmentation or utilize segmentation masks merely as conditional prompts for local edit generation tasks, often relying on multiple disjointed models. To bridge these gaps, we introduce FOCUS, a unified LVLM that integrates segmentation-aware perception and controllable object-centric generation within an end-to-end framework. FOCUS employs a dual-branch visual encoder to simultaneously capture global semantic context and fine-grained spatial details. In addition, we leverage a MoVQGAN-based visual tokenizer to produce discrete visual tokens that enhance generation quality. To enable accurate and controllable image editing, we propose a progressive multi-stage training pipeline, where segmentation masks are jointly optimized and used as spatial condition prompts to guide the diffusion decoder. This strategy aligns visual encoding, segmentation, and generation modules, effectively bridging segmentation-aware perception with fine-grained visual synthesis. Extensive experiments across three core tasks, including multimodal understanding, referring segmentation accuracy, and controllable image generation, demonstrate that FOCUS achieves strong performance by jointly optimizing visual perception and generative capabilities.

ICLR Conference 2025 Conference Paper

In vivo cell-type and brain region classification via multimodal contrastive learning

  • Han Yu
  • Hanrui Lyu
  • YiXun Xu
  • Charlie Windolf
  • Eric Kenji Lee
  • Fan Yang
  • Andrew M. Shelton
  • Olivier Winter

Current electrophysiological approaches can track the activity of many neurons, yet it is usually unknown which cell-types or brain areas are being recorded without further molecular or histological analysis. Developing accurate and scalable algorithms for identifying the cell-type and brain region of recorded neurons is thus crucial for improving our understanding of neural computation. In this work, we develop a multimodal contrastive learning approach for neural data that can be fine-tuned for different downstream tasks, including inference of cell-type and brain location. We utilize multimodal contrastive learning to jointly embed the activity autocorrelations and extracellular waveforms of individual neurons. We demonstrate that our embedding approach, Neuronal Embeddings via MultimOdal Contrastive Learning (NEMO), paired with supervised fine-tuning, achieves state-of-the-art cell-type classification for two opto-tagged datasets and brain region classification for the public International Brain Laboratory Brain-wide Map dataset. Our method represents a promising step towards accurate cell-type and brain region classification from electrophysiological recordings.

AAAI Conference 2025 Conference Paper

Language Ranker: A Metric for Quantifying LLM Performance Across High and Low-Resource Languages

  • Zihao Li
  • Yucheng Shi
  • Zirui Liu
  • Fan Yang
  • Ali Payani
  • Ninghao Liu
  • Mengnan Du

The development of Large Language Models (LLMs) relies on extensive text corpora, which are often unevenly distributed across languages. This imbalance results in LLMs performing significantly better on high-resource languages like English, German, and French, while their capabilities in low-resource languages remain inadequate. Currently, there is a lack of quantitative methods to evaluate the performance of LLMs in these low-resource languages. To address this gap, we propose the Language Ranker, an intrinsic metric designed to benchmark and rank languages based on LLM performance using internal representations. By comparing the LLM's internal representation of various languages against a baseline derived from English, we can assess the model's multilingual capabilities in a robust and language-agnostic manner. Our analysis reveals that high-resource languages exhibit higher similarity scores with English, demonstrating superior performance, while low-resource languages show lower similarity scores, underscoring the effectiveness of our metric in assessing language-specific capabilities. Besides, the experiments show that there is a strong correlation between the LLM’s performance in different languages and the proportion of those languages in its pre-training corpus. These insights underscore the efficacy of the Language Ranker as a tool for evaluating LLM performance across different languages, particularly those with limited resources.

NeurIPS Conference 2025 Conference Paper

LiveStar: Live Streaming Assistant for Real-World Online Video Understanding

  • Zhenyu Yang
  • Kairui Zhang
  • Yuhang Hu
  • Bing Wang
  • Shengsheng Qian
  • Bin Wen
  • Fan Yang
  • Tingting Gao

Despite significant progress in Video Large Language Models (Video-LLMs) for offline video understanding, existing online Video-LLMs typically struggle to simultaneously process continuous frame-by-frame inputs and determine optimal response timing, often compromising real-time responsiveness and narrative coherence. To address these limitations, we introduce LiveStar, a pioneering live streaming assistant that achieves always-on proactive responses through adaptive streaming decoding. Specifically, LiveStar incorporates: (1) a training strategy enabling incremental video-language alignment for variable-length video streams, preserving temporal consistency across dynamically evolving frame sequences; (2) a response-silence decoding framework that determines optimal proactive response timing via a single forward pass verification; (3) memory-aware acceleration via peak-end memory compression for online inference on 10+ minute videos, combined with streaming key-value cache to achieve 1. 53× faster inference. We also construct an OmniStar dataset, a comprehensive dataset for training and benchmarking that encompasses 15 diverse real-world scenarios and 5 evaluation tasks for online video understanding. Extensive experiments across three benchmarks demonstrate LiveStar's state-of-the-art performance, achieving an average 19. 5\% improvement in semantic correctness with 18. 1\% reduced timing difference compared to existing online Video-LLMs, while improving FPS by 12. 0\% across all five OmniStar tasks. Our model and dataset can be accessed at https: //github. com/yzy-bupt/LiveStar.

JBHI Journal 2025 Journal Article

MADRNet: Morphology-Aware Dual-Path Reversible Network for Sperm Classification

  • Fan Yang
  • Jingzhang Sun
  • Honglan Huang
  • Liang Zhang
  • Jiheng Zhang

Sperm morphology analysis plays a crucial role in the clinical diagnosis of male infertility. However, manual evaluation is inherently subjective, and inconsistencies in diagnostic criteria may compromise accuracy. Some existing sperm image classification models are introduced but requiring manual intervention. Most models lack of consideration of alignment between computational classification and WHO sperm morphology standards. To address these challenges, we propose an innovative morphology-aware dual-path reversible network ( MADRNet ) in designing our model. We integrate key biomarkers, such as head aspect ratio and acrosomal integrity, both of which are crucial for clinical sperm assessment, into the network. Particularly, the network utilizes a dual-path attention mechanism, incorporating both parallel spatial and channel attention, while embedding the acrosome anatomical constraint within the channel attention. To further enhance the alignment of our model with the WHO standards, we develop a dynamic loss function considering head aspect ratio constraint. Further, we employ a reversible architecture to enable the model to preserve more microscopic details while reducing GPU memory consumption. Experiments on the HuSHeM dataset demonstrate that the model achieves an accuracy of 96. 3% and an F1 score of 96. 8%. Meanwhile, the model maintains a real-time processing speed of 32 ms per image, providing a precise and efficient solution for clinical sperm screening. The implementation source code and the underlying dataset are available at https://github.com/fanyangZK/MADRNet.

AAAI Conference 2025 Conference Paper

NaFV-Net: An Adversarial Four-view Network for Mammogram Classification

  • Feng Lu
  • Yuxiang Hou
  • Wei Li
  • Xiangying Yang
  • Haibo Zheng
  • Wenxi Luo
  • Leqing Chen
  • Yuyang Cao

Breast cancer remains a leading cause of mortality among women, with millions of new cases diagnosed annually. Early detection through screening is crucial. Using neural networks to improve the accuracy of breast cancer screening has become increasingly important. In accordance with radiologists' practices, we proposed using images from the unaffected side to create adversarial samples with critical medical implications in our adversarial learning process. By introducing beneficial perturbations, this method aims to reduce overconfidence and improve the precision and robustness of breast cancer classification. Our proposed framework is an adversarial quadruple-view classification network (NaFV-Net) incorporating images from both affected and unaffected perspectives. By comprehensively capturing local and global information and implementing adversarial learning from four mammography views, this framework allows for the fusion of features and the integration of medical principles and radiologist evaluation techniques, thus facilitating the accurate identification and characterization of breast tissues. Extensive experiments have shown the high effectiveness of our model in accurately distinguishing between benign and malignant findings, demonstrating state-of-the-art classification performance on both internal and public datasets.

JMLR Journal 2025 Journal Article

Precise High-Dimensional Asymptotics for Quantifying Heterogeneous Transfers

  • Fan Yang
  • Hongyang R. Zhang
  • Sen Wu
  • Christopher Re
  • Weijie J. Su

The problem of learning one task using samples from another task is central to transfer learning. In this paper, we focus on answering the following question: when does combining the samples from two related tasks perform better than learning with one target task alone? This question is motivated by an empirical phenomenon known as negative transfer often observed in transfer learning practice. While the transfer effect from one task to another depends on factors such as their sample sizes and the spectrum of their covariance matrices, precisely quantifying this dependence has remained a challenging problem. In order to compare a transfer learning estimator to single-task learning, one needs to compare the risks between the two estimators precisely. Further, the comparison depends on the distribution shifts between the two tasks. This paper applies recent developments of random matrix theory to tackle this challenge in a high-dimensional linear regression setting with two tasks. We provide precise high-dimensional asymptotics for the bias and variance of a classical hard parameter sharing (HPS) estimator in the proportional limit, when the sample sizes of both tasks increase proportionally with dimension at fixed ratios. The precise asymptotics apply to various types of distribution shifts, including covariate shifts, model shifts, and combinations of both. We illustrate these results in a random-effects model to mathematically prove a phase transition from positive to negative transfer as the number of source task samples increases. One insight from the analysis is that a rebalanced HPS estimator, which downsizes the source task when the model shift is high, achieves the minimax optimal rate. The finding regarding phase transition also applies to multiple tasks when feature covariates are shared across all tasks. Simulations validate the accuracy of the high-dimensional asymptotics for finite dimensions. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2025. ( edit, beta )

ICLR Conference 2025 Conference Paper

Proving Olympiad Inequalities by Synergizing LLMs and Symbolic Reasoning

  • Zenan Li
  • Zhaoyu Li
  • Wen Tang
  • Xian Zhang
  • Yuan Yao 0001
  • Xujie Si
  • Fan Yang
  • Kaiyu Yang

Large language models (LLMs) can prove mathematical theorems formally by generating proof steps (\textit{a.k.a.} tactics) within a proof system. However, the space of possible tactics is vast and complex, while the available training data for formal proofs is limited, posing a significant challenge to LLM-based tactic generation. To address this, we introduce a neuro-symbolic tactic generator that synergizes the mathematical intuition learned by LLMs with domain-specific insights encoded by symbolic methods. The key aspect of this integration is identifying which parts of mathematical reasoning are best suited to LLMs and which to symbolic methods. While the high-level idea of neuro-symbolic integration is broadly applicable to various mathematical problems, in this paper, we focus specifically on Olympiad inequalities (Figure~1). We analyze how humans solve these problems and distill the techniques into two types of tactics: (1) scaling, handled by symbolic methods, and (2) rewriting, handled by LLMs. In addition, we combine symbolic tools with LLMs to prune and rank the proof goals for efficient proof search. We evaluate our framework on 161 challenging inequalities from multiple mathematics competitions, achieving state-of-the-art performance and significantly outperforming existing LLM and symbolic approaches without requiring additional training data.

NeurIPS Conference 2025 Conference Paper

Puppeteer: Rig and Animate Your 3D Models

  • Chaoyue Song
  • Xiu Li
  • Fan Yang
  • Zhongcong XU
  • Jiacheng Wei
  • Fayao Liu
  • Jiashi Feng
  • Guosheng Lin

Modern interactive applications increasingly demand dynamic 3D content, yet the transformation of static 3D models into animated assets constitutes a significant bottleneck in content creation pipelines. While recent advances in generative AI have revolutionized static 3D model creation, rigging and animation continue to depend heavily on expert intervention. We present \textbf{Puppeteer}, a comprehensive framework that addresses both automatic rigging and animation for diverse 3D objects. Our system first predicts plausible skeletal structures via an auto-regressive transformer that introduces a joint-based tokenization strategy for compact representation and a hierarchical ordering methodology with stochastic perturbation that enhances bidirectional learning capabilities. It then infers skinning weights via an attention-based architecture incorporating topology-aware joint attention that explicitly encodes inter-joint relationships based on skeletal graph distances. Finally, we complement these rigging advances with a differentiable optimization-based animation pipeline that generates stable, high-fidelity animations while being computationally more efficient than existing approaches. Extensive evaluations across multiple benchmarks demonstrate that our method significantly outperforms state-of-the-art techniques in both skeletal prediction accuracy and skinning quality. The system robustly processes diverse 3D content, ranging from professionally designed game assets to AI-generated shapes, producing temporally coherent animations that eliminate the jittering issues common in existing methods.

NeurIPS Conference 2025 Conference Paper

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

  • Mingyang Chen
  • Linzhuang Sun
  • Tianpeng Li
  • Haoze Sun
  • Chenzheng Zhu
  • Haofen Wang
  • Jeff Pan
  • Wen Zhang

Large Language Models (LLMs) have shown remarkable capabilities in reasoning, exemplified by the success of OpenAI-o1 and DeepSeek-R1. However, integrating reasoning with external search processes remains challenging, especially for complex multi-hop questions requiring multiple retrieval steps. We propose ReSearch, a novel framework that trains LLMs to Reason with Search via reinforcement learning without using any supervised data on reasoning steps. Our approach treats search operations as integral components of the reasoning chain, where when and how to perform searches is guided by text-based thinking, and search results subsequently influence further reasoning. We train ReSearch on Qwen2. 5-7B(-Instruct) and Qwen2. 5-32B(-Instruct) models and conduct extensive experiments. Despite being trained on only one dataset, our models demonstrate strong generalizability across various benchmarks. Analysis reveals that ReSearch naturally elicits advanced reasoning capabilities such as reflection and self-correction during the reinforcement learning process.

NeurIPS Conference 2025 Conference Paper

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

  • Di Liu
  • Meng Chen
  • Baotong Lu
  • Huiqiang Jiang
  • Zhenhua Han
  • Qianxi Zhang
  • Qi Chen
  • Chengruidong Zhang

Transformer-based Large Language Models (LLMs) have become increasingly important. However, scaling LLMs to longer contexts incurs slow inference speed and high GPU memory consumption for caching key-value (KV) vectors. This paper presents RetrievalAttention, a training-free approach to both accelerate the decoding phase and reduce GPU memory consumption by pre-building KV vector indexes for fixed contexts and maintaining them in CPU memory for efficient retrieval. Unlike conventional KV cache methods, RetrievalAttention integrate approximate nearest neighbor search (ANNS) indexes into attention computation. We observe that off-the-shelf ANNS techniques often fail due to the out-of-distribution (OOD) nature of query and key vectors in attention mechanisms. RetrievalAttention overcomes this with an attention-aware vector index. Our evaluation shows RetrievalAttention achieves near full attention accuracy while accessing only 1-3\% of the data, significantly reducing inference costs. Remarkably, RetrievalAttention enables LLMs with 8B parameters to handle 128K tokens on a single NVIDIA RTX4090 (24GB), achieving a decoding speed of 0. 107 seconds per token.

NeurIPS Conference 2025 Conference Paper

Reviving DSP for Advanced Theorem Proving in the Era of Reasoning Models

  • Chenrui Cao
  • Liangcheng Song
  • Zenan Li
  • Xinyi Le
  • Xian Zhang
  • Hui Xue
  • Fan Yang

Recent advancements, such as DeepSeek-Prover-V2-671B and Kimina-Prover-Preview-72B, demonstrate a prevailing trend in leveraging reinforcement learning (RL)-based large-scale training for automated theorem proving. Surprisingly, we discover that even without any training, careful neuro-symbolic coordination of existing off-the-shelf reasoning models and tactic step provers can achieve comparable performance. This paper introduces DSP+, an improved version of the Draft, Sketch, and Prove framework, featuring a fine-grained and integrated neuro-symbolic enhancement for each phase: (1) In the draft phase, we prompt reasoning models to generate concise natural-language subgoals to benefit the sketch phase, removing thinking tokens and references to human-written proofs; (2) In the sketch phase, subgoals are autoformalized with hypotheses to benefit the proving phase, and sketch lines containing syntactic errors are masked according to predefined rules; (3) In the proving phase, we tightly integrate symbolic search methods like Aesop with step provers to establish proofs for the sketch subgoals. Experimental results show that, without any additional model training or fine-tuning, DSP+ solves 80. 7%, 32. 8%, and 24 out of 644 problems from miniF2F, ProofNet, and PutnamBench, respectively, while requiring fewer budgets compared to state-of-the-arts. DSP+ proves imo 2019 p1, an IMO problem in miniF2F that is not solved by any prior work. Additionally, DSP+ generates proof patterns comprehensible by human experts, facilitating the identification of formalization errors; For example, eight wrongly formalized statements in miniF2F are discovered. Our results highlight the potential of classical reasoning patterns besides the RL-based training. All components will be open-sourced.

NeurIPS Conference 2025 Conference Paper

rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset

  • Yifei Liu
  • Li Lyna Zhang
  • Yi Zhu
  • Bingcheng Dong
  • Xudong Zhou
  • Ning Shang
  • Fan Yang
  • Cheng Li

Advancing code reasoning in large language models (LLMs) is fundamentally limited by the scarcity of high-difficulty datasets, especially those with verifiable input-output test cases necessary for rigorous solution validation at scale. We introduce rStar-Coder, which significantly improves LLM code reasoning capabilities by constructing a large-scale, verified dataset of 418K competition-level code problems, 580K long-reasoning solutions along with rich test cases of varying difficulty. This is achieved through three core contributions: (1) we curate competitive programming code problems and solutions to synthesize new, solvable problems; (2) we introduce a reliable input-output test case synthesis pipeline that decouples the generation into a three-step input generation method and a mutual verification mechanism for effective output labeling; (3) we augment problems with high-quality, test-case-verified long-reasoning solutions. Extensive experiments on Qwen models (1. 5B-14B) across various code reasoning benchmarks demonstrate the superiority of rStar-Coder dataset, achieving leading performance comparable to frontier reasoning LLMs with significantly smaller model sizes. On LiveCodeBench, rStar-Coder improves Qwen2. 5-7B from 17. 4% to an impressive 57. 3%, and Qwen2. 5-14B from 23. 3% to 62. 5%, surpassing o3-mini (low) by 3. 1%. On the more challenging USA Computing Olympiad, our 7B model achieves an average pass@1 accuracy of 16. 15%, outperforming the frontier-level QWQ-32B. rStar-Coder dataset is publicly available at https: //huggingface. co/datasets/microsoft/rStar-Coder.

NeurIPS Conference 2025 Conference Paper

SeerAttention: Self-distilled Attention Gating for Efficient Long-context Prefilling

  • Yizhao Gao
  • Zhichen Zeng
  • DaYou Du
  • Shijie Cao
  • Peiyuan Zhou
  • Jiaxing Qi
  • Junjie Lai
  • Hayden So

Attention is the cornerstone of modern Large Language Models (LLMs). Yet its quadratic complexity hinders efficiency and scalability, especially for long-context processing. A promising approach is to leverage sparsity in attention. However, existing sparsity-based solutions predominantly rely on predefined patterns or heuristics at the attention head level, struggling to adapt dynamically to different contexts efficiently. We propose SeerAttention, a simple yet effective attention mechanism that directly learns the block-level attention sparsity from the LLM itself. Inspired by the gating mechanism in Mixture of Experts (MoE), SeerAttention augments the conventional attention with a learnable gate that selectively activates important blocks within the attention map. Specifically, the gate first pools the query (Q) and key (K) tensors along the sequence dimension and processes them through learnable linear layers. The resulting matrices are then multiplied together to produce the gating scores, which are used to predict block-level attention sparsity. Combined with our block-sparse FlashAttention kernel, SeerAttention can achieve significant speedup on GPUs. When applied to pre-trained LLMs, SeerAttention only requires training the gate parameters in a lightweight self-distillation manner, allowing rapid convergence. Our evaluation results demonstrate that SeerAttention achieves better model accuracy and lower latency for long-context pre-filling compared to prior methods. Code is available at: https: //github. com/microsoft/SeerAttention.

AAMAS Conference 2025 Conference Paper

Self-Interpretable Reinforcement Learning via Rule Ensembles

  • Yue Yang
  • Fan Yang
  • Yu Bai
  • Hao Wang

Current reinforcement learning (RL) models, often functioning as complex ‘black boxes, ’ obscure decision-making processes. This lack of transparency limits its applicability in critical real-world applications where clear reasoning behind algorithmic choices is crucial. To tackle this issue, we suggest moving from neural network or tabular approaches to a rule ensemble model, which improves decision-making clarity and adapts dynamically to environmental interactions. Instead, our method constructs additive rule ensembles to approximate the Q-value in reinforcement learning using orthogonal gradient boosting (OGB) combined with a post-processing rule replacement technique. This method enables the model to provide inherent explanations through the use of rules. Our study sets a theoretical foundation for rule ensembles within the reinforcement learning framework, emphasizing their capacity to boost interpretability and facilitate the analysis of rule impacts. Experimental results from seven classic environments demonstrate that our proposed rule ensembles match or exceed the performance of representative RL models such as DQN, A2C, and PPO, while also providing self-interpretability and transparency.

ICLR Conference 2025 Conference Paper

Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model

  • Longrong Yang
  • Dong Shen
  • Chaoxiang Cai
  • Fan Yang
  • Tingting Gao
  • Di Zhang
  • Xi Li

The Mixture-of-Experts (MoE) has gained increasing attention in studying Large Vision-Language Models (LVLMs). It uses a sparse model to replace the dense model, achieving comparable performance while activating fewer parameters during inference, thus significantly reducing the inference cost. Existing MoE methods in LVLM encourage different experts to specialize in different tokens, and they usually employ a router to predict the routing of each token. However, the router is not optimized concerning distinct parameter optimization directions generated from tokens within an expert. This may lead to severe interference between tokens within an expert. To address this problem, we propose to use the token-level gradient analysis to Solving Token Gradient Conflict (STGC) in this paper. Specifically, we first use token-level gradients to identify conflicting tokens in experts. After that, we add a regularization loss tailored to encourage conflicting tokens routing from their current experts to other experts, for reducing interference between tokens within an expert. Our method can serve as a plug-in for diverse LVLM methods, and extensive experimental results demonstrate its effectiveness. demonstrate its effectiveness. The code will be publicly available at https://github.com/longrongyang/STGC.

NeurIPS Conference 2025 Conference Paper

Who You Are Matters: Bridging Interests and Social Roles via LLM-Enhanced Logic Recommendation

  • Qing Yu
  • Xiaobei Wang
  • Shuchang Liu
  • Xiaoyu Yang
  • Xueliang Wang
  • Chang Meng
  • Shanshan Wu
  • Bin Wen

Recommender systems filter contents/items valuable to users by inferring preferences from user features and historical behaviors. Mainstream approaches follow the learning-to-rank paradigm, which focus on discovering and modeling item topics (e. g. , categories), and capturing user preferences on these topics based on historical interactions. However, this paradigm often neglects the modeling of user characteristics and their social roles, which are logical confounders influencing the correlated interest and user preference transition. To bridge this gap, we introduce the user role identification task and the behavioral logic modeling task that aim to explicitly model user roles and learn the logical relations between item topics and user social roles. We show that it is possible to explicitly solve these tasks through an efficient integration framework of Large Language Model (LLM) and recommendation systems, for which we propose TagCF. On the one hand, TagCF exploits the (Multi-modal) LLM's world knowledge and logic inference ability to extract realistic tag-based virtual logic graphs that reveal dynamic and expressive knowledge of users, refining our understanding of user behaviors. On the other hand, TagCF presents empirically effective integration modules that take advantage of the extracted tag-logic information, augmenting the recommendation performance. We conduct both online experiments and offline experiments with industrial and public datasets as verification of TagCF's effectiveness, and we empirically show that the user role modeling strategy is potentially a better choice than the modeling of item topics. Additionally, we provide evidence that the extracted logic graphs are empirically a general and transferable knowledge that can benefit a wide range of recommendation tasks. Our code is available in https: //github. com/Code2Q/TagCF.

AAAI Conference 2024 Conference Paper

An Effective Augmented Lagrangian Method for Fine-Grained Multi-View Optimization

  • Yuze Tan
  • Hecheng Cai
  • Shudong Huang
  • Shuping Wei
  • Fan Yang
  • Jiancheng Lv

The significance of multi-view learning in effectively mitigating the intricate intricacies entrenched within heterogeneous data has garnered substantial attention in recent years. Notwithstanding the favorable achievements showcased by recent strides in this area, a confluence of noteworthy challenges endures. To be specific, a majority of extant methodologies unceremoniously assign weights to data points view-wisely. This ineluctably disregards the intrinsic reality that disparate views confer diverse contributions to each individual sample, consequently neglecting the rich wellspring of sample-level structural insights harbored within the dataset. In this paper, we proposed an effective Augmented Lagrangian MethOd for fiNe-graineD (ALMOND) multi-view optimization. This innovative approach scrutinizes the interplay among multiple views at the granularity of individual samples, thereby fostering the enhanced preservation of local structural coherence. The Augmented Lagrangian Method (ALM) is elaborately incorporated into our framework, which enables us to achieve an optimal solution without involving an inexplicable intermediate variable as previous methods do. Empirical experiments on multi-view clustering tasks across heterogeneous datasets serve to incontrovertibly showcase the effectiveness of our proposed methodology, corroborating its preeminence over incumbent state-of-the-art alternatives.

NeurIPS Conference 2024 Conference Paper

Autoformalize Mathematical Statements by Symbolic Equivalence and Semantic Consistency

  • Zenan Li
  • Yifan Wu
  • Zhaoyu Li
  • Xinming Wei
  • Fan Yang
  • Xian Zhang
  • Xiaoxing Ma

Autoformalization, the task of automatically translating natural language descriptions into a formal language, poses a significant challenge across various domains, especially in mathematics. Recent advancements in large language models (LLMs) have unveiled their promising capabilities to formalize even competition-level math problems. However, we observe a considerable discrepancy between pass@1 and pass@k accuracies in LLM-generated formalizations. To address this gap, we introduce a novel framework that scores and selects the best result from k autoformalization candidates based on two complementary self-consistency methods: symbolic equivalence and semantic consistency. Elaborately, symbolic equivalence identifies the logical homogeneity among autoformalization candidates using automated theorem provers, and semantic consistency evaluates the preservation of the original meaning by informalizing the candidates and computing the similarity between the embeddings of the original and informalized texts. Our extensive experiments on the MATH and miniF2F datasets demonstrate that our approach significantly enhances autoformalization accuracy, achieving up to 0. 22-1. 35x relative improvements across various LLMs and baseline methods.

AAAI Conference 2024 Conference Paper

Causal-Driven Skill Prerequisite Structure Discovery

  • Shenbao Yu
  • Yifeng Zeng
  • Fan Yang
  • Yinghui Pan

Knowing a prerequisite structure among skills in a subject domain effectively enables several educational applications, including intelligent tutoring systems and curriculum planning. Traditionally, educators or domain experts use intuition to determine the skills' prerequisite relationships, which is time-consuming and prone to fall into the trap of blind spots. In this paper, we focus on inferring the prerequisite structure given access to students' performance on exercises in a subject. Nevertheless, it is challenging since students' mastery of skills can not be directly observed, but can only be estimated, i.e., its latency in nature. To tackle this problem, we propose a causal-driven skill prerequisite structure discovery (CSPS) method in a two-stage learning framework. In the first stage, we learn the skills' correlation relationships presented in the covariance matrix from the student performance data while, through the predicted covariance matrix in the second stage, we consider a heuristic method based on conditional independence tests and standardized partial variance to discover the prerequisite structure. We demonstrate the performance of the new approach with both simulated and real-world data. The experimental results show the effectiveness of the proposed model for identifying the skills' prerequisite structure.

TIST Journal 2024 Journal Article

Demand-driven Urban Facility Visit Prediction

  • Yunke Zhang
  • Tong Li
  • Yuan Yuan
  • Fengli Xu
  • Fan Yang
  • Funing Sun
  • Yong Li

Predicting citizens’ visiting behaviors to urban facilities is instrumental for city governors and planners to detect inequalities in urban opportunities and optimize the distribution of facilities and resources. Previous works predict facility visits simply using observed visit behavior, yet citizens’ intrinsic demands for facilities are not characterized explicitly, causing potential incorrect learned relations in the prediction results. In this article, to make up for this deficiency, we present a demand-driven urban facility visit prediction method that decomposes citizens’ visits to facilities into their unobservable demands and their capability to fulfill them. Demands are expressed as the function of regional demographic attributes by a neural network, and the fulfillment capability is determined by the urban region’s spatial accessibility to facilities. Extensive evaluations of datasets of three large cities confirm the efficiency and rationality of our model. Our method outperforms the best state-of-the-art model by 8.28% on average in facility visit prediction tasks. Further analyses demonstrate the reasonableness of recovered facility demands and their relationship with citizen demographics. For instance, senior citizens tend to have higher medical demands but lower shopping demands. Meanwhile, estimated capabilities and accessibilities provide deeper insights into the decaying accessibility with respect to spatial distance and facilities’ diverse functions in the urban environment. Our findings shed light on demand-driven urban data mining and demand-based urban facility planning.

IROS Conference 2024 Conference Paper

EMBOSR: Embodied Spatial Reasoning for Enhanced Situated Question Answering in 3D Scenes

  • Yu Hao
  • Fan Yang
  • Nicholas Fang
  • Yu-Shen Liu

3D Embodied Spatial Reasoning, emphasizing an agent’s interaction with its surroundings for spatial information inference, is adeptly facilitated by the process of Situated Question Answering in 3D Scenes (SQA3D). SQA3D requires an agent to comprehend its position and orientation within a 3D scene based on a textual situation and then utilize this understanding to answer questions about the surrounding environment in that context. Previous methods in this field face substantial challenges, including a dependency on constant retraining on limited datasets, which leads to poor performance in unseen scenarios, limited expandability, and inadequate generalization. To address these challenges, we present a new embodied spatial reasoning paradigm for enhanced SQA3D, fusing the capabilities of foundation models with the chain of thought methodology. This approach is designed to elevate adaptability and scalability in a wide array of 3D environments. A new aspect of our model is the integration of a chain of thought reasoning process, which significantly augments the model’s capability for spatial reasoning and complex query handling in diverse 3D environments. In our structured experiments, we compare our approach against other methods with varying architectures, demonstrating its efficacy in multiple tasks including SQA3D and 3D captioning. We also assess the informativeness contained in the generated answers for complex queries. Ablation studies further delineate the individual contributions of our method to its overall performance. The results consistently affirm our proposed method’s effectiveness and efficiency.

NeurIPS Conference 2024 Conference Paper

Empowering and Assessing the Utility of Large Language Models in Crop Science

  • Hang Zhang
  • Jiawei Sun
  • Renqi Chen
  • Wei Liu
  • Zhonghang Yuan
  • Xinzhe Zheng
  • Zhefan Wang
  • Zhiyuan Yang

Large language models (LLMs) have demonstrated remarkable efficacy across knowledge-intensive tasks. Nevertheless, their untapped potential in crop science presents an opportunity for advancement. To narrow this gap, we introduce CROP, which includes a novel instruction tuning dataset specifically designed to enhance LLMs’ professional capabilities in the crop science sector, along with a benchmark that serves as a comprehensive evaluation of LLMs’ understanding of the domain knowledge. The CROP dataset is curated through a task-oriented and LLM-human integrated pipeline, comprising 210, 038 single-turn and 1, 871 multi-turn dialogues related to crop science scenarios. The CROP benchmark includes 5, 045 multiple-choice questions covering three difficulty levels. Our experiments based on the CROP benchmark demonstrate notable enhancements in crop science-related tasks when LLMs are fine-tuned with the CROP dataset. To the best of our knowledge, CROP dataset is the first-ever instruction tuning dataset in the crop science domain. We anticipate that CROP will accelerate the adoption of LLMs in the domain of crop science, ultimately contributing to global food production.

TIST Journal 2024 Journal Article

Explainability for Large Language Models: A Survey

  • Haiyan Zhao
  • Hanjie Chen
  • Fan Yang
  • Ninghao Liu
  • Huiqi Deng
  • Hengyi Cai
  • Shuaiqiang Wang
  • Dawei Yin

Large language models (LLMs) have demonstrated impressive capabilities in natural language processing. However, their internal mechanisms are still unclear and this lack of transparency poses unwanted risks for downstream applications. Therefore, understanding and explaining these models is crucial for elucidating their behaviors, limitations, and social impacts. In this article, we introduce a taxonomy of explainability techniques and provide a structured overview of methods for explaining Transformer-based language models. We categorize techniques based on the training paradigms of LLMs: traditional fine-tuning-based paradigm and prompting-based paradigm. For each paradigm, we summarize the goals and dominant approaches for generating local explanations of individual predictions and global explanations of overall model knowledge. We also discuss metrics for evaluating generated explanations and discuss how explanations can be leveraged to debug models and improve performance. Lastly, we examine key challenges and emerging opportunities for explanation techniques in the era of LLMs in comparison to conventional deep learning models.

AAAI Conference 2024 Conference Paper

Geometry-Guided Domain Generalization for Monocular 3D Object Detection

  • Fan Yang
  • Hui Chen
  • Yuwei He
  • Sicheng Zhao
  • Chenghao Zhang
  • Kai Ni
  • Guiguang Ding

Monocular 3D object detection (M3OD) is important for autonomous driving. However, existing deep learning-based methods easily suffer from performance degradation in real-world scenarios due to the substantial domain gap between training and testing. M3OD's domain gaps are complex, including camera intrinsic parameters, extrinsic parameters, image appearance, etc. Existing works primarily focus on the domain gaps of camera intrinsic parameters, ignoring other key factors. Moreover, at the feature level, conventional domain invariant learning methods generally cause the negative transfer issue, due to the ignorance of dependency between geometry tasks and domains. To tackle these issues, in this paper, we propose MonoGDG, a geometry-guided domain generalization framework for M3OD, which effectively addresses the domain gap at both camera and feature levels. Specifically, MonoGDG consists of two major components. One is geometry-based image reprojection, which mitigates the impact of camera discrepancy by unifying intrinsic parameters, randomizing camera orientations, and unifying the field of view range. The other is geometry-dependent feature disentanglement, which overcomes the negative transfer problems by incorporating domain-shared and domain-specific features. Additionally, we leverage a depth-disentangled domain discriminator and a domain-aware geometry regression attention mechanism to account for the geometry-domain dependency. Extensive experiments on multiple autonomous driving benchmarks demonstrate that our method achieves state-of-the-art performance in domain generalization for M3OD.

AAAI Conference 2024 Conference Paper

Implicit Modeling of Non-rigid Objects with Cross-Category Signals

  • Yuchun Liu
  • Benjamin Planche
  • Meng Zheng
  • Zhongpai Gao
  • Pierre Sibut-Bourde
  • Fan Yang
  • Terrence Chen
  • Ziyan Wu

Deep implicit functions (DIFs) have emerged as a potent and articulate means of representing 3D shapes. However, methods modeling object categories or non-rigid entities have mainly focused on single-object scenarios. In this work, we propose MODIF, a multi-object deep implicit function that jointly learns the deformation fields and instance-specific latent codes for multiple objects at once. Our emphasis is on non-rigid, non-interpenetrating entities such as organs. To effectively capture the interrelation between these entities and ensure precise, collision-free representations, our approach facilitates signaling between category-specific fields to adequately rectify shapes. We also introduce novel inter-object supervision: an attraction-repulsion loss is formulated to refine contact regions between objects. Our approach is demonstrated on various medical benchmarks, involving modeling different groups of intricate anatomical entities. Experimental results illustrate that our model can proficiently learn the shape representation of each organ and their relations to others, to the point that shapes missing from unseen instances can be consistently recovered by our method. Finally, MODIF can also propagate semantic information throughout the population via accurate point correspondences.

AAAI Conference 2024 Conference Paper

Multi-Modal Disordered Representation Learning Network for Description-Based Person Search

  • Fan Yang
  • Wei Li
  • Menglong Yang
  • Binbin Liang
  • Jianwei Zhang

Description-based person search aims to retrieve images of the target identity via textual descriptions. One of the challenges for this task is to extract discriminative representation from images and descriptions. Most existing methods apply the part-based split method or external models to explore the fine-grained details of local features, which ignore the global relationship between partial information and cause network instability. To overcome these issues, we propose a Multi-modal Disordered Representation Learning Network (MDRL) for description-based person search to fully extract the visual and textual representations. Specifically, we design a Cross-modality Global Feature Learning Architecture to learn the global features from the two modalities and meet the demand of the task. Based on our global network, we introduce a Disorder Local Learning Module to explore local features by a disordered reorganization strategy from both visual and textual aspects and enhance the robustness of the whole network. Besides, we introduce a Cross-modality Interaction Module to guide the two streams to extract visual or textual representations considering the correlation between modalities. Extensive experiments are conducted on two public datasets, and the results show that our method outperforms the state-of-the-art methods on CUHK-PEDES and ICFG-PEDES datasets and achieves superior performance.

AAAI Conference 2024 Conference Paper

Multi-View Randomized Kernel Classification via Nonconvex Optimization

  • Xiaojian Ding
  • Fan Yang

Multi kernel learning (MKL) is a representative supervised multi-view learning method widely applied in multi-modal and multi-view applications. MKL aims to classify data by integrating complementary information from predefined kernels. Although existing MKL methods achieve promising performance, they fail to consider the tradeoff between diversity and classification accuracy of kernels, preventing further improvement of classification performance. In this paper, we tackle this problem by generating a number of high-quality base learning kernels and selecting a kernel subset with maximum pairwise diversity and minimum generalization errors. We first formulate this idea as a nonconvex quadratic integer programming problem. Then we transform this nonconvex problem into a convex optimization problem and prove it is equivalent to a semidefinite relaxation problem, which a semidefinite-based branch-and-bound algorithm can quickly solve. Experimental results on the real-world datasets demonstrate the superiority of the proposed method. The results also show that our method works for the support vector machine (SVM) classifier and other state-of-the-art kernel classifiers.

NeurIPS Conference 2024 Conference Paper

Neuro-Symbolic Data Generation for Math Reasoning

  • Zenan Li
  • Zhi Zhou
  • Yuan Yao
  • Yu-Feng Li
  • Chun Cao
  • Fan Yang
  • Xian Zhang
  • Xiaoxing Ma

A critical question about Large Language Models (LLMs) is whether their apparent deficiency in mathematical reasoning is inherent, or merely a result of insufficient exposure to high-quality mathematical data. To explore this, we developed an automated method for generating high-quality, supervised mathematical datasets. The method carefully mutates existing math problems, ensuring both diversity and validity of the newly generated problems. This is achieved by a neuro-symbolic data generation framework combining the intuitive informalization strengths of LLMs, and the precise symbolic reasoning of math solvers along with projected Markov chain Monte Carlo sampling in the highly-irregular symbolic space. Empirical experiments demonstrate the high quality of data generated by the proposed method, and that the LLMs, specifically LLaMA-2 and Mistral, when realigned with the generated data, surpass their state-of-the-art counterparts.

NeurIPS Conference 2024 Conference Paper

Once Read is Enough: Domain-specific Pretraining-free Language Models with Cluster-guided Sparse Experts for Long-tail Domain Knowledge

  • Fang Dong
  • Mengyi Chen
  • Jixian Zhou
  • Yubin Shi
  • Yixuan Chen
  • Mingzhi Dong
  • Yujiang Wang
  • Dongsheng Li

Language models (LMs) only pretrained on a general and massive corpus usually cannot attain satisfying performance on domain-specific downstream tasks, and hence, applying domain-specific pretraining to LMs is a common and indispensable practice. However, domain-specific pretraining can be costly and time-consuming, hindering LMs' deployment in real-world applications. In this work, we consider the incapability to memorize domain-specific knowledge embedded in the general corpus with rare occurrences and long-tail distributions as the leading cause for pretrained LMs' inferior downstream performance. Analysis of Neural Tangent Kernels (NTKs) reveals that those long-tail data are commonly overlooked in the model's gradient updates and, consequently, are not effectively memorized, leading to poor domain-specific downstream performance. Based on the intuition that data with similar semantic meaning are closer in the embedding space, we devise a Cluster-guided Sparse Expert (CSE) layer to actively learn long-tail domain knowledge typically neglected in previous pretrained LMs. During pretraining, a CSE layer efficiently clusters domain knowledge together and assigns long-tail knowledge to designate extra experts. CSE is also a lightweight structure that only needs to be incorporated in several deep layers. With our training strategy, we found that during pretraining, data of long-tail knowledge gradually formulate isolated, outlier clusters in an LM's representation spaces, especially in deeper layers. Our experimental results show that only pretraining CSE-based LMs is enough to achieve superior performance than regularly pretrained-finetuned LMs on various downstream tasks, implying the prospects of domain-specific-pretraining-free language models.

AAAI Conference 2024 Conference Paper

Sparse Bayesian Deep Learning for Cross Domain Medical Image Reconstruction

  • Jiaxin Huang
  • Qi Wu
  • Yazhou Ren
  • Fan Yang
  • Aodi Yang
  • Qianqian Yang
  • Xiaorong Pu

Cross domain medical image reconstruction aims to address the issue that deep learning models trained solely on one source dataset might not generalize effectively to unseen target datasets from different hospitals. Some recent methods achieve satisfactory reconstruction performance, but often at the expense of extensive parameters and time consumption. To strike a balance between cross domain image reconstruction quality and model computational efficiency, we propose a lightweight sparse Bayesian deep learning method. Notably, we apply a fixed-form variational Bayes (FFVB) approach to quantify pixel-wise uncertainty priors derived from degradation distribution of the source domain. Furthermore, by integrating the uncertainty prior into the posterior sampled through stochastic gradient Langevin dynamics (SGLD), we develop a training strategy that dynamically generates and optimizes the prior distribution on the network weights for each unseen domain. This strategy enhances generalizability and ensures robust reconstruction performance. When evaluated on medical image reconstruction tasks, our proposed approach demonstrates impressive performance across various previously unseen domains.

JBHI Journal 2024 Journal Article

Spatio-Temporal Classification of Lung Ventilation Patterns Using 3D EIT Images: A General Approach for Individualized Lung Function Evaluation

  • Shuzhe Chen
  • Li Li
  • Zhichao Lin
  • Ke Zhang
  • Ying Gong
  • Lu Wang
  • Xu Wu
  • Maokun Li

The Pulmonary Function Test (PFT) is a widely utilized and rigorous classification test for evaluating lung function, serving as a comprehensive diagnostic tool for lung conditions. Meanwhile, Electrical Impedance Tomography (EIT) is a rapidly advancing clinical technique that visualizes conductivity distribution induced by ventilation. EIT provides additional spatial and temporal information on lung ventilation beyond traditional PFT. However, relying solely on conventional isolated interpretations of PFT results and EIT images overlooks the continuous dynamic aspects of lung ventilation. This study aims to classify lung ventilation patterns by extracting spatial and temporal features from the 3D EIT image series. The study uses a Variational Autoencoder (VAE) with a MultiRes block to compress the spatial distribution in a 3D image into a one-dimensional vector. These vectors are then stacked to create a feature map for the exhibition of temporal features. A simple convolutional neural network is used for classification. Data from 137 subjects were utilized for the training phase. Initially, the model underwent validation through a leave-one-out cross-validation process. During this validation, the model achieved an accuracy and sensitivity of 0. 96 and 1. 00, respectively, with an f1-score of 0. 98 when identifying the normal subjects. To assess pipeline reliability and feasibility, we tested it on 9 newly recruited subjects, with accurate ventilation mode predictions for 8 out of 9. In addition, we included 2D EIT results for comparison and conducted ablation experiments to validate the effectiveness of the VAE. The study demonstrates the potential of using image series for lung ventilation mode classification, providing a feasible method for patient prescreening and presenting an alternative form of PFT.

AAAI Conference 2023 Conference Paper

Exploring Stochastic Autoregressive Image Modeling for Visual Representation

  • Yu Qi
  • Fan Yang
  • Yousong Zhu
  • Yufei Liu
  • Liwei Wu
  • Rui Zhao
  • Wei Li

Autoregressive language modeling (ALM) has been successfully used in self-supervised pre-training in Natural language processing (NLP). However, this paradigm has not achieved comparable results with other self-supervised approaches in computer vision (e.g., contrastive learning, masked image modeling). In this paper, we try to find the reason why autoregressive modeling does not work well on vision tasks. To tackle this problem, we fully analyze the limitation of visual autoregressive methods and proposed a novel stochastic autoregressive image modeling (named SAIM) by the two simple designs. First, we serialize the image into patches. Second, we employ the stochastic permutation strategy to generate an effective and robust image context which is critical for vision tasks. To realize this task, we create a parallel encoder-decoder training process in which the encoder serves a similar role to the standard vision transformer focusing on learning the whole contextual information, and meanwhile the decoder predicts the content of the current position so that the encoder and decoder can reinforce each other. Our method significantly improves the performance of autoregressive image modeling and achieves the best accuracy (83.9%) on the vanilla ViT-Base model among methods using only ImageNet-1K data. Transfer performance in downstream tasks also shows that our model achieves competitive performance. Code is available at https://github.com/qiy20/SAIM.

IJCAI Conference 2023 Conference Paper

Learning 3D Photography Videos via Self-supervised Diffusion on Single Images

  • Xiaodong Wang
  • Chenfei Wu
  • Shengming Yin
  • Minheng Ni
  • Jianfeng Wang
  • Linjie Li
  • Zhengyuan Yang
  • Fan Yang

3D photography renders a static image into a video with appealing 3D visual effects. Existing approaches typically first conduct monocular depth estimation, then render the input frame to subsequent frames with various viewpoints, and finally use an inpainting model to fill those missing/occluded regions. The inpainting model plays a crucial role in rendering quality, but it is normally trained on out-of-domain data. To reduce the training and inference gap, we propose a novel self-supervised diffusion model as the inpainting module. Given a single input image, we automatically construct a training pair of the masked occluded image and the ground-truth image with random cycle rendering. The constructed training samples are closely aligned to the testing instances, without the need for data annotation. To make full use of the masked images, we designed a Masked Enhanced Block (MEB), which can be easily plugged into the UNet and enhance the semantic conditions. Towards real-world animation, we present a novel task: out-animation, which extends the space and time of input objects. Extensive experiments on real datasets show that our method achieves competitive results with existing SOTA methods.

NeurIPS Conference 2023 Conference Paper

Model-enhanced Vector Index

  • Hailin Zhang
  • Yujing Wang
  • Qi Chen
  • Ruiheng Chang
  • Ting Zhang
  • Ziming Miao
  • Yingyan Hou
  • Yang Ding

Embedding-based retrieval methods construct vector indices to search for document representations that are most similar to the query representations. They are widely used in document retrieval due to low latency and decent recall performance. Recent research indicates that deep retrieval solutions offer better model quality, but are hindered by unacceptable serving latency and the inability to support document updates. In this paper, we aim to enhance the vector index with end-to-end deep generative models, leveraging the differentiable advantages of deep retrieval models while maintaining desirable serving efficiency. We propose Model-enhanced Vector Index (MEVI), a differentiable model-enhanced index empowered by a twin-tower representation model. MEVI leverages a Residual Quantization (RQ) codebook to bridge the sequence-to-sequence deep retrieval and embedding-based models. To substantially reduce the inference time, instead of decoding the unique document ids in long sequential steps, we first generate some semantic virtual cluster ids of candidate documents in a small number of steps, and then leverage the well-adapted embedding vectors to further perform a fine-grained search for the relevant documents in the candidate virtual clusters. We empirically show that our model achieves better performance on the commonly used academic benchmarks MSMARCO Passage and Natural Questions, with comparable serving latency to dense retrieval solutions.

NeurIPS Conference 2023 Conference Paper

Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models

  • Yubin Shi
  • Yixuan Chen
  • Mingzhi Dong
  • Xiaochen Yang
  • Dongsheng Li
  • Yujiang Wang
  • Robert Dick
  • Qin Lv

Despite their prevalence in deep-learning communities, over-parameterized models convey high demands of computational costs for proper training. This work studies the fine-grained, modular-level learning dynamics of over-parameterized models to attain a more efficient and fruitful training strategy. Empirical evidence reveals that when scaling down into network modules, such as heads in self-attention models, we can observe varying learning patterns implicitly associated with each module's trainability. To describe such modular-level learning capabilities, we introduce a novel concept dubbed modular neural tangent kernel (mNTK), and we demonstrate that the quality of a module's learning is tightly associated with its mNTK's principal eigenvalue $\lambda_{\max}$. A large $\lambda_{\max}$ indicates that the module learns features with better convergence, while those miniature ones may impact generalization negatively. Inspired by the discovery, we propose a novel training strategy termed Modular Adaptive Training (MAT) to update those modules with their $\lambda_{\max}$ exceeding a dynamic threshold selectively, concentrating the model on learning common features and ignoring those inconsistent ones. Unlike most existing training schemes with a complete BP cycle across all network modules, MAT can significantly save computations by its partially-updating strategy and can further improve performance. Experiments show that MAT nearly halves the computational cost of model training and outperforms the accuracy of baselines.

AAAI Conference 2022 Conference Paper

Confidence Calibration for Intent Detection via Hyperspherical Space and Rebalanced Accuracy-Uncertainty Loss

  • Yantao Gong
  • Cao Liu
  • Fan Yang
  • Xunliang Cai
  • Guanglu Wan
  • Jiansong Chen
  • Weipeng Zhang
  • Houfeng Wang

Data-driven methods have achieved notable performance on intent detection, which is a task to comprehend user queries. Nonetheless, they are controversial for over-confident predictions. In some scenarios, users do not only care about the accuracy but also the confidence of model. Unfortunately, mainstream neural networks are poorly calibrated, with a large gap between accuracy and confidence. To handle this problem defined as confidence calibration, we propose a model using the hyperspherical space and rebalanced accuracy-uncertainty loss. Specifically, we project the label vector onto hyperspherical space uniformly to generate a dense label representation matrix, which mitigates over-confident predictions due to overfitting sparce one-hot label matrix. Besides, we rebalance samples of different accuracy and uncertainty to better guide model training. Experiments on the open datasets verify that our model outperforms the existing calibration methods and achieves a significant improvement on the calibration metric.

AAAI Conference 2022 Conference Paper

DeTarNet: Decoupling Translation and Rotation by Siamese Network for Point Cloud Registration

  • Zhi Chen
  • Fan Yang
  • Wenbing Tao

Point cloud registration is a fundamental step for many tasks. In this paper, we propose a neural network named DetarNet to decouple the translation t and rotation R, so as to overcome the performance degradation due to their mutual interference in point cloud registration. First, a Siamese Network based Progressive and Coherent Feature Drift (PCFD) module is proposed to align the source and target points in highdimensional feature space, and accurately recover translation from the alignment process. Then we propose a Consensus Encoding Unit (CEU) to construct more distinguishable features for a set of putative correspondences. After that, a Spatial and Channel Attention (SCA) block is adopted to build a classification network for finding good correspondences. Finally, the rotation is obtained by Singular Value Decomposition (SVD). In this way, the proposed network decouples the estimation of translation and rotation, resulting in better performance for both of them. Experimental results demonstrate that the proposed DetarNet improves registration performance on both indoor and outdoor scenes. Our code will be available in https: //github. com/ZhiChen902/DetarNet.

NeurIPS Conference 2022 Conference Paper

Forecasting Human Trajectory from Scene History

  • Mancheng Meng
  • Ziyan Wu
  • Terrence Chen
  • Xiran Cai
  • Xiang Zhou
  • Fan Yang
  • Dinggang Shen

Predicting the future trajectory of a person remains a challenging problem, due to randomness and subjectivity. However, the moving patterns of human in constrained scenario typically conform to a limited number of regularities to a certain extent, because of the scenario restrictions (\eg, floor plan, roads and obstacles) and person-person or person-object interactivity. Thus, an individual person in this scenario should follow one of the regularities as well. In other words, a person's subsequent trajectory has likely been traveled by others. Based on this hypothesis, we propose to forecast a person's future trajectory by learning from the implicit scene regularities. We call the regularities, inherently derived from the past dynamics of the people and the environment in the scene, \emph{scene history}. We categorize scene history information into two types: historical group trajectories and individual-surroundings interaction. To exploit these information for trajectory prediction, we propose a novel framework Scene History Excavating Network (SHENet), where the scene history is leveraged in a simple yet effective approach. In particular, we design two components, the group trajectory bank module to extract representative group trajectories as the candidate for future path, and the cross-modal interaction module to model the interaction between individual past trajectory and its surroundings for trajectory refinement, respectively. In addition, to mitigate the uncertainty in the evaluation, caused by the aforementioned randomness and subjectivity, we propose to include smoothness into evaluation metrics. We conduct extensive evaluations to validate the efficacy of proposed framework on ETH, UCY, as well as a new, challenging benchmark dataset PAV, demonstrating superior performance compared to state-of-the-art methods.

AAAI Conference 2022 Conference Paper

Learning Optical Flow with Adaptive Graph Reasoning

  • Ao Luo
  • Fan Yang
  • Kunming Luo
  • Xin Li
  • Haoqiang Fan
  • Shuaicheng Liu

Estimating per-pixel motion between video frames, known as optical flow, is a long-standing problem in video understanding and analysis. Most contemporary optical flow techniques largely focus on addressing the cross-image matching with feature similarity, with few methods considering how to explicitly reason over the given scene for achieving a holistic motion understanding. In this work, taking a fresh perspective, we introduce a novel graph-based approach, called adaptive graph reasoning for optical flow (AGFlow), to emphasize the value of scene/context information in optical flow. Our key idea is to decouple the context reasoning from the matching procedure, and exploit scene information to effectively assist motion estimation by learning to reason over the adaptive graph. The proposed AGFlow can effectively exploit the context information and incorporate it within the matching procedure, producing more robust and accurate results. On both Sintel clean and final passes, our AGFlow achieves the best accuracy with EPE of 1. 43 and 2. 47 pixels, outperforming state-of-the-art approaches by 11. 2% and 13. 6%, respectively. Code is publicly available at https: //github. com/ megvii-research/AGFlow.

NeurIPS Conference 2022 Conference Paper

Obj2Seq: Formatting Objects as Sequences with Class Prompt for Visual Tasks

  • Zhiyang Chen
  • Yousong Zhu
  • Zhaowen Li
  • Fan Yang
  • Wei Li
  • Haixin Wang
  • Chaoyang Zhao
  • Liwei Wu

Visual tasks vary a lot in their output formats and concerned contents, therefore it is hard to process them with an identical structure. One main obstacle lies in the high-dimensional outputs in object-level visual tasks. In this paper, we propose an object-centric vision framework, Obj2Seq. Obj2Seq takes objects as basic units, and regards most object-level visual tasks as sequence generation problems of objects. Therefore, these visual tasks can be decoupled into two steps. First recognize objects of given categories, and then generate a sequence for each of these objects. The definition of the output sequences varies for different tasks, and the model is supervised by matching these sequences with ground-truth targets. Obj2Seq is able to flexibly determine input categories to satisfy customized requirements, and be easily extended to different visual tasks. When experimenting on MS COCO, Obj2Seq achieves 45. 7% AP on object detection, 89. 0% AP on multi-label classification and 65. 0% AP on human pose estimation. These results demonstrate its potential to be generally applied to different visual tasks. Code has been made available at: https: //github. com/CASIA-IVA-Lab/Obj2Seq.

NeurIPS Conference 2022 Conference Paper

One-Inlier is First: Towards Efficient Position Encoding for Point Cloud Registration

  • Fan Yang
  • Lin Guo
  • Zhi Chen
  • Wenbing Tao

Transformer architecture has shown great potential for many visual tasks, including point cloud registration. As an order-aware module, position encoding plays an important role in Transformer architecture applied to point cloud registration task. In this paper, we propose OIF-PCR, a one-inlier based position encoding method for point cloud registration network. Specifically, we first find one correspondence by a differentiable optimal transport layer, and use it to normalize each point for position encoding. It can eliminate the challenges brought by the different reference frames of two point clouds, and mitigate the feature ambiguity by learning the spatial consistency. Then, we propose a joint approach for establishing correspondence and position encoding, presenting an iterative optimization process. Finally, we design a progressive way for point cloud alignment and feature learning to gradually optimize the rigid transformation. The proposed position encoding is very efficient, requiring only a small addition of memory and computing overhead. Extensive experiments demonstrate the proposed method can achieve competitive performance with the state-of-the-art methods in both indoor and outdoor scenes.

NeurIPS Conference 2022 Conference Paper

UMIX: Improving Importance Weighting for Subpopulation Shift via Uncertainty-Aware Mixup

  • Zongbo Han
  • Zhipeng Liang
  • Fan Yang
  • Liu Liu
  • Lanqing Li
  • Yatao Bian
  • Peilin Zhao
  • Bingzhe Wu

Subpopulation shift widely exists in many real-world machine learning applications, referring to the training and test distributions containing the same subpopulation groups but varying in subpopulation frequencies. Importance reweighting is a normal way to handle the subpopulation shift issue by imposing constant or adaptive sampling weights on each sample in the training dataset. However, some recent studies have recognized that most of these approaches fail to improve the performance over empirical risk minimization especially when applied to over-parameterized neural networks. In this work, we propose a simple yet practical framework, called uncertainty-aware mixup (UMIX), to mitigate the overfitting issue in over-parameterized models by reweighting the ''mixed'' samples according to the sample uncertainty. The training-trajectories-based uncertainty estimation is equipped in the proposed UMIX for each sample to flexibly characterize the subpopulation distribution. We also provide insightful theoretical analysis to verify that UMIX achieves better generalization bounds over prior works. Further, we conduct extensive empirical studies across a wide range of tasks to validate the effectiveness of our method both qualitatively and quantitatively. Code is available at https: //github. com/TencentAILabHealthcare/UMIX.

AAAI Conference 2021 Conference Paper

Cascade Network with Guided Loss and Hybrid Attention for Finding Good Correspondences

  • Zhi Chen
  • Fan Yang
  • Wenbing Tao

Finding good correspondences is a critical prerequisite in many feature based tasks. Given a putative correspondence set of an image pair, we propose a neural network which finds correct correspondences by a binary-class classifier and estimates relative pose through classified correspondences. First, we analyze that due to the imbalance in the number of correct and wrong correspondences, the loss function has a great impact on the classification results. Thus, we propose a new Guided Loss that can directly use evaluation criterion (Fn-measure) as guidance to dynamically adjust the objective function during training. We theoretically prove that the perfect negative correlation between the Guided Loss and Fnmeasure, so that the network is always trained towards the direction of increasing Fn-measure to maximize it. We then propose a hybrid attention block to extract feature, which integrates the Bayesian attentive context normalization (BACN) and channel-wise attention (CA). BACN can mine the prior information to better exploit global context and CA can capture complex channel context to enhance the channel awareness of the network. Finally, based on our Guided Loss and hybrid attention block, a cascade network is designed to gradually optimize the result for more superior performance. Experiments have shown that our network achieves the state-ofthe-art performance on benchmark datasets. Our code will be available in https: //github. com/wenbingtao/GLHA.

ICRA Conference 2021 Conference Paper

Chance Constrained Simultaneous Path Planning and Task Assignment with Bottleneck Objective

  • Fan Yang
  • Nilanjan Chakraborty

We present a novel algorithm for combined task assignment and path planning on a roadmap with stochastic costs. In this problem, the initially unassigned robots and tasks are located at known positions in a roadmap. We want to assign a unique task to each robot and compute a path for the robot to go to the task location. Given the means and variances of travel cost, our goal is to develop algorithms that guarantee that for each robot, with high probability, the total travel cost is below a minimum value in any realization of the stochastic travel costs. We prove that the solution can be obtained by solving (a) a chance-constrained shortest path problems for all robot-task pairs and (b) a linear bottleneck assignment problem in which the cost of an assignment is equal to the optimal objective value of the former problem. We propose algorithms for solving the chance-constrained shortest path problem either optimally or approximately by solving a number of deterministic shortest path problems that minimize some linear combination of means and variances of edge costs. We present simulation results on randomly generated networks and data to demonstrate that our algorithm is scalable with the number of robots (or tasks) and the size of the network.

UAI Conference 2021 Conference Paper

Defending SVMs against poisoning attacks: the hardness and DBSCAN approach

  • Hu Ding 0003
  • Fan Yang
  • Jiawei Huang 0009

Adversarial machine learning has attracted a great amount of attention in recent years. Due to the great importance of support vector machines (SVM) in machine learning, we consider defending SVM against poisoning attacks in this paper. We study two commonly used strategies for defending: designing robust SVM algorithms and data sanitization. Though several robust SVM algorithms have been proposed before, most of them either are in lack of adversarial-resilience, or rely on strong assumptions about the data distribution or the attacker’s behavior. Moreover, the research on the hardness of designing a quality-guaranteed adversarially-resilient SVM algorithm is still quite limited. We are the first, to the best of our knowledge, to prove that even the simplest hard-margin one-class SVM with adversarial outliers problem is NP-complete, and has no fully PTAS unless P=NP. For data sanitization, we explain the effectiveness of DBSCAN (as a density-based outlier removal method) for defending against poisoning attacks. In particular, we link it to the intrinsic dimensionality by proving a sampling theorem in doubling metrics. In our empirical experiments, we systematically compare several defenses including the DBSCAN and robust SVM methods, and investigate the influences from the intrinsic dimensionality and poisoned fraction to their performances.

JBHI Journal 2021 Journal Article

Disease Prediction via Graph Neural Networks

  • Zhenchao Sun
  • Hongzhi Yin
  • Hongxu Chen
  • Tong Chen
  • Lizhen Cui
  • Fan Yang

With the increasingly available electronic medical records (EMRs), disease prediction has recently gained immense research attention, where an accurate classifier needs to be trained to map the input prediction signals (e. g. , symptoms, patient demographics, etc.) to the estimated diseases for each patient. However, existing machine learning-based solutions heavily rely on abundant manually labeled EMR training data to ensure satisfactory prediction results, impeding their performance in the existence of rare diseases that are subject to severe data scarcity. For each rare disease, the limited EMR data can hardly offer sufficient information for a model to correctly distinguish its identity from other diseases with similar clinical symptoms. Furthermore, most existing disease prediction approaches are based on the sequential EMRs collected for every patient and are unable to handle new patients without historical EMRs, reducing their real-life practicality. In this paper, we introduce an innovative model based on Graph Neural Networks (GNNs) for disease prediction, which utilizes external knowledge bases to augment the insufficient EMR data, and learns highly representative node embeddings for patients, diseases and symptoms from the medical concept graph and patient record graph respectively constructed from the medical knowledge base and EMRs. By aggregating information from directly connected neighbor nodes, the proposed neural graph encoder can effectively generate embeddings that capture knowledge from both data sources, and is able to inductively infer the embeddings for a new patient based on the symptoms reported in her/his EMRs to allow for accurate prediction on both general diseases and rare diseases. Extensive experiments on a real-world EMR dataset have demonstrated the state-of-the-art performance of our proposed model.

IS Journal 2021 Journal Article

Fairness in Deep Learning: A Computational Perspective

  • Mengnan Du
  • Fan Yang
  • Na Zou
  • Xia Hu

Fairness in deep learning has attracted tremendous attention recently, as deep learning is increasingly being used in high-stake decision making applications that affect individual lives. We provide a review covering recent progresses to tackle algorithmic fairness problems of deep learning from the computational perspective. Specifically, we show that interpretability can serve as a useful ingredient to diagnose the reasons that lead to algorithmic discrimination. We also discuss fairness mitigation approaches categorized according to three stages of deep learning life-cycle, aiming to push forward the area of fairness in deep learning and build genuinely fair and reliable deep learning systems.

NeurIPS Conference 2021 Conference Paper

Learning Interpretable Decision Rule Sets: A Submodular Optimization Approach

  • Fan Yang
  • Kai He
  • Linxiao Yang
  • Hongxia Du
  • Jingbang Yang
  • Bo Yang
  • Liang Sun

Rule sets are highly interpretable logical models in which the predicates for decision are expressed in disjunctive normal form (DNF, OR-of-ANDs), or, equivalently, the overall model comprises an unordered collection of if-then decision rules. In this paper, we consider a submodular optimization based approach for learning rule sets. The learning problem is framed as a subset selection task in which a subset of all possible rules needs to be selected to form an accurate and interpretable rule set. We employ an objective function that exhibits submodularity and thus is amenable to submodular optimization techniques. To overcome the difficulty arose from dealing with the exponential-sized ground set of rules, the subproblem of searching a rule is casted as another subset selection task that asks for a subset of features. We show it is possible to write the induced objective function for the subproblem as a difference of two submodular (DS) functions to make it approximately solvable by DS optimization algorithms. Overall, the proposed approach is simple, scalable, and likely to be benefited from further research on submodular optimization. Experiments on real datasets demonstrate the effectiveness of our method.

NeurIPS Conference 2021 Conference Paper

MST: Masked Self-Supervised Transformer for Visual Representation

  • Zhaowen Li
  • Zhiyang Chen
  • Fan Yang
  • Wei Li
  • Yousong Zhu
  • Chaoyang Zhao
  • Rui Deng
  • Liwei Wu

Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP) and achieved great success. However, it has not been fully explored in visual self-supervised learning. Meanwhile, previous methods only consider the high-level feature and learning representation from a global perspective, which may fail to transfer to the downstream dense prediction tasks focusing on local features. In this paper, we present a novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image while preserving the global semantic information. Specifically, inspired by the Masked Language Modeling (MLM) in NLP, we propose a masked token strategy based on the multi-head self-attention map, which dynamically masks some tokens of local patches without damaging the crucial structure for self-supervised learning. More importantly, the masked tokens together with the remaining tokens are further recovered by a global image decoder, which preserves the spatial information of the image and is more friendly to the downstream dense prediction tasks. The experiments on multiple datasets demonstrate the effectiveness and generality of the proposed method. For instance, MST achieves Top-1 accuracy of 76. 9% with DeiT-S only using 300-epoch pre-training by linear evaluation, which outperforms supervised methods with the same epoch by 0. 4% and its comparable variant DINO by 1. 0%. For dense prediction tasks, MST also achieves 42. 7% mAP on MS COCO object detection and 74. 04% mIoU on Cityscapes segmentation only with 100-epoch pre-training.

IJCAI Conference 2021 Conference Paper

Time Series Data Augmentation for Deep Learning: A Survey

  • Qingsong Wen
  • Liang Sun
  • Fan Yang
  • Xiaomin Song
  • Jingkun Gao
  • Xue Wang
  • Huan Xu

Deep learning performs remarkably well on many time series analysis tasks recently. The superior performance of deep neural networks relies heavily on a large number of training data to avoid overfitting. However, the labeled data of many real-world time series applications may be limited such as classification in medical time series and anomaly detection in AIOps. As an effective way to enhance the size and quality of the training data, data augmentation is crucial to the successful application of deep learning models on time series data. In this paper, we systematically review different data augmentation methods for time series. We propose a taxonomy for the reviewed methods, and then provide a structured review for these methods by highlighting their strengths and limitations. We also empirically compare different data augmentation methods for different tasks including time series classification, anomaly detection, and forecasting. Finally, we discuss and highlight five future directions to provide useful research guidance.

IROS Conference 2020 Conference Paper

Algorithm for Multi-Robot Chance-Constrained Generalized Assignment Problem with Stochastic Resource Consumption

  • Fan Yang
  • Nilanjan Chakraborty

We present a novel algorithm for the multi-robot generalized assignment problem (GAP) with stochastic resource consumption. In this problem, each robot has a resource (e. g. , battery life) constraint and it consumes a certain amount of resource to perform a task. In practice, the resource consumed for performing a task can be uncertain. Therefore, we assume that the resource consumption is a random variable with known mean and variance. The objective is to find an assignment of the robots to tasks that maximizes the team payoff. Each task is assigned to at most one robot and the resource constraint for each robot has to be satisfied with very high probability. We formulate the problem as a chance-constrained combinatorial optimization problem and call it the chance-constrained generalized assignment problem (CC-GAP). This problem is an extension of the deterministic generalized assignment problem, which is a NP-hard problem. We design an iterative algorithm for solving CC-GAP in which each robot maximizes its own objective by solving a chance-constrained knapsack problem in an iterative manner. The approximation ratio of our algorithm is (1+α), assuming that the deterministic knapsack problem is solved by an α-approximation algorithm. We present simulation results to demonstrate that our algorithm is scalable with the number of robots and tasks.

NeurIPS Conference 2020 Conference Paper

Bayesian Multi-type Mean Field Multi-agent Imitation Learning

  • Fan Yang
  • Alina Vereshchaka
  • Changyou Chen
  • Wen Dong

Multi-agent Imitation learning (MAIL) refers to the problem that agents learn to perform a task interactively in a multi-agent system through observing and mimicking expert demonstrations, without any knowledge of a reward function from the environment. MAIL has received a lot of attention due to promising results achieved on synthesized tasks, with the potential to be applied to complex real-world multi-agent tasks. Key challenges for MAIL include sample efficiency and scalability. In this paper, we proposed Bayesian multi-type mean field multi-agent imitation learning (BM3IL). Our method improves sample efficiency through establishing a Bayesian formulation for MAIL, and enhances scalability through introducing a new multi-type mean field approximation. We demonstrate the performance of our algorithm through benchmarking with three state-of-the-art multi-agent imitation learning algorithms on several tasks, including solving a multi-agent traffic optimization problem in a real-world transportation network. Experimental results indicate that our algorithm significantly outperforms all other algorithms in all scenarios.

ICRA Conference 2020 Conference Paper

Chance Constrained Simultaneous Path Planning and Task Assignment for Multiple Robots with Stochastic Path Costs

  • Fan Yang
  • Nilanjan Chakraborty

We present a novel algorithm for simultaneous task assignment and path planning on a graph (or roadmap) with stochastic edge costs. In this problem, the initially unassigned robots and tasks are located at known positions in a roadmap. We want to assign a unique task to each robot and compute a path for the robot to go to its assigned task location. Given the mean and variance of travel cost of each edge, our goal is to develop algorithms that, with high probability, the total path cost of the robot team is below a minimum value in any realization of the stochastic travel costs. We formulate the problem as a chance-constrained simultaneous task assignment and path planning problem (CC-STAP). We prove that the optimal solution of CC-STAP can be obtained by solving a sequence of deterministic simultaneous task assignment and path planning problems in which the travel cost is a linear combination of mean and variance of the edge cost. We show that the deterministic problem can be solved in two steps. In the first step, robots compute the shortest paths to the task locations and in the second step, the robots solve a linear assignment problem with the costs obtained in the first step. We also propose a distributed algorithm that solves CC-STAP near-optimally. We present simulation results on randomly generated networks and data to demonstrate that our algorithm is scalable with the number of robots (or tasks) and the size of the network.

NeurIPS Conference 2020 Conference Paper

EvolveGraph: Multi-Agent Trajectory Prediction with Dynamic Relational Reasoning

  • Jiachen Li
  • Fan Yang
  • Masayoshi Tomizuka
  • Chiho Choi

Multi-agent interacting systems are prevalent in the world, from purely physical systems to complicated social dynamic systems. In many applications, effective understanding of the situation and accurate trajectory prediction of interactive agents play a significant role in downstream tasks, such as decision making and planning. In this paper, we propose a generic trajectory forecasting framework (named EvolveGraph) with explicit relational structure recognition and prediction via latent interaction graphs among multiple heterogeneous, interactive agents. Considering the uncertainty of future behaviors, the model is designed to provide multi-modal prediction hypotheses. Since the underlying interactions may evolve even with abrupt changes, and different modalities of evolution may lead to different outcomes, we address the necessity of dynamic relational reasoning and adaptively evolving the interaction graphs. We also introduce a double-stage training pipeline which not only improves training efficiency and accelerates convergence, but also enhances model performance. The proposed framework is evaluated on both synthetic physics simulations and multiple real-world benchmark datasets in various areas. The experimental results illustrate that our approach achieves state-of-the-art performance in terms of prediction accuracy.

AAAI Conference 2020 Conference Paper

Hybrid Graph Neural Networks for Crowd Counting

  • Ao Luo
  • Fan Yang
  • Xin Li
  • Dong Nie
  • Zhicheng Jiao
  • Shangchen Zhou
  • Hong Cheng

Crowd counting is an important yet challenging task due to the large scale and density variation. Recent investigations have shown that distilling rich relations among multi-scale features and exploiting useful information from the auxiliary task, i. e. , localization, are vital for this task. Nevertheless, how to comprehensively leverage these relations within a uni- fied network architecture is still a challenging problem. In this paper, we present a novel network structure called Hybrid Graph Neural Network (HyGnn) which targets to relieve the problem by interweaving the multi-scale features for crowd density as well as its auxiliary task (localization) together and performing joint reasoning over a graph. Specifically, HyGnn integrates a hybrid graph to jointly represent the task-specific feature maps of different scales as nodes, and two types of relations as edges: (i) multi-scale relations capturing the feature dependencies across scales and (ii) mutual beneficial relations building bridges for the cooperation between counting and localization. Thus, through message passing, HyGnn can capture and distill richer relations between nodes to obtain more powerful representations, providing robust and accurate results. Our HyGnn performs significantly well on four challenging datasets: ShanghaiTech Part A, ShanghaiTech Part B, UCF CC 50 and UCF QNRF, outperforming the state-ofthe-art algorithms by a large margin.

AAAI Conference 2020 Conference Paper

Mining on Heterogeneous Manifolds for Zero-Shot Cross-Modal Image Retrieval

  • Fan Yang
  • Zheng Wang
  • Jing Xiao
  • Shin'ichi Satoh

Most recent approaches for the zero-shot cross-modal image retrieval map images from different modalities into a uniform feature space to exploit their relevance by using a pre-trained model. Based on the observation that manifolds of zero-shot images are usually deformed and incomplete, we argue that the manifolds of unseen classes are inevitably distorted during the training of a two-stream model that simply maps images from different modalities into a uniform space. This issue directly leads to poor cross-modal retrieval performance. We propose a bi-directional random walk scheme to mining more reliable relationships between images by traversing heterogeneous manifolds in the feature space of each modality. Our proposed method benefits from intra-modal distributions to alleviate the interference caused by noisy similarities in the cross-modal feature space. As a result, we achieved great improvement in the performance of the thermal v. s. visible image retrieval task. The code of this paper: https: //github. com/fyang93/cross-modal-retrieval

IJCAI Conference 2020 Conference Paper

On Metric DBSCAN with Low Doubling Dimension

  • Hu Ding
  • Fan Yang
  • Mingyue Wang

The density based clustering method Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a popular method for outlier recognition and has received tremendous attention from many different areas. A major issue of the original DBSCAN is that the time complexity could be as large as quadratic. Most of existing DBSCAN algorithms focus on developing efficient index structures to speed up the procedure in low-dimensional Euclidean space. However, the research of DBSCAN in high-dimensional Euclidean space or general metric spaces is still quite limited, to the best of our knowledge. In this paper, we consider the metric DBSCAN problem under the assumption that the inliers (excluding the outliers) have a low doubling dimension. We apply a novel randomized k-center clustering idea to reduce the complexity of range query, which is the most time consuming step in the whole DBSCAN procedure. Our proposed algorithms do not need to build any complicated data structures and are easy to implement in practice. The experimental results show that our algorithms can significantly outperform the existing DBSCAN algorithms in terms of running time.

ICLR Conference 2020 Conference Paper

Relational State-Space Model for Stochastic Multi-Object Systems

  • Fan Yang
  • Ling Chen 0001
  • Fan Zhou 0012
  • Yusong Gao
  • Wei Cao 0006

Real-world dynamical systems often consist of multiple stochastic subsystems that interact with each other. Modeling and forecasting the behavior of such dynamics are generally not easy, due to the inherent hardness in understanding the complicated interactions and evolutions of their constituents. This paper introduces the relational state-space model (R-SSM), a sequential hierarchical latent variable model that makes use of graph neural networks (GNNs) to simulate the joint state transitions of multiple correlated objects. By letting GNNs cooperate with SSM, R-SSM provides a flexible way to incorporate relational information into the modeling of multi-object dynamics. We further suggest augmenting the model with normalizing flows instantiated for vertex-indexed random variables and propose two auxiliary contrastive objectives to facilitate the learning. The utility of R-SSM is empirically evaluated on synthetic and real time series datasets.

JAIR Journal 2020 Journal Article

TensorLog: A Probabilistic Database Implemented Using Deep-Learning Infrastructure

  • William Cohen
  • Fan Yang
  • Kathryn Rivard Mazaitis

We present an implementation of a probabilistic first-order logic called TensorLog, in which classes of logical queries are compiled into differentiable functions in a neural-network infrastructure such as Tensorflow or Theano. This leads to a close integration of probabilistic logical reasoning with deep-learning infrastructure: in particular, it enables high-performance deep learning frameworks to be used for tuning the parameters of a probabilistic logic. The integration with these frameworks enables use of GPU-based parallel processors for inference and learning, making TensorLog the first highly parallellizable probabilistic logic. Experimental results show that TensorLog scales to problems involving hundreds of thousands of knowledge-base triples and tens of thousands of examples.

AAAI Conference 2020 Conference Paper

Variational Adversarial Kernel Learned Imitation Learning

  • Fan Yang
  • Alina Vereshchaka
  • Yufan Zhou
  • Changyou Chen
  • Wen Dong

Imitation learning refers to the problem where an agent learns to perform a task through observing and mimicking expert demonstrations, without knowledge of the cost function. Stateof-the-art imitation learning algorithms reduce imitation learning to distribution-matching problems by minimizing some distance measures. However, the distance measure may not always provide informative signals for a policy update. To this end, we propose the variational adversarial kernel learned imitation learning (VAKLIL), which measures the distance using the maximum mean discrepancy with variational kernel learning. Our method optimizes over a large cost-function space and is sample efficient and robust to overfitting. We demonstrate the performance of our algorithm through benchmarking with four state-of-the-art imitation learning algorithms over five high-dimensional control tasks, and a complex transportation control task. Experimental results indicate that our algorithm significantly outperforms related algorithms in all scenarios.

IJCAI Conference 2019 Conference Paper

Decoding EEG by Visual-guided Deep Neural Networks

  • Zhicheng Jiao
  • Haoxuan You
  • Fan Yang
  • Xin Li
  • Han Zhang
  • Dinggang Shen

Decoding visual stimuli from brain activities is an interdisciplinary study of neuroscience and computer vision. With the emerging of Human-AI Collaboration, Human-Computer Interaction, and the development of advanced machine learning models, brain decoding based on deep learning attracts more attention. Electroencephalogram (EEG) is a widely used neurophysiology tool. Inspired by the success of deep learning on image representation and neural decoding, we proposed a visual-guided EEG decoding method that contains a decoding stage and a generation stage. In the classification stage, we designed a visual-guided convolutional neural network (CNN) to obtain more discriminative representations from EEG, which are applied to achieve the classification results. In the generation stage, the visual-guided EEG features are input to our improved deep generative model with a visual consistence module to generate corresponding visual stimuli. With the help of our visual-guided strategies, the proposed method outperforms traditional machine learning methods and deep learning models in the EEG decoding task.

AAAI Conference 2019 Conference Paper

Efficient Image Retrieval via Decoupling Diffusion into Online and Offline Processing

  • Fan Yang
  • Ryota Hinami
  • Yusuke Matsui
  • Steven Ly
  • Shin’ichi Satoh

Diffusion is commonly used as a ranking or re-ranking method in retrieval tasks to achieve higher retrieval performance, and has attracted lots of attention in recent years. A downside to diffusion is that it performs slowly in comparison to the naive k-NN search, which causes a non-trivial online computational cost on large datasets. To overcome this weakness, we propose a novel diffusion technique in this paper. In our work, instead of applying diffusion to the query, we precompute the diffusion results of each element in the database, making the online search a simple linear combination on top of the k-NN search process. Our proposed method becomes 10∼ times faster in terms of online search speed. Moreover, we propose to use late truncation instead of early truncation in previous works to achieve better retrieval performance.

NeurIPS Conference 2019 Conference Paper

Game Design for Eliciting Distinguishable Behavior

  • Fan Yang
  • Liu Leqi
  • Yifan Wu
  • Zachary Lipton
  • Pradeep Ravikumar
  • Tom Mitchell
  • William Cohen

The ability to inferring latent psychological traits from human behavior is key to developing personalized human-interacting machine learning systems. Approaches to infer such traits range from surveys to manually-constructed experiments and games. However, these traditional games are limited because they are typically designed based on heuristics. In this paper, we formulate the task of designing behavior diagnostic games that elicit distinguishable behavior as a mutual information maximization problem, which can be solved by optimizing a variational lower bound. Our framework is instantiated by using prospect theory to model varying player traits, and Markov Decision Processes to parameterize the games. We validate our approach empirically, showing that our designed games can successfully distinguish among players with different traits, outperforming manually-designed ones by a large margin.

AAAI Conference 2019 Conference Paper

Large-Scale Heterogeneous Feature Embedding

  • Xiao Huang
  • Qingquan Song
  • Fan Yang
  • Xia Hu

Feature embedding aims to learn a low-dimensional vector representation for each instance to preserve the information in its features. These representations can benefit various offthe-shelf learning algorithms. While embedding models for a single type of features have been well-studied, real-world instances often contain multiple types of correlated features or even information within a different modality such as networks. Existing studies such as multiview learning show that it is promising to learn unified vector representations from all sources. However, high computational costs of incorporating heterogeneous information limit the applications of existing algorithms. The number of instances and dimensions of features in practice are often large. To bridge the gap, we propose a scalable framework FeatWalk, which can model and incorporate instance similarities in terms of different types of features into a unified embedding representation. To enable the scalability, FeatWalk does not directly calculate any similarity measure, but provides an alternative way to simulate the similarity-based random walks among instances to extract the local instance proximity and preserve it in a set of instance index sequences. These sequences are homogeneous with each other. A scalable word embedding algorithm is applied to them to learn a joint embedding representation of instances. Experiments on four real-world datasets demonstrate the efficiency and effectiveness of FeatWalk.

AAMAS Conference 2019 Conference Paper

Optimal Control of Complex Systems through Variational Inference with a Discrete Event Decision Process

  • Fan Yang
  • Bo Liu
  • Wen Dong

Complex social systems are composed of interconnected individuals whose interactions result in group behaviors. Optimal control of a real-world complex system has many applications, including road traffic management, epidemic prevention, and information dissemination. However, such real-world complex system control is difficult to achieve because of high-dimensional and non-linear system dynamics, and the exploding state and action spaces for the decision maker. Prior methods can be divided into two categories: simulation-based and analytical approaches. Existing simulation approaches have high-variance in Monte Carlo integration, and the analytical approaches suffer from modeling inaccuracy. We adopted simulation modeling in specifying the complex dynamics of a complex system, and developed analytical solutions for searching optimal strategies in a complex network with high-dimensional state-action space. To capture the complex system dynamics, we formulate the complex social network decision making problem as a discrete event decision process. To address the curse of dimensionality and search in high-dimensional state action spaces in complex systems, we reduce control of a complex system to variational inference and parameter learning, introduce Bethe entropy approximation, and develop an expectation propagation algorithm. Our proposed algorithm leads to higher system expected rewards, faster convergence, and lower variance of value function in a realworld transportation scenario than state-of-the-art analytical and sampling approaches.

AAAI Conference 2019 Conference Paper

Understanding Pictograph with Facial Features: End-to-End Sentence-Level Lip Reading of Chinese

  • Xiaobing Zhang
  • Haigang Gong
  • Xili Dai
  • Fan Yang
  • Nianbo Liu
  • Ming Liu

With the breakthrough of deep learning, lip reading technologies are under extraordinarily rapid progress. It is well-known that Chinese is the most widely spoken language in the world. Unlike alphabetic languages, it involves more than 1, 000 pronunciations as Pinyin, and nearly 90, 000 pictographic characters as Hanzi, which makes lip reading of Chinese very challenging. In this paper, we implement visual-only Chinese lip reading of unconstrained sentences in a two-step end-to-end architecture (LipCH-Net), in which two deep neural network models are employed to perform the recognition of Pictureto-Pinyin (mouth motion pictures to pronunciations) and the recognition of Pinyin-to-Hanzi (pronunciations to texts) respectively, before having a jointly optimization to improve the overall performance. In addition, two modules in the Pinyin-to-Hanzi model are pre-trained separately with large auxiliary data in advance of sequence-to-sequence training to make the best of long sequence matches for avoiding ambiguity. We collect 6-month daily news broadcasts from China Central Television (CCTV) website, and semi-automatically label them into a 20. 95 GB dataset with 20, 495 natural Chinese sentences. When trained on the CCTV dataset, the LipCH-Net model outperforms the performance of all stateof-the-art lip reading frameworks. According to the results, our scheme not only accelerates training and reduces overfitting, but also overcomes syntactic ambiguity of Chinese which provides a baseline for future relevant work.

ICRA Conference 2018 Conference Paper

Algorithm for Optimal Chance Constrained Knapsack Problem with Applications to Multi-Robot Teaming

  • Fan Yang
  • Nilanjan Chakraborty

Motivated by applications in multirobot team selection, in this paper, we present a novel algorithm for computing optimal solution of chance-constrained 0-1 knapsack problem. In this variation of the knapsack problem, the objective function is deterministic but the weights of the items are stochastic and therefore the knapsack constraint is stochastic. We convert the chance-constrained knapsack problem to a two-dimensional discrete optimization problem on the variance-mean plane, where each point on the plane can be identified with an assignment of items to the knapsack. By exploiting the geometry of the non-convex feasible region of the chance-constrained knapsack problem in the variance-mean plane, we present a novel deterministic technique to find an optimal solution by solving a sequence of deterministic knapsack problems (called risk-averse knapsack problem). We apply our algorithm to a multirobot team selection problem to cover a given route, where the length of the route is much larger than the length each individual robot can fly and the length that an individual robot can fly is a random variable (with known mean and variance). We present simulation results on randomly generated data to demonstrate that our approach is scalable with both the number of robots and increasing uncertainty of the distance an individual robot can travel.

IJCAI Conference 2018 Conference Paper

Cascaded SR-GAN for Scale-Adaptive Low Resolution Person Re-identification

  • Zheng Wang
  • Mang Ye
  • Fan Yang
  • Xiang Bai
  • Shin'ichi Satoh

Person re-identification (REID) is an important task in video surveillance and forensics applications. Most of previous approaches are based on a key assumption that all person images have uniform and sufficiently high resolutions. Actually, various low-resolutions and scale mismatching always exist in open world REID. We name this kind of problem as Scale-Adaptive Low Resolution Person Re-identification (SALR-REID). The most intuitive way to address this problem is to increase various low-resolutions (not only low, but also with different scales) to a uniform high-resolution. SR-GAN is one of the most competitive image super-resolution deep networks, designed with a fixed upscaling factor. However, it is still not suitable for SALR-REID task, which requires a network not only synthesizing high-resolution images with different upscaling factors, but also extracting discriminative image feature for judging person’s identity. (1) To promote the ability of scale-adaptive upscaling, we cascade multiple SRGANs in series. (2) To supplement the ability of image feature representation, we plug-in a reidentification network. With a unified formulation, a Cascaded Super-Resolution GAN (CSR-GAN) framework is proposed. Extensive evaluations on two simulated datasets and one public dataset demonstrate the advantages of our method over related state-of-the-art methods.

AAAI Conference 2018 Conference Paper

Multi-Scale Bidirectional FCN for Object Skeleton Extraction

  • Fan Yang
  • Xin Li
  • Hong Cheng
  • Yuxiao Guo
  • Leiting Chen
  • Jianping Li

Object skeleton detection is a challenging problem with wide application. Recently, deep Convolutional Neural Networks (CNNs) have substantially improved the performance of the state-of-the-art in this task. However, most of the existing CNN-Based methods are based on a skip-layer structure where low-level and high-level features are combined and learned so as to gather multi-level contextual information. As shallow features are too messy and lack semantic knowledge, they may cause errors and inaccuracy. Therefore, we propose a novel network architecture, Multi-Scale Bidirectional Fully Convolutional Network (MSB-FCN), to better capture and consolidate multi-scale high-level context information for object skeleton detection. Our network uses only deep features to build multi-scale feature representations, and employs a bidirectional structure to collect contextual knowledge. Hence the proposed MSB-FCN has the ability to learn the semantic-level information from different sub-regions. Furthermore, we introduce dense connections into the bidirectional structure of our MSB-FCN to ensure that the learning process at each scale can directly encode information from all other scales. Extensive experiments on various commonly used benchmarks demonstrate that the proposed MSB- FCN has achieved significant improvements over the state-ofthe-art algorithms.

ICRA Conference 2017 Conference Paper

Algorithm for optimal chance constrained linear assignment

  • Fan Yang
  • Nilanjan Chakraborty

In this paper, we design provably-good algorithms for task allocation in multi-robot systems in the presence of payoff uncertainty. We consider a group of robots that has to perform a given set of tasks where each robot performs at most one task. The payoffs of the robots doing the tasks are assumed to be Gaussian random variables with known mean and variances. The total payoff of the robots is a sum of the individual payoffs of all the robots. The goal is to find an assignment with maximum payoff that can be achieved with a specified probability irrespective of the realization of the random variable. This problem can be formulated as a chance constrained combinatorial optimization problem. We develop a novel deterministic technique to solve this chance constrained optimization problem that ensures that the chance constraints are always satisfied. Adopting the notion of risk-aversion from the economics literature, we formulate a risk-averse task allocation problem, which is a deterministic integer optimization problem. We prove that by repeatedly solving the risk-averse task allocation problem using a one-dimensional search on the risk aversion parameter we find a solution for the chance constrained optimization formulation of the linear assignment problem with uncertain payoffs. We provide simulation results on randomly generated data to demonstrate our approach and also compare our method to existing approaches.

NeurIPS Conference 2017 Conference Paper

Differentiable Learning of Logical Rules for Knowledge Base Reasoning

  • Fan Yang
  • Zhilin Yang
  • William Cohen

We study the problem of learning probabilistic first-order logical rules for knowledge base reasoning. This learning problem is difficult because it requires learning the parameters in a continuous space as well as the structure in a discrete space. We propose a framework, Neural Logic Programming, that combines the parameter and structure learning of first-order logical rules in an end-to-end differentiable model. This approach is inspired by a recently-developed differentiable logic called TensorLog [5], where inference tasks can be compiled into sequences of differentiable operations. We design a neural controller system that learns to compose these operations. Empirically, our method outperforms prior work on multiple knowledge base benchmark datasets, including Freebase and WikiMovies.

NeurIPS Conference 2017 Conference Paper

Expectation Propagation with Stochastic Kinetic Model in Complex Interaction Systems

  • Le Fang
  • Fan Yang
  • Wen Dong
  • Tong Guan
  • Chunming Qiao

Technological breakthroughs allow us to collect data with increasing spatio-temporal resolution from complex interaction systems. The combination of high-resolution observations, expressive dynamic models, and efficient machine learning algorithms can lead to crucial insights into complex interaction dynamics and the functions of these systems. In this paper, we formulate the dynamics of a complex interacting network as a stochastic process driven by a sequence of events, and develop expectation propagation algorithms to make inferences from noisy observations. To avoid getting stuck at a local optimum, we formulate the problem of minimizing Bethe free energy as a constrained primal problem and take advantage of the concavity of dual problem in the feasible domain of dual variables guaranteed by duality theorem. Our expectation propagation algorithms demonstrate better performance in inferring the interaction dynamics in complex transportation networks than competing models such as particle filter, extended Kalman filter, and deep neural networks.

NeurIPS Conference 2017 Conference Paper

Good Semi-supervised Learning That Requires a Bad GAN

  • Zihang Dai
  • Zhilin Yang
  • Fan Yang
  • William Cohen
  • Russ Salakhutdinov

Semi-supervised learning methods based on generative adversarial networks (GANs) obtained strong empirical results, but it is not clear 1) how the discriminator benefits from joint training with a generator, and 2) why good semi-supervised classification performance and a good generator cannot be obtained at the same time. Theoretically we show that given the discriminator objective, good semi-supervised learning indeed requires a bad generator, and propose the definition of a preferred generator. Empirically, we derive a novel formulation based on our analysis that substantially improves over feature matching GANs, obtaining state-of-the-art results on multiple benchmark datasets.

ECAI Conference 2016 Conference Paper

On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization

  • Linbo Qiao
  • Tianyi Lin
  • Yu-Gang Jiang 0001
  • Fan Yang
  • Wei Liu 0005
  • Xicheng Lu

We consider a wide spectrum of regularized stochastic minimization problems, where the regularization term is composite with a linear function. Examples of this formulation include graph-guided regularized minimization, generalized Lasso and a class of ℓ 1 regularized problems. The computational challenge is that the closed-form solution of the proximal mapping associated with the regularization term is not available due to the imposed linear composition. Fortunately, the structure of the regularization term allows us to reformulate it as a new convex-concave saddle point problem which can be solved using the Primal-Dual Hybrid Gradient (PDHG) approach. However, this approach may be inefficient in realistic applications as computing the full gradient of the expected objective function could be very expensive when the number of input data samples is considerably large. To address this issue, we propose a Stochastic PDHG (SPDHG) algorithm with either uniformly or non-uniformly averaged iterates. Through uniformly averaged iterates, the SPDHG algorithm converges in expectation withrate for general convex objectives and O( log  ( t) /t) rate for strongly convex objectives, respectively. While with non-uniformly averaged iterates, the SPDHG algorithm is expected to converge with O( 1/t) rate for strongly convex objectives. Numerical experiments on different genres of datasets demonstrate that our proposed algorithm outperforms other competing algorithms.

IJCAI Conference 2016 Conference Paper

Saliency Transfer: An Example-Based Method for Salient Object Detection

  • Xin Li
  • Fan Yang
  • Leiting Chen
  • Hongbin Cai

Over the past decades, numerous theories and studies have demonstrated that salient objects in different scenes often share some properties in common that make them visually stand out from their surroundings, and thus can be processed in finer details. In this paper, we propose a novel method for salient object detection that involves the transfer of the annotations from an existing example onto an input image. Our method, which is based on the low-level saliency features of each pixel, estimates dense pixel-wise correspondences between the input image and an example image, and then integrates high-level concepts to produce an initial saliency map. Finally, a coarse-to-fine optimization framework is proposed to generate uniformly highlighted salient objects. Qualitatively and quantitatively experiments on six popular benchmark datasets validate that our approach greatly outperforms the state-of-the-art algorithms and recently published works.

NeurIPS Conference 2016 Conference Paper

Selective inference for group-sparse linear models

  • Fan Yang
  • Rina Foygel Barber
  • Prateek Jain
  • John Lafferty

We develop tools for selective inference in the setting of group sparsity, including the construction of confidence intervals and p-values for testing selected groups of variables. Our main technical result gives the precise distribution of the magnitude of the projection of the data onto a given subspace, and enables us to develop inference procedures for a broad class of group-sparse selection methods, including the group lasso, iterative hard thresholding, and forward stepwise regression. We give numerical results to illustrate these tools on simulated data and on health record data.

ICRA Conference 2005 Conference Paper

Achieving Desired Contact State Transitions of Polyhedral Parts with Compliant Motions

  • Fan Yang
  • Michael M. Marefat

A new approach to motion planning to achieve contact state transitions in robotic assembly of polyhedral parts is presented. The contact state of a pair of spatial polyhedra is represented by qualitative contact models, which include Feature Interaction Matrices (FIM’s) representation is adopted. Given the desired contact transition (i. e. , the current contact state and the next desired contact state) and the current moving object configuration, we want to generate the compliant motion parameters for the robot system to guide the work piece to the next desired contact state. In this work, optimization method is used to derive the compliant motion parameters. Four motion control conditions are defined to provide constraints; a cost function representing the moving distance is also defined. Minimizing the cost function, the compliant motion parameters can be generated. The method is demonstrated by both translation and rotation examples.