Author name cluster

Bo Jiang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

38 papers

2 author rows

AAAI Conference 2026 Conference Paper

GARNET: GoT-Based Alert Reduction and Narrative Event Tracing

Yiru Gong
Song Liu
Changzhi Zhao
Junrong Liu
Tian Tian
Xiaobo Yang
Bo Jiang
Zhigang Lu

Alerts generated by Security Operations Centers (SOCs) are often numerous and scattered, requiring significant effort from security analysts to manage, which severely slows response times. While recent alert correlation graph methods can effectively reduce alert volume, these graphs are often too complex for analysts to understand. As a result, analysts are increasingly seeking ways to automatically correlate alerts and generate concise, human-readable attack path summaries. Recently, Large Language Models (LLMs) have demonstrated superior performance due to their advanced capabilities in knowledge reserve and reasoning. In this work, we propose GARNET, a framework that uses LLMs for reasoning on alert correlation graphs. GARNET addresses three key technical challenges: 1) modality alignment between alert graphs and logs; 2) semantic alignment between alert graphs and logs; 3) enabling LLMs reasoning along graph paths. Specifically, we first project the embeddings of the graph and logs into the same vector space using contrastive learning. Then, we design self-supervised graph-log instructions to bridge the semantic gap between the graph and logs by training a novel LLM. Finally, GARNET uses a novel Graph-of-Thought (GoT)-based interaction reasoning approach to guide LLM reasoning along graph paths, ultimately generating structured, concise, and human-readable attack path summaries. Experimental results across six attack scenarios show that GARNET reduces false positives by an average of 80%, lowering the false positive rate to below 0.0037. It outperforms the latest approaches and provides more explainable attribution.

PDF Details DOI

TIST Journal 2026 Journal Article

Interpretable Structure Learning for Knowledge Components in Education

Yuang Wei
Yuan-Hao Jiang
Changyong Qi
Wei Zhang
Bo Jiang

Structural relationships among Knowledge Components (KCs) are essential for adaptive learning systems, as they support accurate cognitive diagnosis, personalized path planning, and targeted resource recommendation. However, existing approaches frequently capture correlations instead of reliable directional dependency signals and tend to converge prematurely or become inefficient as graph dimensionality grows. These limitations weaken the reliable modeling of KC-level structure, which in turn reduces interpretability and limits downstream benefits for diagnosis, planning, and recommendation. To this end, we propose a novel structure learning framework that integrates psychometric modeling with structural search. First, we design the I tem R esponse T heory (IRT)-based I nformation C riterion ( IRIC ), an interpretable scoring function that combines information entropy with causal effect estimation grounded in IRT, jointly capturing statistical associations and directionality-sensitive signals under latent ability control. Second, we develop C o- E volutionary O ptimization for S tructural S earch ( CEO-SS ), a multi-population evolutionary algorithm with a game-inspired co-evolution mechanism that balances exploration and exploitation, avoiding premature convergence and showing robust search behavior as graph dimensionality increases within the evaluated benchmarks. Extensive experiments on three types of datasets—including benchmark causal discovery datasets, the public educational dataset, and real-world classroom data—demonstrate that our framework consistently outperforms strong baselines in accuracy and stability, with especially clear gains in adjacency recovery and more modest improvements in edge-direction recovery. In addition, expert evaluation suggests that the learned structures are more diagnostically useful, more actionable for remediation, and more pedagogically plausible than those produced by alternative scoring methods. Overall, the proposed framework provides an interpretable and practically valuable approach to learning KC structures for adaptive learning.

Details DOI

JBHI Journal 2026 Journal Article

R2GenCSR: Mining Contextual and Residual Information for LLMs-based Radiology Report Generation

Xiao Wang
Yuehang Li
Fuling Wang
Shiao Wang
Chuanfu Li
Bo Jiang

Inspired by the tremendous success of Large Language Models (LLMs), existing Radiology report generation methods attempt to leverage large models to achieve better performance. They usually adopt a Transformer to extract the visual features of a given X-ray image, and then, feed them into the LLM for text generation. How to extract more effective information for the LLMs to help them improve final results is an urgent problem that needs to be solved. Additionally, the use of visual Transformer models also brings high computational complexity. To address these issues, this paper proposes a novel context-guided efficient radiology report generation framework. Specifically, we introduce the Mamba as the vision backbone with linear complexity, and the performance obtained is comparable to that of the strong Transformer model. More importantly, we perform context retrieval from the training set for samples within each mini-batch during the training phase, utilizing both positively and negatively related samples to enhance feature representation and discriminative learning. Subsequently, we feed the vision tokens, context information, and prompt statements to invoke the LLM for generating high-quality medical reports. Extensive experiments on three X-ray report generation datasets (i. e. , IU X-Ray, MIMIC-CXR, CheXpert Plus) fully validated the effectiveness of our proposed model. The source code is available at https://github.com/Event-AHU/Medical_ Image_Analysis.

Details DOI

AAAI Conference 2026 Conference Paper

Sentient: Detecting APTs via Capturing Indirect Dependencies and Behavioral Logic

Wenhao Yan
Ning An
Wei Qiao
Weiheng Wu
Zhigang Lu
Bo Jiang
Baoxu Liu
Junrong Liu

Advanced Persistent Threats (APTs) are difficult to detect due to their complexity and stealthiness. To mitigate such attacks, many approaches model entities and their relationship using provenance graphs to detect the stealthy and persistent characteristics of APTs. However, existing detection methods suffer from the flaws of missing indirect dependencies, noisy complex scenarios, and missing behavioral logical associations, which make it difficult to detect complex scenarios and effectively identify stealthy threats. In this paper, we propose Sentient, an APT detection method that combines pre-training and intent analysis. It employs a graph transformer to learn structural and semantic information from provenance graphs to avoid missing indirect dependencies. We mitigate scenario noise by combining global and local information. Additionally, we design an Intent Analysis Module (IAM) to associate logical relationships between behaviors. Sentient is trained solely on easily obtainable benign data to detect malicious behaviors that deviate from benign behavioral patterns. We evaluated Sentient on three widely-used datasets covering real-world attacks and simulated attacks. Notably, compared to six state-of-the-art methods, Sentient achieved an average reduction of 44% in false positive rate(FPR) for detection.

PDF Details DOI

AAAI Conference 2026 Conference Paper

STEP-Nav: Spatial-Temporal Efficient Visual Token Pruning for Vision-and-Language Navigation with Large Language Models

Yantao Lu
Shiqi Sun
Ning Liu
Bo Jiang
Ying Zhang
Jinchao Chen
Chenglie Du

Vision-and-Language Navigation (VLN) plays a critical role in tasks of embodied AI, particularly in unseen environments following natural language instructions. Recent advancements leverage large language models (LLMs) to improve the accuracy and generalizability of VLN systems by encoding image sequences as dense token representations. However, this tokenization approach incurs substantial computational overhead due to two key inefficiencies: 1) ego-centric camera views often include navigation-irrelevant re- gions (e.g., sky or distant backgrounds), and 2) high-frame-rate image sequences introduce temporal redundancy. To address these challenges, we propose Spatial-Temporal Efficient Visual Token Pruning (STEP-Nav), a unified frame- work that simultaneously prunes redundant visual tokens and fine-tunes VLN models to preserve navigation performance. In particular, STEP-Nav incorporates a distance- and content-aware token evaluation mechanism to remove irrelevant tokens at the spatial level, along with temporal level similarity-based filtering to reduce redundancy across sequential frames. To ensure pruning does not harm task performance, we introduce a distortion-aware fine-tuning strategy that aligns pruned-token representations with their full-token counterparts while maintaining navigation accuracy. Experiments on the R2R and RxR benchmarks using Navid-CE and NavGPT-2 as base models demonstrate that STEP-Nav preserves over 95% of the performance while reducing 66.7% of tokens, outperforming existing token pruning baselines.

PDF Details DOI

AAAI Conference 2026 Conference Paper

When Person Re-Identification Meets Event Camera: A Benchmark Dataset and an Attribute-Guided Re-Identification Framework

Xiao Wang
Qian Zhu
Shujuan Wu
Bo Jiang
Shiliang Zhang

Recent researchers have proposed using event cameras for person re-identification (ReID) due to their promising performance and better balance in terms of privacy protection, event camera-based person ReID has attracted significant attention. Currently, mainstream event-based person ReID algorithms primarily focus on fusing visible light and event stream, as well as preserving privacy. Although significant progress has been made, these methods are typically trained and evaluated on small-scale or simulated event camera datasets, making it difficult to assess their real identification performance and generalization ability. To address the issue of data scarcity, this paper introduces a large-scale RGB-event based person ReID dataset, called EvReID. The dataset contains 118,988 image pairs and covers 1200 pedestrian identities, with data collected across multiple seasons, scenes, and lighting conditions. We also evaluate 15 state-of-the-art person ReID algorithms, laying a solid foundation for future research in terms of both data and benchmarking. Based on our newly constructed dataset, this paper further proposes a pedestrian attribute-guided contrastive learning framework to enhance feature learning for person re-identification, termed TriPro-ReID. This framework not only effectively explores the visual features from both RGB frames and event streams, but also fully utilizes pedestrian attributes as mid-level semantic features. Extensive experiments on the EvReID dataset and MARS datasets fully validated the effectiveness of our proposed RGB-Event person ReID framework.

PDF Details DOI

EAAI Journal 2025 Journal Article

A modeled study of driver visual attention driven by driving tasks

Chuan Xu
Bo Jiang
Yukun Wang
Yan Su

Visual attention is an indispensable component of driving, enabling drivers to swiftly identify critical objects within complex and dynamic traffic environments. Despite its significance, existing visual attention models predominantly focus on static or idealized driving scenarios, limiting their ability to capture attention distribution patterns in real-world, dynamic environments. Furthermore, most of these models rely heavily on data-driven approaches, extracting features exclusively from visual image data, while neglecting the profound influence of “the driver, the vehicle, and the road environment”. Consequently, these models frequently fail to effectively address the intricacies of practical driving scenarios. To bridge these gaps, this study introduces a driver visual attention prediction model that comprehensively incorporates the driving task, driver experience, and the impact of dynamic visual scenes. The proposed model leverages the advanced learning capabilities of Convolutional Neural Networks (CNN) and Vision Transformer (ViT), coupled with sequence modeling mechanisms, to effectively capture the nuanced attention allocation patterns of drivers in complex driving contexts. The model is meticulously designed to adapt to dynamically evolving driving task requirements. Experimental results demonstrate that the proposed model outperforms state-of-the-art (SOTA) visual attention prediction models across multiple benchmark evaluation metrics on the DR(eye)VE dataset, particularly excelling in dynamic driving conditions. Moreover, generalization experiments were conducted on the BDD-A and TDV datasets validate the model’s robustness and applicability across varied driving tasks and dynamic conditions.

Details DOI

NeurIPS Conference 2025 Conference Paper

CroPe: Cross-Modal Semantic Compensation Adaptation for All Adverse Scene Understanding

Qin Xu
Qihang Wu
Lu Hongtao
Xiaoxia Cheng
Bo Jiang

Scene understanding in adverse conditions, such as fog, snow, and night, is challenging due to the visual appearance degeneration. In this context, we propose a Cross-modal Semantic Compensation Adaptation method (CroPe) for scene understanding. Distinct from the existing methods, which only use the visual information to learn the domain-invariant features, CroPe establishes a visual-textual paradigm which provides textual semantic compensation for visual features, enabling the model to learn more consistent representations. We propose the Complementary Perceptual Text Generation (CPTG) module which generates a set of multi-level complementary-perceptive text embeddings incorporating both generalization and domain awareness. To achieve cross-modal semantic compensation, the Reverse Chain Text-Visual Fusion (RCTVF) module is developed. By the unified attention and reverse decoding chain, compensation information is successively fused to the visual features from the deep (semantic dense) to shallow (semantic sparse) features, maximizing compensation gain. CroPe yields competitive results under all adverse conditions and significantly improves the state-of-the-art performance by 6. 5 mIoU for ACDC-Night dataset and 1. 2 mIoU for ACDC-All dataset, respectively.

PDF Details

ICML Conference 2025 Conference Paper

Improved Discretization Complexity Analysis of Consistency Models: Variance Exploding Forward Process and Decay Discretization Scheme

Ruofeng Yang
Bo Jiang
Cheng Chen 0015
Shuai Li 0010

Consistency models, a new class of one-step generative models, have shown competitive performance with multi-step diffusion models. The most challenging part of consistency models is the training process, which discretizes the continuous diffusion process into $K$ steps and trains a one-step mapping function on these discretized timepoints. Despite the empirical success, only a few works focus on the discretization complexity $K$, and their setting is far from that of empirical works. More specifically, the current theoretical works analyze the variance preserving (VP) diffusion process with a uniform stepsize, while empirical works adopt a variance exploding (VE) process with a decay discretization stepsize. As a result, these works suffer from large discretization complexity and fail to explain the empirical success of consistency models. To close the gap between theory and application, we analyze consistency models with (1) VE process and (2) decay stepsize and prove the state-of-the-art discretization complexity for consistency models. This result is competitive with the results of diffusion models and shows the potential of consistency models. To balance the computation and performance, previous empirical work further proposes a $2$-step consistency algorithm. In this work, we also analyze the role of $2$-step sampling and show that it improves the discretization complexity compared with one-step generation.

Details

EAAI Journal 2025 Journal Article

Joint depth-segmentation learning with segment priors for non-contact seedling height and stem thickness estimation

Lei Song
Bo Jiang
Huaibo Song

To achieve precise and rapid computation of seedling height and stem diameter — key phenotypic traits for monitoring seedling growth and selecting superior varieties — this study proposes a SAM-Integrated Adaptive Fusion Depth Network (SAFD-Net). SAFD-Net integrates segmentation masks generated by Segment Anything Model (SAM) with an Adaptive Prior Extraction (APE) module to produce priors focused on individual seedling characteristics, and it fuses these priors with deep features through an Adaptive Attention Fusion (AAF) module. A Local Depth Generation (LDG) module refines depth details to improve estimation accuracy, and an Adaptive Multi-scale Fusion (AMF) module merges LDG outputs at different scales to produce high-precision depth maps. From these maps, seedling region depth, pixel height, and pixel stem diameter are extracted to compute actual seedling height and stem diameter. Comparisons with various depth estimation networks demonstrate that SAFD-Net outperforms existing models in both depth estimation and seedling measurement. Experimental evaluations on seedlings from three crops with distinct phenotypic characteristics further show that the method maintains high accuracy under varying shooting distances, lighting conditions, multiple targets, and tilt angles, offering a novel approach for phenotypic monitoring during seedling cultivation. Code is released at https: //github. com/Songlei7664/SAFD-Net.

Details DOI

AAAI Conference 2025 Conference Paper

LiON: Learning Point-Wise Abstaining Penalty for LiDAR Outlier DetectioN Using Diverse Synthetic Data

Shaocong Xu
Pengfei Li
Qianpu Sun
Xinyu Liu
Yang Li
Shihui Guo
Zhen Wang
Bo Jiang

LiDAR-based semantic scene understanding is an important module in the modern autonomous driving perception stack. However, identifying outlier points in a LiDAR point cloud is challenging as LiDAR point clouds lack semantically-rich information. While former SOTA methods adopt heuristic architectures, we revisit this problem from the perspective of Selective Classification, which introduces a selective function into the standard closed-set classification setup. Our solution is built upon the basic idea of abstaining from choosing any inlier categories but learns a point-wise abstaining penalty with a margin-based loss. Apart from learning paradigms, synthesizing outliers to approximate unlimited real outliers is also critical, so we propose a strong synthesis pipeline that generates outliers originated from various factors: object categories, sampling patterns and sizes. We demonstrate that learning different abstaining penalties, apart from point-wise penalty, for different types of (synthesized) outliers can further improve the performance. We benchmark our method on SemanticKITTI and nuScenes and achieve SOTA results.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning

Hao Gao
Shaoyu Chen
Bo Jiang
Bencheng Liao
Yiang Shi
Xiaoyang Guo
Yuechuan Pu
haoran yin

Existing end-to-end autonomous driving (AD) algorithms typically follow the Imitation Learning (IL) paradigm, which faces challenges such as causal confusion and an open-loop gap. In this work, we propose RAD, a 3DGS-based closed-loop Reinforcement Learning (RL) framework for end-to-end Autonomous Driving. By leveraging 3DGS techniques, we construct a photorealistic digital replica of the real physical world, enabling the AD policy to extensively explore the state space and learn to handle out-of-distribution scenarios through large-scale trial and error. To enhance safety, we design specialized rewards to guide the policy in effectively responding to safety-critical events and understanding real-world causal relationships. To better align with human driving behavior, we incorporate IL into RL training as a regularization term. We introduce a closed-loop evaluation benchmark consisting of diverse, previously unseen 3DGS environments. Compared to IL-based methods, RAD achieves stronger performance in most closed-loop metrics, particularly exhibiting a 3× lower collision rate. Abundant closed-loop results are presented in the supplementary material. Code is available at https: //github. com/hustvl/RAD for facilitating future research.

PDF Details

EAAI Journal 2025 Journal Article

Reading comprehension powered semantic fusion network for identification of N-ary drug combinations

Hua Zhang
Peiqian Zhan
Cheng Yang
Yongjian Yan
Zijing Cai
Guogen Shan
Bo Jiang
Bi Chen

The concurrent use of multiple medications to treat one or more diseases is prevalent. Identifying N-ary drug combinations from biomedical texts aids in uncovering significant pharmacological effects triggered by drug-drug interactions. Previous methods for this emerging task have primarily concentrated on representing drug entities using pre-trained language models, overlooking the comprehensive extraction of contextual and task-specific semantic information. To address these limitations, we develop a semantic fusion method grounded in machine reading comprehension (MRC) framework. Our model, termed Reading Comprehension powered semantic Fusion network for Identification of N-ary Drug combinations (RCFIND), first constructs relevant contexts and queries for each individual drug combination. Then, diverse information sources, including task-specific semantics, drug entity representations and contextual details, are fused by using a simplified Capsule network as well as incorporating contrastive learning. We assess RCFIND, achieving F1 scores ranging from 72. 0% to 83. 3% across four types of evaluations. Experimental results demonstrate significant performance enhancements over existing baselines, with at least a 5% F1 score improvement. Ablation studies and further analysis confirm the efficacy of the MRC framework and contrastive learning in accurately identifying N-ary drug combinations.

Details DOI

IJCAI Conference 2025 Conference Paper

TreeKV: Smooth Key-Value Cache Compression with Tree Structures

Ziwei He
Jian Yuan
Haoli Bai
Jingwen Leng
Bo Jiang

Efficient key-value (KV) cache compression is critical for scaling transformer-based Large Language Models (LLMs) in long sequences and resource-limited settings. Existing methods evict tokens based on their positions or importance, but position-based strategies can miss crucial information outside predefined regions, while those relying on global importance scores resulting in strong regional biases, limiting the KV cache's overall context retention and potentially impairing the performance of LLMs on complex tasks. Our wavelet analysis reveals that as tokens approach the end of sequence, their contributions to generation gradually increase and tends to diverge more from neighboring tokens, indicating a smooth transition with increasing complexity and variability from distant to nearby context. Motivated by this observation, we propose TreeKV, an intuitive, training-free method that employs a tree structure for smooth cache compression. TreeKV maintains a fixed cache size, allowing LLMs to deliver high-quality output in long text scenarios and is applicable during both the generation and prefilling stages. TreeKV consistently surpasses all baseline models in language modeling tasks on PG19 and OpenWebText2, allowing LLMs trained with short context window to generalize to longer window with a 16x cache reduction. On the Longbench benchmark, TreeKV achieves the best performance with only 6% of the budget at optimal efficiency.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

UGG-ReID: Uncertainty-Guided Graph Model for Multi-Modal Object Re-Identification

Xixi Wan
Aihua Zheng
Bo Jiang
Beibei Wang
Chenglong Li
Jin Tang

Multi-modal object Re-IDentification (ReID) has gained considerable attention with the goal of retrieving specific targets across cameras using heterogeneous visual data sources. At present, multi-modal object ReID faces two core challenges: (1) learning robust features under fine-grained local noise caused by occlusion, frame loss, and other disruptions; and (2) effectively integrating heterogeneous modalities to enhance multi-modal representation. To address the above challenges, we propose a robust approach named Uncertainty-Guided Graph model for multi-modal object ReID (UGG-ReID). UGG-ReID is designed to mitigate noise interference and facilitate effective multi-modal fusion by estimating both local and sample-level aleatoric uncertainty and explicitly modeling their dependencies. Specifically, we first propose the Gaussian patch-graph representation model that leverages uncertainty to quantify fine-grained local cues and capture their structural relationships. This process boosts the expressiveness of modal-specific information, ensuring that the generated embeddings are both more informative and robust. Subsequently, we design an uncertainty-guided mixture of experts strategy that dynamically routes samples to experts exhibiting low uncertainty. This strategy effectively suppresses noise-induced instability, leading to enhanced robustness. Meanwhile, we design an uncertainty-guided routing to strengthen the multi-modal interaction, improving the performance. UGG-ReID is comprehensively evaluated on five representative multi-modal object ReID datasets, encompassing diverse spectral modalities. Experimental results show that the proposed method achieves excellent performance on all datasets and is significantly better than current methods in terms of noise immunity. Our code is available at https: //github. com/wanxixi11/UGG-ReID.

PDF Details

NeurIPS Conference 2024 Conference Paper

AlterMOMA: Fusion Redundancy Pruning for Camera-LiDAR Fusion Models with Alternative Modality Masking

Shiqi Sun
Yantao Lu
Ning Liu
Bo Jiang
Jinchao Chen
Ying Zhang

Camera-LiDAR fusion models significantly enhance perception performance in autonomous driving. The fusion mechanism leverages the strengths of each modality while minimizing their weaknesses. Moreover, in practice, camera-LiDAR fusion models utilize pre-trained backbones for efficient training. However, we argue that directly loading single-modal pre-trained camera and LiDAR backbones into camera-LiDAR fusion models introduces similar feature redundancy across modalities due to the nature of the fusion mechanism. Unfortunately, existing pruning methods are developed explicitly for single-modal models, and thus, they struggle to effectively identify these specific redundant parameters in camera-LiDAR fusion models. In this paper, to address the issue above on camera-LiDAR fusion models, we propose a novelty pruning framework Alternative Modality Masking Pruning (AlterMOMA), which employs alternative masking on each modality and identifies the redundant parameters. Specifically, when one modality parameters are masked (deactivated), the absence of features from the masked backbone compels the model to reactivate previous redundant features of the other modality backbone. Therefore, these redundant features and relevant redundant parameters can be identified via the reactivation process. The redundant parameters can be pruned by our proposed importance score evaluation function, Alternative Evaluation (AlterEva), which is based on the observation of the loss changes when certain modality parameters are activated and deactivated. Extensive experiments on the nuScene and KITTI datasets encompassing diverse tasks, baseline models, and pruning algorithms showcase that AlterMOMA outperforms existing pruning methods, attaining state-of-the-art performance.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

CE-NAS: An End-to-End Carbon-Efficient Neural Architecture Search Framework

Yiyang Zhao
Yunzhuo Liu
Bo Jiang
Tian Guo

This work presents a novel approach to neural architecture search (NAS) that aims to increase carbon efficiency for the model design process. The proposed framework CE-NAS addresses the key challenge of high carbon cost associated with NAS by exploring the carbon emission variations of energy and energy differences of different NAS algorithms. At the high level, CE-NAS leverages a reinforcement-learning agent to dynamically adjust GPU resources based on carbon intensity, predicted by a time-series transformer, to balance energy-efficient sampling and energy-intensive evaluation tasks. Furthermore, CE-NAS leverages a recently proposed multi-objective optimizer to effectively reduce the NAS search space. We demonstrate the efficacy of CE-NAS in lowering carbon emissions while achieving SOTA results for both NAS datasets and open-domain NAS tasks. For example, on the HW-NasBench dataset, CE-NAS reduces carbon emissions by up to 7. 22X while maintaining a search efficiency comparable to vanilla NAS. For open-domain NAS tasks, CE-NAS achieves SOTA results with 97. 35% top-1 accuracy on CIFAR-10 with only 1. 68M parameters and a carbon consumption of 38. 53 lbs of CO2. On ImageNet, our searched model achieves 80. 6% top-1 accuracy with a 0. 78 ms TensorRT latency using FP16 on NVIDIA V100, consuming only 909. 86 lbs of CO2, making it comparable to other one-shot-based NAS baselines. Our code is available at https: //github. com/cake-lab/CE-NAS.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Few-Shot Diffusion Models Escape the Curse of Dimensionality

Ruofeng Yang
Bo Jiang
Cheng Chen
Ruinan Jin
Baoxiang Wang
Shuai Li

While diffusion models have demonstrated impressive performance, there is a growing need for generating samples tailored to specific user-defined concepts. The customized requirements promote the development of few-shot diffusion models, which use limited $n_{ta}$ target samples to fine-tune a pre-trained diffusion model trained on $n_s$ source samples. Despite the empirical success, no theoretical work specifically analyzes few-shot diffusion models. Moreover, the existing results for diffusion models without a fine-tuning phase can not explain why few-shot models generate great samples due to the curse of dimensionality. In this work, we analyze few-shot diffusion models under a linear structure distribution with a latent dimension $d$. From the approximation perspective, we prove that few-shot models have a $\widetilde{O}(n_s^{-2/d}+n_{ta}^{-1/2})$ bound to approximate the target score function, which is better than $n_{ta}^{-2/d}$ results. From the optimization perspective, we consider a latent Gaussian special case and prove that the optimization problem has a closed-form minimizer. This means few-shot models can directly obtain an approximated minimizer without a complex optimization process. Furthermore, we also provide the accuracy bound $\widetilde{O}(1/n_{ta}+1/\sqrt{n_s})$ for the empirical solution, which still has better dependence on $n_{ta}$ compared to $n_s$. The results of the real-world experiments also show that the models obtained by only fine-tuning the encoder and decoder specific to the target distribution can produce novel images with the target feature, which supports our theoretical results.

PDF Details DOI

AAAI Conference 2024 Conference Paper

HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors

Xiao Wang
Zongzhen Wu
Bo Jiang
Zhimin Bao
Lin Zhu
Guoqi Li
Yaowei Wang
Yonghong Tian

The main streams of human activity recognition (HAR) algorithms are developed based on RGB cameras which usually suffer from illumination, fast motion, privacy preservation, and large energy consumption. Meanwhile, the biologically inspired event cameras attracted great interest due to their unique features, such as high dynamic range, dense temporal but sparse spatial resolution, low latency, low power, etc. As it is a newly arising sensor, even there is no realistic large-scale dataset for HAR. Considering its great practical value, in this paper, we propose a large-scale benchmark dataset to bridge this gap, termed HARDVS, which contains 300 categories and more than 100K event sequences. We evaluate and report the performance of multiple popular HAR algorithms, which provide extensive baselines for future works to compare. More importantly, we propose a novel spatial-temporal feature learning and fusion framework, termed ESTF, for event stream based human activity recognition. It first projects the event streams into spatial and temporal embeddings using StemNet, then, encodes and fuses the dual-view representations using Transformer networks. Finally, the dual features are concatenated and fed into a classification head for activity prediction. Extensive experiments on multiple datasets fully validated the effectiveness of our model. Both the dataset and source code will be released at https://github.com/Event-AHU/HARDVS.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

Implicit Prompt Learning for Image Denoising

Yao Lu
Bo Jiang
Guangming Lu
Bob Zhang

Recently, various deep denoising methods have been proposed to solve the insufficient feature problem in image denoising. These methods can be mainly classified into two categories: (1) Injecting learnable tensors into denoising backbone to supplement feature, which is effective to some extent but may cause serious over-fitting. (2) Using diverse natural images from large image datasets to synthesize noisy images and pre-train denoising models, which can bring model generalization but require large model size and expensive training costs. To address these issues, this paper proposes Implicit Prompt Learning for Image Denoising (IPLID) method to flexibly generate adaptive prompts without meticulously designing them. Specifically, we first introduce an efficient Linear Prompt (LP) block with ultra-few parameters to produce dynamic prompts for both different stages and samples in denoising procedure. We further propose an efficient Compact Feature Fusion (CFF) block to process previous multi-level prompted denoising feature to reconstruct the denoising images. Finally, to further efficiently and effectively produce satisfactory prompt and denoising performance, a Gradient Accumulation (GA) learning scheme is proposed. Experiments on multiple benchmarks showed that the proposed IPLID achieves competitive results with only 1 percent of pre-trained backbone parameters, outperforming classical denoising methods in both efficiency and quality of restored images.

PDF Details DOI

ICML Conference 2024 Conference Paper

In-context Learning on Function Classes Unveiled for Transformers

Zhijie Wang
Bo Jiang
Shuai Li 0010

Transformer-based neural sequence models exhibit a remarkable ability to perform in-context learning. Given some training examples, a pre-trained model can make accurate predictions on an unseen input. This paper studies why transformers can learn different types of function classes in-context. We first show by construction that there exists a family of transformers (with different activation functions) that implement approximate gradient descent on the parameters of neural networks, and we provide an upper bound for the number of heads, hidden dimensions, and layers of the transformer. We also show that a transformer can learn linear functions, the indicator function of a unit ball, and smooth functions in-context by learning neural networks that approximate them. The above instances mainly focus on a transformer pre-trained on single tasks. We also prove that when pre-trained on two tasks: linear regression and classification, a transformer can make accurate predictions on both tasks simultaneously. Our results move beyond linearity in terms of in-context learning instances and provide a comprehensive understanding of why transformers can learn many types of function classes through the bridge of neural networks.

Details

NeurIPS Conference 2024 Conference Paper

Leveraging Drift to Improve Sample Complexity of Variance Exploding Diffusion Models

Ruofeng Yang
Zhijie Wang
Bo Jiang
Shuai Li

Variance exploding (VE) based diffusion models, an important class of diffusion models, have shown state-of-the-art (SOTA) performance. However, only a few theoretical works analyze VE-based models, and those works suffer from a worse forward convergence rate $1/\text{poly}(T)$ than the $\exp{(-T)}$ of variance preserving (VP) based models, where $T$ is the forward diffusion time and the rate measures the distance between forward marginal distribution $q_T$ and pure Gaussian noise. The slow rate is due to the Brownian Motion without a drift term. In this work, we design a new drifted VESDE forward process, which allows a faster $\exp{(-T)}$ forward convergence rate. With this process, we achieve the first efficient polynomial sample complexity for a series of VE-based models with reverse SDE under the manifold hypothesis. Furthermore, unlike previous works, we allow the diffusion coefficient to be unbounded instead of a constant, which is closer to the SOTA models. Besides the reverse SDE, the other common reverse process is the probability flow ODE (PFODE) process, which is deterministic and enjoys faster sample speed. To deepen the understanding of VE-based models, we consider a more general setting considering reverse SDE and PFODE simultaneously, propose a unified tangent-based analysis framework, and prove the first quantitative convergence guarantee for SOTA VE-based models with reverse PFODE. We also show that the drifted VESDE can balance different error terms and improve generated samples without training through synthetic and real-world experiments.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

QFormer: An Efficient Quaternion Transformer for Image Denoising

Bo Jiang
Yao Lu
Guangming Lu
Bob Zhang

Since Deep Convolutional Neural Networks (DCNNs) and Vision Transformer perform well in learning generalizable image priors from large-scale data, these models have been widely used in image denoising tasks. However, vanilla DCNNs and Transformer suffer from two problems. First, the vanilla DCNNs and Transformer only accumulate the output along the channel axis, ignoring the internal relationship among channels. This results in the severely inadequate color structure representation retrieved from color images. Secondly, the DCNNs or Transformer-based image denoising models usually have a large number of parameters, high computational complexity, and slow inference speed. To resolve these issues, this paper proposes a highly-efficient Quaternion Transformer (QFormer) for image denoising. Specifically, the proposed Quaternion Transformer Block (QTB) simplifies the typical Transformer from a multi-branch structure to an elaborately sequential structure mainly with quaternion transformations, to alternately capture both long-range dependencies and local contextual features with color structure information. Furthermore, the proposed QTB can also avoid considerable element-wise multiplications of computing the self-attention matrices. Thus, our QTB can significantly reduce the computational complexity and its sequential structure can further improve the practical inference speed. Comprehensive experiments demonstrate that the proposed QFormer produces state-of-the-art results in both denoising performance and efficiency. We hope that our work will encourage further research to explore the Quaternion Transformer architecture for image denoising tasks.

PDF Details DOI

EAAI Journal 2024 Journal Article

Query-induced multi-task decomposition and enhanced learning for aspect-based sentiment quadruple prediction

Hua Zhang
Xiawen Song
Xiaohui Jia
Cheng Yang
Zeqi Chen
Bi Chen
Bo Jiang
Ye Wang

A complete sentiment analysis of product and service reviews has attracted growing concerns from merchants to enhance personalized marketing activities. Aspect sentiment quadruple prediction (ASQP) is a demanding and challenging task with the objective to predict four sentiment elements from given reviews. Existing methods for ASQP face certain issues, with pipeline-based non-generative approaches prone to error propagation and generative models at the potential risk of producing unexpected outputs or longer inference times. To avoid these shortcomings, we develop a novel end-to-end non-generative model for ASQP involving multi-task decomposition within machine reading comprehension (MRC) framework. Specifically, the ASQP task is decomposed into six query-induced subtasks by introducing task-specific question templates. The proposed model, named MRC-CLRI, is trained with multi-task joint learning. It also incorporates contrastive learning for category identification and sentiment classification to enhance the correlation of the six subtasks. To further promote the quadruple prediction, we present a refined inference algorithm in a bidirectional multi-turn inference procedure to effectively match aspect and opinion terms and optimize two inference hyperparameters: distance threshold and probability threshold. Experimental results exhibit superior performance compared to existing two non-generative and seven generative baselines. Our proposed MRC-CLRI, as a novel non-generative model, outperforms the best existing generative method by an average F1 score improvement of 1. 69% and the best previous non-generative method by an average F1 score improvement of 15. 77% across four review datasets. Ablation experiments further validate the efficacy of the designed contrastive learning and the refined inference algorithm.

Details DOI

JMLR Journal 2024 Journal Article

Volterra Neural Networks (VNNs)

Siddharth Roheda
Hamid Krim
Bo Jiang

The importance of inference in Machine Learning (ML) has led to an explosive number of different proposals, particularly in Deep Learning. In an attempt to reduce the complexity of Convolutional Neural Networks, we propose a Volterra filter-inspired Network architecture. This architecture introduces controlled non-linearities in the form of interactions between the delayed input samples of data. We propose a cascaded implementation of Volterra Filtering so as to significantly reduce the number of parameters required to carry out the same classification task as that of a conventional Neural Network. We demonstrate an efficient parallel implementation of this Volterra Neural Network (VNN), along with its remarkable performance while retaining a relatively simpler and potentially more tractable structure. Furthermore, we show a rather sophisticated adaptation of this network to nonlinearly fuse the RGB (spatial) information and the Optical Flow (temporal) information of a video sequence for action recognition. The proposed approach is evaluated on UCF-101 and HMDB-51 datasets for action recognition, and is shown to outperform state of the art CNN approaches. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2024. ( edit, beta )

PDF Details

NeurIPS Conference 2023 Conference Paper

A Riemannian Exponential Augmented Lagrangian Method for Computing the Projection Robust Wasserstein Distance

Bo Jiang
Ya-Feng Liu

Projection robust Wasserstein (PRW) distance is recently proposed to efficiently mitigate the curse of dimensionality in the classical Wasserstein distance. In this paper, by equivalently reformulating the computation of the PRW distance as an optimization problem over the Cartesian product of the Stiefel manifold and the Euclidean space with additional nonlinear inequality constraints, we propose a Riemannian exponential augmented Lagrangian method (REALM) for solving this problem. Compared with the existing Riemannian exponential penalty-based approaches, REALM can potentially avoid too small penalty parameters and exhibit more stable numerical performance. To solve the subproblems in REALM efficiently, we design an inexact Riemannian Barzilai-Borwein method with Sinkhorn iteration (iRBBS), which selects the stepsizes adaptively rather than tuning the stepsizes in efforts as done in the existing methods. We show that iRBBS can return an $\epsilon$-stationary point of the original PRW distance problem within $\mathcal{O}(\epsilon^{-3})$ iterations, which matches the best known iteration complexity result. Extensive numerical results demonstrate that our proposed methods outperform the state-of-the-art solvers for computing the PRW distance.

PDF Details

TIST Journal 2023 Journal Article

Human Pose Transfer with Augmented Disentangled Feature Consistency

Kun Wu
Chengxiang Yin
Zhengping Che
Bo Jiang
Jian Tang
Zheng Guan
Gangyi Ding

Deep generative models have made great progress in synthesizing images with arbitrary human poses and transferring the poses of one person to others. Though many different methods have been proposed to generate images with high visual fidelity, the main challenge remains and comes from two fundamental issues: pose ambiguity and appearance inconsistency. To alleviate the current limitations and improve the quality of the synthesized images, we propose a pose transfer network with augmented D isentangled F eature C onsistency (DFC-Net) to facilitate human pose transfer. Given a pair of images containing the source and target person, DFC-Net extracts pose and static information from the source and target respectively, then synthesizes an image of the target person with the desired pose from the source. Moreover, DFC-Net leverages disentangled feature consistency losses in the adversarial training to strengthen the transfer coherence and integrates a keypoint amplifier to enhance the pose feature extraction. With the help of the disentangled feature consistency losses, we further propose a novel data augmentation scheme that introduces unpaired support data with the augmented consistency constraints to improve the generality and robustness of DFC-Net. Extensive experimental results on Mixamo-Pose and EDN-10k have demonstrated DFC-Net achieves state-of-the-art performance on pose transfer.

Details DOI

IJCAI Conference 2023 Conference Paper

Prediction with Incomplete Data under Agnostic Mask Distribution Shift

Yichen Zhu
Jian Yuan
Bo Jiang
Tao Lin
Haiming Jin
Xinbing Wang
Chenghu Zhou

Data with missing values is ubiquitous in many applications. Recent years have witnessed increasing attention on prediction with only incomplete data consisting of observed features and a mask that indicates the missing pattern. Existing methods assume that the training and testing distributions are the same, which may be violated in real-world scenarios. In this paper, we consider prediction with incomplete data in the presence of distribution shift. We focus on the case where the underlying joint distribution of complete features and label is invariant, but the missing pattern, i. e. , mask distribution may shift agnostically between training and testing. To achieve generalization, we leverage the observation that for each mask, there is an invariant optimal predictor. To avoid the exponential explosion when learning them separately, we approximate the optimal predictors jointly using a double parameterization technique. This has the undesirable side effect of allowing the learned predictors to rely on the intra-mask correlation and that between features and mask. We perform decorrelation to minimize this effect. Combining the techniques above, we propose a novel prediction method called StableMiss. Extensive experiments on both synthetic and real-world datasets show that StableMiss is robust and outperforms state-of-the-art methods under agnostic mask distribution shift.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Understanding Representation Learnability of Nonlinear Self-Supervised Learning

Ruofeng Yang
Xiangyuan Li
Bo Jiang
Shuai Li

Self-supervised learning (SSL) has empirically shown its data representation learnability in many downstream tasks. There are only a few theoretical works on data representation learnability, and many of those focus on final data representation, treating the nonlinear neural network as a ``black box". However, the accurate learning results of neural networks are crucial for describing the data distribution features learned by SSL models. Our paper is the first to analyze the learning results of the nonlinear SSL model accurately. We consider a toy data distribution that contains two features: the label-related feature and the hidden feature. Unlike previous linear setting work that depends on closed-form solutions, we use the gradient descent algorithm to train a 1-layer nonlinear SSL model with a certain initialization region and prove that the model converges to a local minimum. Furthermore, different from the complex iterative analysis, we propose a new analysis process which uses the exact version of Inverse Function Theorem to accurately describe the features learned by the local minimum. With this local minimum, we prove that the nonlinear SSL model can capture the label-related feature and hidden feature at the same time. In contrast, the nonlinear supervised learning (SL) model can only learn the label-related feature. We also present the learning processes and results of the nonlinear SSL and SL model via simulation experiments.

PDF Details DOI

JMLR Journal 2022 Journal Article

Accelerating Adaptive Cubic Regularization of Newton's Method via Random Sampling

Xi Chen
Bo Jiang
Tianyi Lin
Shuzhong Zhang

In this paper, we consider an unconstrained optimization model where the objective is a sum of a large number of possibly nonconvex functions, though overall the objective is assumed to be smooth and convex. Our bid to solving such model uses the framework of cubic regularization of Newton's method. As well known, the crux in cubic regularization is its utilization of the Hessian information, which may be computationally expensive for large-scale problems. To tackle this, we resort to approximating the Hessian matrix via sub-sampling. In particular, we propose to compute an approximated Hessian matrix by either uniformly or non-uniformly sub-sampling the components of the objective. Based upon such sampling strategy, we develop accelerated adaptive cubic regularization approaches and provide theoretical guarantees on global iteration complexity of $\O(\epsilon^{-1/3})$ with high probability, which matches that of the original accelerated cubic regularization methods Jiang et al. (2020) using the full Hessian information. Interestingly, we also show that in the worst case scenario our algorithm still achieves an $O(\epsilon^{-5/6}\log(\epsilon^{-1}))$ iteration complexity bound. The proof techniques are new to our knowledge and can be of independent interets. Experimental results on the regularized logistic regression problems demonstrate a clear effect of acceleration on several real data sets. [abs] [ pdf ][ bib ] &copy JMLR 2022. ( edit, beta )

PDF Details

AAAI Conference 2020 Conference Paper

Generative Attention Networks for Multi-Agent Behavioral Modeling

Guangyu Li
Bo Jiang
Hao Zhu
Zhengping Che
Yan Liu

Understanding and modeling behavior of multi-agent systems is a central step for artiﬁcial intelligence. Here we present a deep generative model which captures behavior generating process of multi-agent systems, supports accurate predictions and inference, infers how agents interact in a complex system, as well as identiﬁes agent groups and interaction types. Built upon advances in deep generative models and a novel attention mechanism, our model can learn interactions in highly heterogeneous systems with linear complexity in the number of agents. We apply this model to three multi-agent systems in different domains and evaluate performance on a diverse set of tasks including behavior prediction, interaction analysis and system identiﬁcation. Experimental results demonstrate its ability to model multi-agent systems, yielding improved performance over competitive baselines. We also show the model can successfully identify agent groups and interaction types in these systems. Our model offers new opportunities to predict complex multi-agent behaviors and takes a step forward in understanding interactions in multi-agent systems.

PDF Details

AAAI Conference 2019 Short Paper

DSINE: Deep Structural Influence Learning via Network Embedding

Jianjun Wu
Ying Sha
Bo Jiang
Jianlong Tan

Structural representations of user social influence are critical for a variety of applications such as viral marketing and recommendation products. However, existing studies only focus on capturing and preserving the structure of relations, and ignore the diversity of influence relations patterns among users. To this end, we propose a deep structural influence learning model to learn social influence structure via mining rich features of each user, and fuse information from the aligned selfnetwork component for preserving global and local structure of the influence relations among users. Experiments on two real-world datasets demonstrate that the proposed model outperforms the state-of-the-art algorithms for learning rich representations in multi-label classification task.

PDF Details

NeurIPS Conference 2017 Conference Paper

Graph Matching via Multiplicative Update Algorithm

Bo Jiang
Jin Tang
Chris Ding
Yihong Gong
Bin Luo

As a fundamental problem in computer vision, graph matching problem can usually be formulated as a Quadratic Programming (QP) problem with doubly stochastic and discrete (integer) constraints. Since it is NP-hard, approximate algorithms are required. In this paper, we present a new algorithm, called Multiplicative Update Graph Matching (MPGM), that develops a multiplicative update technique to solve the QP matching problem. MPGM has three main benefits: (1) theoretically, MPGM solves the general QP problem with doubly stochastic constraint naturally whose convergence and KKT optimality are guaranteed. (2) Em- pirically, MPGM generally returns a sparse solution and thus can also incorporate the discrete constraint approximately. (3) It is efficient and simple to implement. Experimental results show the benefits of MPGM algorithm.

PDF Details

AAAI Conference 2017 Conference Paper

Nonnegative Orthogonal Graph Matching

Bo Jiang
Jin Tang
Chris Ding
Bin Luo

Graph matching problem that incorporates pair-wise constraints can be formulated as Quadratic Assignment Problem (QAP). The optimal solution of QAP is discrete and combinational, which makes QAP problem NP-hard. Thus, many algorithms have been proposed to ﬁnd approximate solutions. In this paper, we propose a new algorithm, called Nonnegative Orthogonal Graph Matching (NOGM), for QAP matching problem. NOGM is motivated by our new observation that the discrete mapping constraint of QAP can be equivalently encoded by a nonnegative orthogonal constraint which is much easier to implement computationally. Based on this observation, we develop an effective multiplicative update algorithm to solve NOGM and thus can ﬁnd an effective approximate solution for QAP problem. Comparing with many traditional continuous methods which usually obtain continuous solutions and should be further discretized, NOGM can obtain a sparse solution and thus incorporates the desirable discrete constraint naturally in its optimization. Promising experimental results demonstrate beneﬁts of NOGM algorithm.

PDF Details

IJCAI Conference 2016 Conference Paper

Robust Out-of-Sample Data Recovery

Bo Jiang
Chris Ding
Bin Luo

Trace norm based rank regularization techniques have been successfully applied to learn a low-rank recovery for high-dimensional noise data. In many applications, it is desirable to add new samples to previously recovered data which is known as out of sample data recovery problem. However, traditional trace norm based regularization methods can not naturally cope with new samples and thus fail to deal with out-of-sample data recovery. In this paper, we propose a new robust out-of-sample data recovery (ROSR) model for trace norm based regularization methods. An effective iterative algorithm, with the proof of convergence, is presented to find the optimal solution of ROSR problem. As an application, we apply our ROSR to image classification task. Experimental results on six image datasets demonstrate the effectiveness and benefits of the proposed ROSR method.

PDF Details

AAAI Conference 2015 Conference Paper

A Local Sparse Model for Matching Problem

Bo Jiang
Jin Tang
Chris Ding
Bin Luo

PDF Details

TCS Journal 2014 Journal Article

Minimization of the maximum distance between the two guards patrolling a polygonal region

Xuehou Tan
Bo Jiang

The two-guard problem asks whether two guards can walk to detect an unpredictable, moving target in a polygonal region P, no matter how fast the target moves, and if so, construct a walk schedule of the guards. For safety, two guards are required to always be mutually visible, and thus they move on the polygon boundary. In particular, a straight walk requires both guards to monotonically move on the boundary of P from beginning to end, one clockwise and the other counterclockwise. The objective of this paper is to find an optimum straight walk such that the maximum distance between the two guards is minimized. We present an O(n 2) time algorithm for optimizing this metric, where n is the number of vertices of the polygon P. Our result is obtained by investigating a number of new properties of the min–max walks and converting the problem of finding an optimum walk in the min–max metric into that of finding a shortest path between two nodes in a graph. This answers an open question posed by Icking and Klein.

Details DOI

JMLR Journal 2008 Journal Article

Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

Bo Jiang
Xuegong Zhang
Tianxi Cai

Support vector machine (SVM) is one of the most popular and promising classification algorithms. After a classification rule is constructed via the SVM, it is essential to evaluate its prediction accuracy. In this paper, we develop procedures for obtaining both point and interval estimators for the prediction error. Under mild regularity conditions, we derive the consistency and asymptotic normality of the prediction error estimators for SVM with finite-dimensional kernels. A perturbation-resampling procedure is proposed to obtain interval estimates for the prediction error in practice. With numerical studies on simulated data and a benchmark repository, we recommend the use of interval estimates centered at the cross-validated point estimates for the prediction error. Further applications of the proposed procedure in model evaluation and feature selection are illustrated with two examples. [abs] [ pdf ][ bib ] &copy JMLR 2008. ( edit, beta )

PDF Details