Arrow Research search

Author name cluster

Bohan Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

15 papers
1 author row

Possible papers

15

AAAI Conference 2026 Conference Paper

AHAMask: Reliable Task Specification for Large Audio Language Models Without Instructions

  • Yiwei Guo
  • Bohan Li
  • Hankun Wang
  • Zhihan Li
  • Shuai Wang
  • Xie Chen
  • Kai Yu

Although current large audio language models (LALMs) extend text large language models (LLMs) with generic acoustic understanding abilities, they usually suffer from prompt sensitivity, where different instructions of the same intention can yield drastically different outcomes. In this work, we propose AHAMask, where we simply mask some of the attention heads in the decoder-only LLM backbone of LALMs, to trigger specific acoustic task functionalities without instructions. These masks are efficiently obtained by training on an LALM, with the number of trainable parameters equal to the attention head count in its LLM backbone. We show by experiments that applying such selective attention head masks achieves comparable or even better performance than using instructions, either on single or composite tasks. Besides achieving reliable acoustic task specification for LALMs, this also reveals that LALMs exhibit certain ``functional pathways'' in their attention heads.

NeurIPS Conference 2025 Conference Paper

Communication-Efficient Diffusion Denoising Parallelization via Reuse-then-Predict Mechanism

  • Kunyun Wang
  • Bohan Li
  • Kai Yu
  • Minyi Guo
  • Jieru Zhao

Diffusion models have emerged as a powerful class of generative models across various modalities, including image, video, and audio synthesis. However, their deployment is often limited by significant inference latency, primarily due to the inherently sequential nature of the denoising process. While existing parallelization strategies attempt to accelerate inference by distributing computation across multiple devices, they typically incur high communication overhead, hindering deployment on commercial hardware. To address this challenge, we propose $\textbf{ParaStep}$, a novel parallelization method based on a reuse-then-predict mechanism that parallelizes diffusion inference by exploiting similarity between adjacent denoising steps. Unlike prior approaches that rely on layer-wise or stage-wise communication, ParaStep employs lightweight, step-wise communication, substantially reducing overhead. ParaStep achieves end-to-end speedups of up to $\textbf{3. 88}$$\times$ on SVD, $\textbf{2. 43}$$\times$ on CogVideoX-2b, and $\textbf{6. 56}$$\times$ on AudioLDM2-large, while maintaining generation quality. These results highlight ParaStep as a scalable and communication-efficient solution for accelerating diffusion inference, particularly in bandwidth-constrained environments.

TIST Journal 2025 Journal Article

Knowledge Enhancement and Temporal Aware for Multi-Behavior Contrastive Recommendation

  • Hongrui Xuan
  • Bohan Li
  • Wenlong Wu
  • Yi Liu
  • Hongzhi Yin

A well-designed recommender system can accurately learn the embeddings of users and items, reflecting the unique preferences of users. Traditional recommendation techniques usually focus on modeling the singular type of behaviors between users and items. However, in many practical recommendation scenarios (e.g., social media, e-commerce), there exist multi-typed interactive behaviors in user–item relationships, such as click, tag-as-favorite, and purchase in online shopping platforms. Thus, how to make full use of multi-behavior information for recommendation is of great importance to the existing system, which presents challenges in two aspects that need to be explored: (1) Utilizing users’ personalized preferences to capture multi-behavioral dependencies; (2) Dealing with the insufficient recommendation caused by sparse supervision signal for target behavior. In this work, we propose the Knowledge Enhancement Multi-Behavior Contrastive Learning (KMCL) framework, including two Contrastive Learning tasks and three functional modules to tackle the above challenges, respectively. In particular, we design the multi-behavior learning module to extract users’ personalized behavior information for user-embedding enhancement and utilize knowledge graph in the knowledge enhancement module to derive more robust knowledge-aware representations for items. In addition, in the optimization stage, we also model the coarse-grained commonalities and the fine-grained differences between multi-behavior of users to further improve the recommendation effect and propose a joint training paradigm to enhance the learning effect of KMCLR in the joint learning module. Besides, we also considered how to make full use of temporal signals to enhance the effectiveness of multi-behavior recommendations in scenarios with time information and designed a novel encoder to address this issue. Extensive experiments and ablation tests on the three real-world datasets indicate that our KMCLR outperforms various state-of-the-art recommendation methods and verify the effectiveness of our method.

NeurIPS Conference 2025 Conference Paper

Unbalanced Optimal Total Variation Transport: A Theoretical Approach to Spatial Resource Allocation Problems

  • Nhan-Phu Chung
  • Jinhui Han
  • Bohan Li
  • Zehao Li

We propose and analyze a new class of unbalanced weak optimal transport (OT) problems with total variation penalties, motivated by spatial resource allocation tasks. Unlike classical OT, our framework accommodates general unbalanced nonnegative measures and incorporates cost objectives that directly capture operational trade-offs between transport cost and supply–demand mismatch. In the general setting, we establish the existence of optimal solutions and a dual formulation. We then focus on the semi-discrete setting, where one measure is discrete and the other is absolutely continuous, a structure relevant to applications such as service area partitioning for facilities like schools or medical stations. Exploiting a tessellation-based structure, we derive the corresponding explicit optimality conditions. We further address a quantization problem that jointly optimizes the locations and weights of discrete support points, applicable to facility location tasks such as the cost-efficient deployment of battery swap stations or e-commerce warehouses, informed by demand-side data. The dual-tessellation structure also yields explicit gradient expressions, enabling efficient numerical optimization in finite dimensions.

NeurIPS Conference 2025 Conference Paper

Unifying Appearance Codes and Bilateral Grids for Driving Scene Gaussian Splatting

  • Nan Wang
  • Lixing Xiao
  • Yuantao Chen
  • Weiqing Xiao
  • Pierre Merriaux
  • Lei Lei
  • Ziyang Yan
  • Saining Zhang

Neural rendering techniques, including NeRF and Gaussian Splatting (GS), rely on photometric consistency to produce high-quality reconstructions. However, in real-world driving scenarios, it is challenging to guarantee perfect photometric consistency in acquired images. Appearance codes have been widely used to address this issue, but their modeling capability is limited, as a single code is applied to the entire image. Recently, the bilateral grid was introduced to perform pixel-wise color mapping, but it is difficult to optimize and constrain effectively. In this paper, we propose a novel multi-scale bilateral grid that unifies appearance codes and bilateral grids. We demonstrate that this approach significantly improves geometric accuracy in dynamic, decoupled autonomous driving scene reconstruction, outperforming both appearance codes and bilateral grids. This is crucial for autonomous driving, where accurate geometry is important for obstacle avoidance and control. Our method shows strong results across four datasets: Waymo, NuScenes, Argoverse, and PandaSet. We further demonstrate that the improvement in geometry is driven by the multi-scale bilateral grid, which effectively reduces floaters caused by photometric inconsistency.

IJCAI Conference 2024 Conference Paper

Bridging Stereo Geometry and BEV Representation with Reliable Mutual Interaction for Semantic Scene Completion

  • Bohan Li
  • Yasheng Sun
  • Zhujin Liang
  • Dalong Du
  • Zhuanghui Zhang
  • Xiaofeng Wang
  • Yunnan Wang
  • Xin Jin

3D semantic scene completion (SSC) is an ill-posed perception task that requires inferring a dense 3D scene from limited observations. Previous camera-based methods struggle to predict accurate semantic scenes due to inherent geometric ambiguity and incomplete observations. In this paper, we resort to stereo matching technique and bird’s-eye-view (BEV) representation learning to address such issues in SSC. Complementary to each other, stereo matching mitigates geometric ambiguity with epipolar constraint while BEV representation enhances the hallucination ability for invisible regions with global semantic context. However, due to the inherent representation gap between stereo geometry and BEV features, it is non-trivial to bridge them for dense prediction task of SSC. Therefore, we further develop a unified occupancy-based framework dubbed BRGScene, which effectively bridges these two representations with dense 3D volumes for reliable semantic scene completion. Specifically, we design a novel Mutual Interactive Ensemble (MIE) block for pixel-level reliable aggregation of stereo geometry and BEV features. Within the MIE block, a Bi-directional Reliable Interaction (BRI) module, enhanced with confidence re-weighting, is employed to encourage fine-grained interaction through mutual guidance. Besides, a Dual Volume Ensemble (DVE) module is introduced to facilitate complementary aggregation through channel-wise recalibration and multi-group voting. Our method outperforms all published camera-based methods on SemanticKITTI for semantic scene completion. Our code is available on https: //github. com/Arlo0o/StereoScene.

JBHI Journal 2024 Journal Article

Explainable Federated Medical Image Analysis Through Causal Learning and Blockchain

  • Junsheng Mu
  • Michel Kadoch
  • Tongtong Yuan
  • Wenzhe Lv
  • Qiang Liu
  • Bohan Li

Federated learning (FL) enables collaborative training of machine learning models across distributed medical data sources without compromising privacy. However, applying FL to medical image analysis presents challenges like high communication overhead and data heterogeneity. This paper proposes novel FL techniques using explainable artificial intelligence (XAI) for efficient, accurate, and trustworthy analysis. A heterogeneity-aware causal learning approach selectively sparsifies model weights based on their causal contributions, significantly reducing communication requirements while retaining performance and improving interpretability. Furthermore, blockchain provides decentralized quality assessment of client datasets. The assessment scores adjust aggregation weights so higher-quality data has more influence during training, improving model generalization. Comprehensive experiments show our XAI-integrated FL framework enhances efficiency, accuracy and interpretability. The causal learning method decreases communication overhead while maintaining segmentation accuracy. The blockchain-based data valuation mitigates issues from low-quality local datasets. Our framework provides essential model explanations and trust mechanisms, making FL viable for clinical adoption in medical image analysis.

JBHI Journal 2024 Journal Article

Multi-Domain Based Dynamic Graph Representation Learning for EEG Emotion Recognition

  • Hao Tang
  • Songyun Xie
  • Xinzhou Xie
  • Yujie Cui
  • Bohan Li
  • Dalu Zheng
  • Yu Hao
  • Xiangming Wang

Graph neural networks (GNNs) have demonstrated efficient processing of graph-structured data, making them a promising method for electroencephalogram (EEG) emotion recognition. However, due to dynamic functional connectivity and nonlinear relationships between brain regions, representing EEG as graph data remains a great challenge. To solve this problem, we proposed a multi-domain based graph representation learning (MD $^{2}$ GRL) framework to model EEG signals as graph data. Specifically, MD $^{2}$ GRL leverages gated recurrent units (GRU) and power spectral density (PSD) to construct node features of two subgraphs. Subsequently, the self-attention mechanism is adopted to learn the similarity matrix between nodes and fuse it with the intrinsic spatial matrix of EEG to compute the corresponding adjacency matrix. In addition, we introduced a learnable soft thresholding operator to sparsify the adjacency matrix to reduce noise in the graph structure. In the downstream task, we designed a dual-branch GNN and incorporated spatial asymmetry for graph coarsening. We conducted experiments using the publicly available datasets SEED and DEAP, separately for subject-dependent and subject-independent, to evaluate the performance of our model in emotion classification. Experimental results demonstrated that our method achieved state-of-the-art (SOTA) classification performance in both subject-dependent and subject-independent experiments. Furthermore, the visualization analysis of the learned graph structure reveals EEG channel connections that are significantly related to emotion and suppress irrelevant noise. These findings are consistent with established neuroscience research and demonstrate the potential of our approach in comprehending the neural underpinnings of emotion.

AAAI Conference 2024 Conference Paper

One at a Time: Progressive Multi-Step Volumetric Probability Learning for Reliable 3D Scene Perception

  • Bohan Li
  • Yasheng Sun
  • Jingxin Dong
  • Zheng Zhu
  • Jinming Liu
  • Xin Jin
  • Wenjun Zeng

Numerous studies have investigated the pivotal role of reliable 3D volume representation in scene perception tasks, such as multi-view stereo (MVS) and semantic scene completion (SSC). They typically construct 3D probability volumes directly with geometric correspondence, attempting to fully address the scene perception tasks in a single forward pass. However, such a single-step solution makes it hard to learn accurate and convincing volumetric probability, especially in challenging regions like unexpected occlusions and complicated light reflections. Therefore, this paper proposes to decompose the complicated 3D volume representation learning into a sequence of generative steps to facilitate fine and reliable scene perception. Considering the recent advances achieved by strong generative diffusion models, we introduce a multi-step learning framework, dubbed as VPD, dedicated to progressively refining the Volumetric Probability in a Diffusion process. Specifically, we first build a coarse probability volume from input images with the off-the-shelf scene perception baselines, which is then conditioned as the basic geometry prior before being fed into a 3D diffusion UNet, to progressively achieve accurate probability distribution modeling. To handle the corner cases in challenging areas, a Confidence-Aware Contextual Collaboration (CACC) module is developed to correct the uncertain regions for reliable volumetric learning based on multi-scale contextual contents. Moreover, an Online Filtering (OF) strategy is designed to maintain representation consistency for stable diffusion sampling. Extensive experiments are conducted on scene perception tasks including multi-view stereo (MVS) and semantic scene completion (SSC), to validate the efficacy of our method in learning reliable volumetric representations. Notably, for the SSC task, our work stands out as the first to surpass LiDAR-based methods on the SemanticKITTI dataset.

AAAI Conference 2024 Conference Paper

Semantic-Guided Generative Image Augmentation Method with Diffusion Models for Image Classification

  • Bohan Li
  • Xiao Xu
  • Xinghao Wang
  • Yutai Hou
  • Yunlong Feng
  • Feng Wang
  • Xuanliang Zhang
  • Qingfu Zhu

Existing image augmentation methods consist of two categories: perturbation-based methods and generative methods. Perturbation-based methods apply pre-defined perturbations to augment an original image, but only locally vary the image, thus lacking image diversity. In contrast, generative methods bring more image diversity in the augmented images but may not preserve semantic consistency, thus may incorrectly change the essential semantics of the original image. To balance image diversity and semantic consistency in augmented images, we propose SGID, a Semantic-guided Generative Image augmentation method with Diffusion models for image classification. Specifically, SGID employs diffusion models to generate augmented images with good image diversity. More importantly, SGID takes image labels and captions as guidance to maintain semantic consistency between the augmented and original images. Experimental results show that SGID outperforms the best augmentation baseline by 1.72% on ResNet-50 (from scratch), 0.33% on ViT (ImageNet-21k), and 0.14% on CLIP-ViT (LAION-2B). Moreover, SGID can be combined with other image augmentation baselines and further improves the overall performance. We demonstrate the semantic consistency and image diversity of SGID through quantitative human and automated evaluations, as well as qualitative case studies.

NeurIPS Conference 2024 Conference Paper

TAPTRv2: Attention-based Position Update Improves Tracking Any Point

  • Hongyang Li
  • Hao Zhang
  • Shilong Liu
  • Zhaoyang Zeng
  • Feng Li
  • Tianhe Ren
  • Bohan Li
  • Lei Zhang

In this paper, we present TAPTRv2, a Transformer-based approach built upon TAPTR for solving the Tracking Any Point (TAP) task. TAPTR borrows designs from DEtection TRansformer (DETR) and formulates each tracking point as a point query, making it possible to leverage well-studied operations in DETR-like algorithms. TAPTRv2 improves TAPTR by addressing a critical issue regarding its reliance on cost-volume, which contaminates the point query’s content feature and negatively impacts both visibility prediction and cost-volume computation. In TAPTRv2, we propose a novel attention-based position update (APU) operation and use key-aware deformable attention to realize. For each query, this operation uses key-aware attention weights to combine their corresponding deformable sampling positions to predict a new query position. This design is based on the observation that local attention is essentially the same as cost-volume, both of which are computed by dot-production between a query and its surrounding features. By introducing this new operation, TAPTRv2 not only removes the extra burden of cost-volume computation, but also leads to a substantial performance improvement. TAPTRv2 surpasses TAPTR and achieves state-of-the-art performance on many challenging datasets, demonstrating the effectiveness of our approach.

AAAI Conference 2023 Conference Paper

VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing

  • Yihan Wu
  • Junliang Guo
  • Xu Tan
  • Chen Zhang
  • Bohan Li
  • Ruihua Song
  • Lei He
  • Sheng Zhao

Video dubbing aims to translate the original speech in a film or television program into the speech in a target language, which can be achieved with a cascaded system consisting of speech recognition, machine translation and speech synthesis. To ensure the translated speech to be well aligned with the corresponding video, the length/duration of the translated speech should be as close as possible to that of the original speech, which requires strict length control. Previous works usually control the number of words or characters generated by the machine translation model to be similar to the source sentence, without considering the isochronicity of speech as the speech duration of words/characters in different languages varies. In this paper, we propose VideoDubber, a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation, to match the length of source and target speech. Specifically, we control the speech length of generated sentence by guiding the prediction of each word with the duration information, including the speech duration of itself as well as how much duration is left for the remaining words. We design experiments on four language directions (German -> English, Spanish -> English, Chinese English), and the results show that VideoDubber achieves better length control ability on the generated speech than baseline methods. To make up the lack of real-world datasets, we also construct a real-world test set collected from films to provide comprehensive evaluations on the video dubbing task.

JAIR Journal 2021 Journal Article

Efficient Local Search based on Dynamic Connectivity Maintenance for Minimum Connected Dominating Set

  • Xindi Zhang
  • Bohan Li
  • Shaowei Cai
  • Yiyuan Wang

The minimum connected dominating set (MCDS) problem is an important extension of the minimum dominating set problem, with wide applications, especially in wireless networks. Most previous works focused on solving MCDS problem in graphs with relatively small size, mainly due to the complexity of maintaining connectivity. This paper explores techniques for solving MCDS problem in massive real-world graphs with wide practical importance. Firstly, we propose a local greedy construction method with reasoning rule called 1hopReason. Secondly and most importantly, a hybrid dynamic connectivity maintenance method (HDC+) is designed to switch alternately between a novel fast connectivity maintenance method based on spanning tree and its previous counterpart. Thirdly, we adopt a two-level vertex selection heuristic with a newly proposed scoring function called chronosafety to make the algorithm more considerate when selecting vertices. We design a new local search algorithm called FastCDS based on the three ideas. Experiments show that FastCDS significantly outperforms five state-of-the-art MCDS algorithms on both massive graphs and classic benchmarks.

TIST Journal 2021 Journal Article

PARP: A Parallel Traffic Condition Driven Route Planning Model on Dynamic Road Networks

  • Tianlun Dai
  • Bohan Li
  • Ziqiang Yu
  • Xiangrong Tong
  • Meng Chen
  • Gang Chen

The problem of route planning on road network is essential to many Location-Based Services (LBSs). Road networks are dynamic in the sense that the weights of the edges in the corresponding graph constantly change over time, representing evolving traffic conditions. Thus, a practical route planning strategy is required to supply the continuous route optimization considering the historic, current, and future traffic condition. However, few existing works comprehensively take into account these various traffic conditions during the route planning. Moreover, the LBSs usually suffer from extensive concurrent route planning requests in rush hours, which imposes a pressing need to handle numerous queries in parallel for reducing the response time of each query. However, this issue is also not involved by most existing solutions. We therefore investigate a parallel traffic condition driven route planning model on a cluster of processors. To embed the future traffic condition into the route planning, we employ a GCN model to periodically predict the travel costs of roads within a specified time period, which facilitates the robustness of the route planning model against the varying traffic condition. To reduce the response time, a Dual-Level Path (DLP) index is proposed to support a parallel route planning algorithm with the filter-and-refine principle. The bottom level of DLP partitions the entire graph into different subgraphs, and the top level is a skeleton graph that consists of all border vertices in all subgraphs. The filter step identifies a global directional path for a given query based on the skeleton graph. In the refine step, the overall route planning for this query is decomposed into multiple sub-optimizations in the subgraphs passed through by the directional path. Since the subgraphs are independently maintained by different processors, the sub-optimizations of extensive queries can be operated in parallel. Finally, extensive evaluations are conducted to confirm the effectiveness and superiority of the proposal.

IJCAI Conference 2020 Conference Paper

NuCDS: An Efficient Local Search Algorithm for Minimum Connected Dominating Set

  • Bohan Li
  • Xindi Zhang
  • Shaowei Cai
  • Jinkun Lin
  • Yiyuan Wang
  • Christian Blum

The minimum connected dominating set (MCDS) problem is an important extension of the minimum dominating set problem, with wide applications, especially in wireless networks. Despite its practical importance, there are few works on solving MCDS for massive graphs, mainly due to the complexity of maintaining connectivity. In this paper, we propose two novel ideas, and develop a new local search algorithm for MCDS called NuCDS. First, a hybrid dynamic connectivity maintenance method is designed to switch alternately between a novel fast connectivity maintenance method based on spanning tree and its previous counterpart. Second, we define a new vertex property called \emph{safety} to make the algorithm more considerate when selecting vertices. Experiments show that NuCDS significantly outperforms the state-of-the-art MCDS algorithms on both massive graphs and classic benchmarks.