Arrow Research search

Author name cluster

Bing Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

74 papers
2 author rows

Possible papers

74

AAAI Conference 2026 Conference Paper

Ev-iCRF: Self-supervised Event-guided iCRF Estimation for HDR Image Reconstruction

  • Xucheng Guo
  • Bing Li
  • Lin Wang
  • Yiran Shen

In this paper, we present Ev-iCRF, a novel self-supervised pipeline for high dynamic range (HDR) image reconstruction from a single-exposure low dynamic range (LDR) image, guided by asynchronous event streams generated by a bio-inspired event camera. The highlight of Ev-iCRF lies in its formulation of the inverse camera response function (iCRF) based on Event-LDR Correspondence. By leveraging the HDR properties of event data, the method enables direct iCRF estimation, offering a new perspective for event-guided HDR imaging. The pipeline is trained in a self-supervised manner using formulation-driven iCRF estimation loss and refinement loss, without the need for synchronized HDR supervision. Ev-iCRF adopts a two-stage coarse-to-fine reconstruction pipeline, allowing effective fusion of features from both LDR image and event data. The event information is used to optimize the iCRF, enabling accurate HDR reconstruction from LDR inputs. We evaluate Ev-iCRF on real-world datasets, and results show that it outperforms state-of-the-art methods in HDR reconstruction accuracy. Moreover, the reconstructed images demonstrate improved texture fidelity and structural detail.

AAAI Conference 2026 Conference Paper

Exploiting Geometric Structures for Modeling Multi-Agent Behaviors: A New Thinking

  • Bohao Qu
  • Xiaofeng Cao
  • Bing Li
  • Menglin Zhang
  • Tuan-Anh Vu
  • Di Lin
  • Qing Guo

In this paper, we rethink model agent behaviors from a geometric structure perspective in multi-agent reinforcement learning. Modeling agent behaviors is essential for understanding how agents interact and facilitating effective decisions. The key lies in capturing the dependencies and sequential relationships among agent decisions. Since each decision influences the subsequent choices, this forms a hierarchical and nested tree-like structure of interdependencies. While modeling tree-like data in Euclidean spaces could cause distortion, which results in a loss of agent decision structure information. Motivated by this, we reconsider model agent behaviors in hyperbolic space and propose the Hyperbolic Multi-Agent Representations (HMAR) method, which projects the agent behaviors into a Poincaré ball and leverages hyperbolic neural networks to learn agent policy representations. Additionally, we designed a contrastive loss function to train this network, minimizing the distance in feature space between different representations of the same agent while maximizing the distance between representations of distinct agents. Experimental results provide empirical evidence for the effectiveness of the HMAR method in cooperative and competitive environments, demonstrating the potential of hyperbolic agent representations for effective decision-making in multi-agent environments.

AAAI Conference 2026 Conference Paper

Federated Context-Aware Personalized Recommendation

  • Zhihao Wang
  • Xiaoying Liao
  • Wenke Huang
  • Bingqian Liu
  • Tian Chen
  • Jian Wang
  • Bing Li

Federated recommender system is emerging as a new paradigm for providing personalized services while preserving user data privacy. Most existing personalized federated recommender systems predict the user's next item by discretely training user and item embeddings. However, this training approach overlooks the user's behavioral patterns, suffers from low interpretability, and requires a substantial amount of data and meticulous fine-tuning to achieve stable and accurate embeddings. To address these limitations, we propose Federated Context-Aware Personalized Recommendation (FedCAR), a novel framework that leverages users’ recent interactions as behavioral context to guide prediction. Instead of static user embeddings, FedCAR dynamically constructs context representations by aggregating and weighting recently interacted item embeddings. Additionally, we incorporate a contrastive learning strategy that enables the model to capture shared behavioral structures across clients while maintaining personalized preferences, enhancing both generalization and robustness in heterogeneous settings. Experiments on 5 benchmark datasets show that FedCAR consistently outperforms state-of-the-art methods and provides interpretable recommendations by explicitly modeling context dependencies.

AAAI Conference 2026 Conference Paper

Light but Sharp: SlimSTAD for Real-Time Action Detection from Sensor Data

  • Wei Cui
  • Lukai Fan
  • Zhenghua Chen
  • Min Wu
  • Shili Xiang
  • Haixia Wang
  • Bing Li

Sensory Temporal Action Detection (STAD) aims to localize and classify human actions within long, untrimmed sequences captured by non-visual sensors such as WiFi or inertial measurement units (IMUs). Unlike video-based TAD, STAD poses unique challenges due to the low-dimensional, noisy, and heterogeneous nature of sensory data, as well as the real-time and resource constraints on edge devices. While recent STAD models have improved detection performance, their high computational cost hampers practical deployment. In this paper, we propose SlimSTAD, a simple yet effective framework that achieves both high accuracy and low latency for STAD. SlimSTAD features a novel Decoupled Channel Modeling (DCM) encoder, which preserves modality-specific temporal features and enables efficient inter-channel aggregation via lightweight graph attention. An anchor-free cascade predictor then refines action boundaries and class predictions in a two-stage design without dense proposals. Experiments on two real-world datasets demonstrate that SlimSTAD outperforms strong video-derived and sensory baselines by an average of 2.1 mAP, while significantly reducing GFLOPs, parameters, and latency, validating its effectiveness for real-world, edge-aware STAD deployment.

AAAI Conference 2026 Short Paper

Misclassification-Aware Robust Learning from Multiple Human Labelers (Student Abstract)

  • Zuoyuehe Wang
  • Chicheng Ma
  • Pengpeng Chen
  • Lei Chai
  • Yongqiang Yang
  • Zhijun Chen
  • Jingzheng Li
  • Bing Li

Adversarial training is an effective technique for enhancing the robustness of deep neural networks (DNNs). Prior research shows that misclassified examples influence final adversarial robustness much more than correctly classified examples. Ignoring this difference during training can hurt model performance. In crowdsourcing, varying annotator expertise causes noisy, inconsistent labels. As a result, it is hard to distinguish misclassified and correctly classified examples using only provided annotations. Thus, how to use the reliability and discrepancy between these example types to improve robustness within adversarial learning remains a critical but underexplored issue. In this work, we first explore how misclassified and correctly classified examples affect learning from crowds (LFC) in adversarial environments. Then, we formulate the problem of misclassification-aware robust learning from multiple human labelers as a bilevel min-max problem. After that, we introduce MALC, a new approach to make classifiers more robust to adversarial examples via iterative adversarial example generation and parameter estimation. We conduct an extensive evaluation of the proposed MALC, showing that MALC can outperform the state-of-the-art LFC methods in both white-box and black-box settings.

AAAI Conference 2026 Conference Paper

MMhops-R1: Multimodal Multi-hop Reasoning

  • Tao Zhang
  • Ziqi Zhang
  • Zongyang Ma
  • Yuxin Chen
  • Bing Li
  • Chunfeng Yuan
  • Guangting Wang
  • Fengyun Rao

The ability to perform multi-modal multi-hop reasoning by iteratively integrating information across various modalities and external knowledge is critical for addressing complex real-world challenges. However, existing Multi-modal Large Language Models (MLLMs) are predominantly limited to single-step reasoning, as existing benchmarks lack the complexity needed to evaluate and drive multi-hop abilities. To bridge this gap, we introduce MMhops, a novel, large-scale benchmark designed to systematically evaluate and foster multi-modal multi-hop reasoning. MMhops dataset comprises two challenging task formats, Bridging and Comparison, which necessitate that models dynamically construct complex reasoning chains by integrating external knowledge. To tackle the challenges posed by MMhops, we propose MMhops-R1, a novel multi-modal Retrieval-Augmented Generation (mRAG) framework for dynamic reasoning. Our framework utilizes reinforcement learning to optimize the model for autonomously planning reasoning paths, formulating targeted queries, and synthesizing multi-level information. Comprehensive experiments demonstrate that MMhops-R1 significantly outperforms strong baselines on MMhops, highlighting that dynamic planning and multi-modal knowledge integration are crucial for complex reasoning. Moreover, MMhops-R1 demonstrates strong generalization to tasks requiring fixed-hop reasoning, underscoring the robustness of our dynamic planning approach.

JMLR Journal 2026 Journal Article

Nonlinear function-on-function regression by RKHS

  • Peijun Sang
  • Bing Li

We propose a nonlinear function-on-function regression model where both the covariate and the response are random functions. The nonlinear regression is carried out in two steps: we first construct Hilbert spaces to accommodate the functional covariate and the functional response, and then build a second-layer Hilbert space for the covariate to capture nonlinearity. The second-layer space is assumed to be a reproducing kernel Hilbert space, which is generated by a positive definite kernel determined by the inner product of the first-layer Hilbert space for $X$--this structure is known as the nested Hilbert spaces. We develop estimation procedures to implement the proposed method, which allows the functional data to be observed at different time points for different subjects. Furthermore, we establish the convergence rate of our estimator as well as the weak convergence of the predicted response in the Hilbert space. Numerical studies including both simulations and a data application are conducted to investigate the performance of our estimator in finite sample. [abs] [ pdf ][ bib ] &copy JMLR 2026. ( edit, beta )

IJCAI Conference 2025 Conference Paper

An Empirical Study of Federated Prompt Learning for Vision Language Model

  • Zhihao Wang
  • Wenke Huang
  • Tian Chen
  • Zekun Shi
  • Guancheng Wan
  • Yu Qiao
  • Bin Yang
  • Jian Wang

The Vision Language Model (VLM) excels in aligning vision and language representations, and prompt learning has emerged as a key technique for adapting such models to downstream tasks. However, the application of prompt learning with VLM in federated learning (FL) scenarios remains underexplored. This paper systematically investigates the behavioral differences between language prompt learning (LPT) and vision prompt learning (VPT) under data heterogeneity challenges, including label skew and domain shift. We conduct extensive experiments to evaluate the impact of various FL and prompt configurations, such as client scale, aggregation strategies, and prompt length, to assess the robustness of Federated Prompt Learning (FPL). Furthermore, we explore strategies for enhancing prompt learning in complex scenarios where label skew and domain shift coexist, including leveraging both prompt types when computational resources allow. Our findings offer practical insights into optimizing prompt learning in federated settings, contributing to the broader deployment of VLMs in privacy-preserving environments.

IROS Conference 2025 Conference Paper

An Inflatable Deployable Origami Grasper for Adaptive and High-Load Grasping

  • Peng Yan
  • Guang Liang
  • Sen Wang
  • Hailin Huang
  • Wei Wang
  • Xu Li
  • Bing Li

Robotic graspers are essential for enhancing the efficiency and versatility of robots in grasping tasks. In this paper, we propose a novel inflatable deployable origami grasper with a rigid-flexible coupling structure. The proposed grasper can achieve multiple deployment configurations under a single pneumatic actuation, enabling both deployment and grasping operations while also allowing for passive self-folding during deflation. The design and fabrication of the grasper are presented. Then, the stiffness model for the inflatable deployable origami unit is developed based on the equivalent truss method. Experimental results show that the grasper successfully grasps objects of various shapes and sizes in both enveloping and fingertip grasping modes, using either two or four fingers. With its simple mechanical system and high deploy/fold ratio, the proposed grasper holds significant potential for applications in industrial automation and space exploration.

IROS Conference 2025 Conference Paper

Cockroach's Turning Strategy Enhanced Hexapod Robot with Flexible Torso

  • Yiming Li
  • Xingyu Li
  • Jie Zhou
  • Chenfeng Xie
  • Yao Li
  • Bing Li

The design and control of hexapod robots have become an active research field due to the ability to achieve adaptive and stable multi-terrain locomotion. However, existing hexapod robots focus on the integration of flexible pitch joints to enhance their obstacle-crossing and slope-climbing abilities, and few biological observations have been made to gain insight into the agile steering mechanisms of hexapod insects. Herein, we observed the steering movements of Madagascar cockroaches. Observations showed that cockroaches exhibited specific phase relationships in addition to regular tripod gait pattern during steering. Moreover, we also found that a smaller steering radius resulted in a larger lateral bending angle of the thoracic segments. Inspired by this, a hexapod robot with a flexible torso (F-RHex) was designed and fabricated. Bio-inspired gait patterns were abstracted and simplified into two steering strategies: gait-based and mix-based. Compared to the purely gait-based strategy, the F-RHex testing results demonstrated a ~27. 4% reduction in turning radius and ~40% enhancement in steering velocity, implying that the mix-based strategy offers superior steering capability.

ICLR Conference 2025 Conference Paper

DeepTAGE: Deep Temporal-Aligned Gradient Enhancement for Optimizing Spiking Neural Networks

  • Wei Liu
  • Li Yang
  • Mingxuan Zhao
  • Shuxun Wang
  • Jin Gao
  • Wenjuan Li
  • Bing Li
  • Weiming Hu

Spiking Neural Networks (SNNs), with their biologically inspired spatio-temporal dynamics and spike-driven processing, are emerging as a promising low-power alternative to traditional Artificial Neural Networks (ANNs). However, the complex neuronal dynamics and non-differentiable spike communication mechanisms in SNNs present substantial challenges for efficient training. By analyzing the membrane potentials in spiking neurons, we found that their distributions can increasingly deviate from the firing threshold as time progresses, which tends to cause diminished backpropagation gradients and unbalanced optimization. To address these challenges, we propose Deep Temporal-Aligned Gradient Enhancement (DeepTAGE), a novel approach that improves optimization gradients in SNNs from both internal surrogate gradient functions and external supervision methods. Our DeepTAGE dynamically adjusts surrogate gradients in accordance with the membrane potential distribution across different time steps, enhancing their respective gradients in a temporal-aligned manner that promotes balanced training. Moreover, to mitigate issues of gradient vanishing or deviating during backpropagation, DeepTAGE incorporates deep supervision at both spatial (network stages) and temporal (time steps) levels to ensure more effective and robust network optimization. Importantly, our method can be seamlessly integrated into existing SNN architectures without imposing additional inference costs or requiring extra control modules. We validate the efficacy of DeepTAGE through extensive experiments on static benchmarks (CIFAR10, CIFAR100, and ImageNet-1k) and a neuromorphic dataset (DVS-CIFAR10), demonstrating significant performance improvements.

AAAI Conference 2025 Conference Paper

Federated Recommendation with Explicitly Encoding Item Bias

  • Zhihao Wang
  • He Bai
  • Wenke Huang
  • Duantengchuan Li
  • Jian Wang
  • Bing Li

With the development of federated learning techniques and the increased need for user privacy protection, the federated recommendation has become a new recommendation paradigm. However, most existing works focus on user-level federated recommendation, leaving platform-level federated recommendation largely unexplored. A significant challenge in platform-level federated recommendation scenarios is severe label skew. Users behave in various ways on different platforms, bringing up the rating and item bias problem. In this work, we propose FREIB (Federated Recommendation with Explicitly Encoding Item Bias). The core idea is explicitly encoding item bias during federated learning, addressing the problem of fuzzy item bias, and achieving consistent representation in label skew scenarios. We achieve this by utilizing global knowledge guidance to model common rating patterns and by aligning feature prototypes to enhance item encoding at the same rating level. Extensive experiments conducted on three public datasets demonstrate the superiority of our method over several state-of-the-art approaches.

JMLR Journal 2025 Journal Article

Learning causal graphs via nonlinear sufficient dimension reduction

  • Eftychia Solea
  • Bing Li
  • Kyongwon Kim

We introduce a new nonparametric methodology for estimating a directed acyclic graph (DAG) from observational data. Our method is nonparametric in nature: it does not impose any specific form on the joint distribution of the underlying DAG. Instead, it relies on a linear operator on reproducing kernel Hilbert spaces to evaluate conditional independence. However, a fully nonparametric approach would involve conditioning on a large number of random variables, subjecting it to the curse of dimensionality. To solve this problem, we apply nonlinear sufficient dimension reduction to reduce the number of variables before evaluating the conditional independence. We develop an estimator for the DAG, based on a linear operator that characterizes conditional independence, and establish the consistency and convergence rates of this estimator, as well as the uniform consistency of the estimated Markov equivalence class. We introduce a modified PC-algorithm to implement the estimating procedure efficiently such that the complexity depends on the sparseness of the underlying true DAG. We demonstrate the effectiveness of our methodology through simulations and a real data analysis. [abs] [ pdf ][ bib ] &copy JMLR 2025. ( edit, beta )

NeurIPS Conference 2025 Conference Paper

MI-TRQR: Mutual Information-Based Temporal Redundancy Quantification and Reduction for Energy-Efficient Spiking Neural Networks

  • Dengfeng Xue
  • Wenjuan Li
  • Yifan Lu
  • Chunfeng Yuan
  • Yufan Liu
  • Wei Liu
  • Man Yao
  • Li Yang

Brain-inspired spiking neural networks (SNNs) provide energy-efficient computation through event-driven processing. However, the shared weights across multiple timesteps lead to serious temporal feature redundancy, limiting both efficiency and performance. This issue is further aggravated when processing static images due to the duplicated input. To mitigate this problem, we propose a parameter-free and plug-and-play module named Mutual Information-based Temporal Redundancy Quantification and Reduction (MI-TRQR), constructing energy-efficient SNNs. Specifically, Mutual Information (MI) is properly introduced to quantify redundancy between discrete spike features at different timesteps on two spatial scales: pixel (local) and the entire spatial features (global). Based on the multi-scale redundancy quantification, we apply a probabilistic masking strategy to remove redundant spikes. The final representation is subsequently recalibrated to account for the spike removal. Extensive experimental results demonstrate that our MI-TRQR achieves sparser spiking firing, higher energy efficiency, and better performance concurrently with different SNN architectures in tasks of neuromorphic data classification, static data classification, and time-series forecasting. Notably, MI-TRQR increases accuracy by \textbf{1. 7\%} on CIFAR10-DVS with 4 timesteps while reducing energy cost by \textbf{37. 5\%}. Our codes are available at https: //github. com/dfxue/MI-TRQR.

NeurIPS Conference 2025 Conference Paper

OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions

  • Cheng Luo
  • Jianghui Wang
  • Bing Li
  • Siyang Song
  • Bernard Ghanem

In this paper, we introduce Online Multimodal Conversational Response Generation (OMCRG), a novel task designed to produce synchronized verbal and non-verbal listener feedback online, based on the speaker's multimodal inputs. OMCRG captures natural dyadic interactions and introduces new challenges in aligning generated audio with listeners' facial responses. To tackle these challenges, we incorporate text as an intermediate modality to connect audio and facial responses. We propose OmniResponse, a Multimodal Large Language Model (MLLM) that autoregressively generates accurate multimodal listener responses. OmniResponse leverages a pretrained LLM enhanced with two core components: Chrono-Text Markup, which precisely timestamps generated text tokens, and TempoVoice, a controllable online text-to-speech (TTS) module that outputs speech synchronized with facial responses. To advance OMCRG research, we offer ResponseNet, a dataset of 696 detailed dyadic interactions featuring synchronized split-screen videos, multichannel audio, transcripts, and annotated facial behaviors. Comprehensive evaluations on ResponseNet demonstrate that OmniResponse outperforms baseline models in terms of semantic speech content, audio-visual synchronization, and generation quality. Our dataset, code, and models are publicly available at https: //omniresponse. github. io/.

NeurIPS Conference 2025 Conference Paper

One Head to Rule Them All: Amplifying LVLM Safety through a Single Critical Attention Head

  • Junhao Xia
  • Haotian Zhu
  • Shuchao Pang
  • Zhigang Lu
  • Bing Li
  • Yongbin Zhou
  • Minhui Xue

Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in tasks requiring multimodal understanding. However, recent studies indicate that LVLMs are more vulnerable than LLMs to unsafe inputs and prone to generating harmful content. Existing defense strategies primarily include fine-tuning, input sanitization, and output intervention. Although these approaches provide a certain level of protection, they tend to be resource-intensive and struggle to effectively counter sophisticated attack techniques. To tackle such issues, we propose One-head Defense (Oh Defense), a novel yet simple approach utilizing LVLMs' internal safety capabilities. Through systematic analysis of the attention mechanisms, we discover that LVLMs' safety capabilities are concentrated within specific attention heads that respond differently to safe or unsafe inputs. Further exploration reveals that a single critical attention head can effectively serve as a safety guard, providing a strong discriminative signal that amplifies the model's inherent safety capabilities. Hence, the Oh Defense requires no additional training or external modules, making it computationally efficient while effectively reactivating suppressed safety mechanisms. Extensive experiments across diverse LVLM architectures and unsafe datasets validate our approach, i. e. , the Oh Defense achieves near-perfect defense success rates (> 98\%) for unsafe inputs while maintaining low false positive rates (< 5\%) for safe content. The source code is available at https: //github. com/AIASLab/Oh-Defense.

AAAI Conference 2025 Conference Paper

Similar Modality Enhancement and Action Consistency Learning for Weakly Supervised Temporal Action Localization

  • Maodong Li
  • Chao Zheng
  • Jian Wang
  • Bing Li

Weakly-supervised temporal action localization (WTAL) aims to identify and localize action instances in untrimmed videos using only video-level labels. Existing methods typically rely on original features from frozen pre-trained encoders designed for trimmed action classification (TAC) tasks, which inevitably introduces task discrepancy. Additionally, these methods often overlook the importance of considering action consistency from multiple perspectives, specifically the consistency in action processes and action semantics, both of which are crucial for the model's understanding of actions. To address these issues, we propose a novel WTAL method based on similar modality enhancement and action consistency learning (SEAL). First, we construct global descriptors for each action category, and use the pseudo-labels generated based on these descriptors to guide the model in learning more consistent representations, thereby mitigating task discrepancy. Second, we design two types of losses to achieve action consistency learning: process consistency loss, which penalizes candidate proposals that deviate from the action center to ensure the completeness of the action process, and semantic consistency loss, which employs local descriptors to help proposals of the same action category (especially those with apparent semantic confusion) learn similar feature distributions. Extensive experiments on the THUMOS14 and ActivityNet datasets demonstrate the superior performance of the proposed method compared to state-of-the-art methods.

IJCAI Conference 2025 Conference Paper

SSTrack: Sample-interval Scheduling for Lightweight Visual Object Tracking

  • Yutong Kou
  • Shubo Lin
  • Liang Li
  • Bing Li
  • Weiming Hu
  • Jin Gao

In recent years, CPU real-time object tracking has gained significant attention due to its broad applications such as UAV-tracking. To maintain computational efficiency, most existing CPU real-time object trackers rely on lightweight backbones and employ a single initial template image without intermediate online templates. Although the appearance variance between the template and the search is larger under this single template setting, the representation ability of lightweight backbones is weaker which poses a challenge when training lightweight object trackers. To address this issue, we propose SSTrack, a new easier-to-harder training schedule for the lightweight object tracker. From the data perspective, our method designed a success-aware sample scheduler that gradually increases difficult training samples with longer template-search time intervals and reduces the amount of the easier samples so the training cost remains unchanged. From the optimization perspective, we utilized a gradient scaling strategy that retains the original training objective of easier samples despite the reduction in their quantities. With the collective effort from both perspectives, our method achieves State-of-the-Art CPU-real-time accuracy on 5 UAV-tracking benchmarks and 5 general object tracking benchmarks. Codes and models will be available at https: //github. com/Kou-99/SSTrack.

NeurIPS Conference 2025 Conference Paper

SynCL: A Synergistic Training Strategy with Instance-Aware Contrastive Learning for End-to-End Multi-Camera 3D Tracking

  • Shubo Lin
  • Yutong Kou
  • Zirui Wu
  • Shaoru Wang
  • Bing Li
  • Weiming Hu
  • Jin Gao

While existing query-based 3D end-to-end visual trackers integrate detection and tracking via the *tracking-by-attention* paradigm, these two chicken-and-egg tasks encounter optimization difficulties when sharing the same parameters. Our findings reveal that these difficulties arise due to two inherent constraints on the self-attention mechanism, i. e. , over-deduplication for object queries and self-centric attention for track queries. In contrast, removing self-attention mechanism not only minimally impacts regression predictions of the tracker, but also tends to generate more latent candidate boxes. Based on these analyses, we present SynCL, a novel plug-and-play synergistic training strategy designed to co-facilitate multi-task learning for detection and tracking. Specifically, we propose a Task-specific Hybrid Matching module for a weight-shared cross-attention-based decoder that matches the targets of track queries with multiple object queries to exploit promising candidates overlooked by the self-attention mechanism and the bipartite matching. To flexibly select optimal candidates for the one-to-many matching, we also design a Dynamic Query Filtering module controlled by model training status. Moreover, we introduce Instance-aware Contrastive Learning to break through the barrier of self-centric attention for track queries, effectively bridging the gap between detection and tracking. Without additional inference costs, SynCL consistently delivers improvements in various benchmarks and achieves state-of-the-art performance with $58. 9\%$ AMOTA on the nuScenes dataset. Code and raw results are available at.

AAAI Conference 2025 Conference Paper

Towards More Discriminative Feature Learning in SNNs with Temporal-Self-Erasing Supervision

  • Wei Liu
  • Li Yang
  • Mingxuan Zhao
  • Dengfeng Xue
  • Shuxun Wang
  • Boyu Cai
  • Jin Gao
  • Wenjuan Li

Spiking Neural Networks (SNNs) are biologically inspired models that process visual inputs over multiple time steps. However, they often struggle with limited feature discrimination along the temporal dimension due to inherent spatiotemporal invariance. This limitation arises from the redundant activation of certain regions and shared supervision for multiple time steps, constraining the network’s ability to adapt and learn diverse features. To address this challenge, we propose a novel Temporal-Self-Erasing (TSE) supervision method that dynamically adapts the learning regions of interest for different time steps. The TSE method operates by identifying highly activated regions from predictions across multiple time steps and adaptively suppressing them during model training, thereby encouraging the network to focus on less activated yet potentially informative regions. This approach not only enhances the feature discrimination capability of SNNs but also facilitates more effective multi-time-step inference by exploiting more semantic information. Experimental results on benchmark datasets demonstrate that our TSE method significantly improves the classification accuracy and robustness of SNNs.

AAAI Conference 2025 Conference Paper

Union Is Strength! Unite the Power of LLMs and MLLMs for Chart Question Answering

  • Jiapeng Liu
  • Liang Li
  • Shihao Rao
  • Xiyan Gao
  • Weixin Guan
  • Bing Li
  • Can Ma

Chart Question Answering (CQA) requires models to perform chart perception and reasoning. Recent studies driven by Large Language Models (LLMs) have dominated CQA. These include employing more cognitively capable LLMs for indirectly reasoning over transformed charts, i.e., tables, and directly perceiving charts utilizing Multimodal Large Language Models (MLLMs) with a wider perceptual range. Yet, they often encounter bottlenecks due to the limitation of the receptive field of LLMs and the fragility of the complex reasoning of some MLLMs. To unite the strengths of LLMs and MLLMs to complement each other's limitations, we propose Synergy, a framework that unites the power of both models for CQA. Synergy first unites the chart with a table as the augmented perceptual signal. Next, it unites LLMs and MLLMs, scheduling the former to decompose a question into subquestions and the latter to answer these by perceiving the chart. Lastly, it operates LLMs to summarize the subquestion-answer pairs to refine the final answer. Extensive experimental results on popular CharQA and PlotQA benchmarks reveal that, with the power of union, Synergy outperforms strong competitors and achieves superior boosts over naive MLLMs by uniting them with a smaller LLM.

AAAI Conference 2025 Conference Paper

WiFi CSI Based Temporal Activity Detection via Dual Pyramid Network

  • Zhendong Liu
  • Le Zhang
  • Bing Li
  • Yingjie Zhou
  • Zhenghua Chen
  • Ce Zhu

We address the challenge of WiFi-based temporal activity detection and propose an efficient Dual Pyramid Network that integrates Temporal Signal Semantic Encoders and Local Sensitive Response Encoders. The Temporal Signal Semantic Encoder splits feature learning into high and low-frequency components, using a novel Signed Mask-Attention mechanism to emphasize important areas and downplay unimportant ones, with the features fused using ContraNorm. The Local Sensitive Response Encoder captures fluctuations without learning. These feature pyramids are then combined using a new cross-attention fusion mechanism. We also introduce a dataset with over 2,114 activity segments across 553 WiFi CSI samples, each lasting around 85 seconds. Extensive experiments show our method outperforms challenging baselines.

JMLR Journal 2024 Journal Article

Functional Directed Acyclic Graphs

  • Kuang-Yao Lee
  • Lexin Li
  • Bing Li

In this article, we introduce a new method to estimate a directed acyclic graph (DAG) from multivariate functional data. We build on the notion of faithfulness that relates a DAG with a set of conditional independences among the random functions. We develop two linear operators, the conditional covariance operator and the partial correlation operator, to characterize and evaluate the conditional independence. Based on these operators, we adapt and extend the PC-algorithm to estimate the functional directed graph, so that the computation time depends on the sparsity rather than the full size of the graph. We study the asymptotic properties of the two operators, derive their uniform convergence rates, and establish the uniform consistency of the estimated graph, all of which are obtained while allowing the graph size to diverge to infinity with the sample size. We demonstrate the efficacy of our method through both simulations and an application to a time-course proteomic dataset. [abs] [ pdf ][ bib ] &copy JMLR 2024. ( edit, beta )

AAAI Conference 2024 Conference Paper

Invisible Backdoor Attack against 3D Point Cloud Classifier in Graph Spectral Domain

  • Linkun Fan
  • Fazhi He
  • Tongzhen Si
  • Wei Tang
  • Bing Li

3D point cloud has been wildly used in security crucial domains, such as self-driving and 3D face recognition. Backdoor attack is a serious threat that usually destroy Deep Neural Networks (DNN) in the training stage. Though a few 3D backdoor attacks are designed to achieve guaranteed attack efficiency, their deformation will alarm human inspection. To obtain invisible backdoored point cloud, this paper proposes a novel 3D backdoor attack, named IBAPC, which generates backdoor trigger in the graph spectral domain. The effectiveness is grounded by the advantage of graph spectral signal that it can induce both global structure and local points to be responsible for the caused deformation in spatial domain. In detail, a new backdoor implanting function is proposed whose aim is to transform point cloud to graph spectral signal for conducting backdoor trigger. Then, we design a backdoor training procedure which updates the parameter of backdoor implanting function and victim 3D DNN alternately. Finally, the backdoored 3D DNN and its associated backdoor implanting function is obtained by finishing the backdoor training procedure. Experiment results suggest that IBAPC achieves SOTA attack stealthiness from three aspects including objective distance measurement, subjective human evaluation, graph spectral signal residual. At the same time, it obtains competitive attack efficiency. The code is available at https://github.com/f-lk/IBAPC.

JMLR Journal 2024 Journal Article

On Sufficient Graphical Models

  • Bing Li
  • Kyongwon Kim

We introduce a sufficient graphical model by applying the recently developed nonlinear sufficient dimension reduction techniques to the evaluation of conditional independence. The graphical model is nonparametric in nature, as it does not make distributional assumptions such as the Gaussian or copula Gaussian assumptions. However, unlike a fully nonparametric graphical model, which relies on the high-dimensional kernel to characterize conditional independence, our graphical model is based on conditional independence given a set of sufficient predictors with a substantially reduced dimension. In this way we avoid the curse of dimensionality that comes with a high-dimensional kernel. We develop the population-level properties, convergence rate, and variable selection consistency of our estimate. By simulation comparisons and an analysis of the DREAM 4 Challenge data set, we demonstrate that our method outperforms the existing methods when the Gaussian or copula Gaussian assumptions are violated, and its performance remains excellent in the high-dimensional setting. [abs] [ pdf ][ bib ] &copy JMLR 2024. ( edit, beta )

AAAI Conference 2024 Conference Paper

RL-SeqISP: Reinforcement Learning-Based Sequential Optimization for Image Signal Processing

  • Xinyu Sun
  • Zhikun Zhao
  • Lili Wei
  • Congyan Lang
  • Mingxuan Cai
  • Longfei Han
  • Juan Wang
  • Bing Li

Hardware image signal processing (ISP), aiming at converting RAW inputs to RGB images, consists of a series of processing blocks, each with multiple parameters. Traditionally, ISP parameters are manually tuned in isolation by imaging experts according to application-specific quality and performance metrics, which is time-consuming and biased towards human perception due to complex interaction with the output image. Since the relationship between any single parameter’s variation and the output performance metric is a complex, non-linear function, optimizing such a large number of ISP parameters is challenging. To address this challenge, we propose a novel Sequential ISP parameter optimization model, called the RL-SeqISP model, which utilizes deep reinforcement learning to jointly optimize all ISP parameters for a variety of imaging applications. Concretely, inspired by the sequential tuning process of human experts, the proposed model can progressively enhance image quality by seamlessly integrating information from both the image feature space and the parameter space. Furthermore, a dynamic parameter optimization module is introduced to avoid ISP parameters getting stuck into local optima, which is able to more effectively guarantee the optimal parameters resulting from the sequential learning strategy. These merits of the RL-SeqISP model as well as its high efficiency are substantiated by comprehensive experiments on a wide range of downstream tasks, including two visual analysis tasks (instance segmentation and object detection), and image quality assessment (IQA), as compared with representative methods both quantitatively and qualitatively. In particular, even using only 10% of the training data, our model outperforms other SOTA methods by an average of 7% mAP on two visual analysis tasks.

NeurIPS Conference 2024 Conference Paper

SAM-Guided Masked Token Prediction for 3D Scene Understanding

  • Zhimin Chen
  • Liang Yang
  • Yingwei Li
  • Longlong Jing
  • Bing Li

Foundation models have significantly enhanced 2D task performance, and recent works like Bridge3D have successfully applied these models to improve 3D scene understanding through knowledge distillation, marking considerable advancements. Nonetheless, challenges such as the misalignment between 2D and 3D representations and the persistent long-tail distribution in 3D datasets still restrict the effectiveness of knowledge distillation from 2D to 3D using foundation models. To tackle these issues, we introduce a novel SAM-guided tokenization method that seamlessly aligns 3D transformer structures with region-level knowledge distillation, replacing the traditional KNN-based tokenization techniques. Additionally, we implement a group-balanced re-weighting strategy to effectively address the long-tail problem in knowledge distillation. Furthermore, inspired by the recent success of masked feature prediction, our framework incorporates a two-stage masked token prediction process in which the student model predicts both the global embeddings and token-wise local embeddings derived from the teacher models trained in the first stage. Our methodology has been validated across multiple datasets, including SUN RGB-D, ScanNet, and S3DIS, for tasks like 3D object detection and semantic segmentation. The results demonstrate significant improvements over current state-of-the-art self-supervised methods, establishing new benchmarks in this field.

AAAI Conference 2024 Conference Paper

Set Prediction Guided by Semantic Concepts for Diverse Video Captioning

  • Yifan Lu
  • Ziqi Zhang
  • Chunfeng Yuan
  • Peng Li
  • Yan Wang
  • Bing Li
  • Weiming Hu

Diverse video captioning aims to generate a set of sentences to describe the given video in various aspects. Mainstream methods are trained with independent pairs of a video and a caption from its ground-truth set without exploiting the intra-set relationship, resulting in low diversity of generated captions. Different from them, we formulate diverse captioning into a semantic-concept-guided set prediction (SCG-SP) problem by fitting the predicted caption set to the ground-truth set, where the set-level relationship is fully captured. Specifically, our set prediction consists of two synergistic tasks, i.e., caption generation and an auxiliary task of concept combination prediction providing extra semantic supervision. Each caption in the set is attached to a concept combination indicating the primary semantic content of the caption and facilitating element alignment in set prediction. Furthermore, we apply a diversity regularization term on concepts to encourage the model to generate semantically diverse captions with various concept combinations. These two tasks share multiple semantics-specific encodings as input, which are obtained by iterative interaction between visual features and conceptual queries. The correspondence between the generated captions and specific concept combinations further guarantees the interpretability of our model. Extensive experiments on benchmark datasets show that the proposed SCG-SP achieves state-of-the-art (SOTA) performance under both relevance and diversity metrics.

JMLR Journal 2024 Journal Article

Spectral Regularized Kernel Goodness-of-Fit Tests

  • Omar Hagrass
  • Bharath K. Sriperumbudur
  • Bing Li

Maximum mean discrepancy (MMD) has enjoyed a lot of success in many machine learning and statistical applications, including non-parametric hypothesis testing, because of its ability to handle non-Euclidean data. Recently, it has been demonstrated in Balasubramanian et al. (2021) that the goodness-of-fit test based on MMD is not minimax optimal while a Tikhonov regularized version of it is, for an appropriate choice of the regularization parameter. However, the results in Balasubramanian et al. (2021) are obtained under the restrictive assumptions of the mean element being zero, and the uniform boundedness condition on the eigenfunctions of the integral operator. Moreover, the test proposed in Balasubramanian et al. (2021) is not practical as it is not computable for many kernels. In this paper, we address these shortcomings and extend the results to general spectral regularizers that include Tikhonov regularization. [abs] [ pdf ][ bib ] &copy JMLR 2024. ( edit, beta )

NeurIPS Conference 2024 Conference Paper

Vivid-ZOO: Multi-View Video Generation with Diffusion Model

  • Bing Li
  • Cheng Zheng
  • Wenxuan Zhu
  • Jinjie Mai
  • Biao Zhang
  • Peter Wonka
  • Bernard Ghanem

While diffusion models have shown impressive performance in 2D image/video generation, diffusion-based Text-to-Multi-view-Video (T2MVid) generation remains underexplored. The new challenges posed by T2MVid generation lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution. To this end, we propose a novel diffusion-based pipeline that generates high-quality multi-view videos centered around a dynamic 3D object from text. Specifically, we factor the T2MVid problem into viewpoint-space and time components. Such factorization allows us to combine and reuse layers of advanced pre-trained multi-view image and 2D video diffusion models to ensure multi-view consistency as well as temporal coherence for the generated multi-view videos, largely reducing the training cost. We further introduce alignment modules to align the latent spaces of layers from the pre-trained multi-view and the 2D video diffusion models, addressing the reused layers' incompatibility that arises from the domain gap between 2D and multi-view data. In support of this and future research, we further contribute a captioned multi-view video dataset. Experimental results demonstrate that our method generates high-quality multi-view videos, exhibiting vivid motions, temporal coherence, and multi-view consistency, given a variety of text prompts.

NeurIPS Conference 2024 Conference Paper

VQ-Map: Bird's-Eye-View Map Layout Estimation in Tokenized Discrete Space via Vector Quantization

  • Yiwei Zhang
  • Jin Gao
  • Fudong Ge
  • Guan Luo
  • Bing Li
  • Zhaoxiang Zhang
  • Haibin Ling
  • Weiming Hu

Bird's-eye-view (BEV) map layout estimation requires an accurate and full understanding of the semantics for the environmental elements around the ego car to make the results coherent and realistic. Due to the challenges posed by occlusion, unfavourable imaging conditions and low resolution, \emph{generating} the BEV semantic maps corresponding to corrupted or invalid areas in the perspective view (PV) is appealing very recently. \emph{The question is how to align the PV features with the generative models to facilitate the map estimation}. In this paper, we propose to utilize a generative model similar to the Vector Quantized-Variational AutoEncoder (VQ-VAE) to acquire prior knowledge for the high-level BEV semantics in the tokenized discrete space. Thanks to the obtained BEV tokens accompanied with a codebook embedding encapsulating the semantics for different BEV elements in the groundtruth maps, we are able to directly align the sparse backbone image features with the obtained BEV tokens from the discrete representation learning based on a specialized token decoder module, and finally generate high-quality BEV maps with the BEV codebook embedding serving as a bridge between PV and BEV. We evaluate the BEV map layout estimation performance of our model, termed VQ-Map, on both the nuScenes and Argoverse benchmarks, achieving 62. 2/47. 6 mean IoU for surround-view/monocular evaluation on nuScenes, as well as 73. 4 IoU for monocular evaluation on Argoverse, which all set a new record for this map layout estimation task. The code and models are available on \url{https: //github. com/Z1zyw/VQ-Map}.

NeurIPS Conference 2023 Conference Paper

Bridging the Domain Gap: Self-Supervised 3D Scene Understanding with Foundation Models

  • Zhimin Chen
  • Longlong Jing
  • Yingwei Li
  • Bing Li

Foundation models have achieved remarkable results in 2D and language tasks like image segmentation, object detection, and visual-language understanding. However, their potential to enrich 3D scene representation learning is largely untapped due to the existence of the domain gap. In this work, we propose an innovative methodology called Bridge3D to address this gap by pre-training 3D models using features, semantic masks, and captions sourced from foundation models. Specifically, our method employs semantic masks from foundation models to guide the masking and reconstruction process for the masked autoencoder, enabling more focused attention on foreground representations. Moreover, we bridge the 3D-text gap at the scene level using image captioning foundation models, thereby facilitating scene-level knowledge distillation. We further extend this bridging effort by introducing an innovative object-level knowledge distillation method that harnesses highly accurate object-level masks and semantic text data from foundation models. Our methodology significantly surpasses the performance of existing state-of-the-art methods in 3D object detection and semantic segmentation tasks. For instance, on the ScanNet dataset, Bridge3D improves the baseline by a notable margin of 6. 3%. Code will be available at: https: //github. com/Zhimin-C/Bridge3D

AAAI Conference 2023 Conference Paper

Combating Mode Collapse via Offline Manifold Entropy Estimation

  • Haozhe Liu
  • Bing Li
  • Haoqian Wu
  • Hanbang Liang
  • Yawen Huang
  • Yuexiang Li
  • Bernard Ghanem
  • Yefeng Zheng

Generative Adversarial Networks (GANs) have shown compelling results in various tasks and applications in recent years. However, mode collapse remains a critical problem in GANs. In this paper, we propose a novel training pipeline to address the mode collapse issue of GANs. Different from existing methods, we propose to generalize the discriminator as feature embedding and maximize the entropy of distributions in the embedding space learned by the discriminator. Specifically, two regularization terms, i.e., Deep Local Linear Embedding (DLLE) and Deep Isometric feature Mapping (DIsoMap), are introduced to encourage the discriminator to learn the structural information embedded in the data, such that the embedding space learned by the discriminator can be well-formed. Based on the well-learned embedding space supported by the discriminator, a non-parametric entropy estimator is designed to efficiently maximize the entropy of embedding vectors, playing as an approximation of maximizing the entropy of the generated distribution. By improving the discriminator and maximizing the distance of the most similar samples in the embedding space, our pipeline effectively reduces the mode collapse without sacrificing the quality of generated samples. Extensive experimental results show the effectiveness of our method which outperforms the GAN baseline, MaF-GAN on CelebA (9.13 vs. 12.43 in FID) and surpasses the recent state-of-the-art energy-based model on the ANIMEFACE dataset (2.80 vs. 2.26 in Inception score).

NeurIPS Conference 2023 Conference Paper

Compressed Video Prompt Tuning

  • Bing Li
  • Jiaxin Chen
  • Xiuguo Bao
  • Di Huang

Compressed videos offer a compelling alternative to raw videos, showing the possibility to significantly reduce the on-line computational and storage cost. However, current approaches to compressed video processing generally follow the resource-consuming pre-training and fine-tuning paradigm, which does not fully take advantage of such properties, making them not favorable enough for widespread applications. Inspired by recent successes of prompt tuning techniques in computer vision, this paper presents the first attempt to build a prompt based representation learning framework, which enables effective and efficient adaptation of pre-trained raw video models to compressed video understanding tasks. To this end, we propose a novel prompt tuning approach, namely Compressed Video Prompt Tuning (CVPT), emphatically dealing with the challenging issue caused by the inconsistency between pre-training and downstream data modalities. Specifically, CVPT replaces the learnable prompts with compressed modalities (\emph{e. g. } Motion Vectors and Residuals) by re-parameterizing them into conditional prompts followed by layer-wise refinement. The conditional prompts exhibit improved adaptability and generalizability to instances compared to conventional individual learnable ones, and the Residual prompts enhance the noisy motion cues in the Motion Vector prompts for further fusion with the visual cues from I-frames. Additionally, we design Selective Cross-modal Complementary Prompt (SCCP) blocks. After inserting them into the backbone, SCCP blocks leverage semantic relations across diverse levels and modalities to improve cross-modal interactions between prompts and input flows. Extensive evaluations on HMDB-51, UCF-101 and Something-Something v2 demonstrate that CVPT remarkably outperforms the state-of-the-art counterparts, delivering a much better balance between accuracy and efficiency.

NeurIPS Conference 2023 Conference Paper

Dynamically Masked Discriminator for GANs

  • Wentian Zhang
  • Haozhe Liu
  • Bing Li
  • Jinheng Xie
  • Yawen Huang
  • Yuexiang Li
  • Yefeng Zheng
  • Bernard Ghanem

Training Generative Adversarial Networks (GANs) remains a challenging problem. The discriminator trains the generator by learning the distribution of real/generated data. However, the distribution of generated data changes throughout the training process, which is difficult for the discriminator to learn. In this paper, we propose a novel method for GANs from the viewpoint of online continual learning. We observe that the discriminator model, trained on historically generated data, often slows down its adaptation to the changes in the new arrival generated data, which accordingly decreases the quality of generated results. By treating the generated data in training as a stream, we propose to detect whether the discriminator slows down the learning of new knowledge in generated data. Therefore, we can explicitly enforce the discriminator to learn new knowledge fast. Particularly, we propose a new discriminator, which automatically detects its retardation and then dynamically masks its features, such that the discriminator can adaptively learn the temporally-vary distribution of generated data. Experimental results show our method outperforms the state-of-the-art approaches.

NeurIPS Conference 2023 Conference Paper

Exploiting Contextual Objects and Relations for 3D Visual Grounding

  • Li Yang
  • Chunfeng Yuan
  • Ziqi Zhang
  • Zhongang Qi
  • Yan Xu
  • Wei Liu
  • Ying Shan
  • Bing Li

3D visual grounding, the task of identifying visual objects in 3D scenes based on natural language inputs, plays a critical role in enabling machines to understand and engage with the real-world environment. However, this task is challenging due to the necessity to capture 3D contextual information to distinguish target objects from complex 3D scenes. The absence of annotations for contextual objects and relations further exacerbates the difficulties. In this paper, we propose a novel model, CORE-3DVG, to address these challenges by explicitly learning about contextual objects and relations. Our method accomplishes 3D visual grounding via three sequential modular networks, including a text-guided object detection network, a relation matching network, and a target identification network. During training, we introduce a pseudo-label self-generation strategy and a weakly-supervised method to facilitate the learning of contextual objects and relations, respectively. The proposed techniques allow the networks to focus more effectively on referred objects within 3D scenes by understanding their context better. We validate our model on the challenging Nr3D, Sr3D, and ScanRefer datasets and demonstrate state-of-the-art performance. Our code will be public at https: //github. com/yangli18/CORE-3DVG.

NeurIPS Conference 2023 Conference Paper

ZoomTrack: Target-aware Non-uniform Resizing for Efficient Visual Tracking

  • Yutong Kou
  • Jin Gao
  • Bing Li
  • Gang Wang
  • Weiming Hu
  • Yizheng Wang
  • Liang Li

Recently, the transformer has enabled the speed-oriented trackers to approach state-of-the-art (SOTA) performance with high-speed thanks to the smaller input size or the lighter feature extraction backbone, though they still substantially lag behind their corresponding performance-oriented versions. In this paper, we demonstrate that it is possible to narrow or even close this gap while achieving high tracking speed based on the smaller input size. To this end, we non-uniformly resize the cropped image to have a smaller input size while the resolution of the area where the target is more likely to appear is higher and vice versa. This enables us to solve the dilemma of attending to a larger visual field while retaining more raw information for the target despite a smaller input size. Our formulation for the non-uniform resizing can be efficiently solved through quadratic programming (QP) and naturally integrated into most of the crop-based local trackers. Comprehensive experiments on five challenging datasets based on two kinds of transformer trackers, \ie, OSTrack and TransT, demonstrate consistent improvements over them. In particular, applying our method to the speed-oriented version of OSTrack even outperforms its performance-oriented counterpart by 0. 6\% AUC on TNL2K, while running 50\% faster and saving over 55\% MACs. Codes and models are available at https: //github. com/Kou-99/ZoomTrack.

JBHI Journal 2022 Journal Article

3DCANN: A Spatio-Temporal Convolution Attention Neural Network for EEG Emotion Recognition

  • Shuaiqi Liu
  • Xu Wang
  • Ling Zhao
  • Bing Li
  • Weiming Hu
  • Jie Yu
  • Yu-Dong Zhang

Since electroencephalogram (EEG) signals can truly reflect human emotional state, emotion recognition based on EEG has turned into a critical branch in the field of artificial intelligence. Aiming at the disparity of EEG signals in various emotional states, we propose a new deep learning model named three-dimension convolution attention neural network (3DCANN) for EEG emotion recognition in this paper. The 3DCANN model is composed of spatio-temporal feature extraction module and EEG channel attention weight learning module, which can extract the dynamic relation well among multi-channel EEG signals and the internal spatial relation of multi-channel EEG signals during continuous period time. In this model, the spatio-temporal features are fused with the weights of dual attention learning, and the fused features are input into the softmax classifier for emotion classification. In addition, we utilize SJTU Emotion EEG Dataset (SEED) to appraise the feasibility and effectiveness of the proposed algorithm. Finally, experimental results display that the 3DCANN method has superior performance over the state-of-the-art models in EEG emotion recognition.

IJCAI Conference 2022 Conference Paper

Learning Target-aware Representation for Visual Tracking via Informative Interactions

  • Mingzhe Guo
  • Zhipeng Zhang
  • Heng Fan
  • Liping Jing
  • Yilin Lyu
  • Bing Li
  • Weiming Hu

We introduce a novel backbone architecture to improve target-perception ability of feature representation for tracking. Having observed de facto frameworks perform feature matching simply using the backbone outputs for target localization, there is no direct feedback from the matching module to the backbone network, especially the shallow layers. Concretely, only the matching module can directly access the target information, while the representation learning of candidate frame is blind to the reference target. Therefore, the accumulated target-irrelevant interference in shallow stages may degrade the feature quality of deeper layers. In this paper, we approach the problem by conducting multiple branch-wise interactions inside the Siamese-like backbone networks (InBN). The core of InBN is a general interaction modeler (GIM) that injects the target information to different stages of the backbone network, leading to better target-perception of candidate feature representation with negligible computation cost. The proposed GIM module and InBN mechanism are general and applicable to different backbone types including CNN and Transformer for improvements, as evidenced on multiple benchmarks. In particular, the CNN version improves the baseline with 3. 2/6. 9 absolute gains of SUC on LaSOT/TNL2K. The Transformer version obtains SUC of 65. 7/52. 0 on LaSOT/TNL2K, which are on par with recent SOTAs.

IJCAI Conference 2022 Conference Paper

Long-Short Term Cross-Transformer in Compressed Domain for Few-Shot Video Classification

  • Wenyang Luo
  • Yufan Liu
  • Bing Li
  • Weiming Hu
  • Yanan Miao
  • Yangxi Li

Compared with image few-shot learning, most of the existing few-shot video classification methods perform worse on feature matching, because they fail to sufficiently exploit the temporal information and relation. Specifically, frames are usually evenly sampled, which may miss important frames. On the other hand, the heuristic model simply encodes the equally treated frames in sequence, which results in the lack of both long-term and short-term temporal modeling and interaction. To alleviate these limitations, we take advantage of the compressed domain knowledge and propose a long-short term Cross-Transformer (LSTC) for few-shot video classification. For short terms, the motion vector (MV) contains temporal cues and reflects the importance of each frame. For long terms, a video can be natively divided into a sequence of GOPs (Group Of Picture). Using this compressed domain knowledge helps to obtain a more accurate spatial-temporal feature space. Consequently, we design the long-short term selection module, short-term module, and long-term module to comprise the LSTC. Long-short term selection is performed to select informative compressed domain data. Long/short-term modules are utilized to sufficiently exploit the temporal information so that the query and support can be well-matched by cross-attention. Experimental results show the superiority of our method on various datasets.

AAAI Conference 2022 Conference Paper

One More Check: Making “Fake Background” Be Tracked Again

  • Chao Liang
  • Zhipeng Zhang
  • Xue Zhou
  • Bing Li
  • Weiming Hu

The one-shot multi-object tracking, which integrates object detection and ID embedding extraction into a unified network, has achieved groundbreaking results in recent years. However, current one-shot trackers solely rely on singleframe detections to predict candidate bounding boxes, which may be unreliable when facing disastrous visual degradation, e. g. , motion blur, occlusions. Once a target bounding box is mistakenly classified as background by the detector, the temporal consistency of its corresponding tracklet will be no longer maintained. In this paper, we set out to restore the bounding boxes misclassified as “fake background” by proposing a re-check network. The re-check network innovatively expands the role of ID embedding from data association to motion forecasting by effectively propagating previous tracklets to the current frame with a small overhead. Note that the propagation results are yielded by an independent and efficient embedding search, preventing the model from overrelying on detection results. Eventually, it helps to reload the “fake background” and repair the broken tracklets. Building on a strong baseline CSTrack, we construct a new one-shot tracker and achieve favorable gains by 70. 7 → 76. 4, 70. 6 → 76. 3 MOTA on MOT16 and MOT17, respectively. It also reaches a new state-of-the-art MOTA and IDF1 performance. Code is released at https: //github. com/JudasDie/SOTS.

IJCAI Conference 2022 Conference Paper

Representation Learning for Compressed Video Action Recognition via Attentive Cross-modal Interaction with Motion Enhancement

  • Bing Li
  • Jiaxin Chen
  • Dongming Zhang
  • Xiuguo Bao
  • Di Huang

Compressed video action recognition has recently drawn growing attention, since it remarkably reduces the storage and computational cost via replacing raw videos by sparsely sampled RGB frames and compressed motion cues (e. g. , motion vectors and residuals). However, this task severely suffers from the coarse and noisy dynamics and the insufficient fusion of the heterogeneous RGB and motion modalities. To address the two issues above, this paper proposes a novel framework, namely Attentive Cross-modal Interaction Network with Motion Enhancement (MEACI-Net). It follows the two-stream architecture, i. e. one for the RGB modality and the other for the motion modality. Particularly, the motion stream employs a multi-scale block embedded with a denoising module to enhance representation learning. The interaction between the two streams is then strengthened by introducing the Selective Motion Complement (SMC) and Cross-Modality Augment (CMA) modules, where SMC complements the RGB modality with spatio-temporally attentive local motion features and CMA further combines the two modalities with selective feature augmentation. Extensive experiments on the UCF-101, HMDB-51 and Kinetics-400 benchmarks demonstrate the effectiveness and efficiency of MEACI-Net.

AAAI Conference 2022 Conference Paper

SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation

  • Bing Li
  • Cheng Zheng
  • Silvio Giancola
  • Bernard Ghanem

We propose a novel scene flow estimation approach to capture and infer 3D motions from point clouds. Estimating 3D motions for point clouds is challenging, since a point cloud is unordered and its density is significantly non-uniform. Such unstructured data poses difficulties in matching corresponding points between point clouds, leading to inaccurate flow estimation. We propose a novel architecture named Sparse Convolution-Transformer Network (SCTN) that equips the sparse convolution with the transformer. Specifically, by leveraging the sparse convolution, SCTN transfers irregular point cloud into locally consistent flow features for estimating continuous and consistent motions within an object/local object part. We further propose to explicitly learn point relations using a point transformer module, different from exiting methods. We show that the learned relation-based contextual information is rich and helpful for matching corresponding points, benefiting scene flow estimation. In addition, a novel loss function is proposed to adaptively encourage flow consistency according to feature similarity. Extensive experiments demonstrate that our proposed approach achieves a new state of the art in scene flow estimation. Our approach achieves an error of 0. 038 and 0. 037 (EPE3D) on FlyingThings3D and KITTI Scene Flow respectively, which significantly outperforms previous methods by large margins.

AAAI Conference 2021 Conference Paper

DPFPS: Dynamic and Progressive Filter Pruning for Compressing Convolutional Neural Networks from Scratch

  • Xiaofeng Ruan
  • Yufan Liu
  • Bing Li
  • Chunfeng Yuan
  • Weiming Hu

Filter pruning is a commonly used method for compressing Convolutional Neural Networks (ConvNets), due to its friendly hardware supporting and flexibility. However, existing methods mostly need a cumbersome procedure, which brings many extra hyper-parameters and training epochs. This is because only using sparsity and pruning stages cannot obtain a satisfying performance. Besides, many works do not consider the difference of pruning ratio across different layers. To overcome these limitations, we propose a novel dynamic and progressive filter pruning (DPFPS) scheme that directly learns a structured sparsity network from Scratch. In particular, DPFPS imposes a new structured sparsityinducing regularization specifically upon the expected pruning parameters in a dynamic sparsity manner. The dynamic sparsity scheme determines sparsity allocation ratios of different layers and a Taylor series based channel sensitivity criteria is presented to identify the expected pruning parameters. Moreover, we increase the structured sparsity-inducing penalty in a progressive manner. This helps the model to be sparse gradually instead of forcing the model to be sparse at the beginning. Our method solves the pruning ratio based optimization problem by an iterative soft-thresholding algorithm (ISTA) with dynamic sparsity. At the end of the training, we only need to remove the redundant parameters without other stages, such as fine-tuning. Extensive experimental results show that the proposed method is competitive with 11 state-of-the-art methods on both small-scale and largescale datasets (i. e. , CIFAR and ImageNet). Specifically, on ImageNet, we achieve a 44. 97% pruning ratio of FLOPs by compressing ResNet-101, even with an increase of 0. 12% Top-5 accuracy. Our pruned models and codes are released at https: //github. com/taoxvzi/DPFPS.

AAAI Conference 2021 Conference Paper

Improving the Efficiency and Effectiveness for BERT-based Entity Resolution

  • Bing Li
  • Yukai Miao
  • Yaoshu Wang
  • Yifang Sun
  • Wei Wang

BERT has set a new state-of-the-art performance on entity resolution (ER) task, largely owed to fine-tuning pretrained language models and the deep pair-wise interaction. Albeit being remarkably effective, it comes with a steep increase in computational cost, as the deep-interaction requires to exhaustively compute every tuple pair to search for coreferences. For ER task, it is often prohibitively expensive due to the large cardinality to be matched. To tackle this, we introduce a siamese network structure that independently encodes tuples using BERT but delays the pair-wise interaction via an enhanced alignment network. This siamese structure enables a dedicated blocking module to quickly filter out obviously dissimilar tuple pairs, and thus drastically reduces the cardinality of fine-grained matching. Further, the blocking and entity matching are integrated into a multi-task learning framework for facilitating both tasks. Extensive experiments on multiple datasets demonstrate that our model significantly outperforms state-of-the-art models (including BERT) in both efficiency and effectiveness.

AAAI Conference 2021 Conference Paper

Two-Stream Convolution Augmented Transformer for Human Activity Recognition

  • Bing Li
  • Wei Cui
  • Wei Wang
  • Le Zhang
  • Zhenghua Chen
  • Min Wu

Recognition of human activities is an important task due to its far-reaching applications such as healthcare system, context-aware applications, and security monitoring. Recently, WiFi based human activity recognition (HAR) is becoming ubiquitous due to its non-invasiveness. Existing WiFibased HAR methods regard WiFi signals as a temporal sequence of channel state information (CSI), and employ deep sequential models (e. g. , RNN, LSTM) to automatically capture channel-over-time features. Although being remarkably effective, they suffer from two major drawbacks. Firstly, the granularity of a single temporal point is blindly elementary for representing meaningful CSI patterns. Secondly, the timeover-channel features are also important, and could be a natural data augmentation. To address the drawbacks, we propose a novel Two-stream Convolution Augmented Human Activity Transformer (THAT) model. Our model proposes to utilize a two-stream structure to capture both time-over-channel and channel-over-time features, and use the multi-scale convolution augmented transformer to capture range-based patterns. Extensive experiments on four real experiment datasets demonstrate that our model outperforms state-of-the-art models in terms of both effectiveness and efficiency 1.

AAAI Conference 2020 Conference Paper

Fine-Grained Named Entity Typing over Distantly Supervised Data Based on Refined Representations

  • Muhammad Asif Ali
  • Yifang Sun
  • Bing Li
  • Wei Wang

Fine-Grained Named Entity Typing (FG-NET) is a key component in Natural Language Processing (NLP). It aims at classifying an entity mention into a wide range of entity types. Due to a large number of entity types, distant supervision is used to collect training data for this task, which noisily assigns type labels to entity mentions irrespective of the context. In order to alleviate the noisy labels, existing approaches on FG-NET analyze the entity mentions entirely independent of each other and assign type labels solely based on mention’s sentence-specific context. This is inadequate for highly overlapping and/or noisy type labels as it hinders information passing across sentence boundaries. For this, we propose an edge-weighted attentive graph convolution network that refines the noisy mention representations by attending over corpus-level contextual clues prior to the end classification. Experimental evaluation shows that the proposed model outperforms the existing research by a relative score of upto 10. 2% and 8. 3% for macro-f1 and micro-f1 respectively.

AAAI Conference 2020 Conference Paper

GraphER: Token-Centric Entity Resolution with Graph Convolutional Neural Networks

  • Bing Li
  • Wei Wang
  • Yifang Sun
  • Linhan Zhang
  • Muhammad Asif Ali
  • Yi Wang

Entity resolution (ER) aims to identify entity records that refer to the same real-world entity, which is a critical problem in data cleaning and integration. Most of the existing models are attribute-centric, that is, matching entity pairs by comparing similarities of pre-aligned attributes, which require the schemas of records to be identical and are too coarse-grained to capture subtle key information within a single attribute. In this paper, we propose a novel graph-based ER model GraphER. Our model is token-centric: the final matching results are generated by directly aggregating token-level comparison features, in which both the semantic and structural information has been softly embedded into token embeddings by training an Entity Record Graph Convolutional Network (ER-GCN). To the best of our knowledge, our work is the first effort to do token-centric entity resolution with the help of GCN in entity resolution task. Extensive experiments on two real-world datasets demonstrate that our model stably outperforms state-of-the-art models.

AAAI Conference 2020 Conference Paper

HAMNER: Headword Amplified Multi-Span Distantly Supervised Method for Domain Specific Named Entity Recognition

  • Shifeng Liu
  • Yifang Sun
  • Bing Li
  • Wei Wang
  • Xiang Zhao

To tackle Named Entity Recognition (NER) tasks, supervised methods need to obtain sufficient cleanly annotated data, which is labor and time consuming. On the contrary, distantly supervised methods acquire automatically annotated data using dictionaries to alleviate this requirement. Unfortunately, dictionaries hinder the effectiveness of distantly supervised methods for NER due to its limited coverage, especially in specific domains. In this paper, we aim at the limitations of the dictionary usage and mention boundary detection. We generalize the distant supervision by extending the dictionary with headword based non-exact matching. We apply a function to better weight the matched entity mentions. We propose a span-level model, which classifies all the possible spans then infers the selected spans with a proposed dynamic programming algorithm. Experiments on all three benchmark datasets demonstrate that our method outperforms previous state-of-the-art distantly supervised methods.

JMLR Journal 2020 Journal Article

Learning Causal Networks via Additive Faithfulness

  • Kuang-Yao Lee
  • Tianqi Liu
  • Bing Li
  • Hongyu Zhao

In this paper we introduce a statistical model, called additively faithful directed acyclic graph (AFDAG), for causal learning from observational data. Our approach is based on additive conditional independence (ACI), a recently proposed three-way statistical relation that shares many similarities with conditional independence but without resorting to multi-dimensional kernels. This distinct feature strikes a balance between a parametric model and a fully nonparametric model, which makes the proposed model attractive for handling large networks. We develop an estimator for AFDAG based on a linear operator that characterizes ACI, and establish the consistency and convergence rates of this estimator, as well as the uniform consistency of the estimated DAG. Moreover, we introduce a modified PC-algorithm to implement the estimating procedure efficiently, so that its complexity is determined by the level of sparseness rather than the dimension of the network. Through simulation studies we show that our method outperforms existing methods when commonly assumed conditions such as Gaussian or Gaussian copula distributions do not hold. Finally, the usefulness of AFDAG formulation is demonstrated through an application to a proteomics data set. [abs] [ pdf ][ bib ] &copy JMLR 2020. ( edit, beta )

AAAI Conference 2020 Conference Paper

Recursively Binary Modification Model for Nested Named Entity Recognition

  • Bing Li
  • Shifeng Liu
  • Yifang Sun
  • Wei Wang
  • Xiang Zhao

Recently, there has been an increasing interest in identifying named entities with nested structures. Existing models only make independent typing decisions on the entire entity span while ignoring strong modification relations between subentity types. In this paper, we present a novel Recursively Binary Modification model for nested named entity recognition. Our model utilizes the modification relations among sub-entities types to infer the head component on top of a Bayesian framework and uses entity head as a strong evidence to determine the type of the entity span. The process is recursive, allowing lower-level entities to help better model those on the outer-level. To the best of our knowledge, our work is the first effort that uses modification relation in nested NER task. Extensive experiments on four benchmark datasets demonstrate that our model outperforms state-of-the-art models in nested NER tasks, and delivers competitive results with state-of-the-art models in flat NER task, without relying on any extra annotations or NLP tools.

IJCAI Conference 2018 Conference Paper

An Adaptive Hierarchical Compositional Model for Phrase Embedding

  • Bing Li
  • Xiaochun Yang
  • Bin Wang
  • Wei Wang
  • Wei Cui
  • Xianchao Zhang

Phrase embedding aims at representing phrases in a vector space and it is important for the performance of many NLP tasks. Existing models only regard a phrase as either full-compositional or non-compositional, while ignoring the hybrid-compositionality that widely exists, especially in long phrases. This drawback prevents them from having a deeper insight into the semantic structure for long phrases and as a consequence, weakens the accuracy of the embeddings. In this paper, we present a novel method for jointly learning compositionality and phrase embedding by adaptively weighting different compositions using an implicit hierarchical structure. Our model has the ability of adaptively adjusting among different compositions without entailing too much model complexity and time cost. To the best of our knowledge, our work is the first effort that considers hybrid-compositionality in phrase embedding. The experimental evaluation demonstrates that our model outperforms state-of-the-art methods in both similarity tasks and analogy tasks.

AAAI Conference 2018 Conference Paper

Hierarchical Nonlinear Orthogonal Adaptive-Subspace Self-Organizing Map Based Feature Extraction for Human Action Recognition

  • Yang Du
  • Chunfeng Yuan
  • Bing Li
  • Weiming Hu
  • Hao Yang
  • Zhikang Fu
  • Lili Zhao

Feature extraction is a critical step in the task of action recognition. Hand-crafted features are often restricted because of their fixed forms and deep learning features are more effective but need large-scale labeled data for training. In this paper, we propose a new hierarchical Nonlinear Orthogonal Adaptive-Subspace Self-Organizing Map (NOASSOM) to adaptively and learn effective features from data without supervision. NOASSOM is extended from Adaptive-Subspace Self-Organizing Map (ASSOM) which only deals with linear data and is trained with supervision by the labeled data. Firstly, by adding a nonlinear orthogonal map layer, NOAS- SOM is able to handle the nonlinear input data and it avoids defining the specific form of the nonlinear orthogonal map by a kernel trick. Secondly, we modify loss function of ASSOM such that every input sample is used to train model individually. In this way, NOASSOM effectively learns the statistic patterns from data without supervision. Thirdly, we propose a hierarchical NOASSOM to extract more representative features. Finally, we apply the proposed hierarchical NOAS- SOM to efficiently describe the appearance and motion information around trajectories for action recognition. Experimental results on widely used datasets show that our method has superior performance than many state-of-the-art hand-crafted features and deep learning features based methods.

AAAI Conference 2017 Conference Paper

Efficiently Mining High Quality Phrases from Texts

  • Bing Li
  • Xiaochun Yang
  • Bin Wang
  • Wei Cui

Phrase mining is a key research problem for semantic analysis and text-based information retrieval. The existing approaches based on NLP, frequency, and statistics cannot extract high quality phrases and the processing is also time consuming, which are not suitable for dynamic on-line applications. In this paper, we propose an efficient high-quality phrase mining approach (EQPM). To the best of our knowledge, our work is the first effort that considers both intra-cohesion and inter-isolation in mining phrases, which is able to guarantee appropriateness. We also propose a strategy to eliminate order sensitiveness, and ensure the completeness of phrases. We further design efficient algorithms to make the proposed model and strategy feasible. The empirical evaluations on four real data sets demonstrate that our approach achieved a considerable quality improvement and the processing time was 2. 3× ∼ 29× faster than the state-of-the-art works.

IJCAI Conference 2016 Conference Paper

Demo: Assisting Visually Impaired People Navigate Indoors

  • J. Pablo Mu
  • ntilde; oz
  • Bing Li
  • Xuejian Rong
  • Jizhong Xiao
  • Yingli Tian
  • Aries Arditi

Research in Artificial Intelligence, Robotics and Computer Vision has recently made great strides in improving indoor localization. Publicly available technology now allows for indoor localization with very small margins of error. In this demo, we show a system that uses state-of the-art technology to as- sist visually impaired people navigate indoors. Our system takes advantage of spatial representations from CAD files, or floor plan images, to extract valuable information that later can be used to im- prove navigation and human-computer interaction. Using depth information, our system is capable of detecting obstacles and guiding the user to avoid them.

AAAI Conference 2013 Conference Paper

Salient Object Detection via Low-Rank and Structured Sparse Matrix Decomposition

  • Houwen Peng
  • Bing Li
  • Rongrong Ji
  • Weiming Hu
  • Weihua Xiong
  • Congyan Lang

Salient object detection provides an alternative solution to various image semantic understanding tasks such as object recognition, adaptive compression and image retrieval. Recently, low-rank matrix recovery (LR) theory has been introduced into saliency detection, and achieves impressed results. However, the existing LR-based models neglect the underlying structure of images, and inevitably degrade the associated performance. In this paper, we propose a Low-rank and Structured sparse Matrix Decomposition (LSMD) model for salient object detection. In the model, a tree-structured sparsity-inducing norm regularization is firstly introduced to provide a hierarchical description of the image structure to ensure the completeness of the extracted salient object. The similarity of saliency values within the salient object is then guaranteed by the `∞-norm. Finally, high-level priors are integrated to guide the matrix decomposition and enhance the saliency detection. Experimental results on the largest public benchmark database show that our model outperforms existing LRbased approaches and other state-of-the-art methods, which verifies the effectiveness and robustness of the structure cues in our model.

AAAI Conference 2012 Conference Paper

Visual Saliency Map from Tensor Analysis

  • Bing Li
  • Weihua Xiong
  • Weiming Hu

Modeling visual saliency map of an image provides important information for image semantic understanding in many applications. Most existing computational visual saliency models follow a bottom-up framework that generates independent saliency map in each selected visual feature space and combines them in a proper way. Two big challenges to be addressed explicitly in these methods are (1) which features should be extracted for all pixels of the input image and (2) how to dynamically determine importance of the saliency map generated in each feature space. In order to address these problems, we present a novel saliency map computational model based on tensor decomposition and reconstruction. Tensor representation and analysis not only explicitly represent image’s color values but also imply two important relationships inherent to color image. One is reflecting spatial correlations between pixels and the other one is representing interplay between color channels. Therefore, saliency map generator based on the proposed model can adaptively find the most suitable features and their combinational coefficients for each pixel. Experiments on a synthetic image set and a real image set show that our method is superior or comparable to other prevailing saliency map models.

AAAI Conference 2010 Conference Paper

Automated Program Debugging Via Multiple Predicate Switching

  • Yongmei Liu
  • Bing Li

In a previous paper, Liu argued for the importance of establishing a precise theoretical foundation for program debugging from first principles. In this paper, we present a first step towards a theoretical exploration of program debugging algorithms. The starting point of our work is the recent debugging approach based on predicate switching. The idea is to switch the outcome of an instance of a predicate to bring the program execution to a successful completion and then identify the fault by examining the switched predicate. However, no theoretical analysis of the approach is available. In this paper, we generalize the above idea, and propose the bounded debugging via multiple predicate switching (BMPS) algorithm, which locates faults through switching the outcomes of instances of multiple predicates to get a successful execution where each loop is executed for a bounded number of times. Clearly, BMPS can be implemented by resorting to a SAT solver. We focus attention on RHS faults, that is, faults that occur in the control predicates and right-hand-sides of assignment statements. We prove that for conditional programs, BMPS is quasi-complete for RHS faults in the sense that some part of any true diagnosis will be returned by BMPS; and for iterative programs, when the bound is sufficiently large, BMPS is also quasi-complete for RHS faults. Initial experimentation with debugging small C programs showed that BMPS can quickly and effectively locate the faults.