Arrow Research search

Author name cluster

Bo Du

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

74 papers
2 author rows

Possible papers

74

JBHI Journal 2026 Journal Article

MsGA: Gestational Age Estimation with Multi-plane Unified Measurements Driven by Anatomic Segmentation

  • Mingjun Huang
  • Junbo Zhang
  • Wei Hu
  • Chao Sun
  • Xiantao Cai
  • Bo Du

An accurate estimation of gestational age is critical for prenatal care and clinical decision-making. Existing ultrasound-based gestational age estimation methods are limited by the insufficient information representation capacity of conventional medical segmentation models, noise interference in ultrasound images, and inter-observer variability in traditional geometry-based measurement methods. To address these challenges, we propose the MsGA model to estimate gestational age with multi-plane unified measurements driven by anatomic segmentation. In the anatomic segmentation stage, a lightweight and high-performance LGF-UNet module is proposed, which utilizes the Deep Patch Embedding module to expand the receptive field, the Local-Global Fusion Transformer block to enhance local-global feature fusion, and the Focusing Attention Bottleneck module to suppress ultrasound noise via an adaptive threshold. In the measurement stage, a Point Regression module is introduced to refine biometric landmark localization. Furthermore, we create a fully annotated ultrasound plane dataset for the estimation of gestational age across various gestational stages. Extensive experiments on the dataset have demonstrated the effectiveness of the whole model and each module. Our MsGA model is superior to existing models with fewer parameters and achieves state-of-the-art performance on the Gestational Age Estimation task.

AAAI Conference 2026 Conference Paper

Rethinking Visual Token Reduction in LVLMs Under Cross-Modal Misalignment

  • Rui Xu
  • Yunke Wang
  • Yong Luo
  • Bo Du

Large Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics. These visual tokens often outnumber their textual counterparts by a large margin, leading to substantial computational overhead and limiting the scalability of LVLMs in practice. Previous efforts have explored visual token reduction either prior to or within the large language models (LLMs). However, most in-LLM reduction approaches rely on text-conditioned interactions, implicitly assuming that textual tokens can reliably capture the importance of visual tokens. In this work, we revisit this assumption and reveal causal, semantic, and spatial forms of cross-modal misalignment. These misalignments undermine the effectiveness of text-guided visual token reduction. To address this, we introduce VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention, without relying on textual signals. To further suppress redundancy throughout the model hierarchy, we treat the visual encoder and the LLM as a unified system and design a progressive pruning pipeline. Our method performs dominant token selection and lightweight contextual merging at multiple stages, enabling fine-grained visual information to be retained even under aggressive token budgets. Extensive experiments across diverse benchmarks show that VisionDrop achieves consistent improvements over existing approaches, despite requiring no additional training or complex modifications. Notably, when integrated with LLaVA-NeXT-7B, VisionDrop achieves a 2.7x reduction in inference latency and 6x in FLOPs, while retaining 95.71% of the original performance.

NeurIPS Conference 2025 Conference Paper

AiDE-Q: Synthetic Labeled Datasets Can Enhance Learning Models for Quantum Property Estimation

  • Xinbiao Wang
  • Yuxuan Du
  • Zihan Lou
  • Yang Qian
  • Kaining Zhang
  • Yong Luo
  • Bo Du
  • Dacheng Tao

Quantum many-body problems are central to various scientific disciplines, yet their ground-state properties are intrinsically challenging to estimate. Recent advances in deep learning (DL) offer potential solutions in this field, complementing prior purely classical and quantum approaches. However, existing DL-based models typically assume access to a large-scale and noiseless labeled dataset collected by infinite sampling. This idealization raises fundamental concerns about their practical utility, especially given the limited availability of quantum hardware in the near term. To unleash the power of these DL-based models, we propose AiDE-Q (\underline{a}utomat\underline{i}c \underline{d}ata \underline{e}ngine for \underline{q}uantum property estimation), an effective framework that addresses this challenge by iteratively generating high-quality synthetic labeled datasets. Specifically, AiDE-Q utilizes a confidence-check method to assess the quality of synthetic labels and continuously improves the employed DL models with the identified high-quality synthetic dataset. To verify the effectiveness of AiDE-Q, we conduct extensive numerical simulations on a diverse set of quantum many-body and molecular systems, with up to 50 qubits. The results show that AiDE-Q enhances prediction performance for various reference learning models, with improvements of up to $14. 2\\%$. Moreover, we exhibit that a basic supervised learning model integrated with AiDE-Q outperforms advanced reference models, highlighting the importance of a synthetic dataset. Our work paves the way for more efficient and practical applications of DL for quantum property estimation.

NeurIPS Conference 2025 Conference Paper

Backdoor Cleaning without External Guidance in MLLM Fine-tuning

  • Xuankun Rong
  • Wenke Huang
  • Jian Liang
  • Jinhe Bi
  • Xun Xiao
  • Yiming Li
  • Bo Du
  • Mang Ye

Multimodal Large Language Models (MLLMs) are increasingly deployed in fine-tuning-as-a-service (FTaaS) settings, where user-submitted datasets adapt general-purpose models to downstream tasks. This flexibility, however, introduces serious security risks, as malicious fine-tuning can implant backdoors into MLLMs with minimal effort. In this paper, we observe that backdoor triggers systematically disrupt cross-modal processing by causing abnormal attention concentration on non-semantic regions—a phenomenon we term attention collapse. Based on this insight, we propose Believe Your Eyes (BYE), a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples. BYE operates via a three-stage pipeline: (1) extracting attention maps using the fine-tuned model, (2) computing entropy scores and profiling sensitive layers via bimodal separation, and (3) performing unsupervised clustering to remove suspicious samples. Unlike prior defenses, BYE equires no clean supervision, auxiliary labels, or model modifications. Extensive experiments across various datasets, models, and diverse trigger types validate BYE's effectiveness: it achieves near-zero attack success rates while maintaining clean-task performance, offering a robust and generalizable solution against backdoor threats in MLLMs.

NeurIPS Conference 2025 Conference Paper

DGSolver: Diffusion Generalist Solver with Universal Posterior Sampling for Image Restoration

  • Hebaixu Wang
  • Jing Zhang
  • Haonan Guo
  • Di Wang
  • Jiayi Ma
  • Bo Du

Diffusion models have achieved remarkable progress in universal image restoration. However, existing methods perform naive inference in the reverse process, which leads to cumulative errors under limited sampling steps and large step intervals. Moreover, they struggle to balance the commonality of degradation representations with restoration quality, often depending on complex compensation mechanisms that enhance fidelity at the expense of efficiency. To address these challenges, we introduce \textbf{DGSolver}, a diffusion generalist solver with universal posterior sampling. We first derive the exact ordinary differential equations for generalist diffusion models to unify degradation representations and design tailored high-order solvers with a queue-based accelerated sampling strategy to improve both accuracy and efficiency. We then integrate universal posterior sampling to better approximate manifold-constrained gradients, yielding a more accurate noise estimation and correcting errors in inverse inference. Extensive experiments demonstrate that DGSolver outperforms state-of-the-art methods in restoration accuracy, stability, and scalability, both qualitatively and quantitatively. Code and models are publicly available at https: //github. com/MiliLab/DGSolver.

IJCAI Conference 2025 Conference Paper

Exploiting Text Semantics for Few and Zero Shot Node Classification on Text-attributed Graph

  • Yuxiang Wang
  • Xiao Yan
  • Shiyu Jin
  • Quanqing Xu
  • Chuang Hu
  • Yuanyuan Zhu
  • Bo Du
  • Jia Wu

Text-attributed graph (TAG) provides a text description for each graph node, and few- and zero-shot node classification on TAGs have many applications in fields such as academia and social networks. Existing work utilizes various graph-based augmentation techniques to train the node and text embeddings, while text-based augmentations are largely unexplored. In this paper, we propose Text Semantics Augmentation (TSA) to improve accuracy by introducing more text semantic supervision signals. Specifically, we design two augmentation techniques, i. e. , positive semantics matching and negative semantics contrast, to provide more reference texts for each graph node or text description. Positive semantic matching retrieves texts with similar embeddings to match with a graph node. Negative semantic contrast adds a negative prompt to construct a text description with the opposite semantics, which is contrasted with the original node and text. We evaluate TSA on 5 datasets and compare with 13 state-of-the-art baselines. The results show that TSA consistently outperforms all baselines, and its accuracy improvements over the best-performing baseline are usually over 5%. The code is at https: //github. com/wyx11112/TSA.

NeurIPS Conference 2025 Conference Paper

ForensicHub: A Unified Benchmark & Codebase for All-Domain Fake Image Detection and Localization

  • Bo Du
  • Xuekang Zhu
  • Xiaochen Ma
  • Chenfan Qu
  • Kaiwen Feng
  • Zhe Yang
  • Chi-Man Pun
  • Jian Liu

The field of Fake Image Detection and Localization (FIDL) is highly fragmented, encompassing four domains: deepfake detection (Deepfake), image manipulation detection and localization (IMDL), artificial intelligence-generated image detection (AIGC), and document image manipulation localization (Doc). Although individual benchmarks exist in some domains, a unified benchmark for all domains in FIDL remains blank. The absence of a unified benchmark results in significant domain silos, where each domain independently constructs its datasets, models, and evaluation protocols without interoperability, preventing cross-domain comparisons and hindering the development of the entire FIDL field. To close the domain silo barrier, we propose ForensicHub, the first unified benchmark & codebase for all-domain fake image detection and localization. Considering drastic variations on dataset, model, and evaluation configurations across all domains, as well as the scarcity of open-sourced baseline models and the lack of individual benchmarks in some domains, ForensicHub: i) proposes a modular and configuration-driven architecture that decomposes forensic pipelines into interchangeable components across datasets, transforms, models, and evaluators, allowing flexible composition across all domains; ii) fully implements 10 baseline models (3 of which are reproduced from scratch), 6 backbones, 2 new benchmarks for AIGC and Doc, and integrates 2 existing benchmarks of DeepfakeBench and IMDLBenCo through an adapter-based design; iii) establishes an image forensic fusion protocol evaluation mechanism that supports unified training and testing of diverse forensic models across tasks; iv) conducts indepth analysis based on the ForensicHub, offering 8 key actionable insights into FIDL model architecture, dataset characteristics, and evaluation standards. Specifically, ForensicHub includes 4 forensic tasks, 23 datasets, 42 baseline models, 6 backbones, 11 GPU-accelerated pixel- and image-level evaluation metrics, and realizes 16 kinds of cross-domain evaluations. ForensicHub represents a significant leap forward in breaking the domain silos in the FIDL field and inspiring future breakthroughs. Code is available at: https: //github. com/scu-zjz/ForensicHub.

JMLR Journal 2025 Journal Article

FusionBench: A Unified Library and Comprehensive Benchmark for Deep Model Fusion

  • Anke Tang
  • Li Shen
  • Yong Luo
  • Enneng Yang
  • Han Hu
  • Lefei Zhang
  • Bo Du
  • Dacheng Tao

Deep model fusion is an emerging technique that unifies the predictions or parameters of several deep neural networks into a single better-performing model in a cost-effective and data-efficient manner. Although a variety of deep model fusion techniques have been introduced, their evaluations tend to be inconsistent and often inadequate to validate their effectiveness and robustness. We present FusionBench, the first benchmark and a unified library designed specifically for deep model fusion. Our benchmark consists of multiple tasks, each with different settings of models and datasets. This variety allows us to compare fusion methods across different scenarios and model scales. Additionally, FusionBench serves as a unified library for easy implementation and testing of new fusion techniques. FusionBench is open source and actively maintained, with community contributions encouraged. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2025. ( edit, beta )

NeurIPS Conference 2025 Conference Paper

GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution

  • Fengxiang Wang
  • Mingshuo Chen
  • Yueying Li
  • Di Wang
  • Haotian Wang
  • Zonghao Guo
  • Zefan Wang
  • Shan Boqi

Ultra-high-resolution (UHR) remote sensing (RS) imagery offers valuable data for Earth observation but pose challenges for existing multimodal foundation models due to two key bottlenecks: (1) limited availability of UHR training data, and (2) token explosion caused by the large image size. To address data scarcity, we introduce **SuperRS-VQA** (avg. 8, 376$\times$8, 376) and **HighRS-VQA** (avg. 2, 000$\times$1, 912), the highest-resolution vision-language datasets in RS to date, covering 22 real-world dialogue tasks. To mitigate token explosion, our pilot studies reveal significant redundancy in RS images: crucial information is concentrated in a small subset of object-centric tokens, while pruning background tokens (e. g. , ocean or forest) can even improve performance. Motivated by these findings, we propose two strategies: *Background Token Pruning* and *Anchored Token Selection*, to reduce the memory footprint while preserving key semantics. Integrating these techniques, we introduce **GeoLLaVA-8K**, the first RS-focused multimodal large language model capable of handling inputs up to 8K$\times$8K resolution, built on the LLaVA framework. Trained on SuperRS-VQA and HighRS-VQA, GeoLLaVA-8K sets a new state-of-the-art on the XLRS-Bench. Datasets and code were released at https: //github. com/MiliLab/GeoLLaVA-8K.

NeurIPS Conference 2025 Conference Paper

HYPERION: Fine-Grained Hypersphere Alignment for Robust Federated Graph Learning

  • Frank Wan
  • Xiaoran Shang
  • Yuxin Wu
  • Guibin Zhang
  • Jinhe Bi
  • Liangtao Zheng
  • Xin Lin
  • Yue Liu

Robust Federated Graph Learning (FGL) provides an effective decentralized framework for training Graph Neural Networks (GNNs) in noisy-label environments. However, the subtlety of noise during training presents formidable obstacles for developing robust FGL systems. Previous robust FL approaches neither adequately constrain edge-mediated error propagation nor account for intra-class topological differences. At the client level, we innovatively demonstrate that hyperspherical embedding can effectively capture graph structures in a fine-grained manner. Correspondingly, our method effectively addresses the aforementioned issues through fine-grained hypersphere alignment. Moreover, we uncover undetected noise arising from localized perspective constraints and propose the geometric-aware hyperspherical purification module at the server level. Combining both level strategies, we present our robust FGL framework, **HYPERION**, which operates all components within a unified hyperspherical space. **HYPERION** demonstrates remarkable robustness across multiple datasets, for instance, achieving a 29. 7\% $\uparrow$ F1-macro score with 50\%-pair noise on Cora. The code is available for anonymous access at \url{https: //anonymous. 4open. science/r/Hyperion-NeurIPS/}.

AAAI Conference 2025 Conference Paper

Improving Complex Reasoning over Knowledge Graph with Logic-Aware Curriculum Tuning

  • Tianle Xia
  • Liang Ding
  • Guojia Wan
  • Yibing Zhan
  • Bo Du
  • Dacheng Tao

Answering complex queries over incomplete knowledge graphs (KGs) is a challenging job. Most previous works have focused on learning entity/relation embeddings and simulating first-order logic operators with various neural networks. However, they are bottlenecked by the inability to share world knowledge to improve logical reasoning, thus resulting in suboptimal performance. In this paper, we propose a complex reasoning schema over KG upon large language models (LLMs), containing a curriculum-based logical-aware instruction tuning framework, named LACT. Specifically, we augment the arbitrary first-order logical queries via binary tree decomposition, to stimulate the reasoning capability of LLMs. To address the difficulty gap among different types of complex queries, we design a simple and flexible logic-aware curriculum learning framework. Experiments across widely used datasets demonstrate that LACT has substantial improvements~(brings an average +5.5% MRR score) over advanced methods, achieving the new state-of-the-art.

IROS Conference 2025 Conference Paper

Learning to Exploit Leg Odometry Enables Terrain-Aware Quadrupedal Locomotion

  • Yong Zhou
  • Jiawei Jiang
  • Bo Du
  • Zengmao Wang

The geometry of terrain is crucial for developing terrain-aware locomotion policies. Recent advancements in quadrupedal locomotion based on learning rely on depth information obtained from LiDARs and depth cameras. Despite the capabilities of these locomotion policies on terrains, they pose challenges in processing high-dimensional data in real time with onboard hardware. In this study, we develop a lightweight framework that utilizes only the intrinsic sensors of a quadrupedal robot to facilitate terrain-aware locomotion. We introduce a learning-based leg odometry, integrated with a locomotion policy trained through reinforcement learning. Utilizing blind localization from leg odometry alongside a pre-constructed height map enables the robot to navigate steps and stairs without incident. We assess the efficacy of our framework through simulations, where our results indicate that the robot achieves up to a 17% improvement in successful traversal rates and requires fewer point samples. By compensating for slippage during locomotion, our learning-based leg odometry surpasses traditional inertialleg odometry. Lastly, we validate the practical applicability of our models on a real robot, confirming their effectiveness in real-world settings.

IS Journal 2025 Journal Article

Machine Learning Approaches for Micromobility User Behavior Analysis

  • Cheng Zhang
  • Bo Du
  • Qiuyun Luan
  • Jun Shen

With widespread adoption globally, micromobility like bikes, e-scooters, and e-bikes has attracted increasing attention due to its ability to complement existing transportation modes and promote sustainable transportation. Understanding micromobility user behaviors in urban areas is essential for improving safety and comfort, as well as for informing infrastructure development and policy. Prior investigations on micromobility user behaviors primarily relied on statistical and kinematic modeling approaches. Although these methods have proven effective in characterizing user behaviors at both macroscopic and microscopic levels, the advent of artificial intelligence (AI)-powered data analytics and behavioral modeling is revolutionizing the field. Recently, advanced machine learning models, such as gradient boosting decision tree, graph convolutional network, and inverse reinforcement learning, has introduced new momentum into micromobility user behavior research. This article explores recent developments, research opportunities, and future directions in this field, leveraging the power of more generic AI approaches.

NeurIPS Conference 2025 Conference Paper

Merging on the Fly Without Retraining: A Sequential Approach to Scalable Continual Model Merging

  • Anke Tang
  • Enneng Yang
  • Li Shen
  • Yong Luo
  • Han Hu
  • Lefei Zhang
  • Bo Du
  • Dacheng Tao

Deep model merging represents an emerging research direction that combines multiple fine-tuned models to harness their specialized capabilities across different tasks and domains. Current model merging techniques focus on merging all available models simultaneously, with weight interpolation-based methods being the predominant approach. However, these conventional approaches are not well-suited for scenarios where models become available sequentially, and they often suffer from high memory requirements and potential interference between tasks. In this study, we propose a training-free projection-based continual merging method that processes models sequentially through orthogonal projections of weight matrices and adaptive scaling mechanisms. Our method operates by projecting new parameter updates onto subspaces orthogonal to existing merged parameter updates while using an adaptive scaling mechanism to maintain stable parameter distances, enabling efficient sequential integration of task-specific knowledge. Our approach maintains constant memory complexity to the number of models, minimizes interference between tasks through orthogonal projections, and retains the performance of previously merged models through adaptive task vector scaling. Extensive experiments on CLIP-ViT models demonstrate that our method achieves a 5-8% average accuracy improvement while maintaining robust performance in different task orderings. Code is publicly available at https: //github. com/tanganke/opcm.

AAAI Conference 2025 Conference Paper

Mesoscopic Insights: Orchestrating Multi-Scale & Hybrid Architecture for Image Manipulation Localization

  • Xuekang Zhu
  • Xiaochen Ma
  • Lei Su
  • Zhuohang Jiang
  • Bo Du
  • Xiwen Wang
  • Zeyu Lei
  • Wentao Feng

The mesoscopic level serves as a bridge between the macroscopic and microscopic worlds, addressing gaps overlooked by both. Image manipulation localization (IML), a crucial technique to pursue truth from fake images, has long relied on low-level (microscopic-level) traces. However, in practice, most tampering aims to deceive the audience by altering image semantics. As a result, manipulation commonly occurs at the object level (macroscopic level), which is equally important as microscopic traces. Therefore, integrating these two levels into the mesoscopic level presents a new perspective for IML research. Inspired by this, our paper explores how to simultaneously construct mesoscopic representations of micro and macro information for IML and introduces the Mesorch architecture to orchestrate both. Specifically, this architecture i) combines Transformers and CNNs in parallel, with Transformers extracting macro information and CNNs capturing micro details, and ii) explores across different scales, assessing micro and macro information seamlessly. Additionally, based on the Mesorch architecture, the paper introduces two baseline models aimed at solving IML tasks through mesoscopic representation. Extensive experiments across four datasets have demonstrated that our models surpass the current state-of-the-art in terms of performance, computational complexity, and robustness.

NeurIPS Conference 2025 Conference Paper

MOTION: Multi-Sculpt Evolutionary Coarsening for Federated Continual Graph Learning

  • Frank Wan
  • Fengyuan Ran
  • Ruikang Zhang
  • Wenke Huang
  • Xuankun Rong
  • Guibin Zhang
  • Yuxin Wu
  • Bo Du

Graph neural networks (GNNs) have achieved remarkable success in various domains but typically rely on centralized, static graphs, which limits their applicability in distributed, evolving environments. To address this limitation, we define the task of Federated Continual Graph Learning (FCGL), a paradigm for incremental learning on dynamic graphs distributed across decentralized clients. Existing methods, however, neither preserve graph topology during task transitions nor mitigate parameter conflicts in server‐side aggregation. To overcome these challenges, we introduce **MOTION**, a generalizable FCGL framework that integrates two complementary modules: the Graph Topology‐preserving Multi‐Sculpt Coarsening (G‐TMSC) module, which maintains the structural integrity of past graphs through a multi‐expert, similarity‐guided fusion process, and the Graph‐Aware Evolving Parameter Adaptive Engine (G‐EPAE) module, which refines global model updates by leveraging a topology‐sensitive compatibility matrix. Extensive experiments on real‐world datasets show that our approach improves average accuracy (AA) by an average of 30\% $\uparrow$ over the FedAvg baseline across five datasets while maintaining a negative $\downarrow$ average forgetting (AF) rate, significantly enhancing generalization and robustness under FCGL settings. The code is available for anonymous access at https: //anonymous. 4open. science/r/MOTION.

NeurIPS Conference 2025 Conference Paper

Multi-order Orchestrated Curriculum Distillation for Model-Heterogeneous Federated Graph Learning

  • Frank Wan
  • Xu Cheng
  • Run Liu
  • Wenke Huang
  • Zitong Shi
  • Pinyi Jin
  • Guibin Zhang
  • Bo Du

Federated Graph Learning (FGL) has been shown to be particularly effective in enabling collaborative training of Graph Neural Networks (GNNs) in decentralized settings. Model-heterogeneous FGL further enhances practical applicability by accommodating client preferences for diverse model architectures. However, existing model-heterogeneous approaches primarily target Euclidean data and fail to account for a crucial aspect of graph-structured data: topological relationships. To address this limitation, we propose **TRUST**, a novel knowledge distillation-based **model-heterogeneous FGL** framework. Specifically, we propose Progressive Curriculum Node Scheduler to progressively introduce challenging nodes based on learning difficulty. In Adaptive Curriculum Distillation Modulator, we propose an adaptive temperature modulator that dynamically adjusts knowledge distillation temperature to accommodate varying client capabilities and graph complexity. Moreover, we leverage Wasserstein‑Driven Affinity Distillation to enable models to capture cross-class structural relationships through optimal transport. Extensive experiments on multiple graph benchmarks and model-heterogeneous settings show that **TRUST** outperforms existing methods, achieving an average 3. 6\% $\uparrow$ performance gain, particularly under moderate heterogeneity conditions. The code is available for anonymous access at https: //anonymous. 4open. science/r/TRUST-NeurIPS2025.

NeurIPS Conference 2025 Conference Paper

OASIS: One-Shot Federated Graph Learning via Wasserstein Assisted Knowledge Integration

  • Frank Wan
  • Jiaru Qian
  • Wenke Huang
  • Qilin Xu
  • Xianda Guo
  • Boheng Li
  • Guibin Zhang
  • Bo Du

Federated Graph Learning (FGL) offers a promising framework for collaboratively training Graph Neural Networks (GNNs) while preserving data privacy. In resource-constrained environments, One-shot Federated Learning (OFL) emerges as an effective solution by limiting communication to a single round. Current OFL approaches employing generative models have attracted considerable attention; however, they face unresolved challenges: these methods are primarily designed for traditional image data and fail to capture the fine-grained structural information of local graph data. Consequently, they struggle to integrate the intricate correlations necessary and transfer subtle structural insights from each client to the global model. To address these issues, we introduce OASIS, an innovative one-shot FGL framework. In OASIS, we propose a Synergy Graph Synthesizer designed to generate informative synthetic graphs and introduce a Topological Codebook to construct a structural latent space. Moreover, we propose the Wasserstein-Enhanced Semantic Affinity Distillation (WESAD) to incorporate rich inter-class relationships and the Wasserstein-Driven Structural Relation Distillation (WDSRD) to facilitate the effective transfer of structural knowledge from the Topological Codebook. Extensive experiments on real-world tasks demonstrate the superior performance and generalization capability of OASIS. The code is available for anonymous access at https: //anonymous. 4open. science/r/OASIS-NeurIPS25.

IJCAI Conference 2025 Conference Paper

Pixel-wise Divide and Conquer for Federated Vessel Segmentation

  • Tian Chen
  • Wenke Huang
  • Zhihao Wang
  • Zekun Shi
  • He Li
  • Wenhui Dong
  • Mang Ye
  • Bo Du

Accurate vessel segmentation is essential for diagnosing and managing vascular and ophthalmic diseases. Traditional learning-based vessel segmentation methods heavily rely on high-quality, pixel-level annotated datasets. However, segmentation performance suffers significantly when applied in federated learning settings due to vessel morphology inconsistency and vessel-background imbalance. The former limits the ability of models to capture fine-grained vessels, while the latter overemphasizes background pixels and biases the model towards them. To address these challenges, we propose a novel method named Federated Vessel-Aware Calibration (FVAC), which leverages global uncertainty to provide differentiated guidance for clients, focusing on pixels of various morphologies that are difficult to distinguish. Furthermore, we introduce a foreground-background decoupling alignment strategy that utilizes more stable and balanced global features to mitigate semantic drift caused by vessel-background imbalance in local clients. Comprehensive experiments confirm the effectiveness of our method

NeurIPS Conference 2025 Conference Paper

Self-Evolving Pseudo-Rehearsal for Catastrophic Forgetting with Task Similarity in LLMs

  • Jun Wang
  • Liang Ding
  • Shuai Wang
  • Hongyu Li
  • Yong Luo
  • Huangxuan Zhao
  • Han Hu
  • Bo Du

Continual learning for large language models (LLMs) demands a precise balance between $\textbf{plasticity}$ - the ability to absorb new tasks - and $\textbf{stability}$ - the preservation of previously learned knowledge. Conventional rehearsal methods, which replay stored examples, are limited by long-term data inaccessibility; earlier pseudo-rehearsal methods require additional generation modules, while self-synthesis approaches often generate samples that poorly align with real tasks, suffer from unstable outputs, and ignore task relationships. We present $\textbf{\textit{Self-Evolving Pseudo-Rehearsal for Catastrophic Forgetting with Task Similarity}}(\textbf{SERS})$, a lightweight framework that 1) decouples pseudo-input synthesis from label creation, using semantic masking and template guidance to produce diverse, task-relevant prompts without extra modules; 2) applies label self-evolution, blending base-model priors with fine-tuned outputs to prevent over-specialization; and 3) introduces a dynamic regularizer driven by the Wasserstein distance between task distributions, automatically relaxing or strengthening constraints in proportion to task similarity. Experiments across diverse tasks on different LLMs show that our SERS reduces forgetting by over 2\% points against strong pseudo-rehearsal baselines, by ensuring efficient data utilization and wisely transferring knowledge. The code will be released at https: //github. com/JerryWangJun/LLM_CL_SERS/.

IJCAI Conference 2025 Conference Paper

Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation

  • Feizhen Huang
  • Yu Wu
  • Yutian Lin
  • Bo Du

Video-to-Audio (V2A) Generation achieves significant progress and plays a crucial role in film and video post-production. However, current methods overlook the cinematic language, a critical component of artistic expression in filmmaking. As a result, their performance deteriorates in scenarios where Foley targets are only partially visible. To address this challenge, we propose a simple self-distillation approach to extend V2A models to cinematic language scenarios. By simulating the cinematic language variations, the student model learns to align the video features of training pairs with the same audio-visual correspondences, enabling it to effectively capture the associations between sounds and partial visual information. Our method not only achieves impressive improvements under partial visibility across all evaluation metrics, but also enhances performance on the large-scale V2A dataset, VGGSound.

ICRA Conference 2025 Conference Paper

Tracking Everything in Robotic-Assisted Surgery

  • Bohan Zhan
  • Wang Zhao
  • Yi Fang 0006
  • Bo Du
  • Francisco Vasconcelos 0001
  • Danail Stoyanov
  • Daniel S. Elson
  • Baoru Huang

Accurate tracking of tissues and instruments in videos is crucial for Robotic-Assisted Minimally Invasive Surgery (RAMIS), as it enables the robot to comprehend the surgical scene with precise locations and interactions of tissues and tools. Traditional keypoint-based sparse tracking is limited by featured points, while flow-based dense two-view matching suffers from long-term drifts. Recently, the Tracking Any Point (TAP) algorithm was proposed to overcome these limitations and achieve dense accurate long-term tracking. However, its efficacy in surgical scenarios remains untested, largely due to the lack of a comprehensive surgical tracking dataset for evaluation. To address this gap, we introduce a new annotated surgical tracking dataset for benchmarking tracking methods for surgical scenarios, comprising real-world surgical videos with complex tissue and instrument motions. We extensively evaluate state-of-the-art (SOTA) TAP-based algorithms on this dataset and reveal their limitations in challenging surgical scenarios, including fast instrument motion, severe occlusions, and motion blur, etc. Furthermore, we propose a new tracking method, namely SurgMotion, to solve the challenges and further improve the tracking performance. Our proposed method outperforms most TAP-based algorithms in surgical instruments tracking, and especially demonstrates significant improvements over baselines in challenging medical videos. Our code and dataset are available at https://github.com/zhanbh1019/SurgicalMotion.

NeurIPS Conference 2025 Conference Paper

Value-Guided Decision Transformer: A Unified Reinforcement Learning Framework for Online and Offline Settings

  • Hongling Zheng
  • Li Shen
  • Yong Luo
  • Deheng Ye
  • Shuhan Xu
  • Bo Du
  • Jialie Shen
  • Dacheng Tao

The Conditional Sequence Modeling (CSM) paradigm, benefiting from the transformer's powerful distribution modeling capabilities, has demonstrated considerable promise in Reinforcement Learning (RL) tasks. However, much of the work has focused on applying CSM to single online or offline settings, with the general architecture rarely explored. Additionally, existing methods primarily focus on deterministic trajectory modeling, overlooking the randomness of state transitions and the diversity of future trajectory distributions. Fortunately, value-based methods offer a viable solution for CSM, further bridging the potential gap between offline and online RL. In this paper, we propose Value-Guided Decision Transformer (VDT), which leverages value functions to perform advantage-weighting and behavior regularization on the Decision Transformer (DT), guiding the policy toward upper-bound optimal decisions during the offline training phase. In the online tuning phase, VDT further integrates value-based policy improvement with behavior cloning under the CSM architecture through limited interaction and data collection, achieving performance improvement within minimal timesteps. The predictive capability of value functions for future returns is also incorporated into the sampling process. Our method achieves competitive performance on various standard RL benchmarks, providing a feasible solution for developing CSM architectures in general scenarios. Code is available at here.

AAAI Conference 2025 Conference Paper

Vox-UDA: Voxel-wise Unsupervised Domain Adaptation for Cryo-Electron Subtomogram Segmentation with Denoised Pseudo-Labeling

  • Haoran Li
  • Xingjian Li
  • Jiahua Shi
  • Huaming Chen
  • Bo Du
  • Daisuke Kihara
  • Johan Barthelemy
  • Jun Shen

Cryo-Electron Tomography (cryo-ET) is a 3D imaging technology that facilitates the study of macromolecular structures at near-atomic resolution. Recent volumetric segmentation approaches on cryo-ET images have drawn widespread interest in the biological sector. However, existing methods heavily rely on manually labeled data, which requires highly professional skills, thereby hindering the adoption of fully-supervised approaches for cryo-ET images. Some unsupervised domain adaptation (UDA) approaches have been designed to enhance the segmentation network performance using unlabeled data. However, applying these methods directly to cryo-ET image segmentation tasks remains challenging due to two main issues: 1) the source dataset, usually obtained through simulation, contains a fixed level of noise, while the target dataset, directly collected from raw-data from the real-world scenario, have unpredictable noise levels. 2) the source data used for training typically consists of known macromoleculars. In contrast, the target domain data are often unknown, causing the model to be biased towards those known macromolecules, leading to a domain shift problem. To address such challenges, in this work, we introduce a voxel-wise unsupervised domain adaptation approach, termed Vox-UDA, specifically for cryo-ET subtomogram segmentation. Vox-UDA incorporates a noise generation module to simulate target-like noises in the source dataset for cross-noise level adaptation. Additionally, we propose a denoised pseudo-labeling strategy based on the improved Bilateral Filter to alleviate the domain shift problem. More importantly, we construct the first UDA cryo-ET subtomogram segmentation benchmark on three experimental datasets. Extensive experimental results on multiple benchmarks and newly curated real-world datasets demonstrate the superiority of our proposed approach compared to state-of-the-art UDA methods.

AAAI Conference 2025 Conference Paper

WaterDiffusion: Learning a Prior-involved Unrolling Diffusion for Joint Underwater Saliency Detection and Visual Restoration

  • Laibin Chang
  • Yunke Wang
  • Longxiang Deng
  • Bo Du
  • Chang Xu

Underwater salient object detection (USOD) plays a pivotal role in various vision-based marine exploration tasks. However, existing USOD techniques face the dilemma of object mislocalization and imprecise boundaries due to the complex underwater environment. The quality degradation of raw underwater images (caused by selective absorption and medium scattering) makes it challenging to perform instance detection directly. One conceivable approach involves initially removing visual disturbances through underwater image enhancement (UIE), followed by saliency detection. However, this two-stage approach neglects the potential positive impact of the restoration procedure on saliency detection due to it executes in a cascade. Based on this insight, we propose a generalized prior-involved diffusion model, called WaterDiffusion for collaborative underwater saliency detection and visual restoration. Specifically, we first propose a revised self-attention joint diffusion, which embeds dynamic saliency masks into the diffusive network as latent features. By extending the underwater degradation prior into the multi-scale decoder, we innovatively exploit optical transmission maps to aid in localizing underwater salient objects. Then, we further design a gate-guided binary indicator to select either normalized or raw channels for improving feature generalization. Finally, the Half-quadratic Splitting is introduced into the unfolding sampling to refine saliency masks iteratively. Comprehensive experiments demonstrate the superior performance of WaterDiffusion over state-of-the-art methods in both quantitative and qualitative evaluations.

AAAI Conference 2024 Conference Paper

Cycle Self-Refinement for Multi-Source Domain Adaptation

  • Chaoyang Zhou
  • Zengmao Wang
  • Bo Du
  • Yong Luo

Multi-source domain adaptation (MSDA) aims to transfer knowledge from multiple source domains to the unlabeled target domain. In this paper, we propose a cycle self-refinement domain adaptation method, which progressively attempts to learn the dominant transferable knowledge in each source domain in a cycle manner. Specifically, several source-specific networks and a domain-ensemble network are adopted in the proposed method. The source-specific networks are adopted to provide the dominant transferable knowledge in each source domain for instance-level ensemble on predictions of the samples in target domain. Then these samples with high-confidence ensemble predictions are adopted to refine the domain-ensemble network. Meanwhile, to guide each source-specific network to learn more dominant transferable knowledge, we force the features of the target domain from the domain-ensemble network and the features of each source domain from the corresponding source-specific network to be aligned with their predictions from the corresponding networks. Thus the adaptation ability of source-specific networks and the domain-ensemble network can be improved progressively. Extensive experiments on Office-31, Office-Home and DomainNet show that the proposed method outperforms the state-of-the-art methods for most tasks.

AAMAS Conference 2024 Conference Paper

Extended Abstract: Price of Anarchy of Traffic Assignment with Exponential Cost Functions

  • Jianglin Qiao
  • Dave de Jonge
  • Dongmo Zhang
  • Simeon Simoff
  • Carles Sierra
  • Bo Du

This paper is an extended abstract version of "Price of Anarchy of Traffic Assignment with Exponential Cost Functions [5]". We study a routing game where vehicles, selfish agents, independently choose routes to minimize travel delays from road congestion. We focus on exponential latency functions, unlike prior research using polynomial functions like BPR. We calculate a tight upper bound for the price of anarchy and compare it with the BPR function. Results indicate that the exponential function has a lower upper bound for traffic volumes below road capacity than the BPR function. Numerical analysis using real-world data shows that the exponential function closely approximates road latency with even tighter parameters, resulting in a relatively lower upper bound.

NeurIPS Conference 2024 Conference Paper

GoMatching: A Simple Baseline for Video Text Spotting via Long and Short Term Matching

  • Haibin He
  • Maoyuan Ye
  • Jing Zhang
  • Juhua Liu
  • Bo Du
  • Dacheng Tao

Beyond the text detection and recognition tasks in image text spotting, video text spotting presents an augmented challenge with the inclusion of tracking. While advanced end-to-end trainable methods have shown commendable performance, the pursuit of multi-task optimization may pose the risk of producing sub-optimal outcomes for individual tasks. In this paper, we identify a main bottleneck in the state-of-the-art video text spotter: the limited recognition capability. In response to this issue, we propose to efficiently turn an off-the-shelf query-based image text spotter into a specialist on video and present a simple baseline termed GoMatching, which focuses the training efforts on tracking while maintaining strong recognition performance. To adapt the image text spotter to video datasets, we add a rescoring head to rescore each detected instance's confidence via efficient tuning, leading to a better tracking candidate pool. Additionally, we design a long-short term matching module, termed LST-Matcher, to enhance the spotter's tracking capability by integrating both long- and short-term matching results via Transformer. Based on the above simple designs, GoMatching delivers new records on ICDAR15-video, DSText, BOVText, and our proposed novel test set with arbitrary-shaped text termed ArTVideo, which demonstates GoMatching's capability to accommodate general, dense, small, arbitrary-shaped, Chinese and English text scenarios while saving considerable training budgets. The code will be released.

NeurIPS Conference 2024 Conference Paper

IMDL-BenCo: A Comprehensive Benchmark and Codebase for Image Manipulation Detection & Localization

  • Xiaochen Ma
  • Xuekang Zhu
  • Lei Su
  • Bo Du
  • Zhuohang Jiang
  • Bingkui Tong
  • Zeyu Lei
  • Xinyu Yang

A comprehensive benchmark is yet to be established in the Image Manipulation Detection & Localization (IMDL) field. The absence of such a benchmark leads to insufficient and misleading model evaluations, severely undermining the development of this field. However, the scarcity of open-sourced baseline models and inconsistent training and evaluation protocols make conducting rigorous experiments and faithful comparisons among IMDL models challenging. To address these challenges, we introduce IMDL-BenCo, the first comprehensive IMDL benchmark and modular codebase. IMDL-BenCo: i) decomposes the IMDL framework into standardized, reusable components and revises the model construction pipeline, improving coding efficiency and customization flexibility; ii) fully implements or incorporates training code for state-of-the-art models to establish a comprehensive IMDL benchmark; and iii) conducts deep analysis based on the established benchmark and codebase, offering new insights into IMDL model architecture, dataset characteristics, and evaluation standards. Specifically, IMDL-BenCo includes common processing algorithms, 8 state-of-the-art IMDL models (1 of which are reproduced from scratch), 2 sets of standard training and evaluation protocols, 15 GPU-accelerated evaluation metrics, and 3 kinds of robustness evaluation. This benchmark and codebase represent a significant leap forward in calibrating the current progress in the IMDL field and inspiring future breakthroughs. Code is available at: https: //github. com/scu-zjz/IMDLBenCo

AAAI Conference 2024 Conference Paper

Joint Learning Neuronal Skeleton and Brain Circuit Topology with Permutation Invariant Encoders for Neuron Classification

  • Minghui Liao
  • Guojia Wan
  • Bo Du

Determining the types of neurons within a nervous system plays a significant role in the analysis of brain connectomics and the investigation of neurological diseases. However, the efficiency of utilizing anatomical, physiological, or molecular characteristics of neurons is relatively low and costly. With the advancements in electron microscopy imaging and analysis techniques for brain tissue, we are able to obtain whole-brain connectome consisting neuronal high-resolution morphology and connectivity information. However, few models are built based on such data for automated neuron classification. In this paper, we propose NeuNet, a framework that combines morphological information of neurons obtained from skeleton and topological information between neurons obtained from neural circuit. Specifically, NeuNet consists of three components, namely Skeleton Encoder, Connectome Encoder, and Readout Layer. Skeleton Encoder integrates the local information of neurons in a bottom-up manner, with a one-dimensional convolution in neural skeleton's point data; Connectome Encoder uses a graph neural network to capture the topological information of neural circuit; finally, Readout Layer fuses the above two information and outputs classification results. We reprocess and release two new datasets for neuron classification task from volume electron microscopy(VEM) images of human brain cortex and Drosophila brain. Experiments on these two datasets demonstrated the effectiveness of our model with accuracies of 0.9169 and 0.9363, respectively. Code and data are available at: https://github.com/WHUminghui/NeuNet.

IJCAI Conference 2024 Conference Paper

LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation

  • Wentao Jiang
  • Jing Zhang
  • Di Wang
  • Qiming Zhang
  • Zengmao Wang
  • Bo Du

Due to spatial redundancy in remote sensing images, sparse tokens containing rich information are usually involved in self-attention (SA) to reduce the overall token numbers within the calculation, avoiding the high computational cost issue in Vision Transformers. However, such methods usually obtain sparse tokens by hand-crafted or parallel-unfriendly designs, posing a challenge to reach a better balance between efficiency and performance. Different from them, this paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information meanwhile improving the inference speed. Technically, the meta tokens are first initialized from image tokens via cross-attention. Then, we propose Dual Cross-Attention (DCA) to promote information exchange between image tokens and meta tokens, where they serve as query and key (value) tokens alternatively in a dual-branch structure, significantly reducing the computational complexity compared to self-attention. By employing DCA in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes. Experimental results in classification and dense prediction tasks show that LeMeViT has a significant 1. 7 × speedup, fewer parameters, and competitive performance compared to the baseline models, and achieves a better trade-off between efficiency and performance. The code is released at https: //github. com/ViTAE-Transformer/LeMeViT.

NeurIPS Conference 2024 Conference Paper

MMSite: A Multi-modal Framework for the Identification of Active Sites in Proteins

  • Song Ouyang
  • Huiyu Cai
  • Yong Luo
  • Kehua Su
  • Lefei Zhang
  • Bo Du

The accurate identification of active sites in proteins is essential for the advancement of life sciences and pharmaceutical development, as these sites are of critical importance for enzyme activity and drug design. Recent advancements in protein language models (PLMs), trained on extensive datasets of amino acid sequences, have significantly improved our understanding of proteins. However, compared to the abundant protein sequence data, functional annotations, especially precise per-residue annotations, are scarce, which limits the performance of PLMs. On the other hand, textual descriptions of proteins, which could be annotated by human experts or a pretrained protein sequence-to-text model, provide meaningful context that could assist in the functional annotations, such as the localization of active sites. This motivates us to construct a $\textbf{ProT}$ein-$\textbf{A}$ttribute text $\textbf{D}$ataset ($\textbf{ProTAD}$), comprising over 570, 000 pairs of protein sequences and multi-attribute textual descriptions. Based on this dataset, we propose $\textbf{MMSite}$, a multi-modal framework that improves the performance of PLMs to identify active sites by leveraging biomedical language models (BLMs). In particular, we incorporate manual prompting and design a MACross module to deal with the multi-attribute characteristics of textual descriptions. MMSite is a two-stage ("First Align, Then Fuse") framework: first aligns the textual modality with the sequential modality through soft-label alignment, and then identifies active sites via multi-modal fusion. Experimental results demonstrate that MMSite achieves state-of-the-art performance compared to existing protein representation learning methods. The dataset and code implementation are available at https: //github. com/Gift-OYS/MMSite.

NeurIPS Conference 2024 Conference Paper

Parameter Disparities Dissection for Backdoor Defense in Heterogeneous Federated Learning

  • Wenke Huang
  • Mang Ye
  • Zekun Shi
  • Guancheng Wan
  • He Li
  • Bo Du

Backdoor attacks pose a serious threat to federated systems, where malicious clients optimize on the triggered distribution to mislead the global model towards a predefined target. Existing backdoor defense methods typically require either homogeneous assumption, validation datasets, or client optimization conflicts. In our work, we observe that benign heterogeneous distributions and malicious triggered distributions exhibit distinct parameter importance degrees. We introduce the Fisher Discrepancy Cluster and Rescale (FDCR) method, which utilizes Fisher Information to calculate the degree of parameter importance for local distributions. This allows us to reweight client parameter updates and identify those with large discrepancies as backdoor attackers. Furthermore, we prioritize rescaling important parameters to expedite adaptation to the target distribution, encouraging significant elements to contribute more while diminishing the influence of trivial ones. This approach enables FDCR to handle backdoor attacks in heterogeneous federated learning environments. Empirical results on various heterogeneous federated scenarios under backdoor attacks demonstrate the effectiveness of our method.

NeurIPS Conference 2024 Conference Paper

Toward Real Ultra Image Segmentation: Leveraging Surrounding Context to Cultivate General Segmentation Model

  • Sai Wang
  • Yutian Lin
  • Yu Wu
  • Bo Du

Existing ultra image segmentation methods suffer from two major challenges, namely the scalability issue (i. e. they lack the stability and generality of standard segmentation models, as they are tailored to specific datasets), and the architectural issue (i. e. they are incompatible with real-world ultra image scenes, as they compromise between image size and computing resources). To tackle these issues, we revisit the classic sliding inference framework, upon which we propose a Surrounding Guided Segmentation framework (SGNet) for ultra image segmentation. The SGNet leverages a larger area around each image patch to refine the general segmentation results of local patches. Specifically, we propose a surrounding context integration module to absorb surrounding context information and extract specific features that are beneficial to local patches. Note that, SGNet can be seamlessly integrated to any general segmentation model. Extensive experiments on five datasets demonstrate that SGNet achieves competitive performance and consistent improvements across a variety of general segmentation models, surpassing the traditional ultra image segmentation methods by a large margin.

NeurIPS Conference 2024 Conference Paper

What If the Input is Expanded in OOD Detection?

  • Boxuan Zhang
  • Jianing Zhu
  • Zengmao Wang
  • Tongliang Liu
  • Bo Du
  • Bo Han

Out-of-distribution (OOD) detection aims to identify OOD inputs from unknown classes, which is important for the reliable deployment of machine learning models in the open world. Various scoring functions are proposed to distinguish it from in-distribution (ID) data. However, existing methods generally focus on excavating the discriminative information from a single input, which implicitly limits its representation dimension. In this work, we introduce a novel perspective, i. e. , employing different common corruptions on the input space, to expand that. We reveal an interesting phenomenon termed confidence mutation, where the confidence of OOD data can decrease significantly under the corruptions, while the ID data shows a higher confidence expectation considering the resistance of semantic features. Based on that, we formalize a new scoring method, namely, Confidence aVerage (CoVer), which can capture the dynamic differences by simply averaging the scores obtained from different corrupted inputs and the original ones, making the OOD and ID distributions more separable in detection tasks. Extensive experiments and analyses have been conducted to understand and verify the effectiveness of CoVer.

NeurIPS Conference 2023 Conference Paper

AIMS: All-Inclusive Multi-Level Segmentation for Anything

  • Lu Qi
  • Jason Kuen
  • Weidong Guo
  • Jiuxiang Gu
  • Zhe Lin
  • Bo Du
  • Yu Xu
  • Ming-Hsuan Yang

Despite the progress of image segmentation for accurate visual entity segmentation, completing the diverse requirements of image editing applications for different-level region-of-interest selections remains unsolved. In this paper, we propose a new task, All-Inclusive Multi-Level Segmentation (AIMS), which segments visual regions into three levels: part, entity, and relation (two entities with some semantic relationships). We also build a unified AIMS model through multi-dataset multi-task training to address the two major challenges of annotation inconsistency and task correlation. Specifically, we propose task complementarity, association, and prompt mask encoder for three-level predictions. Extensive experiments demonstrate the effectiveness and generalization capacity of our method compared to other state-of-the-art methods on a single dataset or the concurrent work on segment anything. We will make our code and training model publicly available.

AAAI Conference 2023 Conference Paper

DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer

  • Maoyuan Ye
  • Jing Zhang
  • Shanshan Zhao
  • Juhua Liu
  • Bo Du
  • Dacheng Tao

Recently, Transformer-based methods, which predict polygon points or Bezier curve control points for localizing texts, are popular in scene text detection. However, these methods built upon detection transformer framework might achieve sub-optimal training efficiency and performance due to coarse positional query modeling. In addition, the point label form exploited in previous works implies the reading order of humans, which impedes the detection robustness from our observation. To address these challenges, this paper proposes a concise Dynamic Point Text DEtection TRansformer network, termed DPText-DETR. In detail, DPText-DETR directly leverages explicit point coordinates to generate position queries and dynamically updates them in a progressive way. Moreover, to improve the spatial inductive bias of non-local self-attention in Transformer, we present an Enhanced Factorized Self-Attention module which provides point queries within each instance with circular shape guidance. Furthermore, we design a simple yet effective positional label form to tackle the side effect of the previous form. To further evaluate the impact of different label forms on the detection robustness in real-world scenario, we establish an Inverse-Text test set containing 500 manually labeled images. Extensive experiments prove the high training efficiency, robustness, and state-of-the-art performance of our method on popular benchmarks. The code and the Inverse-Text test set are available at https://github.com/ymy-k/DPText-DETR.

IJCAI Conference 2023 Conference Paper

Federated Graph Semantic and Structural Learning

  • Wenke Huang
  • Guancheng Wan
  • Mang Ye
  • Bo Du

Federated graph learning collaboratively learns a global graph neural network with distributed graphs, where the non-independent and identically distributed property is one of the major challenge. Most relative arts focus on traditional distributed tasks like images and voices, incapable of the graph structures. This paper firstly reveals that local client distortion is brought by both node-level semantics and graph-level structure. First, for node-level semantic, we find that contrasting nodes from distinct classes is beneficial to provide a well-performing discrimination. We pull the local node towards the global node of the same class and push them away from the global node of different classes. Second, we postulate that a well-structural graph neural network possesses similarity for neighbors due to the inherent adjacency relationships. However, aligning each node with adjacent nodes hinders discrimination due to the potential class inconsistency. We transform the adjacency relationships into the similarity distribution and leverage the global model to distill the relation knowledge into the local model, which preserves the structural information and discriminability of the local model. Empirical results on three graph datasets manifest the superiority of the proposed method over counterparts.

IJCAI Conference 2023 Conference Paper

Graph Pooling for Graph Neural Networks: Progress, Challenges, and Opportunities

  • Chuang Liu
  • Yibing Zhan
  • Jia Wu
  • Chang Li
  • Bo Du
  • Wenbin Hu
  • Tongliang Liu
  • Dacheng Tao

Graph neural networks have emerged as a leading architecture for many graph-level tasks, such as graph classification and graph generation. As an essential component of the architecture, graph pooling is indispensable for obtaining a holistic graph-level representation of the whole graph. Although a great variety of methods have been proposed in this promising and fast-developing research field, to the best of our knowledge, little effort has been made to systematically summarize these works. To set the stage for the development of future works, in this paper, we attempt to fill this gap by providing a broad review of recent methods for graph pooling. Specifically, 1) we first propose a taxonomy of existing graph pooling methods with a mathematical summary for each category; 2) then, we provide an overview of the libraries related to graph pooling, including the commonly used datasets, model architectures for downstream tasks, and open-source implementations; 3) next, we further outline the applications that incorporate the idea of graph pooling in a variety of domains; 4) finally, we discuss certain critical challenges facing current studies and share our insights on future potential directions for research on the improvement of graph pooling.

IJCAI Conference 2023 Conference Paper

Improving Heterogeneous Model Reuse by Density Estimation

  • Anke Tang
  • Yong Luo
  • Han Hu
  • Fengxiang He
  • Kehua Su
  • Bo Du
  • Yixin Chen
  • Dacheng Tao

This paper studies multiparty learning, aiming to learn a model using the private data of different participants. Model reuse is a promising solution for multiparty learning, assuming that a local model has been trained for each party. Considering the potential sample selection bias among different parties, some heterogeneous model reuse approaches have been developed. However, although pre-trained local classifiers are utilized in these approaches, the characteristics of the local data are not well exploited. This motivates us to estimate the density of local data and design an auxiliary model together with the local classifiers for reuse. To address the scenarios where some local models are not well pre-trained, we further design a multiparty cross-entropy loss for calibration. Upon existing works, we address a challenging problem of heterogeneous model reuse from a decision theory perspective and take advantage of recent advances in density estimation. Experimental results on both synthetic and benchmark data demonstrate the superiority of the proposed method.

JAAMAS Journal 2023 Journal Article

Price of anarchy of traffic assignment with exponential cost functions

  • Jianglin Qiao
  • Dave de Jonge
  • Bo Du

Abstract The rapid evolution of technology in connected automated and autonomous vehicles offers immense potential for revolutionizing future intelligent traffic control and management. This potential is exemplified by the diverse range of control paradigms, ranging from self-routing to centralized control. However, the selection among these paradigms is beyond technical consideration but a delicate balance between autonomous decision-making and holistic system optimization. A pivotal quantitative parameter in navigating this balance is the concept of the “price of anarchy” (PoA) inherent in autonomous decision frameworks. This paper analyses the price of anarchy for road networks with traffic of CAV. We model a traffic network as a routing game in which vehicles are selfish agents who choose routes to travel autonomously to minimize travel delays caused by road congestion. Unlike existing research in which the latency function of road congestion was based on polynomial functions like the well-known BPR function, we focus on routing games where an exponential function can specify the latency of road traffic. We first calculate a tight upper bound for the price of anarchy for this class of games and then compare this result with the tight upper bound of the PoA for routing games with the BPR latency function. The comparison shows that as long as the traffic volume is lower than the road capacity, the tight upper bound of the PoA of the games with the exponential function is lower than the corresponding value with the BPR function. Finally, numerical results based on real-world traffic data demonstrate that the exponential function can approximate road latency as close as the BPR function with even tighter exponential parameters, which results in a relatively lower upper bound.

NeurIPS Conference 2023 Conference Paper

Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective

  • Yingying Fan
  • Yu Wu
  • Bo Du
  • Yutian Lin

We focus on the weakly-supervised audio-visual video parsing task (AVVP), which aims to identify and locate all the events in audio/visual modalities. Previous works only concentrate on video-level overall label denoising across modalities, but overlook the segment-level label noise, where adjacent video segments (i. e. , 1-second video clips) may contain different events. However, recognizing events on the segment is challenging because its label could be any combination of events that occur in the video. To address this issue, we consider tackling AVVP from the language perspective, since language could freely describe how various events appear in each segment beyond fixed labels. Specifically, we design language prompts to describe all cases of event appearance for each video. Then, the similarity between language prompts and segments is calculated, where the event of the most similar prompt is regarded as the segment-level label. In addition, to deal with the mislabeled segments, we propose to perform dynamic re-weighting on the unreliable segments to adjust their labels. Experiments show that our simple yet effective approach outperforms state-of-the-art methods by a large margin.

NeurIPS Conference 2023 Conference Paper

SAMRS: Scaling-up Remote Sensing Segmentation Dataset with Segment Anything Model

  • Di Wang
  • Jing Zhang
  • Bo Du
  • Minqiang Xu
  • Lin Liu
  • Dacheng Tao
  • Liangpei Zhang

The success of the Segment Anything Model (SAM) demonstrates the significance of data-centric machine learning. However, due to the difficulties and high costs associated with annotating Remote Sensing (RS) images, a large amount of valuable RS data remains unlabeled, particularly at the pixel level. In this study, we leverage SAM and existing RS object detection datasets to develop an efficient pipeline for generating a large-scale RS segmentation dataset, dubbed SAMRS. SAMRS totally possesses 105, 090 images and 1, 668, 241 instances, surpassing existing high-resolution RS segmentation datasets in size by several orders of magnitude. It provides object category, location, and instance information that can be used for semantic segmentation, instance segmentation, and object detection, either individually or in combination. We also provide a comprehensive analysis of SAMRS from various aspects. Moreover, preliminary experiments highlight the importance of conducting segmentation pre-training with SAMRS to address task discrepancies and alleviate the limitations posed by limited training data during fine-tuning. The code and dataset will be available at https: //github. com/ViTAE-Transformer/SAMRS

NeurIPS Conference 2023 Conference Paper

Stability and Generalization of the Decentralized Stochastic Gradient Descent Ascent Algorithm

  • Miaoxi Zhu
  • Li Shen
  • Bo Du
  • Dacheng Tao

The growing size of available data has attracted increasing interest in solving minimax problems in a decentralized manner for various machine learning tasks. Previous theoretical research has primarily focused on the convergence rate and communication complexity of decentralized minimax algorithms, with little attention given to their generalization. In this paper, we investigate the primal-dual generalization bound of the decentralized stochastic gradient descent ascent (D-SGDA) algorithm using the approach of algorithmic stability under both convex-concave and nonconvex-nonconcave settings. Our theory refines the algorithmic stability in a decentralized manner and demonstrates that the decentralized structure does not destroy the stability and generalization of D-SGDA, implying that it can generalize as well as the vanilla SGDA in certain situations. Our results analyze the impact of different topologies on the generalization bound of the D-SGDA algorithm beyond trivial factors such as sample sizes, learning rates, and iterations. We also evaluate the optimization error and balance it with the generalization gap to obtain the optimal population risk of D-SGDA in the convex-concave setting. Additionally, we perform several numerical experiments which validate our theoretical findings.

AAAI Conference 2023 Conference Paper

Unlabeled Imperfect Demonstrations in Adversarial Imitation Learning

  • Yunke Wang
  • Bo Du
  • Chang Xu

Adversarial imitation learning has become a widely used imitation learning framework. The discriminator is often trained by taking expert demonstrations and policy trajectories as examples respectively from two categories (positive vs. negative) and the policy is then expected to produce trajectories that are indistinguishable from the expert demonstrations. But in the real world, the collected expert demonstrations are more likely to be imperfect, where only an unknown fraction of the demonstrations are optimal. Instead of treating imperfect expert demonstrations as absolutely positive or negative, we investigate unlabeled imperfect expert demonstrations as they are. A positive-unlabeled adversarial imitation learning algorithm is developed to dynamically sample expert demonstrations that can well match the trajectories from the constantly optimized agent policy. The trajectories of an initial agent policy could be closer to those non-optimal expert demonstrations, but within the framework of adversarial imitation learning, agent policy will be optimized to cheat the discriminator and produce trajectories that are similar to those optimal expert demonstrations. Theoretical analysis shows that our method learns from the imperfect demonstrations via a self-paced way. Experimental results on MuJoCo and RoboSuite platforms demonstrate the effectiveness of our method from different aspects.

AAAI Conference 2022 Conference Paper

Resistance Training Using Prior Bias: Toward Unbiased Scene Graph Generation

  • Chao Chen
  • Yibing Zhan
  • Baosheng Yu
  • Liu Liu
  • Yong Luo
  • Bo Du

Scene Graph Generation (SGG) aims to build a structured representation of a scene using objects and pairwise relationships, which benefits downstream tasks. However, current SGG methods usually suffer from sub-optimal scene graph generation because of the long-tailed distribution of training data. To address this problem, we propose Resistance Training using Prior Bias (RTPB) for the scene graph generation. Specifically, RTPB uses a distributed-based prior bias to improve models’ detecting ability on less frequent relationships during training, thus improving the model generalizability on tail categories. In addition, to further explore the contextual information of objects and relationships, we design a contextual encoding backbone network, termed as Dual Transformer (DTrans). We perform extensive experiments on a very popular benchmark, VG150, to demonstrate the effectiveness of our method for the unbiased scene graph generation. In specific, our RTPB achieves an improvement of over 10% under the mean recall when applied to current SGG methods. Furthermore, DTrans with RTPB outperforms nearly all stateof-the-art methods with a large margin. Code is available at https: //github. com/ChCh1999/RTPB

IJCAI Conference 2022 Conference Paper

Self-paced Supervision for Multi-source Domain Adaptation

  • Zengmao Wang
  • Chaoyang Zhou
  • Bo Du
  • Fengxiang He

Multi-source domain adaptation has attracted great attention in machine learning community. Most of these methods focus on weighting the predictions produced by the adaptation networks of different domains. Thus the domain shifts between certain of domains and target domain are not effectively relieved, resulting in that these domains are not fully exploited and even may have a negative influence on multi-source domain adaptation task. To address such challenge, we propose a multi-source domain adaptation method to gradually improve the adaptation ability of each source domain by producing more high-confident pseudo-labels with self-paced learning for conditional distribution alignment. The proposed method first trains several separate domain branch networks with single domains and an ensemble branch network with all domains. Then we obtain some high-confident pseudo-labels with the branch networks and learn the branch specific pseudo-labels with self-paced learning. Each branch network reduces the domain gap by aligning the conditional distribution with its branch specific pseudo-labels and the pseudo-labels provided by all branch networks. Experiments on Office31, Office-Home and DomainNet show that the proposed method outperforms the state-of-the-art methods.

NeurIPS Conference 2022 Conference Paper

VF-PS: How to Select Important Participants in Vertical Federated Learning, Efficiently and Securely?

  • Jiawei Jiang
  • Lukas Burkhalter
  • Fangcheng Fu
  • Bolin Ding
  • Bo Du
  • Anwar Hithnawi
  • Bo Li
  • Ce Zhang

Vertical Federated Learning (VFL), that trains federated models over vertically partitioned data, has emerged as an important learning paradigm. However, existing VFL methods are facing two challenges: (1) scalability when # participants grows to even modest scale and (2) diminishing return w. r. t. # participants: not all participants are equally important and many will not introduce quality improvement in a large consortium. Inspired by these two challenges, in this paper, we ask: How can we select l out of m participants, where l ≪ m, that are most important? We call this problem Vertically Federated Participant Selection, and model it with a principled mutual information-based view. Our first technical contribution is VF-MINE—a Vertically Federated Mutual INformation Estimator—that uses one of the most celebrated algorithms in database theory—Fagin’s algorithm as a building block. Our second contribution is to further optimize VF-MINE to enable VF-PS, a group testing-based participant selection framework. We empirically show that vertically federated participation selection can be orders of magnitude faster than training a full-fledged VFL model, while being able to identify the most important subset of participants that often lead to a VFL model of similar quality.

AAAI Conference 2022 Conference Paper

Visual Semantics Allow for Textual Reasoning Better in Scene Text Recognition

  • Yue He
  • Chen Chen
  • Jing Zhang
  • Juhua Liu
  • Fengxiang He
  • Chaoyue Wang
  • Bo Du

Existing Scene Text Recognition (STR) methods typically use a language model to optimize the joint probability of the 1D character sequence predicted by a visual recognition (VR) model, which ignore the 2D spatial context of visual semantics within and between character instances, making them not generalize well to arbitrary shape scene text. To address this issue, we make the first attempt to perform textual reasoning based on visual semantics in this paper. Technically, given the character segmentation maps predicted by a VR model, we construct a subgraph for each instance, where nodes represent the pixels in it and edges are added between nodes based on their spatial similarity. Then, these subgraphs are sequentially connected by their root nodes and merged into a complete graph. Based on this graph, we devise a graph convolutional network for textual reasoning (GTR) by supervising it with a cross-entropy loss. GTR can be easily plugged in representative STR models to improve their performance owing to better textual reasoning. Specifically, we construct our model, namely S-GTR, by paralleling GTR to the language model in a segmentation-based STR baseline, which can effectively exploit the visual-linguistic complementarity via mutual learning. S-GTR sets new state-of-the-art on six challenging STR benchmarks and generalizes well to multi-linguistic datasets. Code is available at https: //github. com/adeline-cs/GTR.

AAAI Conference 2021 Conference Paper

GaussianPath:A Bayesian Multi-Hop Reasoning Framework for Knowledge Graph Reasoning

  • Guojia Wan
  • Bo Du

Recently, multi-hop reasoning over incomplete Knowledge Graphs (KGs) has attracted wide attention due to its desirable interpretability for downstream tasks, such as question answer and knowledge graph completion. Multi-Hop reasoning is a typical sequential decision problem, which can be formulated as a Markov decision process (MDP). Subsequently, some reinforcement learning (RL) based approaches are proposed and proven effective to train an agent for reasoning paths sequentially until reaching the target answer. However, these approaches assume that an entity/relation representation follows a one-point distribution. In fact, different entities and relations may contain different certainties. On the other hand, since REINFORCE used for updating the policy in these approaches is a biased policy gradients method, the agent is prone to be stuck in high reward paths rather than broad reasoning paths, which leads to premature and suboptimal exploitation. In this paper, we consider a Bayesian reinforcement learning paradigm to harness uncertainty into multi-hop reasoning. By incorporating uncertainty into the representation layer, the agent trained by RL has uncertainty in a region of the state space then it should be more efficient in exploring unknown or less known part of the KG. In our approach, we build a Bayesian Q-learning architecture as a state-action value function for estimating the expected longterm reward. As initialized by Gaussian prior or pre-trained prior distribution, the representation layer drives uncertainty that allows regularizing the training. We conducted extensive experiments on multiple KGs. Experimental results show a superior performance than other baselines, especially significant improvements on the automated extracted KG.

IJCAI Conference 2021 Conference Paper

Learning Visual Words for Weakly-Supervised Semantic Segmentation

  • Lixiang Ru
  • Bo Du
  • Chen Wu

Current weakly-supervised semantic segmentation (WSSS) methods with image-level labels mainly adopt class activation maps (CAM) to generate the initial pseudo labels. However, CAM usually only identifies the most discriminative object extents, which is attributed to the fact that the network doesn't need to discover the integral object to recognize image-level labels. In this work, to tackle this problem, we proposed to simultaneously learn the image-level labels and local visual word labels. Specifically, in each forward propagation, the feature maps of the input image will be encoded to visual words with a learnable codebook. By enforcing the network to classify the encoded fine-grained visual words, the generated CAM could cover more semantic regions. Besides, we also proposed a hybrid spatial pyramid pooling module that could preserve local maximum and global average values of feature maps, so that more object details and less background were considered. Based on the proposed methods, we conducted experiments on the PASCAL VOC 2012 dataset. Our proposed method achieved 67. 2% mIoU on the val set and 67. 3% mIoU on the test set, which outperformed recent state-of-the-art methods.

JBHI Journal 2021 Journal Article

Multi-Task Learning for Registering Images With Large Deformation

  • Bo Du
  • Jiandong Liao
  • Baris Turkbey
  • Pingkun Yan

Accurate registration of prostate magnetic resonance imaging (MRI) images of the same subject acquired at different time points helps diagnose cancer and monitor the tumor progress. However, it is very challenging especially when one image was acquired with the use of endorectal coil (ERC) but the other was not, which causes significant deformation. Classical iterative image registration methods are also computationally intensive. Deep learning based registration frameworks have recently been developed and demonstrated promising performance. However, the lack of proper constraints often results in unrealistic registration. In this paper, we propose a multi-task learning based registration network with anatomical constraint to address these issues. The proposed approach uses a cycle constraint loss to achieve forward/backward registration and an inverse constraint loss to encourage diffeomorphic registration. In addition, an adaptive anatomical constraint aiming for regularizing the registration network with the use of anatomical labels is introduced through weak supervision. Our experiments on registering prostate MR images of the same subject obtained at different time points with and without ERC show that the proposed method achieves very promising performance under different measures in dealing with the large deformation. Compared with other existing methods, our approach works more efficiently with average running time less than a second and is able to obtain more visually realistic results.

AAAI Conference 2021 Conference Paper

RNA Secondary Structure Representation Network for RNA-proteins Binding Prediction

  • Ziyi Liu
  • Fulin Luo
  • Bo Du

RNA-binding proteins (RBPs) play a significant part in several biological processes in the living cell, such as gene regulation and mRNA localization. Several deep learning methods, especially the model based on convolutional neural network (CNN), have been used to predict the binding sites. However, previous methods fail to represent RNA secondary structure features. The traditional deep learning methods generally transform the RNA secondary structure to a regular matrix that cannot reveal the topological structure information of RNA. To effectively extract the structure features of RNA, we propose an RNA secondary structure representation network (RNASSR-Net) based on graph convolutional neural network (GCN) and convolution neural network (CNN) for RBP binding prediction. RNASSR-Net constructs the graph model derived from the RNA secondary structure to learn the topological properties of RNA. Then, it obtains the spatial importance of each base in RNA with CNN to guide the representation of the RNA secondary structure. Finally, RNASSR- Net combines the structure and sequence features to predict the binding sites. Experimental results demonstrate the proposed method outperforms a few state-of-the-art methods on the benchmark datasets and gets a higher improvement on the small-size data. Besides, the proposed RNASSR-Net is also used to detect the accurate motifs compared with the experimentally verified motifs, which reveals the binding region location and RNA structure interpretation for some biological guidance in the future.

IJCAI Conference 2021 Conference Paper

Robust Adversarial Imitation Learning via Adaptively-Selected Demonstrations

  • Yunke Wang
  • Chang Xu
  • Bo Du

The agent in imitation learning (IL) is expected to mimic the behavior of the expert. Its performance relies highly on the quality of given expert demonstrations. However, the assumption that collected demonstrations are optimal cannot always hold in real-world tasks, which would seriously influence the performance of the learned agent. In this paper, we propose a robust method within the framework of Generative Adversarial Imitation Learning (GAIL) to address imperfect demonstration issue, in which good demonstrations can be adaptively selected for training while bad demonstrations are abandoned. Specifically, a binary weight is assigned to each expert demonstration to indicate whether to select it for training. The reward function in GAIL is employed to determine this weight (i. e. higher reward results in higher weight). Compared to some existing solutions that require some auxiliary information about this weight, we set up the connection between weight and model so that we can jointly optimize GAIL and learn the latent weight. Besides hard binary weighting, we also propose a soft weighting scheme. Experiments in the Mujoco demonstrate the proposed method outperforms other GAIL-based methods when dealing with imperfect demonstrations.

AAAI Conference 2020 Conference Paper

Compressed Self-Attention for Deep Metric Learning

  • Ziye Chen
  • Mingming Gong
  • Yanwu Xu
  • Chaohui Wang
  • Kun Zhang
  • Bo Du

In this paper, we aim to enhance self-attention (SA) mechanism for deep metric learning in visual perception, by capturing richer contextual dependencies in visual data. To this end, we propose a novel module, named compressed selfattention (CSA), which significantly reduces the computation and memory cost with a neglectable decrease in accuracy with respect to the original SA mechanism, thanks to the following two characteristics: i) it only needs to compute a small number of base attention maps for a small number of base feature vectors; and ii) the output at each spatial location can be simply obtained by an adaptive weighted average of the outputs calculated from the base attention maps. The high computational efficiency of CSA enables the application to high-resolution shallow layers in convolutional neural networks with little additional cost. In addition, CSA makes it practical to further partition the feature maps into groups along the channel dimension and compute attention maps for features in each group separately, thus increasing the diversity of long-range dependencies and accordingly boosting the accuracy. We evaluate the performance of CSA via extensive experiments on two metric learning tasks: person re-identification and local descriptor learning. Qualitative and quantitative comparisons with latest methods demonstrate the significance of CSA in this topic.

IJCAI Conference 2020 Conference Paper

Compressed Self-Attention for Deep Metric Learning with Low-Rank Approximation

  • Ziye Chen
  • Mingming Gong
  • Lingjuan Ge
  • Bo Du

In this paper, we apply self-attention (SA) mechanism to boost the performance of deep metric learning. However, due to the pairwise similarity measurement, the cost of storing and manipulating the complete attention maps makes it infeasible for large inputs. To solve this problem, we propose a compressed self-attention with low-rank approximation (CSALR) module, which significantly reduces the computation and memory costs without sacrificing the accuracy. In CSALR, the original attention map is decomposed into a landmark attention map and a combination coefficient map with a small number of landmark feature vectors sampled from the input feature map by average pooling. Thanks to the efficiency of CSALR, we can apply CSALR to high-resolution shallow convolutional layers and implement a multi-head form of CSALR, which further boosts the performance. We evaluate the proposed CSALR on person reidentification which is a typical metric learning task. Extensive experiments shows the effectiveness and efficiency of CSALR in deep metric learning and its superiority over the baselines.

IJCAI Conference 2020 Conference Paper

Multichannel Color Image Denoising via Weighted Schatten p-norm Minimization

  • Xinjian Huang
  • Bo Du
  • Weiwei Liu

The R, G and B channels of a color image generally have different noise statistical properties or noise strengths. It is thus problematic to apply grayscale image denoising algorithms to color image denoising. In this paper, based on the non-local self-similarity of an image and the different noise strength across each channel, we propose a MultiChannel Weighted Schatten p-Norm Minimization (MCWSNM) model for RGB color image denoising. More specifically, considering a small local RGB patch in a noisy image, we first find its nonlocal similar cubic patches in a search window with an appropriate size. These similar cubic patches are then vectorized and grouped to construct a noisy low-rank matrix, which can be recovered using the Schatten p-norm minimization framework. Moreover, a weight matrix is introduced to balance each channel’s contribution to the final denoising results. The proposed MCWSNM can be solved via the alternating direction method of multipliers. Convergence property of the proposed method are also theoretically analyzed. Experiments conducted on both synthetic and real noisy color image datasets demonstrate highly competitive denoising performance, outperforming comparison algorithms, including several methods based on neural networks.

IJCAI Conference 2020 Conference Paper

Positive Unlabeled Learning with Class-prior Approximation

  • Shizhen Chang
  • Bo Du
  • Liangpei Zhang

The positive unlabeled (PU) learning aims to train a binary classifier from a set of positive labeled samples and other unlabeled samples. Much research has been done on this special branch of weakly supervised classification problems. Since only part of the positive class is labeled, the classical PU model trains the classifier assuming the class-prior is known. However, the true class prior is usually difficult to obtain and must be learned from the given data, and the traditional methods may not work. In this paper, we formulate a convex formulation to jointly solve the class-prior unknown problem and train an accurate classifier with no need of any class-prior assumptions or additional negative samples. The class prior is estimated by pursuing the optimal solution of gradient thresholding and the classifier is simultaneously trained by performing empirical unbiased risk. The detailed derivation and theoretical analysis of the proposed model are outlined, and a comparison of our experiments with other representative methods prove the superiority of our method.

AAAI Conference 2020 Conference Paper

Reinforcement Learning Based Meta-Path Discovery in Large-Scale Heterogeneous Information Networks

  • Guojia Wan
  • Bo Du
  • Shirui Pan
  • Gholameza Haffari

Meta-paths are important tools for a wide variety of data mining and network analysis tasks in Heterogeneous Information Networks (HINs), due to their flexibility and interpretability to capture the complex semantic relation among objects. To date, most HIN analysis still relies on handcrafting meta-paths, which requires rich domain knowledge that is extremely difficult to obtain in complex, large-scale, and schema-rich HINs. In this work, we present a novel framework, Meta-path Discovery with Reinforcement Learning (MPDRL), to identify informative meta-paths from complex and large-scale HINs. To capture different semantic information between objects, we propose a novel multi-hop reasoning strategy in a reinforcement learning framework which aims to infer the next promising relation that links a source entity to a target entity. To improve the efficiency, moreover, we develop a type context representation embedded approach to scale the RL framework to handle million-scale HINs. As multi-hop reasoning generates rich meta-paths with various length, we further perform a meta-path induction step to summarize the important meta-paths using Lowest Common Ancestor principle. Experimental results on two large-scale HINs, Yago and NELL, validate our approach and demonstrate that our algorithm not only achieves superior performance in the link prediction task, but also identifies useful meta-paths that would have been ignored by human experts.

AAAI Conference 2020 Conference Paper

Temporal Network Embedding with High-Order Nonlinear Information

  • Zhenyu Qiu
  • Wenbin Hu
  • Jia Wu
  • Weiwei Liu
  • Bo Du
  • Xiaohua Jia

Temporal network embedding, which aims to learn the lowdimensional representations of nodes in temporal networks that can capture and preserve the network structure and evolution pattern, has attracted much attention from the scientific community. However, existing methods suffer from two main disadvantages: 1) they cannot preserve the node temporal proximity that capture important properties of the network structure; and 2) they cannot represent the nonlinear structure of temporal networks. In this paper, we propose a high-order nonlinear information preserving (HNIP) embedding method to address these issues. Specifically, we define three orders of temporal proximities by exploring network historical information with a time exponential decay model to quantify the temporal proximity between nodes. Then, we propose a novel deep guided auto-encoder to capture the highly nonlinear structure. Meanwhile, the training set of the guide autoencoder is generated by the temporal random walk (TRW) algorithm. By training the proposed deep guided auto-encoder with a specific mini-batch stochastic gradient descent algorithm, HNIP can efficiently preserves the temporal proximities and highly nonlinear structure of temporal networks. Experimental results on four real-world networks demonstrate the effectiveness of the proposed method.

IJCAI Conference 2020 Conference Paper

TextFuseNet: Scene Text Detection with Richer Fused Features

  • Jian Ye
  • Zhe Chen
  • Juhua Liu
  • Bo Du

Arbitrary shape text detection in natural scenes is an extremely challenging task. Unlike existing text detection approaches that only perceive texts based on limited feature representations, we propose a novel framework, namely TextFuseNet, to exploit the use of richer features fused for text detection. More specifically, we propose to perceive texts from three levels of feature representations, i. e. , character-, word- and global-level, and then introduce a novel text representation fusion technique to help achieve robust arbitrary text detection. The multi-level feature representation can adequately describe texts by dissecting them into individual characters while still maintaining their general semantics. TextFuseNet then collects and merges the texts’ features from different levels using a multi-path fusion architecture which can effectively align and fuse different representations. In practice, our proposed TextFuseNet can learn a more adequate description of arbitrary shapes texts, suppressing false positives and producing more accurate detection results. Our proposed framework can also be trained with weak supervision for those datasets that lack character-level annotations. Experiments on several datasets show that the proposed TextFuseNet achieves state-of-the-art performance. Specifically, we achieve an F-measure of 94. 3% on ICDAR2013, 92. 1% on ICDAR2015, 87. 1% on Total-Text and 86. 6% on CTW-1500, respectively.

IJCAI Conference 2019 Conference Paper

Accelerated Inference Framework of Sparse Neural Network Based on Nested Bitmask Structure

  • Yipeng Zhang
  • Bo Du
  • Lefei Zhang
  • Rongchun Li
  • Yong Dou

In order to satisfy the ever-growing demand for high-performance processors for neural networks, the state-of-the-art processing units tend to use application-oriented circuits to replace Processing Engine (PE) on the GPU under circumstances where low-power solutions are required. The application-oriented PE is fully optimized in terms of the circuit architecture and eliminates incorrect data dependency and instructional redundancy. In this paper, we propose a novel encoding approach on a sparse neural network after pruning. We partition the weight matrix into numerous blocks and use a low-rank binary map to represent the validation of these blocks. Furthermore, the elements in each nonzero block are also encoded into two submatrices: one is the binary stream discriminating the zero/nonzero position, while the other is the pure nonzero elements stored in the FIFO. In the experimental part, we implement a well pre-trained sparse neural network on the Xilinx FPGA VC707. Experimental results show that our algorithm outperforms the other benchmarks. Our approach has successfully optimized the throughput and the energy efficiency to deal with a single frame. Accordingly, we contend that Nested Bitmask Neural Network (NBNN), is an efficient neural network structure with only minor accuracy loss on the SoC system.

IJCAI Conference 2019 Conference Paper

MUSICAL: Multi-Scale Image Contextual Attention Learning for Inpainting

  • Ning Wang
  • Jingyuan Li
  • Lefei Zhang
  • Bo Du

We study the task of image inpainting, where an image with missing region is recovered with plausible context. Recent approaches based on deep neural networks have exhibited potential for producing elegant detail and are able to take advantage of background information, which gives texture information about missing region in the image. These methods often perform pixel/patch level replacement on the deep feature maps of missing region and therefore enable the generated content to have similar texture as background region. However, this kind of replacement is a local strategy and often performs poorly when the background information is misleading. To this end, in this study, we propose to use a multi-scale image contextual attention learning (MUSICAL) strategy that helps to flexibly handle richer background information while avoid to misuse of it. However, such strategy may not promising in generating context of reasonable style. To address this issue, both of the style loss and the perceptual loss are introduced into the proposed method to achieve the style consistency of the generated image. Furthermore, we have also noticed that replacing some of the down sampling layers in the baseline network with the stride 1 dilated convolution layers is beneficial for producing sharper and fine-detailed results. Experiments on the Paris Street View, Places, and CelebA datasets indicate the superior performance of our approach compares to the state-of-the-arts.

IJCAI Conference 2019 Conference Paper

Pseudo Supervised Matrix Factorization in Discriminative Subspace

  • Jiaqi Ma
  • Yipeng Zhang
  • Lefei Zhang
  • Bo Du
  • Dapeng Tao

Non-negative Matrix Factorization (NMF) and spectral clustering have been proved to be efficient and effective for data clustering tasks and have been applied to various real-world scenes. However, there are still some drawbacks in traditional methods: (1) most existing algorithms only consider high-dimensional data directly while neglect the intrinsic data structure in the low-dimensional subspace; (2) the pseudo-information got in the optimization process is not relevant to most spectral clustering and manifold regularization methods. In this paper, a novel unsupervised matrix factorization method, Pseudo Supervised Matrix Factorization (PSMF), is proposed for data clustering. The main contributions are threefold: (1) to cluster in the discriminant subspace, Linear Discriminant Analysis (LDA) combines with NMF to become a unified framework; (2) we propose a pseudo supervised manifold regularization term which utilizes the pseudo-information to instruct the regularization term in order to find subspace that discriminates different classes; (3) an efficient optimization algorithm is designed to solve the proposed problem with proved convergence. Extensive experiments on multiple benchmark datasets illustrate that the proposed model outperforms other state-of-the-art clustering algorithms.

AAAI Conference 2019 Conference Paper

Self-Ensembling Attention Networks: Addressing Domain Shift for Semantic Segmentation

  • Yonghao Xu
  • Bo Du
  • Lefei Zhang
  • Qian Zhang
  • Guoli Wang
  • Liangpei Zhang

Recent years have witnessed the great success of deep learning models in semantic segmentation. Nevertheless, these models may not generalize well to unseen image domains due to the phenomenon of domain shift. Since pixel-level annotations are laborious to collect, developing algorithms which can adapt labeled data from source domain to target domain is of great significance. To this end, we propose self-ensembling attention networks to reduce the domain gap between different datasets. To the best of our knowledge, the proposed method is the first attempt to introduce selfensembling model to domain adaptation for semantic segmentation, which provides a different view on how to learn domain-invariant features. Besides, since different regions in the image usually correspond to different levels of domain gap, we introduce the attention mechanism into the proposed framework to generate attention-aware features, which are further utilized to guide the calculation of consistency loss in the target domain. Experiments on two benchmark datasets demonstrate that the proposed framework can yield competitive performance compared with the state of the art methods.

IJCAI Conference 2018 Conference Paper

Matrix completion with Preference Ranking for Top-N Recommendation

  • Zengmao Wang
  • Yuhong Guo
  • Bo Du

Matrix completion has become a popular method for top-N recommendation due to the low rank nature of sparse rating matrices. However, many existing methods produce top-N recommendations by recovering a user-item matrix solely based on a low rank function or its relaxations, while ignoring other important intrinsic characteristics of the top-N recommendation tasks such as preference ranking over the items. In this paper, we propose a novel matrix completion method that integrates the low rank and preference ranking characteristics of recommendation matrix under a self-recovery model for top-N recommendation. The proposed method is formulated as a joint minimization problem and solved using an ADMM algorithm. We conduct experiments on E-commerce datasets. The experimental results show the proposed approach outperforms several state-of-the-art methods.

AAAI Conference 2018 Conference Paper

Nonlocal Patch Based t-SVD for Image Inpainting: Algorithm and Error Analysis

  • Liangchen Song
  • Bo Du
  • Lefei Zhang
  • Liangpei Zhang
  • Jia Wu
  • Xuelong Li

In this paper, we propose a novel image inpainting framework consisting of an interpolation step and a low-rank tensor completion step. More specifically, we first initial the image with triangulation-based linear interpolation, and then we find similar patches for each missing-entry centered patch. Treating a group of patch matrices as a tensor, we employ the recently proposed effective t-SVD tensor completion algorithm with a warm start strategy to inpaint it. We observe that the interpolation step is such a rough initialization that the similar patch we found may not exactly match with the reference, so we name the problem as Patch Mismatch and analyse the error caused by it thoroughly. Our theoretical analysis shows that the error caused by Patch Mismatch can be decomposed into two components, one of which can be bounded by a reasonable assumption named local patch similarity, and another part is lower than that using matrix. Experiments on real images verify our method’s superiority to the state-of-the-art inpainting methods.

IJCAI Conference 2018 Conference Paper

R-SVM+: Robust Learning with Privileged Information

  • Xue Li
  • Bo Du
  • Chang Xu
  • Yipeng Zhang
  • Lefei Zhang
  • Dacheng Tao

In practice, the circumstance that training and test data are clean is not always satisfied. The performance of existing methods in the learning using privileged information (LUPI) paradigm may be seriously challenged, due to the lack of clear strategies to address potential noises in the data. This paper proposes a novel Robust SVM+ (RSVM+) algorithm based on a rigorous theoretical analysis. Under the SVM+ framework in the LUPI paradigm, we study the lower bound of perturbations of both example feature data and privileged feature data, which will mislead the model to make wrong decisions. By maximizing the lower bound, tolerance of the learned model over perturbations will be increased. Accordingly, a novel regularization function is introduced to upgrade a variant form of SVM+. The objective function of RSVM+ is transformed into a quadratic programming problem, which can be efficiently optimized using off-the-shelf solvers. Experiments on real-world datasets demonstrate the necessity of studying robust SVM+ and the effectiveness of the proposed algorithm.

IJCAI Conference 2017 Conference Paper

Adaptive Manifold Regularized Matrix Factorization for Data Clustering

  • Lefei Zhang
  • Qian Zhang
  • Bo Du
  • Jane You
  • Dacheng Tao

Data clustering is the task to group the data samples into certain clusters based on the relationships of samples and structures hidden in data, and it is a fundamental and important topic in data mining and machine learning areas. In the literature, the spectral clustering is one of the most popular approaches and has many variants in recent years. However, the performance of spectral clustering is determined by the affinity matrix, which is always computed by a predefined model (e. g. , Gaussian kernel function) with carefully tuned parameters combination, and may far from optimal in practice. In this paper, we propose to consider the observed data clustering as a robust matrix factorization point of view, and learn an affinity matrix simultaneously to regularize the proposed matrix factorization. The solution of the proposed adaptive manifold regularized matrix factorization (AMRMF) is reached by a novel Augmented Lagrangian Multiplier (ALM) based algorithm. The experimental results on standard clustering datasets demonstrate the superior performance over the exist alternatives.

IJCAI Conference 2017 Conference Paper

On Gleaning Knowledge from Multiple Domains for Active Learning

  • Zengmao Wang
  • Bo Du
  • Lefei Zhang
  • Liangpei Zhang
  • Ruimin Hu
  • Dacheng Tao

How can a doctor diagnose new diseases with little historical knowledge, which are emerging over time? Active learning is a promising way to address the problem by querying the most informative samples. Since the diagnosed cases for new disease are very limited, gleaning knowledge from other domains (classical prescriptions) to prevent the bias of active leaning would be vital for accurate diagnosis. In this paper, a framework that attempts to glean knowledge from multiple domains for active learning by querying the most uncertain and representative samples from the target domain and calculating the importance weights for re-weighting the source data in a single unified formulation is proposed. The weights are optimized by both a supervised classifier and distribution matching between the source domain and target domain with maximum mean discrepancy. Besides, a multiple domains active learning method is designed based on the proposed framework as an example. The proposed method is verified with newsgroups and handwritten digits data recognition tasks, where it outperforms the state-of-the-art methods.

AAAI Conference 2017 Conference Paper

Robust Manifold Matrix Factorization for Joint Clustering and Feature Extraction

  • Lefei Zhang
  • Qian Zhang
  • Bo Du
  • Dacheng Tao
  • Jane You

Low-rank matrix approximation has been widely used for data subspace clustering and feature representation in many computer vision and pattern recognition applications. However, in order to enhance the discriminability, most of the matrix approximation based feature extraction algorithms usually generate the cluster labels by certain clustering algorithm (e. g. , the kmeans) and then perform the matrix approximation guided by such label information. In addition, the noises and outliers in the dataset with large reconstruction errors will easily dominate the objective function by the conventional 2norm based squared residue minimization. In this paper, we propose a novel clustering and feature extraction algorithm based on an unified low-rank matrix factorization framework, which suggests that the observed data matrix can be approximated by the production of projection matrix and low dimensional representation, among which the low-dimensional representation can be approximated by the cluster indicator and latent feature matrix simultaneously. Furthermore, we have proposed using the 2, 1-norm and integrating the manifold regularization to further promote the proposed model. A novel Augmented Lagrangian Method (ALM) based procedure is designed to effectively and efficiently seek the optimal solution of the problem. The experimental results in both clustering and feature extraction perspectives demonstrate the superior performance of the proposed method.