Arrow Research search

Author name cluster

Mingli Song

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

82 papers
2 author rows

Possible papers

82

AAAI Conference 2026 Conference Paper

D3-RSMDE: 40× Faster and High-Fidelity Remote Sensing Monocular Depth Estimation

  • Ruizhi Wang
  • Weihan Li
  • Zunlei Feng
  • Haofei Zhang
  • Mingli Song
  • Jiayu Wang
  • Jie Song
  • Li Sun

Real-time, high-fidelity monocular depth estimation from remote sensing imagery is crucial for numerous applications, yet existing methods face a stark trade-off between accuracy and efficiency. Although using Vision Transformer (ViT) backbones for dense prediction is fast, they often exhibit poor perceptual quality. Conversely, diffusion models offer high fidelity but at a prohibitive computational cost. To overcome these limitations, we propose Depth Detail Diffusion for Remote Sensing Monocular Depth Estimation (D³-RSMDE), an efficient framework designed to achieve an optimal balance between speed and quality. Our framework first leverages a ViT-based module to rapidly generate a high-quality preliminary depth map construction, which serves as a structural prior, effectively replacing the time-consuming initial structure generation stage of diffusion models. Based on this prior, we propose a Progressive Linear Blending Refinement (PLBR) strategy, which uses a lightweight U-Net to refine the details in only a few iterations. The entire refinement step operates efficiently in a compact latent space supported by a Variational Autoencoder (VAE). Extensive experiments demonstrate that D³-RSMDE achieves a notable 11.85% reduction in the Learned Perceptual Image Patch Similarity (LPIPS) perceptual metric over leading models like Marigold, while also achieving over a 40× speedup in inference and maintaining VRAM usage comparable to lightweight ViT models.

AAAI Conference 2026 Conference Paper

Neural Graph Navigation for Intelligent Subgraph Matching

  • Yuchen Ying
  • Yiyang Dai
  • Wenda Li
  • Wenjie Huang
  • Rui Wang
  • Tongya Zheng
  • Yu Wang
  • Hanyang Yuan

Subgraph matching, a cornerstone of relational pattern detection in domains ranging from biochemical systems to social network analysis, faces significant computational challenges due to the dramatically growing search space. Existing methods address this problem within a filtering-ordering-enumeration framework, in which the enumeration stage recursively matches the query graph against the candidate subgraphs of the data graph. However, the lack of awareness of subgraph structural patterns leads to a costly brute-force enumeration, thereby critically motivating the need for intelligent navigation in subgraph matching. To address this challenge, we propose Neural Graph Navigation (NeuGN), a neuro-heuristic framework that transforms brute-force enumeration into neural-guided search by integrating neural navigation mechanisms into the core enumeration process. By preserving heuristic-based completeness guarantees while incorporating neural intelligence, NeuGN significantly reduces the First Match Steps by up to 98.2% compared to state-of-the-art methods across six real-world datasets.

AAAI Conference 2025 Conference Paper

Agent-Aware Training for Agent-Agnostic Action Advising in Deep Reinforcement Learning

  • Yaoquan Wei
  • Shunyu Liu
  • Jie Song
  • Tongya Zheng
  • Kaixuan Chen
  • Mingli Song

Action advising endeavors to leverage supplementary guidance from expert teachers to alleviate the issue of sampling inefficiency in Deep Reinforcement Learning (DRL). Previous agent-specific action advising methods are hindered by imperfections in the agent itself, while agent-agnostic approaches exhibit limited adaptability to the learning agent. In this study, we propose a novel framework called Agent-Aware trAining yet Agent-Agnostic Action Advising (A7) to strike a balance between the two. The underlying concept of A7 revolves around utilizing the similarity of state features as an indicator for soliciting advice. However, unlike prior methodologies, the measurement of state feature similarity is performed by neither the error-prone learning agent nor the agent-agnostic advisor. Instead, we employ a proxy model to extract state features that are both discriminative (adaptive to the agent) and generally applicable (robust to agent noise). Furthermore, we utilize behavior cloning to train a model for reusing advice and introduce an intrinsic reward for the advised samples to incentivize the utilization of expert guidance. Experiments are conducted on the GridWorld, LunarLander, and six prominent scenarios from Atari games. The results demonstrate that A7 significantly accelerates the learning process and surpasses existing methods (both agent- specific and agent-agnostic) by a substantial margin. Our code will be made publicly available.

ICML Conference 2025 Conference Paper

Assessing Safety Risks and Quantization-aware Safety Patching for Quantized Large Language Models

  • Kejia Chen 0007
  • Jiawen Zhang 0005
  • Jiacong Hu
  • Yu Wang 0176
  • Jian Lou 0001
  • Zunlei Feng
  • Mingli Song

Quantized large language models (LLMs) have gained increasing attention and significance for enabling deployment in resource-constrained environments. However, emerging studies on a few calibration dataset-free quantization methods suggest that quantization may compromise the safety capabilities of LLMs, underscoring the urgent need for systematic safety evaluations and effective mitigation strategies. In this paper, we present comprehensive safety evaluations across various mainstream quantization techniques and diverse calibration datasets, utilizing widely accepted safety benchmarks. To address the identified safety vulnerabilities, we propose a quantization-aware safety patching framework, Q-resafe, to efficiently restore the safety capabilities of quantized LLMs while minimizing any adverse impact on utility. Extensive experiment results demonstrate that Q-resafe successfully re-aligns the safety of quantized LLMs with their pre-quantization counterparts, even under challenging evaluation scenarios. Project page: https: //github. com/Thecommonirin/Qresafe.

AAAI Conference 2025 Conference Paper

Association Pattern-enhanced Molecular Representation Learning

  • Lingxiang Jia
  • Yuchen Ying
  • Tian Qiu
  • Shaolun Yao
  • Liang Xue
  • Jie Lei
  • Jie Song
  • Mingli Song

The applicability of drug molecules in various clinical scenarios is significantly influenced by a diverse range of molecular properties. By leveraging self-supervised conditions such as atom attributes and interatomic bonds, existing advanced molecular foundation models can generate expressive representations of these molecules. However, such models often overlook the fixed association patterns within molecules that influence physiological or chemical properties. In this paper, we introduce a novel association pattern-aware message passing method, which can serve as an effective yet general plug-and-play plugin, thereby enhancing the atom representations generated by molecular foundation models without requiring additional pretraining. Additionally, molecular property-specific pattern libraries are constructed to collect the generated interpretable common patterns that bind to these properties. Extensive experiments conducted on 11 benchmark molecular property prediction tasks across 8 advanced molecular foundation models demonstrate significant superiority of the proposed method, with performance improvements of up to approximately 20%. Furthermore, a property-specific pattern library is tailored for blood-brain barrier penetration, which has undergone corresponding mechanistic validation.

NeurIPS Conference 2025 Conference Paper

Association-Focused Path Aggregation for Graph Fraud Detection

  • Tian Qiu
  • Wenda Li
  • Zunlei Feng
  • Jie Lei
  • Tao Wang
  • Yi Gao
  • Mingli Song
  • Yang Gao

Fraudulent activities have caused substantial negative social impacts and are exhibiting emerging characteristics such as intelligence and industrialization, posing challenges of high-order interactions, intricate dependencies, and the sparse yet concealed nature of fraudulent entities. Existing graph fraud detectors are limited by their narrow "receptive fields", as they focus only on the relations between an entity and its neighbors while neglecting longer-range structural associations hidden between entities. To address this issue, we propose a novel fraud detector based on Graph Path Aggregation (GPA). It operates through variable-length path sampling, semantic-associated path encoding, path interaction and aggregation, and aggregation-enhanced fraud detection. To further facilitate interpretable association analysis, we synthesize G-Internet, the first benchmark dataset in the field of internet fraud detection. Extensive experiments across datasets in multiple fraud scenarios demonstrate that the proposed GPA outperforms mainstream fraud detectors by up to +15% in Average Precision (AP). Additionally, GPA exhibits enhanced robustness to noisy labels and provides excellent interpretability by uncovering implicit fraudulent patterns across broader contexts. Code is available at https: //github. com/horrible-dong/GPA.

ECAI Conference 2025 Conference Paper

Bi-Level Mean Field: Dynamic Grouping for Large-Scale MARL

  • Yuxuan Zheng
  • Yihe Zhou
  • Feiyang Xu
  • Mingli Song
  • Shunyu Liu 0001

Large-scale Multi-Agent Reinforcement Learning (MARL) often suffers from the curse of dimensionality, as the exponential growth in agent interactions significantly increases computational complexity and impedes learning efficiency. To mitigate this, existing efforts that rely on Mean Field (MF) simplify the interaction landscape by approximating neighboring agents as a single mean agent, thus reducing overall complexity to pairwise interactions. However, these MF methods inevitably fail to account for individual differences, leading to aggregation noise caused by inaccurate iterative updates during MF learning. In this paper, we propose a Bi-level Mean Field (BMF) method to capture agent diversity with dynamic grouping in large-scale MARL, which can alleviate aggregation noise via bi-level interaction. Specifically, BMF introduces a dynamic group assignment module, which employs a Variational AutoEncoder (VAE) to learn the representations of agents, facilitating their dynamic grouping over time. Furthermore, we propose a bi-level interaction module to model both inter- and intra-group interactions for effective neighboring aggregation. Experiments across various tasks demonstrate that the proposed BMF yields results superior to the state-of-the-art methods. Our code is available at https: //github. com/Chreer/BMF.

AAMAS Conference 2025 Conference Paper

CADP: Towards Better Centralized Learning for Decentralized Execution in MARL

  • Yihe Zhou
  • Shunyu Liu
  • Yunpeng Qing
  • Tongya Zheng
  • Kaixuan Chen
  • Jie Song
  • Mingli Song

Centralized Training with Decentralized Execution (CTDE) has recently emerged as a popular framework for cooperative Multi-Agent Reinforcement Learning (MARL), where agents can use additional global state information to guide training in a centralized way and make their own decisions only based on decentralized local policies. Despite the encouraging results achieved, CTDE makes an independence assumption on agent policies, which limits agents from adopting global cooperative information from each other during CT. Therefore, we argue that the existing CTDE framework cannot fully utilize global information for training, leading to an inefficient joint exploration and perception, which can degrade the final performance. In this paper, we introduce a novel Centralized Advising and Decentralized Pruning (CADP) framework for MARL, that not only enables an efficacious message exchange among agents during training but also guarantees DE.

IJCAI Conference 2025 Conference Paper

CADP: Towards Better Centralized Learning for Decentralized Execution in MARL

  • Yihe Zhou
  • Shunyu Liu
  • Yunpeng Qing
  • Tongya Zheng
  • Kaixuan Chen
  • Jie Song
  • Mingli Song

Centralized Training with Decentralized Execution (CTDE) has recently emerged as a popular framework for cooperative Multi-Agent Reinforcement Learning (MARL), where agents can use additional global state information to guide training in a centralized way and make their own decisions only based on decentralized local policies. Despite the encouraging results achieved, CTDE makes an independence assumption on agent policies, which limits agents from adopting global cooperative information from each other during centralized training. Therefore, we argue that the existing CTDE framework cannot fully utilize global information for training, leading to an inefficient joint exploration and perception, which can degrade the final performance. In this paper, we introduce a novel Centralized Advising and Decentralized Pruning (CADP) framework for MARL, that not only enables an efficacious message exchange among agents during training but also guarantees the independent policies for decentralized execution. Firstly, CADP endows agents the explicit communication channel to seek and take advice from different agents for more centralized training. To further ensure the decentralized execution, we propose a smooth model pruning mechanism to progressively constrain the agent communication into a closed one without degradation in agent cooperation capability. Empirical evaluations on different benchmarks and across various MARL backbones demonstrate that the proposed framework achieves superior performance compared with the state-of-the-art counterparts. Our code is available at https: //github. com/zyh1999/CADP

NeurIPS Conference 2025 Conference Paper

Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning

  • Kongcheng Zhang
  • QI YAO
  • Shunyu Liu
  • Yingjie Wang
  • Baisheng Lai
  • Jieping Ye
  • Mingli Song
  • Dacheng Tao

Recent advances of Reinforcement Learning (RL) have highlighted its potential in complex reasoning tasks, yet effective training often relies on external supervision, which limits the broader applicability. In this work, we propose a novel self-rewarding reinforcement learning framework to enhance Large Language Model (LLM) reasoning by leveraging the consistency of intermediate reasoning states across different reasoning trajectories. Our key insight is that correct responses often exhibit consistent trajectory patterns in terms of model likelihood: their intermediate reasoning states tend to converge toward their own final answers ( high consistency ) with minimal deviation toward other candidates ( low volatility ). Inspired by this observation, we introduce CoVo, an intrinsic reward mechanism that integrates Co nsistency and Vo latility via a robust vector-space aggregation strategy, complemented by a curiosity bonus to promote diverse exploration. CoVo enables LLMs to perform RL in a self-rewarding manner, offering a scalable pathway for learning to reason without external supervision. Extensive experiments on diverse reasoning benchmarks show that CoVo achieves performance comparable to or even surpassing supervised RL. Our code is available at https: //github. com/sastpg/CoVo.

AAAI Conference 2025 Conference Paper

Cooperative Policy Agreement: Learning Diverse Policy for Offline MARL

  • Yihe Zhou
  • Yuxuan Zheng
  • Yue Hu
  • Kaixuan Chen
  • Tongya Zheng
  • Jie Song
  • Mingli Song
  • Shunyu Liu

Offline Multi-Agent Reinforcement Learning (MARL) aims to learn optimal joint policies from pre-collected datasets without further interaction with the environment. Despite the encouraging results achieved so far, we identify the policy mismatch problem that arises from employing diverse offline MARL datasets, a highly important ingredient for cooperative generalization yet largely overlooked by existing literature. Specifically, in the case that offline datasets exhibit various optimal joint policies, policy mismatch often occurs when individual actions from different optimal joint actions are combined in a way that results in a suboptimal joint action. In this paper, we introduce a novel Cooperative Policy Agreement (CPA) method, that not only mitigates the policy mismatch problem but also learns to generate diverse joint policies. CPA firstly introduces an autoregressive decision-making mechanism among agents during offline training. This mechanism enables agents to access the actions previously taken by other agents, thereby facilitating effective joint policy matching. Moreover, diverse joint policies can be directly obtained through sequential action sampling from the autoregressive model. Then we further incorporate a policy agreement mechanism to convert these autoregressive joint policies into decentralized policies with a non-autoregressive form, while still ensuring the diversity of the generated policies. This mechanism guarantees that the proposed CPA adheres to the Centralized Training with Decentralized Execution (CTDE) constraint. Experiments conducted on various benchmarks demonstrate that CPA yields superior performance to state-of-the-art competitors.

AAAI Conference 2025 Conference Paper

D^2-DPM: Dual Denoising for Quantized Diffusion Probabilistic Models

  • Qian Zeng
  • Jie Song
  • Han Zheng
  • Hao Jiang
  • Mingli Song

Diffusion models have achieved cutting-edge performance in image generation. However, their lengthy denoising process and computationally intensive score estimation network impede their scalability in low-latency and resource-constrained scenarios. Post-training quantization (PTQ) compresses and accelerates diffusion models without retraining, but it inevitably introduces additional quantization noise, resulting in mean and variance deviations. In this work, we propose D2-DPM, a dual denoising mechanism aimed at precisely mitigating the adverse effects of quantization noise on the noise estimation network. Specifically, we first unravel the impact of quantization noise on the sampling equation into two components: the mean deviation and the variance deviation. The mean deviation alters the drift coefficient of the sampling equation, influencing the trajectory trend, while the variance deviation magnifies the diffusion coefficient, impacting the convergence of the sampling trajectory. The proposed D2-DPM is thus devised to denoise the quantization noise at each time step, and then denoise the noisy sample through the inverse diffusion iterations. Experimental results demonstrate that D2-DPM achieves superior generation quality, yielding a 1.42 lower FID than the full-precision model while achieving 3.99x compression and 11.67x bit-operation acceleration.

ICLR Conference 2025 Conference Paper

Dataset Ownership Verification in Contrastive Pre-trained Models

  • Yuechen Xie
  • Jie Song
  • Mengqi Xue
  • Haofei Zhang
  • Xingen Wang
  • Bingde Hu
  • Genlang Chen
  • Mingli Song

High-quality open-source datasets, which necessitate substantial efforts for curation, has become the primary catalyst for the swift progress of deep learning. Concurrently, protecting these datasets is paramount for the well-being of the data owner. Dataset ownership verification emerges as a crucial method in this domain, but existing approaches are often limited to supervised models and cannot be directly extended to increasingly popular unsupervised pre-trained models. In this work, we propose the first dataset ownership verification method tailored specifically for self-supervised pre-trained models by contrastive learning. Its primary objective is to ascertain whether a suspicious black-box backbone has been pre-trained on a specific unlabeled dataset, aiding dataset owners in upholding their rights. The proposed approach is motivated by our empirical insights that when models are trained with the target dataset, the unary and binary instance relationships within the embedding space exhibit significant variations compared to models trained without the target dataset. We validate the efficacy of this approach across multiple contrastive pre-trained models including SimCLR, BYOL, SimSiam, MOCO v3, and DINO. The results demonstrate that our method rejects the null hypothesis with a $p$-value markedly below $0.05$, surpassing all previous methodologies. Our code is available at https://github.com/xieyc99/DOV4CL.

IJCAI Conference 2025 Conference Paper

DenseSAM: Semantic Enhance SAM for Efficient Dense Object Segmentation

  • Linyun Zhou
  • Jiacong Hu
  • Shengxuming Zhang
  • Xiangtong Du
  • Mingli Song
  • Xiuming Zhang
  • Zunlei Feng

Dense object segmentation is essential for various applications, particularly in pathology image and remote sensing image analysis. However, distinguishing numerous similar and densely packed objects in this task presents significant challenges. Several methods, including CNN- and ViT-based approaches, have been proposed to tackle these issues. Yet, models trained on limited datasets exhibit limited generalization ability. The Segment Anything Model (SAM) has recently achieved significant progress in zero-shot segmentation but relies heavily on precise positional guidance. However, providing numerous accurate location prompts in dense scenarios is time-consuming. To overcome this limitation, we conducted an in-depth exploration of the SAM mechanism and found that its strong generalization ability stems from the encoder’s edge detection capability, which is semantically independent, making location prompts essential for segmentation. This insight inspired the development of DenseSAM, which replaces location prompts with semantic guidance for automatic segmentation in dense scenarios. Specifically, it uses local details to weaken the edges of background objects, leverages global context to enhance intra-class feature similarity, while further increasing contrast with the background, and integrates a dual-head decoding process to enable lightweight automatic semantic segmentation. Extensive experiments on pathology images demonstrate that DenseSAM delivers remarkable performance with minimal training parameters, providing a cost-effective and efficient solution. Moreover, experiments on remote sensing images further validate its excellent scalability, making DenseSAM suitable for various dense object segmentation domains. The code is available at https: //github. com/imAzhou/DenseSAM.

AAAI Conference 2025 Conference Paper

Disentangled Table-Graph Representation for Interpretable Transmission Line Fault Location

  • Na Yu
  • Yutong Deng
  • Shunyu Liu
  • Kaixuan Chen
  • Tongya Zheng
  • Mingli Song

The fault location task in power grids is crucial for maintaining social order and ensuring public safety. However, existing methods that rely on tabular state records often neglect the intrinsic topological influences of transmission lines, resulting in a segmented approach to fault location that consists of multiple stages. In this paper, we propose an Disentangled Table-Graph representation framework, termed DTG, which integrates fault location tasks at coarse-grained line levels and fine-grained point levels within an end-to-end learning paradigm. Our innovative disentanglement strategy produces interpretable attribution coefficients that connect tabular records and transmission line topology, thereby facilitating fault location at both line- and point-levels. The joint prediction tasks designed around our disentangled tabular graph representation promote mutual information exchange between features and topology of transmission lines in an interpretable manner. Experimental results on the 7-bus system, 36-bus system and a realistic 325-bus system in China demonstrate that the proposed method adapt to different topological structures and handle different types of faults. Compared to traditional methods, DTG4Power achieves high accuracy in both fault lines and fault points.

IJCAI Conference 2025 Conference Paper

Efficient Dynamic Graphs Learning with Refined Batch Parallel Training

  • Zhengzhao Feng
  • Rui Wang
  • Longjiao Zhang
  • Tongya Zheng
  • Ziqi Huang
  • Mingli Song

Memory-based temporal graph neural networks (MTGNN) use node memory to store historical information, enabling efficient processing of large dynamic graphs through batch parallel training, with larger batch sizes leading to increased training efficiency. However, this approach overlooks the interdependency among edges within the same batch, leading to outdated memory states and reduced training accuracy. Previous studies have attempted to mitigate this issue through methods such as measuring memory loss, overlap training, and additional compensation modules. Despite these efforts, challenges persist, including imprecise coarse-grained memory loss measurement and ineffective compensation modules. To address these challenges, we propose the Refined Batch parallel Training (RBT) framework, which accurately evaluates intra-batch information loss and optimizes batch partitioning to minimize loss, enhancing the training process's effectiveness and efficiency. RBT also includes a precise and efficient memory compensation algorithm. Experimental results demonstrate RBT's superior performance compared to existing MTGNN frameworks like TGL, ETC, and PRES in terms of training efficiency and accuracy across various dynamic graph datasets. Our code is made publicly available at https: //github. com/fengwudi/RBT.

ICLR Conference 2025 Conference Paper

From GNNs to Trees: Multi-Granular Interpretability for Graph Neural Networks

  • Jie Yang
  • Yuwen Wang
  • Kaixuan Chen 0004
  • Tongya Zheng
  • Yihe Zhou
  • Zhenbang Xiao
  • Ji Cao 0001
  • Mingli Song

Interpretable Graph Neural Networks (GNNs) aim to reveal the underlying reasoning behind model predictions, attributing their decisions to specific subgraphs that are informative. However, existing subgraph-based interpretable methods suffer from an overemphasis on local structure, potentially overlooking long-range dependencies within the entire graphs. Although recent efforts that rely on graph coarsening have proven beneficial for global interpretability, they inevitably reduce the graphs to a fixed granularity. Such an inflexible way can only capture graph connectivity at a specific level, whereas real-world graph tasks often exhibit relationships at varying granularities (e.g., relevant interactions in proteins span from functional groups, to amino acids, and up to protein domains). In this paper, we introduce a novel Tree-like Interpretable Framework (TIF) for graph classification, where plain GNNs are transformed into hierarchical trees, with each level featuring coarsened graphs of different granularity as tree nodes. Specifically, TIF iteratively adopts a graph coarsening module to compress original graphs (i.e., root nodes of trees) into increasingly coarser ones (i.e., child nodes of trees), while preserving diversity among tree nodes within different branches through a dedicated graph perturbation module. Finally, we propose an adaptive routing module to identify the most informative root-to-leaf paths, providing not only the final prediction but also the multi-granular interpretability for the decision-making process. Extensive experiments on the graph classification benchmarks with both synthetic and real-world datasets demonstrate the superiority of TIF in interpretability, while also delivering a competitive prediction performance akin to the state-of-the-art counterparts.

AAAI Conference 2025 Conference Paper

Global Attribute-Association Pattern Aggregation for Graph Fraud Detection

  • Mingjiang Duan
  • Da He
  • Tongya Zheng
  • Lingxiang Jia
  • Mingli Song
  • Xinyu Wang
  • Zunlei Feng

Fraud is increasingly prevalent, and its patterns are frequently changing, posing challenges for fraud detection methods such as random forests and Graph Neural Networks (GNNs), which rely on bin-based and mixture features separately. The former may lose crucial graph-associated features, while the latter face incorrect feature fusion. To overcome these limitations, we propose an approach based on attribute-association pattern that leverages the distinct attribute and association patterns differentiating fraudulent from benign behaviors, to enhance fraud detection capabilities. Attribute features are adaptively split into separate bins to eliminate incorrect attribute fusion and combine association patterns through graph neighbor message passing, thereby deriving attribute-association pattern features. Using the learned attribute-association patterns, the fraud patterns between a single pattern and the patterns across the entire graph are globally aggregated. Extensive experiments comparing our approach with 24 methods on 7 datasets demonstrate that the proposed method achieves SOTA performance.

AAAI Conference 2025 Conference Paper

Holistic Semantic Representation for Navigational Trajectory Generation

  • Ji Cao
  • Tongya Zheng
  • Qinghong Guo
  • Yu Wang
  • Junshu Dai
  • Shunyu Liu
  • Jie Yang
  • Jie Song

Trajectory generation has garnered significant attention from researchers in the field of spatio-temporal analysis, as it can generate substantial synthesized human mobility trajectories that enhance user privacy and alleviate data scarcity. However, existing trajectory generation methods often focus on improving trajectory generation quality from a singular perspective, lacking a comprehensive semantic understanding across various scales. Consequently, we are inspired to develop a HOlistic SEmantic Representation (HOSER) framework for navigational trajectory generation. Given an origin-and-destination (OD) pair and the starting time point of a latent trajectory, we first propose a Road Network Encoder to expand the receptive field of road- and zone-level semantics. Second, we design a Multi-Granularity Trajectory Encoder to integrate the spatio-temporal semantics of the generated trajectory at both the point and trajectory levels. Finally, we employ a Destination-Oriented Navigator to seamlessly integrate destination-oriented guidance. Extensive experiments on three real-world datasets demonstrate that HOSER outperforms state-of-the-art baselines by a significant margin. Moreover, the model's performance in few-shot learning and zero-shot learning scenarios further verifies the effectiveness of our holistic semantic representation.

ICML Conference 2025 Conference Paper

L-Diffusion: Laplace Diffusion for Efficient Pathology Image Segmentation

  • Weihan Li
  • Linyun Zhou
  • Yang Jian
  • Shengxuming Zhang
  • Xiangtong Du
  • Xiuming Zhang
  • Jing Zhang 0120
  • Chaoqing Xu

Pathology image segmentation plays a pivotal role in artificial digital pathology diagnosis and treatment. Existing approaches to pathology image segmentation are hindered by labor-intensive annotation processes and limited accuracy in tail-class identification, primarily due to the long-tail distribution inherent in gigapixel pathology images. In this work, we introduce the Laplace Diffusion Model, referred to as L-Diffusion, an innovative framework tailored for efficient pathology image segmentation. L-Diffusion utilizes multiple Laplace distributions, as opposed to Gaussian distributions, to model distinct components—a methodology supported by theoretical analysis that significantly enhances the decomposition of features within the feature space. A sequence of feature maps is initially generated through a series of diffusion steps. Following this, contrastive learning is employed to refine the pixel-wise vectors derived from the feature map sequence. By utilizing these highly discriminative pixel-wise vectors, the segmentation module achieves a harmonious balance of precision and robustness with remarkable efficiency. Extensive experimental evaluations demonstrate that L-Diffusion attains improvements of up to 7. 16%, 26. 74%, 16. 52%, and 3. 55% on tissue segmentation datasets, and 20. 09%, 10. 67%, 14. 42%, and 10. 41% on cell segmentation datasets, as quantified by DICE, MPA, mIoU, and FwIoU metrics. The source are available at https: //github. com/Lweihan/LDiffusion.

IJCAI Conference 2025 Conference Paper

Odyssey: Empowering Minecraft Agents with Open-World Skills

  • Shunyu Liu
  • Yaoru Li
  • Kongcheng Zhang
  • Zhenyu Cui
  • Wenkai Fang
  • Yuxuan Zheng
  • Tongya Zheng
  • Mingli Song

Recent studies have delved into constructing generalist agents for open-world environments like Minecraft. Despite the encouraging results, existing efforts mainly focus on solving basic programmatic tasks, e. g. , material collection and tool-crafting following the Minecraft tech-tree, treating the ObtainDiamond task as the ultimate goal. This limitation stems from the narrowly defined set of actions available to agents, requiring them to learn effective long-horizon strategies from scratch. Consequently, discovering diverse gameplay opportunities in the open world becomes challenging. In this work, we introduce Odyssey, a new framework that empowers Large Language Model (LLM)-based agents with open-world skills to explore the vast Minecraft world. Odyssey comprises three key parts: (1) An interactive agent with an open-world skill library that consists of 40 primitive skills and 183 compositional skills. (2) A fine-tuned LLaMA-3 model trained on a large question-answering dataset with 390k+ instruction entries derived from the Minecraft Wiki. (3) A new agent capability benchmark includes the long-term planning task, the dynamic-immediate planning task, and the autonomous exploration task. Extensive experiments demonstrate that the proposed Odyssey framework can effectively evaluate different capabilities of LLM-based agents. All datasets, model weights, and code are publicly available to motivate future research on more advanced autonomous agent solutions.

JMLR Journal 2025 Journal Article

Optimal and Efficient Algorithms for Decentralized Online Convex Optimization

  • Yuanyu Wan
  • Tong Wei
  • Bo Xue
  • Mingli Song
  • Lijun Zhang

We investigate decentralized online convex optimization (D-OCO), in which a set of local learners are required to minimize a sequence of global loss functions using only local computations and communications. Previous studies have established $O(n^{5/4}\rho^{-1/2}\sqrt{T})$ and ${O}(n^{3/2}\rho^{-1}\log T)$ regret bounds for convex and strongly convex functions respectively, where $n$ is the number of local learners, $\rho [abs] [ pdf ][ bib ] &copy JMLR 2025. ( edit, beta )

ICML Conference 2025 Conference Paper

Revisiting Differentially Private Algorithms for Decentralized Online Learning

  • Xiaoyu Wang
  • Wenhao Yang
  • Chang Yao 0001
  • Mingli Song
  • Yuanyu Wan

Although the differential privacy (DP) of decentralized online learning has garnered considerable attention recently, existing algorithms are unsatisfactory due to their inability to achieve $(\epsilon, 0)$-DP over all $T$ rounds, recover the optimal regret in the non-private case, and maintain the lightweight computation under complex constraints. To address these issues, we first propose a new decentralized online learning algorithm satisfying $(\epsilon, 0)$-DP over $T$ rounds, and show that it can achieve $\widetilde{O}(n(\rho^{-1/4}+\epsilon^{-1}\rho^{1/4})\sqrt{T})$ and $\widetilde{O}(n (\rho^{-1/2}+\epsilon^{-1}))$ regret bounds for convex and strongly convex functions respectively, where $n$ is the number of local learners and $\rho$ is the spectral gap of the communication matrix. As long as $\epsilon=\Omega(\sqrt{\rho})$, these bounds nearly match existing lower bounds in the non-private case, which implies that $(\epsilon, 0)$-DP of decentralized online learning may be ensured nearly for free. Our key idea is to design a block-decoupled accelerated gossip strategy that can be incorporated with the classical tree-based private aggregation, and also enjoys a faster average consensus among local learners. Furthermore, we develop a projection-free variant of our algorithm to keep the efficiency under complex constraints. As a trade-off, the above regret bounds degrade to $\widetilde{O}(n(T^{3/4}+\epsilon^{-1}T^{1/4}))$ and $\widetilde{O}(n(T^{2/3}+\epsilon^{-1}))$ respectively, which however are even better than the existing private centralized projection-free online algorithm.

NeurIPS Conference 2025 Conference Paper

SALoM: Structure Aware Temporal Graph Networks with Long-Short Memory Updater

  • Hanwen Liu
  • Longjiao Zhang
  • Rui Wang
  • Tongya Zheng
  • Sai Wu
  • Chang Yao
  • Mingli Song

Dynamic graph learning is crucial for accurately modeling complex systems by integrating topological structure and temporal information within graphs. While memory-based methods are commonly used and excel at capturing short-range temporal correlations, they struggle with modeling long-range dependencies, harmonizing long-range and short-range correlations, and integrating structural information effectively. To address these challenges, we present SALoM: Structure Aware Temporal Graph Networks with Long-Short Memory Updater. SALoM features a memory module that addresses gradient vanishing and information forgetting, enabling the capture of long-term dependencies across various time scales. Additionally, SALoM utilizes a long-short memory updater (LSMU) to dynamically balance long-range and short-range temporal correlations, preventing over-generalization. By integrating co-occurrence encoding and LSMU through information bottleneck-based fusion, SALoM effectively captures both the structural and temporal information within graphs. Experimental results across various graph datasets demonstrate SALoM's superior performance, achieving state-of-the-art results in dynamic graph link prediction. Our code is openly accessible at https: //github. com/wave5418/SALoM.

NeurIPS Conference 2025 Conference Paper

SeRL: Self-play Reinforcement Learning for Large Language Models with Limited Data

  • Wenkai Fang
  • Shunyu Liu
  • Yang Zhou
  • Kongcheng Zhang
  • Tongya Zheng
  • Kaixuan Chen
  • Mingli Song
  • Dacheng Tao

Recent advances have demonstrated the effectiveness of Reinforcement Learning (RL) in improving the reasoning capabilities of Large Language Models (LLMs). However, existing works inevitably rely on high-quality instructions and verifiable rewards for effective training, both of which are often difficult to obtain in specialized domains. In this paper, we propose Self-play Reinforcement Learning (SeRL) to bootstrap LLM training with limited initial data. Specifically, SeRL comprises two complementary modules: self-instruction and self-rewarding. The former module generates additional instructions based on the available data at each training step, employing comprehensive online filtering strategies to ensure instruction quality, diversity, and difficulty. The latter module introduces a simple yet effective majority-voting mechanism to estimate response rewards for additional instructions, eliminating the need for external annotations. Finally, SeRL performs conventional RL based on the generated data, facilitating iterative self-play learning. Extensive experiments on various reasoning benchmarks and across different LLM backbones demonstrate that the proposed SeRL yields results superior to its counterparts and achieves performance on par with those obtained by high-quality data with verifiable rewards. Our code is available at https: //github. com/wantbook-book/SeRL.

ICML Conference 2025 Conference Paper

STD-FD: Spatio-Temporal Distribution Fitting Deviation for AIGC Forgery Identification

  • Hengrui Lou
  • Zunlei Feng
  • Jinsong Geng
  • Erteng Liu
  • Jie Lei 0002
  • Lechao Cheng
  • Jie Song 0011
  • Mingli Song

With the rise of AIGC technologies, particularly diffusion models, highly realistic fake images that can deceive human visual perception has become feasible. Consequently, various forgery detection methods have emerged. However, existing methods treat the generation process of fake images as either a black-box or an auxiliary tool, offering limited insights into its underlying mechanisms. In this paper, we propose Spatio-Temporal Distribution Fitting Deviation (STD-FD) for AIGC forgery detection, which explores the generative process in detail. By decomposing and reconstructing data within generative diffusion models, initial experiments reveal temporal distribution fitting deviations during the image reconstruction process. These deviations are captured through reconstruction noise maps for each spatial semantic unit, derived via a super-resolution algorithm. Critical discriminative patterns, termed DFactors, are identified through statistical modeling of these deviations. Extensive experiments show that STD-FD effectively captures distribution patterns in AIGC-generated data, demonstrating strong robustness and generalizability while outperforming state-of-the-art (SOTA) methods on major datasets. The source code is available at this link.

ECAI Conference 2025 Conference Paper

TED-DTMoA: Tri-Comparison Expertise Decision for Drug-Target Mechanism of Action

  • Lingxiang Jia
  • Zipeng Zhong
  • Shaolun Yao
  • Jie Song 0011
  • Mingli Song
  • Zunlei Feng

Machine-learned interactions between drugs and human protein targets play a crucial role in efficient and accurate drug discovery. However, the drug-target mechanism of action (DTMoA) prediction is actually a multi-class classification problem, which follows a long-tailed class distribution. Existing methods simply address whether the drugs and targets can interact and rarely consider these deep mechanisms. In this paper, we introduce TED-DTMoA, a novel DTMoA prediction framework that incorporates the divide-and-conquer strategy with tri-comparison options. Specifically, to reduce the learning difficulty of tail classes, we propose an expertise-based divide-and-conquer decision approach that combines the results of multiple independent expertise models for sub-tasks decomposed from the original prediction task. In addition, to enhance the discrimination of similar mechanism classes, we devise a tri-comparison learning strategy that defines the sub-task as the classification of triple options, such as expanding the classification task for classes A and B to include an extra “Neither of them” option. Extensive experiments conducted on various DTMoA datasets quantitatively demonstrate the proposed method achieves an approximately 13% performance improvement compared with advanced baselines. Moreover, our method exhibits an obvious superiority on the tail classes. Further analysis of the evolvability and generalization reveals the significant potential to be deployed in real-world scenes.

NeurIPS Conference 2025 Conference Paper

Tree of Preferences for Diversified Recommendation

  • Hanyang Yuan
  • Ning Tang
  • Tongya Zheng
  • Jiarong Xu
  • Xintong Hu
  • Renhong Huang
  • Shunyu Liu
  • Jiacong Hu

Diversified recommendation has attracted increasing attention from both researchers and practitioners, which can effectively address the homogeneity of recommended items. Existing approaches predominantly aim to infer the diversity of user preferences from observed user feedback. Nonetheless, due to inherent data biases, the observed data may not fully reflect user interests, where underexplored preferences can be overwhelmed or remain unmanifested. Failing to capture these preferences can lead to suboptimal diversity in recommendations. To fill this gap, this work aims to study diversified recommendation from a data-bias perspective. Inspired by the outstanding performance of large language models (LLMs) in zero-shot inference leveraging world knowledge, we propose a novel approach that utilizes LLMs' expertise to uncover underexplored user preferences from observed behavior, ultimately providing diverse and relevant recommendations. To achieve this, we first introduce Tree of Preferences (ToP), an innovative structure constructed to model user preferences from coarse to fine. ToP enables LLMs to systematically reason over the user's rationale behind their behavior, thereby uncovering their underexplored preferences. To guide diversified recommendations using uncovered preferences, we adopt a data-centric approach, identifying candidate items that match user preferences and generating synthetic interactions that reflect underexplored preferences. These interactions are integrated to train a general recommender for diversification. Moreover, we scale up overall efficiency by dynamically selecting influential users during optimization. Extensive evaluations of both diversity and relevance show that our approach outperforms existing methods in most cases and achieves near-optimal performance in others, with reasonable inference latency.

IJCAI Conference 2025 Conference Paper

VQCounter: Designing Visual Prompt Queue for Accurate Open-World Counting

  • Fanfan Ye
  • Yiqi Fan
  • Qiaoyong Zhong
  • Shicai Yang
  • Di Xie
  • Jie Song
  • Mingli Song

Class-agnostic counting enables enumerating arbitrary object classes beyond those seen during training. Recent studies attempted to exploit the potential of visual foundation models such as GroundingDINO. Despite the considerable progress, we observe certain shortcomings, including the limited diversity of visual prompts and suboptimal training regimen. To address these issues, we introduce VQCounter, which incorporates a visual prompt queue mechanism designed to enrich the diversity of visual prompts. A random modality switching strategy is proposed during training to strengthen both textual and visual modalities. Besides, in light of weak point supervision, a Voronoi diagram-based cost (VoronoiCost) is designed to improve Hungarian matching, leading to more stable and faster convergence. Building upon the Voronoi diagram, we also propose a novel set of more stringent evaluation metrics, which take point localization into account. Extensive experiments on the FSC-147 and CARPK datasets demonstrate that VQCounter achieves state-of-the-art performance in both zero-shot and few-shot settings, significantly outperforming existing methods across nearly all evaluations.

NeurIPS Conference 2024 Conference Paper

A2PO: Towards Effective Offline Reinforcement Learning from an Advantage-aware Perspective

  • Yunpeng Qing
  • Shunyu Liu
  • Jingyuan Cong
  • Kaixuan Chen
  • Yihe Zhou
  • Mingli Song

Offline reinforcement learning endeavors to leverage offline datasets to craft effective agent policy without online interaction, which imposes proper conservative constraints with the support of behavior policies to tackle the out-of-distribution problem. However, existing works often suffer from the constraint conflict issue when offline datasets are collected from multiple behavior policies, i. e. , different behavior policies may exhibit inconsistent actions with distinct returns across the state space. To remedy this issue, recent advantage-weighted methods prioritize samples with high advantage values for agent training while inevitably ignoring the diversity of behavior policy. In this paper, we introduce a novel Advantage-Aware Policy Optimization (A2PO) method to explicitly construct advantage-aware policy constraints for offline learning under mixed-quality datasets. Specifically, A2PO employs a conditional variational auto-encoder to disentangle the action distributions of intertwined behavior policies by modeling the advantage values of all training data as conditional variables. Then the agent can follow such disentangled action distribution constraints to optimize the advantage-aware policy towards high advantage values. Extensive experiments conducted on both the single-quality and mixed-quality datasets of the D4RL benchmark demonstrate that A2PO yields results superior to the counterparts. Our code is available at https: //github. com/Plankson/A2PO.

AAAI Conference 2024 Conference Paper

Angle Robustness Unmanned Aerial Vehicle Navigation in GNSS-Denied Scenarios

  • Yuxin Wang
  • Zunlei Feng
  • Haofei Zhang
  • Yang Gao
  • Jie Lei
  • Li Sun
  • Mingli Song

Due to the inability to receive signals from the Global Navigation Satellite System (GNSS) in extreme conditions, achieving accurate and robust navigation for Unmanned Aerial Vehicles (UAVs) is a challenging task. Recently emerged, vision-based navigation has been a promising and feasible alternative to GNSS-based navigation. However, existing vision-based techniques are inadequate in addressing flight deviation caused by environmental disturbances and inaccurate position predictions in practical settings. In this paper, we present a novel angle robustness navigation paradigm to deal with flight deviation in point-to-point navigation tasks. Additionally, we propose a model that includes the Adaptive Feature Enhance Module, Cross-knowledge Attention-guided Module and Robust Task-oriented Head Module to accurately predict direction angles for high-precision navigation. To evaluate the vision-based navigation methods, we collect a new dataset termed as UAV_AR368. Furthermore, we design the Simulation Flight Testing Instrument (SFTI) using Google Earth to simulate different flight environments, thereby reducing the expenses associated with real flight testing. Experiment results demonstrate that the proposed model outperforms the state-of-the-art by achieving improvements of 26.0% and 45.6% in the success rate of arrival under ideal and disturbed circumstances, respectively.

NeurIPS Conference 2024 Conference Paper

Association Pattern-aware Fusion for Biological Entity Relationship Prediction

  • Lingxiang Jia
  • Yuchen Ying
  • Zunlei Feng
  • Zipeng Zhong
  • Shaolun Yao
  • Jiacong Hu
  • Mingjiang Duan
  • Xingen Wang

Deep learning-based methods significantly advance the exploration of associations among triple-wise biological entities (e. g. , drug-target protein-adverse reaction), thereby facilitating drug discovery and safeguarding human health. However, existing researches only focus on entity-centric information mapping and aggregation, neglecting the crucial role of potential association patterns among different entities. To address the above limitation, we propose a novel association pattern-aware fusion method for biological entity relationship prediction, which effectively integrates the related association pattern information into entity representation learning. Additionally, to enhance the missing information of the low-order message passing, we devise a bind-relation module that considers the strong bind of low-order entity associations. Extensive experiments conducted on three biological datasets quantitatively demonstrate that the proposed method achieves about 4%-23% hit@1 improvements compared with state-of-the-art baselines. Furthermore, the interpretability of association patterns is elucidated in detail, thus revealing the intrinsic biological mechanisms and promoting it to be deployed in real-world scenarios. Our data and code are available at https: //github. com/hry98kki/PatternBERP.

NeurIPS Conference 2024 Conference Paper

Can Graph Neural Networks Expose Training Data Properties? An Efficient Risk Assessment Approach

  • Hanyang Yuan
  • Jiarong Xu
  • Renhong Huang
  • Mingli Song
  • Chunping Wang
  • Yang Yang

Graph neural networks (GNNs) have attracted considerable attention due to their diverse applications. However, the scarcity and quality limitations of graph data present challenges to their training process in practical settings. To facilitate the development of effective GNNs, companies and researchers often seek external collaboration. Yet, directly sharing data raises privacy concerns, motivating data owners to train GNNs on their private graphs and share the trained models. Unfortunately, these models may still inadvertently disclose sensitive properties of their training graphs (\textit{e. g. }, average default rate in a transaction network), leading to severe consequences for data owners. In this work, we study graph property inference attack to identify the risk of sensitive property information leakage from shared models. Existing approaches typically train numerous shadow models for developing such attack, which is computationally intensive and impractical. To address this issue, we propose an efficient graph property inference attack by leveraging model approximation techniques. Our method only requires training a small set of models on graphs, while generating a sufficient number of approximated shadow models for attacks. To enhance diversity while reducing errors in the approximated models, we apply edit distance to quantify the diversity within a group of approximated models and introduce a theoretically guaranteed criterion to evaluate each model's error. Subsequently, we propose a novel selection mechanism to ensure that the retained approximated models achieve high diversity and low error. Extensive experiments across six real-world scenarios demonstrate our method's substantial improvement, with average increases of 2. 7\% in attack accuracy and 4. 1\% in ROC-AUC, while being 6. 5$\times$ faster compared to the best baseline.

ICLR Conference 2024 Conference Paper

Chain-of-Experts: When LLMs Meet Complex Operations Research Problems

  • Ziyang Xiao
  • Dongxiang Zhang
  • Yangjun Wu
  • Lilin Xu
  • Yuan Jessica Wang
  • Xiongwei Han
  • Xiaojin Fu
  • Tao Zhong 0004

Large language models (LLMs) have emerged as powerful techniques for various NLP tasks, such as mathematical reasoning and plan generation. In this paper, we study automatic modeling and programming for complex operation research (OR) problems, so as to alleviate the heavy dependence on domain experts and benefit a spectrum of industry sectors. We present the first LLM-based solution, namely Chain-of-Experts (CoE), a novel multi-agent cooperative framework to enhance reasoning capabilities. Specifically, each agent is assigned a specific role and endowed with domain knowledge related to OR. We also introduce a conductor to orchestrate these agents via forward thought construction and backward reflection mechanism. Furthermore, we release a benchmark dataset (ComplexOR) of complex OR problems to facilitate OR research and community development. Experimental results show that CoE significantly outperforms the state-of-the-art LLM-based approaches both on LPWP and ComplexOR.

NeurIPS Conference 2024 Conference Paper

Dual-Perspective Activation: Efficient Channel Denoising via Joint Forward-Backward Criterion for Artificial Neural Networks

  • Tian Qiu
  • Chenchao Gao
  • Zunlei Feng
  • Jie Lei
  • Bingde Hu
  • Xingen Wang
  • Yi Gao
  • Mingli Song

The design of Artificial Neural Network (ANN) is inspired by the working patterns of the human brain. Connections in biological neural networks are sparse, as they only exist between few neurons. Meanwhile, the sparse representation in ANNs has been shown to possess significant advantages. Activation responses of ANNs are typically expected to promote sparse representations, where key signals get activated while irrelevant/redundant signals are suppressed. It can be observed that samples of each category are only correlated with sparse and specific channels in ANNs. However, existing activation mechanisms often struggle to suppress signals from other irrelevant channels entirely, and these signals have been verified to be detrimental to the network's final decision. To address the issue of channel noise interference in ANNs, a novel end-to-end trainable Dual-Perspective Activation (DPA) mechanism is proposed. DPA efficiently identifies irrelevant channels and applies channel denoising under the guidance of a joint criterion established online from both forward and backward propagation perspectives while preserving activation responses from relevant channels. Extensive experiments demonstrate that DPA successfully denoises channels and facilitates sparser neural representations. Moreover, DPA is parameter-free, fast, applicable to many mainstream ANN architectures, and achieves remarkable performance compared to other existing activation counterparts across multiple tasks and domains. Code is available at https: //github. com/horrible-dong/DPA.

ICLR Conference 2024 Conference Paper

Dynamic Neural Response Tuning

  • Tian Qiu
  • Wenxiang Xu
  • Lin Chen
  • Linyun Zhou
  • Zunlei Feng
  • Mingli Song

Artificial Neural Networks (ANNs) have gained widespread applications across various areas in recent years. The ANN design was initially inspired by principles of biology. The biological neural network's fundamental response process comprises information transmission and aggregation. The information transmission in biological neurons is often achieved by triggering action potentials that propagate through axons. ANNs utilize activation mechanisms to simulate such biological behavior. However, previous studies have only considered static response conditions, while the biological neuron's response conditions are typically dynamic, depending on multiple factors such as neuronal properties and the real-time environment. Therefore, the dynamic response conditions of biological neurons could help improve the static ones of existing activations in ANNs. Additionally, the biological neuron's aggregated response exhibits high specificity for different categories, allowing the nervous system to differentiate and identify objects. Inspired by these biological patterns, we propose a novel Dynamic Neural Response Tuning (DNRT) mechanism, which aligns the response patterns of ANNs with those of biological neurons. DNRT comprises Response-Adaptive Activation (RAA) and Aggregated Response Regularization (ARR), mimicking the biological neuron's information transmission and aggregation behaviors. RAA dynamically adjusts the response condition based on the characteristics and strength of the input signal. ARR is devised to enhance the network's ability to learn category specificity by imposing constraints on the network's response distribution. Extensive experimental studies indicate that the proposed DNRT is highly interpretable, applicable to various mainstream network architectures, and can achieve remarkable performance compared with existing neural response mechanisms in multiple tasks and domains. Code is available at https://github.com/horrible-dong/DNRT.

IJCAI Conference 2024 Conference Paper

Hundredfold Accelerating for Pathological Images Diagnosis and Prognosis through Self-reform Critical Region Focusing

  • XiaoTian Yu
  • Haoming Luo
  • Jiacong Hu
  • Xiuming Zhang
  • Yuexuan Wang
  • Wenjie Liang
  • Yijun Bei
  • Mingli Song

Pathological slides are commonly gigapixel images with abundant information and are therefore significant for clinical diagnosis. However, the ultra-large size makes both training and evaluation extremely time-consuming. Most existing methods need to crop the slide into patches, which also leads to large memory requirements. In this paper, we propose the Self-reform Multilayer Transformer (SMT) to accelerate the pathological image diagnosis and prognosis. Inspired by the pathologists' diagnostic procedure, SMT is designed to achieve layer-by-layer focus on critical regions. In the forward process, the first layer takes thumbnails as inputs and measures the significance of each patch that deserves focusing. Images from focused regions are cropped with a higher magnification and used as the input of the next layer. By analogy, the third layer inputs are focused images of second layer, which contain abundant cellular features. In addition to the forward focusing, the backward reform strategy is proposed to improve the precision of former layers. This cyclic process achieves iterative interactions for better performance on both classification and focusing. In this way, only a small part of critical patches are required in SMT for diagnosis and prognosis. Sufficient experiments demonstrate that SMT achieves hundreds times faster speed, while achieving comparable accuracy and less storage compared with existing SOTA methods.

NeurIPS Conference 2024 Conference Paper

Improved Regret for Bandit Convex Optimization with Delayed Feedback

  • Yuanyu Wan
  • Chang Yao
  • Mingli Song
  • Lijun Zhang

We investigate bandit convex optimization (BCO) with delayed feedback, where only the loss value of the action is revealed under an arbitrary delay. Let $n, T, \bar{d}$ denote the dimensionality, time horizon, and average delay, respectively. Previous studies have achieved an $O(\sqrt{n}T^{3/4}+(n\bar{d})^{1/3}T^{2/3})$ regret bound for this problem, whose delay-independent part matches the regret of the classical non-delayed bandit gradient descent algorithm. However, there is a large gap between its delay-dependent part, i. e. , $O((n\bar{d})^{1/3}T^{2/3})$, and an existing $\Omega(\sqrt{\bar{d}T})$ lower bound. In this paper, we illustrate that this gap can be filled in the worst case, where $\bar{d}$ is very close to the maximum delay $d$. Specifically, we first develop a novel algorithm, and prove that it enjoys a regret bound of $O(\sqrt{n}T^{3/4}+\sqrt{dT})$ in general. Compared with the previous result, our regret bound is better for $d=O((n\bar{d})^{2/3}T^{1/3})$, and the delay-dependent part is tight in the worst case. The primary idea is to decouple the joint effect of the delays and the bandit feedback on the regret by carefully incorporating the delayed bandit feedback with a blocking update mechanism. Furthermore, we show that the proposed algorithm can improve the regret bound to $O((nT)^{2/3}\log^{1/3}T+d\log T)$ for strongly convex functions. Finally, if the action sets are unconstrained, we demonstrate that it can be simply extended to achieve an $O(n\sqrt{T\log T}+d\log T)$ regret bound for strongly convex and smooth functions.

IJCAI Conference 2024 Conference Paper

Improving Adversarial Robustness via Feature Pattern Consistency Constraint

  • Jiacong Hu
  • Jingwen Ye
  • Zunlei Feng
  • Jiazhen Yang
  • Shunyu Liu
  • XiaoTian Yu
  • Lingxiang Jia
  • Mingli Song

Convolutional Neural Networks (CNNs) are well-known for their vulnerability to adversarial attacks, posing significant security concerns. In response to these threats, various defense methods have emerged to bolster the model's robustness. However, most existing methods either focus on learning from adversarial perturbations, leading to overfitting to the adversarial examples, or aim to eliminate such perturbations during inference, inevitably increasing computational burdens. Conversely, clean training, which strengthens the model's robustness by relying solely on clean examples, can address the aforementioned issues. In this paper, we align with this methodological stream and enhance its generalizability to unknown adversarial examples. This enhancement is achieved by scrutinizing the behavior of latent features within the network. Recognizing that a correct prediction relies on the correctness of the latent feature's pattern, we introduce a novel and effective Feature Pattern Consistency Constraint (FPCC) method to reinforce the latent feature's capacity to maintain the correct feature pattern. Specifically, we propose Spatial-wise Feature Modification and Channel-wise Feature Selection to enhance latent features. Subsequently, we employ the Pattern Consistency Loss to constrain the similarity between the feature pattern of the latent features and the correct feature pattern. Our experiments demonstrate that the FPCC method empowers latent features to uphold correct feature patterns even in the face of adversarial examples, resulting in inherent adversarial robustness surpassing state-of-the-art models.

ECAI Conference 2024 Conference Paper

Learning a Mini-Batch Graph Transformer via Two-Stage Interaction Augmentation

  • Wenda Li 0003
  • Kaixuan Chen 0004
  • Shunyu Liu 0001
  • Tongya Zheng
  • Wenjie Huang
  • Mingli Song

Mini-batch Graph Transformer (MGT), as an emerging graph learning model, has demonstrated significant advantages in semi-supervised node prediction tasks with improved computational efficiency and enhanced model robustness. However, existing methods for processing local information either rely on sampling or simple aggregation, which respectively result in the loss and squashing of critical neighbor information. Moreover, the limited number of nodes in each mini-batch restricts the model’s capacity to capture the global characteristic of the graph. In this paper, we propose LGMformer, a novel MGT model that employs a two-stage augmented interaction strategy, transitioning from local to global perspectives, to address the aforementioned bottlenecks. The local interaction augmentation (LIA) presents a neighbor-target interaction Transformer (NTIformer) to acquire an insightful understanding of the co-interaction patterns between neighbors and the target node, resulting in a locally effective token list that serves as input for the MGT. In contrast, global interaction augmentation (GIA) adopts a cross-attention mechanism to incorporate entire graph prototypes into the target node representation, thereby compensating for the global graph information to ensure a more comprehensive perception. To this end, LGMformer achieves the enhancement of node representations under the MGT paradigm. Experimental results related to node classification on the ten benchmark datasets demonstrate the effectiveness of the proposed method. Our code is available at https: //github. com/l-wd/LGMformer.

NeurIPS Conference 2024 Conference Paper

LG-CAV: Train Any Concept Activation Vector with Language Guidance

  • Qihan Huang
  • Jie Song
  • Mengqi Xue
  • Haofei Zhang
  • Bingde Hu
  • Huiqiong Wang
  • Hao Jiang
  • Xingen Wang

Concept activation vector (CAV) has attracted broad research interest in explainable AI, by elegantly attributing model predictions to specific concepts. However, the training of CAV often necessitates a large number of high-quality images, which are expensive to curate and thus limited to a predefined set of concepts. To address this issue, we propose Language-Guided CAV (LG-CAV) to harness the abundant concept knowledge within the certain pre-trained vision-language models (e. g. , CLIP). This method allows training any CAV without labeled data, by utilizing the corresponding concept descriptions as guidance. To bridge the gap between vision-language model and the target model, we calculate the activation values of concept descriptions on a common pool of images (probe images) with vision-language model and utilize them as language guidance to train the LG-CAV. Furthermore, after training high-quality LG-CAVs related to all the predicted classes in the target model, we propose the activation sample reweighting (ASR), serving as a model correction technique, to improve the performance of the target model in return. Experiments on four datasets across nine architectures demonstrate that LG-CAV achieves significantly superior quality to previous CAV methods given any concept, and our model correction method achieves state-of-the-art performance compared to existing concept-based methods. Our code is available at https: //github. com/hqhQAQ/LG-CAV.

NeurIPS Conference 2024 Conference Paper

Model LEGO: Creating Models Like Disassembling and Assembling Building Blocks

  • Jiacong Hu
  • Jing Gao
  • Jingwen Ye
  • Yang Gao
  • Xingen Wang
  • Zunlei Feng
  • Mingli Song

With the rapid development of deep learning, the increasing complexity and scale of parameters make training a new model increasingly resource-intensive. In this paper, we start from the classic convolutional neural network (CNN) and explore a paradigm that does not require training to obtain new models. Similar to the birth of CNN inspired by receptive fields in the biological visual system, we draw inspiration from the information subsystem pathways in the biological visual system and propose Model Disassembling and Assembling (MDA). During model disassembling, we introduce the concept of relative contribution and propose a component locating technique to extract task-aware components from trained CNN classifiers. For model assembling, we present the alignment padding strategy and parameter scaling strategy to construct a new model tailored for a specific task, utilizing the disassembled task-aware components. The entire process is akin to playing with LEGO bricks, enabling arbitrary assembly of new models, and providing a novel perspective for model creation and reuse. Extensive experiments showcase that task-aware components disassembled from CNN classifiers or new models assembled using these components closely match or even surpass the performance of the baseline, demonstrating its promising results for model reuse. Furthermore, MDA exhibits diverse potential applications, with comprehensive experiments exploring model decision route analysis, model compression, knowledge distillation, and more.

ICML Conference 2024 Conference Paper

Non-stationary Online Convex Optimization with Arbitrary Delays

  • Yuanyu Wan
  • Chang Yao 0001
  • Mingli Song
  • Lijun Zhang 0005

Online convex optimization (OCO) with arbitrary delays, in which gradients or other information of functions could be arbitrarily delayed, has received increasing attention recently. Different from previous studies that focus on stationary environments, this paper investigates the delayed OCO in non-stationary environments, and aims to minimize the dynamic regret with respect to any sequence of comparators. To this end, we first propose a simple algorithm, namely DOGD, which performs a gradient descent step for each delayed gradient according to their arrival order. Despite its simplicity, our novel analysis shows that the dynamic regret of DOGD can be automatically bounded by $O(\sqrt{\bar{d}T}(P_T+1))$ under mild assumptions, and $O(\sqrt{dT}(P_T+1))$ in the worst case, where $\bar{d}$ and $d$ denote the average and maximum delay respectively, $T$ is the time horizon, and $P_T$ is the path-length of comparators. Furthermore, we develop an improved algorithm, which reduces those dynamic regret bounds achieved by DOGD to $O(\sqrt{\bar{d}T(P_T+1)})$ and $O(\sqrt{dT(P_T+1)})$, respectively. The key idea is to run multiple DOGD with different learning rates, and utilize a meta-algorithm to track the best one based on their delayed performance. Finally, we demonstrate that our improved algorithm is optimal in the worst case by deriving a matching lower bound.

AAAI Conference 2024 Conference Paper

On the Concept Trustworthiness in Concept Bottleneck Models

  • Qihan Huang
  • Jie Song
  • Jingwen Hu
  • Haofei Zhang
  • Yong Wang
  • Mingli Song

Concept Bottleneck Models (CBMs), which break down the reasoning process into the input-to-concept mapping and the concept-to-label prediction, have garnered significant attention due to their remarkable interpretability achieved by the interpretable concept bottleneck. However, despite the transparency of the concept-to-label prediction, the mapping from the input to the intermediate concept remains a black box, giving rise to concerns about the trustworthiness of the learned concepts (i.e., these concepts may be predicted based on spurious cues). The issue of concept untrustworthiness greatly hampers the interpretability of CBMs, thereby hindering their further advancement. To conduct a comprehensive analysis on this issue, in this study we establish a benchmark to assess the trustworthiness of concepts in CBMs. A pioneering metric, referred to as concept trustworthiness score, is proposed to gauge whether the concepts are derived from relevant regions. Additionally, an enhanced CBM is introduced, enabling concept predictions to be made specifically from distinct parts of the feature map, thereby facilitating the exploration of their related regions. Besides, we introduce three modules, namely the cross-layer alignment (CLA) module, the cross-image alignment (CIA) module, and the prediction alignment (PA) module, to further enhance the concept trustworthiness within the elaborated CBM. The experiments on five datasets across ten architectures demonstrate that without using any concept localization annotations during training, our model improves the concept trustworthiness by a large margin, meanwhile achieving superior accuracy to the state-of-the-arts. Our code is available at https://github.com/hqhQAQ/ProtoCBM.

AAAI Conference 2024 Conference Paper

Progressive Feature Self-Reinforcement for Weakly Supervised Semantic Segmentation

  • Jingxuan He
  • Lechao Cheng
  • Chaowei Fang
  • Zunlei Feng
  • Tingting Mu
  • Mingli Song

Compared to conventional semantic segmentation with pixel-level supervision, weakly supervised semantic segmentation (WSSS) with image-level labels poses the challenge that it commonly focuses on the most discriminative regions, resulting in a disparity between weakly and fully supervision scenarios. A typical manifestation is the diminished precision on object boundaries, leading to deteriorated accuracy of WSSS. To alleviate this issue, we propose to adaptively partition the image content into certain regions (e.g., confident foreground and background) and uncertain regions (e.g., object boundaries and misclassified categories) for separate processing. For uncertain cues, we propose an adaptive masking strategy and seek to recover the local information with self-distilled knowledge. We further assume that confident regions should be robust enough to preserve the global semantics, and introduce a complementary self-distillation method that constrains semantic consistency between confident regions and an augmented view with the same class labels. Extensive experiments conducted on PASCAL VOC 2012 and MS COCO 2014 demonstrate that our proposed single-stage approach for WSSS not only outperforms state-of-the-art counterparts but also surpasses multi-stage methods that trade complexity for accuracy.

IJCAI Conference 2024 Conference Paper

ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition

  • Mengqi Xue
  • Qihan Huang
  • Haofei Zhang
  • Jingwen Hu
  • Jie Song
  • Mingli Song
  • Canghong Jin

Prototypical part network (ProtoPNet) and its variants have drawn wide attention and been applied to various tasks due to their inherent self-explanatory property. Previous ProtoPNets are primarily built upon convolutional neural networks (CNNs). Therefore, it is natural to investigate whether these explainable methods can be advantageous for the recently emerged Vision Transformers (ViTs). However, directly utilizing ViT-backed models as backbones can lead to prototypes paying excessive attention to background positions rather than foreground objects (i. e. , the “distraction” problem). To address the problem, this paper proposes prototypical part Transformer (ProtoPFormer) for interpretable image recognition. Based the architectural characteristics of ViTs, we modify the original ProtoPNet by creating separate global and local branches, each accompanied by corresponding prototypes that can capture and highlight representative holistic and partial features. Specifically, the global prototypes can guide local prototypes to concentrate on the foreground and effectively suppress the background influence. Subsequently, local prototypes are explicitly supervised to concentrate on different discriminative visual parts. Finally, the two branches mutually correct each other and jointly make the final decisions. Moreover, extensive experiments demonstrate that ProtoPFormer can consistently achieve superior performance on accuracy, visualization results, and quantitative interpretability evaluation over the state-of-the-art (SOTA) baselines. Our code has been released at https: //github. com/zju-vipa/ProtoPFormer.

AAAI Conference 2024 Conference Paper

Sampling-Resilient Multi-Object Tracking

  • Zepeng Li
  • Dongxiang Zhang
  • Sai Wu
  • Mingli Song
  • Gang Chen

Multi-Object Tracking (MOT) is a cornerstone operator for video surveillance applications. To enable real-time processing of large-scale live video streams, we study an interesting scenario called down-sampled MOT, which performs object tracking only on a small subset of video frames. The problem is challenging for state-of-the-art MOT methods, which exhibit significant performance degradation under high frame reduction ratios. In this paper, we devise a sampling-resilient tracker with a novel sparse-observation Kalman filter (SOKF). It integrates an LSTM network to capture non-linear and dynamic motion patterns caused by sparse observations. Since the LSTM-based state transition is not compatible with the original noise estimation mechanism, we propose new estimation strategies based on Bayesian neural networks and derive the optimal Kalman gain for SOKF. To associate the detected bounding boxes robustly, we also propose a comprehensive similarity metric that systematically integrates multiple spatial matching signals. Experiments on three benchmark datasets show that our proposed tracker achieves the best trade-off between efficiency and accuracy. With the same tracking accuracy, we reduce the total processing time of ByteTrack by 2× in MOT17 and 3× in DanceTrack.

ECAI Conference 2024 Conference Paper

SecPE: Secure Prompt Ensembling for Private and Robust Large Language Models

  • Jiawen Zhang 0005
  • Kejia Chen 0007
  • Zunlei Feng
  • Jian Lou 0001
  • Mingli Song

With the growing popularity of LLMs among the general public users, privacy-preserving and adversarial robustness have become two pressing demands for LLM-based services, which have largely been pursued separately but rarely jointly. In this paper, to the best of our knowledge, we are among the first attempts towards robust and private LLM inference by tightly integrating two disconnected fields: private inference and prompt ensembling. The former protects users’ privacy by encrypting inference data transmitted and processed by LLMs, while the latter enhances adversarial robustness by yielding an aggregated output from multiple prompted LLM responses. Although widely recognized as effective individually, private inference for prompt ensembling together entails new challenges that render the naive combination of existing techniques inefficient. To overcome the hurdles, we propose SecPE, which designs efficient fully homomorphic encryption (FHE) counterparts for the core algorithmic building blocks of prompt ensembling. We conduct extensive experiments on 8 tasks to evaluate the accuracy, robustness, and efficiency of SecPE. The results show that SecPE maintains high clean accuracy and offers better robustness at the expense of merely 2. 5% efficiency overhead compared to baseline private inference methods, indicating a satisfactory “accuracy-robustness-efficiency” tradeoff. For the efficiency of the encrypted Argmax operation that incurs major slowdown for prompt ensembling, SecPE is 35. 4 times faster than the state-of-the-art peers, which can be of independent interest beyond this work.

NeurIPS Conference 2024 Conference Paper

Transformer Doctor: Diagnosing and Treating Vision Transformers

  • Jiacong Hu
  • Hao Chen
  • Kejia Chen
  • Yang Gao
  • Jingwen Ye
  • Xingen Wang
  • Mingli Song
  • Zunlei Feng

Due to its powerful representational capabilities, Transformers have gradually become the mainstream model in the field of machine vision. However, the vast and complex parameters of Transformers impede researchers from gaining a deep understanding of their internal mechanisms, especially error mechanisms. Existing methods for interpreting Transformers mainly focus on understanding them from the perspectives of the importance of input tokens or internal modules, as well as the formation and meaning of features. In contrast, inspired by research on information integration mechanisms and conjunctive errors in the biological visual system, this paper conducts an in-depth exploration of the internal error mechanisms of Transformers. We first propose an information integration hypothesis for Transformers in the machine vision domain and provide substantial experimental evidence to support this hypothesis. This includes the dynamic integration of information among tokens and the static integration of information within tokens in Transformers, as well as the presence of conjunctive errors therein. Addressing these errors, we further propose heuristic dynamic integration constraint methods and rule-based static integration constraint methods to rectify errors and ultimately improve model performance. The entire methodology framework is termed as Transformer Doctor, designed for diagnosing and treating internal errors within transformers. Through a plethora of quantitative and qualitative experiments, it has been demonstrated that Transformer Doctor can effectively address internal errors in transformers, thereby enhancing model performance.

NeurIPS Conference 2024 Conference Paper

Vision Mamba Mender

  • Jiacong Hu
  • Anda Cao
  • Zunlei Feng
  • Shengxuming Zhang
  • Yi Wang
  • Lingxiang Jia
  • Mingli Song

Mamba, a state-space model with selective mechanisms and hardware-aware architecture, has demonstrated outstanding performance in long sequence modeling tasks, particularly garnering widespread exploration and application in the field of computer vision. While existing works have mixed opinions of its application in visual tasks, the exploration of its internal workings and the optimization of its performance remain urgent and worthy research questions given its status as a novel model. Existing optimizations of the Mamba model, especially when applied in the visual domain, have primarily relied on predefined methods such as improving scanning mechanisms or integrating other architectures, often requiring strong priors and extensive trial and error. In contrast to these approaches, this paper proposes the Vision Mamba Mender, a systematic approach for understanding the workings of Mamba, identifying flaws within, and subsequently optimizing model performance. Specifically, we present methods for predictive correlation analysis of Mamba's hidden states from both internal and external perspectives, along with corresponding definitions of correlation scores, aimed at understanding the workings of Mamba in visual recognition tasks and identifying flaws therein. Additionally, tailored repair methods are proposed for identified external and internal state flaws to eliminate them and optimize model performance. Extensive experiments validate the efficacy of the proposed methods on prevalent Mamba architectures, significantly enhancing Mamba's performance.

ECAI Conference 2023 Conference Paper

Adversarial Erasing with Pruned Elements: Towards Better Graph Lottery Tickets

  • Yuwen Wang
  • Shunyu Liu 0001
  • Kaixuan Chen 0004
  • Tongtian Zhu
  • Ji Qiao
  • Mengjie Shi
  • Yuanyu Wan
  • Mingli Song

Graph Lottery Ticket (GLT), a combination of core subgraph and sparse subnetwork, has been proposed to mitigate the computational cost of deep Graph Neural Networks (GNNs) on large input graphs while preserving original performance. However, the winning GLTs in exisiting studies are obtained by applying iterative magnitude-based pruning (IMP) without re-evaluating and re-considering the pruned information, which disregards the dynamic changes in the significance of edges/weights during graph/model structure pruning, and thus limits the appeal of the winning tickets. In this paper, we formulate a conjecture, i. e. , existing overlooked valuable information in the pruned graph connections and model parameters which can be re-grouped into GLT to enhance the final performance. Specifically, we propose an adversarial complementary erasing (ACE) framework to explore the valuable information from the pruned components, thereby developing a more powerful GLT, referred to as the ACE-GLT. The main idea is to mine valuable information from pruned edges/weights after each round of IMP, and employ the ACE technique to refine the GLT processing. Finally, experimental results demonstrate that our ACE-GLT outperforms existing methods for searching GLT in diverse tasks. Our code is available at https: //github. com/Wangyuwen0627/ACE-GLT.

AAAI Conference 2023 Conference Paper

Contrastive Identity-Aware Learning for Multi-Agent Value Decomposition

  • Shunyu Liu
  • Yihe Zhou
  • Jie Song
  • Tongya Zheng
  • Kaixuan Chen
  • Tongtian Zhu
  • Zunlei Feng
  • Mingli Song

Value Decomposition (VD) aims to deduce the contributions of agents for decentralized policies in the presence of only global rewards, and has recently emerged as a powerful credit assignment paradigm for tackling cooperative Multi-Agent Reinforcement Learning (MARL) problems. One of the main challenges in VD is to promote diverse behaviors among agents, while existing methods directly encourage the diversity of learned agent networks with various strategies. However, we argue that these dedicated designs for agent networks are still limited by the indistinguishable VD network, leading to homogeneous agent behaviors and thus downgrading the cooperation capability. In this paper, we propose a novel Contrastive Identity-Aware learning (CIA) method, explicitly boosting the credit-level distinguishability of the VD network to break the bottleneck of multi-agent diversity. Specifically, our approach leverages contrastive learning to maximize the mutual information between the temporal credits and identity representations of different agents, encouraging the full expressiveness of credit assignment and further the emergence of individualities. The algorithm implementation of the proposed CIA module is simple yet effective that can be readily incorporated into various VD architectures. Experiments on the SMAC benchmarks and across different VD backbones demonstrate that the proposed method yields results superior to the state-of-the-art counterparts. Our code is available at https://github.com/liushunyu/CIA.

ICML Conference 2023 Conference Paper

Decentralized SGD and Average-direction SAM are Asymptotically Equivalent

  • Tongtian Zhu
  • Fengxiang He
  • Kaixuan Chen 0004
  • Mingli Song
  • Dacheng Tao

Decentralized stochastic gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server. However, existing theories claim that decentralization invariably undermines generalization. In this paper, we challenge the conventional belief and present a completely new perspective for understanding decentralized learning. We prove that D-SGD implicitly minimizes the loss function of an average-direction Sharpness-aware minimization (SAM) algorithm under general non-convex non-$\beta$-smooth settings. This surprising asymptotic equivalence reveals an intrinsic regularization-optimization trade-off and three advantages of decentralization: (1) there exists a free uncertainty evaluation mechanism in D-SGD to improve posterior estimation; (2) D-SGD exhibits a gradient smoothing effect; and (3) the sharpness regularization effect of D-SGD does not decrease as total batch size increases, which justifies the potential generalization benefit of D-SGD over centralized SGD (C-SGD) in large-batch scenarios.

NeurIPS Conference 2023 Conference Paper

Lookaround Optimizer: $k$ steps around, 1 step average

  • Jiangtao Zhang
  • Shunyu Liu
  • Jie Song
  • Tongtian Zhu
  • Zhengqi Xu
  • Mingli Song

Weight Average (WA) is an active research topic due to its simplicity in ensembling deep networks and the effectiveness in promoting generalization. Existing weight average approaches, however, are often carried out along only one training trajectory in a post-hoc manner (i. e. , the weights are averaged after the entire training process is finished), which significantly degrades the diversity between networks and thus impairs the effectiveness. In this paper, inspired by weight average, we propose Lookaround, a straightforward yet effective SGD-based optimizer leading to flatter minima with better generalization. Specifically, Lookaround iterates two steps during the whole training period: the around step and the average step. In each iteration, 1) the around step starts from a common point and trains multiple networks simultaneously, each on transformed data by a different data augmentation, and 2) the average step averages these trained networks to get the averaged network, which serves as the starting point for the next iteration. The around step improves the functionality diversity while the average step guarantees the weight locality of these networks during the whole training, which is essential for WA to work. We theoretically explain the superiority of Lookaround by convergence analysis, and make extensive experiments to evaluate Lookaround on popular benchmarks including CIFAR and ImageNet with both CNNs and ViTs, demonstrating clear superiority over state-of-the-arts. Our code is available at https: //github. com/Ardcy/Lookaround.

ECAI Conference 2023 Conference Paper

LoSS: Local Structural Separation Hypergraph Convolutional Neural Network

  • Bingde Hu
  • Yang Gao 0001
  • Zunlei Feng
  • Mingli Song
  • Xinyu Wang 0001
  • Ying Li 0097

Graph classification is a classic problem with practical applications in many real-life scenes. Existing graph neural networks, including GCN, GAT, and GIN, are proposed to extract useful features from complex graph structures. However, most existing methods’ feature extraction and aggregation inevitably mix the useful and redundant features, which will disturb the final classification performance. In this paper, to handle the above drawback, we put forward the Local Structural Separation Hypergraph Convolutional Neural Network (LoSS) based on two discoveries: most graph classification tasks only focus on a few groups of adjacent nodes, and different categories have their specific high response bits in graph embeddings. In LoSS, we first decouple the original graph into different hypergraphs and aggregate the features in each substructure, which aims to find useful features for the final classification. Next, the low-correlation feature suppression strategy is devised to suppress the irrelevant node-level and bit-level features in the forward inference process, effectively reducing the disturbance of redundant features. Experiments on five datasets show that the proposed LoSS can effectively locate and aggregate useful hypergraph features and achieve SOTA performance compared with existing methods.

AAAI Conference 2023 Conference Paper

Neural TSP Solver with Progressive Distillation

  • Dongxiang Zhang
  • Ziyang Xiao
  • Yuan Wang
  • Mingli Song
  • Gang Chen

Travelling salesman problem (TSP) is NP-Hard with exponential search space. Recently, the adoption of encoder-decoder models as neural TSP solvers has emerged as an attractive topic because they can instantly obtain near-optimal results for small-scale instances. Nevertheless, their training efficiency and solution quality degrade dramatically when dealing with large-scale problems. To address the issue, we propose a novel progressive distillation framework, by adopting curriculum learning to train TSP samples in increasing order of their problem size and progressively distilling high-level knowledge from small models to large models via a distillation loss. In other words, the trained small models are used as the teacher network to guide action selection when training large models. To accelerate training speed, we also propose a Delaunary-graph based action mask and a new attention-based decoder to reduce decoding cost. Experimental results show that our approach establishes clear advantages over existing encoder-decoder models in terms of training effectiveness and solution quality. In addition, we validate its usefulness as an initial solution generator for the state-of-the-art TSP solvers, whose probability of obtaining the optimal solution can be further improved in such a hybrid manner.

ICLR Conference 2023 Conference Paper

Schema Inference for Interpretable Image Classification

  • Haofei Zhang
  • Mengqi Xue
  • Xiaokang Liu
  • Kaixuan Chen 0004
  • Jie Song 0011
  • Mingli Song

In this paper, we study a novel inference paradigm, termed as schema inference, that learns to deductively infer the explainable predictions by rebuilding the prior deep neural network (DNN) forwarding scheme, guided by the prevalent philosophical cognitive concept of schema. We strive to reformulate the conventional model inference pipeline into a graph matching policy that associates the extracted visual concepts of an image with the pre-computed scene impression, by analogy with human reasoning mechanism via impression matching. To this end, we devise an elaborated architecture, termed as SchemaNet, as a dedicated instantiation of the proposed schema inference concept, that models both the visual semantics of input instances and the learned abstract imaginations of target categories as topological relational graphs. Meanwhile, to capture and leverage the compositional contributions of visual semantics in a global view, we also introduce a universal Feat2Graph scheme in SchemaNet to establish the relational graphs that contain abundant interaction information. Both the theoretical analysis and the experimental results on several benchmarks demonstrate that the proposed schema inference achieves encouraging performance and meanwhile yields a clear picture of the deductive process leading to the predictions. Our code is available at https://github.com/zhfeing/SchemaNet-PyTorch.

IJCAI Conference 2023 Conference Paper

Temporal Constrained Feasible Subspace Learning for Human Pose Forecasting

  • Gaoang Wang
  • Mingli Song

Human pose forecasting is a sequential modeling task that aims to predict future poses from historical motions. Most existing approaches focus on the spatial-temporal neural network model design for learning movement patterns to reduce prediction errors. However, they usually do not strictly follow the temporal constraints in the inference stage. Even though a small Mean Per Joint Position Error (MPJPE) is achieved, some of the predicted poses are not temporal feasible solutions, which disobeys the continuity of the body movement. In this paper, we consider the temporal constrained feasible solutions for human pose forecasting, where the predicted poses of input historical poses are guaranteed to obey the temporal constraints strictly in the inference stage. Rather than direct supervision of the prediction in the original pose space, a temporal constrained subspace is explicitly learned and then followed by an inverse transformation to obtain the final predictions. We evaluate the proposed method on large-scale benchmarks, including Human3. 6M, AMASS, and 3DPW. State-of-the-art performance has been achieved with the temporal constrained feasible solutions.

IJCAI Conference 2022 Conference Paper

Comparison Knowledge Translation for Generalizable Image Classification

  • Zunlei Feng
  • Tian Qiu
  • Sai Wu
  • Xiaotuan Jin
  • Zengliang He
  • Mingli Song
  • Huiqiong Wang

Deep learning has recently achieved remarkable performance in image classification tasks, which depends heavily on massive annotation. However, the classification mechanism of existing deep learning models seems to contrast to humans' recognition mechanism. With only a glance at an image of the object even unknown type, humans can quickly and precisely find other same category objects from massive images, which benefits from daily recognition of various objects. In this paper, we attempt to build a generalizable framework that emulates the humans' recognition mechanism in the image classification task, hoping to improve the classification performance on unseen categories with the support of annotations of other categories. Specifically, we investigate a new task termed Comparison Knowledge Translation (CKT). Given a set of fully labeled categories, CKT aims to translate the comparison knowledge learned from the labeled categories to a set of novel categories. To this end, we put forward a Comparison Classification Translation Network (CCT-Net), which comprises a comparison classifier and a matching discriminator. The comparison classifier is devised to classify whether two images belong to the same category or not, while the matching discriminator works together in an adversarial manner to ensure whether classified results match the truth. Exhaustive experiments show that CCT-Net achieves surprising generalization ability on unseen categories and SOTA performance on target categories.

AAAI Conference 2022 Conference Paper

Model Doctor: A Simple Gradient Aggregation Strategy for Diagnosing and Treating CNN Classifiers

  • Zunlei Feng
  • Jiacong Hu
  • Sai Wu
  • XiaoTian Yu
  • Jie Song
  • Mingli Song

Recently, Convolutional Neural Network (CNN) has achieved excellent performance in the classification task. It is widely known that CNN is deemed as a ‘black-box’, which is hard for understanding the prediction mechanism and debugging the wrong prediction. Some model debugging and explanation works are developed for solving the above drawbacks. However, those methods focus on explanation and diagnosing possible causes for model prediction, based on which the researchers handle the following optimization of models manually. In this paper, we propose the first completely automatic model diagnosing and treating tool, termed as Model Doctor. Based on two discoveries that 1) each category is only correlated with sparse and specific convolution kernels, and 2) adversarial samples are isolated while normal samples are successive in the feature space, a simple aggregate gradient constraint is devised for effectively diagnosing and optimizing CNN classifiers. The aggregate gradient strategy is a versatile module for mainstream CNN classifiers. Extensive experiments demonstrate that the proposed Model Doctor applies to all existing CNN classifiers, and improves the accuracy of 16 mainstream CNN classifiers by 1% ∼ 5%.

AAAI Conference 2022 Conference Paper

Safe Distillation Box

  • Jingwen Ye
  • Yining Mao
  • Jie Song
  • Xinchao Wang
  • Cheng Jin
  • Mingli Song

Knowledge distillation (KD) has recently emerged as a powerful strategy to transfer knowledge from a pre-trained teacher model to a lightweight student, and has demonstrated its unprecedented success over a wide spectrum of applications. In spite of the encouraging results, the KD process per se poses a potential threat to network ownership protection, since the knowledge contained in network can be effortlessly distilled and hence exposed to a malicious user. In this paper, we propose a novel framework, termed as Safe Distillation Box (SDB), that allows us to wrap a pre-trained model in a virtual box for intellectual property protection. Specifically, SDB preserves the inference capability of the wrapped model to all users, but precludes KD from unauthorized users. For authorized users, on the other hand, SDB carries out a knowledge augmentation scheme to strengthen the KD performances and the results of the student model. In other words, all users may employ a model in SDB for inference, but only authorized users get access to KD from the model. The proposed SDB imposes no constraints over the model architecture, and may readily serve as a plug-andplay solution to protect the ownership of a pre-trained network. Experiments across various datasets and architectures demonstrate that, with SDB, the performance of an unauthorized KD drops significantly while that of an authorized gets enhanced, demonstrating the effectiveness of SDB.

ICML Conference 2022 Conference Paper

Topology-aware Generalization of Decentralized SGD

  • Tongtian Zhu
  • Fengxiang He
  • Lan Zhang 0002
  • Zhengyang Niu
  • Mingli Song
  • Dacheng Tao

This paper studies the algorithmic stability and generalizability of decentralized stochastic gradient descent (D-SGD). We prove that the consensus model learned by D-SGD is $\mathcal{O}{(m/N\unaryplus1/m\unaryplus\lambda^2)}$-stable in expectation in the non-convex non-smooth setting, where $N$ is the total sample size of the whole system, $m$ is the worker number, and $1\unaryminus\lambda$ is the spectral gap that measures the connectivity of the communication topology. These results then deliver an $\mathcal{O}{(1/N\unaryplus{({(m^{-1}\lambda^2)}^{\frac{\alpha}{2}}\unaryplus m^{\unaryminus\alpha})}/{N^{1\unaryminus\frac{\alpha}{2}}})}$ in-average generalization bound, which is non-vacuous even when $\lambda$ is closed to $1$, in contrast to vacuous as suggested by existing literature on the projected version of D-SGD. Our theory indicates that the generalizability of D-SGD has a positive correlation with the spectral gap, and can explain why consensus control in initial training phase can ensure better generalization. Experiments of VGG-11 and ResNet-18 on CIFAR-10, CIFAR-100 and Tiny-ImageNet justify our theory. To our best knowledge, this is the first work on the topology-aware generalization of vanilla D-SGD. Code is available at \url{https: //github. com/Raiden-Zhu/Generalization-of-DSGD}.

AAAI Conference 2022 Conference Paper

Up to 100x Faster Data-Free Knowledge Distillation

  • Gongfan Fang
  • Kanya Mo
  • Xinchao Wang
  • Jie Song
  • Shitao Bei
  • Haofei Zhang
  • Mingli Song

Data-free knowledge distillation (DFKD) has recently been attracting increasing attention from research communities, attributed to its capability to compress a model only using synthetic data. Despite the encouraging results achieved, stateof-the-art DFKD methods still suffer from the inefficiency of data synthesis, making the data-free training process extremely time-consuming and thus inapplicable for large-scale tasks. In this work, we introduce an efficacious scheme, termed as FastDFKD, that allows us to accelerate DFKD by a factor of orders of magnitude. At the heart of our approach is a novel strategy to reuse the shared common features in training data so as to synthesize different data instances. Unlike prior methods that optimize a set of data independently, we propose to learn a meta-synthesizer that seeks common features as the initialization for the fast data synthesis. As a result, FastDFKD achieves data synthesis within only a few steps, significantly enhancing the efficiency of data-free training. Experiments over CIFAR, NYUv2, and ImageNet demonstrate that the proposed FastDFKD achieves 10× and even 100× acceleration while preserving performances on par with state of the art. Code is available at https: //github. com/zju-vipa/Fast-Datafree.

IJCAI Conference 2021 Conference Paper

Boundary Knowledge Translation based Reference Semantic Segmentation

  • Lechao Cheng
  • Zunlei Feng
  • Xinchao Wang
  • Ya Jie Liu
  • Jie Lei
  • Mingli Song

Given a reference object of an unknown type in an image, human observers can effortlessly find the objects of the same category in another image and precisely tell their visual boundaries. Such visual cognition capability of humans seems absent from the current research spectrum of computer vision. Existing segmentation networks, for example, rely on a humongous amount of labeled data, which is laborious and costly to collect and annotate; besides, the performance of segmentation networks tend to downgrade as the number of the category increases. In this paper, we introduce a novel Reference semantic segmentation Network (Ref-Net) to conduct visual boundary knowledge translation. Ref-Net contains a Reference Segmentation Module (RSM) and a Boundary Knowledge Translation Module (BKTM). Inspired by the human recognition mechanism, RSM is devised only to segment the same category objects based on the features of the reference objects. BKTM, on the other hand, introduces two boundary discriminator branches to conduct inner and outer boundary segmentation of the target object in an adversarial manner, and translate the annotated boundary knowledge of open-source datasets into the segmentation network. Exhaustive experiments demonstrate that, with tens of finely-grained annotated samples as guidance, Ref-Net achieves results on par with fully supervised methods on six datasets. Our code can be found in the supplementary material.

IJCAI Conference 2021 Conference Paper

Contrastive Model Invertion for Data-Free Knolwedge Distillation

  • Gongfan Fang
  • Jie Song
  • Xinchao Wang
  • Chengchao Shen
  • Xingen Wang
  • Mingli Song

Model inversion, whose goal is to recover training data from a pre-trained model, has been recently proved feasible. However, existing inversion methods usually suffer from the mode collapse problem, where the synthesized instances are highly similar to each other and thus show limited effectiveness for downstream tasks, such as knowledge distillation. In this paper, we propose Contrastive Model Inversion (CMI), where the data diversity is explicitly modeled as an optimizable objective, to alleviate the mode collapse issue. Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination. To this end, we introduce in CMI a contrastive learning objective that encourages the synthesizing instances to be distinguishable from the already synthesized ones in previous batches. Experiments of pre-trained models on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI not only generates more visually plausible instances than the state of the arts, but also achieves significantly superior performance when the generated data are used for knowledge distillation. Code is available at https: //github. com/zju-vipa/DataFree.

AAAI Conference 2021 Conference Paper

Edge-competing Pathological Liver Vessel Segmentation with Limited Labels

  • Zunlei Feng
  • Zhonghua Wang
  • Xinchao Wang
  • Xiuming Zhang
  • Lechao Cheng
  • Jie Lei
  • Yuexuan Wang
  • Mingli Song

The microvascular invasion (MVI) is a major prognostic factor in hepatocellular carcinoma, which is one of the malignant tumors with the highest mortality rate. The diagnosis of MVI needs discovering the vessels that contain hepatocellular carcinoma cells and counting their number in each vessel, which depends heavily on experiences of the doctor, is largely subjective and time-consuming. However, there is no algorithm as yet tailored for the MVI detection from pathological images. This paper collects the first pathological liver image dataset containing 522 whole slide images with labels of vessels, MVI, and hepatocellular carcinoma grades. The first and essential step for the automatic diagnosis of MVI is the accurate segmentation of vessels. The unique characteristics of pathological liver images, such as super-large size, multi-scale vessel, and blurred vessel edges, make the accurate vessel segmentation challenging. Based on the collected dataset, we propose an Edge-competing Vessel Segmentation Network (EVS-Net), which contains a segmentation network and two edge segmentation discriminators. The segmentation network, combined with an edge-aware self-supervision mechanism, is devised to conduct vessel segmentation with limited labeled patches. Meanwhile, two discriminators are introduced to distinguish whether the segmented vessel and background contain residual features in an adversarial manner. In the training stage, two discriminators are devised to compete for the predicted position of edges. Exhaustive experiments demonstrate that, with only limited labeled patches, EVS-Net achieves a close performance of fully supervised methods, which provides a convenient tool for the pathological liver vessel segmentation. Code is publicly available at https: //github. com/zju-vipa/EVS-Net.

IJCAI Conference 2021 Conference Paper

KDExplainer: A Task-oriented Attention Model for Explaining Knowledge Distillation

  • Mengqi Xue
  • Jie Song
  • Xinchao Wang
  • Ying Chen
  • Xingen Wang
  • Mingli Song

Knowledge distillation (KD) has recently emerged as an efficacious scheme for learning compact deep neural networks (DNNs). Despite the promising results achieved, the rationale that interprets the behavior of KD has yet remained largely understudied. In this paper, we introduce a novel task-oriented attention model, termed as KDExplainer, to shed light on the working mechanism underlying the vanilla KD. At the heart of KDExplainer is a Hierarchical Mixture of Experts (HME), in which a multi-class classification is reformulated as a multi-task binary one. Through distilling knowledge from a free-form pre-trained DNN to KDExplainer, we observe that KD implicitly modulates the knowledge conflicts between different subtasks, and in reality has much more to offer than label smoothing. Based on such findings, we further introduce a portable tool, dubbed as virtual attention module (VAM), that can be seamlessly integrated with various DNNs to enhance their performance under KD. Experimental results demonstrate that with a negligible additional cost, student models equipped with VAM consistently outperform their non-VAM counterparts across different benchmarks. Furthermore, when combined with other KD methods, VAM remains competent in promoting results, even though it is only motivated by vanilla KD. The code is available at https: // github. com/zju-vipa/KDExplainer.

NeurIPS Conference 2021 Conference Paper

Mosaicking to Distill: Knowledge Distillation from Out-of-Domain Data

  • Gongfan Fang
  • Yifan Bao
  • Jie Song
  • Xinchao Wang
  • Donglin Xie
  • Chengchao Shen
  • Mingli Song

Knowledge distillation~(KD) aims to craft a compact student model that imitates the behavior of a pre-trained teacher in a target domain. Prior KD approaches, despite their gratifying results, have largely relied on the premise that \emph{in-domain} data is available to carry out the knowledge transfer. Such an assumption, unfortunately, in many cases violates the practical setting, since the original training data or even the data domain is often unreachable due to privacy or copyright reasons. In this paper, we attempt to tackle an ambitious task, termed as \emph{out-of-domain} knowledge distillation~(OOD-KD), which allows us to conduct KD using only OOD data that can be readily obtained at a very low cost. Admittedly, OOD-KD is by nature a highly challenging task due to the agnostic domain gap. To this end, we introduce a handy yet surprisingly efficacious approach, dubbed as~\textit{MosaicKD}. The key insight behind MosaicKD lies in that, samples from various domains share common local patterns, even though their global semantic may vary significantly; these shared local patterns, in turn, can be re-assembled analogous to mosaic tiling, to approximate the in-domain data and to further alleviating the domain discrepancy. In MosaicKD, this is achieved through a four-player min-max game, in which a generator, a discriminator, a student network, are collectively trained in an adversarial manner, partially under the guidance of a pre-trained teacher. We validate MosaicKD over {classification and semantic segmentation tasks} across various benchmarks, and demonstrate that it yields results much superior to the state-of-the-art counterparts on OOD data. Our code is available at \url{https: //github. com/zju-vipa/MosaicKD}.

AAAI Conference 2021 Conference Paper

Progressive Network Grafting for Few-Shot Knowledge Distillation

  • Chengchao Shen
  • Xinchao Wang
  • Youtan Yin
  • Jie Song
  • Sihui Luo
  • Mingli Song

Knowledge distillation has demonstrated encouraging performances in deep model compression. Most existing approaches, however, require massive labeled data to accomplish the knowledge transfer, making the model compression a cumbersome and costly process. In this paper, we investigate the practical few-shot knowledge distillation scenario, where we assume only a few samples without human annotations are available for each category. To this end, we introduce a principled dual-stage distillation scheme tailored for fewshot data. In the first step, we graft the student blocks one by one onto the teacher, and learn the parameters of the grafted block intertwined with those of the other teacher blocks. In the second step, the trained student blocks are progressively connected and then together grafted onto the teacher network, allowing the learned student blocks to adapt themselves to each other and eventually replace the teacher network. Experiments demonstrate that our approach, with only a few unlabeled samples, achieves gratifying results on CIFAR10, CI- FAR100, and ILSVRC-2012. On CIFAR10 and CIFAR100, our performances are even on par with those of knowledge distillation schemes that utilize the full datasets. The source code is available at https: //github. com/zju-vipa/NetGraft.

AAAI Conference 2021 Conference Paper

Visual Boundary Knowledge Translation for Foreground Segmentation

  • Zunlei Feng
  • Lechao Cheng
  • Xinchao Wang
  • Xiang Wang
  • Ya Jie Liu
  • Xiangtong Du
  • Mingli Song

When confronted with objects of unknown types in an image, humans can effortlessly and precisely tell their visual boundaries. This recognition mechanism and underlying generalization capability seem to contrast to state-of-the-art image segmentation networks that rely on large-scale categoryaware annotated training samples. In this paper, we make an attempt towards building models that explicitly account for visual boundary knowledge, in hope to reduce the training effort on segmenting unseen categories. Specifically, we investigate a new task termed as Boundary Knowledge Translation (BKT). Given a set of fully labeled categories, BK- T aims to translate the visual boundary knowledge learned from the labeled categories, to a set of novel categories, each of which is provided only a few labeled samples. To this end, we propose a Translation Segmentation Network (Trans-Net), which comprises a segmentation network and two boundary discriminators. The segmentation network, combined with a boundary-aware self-supervised mechanism, is devised to conduct foreground segmentation, while the two discriminators work together in an adversarial manner to ensure an accurate segmentation of the novel categories under light supervision. Exhaustive experiments demonstrate that, with only tens of labeled samples as guidance, Trans-Net achieves close results on par with fully supervised methods.

AAAI Conference 2020 Conference Paper

Dynamic Instance Normalization for Arbitrary Style Transfer

  • Yongcheng Jing
  • Xiao Liu
  • Yukang Ding
  • Xinchao Wang
  • Errui Ding
  • Mingli Song
  • Shilei Wen

Prior normalization methods rely on affine transformations to produce arbitrary image style transfers, of which the parameters are computed in a pre-defined way. Such manuallydefined nature eventually results in the high-cost and shared encoders for both style and content encoding, making style transfer systems cumbersome to be deployed in resourceconstrained environments like on the mobile-terminal side. In this paper, we propose a new and generalized normalization module, termed as Dynamic Instance Normalization (DIN), that allows for flexible and more efficient arbitrary style transfers. Comprising an instance normalization and a dynamic convolution, DIN encodes a style image into learnable convolution parameters, upon which the content image is stylized. Unlike conventional methods that use shared complex encoders to encode content and style, the proposed DIN introduces a sophisticated style encoder, yet comes with a compact and lightweight content encoder for fast inference. Experimental results demonstrate that the proposed approach yields very encouraging results on challenging style patterns and, to our best knowledge, for the first time enables an arbitrary style transfer using MobileNet-based lightweight architecture, leading to a reduction factor of more than twenty in computational cost as compared to existing approaches. Furthermore, the proposed DIN provides flexible support for stateof-the-art convolutional operations, and thus triggers novel functionalities, such as uniform-stroke placement for nonnatural images and automatic spatial-stroke control.

NeurIPS Conference 2020 Conference Paper

Factorizable Graph Convolutional Networks

  • Yiding Yang
  • Zunlei Feng
  • Mingli Song
  • Xinchao Wang

Graphs have been widely adopted to denote structural connections between entities. The relations are in many cases heterogeneous, but entangled together and denoted merely as a single edge between a pair of nodes. For example, in a social network graph, users in different latent relationships like friends and colleagues, are usually connected via a bare edge that conceals such intrinsic connections. In this paper, we introduce a novel graph convolutional network (GCN), termed as factorizable graph convolutional network (FactorGCN), that explicitly disentangles such intertwined relations encoded in a graph. FactorGCN takes a simple graph as input, and disentangles it into several factorized graphs, each of which represents a latent and disentangled relation among nodes. The features of the nodes are then aggregated separately in each factorized latent space to produce disentangled features, which further leads to better performances for downstream tasks. We evaluate the proposed FactorGCN both qualitatively and quantitatively on the synthetic and real-world datasets, and demonstrate that it yields truly encouraging results in terms of both disentangling and feature aggregation. Code is publicly available at https: //github. com/ihollywhy/FactorGCN. PyTorch.

AAAI Conference 2020 Conference Paper

Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers

  • Ya Zhao
  • Rui Xu
  • Xinchao Wang
  • Peng Hou
  • Haihong Tang
  • Mingli Song

Lip reading has witnessed unparalleled development in recent years thanks to deep learning and the availability of largescale datasets. Despite the encouraging results achieved, the performance of lip reading, unfortunately, remains inferior to the one of its counterpart speech recognition, due to the ambiguous nature of its actuations that makes it challenging to extract discriminant features from the lip movement videos. In this paper, we propose a new method, termed as Lip by Speech (LIBS), of which the goal is to strengthen lip reading by learning from speech recognizers. The rationale behind our approach is that the features extracted from speech recognizers may provide complementary and discriminant clues, which are formidable to be obtained from the subtle movements of the lips, and consequently facilitate the training of lip readers. This is achieved, specifically, by distilling multigranularity knowledge from speech recognizers to lip readers. To conduct this cross-modal knowledge distillation, we utilize an efficacious alignment scheme to handle the inconsistent lengths of the audios and videos, as well as an innovative filtering strategy to refine the speech recognizer’s prediction. The proposed method achieves the new state-of-the-art performance on the CMLR and LRS2 datasets, outperforming the baseline by a margin of 7. 66% and 2. 75% in character error rate, respectively.

NeurIPS Conference 2020 Conference Paper

One-sample Guided Object Representation Disassembling

  • Zunlei Feng
  • Yongming He
  • Xinchao Wang
  • Xin Gao
  • Jie Lei
  • Cheng Jin
  • Mingli Song

The ability to disassemble the features of objects and background is crucial for many machine learning tasks, including image classification, image editing, visual concepts learning, and so on. However, existing (semi-)supervised methods all need a large amount of annotated samples, while unsupervised methods can't handle real-world images with complicated backgrounds. In this paper, we introduce the One-sample Guided Object Representation Disassembling (One-GORD) method, which only requires one annotated sample for each object category to learn disassembled object representation from unannotated images. For the annotated one-sample, we first adopt some data augmentation strategies to generate some synthetic samples, which can guide the disassembling of the object features and background features. For the unannotated images, two self-supervised mechanisms: dual-swapping and fuzzy classification are introduced to disassemble object features from the background with the guidance of annotated one-sample. What's more, we devise two metrics to evaluate the disassembling performance from the perspective of representation and image, respectively. Experiments demonstrate that the One-GORD achieves competitive dissembling performance and can handle natural scenes with complicated backgrounds.

IJCAI Conference 2019 Conference Paper

Amalgamating Filtered Knowledge: Learning Task-customized Student from Multi-task Teachers

  • Jingwen Ye
  • Xinchao Wang
  • Yixin Ji
  • Kairi Ou
  • Mingli Song

Many well-trained Convolutional Neural Network~(CNN) models have now been released online by developers for the sake of effortless reproducing. In this paper, we treat such pre-trained networks as teachers and explore how to learn a target student network for customized tasks, using multiple teachers that handle different tasks. We assume no human-labelled annotations are available, and each teacher model can be either single- or multi-task network, where the former is a degenerated case of the latter. The student model, depending on the customized tasks, learns the related knowledge filtered from the multiple teachers, and eventually masters the complete or a subset of expertise from all teachers. To this end, we adopt a layer-wise training strategy, which entangles the student's network block to be learned with the corresponding teachers. As demonstrated on several benchmarks, the learned student network achieves very promising results, even outperforming the teachers on the customized tasks.

AAAI Conference 2019 Conference Paper

Amalgamating Knowledge towards Comprehensive Classification

  • Chengchao Shen
  • Xinchao Wang
  • Jie Song
  • Li Sun
  • Mingli Song

With the rapid development of deep learning, there have been an unprecedentedly large number of trained deep network models available online. Reusing such trained models can significantly reduce the cost of training the new models from scratch, if not infeasible at all as the annotations used for the training original networks are often unavailable to public. We propose in this paper to study a new model-reusing task, which we term as knowledge amalgamation. Given multiple trained teacher networks, each of which specializes in a different classification problem, the goal of knowledge amalgamation is to learn a lightweight student model capable of handling the comprehensive classification. We assume no other annotations except the outputs from the teacher models are available, and thus focus on extracting and amalgamating knowledge from the multiple teachers. To this end, we propose a pilot two-step strategy to tackle the knowledge amalgamation task, by learning first the compact feature representations from teachers and then the network parameters in a layer-wise manner so as to build the student model. We apply this approach to four public datasets and obtain very encouraging results: even without any human annotation, the obtained student model is competent to handle the comprehensive classification task and in most cases outperforms the teachers in individual sub-tasks.

IJCAI Conference 2019 Conference Paper

An Online Intelligent Visual Interaction System

  • Anxiang Zeng
  • Han Yu
  • Xin Gao
  • Kairi Ou
  • Zhenchuan Huang
  • Peng Hou
  • Mingli Song
  • Jingshu Zhang

This paper proposes an Online Intelligent Visual Interactive System (OIVIS), which can be applied to various live video broadcast and short video scenes to provide an interactive user experience. In the live video broadcast, the anchor can issue various commands by using pre-defined gestures, and can trigger real-time background replacement to create an immersive atmosphere. To support such dynamic interactivity, we implemented algorithms including real-time gesture recognition and real-time video portrait segmentation, developed a deep network inference framework, and a real-time rendering framework AI Gender at the front end to create a complete set of visual interaction solutions for use in resource constrained mobile.

NeurIPS Conference 2019 Conference Paper

Deep Model Transferability from Attribution Maps

  • Jie Song
  • Yixin Chen
  • Xinchao Wang
  • Chengchao Shen
  • Mingli Song

Exploring the transferability between heterogeneous tasks sheds light on their intrinsic interconnections, and consequently enables knowledge transfer from one task to another so as to reduce the training effort of the latter. In this paper, we propose an embarrassingly simple yet very efficacious approach to estimating the transferability of deep networks, especially those handling vision tasks. Unlike the seminal work of \emph{taskonomy} that relies on a large number of annotations as supervision and is thus computationally cumbersome, the proposed approach requires no human annotations and imposes no constraints on the architectures of the networks. This is achieved, specifically, via projecting deep networks into a \emph{model space}, wherein each network is treated as a point and the distances between two points are measured by deviations of their produced attribution maps. The proposed approach is several-magnitude times faster than taskonomy, and meanwhile preserves a task-wise topological structure highly similar to the one obtained by taskonomy. Code is available at \url{https: //github. com/zju-vipa/TransferbilityFromAttributionMaps}.

IJCAI Conference 2019 Conference Paper

Knowledge Amalgamation from Heterogeneous Networks by Common Feature Learning

  • Sihui Luo
  • Xinchao Wang
  • Gongfan Fang
  • Yao Hu
  • Dapeng Tao
  • Mingli Song

An increasing number of well-trained deep networks have been released online by researchers and developers, enabling the community to reuse them in a plug-and-play way without accessing the training annotations. However, due to the large number of network variants, such public-available trained models are often of different architectures, each of which being tailored for a specific task or dataset. In this paper, we study a deep-model reusing task, where we are given as input pre-trained networks of heterogeneous architectures specializing in distinct tasks, as teacher models. We aim to learn a multitalented and light-weight student model that is able to grasp the integrated knowledge from all such heterogeneous-structure teachers, again without accessing any human annotation. To this end, we propose a common feature learning scheme, in which the features of all teachers are transformed into a common space and the student is enforced to imitate them all so as to amalgamate the intact knowledge. We test the proposed approach on a list of benchmarks and demonstrate that the learned student is able to achieve very promising performance, superior to those of the teachers in their specialized tasks.

IJCAI Conference 2019 Conference Paper

SPAGAN: Shortest Path Graph Attention Network

  • Yiding Yang
  • Xinchao Wang
  • Mingli Song
  • Junsong Yuan
  • Dacheng Tao

Graph convolutional networks (GCN) have recently demonstrated their potential in analyzing non-grid structure data that can be represented as graphs. The core idea is to encode the local topology of a graph, via convolutions, into the feature of a center node. In this paper, we propose a novel GCN model, which we term as Shortest Path Graph Attention Network (SPAGAN). Unlike conventional GCN models that carry out node-based attentions, on either first-order neighbors or random higher-order ones, the proposed SPAGAN conducts path-based attention that explicitly accounts for the influence of a sequence of nodes yielding the minimum cost, or shortest path, between the center node and its higher-order neighbors. SPAGAN therefore allows for a more informative and intact exploration of the graph structure and further the more effective aggregation of information from distant neighbors, as compared to node-based GCN methods. We test SPAGAN for the downstream classification task on several standard datasets, and achieve performances superior to the state of the art.

NeurIPS Conference 2018 Conference Paper

Dual Swap Disentangling

  • Zunlei Feng
  • Xinchao Wang
  • Chenglong Ke
  • An-Xiang Zeng
  • Dacheng Tao
  • Mingli Song

Learning interpretable disentangled representations is a crucial yet challenging task. In this paper, we propose a weakly semi-supervised method, termed as Dual Swap Disentangling (DSD), for disentangling using both labeled and unlabeled data. Unlike conventional weakly supervised methods that rely on full annotations on the group of samples, we require only limited annotations on paired samples that indicate their shared attribute like the color. Our model takes the form of a dual autoencoder structure. To achieve disentangling using the labeled pairs, we follow a encoding-swap-decoding'' process, where we first swap the parts of their encodings corresponding to the shared attribute, and then decode the obtained hybrid codes to reconstruct the original input pairs. For unlabeled pairs, we follow the encoding-swap-decoding'' process twice on designated encoding parts and enforce the final outputs to approximate the input pairs. By isolating parts of the encoding and swapping them back and forth, we impose the dimension-wise modularity and portability of the encodings of the unlabeled samples, which implicitly encourages disentangling under the guidance of labeled pairs. This dual swap mechanism, tailored for semi-supervised setting, turns out to be very effective. Experiments on image datasets from a wide domain show that our model yields state-of-the-art disentangling performances.

TIST Journal 2015 Journal Article

Where2Stand

  • Yinting Wang
  • Mingli Song
  • Dacheng Tao
  • Yong Rui
  • Jiajun Bu
  • Ah Chung Tsoi
  • Shaojie Zhuo
  • Ping Tan

People often take photographs at tourist sites and these pictures usually have two main elements: a person in the foreground and scenery in the background. This type of “souvenir photo” is one of the most common photos clicked by tourists. Although algorithms that aid a user-photographer in taking a well-composed picture of a scene exist [Ni et al. 2013], few studies have addressed the issue of properly positioning human subjects in photographs. In photography, the common guidelines of composing portrait images exist. However, these rules usually do not consider the background scene. Therefore, in this article, we investigate human-scenery positional relationships and construct a photographic assistance system to optimize the position of human subjects in a given background scene, thereby assisting the user in capturing high-quality souvenir photos. We collect thousands of well-composed portrait photographs to learn human-scenery aesthetic composition rules. In addition, we define a set of negative rules to exclude undesirable compositions. Recommendation results are achieved by combining the first learned positive rule with our proposed negative rules. We implement the proposed system on an Android platform in a smartphone. The system demonstrates its efficacy by producing well-composed souvenir photos.