Arrow Research search

Author name cluster

Shuhui Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

30 papers
2 author rows

Possible papers

30

JBHI Journal 2025 Journal Article

Consistency Conditioned Memory Augmented Dynamic Diagnosis Model for Medical Visual Question Answering

  • Ting Yu
  • Binhui Ge
  • Shuhui Wang
  • Yan Yang
  • Qingming Huang
  • Jun Yu

Medical Visual Question Answering (Med-VQA) holds immense promise as an invaluable medical assistance aid, offering timely diagnostic outcomes based on medical images and accompanying questions, thereby supporting medical professionals in making accurate clinical decisions. However, Med-VQA is still in its infancy, with existing solutions falling short in imitating human diagnostic processes and ensuring result consistency. To address these challenges, we propose a Co nsistency Co nditioned Me mory augmented D ynamic diagnosis model (CoCoMeD), incorporating two core components: a dynamic memory diagnosis engine and a consistency-conditioned enforcer. The dynamic memory diagnosis engine enables intricate diagnostic interactions by retaining vital visual cues from medical images and iteratively updating pertinent memories. This dynamic reasoning capability mirrors the cognitive processes observed in skilled medical diagnosticians, thus effectively enhancing the model's ability to reason over diverse medical visual facts and patient-specific questions. Moreover, to strengthen diagnostic coherence, the consistency-conditioned enforcer imposes coherence constraints linking interrelated questions with identical medical facts, ensuring the credibility and reliability of its diagnostic outcomes. Additionally, we present C-SLAKE, an extended Med-VQA dataset encompassing diverse medical image types, and categorized diagnostic question-answer pairs for consistent Med-VQA evaluation on rich medical sources. Comprehensive experiments on DME and C-SLAKE showcase CoCoMeD's superior performance and potential to advance trustworthy multi-source medical question answering.

AAAI Conference 2025 Conference Paper

Dis²Booth: Learning Image Distribution with Disentangled Features for Text-to-Image Diffusion Models

  • Guanqi Ding
  • Chengyu Yang
  • Shuhui Wang
  • Xincheng Li
  • Jinzhe Zhang
  • Xin Jin
  • Qingming Huang

Personalized image generation enables customized content creation based on the text-to-image diffusion models.However, existing personalization methods focus on fine-tuning generative models to learn to generate specific single individuals or concepts, such as an image of a specific Corgi, but are unable to generate data for multiple individuals or concepts with common characteristics, such as images of multiple different Corgis. In this work, we focus on personalizing a diffusion model to generated varied data usually containing multiple subjects, which has a more diverse and complex data distribution. Our basic assumption is that the varied data distribution is composed of the common features shared among all samples, as well as the reasonable variations within it. Accordingly, we are capable to decompose the learning process of complex data distributions into two simpler sub-tasks, employing a divide-and-conquer approach. To this end we propose Dis2Booth, a framework that can learn complex image Distribution by Disentangling data distribution in an unsupervised manner.Specifically, Dis2Booth contains two modules, Anchor LoRA and Delta LoRA, that are tasked with learning the common features and variational features constrained by Contextual Loss and Delta Loss unsupervisedly. Besides, the Asynchronous Optimization Strategy is proposed to ensure the collaborative training of the two modules. Extensive experiments suggest that Dis2Booth is able to learn the data distribution with higher diversity and complexity while maintaining the same level of flexibility as LoRA.

AAAI Conference 2025 Conference Paper

Divide-and-Conquer: Tree-structured Strategy with Answer Distribution Estimator for Goal-Oriented Visual Dialogue

  • Shuo Cai
  • Xinzhe Han
  • Shuhui Wang

Goal-oriented visual dialogue involves multi-round interaction between artificial agents, which has been of remarkable attention due to its wide applications. Given a visual scene, this task occurs when a Questioner asks an action-oriented question and an Answerer responds with the intent of letting the Questioner know the correct action to take. The quality of questions affects the accuracy and efficiency of the target search progress. However, existing methods lack a clear strategy to guide the generation of questions, resulting in the randomness in the search process and inconvergent results. We propose a Tree-Structured Strategy with Answer Distribution Estimator (TSADE) which guides the question generation by excluding half of the current candidate objects in each round. The above process is implemented by maximizing a binary reward inspired by the ``divide-and-conquer'' paradigm. We further design a candidate-minimization reward which encourages the model to narrow down the scope of candidate objects toward the end of the dialogue. We experimentally demonstrate that our method can enable the agents to achieve high task-oriented accuracy with fewer repeating questions and rounds compared to traditional ergodic question generation approaches. Qualitative results further show that TSADE facilitates agents to generate higher-quality questions.

NeurIPS Conference 2025 Conference Paper

Edit Less, Achieve More: Dynamic Sparse Neuron Masking for Lifelong Knowledge Editing in LLMs

  • Jinzhe Liu
  • Junshu Sun
  • Shufan Shen
  • Chenxue Yang
  • Shuhui Wang

Lifelong knowledge editing enables continuous, precise updates to outdated knowledge in large language models (LLMs) without computationally expensive full retraining. However, existing methods often accumulate errors throughout the editing process, causing a gradual decline in both editing accuracy and generalization. To tackle this problem, we propose Neuron-Specific Masked Knowledge Editing (NMKE), a novel fine-grained editing framework that combines neuron-level attribution with dynamic sparse masking. Leveraging neuron functional attribution, we identify two key types of knowledge neurons, with knowledge-general neurons activating consistently across prompts and knowledge-specific neurons activating to specific prompts. NMKE further introduces an entropy-guided dynamic sparse mask, locating relevant neurons to the target knowledge. This strategy enables precise neuron-level knowledge editing with fewer parameter modifications. Experimental results from thousands of sequential edits demonstrate that NMKE outperforms existing methods in maintaining high editing success rates and preserving model general capabilities in lifelong editing.

ICLR Conference 2025 Conference Paper

Enhancing Pre-trained Representation Classifiability can Boost its Interpretability

  • Shufan Shen
  • Zhaobo Qi
  • Junshu Sun
  • Qingming Huang
  • Qi Tian 0001
  • Shuhui Wang

The visual representation of a pre-trained model prioritizes the classifiability on downstream tasks, while the widespread applications for pre-trained visual models have posed new requirements for representation interpretability. However, it remains unclear whether the pre-trained representations can achieve high interpretability and classifiability simultaneously. To answer this question, we quantify the representation interpretability by leveraging its correlation with the ratio of interpretable semantics within the representations. Given the pre-trained representations, only the interpretable semantics can be captured by interpretations, whereas the uninterpretable part leads to information loss. Based on this fact, we propose the Inherent Interpretability Score (IIS) that evaluates the information loss, measures the ratio of interpretable semantics, and quantifies the representation interpretability. In the evaluation of the representation interpretability with different classifiability, we surprisingly discover that the interpretability and classifiability are positively correlated, i.e., representations with higher classifiability provide more interpretable semantics that can be captured in the interpretations. This observation further supports two benefits to the pre-trained representations. First, the classifiability of representations can be further improved by fine-tuning with interpretability maximization. Second, with the classifiability improvement for the representations, we obtain predictions based on their interpretations with less accuracy degradation. The discovered positive correlation and corresponding applications show that practitioners can unify the improvements in interpretability and classifiability for pre-trained vision models. Codes are available at https://github.com/ssfgunner/IIS.

AAAI Conference 2025 Conference Paper

Image-to-video Adaptation with Outlier Modeling and Robust Self-learning

  • Junbao Zhuo
  • Shuhui Wang
  • Zhenghan Chen
  • Li Shen
  • Qingming Huang
  • Huimin Ma

The image-to-video adaptation task seeks to effectively harness both labeled images and unlabeled videos for achieving effective video recognition. The modality gap of the image and video modalities and the domain discrepancy across the two domains are the two essential challenges in this task. Existing methods reduce the domain discrepancy via close-set domain adaptation techniques, resulting in inaccurate domain alignment as there exist outlier target frames. To tackle this issue, we extend the vanilla classifier with outlier classes, where each outlier class responsible for capturing outlier frames for a specific class via batch nuclear norm maximization loss. We further propose a new loss by treating the source images apart from class c as instances from outlier class specific for c. As for the modality gap, existing methods usually utilize the pseudo labels obtained from an image-level adapted model to learn a video-level model. Rare efforts are dedicated to handling the noise in pseudo labels. We proposed a new metric based on label propagation consistency to select samples for training a better video-level model. Experiments on 3 benchmarks validating the effectiveness of our method.

ICLR Conference 2025 Conference Paper

Learning Fine-Grained Representations through Textual Token Disentanglement in Composed Video Retrieval

  • Yue Wu
  • Zhaobo Qi
  • Yiling Wu
  • Junshu Sun
  • Yaowei Wang 0001
  • Shuhui Wang

With the explosive growth of video data, finding videos that meet detailed requirements in large datasets has become a challenge. To address this, the composed video retrieval task has been introduced, enabling users to retrieve videos using complex queries that involve both visual and textual information. However, the inherent heterogeneity between the modalities poses significant challenges. Textual data are highly abstract, while video content contains substantial redundancy. The modality gap in information representation makes existing methods struggle with the modality fusion and alignment required for fine-grained composed retrieval. To overcome these challenges, we first introduce FineCVR-1M, a fine-grained composed video retrieval dataset containing 1,010,071 video-text triplets with detailed textual descriptions. This dataset is constructed through an automated process that identifies key concept changes between video pairs to generate textual descriptions for both static and action concepts. For fine-grained retrieval methods, the key challenge lies in understanding the detailed requirements. Text description serves as clear expressions of intent, but it requires models to distinguish subtle differences in the description of video semantics. Therefore, we propose a textual Feature Disentanglement and Cross-modal Alignment framework (FDCA) that disentangles features at both the sentence and token levels. At the sequence level, we separate text features into retained and injected features. At the token level, an Auxiliary Token Disentangling mechanism is proposed to disentangle texts into retained, injected, and excluded tokens. The disentanglement at both levels extracts fine-grained features, which are aligned and fused with the reference video to extract global representations for video retrieval. Experiments on FineCVR-1M dataset demonstrate the superior performance of FDCA. Our code and dataset are available at: https://may2333.github.io/FineCVR/.

ICLR Conference 2025 Conference Paper

Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos

  • Yufan Zhou
  • Zhaobo Qi
  • Lingshuai Lin
  • Junqi Jing
  • Tingting Chai
  • Beichen Zhang
  • Shuhui Wang
  • Weigang Zhang

In this paper, we address the challenge of procedure planning in instructional videos, aiming to generate coherent and task-aligned action sequences from start and end visual observations. Previous work has mainly relied on text-level supervision to bridge the gap between observed states and unobserved actions, but it struggles with capturing intricate temporal relationships among actions. Building on these efforts, we propose the Masked Temporal Interpolation Diffusion (MTID) model that introduces a latent space temporal interpolation module within the diffusion model. This module leverages a learnable interpolation matrix to generate intermediate latent features, thereby augmenting visual supervision with richer mid-state details. By integrating this enriched supervision into the model, we enable end-to-end training tailored to task-specific requirements, significantly enhancing the model's capacity to predict temporally coherent action sequences. Additionally, we introduce an action-aware mask projection mechanism to restrict the action generation space, combined with a task-adaptive masked proximity loss to prioritize more accurate reasoning results close to the given start and end states over those in intermediate steps. Simultaneously, it filters out task-irrelevant action predictions, leading to contextually aware action sequences. Experimental results across three widely used benchmark datasets demonstrate that our MTID achieves promising action planning performance on most metrics.

AAAI Conference 2025 Conference Paper

MSR: A Multifaceted Self-Retrieval Framework for Microscopic Cascade Prediction

  • Dongsheng Hong
  • Chao Chen
  • Xujia Li
  • Shuhui Wang
  • Wen Lin
  • Xiangwen Liao

The microscopic cascade prediction task has wide applications in downstream areas like ''rumor detection''. Its goal is to forecast the diffusion routines of information cascade within networks. Existing works typically formulate it as a classification task, which fails to well align with the Social Homophily assumption, as it just use the features of ''infected'' users while neglecting those of ''uninfected'' users in representation learning. Moreover, these methods focus primarily on social relationships, thereby dismissing other vital dimensions like users' historical behavior and the underlying preferences behind it. To address these challenges, we introduce the MSR (Multifaceted Self-Retrieval) framework. During encoding, in addition to the existing social graph, we construct a preference graph to represent ''behavioral preferences'' and further propose a modified multi-channel GRAU for multi-view analysis of cascade phenomenon. For decoding, our approach diverges from classification-based methods by reformulating the task as an information retrieval problem that predicts the target user with similarity measures. Empirical evaluations on public datasets demonstrate that this framework significantly outperforms baselines on Hits@κ and MAP@κ, affirming its enhanced ability.

NeurIPS Conference 2025 Conference Paper

Relieving the Over-Aggregating Effect in Graph Transformers

  • Junshu Sun
  • Wanxing Chang
  • Chenxue Yang
  • Qingming Huang
  • Shuhui Wang

Graph attention has demonstrated superior performance in graph learning tasks. However, learning from global interactions can be challenging due to the large number of nodes. In this paper, we discover a new phenomenon termed over-aggregating. Over-aggregating arises when a large volume of messages is aggregated into a single node with less discrimination, leading to the dilution of the key messages and potential information loss. To address this, we propose Wideformer, a plug-and-play method for graph attention. Wideformer divides the aggregation of all nodes into parallel processes and guides the model to focus on specific subsets of these processes. The division can limit the input volume per aggregation, avoiding message dilution and reducing information loss. The guiding step sorts and weights the aggregation outputs, prioritizing the informative messages. Evaluations show that Wideformer can effectively mitigate over-aggregating. As a result, the backbone methods can focus on the informative messages, achieving superior performance compared to baseline methods.

NeurIPS Conference 2025 Conference Paper

VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set

  • Shufan Shen
  • Junshu Sun
  • Qingming Huang
  • Shuhui Wang

The alignment of vision-language representations endows current Vision-Language Models (VLMs) with strong multi-modal reasoning capabilities. However, the interpretability of the alignment component remains uninvestigated due to the difficulty in mapping the semantics of multi-modal representations into a unified concept set. To address this problem, we propose VL-SAE, a sparse autoencoder that encodes vision-language representations into its hidden activations. Each neuron in the hidden layer correlates to a concept represented by semantically similar images and texts, thereby interpreting these representations with a unified concept set. To establish the neuron-concept correlation, we encourage semantically similar representations to exhibit consistent neuron activations during self-supervised training. First, to measure the semantic similarity of multi-modal representations, we perform their alignment in an explicit form based on cosine similarity. Second, we construct the VL-SAE with a distance-based encoder and two modality-specific decoders to ensure the activation consistency of semantically similar representations. Experiments across multiple VLMs (e. g. , CLIP, LLaVA) demonstrate the superior capability of VL-SAE in interpreting and enhancing the vision-language alignment. For interpretation, the alignment between vision and language representations can be understood by comparing their semantics with concepts. For enhancement, the alignment can be strengthened by aligning vision-language representations at the concept level, contributing to performance improvements in downstream tasks, including zero-shot image classification and hallucination elimination. Codes are provided in the supplementary and will be released to GitHub.

AAAI Conference 2024 Conference Paper

Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video

  • Zhaobo Qi
  • Yibo Yuan
  • Xiaowen Ruan
  • Shuhui Wang
  • Weigang Zhang
  • Qingming Huang

Temporal Sentence Grounding in Video (TSGV) is troubled by dataset bias issue, which is caused by the uneven temporal distribution of the target moments for samples with similar semantic components in input videos or query texts. Existing methods resort to utilizing prior knowledge about bias to artificially break this uneven distribution, which only removes a limited amount of significant language biases. In this work, we propose the bias-conflict sample synthesis and adversarial removal debias strategy (BSSARD), which dynamically generates bias-conflict samples by explicitly leveraging potentially spurious correlations between single-modality features and the temporal position of the target moments. Through adversarial training, its bias generators continuously introduce biases and generate bias-conflict samples to deceive its grounding model. Meanwhile, the grounding model continuously eliminates the introduced biases, which requires it to model multi-modality alignment information. BSSARD will cover most kinds of coupling relationships and disrupt language and visual biases simultaneously. Extensive experiments on Charades-CD and ActivityNet-CD demonstrate the promising debiasing capability of BSSARD. Source codes are available at https://github.com/qzhb/BSSARD.

AAAI Conference 2024 Conference Paper

Confusing Pair Correction Based on Category Prototype for Domain Adaptation under Noisy Environments

  • Churan Zhi
  • Junbao Zhuo
  • Shuhui Wang

In this paper, we address unsupervised domain adaptation under noisy environments, which is more challenging and practical than traditional domain adaptation. In this scenario, the model is prone to overfitting noisy labels, resulting in a more pronounced domain shift and a notable decline in the overall model performance. Previous methods employed prototype methods for domain adaptation on robust feature spaces. However, these approaches struggle to effectively classify classes with similar features under noisy environments. To address this issue, we propose a new method to detect and correct confusing class pair. We first divide classes into easy and hard classes based on the small loss criterion. We then leverage the top-2 predictions for each sample after aligning the source and target domain to find the confusing pair in the hard classes. We apply label correction to the noisy samples within the confusing pair. With the proposed label correction method, we can train our model with more accurate labels. Extensive experiments confirm the effectiveness of our method and demonstrate its favorable performance compared with existing state-of-the-art methods. Our codes are publicly available at https://github.com/Hehxcf/CPC/.

ICML Conference 2024 Conference Paper

Data-free Neural Representation Compression with Riemannian Neural Dynamics

  • Zhengqi Pei
  • Anran Zhang
  • Shuhui Wang
  • Xiangyang Ji
  • Qingming Huang

Neural models are equivalent to dynamic systems from a physics-inspired view, implying that computation on neural networks can be interpreted as the dynamical interactions between neurons. However, existing work models neuronal interaction as a weight-based linear transformation, and the nonlinearity comes from the nonlinear activation functions, which leads to limited nonlinearity and data-fitting ability of the whole neural model. Inspired by Riemannian geometry, we interpret neural structures by projecting neurons onto the Riemannian neuronal state space and model neuronal interaction with Riemannian metric (${\it RieM}$), which provides a more efficient neural representation with higher parameter efficiency. With ${\it RieM}$, we further design a novel data-free neural compression mechanism that does not require additional fine-tuning with real data. Using backbones like ResNet and Vision Transformer, we conduct extensive experiments on datasets such as MNIST, CIFAR-100, ImageNet-1k, and COCO object detection. Empirical results show that, under equal compression rates and computational complexity, models compressed with ${\it RieM}$ achieve superior inference accuracy compared to existing data-free compression methods.

NeurIPS Conference 2024 Conference Paper

Expanding Sparse Tuning for Low Memory Usage

  • Shufan Shen
  • Junshu Sun
  • Xiangyang Ji
  • Qingming Huang
  • Shuhui Wang

Parameter-efficient fine-tuning (PEFT) is an effective method for adapting pre-trained vision models to downstream tasks by tuning a small subset of parameters. Among PEFT methods, sparse tuning achieves superior performance by only adjusting the weights most relevant to downstream tasks, rather than densely tuning the whole weight matrix. However, this performance improvement has been accompanied by increases in memory usage, which stems from two factors, i. e. , the storage of the whole weight matrix as learnable parameters in the optimizer and the additional storage of tunable weight indexes. In this paper, we propose a method named SNELL (Sparse tuning with kerNELized LoRA) for sparse tuning with low memory usage. To achieve low memory usage, SNELL decomposes the tunable matrix for sparsification into two learnable low-rank matrices, saving from the costly storage of the whole original matrix. A competition-based sparsification mechanism is further proposed to avoid the storage of tunable weight indexes. To maintain the effectiveness of sparse tuning with low-rank matrices, we extend the low-rank decomposition by applying nonlinear kernel functions to the whole-matrix merging. Consequently, we gain an increase in the rank of the merged matrix, enhancing the ability of SNELL in adapting the pre-trained models to downstream tasks. Extensive experiments on multiple downstream tasks show that SNELL achieves state-of-the-art performance with low memory usage, endowing PEFT with sparse tuning to large-scale models. Codes are available at https: //github. com/ssfgunner/SNELL.

ICML Conference 2024 Conference Paper

Modeling Language Tokens as Functionals of Semantic Fields

  • Zhengqi Pei
  • Anran Zhang
  • Shuhui Wang
  • Qingming Huang

Recent advances in natural language processing have relied heavily on using Transformer-based language models. However, Transformers often require large parameter sizes and model depth. Existing Transformer-free approaches using state-space models demonstrate superiority over Transformers, yet they still lack a neuro-biologically connection to the human brain. This paper proposes ${\it LasF}$, representing ${\bf L}$anguage tokens ${\bf as}$ ${\bf F}$unctionals of semantic fields, to simulate the neuronal behaviors for better language modeling. The ${\it LasF}$ module is equivalent to a nonlinear approximator tailored for sequential data. By replacing the final layers of pre-trained language models with the ${\it LasF}$ module, we obtain ${\it LasF}$-based models. Experiments conducted for standard reading comprehension and question-answering tasks demonstrate that the ${\it LasF}$-based models consistently improve accuracy with fewer parameters. Besides, we use CommonsenseQA’s blind test set to evaluate a full-parameter tuned ${\it LasF}$-based model, which outperforms the prior best ensemble and single models by $0. 4%$ and $3. 1%$, respectively. Furthermore, our ${\it LasF}$-only language model trained from scratch outperforms existing parameter-efficient language models on standard datasets such as WikiText103 and PennTreebank.

ICLR Conference 2024 Conference Paper

R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation

  • Jiayu Xiao
  • Henglei Lv
  • Liang Li 0003
  • Shuhui Wang
  • Qingming Huang

Recent text-to-image (T2I) diffusion models have achieved remarkable progress in generating high-quality images given text-prompts as input. However, these models fail to convey appropriate spatial composition specified by a layout instruction. In this work, we probe into zero-shot grounded T2I generation with diffusion models, that is, generating images corresponding to the input layout information without training auxiliary modules or finetuning diffusion models. We propose a **R**egion and **B**oundary (R&B) aware cross-attention guidance approach that gradually modulates the attention maps of diffusion model during generative process, and assists the model to synthesize images (1) with high fidelity, (2) highly compatible with textual input, and (3) interpreting layout instructions accurately. Specifically, we leverage the discrete sampling to bridge the gap between consecutive attention maps and discrete layout constraints, and design a region-aware loss to refine the generative layout during diffusion process. We further propose a boundary-aware loss to strengthen object discriminability within the corresponding regions. Experimental results show that our method outperforms existing state-of-the-art zero-shot grounded T2I generation methods by a large margin both qualitatively and quantitatively on several benchmarks. Project page: https://sagileo.github.io/Region-and-Boundary.

NeurIPS Conference 2024 Conference Paper

Towards Dynamic Message Passing on Graphs

  • Junshu Sun
  • Chenxue Yang
  • Xiangyang Ji
  • Qingming Huang
  • Shuhui Wang

Message passing plays a vital role in graph neural networks (GNNs) for effective feature learning. However, the over-reliance on input topology diminishes the efficacy of message passing and restricts the ability of GNNs. Despite efforts to mitigate the reliance, existing study encounters message-passing bottlenecks or high computational expense problems, which invokes the demands for flexible message passing with low complexity. In this paper, we propose a novel dynamic message-passing mechanism for GNNs. It projects graph nodes and learnable pseudo nodes into a common space with measurable spatial relations between them. With nodes moving in the space, their evolving relations facilitate flexible pathway construction for a dynamic message-passing process. Associating pseudo nodes to input graphs with their measured relations, graph nodes can communicate with each other intermediately through pseudo nodes under linear complexity. We further develop a GNN model named $\mathtt{N^2}$ based on our dynamic message-passing mechanism. $\mathtt{N^2}$ employs a single recurrent layer to recursively generate the displacements of nodes and construct optimal dynamic pathways. Evaluation on eighteen benchmarks demonstrates the superior performance of $\mathtt{N^2}$ over popular GNNs. $\mathtt{N^2}$ successfully scales to large-scale benchmarks and requires significantly fewer parameters for graph classification with the shared recurrent layer.

ICML Conference 2023 Conference Paper

All in a Row: Compressed Convolution Networks for Graphs

  • Junshu Sun
  • Shuhui Wang
  • Xinzhe Han
  • Zhe Xue
  • Qingming Huang

Compared to Euclidean convolution, existing graph convolution methods generally fail to learn diverse convolution operators under limited parameter scales and depend on additional treatments of multi-scale feature extraction. The challenges of generalizing Euclidean convolution to graphs arise from the irregular structure of graphs. To bridge the gap between Euclidean space and graph space, we propose a differentiable method for regularization on graphs that applies permutations to the input graphs. The permutations constrain all nodes in a row regardless of their input order and therefore enable the flexible generalization of Euclidean convolution. Based on the regularization of graphs, we propose Compressed Convolution Network (CoCN) for hierarchical graph representation learning. CoCN follows the local feature learning and global parameter sharing mechanisms of Convolution Neural Networks. The whole model can be trained end-to-end and is able to learn both individual node features and the corresponding structure features. We validate CoCN on several node classification and graph classification benchmarks. CoCN achieves superior performance over competitive convolutional GNNs and graph pooling models. Codes are available at https: //github. com/sunjss/CoCN.

ICML Conference 2023 Conference Paper

Dynamics-inspired Neuromorphic Visual Representation Learning

  • Zhengqi Pei
  • Shuhui Wang

This paper investigates the dynamics-inspired neuromorphic architecture for visual representation learning following Hamilton’s principle. Our method converts weight-based neural structure to its dynamics-based form that consists of finite sub-models, whose mutual relations measured by computing path integrals amongst their dynamical states are equivalent to the typical neural weights. Based on the entropy reduction process derived from the Euler-Lagrange equations, the feedback signals interpreted as stress forces amongst sub-models push them to move. We first train a dynamics-based neural model from scratch and observe that this model outperforms traditional neural models on MNIST. We then convert several pre-trained neural structures into dynamics-based forms, followed by fine-tuning via entropy reduction to obtain the stabilized dynamical states. We observe consistent improvements in these transformed models over their weight-based counterparts on ImageNet and WebVision in terms of computational complexity, parameter size, testing accuracy, and robustness. Besides, we show the correlation between model performance and structural entropy, providing deeper insight into weight-free neuromorphic learning.

AAAI Conference 2022 Conference Paper

Unsupervised Coherent Video Cartoonization with Perceptual Motion Consistency

  • Zhenhuan Liu
  • Liang Li
  • Huajie Jiang
  • Xin Jin
  • Dandan Tu
  • Shuhui Wang
  • Zheng-Jun Zha

In recent years, creative content generations like style transfer and neural photo editing have attracted more and more attention. Among these, cartoonization of real-world scenes has promising applications in entertainment and industry. Different from image translations focusing on improving the style effect of generated images, video cartoonization has additional requirements on the temporal consistency. In this paper, we propose a spatially-adaptive semantic alignment framework with perceptual motion consistency for coherent video cartoonization in an unsupervised manner. The semantic alignment module is designed to restore deformation of semantic structure caused by spatial information lost in the encoder-decoder architecture. Furthermore, we devise the spatio-temporal correlative map as a style-independent, global-aware regularization on the perceptual motion consistency. Deriving from similarity measurement of high-level features in photo and cartoon frames, it captures global semantic information beyond raw pixel-value in optical flow. Besides, the similarity measurement disentangles temporal relationships from domain-specific style properties, which helps regularize the temporal consistency without hurting style effects of cartoon images. Qualitative and quantitative experiments demonstrate our method is able to generate highly stylistic and temporal consistent cartoon videos.

AAAI Conference 2021 Conference Paper

Composite Adversarial Attacks

  • Xiaofeng Mao
  • Yuefeng Chen
  • Shuhui Wang
  • Hang Su
  • Yuan He
  • Hui Xue

Adversarial attack is a technique for deceiving Machine Learning (ML) models, which provides a way to evaluate the adversarial robustness. In practice, attack algorithms are artificially selected and tuned by human experts to break a ML system. However, manual selection of attackers tends to be sub-optimal, leading to a mistakenly assessment of model security. In this paper, a new procedure called Composite Adversarial Attack (CAA) is proposed for automatically searching the best combination of attack algorithms and their hyperparameters from a candidate pool of 32 base attackers. We design a search space where attack policy is represented as an attacking sequence, i. e. , the output of the previous attacker is used as the initialization input for successors. Multiobjective NSGA-II genetic algorithm is adopted for finding the strongest attack policy with minimum complexity. The experimental result shows CAA beats 10 top attackers on 11 diverse defenses with less elapsed time (6 × faster than AutoAttack), and achieves the new state-of-the-art on l∞, l2 and unrestricted adversarial attacks.

IJCAI Conference 2020 Conference Paper

A Structured Latent Variable Recurrent Network With Stochastic Attention For Generating Weibo Comments

  • Shijie Yang
  • Liang Li
  • Shuhui Wang
  • Weigang Zhang
  • Qingming Huang
  • Qi Tian

Building intelligent agents to generate realistic Weibo comments is challenging. For such realistic Weibo comments, the key criterion is improving diversity while maintaining coherency. Considering that the variability of linguistic comments arises from multi-level sources, including both discourse-level properties and word-level selections, we improve the comment diversity by leveraging such inherent hierarchy. In this paper, we propose a structured latent variable recurrent network, which exploits the hierarchical-structured latent variables with stochastic attention to model the variations of comments. First, we endow both discourse-level and word-level latent variables with hierarchical and temporal dependencies for constructing multi-level hierarchy. Second, we introduce a stochastic attention to infer the key-words of interest in the input post. As a result, diverse comments can be generated with both discourse-level properties and local-word selections. Experiments on open-domain Weibo data show that our model generates more diverse and realistic comments.

AAAI Conference 2020 Conference Paper

F³Net: Fusion, Feedback and Focus for Salient Object Detection

  • Jun Wei
  • Shuhui Wang
  • Qingming Huang

Most of existing salient object detection models have achieved great progress by aggregating multi-level features extracted from convolutional neural networks. However, because of the different receptive fields of different convolutional layers, there exists big differences between features generated by these layers. Common feature fusion strategies (addition or concatenation) ignore these differences and may cause suboptimal solutions. In this paper, we propose the F3 Net to solve above problem, which mainly consists of cross feature module (CFM) and cascaded feedback decoder (CFD) trained by minimizing a new pixel position aware loss (PPA). Specifically, CFM aims to selectively aggregate multilevel features. Different from addition and concatenation, CFM adaptively selects complementary components from input features before fusion, which can effectively avoid introducing too much redundant information that may destroy the original features. Besides, CFD adopts a multi-stage feedback mechanism, where features closed to supervision will be introduced to the output of previous layers to supplement them and eliminate the differences between features. These refined features will go through multiple similar iterations before generating the final saliency maps. Furthermore, different from binary cross entropy, the proposed PPA loss doesn’t treat pixels equally, which can synthesize the local structure information of a pixel to guide the network to focus more on local details. Hard pixels from boundaries or error-prone parts will be given more attention to emphasize their importance. F3 Net is able to segment salient object regions accurately and provide clear local details. Comprehensive experiments on five benchmark datasets demonstrate that F3 Net outperforms state-of-the-art approaches on six evaluation metrics. Code will be released at https: //github. com/weijun88/F3Net.

NeurIPS Conference 2020 Conference Paper

Heuristic Domain Adaptation

  • Shuhao Cui
  • Xuan Jin
  • Shuhui Wang
  • Yuan He
  • Qingming Huang

In visual domain adaptation (DA), separating the domain-specific characteristics from the domain-invariant representations is an ill-posed problem. Existing methods apply different kinds of priors or directly minimize the domain discrepancy to address this problem, which lack flexibility in handling real-world situations. Another research pipeline expresses the domain-specific information as a gradual transferring process, which tends to be suboptimal in accurately removing the domain-specific properties. In this paper, we address the modeling of domain-invariant and domain-specific information from the heuristic search perspective. We identify the characteristics in the existing representations that lead to larger domain discrepancy as the heuristic representations. With the guidance of heuristic representations, we formulate a principled framework of Heuristic Domain Adaptation (HDA) with well-founded theoretical guarantees. To perform HDA, the cosine similarity scores and independence measurements between domain-invariant and domain-specific representations are cast into the constraints at the initial and final states during the learning procedure. Similar to the final condition of heuristic search, we further derive a constraint enforcing the final range of heuristic network output to be small. Accordingly, we propose Heuristic Domain Adaptation Network (HDAN), which explicitly learns the domain-invariant and domain-specific representations with the above mentioned constraints. Extensive experiments show that HDAN has exceeded state-of-the-art on unsupervised DA, multi-source DA and semi-supervised DA. The code is available at https: //github. com/cuishuhao/HDA.

TIST Journal 2017 Journal Article

Location-Based Parallel Tag Completion for Geo-Tagged Social Image Retrieval

  • Jiaming Zhang
  • Shuhui Wang
  • Qingming Huang

Having benefited from tremendous growth of user-generated content, social annotated tags get higher importance in the organization and retrieval of large-scale image databases on Online Sharing Websites (OSW). To obtain high-quality tags from existing community contributed tags with missing information and noise, tag-based annotation or recommendation methods have been proposed for performance promotion of tag prediction. While images from OSW contain rich social attributes, they have not taken full advantage of rich social attributes and auxiliary information associated with social images to construct global information completion models. In this article, beyond the image-tag relation, we take full advantage of the ubiquitous GPS locations and image-user relationship to enhance the accuracy of tag prediction and improve the computational efficiency. For GPS locations, we define the popular geo-locations where people tend to take more images as Points of Interests (POI), which are discovered by mean shift approach. For image-user relationship, we integrate a localized prior constraint, expecting the completed tag sub-matrix in each POI to maintain consistency with users’ tagging behaviors. Based on these two key issues, we propose a unified tag matrix completion framework, which learns the image-tag relation within each POI. To solve the optimization problem, an efficient proximal sub-gradient descent algorithm is designed. The model optimization can be easily parallelized and distributed to learn the tag sub-matrix for each POI. Extensive experimental results reveal that the learned tag sub-matrix of each POI reflects the major trend of users’ tagging results with respect to different POIs and users, and the parallel learning process provides strong support for processing large-scale online image databases. To fit the response time requirement and storage limitations of Tag-based Image Retrieval (TBIR) on mobile devices, we introduce Asymmetric Locality Sensitive Hashing (ALSH) to reduce the time cost and meanwhile improve the efficiency of retrieval.

IROS Conference 2011 Conference Paper

Shape and location design of supporting legs for a new Water Strider Robot

  • Licheng Wu
  • Shuhui Wang
  • Marco Ceccarelli
  • Haiwen Yuan
  • Guosheng Yang

In this paper, the problems are discussed for shape design and position arrangement for Water Strider Robot's supporting legs are discussed. A supporting leg is approached as Euler-Bernoulli elastic curved beam and a method for designing its optimal shape is proposed by analysing elastic deformation and stress-strain. The objective of the proposed optimal method is to attain the maximum lift force in leg operation. The effectiveness and validity of design results are verified through simulations and lab experiments. A method for properly locating supporting legs on the robot body is proposed by analysing the influence of leg location to lift force and the relationship of supporting legs' with robot's roll-resistant capability. A layout scheme for the Water Dancer II-a prototype with ten supporting legs is presented with its operation successful designed.