Arrow Research search

Author name cluster

Shuo Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

57 papers
2 author rows

Possible papers

57

AAAI Conference 2026 Conference Paper

Accelerating Controllable Generation via Hybrid-grained Cache

  • Lin Liu
  • Huixia Ben
  • Shuo Wang
  • Jinda Lu
  • Junxiang Qiu
  • Shengeng Tang
  • Yanbin Hao

Controllable generative models have been widely used to improve the realism of synthetic visual content. However, such models must handle control conditions and content generation computational requirements, resulting in generally low generation efficiency. To address this issue, we propose a Hybrid-Grained Cache (HGC) approach that reduces computational overhead by adopting cache strategies with different granularities at different computational stages. Specifically, (1) we use a coarse-grained cache (block-level) based on feature reuse to dynamically bypass redundant computations in encoder-decoder blocks between each step of model reasoning. (2) We design a fine-grained cache (prompt-level) that acts within a module, where the fine-grained cache reuses cross-attention maps within consecutive reasoning steps and extends them to the corresponding module computations of adjacent steps. These caches of different granularities can be seamlessly integrated into each computational link of the controllable generation process. We verify the effectiveness of HGC on four benchmark datasets, especially its advantages in balancing generation efficiency and visual quality. For example, on the COCO-Stuff segmentation benchmark, our HGC significantly reduces the computational cost (MACs) by 63% (from 18.22T → 6.70T↓), while keeping the loss of semantic fidelity (quantized performance degradation) within 1.5%.

AAAI Conference 2026 Conference Paper

CompTrack: Information Bottleneck-Guided Low-Rank Dynamic Token Compression for Point Cloud Tracking

  • Sifan Zhou
  • Yichao Cao
  • Jiahao Nie
  • Yuqian Fu
  • Ziyu Zhao
  • Xiaobo Lu
  • Shuo Wang

3D single object tracking (SOT) in LiDAR point clouds is a critical task in computer vision and autonomous driving. Despite great success having been achieved, the inherent sparsity of point clouds introduces a dual-redundancy challenge that limits existing trackers: (1) vast spatial redundancy from background noise impairs accuracy, and (2) informational redundancy within the foreground hinders efficiency. To tackle these issues, we propose CompTrack, a novel end-to-end framework that systematically eliminates both forms of redundancy in point clouds. First, CompTrack incorporates a Spatial Foreground Predictor (SFP) module to filter out irrelevant background noise based on information entropy, addressing spatial redundancy. Subsequently, its core is an Information Bottleneck-guided Dynamic Token Compression (IB-DTC) module that eliminates the informational redundancy within the foreground. Theoretically grounded in low-rank approximation, this module leverages an online SVD analysis to adaptively compress the redundant foreground into a compact and highly informative set of proxy tokens. Extensive experiments on KITTI, nuScenes and Waymo datasets demonstrate that CompTrack achieves top-performing tracking performance with superior efficiency, running at a real-time 90 FPS on a single RTX 3090 GPU.

AAAI Conference 2026 Conference Paper

Dual-Horizon Interest Model for Unified Search and Recommendation

  • Wenhao Zhu
  • Yuxin Li
  • Shuo Wang
  • Hao Wang

Search and recommendation are pivotal for information access and are increasingly unified to exploit shared user-item interactions. Both tasks suffer from data sparsity, which joint modeling can mitigate by integrating behavioral data with or without explicit queries. However, existing unified frameworks rarely distinguish between users’ long- and short-term interests, despite their divergent temporal dynamics in search and recommendation. In this work, we propose a novel model, DHIM, which explicitly disentangles and integrates users' long- and short-term interests across both the search and recommendation scenarios. First, long- and short-term interests are independently extracted from search and recommendation using a unified extraction strategy. These interests are then adaptively integrated via a cross-scenario fusion module. A self‐supervised contrastive loss supervises the learning of both interest types within and across scenarios. The resulting representations are fed into downstream search and recommendation models for prediction. Extensive experiments on two public benchmarks demonstrate that our approach consistently outperforms single-scenario and state-of-the-art joint models, achieving superior accuracy and generalizability. To our knowledge, this is the first work to incorporate explicit dual-horizon interest modeling into a unified search and recommendation framework with self-supervised contrastive learning.

AIJ Journal 2026 Journal Article

Environment promoted invariant information learning for graph out-of-distribution generalization

  • Shuo Wang
  • Mingchen Sun
  • Qiang Huang
  • Ying Wang

Graph out-of-distribution generalization is an important task in graph data mining, which has received extensive attention in many practical applications. In recent years, an increasing number of studies have focused on applying invariant learning and causal learning to enhance the model’s cross environment generalization capability. However, existing methods often neglect subgraph estimation bias during invariant information extraction, which impacts generalization performance. Therefore, to address this issue, we construct the Causality Inspired Environment Promoted Graph Generalization Framework (CEPG), which dynamically corrects subgraph estimation biases and learns the target invariant distribution through integrating multiple constraints. Specifically, we first leverage a subgraph generation module to explicitly obtain invariant and environmental subgraphs by evaluating the link reliability. Then, we design the specific environmental information extraction module to prevent bias propagation from environmental subgraphs and capture domain-specific knowledge. Finally, we construct the environment promoted invariant information learning module. This module can align estimated invariant distribution with the target distribution through the environment promoted and reflection mechanism guidance constraints. Extensive experiments demonstrate that our approach effectively enhances generalization across various types of distribution shifts and outperforms state-of-the-art methods on both synthetic and real-world graph OOD generalization benchmarks.

AAAI Conference 2026 Conference Paper

Graph Domain Adaptation via Homophily-Agnostic Reconstructing Structure

  • Ruiyi Fang
  • Shuo Wang
  • Ruizhi Pu
  • Qiuhao Zeng
  • Hao Zheng
  • Ziyan Wang
  • Jiale Cai
  • Zhimin Mei

Graph Domain Adaptation (GDA) transfers knowledge from labeled source graphs to unlabeled target graphs, addressing the challenge of label scarcity. However, existing GDA methods typically assume that both source and target graphs exhibit homophily, leading existing methods to perform poorly when heterophily is present. Furthermore, the lack of labels in the target graph makes it impossible to assess its homophily level beforehand. To address this challenge, we propose a novel homophily-agnostic approach that effectively transfers knowledge between graphs with varying degrees of homophily. Specifically, we adopt a divide-and-conquer strategy that first separately reconstructs highly homophilic and heterophilic variants of both the source and target graphs, and then performs knowledge alignment separately between corresponding graph variants. Extensive experiments conducted on five benchmark datasets demonstrate the superior performance of our approach, particularly highlighting its substantial advantages on heterophilic graphs.

AAAI Conference 2026 Conference Paper

HEV Generative Sandbox: A Framework for Assessing Domain-Specific Social Risks Through Human-LLM Simulation

  • Yiran Liu
  • Zhiyi Hou
  • Xiaoang Xu
  • Shuo Wang
  • Huijia Wu
  • Kaicheng Yu
  • Yang Yu
  • ChengXiang Zhai

Deploying Large Language Models (LLMs) in specialized domains introduces significant societal and compliance risks, including bias amplification, misinformation propagation, and privacy violations. These risks predominantly emerge from the dynamic interactions between LLMs and humans in specific contexts. Different domains face unique distribution of hazards, and varying interaction modalities introduce distinct levels of exposure and vulnerability. However, current risk assessment frameworks lack a systematic methodology to capture this dynamic interplay. In this work, we introduce the HEV Generative Sandbox, a novel risk evaluation framework that simulates human-LLM behavior to quantify domain-contextual risks across three interdependent dimensions: 1) Hazard (H): Domain-specific threats inherent to a given context; 2) Exposure (E): The extent to which the LLM and its users are subjected to hazardous scenarios; 3) Vulnerability (V): The susceptibility of the system to risk due to human interaction or model weaknesses. Our approach pioneers "domain-rooted scenario generation", wherein we sample contextual distributions from domain-specific corpora and simulate diverse inputs. By unifying dynamic scenario simulation, causal risk decomposition, and closed-loop evaluation, the HEV Generative Sandbox provides a scalable, domain-sensitive methodology for responsible LLM deployment. This work contributes to advancing the safe deployment of LLMs by providing a comprehensive and automated risk evaluation framework.

AAAI Conference 2026 Conference Paper

Hierarchical Semantic Alignment for Image Clustering

  • Xingyu Zhu
  • Beier Zhu
  • Yunfan Li
  • Junfeng Fang
  • Shuo Wang
  • Kesen Zhao
  • Hanwang Zhang

Image clustering is a classic problem in computer vision, which categorizes images into different groups. Recent studies utilize nouns as external semantic knowledge to improve clustering performance. However, these methods often overlook the inherent ambiguity of nouns, which can distort semantic representations and degrade clustering quality. To address this issue, we propose a hierarChical semAntic alignmEnt method for image clustering, dubbed CAE, which improves clustering performance in a training-free manner. In our approach, we incorporate two complementary types of textual semantics: caption-level descriptions, which convey fine-grained attributes of image content, and noun-level concepts, which represent high-level object categories. We first select relevant nouns from WordNet and descriptions from caption datasets to construct a semantic space aligned with image features. Then, we design a residual attention mechanism to further enhance the discriminability of this space. Finally, we combine the enhanced semantic and image features to perform clustering. Extensive experiments across 8 datasets demonstrate the effectiveness of our method, notably surpassing the state-of-the-art training-free approach with a 4.2% improvement in accuracy and a 2.9% improvement in adjusted rand index (ARI) on the ImageNet-1K dataset.

AAAI Conference 2026 Conference Paper

Mem4D: Decoupling Static and Dynamic Memory for Dynamic Scene Reconstruction

  • Xudong Cai
  • Shuo Wang
  • Peng Wang
  • Yongcai Wang
  • Zhaoxin Fan
  • Wanting Li
  • Tianbao Zhang
  • Jianrong Tao

Reconstructing dense geometry for dynamic scenes from a monocular video is a critical yet challenging task. Recent memory-based methods enable efficient online reconstruction, but they fundamentally suffer from a Memory Demand Dilemma: The memory representation faces an inherent conflict between the long-term stability required for static structures and the rapid, high-fidelity detail retention needed for dynamic motion. This conflict forces existing methods into a compromise, leading to either geometric drift in static structures or blurred, inaccurate reconstructions of dynamic objects. To address this dilemma, we propose Mem4D, a novel framework that decouples the modeling of static geometry and dynamic motion. Guided by this insight, we design a dual-memory architecture: 1) The Transient Dynamics Memory (TDM) focuses on capturing high-frequency motion details from recent frames, enabling accurate and fine-grained modeling of dynamic content; 2) The Persistent Structure Memory (PSM) compresses and preserves long-term spatial information, ensuring global consistency and drift-free reconstruction for static elements. By alternating queries to these specialized memories, Mem4D simultaneously maintains static geometry with global consistency and reconstructs dynamic elements with high fidelity. Experiments on challenging benchmarks demonstrate that our method achieves state-of-the-art or competitive performance while maintaining high efficiency.

AAAI Conference 2026 Conference Paper

MonoDream: Monocular Vision-Language Navigation with Panoramic Dreaming

  • Shuo Wang
  • Yongcai Wang
  • Zhaoxin Fan
  • Yucheng Wang
  • Maiyue Chen
  • Kaihui Wang
  • Zhizhong Su
  • Wanting Li

Vision-Language Navigation (VLN) tasks often leverage panoramic RGB and depth inputs to provide rich spatial cues for action planning, but these sensors can be costly or less accessible in real-world deployments. Recent approaches based on Vision-Language Action (VLA) models achieve strong results with monocular input, yet they still lag behind methods using panoramic RGB-D information. We present MonoDream, a lightweight VLA framework that enables monocular agents to learn a Unified Navigation Representation (UNR). This shared feature representation jointly aligns navigation-relevant visual semantics (e.g., global layout, depth, and future cues) and language-grounded action intent, enabling more reliable action prediction. MonoDream further introduces Latent Panoramic Dreaming (LPD) tasks to supervise the UNR, which train the model to predict latent features of panoramic RGB and depth observations at both current and future steps based on only monocular input. Experiments on multiple VLN benchmarks show that MonoDream consistently improves monocular navigation performance and significantly narrows the gap with panoramic-based agents.

AAAI Conference 2026 Conference Paper

Run, Ruminate, and Regulate: A Dual-process Thinking System for Vision-and-Language Navigation

  • Yu Zhong
  • Zihao Zhang
  • Rui Zhang
  • Lingdong Huang
  • Haihan Gao
  • Shuo Wang
  • Da Li
  • Ruijian Han

Vision-and-Language Navigation (VLN) requires an agent to dynamically explore complex 3D environments following human instructions. Recent research underscores the potential of harnessing large language models (LLMs) for VLN, given their commonsense knowledge and general reasoning capabilities. Despite their strengths, a substantial gap in task completion performance persists between LLM-based approaches and domain experts, as LLMs inherently struggle to comprehend real-world spatial correlations precisely; additionally, LLM inference can make the decision-making process considerably inefficient. To address these issues, we propose a novel dual-process thinking framework dubbed R3, integrating LLMs' generalization capabilities with VLN-specific expertise in a zero-shot manner. The framework comprises three core modules: Runner, Ruminator, and Regulator. The Runner is a lightweight transformer-based expert model that ensures efficient and accurate navigation under regular circumstances. The Ruminator employs a multimodal LLM as the backbone and adopts chain-of-thought (CoT) prompting to elicit structured reasoning from the LLM. The Regulator monitors the navigation progress and controls the appropriate thinking mode according to three criteria, integrating Runner and Ruminator harmoniously. Experimental results illustrate that R3 significantly outperforms other state-of-the-art methods, exceeding 3.28% and 3.30% in SPL and RGSPL respectively on the REVERIE benchmark, highlighting the effectiveness of our method in handling challenging VLN tasks.

NeurIPS Conference 2025 Conference Paper

A*-Thought: Efficient Reasoning via Bidirectional Compression for Low-Resource Settings

  • Xiaoang Xu
  • Shuo Wang
  • Xu Han
  • Zhenghao Liu
  • Huijia Wu
  • Peipei Li
  • Zhiyuan Liu
  • Maosong Sun

Large Reasoning Models (LRMs) achieve superior performance by extending the thought length. However, a lengthy thinking trajectory leads to reduced efficiency. Most of the existing methods are stuck in the assumption of overthinking and attempt to reason efficiently by compressing the Chain-of-Thought, but this often leads to performance degradation. To address this problem, we introduce A*-Thought, an efficient tree search-based unified framework designed to identify and isolate the most essential thoughts from the extensive reasoning chains produced by these models. It formulates the reasoning process of LRMs as a search tree, where each node represents a reasoning span in the giant reasoning space. By combining the A* search algorithm with a cost function specific to the reasoning path, it can efficiently compress the chain of thought and determine a reasoning path with high information density and low cost. In addition, we also propose a bidirectional importance estimation mechanism, which further refines this search process and enhances its efficiency beyond uniform sampling. Extensive experiments on several advanced math tasks show that A*-Thought effectively balances performance and efficiency over a huge search space. Specifically, A*-Thought can improve the performance of QwQ-32B by 2. 39$\times$ with low-budget and reduce the length of the output token by nearly 50\% with high-budget. The proposed method is also compatible with several other LRMs, demonstrating its generalization capability. The code can be accessed at: https: //github. com/AI9Stars/AStar-Thought.

TIST Journal 2025 Journal Article

Aspect-Enhanced Explainable Recommendation with Multi-modal Contrastive Learning

  • Hao Liao
  • Shuo Wang
  • Hao Cheng
  • Wei Zhang
  • Jiwei Zhang
  • Mingyang Zhou
  • Kezhong Lu
  • Rui Mao

Explainable recommender systems ( ERS ) aim to enhance users’ trust in the systems by offering personalized recommendations with transparent explanations. This transparency provides users with a clear understanding of the rationale behind the recommendations, fostering a sense of confidence and reliability in the system’s outputs. Generally, the explanations are presented in a familiar and intuitive way, which is in the form of natural language, thus enhancing their accessibility to users. Recently, there has been an increasing focus on leveraging reviews as a valuable source of rich information in both modeling user-item preferences and generating textual interpretations, which can be performed simultaneously in a multi-task framework. Despite the progress made in these review-based recommendation systems, the integration of implicit feedback derived from user-item interactions and user-written text reviews has yet to be fully explored. To fill this gap, we propose a model named SERMON (A s pect-enhanced E xplainable R ecommendation with M ulti-modal C o ntrast Lear n ing). Our model explores the application of multimodal contrastive learning to facilitate reciprocal learning across two modalities, thereby enhancing the modeling of user preferences. Moreover, our model incorporates the aspect information extracted from the review, which provides two significant enhancements to our tasks. Firstly, the quality of the generated explanations is improved by incorporating the aspect characteristics into the explanations generated by a pre-trained model with controlled textual generation ability. Secondly, the commonly used user-item interactions are transformed into user-item-aspect interactions, which we refer to as interaction triple, resulting in a more nuanced representation of user preference. To validate the effectiveness of our model, we conduct extensive experiments on three real-world datasets. The experimental results show that our model outperforms state-of-the-art baselines, with a 2.0% improvement in prediction accuracy and a substantial 24.5% enhancement in explanation quality for the TripAdvisor dataset.

NeurIPS Conference 2025 Conference Paper

Aux-Think: Exploring Reasoning Strategies for Data-Efficient Vision-Language Navigation

  • Shuo Wang
  • Yongcai Wang
  • Wanting Li
  • Xudong Cai
  • Yucheng Wang
  • Maiyue Chen
  • Zhizhong Su
  • Deying Li

Vision-Language Navigation is a critical task for developing embodied agents that can follow natural language instructions to navigate in complex real-world environments. Recent advances by finetuning large pretrained models have significantly improved generalization and instruction grounding compared to traditional approaches. However, the role of reasoning strategies in navigation—an action-centric, long-horizon task—remains underexplored, despite Chain-of-Thought reasoning's demonstrated success in static tasks like question answering and visual reasoning. To address this gap, we conduct the first systematic evaluation of reasoning strategies for VLN, including No-Think (direct action prediction), Pre-Think (reason before action), and Post-Think (reason after action). Surprisingly, our findings reveal the Inference-time Reasoning Collaps issue, where inference-time reasoning degrades navigation accuracy, highlighting the challenges of integrating reasoning into VLN. Based on this insight, we propose Aux-Think, a framework that trains models to internalize structured reasoning patterns through CoT supervision during training, while preserving No-Think inference for efficient action prediction. To support this framework, we release R2R-CoT-320k, a large-scale Chain-of-Thought annotated dataset. Empirically, Aux-Think significantly reduces training effort without compromising performance.

AAAI Conference 2025 Conference Paper

CoDe: Communication Delay-Tolerant Multi-Agent Collaboration via Dual Alignment of Intent and Timeliness

  • Shoucheng Song
  • Youfang Lin
  • Sheng Han
  • Chang Yao
  • Hao Wu
  • Shuo Wang
  • Kai Lv

Communication has been widely employed to enhance multi-agent collaboration. Previous research has typically assumed delay-free communication, a strong assumption that is challenging to meet in practice. However, real-world agents suffer from channel delays, receiving messages sent at different time points, termed Asynchronous Communication, leading to cognitive biases and breakdowns in collaboration. This paper first defines two communication delay settings in MARL and emphasizes their harm to collaboration. To handle the above delays, this paper proposes a novel framework, Communication Delay-Tolerant Multi-Agent Collaboration (CoDe). At first, CoDe learns an intent representation as messages through future action inference, reflecting the stable future behavioral trends of the agents. Then, CoDe devises a dual alignment mechanism of intent and timeliness to strengthen the fusion process of asynchronous messages. In this way, agents can extract the long-term intent of others, even from delayed messages, and selectively utilize the most recent messages that are relevant to their intent. Experimental results demonstrate that CoDe outperforms baseline algorithms in three MARL benchmarks without delay and exhibits robustness under fixed and time-varying delays.

ICML Conference 2025 Conference Paper

Cooperation of Experts: Fusing Heterogeneous Information with Large Margin

  • Shuo Wang
  • Shunyang Huang
  • Jinghui Yuan
  • Zhixiang Shen
  • Zhao Kang 0001

Fusing heterogeneous information remains a persistent challenge in modern data analysis. While significant progress has been made, existing approaches often fail to account for the inherent heterogeneity of object patterns across different semantic spaces. To address this limitation, we propose the Cooperation of Experts (CoE) framework, which encodes multi-typed information into unified heterogeneous multiplex networks. By transcending modality and connection differences, CoE provides a powerful and flexible model for capturing the intricate structures of real-world complex data. In our framework, dedicated encoders act as domain-specific experts, each specializing in learning distinct relational patterns in specific semantic spaces. To enhance robustness and extract complementary knowledge, these experts collaborate through a novel large margin mechanism supported by a tailored optimization strategy. Rigorous theoretical analyses guarantee the framework’s feasibility and stability, while extensive experiments across diverse benchmarks demonstrate its superior performance and broad applicability.

NeurIPS Conference 2025 Conference Paper

DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection

  • Yingli Shen
  • Wen Lai
  • Shuo Wang
  • Xueren Zhang
  • Kangyang Luo
  • Alexander Fraser
  • Maosong Sun

The rapid development of multilingual large language models (LLMs) highlights the need for high-quality, diverse, and well-curated multilingual datasets. In this paper, we introduce DCAD-2000 (Data Cleaning as Anomaly Detection), a large-scale multilingual corpus constructed from newly extracted Common Crawl data and existing multilingual sources. DCAD-2000 covers 2, 282 languages, 46. 72TB of text, and 8. 63 billion documents, spanning 155 high- and medium-resource languages and 159 writing scripts. To overcome the limitations of existing data cleaning approaches, which rely on manually designed heuristic thresholds, we reframe data cleaning as an anomaly detection problem. This dynamic filtering paradigm substantially improves data quality by automatically identifying and removing noisy or anomalous content. By fine-tuning LLMs on DCAD-2000, we demonstrate notable improvements in data quality, robustness of the cleaning pipeline, and downstream performance, particularly for low-resource languages across multiple multilingual benchmarks.

NeurIPS Conference 2025 Conference Paper

Enhancing CLIP Robustness via Cross-Modality Alignment

  • Xingyu Zhu
  • Beier Zhu
  • Shuo Wang
  • Kesen Zhao
  • Hanwang Zhang

Vision-language models (VLMs) such as CLIP demonstrate strong generalization in zero-shot classification but remain highly vulnerable to adversarial perturbations. Existing methods primarily focus on adversarial fine-tuning or prompt optimization, they often overlook the gaps in CLIP’s encoded features, which is shown as the text and image features lie far apart from each other. This misalignment is significantly amplified under adversarial perturbations, leading to severe degradation in classification performance. To address this problem, we propose C r O ss-moda L ity A lignment, dubbed COLA, an optimal transport-based framework that explicitly addresses adversarial misalignment by restoring both global image-text alignment and local structural consistency in the feature space. (1) COLA first projects adversarial image embeddings onto a subspace spanned by class text features, effectively filtering out non-semantic distortions while preserving discriminative information. (2) It then models images and texts as discrete distributions over multiple augmented views and refines their alignment via OT, with the subspace projection seamlessly integrated into the cost computation. This design ensures stable cross-modal alignment even under adversarial conditions. COLA is training-free and compatible with existing fine-tuned models. Extensive evaluations across 14 zero-shot classification benchmarks demonstrate the effectiveness of COLA, especially with an average improvement of 6. 7% on ImageNet and its variants under PGD adversarial attacks, while maintaining high accuracy on clean samples.

EAAI Journal 2025 Journal Article

Enhancing rotation you only look once version 8 for accurate detection of arbitrarily-oriented and multi-scale construction material in complex environment

  • Yujie Lu
  • Yuanjun Nong
  • Dayu Zhu
  • Shuo Wang

Construction material detection is integral to effective material management. While deep learning-based methods have advanced automatic detection, challenges such as arbitrarily oriented objects, similar features, large object scale variations, and noise interference still limit accuracy. This study proposes an enhanced detection method based on improved rotation you only look once version 8, incorporating three key innovations. First, the large selective kernel network employs a spatial selection mechanism to dynamically adjust the network's receptive field, capturing key features and contextual information of various materials in complex construction scenarios. This enhances object detection performance in feature-similar environments. Second, the poly kernel inception network combines non-dilated multi-scale convolutions to extract dense texture features from differently sized objects, while using contextual anchor attention to capture long-range information for small objects. These components work together to improve multi-scale object detection. Third, the rectangular self-calibration module uses horizontal and vertical pooling to model rectangular attention regions, capturing axial global context. Its shape self-calibration function adjusts these regions to better fit arbitrarily oriented construction materials, enhancing focus on objects while suppressing noise interference. Experimental results show a mean average precision of 0. 871–2. 7 % higher than the baseline model and surpassing other state-of-the-art methods—while maintaining real-time detection speed at 28. 3 frames per second. The proposed method improves material detection accuracy in complex construction environments, enabling refined material management on-site. This supports material entry and exit tracking, on-site usage monitoring, inventory management, and procurement planning, while also strengthening lean control of construction progress and costs.

EAAI Journal 2025 Journal Article

Fault diagnosis method of mining vibrating screen mesh based on an improved algorithm

  • Fusheng Niu
  • Jiahui Wu
  • Jinxia Zhang
  • ZhiHeng Nie
  • Guang Song
  • Xiongsheng Zhu
  • Shuo Wang

Artificial intelligence fault diagnosis technology based on machine vision, due to its low cost and high efficiency, has become an indispensable part of production processes across various industries. Compared to traditional fault diagnosis methods, artificial intelligence diagnosis of common mechanical failures, such as ‘clogging’, ‘wear’, and ‘breakage’ in vibrating screen meshes within the mining screening sector, improves detection efficiency, accuracy, and sustainability. Since small target faults in large screening areas are challenging to detect through manual diagnosis, it reduces screening efficiency and shorter equipment lifespan, negatively impacting mining enterprises' safe and efficient production. A fault diagnosis model with a better speed-precision trade-off is proposed to improve detection precision based on the You Only Look Once version 5 single-stage object detection algorithm. This model is optimized in feature extraction and fusion by integrating autocode masking, re-parameterization, and omni-dimensional attention. The model's performance is primarily evaluated using precision, recall, balanced score, and mean average precision. The improved algorithm achieves a precision of 97. 2%, a recall of 93. 3%, a balanced score of 95. 21%, and a mean average precision of 97. 0%. Experimental results demonstrate that the improved algorithm increases the mean average precision by 3. 1% compared to the original model. The results show that the improved algorithm is more effective than the original in fault diagnosis, with enhanced screen mesh detection precision. Thus, it ensures production safety and stable screening efficiency. Moreover, the proposed algorithm provides a reference for advancing intelligent and efficient fault diagnosis technology in the mining screening field.

IJCAI Conference 2025 Conference Paper

Fine-Grained and Efficient Self-Unlearning with Layered Iteration

  • Hongyi Lyu
  • Xuyun Zhang
  • Hongsheng Hu
  • Shuo Wang
  • Chaoxiang He
  • Lianyong Qi

As machine learning models become widely deployed in data-driven applications, ensuring compliance with the 'right to be forgotten' as required by many privacy regulations is vital for safeguarding user privacy. To forget the given data, existing re-labeling based unlearning methods employ a single-step adjustment scheme that revises the decision boundaries in one re-labeling phase. However, such single-step approaches lead to coarse-grained changes in decision boundaries among the remaining classes and impose adverse effects on the model utility. To address these limitations, we propose 'Self-Unlearning with Layered Iteration (SULI), ' a novel unlearning approach that introduces a layered iteration strategy to re-label the forgetting data iteratively and refine the decision boundaries progressively. We further develop a 'Selective Probability Adjustment (SPA)' technique, which uses a soft-label mechanism to promote smoother decision-boundary transitions. Comprehensive experiments on three benchmark datasets demonstrate that SULI achieves superior performance in effectiveness, efficiency, and privacy compared to the state-of-the-art baselines in both class-wise and instance-wise unlearning scenarios. The source code is released at https: //github. com/Hongyi-Lyu-MQ/SULI.

AAAI Conference 2025 Conference Paper

Infer the Whole from a Glimpse of a Part: Keypoint-Based Knowledge Graph for Vehicle Re-Identification

  • Kai Lv
  • Yunlong Li
  • Zhuo Chen
  • Shuo Wang
  • Sheng Han
  • Youfang Lin

Vehicle re-identification aims to match vehicles across non-overlapping camera views. Many existing methods extract features from one specific image, and these methods lack view-invariance when comparing vehicles of different orientations. As a result, discriminative parts obscured by viewpoint changes cannot contribute effectively to matching. This work presents a novel keypoint-based framework for vehicle Re-ID. We propose to explicitly model the intrinsic structural relationships between vehicle components via knowledge graph. By establishing connection between keypoints, our approach aims to leverage such prior to match vehicles even when some parts are not directly comparable due to orientation inconsistencies. Specifically, given query and gallery images, we first detect visible keypoints. Then, a transformer-based model infers features for non-overlapped keypoints by conditioning on visible correspondences defined in the knowledge graph. The final representation integrates visible and inferred features. Extensive experiments demonstrate our method outperforms state-of-the-arts on standard benchmarks under cross-view matching scenarios. To our knowledge, this is the first work introducing structural priors via keypoint knowledge graphs for view-invariant vehicle re-identification.

AAAI Conference 2025 Conference Paper

Medical Manifestation-Aware De-Identification

  • Yuan Tian
  • Shuo Wang
  • Guangtao Zhai

Face de-identification (DeID) has been widely studied for common scenes, but remains under-researched for medical scenes, mostly due to the lack of large-scale patient face datasets. In this paper, we release MeMa, consisting of over 40,000 photo-realistic patient faces. MeMa is re-generated from massive real patient photos. By carefully modulating the generation and data-filtering procedures, MeMa avoids breaching real patient privacy, while ensuring rich and plausible medical manifestations. We recruit expert clinicians to annotate MeMa with both coarse- and fine-grained labels, building the first medical-scene DeID benchmark. Additionally, we propose a baseline approach for this new medical-aware DeID task, by integrating data-driven medical semantic priors into the DeID procedure. Despite its conciseness and simplicity, our approach substantially outperforms previous ones.

AAAI Conference 2025 Conference Paper

MoRe: Class Patch Attention Needs Regularization for Weakly Supervised Semantic Segmentation

  • Zhiwei Yang
  • Yucong Meng
  • Kexue Fu
  • Shuo Wang
  • Zhijian Song

Weakly Supervised Semantic Segmentation (WSSS) with image-level labels typically uses Class Activation Maps (CAM) to achieve dense predictions. Recently, Vision Transformer (ViT) has provided an alternative to generate localization maps from class-patch attention. However, due to insufficient constraints on modeling such attention, we observe that the Localization Attention Maps (LAM) often struggle with the artifact issue, i.e., patch regions with minimal semantic relevance are falsely activated by class tokens. In this work, we propose MoRe to address this issue and further explore the potential of LAM. Our findings suggest that imposing additional regularization on class-patch attention is necessary. To this end, we first view the attention as a novel directed graph and propose the Graph Category Representation module to implicitly regularize the interaction among class-patch entities. It ensures that class tokens dynamically condense the related patch information and suppress unrelated artifacts at a graph level. Second, motivated by the observation that CAM from classification weights maintains smooth localization of objects, we devise the Localization-informed Regularization module to explicitly regularize the class-patch attention. It directly mines the token relations from CAM and further supervises the consistency between class and patch tokens in a learnable manner. Extensive experiments on PASCAL VOC and MS COCO validate that MoRe effectively addresses the artifact issue and achieves state-of-the-art performance, surpassing recent single-stage and even multi-stage methods.

ICML Conference 2025 Conference Paper

Multi-Domain Graph Foundation Models: Robust Knowledge Transfer via Topology Alignment

  • Shuo Wang
  • Bokui Wang
  • Zhixiang Shen
  • Boyan Deng
  • Zhao Kang 0001

Recent advances in CV and NLP have inspired researchers to develop general-purpose graph foundation models through pre-training across diverse domains. However, a fundamental challenge arises from the substantial differences in graph topologies across domains. Additionally, real-world graphs are often sparse and prone to noisy connections and adversarial attacks. To address these issues, we propose the Multi-Domain Graph Foundation Model (MDGFM), a unified framework that aligns and leverages cross-domain topological information to facilitate robust knowledge transfer. MDGFM bridges different domains by adaptively balancing features and topology while refining original graphs to eliminate noise and align topological structures. To further enhance knowledge transfer, we introduce an efficient prompt-tuning approach. By aligning topologies, MDGFM not only improves multi-domain pre-training but also enables robust knowledge transfer to unseen domains. Theoretical analyses provide guarantees of MDGFM’s effectiveness and domain generalization capabilities. Extensive experiments on both homophilic and heterophilic graph datasets validate the robustness and efficacy of our method.

FOCS Conference 2025 Conference Paper

Multi-Pass Streaming Lower Bounds for Approximating Max-Cut

  • Yumou Fei
  • Dor Minzer
  • Shuo Wang

In the Max-Cut problem in the streaming model, an algorithm is given the edges of an unknown graph $G=(V, E)$ in some fixed order, and its goal is to approximate the size of the largest cut in G. Improving upon an earlier result of Kapralov, Khanna and Sudan, it was shown by Kapralov and Krachun that for all $\varepsilon\gt 0$, no $o(n)$ memory streaming algorithm can achieve a $(1 / 2+\varepsilon)$-approximation for Max-Cut. Their result holds for single-pass streams, i. e. the setting in which the algorithm only views the stream once, and it was open whether multi-pass access may help. The state-of-the-art result along these lines, due to Assadi and N, rules out arbitrarily good approximation algorithms with constantly many passes and $n^{1-\delta}$ space for any $\delta\gt 0$. We improve upon this state-of-the-art result, showing that any non-trivial approximation algorithm for Max-Cut requires either polynomially many passes or polynomially large space. More specifically, we show that for all $\varepsilon\gt 0$, a k-pass streaming $(1 / 2+\varepsilon)$-approximation algorithm for Max-Cut requires $\Omega_{\varepsilon}\left(n^{1 / 3} / k\right)$ space. This result leads to a similar lower bound for the Maximum Directed Cut problem, showing the near optimality of the algorithm of [Saxena, Singer, Sudan, Velusamy, SODA 2025]. Our lower bounds proceed by showing a communication complexity lower bound for the Distributional Implicit Hidden Partition (DIHP) Problem, introduced by Kapralov and Krachun. While a naive application of the discrepancy method fails, we identify a property of protocols called “globalness”, and show that (1) any protocol for DIHP can be turned into a global protocol, (2) the discrepancy of a global protocol must be small. The second step is the more technically involved step in the argument, and therein we use global hypercontractive inequalities, and more specifically strong quantitative versions of the level- d inequality for global functions.

ECAI Conference 2025 Conference Paper

Optimizing Semantic Consistency Modeling: Task-Specific Tensor Fusion, Multi-Task Multi-Scale Joint Training, and Uncertainty-Aware Distillation

  • Zeyu Wei
  • Shuo Wang
  • Xuemin Liu
  • Xiaohui Rong

Semantic consistency evaluation faces two critical challenges: Bi-Encoder models, while efficient, struggle to capture fine-grained interactions between sentence pairs, limiting accuracy; meanwhile, Cross-Encoders and large language models (LLMs), despite superior performance, incur substantial computational costs, hindering practical deployment. This paper proposes a Multi-dimensional Consistency Evaluation Model (MCEM), designed to balance performance and efficiency, enabling precise modeling of sentence pairs and efficient inference across multiple consistency dimensions, including semantic, emotional, and logical aspects. The core innovation of MCEM lies in its integration of gated memory networks with a Multi-Task Mixture-of-Experts (MT-MoE) architecture, which enables fine-grained decomposition and dynamic recombination of tensors to disentangle shared and task-specific features. Furthermore, a multi-granular feature extraction module enhances the model’s ability to capture semantic information at the character, phrase, and sentence levels. Additionally, an uncertainty-aware knowledge distillation mechanism effectively transfers high-confidence knowledge from the Cross-Encoder branch to the lightweight path, significantly improving inference efficiency while maintaining high performance. Experimental results demonstrate that MCEM achieves state-of-the-art (SOTA) performance across multiple consistency benchmarks and exhibits strong generalization capability under both adversarial perturbations and previously unseen data structures.

JBHI Journal 2025 Journal Article

Personalized Lumbar Vertebrae Modeling for Dynamic Assessment of Idiopathic Scoliosis

  • Chengyin Wang
  • Jianfeng Li
  • Shuo Wang
  • Yuxuan Wang
  • Jianguo Zhang
  • Mingjie Dong
  • Bin Fang
  • Qianyu Zhuang

Clinical assessment of idiopathic scoliosis (IS) patients primarily relies on static imaging techniques. Dynamic digital human (DDH) can provide comprehensive spatio-temporal information for dynamic assessment of the deformed spine in IS patients comparing with static imaging techniques, such as X-ray for general assessment and computed tomography (CT) for surgical planning. The lumbar vertebrae exhibit greater morphological variability than the thoracic region when subjected to different postures and mechanical loads, making them particularly important for dynamic assessment. Therefore, a personalized lumbar vertebrae model (PLVM) is proposed in this work to simulate lumbar vertebrae motion for IS patients; furthermore, an individualized DDH (i-DDH) is proposed by embedding PLVM into DDH to capture the spatio-temporal information. First, we use a bone primitive generation method to construct the DDH by incorporating Neural Radiance Fields (NeRF) and three-dimensional (3D) Gaussian splatting methods. Next, we develop the PLVM generation method to simulate lumbar vertebrae motion under different loads and postures. Finally, the bone primitives and PLVM are merged to generate the i-DDH for dynamic assessment. We validated i-DDH using multi-posture radiographs from eight IS patients awaiting surgery. The results demonstrate high accuracy compared to state-of-the-art (SOTA) models, with a mean angular error of $0. 96^\circ$ and a maximum error of $3. 6^\circ$ relative to radiographs. The proposed i-DDH framework is able to capture the spinal posture and conduct the dynamic assessment of IS patients rather than fixed positions. It overcomes the soft tissue artifact (STA) problem from motion capture systems and the failure to generate 3D spinal deformity of IS patients by training healthy subjects from computer vision methods. It also shows great clinical significance for preoperative planning and clinical assessment by providing dynamic spinal posture that cannot be achieved with static imaging.

NeurIPS Conference 2025 Conference Paper

QiMeng-MuPa: Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

  • Changxin Ke
  • Rui Zhang
  • Shuo Wang
  • Li Ding
  • Guangli Li
  • Yuanbo Wen
  • Shuoming Zhang
  • Ruiyuan Xu

The rise of GPU-based high-performance computing (HPC) has driven the widespread adoption of parallel programming models such as CUDA. Yet, the inherent complexity of parallel programming creates a demand for the automated sequential-to-parallel approaches. However, data scarcity poses a significant challenge for machine learning-based sequential-to-parallel code translation. Although recent back-translation methods show promise, they still fail to ensure functional equivalence in the translated code. In this paper, we propose \textbf{QiMeng-MuPa}, a novel \textbf{Mu}tual-Supervised Learning framework for Sequential-to-\textbf{Pa}rallel code translation, to address the functional equivalence issue. QiMeng-MuPa consists of two models, a Translator and a Tester. Through an iterative loop consisting of Co-verify and Co-evolve steps, the Translator and the Tester mutually generate data for each other and improve collectively. The Tester generates unit tests to verify and filter functionally equivalent translated code, thereby evolving the Translator, while the Translator generates translated code as augmented input to evolve the Tester. Experimental results demonstrate that QiMeng-MuPa significantly enhances the performance of the base models: when applied to Qwen2. 5-Coder, it not only improves Pass@1 by up to 28. 91\% and boosts Tester performance by 68. 90\%, but also outperforms the previous state-of-the-art method CodeRosetta by 1. 56 and 6. 92 in BLEU and CodeBLEU scores, while achieving performance comparable to DeepSeek-R1 and GPT-4. 1. Our code is available at \url{https: //github. com/kcxain/mupa}.

NeurIPS Conference 2025 Conference Paper

RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing

  • Fengxiang Wang
  • Yulin Wang
  • Mingshuo Chen
  • Haotian Wang
  • Hongzhen Wang
  • Haiyan Zhao
  • Yangang Sun
  • Shuo Wang

Recent advances in self-supervised learning for Vision Transformers (ViTs) have fueled breakthroughs in remote sensing (RS) foundation models. However, the quadratic complexity of self-attention poses a significant barrier to scalability, particularly for large models and high-resolution images. While the linear-complexity Mamba architecture offers a promising alternative, existing RS applications of Mamba remain limited to supervised tasks on small, domain-specific datasets. To address these challenges, we propose RoMA, a framework that enables scalable self-supervised pretraining of Mamba-based RS foundation models using large-scale, diverse, unlabeled data. RoMA enhances scalability for high-resolution images through a tailored auto-regressive learning strategy, incorporating two key innovations: 1) a rotation-aware pretraining mechanism combining adaptive cropping with angular embeddings to handle sparsely distributed objects with arbitrary orientations, and 2) multi-scale token prediction objectives that address the extreme variations in object scales inherent to RS imagery. Systematic empirical studies validate that Mamba adheres to RS data and parameter scaling laws, with performance scaling reliably as model and data size increase. Furthermore, experiments across scene classification, object detection, and semantic segmentation tasks demonstrate that RoMA-pretrained Mamba models consistently outperform ViT-based counterparts in both accuracy and computational efficiency. The source code and pretrained models have be released at https: //github. com/MiliLab/RoMA.

NeurIPS Conference 2025 Conference Paper

WaveAR: Wavelet-Aware Continuous Autoregressive Diffusion for Accurate Human Motion Prediction

  • shengchuan gao
  • Shuo Wang
  • Yabiao Wang
  • Ran Yi

This work tackles a challenging problem: stochastic human motion prediction (SHMP), which aims to forecast diverse and physically plausible future pose sequences based on a short history of observed motion. While autoregressive sequence models have excelled in related generation tasks, their reliance on vector‐quantized tokenization limits motion fidelity and training stability. To overcome these drawbacks, we introduce \textbf{WaveAR}, a novel AR based framework which is the first successful application of a continuous autoregressive generation paradigm to HMP to our best knowledge. WaveAR consists of two stages. In the first stage, a lightweight Spatio‐Temporal VAE (ST-VAE) compresses the raw 3D-joint sequence into a downsampled latent token stream, providing a compact yet expressive foundation. In the second stage, we apply masked autoregressive prediction directly in this continuous latent space, conditioning on both unmasked latents and multi‐scale spectral cues extracted via a 2D discrete wavelet transform. A fusion module consisting of alternating cross-attention and self-attention layers adaptively fuses temporal context with low- and high-frequency wavelet subbands, and a small MLP‐based diffusion head predicts per-token noise residuals under a denoising loss. By avoiding vector quantization and integrating localized frequency information, WaveAR preserves fine‐grained motion details while maintaining fast inference speed. Extensive experiments on standard benchmarks demonstrate that our approach delivers more accurate and computationally efficient predictions than prior state‐of-the-art methods.

IJCAI Conference 2025 Conference Paper

Where Does This Data Come From? Enhanced Source Inference Attacks in Federated Learning

  • Haiyang Chen
  • Xiaolong Xu
  • Xiang Zhu
  • Xiaokang Zhou
  • Fei Dai
  • Yansong Gao
  • Xiao Chen
  • Shuo Wang

Federated learning (FL) enables collaborative model training without exposing raw data, offering a privacy-aware alternative to centralized learning. However, FL remains vulnerable to various privacy attacks that exploit shared model updates, including membership inference, property inference, and gradient inversion. Source inference attacks further threaten FL by identifying which client contributed a specific training sample, posing severe risks to user and institutional privacy. Existing source inference attacks mainly assume passive adversaries and overlook more realistic scenarios where the server actively manipulates the training process. In this paper, we present an enhanced source inference attack that demonstrates how a malicious server can amplify behavioral differences between clients to more accurately infer data origin. Our approach introduces active training manipulation and data augmentation to expose client-specific patterns. Experimental results across five representative FL algorithms and multiple datasets show that our method significantly outperforms prior passive attacks. These findings reveal a deeper level of privacy vulnerability in FL and call for stronger defense mechanisms under active threat models.

STOC Conference 2024 Conference Paper

A New Information Complexity Measure for Multi-pass Streaming with Applications

  • Mark Braverman
  • Sumegha Garg
  • Qian Li 0012
  • Shuo Wang
  • David P. Woodruff
  • Jiapeng Zhang

We introduce a new notion of information complexity for multi-pass streaming problems and use it to resolve several important questions in data streams. In the coin problem, one sees a stream of n i.i.d. uniformly random bits and one would like to compute the majority with constant advantage. We show that any constant-pass algorithm must use Ω(log n ) bits of memory, significantly extending an earlier Ω(log n ) bit lower bound for single-pass algorithms of Braverman-Garg-Woodruff (FOCS, 2020). This also gives the first Ω(log n ) bit lower bound for the problem of approximating a counter up to a constant factor in worst-case turnstile streams for more than one pass. In the needle problem, one either sees a stream of n i.i.d. uniform samples from a domain [ t ], or there is a randomly chosen needle α ∈[ t ] for which each item independently is chosen to equal α with probability p , and is otherwise uniformly random in [ t ]. The problem of distinguishing these two cases is central to understanding the space complexity of the frequency moment estimation problem in random order streams. We show tight multi-pass space bounds for this problem for every p < 1/√ n log 3 n , resolving an open question of Lovett and Zhang (FOCS, 2023); even for 1-pass our bounds are new. To show optimality, we improve both lower and upper bounds from existing results. Our information complexity framework significantly extends the toolkit for proving multi-pass streaming lower bounds, and we give a wide number of additional streaming applications of our lower bound techniques, including multi-pass lower bounds for ℓ p -norm estimation, ℓ p -point query and heavy hitters, and compressed sensing problems.

EAAI Journal 2024 Journal Article

An approach to ship target detection based on combined optimization model of dehazing and detection

  • Tao Liu
  • Zhao Zhang
  • Zhengling Lei
  • Yuchi Huo
  • Shuo Wang
  • Jiansen Zhao
  • Jinfeng Zhang
  • Xin Jin

The design of a ship detection model that can be adapted to both foggy and clear images faces significant challenges. Existing methods are either not accurate enough, or have a high amount of model parameters, making them difficult to deploy to lightweight front-ends. To address these issues, a lightweight deep learning model based on combined optimization of dehazing and detection is proposed, focusing on self-adaptive ship detection. Firstly, a self-adaptive image dehazing module is designed and placed ahead of the detection network, including a dehazing parameter predictor and an improved dehazing method. Subsequently, a lightweight-improved object detection deep learning model integrated with the dehazing module is devised to detect the ship in the foggy image. Experimental results demonstrate the effectiveness of this approach in enabling efficient and accurate ship detection under foggy conditions. Through the joint optimization of the dehazing module and the detection module, it can be seen from the experiments that our Dehazing + Detection model has the highest detection accuracy and performs well in terms of detection speed, parameter amount, and weight file size. The detection accuracy has reached 97. 1%, which is better than that of the other three dehazing + detection models.

NeurIPS Conference 2024 Conference Paper

Beyond Redundancy: Information-aware Unsupervised Multiplex Graph Structure Learning

  • Zhixiang Shen
  • Shuo Wang
  • Zhao Kang

Unsupervised Multiplex Graph Learning (UMGL) aims to learn node representations on various edge types without manual labeling. However, existing research overlooks a key factor: the reliability of the graph structure. Real-world data often exhibit a complex nature and contain abundant task-irrelevant noise, severely compromising UMGL's performance. Moreover, existing methods primarily rely on contrastive learning to maximize mutual information across different graphs, limiting them to multiplex graph redundant scenarios and failing to capture view-unique task-relevant information. In this paper, we focus on a more realistic and challenging task: to unsupervisedly learn a fused graph from multiple graphs that preserve sufficient task-relevant information while removing task-irrelevant noise. Specifically, our proposed Information-aware Unsupervised Multiplex Graph Fusion framework (InfoMGF) uses graph structure refinement to eliminate irrelevant noise and simultaneously maximizes view-shared and view-unique task-relevant information, thereby tackling the frontier of non-redundant multiplex graph. Theoretical analyses further guarantee the effectiveness of InfoMGF. Comprehensive experiments against various baselines on different downstream tasks demonstrate its superior performance and robustness. Surprisingly, our unsupervised method even beats the sophisticated supervised approaches. The source code and datasets are available at https: //github. com/zxlearningdeep/InfoMGF.

AAAI Conference 2024 Conference Paper

Boosting Few-Shot Learning via Attentive Feature Regularization

  • Xingyu Zhu
  • Shuo Wang
  • Jinda Lu
  • Yanbin Hao
  • Haifeng Liu
  • Xiangnan He

Few-shot learning (FSL) based on manifold regularization aims to improve the recognition capacity of novel objects with limited training samples by mixing two samples from different categories with a blending factor. However, this mixing operation weakens the feature representation due to the linear interpolation and the overlooking of the importance of specific channels. To solve these issues, this paper proposes attentive feature regularization (AFR) which aims to improve the feature representativeness and discriminability. In our approach, we first calculate the relations between different categories of semantic labels to pick out the related features used for regularization. Then, we design two attention-based calculations at both the instance and channel levels. These calculations enable the regularization procedure to focus on two crucial aspects: the feature complementarity through adaptive interpolation in related categories and the emphasis on specific feature channels. Finally, we combine these regularization strategies to significantly improve the classifier performance. Empirical studies on several popular FSL benchmarks demonstrate the effectiveness of AFR, which improves the recognition accuracy of novel categories without the need to retrain any feature extractor, especially in the 1-shot setting. Furthermore, the proposed AFR can seamlessly integrate into other FSL methods to improve classification performance.

NeurIPS Conference 2024 Conference Paper

Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models

  • Bowen Ping
  • Shuo Wang
  • Hanqing Wang
  • Xu Han
  • Yuzhuang Xu
  • Yukun Yan
  • Yun Chen
  • Baobao Chang

Fine-tuning is a crucial process for adapting large language models (LLMs) to diverse applications. In certain scenarios, such as multi-tenant serving, deploying multiple LLMs becomes necessary to meet complex demands. Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs. In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs (e. g. , WizardMath for math problems). Motivated by the long-tail distribution of singular values in the delta weights, we propose a delta quantization approach using mixed-precision. This method employs higher-bit representation for singular vectors corresponding to larger singular values. We evaluate our approach on various fine-tuned LLMs, including math LLMs, code LLMs, chat LLMs, and even VLMs. Experimental results demonstrate that our approach performs comparably to full fine-tuned LLMs, surpassing both low-rank and low-bit baselines by a considerable margin. Additionally, we show that our method is compatible with various backbone LLMs, such as Llama-2, Llama-3, and Mistral, highlighting its generalizability.

NeurIPS Conference 2024 Conference Paper

Enhancing Zero-Shot Vision Models by Label-Free Prompt Distribution Learning and Bias Correcting

  • Xingyu Zhu
  • Beier Zhu
  • Yi Tan
  • Shuo Wang
  • Yanbin Hao
  • Hanwang Zhang

Vision-language models, such as CLIP, have shown impressive generalization capacities when using appropriate text descriptions. While optimizing prompts on downstream labeled data has proven effective in improving performance, these methods entail labor costs for annotations and are limited by their quality. Additionally, since CLIP is pre-trained on highly imbalanced Web-scale data, it suffers from inherent label bias that leads to suboptimal performance. To tackle the above challenges, we propose a label-**F**ree p**ro**mpt distribution **l**earning and b**i**as **c**orrection framework, dubbed as **Frolic**, which boosts zero-shot performance without the need for labeled data. Specifically, our Frolic learns distributions over prompt prototypes to capture diverse visual representations and adaptively fuses these with the original CLIP through confidence matching. This fused model is further enhanced by correcting label bias via a label-free logit adjustment. Notably, our method is not only training-free but also circumvents the necessity for hyper-parameter tuning. Extensive experimental results across 16 datasets demonstrate the efficacy of our approach, particularly outperforming the state-of-the-art by an average of $2. 6\%$ on 10 datasets with CLIP ViT-B/16 and achieving an average margin of $1. 5\%$ on ImageNet and its five distribution shifts with CLIP ViT-B/16. Codes are available in [https: //github. com/zhuhsingyuu/Frolic](https: //github. com/zhuhsingyuu/Frolic).

NeurIPS Conference 2024 Conference Paper

FAST: A Dual-tier Few-Shot Learning Paradigm for Whole Slide Image Classification

  • Kexue Fu
  • Xiaoyuan Luo
  • Linhao Qu
  • Shuo Wang
  • Ying Xiong
  • Ilias Maglogiannis
  • Longxiang Gao
  • Manning Wang

The expensive fine-grained annotation and data scarcity have become the primary obstacles for the widespread adoption of deep learning-based Whole Slide Images (WSI) classification algorithms in clinical practice. Unlike few-shot learning methods in natural images that can leverage the labels of each image, existing few-shot WSI classification methods only utilize a small number of fine-grained labels or weakly supervised slide labels for training in order to avoid expensive fine-grained annotation. They lack sufficient mining of available WSIs, severely limiting WSI classification performance. To address the above issues, we propose a novel and efficient dual-tier few-shot learning paradigm for WSI classification, named FAST. FAST consists of a dual-level annotation strategy and a dual-branch classification framework. Firstly, to avoid expensive fine-grained annotation, we collect a very small number of WSIs at the slide level, and annotate an extremely small number of patches. Then, to fully mining the available WSIs, we use all the patches and available patch labels to build a cache branch, which utilizes the labeled patches to learn the labels of unlabeled patches and through knowledge retrieval for patch classification. In addition to the cache branch, we also construct a prior branch that includes learnable prompt vectors, using the text encoder of visual-language models for patch classification. Finally, we integrate the results from both branches to achieve WSI classification. Extensive experiments on binary and multi-class datasets demonstrate that our proposed method significantly surpasses existing few-shot classification methods and approaches the accuracy of fully supervised methods with only 0. 22% annotation costs. All codes and models will be publicly available on https: //github. com/fukexue/FAST.

IJCAI Conference 2024 Conference Paper

How to Learn Domain-Invariant Representations for Visual Reinforcement Learning: An Information-Theoretical Perspective

  • Shuo Wang
  • Zhihao Wu
  • Jinwen Wang
  • Xiaobo Hu
  • Youfang Lin
  • Kai Lv

Despite the impressive success in visual control challenges, Visual Reinforcement Learning (VRL) policies have struggled to generalize to other scenarios. Existing works attempt to empirically improve the generalization capability, lacking theoretical support. In this work, we explore how to learn domain-invariant representations for VRL from an information-theoretical perspective. Specifically, we identify three Mutual Information (MI) terms. These terms highlight that a robust representation should preserve domain invariant information (return and dynamic transition) under significant observation perturbation. Furthermore, we relax the MI terms to derive three components for implementing a practical Mutual Information-based Invariant Representation (MIIR) algorithm for VRL. Extensive experiments demonstrate that MIIR achieves state-of-the-art generalization performance and the best sample efficiency in the DeepMind Control suite, Robotic Manipulation, and Carla.

NeurIPS Conference 2024 Conference Paper

OneBit: Towards Extremely Low-bit Large Language Models

  • Yuzhuang Xu
  • Xu Han
  • Zonghan Yang
  • Shuo Wang
  • Qingfu Zhu
  • Zhiyuan Liu
  • Weidong Liu
  • Wanxiang Che

Model quantification uses low bit-width values to represent the weight matrices of existing models to be quantized, which is a promising approach to reduce both storage and computational overheads of deploying highly anticipated LLMs. However, current quantization methods suffer severe performance degradation when the bit-width is extremely reduced, and thus focus on utilizing 4-bit or 8-bit values to quantize models. This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs. For this target, we introduce a 1-bit model compressing framework named OneBit, including a novel 1-bit parameter representation method to better quantize LLMs as well as an effective parameter initialization method based on matrix decomposition to improve the convergence speed of the quantization framework. Sufficient experimental results indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes when only using 1-bit weight matrices.

NeurIPS Conference 2024 Conference Paper

PGN: The RNN's New Successor is Effective for Long-Range Time Series Forecasting

  • Yuxin Jia
  • Youfang Lin
  • Jing Yu
  • Shuo Wang
  • Tianhao Liu
  • Huaiyu Wan

Due to the recurrent structure of RNN, the long information propagation path poses limitations in capturing long-term dependencies, gradient explosion/vanishing issues, and inefficient sequential execution. Based on this, we propose a novel paradigm called Parallel Gated Network (PGN) as the new successor to RNN. PGN directly captures information from previous time steps through the designed Historical Information Extraction (HIE) layer and leverages gated mechanisms to select and fuse it with the current time step information. This reduces the information propagation path to $\mathcal{O}(1)$, effectively addressing the limitations of RNN. To enhance PGN's performance in long-range time series forecasting tasks, we propose a novel temporal modeling framework called Temporal PGN (TPGN). TPGN incorporates two branches to comprehensively capture the semantic information of time series. One branch utilizes PGN to capture long-term periodic patterns while preserving their local characteristics. The other branch employs patches to capture short-term information and aggregate the global representation of the series. TPGN achieves a theoretical complexity of $\mathcal{O}(\sqrt{L})$, ensuring efficiency in its operations. Experimental results on five benchmark datasets demonstrate the state-of-the-art (SOTA) performance and high efficiency of TPGN, further confirming the effectiveness of PGN as the new successor to RNN in long-range time series forecasting. The code is available in this repository: https: //github. com/Water2sea/TPGN.

IJCAI Conference 2024 Conference Paper

Shadow-Free Membership Inference Attacks: Recommender Systems Are More Vulnerable Than You Thought

  • Xiaoxiao Chi
  • Xuyun Zhang
  • Yan Wang
  • Lianyong Qi
  • Amin Beheshti
  • Xiaolong Xu
  • Kim-Kwang Raymond Choo
  • Shuo Wang

Recommender systems have been successfully applied in many applications. Nonetheless, recent studies demonstrate that recommender systems are vulnerable to membership inference attacks (MIAs), leading to the leakage of users’ membership privacy. However, existing MIAs relying on shadow training suffer a large performance drop when the attacker lacks knowledge of the training data distribution and the model architecture of the target recommender system. To better understand the privacy risks of recommender systems, we propose shadow-free MIAs that directly leverage a user’s recommendations for membership inference. Without shadow training, the proposed attack can conduct MIAs efficiently and effectively under a practice scenario where the attacker is given only black-box access to the target recommender system. The proposed attack leverages an intuition that the recommender system personalizes a user’s recommendations if his historical interactions are used by it. Thus, an attacker can infer membership privacy by determining whether the recommendations are more similar to the interactions or the general popular items. We conduct extensive experiments on benchmark datasets across various recommender systems. Remarkably, our attack achieves far better attack accuracy with low false positive rates than baselines while with a much lower computational cost.

AAAI Conference 2024 Conference Paper

What Effects the Generalization in Visual Reinforcement Learning: Policy Consistency with Truncated Return Prediction

  • Shuo Wang
  • Zhihao Wu
  • Xiaobo Hu
  • Jinwen Wang
  • Youfang Lin
  • Kai Lv

In visual Reinforcement Learning (RL), the challenge of generalization to new environments is paramount. This study pioneers a theoretical analysis of visual RL generalization, establishing an upper bound on the generalization objective, encompassing policy divergence and Bellman error components. Motivated by this analysis, we propose maintaining the cross-domain consistency for each policy in the policy space, which can reduce the divergence of the learned policy during the test. In practice, we introduce the Truncated Return Prediction (TRP) task, promoting cross-domain policy consistency by predicting truncated returns of historical trajectories. Moreover, we also propose a Transformer-based predictor for this auxiliary task. Extensive experiments on DeepMind Control Suite and Robotic Manipulation tasks demonstrate that TRP achieves state-of-the-art generalization performance. We further demonstrate that TRP outperforms previous methods in terms of sample efficiency during training.

YNIMG Journal 2023 Journal Article

Cognitive and neural bases of visual-context-guided decision-making

  • Sai Sun
  • Hongbo Yu
  • Shuo Wang
  • Rongjun Yu

Humans adjust their behavioral strategies based on feedback, a process that may depend on intrinsic preferences and contextual factors such as visual salience. In this study, we hypothesized that decision-making based on visual salience is influenced by habitual and goal-directed processes, which can be evidenced by changes in attention and subjective valuation systems. To test this hypothesis, we conducted a series of studies to investigate the behavioral and neural mechanisms underlying visual salience-driven decision-making. We first established the baseline behavioral strategy without salience in Experiment 1 (n = 21). We then highlighted the utility or performance dimension of the chosen outcome using colors in Experiment 2 (n = 30). We demonstrated that the difference in staying frequency increased along the salient dimension, confirming a salience effect. Furthermore, the salience effect was abolished when directional information was removed in Experiment 3 (n = 28), suggesting that the salience effect is feedback-specific. To generalize our findings, we replicated the feedback-specific salience effects using eye-tracking and text emphasis. The fixation differences between the chosen and unchosen values were enhanced along the feedback-specific salient dimension in Experiment 4 (n = 48) but unchanged after removing feedback-specific information in Experiment 5 (n = 32). Moreover, the staying frequency was correlated with fixation properties, confirming that salience guides attention deployment. Lastly, our neuroimaging study (Experiment 6, n = 25) showed that the striatum subregions encoded salience-based outcome evaluation, while the vmPFC encoded salience-based behavioral adjustments. The connectivity of the vmPFC-ventral striatum accounted for individual differences in utility-driven, whereas the vmPFC-dmPFC for performance-driven behavioral adjustments. Together, our results provide a neurocognitive account of how task-irrelevant visual salience drives decision-making by involving attention and the frontal-striatal valuation systems. PUBLIC SIGNIFICANCE STATEMENT: Humans may use the current outcome to make behavior adjustments. How this occurs may depend on stable individual preferences and contextual factors, such as visual salience. Under the hypothesis that visual salience determines attention and subsequently modulates subjective valuation, we investigated the underlying behavioral and neural bases of visual-context-guided outcome evaluation and behavioral adjustments. Our findings suggest that the reward system is orchestrated by visual context and highlight the critical role of attention and the frontal-striatal neural circuit in visual-context-guided decision-making that may involve habitual and goal-directed processes.

AAAI Conference 2023 Conference Paper

High-Resolution Iterative Feedback Network for Camouflaged Object Detection

  • Xiaobin Hu
  • Shuo Wang
  • Xuebin Qin
  • Hang Dai
  • Wenqi Ren
  • Donghao Luo
  • Ying Tai
  • Ling Shao

Spotting camouflaged objects that are visually assimilated into the background is tricky for both object detection algorithms and humans who are usually confused or cheated by the perfectly intrinsic similarities between the foreground objects and the background surroundings. To tackle this challenge, we aim to extract the high-resolution texture details to avoid the detail degradation that causes blurred vision in edges and boundaries. We introduce a novel HitNet to refine the low-resolution representations by high-resolution features in an iterative feedback manner, essentially a global loop-based connection among the multi-scale resolutions. To design better feedback feature flow and avoid the feature corruption caused by recurrent path, an iterative feedback strategy is proposed to impose more constraints on each feedback connection. Extensive experiments on four challenging datasets demonstrate that our HitNet breaks the performance bottleneck and achieves significant improvements compared with 29 state-of-the-art methods. In addition, to address the data scarcity in camouflaged scenarios, we provide an application example to convert the salient objects to camouflaged objects, thereby generating more camouflaged training samples from the diverse salient object datasets. Code will be made publicly available.

AAAI Conference 2023 Conference Paper

Memory-Aided Contrastive Consensus Learning for Co-salient Object Detection

  • Peng Zheng
  • Jie Qin
  • Shuo Wang
  • Tian-Zhu Xiang
  • Huan Xiong

Co-salient object detection (CoSOD) aims at detecting common salient objects within a group of relevant source images. Most of the latest works employ the attention mechanism for finding common objects. To achieve accurate CoSOD results with high-quality maps and high efficiency, we propose a novel Memory-aided Contrastive Consensus Learning (MCCL) framework, which is capable of effectively detecting co-salient objects in real time (∼150 fps). To learn better group consensus, we propose the Group Consensus Aggregation Module (GCAM) to abstract the common features of each image group; meanwhile, to make the consensus representation more discriminative, we introduce the Memory-based Contrastive Module (MCM), which saves and updates the consensus of images from different groups in a queue of memories. Finally, to improve the quality and integrity of the predicted maps, we develop an Adversarial Integrity Learning (AIL) strategy to make the segmented regions more likely composed of complete objects with less surrounding noise. Extensive experiments on all the latest CoSOD benchmarks demonstrate that our lite MCCL outperforms 13 cutting-edge models, achieving the new state of the art (∼5.9% and ∼6.2% improvement in S-measure on CoSOD3k and CoSal2015, respectively). Our source codes, saliency maps, and online demos are publicly available at https://github.com/ZhengPeng7/MCCL.

IJCAI Conference 2022 Conference Paper

Boundary-Guided Camouflaged Object Detection

  • Yujia Sun
  • Shuo Wang
  • Chenglizhao Chen
  • Tian-Zhu Xiang

Camouflaged object detection (COD), segmenting objects that are elegantly blended into their surroundings, is a valuable yet challenging task. Existing deep-learning methods often fall into the difficulty of accurately identifying the camouflaged object with complete and fine object structure. To this end, in this paper, we propose a novel boundary-guided network (BGNet) for camouflaged object detection. Our method explores valuable and extra object-related edge semantics to guide representation learning of COD, which forces the model to generate features that highlight object structure, thereby promoting camouflaged object detection of accurate boundary localization. Extensive experiments on three challenging benchmark datasets demonstrate that our BGNet significantly outperforms the existing 18 state-of-the-art methods under four widely-used evaluation metrics. Our code is publicly available at: https: //github. com/thograce/BGNet.

IJCAI Conference 2022 Conference Paper

Long-term Spatio-Temporal Forecasting via Dynamic Multiple-Graph Attention

  • WEI SHAO
  • Zhiling Jin
  • Shuo Wang
  • Yufan Kang
  • Xiao Xiao
  • Hamid Menouar
  • Zhaofeng Zhang
  • Junshan Zhang

Many real-world ubiquitous applications, such as parking recommendations and air pollution monitoring, benefit significantly from accurate long-term spatio-temporal forecasting (LSTF). LSTF makes use of long-term dependency structure between the spatial and temporal domains, as well as the contextual information. Recent studies have revealed the potential of multi-graph neural networks (MGNNs) to improve prediction performance. However, existing MGNN methods do not work well when applied to LSTF due to several issues: the low level of generality, insufficient use of contextual information, and the imbalanced graph fusion approach. To address these issues, we construct new graph models to represent the contextual information of each node and exploit the long-term spatio-temporal data dependency structure. To aggregate the information across multiple graphs, we propose a new dynamic multi-graph fusion module to characterize the correlations of nodes within a graph and the nodes across graphs via the spatial attention and graph attention mechanisms. Furthermore, we introduce a trainable weight tensor to indicate the importance of each node in different graphs. Extensive experiments on two large-scale datasets demonstrate that our proposed approaches significantly improve the performance of existing graph neural network models in LSTF prediction tasks.

YNICL Journal 2022 Journal Article

Predicting prognosis of primary pontine hemorrhage using CT image and deep learning

  • Shuo Wang
  • Feng Chen
  • Mingyu Zhang
  • Xiaolin Zhao
  • Linghua Wen
  • Wenyuan Wu
  • Shina Wu
  • Zhe Li

Prognosis of primary pontine hemorrhage (PPH) is important for treatment planning and patient management. However, only few clinical factors were reported to have prognostic value to PPH. Here, we propose a deep learning (DL) model that mines high-dimensional prognostic information from computed tomography (CT) images and combines clinical factors for predicting individualized prognosis of PPH. We proposed a multi-task DL model to learn high-dimensional CT features of hematoma and perihematomal areas for predicting the risk of 30-day mortality, 90-day mortality and 90-day functional outcome of PPH simultaneously. We further explored the combination of the DL model and clinical factors by building a combined model. All the models were trained in a training cohort (n = 219) and tested in an independent testing cohort (n = 35). The DL model achieved area under the curve (AUC) of 0.886, 0.886, and 0.759 in predicting 30-day mortality, 90-day mortality and 90-day functional outcome of PPH in the independent testing cohort, which improved over the previously reported new PPH score and the clinical model. When combining the DL model and clinical factors, the combined model achieved improved performance (AUC = 0.920, 0.941, and 0.894), indicating that DL model mines CT information that complements clinical factors. Through DL visualization technique, we found that the internal structure of hematoma and its expansion to perihematomal regions are important for predicting the prognosis of PPH. This DL model provides an easy-to-use way for predicting individualized prognosis of PPH by mining high-dimensional information from CT images, and showed improvement over clinical factors and present methods.

ICRA Conference 2021 Conference Paper

Towards Adjoint Sensing and Acting Schemes and Interleaving Task Planning for Robust Robot Plan

  • Shuo Yang 0005
  • Xinjun Mao
  • Shuo Wang
  • Huaiyu Xiao
  • Yuanzhou Xue

Robots operating in open environments expect to have robust plans to achieve tasks successfully under environment uncertainties. However, both partial observability and dynamics of environment states have significantly decreased the robustness of task achievement, making robot task planning much more challenging. The partially observable states require the robot to obtain observations for optimally acting of the task goal. Also, state dynamics expects the robot to continuously observe surroundings for acting safely. Both challenges practically demand the purposeful and tight interactions between robot state-changing actuating actions and sensor-based observation actions. This paper proposes a novel model of Adjoint Sensing and Acting (ASA) that explicitly defines two parallel and sequential interaction schemes between actuating and observation actions, as well as an extended Behavior Tree for a concrete implementation of above schemes. We further propose an interleaving task planning approach for planning ASA-style plans, which integrates a deliberative POMDP planner for pursuing task goals, and a reactive Behavior Tree executive for fast responding to unexpected events. We experimentally demonstrate that ASA interaction schemes are practical and applicable to model and plan the open environment robot tasks. The plans from the interleaving task planning approach are both reactive in run-time response and efficient in task achievement.

ICML Conference 2020 Conference Paper

Loss Function Search for Face Recognition

  • Xiaobo Wang 0001
  • Shuo Wang
  • Cheng Chi 0003
  • Shifeng Zhang
  • Tao Mei 0001

In face recognition, designing margin-based (\emph{e. g. }, angular, additive, additive angular margins) softmax loss functions plays an important role to learn discriminative features. However, these hand-crafted heuristic methods may be sub-optimal because they require much effort to explore the large design space. Recently, an AutoML for loss function search method AM-LFS has been derived, which leverages reinforcement learning to search loss functions during the training process. But its search space is complex and unstable that hindering its superiority. In this paper, we first analyze that the key to enhance the feature discrimination is actually \textbf{how to reduce the softmax probability}. We then design a unified formulation for the current margin-based softmax losses. Accordingly, we define a novel search space and develop a reward-guided search method to automatically obtain the best candidate. Experimental results on a variety of face recognition benchmarks have demonstrated the effectiveness of our method over the state-of-the-art alternatives.

AAAI Conference 2020 Conference Paper

Mis-Classified Vector Guided Softmax Loss for Face Recognition

  • Xiaobo Wang
  • Shifeng Zhang
  • Shuo Wang
  • Tianyu Fu
  • Hailin Shi
  • Tao Mei

Face recognition has witnessed significant progress due to the advances of deep convolutional neural networks (CNNs), the central task of which is how to improve the feature discrimination. To this end, several margin-based (e. g. , angular, additive and additive angular margins) softmax loss functions have been proposed to increase the feature margin between different classes. However, despite great achievements have been made, they mainly suffer from three issues: 1) Obviously, they ignore the importance of informative features mining for discriminative learning; 2) They encourage the feature margin only from the ground truth class, without realizing the discriminability from other non-ground truth classes; 3) The feature margin between different classes is set to be same and fixed, which may not adapt the situations very well. To cope with these issues, this paper develops a novel loss function, which adaptively emphasizes the mis-classified feature vectors to guide the discriminative feature learning. Thus we can address all the above issues and achieve more discriminative face features. To the best of our knowledge, this is the first attempt to inherit the advantages of feature margin and feature mining into a unified loss function. Experimental results on several benchmarks have demonstrated the effectiveness of our method over state-of-the-art alternatives. Our code is available at http: //www. cbsr. ia. ac. cn/users/xiaobowang/.

IJCAI Conference 2019 Conference Paper

Dense Temporal Convolution Network for Sign Language Translation

  • Dan Guo
  • Shuo Wang
  • Qi Tian
  • Meng Wang

The sign language translation (SLT) which aims at translating a sign language video into natural language is a weakly supervised task, given that there is no exact mapping relationship between visual actions and textual words in a sentence label. To align the sign language actions and translate them into the respective words automatically, this paper proposes a dense temporal convolution network, termed DenseTCN which captures the actions in hierarchical views. Within this network, a temporal convolution (TC) is designed to learn the short-term correlation among adjacent features and further extended to a dense hierarchical structure. In the kth TC layer, we integrate the outputs of all preceding layers together: (1) The TC in a deeper layer essentially has larger receptive fields, which captures long-term temporal context by the hierarchical content transition. (2) The integration addresses the SLT problem by different views, including embedded short-term and extended longterm sequential learning. Finally, we adopt the CTC loss and a fusion strategy to learn the featurewise classification and generate the translated sentence. The experimental results on two popular sign language benchmarks, i. e. PHOENIX and USTCConSents, demonstrate the effectiveness of our proposed method in terms of various measurements.

IJCAI Conference 2019 Conference Paper

Equally-Guided Discriminative Hashing for Cross-modal Retrieval

  • Yufeng Shi
  • Xinge You
  • Feng Zheng
  • Shuo Wang
  • Qinmu Peng

Cross-modal hashing intends to project data from two modalities into a common hamming space to perform cross-modal retrieval efficiently. Despite satisfactory performance achieved on real applications, existing methods are incapable of effectively preserving semantic structure to maintain inter-class relationship and improving discriminability to make intra-class samples aggregated simultaneously, which thus limits the higher retrieval performance. To handle this problem, we propose Equally-Guided Discriminative Hashing (EGDH), which jointly takes into consideration semantic structure and discriminability. Specifically, we discover the connection between semantic structure preserving and discriminative methods. Based on it, we directly encode multi-label annotations that act as high-level semantic features to build a common semantic structure preserving classifier. With the common classifier to guide the learning of different modal hash functions equally, hash codes of samples are intra-class aggregated and inter-class relationship preserving. Experimental results on two benchmark datasets demonstrate the superiority of EGDH compared with the state-of-the-arts.

YNIMG Journal 2017 Journal Article

A framework for designing dynamic lp-ntPET studies to maximize the sensitivity to transient neurotransmitter responses to drugs: Application to dopamine and smoking

  • Shuo Wang
  • Sujin Kim
  • Kelly P. Cosgrove
  • Evan D. Morris

The “linear parametric neurotransmitter PET” (lp-ntPET) model was introduced to capture the time course of transient endogenous neurotransmitter response to drug stimulus from dynamic PET data. We previously used this novel analysis tool to probe the short-lived dopamine (DA) response induced by cigarette smoking in the PET scanner. It allowed us to find a sex difference in the DA signature of cigarette smoking. To make best use of this tool to characterize neurotransmitter response to drug stimulus, the sensitivity of lp-ntPET to detect such responses must be maximized. We designed a series of simulation studies to examine the impact of the following factors on the sensitivity of lp-ntPET using smoking-induced DA release as an example application: tracer delivery protocol, pre-processing for image denoising, timing of the smoking task, duration of the PET scan, and dose of the radiotracer. Our results suggest that a Bolus paradigm could replace a more difficult B/I paradigm without sacrificing the sensitivity of the method. Pre-processing the PET data with the de-noising algorithm HYPR could improve the sensitivity. The optimal timing to start the smoking task is 45min in a 90min scan and 35min in a 75min scan. A mild shortening of the scan time from 90mCi to 75min should be acceptable without loss of sensitivity. We suggest a lower dose limit of a bolus injection at 16mCi to limit underestimation of DA activation. This study established the framework to optimize the experimental design for reaching the full potential of lp-ntPET to detect neurotransmitter responses to drugs or even behavioral tasks.

YNIMG Journal 2017 Journal Article

Decision ambiguity is mediated by a late positive potential originating from cingulate cortex

  • Sai Sun
  • Shanshan Zhen
  • Zhongzheng Fu
  • Daw-An Wu
  • Shinsuke Shimojo
  • Ralph Adolphs
  • Rongjun Yu
  • Shuo Wang

People often make decisions in the face of ambiguous information, but it remains unclear how ambiguity is represented in the brain. We used three types of ambiguous stimuli and combined EEG and fMRI to examine the neural representation of perceptual decisions under ambiguity. We identified a late positive potential, the LPP, which differentiated levels of ambiguity, and which was specifically associated with behavioral judgments about choices that were ambiguous, rather than passive perception of ambiguous stimuli. Mediation analyses together with two further control experiments confirmed that the LPP was generated only when decisions are made (not during mere perception of ambiguous stimuli), and only when those decisions involved choices on a dimension that is ambiguous. A further control experiment showed that a stronger LPP arose in the presence of ambiguous stimuli compared to when only unambiguous stimuli were present. Source modeling suggested that the LPP originated from multiple loci in cingulate cortex, a finding we further confirmed using fMRI and fMRI-guided ERP source prediction. Taken together, our findings argue for a role of an LPP originating from cingulate cortex in encoding decisions based on task-relevant perceptual ambiguity, a process that may in turn influence confidence judgment, response conflict, and error correction.

IJCAI Conference 2016 Conference Paper

Dealing with Multiple Classes in Online Class Imbalance Learning

  • Shuo Wang
  • Leandro L. Minku
  • Xin Yao

Online class imbalance learning deals with data streams having very skewed class distributions in a timely fashion. Although a few methods have been proposed to handle such problems, most of them focus on two-class cases. Multi-class imbalance imposes additional challenges in learning. This paper studies the combined challenges posed by multi-class imbalance and online learning, and aims at a more effective and adaptive solution. First, we introduce two resampling-based ensemble methods, called MOOB and MUOB, which can process multi-class data directly and strictly online with an adaptive sampling rate. Then, we look into the impact of multi-minority and multi-majority cases on MOOB and MUOB in comparison to other methods under stationary and dynamic scenarios. Both multi-minority and multi-majority make a negative impact. MOOB shows the best and most stable G-mean in most stationary and dynamic cases.