Arrow Research search

Author name cluster

Dong Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

77 papers
2 author rows

Possible papers

77

EAAI Journal 2026 Journal Article

A communication-efficient federated learning method for traffic flow prediction

  • Kaiju Li
  • Qiang Xu
  • Dong Wang
  • Xiang Nie
  • Hao Wang

Federated learning is increasingly adopted for traffic flow prediction (TFP) to enable privacy preserving collaboration across distributed sensors. However, real-world deployments are highly heterogeneous in computational capability, causing stragglers that dominate per-round latency and severely slow down model updates. Most existing approaches mitigate stragglers by suppressing or discarding slow clients, which reduce data representativeness and introduce training bias. It is a harmful trade-off for TFP where broad spatial coverage is crucial for accuracy. We propose a communication-efficient logical clustering federated learning framework (LCFed) that mitigates stragglers by logically balancing effective training time while preserving full client participation. LCFed combines a coarse-grained logical dynamic clustering algorithm ( LoDynClust ) to balance computational resources across clusters and reduce synchronization delays, with a fine-grained intra-cluster adaptive collaborative training mechanism ( ICACT ) to regulate aggregation intervals and mitigate training bias. We further provide a convergence analysis. Extensive experiments on three real-world traffic datasets show that LCFed significantly reduces training latency caused by stragglers while maintaining competitive prediction accuracy compared with state-of-the-art baselines.

AAAI Conference 2026 Conference Paper

CADTrack: Learning Contextual Aggregation with Deformable Alignment for Robust RGBT Tracking

  • Hao Li
  • Yuhao Wang
  • Xiantao Hu
  • Wenning Hao
  • Pingping Zhang
  • Dong Wang
  • Huchuan Lu

RGB-Thermal (RGBT) tracking aims to exploit visible and thermal infrared modalities for robust all-weather object tracking. However, existing RGBT trackers struggle to resolve modality discrepancies, which poses great challenges for robust feature representation. This limitation hinders effective cross-modal information propagation and fusion, which significantly reduces the tracking accuracy. To address this limitation, we propose a novel Contextual Aggregation with Deformable Alignment framework called CADTrack for RGBT Tracking. To be specific, we first deploy the Mamba-based Feature Interaction (MFI) that establishes efficient feature interaction via state space models. This interaction module can operate with linear complexity, reducing computational cost and improving feature discrimination. Then, we propose the Contextual Aggregation Module (CAM) that dynamically activates backbone layers through sparse gating based on the Mixture-of-Experts (MoE). This module can encode complementary contextual information from cross-layer features. Finally, we propose the Deformable Alignment Module (DAM) to integrate deformable sampling and temporal propagation, mitigating spatial misalignment and localization drift. With the above components, our CADTrack achieves robust and accurate tracking in complex scenarios. Extensive experiments on five RGBT tracking benchmarks verify the effectiveness of our proposed method.

JBHI Journal 2026 Journal Article

DUR-Net+: Semi-Supervised Abdominal CT Pheochromocytoma Segmentation Via Dynamic Uncertainty Rectified and Prior Knowledge From SAM-Med3D

  • Chuanbo Qin
  • Zhuyuan Chen
  • Dong Wang
  • Bin Zheng
  • Jun Luo
  • Junying Zeng
  • Xudong Jia
  • Jin Wen

Pheochromocytoma is a rare urological adrenal tumor disease. Automated segmentation of pheochromocytomas from computed tomography (CT) is essential for diagnosis and treatment. However, this task is a challenging one due to issues such as blurred boundaries, irregular shapes, variations in location and size, and the lack of annotated images for training. To address these issues, we propose a semi-supervised framework for pheochromocytoma segmentation that primarily consists of a dynamic uncertainty rectification mechanism and a supervised strategy based on SAM-Med3D prior knowledge. First, we design a semi-supervised segmentation model comprising a shared encoder and multiple independent decoders that dynamically select pseudo labels from the different decoder outputs. To mitigate the risk of unreliable predictions caused by sparse annotations during training, we introduce uncertainty estimation to prioritize reliable outputs. Additionally, an Attentional Convolution Block (ACB) is designed in the encoding stage to fully utilize both global and local features, improving tumor recognition in segmentation. Furthermore, SAM-Med3D prior knowledge is incorporated into the framework as supplementary supervisory information, aiding the model in learning from limited labeled data. To eliminate the labor-intensive requirement for manual prompts in SAM-Med3D, we leverage pseudo labels to generate high-quality mask prompts, thus transforming the clinical workflow. Experiments on two pheochromocytoma datasets from different centers demonstrate that our proposed method achieves competitive performance.

AAAI Conference 2026 Conference Paper

FreeGaussian: Annotation-free Control of Articulated Objects via 3D Gaussian Splats with Flow Derivatives

  • Qizhi Chen
  • Delin Qu
  • Junli Liu
  • Yiwen Tang
  • Haoming Song
  • Dong Wang
  • Yuan Yuan
  • Bin Zhao

Reconstructing controllable Gaussian splats for articulated objects from monocular video is especially challenging due to its inherently insufficient constraints. Existing methods address this by relying on dense masks and manually defined control signals, limiting their real-world applications. In this paper, we propose an annotation-free method, FreeGaussian, which mathematically disentangles camera egomotion and articulated movements via flow derivatives. By establishing a connection between 2D flows and 3D Gaussian dynamic flow, our method enables optimization and continuity of dynamic Gaussian motions from flow priors without any control signals. Furthermore, we introduce a 3D spherical vector controlling scheme, which represents the state as a 3D Gaussian trajectory, thereby eliminating the need for complex 1D control signal calculations and simplifying controllable Gaussian modeling. Extensive experiments on articulated objects demonstrate the state-of-the-art visual performance and precise, part-aware controllability of our method.

EAAI Journal 2026 Journal Article

Predictive analysis of carbon monoxide concentration fluctuations due to coal spontaneous combustion in goaf zones disturbed by mining activities

  • Weihu Cao
  • Xiaoxing Zhong
  • Dong Wang
  • Yongle Feng
  • Yongli Dong
  • Kun Zhou

Carbon monoxide (CO) is a key indicator for monitoring and early warning of coal spontaneous combustion. However, due to interference from mining operations, the CO concentration monitored at the return corner of the goaf is highly nonlinear and uncertain, often leading to false alarms and missed warnings. To address these issues, a fixed value z was introduced when using the Kalman filter (KF), resulting in the development of an improved Kalman filter method (imKF). This method was then combined with time series prediction techniques to construct CO concentration filtering and prediction models using various algorithm combinations. The results indicated that compared with the Recurrent Neural Network (RNN) model, Long Short-Term Memory (LSTM) model, Autoregressive Integrated Moving Average (ARIMA) model, Gated Recurrent Unit network (GRU) model and prediction model using unmodified Kalman filter, the prediction accuracy of CO under the combination of LSTM-imKF-RNN is the highest. The Mean Squared Error (MSE) on the two datasets reached 0. 0059 and 3. 3420, respectively. Through synchronous monitoring of CO concentration, blasting operations, and roof pressure data, the LSTM-imKF-RNN based on the improved Kalman filter can effectively address the significant impact of roof pressure and blasting operations on underground CO prediction results. Compared to using the unimproved Kalman filter, the average percentage error was reduced from 16. 640 % to 0. 247 %.

EAAI Journal 2026 Journal Article

Pseudo-central feature matching: An adaptive semisupervised fault diagnosis method for knowledge transfer under variable working conditions

  • Changqing Shen
  • Hangqi Ge
  • Hao Yang
  • Juanjuan Shi
  • Dong Wang
  • Zhongkui Zhu

In recent years, with the continuous advancement of industrialization, fault diagnosis technology for industrial equipment has developed rapidly. Semisupervised domain adaptation (SSDA) can improve the generalization ability of models by utilizing a small portion of labeled data alongside a large quantity of unlabeled data, achieving cross-domain fault diagnosis. Consequently, SSDA is widely applied. However, previous SSDA methods often misalign target data with incorrect labeled source data during spatial mapping, leading to erroneous classifications. To solve this problem, this study draws inspiration from the idea of noisy label learning and provides an adaptive semisupervised fault diagnosis method, pseudo-central feature matching (PCFM) for knowledge transfer under variable working conditions. First, a novel semisupervised adaptive correction framework is proposed, which treats the labeled source domain data as noisy approximations of the target domain data and adaptively refines them using feedback from the target domain. The misalignment problem encountered in traditional SSDA methods is then effectively mitigated. A prototype network with dynamically updated pseudo-centers is subsequently introduced to guide the feature alignment between source and target domains, enhancing robustness in cross-domain scenarios. The proposed adaptive correction framework treats source labels as noisy approximations and refines them adaptively through dynamic pseudo-center updates, which effectively improves feature alignment and label reliability. Finally, the effectiveness of PCFM is validated through experiments. Results show that this approach not only aligns data between the source and target domains but also significantly enhances the accuracy of existing advanced methods, yielding more than a 6 % performance gain over MME and DANN.

EAAI Journal 2025 Journal Article

A comprehensive review of non-Latin natural scene text detection and recognition techniques

  • Elham Eli
  • Dong Wang
  • Wenting Xu
  • Hornisa Mamat
  • Alimjan Aysa
  • Kurban Ubul

This paper provides a comprehensive review of non-Latin scene text recognition (STR), with a focus on deep learning methods. It discusses the challenges presented by non-Latin scripts and how current methodologies address these challenges in both text detection and recognition. We explore traditional and deep learning-based approaches, with particular emphasis on transformer-based models, and analyze language-specific challenges across different scripts, such as Arabic, Chinese, and Indic. The review covers both text detection, where we examine the role of deep learning in identifying text regions, and scene text recognition, highlighting the integration of attention mechanisms and cross-lingual transfer learning. We also propose promising future research directions, including data augmentation, better multi-language adaptability, and the fusion of cross-modal information. Finally, we introduce our recent research on Uyghur scene text recognition as a case study. This paper offers valuable insights into advancing STR technologies and overcoming the unique challenges in non-Latin script recognition.

ICLR Conference 2025 Conference Paper

A Non-Contrastive Learning Framework for Sequential Recommendation with Preference-Preserving Profile Generation

  • Huimin Zeng
  • Xiaojie Wang 0003
  • Anoop Jain
  • Zhicheng Dou
  • Dong Wang

Contrastive Learning (CL) proves to be effective for learning generalizable user representations in Sequential Recommendation (SR), but it suffers from high computational costs due to its reliance on negative samples. To overcome this limitation, we propose the first Non-Contrastive Learning (NCL) framework for SR, which eliminates computational overhead of identifying and generating negative samples. However, without negative samples, it is challenging to learn uniform representations from only positive samples, which is prone to representation collapse. Furthermore, the alignment of the learned representations may be substantially compromised because existing ad-hoc augmentations can produce positive samples that have inconsistent user preferences. To tackle these challenges, we design a novel preference-preserving profile generation method to produce high-quality positive samples for non-contrastive training. Inspired by differential privacy, our approach creates augmented user profiles that exhibit high diversity while provably retaining consistent user preferences. With larger diversity and consistency of the positive samples, our NCL framework significantly enhances the alignment and uniformity of the learned representations, which contributes to better generalization. The experimental results on various benchmark datasets and model architectures demonstrate the effectiveness of the proposed method. Finally, our investigations reveal that both uniformity and alignment play a vital role in improving generalization for SR. Interestingly, in our data-sparse setting, alignment is usually more important than uniformity.

IJCAI Conference 2025 Conference Paper

Bidirectional Human–AI Collaboration for Equitable Student Performance Prediction via Deep Uncertainty Learning

  • Ruohan Zong
  • Yang Zhang
  • Lanyu Shang
  • Frank Stinar
  • Nigel Bosch
  • Dong Wang

This paper studies a bidirectional human-AI collaborative student performance prediction problem to enhance equitable online education, aligning with the United Nations' Sustainable Development Goal (SDG) of ensuring inclusive and equitable quality education for all. The goal is to leverage collaborative intelligence to generate accurate and fair student outcome predictions from behavioral data, ensuring equitable estimation for underrepresented populations. Current fair AI solutions often fail to mitigate demographic bias in the absence of student demographic data, while human-AI collaborative approaches frequently overlook human cognitive biases, leading to inaccurate predictions. We develop CollabDebias, a novel bidirectional human-AI collaborative framework that utilizes the complementary strengths of AI and humans to mitigate the AI demographic bias and human cognitive bias. To address AI demographic bias, we propose an uncertainty learning-based bias identification method and a reliability-aware human-AI integration approach. To reduce human cognitive bias, we design uncertainty-aware visualization of AI decision area and attention mechanism. Experimental results on an online course demonstrate CollabDebias's effectiveness in improving student performance prediction accuracy and fairness.

AAAI Conference 2025 Conference Paper

CLIP-driven View-aware Prompt Learning for Unsupervised Vehicle Re-identification

  • Jiyang Xu
  • Qi Wang
  • Xin Xiong
  • Di Gai
  • Ruihua Zhou
  • Dong Wang

With the emergence of vision-language pre-trained models, such as CLIP, some textual prompts have been gradually introduced recently into re-identification (Re-ID) tasks to obtain considerably robust multimodal information. However, most textual descriptions based on vehicle Re-ID tasks only contain identity index words without specific words to describe vehicle view information, thereby resulting in difficulty to be widely applied in vehicle Re-ID tasks with view variations. This case inspires us to propose a CLIP-driven view-aware prompt learning framework for unsupervised vehicle Re-ID. We first design a learnable textual prompt template called view-aware context optimization (ViewCoOp) based on dynamic multi-view word embeddings, which can fully obtain the proportion and position encoding of each view in the whole vehicle body region. Subsequently, a cross-modal mutual graph is constructed to explore the connections between inter-modal and intra-modal. Each sample is treated as a graph node, which extracts textual features based on ViewCoOp and the visual features of images. Moreover, leveraging the inter-cluster and intra-cluster correlation in the bimodal clustering results in the determination of connectivity between graph node pairs. Lastly, the proposed cross-modal mutual graph method utilizes supervised information from the bimodal gap to directly fine-tune the image encoder of CLIP for downstream unsupervised vehicle Re-ID tasks. Extensive experiments verify that the proposed method is capable of effectively obtaining cross-modal description ability from multiple views.

NeurIPS Conference 2025 Conference Paper

Collaborative Reasoner: Self-Improving Social Agents with Synthetic Conversations

  • Ansong Ni
  • Ruta Desai
  • Yang Li
  • Xinjie Lei
  • Dong Wang
  • Jiemin Zhang
  • Jane Yu
  • Ramya Raghavendra

With increasingly powerful large language models (LLMs) and LLM-based agents tackling an ever-growing list of tasks, we envision a future where numerous LLM agents work seamlessly with other AI agents and humans to solve complex problems and enhance daily life. To achieve these goals, LLM agents must develop collaborative skills such as effective persuasion, assertion and disagreement, which are often overlooked in the prevalent single-turn training and evaluation of LLMs. In this work, we present Collaborative Reasoner (Coral), a framework to evaluate and improve the collaborative reasoning abilities of language models. In particular, tasks and metrics in Coral necessitate agents to disagree with incorrect solutions, convince their partners of a correct solution, and ultimately agree as a team to commit to a final solution, all through a natural multi-turn conversation. Through comprehensive evaluation on six collaborative reasoning tasks covering domains of coding, math, scientific QA and social reasoning, we show that current models cannot effectively collaborate due to undesirable social behaviors, collapsing even on problems that they can solve singlehandedly. To improve the collaborative reasoning capabilities of LLMs, we propose a self-play method to generate synthetic multi-turn preference data and further train the language models to be better collaborators. Experiments with Llama-3. 1, Ministral and Qwen-2. 5 models show that our proposed self-improvement approach consistently outperforms finetuned chain-of-thought performance of the same base model, yielding gains up to 16. 7% absolute. Human evaluations show that the models exhibit more effective disagreement and produce more natural conversations after training on our synthetic interaction data.

NeurIPS Conference 2025 Conference Paper

Equilibrium Policy Generalization: A Reinforcement Learning Framework for Cross-Graph Zero-Shot Generalization in Pursuit-Evasion Games

  • Runyu Lu
  • Peng Zhang
  • Ruochuan Shi
  • Yuanheng Zhu
  • Dongbin Zhao
  • Yang Liu
  • Dong Wang
  • Cesare Alippi

Equilibrium learning in adversarial games is an important topic widely examined in the fields of game theory and reinforcement learning (RL). Pursuit-evasion game (PEG), as an important class of real-world games from the fields of robotics and security, requires exponential time to be accurately solved. When the underlying graph structure varies, even the state-of-the-art RL methods require recomputation or at least fine-tuning, which can be time-consuming and impair real-time applicability. This paper proposes an Equilibrium Policy Generalization (EPG) framework to effectively learn a generalized policy with robust cross-graph zero-shot performance. In the context of PEGs, our framework is generally applicable to both pursuer and evader sides in both no-exit and multi-exit scenarios. These two generalizability properties, to our knowledge, are the first to appear in this domain. The core idea of the EPG framework is to train an RL policy across different graph structures against the equilibrium policy for each single graph. To construct an equilibrium oracle for single-graph policies, we present a dynamic programming (DP) algorithm that provably generates pure-strategy Nash equilibrium with near-optimal time complexity. To guarantee scalability with respect to pursuer number, we further extend DP and RL by designing a grouping mechanism and a sequence model for joint policy decomposition, respectively. Experimental results show that, using equilibrium guidance and a distance feature proposed for cross-graph PEG training, the EPG framework guarantees desirable zero-shot performance in various unseen real-world graphs. Besides, when trained under an equilibrium heuristic proposed for the graphs with exits, our generalized pursuer policy can even match the performance of the fine-tuned policies from the state-of-the-art PEG methods.

AAAI Conference 2025 Conference Paper

Exploring Enhanced Contextual Information for Video-Level Object Tracking

  • Ben Kang
  • Xin Chen
  • Simiao Lai
  • Yang Liu
  • Yi Liu
  • Dong Wang

Contextual information at the video level has become increasingly crucial for visual object tracking. However, existing methods typically use only a few tokens to convey this information, which can lead to information loss and limit their ability to fully capture the context. To address this issue, we propose a new video-level visual object tracking framework called MCITrack. It leverages Mamba's hidden states to continuously record and transmit extensive contextual information throughout the video stream, resulting in more robust object tracking. The core component of MCITrack is the Contextual Information Fusion module, which consists of the mamba layer and the cross-attention layer. The mamba layer stores historical contextual information, while the cross-attention layer integrates this information into the current visual features of each backbone block. This module enhances the model's ability to capture and utilize contextual information at multiple levels through deep integration with the backbone. Experiments demonstrate that MCITrack achieves competitive performance across numerous benchmarks. For instance, it gets 76.6% AUC on LaSOT and 80.0% AO on GOT-10k, establishing a new state-of-the-art performance.

EAAI Journal 2025 Journal Article

Fast and intelligent measurement of the ventilation resistance coefficient for the whole mine based on sparse measurement points

  • Dong Wang
  • Jian Liu
  • Lijun Deng
  • Peng Cao
  • Li Liu

Artificial intelligence is playing an important role in mine ventilation engineering, especially in ensuring safe mine production. The mine ventilation resistance coefficient (MVRC) is the core and basic parameter of a mine ventilation system. It is crucial to quickly and accurately obtain the ventilation resistance coefficient (VRC) of the whole mine for the scientific, safe, and intelligent management of the mine ventilation system. To solve the time-consuming and laborious problem of the traditional mine ventilation resistance measurement method, we propose a fast and intelligent measurement method to obtain the whole mine's VRC based on an artificial intelligence differential evolution algorithm and sparse measurement points. The VRC was experimentally measured to verify the validity of the intelligent measurement method and the reliability of the model. The relative error of the air volume at the observation points of the solved results is less than 6 %. The fast intelligent measurement of the MVRC of the Longshou mine was carried out. The results were applied to develop an emergency plan for addressing the insufficient air supply in the ventilation system caused by the collapse of the mine's main blind return shaft and validated through engineering practice. After field practice, the relative error between the predicted and tested air return volume of the 10-row inclined shaft was 3. 23 %. It is verified that the results obtained using this method can solve mine ventilation system problems with relatively high accuracy, significantly reducing both the testing workload and time required for the mine ventilation resistance measurements.

ICLR Conference 2025 Conference Paper

Forget the Data and Fine-Tuning! Just Fold the Network to Compress

  • Dong Wang
  • Haris Sikic
  • Lothar Thiele
  • Olga Saukh

We introduce model folding, a novel data-free model compression technique that merges structurally similar neurons across layers, significantly reducing the model size without the need for fine-tuning or access to training data. Unlike existing methods, model folding preserves data statistics during compression by leveraging k-means clustering, and using novel data-free techniques to prevent variance collapse or explosion. Our theoretical framework and experiments across standard benchmarks, including ResNet18 and LLaMA-7B, demonstrate that model folding achieves comparable performance to data-driven compression techniques and outperforms recently proposed data-free methods, especially at high sparsity levels. This approach is particularly effective for compressing large-scale models, making it suitable for deployment in resource-constrained environments.

NeurIPS Conference 2025 Conference Paper

Hybrid Latent Reasoning via Reinforcement Learning

  • Zhenrui Yue
  • Bowen Jin
  • Huimin Zeng
  • Honglei Zhuang
  • Zhen Qin
  • Jinsung Yoon
  • Lanyu Shang
  • Jiawei Han

Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as their continuous paradigm conflicts with the discrete nature of autoregressive generation. Moreover, these methods rely on CoT traces for training and thus fail to exploit the inherent reasoning patterns of LLMs. In this work, we explore latent reasoning by leveraging the intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that (1) integrates prior hidden states into sampled tokens with a learnable gating mechanism, and (2) initializes training with predominantly token embeddings while progressively incorporating more hidden features. This design maintains LLMs' generative capabilities and incentivizes hybrid reasoning using both discrete and continuous representations. In addition, the hybrid HRPO introduces stochasticity into latent reasoning via token sampling, thereby enabling RL-based optimization without requiring CoT trajectories. Extensive evaluations across diverse benchmarks show that HRPO outperforms prior methods in both knowledge- and reasoning-intensive tasks. Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths, highlighting the potential of our RL-based approach and offer insights for future work in latent reasoning.

AAAI Conference 2025 Conference Paper

Knowledge Graph Completion with Relation-Aware Anchor Enhancement

  • Duanyang Yuan
  • Sihang Zhou
  • Xiaoshu Chen
  • Dong Wang
  • Ke Liang
  • Xinwang Liu
  • Jian Huang

Text-based knowledge graph completion methods take advantage of pre-trained language models (PLM) to enhance intrinsic semantic connections of raw triplets with detailed text descriptions. Typical methods in this branch map an input query (textual descriptions associated with an entity and a relation) and its candidate entities into feature vectors, respectively, and then maximize the probability of valid triples. These methods are gaining promising performance and increasing attention for the rapid development of large language models. According to the property of the language models, the more related and specific context information the input query provides, the more discriminative the resultant embedding will be. In this paper, through observation and validation, we find a neglected fact that the relation-aware neighbors of the head entities in queries could act as effective contexts for more precise link prediction. Driven by this finding, we propose a relation-aware anchor enhanced knowledge graph completion method (RAA-KGC). Specifically, in our method, to provide a reference of what might the target entity be like, we first generate anchor entities within the relation-aware neighborhood of the head entity. Then, by pulling the query embedding towards the neighborhoods of the anchors, it is tuned to be more discriminative for target entity matching. The results of our extensive experiments not only validate the efficacy of RAA-KGC but also reveal that by integrating our relation-aware anchor enhancement strategy, the performance of current leading methods can be notably enhanced without substantial modifications.

AAAI Conference 2025 Conference Paper

Learning 2D Invariant Affordance Knowledge for 3D Affordance Grounding

  • Xianqiang Gao
  • Pingrui Zhang
  • Delin Qu
  • Dong Wang
  • Zhigang Wang
  • Yan Ding
  • Bin Zhao

3D Object Affordance Grounding aims to predict the functional regions on a 3D object and has laid the foundation for a wide range of applications in robotics. Recent advances tackle this problem via learning a mapping between 3D regions and a single human-object interaction image. However, the geometric structure of the 3D object and the object in the human-object interaction image are not always consistent, leading to poor generalization. To address this issue, we propose to learn generalizable invariant affordance knowledge from multiple human-object interaction images within the same affordance category. Specifically, we introduce the Multi-Image Guided Invariant-Feature-Aware 3D Affordance Grounding (MIFAG) framework. It grounds 3D object affordance regions by identifying common interaction patterns across multiple human-object interaction images. First, the Invariant Affordance Knowledge Extraction Module (IAM) utilizes an iterative updating strategy to gradually extract aligned affordance knowledge from multiple images and integrate it into an affordance dictionary. Then, the Affordance Dictionary Adaptive Fusion Module (ADM) learns comprehensive point cloud representations that consider all affordance candidates in multiple images. Besides, the Multi-Image and Point Affordance (MIPA) benchmark is constructed and our method outperforms existing state-of-the-art methods on various experimental comparisons.

NeurIPS Conference 2025 Conference Paper

LOMIA: Label-Only Membership Inference Attacks against Pre-trained Large Vision-Language Models

  • Yihao Liu
  • Xinqi Lyu
  • Dong Wang
  • Yanjie Li
  • Bin Xiao

Large vision-language models (VLLMs) have driven significant progress in multi-modal systems, enabling a wide range of applications across domains such as healthcare, education, and content generation. Despite the success, the large-scale datasets used to train these models often contain sensitive or personally identifiable information, raising serious privacy concerns. To audit and better understand such risks, membership inference attacks (MIAs) have become a key tool. However, existing MIAs against VLLMs predominantly assume access to full-model logits, which are typically unavailable in many practical deployments. To facilitate MIAs in a more realistic and restrictive setting, we propose a novel framework: label-only membership inference attacks (LOMIA) targeting pre-trained VLLMs where only the model’s top-1 prediction is available. Within this framework, we propose three effective attack methods, all of which exploit the intuition that training samples are more likely to be memorized by the VLLMs, resulting in outputs that exhibit higher semantic alignment and lower perplexity. Our experiments show that our framework surpasses existing label-only attack adaptations for different VLLMs and competes with state-of-the-art logits-based attacks across all metrics on three widely used open-source VLLMs and GPT-4o.

NeurIPS Conference 2025 Conference Paper

Meta CLIP 2: A Worldwide Scaling Recipe

  • Yung-Sung Chuang
  • Yang Li
  • Dong Wang
  • Ching-Feng Yeh
  • Kehan Lyu
  • Ramya Raghavendra
  • Jim Glass
  • Lifei Huang

Contrastive Language-Image Pretraining (CLIP) is a popular foundation model, supporting from zero-shot classification, retrieval to encoders for multimodal large language models (MLLMs). Although CLIP is successfully trained on billion-scale image-text pairs from the English world, scaling CLIP's training further to learning from the worldwide web data is still challenging: (1) no curation method is available to handle data points from non-English world; (2) the English performance from existing multilingual CLIP is worse than its English-only counterpart, i. e. , "curse of multilinguality" that is common in LLMs. Here, we present Meta CLIP 2, the first recipe training CLIP from scratch on worldwide web-scale image-text pairs. To generalize our findings, we conduct rigorous ablations with minimal changes that are necessary to address the above challenges and present a recipe enabling mutual benefits from English and non-English world data. In zero-shot ImageNet classification, Meta CLIP 2 ViT-H/14 surpasses its English-only counterpart by 0. 8% and mSigLIP by 0. 7%, and surprisingly sets new state-of-the-art without system-level confounding factors (e. g. , translation, bespoke architecture changes) on multilingual benchmarks, such as CVQA with 57. 4%, Babel-ImageNet with 50. 2% and XM3600 with 64. 3% on image-to-text retrieval. Code and model are available at https: //github. com/facebookresearch/MetaCLIP.

IJCAI Conference 2025 Conference Paper

MPPQ: Enhancing Post-Training Quantization for LLMs via Mixed Supervision, Proxy Rounding, and Pre-Searching

  • Mingrun Wei
  • Yeyu Yan
  • Dong Wang

Recently, post-training quantization (PTQ) methods for large language models (LLMs) primarily focus on tackling the challenges caused by outliers. Scaling transformation has proven to be effective while how to enhance the performance of extremely low-bitwidth (e. g. , 2-bit) PTQ under it remains largely unexplored. In this work, a new PTQ framework, namely MPPQ, is established. Specifically, MPPQ first proposes an enhanced reconstruction loss based on Mixed metric supervision to mitigate the distribution inconsistency caused by quantization while providing strong regularization for learnable parameters. Secondly, we introduce a Proxy-based adaptive rounding scheme in weight quantization, which replaces the round-to-nearest (RTN) function to minimize the overall quantization errors through element-wise scaling. Furthermore, a factor coarse Pre-searching mechanism is presented to ensure proper coordination between quantization and clipping patterns, while achieving optimal initialization of clipping factors before training. Extensive experiments show that MPPQ consistently outperforms state-of-the-art methods in low-bit quantization settings. For instance, the perplexity of WikiText2 can be dramatically reduced to 8. 85 (3. 9 ↓ vs 12. 75 of the latest method, LRQuant) for the LLaMA-2-7B model, which is quantized with W4A4.

IROS Conference 2025 Conference Paper

Multi-Material 3D-Printed Magnetic Millirobot for Quadrupedal Locomotion in Endoluminal Spaces

  • Ruichen Wang
  • Jinqiang Wang
  • Dong Wang

Quadrupedal locomotion have advantages of a low center of gravity, broad support base, and four-legged coordination, enabling outstanding stability in complex terrains. Drawing inspiration from this, researchers have developed robots emulate such locomotion through multiple actuations controlling and structural reconfiguration. However, complex control sequences and slow tethered actuation limit their locomotion in confined endoluminal spaces. Here, we present a quadrupedal magnetic millirobot (beamrobot) fabricated via multi-material direct ink writing (DIW) for medical applications in complex endoluminal spaces. The millirobot combines a soft body with magnetic feet, enabling controlled shape-morphing and locomotion under external magnetic fields. The printing parameters are optimized. The numerical simulations and experiments validate the static deformation and dynamic locomotion modes. Experimental results demonstrate the versatility of the beamrobots, including following "U"-shaped trajectories, displaying movement in the branch vessel model, and clearing the obstruction in the vessel model. The proposed quadrupedal magnetic millirobot and hard-magnetic actuation approach open up new possibilities for the medical applications.

NeurIPS Conference 2025 Conference Paper

NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions

  • Weizhe Yuan
  • Jane Yu
  • Song Jiang
  • Karthik Padthe
  • Yang Li
  • Dong Wang
  • Ilia Kulikov
  • Kyunghyun Cho

Scaling reasoning capabilities beyond traditional domains such as math and coding is hindered by the lack of diverse and high-quality questions. To overcome this limitation, we introduce a scalable approach for generating diverse and challenging reasoning questions, accompanied by reference answers. We present NaturalReasoning, a comprehensive dataset comprising 2. 8 million questions that span multiple domains, including STEM fields (e. g. , Physics, Computer Science), Economics, Social Sciences, and more. We demonstrate the utility of the questions in NaturalReasoning through knowledge distillation experiments which show that NaturalReasoning can effectively elicit and transfer reasoning capabilities from a strong teacher model. Furthermore, we demonstrate that NaturalReasoning is also effective for unsupervised self-training using external reward models or self-rewarding.

AAAI Conference 2025 Conference Paper

SIDE: Socially Informed Drought Estimation Toward Understanding Societal Impact Dynamics of Environmental Crisis

  • Lanyu Shang
  • Bozhang Chen
  • Shiwei Liu
  • Yang Zhang
  • Ruohan Zong
  • Anav Vora
  • Ximing Cai
  • Na Wei

Drought has become a critical global threat with significant societal impact. Existing drought monitoring solutions primarily focus on assessing drought severity using quantitative measurements, overlooking the diverse societal impact of drought from human-centric perspectives. Motivated by the collective intelligence on social media and the computational power of AI, this paper studies a novel problem of socially informed AI-driven drought estimation that aims to leverage social and news media information to jointly estimate drought severity and its societal impact. Two technical challenges exist: 1) How to model the implicit temporal dynamics of drought societal impact. 2) How to capture the social-physical interdependence between the physical drought condition and its societal impact. To address these challenges, we develop SIDE, a socially informed AI-driven drought estimation framework that explicitly quantifies the societal impact of drought and effectively models the social-physical interdependency for joint severity-impact estimation. Experiments on real-world datasets from California and Texas demonstrate SIDE's superior performance compared to state-of-the-art baselines in accurately estimating drought severity and its societal impact. SIDE offers valuable insights for developing human-centric drought mitigation strategies to foster sustainable and resilient communities.

AAAI Conference 2025 Conference Paper

SUTrack: Towards Simple and Unified Single Object Tracking

  • Xin Chen
  • Ben Kang
  • Wanting Geng
  • Jiawen Zhu
  • Yi Liu
  • Dong Wang
  • Huchuan Lu

In this paper, we propose a simple yet unified single object tracking (SOT) framework, dubbed SUTrack. It consolidates five SOT tasks (RGB-based, RGB-Depth, RGB-Thermal, RGB-Event, RGB-Language Tracking) into a unified model trained in a single session. Due to the distinct nature of the data, current methods typically design individual architectures and train separate models for each task. This fragmentation results in redundant training processes, repetitive technological innovations, and limited cross-modal knowledge sharing. In contrast, SUTrack demonstrates that a single model with a unified input representation can effectively handle various SOT tasks, eliminating the need for task-specific designs and separate training sessions. Additionally, we introduce a task-recognition training strategy and a soft token type embedding to further enhance SUTrack's performance with minimal overhead. Experiments show that SUTrack outperforms previous task-specific counterparts across 11 datasets spanning five SOT tasks. Moreover, we provide a range of models catering edge devices as well as high-performance GPUs, striking a good trade-off between speed and accuracy. We hope SUTrack could serve as a strong foundation for further compelling research into unified tracking models.

AAAI Conference 2025 Conference Paper

Two-stream Beats One-stream: Asymmetric Siamese Network for Efficient Visual Tracking

  • Jiawen Zhu
  • Huayi Tang
  • Xin Chen
  • Xinying Wang
  • Dong Wang
  • Huchuan Lu

Efficient tracking has garnered attention for its ability to operate on resource-constrained platforms for real-world deployment beyond desktop GPUs. Current efficient trackers mainly follow precision-oriented trackers, adopting a one-stream framework with lightweight modules. However, blindly adhering to the one-stream paradigm may not be optimal, as incorporating template computation in every frame leads to redundancy, and pervasive semantic interaction between template and search region places stress on edge devices. In this work, we propose a novel asymmetric Siamese tracker named AsymTrack for efficient tracking. AsymTrack disentangles template and search streams into separate branches, with template computing only once during initialization to generate modulation signals. Building on this architecture, we devise an efficient template modulation mechanism to unidirectional inject crucial cues into the search features, and design an object perception enhancement module that integrates abstract semantics and local details to overcome the limited representation in lightweight tracker. Extensive experiments demonstrate that AsymTrack offers superior speed-precision trade-offs across different platforms compared to the current state-of-the-arts. For instance, AsymTrack-T achieves 60.8% AUC on LaSOT and 224/81/84 FPS on GPU/CPU/AGX, surpassing HiT-Tiny by 6.0% AUC with higher speeds.

JBHI Journal 2024 Journal Article

Automatically Detecting Anchor Cells and Clustering for scRNA-Seq Data Using scTSNN

  • Qiaoming Liu
  • Dandan Zhang
  • Dong Wang
  • Guohua Wang
  • Yadong Wang

Advancing in single-cell RNA sequencing techniques enhances the resolution of cell heterogeneity study. Density-based unsupervised clustering has the potential to detect the representative anchor points and the number of clusters automatically. Meanwhile, discovering the true cell type of scRNA-seq data in the unsupervised scenario is still challenging. To this end, we proposed a tensor shared nearest neighbor anchor clustering for scRNA-seq data, named scTSNN, which first makes use of the tensor affinity learning module to mine the local-global balanced topological structures among cells, next designs density-based shared nearest neighbor measurement method to automatically detect anchor cells, finally partitions the non-anchor cells to obtain the clustering results. Validated on synthetic datasets and scRNA-seq datasets, scTSNN not only exactly detects the complicated structures but also has better performance in accuracy and robustness compared with the state-of-the-art methods. Moreover, case studies on mammalian cells and cervical cancer tumor cells demonstrate the selected anchor cells of scTSNN benefit the cell pseudotime inference and rare cell identification, which show good application and research value of scTSNN.

AAAI Conference 2024 Conference Paper

Box2Poly: Memory-Efficient Polygon Prediction of Arbitrarily Shaped and Rotated Text

  • Xuyang Chen
  • Dong Wang
  • Konrad Schindler
  • Mingwei Sun
  • Yongliang Wang
  • Nicolo Savioli
  • Liqiu Meng

Recently, Transformer-based text detection techniques have sought to predict polygons by encoding the coordinates of individual boundary vertices using distinct query features. However, this approach incurs a significant memory overhead and struggles to effectively capture the intricate relationships between vertices belonging to the same instance. Consequently, irregular text layouts often lead to the prediction of outlined vertices, diminishing the quality of results. To address these challenges, we present an innovative approach rooted in Sparse R-CNN: a cascade decoding pipeline for polygon prediction. Our method ensures precision by iteratively refining polygon predictions, considering both the scale and location of preceding results. Leveraging this stabilized regression pipeline, even employing just a single feature vector to guide polygon instance regression yields promising detection results. Simultaneously, the leverage of instance-level feature proposal substantially enhances memory efficiency ( > 50% less vs. the SOTA method DPText-DETR) and reduces inference speed (> 40% less vs. DPText-DETR) with comparable performance on benchmarks. The code is available at https://github.com/Albertchen98/Box2Poly.git.

AAAI Conference 2024 Conference Paper

Color Event Enhanced Single-Exposure HDR Imaging

  • Mengyao Cui
  • Zhigang Wang
  • Dong Wang
  • Bin Zhao
  • Xuelong Li

Single-exposure high dynamic range (HDR) imaging aims to reconstruct the wide-range intensities of a scene by using its single low dynamic range (LDR) image, thus providing significant efficiency. Existing methods pay high attention to restoring the luminance by inversing the tone-mapping process, while the color in the over-/under-exposed area cannot be well restored due to the information loss of the single LDR image. To address this issue, we introduce color events into the imaging pipeline, which record asynchronous pixel-wise color changes in a high dynamic range, enabling edge-like scene perception under challenging lighting conditions. Specifically, we propose a joint framework that incorporates color events and a single LDR image to restore both content and color of an HDR image, where an exposureaware transformer (EaT) module is designed to propagate the informative hints, provided by the normal-exposed LDR regions and the event streams, to the missing areas. In this module, an exposure-aware mask is estimated to suppress distractive information and strengthen the restoration of the over-/under-exposed regions. To our knowledge, we are the first to use color events to enhance single-exposure HDR imaging. We also contribute corresponding datasets, consisting of synthesized datasets and a real-world dataset collected by a DAVIS346-color camera. The datasets can be found at https://www.kaggle.com/datasets/mengyaocui/ce-hdr. Extensive experiments demonstrate the effectiveness of the proposed method.

EAAI Journal 2024 Journal Article

Dynamic time scales ensemble framework for similarity-based remaining useful life prediction under multiple failure modes

  • Yuhui Xu
  • Tangbin Xia
  • Dong Wang
  • Zhen Chen
  • Ershun Pan
  • Lifeng Xi

In modern industry, the stochastic degradation of mechanical equipment typically involves multiple failure modes, which heavily affects the reliability of the remaining useful life (RUL) prediction. The similarity-based methods have been widely deployed in RUL prediction due to their flexibility, but it is still challenging to accurately identify similar degradation trajectories under varying failure modes. The obstacles lie in the interference of reference trajectories under different degradation states and the insufficiency of measuring trajectory trends. Therefore, this paper proposes a dynamic scales ensemble method based on the mean removal Canberra distance with failure identification (FI-MRC-DSE) for similarity-based prognosis. Firstly, a gated recurrent unit autoencoder network is employed to adaptively extract failure features from multi-dimensional monitoring data to support the targeted selection of reference trajectories. Then, the similarity matching is performed based on the proposed MRC distance instead of the commonly used Euclidean distance, enhancing the perception of degradation trends. Finally, the matching results across multiple time scales, which are dynamically determined by the instance's degradation state, are integrated to obtain the predicted RUL. It effectively overcomes the insufficient utilization of trajectory caused by the single time scale. In the experiments, the superiority of our developed similarity-based FI-MRC-DSE method is demonstrated by comparison with the state-of-the-art similarity-based methods. The effectiveness analyses and the ablation study show that all three key components contribute to accurate prognosis under multiple failure modes.

IROS Conference 2024 Conference Paper

Flexible and Topological Consistent Local Replanning for Multirotors

  • Dong Wang
  • Hongkai Ye
  • Neng Pan
  • Jinxin Huang
  • Bangyan Zhang
  • Yinian Mao
  • Guoquan Huang 0001
  • Chao Xu 0001

In many situations such as city delivery and wild inspection, quadrotors are often required to follow a predefined reference trajectory. However, these reference trajectories cannot be perfectly safe, resulting in conflicts between tracking the reference precisely, flying safely, and finishing the mission timely. This paper proposes to solve the above problem, by introducing a replanning framework that first generates a topological consistent collision-free initial path and then flexibly optimizes the rejoin point and trajectory duration to generate a smooth and safe local rejoining trajectory. To avoid local trajectory switching in different directions during high-frequency replanning, we propose a topology-preserving path search algorithm based on kinodynamic RRT*. To satisfy dynamic constraints, avoid delays, and achieve a smooth rejoin of the reference trajectory, we propose an optimization-based approach to refine the initial trajectory. The simulation results confirm that our proposed topological consistency and flexible optimization methods can reduce the risk of local trajectory and decrease obstacle avoidance delay for tracking reference trajectory. We also conduct real-world experiments in challenging environments and verify the effectiveness of our method.

NeurIPS Conference 2024 Conference Paper

GraphMorph: Tubular Structure Extraction by Morphing Predicted Graphs

  • Zhao Zhang
  • Ziwei Zhao
  • Dong Wang
  • Liwei Wang

Accurately restoring topology is both challenging and crucial in tubular structure extraction tasks, such as blood vessel segmentation and road network extraction. Diverging from traditional approaches based on pixel-level classification, our proposed method, named GraphMorph, focuses on branch-level features of tubular structures to achieve more topologically accurate predictions. GraphMorph comprises two main components: a Graph Decoder and a Morph Module. Utilizing multi-scale features extracted from an image patch by the segmentation network, the Graph Decoder facilitates the learning of branch-level features and generates a graph that accurately represents the tubular structure in this patch. The Morph Module processes two primary inputs: the graph and the centerline probability map, provided by the Graph Decoder and the segmentation network, respectively. Employing a novel SkeletonDijkstra algorithm, the Morph Module produces a centerline mask that aligns with the predicted graph. Furthermore, we observe that employing centerline masks predicted by GraphMorph significantly reduces false positives in the segmentation task, which is achieved by a simple yet effective post-processing strategy. The efficacy of our method in the centerline extraction and segmentation tasks has been substantiated through experimental evaluations across various datasets. Source code will be released soon.

AAAI Conference 2024 Conference Paper

Hybrid-SORT: Weak Cues Matter for Online Multi-Object Tracking

  • Mingzhan Yang
  • Guangxin Han
  • Bin Yan
  • Wenhua Zhang
  • Jinqing Qi
  • Huchuan Lu
  • Dong Wang

Multi-Object Tracking (MOT) aims to detect and associate all desired objects across frames. Most methods accomplish the task by explicitly or implicitly leveraging strong cues (i.e., spatial and appearance information), which exhibit powerful instance-level discrimination. However, when object occlusion and clustering occur, spatial and appearance information will become ambiguous simultaneously due to the high overlap among objects. In this paper, we demonstrate this long-standing challenge in MOT can be efficiently and effectively resolved by incorporating weak cues to compensate for strong cues. Along with velocity direction, we introduce the confidence and height state as potential weak cues. With superior performance, our method still maintains Simple, Online and Real-Time (SORT) characteristics. Also, our method shows strong generalization for diverse trackers and scenarios in a plug-and-play and training-free manner. Significant and consistent improvements are observed when applying our method to 5 different representative trackers. Further, with both strong and weak cues, our method Hybrid-SORT achieves superior performance on diverse benchmarks, including MOT17, MOT20, and especially DanceTrack where interaction and severe occlusion frequently happen with complex motions. The code and models are available at https://github.com/ymzis69/HybridSORT.

NeurIPS Conference 2024 Conference Paper

LiveScene: Language Embedding Interactive Radiance Fields for Physical Scene Control and Rendering

  • Delin Qu
  • Qizhi Chen
  • Pingrui Zhang
  • Xianqiang Gao
  • Bin Zhao
  • Zhigang Wang
  • Dong Wang
  • Xuelong Li

This paper scales object-level reconstruction to complex scenes, advancing interactive scene reconstruction. We introduce two datasets, OmniSim and InterReal, featuring 28 scenes with multiple interactive objects. To tackle the challenge of inaccurate interactive motion recovery in complex scenes, we propose LiveScene, a scene-level language-embedded interactive radiance field that efficiently reconstructs and controls multiple objects. By decomposing the interactive scene into local deformable fields, LiveScene enables separate reconstruction of individual object motions, reducing memory consumption. Additionally, our interaction-aware language embedding localizes individual interactive objects, allowing for arbitrary control using natural language. Our approach demonstrates significant superiority in novel view synthesis, interactive scene control, and language grounding performance through extensive experiments. Project page: https: //livescenes. github. io.

NeurIPS Conference 2024 Conference Paper

LLMs Can Evolve Continually on Modality for $\mathbb{X}$-Modal Reasoning

  • Jiazuo Yu
  • Haomiao Xiong
  • Lu Zhang
  • Haiwen Diao
  • Yunzhi Zhuge
  • Lanqing Hong
  • Dong Wang
  • Huchuan Lu

Multimodal Large Language Models (MLLMs) have gained significant attention due to their impressive capabilities in multimodal understanding. However, existing methods rely heavily on extensive modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities. In this paper, we propose \textbf{PathWeave}, a flexible and scalable framework with modal-\textbf{path} s\textbf{w}itching and \textbf{e}xp\textbf{a}nsion abilities that enables MLLMs to continually \textbf{ev}olve on modalities for $\mathbb{X}$-modal reasoning. We leverage the concept of Continual Learning and develop an incremental training strategy atop pre-trained MLLMs, enabling their expansion to new modalities using uni-modal data, without executing joint-modal pretraining. In detail, a novel Adapter-in-Adapter (AnA) framework is introduced, in which uni-modal and cross-modal adapters are seamlessly integrated to facilitate efficient modality alignment and collaboration. Additionally, an MoE-based gating module is applied between two types of adapters to further enhance the multimodal interaction. To investigate the proposed method, we establish a challenging benchmark called \textbf{C}ontinual \textbf{L}earning of \textbf{M}odality (MCL), which consists of high-quality QA data from five distinct modalities: image, video, \textcolor{black}{audio, depth} and point cloud. Extensive experiments demonstrate the effectiveness of the proposed AnA framework on learning plasticity and memory stability during continual learning. Furthermore, PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98. 73\%. Our code locates at \url{https: //github. com/JiazuoYu/PathWeave}.

IROS Conference 2024 Conference Paper

Multi-Fov-Constrained Trajectory Planning for Multirotor Safe Landing

  • Dong Wang
  • Jingping Wang
  • Suqin He
  • Jinxin Huang
  • Bangyan Zhang
  • Yinian Mao
  • Guoquan Huang 0003
  • Chao Xu 0001

In recent years, multirotors have become more and more widely used, such as in aerial photography and delivery. Ensuring a safe landing in emergencies is the most basic requirement, and it is important to make full use of all the sensors of the multirotor. To improve the safety of UAV landing in unknown unstructured scenes, this paper proposes a multi-FOV-constrained trajectory planning algorithm. Due to the discontinuity of multi-FOV constraints and the nonlinearity of UAV dynamics, the entire trajectory planning problem is a nonlinear optimization problem with non-convex constraints. To address this problem, our algorithm contains two stages, a multi-fov-constrained path search algorithm and a safe landing trajectory optimization algorithm. The multi-fov-constrained path search algorithm is used to generate a safe initial path that satisfies the FOV constraint. Then, the safe landing trajectory optimization algorithm generates a safe trajectory, which considers FOV constraints, dynamics, smoothness, and obstacle avoidance. We conducted simulation experiments and real-world experiments to verify the robustness and effectiveness of our algorithm.

AAAI Conference 2024 Conference Paper

Point-PEFT: Parameter-Efficient Fine-Tuning for 3D Pre-trained Models

  • Yiwen Tang
  • Ray Zhang
  • Zoey Guo
  • Xianzheng Ma
  • Bin Zhao
  • Zhigang Wang
  • Dong Wang
  • Xuelong Li

The popularity of pre-trained large models has revolutionized downstream tasks across diverse fields, such as language, vision, and multi-modality. To minimize the adaption cost for downstream tasks, many Parameter-Efficient Fine-Tuning (PEFT) techniques are proposed for language and 2D image pre-trained models. However, the specialized PEFT method for 3D pre-trained models is still under-explored. To this end, we introduce Point-PEFT, a novel framework for adapting point cloud pre-trained models with minimal learnable parameters. Specifically, for a pre-trained 3D model, we freeze most of its parameters, and only tune the newly added PEFT modules on downstream tasks, which consist of a Point-prior Prompt and a Geometry-aware Adapter. The Point-prior Prompt adopts a set of learnable prompt tokens, for which we propose to construct a memory bank with domain-specific knowledge, and utilize a parameter-free attention to enhance the prompt tokens. The Geometry-aware Adapter aims to aggregate point cloud features within spatial neighborhoods to capture fine-grained geometric information through local interactions. Extensive experiments indicate that our Point-PEFT can achieve better performance than the full fine-tuning on various downstream tasks, while using only 5% of the trainable parameters, demonstrating the efficiency and effectiveness of our approach. Code is released at https://github.com/Ivan-Tang-3D/Point-PEFT.

EAAI Journal 2024 Journal Article

Relation between fault characteristic frequencies and local interpretability shapley additive explanations for continuous machine health monitoring

  • Tongtong Yan
  • Xueqi Xing
  • Tangbin Xia
  • Dong Wang

Recently, the Shapley additive explanations models have been extensively studied to enhance explainability of artificial intelligence algorithms, while most of them simply use Shapley additive explanations to rank or measure the importance of different features. In this study, a novel methodology that studies the relation between fault characteristic frequencies and Shapley values generated by local interpretability Shapley additive explanations for machine health monitoring is proposed. Firstly, a simulation model is introduced to generate vibration signals at different health conditions and their spectral amplitudes transformed from Fourier transform are used to investigate the relationship between fault characteristic frequencies and local interpretability Shapley values. It is interestingly found that Shapley values can be used to locate fault characteristic frequencies. Moreover, most of them have negative values in a normal stage and have positive values in an abnormal stage. Based on this finding and Shapley additive explanations, a health indicator construction methodology is proposed to continuously monitor incipient machine faults. Subsequently, an automatic signal filtering method is proposed to remove and eliminate burrs and noise in Shapley values so that fault characteristic frequencies can be clearly revealed by Shapley values for physical fault diagnosis. Two run-to-failure cases are conducted to demonstrate the effectiveness of the proposed methodology and then the superiority of this study is demonstrated by comparing with existing methods for health indicator construction and fault diagnosis, including sparsity parameters, Hjorth parameters, and fast Kurtogram. Comparison results show that the proposed health indicator is more sensitive to the time of incipient fault initiation and interpretable fault diagnosis based on Shapley values has a robust performance. This study first sheds a light on the relationship between fault characteristic frequencies and Shapley values under the scenario of continuous machine health monitoring and seamlessly guides applicants to realize Shapley additive explanations based incipient fault detection and diagnosis.

AAAI Conference 2024 Conference Paper

X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-Modal Knowledge Transfer

  • Linglin Jing
  • Ying Xue
  • Xu Yan
  • Chaoda Zheng
  • Dong Wang
  • Ruimao Zhang
  • Zhigang Wang
  • Hui Fang

The field of 4D point cloud understanding is rapidly developing with the goal of analyzing dynamic 3D point cloud sequences. However, it remains a challenging task due to the sparsity and lack of texture in point clouds. Moreover, the irregularity of point cloud poses a difficulty in aligning temporal information within video sequences. To address these issues, we propose a novel cross-modal knowledge transfer framework, called X4D-SceneFormer. This framework enhances 4D-Scene understanding by transferring texture priors from RGB sequences using a Transformer architecture with temporal relationship mining. Specifically, the framework is designed with a dual-branch architecture, consisting of an 4D point cloud transformer and a Gradient-aware Image Transformer (GIT). The GIT combines visual texture and temporal correlation features to offer rich semantics and dynamics for better point cloud representation. During training, we employ multiple knowledge transfer techniques, including temporal consistency losses and masked self-attention, to strengthen the knowledge transfer between modalities. This leads to enhanced performance during inference using single-modal 4D point cloud inputs. Extensive experiments demonstrate the superior performance of our framework on various 4D point cloud video understanding tasks, including action recognition, action segmentation and semantic segmentation. The results achieve 1st places, i.e., 85.3% (+7.9%) accuracy and 47.3% (+5.0%) mIoU for 4D action segmentation and semantic segmentation, on the HOI4D challenge, outperforming previous state-of-the-art by a large margin. We release the code at https://github.com/jinglinglingling/X4D.

AAAI Conference 2023 Conference Paper

A Crowd-AI Collaborative Duo Relational Graph Learning Framework towards Social Impact Aware Photo Classification

  • Yang Zhang
  • Ziyi Kou
  • Lanyu Shang
  • Huimin Zeng
  • Zhenrui Yue
  • Dong Wang

In artificial intelligence (AI), negative social impact (NSI) represents the negative effect on the society as a result of mistakes conducted by AI agents. While the photo classification problem has been widely studied in the AI community, the NSI made by photo misclassification is largely ignored due to the lack of quantitative measurements of the NSI and effective approaches to reduce it. In this paper, we focus on an NSI-aware photo classification problem where the goal is to develop a novel crowd-AI collaborative learning framework that leverages online crowd workers to quantitatively estimate and effectively reduce the NSI of misclassified photos. Our problem is motivated by the limitations of current NSI-aware photo classification approaches that either 1) cannot accurately estimate NSI because they simply model NSI as the semantic difference between true and misclassified categories or 2) require costly human annotations to estimate NSI of pairwise class categories. To address such limitations, we develop SocialCrowd, a crowdsourcing-based NSI-aware photo classification framework that explicitly reduces the NSI of photo misclassification by designing a duo relational NSI-aware graph with the NSI estimated by online crowd workers. The evaluation results on two large-scale image datasets show that SocialCrowd not only reduces the NSI of photo misclassification but also improves the classification accuracy on both datasets.

NeurIPS Conference 2023 Conference Paper

Cross-Domain Policy Adaptation via Value-Guided Data Filtering

  • Kang Xu
  • Chenjia Bai
  • Xiaoteng Ma
  • Dong Wang
  • Bin Zhao
  • Zhen Wang
  • Xuelong Li
  • Wei Li

Generalizing policies across different domains with dynamics mismatch poses a significant challenge in reinforcement learning. For example, a robot learns the policy in a simulator, but when it is deployed in the real world, the dynamics of the environment may be different. Given the source and target domain with dynamics mismatch, we consider the online dynamics adaptation problem, in which case the agent can access sufficient source domain data while online interactions with the target domain are limited. Existing research has attempted to solve the problem from the dynamics discrepancy perspective. In this work, we reveal the limitations of these methods and explore the problem from the value difference perspective via a novel insight on the value consistency across domains. Specifically, we present the Value-Guided Data Filtering (VGDF) algorithm, which selectively shares transitions from the source domain based on the proximity of paired value targets across the two domains. Empirical results on various environments with kinematic and morphology shifts demonstrate that our method achieves superior performance compared to prior approaches.

AAAI Conference 2023 Conference Paper

Decision-Making Context Interaction Network for Click-Through Rate Prediction

  • Xiang Li
  • Shuwei Chen
  • Jian Dong
  • Jin Zhang
  • Yongkang Wang
  • Xingxing Wang
  • Dong Wang

Click-through rate (CTR) prediction is crucial in recommendation and online advertising systems. Existing methods usually model user behaviors, while ignoring the informative context which influences the user to make a click decision, e.g., click pages and pre-ranking candidates that inform inferences about user interests, leading to suboptimal performance. In this paper, we propose a Decision-Making Context Interaction Network (DCIN), which deploys a carefully designed Context Interaction Unit (CIU) to learn decision-making contexts and thus benefits CTR prediction. In addition, the relationship between different decision-making context sources is explored by the proposed Adaptive Interest Aggregation Unit (AIAU) to improve CTR prediction further. In the experiments on public and industrial datasets, DCIN significantly outperforms the state-of-the-art methods. Notably, the model has obtained the improvement of CTR+2.9%/CPM+2.1%/GMV+1.5% for online A/B testing and served the main traffic of Meituan Waimai advertising system.

NeurIPS Conference 2023 Conference Paper

Diffusion Model is an Effective Planner and Data Synthesizer for Multi-Task Reinforcement Learning

  • Haoran He
  • Chenjia Bai
  • Kang Xu
  • Zhuoran Yang
  • Weinan Zhang
  • Dong Wang
  • Bin Zhao
  • Xuelong Li

Diffusion models have demonstrated highly-expressive generative capabilities in vision and NLP. Recent studies in reinforcement learning (RL) have shown that diffusion models are also powerful in modeling complex policies or trajectories in offline datasets. However, these works have been limited to single-task settings where a generalist agent capable of addressing multi-task predicaments is absent. In this paper, we aim to investigate the effectiveness of a single diffusion model in modeling large-scale multi-task offline data, which can be challenging due to diverse and multimodal data distribution. Specifically, we propose Multi-Task Diffusion Model (\textsc{MTDiff}), a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis in multi-task offline settings. \textsc{MTDiff} leverages vast amounts of knowledge available in multi-task data and performs implicit knowledge sharing among tasks. For generative planning, we find \textsc{MTDiff} outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D. For data synthesis, \textsc{MTDiff} generates high-quality data for testing tasks given a single demonstration as a prompt, which enhances the low-quality datasets for even unseen tasks.

AAAI Conference 2023 Conference Paper

Direct Heterogeneous Causal Learning for Resource Allocation Problems in Marketing

  • Hao Zhou
  • Shaoming Li
  • Guibin Jiang
  • Jiaqi Zheng
  • Dong Wang

Marketing is an important mechanism to increase user engagement and improve platform revenue, and heterogeneous causal learning can help develop more effective strategies. Most decision-making problems in marketing can be formulated as resource allocation problems and have been studied for decades. Existing works usually divide the solution procedure into two fully decoupled stages, i.e., machine learning (ML) and operation research (OR) --- the first stage predicts the model parameters and they are fed to the optimization in the second stage. However, the error of the predicted parameters in ML cannot be respected and a series of complex mathematical operations in OR lead to the increased accumulative errors. Essentially, the improved precision on the prediction parameters may not have a positive correlation on the final solution due to the side-effect from the decoupled design. In this paper, we propose a novel approach for solving resource allocation problems to mitigate the side-effects. Our key intuition is that we introduce the decision factor to establish a bridge between ML and OR such that the solution can be directly obtained in OR by only performing the sorting or comparison operations on the decision factor. Furthermore, we design a customized loss function that can conduct direct heterogeneous causal learning on the decision factor, an unbiased estimation of which can be guaranteed when the loss convergences. As a case study, we apply our approach to two crucial problems in marketing: the binary treatment assignment problem and the budget allocation problem with multiple treatments. Both large-scale simulations and online A/B Tests demonstrate that our approach achieves significant improvement compared with state-of-the-art.

AAAI Conference 2023 Conference Paper

Dual Memory Aggregation Network for Event-Based Object Detection with Learnable Representation

  • Dongsheng Wang
  • Xu Jia
  • Yang Zhang
  • Xinyu Zhang
  • Yaoyuan Wang
  • Ziyang Zhang
  • Dong Wang
  • Huchuan Lu

Event-based cameras are bio-inspired sensors that capture brightness change of every pixel in an asynchronous manner. Compared with frame-based sensors, event cameras have microsecond-level latency and high dynamic range, hence showing great potential for object detection under high-speed motion and poor illumination conditions. Due to sparsity and asynchronism nature with event streams, most of existing approaches resort to hand-crafted methods to convert event data into 2D grid representation. However, they are sub-optimal in aggregating information from event stream for object detection. In this work, we propose to learn an event representation optimized for event-based object detection. Specifically, event streams are divided into grids in the x-y-t coordinates for both positive and negative polarity, producing a set of pillars as 3D tensor representation. To fully exploit information with event streams to detect objects, a dual-memory aggregation network (DMANet) is proposed to leverage both long and short memory along event streams to aggregate effective information for object detection. Long memory is encoded in the hidden state of adaptive convLSTMs while short memory is modeled by computing spatial-temporal correlation between event pillars at neighboring time intervals. Extensive experiments on the recently released event-based automotive detection dataset demonstrate the effectiveness of the proposed method.

EAAI Journal 2023 Journal Article

Interpretable federated learning for machine condition monitoring: Interpretable average global model as a fault feature library

  • xiao feng
  • Dong Wang
  • Bingchang Hou
  • Tongtong Yan

Federated learning (FL) is an emerging technique used to prevent the two contradictory problems of data silos and data privacy. Different from centralized learning, FL makes it possible to learn a global model while private data are stored locally. Nevertheless, statistical heterogeneity is a major challenge that has not been well addressed in literature and the interpretability of the model is always ignored. In this paper, an interpretable FL framework is constructed for machine condition monitoring and fault diagnosis. Since fault characteristic frequencies (FCFs) and their harmonics are closely connected with specific machine fault types, an interpretable local client model is designed to identify the FCFs and their harmonics of different clients. Theoretical investigation on the additivity of local model parameters in this paper prove that learned local parameters from the frequency domain are actually FCFs and their harmonics and their additivity is capable of constructing a fault feature library, which is beneficial to providing different fault information and quickly diagnosing fault types. The effectiveness of the method is demonstrated by experiments with two independent bearing run-to-failure datasets.

IJCAI Conference 2023 Conference Paper

On Adversarial Robustness of Demographic Fairness in Face Attribute Recognition

  • Huimin Zeng
  • Zhenrui Yue
  • Lanyu Shang
  • Yang Zhang
  • Dong Wang

Demographic fairness has become a critical objective when developing modern visual models for identity-sensitive applications, such as face attribute recognition (FAR). While great efforts have been made to improve the fairness of the models, the investigation on the adversarial robustness of the fairness (e. g. , whether the fairness of the models could still be maintained under potential malicious fairness attacks) is largely ignored. Therefore, this paper explores the adversarial robustness of demographic fairness in FAR applications from both attacking and defending perspectives. In particular, we firstly present a novel fairness attack, who aims at corrupting the demographic fairness of face attribute classifiers. Next, to mitigate the effect of the fairness attack, we design an efficient defense algorithm called robust-fair training. With this defense, face attribute classifiers learn how to combat the bias introduced by the fairness attack. As such, the face attribute classifiers are not only trained to be fair, but the fairness is also robust. Our extensive experimental results show the effectiveness of both our proposed attack and defense methods across various model architectures and FAR applications. We believe our work could be strong baselines for future work on robust-fair AI models.

IJCAI Conference 2023 Conference Paper

On Optimizing Model Generality in AI-based Disaster Damage Assessment: A Subjective Logic-driven Crowd-AI Hybrid Learning Approach

  • Yang Zhang
  • Ruohan Zong
  • Lanyu Shang
  • Huimin Zeng
  • Zhenrui Yue
  • Na Wei
  • Dong Wang

This paper focuses on the AI-based damage assessment (ADA) applications that leverage state-of-the-art AI techniques to automatically assess the disaster damage severity using online social media imagery data, which aligns well with the ''disaster risk reduction'' target under United Nations' Sustainable Development Goals (UN SDGs). This paper studies an ADA model generality problem where the objective is to address the limitation of current ADA solutions that are often optimized only for a single disaster event and lack the generality to provide accurate performance across different disaster events. To address this limitation, we work with domain experts and local community stakeholders in disaster response to develop CollabGeneral, a subjective logic-driven crowd-AI collaborative learning framework that integrates AI and crowdsourced human intelligence into a principled learning framework to address the ADA model generality problem. Extensive experiments on four real-world ADA datasets demonstrate that CollabGeneral consistently outperforms the state-of-the-art baselines by significantly improving the ADA model generality across different disasters.

IROS Conference 2023 Conference Paper

Polynomial-Based Online Planning for Autonomous Drone Racing in Dynamic Environments

  • Qianhao Wang
  • Dong Wang
  • Chao Xu 0001
  • Alan Gao
  • Fei Gao 0011

In recent years, there is a noteworthy advance-ment in autonomous drone racing. However, the primary focus is on attaining execution times, while scant attention is given to the challenges of dynamic environments. The high-speed nature of racing scenarios, coupled with the potential for unforeseeable environmental alterations, present stringent requirements for online replanning and its timeliness. For racing in dynamic environments, we propose an online replanning framework with an efficient polynomial trajectory representation. We trade off between aggressive speed and flexible obstacle avoidance based on an optimization approach. Additionally, to ensure safety and precision when crossing intermediate racing waypoints, we formulate the demand as hard constraints during planning. For dynamic obstacles, parallel multi-topology trajectory planning is designed based on engineering considerations to prevent racing time loss due to local optimums. The framework is integrated into a quadrotor system and successfully demonstrated at the DJI Robomaster Intelligent UAV Championship, where it successfully complete the racing track and placed first, finishing in less than half the time of the second-place 1 1 https://pro-robomasters-hz-n5i3.oss-cn-hangzhou.aliyuncs.com/sass/event-list.html.

JBHI Journal 2023 Journal Article

Predicting Drug-Disease Associations Through Similarity Network Fusion and Multi-View Feature Projection Representation

  • Shiming Wang
  • Jie Li
  • Dong Wang
  • Dechen Xu
  • Jiahuan Jin
  • Yadong Wang

Predicting drug-disease associations (DDAs) through computational methods has become a prevalent trend in drug development because of their high efficiency and low cost. Existing methods usually focus on constructing heterogeneous networks by collecting multiple data resources to improve prediction ability. However, potential association possibilities of numerous unconfirmed drug-related or disease-related pairs are not sufficiently considered. In this article, we propose a novel computational model to predict new DDAs. First, a heterogeneous network is constructed, including four types of nodes (drugs, targets, cell lines, diseases) and three types of edges (associations, association scores, similarities). Second, an updating and merging-based similarity network fusion method, termed UM-SF, is presented to fuse various similarity networks with diverse weights. Finally, an intermediate layer-mediated multi-view feature projection representation method, termed IM-FP, is proposed to calculate the predicted DDA scores. This method uses multiple association scores to construct multi-view drug features, then projects them into disease space through the intermediate layer, where an intermediate layer similarity constraint is designed to learn the projection matrices. Results of comparative experiments reveal the effectiveness of our innovations. Comparisons with other state-of-the-art models by the 10-fold cross-validation experiment indicate our model's advantage on AUROC and AUPR metrics. Moreover, our proposed model successfully predicted 107 novel high-ranked DDAs.

IJCAI Conference 2022 Conference Paper

Crowd, Expert & AI: A Human-AI Interactive Approach Towards Natural Language Explanation Based COVID-19 Misinformation Detection

  • Ziyi Kou
  • Lanyu Shang
  • Yang Zhang
  • Zhenrui Yue
  • Huimin Zeng
  • Dong Wang

In this paper, we study an explainable COVID-19 misinformation detection problem where the goal is to accurately identify COVID-19 misleading posts on social media and explain the posts with natural language explanations (NLEs). Our problem is motivated by the limitations of current explainable misinformation detection approaches that cannot provide NLEs for COVID-19 posts due to the lack of sufficient professional COVID-19 knowledge for supervision. To address such a limitation, we develop CEA-COVID, a crowd-expert-AI framework that jointly exploits the common logical reasoning ability of online crowd workers and the professional knowledge of COVID-19 experts to effectively generate NLEs for detecting and explaining COVID-19 misinformation. We evaluate CEA-COVID using two public COVID-19 misinformation datasets on social media. Results demonstrate that CEA-COVID outperforms existing explainable misinformation detection models in terms of both explainability and detection accuracy.

IJCAI Conference 2022 Conference Paper

D-DPCC: Deep Dynamic Point Cloud Compression via 3D Motion Prediction

  • Tingyu Fan
  • Linyao Gao
  • Yiling Xu
  • Zhu Li
  • Dong Wang

The non-uniformly distributed nature of the 3D Dynamic Point Cloud (DPC) brings significant challenges to its high-efficient inter-frame compression. This paper proposes a novel 3D sparse convolution-based Deep Dynamic Point Cloud Compression (D-DPCC) network to compensate and compress the DPC geometry with 3D motion estimation and motion compensation in the feature space. In the proposed D-DPCC network, we design a Multi-scale Motion Fusion (MMF) module to accurately estimate the 3D optical flow between the feature representations of adjacent point cloud frames. Specifically, we utilize a 3D sparse convolution-based encoder to obtain the latent representation for motion estimation in the feature space and introduce the proposed MMF module for fused 3D motion embedding. Besides, for motion compensation, we propose a 3D Adaptively Weighted Interpolation (3DAWI) algorithm with a penalty coefficient to adaptively decrease the impact of distant neighbours. We compress the motion embedding and the residual with a lossy autoencoder-based network. To our knowledge, this paper is the first work proposing an end-to-end deep dynamic point cloud compression framework. The experimental result shows that the proposed D-DPCC framework achieves an average 76% BD-Rate (Bjontegaard Delta Rate) gains against state-of-the-art Video-based Point Cloud Compression (V-PCC) v13 in inter mode.

IJCAI Conference 2022 Conference Paper

Feature Dense Relevance Network for Single Image Dehazing

  • Yun Liang
  • Enze Huang
  • Zifeng Zhang
  • Zhuo Su
  • Dong Wang

Existing learning-based dehazing methods do not fully use non-local information, which makes the restoration of seriously degraded region very tough. We propose a novel dehazing network by defining the Feature Dense Relevance module (FDR) and the Shallow Feature Mapping module (SFM). The FDR is defined based on multi-head attention to construct the dense relationship between different local features in the whole image. It enables the network to restore the degraded local regions by non-local information in complex scenes. In addition, the raw distant skip-connection easily leads to artifacts while it cannot deal with the shallow features effectively. Therefore, we define the SFM by combining the atmospheric scattering model and the distant skip-connection to effectively deal with the shallow features in different scales. It not only maps the degraded textures into clear textures by distant dependence, but also reduces artifacts and color distortions effectively. We introduce contrastive loss and focal frequency loss in the network to obtain a realitic and clear image. The extensive experiments on several synthetic and real-world datasets demonstrate that our network surpasses most of the state-of-the-art methods.

IJCAI Conference 2022 Conference Paper

MultiQuant: Training Once for Multi-bit Quantization of Neural Networks

  • Ke Xu
  • Qiantai Feng
  • Xingyi Zhang
  • Dong Wang

Quantization has become a popular technique to compress deep neural networks (DNNs) and reduce computational costs, but most prior work focuses on training DNNs at each individual fixed bit-width and accuracy trade-off point. How to produce a model with flexible precision is largely unexplored. This work proposes a multi-bit quantization framework (MultiQuant) to make the learned DNNs robust for different precision configuration during inference by adopting Lowest-Random-Highest bit-width co-training method. Meanwhile, we propose an online adaptive label generation strategy to alleviate the problem of vicious competition under different precision caused by one-hot labels in the supernet training. The trained supernet model can be flexibly set to different bit widths to support dynamic speed and accuracy trade-off. Furthermore, we adopt the Monte Carlo sampling-based genetic algorithm search strategy with quantization-aware accuracy predictor as evaluation criterion to incorporate the mixed precision technology in our framework. Experiment results on ImageNet datasets demonstrate MultiQuant method can attain the quantization results under different bit-widths comparable with quantization-aware training without retraining.

IJCAI Conference 2022 Conference Paper

On Attacking Out-Domain Uncertainty Estimation in Deep Neural Networks

  • Huimin Zeng
  • Zhenrui Yue
  • Yang Zhang
  • Ziyi Kou
  • Lanyu Shang
  • Dong Wang

In many applications with real-world consequences, it is crucial to develop reliable uncertainty estimation for the predictions made by the AI decision systems. Targeting at the goal of estimating uncertainty, various deep neural network (DNN) based uncertainty estimation algorithms have been proposed. However, the robustness of the uncertainty returned by these algorithms has not been systematically explored. In this work, to raise the awareness of the research community on robust uncertainty estimation, we show that state-of-the-art uncertainty estimation algorithms could fail catastrophically under our proposed adversarial attack despite their impressive performance on uncertainty estimation. In particular, we aim at attacking out-domain uncertainty estimation: under our attack, the uncertainty model would be fooled to make high-confident predictions for the out-domain data, which they originally would have rejected. Extensive experimental results on various benchmark image datasets show that the uncertainty estimated by state-of-the-art methods could be easily corrupted by our attack.

NeurIPS Conference 2022 Conference Paper

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

  • Renrui Zhang
  • Ziyu Guo
  • Peng Gao
  • Rongyao Fang
  • Bin Zhao
  • Dong Wang
  • Yu Qiao
  • Hongsheng Li

Masked Autoencoders (MAE) have shown great potentials in self-supervised pre-training for language and 2D image transformers. However, it still remains an open question on how to exploit masked autoencoding for learning 3D representations of irregular point clouds. In this paper, we propose Point-M2AE, a strong Multi-scale MAE pre-training framework for hierarchical self-supervised learning of 3D point clouds. Unlike the standard transformer in MAE, we modify the encoder and decoder into pyramid architectures to progressively model spatial geometries and capture both fine-grained and high-level semantics of 3D shapes. For the encoder that downsamples point tokens by stages, we design a multi-scale masking strategy to generate consistent visible regions across scales, and adopt a local spatial self-attention mechanism during fine-tuning to focus on neighboring patterns. By multi-scale token propagation, the lightweight decoder gradually upsamples point tokens with complementary skip connections from the encoder, which further promotes the reconstruction from a global-to-local perspective. Extensive experiments demonstrate the state-of-the-art performance of Point-M2AE for 3D representation learning. With a frozen encoder after pre-training, Point-M2AE achieves 92. 9% accuracy for linear SVM on ModelNet40, even surpassing some fully trained methods. By fine-tuning on downstream tasks, Point-M2AE achieves 86. 43% accuracy on ScanObjectNN, +3. 36% to the second-best, and largely benefits the few-shot classification, part segmentation and 3D object detection with the hierarchical pre-training scheme. Code is available at https: //github. com/ZrrSkywalker/Point-M2AE.

NeurIPS Conference 2022 Conference Paper

Tight Mutual Information Estimation With Contrastive Fenchel-Legendre Optimization

  • Qing Guo
  • Junya Chen
  • Dong Wang
  • Yuewei Yang
  • Xinwei Deng
  • Jing Huang
  • Larry Carin
  • Fan Li

Successful applications of InfoNCE (Information Noise-Contrastive Estimation) and its variants have popularized the use of contrastive variational mutual information (MI) estimators in machine learning. While featuring superior stability, these estimators crucially depend on costly large-batch training, and they sacrifice bound tightness for variance reduction. To overcome these limitations, we revisit the mathematics of popular variational MI bounds from the lens of unnormalized statistical modeling and convex optimization. Our investigation yields a new unified theoretical framework encompassing popular variational MI bounds, and leads to a novel, simple, and powerful contrastive MI estimator we name FLO. Theoretically, we show that the FLO estimator is tight, and it converges under stochastic gradient descent. Empirically, the proposed FLO estimator overcomes the limitations of its predecessors and learns more efficiently. The utility of FLO is verified using extensive benchmarks, and we further inspire the community with novel applications in meta-learning. Our presentation underscores the foundational importance of variational MI estimation in data-efficient learning.

IJCAI Conference 2021 Conference Paper

A Streaming End-to-End Framework For Spoken Language Understanding

  • Nihal Potdar
  • Anderson Raymundo Avila
  • Chao Xing
  • Dong Wang
  • Yiran Cao
  • Xiao Chen

End-to-end spoken language understanding (SLU) recently attracted increasing interest. Compared to the conventional tandem-based approach that combines speech recognition and language understanding as separate modules, the new approach extracts users' intentions directly from the speech signals, resulting in joint optimization and low latency. Such an approach, however, is typically designed to process one intent at a time, which leads users to have to take multiple rounds to fulfill their requirements while interacting with a dialogue system. In this paper, we propose a streaming end-to-end framework that can process multiple intentions in an online and incremental way. The backbone of our framework is a unidirectional RNN trained with the connectionist temporal classification (CTC) criterion. By this design, an intention can be identified when sufficient evidence has been accumulated, and multiple intentions will be identified sequentially. We evaluate our solution on the Fluent Speech Commands (FSC) dataset and the detection accuracy is about 97 % on all multi-intent settings. This result is comparable to the performance of the state-of-the-art non-streaming models, but is achieved in an online and incremental way. We also employ our model to an keyword spotting task using the Google Speech Commands dataset, and the results are also highly promising.

AAAI Conference 2021 Conference Paper

eTREE: Learning Tree-structured Embeddings

  • Faisal M. Almutairi
  • Yunlong Wang
  • Dong Wang
  • Emily Zhao
  • Nicholas D. Sidiropoulos

Matrix factorization (MF) plays an important role in a wide range of machine learning and data mining models. MF is commonly used to obtain item embeddings and feature representations due to its ability to capture correlations and higherorder statistical dependencies across dimensions. In many applications, the categories of items exhibit a hierarchical tree structure. For instance, human diseases can be divided into coarse categories, e. g. , bacterial, and viral. These categories can be further divided into finer categories, e. g. , viral infections can be respiratory, gastrointestinal, and exanthematous viral diseases. In e-commerce, products, movies, books, etc. , are grouped into hierarchical categories, e. g. , clothing items are divided by gender, then by type (formal, casual, etc.). While the tree structure and the categories of the different items may be known in some applications, they have to be learned together with the embeddings in many others. In this work, we propose eTREE, a model that incorporates the (usually ignored) tree structure to enhance the quality of the embeddings. We leverage the special uniqueness properties of Nonnegative MF (NMF) to prove identifiability of eTREE. The proposed model not only exploits the tree structure prior, but also learns the hierarchical clustering in an unsupervised data-driven fashion. We derive an efficient algorithmic solution and a scalable implementation of eTREE that exploits parallel computing, computation caching, and warm start strategies. We showcase the effectiveness of eTREE on real data from various application domains: healthcare, recommender systems, and education. We also demonstrate the meaningfulness of the tree obtained from eTREE by means of domain experts interpretation.

IROS Conference 2021 Conference Paper

FAST-Dynamic-Vision: Detection and Tracking Dynamic Objects with Event and Depth Sensing

  • Botao He
  • Haojia Li
  • Siyuan Wu
  • Dong Wang
  • Zhiwei Zhang 0032
  • Qianli Dong
  • Chao Xu 0001
  • Fei Gao 0011

The development of aerial autonomy has enabled aerial robots to fly agilely in complex environments. However, dodging fast-moving objects in flight remains a challenge, limiting the further application of unmanned aerial vehicles (UAVs). The bottleneck of solving this problem is the accurate perception of rapid dynamic objects. Recently, event cameras have shown great potential in solving this problem. This paper presents a complete perception system including ego-motion compensation, object detection, and trajectory prediction for fast-moving dynamic objects with low latency and high precision. Firstly, we propose an accurate ego-motion compensation algorithm by considering both rotational and translational motion for more robust object detection. Then, for dynamic object detection, an event camera-based efficient regression algorithm is designed. Finally, we propose an optimization-based approach that asynchronously fuses event and depth cameras for trajectory prediction. Extensive real-world experiments and benchmarks are performed to validate our framework. Moreover, our code will be released to benefit related researches.

NeurIPS Conference 2021 Conference Paper

KeSpeech: An Open Source Speech Dataset of Mandarin and Its Eight Subdialects

  • Zhiyuan Tang
  • Dong Wang
  • Yanguang Xu
  • Jianwei Sun
  • XiaoNing Lei
  • Shuaijiang Zhao
  • Cheng Wen
  • Xingjun Tan

This paper introduces an open source speech dataset, KeSpeech, which involves 1, 542 hours of speech signals recorded by 27, 237 speakers in 34 cities in China, and the pronunciation includes standard Mandarin and its 8 subdialects. The new dataset possesses several properties. Firstly, the dataset provides multiple labels including content transcription, speaker identity and subdialect, hence supporting a variety of speech processing tasks, such as speech recognition, speaker recognition, and subdialect identification, as well as other advanced techniques like multi-task learning and conditional learning. Secondly, some of the text samples were parallel recorded with both the standard Mandarin and a particular subdialect, allowing for new applications such as subdialect style conversion. Thirdly, the number of speakers is much larger than other open-source datasets, making it suitable for tasks that require training data from vast speakers. Finally, the speech signals were recorded in two phases, which opens the opportunity for the study of the time variance property of human speech. We present the design principle of the KeSpeech dataset and four baseline systems based on the new data resource: speech recognition, speaker verification, subdialect identification and voice conversion. The dataset is free for all academic usage.

AAAI Conference 2021 Conference Paper

Temporal Relational Modeling with Self-Supervision for Action Segmentation

  • Dong Wang
  • Di Hu
  • Xingjian Li
  • Dejing Dou

Temporal relational modeling in video is essential for human action understanding, such as action recognition and action segmentation. Although Graph Convolution Networks (GCNs) have shown promising advantages in relation reasoning on many tasks, it is still a challenge to apply graph convolution networks on long video sequences effectively. The main reason is that large number of nodes (i. e. , video frames) makes GCNs hard to capture and model temporal relations in videos. To tackle this problem, in this paper, we introduce an effective GCN module, Dilated Temporal Graph Reasoning Module (DTGRM), designed to model temporal relations and dependencies between video frames at various time spans. In particular, we capture and model temporal relations via constructing multi-level dilated temporal graphs where the nodes represent frames from different moments in video. Moreover, to enhance temporal reasoning ability of the proposed model, an auxiliary self-supervised task is proposed to encourage the dilated temporal graph reasoning module to find and correct wrong temporal relations in videos. Our DTGRM model outperforms state-of-the-art action segmentation models on three challenging datasets: 50Salads, Georgia Tech Egocentric Activities (GTEA), and the Breakfast dataset. The code is available at https: //github. com/redwang/DTGRM.

IJCAI Conference 2021 Conference Paper

User Retention: A Causal Approach with Triple Task Modeling

  • Yang Zhang
  • Dong Wang
  • Qiang Li
  • Yue Shen
  • Ziqi Liu
  • Xiaodong Zeng
  • Zhiqiang Zhang
  • Jinjie Gu

For many Internet companies, it has been an important focus to improve user retention rate. To achieve this goal, we need to recommend proper services in order to meet the demands of users. Unlike conventional click-through rate (CTR) estimation, there are lots of noise in the collected data when modeling retention, caused by two major issues: 1) implicit impression-revisit effect: users could revisit the APP even if they do not explicitly interact with the recommender system; 2) selection bias: recommender system suffers from selection bias caused by user's self-selection. To address the above challenges, we propose a novel method named UR-IPW (User Retention Modeling with Inverse Propensity Weighting), which 1) makes full use of both explicit and implicit interactions in the observed data. 2) models revisit rate estimation from a causal perspective accounting for the selection bias problem. The experiments on both offline and online environments from different scenarios demonstrate the superiority of UR-IPW over previous methods. To the best of our knowledge, this is the first work to model user retention by estimating the revisit rate from a causal perspective.

AAAI Conference 2020 Conference Paper

Crowd-Assisted Disaster Scene Assessment with Human-AI Interactive Attention

  • Daniel (Yue) Zhang
  • Yifeng Huang
  • Yang Zhang
  • Dong Wang

The recent advances of mobile sensing and artificial intelligence (AI) have brought new revolutions in disaster response applications. One example is disaster scene assessment (DSA) which leverages computer vision techniques to assess the level of damage severity of the disaster events from images provided by eyewitnesses on social media. The assessment results are critical in prioritizing the rescue operations of the response teams. While AI algorithms can significantly reduce the detection time and manual labeling cost in such applications, their performance often falls short of the desired accuracy. Our work is motivated by the emergence of crowdsourcing platforms (e. g. , Amazon Mechanic Turk, Waze) that provide unprecedented opportunities for acquiring human intelligence for AI applications. In this paper, we develop an interactive Disaster Scene Assessment (iDSA) scheme that allows AI algorithms to directly interact with humans to identify the salient regions of the disaster images in DSA applications. We also develop new incentive designs and active learning techniques to ensure reliable, timely, and costefficient responses from the crowdsourcing platforms. Our evaluation results on real-world case studies during Nepal and Ecuador earthquake events demonstrate that iDSA can significantly outperform state-of-the-art baselines in accurately assessing the damage of disaster scenes.

AAAI Conference 2020 Conference Paper

Knowledge and Cross-Pair Pattern Guided Semantic Matching for Question Answering

  • Zihan Xu
  • Hai-Tao Zheng
  • Shaopeng Zhai
  • Dong Wang

Semantic matching is a basic problem in natural language processing, but it is far from solved because of the differences between the pairs for matching. In question answering (QA), answer selection (AS) is a popular semantic matching task, usually reformulated as a paraphrase identification (PI) problem. However, QA is different from PI because the question and the answer are not synonymous sentences and not strictly comparable. In this work, a novel knowledge and cross-pair pattern guided semantic matching system (KCG) is proposed, which considers both knowledge and pattern conditions for QA. We apply explicit cross-pair matching based on Graph Convolutional Network (GCN) to help KCG recognize general domain-independent Q-to-A patterns better. And with the incorporation of domain-specific information from knowledge bases (KB), KCG is able to capture and explore various relations within Q-A pairs. Experiments show that KCG is robust against the diversity of Q-A pairs and outperforms the state-of-the-art systems on different answer selection tasks.

IJCAI Conference 2019 Conference Paper

CFM: Convolutional Factorization Machines for Context-Aware Recommendation

  • Xin Xin
  • Bo Chen
  • Xiangnan He
  • Dong Wang
  • Yue Ding
  • Joemon Jose

Factorization Machine (FM) is an effective solution for context-aware recommender systems (CARS) which models second-order feature interactions by inner product. However, it is insufficient to capture high-order and nonlinear interaction signals. While several recent efforts have enhanced FM with neural networks, they assume the embedding dimensions are independent from each other and model high-order interactions in a rather implicit manner. In this paper, we propose Convolutional Factorization Machine (CFM) to address above limitations. Specifically, CFM models second-order interactions with outer product, resulting in ''images'' which capture correlations between embedding dimensions. Then all generated ''images'' are stacked, forming an interaction cube. 3D convolution is applied above it to learn high-order interaction signals in an explicit approach. Besides, we also leverage a self-attention mechanism to perform the pooling of features to reduce time complexity. We conduct extensive experiments on three real-world datasets, demonstrating significant improvement of CFM over competing methods for context-aware top-k recommendation.

IJCAI Conference 2019 Conference Paper

Exploiting Persona Information for Diverse Generation of Conversational Responses

  • Haoyu Song
  • Wei-Nan Zhang
  • Yiming Cui
  • Dong Wang
  • Ting Liu

In human conversations, due to their personalities in mind, people can easily carry out and maintain the conversations. Giving conversational context with persona information to a chatbot, how to exploit the information to generate diverse and sustainable conversations is still a non-trivial task. Previous work on persona-based conversational models successfully make use of predefined persona information and have shown great promise in delivering more realistic responses. And they all learn with the assumption that given a source input, there is only one target response. However, in human conversations, there are massive appropriate responses to a given input message. In this paper, we propose a memory-augmented architecture to exploit persona information from context and incorporate a conditional variational autoencoder model together to generate diverse and sustainable conversations. We evaluate the proposed model on a benchmark persona-chat dataset. Both automatic and human evaluations show that our model can deliver more diverse and more engaging persona-based responses than baseline approaches.

IJCAI Conference 2019 Conference Paper

Latent Distribution Preserving Deep Subspace Clustering

  • Lei Zhou
  • Xiao Bai
  • Dong Wang
  • Xianglong Liu
  • Jun Zhou
  • Edwin Hancock

Subspace clustering is a useful technique for many computer vision applications in which the intrinsic dimension of high-dimensional data is smaller than the ambient dimension. Traditional subspace clustering methods often rely on the self-expressiveness property, which has proven effective for linear subspace clustering. However, they perform unsatisfactorily on real data with complex nonlinear subspaces. More recently, deep autoencoder based subspace clustering methods have achieved success owning to the more powerful representation extracted by the autoencoder network. Unfortunately, these methods only considering the reconstruction of original input data can hardly guarantee the latent representation for the data distributed in subspaces, which inevitably limits the performance in practice. In this paper, we propose a novel deep subspace clustering method based on a latent distribution-preserving autoencoder, which introduces a distribution consistency loss to guide the learning of distribution-preserving latent representation, and consequently enables strong capacity of characterizing the real-world data for subspace clustering. Experimental results on several public databases show that our method achieves significant improvement compared with the state-of-the-art subspace clustering methods.

AAAI Conference 2019 Conference Paper

Memory-Augmented Temporal Dynamic Learning for Action Recognition

  • Yuan Yuan
  • Dong Wang
  • Qi Wang

Human actions captured in video sequences contain two crucial factors for action recognition, i. e. , visual appearance and motion dynamics. To model these two aspects, Convolutional and Recurrent Neural Networks (CNNs and RNNs) are adopted in most existing successful methods for recognizing actions. However, CNN based methods are limited in modeling long-term motion dynamics. RNNs are able to learn temporal motion dynamics but lack effective ways to tackle unsteady dynamics in long-duration motion. In this work, we propose a memory-augmented temporal dynamic learning network, which learns to write the most evident information into an external memory module and ignore irrelevant ones. In particular, we present a differential memory controller to make a discrete decision on whether the external memory module should be updated with current feature. The discrete memory controller takes in the memory history, context embedding and current feature as inputs and controls information flow into the external memory module. Additionally, we train this discrete memory controller using straight-through estimator. We evaluate this end-to-end system on benchmark datasets (UCF101 and HMDB51) of human action recognition. The experimental results show consistent improvements on both datasets over prior works and our baselines.

NeurIPS Conference 2019 Conference Paper

On Fenchel Mini-Max Learning

  • Chenyang Tao
  • Liqun Chen
  • Shuyang Dai
  • Junya Chen
  • Ke Bai
  • Dong Wang
  • Jianfeng Feng
  • Wenlian Lu

Inference, estimation, sampling and likelihood evaluation are four primary goals of probabilistic modeling. Practical considerations often force modeling approaches to make compromises between these objectives. We present a novel probabilistic learning framework, called Fenchel Mini-Max Learning (FML), that accommodates all four desiderata in a flexible and scalable manner. Our derivation is rooted in classical maximum likelihood estimation, and it overcomes a longstanding challenge that prevents unbiased estimation of unnormalized statistical models. By reformulating MLE as a mini-max game, FML enjoys an unbiased training objective that (i) does not explicitly involve the intractable normalizing constant and (ii) is directly amendable to stochastic gradient descent optimization. To demonstrate the utility of the proposed approach, we consider learning unnormalized statistical models, nonparametric density estimation and training generative models, with encouraging empirical results presented.

NeurIPS Conference 2018 Conference Paper

BRITS: Bidirectional Recurrent Imputation for Time Series

  • Wei Cao
  • Dong Wang
  • Jian Li
  • Hao Zhou
  • Lei Li
  • Yitan Li

Time series are widely used as signals in many classification/regression tasks. It is ubiquitous that time series contains many missing values. Given multiple correlated time series data, how to fill in missing values and to predict their class labels? Existing imputation methods often impose strong assumptions of the underlying data generating process, such as linear dynamics in the state space. In this paper, we propose BRITS, a novel method based on recurrent neural networks for missing value imputation in time series data. Our proposed method directly learns the missing values in a bidirectional recurrent dynamical system, without any specific assumption. The imputed values are treated as variables of RNN graph and can be effectively updated during the backpropagation. BRITS has three advantages: (a) it can handle multiple correlated missing values in time series; (b) it generalizes to time series with nonlinear dynamics underlying; (c) it provides a data-driven imputation procedure and applies to general settings with missing data. We evaluate our model on three real-world datasets, including an air quality dataset, a health-care data, and a localization data for human activity. Experiments show that our model outperforms the state-of-the-art methods in both imputation and classification/regression accuracies.

AAAI Conference 2018 Conference Paper

When Will You Arrive? Estimating Travel Time Based on Deep Neural Networks

  • Dong Wang
  • Junbo Zhang
  • Wei Cao
  • Jian Li
  • Yu Zheng

Estimating the travel time of any path (denoted by a sequence of connected road segments) in a city is of great importance to traffic monitoring, route planning, ridesharing, taxi/Uber dispatching, etc. However, it is a very challenging problem, affected by diverse complex factors, including spatial correlations, temporal dependencies, external conditions (e. g. weather, traffic lights). Prior work usually focuses on estimating the travel times of individual road segments or sub-paths and then summing up these times, which leads to an inaccurate estimation because such approaches do not consider road intersections/traffic lights, and local errors may accumulate. To address these issues, we propose an end-to-end Deep learning framework for Travel Time Estimation (called DeepTTE) that estimates the travel time of the whole path directly. More specifically, we present a geo-convolution operation by integrating the geographic information into the classical convolution, capable of capturing spatial correlations. By stacking recurrent unit on the geo-convoluton layer, our DeepTTE can capture the temporal dependencies as well. A multi-task learning component is given on the top of DeepTTE, that learns to estimate the travel time of both the entire path and each local path simultaneously during the training phase. Extensive experiments on two trajectory datasets show our DeepTTE significantly outperforms the state-of-the-art methods.

TIST Journal 2017 Journal Article

An Unsupervised Approach to Inferring the Localness of People Using Incomplete Geotemporal Online Check-In Data

  • Chao Huang
  • Dong Wang
  • Jun Tao

Inferring the localness of people is to classify people who are local residents in a city from people who visit the city by analyzing online check-in points that are contributed by online users. This information is critical for the urban planning, user profiling, and localized recommendation systems. Supervised learning approaches have been developed to infer the location of people in a city by assuming the availability of high-quality training datasets with complete geotemporal information. In this article, we develop an unsupervised model to accurately identify local people in a city by using the incomplete online check-in data that are publicly available. In particular, we develop an incomplete geotemporal expectation maximization (IGT-EM) scheme, which incorporates a set of hidden variables to represent the localness of people and a set of estimation parameters to represent the likelihood of venues to attract local and nonlocal people, respectively. Our solution can accurately classify local people from nonlocal nones without requiring any training data. We also implement a parallel IGT-EM algorithm by leveraging the computing power of a graphic processing unit (GPU) that consists of 2,496 cores. In the evaluation, we compare our new approach with the existing solutions through four real-world case studies using data from the New York City, Chicago, Boston, and Washington, DC. The results show that our approach can identify the local people and significantly outperform the compared baselines in estimation accuracy and execution time.

IJCAI Conference 2016 Conference Paper

Chinese Song Iambics Generation with Neural Attention-Based Model

  • Qixin Wang
  • Tianyi Luo
  • Dong Wang
  • Chao Xing

Learning and generating Chinese poems is a charming yet challenging task. Traditional approaches involve various language modeling and machine translation techniques, however, they perform not as well when generating poems with complex pattern constraints, for example Song iambics, a famous type of poems that involve variable-length sentences and strict rhythmic patterns. This paper applies the attention-based sequence-to-sequence model to generate Chinese Song iambics. Specifically, we encode the cue sentences by a bi-directional Long-Short Term Memory (LSTM) model and then predict the entire iambic with the information provided by the encoder, in the form of an attention-based LSTM that can regularize the generation process by the fine structure of the input cues. Several techniques are investigated to improve the model, including global context integration, hybrid style training, character vector initialization and adaptation. Both the automatic and subjective evaluation results show that our model indeed can learn the complex structural and rhythmic patterns of Song iambics, and the generation is rather successful.

AAAI Conference 2014 Conference Paper

Robust Distance Metric Learning in the Presence of Label Noise

  • Dong Wang
  • Xiaoyang Tan

Many distance learning algorithms have been developed in recent years. However, few of them consider the problem when the class labels of training data are noisy, and this may lead to serious performance deterioration. In this paper, we present a robust distance learning method in the presence of label noise, by extending a previous non-parametric discriminative distance learning algorithm, i. e. , Neighbourhood Components Analysis (NCA). Particularly, we analyze the effect of label noise on the derivative of likelihood with respect to the transformation matrix, and propose to model the conditional probability of the true label of each point so as to reduce that effect. The model is then optimized within the EM framework, with additional regularization used to avoid overfitting. Our experiments on several UCI datasets and a real dataset with unknown noise patterns show that the proposed RNCA is more tolerant to class label noise compared to the original NCA method.

EAAI Journal 2011 Journal Article

Effective recognition of MCCs in mammograms using an improved neural classifier

  • Jinchang Ren
  • Dong Wang
  • Jianmin Jiang

Computer-aided diagnosis is one of the most important engineering applications of artificial intelligence. In this paper, early detection of breast cancer through classification of microcalcification clusters from mammograms is emphasized. Although artificial neural network (ANN) has been widely applied in this area, the average accuracy achieved is only around 80% in terms of the area under the receiver operating characteristic curve A z. This performance may become much worse when the training samples are imbalanced. As a result, an improved neural classifier is proposed, in which balanced learning with optimized decision making are introduced to enable effective learning from imbalanced samples. When the proposed learning strategy is applied to individual classifiers, the results on the DDSM database have demonstrated that the performance from has been significantly improved. An average improvement of more than 10% in the measurements of F 1 score and A z has fully validated the effectiveness of our proposed method for the successful classification of clustered microcalcifications.

SAT Conference 2003 Conference Paper

SAT Based Predicate Abstraction for Hardware Verification

  • Edmund M. Clarke
  • Muralidhar Talupur
  • Helmut Veith
  • Dong Wang

Abstract Predicate abstraction is an important technique for extracting compact finite state models from large or infinite state systems. Predicate abstraction uses decision procedures to compute a model which is amenable to model checking, and has been used successfully for software verification. Little work however has been done on applying predicate abstraction to large scale finite state systems, most notably, hardware, where the decision procedures are SAT solvers. We consider predicate abstraction for hardware in the framework of Counterexample-Guided Abstraction Refinement where in the course of verification, the abstract model has to be repeatedly refined. The goal of the refinement is to eliminate spurious behavior in the abstract model which is not present in the original model, and gives rise to false negatives (spurious counterexamples). In this paper, we present two efficient SAT-based algorithms to refine abstract hardware models which deal with spurious transitions and spurious counterexamples respectively. Both algorithms make use of the conflict graphs generated by SAT solvers. The first algorithm extracts constraints from the conflict graphs which are used to make the abstract model more accurate. Once an abstract transition is determined to be spurious, our algorithm does not need to make any additional calls to SAT solver. Our second algorithm generates a compact predicate which eliminates a spurious counterexample. This algorithm uses the conflict graphs to identify the important concrete variables that render the counterexample spurious, creates an additional predicate over these concrete variables, and adds it to the abstract model. Experiments over hardware designs with several thousands of registers demonstrate the effectiveness of our methods.