Arrow Research search

Author name cluster

He Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

43 papers
2 author rows

Possible papers

43

JBHI Journal 2026 Journal Article

Federated Spatial Prior-Based Source-Free Domain Adaptation for White Matter Hyperintensities Segmentation

  • Yu Cheng
  • Yuxiang Dai
  • Rencheng Zheng
  • Beini Fei
  • Hui Zhang
  • Xinran Wu
  • Boyu Zhang
  • Haoran Peng

White matter hyperintensities (WMH) are important imaging biomarkers for cerebral small vessel disease, and their automatic segmentation across data with different distributions is crucial for assessing brain health and supporting diagnosis. However, cross-domain WMH segmentation remains challenging in privacy-sensitive and label-scarce clinical settings. Existing methods either relied on source domain data, violating privacy constraints, or lacked spatial guidance, which resulted in poor generalization, such as low sensitivity to small lesions. To address these challenges, we developed a source-free domain adaptation (SFDA) framework enhanced by federated spatial prior modeling. Our method used a dual-path pseudo-label generator that leveraged spatial priors to improve boundary accuracy and enhance the detection of small lesions. These priors were optimized via federated learning across multiple sites without sharing raw data, boosting model generalization while preserving privacy. The model was then fine-tuned using refined pseudo-labels. Experimental results demonstrated that our method consistently outperforms state-of-the-art UDA and SFDA methods, achieving 3–10% DSC improvement in most sites across 3 public and 7 private datasets. It also showed superior performance in small lesion detection and boundary delineation. Our method offered a robust, privacy-preserving solution for WMH segmentation and provided valuable support for early diagnosis and risk assessment of cerebrovascular diseases.

AAAI Conference 2026 Conference Paper

PC-Flow: Preference Alignment in Flow Matching via Classifier

  • Shaomeng Wang
  • He Wang
  • Longquan Dai
  • Jinhui Tang

Flow Matching (FM) is an efficient generative modeling framework, but aligning it with human preferences remains underexplored.~Although applying Direct Preference Optimization (DPO) to diffusion models has yielded improvements, directly extending DPO-like methods to FM poses three challenges: 1) Incompatibility with ODE-based models, 2) Heavy computational cost from full model fine-tuning, and 3) Reliance on reference model quality. To address these limitations, we propose Preference Classifier for Flow Matching (PC-Flow), a novel reference-free preference alignment framework. Specifically, we reinterpret FM’s deterministic ODE as an equivalent SDE to enable DPO-style learning. Then, we introduce a lightweight classifier to model relative preferences exclusively. This approach decouples alignment from the generative model, eliminating the need for costly fine-tuning or a reference model. Theoretically, PC-Flow guarantees consistent preference-guided distribution evolution, achieves a DPO-equivalent objective without a reference model, and progressively steers generation toward preferred outputs. Experiments show that PC-Flow achieves DPO-level alignment with significantly lower training costs.

IJCAI Conference 2025 Conference Paper

AccCtr: Accelerating Training-Free Conditional Control For Diffusion Models

  • Longquan Dai
  • He Wang
  • Yiming Zhang
  • Shaomeng Wang
  • Jinhui Tang

In current training-free Conditional Diffusion Models (CDM), the sampling process is steered by the gradient, which measures the discrepancy between the guidance and the condition extracted by a pre-trained condition extraction network. These methods necessitate small guidance steps, resulting in longer sampling times. To address the issue of slow sampling, we introduce AccCtr, a method that simplifies the conditional sampling algorithm by maximizing the sum of two objectives. The local maximum set of one objective is contained within the local maximum set of the other. Leveraging this relationship, we decompose the joint optimization into two parts, alternately maximizing each objective. By analyzing the steps involved in optimizing these objectives, we identify the most time-consuming steps and recommend retraining condition extraction network—a relatively simple task—to reduce its computational cost. Integrating AccCtr into current CDMs is a seamless task that does not impose a significant computational burden. Extensive testing has demonstrated that AccCtr offers superior sample quality and faster generation times.

JBHI Journal 2025 Journal Article

Aleatoric-Uncertainty-Aware Maximum Intensity Projection-Based GAN for 7T-Like Generation From 3T TOF-MRA

  • Wei Tang
  • Yuxiang Dai
  • Boyu Zhang
  • Zhang Shi
  • Ying-Hua Chu
  • Peixian Zhuang
  • Dinggang Shen
  • Chengyan Wang

Time-of-flight magnetic resonance angiography (TOF-MRA) is a prevalent vascular imaging technique for assessing cerebrovascular diseases. Compared to routine 3T TOF-MRA, 7T TOF-MRA provides vascular structures with a higher signal-to-noise ratio (SNR) and better vessel contrast, revealing greater vascular details. However, the inaccessibility of 7T scanners and specific physiological and technical concerns limit its clinical application. Therefore, we aimed to generate high-quality 7T-like TOF-MRA from 3T TOF-MRA. Considering the spatial sparsity of vessel signals, the visibility discrepancy of distal and small vessels between 3T and 7T images, and the subtle spatial misalignment between paired data, we proposed a novel aleatoric-uncertainty-aware maximum intensity projection-based generative adversarial network (AU-MIPGAN). In our method, we employed a knowledge distillation (KD) framework to incorporate multi-directional MIP information into the 3T-to-7T learning process to strengthen the learning of vessels and provide three-dimensional (3D) vascular morphological knowledge for the student model, facilitating accurate generation of vascular structures. Furthermore, we exploited AU modeling to compensate for the spatial misalignment between paired 3T and 7T images during the training procedure, which helped the model concentrate more on learning the intrinsic gap between 3T and 7T images. Qualitative and quantitative results demonstrated that the proposed AU-MIPGAN can achieve promising performance for 7T-like TOF-MRA generation.

ICRA Conference 2025 Conference Paper

BODex: Scalable and Efficient Robotic Dexterous Grasp Synthesis Using Bilevel Optimization

  • Jiayi Chen
  • Yubin Ke
  • He Wang

Robotic dexterous grasping is important for interacting with the environment. To unleash the potential of data-driven models for dexterous grasping, a large-scale, highquality dataset is essential. While gradient-based optimization offers a promising way for constructing such datasets, previous works suffer from limitations, such as inefficiency, strong assumptions in the grasp quality energy, or limited object sets for experiments. Moreover, the lack of a standard benchmark for comparing different methods and datasets hinders progress in this field. To address these challenges, we develop a highly efficient synthesis system and a comprehensive benchmark with MuJoCo for dexterous grasping. We formulate grasp synthesis as a bilevel optimization problem, combining a novel lowerlevel quadratic programming (QP) with an upper-level gradient descent process. By leveraging recent advances in CUDAaccelerated robotic libraries and GPU-based QP solvers, our system can parallelize thousands of grasps and synthesize over 49 grasps per second on a single 3090 GPU. Our synthesized grasps for Shadow, Allegro, and Leap hands all achieve a success rate above 75 % in simulation, with a penetration depth under 1 mm, outperforming existing baselines on nearly all metrics. Compared to the previous large-scale dataset, DexGraspNet, our dataset significantly improves the performance of learning models, with a success rate from around 40 % to 80 % in simulation. Real-world testing of the trained model on the Shadow Hand achieves an 81 % success rate across 20 diverse objects. The codes and datasets are released on our project page: https://pku-epic.github.io/BODex.

NeurIPS Conference 2025 Conference Paper

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

  • Wenyao Zhang
  • Hongsi Liu
  • Zekun Qi
  • Yunnan Wang
  • XinQiang Yu
  • Jiazhao Zhang
  • Runpei Dong
  • Jiawei He

Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76. 7 success rate on real robot tasks and 4. 44 average length on the CALVIN ABC-D benchmarks.

AAAI Conference 2025 Conference Paper

Effects of Momentum in Implicit Bias of Gradient Flow for Diagonal Linear Networks

  • Bochen Lyu
  • He Wang
  • Zheng Wang
  • Zhanxing Zhu

This paper targets on the regularization effect of momentum-based methods in regression settings and analyzes the popular diagonal linear networks to precisely characterize the implicit bias of continuous versions of heavy-ball (HB) and Nesterov's method of accelerated gradients (NAG). We show that, HB and NAG exhibit different implicit bias compared to GD for diagonal linear networks, which is different from the one for classic linear regression problem where momentum-based methods share the same implicit bias with GD. Specifically, the role of momentum in the implicit bias of GD is twofold: (a) HB and NAG induce extra initialization mitigation effects similar to SGD that are beneficial for generalization of sparse regression; (b) the implicit regularization effects of HB and NAG also depend on the initialization of gradients explicitly, which may not be benign for generalization. As a result, whether HB and NAG have better generalization properties than GD jointly depends on the aforementioned twofold effects determined by various parameters such as learning rate, momentum factor, and integral of gradients. Our findings highlight the potential beneficial role of momentum and can help understand its advantages in practice such as when it will lead to better generalization performance.

AAAI Conference 2025 Conference Paper

EMControl: Adding Conditional Control to Text-to-Image Diffusion Models via Expectation-Maximization

  • He Wang
  • Longquan Dai
  • Jinhui Tang

Recent advances in diffusion models focus on efficiently handling conditional generative tasks without extra training. The process involves decomposing the result into two components: 1. unconditional sample, generated in the absence of conditions; 2. condition correction, adjusting unconditional sample to include the guidance image. This adjustment is quantified by the pixel-level measure, where the latent is decoded back into a pixel image, and the forward operator translates the noisy image into the guidance domain for comparison with the guidance image. To enhance the fidelity of condition correction, we propose a learnable latent forward operator, focusing on latent-space consistency with the expectation that this latent-space consistency approximates the pixel-level fidelity measure. The encoder translates the guidance image into the latent space, and a correctional operator is proposed to rectify model mismatching in the latent guidance model. The determination of the condition term and the correction estimation is akin to solving a blind inverse problem. Our EMControl employs the Expectation-Maximization (EM) algorithm to solve the blind inverse problem during the reverse sampling process. This technique ensures that samples, once consistent with the guidance, are accurately mapped back onto the noisy data manifold, adhering to the data's inherent distribution. The EMControl has proven its effectiveness by delivering superior performance in conditional diffusion generation tasks compared to previous approaches. Moreover, its application to multiple-condition scenarios underscores its versatility and robustness across a range of generative tasks.

AAAI Conference 2025 Conference Paper

GenesisTex2: Stable, Consistent and High-Quality Text-to-Texture Generation

  • Jiawei Lu
  • YingPeng Zhang
  • Zengjun Zhao
  • He Wang
  • Kun Zhou
  • Tianjia Shao

Large-scale text-guided image diffusion models have demonstrated remarkable results in text-to-image (T2I) generation. However, applying these models to synthesize textures for 3D geometries remains challenging due to the domain gap between 2D images and textures on a 3D surface. Early works that used a projecting-inpainting approach managed to preserve generation diversity, but often resulted in noticeable artifacts and style inconsistencies. While recent methods have attempted to address these inconsistencies, they often introduce other issues, such as blurring, over-saturation, or over-smoothing. To overcome these challenges, we propose a novel text-to-texture synthesis framework that takes advantage of pre-trained diffusion models. We introduce a local attention reweighing mechanism in the self-attention layers to guide the model in focusing on spatial-correlated patches across different views, thereby enhancing local details while preserving cross-view consistency. Additionally, we propose a novel latent space merge pipeline, which further ensures consistency across different viewpoints without sacrificing too much diversity. Our method significantly outperforms existing state-of-the-art techniques in terms of texture consistency and visual quality, while delivering results much faster than distillation-based methods. Importantly, our framework does not require additional training or fine-tuning, making it highly adaptable to a wide range of models available on public platforms.

NeurIPS Conference 2025 Conference Paper

Heavy-Ball Momentum Method in Continuous Time and Discretization Error Analysis

  • Bochen Lyu
  • Xiaojing Zhang
  • Fangyi Zheng
  • He Wang
  • Zheng Wang
  • Zhanxing Zhu

This paper establishes a continuous time approximation, a piece-wise continuous differential equation, for the discrete Heavy-Ball (HB) momentum method with explicit discretization error. Investigating continuous differential equations has been a promising approach for studying the discrete optimization methods. Despite the crucial role of momentum in gradient-based optimization methods, the gap between the original dynamics and the continuous time approximations due to the discretization error has not been comprehensively bridged yet. In this work, we study the HB momentum method in continuous time while putting more focus on the discretization error to provide additional theoretical tools to this area. In particular, we design a first-order piece-wise continuous differential equation, where we add a number of counter terms to account for the discretization error explicitly. As a result, we provide a continuous time model for the HB momentum method that allows the control of discretization error to arbitrary order of the learning rate. As an application, we leverage it to find a new implicit regularization of the directional smoothness and investigate the implicit bias of HB for diagonal linear networks, indicating how our results can be used in deep learning. Our theoretical findings are further supported by numerical experiments.

NeurIPS Conference 2025 Conference Paper

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

  • Zekun Qi
  • Wenyao Zhang
  • Yufei Ding
  • Runpei Dong
  • XinQiang Yu
  • Jingwen Li
  • Lingyun Xu
  • Baoyu Li

While spatial reasoning has made progress in object localization relationships, it often overlooks object orientation—a key factor in 6-DoF fine-grained manipulation. Traditional pose representations rely on pre-defined frames or templates, limiting generalization and semantic grounding. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e. g. , the ''plug-in'' direction of a USB or the ''handle'' direction of a cup). To support this, we construct OrienText300K, a large-scale dataset of 3D objects annotated with semantic orientations, and develop PointSO, a general model for zero-shot semantic orientation prediction. By integrating semantic orientation into VLM agents, our SoFar framework enables 6-DoF spatial reasoning and generates robotic actions. Extensive experiments demonstrated the effectiveness and generalization of our SoFar, e. g. , zero-shot 48. 7\% successful rate on Open6DOR and zero-shot 74. 9\% successful rate on SIMPLER-Env.

JBHI Journal 2025 Journal Article

Unsupervised Domain Adaptation With Synchronized Self-Training for Cross- Domain Motor Imagery Recognition

  • Peiyin Chen
  • Xiaofeng Liu
  • Chao Ma
  • He Wang
  • Xiong Yang
  • Celso Grebogi
  • Xiao Gu
  • Zhongke Gao

Robust decoding performance is essential for the practical deployment of brain-computer interface (BCI) systems. Existing EEG decoding models often rely on large amounts of annotated data collected through specific experimental setups, which fail to address the heterogeneity of data distributions across different domains. This limitation hinders BCI systems from effectively managing the complexity and variability of real-world data. To overcome these challenges, we propose Synchronized Self-Training Domain Adaptation (SSTDA) for cross-domain motor imagery classification. Specifically, SSTDA leverages labeled signals from a source domain and applies self-training to unlabeled signals from a target domain, enabling the simultaneous training of a more robust classifier. The raw EEG signals are mapped into a latent space by a feature extractor for discriminative representation learning. A domain-shared latent space is then learned by optimizing the feature extractor with both source and target samples, using an easy-tohard self-training process. We validate the method with extensive experiments on two public motor imagery datasets: Dataset IIa of BCI Competition IV and the High Gamma dataset. In the inter-subject task, our method achieves classification accuracies of 64. 43% and 80. 40%, respectively. It also outperforms existing methods in the inter-session task. Moreover, we develope a new six-class motor imagery dataset and achieve test accuracies of 77. 09% and 80. 18% across different datasets. All experimental results demonstrate that our SSTDA outperforms existing algorithms in inter-session, inter-subject, and inter-dataset validation protocols, highlighting its capability to learn discriminative, domain-invariant representations that enhance EEG decoding performance.

JBHI Journal 2025 Journal Article

WiRe-Breath: A Sustainable WiFi-Based Real-Time Respiratory Monitoring Solution

  • Yunpeng Ge
  • He Wang
  • Ivan Wang-Hei Ho

Respiratory monitoring, including respiratory rate monitoring and apnea detection, plays an essential role in daily healthcare, especially for older adults. Existing WiFi-based respiratory detection models lack a systematic device deployment strategy and face limitations in real-time monitoring in complex environments. This paper presents WiRe-Breath, a non-intrusive and real-time respiratory monitoring system based on off-the-shelf WiFi devices. Unlike previous approaches, only periodic beacons from routers are utilized for respiratory analysis in the proposed system, significantly increasing sustainability and reliability for long-term monitoring. To ensure high accuracy, we propose a respiratory sensing model and determine the optimal device deployment strategy. Specifically, we enhance respiratory rate features through effective motion enhancement and respiratory rate extraction approaches. Additionally, an apnea detection algorithm is designed for disease surveillance. A web application is also implemented for real-time respiratory status monitoring. Our experimental results indicate that the proposed system achieves an average accuracy of 98. 86 $\%$ for respiratory rate detection in quiet environments and 97. 72 $\%$ in environments with interference. WiRe-Breath also demonstrates effective apnea detection, fulfills real-time monitoring requirements, and delivers state-of-the-art performance.

ICRA Conference 2024 Conference Paper

ASGrasp: Generalizable Transparent Object Reconstruction and 6-DoF Grasp Detection from RGB-D Active Stereo Camera

  • Jun Shi 0008
  • Yong A
  • Yixiang Jin
  • Dingzhe Li
  • Haoyu Niu 0001
  • Zhezhu Jin
  • He Wang

In this paper, we tackle the problem of grasping transparent and specular objects. This issue holds importance, yet it remains unsolved within the field of robotics due to failure of recover their accurate geometry by depth cameras. For the first time, we propose ASGrasp, a 6-DoF grasp detection network that uses an RGB-D active stereo camera. ASGrasp utilizes a two-layer learning-based stereo network for the purpose of transparent object reconstruction, enabling material-agnostic object grasping in cluttered environments. In contrast to existing RGB-D based grasp detection methods, which heavily depend on depth restoration networks and the quality of depth maps generated by depth cameras, our system distinguishes itself by its ability to directly utilize raw IR and RGB images for transparent object geometry reconstruction. We create an extensive synthetic dataset through domain randomization, which is based on GraspNet-1Billion. Our experiments demonstrate that ASGrasp can achieve over 90% success rate for generalizable transparent object grasping in both simulation and the real via seamless sim-to-real transfer. Our method significantly outperforms SOTA networks and even surpasses the performance upper bound set by perfect visible point cloud inputs. Project page: https://pku-epic.github.io/ASGrasp

IROS Conference 2024 Conference Paper

Contrastive Mask Denoising Transformer for 3D Instance Segmentation

  • He Wang
  • Minshen Lin
  • Guofeng Zhang 0001

In transformer-based methods for point cloud instance segmentation, bipartite matching is used to establish one-to-one correspondences between predictions and ground truths. However, in early training stages, matches can be unstable and inconsistent between epochs, requiring the model to frequently adjust its learning path, thus reducing the quality of model convergence. To address this challenge, we propose the contrastive mask denoising transformer for 3D instance segmentation, which utilizes a mask denoising module to guide the model towards a more stable optimization path in early training stages. Furthermore, we introduce a multi-pattern-aware query selection module to assist the model learn multiple patterns at one position such that clustered objects can be discerned. In addition, the proposed modules are "plug and play", which can easily be integrated into transformer-based architectures. Experimental results on ScanNetv2 dataset show that the proposed modules improve the performance of multiple pipelines, notably achieving +1. 0 mAP on the main pipeline.

NeurIPS Conference 2024 Conference Paper

Noisy Ostracods: A Fine-Grained, Imbalanced Real-World Dataset for Benchmarking Robust Machine Learning and Label Correction Methods

  • Jiamian Hu
  • Yuanyuan Hong
  • Yihua Chen
  • He Wang
  • Moriaki Yasuhara

We present the Noisy Ostracods, a noisy dataset for genus and species classificationof crustacean ostracods with specialists’ annotations. Over the 71466 specimenscollected, 5. 58% of them are estimated to be noisy (possibly problematic) at genuslevel. The dataset is created to addressing a real-world challenge: creating aclean fine-grained taxonomy dataset. The Noisy Ostracods dataset has diversenoises from multiple sources. Firstly, the noise is open-set, including new classesdiscovered during curation that were not part of the original annotation. Thedataset has pseudo-classes, where annotators misclassified samples that shouldbelong to an existing class into a new pseudo-class. The Noisy Ostracods datasetis highly imbalanced with a imbalance factor ρ = 22429. This presents a uniquechallenge for robust machine learning methods, as existing approaches have notbeen extensively evaluated on fine-grained classification tasks with such diversereal-world noise. Initial experiments using current robust learning techniqueshave not yielded significant performance improvements on the Noisy Ostracodsdataset compared to cross-entropy training on the raw, noisy data. On the otherhand, noise detection methods have underperformed in error hit rate comparedto naive cross-validation ensembling for identifying problematic labels. Thesefindings suggest that the fine-grained, imbalanced nature, and complex noisecharacteristics of the dataset present considerable challenges for existing noiserobustalgorithms. By openly releasing the Noisy Ostracods dataset, our goalis to encourage further research into the development of noise-resilient machinelearning methods capable of effectively handling diverse, real-world noise in finegrainedclassification tasks. The dataset, along with its evaluation protocols, can beaccessed at https: //github. com/H-Jamieu/Noisy_ostracods.

RLJ Journal 2024 Journal Article

Sample Complexity of Offline Distributionally Robust Linear Markov Decision Processes

  • He Wang
  • Laixi Shi
  • Yuejie Chi

In offline reinforcement learning (RL), the absence of active exploration calls for attention on the model robustness to tackle the sim-to-real gap, where the discrepancy between the simulated and deployed environments can significantly undermine the performance of the learned policy. To endow the learned policy with robustness in a sample-efficient manner in the presence of high-dimensional state-action space, this paper considers the sample complexity of distributionally robust linear Markov decision processes (MDPs) with an uncertainty set characterized by the total variation distance using offline data. We develop a pessimistic model-based algorithm and establish its sample complexity bound under minimal data coverage assumptions, which outperforms prior art by at least $\widetilde{O}(d)$, where $d$ is the feature dimension. We further improve the performance guarantee of the proposed algorithm by incorporating a carefully-designed variance estimator.

RLC Conference 2024 Conference Paper

Sample Complexity of Offline Distributionally Robust Linear Markov Decision Processes

  • He Wang
  • Laixi Shi
  • Yuejie Chi

In offline reinforcement learning (RL), the absence of active exploration calls for attention on the model robustness to tackle the sim-to-real gap, where the discrepancy between the simulated and deployed environments can significantly undermine the performance of the learned policy. To endow the learned policy with robustness in a sample-efficient manner in the presence of high-dimensional state-action space, this paper considers the sample complexity of distributionally robust linear Markov decision processes (MDPs) with an uncertainty set characterized by the total variation distance using offline data. We develop a pessimistic model-based algorithm and establish its sample complexity bound under minimal data coverage assumptions, which outperforms prior art by at least $\widetilde{O}(d)$, where $d$ is the feature dimension. We further improve the performance guarantee of the proposed algorithm by incorporating a carefully-designed variance estimator.

AAAI Conference 2024 Conference Paper

SocialCVAE: Predicting Pedestrian Trajectory via Interaction Conditioned Latents

  • Wei Xiang
  • Haoteng YIN
  • He Wang
  • Xiaogang Jin

Pedestrian trajectory prediction is the key technology in many applications for providing insights into human behavior and anticipating human future motions. Most existing empirical models are explicitly formulated by observed human behaviors using explicable mathematical terms with deterministic nature, while recent work has focused on developing hybrid models combined with learning-based techniques for powerful expressiveness while maintaining explainability. However, the deterministic nature of the learned steering behaviors from the empirical models limits the models' practical performance. To address this issue, this work proposes the social conditional variational autoencoder (SocialCVAE) for predicting pedestrian trajectories, which employs a CVAE to explore behavioral uncertainty in human motion decisions. SocialCVAE learns socially reasonable motion randomness by utilizing a socially explainable interaction energy map as the CVAE's condition, which illustrates the future occupancy of each pedestrian's local neighborhood area. The energy map is generated using an energy-based interaction model, which anticipates the energy cost (i.e., repulsion intensity) of pedestrians' interactions with neighbors. Experimental results on two public benchmarks including 25 scenes demonstrate that SocialCVAE significantly improves prediction accuracy compared with the state-of-the-art methods, with up to 16.85% improvement in Average Displacement Error (ADE) and 69.18% improvement in Final Displacement Error (FDE). Code is available at: https://github.com/ViviXiang/SocialCVAE.

ICRA Conference 2024 Conference Paper

STOPNet: Multiview-based 6-DoF Suction Detection for Transparent Objects on Production Lines

  • Yuxuan Kuang
  • Qin Han
  • Danshi Li
  • Qiyu Dai
  • Lian Ding
  • Dong Sun
  • Hanlin Zhao
  • He Wang

In this work, we present STOPNet, a framework for 6-DoF object suction detection on production lines, with a focus on but not limited to transparent objects, which is an important and challenging problem in robotic systems and modern industry. Current methods requiring depth input fail on transparent objects due to depth cameras’ deficiency in sensing their geometry, while we proposed a novel framework to reconstruct the scene on the production line depending only on RGB input, based on multiview stereo. Compared to existing works, our method not only reconstructs the whole 3D scene in order to obtain high-quality 6-DoF suction poses in real time but also generalizes to novel environments, novel arrangements and novel objects, including challenging transparent objects, both in simulation and the real world. Extensive experiments in simulation and the real world show that our method significantly surpasses the baselines and has better generalizability, which caters to practical industrial needs.

AAAI Conference 2024 Conference Paper

Successive POI Recommendation via Brain-Inspired Spatiotemporal Aware Representation

  • Gehua Ma
  • He Wang
  • Jingyuan Zhao
  • Rui Yan
  • Huajin Tang

Existing approaches usually perform spatiotemporal representation in the spatial and temporal dimensions, respectively, which isolates the spatial and temporal natures of the target and leads to sub-optimal embeddings. Neuroscience research has shown that the mammalian brain entorhinal-hippocampal system provides efficient graph representations for general knowledge. Moreover, entorhinal grid cells present concise spatial representations, while hippocampal place cells represent perception conjunctions effectively. Thus, the entorhinal-hippocampal system provides a novel angle for spatiotemporal representation, which inspires us to propose the SpatioTemporal aware Embedding framework (STE) and apply it to POIs (STEP). STEP considers two types of POI-specific representations: sequential representation and spatiotemporal conjunctive representation, learned using sparse unlabeled data based on the proposed graph-building policies. Notably, STEP jointly represents the spatiotemporal natures of POIs using both observations and contextual information from integrated spatiotemporal dimensions by constructing a spatiotemporal context graph. Furthermore, we introduce a successive POI recommendation method using STEP, which achieves state-of-the-art performance on two benchmarks. In addition, we demonstrate the excellent performance of the STE representation approach in other spatiotemporal representation-centered tasks through a case study of the traffic flow prediction problem. Therefore, this work provides a novel solution to spatiotemporal representation and paves a new way for spatiotemporal modeling-related tasks.

AAAI Conference 2023 Conference Paper

Defending Black-Box Skeleton-Based Human Activity Classifiers

  • He Wang
  • Yunfeng Diao
  • Zichang Tan
  • Guodong Guo

Skeletal motions have been heavily relied upon for human activity recognition (HAR). Recently, a universal vulnerability of skeleton-based HAR has been identified across a variety of classifiers and data, calling for mitigation. To this end, we propose the first black-box defense method for skeleton-based HAR to our best knowledge. Our method is featured by full Bayesian treatments of the clean data, the adversaries and the classifier, leading to (1) a new Bayesian Energy-based formulation of robust discriminative classifiers, (2) a new adversary sampling scheme based on natural motion manifolds, and (3) a new post-train Bayesian strategy for black-box defense. We name our framework Bayesian Energy-based Adversarial Training or BEAT. BEAT is straightforward but elegant, which turns vulnerable black-box classifiers into robust ones without sacrificing accuracy. It demonstrates surprising and universal effectiveness across a wide range of skeletal HAR classifiers and datasets, under various attacks. Appendix and code are available.

AAAI Conference 2023 Conference Paper

LeNo: Adversarial Robust Salient Object Detection Networks with Learnable Noise

  • He Wang
  • Lin Wan
  • He Tang

Pixel-wise prediction with deep neural network has become an effective paradigm for salient object detection (SOD) and achieved remarkable performance. However, very few SOD models are robust against adversarial attacks which are visually imperceptible for human visual attention. The previous work robust saliency (ROSA) shuffles the pre-segmented superpixels and then refines the coarse saliency map by the densely connected conditional random field (CRF). Different from ROSA that rely on various pre- and post-processings, this paper proposes a light-weight Learnable Noise (LeNo) to defend adversarial attacks for SOD models. LeNo preserves accuracy of SOD models on both adversarial and clean images, as well as inference speed. In general, LeNo consists of a simple shallow noise and noise estimation that embedded in the encoder and decoder of arbitrary SOD networks respectively. Inspired by the center prior of human visual attention mechanism, we initialize the shallow noise with a cross-shaped gaussian distribution for better defense against adversarial attacks. Instead of adding additional network components for post-processing, the proposed noise estimation modifies only one channel of the decoder. With the deeply-supervised noise-decoupled training on state-of-the-art RGB and RGB-D SOD networks, LeNo outperforms previous works not only on adversarial images but also on clean images, which contributes stronger robustness for SOD. Our code is available at https://github.com/ssecv/LeNo.

AAAI Conference 2023 Conference Paper

Tracking and Reconstructing Hand Object Interactions from Point Cloud Sequences in the Wild

  • Jiayi Chen
  • Mi Yan
  • Jiazhao Zhang
  • Yinzhen Xu
  • Xiaolong Li
  • Yijia Weng
  • Li Yi
  • Shuran Song

In this work, we tackle the challenging task of jointly tracking hand object poses and reconstructing their shapes from depth point cloud sequences in the wild, given the initial poses at frame 0. We for the first time propose a point cloud-based hand joint tracking network, HandTrackNet, to estimate the inter-frame hand joint motion. Our HandTrackNet proposes a novel hand pose canonicalization module to ease the tracking task, yielding accurate and robust hand joint tracking. Our pipeline then reconstructs the full hand via converting the predicted hand joints into a MANO hand. For object tracking, we devise a simple yet effective module that estimates the object SDF from the first frame and performs optimization-based tracking. Finally, a joint optimization step is adopted to perform joint hand and object reasoning, which alleviates the occlusion-induced ambiguity and further refines the hand pose. During training, the whole pipeline only sees purely synthetic data, which are synthesized with sufficient variations and by depth simulation for the ease of generalization. The whole pipeline is pertinent to the generalization gaps and thus directly transferable to real in-the-wild data. We evaluate our method on two real hand object interaction datasets, e.g. HO3D and DexYCB, without any fine-tuning. Our experiments demonstrate that the proposed method significantly outperforms the previous state-of-the-art depth-based hand and object pose estimation and tracking methods, running at a frame rate of 9 FPS. We have released our code on https://github.com/PKU-EPIC/HOTrack.

JBHI Journal 2022 Journal Article

Multiple B-Value Model-Based Residual Network (MORN) for Accelerated High-Resolution Diffusion-Weighted Imaging

  • Fanwen Wang
  • Hui Zhang
  • Fei Dai
  • Weibo Chen
  • Shuai Xu
  • Zidong Yang
  • Dinggang Shen
  • Chengyan Wang

Single-Shot Echo Planar Imaging (SSEPI) based Diffusion Weighted Imaging (DWI) has shortcomings such as low resolution and severe distortions. In contrast, Multi-Shot EPI (MSEPI) provides optimal spatial resolution but increases scan time. This study proposed a Multiple b-value mOdel-based Residual Network (MORN) model to reconstruct multiple b-value high-resolution DWI from undersampled k-space data simultaneously. We incorporated Parallel Imaging (PI) into a residual U-net to reconstruct multiple b-value multi-coil data with the supervision of MUltiplexed Sensitivity-Encoding (MUSE) reconstructed Multi-Shot DWI (MSDWI). Moreover, asymmetric concatenations among different b-values and the combined loss to back propagate helped the feature transfer. After training and validation of the MORN in a dataset of 32 healthy cases, additional assessments were performed on 6 patients with different tumor types. The experimental results demonstrated that the MORN model outperformed conventional PI reconstruction (i. e. SENSE) and two state-of-the-art deep learning methods (SENSE-GAN and VSNet) in terms of PSNR (Peak Signal-to-Noise Ratio), SSIM (Structual SIMilarity) and apparent diffusion coefficient maps. In addition, using the pre-trained model under DWI, the MORN achieved consistent fractional anisotrophy and mean diffusivity reconstructed from multiple diffusion directions. Hence, the proposed method shows potential in clinical application according to the observations on tumor patients as well as images of multiple diffusion directions.

AAAI Conference 2022 Conference Paper

Pose Guided Image Generation from Misaligned Sources via Residual Flow Based Correction

  • Jiawei Lu
  • He Wang
  • Tianjia Shao
  • Yin Yang
  • Kun Zhou

Generating new images with desired properties (e. g. new view/poses) from source images has been enthusiastically pursued recently, due to its wide range of potential applications. One way to ensure high-quality generation is to use multiple sources with complementary information such as different views of the same object. However, as source images are often misaligned due to the large disparities among the camera settings, strong assumptions have been made in the past with respect to the camera(s) or/and the object in interest, limiting the application of such techniques. Therefore, we propose a new general approach which models multiple types of variations among sources, such as view angles, poses, facial expressions, in a unified framework, so that it can be employed on datasets of vastly different nature. We verify our approach on a variety of data including humans bodies, faces, city scenes and 3D objects. Both the qualitative and quantitative results demonstrate the better performance of our method than the state of the art.

ICRA Conference 2022 Conference Paper

Stable and Efficient Shapley Value-Based Reward Reallocation for Multi-Agent Reinforcement Learning of Autonomous Vehicles

  • Songyang Han
  • He Wang
  • Sanbao Su
  • Yuanyuan Shi
  • Fei Miao

With the development of sensing and communication technologies in networked cyber-physical systems (CPSs), multi-agent reinforcement learning (MARL)-based methodologies are integrated into the control process of physical systems and demonstrate prominent performance in a wide array of CPS domains, such as connected autonomous vehicles (CAVs). However, it remains challenging to mathematically characterize the improvement of the performance of CAVs with communication and cooperation capability. When each individual autonomous vehicle is originally self-interest, we can not assume that all agents would cooperate naturally during the training process. In this work, we propose to reallocate the system's total reward efficiently to motivate stable cooperation among autonomous vehicles. We formally define and quantify how to reallocate the system's total reward to each agent under the proposed transferable utility game, such that communication-based cooperation among multi-agents increases the system's total reward. We prove that Shapley value-based reward reallocation of MARL locates in the core if the transferable utility game is a convex game. Hence, the cooperation is stable and efficient and the agents should stay in the coalition or the cooperating group. We then propose a cooperative policy learning algorithm with Shapley value reward reallocation. In experiments, compared with several literature algorithms, we show the improvement of the mean episode system reward of CAV systems using our proposed algorithm.

AAAI Conference 2021 Conference Paper

In-game Residential Home Planning via Visual Context-aware Global Relation Learning

  • Lijuan Liu
  • Yin Yang
  • Yi Yuan
  • Tianjia Shao
  • He Wang
  • Kun Zhou

In this paper, we propose an effective global relation learning algorithm to recommend an appropriate location of a building unit for in-game customization of residential home complex. Given a construction layout, we propose a visual contextaware graph generation network that learns the implicit global relations among the scene components and infers the location of a new building unit. The proposed network takes as input the scene graph and the corresponding top-view depth image. It provides the location recommendations for a newlyadded building units by learning an auto-regressive edge distribution conditioned on existing scenes. We also introduce a global graph-image matching loss to enhance the awareness of essential geometry semantics of the site. Qualitative and quantitative experiments demonstrate that the recommended location well reflects the implicit spatial rules of components in the residential estates, and it is instructive and practical to locate the building units in the 3D scene of the complex construction.

NeurIPS Conference 2021 Conference Paper

Leveraging SE(3) Equivariance for Self-supervised Category-Level Object Pose Estimation from Point Clouds

  • Xiaolong Li
  • Yijia Weng
  • Li Yi
  • Leonidas J. Guibas
  • A. Abbott
  • Shuran Song
  • He Wang

Category-level object pose estimation aims to find 6D object poses of previously unseen object instances from known categories without access to object CAD models. To reduce the huge amount of pose annotations needed for category-level learning, we propose for the first time a self-supervised learning framework to estimate category-level 6D object pose from single 3D point clouds. During training, our method assumes no ground-truth pose annotations, no CAD models, and no multi-view supervision. The key to our method is to disentangle shape and pose through an invariant shape reconstruction module and an equivariant pose estimation module, empowered by SE(3) equivariant point cloud networks. The invariant shape reconstruction module learns to perform aligned reconstructions, yielding a category-level reference frame without using any annotations. In addition, the equivariant pose estimation module achieves category-level pose estimation accuracy that is comparable to some fully supervised methods. Extensive experiments demonstrate the effectiveness of our approach on both complete and partial depth point clouds from the ModelNet40 benchmark, and on real depth point clouds from the NOCS-REAL 275 dataset. The project page with code and visualizations can be found at: dragonlong. github. io/equi-pose.

ICML Conference 2019 Conference Paper

Competing Against Nash Equilibria in Adversarially Changing Zero-Sum Games

  • Adrian Rivera Cardoso
  • Jacob D. Abernethy
  • He Wang
  • Huan Xu

We study the problem of repeated play in a zero-sum game in which the payoff matrix may change, in a possibly adversarial fashion, on each round; we call these Online Matrix Games. Finding the Nash Equilibrium (NE) of a two player zero-sum game is core to many problems in statistics, optimization, and economics, and for a fixed game matrix this can be easily reduced to solving a linear program. But when the payoff matrix evolves over time our goal is to find a sequential algorithm that can compete with, in a certain sense, the NE of the long-term-averaged payoff matrix. We design an algorithm with small NE regret–that is, we ensure that the long-term payoff of both players is close to minimax optimum in hindsight. Our algorithm achieves near-optimal dependence with respect to the number of rounds and depends poly-logarithmically on the number of available actions of the players. Additionally, we show that the naive reduction, where each player simply minimizes its own regret, fails to achieve the stated objective regardless of which algorithm is used. Lastly, we consider the so-called bandit setting, where the feedback is significantly limited, and we provide an algorithm with small NE regret using one-point estimates of each payoff matrix.

NeurIPS Conference 2019 Conference Paper

Large Scale Markov Decision Processes with Changing Rewards

  • Adrian Rivera Cardoso
  • He Wang
  • Huan Xu

We consider Markov Decision Processes (MDPs) where the rewards are unknown and may change in an adversarial manner. We provide an algorithm that achieves a regret bound of $O( \sqrt{\tau (\ln|S|+\ln|A|)T}\ln(T))$, where $S$ is the state space, $A$ is the action space, $\tau$ is the mixing time of the MDP, and $T$ is the number of periods. The algorithm's computational complexity is polynomial in $|S|$ and $|A|$. We then consider a setting often encountered in practice, where the state space of the MDP is too large to allow for exact solutions. By approximating the state-action occupancy measures with a linear architecture of dimension $d\ll|S|$, we propose a modified algorithm with a computational complexity polynomial in $d$ and independent of $|S|$. We also prove a regret bound for this modified algorithm, which to the best of our knowledge, is the first $\tilde{O}(\sqrt{T})$ regret bound in the large-scale MDP setting with adversarially changing rewards.

NeurIPS Conference 2016 Conference Paper

“Congruent” and “Opposite” Neurons: Sisters for Multisensory Integration and Segregation

  • Wen-Hao Zhang
  • He Wang
  • K. Y. Michael Wong
  • Si Wu

Experiments reveal that in the dorsal medial superior temporal (MSTd) and the ventral intraparietal (VIP) areas, where visual and vestibular cues are integrated to infer heading direction, there are two types of neurons with roughly the same number. One is “congruent” cells, whose preferred heading directions are similar in response to visual and vestibular cues; and the other is “opposite” cells, whose preferred heading directions are nearly “opposite” (with an offset of 180 degree) in response to visual vs. vestibular cues. Congruent neurons are known to be responsible for cue integration, but the computational role of opposite neurons remains largely unknown. Here, we propose that opposite neurons may serve to encode the disparity information between cues necessary for multisensory segregation. We build a computational model composed of two reciprocally coupled modules, MSTd and VIP, and each module consists of groups of congruent and opposite neurons. In the model, congruent neurons in two modules are reciprocally connected with each other in the congruent manner, whereas opposite neurons are reciprocally connected in the opposite manner. Mimicking the experimental protocol, our model reproduces the characteristics of congruent and opposite neurons, and demonstrates that in each module, the sisters of congruent and opposite neurons can jointly achieve optimal multisensory information integration and segregation. This study sheds light on our understanding of how the brain implements optimal multisensory integration and segregation concurrently in a distributed manner.

NeurIPS Conference 2010 Conference Paper

Attractor Dynamics with Synaptic Depression

  • K. Wong
  • He Wang
  • Si Wu
  • Chi Fung

Neuronal connection weights exhibit short-term depression (STD). The present study investigates the impact of STD on the dynamics of a continuous attractor neural network (CANN) and its potential roles in neural information processing. We find that the network with STD can generate both static and traveling bumps, and STD enhances the performance of the network in tracking external inputs. In particular, we find that STD endows the network with slow-decaying plateau behaviors, namely, the network being initially stimulated to an active state will decay to silence very slowly in the time scale of STD rather than that of neural signaling. We argue that this provides a mechanism for neural systems to hold short-term memory easily and shut off persistent activities naturally.