Arrow Research search

Author name cluster

Hao Su

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

31 papers
2 author rows

Possible papers

31

AAAI Conference 2026 Conference Paper

New Synthetic Goldmine: Hand Joint Angle-Driven EMG Data Generation Framework for Micro-Gesture Recognition

  • Nana Wang
  • Suli Wang
  • Gen Li
  • Pengfei Ren
  • Hao Su

Electromyography (EMG)-based gesture recognition has emerged as a promising approach for human-computer interaction. However, its performance is often limited by the scarcity of labeled EMG data, significant cross-user variability, and poor generalization to unseen gestures. To address these challenges, we propose SeqEMG-GAN, a conditional, sequence-driven generative framework that synthesizes high-fidelity EMG signals from hand joint angle sequences. Our method introduces a context-aware architecture composed of an angle encoder, a dual-layer context encoder featuring the novel Ang2Gist unit, a deep convolutional EMG generator, and a discriminator, all jointly optimized via adversarial learning. By conditioning on joint kinematic trajectories, SeqEMG-GAN is capable of generating semantically consistent EMG sequences, even for previously unseen gestures, thereby enhancing data diversity and physiological plausibility. Experimental results show that classifiers trained solely on synthetic data experience only a slight accuracy drop (from 57.77% to 55.71%). In contrast, training with a combination of real and synthetic data significantly improves accuracy to 60.53%, outperforming real-only training by 2.76%. These findings demonstrate the effectiveness of our framework, also achieves the state-of-art performance in augmenting EMG datasets and enhancing gesture recognition performance for applications such as neural robotic hand control, AI/AR glasses, and gesture-based virtual gaming systems.

NeurIPS Conference 2025 Conference Paper

MyoChallenge 2024: A New Benchmark for Physiological Dexterity and Agility in Bionic Humans

  • Huiyi Wang
  • Chun Kwang Tan
  • Balint Hodossy
  • Shirui Lyu
  • Pierre Schumacher
  • James Heald
  • Kai Biegun
  • Samo Hromadka

Recent advancements in bionic prosthetic technology offer transformative opportunities to restore mobility and functionality for individuals with missing limbs. Users of bionic limbs, or bionic humans, learn to seamlessly integrate prosthetic extensions into their motor repertoire, regaining critical motor abilities. The remarkable movement generalization and environmental adaptability demonstrated by these individuals highlight motor intelligence capabilities unmatched by current artificial intelligence systems. Addressing these limitations, MyoChallenge '24 at NeurIPS 2024 established a benchmark for human-robot coordination with an emphasis on joint control of both biological and mechanical limbs. The competition featured two distinct tracks: a manipulation task utilizing the myoMPL model, integrating a virtual biological arm and the Modular Prosthetic Limb (MPL) for a passover task; and a locomotion task using the novel myoOSL model, combining a bilateral virtual biological leg with a trans-femoral amputation and the Open Source Leg (OSL) to navigate varied terrains. Marking the third iteration of the MyoChallenge, the event attracted over 50 teams with more than 290 submissions all around the globe, with diverse participants ranging from independent researchers to high school students. The competition facilitated the development of several state-of-the-art control algorithms for bionic musculoskeletal systems, leveraging techniques such as imitation learning, muscle synergy, and model-based reinforcement learning that significantly surpassed our proposed baseline performance by a factor of 10. By providing the open-source simulation framework of MyoSuite, standardized tasks, and physiologically realistic models, MyoChallenge serves as a reproducible testbed and benchmark for bridging ML and biomechanics. The competition website is featured here: https: //sites. google. com/view/myosuite/myochallenge/myochallenge-2024.

TMLR Journal 2025 Journal Article

Responsive Noise-Relaying Diffusion Policy: Responsive and Efficient Visuomotor Control

  • Zhuoqun Chen
  • Xiu Yuan
  • Tongzhou Mu
  • Hao Su

Imitation learning is an efficient method for teaching robots a variety of tasks. Diffusion Policy, which uses a conditional denoising diffusion process to generate actions, has demonstrated superior performance, particularly in learning from multi-modal demonstrates. However, it relies on executing multiple actions predicted from the same inference step to retain performance and prevent mode bouncing, which limits its responsiveness, as actions are not conditioned on the most recent observations. To address this, we introduce Responsive Noise-Relaying Diffusion Policy (RNR-DP), which maintains a noise-relaying buffer with progressively increasing noise levels and employs a sequential denoising mechanism that generates immediate, noise-free actions at the head of the sequence, while appending noisy actions at the tail. This ensures that actions are responsive and conditioned on the latest observations, while maintaining motion consistency through the noise-relaying buffer. This design enables the handling of tasks requiring responsive control, and accelerates action generation by reusing denoising steps. Experiments on response-sensitive tasks demonstrate that, compared to Diffusion Policy, ours achieves 18% improvement in success rate. Further evaluation on regular tasks demonstrates that RNR-DP also exceeds the best acceleration method (DDIM) by 6.9% in success rate, highlighting its computational efficiency advantage in scenarios where responsiveness is less critical.

NeurIPS Conference 2025 Conference Paper

Staggered Environment Resets Improve Massively Parallel On-Policy Reinforcement Learning

  • Sid Bharthulwar
  • Stone Tao
  • Hao Su

Massively parallel GPU simulation environments have accelerated reinforcement learning (RL) research by enabling fast data collection for on-policy RL algorithms like Proximal Policy Optimization (PPO). To maximize throughput, it is common to use short rollouts per policy update, increasing the update-to-data (UTD) ratio. However, we find that, in this setting, standard synchronous resets introduce harmful nonstationarity, skewing the learning signal and destabilizing training. We introduce staggered resets, a simple yet effective technique where environments are initialized and reset at varied points within the task horizon. This yields training batches with more uniform state visitation distributions, reducing the nonstationarity induced by synchronized rollouts. We characterize dimensions along which RL environments can benefit significantly from staggered resets through toy environments. We then apply this technique to challenging high-dimensional robotics environments, achieving significantly higher sample efficiency, faster wall-clock convergence, and stronger final performance. Finally, this technique scales better with more parallel environments compared to naive synchronized rollouts, yielding more optimal utilization of computational resources.

AAAI Conference 2025 Conference Paper

When Should We Prefer State-to-Visual DAgger over Visual Reinforcement Learning?

  • Tongzhou Mu
  • Zhaoyang Li
  • Stanisław Wiktor Strzelecki
  • Xiu Yuan
  • Yunchao Yao
  • Litian Liang
  • Hao Su

Learning policies from high-dimensional visual inputs, such as pixels and point clouds, is crucial in various applications. Visual reinforcement learning is a promising approach that directly trains policies from visual observations, although it faces challenges in sample efficiency and computational costs. This study conducts an empirical comparison of State-to-Visual DAgger — a two-stage framework that initially trains a state policy before adopting online imitation to learn a visual policy — and Visual RL across a diverse set of tasks. We evaluate both methods across 16 tasks from three benchmarks, focusing on their asymptotic performance, sample efficiency, and computational costs. Surprisingly, our findings reveal that State-to-Visual DAgger does not universally outperform Visual RL but shows significant advantages in challenging tasks, offering more consistent performance. In contrast, its benefits in sample efficiency are less pronounced, although it often reduces the overall wall-clock time required for training. Based on our findings, we provide recommendations for practitioners and hope that our results contribute valuable perspectives for future research in visual policy learning.

IJCAI Conference 2024 Conference Paper

A Consistency and Integration Model with Adaptive Thresholds for Weakly Supervised Object Localization

  • Hao Su
  • Meng Yang

Weakly Supervised Object Localization (WSOL) is a challenging task, which aims to learn object localization with less costly image-level labels. Existing convolution neural network (CNN) based methods tend to focus on discriminative regions of objects, while transformer-based methods overemphasize deep global features powerful for classification and lack the capability to perceive object details, leading to prediction results far from the object boundary. In this paper, we propose a novel Consistency and Integration Model with Adaptive Thresholds (CIAT) that exploits the spatial-semantic consistency between shallow and deep features to activate more object regions and detects the object regions adaptively in different images. First, we introduce a simple plug-and-play consistency and integration module of shallow-deep features (CISD), which utilizes shallow features efficiently to enhance the entire object perception. Then, we design an online adaptive threshold (OAT) based on Bayesian decision theory, which computes a reasonable segmentation threshold adaptive for the localization map of each image, making the predicted bounding box closer to the ground truth. Extensive experiments on two widely used CUB-200-2011 and ILSVRC datasets verify the effectiveness of our methods.

IJCAI Conference 2024 Conference Paper

BeyondVision: An EMG-driven Micro Hand Gesture Recognition Based on Dynamic Segmentation

  • Nana Wang
  • Jianwei Niu
  • Xuefeng Liu
  • Dongqin Yu
  • Guogang Zhu
  • Xinghao Wu
  • Mingliang Xu
  • Hao Su

Hand gesture recognition (HGR) plays a pivotal role in natural and intuitive human-computer interactions. Recent HGR methods focus on recognizing gestures from vision-based images or videos. However, vision-based methods are limited in recognizing micro hand gestures (MHGs) (e. g. , pinch within 1cm) and gestures with occluded fingers. To address these issues, combined with the electromyography (EMG) technique, we propose BeyondVision, an EMG-driven MHG recognition system based on deep learning. BeyondVision consists of a wristband-style EMG sampling device and a tailored lightweight neural network BV-Net that can accurately translate EMG signals of MHGs to control commands in real-time. Moreover, we propose a post-processing mechanism and a weight segmentation algorithm to effectively improve the accuracy rate of MHG recognition. Subjective and objective experimental results show that our approach achieves over 95% average recognition rate, 2000Hz sampling frequency, and real-time micro gesture recognition. Our technique has been applied in a commercially available product, introduced at: https: //github. com/tyc333/NoBarriers.

ICRA Conference 2024 Conference Paper

EgoPAT3Dv2: Predicting 3D Action Target from 2D Egocentric Vision for Human-Robot Interaction

  • Irving Fang
  • Yuzhong Chen
  • Yifan Wang
  • Jianghan Zhang
  • Qiushi Zhang
  • Jiali Xu
  • Xibo He
  • Weibo Gao

A robot’s ability to anticipate the 3D action target location of a hand’s movement from egocentric videos can greatly improve safety and efficiency in human-robot interaction (HRI). While previous research predominantly focused on semantic action classification or 2D target region prediction, we argue that predicting the action target’s 3D coordinate could pave the way for more versatile downstream robotics tasks, especially given the increasing prevalence of headset devices. This study expands EgoPAT3D, the sole dataset dedicated to egocentric 3D action target prediction. We augment both its size and diversity, enhancing its potential for generalization. Moreover, we substantially enhance the baseline algorithm by introducing a large pre-trained model and human prior knowledge. Remarkably, our novel algorithm can now achieve superior prediction outcomes using solely RGB images, eliminating the previous need for 3D point clouds and IMU input. Furthermore, we deploy our enhanced baseline algorithm on a real-world robotic platform to illustrate its practical utility in straightforward HRI tasks. The demonstrations showcase the real-world applicability of our advancements and may inspire more HRI use cases involving egocentric vision. All code and data are open-sourced and can be found on the project website.

IJCAI Conference 2024 Conference Paper

Estimating before Debiasing: A Bayesian Approach to Detaching Prior Bias in Federated Semi-Supervised Learning

  • Guogang Zhu
  • Xuefeng Liu
  • Xinghao Wu
  • Shaojie Tang
  • Chao Tang
  • Jianwei Niu
  • Hao Su

Federated Semi-Supervised Learning (FSSL) leverages both labeled and unlabeled data on clients to collaboratively train a model. In FSSL, the heterogeneous data can introduce prediction bias into the model, causing the model's prediction to skew towards some certain classes. Existing FSSL methods primarily tackle this issue by enhancing consistency in model parameters or outputs. However, as the models themselves are biased, merely constraining their consistency is not sufficient to alleviate prediction bias. In this paper, we explore this bias from a Bayesian perspective and demonstrate that it principally originates from label prior bias within the training data. Building upon this insight, we propose a debiasing method for FSSL named FedDB. FedDB utilizes the Average Prediction Probability of Unlabeled Data (APP-U) to approximate the biased prior. During local training, FedDB employs APP-U to refine pseudo-labeling through Bayes' theorem, thereby significantly reducing the label prior bias. Concurrently, during the model aggregation, FedDB uses APP-U from participating clients to formulate unbiased aggregate weights, thereby effectively diminishing bias in the global model. Experimental results show that FedDB can surpass existing FSSL methods. The code is available at https: //github. com/GuogangZhu/FedDB.

NeurIPS Conference 2024 Conference Paper

MeshFormer : High-Quality Mesh Generation with 3D-Guided Reconstruction Model

  • Minghua Liu
  • Chong Zeng
  • Xinyue Wei
  • Ruoxi Shi
  • Linghao Chen
  • Chao Xu
  • Mengqi Zhang
  • Zhaoning Wang

Open-world 3D reconstruction models have recently garnered significant attention. However, without sufficient 3D inductive bias, existing methods typically entail expensive training costs and struggle to extract high-quality 3D meshes. In this work, we introduce MeshFormer, a sparse-view reconstruction model that explicitly leverages 3D native structure, input guidance, and training supervision. Specifically, instead of using a triplane representation, we store features in 3D sparse voxels and combine transformers with 3D convolutions to leverage an explicit 3D structure and projective bias. In addition to sparse-view RGB input, we require the network to take input and generate corresponding normal maps. The input normal maps can be predicted by 2D diffusion models, significantly aiding in the guidance and refinement of the geometry's learning. Moreover, by combining Signed Distance Function (SDF) supervision with surface rendering, we directly learn to generate high-quality meshes without the need for complex multi-stage training processes. By incorporating these explicit 3D biases, MeshFormer can be trained efficiently and deliver high-quality textured meshes with fine-grained geometric details. It can also be integrated with 2D diffusion models to enable fast single-image-to-3D and text-to-3D tasks. Videos are available at https: //meshformer3d. github. io/

NeurIPS Conference 2023 Conference Paper

Deductive Verification of Chain-of-Thought Reasoning

  • Zhan Ling
  • Yunhao Fang
  • Xuanlin Li
  • Zhiao Huang
  • Mingu Lee
  • Roland Memisevic
  • Hao Su

Large Language Models (LLMs) significantly benefit from Chain-of-thought (CoT) prompting in performing various reasoning tasks. While CoT allows models to produce more comprehensive reasoning processes, its emphasis on intermediate reasoning steps can inadvertently introduce hallucinations and accumulated errors, thereby limiting models’ ability to solve complex reasoning tasks. Inspired by how humans engage in careful and meticulous deductive logical reasoning processes to solve tasks, we seek to enable language models to perform explicit and rigorous deductive reasoning, and also ensure the trustworthiness of their reasoning process through self-verification. However, directly verifying the validity of an entire deductive reasoning process is challenging, even with advanced models like ChatGPT. In light of this, we propose to decompose a reasoning verification process into a series of step-by-step subprocesses, each only receiving their necessary context and premises. To facilitate this procedure, we propose Natural Program, a natural language-based deductive reasoning format. Our approach enables models to generate precise reasoning steps where subsequent steps are more rigorously grounded on prior steps. It also empowers language models to carry out reasoning self-verification in a step-by-step manner. By integrating this verification process into each deductive reasoning stage, we significantly enhance the rigor and trustfulness of generated reasoning steps. Along this process, we also improve the answer correctness on complex reasoning tasks.

NeurIPS Conference 2023 Conference Paper

DiffVL: Scaling Up Soft Body Manipulation using Vision-Language Driven Differentiable Physics

  • Zhiao Huang
  • Feng Chen
  • Yewen Pu
  • Chunru Lin
  • Hao Su
  • Chuang Gan

Combining gradient-based trajectory optimization with differentiable physics simulation is an efficient technique for solving soft-body manipulation problems. Using a well-crafted optimization objective, the solver can quickly converge onto a valid trajectory. However, writing the appropriate objective functions requires expert knowledge, making it difficult to collect a large set of naturalistic problems from non-expert users. We introduce DiffVL, a method that enables non-expert users to communicate soft-body manipulation tasks -- a combination of vision and natural language, given in multiple stages -- that can be readily leveraged by a differential physics solver. We have developed GUI tools that enable non-expert users to specify 100 tasks inspired by real-life soft-body manipulations from online videos, which we'll make public. We leverage large language models to translate task descriptions into machine-interpretable optimization objectives. The optimization objectives can help differentiable physics solvers to solve these long-horizon multistage tasks that are challenging for previous baselines.

NeurIPS Conference 2023 Conference Paper

One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization

  • Minghua Liu
  • Chao Xu
  • Haian Jin
  • Linghao Chen
  • Mukund Varma T
  • Zexiang Xu
  • Hao Su

Single image 3D reconstruction is an important but challenging task that requires extensive knowledge of our natural world. Many existing methods solve this problem by optimizing a neural radiance field under the guidance of 2D diffusion models but suffer from lengthy optimization time, 3D inconsistency results, and poor geometry. In this work, we propose a novel method that takes a single image of any object as input and generates a full 360-degree 3D textured mesh in a single feed-forward pass. Given a single image, we first use a view-conditioned 2D diffusion model, Zero123, to generate multi-view images for the input view, and then aim to lift them up to 3D space. Since traditional reconstruction methods struggle with inconsistent multi-view predictions, we build our 3D reconstruction module upon an SDF-based generalizable neural surface reconstruction method and propose several critical training strategies to enable the reconstruction of 360-degree meshes. Without costly optimizations, our method reconstructs 3D shapes in significantly less time than existing methods. Moreover, our method favors better geometry, generates more 3D consistent results, and adheres more closely to the input image. We evaluate our approach on both synthetic data and in-the-wild images and demonstrate its superiority in terms of both mesh quality and runtime. In addition, our approach can seamlessly support the text-to-3D task by integrating with off-the-shelf text-to-image diffusion models.

NeurIPS Conference 2023 Conference Paper

OpenIllumination: A Multi-Illumination Dataset for Inverse Rendering Evaluation on Real Objects

  • Isabella Liu
  • Linghao Chen
  • Ziyang Fu
  • Liwen Wu
  • Haian Jin
  • Zhong Li
  • Chin Ming Ryan Wong
  • Yi Xu

We introduce OpenIllumination, a real-world dataset containing over 108K images of 64 objects with diverse materials, captured under 72 camera views and a large number of different illuminations. For each image in the dataset, we provide accurate camera parameters, illumination ground truth, and foreground segmentation masks. Our dataset enables the quantitative evaluation of most inverse rendering and material decomposition methods for real objects. We examine several state-of-the-art inverse rendering methods on our dataset and compare their performances. The dataset and code can be found on the project page: https: //oppo-us-research. github. io/OpenIllumination.

NeurIPS Conference 2023 Conference Paper

OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding

  • Minghua Liu
  • Ruoxi Shi
  • Kaiming Kuang
  • Yinhao Zhu
  • Xuanlin Li
  • Shizhong Han
  • Hong Cai
  • Fatih Porikli

We introduce OpenShape, a method for learning multi-modal joint representations of text, image, and point clouds. We adopt the commonly used multi-modal contrastive learning framework for representation alignment, but with a specific focus on scaling up 3D representations to enable open-world 3D shape understanding. To achieve this, we scale up training data by ensembling multiple 3D datasets and propose several strategies to automatically filter and enrich noisy text descriptions. We also explore and compare strategies for scaling 3D backbone networks and introduce a novel hard negative mining module for more efficient training. We evaluate OpenShape on zero-shot 3D classification benchmarks and demonstrate its superior capabilities for open-world recognition. Specifically, OpenShape achieves a zero-shot accuracy of 46. 8% on the 1, 156-category Objaverse-LVIS benchmark, compared to less than 10% for existing methods. OpenShape also achieves an accuracy of 85. 3% on ModelNet40, outperforming previous zero-shot baseline methods by 20% and performing on par with some fully-supervised methods. Furthermore, we show that our learned embeddings encode a wide range of visual and semantic concepts (e. g. , subcategories, color, shape, style) and facilitate fine-grained text-3D and image-3D interactions. Due to their alignment with CLIP embeddings, our learned shape representations can also be integrated with off-the-shelf CLIP-based models for various applications, such as point cloud captioning and point cloud-conditioned image generation.

ICLR Conference 2021 Conference Paper

BiPointNet: Binary Neural Network for Point Clouds

  • Haotong Qin
  • Zhongang Cai
  • Mingyuan Zhang
  • Yifu Ding
  • Haiyu Zhao
  • Shuai Yi
  • Xianglong Liu 0001
  • Hao Su

To alleviate the resource constraint for real-time point cloud applications that run on edge devices, in this paper we present BiPointNet, the first model binarization approach for efficient deep learning on point clouds. We discover that the immense performance drop of binarized models for point clouds mainly stems from two challenges: aggregation-induced feature homogenization that leads to a degradation of information entropy, and scale distortion that hinders optimization and invalidates scale-sensitive structures. With theoretical justifications and in-depth analysis, our BiPointNet introduces Entropy-Maximizing Aggregation (EMA) to modulate the distribution before aggregation for the maximum information entropy, and Layer-wise Scale Recovery (LSR) to efficiently restore feature representation capacity. Extensive experiments show that BiPointNet outperforms existing binarization methods by convincing margins, at the level even comparable with the full precision counterpart. We highlight that our techniques are generic, guaranteeing significant improvements on various fundamental tasks and mainstream backbones. Moreover, BiPointNet gives an impressive 14.7× speedup and 18.9× storage saving on real-world resource-constrained devices.

AAAI Conference 2021 Conference Paper

MangaGAN: Unpaired Photo-to-Manga Translation Based on The Methodology of Manga Drawing

  • Hao Su
  • Jianwei Niu
  • Xuefeng Liu
  • Qingfeng Li
  • Jiahe Cui
  • Ji Wan

Manga is a world popular comic form originated in Japan, which typically employs black-and-white stroke lines and geometric exaggeration to describe humans’ appearances, poses, and actions. In this paper, we propose MangaGAN, the first method based on Generative Adversarial Network (GAN) for unpaired photo-to-manga translation. Inspired by the drawing process of experienced manga artists, Manga- GAN generates geometric features and converts each facial region into the manga domain with a tailored multi-GANs architecture. For training MangaGAN, we collect a new dataset from a popular manga work with extensive features. To produce high-quality manga faces, we propose a structural smoothing loss to smooth stroke-lines and avoid noisy pixels, and a similarity preserving module to improve the similarity between domains of photo and manga. Extensive experiments show that MangaGAN can produce high-quality manga faces preserving both the facial similarity and manga style, and outperforms other reference methods.

NeurIPS Conference 2021 Conference Paper

ManiSkill: Generalizable Manipulation Skill Benchmark with Large-Scale Demonstrations

  • Tongzhou Mu
  • Zhan Ling
  • Fanbo Xiang
  • Derek Yang
  • Xuanlin Li
  • Stone Tao
  • Zhiao Huang
  • Zhiwei Jia

Object manipulation from 3D visual inputs poses many challenges on building generalizable perception and policy models. However, 3D assets in existing benchmarks mostly lack the diversity of 3D shapes that align with real-world intra-class complexity in topology and geometry. Here we propose SAPIEN Manipulation Skill Benchmark (ManiSkill) to benchmark manipulation skills over diverse objects in a full-physics simulator. 3D assets in ManiSkill include large intra-class topological and geometric variations. Tasks are carefully chosen to cover distinct types of manipulation challenges. Latest progress in 3D vision also makes us believe that we should customize the benchmark so that the challenge is inviting to researchers working on 3D deep learning. To this end, we simulate a moving panoramic camera that returns ego-centric point clouds or RGB-D images. In addition, we would like ManiSkill to serve a broad set of researchers interested in manipulation research. Besides supporting the learning of policies from interactions, we also support learning-from-demonstrations (LfD) methods, by providing a large number of high-quality demonstrations (~36, 000 successful trajectories, ~1. 5M point cloud/RGB-D frames in total). We provide baselines using 3D deep learning and LfD algorithms. All code of our benchmark (simulator, environment, SDK, and baselines) is open-sourced (\href{https: //github. com/haosulab/ManiSkill}{Github repo}), and a challenge facing interdisciplinary researchers will be held based on the benchmark.

NeurIPS Conference 2021 Conference Paper

Particle Cloud Generation with Message Passing Generative Adversarial Networks

  • Raghav Kansal
  • Javier Duarte
  • Hao Su
  • Breno Orzari
  • Thiago Tomei
  • Maurizio Pierini
  • Mary Touranakou
  • jean-roch vlimant

In high energy physics (HEP), jets are collections of correlated particles produced ubiquitously in particle collisions such as those at the CERN Large Hadron Collider (LHC). Machine learning (ML)-based generative models, such as generative adversarial networks (GANs), have the potential to significantly accelerate LHC jet simulations. However, despite jets having a natural representation as a set of particles in momentum-space, a. k. a. a particle cloud, there exist no generative models applied to such a dataset. In this work, we introduce a new particle cloud dataset (JetNet), and apply to it existing point cloud GANs. Results are evaluated using (1) 1-Wasserstein distances between high- and low-level feature distributions, (2) a newly developed Fréchet ParticleNet Distance, and (3) the coverage and (4) minimum matching distance metrics. Existing GANs are found to be inadequate for physics applications, hence we develop a new message passing GAN (MPGAN), which outperforms existing point cloud GANs on virtually every metric and shows promise for use in HEP. We propose JetNet as a novel point-cloud-style dataset for the ML community to experiment with, and set MPGAN as a benchmark to improve upon for future generative models. Additionally, to facilitate research and improve accessibility and reproducibility in this area, we release the open-source JetNet Python package with interfaces for particle cloud datasets, implementations for evaluation and loss metrics, and more tools for ML in HEP development.

NeurIPS Conference 2021 Conference Paper

Stabilizing Deep Q-Learning with ConvNets and Vision Transformers under Data Augmentation

  • Nicklas Hansen
  • Hao Su
  • Xiaolong Wang

While agents trained by Reinforcement Learning (RL) can solve increasingly challenging tasks directly from visual observations, generalizing learned skills to novel environments remains very challenging. Extensive use of data augmentation is a promising technique for improving generalization in RL, but it is often found to decrease sample efficiency and can even lead to divergence. In this paper, we investigate causes of instability when using data augmentation in common off-policy RL algorithms. We identify two problems, both rooted in high-variance Q-targets. Based on our findings, we propose a simple yet effective technique for stabilizing this class of algorithms under augmentation. We perform extensive empirical evaluation of image-based RL using both ConvNets and Vision Transformers (ViT) on a family of benchmarks based on DeepMind Control Suite, as well as in robotic manipulation tasks. Our method greatly improves stability and sample efficiency of ConvNets under augmentation, and achieves generalization results competitive with state-of-the-art methods for image-based RL in environments with unseen visuals. We further show that our method scales to RL with ViT-based architectures, and that data augmentation may be especially important in this setting.

NeurIPS Conference 2020 Conference Paper

Multi-task Batch Reinforcement Learning with Metric Learning

  • Jiachen Li
  • Quan Vuong
  • Shuang Liu
  • Minghua Liu
  • Kamil Ciosek
  • Henrik Christensen
  • Hao Su

We tackle the Multi-task Batch Reinforcement Learning problem. Given multiple datasets collected from different tasks, we train a multi-task policy to perform well in unseen tasks sampled from the same distribution. The task identities of the unseen tasks are not provided. To perform well, the policy must infer the task identity from collected transitions by modelling its dependency on states, actions and rewards. Because the different datasets may have state-action distributions with large divergence, the task inference module can learn to ignore the rewards and spuriously correlate \textit{only} state-action pairs to the task identity, leading to poor test time performance. To robustify task inference, we propose a novel application of the triplet loss. To mine hard negative examples, we relabel the transitions from the training tasks by approximating their reward functions. When we allow further training on the unseen tasks, using the trained policy as an initialization leads to significantly faster convergence compared to randomly initialized policies (up to 80% improvement and across 5 different Mujoco task distributions). We name our method \textbf{MBML} (\textbf{M}ulti-task \textbf{B}atch RL with \textbf{M}etric \textbf{L}earning).

NeurIPS Conference 2020 Conference Paper

Refactoring Policy for Compositional Generalizability using Self-Supervised Object Proposals

  • Tongzhou Mu
  • Jiayuan Gu
  • Zhiwei Jia
  • Hao Tang
  • Hao Su

We study how to learn a policy with compositional generalizability. We propose a two-stage framework, which refactorizes a high-reward teacher policy into a generalizable student policy with strong inductive bias. Particularly, we implement an object-centric GNN-based student policy, whose input objects are learned from images through self-supervised learning. Empirically, we evaluate our approach on four difficult tasks that require compositional generalizability, and achieve superior performance compared to baselines.

NeurIPS Conference 2020 Conference Paper

Towards Scale-Invariant Graph-related Problem Solving by Iterative Homogeneous GNNs

  • Hao Tang
  • Zhiao Huang
  • Jiayuan Gu
  • Bao-Liang Lu
  • Hao Su

Current graph neural networks (GNNs) lack generalizability with respect to scales (graph sizes, graph diameters, edge weights, etc. .) when solving many graph analysis problems. Taking the perspective of synthesizing graph theory programs, we propose several extensions to address the issue. First, inspired by the dependency of iteration number of common graph theory algorithms on graph size, we learn to terminate the message passing process in GNNs adaptively according to the computation progress. Second, inspired by the fact that many graph theory algorithms are homogeneous with respect to graph weights, we introduce homogeneous transformation layers that are universal homogeneous function approximators, to convert ordinary GNNs to be homogeneous. Experimentally, we show that our GNN can be trained from small-scale graphs but generalize well to large-scale graphs for a number of basic graph theory problems. It also shows generalizability for applications of multi-body physical simulation and image-based navigation problems.

NeurIPS Conference 2019 Conference Paper

Mapping State Space using Landmarks for Universal Goal Reaching

  • Zhiao Huang
  • Fangchen Liu
  • Hao Su

An agent that has well understood the environment should be able to apply its skills for any given goals, leading to the fundamental problem of learning the Universal Value Function Approximator (UVFA). A UVFA learns to predict the cumulative rewards between all state-goal pairs. However, empirically, the value function for long-range goals is always hard to estimate and may consequently result in failed policy. This has presented challenges to the learning process and the capability of neural networks. We propose a method to address this issue in large MDPs with sparse rewards, in which exploration and routing across remote states are both extremely challenging. Our method explicitly models the environment in a hierarchical manner, with a high-level dynamic landmark-based map abstracting the visited state space, and a low-level value network to derive precise local decisions. We use farthest point sampling to select landmark states from past experience, which has improved exploration compared with simple uniform sampling. Experimentally we showed that our method enables the agent to reach long-range goals at the early training stage, and achieve better performance than standard RL algorithms for a number of challenging tasks.

NeurIPS Conference 2018 Conference Paper

Deep Functional Dictionaries: Learning Consistent Semantic Structures on 3D Models from Functions

  • Minhyuk Sung
  • Hao Su
  • Ronald Yu
  • Leonidas Guibas

Various 3D semantic attributes such as segmentation masks, geometric features, keypoints, and materials can be encoded as per-point probe functions on 3D geometries. Given a collection of related 3D shapes, we consider how to jointly analyze such probe functions over different shapes, and how to discover common latent structures using a neural network — even in the absence of any correspondence information. Our network is trained on point cloud representations of shape geometry and associated semantic functions on that point cloud. These functions express a shared semantic understanding of the shapes but are not coordinated in any way. For example, in a segmentation task, the functions can be indicator functions of arbitrary sets of shape parts, with the particular combination involved not known to the network. Our network is able to produce a small dictionary of basis functions for each shape, a dictionary whose span includes the semantic functions provided for that shape. Even though our shapes have independent discretizations and no functional correspondences are provided, the network is able to generate latent bases, in a consistent order, that reflect the shared semantic structure among the shapes. We demonstrate the effectiveness of our technique in various segmentation and keypoint selection applications.

NeurIPS Conference 2017 Conference Paper

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

  • Charles Ruizhongtai Qi
  • Li Yi
  • Hao Su
  • Leonidas Guibas

Few prior works study deep learning on point sets. PointNet is a pioneer in this direction. However, by design PointNet does not capture local structures induced by the metric space points live in, limiting its ability to recognize fine-grained patterns and generalizability to complex scenes. In this work, we introduce a hierarchical neural network that applies PointNet recursively on a nested partitioning of the input point set. By exploiting metric space distances, our network is able to learn local features with increasing contextual scales. With further observation that point sets are usually sampled with varying densities, which results in greatly decreased performance for networks trained on uniform densities, we propose novel set learning layers to adaptively combine features from multiple scales. Experiments show that our network called PointNet++ is able to learn deep point set features efficiently and robustly. In particular, results significantly better than state-of-the-art have been obtained on challenging benchmarks of 3D point clouds.

ECAI Conference 2016 Conference Paper

Adaptive Binary Quantization for Fast Nearest Neighbor Search

  • Zhujin Li
  • Xianglong Liu 0001
  • Junjie Wu
  • Hao Su

Hashing has been proved an attractive technique for fast nearest neighbor search over big data. Compared to the projection based hashing methods, prototype based ones own stronger capability of generating discriminative binary codes for the data with complex inherent structure. However, our observation indicates that they still suffer from the insufficient coding that usually utilizes the complete binary codes in a hypercube. To address this problem, we propose an adaptive binary quantization method that learns a discriminative hash function with prototypes correspondingly associated with small unique binary codes. Our alternating optimization adaptively discovers the prototype set and the code set of a varying size in an efficient way, which together robustly approximate the data relations. Our method can be naturally generalized to the product space for long hash codes. We believe that our idea serves as a very helpful insight to hashing research. The extensive experiments on four large-scale (up to 80 million) datasets demonstrate that our method significantly outperforms state-of-the-art hashing methods, with up to 58. 84% performance gains relatively.

NeurIPS Conference 2016 Conference Paper

FPNN: Field Probing Neural Networks for 3D Data

  • Yangyan Li
  • Soeren Pirk
  • Hao Su
  • Charles Qi
  • Leonidas Guibas

Building discriminative representations for 3D data has been an important task in computer graphics and computer vision research. Convolutional Neural Networks (CNNs) have shown to operate on 2D images with great success for a variety of tasks. Lifting convolution operators to 3D (3DCNNs) seems like a plausible and promising next step. Unfortunately, the computational complexity of 3D CNNs grows cubically with respect to voxel resolution. Moreover, since most 3D geometry representations are boundary based, occupied regions do not increase proportionately with the size of the discretization, resulting in wasted computation. In this work, we represent 3D spaces as volumetric fields, and propose a novel design that employs field probing filters to efficiently extract features from them. Each field probing filter is a set of probing points -- sensors that perceive the space. Our learning algorithm optimizes not only the weights associated with the probing points, but also their locations, which deforms the shape of the probing filters and adaptively distributes them in 3D space. The optimized probing points sense the 3D space "intelligently", rather than operating blindly over the entire domain. We show that field probing is significantly more efficient than 3DCNNs, while providing state-of-the-art performance, on classification tasks for 3D object recognition benchmark datasets.

IROS Conference 2013 Conference Paper

Teleoperation system with hybrid pneumatic-piezoelectric actuation for MRI-guided needle insertion with haptic feedback

  • Weijian Shang
  • Hao Su
  • Gang Li 0018
  • Gregory S. Fischer

This paper presents a surgical master-slave tele-operation system for percutaneous interventional procedures under continuous magnetic resonance imaging (MRI) guidance. This system consists of a piezoelectrically actuated slave robot for needle placement with integrated fiber optic force sensor utilizing Fabry-Perot interferometry (FPI) sensing principle. The sensor flexure is optimized and embedded to the slave robot for measuring needle insertion force. A novel, compact opto-mechanical FPI sensor interface is integrated into an MRI robot control system. By leveraging the complementary features of pneumatic and piezoelectric actuation, a pneumatically actuated haptic master robot is also developed to render force associated with needle placement interventions to the clinician. An aluminum load cell is implemented and calibrated to close the impedance control loop of the master robot. A force-position control algorithm is developed to control the hybrid actuated system. Teleoperated needle insertion is demonstrated under live MR imaging, where the slave robot resides in the scanner bore and the user manipulates the master beside the patient outside the bore. Force and position tracking results of the master-slave robot are demonstrated to validate the tracking performance of the integrated system. It has a position tracking error of 0. 318mm and sine wave force tracking error of 2. 227N.

NeurIPS Conference 2010 Conference Paper

Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification

  • Li-Jia Li
  • Hao Su
  • Li Fei-Fei
  • Eric Xing

Robust low-level image features have been proven to be effective representations for a variety of visual recognition tasks such as object recognition and scene classification; but pixels, or even local image patches, carry little semantic meanings. For high level visual tasks, such low-level image representations are potentially not enough. In this paper, we propose a high-level image representation, called the Object Bank, where an image is represented as a scale invariant response map of a large number of pre-trained generic object detectors, blind to the testing dataset or visual task. Leveraging on the Object Bank representation, superior performances on high level visual recognition tasks can be achieved with simple off-the-shelf classifiers such as logistic regression and linear SVM. Sparsity algorithms make our representation more efficient and scalable for large scene datasets, and reveal semantically meaningful feature patterns.