Arrow Research search

Author name cluster

Yiming Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

34 papers
2 author rows

Possible papers

34

AAAI Conference 2026 Conference Paper

DSAP: Enhancing Generalization in Goal-Conditioned Reinforcement Learning

  • Yiming Wang
  • Kaiyan Zhao
  • Ming Yang
  • Yan Li
  • Furui Liu
  • Jiayu Chen
  • Leong Hou U

Goal-conditioned Reinforcement Learning (RL) is a promising direction for training agents capable of tackling a variety of tasks. However, generalizing to new goals in different environments remains a central challenge for goal-conditioned RL agents. Existing methods often rely on state abstraction, which involves learning abstracted state representations by excluding irrelevant features, to improve generalization. Despite their success in simplified settings, these methods often fail to generalize effectively to realistic environments with varied goals. In this work, we propose to enhance generalization through state abstraction from the perspective of causal inference. We hypothesize that the generalization gap arises in part due to unobserved confounders: latent variables that simultaneously influence both the global and goal states. To address this, we introduce Deconfounded State Abstraction for Policy learning (DSAP), a novel framework that mitigates backdoor confounding by employing a learned causal graph as a *proxy* for the hidden confounders. We provide theoretical analysis demonstrating that DSAP improves both the learning process and the generalization capability of goal-conditioned policies. Extensive experiments across different settings of multiple benchmarks show that our method significantly outperforms existing methods.

AAAI Conference 2026 Conference Paper

Explore to Learn: Latent Exploration Through Disentangled Synergy Patterns for Reinforcement Learning in Overactuated Control

  • Yiming Wang
  • Kaiyan Zhao
  • Xu Li
  • Yan Li
  • Jiayu Chen
  • Steven Morad
  • Leong Hou U

Control in high-dimensional action spaces remains a fundamental challenge in reinforcement learning (RL), primarily due to inefficient exploration of the action space. While recent methods attempt to guide exploration, they often fall short of achieving the agility and coordination exhibited in biological motor control. Inspired by how organisms exploit muscle synergies for efficient movement, we propose Explore to Learn (ETL), a two-stage framework that first discovers fundamental synergy patterns and then leverages them for task-specific policy learning. In the first stage, ETL discovers underlying synergy patterns by deploying a targeted exploration policy. These patterns are modeled as latent directions in a low-dimensional space, along which the agent is guided to collect diverse and structured muscle activation trajectories. A variational autoencoder (VAE) is then trained to encode high-dimensional actions into a latent space whose dimensions correspond to the synergy patterns. In the second stage, the policy is trained entirely in this synergy-aware latent space, producing synergy coefficients that the decoder maps back to full-dimensional muscle actions. This structured representation significantly reduces the complexity of learning, while the decoder is further fine-tuned to enhance expressiveness and generalization across downstream tasks. Extensive experiments across musculoskeletal environments and the DMControl suite demonstrate that ETL consistently outperforms prior methods in both exploration efficiency and control performance, achieving superior scalability and generalization in overactuated control tasks.

AAAI Conference 2026 Conference Paper

Latent State-Predictive Exploration for Deep Reinforcement Learning

  • Yiming Wang
  • Kaiyan Zhao
  • Borong Zhang
  • Yan Li
  • Leong Hou U

Reinforcement learning (RL) has achieved promising results in continuous control tasks, where efficient exploration of the state space is crucial for success. However, many recent RL approaches still struggle with sample inefficiency and insufficient exploration for long-horizon tasks, particularly in environments characterized by high-dimensional and complex state spaces. To address these challenges, we propose a novel exploration framework, Latent State Predictive Exploration (LSPE). The core idea behind LSPE is to endow the agent with a form of ``foresight" to enhance exploration in long-horizon settings. Specifically, LSPE employs a state encoder to learn compact latent representations from high-dimensional visual observations, effectively filtering out irrelevant or noisy information. To further enrich and stabilize these representations, we incorporate a diffusion-based self-predictive module that enforces temporal consistency by predicting future states, thereby improving both exploration and downstream predictive control. Additionally, we introduce an Exploration Reward Function (ERF) that explicitly encourages the agent to visit novel latent states. This reward signal promotes more efficient and scalable exploration in complex environments. We evaluate LSPE across a diverse set of challenging long-horizon navigation and manipulation tasks, spanning simulation environments such as Habitat and Robosuite, as well as deployment on a real robot in a **physical indoor environment**. Experimental results show that LSPE substantially enhances exploration efficiency and scales effectively to complex, high-dimensional tasks.

AAAI Conference 2026 Conference Paper

RGMP: Recurrent Geometric-prior Multimodal Policy for Generalizable Humanoid Robot Manipulation

  • Xuetao Li
  • Wenke Huang
  • Nengyuan Pan
  • Kaiyan Zhao
  • Songhua Yang
  • Yiming Wang
  • Mengde Li
  • Mang Ye

Humanoid robots exhibit significant potential in executing diverse human-level skills. However, current research predominantly relies on data-driven approaches that necessitate extensive training datasets to achieve robust multimodal decision-making capabilities and generalizable visuomotor control. These methods raise concerns due to the neglect of geometric reasoning in unseen scenarios and the inefficient modeling of robot-target relationships within the training data, resulting in a significant waste of training resources. To address these limitations, we present the Recurrent Geometric-prior Multimodal Policy (RGMP), an end-to-end framework that unifies geometric-semantic skill reasoning with data-efficient visuomotor control. For perception capabilities, we propose the Geometric-prior Skill Selector, which infuses geometric inductive biases into a vision language model, producing adaptive skill sequences for unseen scenes with minimal spatial common sense tuning. To achieve data-efficient robotic motion synthesis, we introduce the Adaptive Recursive Gaussian Network, which parameterizes robot-object interactions as a compact hierarchy of Gaussian processes that recursively encode multi-scale spatial relationships, yielding dexterous, data-efficient motion synthesis even from sparse demonstrations. Evaluated on both our humanoid robot and desktop robot, the RGMP framework achieves 87% task success in generalization tests and exhibits 5× greater data efficiency than the state-of-the-art model. This performance underscores its superior cross-domain generalization, paving the way for more versatile and data-efficient robotic systems.

AAAI Conference 2026 Conference Paper

Spherical Geometry Diffusion: Generating High-quality 3D Face Geometry via Sphere-anchored Representations

  • Junyi Zhang
  • Yiming Wang
  • Yunhong Lu
  • Qichao Wang
  • Wenzhe Qian
  • Xiaoyin Xu
  • David Gu
  • Min Zhang

A fundamental challenge in text-to-3D face generation is achieving high-quality geometry. The core difficulty lies in the arbitrary and intricate distribution of vertices in 3D space, making it challenging for existing models to establish clean connectivity and resulting in suboptimal geometry. To address this, our core insight is to simplify the underlying geometric structure by constraining the distribution onto a simple and regular manifold, a topological sphere. Building on this, we first propose the Spherical Geometry Representation, a novel face representation that anchors geometric signals to uniform spherical coordinates. This guarantees a regular point distribution, from which the mesh connectivity can be robustly reconstructed. Critically, this canonical sphere can be seamlessly unwrapped into a 2D map, creating a perfect synergy with powerful 2D generative models. We then introduce Spherical Geometry Diffusion, a conditional diffusion framework built upon this 2D map. It enables diverse and controllable generation by jointly modeling geometry and texture, where the geometry explicitly conditions the texture synthesis process. Our method's effectiveness is demonstrated through its success in a wide range of tasks: text-to-3D generation, face reconstruction, and text-based 3D editing. Extensive experiments show that our approach substantially outperforms existing methods in geometric quality, textual fidelity, and inference efficiency.

IJCAI Conference 2025 Conference Paper

BILE: An Effective Behavior-based Latent Exploration Scheme for Deep Reinforcement Learning

  • Yiming Wang
  • Kaiyan Zhao
  • Yan Li
  • Leong Hou U

Efficient exploration of state spaces is critical for the success of deep reinforcement learning (RL). While many methods leverage exploration bonuses to encourage exploration instead of relying solely on extrinsic rewards, these bonus-based approaches often face challenges with learning efficiency and scalability, especially in environments with high-dimensional state spaces. To address these issues, we propose BehavIoral metric-based Latent Exploration (BILE). The core idea is to learn a compact representation within the behavioral metric space that preserves value differences between states. By introducing additional rewards to encourage exploration in this latent space, BILE drives the agent to visit states with higher value diversity and exhibit more behaviorally distinct actions, leading to more effective exploration of the state space. Additionally, we present a novel behavioral metric for efficient and robust training of the state encoder, backed by theoretical guarantees. Extensive experiments on high-dimensional environments, including realistic indoor scenarios in Habitat, robotic tasks in Robosuite, and challenging discrete Minigrid benchmarks, demonstrate the superiority and scalability of our method over other approaches.

NeurIPS Conference 2025 Conference Paper

ConViS-Bench: Estimating Video Similarity Through Semantic Concepts

  • Benedetta Liberatori
  • Alessandro Conti
  • Lorenzo Vaquero
  • Yiming Wang
  • Elisa Ricci
  • Paolo Rota

What does it mean for two videos to be similar? Videos may appear similar when judged by the actions they depict, yet entirely different if evaluated based on the locations where they were filmed. While humans naturally compare videos by taking different aspects into account, this ability has not been thoroughly studied and presents a challenge for models that often depend on broad global similarity scores. Large Multimodal Models (LMMs) with video understanding capabilities open new opportunities for leveraging natural language in comparative video tasks. We introduce Concept-based Video Similarity estimation (ConViS), a novel task that compares pairs of videos by computing interpretable similarity scores across a predefined set of key semantic concepts. ConViS allows for human-like reasoning about video similarity and enables new applications such as concept-conditioned video retrieval. To support this task, we also introduce ConViS-Bench, a new benchmark comprising carefully annotated video pairs spanning multiple domains. Each pair comes with concept-level similarity scores and textual descriptions of both differences and similarities. Additionally, we benchmark several state-of-the-art models on ConViS, providing insights into their alignment with human judgments. Our results reveal significant performance differences on ConViS, indicating that some concepts present greater challenges for estimating video similarity. We believe that ConViS-Bench will serve as a valuable resource for advancing research in language-driven video understanding.

IJCAI Conference 2025 Conference Paper

Efficient Diversity-based Experience Replay for Deep Reinforcement Learning

  • Kaiyan Zhao
  • Yiming Wang
  • Yuyang Chen
  • Yan Li
  • Leong Hou U
  • Xiaoguang Niu

Experience replay is widely used to improve learning efficiency in reinforcement learning by leveraging past experiences. However, existing experience replay methods, whether based on uniform or prioritized sampling, often suffer from low efficiency, particularly in real-world scenarios with high-dimensional state spaces. To address this limitation, we propose a novel approach, Efficient Diversity-based Experience Replay (EDER). EDER employs a determinantal point process to model the diversity between samples and prioritizes replay based on the diversity between samples. To further enhance learning efficiency, we incorporate Cholesky decomposition for handling large state spaces in realistic environments. Additionally, rejection sampling is applied to select samples with higher diversity, thereby improving overall learning efficacy. Extensive experiments are conducted on robotic manipulation tasks in MuJoCo, Atari games, and realistic indoor environments in Habitat. The results demonstrate that our approach not only significantly improves learning efficiency but also achieves superior performance in high-dimensional, realistic environments.

ICRA Conference 2025 Conference Paper

ForceMimic: Force-Centric Imitation Learning with Force-Motion Capture System for Contact-Rich Manipulation

  • Wenhai Liu
  • Junbo Wang 0004
  • Yiming Wang
  • Weiming Wang
  • Cewu Lu

In most contact-rich manipulation tasks, humans apply time-varying forces to the target object, compensating for inaccuracies in the vision-guided hand trajectory. However, current robot learning algorithms primarily focus on trajectory-based policy, with limited attention given to learning force-related skills. To address this limitation, we introduce ForceMimic, a force-centric robot learning system, providing a natural, force-aware and robot-free robotic demonstration collection system, along with a hybrid force-motion imitation learning algorithm for robust contact-rich manipulation. Using the proposed ForceCapture system, an operator can peel a zucchini in 5 minutes, while force-feedback teleoperation takes over 13 minutes and struggles with task completion. With the collected data, we propose HybridIL to train a force-centric imitation learning model, equipped with hybrid force-position control primitive to fit the predicted wrench-position parameters during robot execution. Experiments demonstrate that our approach enables the model to learn a more robust policy under the contact-rich task of vegetable peeling, increasing the success rates by 54. 5% relatively compared to state-of-the-art pure-vision-based imitation learning. Hardware, code, data and more results can be found on the project website at https://forcemimic.github.io.

EAAI Journal 2025 Journal Article

Interpretable interval prediction of dam displacement based on variational autoencoder and improved temporal fusion transformer considering solar radiation effects

  • Taiqi Lu
  • Hao Gu
  • Chongshi Gu
  • Chenfei Shao
  • Yiming Wang
  • Dongyang Yuan

Ensuring the safety of dams is critical to maintaining national economic development and social stability, requiring the implementation of accurate displacement prediction methods for early detection of structural anomalies and effective risk mitigation. However, existing statistical models primarily focus on point predictions, failing to quantify the uncertainty in displacement variations, and often neglect the critical environmental factor of solar radiation. To address these limitations, this study proposes a novel interpretable interval prediction framework that integrates solar radiation factors into an advanced hydrostatic-temperature-time (AHTT) model. A variational autoencoder (VAE) is employed to extract robust latent features from a large volume of measured temperature data, effectively reducing temperature-related noise. Subsequently, an improved temporal fusion transformer method is introduced to probabilistic dam displacement prediction. This method uses an enhanced quantile loss function based on the Huber loss to generate both point and interval predictions that dynamically reflect the prediction uncertainty. In addition, an interpretable multi-head attention module is incorporated to quantify the contribution of each environmental factor. Hyperparameter tuning of the improved temporal fusion transformer is further optimized using Bayesian optimization based on the tree-structured Parzen estimator (TPE), which improves prediction accuracy. Engineering case studies validate that the proposed model not only achieves the highest point prediction accuracy, but also provides narrower prediction intervals with the best coverage width criterion. Ablation experiments and interpretability analyses further confirm the significant impact of solar radiation on dam displacement, providing valuable insights for the development of dam displacement prediction models and risk-informed decision making.

NeurIPS Conference 2025 Conference Paper

Learning Efficient Fuse-and-Refine for Feed-Forward 3D Gaussian Splatting

  • Yiming Wang
  • Lucy Chai
  • Xuan Luo
  • Michael Niemeyer
  • Manuel Lagunas
  • Stephen Lombardi
  • Siyu Tang
  • Tiancheng Sun

Recent advances in feed-forward 3D Gaussian Splatting have led to rapid improvements in efficient scene reconstruction from sparse views. However, most existing approaches construct Gaussian primitives directly aligned with the pixels in one or more of the input images. This leads to redundancies in the representation when input views overlap and constrains the position of the primitives to lie along the input rays without full flexibility in 3D space. Moreover, these pixel-aligned approaches do not naturally generalize to dynamic scenes, where effectively leveraging temporal information requires resolving both redundant and newly appearing content across frames. To address these limitations, we introduce a novel Fuse-and-Refine module that enhances existing feed-forward models by merging and refining the primitives in a canonical 3D space. At the core of our method is an efficient hybrid Splat-Voxel representation – from an initial set of pixel-aligned Gaussian primitives, we aggregate local features into a coarse-to-fine voxel hierarchy, and then use a sparse voxel transformer to process these voxel features and generate refined Gaussian primitives. By fusing and refining an arbitrary number of inputs into a consistent set of primitives, our representation effectively reduces redundancy and naturally adapts to temporal frames, enabling history-aware online reconstruction of dynamic scenes. Trained on large-scale static scene datasets, our model learns an effective global strategy to process around 200k primitives within 15ms and significantly enhances reconstruction quality compared to pixel-aligned reconstruction approaches. Without additional training, our model generalizes to video by fusing primitives across time, yielding a more temporally coherent result compared to baseline methods with graceful handling of occluded content. Our approach achieves state-of-the-art performance in both static and streaming scene reconstructions while running at interactive rates (15 fps with 350ms delay) on a single H100 GPU.

NeurIPS Conference 2025 Conference Paper

Learning from Disjoint Views: A Contrastive Prototype Matching Network for Fully Incomplete Multi-View Clustering

  • Yiming Wang
  • Qun Li
  • Dongxia Chang
  • Jie Wen
  • Hua Dai
  • Fu Xiao
  • Yao Zhao

Multi-view clustering aims to enhance clustering performance by leveraging information from diverse sources. However, its practical application is often hindered by a barrier: the lack of correspondences across views. This paper focuses on the understudied problem of fully incomplete multi-view clustering (FIMC), a scenario where existing methods fail due to their reliance on partial alignment. To address this problem, we introduce the Contrastive Prototype Matching Network (CPMN), a novel framework that establishes a new paradigm for cross-view alignment based on matching high-level categorical structures. Instead of aligning individual instances, CPMN performs a more robust cluster prototype alignment. CPMN first employs a correspondence-free graph contrastive learning approach, leveraging mutual $k$-nearest neighbors (MNN) to uncover intrinsic data structures and establish initial prototypes from entirely unpaired views. Building on the prototypes, we introduce a cross-view prototype graph matching stage to resolve category misalignment and forge a unified clustering structure. Finally, guided by this alignment, we devise a prototype-aware contrastive learning mechanism to promote semantic consistency, replacing the reliance on the initial MNN-based structural similarity. Extensive experiments on benchmark datasets demonstrate that our method significantly outperforms various baselines and ablation variants, validating its effectiveness.

JBHI Journal 2025 Journal Article

Ortho-OPD: an Automatic Osteotomy Planes Design Model for Orthognathic Surgery Based on Deep Learning

  • Yiming Wang
  • Xiangmin Li
  • Yang Yang
  • Mengjia Cheng
  • Xu Zhang
  • Jiahao Bao
  • Hongjun Qian
  • Xinyi Huang

Orthognathic surgery is applied to restore esthetical facial profile and functional occlusion for patients with dentofacial deformity. Virtual surgical planning (VSP) is indispensable for precise and individualized treatment. Manually designing osteotomy planes is time-consuming and highly experience-dependent. This study aimed to develop and validate an automatic osteotomy plane design method based on deep learning. Methods: A deep learning model, Ortho-OPD (orthognathic osteotomy planes de signer), was proposed, consisting of a segmentation network and the random sample consensus (RANSAC) algorithm. The segmentation network was based on a convolutional neural network (CNN), orthognathic segmenting the craniomaxillofacial (CMF) CT data. Osteotomy planes were then defined by the RANSAC algorithm. Ortho-OPD was trained on 71 samples and tested on 31 cases. The performance was evaluated quantitatively and qualitatively. Results: Ortho-OPD functioned smoothly, and all cases were successfully performed. The 3D boundary-sensitive loss was employed to optimize precision. Evaluation metrics included accuracy and clinical efficiency. The mean dice similarity coefficient (DSC) was 0. 920. 032 in CMF seg mentation. Ortho-OPD showcased excellent productivity, taking an average of about 9 seconds to complete virtual bimaxillary osteotomy compared to manual work. The angular errors between the predicted planes and ground truth planes, plus the shortest distance from the neural tube or the adjacent apical points to predicted planes, were examined, indicating no significant difference and reliability for preserving vital anatomical structures. Overall, the automatic osteotomy plane design from raw CT data was realized using Ortho-OPD, composed of CNN and RANSAC, providing an efficient and ideal alternative in orthognathic osteotomy planning.

NeurIPS Conference 2025 Conference Paper

PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts

  • Yiming Wang
  • Pei Zhang
  • Jialong Tang
  • Hao-Ran Wei
  • Baosong Yang
  • Rui Wang
  • Chenshu Sun
  • Feitong Sun

In this paper, we introduce PolyMath, a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels. Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation, making it a highly discriminative multilingual mathematical benchmark in the era of reasoning LLMs. We conduct a comprehensive evaluation for advanced LLMs and find that even Qwen-3-235B-A22B-Thinking and Gemini-2. 5-pro, achieve only 54. 6 and 52. 2 benchmark scores, with about 40% accuracy under the highest level. From a language perspective, our benchmark reveals several key challenges of LLMs in multilingual reasoning: (1) Reasoning performance varies widely across languages for current LLMs; (2) Input-output language consistency is low in reasoning LLMs and may be correlated with performance; (3) The thinking length differs significantly by language for current LLMs. Additionally, we demonstrate that controlling the output language in the instructions has the potential to affect reasoning performance, especially for some low-resource languages, suggesting a promising direction for improving multilingual capabilities in LLMs.

NeurIPS Conference 2025 Conference Paper

Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding

  • Yiming Wang
  • Pei Zhang
  • Siyuan Huang
  • Baosong Yang
  • Zhuosheng Zhang
  • Fei Huang
  • Rui Wang

Test-time scaling enhances large language model performance by allocating additional compute resources during decoding. Best-of-$N$ (BoN) sampling serves as a common sampling-based scaling technique, broadening the search space in parallel to find better solutions from the model distribution. However, its cost–performance trade-off is still underexplored. Two main challenges limit the efficiency of BoN sampling: (1) Generating $N$ full samples consumes substantial GPU memory, reducing inference capacity under limited resources. (2) Reward models add extra memory and latency overhead, and training strong reward models introduces potential training data costs. Although some studies have explored efficiency improvements, none have addressed both challenges at once. To address this gap, we propose **Self-Truncation Best-of-$N$ (ST-BoN)**, a decoding method that avoids fully generating all $N$ samples and eliminates the need for reward models. It leverages early sampling consistency in the model’s internal states to identify the most promising path and truncate suboptimal ones. In terms of cost, ST-BoN reduces dynamic GPU memory usage by over 80% and inference latency by 50%. In terms of cost–performance trade-off, ST-BoN achieves the same performance as Full-BoN while saving computational cost by 70%–80%, and under the same cost, it can improve accuracy by 3–4 points.

IJCAI Conference 2025 Conference Paper

STAMImputer: Spatio-Temporal Attention MoE for Traffic Data Imputation

  • Yiming Wang
  • Hao Peng
  • Senzhang Wang
  • Haohua Du
  • Chunyang Liu
  • Jia Wu
  • Guanlin Wu

Traffic data imputation is fundamentally important to support various applications in intelligent transportation systems such as traffic flow prediction. However, existing time-to-space sequential methods often fail to effectively extract features in block-wise missing data scenarios. Meanwhile, the static graph structure for spatial feature propagation significantly constrains the model's flexibility in handling the distribution shift issue for the nonstationary traffic data. To address these issues, this paper proposes a Spatio-Temporal Attention Mixture of experts network named STAMImputer for traffic data imputation. Specifically, we introduce a Mixture of Experts (MoE) framework to capture latent spatio-temporal features and their influence weights, effectively imputing block missing. A novel Low-rank guided Sampling Graph ATtention (LrSGAT) mechanism is designed to dynamically balance the local and global correlations across road networks. The sampled attention vectors are utilized to generate dynamic graphs that capture real-time spatial correlations. Extensive experiments are conducted on four traffic datasets for evaluation. The result shows STAMImputer achieves significantly performance improvement compared with existing SOTA approaches. Our codes are available at https: //github. com/RingBDStack/STAMImupter.

NeurIPS Conference 2025 Conference Paper

Training-free Online Video Step Grounding

  • Luca Zanella
  • Massimiliano Mancini
  • Yiming Wang
  • Alessio Tonioni
  • Elisa Ricci

Given a task and a set of steps composing it, Video Step Grounding (VSG) aims to detect which steps are performed in a video. Standard approaches for this task require a labeled training set (e. g. , with step-level annotations or narrations), which may be costly to collect. Moreover, they process the full video offline, limiting their applications for scenarios requiring online decisions. Thus, in this work, we explore how to perform VSG online and without training. We achieve this by exploiting the zero-shot capabilities of recent Large Multimodal Models (LMMs). In particular, we use LMMs to predict the step associated with a restricted set of frames, without access to the whole video. We show that this online strategy without task-specific tuning outperforms offline and training-based models. Motivated by this finding, we develop Bayesian Grounding with Large Multimodal Models (BaGLM), further injecting knowledge of past frames into the LMM-based predictions. BaGLM exploits Bayesian filtering principles, modeling step transitions via (i) a dependency matrix extracted through large language models and (ii) an estimation of step progress. Experiments on three datasets show superior performance of BaGLM over state-of-the-art training-based offline methods.

ICRA Conference 2024 Conference Paper

AirExo: Low-Cost Exoskeletons for Learning Whole-Arm Manipulation in the Wild

  • Hongjie Fang
  • Haoshu Fang
  • Yiming Wang
  • Jieji Ren
  • Jingjing Chen
  • Ruo Zhang
  • Weiming Wang
  • Cewu Lu

While humans can use parts of their arms other than the hands for manipulations like gathering and supporting, whether robots can effectively learn and perform the same type of operations remains relatively unexplored. As these manipulations require joint-level control to regulate the complete poses of the robots, we develop AirExo, a low-cost, adaptable, and portable dual-arm exoskeleton, for teleoperation and demonstration collection. As collecting teleoperated data is expensive and time-consuming, we further leverage AirExo to collect cheap in-the-wild demonstrations at scale. Under our in-the-wild learning framework, we show that with only 3 minutes of the teleoperated demonstrations, augmented by diverse and extensive in-the-wild data collected by AirExo, robots can learn a policy that is comparable to or even better than one learned from teleoperated demonstrations lasting over 20 minutes. Experiments demonstrate that our approach enables the model to learn a more general and robust policy across the various stages of the task, enhancing the success rates in task completion even with the presence of disturbances. Project website: airexo.github.io.

IJCAI Conference 2024 Conference Paper

Boosting Single Positive Multi-label Classification with Generalized Robust Loss

  • Yanxi Chen
  • Chunxiao Li
  • Xinyang Dai
  • Jinhuan Li
  • Weiyu Sun
  • Yiming Wang
  • Renyuan Zhang
  • Tinghe Zhang

Multi-label learning (MLL) requires comprehensive multi-semantic annotations that is hard to fully obtain, thus often resulting in missing labels scenarios. In this paper, we investigate Single Positive Multi-label Learning (SPML), where each image is associated with merely one positive label. Existing SPML methods only focus on designing losses using mechanisms such as hard pseudo-labeling and robust losses, mostly leading to unacceptable false negatives. To address this issue, we first propose a generalized loss framework based on expected risk minimization to provide soft pseudo labels, and point out that the former losses can be seamlessly converted into our framework. In particular, we design a novel robust loss based on our framework, which enjoys flexible coordination between false positives and false negatives, and can additionally deal with the imbalance between positive and negative samples. Extensive experiments show that our approach can significantly improve SPML performance and outperform the vast majority of state-of-the-art methods on all the four benchmarks. Our code is available at https: //github. com/yan4xi1/GRLoss.

AAAI Conference 2024 Conference Paper

DMMR: Cross-Subject Domain Generalization for EEG-Based Emotion Recognition via Denoising Mixed Mutual Reconstruction

  • Yiming Wang
  • Bin Zhang
  • Yujiao Tang

Electroencephalography (EEG) has proven to be effective in emotion analysis. However, current methods struggle with individual variations, complicating the generalization of models trained on data from source subjects to unseen target subjects. To tackle this issue, we propose the Denoising Mixed Mutual Reconstruction (DMMR) model, employing a two-stage pre-training followed by fine-tuning approach. During the pre-training phase, DMMR leverages self-supervised learning through a multi-decoder autoencoder, which encodes and reconstructs features of one subject, aiming to generate features resembling those from other subjects within the same category, thereby encouraging the encoder to learn subject-invariant features. We introduce a hidden-layer mixed data augmentation approach to mitigate the limitations posed by the scarcity of source data, thereby extending the method to a two-stage process. To bolster stability against noise, we incorporate a noise injection method, named “Time Steps Shuffling”, into the input data. During the fine-tuning phase, an emotion classifier is integrated to extract emotion-related features. Experimental accuracy on the SEED and SEED-IV datasets reached 88.27% (±5.62) and 72.70% (±8.01), respectively, demonstrating state-of-the-art and comparable performance, thereby showcasing the superiority of DMMR. The proposed data augmentation and noise injection methods were observed to complementarily enhance accuracy and stability, thus alleviating the aforementioned issues.

NeurIPS Conference 2024 Conference Paper

Embedding Trajectory for Out-of-Distribution Detection in Mathematical Reasoning

  • Yiming Wang
  • Pei Zhang
  • Baosong Yang
  • Derek F. Wong
  • Zhuosheng Zhang
  • Rui Wang

Real-world data deviating from the independent and identically distributed (\textit{i. i. d. }) assumption of in-distribution training data poses security threats to deep networks, thus advancing out-of-distribution (OOD) detection algorithms. Detection methods in generative language models (GLMs) mainly focus on uncertainty estimation and embedding distance measurement, with the latter proven to be most effective in traditional linguistic tasks like summarization and translation. However, another complex generative scenario mathematical reasoning poses significant challenges to embedding-based methods due to its high-density feature of output spaces, but this feature causes larger discrepancies in the embedding shift trajectory between different samples in latent spaces. Hence, we propose a trajectory-based method TV score, which uses trajectory volatility for OOD detection in mathematical reasoning. Experiments show that our method outperforms all traditional algorithms on GLMs under mathematical reasoning scenarios and can be extended to more applications with high-density features in output spaces, such as multiple-choice questions.

NeurIPS Conference 2024 Conference Paper

Rethinking Exploration in Reinforcement Learning with Effective Metric-Based Exploration Bonus

  • Yiming Wang
  • Kaiyan Zhao
  • Furui Liu
  • Leong Hou U

Enhancing exploration in reinforcement learning (RL) through the incorporation of intrinsic rewards, specifically by leveraging *state discrepancy* measures within various metric spaces as exploration bonuses, has emerged as a prevalent strategy to encourage agents to visit novel states. The critical factor lies in how to quantify the difference between adjacent states as *novelty* for promoting effective exploration. Nonetheless, existing methods that evaluate state discrepancy in the latent space under $L_1$ or $L_2$ norm often depend on count-based episodic terms as scaling factors for exploration bonuses, significantly limiting their scalability. Additionally, methods that utilize the bisimulation metric for evaluating state discrepancies face a theory-practice gap due to improper approximations in metric learning, particularly struggling with *hard exploration* tasks. To overcome these challenges, we introduce the **E**ffective **M**etric-based **E**xploration-bonus (EME). EME critically examines and addresses the inherent limitations and approximation inaccuracies of current metric-based state discrepancy methods for exploration, proposing a robust metric for state discrepancy evaluation backed by comprehensive theoretical analysis. Furthermore, we propose the diversity-enhanced scaling factor integrated into the exploration bonus to be dynamically adjusted by the variance of prediction from an ensemble of reward models, thereby enhancing exploration effectiveness in particularly challenging scenarios. Extensive experiments are conducted on hard exploration tasks within Atari games, Minigrid, Robosuite, and Habitat, which illustrate our method's scalability to various scenarios. The project website can be found at https: //sites. google. com/view/effective-metric-exploration.

ICML Conference 2023 Conference Paper

DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation

  • Yuhang Lai
  • Chengxi Li 0011
  • Yiming Wang
  • Tianyi Zhang
  • Ruiqi Zhong
  • Luke Zettlemoyer
  • Wen-tau Yih
  • Daniel Fried

We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as Numpy and Pandas. Compared to prior works, DS-1000 incorporates three core features. First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow. Second, our automatic evaluation is highly specific (reliable) – across all Codex-002-predicted solutions that our evaluation accepts, only 1. 8% of them are incorrect; we achieve this with multi-criteria metrics, checking both functional correctness by running test cases and surface-form constraints by restricting API usages or keywords. Finally, we proactively defend against memorization by slightly modifying our problems to be different from the original StackOverflow source; consequently, models cannot answer them correctly by memorizing the solutions from pre-training. The current best public system (Codex-002) achieves 43. 3% accuracy, leaving ample room for improvement. We release our benchmark at https: //ds1000-code-gen. github. io.

NeurIPS Conference 2023 Conference Paper

Efficient Potential-based Exploration in Reinforcement Learning using Inverse Dynamic Bisimulation Metric

  • Yiming Wang
  • Ming Yang
  • Renzhi Dong
  • Binbin Sun
  • Furui Liu
  • Leong Hou U

Reward shaping is an effective technique for integrating domain knowledge into reinforcement learning (RL). However, traditional approaches like potential-based reward shaping totally rely on manually designing shaping reward functions, which significantly restricts exploration efficiency and introduces human cognitive biases. While a number of RL methods have been proposed to boost exploration by designing an intrinsic reward signal as exploration bonus. Nevertheless, these methods heavily rely on the count-based episodic term in their exploration bonus which falls short in scalability. To address these limitations, we propose a general end-to-end potential-based exploration bonus for deep RL via potentials of state discrepancy, which motivates the agent to discover novel states and provides them with denser rewards without manual intervention. Specifically, we measure the novelty of adjacent states by calculating their distance using the bisimulation metric-based potential function, which enhances agent's exploration and ensures policy invariance. In addition, we offer a theoretical guarantee on our inverse dynamic bisimulation metric, bounding the value difference and ensuring that the agent explores states with higher TD error, thus significantly improving training efficiency. The proposed approach is named \textbf{LIBERTY} (exp\textbf{L}oration v\textbf{I}a \textbf{B}isimulation m\textbf{E}t\textbf{R}ic-based s\textbf{T}ate discrepanc\textbf{Y}) which is comprehensively evaluated on the MuJoCo and the Arcade Learning Environments. Extensive experiments have verified the superiority and scalability of our algorithm compared with other competitive methods.

AAAI Conference 2023 Conference Paper

Query Your Model with Definitions in FrameNet: An Effective Method for Frame Semantic Role Labeling

  • Ce Zheng
  • Yiming Wang
  • Baobao Chang

Frame Semantic Role Labeling (FSRL) identifies arguments and labels them with frame semantic roles defined in FrameNet. Previous researches tend to divide FSRL into argument identification and role classification. Such methods usually model role classification as naive multi-class classification and treat arguments individually, which neglects label semantics and interactions between arguments and thus hindering performance and generalization of models. In this paper, we propose a query-based framework named ArGument Extractor with Definitions in FrameNet (AGED) to mitigate these problems. Definitions of frames and frame elements (FEs) in FrameNet can be used to query arguments in text. Encoding text-definition pairs can guide models in learning label semantics and strengthening argument interactions. Experiments show that AGED outperforms previous state-of-the-art by up to 1.3 F1-score in two FrameNet datasets and the generalization power of AGED in zero-shot and fewshot scenarios. Our code and technical appendix is available at https://github.com/PKUnlp-icler/AGED.

NeurIPS Conference 2023 Conference Paper

Vocabulary-free Image Classification

  • Alessandro Conti
  • Enrico Fini
  • Massimiliano Mancini
  • Paolo Rota
  • Yiming Wang
  • Elisa Ricci

Recent advances in large vision-language models have revolutionized the image classification paradigm. Despite showing impressive zero-shot capabilities, a pre-defined set of categories, a. k. a. the vocabulary, is assumed at test time for composing the textual prompts. However, such assumption can be impractical when the semantic context is unknown and evolving. We thus formalize a novel task, termed as Vocabulary-free Image Classification (VIC), where we aim to assign to an input image a class that resides in an unconstrained language-induced semantic space, without the prerequisite of a known vocabulary. VIC is a challenging task as the semantic space is extremely large, containing millions of concepts, with hard-to-discriminate fine-grained categories. In this work, we first empirically verify that representing this semantic space by means of an external vision-language database is the most effective way to obtain semantically relevant content for classifying the image. We then propose Category Search from External Databases (CaSED), a method that exploits a pre-trained vision-language model and an external vision-language database to address VIC in a training-free manner. CaSED first extracts a set of candidate categories from captions retrieved from the database based on their semantic similarity to the image, and then assigns to the image the best matching candidate category according to the same vision-language model. Experiments on benchmark datasets validate that CaSED outperforms other complex vision-language frameworks, while being efficient with much fewer parameters, paving the way for future research in this direction.

NeurIPS Conference 2023 Conference Paper

When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Segmentation

  • Xinhong Ma
  • Yiming Wang
  • Hao Liu
  • Tianyu Guo
  • Yunhe Wang

Source-free domain adaptive semantic segmentation aims to adapt a pre-trained source model to the unlabeled target domain without accessing the private source data. Previous methods usually fine-tune the entire network, which suffers from expensive parameter tuning. To avoid this problem, we propose to utilize visual prompt tuning for parameter-efficient adaptation. However, the existing visual prompt tuning methods are unsuitable for source-free domain adaptive semantic segmentation due to the following two reasons: (1) Commonly used visual prompts like input tokens or pixel-level perturbations cannot reliably learn informative knowledge beneficial for semantic segmentation. (2) Visual prompts require sufficient labeled data to fill the gap between the pre-trained model and downstream tasks. To alleviate these problems, we propose a universal unsupervised visual prompt tuning (Uni-UVPT) framework, which is applicable to various transformer-based backbones. Specifically, we first divide the source pre-trained backbone with frozen parameters into multiple stages, and propose a lightweight prompt adapter for progressively encoding informative knowledge into prompts and enhancing the generalization of target features between adjacent backbone stages. Cooperatively, a novel adaptive pseudo-label correction strategy with a multiscale consistency loss is designed to alleviate the negative effect of target samples with noisy pseudo labels and raise the capacity of visual prompts to spatial perturbations. Extensive experiments demonstrate that Uni-UVPT achieves state-of-the-art performance on GTA5 $\to$ Cityscapes and SYNTHIA $\to$ Cityscapes tasks and can serve as a universal and parameter-efficient framework for large-model unsupervised knowledge transfer. Code will be available at https: //gitee. com/mindspore/models/tree/master/research/cv/uni-uvpt and https: //github. com/huawei-noah/noah-research/tree/master/uni-uvpt.

IJCAI Conference 2022 Conference Paper

Weakly-supervised Text Classification with Wasserstein Barycenters Regularization

  • Jihong Ouyang
  • Yiming Wang
  • Ximing Li
  • Changchun Li

Weakly-supervised text classification aims to train predictive models with unlabeled texts and a few representative words of classes, referred to as category words, rather than labeled texts. These weak supervisions are much more cheaper and easy to collect in real-world scenarios. To resolve this task, we propose a novel deep classification model, namely Weakly-supervised Text Classification with Wasserstein Barycenter Regularization (WTC-WBR). Specifically, we initialize the pseudo-labels of texts by using the category word occurrences, and formulate a weakly self-training framework to iteratively update the weakly-supervised targets by combining the pseudo-labels with the sharpened predictions. Most importantly, we suggest a Wasserstein barycenter regularization with the weakly-supervised targets on the deep feature space. The intuition is that the texts tend to be close to the corresponding Wasserstein barycenter indicated by weakly-supervised targets. Another benefit is that the regularization can capture the geometric information of deep feature space to boost the discriminative power of deep features. Experimental results demonstrate that WTC-WBR outperforms the existing weakly-supervised baselines, and achieves comparable performance to semi-supervised and supervised baselines.

AAAI Conference 2021 Conference Paper

GraphMSE: Efficient Meta-path Selection in Semantically Aligned Feature Space for Graph Neural Networks

  • Yi Li
  • Yilun Jin
  • Guojie Song
  • Zihao Zhu
  • Chuan Shi
  • Yiming Wang

Heterogeneous information networks (HINs) are ideal for describing real-world data with different types of entities and relationships. To carry out machine learning on HINs, metapaths are widely utilized to extract semantics with pre-defined patterns, and models such as graph convolutional networks (GCNs) are thus enabled. However, previous works generally assume a fixed set of meta-paths, which is unrealistic as real-world data are overwhelmingly diverse. Therefore, it is appealing if meta-paths can be automatically selected given an HIN, yet existing works aiming at such problem possess drawbacks, such as poor efficiency and ignoring feature heterogeneity. To address these drawbacks, we propose GraphMSE, an efficient heterogeneous GCN combined with automatic meta-path selection. Specifically, we design highly efficient meta-path sampling techniques, and then injectively project sampled meta-path instances to vectors. We then design a novel semantic feature space alignment, aiming to align the meta-path instance vectors and hence facilitate meta-path selection. Extensive experiments on real-world datasets demonstrate that GraphMSE outperforms state-ofthe-art counterparts, figures out important meta-paths, and is dramatically (e. g. 200 times) more efficient.

IJCAI Conference 2021 Conference Paper

Layer-Assisted Neural Topic Modeling over Document Networks

  • Yiming Wang
  • Ximing Li
  • Jihong Ouyang

Neural topic modeling provides a flexible, efficient, and powerful way to extract topic representations from text documents. Unfortunately, most existing models cannot handle the text data with network links, such as web pages with hyperlinks and scientific papers with citations. To resolve this kind of data, we develop a novel neural topic model, namely Layer-Assisted Neural Topic Model (LANTM), which can be interpreted from the perspective of variational auto-encoders. Our major motivation is to enhance the topic representation encoding by not only using text contents, but also the assisted network links. Specifically, LANTM encodes the texts and network links to the topic representations by an augmented network with graph convolutional modules, and decodes them by maximizing the likelihood of the generative process. The neural variational inference is adopted for efficient inference. Experimental results validate that LANTM significantly outperforms the existing models on topic quality, text classification and link prediction. .

AAAI Conference 2019 Short Paper

Robust Principal Component Analysis-Based Infrared Small Target Detection

  • Qiwei Chen
  • Cheng Wu
  • Yiming Wang

A method based on Robust Principle Component Analysis (RPCA) technique is proposed to detect small targets in infrared images. Using the low rank characteristic of background and the sparse characteristic of target, the observed image is regarded as the sum of a low-rank background matrix and a sparse outlier matrix, and then the decomposition is solved by the RPCA. The infrared small target is extracted from the single-frame image or multi-frame sequence. In order to get more efficient algorithm, the iteration process in the augmented Lagrange multiplier method is improved. The simulation results show that the method can detect out the small target precisely and efficiently.

AAAI Conference 2019 Short Paper

T-Center: A Novel Discriminative Feature Extraction Approach for Iris Recognition

  • Yifeng Chen
  • Cheng Wu
  • Yiming Wang

For large-scale iris recognition tasks, the determination of classification thresholds remains a challenging task, especially in practical applications where sample space is growing rapidly. Due to the complexity of iris samples, the classification threshold is difficult to determine with the increase of samples. The key issue to solving such threshold determination problems is to obtain iris feature vectors with more obvious discrimination. Therefore, we train deep convolutional neural networks based on a large number of iris samples to extract iris features. More importantly, an optimized center loss function referred to Tight Center (T -Center) Loss is used to solve the problem of insufficient discrimination caused by Softmax loss function. In order to evaluate the effectiveness of our proposed method, we use cosine similarity to estimate the similarity between the features on the published datasets CASIA-IrisV4 and IITD2. 0. Our experiment results demonstrate that the T -Center loss can minimize intra-class variance and maximize inter-class variance, which achieve significant performance on the benchmark experiments.

IJCAI Conference 2018 Conference Paper

Galaxy Network Embedding: A Hierarchical Community Structure Preserving Approach

  • Lun Du
  • Zhicong Lu
  • Yun Wang
  • Guojie Song
  • Yiming Wang
  • Wei Chen

Network embedding is a method of learning a low-dimensional vector representation of network vertices under the condition of preserving different types of network properties. Previous studies mainly focus on preserving structural information of vertices at a particular scale, like neighbor information or community information, but cannot preserve the hierarchical community structure, which would enable the network to be easily analyzed at various scales. Inspired by the hierarchical structure of galaxies, we propose the Galaxy Network Embedding (GNE) model, which formulates an optimization problem with spherical constraints to describe the hierarchical community structure preserving network embedding. More specifically, we present an approach of embedding communities into a low dimensional spherical surface, the center of which represents the parent community they belong to. Our experiments reveal that the representations from GNE preserve the hierarchical community structure and show advantages in several applications such as vertex multi-class classification and network visualization. The source code of GNE is available online.

NeurIPS Conference 2014 Conference Paper

Accelerated Mini-batch Randomized Block Coordinate Descent Method

  • Tuo Zhao
  • Mo Yu
  • Yiming Wang
  • Raman Arora
  • Han Liu

We consider regularized empirical risk minimization problems. In particular, we minimize the sum of a smooth empirical risk function and a nonsmooth regularization function. When the regularization function is block separable, we can solve the minimization problems in a randomized block coordinate descent (RBCD) manner. Existing RBCD methods usually decrease the objective value by exploiting the partial gradient of a randomly selected block of coordinates in each iteration. Thus they need all data to be accessible so that the partial gradient of the block gradient can be exactly obtained. However, such a ``batch setting may be computationally expensive in practice. In this paper, we propose a mini-batch randomized block coordinate descent (MRBCD) method, which estimates the partial gradient of the selected block based on a mini-batch of randomly sampled data in each iteration. We further accelerate the MRBCD method by exploiting the semi-stochastic optimization scheme, which effectively reduces the variance of the partial gradient estimators. Theoretically, we show that for strongly convex functions, the MRBCD method attains lower overall iteration complexity than existing RBCD methods. As an application, we further trim the MRBCD method to solve the regularized sparse learning problems. Our numerical experiments shows that the MRBCD method naturally exploits the sparsity structure and achieves better computational performance than existing methods. "