Arrow Research search

Author name cluster

Cewu Lu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

66 papers
2 author rows

Possible papers

66

AAAI Conference 2026 Conference Paper

Exploring Category-level Articulated Object Pose Tracking on SE(3) Manifolds

  • Xianhui Meng
  • Yukang Huo
  • Li Zhang
  • Liu Liu
  • Haonan Jiang
  • Yan Zhong
  • Pingrui Zhang
  • Cewu Lu

Articulated objects are prevalent in daily life and robotic manipulation tasks. However, compared to rigid objects, pose tracking for articulated objects remains an underexplored problem due to their inherent kinematic constraints. To address these challenges, this work proposes a novel point-pair-based pose tracking framework, termed PPF-Tracker. The proposed framework first performs quasi-canonicalization of point clouds in the SE(3) Lie group space, and then models articulated objects using Point Pair Features (PPF) to predict pose voting parameters by leveraging the invariance properties of SE(3). Finally, semantic information of joint axes is incorporated to impose unified kinematic constraints across all parts of the articulated object. PPF-Tracker is systematically evaluated on both synthetic datasets and real-world scenarios, demonstrating strong generalization across diverse and challenging environments. Experimental results highlight the effectiveness and robustness of PPF-Tracker in multi-frame pose tracking of articulated objects. We believe this work can foster advances in robotics, embodied intelligence, and augmented reality.

AAAI Conference 2026 Conference Paper

Verb Mirage: Unveiling and Assessing Verb Concept Hallucinations in Multimodal Large Language Models

  • Zehao Wang
  • Xinpeng Liu
  • Yudonglin Zhang
  • Xiaoqian Wu
  • Zhou Fang
  • Yifan Fang
  • Junfu Pu
  • Cewu Lu

Multimodal Large Language Models (MLLMs) have garnered significant attention recently and demonstrate outstanding capabilities in various tasks such as OCR, VQA, captioning, etc. However, hallucination remains a persistent issue. While numerous methods have been proposed to mitigate hallucinations, achieving notable improvements, these methods primarily focus on mitigating hallucinations related to object/noun concepts. Verb concepts, which are crucial for understanding human actions, have been largely overlooked. In this paper, to the best of our knowledge, we are the first to investigate the verb hallucination phenomenon of MLLMs from various perspectives. Our findings reveal that most state-of-the-art MLLMs suffer from severe verb hallucination. To assess the effectiveness of existing mitigation methods for object concept hallucination in relation to verb hallucination, we evaluated these methods and found that they do not effectively address verb hallucination. To address this issue, we propose a baseline method based on fine-tuning with rich verb knowledge, achieving decent superiority. The experiment results demonstrate that our method significantly reduces hallucinations related to verbs.

IROS Conference 2025 Conference Paper

ArtGS: 3D Gaussian Splatting for Interactive Visual-Physical Modeling and Manipulation of Articulated Objects

  • Qiaojun Yu
  • Xibin Yuan
  • Yu Jiang
  • Junting Chen
  • Dongzhe Zheng
  • Ce Hao
  • Yang You 0004
  • Yixing Chen 0008

Articulated object manipulation remains a critical challenge in robotics due to the complex kinematic constraints and the limited physical reasoning of existing methods. In this work, we introduce ArtGS, a novel framework that extends 3D Gaussian Splatting (3DGS) by integrating visual-physical modeling for articulated object understanding and interaction. ArtGS begins with multi-view RGB-D reconstruction, followed by reasoning with a vision-language model (VLM) to extract semantic and structural information, particularly the articulated bones. Through dynamic, differentiable 3DGS-based rendering, ArtGS optimizes the parameters of the articulated bones, ensuring physically consistent motion constraints and enhancing the manipulation policy. By leveraging dynamic Gaussian splatting, cross-embodiment adaptability, and closed-loop optimization, ArtGS establishes a new framework for efficient, scalable, and generalizable articulated object modeling and manipulation. Experiments conducted in both simulation and real-world environments demonstrate that ArtGS significantly outperforms previous methods in joint estimation accuracy and manipulation success rates across a variety of articulated objects. Additional images and videos are available on the project website: sites.google.com/view/artgs.

ICRA Conference 2025 Conference Paper

Cage: Causal Attention Enables Data-Efficient Generalizable Robotic Manipulation

  • Shangning Xia
  • Hongjie Fang
  • Cewu Lu
  • Haoshu Fang

Generalization in robotic manipulation remains a critical challenge, particularly when scaling to new environments with limited demonstrations. This paper introduces CAGE, a novel robotic manipulation policy designed to overcome these generalization barriers by integrating the pretrained visual representation with causal attention mechanism. CAGE utilizes the powerful feature extraction capabilities of the vision foundation model DINOv2, combined with LoRA fine-tuning for robust environment understanding. The policy further employs a causal perceiver for effective token compression and a diffusion-based action head with attention to enhance task-specific fine-grained conditioning. With as few as 50 demonstrations from a single training environment, CAGE achieves robust generalization across diverse visual changes in objects, backgrounds, and viewpoints. Extensive experiments validate that CAGE significantly outperforms existing state-of-the-art RGB/RGB-D-based approaches in various manipulation tasks, especially under large distribution shifts. In similar environments, CAGE offers an average of 42 % increase in task completion rate. While all baselines fail in unseen environments, CAGE manages to obtain a 43 % completion rate and a 51 % success rate in average, marking a substantial advancement toward the practical deployment of robots in real-world settings. Project website: cage-policy.github.io.

ICRA Conference 2025 Conference Paper

Deformpam: Data-Efficient Learning for Long-Horizon Deformable Object Manipulation Via Preference-Based Action Alignment

  • Wendi Chen
  • Han Xue
  • Fangyuan Zhou
  • Yuan Fang
  • Cewu Lu

In recent years, imitation learning has made progress in the field of robotic manipulation. However, it still faces challenges when addressing complex long-horizon tasks with deformable objects, such as high-dimensional state spaces, complex dynamics, and multimodal action distributions. Traditional imitation learning methods often require a large amount of data and encounter distributional shifts and accumulative errors in these tasks. To address these issues, we propose a data-efficient general learning framework (DeformPAM) based on preference learning and reward-guided action selection. DeformPAM decomposes long-horizon tasks into multiple action primitives, utilizes 3D point cloud inputs and diffusion models to model action distributions, and trains an implicit reward model using human preference data. During the inference phase, the reward model scores multiple candidate actions, selecting the optimal action for execution, thereby reducing the occurrence of anomalous actions and improving task completion quality. Experiments conducted on three challenging real-world long-horizon deformable object manipulation tasks demonstrate the effectiveness of this method. Results show that DeformPAM improves both task completion quality and efficiency compared to baseline methods even with limited data. Code and data will be available at deform-pam. robotflow. ai.

IROS Conference 2025 Conference Paper

Dexterous Manipulation Based on Prior Dexterous Grasp Pose Knowledge

  • Hengxu Yan
  • Haoshu Fang
  • Cewu Lu

Dexterous manipulation has received considerable attention in recent research. Predominantly, existing studies have concentrated on reinforcement learning methods to address the substantial degrees of freedom in hand movements. Nonetheless, these methods typically suffer from low efficiency and accuracy. In this work, we introduce a novel reinforcement learning approach that leverages prior dexterous grasp pose knowledge to enhance both efficiency and accuracy. Unlike previous work, they always make the robotic hand go with a fixed dexterous grasp pose, We decouple the manipulation process into two distinct phases: initially, we generate a dexterous grasp pose targeting the functional part of the object; after that, we employ reinforcement learning to comprehensively explore the environment. Our findings suggest that the majority of learning time is expended in identifying the appropriate initial position and selecting the optimal manipulation viewpoint. Experimental results demonstrate significant improvements in learning efficiency and success rates across four distinct tasks.

IROS Conference 2025 Conference Paper

DiffGen: Robot Demonstration Generation via Differentiable Physics Simulation, Differentiable Rendering, and Vision-Language Model

  • Yang Jin
  • Jun Lv
  • Shuqiang Jiang
  • Cewu Lu

Generating robot demonstrations through simulation is widely recognized as an effective way to scale up robot data. Previous work often trained reinforcement learning agents to generate expert policies, but this approach lacks sample efficiency. Recently, a line of work has attempted to generate robot demonstrations via differentiable simulation, which is promising but heavily relies on reward design, a labor-intensive process. In this paper, we propose DiffGen, a novel framework that integrates differentiable physics simulation, differentiable rendering, and a vision-language model to enable automatic and efficient generation of robot demonstrations. Given a simulated robot manipulation scenario and a natural language instruction, DiffGen can generate realistic robot demonstrations by minimizing the distance between the embedding of the language instruction and the embedding of the simulated observation after manipulation in representation space. The embeddings are obtained from the vision-language model, and the optimization is achieved by calculating and descending gradients through the differentiable simulation, differentiable rendering, and vision-language model components. Experiments demonstrate that with DiffGen, we could efficiently and effectively generate robot data with minimal human effort or training time. The videos of the results can be accessed at https://sites.google.com/view/diffgen.

AAAI Conference 2025 Conference Paper

Discovering Conceptual Knowledge with Analytic Ontology Templates for Articulated Objects

  • Jianhua Sun
  • Yuxuan Li
  • Longfei Xu
  • Jiude Wei
  • Liang Chai
  • Cewu Lu

Human cognition can leverage fundamental conceptual knowledge, like geometry and kinematic ones, to appropriately perceive, comprehend and interact with novel objects. Motivated by this finding, we aim to endow machine intelligence with an analogous capability through performing at the conceptual level, in order to understand and then interact with articulated objects, especially for those in novel categories, which is challenging due to the intricate geometric structures and diverse joint types of articulated objects. To achieve this goal, we propose Analytic Ontology Template (AOT), a parameterized and differentiable program description of generalized conceptual ontologies. A baseline approach called AOTNet driven by AOTs is designed accordingly to equip intelligent agents with these generalized concepts, and then empower the agents to effectively discover the conceptual knowledge on the structure and affordance of articulated objects. The AOT-driven approach yields benefits in three key perspectives: i) enabling concept-level understanding of articulated objects without relying on any real training data, ii) providing analytic structure information, and iii) introducing rich affordance information indicating proper ways of interaction. We conduct exhaustive experiments and the results demonstrate the superiority of our approach in understanding and then interacting with articulated objects.

ICRA Conference 2025 Conference Paper

ForceMimic: Force-Centric Imitation Learning with Force-Motion Capture System for Contact-Rich Manipulation

  • Wenhai Liu
  • Junbo Wang 0004
  • Yiming Wang
  • Weiming Wang
  • Cewu Lu

In most contact-rich manipulation tasks, humans apply time-varying forces to the target object, compensating for inaccuracies in the vision-guided hand trajectory. However, current robot learning algorithms primarily focus on trajectory-based policy, with limited attention given to learning force-related skills. To address this limitation, we introduce ForceMimic, a force-centric robot learning system, providing a natural, force-aware and robot-free robotic demonstration collection system, along with a hybrid force-motion imitation learning algorithm for robust contact-rich manipulation. Using the proposed ForceCapture system, an operator can peel a zucchini in 5 minutes, while force-feedback teleoperation takes over 13 minutes and struggles with task completion. With the collected data, we propose HybridIL to train a force-centric imitation learning model, equipped with hybrid force-position control primitive to fit the predicted wrench-position parameters during robot execution. Experiments demonstrate that our approach enables the model to learn a more robust policy under the contact-rich task of vegetable peeling, increasing the success rates by 54. 5% relatively compared to state-of-the-art pure-vision-based imitation learning. Hardware, code, data and more results can be found on the project website at https://forcemimic.github.io.

NeurIPS Conference 2025 Conference Paper

ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation

  • Jiawen Yu
  • Hairuo Liu
  • Qiaojun Yu
  • Jieji Ren
  • Ce Hao
  • Haitong Ding
  • Guangyu Huang
  • Guofan Huang

Vision-Language-Action (VLA) models have advanced general-purpose robotic manipulation by leveraging pretrained visual and linguistic representations. However, they struggle with contact-rich tasks that require fine-grained control involving force, especially under visual occlusion or dynamic uncertainty. To address these limitations, we propose \textbf{ForceVLA}, a novel end-to-end manipulation framework that treats external force sensing as a first-class modality within VLA systems. ForceVLA introduces \textbf{FVLMoE}, a force-aware Mixture-of-Experts fusion module that dynamically integrates pretrained visual-language embeddings with real-time 6-axis force feedback during action decoding. This enables context-aware routing across modality-specific experts, enhancing the robot's ability to adapt to subtle contact dynamics. We also introduce \textbf{ForceVLA-Data}, a new dataset comprising synchronized vision, proprioception, and force-torque signals across five contact-rich manipulation tasks. ForceVLA improves average task success by 23. 2\% over strong $\pi_0$-based baselines, achieving up to 80\% success in tasks such as plug insertion. Our approach highlights the importance of multimodal integration for dexterous manipulation and sets a new benchmark for physically intelligent robotic control. Code and data will be released at https: //sites. google. com/view/forcevla2025/.

IROS Conference 2025 Conference Paper

FSGlove: An Inertial-Based Hand Tracking System with Shape-Aware Calibration

  • Yutong Li 0004
  • Jieyi Zhang 0001
  • Wenqiang Xu
  • Tutian Tang
  • Cewu Lu

Accurate hand motion capture (MoCap) is vital for applications in robotics, virtual reality, and biomechanics, yet existing systems face limitations in capturing high-degree-of-freedom (DoF) joint kinematics and personalized hand shape. Commercial gloves offer up to 21 DoFs, which are insufficient for complex manipulations while neglecting shape variations that are critical for contact-rich tasks. We present FSGlove, an inertial-based system that simultaneously tracks up to 48 DoFs and reconstructs personalized hand shapes via DiffHCal, a novel calibration method. Each finger joint and the dorsum are equipped with IMUs, enabling high-resolution motion sensing. DiffHCal integrates with the parametric MANO model through differentiable optimization, resolving joint kinematics, shape parameters, and sensor misalignment during a single streamlined calibration. The system achieves state-of-the-art accuracy, with joint angle errors of less than 2. 7°, and outperforms commercial alternatives in shape reconstruction and contact fidelity. FSGlove’s open-source hardware and software design ensures compatibility with current VR and robotics ecosystems, while its ability to capture subtle motions (e. g. , fingertip rubbing) bridges the gap between human dexterity and robotic imitation. Evaluated against Nokov optical MoCap, FSGlove advances hand tracking by unifying the kinematic and contact fidelity. Hardware design, software, and more results are available at: https://sites.google.com/view/fsglove.

IROS Conference 2025 Conference Paper

Generalizable and Actionable Part Detection and Manipulation with SAM-rectified Segmentation and Iterative Pose Refinement

  • Sucheng Qian
  • Li Zhang 0104
  • Yanyan Wei
  • Liu Liu 0012
  • Cewu Lu

The ability to perform cross-category object perception and manipulation is highly desirable in building intelligent robots. One promising approach is to define the concept of Generalizable and Actionable Parts (GAParts), such as buttons and handles, on both seen and unseen object categories. However, the accurate cross-category perception of GAParts is still challenging due to the large inter-category object shape variations. To address this issue, we introduce SAMIR, a novel framework using SAM-rectified segmentation and Iterative pose Refinement for GAPart detection and manipulation. Firstly, we introduce a Segment Anything (SAM) segmentation prior to rectify the unconfident, fragmented GAPart instance proposals. Secondly, in addition to the zero-shot generalization of the SAM foundation model, we further finetune it with a lightweight adaptor model on our task dataset. Finally, we propose an iterative pose refinement procedure that improves the accuracy of GAPart pose estimation. Our perception experiments on GAPartNet dataset show that SAMIR consistently outperforms the baseline method on instance segmentation and pose estimation tasks. Our manipulation experiments in Sapien simulator illustrate that SAMIR leads to an improved manipulation success rate. We also deploy our method to a real robot for real-world manipulation. Our code and video are available at sites.google.com/view/samir-gapart.

ICRA Conference 2025 Conference Paper

Human-Agent Joint Learning for Efficient Robot Manipulation Skill Acquisition

  • Shengcheng Luo
  • Quanquan Peng
  • Jun Lv
  • Kaiwen Hong
  • Katherine Driggs-Campbell
  • Cewu Lu
  • Yonglu Li 0001

Employing a teleoperation system for gathering demonstrations offers the potential for more efficient learning of robot manipulation. However, teleoperating a robot arm equipped with a dexterous hand or gripper, via a teleoperation system presents inherent challenges due to the task's high dimensionality, complexity of motion, and differences between physiological structures. In this study, we introduce a novel system for joint learning between human operators and robots, that enables human operators to share control of a robot end-effector with a learned assistive agent, simplifies the data collection process, and facilitates simultaneous human demonstration collection and robot manipulation training. As data accumulates, the assistive agent gradually learns. Consequently, less human effort and attention are required, enhancing the efficiency of the data collection process. It also allows the human operator to adjust the control ratio to achieve a tradeoff between manual and automated control. We conducted experiments in both simulated environments and physical realworld settings. Through user studies and quantitative evaluations, it is evident that the proposed system could enhance data collection efficiency and reduce the need for human adaptation while ensuring the collected data is of sufficient quality for downstream tasks. For more details, please refer to our webpage https://norweig1an.github.io/HAJL.github.io/.

ICLR Conference 2025 Conference Paper

ImDy: Human Inverse Dynamics from Imitated Observations

  • Xinpeng Liu 0002
  • Junxuan Liang
  • Zili Lin
  • Haowen Hou
  • Yonglu Li 0001
  • Cewu Lu

Inverse dynamics (ID), which aims at reproducing the driven torques from human kinematic observations, has been a critical tool for gait analysis. However, it is hindered from wider application to general motion due to its limited scalability. Conventional optimization-based ID requires expensive laboratory setups, restricting its availability. To alleviate this problem, we propose to exploit the recently progressive human motion imitation algorithms to learn human inverse dynamics in a data-driven manner. The key insight is that the human ID knowledge is implicitly possessed by motion imitators, though not directly applicable. In light of this, we devise an efficient data collection pipeline with state-of-the-art motion imitation algorithms and physics simulators, resulting in a large-scale human inverse dynamics benchmark as Imitated Dynamics (ImDy). ImDy contains over 150 hours of motion with joint torque and full-body ground reaction force data. With ImDy, we train a data-driven human inverse dynamics solver ImDyS(olver) in a fully supervised manner, which conducts ID and ground reaction force estimation simultaneously. Experiments on ImDy and real-world data demonstrate the impressive competency of ImDyS in human inverse dynamics and ground reaction force estimation. Moreover, the potential of ImDy(-S) as a fundamental motion analysis tool is exhibited with downstream applications. The project page is https://foruck.github.io/ImDy.

AAAI Conference 2025 Conference Paper

Interacted Object Grounding in Spatio-Temporal Human-Object Interactions

  • Xiaoyang Liu
  • Boran Wen
  • Xinpeng Liu
  • Zizheng Zhou
  • Hongwei Fan
  • Cewu Lu
  • Lizhuang Ma
  • Yulong Chen

Spatio-temporal Human-Object Interaction (ST-HOI) understanding aims at detecting HOIs from videos, which is crucial for activity understanding. However, existing whole-body-object interaction video benchmarks overlook the truth that open-world objects are diverse, that is, they usually provide limited and predefined object classes. Therefore, we introduce a new open-world benchmark: Grounding Interacted Objects (GIO) including 1,098 interacted objects class and 290K interacted object boxes annotation. Accordingly, an object grounding task is proposed expecting vision systems to discover interacted objects. Even though today’s detectors and grounding methods have succeeded greatly, they perform unsatisfactorily in localizing diverse and rare objects in GIO. This profoundly reveals the limitations of current vision systems and poses a great challenge. Thus, we explore leveraging spatio-temporal cues to address object grounding and propose a 4D question-answering framework (4D-QA) to discover interacted objects from diverse videos. Our method demonstrates significant superiority in extensive experiments compared to current baselines.

ICLR Conference 2025 Conference Paper

Interactive Adjustment for Human Trajectory Prediction with Individual Feedback

  • Jianhua Sun 0003
  • Yuxuan Li
  • Liang Chai
  • Cewu Lu

Human trajectory prediction is fundamental for autonomous driving and service robot. The research community has studied various important aspects of this task and made remarkable progress recently. However, there is an essential perspective which is not well exploited in previous research all along, namely individual feedback. Individual feedback exists in the sequential nature of trajectory prediction, where earlier predictions of a target can be verified over time by his ground-truth trajectories to obtain feedback which provides valuable experience for subsequent predictions on the same agent. In this paper, we show such feedback can reveal the strengths and weaknesses of the model's predictions on a specific target and heuristically guide to deliver better predictions on him. We present an interactive adjustment network to effectively model and leverage the feedback. This network first exploits the feedback from previous predictions to dynamically generate an adjuster which then interactively makes appropriate adjustments to current predictions for more accurate ones. We raise a novel displacement expectation loss to train this interactive architecture. Through experiments on representative prediction methods and widely-used benchmarks, we demonstrate the great value of individual feedback and the superior effectiveness of proposed interactive adjustment network.

IROS Conference 2025 Conference Paper

Kalib: Easy Hand-Eye Calibration with Reference Point Tracking

  • Tutian Tang
  • Minghao Liu
  • Wenqiang Xu
  • Cewu Lu

Hand-eye calibration aims to estimate the transformation between a camera and a robot. Traditional methods rely on fiducial markers, which require considerable manual effort and precise setup. Recent advances in deep learning have introduced markerless techniques but come with more prerequisites, such as retraining networks for each robot, and accessing accurate mesh models for data generation. In this paper, we propose Kalib, an automatic and easy-to-setup hand-eye calibration method that leverages the generalizability of visual foundation models to overcome these challenges. It features only two basic prerequisites, the robot’s kinematic chain and a predefined reference point on the robot. During calibration, the reference point is tracked in the camera space. Its corresponding 3D coordinates in the robot coordinate can be inferred by forward kinematics. Then, a PnP solver directly estimates the transformation between the camera and the robot without training new networks or accessing mesh models. Evaluations in simulated and real-world benchmarks show that Kalib achieves good accuracy with a lower manual workload compared with recent baseline methods. We also demonstrate its application in multiple real-world settings with various robot arms and grippers. Kalib’s user-friendly design and minimal setup requirements make it a possible solution for continuous operation in unstructured environments. The code, data, and supplementary materials are available at https://sites.google.com/view/hand-eye-kalib.

IROS Conference 2025 Conference Paper

Knowledge-Driven Imitation Learning: Enabling Generalization Across Diverse Conditions

  • Zhuochen Miao
  • Jun Lv
  • Hongjie Fang
  • Yang Jin
  • Cewu Lu

Imitation learning has emerged as a powerful paradigm in robot manipulation, yet its generalization capability remains constrained by object-specific dependencies in limited expert demonstrations. To address this challenge, we propose knowledge-driven imitation learning, a framework that leverages external structural semantic knowledge to abstract object representations within the same category. We introduce a novel semantic keypoint graph as a knowledge template and develop a coarse-to-fine template-matching algorithm that optimizes both structural consistency and semantic similarity. Evaluated on three real-world robotic manipulation tasks, our method achieves superior performance, surpassing image-based diffusion policies with only one-quarter of the expert demonstrations. Extensive experiments further demonstrate its robustness across novel objects, backgrounds, and lighting conditions. This work pioneers a knowledge-driven approach to data-efficient robotic learning in real-world settings. Code and more materials are available on knowledge-driven.github.io.

IROS Conference 2025 Conference Paper

LDexMM: Language-Guided Dexterous Multi-Task Manipulation with Reinforcement Learning

  • Hengxu Yan
  • Junbo Wang 0004
  • Haoshu Fang
  • Qiaojun Yu
  • Cewu Lu

Language plays a crucial role in robotic manipulation, particularly in facilitating complex tasks. Previous work primarily focused on two-finger manipulation. However, leveraging language to guide reinforcement learning for dexterous hands remains a challenge due to their high degrees of freedom. In this work, we introduce a language-guided dexterous multi-task manipulation framework (LDexMM), which decomposes the problem into two distinct phases. First, we use language instructions to guide a segmentation model in generating a dexterous grasp pose for the functional part of the object. After establishing this initial grasp, reinforcement learning is employed to refine the grasp pose and complete the task. Simultaneously, language constraints are applied to focus the actions on the specified object. Our experiments demonstrate success rates of 31%, 40%, 49. 2%, and 72. 7% on 10, 7, 5, and 3 tasks, respectively, with a single model.

NeurIPS Conference 2025 Conference Paper

REArtGS: Reconstructing and Generating Articulated Objects via 3D Gaussian Splatting with Geometric and Motion Constraints

  • Di Wu
  • Liu Liu
  • Zhou Linli
  • Anran Huang
  • Liangtu Song
  • Qiaojun Yu
  • Qi Wu
  • Cewu Lu

Articulated objects, as prevalent entities in human life, their 3D representations play crucial roles across various applications. However, achieving both high-fidelity textured surface reconstruction and dynamic generation for articulated objects remains challenging for existing methods. In this paper, we present REArtGS, a novel framework that introduces additional geometric and motion constraints to 3D Gaussian primitives, enabling realistic surface reconstruction and generation for articulated objects. Specifically, given multi-view RGB images of arbitrary two states of articulated objects, we first introduce an unbiased Signed Distance Field (SDF) guidance to regularize Gaussian opacity fields, enhancing geometry constraints and improving surface reconstruction quality. Then we establish deformable fields for 3D Gaussians constrained by the kinematic structures of articulated objects, achieving unsupervised generation of surface meshes in unseen states. Extensive experiments on both synthetic and real datasets demonstrate our approach achieves high-quality textured surface reconstruction for given states, and enables high-fidelity surface generation for unseen states. Project site: https: //sites. google. com/view/reartgs/home.

IROS Conference 2025 Conference Paper

RH20T-P: A Primitive-Level Robotic Manipulation Dataset towards Composable Generalization Agents in Real-world Scenarios

  • Zeren Chen
  • Zhelun Shi
  • Xiaoya Lu
  • Lehan He
  • Sucheng Qian
  • Enshen Zhou
  • Zhenfei Yin
  • Wanli Ouyang

Achieving generalizability in solving out-of-distribution tasks is one of the ultimate goals of learning robotic manipulation. Recent progress of Vision-Language Models (VLMs) has shown that VLM-based task planners can alleviate the difficulty of solving novel tasks, by decomposing the compounded tasks as a plan of sequentially executing primitive-level skills that have been already mastered. It is also promising for robotic manipulation to adapt such composable generalization ability, in the form of composable generalization agents (CGAs). However, the community lacks of reliable design of primitive skills and a sufficient amount of primitive-level data annotations. Therefore, we propose RH20T-P, a primitive-level robotic manipulation dataset, which contains about 38k video clips covering 67 diverse manipulation tasks in real-world scenarios. Each clip is manually annotated according to a set of meticulously designed primitive skills that are common in robotic manipulation. Furthermore, we standardize a plan-execute CGA paradigm and implement an exemplar baseline called RA-P on our RH20T-P, whose positive performance on solving unseen tasks validates that the proposed dataset can offer composable generalization ability to robotic manipulation agents. Project homepage: https://sites.google.com/view/rh20t-primitive/main.

IROS Conference 2025 Conference Paper

SIME: Enhancing Policy Self-Improvement with Modal-level Exploration

  • Yang Jin
  • Jun Lv
  • Wenye Yu
  • Hongjie Fang
  • Yonglu Li 0001
  • Cewu Lu

Self-improvement requires robotic systems to initially learn from human-provided data and then gradually enhance their capabilities through interaction with the environment. This is similar to how humans improve their skills through continuous practice. However, achieving effective self-improvement is challenging, primarily because robots tend to repeat their existing abilities during interactions, often failing to generate new, valuable data for learning. In this paper, we identify the key to successful self-improvement: modal-level exploration and data selection. By incorporating a modal-level exploration mechanism during policy execution, the robot can produce more diverse and multi-modal interactions. At the same time, we select the most valuable trials and high-quality segments from these interactions for learning. We successfully demonstrate effective robot self-improvement on both simulation benchmarks and real-world experiments. The capability for self-improvement will enable us to develop more robust and high-success-rate robotic control strategies at a lower cost. Our code and experiment scripts are available at ericjin2002.github.io/SIME.

IROS Conference 2025 Conference Paper

SKT: Integrating State-Aware Keypoint Trajectories with Vision-Language Models for Robotic Garment Manipulation

  • Xin Li 0110
  • Siyuan Huang 0004
  • Qiaojun Yu
  • Zhengkai Jiang 0001
  • Ce Hao
  • Yimeng Zhu
  • Hongsheng Li 0001
  • Peng Gao 0007

Automating garment manipulation poses a significant challenge for assistive robotics due to the diverse and de-formable nature of garments. Traditional approaches typically require separate models for each garment type, which limits scalability and adaptability. In contrast, this paper presents a unified approach using vision-language models (VLMs) to improve keypoint prediction across various garment categories. By interpreting both visual and semantic information, our model enables robots to manage different garment states with a single model. We created a large-scale synthetic dataset using advanced simulation techniques, allowing scalable training without extensive real-world data. Experimental results indicate that the VLM-based method significantly enhances keypoint detection accuracy and task success rates, providing a more flexible and general solution for robotic garment manipulation. In addition, this research also underscores the potential of VLMs to unify various garment manipulation tasks within a single framework, paving the way for broader applications in home automation and assistive robotics in the future.

ICLR Conference 2025 Conference Paper

The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs

  • Hong Li
  • Nanxi Li
  • Yuanjie Chen
  • Jianbin Zhu
  • Qinlu Guo
  • Cewu Lu
  • Yonglu Li 0001

Multi-modal Large Language Models (MLLMs) have exhibited impressive capability. However, recently many deficiencies of MLLMs have been found compared to human intelligence, $\textit{e.g.}$, hallucination. To drive the MLLMs study, the community dedicated efforts to building larger benchmarks with complex tasks. In this paper, we propose benchmarking an essential but usually overlooked intelligence: $\textbf{association}$, a human's basic capability to link observation and prior practice memory. To comprehensively investigate MLLM's performance on the association, we formulate the association task and devise a standard benchmark based on adjective and verb semantic concepts. Instead of costly data annotation and curation, we propose a convenient $\textbf{annotation-free}$ construction method transforming the general dataset for our association tasks. Simultaneously, we devise a rigorous data refinement process to eliminate confusion in the raw dataset. Building on this database, we establish three levels of association tasks: single-step, synchronous, and asynchronous associations. Moreover, we conduct a comprehensive investigation into the MLLMs' zero-shot association capabilities, addressing multiple dimensions, including three distinct memory strategies, both open-source and closed-source MLLMs, cutting-edge Mixture-of-Experts (MoE) models, and the involvement of human experts. Our systematic investigation shows that current open-source MLLMs consistently exhibit poor capability in our association tasks, even the currently state-of-the-art GPT-4V(vision) also has a significant gap compared to humans. We believe our benchmark would pave the way for future MLLM studies. $\textit{Our data and code are available at:} https://mvig-rhos.com/llm_inception.

ICRA Conference 2025 Conference Paper

Towards Effective Utilization of Mixed-Quality Demonstrations in Robotic Manipulation via Segment-Level Selection and Optimization

  • Jingjing Chen
  • Hongjie Fang
  • Haoshu Fang
  • Cewu Lu

Data is crucial for robotic manipulation, as it underpins the development of robotic systems for complex tasks. While high-quality, diverse datasets enhance the performance and adaptability of robotic manipulation policies, collecting extensive expert-level data is resource-intensive. Consequently, many current datasets suffer from quality inconsistencies due to operator variability, highlighting the need for methods to utilize mixed-quality data effectively. To mitigate these issues, we propose “Select Segments to Imitate” (S2I), a framework that selects and optimizes mixed-quality demonstration data at the segment level, while ensuring plug-and-play compatibility with existing robotic manipulation policies. The framework has three components: demonstration segmentation dividing origin data into meaningful segments, segment selection using contrastive learning to find high-quality segments, and trajectory optimization to refine suboptimal segments for better policy learning. We evaluate S2I through comprehensive experiments in simulation and real-world environments across six tasks, demonstrating that with only 3 expert demonstrations for reference, S2I can improve the performance of various downstream policies when trained with mixed-quality demonstrations. Project website: https://tonyfang.net/s2i/.

NeurIPS Conference 2025 Conference Paper

Tru-POMDP: Task Planning Under Uncertainty via Tree of Hypotheses and Open-Ended POMDPs

  • Wenjing Tang
  • Xinyu He
  • Yongxi Huang
  • Yunxiao Xiao
  • Cewu Lu
  • Panpan Cai

Task planning under uncertainty is essential for home-service robots operating in the real world. Tasks involve ambiguous human instructions, hidden or unknown object locations, and open-vocabulary object types, leading to significant open-ended uncertainty and a boundlessly large planning space. To address these challenges, we propose Tru-POMDP, a planner that combines structured belief generation using Large Language Models (LLMs) with principled POMDP planning. Tru-POMDP introduces a hierarchical Tree of Hypotheses (TOH), which systematically queries an LLM to construct high-quality particle beliefs over possible world states and human goals. We further formulate an open-ended POMDP model that enables rigorous Bayesian belief tracking and efficient belief-space planning over these LLM-generated hypotheses. Experiments on complex object rearrangement tasks across diverse kitchen environments show that Tru-POMDP significantly outperforms state-of-the-art LLM-based and LLM-tree-search hybrid planners, achieving higher success rates with significantly better plans, stronger robustness to ambiguity and occlusion, and greater planning efficiency.

ICRA Conference 2025 Conference Paper

UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models

  • Qiaojun Yu
  • Siyuan Huang 0004
  • Xibin Yuan
  • Zhengkai Jiang 0001
  • Ce Hao
  • Xin Li 0110
  • Haonan Chang
  • Junbo Wang 0004

Previous studies on robotic manipulation are based on a limited understanding of the underlying 3D motion constraints and affordances. To address these challenges, we propose a comprehensive paradigm, termed UniAff, that integrates 3D object-centric manipulation and task understanding in a unified formulation. Specifically, we constructed a dataset labeled with manipulation-related key attributes, comprising 900 articulated objects from 19 categories and 600 tools from 12 categories. Furthermore, we leverage MLLMs to infer object-centric representations for manipulation tasks, including affordance recognition and reasoning about 3D motion constraints. Comprehensive experiments in both simulation and real-world settings indicate that UniAff significantly improves the generalization of robotic manipulation for tools and articulated objects. We hope that UniAff will serve as a general baseline for unified robotic manipulation tasks in the future. Images, videos, dataset and code are published on the project website at: https://sites.google.com/view/uni-aff/home.

NeurIPS Conference 2025 Conference Paper

UniDomain: Pretraining a Unified PDDL Domain from Real-World Demonstrations for Generalizable Robot Task Planning

  • Haoming Ye
  • Yunxiao Xiao
  • Cewu Lu
  • Panpan Cai

Robotic task planning in real-world environments requires reasoning over implicit constraints from language and vision. While LLMs and VLMs offer strong priors, they struggle with long-horizon structure and symbolic grounding. Existing meth- ods that combine LLMs with symbolic planning often rely on handcrafted or narrow domains, limiting generalization. We propose UniDomain, a framework that pre-trains a PDDL domain from robot manipulation demonstrations and applies it for online robotic task planning. It extracts atomic domains from 12, 393 manipulation videos to form a unified domain with 3137 operators, 2875 predicates, and 16481 causal edges. Given a target class of tasks, it retrieves relevant atomics from the unified domain and systematically fuses them into high-quality meta-domains for zero-shot planning. Experiments on diverse real-world tasks show that UniDomain solves complex, unseen tasks in a zero-shot manner, achieving up to 58% higher task success and 160% improvement in plan optimality over state-of-the-art LLM and LLM-PDDL baselines.

ICRA Conference 2024 Conference Paper

A Surprisingly Efficient Representation for Multi-Finger Grasping

  • Hengxu Yan
  • Haoshu Fang
  • Cewu Lu

The problem of grasping objects using a multi-finger hand has received significant attention in recent years. However, it remains challenging to handle a large number of unfamiliar objects in real and cluttered environments. In this work, we propose a representation that can be effectively mapped to the multi-finger grasp space. Based on this representation, we develop a simple decision model that generates accurate grasp quality scores for different multi-finger grasp poses using only hundreds to thousands of training samples. We demonstrate that our representation performs well on a real robot and achieves a success rate of 78. 64% after training with only 500 real-world grasp attempts and 87% with 4500 grasp attempts. Additionally, we achieve a success rate of 84. 51% in a dynamic human-robot handover scenario using a multi-finger hand.

ICRA Conference 2024 Conference Paper

AirExo: Low-Cost Exoskeletons for Learning Whole-Arm Manipulation in the Wild

  • Hongjie Fang
  • Haoshu Fang
  • Yiming Wang
  • Jieji Ren
  • Jingjing Chen
  • Ruo Zhang
  • Weiming Wang
  • Cewu Lu

While humans can use parts of their arms other than the hands for manipulations like gathering and supporting, whether robots can effectively learn and perform the same type of operations remains relatively unexplored. As these manipulations require joint-level control to regulate the complete poses of the robots, we develop AirExo, a low-cost, adaptable, and portable dual-arm exoskeleton, for teleoperation and demonstration collection. As collecting teleoperated data is expensive and time-consuming, we further leverage AirExo to collect cheap in-the-wild demonstrations at scale. Under our in-the-wild learning framework, we show that with only 3 minutes of the teleoperated demonstrations, augmented by diverse and extensive in-the-wild data collected by AirExo, robots can learn a policy that is comparable to or even better than one learned from teleoperated demonstrations lasting over 20 minutes. Experiments demonstrate that our approach enables the model to learn a more general and robust policy across the various stages of the task, enhancing the success rates in task completion even with the presence of disturbances. Project website: airexo.github.io.

NeurIPS Conference 2024 Conference Paper

ConceptFactory: Facilitate 3D Object Knowledge Annotation with Object Conceptualization

  • Jianhua Sun
  • Yuxuan Li
  • Longfei Xu
  • Nange Wang
  • Jiude Wei
  • Yining Zhang
  • Cewu Lu

We present ConceptFactory, a novel scope to facilitate more efficient annotation of 3D object knowledge by recognizing 3D objects through generalized concepts (i. e. object conceptualization), aiming at promoting machine intelligence to learn comprehensive object knowledge from both vision and robotics aspects. This idea originates from the findings in human cognition research that the perceptual recognition of objects can be explained as a process of arranging generalized geometric components (e. g. cuboids and cylinders). ConceptFactory consists of two critical parts: i) ConceptFactory Suite, a unified toolbox that adopts Standard Concept Template Library (STL-C) to drive a web-based platform for object conceptualization, and ii) ConceptFactory Asset, a large collection of conceptualized objects acquired using ConceptFactory suite. Our approach enables researchers to effortlessly acquire or customize extensive varieties of object knowledge to comprehensively study different object understanding tasks. We validate our idea on a wide range of benchmark tasks from both vision and robotics aspects with state-of-the-art algorithms, demonstrating the high quality and versatility of annotations provided by our approach. Our website is available at https: //apeirony. github. io/ConceptFactory.

IROS Conference 2024 Conference Paper

Differentiable Fluid Physics Parameter Identification By Stirring and For Stirring

  • Wenqiang Xu
  • Dongzhe Zheng
  • Yutong Li 0004
  • Jieji Ren
  • Cewu Lu

Fluid interactions are crucial in daily tasks, with properties like density and viscosity being key parameters. The property states can be used as control signals for robot operation. While density estimation is simple, assessing viscosity, especially for different fluid types, is complex. This study introduces a novel differentiable fitting framework, DiffStir, tailored to identify key physics parameters through stirring. Then, given the estimated physics parameters, we can generate commands to guide the robotic stirring. Comprehensive experiments were conducted to validate the efficacy of DiffStir, showcasing its precision in parameter estimation when benchmarked against reported values in the literature. More experiments and videos can be found in the supplementary materials and on the website: https://diffstir.robotflow.ai.

AAAI Conference 2024 Conference Paper

FAVOR: Full-Body AR-Driven Virtual Object Rearrangement Guided by Instruction Text

  • Kailin Li
  • Lixin Yang
  • Zenan Lin
  • Jian Xu
  • Xinyu Zhan
  • Yifei Zhao
  • Pengxiang Zhu
  • Wenxiong Kang

Rearrangement operations form the crux of interactions between humans and their environment. The ability to generate natural, fluid sequences of this operation is of essential value in AR/VR and CG. Bridging a gap in the field, our study introduces FAVOR: a novel dataset for Full-body AR-driven Virtual Object Rearrangement that uniquely employs motion capture systems and AR eyeglasses. Comprising 3k diverse motion rearrangement sequences and 7.17 million interaction data frames, this dataset breaks new ground in research data. We also present a pipeline FAVORITE for producing digital human rearrangement motion sequences guided by instructions. Experimental results, both qualitative and quantitative, suggest that this dataset and pipeline deliver high-quality motion sequences. Our dataset, code, and appendix are available at https://kailinli.github.io/FAVOR.

ICRA Conference 2024 Conference Paper

GAMMA: Generalizable Articulation Modeling and Manipulation for Articulated Objects

  • Qiaojun Yu
  • Junbo Wang 0004
  • Wenhai Liu
  • Ce Hao
  • Liu Liu 0012
  • Lin Shao 0002
  • Weiming Wang
  • Cewu Lu

Articulated objects like cabinets and doors are widespread in daily life. However, directly manipulating 3D articulated objects is challenging because they have diverse geometrical shapes, semantic categories, and kinetic constraints. Prior works mostly focused on recognizing and manipulating articulated objects with specific joint types. They can either estimate the joint parameters or distinguish suitable grasp poses to facilitate trajectory planning. Although these approaches have succeeded in certain types of articulated objects, they lack generalizability to unseen objects, which significantly impedes their application in broader scenarios. In this paper, we propose a novel framework of Generalizable Articulation Modeling and Manipulating for Articulated Objects (GAMMA), which learns both articulation modeling and grasp pose affordance from diverse articulated objects with different categories. In addition, GAMMA adopts adaptive manipulation to iteratively reduce the modeling errors and enhance manipulation performance. We train GAMMA with the PartNet-Mobility dataset and evaluate with comprehensive experiments in SAPIEN simulation and real-world Franka robot. Results show that GAMMA significantly outperforms SOTA articulation modeling and manipulation algorithms in unseen and cross-category articulated objects. Images, videos and codes are published on the project website at: sites. google.com/view/gamma-articulation.

NeurIPS Conference 2024 Conference Paper

General Articulated Objects Manipulation in Real Images via Part-Aware Diffusion Process

  • Zhou Fang
  • Yong-Lu Li
  • Lixin Yang
  • Cewu Lu

Articulated object manipulation in real images is a fundamental step in computer and robotic vision tasks. Recently, several image editing methods based on diffusion models have been proposed to manipulate articulated objects according to text prompts. However, these methods often generate weird artifacts or even fail in real images. To this end, we introduce the Part-Aware Diffusion Model to approach the manipulation of articulated objects in real images. First, we develop Abstract 3D Models to represent and manipulate articulated objects efficiently. Then we propose dynamic feature maps to transfer the appearance of objects from input images to edited ones, meanwhile generating the novel-appearing parts reasonably. Extensive experiments are provided to illustrate the advanced manipulation capabilities of our method concerning state-of-the-art editing works. Additionally, we verify our method on 3D articulated object understanding forembodied robot scenarios and the promising results prove that our method supports this task strongly. The project page is https: //mvig-rhos. com/pa_diffusion.

NeurIPS Conference 2024 Conference Paper

HumanVLA: Towards Vision-Language Directed Object Rearrangement by Physical Humanoid

  • Xinyu Xu
  • Yizheng Zhang
  • Yong-Lu Li
  • Lei Han
  • Cewu Lu

Physical Human-Scene Interaction (HSI) plays a crucial role in numerous applications. However, existing HSI techniques are limited to specific object dynamics and privileged information, which prevents the development of more comprehensive applications. To address this limitation, we introduce HumanVLA for general object rearrangement directed by practical vision and language. A teacher-student framework is utilized to develop HumanVLA. A state-based teacher policy is trained first using goal-conditioned reinforcement learning and adversarial motion prior. Then, it is distilled into a vision-language-action model via behavior cloning. We propose several key insights to facilitate the large-scale learning process. To support general object rearrangement by physical humanoid, we introduce a novel Human-in-the-Room dataset encompassing various rearrangement tasks. Through extensive experiments and analysis, we demonstrate the effectiveness of our approach.

ICML Conference 2024 Conference Paper

Low-Rank Similarity Mining for Multimodal Dataset Distillation

  • Yue Xu
  • Zhilin Lin
  • Yusong Qiu
  • Cewu Lu
  • Yonglu Li 0001

Though dataset distillation has witnessed rapid development in recent years, the distillation of multimodal data, e. g. , image-text pairs, poses unique and under-explored challenges. Unlike unimodal data, image-text contrastive learning (ITC) data lack inherent categorization and should instead place greater emphasis on modality correspondence. In this work, we propose Lo w- R ank S imilarity Mining ( LoRS ) for multimodal dataset distillation, that concurrently distills a ground truth similarity matrix with image-text pairs, and leverages low-rank factorization for efficiency and scalability. The proposed approach brings significant improvement to the existing algorithms, marking a significant contribution to the field of visual-language dataset distillation. We advocate adopting LoRS as a foundational synthetic data setup for image-text dataset distillation. Our code is available at https: //github. com/silicx/LoRS_Distill.

ICRA Conference 2024 Conference Paper

Open X-Embodiment: Robotic Learning Datasets and RT-X Models: Open X-Embodiment Collaboration

  • Abby O'Neill
  • Abdul Rehman
  • Abhiram Maddukuri
  • Abhishek Gupta 0004
  • Abhishek Padalkar
  • Abraham Lee
  • Acorn Pooley
  • Agrim Gupta

Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train "generalist" X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. The project website is robotics-transformer-x. github.io.

AAAI Conference 2024 Conference Paper

Primitive-Based 3D Human-Object Interaction Modelling and Programming

  • Siqi Liu
  • Yong-Lu Li
  • Zhou Fang
  • Xinpeng Liu
  • Yang You
  • Cewu Lu

Embedding Human and Articulated Object Interaction (HAOI) in 3D is an important direction for a deeper human activity understanding. Different from previous works that use parametric and CAD models to represent humans and objects, in this work, we propose a novel 3D geometric primitive-based language to encode both humans and objects. Given our new paradigm, humans and objects are all compositions of primitives instead of heterogeneous entities. Thus, mutual information learning may be achieved between the limited 3D data of humans and different object categories. Moreover, considering the simplicity of the expression and the richness of the information it contains, we choose the superquadric as the primitive representation. To explore an effective embedding of HAOI for the machine, we build a new benchmark on 3D HAOI consisting of primitives together with their images and propose a task requiring machines to recover 3D HAOI using primitives from images. Moreover, we propose a baseline of single-view 3D reconstruction on HAOI. We believe this primitive-based 3D HAOI representation would pave the way for 3D HAOI studies. Our code and data are available at https://mvig-rhos.com/p3haoi.

ICRA Conference 2024 Conference Paper

RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot

  • Haoshu Fang
  • Hongjie Fang
  • Zhenyu Tang 0005
  • Jirong Liu
  • Chenxi Wang 0003
  • Junbo Wang 0004
  • Haoyi Zhu
  • Cewu Lu

A key challenge for robotic manipulation in open domains is how to acquire diverse and generalizable skills for robots. Recent progress in one-shot imitation learning and robotic foundation models have shown promise in transferring trained policies to new tasks based on demonstrations. This feature is attractive for enabling robots to acquire new skills and improve their manipulative ability. However, due to limitations in the training dataset, the current focus of the community has mainly been on simple cases, such as push or pick-place tasks, relying solely on visual guidance. In reality, there are many complex skills, some of which may even require both visual and tactile perception to solve. This paper aims to unlock the potential for an agent to generalize to hundreds of real-world skills with multi-modal perception. To achieve this, we have collected a dataset comprising over 110, 000 contact-rich robot manipulation sequences across diverse skills, contexts, robots, and camera viewpoints, all collected in the real world. Each sequence in the dataset includes visual, force, audio, and action information. Moreover, we also provide a corresponding human demonstration video and a language description for each robot sequence. We have invested significant efforts in calibrating all the sensors and ensuring a high-quality dataset. The dataset is made publicly available on our website: rh20t. github.io.

IROS Conference 2024 Conference Paper

RISE: 3D Perception Makes Real-World Robot Imitation Simple and Effective

  • Chenxi Wang 0003
  • Hongjie Fang
  • Haoshu Fang
  • Cewu Lu

Precise robot manipulations require rich spatial information in imitation learning. Image-based policies model object positions from fixed cameras, which are sensitive to camera view changes. Policies utilizing 3D point clouds usually predict keyframes rather than continuous actions, posing difficulty in dynamic and contact-rich scenarios. To utilize 3D perception efficiently, we present RISE, an end-to-end baseline for real-world imitation learning, which predicts continuous actions directly from single-view point clouds. It compresses the point cloud to tokens with a sparse 3D encoder. After adding sparse positional encoding, the tokens are featurized using a transformer. Finally, the features are decoded into robot actions by a diffusion head. Trained with 50 demonstrations for each real-world task, RISE surpasses currently representative 2D and 3D policies by a large margin, showcasing significant advantages in both accuracy and efficiency. Experiments also demonstrate that RISE is more general and robust to environmental change compared with previous baselines. Project website: rise-policy.github.io.

IROS Conference 2024 Conference Paper

RPMArt: Towards Robust Perception and Manipulation for Articulated Objects

  • Junbo Wang 0004
  • Wenhai Liu
  • Qiaojun Yu
  • Yang You 0004
  • Liu Liu 0012
  • Weiming Wang
  • Cewu Lu

Articulated objects are commonly found in daily life. It is essential that robots can exhibit robust perception and manipulation skills for articulated objects in real-world robotic applications. However, existing methods for articulated objects insufficiently address noise in point clouds and struggle to bridge the gap between simulation and reality, thus limiting the practical deployment in real-world scenarios. To tackle these challenges, we propose a framework towards Robust Perception and Manipulation for Articulated Objects (RPMArt), which learns to estimate the articulation parameters and manipulate the articulation part from the noisy point cloud. Our primary contribution is a Robust Articulation Network (RoArtNet) that is able to predict both joint parameters and affordable points robustly by local feature learning and point tuple voting. Moreover, we introduce an articulation-aware classification scheme to enhance its ability for sim-to-real transfer. Finally, with the estimated affordable point and articulation joint constraint, the robot can generate robust actions to manipulate articulated objects. After learning only from synthetic data, RPMArt is able to transfer zero-shot to real-world articulated objects. Experimental results confirm our approach’s effectiveness, with our framework achieving state-of-the-art performance in both noise-added simulation and real-world environments. Code, data and more results can be found on the project website at https://r-pmart.github.io.

AAAI Conference 2024 Conference Paper

ShapeBoost: Boosting Human Shape Estimation with Part-Based Parameterization and Clothing-Preserving Augmentation

  • Siyuan Bian
  • Jiefeng Li
  • Jiasheng Tang
  • Cewu Lu

Accurate human shape recovery from a monocular RGB image is a challenging task because humans come in different shapes and sizes and wear different clothes. In this paper, we propose ShapeBoost, a new human shape recovery framework that achieves pixel-level alignment even for rare body shapes and high accuracy for people wearing different types of clothes. Unlike previous approaches that rely on the use of PCA-based shape coefficients, we adopt a new human shape parameterization that decomposes the human shape into bone lengths and the mean width of each part slice. This part-based parameterization technique achieves a balance between flexibility and validity using a semi-analytical shape reconstruction algorithm. Based on this new parameterization, a clothing-preserving data augmentation module is proposed to generate realistic images with diverse body shapes and accurate annotations. Experimental results show that our method outperforms other state-of-the-art methods in diverse body shape situations as well as in varied clothing situations.

AAAI Conference 2023 Conference Paper

CRIN: Rotation-Invariant Point Cloud Analysis and Rotation Estimation via Centrifugal Reference Frame

  • Yujing Lou
  • Zelin Ye
  • Yang You
  • Nianjuan Jiang
  • Jiangbo Lu
  • Weiming Wang
  • Lizhuang Ma
  • Cewu Lu

Various recent methods attempt to implement rotation-invariant 3D deep learning by replacing the input coordinates of points with relative distances and angles. Due to the incompleteness of these low-level features, they have to undertake the expense of losing global information. In this paper, we propose the CRIN, namely Centrifugal Rotation-Invariant Network. CRIN directly takes the coordinates of points as input and transforms local points into rotation-invariant representations via centrifugal reference frames. Aided by centrifugal reference frames, each point corresponds to a discrete rotation so that the information of rotations can be implicitly stored in point features. Unfortunately, discrete points are far from describing the whole rotation space. We further introduce a continuous distribution for 3D rotations based on points. Furthermore, we propose an attention-based down-sampling strategy to sample points invariant to rotations. A relation module is adopted at last for reinforcing the long-range dependencies between sampled points and predicts the anchor point for unsupervised rotation estimation. Extensive experiments show that our method achieves rotation invariance, accurately estimates the object rotation, and obtains state-of-the-art results on rotation-augmented classification and part segmentation. Ablation studies validate the effectiveness of the network design.

IROS Conference 2023 Conference Paper

Flexible Handover with Real-Time Robust Dynamic Grasp Trajectory Generation

  • Gu Zhang
  • Haoshu Fang
  • Hongjie Fang
  • Cewu Lu

In recent years, there has been a significant effort dedicated to developing efficient, robust, and general human-to-robot handover systems. However, the area of flexible handover in the context of complex and continuous objects' motion remains relatively unexplored. In this work, we propose an approach for effective and robust flexible handover, which enables the robot to grasp moving objects with flexible motion trajectories with a high success rate. The key innovation of our approach is the generation of real-time robust grasp trajec-tories. We also design a future grasp prediction algorithm to enhance the system's adaptability to dynamic handover scenes. We conduct one-motion handover experiments and motion-continuous handover experiments on our novel benchmark that includes 31 diverse household objects. The system we have developed allows users to move and rotate objects in their hands within a relatively large range. The success rate of the robot grasping such moving objects is 78. 15 % over the entire household object benchmark.

NeurIPS Conference 2023 Conference Paper

Symbol-LLM: Leverage Language Models for Symbolic System in Visual Human Activity Reasoning

  • Xiaoqian Wu
  • Yong-Lu Li
  • Jianhua Sun
  • Cewu Lu

Human reasoning can be understood as a cooperation between the intuitive, associative "System-1'' and the deliberative, logical "System-2''. For existing System-1-like methods in visual activity understanding, it is crucial to integrate System-2 processing to improve explainability, generalization, and data efficiency. One possible path of activity reasoning is building a symbolic system composed of symbols and rules, where one rule connects multiple symbols, implying human knowledge and reasoning abilities. Previous methods have made progress, but are defective with limited symbols from handcraft and limited rules from visual-based annotations, failing to cover the complex patterns of activities and lacking compositional generalization. To overcome the defects, we propose a new symbolic system with two ideal important properties: broad-coverage symbols and rational rules. Collecting massive human knowledge via manual annotations is expensive to instantiate this symbolic system. Instead, we leverage the recent advancement of LLMs (Large Language Models) as an approximation of the two ideal properties, i. e. , Symbols from Large Language Models (Symbol-LLM). Then, given an image, visual contents from the images are extracted andchecked as symbols and activity semantics are reasoned out based on rules via fuzzy logic calculation. Our method shows superiority in extensive activity understanding tasks. Code and data are available at https: //mvig-rhos. com/symbol_llm.

AAAI Conference 2022 Conference Paper

Correlation Field for Boosting 3D Object Detection in Structured Scenes

  • Jianhua Sun
  • Hao-Shu Fang
  • Xianghui Zhu
  • Jiefeng Li
  • Cewu Lu

Data augmentation is an efficient way to elevate 3D object detection performance. In this paper, we propose a simple but effective online crop-and-paste data augmentation pipeline for structured 3D point cloud scenes, named CorrelaBoost. Observing that 3D objects should have reasonable relative positions in a structured scene because of the objects’ functionalities and natural relationships, we express this correlation as a kind of interactive force. An energy field called Correlation Field can be calculated correspondingly across the whole 3D space. According to the Correlation Field, we propose two data augmentation strategies to explore highly congruent positions that a designated object may be pasted to: 1) Category Consistent Exchanging and 2) Energy Optimized Transformation. We conduct exhaustive experiments on various popular benchmarks with different detection frameworks and the results illustrate that our method brings huge free-lunch improvement and significantly outperforms state-of-the-art approaches in terms of data augmentation. It is worth noting that the performance of VoteNet with mAP@0. 5 is improved by 7. 7 on ScanNetV2 dataset and 5. 0 on SUN RGB-D dataset. Our method is simple to implement and increases few computational overhead.

NeurIPS Conference 2022 Conference Paper

DART: Articulated Hand Model with Diverse Accessories and Rich Textures

  • Daiheng Gao
  • Yuliang Xiu
  • Kailin Li
  • Lixin Yang
  • Feng Wang
  • Peng Zhang
  • Bang Zhang
  • Cewu Lu

Hand, the bearer of human productivity and intelligence, is receiving much attention due to the recent fever of digital twins. Among different hand morphable models, MANO has been widely used in vision and graphics community. However, MANO disregards textures and accessories, which largely limits its power to synthesize photorealistic hand data. In this paper, we extend MANO with Diverse Accessories and Rich Textures, namely DART. DART is composed of 50 daily 3D accessories which varies in appearance and shape, and 325 hand-crafted 2D texture maps covers different kinds of blemishes or make-ups. Unity GUI is also provided to generate synthetic hand data with user-defined settings, e. g. , pose, camera, background, lighting, textures, and accessories. Finally, we release DARTset, which contains large-scale (800K), high-fidelity synthetic hand images, paired with perfect-aligned 3D labels. Experiments demonstrate its superiority in diversity. As a complement to existing hand datasets, DARTset boosts the generalization in both hand pose estimation and mesh recovery tasks. Raw ingredients (textures, accessories), Unity GUI, source code and DARTset are publicly available at dart2022. github. io.

AAAI Conference 2022 Conference Paper

Highlighting Object Category Immunity for the Generalization of Human-Object Interaction Detection

  • Xinpeng Liu
  • Yong-Lu Li
  • Cewu Lu

Human-Object Interaction (HOI) detection plays a core role in activity understanding. As a compositional learning problem (human-verb-object), studying its generalization matters. However, widely-used metric mean average precision (mAP) maybe not enough to model the compositional generalization well. Here, we propose a novel metric, mPD (mean Performance Degradation), as a complementary of mAP to evaluate the performance gap among compositions of different objects and the same verb. Surprisingly, mPD reveals that previous state-of-the-arts usually do not generalize well. With mPD as a cue, we propose Object Category (OC) Immunity to advance HOI generalization. Concretely, our core idea is to prevent model from learning spurious object-verb correlations as a short-cut to over-fit the train set. To achieve OCimmunity, we propose an OC-immune network that decouples the inputs from OC, extracts OC-immune representations and leverages uncertainty quantification to generalize to unseen objects. In both conventional and zero-shot experiments, our method achieves decent improvements. To fully evaluate the generalization, we design a new and more difficult benchmark, on which we present significant advantage. The code is available at https: //github. com/Foruck/OC-Immunity.

IROS Conference 2022 Conference Paper

RCare World: A Human-centric Simulation World for Caregiving Robots

  • Ruolin Ye
  • Wenqiang Xu
  • Haoyuan Fu
  • Rajat Kumar Jenamani
  • Vy Nguyen
  • Cewu Lu
  • Katherine Dimitropoulou
  • Tapomayukh Bhattacharjee

We present RCareWorld, a human-centric simulation world for physical and social robotic caregiving designed with inputs from stakeholders. RCareWorld has realistic human models of care recipients with mobility limitations and caregivers, home environments with multiple levels of accessibility and assistive devices, and robots commonly used for caregiving. It interfaces with various physics engines to model diverse material types necessary for simulating caregiving scenarios, and provides the capability to plan, control, and learn both human and robot control policies by integrating with state-of-the-art external planning and learning libraries, and VR devices. We propose a set of realistic caregiving tasks in RCareWorld as a benchmark for physical robotic caregiving and provide baseline control policies for them. We illustrate the high-fidelity simulation capabilities of RCareWorld by demonstrating the execution of a policy learnt in simulation for one of these tasks on a real-world setup. Additionally, we perform a real-world social robotic caregiving experiment using behaviors modeled in RCareWorld. Robotic caregiving, though potentially impactful towards enhancing the quality of life of care recipients and caregivers, is a field with many barriers to entry due to its interdisciplinary facets. RCareWorld takes the first step towards building a realistic simulation world for robotic caregiving that would enable researchers worldwide to contribute to this impactful field. Demo videos and supplementary materials can be found at: https://emprise.cs.cornell.edu/rcareworld/.

ICRA Conference 2022 Conference Paper

SAGCI-System: Towards Sample-Efficient, Generalizable, Compositional, and Incremental Robot Learning

  • Jun Lv
  • Qiaojun Yu
  • Lin Shao 0002
  • Wenhai Liu
  • Wenqiang Xu
  • Cewu Lu

Building general-purpose robots to perform a diverse range of tasks in a large variety of environments in the physical world at the human level is extremely challenging. According to [1], it requires the robot learning to be sample-efficient, generalizable, compositional, and incremental. In this work, we introduce a systematic learning framework called SAGCI-system towards achieving these above four requirements. Our system first takes the raw point clouds gathered by the camera mounted on the robot's wrist as the inputs and produces initial modeling of the surrounding environment represented as a file of Unified Robot Description Format (URDF). Our system adopts a learning-augmented differentiable simulation that loads the URDF. The robot then utilizes the interactive perception to interact with the environment to online verify and modify the URDF. Leveraging the differentiable simulation, we propose a model-based learning algorithm combining object-centric and robot-centric stages to efficiently produce policies to accomplish manipulation tasks. We apply our system to perform articulated object manipulation tasks, both in the simulation and the real world. Extensive experiments demonstrate the effectiveness of our proposed learning framework. Supplemental materials and videos are available on our project webpage https://sites.google.com/view/egci.

AAAI Conference 2022 Conference Paper

Unsupervised Representation for Semantic Segmentation by Implicit Cycle-Attention Contrastive Learning

  • Bo Pang
  • Yizhuo Li
  • Yifan Zhang
  • Gao Peng
  • Jiajun Tang
  • Kaiwen Zha
  • Jiefeng Li
  • Cewu Lu

We study the unsupervised representation learning for the semantic segmentation task. Different from previous works that aim at providing unsupervised pre-trained backbones for segmentation models which need further supervised fine-tune, here, we focus on providing representation that is only trained by unsupervised methods. This means models need to directly generate pixel-level, linearly separable semantic results. We first explore and present two factors that have significant effects on segmentation under the contrastive learning framework: 1) the difficulty and diversity of the positive contrastive pairs, 2) the balance of global and local features. With the intention of optimizing these factors, we propose the cycle-attention contrastive learning (CACL). CACL makes use of semantic continuity of video frames, adopting unsupervised cycle-consistent attention mechanism to implicitly conduct contrastive learning with difficult, global-local-balanced positive pixel pairs. Compared with baseline model MoCo-v2 and other unsupervised methods, CACL demonstrates consistently superior performance on PASCAL VOC (+4. 5 mIoU) and Cityscapes (+4. 5 mIoU) datasets.

AAAI Conference 2021 Conference Paper

DecAug: Augmenting HOI Detection via Decomposition

  • Hao-Shu Fang
  • Yichen Xie
  • Dian Shao
  • Yong-Lu Li
  • Cewu Lu

Human-object interaction (HOI) detection requires a large amount of annotated data. Current algorithms suffer from insufficient training samples and category imbalance within datasets. To increase data efficiency, in this paper, we propose an efficient and effective data augmentation method called DecAug for HOI detection. Based on our proposed object state similarity metric, object patterns across different HOIs are shared to augment local object appearance features without changing their states. Further, we shift spatial correlation between humans and objects to other feasible configurations with the aid of a pose-guided Gaussian Mixture Model while preserving their interactions. Experiments show that our method brings up to 3. 3 mAP and 1. 6 mAP improvements on V-COCO and HICO-DET dataset for two advanced models. Specifically, interactions with fewer samples enjoy more notable improvement. Our method can be easily integrated into various HOI detection models with negligible extra computational consumption.

AAAI Conference 2021 Conference Paper

DIRV: Dense Interaction Region Voting for End-to-End Human-Object Interaction Detection

  • Hao-Shu Fang
  • Yichen Xie
  • Dian Shao
  • Cewu Lu

Recent years, human-object interaction (HOI) detection has achieved impressive advances. However, conventional twostage methods are usually slow in inference. On the other hand, existing one-stage methods mainly focus on the union regions of interactions, which introduce unnecessary visual information as disturbances to HOI detection. To tackle the problems above, we propose a novel one-stage HOI detection approach DIRV in this paper, based on a new concept called interaction region for the HOI problem. Unlike previous methods, our approach concentrates on the densely sampled interaction regions across different scales for each human-object pair, so as to capture the subtle visual features that is most essential to the interaction. Moreover, in order to compensate for the detection flaws of a single interaction region, we introduce a novel voting strategy that makes full use of those overlapped interaction regions in place of conventional Non-Maximal Suppression (NMS). Extensive experiments on two popular benchmarks: V-COCO and HICO-DET show that our approach outperforms existing state-of-the-arts by a large margin with the highest inference speed and lightest network architecture. Our code is publicly available at www. github. com/MVIG-SJTU/DIRV.

NeurIPS Conference 2021 Conference Paper

Localization with Sampling-Argmax

  • Jiefeng Li
  • Tong Chen
  • Ruiqi Shi
  • Yujing Lou
  • Yong-Lu Li
  • Cewu Lu

Soft-argmax operation is commonly adopted in detection-based methods to localize the target position in a differentiable manner. However, training the neural network with soft-argmax makes the shape of the probability map unconstrained. Consequently, the model lacks pixel-wise supervision through the map during training, leading to performance degradation. In this work, we propose sampling-argmax, a differentiable training method that imposes implicit constraints to the shape of the probability map by minimizing the expectation of the localization error. To approximate the expectation, we introduce a continuous formulation of the output distribution and develop a differentiable sampling process. The expectation can be approximated by calculating the average error of all samples drawn from the output distribution. We show that sampling-argmax can seamlessly replace the conventional soft-argmax operation on various localization tasks. Comprehensive experiments demonstrate the effectiveness and flexibility of the proposed method. Code is available at https: //github. com/Jeff-sjtu/sampling-argmax

ICRA Conference 2021 Conference Paper

RGB Matters: Learning 7-DoF Grasp Poses on Monocular RGBD Images

  • Minghao Gou
  • Haoshu Fang
  • Zhanda Zhu
  • Sheng Xu
  • Chenxi Wang 0003
  • Cewu Lu

General object grasping is an important yet unsolved problem in the field of robotics. Most of the current methods either generate grasp poses with few DoF that fail to cover most of the success grasps, or only take the unstable depth image or point cloud as input which may lead to poor results in some cases. In this paper, we propose RGBD-Grasp, a pipeline that solves this problem by decoupling 7-DoF grasp detection into two sub-tasks where RGB and depth information are processed separately. In the first stage, an encoder-decoder like convolutional neural network Angle-View Net(AVN) is proposed to predict the SO(3) orientation of the gripper at every location of the image. Consequently, a Fast Analytic Searching(FAS) module calculates the opening width and the distance of the gripper to the grasp point. By decoupling the grasp detection problem and introducing the stable RGB modality, our pipeline alleviates the requirement for the high-quality depth image and is robust to depth sensor noise. We achieve state-of-the-art results on GraspNet-1Billion dataset compared with several baselines. Real robot experiments on a UR5 robot with an Intel Realsense camera and a Robotiq two-finger gripper show high success rates for both single object scenes and cluttered scenes. Our code and trained model are available at graspnet.net.

AAAI Conference 2021 Conference Paper

TDAF: Top-Down Attention Framework for Vision Tasks

  • Bo Pang
  • Yizhuo Li
  • Jiefeng Li
  • Muchen Li
  • Hanwen Cao
  • Cewu Lu

Human attention mechanisms often work in a top-down manner, yet it is not well explored in vision research. Here, we propose the Top-Down Attention Framework (TDAF) to capture top-down attentions, which can be easily adopted in most existing models. The designed Recursive Dual-Directional Nested Structure in it forms two sets of orthogonal paths, recursive and structural ones, where bottom-up spatial features and top-down attention features are extracted respectively. Such spatial and attention features are nested deeply, therefore, the proposed framework works in a mixed top-down and bottom-up manner. Empirical evidence shows that our TDAF can capture effective stratified attention information and boost performance. ResNet with TDAF achieves 2. 0% improvements on ImageNet. For object detection, the performance is improved by 2. 7% AP over FCOS. For pose estimation, TDAF improves the baseline by 1. 6%. And for action recognition, the 3D-ResNet adopting TDAF achieves improvements of 1. 7% accuracy.

ICRA Conference 2020 Conference Paper

6-PACK: Category-level 6D Pose Tracker with Anchor-Based Keypoints

  • Chen Wang 0053
  • Roberto Martín-Martín
  • Danfei Xu
  • Jun Lv
  • Cewu Lu
  • Li Fei-Fei 0001
  • Silvio Savarese
  • Yuke Zhu

We present 6-PACK, a deep learning approach to category-level 6D object pose tracking on RGB-D data. Our method tracks in real time novel object instances of known object categories such as bowls, laptops, and mugs. 6-PACK learns to compactly represent an object by a handful of 3D keypoints, based on which the interframe motion of an object instance can be estimated through keypoint matching. These keypoints are learned end-to-end without manual supervision in order to be most effective for tracking. Our experiments show that our method substantially outperforms existing methods on the NOCS category-level 6D pose estimation benchmark and supports a physical robot to perform simple vision-based closed-loop manipulation tasks. Our code and video are available at https://sites.google.com/view/6packtracking.

AAAI Conference 2020 Conference Paper

Further Understanding Videos through Adverbs: A New Video Task

  • Bo Pang
  • Kaiwen Zha
  • Yifan Zhang
  • Cewu Lu

Video understanding is a research hotspot of computer vision and significant progress has been made on video action recognition recently. However, the semantics information contained in actions is not rich enough to build powerful video understanding models. This paper first introduces a new video semantics: the Behavior Adverb (BA), which is a more expressive and difficult one covering subtle and inherent characteristics of human action behavior. To exhaustively decode this semantics, we construct the Videos with Action and Adverb Dataset (VAAD), which is a large-scale dataset with a semantically complete set of BAs. The dataset will be released to the public with this paper. We benchmark several representative video understanding methods (originally for action recognition) on BA and action recognition. The results show that BA recognition task is more challenging than conventional action recognition. Accordingly, we propose the BA Understanding Network (BAUN) to solve this problem and the experiments reveal that our BAUN is more suitable for BA recognition (11% better than I3D). Furthermore, we find these two semantics (action and BA) can propel each other forward to better performance: promoting action recognition results by 3. 4% averagely on three standard action recognition datasets (UCF-101, HMDB-51, Kinetics).

NeurIPS Conference 2020 Conference Paper

HOI Analysis: Integrating and Decomposing Human-Object Interaction

  • Yong-Lu Li
  • Xinpeng Liu
  • Xiaoqian Wu
  • Yizhuo Li
  • Cewu Lu

Human-Object Interaction (HOI) consists of human, object and implicit interaction/verb. Different from previous methods that directly map pixels to HOI semantics, we propose a novel perspective for HOI learning in an analytical manner. In analogy to Harmonic Analysis, whose goal is to study how to represent the signals with the superposition of basic waves, we propose the HOI Analysis. We argue that coherent HOI can be decomposed into isolated human and object. Meanwhile, isolated human and object can also be integrated into coherent HOI again. Moreover, transformations between human-object pairs with the same HOI can also be easier approached with integration and decomposition. As a result, the implicit verb will be represented in the transformation function space. In light of this, we propose an Integration-Decomposition Network (IDN) to implement the above transformations and achieve state-of-the-art performance on widely-used HOI detection benchmarks. Code is available at https: //github. com/DirtyHarryLYL/HAKE-Action-Torch/tree/IDN-(Integrating-Decomposing-Network).

AAAI Conference 2020 Conference Paper

Pointwise Rotation-Invariant Network with Adaptive Sampling and 3D Spherical Voxel Convolution

  • Yang You
  • Yujing Lou
  • Qi Liu
  • Yu-Wing Tai
  • Lizhuang Ma
  • Cewu Lu
  • Weiming Wang

Point cloud analysis without pose priors is very challenging in real applications, as the orientations of point clouds are often unknown. In this paper, we propose a brand new pointset learning framework PRIN, namely, Pointwise Rotation- Invariant Network, focusing on rotation-invariant feature extraction in point clouds analysis. We construct spherical signals by Density Aware Adaptive Sampling to deal with distorted point distributions in spherical space. In addition, we propose Spherical Voxel Convolution and Point Re-sampling to extract rotation-invariant features for each point. Our network can be applied to tasks ranging from object classification, part segmentation, to 3D feature matching and label alignment. We show that, on the dataset with randomly rotated point clouds, PRIN demonstrates better performance than state-of-the-art methods without any data augmentation. We also provide theoretical analysis for the rotationinvariance achieved by our methods.

ICRA Conference 2020 Conference Paper

Transferable Active Grasping and Real Embodied Dataset

  • Xiangyu Chen
  • Zelin Ye
  • Jiankai Sun
  • Yuda Fan
  • Fang Hu 0002
  • Chenxi Wang 0003
  • Cewu Lu

Grasping in cluttered scenes is challenging for robot vision systems, as detection accuracy can be hindered by partial occlusion of objects. We adopt a reinforcement learning (RL) framework and 3D vision architectures to search for feasible viewpoints for grasping by the use of hand-mounted RGB-D cameras. To overcome the disadvantages of photo-realistic environment simulation, we propose a large-scale dataset called Real Embodied Dataset (RED), which includes full-viewpoint real samples on the upper hemisphere with amodal annotation and enables a simulator that has real visual feedback. Based on this dataset, a practical 3-stage transferable active grasping pipeline is developed, that is adaptive to unseen clutter scenes. In our pipeline, we propose a novel mask-guided reward to overcome the sparse reward issue in grasping and ensure category-irrelevant behavior. The grasping pipeline and its possible variants are evaluated with extensive experiments both in simulation and on a real-world UR-5 robotic arm.

IROS Conference 2019 Conference Paper

TendencyRL: Multi-stage Discriminative Hints for Efficient Goal-Oriented Reverse Curriculum Learning

  • Chen Wang 0033
  • Junfeng Ding
  • Xiangyu Chen
  • Zelin Ye
  • Jialu Wang
  • Ziruo Cai
  • Cewu Lu

Deep reinforcement learning algorithms have been proven successful in a variety of simulation tasks with dense reward feedback. However, real-world RL applications, e. g. robotic manipulation, remain challenging as most of them are multi-stage and a positive reward can only be received when the final goal is accomplished. In this work, we propose a potential solution to such problems with the introduction of an experience-based tendency reward shaping mechanism, which provides the robot with additional hints based on a discriminative learning on past experience. The reward along with a stage-awareness network help accelerate solving a multistage task split into shorter phases in a reverse curriculum learning manner. We extensively study the advantages of TRL on the standard long-term goal-oriented robotics domains such as pick-and-place, and show that TRL performs more efficiently and robustly than prior approaches in tasks with large state space. In addition, we demonstrate that TRL can solve difficult robot manipulation challenges directly from perception.

IROS Conference 2019 Conference Paper

Transferable Trial-Minimizing Progressive Peg-in-hole Model

  • Junfeng Ding
  • Chen Wang 0033
  • Cewu Lu

Peg-in-hole is a fine-level manipulation task that requires highly precise location and the normal direction of the hole, which is beyond state-of-the-art object detectors’ capability. Therefore, we propose a novel method that trains a robot arm to progressively search for the right inserting pose through trials with both force feedback and visual inputs. Under a reinforcement learning (RL) framework, an agent trying to minimize the number of trials is learned based on the sequentially estimated relative poses with regard to the correct inserting pose. Moreover, our learned dynamics model is transferable. Thanks to our context-independent force and visual feature design, our pre-trained model can be finetuned efficiently for another unseen peg-in-hole case. Extensive experiments show the effectiveness of the proposed framework.

IJCAI Conference 2018 Conference Paper

Annotation-Free and One-Shot Learning for Instance Segmentation of Homogeneous Object Clusters

  • Zheng Wu
  • Ruiheng Chang
  • Jiaxu Ma
  • Cewu Lu
  • Chi Keung Tang

We propose a novel approach for instance segmentation given an image of homogeneous object cluster (HOC). Our learning approach is one-shot because a single video of an object instance is captured and it requires no human annotation. Our intuition is that images of homogeneous objects can be effectively synthesized based on structure and illumination priors derived from real images. A novel solver is proposed that iteratively maximizes our structured likelihood to generate realistic images of HOC. Illumination transformation scheme is applied to make the real and synthetic images share the same illumination condition. Extensive experiments and comparisons are performed to verify our method. We build a dataset consisting of pixel-level annotated images of HOC. The dataset and code will be released.

AAAI Conference 2018 Conference Paper

PoseHD: Boosting Human Detectors Using Human Pose Information

  • Zhijian Liu
  • Bowen Pan
  • Yuliang Xiu
  • Cewu Lu

As most recently proposed methods for human detection have achieved a sufficiently high recall rate within a reasonable number of proposals, in this paper, we mainly focus on how to improve the precision rate of human detectors. In order to address the two main challenges in precision improvement, i. e. , i) hard background instances and ii) redundant partial proposals, we propose the novel PoseHD framework, a top-down pose-based approach on the basis of an arbitrary state-of-theart human detector. In our proposed PoseHD framework, we first make use of human pose estimation (in a batch manner) and present pose heatmap classification (by a convolutional neural network) to eliminate hard negatives by extracting the more detailed structural information; then, we utilize posebased proposal clustering and reranking modules, filtering redundant partial proposals by comprehensively considering both holistic and part information. The experimental results on multiple pedestrian benchmark datasets validate that our proposed PoseHD framework can generally improve the overall performance of recent state-of-the-art human detectors (by 2-4% in both mAP and MR metrics). Moreover, our PoseHD framework can be easily extended to object detection with large-scale object part annotations. Finally, in this paper, we present extensive ablative analysis to compare our approach with these traditional bottom-up pose-based models and highlight the importance of our framework design decisions.