Author name cluster

Shubham Garg

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

4 papers

2 author rows

ICRA Conference 2025 Conference Paper

A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards

Shivansh Patel
Xinchen Yin
Wenlong Huang
Shubham Garg
Hooshang Nayyeri
Li Fei-Fei 0001
Svetlana Lazebnik
Yunzhu Li

Task specification for robotic manipulation in open-world environments is challenging, requiring flexible and adaptive objectives that align with human intentions and can evolve through iterative feedback. We introduce Iterative Keypoint Reward (IKER), a visually grounded, Python-based reward function that serves as a dynamic task specification. Our framework leverages VLMs to generate and refine these reward functions for multi-step manipulation tasks. Given RGB-D observations and free-form language instructions, we sample keypoints in the scene and generate a reward function conditioned on these keypoints. IKER operates on the spatial relationships between keypoints, leveraging commonsense priors about the desired behaviors, and enabling precise SE(3) control. We reconstruct real-world scenes in simulation and use the generated rewards to train reinforcement learning (RL) policies, which are then deployed into the real world-forming a real-to-sim-to-real loop. Our approach demonstrates notable capabilities across diverse scenarios, including both prehensile and non-prehensile tasks, showcasing multi-step task execution, spontaneous error recovery, and on-the-fly strategy adjustments. The results highlight IKER's effectiveness in enabling robots to perform multi-step tasks in dynamic environments through iterative reward shaping. Project Page: https://iker-robot.github.io/

Details

NeurIPS Conference 2025 Conference Paper

MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations

Vardhan Dongre
Chi Gui
Shubham Garg
Hooshang Nayyeri
Gokhan Tur
Dilek Hakkani-Tur
Vikram Adve

We introduce MIRAGE, a new benchmark for multimodal expert-level reasoning and decision-making in consultative interaction settings. Designed for the domain of agriculture, MIRAGE captures the full complexity of expert consultations by combining natural user queries, expert-authored responses, and image-based context, offering a high-fidelity benchmark for evaluating models on grounded reasoning, clarification strategies, and long-form generation in a real-world, knowledge-intensive domain. Grounded in over 35, 000 real user-expert interactions, and curated through a carefully designed multi-step pipeline, MIRAGE spans diverse crop health, pest diagnosis, and crop management scenarios. The benchmark includes more than 7, 000 unique biological entities, covering plant species, pests, and diseases, making it one of the most taxonomically diverse benchmarks available for vision-language models in real-world expert-guided domains. Unlike existing benchmarks that rely on well-specified user inputs, MIRAGE features underspecified, context-rich scenarios, requiring models to infer latent knowledge gaps and either proactively guide the interaction or respond. Our benchmark comprises two core components. The Single-turn Challenge to reason over a single user turn and image set, identify relevant entities, infer causal explanations, and generate actionable recommendations; and a Multi-Turn challenge for dialogue state tracking, goal-driven generation, and expert-level conversational decision-making. We evaluate more than 20 closed and open-source frontier vision-language models (VLMs), using three reasoning language models as evaluators, highlighting the significant challenges posed by MIRAGE in both single-turn and multi-turn interaction settings. Even the advanced GPT4. 1 and GPT4o models achieve 44. 6% and 40. 9% accuracy, respectively, indicating significant room for improvement.

PDF Details

IROS Conference 2025 Conference Paper

Multi-Cali Anything: Dense Feature Multi-Frame Structure-from-Motion for Large-Scale Camera Array Calibration

Jinjiang You
Hewei Wang 0001
Yijie Li 0003
Mingxiao Huo
Long Van Tran Ha
Mingyuan Ma
Jinfeng Xu 0003
Jiayi Zhang

Calibrating large-scale camera arrays, such as those in dome-based setups, is time-intensive and typically requires dedicated captures of known patterns. While extrinsics in such arrays are fixed due to the physical setup, intrinsics often vary across sessions due to factors like lens adjustments or temperature changes. In this paper, we propose a dense-feature-driven multi-frame calibration method that refines intrinsics directly from scene data, eliminating the necessity for additional calibration captures. Our approach enhances traditional Structure-from-Motion (SfM) pipelines by introducing an extrinsics regularization term to progressively align estimated extrinsics with ground-truth values, a dense feature reprojection term to reduce keypoint errors by minimizing reprojection loss in the feature space, and an intrinsics variance term for joint optimization across multiple frames. Experiments on the Multiface dataset show that our method achieves nearly the same precision as dedicated calibration processes, and significantly enhances intrinsics and 3D reconstruction accuracy. Fully compatible with existing SfM pipelines, our method provides an efficient and practical plug-and-play solution for large-scale camera setups. Our code is publicly available at: https://github.com/YJJfish/Multi-Cali-Anything

Details

NeurIPS Conference 2023 Conference Paper

A Dataset of Relighted 3D Interacting Hands

Gyeongsik Moon
Shunsuke Saito
Weipeng Xu
Rohan Joshi
Julia Buffalini
Harley Bellan
Nicholas Rosen
Jesse Richardson

The two-hand interaction is one of the most challenging signals to analyze due to the self-similarity, complicated articulations, and occlusions of hands. Although several datasets have been proposed for the two-hand interaction analysis, all of them do not achieve 1) diverse and realistic image appearances and 2) diverse and large-scale groundtruth (GT) 3D poses at the same time. In this work, we propose Re: InterHand, a dataset of relighted 3D interacting hands that achieve the two goals. To this end, we employ a state-of-the-art hand relighting network with our accurately tracked two-hand 3D poses. We compare our Re: InterHand with existing 3D interacting hands datasets and show the benefit of it. Our Re: InterHand is available in https: //mks0601. github. io/ReInterHand/

PDF Details