Arrow Research search

Author name cluster

Haochen Shi

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers
2 author rows

Possible papers

9

IROS Conference 2025 Conference Paper

DRTT: A Diffusion-based Framework for 4DCT Generation, Robust Thoracic Registration and Tumor Deformation Tracking

  • Dongyuan Li
  • Yixin Shan
  • Yuxuan Mao
  • Haochen Shi
  • Shenghao Huang
  • Weiyan Sun
  • Chang Chen
  • Xiaojun Chen

In minimally invasive robotic thoracic surgery, the unavoidable respiratory motion of the patient causes lung lesions to move and deform, making precise tumor localiza-tion a significant challenge for surgeons. To address this, we introduce an RDDM (Recursive Deformable Diffusion Model)-based framework designed for real-time intraoperative tumor tracking, which can be used for registration and navigation in robot-assisted thoracic surgery. The RDDM reduces training complexity and enhances dataset utilization by employing a simplified DDM (Diffusion Deformable Model) iteratively, significantly lowering computational demands while maximizing the extraction of valuable information from limited 4D-CT (four-dimensional computed tomography) datasets. Considering the robustness required for intraoperative registration and navigation, we incorporate an ICP (Iterative Closest Point)-based point cloud registration method into the framework and validate our approach using publicly available datasets and volunteer trials. This innovation has the potential to reduce radiation exposure, trauma, and the risk of complications for patients undergoing minimally invasive thoracic surgery, and enables downstream tasks such as RAPNB (robot-assisted percutaneous needle biopsy) and radiation therapy.

TMLR Journal 2025 Journal Article

The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning

  • Tianshi Zheng
  • Yixiang Chen
  • Chengxi Li
  • Chunyang Li
  • Qing Zong
  • Haochen Shi
  • Baixuan Xu
  • Yangqiu Song

Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models (LLMs). However, our study reveals a surprising contradiction to this prevailing perspective within the fundamental domain of pattern-based in-context learning (ICL). Through extensive experiments involving 16 state-of-the-art LLMs and nine diverse pattern-based ICL datasets, we demonstrate that CoT and its reasoning variants consistently underperform direct answering across varying model scales and benchmark complexities. To systematically investigate this unexpected phenomenon, we designed extensive experiments to validate several hypothetical explanations. Our analysis uncovers a fundamental hybrid mechanism of explicit-implicit reasoning driving CoT’s performance in pattern-based ICL: while explicit reasoning falters due to LLMs’ struggles to infer underlying patterns from demonstrations, implicit reasoning—disrupted by the increased contextual distance of CoT rationales—often compensates, delivering correct answers despite flawed rationales. This hybrid mechanism explains CoT’s relative underperformance, as noise from weak explicit inference undermines the process, even as implicit mechanisms partially salvage outcomes. Notably, even long-CoT reasoning models, which excel in abstract and symbolic reasoning, fail to fully overcome these limitations despite higher computational costs. Our findings challenge existing assumptions regarding the universal efficacy of CoT, yielding novel insights into its limitations and guiding future research toward more nuanced and effective reasoning methodologies for LLMs.

ICRA Conference 2024 Conference Paper

Open X-Embodiment: Robotic Learning Datasets and RT-X Models: Open X-Embodiment Collaboration

  • Abby O'Neill
  • Abdul Rehman
  • Abhiram Maddukuri
  • Abhishek Gupta 0004
  • Abhishek Padalkar
  • Abraham Lee
  • Acorn Pooley
  • Agrim Gupta

Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train "generalist" X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. The project website is robotics-transformer-x. github.io.

AAMAS Conference 2024 Conference Paper

OPEx: A Large Language Model-Powered Framework for Embodied Instruction Following

  • Haochen Shi
  • Zhiyuan Sun
  • Xingdi Yuan
  • Marc-Alexandre Côté
  • Bang Liu

Embodied Instruction Following (EIF) is crucial for understanding natural language in a practical context, requiring agents to follow verbal instructions for complex tasks. Traditionally, EIF relies heavily on expert annotations for learning, which are costly and sometimes unattainable. Recent research shows Large Language Models (LLMs) can use their reasoning ability to help in EIF with minimal examples, but applying LLMs directly faces issues like hallucinations and partially observable environment. To bridge the gap, we introduce OPEx, a new LLM-based method for EIF that needs far less specific data. OPEx uses three LLMs for different roles: observing to gather environment data, planning by breaking down instructions, and executing tasks with learned skills. Our tests reveal OPEx significantly outperforms the FILM baseline, with 90% less training data for planning tasks and achieving up to 38% performance gain when FILM is trained on identical data.

AAAI Conference 2022 Conference Paper

MAGIC: Multimodal relAtional Graph adversarIal inferenCe for Diverse and Unpaired Text-Based Image Captioning

  • Wenqiao Zhang
  • Haochen Shi
  • Jiannan Guo
  • Shengyu Zhang
  • Qingpeng Cai
  • Juncheng Li
  • Sihui Luo
  • Yueting Zhuang

Text-based image captioning (TextCap) requires simultaneous comprehension of visual content and reading the text of images to generate a natural language description. Although a task can teach machines to understand the complex human environment further given that text is omnipresent in our daily surroundings, it poses additional challenges in normal captioning. A text-based image intuitively contains abundant and complex multimodal relational content, that is, image details can be described diversely from multiview rather than a single caption. Certainly, we can introduce additional paired training data to show the diversity of images’ descriptions, this process is labor-intensive and time-consuming for TextCap pair annotations with extra texts. Based on the insight mentioned above, we investigate how to generate diverse captions that focus on different image parts using an unpaired training paradigm. We propose the Multimodal relAtional Graph adversarIal inferenCe (MAGIC) framework for diverse and unpaired TextCap. This framework can adaptively construct multiple multimodal relational graphs of images and model complex relationships among graphs to represent descriptive diversity. Moreover, a cascaded generative adversarial network is developed from modeled graphs to infer the unpaired caption generation in image–sentence feature alignment and linguistic coherence levels. We validate the effectiveness of MAGIC in generating diverse captions from different relational information items of an image. Experimental results show that MAGIC can generate very promising outcomes without using any image–caption training pairs.

ICRA Conference 2021 Conference Paper

CollisionIK: A Per-Instant Pose Optimization Method for Generating Robot Motions with Environment Collision Avoidance

  • Daniel Rakita
  • Haochen Shi
  • Bilge Mutlu
  • Michael Gleicher

In this work, we present a per-instant pose optimization method that can generate configurations that achieve specified pose or motion objectives as best as possible over a sequence of solutions, while also simultaneously avoiding collisions with static or dynamic obstacles in the environment. We cast our method as a weighted sum non-linear constrained optimization-based IK problem where each term in the objective function encodes a particular pose objective. We demonstrate how to effectively incorporate environment collision avoidance as a single term in this multi-objective, optimization-based IK structure, and provide solutions for how to spatially represent and organize external environments such that data can be efficiently passed to a real-time, performance-critical optimization loop. We demonstrate the effectiveness of our method by comparing it to various state-of-the-art methods in a testbed of simulation experiments and discuss the implications of our work based on our results.

AAAI Conference 2021 Conference Paper

Consensus Graph Representation Learning for Better Grounded Image Captioning

  • Wenqiao Zhang
  • Haochen Shi
  • Siliang Tang
  • Jun Xiao
  • Qiang Yu
  • Yueting Zhuang

The contemporary visual captioning models frequently hallucinate objects that are not actually in a scene, due to the visual misclassification or over-reliance on priors that resulting in the semantic inconsistency between the visual information and the target lexical words. The most common way is to encourage the captioning model to dynamically link generated object words or phrases to appropriate regions of the image, i. e. , the grounded image captioning (GIC). However, GIC utilizes an auxiliary task (grounding objects) that has not solved the key issue of object hallucination, i. e. , the semantic inconsistency. In this paper, we take a novel perspective on the issue above: exploiting the semantic coherency between the visual and language modalities. Specifically, we propose the Consensus Rraph Representation Learning framework (CGRL) for GIC that incorporates a consensus representation into the grounded captioning pipeline. The consensus is learned by aligning the visual graph (e. g. , scene graph) to the language graph that consider both the nodes and edges in a graph. With the aligned consensus, the captioning model can capture both the correct linguistic characteristics and visual relevance, and then grounding appropriate image regions further. We validate the effectiveness of our model, with a significant decline in object hallucination (-9% CHAIRi) on the Flickr30k Entities dataset. Besides, our CGRL also evaluated by several automatic metrics and human evaluation, the results indicate that the proposed approach can simultaneously improve the performance of image captioning (+2. 9 Cider) and grounding (+2. 3 F1LOC ).

AAAI Conference 2021 Conference Paper

Empower Distantly Supervised Relation Extraction with Collaborative Adversarial Training

  • Tao Chen
  • Haochen Shi
  • Liyuan Liu
  • Siliang Tang
  • Jian Shao
  • Zhigang Chen
  • Yueting Zhuang

With recent advances in distantly supervised (DS) relation extraction (RE), considerable attention is attracted to leverage multi-instance learning (MIL) to distill high-quality supervision from the noisy DS. Here, we go beyond label noise and identify the key bottleneck of DS-MIL to be its low data utilization: as high-quality supervision being refined by MIL, MIL abandons a large amount of training instances, which leads to a low data utilization and hinders model training from having abundant supervision. In this paper, we propose collaborative adversarial training to improve the data utilization, which coordinates virtual adversarial training (VAT) and adversarial training (AT) at different levels. Specifically, since VAT is label-free, we employ the instance-level VAT to recycle instances abandoned by MIL. Besides, we deploy AT at the bag-level to unleash the full potential of the high-quality supervision got by MIL. Our proposed method brings consistent improvements (∼ 5 absolute AUC score) to the previous state of the art, which verifies the importance of the data utilization issue and the effectiveness of our method.

IJCAI Conference 2020 Conference Paper

Alleviate Dataset Shift Problem in Fine-grained Entity Typing with Virtual Adversarial Training

  • Haochen Shi
  • Siliang Tang
  • Xiaotao Gu
  • Bo Chen
  • Zhigang Chen
  • Jian Shao
  • Xiang Ren

The recent success of Distant Supervision (DS) brings abundant labeled data for the task of fine-grained entity typing (FET) without human annotation. However, the heuristically generated labels inevitably bring a significant distribution gap, namely dataset shift, between the distantly labeled training set and the manually curated test set. Considerable efforts have been made to alleviate this problem from the label perspective by either intelligently denoising the training labels, or designing noise-aware loss functions. Despite their progress, the dataset shift can hardly be eliminated completely. In this work, complementary to the label perspective, we reconsider this problem from the model perspective: Can we learn a more robust typing model with the existence of dataset shift? To this end, we propose a novel regularization module based on virtual adversarial training (VAT). The proposed approach first uses a self-paced sample selection function to select suitable samples for VAT, then constructs virtual adversarial perturbations based on the selected samples, and finally regularizes the model to be robust to such perturbations. Experiments on two benchmarks demonstrate the effectiveness of the proposed method, with an average 3. 8%, 2. 5%, and 3. 2% improvement in accuracy, Macro F1 and Micro F1 respectively compared to the next best method.