Arrow Research search

Author name cluster

Jian Hu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

12 papers
2 author rows

Possible papers

12

IJCAI Conference 2025 Conference Paper

INT: Instance-Specific Negative Mining for Task-Generic Promptable Segmentation

  • Jian Hu
  • Zixu Cheng
  • Shaogang Gong

Task-generic promptable image segmentation aims to achieve segmentation of diverse samples under a single task description by utilizing only one task-generic prompt. Current methods leverage the generalization capabilities of Vision-Language Models (VLMs) to infer instance-specific prompts from these task-generic prompts in order to guide the segmentation process. However, when VLMs struggle to generalise to some image instances, predicting instance-specific prompts becomes poor. To solve this problem, we introduce Instance-specific Negative Mining for Task-Generic Promptable Segmentation (INT). The key idea of INT is to adaptively reduce the influence of irrelevant (negative) prior knowledge whilst to increase the use the most plausible prior knowledge, selected by negative mining with higher contrast, in order to optimise instance-specific prompts generation. Specifically, INT consists of two components: (1) instance-specific prompt generation, which progressively fliters out incorrect information in prompt generation; (2) semantic mask generation, which ensures each image instance segmentation matches correctly the semantics of the instance-specific prompts. INT is validated on six datasets, including camouflaged objects and medical images, demonstrating its effectiveness, robustness and scalability.

AAAI Conference 2025 Conference Paper

InvSeg: Test-Time Prompt Inversion for Semantic Segmentation

  • Jiayi Lin
  • Jiabo Huang
  • Jian Hu
  • Shaogang Gong

Visual-textual correlations in the attention maps derived from text-to-image diffusion models are proven beneficial to dense visual prediction tasks, e.g., semantic segmentation. However, a significant challenge arises due to the input distributional discrepancy between the context-rich sentences used for image generation and the isolated class names typically used in semantic segmentation. This discrepancy hinders diffusion models from capturing accurate visual-textual correlations. To solve this, we propose InvSeg, a test-time prompt inversion method that tackles open-vocabulary semantic segmentation by inverting image-specific visual context into text prompt embedding space, leveraging structure information derived from the diffusion model's reconstruction process to enrich text prompts so as to associate each class with a structure-consistent mask. Specifically, we introduce Contrastive Soft Clustering (CSC) to align derived masks with the image's structure information, softly selecting anchors for each class and calculating weighted distances to push inner-class pixels closer while separating inter-class pixels, thereby ensuring mask distinction and internal consistency. By incorporating sample-specific context, InvSeg learns context-rich text prompts in embedding space and achieves accurate semantic alignment across modalities. Experiments show that InvSeg achieves state-of-the-art performance on the PASCAL VOC, PASCAL Context and COCO Object datasets.

NeurIPS Conference 2025 Conference Paper

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

  • Mingjie Liu
  • Shizhe Diao
  • Ximing Lu
  • Jian Hu
  • Xin Dong
  • Yejin Choi
  • Jan Kautz
  • Yi Dong

Recent advances in reasoning-centric language models have highlighted reinforcement learning (RL) as a promising method for aligning models with verifiable rewards. However, it remains contentious whether RL truly expands a model’s reasoning capabilities or merely amplifies high-reward outputs already latent in the base model’s distribution, and whether continually scaling up RL compute reliably leads to improved reasoning performance. In this work, we challenge prevailing assumptions by demonstrating that prolonged RL (ProRL) training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling. We introduce ProRL, a novel training methodology that incorporates KL divergence control, reference policy resetting, and a diverse suite of tasks. Our empirical analysis reveals that RL-trained models consistently outperform base models across a wide range of pass@$k$ evaluations, including scenarios where base models fail entirely regardless of the number of attempts. We further show that reasoning boundary improvements correlates strongly with task competence of base model and training duration, suggesting that RL can explore and populate new regions of solution space over time. These findings offer new insights into the conditions under which RL meaningfully expands reasoning boundaries in language models and establish a foundation for future work on long-horizon RL for reasoning. We will release model weights and data to support further research.

NeurIPS Conference 2025 Conference Paper

Uncertainty-quantified Rollout Policy Adaptation for Unlabelled Cross-domain Video Temporal Grounding

  • Jian Hu
  • Zixu Cheng
  • Shaogang Gong
  • Isabel Guan
  • Jianye Hao
  • Jun Wang
  • Kun Shao

Video Temporal Grounding (TG) aims to temporally locate video segments matching a natural language description (a query) in a long video. While Vision-Language Models (VLMs) are effective at holistic semantic matching, they often struggle with fine-grained temporal localisation. Recently, Group Relative Policy Optimisation (GRPO) reformulates the inference process as a reinforcement learning task, enabling fine-grained grounding and achieving strong in-domain performance. However, GRPO relies on labelled data, making it unsuitable in unlabelled domains. Moreover, because videos are large and expensive to store and process, performing full-scale adaptation introduces prohibitive latency and computational overhead, making it impractical for real-time deployment. To overcome both problems, we introduce a Data-Efficient Unlabelled Cross-domain Temporal Grounding method, from which a model is first trained on a labelled source domain, then adapted to a target domain using only a small number of {\em unlabelled videos from the target domain}. This approach eliminates the need for target annotation and keeps both computational and storage overhead low enough to run in real time. Specifically, we introduce \textbf{U}ncertainty-quantified \textbf{R}ollout \textbf{P}olicy \textbf{A}daptation (\textbf{URPA}) for cross-domain knowledge transfer in learning video temporal grounding without target labels. URPA generates multiple candidate predictions using GRPO rollouts, averages them to form a pseudo label, and estimates confidence from the variance across these rollouts. This confidence then weights the training rewards, guiding the model to focus on reliable supervision. Experiments on three datasets across six cross-domain settings show that URPA generalises well using only a few unlabelled target videos. Codes are given in supplemental materials.

ICRA Conference 2024 Conference Paper

Co-Axial Slender Tubular robot (CAST): Towards Robotized Operation for Transorbital Neurosurgery with Minimal Invasiveness

  • Shuai Wang 0024
  • Qingxiang Zhao
  • Jian Chen 0036
  • Mingcong Chen
  • Guanglin Cao
  • Jian Hu
  • Runfeng Zhu
  • Danny Tat Ming Chan

Transorbital Neuro Surgery (TNS) offers a novel treatment towards the lesion inside skull pursuing minimal invasiveness. Most conventional TNS tools are rigid and straight, limiting the dexterity and accessibility in passing a small port. Bendable and steerable surgical tools provides an alternative for this issue. In this work, we proposed a dual-segment slender surgical robot arm for TNS, which is a Co-Axial Slender Tubular robot (CAST), and modelled it using novel approaches. Another contribution is tendon-mortise shaped slits along the axial direction, enhancing the overall stiffness. The bending of CAST is actuated by pushing/pulling distance, and the maximum diameter is only 1. 7mm with high dexterity after mounting on a rigid robot arm. Experiments demonstrates that the proposed the slit design doubles the stiffness properties compared to traditional rectangle slit designs. The path-following task shows that the position error was maximally 3mm in open-looped control. Test on a skull model demonstrates that the whole system could successfully perform electrocoagulation procedure inside the depth of skull in a robotized manner effectively.

ICRA Conference 2024 Conference Paper

Design and Visual Servoing Control of a Hybrid Dual-Segment Flexible Neurosurgical Robot for Intraventricular Biopsy

  • Jian Chen 0036
  • Mingcong Chen
  • Qingxiang Zhao
  • Shuai Wang 0024
  • Yihe Wang
  • Ying Xiao
  • Jian Hu
  • Danny Tat Ming Chan

Traditional rigid endoscopes have challenges in flexibly treating tumors located deep in the brain, and low operability and fixed viewing angles limit its development. This study introduces a novel dual-segment flexible robotic endoscope MicroNeuro, designed to perform biopsies with dexterous surgical manipulation deep in the brain. Taking into account the uncertainty of the control model, an image-based visual servoing with online robot Jacobian estimation has been implemented to enhance motion accuracy. Furthermore, the application of model predictive control with constraints significantly bolsters the flexible robot’s ability to adaptively track mobile objects and resist external interference. Experimental results underscore that the proposed control system enhances motion stability and precision. Phantom testing substantiates its considerable potential for deployment in neurosurgery.

NeurIPS Conference 2024 Conference Paper

Leveraging Hallucinations to Reduce Manual Prompt Dependency in Promptable Segmentation

  • Jian Hu
  • Jiayi Lin
  • Junchi Yan
  • Shaogang Gong

Promptable segmentation typically requires instance-specific manual prompts to guide the segmentation of each desired object. To minimize such a need, task-generic promptable segmentation has been introduced, which employs a single task-generic prompt to segment various images of different objects in the same task. Current methods use Multimodal Large Language Models (MLLMs) to reason detailed instance-specific prompts from a task-generic prompt for improving segmentation accuracy. The effectiveness of this segmentation heavily depends on the precision of these derived prompts. However, MLLMs often suffer hallucinations during reasoning, resulting in inaccurate prompting. While existing methods focus on eliminating hallucinations to improve a model, we argue that MLLM hallucinations can reveal valuable contextual insights when leveraged correctly, as they represent pre-trained large-scale knowledge beyond individual images. In this paper, we first utilize hallucinations to mine task-related information from images and verify its accuracy to enhance precision of the generated prompts. Specifically, we introduce an iterative \textbf{Pro}mpt-\textbf{Ma}sk \textbf{C}ycle generation framework (ProMaC) with a prompt generator and a mask generator. The prompt generator uses a multi-scale chain of thought prompting, initially leveraging hallucinations to extract extended contextual prompts on a test image. These hallucinations are then minimized to formulate precise instance-specific prompts, directing the mask generator to produce masks that are consistent with task semantics by mask semantic alignment. Iteratively the generated masks induce the prompt generator to focus more on task-relevant image areas and reduce irrelevant hallucinations, resulting jointly in better prompts and masks. Experiments on 5 benchmarks demonstrate the effectiveness of ProMaC. Code is in https: //lwpyh. github. io/ProMaC/.

AAAI Conference 2024 Conference Paper

Relax Image-Specific Prompt Requirement in SAM: A Single Generic Prompt for Segmenting Camouflaged Objects

  • Jian Hu
  • Jiayi Lin
  • Shaogang Gong
  • Weitong Cai

Camouflaged object detection (COD) approaches heavily rely on pixel-level annotated datasets. Weakly-supervised COD (WSCOD) approaches use sparse annotations like scribbles or points to reduce annotation efforts, but this can lead to decreased accuracy. The Segment Anything Model (SAM) shows remarkable segmentation ability with sparse prompts like points. However, manual prompt is not always feasible, as it may not be accessible in real-world application. Additionally, it only provides localization information instead of semantic one, which can intrinsically cause ambiguity in interpreting targets. In this work, we aim to eliminate the need for manual prompt. The key idea is to employ Cross-modal Chains of Thought Prompting (CCTP) to reason visual prompts using the semantic information given by a generic text prompt. To that end, we introduce a test-time instance-wise adaptation mechanism called Generalizable SAM (GenSAM) to automatically generate and optimize visual prompts from the generic task prompt for WSCOD. In particular, CCTP maps a single generic text prompt onto image-specific consensus foreground and background heatmaps using vision-language models, acquiring reliable visual prompts. Moreover, to test-time adapt the visual prompts, we further propose Progressive Mask Generation (PMG) to iteratively reweight the input image, guiding the model to focus on the targeted region in a coarse-to-fine manner. Crucially, all network parameters are fixed, avoiding the need for additional training. Experiments on three benchmarks demonstrate that GenSAM outperforms point supervision approaches and achieves comparable results to scribble supervision ones, solely relying on general task descriptions. Our codes is in https://github.com/jyLin8100/GenSAM.

IROS Conference 2024 Conference Paper

Vertebrae-based Global X-ray to CT Registration for Thoracic Surgeries

  • Lilu Liu
  • Yanmei Jiao
  • Zhou An
  • Honghai Ma
  • Chunlin Zhou
  • Haojian Lu
  • Jian Hu
  • Rong Xiong

X-ray to CT registration is an essential technique to provide on-site guidance for clinicians and medical robots by aligning preoperative information with intraoperative images. Current methods focus on local registration with small capture ranges and necessitate a manual initial alignment before precise registration. Some existing global methods are likely to fail in thoracic surgeries because of the respiratory motion and the nearly colinear nature of vertebrae landmarks. In this study, we propose a vertebrae-based global X-ray to CT registration method with the assistance of clinical setups for thoracic surgeries. Firstly, vertebrae centroids are automatically localized by CNN-based networks in CT and X-ray for establishing 2D/3-D correspondences. Then, inspired by clinical setup, we address the degradation of colinear landmarks of 6-DoF pose estimation by introducing a 4-DoF solver. Considering the inaccurate priori and landmark mislocalization, the solver is embedded into the Adaptive Error-Aware Estimator (AE 2 ) to simultaneously estimate weights and aggregate candidate poses. Finally, the whole method is trained in an end-to-end manner for better performance. Evaluations on both the public LIDC-IDRI dataset and clinical dataset demonstrate that our method outperforms existing optimization-based and learningbased approaches in terms of registration accuracy and success rate. Our code: https://github.com/LiuLiluZJU/2P-AE2

IROS Conference 2021 Conference Paper

A Static Model for a Stiffness-Adjustable Snake-Like Robot

  • Di Shun Huang
  • Jian Hu
  • Liuchunzi Guo
  • Yi Sun 0008
  • Liao Wu

In minimally invasive surgery, miniaturisation and in situ adjustable stiffness of robotic manipulators are desired features. Previous research proposed a simple and effective tendon-driven curve-joint manipulator design using a variable neutral-line mechanism, which highly satisfies both criteria. A kinematic model was developed for such a manipulator based on the geometry of the structure. However, such a model assumes that joint angles are all equal between disks without a rigorous derivation, and fails if not all the shapes of the disks are identical. Moreover, the model does not involve an analysis of the tension of each tendon. This paper suggested a static model for predicting the articulation of such a manipulator given the applied tensions on driving tendons. It validates the assumption of equally distributed joint angles and works for manipulators with more general configurations of disks and tendons. It also sets a foundation for further development of tension based control and external force estimation. Simulations on Adams were conducted to prove the correctness of the proposed model. A video demonstrating the simulation results can be found via https://youtu.be/MXhL1LGwLtw

AAAI Conference 2008 Conference Paper

Mining Translations of Web Queries from Web Click-through Data

  • Rong Hu
  • Jian Hu
  • Zheng Chen

Query translation for Cross-Lingual Information Retrieval (CLIR) has gained increasing attention in the research area. Previous work mainly used machine translation systems, bilingual dictionaries, or web corpora to perform query translation. However, most of these approaches require either expensive language resources or complex language models, and cannot achieve timely translation for new queries. In this paper, we propose a novel solution to automatically acquire query translation pairs from the knowledge hidden in the click-through data, that are represented by the URL a user clicks after submitting a query to a search engine. Our proposed solution consists of two stages: identifying bilingual URL pair patterns in the click-through data and matching query translation pairs based on user click behavior. Experimental results on a real dataset show that our method not only generates existing query translation pairs with high precision, but also generates many timely query translation pairs that could not be obtained by previous methods. A comparative study between our system and two commercial online translation systems shows the advantage of our proposed method.