Arrow Research search

Author name cluster

Hao Lu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

26 papers
2 author rows

Possible papers

26

AAAI Conference 2026 Conference Paper

IPFormer: Instance Prompt-guided Transformer for Multi-modal Multi-shot Video Understanding

  • Yujia Liang
  • Jile Jiao
  • Xuetao Feng
  • Xinchen Liu
  • Kun Liu
  • Yuan Wang
  • Zixuan Ye
  • Hao Lu

Video Large Language Models (VideoLLMs), which adopt large language models for video understanding, have been demonstrated for single-shot videos. However, they usually struggle in multi-shot videos with frequent shot changes, varying camera angles, etc., which makes VideoLLMs hardly answer questions about multiple instances or shots over the whole video. We attribute this challenge to two issues: 1) the lack of multi-shot multi-instance annotations of existing datasets, and 2) the negligence of instance-aware modeling of current VideoLLMs. Therefore, we first introduce a new dataset termed MultiClip-Bench, featuring dense descriptions and question-answering pairs tailored for multi-shot and multi-instance scenarios. Moreover, since the existing VideoLLMs neglect the explicit modeling of instance-related features, we propose a novel Instance Prompt-guided Transformer, named IPFormer, to achieve instance-aware videounderstanding. In the IPFormer, we design a simple but effective instance-aware feature injection module, which encodes instance features as instance prompts via an attention-based connector. By this means, IPFormer can aggregate instance-specific information across multiple shots. Extensive experiments not only show that our dataset and model significantly improve multi-shot video understanding. but also show that our MultiClip-Bench can provide valuable training data and benchmarks for various video understanding tasks.

AAAI Conference 2026 Conference Paper

MCTSr-Zero: Self-Reflective Psychological Counseling Dialogues Generation via Principles and Adaptive Exploration

  • Hao Lu
  • Yanchi Gu
  • Haoyuan Huang
  • Yulin Zhou
  • Ningxin Zhu
  • Chen Li

The integration of Monte Carlo Tree Search (MCTS) with Large Language Models (LLMs) has demonstrated significant success in structured, problem-oriented tasks. However, applying these methods to open-ended dialogues, such as those in psychological counseling, presents unique challenges. Unlike tasks with objective correctness, success in therapeutic conversations depends on subjective factors like empathetic engagement, ethical adherence, and alignment with human preferences, for which strict correctness criteria are ill-defined. Existing result-oriented MCTS approaches can therefore produce misaligned responses. To address this, we introduce MCTSr-Zero, an MCTS framework designed for open-ended, human-centric dialogues. Its core innovation is domain alignment, which shifts the MCTS search objective from predefined end-states towards conversational trajectories that conform to target domain principles (e.g., empathy in counseling). Furthermore, MCTSr-Zero incorporates Regeneration and Meta-Prompt Adaptation mechanisms to substantially broaden exploration by allowing the MCTS to consider fundamentally different initial dialogue strategies. We evaluate MCTSr-Zero in psychological counseling by generating multi-turn dialogue data, which is used to fine-tune an LLM, PsyLLM. We also introduce PsyEval, a benchmark for assessing multi-turn psychological counseling dialogues. Experiments demonstrate that PsyLLM achieves state-of-the-art performance on PsyEval and other relevant metrics, validating MCTSr-Zero's effectiveness in generating high-quality, principle-aligned conversational data for human-centric domains and addressing the LLM challenge of consistently adhering to complex psychological standards.

NeurIPS Conference 2025 Conference Paper

DrivingRecon: Large 4D Gaussian Reconstruction Model For Autonomous Driving

  • Hao Lu
  • Tianshuo Xu
  • Wenzhao Zheng
  • Yunpeng Zhang
  • Wei Zhan
  • Dalong Du
  • Masayoshi Tomizuka
  • Kurt Keutzer

Large reconstruction model has remarkable progress, which can directly predict 3D or 4D representations for unseen scenes and objects. However, current work has not systematically explored the potential of large reconstruction models in the field of autonomous driving. To achieve this, we introduce the Large 4D Gaussian Reconstruction Model (DrivingRecon). With an elaborate and simple framework design, it not only ensures efficient and high-quality reconstruction, but also provides potential for downstream tasks. There are two core contributions: firstly, the Prune and Dilate Block (PD-Block) is proposed to prune redundant and overlapping Gaussian points and dilate Gaussian points for complex objects. Then, dynamic and static decoupling is tailored to better learn the temporary-consistent geometry across different time. Experimental results demonstrate that DrivingRecon significantly improves scene reconstruction quality compared to existing methods. Furthermore, we explore applications of DrivingRecon in model pre-training, vehicle type adaptation, and scene editing. Our code will be available.

NeurIPS Conference 2025 Conference Paper

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

  • Ling Fu
  • Zhebin Kuang
  • Jiajun Song
  • Mingxin Huang
  • Biao Yang
  • Yuzhe Li
  • Linghao Zhu
  • Qidi Luo

Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities in certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks ($4\times$ more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios ($31$ diverse scenarios), and thorough evaluation metrics, with $10, 000$ human-verified question-answering pairs and a high proportion of difficult samples. Moreover, we construct a private test set with $1, 500$ manually annotated images. The consistent evaluation trends observed across both public and private test sets validate the OCRBench v2's reliability. After carefully benchmarking state-of-the-art LMMs, we find that most LMMs score below $50$ ($100$ in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. The benchmark and evaluation scripts are available at https: //github. com/Yuliang-Liu/MultimodalOCR.

AAAI Conference 2025 Conference Paper

Towards Generalizable Multi-Camera 3D Object Detection via Perspective Rendering

  • Hao Lu
  • Yunpeng Zhang
  • Guoqing Wang
  • Qing Lian
  • Dalong Du
  • Ying-Cong Chen

Detecting and localizing objects in 3D space using multiple cameras, known as Multi-Camera 3D Object Detection (MC3D-Det), has gained prominence with the advent of bird's-eye view (BEV) approaches. However, these methods often struggle with the serious domain gaps caused by various viewpoints and environments between the training and testing domains. To address this challenge, we propose a novel framework that aligns 3D detection with 2D camera plane results by perspective rendering, thus achieving consistent and accurate results when facing serious domain shifts. Our approach consists of two main steps in both source and target domains: 1) rendering diverse view maps from BEV features by leveraging implicit foreground volumes and 2) rectifying the perspective bias of these maps. This design promotes the learning of perspective- and context-independent features, crucial for accurate object detection across varying viewpoints, camera parameters, and environmental conditions. Notably, our model-agnostic approach preserves the original network structure without incurring additional inference costs, facilitating seamless integration across various models and simplifying deployment. Worth noting is that our approach achieves satisfactory results in real data when trained only with virtual datasets, eliminating the need for real scene annotations. Experimental results on both Domain Generalization (DG) and Unsupervised Domain Adaptation (UDA) demonstrate its effectiveness.

AAAI Conference 2025 Conference Paper

Training Matting Models Without Alpha Labels

  • Wenze Liu
  • Zixuan Ye
  • Hao Lu
  • Zhiguo Cao
  • Xiangyu Yue

The labeling difficulty has been a longstanding problem in deep image matting. To escape from fine labels, this work explores using rough annotations such as trimaps coarsely indicating the foreground/background as supervision. We present that the cooperation between learned semantics from indicated known regions and proper assumed matting rules can help infer alpha values at transition areas. Inspired by the nonlocal principle in traditional image matting, we build a directional distance consistency loss (DDC loss) at each pixel neighborhood to constrain the alpha values conditioned on the input image. DDC loss forces the distance of similar pairs on the alpha matte and on its corresponding image to be consistent. In this way, the alpha values can be propagated from learned known regions to unknown transition areas. With only images and trimaps, a matting model can be trained under the supervision of a known loss and the proposed DDC loss. Experiments on AM-2K and P3M-10K dataset show that our paradigm achieves comparable performance with the fine-label-supervised baseline, while sometimes offers even more satisfying results than human-labeled ground truth.

JBHI Journal 2024 Journal Article

ConDiff-rPPG: Robust Remote Physiological Measurement to Heterogeneous Occlusions

  • Jiyao Wang
  • Ximeng Wei
  • Hao Lu
  • Yingcong Chen
  • Dengbo He

Remote photoplethysmography (rPPG) is a contactless technique that facilitates the measurement of physiological signals and cardiac activities through facial video recordings. This approach holds tremendous potential for various applications. However, existing rPPG methods often did not account for different types of occlusions that commonly occur in real-world scenarios, such as temporary movement or actions of humans in videos or dust on camera. The failure to address these occlusions can compromise the accuracy of rPPG algorithms. To address this issue, we proposed a novel Condiff-rPPG to improve the robustness of rPPG measurement facing various occlusions. First, we compressed the damaged face video into a spatio-temporal representation with several types of masks. Second, the diffusion model was designed to recover the missing information with observed values as a condition. Moreover, a novel low-rank decomposition regularization was proposed to eliminate background noise and maximize informative features. ConDiff-rPPG ensured consistency in optimization goals during the training process. Through extensive experiments, including intra- and cross-dataset evaluations, as well as ablation tests, we demonstrated the robustness and generalization ability of our proposed model.

NeurIPS Conference 2024 Conference Paper

HAWK: Learning to Understand Open-World Video Anomalies

  • Jiaqi Tang
  • Hao Lu
  • Ruizheng Wu
  • Xiaogang Xu
  • Ke Ma
  • Cheng Fang
  • Bin Guo
  • Jiangbo Lu

Video Anomaly Detection (VAD) systems can autonomously monitor and identify disturbances, reducing the need for manual labor and associated costs. However, current VAD systems are often limited by their superficial semantic understanding of scenes and minimal user interaction. Additionally, the prevalent data scarcity in existing datasets restricts their applicability in open-world scenarios. In this paper, we introduce HAWK, a novel framework that leverages interactive large Visual Language Models (VLM) to interpret video anomalies precisely. Recognizing the difference in motion information between abnormal and normal videos, HAWK explicitly integrates motion modality to enhance anomaly identification. To reinforce motion attention, we construct an auxiliary consistency loss within the motion and video space, guiding the video branch to focus on the motion modality. Moreover, to improve the interpretation of motion-to-language, we establish a clear supervisory relationship between motion and its linguistic representation. Furthermore, we have annotated over 8, 000 anomaly videos with language descriptions, enabling effective training across diverse open-world scenarios, and also created 8, 000 question-answering pairs for users' open-world questions. The final results demonstrate that HAWK achieves SOTA performance, surpassing existing baselines in both video description generation and question-answering. Our codes/dataset/demo will be released at https: //github. com/jqtangust/hawk.

JBHI Journal 2024 Journal Article

Hierarchical Style-Aware Domain Generalization for Remote Physiological Measurement

  • Jiyao Wang
  • Hao Lu
  • Ange Wang
  • Yingcong Chen
  • Dengbo He

The utilization of remote photoplethysmography (rPPG) technology has gained attention in recent years due to its ability to extract blood volume pulse (BVP) from facial videos, making it accessible for various applications such as health monitoring and emotional analysis. However, the BVP signal is susceptible to complex environmental changes or individual differences, causing existing methods to struggle in generalizing for unseen domains. This article addresses the domain shift problem in rPPG measurement and shows that most domain generalization methods fail to work well in this problem due to ambiguous instance-specific differences. To address this, the article proposes a novel approach called Hierarchical Style-aware Representation Disentangling (HSRD). HSRD improves generalization capacity by separating domain-invariant and instance-specific feature space during training, which increases the robustness of out-of-distribution samples during inference. This work presents state-of-the-art performance against several methods in both cross and intra-dataset settings.

IROS Conference 2024 Conference Paper

Sampling-based Motion Planning for Optimal Probability of Collision under Environment Uncertainty

  • Hao Lu
  • Hanna Kurniawati
  • Rahul Shome

Motion planning is a fundamental capability in robotics applications. Real-world scenarios can introduce uncertainty to the motion planning problem. In this work we study environment uncertainty in general high-dimensional problems wherein the choice of appropriate metrics and formulations are shown to have significant effect on the probability of collision of the solution path. Several practically motivated cost functions have been proposed in literature to model and solve the problem but are shown in this work to suffer from higher probabilities of collision. The current work presents a theoretically sound formulation that was first mentioned in previous work on minimum constraint removal. In this work, approximating the optimal problem is shown to be better in achieving lower probability of collision. To demonstrate the formulation in a sampling-based setting, a mixed integer linear program seeded by greedy search over a roadmap with sampled environments is used to report paths with low probability of collision. Compared against minimizing the sum and minimizing max probability cost functions on a seven degree-of-freedom robotic arm in uncertain environments, we show clear benefits and promise towards motion planning for optimal probability of collision.

AAAI Conference 2024 Conference Paper

Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting

  • Zhicheng Wang
  • Liwen Xiao
  • Zhiguo Cao
  • Hao Lu

Class-agnostic counting (CAC) aims to count objects of interest from a query image given few exemplars. This task is typically addressed by extracting the features of query image and exemplars respectively and then matching their feature similarity, leading to an extract-then-match paradigm. In this work, we show that CAC can be simplified in an extract-and-match manner, particularly using a vision transformer (ViT) where feature extraction and similarity matching are executed simultaneously within the self-attention. We reveal the rationale of such simplification from a decoupled view of the self-attention.The resulting model, termed CACViT, simplifies the CAC pipeline into a single pretrained plain ViT. Further, to compensate the loss of the scale and the order-of-magnitude information due to resizing and normalization in plain ViT, we present two effective strategies for scale and magnitude embedding. Extensive experiments on the FSC147 and the CARPK datasets show that CACViT significantly outperforms state-of-the-art CAC approaches in both effectiveness (23.60% error reduction) and generalization, which suggests CACViT provides a concise and strong baseline for CAC. Code will be available.

AAAI Conference 2023 Conference Paper

Find Beauty in the Rare: Contrastive Composition Feature Clustering for Nontrivial Cropping Box Regression

  • Zhiyu Pan
  • Yinpeng Chen
  • Jiale Zhang
  • Hao Lu
  • Zhiguo Cao
  • Weicai Zhong

Automatic image cropping algorithms aim to recompose images like human-being photographers by generating the cropping boxes with improved composition quality. Cropping box regression approaches learn the beauty of composition from annotated cropping boxes. However, the bias of annotations leads to quasi-trivial recomposing results, which has an obvious tendency to the average location of training samples. The crux of this predicament is that the task is naively treated as a box regression problem, where rare samples might be dominated by normal samples, and the composition patterns of rare samples are not well exploited. Observing that similar composition patterns tend to be shared by the cropping boundaries annotated nearly, we argue to find the beauty of composition from the rare samples by clustering the samples with similar cropping boundary annotations, i.e., similar composition patterns. We propose a novel Contrastive Composition Clustering (C2C) to regularize the composition features by contrasting dynamically established similar and dissimilar pairs. In this way, common composition patterns of multiple images can be better summarized, which especially benefits the rare samples and endows our model with better generalizability to render nontrivial results. Extensive experimental results show the superiority of our model compared with prior arts. We also illustrate the philosophy of our design with an interesting analytical visualization.

AAAI Conference 2023 Conference Paper

Infusing Definiteness into Randomness: Rethinking Composition Styles for Deep Image Matting

  • Zixuan Ye
  • Yutong Dai
  • Chaoyi Hong
  • Zhiguo Cao
  • Hao Lu

We study the composition style in deep image matting, a notion that characterizes a data generation flow on how to exploit limited foregrounds and random backgrounds to form a training dataset. Prior art executes this flow in a completely random manner by simply going through the foreground pool or by optionally combining two foregrounds before foreground-background composition. In this work, we first show that naive foreground combination can be problematic and therefore derive an alternative formulation to reasonably combine foregrounds. Our second contribution is an observation that matting performance can benefit from a certain occurrence frequency of combined foregrounds and their associated source foregrounds during training. Inspired by this, we introduce a novel composition style that binds the source and combined foregrounds in a definite triplet. In addition, we also find that different orders of foreground combination lead to different foreground patterns, which further inspires a quadruplet-based composition style. Results under controlled experiments on four matting baselines show that our composition styles outperform existing ones and invite consistent performance improvement on both composited and real-world datasets. Code is available at: https://github.com/coconuthust/composition_styles

AAAI Conference 2023 Conference Paper

Learning Second-Order Attentive Context for Efficient Correspondence Pruning

  • Xinyi Ye
  • Weiyue Zhao
  • Hao Lu
  • Zhiguo Cao

Correspondence pruning aims to search consistent correspondences (inliers) from a set of putative correspondences. It is challenging because of the disorganized spatial distribution of numerous outliers, especially when putative correspondences are largely dominated by outliers. It's more challenging to ensure effectiveness while maintaining efficiency. In this paper, we propose an effective and efficient method for correspondence pruning. Inspired by the success of attentive context in correspondence problems, we first extend the attentive context to the first-order attentive context and then introduce the idea of attention in attention (ANA) to model second-order attentive context for correspondence pruning. Compared with first-order attention that focuses on feature-consistent context, second-order attention dedicates to attention weights itself and provides an additional source to encode consistent context from the attention map. For efficiency, we derive two approximate formulations for the naive implementation of second-order attention to optimize the cubic complexity to linear complexity, such that second-order attention can be used with negligible computational overheads. We further implement our formulations in a second-order context layer and then incorporate the layer in an ANA block. Extensive experiments demonstrate that our method is effective and efficient in pruning outliers, especially in high-outlier-ratio cases. Compared with the state-of-the-art correspondence pruning approach LMCNet, our method runs 14 times faster while maintaining a competitive accuracy.

NeurIPS Conference 2022 Conference Paper

SAPA: Similarity-Aware Point Affiliation for Feature Upsampling

  • Hao Lu
  • Wenze Liu
  • Zixuan Ye
  • Hongtao Fu
  • Yuliang Liu
  • Zhiguo Cao

We introduce point affiliation into feature upsampling, a notion that describes the affiliation of each upsampled point to a semantic cluster formed by local decoder feature points with semantic similarity. By rethinking point affiliation, we present a generic formulation for generating upsampling kernels. The kernels encourage not only semantic smoothness but also boundary sharpness in the upsampled feature maps. Such properties are particularly useful for some dense prediction tasks such as semantic segmentation. The key idea of our formulation is to generate similarity-aware kernels by comparing the similarity between each encoder feature point and the spatially associated local region of decoder features. In this way, the encoder feature point can function as a cue to inform the semantic cluster of upsampled feature points. To embody the formulation, we further instantiate a lightweight upsampling operator, termed Similarity-Aware Point Affiliation (SAPA), and investigate its variants. SAPA invites consistent performance improvements on a number of dense prediction tasks, including semantic segmentation, object detection, depth estimation, and image matting. Code is available at: https: //github. com/poppinace/sapa

ICML Conference 2021 Conference Paper

Bootstrapping Fitted Q-Evaluation for Off-Policy Inference

  • Botao Hao
  • Xiang Ji
  • Yaqi Duan
  • Hao Lu
  • Csaba Szepesvári
  • Mengdi Wang 0001

Bootstrapping provides a flexible and effective approach for assessing the quality of batch reinforcement learning, yet its theoretical properties are poorly understood. In this paper, we study the use of bootstrapping in off-policy evaluation (OPE), and in particular, we focus on the fitted Q-evaluation (FQE) that is known to be minimax-optimal in the tabular and linear-model cases. We propose a bootstrapping FQE method for inferring the distribution of the policy evaluation error and show that this method is asymptotically efficient and distributionally consistent for off-policy statistical inference. To overcome the computation limit of bootstrapping, we further adapt a subsampling procedure that improves the runtime by an order of magnitude. We numerically evaluate the bootrapping method in classical RL environments for confidence interval estimation, estimating the variance of off-policy evaluator, and estimating the correlation between multiple off-policy evaluators.

ICLR Conference 2020 Conference Paper

A Learning-based Iterative Method for Solving Vehicle Routing Problems

  • Hao Lu
  • Xingwen Zhang
  • Shuang Yang

This paper is concerned with solving combinatorial optimization problems, in particular, the capacitated vehicle routing problems (CVRP). Classical Operations Research (OR) algorithms such as LKH3 \citep{helsgaun2017extension} are inefficient and difficult to scale to larger-size problems. Machine learning based approaches have recently shown to be promising, partly because of their efficiency (once trained, they can perform solving within minutes or even seconds). However, there is still a considerable gap between the quality of a machine learned solution and what OR methods can offer (e.g., on CVRP-100, the best result of learned solutions is between 16.10-16.80, significantly worse than LKH3's 15.65). In this paper, we present ``Learn to Improve'' (L2I), the first learning based approach for CVRP that is efficient in solving speed and at the same time outperforms OR methods. Starting with a random initial solution, L2I learns to iteratively refine the solution with an improvement operator, selected by a reinforcement learning based controller. The improvement operator is selected from a pool of powerful operators that are customized for routing problems. By combining the strengths of the two worlds, our approach achieves the new state-of-the-art results on CVRP, e.g., an average cost of 15.57 on CVRP-100.

ICML Conference 2018 Conference Paper

The Edge Density Barrier: Computational-Statistical Tradeoffs in Combinatorial Inference

  • Hao Lu
  • Yuan Cao 0006
  • Junwei Lu
  • Han Liu 0001
  • Zhaoran Wang 0001

We study the hypothesis testing problem of inferring the existence of combinatorial structures in undirected graphical models. Although there exist extensive studies on the information-theoretic limits of this problem, it remains largely unexplored whether such limits can be attained by efficient algorithms. In this paper, we quantify the minimum computational complexity required to attain the information-theoretic limits based on an oracle computational model. We prove that, for testing common combinatorial structures, such as clique, nearest neighbor graph and perfect matching, against an empty graph, or large clique against small clique, the information-theoretic limits are provably unachievable by tractable algorithms in general. More importantly, we define structural quantities called the weak and strong edge densities, which offer deep insight into the existence of such computational-statistical tradeoffs. To the best of our knowledge, our characterization is the first to identify and explain the fundamental tradeoffs between statistics and computation for combinatorial inference problems in undirected graphical models.

IS Journal 2011 Journal Article

Next-Generation Team-Science Platform for Scientific Collaboration

  • Xiaolong Zheng
  • Guanyan Ke
  • D. D. Zeng
  • S. Ram
  • Hao Lu

In the past two decades, many branches of science have shifted from individually oriented research toward team-based scientific collaboration. 1-3 Teams of researchers representing different disciplines are often brought together to better solve large-scale and often urgent problems of scientific, societal, and environmental relevance. In addition to combined subject matter expertise and the team's disciplinary composition, many contex- tual factors such as antecedent conditions, collaborative processes, and support technologies as well as a host of social factors such as team size and organizational complexity can directly influence outcomes in team-based research. From this perspective, emerging research on team science aims at better understanding the key contextual factors related to transdisciplinary scientific collaboration processes and enhancing the outcomes of large-scale collaborative research programs. More specifically, team-science research combines problem-solving frameworks, specialized expertise, and research methods across disciplinary boundaries to help produce high-impact science.

AAAI Conference 2010 Conference Paper

Assisting Users with Clustering Tasks by Combining Metric Learning and Classification

  • Sumit Basu
  • Danyel Fisher
  • Steven Drucker
  • Hao Lu

Interactive clustering refers to situations in which a human labeler is willing to assist a learning algorithm in automatically clustering items. We present a related but somewhat different task, assisted clustering, in which a user creates explicit groups of items from a large set and wants suggestions on what items to add to each group. While the traditional approach to interactive clustering has been to use metric learning to induce a distance metric, our situation seems equally amenable to classification. Using clusterings of documents from human subjects, we found that one or the other method proved to be superior for a given cluster, but not uniformly so. We thus developed a hybrid mechanism for combining the metric learner and the classifier. We present results from a large number of trials based on human clusterings, in which we show that our combination scheme matches and often exceeds the performance of a method which exclusively uses either type of learner.