Arrow Research search

Author name cluster

Di Hu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

12 papers
1 author row

Possible papers

12

NeurIPS Conference 2025 Conference Paper

Human-assisted Robotic Policy Refinement via Action Preference Optimization

  • Wenke Xia
  • Yichu Yang
  • Hongtao Wu
  • Xiao Ma
  • Tao Kong
  • Di Hu

Establishing a reliable and iteratively refined robotic system is essential for deploying real-world applications. While Vision-Language-Action (VLA) models are widely recognized as the foundation model for such robotic deployment, their reliance on offline expert demonstrations critically limits their capacity for post-deployment refinement. To mitigate this limitation, we introduce Action Preference Optimization (APO), a method designed to refine VLA models by human-assisted preference alignment gathered through interaction with environments. This method begins with a human-robot collaboration framework for reliable failure correction and interaction trajectory collection through human intervention. However, directly leveraging these interaction trajectories for preference optimization is non-trivial due to the challenges of irreversible robotic actions and token distribution mismatch. To solve this, APO proposes an adaptive reweighting algorithm with binary desirability signals derived from interaction, empowering VLA models effectively suppress failure-prone actions while enhancing corrective action adaptation. Ultimately, APO equips VLA models with the crucial capability to learn from failure, paving the way for their iterative refinement and reliable deployment in dynamic environments. The experiments conducted in simulation and real-world scenarios prove superior generalization and robustness of our human-assisted framework across a variety of manipulation tasks. We believe this work could bring insights for efficient and stable optimization of VLA models through human-robot collaboration. The code and dataset are released at https: //github. com/GeWu-Lab/Action-Preference-Optimization.

NeurIPS Conference 2025 Conference Paper

MokA: Multimodal Low-Rank Adaptation for MLLMs

  • Yake Wei
  • Yu Miao
  • Dongzhan Zhou
  • Di Hu

In this paper, we reveal that most current efficient multimodal fine-tuning methods are hindered by a key limitation: they are directly borrowed from LLMs, often neglecting the intrinsic differences of multimodal scenarios and even affecting the full utilization of all modalities. Inspired by our empirical observation, we argue that unimodal adaptation and cross-modal adaptation are two essential parts for the effective fine-tuning of MLLMs. From this perspective, we propose Multimodal Low-rank Adaptation (MokA), a multimodal-aware efficient fine-tuning strategy that takes multimodal characteristics into consideration. It compresses unimodal information by modality-specific parameters while explicitly enhancing cross-modal interaction, ensuring both unimodal and cross-modal adaptation. Extensive experiments cover three representative multimodal scenarios (audio-visual-text, visual-text, and speech-text), and multiple LLM backbones (LLaMA2, Qwen2, Qwen2. 5-VL, etc). Consistent improvements indicate the efficacy and versatility of the proposed method. Ablation studies and efficiency evaluation are also conducted to fully asses our method. Overall, we think MokA provides a more targeted solution for efficient adaptation of MLLMs, paving the way for further exploration.

AAAI Conference 2024 Conference Paper

Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer

  • Yaoting Wang
  • Weisong Liu
  • Guangyao Li
  • Jian Ding
  • Di Hu
  • Xi Li

Never having seen an object and heard its sound simultaneously, can the model still accurately localize its visual position from the input audio? In this work, we concentrate on the Audio-Visual Localization and Segmentation tasks but under the demanding zero-shot and few-shot scenarios. To achieve this goal, different from existing approaches that mostly employ the encoder-fusion-decoder paradigm to decode localization information from the fused audio-visual feature, we introduce the encoder-prompt-decoder paradigm, aiming to better fit the data scarcity and varying data distribution dilemmas with the help of abundant knowledge from pre-trained models. Specifically, we first propose to construct a Semantic-aware Audio Prompt (SAP) to help the visual foundation model focus on sounding objects, meanwhile, the semantic gap between the visual and audio modalities is also encouraged to shrink. Then, we develop a Correlation Adapter (ColA) to keep minimal training efforts as well as maintain adequate knowledge of the visual foundation model. By equipping with these means, extensive experiments demonstrate that this new paradigm outperforms other fusion-based methods in both the unseen class and cross-dataset settings. We hope that our work can further promote the generalization study of Audio-Visual Localization and Segmentation in practical application scenarios. Project page: https://github.com/GeWu-Lab/Generalizable-Audio-Visual-Segmentation

AAAI Conference 2024 Conference Paper

SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model

  • Tao Wu
  • Xuewei Li
  • Zhongang Qi
  • Di Hu
  • Xintao Wang
  • Ying Shan
  • Xi Li

Controllable spherical panoramic image generation holds substantial applicative potential across a variety of domains. However, it remains a challenging task due to the inherent spherical distortion and geometry characteristics, resulting in low-quality content generation. In this paper, we introduce a novel framework of SphereDiffusion to address these unique challenges, for better generating high-quality and precisely controllable spherical panoramic images. For the spherical distortion characteristic, we embed the semantics of the distorted object with text encoding, then explicitly construct the relationship with text-object correspondence to better use the pre-trained knowledge of the planar images. Meanwhile, we employ a deformable technique to mitigate the semantic deviation in latent space caused by spherical distortion. For the spherical geometry characteristic, in virtue of spherical rotation invariance, we improve the data diversity and optimization objectives in the training process, enabling the model to better learn the spherical geometry characteristic. Furthermore, we enhance the denoising process of the diffusion model, enabling it to effectively use the learned geometric characteristic to ensure the boundary continuity of the generated images. With these specific techniques, experiments on Structured3D dataset show that SphereDiffusion significantly improves the quality of controllable spherical image generation and relatively reduces around 35% FID on average.

TMLR Journal 2023 Journal Article

Supervised Knowledge May Hurt Novel Class Discovery Performance

  • Ziyun Li
  • Jona Otholt
  • Ben Dai
  • Di Hu
  • Christoph Meinel
  • Haojin Yang

Novel class discovery (NCD) aims to infer novel categories in an unlabeled dataset by leveraging prior knowledge of a labeled set comprising disjoint but related classes. Given that most existing literature focuses primarily on utilizing supervised knowledge from a labeled set at the methodology level, this paper considers the question: Is supervised knowledge always helpful at different levels of semantic relevance? To proceed, we first establish a novel metric, so-called transfer leakage, to measure the semantic similarity between labeled/unlabeled datasets. To show the validity of the proposed metric, we build up a large-scale benchmark with various degrees of semantic similarities between labeled/unlabeled datasets on ImageNet by leveraging its hierarachical class structure. The results based on the proposed benchmark show that the proposed transfer leakage is in line with the hierarachical class structure; and that NCD performance is consistent with the semantic similarities (measured by the proposed metric). Next, by using the proposed transfer leakage, we conduct various empirical experiments with different levels of semantic similarity, yielding that supervised knowledge may hurt NCD performance. Specifically, using supervised information from a low-similarity labeled set may lead to a suboptimal result as compared to using pure self-supervised knowledge. These results reveal the inadequacy of the existing NCD literature which usually assumes that supervised knowledge is beneficial. Finally, we develop a pseudo-version of the transfer leakage as a practical reference to decide if supervised knowledge should be used in NCD. Its effectiveness is supported by our empirical studies, which show that the pseudo transfer leakage (with or without supervised knowledge) is consistent with the corresponding accuracy based on various datasets.

AAAI Conference 2022 Conference Paper

SepFusion: Finding Optimal Fusion Structures for Visual Sound Separation

  • Dongzhan Zhou
  • Xinchi Zhou
  • Di Hu
  • Hang Zhou
  • Lei Bai
  • Ziwei Liu
  • Wanli Ouyang

Multiple modalities can provide rich semantic information; and exploiting such information will normally lead to better performance compared with the single-modality counterpart. However, it is not easy to devise an effective cross-modal fusion structure due to the variations of feature dimensions and semantics, especially when the inputs even come from different sensors, as in the field of audio-visual learning. In this work, we propose SepFusion, a novel framework that can smoothly produce optimal fusion structures for visualsound separation. The framework is composed of two components, namely the model generator and the evaluator. To construct the generator, we devise a lightweight architecture space that can adapt to different input modalities. In this way, we can easily obtain audio-visual fusion structures according to our demands. For the evaluator, we adopt the idea of neural architecture search to select superior networks effectively. This automatic process can significantly save human efforts while achieving competitive performances. Moreover, since our SepFusion provides a series of strong models, we can utilize the model family for broader applications, such as further promoting performance via model assembly, or providing suitable architectures for the separation of certain instrument classes. These potential applications further enhance the competitiveness of our approach.

AAAI Conference 2022 Conference Paper

Visual Sound Localization in the Wild by Cross-Modal Interference Erasing

  • Xian Liu
  • Rui Qian
  • Hang Zhou
  • Di Hu
  • Weiyao Lin
  • Ziwei Liu
  • Bolei Zhou
  • Xiaowei Zhou

The task of audiovisual sound source localization has been well studied under constrained scenes, where the audio recordings are clean. However, in real world scenarios, audios are usually contaminated by off screen sound and background noise. They will interfere with the procedure of identifying desired sources and building visual sound connections, making previous studies nonapplicable. In this work, we propose the Interference Eraser (IEr) framework, which tackles the problem of audiovisual sound source localization in the wild. The key idea is to eliminate the interference by redefining and carving discriminative audio representations. Specifically, we observe that the previous practice of learning only a single audio representation is insufficient due to the additive nature of audio signals. We thus extend the audio representation with our Audio Instance Identifier module, which clearly distinguishes sounding instances when audio signals of different volumes are unevenly mixed. Then we erase the influence of the audible but off screen sounds and the silent but visible objects by a Cross modal Referrer module with cross modality distillation. Quantitative and qualitative evaluations demonstrate that our framework achieves superior results on sound localization tasks, especially under real world scenarios.

AAAI Conference 2021 Conference Paper

Temporal Relational Modeling with Self-Supervision for Action Segmentation

  • Dong Wang
  • Di Hu
  • Xingjian Li
  • Dejing Dou

Temporal relational modeling in video is essential for human action understanding, such as action recognition and action segmentation. Although Graph Convolution Networks (GCNs) have shown promising advantages in relation reasoning on many tasks, it is still a challenge to apply graph convolution networks on long video sequences effectively. The main reason is that large number of nodes (i. e. , video frames) makes GCNs hard to capture and model temporal relations in videos. To tackle this problem, in this paper, we introduce an effective GCN module, Dilated Temporal Graph Reasoning Module (DTGRM), designed to model temporal relations and dependencies between video frames at various time spans. In particular, we capture and model temporal relations via constructing multi-level dilated temporal graphs where the nodes represent frames from different moments in video. Moreover, to enhance temporal reasoning ability of the proposed model, an auxiliary self-supervised task is proposed to encourage the dilated temporal graph reasoning module to find and correct wrong temporal relations in videos. Our DTGRM model outperforms state-of-the-art action segmentation models on three challenging datasets: 50Salads, Georgia Tech Egocentric Activities (GTEA), and the Breakfast dataset. The code is available at https: //github. com/redwang/DTGRM.

NeurIPS Conference 2020 Conference Paper

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

  • Di Hu
  • Rui Qian
  • Minyue Jiang
  • Xiao Tan
  • Shilei Wen
  • Errui Ding
  • Weiyao Lin
  • Dejing Dou

Discriminatively localizing sounding objects in cocktail-party, i. e. , mixed sound scenes, is commonplace for humans, but still challenging for machines. In this paper, we propose a two-stage learning framework to perform self-supervised class-aware sounding object localization. First, we propose to learn robust object representations by aggregating the candidate sound localization results in the single source scenes. Then, class-aware object localization maps are generated in the cocktail-party scenarios by referring the pre-learned object knowledge, and the sounding objects are accordingly selected by matching audio and visual object category distributions, where the audiovisual consistency is viewed as the self-supervised signal. Experimental results in both realistic and synthesized cocktail-party videos demonstrate that our model is superior in filtering out silent objects and pointing out the location of sounding objects of different classes. Code is available at https: //github. com/DTaoo/Discriminative-Sounding-Objects-Localization.

AAAI Conference 2017 Conference Paper

Large Graph Hashing with Spectral Rotation

  • Xuelong Li
  • Di Hu
  • Feiping Nie

Faced with the requirements of huge amounts of data processing nowadays, hashing techniques have attracted much attention due to their efficient storage and searching ability. Among these techniques, the ones based on spectral graph show remarkable performance as they could embed the data on a low-dimensional manifold and maintain the neighborhood structure via a non-linear spectral eigenmap. However, the spectral solution in real value of such methods may deviate from the discrete solution. The common practice is just performing a simple rounding operation to obtain the final binary codes, which could break constraints and even result in worse condition. In this paper, we propose to impose a so-called spectral rotation technique to the spectral hashing objective, which could transform the candidate solution into a new one that better approximates the discrete one. Moreover, the binary codes are obtained from the modified solution via minimizing the Euclidean distance, which could result in more semantical correlation within the manifold, where the constraints for codes are always held. We provide an efficient alternative algorithm to solve the above problems. And a manifold learning perceptive for motivating the proposed method is also shown. Extensive experiments are conducted on three large-scale benchmark datasets and the results show our method outperforms state-of-the-art hashing methods, especially the spectral graph ones.