Arrow Research search

Author name cluster

Jiahao Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

24 papers
2 author rows

Possible papers

24

AAAI Conference 2026 Conference Paper

ReCAD: Reinforcement Learning Enhanced Parametric CAD Model Generation with Vision-Language Models

  • Jiahao Li
  • Yusheng Luo
  • Yunzhong Lou
  • Xiangdong Zhou

We present ReCAD, a reinforcement learning (RL) framework that bootstraps pretrained large models (PLMs) to generate precise parametric computer-aided design (CAD) models from multimodal inputs by leveraging their inherent generative capabilities. With just access to simple functional interfaces (e.g., point coordinates), our approach enables the emergence of complex CAD operations (e.g., pattern replication and mirror). This stands in contrast to previous methods, which typically rely on knowledge injected through supervised fine-tuning (SFT), offer limited support for editability, and fail to exploit the strong generative priors of PLMs. Specifically, the ReCAD framework begins by fine-tuning vision-language models (VLMs) to equip them with basic CAD model generation capabilities, where we rewrite CAD scripts into parameterized code that is leveraged to generate accurate textual descriptions for supervision. Then, we propose a novel RL strategy that incorporates parameterized code as guidance to enhance the model’s reasoning on challenging questions. Furthermore, we employ a hierarchical primitive learning process to progressively teach structured and compositional skills under a unified reward function that ensures both geometric accuracy and semantic fidelity. ReCAD sets a new state-of-the-art in both text-to-CAD and image-to-CAD tasks, significantly improving geometric accuracy across in-distribution and out-of-distribution settings. In the image-to-CAD task, for instance, it reduces the mean Chamfer Distance from 73.47 to 29.61 (in-distribution) and from 272.06 to 80.23 (out-of-distribution), outperforming existing baselines by a substantial margin.

AAAI Conference 2026 Conference Paper

SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder

  • Dengcan Liu
  • Jiahao Li
  • Zheren Fu
  • Yi Tu
  • Jiajun Li
  • Zhendong Mao
  • Yongdong Zhang

Reward models (RMs) are a core component in the post-training of large language models (LLMs), serving as proxies for human preference evaluation and guiding model alignment. However, training reliable RMs under limited resources remains challenging due to the reliance on large-scale preference annotations and the high cost of fine-tuning LLMs. To address this, we propose SparseRM, which leverages Sparse Autoencoder (SAE) to extract preference-relevant information encoded in model representations, enabling the construction of a lightweight and interpretable reward model. SparseRM first employs SAE to decompose LLM representations into interpretable directions that capture preference-relevant features. The representations are then projected onto these directions to compute alignment scores, which quantify the strength of each preference feature in the representations. A simple reward head aggregates these scores to predict preference scores. Experiments on three preference modeling tasks show that SparseRM achieves superior performance over most mainstream RMs while using less than 1% of trainable parameters. Moreover, it integrates seamlessly into downstream alignment pipelines, highlighting its potential for efficient alignment.

AAAI Conference 2026 Conference Paper

Target Refocusing via Attention Redistribution for Open-Vocabulary Semantic Segmentation: An Explainability Perspective

  • Jiahao Li
  • Yang Lu
  • Yachao Zhang
  • Yong Xie
  • Fangyong Wang
  • Yuan Xie
  • Yanyun Qu

Open-vocabulary semantic segmentation (OVSS) employs pixel-level vision-language alignment to associate category-related prompts with corresponding pixels. A key challenge is enhancing the multimodal dense prediction capability, specifically this pixel-level multimodal alignment. Although existing methods achieve promising results by leveraging CLIP’s vision-language alignment, they rarely investigate the performance boundaries of CLIP for dense prediction from an interpretability mechanisms perspective. In this work, we systematically investigate CLIP's internal mechanisms and identify a critical phenomenon: analogous to human distraction, CLIP diverts significant attention resources from target regions to irrelevant tokens. Our analysis reveals that these tokens arise from dimension-specific over-activation; filtering them enhances CLIP's dense prediction performance. Consequently, we propose Refocusing CLIP (RF-CLIP), a training-free approach that emulates human distraction-refocusing behavior to redirect attention from distraction tokens back to target regions, thereby refining CLIP's multimodal alignment granularity. Our method achieves SOTA performance on eight benchmarks while maintaining high inference efficiency.

IROS Conference 2025 Conference Paper

A Soft Active Surface Gripper for Safe In Hand Manipulation of Fragile Objects

  • Sheng Xiang
  • Jiahao Li
  • Yinqi Zhang
  • Zhong Wei
  • Jia Liu
  • Yang Yang 0002

This paper introduces a soft active surface gripper designed to manipulate fragile objects safely. This gripper consists of two fingers, each equipped with two compliant pneumatic actuators and a soft active surface. The gripper utilizes the elastic belt as its soft active surface, which is driven by a motor, and the opening angle of the elastic band is controlled by pneumatic actuators. The novel design allows for the passive deformation of both the soft active surface and the compliant pneumatic actuator, enabling adaptation to various object shapes and demonstrating superior handling capabilities for delicate items. By synchronizing the opening and closing of the pneumatic fingers with the conveying motion of the active surface, the active surface gripper realizes three degrees of freedom (DOF) for in-plane manipulation, specifically two translational movements and one rotational movement. A prototype gripper has been designed and fabricated for in-plane manipulation experiments with fragile objects, including strawberries, miniature cupcakes, and pears. Experimental results demonstrate that the gripper can execute in-plane in-hand manipulation of fragile objects with varying geometries and dimensions while maintaining secure and robust handling, preventing object slippage and preserving surface integrity without causing damage.

AAAI Conference 2025 Conference Paper

BSDB-Net: Band-Split Dual-Branch Network with Selective State Spaces Mechanism for Monaural Speech Enhancement

  • Cunhang Fan
  • Enrui Liu
  • Andong Li
  • Jianhua Tao
  • Jian Zhou
  • Jiahao Li
  • Chengshi Zheng
  • Zhao Lv

Although the complex spectrum-based speech enhancement (SE) methods have achieved significant performance, coupling amplitude and phase can lead to a compensation effect, where amplitude information is sacrificed to compensate for the phase that is harmful to SE. In addition, to further improve the performance of SE, many modules are stacked onto SE, resulting in increased model complexity that limits the application of SE. To address these problems, we proposed a dual-path network based on compressed frequency using Mamba. First, we extract amplitude and phase information through parallel dual branches. This approach leverages structured complex spectra to implicitly capture phase information and solves the compensation effect by decoupling amplitude and phase, and the network incorporates an interaction module to suppress unnecessary parts and recover missing components from the other branch. Second, to reduce network complexity, the network introduces a band-split strategy to compress the frequency dimension. To further reduce complexity while maintaining good performance, we designed a Mamba-based module that models the time and frequency dimensions under linear complexity. Finally, compared to baselines, our model achieves an average 8.3 times reduction in computational complexity while maintaining superior performance. Furthermore, it achieves a 25 times reduction in complexity compared to transformer-based models.

IJCAI Conference 2025 Conference Paper

Conditional Causal Representation Learning for Heterogeneous Single-cell RNA Data Integration and Prediction

  • Jiayi Dong
  • Jiahao Li
  • Fei Wang

Single-cell sequencing technology provides deep insights into gene activity at the individual cell level, facilitating the study of gene regulatory mechanisms. However, observed gene expression are often influenced by confounding factors such as batch effects, perturbations, and spatial position, which obscure the true gene regulatory network that governs the cell’s intrinsic state. To address these challenges, we propose scConCRL, a novel conditionally causal representation learning framework designed to extract the true gene regulatory relationships independent of confounding information. By considering both fine-grained molecular gene variables and coarse-grained latent domain variables, scConCRL not only uncovers the intrinsic biological signals but also models the complex relationships between these variables. This dual function enables the separation of genuine cellular states from domain information, providing valuable insights for downstream analyses and biological discovery. We demonstrate the effectiveness of our model on multi-domain datasets from different platforms and perturbation conditions, showing its ability to accurately disentangle confounding influences and discover novel gene relationships. Extensive comparisons across various scenarios illustrate the superior performance of scConCRL in several tasks compared to existing methods.

NeurIPS Conference 2025 Conference Paper

Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding

  • Xiaoyi Zhang
  • Zhaoyang Jia
  • Zongyu Guo
  • Jiahao Li
  • Bin Li
  • Houqiang Li
  • Yan Lu

Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the $\textbf{D}eep \ \textbf{V}ideo \ \textbf{D}iscovery \ (\textbf{DVD})$ agent to leverage an $\textit{agentic search}$ strategy over segmented video clips. Different from previous video agents manually designing a rigid workflow, our approach emphasizes the autonomous nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools to orchestrate adaptive workflow for different queries in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates our advantage. Our DVD agent achieves state-of-the-art performance on the challenging LVBench dataset, reaching an accuracy of $\textbf{74. 2\%}$, which substantially surpasses all prior works, and further improves to $\textbf{76. 0\%}$ with transcripts.

ICRA Conference 2025 Conference Paper

FitnessAgent: A Unified Agent Framework for Open-Set and Personalized Fitness Evaluation

  • Zhenhui Tang
  • Jiahao Li
  • Ping Guo
  • Bowen Tian
  • Qingjun Xing
  • Xuyang Xing
  • Peng Wang 0099

Robotic systems face challenges in performing open-set and personalized fitness evaluations, especially when adapting to new exercises and individual user needs. This paper introduces FitnessAgent, a unified agent framework designed to address these challenges. Unlike traditional systems that rely on pre-trained neural networks or fixed rule-based criteria, FitnessAgent can assess any exercise without prior training, adapting evaluation metrics based on expert knowledge and user-specific requirements. The system breaks down fitness evaluation tasks into combinations of metrics, each calculated using measurable operators such as angles, distances, and positions. By leveraging a set of primitive, exercise-agnostic operators, a large language model (LLM)-based planner dynamically selects and combines these operators for each task. The open-set capability of FitnessAgent is validated through experiments on both the widely-used Functional Movement Screen dataset and a newly collected isometric pose dataset. Results highlight the system's flexibility in handling new movements and its ability to adapt to personalized evaluation criteria without the need for code or algorithm modifications. FitnessAgent offers a scalable and personalized solution for fitness evaluation, making it well-suited for robotic applications that require adaptability to diverse user needs.

AAAI Conference 2025 Conference Paper

Locate Anything on Earth: Advancing Open-Vocabulary Object Detection for Remote Sensing Community

  • Jiancheng Pan
  • Yanxing Liu
  • Yuqian Fu
  • Muyuan Ma
  • Jiahao Li
  • Danda Pani Paudel
  • Luc Van Gool
  • Xiaomeng Huang

Object detection, particularly open-vocabulary object detection, plays a crucial role in Earth sciences, such as environmental monitoring, natural disaster assessment, and land-use planning. However, existing open-vocabulary detectors, primarily trained on natural-world images, struggle to generalize to remote sensing images due to a significant data domain gap. Thus, this paper aims to advance the development of open-vocabulary object detection in remote sensing community. To achieve this, we first reformulate the task as Locate Anything on Earth (LAE) with the goal of detecting any novel concepts on Earth. We then developed the LAE-Label Engine which collects, auto-annotates, and unifies up to 10 remote sensing datasets creating the LAE-1M — the first large-scale remote sensing object detection dataset with broad category coverage. Using the LAE-1M, we further propose and train the novel LAE-DINO Model, the first open-vocabulary foundation object detector for the LAE task, featuring Dynamic Vocabulary Construction (DVC) and Visual-Guided Text Prompt Learning (VisGT) modules. DVC dynamically constructs vocabulary for each training batch, while VisGT maps visual features to semantic space, enhancing text features. We comprehensively conduct experiments on established remote sensing benchmark DIOR, DOTAv2.0, as well as our newly introduced 80-class LAE-80C benchmark. Results demonstrate the advantages of the LAE-1M dataset and the effectiveness of the LAE-DINO method.

AAAI Conference 2025 Conference Paper

MaskViM: Domain Generalized Semantic Segmentation with State Space Models

  • Jiahao Li
  • Yang Lu
  • Yuan Xie
  • Yanyun Qu

Domain Generalized Semantic Segmentation (DGSS) aims to utilize segmentation model training on known source domains to make predictions on unknown target domains. Currently, there are two network architectures: one based on Convolutional Neural Networks (CNNs) and the other based on Visual Transformers (ViTs). However, both CNN-based and ViT-based DGSS methods face challenges: the former lacks a global receptive field, while the latter requires more computational demands. Drawing inspiration from State Space Models (SSMs), which not only possess a global receptive field but also maintain linear complexity, we propose SSM-based method for achieving DGSS. In this work, we first elucidate why does mask make sense in SSM-based DGSS and propose our mask learning mechanism. Leveraging this mechanism, we present our Mask Vision Mamba network (MaskViM), a model for SSM-based DGSS, and design our mask loss to optimize MaskViM. Our method achieves superior performance on four diverse DGSS setting, which demonstrates the effectiveness of our method.

JBHI Journal 2025 Journal Article

MSHGCL:Multi-Scale Hierarchical Graph Contrastive Learning for Drug-Drug Interactions

  • Daohui Ge
  • Xueyan Song
  • Guangshun Zhang
  • Jiaying Yan
  • Jiahao Li
  • Jin-Xing Liu

Drug–drug interactions (DDIs) refer to the change in efficacy when two or more drugs taken at the same time. Reasonable DDI can enhance efficacy or reduce adverse drug reactions; otherwise, it may lead to adverse events. The recent DDI prediction methods, which use different graph neural networks to extract the drug substructures, convert the DDI prediction into the prediction of the relationship between the two drug substructures, and achieve good performance. However, these methods do not pay attention to the relationship between drug substructures extracted from the same drug. Therefore, we propose MSHGCL to constrain the relationship between drug substructures by multi-scale hierarchical graph contrastive learning module. Specifically, the intra-layer contrastive learning module constrains the relationship between drug substructures of the same scale, and the inter-layer contrastive learning module further constrains the relationship between drug substructures of adjacent layers. We evaluate MSHGCL on two real-world datasets. Experimental results show that the proposed MSHGCL method outperforms the most advanced DDI prediction method. The source code is avaliable at https://github.com/N-drinking/MSHGCL. git.

NeurIPS Conference 2025 Conference Paper

Omnidirectional 3D Scene Reconstruction from Single Image

  • Ren Yang
  • Jiahao Li
  • Yan Lu

Reconstruction of 3D scenes from a single image is a crucial step towards enabling next-generation AI-powered immersive experiences. However, existing diffusion-based methods often struggle with reconstructing omnidirectional scenes due to geometric distortions and inconsistencies across the generated novel views, hindering accurate 3D recovery. To overcome this challenge, we propose Omni3D, an approach designed to enhance the geometric fidelity of diffusion-generated views for robust omnidirectional reconstruction. Our method leverages priors from pose estimation techniques, such as MASt3R, to iteratively refine both the generated novel views and their estimated camera poses. Specifically, we minimize the 3D reprojection errors between paired views to optimize the generated images, and simultaneously, correct the pose estimation based on the refined views. This synergistic optimization process yields geometrically consistent views and accurate poses, which are then used to build an explicit 3D Gaussian Splatting representation capable of omnidirectional rendering. Experimental results validate the effectiveness of Omni3D, demonstrating significantly advanced 3D reconstruction quality in the omnidirectional space, compared to previous state-of-the-art methods. Project page: https: //omni3d-neurips. github. io.

NeurIPS Conference 2025 Conference Paper

One-Step Diffusion-Based Image Compression with Semantic Distillation

  • Naifu Xue
  • Zhaoyang Jia
  • Jiahao Li
  • Bin Li
  • Yuan Zhang
  • Yan Lu

While recent diffusion-based generative image codecs have shown impressive performance, their iterative sampling process introduces unpleasant latency. In this work, we revisit the design of a diffusion-based codec and argue that multi-step sampling is not necessary for generative compression. Based on this insight, we propose OneDC, a One-step Diffusion-based generative image Codec—that integrates a latent compression module with a one-step diffusion generator. Recognizing the critical role of semantic guidance in one-step diffusion, we propose using the hyperprior as a semantic signal, overcoming the limitations of text prompts in representing complex visual content. To further enhance the semantic capability of the hyperprior, we introduce a semantic distillation mechanism that transfers knowledge from a pretrained generative tokenizer to the hyperprior codec. Additionally, we adopt a hybrid pixel- and latent-domain optimization to jointly enhance both reconstruction fidelity and perceptual realism. Extensive experiments demonstrate that OneDC achieves SOTA perceptual quality even with one-step generation, offering over 39% bitrate reduction and 20× faster decoding compared to prior multi-step diffusion-based codecs. Project: https: //onedc-codec. github. io/

NeurIPS Conference 2025 Conference Paper

Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback

  • Jiaming Ji
  • Xinyu Chen
  • Rui Pan
  • Han Zhu
  • Jiahao Li
  • Donghai Hong
  • Boyuan Chen
  • Jiayi Zhou

Multimodal large language models (MLLMs) are essential for building general-purpose AI assistants; however, they pose increasing safety risks. How can we ensure safety alignment of MLLMs to prevent undesired behaviors? Going further, it is critical to explore how to fine-tune MLLMs to preserve capabilities while meeting safety constraints. Fundamentally, this challenge can be formulated as a min-max optimization problem. However, existing datasets have not yet disentangled single preference signals into explicit safety constraints, hindering systematic investigation in this direction. Moreover, it remains an open question whether such constraints can be effectively incorporated into the optimization process for multi-modal models. In this work, we present the first exploration of the Safe RLHF-V -- the first multimodal safety alignment framework. The framework consists of: (I) BeaverTails-V, the first open-source dataset featuring dual preference annotations for helpfulness and safety, supplemented with multi-level safety labels (minor, moderate, severe); (II) Beaver-Guard-V, a multi-level guardrail system to proactively defend against unsafe queries and adversarial attacks. Applying the guard model over five rounds of filtering and regeneration significantly enhances the precursor model’s overall safety by an average of 40. 9%. (II) Based on dual preference, we initiate the first exploration of multi-modal safety alignment within a constrained optimization. Experimental results demonstrate that Safe RLHF effectively improves both model helpfulness and safety. Specifically, Safe RLHF-V enhances model safety by 34. 2% and helpfulness by 34. 3%.

AAAI Conference 2024 Conference Paper

Arbitrary-Scale Video Super-resolution Guided by Dynamic Context

  • Cong Huang
  • Jiahao Li
  • Lei Chu
  • Dong Liu
  • Yan Lu

We propose a Dynamic Context-Guided Upsampling (DCGU) module for video super-resolution (VSR) that leverages temporal context guidance to achieve efficient and effective arbitrary-scale VSR. While most VSR research focuses on backbone design, the importance of the upsampling part is often overlooked. Existing methods rely on pixelshuffle-based upsampling, which has limited capabilities in handling arbitrary upsampling scales. Recent attempts to replace pixelshuffle-based modules with implicit neural function-based and filter-based approaches suffer from slow inference speeds and limited representation capacity, respectively. To overcome these limitations, our DCGU module predicts non-local sampling locations and content-dependent filter weights, enabling efficient and effective arbitrary-scale VSR. Our proposed multi-granularity location search module efficiently identifies non-local sampling locations across the entire low-resolution grid, and the temporal bilateral filter modulation module integrates content information with the filter weight to enhance textual details. Extensive experiments demonstrate the superiority of our method in terms of performance and speed on arbitrary-scale VSR.

IJCAI Conference 2024 Conference Paper

Combinatorial Routing for Neural Trees

  • Jiahao Li
  • Ruichu Cai
  • Yuguang Yan

Neural trees benefit from the high-level representation of neural networks and the interpretability of decision trees. Therefore, the existing works on neural trees perform outstandingly on various tasks such as architecture search. However, these works require every router to provide only one successor for each sample, causing the predictions to be dominated by the elite branch and its derivative architectures. To break this branch dominance, we propose the combinatorial routing neural tree method, termed CombRo. Unlike the previous methods employing unicast routing, CombRo performs multicast schema in each iteration, allowing the features to be routed to any combination of successors at every non-leaf. The weights of each architecture are then evaluated accordingly. We update the weights by training the routing subnetwork, and the architecture with the top weight is selected in the final step. We compare CombRo with the existing algorithms on 3 public image datasets, demonstrating its superior performance in terms of accuracy. Visualization results further validate the effectiveness of the multicast routing schema. Code is available at https: //github. com/JiahaoLi-gdut/CombRo.

AAAI Conference 2024 Conference Paper

Discriminative Forests Improve Generative Diversity for Generative Adversarial Networks

  • Junjie Chen
  • Jiahao Li
  • Chen Song
  • Bin Li
  • Qingcai Chen
  • Hongchang Gao
  • Wendy Hui Wang
  • Zenglin Xu

Improving the diversity of Artificial Intelligence Generated Content (AIGC) is one of the fundamental problems in the theory of generative models such as generative adversarial networks (GANs). Previous studies have demonstrated that the discriminator in GANs should have high capacity and robustness to achieve the diversity of generated data. However, a discriminator with high capacity tends to overfit and guide the generator toward collapsed equilibrium. In this study, we propose a novel discriminative forest GAN, named Forest-GAN, that replaces the discriminator to improve the capacity and robustness for modeling statistics in real-world data distribution. A discriminative forest is composed of multiple independent discriminators built on bootstrapped data. We prove that a discriminative forest has a generalization error bound, which is determined by the strength of individual discriminators and the correlations among them. Hence, a discriminative forest can provide very large capacity without any risk of overfitting, which subsequently improves the generative diversity. With the discriminative forest framework, we significantly improved the performance of AutoGAN with a new record FID of 19.27 from 30.71 on STL10 and improved the performance of StyleGAN2-ADA with a new record FID of 6.87 from 9.22 on LSUN-cat.

ICLR Conference 2024 Conference Paper

DMV3D: Denoising Multi-view Diffusion Using 3D Large Reconstruction Model

  • Yinghao Xu 0001
  • Hao Tan 0002
  • Fujun Luan
  • Sai Bi
  • Peng Wang 0099
  • Jiahao Li
  • Zifan Shi
  • Kalyan Sunkavalli

We propose DMV3D, a novel 3D generation approach that uses a transformer-based 3D large reconstruction model to denoise multi-view diffusion. Our reconstruction model incorporates a triplane NeRF representation and, functioning as a denoiser, can denoise noisy multi-view images via 3D NeRF reconstruction and rendering, achieving single-stage 3D generation in the 2D diffusion denoising process. We train DMV3D on large-scale multi-view image datasets of extremely diverse objects using only image reconstruction losses, without accessing 3D assets. We demonstrate state-of-the-art results for the single-image reconstruction problem where probabilistic modeling of unseen object parts is required for generating diverse reconstructions with sharp textures. We also show high-quality text-to-3D generation results outperforming previous 3D diffusion models. Our project website is at: https://dmv3d.github.io/.

ICLR Conference 2024 Conference Paper

Instant3D: Fast Text-to-3D with Sparse-view Generation and Large Reconstruction Model

  • Jiahao Li
  • Hao Tan 0002
  • Kai Zhang 0045
  • Zexiang Xu
  • Fujun Luan
  • Yinghao Xu 0001
  • Yicong Hong
  • Kalyan Sunkavalli

Text-to-3D with diffusion models has achieved remarkable progress in recent years. However, existing methods either rely on score distillation-based optimization which suffer from slow inference, low diversity and Janus problems, or are feed-forward methods that generate low-quality results due to the scarcity of 3D training data. In this paper, we propose Instant3D, a novel method that generates high-quality and diverse 3D assets from text prompts in a feed-forward manner. We adopt a two-stage paradigm, which first generates a sparse set of four structured and consistent views from text in one shot with a fine-tuned 2D text-to-image diffusion model, and then directly regresses the NeRF from the generated images with a novel transformer-based sparse-view reconstructor. Through extensive experiments, we demonstrate that our method can generate diverse 3D assets of high visual quality within 20 seconds, which is two orders of magnitude faster than previous optimization-based methods that can take 1 to 10 hours. Our project webpage is: https://jiahao.ai/instant3d/.

NeurIPS Conference 2024 Conference Paper

Relationship Prompt Learning is Enough for Open-Vocabulary Semantic Segmentation

  • Jiahao Li
  • Yang Lu
  • Yuan Xie
  • Yanyun Qu

Open-vocabulary semantic segmentation (OVSS) aims to segment unseen classes without corresponding labels. Existing Vision-Language Model (VLM)-based methods leverage VLM's rich knowledge to enhance additional explicit segmentation-specific networks, yielding competitive results, but at the cost of extensive training cost. To reduce the cost, we attempt to enable VLM to directly produce the segmentation results without any segmentation-specific networks. Prompt learning offers a direct and parameter-efficient approach, yet it falls short in guiding VLM for pixel-level visual classification. Therefore, we propose the ${\bf R}$elationship ${\bf P}$rompt ${\bf M}$odule (${\bf RPM}$), which generates the relationship prompt that directs VLM to extract pixel-level semantic embeddings suitable for OVSS. Moreover, RPM integrates with VLM to construct the ${\bf R}$elationship ${\bf P}$rompt ${\bf N}$etwork (${\bf RPN}$), achieving OVSS without any segmentation-specific networks. RPN attains state-of-the-art performance with merely about ${\bf 3M}$ trainable parameters (2\% of total parameters).

IJCAI Conference 2023 Conference Paper

Spatially Covariant Lesion Segmentation

  • Hang Zhang
  • Rongguang Wang
  • Jinwei Zhang
  • Dongdong Liu
  • Chao Li
  • Jiahao Li

Compared to natural images, medical images usually show stronger visual patterns and therefore this adds flexibility and elasticity to resource-limited clinical applications by injecting proper priors into neural networks. In this paper, we propose spatially covariant pixel-aligned classifier (SCP) to improve the computational efficiency and meantime maintain or increase accuracy for lesion segmentation. SCP relaxes the spatial invariance constraint imposed by convolutional operations and optimizes an underlying implicit function that maps image coordinates to network weights, the parameters of which are obtained along with the backbone network training and later used for generating network weights to capture spatially covariant contextual information. We demonstrate the effectiveness and efficiency of the proposed SCP using two lesion segmentation tasks from different imaging modalities: white matter hyperintensity segmentation in magnetic resonance imaging and liver tumor segmentation in contrast-enhanced abdominal computerized tomography. The network using SCP has achieved 23. 8, 64. 9 and 74. 7 reduction in GPU memory usage, FLOPs, and network size with similar or better accuracy for lesion segmentation.

IJCAI Conference 2023 Conference Paper

STS-GAN: Can We Synthesize Solid Texture with High Fidelity from Arbitrary 2D Exemplar?

  • Xin Zhao
  • Jifeng Guo
  • Lin Wang
  • Fanqi Li
  • Jiahao Li
  • Junteng Zheng
  • Bo Yang

Solid texture synthesis (STS), an effective way to extend a 2D exemplar to a 3D solid volume, exhibits advantages in computational photography. However, existing methods generally fail to accurately learn arbitrary textures, which may result in the failure to synthesize solid textures with high fidelity. In this paper, we propose a novel generative adversarial nets-based framework (STS-GAN) to extend the given 2D exemplar to arbitrary 3D solid textures. In STS-GAN, multi-scale 2D texture discriminators evaluate the similarity between the given 2D exemplar and slices from the generated 3D texture, promoting the 3D texture generator synthesizing realistic solid textures. Finally, experiments demonstrate that the proposed method can generate high-fidelity solid textures with similar visual characteristics to the 2D exemplar.

NeurIPS Conference 2021 Conference Paper

Deep Contextual Video Compression

  • Jiahao Li
  • Bin Li
  • Yan Lu

Most of the existing neural video compression methods adopt the predictive coding framework, which first generates the predicted frame and then encodes its residue with the current frame. However, as for compression ratio, predictive coding is only a sub-optimal solution as it uses simple subtraction operation to remove the redundancy across frames. In this paper, we propose a deep contextual video compression framework to enable a paradigm shift from predictive coding to conditional coding. In particular, we try to answer the following questions: how to define, use, and learn condition under a deep video compression framework. To tap the potential of conditional coding, we propose using feature domain context as condition. This enables us to leverage the high dimension context to carry rich information to both the encoder and the decoder, which helps reconstruct the high-frequency contents for higher video quality. Our framework is also extensible, in which the condition can be flexibly designed. Experiments show that our method can significantly outperform the previous state-of-the-art (SOTA) deep video compression methods. When compared with x265 using veryslow preset, we can achieve 26. 0% bitrate saving for 1080P standard test videos.