Arrow Research search

Author name cluster

Linlin Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

12 papers
2 author rows

Possible papers

12

AAAI Conference 2026 Conference Paper

AnchorDS: Anchoring Dynamic Sources for Semantically Consistent Text-to-3D Generation

  • Jiayin Zhu
  • Linlin Yang
  • Yicong Li
  • Angela Yao

Optimization‐based text‑to‑3D methods distill guidance from 2D generative models via Score Distillation Sampling (SDS), but implicitly treat this guidance as static. This work shows that ignoring source dynamics yields inconsistent trajectories that suppress or merge semantic cues, leading to "semantic over-smoothing" artifacts. As such, we reformulate text‑to‑3D optimization as mapping a *dynamically evolving source* distribution to a fixed target distribution. We cast the problem into a dual‑conditioned latent space, conditioned on both the text prompt and the intermediately rendered image. Given this joint setup, we observe that the image condition naturally anchors the current source distribution. Building on this insight, we introduce AnchorDS, an improved score distillation mechanism that provides state‑anchored guidance with image conditions and stabilizes generation. We further penalize erroneous source estimates and design a lightweight filter strategy and fine‑tuning strategy that refines the anchor with negligible overhead. AnchorDS produces finer-grained detail, more natural colours, and stronger semantic consistency, particularly for complex prompts, while maintaining efficiency. Extensive experiments show that our method surpasses previous methods in both quality and efficiency.

AAAI Conference 2026 Conference Paper

Efficiently Seeking Flat Minima for Better Generalization in Fine-Tuning Large Language Models and Beyond

  • Jiaxin Deng
  • Qingcheng Zhu
  • Junbiao Pang
  • Linlin Yang
  • Zhongqian Fu
  • Baochang Zhang

Little research explores the correlation between the expressive ability and generalization ability of the low-rank adaptation (LoRA). Sharpness-Aware Minimization (SAM) improves model generalization for both Convolutional Neural Networks (CNNs) and Transformers by encouraging convergence to locally flat minima. However, the connection between sharpness and generalization has not been fully explored for LoRA due to the lack of tools to either empirically seek flat minima or develop theoretical methods. In this work, we propose Flat Minima LoRA (FMLoRA) and its efficient version i.e., EFMLoRA, to seek flat minima for LoRA. Concretely, we theoretically demonstrate that perturbations in the full parameter space can be transferred to the low-rank subspace. This approach eliminates the potential interference introduced by perturbations across multiple matrices in the low-rank subspace. Our extensive experiments on large language models and vision-language models demonstrate that EFMLoRA achieves optimization efficiency comparable to that of LoRA while simultaneously attaining comparable or even better performance. For example, on the GLUE dataset with RoBERTa-large, EFMLoRA outperforms LoRA and full fine-tuning by 1.0% and 0.5% on average, respectively. On vision-language models e.g., Qwen-VL-Chat, there are performance improvements of 1.5% and 1.0% on the SQA and VizWiz datasets, respectively. These empirical results also verify that the generalization of LoRA is closely related to sharpness, which is omitted by previous methods.

ICLR Conference 2025 Conference Paper

Efficient Low-Bit Quantization with Adaptive Scales for Multi-Task Co-Training

  • Boyu Liu
  • Haoyu Huang
  • Linlin Yang
  • Yanjing Li
  • Guodong Guo
  • Xianbin Cao 0001
  • Baochang Zhang 0001

Co-training can achieve parameter-efficient multi-task models but remains unexplored for quantization-aware training. Our investigation shows that directly introducing co-training into existing quantization-aware training (QAT) methods results in significant performance degradation. Our experimental study identifies that the primary issue with existing QAT methods stems from the inadequate activation quantization scales for the co-training framework. To address this issue, we propose Task-Specific Scales Quantization for Multi-Task Co-Training (TSQ-MTC) to tackle mismatched quantization scales. Specifically, a task-specific learnable multi-scale activation quantizer (TLMAQ) is incorporated to enrich the representational ability of shared features for different tasks. Additionally, we find that in the deeper layers of the Transformer model, the quantized network suffers from information distortion within the attention quantizer. A structure-based layer-by-layer distillation (SLLD) is then introduced to ensure that the quantized features effectively preserve the information from their full-precision counterparts. Our extensive experiments in two co-training scenarios demonstrate the effectiveness and versatility of TSQ-MTC. In particular, we successfully achieve a 4-bit quantized low-level visual foundation model based on IPT, which attains a PSNR comparable to the full-precision model while offering a $7.99\times$ compression ratio in the $\times4$ super-resolution task on the Set5 benchmark.

ICML Conference 2025 Conference Paper

ExtPose: Robust and Coherent Pose Estimation by Extending ViTs

  • Rongyu Chen
  • Li'an Zhuo
  • Linlin Yang
  • Qi Wang 0148
  • Liefeng Bo
  • Bang Zhang
  • Angela Yao

Vision Transformers (ViT) are remarkable at 3D pose estimation, yet they still encounter certain challenges. One issue is that the popular ViT architecture for pose estimation is limited to images and lacks temporal information. Another challenge is that the prediction often fails to maintain pixel alignment with the original images. To address these issues, we propose a systematic framework for 3D pose estimation, called ExtPose. ExtPose extends image ViT to the challenging scenario and video setting by taking in additional 2D pose evidence and capturing temporal information in a full attention-based manner. We use 2D human skeleton images to integrate structured 2D pose information. By sharing parameters and attending across modalities and frames, we enhance the consistency between 3D poses and 2D videos without introducing additional parameters. We achieve state-of-the-art (SOTA) performance on multiple human and hand pose estimation benchmarks with substantial improvements to 34. 0mm (-23%) on 3DPW and 4. 9mm (-18%) on FreiHAND in PA-MPJPE over the other ViT-based methods respectively.

ICLR Conference 2025 Conference Paper

Prompt as Knowledge Bank: Boost Vision-language model via Structural Representation for zero-shot medical detection

  • Yuguang Yang 0007
  • Tongfei Chen
  • Haoyu Huang
  • Linlin Yang
  • Chunyu Xie
  • Dawei Leng
  • Xianbin Cao 0001
  • Baochang Zhang 0001

Zero-shot medical detection can further improve detection performance without relying on annotated medical images even upon the fine-tuned model, showing great clinical value. Recent studies leverage grounded vision-language models (GLIP) to achieve this by using detailed disease descriptions as prompts for the target disease name during the inference phase. However, these methods typically treat prompts as equivalent context to the target name, making it difficult to assign specific disease knowledge based on visual information, leading to a coarse alignment between images and target descriptions. In this paper, we propose StructuralGLIP, which introduces an auxiliary branch to encode prompts into a latent knowledge bank layer-by-layer, enabling more context-aware and fine-grained alignment. Specifically, in each layer, we select highly similar features from both the image representation and the knowledge bank, forming structural representations that capture nuanced relationships between image patches and target descriptions. These features are then fused across modalities to further enhance detection performance. Extensive experiments demonstrate that StructuralGLIP achieves a +4.1\% AP improvement over prior state-of-the-art methods across seven zero-shot medical detection benchmarks, and consistently improves fine-tuned models by +3.2\% AP on endoscopy image datasets.

AAAI Conference 2024 Conference Paper

AQ-DETR: Low-Bit Quantized Detection Transformer with Auxiliary Queries

  • Runqi Wang
  • Huixin Sun
  • Linlin Yang
  • Shaohui Lin
  • Chuanjian Liu
  • Yan Gao
  • Yao Hu
  • Baochang Zhang

DEtection TRansformer (DETR)-based models have achieved remarkable performance. However, they are accompanied by a large computation overhead cost, which significantly prevents their applications on resource-limited devices. Prior arts attempt to reduce the computational burden of DETR using low-bit quantization, while these methods sacrifice a severe significant performance on weight-activation-attention low-bit quantization. We observe that the number of matching queries and positive samples affect much on the representation capacity of queries in DETR, while quantifying queries of DETR further reduces its representational capacity, thus leading to a severe performance drop. We introduce a new quantization strategy based on Auxiliary Queries for DETR (AQ-DETR), aiming to enhance the capacity of quantized queries. In addition, a layer-by-layer distillation is proposed to reduce the quantization error between quantized attention and full-precision counterpart. Through our extensive experiments on large-scale open datasets, the performance of the 4-bit quantization of DETR and Deformable DETR models is comparable to full-precision counterparts.

NeurIPS Conference 2024 Conference Paper

CLIP in Mirror: Disentangling text from visual images through reflection

  • Tiancheng Wang
  • Yuguang Yang
  • Linlin Yang
  • Shaohui Lin
  • Juan Zhang
  • Guodong Guo
  • Baochang Zhang

The CLIP network excels in various tasks, but struggles with text-visual images i. e. , images that contain both text and visual objects; it risks confusing textual and visual representations. To address this issue, we propose MirrorCLIP, a zero-shot framework, which disentangles the image features of CLIP by exploiting the difference in the mirror effect between visual objects and text in the images. Specifically, MirrorCLIP takes both original and flipped images as inputs, comparing their features dimension-wise in the latent space to generate disentangling masks. With disentangling masks, we further design filters to separate textual and visual factors more precisely, and then get disentangled representations. Qualitative experiments using stable diffusion models and class activation mapping (CAM) validate the effectiveness of our disentanglement. Moreover, our proposed MirrorCLIP reduces confusion when encountering text-visual images and achieves a substantial improvement on typographic defense, further demonstrating its superior ability of disentanglement. Our code is available at https: //github. com/tcwangbuaa/MirrorCLIP

ICLR Conference 2023 Conference Paper

Improving Deep Regression with Ordinal Entropy

  • Shihao Zhang
  • Linlin Yang
  • Michael Bi Mi
  • Xiaoxu Zheng
  • Angela Yao

In computer vision, it is often observed that formulating regression problems as a classification task yields better performance. We investigate this curious phenomenon and provide a derivation to show that classification, with the cross-entropy loss, outperforms regression with a mean squared error loss in its ability to learn high-entropy feature representations. Based on the analysis, we propose an ordinal entropy loss to encourage higher-entropy feature spaces while maintaining ordinal relationships to improve the performance of regression tasks. Experiments on synthetic and real-world regression tasks demonstrate the importance and benefits of increasing entropy for regression.

NeurIPS Conference 2023 Conference Paper

Synthetic-to-Real Pose Estimation with Geometric Reconstruction

  • Qiuxia Lin
  • Kerui Gu
  • Linlin Yang
  • Angela Yao

Pose estimation is remarkably successful under supervised learning, but obtaining annotations, especially for new deployments, is costly and time-consuming. This work tackles adapting models trained on synthetic data to real-world target domains with only unlabelled data. A common approach is model fine-tuning with pseudo-labels from the target domain; yet many pseudo-labelling strategies cannot provide sufficient high-quality pose labels. This work proposes a reconstruction-based strategy as a complement to pseudo-labelling for synthetic-to-real domain adaptation. We generate the driving image by geometrically transforming a base image according to the predicted keypoints and enforce a reconstruction loss to refine the predictions. It provides a novel solution to effectively correct confident yet inaccurate keypoint locations through image reconstruction in domain adaptation. Our approach outperforms the previous state-of-the-arts by 8% for PCK on four large-scale hand and human real-world datasets. In particular, we excel on endpoints such as fingertips and head, with 7. 2% and 29. 9% improvements in PCK.

ICLR Conference 2022 Conference Paper

Dive Deeper Into Integral Pose Regression

  • Kerui Gu
  • Linlin Yang
  • Angela Yao

Integral pose regression combines an implicit heatmap with end-to-end training for human body and hand pose estimation. Unlike detection-based heatmap methods, which decode final joint positions from the heatmap with a non-differentiable argmax operation, integral regression methods apply a differentiable expectation operation. This paper offers a deep dive into the inference and back-propagation of integral pose regression to better understand the differences in performance and training compared to detection-based methods. For inference, we give theoretical support as to why expectation should always be better than the argmax operation, i.e. integral regression should always outperform detection. Yet, in practice, this is observed only in hard cases because the heatmap activation for regression shrinks in easy cases. We then experimentally show that activation shrinkage is one of the leading causes for integral regression's inferior performance. For back-propagation, we theoretically and empirically analyze the gradients to explain the slow training speed of integral regression. Based on these findings, we incorporate the supervision of a spatial prior to speed up training and improve performance.

IJCAI Conference 2021 Conference Paper

Uncertainty-aware Binary Neural Networks

  • Junhe Zhao
  • Linlin Yang
  • Baochang Zhang
  • Guodong Guo
  • David Doermann

Binary Neural Networks (BNN) are promising machine learning solutions for deployment on resource-limited devices. Recent approaches to training BNNs have produced impressive results, but minimizing the drop in accuracy from full precision networks is still challenging. One reason is that conventional BNNs ignore the uncertainty caused by weights that are near zero, resulting in the instability or frequent flip while learning. In this work, we investigate the intrinsic uncertainty of vanishing near-zero weights, making the training vulnerable to instability. We introduce an uncertainty-aware BNN (UaBNN) by leveraging a new mapping function called certainty-sign (c-sign) to reduce these weights' uncertainties. Our c-sign function is the first to train BNNs with a decreasing uncertainty for binarization. The approach leads to a controlled learning process for BNNs. We also introduce a simple but effective method to measure the uncertainty-based on a Gaussian function. Extensive experiments demonstrate that our method improves multiple BNN methods by maintaining stability of training, and achieves a higher performance over prior arts.

IJCAI Conference 2020 Conference Paper

CP-NAS: Child-Parent Neural Architecture Search for 1-bit CNNs

  • Li'an Zhuo
  • Baochang Zhang
  • Hanlin Chen
  • Linlin Yang
  • Chen Chen
  • Yanjun Zhu
  • David Doermann

Neural architecture search (NAS) proves to be among the best approaches for many tasks by generating an application-adaptive neural architectures, which are still challenged by high computational cost and memory consumption. At the same time, 1-bit convolutional neural networks (CNNs) with binarized weights and activations show their potential for resource-limited embedded devices. One natural approach is to use 1-bit CNNs to reduce the computation and memory cost of NAS by taking advantage of the strengths of each in a unified framework. To this end, a Child-Parent model is introduced to a differentiable NAS to search the binarized architecture(Child) under the supervision of a full-precision model (Parent). In the search stage, the Child-Parent model uses an indicator generated by the parent and child model accuracy to evaluate the performance and abandon operations with less potential. In the training stage, a kernel level CP loss is introduced to optimize the binarized network. Extensive experiments demonstrate that the proposed CP-NAS achieves a comparable accuracy with traditional NAS on both the CIFAR and ImageNet databases. It achieves an accuracy of 95. 27% on CIFAR-10, 64. 3% on ImageNet with binarized weights and activations, and a 30% faster search than prior arts.