Arrow Research search

Author name cluster

Jing Lin

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

18 papers
2 author rows

Possible papers

18

AAAI Conference 2026 Conference Paper

UniMo: Unified Motion Generation and Understanding with Chain of Thought

  • Guocun Wang
  • Kenkun Liu
  • Jing Lin
  • Guorui Song
  • Jian Li
  • Xiaoguang Han

Existing 3D human motion generation and understanding methods often exhibit limited interpretability, restricting effective mutual enhancement between these inherently related tasks. While current unified frameworks based on large language models (LLMs) leverage linguistic priors, they frequently encounter challenges in semantic alignment and task coherence. Moreover, the next-token prediction paradigm in LLMs is ill-suited for motion sequences, causing cumulative prediction errors. To address these limitations, we propose UniMo, a novel framework that integrates motion-language information and interpretable chain of thought (CoT) reasoning into the LLM via supervised fine-tuning (SFT). We further introduce reinforcement learning with Group Relative Policy Optimization (GRPO) as a post-training strategy that optimizes over groups of tokens to enforce structural correctness and semantic alignment, mitigating cumulative errors in motion token prediction. Extensive experiments demonstrate that UniMo significantly outperforms existing unified and task-specific models, achieving state-of-the-art performance in both motion generation and understanding.

AAAI Conference 2026 Conference Paper

UniSketch: A Unified Framework for Parametric Sketch Generation and Constraint Prediction

  • Jing Lin
  • Fazhi He
  • Rubin Fan

In modern Computer-Aided Design (CAD), parametric sketches play a crucial role by capturing both the geometric structure and design intent through constraints. However, existing deep learning–based sketch methods remain restricted to simple geometric primitives and limited constraint types, hindering their application to complex real-world engineering tasks. To address this gap, we introduce the UniSketch dataset, comprising 3,836,290 sketches. It offers a comprehensive and diverse collection of 7 types of geometric primitives and 23 types of 2D constraints, all represented as unified vector sequences suitable for deep learning applications. Leveraging the UniSketch dataset, we propose a unified multi-task Transformer framework as a true foundation model for parametric sketch modeling, supporting diverse core tasks like image-to-sketch generation, constraint prediction, and unconditional sketch synthesis. Furthermore, the generated sketches can be efficiently converted to CAD-compatible formats, enabling seamless integration with industrial CAD system for re-editing and reusing. The experimental results show that UniSketch outperforms existing methods in multiple tasks, demonstrating its versatility and practical value in industrial CAD applications.

NeurIPS Conference 2025 Conference Paper

DartQuant: Efficient Rotational Distribution Calibration for LLM Quantization

  • YUANTIAN SHAO
  • Yuanteng Chen
  • Peisong Wang
  • Jianlin Yu
  • Jing Lin
  • yiwu yao
  • Zhihui Wei
  • Jian Cheng

Quantization plays a crucial role in accelerating the inference of large-scale models, and rotational matrices have been shown to effectively improve quantization performance by smoothing outliers. However, end-to-end fine-tuning of rotational optimization algorithms incurs high computational costs and is prone to overfitting. To address this challenge, we propose an efficient distribution-aware rotational calibration method, DartQuant, which reduces the complexity of rotational optimization by constraining the distribution of the activations after rotation. This approach also effectively reduces reliance on task-specific losses, thereby mitigating the risk of overfitting. Additionally, we introduce the QR-Orth optimization scheme, which replaces expensive alternating optimization with a more efficient solution. In a variety of model quantization experiments, DartQuant demonstrates superior performance. Compared to existing methods, it achieves 47$\times$ acceleration and 10$\times$ memory savings for rotational optimization on a 70B model. Furthermore, it is the first to successfully complete rotational calibration for a 70B model on a single 3090 GPU, making quantization of large language models feasible in resource-constrained environments.

ICLR Conference 2025 Conference Paper

Dynamic Low-Rank Sparse Adaptation for Large Language Models

  • Weizhong Huang
  • Yuxin Zhang 0002
  • Xiawu Zheng
  • Yang Liu 0005
  • Jing Lin
  • Yiwu Yao
  • Rongrong Ji

Despite the efficacy of network sparsity in alleviating the deployment strain of Large Language Models (LLMs), it endures significant performance degradation. Applying Low-Rank Adaptation (LoRA) to fine-tune the sparse LLMs offers an intuitive approach to counter this predicament, while it holds shortcomings include: 1) The inability to integrate LoRA weights into sparse LLMs post-training, and 2) Insufficient performance recovery at high sparsity ratios. In this paper, we introduces dynamic $\textbf{Lo}$w-rank $\textbf{S}$parse $\textbf{A}$daptation $\textbf{(LoSA)}$, a novel method that seamlessly integrates low-rank adaptation into LLM sparsity within a unified framework, thereby enhancing the performance of sparse LLMs without increasing the inference latency. In particular, LoSA dynamically sparsifies the LoRA outcomes based on the corresponding sparse weights during fine-tuning, thus guaranteeing that the LoRA module can be integrated into the sparse LLMs post-training. Besides, to achieve the optimal sparse model architecture, LoSA leverages Representation Mutual Information (RMI) as an indicator to determine the importance of layers, thereby dynamically determining the optimal layer-wise sparsity rates during fine-tuning. Predicated on this, LoSA adjusts the rank of the LoRA module based on the variability in layer-wise reconstruction errors, allocating an appropriate fine-tuning for each layer to reduce the output discrepancies between dense and sparse LLMs. Extensive experiments tell that LoSA can efficiently boost the efficacy of sparse LLMs within a few hours, without introducing any additional inferential burden. For example, LoSA reduced the perplexity of sparse LLaMA-2-7B by $\textbf{68.73}$$\downarrow$ and increased zero-shot accuracy by $\textbf{16.32}$%$\uparrow$, achieving a $\textbf{2.60$\times$}$ speedup on CPU and $\textbf{2.23$\times$}$ speedup on GPU, requiring only $\textbf{45 minutes}$ of fine-tuning on $\textbf{a single}$ NVIDIA A100 80GB GPU. Code is available at https://github.com/wzhuang-xmu/LoSA.

JBHI Journal 2025 Journal Article

Enhancing Weakly Supervised Semantic Segmentation With Multi-Label Contrastive Learning and LLM Features Guidance

  • Wentian Cai
  • Yijiang Li
  • Yandan Chen
  • Jing Lin
  • Zihao Huang
  • Ping Gao
  • Thippa Reddy Gadekallu
  • Wei Wang

Histopathological whole-slide images (WSIs) segmentation is essential for precise tissue characterization in medical diagnostics. However, traditional approaches require labor-intensive pixel-level annotations. To this end, we study weakly supervised semantic segmentation (WSSS) which uses patch-level classification labels, reducing annotation efforts significantly. However, the complexity of WSIs and the challenge of sparse classification labels hinder effective dense pixel predictions. Moreover, due to the multi-label nature of WSI, existing approaches of single-label contrastive learning designed for the representation of single-category, neglecting the presence of other relevant categories and thus fail to adapt to WSI tasks. This paper presents a novel multi-label contrastive learning method for WSSS by incorporating class-specific embedding extraction with LLM features guidance. Specifically, we propose to obtain class-specific embeddings by utilizing classifier weights, followed by a dot-product-based attention fusion method that leverages LLM features to enrich their semantics, facilitating contrastive learning between different classes from single image. Besides, we propose a Robust Learning approach that leverages multi-layer features to evaluate the uncertainty of pseudo-labels, thereby mitigating the impact of noisy pseudo-labels on the learning process of segmentation. Extensive experiments have been conducted on two histopathological image segmentation datasets, i. e. LUAD dataset and BCSS dataset, demonstrating the effectiveness of our methods with leading performance.

ICLR Conference 2025 Conference Paper

RazorAttention: Efficient KV Cache Compression Through Retrieval Heads

  • Hanlin Tang
  • Yang Lin
  • Jing Lin
  • Qingsen Han
  • Danning Ke
  • Shikuan Hong
  • Yiwu Yao
  • Gongyi Wang

The memory and computational demands of Key-Value (KV) cache present significant challenges for deploying long-context language models. Previous approaches attempt to mitigate this issue by selectively dropping tokens, which irreversibly erases critical information that might be needed for future queries. In this paper, we propose a novel compression technique for KV cache that preserves all token information. Our investigation reveals that: i) Most attention heads primarily focus on the local context; ii) Only a few heads, denoted as retrieval heads, can essentially pay attention to all input tokens. These key observations motivate us to use separate caching strategy for attention heads.Therefore, we propose RazorAttention, a training-free KV cache compression algorithm, which maintains a full cache for these crucial retrieval heads and discards the remote tokens in non-retrieval heads. Furthermore, we introduce a novel mechanism involving a “compensation token” to further recover the information in the dropped tokens. Extensive evaluations across a diverse set of large language models (LLMs) demonstrate that RazorAttention achieves a reduction in KV cache size by over 70% without noticeable impacts on performance. Additionally, RazorAttention is compatible with FlashAttention, rendering it an efficient and plug-and-play solution that enhances LLM inference efficiency without overhead or retraining of the original model.

NeurIPS Conference 2025 Conference Paper

VaporTok: RL-Driven Adaptive Video Tokenizer with Prior & Task Awareness

  • Minghao Yang
  • Zechen Bai
  • Jing Lin
  • Haoqian Wang
  • Alex Jinpeng Wang

Recent advances in visual tokenizers have demonstrated their effectiveness for multimodal large language models and autoregressive generative models. However, most existing visual tokenizers rely on a fixed downsampling rate at a given visual resolution, and consequently produce a constant number of visual tokens, ignoring the fact that visual information of varying complexity warrant different token budgets. Motivated by this observation, we propose an adaptive video tokenizer "VaporTok" with two core contributions: Probabilistic Taildrop: We introduce a novel taildrop mechanism that learns a truncation index sampling distribution conditioned on visual complexity of the video. During both training and inference, the decoder reconstructs videos at adaptive token lengths, allocating more tokens to complex videos and fewer to simpler ones. Parallel Sample GRPO with Vapor Reward: By leveraging the probability distribution produced by probabilistic taildrop, we reformulate the visual tokenization pipeline as a sequential decision process. To optimize this process, we propose a variant of GRPO and a composite reward encompassing token efficiency, reconstruction fidelity, and generative quality, thus enabling metrics-aware adaptive tokenization across diverse objectives. Extensive experiments on standard video generation benchmarks confirm our analysis, showing that our adaptive approach matches or outperforms fixed‐rate baselines and naive taildrop while using fewer tokens.

ICML Conference 2024 Conference Paper

HumanTOMATO: Text-aligned Whole-body Motion Generation

  • Shunlin Lu
  • Linghao Chen
  • Ailing Zeng
  • Jing Lin
  • Ruimao Zhang
  • Lei Zhang 0001
  • Harry Shum

This work targets a novel text-driven whole-body motion generation task, which takes a given textual description as input and aims at generating high-quality, diverse, and coherent facial expressions, hand gestures, and body motions simultaneously. Previous works on text-driven motion generation tasks mainly have two limitations: they ignore the key role of fine-grained hand and face controlling in vivid whole-body motion generation, and lack a good alignment between text and motion. To address such limitations, we propose a Text-aligned whOle-body Motion generATiOn framework, named HumanTOMATO, which is the first attempt to our knowledge towards applicable holistic motion generation in this research area. To tackle this challenging task, our solution includes two key designs: (1) a Holistic Hierarchical VQ-VAE (aka H${}^{2}$VQ) and a Hierarchical-GPT for fine-grained body and hand motion reconstruction and generation with two structured codebooks; and (2) a pre-trained text-motion-alignment model to help generated motion align with the input textual description explicitly. Comprehensive experiments verify that our model has significant advantages in both the quality of generated motions and their alignment with text.

NeurIPS Conference 2023 Conference Paper

Binarized Spectral Compressive Imaging

  • Yuanhao Cai
  • Yuxin Zheng
  • Jing Lin
  • Xin Yuan
  • Yulun Zhang
  • Haoqian Wang

Existing deep learning models for hyperspectral image (HSI) reconstruction achieve good performance but require powerful hardwares with enormous memory and computational resources. Consequently, these methods can hardly be deployed on resource-limited mobile devices. In this paper, we propose a novel method, Binarized Spectral-Redistribution Network (BiSRNet), for efficient and practical HSI restoration from compressed measurement in snapshot compressive imaging (SCI) systems. Firstly, we redesign a compact and easy-to-deploy base model to be binarized. Then we present the basic unit, Binarized Spectral-Redistribution Convolution (BiSR-Conv). BiSR-Conv can adaptively redistribute the HSI representations before binarizing activation and uses a scalable hyperbolic tangent function to closer approximate the Sign function in backpropagation. Based on our BiSR-Conv, we customize four binarized convolutional modules to address the dimension mismatch and propagate full-precision information throughout the whole network. Finally, our BiSRNet is derived by using the proposed techniques to binarize the base model. Comprehensive quantitative and qualitative experiments manifest that our proposed BiSRNet outperforms state-of-the-art binarization algorithms. Code and models are publicly available at https: //github. com/caiyuanhao1998/BiSCI

NeurIPS Conference 2023 Conference Paper

Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset

  • Jing Lin
  • Ailing Zeng
  • Shunlin Lu
  • Yuanhao Cai
  • Ruimao Zhang
  • Haoqian Wang
  • Lei Zhang

In this paper, we present Motion-X, a large-scale 3D expressive whole-body motion dataset. Existing motion datasets predominantly contain body-only poses, lacking facial expressions, hand gestures, and fine-grained pose descriptions. Moreover, they are primarily collected from limited laboratory scenes with textual descriptions manually labeled, which greatly limits their scalability. To overcome these limitations, we develop a whole-body motion and text annotation pipeline, which can automatically annotate motion from either single- or multi-view videos and provide comprehensive semantic labels for each video and fine-grained whole-body pose descriptions for each frame. This pipeline is of high precision, cost-effective, and scalable for further research. Based on it, we construct Motion-X, which comprises 15. 6M precise 3D whole-body pose annotations (i. e. , SMPL-X) covering 81. 1K motion sequences from massive scenes. Besides, Motion-X provides 15. 6M frame-level whole-body pose descriptions and 81. 1K sequence-level semantic labels. Comprehensive experiments demonstrate the accuracy of the annotation pipeline and the significant benefit of Motion-X in enhancing expressive, diverse, and natural motion generation, as well as 3D whole-body human mesh recovery.

ECAI Conference 2023 Conference Paper

Time-Series Data Imputation via Realistic Masking-Guided Tri-Attention Bi-GRU

  • Yiqun Zhang 0006
  • An Zeng
  • Dan Pan 0001
  • Yuzhu Ji
  • Zhipeng Zhang
  • Jing Lin

Time series data with missing values are ubiquitous in real applications due to various unforeseen faults during data generation, storage, and transmission. Time-Series Data Imputation (TSDI) is thus crucial to many temporal data analysis tasks. However, existing works usually consider only one of the following two issues: (1) intra-feature temporal dependency, and (2) inter-feature correlation, leading to the overlook of complex coupling information in imputation. To achieve more accurate TDSI, we design a novel imputation model called TABiG, which delicately preserves the short-term, long-term, and inter-feature dependencies by attention mechanisms in a delay error-reduced bi-directional architecture. That is, it leverages GRU to model short-term temporal dependencies and adopts self-attention mechanisms hierarchically to capture long-term temporal dependencies and inter-feature correlations. The multiple self-attention mechanisms are nested in a bi-directional structure to alleviate the problem of delay errors in RNN-like structures. To facilitate model training with higher generalization, a masking strategy that mimics various extreme real missing situations beyond the simple random ones has been adopted for generating self-supervised learning tasks. Comprehensive experiments demonstrate that TABiG significantly outperforms most state-of-the-art imputation counterparts. Complementary results and source code can be accessed at https: //github. com/Zhang2112105189/TABiG

NeurIPS Conference 2022 Conference Paper

Degradation-Aware Unfolding Half-Shuffle Transformer for Spectral Compressive Imaging

  • Yuanhao Cai
  • Jing Lin
  • Haoqian Wang
  • Xin Yuan
  • Henghui Ding
  • Yulun Zhang
  • Radu Timofte
  • Luc V Gool

In coded aperture snapshot spectral compressive imaging (CASSI) systems, hyperspectral image (HSI) reconstruction methods are employed to recover the spatial-spectral signal from a compressed measurement. Among these algorithms, deep unfolding methods demonstrate promising performance but suffer from two issues. Firstly, they do not estimate the degradation patterns and ill-posedness degree from CASSI to guide the iterative learning. Secondly, they are mainly CNN-based, showing limitations in capturing long-range dependencies. In this paper, we propose a principled Degradation-Aware Unfolding Framework (DAUF) that estimates parameters from the compressed image and physical mask, and then uses these parameters to control each iteration. Moreover, we customize a novel Half-Shuffle Transformer (HST) that simultaneously captures local contents and non-local dependencies. By plugging HST into DAUF, we establish the first Transformer-based deep unfolding method, Degradation-Aware Unfolding Half-Shuffle Transformer (DAUHST), for HSI reconstruction. Experiments show that DAUHST surpasses state-of-the-art methods while requiring cheaper computational and memory costs. Code and models are publicly available at https: //github. com/caiyuanhao1998/MST

TIST Journal 2022 Journal Article

FLAG: A Feedback-aware Local and Global Model for Heterogeneous Sequential Recommendation

  • Mingkai He
  • Jing Lin
  • Jinwei Luo
  • Weike Pan
  • Zhong Ming

Heterogeneous sequential recommendation that models sequences of items associated with more than one type of feedback such as examinations and purchases is an emerging topic in the research community, which is also an important problem in many real-world applications. Though there are some methods proposed to exploit different types of feedback in item sequences such as RLBL, RIB, and BINN, they are based on RNN and may not be very competitive in capturing users’ complex and dynamic preferences. And most existing advanced sequential recommendation methods such as the CNN- and attention-based methods are often designed for making use of item sequences with one single type of feedback, which thus can not be applied to the studied problem directly. As a response, we propose a novel feedback-aware local and global (FLAG) preference learning model for heterogeneous sequential recommendation. Our FLAG contains four modules, including (i) a local preference learning module for capturing a user’s short-term interest, which adopts a novel feedback-aware self-attention block to distinguish different types of feedback; (ii) a global preference learning module for modeling a user’s global preference; (iii) a local intention learning module, which takes a user’s real feedback in the next step, i.e., the user’s intention at the current step, as the query vector in a self-attention block to figure out the items that match the user’s intention well; and (iv) a prediction module for preference integration and final prediction. We then conduct extensive experiments on three public datasets and find that our FLAG significantly outperforms 13 very competitive baselines in terms of two commonly used ranking-oriented metrics in most cases. We also include ablation studies and sensitivity analysis of our FLAG to have more in-depth insights.

ICML Conference 2022 Conference Paper

Flow-Guided Sparse Transformer for Video Deblurring

  • Jing Lin
  • Yuanhao Cai
  • Xiaowan Hu
  • Haoqian Wang
  • Youliang Yan
  • Xueyi Zou
  • Henghui Ding
  • Yulun Zhang 0001

Exploiting similar and sharper scene patches in spatio-temporal neighborhoods is critical for video deblurring. However, CNN-based methods show limitations in capturing long-range dependencies and modeling non-local self-similarity. In this paper, we propose a novel framework, Flow-Guided Sparse Transformer (FGST), for video deblurring. In FGST, we customize a self-attention module, Flow-Guided Sparse Window-based Multi-head Self-Attention (FGSW-MSA). For each $query$ element on the blurry reference frame, FGSW-MSA enjoys the guidance of the estimated optical flow to globally sample spatially sparse yet highly related $key$ elements corresponding to the same scene patch in neighboring frames. Besides, we present a Recurrent Embedding (RE) mechanism to transfer information from past frames and strengthen long-range temporal dependencies. Comprehensive experiments demonstrate that our proposed FGST outperforms state-of-the-art (SOTA) methods on both DVD and GOPRO datasets and yields visually pleasant results in real video deblurring. https: //github. com/linjing7/VR-Baseline

ICML Conference 2022 Conference Paper

Unsupervised Flow-Aligned Sequence-to-Sequence Learning for Video Restoration

  • Jing Lin
  • Xiaowan Hu
  • Yuanhao Cai
  • Haoqian Wang
  • Youliang Yan
  • Xueyi Zou
  • Yulun Zhang 0001
  • Luc Van Gool

How to properly model the inter-frame relation within the video sequence is an important but unsolved challenge for video restoration (VR). In this work, we propose an unsupervised flow-aligned sequence-to-sequence model (S2SVR) to address this problem. On the one hand, the sequence-to-sequence model, which has proven capable of sequence modeling in the field of natural language processing, is explored for the first time in VR. Optimized serialization modeling shows potential in capturing long-range dependencies among frames. On the other hand, we equip the sequence-to-sequence model with an unsupervised optical flow estimator to maximize its potential. The flow estimator is trained with our proposed unsupervised distillation loss, which can alleviate the data discrepancy and inaccurate degraded optical flow issues of previous flow-based methods. With reliable optical flow, we can establish accurate correspondence among multiple frames, narrowing the domain difference between 1D language and 2D misaligned frames and improving the potential of the sequence-to-sequence model. S2SVR shows superior performance in multiple VR tasks, including video deblurring, video super-resolution, and compressed video quality enhancement. https: //github. com/linjing7/VR-Baseline

IROS Conference 2019 Conference Paper

Joint Torque Estimation toward Dynamic and Compliant Control for Gear-Driven Torque Sensorless Quadruped Robot

  • Bingchen Jin
  • Caiming Sun
  • Aidong Zhang 0002
  • Ning Ding 0003
  • Jing Lin
  • Ganyu Deng
  • Zuwen Zhu
  • Zhenglong Sun 0001

This paper investigates dynamic and compliant control based on joint output torque estimation for electrically actuated quadruped robots with large-reduction-ratio harmonic gear. Compared with position control, force control exhibits better performance of dynamics and compliance for the robot's interactions with complex environments. However, force control without direct feedbacks from torque sensors may come with poor tracking performance of joint compliance when the robot equipped with gears of high reduction. To solve this problem, we propose a new method to estimate joint torque from motor current and rotation velocity detected on each joint, using a more precise friction model of the harmonic gear. We also introduce a pre-stance phase to the whole cycle of leg alternating swing/stance based on hybrid force and position control to dynamically absorb feet impacts on the ground. Our controller performance is validated by standing experiment and walking experiment.