Arrow Research search

Author name cluster

Linghe Kong

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

25 papers
2 author rows

Possible papers

25

AAAI Conference 2026 Conference Paper

AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization

  • Qiyang Li
  • Rui Kong
  • Yuchen Li
  • Hengyi Cai
  • Shuaiqiang Wang
  • Linghe Kong
  • Guihai Chen
  • Dawei Yin

The integration of dynamic, sparse structures like Mixture-of-Experts (MoE) with parameter-efficient adapters (e.g., LoRA) is a powerful technique for enhancing Large Language Models (LLMs). However, this architectural enhancement comes at a steep cost: despite minimal increases in computational load, the inference latency often skyrockets, leading to decoding speeds slowing by over 2.5 times. Through a fine-grained performance analysis, we pinpoint the primary bottleneck not in the computation itself, but in the severe overhead from fragmented, sequential CUDA kernel launches required for conventional dynamic routing. To address this challenge, we introduce AdaFuse, a framework built on a tight co-design between the algorithm and the underlying hardware system to enable efficient dynamic adapter execution. Departing from conventional layer-wise or block-wise routing, AdaFuse employs a token-level pre-gating strategy, which makes a single, global routing decision for all adapter layers before a token is processed. This ``decide-once, apply-everywhere'' approach effectively staticizes the execution path for each token, creating an opportunity for holistic optimization. We capitalize on this by developing a custom CUDA kernel that performs a fused switching operation, merging the parameters of all selected LoRA adapters into the backbone model in a single, efficient pass. Experimental results on popular open-source LLMs show that AdaFuse achieves accuracy on par with state-of-the-art dynamic adapters while drastically cutting decoding latency by a factor of over 2.4x, thereby bridging the gap between model capability and inference efficiency.

TAAS Journal 2025 Journal Article

ACORN+: Adaptive Compression-Reconstruction for Device-Cloud Collaboration Video Services

  • Jiale Lei
  • Peihao Yang
  • Linghe Kong
  • Yehan Ma
  • Deyu Lin
  • Guihai Chen
  • E. Zhao

With the improvement of edge-based autonomous systems such as mobile Industrial IoT (IIoT) networks, edge devices can capture and upload videos with increasing bitrates. Massive edge-computing end nodes are eager for adequate multimedia data to satisfy the requirements of real-time video services. However, existing encoding standards for video services in Web 2.0 are specifically designed for something other than IoT video streaming. We have improved our Adaptive Compression-Reconstruction (ACORN) framework to obtain ACORN+, based on compressed sensing and recent advances in deep learning. At end nodes, we compress multiple sequential video frames into a single frame to reduce video volume. Given that multiple kinds of intelligent tasks are expected to be finished on the device side, we also designed a device-cloud collaboration scheme where deep learning-based algorithms can be executed on both the device and server sides. Experiments reveal that video analytics can be conducted on compressed frames. Taking action recognition as a device-cloud collaboration use case, we find ACORN \(+\) obtains more than 3 \(\times\) speedup on compressed frames. The reconstruction algorithm in ACORN \(+\) is with 1– 4 dB improvements. Moreover, the encoding time cost and the encoded video volume are reduced by more than 4 \(\times\) under the ACORN \(+\) framework. 1

ICLR Conference 2025 Conference Paper

ARB-LLM: Alternating Refined Binarizations for Large Language Models

  • Zhiteng Li
  • Xianglong Yan
  • Tianao Zhang
  • Haotong Qin
  • Dong Xie
  • Jiang Tian
  • Zhongchao Shi
  • Linghe Kong

Large Language Models (LLMs) have greatly pushed forward advancements in natural language processing, yet their high memory and computational demands hinder practical deployment. Binarization, as an effective compression technique, can shrink model weights to just 1 bit, significantly reducing the high demands on computation and memory. However, current binarization methods struggle to narrow the distribution gap between binarized and full-precision weights, while also overlooking the column deviation in LLM weight distribution. To tackle these issues, we propose ARB-LLM, a novel 1-bit post-training quantization (PTQ) technique tailored for LLMs. To narrow the distribution shift between binarized and full-precision weights, we first design an alternating refined binarization (ARB) algorithm to progressively update the binarization parameters, which significantly reduces the quantization error. Moreover, considering the pivot role of calibration data and the column deviation in LLM weights, we further extend ARB to ARB-X and ARB-RC. In addition, we refine the weight partition strategy with column-group bitmap (CGB), which further enhance performance. Equipping ARB-X and ARB-RC with CGB, we obtain ARB-LLM$_{\text{X}}$ and ARB-LLM$ _{\text{RC}} $ respectively, which significantly outperform state-of-the-art (SOTA) binarization methods for LLMs. As a binary PTQ method, our ARB-LLM$ _{\text{RC}} $ is the first to surpass FP16 models of the same size. Code: https://github.com/ZHITENGLI/ARB-LLM.

ICML Conference 2025 Conference Paper

BiMaCoSR: Binary One-Step Diffusion Model Leveraging Flexible Matrix Compression for Real Super-Resolution

  • Kai Liu 0034
  • Kaicheng Yang
  • Zheng Chen 0014
  • Zhiteng Li
  • Yong Guo
  • Wenbo Li 0001
  • Linghe Kong
  • Yulun Zhang 0001

While super-resolution (SR) methods based on diffusion models (DM) have demonstrated inspiring performance, their deployment is impeded due to the heavy request of memory and computation. Recent researchers apply two kinds of methods to compress or fasten the DM. One is to compress the DM into 1-bit, aka binarization, alleviating the storage and computation pressure. The other distills the multi-step DM into only one step, significantly speeding up inference process. Nonetheless, it remains impossible to deploy DM to resource-limited edge devices. To address this problem, we propose BiMaCoSR, which combines binarization and one-step distillation to obtain extreme compression and acceleration. To prevent the catastrophic collapse of the model caused by binarization, we proposed sparse matrix branch (SMB) and low rank matrix branch (LRM). Both auxiliary branches pass the full-precision (FP) information but in different ways. SMB absorbs the extreme values and its output is high rank, carrying abundant FP information. Whereas, the design of LRMB is inspired by LoRA and is initialized with the top r SVD components, outputting low rank representation. The computation and storage overhead of our proposed branches can be safely ignored. Comprehensive comparison experiments are conducted to exhibit BiMaCoSR outperforms current state-of-the-art binarization methods and gains competitive performance compared with FP one-step model. Moreover, we achieve excellent compression and acceleration. BiMaCoSR achieves a 23. 8x compression ratio and a 27. 4x speedup ratio compared to FP counterpart. Our code and model are available at https: //github. com/Kai-Liu001/BiMaCoSR

NeurIPS Conference 2025 Conference Paper

First SFT, Second RL, Third UPT: Continual Improving Multi-Modal LLM Reasoning via Unsupervised Post-Training

  • Lai Wei
  • Yuting Li
  • Chen Wang
  • Yue Wang
  • Linghe Kong
  • Weiran Huang
  • Lichao Sun

Improving Multi-modal Large Language Models (MLLMs) in the post-training stage typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL), which require expensive and manually annotated multi-modal data--an ultimately unsustainable resource. This limitation has motivated a growing interest in unsupervised paradigms as a third stage of post-training after SFT and RL. While recent efforts have explored this direction, their methods are complex and difficult to iterate. To address this, we propose MM-UPT, a simple yet effective framework for unsupervised post-training of MLLMs, enabling continual self-improvement without any external supervision. The training method of MM-UPT builds upon GRPO, replacing traditional reward signals with a self-rewarding mechanism based on majority voting over multiple sampled responses. Our experiments demonstrate that such training method effectively improves the reasoning ability of Qwen2. 5-VL-7B (e. g. , 66. 3\%$\rightarrow$72. 9\% on MathVista, 62. 9\%$\rightarrow$68. 7\% on We-Math), using standard dataset without ground truth labels. To further explore scalability, we extend our framework to a data self-generation setting, designing two strategies that prompt the MLLM to synthesize new training samples on its own. Additional experiments show that combining these synthetic data with the unsupervised training method can also boost performance, highlighting a promising approach for scalable self-improvement. Overall, MM-UPT offers a new paradigm for autonomous enhancement of MLLMs, serving as a critical third step after initial SFT and RL in the absence of external supervision. Our code is available at \url{https: //github. com/waltonfuture/MM-UPT}.

IROS Conference 2025 Conference Paper

Learning to Generate Vectorized Maps at Intersections with Multiple Roadside Cameras

  • Quanxin Zheng
  • Miao Fan
  • Shengtong Xu
  • Linghe Kong
  • Haoyi Xiong

Vectorized maps are indispensable for precise navigation and the safe operation of autonomous vehicles. Traditional methods for constructing these maps fall into two categories: offline techniques, which rely on expensive, labor-intensive LiDAR data collection and manual annotation, and online approaches that use onboard cameras to reduce costs but suffer from limited performance, especially at complex intersections. To bridge this gap, we introduce the Multiple Roadside Camera-based Vectorized Map approach, MRC-VMap, a cost-effective, vision-centric, end-to-end neural network designed to generate high-definition vectorized maps directly at intersections. Leveraging existing roadside surveillance cameras, MRC-VMap directly converts time-aligned, multi-directional images into vectorized map representations. This integrated solution lowers the need for additional intermediate modules-such as separate feature extraction and Bird’s-Eye View (BEV) conversion steps-thus reducing both computational overhead and error propagation. Moreover, the use of multiple camera views enhances mapping completeness, mitigates occlusions, and provides robust performance under practical deployment constraints. Extensive experiments conducted on 4, 000 intersections across 4 major metropolitan areas in China demonstrate that MRC-VMap not only outperforms state-of-the-art online methods but also achieves accuracy comparable to high-cost LiDAR-based approaches, thereby offering a scalable and efficient solution for modern autonomous navigation systems.

IJCAI Conference 2025 Conference Paper

Multi-player Multi-armed Bandits with Delayed Feedback

  • Jingqi Fan
  • Zilong Wang
  • Shuai Li
  • Linghe Kong

Multi-player multi-armed bandits (MP-MAB) have been extensively studied due to their application in cognitive radio networks. In this setting, multiple players simultaneously select arms and instantly receive feedback. However, in realistic decentralized networks, feedback is often delayed due to sensing latency and signal processing. Without a central coordinator, explicit communication is impossible, and delayed feedback disrupts implicit coordination, since it depends on synchronous observations. As a result, collisions are frequent and system performance degrades significantly. In this paper, we propose an algorithm in MP-MAB with stochastic delay feedback. Each player in the algorithm independently maintains an estimate of the optimal arm set based on their own delayed rewards but only pulls arms from the set, which is, with high probability, identical to those of other players, thus avoiding collisions. The identical arm set also enables implicit communication, allowing players to utilize the exploration results of others. We establish a regret upper bound and derive a lower bound to prove the algorithm is near-optimal. Numerical experiments on both synthetic and real-world datasets validate the effectiveness of our algorithm.

NeurIPS Conference 2025 Conference Paper

PANTHER: Generative Pretraining Beyond Language for Sequential User Behavior Modeling

  • Guilin Li
  • Yun Zhang
  • Xiuyuan Chen
  • CHENGQI LI
  • Bo Wang
  • Linghe Kong
  • Wenjia Wang
  • Weiran Huang

Large language models (LLMs) have shown that generative pretraining can distill vast world knowledge into compact token representations. While LLMs encapsulate extensive world knowledge, they remain limited in modeling the behavioral knowledge contained within user interaction histories. User behavior forms a distinct modality, where each action—defined by multi-dimensional attributes such as time, context, and transaction type—constitutes a behavioral token. Modeling these high-cardinality, sparse, and irregular sequences is challenging, and discriminative models often falter under limited supervision. To bridge this gap, we extend generative pretraining to user behavior, learning transferable representations from unlabeled behavioral data analogous to how LLMs learn from text. We present PANTHER, a hybrid generative–discriminative framework that unifies user behavior pretraining and downstream adaptation, enabling large-scale sequential user representation learning and real-time inference. PANTHER introduces: (1) Structured Tokenization to compress multi-dimensional transaction attributes into an interpretable vocabulary; (2) Sequence Pattern Recognition Module (SPRM) for modeling periodic transaction motifs; (3) a Unified User-Profile Embedding that fuses static demographics with dynamic transaction histories, enabling both personalized predictions and population-level knowledge transfer; and (4) Real-time scalability enabled by offline caching of pre-trained embeddings for millisecond-level inference. Fully deployed and operational online at WeChat Pay, PANTHER delivers a 25. 6\% boost in next-transaction prediction HitRate@1 and a 38. 6\% relative improvement in fraud detection recall over baselines. Cross-domain evaluations on public benchmarks (CCT, MBD, MovieLens-1M, Yelp) show strong generalization, achieving up to 21\% HitRate@1 gains over transformer baselines, establishing PANTHER as a scalable, high-performance framework for industrial user sequential behavior modeling.

ICML Conference 2025 Conference Paper

RUN: Reversible Unfolding Network for Concealed Object Segmentation

  • Chunming He
  • Rihan Zhang
  • Fengyang Xiao
  • Chengyu Fang 0001
  • Longxiang Tang
  • Yulun Zhang 0001
  • Linghe Kong
  • Deng-Ping Fan

Concealed object segmentation (COS) is a challenging problem that focuses on identifying objects that are visually blended into their background. Existing methods often employ reversible strategies to concentrate on uncertain regions but only focus on the mask level, overlooking the valuable of the RGB domain. To address this, we propose a Reversible Unfolding Network (RUN) in this paper. RUN formulates the COS task as a foreground-background separation process and incorporates an extra residual sparsity constraint to minimize segmentation uncertainties. The optimization solution of the proposed model is unfolded into a multistage network, allowing the original fixed parameters to become learnable. Each stage of RUN consists of two reversible modules: the Segmentation-Oriented Foreground Separation (SOFS) module and the Reconstruction-Oriented Background Extraction (ROBE) module. SOFS applies the reversible strategy at the mask level and introduces Reversible State Space to capture non-local information. ROBE extends this to the RGB domain, employing a reconstruction network to address conflicting foreground and background regions identified as distortion-prone areas, which arise from their separate estimation by independent modules. As the stages progress, RUN gradually facilitates reversible modeling of foreground and background in both the mask and RGB domains, reducing false-positive and false-negative regions. Extensive experiments demonstrate the superior performance of RUN and underscore the promise of unfolding-based frameworks for COS and other high-level vision tasks. Code is available at https: //github. com/ChunmingHe/RUN.

ECAI Conference 2025 Conference Paper

VariGen: Controllable Image Generation Via Personalized Diffusion Framework

  • Mingxin Cai
  • Zexu Huang
  • Yuchen Li 0006
  • Linghe Kong
  • Guihai Chen

Diffusion models have achieved remarkable progress in text-to-image generation, enabling the rise of personalized models. A key challenge in personalized generation is to provide users with precise control while ensuring high fidelity to the content. To address this, we introduce VariGen: a framework that empowers users to achieve fine-grained, layout-controllable personalized image generation. VariGen employs the Variational Detail-Aware Feature Extractor to capture intricate details from reference subjects and the Dual Layout Control Mechanism to integrate layout specifications seamlessly into the generation process. We demonstrate that VariGen achieves superior performance through extensive experimentation, offering unparalleled creative freedom and fidelity. To our knowledge, this is the first work to enable users to “create anything, anywhere” with such precision and flexibility.

NeurIPS Conference 2024 Conference Paper

2DQuant: Low-bit Post-Training Quantization for Image Super-Resolution

  • Kai Liu
  • Haotong Qin
  • Yong Guo
  • Xin Yuan
  • Linghe Kong
  • Guihai Chen
  • Yulun Zhang

Low-bit quantization has become widespread for compressing image super-resolution (SR) models for edge deployment, which allows advanced SR models to enjoy compact low-bit parameters and efficient integer/bitwise constructions for storage compression and inference acceleration, respectively. However, it is notorious that low-bit quantization degrades the accuracy of SR models compared to their full-precision (FP) counterparts. Despite several efforts to alleviate the degradation, the transformer-based SR model still suffers severe degradation due to its distinctive activation distribution. In this work, we present a dual-stage low-bit post-training quantization (PTQ) method for image super-resolution, namely 2DQuant, which achieves efficient and accurate SR under low-bit quantization. The proposed method first investigates the weight and activation and finds that the distribution is characterized by coexisting symmetry and asymmetry, long tails. Specifically, we propose Distribution-Oriented Bound Initialization (DOBI), using different searching strategies to search a coarse bound for quantizers. To obtain refined quantizer parameters, we further propose Distillation Quantization Calibration (DQC), which employs a distillation approach to make the quantized model learn from its FP counterpart. Through extensive experiments on different bits and scaling factors, the performance of DOBI can reach the state-of-the-art (SOTA) while after stage two, our method surpasses existing PTQ in both metrics and visual effects. 2DQuant gains an increase in PSNR as high as 4. 52dB on Set5 (x2) compared with SOTA when quantized to 2-bit and enjoys a 3. 60x compression ratio and 5. 08x speedup ratio. The code and models are available at https: //github. com/Kai-Liu001/2DQuant.

NeurIPS Conference 2024 Conference Paper

Binarized Diffusion Model for Image Super-Resolution

  • Zheng Chen
  • Haotong Qin
  • Yong Guo
  • Xiongfei Su
  • Xin Yuan
  • Linghe Kong
  • Yulun Zhang

Advanced diffusion models (DMs) perform impressively in image super-resolution (SR), but the high memory and computational costs hinder their deployment. Binarization, an ultra-compression algorithm, offers the potential for effectively accelerating DMs. Nonetheless, due to the model structure and the multi-step iterative attribute of DMs, existing binarization methods result in significant performance degradation. In this paper, we introduce a novel binarized diffusion model, BI-DiffSR, for image SR. First, for the model structure, we design a UNet architecture optimized for binarization. We propose the consistent-pixel-downsample (CP-Down) and consistent-pixel-upsample (CP-Up) to maintain dimension consistent and facilitate the full-precision information transfer. Meanwhile, we design the channel-shuffle-fusion (CS-Fusion) to enhance feature fusion in skip connection. Second, for the activation difference across timestep, we design the timestep-aware redistribution (TaR) and activation function (TaA). The TaR and TaA dynamically adjust the distribution of activations based on different timesteps, improving the flexibility and representation alability of the binarized module. Comprehensive experiments demonstrate that our BI-DiffSR outperforms existing binarization methods. Code is released at: https: //github. com/zhengchen1999/BI-DiffSR.

NeurIPS Conference 2024 Conference Paper

CondTSF: One-line Plugin of Dataset Condensation for Time Series Forecasting

  • Jianrong Ding
  • Zhanyu Liu
  • Guanjie Zheng
  • Haiming Jin
  • Linghe Kong

\textit{Dataset condensation} is a newborn technique that generates a small dataset that can be used in training deep neural networks (DNNs) to lower storage and training costs. The objective of dataset condensation is to ensure that the model trained with the synthetic dataset can perform comparably to the model trained with full datasets. However, existing methods predominantly concentrate on classification tasks, posing challenges in their adaptation to time series forecasting (TS-forecasting). This challenge arises from disparities in the evaluation of synthetic data. In classification, the synthetic data is considered well-distilled if the model trained with the full dataset and the model trained with the synthetic dataset yield identical labels for the same input, regardless of variations in output logits distribution. Conversely, in TS-forecasting, the effectiveness of synthetic data distillation is determined by the distance between predictions of the two models. The synthetic data is deemed well-distilled only when all data points within the predictions are similar. Consequently, TS-forecasting has a more rigorous evaluation methodology compared to classification. To mitigate this gap, we theoretically analyze the optimization objective of dataset condensation for TS-forecasting and propose a new one-line plugin of dataset condensation for TS-forecasting designated as Dataset \textbf{Cond}ensation for \textbf{T}ime \textbf{S}eries \textbf{F}orecasting (CondTSF) based on our analysis. Plugging CondTSF into previous dataset condensation methods facilitates a reduction in the distance between the predictions of the model trained with the full dataset and the model trained with the synthetic dataset, thereby enhancing performance. We conduct extensive experiments on eight commonly used time series datasets. CondTSF consistently improves the performance of all previous dataset condensation methods across all datasets, particularly at low condensing ratios.

IJCAI Conference 2024 Conference Paper

GS2P: A Generative Pre-trained Learning to Rank Model with Over-parameterization for Web-Scale Search (Extended Abstract)

  • Yuchen Li
  • Haoyi Xiong
  • Linghe Kong
  • Jiang Bian
  • Shuaiqiang Wang
  • Guihai Chen
  • Dawei Yin

While Learning to Rank (LTR) is widely employed in web searches to prioritize pertinent webpages from the retrieved contents based on input queries, traditional LTR models stumble over two principal stumbling blocks leading to subpar performance: 1) the lack of well-annotated query-webpage pairs with ranking scores to cover search queries of various popularity, debilitating their coverage of search queries across the popularity spectrum, and 2) ill-trained models that are incapable of inducing generalized representations for LTR, culminating in overfitting. To tackle above challenges, we proposed a Generative Semi-supervised Pre-trained (GS2P) LTR model. Specifically, GS2P first generates pseudo-labels for the unlabeled samples using tree-based LTR models after a series of co-training procedures, then learns the representations of query-webpage pairs with self-attentive transformers via both discriminative and generative losses. Finally, GS2P boosts the performance of LTR through incorporating Random Fourier Features to over-parameterize the models into "interpolating regime", so as to enjoy the further descent of generalization errors with learned representations. We conduct extensive offline experiments on a publicly available dataset and a real-world dataset collected from a large-scale search engine. The results show that GS2P can achieve the best performance on both datasets, compared to baselines. We also deploy GS2P at a large-scale web search engine with realistic traffic, where we can still observe significant improvement in real-world applications.

NeurIPS Conference 2024 Conference Paper

HyperPrism: An Adaptive Non-linear Aggregation Framework for Distributed Machine Learning over Non-IID Data and Time-varying Communication Links

  • Haizhou Du
  • Yijian Chen
  • Ryan Yang
  • Yuchen Li
  • Linghe Kong

While Distributed Machine Learning (DML) has been widely used to achieve decent performance, it is still challenging to take full advantage of data and devices distributed at multiple vantage points to adapt and learn, especially it is non-trivial to address dynamic and divergence challenges based on the linear aggregation framework as follows: (1) heterogeneous learning data at different devices (i. e. , non-IID data) resulting in model divergence and (2) in the case of time-varying communication links, the limited ability for devices to reconcile model divergence. In this paper, we contribute a non-linear class aggregation framework HyperPrism that leverages distributed mirror descent with averaging done in the mirror descent dual space and adapts the degree of Weighted Power Mean (WPM) used in each round. Moreover, HyperPrism could adaptively choose different mapping for different layers of the local model with a dedicated hypernetwork per device, achieving automatic optimization of DML in high divergence settings. We perform rigorous analysis and experimental evaluations to demonstrate the effectiveness of adaptive, mirror-mapping DML. In particular, we extend the generalizability of existing related works and position them as special cases within HyperPrism. Our experimental results show that HyperPrism can improve the convergence speed up to 98. 63% and scale well to more devices compared with the state-of-the-art, all with little additional computation overhead compared to traditional linear aggregation.

IJCAI Conference 2024 Conference Paper

MPGraf: a Modular and Pre-trained Graphformer for Learning to Rank at Web-scale (Extended Abstract)

  • Yuchen Li
  • Haoyi Xiong
  • Linghe Kong
  • Zeyi Sun
  • Hongyang Chen
  • Shuaiqiang Wang
  • Dawei Yin

Both Transformer and Graph Neural Networks (GNNs) have been used in learning to rank (LTR), however, they adhere to two distinct yet complementary problem formulations, i. e. , ranking score regression based on query-webpage pairs and link prediction within query-webpage bipartite graphs, respectively. Though it is possible to pre-train GNNs or Transformers on source datasets and fine-tune them subject to sparsely annotated LTR datasets separately, the source-target distribution shifts across the pairs and bipartite graphs domains make it extremely difficult to integrate these diverse models into a single LTR framework at a web-scale. We introduce the novel MPGraf model, which utilizes a modular and capsule-based pre-training approach, aiming to incorporate regression capacities from Transformers and link prediction capabilities of GNNs cohesively. We conduct extensive experiments to evaluate the performance of MPGraf using real-world datasets collected from large-scale search engines. The results show that MPGraf can outperform baseline algorithms on several major metrics. Further, we deploy and evaluate MPGraf atop a large-scale search engine with realistic web traffic via A/B tests, where we can still observe significant improvement. MPGraf performs consistently in both offline and online evaluations.

ICLR Conference 2024 Conference Paper

Recursive Generalization Transformer for Image Super-Resolution

  • Zheng Chen 0014
  • Yulun Zhang 0001
  • Jinjin Gu
  • Linghe Kong
  • Xiaokang Yang 0001

Transformer architectures have exhibited remarkable performance in image super-resolution (SR). Since the quadratic computational complexity of the self-attention (SA) in Transformer, existing methods tend to adopt SA in a local region to reduce overheads. However, the local design restricts the global context exploitation, which is crucial for accurate image reconstruction. In this work, we propose the Recursive Generalization Transformer (RGT) for image SR, which can capture global spatial information and is suitable for high-resolution images. Specifically, we propose the recursive-generalization self-attention (RG-SA). It recursively aggregates input features into representative feature maps, and then utilizes cross-attention to extract global information. Meanwhile, the channel dimensions of attention matrices ($query$, $key$, and $value$) are further scaled to mitigate the redundancy in the channel domain. Furthermore, we combine the RG-SA with local self-attention to enhance the exploitation of the global context, and propose the hybrid adaptive integration (HAI) for module integration. The HAI allows the direct and effective fusion between features at different levels (local or global). Extensive experiments demonstrate that our RGT outperforms recent state-of-the-art methods quantitatively and qualitatively. Code and pre-trained models are available at https://github.com/zhengchen1999/RGT.

NeurIPS Conference 2024 Conference Paper

UrbanDataLayer: A Unified Data Pipeline for Urban Science

  • Yiheng Wang
  • Tianyu Wang
  • Yuying Zhang
  • Hongji Zhang
  • Haoyu Zheng
  • Guanjie Zheng
  • Linghe Kong

The rapid progression of urbanization has generated a diverse array of urban data, facilitating significant advancements in urban science and urban computing. Current studies often work on separate problems case by case using diverse data, e. g. , air quality prediction, and built-up areas classification. This fragmented approach hinders the urban research field from advancing at the pace observed in Computer Vision and Natural Language Processing, due to two primary reasons. On the one hand, the diverse data processing steps lead to the lack of large-scale benchmarks and therefore decelerate iterative methodology improvement on a single problem. On the other hand, the disparity in multi-modal data formats hinders the combination of the related modal data to stimulate more research findings. To address these challenges, we propose UrbanDataLayer (UDL), a suite of standardized data structures and pipelines for city data engineering, providing a unified data format for researchers. This allows researchers to easily build up large-scale benchmarks and combine multi-modal data, thus expediting the development of multi-modal urban foundation models. To verify the effectiveness of our work, we present four distinct urban problem tasks utilizing the proposed data layer. UrbanDataLayer aims to enhance standardization and operational efficiency within the urban science research community. The examples and source code are available at https: //github. com/SJTU-CILAB/udl.

ICLR Conference 2024 Conference Paper

Xformer: Hybrid X-Shaped Transformer for Image Denoising

  • Jiale Zhang
  • Yulun Zhang 0001
  • Jinjin Gu
  • Jiahua Dong 0001
  • Linghe Kong
  • Xiaokang Yang 0001

In this paper, we present a hybrid X-shaped vision Transformer, named Xformer, which performs notably on image denoising tasks. We explore strengthening the global representation of tokens from different scopes. In detail, we adopt two types of Transformer blocks. The spatial-wise Transformer block performs fine-grained local patches interactions across tokens defined by spatial dimension. The channel-wise Transformer block performs direct global context interactions across tokens defined by channel dimension. Based on the concurrent network structure, we design two branches to conduct these two interaction fashions. Within each branch, we employ an encoder-decoder architecture to capture multi-scale features. Besides, we propose the Bidirectional Connection Unit (BCU) to couple the learned representations from these two branches while providing enhanced information fusion. The joint designs make our Xformer powerful to conduct global information modeling in both spatial and channel dimensions. Extensive experiments show that Xformer, under the comparable model complexity, achieves state-of-the-art performance on the synthetic and real-world image denoising tasks. We also provide code and models at https://github.com/gladzhang/Xformer.

ICLR Conference 2023 Conference Paper

Accurate Image Restoration with Attention Retractable Transformer

  • Jiale Zhang
  • Yulun Zhang 0001
  • Jinjin Gu
  • Yongbing Zhang 0002
  • Linghe Kong
  • Xin Yuan 0002

Recently, Transformer-based image restoration networks have achieved promising improvements over convolutional neural networks due to parameter-independent global interactions. To lower computational cost, existing works generally limit self-attention computation within non-overlapping windows. However, each group of tokens are always from a dense area of the image. This is considered as a dense attention strategy since the interactions of tokens are restrained in dense regions. Obviously, this strategy could result in restricted receptive fields. To address this issue, we propose \textbf{A}ttention \textbf{R}etractable \textbf{T}ransformer (ART) for image restoration, which presents both dense and sparse attention modules in the network. The sparse attention module allows tokens from sparse areas to interact and thus provides a wider receptive field. Furthermore, the alternating application of dense and sparse attention modules greatly enhances representation ability of Transformer while providing retractable attention on the input image.We conduct extensive experiments on image super-resolution, denoising, and JPEG compression artifact reduction tasks. Experimental results validate that our proposed ART outperforms state-of-the-art methods on various benchmark datasets both quantitatively and visually. We also provide code and models at~\url{https://github.com/gladzhang/ART}.

NeurIPS Conference 2023 Conference Paper

Hierarchical Integration Diffusion Model for Realistic Image Deblurring

  • Zheng Chen
  • Yulun Zhang
  • Ding Liu
  • Bin Xia
  • Jinjin Gu
  • Linghe Kong
  • Xin Yuan

Diffusion models (DMs) have recently been introduced in image deblurring and exhibited promising performance, particularly in terms of details reconstruction. However, the diffusion model requires a large number of inference iterations to recover the clean image from pure Gaussian noise, which consumes massive computational resources. Moreover, the distribution synthesized by the diffusion model is often misaligned with the target results, leading to restrictions in distortion-based metrics. To address the above issues, we propose the Hierarchical Integration Diffusion Model (HI-Diff), for realistic image deblurring. Specifically, we perform the DM in a highly compacted latent space to generate the prior feature for the deblurring process. The deblurring process is implemented by a regression-based method to obtain better distortion accuracy. Meanwhile, the highly compact latent space ensures the efficiency of the DM. Furthermore, we design the hierarchical integration module to fuse the prior into the regression-based model from multiple scales, enabling better generalization in complex blurry scenarios. Comprehensive experiments on synthetic and real-world blur datasets demonstrate that our HI-Diff outperforms state-of-the-art methods. Code and trained models are available at https: //github. com/zhengchen1999/HI-Diff.

NeurIPS Conference 2022 Conference Paper

Cross Aggregation Transformer for Image Restoration

  • Zheng Chen
  • Yulun Zhang
  • Jinjin Gu
  • Yongbing Zhang
  • Linghe Kong
  • Xin Yuan

Recently, Transformer architecture has been introduced into image restoration to replace convolution neural network (CNN) with surprising results. Considering the high computational complexity of Transformer with global attention, some methods use the local square window to limit the scope of self-attention. However, these methods lack direct interaction among different windows, which limits the establishment of long-range dependencies. To address the above issue, we propose a new image restoration model, Cross Aggregation Transformer (CAT). The core of our CAT is the Rectangle-Window Self-Attention (Rwin-SA), which utilizes horizontal and vertical rectangle window attention in different heads parallelly to expand the attention area and aggregate the features cross different windows. We also introduce the Axial-Shift operation for different window interactions. Furthermore, we propose the Locality Complementary Module to complement the self-attention mechanism, which incorporates the inductive bias of CNN (e. g. , translation invariance and locality) into Transformer, enabling global-local coupling. Extensive experiments demonstrate that our CAT outperforms recent state-of-the-art methods on several image restoration applications. The code and models are available at https: //github. com/zhengchen1999/CAT.

AAAI Conference 2020 Conference Paper

Tensor FISTA-Net for Real-Time Snapshot Compressive Imaging

  • Xiaochen Han
  • Bo Wu
  • Zheng Shou
  • Xiao-Yang Liu
  • Yimeng Zhang
  • Linghe Kong

Snapshot compressive imaging (SCI) cameras capture highspeed videos by compressing multiple video frames into a measurement frame. However, reconstructing video frames from the compressed measurement frame is challenging. The existing state-of-the-art reconstruction algorithms suffer from low reconstruction quality or heavy time consumption, making them not suitable for real-time applications. In this paper, exploiting the powerful learning ability of deep neural networks (DNN), we propose a novel Tensor Fast Iterative Shrinkage-Thresholding Algorithm Net (Tensor FISTA-Net) as a decoder for SCI video cameras. Tensor FISTA-Net not only learns the sparsest representation of the video frames through convolution layers, but also reduces the reconstruction time significantly through tensor calculations. Experimental results on synthetic datasets show that the proposed Tensor FISTA-Net achieves average PSNR improvement of 1. 63∼3. 89dB over the state-of-the-art algorithms. Moreover, Tensor FISTA-Net takes less than 2 seconds running time and 12MB memory footprint, making it practical for real-time IoT applications.

AAAI Conference 2019 Conference Paper

Optimizing in the Dark: Learning an Optimal Solution through a Simple Request Interface

  • Qiao Xiang
  • Haitao Yu
  • James Aspnes
  • Franck Le
  • Linghe Kong
  • Y. Richard Yang

Network resource reservation systems are being developed and deployed, driven by the demand and substantial benefits of providing performance predictability for modern distributed applications. However, existing systems suffer limitations: They either are inefficient in finding the optimal resource reservation, or cause private information (e. g. , from the network infrastructure) to be exposed (e. g. , to the user). In this paper, we design BoxOpt, a novel system that leverages efficient oracle construction techniques in optimization and learning theory to automatically, and swiftly learn the optimal resource reservations without exchanging any private information between the network and the user. We implement a prototype of BoxOpt and demonstrate its efficiency and efficacy via extensive experiments using real network topology and trace. Results show that (1) BoxOpt has a 100% correctness ratio, and (2) for 95% of requests, BoxOpt learns the optimal resource reservation within 13 seconds.