Author name cluster

Tong Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

20 papers

2 author rows

AAAI Conference 2026 Conference Paper

KeepKV: Achieving Periodic Lossless KV Cache Compression for Efficient LLM Inference

Yuxuan Tian
Zihan Wang
Yebo Peng
Aomufei Yuan
Zhiming Wang
Bairen Yi
Xin Liu
Yong Cui

Efficient inference of large language models (LLMs) is hindered by an ever-growing key-value (KV) cache, making KV cache compression a critical research direction. Traditional methods selectively evict less important KV cache entries, which leads to information loss and hallucinations. Recently, merging-based strategies have been explored to retain more information by merging KV pairs that would be discarded; however, these existing approaches inevitably introduce inconsistencies in attention distributions before and after merging, causing degraded generation quality. To overcome this challenge, we propose KeepKV, a novel adaptive KV cache merging method designed to preserve performance under strict memory constraints, achieving single-step lossless compression and providing error bounds for multi-step compression. KeepKV introduces the Electoral Votes mechanism that records merging history and adaptively adjusts attention scores. Moreover, it further leverages a novel Zero Inference-Perturbation Merging method, compensating for attention loss resulting from cache merging. Extensive experiments on various benchmarks and LLM architectures demonstrate that KeepKV substantially reduces memory usage while successfully retaining essential context information, achieving over 2 times inference throughput improvement and maintaining superior generation quality even with only 10% KV cache budgets.

PDF Details DOI

AAAI Conference 2026 Conference Paper

RatioSketch: Towards More Accurate Frequency Estimation in Data Streams via a Lightweight Neural Network

Mengbo Wang
Zhuochen Fan
Dayu Wang
Guorui Xie
Qing Li
Zeyu Luan
Yong Jiang
Tong Yang

Sketch-based solutions are widely used to estimate item frequencies in infinite data streams.Traditional hand-crafted sketches face the bottleneck of further eliminating errors because they cannot fully utilize the data stream distribution.Although recent neural sketches represented by MetaSketch and LegoSketch have improved generalization capabilities, they face bottlenecks such as high computational overhead and parameter sensitivity.Meanwhile, they ignore load information, fail to fully utilize the local information in hand-crafted sketches, and do not focus on the frequent items that are usually more important in data streams.In this paper, we propose RatioSketch, a novel lightweight neural network correction framework that synergizes the advantages of hand-crafted sketches and neural sketches in a ``micro-correction'' paradigm.The key idea is to retain the efficient underlying data structure of the hand-crafted sketch and to build a neural correction layer in its output space. We select multiple representative hand-crafted sketches as use cases to study the correction performance of RatioSketch on them.Extensive experimental evaluations on several real-world datasets show that RatioSketch-corrected sketches achieve consistently higher estimation accuracy than their uncorrected counterparts, as well as outperforming neural baselines such as MetaSketch and LegoSketch under identical memory budgets.

PDF Details DOI

UAI Conference 2025 Conference Paper

Beyond Invisibility: Learning Robust Visible Watermarks for Stronger Copyright Protection

Tianci Liu 0003
Tong Yang
Quan Zhang
Qi Lei

As AI advances, copyrighted content faces growing risk of unauthorized use, whether through model training or direct misuse. Building upon invisible adversarial perturbation, recent works developed copyright protections against specific AI techniques such as unauthorized personalization through DreamBooth that are misused. However, these methods offer only short-term security, as they require retraining whenever the underlying model architectures change. To establish long-term protection aiming at better robustness, we go beyond invisible perturbation, and propose a universal approach that embeds \textit{visible} watermarks that are \textit{hard-to-remove} into images. Grounded in a new probabilistic and inverse problem-based formulation, our framework maximizes the discrepancy between the \textit{optimal} reconstruction and the original content. We develop an effective and efficient approximation algorithm to circumvent a intractable bi-level optimization. Experimental results demonstrate superiority of our approach across diverse scenarios.

Details

ICML Conference 2025 Conference Paper

Continuous Semi-Implicit Models

Longlin Yu
Jiajun Zha
Tong Yang
Tianyu Xie 0001
Xiangyu Zhang
S. -H. Gary Chan
Cheng Zhang

Semi-implicit distributions have shown great promise in variational inference and generative modeling. Hierarchical semi-implicit models, which stack multiple semi-implicit layers, enhance the expressiveness of semi-implicit distributions and can be used to accelerate diffusion models given pretrained score networks. However, their sequential training often suffers from slow convergence. In this paper, we introduce CoSIM, a continuous semi-implicit model that extends hierarchical semi-implicit models into a continuous framework. By incorporating a continuous transition kernel, CoSIM enables efficient, simulation-free training. Furthermore, we show that CoSIM achieves consistency with a carefully designed transition kernel, offering a novel approach for multistep distillation of generative models at the distributional level. Extensive experiments on image generation demonstrate that CoSIM performs on par or better than existing diffusion model acceleration methods, achieving superior performance on FD-DINOv2.

Details

NeurIPS Conference 2025 Conference Paper

Exploration from a Primal-Dual Lens: Value-Incentivized Actor-Critic Methods for Sample-Efficient Online RL

Tong Yang
Bo Dai
Lin Xiao
Yuejie Chi

Online reinforcement learning (RL) with complex function approximations such as transformers and deep neural networks plays a significant role in the modern practice of artificial intelligence. Despite its popularity and importance, balancing the fundamental trade-off between exploration and exploitation remains a long-standing challenge; in particular, we are still in lack of efficient and practical schemes that are backed by theoretical performance guarantees. Motivated by recent developments in exploration via optimistic regularization, this paper provides an interpretation of the principle of optimism through the lens of primal-dual optimization. From this fresh perspective, we set forth a new value-incentivized actor-critic (VAC) method, which optimizes a single easy-to-optimize objective integrating exploration and exploitation --- it promotes state-action and policy estimates that are both consistent with collected data transitions and result in higher value functions. Theoretically, the proposed VAC method has near-optimal regret guarantees under linear Markov decision processes (MDPs) in both finite-horizon and infinite-horizon settings, which can be extended to the general function approximation setting under appropriate assumptions.

PDF Details

NeurIPS Conference 2025 Conference Paper

Multi-head Transformers Provably Learn Symbolic Multi-step Reasoning via Gradient Descent

Tong Yang
Yu Huang
Yingbin Liang
Yuejie Chi

Transformers have demonstrated remarkable capabilities in multi-step reasoning tasks. However, understandings of the underlying mechanisms by which they acquire these abilities through training remain limited, particularly from a theoretical standpoint. This work investigates how transformers learn to solve symbolic multi-step reasoning problems through chain-of-thought processes, focusing on path-finding in trees. We analyze two intertwined tasks: a backward reasoning task, where the model outputs a path from a goal node to the root, and a more complex forward reasoning task, where the model implements two-stage reasoning by first identifying the goal-to-root path and then reversing it to produce the root-to-goal path. Our theoretical analysis, grounded in the dynamics of gradient descent, shows that trained one-layer transformers can provably solve both tasks with generalization guarantees to unseen trees. In particular, our multi-phase training dynamics for forward reasoning elucidate how different attention heads learn to specialize and coordinate autonomously to solve the two subtasks in a single autoregressive path. These results provide a mechanistic explanation of how trained transformers can implement sequential algorithmic procedures. Moreover, they offer insights into the emergence of reasoning abilities, suggesting that when tasks are structured to take intermediate chain-of-thought steps, even shallow multi-head transformers can effectively solve problems that would otherwise require deeper architectures.

PDF Details

ICLR Conference 2024 Conference Paper

A Primal-Dual Approach to Solving Variational Inequalities with General Constraints

Tatjana Chavdarova
Tong Yang
Matteo Pagliardini
Michael I. Jordan

Yang et al. (2023) recently showed how to use first-order gradient methods to solve general variational inequalities (VIs) under a limiting assumption that analytic solutions of specific subproblems are available. In this paper, we circumvent this assumption via a warm-starting technique where we solve subproblems approximately and initialize variables with the approximate solution found at the previous iteration. We prove the convergence of this method and show that the gap function of the last iterate of the method decreases at a rate of $\mathcal{O}(\frac{1}{\sqrt{K}})$ when the operator is $L$-Lipschitz and monotone. In numerical experiments, we show that this technique can converge much faster than its exact counterpart. Furthermore, for the cases when the inequality constraints are simple, we introduce an alternative variant of ACVI and establish its convergence under the same conditions. Finally, we relax the smoothness assumptions in Yang et al., yielding, to our knowledge, the first convergence result for VIs with general constraints that does not rely on the assumption that the operator is $L$-Lipschitz.

Details

NeurIPS Conference 2024 Conference Paper

Federated Natural Policy Gradient and Actor Critic Methods for Multi-task Reinforcement Learning

Tong Yang
Shicong Cen
Yuting Wei
Yuxin Chen
Yuejie Chi

Federated reinforcement learning (RL) enables collaborative decision making of multiple distributed agents without sharing local data trajectories. In this work, we consider a multi-task setting, in which each agent has its own private reward function corresponding to different tasks, while sharing the same transition kernel of the environment. Focusing on infinite-horizon Markov decision processes, the goal is to learn a globally optimal policy that maximizes the sum of the discounted total rewards of all the agents in a decentralized manner, where each agent only communicates with its neighbors over some prescribed graph topology. We develop federated vanilla and entropy-regularized natural policy gradient (NPG) methods in the tabular setting under softmax parameterization, where gradient tracking is applied to estimate the global Q-function to mitigate the impact of imperfect information sharing. We establish non-asymptotic global convergence guarantees under exact policy evaluation, where the rates are nearly independent of the size of the state-action space and illuminate the impacts of network size and connectivity. To the best of our knowledge, this is the first time that global convergence is established for federated multi-task RL using policy optimization. We further go beyond the tabular setting by proposing a federated natural actor critic (NAC) method for multi-task RL with function approximation, and establish its finite-time sample complexity taking the errors of function approximation into account.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

In-Context Learning with Representations: Contextual Generalization of Trained Transformers

Tong Yang
Yu Huang
Yingbin Liang
Yuejie Chi

In-context learning (ICL) refers to a remarkable capability of pretrained large language models, which can learn a new task given a few examples during inference. However, theoretical understanding of ICL is largely under-explored, particularly whether transformers can be trained to generalize to unseen examples in a prompt, which will require the model to acquire contextual knowledge of the prompt for generalization. This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks. The contextual generalization here can be attained via learning the template function for each task in-context, where all template functions lie in a linear space with $m$ basis functions. We analyze the training dynamics of one-layer multi-head transformers to {in-contextly} predict unlabeled inputs given partially labeled prompts, where the labels contain Gaussian noise and the number of examples in each prompt are not sufficient to determine the template. Under mild assumptions, we show that the training loss for a one-layer multi-head transformer converges linearly to a global minimum. Moreover, the transformer effectively learns to perform ridge regression over the basis functions. To our knowledge, this study is the first provable demonstration that transformers can learn contextual (i. e. , template) information to generalize to both unseen examples and tasks when prompts contain only a small number of query-answer pairs.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

LCGen: Mining in Low-Certainty Generation for View-consistent Text-to-3D

Zeng Tao
Tong Yang
Junxiong Lin
Xinji Mai
Haoran Wang
Beining Wang
Enyu Zhou
Yan Wang

The Janus Problem is a common issue in SDS-based text-to-3D methods. Due to view encoding approach and 2D diffusion prior guidance, the 3D representation model tends to learn content with higher certainty from each perspective, leading to view inconsistency. In this work, we first model and analyze the problem, visualizing the specific causes of the Janus Problem, which are associated with discrete view encoding and shared priors in 2D lifting. Based on this, we further propose the LCGen method, which guides text-to-3D to obtain different priors with different certainty from various viewpoints, aiding in view-consistent generation. Experiments have proven that our LCGen method can be directly applied to different SDS-based text-to-3D methods, alleviating the Janus Problem without introducing additional information, increasing excessive training burden, or compromising the generation effect.

PDF Details DOI

ICML Conference 2024 Conference Paper

Reflected Flow Matching

Tianyu Xie 0001
Yu Zhu 0004
Longlin Yu
Tong Yang
Ziheng Cheng
Shiyue Zhang 0002
Xiangyu Zhang
Cheng Zhang

Continuous normalizing flows (CNFs) learn an ordinary differential equation to transform prior samples into data. Flow matching (FM) has recently emerged as a simulation-free approach for training CNFs by regressing a velocity model towards the conditional velocity field. However, on constrained domains, the learned velocity model may lead to undesirable flows that result in highly unnatural samples, e. g. , oversaturated images, due to both flow matching error and simulation error. To address this, we add a boundary constraint term to CNFs, which leads to reflected CNFs that keep trajectories within the constrained domains. We propose reflected flow matching (RFM) to train the velocity model in reflected CNFs by matching the conditional velocity fields in a simulation-free manner, similar to the vanilla FM. Moreover, the analytical form of conditional velocity fields in RFM avoids potentially biased approximations, making it superior to existing score-based generative models on constrained domains. We demonstrate that RFM achieves comparable or better results on standard image benchmarks and produces high-quality class-conditioned samples under high guidance weight.

Details

NeurIPS Conference 2023 Conference Paper

Hierarchical Semi-Implicit Variational Inference with Application to Diffusion Model Acceleration

Longlin Yu
Tianyu Xie
Yu Zhu
Tong Yang
Xiangyu Zhang
Cheng Zhang

Semi-implicit variational inference (SIVI) has been introduced to expand the analytical variational families by defining expressive semi-implicit distributions in a hierarchical manner. However, the single-layer architecture commonly used in current SIVI methods can be insufficient when the target posterior has complicated structures. In this paper, we propose hierarchical semi-implicit variational inference, called HSIVI, which generalizes SIVI to allow more expressive multi-layer construction of semi-implicit distributions. By introducing auxiliary distributions that interpolate between a simple base distribution and the target distribution, the conditional layers can be trained by progressively matching these auxiliary distributions one layer after another. Moreover, given pre-trained score networks, HSIVI can be used to accelerate the sampling process of diffusion models with the score matching objective. We show that HSIVI significantly enhances the expressiveness of SIVI on several Bayesian inference problems with complicated target distributions. When used for diffusion model acceleration, we show that HSIVI can produce high quality samples comparable to or better than the existing fast diffusion model based samplers with a small number of function evaluations on various datasets.

PDF Details

ICML Conference 2023 Conference Paper

Optimization for Amortized Inverse Problems

Tianci Liu 0003
Tong Yang
Quan Zhang
Qi Lei

Incorporating a deep generative model as the prior distribution in inverse problems has established substantial success in reconstructing images from corrupted observations. Notwithstanding, the existing optimization approaches use gradient descent largely without adapting to the non-convex nature of the problem and can be sensitive to initial values, impeding further performance improvement. In this paper, we propose an efficient amortized optimization scheme for inverse problems with a deep generative prior. Specifically, the optimization task with high degrees of difficulty is decomposed into optimizing a sequence of much easier ones. We provide a theoretical guarantee of the proposed algorithm and empirically validate it on different inverse problems. As a result, our approach outperforms baseline methods qualitatively and quantitatively by a large margin.

Details

NeurIPS Conference 2023 Conference Paper

Slot-guided Volumetric Object Radiance Fields

Di Qi
Tong Yang
Xiangyu Zhang

We present a novel framework for 3D object-centric representation learning. Our approach effectively decomposes complex scenes into individual objects from a single image in an unsupervised fashion. This method, called \underline{s}lot-guided \underline{V}olumetric \underline{O}bject \underline{R}adiance \underline{F}ields~(sVORF), composes volumetric object radiance fields with object slots as a guidance to implement unsupervised 3D scene decomposition. Specifically, sVORF obtains object slots from a single image via a transformer module, maps these slots to volumetric object radiance fields with a hypernetwork and composes object radiance fields with the guidance of object slots at a 3D location. Moreover, sVORF significantly reduces memory requirement due to small-sized pixel rendering during training. We demonstrate the effectiveness of our approach by showing top results in scene decomposition and generation tasks of complex synthetic datasets (e. g. , Room-Diverse). Furthermore, we also confirm the potential of sVORF to segment objects in real-world scenes (e. g. , the LLFF dataset). We hope our approach can provide preliminary understanding of the physical world and help ease future research in 3D object-centric representation learning.

PDF Details

ICLR Conference 2023 Conference Paper

Solving Constrained Variational Inequalities via a First-order Interior Point-based Method

Tong Yang
Michael I. Jordan
Tatjana Chavdarova

We develop an interior-point approach to solve constrained variational inequality (cVI) problems. Inspired by the efficacy of the alternating direction method of multipliers (ADMM) method in the single-objective context, we generalize ADMM to derive a first-order method for cVIs, that we refer to as ADMM-based interior-point method for constrained VIs (ACVI). We provide convergence guarantees for ACVI in two general classes of problems: (i) when the operator is $\xi$-monotone, and (ii) when it is monotone, some constraints are active and the game is not purely rotational. When the operator is in addition L-Lipschitz for the latter case, we match known lower bounds on rates for the gap function of $\mathcal{O}(1/\sqrt{K})$ and $\mathcal{O}(1/K)$ for the last and average iterate, respectively. To the best of our knowledge, this is the first presentation of a first-order interior-point method for the general cVI problem that has a global convergence guarantee. Moreover, unlike previous work in this setting, ACVI provides a means to solve cVIs when the constraints are nontrivial. Empirical analyses demonstrate clear advantages of ACVI over common first-order methods. In particular, (i) cyclical behavior is notably reduced as our methods approach the solution from the analytic center, and (ii) unlike projection-based methods that zigzag when near a constraint, ACVI efficiently handles the constraints.

Details

AAAI Conference 2022 Conference Paper

Anchor DETR: Query Design for Transformer-Based Detector

Yingming Wang
Xiangyu Zhang
Tong Yang
Jian Sun

In this paper, we propose a novel query design for the transformer-based object detection. In previous transformerbased detectors, the object queries are a set of learned embeddings. However, each learned embedding does not have an explicit physical meaning and we cannot explain where it will focus on. It is difficult to optimize as the prediction slot of each object query does not have a specific mode. In other words, each object query will not focus on a specific region. To solve these problems, in our query design, object queries are based on anchor points, which are widely used in CNN-based detectors. So each object query focuses on the objects near the anchor point. Moreover, our query design can predict multiple objects at one position to solve the difficulty: “one region, multiple objects”. In addition, we design an attention variant, which can reduce the memory cost while achieving similar or better performance than the standard attention in DETR. Thanks to the query design and the attention variant, the proposed detector that we called Anchor DETR, can achieve better performance and run faster than the DETR with 10× fewer training epochs. For example, it achieves 44. 2 AP with 19 FPS on the MSCOCO dataset when using the ResNet50-DC5 feature for training 50 epochs. Extensive experiments on the MSCOCO benchmark prove the effectiveness of the proposed methods. Code is available at https: //github. com/megvii-research/AnchorDETR.

PDF Details

AAAI Conference 2022 Conference Paper

LGD: Label-Guided Self-Distillation for Object Detection

Peizhen Zhang
Zijian Kang
Tong Yang
Xiangyu Zhang
Nanning Zheng
Jian Sun

In this paper, we propose the first self-distillation framework for general object detection, termed LGD (Label-Guided self-Distillation). Previous studies rely on a strong pretrained teacher to provide instructive knowledge that could be unavailable in real-world scenarios. Instead, we generate an instructive knowledge based only on student representations and regular labels. Our framework includes sparse labelappearance encoder, inter-object relation adaptater and intraobject knowledge mapper that jointly form an implicit teacher at training phase, dynamically dependent on labels and evolving student representations. They are trained end-to-end with detector and discarded in inference. Experimentally, LGD obtains decent results on various detectors, datasets, and extensive tasks like instance segmentation. For example in MS- COCO dataset, LGD improves RetinaNet with ResNet-50 under 2× single-scale training from 36. 2% to 39. 0% mAP (+ 2. 8%). It boosts much stronger detectors like FCOS with ResNeXt-101 DCN v2 under 2× multi-scale training from 46. 1% to 47. 9% (+ 1. 8%). Compared with a classical teacherbased method FGFI, LGD not only performs better without requiring pretrained teacher but also reduces 51% training cost beyond inherent student learning. Codes are available at https: //github. com/megvii-research/LGD.

PDF Details

AAAI Conference 2021 Conference Paper

Co-mining: Self-Supervised Learning for Sparsely Annotated Object Detection

Tiancai Wang
Tong Yang
Jiale Cao
Xiangyu Zhang

Object detectors usually achieve promising results with the supervision of complete instance annotations. However, their performance is far from satisfactory with sparse instance annotations. Most existing methods for sparsely annotated object detection either re-weight the loss of hard negative samples or convert the unlabeled instances into ignored regions to reduce the interference of false negatives. We argue that these strategies are insufficient since they can at most alleviate the negative effect caused by missing annotations. In this paper, we propose a simple but effective mechanism, called Co-mining, for sparsely annotated object detection. In our Co-mining, two branches of a Siamese network predict the pseudo-label sets for each other. To enhance multi-view learning and better mine unlabeled instances, the original image and corresponding augmented image are used as the inputs of two branches of the Siamese network, respectively. Co-mining can serve as a general training mechanism applied to most of modern object detectors. Experiments are performed on MS COCO dataset with three different sparsely annotated settings using two typical frameworks: anchorbased detector RetinaNet and anchor-free detector FCOS. Experimental results show that our Co-mining with RetinaNet achieves 1. 4 % ∼ 2. 1% improvements compared with different baselines and surpasses existing methods under the same sparsely annotated setting.

PDF Details

NeurIPS Conference 2019 Conference Paper

DetNAS: Backbone Search for Object Detection

Yukang Chen
Tong Yang
Xiangyu Zhang
Gaofeng Meng
Xinyu Xiao
Jian Sun

Object detectors are usually equipped with backbone networks designed for image classification. It might be sub-optimal because of the gap between the tasks of image classification and object detection. In this work, we present DetNAS to use Neural Architecture Search (NAS) for the design of better backbones for object detection. It is non-trivial because detection training typically needs ImageNetpre-training while NAS systems require accuracies on the target detection task as supervisory signals. Based on the technique of one-shot supernet, which contains all possible networks in the search space, we propose a framework for backbone search on object detection. We train the supernet under the typical detector training schedule: ImageNet pre-training and detection fine-tuning. Then, the architecture search is performed on the trained supernet, using the detection task as the guidance. This framework makes NAS on backbones very efficient. In experiments, we show the effectiveness of DetNAS on various detectors, for instance, one-stage RetinaNetand the two-stage FPN. We empirically find that networks searched on object detection shows consistent superiority compared to those searched on ImageNet classification. The resulting architecture achieves superior performance than hand-crafted networks on COCO with much less FLOPs complexity.

PDF Details

NeurIPS Conference 2018 Conference Paper

MetaAnchor: Learning to Detect Objects with Customized Anchors

Tong Yang
Xiangyu Zhang
Zeming Li
Wenqiang Zhang
Jian Sun

We propose a novel and flexible anchor mechanism named MetaAnchor for object detection frameworks. Unlike many previous detectors model anchors via a predefined manner, in MetaAnchor anchor functions could be dynamically generated from the arbitrary customized prior boxes. Taking advantage of weight prediction, MetaAnchor is able to work with most of the anchor-based object detection systems such as RetinaNet. Compared with the predefined anchor scheme, we empirically find that MetaAnchor is more robust to anchor settings and bounding box distributions; in addition, it also shows the potential on the transfer task. Our experiment on COCO detection task shows MetaAnchor consistently outperforms the counterparts in various scenarios.

PDF Details