Author name cluster

Xinlei Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

26 papers

2 author rows

AAAI Conference 2026 Conference Paper

AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning

Jirong Zha
Yuxuan Fan
Tianyu Zhang
Geng Chen
Yingfeng Chen
Chen Gao
Xinlei Chen

Multimodal Large Language Models (MLLMs) have shown promise in single-agent vision tasks, yet benchmarks for evaluating multi-agent collaborative perception remain scarce. This gap is critical, as multi-drone systems provide enhanced coverage, robustness, and collaboration compared to single-sensor setups. Existing multi-image benchmarks mainly target basic perception tasks using high-quality single-agent images, thus failing to evaluate MLLMs in more complex, egocentric collaborative scenarios, especially under real-world degraded perception conditions. To address these challenges, we introduce AirCopBench, the first comprehensive benchmark designed to evaluate MLLMs in embodied aerial collaborative perception under challenging perceptual conditions. AirCopBench includes 14.6k+ questions derived from both simulator and real-world data, spanning four key task dimensions: Scene Understanding, Object Understanding, Perception Assessment, and Collaborative Decision, across 14 task types. We construct the benchmark using data from challenging degraded-perception scenarios with annotated collaborative events, generating large-scale questions through model-, rule-, and human-based methods under rigorous quality control. Evaluations on 40 MLLMs show significant performance gaps in collaborative perception tasks, with the best model trailing humans by 24.38% on average and exhibiting inconsistent results across tasks. Fine-tuning experiments further confirm the feasibility of sim-to-real transfer in aerial collaborative perception.

PDF Details DOI

AAAI Conference 2026 Conference Paper

DIMM: Decoupled Multi-hierarchy Kalman Filter via Reinforcement Learning

Jirong Zha
Yuxuan Fan
Kai Li
Han Li
Chen Gao
Xinlei Chen

State estimation is challenging for target tracking with high maneuverability, as the target's state transition function changes rapidly, irregularly, and is unknown to the estimator. Existing work based on interacting multiple model (IMM) achieves more accurate estimation than single-filter approaches through model combination, aligning appropriate models for different motion modes of the target over time. However, two limitations of conventional IMM remain unsolved. First, the solution space of the model combination is constrained as the target's diverse kinematic properties in different directions are ignored. Second, the model combination weights calculated by the observation likelihood are not accurate enough due to the measurement uncertainty. In this paper, we propose a novel framework, DIMM, to effectively combine estimates from different motion models in each direction, thus increasing the target tracking accuracy. First, DIMM extends the model combination solution space of conventional IMM from a hyperplane to a hypercube by designing a 3D-decoupled multi-hierarchy filter bank, which describes the target's motion with various-order linear models. Second, DIMM generates more reliable combination weight matrices through a differentiable adaptive fusion network for importance allocation rather than solely relying on the observation likelihood; it contains an attention-based twin delayed deep deterministic policy gradient (TD3) method with a hierarchical reward. Experiments demonstrate that DIMM significantly improves the tracking accuracy of existing state estimation methods by 31.61%~99.23%.

PDF Details DOI

ICLR Conference 2025 Conference Paper

An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Duy-Kien Nguyen
Mahmoud Assran
Unnat Jain
Martin R. Oswald
Cees G. M. Snoek
Xinlei Chen

This work does not introduce a new method. Instead, we present an interesting finding that questions the necessity of the inductive bias of locality in modern computer vision architectures. Concretely, we find that vanilla Transformers can operate by directly treating each individual pixel as a token and achieve highly performant results. This is substantially different from the popular design in Vision Transformer, which maintains the inductive bias from ConvNets towards local neighborhoods (e.g., by treating each 16x16 patch as a token). We showcase the effectiveness of pixels-as-tokens across three well-studied computer vision tasks: supervised learning for classification and regression, self-supervised learning via masked autoencoding, and image generation with diffusion models. Although it's computationally less practical to directly operate on individual pixels, we believe the community must be made aware of this surprising piece of knowledge when devising the next generation of neural network architectures for computer vision.

Details

NeurIPS Conference 2025 Conference Paper

Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization

kaiyuan Li
Xiaoyue Chen
Chen Gao
Yong Li
Xinlei Chen

Large Vision-Language Models (LVLMs) have shown impressive performance across multi-modal tasks by encoding images into thousands of tokens. However, the large number of image tokens results in significant computational overhead, and the use of dynamic high-resolution inputs further increases this burden. Previous approaches have attempted to reduce the number of image tokens through token pruning, typically by selecting tokens based on attention scores or image token diversity. Through empirical studies, we observe that existing methods often overlook the joint impact of pruning on both the current layer's output (local) and the outputs of subsequent layers (global), leading to suboptimal pruning decisions. To address this challenge, we propose Balanced Token Pruning (BTP), a plug-and-play method for pruning vision tokens. Specifically, our method utilizes a small calibration set to divide the pruning process into multiple stages. In the early stages, our method emphasizes the impact of pruning on subsequent layers, whereas in the deeper stages, the focus shifts toward preserving the consistency of local outputs. Extensive experiments across various LVLMs demonstrate the broad effectiveness of our approach on multiple benchmarks. Our method achieves a 78\% compression rate while preserving 96. 7\% of the original models' performance on average. Our code is available at https: //github. com/EmbodiedCity/NeurIPS2025-Balanced-Token-Pruning.

PDF Details

ICLR Conference 2025 Conference Paper

Deconstructing Denoising Diffusion Models for Self-Supervised Learning

Xinlei Chen
Zhuang Liu 0003
Saining Xie
Kaiming He

In this study, we examine the representation learning abilities of Denoising Diffusion Models (DDM) that were originally purposed for image generation. Our philosophy is to deconstruct a DDM, gradually transforming it into a classical Denoising Autoencoder (DAE). This deconstructive process allows us to explore how various components of modern DDMs influence self-supervised representation learning. We observe that only a very few modern components are critical for learning good representations, while many others are nonessential. Our study ultimately arrives at an approach that is highly simplified and to a large extent resembles a classical DAE. We hope our study will rekindle interest in a family of classical methods within the realm of modern self-supervised learning.

Details

ICML Conference 2025 Conference Paper

Highly Compressed Tokenizer Can Generate Without Training

Lukas Lao Beyer
Tianhong Li
Xinlei Chen
Sertac Karaman
Kaiming He

Commonly used image tokenizers produce a 2D grid of spatially arranged tokens. In contrast, so-called 1D image tokenizers represent images as highly compressed one-dimensional sequences of as few as 32 discrete tokens. We find that the high degree of compression achieved by a 1D tokenizer with vector quantization enables image editing and generative capabilities through heuristic manipulation of tokens, demonstrating that even very crude manipulations – such as copying and replacing tokens between latent representations of images – enable fine-grained image editing by transferring appearance and semantic attributes. Motivated by the expressivity of the 1D tokenizer’s latent space, we construct an image generation pipeline leveraging gradient-based test-time optimization of tokens with plug-and-play loss functions such as reconstruction or CLIP similarity. Our approach is demonstrated for inpainting and text-guided image editing use cases, and can generate diverse and realistic samples without requiring training of any generative model.

Details

IJCAI Conference 2025 Conference Paper

How to Enable LLM with 3D Capacity? A Survey of Spatial Reasoning in LLM

Jirong Zha
Yuxuan Fan
Xiao Yang
Chen Gao
Xinlei Chen

3D spatial understanding is essential in real-world applications such as robotics, autonomous vehicles, virtual reality, and medical imaging. Recently, Large Language Models (LLMs), having demonstrated remarkable success across various domains, have been leveraged to enhance 3D understanding tasks, showing potential to surpass traditional computer vision methods. In this survey, we present a comprehensive review of methods integrating LLMs with 3D spatial understanding. We propose a taxonomy that categorizes existing methods into three branches: image-based methods deriving 3D understanding from 2D visual data, point cloud-based methods working directly with 3D representations, and hybrid modality-based methods combining multiple data streams. We systematically review representative methods along these categories, covering data representations, architectural modifications, and training strategies that bridge textual and 3D modalities. Finally, we discuss current limitations, including dataset scarcity and computational challenges, while highlighting promising research directions in spatial perception, multi-modal fusion, and real-world applications.

PDF Details DOI

ICML Conference 2025 Conference Paper

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Yu Sun 0020
Xinhao Li
Karan Dalal
Jiarui Xu
Arjun Vikram
Genghan Zhang
Yann Dubois
Xinlei Chen

Self-attention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their hidden states. We present a practical framework for instantiating sequence modeling layers with linear complexity and expressive hidden states. The key idea is to make the hidden state a machine learning model itself, and the update rule a step of self-supervised learning. Since the hidden state is updated by training even on test sequences, our layers are called Test-Time Training (TTT) layers. We consider two instantiations: TTT-Linear and TTT-MLP, whose hidden state is a linear model and a two-layer MLP respectively. We evaluate our instantiations at the scale of 125M to 1. 3B parameters, comparing with a strong Transformer and Mamba, a modern RNN. Similar to Transformer, TTT-Linear and TTT-MLP can keep reducing perplexity by conditioning on more tokens, while Mamba cannot after 16k context. TTT-MLP still faces challenges in memory I/O, but shows larger potential in long context, pointing to a promising direction for future research.

Details

ICML Conference 2025 Conference Paper

Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

Philippe Hansen-Estruch
David Yan
Ching-Yao Chuang
Orr Zohar
Jialiang Wang 0001
Tingbo Hou
Tao Xu
Sriram Vishwanath

Visual tokenization via auto-encoding empowers state-of-the-art image and video generative models by compressing pixels into a latent space. However, questions remain about how auto-encoder design impacts reconstruction and downstream generative performance. This work explores scaling in auto-encoders for reconstruction and generation by replacing the convolutional backbone with an enhanced Vision Transformer for Tokenization (ViTok). We find scaling the auto-encoder bottleneck correlates with reconstruction but exhibits a nuanced relationship with generation. Separately, encoder scaling yields no gains, while decoder scaling improves reconstruction with minimal impact on generation. As a result, we determine that scaling the current paradigm of auto-encoders is not effective for improving generation performance. Coupled with Diffusion Transformers, ViTok achieves competitive image reconstruction and generation performance on 256p and 512p ImageNet-1K. In videos, ViTok achieves SOTA reconstruction and generation performance on 16-frame 128p UCF-101.

Details

ICML Conference 2025 Conference Paper

LLMs can see and hear without any training

Kumar Ashutosh
Yossi Gandelsman
Xinlei Chen
Ishan Misra
Rohit Girdhar

We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.

Details

NeurIPS Conference 2025 Conference Paper

Meta CLIP 2: A Worldwide Scaling Recipe

Yung-Sung Chuang
Yang Li
Dong Wang
Ching-Feng Yeh
Kehan Lyu
Ramya Raghavendra
Jim Glass
Lifei Huang

Contrastive Language-Image Pretraining (CLIP) is a popular foundation model, supporting from zero-shot classification, retrieval to encoders for multimodal large language models (MLLMs). Although CLIP is successfully trained on billion-scale image-text pairs from the English world, scaling CLIP's training further to learning from the worldwide web data is still challenging: (1) no curation method is available to handle data points from non-English world; (2) the English performance from existing multilingual CLIP is worse than its English-only counterpart, i. e. , "curse of multilinguality" that is common in LLMs. Here, we present Meta CLIP 2, the first recipe training CLIP from scratch on worldwide web-scale image-text pairs. To generalize our findings, we conduct rigorous ablations with minimal changes that are necessary to address the above challenges and present a recipe enabling mutual benefits from English and non-English world data. In zero-shot ImageNet classification, Meta CLIP 2 ViT-H/14 surpasses its English-only counterpart by 0. 8% and mSigLIP by 0. 7%, and surprisingly sets new state-of-the-art without system-level confounding factors (e. g. , translation, bespoke architecture changes) on multilingual benchmarks, such as CVQA with 57. 4%, Babel-ImageNet with 50. 2% and XM3600 with 64. 3% on image-to-text retrieval. Code and model are available at https: //github. com/facebookresearch/MetaCLIP.

PDF Details

JMLR Journal 2025 Journal Article

Test-Time Training on Video Streams

Renhao Wang
Yu Sun
Arnuv Tandon
Yossi Gandelsman
Xinlei Chen
Alexei A. Efros
Xiaolong Wang

Prior work has established Test-Time Training (TTT) as a general framework to further improve a trained model at test time. Before making a prediction on each test instance, the model is first trained on the same instance using a self-supervised task such as reconstruction. We extend TTT to the streaming setting, where multiple test instances - video frames in our case - arrive in temporal order. Our extension is online TTT: The current model is initialized from the previous model, then trained on the current frame and a small window of frames immediately before. Online TTT significantly outperforms the fixed-model baseline for four tasks, on three real-world datasets. The improvements are more than 2.2x and 1.5x for instance and panoptic segmentation. Surprisingly, online TTT also outperforms its offline variant that accesses strictly more information, training on all frames from the entire test video regardless of temporal order. This finding challenges those in prior work using synthetic videos. We formalize a notion of locality as the advantage of online over offline TTT, and analyze its role with ablations and a theory based on bias-variance trade-off. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2025. ( edit, beta )

PDF Details

ICRA Conference 2025 Conference Paper

Underwater Motions Analysis and Control of a Coupling-Tiltable Unmanned Aerial-Aquatic Vehicle

Dongyue Huang
Minghao Dou
Xuchen Liu 0001
Tao Sun
Jianguo Zhang
Ning Ding
Xinlei Chen
Ben M. Chen

Coupling-Tiltable Unmanned Aerial-Aquatic Vehicles (UAAVs) have gained increasing importance, yet lack comprehensive analysis and suitable controllers. This paper analyzes the underwater motion characteristics of a self-designed UAAV, Mirs-Alioth, and designs a controller for it. The effectiveness of the controller is validated through experiments. The singularities of Mirs-Alioth are derived as Singular Thrust Tilt Angle (STTA), which serve as an essential tool for an analysis of its underwater motion characteristics. The analysis reveals several key factors for designing the controller. These include the need for logic switching, using a Nussbaum function to compensate control direction uncertainty in the auxiliary channel, and employing an auxiliary controller to mitigate coupling effects. Based on these key points, a control scheme is designed. It consists of a controller that regulates the thrust tilt angle to the singular value, an auxiliary controller incorporating a Saturated Nussbaum function, and a logic switch. Eventually, two sets of experiments are conducted to validate the effectiveness of the controller and demonstrate the necessity of the Nussbaum function.

Details

NeurIPS Conference 2025 Conference Paper

VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play

Zelai Xu
Ruize Zhang
Chao Yu
Huining Yuan
Xiangmin Yi
Shilong Ji
Chuqi Wang
Wenhao Tang

Robot sports, characterized by well-defined objectives, explicit rules, and dynamic interactions, present ideal scenarios for demonstrating embodied intelligence. In this paper, we present VolleyBots, a novel robot sports testbed where multiple drones cooperate and compete in the sport of volleyball under physical dynamics. VolleyBots integrates three features within a unified platform: competitive and cooperative gameplay, turn-based interaction structure, and agile 3D maneuvering. These intertwined features yield a complex problem combining motion control and strategic play, with no available expert demonstrations. We provide a comprehensive suite of tasks ranging from single-drone drills to multi-drone cooperative and competitive tasks, accompanied by baseline evaluations of representative reinforcement learning (RL), multi-agent reinforcement learning (MARL) and game-theoretic algorithms. Simulation results show that on-policy RL methods outperform off-policy methods in single-agent tasks, but both approaches struggle in complex tasks that combine motion control and strategic play. We additionally design a hierarchical policy which achieves 69. 5% win rate against the strongest baseline in the 3 vs 3 task, demonstrating its potential for tackling the complex interplay between low-level control and high-level strategy. To highlight VolleyBots’ sim-to-real potential, we further demonstrate the zero-shot deployment of a policy trained entirely in simulation on real-world drones.

PDF Details

NeurIPS Conference 2025 Conference Paper

What Can RL Bring to VLA Generalization? An Empirical Study

Jijia Liu
Feng Gao
Bingwen Wei
Xinlei Chen
Qingmin Liao
Yi Wu
Chao Yu
Yu Wang

Large Vision-Language Action (VLA) models have shown significant potential for embodied AI. However, their predominant training via supervised fine-tuning (SFT) limits generalization due to susceptibility to compounding errors under distribution shifts. Reinforcement learning (RL) offers a path to overcome these limitations by optimizing for task objectives via trial-and-error, yet a systematic understanding of its specific generalization benefits for VLAs compared to SFT is lacking. To address this, our study introduces a comprehensive benchmark for evaluating VLA generalization and systematically investigates the impact of RL fine-tuning across diverse visual, semantic, and execution dimensions. Our extensive experiments reveal that RL fine-tuning, particularly with PPO, significantly enhances generalization in semantic understanding and execution robustness over SFT, while maintaining comparable visual robustness. We identify PPO as a more effective RL algorithm for VLAs than LLM-derived methods like DPO and GRPO. We also develop a simple recipe for efficient PPO training on VLAs, and demonstrate its practical utility for improving VLA generalization. The project page is at https: //rlvla. github. io

PDF Details

NeurIPS Conference 2024 Conference Paper

On the Surprising Effectiveness of Attention Transfer for Vision Transformers

Alexander C. Li
Yuandong Tian
Beidi Chen
Deepak Pathak
Xinlei Chen

Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations. Is this actually true? We investigate this question and find that the features and representations learned during pre-training are not essential. Surprisingly, using only the attention patterns from pre-training (i. e. , guiding how information flows between tokens) is sufficient for models to learn high quality features from scratch and achieve comparable downstream performance. We show this by introducing a simple method called attention transfer, where only the attention patterns from a pre-trained teacher ViT are transferred to a student, either by copying or distilling the attention maps. Since attention transfer lets the student learn its own features, ensembling it with a fine-tuned teacher also further improves accuracy on ImageNet. We systematically study various aspects of our findings on the sufficiency of attention maps, including distribution shift settings where they underperform fine-tuning. We hope our exploration provides a better understanding of what pre-training accomplishes and leads to a useful alternative to the standard practice of fine-tuning.

PDF Details DOI

ICLR Conference 2024 Conference Paper

R-MAE: Regions Meet Masked Autoencoders

Duy-Kien Nguyen
Yanghao Li
Vaibhav Aggarwal
Martin R. Oswald
Alexander Kirillov
Cees G. M. Snoek
Xinlei Chen

In this work, we explore regions as a potential visual analogue of words for self-supervised image representation learning. Inspired by Masked Autoencoding (MAE), a generative pre-training baseline, we propose masked region autoencoding to learn from groups of pixels or regions. Specifically, we design an architecture which efficiently addresses the one-to-many mapping between images and regions, while being highly effective especially with high-quality regions. When integrated with MAE, our approach (R-MAE) demonstrates consistent improvements across various pre-training datasets and downstream detection and segmentation benchmarks, with negligible computational overheads. Beyond the quantitative evaluation, our analysis indicates the models pre-trained with masked region autoencoding unlock the potential for interactive segmentation. The code is provided at https://github.com/facebookresearch/r-mae.

Details

TMLR Journal 2024 Journal Article

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes
Quentin Garrido
Jean Ponce
Xinlei Chen
Michael Rabbat
Yann LeCun
Mido Assran
Nicolas Ballas

This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model’s parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.

PDF Details

NeurIPS Conference 2024 Conference Paper

Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers

Lirui Wang
Xinlei Chen
Jialiang Zhao
Kaiming He

One of the roadblocks for training generalist robotic models today is heterogeneity. Previous robot learning methods often collect data to train with one specific embodiment for one task, which is expensive and prone to overfitting. This work studies the problem of learning policy representations through heterogeneous pre-training on robot data across different embodiments and tasks at scale. We propose Heterogeneous Pre-trained Transformers (HPT), which pre-train a large, shareable trunk of a policy neural network to learn a task and embodiment agnostic shared representation. This general architecture aligns the specific proprioception and vision inputs from distinct embodiments to a short sequence of tokens and then processes such tokens to map to control robots for different tasks. Leveraging the recent large-scale multi-embodiment real-world robotic datasets as well as simulation, deployed robots, and human video datasets, we investigate pre-training policies across heterogeneity. We conduct experiments to investigate the scaling behaviors of training objectives, to the extent of 52 datasets. HPTs outperform several baselines and enhance the fine-tuned policy performance by over 20% on unseen tasks in multiple simulator benchmarks and real-world settings. See the project website (liruiw. github. io/hpt) for code and videos.

PDF Details DOI

IROS Conference 2023 Conference Paper

Autonomous Swarm Robot Coordination via Mean-Field Control Embedding Multi-Agent Reinforcement Learning

Huaze Tang
Hengxi Zhang
Zhenpeng Shi
Xinlei Chen
Wenbo Ding 0001
Xiao-Ping Zhang 0002

The learning approaches of designing a controller to guide the collective behavior of swarm robots have gained significant attention in recent years. However, the scalability of swarm robots and their inherent stochasticity complicate the control problem due to increasing complexity, unpredictability, and non-linearity. Despite considerable progress made in swarm robotics, addressing these challenges remains a significant issue. In this work, we model the stochastic dynamics of a swarm robot system and then propose a novel control framework based on a mean-field control (MFC) embedding multi-agent reinforcement learning (MARL) approach named MF-MARL to deal with these challenges. While MARL is able to deal with stochasticity statistically, we integrate MFC, allowing MF-MARL to cope with large-scale robots. Moreover, we apply statistical moments of robots' state and control action to discretize continuous input and enable MF-MARL to be applied in continuous scenarios. To demonstrate the effectiveness of MF-MARL, we evaluate the performance of the robots on a specific swarm simulation platform. The experimental results show that our algorithm outperforms the traditional algorithms both in navigation and manipulation tasks. Finally, we demonstrate the adaptability of the proposed algorithm through the component failure test.

Details

ICLR Conference 2022 Conference Paper

NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training

Chengyue Gong
Dilin Wang
Meng Li 0004
Xinlei Chen
Zhicheng Yan 0001
Yuandong Tian
Qiang Liu 0001
Vikas Chandra

Designing accurate and efficient vision transformers (ViTs) is a highly important but challenging task. Supernet-based one-shot neural architecture search (NAS) enables fast architecture optimization and has achieved state-of-the-art (SOTA) results on convolutional neural networks (CNNs). However, directly applying the supernet-based NAS to optimize ViTs leads to poor performance - even worse compared to training single ViTs. In this work, we observe that the poor performance is due to a gradient conflict issue: the gradients of different sub-networks conflict with that of the supernet more severely in ViTs than CNNs, which leads to early saturation in training and inferior convergence. To alleviate this issue, we propose a series of techniques, including a gradient projection algorithm, a switchable layer scaling design, and a simplified data augmentation and regularization training recipe. The proposed techniques significantly improve the convergence and the performance of all sub-networks. Our discovered hybrid ViT model family, dubbed NASViT, achieves top-1 accuracy from 78.2% to 81.8% on ImageNet from 200M to 800M FLOPs, and outperforms all the prior art CNNs and ViTs, including AlphaNet and LeViT, etc. When transferred to semantic segmentation tasks, NASViTs also outperform previous backbones on both Cityscape and ADE20K datasets, achieving 73.2% and 37.9% mIoU with only 5G FLOPs, respectively. Code is available at https://github.com/facebookresearch/NASViT.

Details

NeurIPS Conference 2022 Conference Paper

Test-Time Training with Masked Autoencoders

Yossi Gandelsman
Yu Sun
Xinlei Chen
Alexei Efros

Test-time training adapts to a new test distribution on the fly by optimizing a model for each test input using self-supervision. In this paper, we use masked autoencoders for this one-sample learning problem. Empirically, our simple method improves generalization on many visual benchmarks for distribution shifts. Theoretically, we characterize this improvement in terms of the bias-variance trade-off.

PDF Details

ICLR Conference 2021 Conference Paper

MoVie: Revisiting Modulated Convolutions for Visual Counting and Beyond

Duy-Kien Nguyen
Vedanuj Goswami
Xinlei Chen

This paper focuses on visual counting, which aims to predict the number of occurrences given a natural image and a query (e.g. a question or a category). Unlike most prior works that use explicit, symbolic models which can be computationally expensive and limited in generalization, we propose a simple and effective alternative by revisiting modulated convolutions that fuse the query and the image locally. Following the design of residual bottleneck, we call our method MoVie, short for Modulated conVolutional bottlenecks. Notably, MoVie reasons implicitly and holistically and only needs a single forward-pass during inference. Nevertheless, MoVie showcases strong performance for counting: 1) advancing the state-of-the-art on counting-specific VQA tasks while being more efficient; 2) outperforming prior-art on difficult benchmarks like COCO for common object counting; 3) helped us secure the first place of 2020 VQA challenge when integrated as a module for ‘number’ related questions in generic VQA models. Finally, we show evidence that modulated convolutions such as MoVie can serve as a general mechanism for reasoning tasks beyond counting.

Details

ICML Conference 2021 Conference Paper

Understanding self-supervised learning dynamics without contrastive pairs

Yuandong Tian
Xinlei Chen
Surya Ganguli

While contrastive approaches of self-supervised learning (SSL) learn representations by minimizing the distance between two augmented views of the same data point (positive pairs) and maximizing views from different data points (negative pairs), recent \emph{non-contrastive} SSL (e. g. , BYOL and SimSiam) show remarkable performance {\it without} negative pairs, with an extra learnable predictor and a stop-gradient operation. A fundamental question rises: why they do not collapse into trivial representation? In this paper, we answer this question via a simple theoretical study and propose a novel approach, \ourmethod{}, that \emph{directly} sets the linear predictor based on the statistics of its inputs, rather than trained with gradient update. On ImageNet, it performs comparably with more complex two-layer non-linear predictors that employ BatchNorm and outperforms linear predictor by $2. 5%$ in 300-epoch training (and $5%$ in 60-epoch). \ourmethod{} is motivated by our theoretical study of the nonlinear learning dynamics of non-contrastive SSL in simple linear networks. Our study yields conceptual insights into how non-contrastive SSL methods learn, how they avoid representational collapse, and how multiple factors, like predictor networks, stop-gradients, exponential moving averages, and weight decay all come into play. Our simple theory recapitulates the results of real-world ablation studies in both STL-10 and ImageNet. Code is released\footnote{\url{https: //github. com/facebookresearch/luckmatters/tree/master/ssl}}.

Details

AAAI Conference 2015 Conference Paper

Never-Ending Learning

Tom Mitchell
William Cohen
Estevam Hruschka
Partha Talukdar
Justin Betteridge
Andrew Carlson
Bhavana Dalvi Mishra
Matthew Gardner

Whereas people learn many different types of knowledge from diverse experiences over many years, most current machine learning systems acquire just a single function or data model from just a single data set. We propose a never-ending learning paradigm for machine learning, to better reﬂect the more ambitious and encompassing type of learning performed by humans. As a case study, we describe the Never- Ending Language Learner (NELL), which achieves some of the desired properties of a never-ending learner, and we discuss lessons learned. NELL has been learning to read the web 24 hours/day since January 2010, and so far has acquired a knowledge base with over 80 million conﬁdence-weighted beliefs (e. g. , servedWith(tea, biscuits)), while learning continually to improve its reading competence over time. NELL has also learned to reason over its knowledge base to infer new beliefs from old ones, and is now beginning to extend its ontology by synthesizing new relational predicates. NELL can be tracked online at http: //rtw. ml. cmu. edu, and followed on Twitter at @CMUNELL.

PDF Details

AAAI Conference 2011 Conference Paper

Large Scale Spectral Clustering with Landmark-Based Representation

Xinlei Chen
Deng Cai

Spectral clustering is one of the most popular clustering approaches. Despite its good performance, it is limited in its applicability to large-scale problems due to its high computational complexity. Recently, many approaches have been proposed to accelerate the spectral clustering. Unfortunately, these methods usually sacri- ﬁce quite a lot information of the original data, thus result in a degradation of performance. In this paper, we propose a novel approach, called Landmark-based Spectral Clustering (LSC), for large scale clustering problems. Speciﬁcally, we select p ( n) representative data points as the landmarks and represent the original data points as the linear combinations of these landmarks. The spectral embedding of the data can then be efﬁciently computed with the landmark-based representation. The proposed algorithm scales linearly with the problem size. Extensive experiments show the effectiveness and efﬁciency of our approach comparing to the state-of-the-art methods.

PDF Details