Arrow Research search

Author name cluster

Xiaolong Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

38 papers
2 author rows

Possible papers

38

AAAI Conference 2026 Conference Paper

MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks

  • Zonglin Wu
  • Yule Xue
  • Yaoyao Feng
  • Xiaolong Wang
  • Yiren Song

As automated attack techniques rapidly advance, CAPTCHAs remain a critical defense mechanism against malicious bots. However, existing CAPTCHA schemes encompass a diverse range of modalities—from static distorted text and obfuscated images to interactive clicks, sliding puzzles, and logic-based questions—yet the community still lacks a unified, large-scale, multimodal benchmark to rigorously evaluate their security robustness. To address this gap, we introduce MCA-Bench, a comprehensive and reproducible benchmarking suite that integrates heterogeneous CAPTCHA types into a single evaluation protocol. Leveraging a shared vision–language model backbone, we fine-tune specialized cracking agents for each CAPTCHA category, enabling consistent, cross-modal assessments. Extensive experiments reveal that MCA-Bench effectively maps the vulnerability spectrum of modern CAPTCHA designs under varied attack settings, and—crucially—offers the first quantitative analysis of how challenge complexity, interaction depth, and model solvability interrelate. Based on these findings, we propose three actionable design principles and identify key open challenges, laying the groundwork for systematic CAPTCHA hardening, fair benchmarking, and broader community collaboration.

NeurIPS Conference 2025 Conference Paper

ARGenSeg: Image Segmentation with Autoregressive Image Generation Model

  • Xiaolong Wang
  • Lixiang Ru
  • Ziyuan Huang
  • Kaixiang Ji
  • DanDan Zheng
  • Jingdong Chen
  • Jun Zhou

We propose a novel AutoRegressive Generation-based paradigm for image Segmentation (ARGenSeg), achieving multimodal understanding and pixel-level perception within a unified framework. Prior works integrating image segmentation into multimodal large language models (MLLMs) typically employ either boundary points representation or dedicated segmentation heads. These methods rely on discrete representations or semantic prompts fed into task-specific decoders, which limits the ability of the MLLM to capture fine-grained visual details. To address these challenges, we introduce a segmentation framework for MLLM based on image generation, which naturally produces dense masks for target objects. We leverage MLLM to output visual tokens and detokenize them into images using an universal VQ-VAE, making the segmentation fully dependent on the pixel-level understanding of the MLLM. To reduce inference latency, we employ a next-scale-prediction strategy to generate required visual tokens in parallel. Extensive experiments demonstrate that our method surpasses prior state-of-the-art approaches on multiple segmentation datasets with a remarkable boost in inference speed, while maintaining strong understanding capabilities.

ICLR Conference 2025 Conference Paper

Consistent Flow Distillation for Text-to-3D Generation

  • Runjie Yan
  • Yinbo Chen
  • Xiaolong Wang

Score Distillation Sampling (SDS) has made significant strides in distilling image-generative models for 3D generation. However, its maximum-likelihood-seeking behavior often leads to degraded visual quality and diversity, limiting its effectiveness in 3D applications. In this work, we propose Consistent Flow Distillation (CFD), which addresses these limitations. We begin by leveraging the gradient of the diffusion ODE or SDE sampling process to guide the 3D generation. From the gradient-based sampling perspective, we find that the consistency of 2D image flows across different viewpoints is important for high-quality 3D generation. To achieve this, we introduce multi-view consistent Gaussian noise on the 3D object, which can be rendered from various viewpoints to compute the flow gradient. Our experiments demonstrate that CFD, through consistent flows, significantly outperforms previous methods in text-to-3D generation.

IROS Conference 2025 Conference Paper

DNAct: Diffusion Guided Multi-Task 3D Policy Learning

  • Ge Yan
  • Yueh-Hua Wu
  • Xiaolong Wang

This paper presents DNAct, a language-conditioned multi-task policy framework that integrates neural rendering pre-training and diffusion training to enforce multi-modality learning in action sequence spaces. To learn a generalizable multi-task policy with few demonstrations, the pre-training phase of DNAct leverages neural rendering to distill 2D semantic features from foundation models such as Stable Diffusion to a 3D space, which provides a comprehensive semantic understanding regarding the scene. Consequently, it allows various applications for challenging robotic tasks requiring rich 3D semantics and accurate geometry. Furthermore, we introduce a novel approach utilizing diffusion training to learn a vision and language feature that encapsulates the inherent multi-modality in the multi-task demonstrations. By reconstructing the action sequences from different tasks via the diffusion process, the model is capable of distinguishing different modalities and thus improving the robustness and the generalizability of the learned representation. DNAct significantly surpasses SOTA NeRF-based multi-task manipulation approaches with over 30% improvement in success rate. Videos are available on dnact.github.io

NeurIPS Conference 2025 Conference Paper

Gaussian-Augmented Physics Simulation and System Identification with Complex Colliders

  • Federico Vasile
  • Ri-Zhao Qiu
  • Lorenzo Natale
  • Xiaolong Wang

System identification involving the geometry, appearance, and physical properties from video observations is a challenging task with applications in robotics and graphics. Recent approaches have relied on fully differentiable Material Point Method (MPM) and rendering for simultaneous optimization of these properties. However, they are limited to simplified object-environment interactions with planar colliders and fail in more challenging scenarios where objects collide with non-planar surfaces. We propose AS-DiffMPM, a differentiable MPM framework that enables physical property estimation with arbitrarily shaped colliders. Our approach extends existing methods by incorporating a differentiable collision handling mechanism, allowing the target object to interact with complex rigid bodies while maintaining end-to-end optimization. We show AS-DiffMPM can be easily interfaced with various novel view synthesis methods as a framework for system identification from visual observations.

TMLR Journal 2025 Journal Article

HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction

  • Chen Bao
  • Jiarui Xu
  • Xiaolong Wang
  • Abhinav Gupta
  • Homanga Bharadhwaj

How can we predict future interaction trajectories of human hands in a scene given high-level colloquial task specifications in the form of natural language? In this paper, we extend the classic hand trajectory prediction task to several tasks involving explicit and implicit language queries. Our proposed tasks require extensive understanding of human daily activities and reasoning abilities about what is happening next given cues from the current scene. We also develop new benchmarks to evaluate the proposed two tasks, Vanilla Hand Prediction (VHP) and Reasoning-Based Hand Prediction (RBHP). We enable solving these tasks by integrating high-level world knowledge and reasoning capabilities of Vision-Language Models (VLMs) with the auto-regressive nature of low-level ego-centric hand trajectories. Our model, HandsOnVLM is a novel VLM that can generate textual responses and produce future hand trajectories through natural-language conversations. Our experiments show that HandsOnVLM outperforms existing task-specific methods and other VLM baselines on proposed tasks, and demonstrates its ability to effectively utilize world knowledge for reasoning about low-level human hand trajectories based on the provided context. More details can be found at https://www.chenbao.tech/handsonvlm/.

AAAI Conference 2025 Conference Paper

HomoMatcher: Achieving Dense Feature Matching with Semi-Dense Efficiency by Homography Estimation

  • Xiaolong Wang
  • Lei Yu
  • Yingying Zhang
  • Jiangwei Lao
  • Lixiang Ru
  • Liheng Zhong
  • Jingdong Chen
  • Yu Zhang

Feature matching between image pairs is a fundamental problem in computer vision that drives many applications, such as SLAM. Recently, semi-dense matching approaches have achieved substantial performance enhancements and established a widely-accepted coarse-to-fine paradigm. However, the majority of existing methods focus on improving coarse feature representation rather than the fine-matching module. Prior fine-matching techniques, which rely on point-to-patch matching probability expectation or direct regression, often lack precision and do not guarantee the continuity of feature points across sequential images. To address this limitation, this paper concentrates on enhancing the fine-matching module in the semi-dense matching framework. We employ a lightweight and efficient homography estimation network to generate the perspective mapping between patches obtained from coarse matching. This patch-to-patch approach achieves the overall alignment of two patches, resulting in a higher sub-pixel accuracy by incorporating additional constraints. By leveraging the homography estimation between patches, we can achieve a dense matching result with low computational cost. Extensive experiments demonstrate that our method achieves higher accuracy compared to previous semi-dense matchers. Meanwhile, our dense matching results exhibit similar end-point-error accuracy compared to previous dense matchers while maintaining semi-dense efficiency.

JMLR Journal 2025 Journal Article

Test-Time Training on Video Streams

  • Renhao Wang
  • Yu Sun
  • Arnuv Tandon
  • Yossi Gandelsman
  • Xinlei Chen
  • Alexei A. Efros
  • Xiaolong Wang

Prior work has established Test-Time Training (TTT) as a general framework to further improve a trained model at test time. Before making a prediction on each test instance, the model is first trained on the same instance using a self-supervised task such as reconstruction. We extend TTT to the streaming setting, where multiple test instances - video frames in our case - arrive in temporal order. Our extension is online TTT: The current model is initialized from the previous model, then trained on the current frame and a small window of frames immediately before. Online TTT significantly outperforms the fixed-model baseline for four tasks, on three real-world datasets. The improvements are more than 2.2x and 1.5x for instance and panoptic segmentation. Surprisingly, online TTT also outperforms its offline variant that accesses strictly more information, training on all frames from the entire test video regardless of temporal order. This finding challenges those in prior work using synthetic videos. We formalize a notion of locality as the advantage of online over offline TTT, and analyze its role with ablations and a theory based on bias-variance trade-off. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2025. ( edit, beta )

NeurIPS Conference 2025 Conference Paper

WorldModelBench: Judging Video Generation Models As World Models

  • Dacheng Li
  • Yunhao Fang
  • Yukang Chen
  • Shuo Yang
  • Shiyi Cao
  • Justin Wong
  • Michael Luo
  • Xiaolong Wang

Video generation models have rapidly progressed, positioning themselves as video world models capable of supporting decision-making applications like robotics and autonomous driving. However, current benchmarks fail to rigorously evaluate these claims, focusing only on general video quality, ignoring important factors to world models such as physics adherence. To bridge this gap, we propose WorldModelBench, a benchmark designed to evaluate the world modeling capabilities of video generation models in application-driven domains. WorldModelBench offers two key advantages: (1) Against to nuanced world modeling violations: By incorporating instruction-following and physics-adherence dimensions, WorldModelBench detects subtle violations, such as irregular changes in object size that breach the mass conservation law—issues overlooked by prior benchmarks. (2) Aligned with large-scale human preferences: We crowd-source 67K human labels to accurately measure 14 frontier models. Using our high-quality human labels, we further fine-tune an accurate judger to automate the evaluation procedure, achieving 9. 9% lower error in predicting world modeling violations than GPT-4o with 2B parameters. In addition, we demonstrate that training to align human annotations by maximizing the rewards from the judger noticeably improve the world modeling capability. The dataset is hosted in HuggingFace at https: //huggingface. co/datasets/Efficient-Large-Model/worldmodelbench. The code to run evaluation is available at https: //github. com/WorldModelBench-Team/WorldModelBench.

NeurIPS Conference 2024 Conference Paper

A Simulation Benchmark for Autonomous Racing with Large-Scale Human Data

  • Adrian Remonda
  • Nicklas Hansen
  • Ayoub Raji
  • Nicola Musiu
  • Marko Bertogna
  • Eduardo E. Veas
  • Xiaolong Wang

Despite the availability of international prize-money competitions, scaled vehicles, and simulation environments, research on autonomous racing and the control of sports cars operating close to the limit of handling has been limited by the high costs of vehicle acquisition and management, as well as the limited physics accuracy of open-source simulators. In this paper, we propose a racing simulation platform based on the simulator Assetto Corsa to test, validate, and benchmark autonomous driving algorithms, including reinforcement learning (RL) and classical Model Predictive Control (MPC), in realistic and challenging scenarios. Our contributions include the development of this simulation platform, several state-of-the-art algorithms tailored to the racing environment, and a comprehensive dataset collected from human drivers. Additionally, we evaluate algorithms in the offline RL setting. All the necessary code (including environment and benchmarks), working examples, and datasets are publicly released and can be found at: https: //github. com/dasGringuen/assetto corsa gym.

TMLR Journal 2024 Journal Article

IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks

  • Jiarui Xu
  • Yossi Gandelsman
  • Amir Bar
  • Jianwei Yang
  • Jianfeng Gao
  • Trevor Darrell
  • Xiaolong Wang

In-context learning allows adapting a model to new tasks given a task description at test time. In this paper, we present IMProv - a generative model that is able to in-context learn visual tasks from multimodal prompts. Given a textual description of a visual task (e.g. “Left: input image, Right: foreground segmentation”), a few input-output visual examples, or both, the model in-context learns to solve it for a new test input. We train a masked generative transformer on a new dataset of figures from computer vision papers and their associated captions, together with a captioned large-scale image-text dataset. During inference time, we prompt the model with text and/or image task example(s) and have the model inpaint the corresponding output. We show that training our model with text conditioning and scaling the dataset size improves in-context learning for computer vision tasks by over $+10\%$ AP for Foreground Segmentation, over $+5\%$ gains in AP for Single Object Detection, and almost $20\%$ lower LPIPS in Colorization. Our emperical results suggest that vision and language prompts are complementary and it is advantageous to use both to achieve better in-context learning performance.

NeurIPS Conference 2024 Conference Paper

SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models

  • An-Chieh Cheng
  • Hongxu Yin
  • Yang Fu
  • Qiushan Guo
  • Ruihan Yang
  • Jan Kautz
  • Xiaolong Wang
  • Sifei Liu

Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision and language tasks. However, their ability to reason about spatial arrangements remains limited. In this work, we introduce Spatial Region GPT (SpatialRGPT) to enhance VLMs’ spatial perception and reasoning capabilities. SpatialRGPT advances VLMs’ spatial understanding through two key innovations: (i) a data curation pipeline that enables effective learning of regional representation from 3D scene graphs, and (ii) a flexible ``plugin'' module for integrating depth information into the visual encoder of existing VLMs. During inference, when provided with user-specified region proposals, SpatialRGPT can accurately perceive their relative directions and distances. Additionally, we propose SpatialRGBT-Bench, a benchmark with ground-truth 3D annotations encompassing indoor, outdoor, and simulated environments, for evaluating 3D spatial cognition in Vision-Language Models (VLMs). Our results demonstrate that SpatialRGPT significantly enhances performance in spatial reasoning tasks, both with and without local region prompts. The model also exhibits strong generalization capabilities, effectively reasoning about complex spatial relations and functioning as a region-aware dense reward annotator for robotic tasks. Code, dataset, and benchmark are released at https: //www. anjiecheng. me/SpatialRGPT.

TARK Conference 2023 Conference Paper

Causal Kripke Models

  • Yiwen Ding
  • Krishna Manoorkar
  • Apostolos Tzimoulis
  • Ruoding Wang
  • Xiaolong Wang

This work extends Halpern and Pearl's causal models for actual causality to a possible world semantics environment. Using this framework we introduce a logic of actual causality with modal operators, which allows for reasoning about causality in scenarios involving multiple possibilities, temporality, knowledge and uncertainty. We illustrate this with a number of examples, and conclude by discussing some future directions for research.

AAAI Conference 2023 Conference Paper

Cross-Modality Person Re-identification with Memory-Based Contrastive Embedding

  • De Cheng
  • Xiaolong Wang
  • Nannan Wang
  • Zhen Wang
  • Xiaoyu Wang
  • Xinbo Gao

Visible-infrared person re-identification (VI-ReID) aims to retrieve the person images of the same identity from the RGB to infrared image space, which is very important for real-world surveillance system. In practice, VI-ReID is more challenging due to the heterogeneous modality discrepancy, which further aggravates the challenges of traditional single-modality person ReID problem, i.e., inter-class confusion and intra-class variations. In this paper, we propose an aggregated memory-based cross-modality deep metric learning framework, which benefits from the increasing number of learned modality-aware and modality-agnostic centroid proxies for cluster contrast and mutual information learning. Furthermore, to suppress the modality discrepancy, the proposed cross-modality alignment objective simultaneously utilizes both historical and up-to-date learned cluster proxies for enhanced cross-modality association. Such training mechanism helps to obtain hard positive references through increased diversity of learned cluster proxies, and finally achieves stronger ``pulling close'' effect between cross-modality image features. Extensive experiment results demonstrate the effectiveness of the proposed method, surpassing state-of-the-art works significantly by a large margin on the commonly used VI-ReID datasets.

NeurIPS Conference 2023 Conference Paper

Elastic Decision Transformer

  • Yueh-Hua Wu
  • Xiaolong Wang
  • Masashi Hamaya

This paper introduces Elastic Decision Transformer (EDT), a significant advancement over the existing Decision Transformer (DT) and its variants. Although DT purports to generate an optimal trajectory, empirical evidence suggests it struggles with trajectory stitching, a process involving the generation of an optimal or near-optimal trajectory from the best parts of a set of sub-optimal trajectories. The proposed EDT differentiates itself by facilitating trajectory stitching during action inference at test time, achieved by adjusting the history length maintained in DT. Further, the EDT optimizes the trajectory by retaining a longer history when the previous trajectory is optimal and a shorter one when it is sub-optimal, enabling it to "stitch" with a more optimal trajectory. Extensive experimentation demonstrates EDT's ability to bridge the performance gap between DT-based and Q Learning-based approaches. In particular, the EDT outperforms Q Learning-based methods in a multi-task regime on the D4RL locomotion benchmark and Atari games.

NeurIPS Conference 2023 Conference Paper

Fine-Grained Cross-View Geo-Localization Using a Correlation-Aware Homography Estimator

  • Xiaolong Wang
  • Runsen Xu
  • Zhuofan Cui
  • Zeyu Wan
  • Yu Zhang

In this paper, we introduce a novel approach to fine-grained cross-view geo-localization. Our method aligns a warped ground image with a corresponding GPS-tagged satellite image covering the same area using homography estimation. We first employ a differentiable spherical transform, adhering to geometric principles, to accurately align the perspective of the ground image with the satellite map. This transformation effectively places ground and aerial images in the same view and on the same plane, reducing the task to an image alignment problem. To address challenges such as occlusion, small overlapping range, and seasonal variations, we propose a robust correlation-aware homography estimator to align similar parts of the transformed ground image with the satellite image. Our method achieves sub-pixel resolution and meter-level GPS accuracy by mapping the center point of the transformed ground image to the satellite image using a homography matrix and determining the orientation of the ground camera using a point above the central axis. Operating at a speed of 30 FPS, our method outperforms state-of-the-art techniques, reducing the mean metric localization error by 21. 3\% and 32. 4\% in same-area and cross-area generalization tasks on the VIGOR benchmark, respectively, and by 34. 4\% on the KITTI benchmark in same-area evaluation.

JBHI Journal 2023 Journal Article

Multimodal Data Matters: Language Model Pre-Training Over Structured and Unstructured Electronic Health Records

  • Sicen Liu
  • Xiaolong Wang
  • Yongshuai Hou
  • Ge Li
  • Hui Wang
  • Hui Xu
  • Yang Xiang
  • Buzhou Tang

As two important textual modalities in electronic health records (EHR), both structured data (clinical codes) and unstructured data (clinical narratives) have recently been increasingly applied to the healthcare domain. Most existing EHR-oriented studies, however, either focus on a particular modality or integrate data from different modalities in a straightforward manner, which usually treats structured and unstructured data as two independent sources of information about patient admission and ignore the intrinsic interactions between them. In fact, the two modalities are documented during the same encounter where structured data inform the documentation of unstructured data and vice versa. In this paper, we proposed a Medical Multimodal Pre-trained Language Model, named MedM-PLM, to learn enhanced EHR representations over structured and unstructured data and explore the interaction of two modalities. In MedM-PLM, two Transformer-based neural network components are firstly adopted to learn representative characteristics from each modality. A cross-modal module is then introduced to model their interactions. We pre-trained MedM-PLM on the MIMIC-III dataset and verified the effectiveness of the model on three downstream clinical tasks, i. e. , medication recommendation, 30-day readmission prediction and ICD coding. Extensive experiments demonstrate the power of MedM-PLM compared with state-of-the-art methods. Further analyses and visualizations show the robustness of our model, which could potentially provide more comprehensive interpretations for clinical decision-making.

JBHI Journal 2023 Journal Article

SHAPE: A Sample-Adaptive Hierarchical Prediction Network for Medication Recommendation

  • Sicen Liu
  • Xiaolong Wang
  • Jingcheng Du
  • Yongshuai Hou
  • Xianbing Zhao
  • Hui Xu
  • Hui Wang
  • Yang Xiang

Effectively medication recommendation with complex multimorbidity conditions is a critical yet challenging task in healthcare. Most existing works predicted medications based on longitudinal records, which assumed the encoding format of intra-visit medical events are serialized and information transmitted patterns of learning longitudinal sequence data are stable. However, the following conditions may have been ignored: 1) A more compact encoder for intra-relationship in the intra-visit medical event is urgent; 2) Strategies for learning accurate representations of the variable longitudinal sequences of patients are different. In this article, we proposed a novel Sample-adaptive Hierarchical medicAtion Prediction nEtwork, termed SHAPE, to tackle the above challenges in the medication recommendation task. Specifically, we design a compact intra-visit set encoder to encode the relationship in the medical event for obtaining visit-level representation and then develop an inter-visit longitudinal encoder to learn the patient-level longitudinal representation efficiently. To endow the model with the capability of modeling the variable visit length, we introduce a soft curriculum learning method to assign the difficulty of each sample automatically by the visit length. Extensive experiments on a benchmark dataset verify the superiority of our model compared with several state-of-the-art baselines.

NeurIPS Conference 2022 Conference Paper

Category-Level 6D Object Pose Estimation in the Wild: A Semi-Supervised Learning Approach and A New Dataset

  • Yang Fu
  • Xiaolong Wang

6D object pose estimation is one of the fundamental problems in computer vision and robotics research. While a lot of recent efforts have been made on generalizing pose estimation to novel object instances within the same category, namely category-level 6D pose estimation, it is still restricted in constrained environments given the limited number of annotated data. In this paper, we collect Wild6D, a new unlabeled RGBD object video dataset with diverse instances and backgrounds. We utilize this data to generalize category-level 6D object pose estimation in the wild with semi-supervised learning. We propose a new model, called Rendering for Pose estimation network RePoNet), that is jointly trained using the free ground-truths with the synthetic data, and a silhouette matching objective function on the real-world data. Without using any 3D annotations on real data, our method outperforms state-of-the-art methods on the previous dataset and our Wild6D test set (with manual annotations for evaluation) by a large margin. Project page with Wild6D data: \url{https: //oasisyang. github. io/semi-pose/}.

NeurIPS Conference 2021 Conference Paper

Multi-Person 3D Motion Prediction with Multi-Range Transformers

  • Jiashun Wang
  • Huazhe Xu
  • Medhini Narasimhan
  • Xiaolong Wang

We propose a novel framework for multi-person 3D motion trajectory prediction. Our key observation is that a human's action and behaviors may highly depend on the other persons around. Thus, instead of predicting each human pose trajectory in isolation, we introduce a Multi-Range Transformers model which contains of a local-range encoder for individual motion and a global-range encoder for social interactions. The Transformer decoder then performs prediction for each person by taking a corresponding pose as a query which attends to both local and global-range encoder features. Our model not only outperforms state-of-the-art methods on long-term 3D motion prediction, but also generates diverse social interactions. More interestingly, our model can even predict 15-person motion simultaneously by automatically dividing the persons into different interaction groups. Project page with code is available at https: //jiashunwang. github. io/MRT/.

NeurIPS Conference 2021 Conference Paper

NovelD: A Simple yet Effective Exploration Criterion

  • Tianjun Zhang
  • Huazhe Xu
  • Xiaolong Wang
  • Yi Wu
  • Kurt Keutzer
  • Joseph E. Gonzalez
  • Yuandong Tian

Efficient exploration under sparse rewards remains a key challenge in deep reinforcement learning. Previous exploration methods (e. g. , RND) have achieved strong results in multiple hard tasks. However, if there are multiple novel areas to explore, these methods often focus quickly on one without sufficiently trying others (like a depth-wise first search manner). In some scenarios (e. g. , four corridor environment in Sec 4. 2), we observe they explore in one corridor for long and fail to cover all the states. On the other hand, in theoretical RL, with optimistic initialization and the inverse square root of visitation count as a bonus, it won't suffer from this and explores different novel regions alternatively (like a breadth-first search manner). In this paper, inspired by this, we propose a simple but effective criterion called NovelD by weighting every novel area approximately equally. Our algorithm is very simple but yet shows comparable performance or even outperforms multiple SOTA exploration methods in many hard exploration tasks. Specifically, NovelD solves all the static procedurally-generated tasks in Mini-Grid with just 120M environment steps, without any curriculum learning. In comparison, the previous SOTA only solves 50% of them. NovelD also achieves SOTA on multiple tasks in NetHack, a rogue-like game that contains more challenging procedurally-generated environments. In multiple Atari games (e. g. , MonteZuma's Revenge, Venture, Gravitar), NovelD outperforms RND. We analyze NovelD thoroughly in MiniGrid and found that empirically it helps the agent explore the environment more uniformly with a focus on exploring beyond the boundary.

NeurIPS Conference 2021 Conference Paper

Stabilizing Deep Q-Learning with ConvNets and Vision Transformers under Data Augmentation

  • Nicklas Hansen
  • Hao Su
  • Xiaolong Wang

While agents trained by Reinforcement Learning (RL) can solve increasingly challenging tasks directly from visual observations, generalizing learned skills to novel environments remains very challenging. Extensive use of data augmentation is a promising technique for improving generalization in RL, but it is often found to decrease sample efficiency and can even lead to divergence. In this paper, we investigate causes of instability when using data augmentation in common off-policy RL algorithms. We identify two problems, both rooted in high-variance Q-targets. Based on our findings, we propose a simple yet effective technique for stabilizing this class of algorithms under augmentation. We perform extensive empirical evaluation of image-based RL using both ConvNets and Vision Transformers (ViT) on a family of benchmarks based on DeepMind Control Suite, as well as in robotic manipulation tasks. Our method greatly improves stability and sample efficiency of ConvNets under augmentation, and achieves generalization results competitive with state-of-the-art methods for image-based RL in environments with unseen visuals. We further show that our method scales to RL with ViT-based architectures, and that data augmentation may be especially important in this setting.

NeurIPS Conference 2021 Conference Paper

Test-Time Personalization with a Transformer for Human Pose Estimation

  • Yizhuo Li
  • Miao Hao
  • Zonglin Di
  • Nitesh Bharadwaj Gundavarapu
  • Xiaolong Wang

We propose to personalize a 2D human pose estimator given a set of test images of a person without using any manual annotations. While there is a significant advancement in human pose estimation, it is still very challenging for a model to generalize to different unknown environments and unseen persons. Instead of using a fixed model for every test case, we adapt our pose estimator during test time to exploit person-specific information. We first train our model on diverse data with both a supervised and a self-supervised pose estimation objectives jointly. We use a Transformer model to build a transformation between the self-supervised keypoints and the supervised keypoints. During test time, we personalize and adapt our model by fine-tuning with the self-supervised objective. The pose is then improved by transforming the updated self-supervised keypoints. We experiment with multiple datasets and show significant improvements on pose estimations with our self-supervised personalization. Project page with code is available at https: //liyz15. github. io/TTP/.

NeurIPS Conference 2020 Conference Paper

Multi-Task Reinforcement Learning with Soft Modularization

  • Ruihan Yang
  • Huazhe Xu
  • Yi Wu
  • Xiaolong Wang

Multi-task learning is a very challenging problem in reinforcement learning. While training multiple tasks jointly allow the policies to share parameters across different tasks, the optimization problem becomes non-trivial: It remains unclear what parameters in the network should be reused across tasks, and how the gradients from different tasks may interfere with each other. Thus, instead of naively sharing parameters across tasks, we introduce an explicit modularization technique on policy representation to alleviate this optimization issue. Given a base policy network, we design a routing network which estimates different routing strategies to reconfigure the base network for each task. Instead of directly selecting routes for each task, our task-specific policy uses a method called soft modularization to softly combine all the possible routes, which makes it suitable for sequential tasks. We experiment with various robotics manipulation tasks in simulation and show our method improves both sample efficiency and performance over strong baselines by a large margin.

NeurIPS Conference 2020 Conference Paper

Online Adaptation for Consistent Mesh Reconstruction in the Wild

  • Xueting Li
  • Sifei Liu
  • Shalini De Mello
  • Kihwan Kim
  • Xiaolong Wang
  • Ming-Hsuan Yang
  • Jan Kautz

This paper presents an algorithm to reconstruct temporally consistent 3D meshes of deformable object instances from videos in the wild. Without requiring annotations of 3D mesh, 2D keypoints, or camera pose for each video frame, we pose video-based reconstruction as a self-supervised online adaptation problem applied to any incoming test video. We first learn a category-specific 3D reconstruction model from a collection of single-view images of the same category that jointly predicts the shape, texture, and camera pose of an image. Then, at inference time, we adapt the model to a test video over time using self-supervised regularization terms that exploit temporal consistency of an object instance to enforce that all reconstructed meshes share a common texture map, a base shape, as well as parts. We demonstrate that our algorithm recovers temporally consistent and reliable 3D structures from videos of non-rigid objects including those of animals captured in the wild -- an extremely challenging task rarely addressed before.

NeurIPS Conference 2019 Conference Paper

Joint-task Self-supervised Learning for Temporal Correspondence

  • Xueting Li
  • Sifei Liu
  • Shalini De Mello
  • Xiaolong Wang
  • Jan Kautz
  • Ming-Hsuan Yang

This paper proposes to learn reliable dense correspondence from videos in a self-supervised manner. Our learning process integrates two highly related tasks: tracking large image regions and establishing fine-grained pixel-level associations between consecutive video frames. We exploit the synergy between both tasks through a shared inter-frame affinity matrix, which simultaneously models transitions between video frames at both the region- and pixel-levels. While region-level localization helps reduce ambiguities in fine-grained matching by narrowing down search regions; fine-grained matching provides bottom-up features to facilitate region-level localization. Our method outperforms the state-of-the-art self-supervised methods on a variety of visual correspondence tasks, including video-object and part-segmentation propagation, keypoint tracking, and object tracking. Our self-supervised method even surpasses the fully-supervised affinity feature representation obtained from a ResNet-18 pre-trained on the ImageNet.

AAAI Conference 2018 Conference Paper

DeepRebirth: Accelerating Deep Neural Network Execution on Mobile Devices

  • Dawei Li
  • Xiaolong Wang
  • Deguang Kong

Deploying deep neural networks on mobile devices is a challenging task. Current model compression methods such as matrix decomposition effectively reduce the deployed model size, but still cannot satisfy real-time processing requirement. This paper first discovers that the major obstacle is the excessive execution time of non-tensor layers such as pooling and normalization without tensor-like trainable parameters. This motivates us to design a novel acceleration framework: DeepRebirth through “slimming” existing consecutive and parallel non-tensor and tensor layers. The layer slimming is executed at different substructures: (a) streamline slimming by merging the consecutive non-tensor and tensor layer vertically; (b) branch slimming by merging non-tensor and tensor branches horizontally. The proposed optimization operations significantly accelerate the model execution and also greatly reduce the run-time memory cost since the slimmed model architecture contains less hidden layers. To maximally avoid accuracy loss, the parameters in new generated layers are learned with layer-wise fine-tuning based on both theoretical analysis and empirical verification. As observed in the experiment, DeepRebirth achieves more than 3x speed-up and 2. 5x run-time memory saving on GoogLeNet with only 0. 4% drop on top-5 accuracy in ImageNet. Furthermore, by combining with other model compression techniques, DeepRebirth offers an average of 106. 3ms inference time on the CPU of Samsung Galaxy S5 with 86. 5% top-5 accuracy, 14% faster than SqueezeNet which only has a top-5 accuracy of 80. 5%.

IJCAI Conference 2018 Conference Paper

Dynamically Hierarchy Revolution: DirNet for Compressing Recurrent Neural Network on Mobile Devices

  • Jie Zhang
  • Xiaolong Wang
  • Dawei Li
  • Yalin Wang

Recurrent neural networks (RNNs) achieve cutting-edge performance on a variety of problems. However, due to their high computational and memory demands, deploying RNNs on resource constrained mobile devices is a challenging task. To guarantee minimum accuracy loss with higher compression rate and driven by the mobile resource requirement, we introduce a novel model compression approach DirNet based on an optimized fast dictionary learning algorithm, which 1) dynamically mines the dictionary atoms of the projection dictionary matrix within layer to adjust the compression rate 2) adaptively changes the sparsity of sparse codes cross the hierarchical layers. Experimental results on language model and an ASR model trained with a 1000h speech dataset demonstrate that our method significantly outperforms prior approaches. Evaluated on off-the-shelf mobile devices, we are able to reduce the size of original model by eight times with real-time model inference and negligible accuracy loss.

AAAI Conference 2017 Conference Paper

Dual-Clustering Maximum Entropy with Application to Classification and Word Embedding

  • Xiaolong Wang
  • Jingjing Wang
  • ChengXiang Zhai

Maximum Entropy (ME), as a general-purpose machine learning model, has been successfully applied to various fields such as text mining and natural language processing. It has been used as a classification technique and recently also applied to learn word embedding. ME establishes a distribution of the exponential form over items (classes/words). When training such a model, learning efficiency is guaranteed by globally updating the entire set of model parameters associated with all items at each training instance. This creates a significant computational challenge when the number of items is large. To achieve learning efficiency with affordable computational cost, we propose an approach named Dual-Clustering Maximum Entropy (DCME). Exploiting the primal-dual form of ME, it conducts clustering in the dual space and approximates each dual distribution by the corresponding cluster center. This naturally enables a hybrid online-offline optimization algorithm whose time complexity per instance only scales as the product of the feature/word vector dimensionality and the cluster number. Experimental studies on text classification and word embedding learning demonstrate that DCME effectively strikes a balance between training speed and model quality, substantially outperforming state-of-the-art methods.

AAAI Conference 2016 Conference Paper

Write-righter: An Academic Writing Assistant System

  • Yuanchao Liu
  • Xin Wang
  • Ming Liu
  • Xiaolong Wang

Writing academic articles in English is a challenging task for non-native speakers, as more effort has to be spent to enhance their language expressions. This paper presents an academic writing assistant system called Write-righter, which can provide real-time hint and recommendation by analyzing the input context. To achieve this goal, some novel strategies, e. g. , semantic extension based sentence retrieval and LDA based sentence structure identification have been proposed. Write-righter is expected to help people express their ideas correctly by recommending top N most possible expressions.

IJCAI Conference 2015 Conference Paper

Modeling Mention, Context and Entity with Neural Networks for Entity Disambiguation

  • Yaming Sun
  • Lei Lin
  • Duyu Tang
  • Nan Yang
  • Zhenzhou Ji
  • Xiaolong Wang

Given a query consisting of a mention (name string) and a background document, entity disambiguation calls for linking the mention to an entity from reference knowledge base like Wikipedia. Existing studies typically use hand-crafted features to represent mention, context and entity, which is laborintensive and weak to discover explanatory factors of data. In this paper, we address this problem by presenting a new neural network approach. The model takes consideration of the semantic representations of mention, context and entity, encodes them in continuous vector space and effectively leverages them for entity disambiguation. Specifically, we model variable-sized contexts with convolutional neural network, and embed the positions of context words to factor in the distance between context word and mention. Furthermore, we employ neural tensor network to model the semantic interactions between context and mention. We conduct experiments for entity disambiguation on two benchmark datasets from TAC-KBP 2009 and 2010. Experimental results show that our method yields state-of-the-art performances on both datasets.

IJCAI Conference 2015 Conference Paper

VRCA: A Clustering Algorithm for Massive Amount of Texts

  • Ming Liu
  • Lei Chen
  • Bingquan Liu
  • Xiaolong Wang

There are lots of texts appearing in the web every day. This fact enables the amount of texts in the web to explode. Therefore, how to deal with large-scale text collection becomes more and more important. Clustering is a generally acceptable solution for text organization. Via its unsupervised characteristic, users can easily dig the useful information that they desired. However, traditional clustering algorithms can only deal with small-scale text collection. When it enlarges, they lose their performances. The main reason attributes to the high-dimensional vectors generated from texts. Therefore, to cluster texts in large amount, this paper proposes a novel clustering algorithm, where only the features that can represent cluster are preserved in cluster’s vector. In this algorithm, clustering process is separated into two parts. In one part, feature’s weight is fine-tuned to make cluster partition meet an optimization function. In the other part, features are reordered and only the useful features that can represent cluster are kept in cluster’s vector. Experimental results demonstrate that our algorithm obtains high performance on both small-scale and large-scale text collections.

NeurIPS Conference 2014 Conference Paper

Deep Joint Task Learning for Generic Object Extraction

  • Xiaolong Wang
  • Liliang Zhang
  • Liang Lin
  • Zhujin Liang
  • Wangmeng Zuo

This paper investigates how to extract objects-of-interest without relying on hand-craft features and sliding windows approaches, that aims to jointly solve two sub-tasks: (i) rapidly localizing salient objects from images, and (ii) accurately segmenting the objects based on the localizations. We present a general joint task learning framework, in which each task (either object localization or object segmentation) is tackled via a multi-layer convolutional neural network, and the two networks work collaboratively to boost performance. In particular, we propose to incorporate latent variables bridging the two networks in a joint optimization manner. The first network directly predicts the positions and scales of salient objects from raw images, and the latent variables adjust the object localizations to feed the second network that produces pixelwise object masks. An EM-type method is then studied for the joint optimization, iterating with two steps: (i) by using the two networks, it estimates the latent variables by employing an MCMC-based sampling method; (ii) it optimizes the parameters of the two networks unitedly via back propagation, with the fixed latent variables. Extensive experiments demonstrate that our joint learning framework significantly outperforms other state-of-the-art approaches in both accuracy and efficiency (e. g. , 1000 times faster than competing approaches).

NeurIPS Conference 2012 Conference Paper

Dynamical And-Or Graph Learning for Object Shape Modeling and Detection

  • Xiaolong Wang
  • Liang Lin

This paper studies a novel discriminative part-based model to represent and recognize object shapes with an “And-Or graph”. We define this model consisting of three layers: the leaf-nodes with collaborative edges for localizing local parts, the or-nodes specifying the switch of leaf-nodes, and the root-node encoding the global verification. A discriminative learning algorithm, extended from the CCCP [23], is proposed to train the model in a dynamical manner: the model structure (e. g. , the configuration of the leaf-nodes associated with the or-nodes) is automatically determined with optimizing the multi-layer parameters during the iteration. The advantages of our method are two-fold. (i) The And-Or graph model enables us to handle well large intra-class variance and background clutters for object shape detection from images. (ii) The proposed learning algorithm is able to obtain the And-Or graph representation without requiring elaborate supervision and initialization. We validate the proposed method on several challenging databases (e. g. , INRIA-Horse, ETHZ-Shape, and UIUC-People), and it outperforms the state-of-the-arts approaches.

AAAI Conference 2011 Conference Paper

Partially Supervised Text Classification with Multi-Level Examples

  • Tao Liu
  • Xiaoyong Du
  • Yongdong Xu
  • Minghui Li
  • Xiaolong Wang

Partially supervised text classification has received great research attention since it only uses positive and unlabeled examples as training data. This problem can be solved by automatically labeling some negative (and more positive) examples from unlabeled examples before training a text classifier. But it is difficult to guarantee both high quality and quantity of the new labeled examples. In this paper, a multi-level example based learning method for partially supervised text classification is proposed, which can make full use of all unlabeled examples. A heuristic method is proposed to assign possible labels to unlabeled examples and partition them into multiple levels according to their labeling confidence. A text classifier is trained on these multi-level examples using weighted support vector machines. Experiments show that the multi-level example based learning method is effective for partially supervised text classification, and outperforms the existing popular methods such as Biased-SVM, ROC-SVM, S-EM and WL.