Arrow Research search

Author name cluster

Jue Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

47 papers
2 author rows

Possible papers

47

AAAI Conference 2026 Conference Paper

RipAlert: A Future-Frame-Aware Framework for Rip Current Forecasting and Early Alerting

  • Meng Wan
  • Qi Su
  • Zhixin Xia
  • Kanglin Chen
  • Jue Wang
  • Tiantian Liu
  • Rongqiang Cao
  • Hui Cui

Rip currents cause over 100 drowning deaths and more than 30,000 rescues annually in the United States, posing a severe threat to beach safety worldwide. However, most existing detection methods are reactive, identifying rip currents only after they form, leaving limited time for intervention. We propose RipAlert, a future-frame-aware framework that forecasts near-future coastal dynamics and proactively identifies rip current risks. We design a region-sensitive optical flow prediction method with a novel entropy-based object detector to capture early-stage reverse-flow anomalies. Unlike static-image approaches, RipAlert leverages temporal motion patterns to detect rip currents up to 5 seconds before they visibly form. To support real-world deployment, we design a lightweight mobile application and release a curated dataset with over 2,000 annotated images. Experiments on the RipVIS benchmark show that our approach achieves state-of-the-art performance. The system has been deployed at high-risk beaches in China, issuing successful early warnings over real-world events. Our work advances AI-driven coastal safety and contributes to SDG 3 (Good Health and Well-Being) and SDG 13 (Climate Action).

JBHI Journal 2025 Journal Article

A Post-Quantum Blockchain and Autonomous AI-Enabled Scheme for Secure Healthcare Information Exchange

  • Linlin He
  • Siyuan Rao
  • Kexin Tian
  • Yuyuan Liu
  • Jue Wang
  • Shuanggen Liu
  • Xiuhua Lu

Secure healthcare information exchange (HIE) is critical to improving medical services, enabling data interoperability, and ensuring patient privacy. However, the increasing threat posed by quantum computing challenges the reliability of conventional cryptographic mechanisms. To address this, we propose a post-quantum secure healthcare data-sharing scheme that combines the Extended Merkle Signature Scheme (XMSS) and consortium blockchain technology to guarantee the integrity, authenticity, and traceability of electronic medical records (EMRs). Furthermore, the scheme incorporates autonomous artificial intelligence (AI) to assist healthcare professionals in generating accurate and intelligent diagnostic reports, enhancing clinical decision-making. We theoretically analyze the scheme’s security in the random oracle model, demonstrating that it effectively resists various threats. Performance evaluation shows that the scheme is particularly suitable for HIE scenarios as it reduces about 49% in total computational overheads and 36% in blockchain storage compared to other schemes.

AAAI Conference 2025 Conference Paper

FoldToken: Learning Protein Language via Vector Quantization and Beyond

  • Zhangyang Gao
  • Cheng Tan
  • Jue Wang
  • Yufei Huang
  • Lirong Wu
  • Stan Z. Li

Is there a foreign language describing protein sequences and structures simultaneously? Protein structures, represented by continuous 3D points, have long posed a challenge due to the contrasting modeling paradigms of discrete sequences. We introduce FoldTokenizer to represent protein sequence-structure as discrete symbols. This approach involves projecting residue types and structures into a discrete space, guided by a reconstruction loss for information preservation. We name the learned discrete symbols as FoldToken, and the sequence of FoldTokens serves as a new protein language, transforming the protein sequence-structure into a unified modality. We apply the created protein language on general backbone inpainting task, building the first GPT-style model (FoldGPT) for sequence-structure co-generation with promising results. Key to our success is the substantial enhancement of the vector quantization module, Soft Conditional Vector Quantization (SoftCVQ).

ICML Conference 2025 Conference Paper

Improving Model Alignment Through Collective Intelligence of Open-Source Models

  • Junlin Wang
  • Roy Xie
  • Shang Zhu
  • Jue Wang
  • Ben Athiwaratkun
  • Bhuwan Dhingra
  • Shuaiwen Leon Song
  • Ce Zhang 0001

Building helpful and harmless large language models (LLMs) requires effective model alignment approach based on human instructions and feedback, which necessitates high-quality human-labeled data. Constructing such datasets is often expensive and hard to scale, and may face potential limitations on diversity and generalization. To address these challenges, we introduce Mixture of Agents Alignment (MoAA), that leverages the collective strengths of various language models to provide high-quality data for model alignment. By employing MoAA, we enhance both supervised fine-tuning and preference optimization, leading to improved performance compared to using a single model alone to generate alignment data (e. g. using GPT-4o alone). Evaluation results show that our approach can improve win rate of LLaMA-3. 1-8B-Instruct from 19. 5 to 48. 3 on Arena-Hard and from 22. 33 to 57. 23 on AlpacaEval2, highlighting a promising direction for model alignment through this new scalable and diverse synthetic data recipe. Furthermore, we demonstrate that MoAA enables a self-improvement pipeline, where models fine-tuned on MoA-generated data surpass their own initial capabilities, providing evidence that our approach can push the frontier of open-source LLMs without reliance on stronger external supervision. Data and code will be released.

ICML Conference 2025 Conference Paper

Ladder-Residual: Parallelism-Aware Architecture for Accelerating Large Model Inference with Communication Overlapping

  • Muru Zhang
  • Mayank Mishra
  • Zhongzhu Zhou
  • William Brandon
  • Jue Wang
  • Yoon Kim
  • Jonathan Ragan-Kelley
  • Shuaiwen Leon Song

Large language model inference is both memory-intensive and time-consuming, often requiring distributed algorithms to efficiently scale. Various model parallelism strategies are used in multi-gpu training and inference to partition computation across multiple devices, reducing memory load and computation time. However, using model parallelism necessitates communication of information between GPUs, which has been a major bottleneck and limits the gains obtained by scaling up the number of devices. We introduce Ladder Residual, a simple architectural modification applicable to all residual-based models that enables straightforward overlapping that effectively hides the latency of communication. Our insight is that in addition to systems optimization, one can also redesign the model architecture to decouple communication from computation. While Ladder Residual can allow communication-computation decoupling in conventional parallelism patterns, we focus on Tensor Parallelism in this paper, which is particularly bottlenecked by its heavy communication. For a Transformer model with 70B parameters, applying Ladder Residual to all its layers can achieve 29% end-to-end wall clock speed up at inference time with TP sharding over 8 devices. We refer the resulting Transformer model as the Ladder Transformer. We train a 1B and 3B Ladder Transformer from scratch and observe comparable performance to a standard dense transformer baseline. We also show that it is possible to convert parts of the Llama-3. 1 8B model to our Ladder Residual architecture with minimal accuracy degradation by only retraining for 3B tokens.

IJCAI Conference 2025 Conference Paper

MCloudNet: An Ultra-Short-Term Photovoltaic Power Forecasting Framework With Multi-Layer Cloud Coverage

  • Meng Wan
  • Tiantian Liu
  • Yuxuan Bi
  • Jue Wang
  • Hui Cui
  • Rongqiang Cao
  • Jiaxiang Wang
  • Peng Shi

Over 4. 15 million low-income households across nearly 60, 000 villages in China benefit from photovoltaic (PV) poverty alleviation power stations. However, weak infrastructure and limited capabilities make these systems vulnerable to fluctuations. One of the United Nations' Sustainable Development Goals (SDG 7) seeks to ensure access to affordable and reliable energy for all, especially in underdeveloped regions. This paper proposes MCloudNet, a multi-modal framework designed to improve ultra-short-term PV prediction in data-scarce, cloud-dynamic environments. MCloudNet explicitly models multi-layer cloud structures from satellite imagery and fuses them with time-series meteorological data to enhance prediction accuracy and interpretability. A province-level dispatch system with MCloudNet has been deployed in Hebei, supporting scheduling across rural PV stations. Experiments conducted in counties such as Shexian and Luxi highlight the framework's effectiveness for use in underdeveloped micro-grids. Operational results show that the system has reduced over 60 million kWh of solar curtailment and generated 24 million CNY in economic value, benefiting approximately 50, 000 rural households. By minimizing power fluctuations and improving rural energy scheduling, MCloudNet supports essential services such as lighting, medical facilities, and communications. The source code is available at: https: //github. com/AI4SClab/MCloudNet.

ICLR Conference 2025 Conference Paper

Mixture-of-Agents Enhances Large Language Model Capabilities

  • Junlin Wang
  • Jue Wang
  • Ben Athiwaratkun
  • Ce Zhang 0001
  • James Y. Zou

Recent advances in large language models (LLMs) demonstrate substantial capabilities in natural language understanding and generation tasks. With the growing number of LLMs, how to harness the collective expertise of multiple LLMs is an exciting open direction. Toward this goal, we propose a new approach that leverages the collective strengths of multiple LLMs through a Mixture-of-Agents (MoA) methodology. In our approach, we construct a layered MoA architecture wherein each layer comprises multiple LLM agents. Each agent takes all the outputs from agents in the previous layer as auxiliary information in generating its response. MoA models achieves state-of-art performance on AlpacaEval 2.0, Arena-Hard, MT-Bench, and FLASK, surpassing GPT-4 Omni. For example, our MoA using only open-source LLMs achieves a score of 65.1% on AlpacaEval 2.0 compared to 57.5% by GPT-4 Omni.

ICLR Conference 2025 Conference Paper

Scaling Instruction-tuned LLMs to Million-token Contexts via Hierarchical Synthetic Data Generation

  • Linda He
  • Jue Wang
  • Maurice Weber
  • Shang Zhu
  • Ben Athiwaratkun
  • Ce Zhang 0001

Large Language Models (LLMs) struggle with long-context reasoning, not only due to the quadratic scaling of computational complexity with sequence length but also because of the scarcity and expense of annotating long-context data. There has been barely any open-source work that systematically ablates long-context data, nor is there any openly available instruction tuning dataset with contexts surpassing 100K tokens. To bridge this gap, we introduce a novel post-training synthetic data generation strategy designed to efficiently extend the context window of LLMs while preserving their general task performance. Our approach scalably extends to arbitrarily long context lengths, unconstrained by the length of available real-world data, which effectively addresses the scarcity of raw long-context data. Through a step-by-step rotary position embedding (RoPE) scaling training strategy, we demonstrate that our model, with a context length of up to 1M tokens, performs well on the RULER benchmark and InfiniteBench and maintains robust performance on general language tasks.

IJCAI Conference 2025 Conference Paper

SEP: A General Lossless Compression Framework with Semantics Enhancement and Multi-Stream Pipelines

  • Meng Wan
  • Rongqiang Cao
  • Yanghao Li
  • Jue Wang
  • Zijian Wang
  • Qi Su
  • Lei Qiu
  • Peng Shi

Deep-learning-based lossless compression is of immense importance in real-world applications, such as cold data persistence, sensor data collection, and astronomical data transmission. However, existing compressors typically model data using single-byte symbols as tokens, which makes it hard to capture the inherent correlations and cannot effectively utilize the parallel capabilities of GPU and multi-core CPU. This paper proposes SEP, a novel lossless compression framework for most time-series backbone neural networks. We first introduce a semantic enhancement module to capture the complex intra-patch relationships of binary byte streams. To improve the compression speed, we design multi-stream pipelines that dynamically assign parallel tasks to GPU streams and multi-cores. We further propose a novel GPU memory optimization strategy, which reuses GPU memory by a shared pool across streams. We conduct experiments on seven real-world datasets and the results demonstrate that our SEP framework outperforms state-of-the-art compressors with an average speed improvement of 30. 0% and an average compression ratio gain of 5. 1%, which is further elevated to 7. 6% with the use of pre-training models. The GPU memory footprint is reduced by as high as 63. 1% and by an average of 36. 2%. The source code is available at: https: //github. com/damonwan1/SEP.

ICML Conference 2024 Conference Paper

Soft Prompt Recovers Compressed LLMs, Transferably

  • Zhaozhuo Xu
  • Zirui Liu 0001
  • Beidi Chen
  • Shaochen (Henry) Zhong
  • Yuxin Tang
  • Jue Wang
  • Kaixiong Zhou
  • Xia Hu 0001

Model compression is one of the most popular approaches to improve the accessibility of Large Language Models (LLMs) by reducing their memory footprint. However, the gaining of such efficiency benefits often simultaneously demands extensive engineering efforts and intricate designs to mitigate the performance decline. In this work, we leverage (Soft) Prompt Tuning in its most vanilla form and discover such conventionally learned soft prompts can recover the performance of compressed LLMs. More surprisingly, we observe such recovery effect to be transferable among different tasks and models (albeit natural tokenizer and dimensionality limitations), resulting in further overhead reduction and yet, subverting the common belief that learned soft prompts are task-specific. Our work is fully orthogonal and compatible with model compression frameworks such as pruning and quantization, where we enable up to $8\times$ compressed LLM (with a joint 4-bit quantization and 50% weight pruning compression) to match its uncompressed counterparts on popular benchmarks. We note that we are the first to reveal vanilla Parameter-Efficient Fine-Tuning (PEFT) techniques have the potential to be utilized under a compression recovery context, opening a new line of opportunities for model accessibility advancement while freeing our fellow researchers from the previously present engineering burdens and constraints. The code is available at https: //github. com/zirui-ray-liu/compress-then-prompt.

NeurIPS Conference 2024 Conference Paper

UniIF: Unified Molecule Inverse Folding

  • Zhangyang Gao
  • Jue Wang
  • Cheng Tan
  • Lirong Wu
  • Yufei Huang
  • Siyuan Li
  • Zhirui Ye
  • Stan Z. Li

Molecule inverse folding has been a long-standing challenge in chemistry and biology, with the potential to revolutionize drug discovery and material science. Despite specified models have been proposed for different small- or macro-molecules, few have attempted to unify the learning process, resulting in redundant efforts. Complementary to recent advancements in molecular structure prediction, such as RoseTTAFold All-Atom and AlphaFold3, we propose the unified model UniIF for the inverse folding of all molecules. We do such unification in two levels: 1) Data-Level: We propose a unified block graph data form for all molecules, including the local frame building and geometric feature initialization. 2) Model-Level: We introduce a geometric block attention network, comprising a geometric interaction, interactive attention and virtual long-term dependency modules, to capture the 3D interactions of all molecules. Through comprehensive evaluations across various tasks such as protein design, RNA design, and material design, we demonstrate that our proposed method surpasses state-of-the-art methods on all tasks. UniIF offers a versatile and effective solution for general molecule inverse folding.

NeurIPS Conference 2024 Conference Paper

Video Token Merging for Long Video Understanding

  • Seon-Ho Lee
  • Jue Wang
  • Zhikang Zhang
  • David Fan
  • Xinyu Li

As the scale of data and models for video understanding rapidly expand, handling long-form video input in transformer-based models presents a practical challenge. Rather than resorting to input sampling or token dropping, which may result in information loss, token merging shows promising results when used in collaboration with transformers. However, the application of token merging for long-form video processing is not trivial. We begin with the premise that token merging should not rely solely on the similarity of video tokens; the saliency of tokens should also be considered. To address this, we explore various video token merging strategies for long-form video classification, starting with a simple extension of image token merging, moving to region-concentrated merging, and finally proposing a learnable video token merging (VTM) algorithm that dynamically merges tokens based on their saliency. Extensive experimental results show that we achieve better or comparable performances on the LVU, COIN, and Breakfast datasets. Moreover, our approach significantly reduces memory costs by 84% and boosts throughput by approximately 6. 89 times compared to baseline algorithms.

ICML Conference 2023 Conference Paper

CocktailSGD: Fine-tuning Foundation Models over 500Mbps Networks

  • Jue Wang
  • Yucheng Lu 0003
  • Binhang Yuan
  • Beidi Chen
  • Percy Liang
  • Christopher De Sa
  • Christopher Ré
  • Ce Zhang 0001

Distributed training of foundation models, especially large language models (LLMs), is communication-intensive and so has heavily relied on centralized data centers with fast interconnects. Can we train on slow networks and unlock the potential of decentralized infrastructure for foundation models? In this paper, we propose CocktailSGD, a novel communication-efficient training framework that combines three distinct compression techniques – random sparsification, top-K sparsification, and quantization – to achieve much greater compression than each individual technique alone. We justify the benefit of such a hybrid approach through a theoretical analysis of convergence. Empirically, we show that CocktailSGD achieves up to 117$\times$ compression in fine-tuning LLMs up to 20 billion parameters without hurting convergence. On a 500Mbps network, CocktailSGD only incurs $\sim$1. 2$\times$ slowdown compared with data center networks.

AAAI Conference 2023 Conference Paper

CoordFill: Efficient High-Resolution Image Inpainting via Parameterized Coordinate Querying

  • Weihuang Liu
  • Xiaodong Cun
  • Chi-Man Pun
  • Menghan Xia
  • Yong Zhang
  • Jue Wang

Image inpainting aims to fill the missing hole of the input. It is hard to solve this task efficiently when facing high-resolution images due to two reasons: (1) Large reception field needs to be handled for high-resolution image inpainting. (2) The general encoder and decoder network synthesizes many background pixels synchronously due to the form of the image matrix. In this paper, we try to break the above limitations for the first time thanks to the recent development of continuous implicit representation. In detail, we down-sample and encode the degraded image to produce the spatial-adaptive parameters for each spatial patch via an attentional Fast Fourier Convolution (FFC)-based parameter generation network. Then, we take these parameters as the weights and biases of a series of multi-layer perceptron (MLP), where the input is the encoded continuous coordinates and the output is the synthesized color value. Thanks to the proposed structure, we only encode the high-resolution image in a relatively low resolution for larger reception field capturing. Then, the continuous position encoding will be helpful to synthesize the photo-realistic high-frequency textures by re-sampling the coordinate in a higher resolution. Also, our framework enables us to query the coordinates of missing pixels only in parallel, yielding a more efficient solution than the previous methods. Experiments show that the proposed method achieves real-time performance on the 2048X2048 images using a single GTX 2080 Ti GPU and can handle 4096X4096 images, with much better performance than existing state-of-the-art methods visually and numerically. The code is available at: https://github.com/NiFangBaAGe/CoordFill.

ICML Conference 2023 Conference Paper

Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time

  • Zichang Liu
  • Jue Wang
  • Tri Dao
  • Tianyi Zhou 0002
  • Binhang Yuan
  • Zhao Song 0002
  • Anshumali Shrivastava
  • Ce Zhang 0001

Large language models (LLMs) with hundreds of billions of parameters have sparked a new wave of exciting AI applications. However, they are computationally expensive at inference time. Sparsity is a natural approach to reduce this cost, but existing methods either require costly retraining, have to forgo LLM’s in-context learning ability, or do not yield wall-clock time speedup on modern hardware. We hypothesize that contextual sparsity, which are small, input-dependent sets of attention heads and MLP parameters that yield approximately the same output as the dense model for a given input, can address these issues. We show that contextual sparsity exists, that it can be accurately predicted, and that we can exploit it to speed up LLM inference in wall-clock time without compromising LLM’s quality or in-context learning ability. Based on these insights, we propose DejaVu, a system that uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer, along with an asynchronous and hardware-aware implementation that speeds up LLM inference. We validate that DejaVu can reduce the inference latency of OPT-175B by over 2$\times$ compared to the state-of-the-art FasterTransformer, and over 6$\times$ compared to the widely used Hugging Face implementation, without compromising model quality. The code is available at https: //github. com/FMInference/DejaVu.

AAAI Conference 2023 Conference Paper

Effective Continual Learning for Text Classification with Lightweight Snapshots

  • Jue Wang
  • Dajie Dong
  • Lidan Shou
  • Ke Chen
  • Gang Chen

Continual learning is known for suffering from catastrophic forgetting, a phenomenon where previously learned concepts are forgotten upon learning new tasks. A natural remedy is to use trained models for old tasks as ‘teachers’ to regularize the update of the current model to prevent such forgetting. However, this requires storing all past models, which is very space-consuming for large models, e.g. BERT, thus impractical in real-world applications. To tackle this issue, we propose to construct snapshots of seen tasks whose key knowledge is captured in lightweight adapters. During continual learning, we transfer knowledge from past snapshots to the current model through knowledge distillation, allowing the current model to review previously learned knowledge while learning new tasks. We also design representation recalibration to better handle the class-incremental setting. Experiments over various task sequences show that our approach effectively mitigates catastrophic forgetting and outperforms all baselines.

TMLR Journal 2023 Journal Article

Holistic Evaluation of Language Models

  • Percy Liang
  • Rishi Bommasani
  • Tony Lee
  • Dimitris Tsipras
  • Dilara Soylu
  • Michihiro Yasunaga
  • Yian Zhang
  • Deepak Narayanan

Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what’s missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios to the extent possible (87.5% of the time), ensuring that metrics beyond accuracy don’t fall to the wayside, and that trade-offs across models and metrics are clearly exposed. We also perform 7 targeted evaluations, based on 26 targeted scenarios, to more deeply analyze specific aspects (e.g. knowledge, reasoning, memorization/copyright, disinformation). Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, including 21 scenarios that were not previously used in mainstream LM evaluation. Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: now all 30 models have been densely benchmarked on a set of core scenarios and metrics under standardized conditions. Our evaluation surfaces 25 top-level findings concerning the interplay between different scenarios, metrics, and models. For full transparency, we release all raw model prompts and completions publicly for further analysis, as well as a general modular toolkit for easily adding new scenarios, models, metrics, and prompting strategies. We intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.

AAAI Conference 2023 Conference Paper

InParformer: Evolutionary Decomposition Transformers with Interactive Parallel Attention for Long-Term Time Series Forecasting

  • Haizhou Cao
  • Zhenhao Huang
  • Tiechui Yao
  • Jue Wang
  • Hui He
  • Yangang Wang

Long-term time series forecasting (LTSF) provides substantial benefits for numerous real-world applications, whereas places essential demands on the model capacity to capture long-range dependencies. Recent Transformer-based models have significantly improved LTSF performance. It is worth noting that Transformer with the self-attention mechanism was originally proposed to model language sequences whose tokens (i.e., words) are discrete and highly semantic. However, unlike language sequences, most time series are sequential and continuous numeric points. Time steps with temporal redundancy are weakly semantic, and only leveraging time-domain tokens is hard to depict the overall properties of time series (e.g., the overall trend and periodic variations). To address these problems, we propose a novel Transformer-based forecasting model named InParformer with an Interactive Parallel Attention (InPar Attention) mechanism. The InPar Attention is proposed to learn long-range dependencies comprehensively in both frequency and time domains. To improve its learning capacity and efficiency, we further design several mechanisms, including query selection, key-value pair compression, and recombination. Moreover, InParformer is constructed with evolutionary seasonal-trend decomposition modules to enhance intricate temporal pattern extraction. Extensive experiments on six real-world benchmarks show that InParformer outperforms the state-of-the-art forecasting Transformers.

NeurIPS Conference 2023 Conference Paper

Skill-it! A data-driven skills framework for understanding and training language models

  • Mayee Chen
  • Nicholas Roberts
  • Kush Bhatia
  • Jue Wang
  • Ce Zhang
  • Frederic Sala
  • Christopher Ré

The quality of training data impacts the performance of pre-trained large language models (LMs). Given a fixed budget of tokens, we study how to best select data that leads to good downstream model performance across tasks. We develop a new framework based on a simple hypothesis: just as humans acquire interdependent skills in a deliberate order, language models also follow a natural order when learning a set of skills from their training data. If such an order exists, it can be utilized for improved understanding of LMs and for data-efficient training. Using this intuition, our framework formalizes the notion of a skill and of an ordered set of skills in terms of the associated data. First, using both synthetic and real data, we demonstrate that these ordered skill sets exist, and that their existence enables more advanced skills to be learned with less data when we train on their prerequisite skills. Second, using our proposed framework, we introduce an online data sampling algorithm, Skill-It, over mixtures of skills for both continual pre-training and fine-tuning regimes, where the objective is to efficiently learn multiple skills in the former and an individual skill in the latter. On the LEGO synthetic in the continual pre-training setting, Skill-It obtains 37. 5 points higher accuracy than random sampling. On the Natural Instructions dataset in the fine-tuning setting, Skill-It reduces the validation loss on the target skill by 13. 6% versus training on data associated with the target skill itself. We apply our skills framework on the RedPajama dataset to continually pre-train a 3B-parameter LM, achieving higher accuracy on the LM Evaluation Harness with 1B tokens than the baseline approach of sampling uniformly over data sources with 3B tokens.

AAAI Conference 2023 Conference Paper

Truncate-Split-Contrast: A Framework for Learning from Mislabeled Videos

  • Zixiao Wang
  • Junwu Weng
  • Chun Yuan
  • Jue Wang

Learning with noisy label is a classic problem that has been extensively studied for image tasks, but much less for video in the literature. A straightforward migration from images to videos without considering temporal semantics and computational cost is not a sound choice. In this paper, we propose two new strategies for video analysis with noisy labels: 1) a lightweight channel selection method dubbed as Channel Truncation for feature-based label noise detection. This method selects the most discriminative channels to split clean and noisy instances in each category. 2) A novel contrastive strategy dubbed as Noise Contrastive Learning, which constructs the relationship between clean and noisy instances to regularize model training. Experiments on three well-known benchmark datasets for video classification show that our proposed truNcatE-split-contrAsT (NEAT) significantly outperforms the existing baselines. By reducing the dimension to 10% of it, our method achieves over 0.4 noise detection F1-score and 5% classification accuracy improvement on Mini-Kinetics dataset under severe noise (symmetric-80%). Thanks to Noise Contrastive Learning, the average classification accuracy improvement on Mini-Kinetics and Sth-Sth-V1 is over 1.6%.

NeurIPS Conference 2022 Conference Paper

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

  • Shoufa Chen
  • Chongjian Ge
  • Zhan Tong
  • Jiangliu Wang
  • Yibing Song
  • Jue Wang
  • Ping Luo

Pretraining Vision Transformers (ViTs) has achieved great success in visual recognition. A following scenario is to adapt a ViT to various image and video recognition tasks. The adaptation is challenging because of heavy computation and memory storage. Each model needs an independent and complete finetuning process to adapt to different tasks, which limits its transferability to different visual domains. To address this challenge, we propose an effective adaptation approach for Transformer, namely AdaptFormer, which can adapt the pre-trained ViTs into many different image and video tasks efficiently. It possesses several benefits more appealing than prior arts. Firstly, AdaptFormer introduces lightweight modules that only add less than 2% extra parameters to a ViT, while it is able to increase the ViT's transferability without updating its original pre-trained parameters, significantly outperforming the existing 100\% fully fine-tuned models on action recognition benchmarks. Secondly, it can be plug-and-play in different Transformers and scalable to many visual tasks. Thirdly, extensive experiments on five image and video datasets show that AdaptFormer largely improves ViTs in the target domains. For example, when updating just 1. 5% extra parameters, it achieves about 10% and 19% relative improvement compared to the fully fine-tuned models on Something-Something~v2 and HMDB51, respectively. Code is available at https: //github. com/ShoufaChen/AdaptFormer.

NeurIPS Conference 2022 Conference Paper

Boosting the Transferability of Adversarial Attacks with Reverse Adversarial Perturbation

  • Zeyu Qin
  • Yanbo Fan
  • Yi Liu
  • Li Shen
  • Yong Zhang
  • Jue Wang
  • Baoyuan Wu

Deep neural networks (DNNs) have been shown to be vulnerable to adversarial examples, which can produce erroneous predictions by injecting imperceptible perturbations. In this work, we study the transferability of adversarial examples, which is significant due to its threat to real-world applications where model architecture or parameters are usually unknown. Many existing works reveal that the adversarial examples are likely to overfit the surrogate model that they are generated from, limiting its transfer attack performance against different target models. To mitigate the overfitting of the surrogate model, we propose a novel attack method, dubbed reverse adversarial perturbation (RAP). Specifically, instead of minimizing the loss of a single adversarial point, we advocate seeking adversarial example located at a region with unified low loss value, by injecting the worst-case perturbation (the reverse adversarial perturbation) for each step of the optimization procedure. The adversarial attack with RAP is formulated as a min-max bi-level optimization problem. By integrating RAP into the iterative process for attacks, our method can find more stable adversarial examples which are less sensitive to the changes of decision boundary, mitigating the overfitting of the surrogate model. Comprehensive experimental comparisons demonstrate that RAP can significantly boost adversarial transferability. Furthermore, RAP can be naturally combined with many existing black-box attack techniques, to further boost the transferability. When attacking a real-world image recognition system, Google Cloud Vision API, we obtain 22% performance improvement of targeted attacks over the compared method. Our codes are available at https: //github. com/SCLBD/Transfer attack RAP.

IJCAI Conference 2022 Conference Paper

Continual Federated Learning Based on Knowledge Distillation

  • Yuhang Ma
  • Zhongle Xie
  • Jue Wang
  • Ke Chen
  • Lidan Shou

Federated learning (FL) is a promising approach for learning a shared global model on decentralized data owned by multiple clients without exposing their privacy. In real-world scenarios, data accumulated at the client-side varies in distribution over time. As a consequence, the global model tends to forget the knowledge obtained from previous tasks while learning new tasks, showing signs of "catastrophic forgetting". Previous studies in centralized learning use techniques such as data replay and parameter regularization to mitigate catastrophic forgetting. Unfortunately, these techniques cannot adequately solve the non-trivial problem in FL. We propose Continual Federated Learning with Distillation (CFeD) to address catastrophic forgetting under FL. CFeD performs knowledge distillation on both the clients and the server, with each party independently having an unlabeled surrogate dataset, to mitigate forgetting. Moreover, CFeD assigns different learning objectives, namely learning the new task and reviewing old tasks, to different clients, aiming to improve the learning ability of the model. The results show that our method performs well in mitigating catastrophic forgetting and achieves a good trade-off between the two objectives.

NeurIPS Conference 2022 Conference Paper

Fine-tuning Language Models over Slow Networks using Activation Quantization with Guarantees

  • Jue Wang
  • Binhang Yuan
  • Luka Rimanic
  • Yongjun He
  • Tri Dao
  • Beidi Chen
  • Christopher Ré
  • Ce Zhang

Communication compression is a crucial technique for modern distributed learning systems to alleviate their communication bottlenecks over slower networks. Despite recent intensive studies of gradient compression for data parallel-style training, compressing the activations for models trained with pipeline parallelism is still an open problem. In this paper, we propose AQ-SGD, a novel activation compression algorithm for communication-efficient pipeline parallelism training over slow networks. Different from previous efforts in activation compression, instead of compressing activation values directly, AQ-SGD compresses the changes of the activations. This allows us to show, to the best of our knowledge for the first time, that one can still achieve $O(1/\sqrt{T})$ convergence rate for non-convex objectives under activation compression, without making assumptions on gradient unbiasedness that do not hold for deep learning models with non-linear activation functions. We then show that AQ-SGD can be optimized and implemented efficiently, without additional end-to-end runtime overhead. We evaluated AQ-SGD to fine-tune language models with up to 1. 5 billion parameters, compressing activation to 2-4 bits. AQ-SGD provides up to $4. 3\times$ end-to-end speed-up in slower networks, without sacrificing model quality. Moreover, we also show that AQ-SGD can be combined with state-of-the-art gradient compression algorithms to enable end-to-end communication compression: All communications between machines, including model gradients, forward activations, and backward gradients are compressed into lower precision. This provides up to $4. 9\times$ end-to-end speed-up, without sacrificing model quality.

NeurIPS Conference 2022 Conference Paper

One Model to Edit Them All: Free-Form Text-Driven Image Manipulation with Semantic Modulations

  • Yiming Zhu
  • Hongyu Liu
  • Yibing Song
  • Ziyang Yuan
  • Xintong Han
  • Chun Yuan
  • Qifeng Chen
  • Jue Wang

Free-form text prompts allow users to describe their intentions during image manipulation conveniently. Based on the visual latent space of StyleGAN[21] and text embedding space of CLIP[34], studies focus on how to map these two latent spaces for text-driven attribute manipulations. Currently, the latent mapping between these two spaces is empirically designed and confines that each manipulation model can only handle one fixed text prompt. In this paper, we propose a method named Free-Form CLIP (FFCLIP), aiming to establish an automatic latent mapping so that one manipulation model handles free-form text prompts. Our FFCLIP has a cross-modality semantic modulation module containing semantic alignment and injection. The semantic alignment performs the automatic latent mapping via linear transformations with a cross attention mechanism. After alignment, we inject semantics from text prompt embeddings to the StyleGAN latent space. For one type of image (e. g. , human portrait'), one FFCLIP model can be learned to handle free-form text prompts. Meanwhile, we observe that although each training text prompt only contains a single semantic meaning, FFCLIP can leverage text prompts with multiple semantic meanings for image manipulation. In the experiments, we evaluate FFCLIP on three types of images (i. e. , human portraits', cars', and churches'). Both visual and numerical results show that FFCLIP effectively produces semantically accurate and visually realistic images. Project page: https: //github. com/KumapowerLIU/FFCLIP.

NeurIPS Conference 2022 Conference Paper

OST: Improving Generalization of DeepFake Detection via One-Shot Test-Time Training

  • Liang Chen
  • Yong Zhang
  • Yibing Song
  • Jue Wang
  • Lingqiao Liu

State-of-the-art deepfake detectors perform well in identifying forgeries when they are evaluated on a test set similar to the training set, but struggle to maintain good performance when the test forgeries exhibit different characteristics from the training images e. g. , forgeries are created by unseen deepfake methods. Such a weak generalization capability hinders the applicability of deepfake detectors. In this paper, we introduce a new learning paradigm specially designed for the generalizable deepfake detection task. Our key idea is to construct a test-sample-specific auxiliary task to update the model before applying it to the sample. Specifically, we synthesize pseudo-training samples from each test image and create a test-time training objective to update the model. Moreover, we proposed to leverage meta-learning to ensure that a fast single-step test-time gradient descent, dubbed one-shot test-time training (OST), can be sufficient for good deepfake detection performance. Extensive results across several benchmark datasets demonstrate that our approach performs favorably against existing arts in terms of generalization to unseen data and robustness to different post-processing steps.

NeurIPS Conference 2022 Conference Paper

Stability Analysis and Generalization Bounds of Adversarial Training

  • Jiancong Xiao
  • Yanbo Fan
  • Ruoyu Sun
  • Jue Wang
  • Zhi-Quan Luo

In adversarial machine learning, deep neural networks can fit the adversarial examples on the training dataset but have poor generalization ability on the test set. This phenomenon is called robust overfitting, and it can be observed when adversarially training neural nets on common datasets, including SVHN, CIFAR-10, CIFAR-100, and ImageNet. In this paper, we study the robust overfitting issue of adversarial training by using tools from uniform stability. One major challenge is that the outer function (as a maximization of the inner function) is nonsmooth, so the standard technique (e. g. , Hardt et al. , 2016) cannot be applied. Our approach is to consider $\eta$-approximate smoothness: we show that the outer function satisfies this modified smoothness assumption with $\eta$ being a constant related to the adversarial perturbation $\epsilon$. Based on this, we derive stability-based generalization bounds for stochastic gradient descent (SGD) on the general class of $\eta$-approximate smooth functions, which covers the adversarial loss. Our results suggest that robust test accuracy decreases in $\epsilon$ when $T$ is large, with a speed between $\Omega(\epsilon\sqrt{T})$ and $\mathcal{O}(\epsilon T)$. This phenomenon is also observed in practice. Additionally, we show that a few popular techniques for adversarial training (\emph{e. g. ,} early stopping, cyclic learning rate, and stochastic weight averaging) are stability-promoting in theory.

NeurIPS Conference 2022 Conference Paper

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

  • Zhan Tong
  • Yibing Song
  • Jue Wang
  • Limin Wang

Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging and meaningful self-supervision task, thus encouraging extracting more effective video representations during the pre-training process. We obtain three important findings with VideoMAE: (1) An extremely high proportion of masking ratio (i. e. , 90% to 95%) still yields favorable performance for VideoMAE. The temporally redundant video content enables higher masking ratio than that of images. (2) VideoMAE achieves impressive results on very small datasets (i. e. , around 3k-4k videos) without using any extra data. This is partially ascribed to the challenging task of video reconstruction to enforce high-level structure learning. (3) VideoMAE shows that data quality is more important than data quantity for SSVP. Domain shift between pre-training and target datasets is an important factor. Notably, our VideoMAE with the vanilla ViT backbone can achieve 87. 4% on Kinects-400, 75. 4% on Something-Something V2, 91. 3% on UCF101, and 62. 6% on HMDB51, without using any extra data. Code is available at https: //github. com/MCG-NJU/VideoMAE.

AAAI Conference 2021 Conference Paper

Effective Slot Filling via Weakly-Supervised Dual-Model Learning

  • Jue Wang
  • Ke Chen
  • Lidan Shou
  • Sai Wu
  • Gang Chen

Slot filling is a challenging task in Spoken Language Understanding (SLU). Supervised methods usually require large amounts of annotation to maintain desirable performance. A solution to relieve the heavy dependency on labeled data is to employ bootstrapping, which leverages unlabeled data. However, bootstrapping is known to suffer from semantic drift. We argue that semantic drift can be tackled by exploiting the correlation between slot values (phrases) and their respective types. By using some particular weakly-labeled data, namely the plain phrases included in sentences, we propose a weaklysupervised slot filling approach. Our approach trains two models, namely a classifier and a tagger, which can effectively learn from each other on the weakly-labeled data. The experimental results demonstrate that our approach achieves better results than standard baselines on multiple datasets, especially in the low-resource setting.

NeurIPS Conference 2021 Conference Paper

Revitalizing CNN Attention via Transformers in Self-Supervised Visual Representation Learning

  • Chongjian Ge
  • Youwei Liang
  • Yibing Song
  • Jianbo Jiao
  • Jue Wang
  • Ping Luo

Studies on self-supervised visual representation learning (SSL) improve encoder backbones to discriminate training samples without labels. While CNN encoders via SSL achieve comparable recognition performance to those via supervised learning, their network attention is under-explored for further improvement. Motivated by the transformers that explore visual attention effectively in recognition scenarios, we propose a CNN Attention REvitalization (CARE) framework to train attentive CNN encoders guided by transformers in SSL. The proposed CARE framework consists of a CNN stream (C-stream) and a transformer stream (T-stream), where each stream contains two branches. C-stream follows an existing SSL framework with two CNN encoders, two projectors, and a predictor. T-stream contains two transformers, two projectors, and a predictor. T-stream connects to CNN encoders and is in parallel to the remaining C-Stream. During training, we perform SSL in both streams simultaneously and use the T-stream output to supervise C-stream. The features from CNN encoders are modulated in T-stream for visual attention enhancement and become suitable for the SSL scenario. We use these modulated features to supervise C-stream for learning attentive CNN encoders. To this end, we revitalize CNN attention by using transformers as guidance. Experiments on several standard visual recognition benchmarks, including image classification, object detection, and semantic segmentation, show that the proposed CARE framework improves CNN encoder backbones to the state-of-the-art performance.

AAAI Conference 2020 Conference Paper

When AWGN-Based Denoiser Meets Real Noises

  • Yuqian Zhou
  • Jianbo Jiao
  • Haibin Huang
  • Yang Wang
  • Jue Wang
  • Honghui Shi
  • Thomas Huang

Discriminative learning based image denoisers have achieved promising performance on synthetic noises such as Additive White Gaussian Noise (AWGN). The synthetic noises adopted in most previous work are pixel-independent, but real noises are mostly spatially/channel-correlated and spatially/channel-variant. This domain gap yields unsatisfied performance on images with real noises if the model is only trained with AWGN. In this paper, we propose a novel approach to boost the performance of a real image denoiser which is trained only with synthetic pixel-independent noise data dominated by AWGN. First, we train a deep model that consists of a noise estimator and a denoiser with mixed AWGN and Random Value Impulse Noise (RVIN). We then investigate Pixel-shuffle Down-sampling (PD) strategy to adapt the trained model to real noises. Extensive experiments demonstrate the effectiveness and generalization of the proposed approach. Notably, our method achieves state-of-theart performance on real sRGB images in the DND benchmark among models trained with synthetic noises. Codes are available at https: //github. com/yzhouas/PD-Denoising-pytorch.

AAAI Conference 2019 Short Paper

Adaptation Strategies for Applying AWGN-Based Denoiser to Realistic Noise

  • Yuqian Zhou
  • Jianbo Jiao
  • Haibin Huang
  • Jue Wang
  • Thomas Huang

Discriminative learning based denoising model trained with Additive White Gaussian Noise (AWGN) performs well on synthesized noise. However, realistic noise can be spatialvariant, signal-dependent and a mixture of complicated noises. In this paper, we explore multiple strategies for applying an AWGN-based denoiser to realistic noise. Specifically, we trained a deep network integrating noise estimating and denoiser with mixed Gaussian (AWGN) and Random Value Impulse Noise (RVIN). To adapt the model to realistic noises, we investigated multi-channel, multi-scale and super-resolution approaches. Our preliminary results demonstrated the effectiveness of the newly-proposed noise model and adaptation strategies.

AAAI Conference 2019 Conference Paper

Video Inpainting by Jointly Learning Temporal Structure and Spatial Details

  • Chuan Wang
  • Haibin Huang
  • Xiaoguang Han
  • Jue Wang

We present a new data-driven video inpainting method for recovering missing regions of video frames. A novel deep learning architecture is proposed which contains two subnetworks: a temporal structure inference network and a spatial detail recovering network. The temporal structure inference network is built upon a 3D fully convolutional architecture: it only learns to complete a low-resolution video volume given the expensive computational cost of 3D convolution. The low resolution result provides temporal guidance to the spatial detail recovering network, which performs imagebased inpainting with a 2D fully convolutional network to produce recovered video frames in their original resolution. Such two-step network design ensures both the spatial quality of each frame and the temporal coherence across frames. Our method jointly trains both sub-networks in an end-to-end manner. We provide qualitative and quantitative evaluation on three datasets, demonstrating that our method outperforms previous learning-based video inpainting methods.

NeurIPS Conference 2012 Conference Paper

Unsupervised Template Learning for Fine-Grained Object Recognition

  • Shulin Yang
  • Liefeng Bo
  • Jue Wang
  • Linda Shapiro

Fine-grained recognition refers to a subordinate level of recognition, such are recognizing different species of birds, animals or plants. It differs from recognition of basic categories, such as humans, tables, and computers, in that there are global similarities in shape or structure shared within a category, and the differences are in the details of the object parts. We suggest that the key to identifying the fine-grained differences lies in finding the right alignment of image regions that contain the same object parts. We propose a template model for the purpose, which captures common shape patterns of object parts, as well as the co-occurence relation of the shape patterns. Once the image regions are aligned, extracted features are used for classification. Learning of the template model is efficient, and the recognition results we achieve significantly outperform the state-of-the-art algorithms.

NeurIPS Conference 2010 Conference Paper

Avoiding False Positive in Multi-Instance Learning

  • Yanjun Han
  • Qing Tao
  • Jue Wang

In multi-instance learning, there are two kinds of prediction failure, i. e. , false negative and false positive. Current research mainly focus on avoding the former. We attempt to utilize the geometric distribution of instances inside positive bags to avoid both the former and the latter. Based on kernel principal component analysis, we define a projection constraint for each positive bag to classify its constituent instances far away from the separating hyperplane while place positive instances and negative instances at opposite sides. We apply the Constrained Concave-Convex Procedure to solve the resulted problem. Empirical results demonstrate that our approach offers improved generalization performance.

IS Journal 2008 Journal Article

AI in China: A Survey

  • Xiao-Shan Gao
  • Dan-tong Ouyang
  • Ji-gui Sun
  • San-jiang Li
  • Tian-shun Yao
  • Ru-zhan Lu
  • Chun-yi Shi
  • Zhan-gang Han

This article consists of nine short essays discussing research pursued by AI researchers in China and their perspectives on research in several AI subareas. The article first introduces the mechanization of mathematics, an area in which Chinese scientists have made significant contributions. It then discusses research in automated reasoning, temporal and spatial knowledge representation and reasoning, natural language understanding, intelligent diagnosis, multiagent systems, computational intelligence, large-scale knowledge processing, and several research streams integrating AI techniques with methods from other fields. Finally, the article makes suggestions concerning future AI research in China.

AAAI Conference 2008 Conference Paper

Hybrid Markov Logic Networks

  • Jue Wang

Markov logic networks (MLNs) combine first-order logic and Markov networks, allowing us to handle the complexity and uncertainty of real-world problems in a single consistent framework. However, in MLNs all variables and features are discrete, while most real-world applications also contain continuous ones. In this paper we introduce hybrid MLNs, in which continuous properties (e. g. , the distance between two objects) and functions over them can appear as features. Hybrid MLNs have all distributions in the exponential family as special cases (e. g. , multivariate Gaussians), and allow much more compact modeling of non-i. i. d. data than propositional representations like hybrid Bayesian networks. We also introduce inference algorithms for hybrid MLNs, by extending the MaxWalkSAT and MC-SAT algorithms to continuous domains. Experiments in a mobile robot mapping domain—involving joint classification, clustering and regression—illustrate the power of hybrid MLNs as a modeling language, and the accuracy and efficiency of the inference algorithms.

IS Journal 2008 Journal Article

Machine Learning: The State of the Art

  • Jue Wang
  • Qing Tao

The two fundamental problems in machine learning (ML) are statistical analysis and algorithm design. The former tells us the principles of the mathematical models that we establish from the observation data. The latter defines the conditions on which implementation of data models and data sets rely. A newly discovered challenge to ML is the Rashomon effect, which means that data are possibly generated from a mixture of heterogeneous sources. A simple classification standard can shed light on emerging forms of ML. This article is part of a special issue on AI in China.

IROS Conference 2005 Conference Paper

Designing robots for long-term social interaction

  • Rachel Gockley
  • Allison Bruce
  • Jodi Forlizzi
  • Marek P. Michalowski
  • Anne Mundell
  • Stephanie Rosenthal
  • Brennan Sellner
  • Reid G. Simmons

Valerie the roboceptionist is the most recent addition to Carnegie Mellon's social robots project. A permanent installation in the entranceway to Newell-Simon hall, the robot combines useful functionality - giving directions, looking up weather forecasts, etc. - with an interesting and compelling character. We are using Valerie to investigate human-robot social interaction, especially long-term human-robot "relationships". Over a nine-month period, we have found that many visitors continue to interact with the robot on a daily basis, but that few of the individual interactions last for more than 30 seconds. Our analysis of the data has indicated several design decisions that should facilitate more natural human-robot interactions.

IS Journal 2005 Journal Article

Rule + Exception Strategies for Security Information Analysis

  • Yiyu Yao
  • Fei-Yue Wang
  • Jue Wang
  • Daniel Zeng

Broadly defined, intelligence and security informatics is "the study of the use and development of advanced information technologies, systems, algorithms, and databases for national- and homeland-security-related applications". Processing security-related information is a critical component of ISI research, which involves studying a wide range of technical and systems challenges related to the acquisition, collection, storage, retrieval, synthesis, analysis, visualization, presentation, and understanding of security-related information. Our research aims to develop a unified data description and understanding framework to enable discovery of useful knowledge and events from data sets related to international, homeland, or other types of security. In particular, this article focuses on a common security information analysis task: how to develop an efficient knowledge representation framework and related automated learning and mining mechanisms to describe and identify abnormal situations or behavior. We advocate the use of a specific knowledge representation and data mining framework based on rules and exceptions for analysis of security-related information. In this rule+exception framework, normal and abnormal situations or behaviors occur as pairs of dual entities: rules succinctly summarize normal situations, and exceptions characterize abnormal situations. The rule+exception approach -which closely resembles how humans understand, organize, and use knowledge -has the potential to evolve into a unified, multilevel data description and understanding framework applicable across many security informatics applications.