Author name cluster

Jianmin Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

49 papers

1 author row

EAAI Journal 2026 Journal Article

A novel color marker-based target tracking and motion intent detection method

Zhenyu Wang
Zenan Lu
Simin Tang
Jianmin Wang

JMLR Journal 2025 Journal Article

depyf: Open the Opaque Box of PyTorch Compiler for Machine Learning Researchers

Kaichao You
Runsheng Bai
Meng Cao
Jianmin Wang
Ion Stoica
Mingsheng Long

PyTorch 2.x introduces a compiler designed to accelerate deep learning programs. However, for machine learning researchers, fully leveraging the PyTorch compiler can be challenging due to its operation at the Python bytecode level, making it appear as an opaque box. To address this, we introduce depyf, a tool designed to demystify the inner workings of the PyTorch compiler. depyf decompiles the bytecode generated by PyTorch back into equivalent source code and establishes connections between the code objects in the memory and their counterparts in source code format on the disk. This feature enables users to step through the source code line by line using debuggers, thus enhancing their understanding of the underlying processes. Notably, depyf is non-intrusive and user-friendly, primarily relying on two convenient context managers for its core functionality. The project is openly available at https://github.com/thuml/depyf and is recognized as a PyTorch ecosystem project at https://pytorch.org/blog/introducing-depyf. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2025. ( edit, beta )

IJCAI Conference 2025 Conference Paper

DO-CoLM: Dynamic 3D Conformation Relationships Capture with Self-Adaptive Ordering Molecular Relational Modeling in Language Models

Zhuo Chen
Jiahui Zhang
Sihan Wang
Hongxin Xiang
Jianmin Wang
Wenjie Du
Yang Wang

Molecular Relational Learning (MRL) aims to understand interactions between molecular pairs, playing a critical role in advancing biochemical research. Recently, Large Language Models (LLMs), with their extensive knowledge bases and advanced reasoning capabilities, have emerged as powerful tools for MRL. However, existing LLMs, which primarily rely on SMILES strings and molecular graphs, face two major challenges. They struggle to capture molecular stereochemistry and dynamics, as molecules possess multiple 3D conformations with varying reactivity and dynamic transformation relationships that are essential for accurately predicting molecular interactions but cannot be effectively represented by 1D SMILES or 2D molecular graphs. Additionally, these models do not consider the autoregressive nature of LLMs, overlooking the impact of input order on model performance. To address these issues, we propose DO-CoLM: a Dynamic relationship capture and self-adaptive Ordering 3D molecular Conformation LM for MRL. By introducing modules to dynamically model intra-molecular and inter-molecular conformational relationships and adaptively adjust the molecular modality input order, DO-CoLM achieves superior performance, as demonstrated by experimental results on 12 cross-domain datasets.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

FlashBias: Fast Computation of Attention with Bias

Haixu Wu
Minghao Guo
Yuezhou Ma
Yuanxu Sun
Jianmin Wang
Wojciech Matusik
Mingsheng Long

Attention with bias, which extends standard attention by introducing prior knowledge as an additive bias matrix to the query-key scores, has been widely deployed in vision, language, protein-folding and other advanced scientific models, underscoring its status as a key evolution of this foundational module. However, introducing bias terms creates a severe efficiency bottleneck in attention computation. It disrupts the tightly fused memory-compute pipeline that underlies the speed of accelerators like FlashAttention, thereby stripping away most of their performance gains and leaving biased attention computationally expensive. Surprisingly, despite its common usage, targeted efficiency optimization for attention with bias remains absent, which seriously hinders its application in complex tasks. Diving into the computation of FlashAttention, we prove that its optimal efficiency is determined by the rank of the attention weight matrix. Inspired by this theoretical result, this paper presents FlashBias based on the low-rank compressed sensing theory, which can provide fast-exact computation for many widely used attention biases and a fast-accurate approximation for biases in general formalizations. FlashBias can fully take advantage of the extremely optimized matrix multiplication operation in modern GPUs, achieving 1. 5$\times$ speedup for Pairformer in AlphaFold 3, and over 2$\times$ speedup for attention with bias in vision and language models without loss of accuracy. Code is available at this repository: https: //github. com/thuml/FlashBias.

IJCAI Conference 2025 Conference Paper

MTGIB-UNet: A Multi-Task Graph Information Bottleneck and Uncertainty Weighted Network for ADMET Prediction

Xuqiang Li
Wenjie Du
Jun Xia
Jianmin Wang
Xiaoqi Wang
Yang Yang
Yang Wang

Accurate prediction of ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) properties is crucial in drug development, as these properties directly impact a drug's efficacy and safety. However, existing multi-task learning models often face challenges related to noise interference and task conflicts when dealing with complex molecular structures. To address these issues, we propose a novel multi-task Graph Neural Network (GNN) model, \textbf{MTGIB-UNet}. The model begins by encoding molecular graphs to capture intricate molecular structure information. Subsequently, based on the Graph Information Bottleneck (GIB) principle, the model compresses the information flow by extracting subgraphs, retaining task-relevant features while removing noise for each task. These embeddings are then fused through a gated network that dynamically adjusts the contribution weights of auxiliary tasks to the primary task. Specifically, an uncertainty weighting (UW) strategy is applied, with additional emphasis placed on the primary task, allowing dynamic adjustment of task weights while strengthening the influence of the primary task on model training. Experiments on standard ADMET datasets demonstrate that our model outperforms existing methods. Additionally, the model shows good interpretability by identifying key molecular substructures related to specific ADMET endpoints.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

PhySense: Sensor Placement Optimization for Accurate Physics Sensing

Yuezhou Ma
Haixu Wu
Hang Zhou
Huikun Weng
Jianmin Wang
Mingsheng Long

Physics sensing plays a central role in many scientific and engineering domains, which inherently involves two coupled tasks: reconstructing dense physical fields from sparse observations and optimizing scattered sensor placements to observe maximum information. While deep learning has made rapid advances in sparse-data reconstruction, existing methods generally omit optimization of sensor placements, leaving the mutual enhancement between reconstruction and placement on the shelf. To change this suboptimal practice, we propose PhySense, a synergistic two-stage framework that learns to jointly reconstruct physical fields and to optimize sensor placements, both aiming for accurate physics sensing. The first stage involves a flow-based generative model enhanced by cross-attention to adaptively fuse sparse observations. Leveraging the reconstruction feedback, the second stage performs sensor placement via projected gradient descent to satisfy spatial constraints. We further prove that the learning objectives of the two stages are consistent with classical variance-minimization principles, providing theoretical guarantees. Extensive experiments across three challenging benchmarks, especially a 3D geometry dataset, indicate PhySense achieves state-of-the-art physics sensing accuracy and discovers informative sensor placements previously unconsidered. Code is available at this repository: https: //github. com/thuml/PhySense.

JBHI Journal 2025 Journal Article

Predicting Mutation-Disease Associations Through Protein Interactions Via Deep Learning

Xue Li
Ben Cao
Jianmin Wang
Xiangyu Meng
Shuang Wang
Yu Huang
Enrico Petretto
Tao Song

Disease is one of the primary factors affecting life activities, with complex etiologies often influenced by gene expression and mutation. Currently, wet lab experiments have analyzed the mechanisms of mutations, but these are usually limited by the costs of wet experiments and constraints in sample types and scales. Therefore, this paper constructs a real-world mutation-induced disease dataset and proposes Capsule and Graph topology networks with Multi-head attention (CGM) to predict the mutation-disease associations. CGM can accurately predict protein mutation-disease associations, and to further elucidate the pathogenicity of protein mutations, we also verified that protein mutations lead to protein structural alterations by the model, which suggests that mutation-induced conformational changes may be an important pathogenic factor. Limited by the size of the mutated protein dataset, we also performed experiments on benchmark and imbalanced datasets, where CGM mined 22 unknown protein interaction pairs from the benchmark dataset, better illustrating the potential of CGM in predicting mutation-disease associations. In summary, this paper curates a real dataset. It proposes that CGM predicts protein mutations and disease associations, providing a novel tool for further understanding of biomolecular pathways and disease mechanisms.

IJCAI Conference 2024 Conference Paper

An Image-enhanced Molecular Graph Representation Learning Framework

Hongxin Xiang
Shuting Jin
Jun Xia
Man Zhou
Jianmin Wang
Li Zeng
Xiangxiang Zeng

Extracting rich molecular representation is a crucial prerequisite for accurate drug discovery. Recent molecular representation learning methods achieve impressive progress, but the paradigm of learning from a single modality gradually encounters the bottleneck of limited representation capabilities. In this work, we fully consider the rich visual information contained in 3D conformation molecular images (i. e. , texture, shadow, color and planar spatial information) and distill graph-based models for more discriminative drug discovery. Specifically, we propose an image-enhanced molecular graph representation learning framework that leverages multi-view molecular images rendered from 3D conformations to boost molecular graph representations. To extract useful auxiliary knowledge from multi-view images, we design a teacher, which is pre-trained on 2 million molecules with conformations through five meticulously designed pre-training tasks. To transfer knowledge from teacher to graph-based students, we pose an efficient cross-modal knowledge distillation strategy with knowledge enhancer and task enhancer. It is worth noting that the distillation architecture of IEM can be directly integrated into existing graph-based models, and significantly improves the capabilities of these models (e. g. GIN, EdgePred, GraphMVP, MoleBERT) for molecular representation learning. In particular, GraphMVP and MoleBERT equipped with IEM achieve new state-of-the-art performance on MoleculeNet benchmark, achieving average 73. 89% and 73. 81% ROC-AUC, respectively. Code is available at https: //github. com/HongxinXiang/IEM.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

AutoTimes: Autoregressive Time Series Forecasters via Large Language Models

Yong Liu
Guo Qin
Xiangdong Huang
Jianmin Wang
Mingsheng Long

Foundation models of time series have not been fully developed due to the limited availability of time series corpora and the underexploration of scalable pre-training. Based on the similar sequential formulation of time series and natural language, increasing research demonstrates the feasibility of leveraging large language models (LLM) for time series. Nevertheless, the inherent autoregressive property and decoder-only architecture of LLMs have not been fully considered, resulting in insufficient utilization of LLM abilities. To fully revitalize the general-purpose token transition and multi-step generation capability of large language models, we propose AutoTimes to repurpose LLMs as autoregressive time series forecasters, which projects time series into the embedding space of language tokens and autoregressively generates future predictions with arbitrary lengths. Compatible with any decoder-only LLMs, the consequent forecaster exhibits the flexibility of the lookback length and scalability with larger LLMs. Further, we formulate time series as prompts, extending the context for prediction beyond the lookback window, termed in-context forecasting. By introducing LLM-embedded textual timestamps, AutoTimes can utilize chronological information to align multivariate time series. Empirically, AutoTimes achieves state-of-the-art with 0. 1% trainable parameters and over $5\times$ training/inference speedup compared to advanced LLM-based forecasters. Code is available at this repository: https: //github. com/thuml/AutoTimes.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

RoPINN: Region Optimized Physics-Informed Neural Networks

Haixu Wu
Huakun Luo
Yuezhou Ma
Jianmin Wang
Mingsheng Long

Physics-informed neural networks (PINNs) have been widely applied to solve partial differential equations (PDEs) by enforcing outputs and gradients of deep models to satisfy target equations. Due to the limitation of numerical computation, PINNs are conventionally optimized on finite selected points. However, since PDEs are usually defined on continuous domains, solely optimizing models on scattered points may be insufficient to obtain an accurate solution for the whole domain. To mitigate this inherent deficiency of the default scatter-point optimization, this paper proposes and theoretically studies a new training paradigm as region optimization. Concretely, we propose to extend the optimization process of PINNs from isolated points to their continuous neighborhood regions, which can theoretically decrease the generalization error, especially for hidden high-order constraints of PDEs. A practical training algorithm, Region Optimized PINN (RoPINN), is seamlessly derived from this new paradigm, which is implemented by a straightforward but effective Monte Carlo sampling method. By calibrating the sampling process into trust regions, RoPINN finely balances optimization and generalization error. Experimentally, RoPINN consistently boosts the performance of diverse PINNs on a wide range of PDEs without extra backpropagation or gradient calculation. Code is available at this repository: https: //github. com/thuml/RoPINN.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

TimeXer: Empowering Transformers for Time Series Forecasting with Exogenous Variables

Yuxuan Wang
Haixu Wu
Jiaxiang Dong
Guo Qin
Haoran Zhang
Yong Liu
Yunzhong Qiu
Jianmin Wang

Deep models have demonstrated remarkable performance in time series forecasting. However, due to the partially-observed nature of real-world applications, solely focusing on the target of interest, so-called endogenous variables, is usually insufficient to guarantee accurate forecasting. Notably, a system is often recorded into multiple variables, where the exogenous variables can provide valuable external information for endogenous variables. Thus, unlike well-established multivariate or univariate forecasting paradigms that either treat all the variables equally or ignore exogenous information, this paper focuses on a more practical setting: time series forecasting with exogenous variables. We propose a novel approach, TimeXer, to ingest external information to enhance the forecasting of endogenous variables. With deftly designed embedding layers, TimeXer empowers the canonical Transformer with the ability to reconcile endogenous and exogenous information, where patch-wise self-attention and variate-wise cross-attention are used simultaneously. Moreover, global endogenous tokens are learned to effectively bridge the causal information underlying exogenous series into endogenous temporal patches. Experimentally, TimeXer achieves consistent state-of-the-art performance on twelve real-world forecasting benchmarks and exhibits notable generality and scalability. Code is available at this repository: https: //github. com/thuml/TimeXer.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Koopa: Learning Non-stationary Time Series Dynamics with Koopman Predictors

Yong Liu
Chenyu Li
Jianmin Wang
Mingsheng Long

Real-world time series are characterized by intrinsic non-stationarity that poses a principal challenge for deep forecasting models. While previous models suffer from complicated series variations induced by changing temporal distribution, we tackle non-stationary time series with modern Koopman theory that fundamentally considers the underlying time-variant dynamics. Inspired by Koopman theory of portraying complex dynamical systems, we disentangle time-variant and time-invariant components from intricate non-stationary series by Fourier Filter and design Koopman Predictor to advance respective dynamics forward. Technically, we propose Koopa as a novel Koopman forecaster composed of stackable blocks that learn hierarchical dynamics. Koopa seeks measurement functions for Koopman embedding and utilizes Koopman operators as linear portraits of implicit transition. To cope with time-variant dynamics that exhibits strong locality, Koopa calculates context-aware operators in the temporal neighborhood and is able to utilize incoming ground truth to scale up forecast horizon. Besides, by integrating Koopman Predictors into deep residual structure, we ravel out the binding reconstruction loss in previous Koopman forecasters and achieve end-to-end forecasting objective optimization. Compared with the state-of-the-art model, Koopa achieves competitive performance while saving 77. 3% training time and 76. 0% memory.

NeurIPS Conference 2023 Conference Paper

SimMTM: A Simple Pre-Training Framework for Masked Time-Series Modeling

Jiaxiang Dong
Haixu Wu
Haoran Zhang
Li Zhang
Jianmin Wang
Mingsheng Long

Time series analysis is widely used in extensive areas. Recently, to reduce labeling expenses and benefit various tasks, self-supervised pre-training has attracted immense interest. One mainstream paradigm is masked modeling, which successfully pre-trains deep models by learning to reconstruct the masked content based on the unmasked part. However, since the semantic information of time series is mainly contained in temporal variations, the standard way of randomly masking a portion of time points will seriously ruin vital temporal variations of time series, making the reconstruction task too difficult to guide representation learning. We thus present SimMTM, a Simple pre-training framework for Masked Time-series Modeling. By relating masked modeling to manifold learning, SimMTM proposes to recover masked time points by the weighted aggregation of multiple neighbors outside the manifold, which eases the reconstruction task by assembling ruined but complementary temporal variations from multiple masked series. SimMTM further learns to uncover the local structure of the manifold, which is helpful for masked modeling. Experimentally, SimMTM achieves state-of-the-art fine-tuning performance compared to the most advanced time series pre-training methods in two canonical time series analysis tasks: forecasting and classification, covering both in- and cross-domain settings.

NeurIPS Conference 2022 Conference Paper

Debiased Self-Training for Semi-Supervised Learning

Baixu Chen
Junguang Jiang
Ximei Wang
Pengfei Wan
Jianmin Wang
Mingsheng Long

Deep neural networks achieve remarkable performances on a wide range of tasks with the aid of large-scale labeled datasets. Yet these datasets are time-consuming and labor-exhaustive to obtain on realistic tasks. To mitigate the requirement for labeled data, self-training is widely used in semi-supervised learning by iteratively assigning pseudo labels to unlabeled samples. Despite its popularity, self-training is well-believed to be unreliable and often leads to training instability. Our experimental studies further reveal that the bias in semi-supervised learning arises from both the problem itself and the inappropriate training with potentially incorrect pseudo labels, which accumulates the error in the iterative self-training process. To reduce the above bias, we propose Debiased Self-Training (DST). First, the generation and utilization of pseudo labels are decoupled by two parameter-independent classifier heads to avoid direct error accumulation. Second, we estimate the worst case of self-training bias, where the pseudo labeling function is accurate on labeled samples, yet makes as many mistakes as possible on unlabeled samples. We then adversarially optimize the representations to improve the quality of pseudo labels by avoiding the worst case. Extensive experiments justify that DST achieves an average improvement of 6. 3% against state-of-the-art methods on standard semi-supervised learning benchmark datasets and 18. 9% against FixMatch on 13 diverse tasks. Furthermore, DST can be seamlessly adapted to other self-training methods and help stabilize their training and balance performance across classes in both cases of training from scratch and finetuning from pre-trained models.

NeurIPS Conference 2022 Conference Paper

Hub-Pathway: Transfer Learning from A Hub of Pre-trained Models

Yang Shu
Zhangjie Cao
Ziyang Zhang
Jianmin Wang
Mingsheng Long

Transfer learning aims to leverage knowledge from pre-trained models to benefit the target task. Prior transfer learning work mainly transfers from a single model. However, with the emergence of deep models pre-trained from different resources, model hubs consisting of diverse models with various architectures, pre-trained datasets and learning paradigms are available. Directly applying single-model transfer learning methods to each model wastes the abundant knowledge of the model hub and suffers from high computational cost. In this paper, we propose a Hub-Pathway framework to enable knowledge transfer from a model hub. The framework generates data-dependent pathway weights, based on which we assign the pathway routes at the input level to decide which pre-trained models are activated and passed through, and then set the pathway aggregation at the output level to aggregate the knowledge from different models to make predictions. The proposed framework can be trained end-to-end with the target task-specific loss, where it learns to explore better pathway configurations and exploit the knowledge in pre-trained models for each target datum. We utilize a noisy pathway generator and design an exploration loss to further explore different pathways throughout the model hub. To fully exploit the knowledge in pre-trained models, each model is further trained by specific data that activate it, which ensures its performance and enhances knowledge transfer. Experiment results on computer vision and reinforcement learning tasks demonstrate that the proposed Hub-Pathway framework achieves the state-of-the-art performance for model hub transfer learning.

NeurIPS Conference 2022 Conference Paper

Non-stationary Transformers: Exploring the Stationarity in Time Series Forecasting

Yong Liu
Haixu Wu
Jianmin Wang
Mingsheng Long

Transformers have shown great power in time series forecasting due to their global-range modeling ability. However, their performance can degenerate terribly on non-stationary real-world data in which the joint distribution changes over time. Previous studies primarily adopt stationarization to attenuate the non-stationarity of original series for better predictability. But the stationarized series deprived of inherent non-stationarity can be less instructive for real-world bursty events forecasting. This problem, termed over-stationarization in this paper, leads Transformers to generate indistinguishable temporal attentions for different series and impedes the predictive capability of deep models. To tackle the dilemma between series predictability and model capability, we propose Non-stationary Transformers as a generic framework with two interdependent modules: Series Stationarization and De-stationary Attention. Concretely, Series Stationarization unifies the statistics of each input and converts the output with restored statistics for better predictability. To address the over-stationarization problem, De-stationary Attention is devised to recover the intrinsic non-stationary information into temporal dependencies by approximating distinguishable attentions learned from raw series. Our Non-stationary Transformers framework consistently boosts mainstream Transformers by a large margin, which reduces MSE by 49. 43% on Transformer, 47. 34% on Informer, and 46. 89% on Reformer, making them the state-of-the-art in time series forecasting. Code is available at this repository: https: //github. com/thuml/Nonstationary_Transformers.

JMLR Journal 2022 Journal Article

Ranking and Tuning Pre-trained Models: A New Paradigm for Exploiting Model Hubs

Kaichao You
Yong Liu
Ziyang Zhang
Jianmin Wang
Michael I. Jordan
Mingsheng Long

Model hubs with many pre-trained models (PTMs) have become a cornerstone of deep learning. Although built at a high cost, they remain under-exploited---practitioners usually pick one PTM from the provided model hub by popularity and then fine-tune the PTM to solve the target task. This naïve but common practice poses two obstacles to full exploitation of pre-trained model hubs: first, the PTM selection by popularity has no optimality guarantee, and second, only one PTM is used while the remaining PTMs are ignored. An alternative might be to consider all possible combinations of PTMs and extensively fine-tune each combination, but this would not only be prohibitive computationally but may also lead to statistical over-fitting. In this paper, we propose a new paradigm for exploiting model hubs that is intermediate between these extremes. The paradigm is characterized by two aspects: (1) We use an evidence maximization procedure to estimate the maximum value of label evidence given features extracted by pre-trained models. This procedure can rank all the PTMs in a model hub for various types of PTMs and tasks before fine-tuning. (2) The best ranked PTM can either be fine-tuned and deployed if we have no preference for the model's architecture or the target PTM can be tuned by the top $K$ ranked PTMs via a Bayesian procedure that we propose. This procedure, which we refer to as B-Tuning, not only improves upon specialized methods designed for tuning homogeneous PTMs, but also applies to the challenging problem of tuning heterogeneous PTMs where it yields a new level of benchmark performance. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2022. ( edit, beta )

NeurIPS Conference 2022 Conference Paper

Supported Policy Optimization for Offline Reinforcement Learning

Jialong Wu
Haixu Wu
Zihan Qiu
Jianmin Wang
Mingsheng Long

Policy constraint methods to offline reinforcement learning (RL) typically utilize parameterization or regularization that constrains the policy to perform actions within the support set of the behavior policy. The elaborative designs of parameterization methods usually intrude into the policy networks, which may bring extra inference cost and cannot take full advantage of well-established online methods. Regularization methods reduce the divergence between the learned policy and the behavior policy, which may mismatch the inherent density-based definition of support set thereby failing to avoid the out-of-distribution actions effectively. This paper presents Supported Policy OpTimization (SPOT), which is directly derived from the theoretical formalization of the density-based support constraint. SPOT adopts a VAE-based density estimator to explicitly model the support set of behavior policy and presents a simple but effective density-based regularization term, which can be plugged non-intrusively into off-the-shelf off-policy RL algorithms. SPOT achieves the state-of-the-art performance on standard benchmarks for offline RL. Benefiting from the pluggable design, offline pretrained models from SPOT can also be applied to perform online fine-tuning seamlessly.

NeurIPS Conference 2021 Conference Paper

Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting

Haixu Wu
Jiehui Xu
Jianmin Wang
Mingsheng Long

Extending the forecasting time is a critical demand for real applications, such as extreme weather early warning and long-term energy consumption planning. This paper studies the long-term forecasting problem of time series. Prior Transformer-based models adopt various self-attention mechanisms to discover the long-range dependencies. However, intricate temporal patterns of the long-term future prohibit the model from finding reliable dependencies. Also, Transformers have to adopt the sparse versions of point-wise self-attentions for long series efficiency, resulting in the information utilization bottleneck. Going beyond Transformers, we design Autoformer as a novel decomposition architecture with an Auto-Correlation mechanism. We break with the pre-processing convention of series decomposition and renovate it as a basic inner block of deep models. This design empowers Autoformer with progressive decomposition capacities for complex time series. Further, inspired by the stochastic process theory, we design the Auto-Correlation mechanism based on the series periodicity, which conducts the dependencies discovery and representation aggregation at the sub-series level. Auto-Correlation outperforms self-attention in both efficiency and accuracy. In long-term forecasting, Autoformer yields state-of-the-art accuracy, with a 38% relative improvement on six benchmarks, covering five practical applications: energy, traffic, economics, weather and disease. Code is available at this repository: https: //github. com/thuml/Autoformer.

NeurIPS Conference 2021 Conference Paper

Cycle Self-Training for Domain Adaptation

Hong Liu
Jianmin Wang
Mingsheng Long

Mainstream approaches for unsupervised domain adaptation (UDA) learn domain-invariant representations to narrow the domain shift, which are empirically effective but theoretically challenged by the hardness or impossibility theorems. Recently, self-training has been gaining momentum in UDA, which exploits unlabeled target data by training with target pseudo-labels. However, as corroborated in this work, under distributional shift, the pseudo-labels can be unreliable in terms of their large discrepancy from target ground truth. In this paper, we propose Cycle Self-Training (CST), a principled self-training algorithm that explicitly enforces pseudo-labels to generalize across domains. CST cycles between a forward step and a reverse step until convergence. In the forward step, CST generates target pseudo-labels with a source-trained classifier. In the reverse step, CST trains a target classifier using target pseudo-labels, and then updates the shared representations to make the target classifier perform well on the source data. We introduce the Tsallis entropy as a confidence-friendly regularization to improve the quality of target pseudo-labels. We analyze CST theoretically under realistic assumptions, and provide hard cases where CST recovers target ground truth, while both invariant feature learning and vanilla self-training fail. Empirical results indicate that CST significantly improves over the state-of-the-arts on visual recognition and sentiment analysis benchmarks.

NeurIPS Conference 2020 Conference Paper

Co-Tuning for Transfer Learning

Kaichao You
Zhi Kou
Mingsheng Long
Jianmin Wang

Fine-tuning pre-trained deep neural networks (DNNs) to a target dataset, also known as transfer learning, is widely used in computer vision and NLP. Because task-specific layers mainly contain categorical information and categories vary with datasets, practitioners only \textit{partially} transfer pre-trained models by discarding task-specific layers and fine-tuning bottom layers. However, it is a reckless loss to simply discard task-specific parameters who take up as many as $20\%$ of the total parameters in pre-trained models. To \textit{fully} transfer pre-trained models, we propose a two-step framework named \textbf{Co-Tuning}: (i) learn the relationship between source categories and target categories from the pre-trained model and calibrated predictions; (ii) target labels (one-hot labels), as well as source labels (probabilistic labels) translated by the category relationship, collaboratively supervise the fine-tuning process. A simple instantiation of the framework shows strong empirical results in four visual classification tasks and one NLP classification task, bringing up to $20\%$ relative improvement. While state-of-the-art fine-tuning techniques mainly focus on how to impose regularization when data are not abundant, Co-Tuning works not only in medium-scale datasets (100 samples per class) but also in large-scale datasets (1000 samples per class) where regularization-based methods bring no gains over the vanilla fine-tuning. Co-Tuning relies on a typically valid assumption that the pre-trained dataset is diverse enough, implying its broad application area.

NeurIPS Conference 2020 Conference Paper

Learning to Adapt to Evolving Domains

Hong Liu
Mingsheng Long
Jianmin Wang
Yu Wang

Domain adaptation aims at knowledge transfer from a labeled source domain to an unlabeled target domain. Current domain adaptation methods have made substantial advances in adapting discrete domains. However, this can be unrealistic in real-world applications, where target data usually comes in an online and continually evolving manner as small batches, posing challenges to classic domain adaptation paradigm: (1) Mainstream domain adaptation methods are tailored to stationary target domains, and can fail in non-stationary environments. (2) Since the target data arrive online, the agent should also maintain competence on previous target domains, i. e. to adapt without forgetting. To tackle these challenges, we propose a meta-adaptation framework which enables the learner to adapt to continually evolving target domain without catastrophic forgetting. Our framework comprises of two components: a meta-objective of learning representations to adapt to evolving domains, enabling meta-learning for unsupervised domain adaptation; and a meta-adapter for learning to adapt without forgetting, reserving knowledge from previous target data. Experiments validate the effectiveness our method on evolving target domains.

AAAI Conference 2020 Conference Paper

Simultaneous Learning of Pivots and Representations for Cross-Domain Sentiment Classification

Liang Li
Weirui Ye
Mingsheng Long
Yateng Tang
Jin Xu
Jianmin Wang

Cross-domain sentiment classiﬁcation aims to leverage useful knowledge from a source domain to mitigate the supervision sparsity in a target domain. A series of approaches depend on the pivot features that behave similarly for polarity prediction in both domains. However, the engineering of such pivot features remains cumbersome and prevents us from learning the disentangled and transferable representations from rich semantic and syntactic information. Towards learning the pivots and representations simultaneously, we propose a new Transferable Pivot Transformer (TPT). Our model consists of two networks: a Pivot Selector that learns to detect transferable ngram pivots from contexts, and a Transferable Transformer that learns to generate domain-invariant representations by modeling the correlation between pivot and non-pivot words. The Pivot Selector and Transferable Transformer are jointly optimized through end-to-end back-propagation. We experiment with real tasks of cross-domain sentiment classiﬁcation over 20 domain pairs where our model outperforms prior arts.

NeurIPS Conference 2020 Conference Paper

Stochastic Normalization

Zhi Kou
Kaichao You
Mingsheng Long
Jianmin Wang

Fine-tuning pre-trained deep networks on a small dataset is an important component in the deep learning pipeline. A critical problem in fine-tuning is how to avoid over-fitting when data are limited. Existing efforts work from two aspects: (1) impose regularization on parameters or features; (2) transfer prior knowledge to fine-tuning by reusing pre-trained parameters. In this paper, we take an alternative approach by refactoring the widely used Batch Normalization (BN) module to mitigate over-fitting. We propose a two-branch design with one branch normalized by mini-batch statistics and the other branch normalized by moving statistics. During training, two branches are stochastically selected to avoid over-depending on some sample statistics, resulting in a strong regularization effect, which we interpret as ``architecture regularization. '' The resulting method is dubbed stochastic normalization (\textbf{StochNorm}). With the two-branch architecture, it naturally incorporates pre-trained moving statistics in BN layers during fine-tuning, exploiting more prior knowledge of pre-trained networks. Extensive empirical experiments show that StochNorm is a powerful tool to avoid over-fitting in fine-tuning with small datasets. Besides, StochNorm is readily pluggable in modern CNN backbones. It is complementary to other fine-tuning methods and can work together to achieve stronger regularization effect.

NeurIPS Conference 2020 Conference Paper

Transferable Calibration with Lower Bias and Variance in Domain Adaptation

Ximei Wang
Mingsheng Long
Jianmin Wang
Michael Jordan

Domain Adaptation (DA) enables transferring a learning machine from a labeled source domain to an unlabeled target one. While remarkable advances have been made, most of the existing DA methods focus on improving the target accuracy at inference. How to estimate the predictive uncertainty of DA models is vital for decision-making in safety-critical scenarios but remains the boundary to explore. In this paper, we delve into the open problem of Calibration in DA, which is extremely challenging due to the coexistence of domain shift and the lack of target labels. We first reveal the dilemma that DA models learn higher accuracy at the expense of well-calibrated probabilities. Driven by this finding, we propose Transferable Calibration (TransCal) to achieve more accurate calibration with lower bias and variance in a unified hyperparameter-free optimization framework. As a general post-hoc calibration method, TransCal can be easily applied to recalibrate existing DA methods. Its efficacy has been justified both theoretically and empirically.

NeurIPS Conference 2019 Conference Paper

Catastrophic Forgetting Meets Negative Transfer: Batch Spectral Shrinkage for Safe Transfer Learning

Xinyang Chen
Sinan Wang
Bo Fu
Mingsheng Long
Jianmin Wang

Before sufficient training data is available, fine-tuning neural networks pre-trained on large-scale datasets substantially outperforms training from random initialization. However, fine-tuning methods suffer from two dilemmas, catastrophic forgetting and negative transfer. While several methods with explicit attempts to overcome catastrophic forgetting have been proposed, negative transfer is rarely delved into. In this paper, we launch an in-depth empirical investigation into negative transfer in fine-tuning and find that, for the weight parameters and feature representations, transferability of their spectral components is diverse. For safe transfer learning, we present Batch Spectral Shrinkage (BSS), a novel regularization approach to penalizing smaller singular values so that untransferable spectral components are suppressed. BSS is orthogonal to existing fine-tuning methods and is readily pluggable to them. Experimental results show that BSS can significantly enhance the performance of representative methods, especially with limited training data.

AAAI Conference 2019 Conference Paper

Transferable Attention for Domain Adaptation

Ximei Wang
Liang Li
Weirui Ye
Mingsheng Long
Jianmin Wang

Recent work in domain adaptation bridges different domains by adversarially learning a domain-invariant representation that cannot be distinguished by a domain discriminator. Existing methods of adversarial domain adaptation mainly align the global images across the source and target domains. However, it is obvious that not all regions of an image are transferable, while forcefully aligning the untransferable regions may lead to negative transfer. Furthermore, some of the images are significantly dissimilar across domains, resulting in weak image-level transferability. To this end, we present Transferable Attention for Domain Adaptation (TADA), focusing our adaptation model on transferable regions or images. We implement two types of complementary transferable attention: transferable local attention generated by multiple region-level domain discriminators to highlight transferable regions, and transferable global attention generated by single image-level domain discriminator to highlight transferable images. Extensive experiments validate that our proposed models exceed state of the art results on standard domain adaptation datasets.

AAAI Conference 2019 Conference Paper

Transferable Curriculum for Weakly-Supervised Domain Adaptation

Yang Shu
Zhangjie Cao
Mingsheng Long
Jianmin Wang

Domain adaptation improves a target task by knowledge transfer from a source domain with rich annotations. It is not uncommon that “source-domain engineering” becomes a cumbersome process in domain adaptation: the high-quality source domains highly related to the target domain are hardly available. Thus, weakly-supervised domain adaptation has been introduced to address this difficulty, where we can tolerate the source domain with noises in labels, features, or both. As such, for a particular target task, we simply collect the source domain with coarse labeling or corrupted data. In this paper, we try to address two entangled challenges of weaklysupervised domain adaptation: sample noises of the source domain and distribution shift across domains. To disentangle these challenges, a Transferable Curriculum Learning (TCL) approach is proposed to train the deep networks, guided by a transferable curriculum informing which of the source examples are noiseless and transferable. The approach enhances positive transfer from clean source examples to the target and mitigates negative transfer of noisy source examples. A thorough evaluation shows that our approach significantly outperforms the state-of-the-art on weakly-supervised domain adaptation tasks.

NeurIPS Conference 2019 Conference Paper

Transferable Normalization: Towards Improving Transferability of Deep Neural Networks

Ximei Wang
Ying Jin
Mingsheng Long
Jianmin Wang
Michael Jordan

Deep neural networks (DNNs) excel at learning representations when trained on large-scale datasets. Pre-trained DNNs also show strong transferability when fine-tuned to other labeled datasets. However, such transferability becomes weak when the target dataset is fully unlabeled as in Unsupervised Domain Adaptation (UDA). We envision that the loss of transferability may stem from the intrinsic limitation of the architecture design of DNNs. In this paper, we delve into the components of DNN architectures and propose Transferable Normalization (TransNorm) in place of existing normalization techniques. TransNorm is an end-to-end trainable layer to make DNNs more transferable across domains. As a general method, TransNorm can be easily applied to various deep neural networks and domain adaption methods, without introducing any extra hyper-parameters or learnable parameters. Empirical results justify that TransNorm not only improves classification accuracies but also accelerates convergence for mainstream DNN-based domain adaptation methods.

NeurIPS Conference 2018 Conference Paper

Conditional Adversarial Domain Adaptation

Mingsheng Long
Zhangjie Cao
Jianmin Wang
Michael Jordan

Adversarial learning has been embedded into deep networks to learn disentangled and transferable representations for domain adaptation. Existing adversarial domain adaptation methods may struggle to align different domains of multimodal distributions that are native in classification problems. In this paper, we present conditional adversarial domain adaptation, a principled framework that conditions the adversarial adaptation models on discriminative information conveyed in the classifier predictions. Conditional domain adversarial networks (CDANs) are designed with two novel conditioning strategies: multilinear conditioning that captures the cross-covariance between feature representations and classifier predictions to improve the discriminability, and entropy conditioning that controls the uncertainty of classifier predictions to guarantee the transferability. Experiments testify that the proposed approach exceeds the state-of-the-art results on five benchmark datasets.

NeurIPS Conference 2018 Conference Paper

Generalized Zero-Shot Learning with Deep Calibration Network

Shichen Liu
Mingsheng Long
Jianmin Wang
Michael Jordan

A technical challenge of deep learning is recognizing target classes without seen data. Zero-shot learning leverages semantic representations such as attributes or class prototypes to bridge source and target classes. Existing standard zero-shot learning methods may be prone to overfitting the seen data of source classes as they are blind to the semantic representations of target classes. In this paper, we study generalized zero-shot learning that assumes accessible to target classes for unseen data during training, and prediction on unseen data is made by searching on both source and target classes. We propose a novel Deep Calibration Network (DCN) approach towards this generalized zero-shot learning paradigm, which enables simultaneous calibration of deep networks on the confidence of source classes and uncertainty of target classes. Our approach maps visual features of images and semantic representations of class prototypes to a common embedding space such that the compatibility of seen data to both source and target classes are maximized. We show superior accuracy of our approach over the state of the art on benchmark datasets for generalized zero-shot learning, including AwA, CUB, SUN, and aPY.

AAAI Conference 2018 Conference Paper

Multi-Adversarial Domain Adaptation

Zhongyi Pei
Zhangjie Cao
Mingsheng Long
Jianmin Wang

Recent advances in deep domain adaptation reveal that adversarial learning can be embedded into deep networks to learn transferable features that reduce distribution discrepancy between the source and target domains. Existing domain adversarial adaptation methods based on single domain discriminator only align the source and target data distributions without exploiting the complex multimode structures. In this paper, we present a multi-adversarial domain adaptation (MADA) approach, which captures multimode structures to enable ﬁne-grained alignment of different data distributions based on multiple domain discriminators. The adaptation can be achieved by stochastic gradient descent with the gradients computed by back-propagation in linear-time. Empirical evidence demonstrates that the proposed model outperforms state of the art methods on standard domain adaptation datasets.

IJCAI Conference 2018 Conference Paper

PredCNN: Predictive Learning with Cascade Convolutions

Ziru Xu
Yunbo Wang
Mingsheng Long
Jianmin Wang

Predicting future frames in videos remains an unsolved but challenging problem. Mainstream recurrent models suffer from huge memory usage and computation cost, while convolutional models are unable to effectively capture the temporal dependencies between consecutive video frames. To tackle this problem, we introduce an entirely CNN-based architecture, PredCNN, that models the dependencies between the next frame and the sequential video inputs. Inspired by the core idea of recurrent models that previous states have more transition operations than future states, we design a cascade multiplicative unit (CMU) that provides relatively more operations for previous video frames. This newly proposed unit enables PredCNN to predict future spatiotemporal data without any recurrent chain structures, which eases gradient propagation and enables a fully paralleled optimization. We show that PredCNN outperforms the state-of-the-art recurrent models for video prediction on the standard Moving MNIST dataset and two challenging crowd flow prediction datasets, and achieves a faster training speed and lower memory footprint.

AAAI Conference 2018 Conference Paper

Transfer Adversarial Hashing for Hamming Space Retrieval

Zhangjie Cao
Mingsheng Long
Chao Huang
Jianmin Wang

Hashing is widely applied to large-scale image retrieval due to the storage and retrieval efﬁciency. Existing work on deep hashing assumes that the database in the target domain is identically distributed with the training set in the source domain. This paper relaxes this assumption to a transfer retrieval setting, which allows the database and the training set to come from different but relevant domains. However, the transfer retrieval setting will introduce two technical difﬁculties: ﬁrst, the hash model trained on the source domain cannot work well on the target domain due to the large distribution gap; second, the domain gap makes it difﬁcult to concentrate the database points to be within a small Hamming ball. As a consequence, transfer retrieval performance within Hamming Radius 2 degrades signiﬁcantly in existing hashing methods. This paper presents Transfer Adversarial Hashing (TAH), a new hybrid deep architecture that incorporates a pairwise t-distribution cross-entropy loss to learn concentrated hash codes and an adversarial network to align the data distributions between the source and target domains. TAH can generate compact transfer hash codes for efﬁcient image retrieval on both source and target domains. Comprehensive experiments validate that TAH yields state of the art Hamming space retrieval performance on standard datasets.

AAAI Conference 2018 Conference Paper

Unsupervised Domain Adaptation With Distribution Matching Machines

Yue Cao
Mingsheng Long
Jianmin Wang

Domain adaptation generalizes a learning model across source domain and target domain that follow different distributions. Most existing work follows a two-step procedure: ﬁrst, explores either feature matching or instance reweighting independently, and second, train the transfer classiﬁer separately. In this paper, we show that either feature matching or instance reweighting can only reduce, but not remove, the cross-domain discrepancy, and the knowledge hidden in the relations between the data labels from the source and target domains is important for unsupervised domain adaptation. We propose a new Distribution Matching Machine (DMM) based on the structural risk minimization principle, which learns a transfer support vector machine by extracting invariant feature representations and estimating unbiased instance weights that jointly minimize the cross-domain distribution discrepancy. This leads to a robust transfer learner that performs well against both mismatched features and irrelevant instances. Our theoretical analysis proves that the proposed approach further reduces the generalization error bound of related domain adaptation methods. Comprehensive experiments validate that the DMM approach signiﬁcantly outperforms competitive methods on standard domain adaptation benchmarks.

AAAI Conference 2017 Conference Paper

Collective Deep Quantization for Efficient Cross-Modal Retrieval

Yue Cao
Mingsheng Long
Jianmin Wang
Shichen Liu

Cross-modal similarity retrieval is a problem about designing a retrieval system that supports querying across content modalities, e. g. , using an image to retrieve for texts. This paper presents a compact coding solution for efﬁcient cross-modal retrieval, with a focus on the quantization approach which has already shown the superior performance over the hashing solutions in single-modal similarity retrieval. We propose a collective deep quantization (CDQ) approach, which is the ﬁrst attempt to introduce quantization in end-to-end deep architecture for cross-modal retrieval. The major contribution lies in jointly learning deep representations and the quantizers for both modalities using carefully-crafted hybrid networks and well-speciﬁed loss functions. In addition, our approach simultaneously learns the common quantizer codebook for both modalities through which the crossmodal correlation can be substantially enhanced. CDQ enables efﬁcient and effective cross-modal retrieval using inner product distance computed based on the common codebook with fast distance table lookup. Extensive experiments show that CDQ yields state of the art cross-modal retrieval results on standard benchmarks.

NeurIPS Conference 2017 Conference Paper

Learning Multiple Tasks with Multilinear Relationship Networks

Mingsheng Long
Zhangjie Cao
Jianmin Wang
Philip Yu

Deep networks trained on large-scale data can learn transferable features to promote learning multiple tasks. Since deep features eventually transition from general to specific along deep networks, a fundamental problem of multi-task learning is how to exploit the task relatedness underlying parameter tensors and improve feature transferability in the multiple task-specific layers. This paper presents Multilinear Relationship Networks (MRN) that discover the task relationships based on novel tensor normal priors over parameter tensors of multiple task-specific layers in deep convolutional networks. By jointly learning transferable features and multilinear relationships of tasks and features, MRN is able to alleviate the dilemma of negative-transfer in the feature layers and under-transfer in the classifier layer. Experiments show that MRN yields state-of-the-art results on three multi-task learning datasets.

NeurIPS Conference 2017 Conference Paper

PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs

Yunbo Wang
Mingsheng Long
Jianmin Wang
Zhifeng Gao
Philip Yu

The predictive learning of spatiotemporal sequences aims to generate future images by learning from the historical frames, where spatial appearances and temporal variations are two crucial structures. This paper models these structures by presenting a predictive recurrent neural network (PredRNN). This architecture is enlightened by the idea that spatiotemporal predictive learning should memorize both spatial appearances and temporal variations in a unified memory pool. Concretely, memory states are no longer constrained inside each LSTM unit. Instead, they are allowed to zigzag in two directions: across stacked RNN layers vertically and through all RNN states horizontally. The core of this network is a new Spatiotemporal LSTM (ST-LSTM) unit that extracts and memorizes spatial and temporal representations simultaneously. PredRNN achieves the state-of-the-art prediction performance on three video prediction datasets and is a more general framework, that can be easily extended to other predictive learning tasks by integrating with other architectures.

AAAI Conference 2017 Conference Paper

Transitive Hashing Network for Heterogeneous Multimedia Retrieval

Zhangjie Cao
Mingsheng Long
Jianmin Wang
Qiang Yang

Hashing is widely applied to large-scale multimedia retrieval due to the storage and retrieval efﬁciency. Crossmodal hashing enables efﬁcient retrieval of one modality from database relevant to a query of another modality. Existing work on cross-modal hashing assumes that heterogeneous relationship across modalities is available for learning to hash. This paper relaxes this strict assumption by only requiring heterogeneous relationship in some auxiliary dataset different from the query or database domain. We design a novel hybrid deep architecture, transitive hashing network (THN), to jointly learn cross-modal correlation from the auxiliary dataset, and align the data distributions of the auxiliary dataset with that of the query or database domain, which generates compact transitive hash codes for efﬁcient crossmodal retrieval. Comprehensive empirical evidence validates that the proposed THN approach yields state of the art retrieval performance on standard multimedia benchmarks, i. e. NUS-WIDE and ImageNet-YahooQA.

AAAI Conference 2016 Conference Paper

Deep Hashing Network for Efficient Similarity Retrieval

Han Zhu
Mingsheng Long
Jianmin Wang
Yue Cao

Due to the storage and retrieval efﬁciency, hashing has been widely deployed to approximate nearest neighbor search for large-scale multimedia retrieval. Supervised hashing, which improves the quality of hash coding by exploiting the semantic similarity on data pairs, has received increasing attention recently. For most existing supervised hashing methods for image retrieval, an image is ﬁrst represented as a vector of hand-crafted or machine-learned features, followed by another separate quantization step that generates binary codes. However, suboptimal hash coding may be produced, because the quantization error is not statistically minimized and the feature representation is not optimally compatible with the binary coding. In this paper, we propose a novel Deep Hashing Network (DHN) architecture for supervised hashing, in which we jointly learn good image representation tailored to hash coding and formally control the quantization error. The DHN model constitutes four key components: (1) a subnetwork with multiple convolution-pooling layers to capture image representations; (2) a fully-connected hashing layer to generate compact binary hash codes; (3) a pairwise crossentropy loss layer for similarity-preserving learning; and (4) a pairwise quantization loss for controlling hashing quality. Extensive experiments on standard image retrieval datasets show the proposed DHN model yields substantial boosts over latest state-of-the-art hashing methods.

AAAI Conference 2016 Conference Paper

Deep Quantization Network for Efficient Image Retrieval

Yue Cao
Mingsheng Long
Jianmin Wang
Han Zhu
Qingfu Wen

Hashing has been widely applied to approximate nearest neighbor search for large-scale multimedia retrieval. Supervised hashing improves the quality of hash coding by exploiting the semantic similarity on data pairs and has received increasing attention recently. For most existing supervised hashing methods for image retrieval, an image is ﬁrst represented as a vector of hand-crafted or machine-learned features, then quantized by a separate quantization step that generates binary codes. However, suboptimal hash coding may be produced, since the quantization error is not statistically minimized and the feature representation is not optimally compatible with the hash coding. In this paper, we propose a novel Deep Quantization Network (DQN) architecture for supervised hashing, which learns image representation for hash coding and formally control the quantization error. The DQN model constitutes four key components: (1) a sub-network with multiple convolution-pooling layers to capture deep image representations; (2) a fully connected bottleneck layer to generate dimension-reduced representation optimal for hash coding; (3) a pairwise cosine loss layer for similarity-preserving learning; and (4) a product quantization loss for controlling hashing quality and the quantizability of bottleneck representation. Extensive experiments on standard image retrieval datasets show the proposed DQN model yields substantial boosts over latest state-of-the-art hashing methods.

IJCAI Conference 2016 Conference Paper

Semi-Supervised Active Learning with Cross-Class Sample Transfer

Yuchen Guo
Guiguang Ding
Yue Gao
Jianmin Wang

To save the labeling efforts for training a classification model, we can simultaneously adopt Active Learning (AL) to select the most informative samples for human labeling, and Semi-supervised Learning (SSL) to construct effective classifiers using a few labeled samples and a large number of unlabeled samples. Recently, using Transfer Learning (TL) to enhance AL and SSL, i. e. , T-SS-AL, has gained considerable attention. However, existing T-SS-AL methods mostly focus on the situation where the source domain and the target domain share the same classes. In this paper, we consider a more practical and challenging setting where the source domain and the target domain have different but related classes. We propose a novel cross-class sample transfer based T-SS-AL method, called CC-SS-AL, to exploit the information from the source domain. Our key idea is to select samples from the source domain which are very similar to the target domain classes and assign pseudo labels to them for classifier training. Extensive experiments on three datasets verify the efficacy of the proposed method.

AAAI Conference 2016 Conference Paper

Transductive Zero-Shot Recognition via Shared Model Space Learning

Yuchen Guo
Guiguang Ding
Xiaoming Jin
Jianmin Wang

Zero-shot Recognition (ZSR) is to learn recognition models for novel classes without labeled data. It is a challenging task and has drawn considerable attention in recent years. The basic idea is to transfer knowledge from seen classes via the shared attributes. This paper focus on the transductive ZSR, i. e. , we have unlabeled data for novel classes. Instead of learning models for seen and novel classes separately as in existing works, we put forward a novel joint learning approach which learns the shared model space (SMS) for models such that the knowledge can be effectively transferred between classes using the attributes. An effective algorithm is proposed for optimization. We conduct comprehensive experiments on three benchmark datasets for ZSR. The results demonstrates that the proposed SMS can signiﬁcantly outperform the state-of-the-art related approaches which validates its efﬁcacy for the ZSR task.

NeurIPS Conference 2016 Conference Paper

Unsupervised Domain Adaptation with Residual Transfer Networks

Mingsheng Long
Han Zhu
Jianmin Wang
Michael Jordan

The recent success of deep neural networks relies on massive amounts of labeled data. For a target task where labeled data is unavailable, domain adaptation can transfer a learner from a different source domain. In this paper, we propose a new approach to domain adaptation in deep networks that can jointly learn adaptive classifiers and transferable features from labeled data in the source domain and unlabeled data in the target domain. We relax a shared-classifier assumption made by previous methods and assume that the source classifier and target classifier differ by a residual function. We enable classifier adaptation by plugging several layers into deep network to explicitly learn the residual function with reference to the target classifier. We fuse features of multiple layers with tensor product and embed them into reproducing kernel Hilbert spaces to match distributions for feature adaptation. The adaptation can be achieved in most feed-forward models by extending them with new residual layers and loss functions, which can be trained efficiently via back-propagation. Empirical evidence shows that the new approach outperforms state of the art methods on standard domain adaptation benchmarks.

AAAI Conference 2015 Conference Paper

A Personalized Interest-Forgetting Markov Model for Recommendations

Jun Chen
Chaokun Wang
Jianmin Wang

Intelligent item recommendation is a key issue in AI research which enables recommender systems to be more “human-minded” when generating recommendations. However, one of the major features of human — forgetting, has barely been discussed as regards recommender systems. In this paper, we considered people’s forgetting of interest when performing personalized recommendations, and brought forward a personalized framework to integrate interest-forgetting property with Markov model. Multiple implementations of the framework were investigated and compared. The experimental evaluation showed that our methods could significantly improve the accuracy of item recommendation, which verified the importance of considering interest-forgetting in recommendations.

AAAI Conference 2015 Conference Paper

Learning Predictable and Discriminative Attributes for Visual Recognition

Yuchen Guo
Guiguang Ding
Xiaoming Jin
Jianmin Wang

Utilizing attributes for visual recognition has attracted increasingly interest because attributes can effectively bridge the semantic gap between low-level visual features and high-level semantic labels. In this paper, we propose a novel method for learning predictable and discriminative attributes. Specifically, we require the learned attributes can be reliably predicted from visual features, and discover the inherent discriminative structure of data. In addition, we propose to exploit the intracategory locality of data to overcome the intra-category variance in visual data. We conduct extensive experiments on Animals with Attributes (AwA) and Caltech256 datasets, and the results demonstrate that the proposed method achieves state-of-the-art performance.

AAAI Conference 2015 Conference Paper

Will You “Reconsume” the Near Past? Fast Prediction on Short-Term Reconsumption Behaviors

Jun Chen
Chaokun Wang
Jianmin Wang

The short-term reconsumption behaviors, i. e. “reconsume” the near past, account for a large proportion of people’s activities every day and everywhere. In this paper, we firstly derived four generic features which influence people’s short-term reconsumption behaviors. These features were extracted with respect to different roles in the process of reconsumption behaviors, i. e. users, items and interactions. Then, we brought forward two fast algorithms with the linear and the quadratic kernels to predict whether a user will perform a short-term reconsumption at a specific time given the context. The experimental results show that our proposed algorithms are more accurate in the prediction tasks compared with the baselines. Meanwhile, the time complexity of online prediction of our algorithms is O(1), which enables fast prediction in real-world scenarios. The prediction contributes to more intelligent decision-making, e. g. potential revisited customer identification, personalized recommendation, and information re-finding.

AAAI Conference 2014 Conference Paper

Learning Temporal Dynamics of Behavior Propagation in Social Networks

Jun Zhang
Chaokun Wang
Jianmin Wang

AAAI Conference 2012 Conference Paper

Transfer Learning with Graph Co-Regularization

Mingsheng Long
Jianmin Wang
Guiguang Ding
Dou Shen
Qiang Yang

Transfer learning proves to be effective for leveraging labeled data in the source domain to build an accurate classifier in the target domain. The basic assumption behind transfer learning is that the involved domains share some common latent factors. Previous methods usually explore these latent factors by optimizing two separate objective functions, i. e. , either maximizing the empirical likelihood, or preserving the geometric structure. Actually, these two objective functions are complementary to each other and optimizing them simultaneously can make the solution smoother and further improve the accuracy of the final model. In this paper, we propose a novel approach called Graph co-regularized Transfer Learning (GTL) for this purpose, which integrates the two objective functions seamlessly into one unified optimization problem. Thereafter, we present an iterative algorithm for the optimization problem with rigorous analysis on convergence and complexity. Our empirical study on two open data sets validates that GTL can consistently improve the classification accuracy compared to the state-of-the-art transfer learning methods.