Author name cluster

Rong Jin

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

72 papers

1 author row

TMLR Journal 2023 Journal Article

Attentional-Biased Stochastic Gradient Descent

Qi Qi
Yi Xu
Wotao Yin
Rong Jin
Tianbao Yang

In this paper, we present a simple yet effective provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning. Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch. The individual-level weight of a sampled data is systematically proportional to the exponential of a scaled loss value of the data, where the scaling factor is interpreted as the regularization parameter in the framework of distributionally robust optimization (DRO). Depending on whether the scaling factor is positive or negative, ABSGD is guaranteed to converge to a stationary point of an information-regularized min-max or min-min DRO problem, respectively. Compared with existing class-level weighting schemes, our method can capture the diversity between individual examples within each class. Compared with existing individual-level weighting methods using meta-learning that require three backward propagations for computing mini-batch stochastic gradients, our method is more efficient with only one backward propagation at each iteration as in standard deep learning methods. ABSGD is flexible enough to combine with other robust losses without any additional cost. Our empirical studies on several benchmark datasets demonstrate the effectiveness of the proposed method.

NeurIPS Conference 2023 Conference Paper

One Fits All: Power General Time Series Analysis by Pretrained LM

Tian Zhou
Peisong Niu
Xue Wang
Liang Sun
Rong Jin

Although we have witnessed great success of pre-trained models in natural language processing (NLP) and computer vision (CV), limited progress has been made for general time series analysis. Unlike NLP and CV where a unified model can be used to perform different tasks, specially designed approach still dominates in each time series analysis task such as classification, anomaly detection, forecasting, and few-shot learning. The main challenge that blocks the development of pre-trained model for time series analysis is the lack of a large amount of data for training. In this work, we address this challenge by leveraging language or CV models, pre-trained from billions of tokens, for time series analysis. Specifically, we refrain from altering the self-attention and feedforward layers of the residual blocks in the pre-trained language or image model. This model, known as the Frozen Pretrained Transformer (FPT), is evaluated through fine-tuning on all major types of tasks involving time series. Our results demonstrate that pre-trained models on natural language or images can lead to a comparable or state-of-the-art performance in all main time series analysis tasks, as illustrated in Figure1. We also found both theoretically and empirically that the self-attention module behaviors similarly to principle component analysis (PCA), an observation that helps explains how transformer bridges the domain gap and a crucial step towards understanding the universality of a pre-trained transformer. The code is publicly available at https: //anonymous. 4open. science/r/Pretrained-LM-for-TSForcasting-C561.

NeurIPS Conference 2023 Conference Paper

OneNet: Enhancing Time Series Forecasting Models under Concept Drift by Online Ensembling

Yifan Zhang
Qingsong Wen
Xue Wang
Weiqi Chen
Liang Sun
Zhang Zhang
Liang Wang
Rong Jin

Online updating of time series forecasting models aims to address the concept drifting problem by efficiently updating forecasting models based on streaming data. Many algorithms are designed for online time series forecasting, with some exploiting cross-variable dependency while others assume independence among variables. Given every data assumption has its own pros and cons in online time series modeling, we propose **On**line **e**nsembling **Net**work (**OneNet**). It dynamically updates and combines two models, with one focusing on modeling the dependency across the time dimension and the other on cross-variate dependency. Our method incorporates a reinforcement learning-based approach into the traditional online convex programming framework, allowing for the linear combination of the two models with dynamically adjusted weights. OneNet addresses the main shortcoming of classical online learning methods that tend to be slow in adapting to the concept drift. Empirical results show that OneNet reduces online forecasting error by more than $\mathbf{50}\\%$ compared to the State-Of-The-Art (SOTA) method.

AAAI Conference 2022 System Paper

A Trend-Driven Fashion Design System for Rapid Response Marketing in E-commerce

Lianghua Huang
Yu Liu
Bin Wang
Pan Pan
Rong Jin

Fashion is the way we express ourselves and has grown into one of the largest industries in the world. Despite the significant evolvement of the fashion industry over the past decades, it is still a great challenge to respond to the diverse preferences of a large number of different consumers in time and accurately. To deal with the problem, we present an innovative demonstration of a trend-driven fashion design system using deep generative modeling, which enables automatic fashion design and editing based on trend reports. Our system consists of three components, including trend-driven fashion design, interactive fashion editing, and popularity estimation. The system offers a unified framework for mass-production of fashion designs that conform to the trend, which helps businesses better respond to market demands.

NeurIPS Conference 2022 Conference Paper

FiLM: Frequency improved Legendre Memory Model for Long-term Time Series Forecasting

Tian Zhou
Ziqing Ma
Xue Wang
Qingsong Wen
Liang Sun
Tao Yao
Wotao Yin
Rong Jin

Recent studies have shown that deep learning models such as RNNs and Transformers have brought significant performance gains for long-term forecasting of time series because they effectively utilize historical information. We found, however, that there is still great room for improvement in how to preserve historical information in neural networks while avoiding overfitting to noise present in the history. Addressing this allows better utilization of the capabilities of deep learning models. To this end, we design a Frequency improved Legendre Memory model, or FiLM: it applies Legendre polynomial projections to approximate historical information, uses Fourier projection to remove noise, and adds a low-rank approximation to speed up computation. Our empirical studies show that the proposed FiLM significantly improves the accuracy of state-of-the-art models in multivariate and univariate long-term forecasting by (19. 2%, 22. 6%), respectively. We also demonstrate that the representation module developed in this work can be used as a general plugin to improve the long-term prediction performance of other deep learning modules. Code is available at https: //github. com/tianzhou2011/FiLM/.

NeurIPS Conference 2022 Conference Paper

Grow and Merge: A Unified Framework for Continuous Categories Discovery

Xinwei Zhang
Jianwen Jiang
Yutong Feng
Zhi-Fan Wu
Xibin Zhao
Hai Wan
Mingqian Tang
Rong Jin

Although a number of studies are devoted to novel category discovery, most of them assume a static setting where both labeled and unlabeled data are given at once for finding new categories. In this work, we focus on the application scenarios where unlabeled data are continuously fed into the category discovery system. We refer to it as the {\bf Continuous Category Discovery} ({\bf CCD}) problem, which is significantly more challenging than the static setting. A common challenge faced by novel category discovery is that different sets of features are needed for classification and category discovery: class discriminative features are preferred for classification, while rich and diverse features are more suitable for new category mining. This challenge becomes more severe for dynamic setting as the system is asked to deliver good performance for known classes over time, and at the same time continuously discover new classes from unlabeled data. To address this challenge, we develop a framework of {\bf Grow and Merge} ({\bf GM}) that works by alternating between a growing phase and a merge phase: in the growing phase, it increases the diversity of features through a continuous self-supervised learning for effective category mining, and in the merging phase, it merges the grown model with a static one to ensure satisfying performance for known classes. Our extensive studies verify that the proposed GM framework is significantly more effective than the state-of-the-art approaches for continuous category discovery.

NeurIPS Conference 2022 Conference Paper

Improved Fine-Tuning by Better Leveraging Pre-Training Data

Ziquan Liu
Yi Xu
Yuanhong Xu
Qi Qian
Hao Li
Xiangyang Ji
Antoni Chan
Rong Jin

As a dominant paradigm, fine-tuning a pre-trained model on the target data is widely used in many deep learning applications, especially for small data sets. However, recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy once the number of training samples is increased in some vision tasks. In this work, we revisit this phenomenon from the perspective of generalization analysis by using excess risk bound which is popular in learning theory. The result reveals that the excess risk bound may have a weak dependency on the pre-trained model. The observation inspires us to leverage pre-training data for fine-tuning, since this data is also available for fine-tuning. The generalization result of using pre-training data shows that the excess risk bound on a target task can be improved when the appropriate pre-training data is included in fine-tuning. With the theoretical motivation, we propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task. Extensive experimental results for image classification tasks on 8 benchmark data sets verify the effectiveness of the proposed data selection based fine-tuning pipeline. Our code is available at https: //github. com/ziquanliu/NeurIPS2022 UOT fine_tuning.

NeurIPS Conference 2022 Conference Paper

Robust Graph Structure Learning via Multiple Statistical Tests

Yaohua Wang
Fangyi Zhang
Ming Lin
Senzhang Wang
Xiuyu Sun
Rong Jin

Graph structure learning aims to learn connectivity in a graph from data. It is particularly important for many computer vision related tasks since no explicit graph structure is available for images for most cases. A natural way to construct a graph among images is to treat each image as a node and assign pairwise image similarities as weights to corresponding edges. It is well known that pairwise similarities between images are sensitive to the noise in feature representations, leading to unreliable graph structures. We address this problem from the viewpoint of statistical tests. By viewing the feature vector of each node as an independent sample, the decision of whether creating an edge between two nodes based on their similarity in feature representation can be thought as a ${\it single}$ statistical test. To improve the robustness in the decision of creating an edge, multiple samples are drawn and integrated by ${\it multiple}$ statistical tests to generate a more reliable similarity measure, consequentially more reliable graph structure. The corresponding elegant matrix form named $\mathcal{B}$$\textbf{-Attention}$ is designed for efficiency. The effectiveness of multiple tests for graph structure learning is verified both theoretically and empirically on multiple clustering and ReID benchmark datasets. Source codes are available at https: //github. com/Thomas-wyh/B-Attention.

AAAI Conference 2022 Conference Paper

Scaled ReLU Matters for Training Vision Transformers

Pichao Wang
Xue Wang
Hao Luo
Jingkai Zhou
Zhipeng Zhou
Fan Wang
Hao Li
Rong Jin

Vision transformers (ViTs) have been an alternative design paradigm to convolutional neural networks (CNNs). However, the training of ViTs is much harder than CNNs, as it is sensitive to the training parameters, such as learning rate, optimizer and warmup epoch. The reasons for training difficulty are empirically analysed in the paper Early Convolutions Help Transformers See Better, and the authors conjecture that the issue lies with the patchify-stem of ViT models. In this paper, we further investigate this problem and extend the above conclusion: only early convolutions do not help for stable training, but the scaled ReLU operation in the convolutional stem (conv-stem) matters. We verify, both theoretically and empirically, that scaled ReLU in conv-stem not only improves training stabilization, but also increases the diversity of patch tokens, thus boosting peak performance with a large margin via adding few parameters and flops. In addition, extensive experiments are conducted to demonstrate that previous ViTs are far from being well trained, further showing that ViTs have great potential to be a better substitute of CNNs.

NeurIPS Conference 2022 Conference Paper

Stability and Generalization Analysis of Gradient Methods for Shallow Neural Networks

Yunwen Lei
Rong Jin
Yiming Ying

While significant theoretical progress has been achieved, unveiling the generalization mystery of overparameterized neural networks still remains largely elusive. In this paper, we study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability. We consider gradient descent (GD) and stochastic gradient descent (SGD) to train SNNs, for both of which we develop consistent excess risk bounds by balancing the optimization and generalization via early-stopping. As compared to existing analysis on GD, our new analysis requires a relaxed overparameterization assumption and also applies to SGD. The key for the improvement is a better estimation of the smallest eigenvalues of the Hessian matrices of the empirical risks and the loss function along the trajectories of GD and SGD by providing a refined estimation of their iterates.

NeurIPS Conference 2021 Conference Paper

An Online Method for A Class of Distributionally Robust Optimization with Non-convex Objectives

Qi Qi
Zhishuai Guo
Yi Xu
Rong Jin
Tianbao Yang

In this paper, we propose a practical online method for solving a class of distributional robust optimization (DRO) with non-convex objectives, which has important applications in machine learning for improving the robustness of neural networks. In the literature, most methods for solving DRO are based on stochastic primal-dual methods. However, primal-dual methods for DRO suffer from several drawbacks: (1) manipulating a high-dimensional dual variable corresponding to the size of data is time expensive; (2) they are not friendly to online learning where data is coming sequentially. To address these issues, we consider a class of DRO with an KL divergence regularization on the dual variables, transform the min-max problem into a compositional minimization problem, and propose practical duality-free online stochastic methods without requiring a large mini-batch size. We establish the state-of-the-art complexities of the proposed methods with and without a Polyak-Łojasiewicz (PL) condition of the objective. Empirical studies on large-scale deep learning tasks (i) demonstrate that our method can speed up the training by more than 2 times than baseline methods and save days of training time on a large-scale dataset with ∼ 265K images, and (ii) verify the supreme performance of DRO over Empirical Risk Minimization (ERM) on imbalanced datasets. Of independent interest, the proposed method can be also used for solving a family of stochastic compositional problems with state-of-the-art complexities.

TCS Journal 2021 Journal Article

Novel algorithms for maximum DS decomposition

Shengminjie Chen
Wenguo Yang
Suixiang Gao
Rong Jin

AAAI Conference 2021 Conference Paper

Train a One-Million-Way Instance Classifier for Unsupervised Visual Representation Learning

Yu Liu
Lianghua Huang
Pan Pan
Bin Wang
Yinghui Xu
Rong Jin

This paper presents a simple unsupervised visual representation learning method with a pretext task of discriminating all images in a dataset using a parametric, instance-level classifier. The overall framework is a replica of a supervised classification model, where semantic classes (e. g. , dog, bird, and ship) are replaced by instance IDs. However, scaling up the classification task from thousands of semantic labels to millions of instance labels brings specific challenges including 1) the large-scale softmax computation; 2) the slow convergence due to the infrequent visiting of instance samples; and 3) the massive number of negative classes that can be noisy. This work presents several novel techniques to handle these difficulties. First, we introduce a hybrid parallel training framework to make large-scale training feasible. Second, we present a raw-feature initialization mechanism for classification weights, which we assume offers a contrastive prior for instance discrimination and can clearly speed up converge in our experiments. Finally, we propose to smooth the labels of a few hardest classes to avoid optimizing over very similar negative pairs. While being conceptually simple, our framework achieves competitive or superior performance compared to state-of-the-art unsupervised approaches, i. e. , SimCLR, Mo- CoV2, and PIC under ImageNet linear evaluation protocol and on several downstream visual tasks, verifying that full instance classification is a strong pretraining technique for many semantic visual tasks.

IJCAI Conference 2019 Conference Paper

A Practical Semi-Parametric Contextual Bandit

Yi Peng
Miao Xie
Jiahao Liu
Xuying Meng
Nan Li
Cheng Yang
Tao Yao
Rong Jin

Classic multi-armed bandit algorithms are inefficient for a large number of arms. On the other hand, contextual bandit algorithms are more efficient, but they suffer from a large regret due to the bias of reward estimation with finite dimensional features. Although recent studies proposed semi-parametric bandits to overcome these defects, they assume arms' features are constant over time. However, this assumption rarely holds in practice, since real-world problems often involve underlying processes that are dynamically evolving over time especially for the special promotions like Singles' Day sales. In this paper, we formulate a novel Semi-Parametric Contextual Bandit Problem to relax this assumption. For this problem, a novel Two-Steps Upper-Confidence Bound framework, called Semi-Parametric UCB (SPUCB), is presented. It can be flexibly applied to linear parametric function problem with a satisfied gap-free bound on the n-step regret. Moreover, to make our method more practical in online system, an optimization is proposed for dealing with high dimensional features of a linear function. Extensive experiments on synthetic data as well as a real dataset from one of the largest e-commercial platforms demonstrate the superior performance of our algorithm.

NeurIPS Conference 2019 Conference Paper

Non-asymptotic Analysis of Stochastic Methods for Non-Smooth Non-Convex Regularized Problems

Yi Xu
Rong Jin
Tianbao Yang

Stochastic Proximal Gradient (SPG) methods have been widely used for solving optimization problems with a simple (possibly non-smooth) regularizer in machine learning and statistics. However, to the best of our knowledge no non-asymptotic convergence analysis of SPG exists for non-convex optimization with a non-smooth and non-convex regularizer. All existing non-asymptotic analysis of SPG for solving non-smooth non-convex problems require the non-smooth regularizer to be a convex function, and hence are not applicable to a non-smooth non-convex regularized problem. This work initiates the analysis to bridge this gap and opens the door to non-asymptotic convergence analysis of non-smooth non-convex regularized problems. We analyze several variants of mini-batch SPG methods for minimizing a non-convex objective that consists of a smooth non-convex loss and a non-smooth non-convex regularizer. Our contributions are two-fold: (i) we show that they enjoy the same complexities as their counterparts for solving convex regularized non-convex problems in terms of finding an approximate stationary point; (ii) we develop more practical variants using dynamic mini-batch size instead of a fixed mini-batch size without requiring the target accuracy level of solution. The significance of our results is that they improve upon the-state-of-art results for solving non-smooth non-convex regularized problems. We also empirically demonstrate the effectiveness of the considered SPG methods in comparison with other peer stochastic methods.

IJCAI Conference 2019 Conference Paper

On the Convergence of (Stochastic) Gradient Descent with Extrapolation for Non-Convex Minimization

Yi Xu
Zhuoning Yuan
Sen Yang
Rong Jin
Tianbao Yang

Extrapolation is a well-known technique for solving convex optimization and variational inequalities and recently attracts some attention for non-convex optimization. Several recent works have empirically shown its success in some machine learning tasks. However, it has not been analyzed for non-convex minimization and there still remains a gap between the theory and the practice. In this paper, we analyze gradient descent and stochastic gradient descent with extrapolation for finding an approximate first-order stationary point in smooth non-convex optimization problems. Our convergence upper bounds show that the algorithms with extrapolation can be accelerated than without extrapolation.

JMLR Journal 2019 Journal Article

Relative Error Bound Analysis for Nuclear Norm Regularized Matrix Completion

Lijun Zhang
Tianbao Yang
Rong Jin
Zhi-Hua Zhou

In this paper, we develop a relative error bound for nuclear norm regularized matrix completion, with the focus on the completion of full-rank matrices. Under the assumption that the top eigenspaces of the target matrix are incoherent, we derive a relative upper bound for recovering the best low-rank approximation of the unknown matrix. Although multiple works have been devoted to analyzing the recovery error of full-rank matrix completion, their error bounds are usually additive, making it impossible to obtain the perfect recovery case and more generally difficult to leverage the skewed distribution of eigenvalues. Our analysis is built upon the optimality condition of the regularized formulation and existing guarantees for low-rank matrix completion. To the best of our knowledge, this is the first relative bound that has been proved for the regularized formulation of matrix completion. [abs] [ pdf ][ bib ] &copy JMLR 2019. ( edit, beta )

AAAI Conference 2019 Conference Paper

Robust Online Matching with User Arrival Distribution Drift

Yu-Hang Zhou
Chen Liang
Nan Li
Cheng Yang
Shenghuo Zhu
Rong Jin

Recently, online matching problems have attracted much attention due to its emerging applications in internet advertising. Most existing online matching methods have adopted either adversarial or stochastic user arrival assumption, while on both of them significant limitation exists. The adversarial model does not exploit existing knowledge of the user sequence, and thus can be pessimistic in practice. On other hands, the stochastic model assumes that users are drawn from a stationary distribution, which may not be true in real applications. In this paper, we consider a novel user arrival model where users are drawn from drifting distribution, which is a hybrid case between the adversarial and stochastic model, and propose a new approach RDLA to deal with such assumption. Instead of maximizing empirical total revenues on the revealed users, RDLA leverages distributionally robust optimization techniques to learn dual variables via a worst-case consideration over an ambiguity set on the underlying user distribution. Experiments on a real-world dataset exhibit the superiority of our approach.

AAAI Conference 2019 Conference Paper

Robust Optimization over Multiple Domains

Qi Qian
Shenghuo Zhu
Jiasheng Tang
Rong Jin
Baigui Sun
Hao Li

In this work, we study the problem of learning a single model for multiple domains. Unlike the conventional machine learning scenario where each domain can have the corresponding model, multiple domains (i. e. , applications/users) may share the same machine learning model due to maintenance loads in cloud computing services. For example, a digit-recognition model should be applicable to hand-written digits, house numbers, car plates, etc. Therefore, an ideal model for cloud computing has to perform well at each applicable domain. To address this new challenge from cloud computing, we develop a framework of robust optimization over multiple domains. In lieu of minimizing the empirical risk, we aim to learn a model optimized to the adversarial distribution over multiple domains. Hence, we propose to learn the model and the adversarial distribution simultaneously with the stochastic algorithm for efficiency. Theoretically, we analyze the convergence rate for convex and non-convex models. To our best knowledge, we first study the convergence rate of learning a robust non-convex model with a practical algorithm. Furthermore, we demonstrate that the robustness of the framework and the convergence rate can be further enhanced by appropriate regularizers over the adversarial distribution. The empirical study on real-world fine-grained visual categorization and digits recognition tasks verifies the effectiveness and efficiency of the proposed framework.

AAAI Conference 2019 Conference Paper

Semi-Parametric Sampling for Stochastic Bandits with Many Arms

Mingdong Ou
Nan Li
Cheng Yang
Shenghuo Zhu
Rong Jin

We consider the stochastic bandit problem with a large candidate arm set. In this setting, classic multi-armed bandit algorithms, which assume independence among arms and adopt non-parametric reward model, are inefficient, due to the large number of arms. By exploiting arm correlations based on a parametric reward model with arm features, contextual bandit algorithms are more efficient, but they can also suffer from large regret in practical applications, due to the reward estimation bias from mis-specified model assumption or incomplete features. In this paper, we propose a novel Bayesian framework, called Semi-Parametric Sampling (SPS), for this problem, which employs semi-parametric function as the reward model. Specifically, the parametric part of SPS, which models expected reward as a parametric function of arm feature, can efficiently eliminate poor arms from candidate set. The non-parametric part of SPS, which adopts nonparametric reward model, revises the parametric estimation to avoid estimation bias, especially on the remained candidate arms. We give an implementation of SPS, Linear SPS (LSPS), which utilizes linear function as the parametric part. In semi-parametric environment, theoretical analysis shows that LSPS achieves better regret bound (i. e. Õ( √ N 1−α dα √ T) with α ∈ [0, 1]) than existing approaches. Also, experiments demonstrate the superiority of the proposed approach.

NeurIPS Conference 2019 Conference Paper

Stagewise Training Accelerates Convergence of Testing Error Over SGD

Zhuoning Yuan
Yan Yan
Rong Jin
Tianbao Yang

Stagewise training strategy is widely used for learning neural networks, which runs a stochastic algorithm (e. g. , SGD) starting with a relatively large step size (aka learning rate) and geometrically decreasing the step size after a number of iterations. It has been observed that the stagewise SGD has much faster convergence than the vanilla SGD with a polynomially decaying step size in terms of both training error and testing error. {\it But how to explain this phenomenon has been largely ignored by existing studies. } This paper provides some theoretical evidence for explaining this faster convergence. In particular, we consider a stagewise training strategy for minimizing empirical risk that satisfies the Polyak-\L ojasiewicz (PL) condition, which has been observed/proved for neural networks and also holds for a broad family of convex functions. For convex loss functions and two classes of ``nice-behaviored" non-convex objectives that are close to a convex function, we establish faster convergence of stagewise training than the vanilla SGD under the PL condition on both training error and testing error. Experiments on stagewise learning of deep residual networks exhibits that it satisfies one type of non-convexity assumption and therefore can be explained by our theory.

AAAI Conference 2019 Conference Paper

Which Factorization Machine Modeling Is Better: A Theoretical Answer with Optimal Guarantee

Ming Lin
Shuang Qiu
Jieping Ye
Xiaomin Song
Qi Qian
Liang Sun
Shenghuo Zhu
Rong Jin

Factorization machine (FM) is a popular machine learning model to capture the second order feature interactions. The optimal learning guarantee of FM and its generalized version is not yet developed. For a rank k generalized FM of d dimensional input, the previous best known sampling complexity is O[k3 d · polylog(kd)] under Gaussian distribution. This bound is sub-optimal comparing to the information theoretical lower bound O(kd). In this work, we aim to tighten this bound towards optimal and generalize the analysis to sub-gaussian distribution. We prove that when the input data satisfies the so-called τ-Moment Invertible Property, the sampling complexity of generalized FM can be improved to O[k2 d · polylog(kd)/τ2 ]. When the second order self-interaction terms are excluded in the generalized FM, the bound can be improved to the optimal O[kd · polylog(kd)] up to the logarithmic factors. Our analysis also suggests that the positive semi-definite constraint in the conventional FM is redundant as it does not improve the sampling complexity while making the model difficult to optimize. We evaluate our improved FM model in real-time high precision GPS signal calibration task to validate its superiority.

NeurIPS Conference 2019 Conference Paper

XNAS: Neural Architecture Search with Expert Advice

Niv Nayman
Asaf Noy
Tal Ridnik
Itamar Friedman
Rong Jin
Lihi Zelnik

This paper introduces a novel optimization method for differential neural architecture search, based on the theory of prediction with expert advice. Its optimization criterion is well fitted for an architecture-selection, i. e. , it minimizes the regret incurred by a sub-optimal selection of operations. Unlike previous search relaxations, that require hard pruning of architectures, our method is designed to dynamically wipe out inferior architectures and enhance superior ones. It achieves an optimal worst-case regret bound and suggests the use of multiple learning-rates, based on the amount of information carried by the backward gradients. Experiments show that our algorithm achieves a strong performance over several image classification datasets. Specifically, it obtains an error rate of 1. 6% for CIFAR-10, 23. 9% for ImageNet under mobile settings, and achieves state-of-the-art results on three additional datasets.

AAAI Conference 2018 Conference Paper

Extremely Low Bit Neural Network: Squeeze the Last Bit Out With ADMM

Cong Leng
Zesheng Dou
Hao Li
Shenghuo Zhu
Rong Jin

Although deep learning models are highly effective for various learning tasks, their high computational costs prohibit the deployment to scenarios where either memory or computational resources are limited. In this paper, we focus on compressing and accelerating deep models with network weights represented by very small numbers of bits, referred to as extremely low bit neural network. We model this problem as a discretely constrained optimization problem. Borrowing the idea from Alternating Direction Method of Multipliers (ADMM), we decouple the continuous parameters from the discrete constraints of network, and cast the original hard problem into several subproblems. We propose to solve these subproblems using extragradient and iterative quantization algorithms that lead to considerably faster convergency compared to conventional optimization methods. Extensive experiments on image recognition and object detection verify that the proposed algorithm is more effective than state-ofthe-art approaches when coming to extremely low bit neural network.

NeurIPS Conference 2018 Conference Paper

Fast Rates of ERM and Stochastic Approximation: Adaptive to Error Bound Conditions

Mingrui Liu
Xiaoxuan Zhang
Lijun Zhang
Rong Jin
Tianbao Yang

Error bound conditions (EBC) are properties that characterize the growth of an objective function when a point is moved away from the optimal set. They have recently received increasing attention in the field of optimization for developing optimization algorithms with fast convergence. However, the studies of EBC in statistical learning are hitherto still limited. The main contributions of this paper are two-fold. First, we develop fast and intermediate rates of empirical risk minimization (ERM) under EBC for risk minimization with Lipschitz continuous, and smooth convex random functions. Second, we establish fast and intermediate rates of an efficient stochastic approximation (SA) algorithm for risk minimization with Lipschitz continuous random functions, which requires only one pass of $n$ samples and adapts to EBC. For both approaches, the convergence rates span a full spectrum between $\widetilde O(1/\sqrt{n})$ and $\widetilde O(1/n)$ depending on the power constant in EBC, and could be even faster than $O(1/n)$ in special cases for ERM. Moreover, these convergence rates are automatically adaptive without using any knowledge of EBC. Overall, this work not only strengthens the understanding of ERM for statistical learning but also brings new fast stochastic algorithms for solving a broad range of statistical learning problems.

NeurIPS Conference 2018 Conference Paper

First-order Stochastic Algorithms for Escaping From Saddle Points in Almost Linear Time

Yi Xu
Rong Jin
Tianbao Yang

(This is a theory paper) In this paper, we consider first-order methods for solving stochastic non-convex optimization problems. The key building block of the proposed algorithms is first-order procedures to extract negative curvature from the Hessian matrix through a principled sequence starting from noise, which are referred to {\it NEgative-curvature-Originated-from-Noise or NEON} and are of independent interest. Based on this building block, we design purely first-order stochastic algorithms for escaping from non-degenerate saddle points with a much better time complexity (almost linear time in the problem's dimensionality). In particular, we develop a general framework of {\it first-order stochastic algorithms} with a second-order convergence guarantee based on our new technique and existing algorithms that may only converge to a first-order stationary point. For finding a nearly {\it second-order stationary point} $\x$ such that $\|\nabla F(\x)\|\leq \epsilon$ and $\nabla^2 F(\x)\geq -\sqrt{\epsilon}I$ (in high probability), the best time complexity of the presented algorithms is $\widetilde O(d/\epsilon^{3. 5})$, where $F(\cdot)$ denotes the objective function and $d$ is the dimensionality of the problem. To the best of our knowledge, this is the first theoretical result of first-order stochastic algorithms with an almost linear time in terms of problem's dimensionality for finding second-order stationary points, which is even competitive with existing stochastic algorithms hinging on the second-order information.

IJCAI Conference 2018 Conference Paper

Multinomial Logit Bandit with Linear Utility Functions

Mingdong Ou
Nan Li
Shenghuo Zhu
Rong Jin

Multinomial logit bandit is a sequential subset selection problem which arises in many applications. In each round, the player selects a K-cardinality subset from N candidate items, and receives a reward which is governed by a multinomial logit (MNL) choice model considering both item utility and substitution property among items. The player's objective is to dynamically learn the parameters of MNL model and maximize cumulative reward over a finite horizon T. This problem faces the exploration-exploitation dilemma, and the involved combinatorial nature makes it non-trivial. In recent years, there have developed some algorithms by exploiting specific characteristics of the MNL model, but all of them estimate the parameters of MNL model separately and incur a regret bound which is not preferred for large candidate set size N. In this paper, we consider the linear utility MNL choice model whose item utilities are represented as linear functions of d-dimension item features, and propose an algorithm, titled LUMB, to exploit the underlying structure. It is proven that the proposed algorithm achieves regret which is free of candidate set size. Experiments show the superiority of the proposed algorithm.

AAAI Conference 2017 Conference Paper

A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis

Zhe Li
Tianbao Yang
Lijun Zhang
Rong Jin

This paper aims to provide a sharp excess risk guarantee for learning a sparse linear model without any assumptions about the strong convexity of the expected loss and the sparsity of the optimal solution in hindsight. Given a target level _ for the excess risk, an interesting question to ask is how many examples and how large the support set of the solution are enough for learning a good model with the target excess risk. To answer these questions, we present a two-stage algorithm that (i) in the first stage an epoch based stochastic optimization algorithm is exploited with an established O(1/_) bound on the sample complexity; and (ii) in the second stage a distribution dependent randomized sparsification is presented with an O(1/_) bound on the sparsity (referred to as support complexity) of the resulting model. Compared to previous works, our contributions lie at (i) we reduce the order of the sample complexity from O(1/_2) to O(1/_) without the strong convexity assumption; and (ii) we reduce the constant in O(1/_) for the sparsity by exploring the distribution dependent sampling.

IJCAI Conference 2017 Conference Paper

Deep Learning at Alibaba

Rong Jin

In this talk, I will focus on the applications and the latest development of deep learning technologies at Alibaba. More specifically, I will discuss (a) how to handle high dimensional data in DNN and its application to recommender system, (b) the development of deep learning models for transfer learning and its application to multimedia data analysis, (c) the development of combinatorial optimization techniques for DNN model compression and its application to large-scale image classification, and (d) the exploration of deep learning technique for combinatorial optimization and its application to the packing problem in shipping industry. I will conclude my talk with a discussion of new directions for deep learning that are under development at Alibaba.

NeurIPS Conference 2017 Conference Paper

Improved Dynamic Regret for Non-degenerate Functions

Lijun Zhang
Tianbao Yang
Jinfeng Yi
Rong Jin
Zhi-Hua Zhou

Recently, there has been a growing research interest in the analysis of dynamic regret, which measures the performance of an online learner against a sequence of local minimizers. By exploiting the strong convexity, previous studies have shown that the dynamic regret can be upper bounded by the path-length of the comparator sequence. In this paper, we illustrate that the dynamic regret can be further improved by allowing the learner to query the gradient of the function multiple times, and meanwhile the strong convexity can be weakened to other non-degenerate conditions. Specifically, we introduce the squared path-length, which could be much smaller than the path-length, as a new regularity of the comparator sequence. When multiple gradients are accessible to the learner, we first demonstrate that the dynamic regret of strongly convex functions can be upper bounded by the minimum of the path-length and the squared path-length. We then extend our theoretical guarantee to functions that are semi-strongly convex or self-concordant. To the best of our knowledge, this is the first time that semi-strong convexity and self-concordance are utilized to tighten the dynamic regret.

AAAI Conference 2016 Conference Paper

Accelerated Sparse Linear Regression via Random Projection

Weizhong Zhang
Lijun Zhang
Rong Jin
Deng Cai
Xiaofei He

In this paper, we present an accelerated numerical method based on random projection for sparse linear regression. Previous studies have shown that under appropriate conditions, gradient-based methods enjoy a geometric convergence rate when applied to this problem. However, the time complexity of evaluating the gradient is as large as O(nd), where n is the number of data points and d is the dimensionality, making those methods inefﬁcient for large-scale and highdimensional dataset. To address this limitation, we ﬁrst utilize random projection to ﬁnd a rank-k approximator for the data matrix, and reduce the cost of gradient evaluation to O(nk + dk), a signiﬁcant improvement when k is much smaller than d and n. Then, we solve the sparse linear regression problem via a proximal gradient method with a homotopy strategy to generate sparse intermediate solutions. Theoretical analysis shows that our method also achieves a global geometric convergence rate, and moreover the sparsity of all the intermediate solutions are well-bounded over the iterations. Finally, we conduct experiments to demonstrate the ef- ﬁciency of the proposed method.

AAAI Conference 2016 Conference Paper

Fast and Accurate Refined Nyström-Based Kernel SVM

Zhe Li
Tianbao Yang
Lijun Zhang
Rong Jin

AIJ Journal 2016 Journal Article

One-pass AUC optimization

Wei Gao
Lu Wang
Rong Jin
Shenghuo Zhu
Zhi-Hua Zhou

AAAI Conference 2016 Conference Paper

Stochastic Optimization for Kernel PCA

Lijun Zhang
Tianbao Yang
Jinfeng Yi
Rong Jin
Zhi-Hua Zhou

Kernel Principal Component Analysis (PCA) is a popular extension of PCA which is able to ﬁnd nonlinear patterns from data. However, the application of kernel PCA to largescale problems remains a big challenge, due to its quadratic space complexity and cubic time complexity in the number of examples. To address this limitation, we utilize techniques from stochastic optimization to solve kernel PCA with linear space and time complexities per iteration. Speciﬁcally, we formulate it as a stochastic composite optimization problem, where a nuclear norm regularizer is introduced to promote low-rankness, and then develop a simple algorithm based on stochastic proximal gradient descent. During the optimization process, the proposed algorithm always maintains a low-rank factorization of iterates that can be conveniently held in memory. Compared to previous iterative approaches, a remarkable property of our algorithm is that it is equipped with an explicit rate of convergence. Theoretical analysis shows that the solution of our algorithm converges to the optimal one at an O(1/T) rate, where T is the number of iterations.

AAAI Conference 2015 Conference Paper

Nystrom Approximation for Sparse Kernel Methods: Theoretical Analysis and Empirical Evaluation

Zenglin Xu
Rong Jin
Bin Shen
Shenghuo Zhu

Nyström approximation is an effective approach to accelerate the computation of kernel matrices in many kernel methods. In this paper, we consider the Nyström approximation for sparse kernel methods. Instead of relying on the low-rank assumption of the original kernels, which sometimes does not hold in some applications, we take advantage of the restricted eigenvalue condition, which has been proved to be robust for sparse kernel methods. Based on the restricted eigenvalue condition, we have provided not only the approximation bound for the original kernel matrix but also the recovery bound for the sparse solutions of sparse kernel regression. In addition to the theoretical analysis, we also demonstrate the good performance of the Nyström approximation for sparse kernel regression on real world data sets.

AAAI Conference 2015 Conference Paper

Online Bandit Learning for a Special Class of Non-Convex Losses

Lijun Zhang
Tianbao Yang
Rong Jin
Zhi-Hua Zhou

In online bandit learning, the learner aims to minimize a sequence of losses, while only observing the value of each loss at a single point. Although various algorithms and theories have been developed for online bandit learning, most of them are limited to convex losses. In this paper, we investigate the problem of online bandit learning with non-convex losses, and develop an efficient algorithm with formal theoretical guarantees. To be specific, we consider a class of losses which is a composition of a non-increasing scalar function and a linear function. This setting models a wide range of supervised learning applications such as online classification with a non-convex loss. Theoretical analysis shows that our algorithm achieves an e O(poly(d)T2/3 ) regret bound when the variation of the loss function is small. To the best of our knowledge, this is the first work in online bandit learning that does not rely on convexity.

NeurIPS Conference 2014 Conference Paper

Extracting Certainty from Uncertainty: Transductive Pairwise Classification from Pairwise Similarities

Tianbao Yang
Rong Jin

In this work, we study the problem of transductive pairwise classification from pairwise similarities~\footnote{The pairwise similarities are usually derived from some side information instead of the underlying class labels. }. The goal of transductive pairwise classification from pairwise similarities is to infer the pairwise class relationships, to which we refer as pairwise labels, between all examples given a subset of class relationships for a small set of examples, to which we refer as labeled examples. We propose a very simple yet effective algorithm that consists of two simple steps: the first step is to complete the sub-matrix corresponding to the labeled examples and the second step is to reconstruct the label matrix from the completed sub-matrix and the provided similarity matrix. Our analysis exhibits that under several mild preconditions we can recover the label matrix with a small error, if the top eigen-space that corresponds to the largest eigenvalues of the similarity matrix covers well the column space of label matrix and is subject to a low coherence, and the number of observed pairwise labels is sufficiently enough. We demonstrate the effectiveness of the proposed algorithm by several experiments.

AAAI Conference 2014 Conference Paper

Privacy and Regression Model Preserved Learning

Jinfeng Yi
Jun Wang
Rong Jin

Sensitive data such as medical records and business reports usually contains valuable information that can be used to build prediction models. However, designing learning models by directly using sensitive data might result in severe privacy and copyright issues. In this paper, we propose a novel matrix completion based framework that aims to tackle two challenging issues simultaneously: i) handling missing and noisy sensitive data, and ii) preserving the privacy of the sensitive data during the learning process. In particular, the proposed framework is able to mask the sensitive data while ensuring that the transformed data are still usable for training regression models. We show that two key properties, namely model preserving and privacy preserving, are satisfied by the transformed data obtained from the proposed framework. In model preserving, we guarantee that the linear regression model built from the masked data approximates the regression model learned from the original data in a perfect way. In privacy preserving, we ensure that the original sensitive data cannot be recovered since the transformation procedure is irreversible. Given these two characteristics, the transformed data can be safely released to any learners for designing prediction models without revealing any private content. Our empirical studies with a synthesized dataset and multiple sensitive benchmark datasets verify our theoretical claim as well as the effectiveness of the proposed framework.

AAAI Conference 2014 Conference Paper

Sparse Learning for Stochastic Composite Optimization

Weizhong Zhang
Lijun Zhang
Yao Hu
Rong Jin
Deng Cai
Xiaofei He

In this paper, we focus on Stochastic Composite Optimization (SCO) for sparse learning that aims to learn a sparse solution. Although many SCO algorithms have been developed for sparse learning with an optimal convergence rate O(1/T), they often fail to deliver sparse solutions at the end either because of the limited sparsity regularization during stochastic optimization or due to the limitation in online-to-batch conversion. To improve the sparsity of solutions obtained by SCO, we propose a simple but effective stochastic optimization scheme that adds a novel sparse online-to-batch conversion to the traditional SCO algorithms. The theoretical analysis shows that our scheme can find a solution with better sparse patterns without affecting the convergence rate. Experimental results on both synthetic and real-world data sets show that the proposed methods are more effective in recovering the sparse solution and have comparable convergence rate as the state-of-the-art SCO algorithms for sparse learning.

NeurIPS Conference 2014 Conference Paper

Top Rank Optimization in Linear Time

Nan Li
Rong Jin
Zhi-Hua Zhou

Bipartite ranking aims to learn a real-valued ranking function that orders positive instances before negative instances. Recent efforts of bipartite ranking are focused on optimizing ranking accuracy at the top of the ranked list. Most existing approaches are either to optimize task specific metrics or to extend the rank loss by emphasizing more on the error associated with the top ranked instances, leading to a high computational cost that is super-linear in the number of training instances. We propose a highly efficient approach, titled TopPush, for optimizing accuracy at the top that has computational complexity linear in the number of training instances. We present a novel analysis that bounds the generalization error for the top ranked instances for the proposed approach. Empirical study shows that the proposed approach is highly competitive to the state-of-the-art approaches and is 10-100 times faster.

NeurIPS Conference 2013 Conference Paper

Linear Convergence with Condition Number Independent Access of Full Gradients

Lijun Zhang
Mehrdad Mahdavi
Rong Jin

For smooth and strongly convex optimization, the optimal iteration complexity of the gradient-based algorithm is $O(\sqrt{\kappa}\log 1/\epsilon)$, where $\kappa$ is the conditional number. In the case that the optimization problem is ill-conditioned, we need to evaluate a larger number of full gradients, which could be computationally expensive. In this paper, we propose to reduce the number of full gradient required by allowing the algorithm to access the stochastic gradients of the objective function. To this end, we present a novel algorithm named Epoch Mixed Gradient Descent (EMGD) that is able to utilize two kinds of gradients. A distinctive step in EMGD is the mixed gradient descent, where we use an combination of the gradient and the stochastic gradient to update the intermediate solutions. By performing a fixed number of mixed gradient descents, we are able to improve the sub-optimality of the solution by a constant factor, and thus achieve a linear convergence rate. Theoretical analysis shows that EMGD is able to find an $\epsilon$-optimal solution by computing $O(\log 1/\epsilon)$ full gradients and $O(\kappa^2\log 1/\epsilon)$ stochastic gradients.

NeurIPS Conference 2013 Conference Paper

Mixed Optimization for Smooth Functions

Mehrdad Mahdavi
Lijun Zhang
Rong Jin

It is well known that the optimal convergence rate for stochastic optimization of smooth functions is $[O(1/\sqrt{T})]$, which is same as stochastic optimization of Lipschitz continuous convex functions. This is in contrast to optimizing smooth functions using full gradients, which yields a convergence rate of $[O(1/T^2)]$. In this work, we consider a new setup for optimizing smooth functions, termed as {\bf Mixed Optimization}, which allows to access both a stochastic oracle and a full gradient oracle. Our goal is to significantly improve the convergence rate of stochastic optimization of smooth functions by having an additional small number of accesses to the full gradient oracle. We show that, with an $[O(\ln T)]$ calls to the full gradient oracle and an $O(T)$ calls to the stochastic oracle, the proposed mixed optimization algorithm is able to achieve an optimization error of $[O(1/T)]$.

NeurIPS Conference 2013 Conference Paper

Speedup Matrix Completion with Side Information: Application to Multi-Label Learning

Miao Xu
Rong Jin
Zhi-Hua Zhou

In standard matrix completion theory, it is required to have at least $O(n\ln^2 n)$ observed entries to perfectly recover a low-rank matrix $M$ of size $n\times n$, leading to a large number of observations when $n$ is large. In many real tasks, side information in addition to the observed entries is often available. In this work, we develop a novel theory of matrix completion that explicitly explore the side information to reduce the requirement on the number of observed entries. We show that, under appropriate conditions, with the assistance of side information matrices, the number of observed entries needed for a perfect recovery of matrix $M$ can be dramatically reduced to $O(\ln n)$. We demonstrate the effectiveness of the proposed approach for matrix completion in transductive incomplete multi-label learning.

NeurIPS Conference 2013 Conference Paper

Stochastic Convex Optimization with Multiple Objectives

Mehrdad Mahdavi
Tianbao Yang
Rong Jin

In this paper, we are interested in the development of efficient algorithms for convex optimization problems in the simultaneous presence of multiple objectives and stochasticity in the first-order information. We cast the stochastic multiple objective optimization problem into a constrained optimization problem by choosing one function as the objective and try to bound other objectives by appropriate thresholds. We first examine a two stages exploration-exploitation based algorithm which first approximates the stochastic objectives by sampling and then solves a constrained stochastic optimization problem by projected gradient method. This method attains a suboptimal convergence rate even under strong assumption on the objectives. Our second approach is an efficient primal-dual stochastic algorithm. It leverages on the theory of Lagrangian method in constrained optimization and attains the optimal convergence rate of $[O(1/ \sqrt{T})]$ in high probability for general Lipschitz continuous objectives.

AAAI Conference 2012 Conference Paper

Efficient Online Learning for Large-Scale Sparse Kernel Logistic Regression

Lijun Zhang
Rong Jin
Chun Chen
Jiajun Bu
Xiaofei He

In this paper, we study the problem of large-scale Kernel Logistic Regression (KLR). A straightforward approach is to apply stochastic approximation to KLR. We refer to this approach as non-conservative online learning algorithm because it updates the kernel classifier after every received training example, leading to a dense classifier. To improve the sparsity of the KLR classifier, we propose two conservative online learning algorithms that update the classifier in a stochastic manner and generate sparse solutions. With appropriately designed updating strategies, our analysis shows that the two conservative algorithms enjoy similar theoretical guarantee as that of the non-conservative algorithm. Empirical studies on several benchmark data sets demonstrate that compared to batch-mode algorithms for KLR, the proposed conservative online learning algorithms are able to produce sparse KLR classifiers, and achieve similar classification accuracy but with significantly shorter training time. Furthermore, both the sparsity and classification accuracy of our methods are comparable to those of the online kernel SVM.

NeurIPS Conference 2012 Conference Paper

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Tianbao Yang
Yu-Feng Li
Mehrdad Mahdavi
Rong Jin
Zhi-Hua Zhou

Both random Fourier features and the Nyström method have been successfully applied to efficient kernel learning. In this work, we investigate the fundamental difference between these two approaches, and how the difference could affect their generalization performances. Unlike approaches based on random Fourier features where the basis functions (i. e. , cosine and sine functions) are sampled from a distribution {\it independent} from the training data, basis functions used by the Nyström method are randomly sampled from the training examples and are therefore {\it data dependent}. By exploring this difference, we show that when there is a large gap in the eigen-spectrum of the kernel matrix, approaches based the Nyström method can yield impressively better generalization error bound than random Fourier features based approach. We empirically verify our theoretical findings on a wide range of large data sets.

AAAI Conference 2012 Conference Paper

Online Kernel Selection: Algorithms and Evaluations

Tianbao Yang
Mehrdad Mahdavi
Rong Jin
Jinfeng Yi
Steven Hoi

Kernel methods have been successfully applied to many machine learning problems. Nevertheless, since the performance of kernel methods depends heavily on the type of kernels being used, identifying good kernels among a set of given kernels is important to the success of kernel methods. A straightforward approach to address this problem is cross-validation by training a separate classifier for each kernel and choosing the best kernel classifier out of them. Another approach is Multiple Kernel Learning (MKL), which aims to learn a single kernel classifier from an optimal combination of multiple kernels. However, both approaches suffer from a high computational cost in computing the full kernel matrices and in training, especially when the number of kernels or the number of training examples is very large. In this paper, we tackle this problem by proposing an efficient online kernel selection algorithm. It incrementally learns a weight for each kernel classifier. The weight for each kernel classifier can help us to select a good kernel among a set of given kernels. The proposed approach is efficient in that (i) it is an online approach and therefore avoids computing all the full kernel matrices before training; (ii) it only updates a single kernel classifier each time by a sampling technique and therefore saves time on updating kernel classifiers with poor performance; (iii) it has a theoretically guaranteed performance compared to the best kernel predictor. Empirical studies on image classification tasks demonstrate the effectiveness of the proposed approach for selecting a good kernel among a set of kernels.

AAAI Conference 2012 Conference Paper

Random Projection with Filtering for Nearly Duplicate Search

Yue Lin
Rong Jin
Deng Cai
Xiaofei He

High dimensional nearest neighbor search is a fundamental problem and has found applications in many domains. Although many hashing based approaches have been proposed for approximate nearest neighbor search in high dimensional space, one main drawback is that they often return many false positives that need to be filtered out by a post procedure. We propose a novel method to address this limitation in this paper. The key idea is to introduce a filtering procedure within the search algorithm, based on the compressed sensing theory, that effectively removes the false positive answers. We first obtain a sparse representation for each data point by the landmark based approach, after which we solve the nearly duplicate search that the difference between the query and its nearest neighbors forms a sparse vector living in a small `p ball, where p ≤ 1. Our empirical study on real-world datasets demonstrates the effectiveness of the proposed approach compared to the state-of-the-art hashing methods.

NeurIPS Conference 2012 Conference Paper

Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning

Jinfeng Yi
Rong Jin
Shaili Jain
Tianbao Yang
Anil Jain

One of the main challenges in data clustering is to define an appropriate similarity measure between two objects. Crowdclustering addresses this challenge by defining the pairwise similarity based on the manual annotations obtained through crowdsourcing. Despite its encouraging results, a key limitation of crowdclustering is that it can only cluster objects when their manual annotations are available. To address this limitation, we propose a new approach for clustering, called \textit{semi-crowdsourced clustering} that effectively combines the low-level features of objects with the manual annotations of a subset of the objects obtained via crowdsourcing. The key idea is to learn an appropriate similarity measure, based on the low-level features of objects, from the manual annotations of only a small portion of the data to be clustered. One difficulty in learning the pairwise similarity measure is that there is a significant amount of noise and inter-worker variations in the manual annotations obtained via crowdsourcing. We address this difficulty by developing a metric learning algorithm based on the matrix completion method. Our empirical study with two real-world image data sets shows that the proposed algorithm outperforms state-of-the-art distance metric learning algorithms in both clustering accuracy and computational efficiency.

NeurIPS Conference 2012 Conference Paper

Stochastic Gradient Descent with Only One Projection

Mehrdad Mahdavi
Tianbao Yang
Rong Jin
Shenghuo Zhu
Jinfeng Yi

Although many variants of stochastic gradient descent have been proposed for large-scale convex optimization, most of them require projecting the solution at {\it each} iteration to ensure that the obtained solution stays within the feasible domain. For complex domains (e. g. , positive semidefinite cone), the projection step can be computationally expensive, making stochastic gradient descent unattractive for large-scale optimization problems. We address this limitation by developing a novel stochastic gradient descent algorithm that does not need intermediate projections. Instead, only one projection at the last iteration is needed to obtain a feasible solution in the given domain. Our theoretical analysis shows that with a high probability, the proposed algorithms achieve an $O(1/\sqrt{T})$ convergence rate for general convex optimization, and an $O(\ln T/T)$ rate for strongly convex optimization under mild conditions about the domain and the objective function.

JMLR Journal 2012 Journal Article

Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

Mehrdad Mahdavi
Rong Jin
Tianbao Yang

In this paper we propose efficient algorithms for solving constrained online convex optimization problems. Our motivation stems from the observation that most algorithms proposed for online convex optimization require a projection onto the convex set K from which the decisions are made. While the projection is straightforward for simple shapes (e.g., Euclidean ball), for arbitrary complex sets it is the main computational challenge and may be inefficient in practice. In this paper, we consider an alternative online convex optimization problem. Instead of requiring that decisions belong to K for all rounds, we only require that the constraints, which define the set K, be satisfied in the long run. By turning the problem into an online convex-concave optimization problem, we propose an efficient algorithm which achieves O(√T) regret bound and O(T 3/4 ) bound on the violation of constraints. Then, we modify the algorithm in order to guarantee that the constraints are satisfied in the long run. This gain is achieved at the price of getting O(T 3/4 ) regret bound. Our second algorithm is based on the mirror prox method (Nemirovski, 2005) to solve variational inequalities which achieves O(T 2/3 ) bound for both regret and the violation of constraints when the domain K can be described by a finite number of linear constraints. Finally, we extend the results to the setting where we only have partial access to the convex set K and propose a multipoint bandit feedback algorithm with the same bounds in expectation as our first algorithm. [abs] [ pdf ][ bib ] &copy JMLR 2012. ( edit, beta )

TIST Journal 2011 Journal Article

Distance metric learning from uncertain side information for automated photo tagging

Lei Wu
Steven C.H. Hoi
Rong Jin
Jianke Zhu
Nenghai Yu

Automated photo tagging is an important technique for many intelligent multimedia information systems, for example, smart photo management system and intelligent digital media library. To attack the challenge, several machine learning techniques have been developed and applied for automated photo tagging. For example, supervised learning techniques have been applied to automated photo tagging by training statistical classifiers from a collection of manually labeled examples. Although the existing approaches work well for small testbeds with relatively small number of annotation words, due to the long-standing challenge of object recognition, they often perform poorly in large-scale problems. Another limitation of the existing approaches is that they require a set of high-quality labeled data, which is not only expensive to collect but also time consuming. In this article, we investigate a social image based annotation scheme by exploiting implicit side information that is available for a large number of social photos from the social web sites. The key challenge of our intelligent annotation scheme is how to learn an effective distance metric based on implicit side information (visual or textual) of social photos. To this end, we present a novel “Probabilistic Distance Metric Learning” (PDML) framework, which can learn optimized metrics by effectively exploiting the implicit side information vastly available on the social web. We apply the proposed technique to photo annotation tasks based on a large social image testbed with over 1 million tagged photos crawled from a social photo sharing portal. Encouraging results show that the proposed technique is effective and promising for social photo based annotation tasks.

JMLR Journal 2011 Journal Article

Double Updating Online Learning

Peilin Zhao
Steven C.H. Hoi
Rong Jin

In most kernel based online learning algorithms, when an incoming instance is misclassified, it will be added into the pool of support vectors and assigned with a weight, which often remains unchanged during the rest of the learning process. This is clearly insufficient since when a new support vector is added, we generally expect the weights of the other existing support vectors to be updated in order to reflect the influence of the added support vector. In this paper, we propose a new online learning method, termed Double Updating Online Learning, or DUOL for short, that explicitly addresses this problem. Instead of only assigning a fixed weight to the misclassified example received at the current trial, the proposed online learning algorithm also tries to update the weight for one of the existing support vectors. We show that the mistake bound can be improved by the proposed online learning method. We conduct an extensive set of empirical evaluations for both binary and multi-class online learning tasks. The experimental results show that the proposed technique is considerably more effective than the state-of-the-art online learning algorithms. The source code is available to public at http://www.cais.ntu.edu.sg/~chhoi/DUOL/. [abs] [ pdf ][ bib ] &copy JMLR 2011. ( edit, beta )

AAAI Conference 2011 Conference Paper

Multi-Task Learning in Square Integrable Space

Wei Wu
Hang Li
Yunhua Hu
Rong Jin

Several kernel based methods for multi-task learning have been proposed, which leverage relations among tasks as regularization to enhance the overall learning accuracies. These methods assume that the tasks share the same kernel, which could limit their applications because in practice diﬀerent tasks may need diﬀerent kernels. The main challenge of introducing multiple kernels into multiple tasks is that models from diﬀerent Reproducing Kernel Hilbert Spaces (RKHSs) are not comparable, making it diﬃcult to exploit relations among tasks. This paper addresses the challenge by formalizing the problem in the Square Integrable Space (SIS). Specially, it proposes a kernel based method which makes use of a regularization term deﬁned in the SIS to represent task relations. We prove a new representer theorem for the proposed approach in SIS. We further derive a practical method for solving the learning problem and conduct consistency analysis of the method. We discuss the relations between our method and an existing method. We also give an SVM based implementation of our method for multi-label classiﬁcation. Experiments on two real-world data sets show that the proposed method performs better than the existing method.

NeurIPS Conference 2010 Conference Paper

Active Learning by Querying Informative and Representative Examples

Sheng-Jun Huang
Rong Jin
Zhi-Hua Zhou

Most active learning approaches select either informative or representative unlabeled instances to query their labels. Although several active learning algorithms have been proposed to combine the two criterions for query selection, they are usually ad hoc in finding unlabeled instances that are both informative and representative. We address this challenge by a principled approach, termed QUIRE, based on the min-max view of active learning. The proposed approach provides a systematic way for measuring and combining the informativeness and representativeness of an instance. Extensive experimental results show that the proposed QUIRE approach outperforms several state-of -the-art active learning approaches.

NeurIPS Conference 2010 Conference Paper

Multi-label Multiple Kernel Learning by Stochastic Approximation: Application to Visual Object Recognition

Serhat Bucak
Rong Jin
Anil Jain

Recent studies have shown that multiple kernel learning is very effective for object recognition, leading to the popularity of kernel learning in computer vision problems. In this work, we develop an efficient algorithm for multi-label multiple kernel learning (ML-MKL). We assume that all the classes under consideration share the same combination of kernel functions, and the objective is to find the optimal kernel combination that benefits all the classes. Although several algorithms have been developed for ML-MKL, their computational cost is linear in the number of classes, making them unscalable when the number of classes is large, a challenge frequently encountered in visual object recognition. We address this computational challenge by developing a framework for ML-MKL that combines the worst-case analysis with stochastic approximation. Our analysis shows that the complexity of our algorithm is $O(m^{1/3}\sqrt{ln m})$, where $m$ is the number of classes. Empirical studies with object recognition show that while achieving similar classification accuracy, the proposed method is significantly more efficient than the state-of-the-art algorithms for ML-MKL.

AAAI Conference 2010 Conference Paper

Smooth Optimization for Effective Multiple Kernel Learning

Zenglin Xu
Rong Jin
Shenghuo Zhu
Michael Lyu
Irwin King

Multiple Kernel Learning (MKL) can be formulated as a convex-concave minmax optimization problem, whose saddle point corresponds to the optimal solution to MKL. Most MKL methods employ the L1-norm simplex constraints on the combination weights of kernels, which therefore involves optimization of a non-smooth function of the kernel weights. These methods usually divide the optimization into two cycles: one cycle deals with the optimization on the kernel combination weights, and the other cycle updates the parameters of SVM. Despite the success of their efficiency, they tend to discard informative complementary kernels. To improve accuracy, we introduce smoothness to the optimization procedure. Furthermore, we transform the optimization into a single smooth convex optimization problem and employ the Nesterov’s method to efficiently solve the optimization problem. Experiments on benchmark data sets demonstrate that the proposed algorithm clearly improves current MKL methods in a number scenarios.

IJCAI Conference 2009 Conference Paper

Zenglin Xu
Rong Jin
Michael R. Lyu
Irwin King

We consider the problem of semi-supervised feature selection, where we are given a small amount of labeled examples and a large amount of unlabeled examples. Since a small number of labeled samples are usually insufﬁcient for identifying the relevant features, the critical problem arising from semi-supervised feature selection is how to take advantage of the information underneath the unlabeled data. To address this problem, we propose a novel discriminative semi-supervised feature selection method based on the idea of manifold regularization. The proposed method selects features through maximizing the classiﬁcation margin between different classes and simultaneously exploiting the geometry of the probability distribution that generates both labeled and unlabeled data. We formulate the proposed feature selection method into a convex-concave optimization problem, where the saddle point corresponds to the optimal solution. To ﬁnd the optimal solution, the level method, a fairly recent optimization method, is employed. We also present a theoretic proof of the convergence rate for the application of the level method to our problem. Empirical evaluation on several benchmark data sets demonstrates the effectiveness of the proposed semi-supervised feature selection method.

NeurIPS Conference 2009 Conference Paper

Adaptive Regularization for Transductive Support Vector Machine

Zenglin Xu
Rong Jin
Jianke Zhu
Irwin King
Michael Lyu
Zhirong Yang

We discuss the framework of Transductive Support Vector Machine (TSVM) from the perspective of the regularization strength induced by the unlabeled data. In this framework, SVM and TSVM can be regarded as a learning machine without regularization and one with full regularization from the unlabeled data, respectively. Therefore, to supplement this framework of the regularization strength, it is necessary to introduce data-dependant partial regularization. To this end, we reformulate TSVM into a form with controllable regularization strength, which includes SVM and TSVM as special cases. Furthermore, we introduce a method of adaptive regularization that is data dependant and is based on the smoothness assumption. Experiments on a set of benchmark data sets indicate the promising results of the proposed work compared with state-of-the-art TSVM algorithms.

NeurIPS Conference 2009 Conference Paper

DUOL: A Double Updating Approach for Online Learning

Peilin Zhao
Steven Hoi
Rong Jin

In most online learning algorithms, the weights assigned to the misclassified examples (or support vectors) remain unchanged during the entire learning process. This is clearly insufficient since when a new misclassified example is added to the pool of support vectors, we generally expect it to affect the weights for the existing support vectors. In this paper, we propose a new online learning method, termed Double Updating Online Learning", or "DUOL" for short. Instead of only assigning a fixed weight to the misclassified example received in current trial, the proposed online learning algorithm also tries to update the weight for one of the existing support vectors. We show that the mistake bound can be significantly improved by the proposed online learning method. Encouraging experimental results show that the proposed technique is in general considerably more effective than the state-of-the-art online learning algorithms. "

NeurIPS Conference 2009 Conference Paper

Learning Bregman Distance Functions and Its Application for Semi-Supervised Clustering

Lei Wu
Rong Jin
Steven Hoi
Jianke Zhu
Nenghai Yu

Learning distance functions with side information plays a key role in many machine learning and data mining applications. Conventional approaches often assume a Mahalanobis distance function. These approaches are limited in two aspects: (i) they are computationally expensive (even infeasible) for high dimensional data because the size of the metric is in the square of dimensionality; (ii) they assume a fixed metric for the entire input space and therefore are unable to handle heterogeneous data. In this paper, we propose a novel scheme that learns nonlinear Bregman distance functions from side information using a non-parametric approach that is similar to support vector machines. The proposed scheme avoids the assumption of fixed metric because its local distance metric is implicitly derived from the Hessian matrix of a convex function that is used to generate the Bregman distance function. We present an efficient learning algorithm for the proposed scheme for distance function learning. The extensive experiments with semi-supervised clustering show the proposed technique (i) outperforms the state-of-the-art approaches for distance function learning, and (ii) is computationally efficient for high dimensional data.

NeurIPS Conference 2009 Conference Paper

Learning to Rank by Optimizing NDCG Measure

Hamed Valizadegan
Rong Jin
Ruofei Zhang
Jianchang Mao

Learning to rank is a relatively new field of study, aiming to learn a ranking function from a set of training data with relevancy labels. The ranking algorithms are often evaluated using Information Retrieval measures, such as Normalized Discounted Cumulative Gain [1] and Mean Average Precision [2]. Until recently, most learning to rank algorithms were not using a loss function related to the above mentioned evaluation measures. The main difficulty in direct optimization of these measures is that they depend on the ranks of documents, not the numerical values output by the ranking function. We propose a probabilistic framework that addresses this challenge by optimizing the expectation of NDCG over all the possible permutations of documents. A relaxation strategy is used to approximate the average of NDCG over the space of permutation, and a bound optimization approach is proposed to make the computation efficient. Extensive experiments show that the proposed algorithm outperforms state-of-the-art ranking algorithms on several benchmark data sets.

NeurIPS Conference 2009 Conference Paper

Regularized Distance Metric Learning:Theory and Algorithm

Rong Jin
Shijun Wang
Yang Zhou

In this paper, we examine the generalization error of regularized distance metric learning. We show that with appropriate constraints, the generalization error of regularized distance metric learning could be independent from the dimensionality, making it suitable for handling high dimensional data. In addition, we present an efficient online learning algorithm for regularized distance metric learning. Our empirical studies with data classification and face recognition show that the proposed algorithm is (i) effective for distance metric learning when compared to the state-of-the-art methods, and (ii) efficient and robust for high dimensional data.

NeurIPS Conference 2008 Conference Paper

An Extended Level Method for Efficient Multiple Kernel Learning

Zenglin Xu
Rong Jin
Irwin King
Michael Lyu

We consider the problem of multiple kernel learning (MKL), which can be formulated as a convex-concave problem. In the past, two efficient methods, i. e. , Semi-Infinite Linear Programming (SILP) and Subgradient Descent (SD), have been proposed for large-scale multiple kernel learning. Despite their success, both methods have their own shortcomings: (a) the SD method utilizes the gradient of only the current solution, and (b) the SILP method does not regularize the approximate solution obtained from the cutting plane model. In this work, we extend the level method, which was originally designed for optimizing non-smooth objective functions, to convex-concave optimization, and apply it to multiple kernel learning. The extended level method overcomes the drawbacks of SILP and SD by exploiting all the gradients computed in past iterations and by regularizing the solution via a projection to a level set. Empirical study with eight UCI datasets shows that the extended level method can significantly improve efficiency by saving on average 91. 9% of computational time over the SILP method and 70. 3% over the SD method.

NeurIPS Conference 2008 Conference Paper

Multi-label Multiple Kernel Learning

Shuiwang Ji
Liang Sun
Rong Jin
Jieping Ye

We present a multi-label multiple kernel learning (MKL) formulation, in which the data are embedded into a low-dimensional space directed by the instance-label correlations encoded into a hypergraph. We formulate the problem in the kernel-induced feature space and propose to learn the kernel matrix as a linear combination of a given collection of kernel matrices in the MKL framework. The proposed learning formulation leads to a non-smooth min-max problem, and it can be cast into a semi-infinite linear program (SILP). We further propose an approximate formulation with a guaranteed error bound which involves an unconstrained and convex optimization problem. In addition, we show that the objective function of the approximate formulation is continuously differentiable with Lipschitz gradient, and hence existing methods can be employed to compute the optimal solution efficiently. We apply the proposed formulation to the automated annotation of Drosophila gene expression pattern images, and promising results have been reported in comparison with representative algorithms.

NeurIPS Conference 2008 Conference Paper

Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization

Liu Yang
Rong Jin
Rahul Sukthankar

The cluster assumption is exploited by most semi-supervised learning (SSL) methods. However, if the unlabeled data is merely weakly related to the target classes, it becomes questionable whether driving the decision boundary to the low density regions of the unlabeled data will help the classification. In such case, the cluster assumption may not be valid; and consequently how to leverage this type of unlabeled data to enhance the classification accuracy becomes a challenge. We introduce Semi-supervised Learning with Weakly-Related Unlabeled Data" (SSLW), an inductive method that builds upon the maximum-margin approach, towards a better usage of weakly-related unlabeled information. Although the SSLW could improve a wide range of classification tasks, in this paper, we focus on text categorization with a small training pool. The key assumption behind this work is that, even with different topics, the word usage patterns across different corpora tends to be consistent. To this end, SSLW estimates the optimal word-correlation matrix that is consistent with both the co-occurrence information derived from the weakly-related unlabeled documents and the labeled documents. For empirical evaluation, we present a direct comparison with a number of state-of-the-art methods for inductive semi-supervised learning and text categorization; and we show that SSLW results in a significant improvement in categorization accuracy, equipped with a small training set and an unlabeled resource that is weakly related to the test beds. "

NeurIPS Conference 2007 Conference Paper

Efficient Convex Relaxation for Transductive Support Vector Machine

Zenglin Xu
Rong Jin
Jianke Zhu
Irwin King
Michael Lyu

We consider the problem of Support Vector Machine transduction, which involves a combinatorial problem with exponential computational complexity in the number of unlabeled examples. Although several studies are devoted to Transductive SVM, they suffer either from the high computation complexity or from the solutions of local optimum. To address this problem, we propose solving Transductive SVM via a convex relaxation, which converts the NP-hard problem to a semi-definite programming. Compared with the other SDP relaxation for Transductive SVM, the proposed algorithm is computationally more efficient with the number of free parameters reduced from O(n2) to O(n) where n is the number of examples. Empirical study with several benchmark data sets shows the promising performance of the proposed algorithm in comparison with other state-of-the-art implementations of Transductive SVM.

NeurIPS Conference 2006 Conference Paper

Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

Hamed Valizadegan
Rong Jin

Maximum margin clustering was proposed lately and has shown promising performance in recent studies [1, 2]. It extends the theory of support vector machine to unsupervised learning. Despite its good performance, there are three ma jor problems with maximum margin clustering that question its efficiency for real-world applications. First, it is computationally expensive and difficult to scale to large-scale datasets because the number of parameters in maximum margin clustering is quadratic in the number of examples. Second, it requires data preprocessing to ensure that any clustering boundary will pass through the origins, which makes it unsuitable for clustering unbalanced dataset. Third, it is sensitive to the choice of kernel functions, and requires external procedure to determine the appropriate values for the parameters of kernel functions. In this paper, we propose "generalized maximum margin clustering" framework that addresses the above three problems simultaneously. The new framework generalizes the maximum margin clustering algorithm by allowing any clustering boundaries including those not passing through the origins. It significantly improves the computational efficiency by reducing the number of parameters. Furthermore, the new framework is able to automatically determine the appropriate kernel matrix without any labeled data. Finally, we show a formal connection between maximum margin clustering and spectral clustering. We demonstrate the efficiency of the generalized maximum margin clustering algorithm using both synthetic datasets and real datasets from the UCI repository.

IJCAI Conference 2005 Conference Paper

A Novel Approach to Model Generation for Heterogeneous Data Classification

Rong Jin
Huan

Ensemble methods such as bagging and boosting have been successfully applied to classification problems. Two important issues associated with an ensemble approach are: how to generate models to construct an ensemble, and how to combine them for classification. In this paper, we focus on the problem of model generation for heterogeneous data classification. If we could partition heterogeneous data into a number of homogeneous partitions, we will likely generate reliable and accurate classification models over the homogeneous partitions. We examine different ways of forming homogeneous subsets and propose a novel method that allows a data point to be assigned multiple times in order to generate homogeneous partitions for ensemble learning. We present the details of the new algorithm and empirical studies over the UCI benchmark datasets and datasets of image classification, and show that the proposed approach is effective for heterogeneous data classification.

NeurIPS Conference 2005 Conference Paper

A Probabilistic Approach for Optimizing Spectral Clustering

Rong Jin
Feng Kang
Chris Ding

Spectral clustering enjoys its success in both data clustering and semisupervised learning. But, most spectral clustering algorithms cannot handle multi-class clustering problems directly. Additional strategies are needed to extend spectral clustering algorithms to multi-class clustering problems. Furthermore, most spectral clustering algorithms employ hard cluster membership, which is likely to be trapped by the local optimum. In this paper, we present a new spectral clustering algorithm, named "Soft Cut". It improves the normalized cut algorithm by introducing soft membership, and can be efficiently computed using a bound optimization algorithm. Our experiments with a variety of datasets have shown the promising performance of the proposed clustering algorithm.

IJCAI Conference 2005 Conference Paper

Learning with Labeled Sessions

Rong Jin
Huan

Traditional supervised learning deals with labeled instances. In many applications such as physiological data modeling and speaker identification, however, training examples are often labeled objects and each of the labeled objects consists of multiple unlabeled instances. When classifying a new object, its class is determined by the majority of its instance classes. As a consequence of this decision rule, one challenge to learning with labeled objects (or sessions) is to determine during training which subset of the instances inside an object should belong to the class of the object. We call this type of learning ‘session-based learning’ to distinguish it from the traditional supervised learning. In this paper, we introduce session-based learning problems, give a formal description of session-based learning in the context of related work, and propose an approach that is particularly designed for sessionbased learning. Empirical studies with UCI datasets and real-world data show that the proposed approach is effective for session-based learning.

NeurIPS Conference 2002 Conference Paper

Learning with Multiple Labels

Rong Jin
Zoubin Ghahramani

In this paper, we study a special kind of learning problem in which each training instance is given a set of (or distribution over) candidate class labels and only one of the candidate labels is the correct one. Such a problem can occur, e. g. , in an information retrieval setting where a set of words is associated with an image, or if classes labels are organized hierarchically. We propose a novel discriminative approach for handling the ambiguity of class labels in the training examples. The experiments with the proposed approach over five different UCI datasets show that our approach is able to find the correct label among the set of candidate labels and actually achieve performance close to the case when each training instance is given a single correct label. In contrast, naIve methods degrade rapidly as more ambiguity is introduced into the labels.