Arrow Research search

Author name cluster

Ji Liu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

67 papers
2 author rows

Possible papers

67

AAAI Conference 2026 Conference Paper

DiffBench Meets DiffAgent: End-to-End LLM-Driven Diffusion Acceleration Code Generation

  • Jiajun Jiao
  • Haowei Zhu
  • Puyuan Yang
  • Jianghui Wang
  • Ji Liu
  • Ziqiong Liu
  • Dong Li
  • Yuejian Fang

Diffusion models have achieved remarkable success in image and video generation. However, their inherently multiple step inference process imposes substantial computational overhead, hindering real-world deployment. Accelerating diffusion models is therefore essential, yet determining how to combine multiple model acceleration techniques remains a significant challenge. To address this issue, we introduce a framework driven by large language models (LLMs) for automated acceleration code generation and evaluation. First, we present DiffBench, a comprehensive benchmark that implements a three stage automated evaluation pipeline across diverse diffusion architectures, optimization combinations and deployment scenarios. Second, we propose DiffAgent, an agent that generates optimal acceleration strategies and codes for arbitrary diffusion models. DiffAgent employs a closed-loop workflow in which a planning component and a debugging component iteratively refine the output of a code generation component, while a genetic algorithm extracts performance feedback from the execution environment to guide subsequent code refinements. We provide a detailed explanation of the DiffBench construction and the design principles underlying DiffAgent. Extensive experiments show that DiffBench offers a thorough evaluation of generated codes and that DiffAgent significantly outperforms existing LLMs in producing effective diffusion acceleration strategies.

AAAI Conference 2026 Conference Paper

Learnable Permutation for Structured Sparsity on Transformer Models

  • Zekai Li
  • Ji Liu
  • Guanchen Li
  • Yixing Xu
  • Ziqiong Liu
  • Xuanwu Yin
  • Dong Li
  • Emad Barsoum

Structured sparsity has emerged as a popular model pruning technique, widely adopted in various architectures, including CNNs, Transformer models, and especially large language models (LLMs) in recent years. A promising direction to further improve post-pruning performance is weight permutation, which reorders model weights into patterns more amenable to pruning. However, the exponential growth of the permutation search space with the scale of Transformer architectures forces most methods to rely on greedy or heuristic algorithms, limiting the effectiveness of reordering. In this work, we propose a novel end-to-end learnable permutation framework. Our method introduces a learnable permutation cost matrix to quantify the cost of swapping any two input channels of a given weight matrix, a differentiable bipartite matching solver to obtain the optimal binary permutation matrix given a cost matrix, and a sparsity optimization loss function to directly optimize the permutation operator. We extensively validate our approach on vision and language Transformers, demonstrating that our method achieves state-of-the-art permutation results for structured sparsity.

AAAI Conference 2025 Conference Paper

EGSRAL:An Enhanced 3D Gaussian Splatting Based Renderer with Automated Labeling for Large-Scale Driving Scene

  • Yixiong Huo
  • Guangfeng Jiang
  • Hongyang Wei
  • Ji Liu
  • Song Zhang
  • Han Liu
  • Xingliang Huang
  • Mingjie Lu

3D Gaussian Splatting (3D GS) has gained popularity due to its faster rendering speed and high-quality novel view synthesis. Some researchers have explored using 3D GS for reconstructing driving scenes. However, these methods often rely on various types of data, such as depth maps, 3D bounding boxes, and trajectories of moving objects. Additionally, the lack of annotations for synthesized images limits their direct application in downstream tasks. To address these issues, we propose EGSRAL, a 3D GS-based method that relies solely on training images without extra annotations. EGSRAL enhances 3D GS's capability to model both dynamic objects and static backgrounds and introduces a novel adaptor for auto labeling, generating corresponding annotations based on existing annotations. We also propose a grouping strategy for vanilla 3D GS to address perspective issues in rendering large-scale, complex scenes. Our method achieves state-of-the-art performance on multiple datasets without any extra annotation. For example, the PSNR metric reaches 29.04 on the nuScenes dataset. Moreover, our automated labeling can significantly improve the performance of 2D/3D detection tasks.

NeurIPS Conference 2025 Conference Paper

Týr-the-Pruner: Structural Pruning LLMs via Global Sparsity Distribution Optimization

  • Guanchen Li
  • Yixing Xu
  • Zeping Li
  • Ji Liu
  • Xuanwu Yin
  • Dong Li
  • Emad Barsoum

Structural pruning enhances hardware-agnostic inference efficiency for large language models (LLMs) yet often fails to maintain comparable performance. Local pruning performs efficient layer-by-layer compression but ignores global topology. Although global pruning aims to identify an optimal sparse model, intuitive methods typically adopt a two-stage paradigm that first evaluates substructure saliency and then applies global pruning, which ignores inter-structure dependencies and fails to achieve end-to-end optimization. To address these limitations, we propose Týr-the-Pruner, an efficient end-to-end search-based global structural pruning framework. This framework constructs a supernet by repeatedly applying local pruning across a range of sparsity ratios to each layer in an LLM, with the core goal of determining the optimal sparsity distribution under a target overall sparsity ratio. Concretely, we introduce an effective local pruning and an expectation error accumulation approach to improve supernet construction. Furthermore, we employ an iterative prune-and-search strategy with coarse-to-fine sparsity granularity to ensure efficient search convergence. Experimental results show that Týr-the-Pruner achieves state-of-the-art structural pruning, retaining 97% of the dense model's performance while removing a challenging 50% of Llama-3. 1-70B's parameters.

TMLR Journal 2024 Journal Article

Byzantine-Resilient Decentralized Multi-Armed Bandits

  • Jingxuan Zhu
  • Alec Koppel
  • Alvaro Velasquez
  • Ji Liu

In decentralized cooperative multi-armed bandits (MAB), each agent observes a distinct stream of rewards, and seeks to exchange information with others to select a sequence of arms so as to minimize its regret. Agents in the cooperative setting can outperform a single agent running a MAB method such as Upper-Confidence Bound (UCB) independently. In this work, we study how to recover such salient behavior when an unknown fraction of the agents can be Byzantine, that is, communicate arbitrarily wrong information in the form of reward mean-estimates or confidence sets. This framework can be used to model attackers in computer networks, instigators of offensive content into recommender systems, or manipulators of financial markets. Our key contribution is the development of a fully decentralized resilient upper confidence bound (UCB) algorithm that fuses an information mixing step among agents with a truncation of inconsistent and extreme values. This truncation step enables us to establish that the performance of each normal agent is no worse than the classic single-agent UCB1 algorithm in terms of regret, and more importantly, the cumulative regret of all normal agents is strictly better than the non-cooperative case, provided that each agent has at least $3f+1$ neighbors where $f$ is the maximum possible Byzantine agents in each agent's neighborhood. Extensions to time-varying neighbor graphs, and minimax lower bounds are further established on the achievable regret. Experiments corroborate the merits of this framework in practice.

NeurIPS Conference 2024 Conference Paper

DiP-GO: A Diffusion Pruner via Few-step Gradient Optimization

  • Haowei Zhu
  • Dehua Tang
  • Ji Liu
  • Mingjie Lu
  • Jintu Zheng
  • Jinzhang Peng
  • Dong Li
  • Yu Wang

Diffusion models have achieved remarkable progress in the field of image generation due to their outstanding capabilities. However, these models require substantial computing resources because of the multi-step denoising process during inference. While traditional pruning methods have been employed to optimize these models, the retraining process necessitates large-scale training datasets and extensive computational costs to maintain generalization ability, making it neither convenient nor efficient. Recent studies attempt to utilize the similarity of features across adjacent denoising stages to reduce computational costs through simple and static strategies. However, these strategies cannot fully harness the potential of the similar feature patterns across adjacent timesteps. In this work, we propose a novel pruning method that derives an efficient diffusion model via a more intelligent and differentiable pruner. At the core of our approach is casting the model pruning process into a SubNet search process. Specifically, we first introduce a SuperNet based on standard diffusion via adding some backup connections built upon the similar features. We then construct a plugin pruner network and design optimization losses to identify redundant computation. Finally, our method can identify an optimal SubNet through few-step gradient optimization and a simple post-processing procedure. We conduct extensive experiments on various diffusion models including Stable Diffusion series and DiTs. Our DiP-GO approach achieves 4. 4 x speedup for SD-1. 5 without any loss of accuracy, significantly outperforming the previous state-of-the-art methods.

TIST Journal 2024 Journal Article

Efficient Federated Learning Using Dynamic Update and Adaptive Pruning with Momentum on Shared Server Data

  • Ji Liu
  • Juncheng Jia
  • Hong Zhang
  • Yuhui Yun
  • Leye Wang
  • Yang Zhou
  • Huaiyu Dai
  • Dejing Dou

Despite achieving remarkable performance, Federated Learning (FL) encounters two important problems, i.e., low training efficiency and limited computational resources. In this article, we propose a new FL framework, i.e., FedDUMAP, with three original contributions, to leverage the shared insensitive data on the server in addition to the distributed data in edge devices so as to efficiently train a global model. First, we propose a simple dynamic server update algorithm, which takes advantage of the shared insensitive data on the server while dynamically adjusting the update steps on the server in order to speed up the convergence and improve the accuracy. Second, we propose an adaptive optimization method with the dynamic server update algorithm to exploit the global momentum on the server and each local device for superior accuracy. Third, we develop a layer-adaptive model pruning method to carry out specific pruning operations, which is adapted to the diverse features of each layer so as to attain an excellent tradeoff between effectiveness and efficiency. Our proposed FL model, FedDUMAP, combines the three original techniques and has a significantly better performance compared with baseline approaches in terms of efficiency (up to 16.9 times faster), accuracy (up to 20.4% higher), and computational cost (up to 62.6% smaller).

AAAI Conference 2024 Conference Paper

FedASMU: Efficient Asynchronous Federated Learning with Dynamic Staleness-Aware Model Update

  • Ji Liu
  • Juncheng Jia
  • Tianshi Che
  • Chao Huo
  • Jiaxiang Ren
  • Yang Zhou
  • Huaiyu Dai
  • Dejing Dou

As a promising approach to deal with distributed data, Federated Learning (FL) achieves major advancements in recent years. FL enables collaborative model training by exploiting the raw data dispersed in multiple edge devices. However, the data is generally non-independent and identically distributed, i.e., statistical heterogeneity, and the edge devices significantly differ in terms of both computation and communication capacity, i.e., system heterogeneity. The statistical heterogeneity leads to severe accuracy degradation while the system heterogeneity significantly prolongs the training process. In order to address the heterogeneity issue, we propose an Asynchronous Staleness-aware Model Update FL framework, i.e., FedASMU, with two novel methods. First, we propose an asynchronous FL system model with a dynamical model aggregation method between updated local models and the global model on the server for superior accuracy and high efficiency. Then, we propose an adaptive local model adjustment method by aggregating the fresh global model with local models on devices to further improve the accuracy. Extensive experimentation with 6 models and 5 public datasets demonstrates that FedASMU significantly outperforms baseline approaches in terms of accuracy (0.60% to 23.90% higher) and efficiency (3.54% to 97.98% faster).

AAAI Conference 2024 Conference Paper

G–LIME: Statistical Learning for Local Interpretations of Deep Neural Networks Using Global Priors (Abstract Reprint)

  • Xuhong Li
  • Haoyi Xiong
  • Xingjian Li
  • Xiao Zhang
  • Ji Liu
  • Haiyan Jiang
  • Zeyu Chen
  • Dejing Dou

To explain the prediction result of a Deep Neural Network (DNN) model based on a given sample, LIME [1] and its derivatives have been proposed to approximate the local behavior of the DNN model around the data point via linear surrogates. Though these algorithms interpret the DNN by finding the key features used for classification, the random interpolations used by LIME would perturb the explanation result and cause the instability and inconsistency between repetitions of LIME computations. To tackle this issue, we propose G-LIME that extends the vanilla LIME through high-dimensional Bayesian linear regression using the sparsity and informative global priors. Specifically, with a dataset representing the population of samples (e.g., the training set), G-LIME first pursues the global explanation of the DNN model using the whole dataset. Then, with a new data point, -LIME incorporates an modified estimator of ElasticNet-alike to refine the local explanation result through balancing the distance to the global explanation and the sparsity/feature selection in the explanation. Finally, G-LIME uses Least Angle Regression (LARS) and retrieves the solution path of a modified ElasticNet under varying -regularization, to screen and rank the importance of features [2] as the explanation result. Through extensive experiments on real world tasks, we show that the proposed method yields more stable, consistent, and accurate results compared to LIME.

AAAI Conference 2024 Conference Paper

UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer

  • Ji Liu
  • Dehua Tang
  • Yuanxian Huang
  • Li Zhang
  • Xiaocheng Zeng
  • Dong Li
  • Mingjie Lu
  • Jinzhang Peng

Traditional channel-wise pruning methods by reducing network channels struggle to effectively prune efficient CNN models with depth-wise convolutional layers and certain efficient modules, such as popular inverted residual blocks. Prior depth pruning methods by reducing network depths are not suitable for pruning some efficient models due to the existence of some normalization layers. Moreover, finetuning subnet with directly removing activation layers would corrupt the original model weights, hindering the pruned model from achieving high performance. To address these issues, we propose a novel depth pruning method for efficient models. Our approach proposes a novel block pruning strategy and progressive training method for the subnet. Additionally, we extend our pruning method to vision transformer models. Experimental results demonstrate that our method consistently outperforms existing depth pruning methods across various pruning configurations. We obtained three pruned ConvNeXtV1 models with our method applying on ConvNeXtV1, which surpass most SOTA efficient models with comparable inference performance. Our method also achieves state-of-the-art pruning performance on the vision transformer model.

IJCAI Conference 2023 Conference Paper

Accurate MRI Reconstruction via Multi-Domain Recurrent Networks

  • Jinbao Wei
  • Zhijie Wang
  • Kongqiao Wang
  • Li Guo
  • Xueyang Fu
  • Ji Liu
  • Xun Chen

In recent years, deep convolutional neural networks (CNNs) have become dominant in MRI reconstruction from undersampled k-space. However, most existing CNNs methods reconstruct the undersampled images either in the spatial domain or in the frequency domain, and neglecting the correlation between these two domains. This hinders the further reconstruction performance improvement. To tackle this issue, in this work, we propose a new multi-domain recurrent network (MDR-Net) with multi-domain learning (MDL) blocks as its basic units to reconstruct the undersampled MR image progressively. Specifically, the MDL block interactively processes the local spatial features and the global frequency information to facilitate complementary learning, leading to fine-grained features generation. Furthermore, we introduce an effective frequency-based loss to narrow the frequency spectrum gap, compensating for over-smoothness caused by the widely used spatial reconstruction loss. Extensive experiments on public fastMRI datasets demonstrate that our MDR-Net consistently outperforms other competitive methods and is able to provide more details.

AAAI Conference 2023 Conference Paper

PINAT: A Permutation INvariance Augmented Transformer for NAS Predictor

  • Shun Lu
  • Yu Hu
  • Peihao Wang
  • Yan Han
  • Jianchao Tan
  • Jixiang Li
  • Sen Yang
  • Ji Liu

Time-consuming performance evaluation is the bottleneck of traditional Neural Architecture Search (NAS) methods. Predictor-based NAS can speed up performance evaluation by directly predicting performance, rather than training a large number of sub-models and then validating their performance. Most predictor-based NAS approaches use a proxy dataset to train model-based predictors efficiently but suffer from performance degradation and generalization problems. We attribute these problems to the poor abilities of existing predictors to character the sub-models' structure, specifically the topology information extraction and the node feature representation of the input graph data. To address these problems, we propose a Transformer-like NAS predictor PINAT, consisting of a Permutation INvariance Augmentation module serving as both token embedding layer and self-attention head, as well as a Laplacian matrix to be the positional encoding. Our design produces more representative features of the encoded architecture and outperforms state-of-the-art NAS predictors on six search spaces: NAS-Bench-101, NAS-Bench-201, DARTS, ProxylessNAS, PPI, and ModelNet. The code is available at https://github.com/ShunLu91/PINAT.

AAAI Conference 2023 Conference Paper

ProxyBO: Accelerating Neural Architecture Search via Bayesian Optimization with Zero-Cost Proxies

  • Yu Shen
  • Yang Li
  • Jian Zheng
  • Wentao Zhang
  • Peng Yao
  • Jixiang Li
  • Sen Yang
  • Ji Liu

Designing neural architectures requires immense manual efforts. This has promoted the development of neural architecture search (NAS) to automate the design. While previous NAS methods achieve promising results but run slowly, zero-cost proxies run extremely fast but are less promising. Therefore, it’s of great potential to accelerate NAS via those zero-cost proxies. The existing method has two limitations, which are unforeseeable reliability and one-shot usage. To address the limitations, we present ProxyBO, an efficient Bayesian optimization (BO) framework that utilizes the zero-cost proxies to accelerate neural architecture search. We apply the generalization ability measurement to estimate the fitness of proxies on the task during each iteration and design a novel acquisition function to combine BO with zero-cost proxies based on their dynamic influence. Extensive empirical studies show that ProxyBO consistently outperforms competitive baselines on five tasks from three public benchmarks. Concretely, ProxyBO achieves up to 5.41× and 3.86× speedups over the state-of-the-art approaches REA and BRP-NAS.

AAAI Conference 2023 Conference Paper

Quality-Aware Self-Training on Differentiable Synthesis of Rare Relational Data

  • Chongsheng Zhang
  • Yaxin Hou
  • Ke Chen
  • Shuang Cao
  • Gaojuan Fan
  • Ji Liu

Data scarcity is a very common real-world problem that poses a major challenge to data-driven analytics. Although a lot of data-balancing approaches have been proposed to mitigate this problem, they may drop some useful information or fall into the overfitting problem. Generative Adversarial Network (GAN) based data synthesis methods can alleviate such a problem but lack of quality control over the generated samples. Moreover, the latent associations between the attribute set and the class labels in a relational data cannot be easily captured by a vanilla GAN. In light of this, we introduce an end-to-end self-training scheme (namely, Quality-Aware Self-Training) for rare relational data synthesis, which generates labeled synthetic data via pseudo labeling on GAN-based synthesis. We design a semantic pseudo labeling module to first control the quality of the generated features/samples, then calibrate their semantic labels via a classifier committee consisting of multiple pre-trained shallow classifiers. The high-confident generated samples with calibrated pseudo labels are then fed into a semantic classification network as augmented samples for self-training. We conduct extensive experiments on 20 benchmark datasets of different domains, including 14 industrial datasets. The results show that our method significantly outperforms state-of-the-art methods, including two recent GAN-based data synthesis schemes. Codes are available at https://github.com/yaxinhou/QAST.

JBHI Journal 2023 Journal Article

SpindlesTracker: An Automatic and Low-Cost Labeled Workflow for Spindle Analysis

  • Zhongzhong Li
  • Yanze Jian
  • Jiatong Hu
  • Chuanmin Zhang
  • Xuechun Meng
  • Ji Liu

Quantitative analysis of spindle dynamics in mitosis through fluorescence microscopy requires tracking spindle elongation in noisy image sequences. Deterministic methods, which use typical microtubule detection and tracking methods, perform poorly in the sophisticated background of spindles. In addition, the expensive data labeling cost also limits the application of machine learning in this field. Here we present a fully automatic and low-cost labeled workflow that efficiently analyzes the dynamic spindle mechanism of time-lapse images, called SpindlesTracker. In this workflow, we design a network named YOLOX-SP which can accurately detect the location and endpoint of each spindle under box-level data supervision. We then optimize the algorithm SORT and MCP for spindle's tracking and skeletonization. As there was no publicly available dataset, we annotated a S. pombe dataset that was entirely acquired from the real world for both training and evaluation. Extensive experiments demonstrate that SpindlesTracker achieves excellent performance in all aspects, while reducing label costs by 60%. Specifically, it achieves 84. 1% mAP in spindle detection and over 90% accuracy in endpoint detection. Furthermore, the improved algorithm enhances tracking accuracy by 1. 3% and tracking precision by 6. 5%. Statistical results also indicate that the mean error of spindle length is within 1 μm. In summary, SpindlesTracker holds significant implications for the study of mitotic dynamic mechanisms and can be readily extended to the analysis of other filamentous objects. The code and the dataset are both released on GitHub.

TIST Journal 2022 Journal Article

Decentralized Online Learning: Take Benefits from Others’ Data without Sharing Your Own to Track Global Trend

  • Wendi Wu
  • Zongren Li
  • Yawei Zhao
  • Chen Yu
  • Peilin Zhao
  • Ji Liu
  • Kunlun He

Decentralized online learning (online learning in decentralized networks) has been attracting more and more attention, since it is believed that decentralized online learning can help data providers cooperatively better solve their online problems without sharing their private data to a third party or other providers. Typically, the cooperation is achieved by letting the data providers exchange their models between neighbors, e.g., recommendation model. However, the best regret bound for a decentralized online learning algorithm is 𝒪( n √ T ), where n is the number of nodes (or users) and T is the number of iterations. This is clearly insignificant, since this bound can be achieved without any communication in the networks. This reminds us to ask a fundamental question: Can people really get benefit from the decentralized online learning by exchanging information? In this article, we studied when and why the communication can help the decentralized online learning to reduce the regret. Specifically, each loss function is characterized by two components: the adversarial component and the stochastic component. Under this characterization, we show that decentralized online gradient enjoys a regret bound \( {\mathcal {O}(\sqrt {n^2TG^2 + n T \sigma ^2})} \), where G measures the magnitude of the adversarial component in the private data (or equivalently the local loss function) and σ measures the randomness within the private data. This regret suggests that people can get benefits from the randomness in the private data by exchanging private information. Another important contribution of this article is to consider the dynamic regret—a more practical regret to track users’ interest dynamics. Empirical studies are also conducted to validate our analysis.

AAAI Conference 2022 Conference Paper

Efficient Device Scheduling with Multi-Job Federated Learning

  • Chendi Zhou
  • Ji Liu
  • Juncheng Jia
  • Jingbo Zhou
  • Yang Zhou
  • Huaiyu Dai
  • Dejing Dou

Recent years have witnessed a large amount of decentralized data in multiple (edge) devices of end-users, while the aggregation of the decentralized data remains difficult for machine learning jobs due to laws or regulations. Federated Learning (FL) emerges as an effective approach to handling decentralized data without sharing the sensitive raw data, while collaboratively training global machine learning models. The servers in FL need to select (and schedule) devices during the training process. However, the scheduling of devices for multiple jobs with FL remains a critical and open problem. In this paper, we propose a novel multi-job FL framework to enable the parallel training process of multiple jobs. The framework consists of a system model and two scheduling methods. In the system model, we propose a parallel training process of multiple jobs, and construct a cost model based on the training time and the data fairness of various devices during the training process of diverse jobs. We propose a reinforcement learning-based method and a Bayesian optimization-based method to schedule devices for multiple jobs while minimizing the cost. We conduct extensive experimentation with multiple jobs and datasets. The experimental results show that our proposed approaches significantly outperform baseline approaches in terms of training time (up to 8. 67 times faster) and accuracy (up to 44. 6% higher).

IJCAI Conference 2022 Conference Paper

FedDUAP: Federated Learning with Dynamic Update and Adaptive Pruning Using Shared Data on the Server

  • Hong Zhang
  • Ji Liu
  • Juncheng Jia
  • Yang Zhou
  • Huaiyu Dai
  • Dejing Dou

Despite achieving remarkable performance, Federated Learning (FL) suffers from two critical challenges, i. e. , limited computational resources and low training efficiency. In this paper, we propose a novel FL framework, i. e. , FedDUAP, with two original contributions, to exploit the insensitive data on the server and the decentralized data in edge devices to further improve the training efficiency. First, a dynamic server update algorithm is designed to exploit the insensitive data on the server, in order to dynamically determine the optimal steps of the server update for improving the convergence and accuracy of the global model. Second, a layer-adaptive model pruning method is developed to perform unique pruning operations adapted to the different dimensions and importance of multiple layers, to achieve a good balance between efficiency and effectiveness. By integrating the two original techniques together, our proposed FL model, FedDUAP, significantly outperforms baseline approaches in terms of accuracy (up to 4. 8% higher), efficiency (up to 2. 8 times faster), and computational cost (up to 61. 9% smaller).

NeurIPS Conference 2022 Conference Paper

Improving Certified Robustness via Statistical Learning with Logical Reasoning

  • Zhuolin Yang
  • Zhikuan Zhao
  • Boxin Wang
  • Jiawei Zhang
  • Linyi Li
  • Hengzhi Pei
  • Bojan Karlaš
  • Ji Liu

Intensive algorithmic efforts have been made to enable the rapid improvements of certificated robustness for complex ML models recently. However, current robustness certification methods are only able to certify under a limited perturbation radius. Given that existing pure data-driven statistical approaches have reached a bottleneck, in this paper, we propose to integrate statistical ML models with knowledge (expressed as logical rules) as a reasoning component using Markov logic networks (MLN), so as to further improve the overall certified robustness. This opens new research questions about certifying the robustness of such a paradigm, especially the reasoning component (e. g. , MLN). As the first step towards understanding these questions, we first prove that the computational complexity of certifying the robustness of MLN is #P-hard. Guided by this hardness result, we then derive the first certified robustness bound for MLN by carefully analyzing different model regimes. Finally, we conduct extensive experiments on five datasets including both high-dimensional images and natural language texts, and we show that the certified robustness with knowledge-based logical reasoning indeed significantly outperforms that of the state-of-the-arts.

AAAI Conference 2021 Conference Paper

C-Watcher: A Framework for Early Detection of High-Risk Neighborhoods Ahead of COVID-19 Outbreak

  • Congxi Xiao
  • Jingbo Zhou
  • Jizhou Huang
  • An Zhuo
  • Ji Liu
  • Haoyi Xiong
  • Dejing Dou

The novel coronavirus disease (COVID-19) has crushed daily routines and is still rampaging through the world. Existing solution for nonpharmaceutical interventions usually needs to timely and precisely select a subset of residential urban areas for containment or even quarantine, where the spatial distribution of confirmed cases has been considered as a key criterion for the subset selection. While such containment measure has successfully stopped or slowed down the spread of COVID-19 in some countries, it is criticized for being inefficient or ineffective, as the statistics of confirmed cases are usually time-delayed and coarse-grained. To tackle the issues, we propose C-Watcher, a novel data-driven framework that aims at screening every neighborhood in a target city and predicting infection risks, prior to the spread of COVID-19 from epicenters to the city. In terms of design, C-Watcher collects large-scale long-term human mobility data from Baidu Maps, then characterizes every residential neighborhood in the city using a set of features based on urban mobility patterns. Furthermore, to transfer the firsthand knowledge (witted in epicenters) to the target city before local outbreaks, we adopt a novel adversarial encoder framework to learn “city-invariant” representations from the mobility-related features for precise early detection of high-risk neighborhoods, even before any confirmed cases known, in the target city. We carried out extensive experiments on C-Watcher using the real-data records in the early stage of COVID-19 outbreaks, where the results demonstrate the efficiency and effectiveness of C-Watcher for early detection of high-risk neighborhoods from a large number of cities.

ICML Conference 2021 Conference Paper

DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning

  • Daochen Zha
  • Jingru Xie
  • Wenye Ma
  • Sheng Zhang
  • Xiangru Lian
  • Xia Hu 0001
  • Ji Liu

Games are abstractions of the real world, where artificial agents learn to compete and cooperate with other agents. While significant achievements have been made in various perfect- and imperfect-information games, DouDizhu (a. k. a. Fighting the Landlord), a three-player card game, is still unsolved. DouDizhu is a very challenging domain with competition, collaboration, imperfect information, large state space, and particularly a massive set of possible actions where the legal actions vary significantly from turn to turn. Unfortunately, modern reinforcement learning algorithms mainly focus on simple and small action spaces, and not surprisingly, are shown not to make satisfactory progress in DouDizhu. In this work, we propose a conceptually simple yet effective DouDizhu AI system, namely DouZero, which enhances traditional Monte-Carlo methods with deep neural networks, action encoding, and parallel actors. Starting from scratch in a single server with four GPUs, DouZero outperformed all the existing DouDizhu AI programs in days of training and was ranked the first in the Botzone leaderboard among 344 AI agents. Through building DouZero, we show that classic Monte-Carlo methods can be made to deliver strong results in a hard domain with a complex action space. The code and an online demo are released at https: //github. com/kwai/DouZero with the hope that this insight could motivate future work.

NeurIPS Conference 2021 Conference Paper

ErrorCompensatedX: error compensation for variance reduced algorithms

  • Hanlin Tang
  • Yao Li
  • Ji Liu
  • Ming Yan

Communication cost is one major bottleneck for the scalability for distributed learning. One approach to reduce the communication cost is to compress the gradient during communication. However, directly compressing the gradient decelerates the convergence speed, and the resulting algorithm may diverge for biased compression. Recent work addressed this problem for stochastic gradient descent by adding back the compression error from the previous step. This idea was further extended to one class of variance reduced algorithms, where the variance of the stochastic gradient is reduced by taking a moving average over all history gradients. However, our analysis shows that just adding the previous step's compression error, as done in existing work, does not fully compensate the compression error. So, we propose ErrorCompensateX, which uses the compression error from the previous two steps. We show that ErrorCompensateX can achieve the same asymptotic convergence rate with the training without compression. Moreover, we provide a unified theoretical analysis framework for this class of variance reduced algorithms, with or without error compensation.

AAAI Conference 2021 Conference Paper

Optimizing Information Theory Based Bitwise Bottlenecks for Efficient Mixed-Precision Activation Quantization

  • Xichuan Zhou
  • Kui Liu
  • Cong Shi
  • Haijun Liu
  • Ji Liu

Recent researches on information theory shed new light on the continuous attempts to open the black box of neural signal encoding. Inspired by the problem of lossy signal compression for wireless communication, this paper presents a Bitwise Bottleneck approach for quantizing and encoding neural network activations. Based on the rate-distortion theory, the Bitwise Bottleneck attempts to determine the most significant bits in activation representation by assigning and approximating the sparse coefficients associated with different bits. Given the constraint of a limited average code rate, the bottleneck minimizes the distortion for optimal activation quantization in a flexible layer-by-layer manner. Experiments over ImageNet and other datasets show that, by minimizing the quantization distortion of each layer, the neural network with bottlenecks achieves the state-of-the-art accuracy with low-precision activation. Meanwhile, by reducing the code rate, the proposed method can improve the memory and computational efficiency by over six times compared with the deep neural network with standard single-precision representation. The source code is available on GitHub: https: //github. com/CQUlearningsystemgroup/BitwiseBottleneck.

ICLR Conference 2021 Conference Paper

Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments

  • Daochen Zha
  • Wenye Ma
  • Lei Yuan
  • Xia Hu 0001
  • Ji Liu

Exploration under sparse reward is a long-standing challenge of model-free reinforcement learning. The state-of-the-art methods address this challenge by introducing intrinsic rewards to encourage exploration in novel states or uncertain environment dynamics. Unfortunately, methods based on intrinsic rewards often fall short in procedurally-generated environments, where a different environment is generated in each episode so that the agent is not likely to visit the same state more than once. Motivated by how humans distinguish good exploration behaviors by looking into the entire episode, we introduce RAPID, a simple yet effective episode-level exploration method for procedurally-generated environments. RAPID regards each episode as a whole and gives an episodic exploration score from both per-episode and long-term views. Those highly scored episodes are treated as good exploration behaviors and are stored in a small ranking buffer. The agent then imitates the episodes in the buffer to reproduce the past good exploration behaviors. We demonstrate our method on several procedurally-generated MiniGrid environments, a first-person-view 3D Maze navigation task from MiniWorld, and several sparse MuJoCo tasks. The results show that RAPID significantly outperforms the state-of-the-art intrinsic reward strategies in terms of sample efficiency and final performance. The code is available at https://github.com/daochenzha/rapid

NeurIPS Conference 2021 Conference Paper

Shifted Chunk Transformer for Spatio-Temporal Representational Learning

  • Xuefan Zha
  • Wentao Zhu
  • Lv Xun
  • Sen Yang
  • Ji Liu

Spatio-temporal representational learning has been widely adopted in various fields such as action recognition, video object segmentation, and action anticipation. Previous spatio-temporal representational learning approaches primarily employ ConvNets or sequential models, e. g. , LSTM, to learn the intra-frame and inter-frame features. Recently, Transformer models have successfully dominated the study of natural language processing (NLP), image classification, etc. However, the pure-Transformer based spatio-temporal learning can be prohibitively costly on memory and computation to extract fine-grained features from a tiny patch. To tackle the training difficulty and enhance the spatio-temporal learning, we construct a shifted chunk Transformer with pure self-attention blocks. Leveraging the recent efficient Transformer design in NLP, this shifted chunk Transformer can learn hierarchical spatio-temporal features from a local tiny patch to a global videoclip. Our shifted self-attention can also effectively model complicated inter-frame variances. Furthermore, we build a clip encoder based on Transformer to model long-term temporal dependencies. We conduct thorough ablation studies to validate each component and hyper-parameters in our shifted chunk Transformer, and it outperforms previous state-of-the-art approaches on Kinetics-400, Kinetics-600, UCF101, and HMDB51.

ICML Conference 2021 Conference Paper

Streaming Bayesian Deep Tensor Factorization

  • Shikai Fang
  • Zheng Wang 0042
  • Zhimeng Pan
  • Ji Liu
  • Shandian Zhe

Despite the success of existing tensor factorization methods, most of them conduct a multilinear decomposition, and rarely exploit powerful modeling frameworks, like deep neural networks, to capture a variety of complicated interactions in data. More important, for highly expressive, deep factorization, we lack an effective approach to handle streaming data, which are ubiquitous in real-world applications. To address these issues, we propose SBTD, a Streaming Bayesian Deep Tensor factorization method. We first use Bayesian neural networks (NNs) to build a deep tensor factorization model. We assign a spike-and-slab prior over each NN weight to encourage sparsity and to prevent overfitting. We then use multivariate Delta’s method and moment matching to approximate the posterior of the NN output and calculate the running model evidence, based on which we develop an efficient streaming posterior inference algorithm in the assumed-density-filtering and expectation propagation framework. Our algorithm provides responsive incremental updates for the posterior of the latent factors and NN weights upon receiving newly observed tensor entries, and meanwhile identify and inhibit redundant/useless weights. We show the advantages of our approach in four real-world applications.

NeurIPS Conference 2021 Conference Paper

TNASP: A Transformer-based NAS Predictor with a Self-evolution Framework

  • Shun Lu
  • Jixiang Li
  • Jianchao Tan
  • Sen Yang
  • Ji Liu

Predictor-based Neural Architecture Search (NAS) continues to be an important topic because it aims to mitigate the time-consuming search procedure of traditional NAS methods. A promising performance predictor determines the quality of final searched models in predictor-based NAS methods. Most existing predictor-based methodologies train model-based predictors under a proxy dataset setting, which may suffer from the accuracy decline and the generalization problem, mainly due to their poor abilities to represent spatial topology information of the graph structure data. Besides the poor encoding for spatial topology information, these works did not take advantage of the temporal information such as historical evaluations during training. Thus, we propose a Transformer-based NAS performance predictor, associated with a Laplacian matrix based positional encoding strategy, which better represents topology information and achieves better performance than previous state-of-the-art methods on NAS-Bench-101, NAS-Bench-201, and DARTS search space. Furthermore, we also propose a self-evolution framework that can fully utilize temporal information as guidance. This framework iteratively involves the evaluations of previously predicted results as constraints into current optimization iteration, thus further improving the performance of our predictor. Such framework is model-agnostic, thus can enhance performance on various backbone structures for the prediction task. Our proposed method helped us rank 2nd among all teams in CVPR 2021 NAS Competition Track 2: Performance Prediction Track.

NeurIPS Conference 2021 Conference Paper

Validating the Lottery Ticket Hypothesis with Inertial Manifold Theory

  • Zeru Zhang
  • Jiayin Jin
  • Zijie Zhang
  • Yang Zhou
  • Xin Zhao
  • Jiaxiang Ren
  • Ji Liu
  • Lingfei Wu

Despite achieving remarkable efficiency, traditional network pruning techniques often follow manually-crafted heuristics to generate pruned sparse networks. Such heuristic pruning strategies are hard to guarantee that the pruned networks achieve test accuracy comparable to the original dense ones. Recent works have empirically identified and verified the Lottery Ticket Hypothesis (LTH): a randomly-initialized dense neural network contains an extremely sparse subnetwork, which can be trained to achieve similar accuracy to the former. Due to the lack of theoretical evidence, they often need to run multiple rounds of expensive training and pruning over the original large networks to discover the sparse subnetworks with low accuracy loss. By leveraging dynamical systems theory and inertial manifold theory, this work theoretically verifies the validity of the LTH. We explore the possibility of theoretically lossless pruning as well as one-time pruning, compared with existing neural network pruning and LTH techniques. We reformulate the neural network optimization problem as a gradient dynamical system and reduce this high-dimensional system onto inertial manifolds to obtain a low-dimensional system regarding pruned subnetworks. We demonstrate the precondition and existence of pruned subnetworks and prune the original networks in terms of the gap in their spectrum that make the subnetworks have the smallest dimensions.

AAAI Conference 2020 Conference Paper

Deep Embedded Complementary and Interactive Information for Multi-View Classification

  • Jinglin Xu
  • Wenbin Li
  • Xinwang Liu
  • Dingwen Zhang
  • Ji Liu
  • Junwei Han

Multi-view classification optimally integrates various features from different views to improve classification tasks. Though most of the existing works demonstrate promising performance in various computer vision applications, we observe that they can be further improved by sufficiently utilizing complementary view-specific information, deep interactive information between different views, and the strategy of fusing various views. In this work, we propose a novel multi-view learning framework that seamlessly embeds various view-specific information and deep interactive information and introduces a novel multi-view fusion strategy to make a joint decision during the optimization for classification. Specifically, we utilize different deep neural networks to learn multiple view-specific representations, and model deep interactive information through a shared interactive network using the cross-correlations between attributes of these representations. After that, we adaptively integrate multiple neural networks by flexibly tuning the power exponent of weight, which not only avoids the trivial solution of weight but also provides a new approach to fuse outputs from different deterministic neural networks. Extensive experiments on several public datasets demonstrate the rationality and effectiveness of our method.

AAAI Conference 2020 Conference Paper

Feature Variance Regularization: A Simple Way to Improve the Generalizability of Neural Networks

  • Ranran Huang
  • Hanbo Sun
  • Ji Liu
  • Lu Tian
  • Li Wang
  • Yi Shan
  • Yu Wang

To improve the generalization ability of neural networks, we propose a novel regularization method that regularizes the empirical risk using a penalty on the empirical variance of the features. Intuitively, our approach introduces confusion into feature extraction and prevents the models from learning features that may relate to specific training samples. According to our theoretical analysis, our method encourages models to generate closer feature distributions for the training set and unobservable true data and minimize the expected risk as well, which allows the model to adapt to new samples better. We provide a thorough empirical justification of our approach, and achieves a greater improvement than other regularization methods. The experimental results show the effectiveness of our method on multiple visual tasks, including classification (CIFAR100, ImageNet, fine-grained datasets) and semantic segmentation (Cityscapes).

NeurIPS Conference 2020 Conference Paper

Once-for-All Adversarial Training: In-Situ Tradeoff between Robustness and Accuracy for Free

  • Haotao Wang
  • Tianlong Chen
  • Shupeng Gui
  • TingKuei Hu
  • Ji Liu
  • Zhangyang Wang

Adversarial training and its many variants substantially improve deep network robustness, yet at the cost of compromising standard accuracy. Moreover, the training process is heavy and hence it becomes impractical to thoroughly explore the trade-off between accuracy and robustness. This paper asks this new question: how to quickly calibrate a trained model in-situ, to examine the achievable trade-offs between its standard and robust accuracies, without (re-)training it many times? Our proposed framework, Once-for-all Adversarial Training (OAT), is built on an innovative model-conditional training framework, with a controlling hyper-parameter as the input. The trained model could be adjusted among different standard and robust accuracies “for free” at testing time. As an important knob, we exploit dual batch normalization to separate standard and adversarial feature statistics, so that they can be learned in one model without degrading performance. We further extend OAT to a Once-for-all Adversarial Training and Slimming (OATS) framework, that allows for the joint trade-off among accuracy, robustness and runtime efficiency. Experiments show that, without any re-training nor ensembling, OAT/OATS achieve similar or even superior performance compared to dedicatedly trained models at various configurations. Our codes and pretrained models are available at: https: //github. com/VITA-Group/Once-for-All-Adversarial-Training.

ICLR Conference 2020 Conference Paper

Watch the Unobserved: A Simple Approach to Parallelizing Monte Carlo Tree Search

  • Anji Liu
  • Jianshu Chen
  • Mingze Yu
  • Yu Zhai
  • Xuewen Zhou
  • Ji Liu

Monte Carlo Tree Search (MCTS) algorithms have achieved great success on many challenging benchmarks (e.g., Computer Go). However, they generally require a large number of rollouts, making their applications costly. Furthermore, it is also extremely challenging to parallelize MCTS due to its inherent sequential nature: each rollout heavily relies on the statistics (e.g., node visitation counts) estimated from previous simulations to achieve an effective exploration-exploitation tradeoff. In spite of these difficulties, we develop an algorithm, WU-UCT, to effectively parallelize MCTS, which achieves linear speedup and exhibits only limited performance loss with an increasing number of workers. The key idea in WU-UCT is a set of statistics that we introduce to track the number of on-going yet incomplete simulation queries (named as unobserved samples). These statistics are used to modify the UCT tree policy in the selection steps in a principled manner to retain effective exploration-exploitation tradeoff when we parallelize the most time-consuming expansion and simulation steps. Experiments on a proprietary benchmark and the Atari Game benchmark demonstrate the linear speedup and the superior performance of WU-UCT comparing to existing techniques.

IROS Conference 2019 Conference Paper

Accelerated Visual Inertial Navigation via Fragmented Structure Updates

  • Yehonathan Litman
  • Ya Wang
  • Ji Liu

Tightly coupled Visual-Inertial Navigation System (VINS) implementations have proven their superiority due to their ability to jointly optimize all state variables. However, this joint optimization is considered as a computational bottleneck within the system, and thus many traditional VINS can only be implemented on platforms containing powerful processors. In this work, we show a significant reduction in the computational burden of optimization can be achieved through the use of fragments; a set of co-visible feature points from across several different cameras efficiently split, categorized into groups, and then analyzed using a machine-learning inspired frequent-pattern growth algorithm. Furthermore, we use a reduced continuous representation during preintegration for better accuracy while requiring less computational resources. We validate our algorithm in datasets, showing that the derivation is not only more accurate, but requires significantly less computational resources. When tested on a Raspberry Pi, our implementation was able to track the system’s state in nearly 20 frames per second using only a single CPU core. Testing on another low power module shows a power draw of around 800 mW. Due to run-time considerations the Raspberry Pi was only competitive in regards to other optimization methodologies, but our algorithm displays remarkable accuracy on a more powerful platform.

NeurIPS Conference 2019 Conference Paper

Efficient Smooth Non-Convex Stochastic Compositional Optimization via Stochastic Recursive Gradient Descent

  • Wenqing Hu
  • Chris Junchi Li
  • Xiangru Lian
  • Ji Liu
  • Huizhuo Yuan

Stochastic compositional optimization arises in many important machine learning tasks such as reinforcement learning and portfolio management. The objective function is the composition of two expectations of stochastic functions, and is more challenging to optimize than vanilla stochastic optimization problems. In this paper, we investigate the stochastic compositional optimization in the general smooth non-convex setting. We employ a recently developed idea of \textit{Stochastic Recursive Gradient Descent} to design a novel algorithm named SARAH-Compositional, and prove a sharp Incremental First-order Oracle (IFO) complexity upper bound for stochastic compositional optimization: $\mathcal{O}((n+m)^{1/2} \varepsilon^{-2})$ in the finite-sum case and $\mathcal{O}(\varepsilon^{-3})$ in the online case. Such a complexity is known to be the best one among IFO complexity results for non-convex stochastic compositional optimization. Numerical experiments validate the superior performance of our algorithm and theory.

NeurIPS Conference 2019 Conference Paper

Global Sparse Momentum SGD for Pruning Very Deep Neural Networks

  • Xiaohan Ding
  • Guiguang Ding
  • Xiangxin Zhou
  • Yuchen Guo
  • Jungong Han
  • Ji Liu

Deep Neural Network (DNN) is powerful but computationally expensive and memory intensive, thus impeding its practical usage on resource-constrained front-end devices. DNN pruning is an approach for deep model compression, which aims at eliminating some parameters with tolerable performance degradation. In this paper, we propose a novel momentum-SGD-based optimization method to reduce the network complexity by on-the-fly pruning. Concretely, given a global compression ratio, we categorize all the parameters into two parts at each training iteration which are updated using different rules. In this way, we gradually zero out the redundant parameters, as we update them using only the ordinary weight decay but no gradients derived from the objective function. As a departure from prior methods that require heavy human works to tune the layer-wise sparsity ratios, prune by solving complicated non-differentiable problems or finetune the model after pruning, our method is characterized by 1) global compression that automatically finds the appropriate per-layer sparsity ratios; 2) end-to-end training; 3) no need for a time-consuming re-training process after pruning; and 4) superior capability to find better winning tickets which have won the initialization lottery.

NeurIPS Conference 2019 Conference Paper

LIIR: Learning Individual Intrinsic Reward in Multi-Agent Reinforcement Learning

  • Yali Du
  • Lei Han
  • Meng Fang
  • Ji Liu
  • Tianhong Dai
  • Dacheng Tao

A great challenge in cooperative decentralized multi-agent reinforcement learning (MARL) is generating diversified behaviors for each individual agent when receiving only a team reward. Prior studies have paid much effort on reward shaping or designing a centralized critic that can discriminatively credit the agents. In this paper, we propose to merge the two directions and learn each agent an intrinsic reward function which diversely stimulates the agents at each time step. Specifically, the intrinsic reward for a specific agent will be involved in computing a distinct proxy critic for the agent to direct the updating of its individual policy. Meanwhile, the parameterized intrinsic reward function will be updated towards maximizing the expected accumulated team reward from the environment so that the objective is consistent with the original MARL problem. The proposed method is referred to as learning individual intrinsic reward (LIIR) in MARL. We compare LIIR with a number of state-of-the-art MARL methods on battle games in StarCraft II. The results demonstrate the effectiveness of LIIR, and we show LIIR can assign each individual agent an insightful intrinsic reward per time step.

NeurIPS Conference 2019 Conference Paper

Model Compression with Adversarial Robustness: A Unified Optimization Framework

  • Shupeng Gui
  • Haotao Wang
  • Haichuan Yang
  • Chen Yu
  • Zhangyang Wang
  • Ji Liu

Deep model compression has been extensively studied, and state-of-the-art methods can now achieve high compression ratios with minimal accuracy loss. This paper studies model compression through a different lens: could we compress models without hurting their robustness to adversarial attacks, in addition to maintaining accuracy? Previous literature suggested that the goals of robustness and compactness might sometimes contradict. We propose a novel Adversarially Trained Model Compression (ATMC) framework. ATMC constructs a unified constrained optimization formulation, where existing compression means (pruning, factorization, quantization) are all integrated into the constraints. An efficient algorithm is then developed. An extensive group of experiments are presented, demonstrating that ATMC obtains remarkably more favorable trade-off among model size, accuracy and robustness, over currently available alternatives in various settings. The codes are publicly available at: https: //github. com/shupenggui/ATMC.

JBHI Journal 2019 Journal Article

Multitask Cascade Convolution Neural Networks for Automatic Thyroid Nodule Detection and Recognition

  • Wenfeng Song
  • Shuai Li
  • Ji Liu
  • Hong Qin
  • Bo Zhang
  • Shuyang Zhang
  • Aimin Hao

Thyroid ultrasonography is a widely used clinical technique for nodule diagnosis in thyroid regions. However, it remains difficult to detect and recognize the nodules due to low contrast, high noise, and diverse appearance of nodules. In today's clinical practice, senior doctors could pinpoint nodules by analyzing global context features, local geometry structure, and intensity changes, which would require rich clinical experience accumulated from hundreds and thousands of nodule case studies. To alleviate doctors’ tremendous labor in the diagnosis procedure, we advocate a machine learning approach to the detection and recognition tasks in this paper. In particular, we develop a multitask cascade convolution neural network (MC-CNN) framework to exploit the context information of thyroid nodules. It may be noted that our framework is built upon a large number of clinically confirmed thyroid ultrasound images with accurate and detailed ground truth labels. Other key advantages of our framework result from a multitask cascade architecture, two stages of carefully designed deep convolution networks in order to detect and recognize thyroid nodules in a pyramidal fashion, and capturing various intrinsic features in a global-to-local way. Within our framework, the potential regions of interest after initial detection are further fed to the spatial pyramid augmented CNNs to embed multiscale discriminative information for fine-grained thyroid recognition. Experimental results on 4309 clinical ultrasound images have indicated that our MC-CNN is accurate and effective for both thyroid nodules detection and recognition. For the correct diagnosis rate of malignant and benign thyroid nodules, its mean Average Precision (mAP) performance can achieve up to $\text{98. 2}\%$ accuracy, which outperforms the common CNNs by $\text{5}\%$ on average. In addition, we conduct rigorous user studies to confirm that our MC-CNN outperforms experienced doctors, yet only consuming roughly $\text{2}\%$ ( $1/48$ ) of doctors’ examination time on average. Therefore, the accuracy and efficiency of our new method exhibit its great potential in clinical applications.

AAAI Conference 2019 Conference Paper

Optimal Projection Guided Transfer Hashing for Image Retrieval

  • Ji Liu
  • Lei Zhang

Recently, learning to hash has been widely studied for image retrieval thanks to the computation and storage efficiency of binary codes. For most existing learning to hash methods, sufficient training images are required and used to learn precise hashing codes. However, in some real-world applications, there are not always sufficient training images in the domain of interest. In addition, some existing supervised approaches need a amount of labeled data, which is an expensive process in terms of time, labor and human expertise. To handle such problems, inspired by transfer learning, we propose a simple yet effective unsupervised hashing method named Optimal Projection Guided Transfer Hashing (GTH) where we borrow the images of other different but related domain i. e. , source domain to help learn precise hashing codes for the domain of interest i. e. , target domain. Besides, we propose to seek for the maximum likelihood estimation (MLE) solution of the hashing functions of target and source domains due to the domain gap. Furthermore, an alternating optimization method is adopted to obtain the two projections of target and source domains such that the domain hashing disparity is reduced gradually. Extensive experiments on various benchmark databases verify that our method outperforms many state-of-the-art learning to hash methods. The implementation details are available at https: //github. com/liuji93/GTH.

AAAI Conference 2018 Conference Paper

Accelerated Method for Stochastic Composition Optimization With Nonsmooth Regularization

  • Zhouyuan Huo
  • Bin Gu
  • Ji Liu
  • Heng Huang

Stochastic composition optimization draws much attention recently and has been successful in many emerging applications of machine learning, statistical analysis, and reinforcement learning. In this paper, we focus on the composition problem with nonsmooth regularization penalty. Previous works either have slow convergence rate, or do not provide complete convergence analysis for the general problem. In this paper, we tackle these two issues by proposing a new stochastic composition optimization method for composition problem with nonsmooth regularization penalty. In our method, we apply variance reduction technique to accelerate the speed of convergence. To the best of our knowledge, our method admits the fastest convergence rate for stochastic composition optimization: for strongly convex composition problem, our algorithm is proved to admit linear convergence; for general composition problem, our algorithm significantly improves the state-of-theart convergence rate from O(T−1/2 ) to O((n1 +n2)2/3 T−1 ). Finally, we apply our proposed algorithm to portfolio management and policy evaluation in reinforcement learning. Experimental results verify our theoretical analysis.

NeurIPS Conference 2018 Conference Paper

Communication Compression for Decentralized Training

  • Hanlin Tang
  • Shaoduo Gan
  • Ce Zhang
  • Tong Zhang
  • Ji Liu

Optimizing distributed learning systems is an art of balancing between computation and communication. There have been two lines of research that try to deal with slower networks: {\em communication compression} for low bandwidth networks, and {\em decentralization} for high latency networks. In this paper, We explore a natural question: {\em can the combination of both techniques lead to a system that is robust to both bandwidth and latency? } Although the system implication of such combination is trivial, the underlying theoretical principle and algorithm design is challenging: unlike centralized algorithms, simply compressing {\rc exchanged information, even in an unbiased stochastic way, within the decentralized network would accumulate the error and cause divergence. } In this paper, we develop a framework of quantized, decentralized training and propose two different strategies, which we call {\em extrapolation compression} and {\em difference compression}. We analyze both algorithms and prove both converge at the rate of $O(1/\sqrt{nT})$ where $n$ is the number of workers and $T$ is the number of iterations, matching the convergence rate for full precision, centralized training. We validate our algorithms and find that our proposed algorithm outperforms the best of merely decentralized and merely quantized algorithm significantly for networks with {\em both} high latency and low bandwidth.

AAMAS Conference 2018 Conference Paper

Gossip Gradient Descent

  • Yang Liu
  • Ji Liu
  • Tamer Basar

We consider a problem of learning a linear regression model distributively with a network of N interconnected agents which receive private streaming data. Each agent can deploy an online learning algorithm, e. g. stochastic gradient descent, to learn adaptively the regression model using its receiving private data. The goal is to devise an algorithm for each agent, under the constraint that each of them can communicate only with its neighboring agents based on a communication graph, to enable each agent converge to the true model with a performance comparable to that of the traditional centralized solution. We propose an algorithm called gossip gradient descent, and establish O q log t (1−λ2)N t convergence in expectation and mean square, where λ2 is the second largest eigenvalue of the expected gossip matrix corresponding to the underlying communication graph. For the case when agents are privacy sensitive, we propose a differentially private variant of the algorithm, which achieves ϵ-differential privacy and O r log2 t ϵ(1−λ2)N t! convergence.

NeurIPS Conference 2018 Conference Paper

Gradient Sparsification for Communication-Efficient Distributed Optimization

  • Jianqiao Wangni
  • Jialei Wang
  • Ji Liu
  • Tong Zhang

Modern large-scale machine learning applications require stochastic optimization algorithms to be implemented on distributed computational architectures. A key bottleneck is the communication overhead for exchanging information such as stochastic gradients among different workers. In this paper, to reduce the communication cost, we propose a convex optimization formulation to minimize the coding length of stochastic gradients. The key idea is to randomly drop out coordinates of the stochastic gradient vectors and amplify the remaining coordinates appropriately to ensure the sparsified gradient to be unbiased. To solve the optimal sparsification efficiently, several simple and fast algorithms are proposed for an approximate solution, with a theoretical guarantee for sparseness. Experiments on $\ell_2$ regularized logistic regression, support vector machines, and convolutional neural networks validate our sparsification approaches.

IJCAI Conference 2018 Conference Paper

New Balanced Active Learning Model and Optimization Algorithm

  • Xiaoqian Wang
  • Yijun Huang
  • Ji Liu
  • Heng Huang

It is common in machine learning applications that unlabeled data are abundant while acquiring labels is extremely difficult. In order to reduce the cost of training model while maintaining the model quality, active learning provides a feasible solution. Instead of acquiring labels for random samples, active learning methods carefully select the data to be labeled so as to alleviate the impact from the redundancy or noise in the selected data and improve the trained model performance. In early stage experimental design, previous active learning methods adopted data reconstruction framework, such that the selected data maintained high representative power. However, these models did not consider the data class structure, thus the selected samples could be predominated by the samples from major classes. Such mechanism fails to include samples from the minor classes thus tends to be less "representative". To solve this challenging problem, we propose a novel active learning model for the early stage of experimental design. We use exclusive sparsity norm to enforce the selected samples to be (roughly) evenly distributed among different groups. We provide a new efficient optimization algorithm and theoretically prove the optimal convergence rate O(1/{T^2}). With a simple substitution, we reduce the computational load of each iteration from O(n^3) to O(n^2), which makes our algorithm more scalable than previous frameworks.

JAIR Journal 2018 Journal Article

Proximal Gradient Temporal Difference Learning: Stable Reinforcement Learning with Polynomial Sample Complexity

  • Bo Liu
  • Ian Gemp
  • Mohammad Ghavamzadeh
  • Ji Liu
  • Sridhar Mahadevan
  • Marek Petrik

In this paper, we introduce proximal gradient temporal difference learning, which provides a principled way of designing and analyzing true stochastic gradient temporal difference learning algorithms. We show how gradient TD (GTD) reinforcement learning methods can be formally derived, not by starting from their original objective functions, as previously attempted, but rather from a primal-dual saddle-point objective function. We also conduct a saddle-point error analysis to obtain finite-sample bounds on their performance. Previous analyses of this class of algorithms use stochastic approximation techniques to prove asymptotic convergence, and do not provide any finite-sample analysis. We also propose an accelerated algorithm, called GTD2-MP, that uses proximal "mirror maps" to yield an improved convergence rate. The results of our theoretical analysis imply that the GTD family of algorithms are comparable and may indeed be preferred over existing least squares TD methods for off-policy learning, due to their linear complexity. We provide experimental results showing the improved performance of our accelerated gradient TD methods.

NeurIPS Conference 2018 Conference Paper

Stochastic Primal-Dual Method for Empirical Risk Minimization with O(1) Per-Iteration Complexity

  • Conghui Tan
  • Tong Zhang
  • Shiqian Ma
  • Ji Liu

Regularized empirical risk minimization problem with linear predictor appears frequently in machine learning. In this paper, we propose a new stochastic primal-dual method to solve this class of problems. Different from existing methods, our proposed methods only require O(1) operations in each iteration. We also develop a variance-reduction variant of the algorithm that converges linearly. Numerical experiments suggest that our methods are faster than existing ones such as proximal SGD, SVRG and SAGA on high-dimensional problems.

JMLR Journal 2017 Journal Article

Accelerating Stochastic Composition Optimization

  • Mengdi Wang
  • Ji Liu
  • Ethan X. Fang

We consider the stochastic nested composition optimization problem where the objective is a composition of two expected- value functions. We propose a new stochastic first-order method, namely the accelerated stochastic compositional proximal gradient (ASC-PG) method. This algorithm updates the solution based on noisy gradient queries using a two-timescale iteration. The ASC-PG is the first proximal gradient method for the stochastic composition problem that can deal with nonsmooth regularization penalty. We show that the ASC-PG exhibits faster convergence than the best known algorithms, and that it achieves the optimal sample-error complexity in several important special cases. We demonstrate the application of ASC-PG to reinforcement learning and conduct numerical experiments. [abs] [ pdf ][ bib ] &copy JMLR 2017. ( edit, beta )

NeurIPS Conference 2017 Conference Paper

Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent

  • Xiangru Lian
  • Ce Zhang
  • Huan Zhang
  • Cho-Jui Hsieh
  • Wei Zhang
  • Ji Liu

Most distributed machine learning systems nowadays, including TensorFlow and CNTK, are built in a centralized fashion. One bottleneck of centralized algorithms lies on high communication cost on the central node. Motivated by this, we ask, can decentralized algorithms be faster than its centralized counterpart? Although decentralized PSGD (D-PSGD) algorithms have been studied by the control community, existing analysis and theory do not show any advantage over centralized PSGD (C-PSGD) algorithms, simply assuming the application scenario where only the decentralized network is available. In this paper, we study a D-PSGD algorithm and provide the first theoretical analysis that indicates a regime in which decentralized algorithms might outperform centralized algorithms for distributed stochastic gradient descent. This is because D-PSGD has comparable total computational complexities to C-PSGD but requires much less communication cost on the busiest node. We further conduct an empirical study to validate our theoretical analysis across multiple frameworks (CNTK and Torch), different network configurations, and computation platforms up to 112 GPUs. On network configurations with low bandwidth or high latency, D-PSGD can be up to one order of magnitude faster than its well-optimized centralized counterparts.

IJCAI Conference 2017 Conference Paper

No Learner Left Behind: On the Complexity of Teaching Multiple Learners Simultaneously

  • Xiaojin Zhu
  • Ji Liu
  • Manuel Lopes

We present a theoretical study of algorithmic teaching in the setting where the teacher must use the same training set to teach multiple learners. This problem is a theoretical abstraction of the real-world classroom setting in which the teacher delivers the same lecture to academically diverse students. We define a minimax teaching criterion to guarantee the performance of the worst learner in the class. We prove that the teaching dimension increases with class diversity in general. For the classes of conjugate Bayesian learners and linear regression learners, respectively, we exhibit corresponding minimax teaching set. We then propose a method to enhance teaching by partitioning the class into sections. We present cases where the optimal partition minimizes overall teaching dimension while maintaining the guarantee on all learners. Interestingly, we show personalized education (one learner per section) is not necessarily the optimal partition. Our results generalize algorithmic teaching to multiple learners and offer insight on how to teach large classes.

NeurIPS Conference 2016 Conference Paper

A Comprehensive Linear Speedup Analysis for Asynchronous Stochastic Parallel Optimization from Zeroth-Order to First-Order

  • Xiangru Lian
  • Huan Zhang
  • Cho-Jui Hsieh
  • Yijun Huang
  • Ji Liu

Asynchronous parallel optimization received substantial successes and extensive attention recently. One of core theoretical questions is how much speedup (or benefit) the asynchronous parallelization can bring to us. This paper provides a comprehensive and generic analysis to study the speedup property for a broad range of asynchronous parallel stochastic algorithms from the zeroth order to the first order methods. Our result recovers or improves existing analysis on special cases, provides more insights for understanding the asynchronous parallel behaviors, and suggests a novel asynchronous parallel zeroth order method for the first time. Our experiments provide novel applications of the proposed asynchronous parallel zeroth order method on hyper parameter tuning and model blending problems.

NeurIPS Conference 2016 Conference Paper

Accelerating Stochastic Composition Optimization

  • Mengdi Wang
  • Ji Liu
  • Ethan Fang

Consider the stochastic composition optimization problem where the objective is a composition of two expected-value functions. We propose a new stochastic first-order method, namely the accelerated stochastic compositional proximal gradient (ASC-PG) method, which updates based on queries to the sampling oracle using two different timescales. The ASC-PG is the first proximal gradient method for the stochastic composition problem that can deal with nonsmooth regularization penalty. We show that the ASC-PG exhibits faster convergence than the best known algorithms, and that it achieves the optimal sample-error complexity in several important special cases. We further demonstrate the application of ASC-PG to reinforcement learning and conduct numerical experiments.

NeurIPS Conference 2016 Conference Paper

Asynchronous Parallel Greedy Coordinate Descent

  • Yang You
  • Xiangru Lian
  • Ji Liu
  • Hsiang-Fu Yu
  • Inderjit Dhillon
  • James Demmel
  • Cho-Jui Hsieh

n this paper, we propose and study an Asynchronous parallel Greedy Coordinate Descent (Asy-GCD) algorithm for minimizing a smooth function with bounded constraints. At each iteration, workers asynchronously conduct greedy coordinate descent updates on a block of variables. In the first part of the paper, we analyze the theoretical behavior of Asy-GCD and prove a linear convergence rate. In the second part, we develop an efficient kernel SVM solver based on Asy-GCD in the shared memory multi-core setting. Since our algorithm is fully asynchronous---each core does not need to idle and wait for the other cores---the resulting algorithm enjoys good speedup and outperforms existing multi-core kernel SVM solvers including asynchronous stochastic coordinate descent and multi-core LIBSVM.

AAAI Conference 2016 Conference Paper

Optimal Discrete Matrix Completion

  • Zhouyuan Huo
  • Ji Liu
  • Heng Huang

In recent years, matrix completion methods have been successfully applied to solve recommender system applications. Most of them focus on the matrix completion problem in real number domain, and produce continuous prediction values. However, these methods are not appropriate in some occasions where the entries of matrix are discrete values, such as movie ratings prediction, social network relation and interaction prediction, because their continuous outputs are not probabilities and uninterpretable. In this case, an additional step to process the continuous results with either heuristic threshold parameters or complicated mapping is necessary, while it is inefficient and may diverge from the optimal solution. There are a few matrix completion methods working on discrete number domain, however, they are not applicable to sparse and large-scale data set. In this paper, we propose a novel optimal discrete matrix completion model, which is able to learn optimal thresholds automatically and also guarantees an exact low-rank structure of the target matrix. We use stochastic gradient descent algorithm with momentum method to optimize the new objective function and speed up optimization. In the experiments, it is proved that our method can predict discrete values with high accuracy, very close to or even better than these values obtained by carefully tuned thresholds on Movielens and YouTube data sets. Meanwhile, our model is able to handle online data and easy to parallelize.

AAAI Conference 2016 Conference Paper

Privacy-CNH: A Framework to Detect Photo Privacy with Convolutional Neural Network using Hierarchical Features

  • Lam Tran
  • Deguang Kong
  • Hongxia Jin
  • Ji Liu

Photo privacy is a very important problem in the digital age where photos are commonly shared on social networking sites and mobile devices. The main challenge in photo privacy detection is how to generate discriminant features to accurately detect privacy at risk photos. Existing photo privacy detection works, which rely on low-level vision features, are non-informative to the users regarding what privacy information is leaked from their photos. In this paper, we propose a new framework called Privacy- CNH that utilizes hierarchical features which include both object and convolutional features in a deep learning model to detect privacy at risk photos. The generation of object features enables our model to better inform the users about the reason why a photo has privacy risk. The combination of convolutional and object features provide a richer model to understand photo privacy from different aspects, thus improving photo privacy detection accuracy. Experimental results demonstrate that the proposed model outperforms the state-of-the-art work and the standard convolutional neural network (CNN) with convolutional features on photo privacy detection tasks.

IJCAI Conference 2016 Conference Paper

Proximal Gradient Temporal Difference Learning Algorithms

  • Bo Liu
  • Ji Liu
  • Mohammad Ghavamzadeh
  • Sridhar Mahadevan
  • Marek Petrik

In this paper, we describe proximal gradient temporal difference learning, which provides a principled way for designing and analyzing true stochastic gradient temporal difference learning algorithms. We show how gradient TD (GTD) reinforcement learning methods can be formally derived, not with respect to their original objective functions as previously attempted, but rather with respect to primal-dual saddle-point objective functions. We also conduct a saddle-point error analysis to obtain finite-sample bounds on their performance. Previous analyses of this class of algorithms use stochastic approximation techniques to prove asymptotic convergence, and no finite-sample analysis had been attempted. An accelerated algorithm is also proposed, namely GTD2-MP, which use proximal mirror maps to yield acceleration. The results of our theoretical analysis imply that the GTD family of algorithms are comparable and may indeed be preferred over existing least squares TD methods for off-policy learning, due to their linear complexity. We provide experimental results showing the improved performance of our accelerated gradient TD methods.

IJCAI Conference 2016 Conference Paper

Staleness-Aware Async-SGD for Distributed Deep Learning

  • Wei Zhang
  • Suyog Gupta
  • Xiangru Lian
  • Ji Liu

Deep neural networks have been shown to achieve state-of-the-art performance in several machine learning tasks. Stochastic Gradient Descent (SGD) is the preferred optimization algorithm for training these networks and asynchronous SGD (ASGD) has been widely adopted for accelerating the training of large-scale deep networks in a distributed computing environment. However, in practice it is quite challenging to tune the training hyperparameters (such as learning rate) when using ASGD so as achieve convergence and linear speedup, since the stability of the optimization algorithm is strongly influenced by the asynchronous nature of parameter updates. In this paper, we propose a variant of the ASGD algorithm in which the learning rate is modulated according to the gradient staleness and provide theoretical guarantees for convergence of this algorithm. Experimental verification is performed on commonly-used image classification benchmarks: CIFAR10 and Imagenet to demonstrate the superior effectiveness of the proposed approach, compared to SSGD (Synchronous SGD) and the conventional ASGD algorithm.

JMLR Journal 2016 Journal Article

The Teaching Dimension of Linear Learners

  • Ji Liu
  • Xiaojin Zhu

Teaching dimension is a learning theoretic quantity that specifies the minimum training set size to teach a target model to a learner. Previous studies on teaching dimension focused on version-space learners which maintain all hypotheses consistent with the training data, and cannot be applied to modern machine learners which select a specific hypothesis via optimization. This paper presents the first known teaching dimension for ridge regression, support vector machines, and logistic regression. We also exhibit optimal training sets that match these teaching dimensions. Our approach generalizes to other linear learners. [abs] [ pdf ][ bib ] &copy JMLR 2016. ( edit, beta )

ICML Conference 2016 Conference Paper

The Teaching Dimension of Linear Learners

  • Ji Liu
  • Xiaojin Zhu 0001
  • Hrag Ohannessian

Teaching dimension is a learning theoretic quantity that specifies the minimum training set size to teach a target model to a learner. Previous studies on teaching dimension focused on version-space learners which maintain all hypotheses consistent with the training data, and cannot be applied to modern machine learners which select a specific hypothesis via optimization. This paper presents the first known teaching dimension for ridge regression, support vector machines, and logistic regression. We also exhibit optimal training sets that match these teaching dimensions. Our approach generalizes to other linear learners.

AAAI Conference 2016 Conference Paper

Uncorrelated Group LASSO

  • Deguang Kong
  • Ji Liu
  • Bo Liu
  • Xuan Bao

2, 1-norm is an effective regularization to enforce a simple group sparsity for feature learning. To capture some subtle structures among feature groups, we propose a new regularization called exclusive group 2, 1-norm. It enforces the sparsity at the intra-group level by using 2, 1-norm, while encourages the selected features to distribute in different groups by using 2 norm at the inter-group level. The proposed exclusive group 2, 1-norm is capable of eliminating the feature correlations in the context of feature selection, if highly correlated features are collected in the same groups. To solve the generic exclusive group 2, 1-norm regularized problems, we propose an efficient iterative re-weighting algorithm and provide a rigorous convergence analysis. Experiment results on real world datasets demonstrate the effectiveness of the proposed new regularization and algorithm.

JMLR Journal 2015 Journal Article

An Asynchronous Parallel Stochastic Coordinate Descent Algorithm

  • Ji Liu
  • Stephen J. Wright
  • Christopher Ré
  • Victor Bittorf
  • Srikrishna Sridhar

We describe an asynchronous parallel stochastic coordinate descent algorithm for minimizing smooth unconstrained or separably constrained functions. The method achieves a linear convergence rate on functions that satisfy an essential strong convexity property and a sublinear rate ($1/K$) on general convex functions. Near-linear speedup on a multicore system can be expected if the number of processors is $O(n^{1/2})$ in unconstrained optimization and $O(n^{1/4})$ in the separable- constrained case, where $n$ is the number of variables. We describe results from implementation on 40-core processors. [abs] [ pdf ][ bib ] &copy JMLR 2015. ( edit, beta )

NeurIPS Conference 2015 Conference Paper

Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization

  • Xiangru Lian
  • Yijun Huang
  • Yuncheng Li
  • Ji Liu

The asynchronous parallel implementations of stochastic gradient (SG) have been broadly used in solving deep neural network and received many successes in practice recently. However, existing theories cannot explain their convergence and speedup properties, mainly due to the nonconvexity of most deep learning formulations and the asynchronous parallel mechanism. To fill the gaps in theory and provide theoretical supports, this paper studies two asynchronous parallel implementations of SG: one is on the computer network and the other is on the shared memory system. We establish an ergodic convergence rate $O(1/\sqrt{K})$ for both algorithms and prove that the linear speedup is achievable if the number of workers is bounded by $\sqrt{K}$ ($K$ is the total number of iterations). Our results generalize and improve existing analysis for convex minimization.

NeurIPS Conference 2014 Conference Paper

Exclusive Feature Learning on Arbitrary Structures via $\ell_{1,2}$-norm

  • Deguang Kong
  • Ryohei Fujimaki
  • Ji Liu
  • Feiping Nie
  • Chris Ding

Group lasso is widely used to enforce the structural sparsity, which achieves the sparsity at inter-group level. In this paper, we propose a new formulation called ``exclusive group lasso'', which brings out sparsity at intra-group level in the context of feature selection. The proposed exclusive group lasso is applicable on any feature structures, regardless of their overlapping or non-overlapping structures. We give analysis on the properties of exclusive group lasso, and propose an effective iteratively re-weighted algorithm to solve the corresponding optimization problem with rigorous convergence analysis. We show applications of exclusive group lasso for uncorrelated feature selection. Extensive experiments on both synthetic and real-world datasets indicate the good performance of proposed methods.

NeurIPS Conference 2013 Conference Paper

An Approximate, Efficient LP Solver for LP Rounding

  • Srikrishna Sridhar
  • Stephen Wright
  • Christopher Re
  • Ji Liu
  • Victor Bittorf
  • Ce Zhang

Many problems in machine learning can be solved by rounding the solution of an appropriate linear program. We propose a scheme that is based on a quadratic program relaxation which allows us to use parallel stochastic-coordinate-descent to approximately solve large linear programs efficiently. Our software is an order of magnitude faster than Cplex (a commercial linear programming solver) and yields similar solution quality. Our results include a novel perturbation analysis of a quadratic-penalty formulation of linear programming and a convergence result, which we use to derive running time and quality guarantees.

JMLR Journal 2012 Journal Article

A Multi-Stage Framework for Dantzig Selector and LASSO

  • Ji Liu
  • Peter Wonka
  • Jieping Ye

We consider the following sparse signal recovery (or feature selection) problem: given a design matrix X∈ ℝ n✕ m (m >> n) and a noisy observation vector y∈ ℝ n satisfying y=Xβ * +ε where ε is the noise vector following a Gaussian distribution N(0,σ 2 I), how to recover the signal (or parameter vector) β * when the signal is sparse? The Dantzig selector has been proposed for sparse signal recovery with strong theoretical guarantees. In this paper, we propose a multi-stage Dantzig selector method, which iteratively refines the target signal β *. We show that if X obeys a certain condition, then with a large probability the difference between the solution β̂ estimated by the proposed method and the true solution β * measured in terms of the l p norm ( p≥ 1 ) is bounded as ||β̂-β * || p ≤ (C(s-N) 1/p √log m+Δ)σ, where C is a constant, s is the number of nonzero entries in β *, the risk of the oracle estimator Δ is independent of m and is much smaller than the first term, and N is the number of entries of β * larger than a certain value in the order of O(σ√log m). The proposed method improves the estimation bound of the standard Dantzig selector approximately from Cs 1/p √log mσ to C(s-N) 1/p √log mσ where the value N depends on the number of large entries in β *. When N=s, the proposed algorithm achieves the oracle solution with a high probability, where the oracle solution is the projection of the observation vector y onto true features. In addition, with a large probability, the proposed method can select the same number of correct features under a milder condition than the Dantzig selector. Finally, we extend this multi-stage procedure to the LASSO case. [abs] [ pdf ][ bib ] &copy JMLR 2012. ( edit, beta )

NeurIPS Conference 2012 Conference Paper

Regularized Off-Policy TD-Learning

  • Bo Liu
  • Sridhar Mahadevan
  • Ji Liu

We present a novel $l_1$ regularized off-policy convergent TD-learning method (termed RO-TD), which is able to learn sparse representations of value functions with low computational complexity. The algorithmic framework underlying RO-TD integrates two key ideas: off-policy convergent gradient TD methods, such as TDC, and a convex-concave saddle-point formulation of non-smooth convex optimization, which enables first-order solvers and feature selection using online convex regularization. A detailed theoretical and experimental analysis of RO-TD is presented. A variety of experiments are presented to illustrate the off-policy convergence, sparse feature selection capability and low computational cost of the RO-TD algorithm.

NeurIPS Conference 2010 Conference Paper

Multi-Stage Dantzig Selector

  • Ji Liu
  • Peter Wonka
  • Jieping Ye

We consider the following sparse signal recovery (or feature selection) problem: given a design matrix $X\in \mathbb{R}^{n\times m}$ $(m\gg n)$ and a noisy observation vector $y\in \mathbb{R}^{n}$ satisfying $y=X\beta^*+\epsilon$ where $\epsilon$ is the noise vector following a Gaussian distribution $N(0, \sigma^2I)$, how to recover the signal (or parameter vector) $\beta^*$ when the signal is sparse? The Dantzig selector has been proposed for sparse signal recovery with strong theoretical guarantees. In this paper, we propose a multi-stage Dantzig selector method, which iteratively refines the target signal $\beta^*$. We show that if $X$ obeys a certain condition, then with a large probability the difference between the solution $\hat\beta$ estimated by the proposed method and the true solution $\beta^*$ measured in terms of the $l_p$ norm ($p\geq 1$) is bounded as \begin{equation*} \|\hat\beta-\beta^*\|_p\leq \left(C(s-N)^{1/p}\sqrt{\log m}+\Delta\right)\sigma, \end{equation*} $C$ is a constant, $s$ is the number of nonzero entries in $\beta^*$, $\Delta$ is independent of $m$ and is much smaller than the first term, and $N$ is the number of entries of $\beta^*$ larger than a certain value in the order of $\mathcal{O}(\sigma\sqrt{\log m})$. The proposed method improves the estimation bound of the standard Dantzig selector approximately from $Cs^{1/p}\sqrt{\log m}\sigma$ to $C(s-N)^{1/p}\sqrt{\log m}\sigma$ where the value $N$ depends on the number of large entries in $\beta^*$. When $N=s$, the proposed algorithm achieves the oracle solution with a high probability. In addition, with a large probability, the proposed method can select the same number of correct features under a milder condition than the Dantzig selector.