Arrow Research search

Author name cluster

Anastasios Kyrillidis

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

30 papers
2 author rows

Possible papers

30

AAAI Conference 2026 Conference Paper

A Catalyst Framework for the Quantum Linear System Problem via the Proximal Point Algorithm

  • Junhyung Lyle Kim
  • Nai-Hui Chia
  • Anastasios Kyrillidis

Solving systems of linear equations is a fundamental problem, but it can be computationally intensive for classical algorithms in high dimensions. Existing quantum algorithms can achieve exponential speedups for the quantum linear system problem (QLSP) in terms of the problem dimension, but the advantage is bottlenecked by condition number of the coefficient matrix. In this work, we propose a new quantum algorithm for QLSP inspired by the classical proximal point algorithm (PPA). Our proposed method can be viewed as a meta-algorithm that allows inverting a modified matrix via an existing QLSP solver, thereby directly approximating the solution vector instead of approximating the inverse of the coefficient matrix. By carefully choosing the step size eta, the proposed algorithm can effectively precondition the linear system to mitigate the dependence on condition numbers that hindered the applicability of previous approaches. Importantly, this is the first iterative framework for QLSP where a tunable parameter eta and initialization x_0 allows controlling the trade-off between the runtime and approximation error.

NeurIPS Conference 2025 Conference Paper

Learning to Specialize: Joint Gating-Expert Training for Adaptive MoEs in Decentralized Settings

  • Yehya Farhat
  • Hamza ElMokhtar Shili
  • Fangshuo Liao
  • Chen Dun
  • Mirian Hipolito Garcia
  • Guoqing Zheng
  • Ahmed Awadallah
  • Robert Sim

Mixture-of-Experts (MoEs) achieve scalability by dynamically activating subsets of their components. Yet, understanding how expertise emerges through joint training of gating mechanisms and experts remains incomplete, especially in scenarios without clear task partitions. Motivated by inference costs and data heterogeneity, we study how joint training of gating functions and experts can dynamically allocate domain-specific expertise across multiple underlying data distributions. As an outcome of our framework, we develop an instance tailored specifically to decentralized training scenarios, introducing Dynamically Decentralized Orchestration of MoEs or DDOME. DDOME leverages heterogeneity emerging from distributional shifts across decentralized data sources to specialize experts dynamically. By integrating a pretrained common expert to inform a gating function, DDOME achieves personalized expert subset selection on-the-fly, facilitating just-in-time personalization. We empirically validate DDOME within a Federated Learning (FL) context: DDOME attains from 4\% up to an 24\% accuracy improvement over state-of-the-art FL baselines in image and text classification tasks, while maintaining competitive zero-shot generalization capabilities. Furthermore, we provide theoretical insights confirming that the joint gating-experts training is critical for achieving meaningful expert specialization.

AAAI Conference 2025 Conference Paper

Sweeping Heterogeneity with Smart MoPs: Mixture of Prompts for LLM Task Adaptation

  • Chen Dun
  • Mirian Del Carmen Hipolito Garcia
  • Guoqing Zheng
  • Ahmed Hassan Awadallah
  • Robert Sim
  • Anastasios Kyrillidis

Prompt instruction tuning is a popular approach to better adjust pretrained LLMs for specific downstream tasks. How to extend this approach to simultaneously handle multiple tasks and data distributions is an interesting question. We propose Mixture of Prompts (MoPs) with smart gating functionality. Our proposed system identifies relevant skills embedded in different groups of prompts and dynamically weighs experts (i.e., collection of prompts) based on the target task. Experiments show that MoPs are resilient to model compression, data source, and task composition, making them highly versatile and applicable in various contexts. In practice, MoPs can simultaneously mitigate prompt training ``interference'' in multi-task, multi-source scenarios (e.g., task and data heterogeneity across sources) and possible implications from model approximations. Empirically, MoPs show particular effectiveness in compressed model scenarios, while maintaining favorable performance in uncompressed settings: MoPs can reduce final perplexity from 9% up to 70% in non-i.i.d. distributed cases and from 3% up to 30% in centralized cases, compared to baselines.

ICLR Conference 2024 Conference Paper

Adaptive Federated Learning with Auto-Tuned Clients

  • Junhyung Lyle Kim
  • Mohammad Taha Toghani
  • Cesar A. Uribe
  • Anastasios Kyrillidis

Federated learning (FL) is a distributed machine learning framework where the global model of a central server is trained via multiple collaborative steps by participating clients without sharing their data. While being a flexible framework, where the distribution of local data, participation rate, and computing power of each client can greatly vary, such flexibility gives rise to many new challenges, especially in the hyperparameter tuning on the client side. We propose $\Delta$-SGD, a simple step size rule for SGD that enables each client to use its own step size by adapting to the local smoothness of the function each client is optimizing. We provide theoretical and empirical results where the benefit of the client adaptivity is shown in various FL scenarios.

TMLR Journal 2024 Journal Article

How Much Pre-training Is Enough to Discover a Good Subnetwork?

  • Cameron R. Wolfe
  • Fangshuo Liao
  • Qihan Wang
  • Junhyung Lyle Kim
  • Anastasios Kyrillidis

Neural network pruning helps discover efficient, high-performing subnetworks within pre-trained, dense network architectures. More often than not, it involves a three-step process—pre-training, pruning, and re-training—that is computationally expensive, as the dense model must be fully pre-trained. While previous work has revealed through experiments the relationship between the amount of pre-training and the performance of the pruned network, a theoretical characterization of such dependency is still missing. Aiming to mathematically analyze the amount of dense network pre-training needed for a pruned network to perform well, we discover a simple theoretical bound in the number of gradient descent pre-training iterations on a two-layer fully connected network in the NTK regime, beyond which pruning via greedy forward selection \citep{provable_subnetworks} yields a subnetwork that achieves good training error. Interestingly, this threshold is logarithmically dependent upon the size of the dataset, meaning that experiments with larger datasets require more pre-training for subnetworks obtained via pruning to perform well. Lastly, we empirically validate our theoretical results on multi-layer perceptions and residual-based convolutional networks trained on MNIST, CIFAR, and ImageNet datasets.

ICML Conference 2024 Conference Paper

On the Error-Propagation of Inexact Hotelling's Deflation for Principal Component Analysis

  • Fangshuo Liao
  • Junhyung Lyle Kim
  • Cruz Barnum
  • Anastasios Kyrillidis

Principal Component Analysis (PCA) aims to find subspaces spanned by the so-called principal components that best represent the variance in the dataset. The deflation method is a popular meta-algorithm that sequentially finds individual principal components, starting from the most important ones and working towards the less important ones. However, as deflation proceeds, numerical errors from the imprecise estimation of principal components propagate due to its sequential nature. This paper mathematically characterizes the error propagation of the inexact Hotelling’s deflation method. We consider two scenarios: $i)$ when the sub-routine for finding the leading eigenvector is abstract and can represent various algorithms; and $ii)$ when power iteration is used as the sub-routine. In the latter case, the additional directional information from power iteration allows us to obtain a tighter error bound than the sub-routine agnostic case. For both scenarios, we explicitly characterize how the errors progress and affect subsequent principal component estimations.

ICRA Conference 2024 Conference Paper

Stochastic Implicit Neural Signed Distance Functions for Safe Motion Planning under Sensing Uncertainty

  • Carlos Quintero-Peña
  • Wil Thomason
  • Zachary Kingston
  • Anastasios Kyrillidis
  • Lydia E. Kavraki

Motion planning under sensing uncertainty is critical for robots in unstructured environments, to guarantee safety for both the robot and any nearby humans. Most work on planning under uncertainty does not scale to high-dimensional robots such as manipulators, assumes simplified geometry of the robot or environment, or requires per-object knowledge of noise. Instead, we propose a method that directly models sensor-specific aleatoric uncertainty to find safe motions for high-dimensional systems in complex environments, without exact knowledge of environment geometry. We combine a novel implicit neural model of stochastic signed distance functions with a hierarchical optimization-based motion planner to plan low- risk motions without sacrificing path quality. Our method also explicitly bounds the risk of the path, offering trustworthiness. We empirically validate that our method produces safe motions and accurate risk bounds and is safer than baseline approaches.

TMLR Journal 2024 Journal Article

When is Momentum Extragradient Optimal? A Polynomial-Based Analysis

  • Junhyung Lyle Kim
  • Gauthier Gidel
  • Anastasios Kyrillidis
  • Fabian Pedregosa

The extragradient method has gained popularity due to its robust convergence properties for differentiable games. Unlike single-objective optimization, game dynamics involve complex interactions reflected by the eigenvalues of the game vector field's Jacobian scattered across the complex plane. This complexity can cause the simple gradient method to diverge, even for bilinear games, while the extragradient method achieves convergence. Building on the recently proven accelerated convergence of the momentum extragradient method for bilinear games \citep{azizian2020accelerating}, we use a polynomial-based analysis to identify three distinct scenarios where this method exhibits further accelerated convergence. These scenarios encompass situations where the eigenvalues reside on the (positive) real line, lie on the real line alongside complex conjugates, or exist solely as complex conjugates. Furthermore, we derive the hyperparameters for each scenario that achieve the fastest convergence rate.

ICRA Conference 2023 Conference Paper

Optimal Grasps and Placements for Task and Motion Planning in Clutter

  • Carlos Quintero-Peña
  • Zachary Kingston
  • Tianyang Pan
  • Rahul Shome
  • Anastasios Kyrillidis
  • Lydia E. Kavraki

Many methods that solve robot planning problems, such as task and motion planners, employ discrete symbolic search to find sequences of valid symbolic actions that are grounded with motion planning. Much of the efficacy of these planners lies in this grounding-bad placement and grasp choices can lead to inefficient planning when a problem has many geometric constraints. Moreover, grounding methods such as naïve sampling often fail to find appropriate values for these choices in the presence of clutter. Towards efficient task and motion planning, we present a novel optimization-based approach for grounding to solve cluttered problems that have many constraints that arise from geometry. Our approach finds an optimal grounding and can provide feedback to discrete search for more effective planning. We demonstrate our method against baseline methods in complex simulated environments.

NeurIPS Conference 2023 Conference Paper

Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time

  • Zichang Liu
  • Aditya Desai
  • Fangshuo Liao
  • Weitao Wang
  • Victor Xie
  • Zhaozhuo Xu
  • Anastasios Kyrillidis
  • Anshumali Shrivastava

Large language models(LLMs) have sparked a new wave of exciting AI applications. Hosting these models at scale requires significant memory resources. One crucial memory bottleneck for the deployment stems from the context window. It is commonly recognized that model weights are memory hungry; however, the size of key-value embedding stored during the generation process (KV cache) can easily surpass the model size. The enormous size of the KV cache puts constraints on the inference batch size, which is crucial for high throughput inference workload. Inspired by an interesting observation of the attention scores, we hypothesize the persistence of importance: only pivotal tokens, which had a substantial influence at one step, will significantly influence future generations. Based on our empirical verification and theoretical analysis around this hypothesis, we propose scissorhands, a system that maintains the memory usage of the KV cache at a fixed budget without finetuning the model. In essence, Scissorhands manages the KV cache by storing the pivotal tokens with a higher probability. We validate that scissorhands reduces the inference memory usage of the KV cache by up to 5$\times$ without compromising model quality. We further demonstrate that scissorhands can be combined with 4-bit quantization, traditionally used to compress model weights, to achieve up to 20$\times$ compression.

TMLR Journal 2022 Journal Article

On the Convergence of Shallow Neural Network Training with Randomly Masked Neurons

  • Fangshuo Liao
  • Anastasios Kyrillidis

With the motive of training all the parameters of a neural network, we study why and when one can achieve this by iteratively creating, training, and combining randomly selected subnetworks. Such scenarios have either implicitly or explicitly emerged in the recent literature: see e.g., the Dropout family of regularization techniques, or some distributed ML training protocols that reduce communication/computation complexities, such as the Independent Subnet Training protocol. While these methods are studied empirically and utilized in practice, they often enjoy partial or no theoretical support, especially when applied on neural network-based objectives. In this manuscript, our focus is on overparameterized single hidden layer neural networks with ReLU activations in the lazy training regime. By carefully analyzing $i)$ the subnetworks' neural tangent kernel, $ii)$ the surrogate functions' gradient, and $iii)$ how we sample and combine the surrogate functions, we prove linear convergence rate of the training error --up to a neighborhood around the optimal point-- for an overparameterized single-hidden layer perceptron with a regression loss. Our analysis reveals a dependency of the size of the neighborhood around the optimal point on the number of surrogate models and the number of local training steps for each selected subnetwork. Moreover, the considered framework generalizes and provides new insights on dropout training, multi-sample dropout training, as well as Independent Subnet Training; for each case, we provide convergence results as corollaries of our main theorem.

ICLR Conference 2022 Conference Paper

PipeGCN: Efficient Full-Graph Training of Graph Convolutional Networks with Pipelined Feature Communication

  • Cheng Wan 0005
  • Youjie Li
  • Cameron R. Wolfe
  • Anastasios Kyrillidis
  • Nam Sung Kim
  • Yingyan Celine Lin

Graph Convolutional Networks (GCNs) is the state-of-the-art method for learning graph-structured data, and training large-scale GCNs requires distributed training across multiple accelerators such that each accelerator is able to hold a partitioned subgraph. However, distributed GCN training incurs prohibitive overhead of communicating node features and feature gradients among partitions for every GCN layer during each training iteration, limiting the achievable training efficiency and model scalability. To this end, we propose PipeGCN, a simple yet effective scheme that hides the communication overhead by pipelining inter-partition communication with intra-partition computation. It is non-trivial to pipeline for efficient GCN training, as communicated node features/gradients will become stale and thus can harm the convergence, negating the pipeline benefit. Notably, little is known regarding the convergence rate of GCN training with both stale features and stale feature gradients. This work not only provides a theoretical convergence analysis but also finds the convergence rate of PipeGCN to be close to that of the vanilla distributed GCN training without any staleness. Furthermore, we develop a smoothing method to further improve PipeGCN's convergence. Extensive experiments show that PipeGCN can largely boost the training throughput (1.7×~28.5×) while achieving the same accuracy as its vanilla counterpart and existing full-graph training methods. The code is available at https://github.com/RICE-EIC/PipeGCN.

UAI Conference 2022 Conference Paper

ResIST: Layer-wise decomposition of ResNets for distributed training

  • Chen Dun
  • Cameron R. Wolfe
  • Christopher M. Jermaine
  • Anastasios Kyrillidis

We propose ResIST, a novel distributed training protocol for Residual Networks (ResNets). ResIST randomly decomposes a global ResNet into several shallow sub-ResNets that are trained independently in a distributed manner for several local iterations, before having their updates synchronized and aggregated into the global model. In the next round, new sub-ResNets are randomly generated and the process repeats until convergence. By construction, per iteration, ResIST communicates only a small portion of network parameters to each machine and never uses the full model during training. Thus, ResIST reduces the per-iteration communication, memory, and time requirements of ResNet training to only a fraction of the requirements of full-model training. In comparison to common protocols, like data-parallel training and data-parallel training with local SGD, ResIST yields a decrease in communication and compute requirements, while being competitive with respect to model performance.

UAI Conference 2022 Conference Paper

Stackmix: a complementary mix algorithm

  • John Chen 0002
  • Samarth Sinha
  • Anastasios Kyrillidis

Techniques combining multiple images as input/output have proven to be effective data augmentations for training convolutional neural networks. In this paper, we present StackMix: each input is presented as a concatenation of two images, and the label is the mean of the two one-hot labels. On its own, StackMix rivals other widely used methods in the “Mix” line of work. More importantly, unlike previous work, significant gains across a variety of benchmarks are achieved by combining StackMix with existing Mix augmentation, effectively mixing more than two images. E. g. , by combining StackMix with CutMix, test error in the supervised setting is improved across a variety of settings over CutMix, including 0. 8% on ImageNet, 3% on Tiny ImageNet, 2% on CIFAR-100, 0. 5% on CIFAR-10, and 1. 5% on STL-10. Similar results are achieved with Mixup. We further show that gains hold for robustness to common input corruptions and perturbations at varying severities with a 0. 7% improvement on CIFAR-100-C, by combining StackMix with AugMix over AugMix. On its own, improvements with StackMix hold across different number of labeled samples on CIFAR-100, maintaining approximately a 2% gap in test accuracy –down to using only 5% of the whole dataset– and is effective in the semi-supervised setting with a 2% improvement with the standard benchmark Pi-model. Finally, we perform an extensive ablation study to better understand the proposed methodology.

AAAI Conference 2021 Conference Paper

On Continuous Local BDD-Based Search for Hybrid SAT Solving

  • Anastasios Kyrillidis
  • Moshe Vardi
  • Zhiwei Zhang

We explore the potential of continuous local search (CLS) in SAT solving by proposing a novel approach for finding a solution of a hybrid system of Boolean constraints. The algorithm is based on CLS combined with belief propagation on binary decision diagrams (BDDs). Our framework accepts all Boolean constraints that admit compact BDDs, including symmetric Boolean constraints and small-coefficient pseudo-Boolean constraints as interesting families. We propose a novel algorithm for efficiently computing the gradient needed by CLS. We study the capabilities and limitations of our versatile CLS solver, GradSAT, by applying it on many benchmark instances. The experimental results indicate that GradSAT can be a useful addition to the portfolio of existing SAT and MaxSAT solvers for solving Boolean satisfiability and optimization problems.

ICRA Conference 2021 Conference Paper

Robust Optimization-based Motion Planning for high-DOF Robots under Sensing Uncertainty

  • Carlos Quintero-Peña
  • Anastasios Kyrillidis
  • Lydia E. Kavraki

Motion planning for high degree-of-freedom (DOF) robots is challenging, especially when acting in complex environments under sensing uncertainty. While there is significant work on how to plan under state uncertainty for low-DOF robots, existing methods cannot be easily translated into the high-DOF case, due to the complex geometry of the robot’s body and its environment. In this paper, we present a method that enhances optimization-based motion planners to produce robust trajectories for high-DOF robots for convex obstacles. Our approach introduces robustness into planners that are based on sequential convex programming: We reformulate each convex subproblem as a robust optimization problem that "protects" the solution against deviations due to sensing uncertainty. The parameters of the robust problem are estimated by sampling from the distribution of noisy obstacles, and performing a first-order approximation of the signed distance function. The original merit function is updated to account for the new costs of the robust formulation at every step. The effectiveness of our approach is demonstrated on two simulated experiments that involve a full body square robot, that moves in randomly generated scenes, and a 7-DOF Fetch robot, performing tabletop operations. The results show nearly zero probability of collision for a reasonable range of the noise parameters for Gaussian and Uniform uncertainty.

AIJ Journal 2021 Journal Article

Solving hybrid Boolean constraints in continuous space via multilinear Fourier expansions

  • Anastasios Kyrillidis
  • Anshumali Shrivastava
  • Moshe Y. Vardi
  • Zhiwei Zhang

The Boolean SATisfiability problem (SAT) is of central importance in computer science. Although SAT is known to be NP-complete, progress on the engineering side—especially that of Conflict-Driven Clause Learning (CDCL) and Local Search SAT solvers—has been remarkable. Yet, while SAT solvers, aimed at solving industrial-scale benchmarks in Conjunctive Normal Form (CNF), have become quite mature, SAT solvers that are effective on other types of constraints (e. g. , cardinality constraints and XORs) are less well-studied; a general approach to handling non-CNF constraints is still lacking. To address the issue above, we design FourierSAT, 1 an incomplete SAT solver based on Fourier Analysis (also known as Walsh-Fourier Transform) of Boolean functions, a technique to represent Boolean functions by multilinear polynomials. By such a reduction to continuous optimization, we propose an algebraic framework for solving systems consisting of different types of constraints. The idea is to leverage gradient information to guide the search process in the direction of local improvements. We show this reduction enjoys interesting theoretical properties. Empirical results demonstrate that FourierSAT can be a useful complement to other solvers on certain classes of benchmarks.

AAAI Conference 2020 Conference Paper

FourierSAT: A Fourier Expansion-Based Algebraic Framework for Solving Hybrid Boolean Constraints

  • Anastasios Kyrillidis
  • Anshumali Shrivastava
  • Moshe Vardi
  • Zhiwei Zhang

The Boolean SATisfiability problem (SAT) is of central importance in computer science. Although SAT is known to be NP-complete, progress on the engineering side—especially that of Conflict-Driven Clause Learning (CDCL) and Local Search SAT solvers—has been remarkable. Yet, while SAT solvers, aimed at solving industrial-scale benchmarks in Conjunctive Normal Form (CNF), have become quite mature, SAT solvers that are effective on other types of constraints (e. g. , cardinality constraints and XORs) are less well-studied; a general approach to handling non-CNF constraints is still lacking. In addition, previous work indicated that for specific classes of benchmarks, the running time of extant SAT solvers depends heavily on properties of the formula and details of encoding, instead of the scale of the benchmarks, which adds uncertainty to expectations of running time. To address the issues above, we design FourierSAT, an incomplete SAT solver based on Fourier analysis of Boolean functions, a technique to represent Boolean functions by multilinear polynomials. By such a reduction to continuous optimization, we propose an algebraic framework for solving systems consisting of different types of constraints. The idea is to leverage gradient information to guide the search process in the direction of local improvements. Empirical results demonstrate that FourierSAT is more robust than other solvers on certain classes of benchmarks.

ICML Conference 2020 Conference Paper

Negative Sampling in Semi-Supervised learning

  • John Chen 0002
  • Vatsal Shah
  • Anastasios Kyrillidis

We introduce Negative Sampling in Semi-Supervised Learning (NS^3L), a simple, fast, easy to tune algorithm for semi-supervised learning (SSL). NS^3L is motivated by the success of negative sampling/contrastive estimation. We demonstrate that adding the NS^3L loss to state-of-the-art SSL algorithms, such as the Virtual Adversarial Training (VAT), significantly improves upon vanilla VAT and its variant, VAT with Entropy Minimization. By adding the NS^3L loss to MixMatch, the current state-of-the-art approach on semi-supervised tasks, we observe significant improvements over vanilla MixMatch. We conduct extensive experiments on the CIFAR10, CIFAR100, SVHN and STL10 benchmark datasets. Finally, we perform an ablation study for NS3L regarding its hyperparameter tuning.

ICML Conference 2019 Conference Paper

Compressing Gradient Optimizers via Count-Sketches

  • Ryan Spring
  • Anastasios Kyrillidis
  • Vijai Mohan
  • Anshumali Shrivastava

Many popular first-order optimization methods accelerate the convergence rate of deep learning models. However, these algorithms require auxiliary variables, which cost additional memory proportional to the number of parameters in the model. The problem is becoming more severe as models grow larger to learn from complex, large-scale datasets. Our proposed solution is to maintain a linear sketch to compress the auxiliary variables. Our approach has the same performance as the full-sized baseline, while using less space for the auxiliary variables. Theoretically, we prove that count-sketch optimization maintains the SGD convergence rate, while gracefully reducing memory usage for large-models. We show a rigorous evaluation on popular architectures such as ResNet-18 and Transformer-XL. On the 1-Billion Word dataset, we save 25% of the memory used during training (7. 7 GB instead of 10. 8 GB) with minimal accuracy and performance loss. For an Amazon extreme classification task with over 49. 5 million classes, we also reduce the training time by 38%, by increasing the mini-batch size 3. 5x using our count-sketch optimizer.

NeurIPS Conference 2019 Conference Paper

Learning Sparse Distributions using Iterative Hard Thresholding

  • Jacky Zhang
  • Rajiv Khanna
  • Anastasios Kyrillidis
  • Oluwasanmi Koyejo

Iterative hard thresholding (IHT) is a projected gradient descent algorithm, known to achieve state of the art performance for a wide range of structured estimation problems, such as sparse inference. In this work, we consider IHT as a solution to the problem of learning sparse discrete distributions. We study the hardness of using IHT on the space of measures. As a practical alternative, we propose a greedy approximate projection which simultaneously captures appropriate notions of sparsity in distributions, while satisfying the simplex constraint, and investigate the convergence behavior of the resulting procedure in various settings. Our results show, both in theory and practice, that IHT can achieve state of the art results for learning sparse distributions.

UAI Conference 2018 Conference Paper

Simple and practical algorithms for 𝓁 p -norm low-rank approximation

  • Anastasios Kyrillidis

There are numerous applications where `1 - / `1 -norm low rank approximations are useful in practice. First, the `1 -norm is more robust than the `2 -norm, and is suited in problem settings where Gaussian assumptions for noise models may not apply. `1 -norm low rank applications include robust PCA applications [56, 6, 31, 32, 24, 57], computer vision tasks such as background subtraction and motion detection [52, 1, 38], detection of brain activation patterns [44], and detection of anomalous behavior in dynamic networks [44]. 1 We propose practical algorithms for entrywise `p -norm low-rank approximation, for p = 1 or p = 1. The proposed framework, which is non-convex and gradient-based, is easy to implement and typically attains better approximations, faster, than state of the art. From a theoretical standpoint, we show that the proposed scheme can attain (1 + ")OPT approximations. Our algorithms are not hyperparameter-free: they achieve the desiderata only assuming algorithm’s hyperparameters are known apriori—or are at least approximable. I. e. , our theory indicates what problem quantities need to be known, in order to get a good solution within polynomial time, and does not contradict to recent inapproximabilty results, as in [46]. 1 For the `1 -norm version of (1), the problem cases are only a few. [43] considers the special case of m = n and r = min{m, n} 1 as the problem of distance to robust non-singularity. [22, 23] use the notion of `1 -norm low rank approximation for the maximal-volume concept in approximation, as well as for the skeleton approximation of a matrix. Finally, [17] identifies that (1) with p = 1 can be used for the recovery of a low-rank matrix from a quantized M. Despite the utility of (1), its solution is not straightforward. While (1) with `2 -norm has a closed-form solution via the Singular Value Decomposition (SVD), the same does not hold for p 2 {1, 1}. Additionally, it has been proved that actually finding the exact solution to (1) can be exponentially complex:

AAAI Conference 2018 Conference Paper

Statistical Inference Using SGD

  • Tianyang Li
  • Liu Liu
  • Anastasios Kyrillidis
  • Constantine Caramanis

We present a novel method for frequentist statistical inference in M-estimation problems, based on stochastic gradient descent (SGD) with a fixed step size: we demonstrate that the average of such SGD sequences can be used for statistical inference, after proper scaling. An intuitive analysis using the Ornstein-Uhlenbeck process suggests that such averages are asymptotically normal. To show the merits of our scheme, we apply it to both synthetic and real data sets, and demonstrate that its accuracy is comparable to classical statistical methods, while requiring potentially far less computation.

ICML Conference 2016 Conference Paper

A Simple and Provable Algorithm for Sparse Diagonal CCA

  • Megasthenis Asteris
  • Anastasios Kyrillidis
  • Sanmi Koyejo
  • Russell A. Poldrack

Given two sets of variables, derived from a common set of samples, sparse Canonical Correlation Analysis (CCA) seeks linear combinations of a small number of variables in each set, such that the induced \emphcanonical variables are maximally correlated. Sparse CCA is NP-hard. We propose a novel combinatorial algorithm for sparse diagonal CCA, \textiti. e. , sparse CCA under the additional assumption that variables within each set are standardized and uncorrelated. Our algorithm operates on a low rank approximation of the input data and its computational complexity scales linearly with the number of input variables. It is simple to implement, and parallelizable. In contrast to most existing approaches, our algorithm administers precise control on the sparsity of the extracted canonical vectors, and comes with theoretical data-dependent global approximation guarantees, that hinge on the spectrum of the input data. Finally, it can be straightforwardly adapted to other constrained variants of CCA enforcing structure beyond sparsity. We empirically evaluate the proposed scheme and apply it on a real neuroimaging dataset to investigate associations between brain activity and behavior measurements.

JMLR Journal 2015 Journal Article

Composite Self-Concordant Minimization

  • Quoc Tran-Dinh
  • Anastasios Kyrillidis
  • Volkan Cevher

We propose a variable metric framework for minimizing the sum of a self-concordant function and a possibly non-smooth convex function, endowed with an easily computable proximal operator. We theoretically establish the convergence of our framework without relying on the usual Lipschitz gradient assumption on the smooth part. An important highlight of our work is a new set of analytic step-size selection and correction procedures based on the structure of the problem. We describe concrete algorithmic instances of our framework for several interesting applications and demonstrate them numerically on both synthetic and real data. [abs] [ pdf ][ bib ] &copy JMLR 2015. ( edit, beta )

NeurIPS Conference 2015 Conference Paper

Sparse PCA via Bipartite Matchings

  • Megasthenis Asteris
  • Dimitris Papailiopoulos
  • Anastasios Kyrillidis
  • Alexandros Dimakis

We consider the following multi-component sparse PCA problem: given a set of data points, we seek to extract a small number of sparse components with \emph{disjoint} supports that jointly capture the maximum possible variance. Such components can be computed one by one, repeatedly solving the single-component problem and deflating the input data matrix, but this greedy procedure is suboptimal. We present a novel algorithm for sparse PCA that jointly optimizes multiple disjoint components. The extracted features capture variance that lies within a multiplicative factor arbitrarily close to $1$ from the optimal. Our algorithm is combinatorial and computes the desired components by solving multiple instances of the bipartite maximum weight matching problem. Its complexity grows as a low order polynomial in the ambient dimension of the input data, but exponentially in its rank. However, it can be effectively applied on a low-dimensional sketch of the input data. We evaluate our algorithm on real datasets and empirically demonstrate that in many cases it outperforms existing, deflation-based approaches.

ICML Conference 2015 Conference Paper

Stay on path: PCA along graph paths

  • Megasthenis Asteris
  • Anastasios Kyrillidis
  • Alexandros G. Dimakis
  • Han-Gyol Yi
  • Bharath Chandrasekaran

We introduce a variant of (sparse) PCA in which the set of feasible support sets is determined by a graph. In particular, we consider the following setting: given a directed acyclic graph G on p vertices corresponding to variables, the non-zero entries of the extracted principal component must coincide with vertices lying along a path in G. From a statistical perspective, information on the underlying network may potentially reduce the number of observations required to recover the population principal component. We consider the canonical estimator which optimally exploits the prior knowledge by solving a non-convex quadratic maximization on the empirical covariance. We introduce a simple network and analyze the estimator under the spiked covariance model for sparse PCA. We show that side information potentially improves the statistical complexity. We propose two algorithms to approximate the solution of the constrained quadratic maximization, and recover a component with the desired properties. We empirically evaluate our schemes on synthetic and real datasets.

AAAI Conference 2014 Conference Paper

Scalable Sparse Covariance Estimation via Self-Concordance

  • Anastasios Kyrillidis
  • Rabeeh Karimi Mahabadi
  • Quoc Tran Dinh
  • Volkan Cevher

We consider the class of convex minimization problems, composed of a self-concordant function, such as the log det metric, a convex data fidelity term h(·) and, a regularizing – possibly non-smooth – function g(·). This type of problems have recently attracted a great deal of interest, mainly due to their omnipresence in top-notch applications. Under this locally Lipschitz continuous gradient setting, we analyze the convergence behavior of proximal Newton schemes with the added twist of a probable presence of inexact evaluations. We prove attractive convergence rate guarantees and enhance state-of-the-art optimization schemes to accommodate such developments. Experimental results on sparse covariance estimation show the merits of our algorithm, both in terms of recovery efficiency and complexity.

ICML Conference 2013 Conference Paper

A proximal Newton framework for composite minimization: Graph learning without Cholesky decompositions and matrix inversions

  • Quoc Tran-Dinh
  • Anastasios Kyrillidis
  • Volkan Cevher

We propose an algorithmic framework for convex minimization problems of composite functions with two terms: a self-concordant part and a possibly nonsmooth regularization part. Our method is a new proximal Newton algorithm with local quadratic convergence rate. As a specific problem instance, we consider sparse precision matrix estimation problems in graph learning. Via a careful dual formulation and a novel analytic step-size selection, we instantiate an algorithm within our framework for graph learning that avoids Cholesky decompositions and matrix inversions, making it attractive for parallel and distributed implementations.

ICML Conference 2013 Conference Paper

Sparse projections onto the simplex

  • Anastasios Kyrillidis
  • Stephen Becker
  • Volkan Cevher
  • Christoph Koch 0001

Most learning methods with rank or sparsity constraints use convex relaxations, which lead to optimization with the nuclear norm or the \ell_1-norm. However, several important learning applications cannot benefit from this approach as they feature these convex norms as constraints in addition to the non-convex rank and sparsity constraints. In this setting, we derive efficient sparse projections onto the simplex and its extension, and illustrate how to use them to solve high-dimensional learning problems in quantum tomography, sparse density estimation and portfolio selection with non-convex constraints.