Author name cluster

Vikas Singh

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

47 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

Composing Linear Layers from Irreducibles

Travis Pence
Daisuke Yamada
Vikas Singh

Contemporary large models often exhibit behaviors suggesting the presence of low-level primitives that compose into modules with richer functionality, but these fundamental building blocks remain poorly understood. We investigate this compositional structure in linear layers by asking: \textit{can we identify/synthesize linear transformations from a minimal set of geometric primitives? } Using Clifford algebra, we show that linear layers can be expressed as compositions of bivectors---geometric objects encoding oriented planes---and introduce a differentiable algorithm that decomposes them into products of rotors. This construction uses only $\mathcal{O}(\log^2 d)$ parameters, versus $\mathcal{O}(d^2)$ required by dense matrices. Applied to the key, query, and value projections in LLM attention layers, our rotor-based layers match the performance of strong baselines such as block-Hadamard and low-rank approximations. Our findings provide an algebraic perspective on how these geometric primitives can compose into higher-level functions within deep models.

NeurIPS Conference 2025 Conference Paper

FoGE: Fock Space inspired encoding for graph prompting

Takis Chytas
Rudrasis Chakraborty
Vikas Singh

Recent results show that modern Large Language Models (LLM) are indeed capable of understanding and answering questions about structured data such as graphs. This new paradigm can lead to solutions that require less supervision while, at the same time, providing a model that can generalize and answer questions beyond the training labels. Existing proposals often use some description of the graph to create an ``augmented'' prompt fed to the LLM. For a chosen class of graphs, if a well-tailored graph encoder is deployed to play together with a pre-trained LLM, the model can answer graph-related questions well. Existing solutions to graph-based prompts range from graph serialization to graph transformers. In this work, we show that the use of a parameter-free graph encoder based on Fock space representations, a concept borrowed from mathematical physics, is remarkably versatile in this problem setting. The simple construction, inherited directly from the theory with a few small adjustments, can provide rich and informative graph encodings, for a wide range of different graphs. We investigate the use of this idea for prefix-tuned prompts leveraging the capabilities of a pre-trained, frozen LLM. The modifications lead to a model that can answer graph-related questions -- from simple graphs to proteins to hypergraphs -- effectively and with minimal, if any, adjustments to the architecture. Our work significantly simplifies existing solutions and generalizes well to multiple different graph-based structures effortlessly.

NeurIPS Conference 2025 Conference Paper

PINNs with Learnable Quadrature

Sourav Pal
Kamyar Azizzadenesheli
Vikas Singh

The growing body of work on Physics-Informed Neural Networks (PINNs) seeks to use machine learning strategies to improve methods for solution discovery of Partial Differential Equations (PDEs). While classical solvers may remain the preferred tool of choice in the short-term, PINNs can be viewed as complementary. The expectation is that in some specific use cases, they can be effective, standalone. A key step in training PINNs is selecting domain points for loss evaluation, where Monte Carlo sampling remains the dominant but often suboptimal in low dimension settings, common in physics. We leverage recent advances in asymptotic expansions of quadrature nodes and weights (for weight functions belonging to the modified Gauss-Jacobi family) together with suitable adjustments for parameterization towards a data-driven framework for learnable quadrature rules. A direct benefit is a performance improvement of PINNs, relative to existing alternatives, on a wide range of problems studied in the literature. Beyond finding a standard solution for an instance of a single PDE, our construction enables learning rules to predict solutions for a given family of PDEs via hyper-networks, a useful capability for PINNs.

ICLR Conference 2025 Conference Paper

SimpleTM: A Simple Baseline for Multivariate Time Series Forecasting

Hui Chen
Viet Luong
Lopamudra Mukherjee
Vikas Singh

The versatility of large Transformer-based models has led to many efforts focused on adaptations to other modalities, including time-series data. For instance, one could start from a pre-trained checkpoint of a large language model and attach adapters to recast the new modality (e.g., time-series) as ``language''. Alternatively, one can use a suitably large Transformer-based model, and make some modifications for time-series data. These ideas offer good performance across available benchmarks. But temporal data are quite heterogeneous (e.g., wearable sensors, physiological measurements in healthcare), and unlike text/image corpus, much of it is not publicly available. So, these models need a fair bit of domain-specific fine-tuning to achieve good performance -- this is often expensive or difficult with limited resources. In this paper, we study and characterize the performance profile of a non-generalist approach: our SimpleTM model is specialized for multivariate time-series forecasting. By simple, we mean that the model is lightweight. It is restricted to tokenization based on textbook signal processing ideas (shown to be effective in vision) which are then allowed to attend/interact: via self-attention but also via ways that are a bit more general than dot-product attention, accomplished via basic geometric algebra operations. We show that even a single- or two-layer model gives results that are competitive with much bigger models, including large transformer-based architectures, on most benchmarks commonly reported in the literature.

NeurIPS Conference 2025 Conference Paper

Single-Step Operator Learning for Conditioned Time-Series Diffusion Models

Hui Chen
Vikas Singh

Diffusion models have achieved significant success, yet their application to time series data, particularly with regard to efficient sampling, remains an active area of research. We describe an operator-learning approach for conditioned time-series diffusion models that gives efficient single-step generation by leveraging insights from the frequency-domain characteristics of both the time-series data and the diffusion process itself. The forward diffusion process induces a structured, frequency-dependent smoothing of the data's probability density function. However, this frequency smoothing is related (e. g. , via likelihood function) to easily accessible frequency components of time-series data. This suggests that a module operating in the frequency space of the time-series can, potentially, more effectively learn to reverse the frequency-dependent smoothing of the data distribution induced by the diffusion process. We set up an operator learning task, based on frequency-aware building blocks, which satisfies semi-group properties, while exploiting the structure of time-series data. Evaluations on multiple datasets show that our single-step generation proposal achieves forecasting/imputation results comparable (or superior) to many multi-step diffusion schemes while significantly reducing inference costs.

ICML Conference 2024 Conference Paper

FrameQuant: Flexible Low-Bit Quantization for Transformers

Harshavardhan Adepu
Zhanpeng Zeng
Li Zhang
Vikas Singh

Transformers are the backbone of powerful foundation models for many Vision and Natural Language Processing tasks. But their compute and memory/storage footprint is large, and so, serving such models is expensive often requiring high-end hardware. To mitigate this difficulty, Post-Training Quantization seeks to modify a pre-trained model and quantize it to eight bits or lower, significantly boosting compute/memory/latency efficiency. Such models have been successfully quantized to four bits with some performance loss. In this work, we outline a simple scheme to quantize Transformer-based models to just two bits (plus some overhead) with only a small drop in accuracy. Key to our formulation is a concept borrowed from Harmonic analysis called Fusion Frames. Our main finding is that the quantization must take place not in the original weight space, but instead in the Fusion Frame representations. If quantization is interpreted as the addition of noise, our casting of the problem allows invoking an extensive body of known consistent recovery and noise robustness guarantees. Further, if desired, de-noising filters are known in closed form. We show empirically, via a variety of experiments, that (almost) two-bit quantization for Transformer models promises sizable efficiency gains. The code is available at https: //github. com/vsingh-group/FrameQuant

ICML Conference 2024 Conference Paper

IM-Unpack: Training and Inference with Arbitrarily Low Precision Integers

Zhanpeng Zeng
Karthikeyan Sankaralingam
Vikas Singh

GEneral Matrix Multiply (GEMM) is a central operation in deep learning and corresponds to a large chunk of the compute footprint. Therefore, improving its efficiency is an active topic of research. A popular strategy is the use of low bit-width integers to approximate the original matrix entries. This allows efficiency gains, but often requires sophisticated techniques to control the rounding error. In this work, we first verify that when the low bit-width restriction is removed, for a variety of Transformer-based models, integers are, in fact, sufficient for all GEMMs need – for both training and inference stages, and achieve parity (with floating point). No sophisticated techniques are needed. We find that while a large majority of entries in matrices (encountered in such models) can be easily represented by low bit-width integers, the existence of a few heavy hitter entries make it difficult to achieve efficiency gains via the exclusive use of low bit-width GEMMs alone. To address this issue, we develop a simple algorithm, Integer Matrix Unpacking (IM-Unpack), to unpack a matrix with large integer entries into a larger matrix whose entries all lie within the representable range of arbitrarily low bit-width integers. This allows equivalence with the original GEMM, i. e. , the exact result can be obtained using purely low bit-width integer GEMMs. This comes at the cost of additional operations – we show that for many popular models, this overhead is quite small. Code is available at https: //github. com/vsingh-group/im-unpack.

ICML Conference 2024 Conference Paper

Implicit Representations via Operator Learning

Sourav Pal
Harshavardhan Adepu
Clinton J. Wang
Polina Golland
Vikas Singh

The idea of representing a signal as the weights of a neural network, called Implicit Neural Representations (INRs), has led to exciting implications for compression, view synthesis and 3D volumetric data understanding. One problem in this setting pertains to the use of INRs for downstream processing tasks. Despite some conceptual results, this remains challenging because the INR for a given image/signal often exists in isolation. What does the neighborhood around a given INR correspond to? Based on this question, we offer an operator theoretic reformulation of the INR model, which we call Operator INR (or O-INR). At a high level, instead of mapping positional encodings to a signal, O-INR maps one function space to another function space. A practical form of this general casting is obtained by appealing to Integral Transforms. The resultant model does not need multi-layer perceptrons (MLPs), used in most existing INR models – we show that convolutions are sufficient and offer benefits including numerically stable behavior. We show that O-INR can easily handle most problem settings in the literature, and offers a similar performance profile as baselines. These benefits come with minimal, if any, compromise. Our code is available at https: //github. com/vsingh-group/oinr.

ICLR Conference 2024 Conference Paper

Pooling Image Datasets with Multiple Covariate Shift and Imbalance

Sotirios Panagiotis Chytas
Vishnu Suresh Lokhande
Vikas Singh

Small sample sizes are common in many disciplines, which necessitates pooling roughly similar datasets across multiple sites/institutions to study weak but relevant associations between images and disease incidence. Such data often manifest shifts and imbalances in covariates (secondary non-imaging data). These issues are well-studied for classical models, but the ideas simply do not apply to overparameterized DNN models. Consequently, recent work has shown how strategies from fairness and invariant representation learning provides a meaningful starting point, but the current repertoire of methods remains limited to accounting for shifts/imbalances in just a couple of covariates at a time. In this paper, we show how viewing this problem from the perspective of Category theory provides a simple and effective solution that completely avoids elaborate multi-stage training pipelines that would otherwise be needed. We show the effectiveness of this approach via extensive experiments on real datasets. Further, we discuss how our style of formulation offers a unified perspective on at least 5+ distinct problem settings in vision, from self-supervised learning to matching problems in 3D reconstruction.

ICML Conference 2023 Conference Paper

Controlled Differential Equations on Long Sequences via Non-standard Wavelets

Sourav Pal
Zhanpeng Zeng
Sathya N. Ravi
Vikas Singh

Neural Controlled Differential equations (NCDE) are a powerful mechanism to model the dynamics in temporal sequences, e. g. , applications involving physiological measures, where apart from the initial condition, the dynamics also depend on subsequent measures or even a different "control" sequence. But NCDEs do not scale well to longer sequences. Existing strategies adapt rough path theory, and instead model the dynamics over summaries known as log signatures. While rigorous and elegant, invertibility of these summaries is difficult, and limits the scope of problems where these ideas can offer strong benefits (reconstruction, generative modeling). For tasks where it is sensible to assume that the (long) sequences in the training data are a fixed length of temporal measurements – this assumption holds in most experiments tackled in the literature – we describe an efficient simplification. First, we recast the regression/classification task as an integral transform. We then show how restricting the class of operators (permissible in the integral transform), allows the use of a known algorithm that leverages non-standard Wavelets to decompose the operator. Thereby, our task (learning the operator) radically simplifies. A neural variant of this idea yields consistent improvements across a wide gamut of use cases tackled in existing works. We also describe a novel application on modeling tasks involving coupled differential equations.

ICLR Conference 2023 Conference Paper

Efficient Discrete Multi Marginal Optimal Transport Regularization

Ronak R. Mehta
Jeffery Kline
Vishnu Suresh Lokhande
Glenn Fung
Vikas Singh

Optimal transport has emerged as a powerful tool for a variety of problems in machine learning, and it is frequently used to enforce distributional constraints. In this context, existing methods often use either a Wasserstein metric, or else they apply concurrent barycenter approaches when more than two distributions are considered. In this paper, we leverage multi-marginal optimal transport (MMOT), where we take advantage of a procedure that computes a generalized earth mover's distance as a sub-routine. We show that not only is our algorithm computationally more efficient compared to other barycentric-based distance methods, but it has the additional advantage that gradients used for backpropagation can be efficiently computed during the forward pass computation itself, which leads to substantially faster model training. We provide technical details about this new regularization term and its properties, and we present experimental demonstrations of faster runtimes when compared to standard Wasserstein-style methods. Finally, on a range of experiments designed to assess effectiveness at enforcing fairness, we demonstrate our method compares well with alternatives.

ICML Conference 2023 Conference Paper

LookupFFN: Making Transformers Compute-lite for CPU inference

Zhanpeng Zeng
Michael Davies
Pranav Pulijala
Karthikeyan Sankaralingam
Vikas Singh

While GPU clusters are the de facto choice for training large deep neural network (DNN) models today, several reasons including ease of workflow, security and cost have led to efforts investigating whether CPUs may be viable for inference in routine use in many sectors of the industry. But the imbalance between the compute capabilities of GPUs and CPUs is huge. Motivated by these considerations, we study a module which is a workhorse within modern DNN architectures, GEMM based Feed Forward Networks (FFNs), and assess the extent to which it can be made compute- (or FLOP-) lite. Specifically, we propose an alternative formulation (we call it LookupFFN) to GEMM based FFNs inspired by the recent studies of using Locality Sensitive Hashing (LSH) to approximate FFNs. Our formulation recasts most essential operations as a memory look-up, leveraging the trade-off between the two resources on any platform: compute and memory (since CPUs offer it in abundance). For RoBERTa language model pretraining, our formulation achieves similar performance compared to GEMM based FFNs, while dramatically reducing the required FLOP. Our development is complemented with a detailed hardware profiling of strategies that will maximize efficiency – not just on contemporary hardware but on products that will be offered in the near/medium term future. Code is avaiable at https: //github. com/mlpen/LookupFFN.

NeurIPS Conference 2023 Conference Paper

VCC: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens

Zhanpeng Zeng
Cole Hawkins
Mingyi Hong
Aston Zhang
Nikolaos Pappas
Vikas Singh
Shuai Zheng

Transformers are central in modern natural language processing and computer vision applications. Despite recent works devoted to reducing the quadratic cost of such models with respect to sequence length, dealing with ultra long sequences (e. g. , $>$16K tokens) remains challenging. Applications such as answering questions based on a book or summarizing a scientific article are inefficient or infeasible. Here, we propose to significantly improve the efficiency of Transformers for ultra long sequences, by compressing the sequence into a much smaller representation at each layer. Specifically, by exploiting the fact that in many tasks, only a small subset of special tokens, which we call VIP-tokens, are most relevant to the final prediction, we propose a VIP-token centric compression (VCC) scheme which selectively compresses the sequence based on their impact on approximating the representation of the VIP-tokens. Compared with competitive baselines, our algorithm is not only efficient (achieving more than $3\times$ compute efficiency gain compared to baselines on 4K and 16K lengths), but also offers competitive/better performance on a large number of tasks. Further, we show that our algorithm scales to 128K tokens (or more) while consistently offering accuracy improvement. Code is available at https: //github. com/mlpen/VCC.

ICML Conference 2022 Conference Paper

Forward Operator Estimation in Generative Models with Kernel Transfer Operators

Zhichun Huang
Rudrasis Chakraborty
Vikas Singh

Generative models which use explicit density modeling (e. g. , variational autoencoders, flow-based generative models) involve finding a mapping from a known distribution, e. g. Gaussian, to the unknown input distribution. This often requires searching over a class of non-linear functions (e. g. , representable by a deep neural network). While effective in practice, the associated runtime/memory costs can increase rapidly, usually as a function of the performance desired in an application. We propose a substantially cheaper (and simpler) forward operator estimation strategy based on adapting known results on kernel transfer operators. We show that our formulation enables highly efficient distribution approximation and sampling, and offers surprisingly good empirical performance that compares favorably with powerful baselines, but with significant runtime savings. We show that the algorithm also performs well in small sample size settings (in brain imaging).

ICML Conference 2022 Conference Paper

Multi Resolution Analysis (MRA) for Approximate Self-Attention

Zhanpeng Zeng
Sourav Pal
Jeffery Kline
Glenn Fung
Vikas Singh

Transformers have emerged as a preferred model for many tasks in natural langugage processing and vision. Recent efforts on training and deploying Transformers more efficiently have identified many strategies to approximate the self-attention matrix, a key module in a Transformer architecture. Effective ideas include various prespecified sparsity patterns, low-rank basis expansions and combinations thereof. In this paper, we revisit classical Multiresolution Analysis (MRA) concepts such as Wavelets, whose potential value in this setting remains underexplored thus far. We show that simple approximations based on empirical feedback and design choices informed by modern hardware and implementation challenges, eventually yield a MRA-based approach for self-attention with an excellent performance profile across most criteria of interest. We undertake an extensive set of experiments and demonstrate that this multi-resolution scheme outperforms most efficient self-attention proposals and is favorable for both short and long sequences. Code is available at \url{https: //github. com/mlpen/mra-attention}.

UAI Conference 2021 Conference Paper

A variational approximation for analyzing the dynamics of panel data

Jurijs Nazarovs
Rudrasis Chakraborty
Songwong Tasneeyapant
Sathya N. Ravi
Vikas Singh

Panel data involving longitudinal measurements of the same set of participants or entities taken over multiple time points is common in studies to understand early childhood development and disease modeling. Deep hybrid models that marry the predictive power of neural networks with physical simulators such as differential equations, are starting to drive advances in such applications. The task of modeling not just the observations/data but the hidden dynamics that are captured by the measurements poses interesting statistical/computational questions. We propose a probabilistic model called ME-NODE to incorporate (fixed + random) mixed effects for analyzing such panel data. We show that our model can be derived using smooth approximations of SDEs provided by the Wong-Zakai theorem. We then derive Evidence Based Lower Bounds for ME-NODE, and develop (efficient) training algorithms using MC based sampling methods and numerical ODE solvers. We demonstrate ME-NODE’s utility on tasks spanning the spectrum from simulations and toy datasets to real longitudinal 3D imaging data from an Alzheimer’s disease (AD) study, and study the performance for accuracy of reconstruction for interpolation, uncertainty estimates and personalized prediction.

NeurIPS Conference 2021 Conference Paper

An Online Riemannian PCA for Stochastic Canonical Correlation Analysis

Zihang Meng
Rudrasis Chakraborty
Vikas Singh

We present an efficient stochastic algorithm (RSG+) for canonical correlation analysis (CCA) using a reparametrization of the projection matrices. We show how this reparametrization (into structured matrices), simple in hindsight, directly presents an opportunity to repurpose/adjust mature techniques for numerical optimization on Riemannian manifolds. Our developments nicely complement existing methods for this problem which either require $O(d^3)$ time complexity per iteration with $O(\frac{1}{\sqrt{t}})$ convergence rate (where $d$ is the dimensionality) or only extract the top $1$ component with $O(\frac{1}{t})$ convergence rate. In contrast, our algorithm offers a strict improvement for this classical problem: it achieves $O(d^2k)$ runtime complexity per iteration for extracting the top $k$ canonical components with $O(\frac{1}{t})$ convergence rate. While the paper primarily focuses on the formulation and technical analysis of its properties, our experiments show that the empirical behavior on common datasets is quite promising, We also explore a potential application in training fair models where the label of protected attribute is missing or otherwise unavailable.

NeurIPS Conference 2021 Conference Paper

Differentiable Optimization of Generalized Nondecomposable Functions using Linear Programs

Zihang Meng
Lopamudra Mukherjee
Yichao Wu
Vikas Singh
Sathya Ravi

We propose a framework which makes it feasible to directly train deep neural networks with respect to popular families of task-specific non-decomposable performance measures such as AUC, multi-class AUC, $F$-measure and others. A common feature of the optimization model that emerges from these tasks is that it involves solving a Linear Programs (LP) during training where representations learned by upstream layers characterize the constraints or the feasible set. The constraint matrix is not only large but the constraints are also modified at each iteration. We show how adopting a set of ingenious ideas proposed by Mangasarian for 1-norm SVMs -- which advocates for solving LPs with a generalized Newton method -- provides a simple and effective solution that can be run on the GPU. In particular, this strategy needs little unrolling, which makes it more efficient during backward pass. Further, even when the constraint matrix is too large to fit on the GPU memory (say large minibatch settings), we show that running the Newton method in a lower dimensional space yields accurate gradients for training, by utilizing a statistical concept called {\em sufficient} dimension reduction. While a number of specialized algorithms have been proposed for the models that we describe here, our module turns out to be applicable without any specific adjustments or relaxations. We describe each use case, study its properties and demonstrate the efficacy of the approach over alternatives which use surrogate lower bounds and often, specialized optimization schemes. Frequently, we achieve superior computational behavior and performance improvements on common datasets used in the literature.

AAAI Conference 2021 Conference Paper

Flow-based Generative Models for Learning Manifold to Manifold Mappings

Xingjian Zhen
Rudrasis Chakraborty
Liu Yang
Vikas Singh

Many measurements or observations in computer vision and machine learning manifest as non-Euclidean data. While recent proposals (like spherical CNN) have extended a number of deep neural network architectures to manifold-valued data, and this has often provided strong improvements in performance, the literature on generative models for manifold data is quite sparse. Partly due to this gap, there are also no modality transfer/translation models for manifold-valued data whereas numerous such methods based on generative models are available for natural images. This paper addresses this gap, motivated by a need in brain imaging – in doing so, we expand the operating range of certain generative models (as well as generative models for modality transfer) from natural images to images with manifold-valued measurements. Our main result is the design of a two-stream version of GLOW (flow-based invertible generative models) that can synthesize information of a field of one type of manifold-valued measurements given another. On the theoretical side, we introduce three kinds of invertible layers for manifold-valued data, which are not only analogous to their functionality in flow-based generative models (e. g. , GLOW) but also preserve the key benefits (determinants of the Jacobian are easy to calculate). For experiments, on a large dataset from the Human Connectome Project (HCP), we show promising results where we can reliably and accurately reconstruct brain images of a field of orientation distribution functions (ODF) from diffusion tensor images (DTI), where the latter has a 5× faster acquisition time but at the expense of worse angular resolution.

UAI Conference 2021 Conference Paper

Graph reparameterizations for enabling 1000+ Monte Carlo iterations in Bayesian deep neural networks

Jurijs Nazarovs
Ronak R. Mehta
Vishnu Suresh Lokhande
Vikas Singh

Uncertainty estimation in deep models is essential in many real-world applications and has benefited from developments over the last several years. Recent evidence suggests that existing solutions dependent on simple Gaussian formulations may not be sufficient. However, moving to other distributions necessitates Monte Carlo (MC) sampling to estimate quantities such as the KL divergence: it could be expensive and scales poorly as the dimensions of both the input data and the model grow. This is directly related to the structure of the computation graph, which can grow linearly as a function of the number of MC samples needed. Here, we construct a framework to describe these computation graphs, and identify probability families where the graph size can be independent or only weakly dependent on the number of MC samples. These families correspond directly to large classes of distributions. Empirically, we can run a much larger number of iterations for MC approximations for larger architectures used in computer vision with gains in performance measured in confident accuracy, stability of training, memory and training time.

AAAI Conference 2021 Conference Paper

Learning Invariant Representations using Inverse Contrastive Loss

Aditya Kumar Akash
Vishnu Suresh Lokhande
Sathya N. Ravi
Vikas Singh

Learning invariant representations is a critical first step in a number of machine learning tasks. A common approach corresponds to the so-called information bottleneck principle in which an application dependent function of mutual information is carefully chosen and optimized. Unfortunately, in practice, these functions are not suitable for optimization purposes since these losses are agnostic of the metric structure of the parameters of the model. We introduce a class of losses for learning representations that are invariant to some extraneous variable of interest by inverting the class of contrastive losses, i. e. , inverse contrastive loss (ICL). We show that if the extraneous variable is binary, then optimizing ICL is equivalent to optimizing a regularized MMD divergence. More generally, we also show that if we are provided a metric on the sample space, our formulation of ICL can be decomposed into a sum of convex functions of the given distance metric. Our experimental results indicate that models obtained by optimizing ICL achieve significantly better invariance to the extraneous variable for a fixed desired level of accuracy. In a variety of experimental settings, we show applicability of ICL for learning invariant representations for both continuous and discrete extraneous variables. The project page with code is available at https: //github. com/adityakumarakash/ICL

AAAI Conference 2021 Conference Paper

Nyströmformer: A Nyström-based Algorithm for Approximating Self-Attention

Yunyang Xiong
Zhanpeng Zeng
Rudrasis Chakraborty
Mingxing Tan
Glenn Fung
Yin Li
Vikas Singh

Transformers have emerged as a powerful tool for a broad range of natural language processing tasks. A key component that drives the impressive performance of Transformers is the self-attention mechanism that encodes the influence or dependence of other tokens on each specific token. While beneficial, the quadratic complexity of self-attention on the input sequence length has limited its application to longer sequences – a topic being actively studied in the community. To address this limitation, we propose Nyströmformer – a model that exhibits favorable scalability as a function of sequence length. Our idea is based on adapting the Nyström method to approximate standard self-attention with O(n) complexity. The scalability of Nyströmformer enables application to longer sequences with thousands of tokens. We perform evaluations on multiple downstream tasks on the GLUE benchmark and IMDB reviews with standard sequence length, and find that our Nyströmformer performs comparably, or in a few cases, even slightly better, than standard self-attention. On longer sequence tasks in the Long Range Arena (LRA) benchmark, Nyströmformer performs favorably relative to other efficient self-attention methods. Our code is available at https: //github. com/mlpen/Nystromformer.

AAAI Conference 2021 Conference Paper

Physarum Powered Differentiable Linear Programming Layers and Applications

Zihang Meng
Sathya N. Ravi
Vikas Singh

Consider a learning algorithm, which involves an internal call to an optimization routine such as a generalized eigenvalue problem, a cone programming problem or even sorting. Integrating such a method as a layer(s) within a trainable deep neural network (DNN) in an efficient and numerically stable way is not straightforward – for instance, only recently, strategies have emerged for eigendecomposition and differentiable sorting. We propose an efficient and differentiable solver for general linear programming problems which can be used in a plug and play manner within DNNs as a layer. Our development is inspired by a fascinating but not widely used link between dynamics of slime mold (physarum) and optimization schemes such as steepest descent. We describe our development and show the use of our solver in a video segmentation task and meta-learning for few-shot learning. We review the existing results and provide a technical analysis describing its applicability for our use cases. Our solver performs comparably with a customized projected gradient descent method on the first task and outperforms the differentiable CVXPY-SCS solver on the second task. Experiments show that our solver converges quickly without the need for a feasible initial point. Our proposal is easy to implement and can easily serve as layers whenever a learning procedure needs a fast approximate solution to a LP, within a larger network.

ICML Conference 2021 Conference Paper

You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling

Zhanpeng Zeng
Yunyang Xiong
Sathya N. Ravi
Shailesh Acharya
Glenn Fung
Vikas Singh

Transformer-based models are widely used in natural language processing (NLP). Central to the transformer model is the self-attention mechanism, which captures the interactions of token pairs in the input sequences and depends quadratically on the sequence length. Training such models on longer sequences is expensive. In this paper, we show that a Bernoulli sampling attention mechanism based on Locality Sensitive Hashing (LSH), decreases the quadratic complexity of such models to linear. We bypass the quadratic cost by considering self-attention as a sum of individual tokens associated with Bernoulli random variables that can, in principle, be sampled at once by a single hash (although in practice, this number may be a small constant). This leads to an efficient sampling scheme to estimate self-attention which relies on specific modifications of LSH (to enable deployment on GPU architectures). We evaluate our algorithm on the GLUE benchmark with standard 512 sequence length where we see favorable performance relative to a standard pretrained Transformer. On the Long Range Arena (LRA) benchmark, for evaluating performance on long sequences, our method achieves results consistent with softmax self-attention but with sizable speed-ups and memory savings and often outperforms other efficient self-attention methods. Our code is available at https: //github. com/mlpen/YOSO.

AAAI Conference 2020 Conference Paper

Optimizing Nondecomposable Data Dependent Regularizers via Lagrangian Reparameterization Offers Significant Performance and Efficiency Gains

Sathya N. Ravi
Abhay Venkatesh
Glenn M. Fung
Vikas Singh

Data dependent regularization is known to beneﬁt a wide variety of problems in machine learning. Often, these regularizers cannot be easily decomposed into a sum over a ﬁnite number of terms, e. g. , a sum over individual example-wise terms. The Fβ measure, Area under the ROC curve (AUCROC) and Precision at a ﬁxed recall (P@R) are some prominent examples that are used in many applications. We ﬁnd that for most medium to large sized datasets, scalability issues severely limit our ability in leveraging the beneﬁts of such regularizers. Importantly, the key technical impediment despite some recent progress is that, such objectives remain difﬁcult to optimize via backpropapagation procedures. While an efﬁcient general-purpose strategy for this problem still remains elusive, in this paper, we show that for many data-dependent nondecomposable regularizers that are relevant in applications, sizable gains in efﬁciency are possible with minimal code-level changes; in other words, no specialized tools or numerical schemes are needed. Our procedure involves a reparameterization followed by a partial dualization – this leads to a formulation that has provably cheap projection operators. We present a detailed analysis of runtime and convergence properties of our algorithm. On the experimental side, we show that a direct use of our scheme signiﬁcantly improves the state of the art IOU measures reported for MSCOCO Stuff segmentation dataset.

YNICL Journal 2019 Journal Article

Cerebrospinal fluid biomarkers of neurofibrillary tangles and synaptic dysfunction are associated with longitudinal decline in white matter connectivity: A multi-resolution graph analysis

Won Hwa Kim
Annie M. Racine
Nagesh Adluru
Seong Jae Hwang
Kaj Blennow
Henrik Zetterberg
Cynthia M. Carlsson
Sanjay Asthana

In addition to the development of beta amyloid plaques and neurofibrillary tangles, Alzheimer's disease (AD) involves the loss of connecting structures including degeneration of myelinated axons and synaptic connections. However, the extent to which white matter tracts change longitudinally, particularly in the asymptomatic, preclinical stage of AD, remains poorly characterized. In this study we used a novel graph wavelet algorithm to determine the extent to which microstructural brain changes evolve in concert with the development of AD neuropathology as observed using CSF biomarkers. A total of 118 participants with at least two diffusion tensor imaging (DTI) scans and one lumbar puncture for CSF were selected from two observational and longitudinally followed cohorts. CSF was assayed for pathology specific to AD (Aβ42 and phosphorylated-tau), neurodegeneration (total-tau), axonal degeneration (neurofilament light chain protein; NFL), and synaptic degeneration (neurogranin). Tractography was performed on DTI scans to obtain structural connectivity networks with 160 nodes where the nodes correspond to specific brain regions of interest (ROIs) and their connections were defined by DTI metrics (i.e., fractional anisotropy (FA) and mean diffusivity (MD)). For the analysis, we adopted a multi-resolution graph wavelet technique called Wavelet Connectivity Signature (WaCS) which derives higher order representations from DTI metrics at each brain connection. Our statistical analysis showed interactions between the CSF measures and the MRI time interval, such that elevated CSF biomarkers and longer time were associated with greater longitudinal changes in white matter microstructure (decreasing FA and increasing MD). Specifically, we detected a total of 17 fiber tracts whose WaCS representations showed an association between longitudinal decline in white matter microstructure and both CSF p-tau and neurogranin. While development of neurofibrillary tangles and synaptic degeneration are cortical phenomena, the results show that they are also associated with degeneration of underlying white matter tracts, a process which may eventually play a role in the development of cognitive decline and dementia.

AAAI Conference 2019 Conference Paper

Explicitly Imposing Constraints in Deep Networks via Conditional Gradients Gives Improved Generalization and Faster Convergence

Sathya N. Ravi
Tuan Dinh
Vishnu Suresh Lokhande
Vikas Singh

A number of results have recently demonstrated the benefits of incorporating various constraints when training deep architectures in vision and machine learning. The advantages range from guarantees for statistical generalization to better accuracy to compression. But support for general constraints within widely used libraries remains scarce and their broader deployment within many applications that can benefit from them remains under-explored. Part of the reason is that Stochastic gradient descent (SGD), the workhorse for training deep neural networks, does not natively deal with constraints with global scope very well. In this paper, we revisit a classical first order scheme from numerical optimization, Conditional Gradients (CG), that has, thus far had limited applicability in training deep models. We show via rigorous analysis how various constraints can be naturally handled by modifications of this algorithm. We provide convergence guarantees and show a suite of immediate benefits that are possible — from training ResNets with fewer layers but better accuracy simply by substituting in our version of CG to faster training of GANs with 50% fewer epochs in image inpainting applications to provably better generalization guarantees using efficiently implementable forms of recently proposed regularizers.

UAI Conference 2019 Conference Paper

Sampling-free Uncertainty Estimation in Gated Recurrent Units with Applications to Normative Modeling in Neuroimaging

Seong Jae Hwang
Ronak R. Mehta
Hyunwoo J. Kim
Sterling C. Johnson
Vikas Singh

There has recently been a concerted effort to derive mechanisms in vision and machine learning systems to offer uncertainty estimates of the predictions they make. Clearly, there are enormous benefits to a system that is not only accurate but also has a sense for when it is not. Existing proposals center around Bayesian interpretations of modern deep architectures - these are effective but can often be computationally demanding. We show how classical ideas in the literature on exponential families on probabilistic networks provide an excellent starting point to derive uncertainty estimates in Gated Recurrent Units (GRU). Our proposal directly quantifies uncertainty deterministically, without the need for costly sampling-based estimation. We show that while uncertainty is quite useful by itself in computer vision and machine learning, we also demonstrate that it can play a key role in enabling statistical analysis with deep networks in neuroimaging studies with normative modeling methods. To our knowledge, this is the first result describing sampling-free uncertainty estimation for powerful sequential models such as GRUs.

NeurIPS Conference 2018 Conference Paper

A Statistical Recurrent Model on the Manifold of Symmetric Positive Definite Matrices

Rudrasis Chakraborty
Chun-Hao Yang
Xingjian Zhen
Monami Banerjee
Derek Archer
David Vaillancourt
Vikas Singh
Baba Vemuri

In a number of disciplines, the data (e. g. , graphs, manifolds) to be analyzed are non-Euclidean in nature. Geometric deep learning corresponds to techniques that generalize deep neural network models to such non-Euclidean spaces. Several recent papers have shown how convolutional neural networks (CNNs) can be extended to learn with graph-based data. In this work, we study the setting where the data (or measurements) are ordered, longitudinal or temporal in nature and live on a Riemannian manifold -- this setting is common in a variety of problems in statistical machine learning, vision and medical imaging. We show how recurrent statistical recurrent network models can be defined in such spaces. We give an efficient algorithm and conduct a rigorous analysis of its statistical properties. We perform extensive numerical experiments demonstrating competitive performance with state of the art methods but with significantly less number of parameters. We also show applications to a statistical analysis task in brain imaging, a regime where deep neural network models have only been utilized in limited ways.

YNIMG Journal 2017 Journal Article

Accelerating permutation testing in voxel-wise analysis through subspace tracking: A new plugin for SnPM

Felipe Gutierrez-Barragan
Vamsi K. Ithapu
Chris Hinrichs
Camille Maumet
Sterling C. Johnson
Thomas E. Nichols
Vikas Singh

Permutation testing is a non-parametric method for obtaining the max null distribution used to compute corrected p-values that provide strong control of false positives. In neuroimaging, however, the computational burden of running such an algorithm can be significant. We find that by viewing the permutation testing procedure as the construction of a very large permutation testing matrix, T, one can exploit structural properties derived from the data and the test statistics to reduce the runtime under certain conditions. In particular, we see that T is low-rank plus a low-variance residual. This makes T a good candidate for low-rank matrix completion, where only a very small number of entries of T ( ∼ 0. 35 % of all entries in our experiments) have to be computed to obtain a good estimate. Based on this observation, we present RapidPT, an algorithm that efficiently recovers the max null distribution commonly obtained through regular permutation testing in voxel-wise analysis. We present an extensive validation on a synthetic dataset and four varying sized datasets against two baselines: Statistical NonParametric Mapping (SnPM13) and a standard permutation testing implementation (referred as NaivePT). We find that RapidPT achieves its best runtime performance on medium sized datasets ( 50 ≤ n ≤ 200 ), with speedups of 1. 5× - 38× (vs. SnPM13) and 20x-1000× (vs. NaivePT). For larger datasets ( n ≥ 200 ) RapidPT outperforms NaivePT (6× - 200×) on all datasets, and provides large speedups over SnPM13 when more than 10000 permutations (2× - 15×) are needed. The implementation is a standalone toolbox and also integrated within SnPM13, able to leverage multi-core architectures when available.

ICML Conference 2017 Conference Paper

When can Multi-Site Datasets be Pooled for Regression? Hypothesis Tests, $\ell_2$-consistency and Neuroscience Applications

Hao Henry Zhou
Yilin Zhang
Vamsi K. Ithapu
Sterling C. Johnson
Grace Wahba
Vikas Singh

Many studies in biomedical and health sciences involve small sample sizes due to logistic or financial constraints. Often, identifying weak (but scientifically interesting) associations between a set of predictors and a response necessitates pooling datasets from multiple diverse labs or groups. While there is a rich literature in statistical machine learning to address distributional shifts and inference in multi-site datasets, it is less clear when such pooling is guaranteed to help (and when it does not) – independent of the inference algorithms we use. In this paper, we present a hypothesis test to answer this question, both for classical and high dimensional linear regression. We precisely identify regimes where pooling datasets across multiple sites is sensible, and how such policy decisions can be made via simple checks executable on each site before any data transfer ever happens. With a focus on Alzheimer’s disease studies, we present empirical results showing that in regimes suggested by our analysis, pooling a local dataset with data from an international study improves power.

ICML Conference 2016 Conference Paper

Experimental Design on a Budget for Sparse Linear Models and Applications

Sathya N. Ravi
Vamsi K. Ithapu
Sterling C. Johnson
Vikas Singh

Budget constrained optimal design of experiments is a classical problem in statistics. Although the optimal design literature is very mature, few efficient strategies are available when these design problems appear in the context of sparse linear models commonly encountered in high dimensional machine learning and statistics. In this work, we study experimental design for the setting where the underlying regression model is characterized by a \ell_1-regularized linear function. We propose two novel strategies: the first is motivated geometrically whereas the second is algebraic in nature. We obtain tractable algorithms for this problem and also hold for a more general class of sparse linear models. We perform an extensive set of experiments, on benchmarks and a large multi-site neuroscience study, showing that the proposed models are effective in practice. The latter experiment suggests that these ideas may play a small role in informing enrollment strategies for similar scientific studies in the short-to-medium term future.

NeurIPS Conference 2016 Conference Paper

Hypothesis Testing in Unsupervised Domain Adaptation with Applications in Alzheimer's Disease

Hao Zhou
Vamsi Ithapu
Sathya Narayanan Ravi
Vikas Singh
Grace Wahba
Sterling Johnson

Consider samples from two different data sources $\{\mathbf{x_s^i}\} \sim P_{\rm source}$ and $\{\mathbf{x_t^i}\} \sim P_{\rm target}$. We only observe their transformed versions $h(\mathbf{x_s^i})$ and $g(\mathbf{x_t^i})$, for some known function class $h(\cdot)$ and $g(\cdot)$. Our goal is to perform a statistical test checking if $P_{\rm source}$ = $P_{\rm target}$ while removing the distortions induced by the transformations. This problem is closely related to concepts underlying numerous domain adaptation algorithms, and in our case, is motivated by the need to combine clinical and imaging based biomarkers from multiple sites and/or batches, where this problem is fairly common and an impediment in the conduct of analyses with much larger sample sizes. We develop a framework that addresses this problem using ideas from hypothesis testing on the transformed measurements, where in the distortions need to be estimated {\it in tandem} with the testing. We derive a simple algorithm and study its convergence and consistency properties in detail, and we also provide lower-bound strategies based on recent work in continuous optimization. On a dataset of individuals at risk for neurological disease, our results are competitive with alternative procedures that are twice as expensive and in some cases operationally infeasible to implement.

ICML Conference 2015 Conference Paper

Manifold-valued Dirichlet Processes

Hyunwoo J. Kim
Jia Xu 0011
Baba C. Vemuri
Vikas Singh

Statistical models for manifold-valued data permit capturing the intrinsic nature of the curved spaces in which the data lie and have been a topic of research for several decades. Typically, these formulations use geodesic curves and distances defined locally for most cases - this makes it hard to design parametric models globally on smooth manifolds. Thus, most (manifold specific) parametric models available today assume that the data lie in a small neighborhood on the manifold. To address this ’locality’ problem, we propose a novel nonparametric model which unifies multivariate general linear models (MGLMs) using multiple tangent spaces. Our framework generalizes existing work on (both Euclidean and non-Euclidean) general linear models providing a recipe to globally extend the locally-defined parametric models (using a mixture of local models). By grouping observations into sub-populations at multiple tangent spaces, our method provides insights into the hidden structure (geodesic relationships) in the data. This yields a framework to group observations and discover geodesic relationships between covariates X and manifold-valued responses Y, which we call Dirichlet process mixtures of multivariate general linear models (DP-MGLM) on Riemannian manifolds. Finally, we present proof of concept experiments to validate our model.

YNIMG Journal 2015 Journal Article

Multi-resolution statistical analysis of brain connectivity graphs in preclinical Alzheimer's disease

Won Hwa Kim
Nagesh Adluru
Moo K. Chung
Ozioma C. Okonkwo
Sterling C. Johnson
Barbara B. Bendlin
Vikas Singh

There is significant interest, both from basic and applied research perspectives, in understanding how structural/functional connectivity changes can explain behavioral symptoms and predict decline in neurodegenerative diseases such as Alzheimer's disease (AD). The first step in most such analyses is to encode the connectivity information as a graph; then, one may perform statistical inference on various ‘global’ graph theoretic summary measures (e. g. , modularity, graph diameter) and/or at the level of individual edges (or connections). For AD in particular, clear differences in connectivity at the dementia stage of the disease (relative to healthy controls) have been identified. Despite such findings, AD-related connectivity changes in preclinical disease remain poorly characterized. Such preclinical datasets are typically smaller and group differences are weaker. In this paper, we propose a new multi-resolution method for performing statistical analysis of connectivity networks/graphs derived from neuroimaging data. At the high level, the method occupies the middle ground between the two contrasts — that is, to analyze global graph summary measures (global) or connectivity strengths or correlations for individual edges similar to voxel based analysis (local). Instead, our strategy derives a Wavelet representation at each primitive (connection edge) which captures the graph context at multiple resolutions. We provide extensive empirical evidence of how this framework offers improved statistical power by analyzing two distinct AD datasets. Here, connectivity is derived from diffusion tensor magnetic resonance images by running a tractography routine. We first present results showing significant connectivity differences between AD patients and controls that were not evident using standard approaches. Later, we show results on populations that are not diagnosed with AD but have a positive family history risk of AD where our algorithm helps in identifying potentially subtle differences between patient groups. We also give an easy to deploy open source implementation of the algorithm for use within studies of connectivity in AD and other neurodegenerative disorders.

JMLR Journal 2015 Journal Article

SnFFT: A Julia Toolkit for Fourier Analysis of Functions over Permutations

Gregory Plumb
Deepti Pachauri
Risi Kondor
Vikas Singh

$\mathbb{S}_n$FFT is an easy to use software library written in the Julia language to facilitate Fourier analysis on the symmetric group (set of permutations) of degree $n$, denoted $\mathbb{S}_n$ and make it more easily deployable within statistical machine learning algorithms. Our implementation internally creates the irreducible matrix representations of $\mathbb{S}_n$, and efficiently computes fast Fourier transforms (FFTs) and inverse fast Fourier transforms (iFFTs). Advanced users can achieve scalability and promising practical performance by exploiting various other forms of sparsity. Further, the library also supports the partial inverse Fourier transforms which utilizes the smoothness properties of functions by maintaining only the first few Fourier coefficients. Out of the box, $\mathbb{S}_n$FFT currently offers two non-trivial operations for functions defined on $\mathbb{S}_n$, namely convolution and correlation. While the potential applicability of $\mathbb{S}_n$FFT is fairly broad, as an example, we show how it can be used for clustering ranked data, where each ranking is modeled as a distribution on $\mathbb{S}_n$. [abs] [ pdf ][ bib ] &copy JMLR 2015. ( edit, beta )

YNIMG Journal 2014 Journal Article

Multi-resolutional shape features via non-Euclidean wavelets: Applications to statistical analysis of cortical thickness

Won Hwa Kim
Vikas Singh
Moo K. Chung
Chris Hinrichs
Deepti Pachauri
Ozioma C. Okonkwo
Sterling C. Johnson

Statistical analysis on arbitrary surface meshes such as the cortical surface is an important approach to understanding brain diseases such as Alzheimer's disease (AD). Surface analysis may be able to identify specific cortical patterns that relate to certain disease characteristics or exhibit differences between groups. Our goal in this paper is to make group analysis of signals on surfaces more sensitive. To do this, we derive multi-scale shape descriptors that characterize the signal around each mesh vertex, i. e. , its local context, at varying levels of resolution. In order to define such a shape descriptor, we make use of recent results from harmonic analysis that extend traditional continuous wavelet theory from the Euclidean to a non-Euclidean setting (i. e. , a graph, mesh or network). Using this descriptor, we conduct experiments on two different datasets, the Alzheimer's Disease NeuroImaging Initiative (ADNI) data and images acquired at the Wisconsin Alzheimer's Disease Research Center (W-ADRC), focusing on individuals labeled as having Alzheimer's disease (AD), mild cognitive impairment (MCI) and healthy controls. In particular, we contrast traditional univariate methods with our multi-resolution approach which show increased sensitivity and improved statistical power to detect a group-level effects. We also provide an open source implementation.

NeurIPS Conference 2014 Conference Paper

Permutation Diffusion Maps (PDM) with Application to the Image Association Problem in Computer Vision

Deepti Pachauri
Risi Kondor
Gautam Sargur
Vikas Singh

Consistently matching keypoints across images, and the related problem of finding clusters of nearby images, are critical components of various tasks in Computer Vision, including Structure from Motion (SfM). Unfortunately, occlusion and large repetitive structures tend to mislead most currently used matching algorithms, leading to characteristic pathologies in the final output. In this paper we introduce a new method, Permutations Diffusion Maps (PDM), to solve the matching problem, as well as a related new affinity measure, derived using ideas from harmonic analysis on the symmetric group. We show that just by using it as a preprocessing step to existing SfM pipelines, PDM can greatly improve reconstruction quality on difficult datasets.

NeurIPS Conference 2013 Conference Paper

Solving the multi-way matching problem by permutation synchronization

Deepti Pachauri
Risi Kondor
Vikas Singh

The problem of matching not just two, but m different sets of objects to each other arises in a variety of contexts, including finding the correspondence between feature points across multiple images in computer vision. At present it is usually solved by matching the sets pairwise, in series. In contrast, we propose a new method, permutation synchronization, which finds all the matchings jointly, in one shot, via a relaxation to eigenvector decomposition. The resulting algorithm is both computationally efficient, and, as we demonstrate with theoretical arguments as well as experimental results, much more stable to noise than previous methods.

NeurIPS Conference 2013 Conference Paper

Speeding up Permutation Testing in Neuroimaging

Chris Hinrichs
Vamsi Ithapu
Qinyuan Sun
Sterling Johnson
Vikas Singh

Multiple hypothesis testing is a significant problem in nearly all neuroimaging studies. In order to correct for this phenomena, we require a reliable estimate of the Family-Wise Error Rate (FWER). The well known Bonferroni correction method, while being simple to implement, is quite conservative, and can substantially under-power a study because it ignores dependencies between test statistics. Permutation testing, on the other hand, is an exact, non parametric method of estimating the FWER for a given α threshold, but for acceptably low thresholds the computational burden can be prohibitive. In this paper, we observe that permutation testing in fact amounts to populating the columns of a very large matrix P. By analyzing the spectrum of this matrix, under certain conditions, we see that P has a low-rank plus a low-variance residual decomposition which makes it suitable for highly sub–sampled — on the order of 0. 5% — matrix completion methods. Thus, we propose a novel permutation testing methodology which offers a large speedup, without sacrificing the fidelity of the estimated FWER. Our valuations on four different neuroimaging datasets show that a computational speedup factor of roughly 50× can be achieved while recovering the FWER distribution up to very high accuracy. Further, we show that the estimated α threshold is also recovered faithfully, and is stable.

ICML Conference 2012 Conference Paper

Incorporating Domain Knowledge in Matching Problems via Harmonic Analysis

Deepti Pachauri
Maxwell D. Collins
Vikas Singh

NeurIPS Conference 2012 Conference Paper

Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging

Chris Hinrichs
Vikas Singh
Jiming Peng
Sterling Johnson

Multiple Kernel Learning (MKL) generalizes SVMs to the setting where one simultaneously trains a linear classifier and chooses an optimal combination of given base kernels. Model complexity is typically controlled using various norm regularizations on the vector of base kernel mixing coefficients. Existing methods, however, neither regularize nor exploit potentially useful information pertaining to how kernels in the input set 'interact'; that is, higher order kernel-pair relationships that can be easily obtained via unsupervised (similarity, geodesics), supervised (correlation in errors), or domain knowledge driven mechanisms (which features were used to construct the kernel? ). We show that by substituting the norm penalty with an arbitrary quadratic function Q \succeq 0, one can impose a desired covariance structure on mixing coefficient selection, and use this as an inductive bias when learning the concept. This formulation significantly generalizes the widely used 1- and 2-norm MKL objectives. We explore the model’s utility via experiments on a challenging Neuroimaging problem, where the goal is to predict a subject’s conversion to Alzheimer’s Disease (AD) by exploiting aggregate information from several distinct imaging modalities. Here, our new model outperforms the state of the art (p-values << 10−3 ). We briefly discuss ramifications in terms of learning bounds (Rademacher complexity).

NeurIPS Conference 2012 Conference Paper

Wavelet based multi-scale shape features on arbitrary surfaces for cortical thickness discrimination

Won Kim
Deepti Pachauri
Charles Hatt
Moo. Chung
Sterling Johnson
Vikas Singh

Hypothesis testing on signals deﬁned on surfaces (such as the cortical surface) is a fundamental component of a variety of studies in Neuroscience. The goal here is to identify regions that exhibit changes as a function of the clinical condition under study. As the clinical questions of interest move towards identifying very early signs of diseases, the corresponding statistical differences at the group level invariably become weaker and increasingly hard to identify. Indeed, after a multiple comparisons correction is adopted (to account for correlated statistical tests over all surface points), very few regions may survive. In contrast to hypothesis tests on point-wise measurements, in this paper, we make the case for performing statistical analysis on multi-scale shape descriptors that characterize the local topological context of the signal around each surface vertex. Our descriptors are based on recent results from harmonic analysis, that show how wavelet theory extends to non-Euclidean settings (i. e. , irregular weighted graphs). We provide strong evidence that these descriptors successfully pick up group-wise differences, where traditional methods either fail or yield unsatisfactory results. Other than this primary application, we show how the framework allows performing cortical surface smoothing in the native space without mappint to a unit sphere.

YNIMG Journal 2011 Journal Article

Predictive markers for AD in a multi-modality framework: An analysis of MCI progression in the ADNI population

Chris Hinrichs
Vikas Singh
Guofan Xu
Sterling C. Johnson

Alzheimer's Disease (AD) and other neurodegenerative diseases affect over 20 million people worldwide, and this number is projected to significantly increase in the coming decades. Proposed imaging-based markers have shown steadily improving levels of sensitivity/specificity in classifying individual subjects as AD or normal. Several of these efforts have utilized statistical machine learning techniques, using brain images as input, as means of deriving such AD-related markers. A common characteristic of this line of research is a focus on either (1) using a single imaging modality for classification, or (2) incorporating several modalities, but reporting separate results for each. One strategy to improve on the success of these methods is to leverage all available imaging modalities together in a single automated learning framework. The rationale is that some subjects may show signs of pathology in one modality but not in another—by combining all available images a clearer view of the progression of disease pathology will emerge. Our method is based on the Multi-Kernel Learning (MKL) framework, which allows the inclusion of an arbitrary number of views of the data in a maximum margin, kernel learning framework. The principal innovation behind MKL is that it learns an optimal combination of kernel (similarity) matrices while simultaneously training a classifier. In classification experiments MKL outperformed an SVM trained on all available features by 3%–4%. We are especially interested in whether such markers are capable of identifying early signs of the disease. To address this question, we have examined whether our multi-modal disease marker (MMDM) can predict conversion from Mild Cognitive Impairment (MCI) to AD. Our experiments reveal that this measure shows significant group differences between MCI subjects who progressed to AD, and those who remained stable for 3years. These differences were most significant in MMDMs based on imaging data. We also discuss the relationship between our MMDM and an individual's conversion from MCI to AD.

NeurIPS Conference 2010 Conference Paper

Epitome driven 3-D Diffusion Tensor image segmentation: on extracting specific structures

Kamiya Motwani
Nagesh Adluru
Chris Hinrichs
Andrew Alexander
Vikas Singh

We study the problem of segmenting specific white matter structures of interest from Diffusion Tensor (DT-MR) images of the human brain. This is an important requirement in many Neuroimaging studies: for instance, to evaluate whether a brain structure exhibits group level differences as a function of disease in a set of images. Typically, interactive expert guided segmentation has been the method of choice for such applications, but this is tedious for large datasets common today. To address this problem, we endow an image segmentation algorithm with 'advice' encoding some global characteristics of the region(s) we want to extract. This is accomplished by constructing (using expert-segmented images) an epitome of a specific region - as a histogram over a bag of 'words' (e. g. ,suitable feature descriptors). Now, given such a representation, the problem reduces to segmenting new brain image with additional constraints that enforce consistency between the segmented foreground and the pre-specified histogram over features. We present combinatorial approximation algorithms to incorporate such domain specific constraints for Markov Random Field (MRF) segmentation. Making use of recent results on image co-segmentation, we derive effective solution strategies for our problem. We provide an analysis of solution quality, and present promising experimental evidence showing that many structures of interest in Neuroscience can be extracted reliably from 3-D brain image volumes using our algorithm.

YNIMG Journal 2009 Journal Article

Spatially augmented LPboosting for AD classification with evaluations on the ADNI dataset

Chris Hinrichs
Vikas Singh
Lopamudra Mukherjee
Guofan Xu
Moo K. Chung
Sterling C. Johnson

Structural and functional brain images are playing an important role in helping us understand the changes associated with neurological disorders such as Alzheimer's disease (AD). Recent efforts have now started investigating their utility for diagnosis purposes. This line of research has shown promising results where methods from machine learning (such as Support Vector Machines) have been used to identify AD-related patterns from images, for use in diagnosing new individual subjects. In this paper, we propose a new framework for AD classification which makes use of the Linear Program (LP) boosting with novel additional regularization based on spatial “smoothness” in 3D image coordinate spaces. The algorithm formalizes the expectation that since the examples for training the classifier are images, the voxels eventually selected for specifying the decision boundary must constitute spatially contiguous chunks, i. e. , “regions” must be preferred over isolated voxels. This prior belief turns out to be useful for significantly reducing the space of possible classifiers and leads to substantial benefits in generalization. In our method, the requirement of spatial contiguity (of selected discriminating voxels) is incorporated within the optimization framework directly. Other methods have made use of similar biases as a pre- or post-processing step, however, our model incorporates this emphasis on spatial smoothness directly into the learning step. We report on extensive evaluations of our algorithm on MR and FDG-PET images from the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, and discuss the relationship of the classification output with the clinical and cognitive biomarker data available within ADNI.

NeurIPS Conference 2007 Conference Paper

Ensemble Clustering using Semidefinite Programming

Vikas Singh
Lopamudra Mukherjee
Jiming Peng
Jinhui Xu

We consider the ensemble clustering problem where the task is to ‘aggregate’ multiple clustering solutions into a single consolidated clustering that maximizes the shared information among given clustering solutions. We obtain several new results for this problem. First, we note that the notion of agreement under such circumstances can be better captured using an agreement measure based on a 2D string encoding rather than voting strategy based methods proposed in literature. Using this generalization, we ﬁrst derive a nonlinear optimization model to max- imize the new agreement measure. We then show that our optimization problem can be transformed into a strict 0-1 Semideﬁnite Program (SDP) via novel con- vexiﬁcation techniques which can subsequently be relaxed to a polynomial time solvable SDP. Our experiments indicate improvements not only in terms of the proposed agreement measure but also the existing agreement measures based on voting strategies. We discuss evaluations on clustering and image segmentation databases.