Arrow Research search

Author name cluster

Eric P. Xing

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

99 papers
2 author rows

Possible papers

99

TMLR Journal 2026 Journal Article

ActionEQA: Action Interface for Embodied Question Answering

  • Tianwei Bao
  • Qineng Wang
  • Kangrui Wang
  • Mingkai Deng
  • Guangyi Liu
  • Jiayuan Mao
  • Lawrence Birnbaum
  • Zhiting Hu

While Vision-Language Models (VLMs) are increasingly integral to embodied intelligence, a significant action understanding bottleneck persists in translating high-level semantic instructions into precise low-level physical actions. However, current benchmarks for embodied agents primarily focus on high-level perception and planning, failing to capture the depth and nature of this semantic-to-physical gap. To address this, we introduce ActionEQA, the first Embodied Question Answering (EQA) benchmark designed to methodically evaluate the ability of VLMs to bridge this critical yet underexplored semantic-physical divide. Grounded in real-world robotics data, ActionEQA thoroughly analyzes VLMs’ grasp of the action interface using a dual-tier design: (1) a Three-Tiered Action Hierarchy for pinpointing the depth at which VLMs' action reasoning collapses. (2) Bidirectional Reasoning Tasks for testing whether VLMs struggle more to predict action outcomes or infer the actions that led to them. Our key findings reveal: (1) The primary bottleneck in action understanding occurs at the mid-level, arising from the challenge of grounding compositional language in 3D physical geometry. (2) VLMs are more adept at inferring past actions than predicting their future outcomes. (3) Richer visual inputs require greater spatial reasoning from VLMs to map actions to physical geometry. (4) Within the action hierarchy, model failures shift from predominantly perceptual errors at the high level to flawed geometric and physical reasoning at the low level.

AAAI Conference 2026 Conference Paper

Vision-G1: Towards General Reasoning Vision-Language Models via Reinforcement Learning

  • Yuheng Zha
  • Kun Zhou
  • Yujia Wu
  • Yushu Wang
  • Jie Feng
  • Zhi Xu
  • Shibo Hao
  • Zhengzhong Liu

Recent vision-language models (VLMs) show strong reasoning capabilities through training with reinforcement learning from verifiable rewards (RLVR). Despite their impressive capabilities, current VLMs focus on a limited range of reasoning tasks, such as mathematical and logical reasoning, due to the lack of readily available verifiable reward data in broader domains. As a result, these models struggle to generalize their reasoning abilities to the wide variety of challenges encountered in real-world environments. To address this limitation, we collect and assemble a comprehensive RL-ready visual reasoning training dataset encompassing 46 datasets across 13 dimensions of 5 domains, covering a wide range of realistic scenarios such as infographic reasoning, mathematical reasoning, spatial reasoning, and general science reasoning. Based on this dataset, we propose an influence function-based data filtering strategy and a multi-round data curriculum method to iteratively strengthen general visual reasoning abilities. Using this approach, we train a general reasoning VLM, namely Vision-G1. Our 7B model achieves state-of-the-art performance across nine visual reasoning benchmarks, surpassing previous similar-sized VLMs and even GPT-4o and Gemini-1.5 Flash.

ICLR Conference 2025 Conference Paper

Causal Representation Learning from Multimodal Biomedical Observations

  • Yuewen Sun
  • Lingjing Kong
  • Guangyi Chen 0002
  • Loka Li
  • Gongxu Luo
  • Zijian Li 0001
  • Yixuan Zhang 0001
  • Yujia Zheng 0001

Prevalent in biomedical applications (e.g., human phenotype research), multimodal datasets can provide valuable insights into the underlying physiological mechanisms. However, current machine learning (ML) models designed to analyze these datasets often lack interpretability and identifiability guarantees, which are essential for biomedical research. Recent advances in causal representation learning have shown promise in identifying interpretable latent causal variables with formal theoretical guarantees. Unfortunately, most current work on multimodal distributions either relies on restrictive parametric assumptions or yields only coarse identification results, limiting their applicability to biomedical research that favors a detailed understanding of the mechanisms. In this work, we aim to develop flexible identification conditions for multimodal data and principled methods to facilitate the understanding of biomedical datasets. Theoretically, we consider a nonparametric latent distribution (c.f., parametric assumptions in previous work) that allows for causal relationships across potentially different modalities. We establish identifiability guarantees for each latent component, extending the subspace identification results from previous work. Our key theoretical contribution is the structural sparsity of causal connections between modalities, which, as we will discuss, is natural for a large collection of biomedical systems. Empirically, we present a practical framework to instantiate our theoretical insights. We demonstrate the effectiveness of our approach through extensive experiments on both numerical and synthetic datasets. Results on a real-world human phenotype dataset are consistent with established biomedical research, validating our theoretical and methodological framework.

ICML Conference 2025 Conference Paper

Data Mixing Optimization for Supervised Fine-Tuning of Large Language Models

  • Yuan Li 0032
  • Zhengzhong Liu 0001
  • Eric P. Xing

Optimizing data mixtures for supervised fine-tuning (SFT) of large language models (LLMs) is critical for developing general-purpose models, yet this area remains underexplored. In this paper, we frame data mixing as an optimization problem and introduce a novel method designed to minimize validation loss. Our approach parametrizes the loss by modeling effective data transferred and leveraging scaling laws for fine-tuning. By experimenting with various small-scale data mixtures, we fit these parameters and derive the optimal weights. We provide both mathematical proofs and empirical results demonstrating that our algorithm achieves excellent overall and individual performance across all domains. Through controlled experiments, we show that models trained with our optimized weights perform on par with those using optimal weights determined via grid search, with per-domain loss only 0. 66% higher than the best domain loss from grid search on average. Additionally, we show that reweighting popular SFT datasets using our method improves both validation loss and downstream performance. Finally, we discuss how our method can generalize to guide data selection for domain-specific models and provide insights into SFT.

ICLR Conference 2025 Conference Paper

Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems

  • Zhenting Qi
  • Hanlin Zhang 0002
  • Eric P. Xing
  • Sham M. Kakade
  • Himabindu Lakkaraju

Retrieval-Augmented Generation (RAG) improves pre-trained models by incorporating external knowledge at test time to enable customized adaptation. We study the risk of datastore leakage in Retrieval-In-Context RAG Language Models (LMs). We show that an adversary can exploit LMs' instruction-following capabilities to easily extract text data verbatim from the datastore of RAG systems built with instruction-tuned LMs via prompt injection. The vulnerability exists for a wide range of modern LMs that span Llama2, Mistral/Mixtral, Vicuna, SOLAR, WizardLM, Qwen1.5, and Platypus2, and the exploitability exacerbates as the model size scales up. We also study multiple effects of RAG setup on the extractability of data, indicating that following unexpected instructions to regurgitate data can be an outcome of failure in effectively utilizing contexts for modern LMs, and further show that such vulnerability can be greatly mitigated by position bias elimination strategies. Extending our study to production RAG models, GPTs, we design an attack that can cause datastore leakage with a near-perfect success rate on 25 randomly selected customized GPTs with at most 2 queries, and we extract text data verbatim at a rate of 41\% from a book of 77,000 words and 3\% from a corpus of 1,569,000 words by prompting the GPTs with only 100 queries generated by themselves.

ICML Conference 2025 Conference Paper

Learning Vision and Language Concepts for Controllable Image Generation

  • Shaoan Xie
  • Lingjing Kong
  • Yujia Zheng 0001
  • Zeyu Tang 0002
  • Eric P. Xing
  • Guangyi Chen 0002
  • Kun Zhang 0001

Concept learning seeks to extract semantic and interpretable representations of atomic concepts from high-dimensional data such as images and text, which can be instrumental to a variety of downstream tasks (e. g. , image generation/editing). Despite its importance, the theoretical foundations for learning atomic concepts and their interactions, especially from multimodal distributions, remain underexplored. In this work, we establish fundamental conditions for learning atomic multimodal concepts and their underlying interactions With identfiability guarantees. We formulate concept learning as a latent variable identification problem, representing atomic concepts in each modality as latent variables, with a graphical model to specify their interactions across modalities. Our theoretical contribution is to provide component-wise identifiability of atomic concepts under flexible, nonparametric conditions that accommodate both continuous and discrete modalities. Building on these theoretical insights, we demonstrate the practical utility of our theory in a downstream task text-to-image (T2I) generation. We develop a principled T2I model that explicitly learns atomic textual and visual concepts with sparse connections between them, allowing us to achieve image generation and editing at the atomic concept level. Empirical evaluations show that our model outperforms existing methods in T2I generation tasks, offering superior controllability and interpretability.

ICLR Conference 2025 Conference Paper

Scaling Long Context Training Data by Long-Distance Referrals

  • Yonghao Zhuang 0001
  • Lanxiang Hu
  • Longfei Yun
  • Souvik Kundu 0009
  • Zhengzhong Liu 0001
  • Eric P. Xing
  • Hao Zhang 0025

Training large language models for long context understanding faces the challenge of data shortage. Previous data engineering approaches mechanically concatenate short documents, which may create many pseudo long documents but raise concerns about data quality. In this paper, we study the core attribute of high quality data for long context training, and provide a data pipeline, LongPack, to scale such data. We found that long distance referrals, which occur in natural long documents, are crucial for long-context training. However, simply concatenating short documents does not reliably generate these relations. We further show that the density of long-distance referrals, which is higher in longer documents, has a key role in training efficiency, making previous upsampling methods suboptimal. To enrich long documents, we propose LongPack, a data pipeline that constructs long documents by packing shorter ones based on referral relationships. Specifically, for web pages, which are the primary source for language model training, we found hyper-link a native signal for such a relation. By packing web pages through their hyper-link connection, we can create longer, high-quality documents. Our experiments demonstrate that LongPackis highly scalable, generating a corpus of long documents equivalent in size to an entire pretraining dataset using just 0.5% root documents. Furthermore, the constructed documents have a ‘near-natural’ quality as innate long documents for long context training, reaching a 32.7% higher score than previous state-of-the-art methods.

ICML Conference 2025 Conference Paper

Synthesizing Privacy-Preserving Text Data via Finetuning *without* Finetuning Billion-Scale LLMs

  • Bowen Tan
  • Zheng Xu
  • Eric P. Xing
  • Zhiting Hu
  • Shanshan Wu

Synthetic data offers a promising path to train models while preserving data privacy. Differentially private (DP) finetuning of large language models (LLMs) as data generator is effective, but is impractical when computation resources are limited. Meanwhile, prompt-based methods such as private evolution depend heavily on the manual prompts, and ineffectively use private information in their iterative data selection process. To overcome these limitations, we propose CTCL (Data Synthesis with C on T rollability and CL ustering), a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale LLM finetuning. CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data. To further adapt to the private domain, the generator is DP finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram representing distributional information. The DP generator then samples according to the DP histogram to synthesize a desired number of data examples. Evaluation across five diverse domains demonstrates the effectiveness of our framework, particularly in the strong privacy regime. Systematic ablation validates the design of each framework component and highlights the scalability of our approach.

ICML Conference 2025 Conference Paper

Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism

  • Aviv Bick
  • Eric P. Xing
  • Albert Gu

State-space models (SSMs) offer efficient alternatives to Transformers for long sequences, but their fixed-size recurrent state limits capability on algorithmic tasks, such as retrieving past context. In this work, we examine how in-context retrieval operates in Transformer- and SSM-based language models and find that both rely on a Gather-and-Aggregate (G&A) mechanism: a Gather Head extracts relevant information from context, which an Aggregate Head integrates into representation. In both architectures, G&A concentrates in a few heads, forming bottlenecks even for simple retrieval. For example, disabling a single Gather or Aggregate Head in a pruned Llama-3. 1-8B impairs retrieving the correct answer letter in MMLU, reducing its accuracy from 66% to 25%. Moreover, this retrieval bottleneck can obscure knowledge demands of tasks as the pruned model succeeds on MMLU with functioning G&A heads yet fails on other knowledge benchmarks. The bottleneck similarly extends to tasks where SSMs typically underperform, like GSM8K, BBH, and dialogue. We show that SSMs’ retrieval challenges manifest in these heads, creating smoother attention patterns instead of the sharp transitions effective G&A requires. Thus, the Transformer-SSM retrieval gap exists in just a few heads, rather than the entire language model. % Result 3: Analyzing Hybrid models This suggests a unified explanation for Transformer vs. SSM performance gap while showing how to merge their strengths. We find that pretrained hybrid models, where SSMs are combined with attention layers, delegate the role of Aggregate Heads to attention. Similarly, replacing a single G&A head in a pretrained SSM with an attention variant boosts retrieval and benchmark scores.

ICML Conference 2024 Conference Paper

Contextualized Policy Recovery: Modeling and Interpreting Medical Decisions with Adaptive Imitation Learning

  • Jannik Deuschel
  • Caleb Ellington
  • Yingtao Luo
  • Benjamin J. Lengerich
  • Pascal Friederich
  • Eric P. Xing

Interpretable policy learning seeks to estimate intelligible decision policies from observed actions; however, existing models force a tradeoff between accuracy and interpretability, limiting data-driven interpretations of human decision-making processes. Fundamentally, existing approaches are burdened by this tradeoff because they represent the underlying decision process as a universal policy, when in fact human decisions are dynamic and can change drastically under different contexts. Thus, we develop Contextualized Policy Recovery (CPR), which re-frames the problem of modeling complex decision processes as a multi-task learning problem, where each context poses a unique task and complex decision policies can be constructed piece-wise from many simple context-specific policies. CPR models each context-specific policy as a linear map, and generates new policy models on-demand as contexts are updated with new observations. We provide two flavors of the CPR framework: one focusing on exact local interpretability, and one retaining full global interpretability. We assess CPR through studies on simulated and real data, achieving state-of-the-art performance on predicting antibiotic prescription in intensive care units ($+22$% AUROC vs. previous SOTA) and predicting MRI prescription for Alzheimer’s patients ($+7. 7$% AUROC vs. previous SOTA). With this improvement, CPR closes the accuracy gap between interpretable and black-box methods, allowing high-resolution exploration and analysis of context-specific decision models.

ICLR Conference 2024 Conference Paper

Fusing Models with Complementary Expertise

  • Hongyi Wang 0001
  • Felipe Maia Polo
  • Yuekai Sun
  • Souvik Kundu 0009
  • Eric P. Xing
  • Mikhail Yurochkin

Training AI models that generalize across tasks and domains has long been among the open problems driving AI research. The emergence of Foundation Models made it easier to obtain expert models for a given task, but the heterogeneity of data that may be encountered at test time often means that any single expert is insufficient. We consider the Fusion of Experts (FoE) problem of fusing outputs of expert models with complementary knowledge of the data distribution and formulate it as an instance of supervised learning. Our method is applicable to both discriminative and generative tasks and leads to significant performance improvements in image and text classification, text summarization, multiple-choice QA, and automatic evaluation of generated text. We also extend our method to the "frugal" setting where it is desired to reduce the number of expert model evaluations at test time. Our implementation is publicly available at https://github.com/hwang595/FoE-ICLR2024.

ICLR Conference 2024 Conference Paper

LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

  • Lianmin Zheng
  • Wei-Lin Chiang
  • Ying Sheng 0007
  • Tianle Li
  • Siyuan Zhuang
  • Zhanghao Wu
  • Yonghao Zhuang 0001
  • Zhuohan Li 0001

Studying how people interact with large language models (LLMs) in real-world scenarios is increasingly important due to their widespread use in various applications. In this paper, we introduce LMSYS-Chat-1M, a large-scale dataset containing one million real-world conversations with 25 state-of-the-art LLMs. This dataset is collected from 210K unique IP addresses in the wild on our Vicuna demo and Chatbot Arena website. We offer an overview of the dataset's content, including its curation process, basic statistics, and topic distribution, highlighting its diversity, originality, and scale. We demonstrate its versatility through four use cases: developing content moderation models that perform similarly to GPT-4, building a safety benchmark, training instruction-following models that perform similarly to Vicuna, and creating challenging benchmark questions. We believe that this dataset will serve as a valuable resource for understanding and advancing LLM capabilities. The dataset is publicly available at https://huggingface.co/datasets/lmsys/lmsys-chat-1m.

ICLR Conference 2024 Conference Paper

LQ-LoRA: Low-rank plus Quantized Matrix Decomposition for Efficient Language Model Finetuning

  • Han Guo
  • Philip Greengard
  • Eric P. Xing
  • Yoon Kim

We propose a simple approach for memory-efficient adaptation of pretrained language models. Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient quantized component. During finetuning, the quantized component remains fixed and only the low-rank component is updated. We present an integer linear programming formulation of the quantization component which enables dynamic configuration of quantization parameters (e.g., bit-width, block size) for each matrix given an overall target memory budget. We further explore a data-aware version of the algorithm which uses an approximation of the Fisher information matrix to weight the reconstruction objective during matrix decomposition. Experiments on finetuning RoBERTa and LLaMA-2 (7B and 70B) demonstrate that our low-rank plus quantized matrix decomposition approach (LQ-LoRA) outperforms strong QLoRA and GPTQ-LoRA baselines and enables aggressive quantization to sub-3 bits with only minor performance degradations. When finetuned on a language modeling calibration dataset, LQ-LoRA can also be used for model compression; in this setting our 2.75-bit LLaMA-2-70B model (which has 2.85 bits on average when including the low-rank components and requires 27GB of GPU memory) performs respectably compared to the 16-bit baseline.

ICML Conference 2024 Conference Paper

Position: TrustLLM: Trustworthiness in Large Language Models

  • Yue Huang 0001
  • Lichao Sun 0001
  • Haoran Wang 0005
  • Siyuan Wu 0001
  • Qihui Zhang
  • Yuan Li
  • Chujie Gao
  • Yixin Huang

Large language models (LLMs) have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. This paper introduces TrustLLM, a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. Our findings firstly show that in general trustworthiness and capability (i. e. , functional effectiveness) are positively related. Secondly, our observations reveal that proprietary LLMs generally outperform most open-source counterparts in terms of trustworthiness, raising concerns about the potential risks of widely accessible open-source LLMs. However, a few open-source LLMs come very close to proprietary ones, suggesting that open-source models can achieve high levels of trustworthiness without additional mechanisms like moderator, offering valuable insights for developers in this field. Thirdly, it is important to note that some LLMs may be overly calibrated towards exhibiting trustworthiness, to the extent that they compromise their utility by mistakenly treating benign prompts as harmful and consequently not responding. Besides these observations, we’ve uncovered key insights into the multifaceted trustworthiness in LLMs. We emphasize the importance of ensuring transparency not only in the models themselves but also in the technologies that underpin trustworthiness. We advocate that the establishment of an AI alliance between industry, academia, the open-source community to foster collaboration is imperative to advance the trustworthiness of LLMs.

ICLR Conference 2024 Conference Paper

PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization

  • Xinyuan Wang 0010
  • Chenxi Li
  • Zhen Wang 0041
  • Fan Bai 0006
  • Haotian Luo
  • Jiayou Zhang
  • Nebojsa Jojic
  • Eric P. Xing

Expert-level prompts, carefully engineered by human experts who have a deep understanding of both large language models (LLMs) and domain knowledge, are the future of prompting and pivotal to harnessing the full power of advanced LLMs. Discovering such prompts with an automated process remains a sought-after and unresolved challenge. Existing prompt optimization techniques, though automated through iterative sampling, often fall short in injecting domain knowledge and exploring the vast prompt space for complex expert-level prompts efficiently. To address this pressing need and achieve expert-level prompting, we introduce PromptAgent, which autonomously discovers prompts equivalent in quality to those handcrafted by experts. At its core, PromptAgent views prompt optimization as a strategic planning problem and employs a principled planning algorithm (rooted in Monte Carlo Tree Search) to strategically explore the vast expert-level prompt space. PromptAgent interacts with the LLM in a human-like trial-and-error manner during the planning, and injects expert-level knowledge by reflecting on model errors and generating insightful error feedback. This novel formulation allows it to iteratively evaluate intermediate prompts, refine them based on errors, simulate future rewards, and search for high-reward paths leading to expert-level prompts. We apply PromptAgent to 12 tasks spanning three practical domains: BIG-Bench Hard (BBH), domain-expert, and general NLU tasks, showing PromptAgent consistently outperforms strong prompting and prompt optimization baselines by great margins. Our qualitative analysis further emphasizes PromptAgent's capability to distill insightful errors into expert-level prompts.

NeurIPS Conference 2024 Conference Paper

Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models

  • Aviv Bick
  • Kevin Y. Li
  • Eric P. Xing
  • J. Z. Kolter
  • Albert Gu

Transformer architectures have become a dominant paradigm for domains like language modeling but suffer in many inference settings due to their quadratic-time self-attention. Recently proposed subquadratic architectures, such as Mamba, have shown promise, but have been pretrained with substantially less computational resources than the strongest Transformer models. In this work, we present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs). The key idea to our approach is that we can view both Transformers and SSMs as applying different forms of mixing matrices over the token sequences. We can thus progressively distill the Transformer architecture by matching different degrees of granularity in the SSM: first matching the mixing matrices themselves, then the hidden units at each block, and finally the end-to-end predictions. Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1. 5 architecture (Phi-Mamba) using only 3B tokens. Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models. MOHAWK allows models like SSMs to leverage computational resources invested in training Transformer-based architectures, highlighting a new avenue for building such models.

ICML Conference 2024 Conference Paper

Unified Generation, Reconstruction, and Representation: Generalized Diffusion with Adaptive Latent Encoding-Decoding

  • Guangyi Liu 0005
  • Yu Wang 0170
  • Zeyu Feng
  • Qiyu Wu 0001
  • Liping Tang
  • Yuan Gao
  • Zhen Li 0026
  • Shuguang Cui

The vast applications of deep generative models are anchored in three core capabilities— generating new instances, reconstructing inputs, and learning compact representations —across various data types, such as discrete text/protein sequences and continuous images. Existing model families, like variational autoencoders (VAEs), generative adversarial networks (GANs), autoregressive models, and (latent) diffusion models, generally excel in specific capabilities and data types but fall short in others. We introduce Generalized E ncoding - D ecoding D iffusion P robabilistic M odels (EDDPMs) which integrate the core capabilities for broad applicability and enhanced performance. EDDPMs generalize the Gaussian noising-denoising in standard diffusion by introducing parameterized encoding-decoding. Crucially, EDDPMs are compatible with the well-established diffusion model objective and training recipes, allowing effective learning of the encoder-decoder parameters jointly with diffusion. By choosing appropriate encoder/decoder (e. g. , large language models), EDDPMs naturally apply to different data types. Extensive experiments on text, proteins, and images demonstrate the flexibility to handle diverse data and tasks and the strong improvement over various existing models. Code is available at https: //github. com/guangyliu/EDDPM.

NeurIPS Conference 2024 Conference Paper

Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs

  • Sukmin Yun
  • haokun lin
  • Rusiru Thushara
  • Mohammad Q. Bhat
  • Yongxin Wang
  • Zutao Jiang
  • Mingkai Deng
  • Jinhong Wang

Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio in a variety of understanding and generation tasks. However, current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. To address this problem, we propose Web2Code, a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning and an evaluation framework for the webpage understanding and HTML code translation abilities of MLLMs. For dataset construction, we leverage pretrained LLMs to enhance existing webpage-to-code datasets as well as generate a diverse pool of new webpages rendered into images. Specifically, the inputs are webpage images and instructions, while the responses are the webpage's HTML code. We further include diverse natural language QA pairs about the webpage content in the responses to enable a more comprehensive understanding of the web content. To evaluate model performance in these tasks, we develop an evaluation framework for testing MLLMs' abilities in webpage understanding and web-to-code generation. Extensive experiments show that our proposed dataset is beneficial not only to our proposed tasks but also in the general visual domain. We hope our work will contribute to the development of general MLLMs suitable for web-based content generation and task automation. Our data and code are available at https: //github. com/MBZUAI-LLM/web2code.

ICLR Conference 2023 Conference Paper

Betty: An Automatic Differentiation Library for Multilevel Optimization

  • Sang Keun Choe
  • Willie Neiswanger
  • Pengtao Xie
  • Eric P. Xing

Gradient-based multilevel optimization (MLO) has gained attention as a framework for studying numerous problems, ranging from hyperparameter optimization and meta-learning to neural architecture search and reinforcement learning. However, gradients in MLO, which are obtained by composing best-response Jacobians via the chain rule, are notoriously difficult to implement and memory/compute intensive. We take an initial step towards closing this gap by introducing Betty, a software library for large-scale MLO. At its core, we devise a novel dataflow graph for MLO, which allows us to (1) develop efficient automatic differentiation for MLO that reduces the computational complexity from $\mathcal{O}(d^3)$ to $\mathcal{O}(d^2)$, (2) incorporate systems support such as mixed-precision and data-parallel training for scalability, and (3) facilitate implementation of MLO programs of arbitrary complexity while allowing a modular interface for diverse algorithmic and systems design choices. We empirically demonstrate that Betty can be used to implement an array of MLO programs, while also observing up to 11% increase in test accuracy, 14% decrease in GPU memory usage, and 20% decrease in training wall time over existing implementations on multiple benchmarks. We also showcase that Betty enables scaling MLO to models with hundreds of millions of parameters. We open-source the code at https://github.com/leopard-ai/betty.

ICLR Conference 2023 Conference Paper

Federated Learning as Variational Inference: A Scalable Expectation Propagation Approach

  • Han Guo
  • Philip Greengard
  • Hongyi Wang 0001
  • Andrew Gelman
  • Yoon Kim
  • Eric P. Xing

The canonical formulation of federated learning treats it as a distributed optimization problem where the model parameters are optimized against a global loss function that decomposes across client loss functions. A recent alternative formulation instead treats federated learning as a distributed inference problem, where the goal is to infer a global posterior from partitioned client data (Al-Shedivat et al., 2021). This paper extends the inference view and describes a variational inference formulation of federated learning where the goal is to find a global variational posterior that well-approximates the true posterior. This naturally motivates an expectation propagation approach to federated learning (FedEP), where approximations to the global posterior are iteratively refined through probabilistic message-passing between the central server and the clients. We conduct an extensive empirical study across various algorithmic considerations and describe practical strategies for scaling up expectation propagation to the modern federated setting. We apply FedEP on standard federated learning benchmarks and find that it outperforms strong baselines in terms of both convergence speed and accuracy.

ICLR Conference 2023 Conference Paper

MPCFORMER: Fast, Performant and Provate Transformer Inference with MPC

  • Dacheng Li
  • Hongyi Wang 0001
  • Rulin Shao
  • Han Guo
  • Eric P. Xing
  • Hao Zhang 0025

Enabling private inference is crucial for many cloud inference services that are based on Transformer models. However, existing private inference solutions can increase the inference latency by more than 60$\times$ or significantly compromise the inference quality. In this paper, we design the framework MPCFORMER as a practical solution, using Secure Multi-Party Computation (MPC) and Knowledge Distillation (KD). Through extensive evaluations, we show that MPCFORMER significantly speeds up Transformer inference in MPC settings while achieving similar ML performance to the input model. On the IMDb dataset, it achieves similar performance to $\text{BERT}_\text{BASE}$, while being 5.3$\times$ faster. On the GLUE benchmark, it achieves 97% performance of $\text{BERT}_\text{BASE}$ with a 2.2$\times$ speedup. MPCFORMER remains effective with different trained Transformer weights such as $\text{ROBERTA}_\text{BASE}$ and larger models including $\text{BERT}_\text{LARGE}$. Code is available at https://github.com/MccRee177/MPCFormer.

ICML Conference 2022 Conference Paper

SDQ: Stochastic Differentiable Quantization with Mixed Precision

  • Xijie Huang
  • Zhiqiang Shen
  • Shichao Li 0002
  • Zechun Liu
  • Xianghong Hu 0001
  • Jeffry Wicaksana
  • Eric P. Xing
  • Kwang-Ting (Tim) Cheng

In order to deploy deep models in a computationally efficient manner, model quantization approaches have been frequently used. In addition, as new hardware that supports various-bit arithmetic operations, recent research on mixed precision quantization (MPQ) begins to fully leverage the capacity of representation by searching various bitwidths for different layers and modules in a network. However, previous studies mainly search the MPQ strategy in a costly scheme using reinforcement learning, neural architecture search, etc. , or simply utilize partial prior knowledge for bitwidth distribution, which might be biased and sub-optimal. In this work, we present a novel Stochastic Differentiable Quantization (SDQ) method that can automatically learn the MPQ strategy in a more flexible and globally-optimized space with a smoother gradient approximation. Particularly, Differentiable Bitwidth Parameters (DBPs) are employed as the probability factors in stochastic quantization between adjacent bitwidth. After the optimal MPQ strategy is acquired, we further train our network with the entropy-aware bin regularization and knowledge distillation. We extensively evaluate our method on different networks, hardwares (GPUs and FPGA), and datasets. SDQ outperforms all other state-of-the-art mixed or single precision quantization with less bitwidth, and are even better than the original full-precision counterparts across various ResNet and MobileNet families, demonstrating the effectiveness and superiority of our method. Code will be publicly available.

UAI Conference 2022 Conference Paper

Toward learning human-aligned cross-domain robust models by countering misaligned features

  • Haohan Wang
  • Zeyi Huang
  • Hanlin Zhang 0002
  • Yong Jae Lee
  • Eric P. Xing

Machine learning has demonstrated remarkable prediction accuracy over i. i. d data, but the accuracy often drops when tested with data from another distribution. In this paper, we aim to offer another view of this problem in a perspective assuming the reason behind this accuracy drop is the reliance of models on the features that are not aligned well with how a data annotator considers similar across these two datasets. We refer to these features as misaligned features. We extend the conventional generalization error bound to a new one for this setup with the knowledge of how the misaligned features are associated with the label. Our analysis offers a set of techniques for this problem, and these techniques are naturally linked to many previous methods in robust machine learning literature. We also compared the empirical strength of these methods demonstrated the performance when these previous techniques are combined, with implementation available.

ICLR Conference 2021 Conference Paper

Federated Learning via Posterior Averaging: A New Perspective and Practical Algorithms

  • Maruan Al-Shedivat
  • Jennifer Gillenwater
  • Eric P. Xing
  • Afshin Rostamizadeh

Federated learning is typically approached as an optimization problem, where the goal is to minimize a global loss function by distributing computation across client devices that possess local data and specify different parts of the global objective. We present an alternative perspective and formulate federated learning as a posterior inference problem, where the goal is to infer a global posterior distribution by having client devices each infer the posterior of their local data. While exact inference is often intractable, this perspective provides a principled way to search for global optima in federated settings. Further, starting with the analysis of federated quadratic objectives, we develop a computation- and communication-efficient approximate posterior inference algorithm—federated posterior averaging (FedPA). Our algorithm uses MCMC for approximate inference of local posteriors on the clients and efficiently communicates their statistics to the server, where the latter uses them to refine a global estimate of the posterior mode. Finally, we show that FedPA generalizes federated averaging (FedAvg), can similarly benefit from adaptive optimizers, and yields state-of-the-art results on four realistic and challenging benchmarks, converging faster, to better optima.

ICLR Conference 2021 Conference Paper

Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling

  • Benedikt Boecking
  • Willie Neiswanger
  • Eric P. Xing
  • Artur Dubrawski

Obtaining large annotated datasets is critical for training successful machine learning models and it is often a bottleneck in practice. Weak supervision offers a promising alternative for producing labeled datasets without ground truth annotations by generating probabilistic labels using multiple noisy heuristics. This process can scale to large datasets and has demonstrated state of the art performance in diverse domains such as healthcare and e-commerce. One practical issue with learning from user-generated heuristics is that their creation requires creativity, foresight, and domain expertise from those who hand-craft them, a process which can be tedious and subjective. We develop the first framework for interactive weak supervision in which a method proposes heuristics and learns from user feedback given on each proposed heuristic. Our experiments demonstrate that only a small number of feedback iterations are needed to train models that achieve highly competitive test set performance without access to ground truth training labels. We conduct user studies, which show that users are able to effectively provide feedback on heuristics and that test set results track the performance of simulated oracles.

ECAI Conference 2020 Conference Paper

Adversarial Domain Adaptation Being Aware of Class Relationships

  • Zeya Wang
  • Baoyu Jing
  • Yang Ni
  • Nanqing Dong
  • Pengtao Xie
  • Eric P. Xing

Adversarial training is a useful approach to promote the learning of transferable representations across the source and target domains, which has been widely applied for domain adaptation (DA) tasks based on deep neural networks. Until very recently, existing adversarial domain adaptation (ADA) methods ignore the useful information from the label space, which is an important factor accountable for the complicated data distributions associated with different semantic classes. Especially, the inter-class semantic relationships have been rarely considered and discussed in the current work of transfer learning. In this paper, we propose a novel relationship-aware adversarial domain adaptation (RADA) algorithm, which first utilizes a single multi-class domain discriminator to enforce the learning of inter-class dependency structure during domain-adversarial training and then aligns this structure with the inter-class dependencies that are characterized from training the label predictor on source domain. Specifically, we impose a regularization term to penalize the structure discrepancy between the inter-class dependencies respectively estimated from domain discriminator and label predictor. Through this alignment, our proposed method makes the adversarial domain adaptation aware of the class relationships. Empirical studies show that the incorporation of class relationships significantly improves the performance on benchmark datasets.

JMLR Journal 2020 Journal Article

Tuning Hyperparameters without Grad Students: Scalable and Robust Bayesian Optimisation with Dragonfly

  • Kirthevasan Kandasamy
  • Karun Raju Vysyaraju
  • Willie Neiswanger
  • Biswajit Paria
  • Christopher R. Collins
  • Jeff Schneider
  • Barnabas Poczos
  • Eric P. Xing

Bayesian Optimisation (BO) refers to a suite of techniques for global optimisation of expensive black box functions, which use introspective Bayesian models of the function to efficiently search for the optimum. While BO has been applied successfully in many applications, modern optimisation tasks usher in new challenges where conventional methods fail spectacularly. In this work, we present Dragonfly, an open source Python library for scalable and robust BO. Dragonfly incorporates multiple recently developed methods that allow BO to be applied in challenging real world settings; these include better methods for handling higher dimensional domains, methods for handling multi-fidelity evaluations when cheap approximations of an expensive function are available, methods for optimising over structured combinatorial spaces, such as the space of neural network architectures, and methods for handling parallel evaluations. Additionally, we develop new methodological improvements in BO for selecting the Bayesian model, selecting the acquisition function, and optimising over complex domains with different variable types and additional constraints. We compare Dragonfly to a suite of other packages and algorithms for global optimisation and demonstrate that when the above methods are integrated, they enable significant improvements in the performance of BO. The Dragonfly library is available at dragonfly.github.io. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2020. ( edit, beta )

ICML Conference 2019 Conference Paper

Fault Tolerance in Iterative-Convergent Machine Learning

  • Aurick Qiao
  • Bryon Aragam
  • Bingjing Zhang
  • Eric P. Xing

Machine learning (ML) training algorithms often possess an inherent self-correcting behavior due to their iterative- convergent nature. Recent systems exploit this property to achieve adaptability and efficiency in unreliable computing environments by relaxing the consistency of execution and allowing calculation errors to be self-corrected during training. However, the behavior of such systems are only well understood for specific types of calculation errors, such as those caused by staleness, reduced precision, or asynchronicity, and for specific algorithms, such as stochastic gradient descent. In this paper, we develop a general framework to quantify the effects of calculation errors on iterative-convergent algorithms. We then use this framework to derive a worst-case upper bound on the cost of arbitrary perturbations to model parameters during training and to design new strategies for checkpoint-based fault tolerance. Our system, SCAR, can reduce the cost of partial failures by 78%{–}95% when compared with traditional checkpoint-based fault tolerance across a variety of ML models and training algorithms, providing near-optimal performance in recovering from failures.

AAAI Conference 2019 Conference Paper

Knowledge-Driven Encode, Retrieve, Paraphrase for Medical Image Report Generation

  • Christy Y. Li
  • Xiaodan Liang
  • Zhiting Hu
  • Eric P. Xing

Generating long and semantic-coherent reports to describe medical images poses great challenges towards bridging visual and linguistic modalities, incorporating medical domain knowledge, and generating realistic and accurate descriptions. We propose a novel Knowledge-driven Encode, Retrieve, Paraphrase (KERP) approach which reconciles traditional knowledge- and retrieval-based methods with modern learning-based methods for accurate and robust medical report generation. Specifically, KERP decomposes medical report generation into explicit medical abnormality graph learning and subsequent natural language modeling. KERP first employs an Encode module that transforms visual features into a structured abnormality graph by incorporating prior medical knowledge; then a Retrieve module that retrieves text templates based on the detected abnormalities; and lastly, a Paraphrase module that rewrites the templates according to specific cases. The core of KERP is a proposed generic implementation unit—Graph Transformer (GTR) that dynamically transforms high-level semantics between graph-structured data of multiple domains such as knowledge graphs, images and sequences. Experiments show that the proposed approach generates structured and robust reports supported with accurate abnormality description and explainable attentive regions, achieving the state-of-the-art results on two medical report benchmarks, with the best medical abnormality and disease classification accuracy and improved human evaluation performance.

ICLR Conference 2019 Conference Paper

Learning Robust Representations by Projecting Superficial Statistics Out

  • Haohan Wang
  • Zexue He
  • Zachary C. Lipton
  • Eric P. Xing

Despite impressive performance as evaluated on i.i.d. holdout data, deep neural networks depend heavily on superficial statistics of the training data and are liable to break under distribution shift. For example, subtle changes to the background or texture of an image can break a seemingly powerful classifier. Building on previous work on domain generalization, we hope to produce a classifier that will generalize to previously unseen domains, even when domain identifiers are not available during training. This setting is challenging because the model may extract many distribution-specific (superficial) signals together with distribution-agnostic (semantic) signals. To overcome this challenge, we incorporate the gray-level co-occurrence matrix (GLCM) to extract patterns that our prior knowledge suggests are superficial: they are sensitive to the texture but unable to capture the gestalt of an image. Then we introduce two techniques for improving our networks' out-of-sample performance. The first method is built on the reverse gradient method that pushes our model to learn representations from which the GLCM representation is not predictable. The second method is built on the independence introduced by projecting the model's representation onto the subspace orthogonal to GLCM representation's. We test our method on the battery of standard domain generalization data sets and, interestingly, achieve comparable or better performance as compared to other domain generalization methods that explicitly require samples from the target distribution for training.

ICML Conference 2019 Conference Paper

Theoretically Principled Trade-off between Robustness and Accuracy

  • Hongyang Zhang 0001
  • Yaodong Yu
  • Jiantao Jiao
  • Eric P. Xing
  • Laurent El Ghaoui
  • Michael I. Jordan

We identify a trade-off between robustness and accuracy that serves as a guiding principle in the design of defenses against adversarial examples. Although this problem has been widely studied empirically, much remains unknown concerning the theory underlying this trade-off. In this work, we decompose the prediction error for adversarial examples (robust error) as the sum of the natural (classification) error and boundary error, and provide a differentiable upper bound using the theory of classification-calibrated loss, which is shown to be the tightest possible upper bound uniform over all probability distributions and measurable predictors. Inspired by our theoretical analysis, we also design a new defense method, TRADES, to trade adversarial robustness off against accuracy. Our proposed algorithm performs well experimentally in real-world datasets. The methodology is the foundation of our entry to the NeurIPS 2018 Adversarial Vision Challenge in which we won the 1st place out of 2, 000 submissions, surpassing the runner-up approach by 11. 41% in terms of mean L_2 perturbation distance.

AAAI Conference 2019 Conference Paper

What if We Simply Swap the Two Text Fragments? A Straightforward yet Effective Way to Test the Robustness of Methods to Confounding Signals in Nature Language Inference Tasks

  • Haohan Wang
  • Da Sun
  • Eric P. Xing

Nature language inference (NLI) task is a predictive task of determining the inference relationship of a pair of natural language sentences. With the increasing popularity of NLI, many state-of-the-art predictive models have been proposed with impressive performances. However, several works have noticed the statistical irregularities in the collected NLI data set that may result in an over-estimated performance of these models and proposed remedies. In this paper, we further investigate the statistical irregularities, what we refer as confounding factors, of the NLI data sets. With the belief that some NLI labels should preserve under swapping operations, we propose a simple yet effective way (swapping the two text fragments) of evaluating the NLI predictive models that naturally mitigate the observed problems. Further, we continue to train the predictive models with our swapping manner and propose to use the deviation of the model’s evaluation performances under different percentages of training text fragments to be swapped to describe the robustness of a predictive model. Our evaluation metrics leads to some interesting understandings of recent published NLI methods. Finally, we also apply the swapping operation on NLI models to see the effectiveness of this straightforward method in mitigating the confounding factor problems in training generic sentence embeddings for other NLP transfer tasks.

ICML Conference 2018 Conference Paper

DiCE: The Infinitely Differentiable Monte Carlo Estimator

  • Jakob N. Foerster
  • Gregory Farquhar
  • Maruan Al-Shedivat
  • Tim Rocktäschel
  • Eric P. Xing
  • Shimon Whiteson

The score function estimator is widely used for estimating gradients of stochastic objectives in stochastic computation graphs (SCG), eg. , in reinforcement learning and meta-learning. While deriving the first-order gradient estimators by differentiating a surrogate loss (SL) objective is computationally and conceptually simple, using the same approach for higher-order derivatives is more challenging. Firstly, analytically deriving and implementing such estimators is laborious and not compliant with automatic differentiation. Secondly, repeatedly applying SL to construct new objectives for each order derivative involves increasingly cumbersome graph manipulations. Lastly, to match the first-order gradient under differentiation, SL treats part of the cost as a fixed sample, which we show leads to missing and wrong terms for estimators of higher-order derivatives. To address all these shortcomings in a unified way, we introduce DiCE, which provides a single objective that can be differentiated repeatedly, generating correct estimators of derivatives of any order in SCGs. Unlike SL, DiCE relies on automatic differentiation for performing the requisite graph manipulations. We verify the correctness of DiCE both through a proof and numerical evaluation of the DiCE derivative estimates. We also use DiCE to propose and evaluate a novel approach for multi-agent learning. Our code is available at https: //github. com/alshedivat/lola.

JMLR Journal 2018 Journal Article

Distributed Proximal Gradient Algorithm for Partially Asynchronous Computer Clusters

  • Yi Zhou
  • Yingbin Liang
  • Yaoliang Yu
  • Wei Dai
  • Eric P. Xing

With ever growing data volume and model size, an error-tolerant, communication efficient, yet versatile distributed algorithm has become vital for the success of many large-scale machine learning applications. In this work we propose m-PAPG, an implementation of the flexible proximal gradient algorithm in model parallel systems equipped with the partially asynchronous communication protocol. The worker machines communicate asynchronously with a controlled staleness bound $s$ and operate at different frequencies. We characterize various convergence properties of m-PAPG: 1) Under a general non-smooth and non-convex setting, we prove that every limit point of the sequence generated by m-PAPG is a critical point of the objective function; 2) Under an error bound condition of convex objective functions, we prove that the optimality gap decays linearly for every $s$ steps; 3) Under the Kurdyka-Łojasiewicz inequality and a sufficient decrease assumption, we prove that the sequences generated by m-PAPG converge to the same critical point, provided that a proximal Lipschitz condition is satisfied. [abs] [ pdf ][ bib ] &copy JMLR 2018. ( edit, beta )

ICML Conference 2018 Conference Paper

Gated Path Planning Networks

  • Lisa Lee
  • Emilio Parisotto
  • Devendra Singh Chaplot
  • Eric P. Xing
  • Ruslan Salakhutdinov

Value Iteration Networks (VINs) are effective differentiable path planning modules that can be used by agents to perform navigation while still maintaining end-to-end differentiability of the entire architecture. Despite their effectiveness, they suffer from several disadvantages including training instability, random seed sensitivity, and other optimization problems. In this work, we reframe VINs as recurrent-convolutional networks which demonstrates that VINs couple recurrent convolutions with an unconventional max-pooling activation. From this perspective, we argue that standard gated recurrent update equations could potentially alleviate the optimization issues plaguing VIN. The resulting architecture, which we call the Gated Path Planning Network, is shown to empirically outperform VIN on a variety of metrics such as learning speed, hyperparameter sensitivity, iteration count, and even generalization. Furthermore, we show that this performance gap is consistent across different maze transition types, maze sizes and even show success on a challenging 3D environment, where the planner is only provided with first-person RGB images.

ICML Conference 2018 Conference Paper

Nonoverlap-Promoting Variable Selection

  • Pengtao Xie
  • Hongbao Zhang
  • Yichen Zhu
  • Eric P. Xing

Variable selection is a classic problem in machine learning (ML), widely used to find important explanatory factors, and improve generalization performance and interpretability of ML models. In this paper, we consider variable selection for models where multiple responses are to be predicted based on the same set of covariates. Since each response is relevant to a unique subset of covariates, we desire the selected variables for different responses have small overlap. We propose a regularizer that simultaneously encourage orthogonality and sparsity, which jointly brings in an effect of reducing overlap. We apply this regularizer to four model instances and develop efficient algorithms to solve the regularized problems. We provide a formal analysis on why the proposed regularizer can reduce generalization error. Experiments on both simulation studies and real-world datasets demonstrate the effectiveness of the proposed regularizer in selecting less-overlapped variables and improving generalization performance.

ICML Conference 2018 Conference Paper

Orthogonality-Promoting Distance Metric Learning: Convex Relaxation and Theoretical Analysis

  • Pengtao Xie
  • Wei Wu 0023
  • Yichen Zhu
  • Eric P. Xing

Distance metric learning (DML), which learns a distance metric from labeled "similar" and "dissimilar" data pairs, is widely utilized. Recently, several works investigate orthogonality-promoting regularization (OPR), which encourages the projection vectors in DML to be close to being orthogonal, to achieve three effects: (1) high balancedness – achieving comparable performance on both frequent and infrequent classes; (2) high compactness – using a small number of projection vectors to achieve a "good" metric; (3) good generalizability – alleviating overfitting to training data. While showing promising results, these approaches suffer three problems. First, they involve solving non-convex optimization problems where achieving the global optimal is NP-hard. Second, it lacks a theoretical understanding why OPR can lead to balancedness. Third, the current generalization error analysis of OPR is not directly on the regularizer. In this paper, we address these three issues by (1) seeking convex relaxations of the original nonconvex problems so that the global optimal is guaranteed to be achievable; (2) providing a formal analysis on OPR’s capability of promoting balancedness; (3) providing a theoretical analysis that directly reveals the relationship between OPR and generalization performance. Experiments on various datasets demonstrate that our convex methods are more effective in promoting balancedness, compactness, and generalization, and are computationally more efficient, compared with the nonconvex methods.

ICML Conference 2018 Conference Paper

Transformation Autoregressive Networks

  • Junier B. Oliva
  • Avinava Dubey
  • Manzil Zaheer
  • Barnabás Póczos
  • Ruslan Salakhutdinov
  • Eric P. Xing
  • Jeff G. Schneider

The fundamental task of general density estimation $p(x)$ has been of keen interest to machine learning. In this work, we attempt to systematically characterize methods for density estimation. Broadly speaking, most of the existing methods can be categorized into either using: a ) autoregressive models to estimate the conditional factors of the chain rule, $p(x_{i}\, |\, x_{i-1}, \ldots)$; or b ) non-linear transformations of variables of a simple base distribution. Based on the study of the characteristics of these categories, we propose multiple novel methods for each category. For example we propose RNN based transformations to model non-Markovian dependencies. Further, through a comprehensive study over both real world and synthetic data, we show that jointly leveraging transformations of variables and autoregressive conditional models, results in a considerable improvement in performance. We illustrate the use of our models in outlier detection and image modeling. Finally we introduce a novel data driven framework for learning a family of distributions.

ICML Conference 2017 Conference Paper

Learning Latent Space Models with Angular Constraints

  • Pengtao Xie
  • Yuntian Deng
  • Yi Zhou 0017
  • Abhimanu Kumar
  • Yaoliang Yu
  • James Y. Zou
  • Eric P. Xing

The large model capacity of latent space models (LSMs) enables them to achieve great performance on various applications, but meanwhile renders LSMs to be prone to overfitting. Several recent studies investigate a new type of regularization approach, which encourages components in LSMs to be diverse, for the sake of alleviating overfitting. While they have shown promising empirical effectiveness, in theory why larger “diversity” results in less overfitting is still unclear. To bridge this gap, we propose a new diversity-promoting approach that is both theoretically analyzable and empirically effective. Specifically, we use near-orthogonality to characterize “diversity” and impose angular constraints (ACs) on the components of LSMs to promote diversity. A generalization error analysis shows that larger diversity results in smaller estimation error and larger approximation error. An efficient ADMM algorithm is developed to solve the constrained LSM problems. Experiments demonstrate that ACs improve generalization performance of LSMs and outperform other diversity-promoting approaches.

JMLR Journal 2017 Journal Article

Learning Scalable Deep Kernels with Recurrent Structure

  • Maruan Al-Shedivat
  • Andrew Gordon Wilson
  • Yunus Saatchi
  • Zhiting Hu
  • Eric P. Xing

Many applications in speech, robotics, finance, and biology deal with sequential data, where ordering matters and recurrent structures are common. However, this structure cannot be easily captured by standard kernel functions. To model such structure, we propose expressive closed-form kernel functions for Gaussian processes. The resulting model, GP-LSTM, fully encapsulates the inductive biases of long short-term memory (LSTM) recurrent networks, while retaining the non-parametric probabilistic advantages of Gaussian processes. We learn the properties of the proposed kernels by optimizing the Gaussian process marginal likelihood using a new provably convergent semi-stochastic gradient procedure, and exploit the structure of these kernels for scalable training and prediction. This approach provides a practical representation for Bayesian LSTMs. We demonstrate state-of-the-art performance on several benchmarks, and thoroughly investigate a consequential autonomous driving application, where the predictive uncertainties provided by GP- LSTM are uniquely valuable. [abs] [ pdf ][ bib ] &copy JMLR 2017. ( edit, beta )

UAI Conference 2017 Conference Paper

Near-Orthogonality Regularization in Kernel Methods

  • Pengtao Xie
  • Barnabás Póczos
  • Eric P. Xing

Kernel methods perform nonlinear learning in high-dimensional reproducing kernel Hilbert spaces (RKHSs). Even though their large model-capacity leads to high representational power, it also incurs substantial risk of overfitting. To alleviate this problem, we propose a new regularization approach, nearorthogonality regularization, which encourages the RKHS functions to be close to being orthogonal. This effectively imposes a structural constraint over the function space, which reduces model complexity and can improve generalization performance. Besides, encouraging orthogonality reduces the redundancy among functions, which hence can reduce model size without compromising modeling power and better capture infrequent patterns in the data. Here, we define a family of orthogonality-promoting regularizers by encouraging the Gram matrix of the RKHS functions to be close to an identity matrix where the closeness is measured by Bregman matrix divergences. We apply these regularizers to two kernel methods, and develop an efficient ADMM-based algorithm to solve the regularized optimization problems. We analyze how near-orthogonality affects the generalization performance of kernel methods. Our results suggest that the closer the functions are to being orthogonal, the smaller the generalization error is. Experiments demonstrate the efficacy of near-orthogonality regularization in kernel methods.

ICML Conference 2017 Conference Paper

Post-Inference Prior Swapping

  • Willie Neiswanger
  • Eric P. Xing

While Bayesian methods are praised for their ability to incorporate useful prior knowledge, in practice, convenient priors that allow for computationally cheap or tractable inference are commonly used. In this paper, we investigate the following question: for a given model, is it possible to compute an inference result with any convenient false prior, and afterwards, given any target prior of interest, quickly transform this result into the target posterior? A potential solution is to use importance sampling (IS). However, we demonstrate that IS will fail for many choices of the target prior, depending on its parametric form and similarity to the false prior. Instead, we propose prior swapping, a method that leverages the pre-inferred false posterior to efficiently generate accurate posterior samples under arbitrary target priors. Prior swapping lets us apply less-costly inference algorithms to certain models, and incorporate new or updated prior information “post-inference”. We give theoretical guarantees about our method, and demonstrate it empirically on a number of models and priors.

ICML Conference 2017 Conference Paper

Toward Controlled Generation of Text

  • Zhiting Hu
  • Zichao Yang
  • Xiaodan Liang
  • Ruslan Salakhutdinov
  • Eric P. Xing

Generic generation and manipulation of text is challenging and has limited success compared to recent deep generative modeling in visual domain. This paper aims at generating plausible text sentences, whose attributes are controlled by learning disentangled latent representations with designated semantics. We propose a new neural generative model which combines variational auto-encoders (VAEs) and holistic attribute discriminators for effective imposition of semantic structures. The model can alternatively be seen as enhancing VAEs with the wake-sleep algorithm for leveraging fake samples as extra training data. With differentiable approximation to discrete text samples, explicit constraints on independent attribute controls, and efficient collaborative learning of generator and discriminators, our model learns interpretable representations from even only word annotations, and produces short sentences with desired attributes of sentiment and tenses. Quantitative experiments using trained classifiers as evaluators validate the accuracy of sentence and attribute generation.

ICML Conference 2017 Conference Paper

Uncorrelation and Evenness: a New Diversity-Promoting Regularizer

  • Pengtao Xie
  • Aarti Singh
  • Eric P. Xing

Latent space models (LSMs) provide a principled and effective way to extract hidden patterns from observed data. To cope with two challenges in LSMs: (1) how to capture infrequent patterns when pattern frequency is imbalanced and (2) how to reduce model size without sacrificing their expressiveness, several studies have been proposed to “diversify” LSMs, which design regularizers to encourage the components therein to be “diverse”. In light of the limitations of existing approaches, we design a new diversity-promoting regularizer by considering two factors: uncorrelation and evenness, which encourage the components to be uncorrelated and to play equally important roles in modeling data. Formally, this amounts to encouraging the covariance matrix of the components to have more uniform eigenvalues. We apply the regularizer to two LSMs and develop an efficient optimization algorithm. Experiments on healthcare, image and text data demonstrate the effectiveness of the regularizer.

ICML Conference 2016 Conference Paper

Diversity-Promoting Bayesian Learning of Latent Variable Models

  • Pengtao Xie
  • Jun Zhu 0001
  • Eric P. Xing

In learning latent variable models (LVMs), it is important to effectively capture infrequent patterns and shrink model size without sacrificing modeling power. Various studies have been done to “diversify” a LVM, which aim to learn a diverse set of latent components in LVMs. Most existing studies fall into a frequentist-style regularization framework, where the components are learned via point estimation. In this paper, we investigate how to “diversify” LVMs in the paradigm of Bayesian learning, which has advantages complementary to point estimation, such as alleviating overfitting via model averaging and quantifying uncertainty. We propose two approaches that have complementary advantages. One is to define diversity-promoting mutual angular priors which assign larger density to components with larger mutual angles based on Bayesian network and von Mises-Fisher distribution and use these priors to affect the posterior via Bayes rule. We develop two efficient approximate posterior inference algorithms based on variational inference and Markov chain Monte Carlo sampling. The other approach is to impose diversity-promoting regularization directly over the post-data distribution of components. These two methods are applied to the Bayesian mixture of experts model to encourage the “experts” to be diverse and experimental results demonstrate the effectiveness and efficiency of our methods.

JMLR Journal 2016 Journal Article

Latent Space Inference of Internet-Scale Networks

  • Qirong Ho
  • Junming Yin
  • Eric P. Xing

The rise of Internet-scale networks, such as web graphs and social media with hundreds of millions to billions of nodes, presents new scientific opportunities, such as overlapping community detection to discover the structure of the Internet, or to analyze trends in online social behavior. However, many existing probabilistic network models are difficult or impossible to deploy at these massive scales. We propose a scalable approach for modeling and inferring latent spaces in Internet-scale networks, with an eye towards overlapping community detection as a key application. By applying a succinct representation of networks as a bag of triangular motifs, developing a parsimonious statistical model, deriving an efficient stochastic variational inference algorithm, and implementing it as a distributed cluster program via the Petuum parameter server system, we demonstrate overlapping community detection on real networks with up to 100 million nodes and 1000 communities on 5 machines in under 40 hours. Compared to other state-of-the-art probabilistic network approaches, our method is several orders of magnitude faster, with competitive or improved accuracy at overlapping community detection. [abs] [ pdf ][ bib ] &copy JMLR 2016. ( edit, beta )

UAI Conference 2016 Conference Paper

Lighter-Communication Distributed Machine Learning via Sufficient Factor Broadcasting

  • Pengtao Xie
  • Jin Kyu Kim
  • Yi Zhou 0017
  • Qirong Ho
  • Abhimanu Kumar
  • Yaoliang Yu
  • Eric P. Xing

Matrix-parametrized models (MPMs) are widely used in machine learning (ML) applications. In large-scale ML problems, the parameter matrix of a MPM can grow at an unexpected rate, resulting in high communication and parameter synchronization costs. To address this issue, we offer two contributions: first, we develop a computation model for a large family of MPMs, which share the following property: the parameter update computed on each data sample is a rank-1 matrix, i. e. the outer product of two “sufficient factors” (SFs). Second, we implement a decentralized, peer-to-peer system, Sufficient Factor Broadcasting (SFB), which broadcasts the SFs among worker machines, and reconstructs the update matrices locally at each worker. SFB takes advantage of small rank-1 matrix updates and efficient partial broadcasting strategies to dramatically improve communication efficiency. We propose a graph optimization based partial broadcasting scheme, which minimizes the delay of information dissemination under the constraint that each machine only communicates with a subset rather than all of machines. Furthermore, we provide theoretical analysis to show that SFB guarantees convergence of algorithms (under full broadcasting) without requiring a centralized synchronization mechanism. Experiments corroborate SFB’s efficiency on four MPMs.

ICML Conference 2016 Conference Paper

Parallel and Distributed Block-Coordinate Frank-Wolfe Algorithms

  • Yu-Xiang Wang 0003
  • Veeranjaneyulu Sadhanala
  • Wei Dai 0003
  • Willie Neiswanger
  • Suvrit Sra
  • Eric P. Xing

We study parallel and distributed Frank-Wolfe algorithms; the former on shared memory machines with mini-batching, and the latter in a delayed update framework. In both cases, we perform computations asynchronously whenever possible. We assume block-separable constraints as in Block-Coordinate Frank-Wolfe (BCFW) method (Lacoste et. al. , 2013), but our analysis subsumes BCFW and reveals problem-dependent quantities that govern the speedups of our methods over BCFW. A notable feature of our algorithms is that they do not depend on worst-case bounded delays, but only (mildly) on **expected** delays, making them robust to stragglers and faulty worker threads. We present experiments on structural SVM and Group Fused Lasso, and observe significant speedups over competing state-of-the-art (and synchronous) methods.

JMLR Journal 2015 Journal Article

AD3: Alternating Directions Dual Decomposition for MAP Inference in Graphical Models

  • André F. T. Martins
  • Mário A. T. Figueiredo
  • Pedro M. Q. Aguiar
  • Noah A. Smith
  • Eric P. Xing

We present AD$^3$, a new algorithm for approximate maximum a posteriori (MAP) inference on factor graphs, based on the alternating directions method of multipliers. Like other dual decomposition algorithms, AD$^3$ has a modular architecture, where local subproblems are solved independently, and their solutions are gathered to compute a global update. The key characteristic of AD$^3$ is that each local subproblem has a quadratic regularizer, leading to faster convergence, both theoretically and in practice. We provide closed-form solutions for these AD$^3$ subproblems for binary pairwise factors and factors imposing first-order logic constraints. For arbitrary factors (large or combinatorial), we introduce an active set method which requires only an oracle for computing a local MAP configuration, making AD$^3$ applicable to a wide range of problems. Experiments on synthetic and real-world problems show that AD$^3$ compares favorably with the state-of-the-art. [abs] [ pdf ][ bib ] &copy JMLR 2015. ( edit, beta )

IJCAI Conference 2015 Conference Paper

An Active Learning Approach to Coreference Resolution

  • Mrinmaya Sachan
  • Eduard Hovy
  • Eric P. Xing

In this paper, we define the problem of coreference resolution in text as one of clustering with pairwise constraints where human experts are asked to provide pairwise constraints (pairwise judgments of coreferentiality) to guide the clustering process. Positing that these pairwise judgments are easy to obtain from humans given the right context, we show that with significantly lower number of pairwise judgments and feature-engineering effort, we can achieve competitive coreference performance. Further, we describe an active learning strategy that minimizes the overall number of such pairwise judgments needed by asking the most informative questions to human experts at each step of coreference resolution. We evaluate this hypothesis and our algorithms on both entity and event coreference tasks and on two languages.

ICML Conference 2015 Conference Paper

Complex Event Detection using Semantic Saliency and Nearly-Isotonic SVM

  • Xiaojun Chang
  • Yi Yang 0001
  • Eric P. Xing
  • Yaoliang Yu

We aim to detect complex events in long Internet videos that may last for hours. A major challenge in this setting is that only a few shots in a long video are relevant to the event of interest while others are irrelevant or even misleading. Instead of indifferently pooling the shots, we first define a novel notion of semantic saliency that assesses the relevance of each shot with the event of interest. We then prioritize the shots according to their saliency scores since shots that are semantically more salient are expected to contribute more to the final event detector. Next, we propose a new isotonic regularizer that is able to exploit the semantic ordering information. The resulting nearly-isotonic SVM classifier exhibits higher discriminative power. Computationally, we develop an efficient implementation using the proximal gradient algorithm, and we prove new, closed-form proximal steps. We conduct extensive experiments on three real-world video datasets and confirm the effectiveness of the proposed approach.

AAAI Conference 2015 Conference Paper

Integrating Image Clustering and Codebook Learning

  • Pengtao Xie
  • Eric P. Xing

Image clustering and visual codebook learning are two fundamental problems in computer vision and they are tightly related. On one hand, a good codebook can generate effective feature representations which largely affect clustering performance. On the other hand, class labels obtained from image clustering can serve as supervised information to guide codebook learning. Traditionally, these two processes are conducted separately and their correlation is generally ignored. In this paper, we propose a Double Layer Gaussian Mixture Model (DLGMM) to simultaneously perform image clustering and codebook learning. In DLGMM, two tasks are seamlessly coupled and can mutually promote each other. Cluster labels and codebook are jointly estimated to achieve the overall best performance. To incorporate the spatial coherence between neighboring visual patches, we propose a Spatially Coherent DL- GMM which uses a Markov Random Field to encourage neighboring patches to share the same visual word label. We use variational inference to approximate the posterior of latent variables and learn model parameters. Experiments on two datasets demonstrate the effectiveness of two models.

ICML Conference 2015 Conference Paper

Large-scale Distributed Dependent Nonparametric Trees

  • Zhiting Hu
  • Qirong Ho
  • Avinava Dubey
  • Eric P. Xing

Practical applications of Bayesian nonparametric (BNP) models have been limited, due to their high computational complexity and poor scaling on large data. In this paper, we consider dependent nonparametric trees (DNTs), a powerful infinite model that captures time-evolving hierarchies, and develop a large-scale distributed training system. Our major contributions include: (1) an effective memoized variational inference for DNTs, with a novel birth-merge strategy for exploring the unbounded tree space; (2) a model-parallel scheme for concurrent tree growing/pruning and efficient model alignment, through conflict-free model partitioning and lightweight synchronization; (3) a data-parallel scheme for variational parameter updates that allows distributed processing of massive data. Using 64 cores in 36 hours, our system learns a 10K-node DNT topic model on 8M documents that captures both high-frequency and long-tail topics. Our data and model scales are orders-of-magnitude larger than recent results on the hierarchical Dirichlet process, and the near-linear scalability indicates great potential for even bigger problem sizes.

UAI Conference 2014 Conference Paper

Asymptotically Exact, Embarrassingly Parallel MCMC

  • Willie Neiswanger
  • Chong Wang 0002
  • Eric P. Xing

Communication costs, resulting from synchronization requirements during learning, can greatly slow down many parallel machine learning algorithms. In this paper, we present a parallel Markov chain Monte Carlo (MCMC) algorithm in which subsets of data are processed independently, with very little communication. First, we arbitrarily partition data onto multiple machines. Then, on each machine, any classical MCMC method (e. g. , Gibbs sampling) may be used to draw samples from a posterior distribution given the data subset. Finally, the samples from each machine are combined to form samples from the full posterior. This embarrassingly parallel algorithm allows each machine to act independently on a subset of the data (without communication) until the final combination stage. We prove that our algorithm generates asymptotically exact samples and empirically demonstrate its ability to parallelize burn-in and sampling in several models.

JMLR Journal 2014 Journal Article

Bayesian Inference with Posterior Regularization and Applications to Infinite Latent SVMs

  • Jun Zhu
  • Ning Chen
  • Eric P. Xing

Existing Bayesian models, especially nonparametric Bayesian methods, rely on specially conceived priors to incorporate domain knowledge for discovering improved latent representations. While priors affect posterior distributions through Bayes' rule, imposing posterior regularization is arguably more direct and in some cases more natural and general. In this paper, we present regularized Bayesian inference (RegBayes), a novel computational framework that performs posterior inference with a regularization term on the desired post-data posterior distribution under an information theoretical formulation. RegBayes is more flexible than the procedure that elicits expert knowledge via priors, and it covers both directed Bayesian networks and undirected Markov networks. When the regularization is induced from a linear operator on the posterior distributions, such as the expectation operator, we present a general convex-analysis theorem to characterize the solution of RegBayes. Furthermore, we present two concrete examples of RegBayes, infinite latent support vector machines (iLSVM) and multi-task infinite latent support vector machines (MT-iLSVM), which explore the large- margin idea in combination with a nonparametric Bayesian model for discovering predictive latent features for classification and multi-task learning, respectively. We present efficient inference methods and report empirical studies on several benchmark data sets, which appear to demonstrate the merits inherited from both large-margin learning and Bayesian nonparametrics. Such results contribute to push forward the interface between these two important subfields, which have been largely treated as isolated in the community. [abs] [ pdf ][ bib ] &copy JMLR 2014. ( edit, beta )

JMLR Journal 2014 Journal Article

Graph Estimation From Multi-Attribute Data

  • Mladen Kolar
  • Han Liu
  • Eric P. Xing

Undirected graphical models are important in a number of modern applications that involve exploring or exploiting dependency structures underlying the data. For example, they are often used to explore complex systems where connections between entities are not well understood, such as in functional brain networks or genetic networks. Existing methods for estimating structure of undirected graphical models focus on scenarios where each node represents a scalar random variable, such as a binary neural activation state or a continuous mRNA abundance measurement, even though in many real world problems, nodes can represent multivariate variables with much richer meanings, such as whole images, text documents, or multi-view feature vectors. In this paper, we propose a new principled framework for estimating the structure of undirected graphical models from such multivariate (or multi-attribute) nodal data. The structure of a graph is inferred through estimation of non-zero partial canonical correlation between nodes. Under a Gaussian model, this strategy is equivalent to estimating conditional independencies between random vectors represented by the nodes and it generalizes the classical problem of covariance selection (Dempster, 1972). We relate the problem of estimating non-zero partial canonical correlations to maximizing a penalized Gaussian likelihood objective and develop a method that efficiently maximizes this objective. Extensive simulation studies demonstrate the effectiveness of the method under various conditions. We provide illustrative applications to uncovering gene regulatory networks from gene and protein profiles, and uncovering brain connectivity graph from positron emission tomography data. Finally, we provide sufficient conditions under which the true graphical structure can be recovered correctly. [abs] [ pdf ][ bib ] &copy JMLR 2014. ( edit, beta )

UAI Conference 2014 Conference Paper

Modeling Citation Networks Using Latent Random Offsets

  • Willie Neiswanger
  • Chong Wang 0002
  • Qirong Ho
  • Eric P. Xing

Out of the many potential factors that determine which links form in a document citation network, two in particular are of high importance: first, a document may be cited based on its subject matter—this can be modeled by analyzing document content; second, a document may be cited based on which other documents have previously cited it—this can be modeled by analyzing citation structure. Both factors are important for users to make informed decisions and choose appropriate citations as the network grows. In this paper, we present a novel model that integrates the merits of content and citation analyses into a single probabilistic framework. We demonstrate our model on three real-world citation networks. Compared with existing baselines, our model can be used to effectively explore a citation network and provide meaningful explanations for links while still maintaining competitive citation prediction performance.

UAI Conference 2014 Conference Paper

Parallel Markov Chain Monte Carlo for Pitman-Yor Mixture Models

  • Avinava Dubey
  • Sinead A. Williamson
  • Eric P. Xing

The Pitman-Yor process provides an elegant way to cluster data that exhibit power law behavior, where the number of clusters is unknown or unbounded. Unfortunately, inference in Pitman- Yor process-based models is typically slow and does not scale well with dataset size. In this paper we present new auxiliary-variable representations for the Pitman-Yor process and a special case of the hierarchical Pitman-Yor process that allows us to develop parallel inference algorithms that distribute inference both on the data space and the model space. We show that our method scales well with increasing data while avoiding any degradation in estimate quality.

ICML Conference 2013 Conference Paper

An Adaptive Learning Rate for Stochastic Variational Inference

  • Rajesh Ranganath
  • Chong Wang 0002
  • David M. Blei
  • Eric P. Xing

Stochastic variational inference finds good posterior approximations of probabilistic models with very large data sets. It optimizes the variational objective with stochastic optimization, following noisy estimates of the natural gradient. Operationally, stochastic inference iteratively subsamples from the data, analyzes the subsample, and updates parameters with a decreasing learning rate. However, the algorithm is sensitive to that rate, which usually requires hand-tuning to each application. We solve this problem by developing an adaptive learning rate for stochastic inference. Our method requires no tuning and is easily implemented with computations already made in the algorithm. We demonstrate our approach with latent Dirichlet allocation applied to three large text corpora. Inference with the adaptive learning rate converges faster and to a better approximation than the best settings of hand-tuned rates.

ICML Conference 2013 Conference Paper

Hierarchical Tensor Decomposition of Latent Tree Graphical Models

  • Le Song
  • Mariya Ishteva
  • Ankur P. Parikh
  • Eric P. Xing
  • Haesun Park

We approach the problem of estimating the parameters of a latent tree graphical model from a hierarchical tensor decomposition point of view. In this new view, the marginal probability table of the observed variables in a latent tree is treated as a tensor, and we show that: (i) the latent variables induce low rank structures in various matricizations of the tensor; (ii) this collection of low rank matricizations induce a hierarchical low rank decomposition of the tensor. Exploiting these properties, we derive an optimization problem for estimating the parameters of a latent tree graphical model, i. e. , hierarchical decomposion of a tensor which minimizes the Frobenius norm of the difference between the original tensor and its decomposition. When the latent tree graphical models are correctly specified, we show that a global optimum of the optimization problem can be obtained via a recursive decomposition algorithm. This algorithm recovers previous spectral algorithms for hidden Markov models (Hsu et al. , 2009; Foster et al. , 2012) and latent tree graphical models (Parikh et al. , 2011; Song et al. , 2011) as special cases, elucidating the global objective these algorithms are optimizing for. When the latent tree graphical models are misspecified, we derive a better decomposition based on our framework, and provide approximation guarantee for this new estimator. In both synthetic and real world data, this new estimator significantly improves over the-state-of-the-art.

UAI Conference 2013 Conference Paper

Integrating Document Clustering and Topic Modeling

  • Pengtao Xie
  • Eric P. Xing

Document clustering and topic modeling are two closely related tasks which can mutually benefit each other. Topic modeling can project documents into a topic space which facilitates effective document clustering. Cluster labels discovered by document clustering can be incorporated into topic models to extract local topics specific to each cluster and global topics shared by all clusters. In this paper, we propose a multi-grain clustering topic model (MGCTM) which integrates document clustering and topic modeling into a unified framework and jointly performs the two tasks to achieve the overall best performance. Our model tightly couples two components: a mixture component used for discovering latent groups in document collection and a topic model component used for mining multi-grain topics including local topics specific to each cluster and global topics shared across clusters. We employ variational inference to approximate the posterior of hidden variables and learn model parameters. Experiments on two datasets demonstrate the effectiveness of our model.

ICML Conference 2013 Conference Paper

Markov Network Estimation From Multi-attribute Data

  • Mladen Kolar
  • Han Liu 0001
  • Eric P. Xing

Many real world network problems often concern multivariate nodal attributes such as image, textual, and multi-view feature vectors on nodes, rather than simple univariate nodal attributes. The existing graph estimation methods built on Gaussian graphical models and covariance selection algorithms can not handle such data, neither can the theories developed around such methods be directly applied. In this paper, we propose a new principled framework for estimating multi-attribute graphs. Instead of estimating the partial correlation as in current literature, our method estimates the partial canonical correlations that naturally accommodate complex nodal features. Computationally, we provide an efficient algorithm which utilizes the multi-attribute structure. Theoretically, we provide sufficient conditions which guarantee consistent graph recovery. Extensive simulation studies demonstrate performance of our method under various conditions.

IJCAI Conference 2013 Conference Paper

Multi-Modal Distance Metric Learning

  • Pengtao Xie
  • Eric P. Xing

Multi-modal data is dramatically increasing with the fast growth of social media. Learning a good distance measure for data with multiple modalities is of vital importance for many applications, including retrieval, clustering, classification and recommendation. In this paper, we propose an effective and scalable multi-modal distance metric learning framework. Based on the multi-wing harmonium model, our method provides a principled way to embed data of arbitrary modalities into a single latent space, of which an optimal distance metric can be learned under proper supervision, i. e. , by minimizing the distance between similar pairs whereas maximizing the distance between dissimilar pairs. The parameters are learned by jointly optimizing the data likelihood under the latent space model and the loss induced by distance supervision, thereby our method seeks a balance between explaining the data and providing an effective distance metric, which naturally avoids overfitting. We apply our general framework to text/image data and present empirical results on retrieval and classification to demonstrate the effectiveness and scalability.

ICML Conference 2013 Conference Paper

Parallel Markov Chain Monte Carlo for Nonparametric Mixture Models

  • Sinead A. Williamson
  • Avinava Dubey
  • Eric P. Xing

Nonparametric mixture models based on the Dirichlet process are an elegant alternative to finite models when the number of underlying components is unknown, but inference in such models can be slow. Existing attempts to parallelize inference in such models have relied on introducing approximations, which can lead to inaccuracies in the posterior estimate. In this paper, we describe auxiliary variable representations for the Dirichlet process and the hierarchical Dirichlet process that allow us to perform MCMC using the correct equilibrium distribution, in a distributed manner. We show that our approach allows scalable inference without the deterioration in estimate quality that accompanies existing methods.

UAI Conference 2012 Conference Paper

A Spectral Algorithm for Latent Junction Trees

  • Ankur P. Parikh
  • Le Song
  • Mariya Ishteva
  • Gabi Teodoru
  • Eric P. Xing

Latent variable models are an elegant framework for capturing rich probabilistic dependencies in many applications. However, current approaches typically parametrize these models using conditional probability tables, and learning relies predominantly on local search heuristics such as Expectation Maximization. Using tensor algebra, we propose an alternative parameterization of latent variable models (where the model structures are junction trees) that still allows for computation of marginals among observed variables. While this novel representation leads to a moderate increase in the number of parameters for junction trees of low treewidth, it lets us design a local-minimum-free algorithm for learning this parameterization. The main computation of the algorithm involves only tensor operations and SVDs which can be orders of magnitude faster than EM algorithms for large datasets. To our knowledge, this is the first provably consistent parameter learning technique for a large class of low-treewidth latent graphical models beyond trees. We demonstrate the advantages of our method on synthetic and real datasets.

JMLR Journal 2012 Journal Article

MedLDA: Maximum Margin Supervised Topic Models

  • Jun Zhu
  • Amr Ahmed
  • Eric P. Xing

A supervised topic model can use side information such as ratings or labels associated with documents or images to discover more predictive low dimensional topical representations of the data. However, existing supervised topic models predominantly employ likelihood-driven objective functions for learning and inference, leaving the popular and potentially powerful max-margin principle unexploited for seeking predictive representations of data and more discriminative topic bases for the corpus. In this paper, we propose the maximum entropy discrimination latent Dirichlet allocation (MedLDA) model, which integrates the mechanism behind the max-margin prediction models (e.g., SVMs) with the mechanism behind the hierarchical Bayesian topic models (e.g., LDA) under a unified constrained optimization framework, and yields latent topical representations that are more discriminative and more suitable for prediction tasks such as document classification or regression. The principle underlying the MedLDA formalism is quite general and can be applied for jointly max-margin and maximum likelihood learning of directed or undirected topic models when supervising side information is available. Efficient variational methods for posterior inference and parameter estimation are derived and extensive empirical studies on several real data sets are also provided. Our experimental results demonstrate qualitatively and quantitatively that MedLDA could: 1) discover sparse and highly discriminative topical representations; 2) achieve state of the art prediction performance; and 3) be more efficient than existing supervised topic models, especially for classification. [abs] [ pdf ][ bib ] &copy JMLR 2012. ( edit, beta )

ICML Conference 2011 Conference Paper

A Spectral Algorithm for Latent Tree Graphical Models

  • Ankur P. Parikh
  • Le Song
  • Eric P. Xing

Latent variable models are powerful tools for probabilistic modeling, and have been successfully applied to various domains, such as speech analysis and bioinformatics. However, parameter learning algorithms for latent variable models have predominantly relied on local search heuristics such as expectation maximization (EM). We propose a fast, local-minimum-free spectral algorithm for learning latent variable models with arbitrary tree topologies, and show that the joint distribution of the observed variables can be reconstructed from the marginals of triples of observed variables irrespective of the maximum degree of the tree. We demonstrate the performance of our spectral algorithm on synthetic and real datasets; for large training sizes, our algorithm performs comparable to or better than EM while being orders of magnitude faster.

UAI Conference 2011 Conference Paper

Sparse Topical Coding

  • Jun Zhu 0001
  • Eric P. Xing

We present sparse topical coding (STC), a non-probabilistic formulation of topic models for discovering latent representations of large collections of data. Unlike probabilistic topic models, STC relaxes the normalization constraint of admixture proportions and the constraint of defining a normalized likelihood function. Such relaxations make STC amenable to: 1) directly control the sparsity of inferred representations by using sparsity-inducing regularizers; 2) be seamlessly integrated with a convex error function (e.g., SVM hinge loss) for supervised learning; and 3) be efficiently learned with a simply structured coordinate descent algorithm. Our results demonstrate the advantages of STC and supervised MedSTC on identifying topical meanings of words and improving classification accuracy and time efficiency.

UAI Conference 2010 Conference Paper

Timeline: A Dynamic Hierarchical Dirichlet Process Model for Recovering Birth/Death and Evolution of Topics in Text Stream

  • Amr Ahmed 0001
  • Eric P. Xing

Topic models have proven to be a useful tool for discovering latent structures in document collections. However, most document collections often come as temporal streams and thus several aspects of the latent structure such as the number of topics, the topics’ distribution and popularity are time-evolving. Several models exist that model the evolution of some but not all of the above aspects. In this paper we introduce infinite dynamic topic models, iDTM, that can accommodate the evolution of all the aforementioned aspects. Our model assumes that documents are organized into epochs, where the documents within each epoch are exchangeable but the order between the documents is maintained across epochs. iDTM allows for unbounded number of topics: topics can die or be born at any epoch, and the representation of each topic can evolve according to a Markovian dynamics. We use iDTM to analyze the birth and evolution of topics in the NIPS community and evaluated the efficacy of our model on both simulated and real datasets with favorable outcome.

ICML Conference 2009 Conference Paper

Dynamic mixed membership blockmodel for evolving networks

  • Wenjie Fu 0002
  • Le Song
  • Eric P. Xing

In a dynamic social or biological environment, interactions between the underlying actors can undergo large and systematic changes. Each actor can assume multiple roles and their degrees of affiliation to these roles can also exhibit rich temporal phenomena. We propose a state space mixed membership stochastic blockmodel which can track across time the evolving roles of the actors. We also derive an efficient variational inference procedure for our model, and apply it to the Enron email networks, and rewiring gene regulatory networks of yeast. In both cases, our model reveals interesting dynamical roles of the actors.

JMLR Journal 2009 Journal Article

Maximum Entropy Discrimination Markov Networks

  • Jun Zhu
  • Eric P. Xing

The standard maximum margin approach for structured prediction lacks a straightforward probabilistic interpretation of the learning scheme and the prediction rule. Therefore its unique advantages such as dual sparseness and kernel tricks cannot be easily conjoined with the merits of a probabilistic model such as Bayesian regularization, model averaging, and ability to model hidden variables. In this paper, we present a new general framework called maximum entropy discrimination Markov networks (MaxEnDNet, or simply, MEDN), which integrates these two approaches and combines and extends their merits. Major innovations of this approach include: 1) It extends the conventional max-entropy discrimination learning of classification rules to a new structural max-entropy discrimination paradigm of learning a distribution of Markov networks. 2) It generalizes the extant Markov network structured-prediction rule based on a point estimator of model coefficients to an averaging model akin to a Bayesian predictor that integrates over a learned posterior distribution of model coefficients. 3) It admits flexible entropic regularization of the model during learning. By plugging in different prior distributions of the model coefficients, it subsumes the well-known maximum margin Markov networks (M 3 N) as a special case, and leads to a model similar to an L 1 -regularized M 3 N that is simultaneously primal and dual sparse, or other new types of Markov networks. 4) It applies a modular learning algorithm that combines existing variational inference techniques and convex-optimization based M 3 N solvers as subroutines. Essentially, MEDN can be understood as a jointly maximum likelihood and maximum margin estimate of Markov network. It represents the first successful attempt to combine maximum entropy learning (a dual form of maximum likelihood learning) with maximum margin learning of Markov network for structured input/output problems; and the basic principle can be generalized to learning arbitrary graphical models, such as the generative Bayesian networks or models with structured hidden variables. We discuss a number of theoretical properties of this approach, and show that empirically it outperforms a wide array of competing methods for structured input/output learning on both synthetic and real OCR and web data extraction data sets. [abs] [ pdf ][ bib ] &copy JMLR 2009. ( edit, beta )

JMLR Journal 2009 Journal Article

Nonextensive Information Theoretic Kernels on Measures

  • André F. T. Martins
  • Noah A. Smith
  • Eric P. Xing
  • Pedro M. Q. Aguiar
  • Mário A. T. Figueiredo

Positive definite kernels on probability measures have been recently applied to classification problems involving text, images, and other types of structured data. Some of these kernels are related to classic information theoretic quantities, such as (Shannon's) mutual information and the Jensen-Shannon (JS) divergence. Meanwhile, there have been recent advances in nonextensive generalizations of Shannon's information theory. This paper bridges these two trends by introducing nonextensive information theoretic kernels on probability measures, based on new JS-type divergences. These new divergences result from extending the the two building blocks of the classical JS divergence: convexity and Shannon's entropy. The notion of convexity is extended to the wider concept of q -convexity, for which we prove a Jensen q -inequality. Based on this inequality, we introduce Jensen-Tsallis (JT) q -differences, a nonextensive generalization of the JS divergence, and define a k -th order JT q -difference between stochastic processes. We then define a new family of nonextensive mutual information kernels, which allow weights to be assigned to their arguments, and which includes the Boolean, JS, and linear kernels as particular cases. Nonextensive string kernels are also defined that generalize the p -spectrum kernel. We illustrate the performance of these kernels on text categorization tasks, in which documents are modeled both as bags of words and as sequences of characters. [abs] [ pdf ][ bib ] &copy JMLR 2009. ( edit, beta )

ICML Conference 2009 Conference Paper

Polyhedral outer approximations with application to natural language parsing

  • André F. T. Martins
  • Noah A. Smith
  • Eric P. Xing

Recent approaches to learning structured predictors often require approximate inference for tractability; yet its effects on the learned model are unclear. Meanwhile, most learning algorithms act as if computational cost was constant within the model class. This paper sheds some light on the first issue by establishing risk bounds for max-margin learning with LP relaxed inference and addresses the second issue by proposing a new paradigm that attempts to penalize "time-consuming" hypotheses. Our analysis relies on a geometric characterization of the outer polyhedra associated with the LP relaxation. We then apply these techniques to the problem of dependency parsing, for which a concise LP formulation is provided that handles non-local output features. A significant improvement is shown over arc-factored models.

UAI Conference 2008 Conference Paper

Feature Selection via Block-Regularized Regression

  • Seyoung Kim
  • Eric P. Xing

Identifying co-varying causal elements in very high dimensional feature space with internal structures, e.g., a space with as many as millions of linearly ordered features, as one typically encounters in problems such as whole genome association (WGA) mapping, remains an open problem in statistical learning. We propose a block-regularized regression model for sparse variable selection in a high-dimensional space where the covariates are linearly ordered, and are possibly subject to local statistical linkages (e.g., block structures) due to spacial or temporal proximity of the features. Our goal is to identify a small subset of relevant covariates that are not merely from random positions in the ordering, but grouped as contiguous blocks from large number of ordered covariates. Following a typical linear regression framework between the features and the response, our proposed model employs a sparsity-enforcing Laplacian prior for the regression coefficients, augmented by a 1st-order Markovian process along the feature sequence that “activates” the regression coefficients in a coupled fashion. We describe a sampling-based learning algorithm and demonstrate the performance of our method on simulated and biological data for marker identification under WGA.

JMLR Journal 2008 Journal Article

Mixed Membership Stochastic Blockmodels

  • Edoardo M. Airoldi
  • David M. Blei
  • Stephen E. Fienberg
  • Eric P. Xing

Consider data consisting of pairwise measurements, such as presence or absence of links between pairs of objects. These data arise, for instance, in the analysis of protein interactions and gene regulatory networks, collections of author-recipient email, and social networks. Analyzing pairwise measurements with probabilistic models requires special assumptions, since the usual independence or exchangeability assumptions no longer hold. Here we introduce a class of variance allocation models for pairwise measurements: mixed membership stochastic blockmodels. These models combine global parameters that instantiate dense patches of connectivity (blockmodel) with local parameters that instantiate node-specific variability in the connections (mixed membership). We develop a general variational inference algorithm for fast approximate posterior inference. We demonstrate the advantages of mixed membership stochastic blockmodels with applications to social networks and protein interaction networks. [abs] [ pdf ][ bib ] &copy JMLR 2008. ( edit, beta )

IROS Conference 2007 Conference Paper

Feature selection for grasp recognition from optical markers

  • Lillian Y. Chang
  • Nancy S. Pollard
  • Tom M. Mitchell
  • Eric P. Xing

Although the human hand is a complex biomechanical system, only a small set of features may be necessary for observation learning of functional grasp classes. We explore how to methodically select a minimal set of hand pose features from optical marker data for grasp recognition. Supervised feature selection is used to determine a reduced feature set of surface marker locations on the hand that is appropriate for grasp classification of individual hand poses. Classifiers trained on the reduced feature set of five markers retain at least 92% of the prediction accuracy of classifiers trained on a full feature set of thirty markers. The reduced model also generalizes better to new subjects. The dramatic reduction of the marker set size and the success of a linear classifier from local marker coordinates recommend optical marker techniques as a practical alternative to data glove methods for observation learning of grasping.

ICML Conference 2007 Conference Paper

Recovering temporally rewiring networks: a model-based approach

  • Fan Guo 0006
  • Steve Hanneke
  • Wenjie Fu 0002
  • Eric P. Xing

A plausible representation of relational information among entities in dynamic systems such as a living cell or a social community is a stochastic network which is topologically rewiring and semantically evolving over time. While there is a rich literature on modeling static or temporally invariant networks, much less has been done toward modeling the dynamic processes underlying rewiring networks, and on recovering such networks when they are not observable. We present a class of hidden temporal exponential random graph models (htERGMs) to study the yet unexplored topic of modeling and recovering temporally rewiring networks from time series of node attributes such as activities of social actors or expression levels of genes. We show that one can reliably infer the latent time-specific topologies of the evolving networks from the observation. We report empirical results on both synthetic data and a Drosophila lifecycle gene expression data set, in comparison with a static counterpart of htERGM.

UAI Conference 2005 Conference Paper

Mining Associated Text and Images with Dual-Wing Harmoniums

  • Eric P. Xing
  • Rong Yan
  • Alexander G. Hauptmann

We propose a multi-wing harmonium model for mining multimedia data that extends and improves on earlier models based on two-layer random fields, which capture bidirectional dependencies between hidden topic aspects and observed inputs. This model can be viewed as an undirected counterpart of the two-layer directed models such as LDA for similar tasks, but bears significant difference in inference/learning cost tradeoffs, latent topic representations, and topic mixing mechanisms. In particular, our model facilitates efficient inference and robust topic mixing, and potentially provides high flexibilities in modeling the latent topic spaces. A contrastive divergence and a variational algorithm are derived for learning. We specialized our model to a dual-wing harmonium for captioned images, incorporating a multivariate Poisson for word-counts and a multivariate Gaussian for color histogram. We present empirical results on the applications of this model to classification, retrieval and image annotation on news video collections, and we report an extensive comparison with various extant models.

UAI Conference 2004 Conference Paper

Graph Partition Strategies for Generalized Mean Field Inference

  • Eric P. Xing
  • Michael I. Jordan

An autonomous variational inference algorithm for arbitrary graphical models requires the ability to optimize variational approximations over the space of model parameters as well as over the choice of tractable families used for the variational approximation. In this paper, we present a novel combination of graph partitioning algorithms with a generalized mean field (GMF) inference algorithm. This combination optimizes over disjoint clustering of variables and performs inference using those clusters. We provide a formal analysis of the relationship between the graph cut and the GMF approximation, and explore several graph partition strategies empirically. Our empirical results provide rather clear support for a weighted version of MinCut as a useful clustering algorithm for GMF inference, which is consistent with the implications from the formal analysis.

UAI Conference 2003 Conference Paper

A generalized mean field algorithm for variational inference in exponential families

  • Eric P. Xing
  • Michael I. Jordan
  • Stuart Russell 0001

The mean field methods, which entail approximating intractable probability distributions variationally with distributions from a tractable family, enjoy high efficiency, guaranteed convergence, and provide lower bounds on the true likelihood. But due to requirement for model-specific derivation of the optimization equations and unclear inference quality in various models, it is not widely used as a generic approximate inference algorithm. In this paper, we discuss a generalized mean field theory on variational approximation to a broad class of intractable distributions using a rich set of tractable distributions via constrained optimization over distribution spaces. We present a class of generalized mean field (GMF) algorithms for approximate inference in complex exponential family models, which entails limiting the optimization over the class of cluster-factorizable distributions. GMF is a generic method requiring no model-specific derivations. It factors a complex model into a set of disjoint variable clusters, and uses a set of canonical fix-point equations to iteratively update the cluster distributions, and converge to locally optimal cluster marginals that preserve the original dependency structure within each cluster, hence, fully decomposed the overall inference problem. We empirically analyzed the effect of different tractable family (clusters of different granularity) on inference quality, and compared GMF with BP on several canonical models. Possible extension to higher-order MF approximation is also discussed.