Arrow Research search

Author name cluster

Jimeng Sun

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

49 papers
1 author row

Possible papers

49

AAAI Conference 2026 Conference Paper

Targeted Pathway Inference for Biological Knowledge Bases via Graph Learning and Explanation

  • Rikuto Kotoge
  • Ziwei Yang
  • Zheng Chen
  • Yushun Dong
  • Yasuko Matsubara
  • Jimeng Sun
  • Yasushi Sakurai

Retrieving targeted pathways in biological knowledge bases, particularly when incorporating wet-lab experimental data, remains a challenging task and often requires downstream analyses and specialized expertise. In this paper, we frame this challenge as a solvable graph learning and explaining task and propose a novel subgraph inference framework, ExPath, that explicitly integrates experimental data to classify various graphs (bio-networks) in biological databases. The links (representing pathways) that contribute more to classification can be considered as targeted pathways. Our framework can seamlessly integrate biological foundation models to encode the experimental molecular data. We propose ML-oriented biological evaluations and a new metric. The experiments involving 301 bio-networks evaluations demonstrate that pathways inferred by ExPath are biologically meaningful, achieving up to 4.5× higher Fidelity+ (necessity) and 14× lower Fidelity- (sufficiency) than explainer baselines, while preserving signaling chains up to 4× longer.

AAAI Conference 2025 Conference Paper

Bi-level Contrastive Learning for Knowledge-Enhanced Molecule Representations

  • Pengcheng Jiang
  • Cao Xiao
  • Tianfan Fu
  • Parminder Bhatia
  • Taha Kass-Hout
  • Jimeng Sun
  • Jiawei Han

Molecular representation learning is vital for various downstream applications, including the analysis and prediction of molecular properties and side effects. While Graph Neural Networks (GNNs) have been a popular framework for modeling molecular data, they often struggle to capture the full complexity of molecular representations. In this paper, we introduce a novel method called Gode, which accounts for the dual-level structure inherent in molecules. Molecules possess an intrinsic graph structure and simultaneously function as nodes within a broader molecular knowledge graph. Gode integrates individual molecular graph representations with multi-domain biochemical data from knowledge graphs. By pre-training two GNNs on different graph structures and employing contrastive learning, Gode effectively fuses molecular structures with their corresponding knowledge graph substructures. This fusion yields a more robust and informative representation, enhancing molecular property predictions by leveraging both chemical and biological information. When fine-tuned across 11 chemical property tasks, our model significantly outperforms existing benchmarks, achieving an average ROC-AUC improvement of 12.7% for classification tasks and an average RMSE/MAE improvement of 34.4% for regression tasks. Notably, Gode surpasses the current leading model in property prediction, with advancements of 2.2% in classification and 7.2% in regression tasks.

AAAI Conference 2025 Conference Paper

Long-Term EEG Partitioning for Seizure Onset Detection

  • Zheng Chen
  • Yasuko Matsubara
  • Yasushi Sakurai
  • Jimeng Sun

Deep learning models have recently shown great success in classifying epileptic patients using EEG recordings. Unfortunately, classification-based methods lack a sound mechanism to detect the onset of seizure events. In this work, we propose a two-stage framework, SODor, that explicitly models seizure onset through a novel task formulation of subsequence clustering. Given an EEG sequence, the framework first learns a set of second-level embeddings with label supervision. It then employs model-based clustering to explicitly capture long-term temporal dependencies in EEG sequences and identify meaningful subsequences. Epochs within a subsequence share a common cluster assignment (normal or seizure), with cluster or state transitions representing successful onset detections. Extensive experiments on three datasets demonstrate that our method can correct misclassifications, achieving 5%-11% classification improvements over other baselines and accurately detecting seizure onsets.

NeurIPS Conference 2025 Conference Paper

Reinforcement Learning for Out-of-Distribution Reasoning in LLMs: An Empirical Study on Diagnosis-Related Group Coding

  • Hanyin Wang
  • Zhenbang Wu
  • Gururaj Kolar
  • Hariprasad Korsapati
  • Brian Bartlett
  • Bryan Hull
  • Jimeng Sun

Diagnosis-Related Group (DRG) codes are essential for hospital reimbursement and operations but require labor-intensive assignment. Large Language Models (LLMs) struggle with DRG coding due to the out-of-distribution (OOD) nature of the task: pretraining corpora rarely contain private clinical or billing data. We introduce DRG-Sapphire, which uses large-scale reinforcement learning (RL) for automated DRG coding from clinical notes. Built on Qwen2. 5-7B and trained with Group Relative Policy Optimization (GRPO) using rule-based rewards, DRG-Sapphire introduces a series of RL enhancements to address domain-specific challenges not seen in previous mathematical tasks. Our model achieves state-of-the-art accuracy on the MIMIC-IV benchmark and generates physician-validated reasoning for DRG assignments, significantly enhancing explainability. Our study further sheds light on broader challenges of applying RL to knowledge-intensive, OOD tasks. We observe that RL performance scales approximately linearly with the logarithm of the number of supervised fine-tuning (SFT) examples, suggesting that RL effectiveness is fundamentally constrained by the domain knowledge encoded in the base model. For OOD tasks like DRG coding, strong RL performance requires sufficient knowledge infusion prior to RL. Consequently, scaling SFT may be more effective and computationally efficient than scaling RL alone for such tasks.

NeurIPS Conference 2024 Conference Paper

CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models

  • Peng Xia
  • Ze Chen
  • Juanxi Tian
  • Yangrui Gong
  • Ruibo Hou
  • Yue Xu
  • Zhenbang Wu
  • Zhiyuan Fan

Artificial intelligence has significantly impacted medical applications, particularly with the advent of Medical Large Vision Language Models (Med-LVLMs), sparking optimism for the future of automated and personalized healthcare. However, the trustworthiness of Med-LVLMs remains unverified, posing significant risks for future model deployment. In this paper, we introduce CARES and aim to comprehensively evaluate the Trustworthiness of Med-LVLMs across the medical domain. We assess the trustworthiness of Med-LVLMs across five dimensions, including trustfulness, fairness, safety, privacy, and robustness. CARES comprises about 41K question-answer pairs in both closed and open-ended formats, covering 16 medical image modalities and 27 anatomical regions. Our analysis reveals that the models consistently exhibit concerns regarding trustworthiness, often displaying factual inaccuracies and failing to maintain fairness across different demographic groups. Furthermore, they are vulnerable to attacks and demonstrate a lack of privacy awareness. We publicly release our benchmark and code in https: //github. com/richard-peng-xia/CARES.

AAAI Conference 2024 Conference Paper

ConSequence: Synthesizing Logically Constrained Sequences for Electronic Health Record Generation

  • Brandon Theodorou
  • Shrusti Jain
  • Cao Xiao
  • Jimeng Sun

Generative models can produce synthetic patient records for analytical tasks when real data is unavailable or limited. However, current methods struggle with adhering to domain-specific knowledge and removing invalid data. We present ConSequence, an effective approach to integrating domain knowledge into sequential generative neural network outputs. Our rule-based formulation includes temporal aggregation and antecedent evaluation modules, ensured by an efficient matrix multiplication formulation, to satisfy hard and soft logical constraints across time steps. Existing constraint methods often fail to guarantee constraint satisfaction, lack the ability to handle temporal constraints, and hinder the learning and computational efficiency of the model. In contrast, our approach efficiently handles all types of constraints with guaranteed logical coherence. We demonstrate ConSequence's effectiveness in generating electronic health records, outperforming competitors in achieving complete temporal and spatial constraint satisfaction without compromising runtime performance or generative quality. Specifically, ConSequence successfully prevents all rule violations while improving the model quality in reducing its test perplexity by 5% and incurring less than a 13% slowdown in generation speed compared to an unconstrained model.

TMLR Journal 2024 Journal Article

Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models

  • Zhen Lin
  • Shubhendu Trivedi
  • Jimeng Sun

Large language models (LLMs) specializing in natural language generation (NLG) have recently started exhibiting promising capabilities across a variety of domains. However, gauging the trustworthiness of responses generated by LLMs remains an open challenge, with limited research on uncertainty quantification (UQ) for NLG. Furthermore, existing literature typically assumes white-box access to language models, which is becoming unrealistic either due to the closed-source nature of the latest LLMs or computational constraints. In this work, we investigate UQ in NLG for *black-box* LLMs. We first differentiate *uncertainty* vs *confidence*: the former refers to the ``dispersion'' of the potential predictions for a fixed input, and the latter refers to the confidence on a particular prediction/generation. We then propose and compare several confidence/uncertainty measures, applying them to *selective NLG* where unreliable results could either be ignored or yielded for further assessment. Experiments were carried out with several popular LLMs on question-answering datasets (for evaluation purposes). Results reveal that a simple measure for the semantic dispersion can be a reliable predictor of the quality of LLM responses, providing valuable insights for practitioners on uncertainty management when adopting LLMs.

NeurIPS Conference 2024 Conference Paper

Instruction Tuning Large Language Models to Understand Electronic Health Records

  • Zhenbang Wu
  • Anant Dadu
  • Mike Nalls
  • Faraz Faghri
  • Jimeng Sun

Large language models (LLMs) have shown impressive capabilities in solving a wide range of tasks based on human instructions. However, developing a conversational AI assistant for electronic health record (EHR) data remains challenging due to (1) the lack of large-scale instruction-following datasets and (2) the limitations of existing model architectures in handling complex and heterogeneous EHR data. In this paper, we introduce MIMIC-Instr, a dataset comprising over 400K open-ended instruction-following examples derived from the MIMIC-IV EHR database. This dataset covers various topics and is suitable for instruction-tuning general-purpose LLMs for diverse clinical use cases. Additionally, we propose Llemr, a general framework that enables LLMs to process and interpret EHRs with complex data structures. Llemr demonstrates competitive performance in answering a wide range of patient-related questions based on EHR data. Furthermore, our evaluations on clinical predictive modeling benchmarks reveal that the fine-tuned Llemr achieves performance comparable to state-of-the-art (SOTA) baselines using curated features. The dataset and code are available at \url{https: //github. com/zzachw/llemr}.

NeurIPS Conference 2024 Conference Paper

KG-FIT: Knowledge Graph Fine-Tuning Upon Open-World Knowledge

  • Pengcheng Jiang
  • Lang Cao
  • Cao Xiao
  • Parminder Bhatia
  • Jimeng Sun
  • Jiawei Han

Knowledge Graph Embedding (KGE) techniques are crucial in learning compact representations of entities and relations within a knowledge graph, facilitating efficient reasoning and knowledge discovery. While existing methods typically focus either on training KGE models solely based on graph structure or fine-tuning pre-trained language models with classification data in KG, KG-FIT leverages LLM-guided refinement to construct a semantically coherent hierarchical structure of entity clusters. By incorporating this hierarchical knowledge along with textual information during the fine-tuning process, KG-FIT effectively captures both global semantics from the LLM and local semantics from the KG. Extensive experiments on the benchmark datasets FB15K-237, YAGO3-10, and PrimeKG demonstrate the superiority of KG-FIT over state-of-the-art pre-trained language model-based methods, achieving improvements of 14. 4\%, 13. 5\%, and 11. 9\% in the Hits@10 metric for the link prediction task, respectively. Furthermore, KG-FIT yields substantial performance gains of 12. 6\%, 6. 7\%, and 17. 7\% compared to the structure-based base models upon which it is built. These results highlight the effectiveness of KG-FIT in incorporating open-world knowledge from LLMs to significantly enhance the expressiveness and informativeness of KG embeddings.

IJCAI Conference 2024 Conference Paper

MediTab: Scaling Medical Tabular Data Predictors via Data Consolidation, Enrichment, and Refinement

  • Zifeng Wang
  • Chufan Gao
  • Cao Xiao
  • Jimeng Sun

Tabular data prediction has been employed in medical applications such as patient health risk prediction. However, existing methods usually revolve around the algorithm design while overlooking the significance of data engineering. Medical tabular datasets frequently exhibit significant heterogeneity across different sources, with limited sample sizes per source. As such, previous predictors are often trained on manually curated small datasets that struggle to generalize across different tabular datasets during inference. This paper proposes to scale medical tabular data predictors (MediTab) to various tabular inputs with varying features. The method uses a data engine that leverages large language models (LLMs) to consolidate tabular samples to overcome the barrier across tables with distinct schema. It also aligns out-domain data with the target task using a "learn, annotate, and refinement'' pipeline. The expanded training data then enables the pre-trained MediTab to infer for arbitrary tabular input in the domain without fine-tuning, resulting in significant improvements over supervised baselines: it reaches an average ranking of 1. 57 and 1. 00 on 7 patient outcome prediction datasets and 3 trial outcome prediction datasets, respectively. In addition, MediTab exhibits impressive zero-shot performances: it outperforms supervised XGBoost models by 8. 9% and 17. 2% on average in two prediction tasks, respectively.

JBHI Journal 2024 Journal Article

Patient Stratification Using Electronic Health Records from a Chronic Disease Management Program

  • Robert Chen
  • Jimeng Sun
  • Robert S. Dittus
  • Daniel Fabbri
  • Jacqueline Kirby
  • Cheryl L. Laffer
  • Candace D. McNaughton
  • Bradley Malin

Objective: The goal of this study is to devise a machine learning framework to assist care coordination programs in prognostic stratification to design and deliver personalized care plans and to allocate financial and medical resources effectively. Materials and Methods: This study is based on a de-identified cohort of 2, 521 hypertension patients from a chronic care coordination program at the Vanderbilt University Medical Center. Patients were modeled as vectors of features derived from electronic health records (EHRs) over a six-year period. We applied a stepwise regression to identify risk factors associated with a decrease in mean arterial pressure of at least 2 mmHg after program enrollment. The resulting features were subsequently validated via a logistic regression classifier. Finally, risk factors were applied to group the patients through model-based clustering. Results: We identified a set of predictive features that consisted of a mix of demographic, medication, and diagnostic concepts. Logistic regression over these features yielded an area under the ROC curve (AUC) of 0. 71 (95% CI: [0. 67, 0. 76]). Based on these features, four clinically meaningful groups are identified through clustering - two of which represented patients with more severe disease profiles, while the remaining represented patients with mild disease profiles. Discussion: Patients with hypertension can exhibit significant variation in their blood pressure control status and responsiveness to therapy. Yet this work shows that a clustering analysis can generate more homogeneous patient groups, which may aid clinicians in designing and implementing customized care programs. Conclusion: The study shows that predictive modeling and clustering using EHR data can be beneficial for providing a systematic, generalized approach for care providers to tailor their management approach based upon patient-level factors.

IJCAI Conference 2024 Conference Paper

Recent Advances in Predictive Modeling with Electronic Health Records

  • Jiaqi Wang
  • Junyu Luo
  • Muchao Ye
  • Xiaochen Wang
  • Yuan Zhong
  • Aofei Chang
  • Guanjie Huang
  • Ziyi Yin

The development of electronic health records (EHR) systems has enabled the collection of a vast amount of digitized patient data. However, utilizing EHR data for predictive modeling presents several challenges due to its unique characteristics. With the advancements in machine learning techniques, deep learning has demonstrated its superiority in various applications, including healthcare. This survey systematically reviews recent advances in deep learning-based predictive models using EHR data. Specifically, we introduce the background of EHR data and provide a mathematical definition of the predictive modeling task. We then categorize and summarize predictive deep models from multiple perspectives. Furthermore, we present benchmarks and toolkits relevant to predictive modeling in healthcare. Finally, we conclude this survey by discussing open challenges and suggesting promising directions for future research.

NeurIPS Conference 2023 Conference Paper

An Iterative Self-Learning Framework for Medical Domain Generalization

  • Zhenbang Wu
  • Huaxiu Yao
  • David Liebovitz
  • Jimeng Sun

Deep learning models have been widely used to assist doctors with clinical decision-making. However, these models often encounter a significant performance drop when applied to data that differs from the distribution they were trained on. This challenge is known as the domain shift problem. Existing domain generalization algorithms attempt to address this problem by assuming the availability of domain IDs and training a single model to handle all domains. However, in healthcare settings, patients can be classified into numerous latent domains, where the actual domain categorizations are unknown. Furthermore, each patient domain exhibits distinct clinical characteristics, making it sub-optimal to train a single model for all domains. To overcome these limitations, we propose SLGD, a self-learning framework that iteratively discovers decoupled domains and trains personalized classifiers for each decoupled domain. We evaluate the generalizability of SLGD across spatial and temporal data distribution shifts on two real-world public EHR datasets: eICU and MIMIC-IV. Our results show that SLGD achieves up to 11% improvement in the AUPRC score over the best baseline.

NeurIPS Conference 2023 Conference Paper

BIOT: Biosignal Transformer for Cross-data Learning in the Wild

  • Chaoqi Yang
  • M Westover
  • Jimeng Sun

Biological signals, such as electroencephalograms (EEG), play a crucial role in numerous clinical applications, exhibiting diverse data formats and quality profiles. Current deep learning models for biosignals (based on CNN, RNN, and Transformers) are typically specialized for specific datasets and clinical settings, limiting their broader applicability. This paper explores the development of a flexible biosignal encoder architecture that can enable pre-training on multiple datasets and fine-tuned on downstream biosignal tasks with different formats. To overcome the unique challenges associated with biosignals of various formats, such as mismatched channels, variable sample lengths, and prevalent missing val- ues, we propose Biosignal Transformer (BIOT). The proposed BIOT model can enable cross-data learning with mismatched channels, variable lengths, and missing values by tokenizing different biosignals into unified "sentences" structure. Specifically, we tokenize each channel separately into fixed-length segments containing local signal features and then rearrange the segments to form a long "sentence". Channel embeddings and relative position embeddings are added to each segment (viewed as "token") to preserve spatio-temporal features. The BIOT model is versatile and applicable to various biosignal learning settings across different datasets, including joint pre-training for larger models. Comprehensive evaluations on EEG, electrocardiogram (ECG), and human activity sensory signals demonstrate that BIOT outperforms robust baselines in common settings and facilitates learning across multiple datasets with different formats. Using CHB-MIT seizure detection task as an example, our vanilla BIOT model shows 3% improvement over baselines in balanced accuracy, and the pre-trained BIOT models (optimized from other data sources) can further bring up to 4% improvements. Our repository is public at https: //github. com/ycq091044/BIOT.

NeurIPS Conference 2023 Conference Paper

CoDrug: Conformal Drug Property Prediction with Density Estimation under Covariate Shift

  • Siddhartha Laghuvarapu
  • Zhen Lin
  • Jimeng Sun

In drug discovery, it is vital to confirm the predictions of pharmaceutical properties from computational models using costly wet-lab experiments. Hence, obtaining reliable uncertainty estimates is crucial for prioritizing drug molecules for subsequent experimental validation. Conformal Prediction (CP) is a promising tool for creating such prediction sets for molecular properties with a coverage guarantee. However, the exchangeability assumption of CP is often challenged with covariate shift in drug discovery tasks: Most datasets contain limited labeled data, which may not be representative of the vast chemical space from which molecules are drawn. To address this limitation, we propose a method called CoDrug that employs an energy-based model leveraging both training data and unlabelled data, and Kernel Density Estimation (KDE) to assess the densities of a molecule set. The estimated densities are then used to weigh the molecule samples while building prediction sets and rectifying for distribution shift. In extensive experiments involving realistic distribution drifts in various small-molecule drug discovery tasks, we demonstrate the ability of CoDrug to provide valid prediction sets and its utility in addressing the distribution shift arising from de novo drug design models. On average, using CoDrug can reduce the coverage gap by over 35% when compared to conformal prediction sets not adjusted for covariate shift.

NeurIPS Conference 2022 Conference Paper

ATD: Augmenting CP Tensor Decomposition by Self Supervision

  • Chaoqi Yang
  • Cheng Qian
  • Navjot Singh
  • Cao (Danica) Xiao
  • M Westover
  • Edgar Solomonik
  • Jimeng Sun

Tensor decompositions are powerful tools for dimensionality reduction and feature interpretation of multidimensional data such as signals. Existing tensor decomposition objectives (e. g. , Frobenius norm) are designed for fitting raw data under statistical assumptions, which may not align with downstream classification tasks. In practice, raw input tensor can contain irrelevant information while data augmentation techniques may be used to smooth out class-irrelevant noise in samples. This paper addresses the above challenges by proposing augmented tensor decomposition (ATD), which effectively incorporates data augmentations and self-supervised learning (SSL) to boost downstream classification. To address the non-convexity of the new augmented objective, we develop an iterative method that enables the optimization to follow an alternating least squares (ALS) fashion. We evaluate our proposed ATD on multiple datasets. It can achieve 0. 8%~2. 5% accuracy gain over tensor-based baselines. Also, our ATD model shows comparable or better performance (e. g. , up to 15% in accuracy) over self-supervised and autoencoder baselines while using less than 5% of learnable parameters of these baseline models.

TMLR Journal 2022 Journal Article

Conformal Prediction Intervals with Temporal Dependence

  • Zhen Lin
  • Shubhendu Trivedi
  • Jimeng Sun

Cross-sectional prediction is common in many domains such as healthcare, including forecasting tasks using electronic health records, where different patients form a cross-section. We focus on the task of constructing valid prediction intervals (PIs) in time series regression with a cross-section. A prediction interval is considered valid if it covers the true response with (a pre-specified) high probability. We first distinguish between two notions of validity in such a setting: cross-sectional and longitudinal. Cross-sectional validity is concerned with validity across the cross-section of the time series data, while longitudinal validity accounts for the temporal dimension. Coverage guarantees along both these dimensions are ideally desirable; however, we show that distribution-free longitudinal validity is theoretically impossible. Despite this limitation, we propose Conformal Prediction with Temporal Dependence (CPTD), a procedure that is able to maintain strict cross-sectional validity while improving longitudinal coverage. CPTD is post-hoc and light-weight, and can easily be used in conjunction with any prediction model as long as a calibration set is available. We focus on neural networks due to their ability to model complicated data such as diagnosis codes for time series regression, and perform extensive experimental validation to verify the efficacy of our approach. We find that CPTD outperforms baselines on a variety of datasets by improving longitudinal coverage and often providing more efficient (narrower) PIs.

NeurIPS Conference 2022 Conference Paper

Conformal Prediction with Temporal Quantile Adjustments

  • Zhen Lin
  • Shubhendu Trivedi
  • Jimeng Sun

We develop Temporal Quantile Adjustment (TQA), a general method to construct efficient and valid prediction intervals (PIs) for regression on cross-sectional time series data. Such data is common in many domains, including econometrics and healthcare. A canonical example in healthcare is predicting patient outcomes using physiological time-series data, where a population of patients composes a cross-section. Reliable PI estimators in this setting must address two distinct notions of coverage: cross-sectional coverage across a cross-sectional slice, and longitudinal coverage along the temporal dimension for each time series. Recent works have explored adapting Conformal Prediction (CP) to obtain PIs in the time series context. However, none handles both notions of coverage simultaneously. CP methods typically query a pre-specified quantile from the distribution of nonconformity scores on a calibration set. TQA adjusts the quantile to query in CP at each time $t$, accounting for both cross-sectional and longitudinal coverage in a theoretically-grounded manner. The post-hoc nature of TQA facilitates its use as a general wrapper around any time series regression model. We validate TQA's performance through extensive experimentation: TQA generally obtains efficient PIs and improves longitudinal coverage while preserving cross-sectional coverage.

IJCAI Conference 2022 Conference Paper

GOCPT: Generalized Online Canonical Polyadic Tensor Factorization and Completion

  • Chaoqi Yang
  • Cheng Qian
  • Jimeng Sun

Low-rank tensor factorization or completion is well-studied and applied in various online settings, such as online tensor factorization (where the temporal mode grows) and online tensor completion (where incomplete slices arrive gradually). However, in many real-world settings, tensors may have more complex evolving patterns: (i) one or more modes can grow; (ii) missing entries may be filled; (iii) existing tensor elements can change. Existing methods cannot support such complex scenarios. To fill the gap, this paper proposes a Generalized Online Canonical Polyadic (CP) Tensor factorization and completion framework (named GOCPT) for this general setting, where we maintain the CP structure of such dynamic tensors during the evolution. We show that existing online tensor factorization and completion setups can be unified under the GOCPT framework. Furthermore, we propose a variant, named GOCPTE, to deal with cases where historical tensor elements are unavailable (e. g. , privacy protection), which achieves similar fitness as GOCPT but with much less computational cost. Experimental results demonstrate that our GOCPT can improve fitness by up to 2. 8% on the JHU Covid data and 9. 2% on a proprietary patient claim dataset over baselines. Our variant GOCPTE shows up to 1. 2% and 5. 5% fitness improvement on two datasets with about 20% speedup compared to the best model.

NeurIPS Conference 2022 Conference Paper

Reinforced Genetic Algorithm for Structure-based Drug Design

  • Tianfan Fu
  • Wenhao Gao
  • Connor Coley
  • Jimeng Sun

Structure-based drug design (SBDD) aims to discover drug candidates by finding molecules (ligands) that bind tightly to a disease-related protein (targets), which is the primary approach to computer-aided drug discovery. Recently, applying deep generative models for three-dimensional (3D) molecular design conditioned on protein pockets to solve SBDD has attracted much attention, but their formulation as probabilistic modeling often leads to unsatisfactory optimization performance. On the other hand, traditional combinatorial optimization methods such as genetic algorithms (GA) have demonstrated state-of-the-art performance in various molecular optimization tasks. However, they do not utilize protein target structure to inform design steps but rely on a random-walk-like exploration, which leads to unstable performance and no knowledge transfer between different tasks despite the similar binding physics. To achieve a more stable and efficient SBDD, we propose Reinforced Genetic Algorithm (RGA) that uses neural models to prioritize the profitable design steps and suppress random-walk behavior. The neural models take the 3D structure of the targets and ligands as inputs and are pre-trained using native complex structures to utilize the knowledge of the shared binding physics from different targets and then fine-tuned during optimization. We conduct thorough empirical studies on optimizing binding affinity to various disease targets and show that RGA outperforms the baselines in terms of docking scores and is more robust to random initializations. The ablation study also indicates that the training on different targets helps improve the performance by leveraging the shared underlying physics of the binding processes. The code is available at https: //github. com/futianfan/reinforced-genetic-algorithm.

NeurIPS Conference 2022 Conference Paper

Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization

  • Wenhao Gao
  • Tianfan Fu
  • Jimeng Sun
  • Connor Coley

Molecular optimization is a fundamental goal in the chemical sciences and is of central interest to drug and material design. In recent years, significant progress has been made in solving challenging problems across various aspects of computational molecular optimizations, emphasizing high validity, diversity, and, most recently, synthesizability. Despite this progress, many papers report results on trivial or self-designed tasks, bringing additional challenges to directly assessing the performance of new methods. Moreover, the sample efficiency of the optimization---the number of molecules evaluated by the oracle---is rarely discussed, despite being an essential consideration for realistic discovery applications. To fill this gap, we have created an open-source benchmark for practical molecular optimization, PMO, to facilitate the transparent and reproducible evaluation of algorithmic advances in molecular optimization. This paper thoroughly investigates the performance of 25 molecular design algorithms on 23 single-objective (scalar) optimization tasks with a particular focus on sample efficiency. Our results show that most ``state-of-the-art'' methods fail to outperform their predecessors under a limited oracle budget allowing 10K queries and that no existing algorithm can efficiently solve certain molecular optimization problems in this setting. We analyze the influence of the optimization algorithm choices, molecular assembly strategies, and oracle landscapes on the optimization performance to inform future algorithm development and benchmarking. PMO provides a standardized experimental setup to comprehensively evaluate and compare new molecule optimization methods with existing ones. All code can be found at https: //github. com/wenhao-gao/mol_opt.

AAAI Conference 2022 Conference Paper

SCRIB: Set-Classifier with Class-Specific Risk Bounds for Blackbox Models

  • Zhen Lin
  • Lucas Glass
  • M. Brandon Westover
  • Cao Xiao
  • Jimeng Sun

Despite deep learning (DL) success in classification problems, DL classifiers do not provide a sound mechanism to decide when to refrain from predicting. Recent works tried to control the overall prediction risk with classification with rejection options. However, existing works overlook the different significance of different classes. We introduce Set-classifier with Class-specific RIsk Bounds (SCRIB) to tackle this problem, assigning multiple labels to each example. Given the output of a black-box model on the validation set, SCRIB constructs a set-classifier that controls the class-specific prediction risks. The key idea is to reject when the set classifier returns more than one label. We validated SCRIB on several medical applications, including sleep staging on electroencephalogram (EEG) data, X-ray COVID image classification, and atrial fibrillation detection based on electrocardiogram (ECG) data. SCRIB obtained desirable class-specific risks, which are 35%- 88% closer to the target risks than baseline methods.

NeurIPS Conference 2022 Conference Paper

TransTab: Learning Transferable Tabular Transformers Across Tables

  • Zifeng Wang
  • Jimeng Sun

Tabular data (or tables) are the most widely used data format in machine learning (ML). However, ML models often assume the table structure keeps fixed in training and testing. Before ML modeling, heavy data cleaning is required to merge disparate tables with different columns. This preprocessing often incurs significant data waste (e. g. , removing unmatched columns and samples). How to learn ML models from multiple tables with partially overlapping columns? How to incrementally update ML models as more columns become available over time? Can we leverage model pretraining on multiple distinct tables? How to train an ML model which can predict on an unseen table? To answer all those questions, we propose to relax fixed table structures by introducing a Transferable Tabular Transformer (TransTab) for tables. The goal of TransTab is to convert each sample (a row in the table) to a generalizable embedding vector, and then apply stacked transformers for feature encoding. One methodology insight is combining column description and table cells as the raw input to a gated transformer model. The other insight is to introduce supervised and self-supervised pretraining to improve model performance. We compare TransTab with multiple baseline methods on diverse benchmark datasets and five oncology clinical trial datasets. Overall, TransTab ranks 1. 00, 1. 00, 1. 78 out of 12 methods in supervised learning, incremental feature learning, and transfer learning scenarios, respectively; and the proposed pretraining leads to 2. 3\% AUC lift on average over the supervised learning.

IJCAI Conference 2021 Conference Paper

Change Matters: Medication Change Prediction with Recurrent Residual Networks

  • Chaoqi Yang
  • Cao Xiao
  • Lucas Glass
  • Jimeng Sun

Deep learning is revolutionizing predictive healthcare, including recommending medications to patients with complex health conditions. Existing approaches focus on predicting all medications for the current visit, which often overlaps with medications from previous visits. A more clinically relevant task is to identify medication changes. In this paper, we propose a new recurrent residual networks, named MICRON, for medication change prediction. MICRON takes the changes in patient health records as input and learns to update a hid- den medication vector and the medication set recurrently with a reconstruction design. The medication vector is like the memory cell that encodes longitudinal information of medications. Unlike traditional methods that require the entire patient history for prediction, MICRON has a residual-based inference that allows for sequential updating based only on new patient features (e. g. , new diagnoses in the recent visit), which is efficient. We evaluated MICRON on real inpatient and outpatient datasets. MICRON achieves 3. 5% and 7. 8% relative improvements over the best baseline in F1 score, respectively. MICRON also requires fewer parameters, which significantly reduces the training time to 38. 3s per epoch with 1. 5× speed-up.

NeurIPS Conference 2021 Conference Paper

Locally Valid and Discriminative Prediction Intervals for Deep Learning Models

  • Zhen Lin
  • Shubhendu Trivedi
  • Jimeng Sun

Crucial for building trust in deep learning models for critical real-world applications is efficient and theoretically sound uncertainty quantification, a task that continues to be challenging. Useful uncertainty information is expected to have two key properties: It should be valid (guaranteeing coverage) and discriminative (more uncertain when the expected risk is high). Moreover, when combined with deep learning (DL) methods, it should be scalable and affect the DL model performance minimally. Most existing Bayesian methods lack frequentist coverage guarantees and usually affect model performance. The few available frequentist methods are rarely discriminative and/or violate coverage guarantees due to unrealistic assumptions. Moreover, many methods are expensive or require substantial modifications to the base neural network. Building upon recent advances in conformal prediction [13, 33] and leveraging the classical idea of kernel regression, we propose Locally Valid and Discriminative prediction intervals (LVD), a simple, efficient, and lightweight method to construct discriminative prediction intervals (PIs) for almost any DL model. With no assumptions on the data distribution, such PIs also offer finite-sample local coverage guarantees (contrasted to the simpler marginal coverage). We empirically verify, using diverse datasets, that besides being the only locally valid method for DL, LVD also exceeds or matches the performance (including coverage rate and prediction accuracy) of existing uncertainty quantification methods, while offering additional benefits in scalability and flexibility.

AAAI Conference 2021 Conference Paper

MIMOSA: Multi-constraint Molecule Sampling for Molecule Optimization

  • Tianfan Fu
  • Cao Xiao
  • Xinhao Li
  • Lucas M. Glass
  • Jimeng Sun

Molecule optimization is a fundamental task for accelerating drug discovery, with the goal of generating new valid molecules that maximize multiple drug properties while maintaining similarity to the input molecule. Existing generative models and reinforcement learning approaches made initial success, but still face difficulties in simultaneously optimizing multiple drug properties. To address such challenges, we propose the MultI-constraint MOlecule SAmpling (MIMOSA) approach, a sampling framework to use input molecule as an initial guess and sample molecules from the target distribution. MIMOSA first pretrains two propertyagnostic graph neural networks (GNNs) for molecule topology and substructure-type prediction, where a substructure can be either atom or single ring. For each iteration, MIMOSA uses the GNNs’ prediction and employs three basic substructure operations (add, replace, delete) to generate new molecules and associated weights. The weights can encode multiple constraints including similarity and drug property constraints, upon which we select promising molecules for next iteration. MIMOSA enables flexible encoding of multiple property- and similarity-constraints and can efficiently generate new molecules that satisfy various property constraints and achieved up to 49. 1% relative improvement over the best baseline in terms of success rate.

IJCAI Conference 2021 Conference Paper

Multi-version Tensor Completion for Time-delayed Spatio-temporal Data

  • Cheng Qian
  • Nikos Kargas
  • Cao Xiao
  • Lucas Glass
  • Nicholas Sidiropoulos
  • Jimeng Sun

Real-world spatio-temporal data is often incomplete or inaccurate due to various data loading delays. For example, a location-disease-time tensor of case counts can have multiple delayed updates of recent temporal slices for some locations or diseases. Recovering such missing or noisy (under-reported) elements of the input tensor can be viewed as a generalized tensor completion problem. Existing tensor completion methods usually assume that i) missing elements are randomly distributed and ii) noise for each tensor element is i. i. d. zero-mean. Both assumptions can be violated for spatio-temporal tensor data. We often observe multiple versions of the input tensor with different under-reporting noise levels. The amount of noise can be time- or location-dependent as more updates are progressively introduced to the tensor. We model such dynamic data as a multi-version tensor with an extra tensor mode capturing the data updates. We propose a low-rank tensor model to predict the updates over time. We demonstrate that our method can accurately predict the ground-truth values of many real-world tensors. We obtain up to 27. 2% lower root mean-squared-error compared to the best baseline method. Finally, we extend our method to track the tensor data over time, leading to significant computational savings.

IJCAI Conference 2021 Conference Paper

SafeDrug: Dual Molecular Graph Encoders for Recommending Effective and Safe Drug Combinations

  • Chaoqi Yang
  • Cao Xiao
  • Fenglong Ma
  • Lucas Glass
  • Jimeng Sun

Medication recommendation is an essential task of AI for healthcare. Existing works focused on recommending drug combinations for patients with complex health conditions solely based on their electronic health records. Thus, they have the following limitations: (1) some important data such as drug molecule structures have not been utilized in the recommendation process. (2) drug-drug interactions (DDI) are modeled implicitly, which can lead to sub-optimal results. To address these limitations, we propose a DDI-controllable drug recommendation model named SafeDrug to leverage drugs’ molecule structures and model DDIs explicitly. SafeDrug is equipped with a global message passing neural network (MPNN) module and a local bipartite learning module to fully encode the connectivity and functionality of drug molecules. SafeDrug also has a controllable loss function to control DDI level in the recommended drug combinations effectively. On a benchmark dataset, our SafeDrug is relatively shown to reduce DDI by 19. 43% and improves 2. 88% on Jaccard similarity between recommended and actually prescribed drug combinations over previous approaches. Moreover, SafeDrug also requires much fewer parameters than previous deep learning based approaches, leading to faster training by about 14% and around 2× speed-up in inference.

AAAI Conference 2021 Conference Paper

STELAR: Spatio-temporal Tensor Factorization with Latent Epidemiological Regularization

  • Nikos Kargas
  • Cheng Qian
  • Nicholas D. Sidiropoulos
  • Cao Xiao
  • Lucas M. Glass
  • Jimeng Sun

Accurate prediction of the transmission of epidemic diseases such as COVID-19 is crucial for implementing effective mitigation measures. In this work, we develop a tensor method to predict the evolution of epidemic trends for many regions simultaneously. We construct a 3-way spatio-temporal tensor (location, attribute, time) of case counts and propose a nonnegative tensor factorization with latent epidemiological model regularization named STELAR. Unlike standard tensor factorization methods which cannot predict slabs ahead, STELAR enables long-term prediction by incorporating latent temporal regularization through a system of discretetime difference equations of a widely adopted epidemiological model. We use latent instead of location/attribute-level epidemiological dynamics to capture common epidemic profile sub-types and improve collaborative learning and prediction. We conduct experiments using both county- and statelevel COVID-19 data and show that our model can identify interesting latent patterns of the epidemic. Finally, we evaluate the predictive ability of our method and show superior performance compared to the baselines, achieving up to 21% lower root mean square error and 25% lower mean absolute error for county-level prediction.

AAAI Conference 2021 Conference Paper

SWIFT: Scalable Wasserstein Factorization for Sparse Nonnegative Tensors

  • Ardavan Afshar
  • Kejing Yin
  • Sherry Yan
  • Cheng Qian
  • Joyce Ho
  • Haesun Park
  • Jimeng Sun

Existing tensor factorization methods assume that the input tensor follows some specific distribution (i. e. Poisson, Bernoulli, and Gaussian), and solve the factorization by minimizing some empirical loss functions defined based on the corresponding distribution. However, it suffers from several drawbacks: 1) In reality, the underlying distributions are complicated and unknown, making it infeasible to be approximated by a simple distribution. 2) The correlation across dimensions of the input tensor is not well utilized, leading to sub-optimal performance. Although heuristics were proposed to incorporate such correlation as side information under Gaussian distribution, they can not easily be generalized to other distributions. Thus, a more principled way of utilizing the correlation in tensor factorization models is still an open challenge. Without assuming any explicit distribution, we formulate the tensor factorization as an optimal transport problem with Wasserstein distance, which can handle non-negative inputs. We introduce SWIFT, which minimizes the Wasserstein distance that measures the distance between the input tensor and that of the reconstruction. In particular, we define the N-th order tensor Wasserstein loss for the widely used tensor CP factorization and derive the optimization algorithm that minimizes it. By leveraging sparsity structure and different equivalent formulations for optimizing computational efficiency, SWIFT is as scalable as other well-known CP algorithms. Using the factor matrices as features, SWIFT achieves up to 9. 65% and 11. 31% relative improvement over baselines for downstream prediction tasks. Under the noisy conditions, SWIFT achieves up to 15% and 17% relative improvements over the best competitors for the prediction tasks.

NeurIPS Conference 2021 Conference Paper

Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development

  • Kexin Huang
  • Tianfan Fu
  • Wenhao Gao
  • Yue Zhao
  • Yusuf Roohani
  • Jure Leskovec
  • Connor Coley
  • Cao Xiao

Therapeutics machine learning is an emerging field with incredible opportunities for innovation and impact. However, advancement in this field requires the formulation of meaningful tasks and careful curation of datasets. Here, we introduce Therapeutics Data Commons (TDC), the first unifying platform to systematically access and evaluate machine learning across the entire range of therapeutics. To date, TDC includes 66 AI-ready datasets spread across 22 learning tasks and spanning the discovery and development of safe and effective medicines. TDC also provides an ecosystem of tools and community resources, including 33 data functions and diverse types of data splits, 23 strategies for systematic model evaluation, 17 molecule generation oracles, and 29 public leaderboards. All resources are integrated and accessible via an open Python library. We carry out extensive experiments on selected datasets, demonstrating that even the strongest algorithms fall short of solving key therapeutics challenges, including distributional shifts, multi-scale and multi-modal learning, and robust generalization to novel data points. We envision that TDC can facilitate algorithmic advances and considerably accelerate machine-learning model development, validation and transition into biomedical and clinical implementation. TDC is available at https: //tdcommons. ai.

AAAI Conference 2020 Conference Paper

CASTER: Predicting Drug Interactions with Chemical Substructure Representation

  • Kexin Huang
  • Cao Xiao
  • Trong Hoang
  • Lucas Glass
  • Jimeng Sun

Adverse drug-drug interactions (DDIs) remain a leading cause of morbidity and mortality. Identifying potential DDIs during the drug design process is critical for patients and society. Although several computational models have been proposed for DDI prediction, there are still limitations: (1) specialized design of drug representation for DDI predictions is lacking; (2) predictions are based on limited labelled data and do not generalize well to unseen drugs or DDIs; and (3) models are characterized by a large number of parameters, thus are hard to interpret. In this work, we develop a ChemicAl SubstrucTurE Representation (CASTER) framework that predicts DDIs given chemical structures of drugs. CASTER aims to mitigate these limitations via (1) a sequential pattern mining module rooted in the DDI mechanism to efficiently characterize functional sub-structures of drugs; (2) an auto-encoding module that leverages both labelled and unlabelled chemical structure data to improve predictive accuracy and generalizability; and (3) a dictionary learning module that explains the prediction via a small set of coefficients which measure the relevance of each input sub-structures to the DDI outcome. We evaluated CASTER on two real-world DDI datasets and showed that it performed better than stateof-the-art baselines and provided interpretable predictions.

AAAI Conference 2020 Conference Paper

CONAN: Complementary Pattern Augmentation for Rare Disease Detection

  • Limeng Cui
  • Siddharth Biswal
  • Lucas M. Glass
  • Greg Lever
  • Jimeng Sun
  • Cao Xiao

Rare diseases affect hundreds of millions of people worldwide but are hard to detect since they have extremely low prevalence rates (varying from 1/1, 000 to 1/200, 000 patients) and are massively underdiagnosed. How do we reliably detect rare diseases with such low prevalence rates? How to further leverage patients with possibly uncertain diagnosis to improve detection? In this paper, we propose a Complementary pattern Augmentation (CONAN) framework for rare disease detection. CONAN combines ideas from both adversarial training and max-margin classification. It first learns self-attentive and hierarchical embedding for patient pattern characterization. Then, we develop a complementary generative adversarial networks (GAN) model to generate candidate positive and negative samples from the uncertain patients by encouraging a max-margin between classes. In addition, CONAN has a disease detector that serves as the discriminator during the adversarial training for identifying rare diseases. We evaluated CONAN on two disease detection tasks. For low prevalence inflammatory bowel disease (IBD) detection, CONAN achieved. 96 precision recall area under the curve (PR-AUC) and 50. 1% relative improvement over the best baseline. For rare disease idiopathic pulmonary fibrosis (IPF) detection, CONAN achieves. 22 PR-AUC with 41. 3% relative improvement over the best baseline.

AAAI Conference 2020 Conference Paper

CORE: Automatic Molecule Optimization Using Copy & Refine Strategy

  • Tianfan Fu
  • Cao Xiao
  • Jimeng Sun

Molecule optimization is about generating molecule Y with more desirable properties based on an input molecule X. The state-of-the-art approaches partition the molecules into a large set of substructures S and grow the new molecule structure by iteratively predicting which substructure from S to add. However, since the set of available substructures S is large, such an iterative prediction task is often inaccurate especially for substructures that are infrequent in the training data. To address this challenge, we propose a new generating strategy called “Copy&Refine” (CORE), where at each step the generator first decides whether to copy an existing substructure from input X or to generate a new substructure, then the most promising substructure will be added to the new molecule. Combining together with scaffolding tree generation and adversarial training, CORE can significantly improve several latest molecule optimization methods in various measures including drug likeness (QED), dopamine receptor (DRD2) and penalized LogP. We tested CORE and baselines using the ZINC database and CORE obtained up to 11% and 21% relatively improvement over the baselines on success rate on the complete test set and the subset with infrequent substructures, respectively.

AAAI Conference 2020 Conference Paper

Doctor2Vec: Dynamic Doctor Representation Learning for Clinical Trial Recruitment

  • Siddharth Biswal
  • Cao Xiao
  • Lucas M. Glass
  • Elizabeth Milkovits
  • Jimeng Sun

Massive electronic health records (EHRs) enable the success of learning accurate patient representations to support various predictive health applications. In contrast, doctor representation was not well studied despite that doctors play pivotal roles in healthcare. How to construct the right doctor representations? How to use doctor representation to solve important health analytic problems? In this work, we study the problem on clinical trial recruitment, which is about identifying the right doctors to help conduct the trials based on the trial description and patient EHR data of those doctors. We propose Doctor2Vec which simultaneously learns 1) doctor representations from EHR data and 2) trial representations from the description and categorical information about the trials. In particular, Doctor2Vec utilizes a dynamic memory network where the doctor’s experience with patients are stored in the memory bank and the network will dynamically assign weights based on the trial representation via an attention mechanism. Validated on large real-world trials and EHR data including 2, 609 trials, 25K doctors and 430K patients, Doctor2Vec demonstrated improved performance over the best baseline by up to 8. 7% in PR-AUC. We also demonstrated that the Doctor2Vec embedding can be transferred to benefit data insufficiency settings including trial recruitment in less populated/newly explored country with 13. 7% improvement or for rare diseases with 8. 1% improvement in PR-AUC.

IJCAI Conference 2019 Conference Paper

DDL: Deep Dictionary Learning for Predictive Phenotyping

  • Tianfan Fu
  • Trong Nghia Hoang
  • Cao Xiao
  • Jimeng Sun

Predictive phenotyping is about accurately predicting what phenotypes will occur in the next clinical visit based on longitudinal Electronic Health Record (EHR) data. Several deep learning (DL) models have demonstrated great performance in predictive phenotyping. However, these DL-based phenotyping models requires access to a large amount of labeled data, which are often expensive to acquire. To address this label-insufficient challenge, we propose a deep dictionary learning framework (DDL) for phenotyping, which utilizes unlabeled data as a complementary source of information to generate a better, more succinct data representation. With extensive experiments on multiple real-world EHR datasets, we demonstrated DDL can outperform the state of the art predictive phenotyping methods on a wide variety of clinical tasks that require patient phenotyping such as heart failure classification, mortality prediction, and sequential prediction. All empirical results consistently show that unlabeled data can indeed be used to generate better data representation, which helps improve DDL's phenotyping performance over existing baseline methods that only uses labeled data.

AAAI Conference 2019 Conference Paper

GAMENet: Graph Augmented MEmory Networks for Recommending Medication Combination

  • Junyuan Shang
  • Cao Xiao
  • Tengfei Ma
  • Hongyan Li
  • Jimeng Sun

Recent progress in deep learning is revolutionizing the healthcare domain including providing solutions to medication recommendations, especially recommending medication combination for patients with complex health conditions. Existing approaches either do not customize based on patient health history, or ignore existing knowledge on drug-drug interactions (DDI) that might lead to adverse outcomes. To fill this gap, we propose the Graph Augmented Memory Networks (GAMENet), which integrates the drug-drug interactions knowledge graph by a memory module implemented as a graph convolutional networks, and models longitudinal patient records as the query. It is trained end-to-end to provide safe and personalized recommendation of medication combination. We demonstrate the effectiveness and safety of GAMENet by comparing with several state-of-the-art methods on real EHR data. GAMENet outperformed all baselines in all effectiveness measures, and also achieved 3. 60% DDI rate reduction from existing EHR data.

AAAI Conference 2019 Conference Paper

Hierarchical Reinforcement Learning for Course Recommendation in MOOCs

  • Jing Zhang
  • Bowen Hao
  • Bo Chen
  • Cuiping Li
  • Hong Chen
  • Jimeng Sun

The proliferation of massive open online courses (MOOCs) demands an effective way of personalized course recommendation. The recent attention-based recommendation models can distinguish the effects of different historical courses when recommending different target courses. However, when a user has interests in many different courses, the attention mechanism will perform poorly as the effects of the contributing courses are diluted by diverse historical courses. To address such a challenge, we propose a hierarchical reinforcement learning algorithm to revise the user profiles and tune the course recommendation model on the revised profiles. Systematically, we evaluate the proposed model on a real dataset consisting of 1, 302 courses, 82, 535 users and 458, 454 user enrolled behaviors, which were collected from XuetangX—one of the largest MOOCs in China. Experimental results show that the proposed model significantly outperforms the state-of-the-art recommendation models (improving 5. 02% to 18. 95% in terms of HR@10).

IJCAI Conference 2019 Conference Paper

MINA: Multilevel Knowledge-Guided Attention for Modeling Electrocardiography Signals

  • Shenda Hong
  • Cao Xiao
  • Tengfei Ma
  • Hongyan Li
  • Jimeng Sun

Electrocardiography (ECG) signals are commonly used to diagnose various cardiac abnormalities. Recently, deep learning models showed initial success on modeling ECG data, however they are mostly black-box, thus lack interpretability needed for clinical usage. In this work, we propose MultIlevel kNowledge-guided Attention networks (MINA) that predict heart diseases from ECG signals with intuitive explanation aligned with medical knowledge. By extracting multilevel (beat-, rhythm- and frequency-level) domain knowledge features separately, MINA combines the medical knowledge and ECG data via a multilevel attention model, making the learned models highly interpretable. Our experiments showed MINA achieved PR-AUC 0. 9436 (outperforming the best baseline by 5. 51%) in real world ECG dataset. Finally, MINA also demonstrated robust performance and strong interpretability against signal distortion and noise contamination.

IJCAI Conference 2019 Conference Paper

Pre-training of Graph Augmented Transformers for Medication Recommendation

  • Junyuan Shang
  • Tengfei Ma
  • Cao Xiao
  • Jimeng Sun

Medication recommendation is an important healthcare application. It is commonly formulated as a temporal prediction task. Hence, most existing works only utilize longitudinal electronic health records (EHRs) from a small number of patients with multiple visits ignoring a large number of patients with a single visit (selection bias). Moreover, important hierarchical knowledge such as diagnosis hierarchy is not leveraged in the representation learning process. Despite the success of deep learning techniques in computational phenotyping, most previous approaches have two limitations: task-oriented representation and ignoring hierarchies of medical codes. To address these challenges, we propose G-BERT, a new model to combine the power of Graph Neural Networks (GNNs) and BERT (Bidirectional Encoder Representations from Transformers) for medical code representation and medication recommendation. We use GNNs to represent the internal hierarchical structures of medical codes. Then we integrate the GNN representation into a transformer-based visit encoder and pre-train it on EHR data from patients only with a single visit. The pre-trained visit encoder and representation are then fine-tuned for downstream predictive tasks on longitudinal EHRs from patients with multiple visits. G-BERT is the first to bring the language model pre-training schema into the healthcare domain and it achieved state-of-the-art performance on the medication recommendation task.

IJCAI Conference 2019 Conference Paper

RDPD: Rich Data Helps Poor Data via Imitation

  • Shenda Hong
  • Cao Xiao
  • Trong Nghia Hoang
  • Tengfei Ma
  • Hongyan Li
  • Jimeng Sun

In many situations, we need to build and deploy separate models in related environments with different data qualities. For example, an environment with strong observation equipments (e. g. , intensive care units) often provides high-quality multi-modal data, which are acquired from multiple sensory devices and have rich-feature representations. On the other hand, an environment with poor observation equipment (e. g. , at home) only provides low-quality, uni-modal data with poor-feature representations. To deploy a competitive model in a poor-data environment without requiring direct access to multi-modal data acquired from a rich-data environment, this paper develops and presents a knowledge distillation (KD) method (RDPD) to enhance a predictive model trained on poor data using knowledge distilled from a high-complexity model trained on rich, private data. We evaluated RDPD on three real-world datasets and shown that its distilled model consistently outperformed all baselines across all datasets, especially achieving the greatest performance improvement over a model trained only on low-quality data by 24. 56% on PR-AUC and 12. 21% on ROC-AUC, and over that of a state-of-the-art KD model by 5. 91% on PR-AUC and 4. 44% on ROC-AUC.

NeurIPS Conference 2018 Conference Paper

MiME: Multilevel Medical Embedding of Electronic Health Records for Predictive Healthcare

  • Edward Choi
  • Cao Xiao
  • Walter Stewart
  • Jimeng Sun

Deep learning models exhibit state-of-the-art performance for many predictive healthcare tasks using electronic health records (EHR) data, but these models typically require training data volume that exceeds the capacity of most healthcare systems. External resources such as medical ontologies are used to bridge the data volume constraint, but this approach is often not directly applicable or useful because of inconsistencies with terminology. To solve the data insufficiency challenge, we leverage the inherent multilevel structure of EHR data and, in particular, the encoded relationships among medical codes. We propose Multilevel Medical Embedding (MiME) which learns the multilevel embedding of EHR data while jointly performing auxiliary prediction tasks that rely on this inherent EHR structure without the need for external labels. We conducted two prediction tasks, heart failure prediction and sequential disease prediction, where MiME outperformed baseline methods in diverse evaluation settings. In particular, MiME consistently outperformed all baselines when predicting heart failure on datasets of different volumes, especially demonstrating the greatest performance improvement (15% relative gain in PR-AUC over the best baseline) on the smallest dataset, demonstrating its ability to effectively model the multilevel structure of EHR data.

AAAI Conference 2017 Conference Paper

StructInf: Mining Structural Influence from Social Streams

  • Jing Zhang
  • Jie Tang
  • Yuanyi Zhong
  • Yuchen Mo
  • Juanzi Li
  • Guojie Song
  • Wendy Hall
  • Jimeng Sun

Social influence is a fundamental issue in social network analysis and has attracted tremendous attention with the rapid growth of online social networks. However, existing research mainly focuses on studying peer influence. This paper introduces a novel notion of structural influence and studies how to efficiently discover structural influence patterns from social streams. We present three sampling algorithms with theoretical unbiased guarantee to speed up the discovery process. Experiments on a big microblogging dataset show that the proposed sampling algorithms can achieve a 10× speedup compared to the exact influence pattern mining algorithm, with an average error rate of only 1. 0%. The extracted structural influence patterns have many applications. We apply them to predict retweet behavior, with performance being significantly improved.

NeurIPS Conference 2016 Conference Paper

RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism

  • Edward Choi
  • Mohammad Taha Bahadori
  • Jimeng Sun
  • Joshua Kulas
  • Andy Schuetz
  • Walter Stewart

Accuracy and interpretability are two dominant features of successful predictive models. Typically, a choice must be made in favor of complex black box models such as recurrent neural networks (RNN) for accuracy versus less accurate but more interpretable traditional models such as logistic regression. This tradeoff poses challenges in medicine where both accuracy and interpretability are important. We addressed this challenge by developing the REverse Time AttentIoN model (RETAIN) for application to Electronic Health Records (EHR) data. RETAIN achieves high accuracy while remaining clinically interpretable and is based on a two-level neural attention model that detects influential past visits and significant clinical variables within those visits (e. g. key diagnoses). RETAIN mimics physician practice by attending the EHR data in a reverse time order so that recent clinical visits are likely to receive higher attention. RETAIN was tested on a large health system EHR dataset with 14 million visits completed by 263K patients over an 8 year period and demonstrated predictive accuracy and computational scalability comparable to state-of-the-art methods such as RNN, and ease of interpretability comparable to traditional models.

JBHI Journal 2015 Journal Article

PSF: A Unified Patient Similarity Evaluation Framework Through Metric Learning With Weak Supervision

  • Fei Wang
  • Jimeng Sun

Patient similarity is an important analytic operation in healthcare applications. At the core, patient similarity takes an index patient as the input and retrieves a ranked list of similar patients that are relevant in a specific clinical context. It takes patient information such as their electronic health records as input and computes the distance between a pair of patients based on those information. To construct a clinically valid similarity measure, physician input often needs to be incorporated. However, obtaining physicians’ input is difficult and expensive. As a result, typically only limited physician feedbacks can be obtained on a small portion of patients. How to leverage all unlabeled patient data and limited supervision information from physicians to construct a clinically meaningful distance metric? In this paper, we present a patient similarity framework (PSF) that unifies and significantly extends existing supervised patient similarity metric learning methods. PSF is a general framework that can learn an appropriate distance metric through supervised and unsupervised information. Within PSF framework, we propose a novel patient similarity algorithm that uses local spline regression to capture the unsupervised information. To speedup the incorporation of physician feedback or newly available clinical information, we introduce a general online update algorithm for an existing PSF distance metric.

NeurIPS Conference 2015 Conference Paper

Time-Sensitive Recommendation From Recurrent User Activities

  • Nan Du
  • Yichen Wang
  • Niao He
  • Jimeng Sun
  • Le Song

By making personalized suggestions, a recommender system is playing a crucial role in improving the engagement of users in modern web-services. However, most recommendation algorithms do not explicitly take into account the temporal behavior and the recurrent activities of users. Two central but less explored questions are how to recommend the most desirable item \emph{at the right moment}, and how to predict \emph{the next returning time} of a user to a service. To address these questions, we propose a novel framework which connects self-exciting point processes and low-rank models to capture the recurrent temporal patterns in a large collection of user-item consumption pairs. We show that the parameters of the model can be estimated via a convex optimization, and furthermore, we develop an efficient algorithm that maintains $O(1 / \epsilon)$ convergence rate, scales up to problems with millions of user-item pairs and thousands of millions of temporal events. Compared to other state-of-the-arts in both synthetic and real datasets, our model achieves superb predictive performance in the two time-sensitive recommendation questions. Finally, we point out that our formulation can incorporate other extra context information of users, such as profile, textual and spatial features.

AAAI Conference 2011 Conference Paper

Automatic Group Sparse Coding

  • Fei Wang
  • Noah Lee
  • Jimeng Sun
  • Jianying Hu
  • Shahram Ebadollahi

Sparse Coding (SC), which models the data vectors as sparse linear combinations over basis vectors (i. e. , dictionary), has been widely applied in machine learning, signal processing and neuroscience. Recently, one specific SC technique, Group Sparse Coding (GSC), has been proposed to learn a common dictionary over multiple different groups of data, where the data groups are assumed to be pre-defined. In practice, this may not always be the case. In this paper, we propose Automatic Group Sparse Coding (AutoGSC), which can (1) discover the hidden data groups; (2) learn a common dictionary over different data groups; and (3) learn an individual dictionary for each data group. Finally, we conduct experiments on both synthetic and real world data sets to demonstrate the effectiveness of AutoGSC, and compare it with traditional sparse coding and Nonnegative Matrix Factorization (NMF) methods.

IJCAI Conference 2007 Conference Paper

  • Xing Wei
  • Jimeng Sun
  • Xuerui Wang

Traditional probabilistic mixture models such as Latent Dirichlet Allocation imply that data records (such as documents) are fully exchangeable. However, data are naturally collected along time, thus obey some order in time. In this paper, we present Dynamic Mixture Models (DMMs) for online pattern discovery in multiple time series. DMMs do not have the noticeable drawback of the SVD-based methods for data streams: negative values in hidden variables are often produced even with all non-negative inputs. We apply DMM models to two real-world datasets, and achieve significantly better results with intuitive interpretation.