Author name cluster

Prasenjit Mitra

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

21 papers

2 author rows

ICLR Conference 2025 Conference Paper

SiReRAG: Indexing Similar and Related Information for Multihop Reasoning

Nan Zhang
Prafulla Kumar Choubey
Alexander R. Fabbri
Gabriel Bernadett-Shapiro
Rui Zhang 0037
Prasenjit Mitra
Caiming Xiong
Chien-Sheng Wu

Indexing is an important step towards strong performance in retrieval-augmented generation (RAG) systems. However, existing methods organize data based on either semantic similarity (similarity) or related information (relatedness), but do not cover both perspectives comprehensively. Our analysis reveals that modeling only one perspective results in insufficient knowledge synthesis, leading to suboptimal performance on complex tasks requiring multihop reasoning. In this paper, we propose SiReRAG, a novel RAG indexing approach that explicitly considers both similar and related information. On the similarity side, we follow existing work and explore some variances to construct a similarity tree based on recursive summarization. On the relatedness side, SiReRAG extracts propositions and entities from texts, groups propositions via shared entities, and generates recursive summaries to construct a relatedness tree. We index and flatten both similarity and relatedness trees into a unified retrieval pool. Our experiments demonstrate that SiReRAG consistently outperforms state-of-the-art indexing methods on three multihop datasets (MuSiQue, 2WikiMultiHopQA, and HotpotQA), with an average 1.9% improvement in F1 scores. As a reasonably efficient solution, SiReRAG enhances existing reranking methods significantly, with up to 7.8% improvement in average F1 scores. Our code is available at https://github.com/SalesforceAIResearch/SiReRAG.

Details

NeurIPS Conference 2024 Conference Paper

Automated Multi-Task Learning for Joint Disease Prediction on Electronic Health Records

Suhan Cui
Prasenjit Mitra

In the realm of big data and digital healthcare, Electronic Health Records (EHR) have become a rich source of information with the potential to improve patient care and medical research. In recent years, machine learning models have proliferated for analyzing EHR data to predict patients' future health conditions. Among them, some studies advocate for multi-task learning (MTL) to jointly predict multiple target diseases for improving the prediction performance over single task learning. Nevertheless, current MTL frameworks for EHR data have significant limitations due to their heavy reliance on human experts to identify task groups for joint training and design model architectures. To reduce human intervention and improve the framework design, we propose an automated approach named AutoDP, which can search for the optimal configuration of task grouping and architectures simultaneously. To tackle the vast joint search space encompassing task combinations and architectures, we employ surrogate model-based optimization, enabling us to efficiently discover the optimal solution. Experimental results on real-world EHR data demonstrate the efficacy of the proposed AutoDP framework. It achieves significant performance improvements over both hand-crafted and automated state-of-the-art methods, also maintains a feasible search cost at the same time.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Data Disparity and Temporal Unavailability Aware Asynchronous Federated Learning for Predictive Maintenance on Transportation Fleets

Leonie von Wahl
Niklas Heidenreich
Prasenjit Mitra
Michael Nolting
Nicolas Tempelmeier

Predictive maintenance has emerged as a critical application in modern transportation, leveraging sensor data to forecast potential damages proactively using machine learning. However, privacy concerns limit data sharing, making Federated learning an appealing approach to preserve data privacy. Nevertheless, challenges arise due to disparities in data distribution and temporal unavailability caused by individual usage patterns in transportation. In this paper, we present a novel asynchronous federated learning approach to address system heterogeneity and facilitate machine learning for predictive maintenance on transportation fleets. The approach introduces a novel data disparity aware aggregation scheme and a federated early stopping method for training. To validate the effectiveness of our approach, we evaluate it on two independent real-world datasets from the transportation domain: 1) oil dilution prediction of car combustion engines and 2) remaining lifetime prediction of plane turbofan engines. Our experiments show that we reliably outperform five state-of-the-art baselines, including federated and classical machine learning models. Moreover, we show that our approach generalises to various prediction model architectures.

PDF Details DOI

AAAI Conference 2023 Short Paper

Can You Answer This? – Exploring Zero-Shot QA Generalization Capabilities in Large Language Models (Student Abstract)

Saptarshi Sengupta
Shreya Ghosh
Preslav Nakov
Prasenjit Mitra

The buzz around Transformer-based language models (TLM) such as BERT, RoBERTa, etc. is well-founded owing to their impressive results on an array of tasks. However, when applied to areas needing specialized knowledge (closed-domain), such as medical, finance, etc. their performance takes drastic hits, sometimes more than their older recurrent/convolutional counterparts. In this paper, we explore zero-shot capabilities of large LMs for extractive QA. Our objective is to examine performance change in the face of domain drift i.e. when the target domain data is vastly different in semantic and statistical properties from the source domain and attempt to explain the subsequent behavior. To this end, we present two studies in this paper while planning further experiments later down the road. Our findings indicate flaws in the current generation of TLM limiting their performance on closed-domain tasks.

PDF Details DOI

JBHI Journal 2023 Journal Article

Forecasting User Interests Through Topic Tag Predictions in Online Health Communities

Amogh Subbakrishna Adishesha
Lily Jakielaszek
Fariha Azhar
Peixuan Zhang
Vasant Honavar
Fenglong Ma
Chandra Belani
Prasenjit Mitra

The increasing reliance on online communities for healthcare information by patients and caregivers has led to the increase in the spread of misinformation, or subjective, anecdotal and inaccurate or non-specific recommendations, which, if acted on, could cause serious harm to the patients. Hence, there is an urgent need to connect users with accurate and tailored health information in a timely manner to prevent such harm. This article proposes an innovative approach to suggesting reliable information to participants in online communities as they move through different stages in their disease or treatment. We hypothesize that patients with similar histories of disease progression or course of treatment would have similar information needs at comparable stages. Specifically, we pose the problem of predicting topic tags or keywords that describe the future information needs of users based on their profiles, traces of their online interactions within the community (past posts, replies) and the profiles and traces of online interactions of other users with similar profiles and similar traces of past interaction with the target users. The result is a variant of the collaborative information filtering or recommendation system tailored to the needs of users of online health communities. We report results of our experiments on two unique datasets from two different social media platforms which demonstrates the superiority of the proposed approach over the state of the art baselines with respect to accurate and timely prediction of topic tags (and hence information sources of interest).

Details DOI

IJCAI Conference 2023 Conference Paper

Understanding the Night-Sky? Developing AI-Enabled System for Exploring Night-Light Usage Patterns

Jakob Hederich
Shreya Ghosh
Zeyu He
Prasenjit Mitra

We present a demonstration of nighttime light pattern (NTL) analysis system. Our tool named NightVIEW is powered by an efficient system architecture to easily export and analyse a huge volume of spatial data (NTL), image segmentation and clustering algorithms to find unusual NTL patterns and identify hotspots of excess night light usage as well as finding semantics of cities.

PDF Details DOI

AAAI Conference 2020 Conference Paper

Joint Modeling of Local and Global Temporal Dynamics for Multivariate Time Series Forecasting with Missing Values

Xianfeng Tang
Huaxiu Yao
Yiwei Sun
Charu Aggarwal
Prasenjit Mitra
Suhang Wang

Multivariate time series (MTS) forecasting is widely used in various domains, such as meteorology and trafﬁc. Due to limitations on data collection, transmission, and storage, realworld MTS data usually contains missing values, making it infeasible to apply existing MTS forecasting models such as linear regression and recurrent neural networks. Though many efforts have been devoted to this problem, most of them solely rely on local dependencies for imputing missing values, which ignores global temporal dynamics. Local dependencies/patterns would become less useful when the missing ratio is high, or the data have consecutive missing values; while exploring global patterns can alleviate such problem. Thus, jointly modeling local and global temporal dynamics is very promising for MTS forecasting with missing values. However, work in this direction is rather limited. Therefore, we study a novel problem of MTS forecasting with missing values by jointly exploring local and global temporal dynamics. We propose a new framework LGnet, which leverages memory network to explore global patterns given estimations from local perspectives. We further introduce adversarial training to enhance the modeling of global temporal distribution. Experimental results on real-world datasets show the effectiveness of LGnet for MTS forecasting with missing values and its robustness under various missing ratios.

PDF Details

IJCAI Conference 2016 Conference Paper

Detecting Rumors from Microblogs with Recurrent Neural Networks

Jing Ma
Wei Gao
Prasenjit Mitra
Sejeong Kwon
Bernard J. Jansen
Kam-Fai Wong
Meeyoung Cha

Microblogging platforms are an ideal place for spreading rumors and automatically debunking rumors is a crucial problem. To detect rumors, existing approaches have relied on hand-crafted features for employing machine learning algorithms that require daunting manual effort. Upon facing a dubious claim, people dispute its truthfulness by posting various cues over time, which generates long-distance dependencies of evidence. This paper presents a novel method that learns continuous representations of microblog events for identifying rumors. The proposed model is based on recurrent neural networks (RNN) for learning the hidden representations that capture the variation of contextual information of relevant posts over time. Experimental results on datasets from two real-world microblog platforms demonstrate that (1) the RNN method outperforms state-of-the-art rumor detection models that use hand-crafted features; (2) performance of the RNN-based algorithm is further improved via sophisticated recurrent units and extra hidden layers; (3) RNN-based method detects rumors more quickly and accurately than existing techniques, including the leading online rumor debunking services.

PDF Details

IJCAI Conference 2016 Conference Paper

WikiWrite: Generating Wikipedia Articles Automatically

Siddhartha Banerjee
Prasenjit Mitra

The growth of Wikipedia, limited by the availability of knowledgeable authors, cannot keep pace with the ever increasing requirements and demands of the readers. In this work, we propose WikiWrite, a system capable of generating content for new Wikipedia articles automatically. First, our technique obtains feature representations of entities on Wikipedia. We adapt an existing work on document embeddings to obtain vector representations of words and paragraphs. Using the representations, we identify articles that are very similar to the new entity on Wikipedia. We train machine learning classifiers using content from the similar articles to assign web retrieved content on the new entity into relevant sections in the Wikipedia article. Second, we propose a novel abstractive summarization technique that uses a two-step integer-linear programming (ILP) model to synthesize the assigned content in each section and rewrite the content to produce a well-formed informative summary. Our experiments show that our technique is able to reconstruct existing articles in Wikipedia with high accuracies. We also create several articles using our approach in the English Wikipedia, most of which have been retained in the online encyclopedia.

PDF Details

AAAI Conference 2015 Conference Paper

A Neural Probabilistic Model for Context Based Citation Recommendation

Wenyi Huang
Zhaohui Wu
Chen Liang
Prasenjit Mitra
C. Giles

Automatic citation recommendation can be very useful for authoring a paper and is an AI-complete problem due to the challenge of bridging the semantic gap between citation context and the cited paper. It is not always easy for knowledgeable researchers to give an accurate citation context for a cited paper or to find the right paper to cite given context. To help with this problem, we propose a novel neural probabilistic model that jointly learns the semantic representations of citation contexts and cited papers. The probability of citing a paper given a citation context is estimated by training a multi-layer neural network. We implement and evaluate our model on the entire CiteSeer dataset, which at the time of this work consists of 10, 760, 318 citation contexts from 1, 017, 457 papers. We show that the proposed model significantly outperforms other stateof-the-art models in recall, MAP, MRR, and nDCG.

PDF Details

IJCAI Conference 2015 Conference Paper

Multi-Document Abstractive Summarization Using ILP Based Multi-Sentence Compression

Siddhartha Banerjee
Prasenjit Mitra
Kazunari Sugiyama

Abstractive summarization is an ideal form of summarization since it can synthesize information from multiple documents to create concise informative summaries. In this work, we aim at developing an abstractive summarizer. First, our proposed approach identifies the most important document in the multi-document set. The sentences in the most important document are aligned to sentences in other documents to generate clusters of similar sentences. Second, we generate K-shortest paths from the sentences in each cluster using a word-graph structure. Finally, we select sentences from the set of shortest paths generated from all the clusters employing a novel integer linear programming (ILP) model with the objective of maximizing information content and readability of the final summary. Our ILP model represents the shortest paths as binary variables and considers the length of the path, information score and linguistic quality score in the objective function. Experimental results on the DUC 2004 and 2005 multi-document summarization datasets show that our proposed approach outperforms all the baselines and state-of-the-art extractive summarizers as measured by the ROUGE scores. Our method also outperforms a recent abstractive summarization technique. In manual evaluation, our approach also achieves promising results on informativeness and readability.

PDF Details

AAAI Conference 2012 Conference Paper

Combining Hashing and Abstraction in Sparse High Dimensional Feature Spaces

Cornelia Caragea
Adrian Silvescu
Prasenjit Mitra

With the exponential increase in the number of documents available online, e. g. , news articles, weblogs, scientific documents, the development of effective and efficient classification methods is needed. The performance of document classifiers critically depends, among other things, on the choice of the feature representation. The commonly used “bag of words” and n-gram representations can result in prohibitively high dimensional input spaces. Data mining algorithms applied to these input spaces may be intractable due to the large number of dimensions. Thus, dimensionality reduction algorithms that can process data into features fast at runtime, ideally in constant time per feature, are greatly needed in high throughput applications, where the number of features and data points can be in the order of millions. One promising line of research to dimensionality reduction is feature clustering. We propose to combine two types of feature clustering, namely hashing and abstraction based on hierarchical agglomerative clustering, in order to take advantage of the strengths of both techniques. Experimental results on two text data sets show that the combined approach uses significantly smaller number of features and gives similar performance when compared with the “bag of words” and n-gram approaches.

PDF Details

ECAI Conference 2012 Conference Paper

Disambiguating Road Names in Text Route Descriptions using Exact-All-Hop Shortest Path Algorithm

Xiao Zhang 0019
Baojun Qiu
Prasenjit Mitra
Sen Xu
Alexander Klippel
Alan M. MacEachren

Automatic extraction and understanding of human-generated route descriptions have been critical to research aiming at understanding human cognition of geospatial information. Among all research issues involved, road name disambiguation is the most important, because one road name can refer to more than one road. Compared with traditional toponym (place name) disambiguation, the challenges of disambiguating road names in human-generated route description are three-fold: (1) the authors may use a wrong or obsolete road name and the gazetteer may have incomplete or out-of-date information; (2) geographic ontologies often used to disambiguate cities or counties do not exist for roads, due to their linear nature and large spatial extent; (3) knowledge of the co-occurrence of road names and other toponyms are difficult to learn due to the difficulty in automatic processing of natural language and lack of external information source of road entities. In this paper, we solve the problem of road name disambiguation in human-generated route descriptions with noise, i. e. in the presence of wrong names and incomplete gazetteer. We model the problem as an Exact-All-Hop Shortest Path problem on a semi-complete directed k-partite graph, and design an efficient algorithm to solve it. Our disambiguation algorithm successfully handles the noisy data and does not require any extra information sources other than the gazetteer. We compared our algorithm with an existing map-based method. Experiment results show that our algorithm significantly outperforms the existing method.

Details

AAAI Conference 2012 Conference Paper

Table Header Detection and Classification

Jing Fang
Prasenjit Mitra
Zhi Tang
C. Lee Giles

In digital libraries, a table, as a specific document component as well as a condensed way to present structured and relational data, contains rich information and often the only source of. that information. In order to explore, retrieve, and reuse that data, tables should be identified and the data extracted. Table recognition is an old field of research. However, due to the diversity of table styles, the results are still far from satisfactory, and not a single algorithm performs well on all different types of tables. In this paper, we randomly take samples from the CiteSeerX to investigate diverse table styles for automatic table extraction. We find that table headers are one of the main characteristics of complex table styles. We identify a set of features that can be used to segregate headers from tabular data and build a classifier to detect table headers. Our empirical evaluation on PDF documents shows that using a Random Forest classifier achieves an accuracy of 92%.

PDF Details

IJCAI Conference 2011 Conference Paper

Context Sensitive Topic Models for Author Influence in Document Networks

Saurabh Kataria
Prasenjit Mitra
Cornelia Caragea
C. Lee Giles

Since the seminal work of Sampath et al. in 1996, despite the subsequent flourishing of techniques on diagnosis of discrete-event systems (DESs), the basic notions of fault and diagnosis have been remaining conceptually unchanged. Faults are defined at component level and diagnoses incorporate the occurrences of component faults within system evolutions: diagnosis is context-free. As this approach may be unsatisfactory for a complex DES, whose topology is organized in a hierarchy of abstractions, we propose to define different diagnosis rules for different subsystems in the hierarchy. Relevant fault patterns are specified as regular expressions on patterns of lower-level subsystems. Separation of concerns is achieved and the expressive power of diagnosis is enhanced: each subsystem has its proper set of diagnosis rules, which may or may not depend on the rules of other subsystems. Diagnosis is no longer anchored to components: it becomes context-sensitive. The approach yields seemingly contradictory but nonetheless possible scenarios: a subsystem can be normal despite the faulty behavior of a number of its components (positive paradox); also, it can be faulty despite the normal behavior of all its components (negative paradox).

PDF Details DOI

AAAI Conference 2010 Conference Paper

Adopting Inference Networks for Online Thread Retrieval

Sumit Bhatia
Prasenjit Mitra

Online forums contain valuable human-generated information. End-users looking for information would like to find only those threads in forums where relevant information is present. Due to the distinctive characteristics of forum pages from generic web pages, special techniques are required to organize and search for information in these forums. Threads and pages in forums are different from other webpages in their hyperlinking patterns. Forum posts also have associated social and non-textual metadata. In this paper, we propose a model for online thread retrieval based on inference networks that utilizes the structural properties of forum threads. We also investigate the effects of incorporating various relevance indicators in our model. We empirically show the effectiveness of our proposed model using real-world data.

PDF Details

AAAI Conference 2010 Conference Paper

Utilizing Context in Generative Bayesian Models for Linked Corpus

Saurabh Kataria
Prasenjit Mitra
Sumit Bhatia

In an interlinked corpus of documents, the context in which a citation appears provides extra information about the cited document. However, associating terms in the context to the cited document remains an open problem. We propose a novel document generation approach that statistically incorporates the context in which a document links to another document. We quantitatively show that the proposed generation scheme explains the linking phenomenon better than previous approaches. The context information along with the actual content of the document provides signicant improvements over the previous approaches for various real world evaluation tasks such as link prediction and log-likelihood estimation on unseen content. The proposed method is more scalable to large collection of documents compared to the previous approaches.

PDF Details

AAAI Conference 2008 Conference Paper

Automatic Extraction of Data Points and Text Blocks from 2-Dimensional Plots in Digital Documents

Saurabh Kataria
Prasenjit Mitra

Two dimensional plots (2-D) in digital documents on the web are an important source of information that is largely under-utilized. In this paper, we outline how data and text can be extracted automatically from these 2-D plots, thus eliminating a time consuming manual process. Our information extraction algorithm identifies the axes of the figures, extracts text blocks like axes-labels and legends and identifies data points in the figure. It also extracts the units appearing in the axes labels and segments the legends to identify the different lines in the legend, the different symbols and their associated text explanations. Our algorithm also performs the challenging task of separating out overlapping text and data points effectively. Our experiments indicate that these techniques are computationally efficient and provide acceptable accuracy.

PDF Details

AAAI Conference 2008 Conference Paper

Hierarchical Location and Topic Based Query Expansion

Shu Huang
Prasenjit Mitra

In this paper, we propose a novel approach to expand queries by exploring both location information and topic information of the queries. Users at different locations tend to have different vocabularies, while the different expressions coming from different vocabularies may relate to the same topics. Thus these expressions are identified as location sensitive and can be used for query expansion. We propose a hierarchical query expansion model, which employs a two-level SVM classification model to classify queries as location sensitive or location non-sensitive, where the former are further classified into same location sensitive and different location sensitive. For the location sensitive queries, we propose an LDA based topic-level query similarity measure to rank the list of similar queries. Experiments with 2G raw log data from CiteSeer and Excite1 show that our hierarchical classification model predicts the query location sensitivity with more than 80% precision and that the final search result is significantly better than existing query expansion methods.

PDF Details

AAAI Conference 2007 Conference Paper

TableRank: A Ranking Algorithm for Table Search and Retrieval

Ying Liu
Prasenjit Mitra

Tables are ubiquitous in web pages and scientiﬁc documents. With the explosive development of the web, tables have become a valuable information repository. Therefore, effectively and efﬁciently searching tables becomes a challenge. Existing search engines do not provide satisfactory search results largely because the current ranking schemes are inadequate for table search and automatic table understanding and extraction are rather difﬁcult in general. In this work, we design and evaluate a novel table ranking algorithm – TableRank to improve the performance of our table search engine Table- Seer. Given a keyword based table query, TableRank facilities TableSeer to return the most relevant tables by tailoring the classic vector space model. TableRank adopts an innovative term weighting scheme by aggregating multiple weighting factors from three levels: term, table and document. The experimental results show that our table search engine outperforms existing search engines on table search. In addition, incorporating multiple weighting factors can signiﬁcantly improve the ranking results.

PDF Details

TCS Journal 2006 Journal Article

Rewriting queries using views in the presence of arithmetic comparisons

Foto Afrati
Chen Li
Prasenjit Mitra

We consider the problem of answering queries using views, where queries and views are conjunctive queries with arithmetic comparisons over dense orders. Previous work only considered limited variants of this problem, without giving a complete solution. We first show that obtaining equivalent rewritings for conjunctive queries with arithmetic comparisons is decidable. Then, we consider the problem of finding maximally contained rewritings (MCRs) where the decidability proof does not carry over. We investigate two special cases of this problem where the query uses only semi-interval comparisons. In both cases decidability of finding MCRs depends on the query containment test. First, we address the case where the homomorphism property holds in testing query containment. In this case decidability is easy to prove but developing an efficient algorithm is not trivial. We develop such an algorithm and prove that it is sound and complete. This algorithm applies in many cases where the query uses only left (or right) semi-interval comparisons. Then, we develop a new query containment test for the case where the containing query uses both left and right semi-interval comparisons but with only one left (or right) semi-interval subgoal. Based on this test, we show how to produce an MCR which is a Datalog query with arithmetic comparisons. The containment test that we develop obtains a result of independent interest. It finds another special case where query containment in the presence of arithmetic comparisons can be tested in nondeterministic polynomial time.

Details DOI