Author name cluster

Jiawei Han

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

47 papers

1 author row

TMLR Journal 2026 Journal Article

Pave Your Own Path: Graph Gradual Domain Adaptation on Fused Gromov-Wasserstein Geodesics

Zhichen Zeng
Ruizhong Qiu
Wenxuan Bao
Tianxin Wei
Xiao Lin
Yuchen Yan
Tarek F. Abdelzaher
Jiawei Han

Graph neural networks, despite their impressive performance, are highly vulnerable to distribution shifts on graphs. Existing graph domain adaptation (graph DA) methods often implicitly assume a mild shift between source and target graphs, limiting their applicability to real-world scenarios with large shifts. Gradual domain adaptation (GDA) has emerged as a promising approach for addressing large shifts by gradually adapting the source model to the target domain via a path of unlabeled intermediate domains. Existing GDA methods exclusively focus on independent and identically distributed (IID) data with a predefined path, leaving their extension to non-IID graphs without a given path an open challenge. To bridge this gap, we present Gadget, the first GDA framework for non-IID graph data. First (theoretical foundation), the Fused Gromov-Wasserstein (FGW) distance is adopted as the domain discrepancy for non-IID graphs, based on which, we derive an error bound on node, edge and graph-level tasks, showing that the target domain error is proportional to the length of the path. Second (optimal path), guided by the error bound, we identify the FGW geodesic as the optimal path, which can be efficiently generated by our proposed algorithm. The generated path can be seamlessly integrated with existing graph DA methods to handle large shifts on graphs, improving state-of-the-art graph DA methods by up to 6.8% in accuracy on real-world datasets.

AAAI Conference 2025 Conference Paper

Bi-level Contrastive Learning for Knowledge-Enhanced Molecule Representations

Pengcheng Jiang
Cao Xiao
Tianfan Fu
Parminder Bhatia
Taha Kass-Hout
Jimeng Sun
Jiawei Han

Molecular representation learning is vital for various downstream applications, including the analysis and prediction of molecular properties and side effects. While Graph Neural Networks (GNNs) have been a popular framework for modeling molecular data, they often struggle to capture the full complexity of molecular representations. In this paper, we introduce a novel method called Gode, which accounts for the dual-level structure inherent in molecules. Molecules possess an intrinsic graph structure and simultaneously function as nodes within a broader molecular knowledge graph. Gode integrates individual molecular graph representations with multi-domain biochemical data from knowledge graphs. By pre-training two GNNs on different graph structures and employing contrastive learning, Gode effectively fuses molecular structures with their corresponding knowledge graph substructures. This fusion yields a more robust and informative representation, enhancing molecular property predictions by leveraging both chemical and biological information. When fine-tuned across 11 chemical property tasks, our model significantly outperforms existing benchmarks, achieving an average ROC-AUC improvement of 12.7% for classification tasks and an average RMSE/MAE improvement of 34.4% for regression tasks. Notably, Gode surpasses the current leading model in property prediction, with advancements of 2.2% in classification and 7.2% in regression tasks.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

CoDiCast: Conditional Diffusion Model for Global Weather Forecasting with Uncertainty Quantification

Jimeng Shi
Bowen Jin
Jiawei Han
Sundararaman Gopalakrishnan
Giri Narasimhan

Accurate weather forecasting is critical for science and society. However, existing methods have not achieved the combination of high accuracy, low uncertainty, and high computational efficiency simultaneously. On one hand, traditional numerical weather prediction (NWP) models are computationally intensive because of their complexity. On the other hand, most machine learning-based weather prediction (MLWP) approaches offer efficiency and accuracy but remain deterministic, lacking the ability to capture forecast uncertainty. To tackle these challenges, we propose a conditional diffusion model, CoDiCast, to generate global weather prediction, integrating accuracy and uncertainty quantification at a modest computational cost. The key idea behind the prediction task is to generate realistic weather scenarios at a future time point, conditioned on observations from the recent past. Due to the probabilistic nature of diffusion models, they can be properly applied to capture the uncertainty of weather predictions. Therefore, we accomplish uncertainty quantifications by repeatedly sampling from stochastic Gaussian noise for each initial weather state and running the denoising process multiple times. Experimental results demonstrate that CoDiCast outperforms several existing MLWP methods in accuracy, and is faster than NWP models in inference speed. Our model can generate 6-day global weather forecasts, at 6-hour steps and 5. 625-degree latitude-longitude resolutions, for over 5 variables, in about 12 minutes on a commodity A100 GPU machine with 80GB memory. The source code is available at https: //github. com/JimengShi/CoDiCast.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

DynamicRAG: Leveraging Outputs of Large Language Model as Feedback for Dynamic Reranking in Retrieval-Augmented Generation

Jiashuo Sun
Xianrui Zhong
Sizhe Zhou
Jiawei Han

Retrieval-augmented generation (RAG) systems combine large language models (LLMs) with external knowledge retrieval, making them highly effective for knowledge-intensive tasks. A crucial but often under-explored component of these systems is the reranker, which refines retrieved documents to enhance generation quality and explainability. The challenge of selecting the optimal number of documents (k) remains unsolved: too few may omit critical information, while too many introduce noise and inefficiencies. Although recent studies have explored LLM-based rerankers, they primarily leverage internal model knowledge and overlook the rich supervisory signals that LLMs can provide, such as using response quality as feedback for optimizing reranking decisions. In this paper, we propose DynamicRAG, a novel RAG framework where the reranker dynamically adjusts both the order and number of retrieved documents based on the query. We model the reranker as an agent optimized through reinforcement learning (RL), using rewards derived from LLM output quality. Across seven knowledge-intensive datasets, DynamicRAG demonstrates superior performance, achieving state-of-the-art results.

NeurIPS Conference 2025 Conference Paper

FGBench: A Dataset and Benchmark for Molecular Property Reasoning at Functional Group-Level in Large Language Models

Xuan Liu
Siru Ouyang
Xianrui Zhong
Jiawei Han
Huimin Zhao

Large language models (LLMs) have gained significant attention in chemistry. However, most existing datasets center on molecular-level property prediction and overlook the role of fine-grained functional group (FG) information. Incorporating FG-level data can provide valuable prior knowledge that links molecular structures with textual descriptions, which can be used to build more interpretable, structure-aware LLMs for reasoning on molecule-related tasks. Moreover, LLMs can learn from such fine-grained information to uncover hidden relationships between specific functional groups and molecular properties, thereby advancing molecular design and drug discovery. Here, we introduce FGBench, a dataset comprising 625K molecular property reasoning problems with functional group information. Functional groups are precisely annotated and localized within the molecule, which ensures the dataset's interoperability thereby facilitating further multimodal applications. FGBench includes both regression and classification tasks on 245 different functional groups across three categories for molecular property reasoning: (1) single functional group impacts, (2) multiple functional group interactions, and (3) direct molecular comparisons. In the benchmark of state-of-the-art LLMs on 7K curated data, the results indicate that current LLMs struggle with FG-level property reasoning, highlighting the need to enhance reasoning capabilities in LLMs for chemistry tasks. We anticipate that the methodology employed in FGBench to construct datasets with functional group-level information will serve as a foundational framework for generating new question–answer pairs, enabling LLMs to better understand fine-grained molecular structure–property relationships. The dataset and evaluation code are available at this \href{https: //github. com/xuanliugit/FGBench}{link}.

NeurIPS Conference 2025 Conference Paper

Hybrid Latent Reasoning via Reinforcement Learning

Zhenrui Yue
Bowen Jin
Huimin Zeng
Honglei Zhuang
Zhen Qin
Jinsung Yoon
Lanyu Shang
Jiawei Han

Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as their continuous paradigm conflicts with the discrete nature of autoregressive generation. Moreover, these methods rely on CoT traces for training and thus fail to exploit the inherent reasoning patterns of LLMs. In this work, we explore latent reasoning by leveraging the intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that (1) integrates prior hidden states into sampled tokens with a learnable gating mechanism, and (2) initializes training with predominantly token embeddings while progressively incorporating more hidden features. This design maintains LLMs' generative capabilities and incentivizes hybrid reasoning using both discrete and continuous representations. In addition, the hybrid HRPO introduces stochasticity into latent reasoning via token sampling, thereby enabling RL-based optimization without requiring CoT trajectories. Extensive evaluations across diverse benchmarks show that HRPO outperforms prior methods in both knowledge- and reasoning-intensive tasks. Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths, highlighting the potential of our RL-based approach and offer insights for future work in latent reasoning.

NeurIPS Conference 2025 Conference Paper

RAST: Reasoning Activation in LLMs via Small-model Transfer

Siru Ouyang
Xinyu Zhu
Zilin Xiao
Minhao Jiang
Yu Meng
Jiawei Han

Reinforcement learning (RL) has become a powerful approach for improving the reasoning capabilities of large language models (LLMs), as evidenced by recent successes such as OpenAI's o1 and Deepseek-R1. However, applying RL at scale remains intimidatingly resource-intensive, requiring multiple model copies and extensive GPU workloads. On the other hand, while being powerful, recent studies suggest that RL does not fundamentally endow models with new knowledge; rather, it primarily reshapes the model's output distribution to activate reasoning capabilities latent in the base model. Building on this insight, we hypothesize that the changes in output probabilities induced by RL are largely model-size invariant, opening the door to a more efficient paradigm: training a small model with RL and transferring its induced probability shifts to larger base models. To verify our hypothesis, we conduct a token-level analysis of decoding trajectories and find high alignment in RL-induced output distributions across model scales, validating our hypothesis. Motivated by this, we propose RAST, a simple yet effective method that transfers reasoning behaviors by injecting RL-induced probability adjustments from a small RL-trained model into larger models. Experiments across multiple mathematical reasoning benchmarks show that RAST substantially and consistently enhances the reasoning capabilities of base models while requiring significantly lower GPU memory than direct RL training, sometimes even yielding better performance than the RL-trained counterparts. Our findings offer new insights into the nature of RL-driven reasoning and practical strategies for scaling its benefits without incurring its full computational cost. The project page of RAST is available at https: //ozyyshr. github. io/RAST/.

NeurIPS Conference 2025 Conference Paper

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

Shivam Agarwal
Zimin Zhang
Lifan Yuan
Jiawei Han
Hao Peng

Entropy minimization (EM) trains the model to concentrate even more probability mass on its most confident outputs. We show that this simple objective alone, without any labeled data, can substantially improve large language models’ (LLMs) performance on challenging math, physics, and coding tasks. We explore three approaches: (1) EM-FT minimizes token-level entropy similarly to instruction finetuning, but on unlabeled outputs drawn from the model; (2) EM-RL: reinforcement learning with negative entropy as the only reward to maximize; (3) EM-INF: inference-time logit adjustment to reduce entropy without any training data or parameter updates. On Qwen-7B, EM-RL, without any labeled data, achieves comparable or better performance than strong RL baselines such as GRPO and RLOO that are trained on 60K labeled examples. Furthermore, EM-INF enables Qwen-32B to match or exceed the performance of proprietary models like GPT-4o, Claude 3 Opus, and Gemini 1. 5 Pro on the challenging SciCode benchmark, while being 3x more efficient than self-consistency and sequential refinement. Our findings reveal that many pretrained LLMs possess previously underappreciated reasoning capabilities that can be effectively elicited through entropy minimization alone, without any labeled data or even any parameter updates.

NeurIPS Conference 2024 Conference Paper

InstructG2I: Synthesizing Images from Multimodal Attributed Graphs

Bowen Jin
Ziqi Pang
Bingjun Guo
Yu-Xiong Wang
Jiaxuan You
Jiawei Han

In this paper, we approach an overlooked yet critical task Graph2Image: generating images from multimodal attributed graphs (MMAGs). This task poses significant challenges due to the explosion in graph size, dependencies among graph entities, and the need for controllability in graph conditions. To address these challenges, we propose a graph context-conditioned diffusion model called InstructG2I. InstructG2I first exploits the graph structure and multimodal information to conduct informative neighbor sampling by combining personalized page rank and re-ranking based on vision-language features. Then, a graph QFormer encoder adaptively encodes the graph nodes into an auxiliary set of graph prompts to guide the denoising process of diffusion. Finally, we propose graph classifier-free guidance, enabling controllable generation by varying the strength of graph guidance and multiple connected edges to a node. Extensive experiments conducted on three datasets from different domains demonstrate the effectiveness and controllability of our approach. The code is available at https: //github. com/PeterGriffinJin/InstructG2I.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

KG-FIT: Knowledge Graph Fine-Tuning Upon Open-World Knowledge

Pengcheng Jiang
Lang Cao
Cao Xiao
Parminder Bhatia
Jimeng Sun
Jiawei Han

Knowledge Graph Embedding (KGE) techniques are crucial in learning compact representations of entities and relations within a knowledge graph, facilitating efficient reasoning and knowledge discovery. While existing methods typically focus either on training KGE models solely based on graph structure or fine-tuning pre-trained language models with classification data in KG, KG-FIT leverages LLM-guided refinement to construct a semantically coherent hierarchical structure of entity clusters. By incorporating this hierarchical knowledge along with textual information during the fine-tuning process, KG-FIT effectively captures both global semantics from the LLM and local semantics from the KG. Extensive experiments on the benchmark datasets FB15K-237, YAGO3-10, and PrimeKG demonstrate the superiority of KG-FIT over state-of-the-art pre-trained language model-based methods, achieving improvements of 14. 4\%, 13. 5\%, and 11. 9\% in the Hits@10 metric for the link prediction task, respectively. Furthermore, KG-FIT yields substantial performance gains of 12. 6\%, 6. 7\%, and 17. 7\% compared to the structure-based base models upon which it is built. These results highlight the effectiveness of KG-FIT in incorporating open-world knowledge from LLMs to significantly enhance the expressiveness and informativeness of KG embeddings.

PDF Details DOI

TMLR Journal 2024 Journal Article

Multi-LoRA Composition for Image Generation

Ming Zhong
Yelong Shen
Shuohang Wang
Yadong Lu
Yizhu Jiao
Siru Ouyang
Donghan Yu
Jiawei Han

Low-Rank Adaptation (LoRA) is extensively utilized in text-to-image models for the accurate rendition of specific elements like distinct characters or unique styles in generated images. Nonetheless, existing methods face challenges in effectively composing multiple LoRAs, especially as the number of LoRAs to be integrated grows, thus hindering the creation of complex imagery. In this paper, we study multi-LoRA composition through a decoding-centric perspective. We present two training-free methods: \textsc{LoRA Switch}, which alternates between different LoRAs at each denoising step, and \textsc{LoRA Composite}, which simultaneously incorporates all LoRAs to guide more cohesive image synthesis. To evaluate the proposed approaches, we establish \texttt{ComposLoRA}, a new comprehensive testbed as part of this research. It features a diverse range of LoRA categories with 480 composition sets. Utilizing an evaluation framework based on GPT-4V, our findings demonstrate a clear improvement in performance with our methods over the prevalent baseline, particularly evident when increasing the number of LoRAs in a composition. The code, benchmarks, LoRA weights, and all evaluation details are available on our project website: https://maszhongming.github.io/Multi-LoRA-Composition.

AAAI Conference 2024 Conference Paper

Seed-Guided Fine-Grained Entity Typing in Science and Engineering Domains

Yu Zhang
Yunyi Zhang
Yanzhen Shen
Yu Deng
Lucian Popa
Larisa Shwartz
ChengXiang Zhai
Jiawei Han

Accurately typing entity mentions from text segments is a fundamental task for various natural language processing applications. Many previous approaches rely on massive human-annotated data to perform entity typing. Nevertheless, collecting such data in highly specialized science and engineering domains (e.g., software engineering and security) can be time-consuming and costly, without mentioning the domain gaps between training and inference data if the model needs to be applied to confidential datasets. In this paper, we study the task of seed-guided fine-grained entity typing in science and engineering domains, which takes the name and a few seed entities for each entity type as the only supervision and aims to classify new entity mentions into both seen and unseen types (i.e., those without seed entities). To solve this problem, we propose SEType which first enriches the weak supervision by finding more entities for each seen type from an unlabeled corpus using the contextualized representations of pre-trained language models. It then matches the enriched entities to unlabeled text to get pseudo-labeled samples and trains a textual entailment model that can make inferences for both seen and unseen types. Extensive experiments on two datasets covering four domains demonstrate the effectiveness of SEType in comparison with various baselines. Code and data are available at: https://github.com/yuzhimanhua/SEType.

PDF Details DOI

NeurIPS Conference 2022 Conference Paper

Generating Training Data with Language Models: Towards Zero-Shot Language Understanding

Yu Meng
Jiaxin Huang
Yu Zhang
Jiawei Han

Pretrained language models (PLMs) have demonstrated remarkable performance in various natural language processing tasks: Unidirectional PLMs (e. g. , GPT) are well known for their superior text generation capabilities; bidirectional PLMs (e. g. , BERT) have been the prominent choice for natural language understanding (NLU) tasks. While both types of models have achieved promising few-shot learning performance, their potential for zero-shot learning has been underexplored. In this paper, we present a simple approach that uses both types of PLMs for fully zero-shot learning of NLU tasks without requiring any task-specific data: A unidirectional PLM generates class-conditioned texts guided by prompts, which are used as the training data for fine-tuning a bidirectional PLM. With quality training data selected based on the generation probability and regularization techniques (label smoothing and temporal ensembling) applied to the fine-tuning stage for better generalization and stability, our approach demonstrates strong performance across seven classification tasks of the GLUE benchmark (e. g. , 72. 3/73. 8 on MNLI-m/mm and 92. 8 on SST-2), significantly outperforming zero-shot prompting methods and achieving even comparable results to strong few-shot approaches using 32 training samples per class.

NeurIPS Conference 2021 Conference Paper

COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

Yu Meng
Chenyan Xiong
Payal Bajaj
saurabh tiwary
Paul Bennett
Jiawei Han
XIA SONG

We present a self-supervised learning framework, COCO-LM, that pretrains Language Models by COrrecting and COntrasting corrupted text sequences. Following ELECTRA-style pretraining, COCO-LM employs an auxiliary language model to corrupt text sequences, upon which it constructs two new tasks for pretraining the main model. The first token-level task, Corrective Language Modeling, is to detect and correct tokens replaced by the auxiliary model, in order to better capture token-level semantics. The second sequence-level task, Sequence Contrastive Learning, is to align text sequences originated from the same source input while ensuring uniformity in the representation space. Experiments on GLUE and SQuAD demonstrate that COCO-LM not only outperforms recent state-of-the-art pretrained models in accuracy, but also improves pretraining efficiency. It achieves the MNLI accuracy of ELECTRA with 50% of its pretraining GPU hours. With the same pretraining steps of standard base/large-sized models, COCO-LM outperforms the previous best models by 1+ GLUE average points.

NeurIPS Conference 2021 Conference Paper

Shift-Robust GNNs: Overcoming the Limitations of Localized Graph Training data

Qi Zhu
Natalia Ponomareva
Jiawei Han
Bryan Perozzi

There has been a recent surge of interest in designing Graph Neural Networks (GNNs) for semi-supervised learning tasks. Unfortunately this work has assumed that the nodes labeled for use in training were selected uniformly at random (i. e. are an IID sample). However in many real world scenarios gathering labels for graph nodes is both expensive and inherently biased -- so this assumption can not be met. GNNs can suffer poor generalization when this occurs, by overfitting to superfluous regularities present in the training data. In this work we present a method, Shift-Robust GNN (SR-GNN), designed to account for distributional differences between biased training data and the graph's true inference distribution. SR-GNN adapts GNN models for the presence of distributional shifts between the nodes which have had labels provided for training and the rest of the dataset. We illustrate the effectiveness of SR-GNN in a variety of experiments with biased training datasets on common GNN benchmark datasets for semi-supervised learning, where we see that SR-GNN outperforms other GNN baselines by accuracy, eliminating at least (~40%) of the negative effects introduced by biased training data. On the largest dataset we consider, ogb-arxiv, we observe an 2% absolute improvement over the baseline and reduce 30% of the negative effects.

IJCAI Conference 2021 Conference Paper

TAXOGAN: Hierarchical Network Representation Learning via Taxonomy Guided Generative Adversarial Networks (Extended Abstract)

Carl Yang
Jieyu Zhang
Jiawei Han

Network representation learning aims at transferring node proximity in networks into distributed vectors, which can be leveraged in various downstream applications. Recent research has shown that nodes in a network can often be organized in latent hierarchical structures, but without a particular underlying taxonomy, the learned node embedding is less useful nor interpretable. In this work, we aim to improve network embedding by modeling the conditional node proximity in networks indicated by node labels residing in real taxonomies. In the meantime, we also aim to model the hierarchical label proximity in the given taxonomies, which is too coarse by solely looking at the hierarchical topologies. Comprehensive experiments and case studies demonstrate the utility of TAXOGAN.

PDF Details DOI

NeurIPS Conference 2021 Conference Paper

Transfer Learning of Graph Neural Networks with Ego-graph Information Maximization

Qi Zhu
Carl Yang
Yidan Xu
Haonan Wang
Chao Zhang
Jiawei Han

Graph neural networks (GNNs) have achieved superior performance in various applications, but training dedicated GNNs can be costly for large-scale graphs. Some recent work started to study the pre-training of GNNs. However, none of them provide theoretical insights into the design of their frameworks, or clear requirements and guarantees towards their transferability. In this work, we establish a theoretically grounded and practically useful framework for the transfer learning of GNNs. Firstly, we propose a novel view towards the essential graph information and advocate the capturing of it as the goal of transferable GNN training, which motivates the design of EGI (Ego-Graph Information maximization) to analytically achieve this goal. Secondly, when node features are structure-relevant, we conduct an analysis of EGI transferability regarding the difference between the local graph Laplacians of the source and target graphs. We conduct controlled synthetic experiments to directly justify our theoretical conclusions. Comprehensive experiments on two real-world network datasets show consistent results in the analyzed setting of direct-transfering, while those on large-scale knowledge graphs show promising results in the more practical setting of transfering with fine-tuning.

NeurIPS Conference 2021 Conference Paper

Universal Graph Convolutional Networks

Di Jin
Zhizhi Yu
Cuiying Huo
Rui Wang
Xiao Wang
Dongxiao He
Jiawei Han

Graph Convolutional Networks (GCNs), aiming to obtain the representation of a node by aggregating its neighbors, have demonstrated great power in tackling various analytics tasks on graph (network) data. The remarkable performance of GCNs typically relies on the homophily assumption of networks, while such assumption cannot always be satisfied, since the heterophily or randomness are also widespread in real-world. This gives rise to one fundamental question: whether networks with different structural properties should adopt different propagation mechanisms? In this paper, we first conduct an experimental investigation. Surprisingly, we discover that there are actually segmentation rules for the propagation mechanism, i. e. , 1-hop, 2-hop and $k$-nearest neighbor ($k$NN) neighbors are more suitable as neighborhoods of network with complete homophily, complete heterophily and randomness, respectively. However, the real-world networks are complex, and may present diverse structural properties, e. g. , the network dominated by homophily may contain a small amount of randomness. So can we reasonably utilize these segmentation rules to design a universal propagation mechanism independent of the network structural assumption? To tackle this challenge, we develop a new universal GCN framework, namely U-GCN. It first introduces a multi-type convolution to extract information from 1-hop, 2-hop and $k$NN networks simultaneously, and then designs a discriminative aggregation to sufficiently fuse them aiming to given learning objectives. Extensive experiments demonstrate the superiority of U-GCN over state-of-the-arts. The code and data are available at https: //github. com/jindi-tju.

AAAI Conference 2020 Conference Paper

Unsupervised Attributed Multiplex Network Embedding

Chanyoung Park
Donghyun Kim
Jiawei Han
Hwanjo Yu

Nodes in a multiplex network are connected by multiple types of relations. However, most existing network embedding methods assume that only a single type of relation exists between nodes. Even for those that consider the multiplexity of a network, they overlook node attributes, resort to node labels for training, and fail to model the global properties of a graph. We present a simple yet effective unsupervised network embedding method for attributed multiplex network called DMGI, inspired by Deep Graph Infomax (DGI) that maximizes the mutual information between local patches of a graph, and the global representation of the entire graph. We devise a systematic way to jointly integrate the node embeddings from multiple graphs by introducing 1) the consensus regularization framework that minimizes the disagreements among the relation-type speciﬁc node embeddings, and 2) the universal discriminator that discriminates true samples regardless of the relation types. We also show that the attention mechanism infers the importance of each relation type, and thus can be useful for ﬁltering unnecessary relation types as a preprocessing step. Extensive experiments on various downstream tasks demonstrate that DMGI outperforms the stateof-the-art methods, even though DMGI is fully unsupervised.

IJCAI Conference 2020 Conference Paper

When Do GNNs Work: Understanding and Improving Neighborhood Aggregation

Yiqing Xie
Sha Li
Carl Yang
Raymond Chi-Wing Wong
Jiawei Han

Graph Neural Networks (GNNs) have been shown to be powerful in a wide range of graph-related tasks. While there exists various GNN models, a critical common ingredient is neighborhood aggregation, where the embedding of each node is updated by referring to the embedding of its neighbors. This paper aims to provide a better understanding of this mechanisms by asking the following question: Is neighborhood aggregation always necessary and beneficial? In short, the answer is no. We carve out two conditions under which neighborhood aggregation is not helpful: (1) when a node's neighbors are highly dissimilar and (2) when a node's embedding is already similar with that of its neighbors. We propose novel metrics that quantitatively measure these two circumstances and integrate them into an Adaptive-layer module. Our experiments show that allowing for node-specific aggregation degrees have significant advantage over current GNNs.

PDF Details DOI

AAAI Conference 2019 Conference Paper

Mining Entity Synonyms with Efficient Neural Set Generation

Jiaming Shen
Ruiliang Lyu
Xiang Ren
Michelle Vanni
Brian Sadler
Jiawei Han

Mining entity synonym sets (i. e. , sets of terms referring to the same entity) is an important task for many entity-leveraging applications. Previous work either rank terms based on their similarity to a given query term, or treats the problem as a two-phase task (i. e. , detecting synonymy pairs, followed by organizing these pairs into synonym sets). However, these approaches fail to model the holistic semantics of a set and suffer from the error propagation issue. Here we propose a new framework, named SynSetMine, that efficiently generates entity synonym sets from a given vocabulary, using example sets from external knowledge bases as distant supervision. SynSetMine consists of two novel modules: (1) a set-instance classifier that jointly learns how to represent a permutation invariant synonym set and whether to include a new instance (i. e. , a term) into the set, and (2) a set generation algorithm that enumerates the vocabulary only once and applies the learned set-instance classifier to detect all entity synonym sets in it. Experiments on three real datasets from different domains demonstrate both effectiveness and efficiency of SynSetMine for mining entity synonym sets.

NeurIPS Conference 2019 Conference Paper

Spherical Text Embedding

Yu Meng
Jiaxin Huang
Guangyuan Wang
Chao Zhang
Honglei Zhuang
Lance Kaplan
Jiawei Han

Unsupervised text embedding has shown great power in a wide range of NLP tasks. While text embeddings are typically learned in the Euclidean space, directional similarity is often more effective in tasks such as word similarity and document clustering, which creates a gap between the training stage and usage stage of text embedding. To close this gap, we propose a spherical generative model based on which unsupervised word and paragraph embeddings are jointly learned. To learn text embeddings in the spherical space, we develop an efficient optimization algorithm with convergence guarantee based on Riemannian optimization. Our model enjoys high efficiency and achieves state-of-the-art performances on various text embedding tasks including word similarity and document clustering.

AAAI Conference 2019 Conference Paper

Weakly-Supervised Hierarchical Text Classification

Yu Meng
Jiaming Shen
Chao Zhang
Jiawei Han

Hierarchical text classification, which aims to classify text documents into a given hierarchy, is an important task in many real-world applications. Recently, deep neural models are gaining increasing popularity for text classification due to their expressive power and minimum requirement for feature engineering. However, applying deep neural networks for hierarchical text classification remains challenging, because they heavily rely on a large amount of training data and meanwhile cannot easily determine appropriate levels of documents in the hierarchical setting. In this paper, we propose a weakly-supervised neural method for hierarchical text classification. Our method does not require a large amount of training data but requires only easy-to-provide weak supervision signals such as a few class-related documents or keywords. Our method effectively leverages such weak supervision signals to generate pseudo documents for model pre-training, and then performs self-training on real unlabeled data to iteratively refine the model. During the training process, our model features a hierarchical neural structure, which mimics the given hierarchy and is capable of determining the proper levels for documents with a blocking mechanism. Experiments on three datasets from different domains demonstrate the efficacy of our method compared with a comprehensive set of baselines.

AAAI Conference 2018 Conference Paper

A Spherical Hidden Markov Model for Semantics-Rich Human Mobility Modeling

Wanzheng Zhu
Chao Zhang
Shuochao Yao
Xiaobin Gao
Jiawei Han

We study the problem of modeling human mobility from semantic trace data, wherein each GPS record in a trace is associated with a text message that describes the user’s activity. Existing methods fall short in unveiling human movement regularities for such data, because they either do not model the text data at all or suffer from text sparsity severely. We propose SHMM, a multi-modal spherical hidden Markov model for semantics-rich human mobility modeling. Under the hidden Markov assumption, SHMM models the generation process of a given trace by jointly considering the observed location, time, and text at each step of the trace. The distinguishing characteristic of SHMM is the text modeling part. We use ﬁxed-size vector representations to encode the semantics of the text messages, and model the generation of the l2-normalized text embeddings on a unit sphere with the von Mises-Fisher (vMF) distribution. Compared with other alternatives like multi-variate Gaussian, our choice of the vMF distribution not only incurs much fewer parameters, but also better leverages the discriminative power of text embeddings in a directional metric space. The parameter inference for the vMF distribution is non-trivial since it involves functional inversion of ratios of Bessel functions. We theoretically prove, for the ﬁrst time, that: 1) the classical Expectation- Maximization algorithm is able to work with vMF distributions; and 2) while closed-form solutions are hard to be obtained for the M-step, Newton’s method is guaranteed to converge to the optimal solution with quadratic convergence rate. We have performed extensive experiments on both synthetic and real-life data. The results on synthetic data verify our theoretical analysis; while the results on real-life data demonstrate that SHMM learns meaningful semantics-rich mobility models, outperforms state-of-the-art mobility models for next location prediction, and incurs lower training cost.

AAAI Conference 2018 Conference Paper

Empower Sequence Labeling with Task-Aware Neural Language Model

Liyuan Liu
Jingbo Shang
Xiang Ren
Frank Xu
Huan Gui
Jian Peng
Jiawei Han

Linguistic sequence labeling is a general approach encompassing a variety of problems, such as part-of-speech tagging and named entity recognition. Recent advances in neural networks (NNs) make it possible to build reliable models without handcrafted features. However, in many cases, it is hard to obtain sufﬁcient annotations to train these models. In this study, we develop a neural framework to extract knowledge from raw texts and empower the sequence labeling task. Besides word-level knowledge contained in pretrained word embeddings, character-aware neural language models are incorporated to extract character-level knowledge. Transfer learning techniques are further adopted to mediate different components and guide the language model towards the key knowledge. Comparing to previous methods, these task-speciﬁc knowledge allows us to adopt a more concise model and conduct more efﬁcient training. Different from most transfer learning methods, the proposed framework does not rely on any additional supervision. It extracts knowledge from self-contained order information of training sequences. Extensive experiments on benchmark datasets demonstrate the effectiveness of leveraging character-level knowledge and the efﬁciency of co-training. For example, on the CoNLL03 NER task, model training completes in about 6 hours on a single GPU, reaching F1 score of 91. 71±0. 10 without using any extra annotations.

TIST Journal 2018 Journal Article

GeoBurst+

Chao Zhang
Dongming Lei
Quan Yuan
Honglei Zhuang
Lance Kaplan
Shaowen Wang
Jiawei Han

The real-time discovery of local events (e.g., protests, disasters) has been widely recognized as a fundamental socioeconomic task. Recent studies have demonstrated that the geo-tagged tweet stream serves as an unprecedentedly valuable source for local event detection. Nevertheless, how to effectively extract local events from massive geo-tagged tweet streams in real time remains challenging. To bridge the gap, we propose a method for effective and real-time local event detection from geo-tagged tweet streams. Our method, named G eo B urst+, first leverages a novel cross-modal authority measure to identify several pivots in the query window. Such pivots reveal different geo-topical activities and naturally attract similar tweets to form candidate events. G eo B urst+ further summarizes the continuous stream and compares the candidates against the historical summaries to pinpoint truly interesting local events. Better still, as the query window shifts, G eo B urst+ is capable of updating the event list with little time cost, thus achieving continuous monitoring of the stream. We used crowdsourcing to evaluate G eo B urst+ on two million-scale datasets and found it significantly more effective than existing methods while being orders of magnitude faster.

AAAI Conference 2018 Conference Paper

Spatiotemporal Activity Modeling Under Data Scarcity: A Graph-Regularized Cross-Modal Embedding Approach

Chao Zhang
Mengxiong Liu
Zhengchao Liu
Carl Yang
Luming Zhang
Jiawei Han

Spatiotemporal activity modeling, which aims at modeling users’ activities at different locations and time from user behavioral data, is an important task for applications like urban planning and mobile advertising. State-of-the-art methods for this task use cross-modal embedding to map the units from different modalities (location, time, text) into the same latent space. However, the success of such methods relies on data sufﬁciency, and may not learn quality embeddings when user behavioral data is scarce. To address this problem, we propose BRANCHNET, a spatiotemporal activity model that transfers knowledge from external sources for alleviating data scarcity. BRANCHNET adopts a graph-regularized cross-modal embedding framework. At the core of it is a main embedding space, which is shared by the main task of reconstructing user behaviors and the auxiliary graph embedding tasks for external sources, thus allowing external knowledge to guide the cross-modal embedding process. In addition to the main embedding space, the auxiliary tasks also have branched task-speciﬁc embedding spaces. The branched embeddings capture the discrepancies between the main task and the auxiliary ones, and free the main embeddings from encoding information for all the tasks. We have empirically evaluated the performance of BRANCHNET, and found that it is capable of effectively transferring knowledge from external sources to learn better spatiotemporal activity models and outperforming strong baseline methods.

IJCAI Conference 2016 Conference Paper

Collaborative Multi-Level Embedding Learning from Reviews for Rating Prediction

Wei Zhang
Quan Yuan
Jiawei Han
Jianyong Wang

We investigate the problem of personalized review-based rating prediction which aims at predicting users' ratings for items that they have not evaluated by using their historical reviews and ratings. Most of existing methods solve this problem by integrating topic model and latent factor model to learn interpretable user and items factors. However, these methods cannot utilize word local context information of reviews. Moreover, it simply restricts user and item representations equivalent to their review representations, which may bring some irrelevant information in review text and harm the accuracy of rating prediction. In this paper, we propose a novel Collaborative Multi-Level Embedding (CMLE) model to address these limitations. The main technical contribution of CMLE is to integrate word embedding model with standard matrix factorization model through a projection level. This allows CMLE to inherit the ability of capturing word local context information from word embedding model and relax the strict equivalence requirement by projecting review embedding to user and item embeddings. A joint optimization problem is formulated and solved through an efficient stochastic gradient ascent algorithm. Empirical evaluations on real datasets show CMLE outperforms several competitive methods and can solve the two limitations well.

AAAI Conference 2016 Conference Paper

EKNOT: Event Knowledge from News and Opinions in Twitter

Min Li
Jingjing Wang
Wenzhu Tong
Hongkun Yu
Xiuli Ma
Yucheng Chen
Haoyan Cai
Jiawei Han

We present the EKNOT system that automatically discovers major events from online news articles, connects each event to its discussion in Twitter, and provides a comprehensive summary of the events from both news media and social media’s point of view. EKNOT takes a time period as input and outputs a complete picture of the events within the given time range along with the public opinions. For each event, EKNOT provides multi-dimensional summaries: a) a summary from news for an objective description; b) a summary from tweets containing opinions/sentiments; c) an entity graph which illustrates the major players involved and their correlations; d) the time span of the event; and e) an opinion (sentiment) distribution. Also, if a user is interested in a particular event, he/she can zoom into this event to investigate its aspects (subevents) summarized in the same manner. EKNOT is built on real-time crawled news articles and tweets, allowing users to explore the dynamics of major events with minimal delays.

IJCAI Conference 2016 Conference Paper

Learning Hostname Preference to Enhance Search Relevance

Jingjing Wang
Changsung Kang
Yi Chang
Jiawei Han

Hostnames such as en. wikipedia. org and www. amazon. com are strong indicators of the content they host. The relevant hostnames for a query can be a signature that captures the query intent. In this study, we learn the hostname preference of queries, which are further utilized to enhance search relevance. Implicit and explicit query intent are modeled simultaneously by a feature aware matrix completion framework. A block-wise parallel algorithm was developed on top of the Spark MLlib for fast optimization of feature aware matrix completion. The optimization completes within minutes at the scale of a million x million matrix, which enables efficient experimental studies at the web scale. Evaluation of the learned hostname preference is performed both intrinsically on test errors, and extrinsically on the impact on search ranking relevance. Experimental results demonstrate that capturing hostname preference can significantly boost the retrieval performance.

AAAI Conference 2016 Conference Paper

Text Classification with Heterogeneous Information Network Kernels

Chenguang Wang
Yangqiu Song
Haoran Li
Ming Zhang
Jiawei Han

Text classiﬁcation is an important problem with many applications. Traditional approaches represent text as a bagof-words and build classiﬁers based on this representation. Rather than words, entity phrases, the relations between the entities, as well as the types of the entities and relations carry much more information to represent the texts. This paper presents a novel text as network classiﬁcation framework, which introduces 1) a structured and typed heterogeneous information networks (HINs) representation of texts, and 2) a meta-path based approach to link texts. We show that with the new representation and links of texts, the structured and typed information of entities and relations can be incorporated into kernels. Particularly, we develop both simple linear kernel and indeﬁnite kernel based on metapaths in the HIN representation of texts, where we call them HIN-kernels. Using Freebase, a well-known world knowledge base, to construct HIN for texts, our experiments on two benchmark datasets show that the indeﬁnite HIN-kernel based on weighted meta-paths outperforms the state-of-theart methods and other HIN-kernels.

IJCAI Conference 2015 Conference Paper

Constrained Information-Theoretic Tripartite Graph Clustering to Identify Semantically Similar Relations

Chenguang Wang
Yangqiu Song
Dan Roth
Chi Wang
Jiawei Han
Heng Ji
Ming Zhang

In knowledge bases or information extraction results, differently expressed relations can be semantically similar (e. g. , (X, wrote, Y) and (X, ’s written work, Y)). Therefore, grouping semantically similar relations into clusters would facilitate and improve many applications, including knowledge base completion, information extraction, information retrieval, and more. This paper formulates relation clustering as a constrained tripartite graph clustering problem, presents an efficient clustering algorithm and exhibits the advantage of the constrained framework. We introduce several ways that provide side information via must-link and cannotlink constraints to improve the clustering results. Different from traditional semi-supervised learning approaches, we propose to use the similarity of relation expressions and the knowledge of entity types to automatically construct the constraints for the algorithm. We show improved relation clustering results on two datasets extracted from human annotated knowledge base (i. e. , Freebase) and open information extraction results (i. e. , ReVerb data).

NeurIPS Conference 2014 Conference Paper

Robust Tensor Decomposition with Gross Corruption

Quanquan Gu
Huan Gui
Jiawei Han

In this paper, we study the statistical performance of robust tensor decomposition with gross corruption. The observations are noisy realization of the superposition of a low-rank tensor $\mathcal{W}^*$ and an entrywise sparse corruption tensor $\mathcal{V}^*$. Unlike conventional noise with bounded variance in previous convex tensor decomposition analysis, the magnitude of the gross corruption can be arbitrary large. We show that under certain conditions, the true low-rank tensor as well as the sparse corruption tensor can be recovered simultaneously. Our theory yields nonasymptotic Frobenius-norm estimation error bounds for each tensor separately. We show through numerical experiments that our theory can precisely predict the scaling behavior in practice.

TIST Journal 2013 Journal Article

A framework of traveling companion discovery on trajectory data streams

Lu-An Tang
Yu Zheng
Jing Yuan
Jiawei Han
Alice Leung
Wen-Chih Peng
Thomas La Porta

The advance of mobile technologies leads to huge volumes of spatio-temporal data collected in the form of trajectory data streams. In this study, we investigate the problem of discovering object groups that travel together (i.e., traveling companions ) from trajectory data streams. Such technique has broad applications in the areas of scientific study, transportation management, and military surveillance. To discover traveling companions, the monitoring system should cluster the objects of each snapshot and intersect the clustering results to retrieve moving-together objects. Since both clustering and intersection steps involve high computational overhead, the key issue of companion discovery is to improve the efficiency of algorithms. We propose the models of closed companion candidates and smart intersection to accelerate data processing. A data structure termed traveling buddy is designed to facilitate scalable and flexible companion discovery from trajectory streams. The traveling buddies are microgroups of objects that are tightly bound together. By only storing the object relationships rather than their spatial coordinates, the buddies can be dynamically maintained along the trajectory stream with low cost. Based on traveling buddies, the system can discover companions without accessing the object details. In addition, we extend the proposed framework to discover companions on more complicated scenarios with spatial and temporal constraints, such as on the road network and battlefield. The proposed methods are evaluated with extensive experiments on both real and synthetic datasets. Experimental results show that our proposed buddy-based approach is an order of magnitude faster than the baselines and achieves higher accuracy in companion discovery.

IJCAI Conference 2013 Conference Paper

Large-Scale Spectral Clustering on Graphs

Jialu Liu
Chi Wang
Marina Danilevsky
Jiawei Han

Graph clustering has received growing attention in recent years as an important analytical technique, both due to the prevalence of graph data, and the usefulness of graph structures for exploiting intrinsic data characteristics. However, as graph data grows in scale, it becomes increasingly more challenging to identify clusters. In this paper we propose an efﬁcient clustering algorithm for largescale graph data using spectral methods. The key idea is to repeatedly generate a small number of “supernodes” connected to the regular nodes, in order to compress the original graph into a sparse bipartite graph. By clustering the bipartite graph using spectral methods, we are able to greatly improve efﬁciency without losing considerable clustering power. Extensive experiments show the effectiveness and efﬁciency of our approach.

PDF Details DOI

TIST Journal 2012 Journal Article

Latent Community Topic Analysis

Zhijun Yin
LiangLiang Cao
Quanquan Gu
Jiawei Han

This article studies the problem of latent community topic analysis in text-associated graphs. With the development of social media, a lot of user-generated content is available with user networks. Along with rich information in networks, user graphs can be extended with text information associated with nodes. Topic modeling is a classic problem in text mining and it is interesting to discover the latent topics in text-associated graphs. Different from traditional topic modeling methods considering links, we incorporate community discovery into topic analysis in text-associated graphs to guarantee the topical coherence in the communities so that users in the same community are closely linked to each other and share common latent topics. We handle topic modeling and community discovery in the same framework. In our model we separate the concepts of community and topic, so one community can correspond to multiple topics and multiple communities can share the same topic. We compare different methods and perform extensive experiments on two real datasets. The results confirm our hypothesis that topics could help understand community structure, while community structure could help model topics.

NeurIPS Conference 2012 Conference Paper

Selective Labeling via Error Bound Minimization

Quanquan Gu
Tong Zhang
Jiawei Han
Chris Ding

In many practical machine learning problems, the acquisition of labeled data is often expensive and/or time consuming. This motivates us to study a problem as follows: given a label budget, how to select data points to label such that the learning performance is optimized. We propose a selective labeling method by analyzing the generalization error of Laplacian regularized Least Squares (LapRLS). In particular, we derive a deterministic generalization error bound for LapRLS trained on subsampled data, and propose to select a subset of data points to label by minimizing this upper bound. Since the minimization is a combinational problem, we relax it into continuous domain and solve it by projected gradient descent. Experiments on benchmark datasets show that the proposed method outperforms the state-of-the-art methods.

TIST Journal 2011 Journal Article

Collection-based sparse label propagation and its application on social group suggestion from photos

Jie Yu
Xin Jin
Jiawei Han
Jiebo Luo

Online social network services pose great opportunities and challenges for many research areas. In multimedia content analysis, automatic social group recommendation for images holds the promise to expand one's social network through media sharing. However, most existing techniques cannot generate satisfactory social group suggestions when the images are classified independently. In this article, we present novel methods to produce accurate suggestions of suitable social groups from a user's personal photo collection. First, an automatic clustering process is designed to estimate the group similarities, select the optimal number of clusters and categorize the social groups. Both visual content and textual annotations are integrated to generate initial predictions of the group categories for the images. Next, the relationship among images in a user's collection is modeled as a sparse graph. A collection-based sparse label propagation method is proposed to improve the group suggestions. Furthermore, the sparse graph-based collection model can be readily exploited to select the most influential and informative samples for active relevance feedback, which can be integrated with the label propagation process without the need for classifier retraining. The proposed methods have been tested on group suggestion tasks for real user collections and demonstrated superior performance over the state-of-the-art techniques.

IJCAI Conference 2011 Conference Paper

Joint Feature Selection and Subspace Learning

Quanquan Gu
Zhenhui Li
Jiawei Han

Dimensionality reduction is a very important topic in machine learning. It can be generally classified into two categories: feature selection and subspace learning. In the past decades, many methods have been proposed for dimensionality reduction. However, most of these works study feature selection and subspace learning independently. In this paper, we present a framework for joint feature selection and subspace learning. We reformulate the subspace learning problem and use L2, 1-norm on the projection matrix to achieve row-sparsity, which leads to selecting relevant features and learning transformation simultaneously. We discuss two situations of the proposed framework, and present their optimization algorithms. Experiments on benchmark face recognition data sets illustrate that the proposed framework outperforms the state of the art methods overwhelmingly.

PDF Details DOI

AAAI Conference 2011 Conference Paper

Learning a Kernel for Multi-Task Clustering

Quanquan Gu
Zhenhui Li
Jiawei Han

Multi-task learning has received increasing attention in the past decade. Many supervised multi-task learning methods have been proposed, while unsupervised multitask learning is still a rarely studied problem. In this paper, we propose to learn a kernel for multi-task clustering. Our goal is to learn a Reproducing Kernel Hilbert Space, in which the geometric structure of the data in each task is preserved, while the data distributions of any two tasks are as close as possible. This is formulated as a uniﬁed kernel learning framework, under which we study two types of kernel learning: nonparametric kernel learning and spectral kernel design. Both types of kernel learning can be solved by linear programming. Experiments on several cross-domain text data sets demonstrate that kernel k-means on the learned kernel can achieve better clustering results than traditional single-task clustering methods. It also outperforms the newly proposed multi-task clustering method.

TIST Journal 2011 Journal Article

MoveMine

Zhenhui Li
Jiawei Han
Ming Ji
Lu-An Tang
Yintao Yu
Bolin Ding
Jae-Gil Lee
Roland Kays

With the maturity and wide availability of GPS, wireless, telecommunication, and Web technologies, massive amounts of object movement data have been collected from various moving object targets, such as animals, mobile devices, vehicles, and climate radars. Analyzing such data has deep implications in many applications, such as, ecological study, traffic control, mobile communication management, and climatological forecast. In this article, we focus our study on animal movement data analysis and examine advanced data mining methods for discovery of various animal movement patterns. In particular, we introduce a moving object data mining system, MoveMine, which integrates multiple data mining functions, including sophisticated pattern mining and trajectory analysis. In this system, two interesting moving object pattern mining functions are newly developed: (1) periodic behavior mining and (2) swarm pattern mining. For mining periodic behaviors, a reference location-based method is developed, which first detects the reference locations, discovers the periods in complex movements, and then finds periodic patterns by hierarchical clustering. For mining swarm patterns, an efficient method is developed to uncover flexible moving object clusters by relaxing the popularly-enforced collective movement constraints. In the MoveMine system, a set of commonly used moving object mining functions are built and a user-friendly interface is provided to facilitate interactive exploration of moving object data mining and flexible tuning of the mining constraints and parameters. MoveMine has been tested on multiple kinds of real datasets, especially for MoveBank applications and other moving object data analysis. The system will benefit scientists and other users to carry out versatile analysis tasks to analyze object movement regularities and anomalies. Moreover, it will benefit researchers to realize the importance and limitations of current techniques and promote future studies on moving object data mining. As expected, a mastery of animal movement patterns and trends will improve our understanding of the interactions between and the changes of the animal world and the ecosystem and therefore help ensure the sustainability of our ecosystem.

IJCAI Conference 2011 Conference Paper

On Trivial Solution and Scale Transfer Problems in Graph Regularized NMF

Quanquan Gu
Chris Ding
Jiawei Han

Combining graph regularization with nonnegative matrix (tri-)factorization (NMF) has shown great performance improvement compared with traditional nonnegative matrix (tri-)factorization models due to its ability to utilize the geometric structure of the documents and words. In this paper, we show that these models are not well-defined and suffering from trivial solution and scale transfer problems. In order to solve these common problems, we propose two models for graph regularized nonnegative matrix (tri-)factorization, which can be applied for document clustering and co-clustering respectively. In the proposed models, a Normalized Cut-like constraint is imposed on the cluster assignment matrix to make the optimization problem well-defined. We derive a multiplicative updating algorithm for the proposed models, and prove its convergence. Experiments of clustering and co-clustering on benchmark text data sets demonstratethat the proposed models outperform the originalmodels as well as many other state-of-the-art clustering methods.

PDF Details DOI

IJCAI Conference 2009 Conference Paper

Deng Cai
Xiaofei He
Xuanhui Wang
Hujun Bao
Jiawei Han

Matrix factorization techniques have been frequently applied in information processing tasks. Among them, Non-negative Matrix Factorization (NMF) have received considerable attentions due to its psychological and physiological interpretation of naturally occurring data whose representation may be parts-based in human brain. On the other hand, from geometric perspective the data is usually sampled from a low dimensional manifold embedded in high dimensional ambient space. One hopes then to ﬁnd a compact representation which uncovers the hidden topics and simultaneously respects the intrinsic geometric structure. In this paper, we propose a novel algorithm, called Locality Preserving Non-negative Matrix Factorization (LPNMF), for this purpose. For two data points, we use KL-divergence to evaluate their similarity on the hidden topics. The optimal maps are obtained such that the feature values on hidden topics are restricted to be non-negative and vary smoothly along the geodesics of the data manifold. Our empirical study shows the encouraging results of the proposed algorithm in comparisons to the state-ofthe-art algorithms on two large high-dimensional databases.

NeurIPS Conference 2009 Conference Paper

Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

Jing Gao
Feng Liang
Wei Fan
Yizhou Sun
Jiawei Han

Little work has been done to directly combine the outputs of multiple supervised and unsupervised models. However, it can increase the accuracy and applicability of ensemble methods. First, we can boost the diversity of classification ensemble by incorporating multiple clustering outputs, each of which provides grouping constraints for the joint label predictions of a set of related objects. Secondly, ensemble of supervised models is limited in applications which have no access to raw data but to the meta-level model outputs. In this paper, we aim at calculating a consolidated classification solution for a set of objects by maximizing the consensus among both supervised predictions and unsupervised grouping constraints. We seek a global optimal label assignment for the target objects, which is different from the result of traditional majority voting and model combination approaches. We cast the problem into an optimization problem on a bipartite graph, where the objective function favors smoothness in the conditional probability estimates over the graph, as well as penalizes deviation from initial labeling of supervised models. We solve the problem through iterative propagation of conditional probability estimates among neighboring nodes, and interpret the method as conducting a constrained embedding in a transformed space, as well as a ranking on the graph. Experimental results on three real applications demonstrate the benefits of the proposed method over existing alternatives.

IJCAI Conference 2007 Conference Paper

Deng Cai
Xiaofei He
Kun Zhou
Jiawei Han
Hujun Bao

Linear Discriminant Analysis (LDA) is a popular data-analytic tool for studying the class relationship between data points. A major disadvantage of LDA is that it fails to discover the local geometrical structure of the data manifold. In this paper, we introduce a novel linear algorithm for discriminant analysis, called {\bf Locality Sensitive Discriminant Analysis} (LSDA). When there is no sufficient training samples, local structure is generally more important than global structure for discriminant analysis. By discovering the local manifold structure, LSDA finds a projection which maximizes the margin between data points from different classes at each local area. Specifically, the data points are mapped into a subspace in which the nearby points with the same label are close to each other while the nearby points with different labels are far apart. Experiments carried out on several standard face databases show a clear improvement over the results of LDA-based recognition.

IS Journal 2003 Journal Article

Profile-based object matching for information integration

AnHai Doan
Ying Lu
Yoonkyong Lee
Jiawei Han

Traditional object-matching methods rely on similarities among shared attributes. Profile-based object matching builds on this approach but also correlates disjoint attributes to improve matching accuracy. To illustrate the PROM approach, we use two relational tables: one contains information about movies, the other about movie reviews.

TCS Journal 1994 Journal Article

Towards efficient induction mechanisms in database systems

Jiawei Han

With the wide availability of huge amounts of data in database systems, the extraction of knowledge in databases by efficient and powerful induction or knowledge discovery mechanisms has become an important issue in the construction of new generation database and knowledge-base systems. In this article, an attribute-oriented induction method for knowledge discovery in databases is investigated, which provides an efficient, set-oriented induction mechanism for extraction of different kinds of knowledge rules, such as characteristic rules, discriminant rules, data evolution regularities and high level dependency rules in large relational databases. Our study shows that the method is robust in the existence of noise and database updates, is extensible to knowledge discovery in advanced and/or special purpose databases, such as object-oriented databases, active databases, spatial databases, etc. , and has wide applications.