Arrow Research search

Author name cluster

Philip S. Yu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

92 papers
2 author rows

Possible papers

92

AAAI Conference 2026 Conference Paper

Hyperbolic Continuous Structural Entropy for Hierarchical Clustering

  • Guangjie Zeng
  • Hao Peng
  • Angsheng Li
  • Li Sun
  • Chunyang Liu
  • Shengze Li
  • Yicheng Pan
  • Philip S. Yu

Hierarchical clustering is a fundamental machine-learning technique for grouping data points into dendrograms. However, existing hierarchical clustering methods encounter two primary challenges: 1) Most methods specify dendrograms without a global objective. 2) Graph-based methods often neglect the significance of graph structure, optimizing objectives on complete or static predefined graphs. In this work, we propose Hyperbolic Continuous Structural Entropy neural networks, namely HypCSE, for structure-enhanced continuous hierarchical clustering. Our key idea is to map data points in the hyperbolic space and minimize the relaxed continuous structural entropy (SE) on structure-enhanced graphs. Specifically, we encode graph vertices in hyperbolic space using hyperbolic graph neural networks and minimize approximate SE defined on graph embeddings. To make the SE objective differentiable for optimization, we reformulate it into a function using the lowest common ancestor (LCA) on trees and then relax it into continuous SE (CSE) by the analogy of hyperbolic graph embeddings and partitioning trees. To ensure a graph structure that effectively captures the hierarchy of data points for CSE calculation, we employ a graph structure learning (GSL) strategy that updates the graph structure during training. Extensive experiments on seven datasets demonstrate the superior performance of HypCSE.

AAAI Conference 2026 Conference Paper

Learning to Explore: Policy-Guided Outlier Synthesis for Graph Out-of-Distribution Detection

  • Li Sun
  • Lanxu Yang
  • Jiayu Tian
  • Bowen Fang
  • Xiaoyan Yu
  • Junda Ye
  • Peng Tang
  • Hao Peng

Detecting Out-of-Distribution (OOD) graphs—those are drawn from a different distribution from the training data-is a critical task for ensuring the safety and reliability of Graph Neural Networks. The main challenge in unsupervised graph-level Out-of-Distribution detection lies in its common reliance on purely in-distribution (ID) data. This ID-only training paradigm leads to an incomplete characterization of the feature space, resulting in decision boundaries that lack the robustness needed to effectively separate ID from OOD samples. While incorporating synthesized outliers into the training process is a promising direction, existing generation methods are limited by their dependence on pre-defined, non-adaptive sampling heuristics (e.g., distance- or density-based). Such fixed strategies lack the flexibility to systematically explore the most informative OOD regions for refining decision boundaries. To overcome this limitation, we propose a novel Policy-Guided Outlier Synthesis (PGOS) framework that replaces static heuristics with a learned, adaptive exploration policy. PGOS trains a reinforcement learning agent to autonomously navigate low-density regions within a structured latent space, sampling representations that are maximally effective for regularizing the OOD decision boundary. These sampled points are then decoded into high-quality pseudo-OOD graphs to enhance the detector's robustness. Extensive experiments demonstrate the strong performance of our method, state-of-the-art results on multiple graph OOD and anomaly detection benchmarks.

AAAI Conference 2026 Conference Paper

S²Drug: Bridging Protein Sequence and 3D Structure in Contrastive Representation Learning for Virtual Screening

  • Bowei He
  • Bowen Gao
  • Yankai Chen
  • Yanyan Lan
  • Chen Ma
  • Philip S. Yu
  • Ya-Qin Zhang
  • Wei-Ying Ma

Virtual screening (VS) is an essential task in drug discovery, focusing on the identification of small-molecule ligands that bind to specific protein pockets. Existing deep learning methods, from early regression models to recent contrastive learning approaches, primarily rely on structural data while overlooking protein sequences, which are more accessible and can enhance generalizability. However, directly integrating protein sequences poses challenges due to the redundancy and noise in large-scale protein-ligand datasets. To address these limitations, we propose S²Drug, a two-stage framework that explicitly incorporates protein Sequence information and 3D Structure context in protein-ligand contrastive representation learning. In the first stage, we perform protein sequence pretraining on ChemBL using an ESM2-based backbone, combined with a tailored data sampling strategy to reduce redundancy and noise on both protein and ligand sides. In the second stage, we fine-tune on PDBBind by fusing sequence and structure information through a residue-level gating module, while introducing an auxiliary binding site prediction task. This auxiliary task guides the model to accurately localize binding residues within the protein sequence and capture their 3D spatial arrangement, thereby refining protein-ligand matching. Across multiple benchmarks, S²Drug consistently improves virtual screening performance and achieves strong results on binding site prediction, demonstrating the value of bridging sequence and structure in contrastive learning.

ICLR Conference 2025 Conference Paper

BANGS: Game-theoretic Node Selection for Graph Self-Training

  • Fangxin Wang 0003
  • Kay Liu
  • Sourav Medya
  • Philip S. Yu

Graph self-training is a semi-supervised learning method that iteratively selects a set of unlabeled data to retrain the underlying graph neural network (GNN) model and improve its prediction performance. While selecting highly confident nodes has proven effective for self-training, this pseudo-labeling strategy ignores the combinatorial dependencies between nodes and suffers from a local view of the distribution. To overcome these issues, we propose BANGS, a novel framework that unifies the labeling strategy with conditional mutual information as the objective of node selection. Our approach---grounded in game theory---selects nodes in a combinatorial fashion and provides theoretical guarantees for robustness under noisy objective. More specifically, unlike traditional methods that rank and select nodes independently, BANGS considers nodes as a collective set in the self-training process. Our method demonstrates superior performance and robustness across various datasets, base models, and hyperparameter settings, outperforming existing techniques. The codebase is available on https://github.com/fangxin-wang/BANGS.

ICLR Conference 2025 Conference Paper

Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent

  • Yangning Li
  • Yinghui Li
  • Xinyu Wang 0013
  • Yong Jiang 0005
  • Zhen Zhang
  • Xinran Zheng
  • Hui Wang 0030
  • Hai-Tao Zheng 0002

Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the “hallucination” issue inherent in multimodal large language models (MLLMs). Although promising, existing heuristic mRAGs typically predefined fixed retrieval processes, which causes two issues: (1) Non-adaptive Retrieval Queries. (2) Overloaded Retrieval Queries. However, these flaws cannot be adequately reflected by current knowledge-seeking visual question answering (VQA) datasets, since the most required knowledge can be readily obtained with a standard two-step retrieval. To bridge the dataset gap, we first construct Dyn-VQA dataset, consisting of three types of ``dynamic'' questions, which require complex knowledge retrieval strategies variable in query, tool, and time: (1) Questions with rapidly changing answers. (2) Questions requiring multi-modal knowledge. (3) Multi-hop questions. Experiments on Dyn-VQA reveal that existing heuristic mRAGs struggle to provide sufficient and precisely relevant knowledge for dynamic questions due to their rigid retrieval processes. Hence, we further propose the first self-adaptive planning agent for multimodal retrieval, **OmniSearch**. The underlying idea is to emulate the human behavior in question solution which dynamically decomposes complex multimodal questions into sub-question chains with retrieval action. Extensive experiments prove the effectiveness of our OmniSearch, also provide direction for advancing mRAG. Code and dataset will be open-sourced.

ICLR Conference 2025 Conference Paper

Can Watermarked LLMs be Identified by Users via Crafted Prompts?

  • Aiwei Liu
  • Sheng Guan
  • Yiming Liu
  • Leyi Pan
  • Yifei Zhang
  • Liancheng Fang
  • Lijie Wen 0001
  • Philip S. Yu

Text watermarking for Large Language Models (LLMs) has made significant progress in detecting LLM outputs and preventing misuse. Current watermarking techniques offer high detectability, minimal impact on text quality, and robustness to text editing. However, current researches lack investigation into the imperceptibility of watermarking techniques in LLM services. This is crucial as LLM providers may not want to disclose the presence of watermarks in real-world scenarios, as it could reduce user willingness to use the service and make watermarks more vulnerable to attacks. This work is the first to investigate the imperceptibility of watermarked LLMs. We design an identification algorithm called Water-Probe that detects watermarks through well-designed prompts to the LLM. Our key motivation is that current watermarked LLMs expose consistent biases under the same watermark key, resulting in similar differences across prompts under different watermark keys. Experiments show that almost all mainstream watermarking algorithms are easily identified with our well-designed prompts, while Water-Probe demonstrates a minimal false positive rate for non-watermarked LLMs. Finally, we propose that the key to enhancing the imperceptibility of watermarked LLMs is to increase the randomness of watermark key selection. Based on this, we introduce the Water-Bag strategy, which significantly improves watermark imperceptibility by merging multiple watermark keys.

ICLR Conference 2025 Conference Paper

DiffPuter: Empowering Diffusion Models for Missing Data Imputation

  • Hengrui Zhang
  • Liancheng Fang
  • Qitian Wu
  • Philip S. Yu

Generative models play an important role in missing data imputation in that they aim to learn the joint distribution of full data. However, applying advanced deep generative models (such as Diffusion models) to missing data imputation is challenging due to 1) the inherent incompleteness of the training data and 2) the difficulty in performing conditional inference from unconditional generative models. To deal with these challenges, this paper introduces DiffPuter, a tailored diffusion model combined with the Expectation-Maximization (EM) algorithm for missing data imputation. DiffPuter iteratively trains a diffusion model to learn the joint distribution of missing and observed data and performs an accurate conditional sampling to update the missing values using a tailored reversed sampling strategy. Our theoretical analysis shows that DiffPuter's training step corresponds to the maximum likelihood estimation of data density (M-step), and its sampling step represents the Expected A Posteriori estimation of missing values (E-step). Extensive experiments across ten diverse datasets and comparisons with 17 different imputation methods demonstrate DiffPuter's superior performance. Notably, DiffPuter achieves an average improvement of 8.10\% in MAE and 5.64\% in RMSE compared to the most competitive existing method.

TMLR Journal 2025 Journal Article

Enhancing Fairness in Unsupervised Graph Anomaly Detection through Disentanglement

  • Wenjing Chang
  • Kay Liu
  • Philip S. Yu
  • Jianjun Yu

Graph anomaly detection (GAD) is becoming increasingly crucial in various applications, ranging from financial fraud detection to fake news detection. However, current GAD methods largely overlook the fairness problem, which might result in discriminatory decisions skewed toward certain demographic groups defined on sensitive attributes (e.g., gender). This greatly limits the applicability of these methods in real-world scenarios in light of societal and ethical restrictions. To address this critical gap, we make the first attempt to integrate fairness with utility in GAD decision-making. Specifically, we devise a novel DisEntangle-based FairnEss-aware aNomaly Detection framework on the attributed graph, named DEFEND. DEFEND first introduces disentanglement in GNNs to capture informative yet sensitive-irrelevant node representations, effectively reducing bias inherent in graphrepresentation learning. Besides, to alleviate discriminatory bias in evaluating anomalies, DEFEND adopts a reconstruction-based method, which concentrates solely on node attributes and avoids incorporating biased graph topology. Additionally, given the inherent association between sensitive-relevant and -irrelevant attributes, DEFEND further constrains the correlation between the reconstruction error and predicted sensitive attributes. Empirical evaluations on real-world datasets reveal that DEFEND performs effectively in GAD and significantly enhances fairness compared to state-of-the-art baselines. Our code is available at https://github.com/AhaChang/DEFEND.

TIST Journal 2025 Journal Article

Graph Contrastive Learning on Multi-label Classification for Recommendations

  • Jiayang Wu
  • Wensheng Gan
  • Huashen Lu
  • Philip S. Yu

In business analysis, providing effective recommendations is crucial for boosting company profits. Graph structures, especially bipartite graphs, are favored for analyzing complex data relationships. Link prediction is crucial for recommending specific items to users. Traditional methods have primarily focused on binary classification tasks. These methods, which identify patterns in graph structures or use representation techniques like graph neural networks (GNNs), face challenges with increasing data volume and label count. Data growth strains system performance and efficiency. More labels intensify data sparsity, as users and items focus on only a few labels, leading to sparse matrices that hamper recommendation algorithms. To tackle these issues, we introduce the Graph Contrastive Learning for Multi-label Classification (MCGCL) model. It uses contrastive learning to improve recommendations and has two training phases: a main task of holistic user–item graph learning to grasp user–item relationships, and a subtask of constructing homogeneous user–user (item–item) subgraphs to capture user–user and item–item relationships. Comparative experiments with state-of-the-art methods confirm the effectiveness of MCGCL, highlighting its potential for improving recommendation systems.

ICML Conference 2025 Conference Paper

How Much Can Transfer? BRIDGE: Bounded Multi-Domain Graph Foundation Model with Generalization Guarantees

  • Haonan Yuan
  • Qingyun Sun
  • Junhua Shi
  • Xingcheng Fu
  • Bryan Hooi
  • Jianxin Li 0002
  • Philip S. Yu

Graph Foundation Models hold significant potential for advancing multi-domain graph learning, yet their full capabilities remain largely untapped. Existing works show promising task performance with the “pretrain-then-prompt” paradigm, which lacks theoretical foundations to understand why it works and how much knowledge can be transferred from source domains to the target. In this paper, we introduce BRIDGE, a bounded graph foundation model pre-trained on multi-domains with Generalization guarantees. To learn discriminative source knowledge, we align multi-domain graph features with domain-invariant aligners during pre-training. Then, a lightweight Mixture of Experts (MoE) network is proposed to facilitate downstream prompting through self-supervised selective knowledge assembly and transfer. Further, to determine the maximum amount of transferable knowledge, we derive an optimizable generalization error upper bound from a graph spectral perspective given the Lipschitz continuity. Extensive experiments demonstrate the superiority of BRIDGE on both node and graph classification compared with 15 state-of-the-art baselines.

ICLR Conference 2025 Conference Paper

IGL-Bench: Establishing the Comprehensive Benchmark for Imbalanced Graph Learning

  • Jiawen Qin
  • Haonan Yuan
  • Qingyun Sun
  • Lyujin Xu
  • Jiaqi Yuan
  • Pengfeng Huang
  • Zhaonan Wang 0005
  • Xingcheng Fu

Deep graph learning has gained grand popularity over the past years due to its versatility and success in representing graph data across a wide range of domains. However, the pervasive issue of imbalanced graph data distributions, where certain parts exhibit disproportionally abundant data while others remain sparse, undermines the efficacy of conventional graph learning algorithms, leading to biased outcomes. To address this challenge, Imbalanced Graph Learning (IGL) has garnered substantial attention, enabling more balanced data distributions and better task performance. Despite the proliferation of IGL algorithms, the absence of consistent experimental protocols and fair performance comparisons pose a significant barrier to comprehending advancements in this field. To bridge this gap, we introduce **IGL-Bench**, a foundational comprehensive benchmark for imbalanced graph learning, embarking on **17** diverse graph datasets and **24** distinct IGL algorithms with uniform data processing and splitting strategies. Specifically, IGL-Bench systematically investigates state-of-the-art IGL algorithms in terms of **effectiveness**, **robustness**, and **efficiency** on node-level and graph-level tasks, with the scope of class-imbalance and topology-imbalance. Extensive experiments demonstrate the potential benefits of IGL algorithms on various imbalanced conditions, offering insights and opportunities in the IGL field. Further, we have developed an open-sourced and unified package to facilitate reproducible evaluation and inspire further innovative research, available at: https://github.com/RingBDStack/IGL-Bench.

TMLR Journal 2025 Journal Article

LEGO-Learn: Label-Efficient Graph Open-Set Learning

  • Haoyan Xu
  • Kay Liu
  • Zhengtao Yao
  • Philip S. Yu
  • Mengyuan Li
  • Kaize Ding
  • Yue Zhao

How can we train graph-based models to recognize unseen classes while keeping labeling costs low? Graph open-set learning (GOL) and out-of-distribution (OOD) detection aim to address this challenge by training models that can accurately classify known, in-distribution (ID) classes while identifying and handling previously unseen classes during inference. It is critical for high-stakes, real-world applications where models frequently encounter unexpected data, including finance, security, and healthcare. However, current GOL methods assume access to a large number of labeled ID samples, which is unrealistic for large-scale graphs due to high annotation costs. In this paper, we propose LEGO-Learn (Label-Efficient Graph Open-set Learning), a novel framework that addresses open-set node classification on graphs within a given label budget by selecting the most informative ID nodes. LEGO-Learn employs a GNN-based filter to identify and exclude potential OOD nodes and then selects highly informative ID nodes for labeling using the K-Medoids algorithm. To prevent the filter from discarding valuable ID examples, we introduce a classifier that differentiates between the $C$ known ID classes and an additional class representing OOD nodes (hence, a $C+1$ classifier). This classifier utilizes a weighted cross-entropy loss to balance the removal of OOD nodes while retaining informative ID nodes. Experimental results on four real-world datasets demonstrate that LEGO-Learn significantly outperforms leading methods, achieving up to a $6.62\%$ improvement in ID classification accuracy and a $7.49\%$ increase in AUROC for OOD detection.

ICML Conference 2025 Conference Paper

One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs

  • Yinghui Li
  • Jiayi Kuang
  • Haojing Huang 0001
  • Zhikun Xu
  • Xinnian Liang
  • Yi Yu
  • Wenlian Lu
  • Yangning Li

Leveraging mathematical Large Language Models (LLMs) for proof generation is a fundamental topic in LLMs research. We argue that the ability of current LLMs to prove statements largely depends on whether they have encountered the relevant proof process during training. This reliance limits their deeper understanding of mathematical theorems and related concepts. Inspired by the pedagogical method of "proof by counterexamples" commonly used in human mathematics education, our work aims to enhance LLMs’ ability to conduct mathematical reasoning and proof through counterexamples. Specifically, we manually create a high-quality, university-level mathematical benchmark, COUNTERMATH, which requires LLMs to prove mathematical statements by providing counterexamples, thereby assessing their grasp of mathematical concepts. Additionally, we develop a data engineering framework to automatically obtain training data for further model improvement. Extensive experiments and detailed analyses demonstrate that COUNTERMATH is challenging, indicating that LLMs, such as OpenAI o1, have insufficient counterexample-driven proof capabilities. Moreover, our exploration into model training reveals that strengthening LLMs’ counterexample-driven conceptual reasoning abilities is crucial for improving their overall mathematical capabilities. We believe that our work offers new perspectives on the community of mathematical LLMs.

AIJ Journal 2025 Journal Article

Out-of-distribution detection by regaining lost clues

  • Zhilin Zhao
  • Longbing Cao
  • Philip S. Yu

Out-of-distribution (OOD) detection identifies samples in the test phase that are drawn from distributions distinct from that of training in-distribution (ID) samples for a trained network. According to the information bottleneck, networks that classify tabular data tend to extract labeling information from features with strong associations to ground-truth labels, discarding less relevant labeling cues. This behavior leads to a predicament in which OOD samples with limited labeling information receive high-confidence predictions, rendering the network incapable of distinguishing between ID and OOD samples. Hence, exploring more labeling information from ID samples, which makes it harder for an OOD sample to obtain high-confidence predictions, can address this over-confidence issue on tabular data. Accordingly, we propose a novel transformer chain (TC), which comprises a sequence of dependent transformers that iteratively regain discarded labeling information and integrate all the labeling information to enhance OOD detection. The generalization bound theoretically reveals that TC can balance ID generalization and OOD detection capabilities. Experimental results demonstrate that TC significantly surpasses state-of-the-art methods for OOD detection in tabular data.

IJCAI Conference 2025 Conference Paper

Out-of-Distribution Detection by Regaining Lost Clues (Abstract Reprint)

  • Zhilin Zhao
  • Longbing Cao
  • Philip S. Yu

Out-of-distribution (OOD) detection identifies samples in the test phase that are drawn from distributions distinct from that of training in-distribution (ID) samples for a trained network. According to the information bottleneck, networks that classify tabular data tend to extract labeling information from features with strong associations to ground-truth labels, discarding less relevant labeling cues. This behavior leads to a predicament in which OOD samples with limited labeling information receive high-confidence predictions, rendering the network incapable of distinguishing between ID and OOD samples. Hence, exploring more labeling information from ID samples, which makes it harder for an OOD sample to obtain high-confidence predictions, can address this over-confidence issue on tabular data. Accordingly, we propose a novel transformer chain (TC), which comprises a sequence of dependent transformers that iteratively regain discarded labeling information and integrate all the labeling information to enhance OOD detection. The generalization bound theoretically reveals that TC can balance ID generalization and OOD detection capabilities. Experimental results demonstrate that TC significantly surpasses state-of-the-art methods for OOD detection in tabular data.

AAAI Conference 2025 Conference Paper

Pioneer: Physics-informed Riemannian Graph ODE for Entropy-increasing Dynamics

  • Li Sun
  • Ziheng Zhang
  • Zixi Wang
  • Yujie Wang
  • Qiqi Wan
  • Hao Li
  • Hao Peng
  • Philip S. Yu

Dynamic interacting system modeling is important for understanding and simulating real world systems, e.g., meteorology and the spread of COVID. The system is typically described as a graph, where multiple objects dynamically interact with each other and evolve over time. In recent years, graph Ordinary Differential Equations (ODE) receive increasing research attentions. While achieving encouraging results, existing solutions prioritize the traditional Euclidean space, and neglect the intrinsic geometry of the system and physics laws, e.g., the principle of entropy increasing. The aforementioned limitations motivate us to rethink the system dynamics from a fresh perspective of Riemannian geometry, and pose a more realistic problem of physics-informed dynamic system modeling, considering the underlying geometry and physics law for the first time. In this paper, we present a novel physics-informed Riemannian graph ODE for a wide range of entropy-increasing dynamic systems (termed as Pioneer). In particular, we formulate a differential system on the Riemannian manifold, where a manifold-valued graph ODE is governed by the proposed constrained Ricci flow, and a manifold preserving Gyro-transform aware of system geometry. Theoretically, we report the provable entropy non-decreasing of our formulation, obeying the physics laws. Empirical results show the superiority of Pioneer on real datasets.

ICLR Conference 2025 Conference Paper

Refine Knowledge of Large Language Models via Adaptive Contrastive Learning

  • Yinghui Li
  • Haojing Huang 0001
  • Jiayi Kuang
  • Yangning Li
  • Shu-Yu Guo
  • Chao Qu
  • Xiaoyu Tan
  • Hai-Tao Zheng 0002

How to alleviate the hallucinations of Large Language Models (LLMs) has always been the fundamental goal pursued by the LLMs research community. Looking through numerous hallucination-related studies, a mainstream category of methods is to reduce hallucinations by optimizing the knowledge representation of LLMs to change their output. Considering that the core focus of these works is the knowledge acquired by models, and knowledge has long been a central theme in human societal progress, we believe that the process of models refining knowledge can greatly benefit from the way humans learn. In our work, by imitating the human learning process, we design an Adaptive Contrastive Learning strategy. Our method flexibly constructs different positive and negative samples for contrastive learning based on LLMs' actual mastery of knowledge. This strategy helps LLMs consolidate the correct knowledge they already possess, deepen their understanding of the correct knowledge they have encountered but not fully grasped, forget the incorrect knowledge they previously learned, and honestly acknowledge the knowledge they lack. Extensive experiments and detailed analyses on widely used datasets demonstrate the effectiveness and competitiveness of our method.

ICML Conference 2025 Conference Paper

SDMG: Smoothing Your Diffusion Models for Powerful Graph Representation Learning

  • Junyou Zhu
  • Langzhou He
  • Chao Gao 0001
  • Dongpeng Hou
  • Zhen Su
  • Philip S. Yu
  • Jürgen Kurths
  • Frank Hellmann

Diffusion probabilistic models (DPMs) have recently demonstrated impressive generative capabilities. There is emerging evidence that their sample reconstruction ability can yield meaningful representations for recognition tasks. In this paper, we demonstrate that the objectives underlying generation and representation learning are not perfectly aligned. Through a spectral analysis, we find that minimizing the mean squared error (MSE) between the original graph and its reconstructed counterpart does not necessarily optimize representations for downstream tasks. Instead, focusing on reconstructing a small subset of features, specifically those capturing global information, proves to be more effective for learning powerful representations. Motivated by these insights, we propose a novel framework, the Smooth Diffusion Model for Graphs (SDMG), which introduces a multi-scale smoothing loss and low-frequency information encoders to promote the recovery of global, low-frequency details, while suppressing irrelevant high-frequency noise. Extensive experiments validate the effectiveness of our method, suggesting a promising direction for advancing diffusion models in graph representation learning.

AAAI Conference 2025 Conference Paper

Structural Entropy Guided Probabilistic Coding

  • Xiang Huang
  • Hao Peng
  • Li Sun
  • Hui Lin
  • Chunyang Liu
  • Jiang Cao
  • Philip S. Yu

Probabilistic embeddings have several advantages over deterministic embeddings as they map each data point to a distribution, which better describes the uncertainty and complexity of data. Many works focus on adjusting the distribution constraint under the Information Bottleneck (IB) principle to enhance representation learning. However, these proposed regularization terms only consider the constraint of each latent variable, omitting the structural information between latent variables. In this paper, we propose a novel structural entropy-guided probabilistic coding model, named SEPC. Specifically, we incorporate the relationship between latent variables into the optimization by proposing a structural entropy regularization loss. Besides, as traditional structural information theory is not well-suited for regression tasks, we propose a probabilistic encoding tree, transferring regression tasks to classification tasks while diminishing the influence of the transformation. Experimental results across 12 natural language understanding tasks, including both classification and regression tasks, demonstrate the superior performance of SEPC compared to other state-of-the-art models in terms of effectiveness, generalization capability, and robustness to label noise.

IJCAI Conference 2025 Conference Paper

T-T: Table Transformer for Tagging-based Aspect Sentiment Triplet Extraction

  • Kun Peng
  • Chaodong Tong
  • Cong Cao
  • Hao Peng
  • Qian Li
  • Guanlin Wu
  • Lei Jiang
  • Yanbing Liu

Aspect sentiment triplet extraction (ASTE) aims to extract triplets composed of aspect terms, opinion terms, and sentiment polarities from given sentences. The table tagging method is a popular approach to addressing this task, which encodes a sentence into a 2-dimensional table, allowing for the tagging of relations between any two words. Previous efforts have focused on designing various downstream relation learning modules to better capture interactions between tokens in the table, revealing that a stronger capability in relation capture can lead to greater improvements in the model. Motivated by this, we attempt to directly utilize transformer layers as downstream relation learning modules. Due to the powerful semantic modeling capability of transformers, it is foreseeable that this will lead to excellent improvement. However, owing to the quadratic relation between the length of the table and the length of the input sentence sequence, using transformers directly faces two challenges: overly long table sequences and unfair local attention interaction. To address these challenges, we propose a novel Table-Transformer (T-T) for the tagging-based ASTE method. Specifically, we introduce a stripe attention mechanism with a loop-shift strategy to tackle these challenges. The former modifies the global attention mechanism to only attend to a 2-dimensional local attention window, while the latter facilitates interaction between different attention windows. Extensive and comprehensive experiments demonstrate that the T-T, as a downstream relation learning module, achieves state-of-the-art performance with lower computational costs.

ICML Conference 2025 Conference Paper

TabNAT: A Continuous-Discrete Joint Generative Framework for Tabular Data

  • Hengrui Zhang
  • Liancheng Fang
  • Qitian Wu
  • Philip S. Yu

While autoregressive models dominate natural language generation, their application to tabular data remains limited due to two challenges: 1) tabular data contains heterogeneous types, whereas autoregressive next-token (distribution) prediction is designed for discrete data, and 2) tabular data is column permutation-invariant, requiring flexible generation orders. Traditional autoregressive models, with their fixed generation order, struggle with tasks like missing data imputation, where the target and conditioning columns vary. To address these issues, we propose Diffusion-nested Non-autoregressive Transformer (TabNAT), a hybrid model combining diffusion processes and masked generative modeling. For continuous columns, TabNAT uses a diffusion model to parameterize their conditional distributions, while for discrete columns, it employs next-token prediction with KL divergence minimization. A masked Transformer with bi-directional attention enables order-agnostic generation, allowing it to learn the distribution of target columns conditioned on arbitrary observed columns. Extensive experiments on ten datasets with diverse properties demonstrate TabNAT’s superiority in both unconditional tabular data generation and conditional missing data imputation tasks.

ICLR Conference 2025 Conference Paper

TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights

  • Aiwei Liu
  • Haoping Bai
  • Zhiyun Lu
  • Yanchao Sun
  • Xiang Kong
  • Xiaoming Simon Wang
  • Jiulong Shan
  • Albin Madappally Jose

Direct Preference Optimization (DPO) has been widely adopted for preference alignment of Large Language Models (LLMs) due to its simplicity and effectiveness. However, DPO is derived as a bandit problem in which the whole response is treated as a single arm, ignoring the importance differences between tokens, which may affect optimization efficiency and make it difficult to achieve optimal results. In this work, we propose that the optimal data for DPO has equal expected rewards for each token in winning and losing responses, as there is no difference in token importance. However, since the optimal dataset is unavailable in practice, we propose using the original dataset for importance sampling to achieve unbiased optimization. Accordingly, we propose a token-level importance sampling DPO objective named TIS-DPO that assigns importance weights to each token based on its reward. Inspired by previous works, we estimate the token importance weights using the difference in prediction probabilities from a pair of contrastive LLMs. We explore three methods to construct these contrastive LLMs: (1) guiding the original LLM with contrastive prompts, (2) training two separate LLMs using winning and losing responses, and (3) performing forward and reverse DPO training with winning and losing responses. Experiments show that TIS-DPO significantly outperforms various baseline methods on harmlessness and helpfulness alignment and summarization tasks. We also visualize the estimated weights, demonstrating their ability to identify key token positions.

AAAI Conference 2025 Conference Paper

Towards Effective, Efficient and Unsupervised Social Event Detection in the Hyperbolic Space

  • Xiaoyan Yu
  • Yifan Wei
  • Shuaishuai Zhou
  • Zhiwei Yang
  • Li Sun
  • Hao Peng
  • Liehuang Zhu
  • Philip S. Yu

The vast, complex, and dynamic nature of social message data has posed challenges to social event detection (SED). Despite considerable effort, these challenges persist, often resulting in inadequately expressive message representations (ineffective) and prolonged learning durations (inefficient). In response to the challenges, this work introduces an unsupervised framework, HyperSED (Hyperbolic SED). Specifically, the proposed framework first models social messages into semantic-based message anchors, and then leverages the structure of the anchor graph and the expressiveness of the hyperbolic space to acquire structure- and geometry-aware anchor representations. Finally, HyperSED builds the partitioning tree of the anchor message graph by incorporating differentiable structural information as the reflection of the detected events. Extensive experiments on public datasets demonstrate HyperSED's competitive performance, along with a substantial improvement in efficiency compared to the current state-of-the-art unsupervised paradigm. Statistically, HyperSED boosts incremental SED by an average of 2%, 2%, and 25% in NMI, AMI, and ARI, respectively; enhancing efficiency by up to 37.41 times and at least 12.10 times, illustrating the advancement of the proposed framework.

IJCAI Conference 2025 Conference Paper

Trace: Structural Riemannian Bridge Matching for Transferable Source Localization in Information Propagation

  • Li Sun
  • Suyang Zhou
  • Bowen Fang
  • Hechuan Zhang
  • Junda Ye
  • Yutong Ye
  • Philip S. Yu

Source localization, the inverse problem of information diffusion, shows fundamental importance for understanding social dynamics. While achieving notable progress, existing solutions are typically exposed to the risk of error accumulation, and require a large number of observations for effective inference. However, it is often impractical to obtain quantities of observations in real scenarios, highlighting the need for a transferable model with broad applicability. Recently, Riemannian geometry has demonstrated its effectiveness in information diffusion and offers guidance in knowledge transfer, but has yet to be explored in source localization. In light of the issues above, we propose to study transferable source localization from a fresh geometric perspective, and present a novel approach (Trace) on the Riemannian manifold. Concretely, we establish a structural Schrodinger bridge to directly model the map between source and final distributions, where a functional curvature, encapsulating the graph structure, is formulated to govern the Schrodinger bridge and facilitate domain adaptation. Furthermore, we design a simple yet effective learning algorithm for Riemannian Schrodinger bridges (geodesics bridge matching) in which we prove the optimal projection holds for Riemannian measure so that the expensive iterative procedure is avoided. Extensive experiments demonstrate the effectiveness and transferability of Trace on both synthetic and real datasets.

TIST Journal 2024 Journal Article

A Survey on Evaluation of Large Language Models

  • Yupeng Chang
  • Xu Wang
  • Jindong Wang
  • Yuan Wu
  • Linyi Yang
  • Kaijie Zhu
  • Hao Chen
  • Xiaoyuan Yi

Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, education, natural and social sciences, agent applications, and other areas. Secondly, we answer the ‘where’ and ‘how’ questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing the performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at: https://github.com/MLGroupJLU/LLM-eval-survey

AAAI Conference 2024 Conference Paper

Adaptive Feature Imputation with Latent Graph for Deep Incomplete Multi-View Clustering

  • Jingyu Pu
  • Chenhang Cui
  • Xinyue Chen
  • Yazhou Ren
  • Xiaorong Pu
  • Zhifeng Hao
  • Philip S. Yu
  • Lifang He

In recent years, incomplete multi-view clustering (IMVC), which studies the challenging multi-view clustering problem on missing views, has received growing research interests. Previous IMVC methods suffer from the following issues: (1) the inaccurate imputation for missing data, which leads to suboptimal clustering performance, and (2) most existing IMVC models merely consider the explicit presence of graph structure in data, ignoring the fact that latent graphs of different views also provide valuable information for the clustering task. To overcome such challenges, we present a novel method, termed Adaptive feature imputation with latent graph for incomplete multi-view clustering (AGDIMC). Specifically, it captures the embbedded features of each view by incorporating the view-specific deep encoders. Then, we construct partial latent graphs on complete data, which can consolidate the intrinsic relationships within each view while preserving the topological information. With the aim of estimating the missing sample based on the available information, we utilize an adaptive imputation layer to impute the embedded feature of missing data by using cross-view soft cluster assignments and global cluster centroids. As the imputation progresses, the portion of complete data increases, contributing to enhancing the discriminative information contained in global pseudo-labels. Meanwhile, to alleviate the negative impact caused by inferior impute samples and the discrepancy of cluster structures, we further design an adaptive imputation strategy based on the global pseudo-label and the local cluster assignment. Experimental results on multiple real-world datasets demonstrate the effectiveness of our method over existing approaches.

ICLR Conference 2024 Conference Paper

An Unforgeable Publicly Verifiable Watermark for Large Language Models

  • Aiwei Liu
  • Leyi Pan
  • Xuming Hu
  • Shuang Li 0015
  • Lijie Wen 0001
  • Irwin King
  • Philip S. Yu

Recently, text watermarking algorithms for large language models (LLMs) have been proposed to mitigate the potential harms of text generated by LLMs, including fake news and copyright issues. However, current watermark detection algorithms require the secret key used in the watermark generation process, making them susceptible to security breaches and counterfeiting during public detection. To address this limitation, we propose an unforgeable publicly verifiable watermark algorithm named UPV that uses two different neural networks for watermark generation and detection, instead of using the same key at both stages. Meanwhile, the token embedding parameters are shared between the generation and detection networks, which makes the detection network achieve a high accuracy very efficiently. Experiments demonstrate that our algorithm attains high detection accuracy and computational efficiency through neural networks. Subsequent analysis confirms the high complexity involved in forging the watermark from the detection network. Our code is available at https://github.com/THU-BPM/unforgeable_watermark

AAAI Conference 2024 Conference Paper

Dual-Channel Learning Framework for Drug-Drug Interaction Prediction via Relation-Aware Heterogeneous Graph Transformer

  • Xiaorui Su
  • Pengwei Hu
  • Zhu-Hong You
  • Philip S. Yu
  • Lun Hu

Identifying novel drug-drug interactions (DDIs) is a crucial task in pharmacology, as the interference between pharmacological substances can pose serious medical risks. In recent years, several network-based techniques have emerged for predicting DDIs. However, they primarily focus on local structures within DDI-related networks, often overlooking the significance of indirect connections between pairwise drug nodes from a global perspective. Additionally, effectively handling heterogeneous information present in both biomedical knowledge graphs and drug molecular graphs remains a challenge for improved performance of DDI prediction. To address these limitations, we propose a Transformer-based relatIon-aware Graph rEpresentation leaRning framework (TIGER) for DDI prediction. TIGER leverages the Transformer architecture to effectively exploit the structure of heterogeneous graph, which allows it direct learning of long dependencies and high-order structures. Furthermore, TIGER incorporates a relation-aware self-attention mechanism, capturing a diverse range of semantic relations that exist between pairs of nodes in heterogeneous graph. In addition to these advancements, TIGER enhances predictive accuracy by modeling DDI prediction task using a dual-channel network, where drug molecular graph and biomedical knowledge graph are fed into two respective channels. By incorporating embeddings obtained at graph and node levels, TIGER can benefit from structural properties of drugs as well as rich contextual information provided by biomedical knowledge graph. Extensive experiments conducted on three real-world datasets demonstrate the effectiveness of TIGER in DDI prediction. Furthermore, case studies highlight its ability to provide a deeper understanding of underlying mechanisms of DDIs.

NeurIPS Conference 2024 Conference Paper

GC-Bench: An Open and Unified Benchmark for Graph Condensation

  • Qingyun Sun
  • Ziying Chen
  • Beining Yang
  • Cheng Ji
  • Xingcheng Fu
  • Sheng Zhou
  • Hao Peng
  • Jianxin Li

Graph condensation (GC) has recently garnered considerable attention due to its ability to reduce large-scale graph datasets while preserving their essential properties. The core concept of GC is to create a smaller, more manageable graph that retains the characteristics of the original graph. Despite the proliferation of graph condensation methods developed in recent years, there is no comprehensive evaluation and in-depth analysis, which creates a great obstacle to understanding the progress in this field. To fill this gap, we develop a comprehensive Graph Condensation Benchmark (GC-Bench) to analyze the performance of graph condensation in different scenarios systematically. Specifically, GC-Bench systematically investigates the characteristics of graph condensation in terms of the following dimensions: effectiveness, transferability, and complexity. We comprehensively evaluate 12 state-of-the-art graph condensation algorithms in node-level and graph-level tasks and analyze their performance in 12 diverse graph datasets. Further, we have developed an easy-to-use library for training and evaluating different GC methods to facilitate reproducible research. The GC-Bench library is available at https: //github. com/RingBDStack/GC-Bench.

IJCAI Conference 2024 Conference Paper

Graph Neural Networks for Brain Graph Learning: A Survey

  • Xuexiong Luo
  • Jia Wu
  • Jian Yang
  • Shan Xue
  • Amin Beheshti
  • Quan Z. Sheng
  • David McAlpine
  • Paul Sowman

Exploring the complex structure of the human brain is crucial for understanding its functionality and diagnosing brain disorders. Thanks to advancements in neuroimaging technology, a novel approach has emerged that involves modeling the human brain as a graph-structured pattern, with different brain regions represented as nodes and the functional relationships among these regions as edges. Moreover, graph neural networks (GNNs) have demonstrated a significant advantage in mining graph-structured data. Developing GNNs to learn brain graph representations for brain disorder analysis has recently gained increasing attention. However, there is a lack of systematic survey work summarizing current research methods in this domain. In this paper, we aim to bridge this gap by reviewing brain graph learning works that utilize GNNs. We first introduce the process of brain graph modeling based on common neuroimaging data. Subsequently, we systematically categorize current works based on the type of brain graph generated and the targeted research problems. To make this research accessible to a broader range of interested researchers, we provide an overview of representative methods and commonly used datasets, along with their implementation sources. Finally, we present our insights on future research directions. The repository of this survey is available at https: //github. com/XuexiongLuoMQ/Awesome-Brain-Graph-Learning-with-GNNs.

AAAI Conference 2024 Conference Paper

Hierarchical and Incremental Structural Entropy Minimization for Unsupervised Social Event Detection

  • Yuwei Cao
  • Hao Peng
  • Zhengtao Yu
  • Philip S. Yu

As a trending approach for social event detection, graph neural network (GNN)-based methods enable a fusion of natural language semantics and the complex social network structural information, thus showing SOTA performance. However, GNN-based methods can miss useful message correlations. Moreover, they require manual labeling for training and predetermining the number of events for prediction. In this work, we address social event detection via graph structural entropy (SE) minimization. While keeping the merits of the GNN-based methods, the proposed framework, HISEvent, constructs more informative message graphs, is unsupervised, and does not require the number of events given a priori. Specifically, we incrementally explore the graph neighborhoods using 1-dimensional (1D) SE minimization to supplement the existing message graph with edges between semantically related messages. We then detect events from the message graph by hierarchically minimizing 2-dimensional (2D) SE. Our proposed 1D and 2D SE minimization algorithms are customized for social event detection and effectively tackle the efficiency problem of the existing SE minimization algorithms. Extensive experiments show that HISEvent consistently outperforms GNN-based methods and achieves the new SOTA for social event detection under both closed- and open-set settings while being efficient and robust.

ICML Conference 2024 Conference Paper

LSEnet: Lorentz Structural Entropy Neural Network for Deep Graph Clustering

  • Li Sun 0008
  • Zhenhao Huang 0001
  • Hao Peng 0001
  • Yujie Wang
  • Chunyang Liu
  • Philip S. Yu

Graph clustering is a fundamental problem in machine learning. Deep learning methods achieve the state-of-the-art results in recent years, but they still cannot work without predefined cluster numbers. Such limitation motivates us to pose a more challenging problem of graph clustering with unknown cluster number. We propose to address this problem from a fresh perspective of graph information theory (i. e. , structural information). In the literature, structural information has not yet been introduced to deep clustering, and its classic definition falls short of discrete formulation and modeling node features. In this work, we first formulate a differentiable structural information (DSI) in the continuous realm, accompanied by several theoretical results. By minimizing DSI, we construct the optimal partitioning tree where densely connected nodes in the graph tend to have the same assignment, revealing the cluster struc- ture. DSI is also theoretically presented as a new graph clustering objective, not requiring the pre-defined cluster number. Furthermore, we design a neural LSEnet in the Lorentz model of hyperbolic space, where we integrate node features to structural information via manifold-valued graph convolution. Extensive empirical results on real graphs show the superiority of our approach.

AAAI Conference 2024 Conference Paper

Motif-Aware Riemannian Graph Neural Network with Generative-Contrastive Learning

  • Li Sun
  • Zhenhao Huang
  • Zixi Wang
  • Feiyang Wang
  • Hao Peng
  • Philip S. Yu

Graphs are typical non-Euclidean data of complex structures. In recent years, Riemannian graph representation learning has emerged as an exciting alternative to Euclidean ones. However, Riemannian methods are still in an early stage: most of them present a single curvature (radius) regardless of structural complexity, suffer from numerical instability due to the exponential/logarithmic map, and lack the ability to capture motif regularity. In light of the issues above, we propose the problem of Motif-aware Riemannian Graph Representation Learning, seeking a numerically stable encoder to capture motif regularity in a diverse-curvature manifold without labels. To this end, we present a novel Motif-aware Riemannian model with Generative-Contrastive learning (MotifRGC), which conducts a minmax game in Riemannian manifold in a self-supervised manner. First, we propose a new type of Riemannian GCN (D-GCN), in which we construct a diverse-curvature manifold by a product layer with the diversified factor, and replace the exponential/logarithmic map by a stable kernel layer. Second, we introduce a motif-aware Riemannian generative-contrastive learning to capture motif regularity in the constructed manifold and learn motif-aware node representation without external labels. Empirical results show the superiority of MofitRGC.

UAI Conference 2024 Conference Paper

Multi-Relational Structural Entropy

  • Yuwei Cao
  • Hao Peng 0001
  • Angsheng Li
  • Chenyu You
  • Zhifeng Hao
  • Philip S. Yu

Structural Entropy (SE) measures the structural information contained in a graph. Minimizing or maximizing SE helps to reveal or obscure the intrinsic structural patterns underlying graphs in an interpretable manner, finding applications in various tasks driven by networked data. However, SE ignores the heterogeneity inherent in the graph relations, which is ubiquitous in modern networks. In this work, we extend SE to consider heterogeneous relations and propose the first metric for multi-relational graph structural information, namely, Multi-relational Structural Entropy (MrSE). To this end, we first cast SE through the novel lens of the stationary distribution from random surfing, which readily extends to multi-relational networks by considering the choices of both nodes and relation types simultaneously at each step. The resulting MrSE is then optimized by a new greedy algorithm to reveal the essential structures within a multi-relational network. Experimental results highlight that the proposed MrSE offers a more insightful interpretation of the structure of multi-relational graphs compared to SE. Additionally, it enhances the performance of two tasks that involve real-world multi-relational graphs, including node clustering and social event detection.

ICML Conference 2024 Conference Paper

Position: TrustLLM: Trustworthiness in Large Language Models

  • Yue Huang 0001
  • Lichao Sun 0001
  • Haoran Wang 0005
  • Siyuan Wu 0001
  • Qihui Zhang
  • Yuan Li
  • Chujie Gao
  • Yixin Huang

Large language models (LLMs) have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. This paper introduces TrustLLM, a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. Our findings firstly show that in general trustworthiness and capability (i. e. , functional effectiveness) are positively related. Secondly, our observations reveal that proprietary LLMs generally outperform most open-source counterparts in terms of trustworthiness, raising concerns about the potential risks of widely accessible open-source LLMs. However, a few open-source LLMs come very close to proprietary ones, suggesting that open-source models can achieve high levels of trustworthiness without additional mechanisms like moderator, offering valuable insights for developers in this field. Thirdly, it is important to note that some LLMs may be overly calibrated towards exhibiting trustworthiness, to the extent that they compromise their utility by mistakenly treating benign prompts as harmful and consequently not responding. Besides these observations, we’ve uncovered key insights into the multifaceted trustworthiness in LLMs. We emphasize the importance of ensuring transparency not only in the models themselves but also in the technologies that underpin trustworthiness. We advocate that the establishment of an AI alliance between industry, academia, the open-source community to foster collaboration is imperative to advance the trustworthiness of LLMs.

JMLR Journal 2024 Journal Article

PyGOD: A Python Library for Graph Outlier Detection

  • Kay Liu
  • Yingtong Dou
  • Xueying Ding
  • Xiyang Hu
  • Ruitong Zhang
  • Hao Peng
  • Lichao Sun
  • Philip S. Yu

PyGOD is an open-source Python library for detecting outliers in graph data. As the first comprehensive library of its kind, PyGOD supports a wide array of leading graph-based methods for outlier detection under an easy-to-use, well-documented API designed for use by both researchers and practitioners. PyGOD provides modularized components of the different detectors implemented so that users can easily customize each detector for their purposes. To ease the construction of detection workflows, PyGOD offers numerous commonly used utility functions. To scale computation to large graphs, PyGOD supports functionalities for deep models such as sampling and mini-batch processing. PyGOD uses best practices in fostering code reliability and maintainability, including unit testing, continuous integration, and code coverage. To facilitate accessibility, PyGOD is released under a BSD 2-Clause license at https://pygod.org and at the Python Package Index (PyPI). [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2024. ( edit, beta )

NeurIPS Conference 2024 Conference Paper

Spiking Graph Neural Network on Riemannian Manifolds

  • Li Sun
  • Zhenhao Huang
  • Qiqi Wan
  • Hao Peng
  • Philip S. Yu

Graph neural networks (GNNs) have become the dominant solution for learning on graphs, the typical non-Euclidean structures. Conventional GNNs, constructed with the Artificial Neuron Network (ANN), have achieved impressive performance at the cost of high computation and energy consumption. In parallel, spiking GNNs with brain-like spiking neurons are drawing increasing research attention owing to the energy efficiency. So far, existing spiking GNNs consider graphs in Euclidean space, ignoring the structural geometry, and suffer from the high latency issue due to Back-Propagation-Through-Time (BPTT) with the surrogate gradient. In light of the aforementioned issues, we are devoted to exploring spiking GNN on Riemannian manifolds, and present a Manifold-valued Spiking GNN (MSG). In particular, we design a new spiking neuron on geodesically complete manifolds with the diffeomorphism, so that BPTT regarding the spikes is replaced by the proposed differentiation via manifold. Theoretically, we show that MSG approximates a solver of the manifold ordinary differential equation. Extensive experiments on common graphs show the proposed MSG achieves superior performance to previous spiking GNNs and energy efficiency to conventional GNNs.

AAAI Conference 2024 Conference Paper

Three Heads Are Better than One: Improving Cross-Domain NER with Progressive Decomposed Network

  • Xuming Hu
  • Zhaochen Hong
  • Yong Jiang
  • Zhichao Lin
  • Xiaobin Wang
  • Pengjun Xie
  • Philip S. Yu

Cross-domain named entity recognition (NER) tasks encourage NER models to transfer knowledge from data-rich source domains to sparsely labeled target domains. Previous works adopt the paradigms of pre-training on the source domain followed by fine-tuning on the target domain. However, these works ignore that general labeled NER source domain data can be easily retrieved in the real world, and soliciting more source domains could bring more benefits. Unfortunately, previous paradigms cannot efficiently transfer knowledge from multiple source domains. In this work, to transfer multiple source domains' knowledge, we decouple the NER task into the pipeline tasks of mention detection and entity typing, where the mention detection unifies the training object across domains, thus providing the entity typing with higher-quality entity mentions. Additionally, we request multiple general source domain models to suggest the potential named entities for sentences in the target domain explicitly, and transfer their knowledge to the target domain models through the knowledge progressive networks implicitly. Furthermore, we propose two methods to analyze in which source domain knowledge transfer occurs, thus helping us judge which source domain brings the greatest benefit. In our experiment, we develop a Chinese cross-domain NER dataset. Our model improved the F1 score by an average of 12.50% across 8 Chinese and English datasets compared to models without source domain data.

ICLR Conference 2024 Conference Paper

Towards Understanding Factual Knowledge of Large Language Models

  • Xuming Hu
  • Junzhe Chen 0001
  • Xiaochuan Li 0003
  • Yufei Guo
  • Lijie Wen 0001
  • Philip S. Yu
  • Zhijiang Guo

Large language models (LLMs) have recently driven striking performance improvements across a range of natural language processing tasks. The factual knowledge acquired during pretraining and instruction tuning can be useful in various downstream tasks, such as question answering, and language generation. Unlike conventional Knowledge Bases (KBs) that explicitly store factual knowledge, LLMs implicitly store facts in their parameters. Content generated by the LLMs can often exhibit inaccuracies or deviations from the truth, due to facts that can be incorrectly induced or become obsolete over time. To this end, we aim to explore the extent and scope of factual knowledge within LLMs by designing the benchmark Pinocchio. Pinocchio contains 20K diverse factual questions that span different sources, timelines, domains, regions, and languages. Furthermore, we investigate whether LLMs can compose multiple facts, update factual knowledge temporally, reason over multiple pieces of facts, identify subtle factual differences, and resist adversarial examples. Extensive experiments on different sizes and types of LLMs show that existing LLMs still lack factual knowledge and suffer from various spurious correlations. We believe this is a critical bottleneck for realizing trustworthy artificial intelligence. The dataset Pinocchio and our codes are publicly available at: https://github.com/THU-BPM/Pinocchio.

TMLR Journal 2024 Journal Article

Uncertainty in Graph Neural Networks: A Survey

  • Fangxin Wang
  • Yuqing Liu
  • Kay Liu
  • Yibo Wang
  • Sourav Medya
  • Philip S. Yu

Graph Neural Networks (GNNs) have been extensively used in various real-world applications. However, the predictive uncertainty of GNNs stemming from diverse sources such as inherent randomness in data and model training errors can lead to unstable and erroneous predictions. Therefore, identifying, quantifying, and utilizing uncertainty are essential to enhance the performance of the model for the downstream tasks as well as the reliability of the GNN predictions. This survey aims to provide a comprehensive overview of the GNNs from the perspective of uncertainty with an emphasis on its integration in graph learning. We compare and summarize existing graph uncertainty theory and methods, alongside the corresponding downstream tasks. Thereby, we bridge the gap between theory and practice, meanwhile connecting different GNN communities. Moreover, our work provides valuable insights into promising directions in this field.

NeurIPS Conference 2024 Conference Paper

When LLMs Meet Cunning Texts: A Fallacy Understanding Benchmark for Large Language Models

  • Yinghui Li
  • Qingyu Zhou
  • Yuanzhen Luo
  • Shirong Ma
  • Yangning Li
  • Hai-Tao Zheng
  • Xuming Hu
  • Philip S. Yu

Recently, Large Language Models (LLMs) make remarkable evolutions in language understanding and generation. Following this, various benchmarks for measuring all kinds of capabilities of LLMs have sprung up. In this paper, we challenge the reasoning and understanding abilities of LLMs by proposing a FaLlacy Understanding Benchmark (FLUB) containing cunning texts that are easy for humans to understand but difficult for models to grasp. Specifically, the cunning texts that FLUB focuses on mainly consist of the tricky, humorous, and misleading texts collected from the real internet environment. And we design three tasks with increasing difficulty in the FLUB benchmark to evaluate the fallacy understanding ability of LLMs. Based on FLUB, we investigate the performance of multiple representative and advanced LLMs, reflecting our FLUB is challenging and worthy of more future study. Interesting discoveries and valuable insights are achieved in our extensive experiments and detailed analyses. We hope that our benchmark can encourage the community to improve LLMs' ability to understand fallacies. Our data and codes are available at https: //github. com/THUKElab/FLUB.

IJCAI Conference 2023 Conference Paper

CONGREGATE: Contrastive Graph Clustering in Curvature Spaces

  • Li Sun
  • Feiyang Wang
  • Junda Ye
  • Hao Peng
  • Philip S. Yu

Graph clustering is a longstanding research topic, and has achieved remarkable success with the deep learning methods in recent years. Nevertheless, we observe that several important issues largely remain open. On the one hand, graph clustering from the geometric perspective is appealing but has rarely been touched before, as it lacks a promising space for geometric clustering. On the other hand, contrastive learning boosts the deep graph clustering but usually struggles in either graph augmentation or hard sample mining. To bridge this gap, we rethink the problem of graph clustering from geometric perspective and, to the best of our knowledge, make the first attempt to introduce a heterogeneous curvature space to graph clustering problem. Correspondingly, we present a novel end-to-end contrastive graph clustering model named CONGREGATE, addressing geometric graph clustering with Ricci curvatures. To support geometric clustering, we construct a theoretically grounded Heterogeneous Curvature Space where deep representations are generated via the product of the proposed fully Riemannian graph convolutional nets. Thereafter, we train the graph clusters by an augmentation-free reweighted contrastive approach where we pay more attention to both hard negatives and hard positives in our curvature space. Empirical results on real-world graphs show that our model outperforms the state-of-the-art competitors.

IJCAI Conference 2023 Conference Paper

Hierarchical State Abstraction based on Structural Information Principles

  • Xianghua Zeng
  • Hao Peng
  • Angsheng Li
  • Chunyang Liu
  • Lifang He
  • Philip S. Yu

State abstraction optimizes decision-making by ignoring irrelevant environmental information in reinforcement learning with rich observations. Nevertheless, recent approaches focus on adequate representational capacities resulting in essential information loss, affecting their performances on challenging tasks. In this article, we propose a novel mathematical Structural Information principles-based State Abstraction framework, namely SISA, from the information-theoretic perspective. Specifically, an unsupervised, adaptive hierarchical state clustering method without requiring manual assistance is presented, and meanwhile, an optimal encoding tree is generated. On each non-root tree node, a new aggregation function and condition structural entropy are designed to achieve hierarchical state abstraction and compensate for sampling-induced essential information loss in state abstraction. Empirical evaluations on a visual gridworld domain and six continuous control benchmarks demonstrate that, compared with five SOTA state abstraction approaches, SISA significantly improves mean episode reward and sample efficiency up to 18. 98 and 44. 44%, respectively. Besides, we experimentally show that SISA is a general framework that can be flexibly integrated with different representation-learning objectives to improve their performances further.

AAAI Conference 2023 Conference Paper

Learning to Select from Multiple Options

  • Jiangshu Du
  • Wenpeng Yin
  • Congying Xia
  • Philip S. Yu

Many NLP tasks can be regarded as a selection problem from a set of options, such as classification tasks, multi-choice question answering, etc. Textual entailment (TE) has been shown as the state-of-the-art (SOTA) approach to dealing with those selection problems. TE treats input texts as premises (P), options as hypotheses (H), then handles the selection problem by modeling (P, H) pairwise. Two limitations: first, the pairwise modeling is unaware of other options, which is less intuitive since humans often determine the best options by comparing competing candidates; second, the inference process of pairwise TE is time-consuming, especially when the option space is large. To deal with the two issues, this work first proposes a contextualized TE model (Context-TE) by appending other k options as the context of the current (P, H) modeling. Context-TE is able to learn more reliable decision for the H since it considers various context. Second, we speed up Context-TE by coming up with Parallel-TE, which learns the decisions of multiple options simultaneously. Parallel-TE significantly improves the inference speed while keeping comparable performance with Context-TE. Our methods are evaluated on three tasks (ultra-fine entity typing, intent detection and multi-choice QA) that are typical selection problems with different sizes of options. Experiments show our models set new SOTA performance; particularly, Parallel-TE is faster than the pairwise TE by k times in inference.

TIST Journal 2023 Journal Article

MC 2: Unsupervised Multiple Social Network Alignment

  • Li Sun
  • Zhongbao Zhang
  • Gen Li
  • Pengxin Ji
  • Sen Su
  • Philip S. Yu

Social network alignment, identifying social accounts of the same individual across different social networks, shows fundamental importance in a wide spectrum of applications, such as link prediction and information diffusion. Individuals more often than not join in multiple social networks, and it is in fact much too expensive or even impossible to acquiring supervision for guiding the alignment. To the best of our knowledge, few method in the literature can align multiple social networks without supervision. In this article, we propose to study the problem of unsupervised multiple social network alignment. To address this problem, we propose a novel unsupervised model of joint Matrix factorization with a diagonal Cone under orthogonal Constraint, referred to as MC 2. Its core idea is to embed and align multiple social networks in the common subspace via an unsupervised approach. Specifically, in MC 2 model, we first design a matrix optimization to infer the common subspace from different social networks. To address the nonconvex optimization, we then design an efficient alternating algorithm by leveraging its inherent functional property. Through extensive experiments on real-world datasets, we demonstrate that the proposed MC 2 model significantly outperforms the state-of-the-art methods.

AAAI Conference 2023 Conference Paper

Self-Organization Preserved Graph Structure Learning with Principle of Relevant Information

  • Qingyun Sun
  • Jianxin Li
  • Beining Yang
  • Xingcheng Fu
  • Hao Peng
  • Philip S. Yu

Most Graph Neural Networks follow the message-passing paradigm, assuming the observed structure depicts the ground-truth node relationships. However, this fundamental assumption cannot always be satisfied, as real-world graphs are always incomplete, noisy, or redundant. How to reveal the inherent graph structure in a unified way remains under-explored. We proposed PRI-GSL, a Graph Structure Learning framework guided by the Principle of Relevant Information, providing a simple and unified framework for identifying the self-organization and revealing the hidden structure. PRI-GSL learns a structure that contains the most relevant yet least redundant information quantified by von Neumann entropy and Quantum Jensen Shannon divergence. PRI-GSL incorporates the evolution of quantum continuous walk with graph wavelets to encode node structural roles, showing in which way the nodes interplay and self-organize with the graph structure. Extensive experiments demonstrate the superior effectiveness and robustness of PRI-GSL.

AAAI Conference 2023 Conference Paper

Self-Supervised Continual Graph Learning in Adaptive Riemannian Spaces

  • Li Sun
  • Junda Ye
  • Hao Peng
  • Feiyang Wang
  • Philip S. Yu

Continual graph learning routinely finds its role in a variety of real-world applications where the graph data with different tasks come sequentially. Despite the success of prior works, it still faces great challenges. On the one hand, existing methods work with the zero-curvature Euclidean space, and largely ignore the fact that curvature varies over the com- ing graph sequence. On the other hand, continual learners in the literature rely on abundant labels, but labeling graph in practice is particularly hard especially for the continuously emerging graphs on-the-fly. To address the aforementioned challenges, we propose to explore a challenging yet practical problem, the self-supervised continual graph learning in adaptive Riemannian spaces. In this paper, we propose a novel self-supervised Riemannian Graph Continual Learner (RieGrace). In RieGrace, we first design an Adaptive Riemannian GCN (AdaRGCN), a unified GCN coupled with a neural curvature adapter, so that Riemannian space is shaped by the learnt curvature adaptive to each graph. Then, we present a Label-free Lorentz Distillation approach, in which we create teacher-student AdaRGCN for the graph sequence. The student successively performs intra-distillation from itself and inter-distillation from the teacher so as to consolidate knowledge without catastrophic forgetting. In particular, we propose a theoretically grounded Generalized Lorentz Projection for the contrastive distillation in Riemannian space. Extensive experiments on the benchmark datasets show the superiority of RieGrace, and additionally, we investigate on how curvature changes over the graph sequence.

TIST Journal 2022 Journal Article

A Survey on Text Classification: From Traditional to Deep Learning

  • Qian Li
  • Hao Peng
  • Jianxin Li
  • Congying Xia
  • Renyu Yang
  • Lichao Sun
  • Philip S. Yu
  • Lifang He

Text classification is the most fundamental and essential task in natural language processing. The last decade has seen a surge of research in this area due to the unprecedented success of deep learning. Numerous methods, datasets, and evaluation metrics have been proposed in the literature, raising the need for a comprehensive and updated survey. This paper fills the gap by reviewing the state-of-the-art approaches from 1961 to 2021, focusing on models from traditional models to deep learning. We create a taxonomy for text classification according to the text involved and the models used for feature extraction and classification. We then discuss each of these categories in detail, dealing with both the technical developments and benchmark datasets that support tests of predictions. A comprehensive comparison between different techniques, as well as identifying the pros and cons of various evaluation metrics are also provided in this survey. Finally, we conclude by summarizing key implications, future research directions, and the challenges facing the research area.

TIST Journal 2022 Journal Article

Federated Social Recommendation with Graph Neural Network

  • Zhiwei Liu
  • Liangwei Yang
  • Ziwei Fan
  • Hao Peng
  • Philip S. Yu

Recommender systems have become prosperous nowadays, designed to predict users’ potential interests in items by learning embeddings. Recent developments of the Graph Neural Networks (GNNs) also provide recommender systems (RSs) with powerful backbones to learn embeddings from a user-item graph. However, only leveraging the user-item interactions suffers from the cold-start issue due to the difficulty in data collection. Hence, current endeavors propose fusing social information with user-item interactions to alleviate it, which is the social recommendation problem. Existing work employs GNNs to aggregate both social links and user-item interactions simultaneously. However, they all require centralized storage of the social links and item interactions of users, which leads to privacy concerns. Additionally, according to strict privacy protection under General Data Protection Regulation, centralized data storage may not be feasible in the future, urging a decentralized framework of social recommendation. As a result, we design a federated learning recommender system for the social recommendation task, which is rather challenging because of its heterogeneity, personalization, and privacy protection requirements. To this end, we devise a novel framework Fe drated So cial recommendation with G raph neural network ( FeSoG ). Firstly, FeSoG adopts relational attention and aggregation to handle heterogeneity. Secondly, FeSoG infers user embeddings using local data to retain personalization. Last but not least, the proposed model employs pseudo-labeling techniques with item sampling to protect the privacy and enhance training. Extensive experiments on three real-world datasets justify the effectiveness of FeSoG in completing social recommendation and privacy protection. We are the first work proposing a federated learning framework for social recommendation to the best of our knowledge.

TIST Journal 2022 Journal Article

Multivariate Correlation-aware Spatio-temporal Graph Convolutional Networks for Multi-scale Traffic Prediction

  • Senzhang Wang
  • Meiyue Zhang
  • Hao Miao
  • Zhaohui Peng
  • Philip S. Yu

Traffic flow prediction based on vehicle trajectories collected from the installed GPS devices is critically important to Intelligent Transportation Systems (ITS). One limitation of existing traffic prediction models is that they mostly focus on predicting road-segment level traffic conditions, which can be considered as a fine-grained prediction. In many scenarios, however, a coarse-grained prediction, such as predicting the traffic flows among different urban areas covering multiple road links, is also required to help government have a better understanding on traffic conditions from the macroscopic point of view. This is especially useful in the applications of urban planning and public transportation planning. Another limitation is that the correlations among different types of traffic-related features are largely ignored. For example, the traffic flow and traffic speed are usually negatively correlated. Existing works regard these traffic-related features as independent features without considering their correlations. In this article, we for the first time study the novel problem of multivariate correlation-aware multi-scale traffic flow predicting, and we propose a feature correlation-aware spatio-temporal graph convolutional networks named MC-STGCN to effectively address it. Specifically, given a road graph, we first construct a coarse-grained road graph based on both the topology closeness and the traffic flow similarity among the nodes (road links). Then a cross-scale spatial-temporal feature learning and fusion technique is proposed for dealing with both the fine- and coarse-grained traffic data. In the spatial domain, a cross-scale GCN is proposed to learn the multi-scale spatial features jointly and fuse them together. In the temporal domain, a cross-scale temporal network that is composed of a hierarchical attention is designed for effectively capturing intra- and inter-scale temporal correlations. To effectively capture the feature correlations, a feature correlation learning component is also designed. Finally, a structural constraint is introduced to make the predictions on the two scale traffic data consistent. We conduct extensive evaluations over two real traffic datasets, and the results demonstrate the superior performance of the proposal on both fine- and coarse-grained traffic predictions.

TIST Journal 2021 Journal Article

A Comprehensive Survey of the Key Technologies and Challenges Surrounding Vehicular Ad Hoc Networks

  • Zhenchang Xia
  • Jia Wu
  • Libing Wu
  • Yanjiao Chen
  • Jian Yang
  • Philip S. Yu

Vehicular ad hoc networks ( VANETs ) and the services they support are an essential part of intelligent transportation. Through physical technologies, applications, protocols, and standards, they help to ensure traffic moves efficiently and vehicles operate safely. This article surveys the current state of play in VANETs development. The summarized and classified include the key technologies critical to the field, the resource-management and safety applications needed for smooth operations, the communications and data transmission protocols that support networking, and the theoretical and environmental constructs underpinning research and development, such as graph neural networks and the Internet of Things. Additionally, we identify and discuss several challenges facing VANETs, including poor safety, poor reliability, non-uniform standards, and low intelligence levels. Finally, we touch on hot technologies and techniques, such as reinforcement learning and 5G communications, to provide an outlook for the future of intelligent transportation systems.

TIST Journal 2021 Journal Article

Dynamic Planning of Bicycle Stations in Dockless Public Bicycle-sharing System Using Gated Graph Neural Network

  • Jianguo Chen
  • Kenli Li
  • Keqin Li
  • Philip S. Yu
  • Zeng Zeng

Benefiting from convenient cycling and flexible parking locations, the Dockless Public Bicycle-sharing (DL-PBS) network becomes increasingly popular in many countries. However, redundant and low-utility stations waste public urban space and maintenance costs of DL-PBS vendors. In this article, we propose a Bicycle Station Dynamic Planning (BSDP) system to dynamically provide the optimal bicycle station layout for the DL-PBS network. The BSDP system contains four modules: bicycle drop-off location clustering, bicycle-station graph modeling, bicycle-station location prediction, and bicycle-station layout recommendation. In the bicycle drop-off location clustering module, candidate bicycle stations are clustered from each spatio-temporal subset of the large-scale cycling trajectory records. In the bicycle-station graph modeling module, a weighted digraph model is built based on the clustering results and inferior stations with low station revenue and utility are filtered. Then, graph models across time periods are combined to create a graph sequence model. In the bicycle-station location prediction module, the GGNN model is used to train the graph sequence data and dynamically predict bicycle stations in the next period. In the bicycle-station layout recommendation module, the predicted bicycle stations are fine-tuned according to the government urban management plan, which ensures that the recommended station layout is conducive to city management, vendor revenue, and user convenience. Experiments on actual DL-PBS networks verify the effectiveness, accuracy, and feasibility of the proposed BSDP system.

IJCAI Conference 2021 Conference Paper

Graph Entropy Guided Node Embedding Dimension Selection for Graph Neural Networks

  • Gongxu Luo
  • Jianxin Li
  • Hao Peng
  • Carl Yang
  • Lichao Sun
  • Philip S. Yu
  • Lifang He

Graph representation learning has achieved great success in many areas, including e-commerce, chemistry, biology, etc. However, the fundamental problem of choosing the appropriate dimension of node embedding for a given graph still remains unsolved. The commonly used strategies for Node Embedding Dimension Selection (NEDS) based on grid search or empirical knowledge suffer from heavy computation and poor model performance. In this paper, we revisit NEDS from the perspective of minimum entropy principle. Subsequently, we propose a novel Minimum Graph Entropy (MinGE) algorithm for NEDS with graph data. To be specific, MinGE considers both feature entropy and structure entropy on graphs, which are carefully designed according to the characteristics of the rich information in them. The feature entropy, which assumes the embeddings of adjacent nodes to be more similar, connects node features and link topology on graphs. The structure entropy takes the normalized degree as basic unit to further measure the higher-order structure of graphs. Based on them, we design MinGE to directly calculate the ideal node embedding dimension for any graph. Finally, comprehensive experiments with popular Graph Neural Networks (GNNs) on benchmark datasets demonstrate the effectiveness and generalizability of our proposed MinGE.

IJCAI Conference 2021 Conference Paper

Graph Learning based Recommender Systems: A Review

  • Shoujin Wang
  • Liang Hu
  • Yan Wang
  • Xiangnan He
  • Quan Z. Sheng
  • Mehmet A. Orgun
  • Longbing Cao
  • Francesco Ricci

Recent years have witnessed the fast development of the emerging topic of Graph Learning based Recommender Systems (GLRS). GLRS mainly employ advanced graph learning approaches to model users’ preferences and intentions as well as items’ characteristics and popularity for Recommender Systems (RS). Differently from other approaches, including content based filtering and collaborative filtering, GLRS are built on graphs where the important objects, e. g. , users, items, and attributes, are either explicitly or implicitly connected. With the rapid development of graph learning techniques, exploring and exploiting homogeneous or heterogeneous relations in graphs is a promising direction for building more effective RS. In this paper, we provide a systematic review of GLRS, by discussing how they extract knowledge from graphs to improve the accuracy, reliability and explainability of the recommendations. First, we characterize and formalize GLRS, and then summarize and categorize the key challenges and main progress in this novel research area.

AAAI Conference 2021 Conference Paper

Hyperbolic Variational Graph Neural Network for Modeling Dynamic Graphs

  • Li Sun
  • Zhongbao Zhang
  • Jiawei Zhang
  • Feiyang Wang
  • Hao Peng
  • Sen Su
  • Philip S. Yu

Learning representations for graphs plays a critical role in a wide spectrum of downstream applications. In this paper, we summarize the limitations of the prior works in three folds: representation space, modeling dynamics and modeling uncertainty. To bridge this gap, we propose to learn dynamic graph representation in hyperbolic space, for the first time, which aims to infer stochastic node representations. Working with hyperbolic space, we present a novel Hyperbolic Variational Graph Neural Network, referred to as HVGNN. In particular, to model the dynamics, we introduce a Temporal GNN (TGNN) based on a theoretically grounded time encoding approach. To model the uncertainty, we devise a hyperbolic graph variational autoencoder built upon the proposed TGNN to generate stochastic node representations of hyperbolic normal distributions. Furthermore, we introduce a reparameterisable sampling algorithm for the hyperbolic normal distribution to enable the gradient-based learning of HVGNN. Extensive experiments show that HVGNN outperforms stateof-the-art baselines on real-world datasets.

AAAI Conference 2021 Conference Paper

KG-BART: Knowledge Graph-Augmented BART for Generative Commonsense Reasoning

  • Ye Liu
  • Yao Wan
  • Lifang He
  • Hao Peng
  • Philip S. Yu

Generative commonsense reasoning which aims to empower machines to generate sentences with the capacity of reasoning over a set of concepts is a critical bottleneck for text generation. Even the state-of-the-art pre-trained language generation models struggle at this task and often produce implausible and anomalous sentences. One reason is that they rarely consider incorporating the knowledge graph which can provide rich relational information among the commonsense concepts. To promote the ability of commonsense reasoning for text generation, we propose a novel knowledge graphaugmented pre-trained language generation model KG-BART, which encompasses the complex relations of concepts through the knowledge graph and produces more logical and natural sentences as output. Moreover, KG-BART can leverage the graph attention to aggregate the rich concept semantics that enhances the model generalization on unseen concept sets. Experiments on benchmark CommonGen dataset verify the effectiveness of our proposed approach by comparing with several strong pre-trained language generation models, particularly KG-BART outperforms BART by 5. 80, 4. 60, in terms of BLEU-3, 4. Moreover, we also show that the generated context by our model can work as background scenarios to benefit downstream commonsense QA tasks. 1

IJCAI Conference 2020 Conference Paper

Adversarial Mutual Information Learning for Network Embedding

  • Dongxiao He
  • Lu Zhai
  • Zhigang Li
  • Di Jin
  • Liang Yang
  • Yuxiao Huang
  • Philip S. Yu

Network embedding which is to learn a low dimensional representation of nodes in a network has been used in many network analysis tasks. Some network embedding methods, including those based on generative adversarial networks (GAN) (a promising deep learning technique), have been proposed recently. Existing GAN-based methods typically use GAN to learn a Gaussian distribution as a priori for network embedding. However, this strategy makes it difficult to distinguish the node representation from Gaussian distribution. Moreover, it does not make full use of the essential advantage of GAN (that is to adversarially learn the representation mechanism rather than the representation itself), leading to compromised performance of the method. To address this problem, we propose to use the adversarial idea on the representation mechanism, i. e. on the encoding mechanism under the framework of autoencoder. Specifically, we use the mutual information between node attributes and embedding as a reasonable alternative of this encoding mechanism (which is much easier to track). Additionally, we introduce another mapping mechanism (which is based on GAN) as a competitor into the adversarial learning system. A range of empirical results demonstrate the effectiveness of the proposed approach.

ECAI Conference 2020 Conference Paper

DeepGS: Deep Representation Learning of Graphs and Sequences for Drug-Target Binding Affinity Prediction

  • Xuan Lin
  • Kaiqi Zhao 0001
  • Tong Xiao 0002
  • Zhe Quan
  • Zhi-Jie Wang 0009
  • Philip S. Yu

Accurately predicting drug-target binding affinity (DTA) in silico is a key task in drug discovery. Most of the conventional DTA prediction methods are simulation-based, which rely heavily on domain knowledge or the assumption of having the 3D structure of the targets, which are often difficult to obtain. Meanwhile, traditional machine learning-based methods apply various features and descriptors, and simply depend on the similarities between drug-target pairs. Recently, with the increasing amount of affinity data available and the success of deep representation learning models on various domains, deep learning techniques have been applied to DTA prediction. However, these methods consider either label/one-hot encodings or the topological structure of molecules, without considering the local chemical context of amino acids and SMILES sequences. Motivated by this, we propose a novel end-to-end learning framework, called DeepGS, which uses deep neural networks to extract the local chemical context from amino acids and SMILES sequences, as well as the molecular structure from the drugs. To assist the operations on the symbolic data, we propose to use advanced embedding techniques (i. e. , Smi2Vec and Prot2Vec) to encode the amino acids and SMILES sequences to a distributed representation. Meanwhile, we suggest a new molecular structure modeling approach that works well under our framework. We have conducted extensive experiments to compare our proposed method with state-of-the-art models including KronRLS, SimBoost, DeepDTA and DeepCPI. Extensive experimental results demonstrate the superiorities and competitiveness of DeepGS.

IJCAI Conference 2020 Conference Paper

Entity Synonym Discovery via Multipiece Bilateral Context Matching

  • Chenwei Zhang
  • Yaliang Li
  • Nan Du
  • Wei Fan
  • Philip S. Yu

Being able to automatically discover synonymous entities in an open-world setting benefits various tasks such as entity disambiguation or knowledge graph canonicalization. Existing works either only utilize entity features, or rely on structured annotations from a single piece of context where the entity is mentioned. To leverage diverse contexts where entities are mentioned, in this paper, we generalize the distributional hypothesis to a multi-context setting and propose a synonym discovery framework that detects entity synonyms from free-text corpora with considerations on effectiveness and robustness. As one of the key components in synonym discovery, we introduce a neural network model SynonymNet to determine whether or not two given entities are synonym with each other. Instead of using entities features, SynonymNet makes use of multiple pieces of contexts in which the entity is mentioned, and compares the context-level similarity via a bilateral matching schema. Experimental results demonstrate that the proposed model is able to detect synonym sets that are not observed during training on both generic and domain-specific datasets: Wiki+Freebase, PubMed+UMLS, and MedBook+MKG, with up to 4. 16% improvement in terms of Area Under the Curve and 3. 19% in terms of Mean Average Precision compared to the best baseline method.

AAAI Conference 2019 Conference Paper

DeepCF: A Unified Framework of Representation Learning and Matching Function Learning in Recommender System

  • Zhi-Hong Deng
  • Ling Huang
  • Chang-Dong Wang
  • Jian-Huang Lai
  • Philip S. Yu

In general, recommendation can be viewed as a matching problem, i. e. , match proper items for proper users. However, due to the huge semantic gap between users and items, it’s almost impossible to directly match users and items in their initial representation spaces. To solve this problem, many methods have been studied, which can be generally categorized into two types, i. e. , representation learning-based CF methods and matching function learning-based CF methods. Representation learning-based CF methods try to map users and items into a common representation space. In this case, the higher similarity between a user and an item in that space implies they match better. Matching function learning-based CF methods try to directly learn the complex matching function that maps user-item pairs to matching scores. Although both methods are well developed, they suffer from two fundamental flaws, i. e. , the limited expressiveness of dot product and the weakness in capturing low-rank relations respectively. To this end, we propose a general framework named DeepCF, short for Deep Collaborative Filtering, to combine the strengths of the two types of methods and overcome such flaws. Extensive experiments on four publicly available datasets demonstrate the effectiveness of the proposed DeepCF framework.

IJCAI Conference 2019 Conference Paper

Fine-grained Event Categorization with Heterogeneous Graph Convolutional Networks

  • Hao Peng
  • Jianxin Li
  • Qiran Gong
  • Yangqiu Song
  • Yuanxin Ning
  • Kunfeng Lai
  • Philip S. Yu

Events are happening in real-world and real-time, which can be planned and organized occasions involving multiple people and objects. Social media platforms publish a lot of text messages containing public events with comprehensive topics. However, mining social events is challenging due to the heterogeneous event elements in texts and explicit and implicit social network structures. In this paper, we design an event meta-schema to characterize the semantic relatedness of social events and build an event-based heterogeneous information network (HIN) integrating information from external knowledge base, and propose a novel Pairwise Popularity Graph Convolutional Network (PP-GCN) based fine-grained social event categorization model. We propose a Knowledgeable meta-paths Instances based social Event Similarity (KIES) between events and build a weighted adjacent matrix as input to the PP-GCN model. Comprehensive experiments on real data collections are conducted to compare various social event detection and clustering tasks. Experimental results demonstrate that our proposed framework outperforms other alternative social event categorization techniques.

IJCAI Conference 2019 Conference Paper

Heterogeneous Graph Matching Networks for Unknown Malware Detection

  • SHEN WANG
  • Zhengzhang Chen
  • Xiao Yu
  • Ding Li
  • Jingchao Ni
  • Lu-An Tang
  • Jiaping Gui
  • Zhichun Li

Information systems have widely been the target of malware attacks. Traditional signature-based malicious program detection algorithms can only detect known malware and are prone to evasion techniques such as binary obfuscation, while behavior-based approaches highly rely on the malware training samples and incur prohibitively high training cost. To address the limitations of existing techniques, we propose MatchGNet, a heterogeneous Graph Matching Network model to learn the graph representation and similarity metric simultaneously based on the invariant graph modeling of the program's execution behaviors. We conduct a systematic evaluation of our model and show that it is accurate in detecting malicious program behavior and can help detect malware attacks with less false positives. MatchGNet outperforms the state-of-the-art algorithms in malware detection by generating 50% less false positives while keeping zero false negatives.

TIST Journal 2019 Journal Article

Multi-View Fusion with Extreme Learning Machine for Clustering

  • Yongshan Zhang
  • Jia Wu
  • Chuan Zhou
  • Zhihua Cai
  • Jian Yang
  • Philip S. Yu

Unlabeled, multi-view data presents a considerable challenge in many real-world data analysis tasks. These data are worth exploring because they often contain complementary information that improves the quality of the analysis results. Clustering with multi-view data is a particularly challenging problem as revealing the complex data structures between many feature spaces demands discriminative features that are specific to the task and, when too few of these features are present, performance suffers. Extreme learning machines (ELMs) are an emerging form of learning model that have shown an outstanding representation ability and superior performance in a range of different learning tasks. Motivated by the promise of this advancement, we have developed a novel multi-view fusion clustering framework based on an ELM, called MVEC. MVEC learns the embeddings from each view of the data via the ELM network, then constructs a single unified embedding according to the correlations and dependencies between each embedding and automatically weighting the contribution of each. This process exposes the underlying clustering structures embedded within multi-view data with a high degree of accuracy. A simple yet efficient solution is also provided to solve the optimization problem within MVEC. Experiments and comparisons on eight different benchmarks from different domains confirm MVEC’s clustering accuracy.

IJCAI Conference 2019 Conference Paper

Outlier-Robust Multi-Aspect Streaming Tensor Completion and Factorization

  • Mehrnaz Najafi
  • Lifang He
  • Philip S. Yu

With the increasing popularity of streaming tensor data such as videos and audios, tensor factorization and completion have attracted much attention recently in this area. Existing work usually assume that streaming tensors only grow in one mode. However, in many real-world scenarios, tensors may grow in multiple modes (or dimensions), i. e. , multi-aspect streaming tensors. Standard streaming methods cannot directly handle this type of data elegantly. Moreover, due to inevitable system errors, data may be contaminated by outliers, which cause significant deviations from real data values and make such research particularly challenging. In this paper, we propose a novel method for Outlier-Robust Multi-Aspect Streaming Tensor Completion and Factorization (OR-MSTC), which is a technique capable of dealing with missing values and outliers in multi-aspect streaming tensor data. The key idea is to decompose the tensor structure into an underlying low-rank clean tensor and a structured-sparse error (outlier) tensor, along with a weighting tensor to mask missing data. We also develop an efficient algorithm to solve the non-convex and non-smooth optimization problem of OR-MSTC. Experimental results on various real-world datasets show the superiority of the proposed method over the baselines and its robustness against outliers.

AAAI Conference 2019 Conference Paper

Private Model Compression via Knowledge Distillation

  • Ji Wang
  • Weidong Bao
  • Lichao Sun
  • Xiaomin Zhu
  • Bokai Cao
  • Philip S. Yu

The soaring demand for intelligent mobile applications calls for deploying powerful deep neural networks (DNNs) on mobile devices. However, the outstanding performance of DNNs notoriously relies on increasingly complex models, which in turn is associated with an increase in computational expense far surpassing mobile devices’ capacity. What is worse, app service providers need to collect and utilize a large volume of users’ data, which contain sensitive information, to build the sophisticated DNN models. Directly deploying these models on public mobile devices presents prohibitive privacy risk. To benefit from the on-device deep learning without the capacity and privacy concerns, we design a private model compression framework RONA. Following the knowledge distillation paradigm, we jointly use hint learning, distillation learning, and self learning to train a compact and fast neural network. The knowledge distilled from the cumbersome model is adaptively bounded and carefully perturbed to enforce differential privacy. We further propose an elegant query sample selection method to reduce the number of queries and control the privacy loss. A series of empirical evaluations as well as the implementation on an Android mobile device show that RONA can not only compress cumbersome models efficiently but also provide a strong privacy guarantee. For example, on SVHN, when a meaningful (9. 83, 10−6 )-differential privacy is guaranteed, the compact model trained by RONA can obtain 20× compression ratio and 19× speed-up with merely 0. 97% accuracy loss.

IJCAI Conference 2018 Conference Paper

Aspect-Level Deep Collaborative Filtering via Heterogeneous Information Networks

  • Xiaotian Han
  • Chuan Shi
  • Senzhang Wang
  • Philip S. Yu
  • Li Song

Latent factor models have been widely used for recommendation. Most existing latent factor models mainly utilize the rating information between users and items, although some recently extended models add some auxiliary information to learn a unified latent factor between users and items. The unified latent factor only represents the latent features of users and items from the aspect of purchase history. However, the latent features of users and items may stem from different aspects, e. g. , the brand-aspect and category-aspect of items. In this paper, we propose a Neural network based Aspect-level Collaborative Filtering model (NeuACF) to exploit different aspect latent factors. Through modelling rich objects and relations in recommender system as a heterogeneous information network, NeuACF first extracts different aspect-level similarity matrices of users and items through different meta-paths and then feeds an elaborately designed deep neural network with these matrices to learn aspect-level latent factors. Finally, the aspect-level latent factors are effectively fused with an attention mechanism for the top-N recommendation. Extensive experiments on three real datasets show that NeuACF significantly outperforms both existing latent factor models and recent neural network models.

AAAI Conference 2018 Conference Paper

Dual Attention Network for Product Compatibility and Function Satisfiability Analysis

  • Hu Xu
  • Sihong Xie
  • Lei Shu
  • Philip S. Yu

Product compatibility and functionality are of utmost importance to customers when they purchase products, and to sellers and manufacturers when they sell products. Due to the huge number of products available online 1, it is infeasible to enumerate and test the compatibility and functionality of every product. In this paper, we address two closely related problems: product compatibility analysis and function satisfiability analysis, where the second problem is a generalization of the first problem (e. g. , whether a product works with another product can be considered as a special function). We first identify a novel question and answering corpus that is up-to-date regarding product compatibility and functionality information 2. To allow automatic discovery product compatibility and functionality, we then propose a deep learning model called Dual Attention Network (DAN). Given a QA pair for a to-be-purchased product, DAN learns to 1) discover complementary products (or functions), and 2) accurately predict the actual compatibility (or satisfiability) of the discovered products (or functions). The challenges addressed by the model include the briefness of QAs, linguistic patterns indicating compatibility, and the appropriate fusion of questions and answers. We conduct experiments to quantitatively and qualitatively show that the identified products and functions have both high coverage and accuracy, compared with a wide spectrum of baselines.

IJCAI Conference 2018 Conference Paper

Lifelong Domain Word Embedding via Meta-Learning

  • Hu Xu
  • Bing Liu
  • Lei Shu
  • Philip S. Yu

Learning high-quality domain word embeddings is important for achieving good performance in many NLP tasks. General-purpose embeddings trained on large-scale corpora are often sub-optimal for domain-specific applications. However, domain-specific tasks often do not have large in-domain corpora for training high-quality domain embeddings. In this paper, we propose a novel lifelong learning setting for domain embedding. That is, when performing the new domain embedding, the system has seen many past domains, and it tries to expand the new in-domain corpus by exploiting the corpora from the past domains via meta-learning. The proposed meta-learner characterizes the similarities of the contexts of the same word in many domain corpora, which helps retrieve relevant data from the past domains to expand the new domain corpus. Experimental results show that domain embeddings produced from such a process improve the performance of the downstream tasks.

ICML Conference 2018 Conference Paper

PredRNN++: Towards A Resolution of the Deep-in-Time Dilemma in Spatiotemporal Predictive Learning

  • Yunbo Wang
  • Zhifeng Gao
  • Mingsheng Long
  • Jianmin Wang 0001
  • Philip S. Yu

We present PredRNN++, a recurrent network for spatiotemporal predictive learning. In pursuit of a great modeling capability for short-term video dynamics, we make our network deeper in time by leveraging a new recurrent structure named Causal LSTM with cascaded dual memories. To alleviate the gradient propagation difficulties in deep predictive models, we propose a Gradient Highway Unit, which provides alternative quick routes for the gradient flows from outputs back to long-range previous inputs. The gradient highway units work seamlessly with the causal LSTMs, enabling our model to capture the short-term and the long-term video dependencies adaptively. Our model achieves state-of-the-art prediction results on both synthetic and real video datasets, showing its power in modeling entangled motions.

ICML Conference 2017 Conference Paper

Kernelized Support Tensor Machines

  • Lifang He 0001
  • Chun-Ta Lu
  • Guixiang Ma
  • Shen Wang 0005
  • LinLin Shen
  • Philip S. Yu
  • Ann B. Ragin

In the context of supervised tensor learning, preserving the structural information and exploiting the discriminative nonlinear relationships of tensor data are crucial for improving the performance of learning tasks. Based on tensor factorization theory and kernel methods, we propose a novel Kernelized Support Tensor Machine (KSTM) which integrates kernelized tensor factorization with maximum-margin criterion. Specifically, the kernelized factorization technique is introduced to approximate the tensor data in kernel space such that the complex nonlinear relationships within tensor data can be explored. Further, dual structural preserving kernels are devised to learn the nonlinear boundary between tensor data. As a result of joint optimization, the kernels obtained in KSTM exhibit better generalization power to discriminative analysis. The experimental results on real-world neuroimaging datasets show the superiority of KSTM over the state-of-the-art techniques.

IJCAI Conference 2017 Conference Paper

SEVEN: Deep Semi-supervised Verification Networks

  • Vahid Noroozi
  • Lei Zheng
  • Sara Bahaadini
  • Sihong Xie
  • Philip S. Yu

Verification determines whether two samples belong to the same class or not, and has important applications such as face and fingerprint verification, where thousands or millions of categories are present but each category has scarce labeled examples, presenting two major challenges for existing deep learning models. We propose a deep semi-supervised model named SEmi-supervised VErification Network (SEVEN) to address these challenges. The model consists of two complementary components. The generative component addresses the lack of supervision within each category by learning general salient structures from a large amount of data across categories. The discriminative component exploits the learned general features to mitigate the lack of supervision within categories, and also directs the generative component to find more informative structures of the whole data manifold. The two components are tied together in SEVEN to allow an end-to-end training of the two components. Extensive experiments on four verification tasks demonstrate that SEVEN significantly outperforms other state-of-the-art deep semi-supervised techniques when labeled data are in short supply. Furthermore, SEVEN is competitive with fully supervised baselines trained with a larger amount of labeled data. It indicates the importance of the generative component in SEVEN.

IJCAI Conference 2016 Conference Paper

Causality Based Propagation History Ranking in Social Networks

  • Zheng Wang
  • Chaokun Wang
  • Jisheng Pei
  • Xiaojun Ye
  • Philip S. Yu

In social network sites (SNS), propagation histories which record the information diffusion process can be used to explain to users what happened in their networks. However, these histories easily grow in size and complexity, limiting their intuitive understanding by users. To reduce this information overload, in this paper, we present the problem of propagation history ranking. The goal is to rank participant edges/nodes by their contribution to the diffusion. Firstly, we discuss and adapt Difference of Causal Effects (DCE) as the ranking criterion. Then, to avoid the complex calculation of DCE, we propose a resp-cap ranking strategy by adopting two indicators. The first is responsibility which captures the necessary face of causal effects. We further give an approximate algorithm for this indicator. The second is capability which is defined to capture the sufficient face of causal effects. Finally, promising experimental results are presented to verify the feasibility of our method.

TIST Journal 2016 Journal Article

Coranking the Future Influence of Multiobjects in Bibliographic Network Through Mutual Reinforcement

  • Senzhang Wang
  • Sihong Xie
  • Xiaoming Zhang
  • Zhoujun Li
  • Philip S. Yu
  • Yueying He

Scientific literature ranking is essential to help researchers find valuable publications from a large literature collection. Recently, with the prevalence of webpage ranking algorithms such as PageRank and HITS, graph-based algorithms have been widely used to iteratively rank papers and researchers through the networks formed by citation and coauthor relationships. However, existing graph-based ranking algorithms mostly focus on ranking the current importance of literature. For researchers who enter an emerging research area, they might be more interested in new papers and young researchers that are likely to become influential in the future, since such papers and researchers are more helpful in letting them quickly catch up on the most recent advances and find valuable research directions. Meanwhile, although some works have been proposed to rank the prestige of a certain type of objects with the help of multiple networks formed of multiobjects, there still lacks a unified framework to rank multiple types of objects in the bibliographic network simultaneously. In this article, we propose a unified ranking framework MRCoRank to corank the future popularity of four types of objects: papers, authors, terms, and venues through mutual reinforcement. Specifically, because the citation data of new publications are sparse and not efficient to characterize their innovativeness, we make the first attempt to extract the text features to help characterize innovative papers and authors. With the observation that the current trend is more indicative of the future trend of citation and coauthor relationships, we then construct time-aware weighted graphs to quantify the importance of links established at different times on both citation and coauthor graphs. By leveraging both the constructed text features and time-aware graphs, we finally fuse the rich information in a mutual reinforcement ranking framework to rank the future importance of multiobjects simultaneously. We evaluate the proposed model through extensive experiments on the ArnetMiner dataset containing more than 1,500,000 papers. Experimental results verify the effectiveness of MRCoRank in coranking the future influence of multiobjects in a bibliographic network.

IJCAI Conference 2016 Conference Paper

Item Recommendation for Emerging Online Businesses

  • Chun-Ta Lu
  • Sihong Xie
  • Weixiang Shao
  • Lifang He
  • Philip S. Yu

Nowadays, a large number of new online businesses emerge rapidly. For these emerging businesses, existing recommendation models usually suffer from the data-sparsity. In this paper, we introduce a novel similarity measure, AmpSim (Augmented Meta Path-based Similarity) that takes both the linkage structures and the augmented link attributes into account. By traversing between heterogeneous networks through overlapping entities, AmpSim can easily gather side information from other networks and capture the rich similarity semantics between entities. We further incorporate the similarity information captured by AmpSim in a collective matrix factorization model such that the transferred knowledge can be iteratively propagated across networks to fit the emerging business. Extensive experiments conducted on real-world datasets demonstrate that our method significantly outperforms other state-of-the-art recommendation models in addressing item recommendation for emerging businesses.

IJCAI Conference 2016 Conference Paper

Understanding Information Diffusion under Interactions

  • Yuan Su
  • Xi Zhang
  • Philip S. Yu
  • Wen Hua
  • Xiaofang Zhou
  • Binxing Fang

Information diffusion in online social networks has attracted substantial research effort. Although recent models begin to incorporate interactions among contagions, they still don't consider the comprehensive interactions involving users and contagions as a whole. Moreover, the interactions obtained in previous work are modeled as latent factors and thus are difficult to understand and interpret. In this paper, we investigate the contagion adoption behavior by incorporating various types of interactions into a coherent model, and propose a novel interaction-aware diffusion framework called IAD. IAD exploits the social network structures to distinguish user roles, and uses both structures and texts to categorize contagions. Experiments with large-scale Weibo dataset demonstrate that IAD outperforms the state-of-art baselines in terms of F1-score and accuracy, as well as the runtime for learning. In addition, the interactions obtained through learning reveal interesting findings, e. g. , food-related contagions have the strongest capability to suppress other contagions' propagation, while advertisement-related contagions have the weakest capability.

AAAI Conference 2016 Conference Paper

Unsupervised Feature Selection on Networks: A Generative View

  • Xiaokai Wei
  • Bokai Cao
  • Philip S. Yu

In the past decade, social and information networks have become prevalent, and research on the network data has attracted much attention. Besides the link structure, network data are often equipped with the content information (i. e, node attributes) that is usually noisy and characterized by high dimensionality. As the curse of dimensionality could hamper the performance of many machine learning tasks on networks (e. g. , community detection and link prediction), feature selection can be a useful technique for alleviating such issue. In this paper, we investigate the problem of unsupervised feature selection on networks. Most existing feature selection methods fail to incorporate the linkage information, and the state-of-the-art approaches usually rely on pseudo labels generated from clustering. Such cluster labels may be far from accurate and can mislead the feature selection process. To address these issues, we propose a generative point of view for unsupervised features selection on networks that can seamlessly exploit the linkage and content information in a more effective manner. We assume that the link structures and node content are generated from a succinct set of high-quality features, and we find these features through maximizing the likelihood of the generation process. Experimental results on three real-world datasets show that our approach can select more discriminative features than state-of-the-art methods.

AAAI Conference 2015 Conference Paper

Burst Time Prediction in Cascades

  • Senzhang Wang
  • Zhao Yan
  • Xia Hu
  • Philip S. Yu
  • Zhoujun Li

Studying the bursty nature of cascades in social media is practically important in many applications such as product sales prediction, disaster relief, and stock market prediction. Although the cascade volume prediction has been extensively studied, how to predict when a burst will come remains an open problem. It is challenging to predict the time of the burst due to the “quick rise and fall” pattern and the diverse time spans of the cascades. To this end, this paper proposes a classification based approach for burst time prediction by utilizing and modeling rich knowledge in information diffusion. Particularly, we first propose a time window based approach to predict in which time window the burst will appear. This paves the way to transform the time prediction task to a classification problem. To address the challenge that the original time series data of the cascade popularity only are not sufficient for predicting cascades with diverse magnitudes and time spans, we explore rich information diffusion related knowledge and model them in a scale-independent manner. Extensive experiments on a Sina Weibo reposting dataset demonstrate the superior performance of the proposed approach in accurately predicting the burst time of posts.

IJCAI Conference 2015 Conference Paper

Integrated Anchor and Social Link Predictions across Social Networks

  • Jiawei Zhang
  • Philip S. Yu

To enjoy more social network services, users nowadays are usually involved in multiple online social media sites at the same time. Across these social networks, users can be connected by both intranetwork links (i. e. , social links) and inter-network links (i. e. , anchor links) simultaneously. In this paper, we want to predict the formation of social links among users in the target network as well as anchor links aligning the target network with other external social networks. The problem is formally defined as the “collective link identification” problem. To solve the collective link identification problem, a unified link prediction framework, CLF (Collective Link Fusion) is proposed in this paper, which consists of two phases: step (1) collective link prediction of anchor and social links, and step (2) propagation of predicted links across the partially aligned “probabilistic networks” with collective random walk. Extensive experiments conducted on two real-world partially aligned networks demonstrate that CLF can perform very well in predicting social and anchor links concurrently.

IS Journal 2015 Journal Article

Nonoccurring Behavior Analytics: A New Area

  • Longbing Cao
  • Philip S. Yu
  • Vipin Kumar

Nonoccurring behaviors (NOBs) refer to those behaviors that should happen but do not take place for some reasons. They are widely seen in online, business, government, health, medical, scientific, and social data and applications. Very limited research has been undertaken to explore such behaviors owing to their invisibility and the significant challenges in analyzing NOBs. This article outlines the concept, intrinsic characteristics, significant challenges, main issues, research directions, and state-of-the-art work related to NOBs, followed by the prospects for and applications of NOB analytics.

ICML Conference 2014 Conference Paper

Learning the Consistent Behavior of Common Users for Target Node Prediction across Social Networks

  • Shan-Hung Wu
  • Hao-Heng Chien
  • Kuan-Hua Lin
  • Philip S. Yu

We study the target node prediction problem: given two social networks, identify those nodes/users from one network (called the source network) who are likely to join another (called the target network, with nodes called target nodes). Although this problem can be solved using existing techniques in the field of cross domain classification, we observe that in many real-world situations the cross-domain classifiers perform sub-optimally due to the heterogeneity between source and target networks that prevents the knowledge from being transferred. In this paper, we propose learning the consistent behavior of common users to help the knowledge transfer. We first present the Consistent Incidence Co-Factorization (CICF) for identifying the consistent users, i. e. , common users that behave consistently across networks. Then we introduce the Domain-UnBiased (DUB) classifiers that transfer knowledge only through those consistent users. Extensive experiments are conducted and the results show that our proposal copes with heterogeneity and improves prediction accuracy.

TIST Journal 2014 Journal Article

Multi-Label Classification Based on Multi-Objective Optimization

  • Chuan Shi
  • Xiangnan Kong
  • Di Fu
  • Philip S. Yu
  • Bin Wu

Multi-label classification refers to the task of predicting potentially multiple labels for a given instance. Conventional multi-label classification approaches focus on single objective setting, where the learning algorithm optimizes over a single performance criterion (e.g., Ranking Loss ) or a heuristic function. The basic assumption is that the optimization over one single objective can improve the overall performance of multi-label classification and meet the requirements of various applications. However, in many real applications, an optimal multi-label classifier may need to consider the trade-offs among multiple inconsistent objectives, such as minimizing Hamming Loss while maximizing Micro F1. In this article, we study the problem of multi-objective multi-label classification and propose a novel solution (called M oml ) to optimize over multiple objectives simultaneously. Note that optimization objectives may be inconsistent, even conflicting, thus one cannot identify a single solution that is optimal on all objectives. Our M oml algorithm finds a set of non-dominated solutions which are optimal according to different trade-offs among multiple objectives. So users can flexibly construct various predictive models from the solution set, which provides more meaningful classification results in different application scenarios. Empirical studies on real-world tasks demonstrate that the M oml can effectively boost the overall performance of multi-label classification by optimizing over multiple objectives simultaneously.

JAAMAS Journal 2012 Journal Article

A brief introduction to agent mining

  • Longbing Cao
  • Gerhard Weiss
  • Philip S. Yu

Abstract Agent mining is an emerging interdisciplinary area that integrates multiagent systems, data mining and knowledge discovery, machine learning and other relevant areas. It brings new opportunities to tackling issues in relevant fields more efficiently by engaging together the individual technologies. It will also bring about symbiosis and symbionts that combine advantages from the corresponding constituent systems. In this editorial, we briefly introduce the concept of agent mining, the main areas of research, and challenges and opportunities in agent mining. Finally, we give an overview of the papers in this special issue.

TIST Journal 2012 Journal Article

Identify Online Store Review Spammers via Social Review Graph

  • Guan Wang
  • Sihong Xie
  • Bing Liu
  • Philip S. Yu

Online shopping reviews provide valuable information for customers to compare the quality of products, store services, and many other aspects of future purchases. However, spammers are joining this community trying to mislead consumers by writing fake or unfair reviews to confuse the consumers. Previous attempts have used reviewers’ behaviors such as text similarity and rating patterns, to detect spammers. These studies are able to identify certain types of spammers, for instance, those who post many similar reviews about one target. However, in reality, there are other kinds of spammers who can manipulate their behaviors to act just like normal reviewers, and thus cannot be detected by the available techniques. In this article, we propose a novel concept of review graph to capture the relationships among all reviewers, reviews and stores that the reviewers have reviewed as a heterogeneous graph. We explore how interactions between nodes in this graph could reveal the cause of spam and propose an iterative computation model to identify suspicious reviewers. In the review graph, we have three kinds of nodes, namely, reviewer, review, and store. We capture their relationships by introducing three fundamental concepts, the trustiness of reviewers, the honesty of reviews, and the reliability of stores, and identifying their interrelationships: a reviewer is more trustworthy if the person has written more honesty reviews; a store is more reliable if it has more positive reviews from trustworthy reviewers; and a review is more honest if many other honest reviews support it. This is the first time such intricate relationships have been identified for spam detection and captured in a graph model. We further develop an effective computation method based on the proposed graph model. Different from any existing approaches, we do not use an review text information. Our model is thus complementary to existing approaches and able to find more difficult and subtle spamming activities, which are agreed upon by human judges after they evaluate our results.

IS Journal 2010 Journal Article

Music Recommendation Using Content and Context Information Mining

  • Ja-Hwung Su
  • Hsin-Ho Yeh
  • Philip S. Yu
  • Vincent S. Tseng

Mobile devices such as smart phones are becoming popular, and realtime access to multimedia data in different environments is getting easier. With properly equipped communication services, users can easily obtain the widely distributed videos, music, and documents they want. Because of its usability and capacity requirements, music is more popular than other types of multimedia data. Documents and videos are difficult to view on mobile phones' small screens, and videos' large data size results in high overhead for retrieval. But advanced compression techniques for music reduce the required storage space significantly and make the circulation of music data easier. This means that users can capture their favorite music directly from the Web without going to music stores. Accordingly, helping users find music they like in a large archive has become an attractive but challenging issue over the past few years.

IJCAI Conference 2009 Conference Paper

  • Zhengzheng Xing
  • Jian Pei
  • Philip S. Yu

In this paper, we formulate the problem of early classification of time series data, which is important in some time-sensitive applications such as healthinformatics. We introduce a novel concept of MPL (Minimum Prediction Length) and develop ECTS (Early Classification on Time Series), an effective 1-nearest neighbor classification method. ECTS makes early predictions and at the same time retains the accuracy comparable to that of a 1NN classifier using the full-length time series. Our empirical study using benchmark time series data sets shows that ECTS works well on the real data sets where 1NN classification is effective.

AAAI Conference 2008 Conference Paper

Clustering on Complex Graphs

  • Bo Long
  • Philip S. Yu

Complex graphs, in which multi-type nodes are linked to each other, frequently arise in many important applications, such as Web mining, information retrieval, bioinformatics, and epidemiology. In this study, We propose a general framework for clustering on complex graphs. Under this framework, we derive a family of clustering algorithms including both hard and soft versions, which are capable of learning cluster patterns from complex graphs with various structures and statistical properties. We also establish the connections between the proposed framework and the traditional graph partitioning approaches. The experimental evaluation provides encouraging results to validate the proposed framework and algorithms.

IS Journal 2007 Journal Article

Domain-Driven, Actionable Knowledge Discovery

  • Longbing Cao
  • Chengqi Zhang
  • Qiang Yang
  • David Bell
  • Michail Vlachos
  • Bahar Taneri
  • Eamonn Keogh
  • Philip S. Yu

Data mining increasingly faces complex challenges in the real-life world of business problems and needs. The gap between business expectations and R&D results in this area involves key aspects of the field, such as methodologies, targeted problems, pattern interestingness, and infrastructure support. Both researchers and practitioners are realizing the importance of domain knowledge to close this gap and develop actionable knowledge for real user needs.

ICML Conference 2006 Conference Paper

Spectral clustering for multi-type relational data

  • Bo Long
  • Zhongfei Zhang
  • Xiaoyun Wu
  • Philip S. Yu

Clustering on multi-type relational data has attracted more and more attention in recent years due to its high impact on various important applications, such as Web mining, e-commerce and bioinformatics. However, the research on general multi-type relational data clustering is still limited and preliminary. The contribution of the paper is three-fold. First, we propose a general model, the collective factorization on related matrices, for multi-type relational data clustering. The model is applicable to relational data with various structures. Second, under this model, we derive a novel algorithm, the spectral relational clustering, to cluster multi-type interrelated data objects simultaneously. The algorithm iteratively embeds each type of data objects into low dimensional spaces and benefits from the interactions among the hidden structures of different types of data objects. Extensive experiments demonstrate the promise and effectiveness of the proposed algorithm. Third, we show that the existing spectral clustering algorithms can be considered as the special cases of the proposed model and algorithm. This demonstrates the good theoretic generality of the proposed model and algorithm.

IJCAI Conference 2003 Conference Paper

Inductive Learning in Less Than One Sequential Data Scan

  • Wei Fan
  • Haixun Wang
  • Philip S. Yu
  • Shaw-Hwa Lo

Most recent research of scalable inductive learning on very large dataset, decision tree construction in particular, focuses on eliminating memory constraints and reducing the number of sequential data scans. However, state-of-the-art decision tree construction algorithms still require multiple scans over the data set and use sophisticated control mechanisms and data structures. We first discuss a general inductive learning framework that scans the dataset exactly once. Then, we propose an extension based on Hoeffding's inequality that scans the dataset less than once. Our frameworks are applicable to a wide range of inductive learners.