Author name cluster

Yangqiu Song

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

55 papers

2 author rows

TMLR Journal 2025 Journal Article

Learning Federated Neural Graph Databases for Answering Complex Queries from Distributed Knowledge Graphs

Qi Hu
Weifeng Jiang
Haoran Li
Zihao Wang
Jiaxin Bai
Qianren Mao
Yangqiu Song
Lixin Fan

The increasing demand for deep learning-based foundation models has highlighted the importance of efficient data retrieval mechanisms. Neural graph databases (NGDBs) offer a compelling solution, leveraging neural spaces to store and query graph-structured data, thereby enabling LLMs to access precise and contextually relevant information. However, current NGDBs are constrained to single-graph operation, limiting their capacity to reason across multiple, distributed graphs. Furthermore, the lack of support for multi-source graph data in existing NGDBs hinders their ability to capture the complexity and diversity of real-world data. In many applications, data is distributed across multiple sources, and the ability to reason across these sources is crucial for making informed decisions. This limitation is particularly problematic when dealing with sensitive graph data, as directly sharing and aggregating such data poses significant privacy risks. As a result, many applications that rely on NGDBs are forced to choose between compromising data privacy or sacrificing the ability to reason across multiple graphs. To address these limitations, we propose to learn Federated Neural Graph DataBase (FedNGDB), a pioneering systematic framework that empowers privacy-preserving reasoning over multi-source graph data. FedNGDB leverages federated learning to collaboratively learn graph representations across multiple sources, enriching relationships between entities, and improving the overall quality of graph data. Unlike existing methods, FedNGDB can handle complex graph structures and relationships, making it suitable for various downstream tasks. We evaluate FedNGDBs on three real-world datasets, demonstrating its effectiveness in retrieving relevant information from multi-source graph data while keeping sensitive information secure on local devices. Our results show that FedNGDBs can efficiently retrieve answers to cross-graph queries, making it a promising approach for LLMs and other applications that rely on efficient data retrieval mechanisms.

NeurIPS Conference 2025 Conference Paper

MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

Zhaowei Wang
Wenhao Yu
Xiyu REN
Jipeng Zhang
Yu Zhao
Rohit Saxena
Liang Cheng
Ginny Wong

The rapid extension of context windows in large vision-language models has given rise to long-context vision-language models (LCVLMs), which are capable of handling hundreds of images with interleaved text tokens in a single forward pass. In this work, we introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks, to evaluate LCVLMs effectively and thoroughly. MMLongBench is composed of 13, 331 examples spanning five different categories of downstream tasks, such as Visual RAG and Many-Shot ICL. It also provides broad coverage of image types, including various natural and synthetic images. To assess the robustness of the models to different input lengths, all examples are delivered at five standardized input lengths (8K-128K tokens) via a cross-modal tokenization scheme that combines vision patches and text tokens. Through a thorough benchmarking of 46 closed-source and open-source LCVLMs, we provide a comprehensive analysis of the current models' vision-language long-context ability. Our results show that: i) performance on a single task is a weak proxy for overall long-context capability; ii) both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement; iii) models with stronger reasoning ability tend to exhibit better long-context performance. By offering wide task coverage, various image types, and rigorous length control, MMLongBench provides the missing foundation for diagnosing and advancing the next generation of LCVLMs.

ICML Conference 2025 Conference Paper

Revolve: Optimizing AI Systems by Tracking Response Evolution in Textual Optimization

Peiyan Zhang
Haibo Jin
Leyang Hu
Xinnuo Li
Liying Kang
Man Luo
Yangqiu Song
Haohan Wang

Recent advancements in large language models (LLMs) have significantly enhanced the ability of LLM-based systems to perform complex tasks through natural language processing and tool interaction. However, optimizing these LLM-based systems for specific tasks remains challenging, often requiring manual interventions like prompt engineering and hyperparameter tuning. Existing automatic optimization methods, such as textual feedback-based techniques ( e. g. , TextGrad), tend to focus on immediate feedback, analogous to using immediate derivatives in traditional numerical gradient descent. However, relying solely on such feedback can be limited when the adjustments made in response to this feedback are either too small or fluctuate irregularly, potentially slowing down or even stalling the optimization process. In this paper, we introduce $\textbf{REVOLVE}$, an optimization method that tracks how $\textbf{R}$esponses $\textbf{EVOLVE}$ across iterations in LLM systems. By focusing on the evolution of responses over time, REVOLVE enables more stable and effective optimization by making thoughtful, progressive adjustments at each step. Experiments across three tasks demonstrate the adaptability and efficiency of our proposal. Beyond its practical contributions, REVOLVE highlights a promising direction, where the rich knowledge from established optimization principles can be leveraged to enhance LLM systems, which paves the way for further advancements in this hybrid domain. Code is available at: https: //llm-revolve. netlify. app.

AAAI Conference 2025 Conference Paper

Simulate and Eliminate: Revoke Backdoors for Generative Large Language Models

Haoran Li
Yulin Chen
Zihao Zheng
Qi Hu
Chunkit Chan
Heshan Liu
Yangqiu Song

With rapid advances, generative large language models (LLMs) dominate various Natural Language Processing (NLP) tasks from understanding to reasoning. Yet, language models' inherent vulnerabilities may be exacerbated due to increased accessibility and unrestricted model training on massive data. A malicious adversary may publish poisoned data online and conduct backdoor attacks on the victim LLMs pre-trained on the poisoned data. Backdoored LLMs behave innocuously for normal queries and generate harmful responses when the backdoor trigger is activated. Despite significant efforts paid to LLMs' safety issues, LLMs are still struggling against backdoor attacks. As Anthropic recently revealed, existing safety training strategies, including supervised fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), fail to revoke the backdoors once the LLM is backdoored during the pre-training stage. In this paper, we present Simulate and Eliminate (SANDE) to erase the undesired backdoored mappings for generative LLMs. We initially propose Overwrite Supervised Fine-tuning (OSFT) for effective backdoor removal when the trigger is known. Then, to handle scenarios where trigger patterns are unknown, we integrate OSFT into our two-stage framework, SANDE. Unlike other works that assume access to cleanly trained models, our safety-enhanced LLMs are able to revoke backdoors without any reference. Consequently, our safety-enhanced LLMs no longer produce targeted responses when the backdoor triggers are activated. We conduct comprehensive experiments to show that our proposed SANDE is effective against backdoor attacks while bringing minimal harm to LLMs' powerful capability.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset

Peng Xie
Xingyuan Liu
Yequan Bie
Tsz Wai Chan
Yangqiu Song
Yang Wang
Hao Chen
Kani Chen

Code-switching (CS) is the alternating use of two or more languages within a conversation or utterance, often influenced by social context and speaker identity. This linguistic phenomenon poses challenges for Automatic Speech Recognition (ASR) systems, which are typically designed for a single language and struggle to handle multilingual inputs. The growing global demand for multilingual applications, including Code-Switching ASR (CSASR), Text-to-Speech (TTS), and Cross-Lingual Information Retrieval (CLIR), highlights the inadequacy of existing monolingual datasets. Although some code-switching datasets exist, most are limited to bilingual mixing within homogeneous ethnic groups, leaving a critical need for a large-scale, diverse benchmark akin to ImageNet in computer vision. To bridge this gap, we introduce \textbf{LinguaMaster}, a multi-agent collaboration framework specifically designed for efficient and scalable multilingual data synthesis. Leveraging this framework, we curate \textbf{SwitchLingua}, the first large-scale multilingual and multi-ethnic code-switching dataset, including: (1) 420K CS textual samples across 12 languages, and (2) over 80 hours of audio recordings from 174 speakers representing 18 countries/regions and 63 racial/ethnic backgrounds, based on the textual data. This dataset captures rich linguistic and cultural diversity, offering a foundational resource for advancing multilingual and multicultural research. Furthermore, to address the issue that existing ASR evaluation metrics lack sensitivity to code-switching scenarios, we propose the \textbf{Semantic-Aware Error Rate (SAER)}, a novel evaluation metric that incorporates semantic information, providing a more accurate and context-aware assessment of system performance. Benchmark experiments on SwitchLingua with state-of-the-art ASR models reveal substantial performance gaps, underscoring the dataset’s utility as a rigorous benchmark for CS capability evaluation. In addition, SwitchLingua aims to encourage further research to promote cultural inclusivity and linguistic diversity in speech technology, fostering equitable progress in the ASR field. LinguaMaster (Code): github. com/Shelton1013/SwitchLingua, SwitchLingua (Data): https: //huggingface. co/datasets/Shelton1013/SwitchLingua text, https: //huggingface. co/datasets/Shelton1013/SwitchLingua audio

TMLR Journal 2025 Journal Article

The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning

Tianshi Zheng
Yixiang Chen
Chengxi Li
Chunyang Li
Qing Zong
Haochen Shi
Baixuan Xu
Yangqiu Song

Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models (LLMs). However, our study reveals a surprising contradiction to this prevailing perspective within the fundamental domain of pattern-based in-context learning (ICL). Through extensive experiments involving 16 state-of-the-art LLMs and nine diverse pattern-based ICL datasets, we demonstrate that CoT and its reasoning variants consistently underperform direct answering across varying model scales and benchmark complexities. To systematically investigate this unexpected phenomenon, we designed extensive experiments to validate several hypothetical explanations. Our analysis uncovers a fundamental hybrid mechanism of explicit-implicit reasoning driving CoT’s performance in pattern-based ICL: while explicit reasoning falters due to LLMs’ struggles to infer underlying patterns from demonstrations, implicit reasoning—disrupted by the increased contextual distance of CoT rationales—often compensates, delivering correct answers despite flawed rationales. This hybrid mechanism explains CoT’s relative underperformance, as noise from weak explicit inference undermines the process, even as implicit mechanisms partially salvage outcomes. Notably, even long-CoT reasoning models, which excel in abstract and symbolic reasoning, fail to fully overcome these limitations despite higher computational costs. Our findings challenge existing assumptions regarding the universal efficacy of CoT, yielding novel insights into its limitations and guiding future research toward more nuanced and effective reasoning methodologies for LLMs.

AIJ Journal 2024 Journal Article

Acquiring and modeling abstract commonsense knowledge via conceptualization

Mutian He
Tianqing Fang
Weiqi Wang
Yangqiu Song

ECAI Conference 2024 Conference Paper

Audience Persona Knowledge-Aligned Prompt Tuning Method for Online Debate

Chunkit Chan
Jiayang Cheng
Xin Liu 0039
Yauwai Yim
Yuxin Jiang
Zheye Deng
Haoran Li 0003
Yangqiu Song

Debate is the process of exchanging viewpoints or convincing others on a particular issue. Recent research has provided empirical evidence that the persuasiveness of an argument is determined not only by language usage but also by communicator characteristics. Researchers have paid much attention to aspects of languages, such as linguistic features and discourse structures, but combining argument persuasiveness and impact with the social personae of the audience has not been explored due to the difficulty and complexity. We have observed the impressive simulation and personification capability of ChatGPT, indicating a giant pre-trained language model may function as an individual to provide personae and exert unique influences based on diverse background knowledge. Therefore, we propose a persona knowledge-aligned framework for argument quality assessment tasks from the audience side. This is the first work that leverages the emergence of ChatGPT and injects such audience personae knowledge into smaller language models via prompt tuning. The performance of our pipeline demonstrates significant and consistent improvement compared to competitive architectures.

ICLR Conference 2024 Conference Paper

Rethinking Complex Queries on Knowledge Graphs with Neural Link Predictors

Hang Yin 0008
Zihao Wang 0001
Yangqiu Song

Reasoning on knowledge graphs is a challenging task because it utilizes observed information to predict the missing one. Particularly, answering complex queries based on first-order logic is one of the crucial tasks to verify learning to reason abilities for generalization and composition. Recently, the prevailing method is query embedding which learns the embedding of a set of entities and treats logic operations as set operations and has shown great empirical success. Though there has been much research following the same formulation, many of its claims lack a formal and systematic inspection. In this paper, we rethink this formulation and justify many of the previous claims by characterizing the scope of queries investigated previously and precisely identifying the gap between its formulation and its goal, as well as providing complexity analysis for the currently investigated queries. Moreover, we develop a new dataset containing ten new types of queries with features that have never been considered and therefore can provide a thorough investigation of complex queries. Finally, we propose a new neural-symbolic method, Fuzzy Inference with Truth value (FIT), where we equip the neural link predictors with fuzzy logic theory to support end-to-end learning using complex queries with provable reasoning capability. Empirical results show that our method outperforms previous methods significantly in the new dataset and also surpasses previous methods in the existing dataset at the same time.

TMLR Journal 2023 Journal Article

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava
Abhinav Rastogi
Abhishek Rao
Abu Awal Md Shoeb
Abubakar Abid
Adam Fisch
Adam R. Brown
Adam Santoro

Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG- bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood develop- ment, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google- internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.

NeurIPS Conference 2023 Conference Paper

Complex Query Answering on Eventuality Knowledge Graph with Implicit Logical Constraints

Jiaxin Bai
Xin Liu
Weiqi Wang
Chen Luo
Yangqiu Song

Querying knowledge graphs (KGs) using deep learning approaches can naturally leverage the reasoning and generalization ability to learn to infer better answers. Traditional neural complex query answering (CQA) approaches mostly work on entity-centric KGs. However, in the real world, we also need to make logical inferences about events, states, and activities (i. e. , eventualities or situations) to push learning systems from System I to System II, as proposed by Yoshua Bengio. Querying logically from an EVentuality-centric KG (EVKG) can naturally provide references to such kind of intuitive and logical inference. Thus, in this paper, we propose a new framework to leverage neural methods to answer complex logical queries based on an EVKG, which can satisfy not only traditional first-order logic constraints but also implicit logical constraints over eventualities concerning their occurrences and orders. For instance, if we know that Food is bad happens before PersonX adds soy sauce, then PersonX adds soy sauce is unlikely to be the cause of Food is bad due to implicit temporal constraint. To facilitate consistent reasoning on EVKGs, we propose Complex Eventuality Query Answering (CEQA), a more rigorous definition of CQA that considers the implicit logical constraints governing the temporal order and occurrence of eventualities. In this manner, we propose to leverage theorem provers for constructing benchmark datasets to ensure the answers satisfy implicit logical constraints. We also propose a Memory-Enhanced Query Encoding (MEQE) approach to significantly improve the performance of state-of-the-art neural query encoders on the CEQA task.

NeurIPS Conference 2023 Conference Paper

Enhancing User Intent Capture in Session-Based Recommendation with Attribute Patterns

Xin Liu
Zheng Li
Yifan Gao
Jingfeng Yang
Tianyu Cao
Zhengyang Wang
Bing Yin
Yangqiu Song

The goal of session-based recommendation in E-commerce is to predict the next item that an anonymous user will purchase based on the browsing and purchase history. However, constructing global or local transition graphs to supplement session data can lead to noisy correlations and user intent vanishing. In this work, we propose the Frequent Attribute Pattern Augmented Transformer (FAPAT) that characterizes user intents by building attribute transition graphs and matching attribute patterns. Specifically, the frequent and compact attribute patterns are served as memory to augment session representations, followed by a gate and a transformer block to fuse the whole session information. Through extensive experiments on two public benchmarks and 100 million industrial data in three domains, we demonstrate that FAPAT consistently outperforms state-of-the-art methods by an average of 4. 5% across various evaluation metrics (Hits, NDCG, MRR). Besides evaluating the next-item prediction, we estimate the models' capabilities to capture user intents via predicting items' attributes and period-item recommendations.

ICLR Conference 2023 Conference Paper

Logical Message Passing Networks with One-hop Inference on Atomic Formulas

Zihao Wang 0001
Yangqiu Song
Ginny Y. Wong
Simon See

Complex Query Answering (CQA) over Knowledge Graphs (KGs) has attracted a lot of attention to potentially support many applications. Given that KGs are usually incomplete, neural models are proposed to answer the logical queries by parameterizing set operators with complex neural networks. However, such methods usually train neural set operators with a large number of entity and relation embeddings from the zero, where whether and how the embeddings or the neural set operators contribute to the performance remains not clear. In this paper, we propose a simple framework for complex query answering that decomposes the KG embeddings from neural set operators. We propose to represent the complex queries into the query graph. On top of the query graph, we propose the Logical Message Passing Neural Network (LMPNN) that connects the local one-hop inferences on atomic formulas to the global logical reasoning for complex query answering. We leverage existing effective KG embeddings to conduct one-hop inferences on atomic formulas, the results of which are regarded as the messages passed in LMPNN. The reasoning process over the overall logical formulas is turned into the forward pass of LMPNN that incrementally aggregates local information to finally predict the answers' embeddings. The complex logical inference across different types of queries will then be learned from training examples based on the LMPNN architecture. Theoretically, our query-graph represenation is more general than the prevailing operator-tree formulation, so our approach applies to a broader range of complex KG queries. Empirically, our approach yields the new state-of-the-art neural CQA model. Our research bridges the gap between complex KG query answering tasks and the long-standing achievements of knowledge graph representation learning. Our implementation can be found at https://github.com/HKUST-KnowComp/LMPNN.

NeurIPS Conference 2023 Conference Paper

Open Visual Knowledge Extraction via Relation-Oriented Multimodality Model Prompting

Hejie Cui
Xinyu Fang
Zihan Zhang
Ran Xu
Xuan Kan
Xin Liu
Yue Yu
Manling Li

Images contain rich relational knowledge that can help machines understand the world. Existing methods on visual knowledge extraction often rely on the pre-defined format (e. g. , sub-verb-obj tuples) or vocabulary (e. g. , relation types), restricting the expressiveness of the extracted knowledge. In this work, we take a first exploration to a new paradigm of open visual knowledge extraction. To achieve this, we present OpenVik which consists of an open relational region detector to detect regions potentially containing relational knowledge and a visual knowledge generator that generates format-free knowledge by prompting the large multimodality model with the detected region of interest. We also explore two data enhancement techniques for diversifying the generated format-free visual knowledge. Extensive knowledge quality evaluations highlight the correctness and uniqueness of the extracted open visual knowledge by OpenVik. Moreover, integrating our extracted knowledge across various visual reasoning applications shows consistent improvements, indicating the real-world applicability of OpenVik.

TMLR Journal 2023 Journal Article

Sequential Query Encoding for Complex Query Answering on Knowledge Graphs

Jiaxin Bai
Tianshi Zheng
Yangqiu Song

Complex Query Answering (CQA) is an important and fundamental task for knowledge graph (KG) reasoning. Query encoding (QE) is proposed as a fast and robust solution to CQA. In the encoding process, most existing QE methods first parse the logical query into an executable computational direct-acyclic graph (DAG), then use neural networks to parameterize the operators, and finally, recursively execute these neuralized operators. However, the parameterization-and-execution paradigm may be potentially over-complicated, as it can be structurally simplified by a single neural network encoder. Meanwhile, sequence encoders, like LSTM and Transformer, proved to be effective for encoding semantic graphs in related tasks. Motivated by this, we propose sequential query encoding (SQE) as an alternative to encode queries for CQA. Instead of parameterizing and executing the computational graph, SQE first uses a search-based algorithm to linearize the computational graph to a sequence of tokens and then uses a sequence encoder to compute its vector representation. Then this vector representation is used as a query embedding to retrieve answers from the embedding space according to similarity scores. Despite its simplicity, SQE demonstrates state-of-the-art neural query encoding performance on FB15k, FB15k-237, and NELL on an extended benchmark including twenty-nine types of in-distribution queries. Further experiment shows that SQE also demonstrates comparable knowledge inference capability on out-of-distribution queries, whose query types are not observed during the training process.

AIJ Journal 2022 Journal Article

ASER: Towards large-scale commonsense knowledge acquisition via higher-order selectional preference over eventualities

Hongming Zhang
Xin Liu
Haojie Pan
Haowen Ke
Jiefu Ou
Tianqing Fang
Yangqiu Song

ICML Conference 2022 Conference Paper

Boosting Graph Structure Learning with Dummy Nodes

Xin Liu 0039
Jiayang Cheng
Yangqiu Song
Xin Jiang 0002

With the development of graph kernels and graph representation learning, many superior methods have been proposed to handle scalability and oversmoothing issues on graph structure learning. However, most of those strategies are designed based on practical experience rather than theoretical analysis. In this paper, we use a particular dummy node connecting to all existing vertices without affecting original vertex and edge properties. We further prove that such the dummy node can help build an efficient monomorphic edge-to-vertex transform and an epimorphic inverse to recover the original graph back. It also indicates that adding dummy nodes can preserve local and global structures for better graph representation learning. We extend graph kernels and graph neural networks with dummy nodes and conduct experiments on graph classification and subgraph isomorphism matching tasks. Empirical results demonstrate that taking graphs with dummy nodes as input significantly boosts graph structure learning, and using their edge-to-vertex graphs can also achieve similar results. We also discuss the gain of expressive power from the dummy in neural networks.

AAAI Conference 2022 Conference Paper

Graph Convolutional Networks with Dual Message Passing for Subgraph Isomorphism Counting and Matching

Xin Liu
Yangqiu Song

Graph neural networks (GNNs) and message passing neural networks (MPNNs) have been proven to be expressive for subgraph structures in many applications. Some applications in heterogeneous graphs require explicit edge modeling, such as subgraph isomorphism counting and matching. However, existing message passing mechanisms are not designed well in theory. In this paper, we start from a particular edge-tovertex transform and exploit the isomorphism property in the edge-to-vertex dual graphs. We prove that searching isomorphisms on the original graph is equivalent to searching on its dual graph. Based on this observation, we propose dual message passing neural networks (DMPNNs) to enhance the substructure representation learning in an asynchronous way for subgraph isomorphism counting and matching as well as unsupervised node classification. Extensive experiments demonstrate the robust performance of DMPNNs by combining both node and edge representation learning in synthetic and real heterogeneous graphs.

TIST Journal 2022 Journal Article

Introduction to the Special Issue on the Federated Learning: Algorithms, Systems, and Applications: Part 1

Qiang Yang
Yongxin Tong
Yang Liu
Yangqiu Song
Hao Peng
Boi Faltings

TIST Journal 2022 Journal Article

Preface to Federated Learning: Algorithms, Systems, and Applications: Part 2

Qiang Yang
Yongxin Tong
Yang Liu
Yangqiu Song
Hao Peng
Boi Faltings

NeurIPS Conference 2021 Conference Paper

Benchmarking the Combinatorial Generalizability of Complex Query Answering on Knowledge Graphs

Zihao Wang
Hang Yin
Yangqiu Song

Complex Query Answering (CQA) is an important reasoning task on knowledge graphs. Current CQA learning models have been shown to be able to generalize from atomic operators to more complex formulas, which can be regarded as the combinatorial generalizability. In this paper, we present EFO-1-QA, a new dataset to benchmark the combinatorial generalizability of CQA models by including 301 different queries types, which is 20 times larger than existing datasets. Besides, our benchmark, for the first time, provide a benchmark to evaluate and analyze the impact of different operators and normal forms by using (a) 7 choices of the operator systems and (b) 9 forms of complex queries. Specifically, we provide the detailed study of the combinatorial generalizability of two commonly used operators, i. e. , projection and intersection, and justify the impact of the forms of queries given the canonical choice of operators. Our code and data can provide an effective pipeline to benchmark CQA models.

JAIR Journal 2021 Journal Article

RWNE: A Scalable Random-Walk based Network Embedding Framework with Personalized Higher-order Proximity Preserved

Jianxin Li
Cheng Ji
Hao Peng
Yu He
Yangqiu Song
Xinmiao Zhang
Fanzhang Peng

Higher-order proximity preserved network embedding has attracted increasing attention. In particular, due to the superior scalability, random-walk-based network embedding has also been well developed, which could efficiently explore higher-order neighborhoods via multi-hop random walks. However, despite the success of current random-walk-based methods, most of them are usually not expressive enough to preserve the personalized higher-order proximity and lack a straightforward objective to theoretically articulate what and how network proximity is preserved. In this paper, to address the above issues, we present a general scalable random-walk-based network embedding framework, in which random walk is explicitly incorporated into a sound objective designed theoretically to preserve arbitrary higher-order proximity. Further, we introduce the random walk with restart process into the framework to naturally and effectively achieve personalized-weighted preservation of proximities of different orders. We conduct extensive experiments on several real-world networks and demonstrate that our proposed method consistently and substantially outperforms the state-of-the-art network embedding methods.

PDF Details DOI

IJCAI Conference 2020 Conference Paper

On the Importance of Word and Sentence Representation Learning in Implicit Discourse Relation Classification

Xin Liu
Jiefu Ou
Yangqiu Song
Xin Jiang

Implicit discourse relation classification is one of the most difficult parts in shallow discourse parsing as the relation prediction without explicit connectives requires the language understanding at both the text span level and the sentence level. Previous studies mainly focus on the interactions between two arguments. We argue that a powerful contextualized representation module, a bilateral multi-perspective matching module, and a global information fusion module are all important to implicit discourse analysis. We propose a novel model to combine these modules together. Extensive experiments show that our proposed model outperforms BERT and other state-of-the-art systems on the PDTB dataset by around 8% and CoNLL 2016 datasets around 16%. We also analyze the effectiveness of different modules in the implicit discourse relation classification task and demonstrate how different levels of representation learning can affect the results.

PDF Details DOI

IJCAI Conference 2020 Conference Paper

TransOMCS: From Linguistic Graphs to Commonsense Knowledge

Hongming Zhang
Daniel Khashabi
Yangqiu Song
Dan Roth

Commonsense knowledge acquisition is a key problem for artificial intelligence. Conventional methods of acquiring commonsense knowledge generally require laborious and costly human annotations, which are not feasible on a large scale. In this paper, we explore a practical way of mining commonsense knowledge from linguistic graphs, with the goal of transferring cheap knowledge obtained with linguistic patterns into expensive commonsense knowledge. The result is a conversion of ASER [Zhang et al. , 2020], a large-scale selectional preference knowledge resource, into TransOMCS, of the same representation as ConceptNet [Liu and Singh, 2004] but two orders of magnitude larger. Experimental results demonstrate the transferability of linguistic knowledge to commonsense knowledge and the effectiveness of the proposed approach in terms of quantity, novelty, and quality. TransOMCS is publicly available at: https: //github. com/HKUST-KnowComp/TransOMCS.

PDF Details DOI

IJCAI Conference 2019 Conference Paper

Fine-grained Event Categorization with Heterogeneous Graph Convolutional Networks

Hao Peng
Jianxin Li
Qiran Gong
Yangqiu Song
Yuanxin Ning
Kunfeng Lai
Philip S. Yu

Events are happening in real-world and real-time, which can be planned and organized occasions involving multiple people and objects. Social media platforms publish a lot of text messages containing public events with comprehensive topics. However, mining social events is challenging due to the heterogeneous event elements in texts and explicit and implicit social network structures. In this paper, we design an event meta-schema to characterize the semantic relatedness of social events and build an event-based heterogeneous information network (HIN) integrating information from external knowledge base, and propose a novel Pairwise Popularity Graph Convolutional Network (PP-GCN) based fine-grained social event categorization model. We propose a Knowledgeable meta-paths Instances based social Event Similarity (KIES) between events and build a weighted adjacent matrix as input to the PP-GCN model. Comprehensive experiments on real data collections are conducted to compare various social event detection and clustering tasks. Experimental results demonstrate that our proposed framework outperforms other alternative social event categorization techniques.

AIJ Journal 2019 Journal Article

Toward any-language zero-shot topic classification of textual documents

Yangqiu Song
Shyam Upadhyay
Haoruo Peng
Stephen Mayhew
Dan Roth

IJCAI Conference 2018 Conference Paper

Biased Random Walk based Social Regularization for Word Embeddings

Ziqian Zeng
Xin Liu
Yangqiu Song

Nowadays, people publish a lot of natural language texts on social media. Socialized word embeddings (SWE) has been proposed to deal with two phenomena of language use: everyone has his/her own personal characteristics of language use and socially connected users are likely to use language in similar ways. We observe that the spread of language use is transitive. Namely, one user can affect his/her friends and the friends can also affect their friends. However, SWE modeled the transitivity implicitly. The social regularization in SWE only applies to one-hop neighbors and thus users outside the one-hop social circle will not be affected directly. In this work, we adopt random walk methods to generate paths on the social graph to model the transitivity explicitly. Each user on a path will be affected by his/her adjacent user(s) on the path. Moreover, according to the update mechanism of SWE, fewer friends a user has, fewer update opportunities he/she can get. Hence, we propose a biased random walk method to provide these users with more update opportunities. Experiments show that our random walk based social regularizations perform better on sentiment classification.

IJCAI Conference 2018 Conference Paper

Make Evasion Harder: An Intelligent Android Malware Detection System

Shifu Hou
Yanfang Ye
Yangqiu Song
Melih Abdulhayoglu

To combat the evolving Android malware attacks, in this paper, instead of only using Application Programming Interface (API) calls, we further analyze the different relationships between them and create higher-level semantics which require more efforts for attackers to evade the detection. We represent the Android applications (apps), related APIs, and their rich relationships as a structured heterogeneous information network (HIN). Then we use a meta-path based approach to characterize the semantic relatedness of apps and APIs. We use each meta-path to formulate a similarity measure over Android apps, and aggregate different similarities using multi-kernel learning to make predictions. Promising experimental results based on real sample collections from Comodo Cloud Security Center demonstrate that our developed system HinDroid outperforms other alternative Android malware detection techniques.

NeurIPS Conference 2018 Conference Paper

MetaGAN: An Adversarial Approach to Few-Shot Learning

Ruixiang Zhang
Tong Che
Zoubin Ghahramani
Yoshua Bengio
Yangqiu Song

In this paper, we propose a conceptually simple and general framework called MetaGAN for few-shot learning problems. Most state-of-the-art few-shot classification models can be integrated with MetaGAN in a principled and straightforward way. By introducing an adversarial generator conditioned on tasks, we augment vanilla few-shot classification models with the ability to discriminate between real and fake data. We argue that this GAN-based approach can help few-shot classifiers to learn sharper decision boundary, which could generalize better. We show that with our MetaGAN framework, we can extend supervised few-shot learning models to naturally cope with unsupervised data. Different from previous work in semi-supervised few-shot learning, our algorithms can deal with semi-supervision at both sample-level and task-level. We give theoretical justifications of the strength of MetaGAN, and validate the effectiveness of MetaGAN on challenging few-shot image classification benchmarks.

AAAI Conference 2018 Conference Paper

Ranking Users in Social Networks With Higher-Order Structures

Huan Zhao
Xiaogang Xu
Yangqiu Song
Dik Lun Lee
Zhao Chen
Han Gao

PageRank has been widely used to measure the authority or the inﬂuence of a user in social networks. However, conventional PageRank only makes use of edge-based relations, ignoring higher-order structures captured by motifs, subgraphs consisting of a small number of nodes in complex networks. In this paper, we propose a novel framework, motif-based PageRank (MPR), to incorporate higher-order structures into conventional PageRank computation. We conduct extensive experiments in three real-world networks, i. e. , DBLP, Epinions, and Ciao, to show that MPR can signiﬁcantly improve the effectiveness of PageRank for ranking users in social networks. In addition to numerical results, we also provide detailed analysis for MPR to show how and why incorporating higher-order information works better than PageRank in ranking users in social networks. 1

IJCAI Conference 2018 Conference Paper

Scalable Multiplex Network Embedding

Hongming Zhang
Liwei Qiu
Lingling Yi
Yangqiu Song

Network embedding has been proven to be helpful for many real-world problems. In this paper, we present a scalable multiplex network embedding model to represent information of multi-type relations into a unified embedding space. To combine information of different types of relations while maintaining their distinctive properties, for each node, we propose one high-dimensional common embedding and a lower-dimensional additional embedding for each type of relation. Then multiple relations can be learned jointly based on a unified network embedding model. We conduct experiments on two tasks: link prediction and node classification using six different multiplex networks. On both tasks, our model achieved better or comparable performance compared to current state-of-the-art models with less memory use.

IJCAI Conference 2018 Conference Paper

Time-evolving Text Classification with Deep Neural Networks

Yu He
Jianxin Li
Yangqiu Song
Mutian He
Hao Peng

Traditional text classification algorithms are based on the assumption that data are independent and identically distributed. However, in most non-stationary scenarios, data may change smoothly due to long-term evolution and short-term fluctuation, which raises new challenges to traditional methods. In this paper, we present the first attempt to explore evolutionary neural network models for time-evolving text classification. We first introduce a simple way to extend arbitrary neural networks to evolutionary learning by using a temporal smoothness framework, and then propose a diachronic propagation framework to incorporate the historical impact into currently learned features through diachronic connections. Experiments on real-world news data demonstrate that our approaches greatly and consistently outperform traditional neural network models in both accuracy and stability.

AAAI Conference 2018 Conference Paper

Training and Evaluating Improved Dependency-Based Word Embeddings

Chen Li
Jianxin Li
Yangqiu Song
Ziwei Lin

Word embedding has been widely used in many natural language processing tasks. In this paper, we focus on learning word embeddings through selective higher-order relationships in sentences to improve the embeddings to be less sensitive to local context and more accurate in capturing semantic compositionality. We present a novel multi-order dependency-based strategy to composite and represent the context under several essential constraints. In order to realize selective learning from the word contexts, we automatically assign the strengths of different dependencies between co-occurred words in the stochastic gradient descent process. We evaluate and analyze our proposed approach using several direct and indirect tasks for word embeddings. Experimental results demonstrate that our embeddings are competitive to or better than state-of-the-art methods and signiﬁcantly outperform other methods in terms of context stability. The output weights and representations of dependencies obtained in our embedding model conform to most of the linguistic characteristics and are valuable for many downstream tasks.

AAAI Conference 2017 Conference Paper

Incrementally Learning the Hierarchical Softmax Function for Neural Language Models

Hao Peng
Jianxin Li
Yangqiu Song
Yaopeng Liu

Neural network language models (NNLMs) have attracted a lot of attention recently. In this paper, we present a training method that can incrementally train the hierarchical softmax function for NNMLs. We split the cost function to model old and update corpora separately, and factorize the objective function for the hierarchical softmax. Then we provide a new stochastic gradient based method to update all the word vectors and parameters, by comparing the old tree generated based on the old corpus and the new tree generated based on the combined (old and update) corpus. Theoretical analysis shows that the mean square error of the parameter vectors can be bounded by a function of the number of changed words related to the parameter node. Experimental results show that incremental training can save a lot of time. The smaller the update corpus is, the faster the update training process is, where an up to 30 times speedup has been achieved. We also use both word similarity/relatedness tasks and dependency parsing task as our benchmarks to evaluate the correctness of the updated word vectors.

IJCAI Conference 2017 Conference Paper

Semi-supervised Learning over Heterogeneous Information Networks by Ensemble of Meta-graph Guided Random Walks

He Jiang
Yangqiu Song
Chenguang Wang
Ming Zhang
Yizhou Sun

Heterogeneous information networks (HINs) is a general representation of many real world applications. The difference between HIN and traditional homogeneous graphs is that the nodes and edges in HIN are with types. Then in the many applications, we need to consider the types to make the approach more semantically meaningful. For the applications that annotation is expensive, on natural way is to consider semi-supervised learning over HIN. In this paper, we present a semi-supervised learning algorithm constrained by the types of HINs. We first decompose the original HIN into several semantically meaningful sub-graphs based the meta-graphs composed of entity and relation types. Then we perform random walk over the sub-graphs to propagate the labels from labeled data to unlabeled data. After we obtain all the labels propagated by different trials of random walk guided by meta-graphs, we use an ensemble algorithm to vote for the final labeling results. We use two public available datasets, 20-newsgroups and RCV1 datasets to test our algorithm. Experimental results show that our algorithm is better than the traditional semi-supervised learning algorithms for HINs. One particular by-product of this work is that we show that previous random walk approach guided by meta-paths can be non-stationary, which is the major reason we propose a meta-graph guide random walk for semi-supervised learning over HINs.

IJCAI Conference 2017 Conference Paper

Socialized Word Embeddings

Ziqian Zeng
Yichun Yin
Yangqiu Song
Ming Zhang

Word embeddings have attracted a lot of attention. On social media, each user’s language use can be significantly affected by the user’s friends. In this paper, we propose a socialized word embedding algorithm which can consider both user’s personal characteristics of language use and the user’s social relationship on social media. To incorporate personal characteristics, we propose to use a user vector to represent each user. Then for each user, the word embeddings are trained based on each user’s corpus by combining the global word vectors and local user vector. To incorporate social relationship, we add a regularization term to impose similarity between two friends. In this way, we can train the global word vectors and user vectors jointly. To demonstrate the effectiveness, we used the latest large-scale Yelp data to train our vectors, and designed several experiments to show how user vectors affect the results.

IJCAI Conference 2016 Conference Paper

Cross-Lingual Dataless Classification for Many Languages

Yangqiu Song
Shyam Upadhyay
Haoruo Peng
Dan Roth

Dataless text classification [Chang et al. , 2008] is a classification paradigm which maps documents into a given label space without requiring any annotated training data. This paper explores a cross-lingual variant of this paradigm, where documents in multiple languages are classified into an English label space. We use CLESA (cross-lingual explicit semantic analysis) to embed both foreign language documents and an English label space into a shared semantic space, and select the best label(s) for a document using the similarity between the corresponding semantic representations. We illustrate our approach by experimenting with classifying documents in 88 different languages into the same English label space. In particular, we show that CLESA is better than using a monolingual ESA on the target foreign language and translating the English labels into that language. Moreover, the evaluation on two benchmarks, TED and RCV2, showed that cross-lingual dataless classification outperforms supervised learning methods when a large collection of annotated documents is not available.

IJCAI Conference 2016 Conference Paper

Improving Topic Model Stability for Effective Document Exploration

Yi Yang
Shimei Pan
Yangqiu Song
Jie Lu
Mercan Topkara

Topic modeling has become a ubiquitous topic analysis tool for text exploration. Most of the existing works on topic modeling focus on fitting topic models to input data. They however ignore an important usability issue that is closely related to the end user experience: stability. In this study, we investigate the stability problem in topic modeling. We first report on the experiments conducted to quantify the severity of the problem. We then propose a new learning framework to mitigate the problem by explicitly incorporating topic stability constraints in model training. We also perform user study to demonstrate the advantages of the proposed method.

IJCAI Conference 2016 Conference Paper

Incorporating External Knowledge into Crowd Intelligence for More Specific Knowledge Acquisition

Tao Han
Hailong Sun
Yangqiu Song
Yili Fang
Xudong Liu

Crowdsourcing has been a helpful mechanism to leverage human intelligence to acquire useful knowledge for well defined tasks. However, when aggregating the crowd knowledge based on the currently developed voting algorithms, it often results in common knowledge that may not be expected. In this paper, we consider the problem of collecting as specific as possible knowledge via crowdsourcing. With the help of using external knowledge base such as WordNet, we incorporate the semantic relations between the alternative answers into a probabilistic model to determine which answer is more specific. We formulate the probabilistic model considering both worker's ability and task's difficulty, and solve it by expectation-maximization (EM) algorithm. Experimental results show that our approach achieved 35. 88% improvement over majority voting when more specific answers are expected.

AAAI Conference 2016 Conference Paper

Instilling Social to Physical: Co-Regularized Heterogeneous Transfer Learning

Ying Wei
Yin Zhu
Cane Leung
Yangqiu Song
Qiang Yang

Ubiquitous computing tasks, such as human activity recognition (HAR), are enabling a wide spectrum of applications, ranging from healthcare to environment monitoring. The success of a ubiquitous computing task relies on sufﬁcient physical sensor data with groundtruth labels, which are always scarce due to the expensive annotating process. Meanwhile, social media platforms provide a lot of social or semantic context information. People share what they are doing and where they are frequently in the messages they post. This rich set of socially shared activities motivates us to transfer knowledge from social media to address the sparsity issue of labelled physical sensor data. In order to transfer the knowledge of social and semantic context, we propose a Co-Regularized Heterogeneous Transfer Learning (CoHTL) model, which builds a common semantic space derived from two heterogeneous domains. Our proposed method outperforms state-of-the-art methods on two ubiquitous computing tasks, namely human activity recognition and region function discovery.

AAAI Conference 2016 Conference Paper

Text Classification with Heterogeneous Information Network Kernels

Chenguang Wang
Yangqiu Song
Haoran Li
Ming Zhang
Jiawei Han

Text classiﬁcation is an important problem with many applications. Traditional approaches represent text as a bagof-words and build classiﬁers based on this representation. Rather than words, entity phrases, the relations between the entities, as well as the types of the entities and relations carry much more information to represent the texts. This paper presents a novel text as network classiﬁcation framework, which introduces 1) a structured and typed heterogeneous information networks (HINs) representation of texts, and 2) a meta-path based approach to link texts. We show that with the new representation and links of texts, the structured and typed information of entities and relations can be incorporated into kernels. Particularly, we develop both simple linear kernel and indeﬁnite kernel based on metapaths in the HIN representation of texts, where we call them HIN-kernels. Using Freebase, a well-known world knowledge base, to construct HIN for texts, our experiments on two benchmark datasets show that the indeﬁnite HIN-kernel based on weighted meta-paths outperforms the state-of-theart methods and other HIN-kernels.

AAAI Conference 2016 Conference Paper

Tracking Idea Flows between Social Groups

Yangxin Zhong
Shixia Liu
Xiting Wang
Jiannan Xiao
Yangqiu Song

In many applications, ideas that are described by a set of words often ﬂow between different groups. To facilitate users in analyzing the ﬂow, we present a method to model the ﬂow behaviors that aims at identifying the lead-lag relationships between word clusters of different user groups. In particular, an improved Bayesian conditional cointegration based on dynamic time warping is employed to learn links between words in different groups. A tensor-based technique is developed to cluster these linked words into different clusters (ideas) and track the ﬂow of ideas. The main feature of the tensor representation is that we introduce two additional dimensions to represent both time and lead-lag relationships. Experiments on both synthetic and real datasets show that our method is more effective than methods based on traditional clustering techniques and achieves better accuracy. A case study was conducted to demonstrate the usefulness of our method in helping users understand the ﬂow of ideas between different user groups on social media.

AAAI Conference 2015 Conference Paper

Combining Machine Learning and Crowdsourcing for Better Understanding Commodity Reviews

Heting Wu
Hailong Sun
Yili Fang
Kefan Hu
Yongqing Xie
Yangqiu Song
Xudong Liu

In e-commerce systems, customer reviews are important information for understanding market feedbacks on certain commodities. However, accurate analyzing reviews is challenging due to the complexity of natural language processing and informal descriptions in reviews. Existing methods mainly focus on studying efficient algorithms that cannot guarantee the accuracy for review analysis. Crowdsourcing can improve the accuracy of review analysis while it is subject to extra costs and low response time. In this work, we combine machine learning and crowdsourcing together for better understanding customer reviews. First, we collectively use multiple machine learning algorithms to pre-process review classification. Second, we select the reviews on which all machine learning algorithms cannot agree and assign them to humans to process. Third, the results from machine learning and crowdsourcing are aggregated to be the final analysis results. Finally, we perform real experiments with practical review data to confirm the effectiveness of our method.

IJCAI Conference 2015 Conference Paper

Constrained Information-Theoretic Tripartite Graph Clustering to Identify Semantically Similar Relations

Chenguang Wang
Yangqiu Song
Dan Roth
Chi Wang
Jiawei Han
Heng Ji
Ming Zhang

In knowledge bases or information extraction results, differently expressed relations can be semantically similar (e. g. , (X, wrote, Y) and (X, ’s written work, Y)). Therefore, grouping semantically similar relations into clusters would facilitate and improve many applications, including knowledge base completion, information extraction, information retrieval, and more. This paper formulates relation clustering as a constrained tripartite graph clustering problem, presents an efficient clustering algorithm and exhibits the advantage of the constrained framework. We introduce several ways that provide side information via must-link and cannotlink constraints to improve the clustering results. Different from traditional semi-supervised learning approaches, we propose to use the similarity of relation expressions and the knowledge of entity types to automatically construct the constraints for the algorithm. We show improved relation clustering results on two datasets extracted from human annotated knowledge base (i. e. , Freebase) and open information extraction results (i. e. , ReVerb data).

IS Journal 2015 Journal Article

Does Summarization Help Stock Prediction? A News Impact Analysis

Xiaodong Li
Haoran Xie
Yangqiu Song
Shanfeng Zhu
Qing Li
Fu Lee Wang

The authors study the problem of how news summarization can help stock price prediction, proposing a generic stock price prediction framework to enable the use of different external signals to predict stock prices. Experiments were conducted on five years of Hong Kong Stock Exchange data, with news reported by Finet; evaluations were performed at individual stock, sector index, and market index levels. The authors' results show that prediction based on news article summarization can effectively outperform prediction based on full-length articles on both validation and independent testing sets.

AAAI Conference 2015 Conference Paper

Microblog Sentiment Classification with Contextual Knowledge Regularization

Fangzhao Wu
Yangqiu Song
Yongfeng Huang

Microblog sentiment classification is an important research topic which has wide applications in both academia and industry. Because microblog messages are short, noisy and contain masses of acronyms and informal words, microblog sentiment classification is a very challenging task. Fortunately, collectively the contextual information about these idiosyncratic words provide knowledge about their sentiment orientations. In this paper, we propose to use the microblogs’ contextual knowledge mined from a large amount of unlabeled data to help improve microblog sentiment classification. We define two kinds of contextual knowledge: wordword association and word-sentiment association. The contextual knowledge is formulated as regularization terms in supervised learning algorithms. An efficient optimization procedure is proposed to learn the model. Experimental results on benchmark datasets show that our method can consistently and significantly outperform the state-of-the-art methods.

IJCAI Conference 2015 Conference Paper

Open Domain Short Text Conceptualization: A Generative + Descriptive Modeling Approach

Yangqiu Song
Shusen Wang
Haixun Wang

Concepts embody the knowledge to facilitate our cognitive processes of learning. Mapping short texts to a large set of open domain concepts has gained many successful applications. In this paper, we unify the existing conceptualization methods from a Bayesian perspective, and discuss the three modeling approaches: descriptive, generative, and discriminative models. Motivated by the discussion of their advantages and shortcomings, we develop a generative + descriptive modeling approach. Our model considers term relatedness in the context, and will result in disambiguated conceptualization. We show the results of short text clustering using a news title data set and a Twitter message data set, and demonstrate the effectiveness of the developed approach compared with the state-of-the-art conceptualization and topic modeling approaches.

AAAI Conference 2015 Conference Paper

Spectral Label Refinement for Noisy and Missing Text Labels

Yangqiu Song
Chenguang Wang
Ming Zhang
Hailong Sun
Qiang Yang

With the recent growth of online content on the Web, there have been more user generated data with noisy and missing labels, e. g. , social tags and voted labels from Amazon’s Mechanical Turks. Most of machine learning methods, which require accurate label sets, could not be trusted when the label sets were yet unreliable. In this paper, we provide a text label refinement algorithm to adjust the labels for such noisy and missing labeled datasets. We assume that the labeled sets can be refined based on the labels with certain confidence, and the similarity between data being consistent with the labels. We propose a label smoothness ratio criterion to measure the smoothness of the labels and the consistency between labels and data. We demonstrate the effectiveness of the label refining algorithm on eight labeled document datasets, and validate that the results are useful for generating better labels.

AAAI Conference 2014 Conference Paper

On Dataless Hierarchical Text Classification

Yangqiu Song
Dan Roth

In this paper, we systematically study the problem of dataless hierarchical text classification. Unlike standard text classification schemes that rely on supervised training, dataless classification depends on understanding the labels of the sought after categories and requires no labeled data. Given a collection of text documents and a set of labels, we show that understanding the labels can be used to accurately categorize the documents. This is done by embedding both labels and documents in a semantic space that allows one to compute meaningful semantic similarity between a document and a potential label. We show that this scheme can be used to support accurate multiclass classification without any supervision. We study several semantic representations and show how to improve the classification using bootstrapping. Our results show that bootstrapped dataless classification is competitive with supervised classification with thousands of labeled examples.

IS Journal 2014 Journal Article

Semantic Multidimensional Scaling for Open-Domain Sentiment Analysis

Erik Cambria
Yangqiu Song
Haixun Wang
Newton Howard

The ability to understand natural language text is far from being emulated in machines. One of the main hurdles to overcome is that computers lack both the common and common-sense knowledge that humans normally acquire during the formative years of their lives. To really understand natural language, a machine should be able to comprehend this type of knowledge, rather than merely relying on the valence of keywords and word co-occurrence frequencies. In this article, the largest existing taxonomy of common knowledge is blended with a natural-language-based semantic network of common-sense knowledge. Multidimensional scaling is applied on the resulting knowledge base for open-domain opinion mining and sentiment analysis.

TIST Journal 2012 Journal Article

TIARA

Shixia Liu
Michelle X. Zhou
Shimei Pan
Yangqiu Song
Weihong Qian
Weijia Cai
Xiaoxiao Lian

We are building an interactive visual text analysis tool that aids users in analyzing large collections of text. Unlike existing work in visual text analytics, which focuses either on developing sophisticated text analytic techniques or inventing novel text visualization metaphors, ours tightly integrates state-of-the-art text analytics with interactive visualization to maximize the value of both. In this article, we present our work from two aspects. We first introduce an enhanced, LDA-based topic analysis technique that automatically derives a set of topics to summarize a collection of documents and their content evolution over time. To help users understand the complex summarization results produced by our topic analysis technique, we then present the design and development of a time-based visualization of the results. Furthermore, we provide users with a set of rich interaction tools that help them further interpret the visualized results in context and examine the text collection from multiple perspectives. As a result, our work offers three unique contributions. First, we present an enhanced topic modeling technique to provide users with a time-sensitive and more meaningful text summary. Second, we develop an effective visual metaphor to transform abstract and often complex text summarization results into a comprehensible visual representation. Third, we offer users flexible visual interaction tools as alternatives to compensate for the deficiencies of current text summarization techniques. We have applied our work to a number of text corpora and our evaluation shows promise, especially in support of complex text analyses.

IJCAI Conference 2011 Conference Paper

Short Text Conceptualization Using a Probabilistic Knowledgebase

Yangqiu Song
Haixun Wang
Zhongyuan Wang
Hongsong Li
Weizhu Chen

Most of the text mining tasks, such as clustering, is dominated by statistical approaches that treat text as a bag of words. Semantics in the text is largely ignored in the mining process, and the mining results are often not easily interpretable. One particular challenge faced by such approaches is short text understanding, as short text lacks enough content from which a statistical conclusion can be drawn. For example, traditional topic analysis methods consider topic segments with tens of hundreds of words. Latent topic modeling, such as latent Dirichlet allocation, also requires sufficient words to infer document topic distribution. We enhance machine learning algorithms by first giving the machine a probabilistic knowledgebase that contains as big, rich, and consistent concepts (of worldly facts) as those in our mental world. Then a Bayesian inference mechanism is developed to conceptualize words and short text. We conducted comprehensive tests of our method on conceptualizing set of text terms, as well as clustering Twitter messages (tweets), which are typically approximately ten words long. Compared to latent semantic topic modeling and other four kinds of methods that using WordNet, Freebase and Wikipedia (category links and explicit semantic analysis), we show significant improvements in terms of tweets clustering accuracy.

PDF Details DOI

AAAI Conference 2010 Conference Paper

Constrained Coclustering for Textual Documents

Yangqiu Song
Shimei Pan
Shixia Liu
Furu Wei
Michelle Zhou
Weihong Qian

In this paper, we present a constrained co-clustering approach for clustering textual documents. Our approach combines the benefits of information-theoretic co-clustering and constrained clustering. We use a two-sided hidden Markov random field (HMRF) to model both the document and word constraints. We also develop an alternating expectation maximization (EM) algorithm to optimize the constrained coclustering model. We have conducted two sets of experiments on a benchmark data set: (1) using human-provided category labels to derive document and word constraints for semi-supervised document clustering, and (2) using automatically extracted named entities to derive document constraints for unsupervised document clustering. Compared to several representative constrained clustering and co-clustering approaches, our approach is shown to be more effective for high-dimensional, sparse text data.

IJCAI Conference 2009 Conference Paper

Zheng Wang
Yangqiu Song
Changshui Zhang

In machine learning problems, labeled data are often in short supply. One of the feasible solution for this problem is transfer learning. It can make use of the labeled data from other domain to discriminate those unlabeled data in the target domain. In this paper, we propose a transfer learning framework based on similarity matrix approximation to tackle such problems. Two practical algorithms are proposed, which are the label propagation and the similarity propagation. In these methods, we build a hybrid graph based on all available data. Then the information is transferred cross domains through alternatively constructing the similarity matrix for different part of the graph. Among all related methods, similarity propagation approach can make maximum use of all available similarity information across domains. This leads to more efﬁcient transfer and better learning result. The experiment on real world text mining applications demonstrates the promise and effectiveness of our algorithms.

IJCAI Conference 2009 Conference Paper

Jianwen Zhang
Yangqiu Song
Gang Chen
Changshui Zhang

This paper deals with evolutionary clustering, which refers to the problem of clustering data with distribution drifting along time. Starting from a density estimation view to clustering problems, we propose two general on-line frameworks. In the ﬁrst framework, i. e. , historical data dependent (HDD), current model distribution is designed to approximate both current and historical data distributions. In the second framework, i. e. , historical model dependent (HMD), current model distribution is designed to approximate both current data distribution and historical model distribution. Both frameworks are based on the general exponential family mixture (EFM) model. As a result, all conventional clustering algorithms based on EFMs can be extended to evolutionary setting under the two frameworks. Empirical results validate the two frameworks.