Arrow Research search

Author name cluster

Shijin Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

28 papers
1 author row

Possible papers

28

AAAI Conference 2026 Conference Paper

BLADE: A Behavior-Level Data Augmentation Framework with Dual Fusion Modeling for Multi-Behavior Sequential Recommendation

  • Yupeng Li
  • Mingyue Cheng
  • Yucong Luo
  • Yitong Zhou
  • Qingyang Mao
  • Shijin Wang

Multi-behavior sequential recommendation aims to capture users' dynamic interests by modeling diverse types of user interactions over time. Although several studies have explored this setting, the recommendation performance remains suboptimal, mainly due to two fundamental challenges: the heterogeneity of user behaviors and data sparsity. To address these challenges, we propose BLADE, a framework that enhances multi-behavior modeling while mitigating data sparsity. Specifically, to handle behavior heterogeneity, we introduce a dual item-behavior fusion architecture that incorporates behavior information at both the input and intermediate levels, enabling preference modeling from multiple perspectives. To mitigate data sparsity, we design three behavior-level data augmentation methods that operate directly on behavior sequences rather than core item sequences. These methods generate diverse augmented views while preserving the semantic consistency of item sequences. These augmented views further enhance representation learning and generalization via contrastive learning. Experiments on three real-world datasets demonstrate the effectiveness of our approach.

AAAI Conference 2026 Conference Paper

CARE-Bench: A Benchmark of Diverse Client Simulations Guided by Expert Principles for Evaluating LLMs in Psychological Counseling

  • Bichen Wang
  • Yixin Sun
  • Junzhe Wang
  • Hao Yang
  • Xing Fu
  • Yanyan Zhao
  • Si Wei
  • Shijin Wang

The mismatch between the growing demand for psychological counseling and the limited availability of services has motivated research into the application of Large Language Models (LLMs) in this domain. Consequently, there is a need for a robust and unified benchmark to assess the counseling competence of various LLMs. Existing works, however, are limited by unprofessional client simulation, static question-and-answer evaluation formats, and unidimensional metrics. These limitations hinder their effectiveness in assessing a model's comprehensive ability to handle diverse and complex clients. To address this gap, we introduce CARE-Bench, a dynamic and interactive automated benchmark. It is built upon diverse client profiles derived from real-world counseling cases and simulated according to expert guidelines. CARE-Bench provides a multidimensional performance evaluation grounded in established psychological scales. Using CARE-Bench, we evaluate several general-purpose LLMs and specialized counseling models, revealing their current limitations. In collaboration with psychologists, we conduct a detailed analysis of the reasons for LLMs' failures when interacting with clients of different types, which provides directions for developing more comprehensive, universal, and effective counseling models.

JBHI Journal 2026 Journal Article

Confidence-Aware Adaptive Fusion Leaning of Imbalance Multi-Modal Data for Cancer Diagnosis and Prognosis

  • Ziye Zhang
  • Shijin Wang
  • Yuying Huang
  • Xiaorou Zheng
  • Shoubin Dong

The effective fusion of pathological images and molecular omics holds significant potential for precision medicine. However, pathological and molecular data are highly heterogeneous, and large-scale multi-modal cancer data often suffer from incomplete information. Predicting clinical tasks from such imbalanced multi-modal data presents a major challenge. Therefore, we propose a confidence-aware adaptive fusion framework CAFusion. The framework adopts a modular design, providing independent and flexible modal feature learning modules to capture high-quality features. To address issues of modal imbalance caused by heterogeneous and incomplete modal, we design a confidence-aware method that evaluates the features of each modal and automatically adjusts their weights. To effectively fuse pathological and molecular modals, we propose an adaptive deep network, which features a flexible, non-fixed layer structure that effectively extracts hidden joint information from multi-modal features, ensuring high generalizability. Experiment results demonstrate that the performance of the CAFusion framework outperforms other state-of-the-art methods, both on complete and incomplete datasets. Moreover, the CAFusion framework offers reasonable medical interpretability.

AAAI Conference 2026 Conference Paper

From Diagnosis to Generalization: A Cognitive Approach to Data Selection for Educational LLMs

  • Yuxiang Guo
  • Yan Zhuang
  • Qi Liu
  • Zhenya Huang
  • Xianquan Wang
  • Liyang He
  • Jiatong Li
  • Rui Li

Specializing Large Language Models for educational domains is a key frontier in creating personalized learning tools. The central challenge is not data scarcity but its abundance: efficiently selecting a curated data subset from vast corpora to enhance specialized skills and foster generalization, without degrading existing abilities. Existing data selection paradigms, relying on superficial semantic similarity or model training dynamics, often lack a principled framework to identify data that promotes true cognitive growth. Our work proposes a paradigm shift from leveraging indirect proxies of learning value, such as semantic similarity and training dynamics, towards a framework that performs a direct, cognitive-level modeling of the learner's state. We introduce CASS, a novel framework that implements this cognitive approach through a clear pipeline, moving from an initial Diagnosis to the ultimate goal of expanding the model's cognitive frontier. First, CASS diagnoses the LLM's cognitive frontier using Multidimensional Item Response Theory. Leveraging this diagnosis, it then employs Fisher Information to select a data subset situated at LLM's cognitive frontier that offers maximum informational gain. Finally, the model is fine-tuned on this curated data using a structured, easy-to-hard curriculum to ensure effective learning. Experiments on our new multi-subject dataset show that models trained with CASS not only achieve superior accuracy in the target domain but also exhibit enhanced generalization. CASS provides a more efficient, effective, and theoretically-grounded paradigm for building expert educational LLMs.

NeurIPS Conference 2025 Conference Paper

A Closed-Form Solution for Fast and Reliable Adaptive Testing

  • Yan Zhuang
  • Chenye Ke
  • Zirui Liu
  • Qi Liu
  • Yuting Ning
  • Zhenya Huang
  • Weizhe Huang
  • Qingyang Mao

Human ability estimation is essential for educational assessment, career advancement, and professional certification. Adaptive Testing systems can improve estimation efficiency by selecting fewer, targeted questions, and are widely used in exams, e. g. , GRE, GMAT, and Duolingo English Test. However, selecting an optimal subset of questions remains a challenging nested optimization problem. Existing methods rely on costly approximations or data-intensive training, making them unsuitable for today's large-scale and complex testing environments. Thus, we propose a Closed-Form solution for question subset selection in Adaptive Testing. It directly minimizes ability estimation error by reducing ability parameter's gradient bias while maintaining Hessian stability, which enables a simple greedy algorithm for question selection. Moreover, it can quantify the impact of human behavioral perturbations on ability estimation. Extensive experiments on large-scale educational datasets demonstrate that it reduces the number of required questions by 10% compared to SOTA methods, while maintaining the same estimation accuracy.

NeurIPS Conference 2025 Conference Paper

FACT: Mitigating Inconsistent Hallucinations in LLMs via Fact-Driven Alternating Code-Text Training

  • Xinxin You
  • Qixin Sun
  • Chenwei Yan
  • Xiao Zhang
  • Chen Ning
  • Xiangling Fu
  • Si Liu
  • Guoping Hu

Inconsistent hallucinations remain a major challenge for large language models (LLMs), undermining the accuracy and reliability of fact-based reasoning in real-world applications. Existing approaches often rely on task-specific training or adaptation, such as hand-crafted synthetic datasets for domain tasks or solutions mainly focused on numerical reasoning, thereby limiting generalizability to broader, unseen NLP tasks. Inspired by the structural rigor and logical consistency of programming languages, we observe that fact-based texts can be mapped to programming structures due to their inherent patterns. We further propose FACT, a novel Fact-driven Alternating Code-text Training framework that alternates between text-to-code and code-to-text prediction. FACT is the first task-agnostic paradigm that embeds code and natural language in a shared semantic space, thereby transferring the logical consistency of code to LLM outputs in NLP tasks. Experiments show that with only a small subset of Wiki-40B-en for training, FACT reduces inconsistent hallucinations by 2. 7%–8. 0% and improves overall performance by 2. 5%–6. 1% in three leading LLMs and four diverse datasets covering QA and summarization tasks. This framework offers a new perspective on addressing challenging hallucinations in LLMs, contributing to more reliable AI.

NeurIPS Conference 2025 Conference Paper

How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation

  • Xin Lu
  • Yanyan Zhao
  • Si Wei
  • Shijin Wang
  • Bing Qin
  • Ting Liu

Pre-trained language models represented by the Transformer have been proven to possess strong base capabilities, and the representative self-attention mechanism in the Transformer has become a classic in sequence modeling architectures. Different from the work of proposing sequence modeling architecture to improve the efficiency of attention mechanism, this work focuses on the impact of sequence modeling architectures on base capabilities. Specifically, our concern is: How exactly do sequence modeling architectures affect the base capabilities of pre-trained language models? In this work, we first point out that the mixed domain pre-training setting commonly adopted in existing architecture design works fails to adequately reveal the differences in base capabilities among various architectures. To address this, we propose a limited domain pre-training setting with out-of-distribution testing, which successfully uncovers significant differences in base capabilities among architectures at an early stage. Next, we analyze the base capabilities of stateful sequence modeling architectures, and find that they exhibit significant degradation in base capabilities compared to the Transformer. Then, through a series of architecture component analysis, we summarize a key architecture design principle: A sequence modeling architecture need possess full-sequence arbitrary selection capability to avoid degradation in base capabilities. Finally, we empirically validate this principle using an extremely simple Top-1 element selection architecture and further generalize it to a more practical Top-1 chunk selection architecture. Experimental results demonstrate our proposed sequence modeling architecture design principle and suggest that our work can serve as a valuable reference for future architecture improvements and novel designs.

NeurIPS Conference 2025 Conference Paper

Investigating and Mitigating Catastrophic Forgetting in Medical Knowledge Injection through Internal Knowledge Augmentation Learning

  • Yuxuan Zhou
  • Xien Liu
  • Xiao Zhang
  • Chen Ning
  • Shijin Wang
  • Guoping Hu
  • Ji Wu

Large Language Models (LLMs) are expected to possess comprehensive medical knowledge to support real-world clinical applications. While domain-specific fine-tuning effectively injects medical knowledge into LLMs, it often causes catastrophic forgetting of previously acquired knowledge and instruction-following capabilities. In this paper, we investigate this issue and reveal a pattern of proximity-dependent forgetting: knowledge that is semantically or topically close to the injected content is more likely to be forgotten, while unrelated knowledge shows minimal degradation. Moreover, we observe that existing mitigation techniques fail to address this type of forgetting effectively. Motivated by this observation and inspired by human learning mechanisms, we proposeInternAL (\Internal Knowledge Augmentation Learning), a novel approach that leverages LLMs' own internal knowledge to mitigate forgetting. InternAL first probes internal knowledge closely related to the injection by prompting the model with questions derived from the injected knowledge. This knowledge is then used to augment the original injection dataset, guiding the model to retain related prior knowledge during training. Experimental results on multiple LLMs (LLaMA, Qwen) demonstrate that InternAL significantly mitigates proximity-related forgetting while maintaining strong knowledge injection performance. Our findings provide new insights into the nature of catastrophic forgetting in medical knowledge injection and highlight a promising direction for robust domain adaptation in LLMs. Code and datasets are available at https: //github. com/THUMLP/InternAL.

AAAI Conference 2025 Conference Paper

Multi-Perspective Consolidation Enhanced Cognitive Diagnosis via Conditional Diffusion Model

  • Guanhao Zhao
  • Zhenya Huang
  • Cheng Cheng
  • Yan Zhuang
  • Qingyang Mao
  • Xin Li
  • Shijin Wang
  • Enhong Chen

Cognitive diagnosis, which assesses the learners' competence from learners' interaction logs, plays a vital role in education. It provides a crucial reference for gauging learners' proficiency levels and tailoring future learning activities accordingly. Researchers have proposed numerous cognitive diagnosis models to address this task. Despite their success, these models continue to face the ill-posed problem because of the information loss caused by under-expressive interaction function and incomplete observations. In this paper, we address these challenges by proposing a novel cognitive diagnosis model, DMC-CDM, based on the theoretical premise that cognitive states can be captured with minimal information loss by maximizing the mutual information between observed and potential observations. Specifically, DMC-CDM incorporates a semantic extractor to provide a comprehensive semantic understanding of learners' interaction logs, thereby enhancing current collaborative-based cognitive state representations. It then consolidates multi-perspective observations to capture precise cognitive states by maximizing mutual information between these observations. We conducted extensive experiments on three datasets, and the experimental results demonstrate that our proposed model is both effective and beneficial for downstream applications in education.

NeurIPS Conference 2024 Conference Paper

Computerized Adaptive Testing via Collaborative Ranking

  • Zirui Liu
  • Yan Zhuang
  • Qi Liu
  • Jiatong Li
  • Yuren Zhang
  • Zhenya Huang
  • Jinze Wu
  • Shijin Wang

As the deep integration of machine learning and intelligent education, Computerized Adaptive Testing (CAT) has received more and more research attention. Compared to traditional paper-and-pencil tests, CAT can deliver both personalized and interactive assessments by automatically adjusting testing questions according to the performance of students during the test process. Therefore, CAT has been recognized as an efficient testing methodology capable of accurately estimating a student’s ability with a minimal number of questions, leading to its widespread adoption in mainstream selective exams such as the GMAT and GRE. However, just improving the accuracy of ability estimation is far from satisfactory in the real-world scenarios, since an accurate ranking of students is usually more important (e. g. , in high-stakes exams). Considering the shortage of existing CAT solutions in student ranking, this paper emphasizes the importance of aligning test outcomes (student ranks) with the true underlying abilities of students. Along this line, different from the conventional independent testing paradigm among students, we propose a novel collaborative framework, Collaborative Computerized Adaptive Testing (CCAT), that leverages inter-student information to enhance student ranking. By using collaborative students as anchors to assist in ranking test-takers, CCAT can give both theoretical guarantees and experimental validation for ensuring ranking consistency.

AAAI Conference 2024 Conference Paper

CONSIDER: Commonalities and Specialties Driven Multilingual Code Retrieval Framework

  • Rui Li
  • Liyang He
  • Qi Liu
  • Yuze Zhao
  • Zheng Zhang
  • Zhenya Huang
  • Yu Su
  • Shijin Wang

Multilingual code retrieval aims to find code snippets relevant to a user's query from a multilingual codebase, which plays a crucial role in software development and expands their application scenarios compared to classical monolingual code retrieval. Despite the performance improvements achieved by previous studies, two crucial problems are overlooked in the multilingual scenario. First, certain programming languages face data scarcity in specific domains, resulting in limited representation capabilities within those domains. Second, different programming languages can be used interchangeably within the same domain, making it challenging for multilingual models to accurately identify the intended programming language of a user's query. To address these issues, we propose the CommONalities and SpecIalties Driven Multilingual CodE Retrieval Framework (CONSIDER), which includes two modules. The first module enhances the representation of various programming languages by modeling pairwise and global commonalities among them. The second module introduces a novel contrastive learning negative sampling algorithm that leverages language confusion to automatically extract specific language features. Through our experiments, we confirm the significant benefits of our model in real-world multilingual code retrieval scenarios in various aspects. Furthermore, an evaluation demonstrates the effectiveness of our proposed CONSIDER framework in monolingual scenarios as well. Our source code is available at https://github.com/smsquirrel/consider.

NeurIPS Conference 2024 Conference Paper

JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models

  • Kun Zhou
  • Beichen Zhang
  • Jiapeng Wang
  • Zhipeng Chen
  • Wayne X. Zhao
  • Jing Sha
  • Zhichao Sheng
  • Shijin Wang

Mathematical reasoning is an important capability of large language models~(LLMs) for real-world applications. To enhance this capability, existing work either collects large-scale math-related texts for pre-training, or relies on stronger LLMs (\eg GPT-4) to synthesize massive math problems. Both types of work generally lead to large costs in training or synthesis. To reduce the cost, based on open-source available texts, we propose an efficient way that trains a small LLM for math problem synthesis, to efficiently generate sufficient high-quality pre-training data. To achieve it, we create a dataset using GPT-4 to distill its data synthesis capability into the small LLM. Concretely, we craft a set of prompts based on human education stages to guide GPT-4, to synthesize problems covering diverse math knowledge and difficulty levels. Besides, we adopt the gradient-based influence estimation method to select the most valuable math-related texts. The both are fed into GPT-4 for creating the knowledge distillation dataset to train the small LLM. We leverage it to synthesize 6 million math problems for pre-training our JiuZhang3. 0 model. The whole process only needs to invoke GPT-4 API 9. 3k times and use 4. 6B data for training. Experimental results have shown that JiuZhang3. 0 achieves state-of-the-art performance on several mathematical reasoning datasets, under both natural language reasoning and tool manipulation settings. Our code and data will be publicly released in \url{https: //github. com/RUCAIBox/JiuZhang3. 0}.

IJCAI Conference 2024 Conference Paper

Learning to Solve Geometry Problems via Simulating Human Dual-Reasoning Process

  • Tong Xiao
  • Jiayu Liu
  • Zhenya Huang
  • Jinze Wu
  • Jing Sha
  • Shijin Wang
  • Enhong Chen

Geometry Problem Solving (GPS), which is a classic and challenging math problem, has attracted much attention in recent years. It requires a solver to comprehensively understand both text and diagram, master essential geometry knowledge, and appropriately apply it in reasoning. However, existing works follow a paradigm of neural machine translation and only focus on enhancing the capability of encoders, which neglects the essential characteristics of human geometry reasoning. In this paper, inspired by dual-process theory, we propose a Dual-Reasoning Geometry Solver (DualGeoSolver) to simulate the dual-reasoning process of humans for GPS. Specifically, we construct two systems in DualGeoSolver, namely Knowledge System and Inference System. Knowledge System controls an implicit reasoning process, which is responsible for providing diagram information and geometry knowledge according to a step-wise reasoning goal generated by Inference System. Inference System conducts an explicit reasoning process, which specifies the goal in each reasoning step and applies the knowledge to generate program tokens for resolving it. The two systems carry out the above process iteratively, which behaves more in line with human cognition. We conduct extensive experiments on two benchmark datasets, GeoQA and GeoQA+. The results demonstrate the superiority of DualGeoSolver in both solving accuracy and robustness from explicitly modeling human reasoning process and knowledge application.

TIST Journal 2024 Journal Article

Model-Agnostic Adaptive Testing for Intelligent Education Systems via Meta-learned Gradient Embeddings

  • Haoyang Bi
  • Qi Liu
  • Han Wu
  • Weidong He
  • Zhenya Huang
  • Yu Yin
  • Haiping Ma
  • Yu Su

The field of education has undergone a significant revolution with the advent of intelligent systems and technology, which aim to personalize the learning experience, catering to the unique needs and abilities of individual learners. In this pursuit, a fundamental challenge is designing proper test for assessing the students’ cognitive status on knowledge and skills accurately and efficiently. One promising approach, referred to as Computerized Adaptive Testing (CAT), is to administrate computer-automated tests that alternately select the next item for each examinee and estimate their cognitive states given their responses to the selected items. Nevertheless, existing CAT systems suffer from inflexibility in item selection and ineffectiveness in cognitive state estimation, respectively. In this article, we propose a Model-Agnostic adaptive testing framework via Meta-leaned Gradient Embeddings, MAMGE for short, improving both item selection and cognitive state estimation simultaneously. For item selection, we design a Gradient Embedding-based Item Selector (GEIS) which incorporates the concept of gradient embeddings to represent items and selects the best ones that are both informative and representative. For cognitive state estimation, we propose a Meta-learned Cognitive State Estimator (MCSE) to automatically control the estimation process by learning to learn a proper initialization and dynamically inferred updates. Both MCSE and GEIS are inherently model-agnostic, and the two modules have an ingenious connection via meta-learned gradient embeddings. Finally, extensive experiments evaluate the effectiveness and flexibility of MAMGE.

NeurIPS Conference 2024 Conference Paper

SocraticLM: Exploring Socratic Personalized Teaching with Large Language Models

  • Jiayu Liu
  • Zhenya Huang
  • Tong Xiao
  • Jing Sha
  • Jinze Wu
  • Qi Liu
  • Shijin Wang
  • Enhong Chen

Large language models (LLMs) are considered a crucial technology for advancing intelligent education since they exhibit the potential for an in-depth understanding of teaching scenarios and providing students with personalized guidance. Nonetheless, current LLM-based application in personalized teaching predominantly follows a "Question-Answering" paradigm, where students are passively provided with answers and explanations. In this paper, we propose SocraticLM, which achieves a Socratic "Thought-Provoking" teaching paradigm that fulfills the role of a real classroom teacher in actively engaging students in the thought process required for genuine problem-solving mastery. To build SocraticLM, we first propose a novel "Dean-Teacher-Student" multi-agent pipeline to construct a new dataset, SocraTeach, which contains $35$K meticulously crafted Socratic-style multi-round (equivalent to $208$K single-round) teaching dialogues grounded in fundamental mathematical problems. Our dataset simulates authentic teaching scenarios, interacting with six representative types of simulated students with different cognitive states, and strengthening four crucial teaching abilities. SocraticLM is then fine-tuned on SocraTeach with three strategies balancing its teaching and reasoning abilities. Moreover, we contribute a comprehensive evaluation system encompassing five pedagogical dimensions for assessing the teaching quality of LLMs. Extensive experiments verify that SocraticLM achieves significant improvements in the teaching performance, outperforming GPT4 by more than 12\%. Our dataset and code is available at https: //github. com/Ljyustc/SocraticLM.

NeurIPS Conference 2024 Conference Paper

Towards Accurate and Fair Cognitive Diagnosis via Monotonic Data Augmentation

  • Zheng Zhang
  • Wei Song
  • Qi Liu
  • Qingyang Mao
  • Yiyan Wang
  • Weibo Gao
  • Zhenya Huang
  • Shijin Wang

Intelligent education stands as a prominent application of machine learning. Within this domain, cognitive diagnosis (CD) is a key research focus that aims to diagnose students' proficiency levels in specific knowledge concepts. As a crucial task within the field of education, cognitive diagnosis encompasses two fundamental requirements: accuracy and fairness. Existing studies have achieved significant success by primarily utilizing observed historical logs of student-exercise interactions. However, real-world scenarios often present a challenge, where a substantial number of students engage with a limited number of exercises. This data sparsity issue can lead to both inaccurate and unfair diagnoses. To this end, we introduce a monotonic data augmentation framework, CMCD, to tackle the data sparsity issue and thereby achieve accurate and fair CD results. Specifically, CMCD integrates the monotonicity assumption, a fundamental educational principle in CD, to establish two constraints for data augmentation. These constraints are general and can be applied to the majority of CD backbones. Furthermore, we provide theoretical analysis to guarantee the accuracy and convergence speed of CMCD. Finally, extensive experiments on real-world datasets showcase the efficacy of our framework in addressing the data sparsity issue with accurate and fair CD results.

AAAI Conference 2023 Conference Paper

BETA-CD: A Bayesian Meta-Learned Cognitive Diagnosis Framework for Personalized Learning

  • Haoyang Bi
  • Enhong Chen
  • Weidong He
  • Han Wu
  • Weihao Zhao
  • Shijin Wang
  • Jinze Wu

Personalized learning is a promising educational approach that aims to provide high-quality personalized services for each student with minimum demands for practice data. The key to achieving that lies in the cognitive diagnosis task, which estimates the cognitive state of the student through his/her logged data of doing practice quizzes. Nevertheless, in the personalized learning scenario, existing cognitive diagnosis models suffer from the inability to (1) quickly adapt to new students using a small amount of data, and (2) measure the reliability of the diagnosis result to avoid improper services that mismatch the student's actual state. In this paper, we propose a general Bayesian mETA-learned Cognitive Diagnosis framework (BETA-CD), which addresses the two challenges by prior knowledge exploitation and model uncertainty quantification, respectively. Specifically, we firstly introduce Bayesian hierarchical modeling to associate each student's cognitive state with a shared prior distribution encoding prior knowledge and a personal posterior distribution indicating model uncertainty. Furthermore, we formulate a meta-learning objective to automatically exploit prior knowledge from historical students, and efficiently solve it with a gradient-based variational inference method. The code will be publicly available at https://github.com/AyiStar/pyat.

NeurIPS Conference 2023 Conference Paper

Evaluating and Improving Tool-Augmented Computation-Intensive Math Reasoning

  • Beichen Zhang
  • Kun Zhou
  • Xilin Wei
  • Xin Zhao
  • Jing Sha
  • Shijin Wang
  • Ji-Rong Wen

Chain-of-thought prompting (CoT) and tool augmentation have been validated in recent work as effective practices for improving large language models (LLMs) to perform step-by-step reasoning on complex math-related tasks. However, most existing math reasoning datasets may not be able to fully evaluate and analyze the ability of LLMs in manipulating tools and performing reasoning, as they often only require very few invocations of tools or miss annotations for evaluating intermediate reasoning steps, thus supporting only outcome evaluation. To address the issue, we construct CARP, a new Chinese dataset consisting of 4, 886 computation-intensive algebra problems with formulated annotations on intermediate steps, facilitating the evaluation of the intermediate reasoning process. In CARP, we test four LLMs with CoT prompting, and find that they are all prone to make mistakes at the early steps of the solution, leading to incorrect answers. Based on this finding, we propose a new approach that can facilitate the deliberation on reasoning steps with tool interfaces, namely DELI. In DELI, we first initialize a step-by-step solution based on retrieved exemplars, then iterate two deliberation procedures that check and refine the intermediate steps of the generated solution, from both tool manipulation and natural language reasoning perspectives, until solutions converge or the maximum iteration is achieved. Experimental results on CARP and six other datasets show that the proposed DELI mostly outperforms competitive baselines, and can further boost the performance of existing CoT methods. Our data and code are available at https: //github. com/RUCAIBox/CARP.

IJCAI Conference 2023 Conference Paper

Exploiting Non-Interactive Exercises in Cognitive Diagnosis

  • Fangzhou Yao
  • Qi Liu
  • Min Hou
  • Shiwei Tong
  • Zhenya Huang
  • Enhong Chen
  • Jing Sha
  • Shijin Wang

Cognitive Diagnosis aims to quantify the proficiency level of students on specific knowledge concepts. Existing studies merely leverage observed historical students-exercise interaction logs to access proficiency levels. Despite effectiveness, observed interactions usually exhibit a power-law distribution, where the long tail consisting of students with few records lacks supervision signals. This phenomenon leads to inferior diagnosis among few records students. In this paper, we propose the Exercise-aware Informative Response Sampling (EIRS) framework to address the long-tail problem. EIRS is a general framework that explores the partial order between observed and unobserved responses as auxiliary ranking-based training signals to supplement cognitive diagnosis. Considering the abundance and complexity of unobserved responses, we first design an Exercise-aware Candidates Selection module, which helps our framework produce reliable potential responses for effective supplementary training. Then, we develop an Expected Ability Change-weighted Informative Sampling strategy to adaptively sample informative potential responses that contribute greatly to model training. Experiments on real-world datasets demonstrate the supremacy of our framework in long-tailed data.

AAAI Conference 2023 Conference Paper

Towards a Holistic Understanding of Mathematical Questions with Contrastive Pre-training

  • Yuting Ning
  • Zhenya Huang
  • Xin Lin
  • Enhong Chen
  • Shiwei Tong
  • Zheng Gong
  • Shijin Wang

Understanding mathematical questions effectively is a crucial task, which can benefit many applications, such as difficulty estimation. Researchers have drawn much attention to designing pre-training models for question representations due to the scarcity of human annotations (e.g., labeling difficulty). However, unlike general free-format texts (e.g., user comments), mathematical questions are generally designed with explicit purposes and mathematical logic, and usually consist of more complex content, such as formulas, and related mathematical knowledge (e.g., Function). Therefore, the problem of holistically representing mathematical questions remains underexplored. To this end, in this paper, we propose a novel contrastive pre-training approach for mathematical question representations, namely QuesCo, which attempts to bring questions with more similar purposes closer. Specifically, we first design two-level question augmentations, including content-level and structure-level, which generate literally diverse question pairs with similar purposes. Then, to fully exploit hierarchical information of knowledge concepts, we propose a knowledge hierarchy-aware rank strategy (KHAR), which ranks the similarities between questions in a fine-grained manner. Next, we adopt a ranking contrastive learning task to optimize our model based on the augmented and ranked questions. We conduct extensive experiments on two real-world mathematical datasets. The experimental results demonstrate the effectiveness of our model.

AAAI Conference 2021 Conference Paper

HMS: A Hierarchical Solver with Dependency-Enhanced Understanding for Math Word Problem

  • Xin Lin
  • Zhenya Huang
  • Hongke Zhao
  • Enhong Chen
  • Qi Liu
  • Hao Wang
  • Shijin Wang

Automatically solving math word problems is a crucial task for exploring the intelligence levels of machines in the general AI domain. It is highly challenging since it requires not only natural language understanding but also mathematical expression inference. Existing solutions usually explore sequence-to-sequence models to generate expressions, where the problems are simply encoded sequentially. However, such models are generally far from enough for understanding problems as similar to humans and lead to incorrect answers. To this end, in this paper, we propose a novel Hierarchical Math Solver (HMS) to make deep understanding and exploitation of problems. In problem understanding, imitating human reading habits, we propose a hierarchical word-clauseproblem encoder. Specifically, we first split each problem into several clauses and learn problem semantics from the local clause level to the global problem level. Then, in clause understanding, we propose a dependency-based module to enhance clause semantics with the dependency structure of the problem. Next, in expression inference, we propose a novel tree-based decoder to generate the mathematical expression for the answer. In the decoder, we apply a hierarchical attention mechanism to enhance the problem semantics with context from different levels, and a pointer-generator network to guide the model to copy existing information and infer extra knowledge. Extensive experimental results on two widely used datasets demonstrate that HMS achieves not only better answers but also more reasonable inference.

AAAI Conference 2020 Conference Paper

Discriminative Sentence Modeling for Story Ending Prediction

  • Yiming Cui
  • Wanxiang Che
  • Wei-Nan Zhang
  • Ting Liu
  • Shijin Wang
  • Guoping Hu

Story Ending Prediction is a task that needs to select an appropriate ending for the given story, which requires the machine to understand the story and sometimes needs commonsense knowledge. To tackle this task, we propose a new neural network called Diff-Net for better modeling the differences of each ending in this task. The proposed model could discriminate two endings in three semantic levels: contextual representation, story-aware representation, and discriminative representation. Experimental results on the Story Cloze Test dataset show that the proposed model siginificantly outperforms various systems by a large margin, and detailed ablation studies are given for better understanding our model. We also carefully examine the traditional and BERT-based models on both SCT v1. 0 and v1. 5 with interesting findings that may potentially help future studies.

AAAI Conference 2020 Conference Paper

Neural Cognitive Diagnosis for Intelligent Education Systems

  • Fei Wang
  • Qi Liu
  • Enhong Chen
  • Zhenya Huang
  • Yuying Chen
  • Yu Yin
  • Zai Huang
  • Shijin Wang

Cognitive diagnosis is a fundamental issue in intelligent education, which aims to discover the proficiency level of students on specific knowledge concepts. Existing approaches usually mine linear interactions of student exercising process by manual-designed function (e. g. , logistic function), which is not sufficient for capturing complex relations between students and exercises. In this paper, we propose a general Neural Cognitive Diagnosis (NeuralCD) framework, which incorporates neural networks to learn the complex exercising interactions, for getting both accurate and interpretable diagnosis results. Specifically, we project students and exercises to factor vectors and leverage multi neural layers for modeling their interactions, where the monotonicity assumption is applied to ensure the interpretability of both factors. Furthermore, we propose two implementations of NeuralCD by specializing the required concepts of each exercise, i. e. , the NeuralCDM with traditional Q-matrix and the improved NeuralCDM+ exploring the rich text content. Extensive experimental results on real-world datasets show the effectiveness of NeuralCD framework with both accuracy and interpretability.

AAAI Conference 2019 Conference Paper

Convolutional Spatial Attention Model for Reading Comprehension with Multiple-Choice Questions

  • Zhipeng Chen
  • Yiming Cui
  • Wentao Ma
  • Shijin Wang
  • Guoping Hu

Machine Reading Comprehension (MRC) with multiplechoice questions requires the machine to read given passage and select the correct answer among several candidates. In this paper, we propose a novel approach called Convolutional Spatial Attention (CSA) model which can better handle the MRC with multiple-choice questions. The proposed model could fully extract the mutual information among the passage, question, and the candidates, to form the enriched representations. Furthermore, to merge various attention results, we propose to use convolutional operation to dynamically summarize the attention values within the different size of regions. Experimental results show that the proposed model could give substantial improvements over various state-of- the-art systems on both RACE and SemEval-2018 Task11 datasets.