Arrow Research search

Author name cluster

Yu Su

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

23 papers
1 author row

Possible papers

23

NeurIPS Conference 2025 Conference Paper

ARM: Adaptive Reasoning Model

  • Siye Wu
  • Jian Xie
  • Yikai Zhang
  • Aili Chen
  • Kai Zhang
  • Yu Su
  • Yanghua Xiao

While large reasoning models demonstrate strong performance on complex tasks, they lack the ability to adjust reasoning token usage based on task difficulty. This often leads to the "overthinking" problem—excessive and unnecessary reasoning—which, although potentially mitigated by human intervention to control the token budget, still fundamentally contradicts the goal of achieving fully autonomous AI. In this work, we propose Adaptive Reasoning Model (ARM), a reasoning model capable of adaptively selecting appropriate reasoning formats based on the task at hand. These formats include three efficient ones—Direct Answer, Short CoT, and Code—as well as a more elaborate format, Long CoT. To train ARM, we introduce Ada-GRPO, an adaptation of Group Relative Policy Optimization (GRPO), which addresses the format collapse issue in traditional GRPO. Ada-GRPO enables ARM to achieve high token efficiency, reducing tokens by an average of $\sim$30%, and up to $\sim$70%, while maintaining performance comparable to the model that relies solely on Long CoT. Furthermore, not only does it improve inference efficiency through reduced token generation, but it also brings a $\sim$2$\times$ speedup in training. In addition to the default Adaptive Mode, ARM supports two additional reasoning modes: 1) Instruction-Guided Mode, which allows users to explicitly specify the reasoning format via special tokens—ideal when the appropriate format is known for a batch of tasks. 2) Consensus-Guided Mode, which aggregates the outputs of the three efficient formats and resorts to Long CoT in case of disagreement, prioritizing performance with higher token usage. All the resources will be released.

NeurIPS Conference 2025 Conference Paper

BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning

  • Jianyang Gu
  • Sam Stevens
  • Elizabeth Campolongo
  • Matthew Thompson
  • Net Zhang
  • Jiaman Wu
  • Andrei Kopanev
  • Zheda Mai

Foundation models trained at scale exhibit remarkable emergent behaviors, learning new capabilities beyond their initial training objectives. We find such emergent behaviors in biological vision models via large-scale contrastive vision-language training. To achieve this, we first curate TreeOfLife-200M, comprising 214 million images of living organisms, the largest and most diverse biological organism image dataset to date. We then train BioCLIP 2 on TreeOfLife-200M to distinguish different species. Despite the narrow training objective, BioCLIP 2 yields extraordinary accuracy when applied to various biological visual tasks such as habitat classification and trait prediction. We identify emergent properties in the learned embedding space of BioCLIP 2. At the inter-species level, the embedding distribution of different species aligns closely with functional and ecological meanings (e. g. , beak sizes and habitats). At the intra-species level, instead of being diminished, the intra-species variations (e. g. , life stages and sexes) are preserved and better separated in subspaces orthogonal to inter-species distinctions. We provide formal proof and analyses to explain why hierarchical supervision and contrastive objectives encourage these emergent properties. Crucially, our results reveal that these properties become increasingly significant with larger-scale training data, leading to a biologically meaningful embedding space.

AAAI Conference 2025 Conference Paper

Distribution-Driven Dense Retrieval: Modeling Many-to-One Query-Document Relationship

  • Junfeng Kang
  • Rui Li
  • Qi Liu
  • Zhenya Huang
  • Zheng Zhang
  • Yanjiang Chen
  • Linbo Zhu
  • Yu Su

Dense retrieval has emerged as the leading approach in information retrieval, aiming to find semantically relevant documents based on natural language queries. Given that a single document can be retrieved by multiple distinct queries, existing methods aim to represent a document with multiple vectors. Each vector is aligned with a different query to model the many-to-one relationship between queries and documents. However, these multiple vector-based approaches encounter challenges such as Increased Storage, Vector Collapse, and Search Efficiency. To address these issues, we introduce the Distribution-Driven Dense Retrieval framework (DDR). Specifically, we use vectors to represent queries and distributions to represent documents. This approach not only captures the relationships between multiple queries corresponding to the same document but also avoids the need to use multiple vectors to represent the document. Furthermore, to ensure search efficiency for DDR, we propose a dot product-based computation method to calculate the similarity between documents represented by distributions and queries represented by vectors. This allows for seamless integration with existing approximate nearest neighbor (ANN) search algorithms for efficient search. Finally, we conduct extensive experiments on real-world datasets, which demonstrate that our method significantly outperforms traditional dense retrieval methods.

TMLR Journal 2025 Journal Article

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

  • Yu Gu
  • Kai Zhang
  • Yuting Ning
  • Boyuan Zheng
  • Boyu Gou
  • Tianci Xue
  • Cheng Chang
  • Sanjari Srivastava

Language agents based on large language models (LLMs) have demonstrated great promise in automating web-based tasks. Recent work has shown that incorporating advanced planning algorithms, e.g., tree search, is advantageous over reactive planning for web agents. However, unlike simulated sandbox environments, real-world environments such as the web are rife with irreversible actions. This undermines the feasibility of backtracking, a cornerstone of (tree) search. Overly relying on test-time search also hurts efficiency. We advocate model-based planning for web agents that employs a world model to simulate and deliberate over the outcome of each candidate action before committing to one. We systematically explore this paradigm by: (1) Proposing a model-based planning framework, WebDreamer, which employs LLMs to serve as both world models and value functions; (2) Training specialized LLMs as world models with a scalable data synthesis pipeline. Empirical results demonstrate that WebDreamers achieves substantial performance improvements over reactive baselines. It is competitive, while being - times more efficient, with tree search in sandbox environments (VisualWebArena) and also works effectively on real-world websites (Online-Mind2Web and Mind2Web-Live). Furthermore, our trained world model, Dreamer-7B, performs comparable to GPT-4o, highlighting the potential of specialized world models for efficient and effective planning in complex web environments. All code, models, and data are publicly available at https://github.com/OSU-NLP-Group/WebDreamer

NeurIPS Conference 2025 Conference Paper

Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

  • Boyu Gou
  • Zanming Huang
  • Yuting Ning
  • Yu Gu
  • Michael Lin
  • Weijian Qi
  • Andrei Kopanev
  • Botao Yu

Agentic search such as Deep Research systems-where agents autonomously browse the web, synthesize information, and return comprehensive citation-backed answers-represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short search horizons and static answers. In this paper, we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework. Our method constructs task-specific judge agents based on a tree-structured rubric design to automatically assess both answer correctness and source attribution. We conduct a comprehensive evaluation of ten frontier agentic search systems and human performance, along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research, can already achieve 50-70% of human performance while spending half the time, highlighting its great potential. Altogether, Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems.

AAAI Conference 2025 Conference Paper

ScholarGEC: Enhancing Controllability of Large Language Model for Chinese Academic Grammatical Error Correction

  • Zixiao Kong
  • Xianquan Wang
  • Shuanghong Shen
  • Keyu Zhu
  • Huibo Xu
  • Yu Su

Large language models (LLMs) have demonstrated exceptional error detection capabilities and can correct sentences with high fluency in grammatical error correction (GEC) tasks. However, when correcting Chinese academic papers, LLMs face significant challenges of over-correction. To delve deeper into this issue, we explore the underlying reasons. On one hand, each discipline has its unique vocabulary and expressions, and LLMs have insufficient and incomplete understanding of domain-specific sentences. On the other hand, the controllability of generative LLMs in GEC tasks is inherently poor, and the traditional sequence-to-sequence (Seq2Seq) correction structure exacerbates this issue. Considering the two aforementioned factors, we propose a new error correction framework for Chinese academic GEC tasks using LLMs, named ScholarGEC. To improve LLMs’ understanding of domain-specific knowledge, we construct appropriate disciplinary knowledge prefixes for sentences and use this domain-specific knowledge data to fine-tune the LLM. To enhance the controllability of LLMs, we replace the traditional Seq2Seq structure with a Detection-Correction separated structure. We also introduce a special token during the process to improve the model’s error detection stability. Additionally, we incorporate iterative self-reflection to enhance the stability of the generation, in the three parts of LLM generation. Extensive experiments demonstrate the effectiveness and robustness of our framework on a Chinese GEC dataset composed of academic papers, and further analysis reveals the capabilities of our framework in enhancing LLM performance in general GEC tasks.

AAAI Conference 2025 Conference Paper

VERSE: Verification-based Self-Play for Code Instructions

  • Hao Jiang
  • Qi Liu
  • Rui Li
  • Yuze Zhao
  • Yixiao Ma
  • Shengyu Ye
  • Junyu Lu
  • Yu Su

Instruction-tuned Code Large Language Models (Code LLMs) have excelled in diverse code-related tasks, such as program synthesis, automatic program repair, and code explanation. To collect training datasets for instruction-tuning, a popular method involves having models autonomously generate instructions and corresponding responses. However, the direct generation of responses does not ensure functional correctness, a crucial requirement for generating responses to code instructions. To overcome this, we present Verification-Based Self-Play (VERSE), aiming to enhance model proficiency in generating correct responses. VERSE establishes a robust verification framework that covers various code instructions. Employing VERSE, Code LLMs engage in self-play to generate instructions and corresponding verifications. They evaluate execution results and self-consistency as verification outcomes, using them as scores to rank generated data for self-training. Experiments show that VERSE improves multiple base Code LLMs (average 7.6%) across various languages and tasks on many benchmarks, affirming its effectiveness.

AAAI Conference 2024 Conference Paper

CONSIDER: Commonalities and Specialties Driven Multilingual Code Retrieval Framework

  • Rui Li
  • Liyang He
  • Qi Liu
  • Yuze Zhao
  • Zheng Zhang
  • Zhenya Huang
  • Yu Su
  • Shijin Wang

Multilingual code retrieval aims to find code snippets relevant to a user's query from a multilingual codebase, which plays a crucial role in software development and expands their application scenarios compared to classical monolingual code retrieval. Despite the performance improvements achieved by previous studies, two crucial problems are overlooked in the multilingual scenario. First, certain programming languages face data scarcity in specific domains, resulting in limited representation capabilities within those domains. Second, different programming languages can be used interchangeably within the same domain, making it challenging for multilingual models to accurately identify the intended programming language of a user's query. To address these issues, we propose the CommONalities and SpecIalties Driven Multilingual CodE Retrieval Framework (CONSIDER), which includes two modules. The first module enhances the representation of various programming languages by modeling pairwise and global commonalities among them. The second module introduces a novel contrastive learning negative sampling algorithm that leverages language confusion to automatically extract specific language features. Through our experiments, we confirm the significant benefits of our model in real-world multilingual code retrieval scenarios in various aspects. Furthermore, an evaluation demonstrates the effectiveness of our proposed CONSIDER framework in monolingual scenarios as well. Our source code is available at https://github.com/smsquirrel/consider.

NeurIPS Conference 2024 Conference Paper

Fine-Tuning is Fine, if Calibrated

  • Zheda Mai
  • Arpita Chowdhury
  • Ping Zhang
  • Cheng-Hao Tu
  • Hong-You Chen
  • Vardaan Pahuja
  • Tanya Berger-Wolf
  • Song Gao

Fine-tuning is arguably the most straightforward way to tailor a pre-trained model (e. g. , a foundation model) to downstream applications, but it also comes with the risk of losing valuable knowledge the model had learned in pre-training. For example, fine-tuning a pre-trained classifier capable of recognizing a large number of classes to master a subset of classes at hand is shown to drastically degrade the model's accuracy in the other classes it had previously learned. As such, it is hard to further use the fine-tuned model when it encounters classes beyond the fine-tuning data. In this paper, we systematically dissect the issue, aiming to answer the fundamental question, "What has been damaged in the fine-tuned model? " To our surprise, we find that the fine-tuned model neither forgets the relationship among the other classes nor degrades the features to recognize these classes. Instead, the fine-tuned model often produces more discriminative features for these other classes, even if they were missing during fine-tuning! What really hurts the accuracy is the discrepant logit scales between the fine-tuning classes and the other classes, implying that a simple post-processing calibration would bring back the pre-trained model's capability and at the same time unveil the feature improvement over all classes. We conduct an extensive empirical study to demonstrate the robustness of our findings and provide preliminary explanations underlying them, suggesting new directions for future theoretical analysis.

NeurIPS Conference 2024 Conference Paper

Grokking of Implicit Reasoning in Transformers: A Mechanistic Journey to the Edge of Generalization

  • Boshi Wang
  • Xiang Yue
  • Yu Su
  • Huan Sun

We study whether transformers can learn to implicitly reason over parametric knowledge, a skill that even the most capable language models struggle with. Focusing on two representative reasoning types, composition and comparison, we consistently find that transformers can learn implicit reasoning, but only through grokking, i. e. , extended training far beyond overfitting. The levels of generalization also vary across reasoning types: when faced with out-of-distribution examples, transformers fail to systematically generalize for composition but succeed for comparison. We delve into the model's internals throughout training, conducting analytical experiments that reveal: 1) the mechanism behind grokking, such as the formation of the generalizing circuit and its relation to the relative efficiency of generalizing and memorizing circuits, and 2) the connection between systematicity and the configuration of the generalizing circuit. Our findings guide data and training setup to better induce implicit reasoning and suggest potential improvements to the transformer architecture, such as encouraging cross-layer knowledge sharing. Furthermore, we demonstrate that for a challenging reasoning task with a large search space, GPT-4-Turbo and Gemini-1. 5-Pro based on non-parametric memory fail badly regardless of prompting styles or retrieval augmentation, while a fully grokked transformer can achieve near-perfect accuracy, showcasing the power of parametric memory for complex reasoning.

NeurIPS Conference 2024 Conference Paper

HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models

  • Bernal J. Gutiérrez
  • Yiheng Shu
  • Yu Gu
  • Michihiro Yasunaga
  • Yu Su

In order to thrive in hostile and ever-changing natural environments, mammalian brains evolved to store large amounts of knowledge about the world and continually integrate new information while avoiding catastrophic forgetting. Despite the impressive accomplishments, large language models (LLMs), even with retrieval-augmented generation (RAG), still struggle to efficiently and effectively integrate a large amount of new experiences after pre-training. In this work, we introduce HippoRAG, a novel retrieval framework inspired by the hippocampal indexing theory of human long-term memory to enable deeper and more efficient knowledge integration over new experiences. HippoRAG synergistically orchestrates LLMs, knowledge graphs, and the Personalized PageRank algorithm to mimic the different roles of neocortex and hippocampus in human memory. We compare HippoRAG with existing RAG methods on multi-hop question answering (QA) and show that our method outperforms the state-of-the-art methods remarkably, by up to 20%. Single-step retrieval with HippoRAG achieves comparable or better performance than iterative retrieval like IRCoT while being 10-20 times cheaper and 6-13 times faster, and integrating HippoRAG into IRCoT brings further substantial gains. Finally, we show that our method can tackle new types of scenarios that are out of reach of existing methods.

TIST Journal 2024 Journal Article

Model-Agnostic Adaptive Testing for Intelligent Education Systems via Meta-learned Gradient Embeddings

  • Haoyang Bi
  • Qi Liu
  • Han Wu
  • Weidong He
  • Zhenya Huang
  • Yu Yin
  • Haiping Ma
  • Yu Su

The field of education has undergone a significant revolution with the advent of intelligent systems and technology, which aim to personalize the learning experience, catering to the unique needs and abilities of individual learners. In this pursuit, a fundamental challenge is designing proper test for assessing the students’ cognitive status on knowledge and skills accurately and efficiently. One promising approach, referred to as Computerized Adaptive Testing (CAT), is to administrate computer-automated tests that alternately select the next item for each examinee and estimate their cognitive states given their responses to the selected items. Nevertheless, existing CAT systems suffer from inflexibility in item selection and ineffectiveness in cognitive state estimation, respectively. In this article, we propose a Model-Agnostic adaptive testing framework via Meta-leaned Gradient Embeddings, MAMGE for short, improving both item selection and cognitive state estimation simultaneously. For item selection, we design a Gradient Embedding-based Item Selector (GEIS) which incorporates the concept of gradient embeddings to represent items and selects the best ones that are both informative and representative. For cognitive state estimation, we propose a Meta-learned Cognitive State Estimator (MCSE) to automatically control the estimation process by learning to learn a proper initialization and dynamically inferred updates. Both MCSE and GEIS are inherently model-agnostic, and the two modules have an ingenious connection via meta-learned gradient embeddings. Finally, extensive experiments evaluate the effectiveness and flexibility of MAMGE.

NeurIPS Conference 2024 Conference Paper

VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images

  • M. Maruf
  • Arka Daw
  • Kazi S. Mehrab
  • Harish B. Manogaran
  • Abhilash Neog
  • Medha Sawhney
  • Mridul Khurana
  • James P. Balhoff

Images are increasingly becoming the currency for documenting biodiversity on the planet, providing novel opportunities for accelerating scientific discoveries in the field of organismal biology, especially with the advent of large vision-language models (VLMs). We ask if pre-trained VLMs can aid scientists in answering a range of biologically relevant questions without any additional fine-tuning. In this paper, we evaluate the effectiveness of $12$ state-of-the-art (SOTA) VLMs in the field of organismal biology using a novel dataset, VLM4Bio, consisting of $469K$ question-answer pairs involving $30K$ images from three groups of organisms: fishes, birds, and butterflies, covering five biologically relevant tasks. We also explore the effects of applying prompting techniques and tests for reasoning hallucination on the performance of VLMs, shedding new light on the capabilities of current SOTA VLMs in answering biologically relevant questions using images.

NeurIPS Conference 2023 Conference Paper

Holistic Transfer: Towards Non-Disruptive Fine-Tuning with Partial Target Data

  • Cheng-Hao Tu
  • Hong-You Chen
  • Zheda Mai
  • Jike Zhong
  • Vardaan Pahuja
  • Tanya Berger-Wolf
  • Song Gao
  • Charles Stewart

We propose a learning problem involving adapting a pre-trained source model to the target domain for classifying all classes that appeared in the source data, using target data that covers only a partial label space. This problem is practical, as it is unrealistic for the target end-users to collect data for all classes prior to adaptation. However, it has received limited attention in the literature. To shed light on this issue, we construct benchmark datasets and conduct extensive experiments to uncover the inherent challenges. We found a dilemma --- on the one hand, adapting to the new target domain is important to claim better performance; on the other hand, we observe that preserving the classification accuracy of classes missing in the target adaptation data is highly challenging, let alone improving them. To tackle this, we identify two key directions: 1) disentangling domain gradients from classification gradients, and 2) preserving class relationships. We present several effective solutions that maintain the accuracy of the missing classes and enhance the overall performance, establishing solid baselines for holistic transfer of pre-trained models with partial target data.

NeurIPS Conference 2023 Conference Paper

MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing

  • Kai Zhang
  • Lingbo Mo
  • Wenhu Chen
  • Huan Sun
  • Yu Su

Text-guided image editing is widely needed in daily life, ranging from personal use to professional applications such as Photoshop. However, existing methods are either zero-shot or trained on an automatically synthesized dataset, which contains a high volume of noise. Thus, they still require lots of manual tuning to produce desirable outcomes in practice. To address this issue, we introduce MagicBrush, the first large-scale, manually annotated dataset for instruction-guided real image editing that covers diverse scenarios: single-turn, multi-turn, mask-provided, and mask-free editing. MagicBrush comprises over 10K manually annotated triplets (source image, instruction, target image), which supports trainining large-scale text-guided image editing models. We fine-tune InstructPix2Pix on MagicBrush and show that the new model can produce much better images according to human evaluation. We further conduct extensive experiments to evaluate current image editing baselines from multiple dimensions including quantitative, qualitative, and human evaluations. The results reveal the challenging nature of our dataset and the gap between current baselines and real-world editing needs.

NeurIPS Conference 2023 Conference Paper

Mind2Web: Towards a Generalist Agent for the Web

  • Xiang Deng
  • Yu Gu
  • Boyuan Zheng
  • Shijie Chen
  • Sam Stevens
  • Boshi Wang
  • Huan Sun
  • Yu Su

We introduce Mind2Web, the first dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website. Existing datasets for web agents either use simulated websites or only cover a limited set of websites and tasks, thus not suitable for generalist web agents. With over 2, 000 open-ended tasks collected from 137 websites spanning 31 domains and crowdsourced action sequences for the tasks, Mind2Web provides three necessary ingredients for building generalist web agents: 1) diverse domains, websites, and tasks, 2) use of real-world websites instead of simulated and simplified ones, and 3) a broad spectrum of user interaction patterns. Based on Mind2Web, we conduct an initial exploration of using large language models (LLMs) for building generalist web agents. While the raw HTML of real-world websites are often too large to be fed to LLMs, we show that first filtering it with a small LM significantly improves the effectiveness and efficiency of LLMs. Our solution demonstrates a decent level of performance, even on websites or entire domains the model has never seen before, but there is still a substantial room to improve towards truly generalizable agents. We open-source our dataset, model implementation, and trained models (https: //osu-nlp-group. github. io/Mind2Web) to facilitate further research on building a generalist agent for the web.

AAAI Conference 2019 Conference Paper

Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning

  • Xin Wang
  • Jiawei Wu
  • Da Zhang
  • Yu Su
  • William Yang Wang

Although promising results have been achieved in video captioning, existing models are limited to the fixed inventory of activities in the training corpus, and do not generalize to open vocabulary scenarios. Here we introduce a novel task, zeroshot video captioning, that aims at describing out-of-domain videos of unseen activities. Videos of different activities usually require different captioning strategies in many aspects, i. e. word selection, semantic construction, and style expression etc, which poses a great challenge to depict novel activities without paired training data. But meanwhile, similar activities share some of those aspects in common. Therefore, we propose a principled Topic-Aware Mixture of Experts (TAMoE) model for zero-shot video captioning, which learns to compose different experts based on different topic embeddings, implicitly transferring the knowledge learned from seen activities to unseen ones. Besides, we leverage external topic-related text corpus to construct the topic embedding for each activity, which embodies the most relevant semantic vectors within the topic. Empirical results not only validate the effectiveness of our method in utilizing semantic knowledge for video captioning, but also show its strong generalization ability when describing novel activities.

AAAI Conference 2018 Conference Paper

Exercise-Enhanced Sequential Modeling for Student Performance Prediction

  • Yu Su
  • Qingwen Liu
  • Qi Liu
  • Zhenya Huang
  • Yu Yin
  • Enhong Chen
  • Chris Ding
  • Si Wei

In online education systems, for offering proactive services to students (e. g. , personalized exercise recommendation), a crucial demand is to predict student performance (e. g. , scores) on future exercising activities. Existing prediction methods mainly exploit the historical exercising records of students, where each exercise is usually represented as the manually labeled knowledge concepts, and the richer information contained in the text descriptions of exercises is still underexplored. In this paper, we propose a novel Exercise-Enhanced Recurrent Neural Network (EERNN) framework for student performance prediction by taking full advantage of both student exercising records and the text of each exercise. Specifically, for modeling the student exercising process, we first design a bidirectional LSTM to learn each exercise representation from its text description without any expertise and information loss. Then, we propose a new LSTM architecture to trace student states (i. e. , knowledge states) in their sequential exercising process with the combination of exercise representations. For making final predictions, we design two strategies under EERNN, i. e. , EERNNM with Markov property and EERNNA with Attention mechanism. Extensive experiments on large-scale real-world data clearly demonstrate the effectiveness of EERNN framework. Moreover, by incorporating the exercise correlations, EERNN can well deal with the cold start problems from both student and exercise perspectives.

TIST Journal 2018 Journal Article

Fuzzy Cognitive Diagnosis for Modelling Examinee Performance

  • Qi Liu
  • Runze Wu
  • Enhong Chen
  • Guandong Xu
  • Yu Su
  • Zhigang Chen
  • Guoping Hu

Recent decades have witnessed the rapid growth of educational data mining (EDM), which aims at automatically extracting valuable information from large repositories of data generated by or related to people’s learning activities in educational settings. One of the key EDM tasks is cognitive modelling with examination data, and cognitive modelling tries to profile examinees by discovering their latent knowledge state and cognitive level (e.g. the proficiency of specific skills). However, to the best of our knowledge, the problem of extracting information from both objective and subjective examination problems to achieve more precise and interpretable cognitive analysis remains underexplored. To this end, we propose a fuzzy cognitive diagnosis framework (FuzzyCDF) for examinees’ cognitive modelling with both objective and subjective problems. Specifically, to handle the partially correct responses on subjective problems, we first fuzzify the skill proficiency of examinees. Then we combine fuzzy set theory and educational hypotheses to model the examinees’ mastery on the problems based on their skill proficiency. Finally, we simulate the generation of examination score on each problem by considering slip and guess factors. In this way, the whole diagnosis framework is built. For further comprehensive verification, we apply our FuzzyCDF to three classical cognitive assessment tasks, i.e., predicting examinee performance, slip and guess detection, and cognitive diagnosis visualization. Extensive experiments on three real-world datasets for these assessment tasks prove that FuzzyCDF can reveal the knowledge states and cognitive level of the examinees effectively and interpretatively.

AAAI Conference 2017 Conference Paper

Question DifÞculty Prediction for READING Problems in Standard Tests

  • Zhenya Huang
  • Qi Liu
  • Enhong Chen
  • Hongke Zhao
  • Mingyong Gao
  • Si Wei
  • Yu Su
  • Guoping Hu

Standard tests aim to evaluate the performance of examinees using different tests with consistent difficulties. Thus, a critical demand is to predict the difficulty of each test question before the test is conducted. Existing studies are usually based on the judgments of education experts (e. g. , teachers), which may be subjective and labor intensive. In this paper, we propose a novel Test-aware Attention-based Convolutional Neural Network (TACNN) framework to automatically solve this Question Difficulty Prediction (QDP) task for READ- ING problems (a typical problem style in English tests) in standard tests. Specifically, given the abundant historical test logs and text materials of questions, we first design a CNNbased architecture to extract sentence representations for the questions. Then, we utilize an attention strategy to qualify the difficulty contribution of each sentence to questions. Considering the incomparability of question difficulties in different tests, we propose a test-dependent pairwise strategy for training TACNN and generating the difficulty prediction value. Extensive experiments on a real-world dataset not only show the effectiveness of TACNN, but also give interpretable insights to track the attention information for questions.

IJCAI Conference 2015 Conference Paper

Cognitive Modelling for Predicting Examinee Performance

  • Runze Wu
  • Qi Liu
  • Yuping Liu
  • Enhong Chen
  • Yu Su
  • Zhigang Chen
  • Guoping Hu

Cognitive modelling can discover the latent characteristics of examinees for predicting their performance (i. e. scores) on each problem. As cognitive modelling is important for numerous applications, e. g. personalized remedy recommendation, some solutions have been designed in the literature. However, the problem of extracting information from both objective and subjective problems to get more precise and interpretable cognitive analysis is still underexplored. To this end, we propose a fuzzy cognitive diagnosis framework (FuzzyCDF) for examinees’ cognitive modelling with both objective and subjective problems. Specifically, to handle the partially correct responses on subjective problems, we first fuzzify the skill proficiency of examinees. Then, we combine fuzzy set theory and educational hypotheses to model the examinees’ mastery on the problems. Further, we simulate the generation of examination scores by considering both slip and guess factors. Extensive experiments on three realworld datasets prove that FuzzyCDF can predict examinee performance more effectively, and the output of FuzzyCDF is also interpretative.