Arrow Research search

Author name cluster

Jiale Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers
1 author row

Possible papers

5

AAAI Conference 2026 Conference Paper

HSKBenchmark: Modeling and Benchmarking Chinese Second Language Acquisition in Large Language Models Through Curriculum Tuning

  • Qihao Yang
  • Xuelin Wang
  • Jiale Chen
  • Xuelian Dong
  • Yuxin Hao
  • Tianyong Hao

Language acquisition is vital to revealing the nature of human language intelligence and has recently emerged as a promising perspective for improving the interpretability of large language models (LLMs). However, it is ethically and practically infeasible to conduct experiments that require controlling human learners' language inputs. This poses challenges for the verifiability and scalability of language acquisition modeling, particularly in Chinese second language acquisition (SLA). While LLMs provide a controllable and reproducible alternative, a systematic benchmark to support phase-wise modeling and assessment is still lacking. To address these issues, we propose HSKBenchmark, the first benchmark for staged modeling and writing assessment of LLMs in Chinese SLA. The benchmark covers HSK levels 3 to 6, comprising authentic textbooks with 6.76M tokens, 16K synthetic instruction data, 30 test topics and a linguistically-grounded evaluation system. To simulate human acquisition trajectories, a curriculum-tuning framework is introduced, which trains LLMs in a progression from beginner to advanced proficiency levels. Since language production in writing is a key perspective for observing SLA development, an evaluation system is established to probe LLMs in writing, including the coverage of level-based grammar items, writing errors, lexical complexity, syntactic complexity, and holistic scoring. We also develop an HSKAgent fine-tuned on 10K compositions from Chinese second language learners to automate this evaluation system. Extensive experimental results demonstrate that HSKBenchmark not only models Chinese SLA effectively, but also serves as a reliable benchmark for dynamic writing assessment in LLMs. Our fine-tuned LLMs have writing performance on par with advanced human learners and exhibit human-like acquisition characteristics. The HSKBenchmark, HSKAgent, and checkpoints serve as foundational tools and resources, with the potential to pave the way for future research on language acquisition modeling and LLMs interpretability.

AAAI Conference 2026 Conference Paper

What, Whether and How? Unveiling Process Reward Models for Thinking with Images Reasoning

  • Yujin Zhou
  • Pengcheng Wen
  • Jiale Chen
  • Boqin Yin
  • Han Zhu
  • Jiaming Ji
  • Juntao Dai
  • Chi-Min Chan

The rapid advancement of Large Vision Language Models (LVLMs) has demonstrated excellent abilities in various visual tasks. Building upon these developments, the thinking with images paradigm has emerged, enabling models to dynamically edit and re-encode visual information at each reasoning step, mirroring human visual processing. However, this paradigm introduces significant challenges as diverse errors may occur during reasoning processes. This necessitates Process Reward Models (PRMs) for distinguishing positive and negative reasoning steps, yet existing benchmarks for PRMs are predominantly text-centric and lack comprehensive assessment under this paradigm. To address these gaps, this work introduces the first comprehensive benchmark specifically designed for evaluating PRMs under the thinking with images paradigm. Our main contributions are: (1) Through extensive analysis of reasoning trajectories and guided search experiments with PRMs, we define 7 fine-grained error types and demonstrate both the necessity for specialized PRMs and the potential for improvement. (2) We construct a comprehensive benchmark comprising 1,206 manually annotated thinking with images reasoning trajectories spanning 4 categories and 16 subcategories for fine-grained evaluation of PRMs. (3) Our experimental analysis reveals that current LVLMs fall short as effective PRMs, exhibiting limited capabilities in visual reasoning process evaluation with significant performance disparities across error types, positive evaluation bias, and sensitivity to reasoning step positions. These findings demonstrate the effectiveness of our benchmark and establish crucial foundations for advancing PRMs in LVLMs.

AAAI Conference 2025 Conference Paper

CA-MLIF: Cross-Attention and Multimodal Low-Rank Interaction Fusion Framework for Tumor Prognostic Prediction

  • Yajun An
  • Jiale Chen
  • Huan Lin
  • Zhenbing Liu
  • Siyang Feng
  • Hualong Zhang
  • Rushi Lan
  • Zaiyi Liu

Cancer is a leading cause of death worldwide due to its aggressive nature and complex variability. Accurate prognosis is therefore challenging but essential for guiding personalized treatment and follow-up. Previous research often relied on single data sources, missing the opportunity to combine various types of patient information for more comprehensive survival predictions. To address these challenges, we propose a two-stage fusion method named Cross-Attention and Multimodal Low-Rank Interaction Fusion Framework (CA-MLIF). In the first stage, we propose a CA mechanism for real-time feature updates and cross-modal mutual learning to capture rich semantic information. In the second stage, we design a novel multimodal low-rank interaction fusion method for survival prediction. Specifically, we present modal attention mechanism (MAM) for feature filtration, low-rank multimodal fusion (LMF) for model complexity reduction, and optimal weight concatenation (OWC) for maximizing feature integration. Extensive experiments on two public datasets TCGA-GBMLGG and TCGA-KIRC, as well as a multi-center in-house lung adenocarcinoma (LUAD) dataset validate the effectiveness of CA-MLIF, which demonstrate that our method outperforms existing approaches in survival prediction under both pathology-gene fusion and CT-pathology fusion scenarios.

NeurIPS Conference 2025 Conference Paper

Quartet: Native FP4 Training Can Be Optimal for Large Language Models

  • Roberto Castro
  • Andrei Panferov
  • Rush Tabesh
  • Oliver Sieberling
  • Jiale Chen
  • Mahdi Nikdan
  • Saleh Ashkboos
  • Dan Alistarh

Training large language models (LLMs) models directly in low-precision offers a way to address computational costs by improving both throughput and energy efficiency. For those purposes, NVIDIA's recent Blackwell architecture facilitates very low-precision operations using FP4 variants. Yet, current algorithms for training LLMs in FP4 precision face significant accuracy degradation and often rely on mixed-precision fallbacks. In this paper, we investigate hardware-supported FP4 training and introduce a new approach for accurate, end-to-end FP4 training with all the major computations (i. e. , linear layers) in low precision. Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across bit-widths and training setups. Guided by this investigation, we design an "optimal" technique in terms of accuracy-vs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for Blackwell, demonstrating that fully FP4-based training is a competitive alternative to FP16 half-precision and to FP8 training. Our code is available at https: //github. com/IST-DASLab/Quartet.