Native Speech Processing with LLMs

Aaron Soh

doi:10.1609/aaai.v40i48.42324

Back to AAAI

AAAI 2026

Native Speech Processing with LLMs

Short Paper AAAI Undergraduate Consortium Artificial Intelligence

PDF Details DOI

Abstract

Recent advances in Large Language Models (LLMs) have achieved state-of-the-art performance in Automatic Speech Recognition (ASR), surpassing ASR-only systems such as Whisper. However, their application to other speech processing tasks, particularly speaker diarisation (SD), remains underexplored. This work proposes extending existing speech-aware LLM architectures with diarisation-specific training and context-based prompting to enable joint transcription and segmentation of multi-speaker audio. By exploiting the semantic reasoning and multilingual capabilities of pretrained LLMs, the proposed approach aims to improve diarisation accuracy, enhancing accessibility for assistive technologies and real-time captioning applications that rely on accurate speaker-aware transcriptions.

Authors

Aaron Soh Nanyang Technological University College of Computing and Data Science Singapore

Keywords

No keywords are indexed for this paper.

Context

Venue: AAAI Conference on Artificial Intelligence
Archive span: 1980-2026
Indexed papers: 28718
Paper id: 1053653539855499188