TEM 3 -Learning: Time-Efficient Multimodal Multi-Task Learning for Advanced Assistive Driving

Wenzhuo Liu; Yicheng Qiao; Zhen Wang; Qiannan Guo; Zilong Chen; Meihua Zhou; Xinran Li; Letian Wang; Zhiwei Li; Huaping Liu 0001; Wenshuo Wang 0001

Back to IROS

IROS 2025

TEM 3 -Learning: Time-Efficient Multimodal Multi-Task Learning for Advanced Assistive Driving

Conference Paper Accepted Paper Artificial Intelligence · Robotics

Details

Abstract

Multi-task learning (MTL) can advance assistive driving by exploring inter-task correlations through shared representations. However, existing methods face two critical limitations: single-modality constraints limiting comprehensive scene understanding and inefficient architectures impeding real-time deployment. This paper proposes TEM 3 -Learning (Time-Efficient Multimodal Multi-task Learning), a novel framework that jointly optimizes driver emotion recognition, driver behavior recognition, traffic context recognition, and vehicle behavior recognition through a two-stage architecture. The first component, the mamba-based multi-view temporal-spatial feature extraction subnetwork (MTS-Mamba), introduces a forward-backward temporal scanning mechanism and global-local spatial attention to efficiently extract low-cost temporal-spatial features from multi-view sequential images. The second component, the MTL-based gated multimodal feature integrator (MGMI), employs task-specific multi-gating modules to adaptively highlight the most relevant modality features for each task, effectively alleviating the negative transfer problem in MTL. Evaluation on the AIDE dataset, our proposed model achieves state-of-the-art accuracy across all four tasks, maintaining a lightweight architecture with fewer than 6 million parameters and delivering an impressive 142. 32 FPS inference speed. Rigorous ablation studies further validate the effectiveness of the proposed framework and the independent contributions of each module. The code is available on https://github.com/Wenzhuo-Liu/TEM3-Learning.

Authors

Keywords

Emotion recognition
Limiting
Scalability
Face recognition
Logic gates
Multitasking
Feature extraction
Real-time systems
Intelligent robots
Vehicles
Multi-task Learning
Multimodal Learning
Multi-modal Multi-task Learning
Inference Speed
Driver Behavior
Multimodal Features
Behavior Recognition
Spatial-temporal Features
Frames Per Second
Negative Transfer
Modal Features
Multi-view Images
Spatial Features
Object Detection
Real-time Performance
Recognition Task
Stochastic Gradient Descent
Convolution Operation
Multiple Tasks
State-space Model
Advanced Driver Assistance Systems
Multi-task Learning Framework
Multi-view Data
Parameter Count
Task Conflict
Task-specific Features
Driver State
Ablation Experiments
Traffic Environment
Self-attention Mechanism

Context

Venue: IEEE/RSJ International Conference on Intelligent Robots and Systems
Archive span: 1988-2025
Indexed papers: 26578
Paper id: 1101699204927446157