Arrow Research search
Back to IROS

IROS 2025

TEM 3 -Learning: Time-Efficient Multimodal Multi-Task Learning for Advanced Assistive Driving

Conference Paper Accepted Paper Artificial Intelligence ยท Robotics

Abstract

Multi-task learning (MTL) can advance assistive driving by exploring inter-task correlations through shared representations. However, existing methods face two critical limitations: single-modality constraints limiting comprehensive scene understanding and inefficient architectures impeding real-time deployment. This paper proposes TEM 3 -Learning (Time-Efficient Multimodal Multi-task Learning), a novel framework that jointly optimizes driver emotion recognition, driver behavior recognition, traffic context recognition, and vehicle behavior recognition through a two-stage architecture. The first component, the mamba-based multi-view temporal-spatial feature extraction subnetwork (MTS-Mamba), introduces a forward-backward temporal scanning mechanism and global-local spatial attention to efficiently extract low-cost temporal-spatial features from multi-view sequential images. The second component, the MTL-based gated multimodal feature integrator (MGMI), employs task-specific multi-gating modules to adaptively highlight the most relevant modality features for each task, effectively alleviating the negative transfer problem in MTL. Evaluation on the AIDE dataset, our proposed model achieves state-of-the-art accuracy across all four tasks, maintaining a lightweight architecture with fewer than 6 million parameters and delivering an impressive 142. 32 FPS inference speed. Rigorous ablation studies further validate the effectiveness of the proposed framework and the independent contributions of each module. The code is available on https://github.com/Wenzhuo-Liu/TEM3-Learning.

Authors

Keywords

  • Emotion recognition
  • Limiting
  • Scalability
  • Face recognition
  • Logic gates
  • Multitasking
  • Feature extraction
  • Real-time systems
  • Intelligent robots
  • Vehicles
  • Multi-task Learning
  • Multimodal Learning
  • Multi-modal Multi-task Learning
  • Inference Speed
  • Driver Behavior
  • Multimodal Features
  • Behavior Recognition
  • Spatial-temporal Features
  • Frames Per Second
  • Negative Transfer
  • Modal Features
  • Multi-view Images
  • Spatial Features
  • Object Detection
  • Real-time Performance
  • Recognition Task
  • Stochastic Gradient Descent
  • Convolution Operation
  • Multiple Tasks
  • State-space Model
  • Advanced Driver Assistance Systems
  • Multi-task Learning Framework
  • Multi-view Data
  • Parameter Count
  • Task Conflict
  • Task-specific Features
  • Driver State
  • Ablation Experiments
  • Traffic Environment
  • Self-attention Mechanism

Context

Venue
IEEE/RSJ International Conference on Intelligent Robots and Systems
Archive span
1988-2025
Indexed papers
26578
Paper id
1101699204927446157