Arrow Research search
Back to AAAI

AAAI 2026

FusedRec: Fused Embedding Communication for Distributed Recommendation Training on GPUs

Conference Paper AAAI Technical Track on Data Mining & Knowledge Management I Artificial Intelligence

Abstract

Recent years have witnessed the wide adoption of deep learning recommendation models (DLRMs) for many online services. Unlike traditional DNN training, DLRMs leverage massive embeddings to represent sparse features, which are stored in distributed GPUs following the model parallel paradigm. Existing approaches adopt deduplication to eliminate replicated embeddings involved in AltoAll transfers to avoid unnecessary communication. In our practices, we have observed that such a deduplication design exacerbates interconnect inefficiency due to the fragmented embedding transfers with reduced message sizes, hindering the performance of distributed DLRM training. This paper introduces FusedRec, a fused embedding communication and lookup mechanism to tackle the inefficiency due to deduplication. By seeking the opportunities to fuse embeddings from multiple categories into a group, FusedRec conducts the communication in a combined shot to alleviate bandwidth under-utilization. Meanwhile, a categorical-aware hashing algorithm is integrated into FusedRec to retain the category information during lookup without extra communication. Combining with efficient unique and recovery operations, comprehensive results show FusedRec achieves a 37.8% throughput speedup in average compared to the SOTA industry implementation, without hurting the recommendation qualities of our in-house models used in online production environments.

Authors

Keywords

No keywords are indexed for this paper.

Context

Venue
AAAI Conference on Artificial Intelligence
Archive span
1980-2026
Indexed papers
28718
Paper id
136198661033080138