FusedRec: Fused Embedding Communication for Distributed Recommendation Training on GPUs

Xuanteng Huang; Fan Li; Riyang Hu; Jianchang Zhang; Yuan Peng; Yang Zhou; Fangying Chen; Xianwei Zhang

doi:10.1609/aaai.v40i17.38512

Back to AAAI

AAAI 2026

FusedRec: Fused Embedding Communication for Distributed Recommendation Training on GPUs

Conference Paper AAAI Technical Track on Data Mining & Knowledge Management I Artificial Intelligence

PDF Details DOI

Abstract

Recent years have witnessed the wide adoption of deep learning recommendation models (DLRMs) for many online services. Unlike traditional DNN training, DLRMs leverage massive embeddings to represent sparse features, which are stored in distributed GPUs following the model parallel paradigm. Existing approaches adopt deduplication to eliminate replicated embeddings involved in AltoAll transfers to avoid unnecessary communication. In our practices, we have observed that such a deduplication design exacerbates interconnect inefficiency due to the fragmented embedding transfers with reduced message sizes, hindering the performance of distributed DLRM training. This paper introduces FusedRec, a fused embedding communication and lookup mechanism to tackle the inefficiency due to deduplication. By seeking the opportunities to fuse embeddings from multiple categories into a group, FusedRec conducts the communication in a combined shot to alleviate bandwidth under-utilization. Meanwhile, a categorical-aware hashing algorithm is integrated into FusedRec to retain the category information during lookup without extra communication. Combining with efficient unique and recovery operations, comprehensive results show FusedRec achieves a 37.8% throughput speedup in average compared to the SOTA industry implementation, without hurting the recommendation qualities of our in-house models used in online production environments.

FusedRec: Fused Embedding Communication for Distributed Recommendation Training on GPUs

Abstract

Authors

Keywords

Context