Kelong Mao Papers

NeurIPS Conference 2025 Conference Paper

UniGist: Towards General and Hardware-aligned Sequence-level Long Context Compression

Chenlong Deng
Zhisong Zhang
Kelong Mao
Shuaiyi Li
Tianqing Fang
Hongming Zhang
Haitao Mi
Dong Yu

Large language models are increasingly capable of handling long-context inputs, but the memory overhead of KV cache remains a major bottleneck for general-purpose deployment. While many compression strategies have been explored, sequence-level compression is particularly challenging due to its tendency to lose important details. We present UniGist, a gist token-based long context compression framework that removes the need for chunk-wise training, enabling the model to learn how to compress and utilize long-range context during training. To fully exploit the sparsity, we introduce a gist shift trick that transforms the attention layout into a right-aligned block structure and develop a block-table-free sparse attention kernel based on it. UniGist further supports one-pass training and flexible chunk sizes during inference, allowing efficient and adaptive context processing. Experiments across multiple long-context tasks show that UniGist significantly improves compression quality, with especially strong performance in recalling details and long-range dependency modeling.

PDF Details

AAAI Conference 2023 Conference Paper

FinalMLP: An Enhanced Two-Stream MLP Model for CTR Prediction

Kelong Mao
Jieming Zhu
Liangcai Su
Guohao Cai
Yuru Li
Zhenhua Dong

Click-through rate (CTR) prediction is one of the fundamental tasks in online advertising and recommendation. Multi-layer perceptron (MLP) serves as a core component in many deep CTR prediction models, but it has been widely shown that applying a vanilla MLP network alone is ineffective in learning complex feature interactions. As such, many two-stream models (e.g., Wide&Deep, DeepFM, and DCN) have recently been proposed, aiming to integrate two parallel sub-networks to learn feature interactions from two different views for enhanced CTR prediction. In addition to one MLP stream that learns feature interactions implicitly, most of the existing research focuses on designing another stream to complement the MLP stream with explicitly enhanced feature interactions. Instead, this paper presents a simple two-stream feature interaction model, namely FinalMLP, which employs only MLPs in both streams yet achieves surprisingly strong performance. In contrast to sophisticated network design in each stream, our work enhances CTR modeling through a feature selection module, which produces differentiated feature inputs to two streams, and a group-wise bilinear fusion module, which effectively captures stream-level interactions across two streams. We show that FinalMLP achieves competitive or even better performance against many existing two-stream CTR models on four open benchmark datasets and also brings significant CTR improvements during an online A/B test in our industrial news recommender system. We envision that the simple yet effective FinalMLP model could serve as a new strong baseline for future development of two-stream CTR models. Our source code will be available at MindSpore/models and FuxiCTR/model_zoo.

PDF Details DOI

Possible papers

UniGist: Towards General and Hardware-aligned Sequence-level Long Context Compression

FinalMLP: An Enhanced Two-Stream MLP Model for CTR Prediction