Arrow Research search

Author name cluster

Jieming Zhu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

15 papers
2 author rows

Possible papers

15

AAAI Conference 2026 Conference Paper

Length-Adaptive Interest Network for Balancing Long and Short Sequence Modeling in CTR Prediction

  • Zhicheng Zhang
  • Zhaocheng Du
  • Jieming Zhu
  • Jiwei Tang
  • Fengyuan Lu
  • Wang Jiaheng
  • Song-Li Wu
  • Qianhui Zhu

User behavior sequences in modern recommendation systems exhibit significant length heterogeneity, ranging from sparse short-term interactions to rich long-term histories. While longer sequences provide more context, we observe that increasing the maximum input sequence length in existing CTR models paradoxically degrades performance for short-sequence users due to attention polarization and length imbalance in training data. To address this, we propose LAIN (Length-Adaptive Interest Network), a plug-and-play framework that explicitly incorporates sequence length as a conditioning signal to balance long- and short-sequence modeling. LAIN consists of three lightweight components: a Spectral Length Encoder that maps length into continuous representations, Length-Conditioned Prompting that injects global contextual cues into both long- and short-term behavior branches, and Length-Modulated Attention that adaptively adjusts attention sharpness based on sequence length. Extensive experiments on three real-world benchmarks across five strong CTR backbones show that LAIN consistently improves overall performance, achieving up to 1.15% AUC gain and 2.25% log loss reduction. Notably, our method significantly improves accuracy for short-sequence users without sacrificing long-sequence effectiveness. Our work offers a general, efficient, and deployable solution to mitigate length-induced bias in sequential recommendation.

AAAI Conference 2026 Conference Paper

Suit the Remedy to the Retriever: Interpretable Query Optimization with Retriever Preference Alignment for Vision-Language Retrieval

  • GuangHao Meng
  • Jinpeng Wang
  • Jieming Zhu
  • Letian Zhang
  • Yong Jiang
  • Dan Zhao
  • Qing Li

Vision-language retrieval (VLR), which uses text or image queries to retrieve corresponding cross-modal content, plays a crucial role in multimedia and computer vision tasks. However, challenging concepts in queries often confuse retrievers, limiting their ability to align concepts with visual content. Existing query optimization methods neglect retrievers’ preferences (i.e., text descriptions that better match their corresponding visual content), resulting in unadapted to the retriever and leading to suboptimal performance. To address this, we propose the Retriever-Adaptive Query Optimization (RAQO), an interpretable framework that rewrites queries based on retriever-specific preferences. Specifically, we first leverages multimodal large language Models (MLLMs) and retrieval's feedback to construct the MLLMs-Driven Preference-Aware Dataset Engine (MPADE), which automatically refine queries offline, capturing the retriever’s implicit preferences. Then, we introduce a ``detect-then-rewrite" chain-of-thought rewriting (ReCoT) strategy equipped with a progressive preference alignment pipeline, including three stages: ambiguity detection fine-tuning, query rewriting fine-tuning, and preference rank optimization. This design enables the rewriter to focus on confusing concepts and produce retriever-adapted, high-quality queries. Extensive VLR benchmark experiments have demonstrated the superiority of RAQO in cross-modal retrieval, as well as its interpretability, generalizability and transferability.

NeurIPS Conference 2025 Conference Paper

Can LLMs Outshine Conventional Recommenders? A Comparative Evaluation

  • Qijiong Liu
  • Jieming Zhu
  • Lu Fan
  • Kun Wang
  • Hengchang Hu
  • Wei Guo
  • Yong Liu
  • Xiao-ming Wu

Integrating large language models (LLMs) into recommender systems has created new opportunities for improving recommendation quality. However, a comprehensive benchmark is needed to thoroughly evaluate and compare the recommendation capabilities of LLMs with traditional recommender systems. In this paper, we introduce \recbench{}, which systematically investigates various item representation forms (including unique identifier, text, semantic embedding, and semantic identifier) and evaluates two primary recommendation tasks, i. e. , click-through rate prediction (CTR) and sequential recommendation (SeqRec). Our extensive experiments cover up to 17 large models and are conducted across five diverse datasets from fashion, news, video, books, and music domains. Our findings indicate that LLM-based recommenders outperform conventional recommenders, achieving up to a 5% AUC improvement in CTR and up to a 170% NDCG@10 improvement in SeqRec. However, these substantial performance gains come at the expense of significantly reduced inference efficiency, rendering LLMs impractical as real-time recommenders. We have released our code and data to enable other researchers to reproduce and build upon our experimental results.

IJCAI Conference 2025 Conference Paper

Device-Cloud Collaborative Correction for On-Device Recommendation

  • Tianyu Zhan
  • Shengyu Zhang
  • Zheqi Lv
  • Jieming Zhu
  • Jiwei Li
  • Fan Wu
  • Fei Wu

With the rapid development of recommendation models and device computing power, device-based recommendation has become an important research area due to its better real-time performance and privacy protection. Previously, Transformer-based sequential recommendation models have been widely applied in this field because they outperform Recurrent Neural Network (RNN)-based recommendation models in terms of performance. However, as the length of interaction sequences increases, Transformer-based models introduce significantly more space and computational overhead compared to RNN-based models, posing challenges for device-based recommendation. To balance real-time performance and high performance on devices, we propose Device-Cloud Collaborative Correction Framework for On-Device Recommendation (CoCorrRec). CoCorrRec uses a self-correction network (SCN) to correct parameters with extremely low time cost. By updating model parameters during testing based on the input token, it achieves performance comparable to current optimal but more complex Transformer-based models. Furthermore, to prevent SCN from overfitting, we design a global correction network (GCN) that processes hidden states uploaded from devices and provides a global correction solution. Extensive experiments on multiple datasets show that CoCorrRec outperforms existing Transformer-based and RNN-based device recommendation models in terms of performance, with fewer parameters and lower FLOPs, thereby achieving a balance between real-time performance and high efficiency. Code is available at https: //github. com/Yuzt-zju/CoCorrRec.

AAAI Conference 2025 Conference Paper

EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models

  • GuangHao Meng
  • Sunan He
  • Jinpeng Wang
  • Tao Dai
  • Letian Zhang
  • Jieming Zhu
  • Qing Li
  • Gang Wang

Vision-language retrieval (VLR) has attracted significant attention in both academia and industry, which involves using text (or images) as queries to retrieve corresponding images (or text). However, existing methods often neglect the rich visual semantics knowledge of entities, thus leading to incorrect retrieval results. To address this problem, we propose the Entity Visual Description enhanced CLIP (EvdCLIP), designed to leverage the visual knowledge of entities to enrich queries. Specifically, since humans recognize entities through visual cues, we employ a large language model (LLM) to generate Entity Visual Descriptions (EVDs) as alignment cues to complement textual data. These EVDs are then integrated into raw queries to create visually-rich, EVD-enhanced queries. Furthermore, recognizing that EVD-enhanced queries may introduce noise or low-quality expansions, we develop a novel, trainable EVD-aware Rewriter (EaRW) for vision-language retrieval tasks. EaRW utilizes EVD knowledge and the generative capabilities of the language model to effectively rewrite queries. With our specialized training strategy, EaRW can generate high-quality and low-noise EVD-enhanced queries. Extensive quantitative and qualitative experiments on image-text retrieval benchmarks validate the superiority of EvdCLIP on vision-language retrieval tasks.

NeurIPS Conference 2025 Conference Paper

MemSim: A Bayesian Simulator for Evaluating Memory of LLM-based Personal Assistants

  • Zeyu Zhang
  • Quanyu Dai
  • Luyu Chen
  • Zeren Jiang
  • Rui Li
  • Jieming Zhu
  • Xu Chen
  • Yi Xie

LLM-based agents have been widely applied as personal assistants, capable of memorizing information from user messages and responding to personal queries. However, there still lacks an objective and automatic evaluation on their memory capability, largely due to the challenges in constructing reliable questions and answers (QAs) according to user messages. In this paper, we propose MemSim, a Bayesian simulator designed to automatically construct reliable QAs from generated user messages, simultaneously keeping their diversity and scalability. Specifically, we introduce the Bayesian Relation Network (BRNet) and a causal generation mechanism to mitigate the impact of LLM hallucinations on factual information, facilitating the automatic creation of an evaluation dataset. Based on MemSim, we generate a dataset in the daily-life scenario, named MemDaily, and conduct extensive experiments to assess the effectiveness of our approach. We also provide a benchmark for evaluating different memory mechanisms in LLM-based agents with the MemDaily dataset.

NeurIPS Conference 2025 Conference Paper

Personalized Visual Content Generation in Conversational Systems

  • Xianquan Wang
  • Zhaocheng Du
  • Huibo Xu
  • Shukang Yin
  • Yupeng Han
  • Jieming Zhu
  • Kai Zhang
  • Qi Liu

With the rapid progress of large language models (LLMs) and diffusion models, there has been growing interest in personalized content generation. However, current conversational systems often present the same recommended content to all users, falling into the dilemma of "one-size-fits-all. " To break this limitation and boost user engagement, in this paper, we introduce PCG ( P ersonalized Visual C ontent G eneration), a unified framework for personalizing item images within conversational systems. We tackle two key bottlenecks: the depth of personalization and the fidelity of generated images. Specifically, an LLM-powered Inclinations Analyzer is adopted to capture user likes and dislikes from context to construct personalized prompts. Moreover, we design a dual-stage LoRA mechanism—Global LoRA for understanding task-specific visual style, and Local LoRA for capturing preferred visual elements from conversation history. During training, we introduce the visual content condition method to ensure LoRA learns both historical visual context and maintains fidelity to the original item images. Extensive experiments on benchmark conversational datasets—including objective metrics and GPT-based evaluations—demonstrate that our framework outperforms strong baselines, which highlight its potential to redefine personalization in visual content generation for conversational scenarios like e-commerce and real-world recommendation.

NeurIPS Conference 2023 Conference Paper

Achieving Cross Modal Generalization with Multimodal Unified Representation

  • Yan Xia
  • Hai Huang
  • Jieming Zhu
  • Zhou Zhao

This paper introduces a novel task called Cross Modal Generalization (CMG), which addresses the challenge of learning a unified discrete representation from paired multimodal data during pre-training. Then in downstream tasks, the model can achieve zero-shot generalization ability in other modalities when only one modal is labeled. Existing approaches in multimodal representation learning focus more on coarse-grained alignment or rely on the assumption that information from different modalities is completely aligned, which is impractical in real-world scenarios. To overcome this limitation, we propose \textbf{Uni-Code}, which contains two key contributions: the Dual Cross-modal Information Disentangling (DCID) module and the Multi-Modal Exponential Moving Average (MM-EMA). These methods facilitate bidirectional supervision between modalities and align semantically equivalent information in a shared discrete latent space, enabling fine-grained unified representation of multimodal sequences. During pre-training, we investigate various modality combinations, including audio-visual, audio-text, and the tri-modal combination of audio-visual-text. Extensive experiments on various downstream tasks, i. e. , cross-modal event classification, localization, cross-modal retrieval, query-based video segmentation, and cross-dataset event localization, demonstrate the effectiveness of our proposed methods. The code is available at https: //github. com/haihuangcode/CMG.

NeurIPS Conference 2023 Conference Paper

Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks

  • Haoyi Duan
  • Yan Xia
  • Zhou Mingze
  • Li Tang
  • Jieming Zhu
  • Zhou Zhao

In recent years, the deployment of large-scale pre-trained models in audio-visual downstream tasks has yielded remarkable outcomes. However, these models, primarily trained on single-modality unconstrained datasets, still encounter challenges in feature extraction for multi-modal tasks, leading to suboptimal performance. This limitation arises due to the introduction of irrelevant modality-specific information during encoding, which adversely affects the performance of downstream tasks. To address this challenge, this paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism. This mechanism leverages audio and visual modalities as soft prompts to dynamically adjust the parameters of pre-trained models based on the current multi-modal input features. Specifically, the DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders, allowing adaptive extraction of crucial information from the current modality across spatial, channel, and temporal dimensions, while preserving the frozen parameters of large-scale pre-trained models. Experimental evaluations demonstrate that our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, our model exhibits promising performance in challenging few-shot and zero-shot scenarios. The source code and pre-trained models are available at https: //github. com/haoyi-duan/DG-SCT.

AAAI Conference 2023 Conference Paper

FinalMLP: An Enhanced Two-Stream MLP Model for CTR Prediction

  • Kelong Mao
  • Jieming Zhu
  • Liangcai Su
  • Guohao Cai
  • Yuru Li
  • Zhenhua Dong

Click-through rate (CTR) prediction is one of the fundamental tasks in online advertising and recommendation. Multi-layer perceptron (MLP) serves as a core component in many deep CTR prediction models, but it has been widely shown that applying a vanilla MLP network alone is ineffective in learning complex feature interactions. As such, many two-stream models (e.g., Wide&Deep, DeepFM, and DCN) have recently been proposed, aiming to integrate two parallel sub-networks to learn feature interactions from two different views for enhanced CTR prediction. In addition to one MLP stream that learns feature interactions implicitly, most of the existing research focuses on designing another stream to complement the MLP stream with explicitly enhanced feature interactions. Instead, this paper presents a simple two-stream feature interaction model, namely FinalMLP, which employs only MLPs in both streams yet achieves surprisingly strong performance. In contrast to sophisticated network design in each stream, our work enhances CTR modeling through a feature selection module, which produces differentiated feature inputs to two streams, and a group-wise bilinear fusion module, which effectively captures stream-level interactions across two streams. We show that FinalMLP achieves competitive or even better performance against many existing two-stream CTR models on four open benchmark datasets and also brings significant CTR improvements during an online A/B test in our industrial news recommender system. We envision that the simple yet effective FinalMLP model could serve as a new strong baseline for future development of two-stream CTR models. Our source code will be available at MindSpore/models and FuxiCTR/model_zoo.

NeurIPS Conference 2022 Conference Paper

M4Singer: A Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus

  • Lichao Zhang
  • Ruiqi Li
  • Shoutong Wang
  • Liqun Deng
  • Jinglin Liu
  • Yi Ren
  • JinZheng He
  • Rongjie Huang

The lack of publicly available high-quality and accurately labeled datasets has long been a major bottleneck for singing voice synthesis (SVS). To tackle this problem, we present M4Singer, a free-to-use Multi-style, Multi-singer Mandarin singing collection with elaborately annotated Musical scores as well as its benchmarks. Specifically, 1) we construct and release a large high-quality Chinese singing voice corpus, which is recorded by 20 professional singers, covering 700 Chinese pop songs as well as all the four SATB types (i. e. , soprano, alto, tenor, and bass); 2) we take extensive efforts to manually compose the musical scores for each recorded song, which are necessary to the study of the prosody modeling for SVS. 3) To facilitate the use and demonstrate the quality of M4Singer, we conduct four different benchmark experiments: score-based SVS, controllable singing voice (CSV), singing voice conversion (SVC) and automatic music transcription (AMT).

AAAI Conference 2021 Short Paper

Modeling High-order Interactions across Multi-interests for Micro-video Reommendation (Student Abstract)

  • Dong Yao
  • Shengyu Zhang
  • Zhou Zhao
  • Wenyan Fan
  • Jieming Zhu
  • Xiuqiang He
  • Fei Wu

Personalized recommendation system has become pervasive in various video platform. Many effective methods have been proposed, but most of them didn’t capture the user’s multilevel interest trait and dependencies between their viewed micro-videos well. To solve these problems, we propose a Self-over-Co Attention module to enhance user’s interest representation. In particular, we first use co-attention to model correlation patterns across different levels and then use selfattention to model correlation patterns within a specific level. Experimental results on filtered public datasets verify that our presented module is useful.

IJCAI Conference 2021 Conference Paper

UNBERT: User-News Matching BERT for News Recommendation

  • Qi Zhang
  • Jingjie Li
  • Qinglin Jia
  • Chuyuan Wang
  • Jieming Zhu
  • Zhaowei Wang
  • Xiuqiang He

Nowadays, news recommendation has become a popular channel for users to access news of their interests. How to represent rich textual contents of news and precisely match users' interests and candidate news lies in the core of news recommendation. However, existing recommendation methods merely learn textual representations from in-domain news data, which limits their generalization ability to new news that are common in cold-start scenarios. Meanwhile, many of these methods represent each user by aggregating the historically browsed news into a single vector and then compute the matching score with the candidate news vector, which may lose the low-level matching signals. In this paper, we explore the use of the successful BERT pre-training technique in NLP for news recommendation and propose a BERT-based user-news matching model, called UNBERT. In contrast to existing research, our UNBERT model not only leverages the pre-trained model with rich language knowledge to enhance textual representation, but also captures multi-grained user-news matching signals at both word-level and news-level. Extensive experiments on the Microsoft News Dataset (MIND) demonstrate that our approach constantly outperforms the state-of-the-art methods.

NeurIPS Conference 2020 Conference Paper

Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding

  • Zhu Zhang
  • Zhou Zhao
  • Zhijie Lin
  • Jieming Zhu
  • Xiuqiang He

Weakly-supervised vision-language grounding aims to localize a target moment in a video or a specific region in an image according to the given sentence query, where only video-level or image-level sentence annotations are provided during training. Most existing approaches employ the MIL-based or reconstruction-based paradigms for the WSVLG task, but the former heavily depends on the quality of randomly-selected negative samples and the latter cannot directly optimize the visual-textual alignment score. In this paper, we propose a novel Counterfactual Contrastive Learning (CCL) to develop sufficient contrastive training between counterfactual positive and negative results, which are based on robust and destructive counterfactual transformations. Concretely, we design three counterfactual transformation strategies from the feature-, interaction- and relation-level, where the feature-level method damages the visual features of selected proposals, interaction-level approach confuses the vision-language interaction and relation-level strategy destroys the context clues in proposal relationships. Extensive experiments on five vision-language grounding datasets verify the effectiveness of our CCL paradigm.

ECAI Conference 2020 Conference Paper

Directional Adversarial Training for Recommender Systems

  • Yangjun Xu
  • Liang Chen 0001
  • Fenfang Xie
  • Weibo Hu
  • Jieming Zhu
  • Chuan Chen 0001
  • Zibin Zheng

Adversarial training is shown as an effective method to improve the generalization ability of deep learning models by making random perturbations in the input space during model training. A recent study has successfully applied adversarial training into recommender systems by perturbing the embeddings of users and items through a minimax game. However, this method ignores the collaborative signal in recommender systems and fails to capture the smoothness in data distribution. We argue that the collaborative signal, which reveals the behavioural similarity between users and items, is critical to modeling recommender systems. In this work, we develop the Directional Adversarial Training (DAT) strategy by explicitly injecting the collaborative signal into the perturbation process. That is, both users and items are perturbed towards their similar neighbours in the embedding space with proper restriction. To verify its effectiveness, we demonstrate the use of DAT on Generalized Matrix Factorization (GMF), one of the most representative collaborative filtering methods. Our experimental results on three public datasets show that our method (called DAGMF) achieves a significant accuracy improvement over GMF and meanwhile, it is less prone to overfitting than GMF.