Arrow Research search

Author name cluster

Hui Xu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers
1 author row

Possible papers

14

AAAI Conference 2026 Conference Paper

Negative Entity Suppression for Zero-Shot Captioning with Synthetic Images

  • Zimao Lu
  • Hui Xu
  • Bing Liu
  • Ke Wang

Text-only training provides an attractive approach to address data scarcity challenges in zero-shot image captioning (ZIC), avoiding the expense of collecting paired image-text annotations. However, although these approaches perform well within training domains, they suffer from poor cross-domain generalization, often producing hallucinated content when encountering novel visual environments. Retrieval-based methods attempt to mitigate this limitation by leveraging external knowledge, but they can paradoxically exacerbate hallucination when retrieved captions contain entities irrelevant to the inputs. We introduce the concept of negative entities—objects that appear in generated caption but are absent from the input—and propose Negative Entity Suppression (NES) to tackle this challenge. NES seamlessly integrates three stages: (1) it employs synthetic images to ensure consistent image-to-text retrieval across both training and inference; (2) it filters negative entities from retrieved content to enhance accuracy; and (3) it applies attention-level suppression using identified negative entities to further minimize the impact of hallucination-prone features. Evaluation across multiple benchmarks demonstrates that NES maintains competitive in-domain performance while improving cross-domain transfer and reducing hallucination rates, achieving new state-of-the-art results in ZIC.

NeurIPS Conference 2025 Conference Paper

OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models

  • Ziheng Cheng
  • Yixiao Huang
  • Hui Xu
  • Somayeh Sojoudi
  • Xuandong Zhao
  • Dawn Song
  • Song Mei

Text-to-Image (T2I) models have achieved remarkable success in generating visual content from text inputs. Although multiple safety alignment strategies have been proposed to prevent harmful outputs, they often lead to overly cautious behavior---rejecting even benign prompts---a phenomenon known as \textit{over-refusal} that reduces the practical utility of T2I models. Despite over-refusal having been observed in practice, there is no large-scale benchmark that systematically evaluates this phenomenon for T2I models. In this paper, we present an automatic workflow to construct synthetic evaluation data, resulting in OVERT (\textbf{OVE}r-\textbf{R}efusal evaluation on \textbf{T}ext-to-image models), the first large-scale benchmark for assessing over-refusal behaviors in T2I models. OVERT includes 4, 600 seemingly harmful but benign prompts across nine safety-related categories, along with 1, 785 genuinely harmful prompts (OVERT-unsafe) to evaluate the safety–utility trade-off. Using OVERT, we evaluate several leading T2I models and find that over-refusal is a widespread issue across various categories (Figure 1), underscoring the need for further research to enhance the safety alignment of T2I models without compromising their functionality. As a preliminary attempt to reduce over-refusal, we explore prompt rewriting; however, we find it often compromises faithfulness to the meaning of the original prompts. Finally, we demonstrate the flexibility of our generation framework in accommodating diverse safety requirements by generating customized evaluation data adapting to user-defined policies.

AAAI Conference 2025 Conference Paper

Temporal Action Localization with Cross Layer Task Decoupling and Refinement

  • Qiang Li
  • Di Liu
  • Jun Kong
  • Sen Li
  • Hui Xu
  • Jianzhong Wang

Temporal action localization (TAL) involves dual tasks to classify and localize actions within untrimmed videos. However, the two tasks often have conflicting requirements for features. Existing methods typically employ separate heads for classification and localization tasks but share the same input feature, leading to suboptimal performance. To address this issue, we propose a novel TAL method with Cross Layer Task Decoupling and Refinement (CLTDR). Based on the feature pyramid of video, CLTDR strategy integrates semantically strong features from higher pyramid layers and detailed boundary-aware boundary features from lower pyramid layers to effectively disentangle the action classification and localization tasks. Moreover, the multiple features from cross layers are also employed to refine and align the disentangled classification and regression results. At last, a lightweight Gated Multi-Granularity (GMG) module is proposed to comprehensively extract and aggregate video features at instant, local, and global temporal granularities. Benefiting from the CLTDR and GMG modules, our method achieves state-of-the-art performance on five challenging benchmarks: THUMOS14, MultiTHUMOS, EPIC-KITCHENS-100, ActivityNet-1.3, and HACS. Code:https://github.com/LiQiang0307/CLTDR-GMG

JBHI Journal 2023 Journal Article

Multimodal Data Matters: Language Model Pre-Training Over Structured and Unstructured Electronic Health Records

  • Sicen Liu
  • Xiaolong Wang
  • Yongshuai Hou
  • Ge Li
  • Hui Wang
  • Hui Xu
  • Yang Xiang
  • Buzhou Tang

As two important textual modalities in electronic health records (EHR), both structured data (clinical codes) and unstructured data (clinical narratives) have recently been increasingly applied to the healthcare domain. Most existing EHR-oriented studies, however, either focus on a particular modality or integrate data from different modalities in a straightforward manner, which usually treats structured and unstructured data as two independent sources of information about patient admission and ignore the intrinsic interactions between them. In fact, the two modalities are documented during the same encounter where structured data inform the documentation of unstructured data and vice versa. In this paper, we proposed a Medical Multimodal Pre-trained Language Model, named MedM-PLM, to learn enhanced EHR representations over structured and unstructured data and explore the interaction of two modalities. In MedM-PLM, two Transformer-based neural network components are firstly adopted to learn representative characteristics from each modality. A cross-modal module is then introduced to model their interactions. We pre-trained MedM-PLM on the MIMIC-III dataset and verified the effectiveness of the model on three downstream clinical tasks, i. e. , medication recommendation, 30-day readmission prediction and ICD coding. Extensive experiments demonstrate the power of MedM-PLM compared with state-of-the-art methods. Further analyses and visualizations show the robustness of our model, which could potentially provide more comprehensive interpretations for clinical decision-making.

JBHI Journal 2023 Journal Article

SHAPE: A Sample-Adaptive Hierarchical Prediction Network for Medication Recommendation

  • Sicen Liu
  • Xiaolong Wang
  • Jingcheng Du
  • Yongshuai Hou
  • Xianbing Zhao
  • Hui Xu
  • Hui Wang
  • Yang Xiang

Effectively medication recommendation with complex multimorbidity conditions is a critical yet challenging task in healthcare. Most existing works predicted medications based on longitudinal records, which assumed the encoding format of intra-visit medical events are serialized and information transmitted patterns of learning longitudinal sequence data are stable. However, the following conditions may have been ignored: 1) A more compact encoder for intra-relationship in the intra-visit medical event is urgent; 2) Strategies for learning accurate representations of the variable longitudinal sequences of patients are different. In this article, we proposed a novel Sample-adaptive Hierarchical medicAtion Prediction nEtwork, termed SHAPE, to tackle the above challenges in the medication recommendation task. Specifically, we design a compact intra-visit set encoder to encode the relationship in the medical event for obtaining visit-level representation and then develop an inter-visit longitudinal encoder to learn the patient-level longitudinal representation efficiently. To endow the model with the capability of modeling the variable visit length, we introduce a soft curriculum learning method to assign the difficulty of each sample automatically by the visit length. Extensive experiments on a benchmark dataset verify the superiority of our model compared with several state-of-the-art baselines.

AAAI Conference 2023 Conference Paper

Temporal Knowledge Graph Reasoning with Historical Contrastive Learning

  • Yi Xu
  • Junjie Ou
  • Hui Xu
  • Luoyi Fu

Temporal knowledge graph, serving as an effective way to store and model dynamic relations, shows promising prospects in event forecasting. However, most temporal knowledge graph reasoning methods are highly dependent on the recurrence or periodicity of events, which brings challenges to inferring future events related to entities that lack historical interaction. In fact, the current moment is often the combined effect of a small part of historical information and those unobserved underlying factors. To this end, we propose a new event forecasting model called Contrastive Event Network (CENET), based on a novel training framework of historical contrastive learning. CENET learns both the historical and non-historical dependency to distinguish the most potential entities that can best match the given query. Simultaneously, it trains representations of queries to investigate whether the current moment depends more on historical or non-historical events by launching contrastive learning. The representations further help train a binary classifier whose output is a boolean mask to indicate related entities in the search space. During the inference process, CENET employs a mask-based strategy to generate the final results. We evaluate our proposed model on five benchmark graphs. The results demonstrate that CENET significantly outperforms all existing methods in most metrics, achieving at least 8.3% relative improvement of Hits@1 over previous state-of-the-art baselines on event-based datasets.

IJCAI Conference 2020 Conference Paper

Exploring Parameter Space with Structured Noise for Meta-Reinforcement Learning

  • Hui Xu
  • Chong Zhang
  • Jiaxing Wang
  • Deqiang Ouyang
  • Yu Zheng
  • Jie Shao

Efficient exploration is a major challenge in Reinforcement Learning (RL) and has been studied extensively. However, for a new task existing methods explore either by taking actions that maximize task agnostic objectives (such as information gain) or applying a simple dithering strategy (such as noise injection), which might not be effective enough. In this paper, we investigate whether previous learning experiences can be leveraged to guide exploration of current new task. To this end, we propose a novel Exploration with Structured Noise in Parameter Space (ESNPS) approach. ESNPS utilizes meta-learning and directly uses meta-policy parameters, which contain prior knowledge, as structured noises to perturb the base model for effective exploration in new tasks. Experimental results on four groups of tasks: cheetah velocity, cheetah direction, ant velocity and ant direction demonstrate the superiority of ESNPS against a number of competitive baselines.