Arrow Research search

Author name cluster

Xun Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

33 papers
2 author rows

Possible papers

33

AAAI Conference 2026 Conference Paper

End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer

  • Yonghui Yu
  • Jiahang Cai
  • Xun Wang
  • Wenwu Yang

Existing multi-person video pose estimation methods typically adopt a two-stage pipeline: detecting individuals in each frame, followed by temporal modeling for single-person pose estimation. This design relies on heuristic operations such as tracking, RoI cropping, and non-maximum suppression, limiting both accuracy and efficiency. In this paper, we present a fully end-to-end framework for multi-person 2D pose estimation in videos, effectively eliminating heuristic operations. A key challenge is to associate individuals across frames under complex and overlapping temporal trajectories. To address this, we introduce a novel Pose-Aware Video transformEr Network (PAVE-Net), which features a spatial encoder to model intra-frame relations and a spatiotemporal pose decoder to capture global dependencies across frames. To achieve accurate temporal association, we propose a pose-aware attention mechanism that enables each pose query to selectively aggregate features corresponding to the same individual across consecutive frames. Additionally, we explicitly model spatiotemporal dependencies among pose keypoints to improve accuracy. Notably, our approach is the first end-to-end method for multi-frame 2D human pose estimation. Extensive experiments show that PAVE-Net substantially outperforms prior image-based end-to-end methods, achieving a 6.0 mAP improvement on PoseTrack2017, and delivers accuracy competitive with state-of-the-art two-stage video-based approaches, while offering significant gains in efficiency.

AAAI Conference 2026 Conference Paper

FashionMAC: Deformation-Free Fashion Image Generation with Fine-Grained Model Appearance Customization

  • Rong Zhang
  • Jinxiao Li
  • Jingnan Wang
  • Zhiwen Zuo
  • Jianfeng Dong
  • Wei Li
  • Chi Wang
  • Weiwei Xu

Garment-centric fashion image generation aims to synthesize realistic and controllable human models dressing a given garment, which has attracted growing interest due to its practical applications in e-commerce. The key challenges of the task lie in two aspects: (1) faithfully preserving the garment details, and (2) gaining fine-grained controllability over the model's appearance. Existing methods typically require performing garment deformation in the generation process, which often leads to garment texture distortions. Also, they fail to control the fine-grained attributes of the generated models, due to the lack of specifically designed mechanisms. To address these issues, we propose FashionMAC, a novel diffusion-based deformation-free framework that achieves high-quality and controllable fashion showcase image generation. The core idea of our framework is to eliminate the need for performing garment deformation and directly outpaint the garment segmented from a dressed person, which enables faithful preservation of the intricate garment details. Moreover, we propose a novel region-adaptive decoupled attention (RADA) mechanism along with a chained mask injection strategy to achieve fine-grained appearance controllability over the synthesized human models. Specifically, RADA adaptively predicts the generated regions for each fine-grained text attribute and enforces the text attribute to focus on the predicted regions by a chained mask injection strategy, significantly enhancing the visual fidelity and the controllability. Extensive experiments validate the superior performance of our framework compared to existing state-of-the-art methods.

ICLR Conference 2025 Conference Paper

Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding

  • Yanming Liu 0003
  • Xinyue Peng
  • Jiannan Cao
  • Shi Bo
  • Yanxin Shen
  • Tianyu Du
  • Sheng Cheng
  • Xun Wang

Large language models (LLMs) have shown remarkable capabilities in natural language processing; however, they still face difficulties when tasked with understanding lengthy contexts and executing effective question answering. These challenges often arise due to the complexity and ambiguity present in longer texts. To enhance the performance of LLMs in such scenarios, we introduce the Long Question Coreference Adaptation (LQCA) method. This innovative framework focuses on coreference resolution tailored to long contexts, allowing the model to identify and manage references effectively. The LQCA method encompasses four key steps: resolving coreferences within sub-documents, computing the distances between mentions, defining a representative mention for coreference, and answering questions through mention replacement. By processing information systematically, the framework provides easier-to-handle partitions for LLMs, promoting better understanding. Experimental evaluations on a range of LLMs and datasets have yielded positive results, with a notable improvements on OpenAI-o1-mini and GPT-4o models, highlighting the effectiveness of leveraging coreference resolution to bridge context gaps in question answering. Our code is public at https://github.com/OceannTwT/LQCA.

LORI Conference 2025 Conference Paper

Craig Interpolation Property in $\exists \Box $-Bundled Fragment of First-Order Modal Logic

  • Xun Wang

Abstract By extending the quantifier-free predicate logic (without equality, constant and function symbols) with a bundled modality \(\exists x\Box \), which packs the quantifier \(\exists x\) and the modality \(\Box \) together, we obtain a fragment of first-order modal logic, namely, \(\exists \Box \) -bundled fragment. In this paper, we prove that the Craig interpolation theorem holds for systems of the \(\exists \Box \) -bundled fragment based on K/D/T/4/S4, while it fails for the system based on S5.

AAAI Conference 2025 Conference Paper

DP-MemArc: Differential Privacy Transfer Learning for Memory Efficient Language Models

  • Yanming Liu
  • Xinyue Peng
  • Yuwei Zhang
  • Xiaolan Ke
  • Songhang Deng
  • Jiannan Cao
  • Chen Ma
  • Mengchen Fu

Large language models have repeatedly shown outstanding performance across diverse applications. However, deploying these models can inadvertently risk user privacy. The significant memory demands during training pose a major challenge in terms of resource consumption. This substantial size places a heavy load on memory resources, raising considerable practical concerns. In this paper, we introduce DP-MemArc, a novel training framework aimed at reducing the memory costs of large language models while emphasizing the protection of user data privacy. DP-MemArc incorporates side network or reversible network designs to support a variety of differential privacy memory-efficient fine-tuning schemes. Our approach not only achieves about 2.5 times in memory optimization but also ensures robust privacy protection, keeping user data secure and confidential. Extensive experiments have demonstrated that DP-MemArc effectively provides differential privacy-efficient fine-tuning across different task scenarios.

AAAI Conference 2025 Conference Paper

Dynamic Adapter with Semantics Disentangling for Cross-lingual Cross-modal Retrieval

  • Rui Cai
  • Zhiyu Dong
  • Jianfeng Dong
  • Xun Wang

Existing cross-modal retrieval methods typically rely on large-scale vision-language pair data. This makes it challenging to efficiently develop a cross-modal retrieval model for under-resourced languages of interest. Therefore, Cross-lingual Cross-modal Retrieval (CCR), which aims to align vision and the low-resource language (the target language) without using any human-labeled target-language data, has gained increasing attention. As a general parameter-efficient way, a common solution is to utilize adapter modules to transfer the vision-language alignment ability of Vision-Language Pretraining (VLP) models from a source language to a target language. However, these adapters are usually static once learned, making it difficult to adapt to target-language captions with varied expressions. To alleviate it, we propose Dynamic Adapter with Semantics Disentangling (DASD), whose parameters are dynamically generated conditioned on the characteristics of the input captions. Considering that the semantics and expression styles of the input caption largely influence how to encode it, we propose a semantic disentangling module to extract the semantic-related and semantic-agnostic features from the input, ensuring that generated adapters are well-suited to the characteristics of input caption. Extensive experiments on two image-text datasets and one video-text dataset demonstrate the effectiveness of our model for cross-lingual cross-modal retrieval, as well as its good compatibility with various VLP models.

ICML Conference 2025 Conference Paper

Efficient and Privacy-Preserving Soft Prompt Transfer for LLMs

  • Xun Wang
  • Jing Xu
  • Franziska Boenisch
  • Michael Backes 0001
  • Christopher A. Choquette-Choo
  • Adam Dziedzic

Prompting has become a dominant paradigm for adapting large language models (LLMs). While discrete (textual) prompts are widely used for their interpretability, soft (parameter) prompts have recently gained traction in APIs. This is because they can encode information from more training samples while minimizing the user’s token usage, leaving more space in the context window for task-specific input. However, soft prompts are tightly coupled to the LLM they are tuned on, limiting their generalization to other LLMs. This constraint is particularly problematic for efficiency and privacy: (1) tuning prompts on each LLM incurs high computational costs, especially as LLMs continue to grow in size. Additionally, (2) when the LLM is hosted externally, soft prompt tuning often requires sharing private data with the LLM provider. For instance, this is the case with the NVIDIA NeMo API. To address these issues, we propose POST ( P rivacy O f S oft prompt T ransfer), a framework that enables private tuning of soft prompts on a small model and subsequently transfers these prompts to a larger LLM. POST uses knowledge distillation to derive a small model directly from the large LLM to improve prompt transferability, tunes the soft prompt locally, optionally with differential privacy guarantees, and transfers it back to the larger LLM using a small public dataset. Our experiments show that POST reduces computational costs, preserves privacy, and effectively transfers high-utility soft prompts.

TMLR Journal 2025 Journal Article

Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining

  • Pihe Hu
  • Shaolong Li
  • Xun Wang
  • Longbo Huang

Large language models (LLMs) have made significant strides in complex tasks, yet their widespread adoption is impeded by substantial computational demands. With hundreds of billion parameters, transformer-based LLMs necessitate months of pretraining across a high-end GPU cluster. However, this paper reveals a compelling finding: transformers exhibit considerable redundancy in pretraining computations, which motivates our proposed solution, Mixed Sparsity Training (MST), an efficient pretraining method that can reduce about $75$% of Floating Point Operations (FLOPs) while maintaining performance. MST integrates dynamic sparse training (DST) with Sparsity Variation (SV) and Hybrid Sparse Attention (HSA) during pretraining, involving three distinct phases: warm-up, ultra-sparsification, and restoration. The warm-up phase transforms the dense model into a sparse one, and the restoration phase reinstates connections. Throughout these phases, the model is trained with a dynamically evolving sparse topology and an HSA mechanism to maintain performance and minimize training FLOPs concurrently. Our experiment on GPT-2 showcases a FLOP reduction of $4\times$ without compromising performance.

AAMAS Conference 2025 Conference Paper

Offline-to-Online Multi-Agent Reinforcement Learning with Offline Value Function Memory and Sequential Exploration

  • Hai Zhong
  • Xun Wang
  • Zhuoran Li
  • Longbo Huang

Offline-to-Online Reinforcement Learning has emerged as a powerful paradigm, leveraging offline data for initialization and online fine-tuning to enhance both sample efficiency and performance. However, most existing research has focused on singleagent settings, with limited exploration of the multi-agent extension, i. e. , Offline-to-Online Multi-Agent Reinforcement Learning (O2O MARL). In O2O MARL, two critical challenges become more prominent as the number of agents increases: (i) the risk of unlearning pre-trained Q-values due to distributional shifts during the transition from offline-to-online phases, and (ii) the difficulty of efficient exploration in the large joint state-action space. To tackle these challenges, we propose a novel O2O MARL framework called Offline Value Function Memory with Sequential Exploration (OVMSE). First, we introduce the Offline Value Function Memory (OVM) mechanism to compute target Q-values, preserving knowledge gained during offline training, ensuring smoother transitions, and enabling efficient fine-tuning. Second, we propose a decentralized Sequential Exploration (SE) strategy tailored for O2O MARL, which effectively utilizes the pre-trained offline policy for exploration, thereby significantly reducing the joint stateaction space to be explored. Extensive experiments on the StarCraft Multi-Agent Challenge (SMAC) demonstrate that OVMSE significantly outperforms existing baselines, achieving superior sample efficiency and overall performance.

NeurIPS Conference 2025 Conference Paper

OmniGen-AR: AutoRegressive Any-to-Image Generation

  • Junke Wang
  • Xun Wang
  • Qiushan Guo
  • Peize Sun
  • Weilin Huang
  • Zuxuan Wu
  • Yu-Gang Jiang

Autoregressive (AR) models have demonstrated strong potential in visual generation, offering competitive performance with simple architectures and optimization objectives. However, existing methods are typically limited to single-modality conditions, \eg, text or category labels, restricting their applicability in real-world scenarios that demand image synthesis from diverse forms of controls. In this work, we present \system, the first unified autoregressive framework for Any-to-Image generation. By discretizing various visual conditions through a shared visual tokenizer and text prompts with a text tokenizer, \system supports a broad spectrum of conditional inputs within a single model, including text (text-to-image generation), spatial signals (segmentation-to-image and depth-to-image), and visual context (image editing, frame prediction, and text-to-video generation). To mitigate the risk of information leakage from condition tokens to content tokens, we introduce Disentangled Causal Attention (DCA), which separates the full-sequence causal mask into condition causal attention and content causal attention. It serves as a training-time regularizer without affecting the standard next-token prediction during inference. With this design, \system achieves new state-of-the-art results across a range of benchmark, \eg, 0. 63 on GenEval and 80. 02 on VBench, demonstrating its effectiveness in flexible and high-fidelity visual generation.

AAAI Conference 2025 Conference Paper

Safe Planner: Empowering Safety Awareness in Large Pre-Trained Models for Robot Task Planning

  • Siyuan Li
  • Feifan Liu
  • Lingfei Cui
  • Jiani Lu
  • Qinqin Xiao
  • Xirui Yang
  • Peng Liu
  • Kewu Sun

Robot task planning is an important problem for autonomous robots in long-horizon challenging tasks. As large pre-trained models have demonstrated superior planning ability, recent research investigates utilizing large models to achieve autonomous planning for robots in diverse tasks. However, since the large models are pre-trained with Internet data and lack the knowledge of real task scenes, large models as planners may make unsafe decisions that hurt the robots and the surrounding environments. To solve this challenge, we propose a novel Safe Planner framework, which empowers safety awareness in large pre-trained models to accomplish safe and executable planning. In this framework, we develop a safety prediction module to guide the high-level large model planner, and this safety module trained in a simulator can be effectively transferred to real-world tasks. The proposed Safe Planner framework is evaluated on both simulated environments and real robots. The experiment results demonstrate that Safe Planner not only achieves state-of-the-art task success rates, but also substantially improves safety during task execution.

JBHI Journal 2025 Journal Article

scSwinTNet: A Cell Type Annotation Method for Large-Scale Single-Cell RNA-Seq Data Based on Shifted Window Attention

  • Huanhuan Dai
  • Xiangyu Meng
  • Zhiyi Pan
  • Qing Yang
  • Haonan Song
  • Yuan Gao
  • Xun Wang

The annotation of cell types based on single-cell RNA sequencing (scRNA-seq) data is a critical downstream task in single-cell analysis, with significant implications for a deeper understanding of biological processes. Most analytical methods cluster cells by unsupervised clustering, which requires manual annotation for cell type determination. This procedure is time-overwhelming and non-repeatable. To accommodate the exponential growth of sequencing cells, reduce the impact of data bias, and integrate large-scale datasets for further improvement of type annotation accuracy, we proposed scSwinTNet. It is a pre-trained tool for annotating cell types in scRNA-seq data, which uses self-attention based on shifted windows and enables intelligent information extraction from gene data. We demonstrated the effectiveness and robustness of scSwinTNet by using 399 760 cells from human and mouse tissues. To the best of our knowledge, scSwinTNet is the first model to annotate cell types in scRNA-seq data using a pre-trained shifted window attention-based model. It does not require a priori knowledge and accurately annotates cell types without manual annotation.

TMLR Journal 2025 Journal Article

Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

  • Haoran Li
  • Qingxiu Dong
  • Zhengyang Tang
  • Chaojun Wang
  • Xingxing Zhang
  • Haoyang Huang
  • Shaohan Huang
  • Xiaolong Huang

We introduce Generalized Instruction Tuning (called GLAN), a general and scalable method for instruction tuning of Large Language Models (LLMs). Unlike prior work that relies on seed examples or existing datasets to construct instruction-tuning data, GLAN exclusively utilizes a pre-curated taxonomy of human knowledge and capabilities as input and generates large-scale synthetic instruction data across all disciplines. Specifically, inspired by the systematic structure in human education system, we build the taxonomy by decomposing human knowledge and capabilities to various fields, sub-fields and ultimately, distinct disciplines semi-automatically, facilitated by LLMs. Subsequently, we generate a comprehensive list of subjects for every discipline and proceed to design a syllabus tailored to each subject, again utilizing LLMs. With the fine-grained key concepts detailed in every class session of the syllabus, we are able to generate diverse instructions with a broad coverage across the entire spectrum of human knowledge and skills. Extensive experiments on large language models (e.g., Mistral) demonstrate that GLAN excels in multiple dimensions from mathematical reasoning, coding, academic exams, logical reasoning to general instruction following without using task-specific training data of these tasks. In addition, GLAN allows for easy customization and new fields or skills can be added by simply incorporating a new node into our taxonomy. While promising, our approach may inherit biases or inaccuracies from LLM-generated data as in other synthetic data work and is primarily evaluated on exam-style benchmarks. Broader evaluations and data quality control are left for future work.

ICLR Conference 2025 Conference Paper

Tool-Planner: Task Planning with Clusters across Multiple Tools

  • Yanming Liu 0003
  • Xinyue Peng
  • Jiannan Cao
  • Shi Bo
  • Yuwei Zhang
  • Xuhong Zhang 0002
  • Sheng Cheng
  • Xun Wang

Large language models (LLMs) have demonstrated exceptional reasoning capabilities, enabling them to solve various complex problems. Recently, this ability has been applied to the paradigm of tool learning. Tool learning involves providing examples of tool usage and their corresponding functions, allowing LLMs to formulate plans and demonstrate the process of invoking and executing each tool. LLMs can address tasks that they cannot complete independently, thereby enhancing their potential across different tasks. However, this approach faces two key challenges. First, redundant error correction leads to unstable planning and long execution time. Additionally, designing a correct plan among multiple tools is also a challenge in tool learning. To address these issues, we propose Tool-Planner, a task-processing framework based on toolkits. Tool-Planner groups tools based on the API functions with the same function into a toolkit and allows LLMs to implement planning across the various toolkits. When a tool error occurs, the language model can reselect and adjust tools based on the toolkit. Experiments show that our approach demonstrates a high pass and win rate across different datasets and optimizes the planning scheme for tool learning in models such as GPT-4 and Claude 3, showcasing the potential of our method. Our code is public at https://github.com/OceannTwT/Tool-Planner.

AAAI Conference 2025 Conference Paper

Towards Ship License Plate Recognition in the Wild: A Large Benchmark and Strong Baseline

  • Baolong Liu
  • Ruiqing Yang
  • Roukai Huang
  • Wenhao Xu
  • Xin Pan
  • Chuanhuang Li
  • Bin Wang
  • Xun Wang

The paper targets the challenging task of Ship License Plate (SLP) recognition. Existing methods for SLP recognition are hampered by the scarcity of large and publicly available datasets, leading to evaluations on small and non-representative datasets. To alleviate it, we have built a large dataset, called SLP34K, which consists of 34,385 images collected by an intelligent traffic surveillance system. The dataset is carefully manually annotated with text labels and attributes, and presents high data diversity by multiple installation locations and long capturing period of the cameras. Additionally, we propose a simple yet effective SLP recognition baseline method. The baseline is equipped with a strong visual encoder that benefits from initial pre-training via self-supervised learning, followed by further refinement through our devised semantic enhancement module. Extensive experiments on SLP34K verify the effectiveness of our proposed baseline. Moreover, while our baseline is designed for SLP recognition, it can also be used for common scene text recognition and achieve state-of-the-art performance on seven mainstream scene text recognition datasets.

AAAI Conference 2024 Conference Paper

Continual Vision-Language Retrieval via Dynamic Knowledge Rectification

  • Zhenyu Cui
  • Yuxin Peng
  • Xun Wang
  • Manyu Zhu
  • Jiahuan Zhou

The recent large-scale pre-trained models like CLIP have aroused great concern in vision-language tasks. However, when required to match image-text data collected in a streaming manner, namely Continual Vision-Language Retrieval (CVRL), their performances are still limited due to the catastrophic forgetting of the learned old knowledge. To handle this issue, advanced methods are proposed to distill the affinity knowledge between images and texts from the old model to the new one for anti-forgetting. Unfortunately, existing approaches neglect the impact of incorrect affinity, which prevents the balance between the anti-forgetting of old knowledge and the acquisition of new knowledge. Therefore, we propose a novel framework called Dynamic Knowledge Rectification (DKR) that simultaneously achieves incorrect knowledge filtering and rectification. Specifically, we first filter the incorrect affinity knowledge calculated by the old model on the new data. Then, a knowledge rectification method is designed to rectify the incorrect affinities while preserving the correct ones. In particular, for the new data that can only be correctly retrieved by the new model, we rectify them with the corresponding new affinity to protect them from negative transfer. Additionally, for those that can not be retrieved by either the old or the new model, we introduce paired ground-truth labels to promote the acquisition of both old and new knowledge. Extensive experiments on several benchmark datasets demonstrate the effectiveness of our DKR and its superiority against state-of-the-art methods.

AAAI Conference 2024 Conference Paper

Robust Visual Imitation Learning with Inverse Dynamics Representations

  • Siyuan Li
  • Xun Wang
  • Rongchang Zuo
  • Kewu Sun
  • Lingfei Cui
  • Jishiyu Ding
  • Peng Liu
  • Zhe Ma

Imitation learning (IL) has achieved considerable success in solving complex sequential decision-making problems. However, current IL methods mainly assume that the environment for learning policies is the same as the environment for collecting expert datasets. Therefore, these methods may fail to work when there are slight differences between the learning and expert environments, especially for challenging problems with high-dimensional image observations. However, in real-world scenarios, it is rare to have the chance to collect expert trajectories precisely in the target learning environment. To address this challenge, we propose a novel robust imitation learning approach, where we develop an inverse dynamics state representation learning objective to align the expert environment and the learning environment. With the abstract state representation, we design an effective reward function, which thoroughly measures the similarity between behavior data and expert data not only element-wise, but also from the trajectory level. We conduct extensive experiments to evaluate the proposed approach under various visual perturbations and in diverse visual control tasks. Our approach can achieve a near-expert performance in most environments, and significantly outperforms the state-of-the-art visual IL methods and robust IL methods.

AAAI Conference 2024 Conference Paper

Simultaneous Optimization of Bid Shading and Internal Auction for Demand-Side Platforms

  • Yadong Xu
  • Bonan Ni
  • Weiran Shen
  • Xun Wang
  • Zichen Wang
  • Yinsong Xue
  • Pingzhong Tang

Online advertising has been one of the most important sources for industry's growth, where the demand-side platforms (DSP) play an important role via bidding to the ad exchanges on behalf of their advertiser clients. Since more and more ad exchanges have shifted from second to first price auctions, it is challenging for DSPs to adjust bidding strategy in the volatile environment. Recent studies on bid shading in first-price auctions may have limited performance due to relatively strong hypotheses about winning probability distribution. Moreover, these studies do not consider the incentive of advertiser clients, which can be crucial for a reliable advertising platform. In this work, we consider both the optimization of bid shading technique and the design of internal auction which is ex-post incentive compatible (IC) for the management of a DSP. Firstly, we prove that the joint design of bid shading and ex-post IC auction can be reduced to choosing one monotone bid function for each advertiser without loss of optimality. Then we propose a parameterized neural network to implement the monotone bid functions. With well-designed surrogate loss, the objective can be optimized in an end-to-end manner. Finally, our experimental results demonstrate the effectiveness and superiority of our algorithm.

JBHI Journal 2024 Journal Article

TBCA: Prediction of Transcription Factor Binding Sites Using a Deep Neural Network With Lightweight Attention Mechanism

  • Xun Wang
  • Qiao Lian
  • Peng Qu
  • Qing Yang

The identification of transcription factor binding sites (TFBSs) is crucial for understanding the regulatory mechanisms of gene expression, which contributes to unraveling cellular functions and disease development. Currently, the most common approach involves the use of deep learning techniques to predict TFBSs by combining sequence and shape features. Although significant progress has been made with these methods, the integration of local features extracted from DNA sequences and shapes with global features has not yet reached a sufficient level, and there is still significant room for improvement in the accuracy of prediction results. In this paper, we propose a novel framework based on convolution and attention mechanisms, referred to as TBCA, which combines DNA sequence information and shape information for predicting transcription factor binding sites. In this work, we employ a two-layer convolutional neural network (CNNs) and self-attention mechanism to extract complex sequence features from DNA. What's more, we utilize a Fourier-transform-enhanced multi-head attention along with channel attention to extract high-order shape features of DNA. Finally, these high-order sequence and shape features are integrated into the channel dimension to achieve accurate TFBSs prediction. Our research results demonstrate that TBCA exhibits superior predictive performance in 165 validated ChIP-seq datasets. Furthermore, the employed attention mechanisms can automatically learn important features at different positions and scales, enhancing the accuracy and robustness of feature representation. We also conduct an in-depth analysis of the contributions of five different shapes to site prediction, revealing that shape features can enhance the prediction of transcription factor DNA binding.

NeurIPS Conference 2024 Conference Paper

xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token

  • Xin Cheng
  • Xun Wang
  • Xingxing Zhang
  • Tao Ge
  • Si-Qing Chen
  • Furu Wei
  • Huishuai Zhang
  • Dongyan Zhao

This paper introduces xRAG, an innovative context compression method tailored for retrieval-augmented generation. xRAG reinterprets document embeddings in dense retrieval--traditionally used solely for retrieval--as features from the retrieval modality. By employing a modality fusion methodology, xRAG seamlessly integrates these embeddings into the language model representation space, effectively eliminating the need for their textual counterparts and achieving an extreme compression rate. In xRAG, the only trainable component is the modality bridge, while both the retriever and the language model remain frozen. This design choice allows for the reuse of offline-constructed document embeddings and preserves the plug-and-play nature of retrieval augmentation. Experimental results demonstrate that xRAG achieves an average improvement of over 10% across six knowledge-intensive tasks, adaptable to various language model backbones, ranging from a dense 7B model to an 8x7B Mixture of Experts configuration. xRAG not only significantly outperforms previous context compression methods but also matches the performance of uncompressed models on several datasets, while reducing overall FLOPs by a factor of 3. 53. Our work pioneers new directions in retrieval-augmented generation from the perspective of multimodality fusion, and we hope it lays the foundation for future efficient and scalable retrieval-augmented systems.

NeurIPS Conference 2023 Conference Paper

Extensible Prompts for Language Models on Zero-shot Language Style Customization

  • Tao Ge
  • Hu Jing
  • Li Dong
  • Shaoguang Mao
  • Yan Xia
  • Xun Wang
  • Si-Qing Chen
  • Furu Wei

We propose eXtensible Prompt (X-Prompt) for prompting a large language model (LLM) beyond natural language (NL). X-Prompt instructs an LLM with not only NL but also an extensible vocabulary of imaginary words. Registering new imaginary words allows us to instruct the LLM to comprehend concepts that are difficult to describe with NL words, thereby making a prompt more descriptive. Also, these imaginary words are designed to be out-of-distribution (OOD) robust so that they can be (re)used like NL words in various prompts, distinguishing X-Prompt from soft prompt that is for fitting in-distribution data. We propose context-augmented learning (CAL) to learn imaginary words for general usability, enabling them to work properly in OOD (unseen) prompts. We experiment X-Prompt for zero-shot language style customization as a case study. The promising results of X-Prompt demonstrate its potential to facilitate advanced interaction beyond the natural language interface, bridging the communication gap between humans and LLMs.

AAAI Conference 2023 Conference Paper

Hierarchical Contrast for Unsupervised Skeleton-Based Action Representation Learning

  • Jianfeng Dong
  • Shengkai Sun
  • Zhonglin Liu
  • Shujie Chen
  • Baolong Liu
  • Xun Wang

This paper targets unsupervised skeleton-based action representation learning and proposes a new Hierarchical Contrast (HiCo) framework. Different from the existing contrastive-based solutions that typically represent an input skeleton sequence into instance-level features and perform contrast holistically, our proposed HiCo represents the input into multiple-level features and performs contrast in a hierarchical manner. Specifically, given a human skeleton sequence, we represent it into multiple feature vectors of different granularities from both temporal and spatial domains via sequence-to-sequence (S2S) encoders and unified downsampling modules. Besides, the hierarchical contrast is conducted in terms of four levels: instance level, domain level, clip level, and part level. Moreover, HiCo is orthogonal to the S2S encoder, which allows us to flexibly embrace state-of-the-art S2S encoders. Extensive experiments on four datasets, i.e., NTU-60, NTU-120, PKU-I and PKU-II, show that HiCo achieves a new state-of-the-art for unsupervised skeleton-based action representation learning in two downstream tasks including action recognition and retrieval, and its learned action representation is of good transferability. Besides, we also show that our framework is effective for semi-supervised skeleton-based action recognition. Our code is available at https://github.com/HuiGuanLab/HiCo.

JBHI Journal 2023 Journal Article

TransFusionNet: Semantic and Spatial Features Fusion Framework for Liver Tumor and Vessel Segmentation Under JetsonTX2

  • Xun Wang
  • Xudong Zhang
  • Gan Wang
  • Ying Zhang
  • Xin Shi
  • Huanhuan Dai
  • Min Liu
  • Zixuan Wang

Liver cancer is one of the most common malignant diseases worldwide. Segmentation and reconstruction of liver tumors and vessels in CT images can provide convenience for physicians in preoperative planning and surgical intervention. In this paper, we introduced a TransFusionNet framework, which consists of a semantic feature extraction module, a local spatial feature extraction module, an edge feature extraction module, and a multi-scale feature fusion module to achieve fine-grained segmentation of liver tumors and vessels. In addition, we applied the transfer learning approach to pre-train using public datasets and then fine-tune the model to further improve the fitting effect. Furthermore, we proposed an intelligent quantization scheme to compress the model weights and achieved high performance inference on JetsonTX2. The TransFusionNet framework achieved mean IoU of 0. 854 in vessel segmentation task, and achieved mean IoU of 0. 927 in liver tumor segmentation task. When profiling the Computational Performance of the quantized inference, our quantized model achieved 4TFLOPs on Node with NVIDIA RTX3090 and 132GFLOPs on JetsonTX2. This unprecedented segmentation effect solves the accuracy and performance bottleneck of automated segmentation to a certain extent.

EAAI Journal 2022 Journal Article

Geometrically interpretable Variance Hyper Rectangle learning for pattern classification

  • Jie Sun
  • Huamao Gu
  • Haoyu Peng
  • Yili Fang
  • Xun Wang

Many current intrinsically interpretable machine learning models can only handle the data that are linear, low-dimensional, and relatively independent attributes and often with discrete attribute values, while the models that are capable of handling high-dimensional nonlinear data, like deep learning, have very poor interpretability. Based on the geometric characteristics, a new idea of accurately wrapping the data region with minimum-volume geometry is proposed for pattern classification. The Variance Hyper Rectangle (VHR) model presented in this paper is a realization of the idea. The VHR model uses the minimum-volume hyper rectangles, obtained through projection variance calculation, to wrap the regions occupied by a category of data, hence it has strong and clear geometric interpretability. In addition, the VHR model is well suited for large data volume, as it approaches the linear complexity in both time and space. Extensive qualitative and quantitative experiments are performed on seven real-world data sets, demonstrating that VHR outperforms the state-of-the-art interpretable methods while running quickly.

LORI Conference 2021 Conference Paper

Completeness Theorems for ∃ ☐-Fragment of First-Order Modal Logic

  • Xun Wang

Abstract The paper expands upon the work by Wang [ 4 ], who proposes a new framework based on quantifier-free predicate language extended by a new modality \(\exists x\Box \) and axiomatizes the logic over S5 frames. This paper gives the logics over K, D, T, 4, S4 frames with increasing and constant domains. And we provide a general strategy for proving completeness theorems for logics w. r. t. the increasing domain and logics w. r. t. the constant domain respectively.

AAAI Conference 2021 Conference Paper

Coupon Design in Advertising Systems

  • Weiran Shen
  • Pingzhong Tang
  • Xun Wang
  • Yadong Xu
  • Xiwang Yang

Online platforms sell advertisements via auctions (e. g. , VCG and GSP auction) and revenue maximization is one of the most important tasks for them. Many revenue increment methods are proposed, like reserve pricing, boosting, coupons and so on. The novelty of coupons rests on the fact that coupons are optional for advertisers while the others are compulsory. Recent studies on coupons have limited applications in advertising systems because they only focus on second price auctions and do not consider the combination with other methods. In this work, we study the coupon design problem for revenue maximization in the widely used VCG auction. Firstly, we examine the bidder strategies in the VCG auction with coupons. Secondly, we cast the coupon design problem into a learning framework and propose corresponding algorithms using the properties of VCG auction. Then we further study how to combine coupons with reserve pricing in our framework. Finally, extensive experiments are conducted to demonstrate the effectiveness of our algorithms based on both synthetic data and industrial data.

NeurIPS Conference 2021 Conference Paper

Spatial Ensemble: a Novel Model Smoothing Mechanism for Student-Teacher Framework

  • Tengteng Huang
  • Yifan Sun
  • Xun Wang
  • Haotian Yao
  • Chi Zhang

Model smoothing is of central importance for obtaining a reliable teacher model in the student-teacher framework, where the teacher generates surrogate supervision signals to train the student. A popular model smoothing method is the Temporal Moving Average (TMA), which continuously averages the teacher parameters with the up-to-date student parameters. In this paper, we propose ''Spatial Ensemble'', a novel model smoothing mechanism in parallel with TMA. Spatial Ensemble randomly picks up a small fragment of the student model to directly replace the corresponding fragment of the teacher model. Consequentially, it stitches different fragments of historical student models into a unity, yielding the ''Spatial Ensemble'' effect. Spatial Ensemble obtains comparable student-teacher learning performance by itself and demonstrates valuable complementarity with temporal moving average. Their integration, named Spatial-Temporal Smoothing, brings general (sometimes significant) improvement to the student-teacher learning framework on a variety of state-of-the-art methods. For example, based on the self-supervised method BYOL, it yields +0. 9% top-1 accuracy improvement on ImageNet, while based on the semi-supervised approach FixMatch, it increases the top-1 accuracy by around +6% on CIFAR-10 when only few training labels are available. Codes and models are available at: https: //github. com/tengteng95/Spatial_Ensemble.

AAAI Conference 2020 Conference Paper

Channel Interaction Networks for Fine-Grained Image Categorization

  • Yu Gao
  • Xintong Han
  • Xun Wang
  • Weilin Huang
  • Matthew Scott

Fine-grained image categorization is challenging due to the subtle inter-class differences. We posit that exploiting the rich relationships between channels can help capture such differences since different channels correspond to different semantics. In this paper, we propose a channel interaction network (CIN), which models the channel-wise interplay both within an image and across images. For a single image, a self-channel interaction (SCI) module is proposed to explore channel-wise correlation within the image. This allows the model to learn the complementary features from the correlated channels, yielding stronger fine-grained features. Furthermore, given an image pair, we introduce a contrastive channel interaction (CCI) module to model the cross-sample channel interaction with a metric learning framework, allowing the CIN to distinguish the subtle visual differences between images. Our model can be trained efficiently in an end-to-end fashion without the need of multi-stage training and testing. Finally, comprehensive experiments are conducted on three publicly available benchmarks, where the proposed method consistently outperforms the state-of-theart approaches, such as DFL-CNN(Wang, Morariu, and Davis 2018) and NTS(Yang et al. 2018).

AAAI Conference 2020 Conference Paper

HAL: Improved Text-Image Matching by Mitigating Visual Semantic Hubs

  • Fangyu Liu
  • Rongtian Ye
  • Xun Wang
  • Shuaipeng Li

The hubness problem widely exists in high-dimensional embedding space and is a fundamental source of error for crossmodal matching tasks. In this work, we study the emergence of hubs in Visual Semantic Embeddings (VSE) with application to text-image matching. We analyze the pros and cons of two widely adopted optimization objectives for training VSE and propose a novel hubness-aware loss function (HAL) that addresses previous methods’ defects. Unlike (Faghri et al. 2018) which simply takes the hardest sample within a minibatch, HAL takes all samples into account, using both local and global statistics to scale up the weights of “hubs”. We experiment our method with various configurations of model architectures and datasets. The method exhibits exceptionally good robustness and brings consistent improvement on the task of text-image matching across all settings. Specifically, under the same model architectures as (Faghri et al. 2018) and (Lee et al. 2018), by switching only the learning objective, we report a maximum R@1 improvement of 7. 4% on MS-COCO and 8. 3% on Flickr30k. 1

AAAI Conference 2020 Short Paper

HARK: Harshness-Aware Sentiment Analysis Framework for Product Review (Student Abstract)

  • Ting Zhou
  • Xun Wang
  • Yili Fang

Sentiment analysis has been a helpful mechanism that targets to understand the market feedback on certain commodities by utilizing the user comments. In the process of providing comments, each user comment is generated based on his/her preference which is referred to as harshness. Existing methods mainly apply majority voting or its variants to directly infer the evaluation of products. Nevertheless, due to the ignorance of the harshness of users, these methods will lead to low-quality inference outcome of sentiment analysis, which is far from the result of the expert analysis report. To this end, we propose HARK, a harshness-aware product analysis framework. First, we employ a Bayesian-based model for sentiment analysis. Moreover, in order to infer the reliable sentiment concerning each product from all the comments, we present a probabilistic graphical model in which the harshness is incorporated. Extensive experimental evaluations have shown that the result of our method is more consistent with the expert evaluation than that of the state-of-the-art methods. And our method also outperforms the method which infers the final sentiment with the ground truth of comments but without involving the harshness of users.

AAAI Conference 2020 Conference Paper

MetaMT, a Meta Learning Method Leveraging Multiple Domain Data for Low Resource Machine Translation

  • Rumeng Li
  • Xun Wang
  • Hong Yu

Neural machine translation (NMT) models have achieved state-of-the-art translation quality with a large quantity of parallel corpora available. However, their performance suffers significantly when it comes to domain-specific translations, in which training data are usually scarce. In this paper, we present a novel NMT model with a new word embedding transition technique for fast domain adaption. We propose to split parameters in the model into two groups: model parameters and meta parameters. The former are used to model the translation while the latter are used to adjust the representational space to generalize the model to different domains. We mimic the domain adaptation of the machine translation model to low-resource domains using multiple translation tasks on different domains. A new training strategy based on meta-learning is developed along with the proposed model to update the model parameters and meta parameters alternately. Experiments on datasets of different domains showed substantial improvements of NMT performances on a limited amount of data.

LORI Conference 2019 Conference Paper

A Logic of Knowing How with Skippable Plans

  • Xun Wang

Abstract The paper expands upon the work by Wang [ 16 ], who proposes a single-agent modal logic framework for reasoning about “knowing how”. This paper proposes a more flexible semantics to the knowing-how operator. According to this semantics, an agent knows how to achieve \(\varphi \) given \(\psi \) if there exists a finite linear plan such that it will end up with some \(\varphi \) -state from any \(\psi \) -state by executing the plan, either fully or skipping some non-executable steps. We give a sound and complete axiomatization of this logic. Finally we introduce a suitable notion of bisimulation for this logic.

JAIR Journal 2019 Journal Article

Multi-scale Hierarchical Residual Network for Dense Captioning

  • Yan Tian
  • Xun Wang
  • Jiachen Wu
  • Ruili Wang
  • Bailin Yang

Recent research on dense captioning based on the recurrent neural network and the convolutional neural network has made a great progress. However, mapping from an image feature space to a description space is a nonlinear and multimodel task, which makes it difficult for the current methods to get accurate results. In this paper, we put forward a novel approach for dense captioning based on hourglass-structured residual learning. Discriminant feature maps are obtained by incorporating dense connected networks and residual learning in our model. Finally, the performance of the approach on the Visual Genome V1.0 dataset and the region labelled MS-COCO (Microsoft Common Objects in Context) dataset are demonstrated. The experimental results have shown that our approach outperforms most current methods.