Author name cluster

Xin Jiang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

40 papers

2 author rows

AAAI Conference 2026 Conference Paper

Fine-Grained Image Retrieval via Dual-Vision Adaptation

Xin Jiang
Meiqi Cao
Hao Tang
Fei Shen
Zechao Li

Fine-Grained Image Retrieval~(FGIR) faces challenges in learning discriminative visual representations to retrieve images with similar fine-grained features. Current leading FGIR solutions typically follow two regimes: enforce pairwise similarity constraints in the semantic embedding space, or incorporate a localization sub-network to fine-tune the entire model. However, such two regimes tend to overfit the training data while forgetting the knowledge gained from large-scale pre-training, thus reducing their generalization ability. In this paper, we propose a Dual-Vision Adaptation (DVA) approach for FGIR, which guides the frozen pre-trained model to perform FGIR through collaborative sample and feature adaptation. Specifically, we design Object-Perceptual Adaptation, which modifies input samples to help the pre-trained model perceive critical objects and elements within objects that are helpful for category prediction. Meanwhile, we propose In-Context Adaptation, which introduces a small set of parameters for feature adaptation without modifying the pre-trained parameters. This makes the FGIR task using these adapted features closer to the task solved during the pre-training. Additionally, to balance retrieval efficiency and performance, we propose Discrimination Perception Transfer to transfer the discriminative knowledge in the object-perceptual adaptation to the image encoder using the knowledge distillation mechanism. Extensive experiments show that DVA performs well on three fine-grained datasets.

PDF Details DOI

AAAI Conference 2026 Conference Paper

ToolACE-R: Model-aware Iterative Training and Adaptive Refinement for Tool learning

Xingshan Zeng
Weiwen Liu
Xu Huang
Zezhong Wang
Lingzhi Wang
Liangyou Li
Yasheng Wang
Lifeng Shang

Tool learning, which allows Large Language Models (LLMs) to leverage external tools for solving complex user tasks, has emerged as a promising avenue for extending model capabilities. However, existing approaches primarily focus on data synthesis for fine-tuning LLMs to invoke tools effectively, largely ignoring how to fully stimulate the potential of the model. In this paper, we propose ToolACE-R, a novel framework that includes both model-aware iterative training and adaptive refinement for tool learning. ToolACE-R features a model-aware iterative training procedure that progressively adjust training samples based on the model’s evolving capabilities to maximize its potential. Additionally, it incorporates self-refinement training corpus which emphasizes LLM's ability to iteratively refine their tool calls, optimizing performance without requiring external feedback. Furthermore, we introduce adaptive self-refinement for efficient test-time scaling, where the trained model can autonomously determine when to stop the process based on iterative self-refinement. We conduct extensive experiments across several benchmark datasets, showing that ToolACE-R achieves competitive performance compared to advanced LLMs. The performance can be further improved efficiently through adaptive self-refinement. These results highlight the effectiveness and generalizability of ToolACE-R, offering a promising direction for more efficient and scalable tool learning.

PDF Details DOI

ICLR Conference 2025 Conference Paper

Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning

Jiacheng Ye
Jiahui Gao
Shansan Gong
Lin Zheng
Xin Jiang
Zhenguo Li
Lingpeng Kong

Autoregressive language models, despite their impressive capabilities, struggle with complex reasoning and long-term planning tasks. We introduce discrete diffusion models as a novel solution to these challenges. Through the lens of subgoal imbalance, we demonstrate how diffusion models effectively learn difficult subgoals that elude autoregressive approaches. We propose Multi-Granularity Diffusion Modeling (MGDM), which prioritizes subgoals based on difficulty during learning. On complex tasks like Countdown, Sudoku, and Boolean Satisfiability Problems, MGDM significantly outperforms autoregressive models without using search techniques. For instance, MGDM achieves 91.5\% and 100\% accuracy on Countdown and Sudoku, respectively, compared to 45.8\% and 20.7\% for autoregressive models. Our work highlights the potential of diffusion-based approaches in advancing AI capabilities for sophisticated language understanding and problem-solving tasks. All associated codes are available at \href{https://github.com/HKUNLP/diffusion-vs-ar}{https://github.com/HKUNLP/diffusion-vs-ar}.

AAAI Conference 2025 Conference Paper

CognitionCapturer: Decoding Visual Stimuli from Human EEG Signal with Multimodal Information

Kaifan Zhang
Lihuo He
Xin Jiang
Wen Lu
Di Wang
Xinbo Gao

Electroencephalogram (EEG) signals have attracted significant attention from researchers due to their non-invasive nature and high temporal sensitivity in decoding visual stimuli. However, most recent studies have focused solely on the relationship between EEG and image data pairs, neglecting the valuable "beyond-image-modality" information embedded in EEG signals. This results in the loss of critical multimodal information in EEG. To address the limitation, this paper proposes a unified framework that fully leverages multimodal data to represent EEG signals, named CognitionCapturer. Specifically, CognitionCapturer trains modality expert encoders for each modality to extract cross-modal information from the EEG modality. Then, it introduces a diffusion prior to map the EEG embedding space to the CLIP embedding space, followed by using a pretrained generative model, the proposed framework can reconstruct visual stimuli with high semantic and structural fidelity. Notably, the framework does not require any fine-tuning of the generative models and can be extended to incorporate more modalities. Through extensive experiments, we demonstrate that CognitionCapturer outperforms state-of-the-art methods both qualitatively and quantitatively.

PDF Details DOI

IROS Conference 2025 Conference Paper

Flipping Manipulation with a Two-Fingered Parallel-Jaw Gripper

Wenxi Liao
Shao Hu
Zhitong Liu
Xin Jiang

Industrial part reorientation remains a critical challenge in automated manufacturing workflows, particularly with parallel-jaw grippers lacking the dexterity for complex manipulations. This paper presents a systematic flipping strategy for structured environments. A quasi-static force equilibrium model is developed to characterize multi-contact manipulation systems, and stability criteria are derived through wrench space analysis, enabling A*-based optimal trajectory generation within the derived stable configuration space. To ensure persistent fingertip-object contact, adaptive impedance control dynamically adjusts gripper stiffness based on real-time force thresholds, preventing unintended detachment. Experimental validation demonstrates robust performance in two representative scenarios: 1) cube flipping on a compliant surface(84. 3g, 90% success over 50 trials), 2) vision-free continuous pivoting of an irregular part on a rigid substrate(56g, 88% success over 50 trials). The methodology requires neither environmental modification nor expensive tactile sensing, showing promise for practical deployment in structured manufacturing systems.

ICLR Conference 2025 Conference Paper

Forewarned is Forearmed: Harnessing LLMs for Data Synthesis via Failure-induced Exploration

Qintong Li
Jiahui Gao
Sheng Wang
Renjie Pi
Xueliang Zhao
Chuan Wu
Xin Jiang
Zhenguo Li

Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data, leading to impressive performance across a range of downstream applications. Current methods often rely on human-annotated data or predefined task templates to direct powerful LLMs in synthesizing task-relevant data for effective model training. However, this dependence on manually designed components may constrain the scope of generated data, potentially overlooking critical edge cases or novel scenarios that could challenge the model. In this paper, we present a novel approach, ReverseGen, designed to automatically generate effective training samples that expose the weaknesses of LLMs. Specifically, we introduce a dedicated proposer trained to produce queries that lead target models to generate unsatisfactory responses. These failure-inducing queries are then used to construct training data, helping to address the models' shortcomings and improve overall performance. Our approach is flexible and can be applied to models of various scales (3B, 7B, and 8B). We evaluate ReverseGen on three key applications—safety, honesty, and math—demonstrating that our generated data is both highly effective and diverse. Models fine-tuned with ReverseGen-generated data consistently outperform those trained on human-annotated or general model-generated data, offering a new perspective on data synthesis for task-specific LLM enhancement.

AAAI Conference 2025 Conference Paper

IMAGDressing-v1: Customizable Virtual Dressing

Fei Shen
Xin Jiang
Xin He
Hu Ye
Cong Wang
Xiaoyu Du
Zechao Li
Jinhui Tang

Existing virtual try-on (VTON) methods provide only limited user control over garment attributes and generally overlook essential factors such as face, pose, and scene context. To address these limitations, we introduce the virtual dressing (VD) task, which aims to synthesize freely editable human images conditioned on fixed garments and optional user-defined inputs. We further propose a comprehensive affinity metric index (CAMI) to quantify the consistency between generated outputs and reference garments. We present IMAGDressing-v1, which leverages a garment-specific U-Net to integrate semantic features from CLIP and texture features from a VAE. To incorporate these garment features into a frozen denoising U-Net for flexible text-driven scene control, we employ a hybrid attention mechanism composed of frozen self-attention and trainable cross-attention layers. IMAGDressing-v1 seamlessly integrates with extension modules, such as ControlNet and IP-Adapter, enabling enhanced diversity and controllability. To alleviate data constraints, we introduce the Interactive Garment Pairing (IGPair) dataset, comprising over 300,000 garment–image pairs and a standardized data assembly pipeline. Extensive experiments demonstrate that IMAGDressing-v1 achieves state-of-the-art performance in controlled human image synthesis. The code and model will be available at https://github.com/muzishen/IMAGDressing.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

Not All Layers of LLMs Are Necessary During Inference

Siqi Fan
Xin Jiang
Xiang Li
Xuying Meng
Peng Han
Shuo Shang
Aixin Sun
Yequan Wang

Due to the large number of parameters, the inference phase of Large Language Models (LLMs) is resource-intensive. However, not all requests posed to LLMs are equally difficult to handle. Through analysis, we show that for some tasks, LLMs can achieve results comparable to the final output at some intermediate layers. That is, not all layers of LLMs are necessary during inference. If we can predict at which layer the inferred results match the final results (produced by evaluating all layers), we could significantly reduce the inference cost. To this end, we propose a simple yet effective algorithm named AdaInfer to adaptively terminate the inference process for an input instance. AdaInfer relies on easily obtainable statistical features and classic classifiers like SVM. Experiments on well-known LLMs like the Llama2 series and OPT, show that AdaInfer can achieve an average of 17. 8% pruning ratio, and up to 43% on sentiment tasks, with nearly no performance drop (<1%). Because AdaInfer does not alter LLM parameters, the LLMs incorporated with AdaInfer maintain generalizability across tasks.

PDF Details DOI

ICML Conference 2025 Conference Paper

SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator

Guoxuan Chen
Han Shi
Jiawei Li
Yihang Gao
Xiaozhe Ren
Yimeng Chen
Xin Jiang
Zhenguo Li

Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks. However, their substantial sizes pose considerable challenges, particularly in computational demands and inference speed, due to their quadratic complexity. In this work, we have identified a key pattern: certain seemingly meaningless separator tokens (i. e. , punctuations) contribute disproportionately to attention scores compared to semantically meaningful tokens. This observation suggests that information of the segments between these separator tokens can be effectively condensed into the separator tokens themselves without significant information loss. Guided by this insight, we introduce SepLLM, a plug-and-play framework that accelerates inference by compressing these segments and eliminating redundant tokens. Additionally, we implement efficient kernels for training acceleration. Experimental results across training-free, training-from-scratch, and post-training settings demonstrate SepLLM’s effectiveness. Notably, using the Llama-3-8B backbone, SepLLM achieves over 50% reduction in KV cache on the GSM8K-CoT benchmark while maintaining comparable performance. Furthermore, in streaming settings, SepLLM effectively processes sequences of up to 4 million tokens or more while maintaining consistent language modeling capabilities.

IROS Conference 2025 Conference Paper

Vision Guided Cable Installation in Constraint Environments Utilizing Parametric Curve Representation

Xin Jiang
Huangtao Wei
Zhitong Liu
Wenxi Liao
Wei Ran

In this paper, a vision-based method is proposed for cable installation tasks in constrained environments. The main challenge of such tasks lies in the potential interference between the cable and surrounding obstacles. Model-based approaches are not well-suited for these industrial scenarios due to variations in the physical properties of workpieces. To address this, the proposed method integrates a potential field-based tip trajectory regulation with a shape deformation servo. In the shape deformation servo, a planner is employed to determine a feasible shape curve that avoids obstacles. This step is crucial, as an infeasible reference for shape control may result in unstable behavior. The effectiveness of the proposed method is validated through experiments. Notably, this approach does not rely on prior model information, making it highly adaptable for industrial deployment.

IROS Conference 2025 Conference Paper

Vision-Based Tactile Sensor Using Light-Conductive Plate for Enhanced Force Sensing Capability

Zhitong Liu
Wenxi Liao
Xin Jiang

In recent years, tactile sensors have become essential for robotic systems, particularly in tasks requiring high-precision interaction and manipulation. The Vision-Based Tactile Sensor (VBTS) represents a significant advancement in tactile sensing, utilizing cameras to monitor the deformation of soft materials at the sensor tip. Pressure applied to the sensor alters the light propagation path, thereby changing the image captured by the camera. By combining image processing and deep learning, VBTS provides highly accurate estimates of contact position and force, achieving micrometer-level resolution. This paper presents a novel VBTS design that leverages a light-conductive plate and a silicone membrane to enhance the sensor’s sensitivity to force perception. The soft, thin nature of the silicone membrane allows for precise detection of minimal forces, making it suitable for tasks involving highly deformable objects. Experimental results demonstrate the sensor’s capability in detecting contact areas and force distributions, which can be applied in diverse domains such as soft object assembly, medical assistance, and food processing. Moreover, the proposed VBTS outperforms traditional sensors by utilizing computationally efficient algorithms that maintain real-time performance without compromising resolution.

ICRA Conference 2024 Conference Paper

Autonomous Quilt Spreading for Caregiving Robots

Yuchun Guo
Zhiqing Lu
Yanling Zhou
Xin Jiang

In this work, we propose a novel strategy to ensure infants, who inadvertently displace their quilts during sleep, are promptly and accurately re-covered. Our approach is formulated into two subsequent steps: interference resolution and quilt spreading. By leveraging the DWPose human skeletal detection and the Segment Anything instance segmentation models, the proposed method can accurately recognize the states of the infant and the quilt over her, which involves addressing the interferences resulted from an infant’s limbs laid on part of the quilt. Building upon prior research, the EM*D deep learning model is employed to forecast quilt state transitions before and after quilt spreading actions. To improve the sensitivity of the network in distinguishing state variation of the handled quilt, we introduce an enhanced loss function that translates the voxelized quilt state into a more representative one. Both simulation and real-world experiments validate the efficacy of our method, in spreading and recover a quilt over an infant.

NeurIPS Conference 2024 Conference Paper

DAPE: Data-Adaptive Positional Encoding for Length Extrapolation

Chuanyang Zheng
Yihang Gao
Han Shi
Minbin Huang
Jingyao Li
Jing Xiong
Xiaozhe Ren
Michael Ng

Positional encoding plays a crucial role in transformers, significantly impact- ing model performance and length generalization. Prior research has introduced absolute positional encoding (APE) and relative positional encoding (RPE) to distinguish token positions in given sequences. However, both APE and RPE remain fixed after model training regardless of input data, limiting their adaptability and flexibility. Hence, we expect that the desired positional encoding should be data-adaptive and can be dynamically adjusted with the given attention. In this paper, we propose a Data-Adaptive Positional Encoding (DAPE) method, which dynamically and semantically adjusts based on input context and learned fixed priors. Experimental validation on real-world datasets (Arxiv, Books3, and CHE) demonstrates that DAPE enhances model performances in terms of trained length and length generalization, where the improvements are statistically significant. The model visualization suggests that our model can keep both local and anti-local information. Finally, we successfully train the model on sequence length 128 and achieve better performance at evaluation sequence length 8192, compared with other static positional encoding methods, revealing the benefit of the adaptive positional encoding method.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Delving into Multimodal Prompting for Fine-Grained Visual Classification

Xin Jiang
Hao Tang
Junyao Gao
Xiaoyu Du
Shengfeng He
Zechao Li

Fine-grained visual classification (FGVC) involves categorizing fine subdivisions within a broader category, which poses challenges due to subtle inter-class discrepancies and large intra-class variations. However, prevailing approaches primarily focus on uni-modal visual concepts. Recent advancements in pre-trained vision-language models have demonstrated remarkable performance in various high-level vision tasks, yet the applicability of such models to FGVC tasks remains uncertain. In this paper, we aim to fully exploit the capabilities of cross-modal description to tackle FGVC tasks and propose a novel multimodal prompting solution, denoted as MP-FGVC, based on the contrastive language-image pertaining (CLIP) model. Our MP-FGVC comprises a multimodal prompts scheme and a multimodal adaptation scheme. The former includes Subcategory-specific Vision Prompt (SsVP) and Discrepancy-aware Text Prompt (DaTP), which explicitly highlights the subcategory-specific discrepancies from the perspectives of both vision and language. The latter aligns the vision and text prompting elements in a common semantic space, facilitating cross-modal collaborative reasoning through a Vision-Language Fusion Module (VLFM) for further improvement on FGVC. Moreover, we tailor a two-stage optimization strategy for MP-FGVC to fully leverage the pre-trained CLIP model and expedite efficient adaptation for FGVC. Extensive experiments conducted on four FGVC datasets demonstrate the effectiveness of our MP-FGVC.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Diffusion of Thought: Chain-of-Thought Reasoning in Diffusion Language Models

Jiacheng Ye
Shansan Gong
Liheng Chen
Lin Zheng
Jiahui Gao
Han Shi
Chuan Wu
Xin Jiang

Recently, diffusion models have garnered significant interest in the field of text processing due to their many potential advantages compared to conventional autoregressive models. In this work, we propose Diffusion-of-Thought (DoT), a novel approach that integrates diffusion models with Chain-of-Thought, a well-established technique for improving the reasoning ability of autoregressive language models. In contrast to autoregressive language models that make decisions in a left-to-right, token-by-token manner, DoT allows reasoning steps to diffuse over time through a diffusion language model and offers greater flexibility in trading-off computation for reasoning performance. Our experimental results demonstrate the effectiveness of DoT in multi-digit multiplication, boolean logic, and grade school math problems. In addition to that, DoT showcases promising self-correction abilities and benefits from existing reasoning-enhancing techniques like self-consistency decoding. Our findings contribute to the understanding and development of reasoning with diffusion language models.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Extracting Training Data from Molecular Pre-trained Models

Renhong Huang
Jiarong Xu
Zhiming Yang
Xiang Si
Xin Jiang
Hanyang Yuan
Chunping Wang
Yang Yang

Graph Neural Networks (GNNs) have significantly advanced the field of drug discovery, enhancing the speed and efficiency of molecular identification. However, training these GNNs demands vast amounts of molecular data, which has spurred the emergence of collaborative model-sharing initiatives. These initiatives facilitate the sharing of molecular pre-trained models among organizations without exposing proprietary training data. Despite the benefits, these molecular pre-trained models may still pose privacy risks. For example, malicious adversaries could perform data extraction attack to recover private training data, thereby threatening commercial secrets and collaborative trust. This work, for the first time, explores the risks of extracting private training molecular data from molecular pre-trained models. This task is nontrivial as the molecular pre-trained models are non-generative and exhibit a diversity of model architectures, which differs significantly from language and image models. To address these issues, we introduce a molecule generation approach and propose a novel, model-independent scoring function for selecting promising molecules. To efficiently reduce the search space of potential molecules, we further introduce a Molecule Extraction Policy Network for molecule extraction. Our experiments demonstrate that even with only query access to molecular pre-trained models, there is a considerable risk of extracting training data, challenging the assumption that model sharing alone provides adequate protection against data extraction attacks. Our codes are publicly available at: \url{https: //github. com/renH2/Molextract}.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Measuring Task Similarity and Its Implication in Fine-Tuning Graph Neural Networks

Renhong Huang
Jiarong Xu
Xin Jiang
Chenglu Pan
Zhiming Yang
Chunping Wang
Yang Yang

The paradigm of pre-training and fine-tuning graph neural networks has attracted wide research attention. In previous studies, the pre-trained models are viewed as universally versatile, and applied for a diverse range of downstream tasks. In many situations, however, this practice results in limited or even negative transfer. This paper, for the first time, emphasizes the specific application scope of graph pre-trained models: not all downstream tasks can effectively benefit from a graph pre-trained model. In light of this, we introduce the measure task consistency to quantify the similarity between graph pre-training and downstream tasks. This measure assesses the extent to which downstream tasks can benefit from specific pre-training tasks. Moreover, a novel fine-tuning strategy, Bridge-Tune, is proposed to further diminish the impact of the difference between pre-training and downstream tasks. The key innovation in Bridge-Tune is an intermediate step that bridges pre-training and downstream tasks. This step takes into account the task differences and further refines the pre-trained model. The superiority of the presented fine-tuning strategy is validated via numerous experiments with different pre-trained models and downstream tasks.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Preparing Lessons for Progressive Training on Language Models

Yu Pan
Ye Yuan
Yichun Yin
Jiaxin Shi
Zenglin Xu
Ming Zhang
Lifeng Shang
Xin Jiang

The rapid progress of Transformers in artificial intelligence has come at the cost of increased resource consumption and greenhouse gas emissions due to growing model sizes. Prior work suggests using pretrained small models to improve training efficiency, but this approach may not be suitable for new model structures. On the other hand, training from scratch can be slow, and progressively stacking layers often fails to achieve significant acceleration. To address these challenges, we propose a novel method called Apollo, which prepares lessons for expanding operations by learning high-layer functionality during training of low layers. Our approach involves low-value-prioritized sampling (LVPS) to train different depths and weight sharing to facilitate efficient expansion. We also introduce an interpolation method for stable model depth extension. Experiments demonstrate that Apollo achieves state-of-the-art acceleration ratios, even rivaling methods using pretrained models, making it a universal and efficient solution for training deep models while reducing time, financial, and environmental costs.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Unsupervised Extractive Summarization with Learnable Length Control Strategies

Renlong Jie
Xiaojun Meng
Xin Jiang
Qun Liu

Unsupervised extractive summarization is an important technique in information extraction and retrieval. Compared with supervised method, it does not require high-quality human-labelled summaries for training and thus can be easily applied for documents with different types, domains or languages. Most of existing unsupervised methods including TextRank and PACSUM rely on graph-based ranking on sentence centrality. However, this scorer can not be directly applied in end-to-end training, and the positional-related prior assumption is often needed for achieving good summaries. In addition, less attention is paid to length-controllable extractor, where users can decide to summarize texts under particular length constraint. This paper introduces an unsupervised extractive summarization model based on a siamese network, for which we develop a trainable bidirectional prediction objective between the selected summary and the original document. Different from the centrality-based ranking methods, our extractive scorer can be trained in an end-to-end manner, with no other requirement of positional assumption. In addition, we introduce a differentiable length control module by approximating 0-1 knapsack solver for end-to-end length-controllable extracting. Experiments show that our unsupervised method largely outperforms the centrality-based baseline using a same sentence encoder. In terms of length control ability, via our trainable knapsack module, the performance consistently outperforms the strong baseline without utilizing end-to-end training. Human evaluation further evidences that our method performs the best among baselines in terms of relevance and consistency.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Better with Less: A Data-Active Perspective on Pre-Training Graph Neural Networks

Jiarong Xu
Renhong Huang
Xin Jiang
Yuxuan Cao
Carl Yang
Chunping Wang
Yang Yang

Pre-training on graph neural networks (GNNs) aims to learn transferable knowledge for downstream tasks with unlabeled data, and it has recently become an active research area. The success of graph pre-training models is often attributed to the massive amount of input data. In this paper, however, we identify the curse of big data phenomenon in graph pre-training: more training data do not necessarily lead to better downstream performance. Motivated by this observation, we propose a better-with-less framework for graph pre-training: fewer, but carefully chosen data are fed into a GNN model to enhance pre-training. The proposed pre-training pipeline is called the data-active graph pre-training (APT) framework, and is composed of a graph selector and a pre-training model. The graph selector chooses the most representative and instructive data points based on the inherent properties of graphs as well as predictive uncertainty. The proposed predictive uncertainty, as feedback from the pre-training model, measures the confidence level of the model in the data. When fed with the chosen data, on the other hand, the pre-training model grasps an initial understanding of the new, unseen data, and at the same time attempts to remember the knowledge learned from previous data. Therefore, the integration and interaction between these two components form a unified framework (APT), in which graph pre-training is performed in a progressive and iterative way. Experiment results show that the proposed APT is able to obtain an efficient pre-training model with fewer training data and better downstream performance.

AAAI Conference 2023 Conference Paper

KPT: Keyword-Guided Pre-training for Grounded Dialog Generation

Qi Zhu
Fei Mi
Zheng Zhang
Yasheng Wang
Yitong Li
Xin Jiang
Qun Liu
Xiaoyan Zhu

Incorporating external knowledge into the response generation process is essential to building more helpful and reliable dialog agents. However, collecting knowledge-grounded conversations is often costly, calling for a better pre-trained model for grounded dialog generation that generalizes well w.r.t. different types of knowledge. In this work, we propose KPT (Keyword-guided Pre-Training), a novel self-supervised pre-training method for grounded dialog generation without relying on extra knowledge annotation. Specifically, we use a pre-trained language model to extract the most uncertain tokens in the dialog as keywords. With these keywords, we construct two kinds of knowledge and pre-train a knowledge-grounded response generation model, aiming at handling two different scenarios: (1) the knowledge should be faithfully grounded; (2) it can be selectively used. For the former, the grounding knowledge consists of keywords extracted from the response. For the latter, the grounding knowledge is additionally augmented with keywords extracted from other utterances in the same dialog. Since the knowledge is extracted from the dialog itself, KPT can be easily performed on a large volume and variety of dialogue data. We considered three data sources (open-domain, task-oriented, conversational QA) with a total of 2.5M dialogues. We conduct extensive experiments on various few-shot knowledge-grounded generation tasks, including grounding on dialog acts, knowledge graphs, persona descriptions, and Wikipedia passages. Our comprehensive experiments and analyses demonstrate that KPT consistently outperforms state-of-the-art methods on these tasks with diverse grounding knowledge.

PDF Details DOI

IJCAI Conference 2023 Conference Paper

Learning Summary-Worthy Visual Representation for Abstractive Summarization in Video

Zenan Xu
Xiaojun Meng
Yasheng Wang
Qinliang Su
Zexuan Qiu
Xin Jiang
Qun Liu

Multimodal abstractive summarization for videos (MAS) requires generating a concise textual summary to describe the highlights of a video according to multimodal resources, in our case, the video content and its transcript. Inspired by the success of the large-scale generative pre-trained language model (GPLM) in generating high-quality textual content (e. g. , summary), recent MAS methods have proposed to adapt the GPLM to this task by equipping it with the visual information, which is often obtained through a general-purpose visual feature extractor. However, the generally extracted visual features may overlook some summary-worthy visual information, which impedes model performance. In this work, we propose a novel approach to learning the summary-worthy visual representation that facilitates abstractive summarization. Our method exploits the summary-worthy information from both the cross-modal transcript data and the knowledge that distills from the pseudo summary. Extensive experiments on three public multimodal datasets show that our method outperforms all competing baselines. Furthermore, with the advantages of summary-worthy visual information, our model can have a significant improvement on small datasets or even datasets with limited training data.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

Zangwei Zheng
Xiaozhe Ren
Fuzhao Xue
Yang Luo
Xin Jiang
Yang You

Large language models (LLMs) have revolutionized the field of AI, demonstrating unprecedented capacity across various tasks. However, the inference process for LLMs comes with significant computational costs. In this paper, we propose an efficient LLM inference pipeline that harnesses the power of LLMs. Our approach begins by tapping into the potential of LLMs to accurately perceive and predict the response length with minimal overhead. By leveraging this information, we introduce an efficient sequence scheduling technique that groups queries with similar response lengths into micro-batches. We evaluate our approach on real-world instruction datasets using the LLaMA-based model, and our results demonstrate an impressive 86% improvement in inference throughput without compromising effectiveness. Notably, our method is orthogonal to other inference acceleration techniques, making it a valuable addition to many existing toolkits (e. g. , FlashAttention, Quantization) for LLM inference.

NeurIPS Conference 2023 Conference Paper

Reusing Pretrained Models by Multi-linear Operators for Efficient Training

Yu Pan
Ye Yuan
Yichun Yin
Zenglin Xu
Lifeng Shang
Xin Jiang
Qun Liu

Training large models from scratch usually costs a substantial amount of resources. Towards this problem, recent studies such as bert2BERT and LiGO have reused small pretrained models to initialize a large model (termed the ``target model''), leading to a considerable acceleration in training. Despite the successes of these previous studies, they grew pretrained models by mapping partial weights only, ignoring potential correlations across the entire model. As we show in this paper, there are inter- and intra-interactions among the weights of both the pretrained and the target models. As a result, the partial mapping may not capture the complete information and lead to inadequate growth. In this paper, we propose a method that linearly correlates each weight of the target model to all the weights of the pretrained model to further enhance acceleration ability. We utilize multi-linear operators to reduce computational and spacial complexity, enabling acceptable resource requirements. Experiments demonstrate that our method can save 76\% computational costs on DeiT-base transferred from DeiT-small, which outperforms bert2BERT by +12\% and LiGO by +21\%, respectively.

AAAI Conference 2022 Conference Paper

AutoBERT-Zero: Evolving BERT Backbone from Scratch

Jiahui Gao
Hang Xu
Han Shi
Xiaozhe Ren
Philip L. H. Yu
Xiaodan Liang
Xin Jiang
Zhenguo Li

Transformer-based pre-trained language models like BERT and its variants have recently achieved promising performance in various natural language processing (NLP) tasks. However, the conventional paradigm constructs the backbone by purely stacking the manually designed global selfattention layers, introducing inductive bias and thus leads to sub-optimal. In this work, we make the first attempt to automatically discover novel pre-trained language model (PLM) backbone on a flexible search space containing the most fundamental operations from scratch. Specifically, we propose a well-designed search space which (i) contains primitive math operations in the intra-layer level to explore novel attention structures, and (ii) leverages convolution blocks to be the supplementary for attentions in the inter-layer level to better learn local dependency. To enhance the efficiency for finding promising architectures, we propose an Operation- Priority Neural Architecture Search (OP-NAS) algorithm, which optimizes both the search algorithm and evaluation of candidate models. Specifically, we propose Operation- Priority (OP) evolution strategy to facilitate model search via balancing exploration and exploitation. Furthermore, we design a Bi-branch Weight-Sharing (BIWS) training strategy for fast model evaluation. Extensive experiments show that the searched architecture (named AutoBERT-Zero) significantly outperforms BERT and its variants of different model capacities in various downstream tasks, proving the architecture’s transfer and scaling abilities. Remarkably, AutoBERT-Zerobase outperforms RoBERTa-base (using much more data) and BERT-large (with much larger model size) by 2. 4 and 1. 4 higher score on GLUE test set.

AAAI Conference 2022 Conference Paper

Blindfolded Attackers Still Threatening: Strict Black-Box Adversarial Attacks on Graphs

Jiarong Xu
Yizhou Sun
Xin Jiang
Yanhao Wang
Chunping Wang
Jiangang Lu
Yang Yang

Adversarial attacks on graphs have attracted considerable research interests. Existing works assume the attacker is either (partly) aware of the victim model, or able to send queries to it. These assumptions are, however, unrealistic. To bridge the gap between theoretical graph attacks and real-world scenarios, in this work, we propose a novel and more realistic setting: strict black-box graph attack, in which the attacker has no knowledge about the victim model at all and is not allowed to send any queries. To design such an attack strategy, we first propose a generic graph filter to unify different families of graph-based models. The strength of attacks can then be quantified by the change in the graph filter before and after attack. By maximizing this change, we are able to find an effective attack strategy, regardless of the underlying model. To solve this optimization problem, we also propose a relaxation technique and approximation theories to reduce the difficulty as well as the computational expense. Experiments demonstrate that, even with no exposure to the model, the Macro-F1 drops 5. 5% in node classification and 29. 5% in graph classification, which is a significant result compared with existent works.

IROS Conference 2022 Conference Paper

Fast and Safe Exploration via Adaptive Semantic Perception in Outdoor Environments

Zhihao Wang 0003
Lingxu Chen
Hongjin Chen
Haoyao Chen
Xin Jiang

Autonomous exploration in unknown environments is a fundamental task for robots. Existing approaches mostly were concentrated on the efficiency of the exploration with the assumption of perfect state estimation, but the drift of pose estimation in visual SLAM occurs frequently and is detrimental to robot's localization and exploration performance. In this paper, a perception-aware exploration(PAE) method is proposed for rapidly and safely autonomous exploration in outdoor environments. The adaptive semantic information is proposed to improve the robustness of perception. Based on the perception module, both the selection of exploration goal on a novel weighted information gain and path planning can avoid the areas with high localization uncertainty. In addition, thanks to the proposed pipeline, including scan-based frontier detection, kd-tree based map prediction and suboptimal frontier buffer strategy, the PAE planner can explore the environment with high success rate and high efficiency. Several simulations are performed to verify the effectiveness of our methods.

NeurIPS Conference 2022 Conference Paper

Towards Efficient Post-training Quantization of Pre-trained Language Models

Haoli Bai
Lu Hou
Lifeng Shang
Xin Jiang
Irwin King
Michael R Lyu

Network quantization has gained increasing attention with the rapid growth of large pre-trained language models~(PLMs). However, most existing quantization methods for PLMs follow quantization-aware training~(QAT) that requires end-to-end training with full access to the entire dataset. Therefore, they suffer from slow training, large memory overhead, and data accessibility issues. In this paper, we study post-training quantization~(PTQ) of PLMs, and propose module-wise quantization error minimization~(MREM), an efficient solution to mitigate these issues. By partitioning the PLM into multiple modules, we minimize the reconstruction error incurred by quantization for each module. In addition, we design a new model parallel training strategy such that each module can be trained locally on separate computing devices without waiting for preceding modules, which brings nearly the theoretical training speed-up (e. g. , $4\times$ on $4$ GPUs). Experiments on GLUE and SQuAD benchmarks show that our proposed PTQ solution not only performs close to QAT, but also enjoys significant reductions in training time, memory overhead, and data consumption.

AAAI Conference 2022 Conference Paper

UniMS: A Unified Framework for Multimodal Summarization with Knowledge Distillation

Zhengkun Zhang
Xiaojun Meng
Yasheng Wang
Xin Jiang
Qun Liu
Zhenglu Yang

With the rapid increase of multimedia data, a large body of literature has emerged to work on multimodal summarization, the majority of which target at refining salient information from textual and visual modalities to output a pictorial summary with the most relevant images. Existing methods mostly focus on either extractive or abstractive summarization and rely on qualified image captions to build image references. We are the first to propose a Unified framework for Multimodal Summarization grounding on BART, UniMS, that integrates extractive and abstractive objectives, as well as selecting the image output. Specially, we adopt knowledge distillation from a vision-language pretrained model to improve image selection, which avoids any requirement on the existence and quality of image captions. Besides, we introduce a visual guided decoder to better integrate textual and visual modalities in guiding abstractive text generation. Results show that our best model achieves a new state-of-the-art result on a large-scale benchmark dataset. The newly involved extractive objective as well as the knowledge distillation technique are proven to bring a noticeable improvement to the multimodal summarization task.

AAAI Conference 2022 Conference Paper

Unsupervised Adversarially Robust Representation Learning on Graphs

Jiarong Xu
Yang Yang
Junru Chen
Xin Jiang
Chunping Wang
Jiangang Lu
Yizhou Sun

Unsupervised/self-supervised pre-training methods for graph representation learning have recently attracted increasing research interests, and they can be generalized to various downstream applications. Yet, the adversarial robustness of such pre-trained graph learning models remains largely unexplored. More importantly, most existing defense techniques for endto-end graph representation learning methods require prespecified label definitions, and thus cannot be directly applied to the pre-training methods. In this paper, we propose an unsupervised defense technique to robustify pre-trained deep graph models, so that the perturbations on the input graph can be successfully identified and blocked before the model is applied to different downstream tasks. Specifically, we introduce a mutual information-based measure, graph representation vulnerability (GRV), to quantify the robustness of graph encoders on the representation space. We then formulate an optimization problem to learn the graph representation by carefully balancing the trade-off between the expressive power and the robustness (i. e. , GRV) of the graph encoder. The discrete nature of graph topology and the joint space of graph data make the optimization problem intractable to solve. To handle the above difficulty and to reduce computational expense, we further relax the problem and thus provide an approximate solution. Additionally, we explore a provable connection between the robustness of the unsupervised graph encoder and that of models on downstream tasks. Extensive experiments demonstrate that even without access to labels and tasks, our model is still able to enhance robustness against adversarial attacks on three downstream tasks (i. e. , node classification, link prediction, and community detection) by an average of +16. 5% compared with existing methods.

NeurIPS Conference 2022 Conference Paper

Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark

Jiaxi Gu
Xiaojun Meng
Guansong Lu
Lu Hou
Niu Minzhe
Xiaodan Liang
Lewei Yao
Runhui Huang

Vision-Language Pre-training (VLP) models have shown remarkable performance on various downstream tasks. Their success heavily relies on the scale of pre-trained cross-modal datasets. However, the lack of large-scale datasets and benchmarks in Chinese hinders the development of Chinese VLP models and broader multilingual applications. In this work, we release a large-scale Chinese cross-modal dataset named Wukong, which contains 100 million Chinese image-text pairs collected from the web. Wukong aims to benchmark different multi-modal pre-training methods to facilitate the VLP research and community development. Furthermore, we release a group of models pre-trained with various image encoders (ViT-B/ViT-L/SwinT) and also apply advanced pre-training techniques into VLP such as locked-image text tuning, token-wise similarity in contrastive learning, and reduced-token interaction. Extensive experiments and a benchmarking of different downstream tasks including a new largest human-verified image-text test dataset are also provided. Experiments show that Wukong can serve as a promising Chinese pre-training dataset and benchmark for different cross-modal learning methods. For the zero-shot image classification task on 10 datasets, $Wukong_\text{ViT-L}$ achieves an average accuracy of 73. 03%. For the image-text retrieval task, it achieves a mean recall of 71. 6% on AIC-ICC which is 12. 9% higher than WenLan 2. 0. Also, our Wukong models are benchmarked on downstream tasks with other variants on multiple datasets, e. g. , Flickr8K-CN, Flickr-30K-CN, COCO-CN, et al. More information can be referred to https: //wukong-dataset. github. io/wukong-dataset/.

IROS Conference 2021 Conference Paper

Consolidating Kinematic Models to Promote Coordinated Mobile Manipulations

Ziyuan Jiao
Zeyu Zhang 0001
Xin Jiang
David K. Han
Song-Chun Zhu
Yixin Zhu 0001
Hangxin Liu

We construct a Virtual Kinematic Chain (VKC) that readily consolidates the kinematics of the mobile base, the arm, and the object to be manipulated in mobile manipulations. Accordingly, a mobile manipulation task is represented by altering the state of the constructed VKC, which can be converted to a motion planning problem, formulated and solved by trajectory optimization. This new VKC perspective of mobile manipulation allows a service robot to (i) produce well-coordinated motions, suitable for complex household environments, and (ii) perform intricate multi-step tasks while interacting with multiple objects without an explicit definition of intermediate goals. In simulated experiments, we validate these advantages by comparing the VKC-based approach with baselines that solely optimize individual components. The results manifest that VKC-based joint modeling and planning promote task success rates and produce more efficient trajectories.

AAAI Conference 2021 Conference Paper

Continuous Self-Attention Models with Neural ODE Networks

Jing Zhang
Peng Zhang
Baiwen Kong
Junqiu Wei
Xin Jiang

Stacked self-attention models receive widespread attention, due to its ability of capturing global dependency among words. However, the stacking of many layers and components generates huge parameters, leading to low parameter efficiency. In response to this issue, we propose a lightweight architecture named Continuous Self-Attention models with neural ODE networks (CSAODE). In CSAODE, continuous dynamical models (i. e. , neural ODEs) are coupled with our proposed self-attention block to form a self-attention ODE solver. This solver continuously calculates and optimizes the hidden states via only one layer of parameters to improve the parameter efficiency. In addition, we design a novel accelerated continuous dynamical model to reduce computing costs, and integrate it in CSAODE. Moreover, since the original self-attention ignores local information, CSAODE makes use of N-gram convolution to encode local representations, and a fusion layer with only two trainable scalars are designed for generating sentence vectors. We perform a series of experiments on text classification, natural language inference (NLI) and text matching tasks. With fewer parameters, CSAODE outperforms state-of-the-art models on text classification tasks (e. g. , 1. 3% accuracy improved on SUBJ task), and has competitive performances for NLI and text matching tasks as well.

AAAI Conference 2021 Conference Paper

HopRetriever: Retrieve Hops over Wikipedia to Answer Complex Questions

Shaobo Li
Xiaoguang Li
Lifeng Shang
Xin Jiang
Qun Liu
Chengjie Sun
Zhenzhou Ji
Bingquan Liu

Collecting supporting evidence from large corpora of text (e. g. , Wikipedia) is of great challenge for open-domain Question Answering (QA). Especially, for multi-hop open-domain QA, scattered evidence pieces are required to be gathered together to support the answer extraction. In this paper, we propose a new retrieval target, hop, to collect the hidden reasoning evidence from Wikipedia for complex question answering. Specifically, the hop in this paper is defined as the combination of a hyperlink and the corresponding outbound link document. The hyperlink is encoded as the mention embedding which models the structured knowledge of how the outbound link entity is mentioned in the textual context, and the corresponding outbound link document is encoded as the document embedding representing the unstructured knowledge within it. Accordingly, we build HopRetriever which retrieves hops over Wikipedia to answer complex questions. Experiments on the HotpotQA dataset demonstrate that HopRetriever outperforms previously published evidence retrieval methods by large margins. Moreover, our approach also yields quantifiable interpretations of the evidence collection process.

AAAI Conference 2020 Conference Paper

Dialog State Tracking with Reinforced Data Augmentation

Yichun Yin
Lifeng Shang
Xin Jiang
Xiao Chen
Qun Liu

Neural dialog state trackers are generally limited due to the lack of quantity and diversity of annotated training data. In this paper, we address this difﬁculty by proposing a reinforcement learning (RL) based framework for data augmentation that can generate high-quality data to improve the neural state tracker. Speciﬁcally, we introduce a novel contextual bandit generator to learn ﬁne-grained augmentation policies that can generate new effective instances by choosing suitable replacements for speciﬁc context. Moreover, by alternately learning between the generator and the state tracker, we can keep reﬁning the generative policies to generate more highquality training data for neural state tracker. Experimental results on the WoZ and MultiWoZ (restaurant) datasets demonstrate that the proposed framework signiﬁcantly improves the performance over the state-of-the-art models, especially with limited training data.

NeurIPS Conference 2020 Conference Paper

DynaBERT: Dynamic BERT with Adaptive Width and Depth

Lu Hou
Zhiqi Huang
Lifeng Shang
Xin Jiang
Xiao Chen
Qun Liu

The pre-trained language models like BERT, though powerful in many natural language processing tasks, are both computation and memory expensive. To alleviate this problem, one approach is to compress them for specific tasks before deployment. However, recent works on BERT compression usually compress the large BERT model to a fixed smaller size, and can not fully satisfy the requirements of different edge devices with various hardware performances. In this paper, we propose a novel dynamic BERT model (abbreviated as DynaBERT), which can flexibly adjust the size and latency by selecting adaptive width and depth. The training process of DynaBERT includes first training a width-adaptive BERT and then allowing both adaptive width and depth, by distilling knowledge from the full-sized model to small sub-networks. Network rewiring is also used to keep the more important attention heads and neurons shared by more sub-networks. Comprehensive experiments under various efficiency constraints demonstrate that our proposed dynamic BERT (or RoBERTa) at its largest size has comparable performance as BERT-base (or RoBERTa-base), while at smaller widths and depths consistently outperforms existing BERT compression methods. Code is available at https: //github. com/huawei-noah/Pretrained-Language-Model/tree/master/DynaBERT.

IJCAI Conference 2020 Conference Paper

On the Importance of Word and Sentence Representation Learning in Implicit Discourse Relation Classification

Xin Liu
Jiefu Ou
Yangqiu Song
Xin Jiang

Implicit discourse relation classification is one of the most difficult parts in shallow discourse parsing as the relation prediction without explicit connectives requires the language understanding at both the text span level and the sentence level. Previous studies mainly focus on the interactions between two arguments. We argue that a powerful contextualized representation module, a bilateral multi-perspective matching module, and a global information fusion module are all important to implicit discourse analysis. We propose a novel model to combine these modules together. Extensive experiments show that our proposed model outperforms BERT and other state-of-the-art systems on the PDTB dataset by around 8% and CoNLL 2016 datasets around 16%. We also analyze the effectiveness of different modules in the implicit discourse relation classification task and demonstrate how different levels of representation learning can affect the results.

PDF Details DOI

NeurIPS Conference 2020 Conference Paper

Unsupervised Text Generation by Learning from Search

Jingjing Li
Zichao Li
Lili Mou
Xin Jiang
Michael Lyu
Irwin King

In this work, we propose TGLS, a novel framework for unsupervised Text Generation by Learning from Search. We start by applying a strong search algorithm (in particular, simulated annealing) towards a heuristically defined objective that (roughly) estimates the quality of sentences. Then, a conditional generative model learns from the search results, and meanwhile smooth out the noise of search. The alternation between search and learning can be repeated for performance bootstrapping. We demonstrate the effectiveness of TGLS on two real-world natural language generation tasks, unsupervised paraphrasing and text formalization. Our model significantly outperforms unsupervised baseline methods in both tasks. Especially, it achieves comparable performance to strong supervised methods for paraphrase generation.

JAIR Journal 2019 Journal Article

Interpretable Charge Prediction for Criminal Cases with Dynamic Rationale Attention

Wenhan Chao
Xin Jiang
Zhunchen Luo
Yakun Hu
Wenjia Ma

Charge prediction which aims to determine appropriate charges for criminal cases based on textual fact descriptions, is an important technology in the ﬁeld of AI&Law. Previous works focus on improving prediction accuracy, ignoring the interpretability, which limits the methods’ applicability. In this work, we propose a deep neural framework to extract short but charge-decisive text snippets – rationales – from input fact description, as the interpretation of charge prediction. To solve the scarcity problem ofrationale annotatedcorpus, rationalesare extractedinareinforcement stylewiththe only supervision in the form of charge labels. We further propose a dynamic rationale attention mechanism to better utilize the information in extracted rationales and predict the charges. Experimental results show that besides providing charge prediction interpretation, our approach can also capture subtle details to help charge prediction.

PDF Details DOI

IJCAI Conference 2016 Conference Paper

Neural Generative Question Answering

Jun Yin
Xin Jiang
Zhengdong Lu
Lifeng Shang
Hang Li
Xiaoming Li

This paper presents an end-to-end neural network model, named Neural Generative Question Answering (GENQA), that can generate answers to simple factoid questions, based on the facts in a knowledge-base. More specifically, the model is built on the encoder-decoder framework for sequence-to-sequence learning, while equipped with the ability to enquire the knowledge-base, and is trained on a corpus of question-answer pairs, with their associated triples in the knowledge-base. Empirical study shows the proposed model can effectively deal with the variations of questions and answers, and generate right and natural answers by referring to the facts in the knowledge-base. The experiment on question answering demonstrates that the proposed model can outperform an embedding-based QA model as well as a neural dialogue model trained on the same data.