Arrow Research search

Author name cluster

Xu Sun

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

40 papers
2 author rows

Possible papers

40

AAAI Conference 2026 Conference Paper

TEMPLE: Incentivizing Temporal Understanding of Video Large Language Models via Progressive Pre-SFT Alignment

  • Shicheng Li
  • Lei Li
  • Kun Ouyang
  • Shuhuai Ren
  • Yuanxin Liu
  • Yuanxing Zhang
  • Fuzheng Zhang
  • Lingpeng Kong

Video Large Language Models (Video LLMs) have achieved significant success by adopting the paradigm of large-scale pre-training followed by supervised fine-tuning (SFT). However, existing approaches struggle with temporal reasoning due to weak temporal correspondence in the data and over-reliance on the next-token prediction paradigm, which collectively result in the absence temporal supervision. To address these limitations, we propose TEMPLE (TEMporal Preference Learning), a systematic framework that enhances temporal reasoning capabilities through Direct Preference Optimization (DPO). To address temporal information scarcity in data, we introduce an automated pipeline for systematically constructing temporality-intensive preference pairs comprising three steps: selecting temporally rich videos, designing video-specific perturbation strategies, and evaluating model responses on clean and perturbed inputs. Complementing this data pipeline, we provide additional supervision signals via preference learning and propose a novel Progressive Pre-SFT Alignment strategy featuring two key innovations: a curriculum learning strategy which progressively increases perturbation difficulty to maximize data efficiency; and applying preference optimization before instruction tuning to incentivize fundamental temporal alignment. Extensive experiments demonstrate that our approach consistently improves Video LLM performance across multiple benchmarks with a relatively small set of self-generated DPO data. Our findings highlight TEMPLE as a scalable and efficient complement to SFT-based methods, paving the way for developing reliable Video LLMs.

AAAI Conference 2025 Conference Paper

InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

  • Yuchi Wang
  • Junliang Guo
  • Jianhong Bai
  • Runyi Yu
  • Tianyu He
  • Xu Tan
  • Xu Sun
  • Jiang Bian

Recent talking avatar generation models have made strides in achieving realistic and accurate lip synchronization with the audio, but often fall short in controlling and conveying detailed expressions and emotions of the avatar, making the generated video less vivid and controllable. In this paper, we propose a text-guided approach for generating emotionally expressive 2D avatars, offering fine-grained control, improved interactivity, and generalizability to the resulting video. Our framework, named InstructAvatar, leverages a natural language interface to control the emotion as well as the facial motion of avatars. Technically, we utilize GPT-4V to design an automatic annotation pipeline, constructing an instruction-video paired training dataset. This is combined with a novel two-branch diffusion-based generator to predict avatars using both audio and text instructions simultaneously. Experimental results demonstrate that InstructAvatar produces results that align well with both conditions, and outperforms existing methods in fine-grained emotion control, lip-sync quality, and naturalness.

NeurIPS Conference 2025 Conference Paper

UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?

  • Yuanxin Liu
  • Rui Zhu
  • Shuhuai Ren
  • Jiacong Wang
  • Haoyuan Guo
  • Xu Sun
  • Lu Jiang

With the rapid growth of video generative models (VGMs), it is essential to develop reliable and comprehensive automatic metrics for AI-generated videos (AIGVs). Existing methods either use off-the-shelf models optimized for other tasks or rely on human assessment data to train specialized evaluators. These approaches are constrained to specific evaluation aspects and are difficult to scale with the increasing demands for finer-grained and more comprehensive evaluations. To address this issue, this work investigates the feasibility of using multimodal large language models (MLLMs) as a unified evaluator for AIGVs, leveraging their strong visual perception and language understanding capabilities. To evaluate the performance of automatic metrics in unified AIGV evaluation, we introduce a benchmark called UVE-Bench. UVE-Bench collects videos generated by state-of-the-art VGMs and provides pairwise human preference annotations across 15 evaluation aspects. Using UVE-Bench, we extensively evaluate 18 MLLMs. Our empirical results suggest that while advanced MLLMs (e. g. , Qwen2VL-72B and InternVL2. 5-78B) still lag behind human evaluators, they demonstrate promising ability in unified AIGV evaluation, significantly surpassing existing specialized evaluation methods. Additionally, we conduct an in-depth analysis of key design choices that impact the performance of MLLM-driven evaluators, offering valuable insights for future research on AIGV evaluation.

TMLR Journal 2024 Journal Article

Decentralized Decoupled Training for Federated Long-Tailed Learning

  • Wenkai Yang
  • Deli Chen
  • Hao Zhou
  • Fandong Meng
  • Jie Zhou
  • Xu Sun

In the real world, the data samples often follow a long-tailed distribution, which poses a great challenge for Federated Learning (FL). That is, when the data is decentralized and long-tailed, FL may produce a poorly-behaved global model that is severely biased towards the head classes with the majority of the training samples. To settle this issue, decoupled training has recently been introduced to FL. Decoupled training aims to re-balance the biased classifier after the normal instance-balanced training, and has achieved promising results in centralized long-tailed learning. The current study directly adopts the decoupled training idea on the server side by re-training the classifier on a set of pseudo features, due to the unavailability of a global balanced dataset in FL. Unfortunately, this practice restricts the capacity of decoupled training in federated long-tailed learning as the low-quality pseudo features lead to a sub-optimal classifier. In this work, motivated by the distributed characteristic of FL, we propose a decentralized decoupled training mechanism by leveraging the abundant real data stored in the local. Specifically, we integrate the local real data with the global gradient prototypes to form the local balanced datasets, and thus re-balance the classifier during the local training. Furthermore, we introduce a supplementary classifier in the training phase to help model the global data distribution, which addresses the problem of contradictory optimization goals caused by performing classifier re-balancing locally. Extensive experiments show that our method consistently outperforms the existing state-of-the-art methods in various settings. Our code is available at https://github.com/keven980716/Federated_Learning_Experiments.

EAAI Journal 2024 Journal Article

Uncertain remanufacturing reverse logistics network design in industry 5.0: Opportunities and challenges of digitalization

  • Hao Yu
  • Xu Sun

Remanufacturing, a crucial step of reverse logistics, focuses on restoring or enhancing the functionality of waste products. The challenge in planning an effective remanufacturing reverse logistics system lies in the uncertainties from various sources. In addition, the evolving industrial landscape in Industry 5. 0 necessitates adaptability to technological advancements. This paper proposes an integrated and digitalized architecture for uncertain reverse logistics network design. A fuzzy optimization model is first formulated to identify potential network configurations under varying demand-satisfying and capacity constraints. These solutions are automatically converted and assessed in a dynamic simulation environment with practical operational logic under a set of real-world scenarios. Numerical experiments are performed to validate the method and show the advantages of integrating optimization with dynamic simulation on a digital platform for strategic network planning. The results, built upon previous research, indicate that while initial investments in technology might be substantial, they may lead to long-term reductions in both costs and emissions. Moreover, collaborative decision-making is essential to mitigate potential disruptions and cascading effects. Our research contributes to the development of a novel integrated decision-support architecture and underscores the role of digitalization and Industry 5. 0 in future smart and sustainable reverse logistics planning.

NeurIPS Conference 2024 Conference Paper

Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents

  • Wenkai Yang
  • Xiaohan Bi
  • Yankai Lin
  • Sishuo Chen
  • Jie Zhou
  • Xu Sun

Driven by the rapid development of Large Language Models (LLMs), LLM-based agents have been developed to handle various real-world applications, including finance, healthcare, and shopping, etc. It is crucial to ensure the reliability and security of LLM-based agents during applications. However, the safety issues of LLM-based agents are currently under-explored. In this work, we take the first step to investigate one of the typical safety threats, backdoor attack, to LLM-based agents. We first formulate a general framework of agent backdoor attacks, then we present a thorough analysis of different forms of agent backdoor attacks. Specifically, compared with traditional backdoor attacks on LLMs that are only able to manipulate the user inputs and model outputs, agent backdoor attacks exhibit more diverse and covert forms: (1) From the perspective of the final attacking outcomes, the agent backdoor attacker can not only choose to manipulate the final output distribution, but also introduce the malicious behavior in an intermediate reasoning step only, while keeping the final output correct. (2) Furthermore, the former category can be divided into two subcategories based on trigger locations, in which the backdoor trigger can either be hidden in the user query or appear in an intermediate observation returned by the external environment. We implement the above variations of agent backdoor attacks on two typical agent tasks including web shopping and tool utilization. Extensive experiments show that LLM-based agents suffer severely from backdoor attacks and such backdoor vulnerability cannot be easily mitigated by current textual backdoor defense algorithms. This indicates an urgent need for further research on the development of targeted defenses against backdoor attacks on LLM-based agents. Warning: This paper may contain biased content.

EAAI Journal 2023 Journal Article

A theory-guided deep-learning method for predicting power generation of multi-region photovoltaic plants

  • Jian Du
  • Jianqin Zheng
  • Yongtu Liang
  • Qi Liao
  • Bohong Wang
  • Xu Sun
  • Haoran Zhang
  • Maher Azaza

Recently, clean solar energy has aroused wide attention due to its excellent potential for electricity production. A highly accurate prediction of photovoltaic power generation (PVPG) is the basis of the production and transmission of electricity. However, the current works neglect the regional correlation characteristics of PVPG and few studies propose an effective framework by incorporating prior knowledge for more physically reasonable results. In this work, a hybrid deep learning framework is proposed for simultaneously capturing the spatial correlations among different regions and temporal dependency patterns with various importance. The scientific theory and domain knowledge are incorporated into the deep learning model to make the predicted results possess physical reasonability. Subsequently, the theory-guided and attention-based CNN-LSTM (TG-A-CNN-LSTM) is constructed for PVPG prediction. In the training process, data mismatch and boundary constraint are incorporated into the loss function, and the positive constraint is utilized to restrict the output of the model. After receiving the parameters of the neural network, a TG-A-CNN-LSTM model, whose predicted results obey the physical law, is constructed. A real energy system in five regions is used to verify the accuracy of the proposed model. The predicted results indicate that TG-A-CNN-LSTM can achieve higher precision of PVPG prediction than other prediction models, with RMSE being 11. 07, MAE being 4. 98, and R 2 being 0. 94, respectively. Moreover, the performance of prediction models with sparse data is tested to illustrate the stability and robustness of TG-A-CNN-LSTM.

NeurIPS Conference 2023 Conference Paper

Fed-FA: Theoretically Modeling Client Data Divergence for Federated Language Backdoor Defense

  • Zhiyuan Zhang
  • Deli Chen
  • Hao Zhou
  • Fandong Meng
  • Jie Zhou
  • Xu Sun

Federated learning algorithms enable neural network models to be trained across multiple decentralized edge devices without sharing private data. However, they are susceptible to backdoor attacks launched by malicious clients. Existing robust federated aggregation algorithms heuristically detect and exclude suspicious clients based on their parameter distances, but they are ineffective on Natural Language Processing (NLP) tasks. The main reason is that, although text backdoor patterns are obvious at the underlying dataset level, they are usually hidden at the parameter level, since injecting backdoors into texts with discrete feature space has less impact on the statistics of the model parameters. To settle this issue, we propose to identify backdoor clients by explicitly modeling the data divergence among clients in federated NLP systems. Through theoretical analysis, we derive the f-divergence indicator to estimate the client data divergence with aggregation updates and Hessians. Furthermore, we devise a dataset synthesization method with a Hessian reassignment mechanism guided by the diffusion theory to address the key challenge of inaccessible datasets in calculating clients' data Hessians. We then present the novel Federated F-Divergence-Based Aggregation~(\textbf{Fed-FA}) algorithm, which leverages the f-divergence indicator to detect and discard suspicious clients. Extensive empirical results show that Fed-FA outperforms all the parameter distance-based methods in defending against backdoor attacks among various natural language backdoor attack scenarios.

NeurIPS Conference 2023 Conference Paper

FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation

  • Yuanxin Liu
  • Lei Li
  • Shuhuai Ren
  • Rundong Gao
  • Shicheng Li
  • Sishuo Chen
  • Xu Sun
  • Lu Hou

Recently, open-domain text-to-video (T2V) generation models have made remarkable progress. However, the promising results are mainly shown by the qualitative cases of generated videos, while the quantitative evaluation of T2V models still faces two critical problems. Firstly, existing studies lack fine-grained evaluation of T2V models on different categories of text prompts. Although some benchmarks have categorized the prompts, their categorization either only focuses on a single aspect or fails to consider the temporal information in video generation. Secondly, it is unclear whether the automatic evaluation metrics are consistent with human standards. To address these problems, we propose FETV, a benchmark for F ine-grained E valuation of T ext-to- V ideo generation. FETV is multi-aspect, categorizing the prompts based on three orthogonal aspects: the major content, the attributes to control and the prompt complexity. FETV is also temporal-aware, which introduces several temporal categories tailored for video generation. Based on FETV, we conduct comprehensive manual evaluations of four representative T2V models, revealing their pros and cons on different categories of prompts from different aspects. We also extend FETV as a testbed to evaluate the reliability of automatic T2V metrics. The multi-aspect categorization of FETV enables fine-grained analysis of the metrics' reliability in different scenarios. We find that existing automatic metrics (e. g. , CLIPScore and FVD) correlate poorly with human evaluation. To address this problem, we explore several solutions to improve CLIPScore and FVD, and develop two automatic metrics that exhibit significant higher correlation with humans than existing metrics. Benchmark page: https: //github. com/llyx97/FETV.

NeurIPS Conference 2023 Conference Paper

Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

  • Shuhuai Ren
  • Aston Zhang
  • Yi Zhu
  • Shuai Zhang
  • Shuai Zheng
  • Mu Li
  • Alexander J. Smola
  • Xu Sun

This work proposes POMP, a prompt pre-training method for vision-language models. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-thousand classes. Once pre-trained, the prompt with a strong transferable ability can be directly plugged into a variety of visual recognition tasks including image classification, semantic segmentation, and object detection, to boost recognition performances in a zero-shot manner. Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 datasets, e. g. , 67. 0% average accuracy on 10 classification datasets (+3. 1% compared to CoOp) and 84. 4 hIoU on open-vocabulary Pascal VOC segmentation (+6. 9 compared to ZSSeg).

TMLR Journal 2023 Journal Article

When to Trust Aggregated Gradients: Addressing Negative Client Sampling in Federated Learning

  • Wenkai Yang
  • Yankai Lin
  • Guangxiang Zhao
  • Peng Li
  • Jie Zhou
  • Xu Sun

Federated Learning has become a widely-used framework which allows learning a global model on decentralized local datasets under the condition of protecting local data privacy. However, federated learning faces severe optimization difficulty when training samples are not independently and identically distributed (non-i.i.d.). In this paper, we point out that the client sampling practice plays a decisive role in the aforementioned optimization difficulty. We find that the negative client sampling will cause the merged data distribution of currently sampled clients heavily inconsistent with that of all available clients, and further make the aggregated gradient unreliable. To address this issue, we propose a novel learning rate adaptation mechanism to adaptively adjust the server learning rate for the aggregated gradient in each round, according to the consistency between the merged data distribution of currently sampled clients and that of all available clients. Specifically, we make theoretical deductions to find a meaningful and robust indicator that is positively related to the optimal server learning rate, which is supposed to minimize the Euclidean distance between the aggregated gradient given currently sampled clients and that if all clients could participate in the current round. We show that our proposed indicator can effectively reflect the merged data distribution of sampled clients, thus we utilize it for the server learning rate adaptation. Extensive experiments on multiple image and text classification tasks validate the great effectiveness of our method in various settings. Our code is available at https://github.com/lancopku/FedGLAD.

IJCAI Conference 2022 Conference Paper

Rethinking the Promotion Brought by Contrastive Learning to Semi-Supervised Node Classification

  • Deli Chen
  • Yankai Lin
  • Lei Li
  • Xuancheng Ren
  • Peng Li
  • Jie Zhou
  • Xu Sun

Graph Contrastive Learning (GCL) has proven highly effective in promoting the performance of Semi-Supervised Node Classification (SSNC). However, existing GCL methods are generally transferred from other fields like CV or NLP, whose underlying working mechanism remains underexplored. In this work, we first deeply probe the working mechanism of GCL in SSNC, and find that the promotion brought by GCL is severely unevenly distributed: the improvement mainly comes from subgraphs with less annotated information, which is fundamentally different from contrastive learning in other fields. However, existing GCL methods generally ignore this uneven distribution of annotated information and apply GCL evenly to the whole graph. To remedy this issue and further improve GCL in SSNC, we propose the Topology InFormation gain-Aware Graph Contrastive Learning (TIFA-GCL) framework that considers the annotated information distribution across graph in GCL. Extensive experiments on six benchmark graph datasets, including the enormous OGB-Products graph, show that TIFA-GCL can bring a larger improvement than existing GCL methods in both transductive and inductive settings. Further experiments demonstrate the generalizability and interpretability of TIFA-GCL.

NeurIPS Conference 2022 Conference Paper

Retrieve, Reason, and Refine: Generating Accurate and Faithful Patient Instructions

  • Fenglin Liu
  • Bang Yang
  • Chenyu You
  • Xian Wu
  • Shen Ge
  • Zhangdaihong Liu
  • Xu Sun
  • Yang Yang

The "Patient Instruction" (PI), which contains critical instructional information provided both to carers and to the patient at the time of discharge, is essential for the patient to manage their condition outside hospital. An accurate and easy-to-follow PI can improve the self-management of patients which can in turn reduce hospital readmission rates. However, writing an appropriate PI can be extremely time consuming for physicians, and is subject to being incomplete or error-prone for (potentially overworked) physicians. Therefore, we propose a new task that can provide an objective means of avoiding incompleteness, while reducing clinical workload: the automatic generation of the PI, which is imagined as being a document that the clinician can review, modify, and approve as necessary (rather than taking the human "out of the loop"). We build a benchmark clinical dataset and propose the Re$^3$Writer, which imitates the working patterns of physicians to first retrieve related working experience from historical PIs written by physicians, then reason related medical knowledge. Finally, it refines the retrieved working experience and reasoned medical knowledge to extract useful information, which is used to generate the PI for previously-unseen patient according to their health records during hospitalization. Our experiments show that, using our method, the performance of 6 different models can be substantially boosted across all metrics, with up to 20%, 11%, and 19% relative improvements in BLEU-4, ROUGE-L, and METEOR, respectively. Meanwhile, we show results from human evaluations to measure the effectiveness in terms of its usefulness for clinical practice. The code is available at https: //github. com/AI-in-Health/Patient-Instructions.

AAAI Conference 2022 Conference Paper

Well-Classified Examples Are Underestimated in Classification with Deep Neural Networks

  • Guangxiang Zhao
  • Wenkai Yang
  • Xuancheng Ren
  • Lei Li
  • Yunfang Wu
  • Xu Sun

The conventional wisdom behind learning deep classification models is to focus on bad-classified examples and ignore wellclassified examples that are far from the decision boundary. For instance, when training with cross-entropy loss, examples with higher likelihoods (i. e. , well-classified examples) contribute smaller gradients in back-propagation. However, we theoretically show that this common practice hinders representation learning, energy optimization, and margin growth. To counteract this deficiency, we propose to reward well-classified examples with additive bonuses to revive their contribution to the learning process. This counterexample theoretically addresses these three issues. We empirically support this claim by directly verifying the theoretical results or significant performance improvement with our counterexample on diverse tasks, including image classification, graph classification, and machine translation. Furthermore, this paper shows that we can deal with complex scenarios, such as imbalanced classification, OOD detection, and applications under adversarial attacks, because our idea can solve these three issues. Code is available at https: //github. com/lancopku/well-classified-examples-areunderestimated.

NeurIPS Conference 2021 Conference Paper

Auto-Encoding Knowledge Graph for Unsupervised Medical Report Generation

  • Fenglin Liu
  • Chenyu You
  • Xian Wu
  • Shen Ge
  • Sheng Wang
  • Xu Sun

Medical report generation, which aims to automatically generate a long and coherent report of a given medical image, has been receiving growing research interests. Existing approaches mainly adopt a supervised manner and heavily rely on coupled image-report pairs. However, in the medical domain, building a large-scale image-report paired dataset is both time-consuming and expensive. To relax the dependency on paired data, we propose an unsupervised model Knowledge Graph Auto-Encoder (KGAE) which accepts independent sets of images and reports in training. KGAE consists of a pre-constructed knowledge graph, a knowledge-driven encoder and a knowledge-driven decoder. The knowledge graph works as the shared latent space to bridge the visual and textual domains; The knowledge-driven encoder projects medical images and reports to the corresponding coordinates in this latent space and the knowledge-driven decoder generates a medical report given a coordinate in this space. Since the knowledge-driven encoder and decoder can be trained with independent sets of images and reports, KGAE is unsupervised. The experiments show that the unsupervised KGAE generates desirable medical reports without using any image-report training pairs. Moreover, KGAE can also work in both semi-supervised and supervised settings, and accept paired images and reports in training. By further fine-tuning with image-report pairs, KGAE consistently outperforms the current state-of-the-art models on two datasets.

AAAI Conference 2021 Conference Paper

Collaborative Group Learning

  • Shaoxiong Feng
  • Hongshen Chen
  • Xuancheng Ren
  • Zhuoye Ding
  • Kan Li
  • Xu Sun

Collaborative learning has successfully applied knowledge transfer to guide a pool of small student networks towards robust local minima. However, previous approaches typically struggle with drastically aggravated student homogenization when the number of students rises. In this paper, we propose Collaborative Group Learning, an efficient framework that aims to diversify the feature representation and conduct an effective regularization. Intuitively, similar to the human group study mechanism, we induce students to learn and exchange different parts of course knowledge as collaborative groups. First, each student is established by randomly routing on a modular neural network, which facilitates flexible knowledge communication between students due to random levels of representation sharing and branching. Second, to resist the student homogenization, students first compose diverse feature sets by exploiting the inductive bias from subsets of training data, and then aggregate and distill different complementary knowledge by imitating a random subgroup of students at each time step. Overall, the above mechanisms are beneficial for maximizing the student population to further improve the model generalization without sacrificing computational efficiency. Empirical evaluations on both image and text tasks indicate that our method significantly outperforms various state-of-the-art collaborative approaches whilst enhancing computational efficiency.

AAAI Conference 2021 Conference Paper

EQG-RACE: Examination-Type Question Generation

  • Xin Jia
  • Wenjie Zhou
  • Xu Sun
  • Yunfang Wu

Question Generation (QG) is an essential component of the automatic intelligent tutoring systems, which aims to generate high-quality questions for facilitating the reading practice and assessments. However, existing QG technologies encounter several key issues concerning the biased and unnatural language sources of datasets which are mainly obtained from the Web (e. g. SQuAD). In this paper, we propose an innovative Examination-type Question Generation approach (EQG- RACE) to generate exam-like questions based on a dataset extracted from RACE. Two main strategies are employed in EQG-RACE for dealing with discrete answer information and reasoning among long contexts. A Rough Answer and Key Sentence Tagging scheme is utilized to enhance the representations of input. An Answer-guided Graph Convolutional Network (AG-GCN) is designed to capture structure information in revealing the inter-sentences and intra-sentence relations. Experimental results show a state-of-the-art performance of EQG-RACE, which is apparently superior to the baselines. In addition, our work has established a new QG prototype with a reshaped dataset and QG method, which provides an important benchmark for related research in future work. We will make our data and code publicly available for further research.

AAAI Conference 2021 Conference Paper

Exploring the Vulnerability of Deep Neural Networks: A Study of Parameter Corruption

  • Xu Sun
  • Zhiyuan Zhang
  • Xuancheng Ren
  • Ruixuan Luo
  • Liangyou Li

We argue that the vulnerability of model parameters is of crucial value to the study of model robustness and generalization but little research has been devoted to understanding this matter. In this work, we propose an indicator to measure the robustness of neural network parameters by exploiting their vulnerability via parameter corruption. The proposed indicator describes the maximum loss variation in the non-trivial worst-case scenario under parameter corruption. For practical purposes, we give a gradient-based estimation, which is far more effective than random corruption trials that can hardly induce the worst accuracy degradation. Equipped with theoretical support and empirical validation, we are able to systematically investigate the robustness of different model parameters and reveal vulnerability of deep neural networks that has been rarely paid attention to before. Moreover, we can enhance the models accordingly with the proposed adversarial corruption-resistant training, which not only improves the parameter robustness but also translates into accuracy elevation.

IJCAI Conference 2021 Conference Paper

Long-term, Short-term and Sudden Event: Trading Volume Movement Prediction with Graph-based Multi-view Modeling

  • Liang Zhao
  • Wei Li
  • Ruihan Bao
  • Keiko Harimoto
  • Yunfang Wu
  • Xu Sun

Trading volume movement prediction is the key in a variety of financial applications. Despite its importance, there is few research on this topic because of its requirement for comprehensive understanding of information from different sources. For instance, the relation between multiple stocks, recent transaction data and suddenly released events are all essential for understanding trading market. However, most of the previous methods only take the fluctuation information of the past few weeks into consideration, thus yielding poor performance. To handle this issue, we propose a graph-based approach that can incorporate multi-view information, i. e. , long-term stock trend, short-term fluctuation and sudden events information jointly into a temporal heterogeneous graph. Besides, our method is equipped with deep canonical analysis to highlight the correlations between different perspectives of fluctuation for better prediction. Experiment results show that our method outperforms strong baselines by a large margin.

AAAI Conference 2021 Conference Paper

Multi-View Feature Representation for Dialogue Generation with Bidirectional Distillation

  • Shaoxiong Feng
  • Xuancheng Ren
  • Kan Li
  • Xu Sun

Neural dialogue models suffer from low-quality responses when interacted in practice, demonstrating difficulty in generalization beyond training data. Recently, knowledge distillation has been used to successfully regularize the student by transferring knowledge from the teacher. However, the teacher and the student are trained on the same dataset and tend to learn similar feature representations, whereas the most general knowledge should be found through differences. The finding of general knowledge is further hindered by the unidirectional distillation, as the student should obey the teacher and may discard some knowledge that is truly general but refuted by the teacher. To this end, we propose a novel training framework, where the learning of general knowledge is more in line with the idea of reaching consensus, i. e. , finding common knowledge that is beneficial to different yet all datasets through diversified learning partners. Concretely, the training task is divided into a group of subtasks with the same number of students. Each student assigned to one subtask not only is optimized on the allocated subtask but also imitates multiview feature representation aggregated from other students (i. e. , student peers), which induces students to capture common knowledge among different subtasks and alleviates the over-fitting of students on the allocated subtasks. To further enhance generalization, we extend the unidirectional distillation to the bidirectional distillation that encourages the student and its student peers to co-evolve by exchanging complementary knowledge with each other. Empirical results and analysis demonstrate that our training framework effectively improves the model generalization without sacrificing training efficiency.

NeurIPS Conference 2021 Conference Paper

Topology-Imbalance Learning for Semi-Supervised Node Classification

  • Deli Chen
  • Yankai Lin
  • Guangxiang Zhao
  • Xuancheng Ren
  • Peng Li
  • Jie Zhou
  • Xu Sun

The class imbalance problem, as an important issue in learning node representations, has drawn increasing attention from the community. Although the imbalance considered by existing studies roots from the unequal quantity of labeled examples in different classes (quantity imbalance), we argue that graph data expose a unique source of imbalance from the asymmetric topological properties of the labeled nodes, i. e. , labeled nodes are not equal in terms of their structural role in the graph (topology imbalance). In this work, we first probe the previously unknown topology-imbalance issue, including its characteristics, causes, and threats to semisupervised node classification learning. We then provide a unified view to jointly analyzing the quantity- and topology- imbalance issues by considering the node influence shift phenomenon with the Label Propagation algorithm. In light of our analysis, we devise an influence conflict detection–based metric Totoro to measure the degree of graph topology imbalance and propose a model-agnostic method ReNode to address the topology-imbalance issue by re-weighting the influence of labeled nodes adaptively based on their relative positions to class boundaries. Systematic experiments demonstrate the effectiveness and generalizability of our method in relieving topology-imbalance issue and promoting semi-supervised node classification. The further analysis unveils varied sensitivity of different graph neural networks (GNNs) to topology imbalance, which may serve as a new perspective in evaluating GNN architectures.

AAAI Conference 2021 Conference Paper

Towards Semantics-Enhanced Pre-Training: Can Lexicon Definitions Help Learning Sentence Meanings?

  • Xuancheng Ren
  • Xu Sun
  • Houfeng Wang
  • Qun Liu

Self-supervised pre-training techniques, albeit relying on large amounts of text, have enabled rapid growth in learning language representations for natural language understanding. However, as radically empirical models on sentences, they are subject to the input data distribution, inevitably incorporating data bias and reporting bias, which may lead to inaccurate understanding of sentences. To address this problem, we propose to adopt a human learner’s approach: when we cannot make sense of a word in a sentence, we often consult the dictionary for specific meanings; but can the same work for empirical models? In this work, we try to inform the pre-trained masked language models of word meanings for semantics-enhanced pre-training. To achieve a contrastive and holistic view of word meanings, a definition pair of two related words is presented to the masked language model such that the model can better associate a word with its crucial semantic features. Both intrinsic and extrinsic evaluations validate the proposed approach on semantics-orientated tasks, with an almost negligible increase of training data.

AAAI Conference 2020 Conference Paper

Measuring and Relieving the Over-Smoothing Problem for Graph Neural Networks from the Topological View

  • Deli Chen
  • Yankai Lin
  • Wei Li
  • Peng Li
  • Jie Zhou
  • Xu Sun

Graph Neural Networks (GNNs) have achieved promising performance on a wide range of graph-based tasks. Despite their success, one severe limitation of GNNs is the over-smoothing issue (indistinguishable representations of nodes in different classes). In this work, we present a systematic and quantitative study on the over-smoothing issue of GNNs. First, we introduce two quantitative metrics, MAD and MADGap, to measure the smoothness and oversmoothness of the graph nodes representations, respectively. Then, we verify that smoothing is the nature of GNNs and the critical factor leading to over-smoothness is the low information-to-noise ratio of the message received by the nodes, which is partially determined by the graph topology. Finally, we propose two methods to alleviate the oversmoothing issue from the topological view: (1) MADReg which adds a MADGap-based regularizer to the training objective; (2) AdaEdge which optimizes the graph topology based on the model predictions. Extensive experiments on 7 widely-used graph datasets with 10 typical GNN models show that the two proposed methods are effective for relieving the over-smoothing issue, thus improving the performance of various GNN models.

NeurIPS Conference 2020 Conference Paper

Prophet Attention: Predicting Attention with Future Attention

  • Fenglin Liu
  • Xuancheng Ren
  • Xian Wu
  • Shen Ge
  • Wei Fan
  • Yuexian Zou
  • Xu Sun

Recently, attention based models have been used extensively in many sequence-to-sequence learning systems. Especially for image captioning, the attention based models are expected to ground correct image regions with proper generated words. However, for each time step in the decoding process, the attention based models usually use the hidden state of the current input to attend to the image regions. Under this setting, these attention models have a deviated focus'' problem that they calculate the attention weights based on previous words instead of the one to be generated, impairing the performance of both grounding and captioning. In this paper, we propose the Prophet Attention, similar to the form of self-supervision. In the training stage, this module utilizes the future information to calculate the ideal'' attention weights towards image regions. These calculated ideal'' weights are further used to regularize the deviated'' attention. In this manner, image regions are grounded with the correct words. The proposed Prophet Attention can be easily incorporated into existing image captioning models to improve their performance of both grounding and captioning. The experiments on the Flickr30k Entities and the MSCOCO datasets show that the proposed Prophet Attention consistently outperforms baselines in both automatic metrics and human evaluations. It is worth noticing that we set new state-of-the-arts on the two benchmark datasets and achieve the 1st place on the leaderboard of the online MSCOCO benchmark in terms of the default ranking score, i. e. , CIDEr-c40.

AAAI Conference 2020 Conference Paper

Visual Agreement Regularized Training for Multi-Modal Machine Translation

  • Pengcheng Yang
  • Boxing Chen
  • Pei Zhang
  • Xu Sun

Multi-modal machine translation aims at translating the source sentence into a different language in the presence of the paired image. Previous work suggests that additional visual information only provides dispensable help to translation, which is needed in several very special cases such as translating ambiguous words. To make better use of visual information, this work presents visual agreement regularized training. The proposed approach jointly trains the source-totarget and target-to-source translation models and encourages them to share the same focus on the visual information when generating semantically equivalent visual words (e. g. “ball” in English and “ballon” in French). Besides, a simple yet effective multi-head co-attention model is also introduced to capture interactions between visual and textual features. The results show that our approaches can outperform competitive baselines by a large margin on the Multi30k dataset. Further analysis demonstrates that the proposed regularized training can effectively improve the agreement of attention on the image, leading to better use of visual information.

IJCAI Conference 2019 Conference Paper

A Dual Reinforcement Learning Framework for Unsupervised Text Style Transfer

  • Fuli Luo
  • Peng Li
  • Jie Zhou
  • Pengcheng Yang
  • Baobao Chang
  • Xu Sun
  • Zhifang Sui

Unsupervised text style transfer aims to transfer the underlying style of text but keep its main content unchanged without parallel data. Most existing methods typically follow two steps: first separating the content from the original style, and then fusing the content with the desired style. However, the separation in the first step is challenging because the content and style interact in subtle ways in natural language. Therefore, in this paper, we propose a dual reinforcement learning framework to directly transfer the style of the text via a one-step mapping model, without any separation of content and style. Specifically, we consider the learning of the source-to-target and target-to-source mappings as a dual task, and two rewards are designed based on such a dual structure to reflect the style accuracy and content preservation, respectively. In this way, the two one-step mapping models can be trained via reinforcement learning, without any use of parallel data. Automatic evaluations show that our model outperforms the state-of-the-art systems by a large margin, especially with more than 10 BLEU points improvement averaged on two benchmark datasets. Human evaluations also validate the effectiveness of our model in terms of style accuracy, content preservation and fluency. Our code and data, including outputs of all baselines and our model are available at https: //github. com/luofuli/DualRL.

NeurIPS Conference 2019 Conference Paper

Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations

  • Fenglin Liu
  • Yuanxin Liu
  • Xuancheng Ren
  • Xiaodong He
  • Xu Sun

In vision-and-language grounding problems, fine-grained representations of the image are considered to be of paramount importance. Most of the current systems incorporate visual features and textual concepts as a sketch of an image. However, plainly inferred representations are usually undesirable in that they are composed of separate components, the relations of which are elusive. In this work, we aim at representing an image with a set of integrated visual regions and corresponding textual concepts, reflecting certain semantics. To this end, we build the Mutual Iterative Attention (MIA) module, which integrates correlated visual features and textual concepts, respectively, by aligning the two modalities. We evaluate the proposed approach on two representative vision-and-language grounding tasks, i. e. , image captioning and visual question answering. In both tasks, the semantic-grounded image representations consistently boost the performance of the baseline models under all metrics across the board. The results demonstrate that our approach is effective and generalizes well to a wide range of models for image-related applications. (The code is available at \url{https: //github. com/fenglinliu98/MIA)

IJCAI Conference 2019 Conference Paper

Exploring and Distilling Cross-Modal Information for Image Captioning

  • Fenglin Liu
  • Xuancheng Ren
  • Yuanxin Liu
  • Kai Lei
  • Xu Sun

Recently, attention-based encoder-decoder models have been used extensively in image captioning. Yet there is still great difficulty for the current methods to achieve deep image understanding. In this work, we argue that such understanding requires visual attention to correlated image regions and semantic attention to coherent attributes of interest. To perform effective attention, we explore image captioning from a cross-modal perspective and propose the Global-and-Local Information Exploring-and-Distilling approach that explores and distills the source information in vision and language. It globally provides the aspect vector, a spatial and relational representation of images based on caption contexts, through the extraction of salient region groupings and attribute collocations, and locally extracts the fine-grained regions and attributes in reference to the aspect vector for word selection. Our fully-attentive model achieves a CIDEr score of 129. 3 in offline COCO evaluation with remarkable efficiency in terms of accuracy, speed, and parameter budget.

IJCAI Conference 2019 Conference Paper

Knowledgeable Storyteller: A Commonsense-Driven Generative Model for Visual Storytelling

  • Pengcheng Yang
  • Fuli Luo
  • Peng Chen
  • Lei Li
  • Zhiyi Yin
  • Xiaodong He
  • Xu Sun

The visual storytelling (VST) task aims at generating a reasonable and coherent paragraph-level story with the image stream as input. Different from caption that is a direct and literal description of image content, the story in the VST task tends to contain plenty of imaginary concepts that do not appear in the image. This requires the AI agent to reason and associate with the imaginary concepts based on implicit commonsense knowledge to generate a reasonable story describing the image stream. Therefore, in this work, we present a commonsense-driven generative model, which aims to introduce crucial commonsense from the external knowledge base for visual storytelling. Our approach first extracts a set of candidate knowledge graphs from the knowledge base. Then, an elaborately designed vision-aware directional encoding schema is adopted to effectively integrate the most informative commonsense. Besides, we strive to maximize the semantic similarity within the output during decoding to enhance the coherence of the generated text. Results show that our approach can outperform the state-of-the-art systems by a large margin, which achieves a 29\% relative improvement of CIDEr score. With additional commonsense and semantic-relevance based objective, the generated stories are more diverse and coherent.

AAAI Conference 2019 Conference Paper

Learning Personalized End-to-End Goal-Oriented Dialog

  • Liangchen Luo
  • Wenhao Huang
  • Qi Zeng
  • Zaiqing Nie
  • Xu Sun

Most existing works on dialog systems only consider conversation content while neglecting the personality of the user the bot is interacting with, which begets several unsolved issues. In this paper, we present a personalized end-to-end model in an attempt to leverage personalization in goal-oriented dialogs. We first introduce a PROFILE MODEL which encodes user profiles into distributed embeddings and refers to conversation history from other similar users. Then a PREFERENCE MODEL captures user preferences over knowledge base entities to handle the ambiguity in user requests. The two models are combined into the PERSONALIZED MEMN2N. Experiments show that the proposed model achieves qualitative performance improvements over state-of-the-art methods. As for human evaluation, it also outperforms other approaches in terms of task completion rate and user satisfaction.

AAAI Conference 2019 Conference Paper

LiveBot: Generating Live Video Comments Based on Visual and Textual Contexts

  • Shuming Ma
  • Lei Cui
  • Damai Dai
  • Furu Wei
  • Xu Sun

We introduce the task of automatic live commenting. Live commenting, which is also called “video barrage”, is an emerging feature on online video sites that allows real-time comments from viewers to fly across the screen like bullets or roll at the right side of the screen. The live comments are a mixture of opinions for the video and the chit chats with other comments. Automatic live commenting requires AI agents to comprehend the videos and interact with human viewers who also make the comments, so it is a good testbed of an AI agent’s ability to deal with both dynamic vision and language. In this work, we construct a large-scale live comment dataset with 2, 361 videos and 895, 929 live comments. Then, we introduce two neural models to generate live comments based on the visual and textual contexts, which achieve better performance than previous neural baselines such as the sequence-to-sequence model. Finally, we provide a retrieval-based evaluation protocol for automatic live commenting where the model is asked to sort a set of candidate comments based on the log-likelihood score, and evaluated on metrics such as mean-reciprocal-rank. Putting it all together, we demonstrate the first “LiveBot”. The datasets and the codes can be found at https: //github. com/lancopku/livebot.

NeurIPS Conference 2019 Conference Paper

Understanding and Improving Layer Normalization

  • Jingjing Xu
  • Xu Sun
  • Zhiyuan Zhang
  • Guangxiang Zhao
  • Junyang Lin

Layer normalization (LayerNorm) is a technique to normalize the distributions of intermediate layers. It enables smoother gradients, faster training, and better generalization accuracy. However, it is still unclear where the effectiveness stems from. In this paper, our main contribution is to take a step further in understanding LayerNorm. Many of previous studies believe that the success of LayerNorm comes from forward normalization. Unlike them, we find that the derivatives of the mean and variance are more important than forward normalization by re-centering and re-scaling backward gradients. Furthermore, we find that the parameters of LayerNorm, including the bias and gain, increase the risk of over-fitting and do not work in most cases. Experiments show that a simple version of LayerNorm (LayerNorm-simple) without the bias and gain outperforms LayerNorm on four datasets. It obtains the state-of-the-art performance on En-Vi machine translation. To address the over-fitting problem, we propose a new normalization method, Adaptive Normalization (AdaNorm), by replacing the bias and gain with a new transformation function. Experiments show that AdaNorm demonstrates better results than LayerNorm on seven out of eight datasets.

IJCAI Conference 2018 Conference Paper

A Hierarchical End-to-End Model for Jointly Improving Text Summarization and Sentiment Classification

  • Shuming Ma
  • Xu Sun
  • Junyang Lin
  • Xuancheng Ren

Text summarization and sentiment classification both aim to capture the main ideas of the text but at different levels. Text summarization is to describe the text within a few sentences, while sentiment classification can be regarded as a special type of summarization which ``summarizes'' the text into a even more abstract fashion, i. e. , a sentiment class. Based on this idea, we propose a hierarchical end-to-end model for joint learning of text summarization and sentiment classification, where the sentiment classification label is treated as the further ``summarization'' of the text summarization output. Hence, the sentiment classification layer is put upon the text summarization layer, and a hierarchical structure is derived. Experimental results on Amazon online reviews datasets show that our model achieves better performance than the strong baseline systems on both abstractive summarization and sentiment classification.

AAAI Conference 2018 Conference Paper

Duplicate Question Identification by Integrating FrameNet With Neural Networks

  • Xiaodong Zhang
  • Xu Sun
  • Houfeng Wang

There are two major problems in duplicate question identification, namely lexical gap and essential constituents matching. Previous methods either design various similarity features or learn representations via neural networks, which try to solve the lexical gap but neglect the essential constituents matching. In this paper, we focus on the essential constituents matching problem and use FrameNet-style semantic parsing to tackle it. Two approaches are proposed to integrate FrameNet parsing with neural networks. An ensemble approach combines a traditional model with manually designed features and a neural network model. An embedding approach converts frame parses to embeddings, which are combined with word embeddings at the input of neural networks. Experiments on Quora question pairs dataset demonstrate that the ensemble approach is more effective and outperforms all baselines. 1

AAAI Conference 2018 Conference Paper

Modeling Scientific Influence for Research Trending Topic Prediction

  • Chengyao Chen
  • Zhitao Wang
  • Wenjie Li
  • Xu Sun

With the growing volume of publications in the Computer Science (CS) discipline, tracking the research evolution and predicting the future research trending topics are of great importance for researchers to keep up with the rapid progress of research. Within a research area, there are many top conferences that publish the latest research results. These conferences mutually influence each other and jointly promote the development of the research area. To predict the trending topics of mutually influenced conferences, we propose a correlated neural influence model, which has the ability to capture the sequential properties of research evolution in each individual conference and discover the dependencies among different conferences simultaneously. The experiments conducted on a scientific dataset including conferences in artificial intelligence and data mining show that our model consistently outperforms the other state-of-the-art methods. We also demonstrate the interpretability and predictability of the proposed model by providing its answers to two questions of concern, i. e. , what the next rising trending topics are and for each conference who the most influential peer is.

AAAI Conference 2017 Conference Paper

A Unified Model for Cross-Domain and Semi-Supervised Named Entity Recognition in Chinese Social Media

  • Hangfeng He
  • Xu Sun

Named entity recognition (NER) in Chinese social media is important but difficult because of its informality and strong noise. Previous methods only focus on in-domain supervised learning which is limited by the rare annotated data. However, there are enough corpora in formal domains and massive in-domain unannotated texts which can be used to improve the task. We propose a unified model which can learn from out-of-domain corpora and in-domain unannotated texts. The unified model contains two major functions. One is for cross-domain learning and another for semi-supervised learning. Cross-domain learning function can learn out-of-domain information based on domain similarity. Semi-Supervised learning function can learn in-domain unannotated information by self-training. Both learning functions outperform existing methods for NER in Chinese social media. Finally, our unified model yields nearly 11% absolute improvement over previously published results.

IROS Conference 2015 Conference Paper

Printing angle sensors for foldable robots

  • Xu Sun
  • Samuel M. Felton
  • Robert J. Wood
  • Sangbae Kim

Self-folding is a promising technique for assembling robots from flat sheets. However, existing implementations do not include reliable methods for sensing the folding angle, making feedback control impossible. In this paper, we present novel angle sensors for foldable robots and machines. They are inkjet printed and fully integrated into robots' laminate. This additional sensor layer tracks the angle motion of robot hinges, to better guide robot assembling by folding and to perform more complicated tasks that requires feedback control, making folded robots more capable in real world applications. We introduce the fabrication process, property assessments, and demonstrate sensor performance by measuring folding angles of a cube and controlling folds on a gripper.

ICRA Conference 2015 Conference Paper

Self-folding and self-actuating robots: A pneumatic approach

  • Xu Sun
  • Samuel M. Felton
  • Ryuma Niiyama
  • Robert J. Wood
  • Sangbae Kim

Self-assembling robots can be transported and deployed inexpensively and autonomously in remote and dangerous environments. In this paper, we introduce a novel self-assembling method with a planar pneumatic system. Inflation of pouches translate into shape changes, turning a sheet of composite material into a complex robotic structure. This new method enables a flat origami-based robotic structure to self-fold to desired angles with pressure control. It allows a static joint to become dynamic, self-actuate to reconfigure itself after initial folding. Finally, the folded robot can unfold itself at the end of a robotic application. We believe this new pneumatic approach provides an important toolkit to build more powerful and capable self-assembling robots.

NeurIPS Conference 2014 Conference Paper

Structure Regularization for Structured Prediction

  • Xu Sun

While there are many studies on weight regularization, the study on structure regularization is rare. Many existing systems on structured prediction focus on increasing the level of structural dependencies within the model. However, this trend could have been misdirected, because our study suggests that complex structures are actually harmful to generalization ability in structured prediction. To control structure-based overfitting, we propose a structure regularization framework via \emph{structure decomposition}, which decomposes training samples into mini-samples with simpler structures, deriving a model with better generalization power. We show both theoretically and empirically that structure regularization can effectively control overfitting risk and lead to better accuracy. As a by-product, the proposed method can also substantially accelerate the training speed. The method and the theoretical results can apply to general graphical models with arbitrary structures. Experiments on well-known tasks demonstrate that our method can easily beat the benchmark systems on those highly-competitive tasks, achieving record-breaking accuracies yet with substantially faster training speed.

IJCAI Conference 2009 Conference Paper

  • Xu Sun
  • Takuya Matsuzaki
  • Daisuke Okanohara
  • Jun’ichi Tsujii

We propose a perceptron-style algorithm for fast discriminative training of structured latent variable model, and analyzed its convergence properties. Our method extends the perceptron algorithm for the learning task with latent dependencies, which may not be captured by traditional models. It relies on Viterbi decoding over latent variables, combined with simple additive updates. Compared to existing probabilistic models of latent variables, our method lowers the training cost significantly yet with comparable or even superior classification accuracy.