Arrow Research search

Author name cluster

Ying Shen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

31 papers
1 author row

Possible papers

31

AAAI Conference 2026 Conference Paper

SmartSplat: Feature-Smart Gaussians for Scalable Compression of Ultra-High-Resolution Images

  • Linfei Li
  • Lin Zhang
  • Zhong Wang
  • Ying Shen

Recent advances in generative AI have accelerated the production of ultra-high-resolution visual content. However, traditional image formats face significant limitations in efficient compression and real-time decoding, which restricts their applicability on end-user devices. Inspired by 3D Gaussian Splatting, 2D Gaussian image models have achieved notable progress in enhancing image representation efficiency and quality. Nevertheless, existing methods struggle to balance compression ratios and reconstruction fidelity in ultra-high-resolution scenarios. To address these challenges, we propose SmartSplat, a highly adaptive and feature-aware GS-based image compression framework that effectively supports arbitrary image resolutions and compression ratios. By leveraging image-aware features such as gradients and color variances, SmartSplat introduces a Gradient-Color Guided Variational Sampling strategy alongside an Exclusion-based Uniform Sampling scheme, significantly improving the non-overlapping coverage of Gaussian primitives in pixel space. Additionally, a Scale-Adaptive Gaussian Color Sampling method is proposed to enhance the initialization of Gaussian color attributes across scales. Through joint optimization of spatial layout, scale, and color initialization, SmartSplat can efficiently capture both local structures and global textures of images using a limited number of Gaussians, achieving superior reconstruction quality under high compression ratios. Extensive experiments on DIV8K and a newly created 16K dataset demonstrate that SmartSplat significantly outperforms state-of-the-art methods at comparable compression ratios and surpasses their compression limits, exhibiting strong scalability and practical applicability. This framework can effectively alleviate the storage and transmission burdens of ultra-high-resolution images, providing a robust foundation for future high-efficiency visual content processing.

NeurIPS Conference 2025 Conference Paper

Atomic Thinking of LLMs: Decoupling and Exploring Mathematical Reasoning Abilities

  • Jiayi Kuang
  • Haojing Huang
  • Yinghui Li
  • Xinnian Liang
  • Zhikun Xu
  • Yangning Li
  • Xiaoyu Tan
  • Chao Qu

Large Language Models (LLMs) have demonstrated outstanding performance in mathematical reasoning capabilities. However, we argue that current large-scale reasoning models primarily rely on scaling up training datasets with diverse mathematical problems and long thinking chains, which raises questions about whether LLMs genuinely acquire mathematical concepts and reasoning principles or merely remember the training data. In contrast, humans tend to break down complex problems into multiple fundamental atomic capabilities. Inspired by this, we propose a new paradigm for evaluating mathematical atomic capabilities. Our work categorizes atomic abilities into two dimensions: (1) field-specific abilities across four major mathematical fields, algebra, geometry, analysis, and topology, and (2) logical abilities at different levels, including conceptual understanding, forward multi-step reasoning with formal math language, and counterexample-driven backward reasoning. We propose corresponding training and evaluation datasets for each atomic capability unit, and conduct extensive experiments about how different atomic capabilities influence others, to explore the strategies to elicit the required specific atomic capability. Evaluation and experimental results on advanced models show many interesting discoveries and inspirations about the different performances of models on various atomic capabilities and the interactions between atomic capabilities. Our findings highlight the importance of decoupling mathematical intelligence into atomic components, providing new insights into model cognition and guiding the development of training strategies toward a more efficient, transferable, and cognitively grounded paradigm of "atomic thinking".

NeurIPS Conference 2025 Conference Paper

CausalVTG: Towards Robust Video Temporal Grounding via Causal Inference

  • Qiyi Wang
  • Senda Chen
  • Ying Shen

Video Temporal Grounding (VTG) aims to localize relevant segments in untrimmed videos based on natural language queries and has seen notable progress in recent years. However, most existing methods suffer from two critical limitations. First, they are prone to learning superficial co-occurrence patterns—such as associating specific objects or phrases with certain events—induced by dataset biases, which ultimately degrades their semantic understanding abilities. Second, they typically assume that relevant segments always exist in the video, an assumption misaligned with real-world scenarios where queried content may be absent. Fortunately, causal inference offers a natural solution to the above-mentioned issues by disentangling dataset-induced biases and enabling counterfactual reasoning about query relevance. To this end, we propose CausalVTG, a novel framework that explicitly integrates causal reasoning into VTG. Specifically, we introduce a causality-aware disentangled encoder (CADE) based on front-door adjustment to mitigate confounding biases in visual and textual modalities. To better capture temporal granularity, we design a multi-scale temporal perception module (MSTP) that reconstructs query-conditioned video features at multiple resolutions. Additionally, a counterfactual contrastive learning objective is employed to help the model discern whether a query is truly grounded in a video. Extensive experiments on five widely-used benchmarks demonstrate that CausalVTG outperforms state-of-the-art methods, achieving higher localization precision under stricter IoU thresholds and more accurately identifying whether a query is truly grounded in the video. These results demonstrate both the effectiveness and generalizability of proposed CausalVTG. The code is available at https: //github. com/MxLearner/CausalVTG.

NeurIPS Conference 2025 Conference Paper

Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data

  • Zhenqing Ling
  • Daoyuan Chen
  • Liuyi Yao
  • Qianli Shen
  • Yaliang Li
  • Ying Shen

Fine-tuning large language models (LLMs) using diverse datasets is crucial for enhancing their overall performance across various domains. In practical scenarios, existing methods based on modeling the mixture proportions of data composition often struggle with data whose domain labels are missing, imprecise or non-normalized, while methods based on data selection usually encounter difficulties in balancing multi-domain performance. To address these challenges, in this work, we investigate the role of data diversity in enhancing the overall abilities of LLMs by empirically constructing contrastive data pools and theoretically deriving explanations. Building upon the insights gained, we propose a new method that gives the LLM a dual identity: an output model to cognitively probe and select data based on diversity reward, as well as an input model to be tuned with the selected data. Extensive experiments show that the proposed method notably boosts performance across domain-undetermined data and a series of foundational downstream tasks when applied to various advanced LLMs. We release our code and hope this study can shed light on the understanding of data diversity and advance feedback-driven data-model co-design for LLMs.

AAAI Conference 2025 Conference Paper

EXCGEC: A Benchmark for Edit-Wise Explainable Chinese Grammatical Error Correction

  • Jingheng Ye
  • Shang Qin
  • Yinghui Li
  • Xuxin Cheng
  • Libo Qin
  • Hai-Tao Zheng
  • Ying Shen
  • Peng Xing

Existing studies explore the explainability of Grammatical Error Correction (GEC) in a limited scenario, where they ignore the interaction between corrections and explanations and have not established a corresponding comprehensive benchmark. To bridge the gap, this paper first introduces the task of EXplainable GEC (EXGEC), which focuses on the integral role of correction and explanation tasks. To facilitate the task, we propose EXCGEC, a tailored benchmark for Chinese EXGEC consisting of 8,216 explanation-augmented samples featuring the design of hybrid edit-wise explanations. We then benchmark several series of LLMs in multi-task learning settings, including post-explaining and pre-explaining. To promote the development of the task, we also build a comprehensive evaluation suite by leveraging existing automatic metrics and conducting human evaluation experiments to demonstrate the human consistency of the automatic metrics for free-text explanations. Our experiments reveal the effectiveness of evaluating free-text explanations using traditional metrics like METEOR and ROUGE, and the inferior performance of multi-task models compared to the pipeline solution, indicating its challenges to establish positive effects in learning both tasks.

NeurIPS Conference 2025 Conference Paper

MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning?

  • Zhe Xu
  • Daoyuan Chen
  • Zhenqing Ling
  • Yaliang Li
  • Ying Shen

Large foundation models face challenges in acquiring transferable, structured thinking abilities, especially when supervised with rigid templates or crowd-annotated instruction datasets. Unlike prior approaches, we focus on a thinking-centric data synthesis paradigm that enables models to evolve through self-generated, cognitively guided data. We propose MindGYM, a structured and scalable framework for question synthesis, composed of: (1) Cognitive Thinking Process Injection, which infuses high-level reasoning objectives to shape the model’s synthesis behavior; (2) Seed Single-Hop Question Synthesis, generating atomic questions from diverse semantic types to encourage broader thinking; and (3) Challenging Multi-Hop QA Synthesis, composing more complex multi-hop questions based on QA seeds for deeper reasoning. Detailed analysis shows that synthetic data generated by our method achieves 16. 7% higher average quality and 67. 91% lower quality variance compared to baseline sources, highlighting that both high-quality and self-contained data are essential for effective, thinking-oriented fine-tuning. MindGYM improves performance on six reasoning benchmarks, achieving gains of up to 16% on MathVision using only 400 data samples, and generalizable improvements across different model sizes and architectures. MindGYM underscores the viability of self-challenging mechanisms in refining large model capabilities while minimizing human intervention and resource demands. Code and data are released to promote data-centric research into self-evolving foundation models driven by their internal reasoning capabilities.

YNIMG Journal 2025 Journal Article

Precuneus-to-hippocampus connectivity links LTP-like plasticity to cognitive function in subjective cognitive decline and mild cognitive impairment

  • Jie Song
  • Qian Lu
  • Shuai Zhang
  • Chuan He
  • Tianjiao Zhang
  • Hailang Yan
  • Han Yang
  • Huanping Wang

BACKGROUND: Disruptions in synaptic plasticity and alterations in effective connectivity (EC) involving the hippocampus and amygdala are hallmarks of early Alzheimer's disease (AD). However, the interplay between these neurophysiological changes and their relationships with cognitive functions in subjective cognitive decline (SCD) and mild cognitive impairment (MCI) remains poorly understood. METHODS: Transcranial magnetic stimulation (TMS) and resting-state functional magnetic resonance imaging (rs-fMRI) were used to assess long-term potentiation (LTP)-like plasticity and EC involving the amygdala and hippocampus in 34 individuals with SCD, 27 with MCI, and 35 healthy controls (HC). Between-group differences in cognitive performance, EC alterations, and LTP-like plasticity were examined and their relationships were assessed via correlation and mediation analyses. RESULTS: Both SCD and MCI groups exhibited disrupted EC between the amygdala/hippocampus and the inferior occipital gyrus (IOG), inferior parietal lobule (IPL), medial frontal lobe (MFL), and precuneus. Also, both LTP-5min and LTP-10min were significantly reduced in MCI group compared to SCD and HC groups. Importantly, EC from the left hippocampus to the IPL and from the IPL, MFL, and precuneus to the hippocampus was correlated with memory and executive functions. Moreover, precuneus-to-hippocampus EC was positively correlated with LTP-10min and mediated the relationship between LTP-like plasticity and cognitive performance. CONCLUSIONS: This study provides novel evidence that precuneus-to-hippocampus EC mediates the link between synaptic plasticity and cognitive function in SCD and MCI, suggesting the precuneus-hippocampus pathway as a promising target for early diagnosis and intervention.

AAAI Conference 2025 Conference Paper

Representing Sounds as Neural Amplitude Fields: A Benchmark of Coordinate-MLPs and a Fourier Kolmogorov-Arnold Framework

  • Linfei Li
  • Lin Zhang
  • Zhong Wang
  • Fengyi Zhang
  • Zelin Li
  • Ying Shen

Although Coordinate-MLP-based implicit neural representations have excelled in representing radiance fields, 3D shapes, and images, their application to audio signals remains underexplored. To fill this gap, we investigate existing implicit neural representations, from which we extract 3 types of positional encoding and 16 commonly used activation functions. Through combinatorial design, we establish the first benchmark for Coordinate-MLPs in audio signal representations. Our benchmark reveals that Coordinate-MLPs require complex hyperparameter tuning and frequency-dependent initialization, limiting their robustness. To address these issues, we propose Fourier-ASR, a novel framework based on the Fourier series theorem and the Kolmogorov-Arnold representation theorem. Fourier-ASR introduces Fourier Kolmogorov-Arnold Networks (Fourier-KAN), which leverage periodicity and strong nonlinearity to represent audio signals, eliminating the need for additional positional encoding. Furthermore, a Frequency-adaptive Learning Strategy (FaLS) is proposed to enhance the convergence of Fourier-KAN by capturing high-frequency components and preventing overfitting of low-frequency signals. Extensive experiments conducted on natural speech and music datasets reveal that: (1) well-designed positional encoding and activation functions in Coordinate-MLPs can effectively improve audio representation quality; and (2) Fourier-ASR can robustly represent complex audio signals without extensive hyperparameter tuning. Looking ahead, the continuity and infinite resolution of implicit audio representations make our research highly promising for tasks such as audio compression, synthesis, and generation.

AAAI Conference 2025 Conference Paper

Towards Audio-Visual Navigation in Noisy Environments: A Large-Scale Benchmark Dataset and an Architecture Considering Multiple Sound-Sources

  • Zhanbo Shi
  • Lin Zhang
  • Linfei Li
  • Ying Shen

Audio-visual navigation has received considerable attention in recent years. However, the majority of related investigations have focused on single sound-source scenarios. Studies in this field for multiple sound-source scenarios remain underexplored due to the limitations of two aspects. First, the existing audio-visual navigation dataset only has limited audio samples, making it difficult to simulate diverse multiple sound-source environments. Second, existing navigation frameworks are mainly designed for single sound-source scenarios, thus their performance is severely reduced in multiple sound-source scenarios. In this work, we make an attempt to fill in these two research gaps to some extent. First, we establish a large-scale BEnchmark Dataset for Audio-Vsual Navigation, namely BeDAViN. This dataset consists of 2,258 audio samples with a total duration of 10.8 hours, which is more than 33 times longer than the existing audio dataset employed in the audio-visual navigation task. Second, we propose a new Embodied Navigation framework for MUltiple Sound-Sources Scenarios called ENMuS3. There are mainly two essential components in ENMuS3, the sound event descriptor and the multi-scale scene memory transformer. The former component equips the agent with the ability to extract spatial and semantic features of the target sound-source among multiple sound-sources, while the latter provides the ability to track the target object effectively in noisy environments. Experimental results on our BeDAViN show that ENMuS3 strongly outperforms its counterparts with a significant improvement in success rates across diverse scenarios.

AAAI Conference 2025 Conference Paper

Zero-Shot Image Captioning with Multi-type Entity Representations

  • Delong Zeng
  • Ying Shen
  • Man Lin
  • Zihao Yi
  • Jiarui Ouyang

As data and computational resources continue to expand, incorporating a variety of knowledge during the pre-training phase enhances large models, providing them with strong zero-shot capabilities. Due to the alignment of modal features by visual language models, zero-shot image captioning no longer necessitates pre-training on paired image-text labeled data, enabling accurate text description generation for images not encountered before. While recent research focuses on methods utilizing entity retrieval as anchors to bridge the gap between different modalities, these approaches often fall short of thoroughly analyzing the impact of entity retrieval recall on the zero-shot generation capabilities. To address this issue, we propose MERCap, a zero-shot image captioning method employing Multi-type Entity representation Retrieval. More specifically, we first approximate image representation using the CLIP representation of text and Gaussian noise to address the modality gap. Then, we train a GPT-2 decoder to reconstruct text using entities as hard prompts and CLIP representations as soft prompts. Additionally, we construct a domain-specific entity set, assigning multiple representations to each entity and refining their representation vectors through contrastive learning. During inference, we retrieve entities and input them into the decoder to generate corresponding captions. Extensive experiments validate that our approach is efficient, achieving a new state-of-the-art level in cross-domain captioning and demonstrating strong competitiveness in in-domain captioning compared to existing methods.

AIIM Journal 2024 Journal Article

FIT-graph: A multi-grained evolutionary graph based framework for disease diagnosis

  • Zizhu Liu
  • Qing Cao
  • Nan Du
  • Huizhen Shu
  • Erheng Zhong
  • Nan Jiang
  • Qiaoran Chen
  • Ying Shen

Early assessment, with the help of machine learning methods, can aid clinicians in optimizing the diagnosis and treatment process, allowing patients to receive critical treatment time. Due to the advantages of effective information organization and interpretable reasoning, knowledge graph-based methods have become one of the most widely used machine learning algorithms for this task. However, due to a lack of effective organization and use of multi-granularity and temporal information, current knowledge graph-based approaches are hard to fully and comprehensively exploit the information contained in medical records, restricting their capacity to make superior quality diagnoses. To address these challenges, we examine and study disease diagnosis applications in-depth, and propose a novel disease diagnosis framework named FIT-Graph. With novel medical multi-grained evolutionary graphs, FIT-Graph efficiently organizes the extracted information from various granularities and time stages, maximizing the retention of valuable information for disease inference and ensuring the comprehensiveness and validity of the final disease inference. We compare FIT-Graph with two real-world clinical datasets from cardiology and respiratory departments with the baseline. The experimental results show that its effect is better than the baseline model, and the baseline performance of the task is improved by about 5% in multiple indices.

NeurIPS Conference 2024 Conference Paper

Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling

  • Jiatao Gu
  • Ying Shen
  • Shuangfei Zhai
  • Yizhe Zhang
  • Navdeep Jaitly
  • Josh Susskind

Diffusion models have emerged as a powerful tool for generating high-quality images from textual descriptions. Despite their successes, these models often exhibit limited diversity in the sampled images, particularly when sampling with a high classifier-free guidance weight. To address this issue, we present Kaleido, a novel approach that enhances the diversity of samples by incorporating autoregressive latent priors. Kaleido integrates an autoregressive language model that encodes the original caption and generates latent variables, serving as abstract and intermediary representations for guiding and facilitating the image generation process. In this paper, we explore a variety of discrete latent representations, including textual descriptions, detection bounding boxes, object blobs, and visual tokens. These representations diversify and enrich the input conditions to the diffusion models, enabling more diverse outputs. Our experimental results demonstrate that Kaleido effectively broadens the diversity of the generated image samples from a given textual description while maintaining high image quality. Furthermore, we show that Kaleido adheres closely to the guidance provided by the generated latent variables, demonstrating its capability to effectively control and direct the image generation process.

AAAI Conference 2024 Conference Paper

MULTISCRIPT: Multimodal Script Learning for Supporting Open Domain Everyday Tasks

  • Jingyuan Qi
  • Minqian Liu
  • Ying Shen
  • Zhiyang Xu
  • Lifu Huang

Automatically generating scripts (i.e. sequences of key steps described in text) from video demonstrations and reasoning about the subsequent steps are crucial to the modern AI virtual assistants to guide humans to complete everyday tasks, especially unfamiliar ones. However, current methods for generative script learning rely heavily on well-structured preceding steps described in text and/or images or are limited to a certain domain, resulting in a disparity with real-world user scenarios. To address these limitations, we present a new benchmark challenge – MULTISCRIPT, with two new tasks on task-oriented multimodal script learning: (1) multimodal script generation, and (2) subsequent step prediction. For both tasks, the input consists of a target task name and a video illustrating what has been done to complete the target task, and the expected output is (1) a sequence of structured step descriptions in text based on the demonstration video, and (2) a single text description for the subsequent step, respectively. Built from WikiHow, MULTISCRIPT covers multimodal scripts in videos and text descriptions for over 6,655 human everyday tasks across 19 diverse domains. To establish baseline performance on MULTISCRIPT, we propose two knowledge-guided multimodal generative frameworks that incorporate the task-related knowledge prompted from large language models such as Vicuna. Experimental results show that our proposed approaches significantly improve over the competitive baselines.

AAAI Conference 2021 Conference Paper

Imagine, Reason and Write: Visual Storytelling with Graph Knowledge and Relational Reasoning

  • Chunpu Xu
  • Min Yang
  • Chengming Li
  • Ying Shen
  • Xiang Ao
  • Ruifeng Xu

Visual storytelling is the task of generating a short story to describe an ordered image stream. Different from visual captions, stories contain not only factual descriptions, but also imaginary concepts that do not appear in the images. In this paper, we propose a novel imagine-reason-write generation framework (IRW) for visual storytelling, inspired by the logic of humans when they write a story. First, a multimodal imagining module is leveraged to learn the imaginative storyline explicitly, improving the coherence and reasonability of the generated story. Second, we employ a relational reasoning module to fully exploit the external knowledge (commonsense knowledge base) and task-specific knowledge (scene graph and event graph) with a relational reasoning method based on the storyline. In this way, we can effectively capture the most informative commonsense and visual relationships among objects in images, enhancing the diversity and informativeness of the generated story. Finally, we integrate the visual information and semantic (concept) information to generate human-like stories. Extensive experiments on a benchmark dataset (i. e. , VIST) demonstrate that the proposed IRW framework substantially outperforms the state-of-the-art methods across multiple evaluation metrics.

AAAI Conference 2021 Conference Paper

Learning to Augment for Data-scarce Domain BERT Knowledge Distillation

  • Lingyun Feng
  • Minghui Qiu
  • Yaliang Li
  • Hai-Tao Zheng
  • Ying Shen

Despite pre-trained language models such as BERT have achieved appealing performance in a wide range of natural language processing tasks, they are computationally expensive to be deployed in real-time applications. A typical method is to adopt knowledge distillation to compress these large pre-trained models (teacher models) to small student models. However, for a target domain with scarce training data, the teacher can hardly pass useful knowledge to the student, which yields performance degradation for the student models. To tackle this problem, we propose a method to learn to augment for data-scarce domain BERT knowledge distillation, by learning a cross-domain manipulation scheme that automatically augments the target with the help of resource-rich source domains. Specifically, the proposed method generates samples acquired from a stationary distribution near the target data and adopts a reinforced selector to automatically refine the augmentation strategy according to the performance of the student. Extensive experiments demonstrate that the proposed method significantly outperforms state-of-the-art baselines on four different tasks, and for the data-scarce domains, the compressed student models even perform better than the original large teacher model, with much fewer parameters (only ∼13. 3%) when only a few labeled examples available.

AIIM Journal 2021 Journal Article

Practical fine-grained learning based anomaly classification for ECG image

  • Qing Cao
  • Nan Du
  • Li Yu
  • Ming Zuo
  • Jingsheng Lin
  • Nathan Liu
  • Erheng Zhong
  • Zizhu Liu

As a widely used vital sign within cardiology, Electrocardiography (ECG) provides the basis for assessing heart function and diagnosing cardiovascular diseases. Automated anomaly detection for ECG plays an important role in improving patient diagnosis efficiency and reducing healthcare costs. Practically, due to the limits of electronics support or the medical system setting, image is a more common format for large-scale ECG storage in most clinical institutions. To guarantee an automated ECG detection model's scalability and practicality in clinical applications, taking good advantage of ECG images is crucial. However, existing time digital-based discriminative models fail to learn from images effectively for two reasons. First of all, the signals recorded on images have much lower resolution and higher noise, which makes it impractical to extract precise ECG signals following existing techniques. Meanwhile, the differences between abnormal signals are usually subtle, and they may be overwhelmed by the noises in the images as well. Towards this end, we design a novel neural framework that can be directly applied to massive ECG images determining various types of cardiology abnormalities. It classifies fine-grained ECG images based on weakly supervised strategy, in which case only image-level labeling is required. By eliminating the need for part annotations, the proposed method can result in significant savings in annotation time and cost. The effectiveness of the method is demonstrated by experimental results on two real ECG datasets.

AAAI Conference 2020 Conference Paper

Attentive User-Engaged Adversarial Neural Network for Community Question Answering

  • Yuexiang Xie
  • Ying Shen
  • Yaliang Li
  • Min Yang
  • Kai Lei

We study the community question answering (CQA) problem that emerges with the advent of numerous community forums in the recent past. The task of finding appropriate answers to questions from informative but noisy crowdsourced answers is important yet challenging in practice. We present an Attentive User-engaged Adversarial Neural Network (AUANN), which interactively learns the context information of questions and answers, and enhances user engagement with the CQA task. A novel attentive mechanism is incorporated to model the semantic internal and external relations among questions, answers and user contexts. To handle the noise issue caused by introducing user context, we design a two-step denoise mechanism, including a coarse-grained selection process by similarity measurement, and a fine-grained selection process by applying an adversarial training module. We evaluate the proposed method on large-scale real-world datasets SemEval-2016 and SemEval-2017. Experimental results verify the benefits of incorporating user information, and show that our proposed model significantly outperforms the stateof-the-art methods.

AAAI Conference 2020 Conference Paper

Be Relevant, Non-Redundant, and Timely: Deep Reinforcement Learning for Real-Time Event Summarization

  • Min Yang
  • Chengming Li
  • Fei Sun
  • Zhou Zhao
  • Ying Shen
  • Chenglin Wu

Real-time event summarization is an essential task in natural language processing and information retrieval areas. Despite the progress of previous work, generating relevant, nonredundant, and timely event summaries remains challenging in practice. In this paper, we propose a Deep Reinforcement learning framework for real-time Event Summarization (DRES), which shows promising performance for resolving all three challenges (i. e. , relevance, non-redundancy, timeliness) in a unified framework. Specifically, we (i) devise a hierarchical cross-attention network with intra- and interdocument attentions to integrate important semantic features within and between the query and input document for better text matching. In addition, relevance prediction is leveraged as an auxiliary task to strengthen the document modeling and help to extract relevant documents; (ii) propose a multi-topic dynamic memory network to capture the sequential patterns of different topics belonging to the event of interest and temporally memorize the input facts from the evolving document stream, avoiding extracting redundant information at each time step; (iii) consider both historical dependencies and future uncertainty of the document stream for generating relevant and timely summaries by exploiting the reinforcement learning technique. Experimental results on two realworld datasets have demonstrated the advantages of DRES model with significant improvement in generating relevant, non-redundant, and timely event summaries against the stateof-the-arts.

IJCAI Conference 2020 Conference Paper

Infobox-to-text Generation with Tree-like Planning based Attention Network

  • Yang Bai
  • Ziran Li
  • Ning Ding
  • Ying Shen
  • Hai-Tao Zheng

We study the problem of infobox-to-text generation that aims to generate a textual description from a key-value table. Representing the input infobox as a sequence, previous neural methods using end-to-end models without order-planning suffer from the problems of incoherence and inadaptability to disordered input. Recent planning-based models only implement static order-planning to guide the generation, which may cause error propagation between planning and generation. To address these issues, we propose a Tree-like PLanning based Attention Network (Tree-PLAN) which leverages both static order-planning and dynamic tuning to guide the generation. A novel tree-like tuning encoder is designed to dynamically tune the static order-plan for better planning by merging the most relevant attributes together layer by layer. Experiments conducted on two datasets show that our model outperforms previous methods on both automatic and human evaluation, and demonstrate that our model has better adaptability to disordered input.

AAAI Conference 2020 Conference Paper

Integrating Linguistic Knowledge to Sentence Paraphrase Generation

  • Zibo Lin
  • Ziran Li
  • Ning Ding
  • Hai-Tao Zheng
  • Ying Shen
  • Wei Wang
  • Cong-Zhi Zhao

Paraphrase generation aims to rewrite a text with different words while keeping the same meaning. Previous work performs the task based solely on the given dataset while ignoring the availability of external linguistic knowledge. However, it is intuitive that a model can generate more expressive and diverse paraphrase with the help of such knowledge. To fill this gap, we propose Knowledge-Enhanced Paraphrase Network (KEPN), a transformer-based framework that can leverage external linguistic knowledge to facilitate paraphrase generation. (1) The model integrates synonym information from the external linguistic knowledge into the paraphrase generator, which is used to guide the decision on whether to generate a new word or replace it with a synonym. (2) To locate the synonym pairs more accurately, we adopt an incremental encoding scheme to incorporate position information of each synonym. Besides, a multi-task architecture is designed to help the framework jointly learn the selection of synonym pairs and the generation of expressive paraphrase. Experimental results on both English and Chinese datasets show that our method significantly outperforms the state-ofthe-art approaches in terms of both automatic and human evaluation.

AAAI Conference 2020 Conference Paper

Interactive Dual Generative Adversarial Networks for Image Captioning

  • Junhao Liu
  • Kai Wang
  • Chunpu Xu
  • Zhou Zhao
  • Ruifeng Xu
  • Ying Shen
  • Min Yang

Image captioning is usually built on either generationbased or retrieval-based approaches. Both ways have certain strengths but suffer from their own limitations. In this paper, we propose an Interactive Dual Generative Adversarial Network (IDGAN) for image captioning, which mutually combines the retrieval-based and generation-based methods to learn a better image captioning ensemble. IDGAN consists of two generators and two discriminators, where the generation- and retrieval-based generators mutually benefit from each other’s complementary targets that are learned from two dual adversarial discriminators. Specifically, the generation- and retrieval-based generators provide improved synthetic and retrieved candidate captions with informative feedback signals from the two respective discriminators that are trained to distinguish the generated captions from the true captions and assign top rankings to true captions respectively, thus featuring the merits of both retrieval-based and generation-based approaches. Extensive experiments on MSCOCO dataset demonstrate that the proposed IDGAN model significantly outperforms the compared methods for image captioning.

AAAI Conference 2020 Conference Paper

Joint Learning of Answer Selection and Answer Summary Generation in Community Question Answering

  • Yang Deng
  • Wai Lam
  • Yuexiang Xie
  • Daoyuan Chen
  • Yaliang Li
  • Min Yang
  • Ying Shen

Community question answering (CQA) gains increasing popularity in both academy and industry recently. However, the redundancy and lengthiness issues of crowdsourced answers limit the performance of answer selection and lead to reading difficulties and misunderstandings for community users. To solve these problems, we tackle the tasks of answer selection and answer summary generation in CQA with a novel joint learning model. Specifically, we design a question-driven pointer-generator network, which exploits the correlation information between question-answer pairs to aid in attending the essential information when generating answer summaries. Meanwhile, we leverage the answer summaries to alleviate noise in original lengthy answers when ranking the relevancy degrees of question-answer pairs. In addition, we construct a new large-scale CQA corpus, WikiHowQA, which contains long answers for answer selection as well as reference summaries for answer summarization. The experimental results show that the joint learning method can effectively address the answer redundancy issue in CQA and achieves state-ofthe-art results on both answer selection and text summarization tasks. Furthermore, the proposed model is shown to be of great transferring ability and applicability for resource-poor CQA tasks, which lack of reference answer summaries.

IJCAI Conference 2020 Conference Paper

Triple-to-Text Generation with an Anchor-to-Prototype Framework

  • Ziran Li
  • Zibo Lin
  • Ning Ding
  • Hai-Tao Zheng
  • Ying Shen

Generating a textual description from a set of RDF triplets is a challenging task in natural language generation. Recent neural methods have become the mainstream for this task, which often generate sentences from scratch. However, due to the huge gap between the structured input and the unstructured output, the input triples alone are insufficient to decide an expressive and specific description. In this paper, we propose a novel anchor-to-prototype framework to bridge the gap between structured RDF triples and natural text. The model retrieves a set of prototype descriptions from the training data and extracts writing patterns from them to guide the generation process. Furthermore, to make a more precise use of the retrieved prototypes, we employ a triple anchor that aligns the input triples into groups so as to better match the prototypes. Experimental results on both English and Chinese datasets show that our method significantly outperforms the state-of-the-art baselines in terms of both automatic and manual evaluation, demonstrating the benefit of learning guidance from retrieved prototypes to facilitate triple-to-text generation.

AAAI Conference 2019 Short Paper

A Multi-Task Learning Approach for Answer Selection: A Study and a Chinese Law Dataset

  • Wenyu Du
  • Baocheng Li
  • Min Yang
  • Qiang Qu
  • Ying Shen

In this paper, we propose a Multi-Task learning approach for Answer Selection (MTAS), motivated by the fact that humans have no difficulty performing such task because they possess capabilities of multiple domains (tasks). Specifically, MTAS consists of two key components: (i) A category classification model that learns rich category-aware document representation; (ii) An answer selection model that provides the matching scores of question-answer pairs. These two tasks work on a shared document encoding layer, and they cooperate to learn a high-quality answer selection system. In addition, a multi-head attention mechanism is proposed to learn important information from different representation subspaces at different positions. We manually annotate the first Chinese question answering dataset in law domain (denoted as LawQA) to evaluate the effectiveness of our model. The experimental results show that our model MTAS consistently outperforms the compared methods. 1

AAAI Conference 2019 Conference Paper

Exploring Human-Like Reading Strategy for Abstractive Text Summarization

  • Min Yang
  • Qiang Qu
  • Wenting Tu
  • Ying Shen
  • Zhou Zhao
  • Xiaojun Chen

The recent artificial intelligence studies have witnessed great interest in abstractive text summarization. Although remarkable progress has been made by deep neural network based methods, generating plausible and high-quality abstractive summaries remains a challenging task. The human-like reading strategy is rarely explored in abstractive text summarization, which however is able to improve the effectiveness of the summarization by considering the process of reading comprehension and logical thinking. Motivated by the humanlike reading strategy that follows a hierarchical routine, we propose a novel Hybrid learning model for Abstractive Text Summarization (HATS). The model consists of three major components, a knowledge-based attention network, a multitask encoder-decoder network, and a generative adversarial network, which are consistent with the different stages of the human-like reading strategy. To verify the effectiveness of HATS, we conduct extensive experiments on two real-life datasets, CNN/Daily Mail and Gigaword datasets. The experimental results demonstrate that HATS achieves impressive results on both datasets.

IJCAI Conference 2019 Conference Paper

Knowledge-enhanced Hierarchical Attention for Community Question Answering with Multi-task and Adaptive Learning

  • Min Yang
  • Lei Chen
  • Xiaojun Chen
  • Qingyao Wu
  • Wei Zhou
  • Ying Shen

In this paper, we propose a Knowledge-enhanced Hierarchical Attention for community question answering with Multi-task learning and Adaptive learning (KHAMA). First, we propose a hierarchical attention network to fully fuse knowledge from input documents and knowledge base (KB) by exploiting the semantic compositionality of the input sequences. The external factual knowledge helps recognize background knowledge (entity mentions and their relationships) and eliminate noise information from long documents that have sophisticated syntactic and semantic structures. In addition, we build multiple CQA models with adaptive boosting and then combine these models to learn a more effective and robust CQA system. Further- more, KHAMA is a multi-task learning model. It regards CQA as the primary task and question categorization as the auxiliary task, aiming at learning a category-aware document encoder and enhance the quality of identifying essential information from long questions. Extensive experiments on two benchmarks demonstrate that KHAMA achieves substantial improvements over the compared methods.

AAAI Conference 2019 Short Paper

Learning Document Embeddings with Crossword Prediction

  • Junyu Luo
  • Min Yang
  • Ying Shen
  • Qiang Qu
  • Haixia Chai

In this paper, we propose a Document Embedding Network (DEN) to learn document embeddings in an unsupervised manner. Our model uses the encoder-decoder architecture as its backbone, which tries to reconstruct the input document from an encoded document embedding. Unlike the standard decoder for text reconstruction, we randomly block some words in the input document, and use the incomplete context information and the encoded document embedding to predict the blocked words in the document, inspired by the crossword game. Thus, our decoder can keep the balance between the known and unknown information, and consider both global and partial information when decoding the missing words. We evaluate the learned document embeddings on two tasks: document classification and document retrieval. The experimental results show that our model substantially outperforms the compared methods. 1.

AAAI Conference 2019 Conference Paper

Multi-Task Learning with Multi-View Attention for Answer Selection and Knowledge Base Question Answering

  • Yang Deng
  • Yuexiang Xie
  • Yaliang Li
  • Min Yang
  • Nan Du
  • Wei Fan
  • Kai Lei
  • Ying Shen

Answer selection and knowledge base question answering (KBQA) are two important tasks of question answering (QA) systems. Existing methods solve these two tasks separately, which requires large number of repetitive work and neglects the rich correlation information between tasks. In this paper, we tackle answer selection and KBQA tasks simultaneously via multi-task learning (MTL), motivated by the following motivations. First, both answer selection and KBQA can be regarded as a ranking problem, with one at text-level while the other at knowledge-level. Second, these two tasks can benefit each other: answer selection can incorporate the external knowledge from knowledge base (KB), while KBQA can be improved by learning contextual information from answer selection. To fulfill the goal of jointly learning these two tasks, we propose a novel multi-task learning scheme that utilizes multi-view attention learned from various perspectives to enable these tasks to interact with each other as well as learn more comprehensive sentence representations. The experiments conducted on several real-world datasets demonstrate the effectiveness of the proposed method, and the performance of answer selection and KBQA is improved. Also, the multi-view attention scheme is proved to be effective in assembling attentive information from different representational perspectives.

AAAI Conference 2019 Conference Paper

Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors

  • Yansen Wang
  • Ying Shen
  • Zhun Liu
  • Paul Pu Liang
  • Amir Zadeh
  • Louis-Philippe Morency

Humans convey their intentions through the usage of both verbal and nonverbal behaviors during face-to-face communication. Speaker intentions often vary dynamically depending on different nonverbal contexts, such as vocal patterns and facial expressions. As a result, when modeling human language, it is essential to not only consider the literal meaning of the words but also the nonverbal contexts in which these words appear. To better model human language, we first model expressive nonverbal representations by analyzing the fine-grained visual and acoustic patterns that occur during word segments. In addition, we seek to capture the dynamic nature of nonverbal intents by shifting word representations based on the accompanying nonverbal behaviors. To this end, we propose the Recurrent Attended Variation Embedding Network (RAVEN) that models the fine-grained structure of nonverbal subword sequences and dynamically shifts word representations based on nonverbal cues. Our proposed model achieves competitive performance on two publicly available datasets for multimodal sentiment analysis and emotion recognition. We also visualize the shifted word representations in different nonverbal contexts and summarize common patterns regarding multimodal variations of word representations.

AIIM Journal 2018 Journal Article

An ontology-driven clinical decision support system (IDDAP) for infectious disease diagnosis and antibiotic prescription

  • Ying Shen
  • Kaiqi Yuan
  • Daoyuan Chen
  • Joël Colloc
  • Min Yang
  • Yaliang Li
  • Kai Lei

Background The available antibiotic decision-making systems were developed from a physician’s perspective. However, because infectious diseases are common, many patients desire access to knowledge via a search engine. Although the use of antibiotics should, in principle, be subject to a doctor’s advice, many patients take them without authorization, and some people cannot easily or rapidly consult a doctor. In such cases, a reliable antibiotic prescription support system is needed. Methods and results This study describes the construction and optimization of the sensitivity and specificity of a decision support system named IDDAP, which is based on ontologies for infectious disease diagnosis and antibiotic therapy. The ontology for this system was constructed by collecting existing ontologies associated with infectious diseases, syndromes, bacteria and drugs into the ontology's hierarchical conceptual schema. First, IDDAP identifies a potential infectious disease based on a patient’s self-described disease state. Then, the system searches for and proposes an appropriate antibiotic therapy specifically adapted to the patient based on factors such as the patient’s body temperature, infection sites, symptoms/signs, complications, antibacterial spectrum, contraindications, drug–drug interactions between the proposed therapy and previously prescribed medication, and the route of therapy administration. The constructed domain ontology contains 1, 267, 004 classes, 7, 608, 725 axioms, and 1, 266, 993 members of “SubClassOf” that pertain to infectious diseases, bacteria, syndromes, anti-bacterial drugs and other relevant components. The system includes 507 infectious diseases and their therapy methods in combination with 332 different infection sites, 936 relevant symptoms of the digestive, reproductive, neurological and other systems, 371 types of complications, 838, 407 types of bacteria, 341 types of antibiotics, 1504 pairs of reaction rates (antibacterial spectrum) between antibiotics and bacteria, 431 pairs of drug interaction relationships and 86 pairs of antibiotic-specific population contraindicated relationships. Compared with the existing infectious disease-relevant ontologies in the field of knowledge comprehension, this ontology is more complete. Analysis of IDDAP's performance in terms of classifiers based on receiver operating characteristic (ROC) curve results (89. 91%) revealed IDDAP's advantages when combined with our ontology. Conclusions and significance This study attempted to bridge the patient/caregiver gap by building a sophisticated application that uses artificial intelligence and machine learning computational techniques to perform data-driven decision-making at the point of primary care. The first level of decision-making is conducted by the IDDAP and provides the patient with a first-line therapy. Patients can then make a subjective judgment, and if any questions arise, should consult a physician for subsequent decisions, particularly in complicated cases or in cases in which the necessary information is not yet available in the knowledge base.