Arrow Research search

Author name cluster

Zhou Yu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

30 papers
2 author rows

Possible papers

30

AAAI Conference 2026 Conference Paper

Semantic Volume: Quantifying and Detecting Both External and Internal Uncertainty in LLMs

  • Xiaomin Li
  • Zhou Yu
  • Ziji Zhang
  • Yingying Zhuang
  • Swair Shah
  • Narayanan Sadagopan
  • Anurag Beniwal

Large language models (LLMs) have demonstrated remarkable performance across diverse tasks by encoding vast amounts of factual knowledge. However, they are still prone to hallucinations, generating incorrect or misleading information, often accompanied by high uncertainty. Existing methods for hallucination detection primarily focus on quantifying internal uncertainty, which arises from missing or conflicting knowledge within the model. However, hallucinations can also stem from external uncertainty, where ambiguous user queries lead to multiple possible interpretations. In this work, we introduce **Semantic Volume**, a novel mathematical measure for quantifying both external and internal uncertainty in LLMs. Our approach perturbs queries and responses, embeds them in a semantic space, and computes the Gram matrix determinant of the embedding vectors, capturing their dispersion as a measure of uncertainty. Our framework provides a generalizable and unsupervised uncertainty detection method without requiring internal access to LLMs. We conduct extensive experiments on both external and internal uncertainty detections, demonstrating that our Semantic Volume method consistently outperforms existing baselines in both tasks. Additionally, we provide theoretical insights linking our measure to differential entropy, unifying and extending previous sampling-based uncertainty measures such as the semantic entropy. Semantic Volume is shown to be a robust and interpretable approach to improving the reliability of LLMs by systematically detecting uncertainty in both user queries and model responses.

AAAI Conference 2026 Conference Paper

Sparse4DGS: 4D Gaussian Splatting for Sparse-Frame Dynamic Scene Reconstruction

  • Changyue Shi
  • Chuxiao Yang
  • Xinyuan Hu
  • Minghao Chen
  • Wenwen Pan
  • Yan Yang
  • Jiajun Ding
  • Zhou Yu

Dynamic Gaussian Splatting approaches have achieved remarkable performance for 4D scene reconstruction. However, these approaches rely on dense-frame video sequences for photorealistic reconstruction. In real-world scenarios, due to equipment constraints, sometimes only sparse frames are accessible. In this paper, we propose Sparse4DGS, the first method for sparse-frame dynamic scene reconstruction. We observe that dynamic reconstruction methods fail in both canonical and deformed spaces under sparse-frame settings, especially in areas with high texture richness. Sparse4DGS tackles this challenge by focusing on texture-rich areas. For the deformation network, we propose Texture-Aware Deformation Regularization, which introduces a texture-based depth alignment loss to regulate Gaussian deformation. For the canonical Gaussian field, we introduce Texture-Aware Canonical Optimization, which incorporates texture-based noise into the gradient descent process of canonical Gaussians. Extensive experiments show that when taking sparse frames as inputs, our method outperforms existing dynamic or few-shot techniques on NeRF-Synthetic, HyperNeRF, NeRF-DS, and our iPhone-4D datasets.

AAAI Conference 2026 Conference Paper

SRSplat: Feed-Forward Super-Resolution Gaussian Splatting from Sparse Multi-View Images

  • Xinyuan Hu
  • Changyue Shi
  • Chuxiao Yang
  • Minghao Chen
  • Jiajun Ding
  • Tao Wei
  • Chen Wei
  • Zhou Yu

Feed-forward 3D reconstruction from sparse, low-resolution (LR) images is a crucial capability for real-world applications, such as autonomous driving and embodied AI. However, existing methods often fail to recover fine texture details. This limitation stems from the inherent lack of high-frequency information in LR inputs. To address this, we propose SRSplat, a feed-forward framework that reconstructs high-resolution 3D scenes from only a few LR views. Our main insight is to compensate for the deficiency of texture information by jointly leveraging external high-quality reference images and internal texture cues. We first construct a scene-specific reference gallery, generated for each scene using Multimodal Large Language Models (MLLMs) and diffusion models. To integrate this external information, we introduce the Reference-Guided Feature Enhancement (RGFE) module, which aligns and fuses features from the LR input images and their reference twin image. Subsequently, we train a decoder to predict the Gaussian primitives using the multi-view fused feature obtained from RGFE. To further refine predicted Gaussian primitives, we introduce Texture-Aware Density Control (TADC), which adaptively adjusts Gaussian density based on the internal texture richness of the LR inputs. Extensive experiments demonstrate that our SRSplat outperforms existing methods on various datasets, including RealEstate10K, ACID, and DTU, and exhibits strong cross-dataset and cross-resolution generalization capabilities.

ICML Conference 2025 Conference Paper

A General Framework for Inference-time Scaling and Steering of Diffusion Models

  • Raghav Singhal
  • Zachary Horvitz
  • Ryan Teehan
  • Mengye Ren
  • Zhou Yu
  • Kathleen McKeown
  • Rajesh Ranganath

Diffusion models have demonstrated remarkable performance in generative modeling, but generating samples with specific desiderata remains challenging. Existing solutions — such as fine-tuning, best-of-n sampling, and gradient-based guidance — are expensive, inefficient, or limited in applicability. In this work, we propose FK steering, a framework for inference-time steering diffusion models with reward functions. In this work, we introduce FK steering, which applies Feynman-Kac interacting particle systems to the inference-time steering of diffusion models with arbitrary reward functions. FK steering works by generating multiple trajectories, called particles, and resampling particles at intermediate steps based on scores computed using functions called potentials. Potentials are defined using rewards for intermediate states and are chosen such that a high score indicates the particle will yield a high-reward sample. We explore various choices of potentials, rewards, and samplers. Steering text-to-image models with a human preference reward, we find that FK steering outperforms fine-tuned models with just 2 particles. Moreover, FK steering a 0. 8B parameter model outperforms a 2. 6B model, achieving state-of-the-art performance on prompt fidelity. We also steer text diffusion models with rewards for text quality and rare attributes such as toxicity, and find that FK steering generates lower perplexity text and enables gradient-free control. Overall, inference-time scaling and steering of diffusion models, even training-free, provides significant quality and controllability benefits. Code available here.

NeurIPS Conference 2025 Conference Paper

An Effective Levelling Paradigm for Unlabeled Scenarios

  • Fangming Cui
  • Zhou Yu
  • Di Yang
  • Yuqiang Ren
  • Liang Xiao
  • Xinmei Tian

Advancements in direct-integration fine-tuning frameworks have underscored their potential to enhance the performance of labeled scenarios and tasks. To enhance the generalization of different categories in the same dataset, some methods have added visual loss to these frameworks for unlabeled scenarios. However, the performance of these methods through visual loss does not improve significantly in domain generalization and cross-dataset generalization tasks. This may be attributed to the uncoordinated learning of the two-modalities alignment and visual loss. To mitigate this issue of uncoordinated learning, we propose a novel method called Levelling Paradigm (LePa) to improve performance for unlabeled tasks or scenarios. The proposed LePa, designed as a plug-in module, dynamically constrains and coordinates multiple objective functions, thereby improving the generalization of these baseline methods. Comprehensive experiments have shown that our design can effectively address generalized scenarios and tasks.

AAAI Conference 2025 Conference Paper

Neural Networks Perform Sufficient Dimension Reduction

  • Shuntuo Xu
  • Zhou Yu

This paper investigates the connection between neural networks and sufficient dimension reduction (SDR), demonstrating that neural networks inherently perform SDR in regression tasks under appropriate rank regularizations. Specifically, the weights in the first layer span the central mean subspace. We establish the statistical consistency of the neural network-based estimator for the central mean subspace, underscoring the suitability of neural networks in addressing SDR-related challenges. Numerical experiments further validate our theoretical findings, and highlight the underlying capability of neural networks to facilitate SDR compared to the existing methods. Additionally, we discuss an extension to unravel the central subspace, broadening the scope of our investigation.

AAAI Conference 2025 Conference Paper

Probability-Density-aware Semi-supervised Learning

  • Shuyang Liu
  • Ruiqiu Zheng
  • Yunhang Shen
  • Zhou Yu
  • Ke Li
  • Xing Sun
  • Shaohui Lin

In Semi-supervised learning(SSL), we always accept cluster assumption, assuming features in different high-density regions belong to other categories. However, it is always ignored by existing algorithms and needs mathematical explanations. This paper first proposes a theorem to statistically explain cluster assumption and prove that the probability density can significantly help to use the prior fully. A Probability-Density-Aware Measure(PM) is proposed based on the theorem to discern the similarity between neighbor points. The PM is deployed to improve Label Propagation and a new pseudo-labeling algorithm, the Probability-Density-Aware Label Propagation(PMLP), is proposed. We also prove that traditional first-order similarity pseudo-labeling could be viewed as a particular case of PMLP, which provides a comprehensive theoretical understanding of PMLP's superior performance. Extensive experiments demonstrate that PMLP achieves outstanding performance compared with other recent methods.

IJCAI Conference 2025 Conference Paper

Self-Classification Enhancement and Correction for Weakly Supervised Object Detection

  • Yufei Yin
  • Lechao Cheng
  • Wengang Zhou
  • Jiajun Deng
  • Zhou Yu
  • Houqiang Li

In recent years, weakly supervised object detection (WSOD) has attracted much attention due to its low labeling cost. The success of recent WSOD models is often ascribed to the two-stage multi-class classification (MCC) task, i. e. , multiple instance learning and online classification refinement. Despite achieving non-trivial progresses, these methods overlook potential classification ambiguities between these two MCC tasks and fail to leverage their unique strengths. In this work, we introduce a novel WSOD framework to ameliorate these two issues. For one thing, we propose a self-classification enhancement module that integrates intra-class binary classification (ICBC) to bridge the gap between the two distinct MCC tasks. The ICBC task enhances the network’s discrimination between positive and mis-located samples in a class-wise manner and forges a mutually reinforcing relationship with the MCC task. For another, we propose a self-classification correction algorithm during inference, which combines the results of both MCC tasks to effectively reduce the mis-classified predictions. Extensive experiments on the prevalent VOC 2007 & 2012 datasets demonstrate the superior performance of our framework.

NeurIPS Conference 2025 Conference Paper

When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs

  • Xiaomin Li
  • Zhou Yu
  • Zhiwei Zhang
  • Xupeng Chen
  • Ziji Zhang
  • Yingying Zhuang
  • Narayanan Sadagopan
  • Anurag Beniwal

Reasoning-enhanced large language models (RLLMs), whether explicitly trained for reasoning or prompted via chain-of-thought (CoT), have achieved state-of-the-art performance on many complex reasoning tasks. However, we uncover a surprising and previously overlooked phenomenon: explicit CoT reasoning can significantly degrade instruction-following accuracy. Evaluating 20+ models on two benchmarks: IFEval (with simple, rule-verifiable constraints) and ComplexBench (with complex, compositional constraints), we consistently observe performance drops when CoT prompting is applied. Through large-scale case studies and an attention-based analysis, we identify common patterns where reasoning either helps (e. g. , with formatting or lexical precision) or hurts (e. g. , by neglecting simple constraints or introducing unnecessary content). We propose a metric, constraint attention, to quantify model focus during generation and show that CoT reasoning often diverts attention away from instruction-relevant tokens. To mitigate these effects, we introduce and evaluate four strategies: in-context learning, self-reflection, self-selective reasoning, and classifier-selective reasoning. Our results demonstrate that selective reasoning strategies, particularly classifier-selective reasoning, can substantially recover lost performance. To our knowledge, this is the first work to systematically expose reasoning-induced failures in instruction-following and offer practical mitigation strategies.

ICML Conference 2024 Conference Paper

Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations

  • Yanda Chen
  • Ruiqi Zhong
  • Narutatsu Ri
  • Chen Zhao 0013
  • He He 0001
  • Jacob Steinhardt
  • Zhou Yu
  • Kathleen McKeown

Large language models (LLMs) are trained to imitate humans to explain human decisions. However, do LLMs explain themselves? Can they help humans build mental models of how LLMs process different inputs? To answer these questions, we propose to evaluate $\textbf{counterfactual simulatability}$ of natural language explanations: whether an explanation can enable humans to precisely infer the model’s outputs on diverse counterfactuals of the explained input. For example, if a model answers ”$\textit{yes}$” to the input question ”$\textit{Can eagles fly? }$” with the explanation ”$\textit{all birds can fly}$”, then humans would infer from the explanation that it would also answer ”$\textit{yes}$” to the counterfactual input ”$\textit{Can penguins fly? }$”. If the explanation is precise, then the model’s answer should match humans’ expectations. We implemented two metrics based on counterfactual simulatability: precision and generality. We generated diverse counterfactuals automatically using LLMs. We then used these metrics to evaluate state-of-the-art LLMs (e. g. , GPT-4) on two tasks: multi-hop factual reasoning and reward modeling. We found that LLM’s explanations have low precision and that precision does not correlate with plausibility. Therefore, naively optimizing human approvals (e. g. , RLHF) may be insufficient.

AAAI Conference 2024 Conference Paper

ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style Transfer

  • Zachary Horvitz
  • Ajay Patel
  • Chris Callison-Burch
  • Zhou Yu
  • Kathleen McKeown

Textual style transfer is the task of transforming stylistic properties of text while preserving meaning. Target "styles" can be defined in numerous ways, ranging from single attributes (e.g. formality) to authorship (e.g. Shakespeare). Previous unsupervised style-transfer approaches generally rely on significant amounts of labeled data for only a fixed set of styles or require large language models. In contrast, we introduce a novel diffusion-based framework for general-purpose style transfer that can be flexibly adapted to arbitrary target styles at inference time. Our parameter-efficient approach, ParaGuide, leverages paraphrase-conditioned diffusion models alongside gradient-based guidance from both off-the-shelf classifiers and strong existing style embedders to transform the style of text while preserving semantic information. We validate the method on the Enron Email Corpus, with both human and automatic evaluations, and find that it outperforms strong baselines on formality, sentiment, and even authorship style transfer.

JMLR Journal 2024 Journal Article

Random Forest Weighted Local Fréchet Regression with Random Objects

  • Rui Qiu
  • Zhou Yu
  • Ruoqing Zhu

Statistical analysis is increasingly confronted with complex data from metric spaces. Petersen and Müller (2019) established a general paradigm of Fréchet regression with complex metric space valued responses and Euclidean predictors. However, the local approach therein involves nonparametric kernel smoothing and suffers from the curse of dimensionality. To address this issue, we in this paper propose a novel random forest weighted local Fréchet regression paradigm. The main mechanism of our approach relies on a locally adaptive kernel generated by random forests. Our first method uses these weights as the local average to solve the conditional Fréchet mean, while the second method performs local linear Fréchet regression, both significantly improving existing Fréchet regression methods. Based on the theory of infinite order U-processes and infinite order $M_{m_n}$-estimator, we establish the consistency, rate of convergence, and asymptotic normality for our local constant estimator, which covers the current large sample theory of random forests with Euclidean responses as a special case. Numerical studies show the superiority of our methods with several commonly encountered types of responses such as distribution functions, symmetric positive-definite matrices, and sphere data. The practical merits of our proposals are also demonstrated through the application to New York taxi data and human mortality data. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2024. ( edit, beta )

AAAI Conference 2024 Conference Paper

SPD-DDPM: Denoising Diffusion Probabilistic Models in the Symmetric Positive Definite Space

  • Yunchen Li
  • Zhou Yu
  • Gaoqi He
  • Yunhang Shen
  • Ke Li
  • Xing Sun
  • Shaohui Lin

Symmetric positive definite(SPD) matrices have shown important value and applications in statistics and machine learning, such as FMRI analysis and traffic prediction. Previous works on SPD matrices mostly focus on discriminative models, where predictions are made directly on E(X|y), where y is a vector and X is an SPD matrix. However, these methods are challenging to handle for large-scale data. In this paper, inspired by denoising diffusion probabilistic model(DDPM), we propose a novel generative model, termed SPD-DDPM, by introducing Gaussian distribution in the SPD space to estimate E(X|y). Moreover, our model can estimate p(X) unconditionally and flexibly without giving y. On the one hand, the model conditionally learns p(X|y) and utilizes the mean of samples to obtain E(X|y) as a prediction. On the other hand, the model unconditionally learns the probability distribution of the data p(X) and generates samples that conform to this distribution. Furthermore, we propose a new SPD net which is much deeper than the previous networks and allows for the inclusion of conditional factors. Experiment results on toy data and real taxi data demonstrate that our models effectively fit the data distribution both unconditionally and conditionally.

AAAI Conference 2023 Conference Paper

Adaptive Hierarchy-Branch Fusion for Online Knowledge Distillation

  • Linrui Gong
  • Shaohui Lin
  • Baochang Zhang
  • Yunhang Shen
  • Ke Li
  • Ruizhi Qiao
  • Bo Ren
  • Muqing Li

Online Knowledge Distillation (OKD) is designed to alleviate the dilemma that the high-capacity pre-trained teacher model is not available. However, the existing methods mostly focus on improving the ensemble prediction accuracy from multiple students (a.k.a. branches), which often overlook the homogenization problem that makes student model saturate quickly and hurts the performance. We assume that the intrinsic bottleneck of the homogenization problem comes from the identical branch architecture and coarse ensemble strategy. We propose a novel Adaptive Hierarchy-Branch Fusion framework for Online Knowledge Distillation, termed AHBF-OKD, which designs hierarchical branches and adaptive hierarchy-branch fusion module to boost the model diversity and aggregate complementary knowledge. Specifically, we first introduce hierarchical branch architectures to construct diverse peers by increasing the depth of branches monotonously on the basis of target branch. To effectively transfer knowledge from the most complex branch to the simplest target branch, we propose an adaptive hierarchy-branch fusion module to create hierarchical teacher assistants recursively, which regards the target branch as the smallest teacher assistant. During the training, the teacher assistant from the previous hierarchy is explicitly distilled by the teacher assistant and the branch from the current hierarchy. Thus, the important scores to different branches are effectively and adaptively allocated to reduce the branch homogenization. Extensive experiments demonstrate the effectiveness of AHBF-OKD on different datasets, including CIFAR-10/100 and ImageNet 2012. For example, on ImageNet 2012, the distilled ResNet-18 achieves Top-1 error of 29.28\%, which significantly outperforms the state-of-the-art methods. The source code is available at https://github.com/linruigong965/AHBF.

AAAI Conference 2023 Conference Paper

MIGA: A Unified Multi-Task Generation Framework for Conversational Text-to-SQL

  • Yingwen Fu
  • Wenjie Ou
  • Zhou Yu
  • Yue Lin

Conversational text-to-SQL is designed to translate multi-turn natural language questions into their corresponding SQL queries. Most advanced conversational text-to-SQL methods are incompatible with generative pre-trained language models (PLMs), such as T5. In this paper, we present a two-stage unified MultI-task Generation frAmework (MIGA) that leverages PLMs’ ability to tackle conversational text-to-SQL. In the pre-training stage, MIGA first decomposes the main task into several related sub-tasks and then unifies them into the same sequence-to-sequence (Seq2Seq) paradigm with task-specific natural language prompts to boost the main task from multi-task training. Later in the fine-tuning stage, we propose four SQL perturbations to alleviate the error propagation problem. MIGA tends to achieve state-of-the-art performance on two benchmarks (SparC and CoSQL). We also provide extensive analyses and discussions to shed light on some new perspectives for conversational text-to-SQL.

AAAI Conference 2021 Conference Paper

A Student-Teacher Architecture for Dialog Domain Adaptation Under the Meta-Learning Setting

  • Kun Qian
  • Wei Wei
  • Zhou Yu

Numerous new dialog domains are being created every day while collecting data for these domains is extremely costly since it involves human interactions. Therefore, it is essential to develop algorithms that can adapt to different domains efficiently when building data-driven dialog models. Most recent research on domain adaption focuses on giving the model a better initialization, rather than optimizing the adaptation process. We propose an efficient domain adaptive taskoriented dialog system model, which incorporates a metateacher model to emphasize the different impacts between generated tokens with respect to the context. We first train our base dialog model and meta-teacher model adversarially in a meta-learning setting on rich-resource domains. The metateacher learns to quantify the importance of tokens under different contexts across different domains. During adaptation, the meta-teacher guides the dialog model to focus on important tokens in order to achieve better adaptation efficiency. We evaluate our model on two multi-domain datasets, MultiWOZ and Google Schema-Guided Dialogue, and achieve state-of-the-art performance.

AAAI Conference 2021 System Paper

ACAT-G: An Interactive Learning Framework for Assisted Response Generation

  • Xueyuan Lu
  • Saurav Sahay
  • Zhou Yu
  • Lama Nachman

In this paper, we introduce ACAT-G, an interactive dialogue learning framework that incorporates constant human feedback into fine-tuning language models in order to assist conditioned dialog generation. The system takes in a limited amount of input from a human and generates personalized response corresponding to the context of the conversation within natural dialog time-frame. By combining inspirations from online learning, reinforcement learning, and large scale language models, we expect this project to provide a foundation for human-in-the-loop conditional dialog generation tasks.

NeurIPS Conference 2021 Conference Paper

IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning

  • Pan Lu
  • Liang Qiu
  • Jiaqi Chen
  • Tanglin Xia
  • Yizhou Zhao
  • Wei Zhang
  • Zhou Yu
  • Xiaodan Liang

Current visual question answering (VQA) tasks mainly consider answering human-annotated questions for natural images. However, aside from natural images, abstract diagrams with semantic richness are still understudied in visual understanding and reasoning research. In this work, we introduce a new challenge of Icon Question Answering (IconQA) with the goal of answering a question in an icon image context. We release IconQA, a large-scale dataset that consists of 107, 439 questions and three sub-tasks: multi-image-choice, multi-text-choice, and filling-in-the-blank. The IconQA dataset is inspired by real-world diagram word problems that highlight the importance of abstract diagram understanding and comprehensive cognitive reasoning. Thus, IconQA requires not only perception skills like object recognition and text understanding, but also diverse cognitive reasoning skills, such as geometric reasoning, commonsense reasoning, and arithmetic reasoning. To facilitate potential IconQA models to learn semantic representations for icon images, we further release an icon dataset Icon645 which contains 645, 687 colored icons on 377 classes. We conduct extensive user studies and blind experiments and reproduce a wide range of advanced VQA methods to benchmark the IconQA task. Also, we develop a strong IconQA baseline Patch-TRM that applies a pyramid cross-modal Transformer with input diagram embeddings pre-trained on the icon dataset. IconQA and Icon645 are available at https: //iconqa. github. io.

AAAI Conference 2021 Conference Paper

Perception Score: A Learned Metric for Open-ended Text Generation Evaluation

  • Jing Gu
  • Qingyang Wu
  • Zhou Yu

Automatic evaluation for open-ended natural language generation tasks remains a challenge. We propose a learned evaluation metric: Perception Score. It utilizes a pre-trained model and considers context information for conditional generation. Perception Score assigns a holistic score along with uncertainty measurement. We conduct experiments on three openended conditional generation tasks and two open-ended unconditional generation tasks. Perception Score achieves stateof-the-art results on all the tasks consistently in terms of correlation with human evaluation scores.

AAAI Conference 2021 Conference Paper

TextGAIL: Generative Adversarial Imitation Learning for Text Generation

  • Qingyang Wu
  • Lei Li
  • Zhou Yu

Generative Adversarial Networks (GANs) for text generation have recently received many criticisms, as they perform worse than their MLE counterparts (Caccia et al. 2020; Tevet et al. 2019; Semeniuta, Severyn, and Gelly 2018). We suspect previous text GANs’ inferior performance is due to the lack of a reliable guiding signal in their discriminators. To address this problem, we propose a generative adversarial imitation learning framework for text generation that uses large pre-trained language models to provide more reliable reward guidance. As previous text GANs suffer from high variance of gradients, we apply contrastive discriminator, and proximal policy optimization (PPO) to stabilize and improve text generation performance. For evaluation, we conduct experiments on a diverse set of unconditional and conditional text generation tasks. Experimental results show that TextGAIL achieves better performance in terms of both quality and diversity than the MLE baseline. We also validate our intuition that TextGAIL’s discriminator demonstrates the capability of providing reasonable rewards with an additional task.

AAAI Conference 2020 Conference Paper

End-to-End Trainable Non-Collaborative Dialog System

  • Yu Li
  • Kun Qian
  • Weiyan Shi
  • Zhou Yu

End-to-end task-oriented dialog models have achieved promising performance on collaborative tasks where users willingly coordinate with the system to complete a given task. While in non-collaborative settings, for example, negotiation and persuasion, users and systems do not share a common goal. As a result, compared to collaborate tasks, people use social content to build rapport and trust in these non-collaborative settings in order to advance their goals. To handle social content, we introduce a hierarchical intent annotation scheme, which can be generalized to different non-collaborative dialog tasks. Building upon Transfer- Transfo (Wolf et al. 2019), we propose an end-to-end neural network model to generate diverse coherent responses. Our model utilizes intent and semantic slots as the intermediate sentence representation to guide the generation process. In addition, we design a filter to select appropriate responses based on whether these intermediate representations fit the designed task and conversation constraints. Our noncollaborative dialog model guides users to complete the task while simultaneously keeps them engaged. We test our approach on our newly proposed ANTISCAM dataset and an existing PERSUASIONFORGOOD dataset. Both automatic and human evaluations suggest that our model outperforms multiple baselines in these two non-collaborative tasks.

AAAI Conference 2020 Conference Paper

Filling Conversation Ellipsis for Better Social Dialog Understanding

  • Xiyuan Zhang
  • Chengxi Li
  • Dian Yu
  • Samuel Davidson
  • Zhou Yu

The phenomenon of ellipsis is prevalent in social conversations. Ellipsis increases the difficulty of a series of downstream language understanding tasks, such as dialog act prediction and semantic role labeling. We propose to resolve ellipsis through automatic sentence completion to improve language understanding. However, automatic ellipsis completion can result in output which does not accurately reflect user intent. To address this issue, we propose a method which considers both the original utterance that has ellipsis and the automatically completed utterance in dialog act and semantic role labeling tasks. Specifically, we first complete user utterances to resolve ellipsis using an end-to-end pointer network model. We then train a prediction model using both utterances containing ellipsis and our automatically completed utterances. Finally, we combine the prediction results from these two utterances using a selection model that is guided by expert knowledge. Our approach improves dialog act prediction and semantic role labeling by 1. 3% and 2. 5% in F1 score respectively in social conversations. We also present an open-domain human-machine conversation dataset with manually completed user utterances and annotated semantic role labeling after manual completion.

AAAI Conference 2020 Conference Paper

Importance-Aware Learning for Neural Headline Editing

  • Qingyang Wu
  • Lei Li
  • Hao Zhou
  • Ying Zeng
  • Zhou Yu

Many social media news writers are not professionally trained. Therefore, social media platforms have to hire professional editors to adjust amateur headlines to attract more readers. We propose to automate this headline editing process through neural network models to provide more immediate writing support for these social media news writers. To train such a neural headline editing model, we collected a dataset which contains articles with original headlines and professionally edited headlines. However, it is expensive to collect a large number of professionally edited headlines. To solve this low-resource problem, we design an encoder-decoder model which leverages large scale pre-trained language models. We further improve the pre-trained model’s quality by introducing a headline generation task as an intermediate task before the headline editing task. Also, we propose Self Importance- Aware (SIA) loss to address the different levels of editing in the dataset by down-weighting the importance of easily classified tokens and sentences. With the help of Pre-training, Adaptation, and SIA, the model learns to generate headlines in the professional editor’s style. Experimental results show that our method significantly improves the quality of headline editing comparing against previous methods.

AAAI Conference 2020 Conference Paper

MOSS: End-to-End Dialog System Framework with Modular Supervision

  • Weixin Liang
  • Youzhi Tian
  • Chengcai Chen
  • Zhou Yu

A major bottleneck in training end-to-end task-oriented dialog system is the lack of data. To utilize limited training data more efficiently, we propose Modular Supervision Network (MOSS), an encoder-decoder training framework that could incorporate supervision from various intermediate dialog system modules including natural language understanding, dialog state tracking, dialog policy learning and natural language generation. With only 60% of the training data, MOSS-all (i. e. , MOSS with supervision from all four dialog modules) outperforms state-of-the-art models on CamRest676. Moreover, introducing modular supervision has even bigger bene- fits when the dialog task has a more complex dialog state and action space. With only 40% of the training data, MOSS-all outperforms the state-of-the-art model on a complex laptop network trouble shooting dataset, LaptopNetwork, that we introduced. LaptopNetwork consists of conversations between real customers and customer service agents in Chinese. Moreover, MOSS framework can accommodate dialogs that have supervision from different dialog modules at both framework level and model level. Therefore, MOSS is extremely flexible to update in real-world deployment.

AAAI Conference 2020 Conference Paper

Task-Oriented Dialog Systems That Consider Multiple Appropriate Responses under the Same Context

  • Yichi Zhang
  • Zhijian Ou
  • Zhou Yu

Conversations have an intrinsic one-to-many property, which means that multiple responses can be appropriate for the same dialog context. In task-oriented dialogs, this property leads to different valid dialog policies towards task completion. However, none of the existing task-oriented dialog generation approaches takes this property into account. We propose a Multi-Action Data Augmentation (MADA) framework to utilize the one-to-many property to generate diverse appropriate dialog responses. Specifically, we first use dialog states to summarize the dialog history, and then discover all possible mappings from every dialog state to its different valid system actions. During dialog system training, we enable the current dialog state to map to all valid system actions discovered in the previous process to create additional state-action pairs. By incorporating these additional pairs, the dialog policy learns a balanced action distribution, which further guides the dialog model to generate diverse responses. Experimental results show that the proposed framework consistently improves dialog policy diversity, and results in improved response diversity and appropriateness. Our model obtains state-of-the-art results on MultiWOZ.

AAAI Conference 2019 Conference Paper

ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering

  • Zhou Yu
  • Dejing Xu
  • Jun Yu
  • Ting Yu
  • Zhou Zhao
  • Yueting Zhuang
  • Dacheng Tao

Recent developments in modeling language and vision have been successfully applied to image question answering. It is both crucial and natural to extend this research direction to the video domain for video question answering (VideoQA). Compared to the image domain where large scale and fully annotated benchmark datasets exists, VideoQA datasets are limited to small scale and are automatically generated, etc. These limitations restrict their applicability in practice. Here we introduce ActivityNet-QA, a fully annotated and large scale VideoQA dataset. The dataset consists of 58, 000 QA pairs on 5, 800 complex web videos derived from the popular ActivityNet dataset. We present a statistical analysis of our ActivityNet-QA dataset and conduct extensive experiments on it by comparing existing VideoQA baselines. Moreover, we explore various video representation strategies to improve VideoQA performance, especially for long videos.

AAAI Conference 2019 Conference Paper

Incorporating Structured Commonsense Knowledge in Story Completion

  • Jiaao Chen
  • Jianshu Chen
  • Zhou Yu

The ability to select an appropriate story ending is the first step towards perfect narrative comprehension. Story ending prediction requires not only the explicit clues within the context, but also the implicit knowledge (such as commonsense) to construct a reasonable and consistent story. However, most previous approaches do not explicitly use background commonsense knowledge. We present a neural story ending selection model that integrates three types of information: narrative sequence, sentiment evolution and commonsense knowledge. Experiments show that our model outperforms state-ofthe-art approaches on a public dataset, ROCStory Cloze Task (Mostafazadeh et al. 2017), and the performance gain from adding the additional commonsense knowledge is significant.

IJCAI Conference 2018 Conference Paper

Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks

  • Zhou Zhao
  • Zhu Zhang
  • Shuwen Xiao
  • Zhou Yu
  • Jun Yu
  • Deng Cai
  • Fei Wu
  • Yueting Zhuang

Open-ended long-form video question answering is challenging problem in visual information retrieval, which automatically generates the natural language answer from the referenced long-form video content according to the question. However, the existing video question answering works mainly focus on the short-form video question answering, due to the lack of modeling the semantic representation of long-form video contents. In this paper, we consider the problem of long-form video question answering from the viewpoint of adaptive hierarchical reinforced encoder-decoder network learning. We propose the adaptive hierarchical encoder network to learn the joint representation of the long-form video contents according to the question with adaptive video segmentation. we then develop the reinforced decoder network to generate the natural language answer for open-ended video question answering. We construct a large-scale long-form video question answering dataset. The extensive experiments show the effectiveness of our method.

IJCAI Conference 2018 Conference Paper

Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding

  • Zhou Yu
  • Jun Yu
  • Chenchao Xiang
  • Zhou Zhao
  • Qi Tian
  • Dacheng Tao

Visual grounding aims to localize an object in an image referred to by a textual query phrase. Various visual grounding approaches have been proposed, and the problem can be modularized into a general framework: proposal generation, multi-modal feature representation, and proposal ranking. Of these three modules, most existing approaches focus on the latter two, with the importance of proposal generation generally neglected. In this paper, we rethink the problem of what properties make a good proposal generator. We introduce the diversity and discrimination simultaneously when generating proposals, and in doing so propose Diversified and Discriminative Proposal Networks model (DDPN). Based on the proposals generated by DDPN, we propose a high performance baseline model for visual grounding and evaluate it on four benchmark datasets. Experimental results demonstrate that our model delivers significant improvements on all the tested data-sets (e. g. , 18. 8% improvement on ReferItGame and 8. 2% improvement on Flickr30k Entities over the existing state-of-the-arts respectively).

IJCAI Conference 2017 Conference Paper

Learning Conversational Systems that Interleave Task and Non-Task Content

  • Zhou Yu
  • Alexander Rudnicky
  • Alan Black

Task-oriented dialog systems have been applied in various tasks, such as automated personal assistants, customer service providers and tutors. These systems work well when users have clear and explicit intentions that are well-aligned to the systems' capabilities. However, they fail if users intentions are not explicit. To address this shortcoming, we propose a framework to interleave non-task content (i. e. everyday social conversation) into task conversations. When the task content fails, the system can still keep the user engaged with the non-task content. We trained a policy using reinforcement learning algorithms to promote long-turn conversation coherence and consistency, so that the system can have smooth transitions between task and non-task content. To test the effectiveness of the proposed framework, we developed a movie promotion dialog system. Experiments with human users indicate that a system that interleaves social and task content achieves a better task success rate and is also rated as more engaging compared to a pure task-oriented system.