Arrow Research search

Author name cluster

Yu Wu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

39 papers
2 author rows

Possible papers

39

AAAI Conference 2026 Conference Paper

Text-based Aerial-Ground Person Retrieval

  • Xinyu Zhou
  • Yu Wu
  • Jiayao Ma
  • Wenhao Wang
  • Min Cao
  • Mang Ye

This work introduces Text-based Aerial-Ground Person Retrieval (TAG-PR), which aims to retrieve person images from heterogeneous aerial and ground views with textual descriptions. Unlike traditional Text-based Person Retrieval (T-PR), which focuses solely on ground-view images, TAG-PR introduces greater practical significance and presents unique challenges due to the large viewpoint discrepancy across images. To support this task, we contribute: (1) TAG-PEDES dataset, constructed from public benchmarks with automatically generated textual descriptions, enhanced by a diversified text generation paradigm to ensure robustness under view heterogeneity; and (2) TAG-CLIP, a novel retrieval framework that addresses view heterogeneity through a hierarchically-routed mixture of experts module to learn view-specific and view-agnostic features and a viewpoint decoupling strategy to decouple view-specific features for better cross-modal alignment. We evaluate the effectiveness of TAG-CLIP on both the proposed TAG-PEDES and existing T-PR benchmarks.

IROS Conference 2025 Conference Paper

An Online Motion Planning Framework for Navigating Torpedo-shaped Autonomous Underwater Vehicles in Unknown Underwater Environments

  • Tianyou Yu
  • Zhaoxuan Dong
  • Yu Wu
  • Xingjie Fu

Navigating unknown underwater environments is a significant challenge for autonomous underwater vehicles (AUVs), especially those with torpedo-like shapes. Lacking a prior map, these vehicles rely on real-time sensor data for perception. Although online motion planning addresses this challenge, many existing methods are primarily tested on more maneuverable robots, such as multicopters and ground vehicles, and do not account for the unique kinematics of torpedo-shaped AUVs, such as limited lateral movement, or the need for 3D motion planning. In this paper, we propose an online motion planning system specifically designed for torpedo-shaped AUVs to navigate 3D underwater terrain without prior environmental knowledge. The system employs a receding horizon planning framework to ensure safe navigation by replanning the trajectory when collisions are detected or the planning horizon is reached. For trajectory generation, a search-based method is used and utilizes a 3D Dubins curve heuristic to guide the generation of an optimal 3D trajectory that adheres to the AUV’s kinematic constraints. To further enhance safety and smoothness, gradient-based optimization is applied to refine the trajectory. Experiments in simulated environments validate the proposed method, demonstrating its ability to generate safe trajectories for AUVs in complex and unknown environments. We release our code as an open-source package 1.

NeurIPS Conference 2025 Conference Paper

BNMusic: Blending Environmental Noises into Personalized Music

  • Chi Zuo
  • Martin Møller
  • Pablo Martínez-Nuevo
  • Huayang Huang
  • Yu Wu
  • Ye Zhu

While being disturbed by environmental noises, the acoustic masking technique is a conventional way to reduce the annoyance in audio engineering that seeks to cover up the noises with other dominant yet less intrusive sounds. However, misalignment between the dominant sound and the noise—such as mismatched downbeats—often requires an excessive volume increase to achieve effective masking. Motivated by recent advances in cross-modal generation, in this work, we introduce an alternative method to acoustic masking, aiming to reduce the noticeability of environmental noises by blending them into personalized music generated based on user-provided text prompts. Following the paradigm of music generation using mel-spectrogram representations, we propose a Blending Noises into Personalized Music (BNMusic) framework with two key stages. The first stage synthesizes a complete piece of music in a mel-spectrogram representation that encapsulates the musical essence of the noise. In the second stage, we adaptively amplifying the generated music segment to further reduce noise perception and enhance the blending effectiveness, while preserving auditory quality. Our experiments with comprehensive evaluations on MusicBench, EPIC-SOUNDS, and ESC-50 demonstrate the effectiveness of our framework, highlighting the ability to blend environmental noise with rhythmically aligned, adaptively amplified, and enjoyable music segments, minimizing the noticeability of the noise, thereby improving overall acoustic experiences. Project page: https: //d-fas. github. io/BNMusic_page/.

IROS Conference 2025 Conference Paper

DPSN: Dual Prior Knowledge Induced Tactile paving and Obstacle Joint Segmentation Network

  • Youqi Song
  • Wenqi Li
  • Zhao Zhang
  • Yu Wu
  • Zilong Jin
  • Changbo Wang
  • Gaoqi He

Accurate semantic segmentation of both tactile paving and the obstacle is crucial for the safe mobility of visually impaired individuals. However, existing methods face two major challenges: (i) discontinuous segmentation fragments; (ii) Inaccurate obstacle recognition. To address challenge (i), we propose incorporating appearance priors of complete tactile pavings to prevent the model from directly learning irregular ground truth masks. To tackle challenge (ii), we propose introducing cross-modal semantic priors to complement the semantic information of obstacles. We implemented these strategies in proposed Dual Prior knowledge induced tactile paving and obstacle joint Segmentation Network (DPSN). Based on bilateral network architecture, DPSN merges obstacle category masks into tactile paving categories, constructing a complete tactile paving mask. Utilizing the complete mask, DPSN transfer appearance prior knowledge to detail features from boundary and structural perspectives. Concurrently, DPSN leverages the CLIP Text Encoder to guide visual feature decoding by attention mechanisms, transferring rich cross-modal semantic prior knowledge to the visual feature maps. Furthermore, we propose the TPO-Dataset, the first dataset for joint tactile paving and obstacle segmentation acquired from actual scenes. Experiments demonstrate that DPSN achieves state-of-the-art results on the TPO-Dataset, with relative gains of 27. 16% in obstacle IoU and 30. 53% in accuracy metrics compared to baseline methods. Notably, DPSN achieves real-time performance at 88. 25 FPS on the maximum scale of 2048×512 resolution.

AAAI Conference 2025 Conference Paper

Efficient Robustness Evaluation via Constraint Relaxation

  • Chao Pan
  • Yu Wu
  • Ke Tang
  • Qing Li
  • Xin Yao

The study of enhancing model robustness against adversarial examples has become increasingly critical in the security of deep learning, leading to the development of numerous adversarial defense techniques. While these defense methods have shown promise in mitigating the impact of adversarial perturbations, evaluating their effectiveness remains a critical challenge. The recently introduced AutoAttack technique has been recognized as a standardized method for assessing model robustness. However, the computational demands of the AutoAttack method significantly limits its applicability, underscoring the urgent need for efficient evaluation techniques. To address this challenge, we propose a novel and efficient evaluation framework based on strategic constraint relaxation. Our key insight is that temporarily expanding the adversarial perturbation bounds during the attack process can help discover more effective adversarial examples. Based on this insight, we develop the Constraint Relaxation Attack (CR Attack) method, which systematically relaxes and resets perturbation constraints during optimization. Extensive experiments on 105 robust models show that CR Attack outperforms AutoAttack in both attack success rate and efficiency, reducing forward and backward propagation time by 38.3× and 15.9× respectively. Through comprehensive analysis, we validate that the constraint relaxation mechanism is crucial for the method's effectiveness.

JBHI Journal 2025 Journal Article

Efficient Vulnerability Assessment of Multi-Bit Faults in Embedded Healthcare Software

  • Jinting Ren
  • Yu Wu
  • Hengyi Ren
  • Xing Li
  • Zhaoyang Han

With the growing use of wearable and implantable medical devices, ensuring the reliability of embedded software is crucial for patient safety. These devices are susceptible to soft errors, and traditional single-bit fault models often fail to account for multi-bit faults, leading to either insufficient protection or excessive overhead. We introduce V-FAME, a lightweight and efficient vulnerability analysis framework for embedded healthcare systems. V-FAME uses machine learning and a multi-bit fault model to identify fault-sensitive code regions quickly and accurately. Unlike existing tools that focus on instruction-level analysis, V-FAME uses a basic block-level classification approach to significantly prune the fault space. Our experiments show V-FAME achieves a speedup of over 6. 16x while maintaining an accuracy of over 80%. This framework supports reliable and cost-effective fault mitigation in medical applications.

IJCAI Conference 2025 Conference Paper

Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation

  • Feizhen Huang
  • Yu Wu
  • Yutian Lin
  • Bo Du

Video-to-Audio (V2A) Generation achieves significant progress and plays a crucial role in film and video post-production. However, current methods overlook the cinematic language, a critical component of artistic expression in filmmaking. As a result, their performance deteriorates in scenarios where Foley targets are only partially visible. To address this challenge, we propose a simple self-distillation approach to extend V2A models to cinematic language scenarios. By simulating the cinematic language variations, the student model learns to align the video features of training pairs with the same audio-visual correspondences, enabling it to effectively capture the associations between sounds and partial visual information. Our method not only achieves impressive improvements under partial visibility across all evaluation metrics, but also enhances performance on the large-scale V2A dataset, VGGSound.

AAAI Conference 2025 Conference Paper

WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic Exploration

  • Yao Zhang
  • Zijian Ma
  • Yunpu Ma
  • Zhen Han
  • Yu Wu
  • Volker Tresp

LLM-based autonomous agents often fail to execute complex web tasks that require dynamic interaction, largely due to the inherent uncertainty and complexity of these environments. Existing LLM-based web agents typically rely on rigid, expert-designed policies specific to certain states and actions, lacking the flexibility and generalizability needed to adapt to unseen tasks. In contrast, humans excel by exploring unknowns, continuously adapting strategies based on new observations, and resolving ambiguities through exploration. To emulate human-like adaptability, web agents need strategic exploration and complex decision-making. Monte Carlo Tree Search (MCTS) is well-suited for this, but classical MCTS struggles with vast action spaces, unpredictable state transitions, and incomplete information in web tasks. In light of this, we develop WebPilot, a multi-agent system with a dual optimization strategy that improves MCTS to better handle complex web environments. Specifically, the Global Optimization phase involves generating a high-level plan by breaking down tasks into manageable subtasks, continuously refining this plan through reflective analysis of new observations and previous subtask attempts, thereby focusing the search process and mitigating challenges posed by vast action spaces in classical MCTS. Subsequently, the Local Optimization phase executes each subtask using a tailored MCTS designed for complex environments, effectively addressing uncertainties and managing incomplete information by iteratively refining decisions based on new observations. Experimental results on WebArena and MiniWoB++ demonstrate the effectiveness of WebPilot. Notably, on WebArena, WebPilot achieves SOTA performance with GPT-4, achieving a 93% relative increase in success rate over the concurrent tree search-based method. WebPilot advances autonomous agents, enabling more reliable decision-making in practical environments.

SoCS Conference 2024 Conference Paper

From Space-Time to Space-Order: Directly Planning a Temporal Planning Graph by Redefining CBS (Extended Abstract)

  • Yu Wu
  • Rishi Veerapaneni
  • Jiaoyang Li 0001
  • Maxim Likhachev

The majority of multi-agent path finding (MAPF) methods compute collision-free space-time paths which require agents to be at a specific location at a specific discretized timestep. However, executing these space-time paths directly on robotic systems is infeasible due to real-time execution differences (e. g. delays) which can lead to collisions. To combat this, current methods translate the space-time paths into a temporal plan graph (TPG) that only requires that agents observe the order in which they navigate through locations where their paths cross. However, planning space-time paths and then post-processing them into a TPG does not reduce the required agent-to-agent coordination, which is fixed once the space-time paths are computed. To that end, we propose a novel algorithm Space-Order CBS that can directly plan a TPG and explicitly minimize coordination. Our main theoretical insight is our novel perspective on viewing a TPG as a set of space-visitation order paths where agents visit locations in relative orders (e. g. 1st vs 2nd) as opposed to specific timesteps. We redefine unique conflicts and constraints for adapting CBS for space-order planning. We experimentally validate how Space-Order CBS can return TPGs which significantly reduce coordination, thus subsequently reducing the amount of agent-agent communication and leading to more robustness to delays during execution.

ICAPS Conference 2024 Conference Paper

MAPF in 3D Warehouses: Dataset and Analysis

  • Qian Wang
  • Rishi Veerapaneni
  • Yu Wu
  • Jiaoyang Li 0001
  • Maxim Likhachev

Recent works have made significant progress in multi-agent path finding (MAPF), with modern methods being able to scale to hundreds of agents, handle unexpected delays, work in groups, etc. The vast majority of these methods have focused on 2D "grid world" domains. However, modern warehouses often utilize multi-agent robotic systems that can move in 3D, enabling dense storage but resulting in a more complex multi-agent planning problem. Motivated by this, we introduce and experimentally analyze the application of MAPF to 3D warehouse management, and release the first (see http: //mapf. info/index. php/Main/Benchmarks) open-source 3D MAPF dataset. We benchmark two state-of-the-art MAPF methods, EECBS and MAPF-LNS2, and show how different hyper-parameters affect these methods across various 3D MAPF problems. We also investigate how the warehouse structure itself affects MAPF performance. Based on our experimental analysis, we find that a fast low-level search is critical for 3D MAPF, EECBS

JBHI Journal 2024 Journal Article

Redefining the Game: MVAE-DFDPnet's Low-Dimensional Embeddings for Superior Drug-Protein Interaction Predictions

  • Liang-Yong Xia
  • Yu Wu
  • Longfei Zhao
  • Leying Chen
  • Shiyi Zhang
  • Mengdi Wang
  • Jie Luo

Precisely predicting drug-protein interactions (DPIs) is pivotal for drug discovery and advancing precision medicine. A significant challenge in this domain is the high-dimensional and heterogeneous data characterizing drug and protein attributes, along with their intricate interactions. In our study, we introduce a novel deep learning architecture: the M ulti-view V ariational A uto- E ncoder followed by a cascade D eep F orest (MVAE-DFDPnet). This framework adeptly learns ultra-low-dimensional embedding for drugs and proteins. Notably, our t-SNE analysis reveals that two-dimensional embedding can clearly define clusters corresponding to diverse drug classes and protein families. These ultra-low-dimensional embedding likely contribute to the enhanced robustness and generalizability of our MVAE-DFDPnet. Our model surpasses current leading methods on benchmark datasets, functioning in significantly reduced dimensional spaces. The model's resilience is further evidenced by its sustained accuracy in predicting interactions involving novel drugs, proteins, and drug classes. Additionally, we have corroborated several newly identified DPIs with experimental evidence from the scientific literature.

NeurIPS Conference 2024 Conference Paper

ROBIN: Robust and Invisible Watermarks for Diffusion Models with Adversarial Optimization

  • Huayang Huang
  • Yu Wu
  • Qian Wang

Watermarking generative content serves as a vital tool for authentication, ownership protection, and mitigation of potential misuse. Existing watermarking methods face the challenge of balancing robustness and concealment. They empirically inject a watermark that is both invisible and robust and passively achieve concealment by limiting the strength of the watermark, thus reducing the robustness. In this paper, we propose to explicitly introduce a watermark hiding process to actively achieve concealment, thus allowing the embedding of stronger watermarks. To be specific, we implant a robust watermark in an intermediate diffusion state and then guide the model to hide the watermark in the final generated image. We employ an adversarial optimization algorithm to produce the optimal hiding prompt guiding signal for each watermark. The prompt embedding is optimized to minimize artifacts in the generated image, while the watermark is optimized to achieve maximum strength. The watermark can be verified by reversing the generation process. Experiments on various diffusion models demonstrate the watermark remains verifiable even under significant image tampering and shows superior invisibility compared to other state-of-the-art robust watermarking methods.

NeurIPS Conference 2024 Conference Paper

RobIR: Robust Inverse Rendering for High-Illumination Scenes

  • Ziyi Yang
  • Yanzhen Chen
  • Xinyu Gao
  • Yazhen Yuan
  • Yu Wu
  • Xiaowei Zhou
  • Xiaogang Jin

Implicit representation has opened up new possibilities for inverse rendering. However, existing implicit neural inverse rendering methods struggle to handle strongly illuminated scenes with significant shadows and slight reflections. The existence of shadows and reflections can lead to an inaccurate understanding of the scene, making precise factorization difficult. To this end, we present RobIR, an implicit inverse rendering approach that uses ACES tone mapping and regularized visibility estimation to reconstruct accurate BRDF of the object. By accurately modeling the indirect radiance field, normal, visibility, and direct light simultaneously, we are able to accurately decouple environment lighting and the object's PBR materials without imposing strict constraints on the scene. Even in high-illumination scenes with shadows and specular reflections, our method can recover high-quality albedo and roughness with no shadow interference. RobIR outperforms existing methods in both quantitative and qualitative evaluations.

NeurIPS Conference 2024 Conference Paper

Toward Real Ultra Image Segmentation: Leveraging Surrounding Context to Cultivate General Segmentation Model

  • Sai Wang
  • Yutian Lin
  • Yu Wu
  • Bo Du

Existing ultra image segmentation methods suffer from two major challenges, namely the scalability issue (i. e. they lack the stability and generality of standard segmentation models, as they are tailored to specific datasets), and the architectural issue (i. e. they are incompatible with real-world ultra image scenes, as they compromise between image size and computing resources). To tackle these issues, we revisit the classic sliding inference framework, upon which we propose a Surrounding Guided Segmentation framework (SGNet) for ultra image segmentation. The SGNet leverages a larger area around each image patch to refine the general segmentation results of local patches. Specifically, we propose a surrounding context integration module to absorb surrounding context information and extract specific features that are beneficial to local patches. Note that, SGNet can be seamlessly integrated to any general segmentation model. Extensive experiments on five datasets demonstrate that SGNet achieves competitive performance and consistent improvements across a variety of general segmentation models, surpassing the traditional ultra image segmentation methods by a large margin.

NeurIPS Conference 2023 Conference Paper

Boundary Guided Learning-Free Semantic Control with Diffusion Models

  • Ye Zhu
  • Yu Wu
  • Zhiwei Deng
  • Olga Russakovsky
  • Yan Yan

Applying pre-trained generative denoising diffusion models (DDMs) for downstream tasks such as image semantic editing usually requires either fine-tuning DDMs or learning auxiliary editing networks in the existing literature. In this work, we present our BoundaryDiffusion method for efficient, effective and light-weight semantic control with frozen pre-trained DDMs, without learning any extra networks. As one of the first learning-free diffusion editing works, we start by seeking a more comprehensive understanding of the intermediate high-dimensional latent spaces by theoretically and empirically analyzing their probabilistic and geometric behaviors in the Markov chain. We then propose to further explore the critical step in the denoising trajectory that characterizes the convergence of a pre-trained DDM and introduce an automatic search method. Last but not least, in contrast to the conventional understanding that DDMs have relatively poor semantic behaviors (in generic latent spaces), we prove that the critical latent space we found already forms semantic subspace boundaries at the generic level in unconditional DDMs, which allows us to do controllable manipulation by guiding the denoising trajectory towards the targeted boundary via a single-step operation. We conduct extensive experiments on multiple DPMs architectures (DDPM, iDDPM) and datasets (CelebA, CelebA-HQ, LSUN-church, LSUN-bedroom, AFHQ-dog) with different resolutions (64, 256), achieving superior or state-of-the-art performance in various task scenarios (image semantic editing, text-based editing, unconditional semantic control) to demonstrate the effectiveness.

ICML Conference 2023 Conference Paper

Magneto: A Foundation Transformer

  • Hongyu Wang 0009
  • Shuming Ma
  • Shaohan Huang
  • Li Dong 0004
  • Wenhui Wang 0003
  • Zhiliang Peng
  • Yu Wu
  • Payal Bajaj

A big convergence of model architectures across language, vision, speech, and multimodal is emerging. However, under the same name ”Transformers”, the above areas use different implementations for better performance, e. g. , Post-LayerNorm for BERT, and Pre-LayerNorm for GPT and vision Transformers. We call for the development of Foundation Transformer for true general-purpose modeling, which serves as a go-to architecture for various tasks and modalities with guaranteed training stability. In this work, we introduce a Transformer variant, named Magneto, to fulfill the goal. Specifically, we propose Sub-LayerNorm for good expressivity, and the initialization strategy theoretically derived from DeepNet for stable scaling up. Extensive experiments demonstrate its superior performance and better stability than the de facto Transformer variants designed for various applications, including language modeling (i. e. , BERT, and GPT), machine translation, vision pretraining (i. e. , BEiT), speech recognition, and multimodal pretraining (i. e. , BEiT-3).

NeurIPS Conference 2023 Conference Paper

Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective

  • Yingying Fan
  • Yu Wu
  • Bo Du
  • Yutian Lin

We focus on the weakly-supervised audio-visual video parsing task (AVVP), which aims to identify and locate all the events in audio/visual modalities. Previous works only concentrate on video-level overall label denoising across modalities, but overlook the segment-level label noise, where adjacent video segments (i. e. , 1-second video clips) may contain different events. However, recognizing events on the segment is challenging because its label could be any combination of events that occur in the video. To address this issue, we consider tackling AVVP from the language perspective, since language could freely describe how various events appear in each segment beyond fixed labels. Specifically, we design language prompts to describe all cases of event appearance for each video. Then, the similarity between language prompts and segments is calculated, where the event of the most similar prompt is regarded as the segment-level label. In addition, to deal with the mislabeled segments, we propose to perform dynamic re-weighting on the unreliable segments to adjust their labels. Experiments show that our simple yet effective approach outperforms state-of-the-art methods by a large margin.

NeurIPS Conference 2023 Conference Paper

RIO: A Benchmark for Reasoning Intention-Oriented Objects in Open Environments

  • Mengxue Qu
  • Yu Wu
  • Wu Liu
  • Xiaodan Liang
  • Jingkuan Song
  • Yao Zhao
  • Yunchao Wei

Intention-oriented object detection aims to detect desired objects based on specific intentions or requirements. For instance, when we desire to "lie down and rest", we instinctively seek out a suitable option such as a "bed" or a "sofa" that can fulfill our needs. Previous work in this area is limited either by the number of intention descriptions or by the affordance vocabulary available for intention objects. These limitations make it challenging to handle intentions in open environments effectively. To facilitate this research, we construct a comprehensive dataset called Reasoning Intention-Oriented Objects (RIO). In particular, RIO is specifically designed to incorporate diverse real-world scenarios and a wide range of object categories. It offers the following key features: 1) intention descriptions in RIO are represented as natural sentences rather than a mere word or verb phrase, making them more practical and meaningful; 2) the intention descriptions are contextually relevant to the scene, enabling a broader range of potential functionalities associated with the objects; 3) the dataset comprises a total of 40, 214 images and 130, 585 intention-object pairs. With the proposed RIO, we evaluate the ability of some existing models to reason intention-oriented objects in open environments.

NeurIPS Conference 2022 Conference Paper

Enabling Detailed Action Recognition Evaluation Through Video Dataset Augmentation

  • Jihoon Chung
  • Yu Wu
  • Olga Russakovsky

It is well-known in the video understanding community that human action recognition models suffer from background bias, i. e. , over-relying on scene cues in making their predictions. However, it is difficult to quantify this effect using existing evaluation frameworks. We introduce the Human-centric Analysis Toolkit (HAT), which enables evaluation of learned background bias without the need for new manual video annotation. It does so by automatically generating synthetically manipulated videos and leveraging the recent advances in image segmentation and video inpainting. Using HAT we perform an extensive analysis of 74 action recognition models trained on the Kinetics dataset. We confirm that all these models focus more on the scene background than on the human motion; further, we demonstrate that certain model design decisions (such as training with fewer frames per video or using dense as opposed to uniform temporal sampling) appear to worsen the background bias. We open-source HAT to enable the community to design more robust and generalizable human action recognition models.

NeurIPS Conference 2022 Conference Paper

Two-Stream Network for Sign Language Recognition and Translation

  • Yutong Chen
  • Ronglai Zuo
  • Fangyun Wei
  • Yu Wu
  • Shujie Liu
  • Brian Mak

Sign languages are visual languages using manual articulations and non-manual elements to convey information. For sign language recognition and translation, the majority of existing approaches directly encode RGB videos into hidden representations. RGB videos, however, are raw signals with substantial visual redundancy, leading the encoder to overlook the key information for sign language understanding. To mitigate this problem and better incorporate domain knowledge, such as handshape and body movement, we introduce a dual visual encoder containing two separate streams to model both the raw videos and the keypoint sequences generated by an off-the-shelf keypoint estimator. To make the two streams interact with each other, we explore a variety of techniques, including bidirectional lateral connection, sign pyramid network with auxiliary supervision, and frame-level self-distillation. The resulting model is called TwoStream-SLR, which is competent for sign language recognition (SLR). TwoStream-SLR is extended to a sign language translation (SLT) model, TwoStream-SLT, by simply attaching an extra translation network. Experimentally, our TwoStream-SLR and TwoStream-SLT achieve state-of-the-art performance on SLR and SLT tasks across a series of datasets including Phoenix-2014, Phoenix-2014T, and CSL-Daily.

AAAI Conference 2020 Conference Paper

A Dataset for Low-Resource Stylized Sequence-to-Sequence Generation

  • Yu Wu
  • Yunli Wang
  • Shujie Liu

Low-resource stylized sequence-to-sequence (S2S) generation is in high demand. However, its development is hindered by the datasets which have limitations on scale and automatic evaluation methods. We construct two large-scale, multiplereference datasets for low-resource stylized S2S, the Machine Translation Formality Corpus (MTFC) that is easy to evaluate and the Twitter Conversation Formality Corpus (TCFC) that tackles an important problem in chatbots. These datasets contain context to source style parallel data, source style to target parallel data, and non-parallel sentences in the target style to enable the semi-supervised learning. We provide three baselines, the pivot-based method, the teacher-student method, and the back-translation method. We find that the pivot-based method is the worst, and the other two methods achieve the best score on different metrics.

AAAI Conference 2020 Conference Paper

Bridging the Gap between Pre-Training and Fine-Tuning for End-to-End Speech Translation

  • Chengyi Wang
  • Yu Wu
  • Shujie Liu
  • Zhenglu Yang
  • Ming Zhou

End-to-end speech translation, a hot topic in recent years, aims to translate a segment of audio into a specific language with an end-to-end model. Conventional approaches employ multi-task learning and pre-training methods for this task, but they suffer from the huge gap between pre-training and fine-tuning. To address these issues, we propose a Tandem Connectionist Encoding Network (TCEN) which bridges the gap by reusing all subnets in fine-tuning, keeping the roles of subnets consistent, and pre-training the attention module. Furthermore, we propose two simple but effective methods to guarantee the speech encoder outputs and the MT encoder inputs are consistent in terms of semantic representation and sequence length. Experimental results show that our model leads to significant improvements in En-De and En-Fr translation irrespective of the backbones.

AAAI Conference 2020 Conference Paper

RobuTrans: A Robust Transformer-Based Text-to-Speech Model

  • Naihan Li
  • Yanqing Liu
  • Yu Wu
  • Shujie Liu
  • Sheng Zhao
  • Ming Liu

Recently, neural network based speech synthesis has achieved outstanding results, by which the synthesized audios are of excellent quality and naturalness. However, current neural TTS models suffer from the robustness issue, which results in abnormal audios (bad cases) especially for unusual text (unseen context). To build a neural model which can synthesize both natural and stable audios, in this paper, we make a deep analysis of why the previous neural TTS models are not robust, based on which we propose RobuTrans (Robust Transformer), a robust neural TTS model based on Transformer. Comparing to TransformerTTS, our model first converts input texts to linguistic features, including phonemic features and prosodic features, then feed them to the encoder. In the decoder, the encoder-decoder attention is replaced with a duration-based hard attention mechanism, and the causal self-attention is replaced with a ”pseudo non-causal attention” mechanism to model the holistic information of the input. Besides, the position embedding is replaced with a 1- D CNN, since it constrains the maximum length of synthesized audio. With these modifications, our model not only fix the robustness problem, but also achieves on parity MOS (4. 36) with TransformerTTS (4. 37) and Tacotron2 (4. 37) on our general set.

AAAI Conference 2020 Conference Paper

Symbiotic Attention with Privileged Information for Egocentric Action Recognition

  • Xiaohan Wang
  • Yu Wu
  • Linchao Zhu
  • Yi Yang

Egocentric video recognition is a natural testbed for diverse interaction reasoning. Due to the large action vocabulary in egocentric video datasets, recent studies usually utilize a twobranch structure for action recognition, i. e. , one branch for verb classification and the other branch for noun classification. However, correlation study between the verb and the noun branches have been largely ignored. Besides, the two branches fail to exploit local features due to the absence of position-aware attention mechanism. In this paper, we propose a novel Symbiotic Attention framework leveraging Privileged information (SAP) for egocentric video recognition. Finer position-aware object detection features can facilitate the understanding of actor’s interaction with the object. We introduce these features in action recognition and regard them as privileged information. Our framework enables mutual communication among the verb branch, the noun branch, and the privileged information. This communication process not only injects local details into global features, but also exploits implicit guidance about the spatio-temporal position of an on-going action. We introduce a novel symbiotic attention (SA) to enable effective communication. It first normalizes the detection guided features on one branch to underline the action-relevant information from the other branch. SA adaptively enhances the interactions among the three sources. To further catalyze this communication, spatial relations are uncovered for the selection of most action-relevant information. It identifies the most valuable and discriminative feature for classification. We validate the effectiveness of our SAP quantitatively and qualitatively. Notably, it achieves the state-ofthe-art on two large-scale egocentric video datasets.

AAAI Conference 2019 Conference Paper

Dictionary-Guided Editing Networks for Paraphrase Generation

  • Shaohan Huang
  • Yu Wu
  • Furu Wei
  • Zhongzhi Luan

An intuitive way for a human to write paraphrase sentences is to replace words or phrases in the original sentence with their corresponding synonyms and make necessary changes to ensure the new sentences are fluent and grammatically correct. We propose a novel approach to modeling the process with dictionary-guided editing networks which effectively conduct rewriting on the source sentence to generate paraphrase sentences. It jointly learns the selection of the appropriate word level and phrase level paraphrase pairs in the context of the original sentence from an off-the-shelf dictionary as well as the generation of fluent natural language sentences. Specifically, the system retrieves a set of word level and phrase level paraphrase pairs derived from the Paraphrase Database (PPDB) for the original sentence, which is used to guide the decision of which the words might be deleted or inserted with the soft attention mechanism under the sequence-to-sequence framework. We conduct experiments on two benchmark datasets for paraphrase generation, namely the MSCOCO and Quora dataset. The automatic evaluation results demonstrate that our dictionary-guided editing networks outperforms the baseline methods. On human evaluation, results indicate that the generated paraphrases are grammatically correct and relevant to the input sentence.

AAAI Conference 2019 Conference Paper

Response Generation by Context-Aware Prototype Editing

  • Yu Wu
  • Furu Wei
  • Shaohan Huang
  • Yunli Wang
  • Zhoujun Li
  • Ming Zhou

Open domain response generation has achieved remarkable progress in recent years, but sometimes yields short and uninformative responses. We propose a new paradigm, prototypethen-edit for response generation, that first retrieves a prototype response from a pre-defined index and then edits the prototype response according to the differences between the prototype context and current context. Our motivation is that the retrieved prototype provides a good start-point for generation because it is grammatical and informative, and the post-editing process further improves the relevance and coherence of the prototype. In practice, we design a contextaware editing model that is built upon an encoder-decoder framework augmented with an editing vector. We first generate an edit vector by considering lexical differences between a prototype context and current context. After that, the edit vector and the prototype response representation are fed to a decoder to generate a new response. Experiment results on a large scale dataset demonstrate that our new paradigm significantly increases the relevance, diversity and originality of generation results, compared to traditional generative models. Furthermore, our model outperforms retrieval-based methods in terms of relevance and originality.

AAAI Conference 2018 Conference Paper

Hierarchical Recurrent Attention Network for Response Generation

  • Chen Xing
  • Yu Wu
  • Wei Wu
  • Yalou Huang
  • Ming Zhou

We study multi-turn response generation in chatbots where a response is generated according to a conversation context. Existing work has modeled the hierarchy of the context, but does not pay enough attention to the fact that words and utterances in the context are differentially important. As a result, they may lose important information in context and generate irrelevant responses. We propose a hierarchical recurrent attention network (HRAN) to model both the hierarchy and the importance variance in a unified framework. In HRAN, a hierarchical attention mechanism attends to important parts within and among utterances with word level attention and utterance level attention respectively. Empirical studies on both automatic evaluation and human judgment show that HRAN can significantly outperform state-of-the-art models for context based response generation.

AAAI Conference 2018 Conference Paper

Knowledge Enhanced Hybrid Neural Network for Text Matching

  • Yu Wu
  • Wei Wu
  • Can Xu
  • Zhoujun Li

Long text brings a big challenge to neural network based text matching approaches due to their complicated structures. To tackle the challenge, we propose a knowledge enhanced hybrid neural network (KEHNN) that leverages prior knowledge to identify useful information and filter out noise in long text and performs matching from multiple perspectives. The model fuses prior knowledge into word representations by knowledge gates and establishes three matching channels with words, sequential structures of text given by Gated Recurrent Units (GRUs), and knowledge enhanced representations. The three channels are processed by a convolutional neural network to generate high level features for matching, and the features are synthesized as a matching score by a multilayer perceptron. In this paper, we focus on exploring the use of taxonomy knowledge for text matching. Evaluation results from extensive experiments on public data sets of question answering and conversation show that KEHNN can significantly outperform state-of-the-art matching models and particularly improve matching accuracy on pairs with long text.

AAAI Conference 2018 Conference Paper

Neural Response Generation With Dynamic Vocabularies

  • Yu Wu
  • Wei Wu
  • Dejian Yang
  • Can Xu
  • Zhoujun Li

We study response generation for open domain conversation in chatbots. Existing methods assume that words in responses are generated from an identical vocabulary regardless of their inputs, which not only makes them vulnerable to generic patterns and irrelevant noise, but also causes a high cost in decoding. We propose a dynamic vocabulary sequence-tosequence (DVS2S) model which allows each input to possess their own vocabulary in decoding. In training, vocabulary construction and response generation are jointly learned by maximizing a lower bound of the true objective with a Monte Carlo sampling method. In inference, the model dynamically allocates a small vocabulary for an input with the word prediction model, and conducts decoding only with the small vocabulary. Because of the dynamic vocabulary mechanism, DVS2S eludes many generic patterns and irrelevant words in generation, and enjoys efficient decoding at the same time. Experimental results on both automatic metrics and human annotations show that DVS2S can significantly outperform state-of-the-art methods in terms of response quality, but only requires 60% decoding time compared to the most efficient baseline.

AAAI Conference 2017 Conference Paper

Topic Aware Neural Response Generation

  • Chen Xing
  • Wei Wu
  • Yu Wu
  • Jie Liu
  • Yalou Huang
  • Ming Zhou
  • Wei-Ying Ma

We consider incorporating topic information into a sequenceto-sequence framework to generate informative and interesting responses for chatbots. To this end, we propose a topic aware sequence-to-sequence (TA-Seq2Seq) model. The model utilizes topics to simulate prior human knowledge that guides them to form informative and interesting responses in conversation, and leverages topic information in generation by a joint attention mechanism and a biased generation probability. The joint attention mechanism summarizes the hidden vectors of an input message as context vectors by message attention and synthesizes topic vectors by topic attention from the topic words of the message obtained from a pre-trained LDA model, with these vectors jointly affecting the generation of words in decoding. To increase the possibility of topic words appearing in responses, the model modifies the generation probability of topic words by adding an extra probability item to bias the overall distribution. Empirical studies on both automatic evaluation metrics and human annotations show that TA-Seq2Seq can generate more informative and interesting responses, significantly outperforming state-of-theart response generation models.

AAAI Conference 2016 Conference Paper

Improving Recommendation of Tail Tags for Questions in Community Question Answering

  • Yu Wu
  • Wei Wu
  • Zhoujun Li
  • Ming Zhou

We study tag recommendation for questions in community question answering (CQA). Tags represent the semantic summarization of questions are useful for navigation and expert finding in CQA and can facilitate content consumption such as searching and mining in these web sites. The task is challenging, as both questions and tags are short and a large fraction of tags are tail tags which occur very infrequently. To solve these problems, we propose matching questions and tags not only by themselves, but also by similar questions and similar tags. The idea is then formalized as a model in which we calculate question-tag similarity using a linear combination of similarity with similar questions and tags weighted by tag importance. Question similarity, tag similarity, and tag importance are learned in a supervised random walk framework by fusing multiple features. Our model thus can not only accurately identify question-tag similarity for head tags, but also improve the accuracy of recommendation of tail tags. Experimental results show that the proposed method significantly outperforms state-of-the-art methods on tag recommendation for questions. Particularly, it improves tail tag recommendation accuracy by a large margin.

IJCAI Conference 2015 Conference Paper

Inapproximability of Treewidth and Related Problems (Extended Abstract)

  • Yu Wu
  • Per Austrin
  • Toniann Pitassi
  • David Liu

Graphical models, such as Bayesian Networks and Markov networks play an important role in artificial intelligence and machine learning. Inference is a central problem to be solved on these networks. This, and other problems on these graph models are often known to be hard to solve in general, but tractable on graphs with bounded Treewidth. Therefore, finding or approximating the Treewidth of a graph is a fundamental problem related to inference in graphical models. In this paper, we study the approximability of a number of graph problems: Treewidth and Pathwidth of graphs, Minimum Fill- In, and a variety of different graph layout problems such as Minimum Cut Linear Arrangement. We show that, assuming Small Set Expansion Conjecture, all of these problems are NP-hard to approximate to within any constant factor in polynomial time. This paper is an extended abstract of the Journal of Artificial Intelligence Research [Wu et al. , 2014]

AAAI Conference 2015 Conference Paper

Mining Query Subtopics from Questions in Community Question Answering

  • Yu Wu
  • Wei Wu
  • Zhoujun Li
  • Ming Zhou

This paper proposes mining query subtopics from questions in community question answering (CQA). The subtopics are represented as a number of clusters of questions with keywords summarizing the clusters. The task is unique in that the subtopics from questions can not only facilitate user browsing in CQA search, but also describe aspects of queries from a question-answering perspective. The challenges of the task include how to group semantically similar questions and how to find keywords capable of summarizing the clusters. We formulate the subtopic mining task as a non-negative matrix factorization (NMF) problem and further extend the model of NMF to incorporate question similarity estimated from metadata of CQA into learning. Compared with existing methods, our method can jointly optimize question clustering and keyword extraction and encourage the former task to enhance the latter. Experimental results on large scale real world CQA datasets show that the proposed method significantly outperforms the existing methods in terms of keyword extraction, while achieving a comparable performance to the state-ofthe-art methods for question clustering.

TIST Journal 2013 Journal Article

A machine learning approach to college drinking prediction and risk factor identification

  • Jinbo Bi
  • Jiangwen Sun
  • Yu Wu
  • Howard Tennen
  • Stephen Armeli

Alcohol misuse is one of the most serious public health problems facing adolescents and young adults in the United States. National statistics shows that nearly 90% of alcohol consumed by youth under 21 years of age involves binge drinking and 44% of college students engage in high-risk drinking activities. Conventional alcohol intervention programs, which aim at installing either an alcohol reduction norm or prohibition against underage drinking, have yielded little progress in controlling college binge drinking over the years. Existing alcohol studies are deductive where data are collected to investigate a psychological/behavioral hypothesis, and statistical analysis is applied to the data to confirm the hypothesis. Due to this confirmatory manner of analysis, the resulting statistical models are cohort-specific and typically fail to replicate on a different sample. This article presents two machine learning approaches for a secondary analysis of longitudinal data collected in college alcohol studies sponsored by the National Institute on Alcohol Abuse and Alcoholism. Our approach aims to discover knowledge, from multiwave cohort-sequential daily data, which may or may not align with the original hypothesis but quantifies predictive models with higher likelihood to generalize to new samples. We first propose a so-called temporally-correlated support vector machine to construct a classifier as a function of daily moods, stress, and drinking expectancies to distinguish days with nighttime binge drinking from days without for individual students. We then propose a combination of cluster analysis and feature selection, where cluster analysis is used to identify drinking patterns based on averaged daily drinking behavior and feature selection is used to identify risk factors associated with each pattern. We evaluate our methods on two cohorts of 530 total college students recruited during the Spring and Fall semesters, respectively. Cross validation on these two cohorts and further on 100 random partitions of the total students demonstrate that our methods improve the model generalizability in comparison with traditional multilevel logistic regression. The discovered risk factors and the interaction of these factors delineated in our models can set a potential basis and offer insights to a new design of more effective college alcohol interventions.

ICML Conference 2013 Conference Paper

Learning Fair Representations

  • Richard S. Zemel
  • Yu Wu
  • Kevin Swersky
  • Toniann Pitassi
  • Cynthia Dwork

We propose a learning algorithm for fair classification that achieves both group fairness (the proportion of members in a protected group receiving positive classification is identical to the proportion in the population as a whole), and individual fairness (similar individuals should be treated similarly). We formulate fairness as an optimization problem of finding a good representation of the data with two competing goals: to encode the data as well as possible, while simultaneously obfuscating any information about membership in the protected group. We show positive results of our algorithm relative to other known techniques, on three datasets. Moreover, we demonstrate several advantages to our approach. First, our intermediate representation can be used for other classification tasks (i. e. , transfer learning is possible); secondly, we take a step toward learning a distance metric which can find important dimensions of the data for classification.