Arrow Research search

Author name cluster

Yuwei Wu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

29 papers
2 author rows

Possible papers

29

AAAI Conference 2026 Conference Paper

Composition-Incremental Learning for Compositional Generalization

  • Zhen Li
  • Yuwei Wu
  • Chenchen Jing
  • Che Sun
  • Chuanhao Li
  • Yunde Jia

Compositional generalization has achieved substantial progress in computer vision on pre-collected training data. Nonetheless, real-world data continually emerges, with possible compositions being nearly infinite, long-tailed, and not entirely visible. Thus, an ideal model is supposed to gradually improve the capability of compositional generalization in an incremental manner. In this paper, we explore Composition-Incremental Learning for Compositional Generalization (CompIL) in the context of the compositional zero-shot learning (CZSL) task, where models need to continually learn new compositions, intending to improve their compositional generalization capability progressively. To quantitatively evaluate CompIL, we develop a benchmark construction pipeline leveraging existing datasets, yielding MIT-States-CompIL and C-GQA-CompIL. Furthermore, we propose a pseudo-replay framework utilizing a visual synthesizer to synthesize visual representations of learned compositions and a linguistic primitive distillation mechanism to maintain aligned primitive representations across the learning process. Extensive experiments demonstrate the effectiveness of the proposed framework.

AAAI Conference 2026 Conference Paper

LongSplat: Online Generalizable 3D Gaussian Splatting from Long Sequence Images

  • Guichen Huang
  • Ruoyu Wang
  • Xiangjun Gao
  • Che Sun
  • Yuwei Wu
  • Shenghua Gao
  • Yunde Jia

3D Gaussian Splatting (3DGS) achieves high-fidelity novel view synthesis, but its application in online long-sequence scenarios is still restricted. Existing methods either rely on slow per-scene optimization or lack efficient frame-wise 3DGS updates, making them unsuitable for online long-sequence videos. In this paper, we propose LongSplat, an online real-time 3D Gaussian reconstruction framework designed for long-sequence image input. The core idea of LongSplat is to maintain a global 3DGS set and design a streaming 3DGS update mechanism that selectively compressing redundant historical Gaussians and introducing new Gaussians by comparing the current observations with the historical Gaussian. To achieve this goal, we design a Gaussian-Image Representation (GIR), which encodes 3D Gaussian parameters into a structured, image-like 2D format. GIR simultaneously enables identity-aware redundancy compression as well as the fusion of current view and historical Gaussians, which are used for online reconstruction and adapt the model to long sequences without overwhelming memory or computational costs. Extensive experiments demonstrate that LongSplat achieves state-of-the-art efficiency-quality trade-offs in real-time novel view synthesis, delivering real-time reconstruction while reducing Gaussian counts by 44% compared to per-pixel prediction paradigms.

NeurIPS Conference 2025 Conference Paper

3D Visual Illusion Depth Estimation

  • Chengtang Yao
  • Zhidan Liu
  • Jiaxi Zeng
  • Lidong Yu
  • Yuwei Wu
  • Yunde Jia

3D visual illusion is a perceptual phenomenon where a two-dimensional plane is manipulated to simulate three-dimensional spatial relationships, making a flat artwork or object look three-dimensional in the human visual system. In this paper, we reveal that the machine visual system is also seriously fooled by 3D visual illusions, including monocular and binocular depth estimation. In order to explore and analyze the impact of 3D visual illusion on depth estimation, we collect a large dataset containing almost 3k scenes and 200k images to train and evaluate SOTA monocular and binocular depth estimation methods. We also propose a 3D visual illusion depth estimation framework that uses common sense from the vision language model to adaptively fuse depth from binocular disparity and monocular depth. Experiments show that SOTA monocular, binocular, and multi-view depth estimation approaches are all fooled by various 3D visual illusions, while our method achieves SOTA performance.

NeurIPS Conference 2025 Conference Paper

Beyond the Seen: Bounded Distribution Estimation for Open-Vocabulary Learning

  • Xiaomeng Fan
  • Yuchuan Mao
  • Zhi Gao
  • Yuwei Wu
  • Jin Chen
  • Yunde Jia

Open-vocabulary learning requires modeling the data distribution in open environments, which consists of both seen-class and unseen-class data. Existing methods estimate the distribution in open environments using seen-class data, where the absence of unseen classes makes the estimation error inherently unidentifiable. Intuitively, learning beyond the seen classes is crucial for distribution estimation to bound the estimation error. We theoretically demonstrate that the distribution can be effectively estimated by generating unseen-class data, through which the estimation error is upper-bounded. Building on this theoretical insight, we propose a novel open-vocabulary learning method, which generates unseen-class data for estimating the distribution in open environments. The method consists of a class-domain-wise data generation pipeline and a distribution alignment algorithm. The data generation pipeline generates unseen-class data under the guidance of a hierarchical semantic tree and domain information inferred from the seen-class data, facilitating accurate distribution estimation. With the generated data, the distribution alignment algorithm estimates and maximizes the posterior probability to enhance generalization in open-vocabulary learning. Extensive experiments on 11 datasets demonstrate that our method outperforms baseline approaches by up to 14%, highlighting its effectiveness and superiority.

AAAI Conference 2025 Conference Paper

Consistency of Compositional Generalization Across Multiple Levels

  • Chuanhao Li
  • Zhen Li
  • Chenchen Jing
  • Xiaomeng Fan
  • Wenbo Ye
  • Yuwei Wu
  • Yunde Jia

Compositional generalization is the capability of a model to understand novel compositions composed of seen concepts. There are multiple levels of novel compositions including phrase-phrase level, phrase-word level, and word-word level. Existing methods achieve promising compositional generalization, but the consistency of compositional generalization across multiple levels of novel compositions remains unexplored. The consistency refers to that a model should generalize to a phrase-phrase level novel composition, and phrase-word/word-word level novel compositions that can be derived from it simultaneously. In this paper, we propose a meta-learning based framework, for achieving consistent compositional generalization across multiple levels. The basic idea is to progressively learn compositions from simple to complex for consistency. Specifically, we divide the original training set into multiple validation sets based on compositional complexity, and introduce multiple meta-weight-nets to generate sample weights for samples in different validation sets. To fit the validation sets in order of increasing compositional complexity, we optimize the parameters of each meta-weight-net independently and sequentially in a multilevel optimization manner. We build a GQA-CCG dataset to quantitatively evaluate the consistency. Experimental results on visual question answering and temporal video grounding, demonstrate the effectiveness of the proposed framework.

NeurIPS Conference 2025 Conference Paper

Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning

  • Pengxiang Li
  • Zhi Gao
  • Bofei Zhang
  • Yapeng Mi
  • Xiaojian (Shawn) Ma
  • Chenrui Shi
  • Tao Yuan
  • Yuwei Wu

Multimodal agents, which integrate a controller (e. g. , a vision language model) with external tools, have demonstrated remarkable capabilities in tackling complex multimodal tasks. Existing approaches for training these agents, both supervised fine-tuning and reinforcement learning, depend on extensive human-annotated task-answer pairs and tool trajectories. However, for complex multimodal tasks, such annotations are prohibitively expensive or impractical to obtain. In this paper, we propose an iterative tool usage exploration method for multimodal agents without any pre-collected data, namely SPORT, via step-wise preference optimization to refine the trajectories of tool usage. Our method enables multimodal agents to autonomously discover effective tool usage strategies through self-exploration and optimization, eliminating the bottleneck of human annotation. SPORT has four iterative components: task synthesis, step sampling, step verification, and preference tuning. We first synthesize multimodal tasks using language models. Then, we introduce a novel trajectory exploration scheme, where step sampling and step verification are executed alternately to solve synthesized tasks. In step sampling, the agent tries different tools and obtains corresponding results. In step verification, we employ a verifier to provide AI feedback to construct step-wise preference data. The data is subsequently used to update the controller for tool usage through preference tuning, producing a SPORT agent. By interacting with real environments, the SPORT agent gradually evolves into a more refined and capable system. Evaluation in the GTA and GAIA benchmarks shows that the SPORT agent achieves 6. 41% and 3. 64% improvements, underscoring the generalization and effectiveness introduced by our method.

ICML Conference 2025 Conference Paper

Large Language Models are Demonstration Pre-Selectors for Themselves

  • Jiarui Jin
  • Yuwei Wu
  • Haoxuan Li 0001
  • Xiaoting He
  • Weinan Zhang 0001
  • Yiming Yang
  • Yong Yu 0001
  • Jun Wang 0012

In-context learning with large language models (LLMs) delivers strong few-shot performance by choosing few-shot demonstrations from the entire training dataset. However, previous few-shot in-context learning methods, which calculate similarity scores for choosing demonstrations, incur high computational costs by repeatedly retrieving large-scale datasets for each query. This is due to their failure to recognize that not all demonstrations are equally informative, and many less informative demonstrations can be inferred from a core set of highly informative ones. To this end, we propose FEEDER (FEw yet Essential Demonstration prE-selectoR), a novel pre-selection framework that identifies a core subset of demonstrations containing the most informative examples. This subset, referred to as the FEEDER set, consists of demonstrations that capture both the ”sufficiency” and ”necessity” information to infer the entire dataset. Notice that FEEDER is selected before the few-shot in-context learning, enabling more efficient few-shot demonstrations choosing in a smaller set. To identify FEEDER, we propose a novel effective tree based algorithm. Once selected, it can replace the original dataset, leading to improved efficiency and prediction accuracy in few-shot in-context learning. Additionally, FEEDER also benefit fine-tuning LLMs, we propose a bi-level optimization method enabling more efficient training without sacrificing performance when datasets become smaller. Our experiments are on 6 text classification datasets, 1 reasoning dataset, and 1 semantic-parsing dataset, across 6 LLMs (ranging from 335M to 7B parameters), demonstrate that: (i) In few-shot inference, FEEDER achieves superior (or comparable) performance while utilizing only half the input training data. (ii) In fine-tuning, FEEDER significantly boosts the performance of LLMs.

IJCAI Conference 2025 Conference Paper

Multi-Sourced Compositional Generalization in Visual Question Answering

  • Chuanhao Li
  • Wenbo Ye
  • Zhen Li
  • Yuwei Wu
  • Yunde Jia

Compositional generalization is the ability of generalizing novel compositions from seen primitives, and has received much attention in vision-and-language (V&L) recently. Due to the multi-modal nature of V&L tasks, the primitives composing compositions source from different modalities, resulting in multi-sourced novel compositions. However, the generalization ability over multi-sourced novel compositions, i. e. , multi-sourced compositional generalization (MSCG) remains unexplored. In this paper, we explore MSCG in the context of visual question answering (VQA), and propose a retrieval-augmented training framework to enhance the MSCG ability of VQA models by learning unified representations for primitives from different modalities. Specifically, semantically equivalent primitives are retrieved for each primitive in the training samples, and the retrieved features are aggregated with the original primitive to refine the model. This process helps the model learn consistent representations for the same semantic primitives across different modalities. To evaluate the MSCG ability of VQA models, we construct a new GQA-MSCG dataset based on the GQA dataset, in which samples include three types of novel compositions composed of primitives from different modalities. The GQA-MSCG dataset is available at https: //github. com/NeverMoreLCH/MSCG.

IROS Conference 2025 Conference Paper

OVSG-SLAM: Open-Vocabulary Semantic Gaussian Splatting SLAM

  • Zhehang Liu
  • Shishen Li
  • Guichen Huang
  • Yuwei Wu

Most conventional semantic SLAM approaches concentrate on maintaining 3D semantic consistency while overlooking their reliance on predefined semantic categories, ultimately limiting flexibility in scene understanding. We propose Open-Vocabulary Semantic Gaussian Splatting SLAM (OVSG-SLAM), an approach that integrates multi-modal perception and 3D Gaussian splatting into a semantic SLAM framework. By combining the advantages of Segment Anything (SAM) for open-vocabulary 2D scene understanding with the powerful feature extraction capabilities of vision-language models, our method eliminates the reliance on predefined closed-set categories. Although Vision-Language Models (VLMs) provide open-vocabulary reasoning, integrating them with 3D semantic SLAM poses challenges such as embedding ambiguity and computational overhead. To address these challenges, we present a feature embedding strategy called differentiable identity-aware encoding, which reduces computational cost while ensuring accurate semantic mapping. Furthermore, instead of using a traditional semantic loss, we optimize the scene representation through an identity loss. Extensive experimental evaluations on the Replica and ScanNet datasets demonstrate that the proposed method achieves state-of-the-art performance in mapping, tracking and 3D semantic segmentation tasks.

NeurIPS Conference 2025 Conference Paper

Sekai: A Video Dataset towards World Exploration

  • Zhen Li
  • Chuanhao Li
  • Xiaofeng Mao
  • Shaoheng Lin
  • Ming Li
  • Shitian Zhao
  • Zhaopan Xu
  • Xinyue Li

Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai (meaning "world" in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5, 000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories. Comprehensive analyses and experiments demonstrate the dataset’s scale, diversity, annotation quality, and effectiveness for training video generation models. We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications.

AAAI Conference 2025 Conference Paper

World Knowledge-Enhanced Reasoning Using Instruction-Guided Interactor in Autonomous Driving

  • Mingliang Zhai
  • Cheng Li
  • Zengyuan Guo
  • Ningrui Yang
  • Xiameng Qin
  • Sanyuan Zhao
  • Junyu Han
  • Ji Tao

The Multi-modal Large Language Models (MLLMs) with extensive world knowledge have revitalized autonomous driving, particularly in reasoning tasks within perceivable regions. However, when faced with perception-limited areas (dynamic or static occlusion regions), MLLMs struggle to effectively integrate perception ability with world knowledge for reasoning. These perception-limited regions can conceal crucial safety information, especially for vulnerable road users. In this paper, we propose a framework, which aims to improve autonomous driving performance under perception-limited conditions by enhancing the integration of perception capabilities and world knowledge. Specifically, we propose a plug-and-play instruction-guided interaction module that bridges modality gaps and significantly reduces the input sequence length, allowing it to adapt effectively to multi-view video inputs. Furthermore, to better integrate world knowledge with driving-related tasks, we have collected and refined a large-scale multi-modal dataset that includes 2 million natural language QA pairs, 1.7 million grounding task data. To evaluate the model’s utilization of world knowledge, we introduce an object-level risk assessment dataset comprising 200K QA pairs, where the questions necessitate multi-step reasoning leveraging world knowledge for resolution. Extensive experiments validate the effectiveness of our proposed method.

NeurIPS Conference 2024 Conference Paper

FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models

  • Pengxiang Li
  • Zhi Gao
  • Bofei Zhang
  • Tao Yuan
  • Yuwei Wu
  • Mehrtash Harandi
  • Yunde Jia
  • Song-Chun Zhu

Vision language models (VLMs) have achieved impressive progress in diverse applications, becoming a prevalent research direction. In this paper, we build FIRE, a feedback-refinement dataset, consisting of 1. 1M multi-turn conversations that are derived from 27 source datasets, empowering VLMs to spontaneously refine their responses based on user feedback across diverse tasks. To scale up the data collection, FIRE is collected in two components: FIRE-100K and FIRE-1M, where FIRE-100K is generated by GPT-4V, and FIRE-1M is freely generated via models trained on FIRE-100K. Then, we build FIRE-Bench, a benchmark to comprehensively evaluate the feedback-refining capability of VLMs, which contains 11K feedback-refinement conversations as the test data, two evaluation settings, and a model to provide feedback for VLMs. We develop the FIRE-LLaVA model by fine-tuning LLaVA on FIRE-100K and FIRE-1M, which shows remarkable feedback-refining capability on FIRE-Bench and outperforms untrained VLMs by 50%, making more efficient user-agent interactions and underscoring the significance of the FIRE dataset.

AAAI Conference 2024 Conference Paper

Residual Hyperbolic Graph Convolution Networks

  • Yangkai Xue
  • Jindou Dai
  • Zhipeng Lu
  • Yuwei Wu
  • Yunde Jia

Hyperbolic graph convolutional networks (HGCNs) have demonstrated representational capabilities of modeling hierarchical-structured graphs. However, as in general GCNs, over-smoothing may occur as the number of model layers increases, limiting the representation capabilities of most current HGCN models. In this paper, we propose residual hyperbolic graph convolutional networks (R-HGCNs) to address the over-smoothing problem. We introduce a hyperbolic residual connection function to overcome the over-smoothing problem, and also theoretically prove the effectiveness of the hyperbolic residual function. Moreover, we use product manifolds and HyperDrop to facilitate the R-HGCNs. The distinctive features of the R-HGCNs are as follows: (1) The hyperbolic residual connection preserves the initial node information in each layer and adds a hyperbolic identity mapping to prevent node features from being indistinguishable. (2) Product manifolds in R-HGCNs have been set up with different origin points in different components to facilitate the extraction of feature information from a wider range of perspectives, which enhances the representing capability of R-HGCNs. (3) HyperDrop adds multiplicative Gaussian noise into hyperbolic representations, such that perturbations can be added to alleviate the over-fitting problem without deconstructing the hyperbolic geometry. Experiment results demonstrate the effectiveness of R-HGCNs under various graph convolution layers and different structures of product manifolds.

NeurIPS Conference 2024 Conference Paper

SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge

  • Chuanhao Li
  • Zhen Li
  • Chenchen Jing
  • Shuo Liu
  • Wenqi Shao
  • Yuwei Wu
  • Ping Luo
  • Yu Qiao

Large vision-language models (LVLMs) are ignorant of the up-to-date knowledge, such as LLaVA series, because they cannot be updated frequently due to the large amount of resources required, and therefore fail in many cases. For example, if a LVLM was released on January 2024, and it wouldn't know the singer of the theme song for the new Detective Conan movie, which wasn't released until April 2024. To solve the problem, a promising solution motivated by retrieval-augmented generation (RAG) is to provide LVLMs with up-to-date knowledge via internet search during inference, i. e. , internet-augmented generation (IAG), which is already integrated in some closed-source commercial LVLMs such as GPT-4V. However, the specific mechanics underpinning them remain a mystery. In this paper, we propose a plug-and-play framework, for augmenting existing LVLMs in handling visual question answering (VQA) about up-to-date knowledge, dubbed SearchLVLMs. A hierarchical filtering model is trained to effectively and efficiently find the most helpful content from the websites returned by a search engine to prompt LVLMs with up-to-date knowledge. To train the model and evaluate our framework's performance, we propose a pipeline to automatically generate news-related VQA samples to construct a dataset, dubbed UDK-VQA. A multi-model voting mechanism is introduced to label the usefulness of website/content for VQA samples to construct the training set. Experimental results demonstrate the effectiveness of our framework, outperforming GPT-4o by $\sim$30\% in accuracy.

IJCAI Conference 2023 Conference Paper

Fast-StrucTexT: An Efficient Hourglass Transformer with Modality-guided Dynamic Token Merge for Document Understanding

  • Mingliang Zhai
  • Yulin Li
  • Xiameng Qin
  • Chen Yi
  • Qunyi Xie
  • Chengquan Zhang
  • Kun Yao
  • Yuwei Wu

Transformers achieve promising performance in document understanding because of their high effectiveness and still suffer from quadratic computational complexity dependency on the sequence length. General efficient transformers are challenging to be directly adapted to model document. They are unable to handle the layout representation in documents, e. g. word, line and paragraph, on different granularity levels and seem hard to achieve a good trade-off between efficiency and performance. To tackle the concerns, we propose Fast-StrucTexT, an efficient multi-modal framework based on the StrucTexT algorithm with an hourglass transformer architecture, for visual document understanding. Specifically, we design a modality-guided dynamic token merging block to make the model learn multi-granularity representation and prunes redundant tokens. Additionally, we present a multi-modal interaction module called Symmetry Cross-Attention (SCA) to consider multi-modal fusion and efficiently guide the token mergence. The SCA allows one modality input as query to calculate cross attention with another modality in a dual phase. Extensive experiments on FUNSD, SROIE, and CORD datasets demonstrate that our model achieves the state-of-the-art performance and almost 1. 9x faster inference time than the state-of-the-art methods.

AAAI Conference 2023 Conference Paper

Learning Event-Relevant Factors for Video Anomaly Detection

  • Che Sun
  • Chenrui Shi
  • Yunde Jia
  • Yuwei Wu

Most video anomaly detection methods discriminate events that deviate from normal patterns as anomalies. However, these methods are prone to interferences from event-irrelevant factors, such as background textures and object scale variations, incurring an increased false detection rate. In this paper, we propose to explicitly learn event-relevant factors to eliminate the interferences from event-irrelevant factors on anomaly predictions. To this end, we introduce a causal generative model to separate the event-relevant factors and event-irrelevant ones in videos, and learn the prototypes of event-relevant factors in a memory augmentation module. We design a causal objective function to optimize the causal generative model and develop a counterfactual learning strategy to guide anomaly predictions, which increases the influence of the event-relevant factors. The extensive experiments show the effectiveness of our method for video anomaly detection.

AAAI Conference 2022 Conference Paper

Efficient Riemannian Meta-Optimization by Implicit Differentiation

  • Xiaomeng Fan
  • Yuwei Wu
  • Zhi Gao
  • Yunde Jia
  • Mehrtash Harandi

To solve optimization problems with nonlinear constrains, the recently developed Riemannian meta-optimization methods show promise, which train neural networks as an optimizer to perform optimization on Riemannian manifolds. A key challenge is the heavy computational and memory burdens, because computing the meta-gradient with respect to the optimizer involves a series of time-consuming derivatives, and stores large computation graphs in memory. In this paper, we propose an efficient Riemannian meta-optimization method that decouples the complex computation scheme from the meta-gradient. We derive Riemannian implicit differentiation to compute the meta-gradient by establishing a link between Riemannian optimization and the implicit function theorem. As a result, the updating our optimizer is only related to the final two iterations, which in turn speeds up our method and reduces the memory footprint significantly. We theoretically study the computational load and memory footprint of our method for long optimization trajectories, and conduct an empirical study to demonstrate the benefits of the proposed method. Evaluations of three optimization problems on different Riemannian manifolds show that our method achieves state-of-the-art performance in terms of the convergence speed and the quality of optima.

NeurIPS Conference 2022 Conference Paper

Hyperbolic Feature Augmentation via Distribution Estimation and Infinite Sampling on Manifolds

  • Zhi Gao
  • Yuwei Wu
  • Yunde Jia
  • Mehrtash Harandi

Learning in hyperbolic spaces has attracted growing attention recently, owing to their capabilities in capturing hierarchical structures of data. However, existing learning algorithms in the hyperbolic space tend to overfit when limited data is given. In this paper, we propose a hyperbolic feature augmentation method that generates diverse and discriminative features in the hyperbolic space to combat overfitting. We employ a wrapped hyperbolic normal distribution to model augmented features, and use a neural ordinary differential equation module that benefits from meta-learning to estimate the distribution. This is to reduce the bias of estimation caused by the scarcity of data. We also derive an upper bound of the augmentation loss, which enables us to train a hyperbolic model by using an infinite number of augmentations. Experiments on few-shot learning and continual learning tasks show that our method significantly improves the performance of hyperbolic algorithms in scarce data regimes.

AAAI Conference 2022 Conference Paper

Learning the Dynamics of Visual Relational Reasoning via Reinforced Path Routing

  • Chenchen Jing
  • Yunde Jia
  • Yuwei Wu
  • Chuanhao Li
  • Qi Wu

Reasoning is a dynamic process. In cognitive theories, the dynamics of reasoning refers to reasoning states over time after successive state transitions. Modeling the cognitive dynamics is of utmost importance to simulate human reasoning capability. In this paper, we propose to learn the reasoning dynamics of visual relational reasoning by casting it as a path routing task. We present a reinforced path routing method that represents an input image via a structured visual graph and introduces a reinforcement learning based model to explore paths (sequences of nodes) over the graph based on an input sentence to infer reasoning results. By exploring such paths, the proposed method represents reasoning states clearly and characterizes state transitions explicitly to fully model the reasoning dynamics for accurate and transparent visual relational reasoning. Extensive experiments on referring expression comprehension and visual question answering demonstrate the effectiveness of our method.

AAAI Conference 2021 Conference Paper

Learning a Gradient-free Riemannian Optimizer on Tangent Spaces

  • Xiaomeng Fan
  • Zhi Gao
  • Yuwei Wu
  • Yunde Jia
  • Mehrtash Harandi

A principal way of addressing constrained optimization problems is to model them as problems on Riemannian manifolds. Recently, Riemannian meta-optimization provides a promising way for solving constrained optimization problems by learning optimizers on Riemannian manifolds in a datadriven fashion, making it possible to design task-specific constrained optimizers. A close look at the Riemannian metaoptimization reveals that learning optimizers on Riemannian manifolds needs to differentiate through the nonlinear Riemannian optimization, which is complex and computationally expensive. In this paper, we propose a simple yet efficient Riemannian meta-optimization method that learns to optimize on tangent spaces of manifolds. In doing so, we present a gradient-free optimizer on tangent spaces, which takes parameters of the model along with the training data as inputs, and generates the updated parameters directly. As a result, the constrained optimization is transformed from Riemannian manifolds to tangent spaces where complex Riemannian operations (e. g. , retraction operations) are removed from the optimizer, and learning the optimizer does not need to differentiate through the Riemannian optimization. We empirically show that our method brings efficient learning of the optimizer, while enjoying a good optimization trajectory in a data-driven manner.

AAAI Conference 2020 Conference Paper

DCMN+: Dual Co-Matching Network for Multi-Choice Reading Comprehension

  • Shuailiang Zhang
  • Hai Zhao
  • Yuwei Wu
  • Zhuosheng Zhang
  • Xi Zhou
  • Xiang Zhou

Multi-choice reading comprehension is a challenging task to select an answer from a set of candidate options when given passage and question. Previous approaches usually only calculate question-aware passage representation and ignore passage-aware question representation when modeling the relationship between passage and question, which cannot effectively capture the relationship between passage and question. In this work, we propose dual co-matching network (DCMN) which models the relationship among passage, question and answer options bidirectionally. Besides, inspired by how humans solve multi-choice questions, we integrate two reading strategies into our model: (i) passage sentence selection that finds the most salient supporting sentences to answer the question, (ii) answer option interaction that encodes the comparison information between answer options. DCMN equipped with the two strategies (DCMN+) obtains state-ofthe-art results on five multi-choice reading comprehension datasets from different domains: RACE, SemEval-2018 Task 11, ROCStories, COIN, MCTest.

AAAI Conference 2020 Conference Paper

Overcoming Language Priors in VQA via Decomposed Linguistic Representations

  • Chenchen Jing
  • Yuwei Wu
  • Xiaoxun Zhang
  • Yunde Jia
  • Qi Wu

Most existing Visual Question Answering (VQA) models overly rely on language priors between questions and answers. In this paper, we present a novel method of language attention-based VQA that learns decomposed linguistic representations of questions and utilizes the representations to infer answers for overcoming language priors. We introduce a modular language attention mechanism to parse a question into three phrase representations: type representation, object representation, and concept representation. We use the type representation to identify the question type and the possible answer set (yes/no or specific concepts such as colors or numbers), and the object representation to focus on the relevant region of an image. The concept representation is verified with the attended region to infer the final answer. The proposed method decouples the language-based concept discovery and vision-based concept verification in the process of answer inference to prevent language priors from dominating the answering process. Experiments on the VQA-CP dataset demonstrate the effectiveness of our method.

AAAI Conference 2020 Conference Paper

Revisiting Bilinear Pooling: A Coding Perspective

  • Zhi Gao
  • Yuwei Wu
  • Xiaoxun Zhang
  • Jindou Dai
  • Yunde Jia
  • Mehrtash Harandi

Bilinear pooling has achieved state-of-the-art performance on fusing features in various machine learning tasks, owning to its ability to capture complex associations between features. Despite the success, bilinear pooling suffers from redundancy and burstiness issues, mainly due to the rank-one property of the resulting representation. In this paper, we prove that bilinear pooling is indeed a similarity-based coding-pooling formulation. This establishment then enables us to devise a new feature fusion algorithm, the factorized bilinear coding (FBC) method, to overcome the drawbacks of the bilinear pooling. We show that FBC can generate compact and discriminative representations with substantially fewer parameters. Experiments on two challenging tasks, namely image classification and visual question answering, demonstrate that our method surpasses the bilinear pooling technique by a large margin.

AAAI Conference 2020 Conference Paper

Semantics-Aware BERT for Language Understanding

  • Zhuosheng Zhang
  • Yuwei Wu
  • Hai Zhao
  • Zuchao Li
  • Shuailiang Zhang
  • Xi Zhou
  • Xiang Zhou

The latest work on language representations carefully integrates contextualized features into language model training, which enables a series of success especially in various machine reading comprehension and natural language inference tasks. However, the existing language representation models including ELMo, GPT and BERT only exploit plain context-sensitive features such as character or word embeddings. They rarely consider incorporating structured semantic information which can provide rich semantics for language representation. To promote natural language understanding, we propose to incorporate explicit contextual semantics from pre-trained semantic role labeling, and introduce an improved language representation model, Semanticsaware BERT (SemBERT), which is capable of explicitly absorbing contextual semantics over a BERT backbone. Sem- BERT keeps the convenient usability of its BERT precursor in a light fine-tuning way without substantial task-specific modi- fications. Compared with BERT, semantics-aware BERT is as simple in concept but more powerful. It obtains new state-ofthe-art or substantially improves results on ten reading comprehension and language inference tasks.

AAAI Conference 2020 Conference Paper

SG-Net: Syntax-Guided Machine Reading Comprehension

  • Zhuosheng Zhang
  • Yuwei Wu
  • Junru Zhou
  • Sufeng Duan
  • Hai Zhao
  • Rui Wang

For machine reading comprehension, the capacity of effectively modeling the linguistic knowledge from the detailriddled and lengthy passages and getting ride of the noises is essential to improve its performance. Traditional attentive models attend to all words without explicit constraint, which results in inaccurate concentration on some dispensable words. In this work, we propose using syntax to guide the text modeling by incorporating explicit syntactic constraints into attention mechanism for better linguistically motivated word representations. In detail, for self-attention network (SAN) sponsored Transformer-based encoder, we introduce syntactic dependency of interest (SDOI) design into the SAN to form an SDOI-SAN with syntax-guided selfattention. Syntax-guided network (SG-Net) is then composed of this extra SDOI-SAN and the SAN from the original Transformer encoder through a dual contextual architecture for better linguistics inspired representation. To verify its effectiveness, the proposed SG-Net is applied to typical pre-trained language model BERT which is right based on a Transformer encoder. Extensive experiments on popular benchmarks including SQuAD 2. 0 and RACE show that the proposed SG-Net design helps achieve substantial performance improvement over strong baselines.

AAAI Conference 2018 Conference Paper

Deep Stereo Matching With Explicit Cost Aggregation Sub-Architecture

  • Lidong Yu
  • Yucheng Wang
  • Yuwei Wu
  • Yunde Jia

Deep neural networks have shown excellent performance for stereo matching. Many efforts focus on the feature extraction and similarity measurement of the matching cost computation step while less attention is paid on cost aggregation which is crucial for stereo matching. In this paper, we present a learning-based cost aggregation method for stereo matching by a novel sub-architecture in the end-to-end trainable pipeline. We reformulate the cost aggregation as a learning process of the generation and selection of cost aggregation proposals which indicate the possible cost aggregation results. The cost aggregation sub-architecture is realized by a two-stream network: one for the generation of cost aggregation proposals, the other for the selection of the proposals. The criterion for the selection is determined by the low-level structure information obtained from a light convolutional network. The two-stream network offers a global view guidance for the cost aggregation to rectify the mismatching value stemming from the limited view of the matching cost computation. The comprehensive experiments on challenge datasets such as KITTI and Scene Flow show that our method outperforms the state-of-the-art methods.

AAAI Conference 2017 Conference Paper

Deep Manifold Learning of Symmetric Positive Definite Matrices with Application to Face Recognition

  • Zhen Dong
  • Su Jia
  • Chi Zhang
  • Mingtao Pei
  • Yuwei Wu

In this paper, we aim to construct a deep neural network which embeds high dimensional symmetric positive definite (SPD) matrices into a more discriminative low dimensional SPD manifold. To this end, we develop two types of basic layers: a 2D fully connected layer which reduces the dimensionality of the SPD matrices, and a symmetrically clean layer which achieves non-linear mapping. Specifically, we extend the classical fully connected layer such that it is suitable for SPD matrices, and we further show that SPD matrices with symmetric pair elements setting zero operations are still symmetric positive definite. Finally, we complete the construction of the deep neural network for SPD manifold learning by stacking the two layers. Experiments on several face datasets demonstrate the effectiveness of the proposed method.

AAAI Conference 2017 Conference Paper

Efficient Object Instance Search Using Fuzzy Objects Matching

  • Tan Yu
  • Yuwei Wu
  • Sreyasee Bhattacharjee
  • Junsong Yuan

Recently, global features aggregated from local convolutional features of the convolutional neural network have shown to be much more effective in comparison with hand-crafted features for image retrieval. However, the global feature might not effectively capture the relevance between the query object and reference images in the object instance search task, especially when the query object is relatively small and there exist multiple types of objects in reference images. Moreover, the object instance search requires to localize the object in the reference image, which may not be achieved through global representations. In this paper, we propose a Fuzzy Objects Matching (FOM) framework to effectively and efficiently capture the relevance between the query object and reference images in the dataset. In the proposed FOM scheme, object proposals are utilized to detect the potential regions of the query object in reference images. To achieve high search efficiency, we factorize the feature matrix of all the object proposals from one reference image into the product of a set of fuzzy objects and sparse codes. In addition, we refine the feature of the generated fuzzy objects according to its neighborhood in the feature space to generate more robust representation. The experimental results demonstrate that the proposed FOM framework significantly outperforms the state-of-the-art methods in precision with less memory and computational cost on three public datasets.