Arrow Research search

Author name cluster

Yang Wu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

20 papers
2 author rows

Possible papers

20

ICML Conference 2025 Conference Paper

Boosting Virtual Agent Learning and Reasoning: A Step-Wise, Multi-Dimensional, and Generalist Reward Model with Benchmark

  • Bingchen Miao
  • Yang Wu
  • Minghe Gao
  • Qifan Yu
  • Wendong Bu
  • Wenqiao Zhang
  • Yunfei Li
  • Siliang Tang

The development of Generalist Virtual Agents (GVAs) has shown significant promise in autonomous task execution. However, current training paradigms face critical limitations, including reliance on outcome supervision and labor-intensive human annotations. To address these challenges, we propose Similar, a s tep-w i se m ult i -dimensiona l gener a list r eward model, which offers fine-grained signals for agent training and can choose better actions for inference-time scaling. Specifically, we begin by systematically defining five dimensions for evaluating agent actions. Building on this framework, we design an MCTS-P algorithm to automatically collect and annotate step-wise, five-dimensional agent execution data. Using this data, we train Similar with our crafted Triple-M strategy. Furthermore, we introduce the first benchmark in the virtual agent domain for step-wise, multi-dimensional reward model training and evaluation, named SRM. This benchmark consists of two components: SRMTrain, which serves as the training set for Similar, and SRMEval, a manually selected test set for evaluating the reward model. Experimental results demonstrate that Similar, through its step-wise, multi-dimensional assessment and synergistic gain, provides GVAs with effective intermediate signals during both training and inference-time scaling. The code is available at https: //github. com/antgroup/Similar.

JBHI Journal 2025 Journal Article

Clinically Generalizable Low-Dose CT Denoising for Pediatric Imaging via Enhanced Diffusion Posterior Sampling

  • Hongmei Tang
  • Qianhao Chen
  • Qiyang Zhang
  • Zhaoting Cheng
  • Yang Wu
  • Shuang Song
  • Hairong Zheng
  • Dong Liang

In total-body positron emission tomography and computed tomography (PET/CT) imaging, reducing the radiation dose of diagnostic CT scans is essential for minimizing overall radiation exposure, particularly in pediatric patients. Although deep learning-based denoising methods have shown promise in restoring low-dose CT (LDCT) to normal-dose CT (NDCT) quality, most approaches rely on structurally aligned paired data, which are difficult to acquire in clinical practice. Models trained on synthetic pairs often exhibit limited generalizability to real LDCT data. Unconditional diffusion models demonstrate outstanding generalizability, but fail to preserve structural fidelity. To address these challenges, we propose an enhanced diffusion posterior sampling (E-DPS) framework that combines a one-step denoiser U-Net with an unconditional diffusion model. Specifically, the U-Net estimator, trained on simulated LDCT–NDCT pairs, provides preliminary denoised outputs as structural constraints, whereas the diffusion model captures the prior distribution of NDCT images to enhance realism and generalizability. During inference, the U-Net predictions are integrated as constraints with tunable weights, thereby guiding diffusion posterior sampling. In addition, an intermediate-stage initialization strategy is introduced, significantly reducing the number of required sampling steps. Extensive experiments on simulated LDCT datasets across three dose levels demonstrate the superiority of our method, yielding average PSNR gains of +5. 2% and +4. 3% at unseen dose levels compared with state-of-the-art approaches. Moreover, on real LDCT images, E-DPS exhibits strong zero-shot generalizability, achieving better noise suppression while preserving anatomical detail. These results highlight the robustness and clinical potential of E-DPS for LDCT denoising.

NeurIPS Conference 2025 Conference Paper

Janus-Pro-R1: Advancing Collaborative Visual Comprehension and Generation via Reinforcement Learning

  • Kaihang Pan
  • Yang Wu
  • Wendong Bu
  • Shen Kai
  • Juncheng Li
  • Yingting Wang
  • Yunfei Li
  • Siliang Tang

Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation. However, these two capabilities remain largely independent, as if they are two separate functions encapsulated within the same model. Consequently, visual comprehension does not enhance visual generation, and the reasoning mechanisms of LLMs have not been fully integrated to revolutionize image generation. In this paper, we propose to enable the collaborative co-evolution of visual comprehension and generation, advancing image generation into an iterative introspective process. We introduce a two-stage training approach: supervised fine-tuning teaches the MLLM with the foundational ability to generate genuine CoT for visual generation, while reinforcement learning activates its full potential via an exploration-exploitation trade-off. Ultimately, we unlock the Aha moment in visual generation, advancing MLLMs from text-to-image tasks to unified image generation. Extensive experiments demonstrate that our model not only excels in text-to-image generation and image editing, but also functions as a superior image semantic evaluator with enhanced visual comprehension capabilities. Project Page: \url{https: //janus-pro-r1. github. io}.

NeurIPS Conference 2025 Conference Paper

Shortcutting Pre-trained Flow Matching Diffusion Models is Almost Free Lunch

  • Xu Cai
  • Yang Wu
  • Qianli Chen
  • Haoran Wu
  • Lichuan Xiang
  • Hongkai Wen

We present an ultra-efficient post-training method for shortcutting large-scale pre-trained flow matching diffusion models into efficient few-step samplers, enabled by novel velocity field self-distillation. While shortcutting in flow matching, originally introduced by shortcut models, offers flexible trajectory-skipping capabilities, it requires a specialized step-size embedding incompatible with existing models unless retraining from scratch—a process nearly as costly as pretraining itself. Our key contribution is thus imparting a more aggressive shortcut mechanism to standard flow matching models (e. g. , Flux), leveraging a unique distillation principle that obviates the need for step-size embedding. Working on the velocity field rather than sample space and learning rapidly from self-guided distillation in an online manner, our approach trains efficiently, e. g. , producing a 3-step Flux <1 A100 day. Beyond distillation, our method can be incorporated into the pretraining stage itself, yielding models that inherently learn efficient, few-step flows without compromising quality. This capability also enables, to our knowledge, the first few-shot distillation method (e. g. , 10 text-image pairs) for dozen-billion-parameter diffusion models, delivering state-of-the-art performance at almost free cost.

NeurIPS Conference 2025 Conference Paper

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

  • Ye Liu
  • Zongyang Ma
  • Junfu Pu
  • Zhongang Qi
  • Yang Wu
  • Ying Shan
  • Chang Chen

Recent advances in Large Multi-modal Models (LMMs) have demonstrated their remarkable success as general-purpose multi-modal assistants, with particular focuses on holistic image- and video-language understanding. Conversely, less attention has been given to scaling fine-grained pixel-level understanding capabilities, where the models are expected to realize pixel-level alignment between visual signals and language semantics. Some previous studies have applied LMMs to related tasks such as region-level captioning and referring expression segmentation. However, these models are limited to performing either referring or segmentation tasks independently and fail to integrate these fine-grained perception capabilities into visual reasoning. To bridge this gap, we propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses. Our model distinguishes itself by seamlessly integrating pixel-level perception with general visual understanding capabilities. Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference, thereby enabling fine-grained pixel-level reasoning. The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos. A novel PixelQA task that jointly requires referring, segmentation, and question answering is also designed to verify the flexibility of our method.

ICML Conference 2025 Conference Paper

What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities

  • Wendong Bu
  • Yang Wu
  • Qifan Yu
  • Minghe Gao
  • Bingchen Miao
  • Zhenkui Zhang
  • Kaihang Pan
  • Liyunfei

As multimodal large language models (MLLMs) advance, MLLM-based virtual agents have demonstrated remarkable performance. However, existing benchmarks face significant limitations, including uncontrollable task complexity, extensive manual annotation, and a lack of multidimensional evaluation. In response to these challenges, we introduce OmniBench, a self-generating, graph-based benchmark with an automated pipeline for synthesizing tasks of controllable complexity through subtask composition. To evaluate the diverse capabilities of virtual agents on the graph, we further present OmniEval, a multidimensional evaluation framework that includes subtask-level evaluation, graph-based metrics, and comprehensive tests across 10 capabilities. Our synthesized dataset contains 36k graph-structured tasks across 20 scenarios, achieving a 91% human acceptance rate. Training on our graph-structured data shows that it improves generalization across environments. We conduct multidimensional evaluations for virtual agents, revealing their performance across various capabilities and paving the way for future advancements. Our project is available at https: //omni-bench. github. io.

NeurIPS Conference 2024 Conference Paper

E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding

  • Ye Liu
  • Zongyang Ma
  • Zhongang Qi
  • Yang Wu
  • Ying Shan
  • Chang W. Chen

Recent advances in Video Large Language Models (Video-LLMs) have demonstrated their great potential in general-purpose video understanding. To verify the significance of these models, a number of benchmarks have been proposed to diagnose their capabilities in different scenarios. However, existing benchmarks merely evaluate models through video-level question-answering, lacking fine-grained event-level assessment and task diversity. To fill this gap, we introduce E. T. Bench (Event-Level & Time-Sensitive Video Understanding Benchmark), a large-scale and high-quality benchmark for open-ended event-level video understanding. Categorized within a 3-level task taxonomy, E. T. Bench encompasses 7. 3K samples under 12 tasks with 7K videos (251. 4h total length) under 8 domains, providing comprehensive evaluations. We extensively evaluated 8 Image-LLMs and 12 Video-LLMs on our benchmark, and the results reveal that state-of-the-art models for coarse-level (video-level) understanding struggle to solve our fine-grained tasks, e. g. , grounding event-of-interests within videos, largely due to the short video context length, improper time representations, and lack of multi-event training data. Focusing on these issues, we further propose a strong baseline model, E. T. Chat, together with an instruction-tuning dataset E. T. Instruct 164K tailored for fine-grained event-level understanding. Our simple but effective solution demonstrates superior performance in multiple scenarios.

ICML Conference 2024 Conference Paper

HGCN2SP: Hierarchical Graph Convolutional Network for Two-Stage Stochastic Programming

  • Yang Wu
  • Yifan Zhang 0001
  • Zhenxing Liang
  • Jian Cheng 0001

Two-stage Stochastic Programming (2SP) is a standard framework for modeling decision-making problems under uncertainty. While numerous methods exist, solving such problems with many scenarios remains challenging. Selecting representative scenarios is a practical method for accelerating solutions. However, current approaches typically rely on clustering or Monte Carlo sampling, failing to integrate scenario information deeply and overlooking the significant impact of the scenario order on solving time. To address these issues, we develop HGCN2SP, a novel model with a hierarchical graph designed for 2SP problems, encoding each scenario and modeling their relationships hierarchically. The model is trained in a reinforcement learning paradigm to utilize the feedback of the solver. The policy network is equipped with a hierarchical graph convolutional network for feature encoding and an attention-based decoder for scenario selection in proper order. Evaluation of two classic 2SP problems demonstrates that HGCN2SP provides high-quality decisions in a short computational time. Furthermore, HGCN2SP exhibits remarkable generalization capabilities in handling large-scale instances, even with a substantial number of variables or scenarios that were unseen during the training phase.

JBHI Journal 2024 Journal Article

Stimulus-Response Patterns: The Key to Giving Generalizability to Text-Based Depression Detection Models

  • Zhenyu Liu
  • Yang Wu
  • Haibo Zhang
  • Gang Li
  • Zhijie Ding
  • Bin Hu

Text content analysis for depression detection using machine learning techniques has become a prominent area of research. However, previous studies focused mainly on analyzing the textual content, neglecting the fundamental factors driving text generation. Consequently, existing models face the challenge of poor generalization to out-of-domain data as they struggle to capture the crucial features of depression. To address this, we propose a novel computational perspective of “stimulus-response patterns” that brings us closer to the essence of clinical diagnosis of depression. Adopting this computational perspective allows us to conceptually unify diverse datasets and generalize this perspective to common datasets in the field. We introduce the Stimulus-Response Patterns-aware Network (SRP-Net) as an exemplary approach within this computational perspective. To assess the performance of the SRP-Net, we constructed a multi-stimulus dataset and conducted experimental evaluations, demonstrating its exceptional cross-stimulus generalizability. Furthermore, we demonstrated the promising performance of SPR-Net in real medical scenarios and conducted an interpretability analysis of the stimulus-response patterns. Our research investigates the critical role of stimulus-response patterns in enhancing the generalizability of text-based depression detection models, which can potentially facilitate data-driven depression detection to approach the diagnostic accuracy of psychiatrists.

AAAI Conference 2024 Conference Paper

Towards Detailed Text-to-Motion Synthesis via Basic-to-Advanced Hierarchical Diffusion Model

  • Zhenyu Xie
  • Yang Wu
  • Xuehao Gao
  • Zhongqian Sun
  • Wei Yang
  • Xiaodan Liang

Text-guided motion synthesis aims to generate 3D human motion that not only precisely reflects the textual description but reveals the motion details as much as possible. Pioneering methods explore the diffusion model for text-to-motion synthesis and obtain significant superiority. However, these methods conduct diffusion processes either on the raw data distribution or the low-dimensional latent space, which typically suffer from the problem of modality inconsistency or detail-scarce. To tackle this problem, we propose a novel Basic-to-Advanced Hierarchical Diffusion Model, named B2A-HDM, to collaboratively exploit low-dimensional and high-dimensional diffusion models for high quality detailed motion synthesis. Specifically, the basic diffusion model in low-dimensional latent space provides the intermediate denoising result that to be consistent with the textual description, while the advanced diffusion model in high-dimensional latent space focuses on the following detail-enhancing denoising process. Besides, we introduce a multi-denoiser framework for the advanced diffusion model to ease the learning of high-dimensional model and fully explore the generative potential of the diffusion model. Quantitative and qualitative experiment results on two text-to-motion benchmarks (HumanML3D and KIT-ML) demonstrate that B2A-HDM can outperform existing state-of-the-art methods in terms of fidelity, modality consistency, and diversity.

NeurIPS Conference 2023 Conference Paper

Act As You Wish: Fine-Grained Control of Motion Diffusion Model with Hierarchical Semantic Graphs

  • Peng Jin
  • Yang Wu
  • Yanbo Fan
  • Zhongqian Sun
  • Wei Yang
  • Li Yuan

Most text-driven human motion generation methods employ sequential modeling approaches, e. g. , transformer, to extract sentence-level text representations automatically and implicitly for human motion synthesis. However, these compact text representations may overemphasize the action names at the expense of other important properties and lack fine-grained details to guide the synthesis of subtly distinct motion. In this paper, we propose hierarchical semantic graphs for fine-grained control over motion generation. Specifically, we disentangle motion descriptions into hierarchical semantic graphs including three levels of motions, actions, and specifics. Such global-to-local structures facilitate a comprehensive understanding of motion description and fine-grained control of motion generation. Correspondingly, to leverage the coarse-to-fine topology of hierarchical semantic graphs, we decompose the text-to-motion diffusion process into three semantic levels, which correspond to capturing the overall motion, local actions, and action specifics. Extensive experiments on two benchmark human motion datasets, including HumanML3D and KIT, with superior performances, justify the efficacy of our method. More encouragingly, by modifying the edge weights of hierarchical semantic graphs, our method can continuously refine the generated motion, which may have a far-reaching impact on the community. Code and pre-trained weights are available at https: //github. com/jpthu17/GraphMotion.

NeurIPS Conference 2023 Conference Paper

CL-NeRF: Continual Learning of Neural Radiance Fields for Evolving Scene Representation

  • Xiuzhe Wu
  • Peng Dai
  • Weipeng DENG
  • Handi Chen
  • Yang Wu
  • Yan-Pei Cao
  • Ying Shan
  • Xiaojuan Qi

Existing methods for adapting Neural Radiance Fields (NeRFs) to scene changes require extensive data capture and model retraining, which is both time-consuming and labor-intensive. In this paper, we tackle the challenge of efficiently adapting NeRFs to real-world scene changes over time using a few new images while retaining the memory of unaltered areas, focusing on the continual learning aspect of NeRFs. To this end, we propose CL-NeRF, which consists of two key components: a lightweight expert adaptor for adapting to new changes and evolving scene representations and a conflict-aware knowledge distillation learning objective for memorizing unchanged parts. We also present a new benchmark for evaluating Continual Learning of NeRFs with comprehensive metrics. Our extensive experiments demonstrate that CL-NeRF can synthesize high-quality novel views of both changed and unchanged regions with high training efficiency, surpassing existing methods in terms of reducing forgetting and adapting to changes. Code and benchmark will be made available.

AAAI Conference 2023 Conference Paper

Scene Graph to Image Synthesis via Knowledge Consensus

  • Yang Wu
  • Pengxu Wei
  • Liang Lin

In this paper, we study graph-to-image generation conditioned exclusively on scene graphs, in which we seek to disentangle the veiled semantics between knowledge graphs and images. While most existing research resorts to laborious auxiliary information such as object layouts or segmentation masks, it is also of interest to unveil the generality of the model with limited supervision, moreover, avoiding extra cross-modal alignments. To tackle this challenge, we delve into the causality of the adversarial generation process, and reason out a new principle to realize a simultaneous semantic disentanglement with an alignment on target and model distributions. This principle is named knowledge consensus, which explicitly describes a triangle causal dependency among observed images, graph semantics and hidden visual representations. The consensus also determines a new graph-to-image generation framework, carried on several adversarial optimization objectives. Extensive experimental results demonstrate that, even conditioned only on scene graphs, our model surprisingly achieves superior performance on semantics-aware image generation, without losing the competence on manipulating the generation through knowledge graphs.

YNIMG Journal 2022 Journal Article

A novel technology for in vivo detection of cell type-specific neural connection with AQP1-encoding rAAV2-retro vector and metal-free MRI

  • Ning Zheng
  • Mei Li
  • Yang Wu
  • Challika Kaewborisuth
  • Zhen Li
  • Zhu Gui
  • Jinfeng Wu
  • Aoling Cai

A mammalian brain contains numerous neurons with distinct cell types for complex neural circuits. Virus-based circuit tracing tools are powerful in tracking the interaction among the different brain regions. However, detecting brain-wide neural networks in vivo remains challenging since most viral tracing systems rely on postmortem optical imaging. We developed a novel approach that enables in vivo detection of brain-wide neural connections based on metal-free magnetic resonance imaging (MRI). The recombinant adeno-associated virus (rAAV) with retrograde ability, the rAAV2-retro, encoding the human water channel aquaporin 1 (AQP1) MRI reporter gene was generated to label neural connections. The mouse was micro-injected with the virus at the Caudate Putamen (CPU) region and subjected to detection with Diffusion-weighted MRI (DWI). The prominent structure of the CPU-connected network was clearly defined. In combination with a Cre-loxP system, rAAV2-retro expressing Cre-dependent AQP1 provides a CPU-connected network of specific type neurons. Here, we established a sensitive, metal-free MRI-based strategy for in vivo detection of cell type-specific neural connections in the whole brain, which could visualize the dynamic changes of neural networks in rodents and potentially in non-human primates.

IJCAI Conference 2021 Conference Paper

SiamRCR: Reciprocal Classification and Regression for Visual Object Tracking

  • Jinlong Peng
  • Zhengkai Jiang
  • Yueyang Gu
  • Yang Wu
  • Yabiao Wang
  • Ying Tai
  • Chengjie Wang
  • Weiyao Lin

Recently, most siamese network based trackers locate targets via object classification and bounding-box regression. Generally, they select the bounding-box with maximum classification confidence as the final prediction. This strategy may miss the right result due to the accuracy misalignment between classification and regression. In this paper, we propose a novel siamese tracking algorithm called SiamRCR, addressing this problem with a simple, light and effective solution. It builds reciprocal links between classification and regression branches, which can dynamically re-weight their losses for each positive sample. In addition, we add a localization branch to predict the localization accuracy, so that it can work as the replacement of the regression assistance link during inference. This branch makes the training and inference more consistent. Extensive experimental results demonstrate the effectiveness of SiamRCR and its superiority over the state-of-the-art competitors on GOT-10k, LaSOT, TrackingNet, OTB-2015, VOT-2018 and VOT-2019. Moreover, our SiamRCR runs at 65 FPS, far above the real-time requirement.

IJCAI Conference 2020 Conference Paper

Beyond Intra-modality: A Survey of Heterogeneous Person Re-identification

  • Zheng Wang
  • Zhixiang Wang
  • Yinqiang Zheng
  • Yang Wu
  • Wenjun Zeng
  • Shin'ichi Satoh

An efficient and effective person re-identification (ReID) system relieves the users from painful and boring video watching and accelerates the process of video analysis. Recently, with the explosive demands of practical applications, a lot of research efforts have been dedicated to heterogeneous person re-identification (Hetero-ReID). In this paper, we provide a comprehensive review of state-of-the-art Hetero-ReID methods that address the challenge of inter-modality discrepancies. According to the application scenario, we classify the methods into four categories --- low-resolution, infrared, sketch, and text. We begin with an introduction of ReID, and make a comparison between Homogeneous ReID (Homo-ReID) and Hetero-ReID tasks. Then, we describe and compare existing datasets for performing evaluations, and survey the models that have been widely employed in Hetero-ReID. We also summarize and compare the representative approaches from two perspectives, i. e. , the application scenario and the learning pipeline. We conclude by a discussion of some future research directions. Follow-up updates are available at https: //github. com/lightChaserX/Awesome-Hetero-reID

IROS Conference 2020 Conference Paper

Magnetically Actuated Pick-and-place Operations of Cellular Micro-rings for High-speed Assembly of Micro-scale Biological Tube

  • Yang Wu
  • Tao Sun 0001
  • Qing Shi
  • Huaping Wang
  • Qiang Huang 0002
  • Toshio Fukuda

Tissue engineering is trying to use modular tissue micro-rings to construct artificial biological microtubes as substitute of autologous tissue tubes to alleviate the shortage of donor sources. However, because of the lack of effective assembly strategies, it is still challenging to achieve high-speed fabrication of biological microtubes with high cell density. In this paper, we proposed a robotic-based magnetic assembly strategy to handle this challenge. We first encapsulated magnetic alginate microfibers into micro-rings formed by cell self-assembly to enhance the controllability. Afterwards, a 3D long-stroke manipulator with visual servoing system was designed to achieve magnetic pick-and-place operations of micro-rings for 3D assembly. Moreover, we developed a mathematical model of the motion of micro-ring in solution environments. Based on visual feedback, we analyzed the feasibility of automatic assembly and following response of micro-rings with the moving magnets, which shows our proposed method has great potential to achieve high-speed bio-assembly. Finally, we successfully assembled multi-micro-rings into a biological microtube with high cell density.

AAAI Conference 2019 Conference Paper

FRAME Revisited: An Interpretation View Based on Particle Evolution

  • Xu Cai
  • Yang Wu
  • Guanbin Li
  • Ziliang Chen
  • Liang Lin

FRAME (Filters, Random fields, And Maximum Entropy) is an energy-based descriptive model that synthesizes visual realism by capturing mutual patterns from structural input signals. The maximum likelihood estimation (MLE) is applied by default, yet conventionally causes the unstable training energy that wrecks the generated structures, which remains unexplained. In this paper, we provide a new theoretical insight to analyze FRAME, from a perspective of particle physics ascribing the weird phenomenon to KL-vanishing issue. In order to stabilize the energy dissipation, we propose an alternative Wasserstein distance in discrete time based on the conclusion that the Jordan-Kinderlehrer-Otto (JKO) discrete flow approximates KL discrete flow when the time step size tends to 0. Besides, this metric can still maintain the model’s statistical consistency. Quantitative and qualitative experiments have been respectively conducted on several widely used datasets. The empirical studies have evidenced the effectiveness and superiority of our method.

IJCAI Conference 2018 Conference Paper

Knowledge-Embedded Representation Learning for Fine-Grained Image Recognition

  • Tianshui Chen
  • Liang Lin
  • Riquan Chen
  • Yang Wu
  • Xiaonan Luo

Humans can naturally understand an image in depth with the aid of rich knowledge accumulated from daily lives or professions. For example, to achieve fine-grained image recognition (e. g. , categorizing hundreds of subordinate categories of birds) usually requires a comprehensive visual concept organization including category labels and part-level attributes. In this work, we investigate how to unify rich professional knowledge with deep neural network architectures and propose a Knowledge-Embedded Representation Learning (KERL) framework for handling the problem of fine-grained image recognition. Specifically, we organize the rich visual concepts in the form of knowledge graph and employ a Gated Graph Neural Network to propagate node message through the graph for generating the knowledge representation. By introducing a novel gated mechanism, our KERL framework incorporates this knowledge representation into the discriminative image feature learning, i. e. , implicitly associating the specific attributes with the feature maps. Compared with existing methods of fine-grained image classification, our KERL framework has several appealing properties: i) The embedded high-level knowledge enhances the feature representation, thus facilitating distinguishing the subtle differences among subordinate categories. ii) Our framework can learn feature maps with a meaningful configuration that the highlighted regions finely accord with the nodes (specific attributes) of the knowledge graph. Extensive experiments on the widely used Caltech-UCSD bird dataset demonstrate the superiority of our KERL framework over existing state-of-the-art methods.

AAAI Conference 2018 Conference Paper

Temporal-Enhanced Convolutional Network for Person Re-Identification

  • Yang Wu
  • Jie Qiu
  • Jun Takamatsu
  • Tsukasa Ogasawara

We propose a new neural network called Temporal-enhanced Convolutional Network (T-CN) for video-based person reidentification. For each video sequence of a person, a spatial convolutional subnet is first applied to each frame for representing appearance information, and then a temporal convolutional subnet links small ranges of continuous frames to extract local motion information. Such spatial and temporal convolutions together construct our T-CN based representation. Finally, a recurrent network is utilized to further explore global dynamics, followed by temporal pooling to generate an overall feature vector for the whole sequence. In the training stage, a Siamese network architecture is adopted to jointly optimize all the components with losses covering both identification and verification. In the testing stage, our network generates an overall discriminative feature representation for each input video sequence (whose length may vary a lot) in a feed-forward way, and even a simple Euclidean distance based matching can generate good re-identification results. Experiments on the most widely used benchmark datasets demonstrate the superiority of our proposal, in comparison with the state-of-the-art.