Arrow Research search

Author name cluster

Yizhou Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

33 papers
2 author rows

Possible papers

33

IROS Conference 2025 Conference Paper

A Novel Robot Hand with Hoeckens Linkages and Soft Phalanges for Scooping and Self-Adaptive Grasping in Environmental Constraints

  • Wentao Guo
  • Yizhou Wang
  • Wenzeng Zhang

This paper presents a novel underactuated adaptive robotic hand, Hockens-A Hand, which integrates the Hoeckens mechanism, a double-parallelogram linkage, and a specialized four-bar linkage to achieve three adaptive grasping modes: parallel pinching, asymmetric scooping, and enveloping grasping. Hockens-A Hand requires only a single linear actuator, leveraging passive mechanical intelligence to ensure adaptability and compliance in unstructured environments. Specifically, the vertical motion of the Hoeckens mechanism introduces compliance, the double-parallelogram linkage ensures line contact at the fingertip, and the four-bar amplification system enables natural transitions between different grasping modes. Additionally, the inclusion of a mesh-textured silicone phalanx further enhances the ability to envelop objects of various shapes and sizes. This study employs detailed kinematic analysis to optimize the push angle and design the linkage lengths for optimal performance. Simulations validated the design by analyzing the fingertip motion and ensuring smooth transitions between grasping modes. Furthermore, the grasping force was analyzed using power equations to enhance the understanding of the system’s performance. Experimental validation using a 3D-printed prototype demonstrates the three grasping modes of the hand in various scenarios under environmental constraints, verifying its grasping stability and broad applicability.

AAAI Conference 2025 Conference Paper

Autoregressive Sequence Modeling for 3D Medical Image Representation

  • Siwen Wang
  • Churan Wang
  • Fei Gao
  • Lixian Su
  • Fandong Zhang
  • Yizhou Wang
  • Yizhou Yu

Three-dimensional (3D) medical images, such as Computed Tomography (CT) and Magnetic Resonance Imaging (MRI), are essential for clinical applications. However, the need for diverse and comprehensive representations is particularly pronounced when considering the variability across different organs, diagnostic tasks, and imaging modalities. How to effectively interpret the intricate contextual information and extract meaningful insights from these images remains an open challenge to the community. While current self-supervised learning methods have shown potential, they often consider an image as a whole thereby overlooking the extensive, complex relationships among local regions from one or multiple images. In this work, we introduce a pioneering method for learning 3D medical image representations through an autoregressive pre-training framework. Our approach sequences various 3D medical images based on spatial, contrast, and semantic correlations, treating them as interconnected visual tokens within a token sequence. By employing an autoregressive sequence modeling task, we predict the next visual token in the sequence, which allows our model to deeply understand and integrate the contextual information inherent in 3D medical images. Additionally, we implement a random startup strategy to avoid overestimating token relationships and to enhance the robustness of learning. The effectiveness of our approach is demonstrated by the superior performance over others on nine downstream tasks in public datasets.

NeurIPS Conference 2025 Conference Paper

Don't Just Chase “Highlighted Tokens” in MLLMs: Revisiting Visual Holistic Context Retention

  • Xin Zou
  • Di Lu
  • Yizhou Wang
  • Yibo Yan
  • Yuanhuiyi Lyu
  • Xu Zheng
  • Linfeng Zhang
  • Xuming Hu

Despite their powerful capabilities, multimodal large language models (MLLMs) suffer from considerable computational overhead due to their reliance on massive visual tokens. Recent studies have explored token pruning to alleviate this problem, which typically uses text-vision cross-attention or [CLS] attention to assess and discard redundant visual tokens. In this work, we identify a critical limitation of such attention-first pruning approaches, i. e. , they tend to preserve semantically similar tokens, resulting in pronounced performance drops under high pruning rates. To this end, we propose HoloV, a simple yet effective, plug-and-play visual token pruning framework for efficient inference. Distinct from previous attention-first schemes, HoloV rethinks token retention from a holistic perspective. By adaptively distributing the pruning budget across different spatial crops, HoloV ensures that the retained tokens capture the global visual context rather than isolated salient features. This strategy minimizes representational collapse and maintains task-relevant information even under aggressive pruning. Experimental results demonstrate that our HoloV achieves superior performance across various tasks, MLLM architectures, and pruning ratios compared to SOTA methods. For instance, LLaVA1. 5 equipped with HoloV preserves 95. 8% of the original performance after pruning 88. 9% of visual tokens, achieving superior efficiency-accuracy trade-offs.

NeurIPS Conference 2025 Conference Paper

GeneMAN: Generalizable Single-Image 3D Human Reconstruction from Multi-Source Human Data

  • Wentao Wang
  • Hang Ye
  • Fangzhou Hong
  • Xue Yang
  • Jianfu Zhang
  • Yizhou Wang
  • Ziwei Liu
  • Liang Pan

Given a single in-the-wild human photo, it remains a challenging task to reconstruct a high-fidelity 3D human model. Existing methods face difficulties including a) the varying body proportions captured by in-the-wild human images; b) diverse personal belongings within the shot; and c) ambiguities in human postures and inconsistency in human textures. In addition, the scarcity of high-quality human data intensifies the challenge. To address these problems, we propose a Generalizable image-to-3D huMAN reconstruction framework, dubbed GeneMAN, building upon a comprehensive multi-source collection of high-quality human data, including 3D scans, multi-view videos, single photos, and our generated synthetic human data. GeneMAN encompasses three key modules. 1) Without relying on parametric human models (e. g. , SMPL), GeneMAN first trains a human-specific text-to-image diffusion model and a view-conditioned diffusion model, serving as GeneMAN 2D human prior and 3D human prior for reconstruction, respectively. 2) With the help of the pretrained human prior models, the Geometry Initialization-&-Sculpting pipeline is leveraged to recover high-quality 3D human geometry given a single image. 3) To achieve high-fidelity 3D human textures, GeneMAN employs the Multi-Space Texture Refinement pipeline, consecutively refining textures in the latent and the pixel spaces. Extensive experimental results demonstrate that GeneMAN could generate high-quality 3D human models from a single image input, outperforming prior state-of-the-art methods. Notably, GeneMAN could reveal much better generalizability in dealing with in-the-wild images, often yielding high-quality 3D human models in natural poses with common items, regardless of the body proportions in the input images.

IJCAI Conference 2025 Conference Paper

Human-Centric Foundation Models: Perception, Generation and Agentic Modeling

  • Shixiang Tang
  • Yizhou Wang
  • Lu Chen
  • Yuan Wang
  • Sida Peng
  • Dan Xu
  • Wanli Ouyang

Human understanding and generation are critical for modeling digital humans and humanoid embodiments. Recently, Human-centric Foundation Models (HcFMs)—inspired by the success of generalist models such as large language and vision models—have emerged to unify diverse human-centric tasks into a single framework, surpassing traditional task-specific approaches. In this survey, we present a comprehensive overview of HcFMs by proposing a taxonomy that categorizes current approaches into four groups: (1) Human-centric Perception Foundation Models that capture fine-grained features for multi-modal 2D and 3D understanding; (2) Human-centric AIGC Foundation Models that generate high-fidelity, diverse human-related content; (3) Unified Perception and Generation Models that integrate these capabilities to enhance both human understanding and synthesis; and (4) Human-centric Agentic Foundation Models that extend beyond perception and generation to learn human-like intelligence and interactive behaviors for humanoid embodied tasks. We review state-of-the-art techniques, discuss emerging challenges and future research directions. This survey aims to serve as a roadmap for researchers and practitioners working towards more robust, versatile, and intelligent digital human and embodiments modeling. Website is https: //github. com/HumanCentricModels/Awesome-Human-Centric-Foundation-Models/

ICML Conference 2025 Conference Paper

Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models

  • Xin Zou 0001
  • Yizhou Wang
  • Yibo Yan
  • Yuanhuiyi Lyu
  • Kening Zheng
  • Sirui Huang
  • Junkai Chen
  • Peijie Jiang

Despite their impressive capabilities, Multimodal Large Language Models (MLLMs) are prone to hallucinations, i. e. , the generated content that is nonsensical or unfaithful to input sources. Unlike in LLMs, hallucinations in MLLMs often stem from the sensitivity of text decoder to visual tokens, leading to a phenomenon akin to "amnesia" about visual information. To address this issue, we propose MemVR, a novel decoding paradigm inspired by common cognition: when the memory of an image seen the moment before is forgotten, people will look at it again for factual answers. Following this principle, we treat visual tokens as supplementary evidence, re-injecting them into the MLLM through Feed Forward Network (FFN) as “key-value memory” at the middle trigger layer. This look-twice mechanism occurs when the model exhibits high uncertainty during inference, effectively enhancing factual alignment. Comprehensive experimental evaluations demonstrate that MemVR significantly mitigates hallucination across various MLLMs and excels in general benchmarks without incurring additional time overhead.

NeurIPS Conference 2025 Conference Paper

Reinforced Context Order Recovery for Adaptive Reasoning and Planning

  • Long Ma
  • Fangwei Zhong
  • Yizhou Wang

Modern causal language models, followed by rapid developments in discrete diffusion models, can now produce a wide variety of interesting and useful content. However, these families of models are predominantly trained to output tokens with a fixed (left-to-right) or random order, which may deviate from the logical order in which tokens are generated originally. In this paper, we observe that current causal and diffusion models encounter difficulties in problems that require adaptive token generation orders to solve tractably, which we characterize with the $\mathcal{V}$-information framework. Motivated by this, we propose Reinforced Context Order Recovery (ReCOR), a reinforcement-learning-based framework to extract adaptive, data-dependent token generation orders from text data without annotations. Self-supervised by token prediction statistics, ReCOR estimates the hardness of predicting every unfilled token and adaptively selects the next token during both training and inference. Experiments on challenging reasoning and planning datasets demonstrate the superior performance of ReCOR compared with baselines, sometimes outperforming oracle models supervised with the ground-truth order.

IROS Conference 2024 Conference Paper

A Novel Geometrical Structure Robot Hand for Linear-parallel Pinching and Coupled Self-adaptive Hybrid Grasping

  • Shi Chen
  • Bihao Zhang
  • Kehan Feng
  • Yizhou Wang
  • Jiayun Li
  • Wenzeng Zhang

Current robot hand grippers capable of self-adaptive or coupled grasping often cannot perform linear-parallel pinching at the physical end of the gripper, which is widely used in industrial applications. For this reason, this paper introduces a gripper with hybrid grasping modes— the LPCSA hand. It can achieve three grasping modes: linear-parallel pinching, coupled, and self-adaptive grasping. The design cleverly couples two kinds of Chebyshev linear mechanisms to enable flat movement at the end of the finger. It also utilizes the deformability of the parallelogram to achieve self-adaptive grasping. Furthermore, the gripper uses an idle stroke and a special component to facilitate the switch between the three modes. The linear-parallel pinching function is suitable for pinching objects of different sizes on the desktop. The self-adaptive grasping mode can adapt to objects of various shapes and sizes. The coupled grasping mode enables fast grasping of irregular objects. This paper also analyzes the kinematics and dynamics of the LPCSA hand. Combined with experiments, it demonstrates that the LPCSA hand has a wide range of grasping space and stable performance.

NeurIPS Conference 2024 Conference Paper

Richelieu: Self-Evolving LLM-Based Agents for AI Diplomacy

  • Zhenyu Guan
  • Xiangyu Kong
  • Fangwei Zhong
  • Yizhou Wang

Diplomacy is one of the most sophisticated activities in human society, involving complex interactions among multiple parties that require skills in social reasoning, negotiation, and long-term strategic planning. Previous AI agents have demonstrated their ability to handle multi-step games and large action spaces in multi-agent tasks. However, diplomacy involves a staggering magnitude of decision spaces, especially considering the negotiation stage required. While recent agents based on large language models (LLMs) have shown potential in various applications, they still struggle with extended planning periods in complex multi-agent settings. Leveraging recent technologies for LLM-based agents, we aim to explore AI's potential to create a human-like agent capable of executing comprehensive multi-agent missions by integrating three fundamental capabilities: 1) strategic planning with memory and reflection; 2) goal-oriented negotiation with social reasoning; and 3) augmenting memory through self-play games for self-evolution without human in the loop. Project page: https: //sites. google. com/view/richelieu-diplomacy.

NeurIPS Conference 2023 Conference Paper

BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset

  • Jiaming Ji
  • Mickel Liu
  • Josef Dai
  • Xuehai Pan
  • Chi Zhang
  • Ce Bian
  • Boyuan Chen
  • Ruiyang Sun

In this paper, we introduce the BeaverTails dataset, aimed at fostering research on safety alignment in large language models (LLMs). This dataset uniquely separates annotations of helpfulness and harmlessness for question-answering pairs, thus offering distinct perspectives on these crucial attributes. In total, we have gathered safety meta-labels for 333, 963 question-answer (QA) pairs and 361, 903 pairs of expert comparison data for both the helpfulness and harmlessness metrics. We further showcase applications of BeaverTails in content moderation and reinforcement learning with human feedback (RLHF), emphasizing its potential for practical safety measures in LLMs. We believe this dataset provides vital resources for the community, contributing towards the safe development and deployment of LLMs. Our project page is available at the following URL: https: //sites. google. com/view/pku-beavertails.

NeurIPS Conference 2023 Conference Paper

Causal Discovery from Subsampled Time Series with Proxy Variables

  • Mingzhou Liu
  • Xinwei Sun
  • Lingjing Hu
  • Yizhou Wang

Inferring causal structures from time series data is the central interest of many scientific inquiries. A major barrier to such inference is the problem of subsampling, i. e. , the frequency of measurement is much lower than that of causal influence. To overcome this problem, numerous methods have been proposed, yet either was limited to the linear case or failed to achieve identifiability. In this paper, we propose a constraint-based algorithm that can identify the entire causal structure from subsampled time series, without any parametric constraint. Our observation is that the challenge of subsampling arises mainly from hidden variables at the unobserved time steps. Meanwhile, every hidden variable has an observed proxy, which is essentially itself at some observable time in the future, benefiting from the temporal structure. Based on these, we can leverage the proxies to remove the bias induced by the hidden variables and hence achieve identifiability. Following this intuition, we propose a proxy-based causal discovery algorithm. Our algorithm is nonparametric and can achieve full causal identification. Theoretical advantages are reflected in synthetic and real-world experiments.

NeurIPS Conference 2023 Conference Paper

ChimpACT: A Longitudinal Dataset for Understanding Chimpanzee Behaviors

  • Xiaoxuan Ma
  • Stephan Kaufhold
  • Jiajun Su
  • Wentao Zhu
  • Jack Terwilliger
  • Andres Meza
  • Yixin Zhu
  • Federico Rossano

Understanding the behavior of non-human primates is crucial for improving animal welfare, modeling social behavior, and gaining insights into distinctively human and phylogenetically shared behaviors. However, the lack of datasets on non-human primate behavior hinders in-depth exploration of primate social interactions, posing challenges to research on our closest living relatives. To address these limitations, we present ChimpACT, a comprehensive dataset for quantifying the longitudinal behavior and social relations of chimpanzees within a social group. Spanning from 2015 to 2018, ChimpACT features videos of a group of over 20 chimpanzees residing at the Leipzig Zoo, Germany, with a particular focus on documenting the developmental trajectory of one young male, Azibo. ChimpACT is both comprehensive and challenging, consisting of 163 videos with a cumulative 160, 500 frames, each richly annotated with detection, identification, pose estimation, and fine-grained spatiotemporal behavior labels. We benchmark representative methods of three tracks on ChimpACT: (i) tracking and identification, (ii) pose estimation, and (iii) spatiotemporal action detection of the chimpanzees. Our experiments reveal that ChimpACT offers ample opportunities for both devising new methods and adapting existing ones to solve fundamental computer vision tasks applied to chimpanzee groups, such as detection, pose estimation, and behavior analysis, ultimately deepening our comprehension of communication and sociality in non-human primates.

AAAI Conference 2023 Conference Paper

RSPT: Reconstruct Surroundings and Predict Trajectory for Generalizable Active Object Tracking

  • Fangwei Zhong
  • Xiao Bi
  • Yudi Zhang
  • Wei Zhang
  • Yizhou Wang

Active Object Tracking (AOT) aims to maintain a specific relation between the tracker and object(s) by autonomously controlling the motion system of a tracker given observations. It is widely used in various applications such as mobile robots and autonomous driving. However, Building a generalizable active tracker that works robustly across various scenarios remains a challenge, particularly in unstructured environments with cluttered obstacles and diverse layouts. To realize this, we argue that the key is to construct a state representation that can model the geometry structure of the surroundings and the dynamics of the target. To this end, we propose a framework called RSPT to form a structure-aware motion representation by Reconstructing Surroundings and Predicting the target Trajectory. Moreover, we further enhance the generalization of the policy network by training in the asymmetric dueling mechanism. Empirical results show that RSPT outperforms existing methods in unseen environments, especially those with cluttered obstacles and diverse layouts. We also demonstrate good sim-to-real transfer when deploying RSPT in real-world scenarios.

NeurIPS Conference 2023 Conference Paper

Social Motion Prediction with Cognitive Hierarchies

  • Wentao Zhu
  • Jason Qin
  • Yuke Lou
  • Hang Ye
  • Xiaoxuan Ma
  • Hai Ci
  • Yizhou Wang

Humans exhibit a remarkable capacity for anticipating the actions of others and planning their own actions accordingly. In this study, we strive to replicate this ability by addressing the social motion prediction problem. We introduce a new benchmark, a novel formulation, and a cognition-inspired framework. We present Wusi, a 3D multi-person motion dataset under the context of team sports, which features intense and strategic human interactions and diverse pose distributions. By reformulating the problem from a multi-agent reinforcement learning perspective, we incorporate behavioral cloning and generative adversarial imitation learning to boost learning efficiency and generalization. Furthermore, we take into account the cognitive aspects of the human social action planning process and develop a cognitive hierarchy framework to predict strategic human social interactions. We conduct comprehensive experiments to validate the effectiveness of our proposed dataset and approach.

AAAI Conference 2022 Conference Paper

Causal Intervention for Subject-Deconfounded Facial Action Unit Recognition

  • Yingjie Chen
  • Diqi Chen
  • Tao Wang
  • Yizhou Wang
  • Yun Liang

Subject-invariant facial action unit (AU) recognition remains challenging for the reason that the data distribution varies among subjects. In this paper, we propose a causal inference framework for subject-invariant facial action unit recognition. To illustrate the causal effect existing in AU recognition task, we formulate the causalities among facial images, subjects, latent AU semantic relations, and estimated AU occurrence probabilities via a structural causal model. By constructing such a causal diagram, we clarify the causal effect among variables and propose a plug-in causal intervention module, CIS, to deconfound the confounder Subject in the causal diagram. Extensive experiments conducted on two commonly used AU benchmark datasets, BP4D and DISFA, show the effectiveness of our CIS, and the model with CIS inserted, CISNet, has achieved state-of-the-art performance.

IJCAI Conference 2022 Conference Paper

Intrinsic Image Decomposition by Pursuing Reflectance Image

  • Tzu-Heng Lin
  • Pengxiao Wang
  • Yizhou Wang

Intrinsic image decomposition is a fundamental problem for many computer vision applications. While recent deep learning based methods have achieved very promising results on the synthetic densely labeled datasets, the results on the real-world dataset are still far from human level performance. This is mostly because collecting dense supervision on a real-world dataset is impossible. Only a sparse set of pairwise judgement from human is often used. It's very difficult for models to learn in such settings. In this paper, we investigate the possibilities of only using reflectance images for supervision during training. In this way, the demand for labeled data is greatly reduced. In order to achieve this goal, we take a deep investigation into the reflectance images. We find that reflectance images are actually comprised of two components: the flat surfaces with low frequency information, and the boundaries with high frequency details. Then, we propose to disentangle the learning process of the two components of the reflectance images. We argue that through this procedure, the reflectance images can be better modeled, and in the meantime, the shading images, though not supervised, can also achieve decent result. Extensive experiments show that our proposed network outperforms current state-of-the-art results by a large margin on the most challenging real-world IIW dataset. We also surprisingly find that on the densely labeled datasets (MIT and MPI-Sintel), our network can also achieve state-of-the-art results on both reflectance and shading images, when we only apply supervision on the reflectance images during training.

AAAI Conference 2022 Conference Paper

LUNA: Localizing Unfamiliarity Near Acquaintance for Open-Set Long-Tailed Recognition

  • Jiarui Cai
  • Yizhou Wang
  • Hung-Min Hsu
  • Jenq-Neng Hwang
  • Kelsey Magrane
  • Craig S Rose

The predefined artificially-balanced training classes in object recognition have limited capability in modeling real-world scenarios where objects are imbalanced-distributed with unknown classes. In this paper, we discuss a promising solution to the Open-set Long-Tailed Recognition (OLTR) task utilizing metric learning. Firstly, we propose a distributionsensitive loss, which weighs more on the tail classes to decrease the intra-class distance in the feature space. Building upon these concentrated feature clusters, a local-densitybased metric is introduced, called Localizing Unfamiliarity Near Acquaintance (LUNA), to measure the novelty of a testing sample. LUNA is flexible with different cluster sizes and is reliable on the cluster boundary by considering neighbors of different properties. Moreover, contrary to most of the existing works that alleviate the open-set detection as a simple binary decision, LUNA is a quantitative measurement with interpretable meanings. Our proposed method exceeds the state-of-the-art algorithm by 4-6% in the closed-set recognition accuracy and 4% in F-measure under the open-set on the public benchmark datasets, including our own newly introduced fine-grained OLTR dataset about marine species (MS- LT), which is the first naturally-distributed OLTR dataset revealing the genuine genetic relationships of the classes.

NeurIPS Conference 2022 Conference Paper

MATE: Benchmarking Multi-Agent Reinforcement Learning in Distributed Target Coverage Control

  • Xuehai Pan
  • Mickel Liu
  • Fangwei Zhong
  • Yaodong Yang
  • Song-Chun Zhu
  • Yizhou Wang

We introduce the Multi-Agent Tracking Environment (MATE), a novel multi-agent environment simulates the target coverage control problems in the real world. MATE hosts an asymmetric cooperative-competitive game consisting of two groups of learning agents--"cameras" and "targets"--with opposing interests. Specifically, "cameras", a group of directional sensors, are mandated to actively control the directional perception area to maximize the coverage rate of targets. On the other side, "targets" are mobile agents that aim to transport cargo between multiple randomly assigned warehouses while minimizing the exposure to the camera sensor networks. To showcase the practicality of MATE, we benchmark the multi-agent reinforcement learning (MARL) algorithms from different aspects, including cooperation, communication, scalability, robustness, and asymmetric self-play. We start by reporting results for cooperative tasks using MARL algorithms (MAPPO, IPPO, QMIX, MADDPG) and the results after augmenting with multi-agent communication protocols (TarMAC, I2C). We then evaluate the effectiveness of the popular self-play techniques (PSRO, fictitious self-play) in an asymmetric zero-sum competitive game. This process of co-evolution between cameras and targets helps to realize a less exploitable camera network. We also observe the emergence of different roles of the target agents while incorporating I2C into target-target communication. MATE is written purely in Python and integrated with OpenAI Gym API to enhance user-friendliness. Our project is released at https: //github. com/UnrealTracking/mate.

IJCAI Conference 2022 Conference Paper

MemREIN: Rein the Domain Shift for Cross-Domain Few-Shot Learning

  • Yi Xu
  • Lichen Wang
  • Yizhou Wang
  • Can Qin
  • Yulun Zhang
  • Yun Fu

Few-shot learning aims to enable models generalize to new categories (query instances) with only limited labeled samples (support instances) from each category. Metric-based mechanism is a promising direction which compares feature embeddings via different metrics. However, it always fail to generalize to unseen domains due to the considerable domain gap challenge. In this paper, we propose a novel framework, MemREIN, which considers Memorized, Restitution, and Instance Normalization for cross-domain few-shot learning. Specifically, an instance normalization algorithm is explored to alleviate feature dissimilarity, which provides the initial model generalization ability. However, naively normalizing the feature would lose fine-grained discriminative knowledge between different classes. To this end, a memorized module is further proposed to separate the most refined knowledge and remember it. Then, a restitution module is utilized to restitute the discrimination ability from the learned knowledge. A novel reverse contrastive learning strategy is proposed to stabilize the distillation process. Extensive experiments on five popular benchmark datasets demonstrate that MemREIN well addresses the domain shift challenge, and significantly improves the performance up to 16. 43% compared with state-of-the-art baselines.

AAAI Conference 2022 Conference Paper

MoCaNet: Motion Retargeting In-the-Wild via Canonicalization Networks

  • Wentao Zhu
  • Zhuoqian Yang
  • Ziang Di
  • Wayne Wu
  • Yizhou Wang
  • Chen Change Loy

We present a novel framework that brings the 3D motion retargeting task from controlled environments to in-the-wild scenarios. In particular, our method is capable of retargeting body motion from a character in a 2D monocular video to a 3D character without using any motion capture system or 3D reconstruction procedure. It is designed to leverage massive online videos for unsupervised training, requiring neither 3D annotations nor motion-body pairing information. The proposed method is built upon two novel canonicalization operations, structure canonicalization and view canonicalization. Trained with the canonicalization operations and the derived regularizations, our method learns to factorize a skeleton sequence into three independent semantic subspaces, i. e. , motion, structure, and view angle. The disentangled representation enables motion retargeting from 2D to 3D with high precision. Our method achieves superior performance on motion transfer benchmarks with large body variations and challenging actions. Notably, the canonicalized skeleton sequence could serve as a disentangled and interpretable representation of human motion that benefits action analysis and motion retrieval.

NeurIPS Conference 2022 Conference Paper

Unsupervised Object Detection Pretraining with Joint Object Priors Generation and Detector Learning

  • Yizhou Wang
  • Meilin Chen
  • Shixiang Tang
  • Feng Zhu
  • Haiyang Yang
  • Lei Bai
  • Rui Zhao
  • Yunfeng Yan

Unsupervised pretraining methods for object detection aim to learn object discrimination and localization ability from large amounts of images. Typically, recent works design pretext tasks that supervise the detector to predict the defined object priors. They normally leverage heuristic methods to produce object priors, \emph{e. g. ,} selective search, which separates the prior generation and detector learning and leads to sub-optimal solutions. In this work, we propose a novel object detection pretraining framework that could generate object priors and learn detectors jointly by generating accurate object priors from the model itself. Specifically, region priors are extracted by attention maps from the encoder, which highlights foregrounds. Instance priors are the selected high-quality output bounding boxes of the detection decoder. By assuming objects as instances in the foreground, we can generate object priors with both region and instance priors. Moreover, our object priors are jointly refined along with the detector optimization. With better object priors as supervision, the model could achieve better detection capability, which in turn promotes the object priors generation. Our method improves the competitive approaches by \textbf{+1. 3 AP}, \textbf{+1. 7 AP} in 1\% and 10\% COCO low-data regimes object detection.

NeurIPS Conference 2020 Conference Paper

Learning Multi-Agent Coordination for Enhancing Target Coverage in Directional Sensor Networks

  • Jing Xu
  • Fangwei Zhong
  • Yizhou Wang

Maximum target coverage by adjusting the orientation of distributed sensors is an important problem in directional sensor networks (DSNs). This problem is challenging as the targets usually move randomly but the coverage range of sensors is limited in angle and distance. Thus, it is required to coordinate sensors to get ideal target coverage with low power consumption, e. g. no missing targets or reducing redundant coverage. To realize this, we propose a Hierarchical Target-oriented Multi-Agent Coordination (HiT-MAC), which decomposes the target coverage problem into two-level tasks: targets assignment by a coordinator and tracking assigned targets by executors. Specifically, the coordinator periodically monitors the environment globally and allocates targets to each executor. In turn, the executor only needs to track its assigned targets. To effectively learn the HiT-MAC by reinforcement learning, we further introduce a bunch of practical methods, including a self-attention module, marginal contribution approximation for the coordinator, goal-conditional observation filter for the executor, etc. Empirical results demonstrate the advantage of HiT-MAC in coverage rate, learning efficiency, and scalability, comparing to baselines. We also conduct an ablative analysis on the effectiveness of the introduced components in the framework.

AAAI Conference 2020 Conference Paper

Pose-Assisted Multi-Camera Collaboration for Active Object Tracking

  • Jing Li
  • Jing Xu
  • Fangwei Zhong
  • Xiangyu Kong
  • Yu Qiao
  • Yizhou Wang

Active Object Tracking (AOT) is crucial to many visionbased applications, e. g. , mobile robot, intelligent surveillance. However, there are a number of challenges when deploying active tracking in complex scenarios, e. g. , target is frequently occluded by obstacles. In this paper, we extend the single-camera AOT to a multi-camera setting, where cameras tracking a target in a collaborative fashion. To achieve effective collaboration among cameras, we propose a novel Pose- Assisted Multi-Camera Collaboration System, which enables a camera to cooperate with the others by sharing camera poses for active object tracking. In the system, each camera is equipped with two controllers and a switcher: The vision-based controller tracks targets based on observed images. The pose-based controller moves the camera in accordance to the poses of the other cameras. At each step, the switcher decides which action to take from the two controllers according to the visibility of the target. The experimental results demonstrate that our system outperforms all the baselines and is capable of generalizing to unseen environments. The code and demo videos are available on our website https: //sites. google. com/view/pose-assistedcollaboration.

NeurIPS Conference 2019 Conference Paper

L_DMI: A Novel Information-theoretic Loss Function for Training Deep Nets Robust to Label Noise

  • Yilun Xu
  • Peng Cao
  • Yuqing Kong
  • Yizhou Wang

Accurately annotating large scale dataset is notoriously expensive both in time and in money. Although acquiring low-quality-annotated dataset can be much cheaper, it often badly damages the performance of trained models when using such dataset without particular treatment. Various methods have been proposed for learning with noisy labels. However, most methods only handle limited kinds of noise patterns, require auxiliary information or steps (e. g. , knowing or estimating the noise transition matrix), or lack theoretical justification. In this paper, we propose a novel information-theoretic loss function, L DMI, for training deep neural networks robust to label noise. The core of L DMI is a generalized version of mutual information, termed Determinant based Mutual Information (DMI), which is not only information-monotone but also relatively invariant. To the best of our knowledge, L DMI is the first loss function that is provably robust to instance-independent label noise, regardless of noise pattern, and it can be applied to any existing classification neural networks straightforwardly without any auxiliary information. In addition to theoretical justification, we also empirically show that using L DMI outperforms all other counterparts in the classification task on both image dataset and natural language dataset include Fashion-MNIST, CIFAR-10, Dogs vs. Cats, MR with a variety of synthesized noise patterns and noise amounts, as well as a real-world dataset Clothing1M.

AAAI Conference 2018 Conference Paper

RAN4IQA: Restorative Adversarial Nets for No-Reference Image Quality Assessment

  • Hongyu Ren
  • Diqi Chen
  • Yizhou Wang

Inspired by the free-energy brain theory (Friston, Kilner, and Harrison 2006), which implies that human visual system (HVS) tends to reduce uncertainty and restore perceptual details upon seeing a distorted image (Friston 2010), we propose restorative adversarial net (RAN), a GAN-based model for no-reference image quality assessment (NR-IQA). RAN, which mimics the process of HVS, consists of three components: a restorator, a discriminator and an evaluator. The restorator restores and reconstructs input distorted image patches, while the discriminator distinguishes the reconstructed patches from the pristine distortion-free patches. After restoration, we observe that the perceptual distance between the restored and the distorted patches is monotonic with respect to the distortion level. We further define Gain of Restoration (GoR) based on this phenomenon. The evaluator predicts perceptual score by extracting feature representations from the distorted and restored patches to measure GoR. Eventually, the quality score of an input image is estimated by weighted sum of the patch scores. Experimental results on Waterloo Exploration, LIVE and TID2013 show the effectiveness and generalization ability of RAN compared to the state-of-the-art NR-IQA models.

AAAI Conference 2017 Conference Paper

Learning Discriminative Activated Simplices for Action Recognition

  • Chenxu Luo
  • Chang Ma
  • Chunyu Wang
  • Yizhou Wang

We address the task of action recognition from a sequence of 3D human poses. This is a challenging task firstly because the poses of the same class could have large intra-class variations either caused by inaccurate 3D pose estimation or various performing styles. Also different actions, e.g., walking vs. jogging, may share similar poses which makes the representation not discriminative to differentiate the actions. To solve the problems, we propose a novel representation for 3D poses by a mixture of Discriminative Activated Simplices (DAS). Each DAS consists of a few bases and represent pose data by their convex combinations. The discriminative power of DAS is firstly realized by learning discriminative bases across classes with a block diagonal constraint enforced on the basis coefficient matrix. Secondly, the DAS provides tight characterization of the pose manifolds thus reducing the chance of generating overlapped DAS between similar classes. We justify the power of the model on benchmark datasets and witness consistent performance improvements.

TIST Journal 2016 Journal Article

Efficient Generalized Fused Lasso and Its Applications

  • Bo Xin
  • Yoshinobu Kawahara
  • Yizhou Wang
  • Lingjing Hu
  • Wen Gao

Generalized fused lasso (GFL) penalizes variables with l 1 norms based both on the variables and their pairwise differences. GFL is useful when applied to data where prior information is expressed using a graph over the variables. However, the existing GFL algorithms incur high computational costs and do not scale to high-dimensional problems. In this study, we propose a fast and scalable algorithm for GFL. Based on the fact that fusion penalty is the Lovász extension of a cut function, we show that the key building block of the optimization is equivalent to recursively solving graph-cut problems. Thus, we use a parametric flow algorithm to solve GFL in an efficient manner. Runtime comparisons demonstrate a significant speedup compared to existing GFL algorithms. Moreover, the proposed optimization framework is very general; by designing different cut functions, we also discuss the extension of GFL to directed graphs. Exploiting the scalability of the proposed algorithm, we demonstrate the applications of our algorithm to the diagnosis of Alzheimer’s disease (AD) and video background subtraction (BS). In the AD problem, we formulated the diagnosis of AD as a GFL regularized classification. Our experimental evaluations demonstrated that the diagnosis performance was promising. We observed that the selected critical voxels were well structured, i.e., connected, consistent according to cross validation, and in agreement with prior pathological knowledge. In the BS problem, GFL naturally models arbitrary foregrounds without predefined grouping of the pixels. Even by applying simple background models, e.g., a sparse linear combination of former frames, we achieved state-of-the-art performance on several public datasets.

NeurIPS Conference 2016 Conference Paper

Maximal Sparsity with Deep Networks?

  • Bo Xin
  • Yizhou Wang
  • Wen Gao
  • David Wipf
  • Baoyuan Wang

The iterations of many sparse estimation algorithms are comprised of a fixed linear filter cascaded with a thresholding nonlinearity, which collectively resemble a typical neural network layer. Consequently, a lengthy sequence of algorithm iterations can be viewed as a deep network with shared, hand-crafted layer weights. It is therefore quite natural to examine the degree to which a learned network model might act as a viable surrogate for traditional sparse estimation in domains where ample training data is available. While the possibility of a reduced computational budget is readily apparent when a ceiling is imposed on the number of layers, our work primarily focuses on estimation accuracy. In particular, it is well-known that when a signal dictionary has coherent columns, as quantified by a large RIP constant, then most tractable iterative algorithms are unable to find maximally sparse representations. In contrast, we demonstrate both theoretically and empirically the potential for a trained deep network to recover minimal $\ell_0$-norm representations in regimes where existing methods fail. The resulting system, which can effectively learn novel iterative sparse estimation algorithms, is deployed on a practical photometric stereo estimation problem, where the goal is to remove sparse outliers that can disrupt the estimation of surface normals from a 3D scene.

AAAI Conference 2016 Conference Paper

Robust Complex Behaviour Modeling at 90Hz

  • Xiangyu Kong
  • Yizhou Wang
  • Tao Xiang

Modeling complex crowd behaviour for tasks such as rare event detection has received increasing interest. However, existing methods are limited because (1) they are sensitive to noise often resulting in a large number of false alarms; and (2) they rely on elaborate models leading to high computational cost thus unsuitable for processing a large number of video inputs in real-time. In this paper, we overcome these limitations by introducing a novel complex behaviour modeling framework, which consists of a Binarized Cumulative Directional (BCD) feature as representation, novel spatial and temporal context modeling via an iterative correlation maximization, and a set of behaviour models, each being a simple Bernoulli distribution. Despite its simplicity, our experiments on three benchmark datasets show that it significantly outperforms the state-of-the-art for both temporal video segmentation and rare event detection. Importantly, it is extremely efficient — reaches 90Hz on a normal PC platform using MAT- LAB.

IJCAI Conference 2015 Conference Paper

Quantized Correlation Hashing for Fast Cross-Modal Search

  • Botong Wu
  • Qiang Yang
  • Wei-Shi Zheng
  • Yizhou Wang
  • Jingdong Wang

Cross-modal hashing is designed to facilitate fast search across domains. In this work, we present a cross-modal hashing approach, called quantized correlation hashing (QCH), which takes into consideration the quantization loss over domains and the relation between domains. Unlike previous approaches that separate the optimization of the quantizer independent of maximization of domain correlation, our approach simultaneously optimizes both processes. The underlying relation between the domains that describes the same objects is established via maximizing the correlation between the hash codes across the domains. The resulting multi-modal objective function is transformed to a unimodal formalization, which is optimized through an alternative procedure. Experimental results on three real world datasets demonstrate that our approach outperforms the state-of-the-art multi-modal hashing methods.

AAAI Conference 2015 Conference Paper

Stable Feature Selection from Brain sMRI

  • Bo Xin
  • Lingjing Hu
  • Yizhou Wang
  • Wen Gao

Neuroimage analysis usually involves learning thousands or even millions of variables using only a limited number of samples. In this regard, sparse models, e. g. the lasso, are applied to select the optimal features and achieve high diagnosis accuracy. The lasso, however, usually results in independent unstable features. Stability, a manifest of reproducibility of statistical results subject to reasonable perturbations to data and the model (Yu 2013), is an important focus in statistics, especially in the analysis of high dimensional data. In this paper, we explore a nonnegative generalized fused lasso model for stable feature selection in the diagnosis of Alzheimer’s disease. In addition to sparsity, our model incorporates two important pathological priors: the spatial cohesion of lesion voxels and the positive correlation between the features and the disease labels. To optimize the model, we propose an efficient algorithm by proving a novel link between total variation and fast network flow algorithms via conic duality. Experiments show that the proposed nonnegative model performs much better in exploring the intrinsic structure of data via selecting stable features compared with other state-of-the-arts.

AAAI Conference 2014 Conference Paper

Efficient Generalized Fused Lasso and its Application to the Diagnosis of Alzheimer’s Disease

  • Bo Xin
  • Yoshinobu Kawahara
  • Yizhou Wang
  • Wen Gao

Generalized fused lasso (GFL) penalizes variables with L1 norms based both on the variables and their pairwise differences. GFL is useful when applied to data where prior information is expressed using a graph over the variables. However, the existing GFL algorithms incur high computational costs and they do not scale to highdimensional problems. In this study, we propose a fast and scalable algorithm for GFL. Based on the fact that fusion penalty is the Lovász extension of a cut function, we show that the key building block of the optimization is equivalent to recursively solving parametric graph-cut problems. Thus, we use a parametric flow algorithm to solve GFL in an efficient manner. Runtime comparisons demonstrated a significant speed-up compared with the existing GFL algorithms. By exploiting the scalability of the proposed algorithm, we formulated the diagnosis of Alzheimer’s disease as GFL. Our experimental evaluations demonstrated that the diagnosis performance was promising and that the selected critical voxels were well structured i. e. , connected, consistent according to cross-validation and in agreement with prior clinical knowledge.