Arrow Research search

Author name cluster

Qingmin Liao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

23 papers
2 author rows

Possible papers

23

AAAI Conference 2026 Conference Paper

AgentSwift: Efficient LLM Agent Design via Value-Guided Hierarchical Search

  • Yu Li
  • Lehui Li
  • Zhihao Wu
  • Qingmin Liao
  • Jianye Hao
  • Kun Shao
  • Fengli Xu

Large language model (LLM) agents have demonstrated strong capabilities across diverse domains, yet automated agent design remains a significant challenge. Current automated agent design approaches are often constrained by limited search spaces that primarily optimize workflows but fail to integrate crucial human-designed components like memory, planning, and tool use. Furthermore, these methods are hampered by high evaluation costs, as evaluating even a single new agent on a benchmark can require tens of dollars. The difficulty of this exploration is further exacerbated by inefficient search strategies that struggle to navigate the large design space effectively, making the discovery of novel agents a slow and resource-intensive process. To address these challenges, we propose AgentSwift, a novel framework for automated agent design. We formalize a hierarchical search space that jointly models agentic workflow and composable functional components. This structure moves beyond optimizing workflows alone by co-optimizing functional components, which enables the discovery of more complex and effective agent architectures. To make exploration within this expansive space feasible, we mitigate high evaluation costs by training a value model on a high-quality dataset, generated via a novel strategy combining combinatorial coverage and balanced Bayesian sampling for low-cost evaluation. Guiding the entire process is a hierarchical Monte Carlo Tree Search (MCTS) strategy, which is informed by uncertainty to efficiently navigate the search space. Evaluated across a comprehensive set of seven benchmarks spanning embodied, math, web, tool, and game domains, AgentSwift discovers agents that achieve an average performance gain of 8.34\% over both existing automated agent search methods and manually designed agents. Moreover, our framework exhibits steeper and more stable search trajectories. By enabling the efficient, automated composition of workflow with functional components, AgentSwift provides a scalable methodology to explore complex agent designs. Our framework serves as a launchpad for researchers to rapidly prototype and discover powerful agent architectures without the impediment of prohibitive evaluation costs.

AAAI Conference 2026 Conference Paper

WeightFlow: Learning Stochastic Dynamics via Evolving Weight of Neural Network

  • Ruikun Li
  • Jiazhen Liu
  • Huandong Wang
  • Qingmin Liao
  • Yong Li

Modeling stochastic dynamics from discrete observations is a key interdisciplinary challenge. Existing methods often fail to estimate the continuous evolution of probability densities from trajectories or face the curse of dimensionality. To address these limitations, we presents a novel paradigm: modeling dynamics directly in the weight space of a neural network by projecting the evolving probability distribution. We first theoretically establish the connection between dynamic optimal transport in measure space and an equivalent energy functional in weight space. Subsequently, we design WeightFlow, which constructs the neural network weights into a graph and learns its evolution via a graph controlled differential equation. Experiments on interdisciplinary datasets show that WeightFlow improves performance by an average of 43.02\% over state-of-the-art methods, providing an effective and scalable solution for modeling high-dimensional stochastic dynamics.

AAAI Conference 2025 Conference Paper

DM-Adapter: Domain-Aware Mixture-of-Adapters for Text-Based Person Retrieval

  • Yating Liu
  • Zimo Liu
  • Xiangyuan Lan
  • Wenming Yang
  • Yaowei Li
  • Qingmin Liao

Text-based person retrieval (TPR) has gained significant attention as a fine-grained and challenging task that closely aligns with practical applications. Tailoring CLIP to person domain is now a emerging research topic due to the abundant knowledge of vision-language pretraining, but challenges still remain during fine-tuning: (i) Previous full-model fine-tuning in TPR is computationally expensive and prone to overfitting.(ii) Existing parameter-efficient transfer learning (PETL) for TPR lacks of fine-grained feature extraction. To address these issues, we propose Domain-Aware Mixture-of-Adapters (DM-Adapter), which unifies Mixture-of-Experts (MOE) and PETL to enhance fine-grained feature representations while maintaining efficiency. Specifically, Sparse Mixture-of-Adapters is designed in parallel to MLP layers in both vision and language branches, where different experts specialize in distinct aspects of person knowledge to handle features more finely. To promote the router to exploit domain information effectively and alleviate the routing imbalance, Domain-Aware Router is then developed by building a novel gating function and injecting learnable domain-aware prompts. Extensive experiments show that our DM-Adapter achieves state-of-the-art performance, outperforming previous methods by a significant margin.

ICML Conference 2025 Conference Paper

Learning from Suboptimal Data in Continuous Control via Auto-Regressive Soft Q-Network

  • Jijia Liu
  • Feng Gao
  • Qingmin Liao
  • Chao Yu 0005
  • Yu Wang 0002

Reinforcement learning (RL) for continuous control often requires large amounts of online interaction data. Value-based RL methods can mitigate this burden by offering relatively high sample efficiency. Some studies further enhance sample efficiency by incorporating offline demonstration data to “kick-start” training, achieving promising results in continuous control. However, they typically compute the Q-function independently for each action dimension, neglecting interdependencies and making it harder to identify optimal actions when learning from suboptimal data, such as non-expert demonstration and online-collected data during the training process. To address these issues, we propose Auto-Regressive Soft Q-learning (ARSQ), a value-based RL algorithm that models Q-values in a coarse-to-fine, auto-regressive manner. First, ARSQ decomposes the continuous action space into discrete spaces in a coarse-to-fine hierarchy, enhancing sample efficiency for fine-grained continuous control tasks. Next, it auto-regressively predicts dimensional action advantages within each decision step, enabling more effective decision-making in continuous control tasks. We evaluate ARSQ on two continuous control benchmarks, RLBench and D4RL, integrating demonstration data into online training. On D4RL, which includes non-expert demonstrations, ARSQ achieves an average 1. 62$\times$ performance improvement over SOTA value-based baseline. On RLBench, which incorporates expert demonstrations, ARSQ surpasses various baselines, demonstrating its effectiveness in learning from suboptimal online-collected data.

NeurIPS Conference 2025 Conference Paper

LLM-Explorer: A Plug-in Reinforcement Learning Policy Exploration Enhancement Driven by Large Language Models

  • Qianyue Hao
  • Yiwen Song
  • Qingmin Liao
  • Jian Yuan
  • Yong Li

Policy exploration is critical in reinforcement learning (RL), where existing approaches include $\epsilon$-greedy, Gaussian process, etc. However, these approaches utilize preset stochastic processes and are indiscriminately applied in all kinds of RL tasks without considering task-specific features that influence policy exploration. Moreover, during RL training, the evolution of such stochastic processes is rigid, which typically only incorporates a decay in the variance, failing to adjust flexibly according to the agent's real-time learning status. Inspired by the analyzing and reasoning capability of large language models (LLMs), we design **LLM-Explorer** to adaptively generate task-specific exploration strategies with LLMs, enhancing the policy exploration in RL. In our design, we sample the learning trajectory of the agent during the RL training in a given task and prompt the LLM to analyze the agent's current policy learning status and then generate a probability distribution for future policy exploration. Updating the probability distribution periodically, we derive a stochastic process specialized for the particular task and dynamically adjusted to adapt to the learning process. Our design is a plug-in module compatible with various widely applied RL algorithms, including the DQN series, DDPG, TD3, and any possible variants developed based on them. Through extensive experiments on the Atari and MuJoCo benchmarks, we demonstrate LLM-Explorer's capability to enhance RL policy exploration, achieving an average performance improvement up to 37. 27%. Our code is open-source at https: //github. com/tsinghua-fib-lab/LLM-Explorer for reproducibility.

AAAI Conference 2025 Conference Paper

Pose Magic: Efficient and Temporally Consistent Human Pose Estimation with a Hybrid Mamba-GCN Network

  • Xinyi Zhang
  • Qiqi Bao
  • Qinpeng Cui
  • Wenming Yang
  • Qingmin Liao

Current state-of-the-art (SOTA) methods in 3D Human Pose Estimation (HPE) are primarily based on Transformers. However, existing Transformer-based 3D HPE backbones often encounter a trade-off between accuracy and computational efficiency. To resolve the above dilemma, in this work, we leverage recent advances in state space models and utilize Mamba for high-quality and efficient long-range modeling. Nonetheless, Mamba still faces challenges in precisely exploiting local dependencies between joints. To address these issues, we propose a new attention-free hybrid spatiotemporal architecture named Hybrid Mamba-GCN (Pose Magic). This architecture introduces local enhancement with GCN by capturing relationships between neighboring joints, thus producing new representations to complement Mamba's outputs. By adaptively fusing representations from Mamba and GCN, Pose Magic demonstrates superior capability in learning the underlying 3D structure. To meet the requirements of real-time inference, we also provide a fully causal version. Extensive experiments show that Pose Magic achieves new SOTA results (0.9 mm drop) while saving 74.1% FLOPs. In addition, Pose Magic exhibits optimal motion consistency and the ability to generalize to unseen sequence lengths.

ICLR Conference 2025 Conference Paper

Predicting the Energy Landscape of Stochastic Dynamical System via Physics-informed Self-supervised Learning

  • Ruikun Li 0002
  • Huandong Wang
  • Qingmin Liao
  • Yong Li 0008

Energy landscapes play a crucial role in shaping dynamics of many real-world complex systems. System evolution is often modeled as particles moving on a landscape under the combined effect of energy-driven drift and noise-induced diffusion, where the energy governs the long-term motion of the particles. Estimating the energy landscape of a system has been a longstanding interdisciplinary challenge, hindered by the high operational costs or the difficulty of obtaining supervisory signals. Therefore, the question of how to infer the energy landscape in the absence of true energy values is critical. In this paper, we propose a physics-informed self-supervised learning method to learn the energy landscape from the evolution trajectories of the system. It first maps the system state from the observation space to a discrete landscape space by an adaptive codebook, and then explicitly integrates energy into the graph neural Fokker-Planck equation, enabling the joint learning of energy estimation and evolution prediction. Experimental results across interdisciplinary systems demonstrate that our estimated energy has a correlation coefficient above 0.9 with the ground truth, and evolution prediction accuracy exceeds the baseline by an average of 17.65\%. The code is available at https://github.com/tsinghua-fib-lab/PESLA.

ICRA Conference 2025 Conference Paper

SAP-SLAM: Semantic-Assisted Perception SLAM with 3D Gaussian Splatting

  • Yuheng Yang
  • Yudong Lin
  • Wenming Yang
  • Guijin Wang
  • Qingmin Liao

The integration of 3D Gaussians has introduced a novel scene representation in Simultaneous Localization and Mapping (SLAM), characterized by explicit representation and differentiable rendering capabilities that enhance scene reconstruction and understanding. However, most current SLAM systems only exploit the basic representational capacity of 3D Gaussians, neglecting their potential to offer richer information and facilitate higher-dimensional scene comprehension. Furthermore, these systems often struggle with reconstruction when encountering rapid camera movements or depth missing. Drawing inspiration from 3D language field, which explores the intrinsic relationships among scene objects, we propose SAPSLAM, a dense SLAM system that combines high-fidelity reconstruction and advanced semantic understanding. Our approach leverages pre-trained visual models to extract semantic features, which are then fused, dimensionally reduced, and encoded into the 3D Gaussian model for optimization and rendering. The integration of these features improves the systems semantic comprehension and scene representation, ultimately enabling the creation of high-precision 3D semantic maps. Additionally, we introduce a semantic-guided Gaussian densification and pruning strategy, which uses semantic consistency to prioritize attention on poorly reconstructed areas, greatly improving performance in complex scenarios. SAP-SLAM achieves competitive results on both real-world and synthetic datasets, demonstrating superior capabilities in semantic understanding and reconstruction.

NeurIPS Conference 2025 Conference Paper

What Can RL Bring to VLA Generalization? An Empirical Study

  • Jijia Liu
  • Feng Gao
  • Bingwen Wei
  • Xinlei Chen
  • Qingmin Liao
  • Yi Wu
  • Chao Yu
  • Yu Wang

Large Vision-Language Action (VLA) models have shown significant potential for embodied AI. However, their predominant training via supervised fine-tuning (SFT) limits generalization due to susceptibility to compounding errors under distribution shifts. Reinforcement learning (RL) offers a path to overcome these limitations by optimizing for task objectives via trial-and-error, yet a systematic understanding of its specific generalization benefits for VLAs compared to SFT is lacking. To address this, our study introduces a comprehensive benchmark for evaluating VLA generalization and systematically investigates the impact of RL fine-tuning across diverse visual, semantic, and execution dimensions. Our extensive experiments reveal that RL fine-tuning, particularly with PPO, significantly enhances generalization in semantic understanding and execution robustness over SFT, while maintaining comparable visual robustness. We identify PPO as a more effective RL algorithm for VLAs than LLM-derived methods like DPO and GRPO. We also develop a simple recipe for efficient PPO training on VLAs, and demonstrate its practical utility for improving VLA generalization. The project page is at https: //rlvla. github. io

NeurIPS Conference 2024 Conference Paper

AdaPKC: PeakConv with Adaptive Peak Receptive Field for Radar Semantic Segmentation

  • Teng Li
  • Liwen Zhang
  • Youcheng Zhang
  • Zijun Hu
  • Pengcheng Pi
  • Zongqing Lu
  • Qingmin Liao
  • Zhe Ma

Deep learning-based radar detection technology is receiving increasing attention in areas such as autonomous driving, UAV surveillance, and marine monitoring. Among recent efforts, PeakConv (PKC) provides a solution that can retain the peak response characteristics of radar signals and play the characteristics of deep convolution, thereby improving the effect of radar semantic segmentation (RSS). However, due to the use of a pre-set fixed peak receptive field sampling rule, PKC still has limitations in dealing with problems such as inconsistency of target frequency domain response broadening, non-homogeneous and time-varying characteristic of noise/clutter distribution. Therefore, this paper proposes an idea of adaptive peak receptive field, and upgrades PKC to AdaPKC based on this idea. Beyond that, a novel fine-tuning technology to further boost the performance of AdaPKC-based RSS networks is presented. Through experimental verification using various real-measured radar data (including publicly available low-cost millimeter-wave radar dataset for autonomous driving and self-collected Ku-band surveillance radar dataset), we found that the performance of AdaPKC-based models surpasses other SoTA methods in RSS tasks. The code is available at https: //github. com/lihua199710/AdaPKC.

AAMAS Conference 2024 Conference Paper

LLM-Powered Hierarchical Language Agent for Real-time Human-AI Coordination

  • Jijia Liu
  • Chao Yu
  • Jiaxuan Gao
  • Yuqing Xie
  • Qingmin Liao
  • Yi Wu
  • Yu Wang

AI agents powered by Large Language Models (LLMs) have made significant advances, enabling them to assist humans in diverse complex tasks and leading to a revolution in human-AI coordination. LLM-powered agents typically require invoking LLM APIs and employing artificially designed complex prompts, which results in high inference latency. While this paradigm works well in scenarios with minimal interactive demands, such as code generation, it is unsuitable for highly interactive and real-time applications, such as gaming. Traditional gaming AI often employs small models or reactive policies, enabling fast inference but offering limited task completion and interaction abilities. In this work, we consider Overcooked as our testbed where players could communicate with natural language and cooperate to serve orders. We propose a Hierarchical Language Agent (HLA) for human-AI coordination that provides both strong reasoning abilities while keeping real-time execution. In particular, HLA adopts a hierarchical framework and comprises three modules: a proficient LLM, referred to as Slow Mind, for intention reasoning and language interaction, a lightweight LLM, referred to as Fast Mind, for generating macro actions, and a reactive policy, referred to as Executor, for transforming macro actions into atomic actions. Human studies show that HLA outperforms other baseline agents, including slow-mind-only agents and fast-mind-only agents, with stronger cooperation abilities, faster responses, and more consistent language communications.

IJCAI Conference 2024 Conference Paper

Long-term Detection and Monitory of Chinese Urban Village Using Satellite Imagery

  • Yuming Lin
  • Xin Zhang
  • Yu Liu
  • Zhenyu Han
  • Qingmin Liao
  • Yong Li

Urban villages are areas filled with rural-like improvised structures in Chinese cities, usually housing the most vulnerable groups. Under the guidance of the Sustainable Development Goals (SDGs), the Chinese government initiated renewal and redevelopment projects, underscoring the meticulous mapping and segmentation of urban villages. Satellite imagery is advanced and efficient in identifying urban villages and monitoring changes, but traditional methods neglect the morphological diversity in season, shape, size, spacing, and layout of urban villages, which is not satisfying for long-term wide-range data. Here, we design a targeted approach based on Tobler’s First Law of Geography, using curriculum labeling to solve morphological diversity and semi-automatically generate segmentation for urban village boundaries. Specifically, we use manually labeled data as seeds for pre-trained SegFormer models and incrementally fine-tune the model based on geographical proximity. The rigorous experimentation across five diverse cities substantiates the commendable efficacy of our methodology. IoU metric demonstrates a noteworthy improvement of over 119% to baseline. Our final results cover 265, 050 urban villages across 433 cities in China over the past 10 years, and the analysis reveals the uneven redevelopment by geography and city scale. We further examine the within-city distribution and verify the urban scaling law associated with several socio-economic factors. Our method can be used nationwide to decide redevelopment priority and resource tilt, contributing to SDG 11. 1 on affordable housing and upgrading slums. The code and dataset are available at https: //github. com/tsinghua-fib-lab/LtCUV.

IJCAI Conference 2024 Conference Paper

Reschedule Diffusion-based Bokeh Rendering

  • Shiyue Yan
  • Xiaoshi Qiu
  • Qingmin Liao
  • Jing-Hao Xue
  • Shaojun Liu

Bokeh rendering for images shot with small apertures has drawn much attention in practice. Very recently people start to explore diffusion models for bokeh rendering, aiming to leverage the models' surging power of image generation. However, we can clearly observe two big issues with the images rendered by diffusion models: large fluctuation and severe color deviation. To address these issues, we propose in this paper a prior-aware sampling approach, which can adaptively control the noise scale through learned priors, and a prior-aware noise scheduling strategy, which can greatly reduce the number of inference steps without sacrificing performance. Extensive experiments show that our method can effectively alleviate the fluctuation problem of sampling results while ensuring similar color styles to the input image. In addition, our method outperforms state-of-the-art methods, sometimes even with only two steps of sampling. Our code is available at https: //github. com/Loeiii/Reschedule-Diffusion-based-Bokeh-Rendering.

NeurIPS Conference 2024 Conference Paper

Taming Diffusion Prior for Image Super-Resolution with Domain Shift SDEs

  • Qinpeng Cui
  • Yixuan Liu
  • Xinyi Zhang
  • Qiqi Bao
  • Qingmin Liao
  • Li Wang
  • Tian Lu
  • Zicheng Liu

Diffusion-based image super-resolution (SR) models have attracted substantial interest due to their powerful image restoration capabilities. However, prevailing diffusion models often struggle to strike an optimal balance between efficiency and performance. Typically, they either neglect to exploit the potential of existing extensive pretrained models, limiting their generative capacity, or they necessitate a dozens of forward passes starting from random noises, compromising inference efficiency. In this paper, we present DoSSR, a $\textbf{Do}$main $\textbf{S}$hift diffusion-based SR model that capitalizes on the generative powers of pretrained diffusion models while significantly enhancing efficiency by initiating the diffusion process with low-resolution (LR) images. At the core of our approach is a domain shift equation that integrates seamlessly with existing diffusion models. This integration not only improves the use of diffusion prior but also boosts inference efficiency. Moreover, we advance our method by transitioning the discrete shift process to a continuous formulation, termed as DoS-SDEs. This advancement leads to the fast and customized solvers that further enhance sampling efficiency. Empirical results demonstrate that our proposed method achieves state-of-the-art performance on synthetic and real-world datasets, while notably requiring $\textbf{\emph{only 5 sampling steps}}$. Compared to previous diffusion prior based methods, our approach achieves a remarkable speedup of 5-7 times, demonstrating its superior efficiency.

AAAI Conference 2024 Conference Paper

UV-SAM: Adapting Segment Anything Model for Urban Village Identification

  • Xin Zhang
  • Yu Liu
  • Yuming Lin
  • Qingmin Liao
  • Yong Li

Urban villages, defined as informal residential areas in or around urban centers, are characterized by inadequate infrastructures and poor living conditions, closely related to the Sustainable Development Goals (SDGs) on poverty, adequate housing, and sustainable cities. Traditionally, governments heavily depend on field survey methods to monitor the urban villages, which however are time-consuming, labor-intensive, and possibly delayed. Thanks to widely available and timely updated satellite images, recent studies develop computer vision techniques to detect urban villages efficiently. However, existing studies either focus on simple urban village image classification or fail to provide accurate boundary information. To accurately identify urban village boundaries from satellite images, we harness the power of the vision foundation model and adapt the Segment Anything Model (SAM) to urban village segmentation, named UV-SAM. Specifically, UV-SAM first leverages a small-sized semantic segmentation model to produce mixed prompts for urban villages, including mask, bounding box, and image representations, which are then fed into SAM for fine-grained boundary identification. Extensive experimental results on two datasets in China demonstrate that UV-SAM outperforms existing baselines, and identification results over multiple years show that both the number and area of urban villages are decreasing over time, providing deeper insights into the development trends of urban villages and sheds light on the vision foundation models for sustainable cities. The dataset and codes of this study are available at https://github.com/tsinghua-fib-lab/UV-SAM.

AAAI Conference 2023 Conference Paper

Dynamic Ensemble of Low-Fidelity Experts: Mitigating NAS “Cold-Start”

  • Junbo Zhao
  • Xuefei Ning
  • Enshu Liu
  • Binxin Ru
  • Zixuan Zhou
  • Tianchen Zhao
  • Chen Chen
  • Jiajin Zhang

Predictor-based Neural Architecture Search (NAS) employs an architecture performance predictor to improve the sample efficiency. However, predictor-based NAS suffers from the severe ``cold-start'' problem, since a large amount of architecture-performance data is required to get a working predictor. In this paper, we focus on exploiting information in cheaper-to-obtain performance estimations (i.e., low-fidelity information) to mitigate the large data requirements of predictor training. Despite the intuitiveness of this idea, we observe that using inappropriate low-fidelity information even damages the prediction ability and different search spaces have different preferences for low-fidelity information types. To solve the problem and better fuse beneficial information provided by different types of low-fidelity information, we propose a novel dynamic ensemble predictor framework that comprises two steps. In the first step, we train different sub-predictors on different types of available low-fidelity information to extract beneficial knowledge as low-fidelity experts. In the second step, we learn a gating network to dynamically output a set of weighting coefficients conditioned on each input neural architecture, which will be used to combine the predictions of different low-fidelity experts in a weighted sum. The overall predictor is optimized on a small set of actual architecture-performance data to fuse the knowledge from different low-fidelity experts to make the final prediction. We conduct extensive experiments across five search spaces with different architecture encoders under various experimental settings. For example, our methods can improve the Kendall's Tau correlation coefficient between actual performance and predicted scores from 0.2549 to 0.7064 with only 25 actual architecture-performance data on NDS-ResNet. Our method can easily be incorporated into existing predictor-based NAS frameworks to discover better architectures. Our method will be implemented in Mindspore (Huawei 2020), and the example code is published at https://github.com/A-LinCui/DELE.

AAAI Conference 2022 Conference Paper

Coarse-to-Fine Embedded PatchMatch and Multi-Scale Dynamic Aggregation for Reference-Based Super-resolution

  • Bin Xia
  • Yapeng Tian
  • Yucheng Hang
  • Wenming Yang
  • Qingmin Liao
  • Jie Zhou

Reference-based super-resolution (RefSR) has made significant progress in producing realistic textures using an external reference (Ref) image. However, existing RefSR methods obtain high-quality correspondence matchings consuming quadratic computation resources with respect to the input size, limiting its application. Moreover, these approaches usually suffer from scale misalignments between the lowresolution (LR) image and Ref image. In this paper, we propose an Accelerated Multi-Scale Aggregation network (AMSA) for Reference-based Super-Resolution, including Coarse-to-Fine Embedded PatchMatch (CFE-PatchMatch) and Multi-Scale Dynamic Aggregation (MSDA) module. To improve matching efficiency, we design a novel Embedded PatchMacth scheme with random samples propagation, which involves end-to-end training with asymptotic linear computational cost to the input size. To further reduce computational cost and speed up convergence, we apply the coarseto-fine strategy on Embedded PatchMacth constituting CFE- PatchMatch. To fully leverage reference information across multiple scales and enhance robustness to scale misalignment, we develop the MSDA module consisting of Dynamic Aggregation and Multi-Scale Aggregation. The Dynamic Aggregation corrects minor scale misalignment by dynamically aggregating features, and the Multi-Scale Aggregation brings robustness to large scale misalignment by fusing multi-scale information. Experimental results show that the proposed AMSA achieves superior performance over state-of-the-art approaches on both quantitative and qualitative evaluations.

AAAI Conference 2022 Conference Paper

Efficient Non-local Contrastive Attention for Image Super-resolution

  • Bin Xia
  • Yucheng Hang
  • Yapeng Tian
  • Wenming Yang
  • Qingmin Liao
  • Jie Zhou

Non-Local Attention (NLA) brings significant improvement for Single Image Super-Resolution (SISR) by leveraging intrinsic feature correlation in natural images. However, NLA gives noisy information large weights and consumes quadratic computation resources with respect to the input size, limiting its performance and application. In this paper, we propose a novel Efficient Non-Local Contrastive Attention (ENLCA) to perform long-range visual modeling and leverage more relevant non-local features. Specifically, ENLCA consists of two parts, Efficient Non-Local Attention (ENLA) and Sparse Aggregation. ENLA adopts the kernel method to approximate exponential function and obtains linear computation complexity. For Sparse Aggregation, we multiply inputs by an amplification factor to focus on informative features, yet the variance of approximation increases exponentially. Therefore, contrastive learning is applied to further separate relevant and irrelevant features. To demonstrate the effectiveness of ENLCA, we build an architecture called Efficient Non-Local Contrastive Network (ENLCN) by adding a few of our modules in a simple backbone. Extensive experimental results show that ENLCN reaches superior performance over state-of-the-art approaches on both quantitative and qualitative evaluations.

JBHI Journal 2022 Journal Article

MDAN: Mirror Difference Aware Network for Brain Stroke Lesion Segmentation

  • Qiqi Bao
  • Shiyu Mi
  • Bowen Gang
  • Wenming Yang
  • Jie Chen
  • Qingmin Liao

Brain stroke lesion segmentation is of great importance for stroke rehabilitation neuroimaging analysis. Due to the large variance of stroke lesion shapes and similarities of tissue intensity distribution, it remains a challenging task. To help detect abnormalities, the anatomical symmetries of brain magnetic resonance (MR) images have been widely used as visual cues for clinical practices. However, most methods for brain images segmentation do not fully utilize structural symmetry information. This paper presents a novel mirror difference aware network (MDAN) for stroke lesion segmentation. The network uses an encoder-decoder architecture, aiming at holistically exploiting the symmetries of image features. Specifically, a differential feature augmentation (DFA) module is developed in the encoding path to highlight the semantically pathological asymmetries of features in abnormalities. In the DFA module, a Siamese contrastive supervised loss is designed to enhance discriminative features, and a mirror position-based difference augmentation (MDA) module is used to further magnify the discrepancy. Moreover, mirror feature fusion (MFF) modules are applied to efficiently fuse and transfer the information both of the original input and the horizontally flipped features to the decoding path. Extensive experiments on the Anatomical Tracings of Lesions After Stroke (ATLAS) dataset show the proposed MDAN outperforms the state-of-the-art methods.

AAAI Conference 2022 Conference Paper

Pose-Invariant Face Recognition via Adaptive Angular Distillation

  • Zhenduo Zhang
  • Yongru Chen
  • Wenming Yang
  • Guijin Wang
  • Qingmin Liao

Pose-invariant face recognition is a practically useful but challenging task. This paper introduces a novel method to learn pose-invariant feature representation without normalizing profile faces to frontal ones or learning disentangled features. We first design a novel strategy to learn pose-invariant feature embeddings by distilling the angular knowledge of frontal faces extracted by teacher network to student network, which enables the handling of faces with large pose variations. In this way, the features of faces across variant poses can cluster compactly for the same person to create a poseinvariant face representation. Secondly, we propose a Pose- Adaptive Angular Distillation loss to mitigate the negative effect of uneven distribution of face poses in the training dataset to pay more attention to the samples with large pose variations. Extensive experiments on two challenging benchmarks (IJB-A and CFP-FP) show that our approach consistently outperforms the existing methods.

ICML Conference 2021 Conference Paper

Group Fisher Pruning for Practical Network Compression

  • Liyang Liu
  • Shilong Zhang
  • Zhanghui Kuang
  • Aojun Zhou
  • Jing-Hao Xue
  • Xinjiang Wang
  • Yimin Chen
  • Wenming Yang

Network compression has been widely studied since it is able to reduce the memory and computation cost during inference. However, previous methods seldom deal with complicated structures like residual connections, group/depth-wise convolution and feature pyramid network, where channels of multiple layers are coupled and need to be pruned simultaneously. In this paper, we present a general channel pruning approach that can be applied to various complicated structures. Particularly, we propose a layer grouping algorithm to find coupled channels automatically. Then we derive a unified metric based on Fisher information to evaluate the importance of a single channel and coupled channels. Moreover, we find that inference speedup on GPUs is more correlated with the reduction of memory rather than FLOPs, and thus we employ the memory reduction of each channel to normalize the importance. Our method can be used to prune any structures including those with coupled channels. We conduct extensive experiments on various backbones, including the classic ResNet and ResNeXt, mobile-friendly MobileNetV2, and the NAS-based RegNet, both on image classification and object detection which is under-explored. Experimental results validate that our method can effectively prune sophisticated networks, boosting inference speed without sacrificing accuracy.

ICLR Conference 2021 Conference Paper

Towards Impartial Multi-task Learning

  • Liyang Liu
  • Yi Li 0050
  • Zhanghui Kuang
  • Jing-Hao Xue
  • Yimin Chen
  • Wenming Yang
  • Qingmin Liao
  • Wayne Zhang 0001

Multi-task learning (MTL) has been widely used in representation learning. However, naively training all tasks simultaneously may lead to the partial training issue, where specific tasks are trained more adequately than others. In this paper, we propose to learn multiple tasks impartially. Specifically, for the task-shared parameters, we optimize the scaling factors via a closed-form solution, such that the aggregated gradient (sum of raw gradients weighted by the scaling factors) has equal projections onto individual tasks. For the task-specific parameters, we dynamically weigh the task losses so that all of them are kept at a comparable scale. Further, we find the above gradient balance and loss balance are complementary and thus propose a hybrid balance method to further improve the performance. Our impartial multi-task learning (IMTL) can be end-to-end trained without any heuristic hyper-parameter tuning, and is general to be applied on all kinds of losses without any distribution assumption. Moreover, our IMTL can converge to similar results even when the task losses are designed to have different scales, and thus it is scale-invariant. We extensively evaluate our IMTL on the standard MTL benchmarks including Cityscapes, NYUv2 and CelebA. It outperforms existing loss weighting methods under the same experimental settings.

TIST Journal 2020 Journal Article

DeepApp

  • Tong Xia
  • Yong Li
  • Jie Feng
  • Depeng Jin
  • Qing Zhang
  • Hengliang Luo
  • Qingmin Liao

Smartphone mobile application (App) usage prediction, i.e., which Apps will be used next, is beneficial for user experience improvement. Through an in-depth analysis on a real-world dataset, we find that App usage is highly spatio-temporally correlated and personalized. Given the ability to model complex spatio-temporal contexts, we aim to apply deep learning to achieve high prediction accuracy. However, the personalization yields a problem: training one network for each individual suffers from data scarcity, yet training one deep neural network for all users often fails to uncover user preference. In this article, we propose a novel App usage prediction framework, named DeepApp, to achieve context-aware prediction via multi-task learning. To tackle the challenge of data scarcity, we train one general network for multiple users to share common patterns. To better utilize the spatio-temporal contexts, we supplement a location prediction task in the multi-task learning framework to learn spatio-temporal relations. As for the personalization, we add a user identification task to capture user preference. We evaluate DeepApp on the large-scale dataset by extensive experiments. Results demonstrate that DeepApp outperforms the start-of-the-art baseline by 6.44%.