Arrow Research search

Author name cluster

Jiaming Liu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

23 papers
2 author rows

Possible papers

23

AAAI Conference 2026 Conference Paper

Decomposing the Neurons: Activation Sparsity via Mixture of Experts for Continual Test Time Adaptation

  • Rongyu Zhang
  • Aosong Cheng
  • Yulin Luo
  • Gaole Dai
  • Huanrui Yang
  • Jiaming Liu
  • Ran Xu
  • Li Du

Continual Test-Time Adaptation (CTTA), which aims to adapt the pre-trained model to ever-evolving target domains, emerges as an important task for vision models. As current vision models appear to be heavily biased towards texture, continuously adapting the model from one domain distribution to another can result in serious catastrophic forgetting. Drawing inspiration from the the encoding characteristics of neuron activation in neural networks, we propose the Mixture-of-Activation-Sparsity-Experts (MoASE) for the CTTA task. Given the distinct reaction of neurons with low and high activation to domain-specific and agnostic features, MoASE decomposes the neural activation into high-activation and low-activation components in each expert with a Spatial Differentiable Dropout (SDD). Based on the decomposition, we devise a Domain-Aware Router (DAR) that utilizes domain information to adaptively weight experts that process the post-SDD sparse activations, and the Activation Sparsity Gate (ASG) that adaptively assigns feature selection thresholds of the SDD for different experts for more precise feature decomposition. Finally, we introduce a Homeostatic-Proximal (HP) loss to maintain update consistency between the teacher and student experts to prevent error accumulation. Extensive experiments substantiate that MoASE achieves state-of-the-art performance in both classification and segmentation tasks.

AAAI Conference 2026 Conference Paper

EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering

  • Runnan Lu
  • Yuxuan Zhang
  • Jiaming Liu
  • Haofan Wang
  • Yiren Song

Generating accurate multilingual text with diffusion models has long been desired but remains challenging. Recent methods have made progress in rendering text in a single language, but rendering arbitrary languages is still an under-explored area. This paper introduces EasyText, a text rendering framework based on DiT (Diffusion Transformer), which connects denoising latents with multilingual character tokens encoded as character tokens. We propose character positioning encoding and position encoding interpolation techniques to achieve controllable and precise text rendering. Additionally, we construct a large-scale synthetic text image dataset with 1 million multilingual image-text annotations as well as a high-quality dataset of 20K annotated images, which are used for pretraining and fine-tuning respectively. Extensive experiments and evaluations demonstrate the effectiveness and advancement of our approach in multilingual text rendering, visual quality, and layout-aware text integration.

NeurIPS Conference 2025 Conference Paper

AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation

  • Sixiang Chen
  • Jiaming Liu
  • Siyuan Qian
  • Han Jiang
  • Zhuoyang Liu
  • Chenyang Gu
  • Xiaoqi Li
  • Chengkai Hou

Recently, mobile manipulation has attracted increasing attention for enabling language-conditioned robotic control in household tasks. However, existing methods still face challenges in coordinating mobile base and manipulator, primarily due to two limitations. On the one hand, they fail to explicitly model the influence of the mobile base on manipulator control, which easily leads to error accumulation under high degrees of freedom. On the other hand, they treat the entire mobile manipulation process with the same visual observation modality (e. g. , either all 2D or all 3D), overlooking the distinct multimodal perception requirements at different stages during mobile manipulation. To address this, we propose the Adaptive Coordination Diffusion Transformer (AC-DiT), which enhances mobile base and manipulator coordination for end-to-end mobile manipulation. First, since the motion of the mobile base directly influences the manipulator's actions, we introduce a mobility-to-body conditioning mechanism that guides the model to first extract base motion representations, which are then used as context prior for predicting whole-body actions. This enables whole-body control that accounts for the potential impact of the mobile base’s motion. Second, to meet the perception requirements at different stages of mobile manipulation, we design a perception-aware multimodal conditioning strategy that dynamically adjusts the fusion weights between various 2D visual images and 3D point clouds, yielding visual features tailored to the current perceptual needs. This allows the model to, for example, adaptively rely more on 2D inputs when semantic information is crucial for action prediction, while placing greater emphasis on 3D geometric information when precise spatial understanding is required. We empirically validate AC-DiT through extensive experiments on both simulated and real-world mobile manipulation tasks, demonstrating superior performance compared to existing methods.

NeurIPS Conference 2025 Conference Paper

Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation

  • hanzhuo tan
  • Xiaolong Tian
  • Hanrui Qi
  • Jiaming Liu
  • Siyi Wang
  • GAO Zuchen
  • Qi Luo
  • Jing Li

Recent advances in LLM-based decompilers have been shown effective to convert low-level binaries into human-readable source code. However, there still lacks a comprehensive benchmark that provides large-scale binary-source function pairs, which is critical for advancing the LLM decompilation technology. Creating accurate binary-source mappings incurs severe issues caused by complex compilation settings and widespread function inlining that obscure the correspondence between binaries and their original source code. Previous efforts have either relied on used contest‐style benchmarks, synthetic binary–source mappings that diverge significantly from the mappings in real world, or partially matched binaries with only code lines or variable names, compromising the effectiveness of analyzing the binary functionality. To alleviate these issues, we introduce Decompile-Bench, the first open-source dataset comprising two million binary-source function pairs condensed from 100 million collected function pairs, i. e. , 450GB of binaries compiled from permissively licensed GitHub projects. For the evaluation purposes, we also developed a benchmark Decompile-Bench-Eval including manually crafted binaries from the well-established HumanEval and MBPP, alongside the compiled GitHub repositories released after 2025 to mitigate data leakage issues. We further explore commonly-used evaluation metrics to provide a thorough assessment of the studied LLM decompilers and find that fine-tuning with Decompile-Bench causes a 20% improvement over previous benchmarks in terms of the re-executability rate. Our code and data has been released in HuggingFace and Github. https: //github. com/anonepo/LLM4Decompile

NeurIPS Conference 2025 Conference Paper

Fast-in-Slow: A Dual-System VLA Model Unifying Fast Manipulation within Slow Reasoning

  • Hao Chen
  • Jiaming Liu
  • Chenyang Gu
  • Zhuoyang Liu
  • Renrui Zhang
  • Xiaoqi Li
  • Xiao He
  • Yandong Guo

Generalized policy and execution efficiency constitute the two critical challenges in robotic manipulation. While recent foundation policies benefit from the common-sense reasoning capabilities of internet-scale pretrained vision-language models (VLMs), they often suffer from low execution frequency. To mitigate this dilemma, dual-system approaches have been proposed to leverage a VLM-based System 2 module for handling high-level decision-making, and a separate System 1 action module for ensuring real-time control. However, existing designs maintain both systems as separate models, limiting System 1 from fully leveraging the rich pretrained knowledge from the VLM-based System 2. In this work, we propose Fast-in-Slow (FiS), a unified dual-system vision-language-action (VLA) model that embeds the System 1 execution module within the VLM-based System 2 by partially sharing parameters. This innovative paradigm not only enables high-frequency execution in System 1, but also facilitates coordination between multimodal reasoning and execution components within a single foundation model of System 2. Given their fundamentally distinct roles within FiS-VLA, we design the two systems to incorporate heterogeneous modality inputs alongside asynchronous operating frequencies, enabling both fast and precise manipulation. To enable coordination between the two systems, a dual-aware co-training strategy is proposed that equips System 1 with action generation capabilities while preserving System 2’s contextual understanding to provide stable latent conditions for System 1. For evaluation, FiS-VLA outperforms previous state-of-the-art methods by 8% in simulation and 11% in real-world tasks in terms of average success rate, while achieving a 117. 7 Hz control frequency with action chunk set to eight. Project web page: https: //fast-in-slow. github. io.

ICLR Conference 2025 Conference Paper

IPDreamer: Appearance-Controllable 3D Object Generation with Complex Image Prompts

  • Bohan Zeng
  • Shanglin Li
  • Yutang Feng
  • Ling Yang 0006
  • Juan Zhang
  • Hong Li 0016
  • Jiaming Liu
  • Conghui He

Recent advances in 3D generation have been remarkable, with methods such as DreamFusion leveraging large-scale text-to-image diffusion-based models to guide 3D object generation. These methods enable the synthesis of detailed and photorealistic textured objects. However, the appearance of 3D objects produced by such text-to-3D models is often unpredictable, and it is hard for single-image-to-3D methods to deal with images lacking a clear subject, complicating the generation of appearance-controllable 3D objects from complex images. To address these challenges, we present IPDreamer, a novel method that captures intricate appearance features from complex **I**mage **P**rompts and aligns the synthesized 3D object with these extracted features, enabling high-fidelity, appearance-controllable 3D object generation. Our experiments demonstrate that IPDreamer consistently generates high-quality 3D objects that align with both the textual and complex image prompts, highlighting its promising capability in appearance-controlled, complex 3D object generation.

AAAI Conference 2025 Conference Paper

LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding

  • Senqiao Yang
  • Jiaming Liu
  • Renrui Zhang
  • Mingjie Pan
  • Ziyu Guo
  • Xiaoqi Li
  • Zehui Chen
  • Peng Gao

Recently, Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have shown promise in instruction following and image understanding. While these models are powerful, they have not yet been developed to comprehend the more challenging 3D geometric and physical scenes, especially when it comes to the sparse outdoor LiDAR data. In this paper, we introduce LiDAR-LLM, which takes raw LiDAR data as input and harnesses the remarkable reasoning capabilities of LLMs to gain a comprehensive understanding of outdoor 3D scenes. The central insight of our LiDAR-LLM is the reformulation of 3D outdoor scene cognition as a language modeling problem, encompassing tasks such as 3D captioning, 3D grounding, 3D question answering, etc. Specifically, due to the scarcity of 3D LiDAR-text pairing data, we introduce a three-stage training strategy and generate relevant datasets, progressively aligning the 3D modality with the language embedding of LLM. Furthermore, we design a Position-Aware Transformer (PAT) to connect the 3D encoder with the LLM, which effectively bridges the modality gap and enhances the LLM's spatial orientation comprehension of visual features. Our experiments demonstrate that LiDAR-LLM effectively comprehends a wide range of instructions related to 3D scenes, achieving a 40.9 BLEU-1 score on the 3D captioning dataset, a Grounded Captioning accuracy of 63.1%, and a BEV mIoU of 14.3%.

AAAI Conference 2025 Conference Paper

Stable-Hair: Real-World Hair Transfer via Diffusion Model

  • Yuxuan Zhang
  • Qing Zhang
  • Yiren Song
  • Jichao Zhang
  • Hao Tang
  • Jiaming Liu

Current hair transfer methods struggle to handle diverse and intricate hairstyles, limiting their applicability in real-world scenarios. In this paper, we propose a novel diffusion-based hair transfer framework, named Stable-Hair, which robustly transfers a wide range of real-world hairstyles to user-provided faces for virtual hair try-on. To achieve this goal, our Stable-Hair framework is designed as a two-stage pipeline. In the first stage, we train a Bald Converter alongside stable diffusion to remove hair from the user-provided face images, resulting in bald images. In the second stage, we specifically designed a Hair Extractor and a Latent IdentityNet to transfer the target hairstyle with highly detailed and high-fidelity to the bald image. The Hair Extractor is trained to encode reference images with the desired hairstyles, while the Latent IdentityNet ensures consistency in identity and background. To minimize color deviations between source images and transfer results, we introduce a novel Latent ControlNet architecture, which functions as both the Bald Converter and Latent IdentityNet. After training on our curated triplet dataset, our method accurately transfers highly detailed and high-fidelity hairstyles to the source images. Extensive experiments demonstrate that our approach achieves state-of-the-art performance compared to existing hair transfer methods.

AAAI Conference 2025 Conference Paper

VFM-Adapter: Adapting Visual Foundation Models for Dense Prediction with Dynamic Hybrid Operation Mapping

  • Zheng Chen
  • Yu Zeng
  • Zehui Chen
  • Hongzhi Gao
  • Lin Chen
  • Jiaming Liu
  • Feng Zhao

Although pre-trained large vision foundation models (VFM) yield superior results on various downstream tasks, full fine-tuning is often impractical due to its high computational cost and storage requirements. Recent advancements in parameter-efficient fine-tuning (PEFT) of VFM for image classification show significant promise. However, the application of PEFT techniques to dense prediction tasks remains largely unexplored. Our analysis of existing methods reveals that the underlying premise of utilizing low-rank parameter matrices, despite their efficacy in specific applications, may not be adequately suitable for dense prediction tasks. To this end, we propose a novel PEFT learning approach tailored for dense prediction tasks, namely VFM-Adapter. Specifically, the VFM-Adapter introduces a hybrid operation mapping technique that seamlessly integrates local information with global modeling to the adapter module. It capitalizes on the distinct inductive biases inherent in different operations. Additionally, we dynamically generate parameters for the VFM-Adapter, enabling flexibility of feature extraction given specific inputs. To validate the efficacy of VFM-Adapter, we conduct extensive experiments across object detection, semantic segmentation, and instance segmentation tasks. Results on multiple benchmarks consistently demonstrate the superiority of our method over previous approaches. Notably, with only three percent of the trainable parameters of the SAM-Base backbone, our approach achieves competitive or even superior performance compared to full fine-tuning. The code will be available.

AAAI Conference 2024 Conference Paper

Efficient Deweahter Mixture-of-Experts with Uncertainty-Aware Feature-Wise Linear Modulation

  • Rongyu Zhang
  • Yulin Luo
  • Jiaming Liu
  • Huanrui Yang
  • Zhen Dong
  • Denis Gudovskiy
  • Tomoyuki Okuno
  • Yohei Nakata

The Mixture-of-Experts (MoE) approach has demonstrated outstanding scalability in multi-task learning including low-level upstream tasks such as concurrent removal of multiple adverse weather effects. However, the conventional MoE architecture with parallel Feed Forward Network (FFN) experts leads to significant parameter and computational overheads that hinder its efficient deployment. In addition, the naive MoE linear router is suboptimal in assigning task-specific features to multiple experts which limits its further scalability. In this work, we propose an efficient MoE architecture with weight sharing across the experts. Inspired by the idea of linear feature modulation (FM), our architecture implicitly instantiates multiple experts via learnable activation modulations on a single shared expert block. The proposed Feature Modulated Expert (FME) serves as a building block for the novel Mixture-of-Feature-Modulation-Experts (MoFME) architecture, which can scale up the number of experts with low overhead. We further propose an Uncertainty-aware Router (UaR) to assign task-specific features to different FM modules with well-calibrated weights. This enables MoFME to effectively learn diverse expert functions for multiple tasks. The conducted experiments on the multi-deweather task show that our MoFME outperforms the state-of-the-art in the image restoration quality by 0.1-0.2 dB while saving more than 74% of parameters and 20% inference time over the conventional MoE counterpart. Experiments on the downstream segmentation and classification tasks further demonstrate the generalizability of MoFME to real open-world applications.

AAAI Conference 2024 Conference Paper

Exploring Sparse Visual Prompt for Domain Adaptive Dense Prediction

  • Senqiao Yang
  • Jiarui Wu
  • Jiaming Liu
  • Xiaoqi Li
  • Qizhe Zhang
  • Mingjie Pan
  • Yulu Gan
  • Zehui Chen

The visual prompts have provided an efficient manner in addressing visual cross-domain problems. Previous works introduce domain prompts to tackle the classification Test-Time Adaptation (TTA) problem by placing image-level prompts on the input and fine-tuning prompts for each target domain. However, since the image-level prompts mask out continuous spatial details in the prompt-allocated region, it will suffer from inaccurate contextual information and limited domain knowledge extraction, particularly when dealing with dense prediction TTA problems. To overcome these challenges, we propose a novel Sparse Visual Domain Prompts (SVDP) approach, which applies minimal trainable parameters (e.g., 0.1%) to pixels across the entire image and reserves more spatial information of the input. To better apply SVDP in extracting domain-specific knowledge, we introduce the Domain Prompt Placement (DPP) method to adaptively allocates trainable parameters of SVDP on the pixels with large distribution shifts. Furthermore, recognizing that each target domain sample exhibits a unique domain shift, we design Domain Prompt Updating (DPU) strategy to optimize prompt parameters differently for each sample, facilitating efficient adaptation to the target domain. Extensive experiments were conducted on widely-used TTA and continual TTA benchmarks, and our proposed method achieves state-of-the-art performance in both semantic segmentation and depth estimation tasks.

AAAI Conference 2024 Conference Paper

Leveraging Imagery Data with Spatial Point Prior for Weakly Semi-supervised 3D Object Detection

  • Hongzhi Gao
  • Zheng Chen
  • Zehui Chen
  • Lin Chen
  • Jiaming Liu
  • Shanghang Zhang
  • Feng Zhao

Training high-accuracy 3D detectors necessitates massive labeled 3D annotations with 7 degree-of-freedom, which is laborious and time-consuming. Therefore, the form of point annotations is proposed to offer significant prospects for practical applications in 3D detection, which is not only more accessible and less expensive but also provides strong spatial information for object localization. In this paper, we empirically discover that it is non-trivial to merely adapt Point-DETR to its 3D form, encountering two main bottlenecks: 1) it fails to encode strong 3D prior into the model, and 2) it generates low-quality pseudo labels in distant regions due to the extreme sparsity of LiDAR points. To overcome these challenges, we introduce Point-DETR3D, a teacher-student framework for weakly semi-supervised 3D detection, designed to fully capitalize on point-wise supervision within a constrained instance-wise annotation budget. Different from Point-DETR which encodes 3D positional information solely through a point encoder, we propose an explicit positional query initialization strategy to enhance the positional prior. Considering the low quality of pseudo labels at distant regions produced by the teacher model, we enhance the detector's perception by incorporating dense imagery data through a novel Cross-Modal Deformable RoI Fusion (D-RoI). Moreover, an innovative point-guided self-supervised learning technique is proposed to allow for fully exploiting point priors, even in student models. Extensive experiments on representative nuScenes dataset demonstrate our Point-DETR3D obtains significant improvements compared to previous works. Notably, with only 5% of labeled data, Point-DETR3D achieves over 90% performance of its fully supervised counterpart.

AAAI Conference 2024 Conference Paper

M3SOT: Multi-Frame, Multi-Field, Multi-Space 3D Single Object Tracking

  • Jiaming Liu
  • Yue Wu
  • Maoguo Gong
  • Qiguang Miao
  • Wenping Ma
  • Cai Xu
  • Can Qin

3D Single Object Tracking (SOT) stands a forefront task of computer vision, proving essential for applications like autonomous driving. Sparse and occluded data in scene point clouds introduce variations in the appearance of tracked objects, adding complexity to the task. In this research, we unveil M3SOT, a novel 3D SOT framework, which synergizes multiple input frames (template sets), multiple receptive fields (continuous contexts), and multiple solution spaces (distinct tasks) in ONE model. Remarkably, M3SOT pioneers in modeling temporality, contexts, and tasks directly from point clouds, revisiting a perspective on the key factors influencing SOT. To this end, we design a transformer-based network centered on point cloud targets in the search area, aggregating diverse contextual representations and propagating target cues by employing historical frames. As M3SOT spans varied processing perspectives, we've streamlined the network—trimming its depth and optimizing its structure—to ensure a lightweight and efficient deployment for SOT applications. We posit that, backed by practical construction, M3SOT sidesteps the need for complex frameworks and auxiliary components to deliver sterling results. Extensive experiments on benchmarks such as KITTI, nuScenes, and Waymo Open Dataset demonstrate that M3SOT achieves state-of-the-art performance at 38 FPS. Our code and models are available at https://github.com/ywu0912/TeamCode.git.

NeurIPS Conference 2024 Conference Paper

RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation

  • Jiaming Liu
  • Mengzhen Liu
  • Zhenyu Wang
  • Pengju An
  • Xiaoqi Li
  • Kaichen Zhou
  • Senqiao Yang
  • Renrui Zhang

A fundamental objective in robot manipulation is to enable models to comprehend visual scenes and execute actions. Although existing Vision-Language-Action (VLA) models for robots can handle a range of basic tasks, they still face challenges in two areas: (1) insufficient reasoning ability to tackle complex tasks, and (2) high computational costs for VLA model fine-tuning and inference. The recently proposed state space model (SSM) known as Mamba demonstrates promising capabilities in non-trivial sequence modeling with linear inference complexity. Inspired by this, we introduce RoboMamba, an end-to-end robotic VLA model that leverages Mamba to deliver both robotic reasoning and action capabilities, while maintaining efficient fine-tuning and inference. Specifically, we first integrate the vision encoder with Mamba, aligning visual tokens with language embedding through co-training, empowering our model with visual common sense and robotic-related reasoning. To further equip RoboMamba with SE(3) pose prediction abilities, we explore an efficient fine-tuning strategy with a simple policy head. We find that once RoboMamba possesses sufficient reasoning capability, it can acquire manipulation skills with minimal fine-tuning parameters (0. 1\% of the model) and time. In experiments, RoboMamba demonstrates outstanding reasoning capabilities on general and robotic evaluation benchmarks. Meanwhile, our model showcases impressive pose prediction results in both simulation and real-world experiments, achieving inference speeds 3 times faster than existing VLA models.

NeurIPS Conference 2024 Conference Paper

StreamFlow: Streamlined Multi-Frame Optical Flow Estimation for Video Sequences

  • Shangkun Sun
  • Jiaming Liu
  • Huaxia Li
  • Guoqing Liu
  • Thomas H. Li
  • Wei Gao

Prior multi-frame optical flow methods typically estimate flow repeatedly in a pair-wise manner, leading to significant computational redundancy. To mitigate this, we implement a Streamlined In-batch Multi-frame (SIM) pipeline, specifically tailored to video inputs to minimize redundant calculations. It enables the simultaneous prediction of successive unidirectional flows in a single forward pass, boosting processing speed by 44. 43% and reaching efficiencies on par with two-frame networks. Moreover, we investigate various spatiotemporal modeling methods for optical flow estimation within this pipeline. Notably, we propose a simple yet highly effective parameter-efficient Integrative spatiotemporal Coherence (ISC) modeling method, alongside a lightweight Global Temporal Regressor (GTR) to harness temporal cues. The proposed ISC and GTR bring powerful spatiotemporal modeling capabilities and significantly enhance accuracy, including in occluded areas, while adding modest computations to the SIM pipeline. Compared to the baseline, our approach, StreamFlow, achieves performance enhancements of 15. 45% and 11. 37% on the Sintel clean and final test sets respectively, with gains of 15. 53% and 10. 77% on occluded regions and only a 1. 11% rise in latency. Furthermore, StreamFlow exhibits state-of-the-art cross-dataset testing results on Sintel and KITTI, demonstrating its robust cross-domain generalization capabilities. The code is available here.

NeurIPS Conference 2023 Conference Paper

Block Coordinate Plug-and-Play Methods for Blind Inverse Problems

  • Weijie Gan
  • shirin shoushtari
  • Yuyang Hu
  • Jiaming Liu
  • Hongyu An
  • Ulugbek Kamilov

Plug-and-play (PnP) prior is a well-known class of methods for solving imaging inverse problems by computing fixed-points of operators combining physical measurement models and learned image denoisers. While PnP methods have been extensively used for image recovery with known measurement operators, there is little work on PnP for solving blind inverse problems. We address this gap by presenting a new block-coordinate PnP (BC-PnP) method that efficiently solves this joint estimation problem by introducing learned denoisers as priors on both the unknown image and the unknown measurement operator. We present a new convergence theory for BC-PnP compatible with blind inverse problems by considering nonconvex data-fidelity terms and expansive denoisers. Our theory analyzes the convergence of BC-PnP to a stationary point of an implicit function associated with an approximate minimum mean-squared error (MMSE) denoiser. We numerically validate our method on two blind inverse problems: automatic coil sensitivity estimation in magnetic resonance imaging (MRI) and blind image deblurring. Our results show that BC-PnP provides an efficient and principled framework for using denoisers as PnP priors for jointly estimating measurement operators and images.

IJCAI Conference 2023 Conference Paper

Open-world Semi-supervised Novel Class Discovery

  • Jiaming Liu
  • Yangqiming Wang
  • Tongze Zhang
  • Yulu Fan
  • Qinli Yang
  • Junming Shao

Traditional semi-supervised learning tasks assume that both labeled and unlabeled data follow the same class distribution, but the realistic open-world scenarios are of more complexity with unknown novel classes mixed in the unlabeled set. Therefore, it is of great challenge to not only recognize samples from known classes but also discover the unknown number of novel classes within the unlabeled data. In this paper, we introduce a new open-world semi-supervised novel class discovery approach named OpenNCD, a progressive bi-level contrastive learning method over multiple prototypes. The proposed method is composed of two reciprocally enhanced parts. First, a bi-level contrastive learning method is introduced, which maintains the pair-wise similarity of the prototypes and the prototype group levels for better representation learning. Then, a reliable prototype similarity metric is proposed based on the common representing instances. Prototypes with high similarities will be grouped progressively for known class recognition and novel class discovery. Extensive experiments on three image datasets are conducted and the results show the effectiveness of the proposed method in open-world scenarios, especially with scarce known classes and labels.

YNIMG Journal 2022 Journal Article

Environmental enrichment leads to behavioral circadian shifts enhancing brain-wide functional connectivity between sensory cortices and eliciting increased hippocampal spiking

  • Francis A.M. Manno
  • Ziqi An
  • Rachit Kumar
  • Junfeng Su
  • Jiaming Liu
  • Ed X. Wu
  • Jufang He
  • Yanqiu Feng

Environmental enrichment induces widespread neuronal changes, but the initiation of the cascade is unknown. We ascertained the critical period of divergence between environmental enriched (EE) and standard environment (SE) mice using continuous infrared (IR) videography, functional magnetic resonance imaging (fMRI), and neuron level calcium imaging. Naïve adult male mice (n = 285, C57BL/6J, postnatal day 60) were divided into SE and EE groups. We assessed the linear time-series of motion activity using a novel structural break test which examined the dataset for change in circadian and day-by-day motion activity. fMRI was used to map brain-wide response using a functional connectome analysis pipeline. Awake calcium imaging was performed on the dorsal CA1 pyramidal layer. We found the preeminent behavioral feature in EE was a forward shift in the circadian rhythm, prolongation of activity in the dark photoperiod, and overall decreased motion activity. The crepuscular period of dusk was seen as the critical period of divergence between EE and SE mice. The functional processes at dusk in EE included increased functional connectivity in the visual cortex, motor cortex, retrosplenial granular cortex, and cingulate cortex using seed-based analysis. Network based statistics found a modulated functional connectome in EE concentrated in two hubs: the hippocampal formation and isocortical network. These hubs experienced a higher node degree and significant enhanced edge connectivity. Calcium imaging revealed increased spikes per second and maximum firing rate in the dorsal CA1 pyramidal layer, in addition to location (anterior-posterior and medial-lateral) effect size differences between EE and SE. The emergence of functional-neuronal changes due to enrichment consisted of enhanced hippocampal-isocortex functional connectivity and CA1 neuronal increased spiking linked to a circadian shift during the dusk period. Future studies should explore the molecular consequences of enrichment inducing shifts in the circadian period.

NeurIPS Conference 2022 Conference Paper

Online Deep Equilibrium Learning for Regularization by Denoising

  • Jiaming Liu
  • Xiaojian Xu
  • Weijie Gan
  • shirin shoushtari
  • Ulugbek Kamilov

Plug-and-Play Priors (PnP) and Regularization by Denoising (RED) are widely-used frameworks for solving imaging inverse problems by computing fixed-points of operators combining physical measurement models and learned image priors. While traditional PnP/RED formulations have focused on priors specified using image denoisers, there is a growing interest in learning PnP/RED priors that are end-to-end optimal. The recent Deep Equilibrium Models (DEQ) framework has enabled memory-efficient end-to-end learning of PnP/RED priors by implicitly differentiating through the fixed-point equations without storing intermediate activation values. However, the dependence of the computational/memory complexity of the measurement models in PnP/RED on the total number of measurements leaves DEQ impractical for many imaging applications. We propose ODER as a new strategy for improving the efficiency of DEQ through stochastic approximations of the measurement models. We theoretically analyze ODER giving insights into its convergence and ability to approximate the traditional DEQ approach. Our numerical results suggest the potential improvements in training/testing complexity due to ODER on three distinct imaging applications.

NeurIPS Conference 2021 Conference Paper

Recovery Analysis for Plug-and-Play Priors using the Restricted Eigenvalue Condition

  • Jiaming Liu
  • Salman Asif
  • Brendt Wohlberg
  • Ulugbek Kamilov

The plug-and-play priors (PnP) and regularization by denoising (RED) methods have become widely used for solving inverse problems by leveraging pre-trained deep denoisers as image priors. While the empirical imaging performance and the theoretical convergence properties of these algorithms have been widely investigated, their recovery properties have not previously been theoretically analyzed. We address this gap by showing how to establish theoretical recovery guarantees for PnP/RED by assuming that the solution of these methods lies near the fixed-points of a deep neural network. We also present numerical results comparing the recovery performance of PnP/RED in compressive sensing against that of recent compressive sensing algorithms based on generative models. Our numerical results suggest that PnP with a pre-trained artifact removal network provides significantly better results compared to the existing state-of-the-art methods.

ICRA Conference 2020 Conference Paper

Learning Affordance Space in Physical World for Vision-based Robotic Object Manipulation

  • Huadong Wu
  • Zhanpeng Zhang
  • Hui Cheng
  • Kai Yang
  • Jiaming Liu
  • Ziying Guo

What is a proper representation for objects in manipulation? What would human try to perceive when manipulating a new object in a new environment? In fact, instead of focusing on the texture and illumination, human can infer the "affordance" [36] of the objects from vision. Here "affordance" describes the object's intrinsic property that affords a particular type of manipulation. In this work, we investigate whether such affordance can be learned by a deep neural network. In particular, we propose an Affordance Space Perception Network (ASPN) that takes an image as input and outputs an affordance map. Different from existing works that infer the pixel-wise probability affordance map in image space, our affordance is defined in the real world space, thus eliminates the need of hand-eye calibration. In addition, we extend the representation ability of affordance by defining it in a 3D affordance space and propose a novel training strategy to improve the performance. Trained purely with simulation data, ASPN can achieve significant performance in the real world. It is a task-agnostic framework and can handle different objects, scenes and viewpoints. Extensive real-world experiments demonstrate the accuracy and robustness of our approach. We achieve the success rates of 94. 2% for singular-object pushing and 92. 4% for multiple-object pushing. We also achieve the success rates of 97. 2% for singular-object grasping and 95. 4% for multiple-object grasping, which outperform current state-of-the-art methods.

NeurIPS Conference 2019 Conference Paper

Block Coordinate Regularization by Denoising

  • Yu Sun
  • Jiaming Liu
  • Ulugbek Kamilov

We consider the problem of estimating a vector from its noisy measurements using a prior specified only through a denoising function. Recent work on plug-and-play priors (PnP) and regularization-by-denoising (RED) has shown the state-of-the-art performance of estimators under such priors in a range of imaging tasks. In this work, we develop a new block coordinate RED algorithm that decomposes a large-scale estimation problem into a sequence of updates over a small subset of the unknown variables. We theoretically analyze the convergence of the algorithm and discuss its relationship to the traditional proximal optimization. Our analysis complements and extends recent theoretical results for RED-based estimation methods. We numerically validate our method using several denoiser priors, including those based on convolutional neural network (CNN) denoisers.

AAAI Conference 2017 Conference Paper

Sparse Deep Transfer Learning for Convolutional Neural Network

  • Jiaming Liu
  • Yali Wang
  • Yu Qiao

Extensive studies have demonstrated that the representations of convolutional neural networks (CNN), which are learned from a large-scale data set in the source domain, can be effectively transferred to a new target domain. However, compared to the source domain, the target domain often has limited data in practice. In this case, overfitting may significantly depress transferability, due to the model redundancy of the intensive CNN structures. To deal with this difficulty, we propose a novel sparse deep transfer learning approach for CNN. There are three main contributions in this work. First, we introduce a Sparse-SourceNet to reduce the redundancy in the source domain. Second, we introduce a Hybrid-TransferNet to improve the generalization ability and the prediction accuracy of transfer learning, by taking advantage of both model sparsity and implicit knowledge. Third, we introduce a Sparse-TargetNet, where we prune our Hybrid-TransferNet to obtain a highlycompact, source-knowledge-integrated CNN in the target domain. To examine the effectiveness of our methods, we perform our sparse deep transfer learning approach on a number of benchmark transfer learning tasks. The results show that, compared to the standard fine-tuning approach, our proposed approach achieves a significant pruning rate on CNN while improves the accuracy of transfer learning.