Arrow Research search

Author name cluster

Guo-Jun Qi

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers
2 author rows

Possible papers

13

ICML Conference 2025 Conference Paper

DynaMind: Reasoning over Abstract Video Dynamics for Embodied Decision-Making

  • Ziru Wang
  • Mengmeng Wang 0005
  • Jade Dai
  • Teli Ma
  • Guo-Jun Qi
  • Yong Liu 0007
  • Guang Dai
  • Jingdong Wang 0001

Integrating natural language instructions and visual perception with decision-making is a critical challenge for embodied agents. Existing methods often struggle to balance the conciseness of language commands with the richness of video content. To bridge the gap between modalities, we propose extracting key spatiotemporal patterns from video that capture visual saliency and temporal evolution, referred to as dynamic representation. Building on this, we introduce DynaMind, a framework that enhances decision-making through dynamic reasoning. Specifically, we design an adaptive FrameScorer to evaluate video frames based on semantic consistency and visual saliency, assigning each frame an importance score. These scores are used to filter redundant video content and synthesize compact dynamic representations. Leveraging these representations, we predict critical future dynamics and apply a dynamic-guided policy to generate coherent and context-aware actions. Extensive results demonstrate that DynaMind significantly outperforms the baselines across several simulation benchmarks and real-world scenarios.

NeurIPS Conference 2025 Conference Paper

Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models

  • Zemin Huang
  • Zhiyang Chen
  • Zijun Wang
  • Tiancheng Li
  • Guo-Jun Qi

We introduce the Diffusion Chain of Lateral Thought (DCoLT), a reasoning framework for diffusion language models. DCoLT treats each intermediate step in the reverse diffusion process as a latent "thinking" action and optimizes the entire reasoning trajectory to maximize the reward on the correctness of the final answer with outcome-based Reinforcement Learning (RL). Unlike traditional Chain-of-Thought (CoT) methods that follow a causal, linear thinking process, DCoLT allows bidirectional, non-linear reasoning with no strict rule on grammatical correctness amid its intermediate steps of thought. We implement DCoLT on two representative Diffusion Language Models (DLMs). First, we choose SEDD as a representative continuous-time discrete diffusion model, where its concrete score derives a probabilistic policy to maximize the RL reward over the entire sequence of intermediate diffusion steps. We further consider the discrete-time masked diffusion language model -- LLaDA, and find that the order to predict and unmask tokens plays an essential role to optimize its RL action resulting from the ranking-based Unmasking Policy Module (UPM) defined by the Plackett-Luce model. Experiments on both math and code generation tasks show that using only public data and 16 H800 GPUs, DCoLT-reinforced DLMs outperform other DLMs trained by SFT or RL or even both. Notably, DCoLT-reinforced LLaDA boosts its reasoning accuracy by +9. 8%, +5. 7%, +11. 4%, +19. 5% on GSM8K, MATH, MBPP, and HumanEval.

AAAI Conference 2024 Conference Paper

BARET: Balanced Attention Based Real Image Editing Driven by Target-Text Inversion

  • Yuming Qiao
  • Fanyi Wang
  • Jingwen Su
  • Yanhao Zhang
  • Yunjie Yu
  • Siyu Wu
  • Guo-Jun Qi

Image editing approaches with diffusion models have been rapidly developed, yet their applicability are subject to requirements such as specific editing types (e.g., foreground or background object editing, style transfer), multiple conditions (e.g., mask, sketch, caption), and time consuming fine-tuning of diffusion models. For alleviating these limitations and realizing efficient real image editing, we propose a novel editing technique that only requires an input image and target text for various editing types including non-rigid edits without fine-tuning diffusion model. Our method contains three novelties: (I) Target-text Inversion Schedule (TTIS) is designed to fine-tune the input target text embedding to achieve fast image reconstruction without image caption and acceleration of convergence. (II) Progressive Transition Scheme applies progressive linear interpolation between target text embedding and its fine-tuned version to generate transition embedding for maintaining non-rigid editing capability. (III) Balanced Attention Module (BAM) balances the tradeoff between textual description and image semantics. By the means of combining self-attention map from reconstruction process and cross-attention map from transition process, the guidance of target text embeddings in diffusion process is optimized. In order to demonstrate editing capability, effectiveness and efficiency of the proposed BARET, we have conducted extensive qualitative and quantitative experiments. Moreover, results derived from user study and ablation study further prove the superiority over other methods.

NeurIPS Conference 2024 Conference Paper

One-Step Diffusion Distillation through Score Implicit Matching

  • Weijian Luo
  • Zemin Huang
  • Zhengyang Geng
  • J. Z. Kolter
  • Guo-Jun Qi

Despite their strong performances on many generative tasks, diffusion models require a large number of sampling steps in order to generate realistic samples. This has motivated the community to develop effective methods to distill pre-trained diffusion models into more efficient models, but these methods still typically require few-step inference or perform substantially worse than the underlying model. In this paper, we present Score Implicit Matching (SIM) a new approach to distilling pre-trained diffusion models into single-step generator models, while maintaining almost the same sample generation ability as the original model as well as being data-free with no need of training samples for distillation. The method rests upon the fact that, although the traditional score-based loss is intractable to minimize for generator models, under certain conditions we \emph{can} efficiently compute the \emph{gradients} for a wide class of score-based divergences between a diffusion model and a generator. SIM shows strong empirical performances for one-step generators: on the CIFAR10 dataset, it achieves an FID of 2. 06 for unconditional generation and 1. 96 for class-conditional generation. Moreover, by applying SIM to a leading transformer-based diffusion model, we distill a single-step generator for text-to-image (T2I) generation that attains an aesthetic score of 6. 42 with no performance decline over the original multi-step counterpart, clearly outperforming the other one-step generators including SDXL-TURBO of 5. 33, SDXL-LIGHTNING of 5. 34 and HYPER-SDXL of 5. 85. We will release this industry-ready one-step transformer-based T2I generator along with this paper.

IJCAI Conference 2024 Conference Paper

Zero-shot High-fidelity and Pose-controllable Character Animation

  • Bingwen Zhu
  • Fanyi Wang
  • Tianyi Lu
  • Peng Liu
  • Jingwen Su
  • Jinxiu Liu
  • Yanhao Zhang
  • Zuxuan Wu

Image-to-video (I2V) generation aims to create a video sequence from a single image, which requires high temporal coherence and visual fidelity. However, existing approaches suffer from inconsistency of character appearances and poor preservation of fine details. Moreover, they require a large amount of video data for training, which can be computationally demanding. To address these limitations, we propose PoseAnimate, a novel zero-shot I2V framework for character animation. PoseAnimate contains three key components: 1) a Pose-Aware Control Module (PACM) that incorporates diverse pose signals into text embeddings, to preserve character-independent content and maintain precise alignment of actions. 2) a Dual Consistency Attention Module (DCAM) that enhances temporal consistency and retains character identity and intricate background details. 3) a Mask-Guided Decoupling Module (MGDM) that refines distinct feature perception abilities, improving animation fidelity by decoupling the character and background. We also propose a Pose Alignment Transition Algorithm (PATA) to ensure smooth action transition. Extensive experiment results demonstrate that our approach outperforms the state-of-the-art training-based methods in terms of character consistency and detail fidelity. Moreover, it maintains a high level of temporal coherence throughout the generated animations.

AAAI Conference 2021 Conference Paper

A Unified Multi-Scenario Attacking Network for Visual Object Tracking

  • Xuesong Chen
  • Canmiao Fu
  • Feng Zheng
  • Yong Zhao
  • Hongsheng Li
  • Ping Luo
  • Guo-Jun Qi

Existing methods of adversarial attacks successfully generate adversarial examples to confuse Deep Neural Networks (DNNs) of image classification and object detection, resulting in wrong predictions. However, these methods are difficult to attack models of video object tracking, because the tracking algorithms could handle sequential information across video frames and the categories of targets tracked are normally unknown in advance. In this paper, we propose a Unified and Effective Network, named UEN, to attack visual object tracking models. There are several appealing characteristics of UEN: (1) UEN could produce various invisible adversarial perturbations according to different attack settings by using only one simple end-to-end network with three ingenious loss function; (2) UEN could generate general visible adversarial patch patterns to attack the advanced trackers in the real-world; (3) Extensive experiments show that UEN is able to attack many state-of-the-art trackers effectively (e. g. SiamRPN-based networks and DiMP) on popular tracking datasets including OTB100, UAV123, and GOT10K, making online real-time attacks possible. The attack results outperform the introduced baseline in terms of attacking ability and attacking efficiency.

AAAI Conference 2021 Conference Paper

Auto-Encoding Transformations in Reparameterized Lie Groups for Unsupervised Learning

  • Feng Lin
  • Haohang Xu
  • Houqiang Li
  • Hongkai Xiong
  • Guo-Jun Qi

Unsupervised training of deep representations has demonstrated remarkable potentials in mitigating the prohibitive expenses on annotating labeled data recently. Among them is predicting transformations as a pretext task to self-train representations, which has shown great potentials for unsupervised learning. However, existing approaches in this category learn representations by either treating a discrete set of transformations as separate classes, or using the Euclidean distance as the metric to minimize the errors between transformations. None of them has been dedicated to revealing the vital role of the geometry of transformation groups in learning representations. Indeed, an image must continuously transform along the curved manifold of a transformation group rather than through a straight line in the forbidden ambient Euclidean space. This suggests the use of geodesic distance to minimize the errors between the estimated and groundtruth transformations. Particularly, we focus on homographies, a general group of planar transformations containing the Euclidean, similarity and affine transformations as its special cases. To avoid an explicit computing of intractable Riemannian logarithm, we project homographies onto an alternative group of rotation transformations SR(3) with a tractable form of geodesic distance. Experiments demonstrate the proposed approach to Auto-Encoding Transformations exhibits superior performances on a variety of recognition problems.

ICLR Conference 2020 Conference Paper

PC-DARTS: Partial Channel Connections for Memory-Efficient Architecture Search

  • Yuhui Xu 0002
  • Lingxi Xie
  • Xiaopeng Zhang 0008
  • Xin Chen 0033
  • Guo-Jun Qi
  • Qi Tian 0001
  • Hongkai Xiong

Differentiable architecture search (DARTS) provided a fast solution in finding effective network architectures, but suffered from large memory and computing overheads in jointly training a super-net and searching for an optimal architecture. In this paper, we present a novel approach, namely Partially-Connected DARTS, by sampling a small part of super-net to reduce the redundancy in exploring the network space, thereby performing a more efficient search without comprising the performance. In particular, we perform operation search in a subset of channels while bypassing the held out part in a shortcut. This strategy may suffer from an undesired inconsistency on selecting the edges of super-net caused by sampling different channels. We solve it by introducing edge normalization, which adds a new set of edge-level hyper-parameters to reduce uncertainty in search. Thanks to the reduced memory cost, PC-DARTS can be trained with a larger batch size and, consequently, enjoy both faster speed and higher training stability. Experiment results demonstrate the effectiveness of the proposed method. Specifically, we achieve an error rate of 2.57% on CIFAR10 within merely 0.1 GPU-days for architecture search, and a state-of-the-art top-1 error rate of 24.2% on ImageNet (under the mobile setting) within 3.8 GPU-days for search. Our code has been made available at https://www.dropbox.com/sh/on9lg3rpx1r6dkf/AABG5mt0sMHjnEJyoRnLEYW4a?dl=0.

AAAI Conference 2019 Conference Paper

Learning to Adaptively Scale Recurrent Neural Networks

  • Hao Hu
  • Liqiang Wang
  • Guo-Jun Qi

Recent advancements in recurrent neural network (RNN) research have demonstrated the superiority of utilizing multiscale structures in learning temporal representations of time series. Currently, most of multiscale RNNs use fixed scales, which do not comply with the nature of dynamical temporal patterns among sequences. In this paper, we propose Adaptively Scaled Recurrent Neural Networks (ASRNN), a simple but efficient way to handle this problem. Instead of using predefined scales, ASRNNs are able to learn and adjust scales based on different temporal contexts, making them more flexible in modeling multiscale patterns. Compared with other multiscale RNNs, ASRNNs are bestowed upon dynamical scaling capabilities with much simpler structures, and are easy to be integrated with various RNN cells. The experiments on multiple sequence modeling tasks indicate AS- RNNs can efficiently adapt scales based on different sequence contexts and yield better performances than baselines without dynamical scaling abilities.

NeurIPS Conference 2018 Conference Paper

CapProNet: Deep Feature Learning via Orthogonal Projections onto Capsule Subspaces

  • Liheng Zhang
  • Marzieh Edraki
  • Guo-Jun Qi

In this paper, we formalize the idea behind capsule nets of using a capsule vector rather than a neuron activation to predict the label of samples. To this end, we propose to learn a group of capsule subspaces onto which an input feature vector is projected. Then the lengths of resultant capsules are used to score the probability of belonging to different classes. We train such a Capsule Projection Network (CapProNet) by learning an orthogonal projection matrix for each capsule subspace, and show that each capsule subspace is updated until it contains input feature vectors corresponding to the associated class. With low dimensionality of capsule subspace as well as an iterative method to estimate the matrix inverse, only a small negligible computing overhead is incurred to train the network. Experiment results on image datasets show the presented network can greatly improve the performance of state-of-the-art Resnet backbones by $10-20\%$ with almost the same computing cost.

ICML Conference 2017 Conference Paper

State-Frequency Memory Recurrent Neural Networks

  • Hao Hu 0010
  • Guo-Jun Qi

Modeling temporal sequences plays a fundamental role in various modern applications and has drawn more and more attentions in the machine learning community. Among those efforts on improving the capability to represent temporal data, the Long Short-Term Memory (LSTM) has achieved great success in many areas. Although the LSTM can capture long-range dependency in the time domain, it does not explicitly model the pattern occurrences in the frequency domain that plays an important role in tracking and predicting data points over various time cycles. We propose the State-Frequency Memory (SFM), a novel recurrent architecture that allows to separate dynamic patterns across different frequency components and their impacts on modeling the temporal contexts of input sequences. By jointly decomposing memorized dynamics into state-frequency components, the SFM is able to offer a fine-grained analysis of temporal sequences by capturing the dependency of uncovered patterns in both time and frequency domains. Evaluations on several temporal modeling tasks demonstrate the SFM can yield competitive performances, in particular as compared with the state-of-the-art LSTM models.

TIST Journal 2011 Journal Article

Image annotation by k NN-sparse graph-based label propagation over noisily tagged web images

  • Jinhui Tang
  • Richang Hong
  • Shuicheng Yan
  • Tat-Seng Chua
  • Guo-Jun Qi
  • Ramesh Jain

In this article, we exploit the problem of annotating a large-scale image corpus by label propagation over noisily tagged web images. To annotate the images more accurately, we propose a novel k NN-sparse graph-based semi-supervised learning approach for harnessing the labeled and unlabeled data simultaneously. The sparse graph constructed by datum-wise one-vs- k NN sparse reconstructions of all samples can remove most of the semantically unrelated links among the data, and thus it is more robust and discriminative than the conventional graphs. Meanwhile, we apply the approximate k nearest neighbors to accelerate the sparse graph construction without loosing its effectiveness. More importantly, we propose an effective training label refinement strategy within this graph-based learning framework to handle the noise in the training labels, by bringing in a dual regularization for both the quantity and sparsity of the noise. We conduct extensive experiments on a real-world image database consisting of 55,615 Flickr images and noisily tagged training labels. The results demonstrate both the effectiveness and efficiency of the proposed approach and its capability to deal with the noise in the training labels.