Author name cluster

Pichao Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

12 papers

2 author rows

ICLR Conference 2025 Conference Paper

Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach

Zechen Bai
Tianjun Xiao
Tong He 0002
Pichao Wang
Zheng Zhang 0001
Thomas Brox
Mike Zheng Shou

As online video content rapidly grows, the task of text-video retrieval (TVR) becomes increasingly important. A key challenge in TVR is the information asymmetry between video and text: videos are inherently richer in information, while their textual descriptions often capture only fragments of this complexity. This paper introduces a novel, data-centric framework to bridge this gap by enriching textual representations to better match the richness of video content. During training, videos are segmented into event-level clips and captioned to ensure comprehensive coverage. During retrieval, a large language model (LLM) generates semantically diverse queries to capture a broader range of possible matches. To enhance retrieval efficiency, we propose a query selection mechanism that identifies the most relevant and diverse queries, reducing computational cost while improving accuracy. Our method achieves state-of-the-art results across multiple benchmarks, demonstrating the power of data-centric approaches in addressing information asymmetry in TVR. This work paves the way for new research focused on leveraging data to improve cross-modal retrieval.

Details

NeurIPS Conference 2025 Conference Paper

SparseDiT: Token Sparsification for Efficient Diffusion Transformer

Shuning Chang
Pichao Wang
Jiasheng Tang
Fan Wang
Yi Yang

Diffusion Transformers (DiT) are renowned for their impressive generative performance; however, they are significantly constrained by considerable computational costs due to the quadratic complexity in self-attention and the extensive sampling steps required. While advancements have been made in expediting the sampling process, the underlying architectural inefficiencies within DiT remain underexplored. We introduce SparseDiT, a novel framework that implements token sparsification across spatial and temporal dimensions to enhance computational efficiency while preserving generative quality. Spatially, SparseDiT employs a tri-segment architecture that allocates token density based on feature requirements at each layer: Poolingformer in the bottom layers for efficient global feature extraction, Sparse-Dense Token Modules (SDTM) in the middle layers to balance global context with local detail, and dense tokens in the top layers to refine high-frequency details. Temporally, SparseDiT dynamically modulates token density across denoising stages, progressively increasing token count as finer details emerge in later timesteps. This synergy between SparseDiT’s spatially adaptive architecture and its temporal pruning strategy enables a unified framework that balances efficiency and fidelity throughout the generation process. Our experiments demonstrate SparseDiT’s effectiveness, achieving a 55\% reduction in FLOPs and a 175\% improvement in inference speed on DiT-XL with similar FID score on 512$\times$512 ImageNet, a 56\% reduction in FLOPs across video generation datasets, and a 69\% improvement in inference speed on PixArt-$\alpha$ on text-to-image generation task with a 0. 24 FID score decrease. SparseDiT provides a scalable solution for high-quality diffusion-based generation compatible with sampling optimization techniques. Code is available at https: //github. com/changsn/SparseDiT.

PDF Details

NeurIPS Conference 2024 Conference Paper

Diffusion-Inspired Truncated Sampler for Text-Video Retrieval

Jiamian Wang
Pichao Wang
Dongfang Liu
Qiang Guan
Sohail Dianat
MAJID RABBANI
Raghuveer Rao
Zhiqiang Tao

Prevalent text-to-video retrieval methods represent multimodal text-video data in a joint embedding space, aiming at bridging the relevant text-video pairs and pulling away irrelevant ones. One main challenge in state-of-the-art retrieval methods lies in the modality gap, which stems from the substantial disparities between text and video and can persist in the joint space. In this work, we leverage the potential of Diffusion models to address the text-video modality gap by progressively aligning text and video embeddings in a unified space. However, we identify two key limitations of existing Diffusion models in retrieval tasks: The L2 loss does not fit the ranking problem inherent in text-video retrieval, and the generation quality heavily depends on the varied initial point drawn from the isotropic Gaussian, causing inaccurate retrieval. To this end, we introduce a new Diffusion-Inspired Truncated Sampler (DITS) that jointly performs progressive alignment and modality gap modeling in the joint embedding space. The key innovation of DITS is to leverage the inherent proximity of text and video embeddings, defining a truncated diffusion flow from the fixed text embedding to the video embedding, enhancing controllability compared to adopting the isotropic Gaussian. Moreover, DITS adopts the contrastive loss to jointly consider the relevant and irrelevant pairs, not only facilitating alignment but also yielding a discriminatively structured embedding. Experiments on five benchmark datasets suggest the state-of-the-art performance of DITS. We empirically find that DITS can also improve the structure of the CLIP embedding space. Code is available at https: //github. com/Jiamian- Wang/DITS-text-video-retrieval

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning

Penghui Ruan
Pichao Wang
Divya Saxena
Jiannong Cao
Yuhui Shi

Despite advancements in Text-to-Video (T2V) generation, producing videos with realistic motion remains challenging. Current models often yield static or minimally dynamic outputs, failing to capture complex motions described by text. This issue stems from the internal biases in text encoding which overlooks motions, and inadequate conditioning mechanisms in T2V generation models. To address this, we propose a novel framework called DEcomposed MOtion (DEMO), which enhances motion synthesis in T2V generation by decomposing both text encoding and conditioning into content and motion components. Our method includes a content encoder for static elements and a motion encoder for temporal dynamics, alongside separate content and motion conditioning mechanisms. Crucially, we introduce text-motion and video-motion supervision to improve the model's understanding and generation of motion. Evaluations on benchmarks such as MSR-VTT, UCF-101, WebVid-10M, EvalCrafter, and VBench demonstrate DEMO's superior ability to produce videos with enhanced motion dynamics while maintaining high visual quality. Our approach significantly advances T2V generation by integrating comprehensive motion understanding directly from textual descriptions. Project page: https: //PR-Ryan. github. io/DEMO-project/

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

Zechen Bai
Tong He
Haiyang Mei
Pichao Wang
Ziteng Gao
Joya Chen
Lei Liu
Zheng Zhang

We introduce VideoLISA, a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos. Leveraging the reasoning capabilities and world knowledge of large language models, and augmented by the Segment Anything Model, VideoLISA generates temporally consistent segmentation masks in videos based on language instructions. Existing image-based methods, such as LISA, struggle with video tasks due to the additional temporal dimension, which requires temporal dynamic understanding and consistent segmentation across frames. VideoLISA addresses these challenges by integrating a Sparse Dense Sampling strategy into the video-LLM, which balances temporal context and spatial detail within computational constraints. Additionally, we propose a One-Token-Seg-All approach using a specially designed token, enabling the model to segment and track objects across multiple frames. Extensive evaluations on diverse benchmarks, including our newly introduced ReasonVOS benchmark, demonstrate VideoLISA's superior performance in video object segmentation tasks involving complex reasoning, temporal understanding, and object tracking. While optimized for videos, VideoLISA also shows promising generalization to image segmentation, revealing its potential as a unified foundation model for language-instructed object segmentation. Code and model will be available at: https: //github. com/showlab/VideoLISA.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Frequency Domain Disentanglement for Arbitrary Neural Style Transfer

Dongyang Li
Hao Luo
Pichao Wang
Zhibin Wang
Shang Liu
Fan Wang

Arbitrary neural style transfer has been a popular research topic due to its rich application scenarios. Effective disentanglement of content and style is the critical factor for synthesizing an image with arbitrary style. The existing methods focus on disentangling feature representations of content and style in the spatial domain where the content and style components are innately entangled and difficult to be disentangled clearly. Therefore, these methods always suffer from low-quality results because of the sub-optimal disentanglement. To address such a challenge, this paper proposes the frequency mixer (FreMixer) module that disentangles and re-entangles the frequency spectrum of content and style components in the frequency domain. Since content and style components have different frequency-domain characteristics (frequency bands and frequency patterns), the FreMixer could well disentangle these two components. Based on the FreMixer module, we design a novel Frequency Domain Disentanglement (FDD) framework for arbitrary neural style transfer. Qualitative and quantitative experiments verify that the proposed method can render better stylized results compared to the state-of-the-art methods.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Head-Free Lightweight Semantic Segmentation with Linear Transformer

Bo Dong
Pichao Wang
Fan Wang

Existing semantic segmentation works have been mainly focused on designing effective decoders; however, the computational load introduced by the overall structure has long been ignored, which hinders their applications on resource-constrained hardwares. In this paper, we propose a head-free lightweight architecture specifically for semantic segmentation, named Adaptive Frequency Transformer (AFFormer). AFFormer adopts a parallel architecture to leverage prototype representations as specific learnable local descriptions which replaces the decoder and preserves the rich image semantics on high-resolution features. Although removing the decoder compresses most of the computation, the accuracy of the parallel structure is still limited by low computational resources. Therefore, we employ heterogeneous operators (CNN and vision Transformer) for pixel embedding and prototype representations to further save computational costs. Moreover, it is very difficult to linearize the complexity of the vision Transformer from the perspective of spatial domain. Due to the fact that semantic segmentation is very sensitive to frequency information, we construct a lightweight prototype learning block with adaptive frequency filter of complexity O(n) to replace standard self attention with O(n^2). Extensive experiments on widely adopted datasets demonstrate that AFFormer achieves superior accuracy while retaining only 3M parameters. On the ADE20K dataset, AFFormer achieves 41.8 mIoU and 4.6 GFLOPs, which is 4.4 mIoU higher than Segformer, with 45% less GFLOPs. On the Cityscapes dataset, AFFormer achieves 78.7 mIoU and 34.4 GFLOPs, which is 2.5 mIoU higher than Segformer with 72.5% less GFLOPs. Code is available at https://github.com/dongbo811/AFFormer.

PDF Details DOI

ICLR Conference 2022 Conference Paper

CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation

Tongkun Xu
Weihua Chen
Pichao Wang
Fan Wang 0019
Hao Li 0030
Rong Jin 0001

Unsupervised domain adaptation (UDA) aims to transfer knowledge learned from a labeled source domain to a different unlabeled target domain. Most existing UDA methods focus on learning domain-invariant feature representation, either from the domain level or category level, using convolution neural networks (CNNs)-based frameworks. One fundamental problem for the category level based UDA is the production of pseudo labels for samples in target domain, which are usually too noisy for accurate domain alignment, inevitably compromising the UDA performance. With the success of Transformer in various tasks, we find that the cross-attention in Transformer is robust to the noisy input pairs for better feature alignment, thus in this paper Transformer is adopted for the challenging UDA task. Specifically, to generate accurate input pairs, we design a two-way center-aware labeling algorithm to produce pseudo labels for target samples. Along with the pseudo labels, a weight-sharing triple-branch transformer framework is proposed to apply self-attention and cross-attention for source/target feature learning and source-target domain alignment, respectively. Such design explicitly enforces the framework to learn discriminative domain-specific and domain-invariant representations simultaneously. The proposed method is dubbed CDTrans (cross-domain transformer), and it provides one of the first attempts to solve UDA tasks with a pure transformer solution. Experiments show that our proposed method achieves the best performance on public UDA datasets, e.g. VisDA-2017 and DomainNet. Code and models are available at https://github.com/CDTrans/CDTrans.

Details

AAAI Conference 2022 Conference Paper

Scaled ReLU Matters for Training Vision Transformers

Pichao Wang
Xue Wang
Hao Luo
Jingkai Zhou
Zhipeng Zhou
Fan Wang
Hao Li
Rong Jin

Vision transformers (ViTs) have been an alternative design paradigm to convolutional neural networks (CNNs). However, the training of ViTs is much harder than CNNs, as it is sensitive to the training parameters, such as learning rate, optimizer and warmup epoch. The reasons for training difficulty are empirically analysed in the paper Early Convolutions Help Transformers See Better, and the authors conjecture that the issue lies with the patchify-stem of ViT models. In this paper, we further investigate this problem and extend the above conclusion: only early convolutions do not help for stable training, but the scaled ReLU operation in the convolutional stem (conv-stem) matters. We verify, both theoretically and empirically, that scaled ReLU in conv-stem not only improves training stabilization, but also increases the diversity of patch tokens, thus boosting peak performance with a large margin via adding few parameters and flops. In addition, extensive experiments are conducted to demonstrate that previous ViTs are far from being well trained, further showing that ViTs have great potential to be a better substitute of CNNs.

PDF Details

NeurIPS Conference 2022 Conference Paper

VTC-LFC: Vision Transformer Compression with Low-Frequency Components

Zhenyu Wang
Hao Luo
Pichao Wang
Feng Ding
Fan Wang
Hao Li

Although Vision transformers (ViTs) have recently dominated many vision tasks, deploying ViT models on resource-limited devices remains a challenging problem. To address such a challenge, several methods have been proposed to compress ViTs. Most of them borrow experience in convolutional neural networks (CNNs) and mainly focus on the spatial domain. However, the compression only in the spatial domain suffers from a dramatic performance drop without fine-tuning and is not robust to noise, as the noise in the spatial domain can easily confuse the pruning criteria, leading to some parameters/channels being pruned incorrectly. Inspired by recent findings that self-attention is a low-pass filter and low-frequency signals/components are more informative to ViTs, this paper proposes compressing ViTs with low-frequency components. Two metrics named low-frequency sensitivity (LFS) and low-frequency energy (LFE) are proposed for better channel pruning and token pruning. Additionally, a bottom-up cascade pruning scheme is applied to compress different dimensions jointly. Extensive experiments demonstrate that the proposed method could save 40% ～ 60% of the FLOPs in ViTs, thus significantly increasing the throughput on practical devices with less than 1% performance drop on ImageNet-1K.

PDF Details

AAAI Conference 2020 Conference Paper

R²MRF: Defocus Blur Detection via Recurrently Refining Multi-Scale Residual Features

Chang Tang
Xinwang Liu
Xinzhong Zhu
En Zhu
Kun Sun
Pichao Wang
Lizhe Wang
Albert Zomaya

Defocus blur detection aims to separate the in-focus and out-of-focus regions in an image. Although attracting more and more attention due to its remarkable potential applications, there are still several challenges for accurate defocus blur detection, such as the interference of background clutter, sensitivity to scales and missing boundary details of defocus blur regions. In order to address these issues, we propose a deep neural network which Recurrently Refines Multi-scale Residual Features (R2MRF) for defocus blur detection. We firstly extract multi-scale deep features by utilizing a fully convolutional network. For each layer, we design a novel recurrent residual refinement branch embedded with multiple residual refinement modules (RRMs) to more accurately detect blur regions from the input image. Considering that the features from bottom layers are able to capture rich low-level features for details preservation while the features from top layers are capable of characterizing the semantic information for locating blur regions, we aggregate the deep features from different layers to learn the residual between the intermediate prediction and the ground truth for each recurrent step in each residual refinement branch. Since the defocus degree is sensitive to image scales, we finally fuse the side output of each branch to obtain the final blur detection map. We evaluate the proposed network on two commonly used defocus blur detection benchmark datasets by comparing it with other 11 state-of-the-art methods. Extensive experimental results with ablation studies demonstrate that R2MRF consistently and significantly outperforms the competitors in terms of both efficiency and accuracy.

PDF Details

AAAI Conference 2018 Conference Paper

Cooperative Training of Deep Aggregation Networks for RGB-D Action Recognition

Pichao Wang
Wanqing Li
Jun Wan
Philip Ogunbona
Xinwang Liu

A novel deep neural network training paradigm that exploits the conjoint information in multiple heterogeneous sources is proposed. Speciﬁcally, in a RGB-D based action recognition task, it cooperatively trains a single convolutional neural network (named c-ConvNet) on both RGB visual features and depth features, and deeply aggregates the two kinds of features for action recognition. Differently from the conventional ConvNet that learns the deep separable features for homogeneous modality-based classiﬁcation with only one softmax loss function, the c-ConvNet enhances the discriminative power of the deeply learned features and weakens the undesired modality discrepancy by jointly optimizing a ranking loss and a softmax loss for both homogeneous and heterogeneous modalities. The ranking loss consists of intra-modality and cross-modality triplet losses, and it reduces both the intra-modality and crossmodality feature variations. Furthermore, the correlations between RGB and depth data are embedded in the c-ConvNet, and can be retrieved by either of the modalities and contribute to the recognition in the case even only one of the modalities is available. The proposed method was extensively evaluated on two large RGB-D action recognition datasets, ChaLearn LAP IsoGD and NTU RGB+D datasets, and one small dataset, SYSU 3D HOI, and achieved state-of-the-art results.

PDF Details