Arrow Research search

Author name cluster

Feng Zhao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

40 papers
2 author rows

Possible papers

40

AAAI Conference 2026 Conference Paper

Frequency-Aware Vision-Language Multimodality Generalization Network for Remote Sensing Image Classification

  • Junjie Zhang
  • Feng Zhao
  • Hanqiang Liu
  • Jun Yu

The booming remote sensing (RS) technology is giving rise to a novel multimodality generalization task, which requires the model to overcome data heterogeneity while possessing powerful cross-scene generalization ability. Moreover, most vision-language models usually describe surface materials using universal texts, lacking proprietary linguistic prior knowledge specific to different RS modalities. In this work, we formalize RS multimodality generalization (RSMG) as a learning paradigm, and propose a frequency-aware vision-language multimodality generalization network (FVMGN) for RS image classification. Specifically, a diffusion-based training-test-time augmentation (DTAug) strategy is designed to reconstruct multimodal land-cover distributions, enriching input information for FVMGN. Following that, to overcome multimodal heterogeneity, a multimodal wavelet disentanglement (MWDis) module is developed to learn cross-domain invariant features by resampling low and high frequency components in the frequency domain. Considering the characteristics of RS vision modalities, shared and proprietary class texts is designed as linguistic inputs for the transformer-based text encoder to extract diverse text features. For multimodal vision inputs, a spatial-frequency-aware image encoder (SFIE) is constructed to realize local-global feature reconstruction and representation. Finally, a multiscale spatial-frequency feature alignment (MSFFA) module is suggested to construct a unified semantic space, ensuring refined multiscale alignment of different text and vision features in spatial and frequency domains. Extensive experiments show that FVMGN has the excellent multimodality generalization ability compared with state-of-the-art methods.

TMLR Journal 2026 Journal Article

InfoScale: Unleashing Training-free Variable-scaled Image Generation via Effective Utilization of Information

  • Guohui Zhang
  • Jiangtong Tan
  • Linjiang Huang
  • Zhonghang Yuan
  • Mingde Yao
  • Jie Huang
  • Feng Zhao

Diffusion models (DMs) have become dominant in visual generation but suffer a performance drop when tested on resolutions that differ from the training scale, whether lower or higher. Current training-free methods for DMs have shown promising results, primarily focusing on higher-resolution generation. However, most methods lack a unified analytical perspective for variable-scale generation, leading to suboptimal results. In fact, the key challenge in generating variable-scale images lies in the differing amounts of information across resolutions, which requires information conversion procedures to be varied for generating variable-scaled images. In this paper, we investigate the issues of three critical aspects in DMs for a unified analysis in variable-scaled generation: dilated convolution, attention mechanisms, and initial noise. Specifically, 1) dilated convolution in DMs for the higher-resolution generation loses high-frequency information. 2) Attention for variable-scaled image generation struggles to adjust the information aggregation adaptively. 3) The spatial distribution of information in the initial noise is misaligned with the variable-scaled image. To solve the above problems, we propose $\textbf{InfoScale}$, an information-centric framework for variable-scaled image generation by effectively utilizing information from three aspects correspondingly. For information loss in 1), we introduce a Progressive Frequency Compensation module to compensate for high-frequency information lost by dilated convolution in higher-resolution generation. For information aggregation inflexibility in 2), we introduce an Adaptive Information Aggregation module to adaptively aggregate information in lower-resolution generation and achieve an effective balance between local and global information in higher-resolution generation. For information distribution misalignment in 3), we design a Noise Adaptation module to re-distribute information in initial noise for variable-scaled generation. Our method is plug-and-play, and extensive experiments demonstrate its effectiveness in variable-scaled image generation.

AAAI Conference 2026 Conference Paper

MACoT: Synthesizing Chains of Thought for Small Models via Multi-Agent Collaboration

  • Guokai Tang
  • Feng Zhao

Small language models (SLMs) run quickly, consume little memory, and can be deployed on edge devices, making them especially appealing when compute or energy is limited. Because of these advantages, boosting SLMs' reasoning ability has become an important research goal. A common approach is to distill the long chains of thought (long-CoTs) produced by large reasoning models (LRMs) into SLMs, hoping to transfer the larger models’ strong reasoning ability. However, SLMs do not always benefit from distillation of long-CoTs. The lengthy and complex semantic steps and large amount of self-reflection content in long-CoTs may exceed the limited learning capabilities of SLMs, and the impact of self-reflection density on the performance of SLMs is unclear. To resolve this capacity mismatch, we propose MACoT, a multi-agent framework that synthesizes chains of thought (CoTs) that are more suitable for small models rather than compressing or pruning existing ones. Through the interactive collaboration among six types of agents, MACoT synthesizes semantically explicit, logically clear CoTs that efficiently activate a small model’s internal knowledge through a carefully designed output pattern. At the same time, the CoTs synthesized by our method can retain a small amount of self-reflection content, thereby matching the learning capability of the small model and maximizing its reasoning accuracy. We fine-tuned Qwen2.5-7B-Instruct using only 1879 synthetic CoTs, significantly improving its performance on mathematical reasoning tasks and generalizing well, outperforming models trained on 5x more data. Through experiments, we found that a modest level of self-reflection boosts small-model performance, whereas excessive reflection sharply degrades it, which shows that “teaching SLMs to think” hinges on aligning each CoT’s cognitive load with the model’s capacity.

AAAI Conference 2026 Conference Paper

Token Painter: Training-Free Text-Guided Image Inpainting via Mask Autoregressive Models

  • Longtao Jiang
  • Jie Huang
  • Mingfei Han
  • Lei Chen
  • Yongqiang Yu
  • Feng Zhao
  • Xiaojun Chang
  • Zhihui Li

Text-guided image inpainting aims to inpaint masked image regions based on a textual prompt while preserving the background. Although diffusion-based methods have become dominant, their property of modeling the entire image in latent space makes it challenging for the results to align well with prompt details and maintain a consistent background. To address these issues, we explore Mask AutoRegressive (MAR) models for this task. MAR naturally supports image inpainting by generating latent tokens corresponding to mask regions, enabling better local controllability without altering the background. However, directly applying MAR to this task makes the inpainting content either ignore the prompts or be disharmonious with the background context. Through analysis of the attention maps from the inpainting images, we identify the impact of background tokens on text tokens during the MAR generation, and leverage this to designToken Painter, a training-free text-guided image inpainting method based on MAR. Our approach introduces two key components: (1) Dual-Stream Encoder Information Fusion (DEIF), which fuses the semantic and context information from text and background in frequency domain to produce novel guidance tokens, allowing MAR to generate text-faithful inpainting content while keeping harmonious with background context. (2) Adaptive Decoder Attention Score Enhancing (ADAE), which adaptively enhances attention scores on guidance tokens and inpainting tokens to further enhance the alignment of prompt details and the content visual quality. Extensive experiments demonstrate that our training-free method outperforms prior state-of-the-art methods across almost all metrics.

AAAI Conference 2026 Conference Paper

Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling

  • Hao Li
  • Shuai Yang
  • Yilun Chen
  • Xinyi Chen
  • Xiaoda Yang
  • Yang Tian
  • Hanqing Wang
  • Tai WANG

Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong performance in robotic manipulation. However, these models remain constrained by the single-frame image paradigm and fail to fully leverage the temporal information offered by multi-frame histories, as directly feeding multiple frames into VLM backbones incurs substantial computational overhead and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame pretraining on large-scale embodied datasets with autoregressive prediction of action tokens, establishing an effective embodied vision-language foundation; (2) Multi-frame post-training, which adapts the prediction of the vision-language backbone from discrete tokens to learnable features, and aggregates historical information via feature chunking. CronusVLA effectively addresses the existing challenges of multi-frame modeling while enhancing performance. To evaluate the robustness under temporal and spatial disturbances, we introduce SimplerEnv-OR, a novel benchmark featuring 24 types of observational disturbances and 120 severity levels. Experiments across three embodiments in simulated and real-world environments demonstrate that CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate on SimplerEnv, a 26.8% improvement over OpenVLA on LIBERO, and the highest robustness score on SimplerEnv-OR, showing the promise of efficient multi-frame adaptation for real-world VLA deployment.

JBHI Journal 2026 Journal Article

XAI Driven Intelligent IoMT Secure Data Management Framework

  • Wei Liu
  • Feng Zhao
  • Lewis Nkenyereye
  • Shalli Rani
  • Keqin Li
  • Jianhui Lv

The Internet of Medical Things (IoMT) has transformed traditional healthcare systems by enabling real-time monitoring, remote diagnostics, and data-driven treatment. However, security and privacy remain significant concerns for IoMT adoption due to the sensitive nature of medical data. Therefore, we propose an integrated framework leveraging blockchain and explainable artificial intelligence (XAI) to enable secure, intelligent, and transparent management of IoMT data. First, the traceability and tamper-proof of blockchain are used to realize the secure transaction of IoMT data, transforming the secure transaction of IoMT data into a two-stage Stackelberg game. The dual-chain architecture is used to ensure the security and privacy protection of the transaction. The main-chain manages regular IoMT data transactions, while the side-chain deals with data trading activities aimed at resale. Simultaneously, the perceptual hash technology is used to realize data rights confirmation, which maximally protects the rights and interests of each participant in the transaction. Subsequently, medical time-series data is modeled using bidirectional simple recurrent units to detect anomalies and cyberthreats accurately while overcoming vanishing gradients. Lastly, an adversarial sample generation method based on local interpretable model-agnostic explanations is provided to evaluate, secure, and improve the anomaly detection model, as well as to make it more explainable and resilient to possible adversarial attacks. Simulation results are provided to illustrate the high performance of the integrated secure data management framework leveraging blockchain and XAI, compared with the benchmarks.

EAAI Journal 2025 Journal Article

Competitive dual-students using bi-level contrastive learning for semi-supervised medical image segmentation

  • Gang Hu
  • Feng Zhao
  • Essam H. Houssein

Semi-supervised image segmentation aims to train the neural network with a small number of labeled images and a large number of unlabeled images, which helps to alleviate the burden of having less manually labeled medical data. However, the Mean-Teacher (MT) model, a benchmark method for semi-supervised medical segmentation, leads to a performance bottleneck as its student model eventually converges to the teacher model. In addition, existing segmentation methods treat all pixels equally and underestimate the importance of indistinguishable and underrepresented pixels, failing to mine the potential information in these regions effectively. To address the above issues, this paper proposes a Competitive Dual-Student (CDS) incorporating bi-level contrastive learning. First, an additional competitive dual-student model is added to the MT model and promoting knowledge sharing and complementarity among networks. Competitive instruction by the teacher through feature information exchange and positive comparisons reduces the accumulation of biased knowledge in the model. It stimulates the potential for further optimization of the model as a whole. Furthermore, a bi-level contrastive learning is designed. The high-level contrastive learning encourages competitive dual students to learn high-quality features from each other by constructing reliability constraints. The low-level contrastive achieved deep mining and accurate processing of local edge features by introducing class prototypes of high-quality features for teacher networks. Finally, the comprehensive experimental results on left atrium, brain tumor segmentation 2019 and automated cardiac diagnosis challenge datasets indicate that the segmentation performance of the proposed CDS outperforms the state-of-the-art compared methods. Code is released at https: //github. com/FengZhao2001/CDS.

JBHI Journal 2025 Journal Article

Explainable AI for Medical Image Analysis in Medical Cyber-Physical Systems: Enhancing Transparency and Trustworthiness of IoMT

  • Wei Liu
  • Feng Zhao
  • Achyut Shankar
  • Carsten Maple
  • James Dinesh Peter
  • Byung-Gyu Kim
  • Adam Slowik
  • Bidare Divakarachari Parameshachari

This study explores the application of explainable artificial intelligence (XAI) in the context of medical image analysis within medical cyber-physical systems (MCPS) to enhance transparency and trustworthiness. Meanwhile, this study proposes an explainable framework that integrates machine learning and knowledge reasoning. The explainability of the model is realized when the framework evolution target feature results and reasoning results are the same and are relatively reliable. However, using these technologies also presents new challenges, including the need to ensure the security and privacy of patient data from Internet of Medical Things (IoMT). Therefore, attack detection is an essential aspect of MCPS security. For the MCPS model with only sensor attacks, the necessary and sufficient conditions for detecting attacks are given based on the definition of sparse observability. The corresponding attack detector and state estimator are designed by assuming that some IoMT sensors are under protection. It is expounded that the IoMT sensors under protection play an important role in improving the efficiency of attack detection and state estimation. The experimental results show that the XAI in the context of medical image analysis within MCPS improves the accuracy of lesion classification, effectively removes low-quality medical images, and realizes the explainability of recognition results. This helps doctors understand the logic of the system's decision-making and can choose whether to trust the results based on the explanation given by the framework.

NeurIPS Conference 2025 Conference Paper

Towards Better & Faster Autoregressive Image Generation: From the Perspective of Entropy

  • Xiaoxiao Ma
  • Feng Zhao
  • Pengyang Ling
  • Haibo Qiu
  • Zhixiang Wei
  • Hu Yu
  • Jie Huang
  • Zhixiong Zeng

In this work, we first revisit the sampling issues in current autoregressive (AR) image generation models and identify that image tokens, unlike text tokens, exhibit lower information density and non-uniform spatial distribution. Accordingly, we present an entropy-informed decoding strategy that facilitates higher autoregressive generation quality with faster synthesis speed. Specifically, the proposed method introduces two main innovations: 1) dynamic temperature control guided by spatial entropy of token distributions, enhancing the balance between content diversity, alignment accuracy, and structural coherence in both mask-based and scale-wise models, without extra computational overhead, and 2) entropy-aware acceptance rules in speculative decoding, achieving near-lossless generation at about 85% of the inference cost of conventional acceleration methods. Extensive experiments across multiple benchmarks using diverse AR image generation models demonstrate the effectiveness and generalizability of our approach in enhancing both generation quality and sampling speed.

AAAI Conference 2025 Conference Paper

VFM-Adapter: Adapting Visual Foundation Models for Dense Prediction with Dynamic Hybrid Operation Mapping

  • Zheng Chen
  • Yu Zeng
  • Zehui Chen
  • Hongzhi Gao
  • Lin Chen
  • Jiaming Liu
  • Feng Zhao

Although pre-trained large vision foundation models (VFM) yield superior results on various downstream tasks, full fine-tuning is often impractical due to its high computational cost and storage requirements. Recent advancements in parameter-efficient fine-tuning (PEFT) of VFM for image classification show significant promise. However, the application of PEFT techniques to dense prediction tasks remains largely unexplored. Our analysis of existing methods reveals that the underlying premise of utilizing low-rank parameter matrices, despite their efficacy in specific applications, may not be adequately suitable for dense prediction tasks. To this end, we propose a novel PEFT learning approach tailored for dense prediction tasks, namely VFM-Adapter. Specifically, the VFM-Adapter introduces a hybrid operation mapping technique that seamlessly integrates local information with global modeling to the adapter module. It capitalizes on the distinct inductive biases inherent in different operations. Additionally, we dynamically generate parameters for the VFM-Adapter, enabling flexibility of feature extraction given specific inputs. To validate the efficacy of VFM-Adapter, we conduct extensive experiments across object detection, semantic segmentation, and instance segmentation tasks. Results on multiple benchmarks consistently demonstrate the superiority of our method over previous approaches. Notably, with only three percent of the trainable parameters of the SAM-Base backbone, our approach achieves competitive or even superior performance compared to full fine-tuning. The code will be available.

NeurIPS Conference 2025 Conference Paper

VideoMAR: Autoregressive Video Generation with Continuous Tokens

  • Hu Yu
  • Biao Gong
  • Hangjie Yuan
  • DanDan Zheng
  • Weilong Chai
  • Jingdong Chen
  • Kecheng Zheng
  • Feng Zhao

Masked-based autoregressive models have demonstrated promising image generation capability in continuous space. However, their potential for video generation remains under-explored. Masked-based autoregressive models have demonstrated promising image generation capability in continuous space. However, their potential for video generation remains under-explored. In this paper, we propose \textbf{VideoMAR}, a concise and efficient decoder-only autoregressive image-to-video model with continuous tokens, composing temporal frame-by-frame and spatial masked generation. We first identify temporal causality and spatial bi-directionality as the first principle of video AR models, and propose the next-frame diffusion loss for the integration of mask and video generation. Besides, the huge cost and difficulty of long sequence autoregressive modeling is a basic but crucial issue. To this end, we propose the temporal short-to-long curriculum learning and spatial progressive resolution training, and employ progressive temperature strategy at inference time to mitigate the accumulation error. Furthermore, VideoMAR replicates several unique capacities of language models to video generation. It inherently bears high efficiency due to simultaneous temporal-wise KV cache and spatial-wise parallel generation, and presents the capacity of spatial and temporal extrapolation via 3D rotary embeddings. On the VBench-I2V benchmark, VideoMAR surpasses the previous state-of-the-art (Cosmos I2V) while requiring significantly fewer parameters ($9. 3\%$), training data ($0. 5\%$), and GPU resources ($0. 2\%$).

NeurIPS Conference 2025 Conference Paper

VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning

  • Qiuchen Wang
  • Ruixue Ding
  • Yu Zeng
  • Zehui Chen
  • Lin Chen
  • Shihang Wang
  • Pengjun Xie
  • Fei Huang

Effectively retrieving, reasoning and understanding visually rich information remains a challenge for traditional Retrieval-Augmented Generation (RAG) methods. On the one hand, traditional text-based methods cannot handle visual-related information. On the other hand, current vision-based RAG approaches are often limited by fixed pipelines and frequently struggle to reason effectively due to the insufficient activation of the fundamental capabilities of models. As reinforcement learning (RL) has been proven to be beneficial for model reasoning, we introduce VRAG-RL, a novel RL framework tailored for complex reasoning across visually rich information. With this framework, VLMs interact with search engines, autonomously sampling single-turn or multi-turn reasoning trajectories with the help of visual perception tokens and undergoing continual optimization based on these samples. Our approach highlights key limitations of RL in RAG domains: (i) Prior Multi-modal RAG approaches tend to merely incorporate images into the context, leading to insufficient reasoning token allocation and neglecting visual-specific perception; and (ii) When models interact with search engines, their queries often fail to retrieve relevant information due to the inability to articulate requirements, thereby leading to suboptimal performance. To address these challenges, we define an action space tailored for visually rich inputs, with actions including cropping and scaling, allowing the model to gather information from a coarse-to-fine perspective. Furthermore, to bridge the gap between users' original inquiries and the retriever, we employ a simple yet effective reward that integrates query rewriting and retrieval performance with a model-based reward. Our VRAG-RL optimizes VLMs for RAG tasks using specially designed RL strategies, aligning the model with real-world applications. Extensive experiments on diverse and challenging benchmarks show that our VRAG-RL outperforms existing methods by 20\% (Qwen2. 5-VL-7B) and 30\% (Qwen2. 5-VL-3B), demonstrating the effectiveness of our approach. The code is available at https: //github. com/Alibaba-NLP/VRAG.

NeurIPS Conference 2024 Conference Paper

Are We on the Right Way for Evaluating Large Vision-Language Models?

  • Lin Chen
  • Jinsong Li
  • Xiaoyi Dong
  • Pan Zhang
  • Yuhang Zang
  • Zehui Chen
  • Haodong Duan
  • Jiaqi Wang

Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities. However, we dig into current evaluation works and identify two primary issues: 1) Visual content is unnecessary for many samples. The answers can be directly inferred from the questions and options, or the world knowledge embedded in LLMs. This phenomenon is prevalent across current benchmarks. For instance, GeminiPro achieves 42. 7% on the MMMU benchmark without any visual input, and outperforms the random choice baseline across six benchmarks near 24% on average. 2) Unintentional data leakage exists in LLM and LVLM training. LLM and LVLM could still answer some visual-necessary questions without visual content, indicating the memorizing of these samples within large-scale training data. For example, Sphinx-X-MoE gets 43. 6% on MMMU without accessing images, surpassing its LLM backbone with 17. 9%. Both problems lead to misjudgments of actual multi-modal gains and potentially misguide the study of LVLM. To this end, we present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1, 500 samples meticulously selected by humans. MMStar benchmarks 6 core capabilities and 18 detailed axes, aiming to evaluate LVLMs' multi-modal capacities with carefully balanced and purified samples. These samples are first roughly selected from current benchmarks with an automated pipeline, human review is then involved to ensure each curated sample exhibits visual dependency, minimal data leakage, and requires advanced multi-modal capabilities. Moreover, two metrics are developed to measure data leakage and actual performance gain in multi-modal training. We evaluate 16 leading LVLMs on MMStar to assess their multi-modal capabilities, and on 7 benchmarks with the proposed metrics to investigate their data leakage and actual multi-modal gain.

EAAI Journal 2024 Journal Article

Data and knowledge-driven dual surrogate-assisted multi-objective rough fuzzy clustering algorithm for image segmentation

  • Feng Zhao
  • Caini Lu
  • Hanqiang Liu

Most multi-objective clustering algorithms (MOCAs) do not fully utilize the spatial and edge information of an image in image segmentation areas. Moreover, the objective evaluations are generally expensive for MOCAs, because the computation cost is related to the number of image pixels. Introducing approximate predictions of surrogate model to replace extensive objective evaluations can improve segmentation efficiency of MOCAs. However, accurately fitting objective functions using only a single surrogate is challenging. To resolve the above-mentioned issues, a data and knowledge-driven dual surrogate-assisted multi-objective rough fuzzy clustering algorithm (DK-DSMRFC) is proposed. First, an edge information-guided local neighborhood weighted filtering strategy is designed to obtain the spatial information with rich image details. Second, three complementary clustering objective functions are constructed to recognize complex clustering structures, which focus on rough fuzzy intra-class compactness with multi-level image information, dual centroids-based inter-class separation, and neighborhood consistency, respectively. To efficiently optimize these objective functions, we construct a data and knowledge-driven dual-surrogate assisted evolutionary framework, in which the radial basis function is used as a principal surrogate model to predict objective functions, and the Kriging model is adopted as an assistant surrogate to provide uncertainty information of predictions. Furthermore, a knowledge-induced multi-perspective infill sampling criterion is designed to promote exploration and exploitation. Finally, a rough fuzzy clustering validity index with spatial constraints and neighborhood consistency is constructed to select the optimal individual. The performance of evolutionary framework is verified on benchmark functions. Experiments on images from four datasets confirm the effectiveness and robustness of the DK-DSMRFC. Keywords: Image segmentation, Rough fuzzy clustering, Surrogate assisted multi-objective optimization, Data and knowledge-driven optimization.

EAAI Journal 2024 Journal Article

Ensemble CART surrogate-assisted automatic multi-objective rough fuzzy clustering algorithm for unsupervised image segmentation

  • Feng Zhao
  • Zihan Tang
  • Zhilei Xiao
  • Hanqiang Liu
  • Jiulun Fan
  • Lu Li

Multi-objective clustering algorithms (MOCAs) are popular in unsupervised image segmentation due to their merit of meeting multiple segmentation requirements and the prospect of automatically estimating the number of clusters. However, most of them suffer from high time costs and are easily to be influenced by the uncertainty when handling real complex images. To address these issues, we propose an ensemble classification and regression tree (CART) surrogate-assisted automatic multi-objective rough fuzzy clustering (ECS-AMRFC) algorithm for unsupervised image segmentation. Firstly, a cluster medoid-based encoding scheme is employed to represent solutions with different number of clusters and meanwhile lessen the length of encoding. Then, we design an ensemble CART as the surrogate model to significantly reduce the computational burden. Moreover, a surrogate model management strategy is proposed to accelerate the optimization and enhance the quality of surrogate modeling. To handle the uncertainty in data, we extend the rough fuzzy clustering into MOCAs and construct three complementary objective functions to seek proper cluster medoids from multiple perspectives. In addition, the Gaussian kernel is introduced into the objective functions to handle image pixels that cannot separate linearly in the feature space. Finally, a kernelized rough fuzzy clustering validity index is defined to automatically select the optimal solution with no requirements of any prior knowledge. Experiments show that ECS-AMRFC not only identifies appropriate number of clusters on different kinds of images, but also obtains better segmentation results than state-of-the-art rough fuzzy clustering algorithms and automatic MOCAs.

NeurIPS Conference 2024 Conference Paper

GaussianCube: A Structured and Explicit Radiance Representation for 3D Generative Modeling

  • Bowen Zhang
  • Yiji Cheng
  • Jiaolong Yang
  • Chunyu Wang
  • Feng Zhao
  • Yansong Tang
  • Dong Chen
  • Baining Guo

We introduce a radiance representation that is both structured and fully explicit and thus greatly facilitates 3D generative modeling. Existing radiance representations either require an implicit feature decoder, which significantly degrades the modeling power of the representation, or are spatially unstructured, making them difficult to integrate with mainstream 3D diffusion methods. We derive GaussianCube by first using a novel densification-constrained Gaussian fitting algorithm, which yields high-accuracy fitting using a fixed number of free Gaussians, and then rearranging these Gaussians into a predefined voxel grid via Optimal Transport. Since GaussianCube is a structured grid representation, it allows us to use standard 3D U-Net as our backbone in diffusion modeling without elaborate designs. More importantly, the high-accuracy fitting of the Gaussians allows us to achieve a high-quality representation with orders of magnitude fewer parameters than previous structured representations for comparable quality, ranging from one to two orders of magnitude. The compactness of GaussianCube greatly eases the difficulty of 3D generative modeling. Extensive experiments conducted on unconditional and class-conditioned object generation, digital avatar creation, and text-to-3D synthesis all show that our model achieves state-of-the-art generation results both qualitatively and quantitatively, underscoring the potential of GaussianCube as a highly accurate and versatile radiance representation for 3D generative modeling.

AAAI Conference 2024 Conference Paper

Graph Reasoning Transformers for Knowledge-Aware Question Answering

  • Ruilin Zhao
  • Feng Zhao
  • Liang Hu
  • Guandong Xu

Augmenting Language Models (LMs) with structured knowledge graphs (KGs) aims to leverage structured world knowledge to enhance the capability of LMs to complete knowledge-intensive tasks. However, existing methods are unable to effectively utilize the structured knowledge in a KG due to their inability to capture the rich relational semantics of knowledge triplets. Moreover, the modality gap between natural language text and KGs has become a challenging obstacle when aligning and fusing cross-modal information. To address these challenges, we propose a novel knowledge-augmented question answering (QA) model, namely, Graph Reasoning Transformers (GRT). Different from conventional node-level methods, the GRT serves knowledge triplets as atomic knowledge and utilize a triplet-level graph encoder to capture triplet-level graph features. Furthermore, to alleviate the negative effect of the modality gap on joint reasoning, we propose a representation alignment pretraining to align the cross-modal representations and introduce a cross-modal information fusion module with attention bias to enable fine-grained information fusion. Extensive experiments conducted on three knowledge-intensive QA benchmarks show that the GRT outperforms the state-of-the-art KG-augmented QA systems, demonstrating the effectiveness and adaptation of our proposed model.

IJCAI Conference 2024 Conference Paper

KG-CoT: Chain-of-Thought Prompting of Large Language Models over Knowledge Graphs for Knowledge-Aware Question Answering

  • Ruilin Zhao
  • Feng Zhao
  • Long Wang
  • Xianzhi Wang
  • Guandong Xu

Large language models (LLMs) encounter challenges such as hallucination and factual errors in knowledge-intensive tasks. One the one hand, LLMs sometimes struggle to generate reliable answers based on the black-box parametric knowledge, due to the lack of responsible knowledge. Moreover, fragmented knowledge facts extracted by knowledge retrievers fail to provide explicit and coherent reasoning paths for improving LLM reasoning. To address these challenges, we propose KG-CoT, a novel knowledge-augmented paradigm that leverages a small-scale step-by-step graph reasoning model to reason over knowledge graphs (KGs) and utilizes a reasoning path generation method to generate chains of reasoning with high confidence for large-scale LLMs. Extensive experiments demonstrate that our KG-CoT significantly improves the performance of LLMs on knowledge-intensive question answering tasks, such as multi-hop, single-hop, and open-domain question answering benchmarks, without fine-tuning LLMs. KG-CoT outperforms the CoT prompting as well as prior retrieval-augmented and knowledge base question answering baselines. Moreover, KG-CoT can reduce the number of API calls and cost and generalize to various LLM backbones in a lightweight plug-and-play manner.

AAAI Conference 2024 Conference Paper

Leveraging Imagery Data with Spatial Point Prior for Weakly Semi-supervised 3D Object Detection

  • Hongzhi Gao
  • Zheng Chen
  • Zehui Chen
  • Lin Chen
  • Jiaming Liu
  • Shanghang Zhang
  • Feng Zhao

Training high-accuracy 3D detectors necessitates massive labeled 3D annotations with 7 degree-of-freedom, which is laborious and time-consuming. Therefore, the form of point annotations is proposed to offer significant prospects for practical applications in 3D detection, which is not only more accessible and less expensive but also provides strong spatial information for object localization. In this paper, we empirically discover that it is non-trivial to merely adapt Point-DETR to its 3D form, encountering two main bottlenecks: 1) it fails to encode strong 3D prior into the model, and 2) it generates low-quality pseudo labels in distant regions due to the extreme sparsity of LiDAR points. To overcome these challenges, we introduce Point-DETR3D, a teacher-student framework for weakly semi-supervised 3D detection, designed to fully capitalize on point-wise supervision within a constrained instance-wise annotation budget. Different from Point-DETR which encodes 3D positional information solely through a point encoder, we propose an explicit positional query initialization strategy to enhance the positional prior. Considering the low quality of pseudo labels at distant regions produced by the teacher model, we enhance the detector's perception by incorporating dense imagery data through a novel Cross-Modal Deformable RoI Fusion (D-RoI). Moreover, an innovative point-guided self-supervised learning technique is proposed to allow for fully exploiting point priors, even in student models. Extensive experiments on representative nuScenes dataset demonstrate our Point-DETR3D obtains significant improvements compared to previous works. Notably, with only 5% of labeled data, Point-DETR3D achieves over 90% performance of its fully supervised counterpart.

EAAI Journal 2024 Journal Article

Lightweight anchor-free one-level feature indoor personnel detection method based on transformer

  • Feng Zhao
  • Yongheng Li
  • Hanqiang Liu
  • Junjie Zhang
  • Zhenglin Zhu

Owing to the development of deep learning, indoor personnel detection methods based on deep neural networks have been extensively investigated in recent years. However, more complex and deeper network structures may consume more computational resources, which seriously limits the deployment of large-scale deep neural networks on lightweight devices. In view of this, a lightweight anchor-free one-level feature indoor personnel detection method based on transformer (LAOF-IPDT) is proposed in this paper, which is deployed on two embedded devices and achieves good detection accuracy. In the feature extraction backbone, an enhanced cross-stage partial ghost convolution block with ghost convolution and channel shuffle is designed to extract shallow features. Additionally, to obtain more comprehensive global features in the high-level semantic structure, an embedded vision transformer cross-stage partial block is constructed by embedding a lightweight mobile-friendly vision transformer. For the path-aggregation neck, a lightweight feature pyramid network is proposed, which integrates multi-scale feature maps to obtain richer one-level feature representations. Subsequently, a dilated convolution group block is applied to expand the one-level feature receptive field and detection is accomplished using a one-level feature map. For the detection head, an anchor-free mechanism is applied to reduce the hyper-parameter interference of anchor boxes. Extensive experiments on four datasets indicated that the LAOF-IPDT outperforms other lightweight networks in terms of accuracy, speed, model parameters, and model size. For example, the frames per second of the LAOF-IPDT are 8. 39 and 14. 67 on CPU devices and Jetson Nano devices, respectively, and the mean average precision is 84. 49%.

IJCAI Conference 2024 Conference Paper

PPTFormer: Pseudo Multi-Perspective Transformer for UAV Segmentation

  • Deyi Ji
  • Wenwei Jin
  • Hongtao Lu
  • Feng Zhao

The ascension of Unmanned Aerial Vehicles (UAVs) in various fields necessitates effective UAV image segmentation, which faces challenges due to the dynamic perspectives of UAV-captured images. Traditional segmentation algorithms falter as they cannot accurately mimic the complexity of UAV perspectives, and the cost of obtaining multi-perspective labeled datasets is prohibitive. To address these issues, we introduce the PPTFormer, a novel Pseudo Multi-Perspective Transformer network that revolutionizes UAV image segmentation. Our approach circumvents the need for actual multi-perspective data by creating pseudo perspectives for enhanced multi-perspective learning. The PPTFormer network boasts Perspective Decomposition, novel Perspective Prototypes, and a specialized encoder and decoder that together achieve superior segmentation results through Pseudo Multi-Perspective Attention (PMP Attention) and fusion. Our experiments demonstrate that PPTFormer achieves state-of-the-art performance across five UAV segmentation datasets, confirming its capability to effectively simulate UAV flight perspectives and significantly advance segmentation precision. This work presents a pioneering leap in UAV scene understanding and sets a new benchmark for future developments in semantic segmentation.

NeurIPS Conference 2024 Conference Paper

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

  • Lin Chen
  • Xilin Wei
  • Jinsong Li
  • Xiaoyi Dong
  • Pan Zhang
  • Yuhang Zang
  • Zehui Chen
  • Haodong Duan

We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions. The series comprises: 1) ShareGPT4Video, 40K GPT4V annotated dense captions of videos with various lengths and sources, developed through carefully designed data filtering and annotating strategy. 2) ShareCaptioner-Video, an efficient and capable captioning model for arbitrary videos, with 4. 8M high-quality aesthetic videos annotated by it. 3) ShareGPT4Video-8B, a simple yet superb LVLM that reached SOTA performance on three advancing video benchmarks. To achieve this, taking aside the non-scalable costly human annotators, we find using GPT4V to caption video with a naive multi-frame or frame-concatenation input strategy leads to less detailed and sometimes temporal-confused results. We argue the challenge of designing a high-quality video captioning strategy lies in three aspects: 1) Inter-frame precise temporal change understanding. 2) Intra-frame detailed content description. 3) Frame-number scalability for arbitrary-length videos. To this end, we meticulously designed a differential video captioning strategy, which is stable, scalable, and efficient for generating captions for videos with arbitrary resolution, aspect ratios, and length. Based on it, we construct ShareGPT4Video, which contains 40K high-quality videos spanning a wide range of categories, and the resulting captions encompass rich world knowledge, object attributes, camera movements, and crucially, detailed and precise temporal descriptions of events. Based on ShareGPT4Video, we further develop ShareCaptioner-Video, a superior captioner capable of efficiently generating high-quality captions for arbitrary videos. We annotated 4. 8M aesthetically appealing videos by it and verified their effectiveness on a 10-second text2video generation task. For video understanding, we verified the effectiveness of ShareGPT4Video on several current LVLM architectures and presented our superb new LVLM ShareGPT4Video-8B. All the models, strategies, and annotations will be open-sourced and we hope this project can serve as a pivotal resource for advancing both the LVLMs and T2VMs community.

NeurIPS Conference 2023 Conference Paper

Deep Fractional Fourier Transform

  • Hu Yu
  • Jie Huang
  • Lingzhi Li
  • Man Zhou
  • Feng Zhao

Existing deep learning-based computer vision methods usually operate in the spatial and frequency domains, which are two orthogonal \textbf{individual} perspectives for image processing. In this paper, we introduce a new spatial-frequency analysis tool, Fractional Fourier Transform (FRFT), to provide comprehensive \textbf{unified} spatial-frequency perspectives. The FRFT is a unified continuous spatial-frequency transform that simultaneously reflects an image's spatial and frequency representations, making it optimal for processing non-stationary image signals. We explore the properties of the FRFT for image processing and present a fast implementation of the 2D FRFT, which facilitates its widespread use. Based on these explorations, we introduce a simple yet effective operator, Multi-order FRactional Fourier Convolution (MFRFC), which exhibits the remarkable merits of processing images from more perspectives in the spatial-frequency plane. Our proposed MFRFC is a general and basic operator that can be easily integrated into various tasks for performance improvement. We experimentally evaluate the MFRFC on various computer vision tasks, including object detection, image classification, guided super-resolution, denoising, dehazing, deraining, and low-light enhancement. Our proposed MFRFC consistently outperforms baseline methods by significant margins across all tasks.

EAAI Journal 2023 Journal Article

DGFaceNet: Lightweight and efficient face recognition

  • Feng Zhao
  • Peng Zhang
  • Ran Zhang
  • Mengwei Li

Face recognition has achieved great success due to the development of deep convolutional neural networks (DCNNs). However, complex DCNNs bring a large number of parameters as well as computational effort, which poses a significant challenge to resource-constrained embedded devices. Meanwhile, the commonly popular loss functions and lightweight networks are not so effective for face recognition. In this paper, we first investigate the impact of the number of similar features generated by inexpensive operations on model performance. It is shown that DCNNs can tolerate more similar features generated by cheap operations in the early stage of the network. We construct Dynamic Ghost Bottleneck based on this idea, and DGFaceNet is composed of stacking Dynamic Ghost Bottleneck. In addition, we propose a new class-margin-linear softmax loss function (CML-softmax) for lightweight networks. CML-softmax designs a quadratic function to replace the cosine function as the target logit, which allows better performance and convergence in low-dimensional output for face recognition. Meanwhile, CML-softmax introduces two margin functions to alleviate class imbalance and softmax early saturation problems, respectively. Our method demonstrates competitive results in many validation datasets and large-scale popular benchmark tests. Speed tests on embedded devices show that the actual inference time of DGFaceNet is 11. 08 times, 8. 57 times, 2. 75 times, and 2. 82 times faster than ResNet-50, EfficientNet, MobileNetV2, and MobileFaceNet, respectively. DGFaceNet can significantly improve the running efficiency of the model in resource-constrained embedded devices while ensuring the model’s performance.

NeurIPS Conference 2023 Conference Paper

FouriDown: Factoring Down-Sampling into Shuffling and Superposing

  • Qi Zhu
  • Man Zhou
  • Jie Huang
  • Naishan Zheng
  • Hongzhi Gao
  • Chongyi Li
  • Yuan Xu
  • Feng Zhao

Spatial down-sampling techniques, such as strided convolution, Gaussian, and Nearest down-sampling, are essential in deep neural networks. In this study, we revisit the working mechanism of the spatial down-sampling family and analyze the biased effects caused by the static weighting strategy employed in previous approaches. To overcome this limitation, we propose a novel down-sampling paradigm in the Fourier domain, abbreviated as FouriDown, which unifies existing down-sampling techniques. Drawing inspiration from the signal sampling theorem, we parameterize the non-parameter static weighting down-sampling operator as a learnable and context-adaptive operator within a unified Fourier function. Specifically, we organize the corresponding frequency positions of the 2D plane in a physically-closed manner within a single channel dimension. We then perform point-wise channel shuffling based on an indicator that determines whether a channel's signal frequency bin is susceptible to aliasing, ensuring the consistency of the weighting parameter learning. FouriDown, as a generic operator, comprises four key components: 2D discrete Fourier transform, context shuffling rules, Fourier weighting-adaptively superposing rules, and 2D inverse Fourier transform. These components can be easily integrated into existing image restoration networks. To demonstrate the efficacy of FouriDown, we conduct extensive experiments on image de-blurring and low-light image enhancement. The results consistently show that FouriDown can provide significant performance improvements. We will make the code publicly available to facilitate further exploration and application of FouriDown.

IJCAI Conference 2023 Conference Paper

Guided Patch-Grouping Wavelet Transformer with Spatial Congruence for Ultra-High Resolution Segmentation

  • Deyi Ji
  • Feng Zhao
  • Hongtao Lu

Most existing ultra-high resolution (UHR) segmentation methods always struggle in the dilemma of balancing memory cost and local characterization accuracy, which are both taken into account in our proposed Guided Patch-Grouping Wavelet Transformer (GPWFormer) that achieves impressive performances. In this work, GPWFormer is a Transformer (T)-CNN (C) mutual leaning framework, where T takes the whole UHR image as input and harvests both local details and fine-grained long-range contextual dependencies, while C takes downsampled image as input for learning the category-wise deep context. For the sake of high inference speed and low computation complexity, T partitions the original UHR image into patches and groups them dynamically, then learns the low-level local details with the lightweight multi-head Wavelet Transformer (WFormer) network. Meanwhile, the fine-grained long-range contextual dependencies are also captured during this process, since patches that are far away in the spatial domain can also be assigned to the same group. In addition, masks produced by C are utilized to guide the patch grouping process, providing a heuristics decision. Moreover, the congruence constraints between the two branches are also exploited to maintain the spatial consistency among the patches. Overall, we stack the multi-stage process in a pyramid way. Experiments show that GPWFormer outperforms the existing methods with significant improvements on five benchmark datasets.

AAAI Conference 2023 Conference Paper

Intensity-Aware Loss for Dynamic Facial Expression Recognition in the Wild

  • Hanting Li
  • Hongjing Niu
  • Zhaoqing Zhu
  • Feng Zhao

Compared with the image-based static facial expression recognition (SFER) task, the dynamic facial expression recognition (DFER) task based on video sequences is closer to the natural expression recognition scene. However, DFER is often more challenging. One of the main reasons is that video sequences often contain frames with different expression intensities, especially for the facial expressions in the real-world scenarios, while the images in SFER frequently present uniform and high expression intensities. Nevertheless, if the expressions with different intensities are treated equally, the features learned by the networks will have large intra-class and small inter-class differences, which are harmful to DFER. To tackle this problem, we propose the global convolution-attention block (GCA) to rescale the channels of the feature maps. In addition, we introduce the intensity-aware loss (IAL) in the training process to help the network distinguish the samples with relatively low expression intensities. Experiments on two in-the-wild dynamic facial expression datasets (i.e., DFEW and FERV39k) indicate that our method outperforms the state-of-the-art DFER approaches. The source code will be available at https://github.com/muse1998/IAL-for-Facial-Expression-Recognition.

AAAI Conference 2023 Conference Paper

Learning Semantic Degradation-Aware Guidance for Recognition-Driven Unsupervised Low-Light Image Enhancement

  • Naishan Zheng
  • Jie Huang
  • Man Zhou
  • Zizheng Yang
  • Qi Zhu
  • Feng Zhao

Low-light images suffer severe degradation of low lightness and noise corruption, causing unsatisfactory visual quality and visual recognition performance. To solve this problem while meeting the unavailability of paired datasets in wide-range scenarios, unsupervised low-light image enhancement (ULLIE) techniques have been developed. However, these methods are primarily guided to alleviate the degradation effect on visual quality rather than semantic levels, hence limiting their performance in visual recognition tasks. To this end, we propose to learn a Semantic Degradation-Aware Guidance (SDAG) that perceives the low-light degradation effect on semantic levels in a self-supervised manner, which is further utilized to guide the ULLIE methods. The proposed SDAG utilizes the low-light degradation factors as augmented signals to degrade the low-light images, and then capture their degradation effect on semantic levels. Specifically, our SDAG employs the subsequent pre-trained recognition model extractor to extract semantic representations, and then learns to self-reconstruct the enhanced low-light image and its augmented degraded images. By constraining the relative reconstruction effect between the original enhanced image and the augmented formats, our SDAG learns to be aware of the degradation effect on semantic levels in a relative comparison manner. Moreover, our SDAG is general and can be plugged into the training paradigm of the existing ULLIE methods. Extensive experiments demonstrate its effectiveness for improving the ULLIE approaches on the downstream recognition tasks while maintaining a competitive visual quality. Code will be available at https://github.com/zheng980629/SDAG.

NeurIPS Conference 2023 Conference Paper

Transition-constant Normalization for Image Enhancement

  • Jie Huang
  • Man Zhou
  • Jinghao Zhang
  • Gang Yang
  • Mingde Yao
  • Chongyi Li
  • Zhiwei Xiong
  • Feng Zhao

Normalization techniques that capture image style by statistical representation have become a popular component in deep neural networks. Although image enhancement can be considered as a form of style transformation, there has been little exploration of how normalization affect the enhancement performance. To fully leverage the potential of normalization, we present a novel Transition-Constant Normalization (TCN) for various image enhancement tasks. Specifically, it consists of two streams of normalization operations arranged under an invertible constraint, along with a feature sub-sampling operation that satisfies the normalization constraint. TCN enjoys several merits, including being parameter-free, plug-and-play, and incurring no additional computational costs. We provide various formats to utilize TCN for image enhancement, including seamless integration with enhancement networks, incorporation into encoder-decoder architectures for downsampling, and implementation of efficient architectures. Through extensive experiments on multiple image enhancement tasks, like low-light enhancement, exposure correction, SDR2HDR translation, and image dehazing, our TCN consistently demonstrates performance improvements. Besides, it showcases extensive ability in other tasks including pan-sharpening and medical segmentation. The code is available at \textit{\textcolor{blue}{https: //github. com/huangkevinj/TCNorm}}.

IJCAI Conference 2022 Conference Paper

AutoAlign: Pixel-Instance Feature Aggregation for Multi-Modal 3D Object Detection

  • Zehui Chen
  • Zhenyu Li
  • Shiquan Zhang
  • Liangji Fang
  • Qinhong Jiang
  • Feng Zhao
  • Bolei Zhou
  • Hang Zhao

Object detection through either RGB images or the LiDAR point clouds has been extensively explored in autonomous driving. However, it remains challenging to make these two data sources complementary and beneficial to each other. In this paper, we propose AutoAlign, an automatic feature fusion strategy for 3D object detection. Instead of establishing deterministic correspondence with camera projection matrix, we model the mapping relationship between the image and point clouds with a learnable alignment map. This map enables our model to automate the alignment of non-homogenous features in a dynamic and data-driven manner. Specifically, a cross-attention feature alignment module is devised to adaptively aggregate pixel-level image features for each voxel. To enhance the semantic consistency during feature alignment, we also design a self-supervised cross-modal feature interaction module, through which the model can learn feature aggregation with instance-level feature guidance. Extensive experimental results show that our approach can lead to 2. 3 mAP and 7. 0 mAP improvements on the KITTI and nuScenes datasets respectively. Notably, our best model reaches 70. 9 NDS on the nuScenes testing leaderboard, achieving competitive performance among various state-of-the-arts.

NeurIPS Conference 2022 Conference Paper

Deep Fourier Up-Sampling

  • Man Zhou
  • Hu Yu
  • Jie Huang
  • Feng Zhao
  • Jinwei Gu
  • Chen Change Loy
  • Deyu Meng
  • Chongyi Li

Existing convolutional neural networks widely adopt spatial down-/up-sampling for multi-scale modeling. However, spatial up-sampling operators (e. g. , interpolation, transposed convolution, and un-pooling) heavily depend on local pixel attention, incapably exploring the global dependency. In contrast, the Fourier domain is in accordance with the nature of global modeling according to the spectral convolution theorem. Unlike the spatial domain that easily performs up-sampling with the property of local similarity, up-sampling in the Fourier domain is more challenging as it does not follow such a local property. In this study, we propose a theoretically feasible Deep Fourier Up-Sampling (FourierUp) to solve these issues. We revisit the relationships between spatial and Fourier domains and reveal the transform rules on the features of different resolutions in the Fourier domain, which provide key insights for FourierUp's designs. FourierUp as a generic operator consists of three key components: 2D discrete Fourier transform, Fourier dimension increase rules, and 2D inverse Fourier transform, which can be directly integrated with existing networks. Extensive experiments across multiple computer vision tasks, including object detection, image segmentation, image de-raining, image dehazing, and guided image super-resolution, demonstrate the consistent performance gains obtained by introducing our FourierUp. Code will be publicly available.

IJCAI Conference 2022 Conference Paper

MMNet: Muscle Motion-Guided Network for Micro-Expression Recognition

  • Hanting Li
  • Mingzhe Sui
  • Zhaoqing Zhu
  • Feng Zhao

Facial micro-expressions (MEs) are involuntary facial motions revealing people’s real feelings and play an important role in the early intervention of mental illness, the national security, and many human-computer interaction systems. However, existing micro-expression datasets are limited and usually pose some challenges for training good classifiers. To model the subtle facial muscle motions, we propose a robust micro-expression recognition (MER) framework, namely muscle motion-guided network (MMNet). Specifically, a continuous attention (CA) block is introduced to focus on modeling local subtle muscle motion patterns with little identity information, which is different from most previous methods that directly extract features from complete video frames with much identity information. Besides, we design a position calibration (PC) module based on the vision transformer. By adding the position embeddings of the face generated by the PC module at the end of the two branches, the PC module can help to add position information to facial muscle motion-pattern features for the MER. Extensive experiments on three public micro-expression datasets demonstrate that our approach outperforms state-of-the-art methods by a large margin. Code is available at https: //github. com/muse1998/MMNet.

NeurIPS Conference 2022 Conference Paper

Panchromatic and Multispectral Image Fusion via Alternating Reverse Filtering Network

  • Keyu Yan
  • Man Zhou
  • Jie Huang
  • Feng Zhao
  • Chengjun Xie
  • Chongyi Li
  • Danfeng Hong

Panchromatic (PAN) and multi-spectral (MS) image fusion, named Pan-sharpening, refers to super-resolve the low-resolution (LR) multi-spectral (MS) images in the spatial domain to generate the expected high-resolution (HR) MS images, conditioning on the corresponding high-resolution PAN images. In this paper, we present a simple yet effective alternating reverse filtering network for pan-sharpening. Inspired by the classical reverse filtering that reverses images to the status before filtering, we formulate pan-sharpening as an alternately iterative reverse filtering process, which fuses LR MS and HR MS in an interpretable manner. Different from existing model-driven methods that require well-designed priors and degradation assumptions, the reverse filtering process avoids the dependency on pre-defined exact priors. To guarantee the stability and convergence of the iterative process via contraction mapping on a metric space, we develop the learnable multi-scale Gaussian kernel module, instead of using specific filters. We demonstrate the theoretical feasibility of such formulations. Extensive experiments on diverse scenes to thoroughly verify the performance of our method, significantly outperforming the state of the arts.

NeurIPS Conference 2022 Conference Paper

Roadblocks for Temporarily Disabling Shortcuts and Learning New Knowledge

  • Hongjing Niu
  • Hanting Li
  • Feng Zhao
  • Bin Li

Deep learning models have been found with a tendency of relying on shortcuts, i. e. , decision rules that perform well on standard benchmarks but fail when transferred to more challenging testing conditions. Such reliance may hinder deep learning models from learning other task-related features and seriously affect their performance and robustness. Although recent studies have shown some characteristics of shortcuts, there are few investigations on how to help the deep learning models to solve shortcut problems. This paper proposes a framework to address this issue by setting up roadblocks on shortcuts. Specifically, roadblocks are placed when the model is urged to learn to complete a gently modified task to ensure that the learned knowledge, including shortcuts, is insufficient the complete the task. Therefore, the model trained on the modified task will no longer over-rely on shortcuts. Extensive experiments demonstrate that the proposed framework significantly improves the training of networks on both synthetic and real-world datasets in terms of both classification accuracy and feature diversity. Moreover, the visualization results show that the mechanism behind the proposed our method is consistent with our expectations. In summary, our approach can effectively disable the shortcuts and thus learn more robust features.

JBHI Journal 2020 Journal Article

Deep Learning-Based Classification of Liver Cancer Histopathology Images Using Only Global Labels

  • Chunli Sun
  • Ao Xu
  • Dong Liu
  • Zhiwei Xiong
  • Feng Zhao
  • Weiping Ding

Liver cancer is a leading cause of cancer deaths worldwide due to its high morbidity and mortality. Histopathological image analysis (HIA) is a crucial step in the early diagnosis of liver cancer and is routinely performed manually. However, this process is time-consuming, error-prone, and easily affected by the expertise of pathologists. Recently, computer-aided methods have been widely applied to medical image analysis; however, the current medical image analysis studies have not yet focused on the histopathological morphology of liver cancer due to its complex features and the insufficiency of training images with detailed annotations. This paper proposes a deep learning method for liver cancer histopathological image classification using only global labels. To compensate for the lack of detailed cancer region annotations in those images, patch features are extracted and fully utilized. Transfer learning is used to obtain the patch-level features and then combined with multiple-instance learning to acquire the image-level features for classification. The method proposed here solves the processing of large-scale images and training sample insufficiency in liver cancer histopathological images for image classification. The proposed method can distinguish and classify liver histopathological images as abnormal or normal with high accuracy, thus providing support for the early diagnosis of liver cancer.

UAI Conference 2015 Conference Paper

Structure Learning Constrained by Node-Specific Degree Distribution

  • Jianzhu Ma
  • Feng Zhao
  • Jinbo Xu

We consider the problem of learning the structure of a Markov Random Field (MRF) when a node-specific degree distribution is provided. The problem setting is inspired by protein contact map (i. e. , graph) prediction in which the contact number (i. e. , degree) of an individual residue (i. e. , node) can be predicted without knowing the contact graph. We formulate this problem using maximum pseudo-likelihood plus a node-specific ℓ1 regularization derived from the predicted degree distribution. Intuitively, when a node have 𝑘 predicted edges, we dynamically reduce the regularization coefficients of the 𝑘 most possible edges to promote their occurrence. We then optimize the objective function using ADMM and an Iterative Maximum Cost Bipartite Matching algorithm. Our experimental results show that using degree distribution as a constraint may lead to significant performance gain when the predicted degree has good accuracy. 1.

AIJ Journal 2001 Journal Article

Influence-based model decomposition for reasoning about spatially distributed physical systems

  • Chris Bailey-Kellogg
  • Feng Zhao

Many important science and engineering applications, such as regulating the temperature distribution over a semiconductor wafer and controlling the noise from a photocopy machine, require interpreting distributed data and designing decentralized controllers for spatially distributed systems. Developing effective computational techniques for representing and reasoning about these systems, which are usually modeled with partial differential equations (PDEs), is one of the major challenge problems for qualitative and spatial reasoning research. This paper introduces a novel approach to decentralized control design, influence-based model decomposition, and applies it in the context of thermal regulation. Influence-based model decomposition uses a decentralized model, called an influence graph, as a key data abstraction representing influences of controls on distributed physical fields. It serves as the basis for novel algorithms for control placement and parameter design for distributed systems with large numbers of coupled variables. These algorithms exploit physical knowledge of locality, linear superposability, and continuity, encapsulated in influence graphs representing dependencies of field nodes on control nodes. The control placement design algorithms utilize influence graphs to decompose a problem domain so as to decouple the resulting regions. The decentralized control parameter optimization algorithms utilize influence graphs to efficiently evaluate thermal fields and to explicitly trade off computation, communication, and control quality. By leveraging the physical knowledge encapsulated in influence graphs, these control design algorithms are more efficient than standard techniques, and produce designs explainable in terms of problem structures.

AAAI Conference 1999 Conference Paper

Influence-Based Model Decomposition

  • Christopher Bailey-Kellogg
  • Dartmouth College
  • Feng Zhao
  • Xerox Palo Alto Research Center

Recent rapid advances in MEMS and information processing technologyhaveenableda newgeneration of AI robotic systems -- so-called SmartMatter systems -that are sensor rich and physically embedded. These systems range from decentralized control systems that regulate building temperature(smart buildings) to vehicle on-boarddiagnostic and control systemsthat interrogate large amounts of sensor data. Oneof the core tasks in the construction and operation of these SmartMatter systems is to synthesize optimal control policies using data rich modelsfor the systemsandenvironment. Unfortunately, these modelsmaycontain thousandsof coupledreal-valued variables and are prohibitively expensiveto reason about using traditional optimizationtechniques such as neural nets and genetic algorithms. This paper introduces a general mechanism for automatically decomposinga large model into smaller subparts so that these subparts can be separately optimized and then combined. The mechanism decomposesa model using an influence graph that records the coupling strengths among constituents of the model. This paper demonstrates the mechanismin an application of decentralizedoptimizationfor a temperature regulation problem. Performancedata has shownthat the approach is muchmoreefficient than the standard discrete optimization algorithms and achieves comparableaccuracy.

AIJ Journal 1994 Journal Article

Extracting and representing qualitative behaviors of complex systems in phase space

  • Feng Zhao

This paper presents a computational method for automatically analyzing qualitative behaviors of complex dynamical systems in phase space. To demonstrate this method, a program called MAPS has been constructed that understands qualitatively distinct features of a phase space and represents geometric information about these features in a dimension-independent description, using deep domain knowledge of dynamical systems theory. Given a dynamical system specified as a system of governing equations, MAPS incrementally extracts the qualitative information about the system in terms of a qualitative phase-space structure describing steady-state behaviors, stabilities, and transient properties. MAPS generates a high-level symbolic description of the system sensible to human beings and manipulable by other programs, through a combination of numerical, combinatorial, and geometric computations and spatial reasoning techniques. MAPS has successfully demonstrated its power in a difficult engineering domain of nonlinear control design.

IJCAI Conference 1991 Conference Paper

Extracting and Representing Qualitative Behaviors of Complex Systems in Phase Spaces

  • Feng Zhao

We develop a qualitative method for under­ standing and representing phase space struc­ tures of complex systems. To demonstrate this method, a program called MAPS has been con­ structed that understands qualitatively differ­ ent regions of a phase space and represents and extracts geometric shape information about these regions, using deep domain knowledge of dynamical system theory. Given a dynamical system specified as a system of governing equa­ tions, MAPS applies a successive sequence of operations to incrementally extract the qual­ itative information and generates a complete, high level symbolic description of the phase space structure, through a combination of nu­ merical, combinatorial, and geometric compu­ tations and spatial reasoning techniques. The high level description is sensible to human be­ ings and manipulable by other programs. We are currently applying the method to a difficult engineering design domain in which controllers for complex systems are to be automatically synthesized to achieve desired properties, based on the knowledge of the phase space "shapes" of the systems.