Arrow Research search

Author name cluster

Fanyi Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers
2 author rows

Possible papers

7

AAAI Conference 2025 Conference Paper

Overcoming Heterogeneous Data in Federated Medical Vision-Language Pre-training: A Triple-Embedding Model Selector Approach

  • Aowen Wang
  • Zhiwang Zhang
  • Dongang Wang
  • Fanyi Wang
  • Haotian Hu
  • Jinyang Guo
  • Yipeng Zhou
  • Chaoyi Pang

The scarcity data of medical field brings the collaborative training in medical vision-language pre-training (VLP) cross different clients. Therefore, the collaborative training in medical VLP faces two challenges: First, the medical data requires privacy, thus can not directly shared across different clients. Second, medical data distribution across institutes is typically heterogeneous, hindering local model alignment and representation capabilities. To simultaneously overcome these two challenges, we propose the framework called personalized model selector with fused multimodal information (PMS-FM). The contribution of PMS-FM is two-fold: 1) PMS-FM uses embeddings to represent information in different formats, allowing for the fusion of multimodal data. 2) PMS-FM adapts to personalized data distributions by training multiple models. A model selector then identifies and selects the best-performing model for each individual client. Extensive experiments with multiple real-world medical datasets demonstrate the superb performance of PMS-FM over existing federated learning methods on different zero-shot classification tasks.

AAAI Conference 2024 Conference Paper

BARET: Balanced Attention Based Real Image Editing Driven by Target-Text Inversion

  • Yuming Qiao
  • Fanyi Wang
  • Jingwen Su
  • Yanhao Zhang
  • Yunjie Yu
  • Siyu Wu
  • Guo-Jun Qi

Image editing approaches with diffusion models have been rapidly developed, yet their applicability are subject to requirements such as specific editing types (e.g., foreground or background object editing, style transfer), multiple conditions (e.g., mask, sketch, caption), and time consuming fine-tuning of diffusion models. For alleviating these limitations and realizing efficient real image editing, we propose a novel editing technique that only requires an input image and target text for various editing types including non-rigid edits without fine-tuning diffusion model. Our method contains three novelties: (I) Target-text Inversion Schedule (TTIS) is designed to fine-tune the input target text embedding to achieve fast image reconstruction without image caption and acceleration of convergence. (II) Progressive Transition Scheme applies progressive linear interpolation between target text embedding and its fine-tuned version to generate transition embedding for maintaining non-rigid editing capability. (III) Balanced Attention Module (BAM) balances the tradeoff between textual description and image semantics. By the means of combining self-attention map from reconstruction process and cross-attention map from transition process, the guidance of target text embeddings in diffusion process is optimized. In order to demonstrate editing capability, effectiveness and efficiency of the proposed BARET, we have conducted extensive qualitative and quantitative experiments. Moreover, results derived from user study and ablation study further prove the superiority over other methods.

IROS Conference 2024 Conference Paper

IC-FPS: Instance-Centroid Faster Point Sampling Framework for 3D Point-based Object Detection

  • Haotian Hu
  • Fanyi Wang
  • Yaonong Wang
  • Laifeng Hu
  • Zhiwang Zhang

3D object detection is one of the most important tasks in autonomous driving and robotics. Our research focuses on tackling low efficiency issue of point-based methods, and we propose a novel Instance-Centroid Faster Point Sampling (IC-FPS) framework. We design a Neighboring Feature Diffusion Module (NFDM) to extract local features for the purpose of efficiently distinguishing the foreground from the background. Considering Farthest Point Sampling (FPS) strategy for downsampling is computationally intensive, we propose the Centroid-Instance Sampling Strategy (CISS). CISS samples center point in large-scale point cloud by rapidly sampling the centroid and instance points of the foreground block. The proposed IC-FPS framework can be inserted into every point-based model and effectively replace the first Set Abstraction (SA) layer. Extensive experiments on several public benchmarks demonstrate the superior performance of our proposed IC-FPS. On the Waymo dataset, IC-FPS significantly improves performance of the benchmark model and increases inference speed by 3. 8 times. And real-time detection of point-based methods is realized for the first time, which is meaningful for industrial applications.

ECAI Conference 2024 Conference Paper

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

  • Jinjin Xu
  • Liwu Xu
  • Yuzhe Yang 0001
  • Xiang Li 0179
  • Fanyi Wang
  • Yanchun Xie
  • Yi-Jie Huang
  • Yaqian Li

Recent advancements in multi-modal large language models (MLLMs) have led to substantial improvements in visual understanding, primarily driven by sophisticated modality alignment strategies. However, predominant approaches prioritize global or regional comprehension, with less focus on fine-grained, pixel-level tasks. To address this gap, we introduce u-LLaVA, an innovative unifying multi-task framework that integrates pixel, regional, and global features to refine the perceptual faculties of MLLMs. We commence by leveraging an efficient modality alignment approach, harnessing both image and video datasets to bolster the model’s foundational understanding across diverse visual contexts. Subsequently, a joint instruction tuning method with task-specific projectors and decoders for end-to-end downstream training is presented. Furthermore, this work contributes a novel mask-based multi-task dataset comprising 277K samples, crafted to challenge and assess the fine-grained perception capabilities of MLLMs. The overall framework is simple, effective, and achieves state-of-the-art performance across multiple benchmarks. We make model, data, and code publicly accessible at https: //github. com/OPPOMKLab/u-LLaVA.

IJCAI Conference 2024 Conference Paper

Zero-shot High-fidelity and Pose-controllable Character Animation

  • Bingwen Zhu
  • Fanyi Wang
  • Tianyi Lu
  • Peng Liu
  • Jingwen Su
  • Jinxiu Liu
  • Yanhao Zhang
  • Zuxuan Wu

Image-to-video (I2V) generation aims to create a video sequence from a single image, which requires high temporal coherence and visual fidelity. However, existing approaches suffer from inconsistency of character appearances and poor preservation of fine details. Moreover, they require a large amount of video data for training, which can be computationally demanding. To address these limitations, we propose PoseAnimate, a novel zero-shot I2V framework for character animation. PoseAnimate contains three key components: 1) a Pose-Aware Control Module (PACM) that incorporates diverse pose signals into text embeddings, to preserve character-independent content and maintain precise alignment of actions. 2) a Dual Consistency Attention Module (DCAM) that enhances temporal consistency and retains character identity and intricate background details. 3) a Mask-Guided Decoupling Module (MGDM) that refines distinct feature perception abilities, improving animation fidelity by decoupling the character and background. We also propose a Pose Alignment Transition Algorithm (PATA) to ensure smooth action transition. Extensive experiment results demonstrate that our approach outperforms the state-of-the-art training-based methods in terms of character consistency and detail fidelity. Moreover, it maintains a high level of temporal coherence throughout the generated animations.

AAAI Conference 2023 Conference Paper

GAM: Gradient Attention Module of Optimization for Point Clouds Analysis

  • Haotian Hu
  • Fanyi Wang
  • Zhiwang Zhang
  • Yaonong Wang
  • Laifeng Hu
  • Yanhao Zhang

In the point cloud analysis task, the existing local feature aggregation descriptors (LFAD) do not fully utilize the neighborhood information of center points. Previous methods only use the distance information to constrain the local aggregation process, which is easy to be affected by abnormal points and cannot adequately fit the original geometry of the point cloud. This paper argues that fine-grained geometric information (FGGI) plays an important role in the aggregation of local features. Based on this, we propose a gradient-based local attention module to address the above problem, which is called Gradient Attention Module (GAM). GAM simplifies the process of extracting the gradient information in the neighborhood to explicit representation using the Zenith Angle matrix and Azimuth Angle matrix, which makes the module 35X faster. The comprehensive experiments on the ScanObjectNN dataset, ShapeNet dataset, S3DIS dataset, Modelnet40 dataset, and KITTI dataset demonstrate the effectiveness, efficientness, and generalization of our newly proposed GAM for 3D point cloud analysis. Especially in S3DIS, GAM achieves the highest index in the current point-based model with mIoU/OA/mAcc of 74.4%/90.6%/83.2%.

IJCAI Conference 2023 Conference Paper

Matting Moments: A Unified Data-Driven Matting Engine for Mobile AIGC in Photo Gallery

  • Yanhao Zhang
  • Fanyi Wang
  • Weixuan Sun
  • Jingwen Su
  • Peng Liu
  • Yaqian Li
  • Xinjie Feng
  • Zhengxia Zou

Image matting is a fundamental technique in visual understanding and has become one of the most significant capabilities in mobile phones. Despite the development of mobile storage and computing power, achieving diverse mobile Artificial Intelligence Generated Content (AIGC) applications remains a great challenge. To address this issue, we present an innovative demonstration of an automatic system called "Matting Moments" that enables automatic image editing based on matting models in different scenarios. Coupled with accurate and refined matting subjects, our system provides visual element editing abilities and backend services for distribution and recommendation that respond to emotional expressions. Our system comprises three components: 1) photo content structuring, 2) data-driven matting engine, and 3) AIGC functions for generation, which automatically achieve diverse photo beautification in the gallery. This system offers a unified framework that guides consumers to obtain intelligent recommendations with beautifully generated contents, helping them enjoy the moments and memories of their present life.