Arrow Research search

Author name cluster

Fuyun Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

3 papers
1 author row

Possible papers

3

AAAI Conference 2025 Conference Paper

Scene Graph-Grounded Image Generation

  • Fuyun Wang
  • Tong Zhang
  • Yuanzhi Wang
  • Xiaoya Zhang
  • Xin Liu
  • Zhen Cui

With the beneft of explicit object-oriented reasoning capabilities of scene graphs, scene graph-to-image generation has made remarkable advancements in comprehending object coherence and interactive relations. Recent state-of-the-arts typically predict the scene layouts as an intermediate representation of a scene graph before synthesizing the image. Nevertheless, transforming a scene graph into an exact layout may restrict its representation capabilities, leading to discrepancies in interactive relationships (such as standing on, wearing, or covering) between the generated image and the input scene graph. In this paper, we propose a Scene Graph-Grounded Image Generation (SGG-IG) method to mitigate the above issues. Specifcally, to enhance the scene graph representation, we design a masked auto-encoder module and a relation embedding learning module to integrate structural knowledge and contextual information of the scene graph with a mask self-supervised manner. Subsequently, to bridge the scene graph with visual content, we introduce a spatial constraint and image-scene alignment constraint to capture the fne-grained visual correlation between the scene graph symbol representation and the corresponding image representation, thereby generating semantically consistent and high-quality images. Extensive experiments demonstrate the effectiveness of the method both quantitatively and qualitatively.

NeurIPS Conference 2025 Conference Paper

Value Diffusion Reinforcement Learning

  • Xiaoliang Hu
  • Fuyun Wang
  • Tong Zhang
  • Zhen Cui

Model-free reinforcement learning (RL) combined with diffusion models has achieved significant progress in addressing complex continuous control tasks. However, a persistent challenge in RL remains the accurate estimation of Q-values, which critically governs the efficacy of policy optimization. Although recent advances employ parametric distributions to model value distributions for enhanced estimation accuracy, current methodologies predominantly rely on unimodal Gaussian assumptions or quantile representations. These constraints introduce distributional bias between the learned and true value distributions, particularly in some tasks with a nonstationary policy, ultimately degrading performance. To address these limitations, we propose value diffusion reinforcement learning (VDRL), a novel model-free online RL method that utilizes the generative capacity of diffusion models to represent multimodal value distributions. The core innovation of VDRL lies in the use of the variational loss of diffusion-based value distribution, which is theoretically proven to be a tight lower bound for the optimization objective under the KL-divergence measurement. Furthermore, we introduce double value diffusion learning with sample selection to enhance training stability and further improve value estimation accuracy. Extensive experiments conducted on the MuJoCo benchmark demonstrate that VDRL significantly outperforms some SOTA model-free online RL baselines, showcasing its effectiveness and robustness.

NeurIPS Conference 2024 Conference Paper

MMM-RS: A Multi-modal, Multi-GSD, Multi-scene Remote Sensing Dataset and Benchmark for Text-to-Image Generation

  • Jialin Luo
  • Yuanzhi Wang
  • Ziqi Gu
  • Yide Qiu
  • Shuaizhen Yao
  • Fuyun Wang
  • Chunyan Xu
  • Wenhua Zhang

Recently, the diffusion-based generative paradigm has achieved impressive general image generation capabilities with text prompts due to its accurate distribution modeling and stable training process. However, generating diverse remote sensing (RS) images that are tremendously different from general images in terms of scale and perspective remains a formidable challenge due to the lack of a comprehensive remote sensing image generation dataset with various modalities, ground sample distances (GSD), and scenes. In this paper, we propose a Multi-modal, Multi-GSD, Multi-scene Remote Sensing (MMM-RS) dataset and benchmark for text-to-image generation in diverse remote sensing scenarios. Specifically, we first collect nine publicly available RS datasets and conduct standardization for all samples. To bridge RS images to textual semantic information, we utilize a large-scale pretrained vision-language model to automatically output text prompts and perform hand-crafted rectification, resulting in information-rich text-image pairs (including multi-modal images). In particular, we design some methods to obtain the images with different GSD and various environments (e. g. , low-light, foggy) in a single sample. With extensive manual screening and refining annotations, we ultimately obtain a MMM-RS dataset that comprises approximately 2. 1 million text-image pairs. Extensive experimental results verify that our proposed MMM-RS dataset allows off-the-shelf diffusion models to generate diverse RS images across various modalities, scenes, weather conditions, and GSD. The dataset is available at https: //github. com/ljl5261/MMM-RS.