Arrow Research search

Author name cluster

Ge Wu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers
2 author rows

Possible papers

5

NeurIPS Conference 2025 Conference Paper

LEDiT: Your Length-Extrapolatable Diffusion Transformer without Positional Encoding

  • Shen Zhang
  • Siyuan Liang
  • Yaning Tan
  • Zhaowei Chen
  • Linze Li
  • Ge Wu
  • Yuhao Chen
  • Shuheng Li

Diffusion transformers (DiTs) struggle to generate images at resolutions higher than their training resolutions. The primary obstacle is that the explicit positional encodings (PE), such as RoPE, need extrapolating to unseen positions which degrades performance when the inference resolution differs from training. In this paper, We propose a Length-Extrapolatable Diffusion Transformer (LEDiT) to overcome this limitation. LEDiT needs no explicit PEs, thereby avoiding PE extrapolation. The key innovation of LEDiT lies in the use of causal attention. We demonstrate that causal attention can implicitly encode global positional information and show that such information facilitates extrapolation. We further introduce a locality enhancement module, which captures fine-grained local information to complement the global coarse-grained position information encoded by causal attention. Experimental results on both conditional and text-to-image generation tasks demonstrate that LEDiT supports up to 4× resolution scaling (e. g. , from 256$\times$256 to 512$\times$512), achieving better image quality compared to the state-of-the-art length extrapolation methods. We believe that LEDiT marks a departure from the standard RoPE-based methods and offers a promising insight into length extrapolation. Project page: https: //shenzhang2145. github. io/ledit/

ICRA Conference 2025 Conference Paper

RE0: Recognize Everything with 3D Zero-Shot Instance Segmentation

  • Xiaohan Yan
  • Zijian Jiang
  • Yinghao Shuai
  • Nan Wang 0041
  • Xiaowei Song
  • Wenbo Ji
  • Ge Wu
  • Jinyu He

Recognizing objects in the 3D world is a significant challenge for robotics. Due to the lack of high-quality 3D data, directly training a general-purpose segmentation model in 3D is almost infeasible. Meanwhile, vision foundation models (VFM) have revolutionized the 2D computer vision field with outstanding performance, making the use of VFM to assist 3D perception a promising direction. However, most existing VFM-assisted methods do not effectively address the 2D-3D inconsistency problem or adequately provide corresponding semantic information for 3D instance objects. To address these two issues, this paper introduces a novel framework for 3D zero-shot instance segmentation called RE0. For the given 3D point clouds and multi-view RGB-D images with poses, we leverage the 3D geometric information, projection relationships, and CLIP semantic features. Specifically, we utilize CropFormer to extract mask information from multi-view posed images, combined with projection relationships to assign point-level labels to each point in the point cloud, and achieve instance-level consistency through inter-frame information interaction. Then, we employ projection relationships again to assign CLIP semantic features to the point cloud and achieve aggregation of small-scale point clouds. Notably, RE0 does not require any additional training and can be implemented by supporting only one inference of CropFormer and one inference of CLIP. Experiments on ScanNet200 and ScanNet++ show that our method achieves higher quality segmentation than the previous zero-shot methods. Our codes and demos are available at https://recognizeeverything.github.io/, with only one RTX 3090 GPU required.

NeurIPS Conference 2025 Conference Paper

Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think

  • Ge Wu
  • Shen Zhang
  • Ruijing Shi
  • Shanghua Gao
  • Zhenyuan Chen
  • Lei Wang
  • Zhaowei Chen
  • Hongcheng Gao

REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models, through alignment between the noisy hidden projections of denoising networks and foundational clean image representations. We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully harnessing the potential of discriminative representations. In this work, we propose a straightforward method called $\textit{$\textbf{R}$epresentation $\textbf{E}$ntanglement for $\textbf{G}$eneration}$ ($\textbf{REG}$), which entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising. REG acquires the capability to produce coherent image-class pairs directly from pure noise, substantially improving both generation quality and training efficiency. This is accomplished with negligible additional inference overhead, requiring only one single additional token for denoising (<0. 5\% increase in FLOPs and latency). The inference process concurrently reconstructs both image latents and their corresponding global semantics, where the acquired semantic knowledge actively guides and enhances the image generation process. On ImageNet 256$\times$256, SiT-XL/2 + REG demonstrates remarkable convergence acceleration, achieving $\textbf{63}\times$ and $\textbf{23}\times$ faster training than SiT-XL/2 and SiT-XL/2 + REPA, respectively. More impressively, SiT-L/2 + REG trained for merely 400K iterations outperforms SiT-XL/2 + REPA trained for 4M iterations ($\textbf{10}\times$ longer). Code is available at: https: //github. com/Martinser/REG.

JBHI Journal 2021 Journal Article

Prediction of Three-Dimensional Radiotherapy Optimal Dose Distributions for Lung Cancer Patients With Asymmetric Network

  • Yan Shao
  • Xiaoying Zhang
  • Ge Wu
  • Qingtao Gu
  • Jiyong Wang
  • Yanchen Ying
  • Aihui Feng
  • Guotong Xie

The iterative design of radiotherapy treatment plans is time-consuming and labor-intensive. In order to provide a guidance to treatment planning, Asymmetric network (A-Net) is proposed to predict the optimal 3D dose distribution for lung cancer patients. A-Net was trained and tested in 392 lung cancer cases with the prescription doses of 50Gy and 60Gy. In A-Net, the encoder and decoder are asymmetric, able to preserve input information and to adapt the limitation of GPU memory. Squeeze and excitation (SE) units are used to improve the data-fitting ability. A loss function involving both the dose distribution and prescription dose as ground truth are designed. In the experiment, A-Net is separately trained and tested in the 50Gy and 60Gy dataset and most of the metrics A-Net achieve similar performance as HD-Unet and 3D-Unet, and some metrics slightly better. In the 50Gy-and-60Gy-combined dataset, most of the A-Net's metrics perform better than the other two. In conclusion, A-Net can accurately predict the IMRT dose distribution in the three datasets of 50Gy and 50Gy-and-60Gy-combined dataset.