Arrow Research search

Author name cluster

Ju Hu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

6 papers
2 author rows

Possible papers

6

ICLR Conference 2025 Conference Paper

Lightweight Predictive 3D Gaussian Splats

  • Junli Cao
  • Vidit Goel
  • Chaoyang Wang 0001
  • Anil Kag
  • Ju Hu
  • Sergei Korolev
  • Chenfanfu Jiang
  • Sergey Tulyakov

Recent approaches representing 3D objects and scenes using Gaussian splats show increased rendering speed across a variety of platforms and devices. While rendering such representations is indeed extremely efficient, storing and transmitting them is often prohibitively expensive. To represent large-scale scenes, one often needs to store millions of 3D Gaussian, which can occupy up to gigabytes of storage. This creates a significant practical barrier, preventing widespread adoption on resource-constrained devices. In this work, we propose a new representation that dramatically reduces the hard drive footprint while featuring similar or improved quality when compared to the standard 3D Gaussian splats. This representation leverages the inherent feature sharing among splats in the close proximity using a hierarchical tree structure, with which only the parent splats need to be stored. We present a method for constructing tree structures from naturally unstructured point clouds. Additionally, we propose the adaptive tree manipulation to prune the redundant trees in the space, while spawn new ones from the significant children splats during the optimization process. On the benchmark datasets, we achieve 20x storage reduction in hard-drive footprint with improved fidelity compared to the vanilla 3DGS and 2-5x reduction compared to the exiting compact solutions. More importantly, we demonstrate the practical application of our method in real-world rendering on mobile devices and AR glasses.

NeurIPS Conference 2025 Conference Paper

Preventing Shortcuts in Adapter Training via Providing the Shortcuts

  • Anujraaj Goyal
  • Guocheng Qian
  • Huseyin Coskun
  • Aarush Gupta
  • Himmy Tam
  • Daniil Ostashev
  • Ju Hu
  • Dhritiman Sagar

Adapter-based training has emerged as a key mechanism for extending the capabilities of powerful foundation image generators, enabling personalized and stylized text-to-image synthesis. These adapters are typically trained to capture a specific target attribute, such as subject identity, using single-image reconstruction objectives. However, because the input image inevitably contains a mixture of visual factors, adapters are prone to entangle the target attribute with incidental ones, such as pose, expression, and lighting. This spurious correlation problem limits generalization and obstructs the model's ability to adhere to the input text prompt. In this work, we uncover a simple yet effective solution: provide the very shortcuts we wish to eliminate during adapter training. In Shortcut-Rerouted Adapter Training, confounding factors are routed through auxiliary modules, such as ControlNet or LoRA, eliminating the incentive for the adapter to internalize them. The auxiliary modules are then removed during inference. When applied to tasks like facial and full-body identity injection, our approach improves generation quality, diversity, and prompt adherence. These results point to a general design principle in the era of large models: when seeking disentangled representations, the most effective path may be to establish shortcuts for what should NOT be learned.

NeurIPS Conference 2024 Conference Paper

BitsFusion: 1.99 bits Weight Quantization of Diffusion Model

  • Yang Sui
  • Yanyu Li
  • Anil Kag
  • Yerlan Idelbayev
  • Junli Cao
  • Ju Hu
  • Dhritiman Sagar
  • Bo Yuan

Diffusion-based image generation models have achieved great success in recent years by showing the capability of synthesizing high-quality content. However, these models contain a huge number of parameters, resulting in a significantly large model size. Saving and transferring them is a major bottleneck for various applications, especially those running on resource-constrained devices. In this work, we develop a novel weight quantization method that quantizes the UNet from Stable Diffusion v1. 5 to $1. 99$ bits, achieving a model with $7. 9\times$ smaller size while exhibiting even better generation quality than the original one. Our approach includes several novel techniques, such as assigning optimal bits to each layer, initializing the quantized model for better performance, and improving the training strategy to dramatically reduce quantization error. Furthermore, we extensively evaluate our quantized model across various benchmark datasets and through human evaluation to demonstrate its superior generation quality.

NeurIPS Conference 2023 Conference Paper

LightSpeed: Light and Fast Neural Light Fields on Mobile Devices

  • Aarush Gupta
  • Junli Cao
  • Chaoyang Wang
  • Ju Hu
  • Sergey Tulyakov
  • Jian Ren
  • László Jeni

Real-time novel-view image synthesis on mobile devices is prohibitive due to the limited computational power and storage. Using volumetric rendering methods, such as NeRF and its derivatives, on mobile devices is not suitable due to the high computational cost of volumetric rendering. On the other hand, recent advances in neural light field representations have shown promising real-time view synthesis results on mobile devices. Neural light field methods learn a direct mapping from a ray representation to the pixel color. The current choice of ray representation is either stratified ray sampling or Plücker coordinates, overlooking the classic light slab (two-plane) representation, the preferred representation to interpolate between light field views. In this work, we find that using the light slab representation is an efficient representation for learning a neural light field. More importantly, it is a lower-dimensional ray representation enabling us to learn the 4D ray space using feature grids which are significantly faster to train and render. Although mostly designed for frontal views, we show that the light-slab representation can be further extended to non-frontal scenes using a divide-and-conquer strategy. Our method provides better rendering quality than prior light field methods and a significantly better trade-off between rendering quality and speed than prior light field methods.

NeurIPS Conference 2023 Conference Paper

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

  • Yanyu Li
  • Huan Wang
  • Qing Jin
  • Ju Hu
  • Pavlo Chemerys
  • Yun Fu
  • Yanzhi Wang
  • Sergey Tulyakov

Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers. However, these models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run. As a result, high-end GPUs and cloud-based inference are required to run diffusion models at scale. This is costly and has privacy implications, especially when user data is sent to a third party. To overcome these challenges, we present a generic approach that, for the first time, unlocks running text-to-image diffusion models on mobile devices in **less than 2 seconds**. We achieve so by introducing efficient network architecture and improving step distillation. Specifically, we propose an efficient UNet by identifying the redundancy of the original model and reducing the computation of the image decoder via data distillation. Further, we enhance the step distillation by exploring training strategies and introducing regularization from classifier-free guidance. Our extensive experiments on MS-COCO show that our model with $8$ denoising steps achieves better FID and CLIP scores than Stable Diffusion v$1. 5$ with $50$ steps. Our work democratizes content creation by bringing powerful text-to-image diffusion models to the hands of users.

NeurIPS Conference 2022 Conference Paper

EfficientFormer: Vision Transformers at MobileNet Speed

  • Yanyu Li
  • Geng Yuan
  • Yang Wen
  • Ju Hu
  • Georgios Evangelidis
  • Sergey Tulyakov
  • Yanzhi Wang
  • Jian Ren

Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of parameters and model design, e. g. , attention mechanism, ViT-based models are generally times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance? To answer this, we first revisit the network architecture and operators used in ViT-based models and identify inefficient designs. Then we introduce a dimension-consistent pure transformer (without MobileNet blocks) as a design paradigm. Finally, we perform latency-driven slimming to get a series of final models dubbed EfficientFormer. Extensive experiments show the superiority of EfficientFormer in performance and speed on mobile devices. Our fastest model, EfficientFormer-L1, achieves $79. 2\%$ top-1 accuracy on ImageNet-1K with only $1. 6$ ms inference latency on iPhone 12 (compiled with CoreML), which runs as fast as MobileNetV2$\times 1. 4$ ($1. 6$ ms, $74. 7\%$ top-1), and our largest model, EfficientFormer-L7, obtains $83. 3\%$ accuracy with only $7. 0$ ms latency. Our work proves that properly designed transformers can reach extremely low latency on mobile devices while maintaining high performance.