Author name cluster

Stella X. Yu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

11 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

Normalize Filters! Classical Wisdom for Deep Vision

Gustavo Perez
Stella X. Yu

Classical image filters, such as those for averaging or differencing, are carefully normalized to ensure consistency, interpretability, and to avoid artifacts like intensity shifts, halos, or ringing. In contrast, convolutional filters learned end-to-end in deep networks lack such constraints. Although they may resemble wavelets and blob/edge detectors, they are not normalized in the same or any way. Consequently, when images undergo atmospheric transfer, their responses become distorted, leading to incorrect outcomes. We address this limitation by proposing filter normalization, followed by learnable scaling and shifting, akin to batch normalization. This simple yet effective modification ensures that the filters are atmosphere-equivariant, enabling co-domain symmetry. By integrating classical filtering principles into deep learning (applicable to both convolutional neural networks and convolution-dependent vision transformers), our method achieves significant improvements on artificial and natural intensity variation benchmarks. Our ResNet34 could even outperform CLIP by a large margin. Our analysis reveals that unnormalized filters degrade performance, whereas filter normalization regularizes learning, promotes diversity, and improves robustness and generalization.

PDF Details

NeurIPS Conference 2025 Conference Paper

Novel View Synthesis from A Few Glimpses via Test-Time Natural Video Completion

Yan Xu
Yixing Wang
Stella X. Yu

Given just a few glimpses of a scene, can you imagine the movie playing out as the camera glides through it? That’s the lens we take on sparse-input novel view synthesis, not only as filling spatial gaps between widely spaced views, but also as completing a natural video unfolding through space. We recast the task as test-time natural video completion, using powerful priors from pretrained video diffusion models to hallucinate plausible in-between views. Our zero-shot, generation-guided framework produces pseudo views at novel camera poses, modulated by an uncertainty-aware mechanism for spatial coherence. These synthesized frames densify supervision for 3D Gaussian Splatting (3D-GS) for scene reconstruction, especially in under-observed regions. An iterative feedback loop lets 3D geometry and 2D view synthesis inform each other, improving both the scene reconstruction and the generated views. The result is coherent, high-fidelity renderings from sparse inputs without any scene-specific training or fine-tuning. On LLFF, DTU, DL3DV, and MipNeRF-360, our method significantly outperforms strong 3D-GS baselines under extreme sparsity. Our project page is at https: //decayale. github. io/project/SV2CGS.

PDF Details

ICML Conference 2025 Conference Paper

Test-Time Canonicalization by Foundation Models for Robust Perception

Utkarsh Singhal
Ryan Feng
Stella X. Yu
Atul Prakash 0001

Perception in the real world requires robustness to diverse viewing conditions. Existing approaches often rely on specialized architectures or training with predefined data augmentations, limiting adaptability. Taking inspiration from mental rotation in human vision, we propose FoCal, a test-time robustness framework that transforms the input into the most typical view. At inference time, FoCal explores a set of transformed images and chooses the one with the highest likelihood under foundation model priors. This test-time optimization boosts robustness while requiring no retraining or architectural changes. Applied to models like CLIP and SAM, it significantly boosts robustness across a wide range of transformations, including 2D and 3D rotations, contrast and lighting shifts, and day-night changes. We also explore potential applications in active vision. By reframing invariance as a test-time optimization problem, FoCal offers a general and scalable approach to robustness. Our code is available at: https: //github. com/sutkarsh/focal.

Details

ICLR Conference 2025 Conference Paper

Visually Consistent Hierarchical Image Classification

Seulki Park
Youren Zhang
Stella X. Yu
Sara Beery
Jonathan Huang

Hierarchical classification predicts labels across multiple levels of a taxonomy, e.g., from coarse-level \textit{Bird} to mid-level \textit{Hummingbird} to fine-level \textit{Green hermit}, allowing flexible recognition under varying visual conditions. It is commonly framed as multiple single-level tasks, but each level may rely on different visual cues. Distinguishing \textit{Bird} from \textit{Plant} relies on {\it global features} like {\it feathers} or {\it leaves}, while separating \textit{Anna's hummingbird} from \textit{Green hermit} requires {\it local details} such as {\it head coloration}. Prior methods improve accuracy using external semantic supervision, but such statistical learning criteria fail to ensure consistent visual grounding at test time, resulting in incorrect hierarchical classification. We propose, for the first time, to enforce \textit{internal visual consistency} by aligning fine-to-coarse predictions through intra-image segmentation. Our method outperforms zero-shot CLIP and state-of-the-art baselines on hierarchical classification benchmarks, achieving both higher accuracy and more consistent predictions. It also improves internal image segmentation without requiring pixel-level annotations.

Details

ICLR Conference 2024 Conference Paper

Learning Hierarchical Image Segmentation For Recognition and By Recognition

Tsung-Wei Ke
Sangwoo Mo
Stella X. Yu

Large vision and language models learned directly through image-text associations often lack detailed visual substantiation, whereas image segmentation tasks are treated separately from recognition, supervisedly learned without interconnections. Our key observation is that, while an image can be recognized in multiple ways, each has a consistent part-and-whole visual organization. Segmentation thus should be treated not as an end task to be mastered through supervised learning, but as an internal process that evolves with and supports the ultimate goal of recognition. We propose to integrate a hierarchical segmenter into the recognition process, {\it train} and {\it adapt} the entire model solely on image-level recognition objectives. We learn hierarchical segmentation {\it for free} alongside recognition, automatically uncovering part-to-whole relationships that not only underpin but also enhance recognition. Enhancing the Vision Transformer (ViT) with adaptive segment tokens and graph pooling, our model surpasses ViT in unsupervised part-whole discovery, semantic segmentation, image classification, and efficiency. Notably, our model (trained on {\it unlabeled} 1M ImageNet images) outperforms SAM (trained on 11M images and 1 billion masks) by absolute 8\% in mIoU on PartImageNet object segmentation.

Details

NeurIPS Conference 2023 Conference Paper

ResoNet: Noise-Trained Physics-Informed MRI Off-Resonance Correction

Alfredo De Goyeneche Macaya
Shreya Ramachandran
Ke Wang
Ekin Karasan
Joseph Y. Cheng
Stella X. Yu
Michael Lustig

Magnetic Resonance Imaging (MRI) is a powerful medical imaging modality that offers diagnostic information without harmful ionizing radiation. Unlike optical imaging, MRI sequentially samples the spatial Fourier domain (k-space) of the image. Measurements are collected in multiple shots, or readouts, and in each shot, data along a smooth trajectory is sampled. Conventional MRI data acquisition relies on sampling k-space row-by-row in short intervals, which is slow and inefficient. More efficient, non-Cartesian sampling trajectories (e. g. , Spirals) use longer data readout intervals, but are more susceptible to magnetic field inhomogeneities, leading to off-resonance artifacts. Spiral trajectories cause off-resonance blurring in the image, and the mathematics of this blurring resembles that of optical blurring, where magnetic field variation corresponds to depth and readout duration to aperture size. Off-resonance blurring is a system issue with a physics-based, accurate forward model. We present a physics-informed deep learning framework for off-resonance correction in MRI, which is trained exclusively on synthetic, noise-like data with representative marginal statistics. Our approach allows for fat/water separation and is compatible with parallel imaging acceleration. Through end-to-end training using synthetic randomized data (i. e. , noise-like images, coil sensitivities, field maps), we train the network to reverse off-resonance effects across diverse anatomies and contrasts without retraining. We demonstrate the effectiveness of our approach through results on phantom and in-vivo data. This work has the potential to facilitate the clinical adoption of non-Cartesian sampling trajectories, enabling efficient, rapid, and motion-robust MRI scans. Code is publicly available at: https: //github. com/mikgroup/ResoNet.

PDF Details

IROS Conference 2023 Conference Paper

The Audio-Visual BatVision Dataset for Research on Sight and Sound

Amandine Brunetto
Sascha Hornauer
Stella X. Yu
Fabien Moutarde

Vision research showed remarkable success in understanding our world, propelled by datasets of images and videos. Sensor data from radar, LiDAR and cameras supports research in robotics and autonomous driving for at least a decade. However, while visual sensors may fail in some conditions, sound has recently shown potential to complement sensor data. Simulated room impulse responses (RIR) in 3D apartment-models became a benchmark dataset for the community, fostering a range of audiovisual research. In simulation, depth is predictable from sound, by learning bat-like perception with a neural network. Concurrently, the same was achieved in reality by using RGB-D images and echoes of chirping sounds. Biomimicking bat perception is an exciting new direction but needs dedicated datasets to explore the potential. Therefore, we collected the BatVision dataset to provide large-scale echoes in complex real-world scenes to the community. We equipped a robot with a speaker to emit chirps and a binaural microphone to record their echoes. Synchronized RGB-D images from the same perspective provide visual labels of traversed spaces. We sampled modern US office spaces to historic French university grounds, indoor and outdoor with large architectural variety. This dataset will allow research on robot echolocation, general audio-visual tasks and sound phænomena unavailable in simulated data. We show promising results for audio-only depth prediction and show how state-of-the-art work developed for simulated data can also succeed on our dataset. Project page: https://amandinebtto.github.io/Batvision-Dataset/

Details

ICLR Conference 2021 Conference Paper

Long-tailed Recognition by Routing Diverse Distribution-Aware Experts

Xudong Wang 0007
Long Lian
Zhongqi Miao
Ziwei Liu 0002
Stella X. Yu

Natural data are often long-tail distributed over semantic classes. Existing recognition methods tackle this imbalanced classification by placing more emphasis on the tail data, through class re-balancing/re-weighting or ensembling over different data groups, resulting in increased tail accuracies but reduced head accuracies. We take a dynamic view of the training data and provide a principled model bias and variance analysis as the training data fluctuates: Existing long-tail classifiers invariably increase the model variance and the head-tail model bias gap remains large, due to more and larger confusion with hard negatives for the tail. We propose a new long-tailed classifier called RoutIng Diverse Experts (RIDE). It reduces the model variance with multiple experts, reduces the model bias with a distribution-aware diversity loss, reduces the computational cost with a dynamic expert routing module. RIDE outperforms the state-of-the-art by 5% to 7% on CIFAR100-LT, ImageNet-LT and iNaturalist 2018 benchmarks. It is also a universal framework that is applicable to various backbone networks, long-tailed algorithms and training mechanisms for consistent performance gains. Our code is available at: https://github.com/frank-xwang/RIDE-LongTailRecognition.

Details

AAAI Conference 2021 Conference Paper

Tied Block Convolution: Leaner and Better CNNs with Shared Thinner Filters

Xudong Wang
Stella X. Yu

Convolution is the main building block of a convolutional neural network (CNN). We observe that an optimized CNN often has highly correlated filters as the number of channels increases with depth, reducing the expressive power of feature representations. We propose Tied Block Convolution (TBC) that shares the same thinner filter over equal blocks of channels and produces multiple responses with a single filter. The concept of TBC can also be extended to group convolution and fully connected layers, and can be applied to various backbone networks and attention modules. Our extensive experimentation on classification, detection, instance segmentation, and attention demonstrates that TBC is consistently leaner and significantly better than standard convolution and group convolution. On attention, with 64× fewer parameters, our TiedSE performs on par with the standard SE. On detection and segmentation, TBC can effectively handle highly overlapping instances, whereas standard CNNs often fail to accurately aggregate information in the presence of occlusion and result in multiple redundant partial object proposals. By sharing filters across channels, TBC reduces correlation and delivers a sizable gain of 6% in the average precision for object detection on MS-COCO when the occlusion ratio is 80%. Our code is publicly available.

PDF Details

ICLR Conference 2021 Conference Paper

Universal Weakly Supervised Segmentation by Pixel-to-Segment Contrastive Learning

Tsung-Wei Ke
Jyh-Jing Hwang
Stella X. Yu

Weakly supervised segmentation requires assigning a label to every pixel based on training instances with partial annotations such as image-level tags, object bounding boxes, labeled points and scribbles. This task is challenging, as coarse annotations (tags, boxes) lack precise pixel localization whereas sparse annotations (points, scribbles) lack broad region coverage. Existing methods tackle these two types of weak supervision differently: Class activation maps are used to localize coarse labels and iteratively refine the segmentation model, whereas conditional random fields are used to propagate sparse labels to the entire image. We formulate weakly supervised segmentation as a semi-supervised metric learning problem, where pixels of the same (different) semantics need to be mapped to the same (distinctive) features. We propose 4 types of contrastive relationships between pixels and segments in the feature space, capturing low-level image similarity, semantic annotation, co-occurrence, and feature affinity They act as priors; the pixel-wise feature can be learned from training images with any partial annotations in a data-driven fashion. In particular, unlabeled pixels in training images participate not only in data-driven grouping within each image, but also in discriminative feature learning within and across images. We deliver a universal weakly supervised segmenter with significant gains on Pascal VOC and DensePose. Our code is publicly available at https://github.com/twke18/SPML.

Details

ICRA Conference 2020 Conference Paper

BatVision: Learning to See 3D Spatial Layout with Two Ears

Jesper Haahr Christensen
Sascha Hornauer
Stella X. Yu

Many species have evolved advanced non-visual perception while artificial systems fall behind. Radar and ultrasound complement camera-based vision but they are often too costly and complex to set up for very limited information gain. In nature, sound is used effectively by bats, dolphins, whales, and humans for navigation and communication. However, it is unclear how to best harness sound for machine perception. Inspired by bats' echolocation mechanism, we design a low- cost BatVision system that is capable of seeing the 3D spatial layout of space ahead by just listening with two ears. Our system emits short chirps from a speaker and records returning echoes through microphones in an artificial human pinnae pair. During training, we additionally use a stereo camera to capture color images for calculating scene depths. We train a model to predict depth maps and even grayscale images from the sound alone. During testing, our trained BatVision provides surprisingly good predictions of 2D visual scenes from two 1D audio signals. Such a sound to vision system would benefit robot navigation and machine vision, especially in low-light or no-light conditions. Our code and data are publicly available.

Details