Arrow Research search

Author name cluster

Alex Wong

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers
2 author rows

Possible papers

7

IROS Conference 2025 Conference Paper

Data-Bootstrapped, Physics-Informed Framework for Object Rearrangement

  • Alex Wong
  • Zhiwei Dong

Object rearrangement, which involves arranging objects step-by-step to achieve tidy states, is critical in robotic applications. Progress in this area is often constrained by issues such as high-cost data collection and physically infeasible trajectory prediction. To address these challenges, we propose the Data-Bootstrapped, Physics-Informed Rearrangement (DPR) framework, which leverages a transformer for sequential decision making. Specifically, DPR integrates Enhanced Data Generation with a Physics Reward Feedback Transformer. Enhanced Data Generation consists of Random Trajectory Reverse for producing high-quality training data and Bootstrapped Trajectory Synthesis, which leverages the transformer’s sequence modeling to diversify training trajectories. To ensure the feasibility of the generated trajectories and to improve the transformer’s performance, we incorporate a Physical Reward Feedback mechanism into the transformer. Experiments on ball and room rearrangement tasks show that DPR significantly outperforms existing methods in terms of both efficiency and effectiveness. Code will be released soon.

NeurIPS Conference 2025 Conference Paper

STree: Speculative Tree Decoding for Hybrid State Space Models

  • Yangchao Wu
  • Zongyue Qin
  • Alex Wong
  • Stefano Soatto

Speculative decoding is a technique to leverage hardware concurrency in order to enable multiple steps of token generation in a single forward pass, thus improving the efficiency of large-scale autoregressive (AR) Transformer models. State-space models (SSMs) are already more efficient than AR Transformers, since their state summarizes all past data with no need to cache or re-process tokens in the sliding window context. However, their state can also comprise thousands of tokens; so, speculative decoding has recently been extended to SSMs. Existing approaches, however, do not leverage the tree-based verification methods, since current SSMs lack the means to compute a token tree efficiently. We propose the first scalable algorithm to perform tree-based speculative decoding in state-space models (SSMs) and hybrid architectures of SSMs and Transformer layers. We exploit the structure of accumulated state transition matrices to facilitate tree-based speculative decoding with minimal overhead relative to current SSM implementations. Along with the algorithm, we describe a hardware-aware implementation that improves naive application of AR Transformer tree-based speculative decoding methods to SSMs. Furthermore, we outperform vanilla speculative decoding with SSMs even with a baseline drafting model and tree structure on three different benchmarks, opening up opportunities for further speed up with SSM and hybrid model inference. Code can be find at: https: //github. com/wyc1997/stree.

NeurIPS Conference 2025 Conference Paper

TREND: Unsupervised 3D Representation Learning via Temporal Forecasting for LiDAR Perception

  • Runjian Chen
  • Hyoungseob Park
  • Bo Zhang
  • Wenqi Shao
  • Ping Luo
  • Alex Wong

Labeling LiDAR point clouds is notoriously time-and-energy-consuming, which spurs recent unsupervised 3D representation learning methods to alleviate the labeling burden in LiDAR perception via pretrained weights. Existing work focus on either masked auto encoding or contrastive learning on LiDAR point clouds, which neglects the temporal LiDAR sequence that naturally accounts for object motion (and their semantics). Instead, we propose TREND, short for Temporal REndering with Neural fielD, to learn 3D representation via forecasting the future observation in an unsupervised manner. TREND integrates forecasting for 3D pre-training through a Recurrent Embedding scheme to generate 3D embeddings across time and a Temporal LiDAR Neural Field specifically designed for LiDAR modality to represent the 3D scene, with which we compute the loss using differentiable rendering. We evaluate TREND on 3D object detection and LiDAR semantic segmentation tasks on popular datasets, including Once, Waymo, NuScenes, and SemanticKITTI. TREND generally improves from-scratch models across datasets and tasks and brings gains of 1. 77\% mAP on Once and 2. 11\% mAP on NuScenes, which are up to 400\% more improvement compared to previous SOTA unsupervised 3D pre-training methods. Codes and models will be available.

NeurIPS Conference 2024 Conference Paper

RSA: Resolving Scale Ambiguities in Monocular Depth Estimators through Language Descriptions

  • Ziyao Zeng
  • Yangchao Wu
  • Hyoungseob Park
  • Daniel Wang
  • Fengyu Yang
  • Stefano Soatto
  • Dong Lao
  • Byung-Woo Hong

We propose a method for metric-scale monocular depth estimation. Inferring depth from a single image is an ill-posed problem due to the loss of scale from perspective projection during the image formation process. Any scale chosen is a bias, typically stemming from training on a dataset; hence, existing works have instead opted to use relative (normalized, inverse) depth. Our goal is to recover metric-scaled depth maps through a linear transformation. The crux of our method lies in the observation that certain objects (e. g. , cars, trees, street signs) are typically found or associated with certain types of scenes (e. g. , outdoor). We explore whether language descriptions can be used to transform relative depth predictions to those in metric scale. Our method, RSA, takes as input a text caption describing objects present in an image and outputs the parameters of a linear transformation which can be applied globally to a relative depth map to yield metric-scaled depth predictions. We demonstrate our method on recent general-purpose monocular depth models on indoors (NYUv2, VOID) and outdoors (KITTI). When trained on multiple datasets, RSA can serve as a general alignment module in zero-shot settings. Our method improves over common practices in aligning relative to metric depth and results in predictions that are comparable to an upper bound of fitting relative depth to ground truth via a linear transformation. Code is available at: https: //github. com/Adonis-galaxy/RSA.

AAAI Conference 2021 Conference Paper

Stereopagnosia: Fooling Stereo Networks with Adversarial Perturbations

  • Alex Wong
  • Mukund Mundhra
  • Stefano Soatto

We study the effect of adversarial perturbations of images on the estimates of disparity by deep learning models trained for stereo. We show that imperceptible additive perturbations can significantly alter the disparity map, and correspondingly the perceived geometry of the scene. These perturbations not only affect the specific model they are crafted for, but transfer to models with different architecture, trained with different loss functions. We show that, when used for adversarial data augmentation, our perturbations result in trained models that are more robust, without sacrificing overall accuracy of the model. This is unlike what has been observed in image classification, where adding the perturbed images to the training set makes the model less vulnerable to adversarial perturbations, but to the detriment of overall accuracy. We test our method using the most recent stereo networks and evaluate their performance on public benchmark datasets.

NeurIPS Conference 2020 Conference Paper

Targeted Adversarial Perturbations for Monocular Depth Prediction

  • Alex Wong
  • Safa Cicek
  • Stefano Soatto

We study the effect of adversarial perturbations on the task of monocular depth prediction. Specifically, we explore the ability of small, imperceptible additive perturbations to selectively alter the perceived geometry of the scene. We show that such perturbations can not only globally re-scale the predicted distances from the camera, but also alter the prediction to match a different target scene. We also show that, when given semantic or instance information, perturbations can fool the network to alter the depth of specific categories or instances in the scene, and even remove them while preserving the rest of the scene. To understand the effect of targeted perturbations, we conduct experiments on state-of-the-art monocular depth prediction methods. Our experiments reveal vulnerabilities in monocular depth prediction networks, and shed light on the biases and context learned by them.

NeurIPS Conference 2011 Conference Paper

Fast and Accurate k-means For Large Datasets

  • Michael Shindler
  • Alex Wong
  • Adam Meyerson

Clustering is a popular problem with many applications. We consider the k-means problem in the situation where the data is too large to be stored in main memory and must be accessed sequentially, such as from a disk, and where we must use as little memory as possible. Our algorithm is based on recent theoretical results, with significant improvements to make it practical. Our approach greatly simpli(cid: 173) fies a recently developed algorithm, both in design and in analysis, and eliminates large constant factors in the approximation guarantee, the memory requirements, and the running time. We then incorporate approximate nearest neighbor search to compute k-means in o(nk) (where n is the number of data points; note that com(cid: 173) puting the cost, given a solution, takes 8(nk) time). We show that our algorithm compares favorably to existing algorithms - both theoretically and experimentally, thus providing state-of-the-art performance in both theory and practice.