Arrow Research search

Author name cluster

Jia Deng

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

24 papers
2 author rows

Possible papers

24

ICLR Conference 2025 Conference Paper

Advancing LLM Reasoning Generalists with Preference Trees

  • Lifan Yuan
  • Ganqu Cui
  • Hanbin Wang
  • Ning Ding 0002
  • Xingyao Wang 0002
  • Boji Shan
  • Zeyuan Liu
  • Jia Deng

We introduce EURUS, a suite of large language models (LLMs) optimized for reasoning. Finetuned from Mistral-7B, Llama-3-8B, and Mixtral-8x22B, EURUS models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, EURUX-8X22B outperforms GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 test sets covering five tasks. The strong performance of EURUS can be primarily attributed to ULTRAINTERACT, our newly-curated large-scale, high-quality training data dataset specifically designed for complex reasoning tasks. ULTRAINTERACT can be used in both supervised fine-tuning, preference learning, and reward modeling. It pairs each instruction with a preference tree consisting of (1) reasoning chains with diverse planning strategies in a unified format, (2) multi-turn interaction trajectories with the environment and the critique, and (3) pairwise positive and negative responses to facilitate preference learning. ULTRAINTERACT allows us to conduct an in-depth exploration of preference learning for reasoning tasks. Our investigation reveals that some well-established preference learning algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. The hypothesis is that in reasoning tasks, the space of correct answers is much smaller than that of incorrect ones, so it is necessary to explicitly increase the reward of chosen data. Therefore, in addition to increasing the reward margin as many preference learning algorithms do, the absolute values of positive responses’ rewards should be positive and may serve as a proxy for performance. Inspired by this, we derive a novel reward modeling objective and empirically that it leads to a stable reward modeling curve and better performance. Together with ULTRAINTERACT, we obtain a strong reward model.

NeurIPS Conference 2025 Conference Paper

Evaluating Robustness of Monocular Depth Estimation with Procedural Scene Perturbations

  • Jack Nugent
  • Siyang Wu
  • Zeyu Ma
  • Beining Han
  • Meenal Parakh
  • Abhishek Joshi
  • Lingjie Mei
  • Alexander Raistrick

Recent years have witnessed substantial progress on monocular depth estimation, particularly as measured by the success of large models on standard benchmarks. However, performance on standard benchmarks does not offer a complete assessment, because most evaluate accuracy but not robustness. In this work, we introduce PDE (Procedural Depth Evaluation), a new benchmark which enables systematic evaluation of robustness to changes in 3D scene content. PDE uses procedural generation to create 3D scenes that test robustness to various controlled perturbations, including object, camera, material and lighting changes. Our analysis yields interesting findings on what perturbations are challenging for state-of-the-art depth models, which we hope will inform further research. Code and data are available at https: //github. com/princeton-vl/proc-depth-eval.

NeurIPS Conference 2025 Conference Paper

InFlux: A Benchmark for Self-Calibration of Dynamic Intrinsics of Video Cameras

  • Erich Liang
  • Roma Bhattacharjee
  • Sreemanti Dey
  • Rafael Moschopoulos
  • Caitlin Wang
  • Michel Liao
  • Grace Tan
  • Andrew Wang

Accurately tracking camera intrinsics is crucial for achieving 3D understanding from 2D video. However, most 3D algorithms assume that camera intrinsics stay constant throughout a video, which is often not true for many real-world in-the-wild videos. A major obstacle in this field is a lack of dynamic camera intrinsics benchmarks--existing benchmarks typically offer limited diversity in scene content and intrinsics variation, and none provide per-frame intrinsic changes for consecutive video frames. In this paper, we present Intrinsics in Flux (InFlux), a real-world benchmark that provides per-frame ground truth intrinsics annotations for videos with dynamic intrinsics. Compared to prior benchmarks, InFlux captures a wider range of intrinsic variations and scene diversity, featuring 143K+ annotated frames from 386 high-resolution indoor and outdoor videos with dynamic camera intrinsics. To ensure accurate per-frame intrinsics, we build a comprehensive lookup table of calibration experiments and extend the Kalibr toolbox to improve its accuracy and robustness. Using our benchmark, we evaluate existing baseline methods for predicting camera intrinsics and find that most struggle to achieve accurate predictions on videos with dynamic intrinsics. For the dataset, code, videos, and submission, please visit https: //influx. cs. princeton. edu/.

ICLR Conference 2025 Conference Paper

Neuron based Personality Trait Induction in Large Language Models

  • Jia Deng
  • Tianyi Tang
  • Yanbin Yin
  • Wenhao Yang
  • Xin Zhao 0018
  • Ji-Rong Wen

Large language models (LLMs) have become increasingly proficient at simulating various personality traits, an important capability for supporting related applications (e.g., role-playing). To further improve this capacity, in this paper, we present a neuron based approach for personality trait induction in LLMs, with three major technical contributions. First, we construct PERSONALITYBENCH, a large-scale dataset for identifying and evaluating personality traits in LLMs. This dataset is grounded in the Big Five personality traits from psychology and designed to assess the generative capabilities of LLMs towards specific personality traits. Second, by leveraging PERSONALITYBENCH, we propose an efficient method for identifying personality-related neurons within LLMs by examining the opposite aspects of a given trait. Third, we develop a simple yet effective induction method that manipulates the values of these identified personality-related neurons, which enables fine-grained control over the traits exhibited by LLMs without training and modifying model parameters. Extensive experiments validates the efficacy of our neuron identification and trait induction methods. Notably, our approach achieves comparable performance as fine-tuned models, offering a more efficient and flexible solution for personality trait induction in LLMs.

NeurIPS Conference 2025 Conference Paper

VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs

  • Shmuel Berman
  • Jia Deng

Vision Language Models (VLMs) excel at complex visual tasks such as VQA and chart understanding, yet recent work suggests they struggle with simple perceptual tests. We present an evaluation that tests vision-language models’ capacity for non-local visual reasoning— reasoning that requires chaining evidence collected from multiple, possibly distant, regions of an image. We isolate three distinct forms of non‑local vision: comparative perception, which demands holding two images in working memory and comparing them; saccadic search, which requires making discrete, evidence‑driven jumps to locate successive targets; and smooth visual search, which involves searching smoothly along a continuous contour. Flagship models (e. g. , GPT-5, Gemini 2. 5 Pro, Claude Sonnet 4), even those that perform well on prior primitive‑vision benchmarks, fail these tests and barely exceed random accuracy on two variants of our tasks that are trivial for humans. Our structured evaluation suite allows us to test if VLMs can perform similar visual algorithms to humans. Our findings show that despite gains in raw visual acuity, current models lack core visual reasoning capabilities.

ICLR Conference 2024 Conference Paper

Llemma: An Open Language Model for Mathematics

  • Zhangir Azerbayev
  • Hailey Schoelkopf
  • Keiran Paster
  • Marco Dos Santos
  • Stephen Marcus McAleer
  • Albert Q. Jiang
  • Jia Deng
  • Stella Biderman

We present Llemma, a large language model for mathematics. We continue pretraining Code Llama on the Proof-Pile-2, a mixture of scientific papers, web data containing mathematics, and mathematical code, yielding Llemma. On the MATH benchmark Llemma outperforms all known openly released models, as well as the unreleased Minerva model suite on an equi-parameter basis. Moreover, Llemma is capable of tool use and formal theorem proving without any finetuning. We openly release all artifacts, including 7 billion and 34 billion parameter models, the Proof-Pile-2, and code to replicate our experiments.

NeurIPS Conference 2023 Conference Paper

Deep Patch Visual Odometry

  • Zachary Teed
  • Lahav Lipson
  • Jia Deng

We propose Deep Patch Visual Odometry (DPVO), a new deep learning system for monocular Visual Odometry (VO). DPVO uses a novel recurrent network architecture designed for tracking image patches across time. Recent approaches to VO have significantly improved the state-of-the-art accuracy by using deep networks to predict dense flow between video frames. However, using dense flow incurs a large computational cost, making these previous methods impractical for many use cases. Despite this, it has been assumed that dense flow is important as it provides additional redundancy against incorrect matches. DPVO disproves this assumption, showing that it is possible to get the best accuracy and efficiency by exploiting the advantages of sparse patch-based matching over dense flow. DPVO introduces a novel recurrent update operator for patch based correspondence coupled with differentiable bundle adjustment. On Standard benchmarks, DPVO outperforms all prior work, including the learning-based state-of-the-art VO-system (DROID) using a third of the memory while running 3x faster on average. Code is available at https: //github. com/princeton-vl/DPVO

TMLR Journal 2023 Journal Article

Learning Symbolic Rules for Reasoning in Quasi-Natural Language

  • Kaiyu Yang
  • Jia Deng

Symbolic reasoning, rule-based symbol manipulation, is a hallmark of human intelligence. However, rule-based systems have had limited success competing with learning-based systems outside formalized domains such as automated theorem proving. We hypothesize that this is due to the manual construction of rules in past attempts. In this work, we take initial steps towards rule-based systems that can reason with natural language but without manually constructing rules. We propose MetaQNL, a "Quasi-Natural Language" that can express both formal logic and natural language sentences, and MetaInduce, a learning algorithm that induces MetaQNL rules from training data consisting of questions and answers, with or without intermediate reasoning steps. In addition, we introduce soft matching—a flexible mechanism for applying rules without rigid matching, overcoming a typical source of brittleness in symbolic reasoning. Our approach achieves state-of-the-art accuracies on multiple reasoning benchmarks; it learns compact models with much less data and produces not only answers but also checkable proofs. Further, experiments on two simple real-world datasets demonstrate the possibility for our method to handle noise and ambiguity.

NeurIPS Conference 2023 Conference Paper

Siamese Masked Autoencoders

  • Agrim Gupta
  • Jiajun Wu
  • Jia Deng
  • Fei-Fei Li

Establishing correspondence between images or scenes is a significant challenge in computer vision, especially given occlusions, viewpoint changes, and varying object appearances. In this paper, we present Siamese Masked Autoencoders (SiamMAE), a simple extension of Masked Autoencoders (MAE) for learning visual correspondence from videos. SiamMAE operates on pairs of randomly sampled video frames and asymmetrically masks them. These frames are processed independently by an encoder network, and a decoder composed of a sequence of cross-attention layers is tasked with predicting the missing patches in the future frame. By masking a large fraction (95%) of patches in the future frame while leaving the past frame unchanged, SiamMAE encourages the network to focus on object motion and learn object-centric representations. Despite its conceptual simplicity, features learned via SiamMAE outperform state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation tasks. SiamMAE achieves competitive results without relying on data augmentation, handcrafted tracking-based pretext tasks, or other techniques to prevent representational collapse.

NeurIPS Conference 2022 Conference Paper

Non-deep Networks

  • Ankit Goyal
  • Alexey Bochkovskiy
  • Jia Deng
  • Vladlen Koltun

Latency is of utmost importance in safety-critical systems. In neural networks, lowest theoretical latency is dependent on the depth of the network. This begs the question -- is it possible to build high-performing ``non-deep" neural networks? We show that it is. To do so, we use parallel subnetworks instead of stacking one layer after another. This helps effectively reduce depth while maintaining high performance. By utilizing parallel substructures, we show, for the first time, that a network with a depth of just 12 can achieve top-1 accuracy over 80% on ImageNet, 96% on CIFAR10, and 81% on CIFAR100. We also show that a network with a low-depth (12) backbone can achieve an AP of 48% on MS-COCO. We analyze the scaling rules for our design and show how to increase performance without changing the network's depth. Finally, we provide a proof of concept for how non-deep networks could be used to build low-latency recognition systems. Code is available at https: //github. com/imankgoyal/NonDeepNetworks.

NeurIPS Conference 2021 Conference Paper

DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras

  • Zachary Teed
  • Jia Deng

We introduce DROID-SLAM, a new deep learning based SLAM system. DROID-SLAM consists of recurrent iterative updates of camera pose and pixelwise depth through a Dense Bundle Adjustment layer. DROID-SLAM is accurate, achieving large improvements over prior work, and robust, suffering from substantially fewer catastrophic failures. Despite training on monocular video, it can leverage stereo or RGB-D video to achieve improved performance at test time. The URL to our open source code is https: //github. com/princeton-vl/DROID-SLAM.

AAAI Conference 2021 Conference Paper

Dynamically Grown Generative Adversarial Networks

  • Lanlan Liu
  • Yuting Zhang
  • Jia Deng
  • Stefano Soatto

Recent work introduced progressive network growing as a promising way to ease the training for large GANs, but the model design and architecture-growing strategy still remain under-explored and needs manual design for different image data. In this paper, we propose a method to dynamically grow a GAN during training, optimizing the network architecture and its parameters together with automation. The method embeds architecture search techniques as an interleaving step with gradient-based training to periodically seek the optimal architecture-growing strategy for the generator and discriminator. It enjoys the benefits of both eased training because of progressive growing and improved performance because of broader architecture design space. Experimental results demonstrate new state-of-the-art of image generation. Observations in the search procedure also provide constructive insights into the GAN model design such as generatordiscriminator balance and convolutional layer choices.

AAAI Conference 2021 Conference Paper

Learning to Sit: Synthesizing Human-Chair Interactions via Hierarchical Control

  • Yu-Wei Chao
  • Jimei Yang
  • Weifeng Chen
  • Jia Deng

Recent progress on physics-based character animation has shown impressive breakthroughs on human motion synthesis, through imitating motion capture data via deep reinforcement learning. However, results have mostly been demonstrated on imitating a single distinct motion pattern, and do not generalize to interactive tasks that require flexible motion patterns due to varying human-object spatial configurations. To bridge this gap, we focus on one class of interactive tasks—sitting onto a chair. We propose a hierarchical reinforcement learning framework which relies on a collection of subtask controllers trained to imitate simple, reusable mocap motions, and a meta controller trained to execute the subtasks properly to complete the main task. We experimentally demonstrate the strength of our approach over different non-hierarchical and hierarchical baselines. We also show that our approach can be applied to motion prediction given an image input. A supplementary video can be found at https: //youtu. be/3CeN0OGz2cA.

IJCAI Conference 2021 Conference Paper

RAFT: Recurrent All-Pairs Field Transforms for Optical Flow (Extended Abstract)

  • Zachary Teed
  • Jia Deng

We introduce Recurrent All-Pairs Field Transforms (RAFT), a new deep network architecture for optical flow. RAFT extracts per-pixel features, builds multi-scale 4D correlation volumes for all pairs of pixels, and iteratively updates a flow field through a recurrent unit that performs lookups on the correlation volumes. RAFT achieves state-of-the-art performance on the KITTI and Sintel datasets. In addition, RAFT has strong cross-dataset generalization as well as high efficiency in inference time, training speed, and parameter count.

NeurIPS Conference 2020 Conference Paper

Learning to Prove Theorems by Learning to Generate Theorems

  • Mingzhe Wang
  • Jia Deng

We consider the task of automated theorem proving, a key AI task. Deep learning has shown promise for training theorem provers, but there are limited human-written theorems and proofs available for supervised learning. To address this limitation, we propose to learn a neural generator that automatically synthesizes theorems and proofs for the purpose of training a theorem prover. Experiments on real-world tasks demonstrate that synthetic data from our approach improves the theorem prover and advances the state of the art of automated theorem proving in Metamath.

NeurIPS Conference 2020 Conference Paper

Rel3D: A Minimally Contrastive Benchmark for Grounding Spatial Relations in 3D

  • Ankit Goyal
  • Kaiyu Yang
  • Dawei Yang
  • Jia Deng

Understanding spatial relations (e. g. , laptop on table) in visual input is important for both humans and robots. Existing datasets are insufficient as they lack large-scale, high-quality 3D ground truth information, which is critical for learning spatial relations. In this paper, we fill this gap by constructing Rel3D: the first large-scale, human-annotated dataset for grounding spatial relations in 3D. Rel3D enables quantifying the effectiveness of 3D information in predicting spatial relations on large-scale human data. Moreover, we propose minimally contrastive data collection---a novel crowdsourcing method for reducing dataset bias. The 3D scenes in our dataset come in minimally contrastive pairs: two scenes in a pair are almost identical, but a spatial relation holds in one and fails in the other. We empirically validate that minimally contrastive examples can diagnose issues with current relation detection models as well as lead to sample-efficient training. Code and data are available at https: //github. com/princeton-vl/Rel3D.

NeurIPS Conference 2020 Conference Paper

Strongly Incremental Constituency Parsing with Graph Neural Networks

  • Kaiyu Yang
  • Jia Deng

Parsing sentences into syntax trees can benefit downstream applications in NLP. Transition-based parsers build trees by executing actions in a state transition system. They are computationally efficient, and can leverage machine learning to predict actions based on partial trees. However, existing transition-based parsers are predominantly based on the shift-reduce transition system, which does not align with how humans are known to parse sentences. Psycholinguistic research suggests that human parsing is strongly incremental—humans grow a single parse tree by adding exactly one token at each step. In this paper, we propose a novel transition system called attach-juxtapose. It is strongly incremental; it represents a partial sentence using a single tree; each action adds exactly one token into the partial tree. Based on our transition system, we develop a strongly incremental parser. At each step, it encodes the partial tree using a graph neural network and predicts an action. We evaluate our parser on Penn Treebank (PTB) and Chinese Treebank (CTB). On PTB, it outperforms existing parsers trained with only constituency trees; and it performs on par with state-of-the-art parsers that use dependency trees as additional training data. On CTB, our parser establishes a new state of the art. Code is available at https: //github. com/princeton-vl/attach-juxtapose-parser.

AAAI Conference 2018 Conference Paper

Dynamic Deep Neural Networks: Optimizing Accuracy-Efficiency Trade-Offs by Selective Execution

  • Lanlan Liu
  • Jia Deng

We introduce Dynamic Deep Neural Networks (D2 NN), a new type of feed-forward deep neural network that allows selective execution. Given an input, only a subset of D2 NN neurons are executed, and the particular subset is determined by the D2 NN itself. By pruning unnecessary computation depending on input, D2 NNs provide a way to improve computational efficiency. To achieve dynamic selective execution, a D2 NN augments a feed-forward deep neural network (directed acyclic graph of differentiable modules) with controller modules. Each controller module is a sub-network whose output is a decision that controls whether other modules can execute. A D2 NN is trained end to end. Both regular and controller modules in a D2 NN are learnable and are jointly trained to optimize both accuracy and efficiency. Such training is achieved by integrating backpropagation with reinforcement learning. With extensive experiments of various D2 NN architectures on image classification tasks, we demonstrate that D2 NNs are general and flexible, and can effectively optimize accuracy-efficiency trade-offs.

NeurIPS Conference 2017 Conference Paper

Associative Embedding: End-to-End Learning for Joint Detection and Grouping

  • Alejandro Newell
  • Zhiao Huang
  • Jia Deng

We introduce associative embedding, a novel method for supervising convolutional neural networks for the task of detection and grouping. A number of computer vision problems can be framed in this manner including multi-person pose estimation, instance segmentation, and multi-object tracking. Usually the grouping of detections is achieved with multi-stage pipelines, instead we propose an approach that teaches a network to simultaneously output detections and group assignments. This technique can be easily integrated into any state-of-the-art network architecture that produces pixel-wise predictions. We show how to apply this method to multi-person pose estimation and report state-of-the-art performance on the MPII and MS-COCO datasets.

AAAI Conference 2017 Conference Paper

Fine-Grained Car Detection for Visual Census Estimation

  • Timnit Gebru
  • Jonathan Krause
  • Yilun Wang
  • Duyun Chen
  • Jia Deng
  • Li Fei-Fei

Targeted socio-economic policies require an accurate understanding of a country’s demographic makeup. To that end, the United States spends more than 1 billion dollars a year gathering census data such as race, gender, education, occupation and unemployment rates. Compared to the traditional method of collecting surveys across many years which is costly and labor intensive, data-driven, machine learningdriven approaches are cheaper and faster—with the potential ability to detect trends in close to real time. In this work, we leverage the ubiquity of Google Street View images and develop a computer vision pipeline to predict income, per capita carbon emission, crime rates and other city attributes from a single source of publicly available visual data. We first detect cars in 50 million images across 200 of the largest US cities and train a model to predict demographic attributes using the detected cars. To facilitate our work, we have collected the largest and most challenging fine-grained dataset reported to date consisting of over 2600 classes of cars comprised of images from Google Street View and other web sources, classi- fied by car experts to account for even the most subtle of visual differences. We use this data to construct the largest scale fine-grained detection system reported to date. Our prediction results correlate well with ground truth income data (r=0. 82), Massachusetts department of vehicle registration, and sources investigating crime rates, income segregation, per capita carbon emission, and other market research. Finally, we learn interesting relationships between cars and neighbourhoods allowing us to perform the first large scale sociological analysis of cities using computer vision techniques.

NeurIPS Conference 2017 Conference Paper

Pixels to Graphs by Associative Embedding

  • Alejandro Newell
  • Jia Deng

Graphs are a useful abstraction of image content. Not only can graphs represent details about individual objects in a scene but they can capture the interactions between pairs of objects. We present a method for training a convolutional neural network such that it takes in an input image and produces a full graph definition. This is done end-to-end in a single stage with the use of associative embeddings. The network learns to simultaneously identify all of the elements that make up a graph and piece them together. We benchmark on the Visual Genome dataset, and demonstrate state-of-the-art performance on the challenging task of scene graph generation.

NeurIPS Conference 2017 Conference Paper

Premise Selection for Theorem Proving by Deep Graph Embedding

  • Mingzhe Wang
  • Yihe Tang
  • Jian Wang
  • Jia Deng

We propose a deep learning-based approach to the problem of premise selection: selecting mathematical statements relevant for proving a given conjecture. We represent a higher-order logic formula as a graph that is invariant to variable renaming but still fully preserves syntactic and semantic information. We then embed the graph into a vector via a novel embedding method that preserves the information of edge ordering. Our approach achieves state-of-the-art results on the HolStep dataset, improving the classification accuracy from 83% to 90. 3%.

NeurIPS Conference 2016 Conference Paper

Single-Image Depth Perception in the Wild

  • Weifeng Chen
  • Zhao Fu
  • Dawei Yang
  • Jia Deng

This paper studies single-image depth perception in the wild, i. e. , recovering depth from a single image taken in unconstrained settings. We introduce a new dataset “Depth in the Wild” consisting of images in the wild annotated with relative depth between pairs of random points. We also propose a new algorithm that learns to estimate metric depth using annotations of relative depth. Compared to the state of the art, our algorithm is simpler and performs better. Experiments show that our algorithm, combined with existing RGB-D data and our new relative depth annotations, significantly improves single-image depth perception in the wild.

NeurIPS Conference 2011 Conference Paper

Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

  • Jia Deng
  • Sanjeev Satheesh
  • Alexander Berg
  • Fei Li

We present a novel approach to efficiently learn a label tree for large scale classification with many classes. The key contribution of the approach is a technique to simultaneously determine the structure of the tree and learn the classifiers for each node in the tree. This approach also allows fine grained control over the efficiency vs accuracy trade-off in designing a label tree, leading to more balanced trees. Experiments are performed on large scale image classification with 10184 classes and 9 million images. We demonstrate significant improvements in test accuracy and efficiency with less training time and more balanced trees compared to the previous state of the art by Bengio et al.