Arrow Research search

Author name cluster

Ce Liu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers
1 author row

Possible papers

13

AAAI Conference 2025 Conference Paper

4D Diffusion for Dynamic Protein Structure Prediction with Reference and Motion Guidance

  • Kaihui Cheng
  • Ce Liu
  • Qingkun Su
  • Jun Wang
  • Liwei Zhang
  • Yining Tang
  • Yao Yao
  • Siyu Zhu

Protein structure prediction is pivotal for understanding the structure-function relationship of proteins, advancing biological research, and facilitating pharmaceutical development and experimental design. While deep learning methods and the expanded availability of experimental 3D protein structures have accelerated structure prediction, the dynamic nature of protein structures has received limited attention. This study introduces an innovative 4D diffusion model incorporating molecular dynamics (MD) simulation data to learn dynamic protein structures. Our approach is distinguished by the following components: (1) a unified diffusion model capable of generating dynamic protein structures, including both the backbone and side chains, utilizing atomic grouping and side-chain dihedral angle predictions; (2) a reference network that enhances structural consistency by integrating the latent embeddings of the initial 3D protein structures; and (3) a motion alignment module aimed at improving temporal structural coherence across multiple time steps. To our knowledge, this is the first diffusion-based model aimed at predicting protein trajectories across multiple time steps simultaneously. Validation on benchmark datasets demonstrates that our model exhibits high accuracy in predicting dynamic 3D structures of proteins containing up to 256 amino acids over 32 time steps, effectively capturing both local flexibility in stable states and significant conformational changes.

NeurIPS Conference 2023 Conference Paper

NAVI: Category-Agnostic Image Collections with High-Quality 3D Shape and Pose Annotations

  • Varun Jampani
  • Kevis-kokitsi Maninis
  • Andreas Engelhardt
  • Arjun Karpur
  • Karen Truong
  • Kyle Sargent
  • Stefan Popov
  • Andre Araujo

Recent advances in neural reconstruction enable high-quality 3D object reconstruction from casually captured image collections. Current techniques mostly analyze their progress on relatively simple image collections where SfM techniques can provide ground-truth (GT) camera poses. We note that SfM techniques tend to fail on in-the-wild image collections such as image search results with varying backgrounds and illuminations. To enable systematic research progress on 3D reconstruction from casual image captures, we propose `NAVI': a new dataset of category-agnostic image collections of objects with high-quality 3D scans along with per-image 2D-3D alignments providing near-perfect GT camera parameters. These 2D-3D alignments allow us to extract accurate derivative annotations such as dense pixel correspondences, depth and segmentation maps. We demonstrate the use of NAVI image collections on different problem settings and show that NAVI enables more thorough evaluations that were not possible with existing datasets. We believe NAVI is beneficial for systematic research progress on 3D reconstruction and correspondence estimation.

NeurIPS Conference 2022 Conference Paper

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

  • Zi-Yi Dou
  • Aishwarya Kamath
  • Zhe Gan
  • Pengchuan Zhang
  • Jianfeng Wang
  • Linjie Li
  • Zicheng Liu
  • Ce Liu

Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones to better capture multimodal interactions. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is released at https: //github. com/microsoft/FIBER.

TMLR Journal 2022 Journal Article

GIT: A Generative Image-to-text Transformer for Vision and Language

  • Jianfeng Wang
  • Zhengyuan Yang
  • Xiaowei Hu
  • Linjie Li
  • Kevin Lin
  • Zhe Gan
  • Zicheng Liu
  • Ce Liu

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on numerous challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.

NeurIPS Conference 2022 Conference Paper

K-LITE: Learning Transferable Visual Models with External Knowledge

  • Sheng Shen
  • Chunyuan Li
  • Xiaowei Hu
  • Yujia Xie
  • Jianwei Yang
  • Pengchuan Zhang
  • Zhe Gan
  • Lijuan Wang

The new generation of state-of-the-art computer vision systems are trained from natural language supervision, ranging from simple object category names to descriptive captions. This form of supervision ensures high generality and usability of the learned visual models, based on the broad concept coverage achieved through large-scale data collection process. Alternatively, we argue that learning with external knowledge about images is a promising way which leverages a much more structured source of supervision and offers sample efficiency. In this paper, we propose K-LITE (Knowledge-augmented Language-Image Training and Evaluation), a simple strategy to leverage external knowledge for building transferable visual systems: In training, it enriches entities in natural language with WordNet and Wiktionary knowledge, leading to an efficient and scalable approach to learning image representations that uses knowledge about the visual concepts; In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts (or describe new ones) to enable zero-shot and few-shot transfer of the pre-trained models. We study the performance of K-LITE on two important computer vision problems, image classification and object detection, benchmarking on 20 and 13 different existing datasets, respectively. The proposed knowledge-augmented models show significant improvement in transfer learning performance over existing methods. Our code is released at https: //github. com/microsoft/klite.

NeurIPS Conference 2022 Conference Paper

OmniVL: One Foundation Model for Image-Language and Video-Language Tasks

  • Junke Wang
  • DongDong Chen
  • Zuxuan Wu
  • Chong Luo
  • Luowei Zhou
  • Yucheng Zhao
  • Yujia Xie
  • Ce Liu

This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture. It adopts a unified transformer-based visual encoder for both image and video inputs, and thus can perform joint image-language and video-language pretraining. We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer (e. g. , use image-language to help video-language). To this end, we propose a \emph{decoupled} joint pretraining of image-language and video-language to effectively decompose the vision-language modeling into spatial and temporal dimensions and obtain performance boost on both image and video tasks. Moreover, we introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e. g. , image classification), video-label (e. g. , video action recognition) data together, so that both supervised and noisily supervised pretraining data are utilized as much as possible. Without incurring extra task-specific adaptors, OmniVL can simultaneously support visual only tasks (e. g. , image classification, video action recognition), cross-modal alignment tasks (e. g. , image/video-text retrieval), and multi-modal understanding and generation tasks (e. g. , image/video question answering, captioning). We evaluate OmniVL on a wide range of downstream tasks and achieve state-of-the-art or competitive results with similar model size and data scale.

NeurIPS Conference 2022 Conference Paper

Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning

  • Yujia Xie
  • Luowei Zhou
  • Xiyang Dai
  • Lu Yuan
  • Nguyen Bach
  • Ce Liu
  • Michael Zeng

People say, "A picture is worth a thousand words". Then how can we get the rich information out of the image? We argue that by using visual clues to bridge large pretrained vision foundation models and language models, we can do so without any extra cross-modal training. Thanks to the strong zero-shot capability of foundation models, we start by constructing a rich semantic representation of the image (e. g. , image tags, object attributes / locations, captions) as a structured textual prompt, called visual clues, using a vision foundation model. Based on visual clues, we use large language model to produce a series of comprehensive descriptions for the visual content, which is then verified by the vision model again to select the candidate that aligns best with the image. We evaluate the quality of generated descriptions by quantitative and qualitative measurement. The results demonstrate the effectiveness of such a structured semantic representation.

NeurIPS Conference 2021 Conference Paper

Neural-PIL: Neural Pre-Integrated Lighting for Reflectance Decomposition

  • Mark Boss
  • Varun Jampani
  • Raphael Braun
  • Ce Liu
  • Jonathan Barron
  • Hendrik PA Lensch

Decomposing a scene into its shape, reflectance and illumination is a fundamental problem in computer vision and graphics. Neural approaches such as NeRF have achieved remarkable success in view synthesis, but do not explicitly perform decomposition and instead operate exclusively on radiance (the product of reflectance and illumination). Extensions to NeRF, such as NeRD, can perform decomposition but struggle to accurately recover detailed illumination, thereby significantly limiting realism. We propose a novel reflectance decomposition network that can estimate shape, BRDF, and per-image illumination given a set of object images captured under varying illumination. Our key technique is a novel illumination integration network called Neural-PIL that replaces a costly illumination integral operation in the rendering with a simple network query. In addition, we also learn deep low-dimensional priors on BRDF and illumination representations using novel smooth manifold auto-encoders. Our decompositions can result in considerably better BRDF and light estimates enabling more accurate novel view-synthesis and relighting compared to prior art. Project page: https: //markboss. me/publication/2021-neural-pil/

NeurIPS Conference 2021 Conference Paper

ViSER: Video-Specific Surface Embeddings for Articulated 3D Shape Reconstruction

  • Gengshan Yang
  • Deqing Sun
  • Varun Jampani
  • Daniel Vlasic
  • Forrester Cole
  • Ce Liu
  • Deva Ramanan

We introduce ViSER, a method for recovering articulated 3D shapes and dense3D trajectories from monocular videos. Previous work on high-quality reconstruction of dynamic 3D shapes typically relies on multiple camera views, strong category-specific priors, or 2D keypoint supervision. We show that none of these are required if one can reliably estimate long-range correspondences in a video, making use of only 2D object masks and two-frame optical flow as inputs. ViSER infers correspondences by matching 2D pixels to a canonical, deformable 3D mesh via video-specific surface embeddings that capture the pixel appearance of each surface point. These embeddings behave as a continuous set of keypoint descriptors defined over the mesh surface, which can be used to establish dense long-range correspondences across pixels. The surface embeddings are implemented as coordinate-based MLPs that are fit to each video via self-supervised losses. Experimental results show that ViSER compares favorably against prior work on challenging videos of humans with loose clothing and unusual poses as well as animals videos from DAVIS and YTVOS. Project page: viser-shape. github. io.

NeurIPS Conference 2020 Conference Paper

Supervised Contrastive Learning

  • Prannay Khosla
  • Piotr Teterwak
  • Chen Wang
  • Aaron Sarna
  • Yonglong Tian
  • Phillip Isola
  • Aaron Maschinot
  • Ce Liu

Contrastive learning applied to self-supervised representation learning has seen a resurgence in recent years, leading to state of the art performance in the unsupervised training of deep image models. Modern batch contrastive approaches subsume or significantly outperform traditional contrastive losses such as triplet, max-margin and the N-pairs loss. In this work, we extend the self-supervised batch contrastive approach to the fully-supervised setting, allowing us to effectively leverage label information. Clusters of points belonging to the same class are pulled together in embedding space, while simultaneously pushing apart clusters of samples from different classes. We analyze two possible versions of the supervised contrastive (SupCon) loss, identifying the best-performing formulation of the loss. On ResNet-200, we achieve top-1 accuracy of 81. 4% on the ImageNet dataset, which is 0. 8% above the best number reported for this architecture. We show consistent outperformance over cross-entropy on other datasets and two ResNet variants. The loss shows benefits for robustness to natural corruptions, and is more stable to hyperparameter settings such as optimizers and data augmentations. In reduced data settings, it outperforms cross-entropy significantly. Our loss function is simple to implement and reference TensorFlow code is released at https: //t. ly/supcon.

NeurIPS Conference 2014 Conference Paper

Deep Convolutional Neural Network for Image Deconvolution

  • Li Xu
  • Jimmy Ren
  • Ce Liu
  • Jiaya Jia

Many fundamental image-related problems involve deconvolution operators. Real blur degradation seldom complies with an deal linear convolution model due to camera noise, saturation, image compression, to name a few. Instead of perfectly modeling outliers, which is rather challenging from a generative model perspective, we develop a deep convolutional neural network to capture the characteristics of degradation. We note directly applying existing deep neural networks does not produce reasonable results. Our solution is to establish the connection between traditional optimization-based schemes and a neural network architecture where a novel, separable structure is introduced as a reliable support for robust deconvolution against artifacts. Our network contains two submodules, both trained in a supervised manner with proper initialization. They yield decent performance on non-blind image deconvolution compared to previous generative-model based methods.

NeurIPS Conference 2007 Conference Paper

Object Recognition by Scene Alignment

  • Bryan Russell
  • Antonio Torralba
  • Ce Liu
  • Rob Fergus
  • William Freeman

Current object recognition systems can only recognize a limited number of object categories; scaling up to many categories is the next challenge. We seek to build a system to recognize and localize many different object categories in complex scenes. We achieve this through a simple approach: by matching the input im- age, in an appropriate representation, to images in a large training set of labeled images. Due to regularities in object identities across similar scenes, the retrieved matches provide hypotheses for object identities and locations. We build a prob- abilistic model to transfer the labels from the retrieval set to the input image. We demonstrate the effectiveness of this approach and study algorithm component contributions using held-out test sets from the LabelMe database.

NeurIPS Conference 2006 Conference Paper

Analysis of Contour Motions

  • Ce Liu
  • William Freeman
  • Edward Adelson

A reliable motion estimation algorithm must function under a wide range of con- ditions. One regime, which we consider here, is the case of moving objects with contours but no visible texture. Tracking distinctive features such as corners can disambiguate the motion of contours, but spurious features such as T-junctions can be badly misleading. It is difficult to determine the reliability of motion from local measurements, since a full rank covariance matrix can result from both real and spurious features. We propose a novel approach that avoids these points al- together, and derives global motion estimates by utilizing information from three levels of contour analysis: edgelets, boundary fragments and contours. Boundary fragment are chains of orientated edgelets, for which we derive motion estimates from local evidence. The uncertainties of the local estimates are disambiguated after the boundary fragments are properly grouped into contours. The grouping is done by constructing a graphical model and marginalizing it using importance sampling. We propose two equivalent representations in this graphical model, re- versible switch variables attached to the ends of fragments and fragment chains, to capture both local and global statistics of boundaries. Our system is success- fully applied to both synthetic and real video sequences containing high-contrast boundaries and textureless regions. The system produces good motion estimates along with properly grouped and completed contours.