Arrow Research search

Author name cluster

Yi Jiang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

35 papers
2 author rows

Possible papers

35

AAAI Conference 2026 Conference Paper

FlashVideo: Flowing Fidelity to Detail for Efficient High-Resolution Video Generation

  • Shilong Zhang
  • Wenbo Li
  • Shoufa Chen
  • Chongjian Ge
  • Peize Sun
  • Yifu Zhang
  • Yi Jiang
  • Zehuan Yuan

DiT models have achieved great success in text-to-video generation, leveraging their scalability in model capacity and data scale. High content and motion fidelity aligned with text prompts, however, often require large model parameters and a substantial number of function evaluations (NFEs). Realistic and visually appealing details are typically reflected in high-resolution outputs, further amplifying computational demands—especially for single-stage DiT models. To address these challenges, we propose a novel two-stage framework, FlashVideo, which strategically allocates model capacity and NFEs across stages to balance generation fidelity and quality. In the first stage, prompt fidelity is prioritized through a low-resolution generation process utilizing large parameters and sufficient NFEs to enhance computational efficiency. The second stage achieves a nearly straight ODE trajectory between low and high resolutions via flow matching, effectively generating fine details and fixing artifacts with minimal NFEs. To ensure a seamless connection between the two independently trained stages during inference, we carefully design degradation strategies during the second-stage training. Quantitative and visual results demonstrate that FlashVideo achieves state-of-the-art high-resolution video generation with superior computational efficiency. Additionally, the two-stage design enables users to preview the initial output and accordingly adjust the prompt before committing to full-resolution generation, thereby significantly reducing computational costs and wait times as well as enhancing commercial viability.

AAAI Conference 2026 Conference Paper

Multi-Level Blur-Aware Stable Diffusion for Region-Adaptive Defocus Deblurring

  • Xiaopan Li
  • Yi Jiang
  • Shiqian Wu
  • Shoulie Xie
  • Sos Agaian

Defocus blur, common in shallow depth-of-field photography, varies across image regions and is challenging to accurately estimate and restore. Existing deblurring methods often struggle to capture fine structural textures and do not effectively adapt to regional differences in blur. We propose Multi-Level Blur-Aware Stable Diffusion (MBSD), a novel framework that explicitly integrates regional blur recognition into a diffusion-based image restoration process. MBSD assigns blur-level labels to image patches using a Patch Blur Annotator (PBA), guiding a Multi-Scale Blur Estimator (MSBE) to predict soft blur probabilities and generate routing weights. These weights control a Blur-Adaptive Expert Mixer (BAEM), which adaptively combines features based on local blur severity. The features are then passed to a text-to-image diffusion model via a cross-attention mechanism, enabling region-specific restoration. Extensive experiments on public benchmarks demonstrate that MBSD delivers superior perceptual quality while maintaining competitive PSNR and SSIM, consistently outperforming state-of-the-art methods.

AAAI Conference 2026 Conference Paper

Unveiling the Attribute Misbinding Threat in Identity-Preserving Models

  • Junming Fu
  • Jishen Zeng
  • Yi Jiang
  • Peiyu Zhuang
  • Baoying Chen
  • Siyu Lu
  • Jianquan Yang

Identity-preserving models have led to notable progress in generating personalized content. Unfortunately, such models also exacerbate risks when misused, for instance, by generating threatening content targeting specific individuals. This paper introduces the Attribute Misbinding Attack, a novel method that poses a threat to identity-preserving models by inducing them to produce Not-Safe-For-Work (NSFW) content. The attack's core idea involves crafting benign-looking textual prompts to circumvent text-filter safeguards and leverage a key model vulnerability: flawed attribute binding that stems from its internal attention bias. This results in misattributing harmful descriptions to a target identity and generating NSFW outputs. To facilitate the study of this attack, we present the Misbinding Prompt evaluation set, which examines the content generation risks of current state-of-the-art identity-preserving models across four risk dimensions: pornography, violence, discrimination, and illegality. Additionally, we introduce the Attribute Binding Safety Score (ABSS), a metric for concurrently assessing both content fidelity and safety compliance. Experimental results show that our Misbinding Prompt evaluation set achieves a 5.28 % higher success rate in bypassing five leading text filters (including GPT-4o) compared to existing main-stream evaluation sets, while also demonstrating the highest proportion of NSFW content generation. The proposed ABSS metric enables a more comprehensive evaluation of identity-preserving models by concurrently assessing both content fidelity and safety compliance.

JBHI Journal 2026 Journal Article

USCNet: Transformer-Based Multimodal Fusion with Segmentation Guidance for Urolithiasis Classification

  • Changmiao Wang
  • Songqi Zhang
  • Yongquan Zhang
  • Yifei Wang
  • Liya Liu
  • Nannan Li
  • Xingzhi Li
  • Jiexin Pan

Kidney stone disease ranks among the most prevalent conditions in urology, and understanding the composition of these stones is essential for creating personalized treatment plans and preventing recurrence. Current methods for analyzing kidney stones depend on post operative specimens, which prevents rapid classification before surgery. To overcome this limitation, we introduce a new approach called the Urinary Stone Segmentation and Classification Network (USCNet). This innovative method allows for precise preoperative classification of kidney stones by integrating Computed Tomography (CT) images with clinical data from Electronic Health Records (EHR). USCNet employs a Transformer-based multimodal fusion framework with CT-EHR attention and segmentation-guided attention modules for accurate classification. Moreover, a dynamic loss function is introduced to effectively balance the dual objectives of segmentation and classification. Experiments on an in-house kidney stone dataset show that USCNet demonstrates outstanding performance across all evaluation metrics, with its classification efficacy significantly surpassing existing mainstream methods. This study presents a promising solution for the precise preoperative classification of kidney stones, offering substantial clinical benefits. The source code has been made publicly available: https://github.com/fancccc/KidneyStoneSC.

AAAI Conference 2025 Conference Paper

Enhancing Adversarial Transferability with Adversarial Weight Tuning

  • Jiahao Chen
  • Zhou Feng
  • Rui Zeng
  • Yuwen Pu
  • Chunyi Zhou
  • Yi Jiang
  • Yuyou Gan
  • Jinbao Li

Deep neural networks (DNNs) are vulnerable to adversarial examples (AEs) that mislead the model while appearing benign to human observers. A critical concern is the transferability of AEs, which enables black-box attacks without direct access to the target model. However, many previous attacks have failed to explain the intrinsic mechanism of adversarial transferability, lacking a unified and representative metric for transferability as well. In this paper, we rethink the property of transferable AEs and develop a novel metric to measure transferability from the perspective of generalization. Building on insights from this metric, we analyze the generalization of AEs across models with different architectures and prove that we can find a local perturbation to mitigate the gap between surrogate and target models. We further establish the inner connections between model smoothness and flat local maxima, both of which contribute to the transferability of AEs. Further, we propose a new adversarial attack algorithm, Adversarial Weight Tuning (AWT), which adaptively adjusts the parameters of the surrogate model using generated AEs to optimize the flat local maxima and model smoothness simultaneously, without the need for extra data. AWT is a data-free tuning method that combines gradient-based and model-related attack methods to enhance the transferability of AEs. Extensive experiments on a variety of models with different architectures on ImageNet demonstrate that AWT yields superior performance over other attacks, with an average increase of nearly 5% and 10% attack success rates on CNN-based and Transformer-based models, respectively, compared to state-of-the-art attacks.

NeurIPS Conference 2025 Conference Paper

InfinityStar: Unified Spacetime AutoRegressive Modeling for Visual Generation

  • Jinlai Liu
  • Jian Han
  • Bin Yan
  • Hui Wu
  • Fengda Zhu
  • Xing Wang
  • Yi Jiang
  • BINGYUE PENG

We introduce InfinityStar, a unified spacetime autoregressive framework for high-resolution image and dynamic video synthesis. Building on the recent success of autoregressive modeling in both vision and language, our purely discrete approach jointly captures spatial and temporal dependencies within a single architecture. This unified design naturally supports a variety of generation tasks such as text-to-image, text-to-video, image-to-video, and long-duration video synthesis via straightforward temporal autoregression. Through extensive experiments, InfinityStar scores 83. 74 on VBench, outperforming all autoregressive models by large margins, even surpassing diffusion competitors like HunyuanVideo. Without extra optimizations, our model generates a 5s, 720p video approximately 10$\times$ faster than leading diffusion-based methods. To our knowledge, InfinityStar is the first discrete autoregressive video generator capable of producing industrial-level 720p videos. We release all code and models to foster further research in efficient, high-quality video generation.

NeurIPS Conference 2025 Conference Paper

UniTok: a Unified Tokenizer for Visual Generation and Understanding

  • Chuofan Ma
  • Yi Jiang
  • Junfeng Wu
  • Jihan Yang
  • Xin Yu
  • Zehuan Yuan
  • BINGYUE PENG
  • Xiaojuan Qi

Visual generative and understanding models typically rely on distinct tokenizers to process images, presenting a key challenge for unifying them within a single framework. Recent studies attempt to address this by connecting the training of VQVAE (for autoregressive generation) and CLIP (for understanding) to build a unified tokenizer. However, directly combining these training objectives has been observed to cause severe loss conflicts. In this paper, we show that reconstruction and semantic supervision do not inherently conflict. Instead, the underlying bottleneck stems from limited representational capacity of discrete token space. Building on these insights, we introduce UniTok, a unified tokenizer featuring a novel multi-codebook quantization mechanism that effectively scales up the vocabulary size and bottleneck dimension. In terms of final performance, UniTok sets a new record of 0. 38 rFID and 78. 6\% zero-shot accuracy on ImageNet. Besides, UniTok can be seamlessly integrated into MLLMs to unlock native visual generation capability, without compromising the understanding performance. Additionally, we show that UniTok favors cfg-free generation, reducing gFID from 14. 6 to 2. 5 on ImageNet 256$\times$256 benchmark. All codes and models have been made publicly available.

NeurIPS Conference 2024 Conference Paper

Efficiency for Free: Ideal Data Are Transportable Representations

  • Peng Sun
  • Yi Jiang
  • Tao Lin

Data, the seminal opportunity and challenge in modern machine learning, currently constrains the scalability of representation learning and impedes the pace of model evolution. In this work, we investigate the efficiency properties of data from both optimization and generalization perspectives. Our theoretical and empirical analysis reveals an unexpected finding: for a given task, utilizing a publicly available, task- and architecture-agnostic model (referred to as the `prior model' in this paper) can effectively produce efficient data. Building on this insight, we propose the Representation Learning Accelerator (ReLA), which promotes the formation and utilization of efficient data, thereby accelerating representation learning. Utilizing a ResNet-18 pre-trained on CIFAR-10 as a prior model to inform ResNet-50 training on ImageNet-1K reduces computational costs by $50\%$ while maintaining the same accuracy as the model trained with the original BYOL, which requires $100\%$ cost. Our code is available at: \url{https: //github. com/LINs-lab/ReLA}.

AAAI Conference 2024 Conference Paper

Incomplete Contrastive Multi-View Clustering with High-Confidence Guiding

  • Guoqing Chao
  • Yi Jiang
  • Dianhui Chu

Incomplete multi-view clustering becomes an important research problem, since multi-view data with missing values are ubiquitous in real-world applications. Although great efforts have been made for incomplete multi-view clustering, there are still some challenges: 1) most existing methods didn't make full use of multi-view information to deal with missing values; 2) most methods just employ the consistent information within multi-view data but ignore the complementary information; 3) For the existing incomplete multi-view clustering methods, incomplete multi-view representation learning and clustering are treated as independent processes, which leads to performance gap. In this work, we proposed a novel Incomplete Contrastive Multi-View Clustering method with high-confidence guiding (ICMVC). Firstly, we proposed a multi-view consistency relation transfer plus graph convolutional network to tackle missing values problem. Secondly, instance-level attention fusion and high-confidence guiding are proposed to exploit the complementary information while instance-level contrastive learning for latent representation is designed to employ the consistent information. Thirdly, an end-to-end framework is proposed to integrate multi-view missing values handling, multi-view representation learning and clustering assignment for joint optimization. Experiments compared with state-of-the-art approaches demonstrated the effectiveness and superiority of our method. Our code is publicly available at https://github.com/liunian-Jay/ICMVC. The version with supplementary material can be found at http://arxiv.org/abs/2312.08697.

NeurIPS Conference 2024 Conference Paper

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

  • Junke Wang
  • Yi Jiang
  • Zehuan Yuan
  • Binyue Peng
  • Zuxuan Wu
  • Yu-Gang Jiang

Tokenizer, serving as a translator to map the intricate visual data into a compact latent space, lies at the core of visual generative models. Based on the finding that existing tokenizers are tailored to either image or video inputs, this paper presents OmniTokenizer, a transformer-based tokenizer for joint image and video tokenization. OmniTokenizer is designed with a spatial-temporal decoupled architecture, which integrates window attention and causal attention for spatial and temporal modeling, respectively. To exploit the complementary nature of image and video data, we further propose a progressive training strategy, where OmniTokenizer is first trained on image data on a fixed resolution to develop the spatial encoding capacity and then jointly trained on image and video data on multiple resolutions to learn the temporal dynamics. OmniTokenizer, for the first time, handles both image and video inputs within a unified framework and proves the possibility of realizing their synergy. Extensive experiments demonstrate that OmniTokenizer achieves state-of-the-art (SOTA) reconstruction performance on various image and video datasets, e. g. , 1. 11 reconstruction FID on ImageNet and 42 reconstruction FVD on UCF-101, beating the previous SOTA methods by 13% and 26%, respectively. Additionally, we also show that when integrated with OmniTokenizer, both language model-based approaches and diffusion models can realize advanced visual synthesis performance, underscoring the superiority and versatility of our method.

NeurIPS Conference 2024 Conference Paper

Recognize Any Regions

  • Haosen Yang
  • Chuofan Ma
  • Bin Wen
  • Yi Jiang
  • Zehuan Yuan
  • Xiatian Zhu

Understanding the semantics of individual regions or patches of unconstrained images, such as open-world object detection, remains a critical yet challenging task in computer vision. Building on the success of powerful image-level vision-language (ViL) foundation models like CLIP, recent efforts have sought to harness their capabilities by either training a contrastive model from scratch with an extensive collection of region-label pairs or aligning the outputs of a detection model with image-level representations of region proposals. Despite notable progress, these approaches are plagued by computationally intensive training requirements, susceptibility to data noise, and deficiency in contextual information. To address these limitations, we explore the synergistic potential of off-the-shelf foundation models, leveraging their respective strengths in localization and semantics. We introduce a novel, generic, and efficient architecture, named RegionSpot, designed to integrate position-aware localization knowledge from a localization foundation model (e. g. , SAM) with semantic information from a ViL model (e. g. , CLIP). To fully exploit pretrained knowledge while minimizing training overhead, we keep both foundation models frozen, focusing optimization efforts solely on a lightweight attention-based knowledge integration module. Extensive experiments in open-world object recognition show that our RegionSpot achieves significant performance gain over prior alternatives, along with substantial computational savings (e. g. , training our model with 3 million data in a single day using 8 V100 GPUs). RegionSpot outperforms GLIP-L by 2. 9 in mAP on LVIS val set, with an even larger margin of 13. 1 AP for more challenging and rare categories, and a 2. 5 AP increase on ODinW. Furthermore, it exceeds GroundingDINO-L by 11. 0 AP for rare categories on the LVIS minival set.

NeurIPS Conference 2024 Conference Paper

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

  • Keyu Tian
  • Yi Jiang
  • Zehuan Yuan
  • BINGYUE PENG
  • Liwei Wang

We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction". This simple, intuitive methodology allows autoregressive (AR) transformers to learn visual distributions fast and generalize well: VAR, for the first time, makes GPT-style AR models surpass diffusion transformers in image generation. On ImageNet 256x256 benchmark, VAR significantly improve AR baseline by improving Frechet inception distance (FID) from 18. 65 to 1. 73, inception score (IS) from 80. 4 to 350. 2, with around 20x faster inference speed. It is also empirically verified that VAR outperforms the Diffusion Transformer (DiT) in multiple dimensions including image quality, inference speed, data efficiency, and scalability. Scaling up VAR models exhibits clear power-law scaling laws similar to those observed in LLMs, with linear correlation coefficients near -0. 998 as solid evidence. VAR further showcases zero-shot generalization ability in downstream tasks including image in-painting, out-painting, and editing. These results suggest VAR has initially emulated the two important properties of LLMs: Scaling Laws and zero-shot task generalization. We have released all models and codes to promote the exploration of AR/VAR models for visual generation and unified learning.

NeurIPS Conference 2023 Conference Paper

CoDet: Co-occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection

  • Chuofan Ma
  • Yi Jiang
  • Xin Wen
  • Zehuan Yuan
  • Xiaojuan Qi

Deriving reliable region-word alignment from image-text pairs is critical to learnobject-level vision-language representations for open-vocabulary object detection. Existing methods typically rely on pre-trained or self-trained vision-languagemodels for alignment, which are prone to limitations in localization accuracy orgeneralization capabilities. In this paper, we propose CoDet, a novel approachthat overcomes the reliance on pre-aligned vision-language space by reformulatingregion-word alignment as a co-occurring object discovery problem. Intuitively, bygrouping images that mention a shared concept in their captions, objects corresponding to the shared concept shall exhibit high co-occurrence among the group. CoDet then leverages visual similarities to discover the co-occurring objects andalign them with the shared concept. Extensive experiments demonstrate that CoDethas superior performances and compelling scalability in open-vocabulary detection, e. g. , by scaling up the visual backbone, CoDet achieves 37. 0 $AP^m_{novel}$ and 44. 7 $AP^m_{all}$ on OV-LVIS, surpassing the previous SoTA by 4. 2 $AP^m_{novel}$ and 9. 8 $AP^m_{all}$. Code is available at https: //github. com/CVMI-Lab/CoDet.

ICLR Conference 2023 Conference Paper

Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling

  • Keyu Tian
  • Yi Jiang
  • Qishuai Diao
  • Chen Lin 0003
  • Liwei Wang
  • Zehuan Yuan

We identify and overcome two key obstacles in extending the success of BERT-style pre-training, or masked image modeling, to convolutional networks (convnets): (i) convolution operation cannot handle irregular, randomly masked input images; (ii) the single-scale nature of BERT pre-training is inconsistent with convnet’s hierarchical structure. For (i), we treat unmasked pixels as sparse voxels of 3D point clouds and use sparse convolution to encode. This is the first use of sparse convolution for 2D masked modeling. For (ii), we develop a hierarchical decoder to reconstruct images from multi-scale encoded features. Our method, called Sparse masKed modeling (SparK), is general: it can be used directly on any convolutional model without backbone modifications. We validate it on both classical (ResNet) and modern (ConvNeXt) models: on three downstream tasks, it surpasses both state-of-the-art contrastive learning and transformer-based masked modeling by similarly large margins (around +1.0%). The improvements on object detection and instance segmentation are more significant (up to +3.5%), validating the strong transferability of features learned. We also find SparK’s favorable scaling behavior by observing more gains on larger networks. All of these findings support the promising future of generative pre-training on convnets. Both codes and pre-trained models have been released at https://github.com/keyu-tian/SparK.

JBHI Journal 2022 Journal Article

DeepRayburst for Automatic Shape Analysis of Tree-Like Structures in Biomedical Images

  • Yi Jiang
  • Weixun Chen
  • Min Liu
  • Yaonan Wang
  • Erik Meijering

Precise quantification of tree-like structures from biomedical images, such as neuronal shape reconstruction and retinal blood vessel caliber estimation, is increasingly important in understanding normal function and pathologic processes in biology. Some handcrafted methods have been proposed for this purpose in recent years. However, they are designed only for a specific application. In this paper, we propose a shape analysis algorithm, DeepRayburst, that can be applied to many different applications based on a Multi-Feature Rayburst Sampling (MFRS) and a Dual Channel Temporal Convolutional Network (DC-TCN). Specifically, we first generate a Rayburst Sampling (RS) core containing a set of multidirectional rays. Then the MFRS is designed by extending each ray of the RS to multiple parallel rays which extract a set of feature sequences. A Gaussian kernel is then used to fuse these feature sequences and outputs one feature sequence. Furthermore, we design a DC-TCN to make the rays terminate on the surface of tree-like structures according to the fused feature sequence. Finally, by analyzing the distribution patterns of the terminated rays, the algorithm can serve multiple shape analysis applications of tree-like structures. Experiments on three different applications, including soma shape reconstruction, neuronal shape reconstruction, and vessel caliber estimation, confirm that the proposed method outperforms other state-of-the-art shape analysis methods, which demonstrate its flexibility and robustness.

NeurIPS Conference 2022 Conference Paper

Rethinking Resolution in the Context of Efficient Video Recognition

  • Chuofan Ma
  • Qiushan Guo
  • Yi Jiang
  • Ping Luo
  • Zehuan Yuan
  • Xiaojuan Qi

In this paper, we empirically study how to make the most of low-resolution frames for efficient video recognition. Existing methods mainly focus on developing compact networks or alleviating temporal redundancy of video inputs to increase efficiency, whereas compressing frame resolution has rarely been considered a promising solution. A major concern is the poor recognition accuracy on low-resolution frames. We thus start by analyzing the underlying causes of performance degradation on low-resolution frames. Our key finding is that the major cause of degradation is not information loss in the down-sampling process, but rather the mismatch between network architecture and input scale. Motivated by the success of knowledge distillation (KD), we propose to bridge the gap between network and input size via cross-resolution KD (ResKD). Our work shows that ResKD is a simple but effective method to boost recognition accuracy on low-resolution frames. Without bells and whistles, ResKD considerably surpasses all competitive methods in terms of efficiency and accuracy on four large-scale benchmark datasets, i. e. , ActivityNet, FCVID, Mini-Kinetics, Something-Something V2. In addition, we extensively demonstrate its effectiveness over state-of-the-art architectures, i. e. , 3D-CNNs and Video Transformers, and scalability towards super low-resolution frames. The results suggest ResKD can serve as a general inference acceleration method for state-of-the-art video recognition. Our code will be available at https: //github. com/CVMI-Lab/ResKD.

JBHI Journal 2021 Journal Article

Deep Learning Methods for Lung Cancer Segmentation in Whole-Slide Histopathology Images—The ACDC@LungHP Challenge 2019

  • Zhang Li
  • Jiehua Zhang
  • Tao Tan
  • Xichao Teng
  • Xiaoliang Sun
  • Hong Zhao
  • Lihong Liu
  • Yang Xiao

Accurate segmentation of lung cancer in pathology slides is a critical step in improving patient care. We proposed the ACDC@LungHP (Automatic Cancer Detection and Classification in Whole-slide Lung Histopathology) challenge for evaluating different computer-aided diagnosis (CADs) methods on the automatic diagnosis of lung cancer. The ACDC@LungHP 2019 focused on segmentation (pixel-wise detection) of cancer tissue in whole slide imaging (WSI), using an annotated dataset of 150 training images and 50 test images from 200 patients. This paper reviews this challenge and summarizes the top 10 submitted methods for lung cancer segmentation. All methods were evaluated using metrics using the precision, accuracy, sensitivity, specificity, and DICE coefficient (DC). The DC ranged from 0. 7354 $\pm$ 0. 1149 to 0. 8372 $\pm$ 0. 0858. The DC of the best method was close to the inter-observer agreement (0. 8398 $\pm$ 0. 0890). All methods were based on deep learning and categorized into two groups: multi-model method and single model method. In general, multi-model methods were significantly better ( p $< $ 0. 01) than single model methods, with mean DC of 0. 7966 and 0. 7544, respectively. Deep learning based methods could potentially help pathologists find suspicious regions for further analysis of lung cancer in WSI.

JMLR Journal 2019 Journal Article

SimpleDet: A Simple and Versatile Distributed Framework for Object Detection and Instance Recognition

  • Yuntao Chen
  • Chenxia Han
  • Yanghao Li
  • Zehao Huang
  • Yi Jiang
  • Naiyan Wang
  • Zhaoxiang Zhang

Object detection and instance recognition play a central role in many AI applications like autonomous driving, video surveillance and medical image analysis. However, training object detection models on large scale datasets remains computationally expensive and time consuming. This paper presents an efficient and open source object detection framework called SimpleDet which enables the training of state-of-the-art detection models on consumer grade hardware at large scale. SimpleDet covers a wide range of models including both high-performance and high-speed ones. SimpleDet is well-optimized for both low precision training and distributed training and achieves 70% higher throughput for the Mask R-CNN detector compared with existing frameworks. Codes, examples and documents of SimpleDet can be found at https://github.com/tusimple/simpledet. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2019. ( edit, beta )