Arrow Research search

Author name cluster

Bo Ren

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers
1 author row

Possible papers

13

NeurIPS Conference 2024 Conference Paper

Lighting Every Darkness with 3DGS: Fast Training and Real-Time Rendering for HDR View Synthesis

  • Xin Jin
  • Pengyi Jiao
  • Zheng-Peng Duan
  • Xingchao Yang
  • Chongyi Li
  • Chun-Le Guo
  • Bo Ren

Volumetric rendering-based methods, like NeRF, excel in HDR view synthesis from RAW images, especially for nighttime scenes. They suffer from long training times and cannot perform real-time rendering due to dense sampling requirements. The advent of 3D Gaussian Splatting (3DGS) enables real-time rendering and faster training. However, implementing RAW image-based view synthesis directly using 3DGS is challenging due to its inherent drawbacks: 1) in nighttime scenes, extremely low SNR leads to poor structure-from-motion (SfM) estimation in dis- tant views; 2) the limited representation capacity of the spherical harmonics (SH) function is unsuitable for RAW linear color space; and 3) inaccurate scene structure hampers downstream tasks such as refocusing. To address these issues, we propose LE3D (Lighting Every darkness with 3DGS). Our method proposes Cone Scatter Initialization to enrich the estimation of SfM and replaces SH with a Color MLP to represent the RAW linear color space. Additionally, we introduce depth distortion and near-far regularizations to improve the accuracy of scene structure for down- stream tasks. These designs enable LE3D to perform real-time novel view synthesis, HDR rendering, refocusing, and tone-mapping changes. Compared to previous vol- umetric rendering-based methods, LE3D reduces training time to 1% and improves rendering speed by up to 4, 000 times for 2K resolution images in terms of FPS. Code and viewer can be found in https: //srameo. github. io/projects/le3d.

AAAI Conference 2023 Conference Paper

Adaptive Hierarchy-Branch Fusion for Online Knowledge Distillation

  • Linrui Gong
  • Shaohui Lin
  • Baochang Zhang
  • Yunhang Shen
  • Ke Li
  • Ruizhi Qiao
  • Bo Ren
  • Muqing Li

Online Knowledge Distillation (OKD) is designed to alleviate the dilemma that the high-capacity pre-trained teacher model is not available. However, the existing methods mostly focus on improving the ensemble prediction accuracy from multiple students (a.k.a. branches), which often overlook the homogenization problem that makes student model saturate quickly and hurts the performance. We assume that the intrinsic bottleneck of the homogenization problem comes from the identical branch architecture and coarse ensemble strategy. We propose a novel Adaptive Hierarchy-Branch Fusion framework for Online Knowledge Distillation, termed AHBF-OKD, which designs hierarchical branches and adaptive hierarchy-branch fusion module to boost the model diversity and aggregate complementary knowledge. Specifically, we first introduce hierarchical branch architectures to construct diverse peers by increasing the depth of branches monotonously on the basis of target branch. To effectively transfer knowledge from the most complex branch to the simplest target branch, we propose an adaptive hierarchy-branch fusion module to create hierarchical teacher assistants recursively, which regards the target branch as the smallest teacher assistant. During the training, the teacher assistant from the previous hierarchy is explicitly distilled by the teacher assistant and the branch from the current hierarchy. Thus, the important scores to different branches are effectively and adaptively allocated to reduce the branch homogenization. Extensive experiments demonstrate the effectiveness of AHBF-OKD on different datasets, including CIFAR-10/100 and ImageNet 2012. For example, on ImageNet 2012, the distilled ResNet-18 achieves Top-1 error of 29.28\%, which significantly outperforms the state-of-the-art methods. The source code is available at https://github.com/linruigong965/AHBF.

AAAI Conference 2023 Conference Paper

FoPro: Few-Shot Guided Robust Webly-Supervised Prototypical Learning

  • Yulei Qin
  • Xingyu Chen
  • Chao Chen
  • Yunhang Shen
  • Bo Ren
  • Yun Gu
  • Jie Yang
  • Chunhua Shen

Recently, webly supervised learning (WSL) has been studied to leverage numerous and accessible data from the Internet. Most existing methods focus on learning noise-robust models from web images while neglecting the performance drop caused by the differences between web domain and real-world domain. However, only by tackling the performance gap above can we fully exploit the practical value of web datasets. To this end, we propose a Few-shot guided Prototypical (FoPro) representation learning method, which only needs a few labeled examples from reality and can significantly improve the performance in the real-world domain. Specifically, we initialize each class center with few-shot real-world data as the ``realistic" prototype. Then, the intra-class distance between web instances and ``realistic" prototypes is narrowed by contrastive learning. Finally, we measure image-prototype distance with a learnable metric. Prototypes are polished by adjacent high-quality web images and involved in removing distant out-of-distribution samples. In experiments, FoPro is trained on web datasets with a few real-world examples guided and evaluated on real-world datasets. Our method achieves the state-of-the-art performance on three fine-grained datasets and two large-scale datasets. Compared with existing WSL methods under the same few-shot settings, FoPro still excels in real-world generalization. Code is available at https://github.com/yuleiqin/fopro.

AAAI Conference 2023 Conference Paper

Open-Vocabulary Multi-Label Classification via Multi-Modal Knowledge Transfer

  • Sunan He
  • Taian Guo
  • Tao Dai
  • Ruizhi Qiao
  • Xiujun Shu
  • Bo Ren
  • Shu-Tao Xia

Real-world recognition system often encounters the challenge of unseen labels. To identify such unseen labels, multi-label zero-shot learning (ML-ZSL) focuses on transferring knowledge by a pre-trained textual label embedding (e.g., GloVe). However, such methods only exploit single-modal knowledge from a language model, while ignoring the rich semantic information inherent in image-text pairs. Instead, recently developed open-vocabulary (OV) based methods succeed in exploiting such information of image-text pairs in object detection, and achieve impressive performance. Inspired by the success of OV-based methods, we propose a novel open-vocabulary framework, named multi-modal knowledge transfer (MKT), for multi-label classification. Specifically, our method exploits multi-modal knowledge of image-text pairs based on a vision and language pre-training (VLP) model. To facilitate transferring the image-text matching ability of VLP model, knowledge distillation is employed to guarantee the consistency of image and label embeddings, along with prompt tuning to further update the label embeddings. To further enable the recognition of multiple objects, a simple but effective two-stream module is developed to capture both local and global features. Extensive experimental results show that our method significantly outperforms state-of-the-art methods on public benchmark datasets.

AAAI Conference 2023 Conference Paper

TaCo: Textual Attribute Recognition via Contrastive Learning

  • Chang Nie
  • Yiqing Hu
  • Yanqiu Qu
  • Hao Liu
  • Deqiang Jiang
  • Bo Ren

As textual attributes like font are core design elements of document format and page style, automatic attributes recognition favor comprehensive practical applications. Existing approaches already yield satisfactory performance in differentiating disparate attributes, but they still suffer in distinguishing similar attributes with only subtle difference. Moreover, their performance drop severely in real-world scenarios where unexpected and obvious imaging distortions appear. In this paper, we aim to tackle these problems by proposing TaCo, a contrastive framework for textual attribute recognition tailored toward the most common document scenes. Specifically, TaCo leverages contrastive learning to dispel the ambiguity trap arising from vague and open-ended attributes. To realize this goal, we design the learning paradigm from three perspectives: 1) generating attribute views, 2) extracting subtle but crucial details, and 3) exploiting valued view pairs for learning, to fully unlock the pre-training potential. Extensive experiments show that TaCo surpasses the supervised counterparts and advances the state-of-the-art remarkably on multiple attribute recognition tasks. Online services of TaCo will be made available.

AAAI Conference 2023 Conference Paper

The Devil Is in the Frequency: Geminated Gestalt Autoencoder for Self-Supervised Visual Pre-training

  • Hao Liu
  • Xinghua Jiang
  • Xin Li
  • Antai Guo
  • Yiqing Hu
  • Deqiang Jiang
  • Bo Ren

The self-supervised Masked Image Modeling (MIM) schema, following "mask-and-reconstruct" pipeline of recovering contents from masked image, has recently captured the increasing interest in the community, owing to the excellent ability of learning visual representation from unlabeled data. Aiming at learning representations with high semantics abstracted, a group of works attempts to reconstruct non-semantic pixels with large-ratio masking strategy, which may suffer from "over-smoothing" problem, while others directly infuse semantics into targets in off-line way requiring extra data. Different from them, we shift the perspective to the Fourier domain which naturally has global perspective and present a new Masked Image Modeling (MIM), termed Geminated Gestalt Autoencoder (Ge^2-AE) for visual pre-training. Specifically, we equip our model with geminated decoders in charge of reconstructing image contents from both pixel and frequency space, where each other serves as not only the complementation but also the reciprocal constraints. Through this way, more robust representations can be learned in the pre-trained encoders, of which the effectiveness is confirmed by the juxtaposing experimental results on downstream recognition tasks. We also conduct several quantitative and qualitative experiments to investigate the learning behavior of our method. To our best knowledge, this is the first MIM work to solve the visual pre-training through the lens of frequency domain.

AAAI Conference 2022 Conference Paper

Comprehensive Regularization in a Bi-directional Predictive Network for Video Anomaly Detection

  • Chengwei Chen
  • Yuan Xie
  • Shaohui Lin
  • Angela Yao
  • Guannan Jiang
  • Wei Zhang
  • Yanyun Qu
  • Ruizhi Qiao

Video anomaly detection aims to automatically identify unusual objects or behaviours by learning from normal videos. Previous methods tend to use simplistic reconstruction or prediction constraints, which leads to the insufficiency of learned representations for normal data. As such, we propose a novel bi-directional architecture with three consistency constraints to comprehensively regularize the prediction task from pixelwise, cross-modal, and temporal-sequence levels. First, predictive consistency is proposed to consider the symmetry property of motion and appearance in forwards and backwards time, which ensures the highly realistic appearance and motion predictions at the pixel-wise level. Second, association consistency considers the relevance between different modalities and uses one modality to regularize the prediction of another one. Finally, temporal consistency utilizes the relationship of the video sequence and ensures that the predictive network generates temporally consistent frames. During inference, the pattern of abnormal frames is unpredictable and will therefore cause higher prediction errors. Experiments show that our method outperforms advanced anomaly detectors and achieves state-of-the-art results on UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets.

AAAI Conference 2022 Conference Paper

Perceiving Stroke-Semantic Context: Hierarchical Contrastive Learning for Robust Scene Text Recognition

  • Hao Liu
  • Bin Wang
  • Zhimin Bao
  • Mobai Xue
  • Sheng Kang
  • Deqiang Jiang
  • Yinsong Liu
  • Bo Ren

We introduce Perceiving Stroke-Semantic Context (PerSec), a new approach to self-supervised representation learning tailored for Scene Text Recognition (STR) task. Considering scene text images carry both visual and semantic properties, we equip our PerSec with dual context perceivers which can contrast and learn latent representations from low-level stroke and high-level semantic contextual spaces simultaneously via hierarchical contrastive learning on unlabeled text image data. Experiments in un- and semi-supervised learning settings on STR benchmarks demonstrate our proposed framework can yield a more robust representation for both CTC-based and attention-based decoders than other contrastive learning methods. To fully investigate the potential of our method, we also collect a dataset of 100 million unlabeled text images, named UTI-100M, covering 5 scenes and 4 languages. By leveraging hundred-million-level unlabeled data, our PerSec shows significant performance improvement when fine-tuning the learned representation on the labeled data. Furthermore, we observe that the representation learned by PerSec presents great generalization, especially under few labeled data scenes.

AAAI Conference 2022 Conference Paper

Sequence-to-Action: Grammatical Error Correction with Action Guided Sequence Generation

  • Jiquan Li
  • Junliang Guo
  • Yongxin Zhu
  • Xin Sheng
  • Deqiang Jiang
  • Bo Ren
  • Linli Xu

The task of Grammatical Error Correction (GEC) has received remarkable attention with wide applications in Natural Language Processing (NLP) in recent years. While one of the key principles of GEC is to keep the correct parts unchanged and avoid over-correction, previous sequence-tosequence (seq2seq) models generate results from scratch, which are not guaranteed to follow the original sentence structure and may suffer from the over-correction problem. In the meantime, the recently proposed sequence tagging models can overcome the over-correction problem by only generating edit operations, but are conditioned on human designed language-specific tagging labels. In this paper, we combine the pros and alleviate the cons of both models by proposing a novel Sequence-to-Action (S2A) module. The S2A module jointly takes the source and target sentences as input, and is able to automatically generate a token-level action sequence before predicting each token, where each action is generated from three choices named SKIP, COPY and GENerate. Then the actions are fused with the basic seq2seq framework to provide final predictions. We conduct experiments on the benchmark datasets of both English and Chinese GEC tasks. Our model consistently outperforms the seq2seq baselines, while being able to significantly alleviate the over-correction problem as well as holding better generality and diversity in the generation results compared to the sequence tagging models.

AAAI Conference 2022 Conference Paper

TDv2: A Novel Tree-Structured Decoder for Offline Mathematical Expression Recognition

  • Changjie Wu
  • Jun Du
  • Yunqing Li
  • Jianshu Zhang
  • Chen Yang
  • Bo Ren
  • Yiqing Hu

In recent years, tree decoders become more popular than La- TeX string decoders in the field of handwritten mathematical expression recognition (HMER) as they can capture the hierarchical tree structure of mathematical expressions. However previous tree decoders converted the tree structure labels into a fixed and ordered sequence, which could not make full use of the diversified expression of tree labels. In this study, we propose a novel tree decoder (TDv2) to fully utilize the tree structure labels. Compared with previous tree decoders, this new model does not require a fixed priority for different branches of a node during training and inference, which can effectively improve the model generalization capability. The input and output of the model make full use of the tree structure label, so that there is no need to find the parent node in the decoding process, which simplifies the decoding process and adds a priori information to help predict the node. We verified the effectiveness of each part of the model through comprehensive ablation experiments and attention visualization analysis. On the authoritative CROHME 14/16/19 datasets, our method achieves the state-of-the-art results.

AAAI Conference 2020 Conference Paper

Accurate Structured-Text Spotting for Arithmetical Exercise Correction

  • Yiqing Hu
  • Yan Zheng
  • Hao Liu
  • Dequang Jiang
  • Yinsong Liu
  • Bo Ren

Correcting arithmetical exercise is a labor intensive and time consuming task for primary school teachers all the time. To reduce their burdens, we propose Arithmetical Exercise Checker (AEC), which is the first system that automatically evaluates all arithmetical expressions (AEs) on exercise images. The major challenge is that AE is formed by printed and handwritten texts with particular arithmetical patterns (e. g. , multi-line, fraction). Despite being part of AE, handwritten texts usually lead to zigzag boundaries and tangled rows. What’s worse, AE may be arithmetical incorrect, which makes the contextual information less valuable for recognition. To tackle these problems, we introduce integrated detection, recognition and evaluation branches by leveraging AE’s intrinsic features, namely 1) boundary indistinctive, 2) locally relevant patterns and 3) globally irrelevant symbols. Experimental results demonstrate that AEC yields a 93. 72% correction accuracy on 40 kinds of mainstream primary arithmetical exercises. So far, the online service of AEC processes 75, 000 arbitrary exercises on average per day, and already reduced the burden of over 1, 000, 000 users. AEC shows the bene- fits for implementing an vision-based system as a way to aid teachers in reducing reduplicative tasks.

IJCAI Conference 2018 Conference Paper

Enhanced-alignment Measure for Binary Foreground Map Evaluation

  • Deng-Ping Fan
  • Cheng Gong
  • Yang Cao
  • Bo Ren
  • Ming-Ming Cheng
  • Ali Borji

The existing binary foreground map (FM) measures address various types of errors in either pixel-wise or structural ways. These measures consider pixel-level match or image-level information independently, while cognitive vision studies have shown that human vision is highly sensitive to both global information and local details in scenes. In this paper, we take a detailed look at current binary FM evaluation measures and propose a novel and effective E-measure (Enhanced-alignment measure). Our measure combines local pixel values with the image-level mean value in one term, jointly capturing image-level statistics and local pixel matching information. We demonstrate the superiority of our measure over the available measures on 4 popular datasets via 5 meta-measures, including ranking models for applications, demoting generic, random Gaussian noise maps, ground-truth switch, as well as human judgments. We find large improvements in almost all the meta-measures. For instance, in terms of application ranking, we observe improvement ranging from 9. 08% to 19. 65% compared with other popular measures.

AAAI Conference 2018 Conference Paper

FLIC: Fast Linear Iterative Clustering With Active Search

  • Jiaxing Zhao
  • Bo Ren
  • Qibin Hou
  • Ming-Ming Cheng
  • Paul Rosin

In this paper, we reconsider the clustering problem for image over-segmentation from a new perspective. We propose a novel search algorithm named “active search” which explicitly considers neighboring continuity. Based on this search method, we design a back-and-forth traversal strategy and a “joint” assignment and update step to speed up the algorithm. Compared to earlier works, such as Simple Linear Iterative Clustering (SLIC) and its follow-ups, who use fixed search regions and perform the assignment and the update step separately, our novel scheme reduces the number of iterations required for convergence, and also improves the boundary sensitivity of the over-segmentation results. Extensive evaluations on the Berkeley segmentation benchmark verify that our method outperforms competing methods under various evaluation metrics. In particular, lowest time cost is reported among existing methods (approximately 30 fps for a 481 × 321 image on a single CPU core). To facilitate the development of over-segmentation, the code will be publicly available.