Arrow Research search

Author name cluster

Cheng Jin

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

31 papers
2 author rows

Possible papers

31

AAAI Conference 2026 Conference Paper

A-FloPS: Accelerating Diffusion Models via Adaptive Flow Path Sampler

  • Cheng Jin
  • Zhenyu Xiao
  • Yuantao Gu

Diffusion models deliver state-of-the-art generative performance across diverse modalities but remain computationally expensive due to their inherently iterative sampling process. Existing training-free acceleration methods typically improve numerical solvers for the reverse-time ODE, yet their effectiveness is fundamentally constrained by the inefficiency of the underlying sampling trajectories. We propose A-FloPS (Adaptive Flow Path Sampler), a principled, training-free framework that reparameterizes the sampling trajectory of any pre-trained diffusion model into a flow-matching form and augments it with an adaptive velocity decomposition. The reparameterization analytically maps diffusion scores to flow-compatible velocities, yielding integration-friendly trajectories without retraining. The adaptive mechanism further factorizes the velocity field into a linear drift term and a residual component whose temporal variation is actively suppressed, restoring the accuracy benefits of high-order integration even in extremely low-NFE regimes. Extensive experiments on conditional image generation and text-to-image synthesis show that A-FloPS consistently outperforms state-of-the-art training-free samplers in both sample quality and efficiency. Notably, with as few as 5 function evaluations, A-FloPS achieves substantially lower FID and generates sharper, more coherent images. The adaptive mechanism also improves native flow-based generative models, underscoring its generality. These results position A-FloPS as a versatile and effective solution for high-quality, low-latency generative modeling.

AAAI Conference 2026 Conference Paper

Advanced Black-Box Tuning of Large Language Models with Limited API Calls

  • Zhikang Xie
  • Weilin Wan
  • Peizhu Gong
  • Weizhong Zhang
  • Cheng Jin

Black-box tuning is an emerging paradigm for adapting large language models (LLMs) to better achieve desired behaviors, particularly when direct access to model parameters is unavailable. Current strategies, however, often present a dilemma of suboptimal extremes: either separately train a small proxy model and then use it to shift the predictions of the foundation model, offering notable efficiency but often yielding limited improvement; or making API calls in each tuning iteration to the foundation model, which entails prohibitive computational costs. In this paper, we argue that a more reasonable way for black-box tuning is to train the proxy model with limited API calls. The underlying intuition is based on two key observations: first, the training samples may exhibit correlations and redundancies, suggesting that the foundation model’s predictions can be estimated from previous calls; second, foundation models frequently demonstrate low accuracy on downstream tasks. Therefore, we propose a novel advanced black-box tuning method for LLMs with limited API calls. Our core strategy involves training a Gaussian Process (GP) surrogate model with "LogitMap Pairs" derived from querying the foundation model on a minimal but highly informative training subset. This surrogate can approximate the outputs of the foundation model to guide the training of the proxy model, thereby effectively reducing the need for direct queries to the foundation model. Extensive experiments verify that our approach elevates pre-trained language model accuracy from 55.92% to 86.85%, reducing the frequency of API queries to merely 1.38%. This significantly outperforms offline approaches that operate entirely without API access. Notably, our method also achieves comparable or superior accuracy to query-intensive approaches, while significantly reducing API costs. This offers a robust and high-efficiency paradigm for language model adaptation.

EAAI Journal 2026 Journal Article

Efficient and robust shoveling control system based on semantic elevation mapping for unmanned loaders

  • Guangda Chen
  • Zhiwen Zhang
  • Lin Cheng
  • Cheng Jin
  • Shunyi Yao
  • Yue Wang
  • Rong Xiong
  • Yingfeng Chen

Improving the automation of wheeled loaders is key to solving labor gaps and boosting safety in construction. This paper proposes an automatic shoveling system for unmanned loaders that, for the first time, balances safety, robustness, efficiency, and energy consumption. The system features automatic calibration of camera and light detection and ranging (LiDAR) using large segmentation models and nonlinear optimization, ensuring stability despite vibrations. A lightweight neural network performs semantic segmentation, and multi-frame point clouds are fused with a confidence algorithm for accurate pile segmentation. The shoveling point selection algorithm integrates semantic and elevation data to prioritize loader and environmental safety. Volume prediction initiates scooping, and a shoveling strategy balances robustness and efficiency. Extensive field tests conducted over two months with two types of loaders in three scenarios, totaling 2090 operations, demonstrate the system’s long-term stability, high bucket full rates, efficiency matching manual operations, and an 11% reduction in energy consumption. These results highlight the system’s potential to transform automated construction machinery.

AAAI Conference 2026 Conference Paper

Explore and Establish Synergistic Effects Between Weight Pruning and Coreset Selection in Neural Network Training

  • Weilin Wan
  • Fan Yi
  • Weizhong Zhang
  • Quan Zhou
  • Cheng Jin

Modern deep neural networks rely heavily on massive model weights and training samples, incurring substantial computational costs. Weight pruning and coreset selection are two emerging paradigms proposed to improve computational efficiency. In this paper, we first explore the interplay between redundant weights and training samples through a transparent analysis: redundant samples, particularly noisy ones, cause model weights to become unnecessarily overtuned to fit them, complicating the identification of irrelevant weights during pruning; conversely, irrelevant weights tend to overfit noisy data, undermining coreset selection effectiveness. To further investigate and harness this interplay in deep learning, we develop a Simultaneous Weight and Sample Tailoring mechanism (SWaST) that alternately performs weight pruning and coreset selection to establish a synergistic effect in training. During this investigation, we observe that when simultaneously removing a large number of weights and samples, a phenomenon we term critical double-loss can occur, where important weights and their supportive samples are mistakenly eliminated at the same time, leading to model instability and nearly irreversible degradation that cannot be recovered in subsequent training. Unlike classic machine learning models, this issue can arise in deep learning due to the lack of theoretical guarantees on the correctness of weight pruning and coreset selection, which explains why these paradigms are often developed independently. We mitigate this by integrating a state preservation mechanism into SWaST, enabling stable joint optimization. Extensive experiments reveal a strong synergy between pruning and coreset selection across varying prune rates and coreset sizes, delivering accuracy boosts of up to 17.83% alongside 10% to 90% FLOPs reductions.

AAAI Conference 2026 Conference Paper

Investigating Data Pruning for Pretraining Biological Foundation Models at Scale

  • Yifan Wu
  • Jiyue Jiang
  • Xichen Ye
  • Yiqi Wang
  • Chang Zhou
  • Yitao Xu
  • Jiayang Chen
  • He Hu

Biological foundation models (BioFMs), pretrained on large-scale biological sequences, have recently shown strong potential in providing meaningful representations for diverse downstream bioinformatics tasks. However, such models often rely on millions to billions of training sequences and billions of parameters, resulting in prohibitive computational costs and significant barriers to reproducibility and accessibility—particularly for academic labs. To address these challenges, we investigate the feasibility of data pruning for BioFM pretraining and propose a post-hoc influence-guided data pruning framework tailored to biological domains. Our approach first introduces a subset-based self-influence formulation that enables efficient estimation of sample importance at low computational cost. Built upon this, we propose two simple yet effective selection strategies: Top-k Influence (Top I) and Coverage-Centric Influence (CCI). Then, we empirically validate our method on two representative BioFMs: RNA-FM and ESM-C. For RNA, our framework consistently outperforms random selection baselines under an extreme pruning rate of over 99%, which displays our framework's effectiveness. Furthermore, we demonstrate the generalizability of our framework on protein-related tasks using ESM-C. Specifically, our coreset even outperforms random 10x subsets in both RNA and protein settings, revealing substantial redundancy in biological sequence datasets. These findings underscore the potential of influence-guided data pruning to substantially reduce the computational cost of BioFM pretraining, paving the way for more efficient, accessible, and sustainable biological AI research.

AAAI Conference 2025 Conference Paper

Achieving Ensemble-Like Performance in a Single Model: A Feature Diversification Framework for Image-Text Matching

  • Zhao Zhou
  • Yiqun Wang
  • Weizhong Zhang
  • Yingbin Zheng
  • Xiangcheng Du
  • Cheng Jin

Model ensembling is a widely used technique that enhances performance in image-text matching tasks by combining multiple models, each trained with different initializations. However, the inefficiencies associated with training several models and generating outputs from them constrain their practical applicability. In this paper, we argue that while the parameters of two randomly initialized models can differ significantly, their feature distributions can be similar at certain stages. By employing a proposed technique called cross-modal realignment, we demonstrate that features derived from differently initialized models maintain similarity at the feature extraction stage and can be effectively transformed by fine-tuning a small number of parameters. These findings provide an efficient way to achieve ensemble-like performance within a single model. Specifically, we propose a Feature Diversification Framework (FDF) that emulates the outputs of multiple model initializations to generate diverse features from a common shared feature. Firstly, we introduce feature conversion methods to transform shared features into a set of distinct features. Next, a realignment training strategy is presented to optimize negative pairs for realigning these transformed features, thereby enhancing their diversification to resemble the outputs of different models. Additionally, we propose a reweighting module that assigns weights to these features, enabling a weighted fusion approach for robust feature representation. Extensive experiments on the Flickr30K and MS-COCO datasets demonstrate the effectiveness and generalizability of our framework.

AAAI Conference 2025 Conference Paper

An Exemplar-based Framework for Chinese Text Recognition

  • Zhao Zhou
  • Xiangcheng Du
  • Yingbin Zheng
  • Xingjiao Wu
  • Cheng Jin

This paper introduces a novel exemplar-based framework for reading Chinese texts in natural scene or document images. We present the Deep Exemplar-based Chinese Text Recognizer, which is structured to first identify candidate characters as exemplars from each text-line, and subsequently recognize them by retrieving analogous exemplars from a database. With text-line level annotations, we design the exemplar discovery network to simultaneously recognize texts and capture individual character positions in a weak-supervision manner. The exemplar retrieval module is then crafted to identify the most similar exemplar and propagate the corresponding character label. This enables us to effectively rectify the misrecognized characters and boost the performance of scene text recognition. Experiments on four scenarios of Chinese texts demonstrate the effectiveness of our proposed framework.

ICML Conference 2025 Conference Paper

Angle Domain Guidance: Latent Diffusion Requires Rotation Rather Than Extrapolation

  • Cheng Jin
  • Zhenyu Xiao
  • Chutao Liu
  • Yuantao Gu

Classifier-free guidance (CFG) has emerged as a pivotal advancement in text-to-image latent diffusion models, establishing itself as a cornerstone technique for achieving high-quality image synthesis. However, under high guidance weights, where text-image alignment is significantly enhanced, CFG also leads to pronounced color distortions in the generated images. We identify that these distortions stem from the amplification of sample norms in the latent space. We present a theoretical framework that elucidates the mechanisms of norm amplification and anomalous diffusion phenomena induced by classifier-free guidance. Leveraging our theoretical insights and the latent space structure, we propose an Angle Domain Guidance (ADG) algorithm. ADG constrains magnitude variations while optimizing angular alignment, thereby mitigating color distortions while preserving the enhanced text-image alignment achieved at higher guidance weights. Experimental results demonstrate that ADG significantly outperforms existing methods, generating images that not only maintain superior text alignment but also exhibit improved color fidelity and better alignment with human perceptual preferences.

NeurIPS Conference 2025 Conference Paper

Complete Structure Guided Point Cloud Completion via Cluster- and Instance-Level Contrastive Learning

  • Yang Chen
  • Yirun Zhou
  • Weizhong Zhang
  • Cheng Jin

Point cloud completion, aiming to reconstruct missing part from incomplete point clouds, is a pivotal task in 3D computer vision. Traditional supervised approaches often necessitate complete point clouds for training supervision, which are not readily accessible in real-world applications. Recent studies have attempted to mitigate this dependency by employing self-supervise mechanisms. However, these approaches frequently yield suboptimal results due to the absence of complete structure in the point cloud data during training. To address these issues, in this paper, we propose an effective framework to complete the point cloud under the guidance of self learned complete structure. A key contribution of our work is the development of a novel self-supervised complete structure reconstruction module, which can learn the complete structure explicitly from incomplete point clouds and thus eliminate the reliance on training data from complete point clouds. Additionally, we introduce a contrastive learning approach at both the cluster- and instance-level to extract shape features guided by the complete structure and to capture style features, respectively. This dual-level learning design ensures that the generated point clouds are both shape-completed and detail-preserving. Extensive experiments on both synthetic and real-world datasets demonstrate that our approach significantly outperforms state-of-the-art self-supervised methods.

NeurIPS Conference 2025 Conference Paper

Computational Budget Should Be Considered in Data Selection

  • Weilin Wan
  • Weizhong Zhang
  • Cheng Jin

Data selection improves computational efficiency by choosing informative subsets of training samples. However, existing methods ignore the compute budget, treating data selection and importance evaluation independently of compute budget constraints. Yet empirical studies show no algorithm can consistently outperform others (or even random selection) across varying budgets. We therefore argue that compute budget must be integral to data-selection strategies, since different budgets impose distinct requirements on data quantity, quality, and distribution for effective training. To this end, we propose a novel Computational budget-Aware Data Selection (CADS) method and naturally formulate it into a bilevel optimization framework, where the inner loop trains the model within the constraints of the computational budget on some selected subset of training data, while the outer loop optimizes data selection based on model evaluation. Our technical contributions lie in addressing two main challenges in solving this bilevel optimization problem: the expensive Hessian matrix estimation for outer-loop gradients and the computational burden of achieving inner-loop optimality during iterations. To solve the first issue, we propose a probabilistic reparameterization strategy and compute the gradient using a Hessian-free policy gradient estimator. To address the second challenge, we transform the inner optimization problem into a penalty term in the outer objective, further discovering that we only need to estimate the minimum of a one-dimensional loss to calculate the gradient, significantly improving efficiency. To accommodate different data selection granularities, we present two complementary CADS variants: an example-level version (CADS-E) offering fine-grained control and a source-level version (CADS-S) aggregating samples into source groups for scalable, efficient selection without sacrificing effectiveness. Extensive experiments show that our method achieves performance gains of up to 14. 42\% over baselines in vision and language benchmarks. Additionally, CADS achieves a 3-20× speedup compared to conventional bilevel implementations, with acceleration correlating positively with compute budget size.

AAAI Conference 2025 Conference Paper

Expanding the Scope of Negatives: Boosting Image-Text Matching with Negatives Distribution Guided Learning

  • Zhao Zhou
  • Weizhong Zhang
  • Xiangcheng Du
  • Yingbin Zheng
  • Cheng Jin

Image-text matching is a crucial task that bridges visual and linguistic modalities. Recent research typically formulates it into the problem of maximizing the margin with the truly hardest negatives to enhance the learning efficiency and avoid the poor local optima. We argue that such formulation can lead to a serious limitation, i.e., under this formulation, conventional trainers would confine their horizon within the hardest negative examples, while other negative examples offer a range of semantic differences not present in the hardest negatives. In this paper, we propose an efficient negative distribution guided training framework for image-text matching to unlock the substantial promotion space left by the above limitation. Rather than simply incorporating additional negative examples into the training objective, which could diminish both the leading role of the hardest negatives in training and the effect of a large margin learning in producing a robust matching model, our central idea is to supply the objective with distributional information on the entire set of negative examples. To be precise, we first construct the sample similarity matrix based on several pretrained models to extract the distributional information of the entire negative sample dataset. Then we encode it into a margin regularization module to smooth the similarities differences of all negatives. This enhancement facilitates the capture of fine-grained semantic differences and guides the main learning process by maximizing the margin with hard negative examples. Furthermore, we propose a hardest negative rectification module to address the instability in hardest negative selection based on predicted similarity and to correct erroneous hardest negatives. We evaluate our method in combination with several state-of-the-art image-text matching methods, and our quantitative and qualitative experiments demonstrate its significant generalizability and effectiveness.

AAAI Conference 2025 Conference Paper

Optimized Gradient Clipping for Noisy Label Learning

  • Xichen Ye
  • Yifan Wu
  • Weizhong Zhang
  • Xiaoqiang Li
  • Yifan Chen
  • Cheng Jin

Previous research has shown that constraining the gradient of loss function w.r.t. model-predicted probabilities can enhance the model robustness against noisy labels. These methods typically specify a fixed optimal threshold for gradient clipping through validation data to obtain the desired robustness against noise. However, this common practice overlooks the dynamic distribution of gradients from both clean and noisy-labeled samples at different stages of training, significantly limiting the model capability to adapt to the variable nature of gradients throughout the training process. To address this issue, we propose a simple yet effective approach called Optimized Gradient Clipping (OGC), which dynamically adjusts the clipping threshold based on the ratio of noise gradients to clean gradients after clipping, estimated by modeling the distributions of clean and noisy samples. This approach allows us to modify the clipping threshold at each training step, effectively controlling the influence of noise gradients. Additionally, we provide statistical analysis to certify the noise-tolerance ability of OGC. Our extensive experiments across various types of label noise, including symmetric, asymmetric, instance-dependent, and real-world noise, demonstrate the effectiveness of our approach.

NeurIPS Conference 2025 Conference Paper

Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

  • Yibin Wang
  • li zhimin
  • Yuhang Zang
  • Chunyu Wang
  • Qinglin Lu
  • Cheng Jin
  • Jiaqi Wang

Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging in shallow reasoning processes with limited depth, often leading to inaccurate reward signals. We posit that incorporating explicit long chains of thought (CoT) into the reward reasoning process can significantly strengthen their reliability and robustness. Furthermore, we believe that once RMs internalize CoT reasoning, their direct response accuracy can also be improved through implicit reasoning capabilities. To this end, this paper proposes UnifiedReward-Think, the first unified multimodal CoT-based reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks. Specifically, we adopt an exploration-driven reinforcement fine-tuning approach to elicit and incentivize the model's latent complex reasoning ability: (1) We first use a small amount of image generation preference data to distill the reasoning process of GPT-4o, which is then used for the model's cold start to learn the format and structure of CoT reasoning. (2) Subsequently, by leveraging the model's prior knowledge and generalization capabilities, we prepare large-scale unified multimodal preference data to elicit the model's reasoning process across various vision tasks. During this phase, correct reasoning outputs are retained for rejection sampling to refine the model (3) while incorrect predicted samples are finally used for Group Relative Policy Optimization (GRPO) based reinforcement fine-tuning, enabling the model to explore diverse reasoning paths and optimize for correct and robust solutions. Extensive experiments confirm that incorporating long CoT reasoning significantly enhances the accuracy of reward signals. Notably, after mastering CoT reasoning, the model exhibits implicit reasoning capabilities, allowing it to surpass existing baselines even without explicit reasoning traces.

IJCAI Conference 2025 Conference Paper

Unleashing the Semantic Adaptability of Controlled Diffusion Model for Image Colorization

  • Xiangcheng Du
  • Zhao Zhou
  • Yanlong Wang
  • Yingbin Zheng
  • Xingjiao Wu
  • Peizhu Gong
  • Cheng Jin

Recent data-driven image colorization methods have leveraged pre-trained Text-to-Image (T2I) diffusion models as generative prior, while still suffering from unsatisfactory and inaccurate semantic-level color control. To address these issues, we propose a Semantic Adaptation method (SeAda) that enhances the prior while considering the semantic discrepancy between color and grayscale image pairs. The SeAda employs a semantic adapter to produce refined semantic embeddings and a controlled T2I diffusion model to create reasonably colored images. Specifically, the semantic adapter transfers the embedding from grayscale to color domain, while the diffusion model utilizes the refined embedding and prior knowledge to achieve realistic and diverse results. We also design a three-staged training strategy to improve semantic comprehension and prior integration for further performance improvement. Extensive experiments on public datasets demonstrate that our method outperforms existing state-of-the-art techniques, yielding superior performance in image colorization.

JBHI Journal 2024 Journal Article

Brain Age Prediction Based on Quantitative Susceptibility Mapping Using the Segmentation Transformer

  • Mingxing Chen
  • Yiqing Wang
  • Yuting Shi
  • Jie Feng
  • Ruimin Feng
  • Xiaojun Guan
  • Xiaojun Xu
  • Yuyao Zhang

The process of brain aging is intricate, encompassing significant structural and functional changes, including myelination and iron deposition in the brain. Brain age could act as a quantitative marker to evaluate the degree of the individual's brain evolution. Quantitative susceptibility mapping (QSM) is sensitive to variations in magnetically responsive substances such as iron and myelin, making it a favorable tool for estimating brain age. In this study, we introduce an innovative 3D convolutional network named Segmentation-Transformer-Age-Network (STAN) to predict brain age based on QSM data. STAN employs a two-stage network architecture. The first-stage network learns to extract informative features from the QSM data through segmentation training, while the second-stage network predicts brain age by integrating the global and local features. We collected QSM images from 712 healthy participants, with 548 for training and 164 for testing. The results demonstrate that the proposed method achieved a high accuracy brain age prediction with a mean absolute error (MAE) of 4. 124 years and a coefficient of determination (R 2 ) of 0. 933. Furthermore, the gaps between the predicted brain age and the chronological age of Parkinson's disease patients were significantly higher than those of healthy subjects (P<0. 01). We thus believe that using QSM-based predicted brain age offers a more reliable and accurate phenotype, with the potentiality to serve as a biomarker to explore the process of advanced brain aging.

AAAI Conference 2024 Conference Paper

FusionFormer: A Concise Unified Feature Fusion Transformer for 3D Pose Estimation

  • Yanlu Cai
  • Weizhong Zhang
  • Yuan Wu
  • Cheng Jin

Depth uncertainty is a core challenge in 3D human pose estimation, especially when the camera parameters are unknown. Previous methods try to reduce the impact of depth uncertainty by multi-view and/or multi-frame feature fusion to utilize more spatial and temporal information. However, they generally lead to marginal improvements and their performance still cannot match the camera-parameter-required methods. The reason is that their handcrafted fusion schemes cannot fuse the features flexibly, e.g., the multi-view and/or multi-frame features are fused separately. Moreover, the diverse and complicated fusion schemes make the principle for developing effective fusion schemes unclear and also raises an open problem that whether there exist more simple and elegant fusion schemes. To address these issues, this paper proposes an extremely concise unified feature fusion transformer (FusionFormer) with minimized handcrafted design for 3D pose estimation. FusionFormer fuses both the multi-view and multi-frame features in a unified fusion scheme, in which all the features are accessible to each other and thus can be fused flexibly. Experimental results on several mainstream datasets demonstrate that FusionFormer achieves state-of-the-art performance. To our best knowledge, this is the first camera-parameter-free method to outperform the existing camera-parameter-required methods, revealing the tremendous potential of camera-parameter-free models. These impressive experimental results together with our concise feature fusion scheme resolve the above open problem. Another appealing feature of FusionFormer we observe is that benefiting from its effective fusion scheme, we can achieve impressive performance with smaller model size and less FLOPs.

AAAI Conference 2024 Conference Paper

Point Cloud Part Editing: Segmentation, Generation, Assembly, and Selection

  • Kaiyi Zhang
  • Yang Chen
  • Ximing Yang
  • Weizhong Zhang
  • Cheng Jin

Ideal part editing should guarantee the diversity of edited parts, the fidelity to the remaining parts, and the quality of the results. However, previous methods do not disentangle each part completely, which means the edited parts will affect the others, resulting in poor diversity and fidelity. In addition, some methods lack constraints between parts, which need manual selections of edited results to ensure quality. Therefore, we propose a four-stage process for point cloud part editing: Segmentation, Generation, Assembly, and Selection. Based on this process, we introduce SGAS, a model for part editing that employs two strategies: feature disentanglement and constraint. By independently fitting part-level feature distributions, we realize the feature disentanglement. By explicitly modeling the transformation from object-level distribution to part-level distributions, we realize the feature constraint. Considerable experiments on different datasets demonstrate the efficiency and effectiveness of SGAS on point cloud part editing. In addition, SGAS can be pruned to realize unsupervised part-aware point cloud generation and achieves state-of-the-art results.

NeurIPS Conference 2024 Conference Paper

Unleashing the Denoising Capability of Diffusion Prior for Solving Inverse Problems

  • Jiawei Zhang
  • Jiaxin Zhuang
  • Cheng Jin
  • Gen Li
  • Yuantao Gu

The recent emergence of diffusion models has significantly advanced the precision of learnable priors, presenting innovative avenues for addressing inverse problems. Previous works have endeavored to integrate diffusion priors into the maximum a posteriori estimation (MAP) framework and design optimization methods to solve the inverse problem. However, prevailing optimization-based rithms primarily exploit the prior information within the diffusion models while neglecting their denoising capability. To bridge this gap, this work leverages the diffusion process to reframe noisy inverse problems as a two-variable constrained optimization task by introducing an auxiliary optimization variable that represents a 'noisy' sample at an equivalent denoising step. The projection gradient descent method is efficiently utilized to solve the corresponding optimization problem by truncating the gradient through the $\mu$-predictor. The proposed algorithm, termed ProjDiff, effectively harnesses the prior information and the denoising capability of a pre-trained diffusion model within the optimization framework. Extensive experiments on the image restoration tasks and source separation and partial generation tasks demonstrate that ProjDiff exhibits superior performance across various linear and nonlinear inverse problems, highlighting its potential for practical applications. Code is available at https: //github. com/weigerzan/ProjDiff/.

AAAI Conference 2022 Conference Paper

Attention-Based Transformation from Latent Features to Point Clouds

  • Kaiyi Zhang
  • Ximing Yang
  • Yuan Wu
  • Cheng Jin

In point cloud generation and completion, previous methods for transforming latent features to point clouds are generally based on fully connected layers (FC-based) or folding operations (Folding-based). However, point clouds generated by FC-based methods are usually troubled by outliers and rough surfaces. For folding-based methods, their data flow is large, convergence speed is slow, and they are also hard to handle the generation of non-smooth surfaces. In this work, we propose AXform, an attention-based method to transform latent features to point clouds. AXform first generates points in an interim space, using a fully connected layer. These interim points are then aggregated to generate the target point cloud. AXform takes both parameter sharing and data flow into account, which makes it has fewer outliers, fewer network parameters, and a faster convergence speed. The points generated by AXform do not have the strong 2-manifold constraint, which improves the generation of non-smooth surfaces. When AXform is expanded to multiple branches for local generations, the centripetal constraint makes it has properties of self-clustering and space consistency, which further enables unsupervised semantic segmentation. We also adopt this scheme and design AXformNet for point cloud completion. Considerable experiments on different datasets show that our methods achieve state-of-the-art results.

AAAI Conference 2022 Conference Paper

Safe Distillation Box

  • Jingwen Ye
  • Yining Mao
  • Jie Song
  • Xinchao Wang
  • Cheng Jin
  • Mingli Song

Knowledge distillation (KD) has recently emerged as a powerful strategy to transfer knowledge from a pre-trained teacher model to a lightweight student, and has demonstrated its unprecedented success over a wide spectrum of applications. In spite of the encouraging results, the KD process per se poses a potential threat to network ownership protection, since the knowledge contained in network can be effortlessly distilled and hence exposed to a malicious user. In this paper, we propose a novel framework, termed as Safe Distillation Box (SDB), that allows us to wrap a pre-trained model in a virtual box for intellectual property protection. Specifically, SDB preserves the inference capability of the wrapped model to all users, but precludes KD from unauthorized users. For authorized users, on the other hand, SDB carries out a knowledge augmentation scheme to strengthen the KD performances and the results of the student model. In other words, all users may employ a model in SDB for inference, but only authorized users get access to KD from the model. The proposed SDB imposes no constraints over the model architecture, and may readily serve as a plug-andplay solution to protect the ownership of a pre-trained network. Experiments across various datasets and architectures demonstrate that, with SDB, the performance of an unauthorized KD drops significantly while that of an authorized gets enhanced, demonstrating the effectiveness of SDB.

IJCAI Conference 2022 Conference Paper

SpanConv: A New Convolution via Spanning Kernel Space for Lightweight Pansharpening

  • Zhi-Xuan Chen
  • Cheng Jin
  • Tian-Jing Zhang
  • Xiao Wu
  • Liang-Jian Deng

Standard convolution operations can effectively perform feature extraction and representation but result in high computational cost, largely due to the generation of the original convolution kernel corresponding to the channel dimension of the feature map, which will cause unnecessary redundancy. In this paper, we focus on kernel generation and present an interpretable span strategy, named SpanConv, for the effective construction of kernel space. Specifically, we first learn two navigated kernels with single channel as bases, then extend the two kernels by learnable coefficients, and finally span the two sets of kernels by their linear combination to construct the so-called SpanKernel. The proposed SpanConv is realized by replacing plain convolution kernel by SpanKernel. To verify the effectiveness of SpanConv, we design a simple network with SpanConv. Experiments demonstrate the proposed network significantly reduces parameters comparing with benchmark networks for remote sensing pansharpening, while achieving competitive performance and excellent generalization. Code is available at https: //github. com/zhi-xuan-chen/IJCAI-2022 SpanConv.

AAAI Conference 2021 Conference Paper

CPCGAN: A Controllable 3D Point Cloud Generative Adversarial Network with Semantic Label Generating

  • Ximing Yang
  • Yuan Wu
  • Kaiyi Zhang
  • Cheng Jin

Generative Adversarial Networks (GAN) are good at generating variant samples of complex data distributions. Generating a sample with certain properties is one of the major tasks in the real-world application of GANs. In this paper, we propose a novel generative adversarial network to generate 3D point clouds from random latent codes, named Controllable Point Cloud Generative Adversarial Network(CPCGAN). A two-stage GAN framework is utilized in CPCGAN and a sparse point cloud containing major structural information is extracted as the middle-level information between the two stages. With their help, CPCGAN has the ability to control the generated structure and generate 3D point clouds with semantic labels for points. Experimental results demonstrate that the proposed CPCGAN outperforms state-of-the-art point cloud GANs.

NeurIPS Conference 2020 Conference Paper

One-sample Guided Object Representation Disassembling

  • Zunlei Feng
  • Yongming He
  • Xinchao Wang
  • Xin Gao
  • Jie Lei
  • Cheng Jin
  • Mingli Song

The ability to disassemble the features of objects and background is crucial for many machine learning tasks, including image classification, image editing, visual concepts learning, and so on. However, existing (semi-)supervised methods all need a large amount of annotated samples, while unsupervised methods can't handle real-world images with complicated backgrounds. In this paper, we introduce the One-sample Guided Object Representation Disassembling (One-GORD) method, which only requires one annotated sample for each object category to learn disassembled object representation from unannotated images. For the annotated one-sample, we first adopt some data augmentation strategies to generate some synthetic samples, which can guide the disassembling of the object features and background features. For the unannotated images, two self-supervised mechanisms: dual-swapping and fuzzy classification are introduced to disassemble object features from the background with the guidance of annotated one-sample. What's more, we devise two metrics to evaluate the disassembling performance from the perspective of representation and image, respectively. Experiments demonstrate that the One-GORD achieves competitive dissembling performance and can handle natural scenes with complicated backgrounds.

AAAI Conference 2019 Conference Paper

Fully Convolutional Video Captioning with Coarse-to-Fine and Inherited Attention

  • Kuncheng Fang
  • Lian Zhou
  • Cheng Jin
  • Yuejie Zhang
  • Kangnian Weng
  • Tao Zhang
  • Weiguo Fan

Automatically generating natural language description for video is an extremely complicated and challenging task. To tackle the obstacles of traditional LSTM-based model for video captioning, we propose a novel architecture to generate the optimal descriptions for videos, which focuses on constructing a new network structure that can generate sentences superior to the basic model with LSTM, and establishing special attention mechanisms that can provide more useful visual information for caption generation. This scheme discards the traditional LSTM, and exploits the fully convolutional network with coarse-to-fine and inherited attention designed according to the characteristics of fully convolutional structure. Our model cannot only outperform the basic LSTM-based model, but also achieve the comparable performance with those of state-of-the-art methods.

JBHI Journal 2018 Journal Article

Left Atrial Appendage Segmentation Using Fully Convolutional Neural Networks and Modified Three-Dimensional Conditional Random Fields

  • Cheng Jin
  • Jianjiang Feng
  • Lei Wang
  • Heng Yu
  • Jiang Liu
  • Jiwen Lu
  • Jie Zhou

Thrombosis has become a global disease threatening human health. The left atrial appendage (LAA) is a major source of thrombosis in patients with atrial fibrillation (AF). Positive correlation exists between LAA volume and AF risk. LAA morphology has been suggested to influence thromboembolic risk in AF patients and to help predict thromboembolic events in low-risk patient groups. Automatic segmentation of LAA can greatly help physicians diagnose AF. In consideration of the large anatomical variations of the LAA, we proposed a robust method for automatic LAA segmentation on computed tomographic angiography (CTA) data using fully convolutional neural networks with three-dimensional (3–D) conditional random fields (CRFs). After manual localization of ROI of LAA, we adopted the FCN in natural image segmentation and transferred their learned models by fine-tuning the networks to segment each 2–D LAA slice. Subsequently, we used a modified dense 3–D CRF that accounts for the 3–D spatial information and larger contextual information to refine the segmentations of all slices. Our method was evaluated on 150 sets of CTA data using five-fold cross validation. Compared with manual annotation, we obtained a mean dice overlap of $\text{94. 76}\%$ and a mean volume overlap of $\text{91. 10}\%$ with a computation time of less than 40 s per volume. Experimental results demonstrated the robustness of our method in dealing with large anatomical variations and computational efficiency for adoption in a daily clinical routine.)

IJCAI Conference 2016 Conference Paper

HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization

  • Gong Cheng
  • Cheng Jin
  • Yuzhong Qu

The rapid growth of open data on the Web promotes the development of data portals that facilitate finding useful datasets. To help users quickly inspect a dataset found in a portal, we propose to summarize its contents and generate a hierarchical grouping of entities connected by relations. Our generic approach, called HIEDS, considers coverage of dataset, height of hierarchy, cohesion within groups, overlap between groups, and homogeneity of groups, and integrates these configurable factors into a combinatorial optimization problem to solve. We present an efficient solution, to serve users with dynamically configured summaries with acceptable latency. We systematically experiment with our approach on real-world RDF datasets.

AAAI Conference 2015 Conference Paper

Cross-Modal Image Clustering via Canonical Correlation Analysis

  • Cheng Jin
  • Wenhui Mao
  • Ruiqi Zhang
  • Yuejie Zhang
  • Xiangyang Xue

A new algorithm via Canonical Correlation Analysis (CCA) is developed in this paper to support more effective crossmodal image clustering for large-scale annotated image collections. It can be treated as a bi-media multimodal mapping problem and modeled as a correlation distribution over multimodal feature representations. It integrates the multimodal feature generation with the Locality Linear Coding (LLC) and co-occurrence association network, multimodal feature fusion with CCA, and accelerated hierarchical k-means clustering, which aims to characterize the correlations between the inter-related visual features in images and semantic features in captions, and measure their association degree more precisely. Very positive results were obtained in our experiments using a large quantity of public data. 1

IJCAI Conference 2013 Conference Paper

Automatic Name-Face Alignment to Enable Cross-Media News Retrieval

  • Yuejie Zhang
  • Wei Wu
  • Yang Li
  • Cheng Jin
  • Xiangyang Xue
  • Jianping Fan

A new algorithm is developed in this paper to support automatic name-face alignment for achieving more accurate cross-media news retrieval. We focus on extracting valuable information from large amounts of news images and their captions, where multi-level image-caption pairs are constructed for characterizing both significant names with higher salience and their cohesion with human faces extracted from news images. To remedy the issue of lacking enough related information for rare name, Web mining is introduced to acquire the extra multimodal information. We also emphasize on an optimization mechanism by our Improved Self-Adaptive Simulated Annealing Genetic Algorithm to verify the feasibility of alignment combinations. Our experiments have obtained very positive results.

IJCAI Conference 2011 Conference Paper

Fusion of Multiple Features and Supervised Learning for Chinese OOV Term Detection and POS Guessing

  • Yuejie Zhang
  • Lei Cen
  • Wei Wu
  • Cheng Jin
  • Xiangyang Xue

In this paper, to support more precise Chinese Out-of-Vocabulary (OOV) term detection and Part-of-Speech (POS) guessing, a unified mechanism is proposed and formulated based on the fusion of multiple features and supervised learning. Besides all the traditional features, the new features for statistical information and global contexts are introduced, as well as some constraints and heuristic rules, which reveal the relationships among OOV term candidates. Our experiments on the Chinese corpora from both People's Daily and SIGHAN 2005 have achieved the consistent results, which are better than those acquired by pure rule-based or statistics-based models. From the experimental results for combining our model with Chinese monolingual retrieval on the data sets of TREC-9, it is found that the obvious improvement for the retrieval performance can also be obtained.

IJCAI Conference 2011 Conference Paper

Learning Inter-Related Statistical Query Translation Models for English-Chinese Bi-Directional CLIR

  • Yuejie Zhang
  • Lei Cen
  • Cheng Jin
  • Xiangyang Xue
  • Jianping Fan

To support more precise query translation for English-Chinese Bi-Directional Cross-Language Information Retrieval (CLIR), we have developed a novel framework by integrating a semantic network to characterize the correlations between multiple inter-related text terms of interest and learn their inter-related statistical query translation models. First, a semantic network is automatically generated from large-scale English-Chinese bilingual parallel corpora to characterize the correlations between a large number of text terms of interest. Second, the semantic network is exploited to learn the statistical query translation models for such text terms of interest. Finally, these inter-related query translation models are used to translate the queries more precisely and achieve more effective CLIR. Our experiments on a large number of official public data have obtained very positive results.

AAAI Conference 2011 Conference Paper

Tracking User-Preference Varying Speed in Collaborative Filtering

  • Ruijiang Li
  • Bin Li
  • Cheng Jin
  • Xiangyang Xue
  • Xingquan Zhu

In real-world recommender systems, some users are easily influenced by new products and whereas others are unwilling to change their minds. So the preference varying speeds for users are different. Based on this observation, we propose a dynamic nonlinear matrix factorization model for collaborative filtering, aimed to improve the rating prediction performance as well as track the preference varying speeds for different users. We assume that user-preference changes smoothly over time, and the preference varying speeds for users are different. These two assumptions are incorporated into the proposed model as prior knowledge on user feature vectors, which can be learned efficiently by MAP estimation. The experimental results show that our method not only achieves state-of-the-art performance in the rating prediction task, but also provides an effective way to track userpreference varying speed.