Arrow Research search

Author name cluster

Kun Zhou

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

35 papers
1 author row

Possible papers

35

AAAI Conference 2026 Conference Paper

3DTeethSAM: Taming SAM2 for 3D Teeth Segmentation

  • Zhiguo Lu
  • Jianwen Lou
  • Mingjun Ma
  • Hairong Jin
  • Youyi Zheng
  • Kun Zhou

3D teeth segmentation, involving the localization of tooth instances and their semantic categorization in 3D dental models, is a critical yet challenging task in digital dentistry due to the complexity of real-world dentition. In this paper, we propose 3DTeethSAM, an adaptation of the Segment Anything Model 2 (SAM2) for 3D teeth segmentation. SAM2 is a pretrained foundation model for image and video segmentation, demonstrating a strong backbone in various downstream scenarios. To adapt SAM2 for 3D teeth data, we render images of 3D teeth models from predefined views, apply SAM2 for 2D segmentation, and reconstruct 3D results using 2D-3D projections. Since SAM2's performance depends on input prompts and its initial outputs often have deficiencies, and given its class-agnostic nature, we introduce three light-weight learnable modules: (1) a prompt embedding generator to derive prompt embeddings from image embeddings for accurate mask decoding, (2) a mask refiner to enhance SAM2's initial segmentation results, and (3) a mask classifier to categorize the generated masks. Additionally, we incorporate Deformable Global Attention Plugins (DGAP) into SAM2's image encoder. The DGAP enhances both the segmentation accuracy and the speed of the training process. Our method has been validated on the 3DTeethSeg benchmark, achieving an IoU of 91.90% on high-resolution 3D teeth meshes, establishing a new state-of-the-art in the field.

AAAI Conference 2026 Conference Paper

Analyzing and Mitigating Object Hallucination: A Training Bias Perspective

  • Yifan Li
  • Kun Zhou
  • Xin Zhao
  • Lei Fang
  • Jirong Wen

As scaling up training data has significantly improved the general multimodal capabilities of Large Vision-Language Models (LVLMs), they still suffer from the hallucination issue, generating text that is inconsistent with the visual input. This phenomenon motivates us to systematically investigate the role of training data in hallucination. We introduce a new benchmark, POPEv2, which consists of counterfactual images collected from the training data of LVLMs with certain objects masked. Through comprehensive evaluation on POPEv2, we find that current LVLMs suffer from training bias: they fail to fully leverage their training data and hallucinate more frequently on images seen during training. Specifically, they perform poorly on counterfactual images, often incorrectly answering “Yes” to questions about masked objects. To understand this issue, we conduct probing experiments on the models’ internal components, revealing that this training bias is primarily located in the language modeling (LM) head, which fails to correctly translate accurate visual representations into textual outputs. Based on these findings, we propose Obliviate, an efficient and lightweight unlearning method designed to mitigate object hallucination via training bias unlearning. Obliviate identifies the discrepancy between ground-truth labels and model outputs on the training data as a proxy for bias and adopts a parameter- and data-efficient fine-tuning strategy that only updates the LM head. Extensive experiments demonstrate the effectiveness of our approach. While only reusing the training data and updating approximately 2% of the parameters, Obliviate significantly reduces hallucination across both discriminative and generative tasks. Furthermore, it demonstrates strong scalability with respect to both model size (2B to 72B) and training data volume, and exhibits promising generalization to hallucination types beyond object-level hallucination.

AAAI Conference 2026 Conference Paper

ElastoGen: 4D Generative Elastodynamics

  • Yutao Feng
  • Yintong Shang
  • Xiang Feng
  • Lei Lan
  • Shandian Zhe
  • Tianjia Shao
  • Hongzhi Wu
  • Kun Zhou

We present ElastoGen, a knowledge-driven AI model that generates physically accurate 4D elastodynamics. Unlike deep models that learn from video- or image-based observations, ElastoGen leverages the principles of physics and learns from established mathematical and optimization procedures. The core idea of ElastoGen is converting the differential equation, corresponding to the nonlinear force equilibrium, into a series of iterative local convolution-like operations, which naturally fit deep architectures. We carefully build our network module following this overarching design philosophy. ElastoGen is much more lightweight in terms of both training requirements and network scale than deep generative models. Because of its alignment with actual physical procedures, ElastoGen efficiently generates accurate dynamics for a wide range of hyperelastic materials and can be easily integrated with upstream and downstream deep modules to enable end-to-end 4D generation.

AAAI Conference 2026 Conference Paper

LR-AdaInSeg:Adaptive Instance Segmentation of Incomplete 3D Scenes Driven by Low-Rank Networks

  • Qin Zhang
  • Kun Zhou
  • Xulun Ye

3D full-scene segmentation technology has demonstrated great potential driven by large models, but it often faces challenges of incomplete scenes and identification of invisible classes in practical applications. To address this, we propose the LR-AdaInSeg method, which significantly enhances the model’s generalization ability in incomplete scenes through two key innovations: First, we design a Bayesian Low-Rank Module, which effectively solves the problem of feature space redundancy through dynamic optimization of the network structure, improving adaptability to incomplete scenes. Second, we combine graph contrastive clustering with the Low-Rank module, leveraging its robust feature representation capability to achieve accurate differentiation of invisible classes. In terms of implementation, we build a multi-scale feature extraction framework based on the 3D U-Net and utilize the 3D prompt points and their 2D masks as supervisory signals to achieve effective fusion of geometric and semantic information. Experiments show that our method achieves advanced performance on multiple benchmarks such as ScanNet, particularly excelling in handling incomplete scenes and invisible class objects.

AAAI Conference 2026 Conference Paper

Vision-G1: Towards General Reasoning Vision-Language Models via Reinforcement Learning

  • Yuheng Zha
  • Kun Zhou
  • Yujia Wu
  • Yushu Wang
  • Jie Feng
  • Zhi Xu
  • Shibo Hao
  • Zhengzhong Liu

Recent vision-language models (VLMs) show strong reasoning capabilities through training with reinforcement learning from verifiable rewards (RLVR). Despite their impressive capabilities, current VLMs focus on a limited range of reasoning tasks, such as mathematical and logical reasoning, due to the lack of readily available verifiable reward data in broader domains. As a result, these models struggle to generalize their reasoning abilities to the wide variety of challenges encountered in real-world environments. To address this limitation, we collect and assemble a comprehensive RL-ready visual reasoning training dataset encompassing 46 datasets across 13 dimensions of 5 domains, covering a wide range of realistic scenarios such as infographic reasoning, mathematical reasoning, spatial reasoning, and general science reasoning. Based on this dataset, we propose an influence function-based data filtering strategy and a multi-round data curriculum method to iteratively strengthen general visual reasoning abilities. Using this approach, we train a general reasoning VLM, namely Vision-G1. Our 7B model achieves state-of-the-art performance across nine visual reasoning benchmarks, surpassing previous similar-sized VLMs and even GPT-4o and Gemini-1.5 Flash.

NeurIPS Conference 2025 Conference Paper

Activation Control for Efficiently Eliciting Long Chain-of-thought Ability of Language Models

  • Zekai Zhao
  • Qi Liu
  • Kun Zhou
  • Zihan Liu
  • Yifei Shao
  • Zhiting Hu
  • Biwei Huang

Despite the remarkable reasoning performance, eliciting the long chain-of-thought(CoT) ability in large language models(LLMs) typically requires costly reinforcement learning or supervised fine-tuning on high-quality distilled data. We investigate the internal mechanisms behind this capability and show that a small set of high-impact activations in the last few layers, greatly govern the long-form reasoning attributes, e. g. output length and self-reflection. Through simply amplifying these activations and adding ``wait'' tokens, the long CoT ability can be invoked without training, leading to significantly increased self-reflection rate and accuracy. In addition, we also find that the activation changes follow predictable trajectories, i. e. a sharp rise after special tokens and a subsequent exponential decay. Based on these insights, we introduce a general training-free activation control technique. It utilizes a few contrastive examples to identify the relevant activations, and then incorporates simple analytic functions to adjust their values at inference time to elicit long CoTs. Extensive experiments have verified the effectiveness of our methods in efficiently eliciting the long CoT ability of LLMs and improving the performance. Besides, we further propose a parameter-efficient fine-tuning method that trains only the last-layer activation amplification module and a few LoRA layers, outperforming LoRA on reasoning benchmarks with much fewer parameters. Our code and data will be fully public released.

IJCAI Conference 2025 Conference Paper

AttentionDrag: Exploiting Latent Correlation Knowledge in Pre-trained Diffusion Models for Image Editing

  • Biao Yang
  • Muqi Huang
  • Yuhui Zhang
  • Yun Xiong
  • Kun Zhou
  • Xi Chen
  • Shiyang Zhou
  • Huishuai Bao

Traditional point-based image editing methods rely on iterative latent optimization or geometric transformations, which are either inefficient in their processing or fail to capture the semantic relationships within the image. These methods often overlook the powerful yet underutilized image editing capabilities inherent in pre-trained diffusion models. In this work, we propose a novel one-step point-based image editing method, named \textbf{AttentionDrag}, which leverages the inherent latent knowledge and feature correlations within pre-trained diffusion models for image editing tasks. This framework enables semantic consistency and high-quality manipulation without the need for extensive re-optimization or retraining. Specifically, we reutilize the latent correlations knowledge learned by the self-attention mechanism in the U-Net module during the DDIM inversion process to automatically identify and adjust relevant image regions, ensuring semantic validity and consistency. Additionally, AttentionDrag adaptively generates masks to guide the editing process, enabling precise and context-aware modifications with friendly interaction. Our results demonstrate a performance that surpasses most state-of-the-art methods with significantly faster speeds, showing a more efficient and semantically coherent solution for point-based image editing tasks. Code is released at: https: //github. com/GPlaying/AttentionDrag.

TMLR Journal 2025 Journal Article

Autonomous Imagination: Closed-Loop Decomposition of Visual-to-Textual Conversion in Visual Reasoning for Multimodal Large Language Models

  • Jingming Liu
  • Yumeng Li
  • Boyuan Xiao
  • Yichang Jian
  • Ziang Qin
  • Tianjia Shao
  • Yao-Xiang Ding
  • Kun Zhou

Under pure textual modality, Large Language Models (LLMs) have demonstrated remarkable success in complex reasoning tasks by decomposing them into simpler sub-problems. However, Multimodal Large Language Models (MLLMs) still struggle with some seemingly straightforward visual tasks, such as counting and solving jigsaw puzzles. We argue that these tasks challenge the ability of {\it visual-to-textual conversion}, where MLLMs convert visual information perceived from the input scene, to textual information for further reasoning and generating the answer. If the complexity of the visual input is beyond the perceptual capability of the MLLMs, without decomposing this conversion process, simply scaling inference-time reasoning cannot solve the task because it repeatedly encounters the same perceptual bottleneck. We propose an approach, {\it autonomous imagination}, to enable MLLMs to iteratively modify visual inputs (e.g. isolating objects, rearranging puzzle pieces) into intermediate visual states, decomposing visual-to-textual conversion into closed-loop visual modification steps. We show that, without any retraining, MLLMs can now solve tasks initially beyond their perceptual capability, highlighting that closed-loop visual modification can be an effective way of decomposing the visual reasoning task into solvable substeps. Our code and data are released at https://future-item.github.io/autoimagine-site/.

AAAI Conference 2025 Conference Paper

Enhancing Identity-Deformation Disentanglement in StyleGAN for One-Shot Face Video Re-Enactment

  • Qing Chang
  • Yao-Xiang Ding
  • Kun Zhou

The task of one-shot face video re-enactment aims at generating target video of faces with the same identity of one source frame and facial deformation of the driving video. To achieve high quality generation, it is essential to precisely disentangle identity-related and identity-independent characteristics, meanwhile build expressive features keeping high-frequency facial details, which still remain unaddressed for existing approaches. To deal with these two challenges, we propose a two-stage generation model based on StyleGAN, whose key novel techniques lie in better disentangling identity and deformation codes in the latent space through an identity-based modeling and manipulating intermediate StyleGAN features at the second stage for augmenting facial details of the generating targets. To further improve identity consistency, a data augmentation method is introduced during training for enhancing the key features affecting identity such as hair and wrinkles. Extensive experimental results demonstrate the superiority of our approach compared to state-of-the-art methods.

AAAI Conference 2025 Conference Paper

GenesisTex2: Stable, Consistent and High-Quality Text-to-Texture Generation

  • Jiawei Lu
  • YingPeng Zhang
  • Zengjun Zhao
  • He Wang
  • Kun Zhou
  • Tianjia Shao

Large-scale text-guided image diffusion models have demonstrated remarkable results in text-to-image (T2I) generation. However, applying these models to synthesize textures for 3D geometries remains challenging due to the domain gap between 2D images and textures on a 3D surface. Early works that used a projecting-inpainting approach managed to preserve generation diversity, but often resulted in noticeable artifacts and style inconsistencies. While recent methods have attempted to address these inconsistencies, they often introduce other issues, such as blurring, over-saturation, or over-smoothing. To overcome these challenges, we propose a novel text-to-texture synthesis framework that takes advantage of pre-trained diffusion models. We introduce a local attention reweighing mechanism in the self-attention layers to guide the model in focusing on spatial-correlated patches across different views, thereby enhancing local details while preserving cross-view consistency. Additionally, we propose a novel latent space merge pipeline, which further ensures consistency across different viewpoints without sacrificing too much diversity. Our method significantly outperforms existing state-of-the-art techniques in terms of texture consistency and visual quality, while delivering results much faster than distillation-based methods. Importantly, our framework does not require additional training or fine-tuning, making it highly adaptable to a wide range of models available on public platforms.

IJCAI Conference 2025 Conference Paper

Low-Light Video Enhancement via Spatial-Temporal Consistent Decomposition

  • Xiaogang Xu
  • Kun Zhou
  • Tao Hu
  • Jiafei Wu
  • Ruixing Wang
  • Hao Peng
  • Bei Yu

Low-Light Video Enhancement (LLVE) seeks to restore dynamic or static scenes plagued by severe invisibility and noise. In this paper, we present an innovative video decomposition strategy that incorporates view-independent and view-dependent components to enhance the performance of LLVE. We leverage dynamic cross-frame correspondences for the view-independent term (which primarily captures intrinsic appearance) and impose a scene-level continuity constraint on the view-dependent term (which mainly describes the shading condition) to achieve consistent and satisfactory decomposition results. To further ensure consistent decomposition, we introduce a dual-structure enhancement network featuring a cross-frame interaction mechanism. By supervising different frames simultaneously, this network encourages them to exhibit matching decomposition features. This mechanism can seamlessly integrate with encoder-decoder single-frame networks, incurring minimal additional parameter costs. Extensive experiments are conducted on widely recognized LLVE benchmarks, covering diverse scenarios. Our framework consistently outperforms existing methods, establishing a new SOTA performance.

AAAI Conference 2025 Conference Paper

RETQA: A Large-Scale Open-Domain Tabular Question Answering Dataset for Real Estate Sector

  • Zhensheng Wang
  • Wenmian Yang
  • Kun Zhou
  • Yiquan Zhang
  • Weijia Jia

The real estate market relies heavily on structured data, such as property details, market trends, and price fluctuations. However, the lack of specialized Tabular Question Answering datasets in this domain limits the development of automated question-answering systems. To fill this gap, we introduce RETQA, the first large-scale open-domain Chinese Tabular Question Answering dataset for Real Estate. RETQA comprises 4,932 tables and 20,762 question-answer pairs across 16 sub-fields within three major domains: property information, real estate company finance information and land auction information. Compared with existing tabular question answering datasets, RETQA poses greater challenges due to three key factors: long-table structures, open-domain retrieval, and multi-domain queries. To tackle these challenges, we propose the SLUTQA framework, which integrates large language models with spoken language understanding tasks to enhance retrieval and answering accuracy. Extensive experiments demonstrate that SLUTQA significantly improves the performance of large language models on RETQA by in-context learning. RETQA and SLUTQA provide essential resources for advancing tabular question answering research in the real estate domain, addressing critical challenges in open-domain and long-table question-answering.

NeurIPS Conference 2025 Conference Paper

Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective

  • Jorge (Zhoujun) Cheng
  • Shibo Hao
  • Tianyang Liu
  • Fan Zhou
  • Yutao Xie
  • Feng Yao
  • Yuexin Bian
  • Nilabjo Dey

Reinforcement learning (RL) has shown promise in enhancing large language model (LLM) reasoning, yet progress towards broader capabilities is limited by the availability of high-quality, multi-domain datasets. This work introduces \ours, a 92K RL-for-reasoning dataset designed to address this gap, covering six reasoning domains: Math, Code, Science, Logic, Simulation, and Tabular, each with corresponding verifiers. We build \ours via a careful data-curation pipeline, including sourcing, deduplication, reward design, and domain-specific and difficulty-based filtering, to facilitate the systematic investigation of cross-domain RL generalization. Our study using \ours suggests the efficacy of a simple mixed-domain RL training approach and reveals several key aspects affecting cross-domain transferability. We further train two models {\ours}-7B and {\ours}-32B purely with RL on our curated data and observe largely improved performance over leading open RL reasoning model baselines, with gains of 7. 3\% and 7. 8\% respectively on an extensive 17-task, six-domain evaluation suite. We are releasing our dataset, code, and evaluation suite to the community, aiming to support further research and development of more general RL-enhanced reasoning models.

NeurIPS Conference 2025 Conference Paper

Towards General Continuous Memory for Vision-Language Models

  • Wenyi WU
  • Zixuan Song
  • Kun Zhou
  • Yifei Shao
  • Zhiting Hu
  • Biwei Huang

Language models (LMs) and their extension, vision-language models (VLMs), have achieved remarkable performance across various tasks. However, they still struggle with complex reasoning tasks that require multimodal or multilingual real world knowledge. To support such capabilities, an external memory system that can efficiently provide relevant multimodal information is essential. Existing approaches generally concatenate image and text tokens into a long sequence as memory, which, however, may drastically increase context length and even degrade performance. In contrast, we propose using continuous memory-a compact set of dense embeddings-to more effectively and efficiently represent multimodal and multilingual knowledge. Our key insight is that a VLM can serve as its own continuous memory encoder. We empirically show that this design improves performance on complex multimodal reasoning tasks. Building on this, we introduce a data-efficient and parameter-efficient method to fine-tune the VLM into a memory encoder, requiring only 1. 2\% of the model’s parameters and a small corpus of 15. 6K self-synthesized samples. Our approach CoMEM utilizes VLM's original capabilities to encode arbitrary multimodal and multilingual knowledge into just 8 continuous embeddings. Since the inference-time VLM remains frozen, our memory module is plug-and-play and can be flexibly integrated as needed. Extensive experiments across eight multimodal reasoning benchmarks demonstrate the effectiveness of our approach. Code and data is publicly released here https: //github. com/WenyiWU0111/CoMEM.

NeurIPS Conference 2024 Conference Paper

JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models

  • Kun Zhou
  • Beichen Zhang
  • Jiapeng Wang
  • Zhipeng Chen
  • Wayne X. Zhao
  • Jing Sha
  • Zhichao Sheng
  • Shijin Wang

Mathematical reasoning is an important capability of large language models~(LLMs) for real-world applications. To enhance this capability, existing work either collects large-scale math-related texts for pre-training, or relies on stronger LLMs (\eg GPT-4) to synthesize massive math problems. Both types of work generally lead to large costs in training or synthesis. To reduce the cost, based on open-source available texts, we propose an efficient way that trains a small LLM for math problem synthesis, to efficiently generate sufficient high-quality pre-training data. To achieve it, we create a dataset using GPT-4 to distill its data synthesis capability into the small LLM. Concretely, we craft a set of prompts based on human education stages to guide GPT-4, to synthesize problems covering diverse math knowledge and difficulty levels. Besides, we adopt the gradient-based influence estimation method to select the most valuable math-related texts. The both are fed into GPT-4 for creating the knowledge distillation dataset to train the small LLM. We leverage it to synthesize 6 million math problems for pre-training our JiuZhang3. 0 model. The whole process only needs to invoke GPT-4 API 9. 3k times and use 4. 6B data for training. Experimental results have shown that JiuZhang3. 0 achieves state-of-the-art performance on several mathematical reasoning datasets, under both natural language reasoning and tool manipulation settings. Our code and data will be publicly released in \url{https: //github. com/RUCAIBox/JiuZhang3. 0}.

NeurIPS Conference 2024 Conference Paper

UPS: Unified Projection Sharing for Lightweight Single-Image Super-resolution and Beyond

  • Kun Zhou
  • Xinyu Lin
  • Zhonghang Liu
  • Xiaoguang Han
  • Jiangbo Lu

To date, transformer-based frameworks have demonstrated impressive results in single-image super-resolution (SISR). However, under practical lightweight scenarios, the complex interaction of deep image feature extraction and similarity modeling limits the performance of these methods, since they require simultaneous layer-specific optimization of both two tasks. In this work, we introduce a novel Unified Projection Sharing algorithm(UPS) to decouple the feature extraction and similarity modeling, achieving notable performance. To do this, we establish a unified projection space defined by a learnable projection matrix, for similarity calculation across all self-attention layers. As a result, deep image feature extraction remains a per-layer optimization manner, while similarity modeling is carried out by projecting these image features onto the shared projection space. Extensive experiments demonstrate that our proposed UPS achieves state-of-the-art performance relative to leading lightweight SISR methods, as verified by various popular benchmarks. Moreover, our unified optimized projection space exhibits encouraging robustness performance for unseen data (degraded and depth images). Finally, UPS also demonstrates promising results across various image restoration tasks, including real-world and classic SISR, image denoising, and image deblocking.

IJCAI Conference 2023 Conference Paper

Diffusion Models for Non-autoregressive Text Generation: A Survey

  • Yifan Li
  • Kun Zhou
  • Wayne Xin Zhao
  • Ji-Rong Wen

Non-autoregressive (NAR) text generation has attracted much attention in the field of natural language processing, which greatly reduces the inference latency but has to sacrifice the generation accuracy. Recently, diffusion models, a class of latent variable generative models, have been introduced into NAR text generation, showing an improved text generation quality. In this survey, we review the recent progress in diffusion models for NAR text generation. As the background, we first present the general definition of diffusion models and the text diffusion models, and then discuss their merits for NAR generation. As the core content, we further introduce two mainstream diffusion models in existing work of text diffusion, and review the key designs of the diffusion process. Moreover, we discuss the utilization of pre-trained language models (PLMs) for text diffusion models and introduce optimization techniques for text data. Finally, we discuss several promising directions and conclude this paper. Our survey aims to provide researchers with a systematic reference of related research on text diffusion models for NAR generation. We also demonstrate our collection of text diffusion models at https: //github. com/RUCAIBox/Awesome-Text-Diffusion-Models.

NeurIPS Conference 2023 Conference Paper

Evaluating and Improving Tool-Augmented Computation-Intensive Math Reasoning

  • Beichen Zhang
  • Kun Zhou
  • Xilin Wei
  • Xin Zhao
  • Jing Sha
  • Shijin Wang
  • Ji-Rong Wen

Chain-of-thought prompting (CoT) and tool augmentation have been validated in recent work as effective practices for improving large language models (LLMs) to perform step-by-step reasoning on complex math-related tasks. However, most existing math reasoning datasets may not be able to fully evaluate and analyze the ability of LLMs in manipulating tools and performing reasoning, as they often only require very few invocations of tools or miss annotations for evaluating intermediate reasoning steps, thus supporting only outcome evaluation. To address the issue, we construct CARP, a new Chinese dataset consisting of 4, 886 computation-intensive algebra problems with formulated annotations on intermediate steps, facilitating the evaluation of the intermediate reasoning process. In CARP, we test four LLMs with CoT prompting, and find that they are all prone to make mistakes at the early steps of the solution, leading to incorrect answers. Based on this finding, we propose a new approach that can facilitate the deliberation on reasoning steps with tool interfaces, namely DELI. In DELI, we first initialize a step-by-step solution based on retrieved exemplars, then iterate two deliberation procedures that check and refine the intermediate steps of the generated solution, from both tool manipulation and natural language reasoning perspectives, until solutions converge or the maximum iteration is achieved. Experimental results on CARP and six other datasets show that the proposed DELI mostly outperforms competitive baselines, and can further boost the performance of existing CoT methods. Our data and code are available at https: //github. com/RUCAIBox/CARP.

AAAI Conference 2022 Conference Paper

Best-Buddy GANs for Highly Detailed Image Super-resolution

  • Wenbo Li
  • Kun Zhou
  • Lu Qi
  • Liying Lu
  • Jiangbo Lu

We consider the single image super-resolution (SISR) problem, where a high-resolution (HR) image is generated based on a low-resolution (LR) input. Recently, generative adversarial networks (GANs) become popular to hallucinate details. Most methods along this line rely on a predefined single- LR-single-HR mapping, which is not flexible enough for the ill-posed SISR task. Also, GAN-generated fake details may often undermine the realism of the whole image. We address these issues by proposing best-buddy GANs (Beby-GAN) for rich-detail SISR. Relaxing the rigid one-to-one constraint, we allow the estimated patches to dynamically seek trustworthy surrogates of supervision during training, which is beneficial to producing more reasonable details. Besides, we propose a region-aware adversarial learning strategy that directs our model to focus on generating details for textured areas adaptively. Extensive experiments justify the effectiveness of our method. An ultra-high-resolution 4K dataset is also constructed to facilitate future super-resolution research.

AAAI Conference 2022 Conference Paper

HoD-Net: High-Order Differentiable Deep Neural Networks and Applications

  • Siyuan Shen
  • Tianjia Shao
  • Kun Zhou
  • Chenfanfu Jiang
  • Feng Luo
  • Yin Yang

We introduce a deep architecture named HoD-Net to enable high-order differentiability for deep learning. HoD-Net is based on and generalizes the complex-step finite difference (CSFD) method. While similar to classic finite difference, CSFD approaches the derivative of a function from a higher-dimension complex domain, leading to highly accurate and robust differentiation computation without numerical stability issues. This method can be coupled with backpropagation and adjoint perturbation methods for an efficient calculation of high-order derivatives. We show how this numerical scheme can be leveraged in challenging deep learning problems, such as high-order network training, deep learningbased physics simulation, and neural differential equations.

IJCAI Conference 2022 Conference Paper

Learning Implicit Body Representations from Double Diffusion Based Neural Radiance Fields

  • Guangming Yao
  • Hongzhi Wu
  • Yi Yuan
  • Lincheng Li
  • Kun Zhou
  • Xin Yu

In this paper, we present a novel double diffusion based neural radiance field, dubbed DD-NeRF, to reconstruct human body geometry and render the human body appearance in novel views from a sparse set of images. We first propose a double diffusion mechanism to achieve expressive representations of input images by fully exploiting human body priors and image appearance details at two levels. At the coarse level, we first model the coarse human body poses and shapes via an unclothed 3D deformable vertex model as guidance. At the fine level, we present a multi-view sampling network to capture subtle geometric deformations and image detailed appearances, such as clothing and hair, from multiple input views. Considering the sparsity of the two level features, we diffuse them into feature volumes in the canonical space to construct neural radiance fields. Then, we present a signed distance function (SDF) regression network to construct body surfaces from the diffused features. Thanks to our double diffused representations, our method can even synthesize novel views of unseen subjects. Experiments on various datasets demonstrate that our approach outperforms the state-of-the-art in both geometric reconstruction and novel view synthesis.

AAAI Conference 2022 Conference Paper

Pose Guided Image Generation from Misaligned Sources via Residual Flow Based Correction

  • Jiawei Lu
  • He Wang
  • Tianjia Shao
  • Yin Yang
  • Kun Zhou

Generating new images with desired properties (e. g. new view/poses) from source images has been enthusiastically pursued recently, due to its wide range of potential applications. One way to ensure high-quality generation is to use multiple sources with complementary information such as different views of the same object. However, as source images are often misaligned due to the large disparities among the camera settings, strong assumptions have been made in the past with respect to the camera(s) or/and the object in interest, limiting the application of such techniques. Therefore, we propose a new general approach which models multiple types of variations among sources, such as view angles, poses, facial expressions, in a unified framework, so that it can be employed on datasets of vastly different nature. We verify our approach on a variety of data including humans bodies, faces, city scenes and 3D objects. Both the qualitative and quantitative results demonstrate the better performance of our method than the state of the art.

NeurIPS Conference 2022 Conference Paper

Pre-Trained Model Reusability Evaluation for Small-Data Transfer Learning

  • Yao-Xiang Ding
  • Xi-Zhu Wu
  • Kun Zhou
  • Zhi-Hua Zhou

We study {\it model reusability evaluation} (MRE) for source pre-trained models: evaluating their transfer learning performance to new target tasks. In special, we focus on the setting under which the target training datasets are small, making it difficult to produce reliable MRE scores using them. Under this situation, we propose {\it synergistic learning} for building the task-model metric, which can be realized by collecting a set of pre-trained models and asking a group of data providers to participate. We provide theoretical guarantees to show that the learned task-model metric distances can serve as trustworthy MRE scores, and propose synergistic learning algorithms and models for general learning tasks. Experiments show that the MRE models learned by synergistic learning can generate significantly more reliable MRE scores than existing approaches for small-data transfer learning.

IJCAI Conference 2021 Conference Paper

EmbedMask: Embedding Coupling for Instance Segmentation

  • Hui Ying
  • Zhaojin Huang
  • Shu Liu
  • Tianjia Shao
  • Kun Zhou

Current instance segmentation methods can be categorized into segmentation-based methods and proposal-based methods. The former performs segmentation first and then does clustering, while the latter detects objects first and then predicts the mask for each object proposal. In this work, we propose a single-stage method, named EmbedMask, that unifies both methods by taking their advantages, so it can achieve good performance in instance segmentation and produce high-resolution masks in a high speed. EmbedMask introduces two newly defined embeddings for mask prediction, which are pixel embedding and proposal embedding. During training, we enforce the pixel embedding to be close to its coupled proposal embedding if they belong to the same instance. During inference, pixels are assigned to the mask of the proposal if their embeddings are similar. This mechanism brings several benefits. First, the pixel-level clustering enables EmbedMask to generate high-resolution masks and avoids the complicated two-stage mask prediction. Second, the existence of proposal embedding simplifies and strengthens the clustering procedure, so our method can achieve high speed and better performance than segmentation-based methods. Without any bell or whistle, EmbedMask outperforms the state-of-the-art instance segmentation method Mask R-CNN on the challenging COCO dataset, obtaining more detailed masks at a higher speed.

AAAI Conference 2021 Conference Paper

In-game Residential Home Planning via Visual Context-aware Global Relation Learning

  • Lijuan Liu
  • Yin Yang
  • Yi Yuan
  • Tianjia Shao
  • He Wang
  • Kun Zhou

In this paper, we propose an effective global relation learning algorithm to recommend an appropriate location of a building unit for in-game customization of residential home complex. Given a construction layout, we propose a visual contextaware graph generation network that learns the implicit global relations among the scene components and infers the location of a new building unit. The proposed network takes as input the scene graph and the corresponding top-view depth image. It provides the location recommendations for a newlyadded building units by learning an auto-regressive edge distribution conditioned on existing scenes. We also introduce a global graph-image matching loss to enhance the awareness of essential geometry semantics of the site. Qualitative and quantitative experiments demonstrate that the recommended location well reflects the implicit spatial rules of components in the residential estates, and it is instructive and practical to locate the building units in the 3D scene of the complex construction.

AAAI Conference 2021 Conference Paper

Neural Sentence Ordering Based on Constraint Graphs

  • Yutao Zhu
  • Kun Zhou
  • Jian-Yun Nie
  • Shengchao Liu
  • Zhicheng Dou

Sentence ordering aims at arranging a list of sentences in the correct order. Based on the observation that sentence order at different distances may rely on different types of information, we devise a new approach based on multi-granular orders between sentences. These orders form multiple constraint graphs, which are then encoded by Graph Isomorphism Networks and fused into sentence representations. Finally, sentence order is determined using the order-enhanced sentence representations. Our experiments on five benchmark datasets show that our method outperforms all existing baselines significantly, achieving a new state-of-the-art performance. The results demonstrate the advantage of considering multiple types of order information and using graph neural networks to integrate sentence content and order information for the task. Our code is available at https: //github. com/DaoD/ ConstraintGraph4NSO.

AAAI Conference 2021 Conference Paper

One-shot Face Reenactment Using Appearance Adaptive Normalization

  • Guangming Yao
  • Yi Yuan
  • Tianjia Shao
  • Shuang Li
  • Shanqi Liu
  • Yong Liu
  • Mengmeng Wang
  • Kun Zhou

The paper proposes a novel generative adversarial network for one-shot face reenactment, which can animate a single face image to a different pose-and-expression (provided by a driving image) while keeping its original appearance. The core of our network is a novel mechanism called appearance adaptive normalization, which can effectively integrate the appearance information from the input image into our face generator by modulating the feature maps of the generator using the learned adaptive parameters. Furthermore, we specially design a local net to reenact the local facial components (i. e. , eyes, nose and mouth) first, which is a much easier task for the network to learn and can in turn provide explicit anchors to guide our face generator to learn the global appearance and pose-and-expression. Extensive quantitative and qualitative experiments demonstrate the significant efficacy of our model compared with prior one-shot methods.

AAAI Conference 2021 Conference Paper

Structure-aware Person Image Generation with Pose Decomposition and Semantic Correlation

  • Jilin Tang
  • Yi Yuan
  • Tianjia Shao
  • Yong Liu
  • Mengmeng Wang
  • Kun Zhou

In this paper we tackle the problem of pose guided person image generation, which aims to transfer a person image from the source pose to a novel target pose while maintaining the source appearance. Given the inefficiency of standard CNNs in handling large spatial transformation, we propose a structure-aware flow based method for high-quality person image generation. Specifically, instead of learning the complex overall pose changes of human body, we decompose the human body into different semantic parts (e. g. , head, torso, and legs) and apply different networks to predict the flow fields for these parts separately. Moreover, we carefully design the network modules to effectively capture the local and global semantic correlations of features within and among the human parts respectively. Extensive experimental results show that our method can generate high-quality results under large pose discrepancy and outperforms state-of-the-art methods in both qualitative and quantitative comparisons.

NeurIPS Conference 2020 Conference Paper

LAPAR: Linearly-Assembled Pixel-Adaptive Regression Network for Single Image Super-resolution and Beyond

  • Wenbo Li
  • Kun Zhou
  • Lu Qi
  • Nianjuan Jiang
  • Jiangbo Lu
  • Jiaya Jia

Single image super-resolution (SISR) deals with a fundamental problem of upsampling a low-resolution (LR) image to its high-resolution (HR) version. Last few years have witnessed impressive progress propelled by deep learning methods. However, one critical challenge faced by existing methods is to strike a sweet spot of deep model complexity and resulting SISR quality. This paper addresses this pain point by proposing a linearly-assembled pixel-adaptive regression network (LAPAR), which casts the direct LR to HR mapping learning into a linear coefficient regression task over a dictionary of multiple predefined filter bases. Such a parametric representation renders our model highly lightweight and easy to optimize while achieving state-of-the-art results on SISR benchmarks. Moreover, based on the same idea, LAPAR is extended to tackle other restoration tasks, e. g. , image denoising and JPEG image deblocking, and again, yields strong performance.

IJCAI Conference 2007 Conference Paper

  • Deng Cai
  • Xiaofei He
  • Kun Zhou
  • Jiawei Han
  • Hujun Bao

Linear Discriminant Analysis (LDA) is a popular data-analytic tool for studying the class relationship between data points. A major disadvantage of LDA is that it fails to discover the local geometrical structure of the data manifold. In this paper, we introduce a novel linear algorithm for discriminant analysis, called {\bf Locality Sensitive Discriminant Analysis} (LSDA). When there is no sufficient training samples, local structure is generally more important than global structure for discriminant analysis. By discovering the local manifold structure, LSDA finds a projection which maximizes the margin between data points from different classes at each local area. Specifically, the data points are mapped into a subspace in which the nearby points with the same label are close to each other while the nearby points with different labels are far apart. Experiments carried out on several standard face databases show a clear improvement over the results of LDA-based recognition.