Arrow Research search

Author name cluster

Kun Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

35 papers
2 author rows

Possible papers

35

AAAI Conference 2026 Conference Paper

Can Molecular Evolution Mechanism Enhance Molecular Representation?

  • Kun Li
  • Longtao Hu
  • Jiameng Chen
  • Hongzhi Zhang
  • Yida Xiong
  • Xiantao Cai
  • Wenbin Hu
  • Jia Wu

Molecular evolution is the process of simulating the natural evolution of molecules in chemical space to explore potential molecular structures and properties. The relationships between similar molecules are often described through transformations such as adding, deleting, and modifying atoms and chemical bonds, reflecting specific evolutionary paths. Existing molecular representation methods mainly focus on mining data, such as atomic-level structures and chemical bonds directly from the molecules, often overlooking their evolutionary history. Consequently, we aim to explore the possibility of enhancing molecular representations by simulating the evolutionary process. We extract and analyze the changes in the evolutionary pathway and explore combining it with existing molecular representations. Therefore, this paper proposes the molecular evolutionary network (MEvoN) for molecular representations. First, we construct the MEvoN using molecules with a small number of atoms and generate evolutionary paths utilizing similarity calculations. Then, by modeling the atomic-level changes, MEvoN reveals their impact on molecular properties. Experimental results show that the MEvoN-based molecular property prediction method significantly improves the performance of traditional end-to-end algorithms by approximately 33% on both the QM7 and QM9 datasets.

AAAI Conference 2026 Conference Paper

CoMA-SLAM: Collaborative Multi-Agent Gaussian SLAM with Geometric Consistency

  • Lin Chen
  • Yongxin Su
  • Jvboxi Wang
  • Pengcheng Han
  • Zhenyu Xia
  • Shuhui Bu
  • Kun Li
  • Boni Hu

Although Gaussian scene representation has achieved remarkable success in tracking and mapping, most existing methods are confined to single-agent systems. Current multi-agent solutions typically rely on centralized architectures, which struggle to account for communication bandwidth constraints. Furthermore, the inherent depth ambiguity of 3D Gaussian splatting poses notable challenges in maintaining geometric consistency. To address these challenges, we introduce CoMA-SLAM, the first distributed multi-agent Gaussian SLAM framework. By leveraging 2D Gaussian surfels and robust initialization strategy, CoMA-SLAM enhances tracking accuracy and geometry consistency. It efficiently manages communication bandwidth while dynamically scaling with the number of agents. Through the integration of intra- and inter-loop closure, distributed keyframe optimization and submap centric update, our framework ensures global consistency and robustly alignment. Synthetic and real-world experiments demonstrate that CoMA-SLAM outperforms state-of-the-art methods in pose accuracy, rendering fidelity, and geometric consistency while maintaining competitive efficiency across distributed multi-agent systems. Notably, by avoiding data transmission to a centralized server, our method reduces communication bandwidth by 99.8% compared to centralized approaches.

AAAI Conference 2026 Conference Paper

DLVINet: Advancing Dual-Lens Video Inpainting Beyond Parallax Constraints

  • Zhiliang Wu
  • Kun Li
  • Yunqiu Xu
  • Hehe Fan
  • Yi Yang

Dual-lens video inpainting aims to simultaneously restore missing or corrupted contents in videos captured by each lens of binocular systems. Although preliminary explorations have been conducted, existing methods still face two key challenges: limited exploitation of long-range reference information and inadequate modeling of inter-lens consistency in non-standard binocular systems. In this paper, we propose a novel dual-lens video inpainting framework named DLVINet, which addresses these challenges with two core components. Firstly, we develop a sparse spatial-temporal transformer (SSTT) that effectively utilizes the information from distant frames to complete the video contents of each lens individually. By employing sparse spatial-temporal attention with a channel selection mechanism, SSTT not only restores missing regions, but also avoids introducing redundant or irrelevant information. Furthermore, SSTT introduces a multi-scale feed-forward network to enrich the multi-scale representation of completed features. Secondly, we design a cross-lens texture transformer (CLTT) to model inter-lens consistency. By interacting with corresponding features between lenses under the guidance of cross-attention, CLTT captures global inter-lens correspondences. Such a design enables effective cross-view information modeling without being constrained by horizontal parallax, which is particularly critical for non-standard binocular systems. Extensive experiments demonstrate the effectiveness of our DLVINet.

AAAI Conference 2026 Conference Paper

GLOBA: Rethinking Parameter Conflicts in Model Merging

  • Zehao Liu
  • Kun Li
  • Wei Zhou

Model merging serves as a training-free technique that combines multiple task-specific models into a unified multi-task model, but parameter conflicts often lead to performance drops. Previous methods flatten weight matrices into one-dimensional vectors, losing the inherent structural information of their row and column spaces. We mathematically prove and experimentally validate that parameter conflicts arise from non-orthogonal components of task vectors, while orthogonal components are conflict-free. Furthermore, we find that non-orthogonal components can contain both harmful conflicts and beneficial synergies. To precisely locate parameter conflicts and extract orthogonal components, we propose GLOBA (GLObal Basis Analysis Framework), which projects task vectors onto a global basis to align them within a unified coordinate system and construct a task interaction matrix. Following energy-based pruning, we divide parameters into five types based on the orthogonal relationships between the row spaces and column spaces of task vectors. Experiments on three fine-tuned models (mathematics, coding, and instruction-following) using LLaMA-2-7B and LLaMA-2-13B demonstrate significant performance gains through selective retention of beneficial parameters and removal of conflicting ones.

AAAI Conference 2026 Conference Paper

InterCoser: Interactive 3D Character Creation with Disentangled Fine-Grained Features

  • Yi Wang
  • Jian Ma
  • Zhuo Su
  • Guidong Wang
  • Jingyu Yang
  • Yu-Kun Lai
  • Kun Li

This paper aims to interactively generate and edit disentangled 3D characters based on precise user instructions. Existing methods generate and edit 3D characters via rough and simple editing guidance and entangled representations, making it difficult to achieve precise and comprehensive control over fine-grained local editing and free clothing transfer for characters. To enable accurate and intuitive control over the generation and editing of high-quality 3D characters with freely interchangeable clothing, we propose a novel user-interactive approach for disentangled 3D character creation. Specifically, to achieve precise control over 3D character generation and editing, we introduce two user-friendly interaction approaches: a sketch-based layered character generation/editing method, which supports clothing transfer; and a 3D-proxy-based part-level editing method, enabling fine-grained disentangled editing. To enhance 3D character quality, we propose a 3D Gaussian reconstruction strategy guided by geometric priors, ensuring that 3D characters exhibit detailed local geometry and smooth global surfaces. Extensive experiments on both public datasets and in-the-wild data demonstrate that our approach not only generates high-quality disentangled 3D characters but also supports precise and fine-grained editing through user interaction.

AAAI Conference 2026 Conference Paper

Sequence-Free for Compound Protein Interaction Prediction

  • Hongzhi Zhang
  • Jiameng Chen
  • Kun Li
  • Yida Xiong
  • Xiantao Cai
  • Wenbin Hu
  • Jia Wu

The prediction of compound–protein interactions (CPIs) is crucial for drug discovery. Most existing CPI prediction models rely on protein sequence information as input. However, in early-stage drug development, particularly in phenotype-driven studies or compound-response analyses, proteins are often annotated only with functional labels, and their sequences remain undetermined. Consequently, current methods are inapplicable in such scenarios. Furthermore, our experiments find that even when large-scale perturbations were applied to protein sequences, the predictive performance of the existing models did not show a significant decline. It indicates that the high investment in sequencing may not bring corresponding returns. To address the above issues, we propose an inexpensive, protein-sequencing-free framework BioText-CPI, based on the Biomedical Textual description of protein for CPI prediction. Firstly, during the pre-training stage of the model, we use contrastive learning to align protein texts and sequence modalities. Subsequently, we add biological text descriptions of proteins to the existing public CPI dataset to construct a new CPI dataset. Finally, in the CPI prediction stage, the sequence and biomedical text descriptions of proteins can be used as the input for CPI prediction either separately or simultaneously to meet the application requirements of different scenarios. The experiments demonstrate that BioText-CPI achieves comparable effects to the traditional methods when only the biomedical description of protein is input. Moreover, when the two modalities of protein information are input simultaneously, BioText-CPI achieves state-of-the-art performance across multiple scenarios.

AAAI Conference 2026 Conference Paper

VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models

  • Mingjie Xu
  • Jinpeng Chen
  • Yuzhi Zhao
  • Jason Chun Lok Li
  • Yue Qiu
  • Zekang Du
  • Mengyang Wu
  • Pingping Zhang

Multimodal Large Language Models (MLLM) have enabled a wide range of advanced vision-language applications, including fine-grained object recognition and contextual understanding. When querying specific regions or objects in an image, human users naturally use "Visual Prompts" (VP) like bounding boxes to provide reference. However, no existing benchmark systematically evaluates the ability of MLLMs to interpret such VPs. This gap raises uncertainty about whether current MLLMs can effectively recognize VPs, an intuitive prompting method for humans, and utilize them to solve problems. To address this limitation, we introduce VP-Bench, aiming to assess MLLMs’ capability in VP perception and utilization. VP-Bench employs a two-stage evaluation framework: Stage 1 examines models’ ability to perceive VPs in natural scenes, utilizing 100K visualized prompts spanning 8 shapes and 355 attribute combinations. Stage 2 investigates the impact of VPs on downstream tasks, measuring their effectiveness in real-world problem-solving scenarios. Using VP-Bench, we evaluate 21 MLLMs, including proprietary systems (e.g., GPT-4o) and open-source models (e.g., InternVL-2.5 and Qwen2.5-VL). In addition, we conduct a comprehensive analysis of the factors influencing VP understanding, such as attribute variations and model scale. VP-Bench establishes a new reference framework for studying MLLMs’ ability to comprehend and resolve grounded referring questions.

AAAI Conference 2025 Conference Paper

BotSim: LLM-Powered Malicious Social Botnet Simulation

  • Boyu Qiao
  • Kun Li
  • Wei Zhou
  • Shilong Li
  • Qianqian Lu
  • Songlin Hu

Social media platforms like X(Twitter) and Reddit are vital to global communication. However, advancements in Large Language Model (LLM) technology give rise to social media bots with unprecedented intelligence. These bots adeptly simulate human profiles, conversations, and interactions, disseminating large amounts of false information and posing significant challenges to platform regulation. To better understand and counter these threats, we innovatively design BotSim, a malicious social botnet simulation powered by LLM. BotSim mimics the information dissemination patterns of real-world social networks, creating a virtual environment composed of intelligent agent bots and real human users. In the temporal simulation constructed by BotSim, these advanced agent bots autonomously engage in social interactions such as posting and commenting, effectively modeling scenarios of information flow and user interaction. Building on the BotSim framework, we construct a highly human-like, LLM-driven bot dataset called BotSim-24 and benchmark multiple bot detection strategies against it. The experimental results indicate that detection methods effective on traditional bot datasets perform worse on BotSim-24, highlighting the urgent need for new detection strategies to address the cybersecurity threats posed by these advanced bots.

IROS Conference 2025 Conference Paper

CODE: COllaborative Visual-UWB SLAM for Online Large-Scale Metric DEnse Mapping

  • Lin Chen 0042
  • Xuan Jia
  • Shuhui Bu
  • Guangming Wang 0001
  • Kun Li
  • Zhenyu Xia
  • Xiaohan Li
  • Pengcheng Han

This paper presents a novel collaborative online dense mapping system for multiple Unmanned Aerial Vehicles (UAVs). The system confers two primary benefits: it facilitates simultaneous UAVs co-localization and real-time dense map reconstruction, and it recovers the metric scale even in GNSS-denied conditions. To achieve these advantages, Ultrawideband (UWB) measurements, monocular Visual Odometry (VO), and co-visibility observations are jointly employed to recover both relative positions and global UAV poses, thereby ensuring optimality at both local and global scales. In the proposed methodology, a two-stage optimization strategy is proposed to reduce optimization burden. Initially, relative Sim3 transformations among UAVs are swiftly estimated, with UWB measurements facilitating metric scale recovery in the absence of GNSS. Subsequently, a global pose optimization is performed to effectively mitigate cumulative drift. By integrating UWB, VO, and co-visibility data within this framework, both local geometric consistency and global pose accuracy are robustly maintained. Through comprehensive simulation and empirical real-world testing, we demonstrate that our system not only improves UAV positioning accuracy in challenging scenarios but also facilitates the high-quality, online integration of dense point clouds in large-scale areas. This research offers valuable contributions and practical techniques for precise, real-time map reconstruction using an autonomous UAV fleet, particularly in GNSS-denied environments.

IJCAI Conference 2025 Conference Paper

Drafting and Revision: Advancing High-Fidelity Video Inpainting

  • Zhiliang Wu
  • Kun Li
  • Hehe Fan
  • Yi Yang

Video inpainting aims to fill the missing regions in video with spatial-temporally coherent contents. Existing methods usually treat the missing contents as a whole and adopt a hybrid objective containing a reconstruction loss and an adversarial loss to train the model. However, these two kinds of loss focus on contents at different frequencies, simply combining them may cause inter-frequency conflicts, leading the trained model to generate compromised results. Inspired by the common corrupted painting restoration process of “drawing a draft first and then revising the details later”, this paper proposes a Drafting-and-Revision Completion Network (DRCN) for video inpainting. Specifically, we first design a Drafting Network that utilizes the temporal information to complete the low-frequency semantic structure at low resolution. Then, a Revision Network is developed to hallucinate high-frequency details at high resolution by using the output of Drafting Network. In this way, adversarial loss and reconstruction loss can be applied to high-frequency and low-frequency respectively, effectively mitigating inter-frequency conflicts. Furthermore, Revision Network can be stacked in a pyramid manner to generate higher resolution details, which provide a feasible solution for high-resolution video inpainting. Experiments show that DRCN achieves improvements of 7. 43% and 12. 64% in E_warp and LPIPS, and can handle higher resolution videos on limited GPU memory.

TIST Journal 2025 Journal Article

JASRNet: Learning Joint Adaptive Sampling and Reconstruction for Depth Sensing

  • Chunyang Bi
  • Mingnuo Teng
  • Tianhao Xie
  • Kun Li
  • Jingyu Yang

Recent attempts to exploit irregular sampling strategies for depth sensing have shown prominent merits over the uniform rectangular sampling in terms of depth reconstruction quality, particularly at low sampling rates. However, the separate treatment of depth sampling and reconstruction did not enjoy potential merits of joint optimization. In this article, we propose a joint adaptive depth sampling and reconstruction network, named JASRNet, for the RGB-D sensing configuration, to simultaneously optimize both the sampling and reconstruction of the depth information in an end-to-end manner. The sampling sub-network infers the locations to sample according to the significance distribution generated from the associated RGB image without any prior information of the underlying depth maps. The depth reconstruction sub-network learns and then fuses global and local depth features with attention guidance, which helps to obtain more accurate depth reconstruction results at boundaries. A hybrid loss function is further proposed to promote sharp discontinuities of the reconstructed depth maps. The qualitative and quantitative results show that our method achieves better depth sensing quality than several state-of-the-art methods for various indoor and outdoor scenes.

AAAI Conference 2025 Conference Paper

Patch-level Sounding Object Tracking for Audio-Visual Question Answering

  • Zhangbin Li
  • Jinxing Zhou
  • Jing Zhang
  • Shengeng Tang
  • Kun Li
  • Dan Guo

Answering questions related to audio-visual scenes, i.e., the AVQA task, is becoming increasingly popular. A critical challenge is accurately identifying and tracking sounding objects related to the question along the timeline. In this paper, we present a new Patch-level Sounding Object Tracking (PSOT) method. It begins with a Motion-driven Key Patch Tracking (M-KPT) module, which relies on visual motion information to identify salient visual patches with significant movements that are more likely to relate to sounding objects and questions. We measure the patch-wise motion intensity map between neighboring video frames and utilize it to construct and guide a motion-driven graph network. Meanwhile, we design a Sound-driven KPT (S-KPT) module to explicitly track sounding patches. This module also involves a graph network, with the adjacency matrix regularized by the audio-visual correspondence map. The M-KPT and S-KPT modules are performed in parallel for each temporal segment, allowing balanced tracking of salient and sounding objects. Based on the tracked patches, we further propose a Question-driven KPT (Q-KPT) module to retain patches highly relevant to the question, ensuring the model focuses on the most informative clues. The audio-visual-question features are updated during the processing of these modules, which are then aggregated for final answer prediction. Extensive experiments on standard datasets demonstrate the effectiveness of our method, achieving competitive performance even compared to recent large-scale pretraining-based approaches.

IJCAI Conference 2025 Conference Paper

Prompt-Aware Controllable Shadow Removal

  • Kerui Chen
  • Zhiliang Wu
  • Wenjin Hou
  • Kun Li
  • Hehe Fan
  • Yi Yang

Shadow removal aims to restore the image content in shadowed regions. While deep learning-based methods have shown promising results, they still face key challenges: 1) uncontrolled removal of all shadows, or 2) controllable removal but heavily relies on precise shadow region masks. To address these issues, we introduce a novel paradigm: prompt-aware controllable shadow removal. Unlike existing approaches, our paradigm allows for targeted shadow removal from specific subjects based on user prompts (e. g. , dots, lines, or subject masks). This approach eliminates the need for shadow annotations and offers flexible, user-controlled shadow removal. Specifically, we propose an end-to-end learnable model, the Prompt-Aware Controllable Shadow Removal Network (PACSRNet). PACSRNet consists of two key modules: a prompt-aware module that generates shadow masks for the specified subject based on the user prompt, and a shadow removal module that uses the shadow prior from the first module to restore the content in the shadowed areas. Additionally, we enhance the shadow removal module by incorporating feature information from the prompt-aware module through a linear operation, providing prompt-guided support for shadow removal. Recognizing that existing shadow removal datasets lack diverse user prompts, we contribute a new dataset specifically designed for prompt-based controllable shadow removal. Extensive experimental results demonstrate the effectiveness and superiority of PACSRNet.

AAAI Conference 2025 Conference Paper

Prototypical Calibrating Ambiguous Samples for Micro-Action Recognition

  • Kun Li
  • Dan Guo
  • Guoliang Chen
  • Chunxiao Fan
  • Jingyuan Xu
  • Zhiliang Wu
  • Hehe Fan
  • Meng Wang

Micro-Action Recognition (MAR) has gained increasing attention due to its crucial role as a form of non-verbal communication in social interactions, with promising potential for applications in human communication and emotion analysis. However, current approaches often overlook the inherent ambiguity in micro-actions, which arises from the wide category range and subtle visual differences between categories. This oversight hampers the accuracy of micro-action recognition. In this paper, we propose a novel Prototypical Calibrating Ambiguous Network (PCAN) to unleash and mitigate the ambiguity of MAR. Firstly, we employ a hierarchical action-tree to identify the ambiguous sample, categorizing them into distinct sets of ambiguous samples of false negatives and false positives, considering both body- and action-level categories. Secondly, we implement an ambiguous contrastive refinement module to calibrate these ambiguous samples by regulating the distance between ambiguous samples and their corresponding prototypes. This calibration process aims to pull false negative (FN) samples closer to their respective prototypes and push false positive (FP) samples apart from their affiliated prototypes. In addition, we propose a new prototypical diversity amplification loss to strengthen the model's capacity by amplifying the differences between different prototypes. Finally, we propose a prototype-guided rectification to rectify prediction by incorporating the representability of prototypes. Extensive experiments conducted on the benchmark dataset demonstrate the superior performance of our method compared to existing approaches.

AAAI Conference 2025 Conference Paper

ZeroMamba: Exploring Visual State Space Model for Zero-Shot Learning

  • Wenjin Hou
  • Dingjie Fu
  • Kun Li
  • Shiming Chen
  • Hehe Fan
  • Yi Yang

Zero-shot learning (ZSL) aims to recognize unseen classes by transferring semantic knowledge from seen classes to unseen ones, guided by semantic information. To this end, existing works have demonstrated remarkable performance by utilizing global visual features from Convolutional Neural Networks (CNNs) or Vision Transformers (ViTs) for visual-semantic interactions. Due to the limited receptive fields of CNNs and the quadratic complexity of ViTs, however, these visual backbones achieve suboptimal visual-semantic interactions. In this paper, motivated by the visual state space model (i.e., Vision Mamba), which is capable of capturing long-range dependencies and modeling complex visual dynamics, we propose a parameter-efficient ZSL framework called ZeroMamba to advance ZSL. Our ZeroMamba comprises three key components: Semantic-aware Local Projection (SLP), Global Representation Learning (GRL), and Semantic Fusion (SeF). Specifically, SLP integrates semantic embeddings to map visual features to local semantic-related representations, while GRL encourages the model to learn global semantic representations. SeF combines these two semantic representations to enhance the discriminability of semantic features. We incorporate these designs into Vision Mamba, forming an end-to-end ZSL framework. As a result, the learned semantic representations are better suited for classification. Through extensive experiments on four prominent ZSL benchmarks, ZeroMamba demonstrates superior performance, significantly outperforming the state-of-the-art (i.e., CNN-based and ViT-based) methods under both conventional ZSL (CZSL) and generalized ZSL (GZSL) settings.

ICRA Conference 2024 Conference Paper

AutoFusion: Autonomous Visual Geolocation and Online Dense Reconstruction for UAV Cluster

  • Yizhu Zhang
  • Shuhui Bu
  • Yifei Dong 0008
  • Yu Zhang 0197
  • Kun Li
  • Lin Chen 0042

Real-time dense reconstruction using Unmanned Aerial Vehicle (UAV) is becoming increasingly popular in large-scale rescue and environmental monitoring tasks. However, due to the energy constraints of a single UAV, the efficiency can be greatly improved through the collaboration of multi-UAVs. Nevertheless, when faced with unknown environments or the loss of Global Navigation Satellite System (GNSS) signal, most multi-UAV SLAM systems can’t work, making it hard to construct a global consistent map. In this paper, we propose a real-time dense reconstruction system called AutoFusion for multiple UAVs, which robustly supports scenarios with lost global positioning and weak co-visibility. A method for Visual Geolocation and Matching Network (VGMN) is suggested by constructing a graph convolutional neural network as a feature extractor. It can acquire geographical location information solely through images. We also present a real-time dense reconstruction framework for multi-UAV with autonomous visual geolocation. UAV agents send images and relative positions to the ground server, which processes the data using VGMN for multi-agent geolocation optimization, including initialization, pose graph optimization, and map fusion. Extensive experiments demonstrate that our system can efficiently and stably construct large-scale dense maps in real-time with high accuracy and robustness.

IJCAI Conference 2024 Conference Paper

Contrastive Learning Drug Response Models from Natural Language Supervision

  • Kun Li
  • Xiuwen Gong
  • Jia Wu
  • Wenbin Hu

Deep learning-based drug response prediction (DRP) methods can accelerate the drug discovery process and reduce research and development costs. Despite their high accuracy, generating regression-aware representations remains challenging for mainstream approaches. For instance, the representations are often disordered, aggregated, and overlapping, and they fail to characterize distinct samples effectively. This results in poor representation during the DRP task, diminishing generalizability and potentially leading to substantial costs during the drug discovery. In this paper, we propose CLDR, a contrastive learning framework with natural language supervision for the DRP. The CLDR converts regression labels into text, which is merged with the drug response caption as a second sample modality instead of the traditional modes, i. e. , graphs and sequences. Simultaneously, a common-sense numerical knowledge graph is introduced to improve the continuous text representation. Our framework is validated using the genomics of drug sensitivity in cancer dataset with average performance increases ranging from 7. 8% to 31. 4%. Furthermore, experiments demonstrate that the proposed CLDR effectively maps samples with distinct label values into a high-dimensional space. In this space, the sample representations are scattered, significantly alleviating feature overlap. The code is available at: https: //github. com/DrugD/CLDR.

AAAI Conference 2024 Conference Paper

EulerMormer: Robust Eulerian Motion Magnification via Dynamic Filtering within Transformer

  • Fei Wang
  • Dan Guo
  • Kun Li
  • Meng Wang

Video Motion Magnification (VMM) aims to break the resolution limit of human visual perception capability and reveal the imperceptible minor motion that contains valuable information in the macroscopic domain. However, challenges arise in this task due to photon noise inevitably introduced by photographic devices and spatial inconsistency in amplification, leading to flickering artifacts in static fields and motion blur and distortion in dynamic fields in the video. Existing methods focus on explicit motion modeling without emphasizing prioritized denoising during the motion magnification process. This paper proposes a novel dynamic filtering strategy to achieve static-dynamic field adaptive denoising. Specifically, based on Eulerian theory, we separate texture and shape to extract motion representation through inter-frame shape differences, expecting to leverage these subdivided features to solve this task finely. Then, we introduce a novel dynamic filter that eliminates noise cues and preserves critical features in the motion magnification and amplification generation phases. Overall, our unified framework, EulerMormer, is a pioneering effort to first equip with Transformer in learning-based VMM. The core of the dynamic filter lies in a global dynamic sparse cross-covariance attention mechanism that explicitly removes noise while preserving vital information, coupled with a multi-scale dual-path gating mechanism that selectively regulates the dependence on different frequency features to reduce spatial attenuation and complement motion boundaries. We demonstrate extensive experiments that EulerMormer achieves more robust video motion magnification from the Eulerian perspective, significantly outperforming state-of-the-art methods. The source code is available at https://github.com/VUT-HFUT/EulerMormer.

AAAI Conference 2024 Conference Paper

KeDuSR: Real-World Dual-Lens Super-Resolution via Kernel-Free Matching

  • Huanjing Yue
  • Zifan Cui
  • Kun Li
  • Jingyu Yang

Dual-lens super-resolution (SR) is a practical scenario for reference (Ref) based SR by utilizing the telephoto image (Ref) to assist the super-resolution of the low-resolution wide-angle image (LR input). Different from general RefSR, the Ref in dual-lens SR only covers the overlapped field of view (FoV) area. However, current dual-lens SR methods rarely utilize these specific characteristics and directly perform dense matching between the LR input and Ref. Due to the resolution gap between LR and Ref, the matching may miss the best-matched candidate and destroy the consistent structures in the overlapped FoV area. Different from them, we propose to first align the Ref with the center region (namely the overlapped FoV area) of the LR input by combining global warping and local warping to make the aligned Ref be sharp and consistent. Then, we formulate the aligned Ref and LR center as value-key pairs, and the corner region of the LR is formulated as queries. In this way, we propose a kernel-free matching strategy by matching between the LR-corner (query) and LR-center (key) regions, and the corresponding aligned Ref (value) can be warped to the corner region of the target. Our kernel-free matching strategy avoids the resolution gap between LR and Ref, which makes our network have better generalization ability. In addition, we construct a DuSR-Real dataset with (LR, Ref, HR) triples, where the LR and HR are well aligned. Experiments on three datasets demonstrate that our method outperforms the second-best method by a large margin. Our code and dataset are available at https://github.com/ZifanCui/KeDuSR.

IJCAI Conference 2024 Conference Paper

Zero-shot Learning for Preclinical Drug Screening

  • Kun Li
  • Weiwei Liu
  • Yong Luo
  • Xiantao Cai
  • Jia Wu
  • Wenbin Hu

Conventional deep learning methods typically employ supervised learning for drug response prediction (DRP). This entails dependence on labeled response data from drugs for model training. However, practical applications in the preclinical drug screening phase demand that DRP models predict responses for novel compounds, often with unknown drug responses. This presents a challenge, rendering supervised deep learning methods unsuitable for such scenarios. In this paper, we propose a zero-shot learning solution for the DRP task in preclinical drug screening. Specifically, we propose a Multi-branch Multi-Source Domain Adaptation Test Enhancement Plug-in, called MSDA. MSDA can be seamlessly integrated with conventional DRP methods, learning invariant features from the prior response data of similar drugs to enhance real-time predictions of unlabeled compounds. The results of experiments on two large drug response datasets showed that MSDA efficiently predicts drug responses for novel compounds, leading to a general performance improvement of 5-10% in the preclinical drug screening phase. The significance of this solution resides in its potential to accelerate the drug discovery process, improve drug candidate assessment, and facilitate the success of drug discovery. The code is available at https: //github. com/DrugD/MSDA.

AAAI Conference 2022 Conference Paper

AFDetV2: Rethinking the Necessity of the Second Stage for Object Detection from Point Clouds

  • Yihan Hu
  • Zhuangzhuang Ding
  • Runzhou Ge
  • Wenxin Shao
  • Li Huang
  • Kun Li
  • Qiang Liu

There have been two streams in the 3D detection from point clouds: single-stage methods and two-stage methods. While the former is more computationally efficient, the latter usually provides better detection accuracy. By carefully examining the two-stage approaches, we have found that if appropriately designed, the first stage can produce accurate box regression. In this scenario, the second stage mainly rescores the boxes such that the boxes with better localization get selected. From this observation, we have devised a single-stage anchor-free network that can fulfill these requirements. This network, named AFDetV2, extends the previous work by incorporating a self-calibrated convolution block in the backbone, a keypoint auxiliary supervision, and an IoU prediction branch in the multi-task head. We take a simple product of the predicted IoU score with the classification heatmap to form the final classification confidence. The enhanced backbone strengthens the box localization capability, and the rescoring approach effectively joins the object presence confidence and the box regression accuracy. As a result, the detection accuracy is drastically boosted in the single-stage. To evaluate our approach, we have conducted extensive experiments on the Waymo Open Dataset and the nuScenes Dataset. We have observed that our AFDetV2 achieves the state-of-theart results on these two datasets, superior to all the prior arts, including both the single-stage and the two-stage 3D detectors. AFDetV2 won the 1st place in the Real-Time 3D Detection of the Waymo Open Dataset Challenge 2021. In addition, a variant of our model AFDetV2-Base was entitled the “Most Efficient Model” by the Challenge Sponsor, showing a superior computational efficiency. To demonstrate the generality of this single-stage method, we have also applied it to the first stage of the two-stage networks. Without exception, the results show that with the strengthened backbone and the rescoring approach, the second stage refinement is no longer needed.

NeurIPS Conference 2022 Conference Paper

FOF: Learning Fourier Occupancy Field for Monocular Real-time Human Reconstruction

  • Qiao Feng
  • Yebin Liu
  • Yu-Kun Lai
  • Jingyu Yang
  • Kun Li

The advent of deep learning has led to significant progress in monocular human reconstruction. However, existing representations, such as parametric models, voxel grids, meshes and implicit neural representations, have difficulties achieving high-quality results and real-time speed at the same time. In this paper, we propose Fourier Occupancy Field (FOF), a novel, powerful, efficient and flexible 3D geometry representation, for monocular real-time and accurate human reconstruction. A FOF represents a 3D object with a 2D field orthogonal to the view direction where at each 2D position the occupancy field of the object along the view direction is compactly represented with the first few terms of Fourier series, which retains the topology and neighborhood relation in the 2D domain. A FOF can be stored as a multi-channel image, which is compatible with 2D convolutional neural networks and can bridge the gap between 3D geometries and 2D images. A FOF is very flexible and extensible, \eg, parametric models can be easily integrated into a FOF as a prior to generate more robust results. Meshes and our FOF can be easily inter-converted. Based on FOF, we design the first 30+FPS high-fidelity real-time monocular human reconstruction framework. We demonstrate the potential of FOF on both public datasets and real captured data. The code is available for research purposes at http: //cic. tju. edu. cn/faculty/likun/projects/FOF.

NeurIPS Conference 2021 Conference Paper

Implicit Transformer Network for Screen Content Image Continuous Super-Resolution

  • Jingyu Yang
  • Sheng Shen
  • Huanjing Yue
  • Kun Li

Nowadays, there is an explosive growth of screen contents due to the wide application of screen sharing, remote cooperation, and online education. To match the limited terminal bandwidth, high-resolution (HR) screen contents may be downsampled and compressed. At the receiver side, the super-resolution (SR)of low-resolution (LR) screen content images (SCIs) is highly demanded by the HR display or by the users to zoom in for detail observation. However, image SR methods mostly designed for natural images do not generalize well for SCIs due to the very different image characteristics as well as the requirement of SCI browsing at arbitrary scales. To this end, we propose a novel Implicit Transformer Super-Resolution Network (ITSRN) for SCISR. For high-quality continuous SR at arbitrary ratios, pixel values at query coordinates are inferred from image features at key coordinates by the proposed implicit transformer and an implicit position encoding scheme is proposed to aggregate similar neighboring pixel values to the query one. We construct benchmark SCI1K and SCI1K-compression datasets withLR and HR SCI pairs. Extensive experiments show that the proposed ITSRN significantly outperforms several competitive continuous and discrete SR methods for both compressed and uncompressed SCIs.

ICRA Conference 2021 Conference Paper

MDANet: Multi-Modal Deep Aggregation Network for Depth Completion

  • Yanjie Ke
  • Kun Li
  • Wei Yang 0011
  • Zhenbo Xu
  • Dayang Hao
  • Liusheng Huang
  • Gang Wang

Depth completion aims to recover the dense depth map from sparse depth data and RGB image respectively. However, due to the huge difference between the multi-modal signal input, vanilla convolutional neural network and simple fusion strategy cannot extract features from sparse data and aggregate multi-modal information effectively. To tackle this problem, we design a novel network architecture that takes full advantage of multi-modal features for depth completion. An effective Pre-completion algorithm is first put forward to increase the density of the input depth map and to provide distribution priors. Moreover, to effectively fuse the image features and the depth features, we propose a multi-modal deep aggregation block that consists of multiple connection and aggregation pathways for deeper fusion. Furthermore, based on the intuition that semantic image features are beneficial for accurate contour, we introduce the deformable guided fusion layer to guide the generation of the dense depth map. The resulting architecture, called MDANet, outperforms all the stateof-the-art methods on the popular KITTI Depth Completion Benchmark, meanwhile with fewer parameters than recent methods. The code of this work will be available at https://github.com/USTC-Keyanjie/MDANet_ICRA2021.

AAAI Conference 2021 Conference Paper

Proposal-Free Video Grounding with Contextual Pyramid Network

  • Kun Li
  • Dan Guo
  • Meng Wang

The challenge of video grounding - localizing activities in an untrimmed video via a natural language query - is to tackle the semantics of vision and language consistently along the temporal dimension. Most existing proposal-based methods are trapped by computational cost with extensive candidate proposals. In this paper, we propose a novel proposalfree framework named Contextual Pyramid Network (CP- Net) to investigate multi-scale temporal correlation in the video. Specifically, we propose a pyramid network to extract 2D contextual correlation maps at different temporal scales (T ∗T, T 2 ∗ T 2, T 4 ∗ T 4 ), where the 2D correlation map (past → current & current ← future) is designed to model all the relations of any two moments in the video. In other words, CPNet progressively replenishes the temporal contexts and refines the location of queried activity by enlarging the temporal receptive fields. Finally, we implement a temporal self-attentive regression (i. e. , proposal-free regression) to predict the activity boundary from the above hierarchical context-aware 2D correlation maps. Extensive experiments on ActivityNet Captions, Charades-STA, and TACoS datasets demonstrate that our approach outperforms state-of-the-art methods.

ICRA Conference 2021 Conference Paper

Robust SRIF-based LiDAR-IMU Localization for Autonomous Vehicles

  • Kun Li
  • Zhanpeng Ouyang
  • Lan Hu
  • Dayang Hao
  • Laurent Kneip

We present a tightly-coupled multi-sensor fusion architecture for autonomous vehicle applications, which achieves centimetre-level accuracy and high robustness in various scenarios. In order to realize robust and accurate point-cloud feature matching we propose a novel method for extracting structural, highly discriminative features from LiDAR point clouds. For high frequency motion prediction and noise propagation, we use incremental on-manifold IMU pre-integration. We also adopt a multi-frame sliding window square root inverse filter, so that the system maintains numerically stable results under the premise of limited power consumption. To verify our methodology, we test the fusion algorithm in multiple applications and platforms equipped with a LiDAR-IMU system. Our results demonstrate that our fusion framework attains state-of-the-art localization accuracy, high robustness and a good generalization ability.

NeurIPS Conference 2020 Conference Paper

GPS-Net: Graph-based Photometric Stereo Network

  • Zhuokun Yao
  • Kun Li
  • Ying Fu
  • Haofeng Hu
  • Boxin Shi

Learning-based photometric stereo methods predict the surface normal either in a per-pixel or an all-pixel manner. Per-pixel methods explore the inter-image intensity variation of each pixel but ignore features from the intra-image spatial domain. All-pixel methods explore the intra-image intensity variation of each input image but pay less attention to the inter-image lighting variation. In this paper, we present a Graph-based Photometric Stereo Network, which unifies per-pixel and all-pixel processings to explore both inter-image and intra-image information. For per-pixel operation, we propose the Unstructured Feature Extraction Layer to connect an arbitrary number of input image-light pairs into graph structures, and introduce Structure-aware Graph Convolution filters to balance the input data by appropriately weighting shadows and specular highlights. For all-pixel operation, we propose the Normal Regression Network to make efficient use of the intra-image spatial information for predicting a surface normal map with rich details. Experimental results on the real-world benchmark show that our method achieves excellent performance under both sparse and dense lighting distributions.

IJCAI Conference 2020 Conference Paper

Rebalancing Expanding EV Sharing Systems with Deep Reinforcement Learning

  • Man Luo
  • Wenzhe Zhang
  • Tianyou Song
  • Kun Li
  • Hongming Zhu
  • Bowen Du
  • Hongkai Wen

Electric Vehicle (EV) sharing systems have recently experienced unprecedented growth across the world. One of the key challenges in their operation is vehicle rebalancing, i. e. , repositioning the EVs across stations to better satisfy future user demand. This is particularly challenging in the shared EV context, because i) the range of EVs is limited while charging time is substantial, which constrains the rebalancing options; and ii) as a new mobility trend, most of the current EV sharing systems are still continuously expanding their station networks, i. e. , the targets for rebalancing can change over time. To tackle these challenges, in this paper we model the rebalancing task as a Multi-Agent Reinforcement Learning (MARL) problem, which directly takes the range and charging properties of the EVs into account. We propose a novel approach of policy optimization with action cascading, which isolates the non-stationarity locally, and use two connected networks to solve the formulated MARL. We evaluate the proposed approach using a simulator calibrated with 1-year operation data from a real EV sharing system. Results show that our approach significantly outperforms the state-of-the-art, offering up to 14% gain in order satisfied rate and 12% increase in net revenue.

IROS Conference 2020 Conference Paper

Vision Global Localization with Semantic Segmentation and Interest Feature Points

  • Kai Li
  • Xudong Zhang
  • Kun Li
  • Shuo Zhang

In this work, we present a vision-only global localization architecture for autonomous vehicle applications, and achieves centimeter-level accuracy and high robustness in various scenarios. We first apply pixel-wise segmentation to the front-view mono camera and extract the semantic features, e. g. pole-like objects, lane markings, and curbs, which are robust to illumination, viewing angles and seasonal changes. For the scenes without enough semantic information, we extract interest feature points on static backgrounds, such as ground surface and buildings, assisted by our semantic segmentation. We create the visual global map with semantic feature map layers extracted from LiDAR point-cloud semantic map and the point feature map layer built with a fixed-pose SFM. A lumped Levenberg-Marquardt optimization solver is then applied to minimize the cost from two types of observations. We further evaluate the accuracy and robustness of our method with road tests on Alibaba's autonomous delivery vehicles in multiple scenarios as well as a KAIST urban dataset.

ICRA Conference 2018 Conference Paper

Inverse Reinforcement Learning via Function Approximation for Clinical Motion Analysis

  • Kun Li
  • Mrinal Rath
  • Joel W. Burdick

This paper introduces a new method for inverse reinforcement learning in large state spaces, where the learned reward function can be used to control high-dimensional robot systems and analyze complex human movement. To avoid solving the computationally expensive reinforcement learning problems in reward learning, we propose a function approximation method to ensure that the Bellman Optimality Equation always holds, and then estimate a function to maximize the likelihood of the observed motion. The time complexity of the proposed method is linearly proportional to the cardinality of the action set, thus it can handle large state spaces efficiently. We test the proposed method in a simulated environment on reward learning, and show that it is more accurate than existing methods and significantly better in scalability. We also show that the proposed method can extend many existing methods to large state spaces. We then apply the method to evaluating the effect of rehabilitative stimulations on patients with spinal cord injuries based on the observed patient motions.

ICRA Conference 2017 Conference Paper

Clinical patient tracking in the presence of transient and permanent occlusions via geodesic feature

  • Kun Li
  • Joel W. Burdick

This paper develops a method to use RGB-D cameras to track the motions of a human spinal cord injury patient undergoing spinal stimulation and physical rehabilitation. Because clinicians must remain close to the patient during training sessions, the patient is usually under permanent and transient occlusions due to the training equipment and the movements of the attending clinicians. These occlusions can significantly degrade the accuracy of existing human tracking methods. To improve the data association problem in these circumstances, we present a new global feature based on the geodesic distances of surface mesh points to a set of anchor points. Transient occlusions are handled via a multi-hypothesis tracking framework. To evaluate the method, we simulated different occlusion sizes on a data set captured from a human in varying movement patterns, and compared the proposed feature with other tracking methods. The results show that the proposed method achieves robustness to both surface deformations and transient occlusions.

ICRA Conference 2014 Conference Paper

Robotic object manipulation with multilevel part-based model in RGB-D data

  • Kun Li
  • Max Q. -H. Meng

The performance of robotic object manipulation relies heavily on the selection of object model. In this article, we develop a multilevel part-based object model by applying latent support vector machine to training a hierarchical object structure. We implement our method with a robot arm and a depth sensor in Robot Operating System, and then we compare the recognition performance of this model with established methods on a point cloud data set and show the manipulation performance of our model on three practical tasks. The result demonstrates that our robot recognizes and manipulates objects more accurately with this multilevel part-based object model.