Arrow Research search

Author name cluster

Zhen Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

59 papers
2 author rows

Possible papers

59

AAAI Conference 2026 Conference Paper

Cancer Survival Prediction by Cyclic Generation and Multi-grained Alignment

  • Yongqi Bu
  • Qinggang Niu
  • Zhen Li
  • Yanyu Xu
  • Jun Wang
  • Guoxian Yu

Cancer survival analysis with multimodal data is crucial for precise treatments and patient benefits. However, the following challenges prohibit integrating histopathology and genomics: (i) multimodal data is not always complete, especially for the more costly genomics data; (ii) intricate interactions between different modalities are difficult to capture and understand. To response, we propose an end-to-end framework (CIMA) that coordinates Cyclic modality generation and Multi-grained multimodal Alignment. Specifically, CIMA designs a cyclic modality reconstruction module to reciprocally impute missing modalities and infer the interactions between them. Next, it introduces the multi-grained alignment module over the imputed data and interactions to mine fine-grained alignments between histopathology (slide patches) and genomics (biological pathways). CIMA then constructs the adaptive fusion module to leverage multimodal data and alignments for survival prediction. Extensive experiments on cancer benchmark datasets demonstrate that CIMA outperforms existing methods and exhibits good interpretability, providing valuable insights into intricate relationships between pathological phenotypes and biological pathways.Our code is released in the supplementary materials.

AAAI Conference 2026 Conference Paper

Composition-Incremental Learning for Compositional Generalization

  • Zhen Li
  • Yuwei Wu
  • Chenchen Jing
  • Che Sun
  • Chuanhao Li
  • Yunde Jia

Compositional generalization has achieved substantial progress in computer vision on pre-collected training data. Nonetheless, real-world data continually emerges, with possible compositions being nearly infinite, long-tailed, and not entirely visible. Thus, an ideal model is supposed to gradually improve the capability of compositional generalization in an incremental manner. In this paper, we explore Composition-Incremental Learning for Compositional Generalization (CompIL) in the context of the compositional zero-shot learning (CZSL) task, where models need to continually learn new compositions, intending to improve their compositional generalization capability progressively. To quantitatively evaluate CompIL, we develop a benchmark construction pipeline leveraging existing datasets, yielding MIT-States-CompIL and C-GQA-CompIL. Furthermore, we propose a pseudo-replay framework utilizing a visual synthesizer to synthesize visual representations of learned compositions and a linguistic primitive distillation mechanism to maintain aligned primitive representations across the learning process. Extensive experiments demonstrate the effectiveness of the proposed framework.

AAAI Conference 2026 Conference Paper

DriveFlow: Rectified Flow Adaptation for Robust 3D Object Detection in Autonomous Driving

  • Hongbin Lin
  • Yiming Yang
  • Chaoda Zheng
  • Yifan Zhang
  • Shuaicheng Niu
  • Zilu Guo
  • Yafeng Li
  • Gui Gui

In autonomous driving, vision-centric 3D object detection recognizes and localizes 3D objects from RGB images. However, due to high annotation costs and diverse outdoor scenes, training data often fails to cover all possible test scenarios, known as the out-of-distribution (OOD) issue. Training-free image editing offers a promising solution for improving model robustness by training data enhancement without any modifications to pre-trained diffusion models. Nevertheless, inversion-based methods often suffer from limited effectiveness and inherent inaccuracies, while recent rectified-flow-based approaches struggle to preserve objects with accurate 3D geometry. In this paper, we propose DriveFlow, a Rectified Flow Adaptation method for training data enhancement in autonomous driving based on pre-trained Text-to-Image flow models. Based on frequency decomposition, DriveFlow introduces two strategies to adapt noise-free editing paths derived from text-conditioned velocities. 1) High-Frequency Foreground Preservation: DriveFlow incorporates a high-frequency alignment loss for foreground to maintain precise 3D object geometry. 2) Dual-Frequency Background Optimization: DriveFlow also conducts dual-frequency optimization for background, balancing editing flexibility and semantic consistency. Comprehensive experiments validate the effectiveness and efficiency of DriveFlow, demonstrating comprehensive performance improvements on all categories across OOD scenarios.

AAAI Conference 2026 Conference Paper

MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models

  • Pengfei Zhou
  • Xiaopeng Peng
  • Fanrui Zhang
  • Zhaopan Xu
  • Jiaxin Ai
  • Yansheng Qiu
  • Wangbo Zhao
  • Jiajun Song

Multimodal large language models (MLLMs), which integrate language and visual cues for problem-solving, are crucial for advancing artificial general intelligence (AGI). However, current benchmarks for measuring the intelligence of MLLMs suffer from limited scale, narrow coverage, and unstructured knowledge, offering only static and undifferentiated evaluations. To bridge this gap, we introduce MDK12-Bench, a large-scale multidisciplinary benchmark built from real-world K–12 exams spanning six disciplines with 141K instances and 6,225 knowledge points organized in a six-layer taxonomy. Covering five question formats with difficulty and year annotations, it enables comprehensive evaluation to capture the extent to which MLLMs perform over four dimensions: 1) difficulty levels, 2) temporal (cross-year) shifts, 3) contextual shifts, and 4) knowledge-driven reasoning. We propose a novel dynamic evaluation framework that introduces unfamiliar visual, textual, and question form shifts to challenge model generalization while improving benchmark objectivity and longevity by mitigating data contamination. We further evaluate knowledge-point reference-augmented generation (KP-RAG) to examine the role of knowledge in reasoning. Key findings reveal limitations in current MLLMs in multiple aspects and provide guidance for enhancing model reasoning, robustness, and AI-assisted education.

NeurIPS Conference 2025 Conference Paper

3D Gaussian Flats: Hybrid 2D/3D Photometric Scene Reconstruction

  • Maria Taktasheva
  • Lily Goli
  • Alessandro Fiorini
  • Zhen Li
  • Daniel Rebain
  • Andrea Tagliasacchi

Recent advances in radiance fields and novel view synthesis enable creation of realistic digital twins from photographs. However, current methods struggle with flat, texture-less surfaces, creating uneven and semi-transparent reconstructions, due to an ill-conditioned photometric reconstruction objective. Surface reconstruction methods solve this issue but sacrifice visual quality. We propose a novel hybrid 2D/3D representation that jointly optimizes constrained planar (2D) Gaussians for modeling flat surfaces and freeform (3D) Gaussians for the rest of the scene. Our end-to-end approach dynamically detects and refines planar regions, improving both visual fidelity and geometric accuracy. It achieves state-of-the-art depth estimation on ScanNet++ and ScanNetv2, and excels at mesh extraction without overfitting to a specific camera model, showing its effectiveness in producing high-quality reconstruction of indoor scenes.

NeurIPS Conference 2025 Conference Paper

AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models

  • Xinyi Wang
  • Xun Yang
  • Yanlong Xu
  • Yuchen Wu
  • Zhen Li
  • Na Zhao

Effective human-agent collaboration in physical environments requires understanding not only what to act upon, but also where the actionable elements are and how to interact with them. Existing approaches often operate at the object level or disjointedly handle fine-grained affordance reasoning, lacking coherent, instruction-driven grounding and reasoning. In this work, we introduce a new task: Fine-grained 3D Embodied Reasoning, which requires an agent to predict, for each referenced affordance element in a 3D scene, a structured triplet comprising its spatial location, motion type, and motion axis, based on a task instruction. To solve this task, we propose AffordBot, a novel framework that integrates Multimodal Large Language Models (MLLMs) with a tailored chain-of-thought (CoT) reasoning paradigm. To bridge the gap between 3D input and 2D-compatible MLLMs, we render surround-view images of the scene and project 3D element candidates into these views, forming a rich visual representation aligned with the scene geometry. Our CoT pipeline begins with an active perception stage, prompting the MLLM to select the most informative viewpoint based on the instruction, before proceeding with step-by-step reasoning to localize affordance elements and infer plausible interaction motions. Evaluated on the SceneFun3D dataset, AffordBot achieves state-of-the-art performance, demonstrating strong generalization and physically grounded reasoning with only 3D point cloud input and MLLMs. Our code is available at https: //github. com/hannahwxy/AffordBot.

NeurIPS Conference 2025 Conference Paper

AgentTTS: Large Language Model Agent for Test-time Compute-optimal Scaling Strategy in Complex Tasks

  • Fali Wang
  • Hui Liu
  • Zhenwei Dai
  • Jingying Zeng
  • Zhiwei Zhang
  • Zongyu Wu
  • Chen Luo
  • Zhen Li

Test-time scaling (TTS) enhances the performance of large language models (LLMs) by allocating additional compute resources during inference. However, existing research primarily investigates TTS in single-stage tasks; while many real-world problems are multi-stage complex tasks, composed of a sequence of heterogeneous subtasks with each subtask requires LLM of specific capability. Therefore, we study a novel problem: the test-time compute-optimal scaling in multi-stage complex tasks, aiming to select suitable models and allocate budgets per subtask to maximize overall performance. TTS in multi-stage tasks introduces two fundamental challenges: (i) The combinatorial search space of model and budget allocations, combined with the high cost of inference, makes brute-force search impractical. (ii) The optimal model and budget allocations across subtasks are interdependent, increasing the complexity of the compute-optimal search. To address this gap, we conduct extensive pilot experiments on four tasks across six datasets, deriving three empirical insights characterizing the behavior of LLMs in multi-stage complex tasks. Informed by these insights, we propose AgentTTS, an LLM-agent-based framework that autonomously searches for compute-optimal allocations through iterative feedback-driven interactions with the execution environment. Experimental results demonstrate that AgentTTS significantly outperforms traditional and other LLM-based baselines in search efficiency, and shows improved robustness to varying training set sizes and enhanced interpretability.

AAAI Conference 2025 Conference Paper

Consistency of Compositional Generalization Across Multiple Levels

  • Chuanhao Li
  • Zhen Li
  • Chenchen Jing
  • Xiaomeng Fan
  • Wenbo Ye
  • Yuwei Wu
  • Yunde Jia

Compositional generalization is the capability of a model to understand novel compositions composed of seen concepts. There are multiple levels of novel compositions including phrase-phrase level, phrase-word level, and word-word level. Existing methods achieve promising compositional generalization, but the consistency of compositional generalization across multiple levels of novel compositions remains unexplored. The consistency refers to that a model should generalize to a phrase-phrase level novel composition, and phrase-word/word-word level novel compositions that can be derived from it simultaneously. In this paper, we propose a meta-learning based framework, for achieving consistent compositional generalization across multiple levels. The basic idea is to progressively learn compositions from simple to complex for consistency. Specifically, we divide the original training set into multiple validation sets based on compositional complexity, and introduce multiple meta-weight-nets to generate sample weights for samples in different validation sets. To fit the validation sets in order of increasing compositional complexity, we optimize the parameters of each meta-weight-net independently and sequentially in a multilevel optimization manner. We build a GQA-CCG dataset to quantitatively evaluate the consistency. Experimental results on visual question answering and temporal video grounding, demonstrate the effectiveness of the proposed framework.

JBHI Journal 2025 Journal Article

Highlighted Diffusion Model as Plug-In Priors for Polyp Segmentation

  • Yuhao Du
  • Yuncheng Jiang
  • Shuangyi Tan
  • Si-Qi Liu
  • Zhen Li
  • Guanbin Li
  • Xiang Wan

Automated polyp segmentation from colonoscopy images is crucial for colorectal cancer diagnosis. The accuracy of such segmentation, however, is challenged by two main factors. First, the variability in polyps' size, shape, and color, coupled with the scarcity of well-annotated data due to the need for specialized manual annotation, hampers the efficacy of existing deep learning methods. Second, concealed polyps often blend with adjacent intestinal tissues, leading to poor contrast that challenges segmentation models. Recently, diffusion models have been explored and adapted for polyp segmentation tasks. However, the significant domain gap between RGB-colonoscopy images and grayscale segmentation masks, along with the low efficiency of the diffusion generation process, hinders the practical implementation of these models. To mitigate these challenges, we introduce the Highlighted Diffusion Model Plus (HDM+), a two-stage polyp segmentation framework. This framework incorporates the Highlighted Diffusion Model (HDM) to provide explicit semantic guidance, thereby enhancing segmentation accuracy. In the initial stage, the HDM is trained using highlighted ground-truth data, which emphasizes polyp regions while suppressing the background in the images. This approach reduces the domain gap by focusing on the image itself rather than on the segmentation mask. In the subsequent second stage, we employ the highlighted features from the trained HDM's U-Net model as plug-in priors for polyp segmentation, rather than generating highlighted images, thereby increasing efficiency. Extensive experiments conducted on six polyp segmentation benchmarks demonstrate the effectiveness of our approach.

IROS Conference 2025 Conference Paper

Implicit Disparity-Blur Alignment for Fast and Precise Autofocus in Robotic Microsurgical Imaging

  • Pan Fu
  • Zhen Li
  • Ming-Yang Zhang
  • Yu-Peng Zhai
  • Junzheng Wang
  • Wen-Hao He
  • Gui-Bin Bian

Creating an intelligent surgical environment requires not only advanced robotic systems but also optimized microscopic imaging. However, autofocus remains a fundamental challenge, with current methods suffering from slow iterative processes or directional ambiguity, which compromises real-time performance. This paper presents an implicit disparity-blur alignment approach for robotic microsurgical autofocus, integrating stereo geometry’s monotonic depth cues with de-focus characteristics for rapid convergence. A novel physics-guided dual-stream network is developed to encode implicit depth representations through hierarchical cross-pathway feature fusion, enabling reliable focus prediction without explicit stereo matching in blur-degraded regions. An ROI-aware attention module is proposed to dynamically optimize focus-critical regions, coupled with learnable physics-guided kernel learning for precise Z-offset estimation. The approach achieves a top directional accuracy of 94. 85% and a single-pass focus error of 0. 20 mm with an inference time of 53 ms on a surgical dataset, which outperforms state-of-the-art methods in reducing iteration count by 22. 8% and inference time by 51. 8%. An intelligent robotic microscope prototype is developed, with validation through ex vivo tests demonstrating its ability to enable fast and precise multi-region focusing for microsurgeries.

IJCAI Conference 2025 Conference Paper

Multi-Sourced Compositional Generalization in Visual Question Answering

  • Chuanhao Li
  • Wenbo Ye
  • Zhen Li
  • Yuwei Wu
  • Yunde Jia

Compositional generalization is the ability of generalizing novel compositions from seen primitives, and has received much attention in vision-and-language (V&L) recently. Due to the multi-modal nature of V&L tasks, the primitives composing compositions source from different modalities, resulting in multi-sourced novel compositions. However, the generalization ability over multi-sourced novel compositions, i. e. , multi-sourced compositional generalization (MSCG) remains unexplored. In this paper, we explore MSCG in the context of visual question answering (VQA), and propose a retrieval-augmented training framework to enhance the MSCG ability of VQA models by learning unified representations for primitives from different modalities. Specifically, semantically equivalent primitives are retrieved for each primitive in the training samples, and the retrieved features are aggregated with the original primitive to refine the model. This process helps the model learn consistent representations for the same semantic primitives across different modalities. To evaluate the MSCG ability of VQA models, we construct a new GQA-MSCG dataset based on the GQA dataset, in which samples include three types of novel compositions composed of primitives from different modalities. The GQA-MSCG dataset is available at https: //github. com/NeverMoreLCH/MSCG.

JBHI Journal 2025 Journal Article

Property-Guided Few-Shot Learning for Molecular Property Prediction With Dual-View Encoder and Relation Graph Learning Network

  • Lianwei Zhang
  • Dongjiang Niu
  • Beiyi Zhang
  • Qiang Zhang
  • Zhen Li

Molecular property prediction is an important task in drug discovery. However, experimental data for many drug molecules are limited, especially for novel molecular structures or rare diseases which affect the accuracy of many deep learning methods that rely on large training datasets. To this end, we propose PG-DERN, a novel few-shot learning model for molecular property prediction. A dual-view encoder is introduced to learn a meaningful molecular representation by integrating information from node and subgraph. Next, a relation graph learning module is proposed to construct a relation graph based on the similarity between molecules, which improves the efficiency of information propagation and the accuracy of property prediction. In addition, we use a MAML-based meta-learning strategy to learn well-initialized meta-parameters. In order to guide the tuning of meta-parameters, a property-guided feature augmentation module is designed to transfer information from similar properties to the novel property to improve the comprehensiveness of the feature representation of molecules with novel property. A series of comparative experiments on four benchmark datasets demonstrate that the proposed PG-DERN outperforms state-of-the-art methods.

NeurIPS Conference 2025 Conference Paper

Sekai: A Video Dataset towards World Exploration

  • Zhen Li
  • Chuanhao Li
  • Xiaofeng Mao
  • Shaoheng Lin
  • Ming Li
  • Shitian Zhao
  • Zhaopan Xu
  • Xinyue Li

Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai (meaning "world" in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5, 000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories. Comprehensive analyses and experiments demonstrate the dataset’s scale, diversity, annotation quality, and effectiveness for training video generation models. We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications.

IROS Conference 2025 Conference Paper

SkyVLN: Vision-and-Language Navigation and NMPC Control for UAVs in Urban Environments

  • Tianshun Li
  • Tianyi Huai
  • Zhen Li
  • Yichun Gao
  • Haoang Li
  • Xinhu Zheng

Unmanned Aerial Vehicles (UAVs) have emerged as versatile tools across various sectors, driven by their mobility and adaptability. This paper introduces SkyVLN, a novel framework integrating vision-and-language navigation (VLN) with Nonlinear Model Predictive Control (NMPC) to enhance UAV autonomy in complex urban environments. Unlike traditional navigation methods, SkyVLN leverages Large Language Models (LLMs) to interpret natural language instructions and visual observations, enabling UAVs to navigate through dynamic 3D spaces with improved accuracy and robustness. We present a multimodal navigation agent equipped with a fine-grained spatial verbalizer and a history path memory mechanism. These components allow the UAV to disambiguate spatial contexts, handle ambiguous instructions, and backtrack when necessary. The framework also incorporates an NMPC module for dynamic obstacle avoidance, ensuring precise trajectory tracking and collision prevention. To validate our approach, we developed a high-fidelity 3D urban simulation environment using AirSim, featuring realistic imagery and dynamic urban elements. Extensive experiments demonstrate that SkyVLN significantly improves navigation success rates and efficiency, particularly in new and unseen environments.

NeurIPS Conference 2025 Conference Paper

SQS: Enhancing Sparse Perception Models via Query-based Splatting in Autonomous Driving

  • Haiming Zhang
  • Yiyao Zhu
  • Wending Zhou
  • Xu Yan
  • Yingjie Cai
  • Bingbing Liu
  • Shuguang Cui
  • Zhen Li

Sparse Perception Models (SPMs) adopt a query-driven paradigm that forgoes explicit dense BEV or volumetric construction, enabling highly efficient computation and accelerated inference. In this paper, we introduce SQS, a novel query-based splatting pre-training specifically designed to advance SPMs in autonomous driving. SQS introduces a plug-in module that predicts 3D Gaussian representations from sparse queries during pre-training, leveraging self-supervised splatting to learn fine-grained contextual features through the reconstruction of multi-view images and depth maps. During fine-tuning, the pre-trained Gaussian queries are seamlessly integrated into downstream networks via query interaction mechanisms that explicitly connect pre-trained queries with task-specific queries, effectively accommodating the diverse requirements of occupancy prediction and 3D object detection. Extensive experiments on autonomous driving benchmarks demonstrate that SQS delivers considerable performance gains across multiple query-based 3D perception tasks, notably in occupancy prediction and 3D object detection, outperforming prior state-of-the-art pre-training approaches by a significant margin (i. e. , +1. 3 mIoU on occupancy prediction and +1. 0 NDS on 3D detection).

AAAI Conference 2025 Conference Paper

Topo2Seq: Enhanced Topology Reasoning via Topology Sequence Learning

  • Yiming Yang
  • Yueru Luo
  • Bingkun He
  • Erlong Li
  • Zhipeng Cao
  • Chao Zheng
  • Shuqi Mei
  • Zhen Li

Extracting lane topology from perspective views (PV) is crucial for planning and control in autonomous driving. This approach extracts potential drivable trajectories for self-driving vehicles without relying on high-definition (HD) maps. However, the unordered nature and weak long-range perception of the DETR-like framework can result in misaligned segment endpoints and limited topological prediction capabilities. Inspired by the learning of contextual relationships in language models, the connectivity relations in roads can be characterized as explicit topology sequences. In this paper, we introduce Topo2Seq, a novel approach for enhancing topology reasoning via topology sequences learning. The core concept of Topo2Seq is a randomized order prompt-to-sequence learning between lane segment decoder and topology sequence decoder. The dual-decoder branches simultaneously learn the lane topology sequences extracted from the Directed Acyclic Graph (DAG) and the lane graph containing geometric information. Randomized order prompt-to-sequence learning extracts unordered key points from the lane graph predicted by the lane segment decoder, which are then fed into the prompt design of the topology sequence decoder to reconstruct an ordered and complete lane graph. In this way, the lane segment decoder learns powerful long-range perception and accurate topological reasoning from the topology sequence decoder. Notably, topology sequence decoder is only introduced during training and does not affect the inference efficiency. Experimental evaluations on the OpenLane-V2 dataset demonstrate the state-of-the-art performance of Topo2Seq in topology reasoning.

AAAI Conference 2025 Conference Paper

VQA4CIR: Boosting Composed Image Retrieval with Visual Question Answering

  • Chun-Mei Feng
  • Yang Bai
  • Tao Luo
  • Zhen Li
  • Salman Khan
  • Wangmeng Zuo
  • Rick Siow Mong Goh
  • Yong Liu

Albeit progress has been made in Composed Image Retrieval (CIR), we empirically find that a certain percentage of failure retrieval results are not consistent with their relative captions. To address this issue, this work provides a Visual Question Answering (VQA) perspective to boost the performance of CIR. The resulting VQA4CIR is a post-processing approach and can be directly plugged into existing CIR methods. Given the top-C retrieved images by a CIR method, VQA4CIR aims to decrease the adverse effect of the failure retrieval results being inconsistent with the relative caption. To find the retrieved images inconsistent with the relative caption, we resort to the "QA generation → VQA" self-verification pipeline. For QA generation, we suggest fine-tuning LLM (e.g., LLaMA) to generate several pairs of questions and answers from each relative caption. We then fine-tune LVLM (e.g., LLaVA) to obtain the VQA model. By feeding the retrieved image and question to the VQA model, one can find the images inconsistent with relative caption when the answer by VQA is inconsistent with the answer in the QA pair. Consequently, the CIR performance can be boosted by modifying the ranks of inconsistently retrieved images. Experimental results show that our proposed method outperforms state-of-the-art CIR methods on the CIRR and Fashion-IQ datasets.

AAAI Conference 2024 Conference Paper

CrossBind: Collaborative Cross-Modal Identification of Protein Nucleic-Acid-Binding Residues

  • Linglin Jing
  • Sheng Xu
  • Yifan Wang
  • Yuzhe Zhou
  • Tao Shen
  • Zhigang Ji
  • Hui Fang
  • Zhen Li

Accurate identification of protein nucleic acid binding residues poses a significant challenge with important implications for various biological processes and drug design. Many typical computational methods for protein analysis rely on a single model that could ignore either the semantic context of the protein or the global 3D geometric information. Consequently, these approaches may result in incomplete or inaccurate protein analysis. To address the above issue, in this paper, we present CrossBind, a novel collaborative cross modal approach for identifying binding residues by exploiting both protein geometric structure and its sequence prior knowledge extracted from a large scale protein language model. Specifically, our multi modal approach leverages a contrastive learning technique and atom wise attention to capture the positional relationships between atoms and residues, thereby incorporating fine grained local geometric knowledge, for better binding residue prediction. Extensive experimental results demonstrate that our approach outperforms the next best state of the art methods, GraphSite and GraphBind, on DNA and RNA datasets by 10.8/17.3% in terms of the harmonic mean of precision and recall (F1 Score) and 11.9/24.8% in Matthews correlation coefficient (MCC), respectively. We release the code at https://github.com/BEAM-Labs/CrossBind.

JBHI Journal 2024 Journal Article

ECC-PolypDet: Enhanced CenterNet With Contrastive Learning for Automatic Polyp Detection

  • Yuncheng Jiang
  • Zixun Zhang
  • Yiwen Hu
  • Guanbin Li
  • Xiang Wan
  • Song Wu
  • Shuguang Cui
  • Silin Huang

Accurate polyp detection is critical for early colorectal cancer diagnosis. Although remarkable progress has been achieved in recent years, the complex colon environment and concealed polyps with unclear boundaries still pose severe challenges in this area. Existing methods either involve computationally expensive context aggregation or lack prior modeling of polyps, resulting in poor performance in challenging cases. In this paper, we propose the Enhanced CenterNet with Contrastive Learning (ECC-PolypDet), a two-stage training & end-to-end inference framework that leverages images and bounding box annotations to train a general model and fine-tune it based on the inference score to obtain a final robust model. Specifically, we conduct Box-assisted Contrastive Learning (BCL) during training to minimize the intra-class difference and maximize the inter-class difference between foreground polyps and backgrounds, enabling our model to capture concealed polyps. Moreover, to enhance the recognition of small polyps, we design the Semantic Flow-guided Feature Pyramid Network (SFFPN) to aggregate multi-scale features and the Heatmap Propagation (HP) module to boost the model's attention on polyp targets. In the fine-tuning stage, we introduce the IoU-guided Sample Re-weighting (ISR) mechanism to prioritize hard samples by adaptively adjusting the loss weight for each sample during fine-tuning. Extensive experiments on six large-scale colonoscopy datasets demonstrate the superiority of our model compared with previous state-of-the-art detectors.

AAAI Conference 2024 Conference Paper

RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation

  • Haiming Zhang
  • Xu Yan
  • Dongfeng Bai
  • Jiantao Gao
  • Pan Wang
  • Bingbing Liu
  • Shuguang Cui
  • Zhen Li

3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images. However, image-based scene perception encounters significant challenges in achieving accurate prediction due to the absence of geometric priors. In this paper, we address this issue by exploring cross-modal knowledge distillation in this task, i.e., we leverage a stronger multi-modal model to guide the visual model during training. In practice, we observe that directly applying features or logits alignment, proposed and widely used in bird's-eye-view (BEV) perception, does not yield satisfactory results. To overcome this problem, we introduce RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction. By employing differentiable volume rendering, we generate depth and semantic maps in perspective views and propose two novel consistency criteria between the rendered outputs of teacher and student models. Specifically, the depth consistency loss aligns the termination distributions of the rendered rays, while the semantic consistency loss mimics the intra-segment similarity guided by vision foundation models (VLMs). Experimental results on the nuScenes dataset demonstrate the effectiveness of our proposed method in improving various 3D occupancy prediction approaches, e.g., our proposed methodology enhances our baseline by 2.2% in the metric of mIoU and achieves 50% in Occ3D benchmark.

NeurIPS Conference 2024 Conference Paper

SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge

  • Chuanhao Li
  • Zhen Li
  • Chenchen Jing
  • Shuo Liu
  • Wenqi Shao
  • Yuwei Wu
  • Ping Luo
  • Yu Qiao

Large vision-language models (LVLMs) are ignorant of the up-to-date knowledge, such as LLaVA series, because they cannot be updated frequently due to the large amount of resources required, and therefore fail in many cases. For example, if a LVLM was released on January 2024, and it wouldn't know the singer of the theme song for the new Detective Conan movie, which wasn't released until April 2024. To solve the problem, a promising solution motivated by retrieval-augmented generation (RAG) is to provide LVLMs with up-to-date knowledge via internet search during inference, i. e. , internet-augmented generation (IAG), which is already integrated in some closed-source commercial LVLMs such as GPT-4V. However, the specific mechanics underpinning them remain a mystery. In this paper, we propose a plug-and-play framework, for augmenting existing LVLMs in handling visual question answering (VQA) about up-to-date knowledge, dubbed SearchLVLMs. A hierarchical filtering model is trained to effectively and efficiently find the most helpful content from the websites returned by a search engine to prompt LVLMs with up-to-date knowledge. To train the model and evaluate our framework's performance, we propose a pipeline to automatically generate news-related VQA samples to construct a dataset, dubbed UDK-VQA. A multi-model voting mechanism is introduced to label the usefulness of website/content for VQA samples to construct the training set. Experimental results demonstrate the effectiveness of our framework, outperforming GPT-4o by $\sim$30\% in accuracy.

NeurIPS Conference 2024 Conference Paper

Towards Flexible 3D Perception: Object-Centric Occupancy Completion Augments 3D Object Detection

  • Chaoda Zheng
  • Feng Wang
  • Naiyan Wang
  • Shuguang Cui
  • Zhen Li

While 3D object bounding box (bbox) representation has been widely used in autonomous driving perception, it lacks the ability to capture the precise details of an object's intrinsic geometry. Recently, occupancy has emerged as a promising alternative for 3D scene perception. However, constructing a high-resolution occupancy map remains infeasible for large scenes due to computational constraints. Recognizing that foreground objects only occupy a small portion of the scene, we introduce object-centric occupancy as a supplement to object bboxes. This representation not only provides intricate details for detected objects but also enables higher voxel resolution in practical applications. We advance the development of object-centric occupancy perception from both data and algorithm perspectives. On the data side, we construct the first object-centric occupancy dataset from scratch using an automated pipeline. From the algorithmic standpoint, we introduce a novel object-centric occupancy completion network equipped with an implicit shape decoder that manages dynamic-size occupancy generation. This network accurately predicts the complete object-centric occupancy volume for inaccurate object proposals by leveraging temporal information from long sequences. Our method demonstrates robust performance in completing object shapes under noisy detection and tracking conditions. Additionally, we show that our occupancy features significantly enhance the detection results of state-of-the-art 3D object detectors, especially for incomplete or distant objects in the Waymo Open Dataset.

AAAI Conference 2024 Conference Paper

WeakPCSOD: Overcoming the Bias of Box Annotations for Weakly Supervised Point Cloud Salient Object Detection

  • Jun Wei
  • S. Kevin Zhou
  • Shuguang Cui
  • Zhen Li

Point cloud salient object detection (PCSOD) is a newly proposed task in 3D dense segmentation. However, the acquisition of accurate 3D dense annotations comes at a high cost, severely limiting the progress of PCSOD. To address this issue, we propose the first weakly supervised PCSOD (named WeakPCSOD) model, which relies solely on cheap 3D bounding box annotations. In WeakPCSOD, we extract noise-free supervision from coarse 3D bounding boxes while mitigating shape biases inherent in box annotations. To achieve this, we introduce a novel mask-to-box (M2B) transformation and a color consistency (CC) loss. The M2B transformation, from a shape perspective, disentangles predictions from labels, enabling the extraction of noiseless supervision from labels while preserving object shapes independently of the box bias. From an appearance perspective, we further introduce the CC loss to provide dense supervision, which mitigates the non-unique predictions stemming from weak supervision and substantially reduces prediction variability. Furthermore, we employ a self-training (ST) strategy to enhance performance by utilizing high-confidence pseudo labels. Notably, the M2B transformation, CC loss, and ST strategy are seamlessly integrated into any model and incur no computational costs for inference. Extensive experiments demonstrate the effectiveness of our WeakPCSOD model, even comparable to fully supervised models utilizing dense annotations.

AAAI Conference 2024 Conference Paper

X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-Modal Knowledge Transfer

  • Linglin Jing
  • Ying Xue
  • Xu Yan
  • Chaoda Zheng
  • Dong Wang
  • Ruimao Zhang
  • Zhigang Wang
  • Hui Fang

The field of 4D point cloud understanding is rapidly developing with the goal of analyzing dynamic 3D point cloud sequences. However, it remains a challenging task due to the sparsity and lack of texture in point clouds. Moreover, the irregularity of point cloud poses a difficulty in aligning temporal information within video sequences. To address these issues, we propose a novel cross-modal knowledge transfer framework, called X4D-SceneFormer. This framework enhances 4D-Scene understanding by transferring texture priors from RGB sequences using a Transformer architecture with temporal relationship mining. Specifically, the framework is designed with a dual-branch architecture, consisting of an 4D point cloud transformer and a Gradient-aware Image Transformer (GIT). The GIT combines visual texture and temporal correlation features to offer rich semantics and dynamics for better point cloud representation. During training, we employ multiple knowledge transfer techniques, including temporal consistency losses and masked self-attention, to strengthen the knowledge transfer between modalities. This leads to enhanced performance during inference using single-modal 4D point cloud inputs. Extensive experiments demonstrate the superior performance of our framework on various 4D point cloud video understanding tasks, including action recognition, action segmentation and semantic segmentation. The results achieve 1st places, i.e., 85.3% (+7.9%) accuracy and 47.3% (+5.0%) mIoU for 4D action segmentation and semantic segmentation, on the HOI4D challenge, outperforming previous state-of-the-art by a large margin. We release the code at https://github.com/jinglinglingling/X4D.

NeurIPS Conference 2023 Conference Paper

Amazon-M2: A Multilingual Multi-locale Shopping Session Dataset for Recommendation and Text Generation

  • Wei Jin
  • Haitao Mao
  • Zheng Li
  • Haoming Jiang
  • Chen Luo
  • Hongzhi Wen
  • Haoyu Han
  • Hanqing Lu

Modeling customer shopping intentions is a crucial task for e-commerce, as it directly impacts user experience and engagement. Thus, accurately understanding customer preferences is essential for providing personalized recommendations. Session-based recommendation, which utilizes customer session data to predict their next interaction, has become increasingly popular. However, existing session datasets have limitations in terms of item attributes, user diversity, and dataset scale. As a result, they cannot comprehensively capture the spectrum of user behaviors and preferences. To bridge this gap, we present the Amazon Multilingual Multi-locale Shopping Session Dataset, namely Amazon-M2. It is the first multilingual dataset consisting of millions of user sessions from six different locales, where the major languages of products are English, German, Japanese, French, Italian, and Spanish. Remarkably, the dataset can help us enhance personalization and understanding of user preferences, which can benefit various existing tasks as well as enable new tasks. To test the potential of the dataset, we introduce three tasks in this work: (1) next-product recommendation, (2) next-product recommendation with domain shifts, and (3) next-product title generation. With the above tasks, we benchmark a range of algorithms on our proposed dataset, drawing new insights for further research and practice. In addition, based on the proposed dataset and tasks, we hosted a competition in the KDD CUP 2023 https: //www. aicrowd. com/challenges/amazon-kdd-cup-23-multilingual-recommendation-challenge and have attracted thousands of users and submissions. The winning solutions and the associated workshop can be accessed at our website~https: //kddcup23. github. io/.

AAAI Conference 2023 Conference Paper

CowClip: Reducing CTR Prediction Model Training Time from 12 Hours to 10 Minutes on 1 GPU

  • Zangwei Zheng
  • Pengtai Xu
  • Xuan Zou
  • Da Tang
  • Zhen Li
  • Chenguang Xi
  • Peng Wu
  • Leqi Zou

The click-through rate (CTR) prediction task is to predict whether a user will click on the recommended item. As mind-boggling amounts of data are produced online daily, accelerating CTR prediction model training is critical to ensuring an up-to-date model and reducing the training cost. One approach to increase the training speed is to apply large batch training. However, as shown in computer vision and natural language processing tasks, training with a large batch easily suffers from the loss of accuracy. Our experiments show that previous scaling rules fail in the training of CTR prediction neural networks. To tackle this problem, we first theoretically show that different frequencies of ids make it challenging to scale hyperparameters when scaling the batch size. To stabilize the training process in a large batch size setting, we develop the adaptive Column-wise Clipping (CowClip). It enables an easy and effective scaling rule for the embeddings, which keeps the learning rate unchanged and scales the L2 loss. We conduct extensive experiments with four CTR prediction networks on two real-world datasets and successfully scaled 128 times the original batch size without accuracy loss. In particular, for CTR prediction model DeepFM training on the Criteo dataset, our optimization framework enlarges the batch size from 1K to 128K with over 0.1% AUC improvement and reduces training time from 12 hours to 10 minutes on a single V100 GPU. Our code locates at github.com/bytedance/LargeBatchCTR.

JBHI Journal 2023 Journal Article

Deep Learning Identifies Intelligible Predictors of Poor Prognosis in Chronic Kidney Disease

  • Ping Liang
  • Jiannan Yang
  • Weilan Wang
  • Guanjie Yuan
  • Min Han
  • Qingpeng Zhang
  • Zhen Li

Early diagnosis and prediction of chronic kidney disease (CKD) progress within a given duration are critical to ensure personalized treatment, which could improve patients' quality of life and prolong survival time. In this study, we explore the intelligibility of machine-learning and deep-learning models on end-stage renal disease (ESRD) prediction, based on readily-accessible clinical and laboratory features of patients suffering from CKD. Eight machine learning models were used to predict whether a patient suffering from CKD would progress to ESRD within three years based on demographics, clinical, and comorbidity information. LASSO, random forest, and XGBoost were used to identify the most significant markers. In addition, we introduced four advanced attribution methods to the deep learning model to enhance model intelligibility. The deep learning model achieved an AUC-ROC of 0. 8991, which was significantly higher than that of baseline models. The interpretation generated by deep learning with attribution methods, random forest, and XGBoost was consistent with clinical knowledge, whereas LASSO-based interpretation was inconsistent. Hematuria, proteinuria, potassium, urine albumin to creatinine ratio were positively associated with the progression of CKD, while eGFR and urine creatinine were negatively associated. In conclusion, deep learning with attribution algorithms could identify intelligible features of CKD progression. Our model identified a number of critical, but under-reported features, which may be novel markers for CKD progression. This study provides physicians with solid data-driven evidence for using machine learning for CKD clinical management and treatment.

AAAI Conference 2023 Conference Paper

Fair-CDA: Continuous and Directional Augmentation for Group Fairness

  • Rui Sun
  • Fengwei Zhou
  • Zhenhua Dong
  • Chuanlong Xie
  • Lanqing Hong
  • Jiawei Li
  • Rui Zhang
  • Zhen Li

In this work, we propose Fair-CDA, a fine-grained data augmentation strategy for imposing fairness constraints. We use a feature disentanglement method to extract the features highly related to the sensitive attributes. Then we show that group fairness can be achieved by regularizing the models on transition paths of sensitive features between groups. By adjusting the perturbation strength in the direction of the paths, our proposed augmentation is controllable and auditable. To alleviate the accuracy degradation caused by fairness constraints, we further introduce a calibrated model to impute labels for the augmented data. Our proposed method does not assume any data generative model and ensures good generalization for both accuracy and fairness. Experimental results show that Fair-CDA consistently outperforms state-of-the-art methods on widely-used benchmarks, e.g., Adult, CelebA and MovieLens. Especially, Fair-CDA obtains an 86.3% relative improvement for fairness while maintaining the accuracy on the Adult dataset. Moreover, we evaluate Fair-CDA in an online recommendation system to demonstrate the effectiveness of our method in terms of accuracy and fairness.

AAAI Conference 2023 Conference Paper

Geometry-Aware Network for Domain Adaptive Semantic Segmentation

  • Yinghong Liao
  • Wending Zhou
  • Xu Yan
  • Zhen Li
  • Yizhou Yu
  • Shuguang Cui

Measuring and alleviating the discrepancies between the synthetic (source) and real scene (target) data is the core issue for domain adaptive semantic segmentation. Though recent works have introduced depth information in the source domain to reinforce the geometric and semantic knowledge transfer, they cannot extract the intrinsic 3D information of objects, including positions and shapes, merely based on 2D estimated depth. In this work, we propose a novel Geometry-Aware Network for Domain Adaptation (GANDA), leveraging more compact 3D geometric point cloud representations to shrink the domain gaps. In particular, we first utilize the auxiliary depth supervision from the source domain to obtain the depth prediction in the target domain to accomplish structure-texture disentanglement. Beyond depth estimation, we explicitly exploit 3D topology on the point clouds generated from RGB-D images for further coordinate-color disentanglement and pseudo-labels refinement in the target domain. Moreover, to improve the 2D classifier in the target domain, we perform domain-invariant geometric adaptation from source to target and unify the 2D semantic and 3D geometric segmentation results in two domains. Note that our GANDA is plug-and-play in any existing UDA framework. Qualitative and quantitative results demonstrate that our model outperforms state-of-the-arts on GTA5->Cityscapes and SYNTHIA->Cityscapes.

AAAI Conference 2023 Conference Paper

MMTN: Multi-Modal Memory Transformer Network for Image-Report Consistent Medical Report Generation

  • Yiming Cao
  • Lizhen Cui
  • Lei Zhang
  • Fuqiang Yu
  • Zhen Li
  • Yonghui Xu

Automatic medical report generation is an essential task in applying artificial intelligence to the medical domain, which can lighten the workloads of doctors and promote clinical automation. The state-of-the-art approaches employ Transformer-based encoder-decoder architectures to generate reports for medical images. However, they do not fully explore the relationships between multi-modal medical data, and generate inaccurate and inconsistent reports. To address these issues, this paper proposes a Multi-modal Memory Transformer Network (MMTN) to cope with multi-modal medical data for generating image-report consistent medical reports. On the one hand, MMTN reduces the occurrence of image-report inconsistencies by designing a unique encoder to associate and memorize the relationship between medical images and medical terminologies. On the other hand, MMTN utilizes the cross-modal complementarity of the medical vision and language for the word prediction, which further enhances the accuracy of generating medical reports. Extensive experiments on three real datasets show that MMTN achieves significant effectiveness over state-of-the-art approaches on both automatic metrics and human evaluation.

JBHI Journal 2023 Journal Article

Predicting Drug-Target Affinity by Learning Protein Knowledge From Biological Networks

  • Wenjian Ma
  • Shugang Zhang
  • Zhen Li
  • Mingjian Jiang
  • Shuang Wang
  • Nianfan Guo
  • Yuanfei Li
  • Xiangpeng Bi

Predicting drug-target affinity (DTA) is a crucial step in the process of drug discovery. Efficient and accurate prediction of DTA would greatly reduce the time and economic cost of new drug development, which has encouraged the emergence of a large number of deep learning-based DTA prediction methods. In terms of the representation of target proteins, current methods can be classified into 1D sequence- and 2D-protein graph-based methods. However, both two approaches focused only on the inherent properties of the target protein, but neglected the broad prior knowledge regarding protein interactions that have been clearly elucidated in past decades. Aiming at the above issue, this work presents an end-to-end DTA prediction method named MSF-DTA ( M ulti- S ource F eature Fusion-based D rug- T arget A ffinity). The contributions can be summarized as follows. First, MSF-DTA adopts a novel “neighboring feature”-based protein representation. Instead of utilizing only the inherent features of a target protein, MSF-DTA gathers additional information for the target protein from its biologically related “neighboring” proteins in PPI (i. e. , protein-protein interaction) and SSN (i. e. , sequence similarity) networks to get prior knowledge. Second, the representation was learned using an advanced graph pre-training framework, VGAE, which could not only gather node features but also learn topological connections, therefore contributing to a richer protein representation and benefiting the downstream DTA prediction task. This study provides new perspective for the DTA prediction task, and evaluation results demonstrated that MSF-DTA obtained superior performances compared to current state-of-the-art methods.

YNIMG Journal 2023 Journal Article

Single-subject cortical morphological brain networks: Phenotypic associations and neurobiological substrates

  • Zhen Li
  • Junle Li
  • Ningkai Wang
  • Yating Lv
  • Qihong Zou
  • Jinhui Wang

Although single-subject morphological brain networks provide an important way for human connectome studies, their roles and origins are poorly understood. Combining cross-sectional and repeated structural magnetic resonance imaging scans from adults, children and twins with behavioral and cognitive measures and brain-wide transcriptomic, cytoarchitectonic and chemoarchitectonic data, this study examined phenotypic associations and neurobiological substrates of single-subject morphological brain networks. We found that single-subject morphological brain networks explained inter-individual variance and predicted individual outcomes in Motor and Cognition domains, and distinguished individuals from each other. The performance can be further improved by integrating different morphological indices for network construction. Low-moderate heritability was observed for single-subject morphological brain networks with the highest heritability for sulcal depth-derived networks and higher heritability for inter-module connections. Furthermore, differential roles of genetic, cytoarchitectonic and chemoarchitectonic factors were observed for single-subject morphological brain networks. Cortical thickness-derived networks were related to the three factors with contributions from genes enriched in membrane and transport related functions, genes preferentially located in supragranular and granular layers, overall thickness in the molecular layer and thickness of wall in the infragranular layers, and metabotropic glutamate receptor 5 and dopamine transporter; fractal dimension-, gyrification index- and sulcal depth-derived networks were only associated with the chemoarchitectonic factor with contributions from different sets of neurotransmitter receptors. Most results were reproducible across different parcellation schemes and datasets. Altogether, this study demonstrates phenotypic associations and neurobiological substrates of single-subject morphological brain networks, which provide intermediate endophenotypes to link molecular and cellular architecture and behavior and cognition.

NeurIPS Conference 2023 Conference Paper

Small Total-Cost Constraints in Contextual Bandits with Knapsacks, with Application to Fairness

  • Evgenii Chzhen
  • Christophe Giraud
  • Zhen Li
  • Gilles Stoltz

We consider contextual bandit problems with knapsacks [CBwK], a problem where at each round, a scalar reward is obtained and vector-valued costs are suffered. The learner aims to maximize the cumulative rewards while ensuring that the cumulative costs are lower than some predetermined cost constraints. We assume that contexts come from a continuous set, that costs can be signed, and that the expected reward and cost functions, while unknown, may be uniformly estimated---a typical assumption in the literature. In this setting, total cost constraints had so far to be at least of order $T^{3/4}$, where $T$ is the number of rounds, and were even typically assumed to depend linearly on $T$. We are however motivated to use CBwK to impose a fairness constraint of equalized average costs between groups: the budget associated with the corresponding cost constraints should be as close as possible to the natural deviations, of order $\sqrt{T}$. To that end, we introduce a dual strategy based on projected-gradient-descent updates, that is able to deal with total-cost constraints of the order of $\sqrt{T}$ up to poly-logarithmic terms. This strategy is more direct and simpler than existing strategies in the literature. It relies on a careful, adaptive, tuning of the step size.

YNIMG Journal 2022 Journal Article

A novel technology for in vivo detection of cell type-specific neural connection with AQP1-encoding rAAV2-retro vector and metal-free MRI

  • Ning Zheng
  • Mei Li
  • Yang Wu
  • Challika Kaewborisuth
  • Zhen Li
  • Zhu Gui
  • Jinfeng Wu
  • Aoling Cai

A mammalian brain contains numerous neurons with distinct cell types for complex neural circuits. Virus-based circuit tracing tools are powerful in tracking the interaction among the different brain regions. However, detecting brain-wide neural networks in vivo remains challenging since most viral tracing systems rely on postmortem optical imaging. We developed a novel approach that enables in vivo detection of brain-wide neural connections based on metal-free magnetic resonance imaging (MRI). The recombinant adeno-associated virus (rAAV) with retrograde ability, the rAAV2-retro, encoding the human water channel aquaporin 1 (AQP1) MRI reporter gene was generated to label neural connections. The mouse was micro-injected with the virus at the Caudate Putamen (CPU) region and subjected to detection with Diffusion-weighted MRI (DWI). The prominent structure of the CPU-connected network was clearly defined. In combination with a Cre-loxP system, rAAV2-retro expressing Cre-dependent AQP1 provides a CPU-connected network of specific type neurons. Here, we established a sensitive, metal-free MRI-based strategy for in vivo detection of cell type-specific neural connections in the whole brain, which could visualize the dynamic changes of neural networks in rodents and potentially in non-human primates.

NeurIPS Conference 2022 Conference Paper

AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation

  • Yuanfeng Ji
  • Haotian Bai
  • Chongjian Ge
  • Jie Yang
  • Ye Zhu
  • Ruimao Zhang
  • Zhen Li
  • Lingyan Zhanng

Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans in recent years, a comprehensive evaluation of the models' capabilities is hampered by the lack of a large-scale benchmark from diverse clinical scenarios. Constraint by the high cost of collecting and labeling 3D medical data, most of the deep learning models to date are driven by datasets with a limited number of organs of interest or samples, which still limits the power of modern deep models and makes it difficult to provide a fully comprehensive and fair estimate of various methods. To mitigate the limitations, we present AMOS, a large-scale, diverse, clinical dataset for abdominal organ segmentation. AMOS provides 500 CT and 100 MRI scans collected from multi-center, multi-vendor, multi-modality, multi-phase, multi-disease patients, each with voxel-level annotations of 15 abdominal organs, providing challenging examples and test-bed for studying robust segmentation algorithms under diverse targets and scenarios. We further benchmark several state-of-the-art medical segmentation models to evaluate the status of the existing methods on this new challenging dataset. We have made our datasets, benchmark servers, and baselines publicly available, and hope to inspire future research. Information can be found at https: //amos22. grand-challenge. org.

JMLR Journal 2022 Journal Article

An Error Analysis of Generative Adversarial Networks for Learning Distributions

  • Jian Huang
  • Yuling Jiao
  • Zhen Li
  • Shiao Liu
  • Yang Wang
  • Yunfei Yang

This paper studies how well generative adversarial networks (GANs) learn probability distributions from finite samples. Our main results establish the convergence rates of GANs under a collection of integral probability metrics defined through H\"{o}lder classes, including the Wasserstein distance as a special case. We also show that GANs are able to adaptively learn data distributions with low-dimensional structures or have H\"{o}lder densities, when the network architectures are chosen properly. In particular, for distributions concentrated around a low-dimensional set, we show that the learning rates of GANs do not depend on the high ambient dimension, but on the lower intrinsic dimension. Our analysis is based on a new oracle inequality decomposing the estimation error into the generator and discriminator approximation error and the statistical error, which may be of independent interest. [abs] [ pdf ][ bib ] &copy JMLR 2022. ( edit, beta )

AAAI Conference 2022 Conference Paper

Contact-Distil: Boosting Low Homologous Protein Contact Map Prediction by Self-Supervised Distillation

  • Qin Wang
  • Jiayang Chen
  • Yuzhe Zhou
  • Yu Li
  • Liangzhen Zheng
  • Sheng Wang
  • Zhen Li
  • Shuguang Cui

Accurate protein contact map prediction (PCMP) is essential for precise protein structure estimation and further biological studies. Recent works achieve significant performance on this task with high quality multiple sequence alignment (MSA). However, the PCMP accuracy drops dramatically while only poor MSA (e. g. , absolute MSA count less than 10) is available. Therefore, in this paper, we propose the Contact-Distil to improve the low homologous PCMP accuracy through knowledge distillation on a self-supervised model. Particularly, two pre-trained transformers are exploited to learn the high quality and low quality MSA representation in parallel for the teacher and student model correspondingly. Besides, the co-evolution information is further extracted from pure sequence through a pretrained ESM-1b model, which provides auxiliary knowledge to improve student performance. Extensive experiments show Contact-Distil outperforms previous state-of-the-arts by large margins on CAMEO-L dataset for low homologous PCMP, i. e. , around 13. 3% and 9. 5% improvements against Alphafold2 and MSA Transformer respectively when MSA count less than 10.

NeurIPS Conference 2022 Conference Paper

Contextual Bandits with Knapsacks for a Conversion Model

  • Zhen Li
  • Gilles Stoltz

We consider contextual bandits with knapsacks, with an underlying structure between rewards generated and cost vectors suffered. We do so motivated by sales with commercial discounts. At each round, given the stochastic i. i. d. \ context $\mathbf{x}_t$ and the arm picked $a_t$ (corresponding, e. g. , to a discount level), a customer conversion may be obtained, in which case a reward $r(a, \mathbf{x}_t)$ is gained and vector costs $\mathbf{c}(a_t, \mathbf{x}_t)$ are suffered (corresponding, e. g. , to losses of earnings). Otherwise, in the absence of a conversion, the reward and costs are null. The reward and costs achieved are thus coupled through the binary variable measuring conversion or the absence thereof. This underlying structure between rewards and costs is different from the linear structures considered by Agrawal and Devanur [2016] (but we show that the techniques introduced in the present article may also be applied to the case of these linear structures). The adaptive policies exhibited in this article solve at each round a linear program based on upper-confidence estimates of the probabilities of conversion given $a$ and $\mathbf{x}$. This kind of policy is most natural and achieves a regret bound of the typical order $(\mathrm{OPT}/B) \smash{\sqrt{T}}$, where $B$ is the total budget allowed, $\mathrm{OPT}$ is the optimal expected reward achievable by a static policy, and $T$ is the number of rounds.

NeurIPS Conference 2022 Conference Paper

Divide and Contrast: Source-free Domain Adaptation via Adaptive Contrastive Learning

  • Ziyi Zhang
  • Weikai Chen
  • Hui Cheng
  • Zhen Li
  • Siyuan Li
  • Liang Lin
  • Guanbin Li

We investigate a practical domain adaptation task, called source-free domain adaptation (SFUDA), where the source pretrained model is adapted to the target domain without access to the source data. Existing techniques mainly leverage self-supervised pseudo-labeling to achieve class-wise global alignment [1] or rely on local structure extraction that encourages the feature consistency among neighborhoods [2]. While impressive progress has been made, both lines of methods have their own drawbacks – the “global” approach is sensitive to noisy labels while the “local” counterpart suffers from the source bias. In this paper, we present Divide and Contrast (DaC), a new paradigm for SFUDA that strives to connect the good ends of both worlds while bypassing their limitations. Based on the prediction confidence of the source model, DaC divides the target data into source-like and target-specific samples, where either group of samples is treated with tailored goals under an adaptive contrastive learning framework. Specifically, the source-like samples are utilized for learning global class clustering thanks to their relatively clean labels. The more noisy target-specific data are harnessed at the instance level for learning the intrinsic local structures. We further align the source-like domain with the target-specific samples using a memory bank-based Maximum Mean Discrepancy (MMD) loss to reduce the distribution mismatch. Extensive experiments on VisDA, Office-Home, and the more challenging DomainNet have verified the superior performance of DaC over current state-of-the-art approaches. The code is available at https: //github. com/ZyeZhang/DaC. git.

NeurIPS Conference 2022 Conference Paper

Let Images Give You More: Point Cloud Cross-Modal Training for Shape Analysis

  • Xu Yan
  • Heshen Zhan
  • Chaoda Zheng
  • Jiantao Gao
  • Ruimao Zhang
  • Shuguang Cui
  • Zhen Li

Although recent point cloud analysis achieves impressive progress, the paradigm of representation learning from single modality gradually meets its bottleneck. In this work, we take a step towards more discriminative 3D point cloud representation using 2D images, which inherently contain richer appearance information, e. g. , texture, color, and shade. Specifically, this paper introduces a simple but effective point cloud cross-modality training (PointCMT) strategy, which utilizes view-images, i. e. , rendered or projected 2D images of the 3D object, to boost point cloud classification. In practice, to effectively acquire auxiliary knowledge from view-images, we develop a teacher-student framework and formulate the cross-modal learning as a knowledge distillation problem. Through novel feature and classifier enhancement criteria, PointCMT eliminates the distribution discrepancy between different modalities and avoid potential negative transfer effectively. Note that PointCMT efficiently improves the point-only representation without any architecture modification. Sufficient experiments verify significant gains on various datasets based on several backbones, i. e. , equipped with PointCMT, PointNet++ and PointMLP achieve state-of-the-art performance on two benchmarks, i. e. , 94. 4% and 86. 7% accuracy on ModelNet40 and ScanObjectNN, respectively.

JBHI Journal 2022 Journal Article

Space Squeeze Reasoning and Low-Rank Bilinear Feature Fusion for Surgical Image Segmentation

  • Zhen-Liang Ni
  • Gui-Bin Bian
  • Zhen Li
  • Xiao-Hu Zhou
  • Rui-Qi Li
  • Zeng-Guang Hou

Surgical image segmentation is critical for surgical robot control and computer-assisted surgery. In the surgical scene, the local features of objects are highly similar, and the illumination interference is strong, which makes surgical image segmentation challenging. To address the above issues, a bilinear squeeze reasoning network is proposed for surgical image segmentation. In it, the space squeeze reasoning module is proposed, which adopts height pooling and width pooling to squeeze global contexts in the vertical and horizontal directions, respectively. The similarity between each horizontal position and each vertical position is calculated to encode long-range semantic dependencies and establish the affinity matrix. The feature maps are also squeezed from both the vertical and horizontal directions to model channel relations. Guided by channel relations, the affinity matrix is expanded to the same size as the input features. It captures long-range semantic dependencies from different directions, helping address the local similarity issue. Besides, a low-rank bilinear fusion module is proposed to enhance the model’s ability to recognize similar features. This module is based on the low-rank bilinear model to capture the inter-layer feature relations. It integrates the location details from low-level features and semantic information from high-level features. Various semantics can be represented more accurately, which effectively improves feature representation. The proposed network achieves state-of-the-art performance on cataract image segmentation dataset CataSeg and robotic image segmentation dataset EndoVis 2018.

IJCAI Conference 2021 Conference Paper

Adaptive Residue-wise Profile Fusion for Low Homologous Protein Secondary Structure Prediction Using External Knowledge

  • Qin Wang
  • Jun Wei
  • Boyuan Wang
  • Zhen Li
  • Sheng Wang
  • Shuguang Cui

Protein secondary structure prediction (PSSP) is essential for protein function analysis. However, for low homologous proteins, the PSSP suffers from insufficient input features. In this paper, we explicitly import external self-supervised knowledge for low homologous PSSP under the guidance of residue-wise (amino acid wise) profile fusion. In practice, we firstly demonstrate the superiority of profile over Position-Specific Scoring Matrix (PSSM) for low homologous PSSP. Based on this observation, we introduce the novel self-supervised BERT features as the pseudo profile, which implicitly involves the residue distribution in all native discovered sequences as the complementary features. Furthermore, a novel residue-wise attention is specially designed to adaptively fuse different features (i. e. , original low-quality profile, BERT based pseudo profile), which not only takes full advantage of each feature but also avoids noise disturbance. Besides, the feature consistency loss is proposed to accelerate the model learning from multiple semantic levels. Extensive experiments confirm that our method outperforms state-of-the-arts (i. e. , 4. 7% for extremely low homologous cases on BC40 dataset).

IJCAI Conference 2021 Conference Paper

Local Representation is Not Enough: Soft Point-Wise Transformer for Descriptor and Detector of Local Features

  • Zihao Wang
  • Xueyi Li
  • Zhen Li

Significant progress has been witnessed for the descriptor and detector of local features, but there still exist several challenging and intractable limitations, such as insufficient localization accuracy and non-discriminative description, especially in repetitive- or blank-texture regions, which haven't be well addressed. The coarse feature representation and limited receptive field are considered as the main issues for these limitations. To address these issues, we propose a novel Soft Point-Wise Transformer for Descriptor and Detector, simultaneously mining long-range intrinsic and cross-scale dependencies of local features. Furthermore, our model leverages the distinct transformers based on the soft point-wise attention, substantially decreasing the memory and computation complexity, especially for high-resolution feature maps. In addition, multi-level decoder is constructed to guarantee the high detection accuracy and discriminative description. Extensive experiments demonstrate that our model outperforms the existing state-of-the-art methods on the image matching and visual localization benchmarks.

IJCAI Conference 2021 Conference Paper

PointLIE: Locally Invertible Embedding for Point Cloud Sampling and Recovery

  • Weibing Zhao
  • Xu Yan
  • Jiantao Gao
  • Ruimao Zhang
  • Jiayan Zhang
  • Zhen Li
  • Song Wu
  • Shuguang Cui

Point Cloud Sampling and Recovery (PCSR) is critical for massive real-time point cloud collection and processing since raw data usually requires large storage and computation. This paper addresses a fundamental problem in PCSR: How to downsample the dense point cloud with arbitrary scales while preserving the local topology of discarded points in a case-agnostic manner (i. e. , without additional storage for point relationships)? We propose a novel Locally Invertible Embedding (PointLIE) framework to unify the point cloud sampling and upsampling into one single framework through bi-directional learning. Specifically, PointLIE decouples the local geometric relationships between discarded points from the sampled points by progressively encoding the neighboring offsets to a latent variable. Once the latent variable is forced to obey a pre-defined distribution in the forward sampling path, the recovery can be achieved effectively through inverse operations. Taking the recover-pleasing sampled points and a latent embedding randomly drawn from the specified distribution as inputs, PointLIE can theoretically guarantee the fidelity of reconstruction and outperform state-of-the-arts quantitatively and qualitatively.

AAAI Conference 2021 Conference Paper

PSSM-Distil: Protein Secondary Structure Prediction (PSSP) on Low-Quality PSSM by Knowledge Distillation with Contrastive Learning

  • Qin Wang
  • Boyuan Wang
  • Zhenlei Xu
  • Jiaxiang Wu
  • Peilin Zhao
  • Zhen Li
  • Sheng Wang
  • Junzhou Huang

Protein secondary structure prediction (PSSP) is an essential task in computational biology. To achieve the accurate PSSP, the general and vital feature engineering is to use multiple sequence alignment (MSA) for Position-Specific Scoring Matrix (PSSM) extraction. However, when only low-quality PSSM can be obtained due to poor sequence homology, previous PSSP accuracy (merely around 65%) is far from practical usage for subsequent tasks. In this paper, we propose a novel PSSM-Distil framework for PSSP on low-quality PSSM, which not only enhances the PSSM feature at a lower level but also aligns the feature distribution at a higher level. In practice, the PSSM-Distil first exploits the proteins with high-quality PSSM to achieve a teacher network for PSSP in a full-supervised way. Under the guidance of the teacher network, the low-quality PSSM and corresponding student network with low discriminating capacity are effectively resolved by feature enhancement through EnhanceNet and distribution alignment through knowledge distillation with contrastive learning. Further, our PSSM-Distil supports the input from a pre-trained protein sequence language BERT model to provide auxiliary information, which is designed to address the extremely low-quality PSSM cases, i. e. , no homologous sequence. Extensive experiments demonstrate the proposed PSSM-Distil outperforms state-of-the-art models on PSSP by 6% on average and nearly 8% in extremely low-quality cases on public benchmarks, BC40 and CB513.

AAAI Conference 2021 Conference Paper

Sparse Single Sweep LiDAR Point Cloud Segmentation via Learning Contextual Shape Priors from Scene Completion

  • Xu Yan
  • Jiantao Gao
  • Jie Li
  • Ruimao Zhang
  • Zhen Li
  • Rui Huang
  • Shuguang Cui

LiDAR point cloud analysis is a core task for 3D computer vision, especially for autonomous driving. However, due to the severe sparsity and noise interference in the single sweep Li- DAR point cloud, the accurate semantic segmentation is nontrivial to achieve. In this paper, we propose a novel sparse Li- DAR point cloud semantic segmentation framework assisted by learned contextual shape priors. In practice, an initial semantic segmentation (SS) of a single sweep point cloud can be achieved by any appealing network and then flows into the semantic scene completion (SSC) module as the input. By merging multiple frames in the LiDAR sequence as supervision, the optimized SSC module has learned the contextual shape priors from sequential LiDAR data, completing the sparse single sweep point cloud to the dense one. Thus, it inherently improves SS optimization through fully end-toend training. Besides, a Point-Voxel Interaction (PVI) module is proposed to further enhance the knowledge fusion between SS and SSC tasks, i. e. , promoting the interaction of incomplete local geometry of point cloud and complete voxelwise global structure. Furthermore, the auxiliary SSC and PVI modules can be discarded during inference without extra burden for SS. Extensive experiments confirm that our JS3C- Net achieves superior performance on both SemanticKITTI and SemanticPOSS benchmarks, i. e. , 4% and 3% improvement correspondingly.

IJCAI Conference 2020 Conference Paper

BARNet: Bilinear Attention Network with Adaptive Receptive Fields for Surgical Instrument Segmentation

  • Zhen-Liang Ni
  • Gui-Bin Bian
  • Guan-An Wang
  • Xiao-Hu Zhou
  • Zeng-Guang Hou
  • Xiao-Liang Xie
  • Zhen Li
  • Yu-Han Wang

Surgical instrument segmentation is crucial for computer-assisted surgery. Different from common object segmentation, it is more challenging due to the large illumination variation and scale variation in the surgical scenes. In this paper, we propose a bilinear attention network with adaptive receptive fields to address these two issues. To deal with the illumination variation, the bilinear attention module models global contexts and semantic dependencies between pixels by capturing second-order statistics. With them, semantic features in challenging areas can be inferred from their neighbors, and the distinction of various semantics can be boosted. To adapt to the scale variation, our adaptive receptive field module aggregates multi-scale features and selects receptive fields adaptively. Specifically, it models the semantic relationships between channels to choose feature maps with appropriate scales, changing the receptive field of subsequent convolutions. The proposed network achieves the best performance 97. 47% mean IoU on Cata7. It also takes the first place on EndoVis 2017, exceeding the second place by 10. 10% mean IoU.

NeurIPS Conference 2018 Conference Paper

Deep Neural Nets with Interpolating Function as Output Activation

  • Bao Wang
  • xiyang luo
  • Zhen Li
  • Wei Zhu
  • Zuoqiang Shi
  • Stanley Osher

We replace the output layer of deep neural nets, typically the softmax function, by a novel interpolating function. And we propose end-to-end training and testing algorithms for this new architecture. Compared to classical neural nets with softmax function as output activation, the surrogate with interpolating function as output activation combines advantages of both deep and manifold learning. The new framework demonstrates the following major advantages: First, it is better applicable to the case with insufficient training data. Second, it significantly improves the generalization accuracy on a wide variety of networks. The algorithm is implemented in PyTorch, and the code is available at https: //github. com/ BaoWangMath/DNN-DataDependentActivation.

IJCAI Conference 2016 Conference Paper

Protein Secondary Structure Prediction Using Cascaded Convolutional and Recurrent Neural Networks

  • Zhen Li
  • Yizhou Yu

Protein secondary structure prediction is an important problem in bioinformatics. Inspired by the recent successes of deep neural networks, in this paper, we propose an end-to-end deep network that predicts protein secondary structures from integrated local and global contextual features. Our deep architecture leverages convolutional neural networks with different kernel sizes to extract multiscale local contextual features. In addition, considering long-range dependencies existing in amino acid sequences, we set up a bidirectional neural network consisting of gated recurrent unit to capture global contextual features. Furthermore, multi-task learning is utilized to predict secondary structure labels and amino-acid solvent accessibility simultaneously. Our proposed deep network demonstrates its effectiveness by achieving state-of-the-art performance, i. e. , 69. 7% Q8 accuracy on the public benchmark CB513, 76. 9% Q8 accuracy on CASP10 and 73. 1% Q8 accuracy on CASP11. Our model and results are publicly available.

ICRA Conference 2013 Conference Paper

A novel one-motor driven robot that jumps and walks

  • Jun Zhang 0030
  • Guangming Song
  • Guifang Qiao
  • Zhen Li
  • Weiguo Wang
  • Aiguo Song

This paper presents a 10cm × 5cm × 5cm, 52g one-motor driven robot. One DC motor with a driving gear drives two driven gears to implement the functions of jumping and walking. Two one-way bearings mounted on the inner races of the two driven gears are used to switch between jumping and walking when the motor rotates clockwise and anticlockwise respectively. The jumping energy is obtained by compressing and releasing two torsion springs using a cylindrical cam with quick return characteristics. Two disk cams drive two forelegs with elastic joints to step forward one after another to implement the walking locomotion pattern. Two connecting rods link the forelegs and the rear legs on the left and right sides of the robot to transmit motions from forelegs to rear legs. The jumping and walking performances of the robot are tested. Experimental results show that the proposed robot can jump more than 33cm high at a takeoff angle of 71. 2° and it can walk forward at 1. 43mm/s.

IROS Conference 2012 Conference Paper

Self-righting, steering and takeoff angle adjusting for a jumping robot

  • Jun Zhang 0030
  • Guangming Song
  • Zhen Li
  • Guifang Qiao
  • Hongtao Sun
  • Aiguo Song

This paper presents a 9 cm × 7 cm × 12 cm, 154 g jumping robot with self-righting, steering, and takeoff angle adjusting capabilities. The quick energy releasing function of the jumping mechanism is implemented by using an eccentric cam. The self-righting, steering, and takeoff angle adjusting capabilities are achieved by adding a rotatable pole leg. The pole leg can prop up the body of the robot when it falls down. The pole leg can also steer the robot to turn at a step of about 24°. By adjusting the center of mass (COM), the robot can jump at different takeoff angles. Experimental results show that the constructed robot can jump more than 88 cm high at a takeoff angle of 82. 7° and it can continuously jump to overcome stairs.

NeurIPS Conference 2011 Conference Paper

Learning to Search Efficiently in High Dimensions

  • Zhen Li
  • Huazhong Ning
  • LiangLiang Cao
  • Tong Zhang
  • Yihong Gong
  • Thomas Huang

High dimensional similarity search in large scale databases becomes an important challenge due to the advent of Internet. For such applications, specialized data structures are required to achieve computational efficiency. Traditional approaches relied on algorithmic constructions that are often data independent (such as Locality Sensitive Hashing) or weakly dependent (such as kd-trees, k-means trees). While supervised learning algorithms have been applied to related problems, those proposed in the literature mainly focused on learning hash codes optimized for compact embedding of the data rather than search efficiency. Consequently such an embedding has to be used with linear scan or another search algorithm. Hence learning to hash does not directly address the search efficiency issue. This paper considers a new framework that applies supervised learning to directly optimize a data structure that supports efficient large scale search. Our approach takes both search quality and computational cost into consideration. Specifically, we learn a boosted search forest that is optimized using pair-wise similarity labeled examples. The output of this search forest can be efficiently converted into an inverted indexing data structure, which can leverage modern text search infrastructure to achieve both scalability and efficiency. Experimental results show that our approach significantly outperforms the start-of-the-art learning to hash methods (such as spectral hashing), as well as state-of-the-art high dimensional search algorithms (such as LSH and k-means trees).

IS Journal 2010 Journal Article

A Comparative Study of Mobile-Based Landmark Recognition Techniques

  • Kim-Hui Yap
  • Tao Chen
  • Zhen Li
  • Kui Wu

Mobile-based landmark recognition is becoming increasingly appealing due to the proliferation of mobile devices coupled with improving processing techniques, imaging capability, and networking infrastructure. This article provides a general overview of existing mobile-based and nonmobile-based landmark recognition systems and their differences. We discuss content and context analysis and compare landmark classification methods. We also present the experimental results of our own mobile landmark recognition evaluations based on content analysis, context analysis, and integrated content-context analysis.

KER Journal 2006 Journal Article

Enabling dynamic composition and coordination for autonomic Grid applications using the Rudder Agent framework

  • Zhen Li
  • Manish Parashar

This paper introduces Rudder, a peer-to-peer agent framework for supporting autonomic applications in decentralized distributed environments. The framework provides agents to discover, select, and compose elements, and defines agent interaction and negotiation protocols to enable appropriate application behaviors to be negotiated and enacted dynamically. The implementations of these protocols as well as agent coordination and negotiation activities are supported by Comet, a scalable decentralized coordination substrate. The operation and experimental evaluation of Rudder is presented.