Arrow Research search

Author name cluster

Jiaxin Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

18 papers
2 author rows

Possible papers

18

AAAI Conference 2026 Conference Paper

DensiCrafter: Physically-Constrained Generation and Fabrication of Self-Supporting Hollow Structures

  • Shengqi Dang
  • Fu Chai
  • Jiaxin Li
  • Chao Yuan
  • Wei Ye
  • Nan Cao

The rise of 3D generative models has enabled automatic 3D geometry and texture synthesis from multimodal inputs (e.g., text or images). However, these methods often ignore physical constraints and manufacturability considerations. In this work, we address the challenge of producing 3D designs that are both lightweight and self-supporting. We present DensiCrafter, a framework for generating lightweight, self-supporting 3D hollow structures by optimizing the density field. Starting from coarse voxel grids produced by Trellis, we interpret these as continuous density fields to optimize and introduce three differentiable, physically constrained, and simulation-free loss terms. Additionally, a mass regularization penalizes unnecessary material, while a restricted optimization domain preserves the outer surface. Our method seamlessly integrates with pretrained Trellis-based models (e.g., Trellis, DSO) without any architectural changes. In extensive evaluations, we achieve up to 43% reduction in material mass on the text-to-3D task. Compared to state-of-the-art baselines, our method could improve the stability and maintain high geometric fidelity. Real-world 3D-printing experiments confirm that our hollow designs can be reliably fabricated and could be self-supporting.

AAAI Conference 2026 Conference Paper

Pansharpening for Thin-Cloud Contaminated Remote Sensing Images: A Unified Framework and Benchmark Dataset

  • Songcheng Du
  • Yang Zou
  • Jiaxin Li
  • Mingxuan Liu
  • Ying Li
  • Changjing Shang
  • Qiang Shen

Pansharpening under thin cloudy conditions is a practically significant yet rarely addressed task, challenged by simultaneous spatial resolution degradation and cloud-induced spectral distortions. Existing methods often address cloud removal and pansharpening sequentially, leading to cumulative errors and suboptimal performance due to the lack of joint degradation modeling. To address these challenges, we propose a Unified Pansharpening Model with Thin Cloud Removal (Pan-TCR), an end-to-end framework that integrates physical priors. Motivated by theoretical analysis in the frequency domain, we design a frequency-decoupled restoration (FDR) block that disentangles the restoration of multispectral image (MSI) features into amplitude and phase components, each guided by complementary degradation-robust prompts: the near-infrared (NIR) band amplitude for cloud-resilient restoration, and the panchromatic (PAN) phase for high-resolution structural enhancement. To ensure coherence between the two components, we further introduce an interactive inter-frequency consistency (IFC) module, enabling cross-modal refinement that enforces consistency and robustness across frequency cues. Furthermore, we introduce the first real-world thin-cloud contaminated pansharpening dataset (PanTCR-GF2), comprising paired clean and cloudy PAN-MSI images, to enable robust benchmarking under realistic conditions. Extensive experiments on real-world and synthetic datasets demonstrate the superiority and robustness of Pan-TCR, establishing a new benchmark for pansharpening under realistic atmospheric degradations.

AAAI Conference 2025 Conference Paper

FloNa: Floor Plan Guided Embodied Visual Navigation

  • Jiaxin Li
  • Weiqi Huang
  • Zan Wang
  • Wei Liang
  • Huijun Di
  • Feng Liu

Humans naturally rely on floor plans to navigate in unfamiliar environments, as they are readily available, reliable, and provide rich geometrical guidance. However, existing visual navigation settings overlook this valuable prior knowledge, leading to limited efficiency and accuracy. To eliminate this gap, we introduce a novel navigation task: Floor Plan Visual Navigation (FloNa), the first attempt to incorporate floor plans into embodied visual navigation. While the floor plan offers significant advantages, two key challenges emerge: (1) handling the spatial inconsistency between the floor plan and the actual scene layout for collision-free navigation, and (2) aligning observed images with the floor plan sketch despite their distinct modalities. To address these challenges, we propose FloDiff, a novel diffusion policy framework incorporating a localization module to facilitate alignment between the current observation and the floor plan. We further collect 20k navigation episodes across 117 scenes in the iGibson simulator to support the training and evaluation. Extensive experiments demonstrate the effectiveness and efficiency of our framework in unfamiliar scenes using floor plan knowledge.

ICML Conference 2025 Conference Paper

HetSSNet: Spatial-Spectral Heterogeneous Graph Learning Network for Panchromatic and Multispectral Images Fusion

  • Mengting Ma
  • Yizhen Jiang
  • Mengjiao Zhao
  • Jiaxin Li
  • Wei Zhang 0243

Remote sensing pansharpening aims to reconstruct spatial-spectral properties during the fusion of panchromatic (PAN) images and lowresolution multi-spectral (LR-MS) images, finally generating the high-resolution multi-spectral (HRMS) images. In the mainstream modeling strategies, i. e. , CNN and Transformer, the input images are treated as the equal-sized grid of pixels in the Euclidean space. They have limitations in facing remote sensing images with irregular ground objects. Graph is the more flexible structure, however, there are two major challenges when modeling spatial-spectral properties with graph: 1) constructing the customized graph structure for spatial-spectral relationship priors; 2) learning the unified spatial-spectral representation through the graph. To address these challenges, we propose the spatial-spectral heterogeneous graph learning network, named HetSSNet. Specifically, HetSSNet initially constructs the heterogeneous graph structure for pansharpening, which explicitly describes pansharpening-specific relationships. Subsequently, the basic relationship pattern generation module is designed to extract the multiple relationship patterns from the heterogeneous graph. Finally, relationship pattern aggregation module is exploited to collaboratively learn unified spatial-spectral representation across different relationships among nodes with adaptive importance learning from local and global perspectives. Extensive experiments demonstrate the significant superiority and generalization of HetSSNet.

AAAI Conference 2025 Conference Paper

Modality-Aware Shot Relating and Comparing for Video Scene Detection

  • Jiawei Tan
  • Hongxing Wang
  • Kang Dang
  • Jiaxin Li
  • Zhilong Ou

Video scene detection involves assessing whether each shot and its surroundings belong to the same scene. Achieving this requires meticulously correlating multi-modal cues, e.g., visual entity and place modalities, among shots and comparing semantic changes around each shot. However, most methods treat multi-modal semantics equally and do not examine contextual differences between the two sides of a shot, leading to sub-optimal detection performance. In this paper, we propose the Modality-Aware Shot Relating and Comparing approach (MASRC), which enables relating shots per their own characteristics of visual entity and place modalities, as well as comparing multi-shots similarities to have scene changes explicitly encoded. Specifically, to fully harness the potential of visual entity and place modalities in modeling shot relations, we mine long-term shot correlations from entity semantics while simultaneously revealing short-term shot correlations from place semantics. In this way, we can learn distinctive shot features that consolidate coherence within scenes and amplify distinguishability across scenes. Once equipped with distinctive shot features, we further encode the relations between preceding and succeeding shots of each target shot by similarity convolution, aiding in the identification of scene ending shots. We validate the broad applicability of the proposed components in MASRC. Extensive experimental results on public benchmark datasets demonstrate that the proposed MASRC significantly advances video scene detection.

NeurIPS Conference 2025 Conference Paper

Open-Vocabulary Part Segmentation via Progressive and Boundary-Aware Strategy

  • Xinlong Li
  • Di Lin
  • Shaoyiyi Gao
  • Jiaxin Li
  • Ruonan Liu
  • Qing Guo

Open-vocabulary part segmentation (OVPS) struggles with structurally connected boundaries due to the inherent conflict between continuous image features and discrete classification mechanism. To address this, we propose PBAPS, a novel training-free framework specifically designed for OVPS. PBAPS leverages structural knowledge of object-part relationships to guide a progressive segmentation from objects to fine-grained parts. To further improve accuracy at challenging boundaries, we introduce a Boundary-Aware Refinement (BAR) module that identifies ambiguous boundary regions by quantifying classification uncertainty, enhances the discriminative features of these ambiguous regions using high-confidence context, and adaptively refines part prototypes to better align with the specific image. Experiments on Pascal-Part-116, ADE20K-Part-234, PartImageNet demonstrate that PBAPS significantly outperforms state-of-the-art methods, achieving 46. 35\% mIoU and 34. 46\% bIoU on Pascal-Part-116. Our code is available at https: //github. com/TJU-IDVLab/PBAPS.

NeurIPS Conference 2025 Conference Paper

StreamForest: Efficient Online Video Understanding with Persistent Event Memory

  • Xiangyu Zeng
  • Kefan Qiu
  • Qingyu Zhang
  • Xinhao Li
  • Jing Wang
  • Jiaxin Li
  • Ziang Yan
  • Kun Tian

Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in video understanding. However, their effectiveness in real-time streaming scenarios remains limited due to storage constraints of historical visual features and insufficient real-time spatiotemporal reasoning. To address these challenges, we propose StreamForest, a novel architecture specifically designed for streaming video understanding. Central to StreamForest is the Persistent Event Memory Forest, a memory mechanism that adaptively organizes video frames into multiple event-level tree structures. This process is guided by penalty functions based on temporal distance, content similarity, and merge frequency, enabling efficient long-term memory retention under limited computational resources. To enhance real-time perception, we introduce a Fine-grained Spatiotemporal Window, which captures detailed short-term visual cues to improve current scene perception. Additionally, we present OnlineIT, an instruction-tuning dataset tailored for streaming video tasks. OnlineIT significantly boosts MLLM performance in both real-time perception and future prediction. To evaluate generalization in practical applications, we introduce ODV-Bench, a new benchmark focused on real-time streaming video understanding in autonomous driving scenarios. Experimental results demonstrate that StreamForest achieves the state-of-the-art performance, with accuracies of 77. 3% on StreamingBench, 60. 5% on OVBench, and 55. 6% on OVO-Bench. In particular, even under extreme visual token compression (limited to 1024 tokens), the model retains 96. 8% of its average accuracy in eight benchmarks relative to the default setting. These results underscore the robustness, efficiency, and generalizability of StreamForest for streaming video understanding.

IJCAI Conference 2025 Conference Paper

TSTAI: A Time-varying Brain Effective Connectivity Network Construction Method Combining with Brain Active Information

  • Qi Chen
  • Zhiqiong Wang
  • Jiaxin Li
  • Jinying Tao
  • Junchang Xin

More accurate construction of brain effective conncetivity networks remains a great challenge to achieve accurate auxiliary diagnosis of brain diseases and in-depth exploration of brain function. However, existing methods only consider higher-order or non-stationary assumptions, rather than simultaneously constructing higher-order and non-stationary networks. Among many existing methods, Bayesian network methods demonstrate superior network structure learning ability. In this work, the forward-backward search (FBS) method is optimized by using brain active information, which is improved to a higher-order network structure learning method, called TSTAI. Firstly, in the process of non-stationary network structure learning, two-stage idea is used to search the change points. Then, in the process of learning higher-order network structure, FBS method is combined with two kinds of brain active information to improve the condition set filtering process and scoring function, respectively. Finally, the pruning strategy is used to reduce the search space. Extensive experiments on simulated and real data demonstrate the effectiveness of TSTAI. Through experiments, the TSTAI is compared with state-of-the-art higher-order network construction methods, and the proposed method achieves an improvement of 3. 6% and 17. 4% respectively in the network construction accuracy.

ICLR Conference 2024 Conference Paper

BENO: Boundary-embedded Neural Operators for Elliptic PDEs

  • Haixin Wang 0003
  • Jiaxin Li
  • Anubhav Dwivedi
  • Kentaro Hara
  • Tailin Wu

Elliptic partial differential equations (PDEs) are a major class of time-independent PDEs that play a key role in many scientific and engineering domains such as fluid dynamics, plasma physics, and solid mechanics. Recently, neural operators have emerged as a promising technique to solve elliptic PDEs more efficiently by directly mapping the input to solutions. However, existing networks typically neglect complex geometries and inhomogeneous boundary values present in the real world. Here we introduce Boundary-Embedded Neural Operators (BENO), a novel neural operator architecture that embeds the complex geometries and inhomogeneous boundary values into the solving of elliptic PDEs. Inspired by classical Green's function, BENO consists of two Graph Neural Networks (GNNs) for interior source term and boundary values, respectively. Furthermore, a Transformer encoder maps the global boundary geometry into a latent vector which influences each message passing layer of the GNNs. We test our model and strong baselines extensively in elliptic PDEs with complex boundary conditions. We show that all existing baseline methods fail to learn the solution operator. In contrast, our model, endowed with boundary-embedded architecture, outperforms state-of-the-art neural operators and strong baselines by an average of 60.96%.

AAMAS Conference 2024 Conference Paper

JDRec: Practical Actor-Critic Framework for Online Combinatorial Recommender System

  • Xin Zhao
  • Jiaxin Li
  • Zhiwei Fang
  • Yuchen Guo
  • Jinyuan Zhao
  • Jie He
  • Wenlong Chen
  • Changping Peng

In the realm of online recommendation systems, the Combinatorial Recommender (CR) system stands out for its unique approach. It presents users with a list of items on a result page, where user behavior is simultaneously influenced by contextual information and the items listed. Formulated as a combinatorial optimization problem, the objective of the CR system is to maximize the recommendation reward across the entire list of items. Despite the significant potential of CR systems, developing a practical and efficient model remains substantial challenges. These challenges stem from the dynamic nature of online environments and the pressing need for personalized recommendations. To tackle these challenges, we decompose the overarching problem into two sub-problems: list generation and list evaluation. We propose novel and pragmatic model architectures for each sub-problem aiming to concurrently enhance both effectiveness and efficiency. To further adapt the CR system to online scenarios, we integrate a bootstrap algorithm into an actor-critic reinforcement framework. This innovative approach called JD Recommender System (JDRec) is designed to continuously refine the recommendation mode through sustained user interaction, ensuring the system’s adaptability and relevance. The proposed JDRec framework, tested through rigorous offline and online experiments, has shown promising results. It has been successfully deployed in online JD recommendation systems, yielding a notable improvement in click-through rate by 2. 6% and augmenting the total value of the platform by 5. 03%. Besides, we release the large scale dataset used in our work to facilitate further research. This work is licensed under a Creative Commons Attribution International 4. 0 License. *Equal contribution. Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024), N. Alechina, V. Dignum, M. Dastani, J. S. Sichman (eds.), May 6 – 10, 2024, Auckland, New Zealand. © 2024 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org).

IROS Conference 2024 Conference Paper

Visual Loop Closure Detection with Thorough Temporal and Spatial Context Exploitation

  • Jiaxin Li
  • Zan Wang
  • Huijun Di
  • Jian Li
  • Wei Liang 0008

Despite advancements in visual Simultaneous Localization and Mapping (SLAM), prevailing visual Loop Closure Detection (LCD) methods primarily rely on computationally intensive image similarity comparisons, neglecting temporal-spatial context during long-term exploration. To address this issue, we propose TOSA, a novel visual LCD algorithm harnessing TempOral and SpAtial context for efficient LCD. Specifically, as the agent explores through time, our approach recurrently updates a latent feature incorporating historical information via a Long Short-Term Memory (LSTM) module. Upon receiving a query frame, TOSA seamlessly fuses the latent feature with the query feature to predict the candidates’ distribution, thus averting intensive similarity computation. Additionally, TOSA integrates a temporal-spatial convolution for candidate refinement by thoroughly exploiting the temporal consistency and spatial correlation to enhance selected candidates, further boosting the performance. Extensive experiments across four standard datasets showcase the superiority of our method over existing state-of-the-art techniques, demonstrating the effectiveness of utilizing rich temporal-spatial contexts.

IROS Conference 2022 Conference Paper

A Deep-Learning-based System for Indoor Active Cleaning

  • Yike Yun
  • Linjie Hou
  • Zijian Feng
  • Wei Jin
  • Yang Liu
  • Heng Wang
  • Ruonan He
  • Weitao Guo

Cleaning public areas like commercial complexes is challenging due to their sophisticated surroundings and the vast kinds of real-life dirt. Robots are required to distinguish dirts and apply corresponding cleaning strategies. In this work, we proposed an active-cleaning framework by utilizing deep-learning methods for both solid wastes detection and liquid stains segmentation. Our system consists of 4 components: a Perception module integrated with deep-learning models, a Post-processing module for projection, a Tracking module for map localization, and a Planning and Control module for cleaning strategies. Compared with classic approaches, our vision-based system significantly improves cleaning efficiency. Besides, we released the largest real-world indoor hybrid dirt cleaning dataset (HD10K) containing 10K labeled images, together with a track-level evaluation metric for better cleaning performance measurement. The proposed deep-learning based system is verified with extensive experiments on our dataset, and deployed to Gaussian Robotics's robots operating globally. Dataset is available at: https://gaussianopensource.github.io/projects/active_cleaning.

ICRA Conference 2019 Conference Paper

Discrete Rotation Equivariance for Point Cloud Recognition

  • Jiaxin Li
  • Yingcai Bi
  • Gim Hee Lee

Despite the recent active research on processing point clouds with deep networks, few attention has been on the sensitivity of the networks to rotations. In this paper, we propose a deep learning architecture that achieves discrete SO(2)/SO(3) rotation equivariance for point cloud recognition. Specifically, the rotation of an input point cloud with elements of a rotation group is similar to shuffling the feature vectors generated by our approach. The equivariance is easily reduced to invariance by eliminating the permutation with operations such as maximum or average. Our method can be directly applied to any existing point cloud based networks, resulting in significant improvements in their performance for rotated inputs. We show state-of-the-art results in the classification tasks with various datasets under both SO(2) and SO(3) rotations. In addition, we further analyze the necessary conditions of applying our approach to PointNet [1] based networks.

IROS Conference 2017 Conference Paper

Deep learning for 2D scan matching and loop closure

  • Jiaxin Li
  • Huangying Zhan
  • Ben M. Chen
  • Ian D. Reid 0001
  • Gim Hee Lee

Although 2D LiDAR based Simultaneous Localization and Mapping (SLAM) is a relatively mature topic nowadays, the loop closure problem remains challenging due to the lack of distinctive features in 2D LiDAR range scans. Existing research can be roughly divided into correlation based approaches e. g. scan-to-submap matching and feature based methods e. g. bag-of-words (BoW). In this paper, we solve loop closure detection and relative pose transformation using 2D LiDAR within an end-to-end Deep Learning framework. The algorithm is verified with simulation data and on an Unmanned Aerial Vehicle (UAV) flying in indoor environment. The loop detection ConvNet alone achieves an accuracy of 98. 2% in loop closure detection. With a verification step using the scan matching ConvNet, the false positive rate drops to around 0. 001%. The proposed approach processes 6000 pairs of raw LiDAR scans per second on a Nvidia GTX1080 GPU.