Arrow Research search

Author name cluster

Bin Zhao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

29 papers
2 author rows

Possible papers

29

AAAI Conference 2026 Conference Paper

CLUHCS:Dual-View Contrastive Learning Enabled Unsupervised Heterogeneous Community Search with Meta-Path Behavior Modeling

  • Xiaoqin Xie
  • Bin Zhao
  • Mingzhu Chang
  • Shuai Han
  • Wu Yang

Existing community search methods heavily rely on labeled data or predefined structures, thus fail to capture obscure and dynamic community boundaries in open-world heterogeneous networks, leading to poor adaptability. They also ignore modeling behavioral patterns, resulting in poor search performance. To solve the above issues, this work formally defines the unsupervised behavior-driven community search problem for heterogeneous graphs and designs dual-view Contrastive Learning-based Unsupervised framework for Heterogeneous graph Community Search (CLUHCS). CLUHCS designs a relation view to encode local community cohesion and a meta-path view to capture global behavior semantics. By using PathSim averaging strategy to generate positive samples and self-supervised signals, we can completely eliminate label dependency. Then, contrastive training is leveraged to automatically learn community representations and solve the open community boundary ambiguity challenge. Furthermore, by capturing behavior patterns, the meta-path behavior modeling flexibly characterizes the formation mechanism of heterogeneous communities. Experiments on three datasets verify the effectiveness and efficiency of CLUHCS. CLUHCS significantly improves F1-score by 52.7% over the supervised baseline FCS-HGNN and by 41.5% over the unsupervised method TransZero.

AAAI Conference 2026 Conference Paper

FreeGaussian: Annotation-free Control of Articulated Objects via 3D Gaussian Splats with Flow Derivatives

  • Qizhi Chen
  • Delin Qu
  • Junli Liu
  • Yiwen Tang
  • Haoming Song
  • Dong Wang
  • Yuan Yuan
  • Bin Zhao

Reconstructing controllable Gaussian splats for articulated objects from monocular video is especially challenging due to its inherently insufficient constraints. Existing methods address this by relying on dense masks and manually defined control signals, limiting their real-world applications. In this paper, we propose an annotation-free method, FreeGaussian, which mathematically disentangles camera egomotion and articulated movements via flow derivatives. By establishing a connection between 2D flows and 3D Gaussian dynamic flow, our method enables optimization and continuity of dynamic Gaussian motions from flow priors without any control signals. Furthermore, we introduce a 3D spherical vector controlling scheme, which represents the state as a 3D Gaussian trajectory, thereby eliminating the need for complex 1D control signal calculations and simplifying controllable Gaussian modeling. Extensive experiments on articulated objects demonstrate the state-of-the-art visual performance and precise, part-aware controllability of our method.

AAAI Conference 2026 Conference Paper

MindSight: A Bio-Inspired Neural Architecture for Visual Restoration via Cortical Electrical Stimulation

  • Yongjie Zou
  • Haonan Niu
  • Bin Zhao
  • Guoliang Yi
  • Mengchuanzhi Yang
  • Jiawei Ju
  • Jiapeng Yin
  • Chengyu T. Li

Visual impairment is a common condition worldwide, and cortical electrical stimulation is one of the approaches to aid in visual restoration. However, existing methods suffer from limited precision, flexibility, and generalization in generating the desired visual perception. In this paper, we propose a novel deep learning-based algorithm for cortical electrical stimulation, named ``MindSight," aimed at enhancing the clarity and accuracy of induced visual perceptions. Our framework introduces three key innovations: (1) A differentiable biophysical model simulating cortical state transitions under electrical stimulation, enabling end-to-end training; (2) A dual-path training architecture combining neural decoding fidelity with phosphene simulation constraints; (3) An attention-guided background gated network for input filtration and, a multi-channel activation constraint to ensure the effectiveness of electrical stimulation. We validated our approach through novel experiments with macaque monkeys, demonstrating superior performance in visual perception tasks. These results highlight the potential of our approach in assisting individuals with visual impairments.

ICRA Conference 2025 Conference Paper

APA-BI: Adaptive Partition Aggregation and Bidirectional Integration for UAV-View Geo-Localization

  • Xichen Zhang
  • Shuying Zhao
  • Yunzhou Zhang
  • Fawei Ge
  • Bin Zhao
  • Yizhong Zhang

The task of UAV-view geo-localization is to match a query image with database images to estimate the current geographic location of the query image. This is particularly useful in environments where GPS is not available or when the device fails. Although deep learning methods make sufficient progress in UAV-view geo-localization, they still face challenges in improving the distinguishability of features. For instance, some feature aggregation methods do not consider semantic integrity, and robust elements in the image are not given enough attention. This paper proposes a UAV-view geo-localization method (APA-BI) to tackle the above issues. Specifically, we propose an adaptive partition aggregation method to ensure feature integrity at the semantic level by increasing the receptive field of the classifier module. At the same time, we design a bidirectional integration module to further enhance feature distinguishability by extracting robust tubular topological structures from images. Experimental results on public datasets demonstrate that APA-BI achieves impressive retrieval accuracy and outperforms most state-of-the-art methods. Moreover, the test results of APA-BI in real-world scenarios also show excellent performance.

JBHI Journal 2025 Journal Article

CrossMatch: Enhance Semi-Supervised Medical Image Segmentation With Perturbation Strategies and Knowledge Distillation

  • Bin Zhao
  • Chunshi Wang
  • Shuxue Ding

Semi-supervised learning for medical image segmentation presents a unique challenge of efficiently using limited labeled data while leveraging abundant unlabeled data. Despite advancements, existing methods often do not fully exploit the potential of the unlabeled data for enhancing model robustness and accuracy. In this paper, we introduce CrossMatch, a novel framework that integrates knowledge distillation with dual perturbation strategies, image-level and feature-level, to improve the model's learning from both labeled and unlabeled data. CrossMatch employs multiple encoders and decoders to generate diverse data streams, which undergo self-knowledge distillation to enhance the consistency and reliability of predictions across varied perturbations. Our method significantly surpasses other state-of-the-art techniques in standard benchmarks by effectively minimizing the gap between training on labeled and unlabeled data and improving edge accuracy and generalization in medical image segmentation. The efficacy of CrossMatch is demonstrated through extensive experimental validations, showing remarkable performance improvements without increasing computational costs.

TMLR Journal 2025 Journal Article

Decentralized Transformers with Centralized Aggregation are Sample-Efficient Multi-Agent World Models

  • Yang Zhang
  • Chenjia Bai
  • Bin Zhao
  • Junchi Yan
  • Xiu Li
  • Xuelong Li

Learning a world model for model-free Reinforcement Learning (RL) agents can significantly improve the sample efficiency by learning policies in imagination. However, building a world model for Multi-Agent RL (MARL) can be particularly challenging due to the scalability issue across different number of agents in a centralized architecture, and also the non-stationarity issue in a decentralized architecture stemming from the inter-dependency among agents. To address both challenges, we propose a novel world model for MARL that learns decentralized local dynamics for scalability, combined with a centralized representation aggregation from all agents. We cast the dynamics learning as an auto-regressive sequence modeling problem over discrete tokens by leveraging the expressive Transformer architecture, in order to model complex local dynamics across different agents and provide accurate and consistent long-term imaginations. As the first pioneering Transformer-based world model for multi-agent systems, we introduce a Perceiver Transformer as an effective solution to enable centralized representation aggregation within this context. Extensive results on Starcraft Multi-Agent Challenge (SMAC) and MAMujoco demonstrate superior sample efficiency and overall performance compared to strong model-free approaches and existing model-based methods.

AAAI Conference 2025 Conference Paper

Learning 2D Invariant Affordance Knowledge for 3D Affordance Grounding

  • Xianqiang Gao
  • Pingrui Zhang
  • Delin Qu
  • Dong Wang
  • Zhigang Wang
  • Yan Ding
  • Bin Zhao

3D Object Affordance Grounding aims to predict the functional regions on a 3D object and has laid the foundation for a wide range of applications in robotics. Recent advances tackle this problem via learning a mapping between 3D regions and a single human-object interaction image. However, the geometric structure of the 3D object and the object in the human-object interaction image are not always consistent, leading to poor generalization. To address this issue, we propose to learn generalizable invariant affordance knowledge from multiple human-object interaction images within the same affordance category. Specifically, we introduce the Multi-Image Guided Invariant-Feature-Aware 3D Affordance Grounding (MIFAG) framework. It grounds 3D object affordance regions by identifying common interaction patterns across multiple human-object interaction images. First, the Invariant Affordance Knowledge Extraction Module (IAM) utilizes an iterative updating strategy to gradually extract aligned affordance knowledge from multiple images and integrate it into an affordance dictionary. Then, the Affordance Dictionary Adaptive Fusion Module (ADM) learns comprehensive point cloud representations that consider all affordance candidates in multiple images. Besides, the Multi-Image and Point Affordance (MIPA) benchmark is constructed and our method outperforms existing state-of-the-art methods on various experimental comparisons.

ICRA Conference 2025 Conference Paper

VSS-SLAM: Voxelized Surfel Splatting for Geometally Accurate SLAM

  • Xuanhua Chen
  • Yunzhou Zhang
  • Zhiyao Zhang
  • Guoqing Wang
  • Bin Zhao
  • Xingshuo Wang

[1] Visual Simultaneous Localization and Mapping (SLAM) helps robots estimate their poses and perceive the environment in unknown settings. Recent work has demonstrated that implicit neural radiance fields and 3D Gaussian Splatting (3DGS) offer higher fidelity scene representation than traditional map representations. We propose VSS-SLAM, which utilizes voxelized surfels as the map representation for incremental mapping in unknown environments. This representation effectively addresses the issue of redundant and disordered primitives encountered in previous methods, thereby enhancing geometric accuracy during reconstruction. Specifically, our approach divides the scene using voxels and stores geometric and appearance information in feature vectors at the voxel vertices. Before rendering, these feature vectors are decoded to generate the corresponding surfels. Additionally, we align camera poses through image and depth rendering. Extensive experiments on the Replica and TUM-RGBD datasets demonstrate that VSS-SLAM delivers high-fidelity reconstruction and accurate pose estimation in both simulated and real-world environments. Source code will soon be available.

IROS Conference 2024 Conference Paper

Calibration-Free Vision-Assisted Container Loading of RTG Cranes

  • Jianbing Yang
  • Yuanzhe Wang
  • Hao Jiang
  • Bin Zhao
  • Yiming Li
  • Danwei Wang

Vision-assisted container loading of Rubber Tyred Gantry (RTG) cranes are facing two primary challenges. Firstly, the uncertainty inherent in Covolutional Neural Network (CNN) based detection hinders its direct application in the safety-critical operation of such heavy-duty machinery. Secondly, sensor calibration introduces additional complexities and errors into the system. However, existing studies have not adequately addressed these challenges. Motivated by this gap, this paper proposes an integrated approach for target detection and alignment control in container loading of RTG cranes. To ensure reliable target marker identification, a heuristic post-processing algorithm is developed as a complement to CNN-based foreground segmentation, thereby ensuring safety during the container handling process. On this basis, a pixel-based control scheme is designed to align the container with the target markers, which eliminates the need for offline or online sensor calibrations. The proposed approach has been successfully implemented on a real RTG crane manufactured by Shanghai Zhenhua Heavy Industries Co. , Ltd. (ZPMC) and validated at the Port of Ningbo, China. Experimental results demonstrate the superiority of the proposed approach over current manual operations in port industries, highlighting its potential for crane automation.

AAAI Conference 2024 Conference Paper

Color Event Enhanced Single-Exposure HDR Imaging

  • Mengyao Cui
  • Zhigang Wang
  • Dong Wang
  • Bin Zhao
  • Xuelong Li

Single-exposure high dynamic range (HDR) imaging aims to reconstruct the wide-range intensities of a scene by using its single low dynamic range (LDR) image, thus providing significant efficiency. Existing methods pay high attention to restoring the luminance by inversing the tone-mapping process, while the color in the over-/under-exposed area cannot be well restored due to the information loss of the single LDR image. To address this issue, we introduce color events into the imaging pipeline, which record asynchronous pixel-wise color changes in a high dynamic range, enabling edge-like scene perception under challenging lighting conditions. Specifically, we propose a joint framework that incorporates color events and a single LDR image to restore both content and color of an HDR image, where an exposureaware transformer (EaT) module is designed to propagate the informative hints, provided by the normal-exposed LDR regions and the event streams, to the missing areas. In this module, an exposure-aware mask is estimated to suppress distractive information and strengthen the restoration of the over-/under-exposed regions. To our knowledge, we are the first to use color events to enhance single-exposure HDR imaging. We also contribute corresponding datasets, consisting of synthesized datasets and a real-world dataset collected by a DAVIS346-color camera. The datasets can be found at https://www.kaggle.com/datasets/mengyaocui/ce-hdr. Extensive experiments demonstrate the effectiveness of the proposed method.

IROS Conference 2024 Conference Paper

HSS-SLAM: Human-in-the-Loop Semantic SLAM Represented by Superquadrics

  • Yulong Li
  • Yunzhou Zhang
  • Bin Zhao
  • Zhiyao Zhang
  • You Shen
  • Tengda Zhang
  • Guolu Chen

The advancement of object detection algorithms has catalyzed the development of object-level semantic SLAM. However, due to missed and false detections, object-level semantic SLAM fails to represent the objects within the scene adequately. Therefore, this paper proposes a novel object-level semantic SLAM termed HSS-SLAM. We incorporate human-in-the-loop into our method, establishing an interaction module to facilitate human editing and rectifying semantic information. Additionally, to minimize the manual correction workload, a lightweight and intuitive method for semantic extension is proposed, augmenting the semantic richness of the global map with a few operations. Furthermore, our method adopts superquadrics for object representation, enabling detailed descriptions of various object shapes. This mitigates the limitation of conventional semantic mapping, where objects are difficult to distinguish due to the reliance on a single-shape representation. Subsequently, precise estimation of superquadric parameters and camera poses is achieved through joint optimization. Extensive experiments conducted on TUM RGB-D and Scenes V2 datasets demonstrate that the proposed approach exhibits competitive performance, surpassing current methods in both object representation and camera localization accuracy.

ICRA Conference 2024 Conference Paper

L-VIWO: Visual-Inertial-Wheel Odometry based on Lane Lines

  • Bin Zhao
  • Yunzhou Zhang
  • Junjie Huang
  • Xichen Zhang
  • Zeyu Long
  • Yulong Li

To achieve precise localization for autonomous vehicles and mitigate the problem of accumulated drift error in odometry, this paper proposes L-VIWO, a Visual-Inertial-Wheel Odometry based on lane lines. This method effectively utilizes the lateral constraints provided by lane lines to eliminate and relieve the incrementally accumulated pose errors. Firstly, we introduce a lane line tracking method that enables multi-frame tracking of the same lane line, thereby obtaining multi-frame data of a lane line. Then, we utilize multi-frame data of the lane lines and the curvature characteristics of adjacent lane lines to optimize the positions of the lane line sample points, thus building a reliable lane line map. Finally, we use the built local lane line map to correct the position of the vehicle. Based on the corrected position and prior pose from the odometry, we build a graph optimization model to optimize the pose of the vehicle. Through localization experiments on the KAIST dataset, it has been demonstrated that the proposed method effectively enhances the localization accuracy of odometry, thus confirming the effectiveness of the method.

NeurIPS Conference 2024 Conference Paper

Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training

  • Haoran He
  • Chenjia Bai
  • Ling Pan
  • Weinan Zhang
  • Bin Zhao
  • Xuelong Li

Learning a generalist embodied agent capable of completing multiple tasks poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets. In contrast, a vast amount of human videos exist, capturing intricate tasks and interactions with the physical world. Promising prospects arise for utilizing actionless human videos for pre-training and transferring the knowledge to facilitate robot policy learning through limited robot demonstrations. However, it remains a challenge due to the domain gap between humans and robots. Moreover, it is difficult to extract useful information representing the dynamic world from human videos, because of its noisy and multimodal data structure. In this paper, we introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos. We start by compressing both human and robot videos into unified video tokens. In the pre-training stage, we employ a discrete diffusion model with a mask-and-replace diffusion strategy to predict future video tokens in the latent space. In the fine-tuning stage, we harness the imagined future videos to guide low-level action learning with a limited set of robot data. Experiments demonstrate that our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches with superior performance.

NeurIPS Conference 2024 Conference Paper

LiveScene: Language Embedding Interactive Radiance Fields for Physical Scene Control and Rendering

  • Delin Qu
  • Qizhi Chen
  • Pingrui Zhang
  • Xianqiang Gao
  • Bin Zhao
  • Zhigang Wang
  • Dong Wang
  • Xuelong Li

This paper scales object-level reconstruction to complex scenes, advancing interactive scene reconstruction. We introduce two datasets, OmniSim and InterReal, featuring 28 scenes with multiple interactive objects. To tackle the challenge of inaccurate interactive motion recovery in complex scenes, we propose LiveScene, a scene-level language-embedded interactive radiance field that efficiently reconstructs and controls multiple objects. By decomposing the interactive scene into local deformable fields, LiveScene enables separate reconstruction of individual object motions, reducing memory consumption. Additionally, our interaction-aware language embedding localizes individual interactive objects, allowing for arbitrary control using natural language. Our approach demonstrates significant superiority in novel view synthesis, interactive scene control, and language grounding performance through extensive experiments. Project page: https: //livescenes. github. io.

AAAI Conference 2024 Conference Paper

Point-PEFT: Parameter-Efficient Fine-Tuning for 3D Pre-trained Models

  • Yiwen Tang
  • Ray Zhang
  • Zoey Guo
  • Xianzheng Ma
  • Bin Zhao
  • Zhigang Wang
  • Dong Wang
  • Xuelong Li

The popularity of pre-trained large models has revolutionized downstream tasks across diverse fields, such as language, vision, and multi-modality. To minimize the adaption cost for downstream tasks, many Parameter-Efficient Fine-Tuning (PEFT) techniques are proposed for language and 2D image pre-trained models. However, the specialized PEFT method for 3D pre-trained models is still under-explored. To this end, we introduce Point-PEFT, a novel framework for adapting point cloud pre-trained models with minimal learnable parameters. Specifically, for a pre-trained 3D model, we freeze most of its parameters, and only tune the newly added PEFT modules on downstream tasks, which consist of a Point-prior Prompt and a Geometry-aware Adapter. The Point-prior Prompt adopts a set of learnable prompt tokens, for which we propose to construct a memory bank with domain-specific knowledge, and utilize a parameter-free attention to enhance the prompt tokens. The Geometry-aware Adapter aims to aggregate point cloud features within spatial neighborhoods to capture fine-grained geometric information through local interactions. Extensive experiments indicate that our Point-PEFT can achieve better performance than the full fine-tuning on various downstream tasks, while using only 5% of the trainable parameters, demonstrating the efficiency and effectiveness of our approach. Code is released at https://github.com/Ivan-Tang-3D/Point-PEFT.

AAAI Conference 2024 Conference Paper

X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-Modal Knowledge Transfer

  • Linglin Jing
  • Ying Xue
  • Xu Yan
  • Chaoda Zheng
  • Dong Wang
  • Ruimao Zhang
  • Zhigang Wang
  • Hui Fang

The field of 4D point cloud understanding is rapidly developing with the goal of analyzing dynamic 3D point cloud sequences. However, it remains a challenging task due to the sparsity and lack of texture in point clouds. Moreover, the irregularity of point cloud poses a difficulty in aligning temporal information within video sequences. To address these issues, we propose a novel cross-modal knowledge transfer framework, called X4D-SceneFormer. This framework enhances 4D-Scene understanding by transferring texture priors from RGB sequences using a Transformer architecture with temporal relationship mining. Specifically, the framework is designed with a dual-branch architecture, consisting of an 4D point cloud transformer and a Gradient-aware Image Transformer (GIT). The GIT combines visual texture and temporal correlation features to offer rich semantics and dynamics for better point cloud representation. During training, we employ multiple knowledge transfer techniques, including temporal consistency losses and masked self-attention, to strengthen the knowledge transfer between modalities. This leads to enhanced performance during inference using single-modal 4D point cloud inputs. Extensive experiments demonstrate the superior performance of our framework on various 4D point cloud video understanding tasks, including action recognition, action segmentation and semantic segmentation. The results achieve 1st places, i.e., 85.3% (+7.9%) accuracy and 47.3% (+5.0%) mIoU for 4D action segmentation and semantic segmentation, on the HOI4D challenge, outperforming previous state-of-the-art by a large margin. We release the code at https://github.com/jinglinglingling/X4D.

NeurIPS Conference 2023 Conference Paper

Cross-Domain Policy Adaptation via Value-Guided Data Filtering

  • Kang Xu
  • Chenjia Bai
  • Xiaoteng Ma
  • Dong Wang
  • Bin Zhao
  • Zhen Wang
  • Xuelong Li
  • Wei Li

Generalizing policies across different domains with dynamics mismatch poses a significant challenge in reinforcement learning. For example, a robot learns the policy in a simulator, but when it is deployed in the real world, the dynamics of the environment may be different. Given the source and target domain with dynamics mismatch, we consider the online dynamics adaptation problem, in which case the agent can access sufficient source domain data while online interactions with the target domain are limited. Existing research has attempted to solve the problem from the dynamics discrepancy perspective. In this work, we reveal the limitations of these methods and explore the problem from the value difference perspective via a novel insight on the value consistency across domains. Specifically, we present the Value-Guided Data Filtering (VGDF) algorithm, which selectively shares transitions from the source domain based on the proximity of paired value targets across the two domains. Empirical results on various environments with kinematic and morphology shifts demonstrate that our method achieves superior performance compared to prior approaches.

NeurIPS Conference 2023 Conference Paper

Diffusion Model is an Effective Planner and Data Synthesizer for Multi-Task Reinforcement Learning

  • Haoran He
  • Chenjia Bai
  • Kang Xu
  • Zhuoran Yang
  • Weinan Zhang
  • Dong Wang
  • Bin Zhao
  • Xuelong Li

Diffusion models have demonstrated highly-expressive generative capabilities in vision and NLP. Recent studies in reinforcement learning (RL) have shown that diffusion models are also powerful in modeling complex policies or trajectories in offline datasets. However, these works have been limited to single-task settings where a generalist agent capable of addressing multi-task predicaments is absent. In this paper, we aim to investigate the effectiveness of a single diffusion model in modeling large-scale multi-task offline data, which can be challenging due to diverse and multimodal data distribution. Specifically, we propose Multi-Task Diffusion Model (\textsc{MTDiff}), a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis in multi-task offline settings. \textsc{MTDiff} leverages vast amounts of knowledge available in multi-task data and performs implicit knowledge sharing among tasks. For generative planning, we find \textsc{MTDiff} outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D. For data synthesis, \textsc{MTDiff} generates high-quality data for testing tasks given a single demonstration as a prompt, which enhances the low-quality datasets for even unseen tasks.

NeurIPS Conference 2022 Conference Paper

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

  • Renrui Zhang
  • Ziyu Guo
  • Peng Gao
  • Rongyao Fang
  • Bin Zhao
  • Dong Wang
  • Yu Qiao
  • Hongsheng Li

Masked Autoencoders (MAE) have shown great potentials in self-supervised pre-training for language and 2D image transformers. However, it still remains an open question on how to exploit masked autoencoding for learning 3D representations of irregular point clouds. In this paper, we propose Point-M2AE, a strong Multi-scale MAE pre-training framework for hierarchical self-supervised learning of 3D point clouds. Unlike the standard transformer in MAE, we modify the encoder and decoder into pyramid architectures to progressively model spatial geometries and capture both fine-grained and high-level semantics of 3D shapes. For the encoder that downsamples point tokens by stages, we design a multi-scale masking strategy to generate consistent visible regions across scales, and adopt a local spatial self-attention mechanism during fine-tuning to focus on neighboring patterns. By multi-scale token propagation, the lightweight decoder gradually upsamples point tokens with complementary skip connections from the encoder, which further promotes the reconstruction from a global-to-local perspective. Extensive experiments demonstrate the state-of-the-art performance of Point-M2AE for 3D representation learning. With a frozen encoder after pre-training, Point-M2AE achieves 92. 9% accuracy for linear SVM on ModelNet40, even surpassing some fully trained methods. By fine-tuning on downstream tasks, Point-M2AE achieves 86. 43% accuracy on ScanObjectNN, +3. 36% to the second-best, and largely benefits the few-shot classification, part segmentation and 3D object detection with the hierarchical pre-training scheme. Code is available at https: //github. com/ZrrSkywalker/Point-M2AE.

ICRA Conference 2020 Conference Paper

A Continuum Manipulator with Closed-form Inverse Kinematics and Independently Tunable Stiffness

  • Bin Zhao
  • Lingyun Zeng
  • Baibo Wu
  • Kai Xu 0001

Continuum manipulators can accomplish various tasks in confined spaces, benefiting from their compliant structures and improved dexterity. Confined and unstructured spaces may require both enhanced stiffness of a continuum manipulator for precision and payload, as well as compliance for safe interaction. Thus, studies have been consistently dedicated to design continuum or articulated manipulators with tunable stiffness to adapt to different operating conditions. This paper presents a continuum manipulator with independently tunable stiffness where the stiffness variation does not affect the movement of the manipulator's end-effector. Moreover, the proposed continuum manipulator is found to have analytical inverse kinematics. The design concept, analytical kinematics, system construction and experimental characterizations are presented. The results showed that the manipulator's stiffness can be increased up to 3. 61 times of the minimal value, demonstrating the effectiveness of the proposed idea.

IJCAI Conference 2019 Conference Paper

Travel Time Estimation without Road Networks: An Urban Morphological Layout Representation Approach

  • Wuwei Lan
  • Yanyan Xu
  • Bin Zhao

Travel time estimation is a crucial task for not only personal travel scheduling but also city planning. Previous methods focus on modeling toward road segments or sub-paths, then summing up for a final prediction, which have been recently replaced by deep neural models with end-to-end training. Usually, these methods are based on explicit feature representations, including spatio-temporal features, traffic states, etc. Here, we argue that the local traffic condition is closely tied up with the land-use and built environment, i. e. , metro stations, arterial roads, intersections, commercial area, residential area, and etc, yet the relation is time-varying and too complicated to model explicitly and efficiently. Thus, this paper proposes an end-to-end multi-task deep neural model, named Deep Image to Time (DeepI2T), to learn the travel time mainly from the built environment images, a. k. a. the morphological layout images, and showoff the new state-of-the-art performance on real-world datasets in two cities. Moreover, our model is designed to tackle both path-aware and path-blind scenarios in the testing phase. This work opens up new opportunities of using the publicly available morphological layout images as considerable information in multiple geography-related smart city applications.

IROS Conference 2018 Conference Paper

Continuum Manipulator with Redundant Backbones and Constrained Bending Curvature for Continuously Variable Stiffness

  • Bin Zhao
  • Weihao Zhang
  • Zhaoyu Zhang
  • Xiangyang Zhu
  • Kai Xu 0001

Snake-like manipulators can navigate and perform manipulation in confined spaces. Their recent implementations in surgical robots attracted a lot of attentions. These slender manipulators usually possess either a hyper-redundant articulated vertebrate structure or a continuum one. Primary design considerations usually converge to a balance between proper workspace and acceptable stiffness. Efforts have hence been constantly made to achieve higher or adjustable stiffness for a manipulator to widen its applications. This paper presents a simple continuum manipulator design with variable stiffness based on redundantly arranged elastic backbones and continuously constrained bending curvature. The design concepts, kinematics, a preliminary formulation for stiffness adjustment, system construction and experimental characterizations are elaborated. The results showed that the manipulator's stiffness can be increased up to 4. 71 times of the value without the curvature constraining rod, indicating the efficacy of the proposed idea.

IJCAI Conference 2018 Conference Paper

Video Captioning with Tube Features

  • Bin Zhao
  • Xuelong Li
  • Xiaoqiang Lu

Visual feature plays an important role in the video captioning task. Considering that the video content is mainly composed of the activities of salient objects, it has restricted the caption quality of current approaches which just focus on global frame features while paying less attention to the salient objects. To tackle this problem, in this paper, we design an object-aware feature for video captioning, denoted as tube feature. Firstly, Faster-RCNN is employed to extract object regions in frames, and a tube generation method is developed to connect the regions from different frames but belonging to the same object. After that, an encoder-decoder architecture is constructed for video caption generation. Specifically, the encoder is a bi-directional LSTM, which is utilized to capture the dynamic information of each tube. The decoder is a single LSTM extended with an attention model, which enables our approach to adaptively attend to the most correlated tubes when generating the caption. We evaluate our approach on two benchmark datasets: MSVD and Charades. The experimental results have demonstrated the effectiveness of tube feature in the video captioning task.

IJCAI Conference 2017 Conference Paper

MAM-RNN: Multi-level Attention Model Based RNN for Video Captioning

  • Xuelong Li
  • Bin Zhao
  • Xiaoqiang Lu

Visual information is quite important for the task of video captioning. However, in the video, there are a lot of uncorrelated content, which may cause interference to generate a correct caption. Based on this point, we attempt to exploit the visual features which are most correlated to the caption. In this paper, a Multi-level Attention Model based Recurrent Neural Network (MAM-RNN) is proposed, where MAM is utilized to encode the visual feature and RNN works as the decoder to generate the video caption. During generation, the proposed approach is able to adaptively attend to the salient regions in the frame and the frames correlated to the caption. Practically, the experimental results on two benchmark datasets, i. e. , MSVD and Charades, have shown the excellent performance of the proposed approach.

NeurIPS Conference 2011 Conference Paper

Large-Scale Category Structure Aware Image Categorization

  • Bin Zhao
  • Fei Li
  • Eric Xing

Most previous research on image categorization has focused on medium-scale data sets, while large-scale image categorization with millions of images from thousands of categories remains a challenge. With the emergence of structured large-scale dataset such as the ImageNet, rich information about the conceptual relationships between images, such as a tree hierarchy among various image categories, become available. As human cognition of complex visual world benefits from underlying semantic relationships between object classes, we believe a machine learning system can and should leverage such information as well for better performance. In this paper, we employ such semantic relatedness among image categories for large-scale image categorization. Specifically, a category hierarchy is utilized to properly define loss function and select common set of features for related categories. An efficient optimization method based on proximal approximation and accelerated parallel gradient method is introduced. Experimental results on a subset of ImageNet containing 1. 2 million images from 1000 categories demonstrate the effectiveness and promise of our proposed approach.