Arrow Research search

Author name cluster

Junbo Zhang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

22 papers
2 author rows

Possible papers

22

JBHI Journal 2026 Journal Article

MsGA: Gestational Age Estimation with Multi-plane Unified Measurements Driven by Anatomic Segmentation

  • Mingjun Huang
  • Junbo Zhang
  • Wei Hu
  • Chao Sun
  • Xiantao Cai
  • Bo Du

An accurate estimation of gestational age is critical for prenatal care and clinical decision-making. Existing ultrasound-based gestational age estimation methods are limited by the insufficient information representation capacity of conventional medical segmentation models, noise interference in ultrasound images, and inter-observer variability in traditional geometry-based measurement methods. To address these challenges, we propose the MsGA model to estimate gestational age with multi-plane unified measurements driven by anatomic segmentation. In the anatomic segmentation stage, a lightweight and high-performance LGF-UNet module is proposed, which utilizes the Deep Patch Embedding module to expand the receptive field, the Local-Global Fusion Transformer block to enhance local-global feature fusion, and the Focusing Attention Bottleneck module to suppress ultrasound noise via an adaptive threshold. In the measurement stage, a Point Regression module is introduced to refine biometric landmark localization. Furthermore, we create a fully annotated ultrasound plane dataset for the estimation of gestational age across various gestational stages. Extensive experiments on the dataset have demonstrated the effectiveness of the whole model and each module. Our MsGA model is superior to existing models with fewer parameters and achieves state-of-the-art performance on the Gestational Age Estimation task.

AAAI Conference 2025 Conference Paper

AirRadar: Inferring Nationwide Air Quality in China with Deep Neural Networks

  • Qiongyan Wang
  • Yutong Xia
  • Siru Zhong
  • Weichuang Li
  • Yuankai Wu
  • Shifen Cheng
  • Junbo Zhang
  • Yu Zheng

Monitoring real-time air quality is essential for safeguarding public health and fostering social progress. However, the widespread deployment of air quality monitoring stations is constrained by their significant costs. To address this limitation, we introduce AirRadar, a deep neural network designed to accurately infer real-time air quality in locations lacking monitoring stations by utilizing data from existing ones. By leveraging learnable mask tokens, AirRadar reconstructs air quality features in unmonitored regions. Specifically, it operates in two stages: first capturing spatial correlations and then adjusting for distribution shifts. We validate AirRadar’s efficacy using a year-long dataset from 1,085 monitoring stations across China, demonstrating its superiority over multiple baselines, even with varying degrees of unobserved data.

IROS Conference 2025 Conference Paper

Benchmarking Long-Horizon Mobile Manipulation in Multi-Room Dynamic Environments

  • Junbo Zhang
  • Kaisheng Ma

Long-horizon reasoning and task execution are crucial for complex mobile manipulation tasks in household environments. Existing benchmarks and methods primarily focus on single-room or single-object mobile manipulation scenarios, limiting the scope of long-horizon planning and scene-level understanding. To address this gap, we introduce a novel benchmark for long-horizon mobile manipulation in multi-room household environments. Our task requires agents to follow a sequence of language instructions, each directing the movement of specific objects across receptacles and rooms. In this task, we investigate the role of long-term memory by constructing a hierarchical scene graph that captures the relationships between objects, furniture, and rooms. This scene graph-based memory is dynamically updated as the agent explores the environment, which effectively aligns the scene information with the targets and environmental context specified in the language instructions. Additionally, we benchmark the proposed task in dynamic environments where objects can be relocated during task execution, simulating real-world scenarios. Our results demonstrate that the scene graph-based memory significantly improves the agent’s performance in long-horizon mobile manipulation tasks. Moreover, dynamically updating the state of objects within the scene graph enables the agent to better adapt to dynamic conditions.

IJCAI Conference 2025 Conference Paper

Non-collective Calibrating Strategy for Time Series Forecasting

  • Bin Wang
  • Yongqi Han
  • Minbo Ma
  • Tianrui Li
  • Junbo Zhang
  • Feng Hong
  • Yanwei Yu

Deep learning-based approaches have demonstrated significant advancements in time series forecasting. Despite these ongoing developments, the complex dynamics of time series make it challenging to establish the rule of thumb for designing the golden model architecture. In this study, we argue that refining existing advanced models through a universal calibrating strategy can deliver substantial benefits with minimal resource costs, as opposed to elaborating and training a new model from scratch. We first identify a multi-target learning conflict in the calibrating process, which arises when optimizing variables across time steps, leading to the underutilization of the model's learning capabilities. To address this issue, we propose an innovative calibrating strategy called Socket+Plug (SoP). This approach retains an exclusive optimizer and early-stopping monitor for each predicted target within each Plug while keeping the fully trained Socket backbone frozen. The model-agnostic nature of SoP allows it to directly calibrate the performance of any trained deep forecasting models, regardless of their specific architectures. Extensive experiments on various time series benchmarks and a spatio-temporal meteorological ERA5 dataset demonstrate the effectiveness of SoP, achieving up to a 22% improvement even when employing a simple MLP as the Plug (highlighted in Figure 1).

TIST Journal 2024 Journal Article

Exploring the Distributed Knowledge Congruence in Proxy-data-free Federated Distillation

  • Zhiyuan Wu
  • Sheng Sun
  • Yuwei Wang
  • Min Liu
  • Quyang Pan
  • Junbo Zhang
  • Zeju Li
  • Qingxiang Liu

Federated learning (FL) is a privacy-preserving machine learning paradigm in which the server periodically aggregates local model parameters from cli ents without assembling their private data. Constrained communication and personalization requirements pose severe challenges to FL. Federated distillation (FD) is proposed to simultaneously address the above two problems, which exchanges knowledge between the server and clients, supporting heterogeneous local models while significantly reducing communication overhead. However, most existing FD methods require a proxy dataset, which is often unavailable in reality. A few recent proxy-data-free FD approaches can eliminate the need for additional public data, but suffer from remarkable discrepancy among local knowledge due to client-side model heterogeneity, leading to ambiguous representation on the server and inevitable accuracy degradation. To tackle this issue, we propose a proxy-data-free FD algorithm based on distributed knowledge congruence (FedDKC). FedDKC leverages well-designed refinement strategies to narrow local knowledge differences into an acceptable upper bound, so as to mitigate the negative effects of knowledge incongruence. Specifically, from perspectives of peak probability and Shannon entropy of local knowledge, we design kernel-based knowledge refinement (KKR) and searching-based knowledge refinement (SKR) respectively, and theoretically guarantee that the refined-local knowledge can satisfy an approximately-similar distribution and be regarded as congruent. Extensive experiments conducted on three common datasets demonstrate that our proposed FedDKC significantly outperforms the state-of-the-art on various heterogeneous settings while evidently improving the convergence speed.

IROS Conference 2024 Conference Paper

MG-VLN: Benchmarking Multi-Goal and Long-Horizon Vision-Language Navigation with Language Enhanced Memory Map

  • Junbo Zhang
  • Kaisheng Ma

Vision-Language Navigation (VLN) with high-level language instructions is a crucial task in robotics. Existing VLN benchmarks, such as the REVERIE challenge which has single-goal instructions and limited navigation steps, do not fully encapsulate the complexity of real-world navigation that often require multi-objective and long-horizon navigation. To address this, we propose a new benchmark task: Multi-Goal and Long-Horizon Vision-Language Navigation (MG-VLN), extending the REVERIE benchmark to encompass multi-objective and long-horizon navigation scenarios with sequences of high-level instructions. This task aims to provide a simulation benchmark to guide the design of lifelong and long-horizon navigation robots. To initiate the exploration in this newly proposed task, we first investigate the role of long-term memory in improving navigation performance by leveraging environmental information gathered during previous sub-goals. Additionally, we examine the types of knowledge that most effectively enrich this long-term memory. Specifically, we integrate the visual contents with linguistic knowledge such as object categories, visual captions, and object attributes/relationships. Our findings indicate that: 1) the explicit long-term memory map significantly enhances navigation performance in multi-goal and long-horizon scenarios; 2) incorporating object attributes and relationships information is the most advantageous for aligning environmental cues with high-level instructions.

AAAI Conference 2023 Conference Paper

AirFormer: Predicting Nationwide Air Quality in China with Transformers

  • Yuxuan Liang
  • Yutong Xia
  • Songyu Ke
  • Yiwei Wang
  • Qingsong Wen
  • Junbo Zhang
  • Yu Zheng
  • Roger Zimmermann

Air pollution is a crucial issue affecting human health and livelihoods, as well as one of the barriers to economic growth. Forecasting air quality has become an increasingly important endeavor with significant social impacts, especially in emerging countries. In this paper, we present a novel Transformer termed AirFormer to predict nationwide air quality in China, with an unprecedented fine spatial granularity covering thousands of locations. AirFormer decouples the learning process into two stages: 1) a bottom-up deterministic stage that contains two new types of self-attention mechanisms to efficiently learn spatio-temporal representations; 2) a top-down stochastic stage with latent variables to capture the intrinsic uncertainty of air quality data. We evaluate AirFormer with 4-year data from 1,085 stations in Chinese Mainland. Compared to prior models, AirFormer reduces prediction errors by 5%∼8% on 72-hour future predictions. Our source code is available at https://github.com/yoshall/airformer.

ICLR Conference 2023 Conference Paper

Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning?

  • Runpei Dong
  • Zekun Qi
  • Linfeng Zhang 0001
  • Junbo Zhang
  • Jianjian Sun
  • Zheng Ge
  • Li Yi 0001
  • Kaisheng Ma

The success of deep learning heavily relies on large-scale data with comprehensive labels, which is more expensive and time-consuming to fetch in 3D compared to 2D images or natural languages. This promotes the potential of utilizing models pretrained with data more than 3D as teachers for cross-modal knowledge transferring. In this paper, we revisit masked modeling in a unified fashion of knowledge distillation, and we show that foundational Transformers pretrained with 2D images or natural languages can help self-supervised 3D representation learning through training Autoencoders as Cross-Modal Teachers (ACT). The pretrained Transformers are transferred as cross-modal 3D teachers using discrete variational autoencoding self-supervision, during which the Transformers are frozen with prompt tuning for better knowledge inheritance. The latent features encoded by the 3D teachers are used as the target of masked point modeling, wherein the dark knowledge is distilled to the 3D Transformer students as foundational geometry understanding. Our ACT pretrained 3D learner achieves state-of-the-art generalization capacity across various downstream benchmarks, e.g., 88.21% overall accuracy on ScanObjectNN. Codes have been released at https://github.com/RunpeiDong/ACT.

AAAI Conference 2023 Conference Paper

AutoSTL: Automated Spatio-Temporal Multi-Task Learning

  • Zijian Zhang
  • Xiangyu Zhao
  • Hao Miao
  • Chunxu Zhang
  • Hongwei Zhao
  • Junbo Zhang

Spatio-temporal prediction plays a critical role in smart city construction. Jointly modeling multiple spatio-temporal tasks can further promote an intelligent city life by integrating their inseparable relationship. However, existing studies fail to address this joint learning problem well, which generally solve tasks individually or a fixed task combination. The challenges lie in the tangled relation between different properties, the demand for supporting flexible combinations of tasks and the complex spatio-temporal dependency. To cope with the problems above, we propose an Automated Spatio-Temporal multi-task Learning (AutoSTL) method to handle multiple spatio-temporal tasks jointly. Firstly, we propose a scalable architecture consisting of advanced spatio-temporal operations to exploit the complicated dependency. Shared modules and feature fusion mechanism are incorporated to further capture the intrinsic relationship between tasks. Furthermore, our model automatically allocates the operations and fusion weight. Extensive experiments on benchmark datasets verified that our model achieves state-of-the-art performance. As we can know, AutoSTL is the first automated spatio-temporal multi-task learning method.

AAAI Conference 2023 Conference Paper

Language-Assisted 3D Feature Learning for Semantic Scene Understanding

  • Junbo Zhang
  • Guofan Fan
  • Guanghan Wang
  • Zhengyuan Su
  • Kaisheng Ma
  • Li Yi

Learning descriptive 3D features is crucial for understanding 3D scenes with diverse objects and complex structures. However, it is usually unknown whether important geometric attributes and scene context obtain enough emphasis in an end-to-end trained 3D scene understanding network. To guide 3D feature learning toward important geometric attributes and scene context, we explore the help of textual scene descriptions. Given some free-form descriptions paired with 3D scenes, we extract the knowledge regarding the object relationships and object attributes. We then inject the knowledge to 3D feature learning through three classification-based auxiliary tasks. This language-assisted training can be combined with modern object detection and instance segmentation methods to promote 3D semantic scene understanding, especially in a label-deficient regime. Moreover, the 3D feature learned with language assistance is better aligned with the language features, which can benefit various 3D-language multimodal tasks. Experiments on several benchmarks of 3D-only and 3D-language tasks demonstrate the effectiveness of our language-assisted 3D feature learning. Code is available at https://github.com/Asterisci/Language-Assisted-3D.

AAAI Conference 2023 Conference Paper

Spatio-Temporal Self-Supervised Learning for Traffic Flow Prediction

  • Jiahao Ji
  • Jingyuan Wang
  • Chao Huang
  • Junjie Wu
  • Boren Xu
  • Zhenhe Wu
  • Junbo Zhang
  • Yu Zheng

Robust prediction of citywide traffic flows at different time periods plays a crucial role in intelligent transportation systems. While previous work has made great efforts to model spatio-temporal correlations, existing methods still suffer from two key limitations: i) Most models collectively predict all regions' flows without accounting for spatial heterogeneity, i.e., different regions may have skewed traffic flow distributions. ii) These models fail to capture the temporal heterogeneity induced by time-varying traffic patterns, as they typically model temporal correlations with a shared parameterized space for all time periods. To tackle these challenges, we propose a novel Spatio-Temporal Self-Supervised Learning (ST-SSL) traffic prediction framework which enhances the traffic pattern representations to be reflective of both spatial and temporal heterogeneity, with auxiliary self-supervised learning paradigms. Specifically, our ST-SSL is built over an integrated module with temporal and spatial convolutions for encoding the information across space and time. To achieve the adaptive spatio-temporal self-supervised learning, our ST-SSL first performs the adaptive augmentation over the traffic flow graph data at both attribute- and structure-levels. On top of the augmented traffic graph, two SSL auxiliary tasks are constructed to supplement the main traffic prediction task with spatial and temporal heterogeneity-aware augmentation. Experiments on four benchmark datasets demonstrate that ST-SSL consistently outperforms various state-of-the-art baselines. Since spatio-temporal heterogeneity widely exists in practical datasets, the proposed framework may also cast light on other spatial-temporal applications. Model implementation is available at https://github.com/Echo-Ji/ST-SSL.

AAAI Conference 2023 Conference Paper

Win-Win: A Privacy-Preserving Federated Framework for Dual-Target Cross-Domain Recommendation

  • Gaode Chen
  • Xinghua Zhang
  • Yijun Su
  • Yantong Lai
  • Ji Xiang
  • Junbo Zhang
  • Yu Zheng

Cross-domain recommendation (CDR) aims to alleviate the data sparsity by transferring knowledge from an informative source domain to the target domain, which inevitably proposes stern challenges to data privacy and transferability during the transfer process. A small amount of recent CDR works have investigated privacy protection, while they still suffer from satisfying practical requirements (e.g., limited privacy-preserving ability) and preventing the potential risk of negative transfer. To address the above challenging problems, we propose a novel and unified privacy-preserving federated framework for dual-target CDR, namely P2FCDR. We design P2FCDR as peer-to-peer federated network architecture to ensure the local data storage and privacy protection of business partners. Specifically, for the special knowledge transfer process in CDR under federated settings, we initialize an optimizable orthogonal mapping matrix to learn the embedding transformation across domains and adopt the local differential privacy technique on the transformed embedding before exchanging across domains, which provides more reliable privacy protection. Furthermore, we exploit the similarity between in-domain and cross-domain embedding, and develop a gated selecting vector to refine the information fusion for more accurate dual transfer. Extensive experiments on three real-world datasets demonstrate that P2FCDR significantly outperforms the state-of-the-art methods and effectively protects data privacy.

IS Journal 2021 Journal Article

Federated Digital Gateway: Methodologies, Tools, and Applications

  • Yang Liu
  • Ruolan Wang
  • Shishuai Du
  • Junbo Zhang
  • Yu Zheng

Federated machine learning (FML) is a new machine learning paradigm that is focused on training distributed models, where data are scattered in different places known as data silos, only necessary modeling information (not raw data) is exchanged, and data privacy and security are protected during the modeling. This research area has been growing fast during the past years, but the vision of making it a practical solution is still not fulfilled. Motivated by this, here we introduce an intelligent architecture, termed Federated Digital Gateway. It is designed to help algorithm engineers to easily deploy FML methods for real-life tasks. It provides different modules such as secure communication tools, database interface, authentication center, account system, and user interface. This architecture has been shown to function smoothly in two real-world applications. Overall, the federated digital gateway is practical and deployable for applying federated learning to solve real-life tasks.

AAAI Conference 2021 Conference Paper

Traffic Flow Forecasting with Spatial-Temporal Graph Diffusion Network

  • Xiyue Zhang
  • Chao Huang
  • Yong Xu
  • Lianghao Xia
  • Peng Dai
  • Liefeng Bo
  • Junbo Zhang
  • Yu Zheng

Accurate forecasting of citywide traffic flow has been playing critical role in a variety of spatial-temporal mining applications, such as intelligent traffic control and public risk assessment. While previous work has made significant efforts to learn traffic temporal dynamics and spatial dependencies, two key limitations exist in current models. First, only the neighboring spatial correlations among adjacent regions are considered in most existing methods, and the global interregion dependency is ignored. Additionally, these methods fail to encode the complex traffic transition regularities exhibited with time-dependent and multi-resolution in nature. To tackle these challenges, we develop a new traffic prediction framework–Spatial-Temporal Graph Diffusion Network (ST-GDN). In particular, ST-GDN is a hierarchically structured graph neural architecture which learns not only the local region-wise geographical dependencies, but also the spatial semantics from a global perspective. Furthermore, a multiscale attention network is developed to empower ST-GDN with the capability of capturing multi-level temporal dynamics. Experiments on several real-life traffic datasets demonstrate that ST-GDN outperforms different types of state-ofthe-art baselines. Source codes of implementations are available at https: //github. com/jill001/ST-GDN.

IJCAI Conference 2018 Conference Paper

GeoMAN: Multi-level Attention Networks for Geo-sensory Time Series Prediction

  • Yuxuan Liang
  • Songyu Ke
  • Junbo Zhang
  • Xiuwen Yi
  • Yu Zheng

Numerous sensors have been deployed in different geospatial locations to continuously and cooperatively monitor the surrounding environment, such as the air quality. These sensors generate multiple geo-sensory time series, with spatial correlations between their readings. Forecasting geo-sensory time series is of great importance yet very challenging as it is affected by many complex factors, i. e. , dynamic spatio-temporal correlations and external factors. In this paper, we predict the readings of a geo-sensor over several future hours by using a multi-level attention-based recurrent neural network that considers multiple sensors' readings, meteorological data, and spatial data. More specifically, our model consists of two major parts: 1) a multi-level attention mechanism to model the dynamic spatio-temporal dependencies. 2) a general fusion module to incorporate the external factors from different domains. Experiments on two types of real-world datasets, viz. , air quality data and water quality data, demonstrate that our method outperforms nine baseline methods.

AAAI Conference 2018 Conference Paper

When Will You Arrive? Estimating Travel Time Based on Deep Neural Networks

  • Dong Wang
  • Junbo Zhang
  • Wei Cao
  • Jian Li
  • Yu Zheng

Estimating the travel time of any path (denoted by a sequence of connected road segments) in a city is of great importance to traffic monitoring, route planning, ridesharing, taxi/Uber dispatching, etc. However, it is a very challenging problem, affected by diverse complex factors, including spatial correlations, temporal dependencies, external conditions (e. g. weather, traffic lights). Prior work usually focuses on estimating the travel times of individual road segments or sub-paths and then summing up these times, which leads to an inaccurate estimation because such approaches do not consider road intersections/traffic lights, and local errors may accumulate. To address these issues, we propose an end-to-end Deep learning framework for Travel Time Estimation (called DeepTTE) that estimates the travel time of the whole path directly. More specifically, we present a geo-convolution operation by integrating the geographic information into the classical convolution, capable of capturing spatial correlations. By stacking recurrent unit on the geo-convoluton layer, our DeepTTE can capture the temporal dependencies as well. A multi-task learning component is given on the top of DeepTTE, that learns to estimate the travel time of both the entire path and each local path simultaneously during the training phase. Extensive experiments on two trajectory datasets show our DeepTTE significantly outperforms the state-of-the-art methods.

AAAI Conference 2017 Conference Paper

Deep Spatio-Temporal Residual Networks for Citywide Crowd Flows Prediction

  • Junbo Zhang
  • Yu Zheng
  • Dekang Qi

Forecasting the flow of crowds is of great importance to traffic management and public safety, and very challenging as it is affected by many complex factors, such as inter-region traf- fic, events, and weather. We propose a deep-learning-based approach, called ST-ResNet, to collectively forecast the in- flow and outflow of crowds in each and every region of a city. We design an end-to-end structure of ST-ResNet based on unique properties of spatio-temporal data. More specifically, we employ the residual neural network framework to model the temporal closeness, period, and trend properties of crowd traffic. For each property, we design a branch of residual convolutional units, each of which models the spatial properties of crowd traffic. ST-ResNet learns to dynamically aggregate the output of the three residual neural networks based on data, assigning different weights to different branches and regions. The aggregation is further combined with external factors, such as weather and day of the week, to predict the final traffic of crowds in each and every region. Experiments on two types of crowd flows in Beijing and New York City (NYC) demonstrate that the proposed ST-ResNet outperforms six well-known methods.

IJCAI Conference 2016 Conference Paper

ST-MVL: Filling Missing Values in Geo-Sensory Time Series Data

  • Xiuwen Yi
  • Yu Zheng
  • Junbo Zhang
  • Tianrui Li

Many sensors have been deployed in the physical world, generating massive geo-tagged time series data. In reality, we usually lose readings of sensors at some unexpected moments because of sensor or communication errors. Those missing readings do not only affect real-time monitoring but also compromise the performance of further data analysis. In this paper, we propose a spatio-temporal multi-view-based learning (ST-MVL) method to collectively fill missing readings in a collection of geo-sensory time series data, considering 1) the temporal correlation between readings at different timestamps in the same series and 2) the spatial correlation between different time series. Our method combines empirical statistic models, consisting of Inverse Distance Weighting and Simple Exponential Smoothing, with data-driven algorithms, comprised of User-based and Item-based Collaborative Filtering. The former models handle the general missing cases based on empirical assumptions derived from history data over a long period, standing for two global views from a spatial and temporal perspective respectively. The latter algorithms deal with special cases where empirical assumptions may not hold, based on recent contexts of data, denoting two local views from a spatial and temporal perspective respectively. The predictions of the four views are aggregated to a final value in a multi-view learning algorithm. We evaluate our method based on Beijing air quality and meteorological data, finding our model's advantages beyond ten baseline approaches.