Author name cluster

Tong Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

31 papers

2 author rows

EAAI Journal 2026 Journal Article

Balancing global coherence and hand-level detail: Frequency-decomposed whole-body human motion prediction

Delong Yang
Tong Wang
Linda Ma
Qiongjie Cui

Details DOI

AAAI Conference 2026 Conference Paper

Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation

Jinxing Zhou
Yanghao Zhou
Mingfei Han
Tong Wang
Xiaojun Chang
Hisham Cholakkal
Rao Muhammad Anwer

Referring Audio-Visual Segmentation (Ref-AVS) aims to segment target objects in audible videos based on given reference expressions. Prior works typically rely on learning latent embeddings via multimodal fusion to prompt a tunable SAM/SAM2 decoder for segmentation, which requires strong pixel-level supervision and lacks interpretability. From a novel perspective of explicit reference understanding, we propose TGS-Agent, which decomposes the task into a Think-Ground-Segment process, mimicking the human reasoning procedure by first identifying the referred object through multimodal analysis, followed by coarse-grained grounding and precise segmentation. To this end, we first propose Ref-Thinker, a multimodal language model capable of reasoning over textual, visual, and auditory cues. We construct an instruction-tuning dataset with explicit object-aware think-answer chains for Ref-Thinker fine-tuning. The object description inferred by Ref-Thinker is used as an explicit prompt for Grounding-DINO and SAM2, which perform grounding and segmentation without relying on pixel-level supervision. Additionally, we introduce R2-AVSBench, a new benchmark with linguistically diverse and reasoning-intensive references for better evaluating model generalization. Our approach achieves state-of-the-art results on both standard Ref-AVSBench and proposed R2-AVSBench.

PDF Details DOI

EAAI Journal 2025 Journal Article

ATCN-BiGRU: A hybrid deep learning framework based temporal convolutional network and stacked bidirectional gate recurrent units for traffic flow prediction in urban scenarios

Liyue Fu
Tong Wang
Min Ouyang
Ling Zhao
Xiaofeng Yin

Details DOI

ICML Conference 2025 Conference Paper

Clients Collaborate: Flexible Differentially Private Federated Learning with Guaranteed Improvement of Utility-Privacy Trade-off

Yuecheng Li
Lele Fu
Tong Wang
Jian Lou 0001
Bin Chen 0011
Lei Yang 0030
Jian Shen
Zibin Zheng

To defend against privacy leakage of user data, differential privacy is widely used in federated learning, but it is not free. The addition of noise randomly disrupts the semantic integrity of the model and this disturbance accumulates with increased communication rounds. In this paper, we introduce a novel federated learning framework with rigorous privacy guarantees, named FedCEO, designed to strike a trade-off between model utility and user privacy by letting clients " C *ollaborate with E ach O ther ". Specifically, we perform efficient tensor low-rank proximal optimization on stacked local model parameters at the server, demonstrating its capability to flexibly truncate high-frequency components in spectral space. This capability implies that our FedCEO can effectively recover the disrupted semantic information by smoothing the global semantic space for different privacy settings and continuous training processes. Moreover, we improve the SOTA utility-privacy trade-off bound by order of $\sqrt{d}$, where $d$ is the input dimension. We illustrate our theoretical results with experiments on representative datasets and observe significant performance improvements and strict privacy guarantees under different privacy settings. The *code is available at https: //github. com/6lyc/FedCEO_Collaborate-with-Each-Other.

Details

EAAI Journal 2025 Journal Article

Cyclic translations between pathomics and genomics improve automatic cancer diagnosis from whole slide images

Xinyu Hao
Hongming Xu
Xiaofeng Wang
Tong Wang
Timo Hamalainen
Fengyu Cong

Details DOI

ICRA Conference 2025 Conference Paper

Efficient Cross-Boundary Grasping in Stacked Clutter with Single-Visual Mapping Multi-Step

Yudong Luo
Tong Wang
Feiyu Xie
Na Zhao 0008
Xianping Fu
Yantao Shen 0001

In logistics applications, the vision-based technology for grasping target objects in the air is relatively mature. However, when operating across the air and water such as grasping marine products from the water, the visual information collected by the camera will be disturbed by ripples and bubbles on the water surface, resulting in low grasping efficiency. Therefore, we introduce a grasping strategy based on single-visual mapping for multi-step (SVMMS) strategy to achieve cross-medium operations involving stacked objects. Specifically, we design a multifunctional integrated Deep Q-learning-based network model to extract visual features from the scene to effectively detect stacked objects and outputs their hierarchical relationships. Moreover, we quantify the underlying relationship between motion logic during action execution and changes in RGB-D during action execution to help the robot achieve efficient and collision-free operations. Our approach also incorporates a time-series design with prioritized experience replay to globally optimize the action sequence. Additionally, we propose a novel sim2real method by combining domain randomization to address the difference in object sizes between the simulation and the real world. Extensive experiments in both simulation and physical environments show that SVMMS-Grasp significantly outperforms existing methods in terms of task success rate, stability, and operational efficiency.

Details

ICRA Conference 2025 Conference Paper

Explore the LiDAR-Camera Dynamic Adjustment Fusion for 3D Object Detection

Yiran Yang
Xu Gao
Tong Wang
Xin Hao
Yifeng Shi
Xiao Tan 0001
Xiaoqing Ye

Camera and LiDAR serve as informative sensors for accurate and robust autonomous driving systems. However, these sensors often exhibit heterogeneous natures, resulting in distributional modality gaps that present significant challenges for fusion. To address this, a robust fusion technique is crucial, particularly for enhancing 3D object detection. In this paper, we introduce a dynamic adjustment technology aimed at aligning modal distributions and learning effective modality representations to enhance the fusion process. Specifically, we propose a triphase domain aligning module. This module adjusts the feature distributions from both the camera and LiDAR, bringing them closer to the ground truth domain and minimizing differences. Additionally, we explore improved representation acquisition methods for dynamic fusion, which includes modal interaction and specialty enhancement. Finally, an adaptive learning technique that merges the semantics and geometry information for dynamical instance optimization. Extensive experiments in the nuScenes dataset present competitive performance with state-of-the-art approaches. Our code will be released in the future.

Details

JBHI Journal 2025 Journal Article

Multi-Task Adaptive Resolution Network for Lymph Node Metastasis Diagnosis From Whole Slide Images of Colorectal Cancer

Tong Wang
Su-Jin Shin
Mingkang Wang
Qi Xu
Guiyang Jiang
Fengyu Cong
Jeonghyun Kang
Hongming Xu

Automated detection of lymph node metastasis (LNM) holds great potential to alleviate the workload of doctors and reduce misinterpretations. Despite the practical successes achieved, effectively addressing the highly complex and heterogeneous tumor microenvironment remains an open and challenging problem, especially when tumor subtypes intermingle and are difficult to delineate. In this paper, we propose a multi-task adaptive resolution network, named MAR-Net, for LNM detection and subtyping in complex mixed-type cancers. Specifically, we construct a resolution-aware module to mine heterogeneous diagnostic information, which exploits the multi-scale pyramid information and adaptively combines multi-resolution structured features for comprehensive representation. Additionally, we adopt a multi-task learning approach that simultaneously addresses LNM detection and subtyping, reducing model instability during optimization and improving performance across both tasks. More importantly, to rectify the potential misclassification of tumor subtypes, we elaborately design a hierarchical subtying refinement (HSR) algorithm that leverages a generic segmentation model informed by pathologists' prior knowledge. Evaluations have been conducted on three private and one public cancer datasets (554 WSIs, 4. 8 million patches). Our experimental results demonstrate that the proposed method consistently achieves superior performance compared to the state-of-the-art methods, achieving 0. 5% to 3. 2% higher AUC in LNM detection and 3. 8% to 4. 4% higher AUC in LNM subtyping.

Details DOI

NeurIPS Conference 2025 Conference Paper

ProtoPairNet: Interpretable Regression through Prototypical Pair Reasoning

Rose Gurung
Ronilo Ragodos
Chiyu Ma
Tong Wang
Chaofan Chen

We present Prototypical Pair Network (ProtoPairNet), a novel interpretable architecture that combines deep learning with case-based reasoning to predict continuous targets. While prototype-based models have primarily addressed image classification with discrete outputs, extending these methods to continuous targets, such as regression, poses significant challenges. Existing architectures which rely heavily on one-to-one comparison with prototypes lack the directional information necessary for continuous predictions. Our method redefines the role of prototypes in such tasks by incorporating prototypical pairs into the reasoning process. Predictions are derived based on the input's relative dissimilarities to these pairs, leveraging an intuitive geometric interpretation. Our method further reduces the complexity of the reasoning process by relying on the single most relevant pair of prototypes, rather than all prototypes in the model as was done in prior works. Our model is versatile enough to be used in both vision-based regression and continuous control in reinforcement learning. Our experiments demonstrate that ProtoPairNet achieves performance on par with its black-box counterparts across these tasks. Comprehensive analyses confirm the meaningfulness of prototypical pairs and the faithfulness of our model’s interpretations, and extensive user studies highlight our model's improved interpretability over existing methods.