Author name cluster

Can Zhang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

6 papers

2 author rows

ICML Conference 2025 Conference Paper

FedSMU: Communication-Efficient and Generalization-Enhanced Federated Learning through Symbolic Model Updates

Xinyi Lu
Hao Zhang
Chenglin Li
Weijia Lu
Zhifei Yang 0005
Wenrui Dai
Xiaodong Zhang
Xiaofeng Ma

The significant communication overhead and client data heterogeneity have posed an important challenge to current federated learning (FL) paradigm. Existing compression-based and optimization-based FL algorithms typically focus on addressing either the model compression challenge or the data heterogeneity issue individually, rather than tackling both of them. In this paper, we observe that by symbolizing the client model updates to be uploaded (i. e. , normalizing the magnitude for each model parameter at local clients), the model heterogeneity, essentially stemmed from data heterogeneity, can be mitigated, and thereby helping improve the overall generalization performance of the globally aggregated model at the server. Inspired with this observation, and further motivated by the success of Lion optimizer in achieving the optimal performance on most tasks in the centralized learning, we propose a new FL algorithm, called FedSMU, which simultaneously reduces the communication overhead and alleviates the data heterogeneity issue. Specifically, FedSMU splits the standard Lion optimizer into the local updates and global execution, where only the symbol of client model updates commutes between the client and server. We theoretically prove the convergence of FedSMU for the general non-convex settings. Through extensive experimental evaluations on several benchmark datasets, we demonstrate that our FedSMU algorithm not only reduces the communication overhead, but also achieves a better generalization performance than the other compression-based and optimization-based baselines.

Details

AAAI Conference 2025 Conference Paper

Leveraging Consistent Spatio-Temporal Correspondence for Robust Visual Odometry

Zhaoxing Zhang
Junda Cheng
Gangwei Xu
Xiaoxiang Wang
Can Zhang
Xin Yang

Recent approaches to VO have significantly improved performance by using deep networks to predict optical flow between video frames. However, existing methods still suffer from noisy and inconsistent flow matching, making it difficult to handle challenging scenarios and long-sequence estimation.To overcome these challenges, we introduce Spatio-Temporal Visual Odometry (STVO), a novel deep network architecture that effectively leverages inherent spatio-temporal cues to enhance the accuracy and consistency of multi-frame flow matching. With more accurate and consistent flow matching, STVO can achieve better pose estimation through the bundle adjustment (BA).Specifically, STVO introduces two innovative components: 1) the Temporal Propagation Module that utilizes multi-frame information to extract and propagate temporal cues across adjacent frames, maintaining temporal consistency; 2) the Spatial Activation Module that utilizes geometric priors from the depth maps to enhance spatial consistency while filtering out excessive noise and incorrect matches.Our STVO achieves state-of-the-art performance on TUM-RGBD, EuRoc MAV, ETH3D and KITTI Odometry benchmarks. Notably, it improves accuracy by 77.8% on ETH3D benchmark and 38.9% on KITTI Odometry benchmark over the previous best methods.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models

zhentao he
Can Zhang
Ziheng Wu
Zhenghao Chen
Yufei Zhan
Yifan Li
Zhao Zhang
Xian Wang

Recent advancements in multimodal large language models (MLLMs) have enhanced document understanding by integrating textual and visual information. However, existing models exhibit incompleteness within their paradigm in real-world scenarios, particularly under visual degradation (e. g. , blur, occlusion, low contrast). In such conditions, the current response paradigm often fails to adequately perceive visual degradation and ambiguity, leading to overreliance on linguistic priors or misaligned visual-textual reasoning. This difficulty in recognizing uncertainty frequently results in the generation of hallucinatory content, especially when a precise answer is not feasible. To better demonstrate and analyze this phenomenon and problem, we propose KIE-HVQA, the first benchmark dedicated to evaluating OCR hallucination in degraded document understanding. This dataset includes test samples spanning identity cards, invoices, and prescriptions, with simulated real-world degradations and pixel-level annotations for OCR reliability. This setup allows for evaluating models' capacity, under degraded input, to distinguish reliable visual information and answer accordingly, thereby highlighting the challenge of avoiding hallucination on uncertain data. To achieve vision-faithful reasoning and thereby avoid the aforementioned issues, we further introduce a Group Relative Policy Optimization (GRPO)-based framework featuring a novel reward mechanism. By incorporating a self-awareness of visual uncertainty and an analysis method that initiates refusal to answer to increase task difficulty within our supervised fine-tuning and reinforcement learning framework, we successfully mitigated hallucinations in ambiguous regions. Experiments on Qwen2. 5-VL demonstrate that our 7B-parameter model achieves a ~28% absolute improvement in hallucination-free accuracy over GPT-4o on KIE-HVQA and there is no significant performance drop in standard tasks, highlighting both effectiveness and robustness. This work advances the development of reliable MLLMs for real-world document analysis by addressing critical challenges in visual-linguistic alignment under degradation.

PDF Details

YNIMG Journal 2023 Journal Article

Distinct inter-brain synchronization patterns underlying group decision-making under uncertainty with partners in different interpersonal relationships

Hanxuan Zhao
Can Zhang
Ruiwen Tao
Haijun Duan
Sihua Xu

Humans may behave in different manners when making decisions with friends and strangers. Whether the interpersonal relationship and the characteristics of the individuals in the group affected the group decision-making under uncertainty in the real-time interaction remains unknown. Using the turn-based Balloon Analogue Risk Task (BART), the present study examined the group decision-making propensity under uncertainty with partners in different interpersonal relationships and interpersonal orientations. Corresponding inter-brain synchronization (IBS) patterns at the prefrontal cortex (PFC) were also uncovered with the fNIRS-based hyperscanning approach. Behavioral results identified that dyads in the friend group exhibited the uncertainty-averse propensity when comparing with the stranger group. The fNIRS results reported that feedback-related IBS at the left inferior frontal gyrus (l-IFG) and medial frontopolar cortex (mFPC) during different feedbacks was modulated by interpersonal relationships. The IBS at all channels in the PFC during the positive and negative feedbacks, respectively, predicted the decision-making propensity under uncertainty in the stranger and friend groups based on the support vector machine (SVM) algorithm. The moderating role of the social value orientation (SVO) was also verified in the mediation effect of the dyad closeness on the decision-making propensity under uncertainty via the IBS at the right lateral frontopolar cortex (r-FPC). These findings demonstrated disparate behavioral responses and inter-brain synchronization patterns underlying group decision-making under uncertainty with partners in different interpersonal relationships.

Details DOI

AAAI Conference 2021 Conference Paper

Non-Autoregressive Coarse-to-Fine Video Captioning

Bang Yang
Yuexian Zou
Fenglin Liu
Can Zhang

It is encouraged to see that progress has been made to bridge videos and natural language. However, mainstream video captioning methods suffer from slow inference speed due to the sequential manner of autoregressive decoding, and prefer generating generic descriptions due to the insufficient training of visual words (e. g. , nouns and verbs) and inadequate decoding paradigm. In this paper, we propose a nonautoregressive decoding based model with a coarse-to-fine captioning procedure to alleviate these defects. In implementations, we employ a bi-directional self-attention based network as our language model for achieving inference speedup, based on which we decompose the captioning procedure into two stages, where the model has different focuses. Specifically, given that visual words determine the semantic correctness of captions, we design a mechanism of generating visual words to not only promote the training of scene-related words but also capture relevant details from videos to construct a coarse-grained sentence “template”. Thereafter, we devise dedicated decoding algorithms that fill in the “template” with suitable words and modify inappropriate phrasing via iterative refinement to obtain a fine-grained description. Extensive experiments on two mainstream video captioning benchmarks, i. e. , MSVD and MSR-VTT, demonstrate that our approach achieves state-of-the-art performance, generates diverse descriptions, and obtains high inference efficiency.

PDF Details

IJCAI Conference 2021 Conference Paper

RR-Net: Injecting Interactive Semantics in Human-Object Interaction Detection

Dongming Yang
Yuexian Zou
Can Zhang
Meng Cao
Jie Chen

Human-Object Interaction (HOI) detection devotes to learn how humans interact with surrounding objects. Latest end-to-end HOI detectors are short of relation reasoning, which leads to inability to learn HOI-specific interactive semantics for predictions. In this paper, we therefore propose novel relation reasoning for HOI detection. We first present a progressive Relation-aware Frame, which brings a new structure and parameter sharing pattern for interaction inference. Upon the frame, an Interaction Intensifier Module and a Correlation Parsing Module are carefully designed, where: a) interactive semantics from humans can be exploited and passed to objects to intensify interactions, b) interactive correlations among humans, objects and interactions are integrated to promote predictions. Based on modules above, we construct an end-to-end trainable framework named Relation Reasoning Network (abbr. RR-Net). Extensive experiments show that our proposed RR-Net sets a new state-of-the-art on both V-COCO and HICO-DET benchmarks and improves the baseline about 5. 5% and 9. 8% relatively, validating that this first effort in exploring relation reasoning and integrating interactive semantics has brought obvious improvement for end-to-end HOI detection.

PDF Details DOI