Author name cluster

Le Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

25 papers

2 author rows

AAAI Conference 2026 Conference Paper

FeTS: A Feature-Aware Framework for Time Series Forecasting

Le Wang
Jianyong Chen
Songbai Liu

Time series forecasting faces a fundamental challenge: the uneven distribution of predictive importance in time series data, where some specific time points and feature combinations carry disproportionately predictive power. As a result, uniform processing methods that treat all data alike inevitably fall short of optimal performance. To address this problem, we propose FeTS, a feature-aware framework that comprehensively learns temporal features through two key components: (i) Adaptive Feature Extraction (AdaFE), which dynamically discovers the most important features within each temporal patch and extracts them on the fly, yielding sharper and more focused local representations; and (ii) Dual-Scale Feed-Forward Network (DSFFN), which strategically integrates fine-grained local features with global long-term dependencies to achieve richer dual-scale representation learning. Extensive experiments on eight benchmark datasets demonstrate that FeTS achieves state-of-the-art performance in time series forecasting tasks, offering a novel solution to the challenge of uneven predictive importance in forecasting.

PDF Details DOI

AAAI Conference 2026 Conference Paper

HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses Through Reasoning MLLMs

Zheng Qin
Ruobing Zheng
Yabing Wang
Tianqi Li
Yi Yuan
Jingdong Chen
Le Wang

While Multimodal Large Language Models (MLLMs) show immense promise for achieving truly human-like interactions, progress is hindered by the lack of fine-grained evaluation frameworks for human-centered scenarios, encompassing both the understanding of complex human intentions and the provision of empathetic, context-aware responses. Here we introduce HumanSense, a comprehensive benchmark designed to evaluate the human-centered perception and interaction capabilities of MLLMs, with a particular focus on deep understanding of extended multimodal contexts and the formulation of rational feedback. Our evaluation reveals that leading MLLMs still have considerable room for improvement, particularly for advanced interaction-oriented tasks. Supplementing visual input with audio and text information yields substantial improvements, and Omni-modal models show advantages on these tasks.Furthermore, grounded in the observation that appropriate feedback stems from a contextual analysis of the interlocutor's needs and emotions, we posit that reasoning ability serves as the key to unlocking it. We devise a multi-stage, modality-progressive reinforcement learning approach, resulting in HumanSense-Omni-Reasoning, which substantially enhances performance on higher-level understanding and interactive tasks. Additionally, we observe that successful reasoning processes appear to exhibit consistent thought patterns. By designing corresponding prompts, we also enhance the performance of non-reasoning models in a training-free manner.

PDF Details DOI

JBHI Journal 2026 Journal Article

Spec-ViT: A Vision Transformer With Wavelet for Anti-Aliasing and Denoising in Medical Image Classification

Xiong Zhang
Le Wang
Yanying Rao
Yuzheng Su
Fayaz Ali Dharejo
Radu Timofte
Guo-jun Mao
Moath Alathbah

Medical image analysis remains challenging due to inherent limitations in imaging modalities, where structural aliasing and noise artifacts persistently compromise diagnostic accuracy. While convolutional neural networks (CNNs) and vision transformers (ViTs) have achieved remarkable progress in feature extraction, their inherent sampling mechanisms and spectral biases often exacerbate these high-frequency distortions, leading to suboptimal lesion characterization. To address this critical limitation, we propose Spec-ViT, a novel wavelet-based anti-aliasing Transformer architecture that synergistically integrates adaptive spectral purification with hierarchical attentive learning. The Wavelet Antialiasing Module (WAM) first implements learnable smoothing factor in the wavelet domain to suppress highfrequency artifacts, while preserving clinically relevant lowfrequency structures and fine diagnostic details. Building upon this spectral foundation, the Lightweight Enhanced Attention (LEA) refines feature representations through a dual-path mechanism, coupling channel-spatial attention with global multi-head self-attention to enhance lesion context modeling. Finally, the Smoothed Convolutional Gate (SCG) further sharpens local discriminability through depthwise convolution and adaptive Swish gating, completing a coherent pipeline from frequency-aware purification to global-local attentive analysis. Extensive experiments on five benchmark medical image classification datasets demonstrate that Spec-ViT consistently outperforms both baseline and state-of-the-art methods, achieving up to 84. 04% accuracy on the Pediatric Pneumonia Chest X-rays dataset in particular.

Details DOI

EAAI Journal 2025 Journal Article

A physical-modulated framework for process optimization and shape inference of industrial metal tube

Le Wang
Zili Wang
Shuyou Zhang
Jianrong Tan
Yaochen Lin
Yongzhe Xiang

Details DOI

IROS Conference 2025 Conference Paper

An insect-scale multimodal amphibious piezoelectric robot

Le Wang
Xin Wang
Hanlin Wang
Xiqing Zuo
Chao Xu

Miniature amphibious robots are capable of performing various tasks in complex terrestrial and aquatic environments due to their superior adaptability. However, the mobility of existing miniature amphibious robots in such environments is limited by their complex locomotion systems and single mode of motion. This work presents a novel insect-scale amphibious robot, powered by a single piezoelectric actuator. The prototype of the robot is fabricated and preliminarily tested preliminarily. By exploiting the different vibration modes of the piezoelectric actuators, the robot achieves movement in an amphibious environment. The robot employs the acoustic flow generated by the higher-order mode to achieve rapid motion at the water surface. In addition, the robot attains forward and backward motion on the ground by means of friction force between the driving feet and the ground. The findings of this study offer significant insights into the development of amphibious robots that exhibit enhanced flexibility and adaptability. These insights lay the foundation for the future applications of such robots in narrow amphibious settings.

Details

AAAI Conference 2025 Conference Paper

Diversifying Query: Region-Guided Transformer for Temporal Sentence Grounding

Xiaolong Sun
Liushuai Shi
Le Wang
Sanping Zhou
Kun Xia
Yabing Wang
Gang Hua

Temporal sentence grounding is a challenging task that aims to localize the moment spans relevant to a language description. Although recent DETR-based models have achieved notable progress by leveraging multiple learnable moment queries, they suffer from overlapped and redundant proposals, leading to inaccurate predictions. We attribute this limitation to the lack of task-related guidance for the learnable queries to serve a specific mode. Furthermore, the complex solution space generated by variable and open-vocabulary language descriptions complicates optimization, making it harder for learnable queries to adaptively distinguish each other, leading to more severe overlapped proposals. To address this limitation, we present the Region-Guided TRansformer (RGTR) for temporal sentence grounding, which introduces regional guidance to increase query diversity and eliminate overlapped proposals. Instead of using learnable queries, RGTR adopts a set of anchor pairs as moment queries to introduce explicit regional guidance. Each moment query takes charge of moment prediction for a specific temporal region, which reduces the optimization difficulty and ensures the diversity of the proposals. In addition, we design an IoU-aware scoring head to improve proposal quality. Extensive experiments demonstrate the effectiveness of RGTR, outperforming state-of-the-art methods on three public benchmarks and exhibiting good generalization and robustness on out-of-distribution splits.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

DynaRend: Learning 3D Dynamics via Masked Future Rendering for Robotic Manipulation

Jingyi Tian
Le Wang
Sanping Zhou
Sen Wang
Gang Hua

Learning generalizable robotic manipulation policies remains a key challenge due to the scarcity of diverse real-world training data. While recent approaches have attempted to mitigate this through self-supervised representation learning, most either rely on 2D vision pretraining paradigms such as masked image modeling, which primarily focus on static semantics or scene geometry, or utilize large-scale video prediction models that emphasize 2D dynamics, thus failing to jointly learn the geometry, semantics, and dynamics required for effective manipulation. In this paper, we present DynaRend, a representation learning framework that learns 3D-aware and dynamics-informed triplane features via masked reconstruction and future prediction using differentiable volumetric rendering. By pretraining on multi-view RGB-D video data, DynaRend jointly captures spatial geometry, future dynamics, and task semantics in a unified triplane representation. The learned representations can be effectively transferred to downstream robotic manipulation tasks via action value map prediction. We evaluate DynaRend on two challenging benchmarks, RLBench and Colosseum, as well as in real-world robotic experiments, demonstrating substantial improvements in policy success rate, generalization to environmental perturbations, and real-world applicability across diverse manipulation tasks.

PDF Details

AAAI Conference 2025 Conference Paper

RefDetector: A Simple Yet Effective Matching-based Method for Referring Expression Comprehension

Yabing Wang
Zhuotao Tian
Zheng Qin
Sanping Zhou
Le Wang

Despite the rapid and substantial advancements in object detection, it continues to face limitations imposed by pre-defined category sets. Current methods for visual grounding primarily focus on how to better leverage the visual backbone to generate text-tailored visual features, which may require adjusting the parameters of the entire model. Besides, some early methods, \ie, matching-based method, build upon and extend the functionality of existing object detectors by enabling them to localize an object based on free-form linguistic expressions, which have good application potential. However, the untapped potential of the matching-based approach has not been fully realized due to inadequate exploration. In this paper, we first analyze the limitations that exist in the current matching-based method (\ie, mismatch problem and complicated fusion mechanisms), and then present a simple yet effective matching-based method, namely RefDetector. To tackle the above issues, we devise a simple heuristic rule to generate proposals with improved referent recall. Additionally, we introduce a straightforward vision-language interaction module that eliminates the need for intricate manually-designed mechanisms. Moreover, we have explored the visual grounding based on the modern detector DETR, and achieved significant performance improvement. Extensive experiments on three REC benchmark datasets, \ie, RefCOCO, RefCOCO+, and RefCOCOg validate the effectiveness of the proposed method.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

SAMPO: Scale-wise Autoregression with Motion Prompt for Generative World Models

Sen Wang
Jingyi Tian
Le Wang
Zhimin Liao
Huaiyi Dong
Kun Xia
Sanping Zhou
Wei Tang

World models allow agents to simulate the consequences of actions in imagined environments for planning, control, and long-horizon decision-making. However, existing autoregressive world models struggle with visually coherent predictions due to disrupted spatial structure, inefficient decoding, and inadequate motion modeling. In response, we propose Scale-wise Autoregression with Motion PrOmpt (SAMPO), a hybrid framework that combines visual autoregressive modeling for intra-frame generation with causal modeling for next-frame generation. Specifically, SAMPO integrates temporal causal decoding with bidirectional spatial attention, which preserves spatial locality and supports parallel decoding within each scale. This design significantly enhances both temporal consistency and rollout efficiency. To further improve dynamic scene understanding, we devise an asymmetric multi-scale tokenizer that preserves spatial details in observed frames and extracts compact dynamic representations for future frames, optimizing both memory usage and model performance. Additionally, we introduce a trajectory-aware motion prompt module that injects spatiotemporal cues about object and robot trajectories, focusing attention on dynamic regions and improving temporal consistency and physical realism. Extensive experiments show that SAMPO achieves competitive performance in action-conditioned video prediction and model-based control, improving generation quality with 4. 4× faster inference. We also evaluate SAMPO's zero-shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks and benefit from larger model sizes.

PDF Details

NeurIPS Conference 2024 Conference Paper

Referencing Where to Focus: Improving Visual Grounding with Referential Query

Yabing Wang
Zhuotao Tian
Qingpei Guo
Zheng Qin
Sanping Zhou
Ming Yang
Le Wang

Visual Grounding aims to localize the referring object in an image given a natural language expression. Recent advancements in DETR-based visual grounding methods have attracted considerable attention, as they directly predict the coordinates of the target object without relying on additional efforts, such as pre-generated proposal candidates or pre-defined anchor boxes. However, existing research primarily focuses on designing stronger multi-modal decoder, which typically generates learnable queries by random initialization or by using linguistic embeddings. This vanilla query generation approach inevitably increases the learning difficulty for the model, as it does not involve any target-related information at the beginning of decoding. Furthermore, they only use the deepest image feature during the query learning process, overlooking the importance of features from other levels. To address these issues, we propose a novel approach, called RefFormer. It consists of the query adaption module that can be seamlessly integrated into CLIP and generate the referential query to provide the prior context for decoder, along with a task-specific decoder. By incorporating the referential query into the decoder, we can effectively mitigate the learning difficulty of the decoder, and accurately concentrate on the target object. Additionally, our proposed query adaption module can also act as an adapter, preserving the rich knowledge within CLIP without the need to tune the parameters of the backbone network. Extensive experiments demonstrate the effectiveness and efficiency of our proposed method, outperforming state-of-the-art approaches on five visual grounding benchmarks.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Temporal Correlation Vision Transformer for Video Person Re-Identification

Pengfei Wu
Le Wang
Sanping Zhou
Gang Hua
Changyin Sun

Video Person Re-Identification (Re-ID) is a task of retrieving persons from multi-camera surveillance systems. Despite the progress made in leveraging spatio-temporal information in videos, occlusion in dense crowds still hinders further progress. To address this issue, we propose a Temporal Correlation Vision Transformer (TCViT) for video person Re-ID. TCViT consists of a Temporal Correlation Attention (TCA) module and a Learnable Temporal Aggregation (LTA) module. The TCA module is designed to reduce the impact of non-target persons by relative state, while the LTA module is used to aggregate frame-level features based on their completeness. Specifically, TCA is a parameter-free module that first aligns frame-level features to restore semantic coherence in videos and then enhances the features of the target person according to temporal correlation. Additionally, unlike previous methods that treat each frame equally with a pooling layer, LTA introduces a lightweight learnable module to weigh and aggregate frame-level features under the guidance of a classification score. Extensive experiments on four prevalent benchmarks demonstrate that our method achieves state-of-the-art performance in video Re-ID.

PDF Details DOI

JBHI Journal 2023 Journal Article

E2EGI: End-to-End Gradient Inversion in Federated Learning

Zhaohua Li
Le Wang
Guangyao Chen
Zhiqiang Zhang
Muhammad Shafiq
Zhaoquan Gu

A plethora of healthcare data is produced every day due to the proliferation of prominent technologies such as Internet of Medical Things (IoMT). Digital-driven smart devices like wearable watches, wristbands and bracelets are utilized extensively in modern healthcare applications. Mining valuable information from the data distributed at the owners' level is useful, but it is challenging to preserve data privacy. Federated learning (FL) has swiftly surged in popularity due to its efficacy in dealing privacy vulnerabilities. Recent studies have demonstrated that Gradient Inversion Attack (GIA) can reconstruct the input data by leaked gradients, previous work demonstrated the achievement of GIA in very limited scenarios, such as the label repetition rate of the target sample being low and batch sizes being smaller than 48. In this paper, a novel method of End-to-End Gradient Inversion (E2EGI) is proposed. Compared to the state-of-the-art method, E2EGI's Minimum Loss Combinatorial Optimization (MLCO) has the ability to realize reconstructed samples with higher similarity, and the Distributed Gradient Inversion algorithm can implement GIA with batch sizes of 8 to 256 on deep network models (such as ResNet-50) and ImageNet datasets. A new Label Reconstruction algorithm is developed that relies only on the gradient information of the target model, which can achieve a label reconstruction accuracy of 81% in one batch sample with a label repetition rate of 96%, a 27% improvement over the state-of-the-art method. This proposed work can underpin data security assessments for healthcare federated learning.

Details DOI

AAAI Conference 2023 Conference Paper

Multi-Stream Representation Learning for Pedestrian Trajectory Prediction

Yuxuan Wu
Le Wang
Sanping Zhou
Jinghai Duan
Gang Hua
Wei Tang

Forecasting the future trajectory of pedestrians is an important task in computer vision with a range of applications, from security cameras to autonomous driving. It is very challenging because pedestrians not only move individually across time but also interact spatially, and the spatial and temporal information is deeply coupled with one another in a multi-agent scenario. Learning such complex spatio-temporal correlation is a fundamental issue in pedestrian trajectory prediction. Inspired by the procedure that the hippocampus processes and integrates spatio-temporal information to form memories, we propose a novel multi-stream representation learning module to learn complex spatio-temporal features of pedestrian trajectory. Specifically, we learn temporal, spatial and cross spatio-temporal correlation features in three respective pathways and then adaptively integrate these features with learnable weights by a gated network. Besides, we leverage the sparse attention gate to select informative interactions and correlations brought by complex spatio-temporal modeling and reduce complexity of our model. We evaluate our proposed method on two commonly used datasets, i.e. ETH-UCY and SDD, and the experimental results demonstrate our method achieves the state-of-the-art performance. Code: https://github.com/YuxuanIAIR/MSRL-master

PDF Details DOI

AAAI Conference 2022 Conference Paper

Complementary Attention Gated Network for Pedestrian Trajectory Prediction

Jinghai Duan
Le Wang
Chengjiang Long
Sanping Zhou
Fang Zheng
Liushuai Shi
Gang Hua

Pedestrian trajectory prediction is crucial in many practical applications due to the diversity of pedestrian movements, such as social interactions and individual motion behaviors. With similar observable trajectories and social environments, different pedestrians may make completely different future decisions. However, most existing methods only focus on the frequent modal of the trajectory and thus are difficult to generalize to the peculiar scenario, which leads to the decline of the multimodal fitting ability when facing similar scenarios. In this paper, we propose a complementary attention gated network (CAGN) for pedestrian trajectory prediction, in which a dual-path architecture including normal and inverse attention is proposed to capture both frequent and peculiar modals in spatial and temporal patterns, respectively. Specifically, a complementary block is proposed to guide normal and inverse attention, which are then be summed with learnable weights to get attention features by a gated network. Finally, multiple trajectory distributions are estimated based on the fused spatio-temporal attention features due to the multimodality of future trajectory. Experimental results on benchmark datasets, i. e. , the ETH, and the UCY, demonstrate that our method outperforms state-of-the-art methods by 13. 8% in Average Displacement Error (ADE) and 10. 4% in Final Displacement Error (FDE). Code will be available at https: //github. com/jinghaiD/CAGN

PDF Details

JBHI Journal 2022 Journal Article

Continuous Seizure Detection Based on Transformer and Long-Term iEEG

Yulin Sun
Weipeng Jin
Xiaopeng Si
Xingjian Zhang
Jiale Cao
Le Wang
Shaoya Yin
Dong Ming

Automatic seizure detection algorithms are necessary for patients with refractory epilepsy. Many excellent algorithms have achieved good results in seizure detection. Still, most of them are based on discontinuous intracranial electroencephalogram (iEEG) and ignore the impact of different channels on detection. This study aimed to evaluate the proposed algorithm using continuous, long-term iEEG to show its applicability in clinical routine. In this study, we introduced the ability of the transformer network to calculate the attention between the channels of input signals into seizure detection. We proposed an end-to-end model that included convolution and transformer layers. The model did not need feature engineering or format transformation of the original multi-channel time series. Through evaluation on two datasets, we demonstrated experimentally that the transformer layer could improve the performance of the seizure detection algorithm. For the SWEC-ETHZ iEEG dataset, we achieved 97. 5% event-based sensitivity, 0. 06/h FDR, and 13. 7 s latency. For the TJU-HH iEEG dataset, we achieved 98. 1% event-based sensitivity, 0. 22/h FDR, and 9. 9 s latency. In addition, statistics showed that the model allocated more attention to the channels close to the seizure onset zone within 20 s after the seizure onset, which improved the explainability of the model. This paper provides a new method to improve the performance and explainability of automatic seizure detection.

Details DOI

AAAI Conference 2022 Conference Paper

Learning Disentangled Classification and Localization Representations for Temporal Action Localization

Zixin Zhu
Le Wang
Wei Tang
Ziyi Liu
Nanning Zheng
Gang Hua

A common approach to Temporal Action Localization (TAL) is to generate action proposals and then perform action classification and localization on them. For each proposal, existing methods universally use a shared proposal-level representation for both tasks. However, our analysis indicates that this shared representation focuses on the most discriminative frames for classification, e. g. , “take-offs” rather than “runups” in distinguishing “high jump” and “long jump”, while frames most relevant to localization, such as the start and end frames of an action, are largely ignored. In other words, such a shared representation can not simultaneously handle both classification and localization tasks well, and it makes precise TAL difficult. To address this challenge, this paper disentangles the shared representation into classification and localization representations. The disentangled classification representation focuses on the most discriminative frames, and the disentangled localization representation focuses on the action phase as well as the action start and end. Our model can be divided into two sub-networks, i. e. , the disentanglement network and the context-based aggregation network. The disentanglement network is an autoencoder to learn orthogonal hidden variables of classification and localization. The context-based aggregation network aggregates the classification and localization representations by modeling local and global contexts. We evaluate our proposed method on two popular benchmarks for TAL, which outperforms all state-ofthe-art methods.

PDF Details

AAAI Conference 2022 Conference Paper

Social Interpretable Tree for Pedestrian Trajectory Prediction

Liushuai Shi
Le Wang
Chengjiang Long
Sanping Zhou
Fang Zheng
Nanning Zheng
Gang Hua

Understanding the multiple socially-acceptable future behaviors is an essential task for many vision applications. In this paper, we propose a tree-based method, termed as Social Interpretable Tree (SIT), to address this multi-modal prediction task, where a hand-crafted tree is built depending on the prior information of observed trajectory to model multiple future trajectories. Specifically, a path in the tree from the root to leaf represents an individual possible future trajectory. SIT employs a coarse-to-fine optimization strategy, in which the tree is first built by high-order velocity to balance the complexity and coverage of the tree and then optimized greedily to encourage multimodality. Finally, a teacher-forcing refining operation is used to predict the final fine trajectory. Compared with prior methods which leverage implicit latent variables to represent possible future trajectories, the path in the tree can explicitly explain the rough moving behaviors (e. g. , go straight and then turn right), and thus provides better interpretability. Despite the hand-crafted tree, the experimental results on ETH-UCY and Stanford Drone datasets demonstrate that our method is capable of matching or exceeding the performance of state-of-the-art methods. Interestingly, the experiments show that the raw built tree without training outperforms many prior deep neural network based approaches. Meanwhile, our method presents sufficient flexibility in longterm prediction and different best-of-K predictions.

PDF Details

EAAI Journal 2022 Journal Article

Toward axial accuracy prediction and optimization of metal tube bending forming: A novel GRU-integrated Pb-NSGA-III optimization framework

Chang Sun
Zili Wang
Shuyou Zhang
Xiaojian Liu
Le Wang
Jianrong Tan

Details DOI

AAAI Conference 2021 Conference Paper

ACSNet: Action-Context Separation Network for Weakly Supervised Temporal Action Localization

Ziyi Liu
Le Wang
Qilin Zhang
Wei Tang
Junsong Yuan
Nanning Zheng
Gang Hua

The object of Weakly-supervised Temporal Action Localization (WS-TAL) is to localize all action instances in an untrimmed video with only video-level supervision. Due to the lack of frame-level annotations during training, current WS-TAL methods rely on attention mechanisms to localize the foreground snippets or frames that contribute to the video-level classification task. This strategy frequently confuse context with the actual action, in the localization result. Separating action and context is a core problem for precise WS-TAL, but it is very challenging and has been largely ignored in the literature. In this paper, we introduce an Action-Context Separation Network (ACSNet) that explicitly takes into account context for accurate action localization. It consists of two branches (i. e. , the Foreground-Background branch and the Action-Context branch). The Foreground- Background branch first distinguishes foreground from background within the entire video while the Action-Context branch further separates the foreground as action and context. We associate video snippets with two latent components (i. e. , a positive component and a negative component), and their different combinations can effectively characterize foreground, action and context. Furthermore, we introduce extended labels with auxiliary context categories to facilitate the learning of action-context separation. Experiments on THU- MOS14 and ActivityNet v1. 2/v1. 3 datasets demonstrate the ACSNet outperforms existing state-of-the-art WS-TAL methods by a large margin.

PDF Details

AAAI Conference 2021 Conference Paper

Weakly Supervised Temporal Action Localization Through Learning Explicit Subspaces for Action and Context

Ziyi Liu
Le Wang
Wei Tang
Junsong Yuan
Nanning Zheng
Gang Hua

Weakly-supervised Temporal Action Localization (WS-TAL) methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision. Existing WS-TAL methods rely on deep features learned for action recognition. However, due to the mismatch between classification and localization, these features cannot distinguish the frequently co-occurring contextual background, i. e. , the context, and the actual action instances. We term this challenge action-context confusion, and it will adversely affect the action localization accuracy. To address this challenge, we introduce a framework that learns two feature subspaces respectively for actions and their context. By explicitly accounting for action visual elements, the action instances can be localized more precisely without the distraction from the context. To facilitate the learning of these two feature subspaces with only video-level categorical labels, we leverage the predictions from both spatial and temporal streams for snippets grouping. In addition, an unsupervised learning task is introduced to make the proposed module focus on mining temporal information. The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks, i. e. , THUMOS14, ActivityNet v1. 2 and v1. 3 datasets.

PDF Details

IS Journal 2020 Journal Article

Joint Intelligence Ranking by Federated Multiplicative Update

Chi Zhang
Yu Liu
Le Wang
Yuehu Liu
Li Li
Nanning Zheng

The joint intelligence ranking of intelligent systems like autonomous driving is of great importance for building a more general, extensive, and universally accepted intelligence evaluation scheme. However, due to issues such as privacy security and industry or area competition, the integration of isolated test results may face large unimaginable difficulty in information security and encrypted model training. To address this, we derive the federated multiplicative update (FMU) algorithm with boundary constraints to solve the nonnegative matrix factorization based joint intelligence ranking. The encrypted learning process is developed to alternate original computation steps in multiplicative update algorithms. Owning feasible property for the fast convergence and secure exchange of variables, the proposed framework outperforms the previous work on both real and simulated data. Further experimental analysis reveals that the introduced federated mechanism does not harm the overall time efficiency.

Details DOI

AAAI Conference 2020 Conference Paper

Ladder Loss for Coherent Visual-Semantic Embedding

Mo Zhou
Zhenxing Niu
Le Wang
Zhanning Gao
Qilin Zhang
Gang Hua

For visual-semantic embedding, the existing methods normally treat the relevance between queries and candidates in a bipolar way – relevant or irrelevant, and all “irrelevant” candidates are uniformly pushed away from the query by an equal margin in the embedding space, regardless of their various proximity to the query. This practice disregards relatively discriminative information and could lead to suboptimal ranking in the retrieval results and poorer user experience, especially in the long-tail query scenario where a matching candidate may not necessarily exist. In this paper, we introduce a continuous variable to model the relevance degree between queries and multiple candidates, and propose to learn a coherent embedding space, where candidates with higher relevance degrees are mapped closer to the query than those with lower relevance degrees. In particular, the new ladder loss is proposed by extending the triplet loss inequality to a more general inequality chain, which implements variable push-away margins according to respective relevance degrees. In addition, a proper Coherent Score metric is proposed to better measure the ranking results including those “irrelevant” candidates. Extensive experiments on multiple datasets validate the efﬁcacy of our proposed method, which achieves signiﬁcant improvement over existing state-of-the-art methods.

PDF Details

AAAI Conference 2019 Conference Paper

Video Imprint Segmentation for Temporal Action Detection in Untrimmed Videos

Zhanning Gao
Le Wang
Qilin Zhang
Zhenxing Niu
Nanning Zheng
Gang Hua

We propose a temporal action detection by spatial segmentation framework, which simultaneously categorize actions and temporally localize action instances in untrimmed videos. The core idea is the conversion of temporal detection task into a spatial semantic segmentation task. Firstly, the video imprint representation is employed to capture the spatial/temporal interdependences within/among frames and represent them as spatial proximity in a feature space. Subsequently, the obtained imprint representation is spatially segmented by a fully convolutional network. With such segmentation labels projected back to the video space, both temporal action boundary localization and per-frame spatial annotation can be obtained simultaneously. The proposed framework is robust to variable lengths of untrimmed videos, due to the underlying fixed-size imprint representations. The efficacy of the framework is validated in two public action detection datasets.

PDF Details

YNIMG Journal 2009 Journal Article

Effects of low frequency rTMS on cortical connectivity in stroke patients assessed with fMRI and dynamic causal modeling

C Grefkes
DA Nowak
Le Wang
SB Eickhoff
M Dafotakis
GR Fink

Details DOI

YNIMG Journal 2009 Journal Article

Neural correlates of improved visuomotor functions following stimulation of the noradrenergic system in humans

C Grefkes
Le Wang
SB Eickhoff
GR Fink

Details DOI