Arrow Research search

Author name cluster

Meng Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

98 papers
2 author rows

Possible papers

98

AAAI Conference 2026 Conference Paper

AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection

  • Bin-Bin Gao
  • Yue Zhou
  • Jiangtao Yan
  • Yuezhi Cai
  • Weixi Zhang
  • Meng Wang
  • Jun Liu
  • Yong Liu

Universal visual anomaly detection aims to identify anomalies from novel or unseen vision domains without additional fine-tuning, which is critical in open scenarios. Recent studies have demonstrated that pre-trained vision-language models like CLIP exhibit strong generalization with just zero or a few normal images. However, existing methods struggle to design prompt templates, handle complex token interactions, or require fine-tuning on target domains, resulting in limited flexibility. In this work, we present a simple yet effective AdaptCLIP based on two key insights. First, adaptive visual and textual representations should be learned alternately rather than jointly. Second, comparative learning between query and normal image prompt should incorporate both contextual and aligned residual features, rather than relying solely on residual features. AdaptCLIP treats CLIP models as a foundational service, adding only three simple adapters, visual adapter, textual adapter, and prompt-query adapter, at its input or output ends. AdaptCLIP supports zero-/few-shot generalization across domains and provides a training-free approach on target domains once trained on a base dataset. AdaptCLIP achieves state-of-the-art performance on 12 anomaly detection benchmarks from industrial and medical domains, significantly outperforming existing competitive methods.

EAAI Journal 2026 Journal Article

An electroencephalogram signal analysis method based on dual self-supervised graph diffusion recurrent network

  • Sunan Ge
  • Shuang Wang
  • Rui Zhang
  • Xueqing Zhao
  • Xinshi
  • Meng Wang
  • Tao Wu

Diagnosis of neurological diseases and emotion recognition analyzing based on electroencephalogram (EEG) signals have been widely applied in numerous fields by revealing the complex operational mechanisms of the human brain. However, existing EEG signal analysis methods are hindered by label noise and the scale of labeled data samples, making it difficult to effectively learn the distribution characteristics of the data and identify the heterogeneity of EEG signals. Therefore, this paper proposes a dual self-supervised graph diffusion recurrent network (DSGDRN) method for representation learning of unlabeled EEG signals, reducing biases and noise effects caused by manual annotation and improving the ability to recognize individual differences. First, to capture the natural geometric features of EEG signals and the dynamic connection information within the brain, distance graph structures and correlation graph structures are respectively used for feature expression. A dual self-supervised algorithm is employed to represent hidden states as a learnable function, enhancing the expressive power of the graph recurrent diffusion network and its ability to recognize contextual information. Finally, during the testing process, a dual learning strategy with continuous adaptive adjustment of hidden state parameters is adopted to improve the application capability of EEG signals in real-world scenarios. Experimental results demonstrate that compared with existing methods, the proposed method exhibits superior performance in neurological disease diagnosis and emotion detection, indicating its effective representation learning capabilities in fields such as neurological disease analysis and emotion recognition.

AAAI Conference 2026 Conference Paper

Benchmarking Trustworthiness in Multimodal LLMs for Video Understanding

  • Youze Wang
  • Zijun Chen
  • Ruoyu Chen
  • Shishen Gu
  • Wenbo Hu
  • Jiayang Liu
  • Yinpeng Dong
  • Hang Su

Recent advancements in multimodal large language models for video understanding (videoLLMs) have enhanced their capacity to process complex spatiotemporal data. However, challenges such as factual inaccuracies, harmful content, biases, hallucinations, and privacy risks compromise their reliability. This study introduces Trust-videoLLMs, a first comprehensive benchmark evaluating 23 state-of-the-art videoLLMs (5 commercial, 18 open-source) across five critical dimensions: truthfulness, robustness, safety, fairness, and privacy. Comprising 30 tasks with adapted, synthetic, and annotated videos, the framework assesses spatiotemporal risks, temporal consistency and cross-modal impact. Results reveal significant limitations in dynamic scene comprehension, cross-modal perturbation resilience and real-world risk mitigation. While open-source models occasionally outperform, proprietary models generally exhibit superior credibility, though scaling does not consistently improve performance. These findings underscore the need for enhanced training datat diversity and robust multimodal alignment. Trust-videoLLMs provides a publicly available, extensible toolkit for standardized trustworthiness assessments, addressing the critical gap between accuracy-focused benchmarks and demands for robustness, safety, fairness, and privacy.

AAAI Conference 2026 Conference Paper

FilmWeaver: Weaving Consistent Multi-Shot Videos with Cache-Guided Autoregressive Diffusion

  • Xiangyang Luo
  • Qingyu Li
  • Xiaokun Liu
  • Wenyu Qin
  • Miao Yang
  • Meng Wang
  • Pengfei Wan
  • Di Zhang

Current video generation models perform well at single-shot synthesis but struggle with multi-shot videos, facing critical challenges in maintaining character and background consistency across shots and flexibly generating videos of arbitrary length and shot count. To address these limitations, we introduce \textbf{FilmWeaver}, a novel framework designed to generate consistent, multi-shot videos of arbitrary length. First, it employs an autoregressive diffusion paradigm to achieve arbitrary-length video generation. To address the challenge of consistency, our key insight is to decouple the problem into inter-shot consistency and intra-shot coherence. We achieve this through a dual-level cache mechanism: a shot memory caches keyframes from preceding shots to maintain character and scene identity, while a temporal memory retains a history of frames from the current shot to ensure smooth, continuous motion. The proposed framework allows for flexible, multi-round user interaction to create multi-shot videos. Furthermore, due to this decoupled design, our method demonstrates high versatility by supporting downstream tasks such as multi-concept injection and video extension. To facilitate the training of our consistency-aware method, we also developed a comprehensive pipeline to construct a high-quality multi-shot video dataset. Extensive experimental results demonstrate that our method surpasses existing approaches on metrics for both consistency and aesthetic quality, opening up new possibilities for creating more consistent, controllable, and narrative-driven video content.

AAAI Conference 2026 Conference Paper

RecCocktail: A Generalizable and Efficient Framework for LLM-Based Recommendation

  • Min Hou
  • Chenxi Bai
  • Le Wu
  • Hao Liu
  • Kai Zhang
  • Weiwen Liu
  • Richang Hong
  • Ruiming Tang

Large Language Models (LLMs) have achieved remarkable success in recent years, owing to their impressive generalization capabilities and rich world knowledge. To capitalize on the potential of using LLMs as recommender systems, mainstream approaches typically focus on two paradigms. The first paradigm designs multi-domain or multi-task instruction data for generalizable recommendation, so as to align LLMs with general recommendation areas and deal with cold-start recommendation. The second paradigm focuses on enhancing domain-specific recommendation tasks, improving performance in warm recommendation scenarios. While most previous works treat these two paradigms separately, we argue that they have complementary advantages, and combining them can yield better results. In this paper, we propose a generalizable and efficient LLM-based recommendation framework RecCocktail. Our approach begins with fine-tuning a "base spirit" LoRA module using domain-general recommendation instruction data to align LLM with recommendation knowledge. Next, given users' behavior of a specific domain, we construct a domain-specific "ingredient" LoRA module. We then provide an entropy-guided adaptive merging method to mix the "base spirit" and the "ingredient" in the weight space. Please note that, RecCocktail combines the advantages of the existing two paradigms without introducing additional time or space overhead during the inference phase. Moreover, RecCocktail is efficient with plug and play, as the "base spirit" LoRA is trained only once, and any domain-specific "ingredient" can be efficiently mixed with only domain-specific fine-tuning. Extensive experiments on multiple datasets under both warm and cold-start recommendation scenarios validate the effectiveness and generality of the proposed RecCocktail.

EAAI Journal 2026 Journal Article

Robust traffic sign detection in real-world harsh conditions: A pioneering benchmark dataset and attention-based methodology

  • Fengping Wang
  • Jie Bai
  • Meng Wang
  • Baobao Liu
  • Haiwei Xue
  • Jun Chen

Traffic sign detection systems suffer from dramatic performance deterioration under harsh environmental conditions such as rain, snow, fog, and glare, thereby exposing a crucial vulnerability in model generalization and safety-critical autonomous systems. This paper proposes Harsh Environment Robust Attention (HERA), a novel image restoration mechanism specifically devised for challenging environmental scenarios. Firstly, we introduce an Adaptive Residual Block that dynamically adjusts channel weights to enhance restoration capability for blurred and low-contrast images. Secondly, a hybrid attention mechanism integrating Channel Attention (CA) and Pixel Attention (PA) layers achieves comprehensive global-local feature fusion, thereby enhancing feature representation in target regions. Thirdly, a Multi-Scale Feature Fusion Module is employed to capture semantic and texture information across different scales, substantially boosting detection robustness in harsh conditions. To address domain-specific data scarcity hindering evaluation of traffic sign detection models in adverse environments, we present Traffic signs in Harsh envirOnments for Recognition (THOR) — to our knowledge, the first-of-its-kind benchmark dataset specifically curated to bridge this research gap. This benchmark comprises 2963 annotated real-world images across four harsh environments: rain, snow, fog, and glare. Experimental results on the THOR dataset show that integrating HERA into multiple You Only Look Once (YOLO) backbones consistently improves detection performance under adverse weather conditions, achieving up to 9. 7% mean Average Precision (mAP) improvement on YOLO version 12 (YOLOv12) and comparable gains across other models. The THOR dataset and HERA implementation are publicly available at: https: //github. com/Violet-nizikou/Robust-Traffic-Sign-Detection-in-Real-World-Harsh-Conditions.

AAAI Conference 2026 Conference Paper

Semantic Alignment of Malicious Question Based on Contrastive Semantic Networks and Data Augmentation (Abstract Reprint)

  • Xinyan Wang
  • Jinshuo Liu
  • Juan Deng
  • Meng Wang
  • Qian Deng
  • Youcheng Yan
  • Lina Wang
  • Yunsong Ma

The identification and filtration of malicious texts in social media environments represent a significant technical challenge aimed at protecting users from online violence and disinformation. This complexity stems from the diversity and innovativeness of social media texts, which include unique expressions and special sentence structures. Particularly, malicious texts in interrogative forms pose alignment challenges with traditional corpora due to existing methods’ failure to exploit the text’s deep global semantic representations. This issue is compounded by the scant research on Chinese texts, leading to inefficiencies in recognition accuracy. To mitigate these challenges, we introduce an innovative framework based on a Global Contrastive Semantic Network (GCSN), designed to enhance malicious text recognition efficiency and accuracy by deeply learning global semantic knowledge. It comprises an encoder for global semantic information modelling and a graph-matching network for semantic similarity evaluation between question pairs, enabling the accurate identification and filtering of malicious texts with complex structures. Furthermore, we introduce a semantic consistency-based data augmentation method (COMBINE), using real-world data to generate balanced positive and negative samples, enriching the dataset and enhancing the model’s ability to distinguish semantic consistency through contrastive learning. Experimental validation on two Chinese datasets demonstrates our model’s exceptional performance, affirming its applicationa value in social media malicious text recognition. Our code is available at https://github.com/Wxy13131313131/GCSN-COMBINE

AAAI Conference 2026 Conference Paper

Sparse-Scale Transformer with Bidirectional Awareness for Time Series Forecasting

  • Ying Liu
  • Bo Liu
  • Sheng Huang
  • Gang Luo
  • Wenbo Hu
  • Meng Wang
  • Richang Hong

Time series forecasting (TSF) plays a crucial role in many real-world applications, such as weather prediction and economic planning. While Transformer-based models have shown strong capabilities in modeling long-range dependencies, effectively capturing the multi-scale temporal dynamics inherent in time series remains a major challenge. Existing methods often adopt time-windows of varying sizes, which may introduce noisy or irrelevant representations when mismatched with the underlying temporal patterns, potentially leading to overfitting. In this paper, we propose Sparse-Scale Transformer (SSformer) with Bidirectional Awareness for Time Series Forecasting to enhance the multi-scale modeling for time series. Specifically, we propose a novel Sparse-Scale Convolution (SSC) block that imposes sparsity on scales to obtain the informative representations by evaluating the intra-scale segment similarity of time series, and utilizes scale-specific convolutions to extract local patterns. Furthermore, we design a Bidirectional-Scale Interaction (BSI) block to explicitly model scale correlations in both coarse-to-fine and fine-to-coarse directions. Finally, scale predictions are ensembled to fully exploit the complementary forecasting capabilities across scales. Extensive experiments on various real-world datasets demonstrate that SSformer achieves state-of-the-art performance with superior efficiency.

AAAI Conference 2026 Conference Paper

Towards Non-Stationary Time Series Forecasting with Temporal Stabilization and Frequency Differencing

  • Junkai Lu
  • Peng Chen
  • Chenjuan Guo
  • Yang Shu
  • Meng Wang
  • Bin Yang

Time series forecasting is critical for decision making across dynamic domains such as energy, finance, transportation, and cloud computing. However, real-world time series often exhibit non-stationarity, including temporal distribution shifts and spectral variability, which poses significant challenges for existing long-term time series forecasting methods. In this paper, we propose DTAF, a dual-branch framework that addresses non-stationarity in both the temporal and frequency domains. For the temporal domain, the Temporal Stabilizing Fusion (TFS) module employs a non-stationary mix of experts (MOE) filter to disentangle and suppress temporal non-stationary patterns while preserving long-term dependencies. For the frequency domain, the Frequency Wave Modeling (FWM) module applies frequency differencing to dynamically highlight components with significant spectral shifts. By fusing the complementary outputs of TFS and FWM, DTAF generates robust forecasts that adapt to both temporal and frequency domain non-stationarity. Extensive experiments on multiple real-world benchmarks demonstrate that DTAF outperforms state-of-the-art baselines, yielding significant improvements in forecasting accuracy under non-stationary conditions.

NeurIPS Conference 2025 Conference Paper

3D Gaussian Splatting based Scene-independent Relocalization with Unidirectional and Bidirectional Feature Fusion

  • Junyi Wang
  • Yuze Wang
  • Wantong Duan
  • Meng Wang
  • Yue Qi

Visual localization is a critical component across various domains. The recent emergence of novel scene representations, such as 3D Gaussian Splatting (3D GS), introduces new opportunities for advancing localization pipelines. In this paper, we propose a novel 3D GS-based framework for RGB based, scene-independent camera relocalization, with three main contributions. First, we design a two-stage pipeline with fully exploiting 3D GS. The pipeline consists of an initial stage, which utilizes 2D-3D correspondences between image pixels and 3D Gaussians, followed by pose refinement using the rendered image by 3D GS. Second, we introduce a 3D GS based Relocalization Network, termed GS-RelocNet, to establish correspondences for initial camera pose estimation. Additionally, we present a refinement network that further optimizes the camera pose. Third, we propose a unidirectional 2D-3D feature fusion module and a bidirectional image feature fusion module, integrated into GS-RelocNet and the refinement network, respectively, to enhance feature sharing across the two stages. Experimental results on public 7 Scenes, Cambridge Landmarks, TUM RGB-D and Bonn demonstrate state-of-the-art performance. Furthermore, the beneficial effects of the two feature fusion modules and pose refinement are also highlighted. In summary, we believe that the proposed framework can be a novel universal localization pipeline for further research.

AIIM Journal 2025 Journal Article

A cell-interacting and multi-correcting method for automatic circulating tumor cells detection

  • Xuan Zhang
  • Rensheng Lai
  • Ling Bai
  • Jianxin Ji
  • Ruihao Qin
  • Lihong Jiang
  • Bin Meng
  • Ying Zhang

Sensitive detection of circulating tumor cells (CTCs) from peripheral blood can serve as an effective tool in the early diagnosis and prognosis of cancer. Many methods based on modern object detectors were proposed in recent years for automatic abnormal cells detection in slide images. Although the modes of these methods can also be applied to the CTCs detection, several practical difficulties lead to suboptimal performance of them, such as accurate capture of CTCs in a large number of mixed cells and identification of CTCs and CTC-like cells with similar visual characteristics. Here, we develop a new cell-interacting and multi-correcting detector called CMD, and apply H&E-stained slide images to detect CTCs automatically for the first time. Specifically, the proposed method incorporates two task-oriented novel modules: (1) a self-attention module for aggregating feature interactions between cells and allowing the model to pay more attention to key abnormal cells, (2) a hard sample mining sampler for progressively correcting predictions of cells with ambiguous classification boundaries. Experiments conducted on a multi-center dataset of 1247 annotated slide images confirm the superiority of our method over state-of-the-art cell detection methods. The results of ablation experiment part also prove the effectiveness of two modules. The source codes of this paper are available at https: //github. com/zx333445/CMD.

AAAI Conference 2025 Conference Paper

Cognitive Bias and Reassignment: Who Can Contribute High Quality LLM Data

  • Yunfan Gao
  • Yun Xiong
  • Zhongyuan Hu
  • Yiming Zhang
  • Meng Wang
  • Haofen Wang

In recent years, the rapid development of Large Language Models has highlighted the urgent need for large-scale, high-quality, and diverse data. We have launched an LLM data co-creation platform aimed at bringing together a wide range of participants to contribute data. Within six months, the platform has attracted over 10,000 participants who contributed more than 150,000 data entries across more than 200 tasks. An observable user cohort was constructed around the question, "Who is the best data contributor?" along with sub-questions concerning user preferences, task competence, and more. Through a detailed analysis of data contributors, this paper reveals several data collection patterns related to human factors. It reveals that contributors who provide high-quality data often do not meet initial expectations, as their behavior exhibits typical characteristics of the Dunning-Kruger effect. This paper examined the cognitive bias between users' self-assessment and actual abilities, where individuals tend to overestimate their capabilities in certain tasks, leading to a decreased willingness to continue contributing and a consequent waste of human resources. To address this issue, we propose a task reassignment method based on multi-task fine-tuning of small language models (SLMs) to better align user groups with appropriate task types. After the reallocation, we observed a significant increase in user engagement and platform benefits, along with improved overall platform efficiency. The versatility of this method makes it applicable to broader data collection scenarios.

NeurIPS Conference 2025 Conference Paper

Contrastive Learning with Data Misalignment: Feature Purity, Training Dynamics and Theoretical Generalization Guarantees

  • Jiawei Sun
  • Shuai Zhang
  • Hongkang Li
  • Meng Wang

Contrastive learning is a powerful framework for learning discriminative representations from image-text pairs. Despite its success, its theoretical foundations, especially when the image-text pair exhibits misalignment, remain underexplored. This paper provides the first theoretical analysis of contrastive learning under data misalignment, proving how the ground-truth modality-paired features are amplified while spurious features are suppressed through the training dynamics analysis. Specifically, we study two nonlinear encoders trained jointly with a contrastive loss and demonstrate that noisy (or misaligned) data pairs result in mixed representations and degrade the model's generalization ability. In contrast, recaptioning and filtering improve the data alignment, which in turn purifies the features learned by neurons and subsequently enhances generalization. Our analysis identifies feature purity as a key factor in the success of contrastive learning and offers insights into how data quality and training procedures impact representation learning and downstream generalization. Theoretical insights are supported by experiments on standard benchmarks.

NeurIPS Conference 2025 Conference Paper

EgoBlind: Towards Egocentric Visual Assistance for the Blind

  • Junbin Xiao
  • Nanxin Huang
  • Hao Qiu
  • Zhulin Tao
  • Xun Yang
  • Richang Hong
  • Meng Wang
  • Angela Yao

We present EgoBlind, the first egocentric VideoQA dataset collected from blind individuals to evaluate the assistive capabilities of contemporary multimodal large language models (MLLMs). EgoBlind comprises 1, 392 first-person videos from the daily lives of blind and visually impaired individuals. It also features 5, 311 questions directly posed or verified by the blind to reflect their in-situation needs for visual assistance. Each question has an average of 3 manually annotated reference answers to reduce subjectiveness. Using EgoBlind, we comprehensively evaluate 16 advanced MLLMs and find that all models struggle. The best performers achieve an accuracy near 60\%, which is far behind human performance of 87. 4\%. To guide future advancements, we identify and summarize major limitations of existing MLLMs in egocentric visual assistance for the blind and explore heuristic solutions for improvement. With these efforts, we hope that EgoBlind will serve as a foundation for developing effective AI assistants to enhance the independence of the blind and visually impaired. Data and code are available at \url{https: //github. com/doc-doc/EgoBlind}.

AAAI Conference 2025 Conference Paper

FakeDiffer: Distributional Disparity Learning on Differentiated Reconstruction for Face Forgery Detection

  • Bo Wang
  • Zhao Zhang
  • Suiyi Zhao
  • Xianming Ye
  • Haijun Zhang
  • Meng Wang

Existing face forgery detection methods achieve promising performance when training and testing forgery data are from identical manipulation types, while they fail to generalize well to unseen samples. In this paper, we experimentally investigate and find that the poor generalization of the methods mainly arises from their overfitting on the known fake patterns. Excessively focused on seen fakes, those detectors fail to effectively learn image-intrinsic information and the distributional disparity between real and fake images. Then, to address this issue, we redefine fake learning as real-fake distributional disparity learning. We propose a novel deepfake detection framework learning distributional disparity based on the differentiated reconstruction on real and fake images for improved generalization. Specifically, distributional disparity learning on differentiated reconstruction of the real and fake images, enforces the model to learn image-invariant intrinsic representations. The reconstruction on real and fake images forces the decoders to learn the distribution of real and fake images, respectively. Moreover, to avoid the influence from the specificalization of the known fake patterns, we further propose the information interaction learning on the encoded intrinsic information and the pixel disparity between the input image and its reconstruction to distinguish face forgeries that are even unknown. Extensive experiments on large-scale benchmark datasets demonstrated the effectiveness of addressing the overfitting issue of the classification network, and verified the superior performance of our method.

IJCAI Conference 2025 Conference Paper

From End-to-end to Step-by-step: Learning to Abstract via Abductive Reinforcement Learning

  • Zilong Wang
  • Jiongda Wang
  • Xiaoyong Chen
  • Meng Wang
  • Ming Ma
  • Zhipeng Wang
  • Zhenyu Zhou
  • Tianming Yang

Abstraction is a critical technique in general problem-solving, allowing complex tasks to be decomposed into smaller, manageable sub-tasks. While traditional symbolic planning relies on predefined primitive symbols to construct structured abstractions, its reliance on formal representations limits applicability to real-world tasks. On the other hand, reinforcement learning excels at learning end-to-end policies directly from sensory inputs in unstructured environments but struggles with compositional generalization in complex tasks with delayed rewards. In this paper, we propose Abductive Abstract Reinforcement Learning (A2RL), a novel neuro-symbolic RL framework bridging the two paradigms based on Abductive Learning (ABL), enabling RL agents to learn abstractions directly from raw sensory inputs without predefined symbols. A2RL induces a finite state machine to represent high-level, step-by-step procedures, where each abstract state corresponds to a sub-algebra of the original Markov Decision Process (MDP). This approach not only bridges the gap between symbolic abstraction and sub-symbolic learning but also provides a natural mechanism for the emergence of new symbols. Experiments show that A2RL can mitigate the delayed reward problem and improve the generalization capability compared to traditional end-to-end RL methods.

NeurIPS Conference 2025 Conference Paper

Learning Temporal 3D Semantic Scene Completion via Optical Flow Guidance

  • Meng Wang
  • Fan Wu
  • Ruihui Li
  • Qin Yunchuan
  • Zhuo Tang
  • Li Ken Li

3D Semantic Scene Completion (SSC) provides comprehensive scene geometry and semantics for autonomous driving perception, which is crucial for enabling accurate and reliable decision-making. However, existing SSC methods are limited to capturing sparse information from the current frame or naively stacking multi-frame temporal features, thereby failing to acquire effective scene context. These approaches ignore critical motion dynamics and struggle to achieve temporal consistency. To address the above challenges, we propose a novel temporal SSC method FlowScene: Learning Temporal 3D Semantic Scene Completion via Optical Flow Guidance. By leveraging optical flow, FlowScene can integrate motion, different viewpoints, occlusions, and other contextual cues, thereby significantly improving the accuracy of 3D scene completion. Specifically, our framework introduces two key components: (1) a Flow-Guided Temporal Aggregation module that aligns and aggregates temporal features using optical flow, capturing motion-aware context and deformable structures; and (2) an Occlusion-Guided Voxel Refinement module that injects occlusion masks and temporally aggregated features into 3D voxel space, adaptively refining voxel representations for explicit geometric modeling. Experimental results demonstrate that FlowScene achieves state-of-the-art performance, with mIoU of 17. 70 and 20. 81 on the SemanticKITTI and SSCBench-KITTI-360 benchmarks.

AAAI Conference 2025 Conference Paper

MOL-Mamba: Enhancing Molecular Representation with Structural & Electronic Insights

  • Jingjing Hu
  • Dan Guo
  • Zhan Si
  • Deguang Liu
  • Yunfeng Diao
  • Jing Zhang
  • Jinxing Zhou
  • Meng Wang

Molecular representation learning plays a crucial role in various downstream tasks, such as molecular property prediction and drug design. To accurately represent molecules, Graph Neural Networks (GNNs) and Graph Transformers (GTs) have shown potential in the realm of self-supervised pretraining. However, existing approaches often overlook the relationship between molecular structure and electronic information, as well as the internal semantic reasoning within molecules. This omission of fundamental chemical knowledge in graph semantics leads to incomplete molecular representations, missing the integration of structural and electronic data. To address these issues, we introduce MOL-Mamba, a framework that enhances molecular representation by combining structural and electronic insights. MOL-Mamba consists of an Atom & Fragment Mamba-Graph (MG) for hierarchical structural reasoning and a Mamba-Transformer (MT) fuser for integrating molecular structure and electronic correlation learning. Additionally, we propose a Structural Distribution Collaborative Training and E-semantic Fusion Training framework to further enhance molecular representation learning. Extensive experiments demonstrate that MOL-Mamba outperforms state-of-the-art baselines across eleven chemical-biological molecular datasets.

NeurIPS Conference 2025 Conference Paper

One Stone with Two Birds: A Null-Text-Null Frequency-Aware Diffusion Models for Text-Guided Image Inpainting

  • Haipeng Liu
  • Yang Wang
  • Meng Wang

Text-guided image inpainting aims at reconstructing the masked regions as per text prompts, where the longstanding challenges lie in the preservation for unmasked regions, while achieving the semantics consistency between unmasked and inpainted masked regions. Previous arts failed to address both of them, always with either of them to be remedied. Such facts, as we observed, stem from the entanglement of the hybrid (e. g. , mid-and-low) frequency bands that encode varied image properties, which exhibit different robustness to text prompts during the denoising process. In this paper, we propose a null-text-null frequency-aware diffusion models, dubbed NTN-Diff, for text-guided image inpainting, by decomposing the semantics consistency across masked and unmasked regions into the consistencies as per each frequency band, while preserving the unmasked regions, to circumvent two challenges in a row. Based on the diffusion process, we further divide the denoising process into early (high-level noise) and late (low-level noise) stages, where the mid-and-low frequency bands are disentangled during the denoising process. As observed, the stable mid-frequency band is progressively denoised to be semantically aligned during text-guided denoising process, which, meanwhile, serves as the guidance to the null-text denoising process to denoise low-frequency band for the masked regions, followed by a subsequent text-guided denoising process at late stage, to achieve the semantics consistency for mid-and-low frequency bands across masked and unmasked regions, while preserve the unmasked regions. Extensive experiments validate the superiority of NTN-Diff over the state-of-the-art diffusion models to text-guided diffusion models. Our code can be accessed from https: //github. com/htyjers/NTN-Diff.

AAAI Conference 2025 Conference Paper

PhysDiff: Physiology-based Dynamicity Disentangled Diffusion Model for Remote Physiological Measurement

  • Wei Qian
  • Gaoji Su
  • Dan Guo
  • Jinxing Zhou
  • Xiaobai Li
  • Bin Hu
  • Shengeng Tang
  • Meng Wang

Recent works on remote PhotoPlethysmoGraphy (rPPG) estimation typically use techniques like CNNs and Transformers to encode implicit features from facial videos for prediction. These methods learn to directly map facial videos to the static values of rPPG signals, overlooking the inherent dynamic characteristics of rPPG sequence. Moreover, the rPPG signal is extremely weak and highly susceptible to interference from various sources of noise, including illumination conditions, head movements, and variations in skin tone. To address these limitations, we propose a Physiology-based dynamicity disentangled diffusion (PhysDiff) model particularly designed for robust rPPG estimation. PhysDiff leverages the diffusion model to learn the distribution of quasi-periodic rPPG signal and uses a dynamicity disentanglement strategy to capture two dynamic characteristics in temporal rPPG signal, i.e., trend and amplitude. This disentanglement is motivated by the underlying dynamic physiological processes of vasodilation and vasoconstriction, ensuring a more precise representation of the rPPG signal. The disentangled components are then used as pivotal conditions in the proposed spatial-temporal hybrid denoiser for rPPG reconstruction. Besides, we introduce a periodicity-based multi-hypothesis selection strategy in model inference, which compares the natural periodicity of multiple generated rPPG hypotheses and selects the most favorable one as the final prediction. Extensive experiments on four datasets demonstrate that our PhysDiff significantly outperforms prior methods on both intra-dataset and cross-dataset testing.

AAAI Conference 2025 Conference Paper

Prototypical Calibrating Ambiguous Samples for Micro-Action Recognition

  • Kun Li
  • Dan Guo
  • Guoliang Chen
  • Chunxiao Fan
  • Jingyuan Xu
  • Zhiliang Wu
  • Hehe Fan
  • Meng Wang

Micro-Action Recognition (MAR) has gained increasing attention due to its crucial role as a form of non-verbal communication in social interactions, with promising potential for applications in human communication and emotion analysis. However, current approaches often overlook the inherent ambiguity in micro-actions, which arises from the wide category range and subtle visual differences between categories. This oversight hampers the accuracy of micro-action recognition. In this paper, we propose a novel Prototypical Calibrating Ambiguous Network (PCAN) to unleash and mitigate the ambiguity of MAR. Firstly, we employ a hierarchical action-tree to identify the ambiguous sample, categorizing them into distinct sets of ambiguous samples of false negatives and false positives, considering both body- and action-level categories. Secondly, we implement an ambiguous contrastive refinement module to calibrate these ambiguous samples by regulating the distance between ambiguous samples and their corresponding prototypes. This calibration process aims to pull false negative (FN) samples closer to their respective prototypes and push false positive (FP) samples apart from their affiliated prototypes. In addition, we propose a new prototypical diversity amplification loss to strengthen the model's capacity by amplifying the differences between different prototypes. Finally, we propose a prototype-guided rectification to rectify prediction by incorporating the representability of prototypes. Extensive experiments conducted on the benchmark dataset demonstrate the superior performance of our method compared to existing approaches.

JAIR Journal 2025 Journal Article

Semantic Alignment of Malicious Question Based on Contrastive Semantic Networks and Data Augmentation

  • Xinyan Wang
  • Jinshuo Liu
  • Juan Deng
  • Meng Wang
  • Qian Deng
  • Youcheng Yan
  • Lina Wang
  • Yunsong Ma

The identification and filtration of malicious texts in social media environments represent a significant technical challenge aimed at protecting users from online violence and disinformation. This complexity stems from the diversity and innovativeness of social media texts, which include unique expressions and special sentence structures. Particularly, malicious texts in interrogative forms pose alignment challenges with traditional corpora due to existing methods' failure to exploit the text's deep global semantic representations. This issue is compounded by the scant research on Chinese texts, leading to inefficiencies in recognition accuracy. To mitigate these challenges, we introduce an innovative framework based on a Global Contrastive Semantic Network (GCSN), designed to enhance malicious text recognition efficiency and accuracy by deeply learning global semantic knowledge. It comprises an encoder for global semantic information modelling and a graph-matching network for semantic similarity evaluation between question pairs, enabling the accurate identification and filtering of malicious texts with complex structures. Furthermore, we introduce a semantic consistency-based data augmentation method (COMBINE), using real-world data to generate balanced positive and negative samples, enriching the dataset and enhancing the model's ability to distinguish semantic consistency through contrastive learning. Experimental validation on two Chinese datasets demonstrates our model's exceptional performance, affirming its applicationa value in social media malicious text recognition. Our code is available at https://github.com/Wxy13131313131/GCSN-COMBINE

TIST Journal 2025 Journal Article

Talking-DiSSM: Enhancing Temporal Consistency in Talking Face Video Generation with Bidirectional SSMs

  • Zhen Xiao
  • Xueliang Liu
  • Jinlin Guo
  • Jun He
  • Richang Hong
  • Meng Wang

Generating temporally smooth and high-resolution videos is a crucial objective in talking face generation tasks. Diffusion-based generative models have emerged as a prime choice for these tasks due to their ability to produce high-quality outputs. To mitigate the impact of stochasticity in the diffusion process, recent research has predominantly utilized self-attention layers to extract temporal features, ensuring temporal consistency in the generated videos. However, self-attention mechanisms have computational complexity that scales quadratically with video length, leading to high computational costs. This limitation poses significant challenges when attempting to generate longer video sequences using diffusion models. To address this challenge, we propose Talking-DiSSM, an end-to-end method for generating audio-driven talking face videos using State-Space Models (SSMs). This novel framework for conditional video diffusion modeling integrates Bidirectional State-Space Models (Bi-SSM) as temporal modeling modules with linear complexity, effectively capturing complex sequential temporal information and intra-batch sequential interdependencies in videos. Additionally, we employ a simple yet effective batch-overlapped sampling strategy to process input video clips, constructing inter-batch correlations while incorporating reference face clips and landmarks as conditions to ensure stability in the generation process. Extensive experiments demonstrate that Talking-DiSSM generates temporally consistent, high-quality, and identity-preserving talking face videos synchronized with the driving audio, achieving state-of-the-art results compared to existing models.

TMLR Journal 2025 Journal Article

Theoretical Learning Performance of Graph Networks: the Impact of Jumping Connections and Layer-wise Sparsification

  • Jiawei Sun
  • Hongkang Li
  • Meng Wang

Jumping connections enable Graph Convolutional Networks (GCNs) to overcome over-smoothing, while graph sparsification reduces computational demands by selecting a submatrix of the graph adjacency matrix during neighborhood aggregation. Learning GCNs with graph sparsification has shown empirical success across various applications, but a theoretical understanding of the generalization guarantees remains limited, with existing analyses ignoring either graph sparsification or jumping connections. This paper presents the first learning dynamics and generalization analysis of GCNs with jumping connections using graph sparsification. Our analysis demonstrates that the generalization accuracy of the learned model closely approximates the highest achievable accuracy within a broad class of target functions dependent on the proposed sparse effective adjacency matrix $A^*$. Thus, graph sparsification maintains generalization performance when $A^*$ accurately models data correlations. We reveal that jumping connections lead to different sparsification requirements across layers. In a two-hidden-layer GCN, the generalization is more affected by the sparsified matrix deviations from $A^*$ of the first layer than the second layer. To the best of our knowledge, this marks the first theoretical characterization of jumping connections' role in sparsification requirements. We validate our theoretical results on benchmark datasets in deep GCNs.

AAAI Conference 2025 Conference Paper

Thinking in Granularity: Dynamic Quantization for Image Super-Resolution by Intriguing Multi-Granularity Clues

  • Mingshen Wang
  • Zhao Zhang
  • Feng Li
  • Ke Xu
  • Kang Miao
  • Meng Wang

Dynamic quantization has attracted rising attention in image super-resolution (SR) as it expands the potential of heavy SR models onto mobile devices while preserving competitive performance. Most current methods explore layer-to-bit configuration upon varying local regions, adaptively allocating the bit to each layer and patch. Despite the benefits, they still fall short in the tradeoff of SR accuracy and quantization efficiency. Apart from this, adapting the quantization level for each layer individually can disturb the original inter-layer relationships, thus diminishing the representation capability of quantized models. In this work, we propose Granular-DQ, which takes advantage of multi-granularity clues and local patch statistics, achieving a distinctive patch-wise and layer-invariant dynamic quantization paradigm. Specifically, Granular-DQ initiates by developing a granularity-bit controller to apprehend the coarse-to-fine granular representations of local patches, matching their proportional contribution to the entire image to determine the proper bit-width allocation. On this premise, we investigate the interrelationships between bit-width and information density within high-bit patches, establishing a soft gate that enables further fine-grained dynamic bit adaption. Extensive experiments validate the superiority of Granular-DQ in the trade-off between efficiency and accuracy over recent state-of-the-art methods on various SR models.

IJCAI Conference 2025 Conference Paper

Towards Micro-Action Recognition with Limited Annotations: An Asynchronous Pseudo Labeling and Training Approach

  • Yan Zhang
  • Lechao Cheng
  • Yaxiong Wang
  • Zhun Zhong
  • Meng Wang

Micro-Action Recognition (MAR) aims to classify subtle human actions in video. However, annotating MAR datasets is particularly challenging due to the subtlety of actions. To this end, we introduce the setting of Semi-Supervised MAR (SSMAR), where only a part of samples are labeled. We first evaluate traditional Semi-Supervised Learning (SSL) methods to SSMAR and find that these methods tend to overfit on inaccurate pseudo-labels, leading to error accumulation and degraded performance. This issue primarily arises from the common practice of directly using the predictions of classifier as pseudo-labels to train the model. To solve this issue, we propose a novel framework, called Asynchronous Pseudo Labeling and Training (APLT), which explicitly separates the pseudo-labeling process from model training. Specifically, we introduce a semi-supervised clustering method during the offline pseudo-labeling phase to generate more accurate pseudo-labels. Moreover, a self-adaptive thresholding strategy is proposed to dynamically filter noisy labels of different classes. We then build a memory-based prototype classifier based on the filtered pseudo-labels, which is fixed and used to guide the subsequent model training phase. By alternating the two pseudo-labeling and model training phases in an asynchronous manner, the model can not only be learned with more accurate pseudo-labels but also avoid the overfitting issue. Experiments on three MAR datasets show that our APLT largely outperforms state-of-the-art SSL methods. For instance, APLT improves accuracy by 14. 5% over FixMatch on the MA-12 dataset when using only 50% labeled data. Code is available at https: //github. com/zy-hfut/APLT

ICLR Conference 2025 Conference Paper

Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis

  • Hongkang Li
  • Songtao Lu
  • Pin-Yu Chen
  • Xiaodong Cui
  • Meng Wang

Chain-of-Thought (CoT) is an efficient prompting method that enables the reasoning ability of large language models by augmenting the query using multiple examples with multiple intermediate steps. Despite the empirical success, the theoretical understanding of how to train a Transformer to achieve the CoT ability remains less explored. This is primarily due to the technical challenges involved in analyzing the nonconvex optimization on nonlinear attention models. To the best of our knowledge, this work provides the first theoretical study of training Transformers with nonlinear attention to obtain the CoT generalization capability so that the resulting model can inference on unseen tasks when the input is augmented by examples of the new task. We first quantify the required training samples and iterations to train a Transformer model towards CoT ability. We then prove the success of its CoT generalization on unseen tasks with distribution-shifted testing data. Moreover, we theoretically characterize the conditions for an accurate reasoning output by CoT even when the provided reasoning examples contain noises and are not always accurate. In contrast, in-context learning (ICL), which can be viewed as one-step CoT without intermediate steps, may fail to provide an accurate output when CoT does. These theoretical findings are justified through experiments.

AAAI Conference 2025 Conference Paper

VLScene: Vision-Language Guidance Distillation for Camera-Based 3D Semantic Scene Completion

  • Meng Wang
  • Huilong Pi
  • Ruihui Li
  • Yunchuan Qin
  • Zhuo Tang
  • Kenli Li

Camera-based 3D semantic scene completion (SSC) provides dense geometric and semantic perception for autonomous driving. However, images provide limited information making the model susceptible to geometric ambiguity caused by occlusion and perspective distortion. Existing methods often lack explicit semantic modeling between objects, limiting their perception of 3D semantic context. To address these challenges, we propose a novel method VLScene: Vision-Language Guidance Distillation for Camera-based 3D Semantic Scene Completion. The key insight is to use the vision-language model to introduce high-level semantic priors to provide the object spatial context required for 3D scene understanding. Specifically, we design a vision-language guidance distillation process to enhance image features, which can effectively capture semantic knowledge from the surrounding environment and improve spatial context reasoning. In addition, we introduce a geometric-semantic sparse awareness mechanism to propagate geometric structures in the neighborhood and enhance semantic information through contextual sparse interactions. Experimental results demonstrate that VLScene achieves rank-1st performance on challenging benchmarks—SemanticKITTI and SSCBench-KITTI-360, yielding remarkably mIoU scores of 17.52 and 19.10, respectively.

AAAI Conference 2024 Conference Paper

A Dual-Way Enhanced Framework from Text Matching Point of View for Multimodal Entity Linking

  • Shezheng Song
  • Shan Zhao
  • Chengyu Wang
  • Tianwei Yan
  • Shasha Li
  • Xiaoguang Mao
  • Meng Wang

Multimodal Entity Linking (MEL) aims at linking ambiguous mentions with multimodal information to entity in Knowledge Graph (KG) such as Wikipedia, which plays a key role in many applications. However, existing methods suffer from shortcomings, including modality impurity such as noise in raw image and ambiguous textual entity representation, which puts obstacles to MEL. We formulate multimodal entity linking as a neural text matching problem where each multimodal information (text and image) is treated as a query, and the model learns the mapping from each query to the relevant entity from candidate entities. This paper introduces a dual-way enhanced (DWE) framework for MEL: (1) our model refines queries with multimodal data and addresses semantic gaps using cross-modal enhancers between text and image information. Besides, DWE innovatively leverages fine-grained image attributes, including facial characteristic and scene feature, to enhance and refine visual features. (2)By using Wikipedia descriptions, DWE enriches entity semantics and obtains more comprehensive textual representation, which reduces between textual representation and the entities in KG. Extensive experiments on three public benchmarks demonstrate that our method achieves state-of-the-art (SOTA) performance, indicating the superiority of our model. The code is released on https://github.com/season1blue/DWE.

TIST Journal 2024 Journal Article

Efficiently Gluing Pre-Trained Language and Vision Models for Image Captioning

  • Peipei Song
  • Yuanen Zhou
  • Xun Yang
  • Daqing Liu
  • Zhenzhen Hu
  • Depeng Wang
  • Meng Wang

Vision-and-language pre-training models have achieved impressive performance for image captioning. But most of them are trained with millions of paired image-text data and require huge memory and computing overhead. To alleviate this, we try to stand on the shoulders of large-scale pre-trained language models (PLM) and pre-trained vision models (PVM) and efficiently connect them for image captioning. There are two major challenges: one is that language and vision modalities have different semantic granularity (e.g., a noun may cover many pixels), and the other is that the semantic gap still exists between the pre-trained language and vision models. To this end, we design a lightweight and efficient connector to glue PVM and PLM, which holds a criterion of selection-then-transformation. Specifically, in the selection phase, we treat each image as a set of patches instead of pixels. We select salient image patches and cluster them into visual regions to align with text. Then, to effectively reduce the semantic gap, we propose to map the selected image patches into text space through spatial and channel transformations. With training on image captioning datasets, the connector learns to bridge the semantic granularity and semantic gap via backpropagation, preparing for the PLM to generate descriptions. Experimental results on the MSCOCO and Flickr30k datasets demonstrate that our method yields comparable performance to existing works. By solely training the small connector, we achieve a CIDEr performance of 132.2% on the MSCOCO Karpathy test split. Moreover, our findings reveal that fine-tuning the PLM can further enhance performance potential, resulting in a CIDEr score of 140.6%. Code and models are available at https://github.com/YuanEZhou/PrefixCap.

AAAI Conference 2024 Conference Paper

EulerMormer: Robust Eulerian Motion Magnification via Dynamic Filtering within Transformer

  • Fei Wang
  • Dan Guo
  • Kun Li
  • Meng Wang

Video Motion Magnification (VMM) aims to break the resolution limit of human visual perception capability and reveal the imperceptible minor motion that contains valuable information in the macroscopic domain. However, challenges arise in this task due to photon noise inevitably introduced by photographic devices and spatial inconsistency in amplification, leading to flickering artifacts in static fields and motion blur and distortion in dynamic fields in the video. Existing methods focus on explicit motion modeling without emphasizing prioritized denoising during the motion magnification process. This paper proposes a novel dynamic filtering strategy to achieve static-dynamic field adaptive denoising. Specifically, based on Eulerian theory, we separate texture and shape to extract motion representation through inter-frame shape differences, expecting to leverage these subdivided features to solve this task finely. Then, we introduce a novel dynamic filter that eliminates noise cues and preserves critical features in the motion magnification and amplification generation phases. Overall, our unified framework, EulerMormer, is a pioneering effort to first equip with Transformer in learning-based VMM. The core of the dynamic filter lies in a global dynamic sparse cross-covariance attention mechanism that explicitly removes noise while preserving vital information, coupled with a multi-scale dual-path gating mechanism that selectively regulates the dependence on different frequency features to reduce spatial attenuation and complement motion boundaries. We demonstrate extensive experiments that EulerMormer achieves more robust video motion magnification from the Eulerian perspective, significantly outperforming state-of-the-art methods. The source code is available at https://github.com/VUT-HFUT/EulerMormer.

NeurIPS Conference 2024 Conference Paper

FasMe: Fast and Sample-efficient Meta Estimator for Precision Matrix Learning in Small Sample Settings

  • Xiao Tan
  • Yiqin Wang
  • Yangyang Shen
  • Dian Shen
  • Meng Wang
  • Peibo Duan
  • Beilun Wang

Precision matrix estimation is a ubiquitous task featuring numerous applications such as rare disease diagnosis and neural connectivity exploration. However, this task becomes challenging in small sample settings, where the number of samples is significantly less than the number of dimensions, leading to unreliable estimates. Previous approaches either fail to perform well in small sample settings or suffer from inefficient estimation processes, even when incorporating meta-learning techniques. To this end, we propose a novel approach FasMe for Fast and Sample-efficient Meta Precision Matrix Learning, which first extracts meta-knowledge through a multi-task learning diagram. Then, meta-knowledge constraints are applied using a maximum determinant matrix completion algorithm for the novel task. As a result, we reduce the sample size requirements to $O(\log p/K)$ per meta-training task and $O(\log\vert \mathcal{G}\vert)$ for the meta-testing task. Moreover, the hereby proposed model only needs $O(p \log\epsilon^{-1})$ time and $O(p)$ memory for converging to an $\epsilon$-accurate solution. On multiple synthetic and biomedical datasets, FasMe is at least ten times faster than the four baselines while promoting prediction accuracy in small sample settings.

AAAI Conference 2024 Conference Paper

KPA-Tracker: Towards Robust and Real-Time Category-Level Articulated Object 6D Pose Tracking

  • Liu Liu
  • Anran Huang
  • Qi Wu
  • Dan Guo
  • Xun Yang
  • Meng Wang

Our life is populated with articulated objects. Current category-level articulation estimation works largely focus on predicting part-level 6D poses on static point cloud observations. In this paper, we tackle the problem of category-level online robust and real-time 6D pose tracking of articulated objects, where we propose KPA-Tracker, a novel 3D KeyPoint based Articulated object pose Tracker. Given an RGB-D image or a partial point cloud at the current frame as well as the estimated per-part 6D poses from the last frame, our KPA-Tracker can effectively update the poses with learned 3D keypoints between the adjacent frames. Specifically, we first canonicalize the input point cloud and formulate the pose tracking as an inter-frame pose increment estimation task. To learn consistent and separate 3D keypoints for every rigid part, we build KPA-Gen that outputs the high-quality ordered 3D keypoints in an unsupervised manner. During pose tracking on the whole video, we further propose a keypoint-based articulation tracking algorithm that mines keyframes as reference for accurate pose updating. We provide extensive experiments on validating our KPA-Tracker on various datasets ranging from synthetic point cloud observation to real-world scenarios, which demonstrates the superior performance and robustness of the KPA-Tracker. We believe that our work has the potential to be applied in many fields including robotics, embodied intelligence and augmented reality. All the datasets and codes are available at https://github.com/hhhhhar/KPA-Tracker.

AAMAS Conference 2024 Conference Paper

Mastering Robot Control through Point-based Reinforcement Learning with Pre-training

  • Yihong Chen
  • Cong Wang
  • Tianpei Yang
  • Meng Wang
  • Yingfeng Chen
  • Jifei Zhou
  • Chaoyi Zhao
  • Xinfeng Zhang

Visual-based Reinforcement Learning (RL) has gained prominence in robotics decision-making due to its significant potential. However, the prevalent utilization of images in visual-based RL lacks explicit descriptions of object structures and spatial configurations in scenes, thereby limiting the overall efficiency and robustness of RL in robot control. Additionally, training an RL policy solely using visual observations from scratch is typically sample-inefficient, rendering it impractical for real-world application. To address these challenges, this paper proposes a novel method, called Pre-training on Point-based RL (P2RL), which takes the point cloud representations of scenes as states and preserves the intricate spatial details between objects. To further enhance efficiency, we leverage the pre-training method to bolster the perception ability of the network. Key factors in the pre-training process are systematically examined to optimize downstream RL training. Experimental results demonstrate the superior robustness and efficiency of P2RL compared to the state-of-the-art image-based RL method, especially in evaluations involving untrained scenes.

TIST Journal 2024 Journal Article

Mitigating Recommendation Biases via Group-Alignment and Global-Uniformity in Representation Learning

  • Miaomiao Cai
  • Min Hou
  • Lei Chen
  • Le Wu
  • Haoyue Bai
  • Yong Li
  • Meng Wang

Collaborative Filtering (CF) plays a crucial role in modern recommender systems, leveraging historical user-item interactions to provide personalized suggestions. However, CF-based methods often encounter biases due to imbalances in training data. This phenomenon makes CF-based methods tend to prioritize recommending popular items and performing unsatisfactorily on inactive users. Existing works address this issue by rebalancing training samples, reranking recommendation results, or making the modeling process robust to the bias. Despite their effectiveness, these approaches can compromise accuracy or be sensitive to weighting strategies, making them challenging to train. Therefore, exploring how to mitigate these biases remains in urgent demand. In this article, we deeply analyze the causes and effects of the biases and propose a framework to alleviate biases in recommendation from the perspective of representation distribution, namely Group-Alignment and Global-Uniformity Enhanced Representation Learning for Debiasing Recommendation (AURL). Specifically, we identify two significant problems in the representation distribution of users and items, namely group-discrepancy and global-collapse. These two problems directly lead to biases in the recommendation results. To this end, we propose two simple but effective regularizers in the representation space, respectively named group-alignment and global-uniformity. The goal of group-alignment is to bring the representation distribution of long-tail entities closer to that of popular entities, while global-uniformity aims to preserve the information of entities as much as possible by evenly distributing representations. Our method directly optimizes both the group-alignment and global-uniformity regularization terms to mitigate recommendation biases. Please note that AURL applies to arbitrary CF-based recommendation backbones. Extensive experiments on three real datasets and various recommendation backbones verify the superiority of our proposed framework. The results show that AURL not only outperforms existing debiasing models in mitigating biases but also improves recommendation performance to some extent.

AAAI Conference 2024 Conference Paper

Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering

  • Zhangbin Li
  • Dan Guo
  • Jinxing Zhou
  • Jing Zhang
  • Meng Wang

This paper focuses on the Audio-Visual Question Answering (AVQA) task that aims to answer questions derived from untrimmed audible videos. To generate accurate answers, an AVQA model is expected to find the most informative audio-visual clues relevant to the given questions. In this paper, we propose to explicitly consider fine-grained visual objects in video frames (object-level clues) and explore the multi-modal relations (\textit{i.e.}, the object, audio, and question) in terms of feature interaction and model optimization. For the former, we present an end-to-end object-oriented network that adopts a question-conditioned clue discovery module to concentrate audio/visual modalities on respective keywords of the question and designs a modality-conditioned clue collection module to highlight closely associated audio segments or visual objects. For model optimization, we propose an object-aware adaptive-positivity learning strategy that selects the highly semantic-matched multi-modal pair as \textit{positivity}. Specifically, we design two object-aware contrastive loss functions to identify the highly relevant question-object pairs and audio-object pairs, respectively. These selected pairs are constrained to have larger similarity values than the mismatched pairs. The positivity-selecting process is adaptive as the positivity pairs selected in each video frame may be different. These two object-aware objectives help the model understand \textit{which objects are exactly relevant to the question} and \textit{which are making sounds}. Extensive experiments on the MUSIC-AVQA dataset demonstrate the proposed method is effective in finding favorable audio-visual clues and also achieves new state-of-the-art question-answering performance. The code is available at https://github.com/zhangbin-ai/APL.

IJCAI Conference 2024 Conference Paper

OSIC: A New One-Stage Image Captioner Coined

  • Bo Wang
  • Zhao Zhang
  • Mingbo Zhao
  • Xiaojie Jin
  • Mingliang Xu
  • Meng Wang

Mainstream image captioning models are usually two-stage captioners, i. e. , encoding the region features by a pre-trained detector and then feeding them into a language model to generate the captions. However, such a two-stage procedure will lead to a task-based information gap that decreases the performance, because the region features in the detection task are suboptimal representations and cannot provide all the necessary information for subsequent captions generation. Besides, the region features are usually represented from the last layer of the detectors that lose the local details of images. In this paper, we propose a novel One-Stage Image Captioner (OSIC) with dynamic multi-sight learning, which directly transforms the images into descriptive sentences in one stage for eliminating the information gap. Specifically, to obtain rich features, multi-level features are captured by Swin Transformer, and then fed into a novel dynamic multi-sight embedding module to exploit both the global structure and local texture of input images. To enhance the global modeling capacity of the visual encoder, we propose a new dual-dimensional refining to non-locally model the features interaction. As a result, OSIC can directly obtain rich semantic information to improve the captioner. Extensive comparisons on the benchmark MS-COCO, Flickr8K and Flickr30K datasets verified the superior performance of our method.

NeurIPS Conference 2024 Conference Paper

Temporal Sentence Grounding with Relevance Feedback in Videos

  • Jianfeng Dong
  • Xiaoman Peng
  • Daizong Liu
  • Xiaoye Qu
  • Xun Yang
  • Cuizhu Bao
  • Meng Wang

As a widely explored multi-modal task, Temporal Sentence Grounding in videos (TSG) endeavors to retrieve a specific video segment matched with a given query text from a video. The traditional paradigm for TSG generally assumes that relevant segments always exist within a given video. However, this assumption is restrictive and unrealistic in real-world applications where the existence of a query-related segment is uncertain, easily resulting in erroneous grounding. Motivated by the research gap and practical application, this paper introduces a new task, named Temporal Sentence Grounding with Relevance Feedback (TSG-RF) in videos, which accommodates the possibility that a video may or may not include a segment related to the query. This task entails localizing precise video segments that semantically align with the query text when such content is present, while delivering definitive feedback on the non-existence of related segments when absent. Moreover, we propose a novel Relation-aware Temporal Sentence Grounding (RaTSG) network for addressing this challenging task. This network first reformulates the TSG-RF task as a foreground-background detection problem by investigating whether the query-related semantics exist in both frame and video levels. Then, a multi-granularity relevance discriminator is exploited to produce precise video-query relevance feedback and a relation-aware segment grounding module is employed to selectively conduct the grounding process, dynamically adapting to the presence or absence of query-related segments in videos. To validate our RaTSG network, we reconstruct two popular TSG datasets, establishing a rigorous benchmark for TSG-RF. Experimental results demonstrate the effectiveness of our proposed RaTSG for the TSG-RF task. Our source code is available at https: //github. com/HuiGuanLab/RaTSG.

IJCAI Conference 2024 Conference Paper

Towards Proactive Interactions for In-Vehicle Conversational Assistants Utilizing Large Language Models

  • Huifang Du
  • Xuejing Feng
  • Jun Ma
  • Meng Wang
  • Shiyu Tao
  • YiJie Zhong
  • Yuan-Fang Li
  • Haofen Wang

Research demonstrates that the proactivity of in-vehicle conversational assistants (IVCAs) can help to reduce distractions and enhance driving safety, better meeting users' cognitive needs. However, existing IVCAs struggle with user intent recognition and context awareness, which leads to suboptimal proactive interactions. Large language models (LLMs) have shown potential for generalizing to various tasks with prompts, but their application in IVCAs and exploration of proactive interaction remain under-explored. These raise questions about how LLMs improve proactive interactions for IVCAs and influence user perception. To investigate these questions systematically, we establish a framework with five proactivity levels across two dimensions—assumption and autonomy—for IVCAs. According to the framework, we propose a ``Rewrite + ReAct + Reflect'' strategy, aiming to empower LLMs to fulfill the specific demands of each proactivity level when interacting with users. Both feasibility and subjective experiments are conducted. The LLM outperforms the state-of-the-art model in success rate and achieves satisfactory results for each proactivity level. Subjective experiments with 40 participants validate the effectiveness of our framework and show the proactive level with strong assumptions and user confirmation is most appropriate.

AAAI Conference 2023 Conference Paper

DC-Former: Diverse and Compact Transformer for Person Re-identification

  • Wen Li
  • Cheng Zou
  • Meng Wang
  • Furong Xu
  • Jianan Zhao
  • Ruobing Zheng
  • Yuan Cheng
  • Wei Chu

In person re-identification (ReID) task, it is still challenging to learn discriminative representation by deep learning, due to limited data. Generally speaking, the model will get better performance when increasing the amount of data. The addition of similar classes strengthens the ability of the classifier to identify similar identities, thereby improving the discrimination of representation. In this paper, we propose a Diverse and Compact Transformer (DC-Former) that can achieve a similar effect by splitting embedding space into multiple diverse and compact subspaces. Compact embedding subspace helps model learn more robust and discriminative embedding to identify similar classes. And the fusion of these diverse embeddings containing more fine-grained information can further improve the effect of ReID. Specifically, multiple class tokens are used in vision transformer to represent multiple embedding spaces. Then, a self-diverse constraint (SDC) is applied to these spaces to push them away from each other, which makes each embedding space diverse and compact. Further, a dynamic weight controller (DWC) is further designed for balancing the relative importance among them during training. The experimental results of our method are promising, which surpass previous state-of-the-art methods on several commonly used person ReID benchmarks. Our code is available at https://github.com/ant-research/Diverse-and-Compact-Transformer.

NeurIPS Conference 2023 Conference Paper

Disentangling Cognitive Diagnosis with Limited Exercise Labels

  • Xiangzhi Chen
  • Le Wu
  • Fei Liu
  • Lei Chen
  • Kun Zhang
  • Richang Hong
  • Meng Wang

Cognitive diagnosis is an important task in intelligence education, which aims at measuring students’ proficiency in specific knowledge concepts. Given a fully labeled exercise-concept matrix, most existing models focused on mining students' response records for cognitive diagnosis. Despite their success, due to the huge cost of labeling exercises, a more practical scenario is that limited exercises are labeled with concepts. Performing cognitive diagnosis with limited exercise labels is under-explored and remains pretty much open. In this paper, we propose Disentanglement based Cognitive Diagnosis (DCD) to address the challenges of limited exercise labels. Specifically, we utilize students' response records to model student proficiency, exercise difficulty and exercise label distribution. Then, we introduce two novel modules - group-based disentanglement and limited-labeled alignment modules - to disentangle the factors relevant to concepts and align them with real limited labels. Particularly, we introduce the tree-like structure of concepts with negligible cost for group-based disentangling, as concepts of different levels exhibit different independence relationships. Extensive experiments on widely used benchmarks demonstrate the superiority of our proposed model.

AAAI Conference 2023 Conference Paper

Fair Representation Learning for Recommendation: A Mutual Information Perspective

  • Chen Zhao
  • Le Wu
  • Pengyang Shao
  • Kun Zhang
  • Richang Hong
  • Meng Wang

Recommender systems have been widely used in recent years. By exploiting historical user-item interactions, recommender systems can model personalized potential interests of users and have been widely applied to a wide range of scenarios. Despite their impressive performance, most of them may be subject to unwanted biases related to sensitive attributes (e.g., race and gender), leading to unfairness. An intuitive idea to alleviate this problem is to ensure that there is no mutual information between recommendation results and sensitive attributes. However, keeping independence conditions solely achieves fairness improvement while causing an obvious degradation of recommendation accuracy, which is not a desired result. To this end, in this paper, we re-define recommendation fairness with a novel two-fold mutual information objective. In concerned details, we define fairness as mutual information minimization between embeddings and sensitive information, and mutual information maximization between embeddings and non-sensitive information. Then, a flexible Fair Mutual Information (FairMI) framework is designed to achieve this goal. FairMI first employs a sensitive attribute encoder to capture sensitive information in the data. Then, based on results from the sensitive attribute encoder, an interest encoder is developed to generate sensitive-free embeddings, which are expected to contain rich non-sensitive information of input data. Moreover, we propose novel mutual information (upper/lower) bounds with contrastive information estimation for model optimization. Extensive experiments over two real-world datasets demonstrate the effectiveness of our proposed FairMI in reducing unfairness and improving recommendation accuracy simultaneously.

AAAI Conference 2023 Conference Paper

MCL: Multi-Granularity Contrastive Learning Framework for Chinese NER

  • Shan Zhao
  • Chengyu Wang
  • Minghao Hu
  • Tianwei Yan
  • Meng Wang

Recently, researchers have applied the word-character lattice framework to integrated word information, which has become very popular for Chinese named entity recognition (NER). However, prior approaches fuse word information by different variants of encoders such as Lattice LSTM or Flat-Lattice Transformer, but are still not data-efficient indeed to fully grasp the depth interaction of cross-granularity and important word information from the lexicon. In this paper, we go beyond the typical lattice structure and propose a novel Multi-Granularity Contrastive Learning framework (MCL), that aims to optimize the inter-granularity distribution distance and emphasize the critical matched words in the lexicon. By carefully combining cross-granularity contrastive learning and bi-granularity contrastive learning, the network can explicitly leverage lexicon information on the initial lattice structure, and further provide more dense interactions of across-granularity, thus significantly improving model performance. Experiments on four Chinese NER datasets show that MCL obtains state-of-the-art results while considering model efficiency. The source code of the proposed method is publicly available at https://github.com/zs50910/MCL

JBHI Journal 2023 Journal Article

Metabolic Anomaly Appearance Aware U-Net for Automatic Lymphoma Segmentation in Whole-Body PET/CT Scans

  • Tianyu Shi
  • Huiyan Jiang
  • Meng Wang
  • Zhaoshuo Diao
  • Guoxu Zhang
  • Yu-dong Yao

Positron emission tomography-computed tomography (PET/CT) is an essential imaging instrument for lymphoma diagnosis and prognosis. PET/CT image based automatic lymphoma segmentation is increasingly used in the clinical community. U-Net-like deep learning methods have been widely used for PET/CT in this task. However, their performance is limited by the lack of sufficient annotated data, due to the existence of tumor heterogeneity. To address this issue, we propose an unsupervised image generation scheme to improve the performance of another independent supervised U-Net for lymphoma segmentation by capturing metabolic anomaly appearance (MAA). Firstly, we propose an anatomical-metabolic consistency generative adversarial network (AMC-GAN) as an auxiliary branch of U-Net. Specifically, AMC-GAN learns normal anatomical and metabolic information representations using co-aligned whole-body PET/CT scans. In the generator of AMC-GAN, we propose a complementary attention block to enhance the feature representation of low-intensity areas. Then, the trained AMC-GAN is used to reconstruct the corresponding pseudo-normal PET scans to capture MAAs. Finally, combined with the original PET/CT images, MAAs are used as the prior information for improving the performance of lymphoma segmentation. Experiments are conducted on a clinical dataset containing 191 normal subjects and 53 patients with lymphomas. The results demonstrate that the anatomical-metabolic consistency representations obtained from unlabeled paired PET/CT scans can be helpful for more accurate lymphoma segmentation, which suggest the potential of our approach to support physician diagnosis in practical clinical applications.

NeurIPS Conference 2023 Conference Paper

On the Convergence and Sample Complexity Analysis of Deep Q-Networks with $\epsilon$-Greedy Exploration

  • Shuai Zhang
  • Hongkang Li
  • Meng Wang
  • Miao Liu
  • Pin-Yu Chen
  • Songtao Lu
  • Sijia Liu
  • Keerthiram Murugesan

This paper provides a theoretical understanding of deep Q-Network (DQN) with the $\varepsilon$-greedy exploration in deep reinforcement learning. Despite the tremendous empirical achievement of the DQN, its theoretical characterization remains underexplored. First, the exploration strategy is either impractical or ignored in the existing analysis. Second, in contrast to conventional Q-learning algorithms, the DQN employs the target network and experience replay to acquire an unbiased estimation of the mean-square Bellman error (MSBE) utilized in training the Q-network. However, the existing theoretical analysis of DQNs lacks convergence analysis or bypasses the technical challenges by deploying a significantly overparameterized neural network, which is not computationally efficient. This paper provides the first theoretical convergence and sample complexity analysis of the practical setting of DQNs with $\epsilon$-greedy policy. We prove an iterative procedure with decaying $\epsilon$ converges to the optimal Q-value function geometrically. Moreover, a higher level of $\epsilon$ values enlarges the region of convergence but slows down the convergence, while the opposite holds for a lower level of $\epsilon$ values. Experiments justify our established theoretical insights on DQNs.

AAAI Conference 2023 Conference Paper

Rethinking Data-Free Quantization as a Zero-Sum Game

  • Biao Qian
  • Yang Wang
  • Richang Hong
  • Meng Wang

Data-free quantization (DFQ) recovers the performance of quantized network (Q) without accessing the real data, but generates the fake sample via a generator (G) by learning from full-precision network (P) instead. However, such sample generation process is totally independence of Q, specialized as failing to consider the adaptability of the generated samples, i.e., beneficial or adversarial, over the learning process of Q, resulting into non-ignorable performance loss. Building on this, several crucial questions --- how to measure and exploit the sample adaptability to Q under varied bit-width scenarios? how to generate the samples with desirable adaptability to benefit the quantized network? --- impel us to revisit DFQ. In this paper, we answer the above questions from a game-theory perspective to specialize DFQ as a zero-sum game between two players --- a generator and a quantized network, and further propose an Adaptability-aware Sample Generation (AdaSG) method. Technically, AdaSG reformulates DFQ as a dynamic maximization-vs-minimization game process anchored on the sample adaptability. The maximization process aims to generate the sample with desirable adaptability, such sample adaptability is further reduced by the minimization process after calibrating Q for performance recovery. The Balance Gap is defined to guide the stationarity of the game process to maximally benefit Q. The theoretical analysis and empirical studies verify the superiority of AdaSG over the state-of-the-arts. Our code is available at https://github.com/hfutqian/AdaSG.

JBHI Journal 2023 Journal Article

SCL-Net: Structured Collaborative Learning for PET/CT Based Tumor Segmentation

  • Meng Wang
  • Huiyan Jiang
  • Tianyu Shi
  • Yu-dong Yao

Collaborative learning methods for medical image segmentation are often variants of UNet, where the constructions of classifiers depend on each other and their outputs are supervised independently. However, they cannot explicitly ensure that optimizing auxiliary classifier heads leads to improved segmentation of target classifier. To resolve this problem, we propose a structured collaborative learning (SCL) method, which consists of a context-aware structured classifier population generation (CA-SCPG) module, where the feature propagation of the target classifier path is directly enhanced by the outputs of auxiliary classifiers via a light-weighted high-level context-aware dense connection (HLCA-DC) mechanism, and a knowledge-aware structured classifier population supervision (KA-SCPS) module, where the auxiliary classifiers are properly supervised under the guidance of target classifier's segmentations. Specifically, SCL is proposed based on a recurrent-dense-siamese decoder (RDS-Decoder), which consists of multiple siamese-decoder paths. CA-SCPG enhances the feature propagation of the decoder paths by HLCA-DC, which densely reuses previous decoder paths' output predictions to belong to the target classes as inputs to the latter decoder paths. KA-SCPS supervises the classifier heads simultaneously with KA-SCPS loss, which consists of a generalized weighted cross-entropy loss for deep class-imbalanced learning and a novel knowledge-aware Dice loss (KA-DL). KA-DL is a weighted Dice loss broadcasting knowledges learnt by the target classifier to other classifier heads, harmonizing the learning process of the classifier population. Experiments are performed based on PET/CT volumes with malignant melanoma, lymphoma, or lung cancer. Experimental results demonstrate the superiority of our SCL, when compared to the state-of-the-art methods and baselines.

TIST Journal 2023 Journal Article

Toward Balancing the Efficiency and Effectiveness in k-Facility Relocation Problem

  • Hu Wang
  • Hui Li
  • Meng Wang
  • Jiangtao Cui

Facility Relocation (FR), which is an effort to reallocate the placement of facilities to adapt to the changes of urban planning, has remarkable impact on many areas. Existing solutions fail to guarantee the result quality on relocating k > 1 facilities. As k -FR problem is NP-complete and is not submodular or non-decreasing, traditional greedy algorithm cannot be directly applied. We propose to transform k -FR into another facility placement problem, which is submodular and non-decreasing. We prove that the optimal solutions of both problems are equivalent. Accordingly, we present the first approximate solution toward the k -FR, FR2FP. Our extensive comparison over both FR2FP and the state-of-the-art solution shows that FR2FP, although it provides approximation guarantee, cannot necessarily given superior results. The comparison motivates us to present an advanced approximate solution, FR2FP-ex. Moreover, based on Lagrangian relaxation, we develop an algorithm that can adjust the approximation ratio. Extensive experiments verified that, FR2FP-ex demonstrates the best result quality, and it is very close to the optimal solution. In addition, we also unveil the scenarios when the state-of-the-art would fail. We further generalize the k -FR problem, considering the budget for relocation and the cost of each facility. We also present corresponding approximate solutions toward the new problem and prove the approximation ratio.

NeurIPS Conference 2023 Conference Paper

Towards Efficient Pre-Trained Language Model via Feature Correlation Distillation

  • Kun Huang
  • Xin Guo
  • Meng Wang

Knowledge Distillation (KD) has emerged as a promising approach for compressing large Pre-trained Language Models (PLMs). The performance of KD relies on how to effectively formulate and transfer the knowledge from the teacher model to the student model. Prior arts mainly focus on directly aligning output features from the transformer block, which may impose overly strict constraints on the student model's learning process and complicate the training process by introducing extra parameters and computational cost. Moreover, our analysis indicates that the different relations within self-attention, as adopted in other works, involves more computation complexities and can easily be constrained by the number of heads, potentially leading to suboptimal solutions. To address these issues, we propose a novel approach that builds relationships directly from output features. Specifically, we introduce token-level and sequence-level relations concurrently to fully exploit the knowledge from the teacher model. Furthermore, we propose a correlation-based distillation loss to alleviate the exact match properties inherent in traditional KL divergence or MSE loss functions. Our method, dubbed FCD, presents a simple yet effective method to compress various architectures (BERT, RoBERTa, and GPT) and model sizes (base-size and large-size). Extensive experimental results demonstrate that our distilled, smaller language models significantly surpass existing KD methods across various NLP tasks.

ICML Conference 2022 Conference Paper

A Difference Standardization Method for Mutual Transfer Learning

  • Haoqing Xu
  • Meng Wang
  • Beilun Wang

In many real-world applications, mutual transfer learning is the paradigm that each data domain can potentially be a source or target domain. This is quite different from transfer learning tasks where the source and target are known a priori. However, previous studies about mutual transfer learning either suffer from high computational complexity or oversimplified hypothesis. To overcome these challenges, in this paper, we propose the \underline{Diff}erence \underline{S}tandardization method ({\bf DiffS}) for mutual transfer learning. Specifically, we put forward a novel distance metric between domains, the standardized domain difference, to obtain fast structure recovery and accurate parameter estimation simultaneously. We validate the method’s performance using both synthetic and real-world data. Compared to previous methods, DiffS demonstrates a speed-up of approximately 3000 times that of similar methods and achieves the same accurate learnability structure estimation.

AIJ Journal 2022 Journal Article

Bayesian feature interaction selection for factorization machines

  • Yifan Chen
  • Yang Wang
  • Pengjie Ren
  • Meng Wang
  • Maarten de Rijke

Factorization machines are a generic supervised method for a wide range of tasks in the field of artificial intelligence, such as prediction, inference, etc. , which can effectively model feature interactions. However, handling combinations of features is expensive due to the exponential growth of feature interactions with the order. In nature, not all feature interactions are equally useful for prediction. Recently, a large number of methods that perform feature interaction selection have attracted great attention because of their effectiveness at filtering out useless feature interactions. Current feature interaction selection methods suffered from the following limitations: (1) they assume that all users share the same feature interactions; and (2) they select pairwise feature interactions only. In this paper, we propose novel Bayesian variable selection methods, targeting feature interaction selection for factorization machines, which effectively reduce the number of interactions. We study personalized feature interaction selection to account for individual preferences, and further extend the model to investigate higher-order feature interaction selection on higher-order factorization machines. We provide empirical evidence for the advantages of the proposed Bayesian feature interaction selection methods using different prediction tasks.

JBHI Journal 2022 Journal Article

HD-RDS-UNet: Leveraging Spatial-Temporal Correlation Between the Decoder Feature Maps for Lymphoma Segmentation

  • Meng Wang
  • Huiyan Jiang
  • Tianyu Shi
  • Yu-dong Yao

Lymphoma is cancer originated in the lymphatic system. Clinically, automatic and accurate lymphoma segmentation is critical yet challenging. Recently, UNet-like architectures are widely used for medical image segmentation. The pure UNet-like architectures can model the spatial correlation between the feature maps very well, whereas they discard the critical temporal correlation. Some prior works combine UNet with recurrent neural networks (RNNs) to utilize the spatial and temporal correlation simultaneously. However, it is inconvenient to incorporate some advanced techniques proposed for UNet to RNNs, which hampers their further improvements. In this paper, we propose a recurrent dense siamese decoder architecture, which simulates RNNs and can densely utilize the spatial temporal correlation between the decoder feature maps following a “UNet” approach. We combine it with a modified hyper dense encoder. Therefore, the proposed model is a UNet with a hyper dense encoder and a recurrent dense siamese decoder (HD-RDS-UNet). To stabilize the training process, we propose a weighted Dice loss with stable gradient and self-adaptive parameters. We perform patient-independent five-fold cross-validation on 3D volumes collected from whole-body PET/CT scans of patients with lymphomas. The experimental results show that the volume-wise average Dice score and sensitivity are 85. 58% and 94. 63%, respectively. The patient-wise average Dice score and sensitivity are 85. 85% and 95. 01%, respectively. The different configurations of HD-RDS-UNet consistently show superiority in the performance comparison. Besides, a trained HD-RDS-UNet can be easily pruned, resulting in significantly reduced inference time and memory usage, while keeping very good segmentation performance.

YNIMG Journal 2022 Journal Article

Insights from an autism imaging biomarker challenge: Promises and threats to biomarker discovery

  • Nicolas Traut
  • Katja Heuer
  • Guillaume Lemaître
  • Anita Beggiato
  • David Germanaud
  • Monique Elmaleh
  • Alban Bethegnies
  • Laurent Bonnasse-Gahot

MRI has been extensively used to identify anatomical and functional differences in Autism Spectrum Disorder (ASD). Yet, many of these findings have proven difficult to replicate because studies rely on small cohorts and are built on many complex, undisclosed, analytic choices. We conducted an international challenge to predict ASD diagnosis from MRI data, where we provided preprocessed anatomical and functional MRI data from > 2,000 individuals. Evaluation of the predictions was rigorously blinded. 146 challengers submitted prediction algorithms, which were evaluated at the end of the challenge using unseen data and an additional acquisition site. On the best algorithms, we studied the importance of MRI modalities, brain regions, and sample size. We found evidence that MRI could predict ASD diagnosis: the 10 best algorithms reliably predicted diagnosis with AUC∼0.80 - far superior to what can be currently obtained using genotyping data in cohorts 20-times larger. We observed that functional MRI was more important for prediction than anatomical MRI, and that increasing sample size steadily increased prediction accuracy, providing an efficient strategy to improve biomarkers. We also observed that despite a strong incentive to generalise to unseen data, model development on a given dataset faces the risk of overfitting: performing well in cross-validation on the data at hand, but not generalising. Finally, we were able to predict ASD diagnosis on an external sample added after the end of the challenge (EU-AIMS), although with a lower prediction accuracy (AUC=0.72). This indicates that despite being based on a large multisite cohort, our challenge still produced biomarkers fragile in the face of dataset shifts.

JBHI Journal 2022 Journal Article

Speckle Noise Reduction for OCT Images Based on Image Style Transfer and Conditional GAN

  • Yi Zhou
  • Kai Yu
  • Meng Wang
  • Yuhui Ma
  • Yuanyuan Peng
  • Zhongyue Chen
  • Weifang Zhu
  • Fei Shi

Raw optical coherence tomography (OCT) images typically are of low quality because speckle noise blurs retinal structures, severely compromising visual quality and degrading performances of subsequent image analysis tasks. In our previous study (Ma et al. , 2018), we have developed a Conditional Generative Adversarial Network (cGAN) for speckle noise removal in OCT images collected by several commercial OCT scanners, which we collectively refer to as scanner T. In this paper, we improve the cGAN model and apply it to our in-house OCT scanner (scanner B) for speckle noise suppression. The proposed model consists of two steps: 1) We train a Cycle-Consistent GAN (CycleGAN) to learn style transfer between two OCT image datasets collected by different scanners. The purpose of the CycleGAN is to leverage the ground truth dataset created in our previous study. 2) We train a mini-cGAN model based on the PatchGAN mechanism with the ground truth dataset to suppress speckle noise in OCT images. After training, we first apply the CycleGAN model to convert raw images collected by scanner B to match the style of the images from scanner T, and subsequently use the mini-cGAN model to suppress speckle noise in the style transferred images. We evaluate the proposed method on a dataset collected by scanner B. Experimental results show that the improved model outperforms our previous method and other state-of-the-art models in speckle noise removal, retinal structure preservation and contrast enhancement.

AAAI Conference 2022 Conference Paper

Width & Depth Pruning for Vision Transformers

  • Fang Yu
  • Kun Huang
  • Meng Wang
  • Yuan Cheng
  • Wei Chu
  • Li Cui

Transformer models have demonstrated their promising potential and achieved excellent performance on a series of computer vision tasks. However, the huge computational cost of vision transformers hinders their deployment and application to edge devices. Recent works have proposed to find and remove the unimportant units of vision transformers. Despite achieving remarkable results, these methods take one dimension of network width into consideration and ignore network depth, which is another important dimension for pruning vision transformers. Therefore, we propose a Width & Depth Pruning (WDPruning) framework that reduces both width and depth dimensions simultaneously. Specifically, for width pruning, a set of learnable pruning-related parameters is used to adaptively adjust the width of transformer. For depth pruning, we introduce several shallow classifiers by using the intermediate information of the transformer blocks, which allows images to be classified by shallow classifiers instead of the deeper classifiers. In the inference period, all of the blocks after shallow classifiers can be dropped so they don’t bring additional parameters and computation. Experimental results on benchmark datasets demonstrate that the proposed method can significantly reduce the computational costs of mainstream vision transformers such as DeiT and Swin Transformer with a minor accuracy drop. In particular, on ILSVRC-12, we achieve over 22% pruning ratio of FLOPs by compressing DeiT-Base, even with an increase of 0. 14% Top-1 accuracy.

AAAI Conference 2021 Conference Paper

Leveraging Table Content for Zero-shot Text-to-SQL with Meta-Learning

  • Yongrui Chen
  • Xinnan Guo
  • Chaojie Wang
  • Jian Qiu
  • Guilin Qi
  • Meng Wang
  • Huiying Li

Single-table text-to-SQL aims to transform a natural language question into a SQL query according to one single table. Recent work has made promising progress on this task by pretrained language models and a multi-submodule framework. However, zero-shot table, that is, the invisible table in the training set, is currently the most critical bottleneck restricting the application of existing approaches to real-world scenarios. Although some work has utilized auxiliary tasks to help handle zero-shot tables, expensive extra manual annotation limits their practicality. In this paper, we propose a new approach for the zero-shot text-to-SQL task which does not rely on any additional manual annotations. Our approach consists of two parts. First, we propose a new model that leverages the abundant information of table content to help establish the mapping between questions and zero-shot tables. Further, we propose a simple but efficient meta-learning strategy to train our model. The strategy utilizes the two-step gradient update to force the model to learn a generalization ability towards zero-shot tables. We conduct extensive experiments on a public open-domain text-to-SQL dataset WikiSQL and a domain-specific dataset ESQL. Compared to existing approaches using the same pre-trained model, our approach achieves significant improvements on both datasets. Compared to the larger pre-trained model and the tabular-specific pre-trained model, our approach is still competitive. More importantly, on the zero-shot subsets of both the datasets, our approach further increases the improvements.

AAAI Conference 2021 Conference Paper

Making the Relation Matters: Relation of Relation Learning Network for Sentence Semantic Matching

  • Kun Zhang
  • Le Wu
  • Guangyi Lv
  • Meng Wang
  • Enhong Chen
  • Shulan Ruan

Sentence semantic matching is one of the fundamental tasks in natural language processing, which requires an agent to determine the semantic relation among input sentences. Recently, deep neural networks have achieved impressive performance in this area, especially BERT. Despite their effectiveness, most of these models treat output labels as meaningless one-hot vectors, underestimating the semantic information and guidance of relations that these labels reveal, especially for tasks with a small number of labels. To address this problem, we propose a Relation of Relation Learning Network (R2 -Net) for sentence semantic matching. Specifically, we first employ BERT to encode the input sentences from a global perspective. Then a CNN-based encoder is designed to capture keywords and phrase information from a local perspective. To fully leverage labels for better relation information extraction, we introduce a self-supervised relation of relation classification task for guiding R2 -Net to consider more about relations. Meanwhile, a triplet loss is employed to distinguish the intra-class and inter-class relations in a finer granularity. Empirical experiments on two sentence semantic matching tasks demonstrate the superiority of our proposed model.

YNICL Journal 2021 Journal Article

Multisite schizophrenia classification by integrating structural magnetic resonance imaging data with polygenic risk score

  • Ke Hu
  • Meng Wang
  • Yong Liu
  • Hao Yan
  • Ming Song
  • Jun Chen
  • Yunchun Chen
  • Huaning Wang

Previous brain structural magnetic resonance imaging studies reported that patients with schizophrenia have brain structural abnormalities, which have been used to discriminate schizophrenia patients from normal controls. However, most existing studies identified schizophrenia patients at a single site, and the genetic features closely associated with highly heritable schizophrenia were not considered. In this study, we performed standardized feature extraction on brain structural magnetic resonance images and on genetic data to separate schizophrenia patients from normal controls. A total of 1010 participants, 508 schizophrenia patients and 502 normal controls, were recruited from 8 independent sites across China. Classification experiments were carried out using different machine learning methods and input features. We tested a support vector machine, logistic regression, and an ensemble learning strategy using 3 feature sets of interest: (1) imaging features: gray matter volume, (2) genetic features: polygenic risk scores, and (3) a fusion of imaging features and genetic features. The performance was assessed by leave-one-site-out cross-validation. Finally, some important brain and genetic features were identified. We found that the models with both imaging and genetic features as input performed better than models with either alone. The average accuracy of the classification models with the best performance in the cross-validation was 71.6%. The genetic feature that measured the cumulative risk of the genetic variants most associated with schizophrenia contributed the most to the classification. Our work took the first step toward considering both structural brain alterations and genome-wide genetic factors in a large-scale multisite schizophrenia classification. Our findings may provide insight into the underlying pathophysiology and risk mechanisms of schizophrenia.

AAAI Conference 2021 Conference Paper

Partial-Label and Structure-constrained Deep Coupled Factorization Network

  • Yan Zhang
  • Zhao Zhang
  • Yang Wang
  • Zheng Zhang
  • Li Zhang
  • Shuicheng Yan
  • Meng Wang

In this paper, we technically propose an enriched prior guided framework, called Dual-constrained Deep Semi-Supervised Coupled Factorization Network (DS2 CF-Net), for discovering hierarchical coupled data representation. To extract hidden deep features, DS2 CF-Net is formulated as a partial-label and geometrical structure-constrained framework. Specifically, DS2 CF-Net designs a deep factorization architecture using multilayers of linear transformations, which can coupled update both the basis vectors and new representations in each layer. To enable learned deep representations and coefficients to be discriminative, we also consider enriching the supervised prior by joint deep coefficients-based label prediction and then incorporate the enriched prior information as additional label and structure constraints. The label constraint can enable the intra-class samples to have same coordinate in feature space, and the structure constraint forces the coefficients in each layer to be block-diagonal so that the enriched prior using the self-expressive label propagation are more accurate. Our network also integrates the adaptive dualgraph learning to retain the local structures of both data and feature manifolds in each layer. Extensive experiments on image datasets demonstrate the effectiveness of DS2 CF-Net for representation learning and clustering.

AAAI Conference 2021 Conference Paper

Proposal-Free Video Grounding with Contextual Pyramid Network

  • Kun Li
  • Dan Guo
  • Meng Wang

The challenge of video grounding - localizing activities in an untrimmed video via a natural language query - is to tackle the semantics of vision and language consistently along the temporal dimension. Most existing proposal-based methods are trapped by computational cost with extensive candidate proposals. In this paper, we propose a novel proposalfree framework named Contextual Pyramid Network (CP- Net) to investigate multi-scale temporal correlation in the video. Specifically, we propose a pyramid network to extract 2D contextual correlation maps at different temporal scales (T ∗T, T 2 ∗ T 2, T 4 ∗ T 4 ), where the 2D correlation map (past → current & current ← future) is designed to model all the relations of any two moments in the video. In other words, CPNet progressively replenishes the temporal contexts and refines the location of queried activity by enlarging the temporal receptive fields. Finally, we implement a temporal self-attentive regression (i. e. , proposal-free regression) to predict the activity boundary from the above hierarchical context-aware 2D correlation maps. Extensive experiments on ActivityNet Captions, Charades-STA, and TACoS datasets demonstrate that our approach outperforms state-of-the-art methods.

IJCAI Conference 2021 Conference Paper

Reward-Constrained Behavior Cloning

  • Zhaorong Wang
  • Meng Wang
  • Jingqi Zhang
  • Yingfeng Chen
  • Chongjie Zhang

Deep reinforcement learning (RL) has demonstrated success in challenging decision-making/control tasks. However, RL methods, which solve tasks through maximizing the expected reward, may generate undesirable behaviors due to inferior local convergence or incompetent reward design. These undesirable behaviors of agents may not reduce the total reward but destroy the user experience of the application. For example, in the autonomous driving task, the policy actuated by speed reward behaves much more sudden brakes while human drivers generally don’t do that. To overcome this problem, we present a novel method named Reward-Constrained Behavior Cloning (RCBC) which synthesizes imitation learning and constrained reinforcement learning. RCBC leverages human demonstrations to induce desirable or human-like behaviors and employs lower-bound reward constraints for policy optimization to maximize the expected reward. Empirical results on popular benchmark environments show that RCBC learns significantly more human-desired policies with performance guarantees which meet the lower-bound reward constraints while performing better than or as well as baseline methods in terms of reward maximization.

NeurIPS Conference 2021 Conference Paper

Why Lottery Ticket Wins? A Theoretical Perspective of Sample Complexity on Sparse Neural Networks

  • Shuai Zhang
  • Meng Wang
  • Sijia Liu
  • Pin-Yu Chen
  • Jinjun Xiong

The lottery ticket hypothesis (LTH) states that learning on a properly pruned network (the winning ticket) has improved test accuracy over the original unpruned network. Although LTH has been justified empirically in a broad range of deep neural network (DNN) involved applications like computer vision and natural language processing, the theoretical validation of the improved generalization of a winning ticket remains elusive. To the best of our knowledge, our work, for the first time, characterizes the performance of training a pruned neural network by analyzing the geometric structure of the objective function and the sample complexity to achieve zero generalization error. We show that the convex region near a desirable model with guaranteed generalization enlarges as the neural network model is pruned, indicating the structural importance of a winning ticket. Moreover, as the algorithm for training a pruned neural network is specified as an (accelerated) stochastic gradient descent algorithm, we theoretically show that the number of samples required for achieving zero generalization error is proportional to the number of the non-pruned weights in the hidden layer. With a fixed number of samples, training a pruned neural network enjoys a faster convergence rate to the desired model than training the original unpruned one, providing a formal justification of the improved generalization of the winning ticket. Our theoretical results are acquired from learning a pruned neural network of one hidden layer, while experimental results are further provided to justify the implications in pruning multi-layer neural networks.

AAAI Conference 2020 System Paper

Automatic Car Damage Assessment System: Reading and Understanding Videos as Professional Insurance Inspectors

  • Wei Zhang
  • Yuan Cheng
  • Xin Guo
  • Qingpei Guo
  • Jian Wang
  • Qing Wang
  • Chen Jiang
  • Meng Wang

We demonstrate a car damage assessment system in car insurance field based on artificial intelligence techniques, which can exempt insurance inspectors from checking cars on site and help people without professional knowledge to evaluate car damages when accidents happen. Unlike existing approaches, we utilize videos instead of photos to interact with users to make the whole procedure as simple as possible. We adopt object and video detection and segmentation techniques in computer vision, and take advantage of multiple frames extracted from videos to achieve high damage recognition accuracy. The system uploads video streams captured by mobile devices, recognizes car damage on the cloud asynchronously and then returns damaged components and repair costs to users. The system evaluates car damages and returns results automatically and effectively in seconds, which reduces laboratory costs and decreases insurance claim time significantly.

TIST Journal 2020 Journal Article

Deep Neighborhood Component Analysis for Visual Similarity Modeling

  • Xueliang Liu
  • Xun Yang
  • Meng Wang
  • Richang Hong

Learning effective visual similarity is an essential problem in multimedia research. Despite the promising progress made in recent years, most existing approaches learn visual features and similarities in two separate stages, which inevitably limits their performance. Once useful information has been lost in the feature extraction stage, it can hardly be recovered later. This article proposes a novel end-to-end approach for visual similarity modeling, called deep neighborhood component analysis, which discriminatively trains deep neural networks to jointly learn visual features and similarities. Specifically, we first formulate a metric learning objective that maximizes the intra-class correlations and minimizes the inter-class correlations under the neighborhood component analysis criterion, and then train deep convolutional neural networks to learn a nonlinear mapping that projects visual instances from original feature space to a discriminative and neighborhood-structure-preserving embedding space, thus resulting in better performance. We conducted extensive evaluations on several widely used and challenging datasets, and the impressive results demonstrate the effectiveness of our proposed approach.

TIST Journal 2020 Journal Article

FROST

  • Meng Wang
  • Hui Li
  • Jiangtao Cui
  • Sourav S. Bhowmick
  • Ping Liu

The facility relocation (FR) problem, which aims to optimize the placement of facilities to accommodate the changes of users’ locations, has a broad spectrum of applications. Despite the significant progress made by existing solutions to the FR problem, they all assume each user is stationary and represented as a single point. Unfortunately, in reality, objects (e.g., people, animals) are mobile. For example, a car-sharing user picks up a vehicle from a station close to where he or she is currently located. Consequently, these efforts may fail to identify a superior solution to the FR problem. In this article, for the first time, we take into account the movement history of users and introduce a novel FR problem, called motion-fr, to address the preceding limitation. Specifically, we present a framework called frost to address it. frost comprises two exact algorithms: index based and index free. The former is designed to address the scenario when facilities and objects are known a priori, whereas the latter solves the motion-fr problem by jettisoning this assumption. Further, we extend the index-based algorithm to solve the general k - motion-fr problem, which aims to relocate k inferior facilities. We devise an approximate solution due to NP-hardness of the problem. Experimental study over both real-world and synthetic datasets demonstrates the superiority of our framework in comparison to state-of-the-art FR techniques in efficiency and effectiveness.

AAAI Conference 2020 Short Paper

Generative Adversarial Imitation Learning from Failed Experiences (Student Abstract)

  • Jiacheng Zhu
  • Jiahao Lin
  • Meng Wang
  • Yingfeng Chen
  • Changjie Fan
  • Chong Jiang
  • Zongzhang Zhang

Imitation learning provides a family of promising methods that learn policies from expert demonstrations directly. As a model-free and on-line imitation learning method, generative adversarial imitation learning (GAIL) generalizes well to unseen situations and can handle complex problems. In this paper, we propose a novel variant of GAIL called GAIL from failed experiences (GAILFE). GAILFE allows an agent to utilize failed experiences in the training process. Moreover, a constrained optimization objective is formalized in GAILFE to balance learning from given demonstrations and from self-generated failed experiences. Empirically, compared with GAIL, GAILFE can improve sample efficiency and learning speed over different tasks.

IJCAI Conference 2020 Conference Paper

Learning to Discretely Compose Reasoning Module Networks for Video Captioning

  • Ganchao Tan
  • Daqing Liu
  • Meng Wang
  • Zheng-Jun Zha

Generating natural language descriptions for videos, i. e. , video captioning, essentially requires step-by-step reasoning along the generation process. For example, to generate the sentence “a man is shooting a basketball”, we need to first locate and describe the subject “man”, next reason out the man is “shooting”, then describe the object “basketball” of shooting. However, existing visual reasoning methods designed for visual question answering are not appropriate to video captioning, for it requires more complex visual reasoning on videos over both space and time, and dynamic module composition along the generation process. In this paper, we propose a novel visual reasoning approach for video captioning, named Reasoning Module Networks (RMN), to equip the existing encoder-decoder framework with the above reasoning capacity. Specifically, our RMN employs 1) three sophisticated spatio-temporal reasoning modules, and 2) a dynamic and discrete module selector trained by a linguistic loss with a Gumbel approximation. Extensive experiments on MSVD and MSR-VTT datasets demonstrate the proposed RMN outperforms the state-of-the-art methods while providing an explicit and explainable generation process. Our code is available at https: //github. com/tgc1997/RMN.

AAAI Conference 2020 Conference Paper

Learning to Match on Graph for Fashion Compatibility Modeling

  • Xun Yang
  • Xiaoyu Du
  • Meng Wang

Understanding the mix-and-match relationships between items receives increasing attention in the fashion industry. Existing methods have primarily learned visual compatibility from dyadic co-occurrence or co-purchase information of items to model the item-item matching interaction. Despite effectiveness, rich extra-connectivities between compatible items, e. g. , user-item interactions and item-item substitutable relationships, which characterize the structural properties of items, have been largely ignored. This paper presents a graphbased fashion matching framework named Deep Relational Embedding Propagation (DREP), aiming to inject the extraconnectivities between items into the pairwise compatibility modeling. Specifically, we first build a multi-relational itemitem-user graph which encodes diverse item-item and useritem relationships. Then we compute structured representations of items by an attentive relational embedding propagation rule that performs messages propagation along edges of the relational graph. This leads to expressive modeling of higher-order connectivity between items and also better representation of fashion items. Finally, we predict pairwise compatibility based on a compatibility metric learning module. Extensive experiments show that DREP can significantly improve the performance of state-of-the-art methods.

IJCAI Conference 2020 Conference Paper

Multi-Scale Spatial-Temporal Integration Convolutional Tube for Human Action Recognition

  • Haoze Wu
  • Jiawei Liu
  • Xierong Zhu
  • Meng Wang
  • Zheng-Jun Zha

Applying multi-scale representations leads to consistent performance improvements on a wide range of image recognition tasks. However, with the addition of the temporal dimension in video domain, directly obtaining layer-wise multi-scale spatial-temporal features will add a lot extra computational cost. In this work, we propose a novel and efficient Multi-Scale Spatial-Temporal Integration Convolutional Tube (MSTI) aiming at achieving accurate recognition of actions with lower computational cost. It firstly extracts multi-scale spatial and temporal features through the multi-scale convolution block. Considering the interaction of different-scales representations and the interaction of spatial appearance and temporal motion, we employ the cross-scale attention weighted blocks to perform feature recalibration by integrating multi-scale spatial and temporal features. An end-to-end deep network, MSTI-Net, is also presented based on the proposed MSTI tube for human action recognition. Extensive experimental results show that our MSTI-Net significantly boosts the performance of existing convolution networks and achieves state-of-the-art accuracy on three challenging benchmarks, i. e. , UCF-101, HMDB-51 and Kinetics-400, with much fewer parameters and FLOPs.

AAAI Conference 2020 Conference Paper

One-Shot Learning for Long-Tail Visual Relation Detection

  • Weitao Wang
  • Meng Wang
  • Sen Wang
  • Guodong Long
  • Lina Yao
  • Guilin Qi
  • Yang Chen

The aim of visual relation detection is to provide a comprehensive understanding of an image by describing all the objects within the scene, and how they relate to each other, in form; for example, . This ability is vital for image captioning, visual question answering, and many other applications. However, visual relationships have long-tailed distributions and, thus, the limited availability of training samples is hampering the practicability of conventional detection approaches. With this in mind, we designed a novel model for visual relation detection that works in one-shot settings. The embeddings of objects and predicates are extracted through a network that includes a feature-level attention mechanism. Attention alleviates some of the problems with feature sparsity, and the resulting representations capture more discriminative latent features. The core of our model is a dual graph neural network that passes and aggregates the context information of predicates and objects in an episodic training scheme to improve recognition of the one-shot predicates and then generate the triplets. To the best of our knowledge, we are the first to center on the viability of one-shot learning for visual relation detection. Extensive experiments on two newly-constructed datasets show that our model significantly improved the performance of two tasks PredCls and SGCls from 2. 8% to 12. 2% compared with state-of-the-art baselines.

IJCAI Conference 2020 Conference Paper

Quadratic Sparse Gaussian Graphical Model Estimation Method for Massive Variables

  • Jiaqi Zhang
  • Meng Wang
  • Qinchi Li
  • Sen Wang
  • Xiaojun Chang
  • Beilun Wang

We consider the problem of estimating a sparse Gaussian Graphical Model with a special graph topological structure and more than a million variables. Most previous scalable estimators still contain expensive calculation steps (e. g. , matrix inversion or Hessian matrix calculation) and become infeasible in high-dimensional scenarios, where p (number of variables) is larger than n (number of samples). To overcome this challenge, we propose a novel method, called Fast and Scalable Inverse Covariance Estimator by Thresholding (FST). FST first obtains a graph structure by applying a generalized threshold to the sample covariance matrix. Then, it solves multiple block-wise subproblems via element-wise thresholding. By using matrix thresholding instead of matrix inversion as the computational bottleneck, FST reduces its computational complexity to a much lower order of magnitude (O(p2)). We show that FST obtains the same sharp convergence rate O(√(log max{p, n}/n) as other state-of-the-art methods. We validate the method empirically, on multiple simulated datasets and one real-world dataset, and show that FST is two times faster than the four baselines while achieving a lower error rate under both Frobenius-norm and max-norm.

IJCAI Conference 2020 Conference Paper

Recurrent Relational Memory Network for Unsupervised Image Captioning

  • Dan Guo
  • Yang Wang
  • Peipei Song
  • Meng Wang

Unsupervised image captioning with no annotations is an emerging challenge in computer vision, where the existing arts usually adopt GAN (Generative Adversarial Networks) models. In this paper, we propose a novel memory-based network rather than GAN, named Recurrent Relational Memory Network (R2M). Unlike complicated and sensitive adversarial learning that non-ideally performs for long sentence generation, R2M implements a concepts-to-sentence memory translator through two-stage memory mechanisms: fusion and recurrent memories, correlating the relational reasoning between common visual concepts and the generated words for long periods. R2M encodes visual context through unsupervised training on images, while enabling the memory to learn from irrelevant textual corpus via supervised fashion. Our solution enjoys less learnable parameters and higher computational efficiency than GAN-based methods, which heavily bear parameter sensitivity. We experimentally validate the superiority of R2M than state-of-the-arts on all benchmark datasets.

AAAI Conference 2020 Conference Paper

Revisiting Graph Based Collaborative Filtering: A Linear Residual Graph Convolutional Network Approach

  • Lei Chen
  • Le Wu
  • Richang Hong
  • Kun Zhang
  • Meng Wang

Graph Convolutional Networks (GCNs) are state-of-the-art graph based representation learning models by iteratively stacking multiple layers of convolution aggregation operations and non-linear activation operations. Recently, in Collaborative Filtering (CF) based Recommender Systems (RS), by treating the user-item interaction behavior as a bipartite graph, some researchers model higher-layer collaborative signals with GCNs. These GCN based recommender models show superior performance compared to traditional works. However, these models suffer from training difficulty with non-linear activations for large user-item graphs. Besides, most GCN based models could not model deeper layers due to the over smoothing effect with the graph convolution operation. In this paper, we revisit GCN based CF models from two aspects. First, we empirically show that removing non-linearities would enhance recommendation performance, which is consistent with the theories in simple graph convolutional networks. Second, we propose a residual network structure that is specifically designed for CF with useritem interaction modeling, which alleviates the over smoothing problem in graph convolution aggregation operation with sparse user-item interaction data. The proposed model is a linear model and it is easy to train, scale to large datasets, and yield better efficiency and effectiveness on two real datasets. We publish the source code at https: //github. com/newlei/LR- GCCF.

TCS Journal 2020 Journal Article

Translating Xd-C programs to MSVL programs

  • Meng Wang
  • Cong Tian
  • Nan Zhang
  • Zhenhua Duan
  • Chenguang Yao

C language is one of the most popular languages for software systems. In order to verify safety, reliability and security properties of such systems written in C, a tool UMC4M for runtime verification at code level based on Modeling, Simulation and Verification Language (MSVL) and its compiler MC is employed. To do so, a C program P has to be translated to an MSVL program M and the negation of a desired property Q is also translated to an MSVL program M', then “M and M'” is compiled and executed armed with MC. Whether P violates Q is checked by evaluating whether there exists an acceptable execution of new MSVL program “M and M'”. Therefore, how to translate a C program to an MSVL program is a critical issue. However, in general, C is of complicated structures with goto statement. In this paper, we confine the syntax of C in a suitable subset called Xd-C without loss of expressiveness. Further, we present a translation algorithm from an Xd-C program to an MSVL program based on translation algorithms for expressions and statements. Moreover, the equivalences between expressions and statements involved in Xd-C and MSVL programs are inductively proved. Subsequently, the equivalence between the original Xd-C program and the translated MSVL program is also proved. In addition, the proposed approach has been implemented by a tool called C 2 M. A benchmark of experiments including 13 real-world Xd-C programs is conducted. The results show that C 2 M works effectively.

IJCAI Conference 2020 Conference Paper

Unsupervised Vehicle Re-identification with Progressive Adaptation

  • Jinjia Peng
  • Yang Wang
  • Huibing Wang
  • Zhao Zhang
  • Xianping Fu
  • Meng Wang

Vehicle re-identification (reID) aims at identifying vehicles across different non-overlapping cameras views. The existing methods heavily relied on well-labeled datasets for ideal performance, which inevitably causes fateful drop due to the severe domain bias between the training domain and the real-world scenes; worse still, these approaches required full annotations, which is labor-consuming. To tackle these challenges, we propose a novel Progressive Adaptation Learning method for vehicle reID, named PAL, which infers from the abundant data without annotations. For PAL, a data adaptation module is employed for source domain, which generates the images with similar data distribution to unlabeled target domain as “pseudo target samples”. These pseudo samples are combined with the unlabeled samples that are selected by a dynamic sampling strategy to make training faster. We further proposed a weighted label smoothing (WLS) loss, which considers the similarity between samples with different clusters to balance the confidence of pseudo labels. Comprehensive experimental results validate the advantages of PAL on both VehicleID and VeRi-776 dataset.

IJCAI Conference 2019 Conference Paper

Approximate Optimal Transport for Continuous Densities with Copulas

  • Jinjin Chi
  • Jihong Ouyang
  • Ximing Li
  • Yang Wang
  • Meng Wang

Optimal Transport (OT) formulates a powerful framework by comparing probability distributions, and it has increasingly attracted great attention within the machine learning community. However, it suffers from severe computational burden, due to the intractable objective with respect to the distributions of interest. Especially, there still exist very few attempts for continuous OT, i. e. , OT for comparing continuous densities. To this end, we develop a novel continuous OT method, namely Copula OT (Cop-OT). The basic idea is to transform the primal objective of continuous OT into a tractable form with respect to the copula parameter, which can be efficiently solved by stochastic optimization with less time and memory requirements. Empirical results on real applications of image retrieval and synthetic data demonstrate that our Cop-OT can gain more accurate approximations to continuous OT values than the state-of-the-art baselines.

IJCAI Conference 2019 Conference Paper

Connectionist Temporal Modeling of Video and Language: a Joint Model for Translation and Sign Labeling

  • Dan Guo
  • Shengeng Tang
  • Meng Wang

Online sign interpretation suffers from challenges presented by hybrid semantics learning among sequential variations of visual representations, sign linguistics, and textual grammars. This paper proposes a Connectionist Temporal Modeling (CTM) network for sentence translation and sign labeling. To acquire short-term temporal correlations, a Temporal Convolution Pyramid (TCP) module is performed on 2D CNN features to realize (2D+1D)=pseudo 3D' CNN features. CTM aligns the pseudo 3D' with the original 3D CNN clip features and fuses them. Next, we implement a connectionist decoding scheme for long-term sequential learning. Here, we embed dynamic programming into the decoding scheme, which learns temporal mapping among features, sign labels, and the generated sentence directly. The solution using dynamic programming to sign labeling is considered as pseudo labels. Finally, we utilize the pseudo supervision cues in an end-to-end framework. A joint objective function is designed to measure feature correlation, entropy regularization on sign labeling, and probability maximization on sentence decoding. The experimental results using the RWTH-PHOENIX-Weather and USTC-CSL datasets demonstrate the effectiveness of the proposed approach.

IJCAI Conference 2019 Conference Paper

Dense Temporal Convolution Network for Sign Language Translation

  • Dan Guo
  • Shuo Wang
  • Qi Tian
  • Meng Wang

The sign language translation (SLT) which aims at translating a sign language video into natural language is a weakly supervised task, given that there is no exact mapping relationship between visual actions and textual words in a sentence label. To align the sign language actions and translate them into the respective words automatically, this paper proposes a dense temporal convolution network, termed DenseTCN which captures the actions in hierarchical views. Within this network, a temporal convolution (TC) is designed to learn the short-term correlation among adjacent features and further extended to a dense hierarchical structure. In the kth TC layer, we integrate the outputs of all preceding layers together: (1) The TC in a deeper layer essentially has larger receptive fields, which captures long-term temporal context by the hierarchical content transition. (2) The integration addresses the SLT problem by different views, including embedded short-term and extended longterm sequential learning. Finally, we adopt the CTC loss and a fusion strategy to learn the featurewise classification and generate the translated sentence. The experimental results on two popular sign language benchmarks, i. e. PHOENIX and USTCConSents, demonstrate the effectiveness of our proposed method in terms of various measurements.

IJCAI Conference 2019 Conference Paper

Dual Visual Attention Network for Visual Dialog

  • Dan Guo
  • Hui Wang
  • Meng Wang

Visual dialog is a challenging task, which involves multi-round semantic transformations between vision and language. This paper aims to address cross-modal semantic correlation for visual dialog. Motivated by that Vg (global vision), Vl (local vision), Q (question) and H (history) have inseparable relevances, the paper proposes a novel Dual Visual Attention Network (DVAN) to realize (Vg, Vl, Q, H)--> A. DVAN is a three-stage query-adaptive attention model. In order to acquire accurate A (answer), it first explores the textual attention, which imposes the question on history to pick out related context H'. Then, based on Q and H', it implements respective visual attentions to discover related global image visual hints Vg' and local object-based visual hints Vl'. Next, a dual crossing visual attention is proposed. Vg' and Vl' are mutually embedded to learn the complementary of visual semantics. Finally, the attended textual and visual features are combined to infer the answer. Experimental results on the VisDial v0. 9 and v1. 0 datasets validate the effectiveness of the proposed approach.

TIST Journal 2019 Journal Article

Motion-Aware Compression and Transmission of Mesh Animation Sequences

  • Bailin Yang
  • Luhong Zhang
  • Frederick W. B. Li
  • Xiaoheng Jiang
  • Zhigang Deng
  • Meng Wang
  • Mingliang Xu

With the increasing demand in using 3D mesh data over networks, supporting effective compression and efficient transmission of meshes has caught lots of attention in recent years. This article introduces a novel compression method for 3D mesh animation sequences, supporting user-defined and progressive transmissions over networks. Our motion-aware approach starts with clustering animation frames based on their motion similarities, dividing a mesh animation sequence into fragments of varying lengths. This is done by a novel temporal clustering algorithm, which measures motion similarity based on the curvature and torsion of a space curve formed by corresponding vertices along a series of animation frames. We further segment each cluster based on mesh vertex coherence, representing topological proximity within an object under certain motion. To produce a compact representation, we perform intra-cluster compression based on Graph Fourier Transform (GFT) and Set Partitioning In Hierarchical Trees (SPIHT) coding. Optimized compression results can be achieved by applying GFT due to the proximity in vertex position and motion. We adapt SPIHT to support progressive transmission and design a mechanism to transmit mesh animation sequences with user-defined quality. Experimental results show that our method can obtain a high compression ratio while maintaining a low reconstruction error.

IJCAI Conference 2019 Conference Paper

Personalized Multimedia Item and Key Frame Recommendation

  • Le Wu
  • Lei Chen
  • Yonghui Yang
  • Richang Hong
  • Yong Ge
  • Xing Xie
  • Meng Wang

When recommending or advertising items to users, an emerging trend is to present each multimedia item with a key frame image (e. g. , the poster of a movie). As each multimedia item can be represented as multiple fine-grained visual images (e. g. , related images of the movie), personalized key frame recommendation is necessary in these applications to attract users' unique visual preferences. However, previous personalized key frame recommendation models relied on users' fine grained image behavior of multimedia items (e. g. , user-image interaction behavior), which is often not available in real scenarios. In this paper, we study the general problem of joint multimedia item and key frame recommendation in the absence of the fine-grained user-image behavior. We argue that the key challenge of this problem lies in discovering users' visual profiles for key frame recommendation, as most recommendation models would fail without any users' fine-grained image behavior. To tackle this challenge, we leverage users' item behavior by projecting users(items) in two latent spaces: a collaborative latent space and a visual latent space. We further design a model to discern both the collaborative and visual dimensions of users, and model how users make decisive item preferences from these two spaces. As a result, the learned user visual profiles could be directly applied for key frame recommendation. Finally, experimental results on a real-world dataset clearly show the effectiveness of our proposed model on the two recommendation tasks.

AAAI Conference 2019 Conference Paper

TransNFCM: Translation-Based Neural Fashion Compatibility Modeling

  • Xun Yang
  • Yunshan Ma
  • Lizi Liao
  • Meng Wang
  • Tat-Seng Chua

Identifying mix-and-match relationships between fashion items is an urgent task in a fashion e-commerce recommender system. It will significantly enhance user experience and satisfaction. However, due to the challenges of inferring the rich yet complicated set of compatibility patterns in a large e-commerce corpus of fashion items, this task is still underexplored. Inspired by the recent advances in multirelational knowledge representation learning and deep neural networks, this paper proposes a novel Translation-based Neural Fashion Compatibility Modeling (TransNFCM) framework, which jointly optimizes fashion item embeddings and category-specific complementary relations in a unified space via an end-to-end learning manner. TransNFCM places items in a unified embedding space where a category-specific relation (category-comp-category) is modeled as a vector translation operating on the embeddings of compatible items from the corresponding categories. By this way, we not only capture the specific notion of compatibility conditioned on a specific pair of complementary categories, but also preserve the global notion of compatibility. We also design a deep fashion item encoder which exploits the complementary characteristic of visual and textual features to represent the fashion products. To the best of our knowledge, this is the first work that uses category-specific complementary relations to model the category-aware compatibility between items in a translation-based embedding space. Extensive experiments demonstrate the effectiveness of TransNFCM over the state-of-the-arts on two real-world datasets.

IJCAI Conference 2018 Conference Paper

Fine-grained Image Classification by Visual-Semantic Embedding

  • Huapeng Xu
  • Guilin Qi
  • Jingjing Li
  • Meng Wang
  • Kang Xu
  • Huan Gao

This paper investigates a challenging problem, which is known as fine-grained image classification(FGIC). Different from conventional computer visionproblems, FGIC suffers from the large intraclassdiversities and subtle inter-class differences. Existing FGIC approaches are limited to exploreonly the visual information embedded in the images. In this paper, we present a novel approachwhich can use handy prior knowledge from eitherstructured knowledge bases or unstructured text tofacilitate FGIC. Specifically, we propose a visual-semanticembedding model which explores semanticembedding from knowledge bases and text, andfurther trains a novel end-to-end CNN frameworkto linearly map image features to a rich semanticembedding space. Experimental results on a challenginglarge-scale UCSD Bird-200-2011 datasetverify that our approach outperforms several state-of-the-art methods with significant advances.

AAAI Conference 2018 Conference Paper

Hierarchical LSTM for Sign Language Translation

  • Dan Guo
  • Wengang Zhou
  • Houqiang Li
  • Meng Wang

Continuous Sign Language Translation (SLT) is a challenging task due to its specific linguistics under sequential gesture variation without word alignment. Current hybrid HMM and CTC (Connectionist temporal classification) based models are proposed to solve frame or word level alignment. They may fail to tackle the cases with messing word order corresponding to visual content in sentences. To solve the issue, this paper proposes a hierarchical-LSTM (HLSTM) encoderdecoder model with visual content and word embedding for SLT. It tackles different granularities by conveying spatiotemporal transitions among frames, clips and viseme units. It firstly explores spatio-temporal cues of video clips by 3D CNN and packs appropriate visemes by online key clip mining with adaptive variable-length. After pooling on recurrent outputs of the top layer of HLSTM, a temporal attentionaware weighting mechanism is proposed to balance the intrinsic relationship among viseme source positions. At last, another two LSTM layers are used to separately recurse viseme vectors and translate semantic. After preserving original visual content by 3D CNN and the top layer of HLSTM, it shortens the encoding time step of the bottom two LSTM layers with less computational complexity while attaining more nonlinearity. Our proposed model exhibits promising performance on singer-independent test with seen sentences and also outperforms the comparison algorithms on unseen sentences.

YNIMG Journal 2018 Journal Article

Optimal referencing for stereo-electroencephalographic (SEEG) recordings

  • Guangye Li
  • Shize Jiang
  • Sivylla E. Paraskevopoulou
  • Meng Wang
  • Yang Xu
  • Zehan Wu
  • Liang Chen
  • Dingguo Zhang

Stereo-electroencephalography (SEEG) is an intracranial recording technique in which depth electrodes are inserted in the brain as part of presurgical assessments for invasive brain surgery. SEEG recordings can tap into neural signals across the entire brain and thereby sample both cortical and subcortical sites. However, even though signal referencing is important for proper assessment of SEEG signals, no previous study has comprehensively evaluated the optimal referencing method for SEEG. In our study, we recorded SEEG data from 15 human subjects during a motor task, referencing them against the average of two white matter contacts (monopolar reference). We then subjected these signals to 5 different re-referencing approaches: common average reference (CAR), gray-white matter reference (GWR), electrode shaft reference (ESR), bipolar reference, and Laplacian reference. The results from three different signal quality metrics suggest the use of the Laplacian re-reference for study of local population-level activity and low-frequency oscillatory activity.

TIST Journal 2017 Journal Article

Learning User Attributes via Mobile Social Multimedia Analytics

  • Liqiang Nie
  • Luming Zhang
  • Meng Wang
  • Richang Hong
  • Aleksandr Farseev
  • Tat-Seng Chua

Learning user attributes from mobile social media is a fundamental basis for many applications, such as personalized and targeting services. A large and growing body of literature has investigated the user attributes learning problem. However, far too little attention has been paid to jointly consider the dual heterogeneities of user attributes learning by harvesting multiple social media sources. In particular, user attributes are complementarily and comprehensively characterized by multiple social media sources, including footprints from Foursqare, daily updates from Twitter, professional careers from Linkedin, and photo posts from Instagram. On the other hand, attributes are inter-correlated in a complex way rather than independent to each other, and highly related attributes may share similar feature sets. Towards this end, we proposed a unified model to jointly regularize the source consistency and graph-constrained relatedness among tasks. As a byproduct, it is able to learn the attribute-specific and attribute-sharing features via graph-guided fused lasso penalty. Besides, we have theoretically demonstrated its optimization. Extensive evaluations on a real-world dataset thoroughly demonstrated the effectiveness of our proposed model.

TIST Journal 2017 Journal Article

Visual Classification of Furniture Styles

  • Zhenhen Hu
  • Yonggang Wen
  • Luoqi Liu
  • Jianguo Jiang
  • Richang Hong
  • Meng Wang
  • Shuicheng Yan

Furniture style describes the discriminative appearance characteristics of furniture. It plays an important role in real-world indoor decoration. In this article, we explore the furniture style features and study the problem of furniture style classification. Differing from traditional object classification, furniture style classification aims at classifying different furniture in terms of the “style” that describes its appearance (e.g., American style, Gothic style, Rococo style, etc.) rather than the “kind” that is more related to its functional structure (e.g., bed, desk, etc.). To pursue efficient furniture style features, we construct a novel dataset of furniture styles that contains 16 common style categories and implement three strategies with respect to two categories of classification, that is, handcrafted classification and learning-based classification. First, we follow the typical image classification pipeline to extract the handcrafted features and train the classifier by support vector machine. Then we use the convolutional neural network to extract learning-based features from training images. To obtain comprehensive furniture style features, we finally combine the handcrafted image classification pipeline and the learning-based network. We experimentally evaluate the performances of handcrafted features and learning-based features of each strategy, and the results show the superiority of learning-based features and also the comprehensiveness of handcrafted features.

IJCAI Conference 2016 Conference Paper

A Relaxed Ranking-Based Factor Model for Recommender System from Implicit Feedback

  • Huayu Li
  • Richang Hong
  • Defu Lian
  • Zhiang Wu
  • Meng Wang
  • Yong Ge

Implicit feedback based recommendation has recently been an important task with the accumulated user-item interaction data. However, it is very challenging to produce recommendations from implicit feedback due to the sparseness of data and the lack of negative feedback/rating. Although various factor models have been proposed to tackle this problem, they either focus on rating prediction that may lead to inaccurate top-k recommendations or are dependent on the sampling of negative feedback that often results in bias. To this end, we propose a Relaxed Ranking-based Factor Model, RRFM, to relax pairwise ranking into a SVM-like task, where positive and negative feedbacks are separated by the soft boundaries, and their non-separate property is employed to capture the characteristic of unobserved data. A smooth and scalable algorithm is developed to solve group- and instance- level's optimization and parameter estimation. Extensive experiments based on real-world datasets demonstrate the effectiveness and advantage of our approach.

IJCAI Conference 2016 Conference Paper

Empirical Risk Minimization for Metric Learning Using Privileged Information

  • Xun Yang
  • Meng Wang
  • Luming Zhang
  • Dacheng Tao

Traditional metric learning methods usually make decisions based on a fixed threshold, which may result in a suboptimal metric when the inter-class and inner-class variations are complex. To address this issue, in this paper we propose an effective metric learning method by exploiting privileged information to relax the fixed threshold under the empirical risk minimization framework. Privileged information describes useful high-level semantic information that is only available during training. Our goal is to improve the performance by incorporating privileged information to design a locally adaptive decision function. We jointly learn two distance metrics by minimizing the empirical loss penalizing the difference between the distance in the original space and that in the privileged space. The distance in the privileged space functions as a locally adaptive decision threshold, which can guide the decision making like a teacher. We optimize the objective function using the Accelerated Proximal Gradient approach to obtain a global optimum solution. Experiment results show that by leveraging privileged information, our proposed method can achieve satisfactory performance.

YNIMG Journal 2016 Journal Article

Regional homogeneity of intrinsic brain activity correlates with auditory-motor processing of vocal pitch errors

  • Zhiqiang Guo
  • Xiyan Huang
  • Meng Wang
  • Jeffery A. Jones
  • Zhengjia Dai
  • Weifeng Li
  • Peng Liu
  • Hanjun Liu

It has been well documented that speakers produce rapid compensatory vocal adjustments for errors they perceive in their auditory feedback. The fact that they differ greatly in the degree to which they compensate for perceived errors, however, has received much less attention. The present study investigated whether intrinsic brain activity during resting can predict an individual's behavioral and cortical responses in compensating for pitch-shifted auditory feedback during vocalization. This relationship was investigated by correlating the regional homogeneity (ReHo) of resting-state fMRI signals with the vocal compensation and event-related potentials (N1 and P2) in response to pitch shifts of −200 and −500 cents. Behaviorally, the magnitudes of vocal compensation were significantly correlated with the ReHo values in the right supplementary motor area (SMA) for both −200 and −500 cents, the right primary motor cortex (M1) for −200 cents, and the left premotor cortex (PMC) for −500 cents. For both pitch shift sizes, there were significant correlations between ReHo and N1 amplitude in the left inferior frontal gyrus (IFG), right superior temporal gyrus (STG), bilateral M1, and left SMA. Significant correlations between ReHo and P2 amplitude were observed in the bilateral IFG, right STG, left SMA and M1 for −200 and −500 cents, the left PMC for −200 cents, and the right SMA for −500 cents. These findings provide the first evidence that regional homogeneity of intrinsic brain activity can predict behavioral and cortical responses in compensating for pitch errors in voice auditory feedback.

IS Journal 2016 Journal Article

Trust Agent-Based Behavior Induction in Social Networks

  • Lei Li
  • Jianping He
  • Meng Wang
  • Xindong Wu

The essence of social networks is that they can influence people's public opinions and group behaviors form quickly. Negative group behavior influences societal stability significantly, but existing behavior-induction approaches are too simple and inefficient. To automatically and efficiently induct behavior in social networks, this article introduces trust agents and designs their features according to group behavior features. In addition, a dynamics control mechanism can be generated to coordinate participant behaviors in social networks to avoid a specific restricted negative group behavior.

TIST Journal 2015 Journal Article

Robust Multiview Feature Learning for RGB-D Image Understanding

  • Zheng-Jun Zha
  • Yang Yang
  • Jinhui Tang
  • Meng Wang
  • Tat-Seng Chua

The availability of massive RGB-depth (RGB-D) images poses a compelling need for effective RGB-D content understanding techniques. RGB-D images provide synchronized information from multiple views (e.g., color and depth) of real-world objects and scenes. This work proposes learning compact and discriminative features from the multiple views of RGB-D content toward effective feature representation for RGB-D image understanding. In particular, a robust multiview feature learning approach is developed, which exploits the intrinsic relations among multiple views. The feature learning in multiple views is jointly optimized in an integrated formulation. The joint optimization essentially exploits the intrinsic relations among the views, leading to effective features and making the learning process robust to noises. The feature learning function is formulated as a robust nonnegative graph embedding function over multiple graphs in various views. The graphs characterize the local geometric and discriminating structure of the multiview data. The joint sparsity in ℓ 1 -norm graph embedding and ℓ 21 -norm data factorization further enhances the robustness of feature learning. We derive an efficient computational solution for the proposed approach and provide rigorous theoretical proof with regard to its convergence. We apply the proposed approach to two RGB-D image understanding tasks: RGB-D object classification and RGB-D scene categorization. We conduct extensive experiments on two real-world RGB-D image datasets. The experimental results have demonstrated the effectiveness of the proposed approach.

IJCAI Conference 2015 Conference Paper

Saliency Detection with a Deeper Investigation of Light Field

  • Jun Zhang
  • Meng Wang
  • Jun Gao
  • Yi Wang
  • Xudong Zhang
  • Xindong Wu

Although the light field has been recently recognized helpful in saliency detection, it is not comprehensively explored yet. In this work, we propose a new saliency detection model with light field data. The idea behind the proposed model originates from the following observations. (1) People can distinguish regions at different depth levels via adjusting the focus of eyes. Similarly, a light field image can generate a set of focal slices focusing at different depth levels, which suggests that a background can be weighted by selecting the corresponding slice. We show that background priors encoded by light field focusness have advantages in eliminating background distraction and enhancing the saliency by weighting the light field contrast. (2) Regions at closer depth ranges tend to be salient, while far in the distance mostly belong to the backgrounds. We show that foreground objects can be easily separated from similar or cluttered backgrounds by exploiting their light field depth. Extensive evaluations on the recently introduced Light Field Saliency Dataset (LFSD) [Li et al. , 2014], including studies of different light field cues and comparisons with Li et al. ’s method (the only reported light field saliency detection approach to our knowledge) and the 2D/3D state-of-the-art approaches extended with light field depth/focusness information, show that the investigated light field properties are complementary with each other and lead to improvements on 2D/3D models, and our approach produces superior results in comparison with the state-of-the-art.

IJCAI Conference 2013 Conference Paper

Online Group Feature Selection

  • Jing Wang
  • Zhong-Qiu Zhao
  • Xuegang Hu
  • Yiu-ming Cheung
  • Meng Wang
  • Xindong Wu

Online feature selection with dynamic features has become an active research area in recent years. However, in some real-world applications such as image analysis and email spam filtering, features may arrive by groups. Existing online feature selection methods evaluate features individually, while existing group feature selection methods cannot handle online processing. Motivated by this, we formulate the online group feature selection problem, and propose a novel selection approach for this problem. Our proposed approach consists of two stages: online intra-group selection and online inter-group selection. In the intra-group selection, we use spectral analysis to select discriminative features in each group when it arrives. In the inter-group selection, we use Lasso to select a globally optimal subset of features. This 2-stage procedure continues until there are no more features to come or some predefined stopping conditions are met. Extensive experiments conducted on benchmark and real-world data sets demonstrate that our proposed approach outperforms other state-of-theart online feature selection methods.

TIST Journal 2011 Journal Article

Active learning in multimedia annotation and retrieval

  • Meng Wang
  • Xian-Sheng Hua

Active learning is a machine learning technique that selects the most informative samples for labeling and uses them as training data. It has been widely explored in multimedia research community for its capability of reducing human annotation effort. In this article, we provide a survey on the efforts of leveraging active learning in multimedia annotation and retrieval. We mainly focus on two application domains: image/video annotation and content-based image retrieval. We first briefly introduce the principle of active learning and then we analyze the sample selection criteria. We categorize the existing sample selection strategies used in multimedia annotation and retrieval into five criteria: risk reduction, uncertainty, diversity, density and relevance. We then introduce several classification models used in active learning-based multimedia annotation and retrieval, including semi-supervised learning, multilabel learning and multiple instance learning. We also provide a discussion on several future trends in this research direction. In particular, we discuss cost analysis of human annotation and large-scale interactive multimedia annotation.

TIST Journal 2010 Journal Article

Accessible image search for colorblindness

  • Meng Wang
  • Bo Liu
  • Xian-Sheng Hua

This article introduces an intelligent system that accommodates colorblind users in image search. Color plays an important role in the human perception and recognition of images. However, there are about 8% of men and 0.8% of women suffering from colorblindness. We show that the existing image search techniques cannot provide satisfactory results for these users since many images will not be well perceived by them due to the loss of color information. To deal with this difficulty, we introduce a system named Accessible Image Search (AIS) to accommodate these users. Different from the general image search scheme that aims at returning more relevant results, AIS further takes into account the colorblind accessibilities of the returned results, that is, the image qualities in the eyes of colorblind users. The system contains three components: accessibility assessment, accessibility improvement, and color indication. The accessibility assessment component measures the accessibility scores of images, and consequently different reranking methods can be performed to prioritize images with high accessibilities. In the accessibility improvement component, we propose an efficient recoloring algorithm to modify the colors of the images such that they can be better perceived by colorblind users. Color indication aims to indicate the name of the interesting color in an image. We evaluate the introduced system with more than 60 queries and 20 anonymous colorblind users, and the empirical results demonstrate its effectiveness and usefulness.

IJCAI Conference 2009 Conference Paper

  • Zheng-Jun Zha
  • Tao Mei
  • Meng Wang
  • Zengfu Wang
  • Xian-Sheng Hua

Most of the existing metric learning methods are accomplished by exploiting pairwise constraints over the labeled data and frequently suffer from the insufficiency of training examples. To learn a robust distance metric from few labeled examples, prior knowledge from unlabeled examples as well as the metrics previously derived from auxiliary data sets can be useful. In this paper, we propose to leverage such auxiliary knowledge to assist distance metric learning, which is formulated following the regularized loss minimization principle. Two algorithms are derived on the basis of manifold regularization and log-determinant divergence regularization technique, respectively, which can simultaneously exploit label information (i. e. , the pairwise constraints over labeled data), unlabeled examples, and the metrics derived from auxiliary data sets. The proposed methods directly manipulate the auxiliary metrics and require no raw examples from the auxiliary data sets, which make them efficient and flexible. We conduct extensive evaluations to compare our approaches with a number of competing approaches on face recognition task. The experimental results show that our approaches can derive reliable distance metrics from limited training examples and thus are superior in terms of accuracy and labeling efforts.