Arrow Research search

Author name cluster

Yongdong Zhang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

49 papers
1 author row

Possible papers

49

AAAI Conference 2026 Conference Paper

In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback

  • Mingye Zhu
  • Yi Liu
  • Zheren Fu
  • Quan Wang
  • Yongdong Zhang

Training Large Language Models (LLMs) for chain-of-thought reasoning presents a significant challenge: supervised fine-tuning on a single "golden" rationale hurts generalization as it penalizes equally valid alternatives, whereas reinforcement learning with verifiable rewards struggles with credit assignment and prohibitive computational cost. To tackle these limitations, we introduce InTRO (In-Token Rationality Optimization), a new framework that enables both token-level exploration and self-feedback for accurate and concise reasoning. Instead of directly optimizing an intractable objective over all valid reasoning paths, InTRO leverages correction factors—token-wise importance weights estimated by the information discrepancy between the generative policy and its answer-conditioned counterpart, for informative next-token selection. This approach allows the model to perform token-level exploration and receive self-generated feedback within a single forward pass, ultimately encouraging accurate and concise rationales. Across six math-reasoning benchmarks, InTRO consistently outperforms other baselines, raising solution accuracy by up to 20% relative to the base model. Its chains of thought are also notably more concise, exhibiting reduced verbosity. Beyond this, InTRO enables cross-domain transfer, successfully adapting to out-of-domain reasoning tasks that extend beyond the realm of mathematics, demonstrating robust generalization.

AAAI Conference 2026 Conference Paper

SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability

  • Jiankang Wang
  • Zhihan Zhang
  • Zhihang Liu
  • Yang Li
  • Jiannan Ge
  • Hongtao Xie
  • Yongdong Zhang

Multimodal Large Language Models (MLLMs) have shown remarkable progress in temporal or spatial localization tasks, but struggle with joint spatio-temporal video grounding (STVG). We identify two key bottlenecks hindering this capability: (1) the sheer number of visual tokens makes long-range and fine-grained visual modeling challenging; (2) generating a long sequence of bounding boxes in text makes it hard to accurately align each box with its specific video frame. Distinct from prior efforts that rely on attaching complex modules, we argue for a more elegant paradigm that unlocks the inherent potential of MLLMs and leverages their strengths. To this end, we propose \textbf{\textit{SpaceVLLM}}, a MLLM equipped with spatio-temporal video grounding capabilities. Specifically, we propose Spatio-Temporal Aware Queries, interleaved with video frames, to guide the MLLM in capturing both static appearance and dynamic motion features. We further present a lightweight Query-Guided Space Head that maps queries to precise spatial coordinates, bypassing the need for direct textual coordinate generation and enabling the MLLM to focus on video understanding. To further facilitate research in this area, we propose an automated data synthesis pipeline to construct \textbf{V-STG} dataset, comprising 110K STVG instances. Extensive experiments show that \textit{SpaceVLLM} achieves the state-of-the-art performance on STVG benchmarks and maintains strong performance on various video understanding tasks, validating our approach's effectiveness.

AAAI Conference 2026 Conference Paper

SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder

  • Dengcan Liu
  • Jiahao Li
  • Zheren Fu
  • Yi Tu
  • Jiajun Li
  • Zhendong Mao
  • Yongdong Zhang

Reward models (RMs) are a core component in the post-training of large language models (LLMs), serving as proxies for human preference evaluation and guiding model alignment. However, training reliable RMs under limited resources remains challenging due to the reliance on large-scale preference annotations and the high cost of fine-tuning LLMs. To address this, we propose SparseRM, which leverages Sparse Autoencoder (SAE) to extract preference-relevant information encoded in model representations, enabling the construction of a lightweight and interpretable reward model. SparseRM first employs SAE to decompose LLM representations into interpretable directions that capture preference-relevant features. The representations are then projected onto these directions to compute alignment scores, which quantify the strength of each preference feature in the representations. A simple reward head aggregates these scores to predict preference scores. Experiments on three preference modeling tasks show that SparseRM achieves superior performance over most mainstream RMs while using less than 1% of trainable parameters. Moreover, it integrates seamlessly into downstream alignment pipelines, highlighting its potential for efficient alignment.

AAAI Conference 2025 Conference Paper

ELDER: Enhancing Lifelong Model Editing with Mixture-of-LoRA

  • Jiaang Li
  • Quan Wang
  • Zhongnan Wang
  • Yongdong Zhang
  • Zhendong Mao

Large language models (LLMs) require model editing to efficiently update specific knowledge within them and avoid factual errors. Most model editing methods are solely designed for single-time use and result in a significant forgetting effect in lifelong editing scenarios, where sequential edits are conducted over time. Previous approaches manage sequential edits by freezing original parameters and discretely allocating new parameters for each knowledge update. However, these methods lack robustness to minor input variations due to the discrete mapping between data and parameters. To overcome this challenge, we propose ELDER, a novel approach to create a continuous association between data and adapters. ELDER integrates multiple LoRAs through a router network and is trained to establish a smooth data-adapter association, thereby enhancing the edit robustness and generalization of semantically equivalent inputs. To ensure inputs containing the same knowledge will be processed by the same LoRAs, we design a novel loss to guide the model link LoRA allocations with edit knowledge. Furthermore, we propose a deferral mechanism to retain the original LLM capabilities post-edit. Extensive experiments on GPT-2 XL and LLaMA2-7B demonstrate that ELDER effectively edits models in the lifelong setting, outperforming eight baselines while exhibiting strong scalability and preserving LLMs' general abilities on downstream tasks.

NeurIPS Conference 2025 Conference Paper

Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models

  • Yi Liu
  • Dianqing Liu
  • Mingye Zhu
  • Junbo Guo
  • Yongdong Zhang
  • Zhendong Mao

The widespread adoption of large language models (LLMs) across industries has increased the demand for high-quality and customizable outputs. However, traditional alignment methods often require retraining large pretrained models, making it difficult to quickly adapt and optimize LLMs for diverse applications. To address this limitation, we propose a novel \textit{Residual Alignment Model} (\textit{RAM}) that formalizes the alignment process as a type of importance sampling. In this framework, the unaligned upstream model serves as the proposal distribution, while the alignment process is framed as secondary sampling based on an autoregressive alignment module that acts as an estimator of the importance weights. This design enables a natural detachment of the alignment module from the target aligned model, improving flexibility and scalability. Based on this model, we derive an efficient sequence-level training strategy for the alignment module, which operates independently of the proposal module. Additionally, we develop a resampling algorithm with iterative token-level decoding to address the common first-token latency issue in comparable methods. Experimental evaluations on two leading open-source LLMs across diverse tasks, including instruction following, domain adaptation, and preference optimization, demonstrate that our approach consistently outperforms baseline models.

NeurIPS Conference 2025 Conference Paper

Leveraging robust optimization for llm alignment under distribution shifts

  • Mingye Zhu
  • Yi Liu
  • Zheren Fu
  • Yongdong Zhang
  • Zhendong Mao

Preference alignment methods are increasingly critical for steering large language models (LLMs) to generate outputs consistent with human values. While recent approaches often rely on synthetic data generated by LLMs for scalability and cost-efficiency reasons, this reliance can introduce distributional shifts that undermine the nuanced representation of human preferences needed for desirable outputs. In this paper, we propose a novel distribution-aware optimization framework that improves preference alignment despite such shifts. Our approach first leverages well-learned classifiers to assign a calibration value to each training sample, quantifying its alignment with the target human-preferred distribution. These values are then incorporated into a robust optimization objective that minimizes the worst-case loss over regions of the data space most relevant to human preferences. By explicitly focusing optimization on the target distribution, our approach mitigates the impact of distributional mismatch and improves the generation of responses that better reflect intended values.

IJCAI Conference 2024 Conference Paper

Aggregation and Purification: Dual Enhancement Network for Point Cloud Few-shot Segmentation

  • Guoxin Xiong
  • Yuan Wang
  • Zhaoyang Li
  • Wenfei Yang
  • Tianzhu Zhang
  • Xu Zhou
  • Shifeng Zhang
  • Yongdong Zhang

Point cloud few-shot semantic segmentation (PC-FSS) aims to segment objects within query samples of new categories given only a handful of annotated support samples. Although PC-FSS demonstrates enhanced category generalization capabilities compared to the fully supervised paradigm, the prevalent significant scene discrepancies, which can be systematically summarized into intra-semantic diversity and semantic inconsistency, have posed substantial challenges to the area. In this work, we design a novel Dual Enhancement Network (DENet) to comprehensively tackle different kinds of scene discrepancies in a coherent and synergistic framework. The proposed DENet enjoys several merits. First, we design a mutual aggregation module to reconcile the intrinsic tension between the support prototypes and query point features, and the intra-semantic diversity is diminished in a bidirectional manner. Second, the consistent purification strategy is introduced to eliminate ambiguous prototypes, thereby reducing the mismatches brought by semantic inconsistency. Extensive experiments on S3DIS and ScanNet under different settings demonstrate that DENet significantly outperforms previous SOTAs.

NeurIPS Conference 2024 Conference Paper

Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing

  • Yadong Qu
  • Yuxin Wang
  • Bangbang Zhou
  • Zixiao Wang
  • Hongtao Xie
  • Yongdong Zhang

Existing scene text recognition (STR) methods struggle to recognize challenging texts, especially for artistic and severely distorted characters. The limitation lies in the insufficient exploration of character morphologies, including the monotonousness of widely used synthetic training data and the sensitivity of the model to character morphologies. To address these issues, inspired by the human learning process of viewing and summarizing, we facilitate the contrastive learning-based STR framework in a self-motivated manner by leveraging synthetic and real unlabeled data without any human cost. In the viewing process, to compensate for the simplicity of synthetic data and enrich character morphology diversity, we propose an Online Generation Strategy to generate background-free samples with diverse character styles. By excluding background noise distractions, the model is encouraged to focus on character morphology and generalize the ability to recognize complex samples when trained with only simple synthetic data. To boost the summarizing process, we theoretically demonstrate the derivation error in the previous character contrastive loss, which mistakenly causes the sparsity in the intra-class distribution and exacerbates ambiguity on challenging samples. Therefore, a new Character Unidirectional Alignment Loss is proposed to correct this error and unify the representation of the same characters in all samples by aligning the character features in the student model with the reference features in the teacher model. Extensive experiment results show that our method achieves SOTA performance (94. 7\% and 70. 9\% average accuracy on common benchmarks and Union14M-Benchmark). Code will be available.

AAAI Conference 2024 Conference Paper

Bootstrapping Large Language Models for Radiology Report Generation

  • Chang Liu
  • Yuanhe Tian
  • Weidong Chen
  • Yan Song
  • Yongdong Zhang

Radiology report generation (RRG) aims to automatically generate a free-text description from a specific clinical radiograph, e.g., chest X-Ray images. Existing approaches tend to perform RRG with specific models trained on the public yet limited data from scratch, where they often lead to inferior performance owing to the problem of inefficient capabilities in both aligning visual and textual features and generating informative reports accordingly. Currently, large language models (LLMs) offered a promising solution to text generation with their power in learning from big data, especially for cross-modal scenarios such as RRG. However, most existing LLMs are pre-trained on general data, and suffer from the same problem of conventional approaches caused by knowledge gap between general and medical domain if they are applied to RRG. Therefore in this paper, we propose an approach to bootstrapping LLMs for RRG with a in-domain instance induction and a coarse-to-fine decoding process. Specifically, the in-domain instance induction process learns to align the LLM to radiology reports from general texts through contrastive learning. The coarse-to-fine decoding performs a text elevating process for those reports from the ranker, further enhanced with visual features and refinement prompts. Experimental results on two prevailing RRG datasets, namely, IU X-Ray and MIMIC-CXR, demonstrate the superiority of our approach to previous state-of-the-art solutions. Further analyses illustrate that, for the LLM, the induction process enables it to better align with the medical domain and the coarse-to-fine generation allows it to conduct more precise text generation.

NeurIPS Conference 2024 Conference Paper

Homology Consistency Constrained Efficient Tuning for Vision-Language Models

  • Huatian Zhang
  • Lei Zhang
  • Yongdong Zhang
  • Zhendong Mao

Efficient transfer learning has shown remarkable performance in tuning large-scale vision-language models (VLMs) toward downstream tasks with limited data resources. The key challenge of efficient transfer lies in adjusting image-text alignment to be task-specific while preserving pre-trained general knowledge. However, existing methods adjust image-text alignment merely on a set of observed samples, e. g. , data set and external knowledge base, which cannot guarantee to keep the correspondence of general concepts between image and text latent manifolds without being disrupted and thereby a weak generalization of the adjusted alignment. In this work, we propose a Homology Consistency (HC) constraint for efficient transfer on VLMs, which explicitly constrains the correspondence of image and text latent manifolds through structural equivalence based on persistent homology in downstream tuning. Specifically, we build simplicial complex on the top of data to mimic the topology of latent manifolds, then track the persistence of the homology classes of topological features across multiple scales, and guide the directions of persistence tracks in image and text manifolds to coincide each other, with a deviating perturbation additionally. For practical application, we tailor the implementation of our proposed HC constraint for two main paradigms of adapter tuning. Extensive experiments on few-shot learning over 11 datasets and domain generalization demonstrate the effectiveness and robustness of our method.

NeurIPS Conference 2024 Conference Paper

MILP-StuDio: MILP Instance Generation via Block Structure Decomposition

  • Haoyang Liu
  • Jie Wang
  • Wanbo Zhang
  • Zijie Geng
  • Yufei Kuang
  • Xijun Li
  • Yongdong Zhang
  • Bin Li

Mixed-integer linear programming (MILP) is one of the most popular mathematical formulations with numerous applications. In practice, improving the performance of MILP solvers often requires a large amount of high-quality data, which can be challenging to collect. Researchers thus turn to generation techniques to generate additional MILP instances. However, existing approaches do not take into account specific block structures—which are closely related to the problem formulations—in the constraint coefficient matrices (CCMs) of MILPs. Consequently, they are prone to generate computationally trivial or infeasible instances due to the disruptions of block structures and thus problem formulations. To address this challenge, we propose a novel MILP generation framework, called Block Structure Decomposition (MILP-StuDio), to generate high-quality instances by preserving the block structures. Specifically, MILP-StuDio begins by identifying the blocks in CCMs and decomposing the instances into block units, which serve as the building blocks of MILP instances. We then design three operators to construct new instances by removing, substituting, and appending block units in the original instances, enabling us to generate instances with flexible sizes. An appealing feature of MILP-StuDio is its strong ability to preserve the feasibility and computational hardness of the generated instances. Experiments on the commonly-used benchmarks demonstrate that using instances generated by MILP-StuDio is able to significantly reduce over 10% of the solving time for learning-based solvers.

NeurIPS Conference 2024 Conference Paper

MotionGS: Exploring Explicit Motion Guidance for Deformable 3D Gaussian Splatting

  • Ruijie Zhu
  • Yanzhe Liang
  • Hanzhi Chang
  • Jiacheng Deng
  • Jiahao Lu
  • Wenfei Yang
  • Tianzhu Zhang
  • Yongdong Zhang

Dynamic scene reconstruction is a long-term challenge in the field of 3D vision. Recently, the emergence of 3D Gaussian Splatting has provided new insights into this problem. Although subsequent efforts rapidly extend static 3D Gaussian to dynamic scenes, they often lack explicit constraints on object motion, leading to optimization difficulties and performance degradation. To address the above issues, we propose a novel deformable 3D Gaussian splatting framework called MotionGS, which explores explicit motion priors to guide the deformation of 3D Gaussians. Specifically, we first introduce an optical flow decoupling module that decouples optical flow into camera flow and motion flow, corresponding to camera movement and object motion respectively. Then the motion flow can effectively constrain the deformation of 3D Gaussians, thus simulating the motion of dynamic objects. Additionally, a camera pose refinement module is proposed to alternately optimize 3D Gaussians and camera poses, mitigating the impact of inaccurate camera poses. Extensive experiments in the monocular dynamic scenes validate that MotionGS surpasses state-of-the-art methods and exhibits significant superiority in both qualitative and quantitative results. Project page: https: //ruijiezhu94. github. io/MotionGS_page.

AAAI Conference 2024 Conference Paper

Task-Adaptive Prompted Transformer for Cross-Domain Few-Shot Learning

  • Jiamin Wu
  • Xin Liu
  • Xiaotian Yin
  • Tianzhu Zhang
  • Yongdong Zhang

Cross-Domain Few-Shot Learning (CD-FSL) aims at recognizing samples in novel classes from unseen domains that are vastly different from training classes, with few labeled samples. However, the large domain gap between training and novel classes makes previous FSL methods perform poorly. To address this issue, we propose MetaPrompt, a Task-adaptive Prompted Transformer model for CD-FSL, by jointly exploiting prompt learning and the parameter generation framework. The proposed MetaPrompt enjoys several merits. First, a task-conditioned prompt generator is established upon attention mechanisms. It can flexibly produce a task-adaptive prompt with arbitrary length for unseen tasks, by selectively gathering task characteristics from the contextualized support embeddings. Second, the task-adaptive prompt is attached to Vision Transformer to facilitate fast task adaptation, steering the task-agnostic representation to incorporate task knowledge. To our best knowledge, this is the first work to exploit a prompt-based parameter generation mechanism for CD-FSL. Extensive experimental results on the Meta-Dataset benchmark demonstrate that our method achieves superior results against state-of-the-art methods.

NeurIPS Conference 2024 Conference Paper

Towards Next-Generation Logic Synthesis: A Scalable Neural Circuit Generation Framework

  • Zhihai Wang
  • Jie Wang
  • Qingyue Yang
  • Yinqi Bai
  • Xing Li
  • Lei Chen
  • Jianye Hao
  • Mingxuan Yuan

Logic Synthesis (LS) aims to generate an optimized logic circuit satisfying a given functionality, which generally consists of circuit translation and optimization. It is a challenging and fundamental combinatorial optimization problem in integrated circuit design. Traditional LS approaches rely on manually designed heuristics to tackle the LS task, while machine learning recently offers a promising approach towards next-generation logic synthesis by neural circuit generation and optimization. In this paper, we first revisit the application of differentiable neural architecture search (DNAS) methods to circuit generation and found from extensive experiments that existing DNAS methods struggle to exactly generate circuits, scale poorly to large circuits, and exhibit high sensitivity to hyper-parameters. Then we provide three major insights for these challenges from extensive empirical analysis: 1) DNAS tends to overfit to too many skip-connections, consequently wasting a significant portion of the network's expressive capabilities; 2) DNAS suffers from the structure bias between the network architecture and the circuit inherent structure, leading to inefficient search; 3) the learning difficulty of different input-output examples varies significantly, leading to severely imbalanced learning. To address these challenges in a systematic way, we propose a novel regularized triangle-shaped circuit network generation framework, which leverages our key insights for completely accurate and scalable circuit generation. Furthermore, we propose an evolutionary algorithm assisted by reinforcement learning agent restarting technique for efficient and effective neural circuit optimization. Extensive experiments on four different circuit benchmarks demonstrate that our method can precisely generate circuits with up to 1200 nodes. Moreover, our synthesized circuits significantly outperform the state-of-the-art results from several competitive winners in IWLS 2022 and 2023 competitions.

NeurIPS Conference 2023 Conference Paper

A Deep Instance Generative Framework for MILP Solvers Under Limited Data Availability

  • Zijie Geng
  • Xijun Li
  • Jie Wang
  • Xiao Li
  • Yongdong Zhang
  • Feng Wu

In the past few years, there has been an explosive surge in the use of machine learning (ML) techniques to address combinatorial optimization (CO) problems, especially mixed-integer linear programs (MILPs). Despite the achievements, the limited availability of real-world instances often leads to sub-optimal decisions and biased solver assessments, which motivates a suite of synthetic MILP instance generation techniques. However, existing methods either rely heavily on expert-designed formulations or struggle to capture the rich features of real-world instances. To tackle this problem, we propose G2MILP, the first deep generative framework for MILP instances. Specifically, G2MILP represents MILP instances as bipartite graphs, and applies a masked variational autoencoder to iteratively corrupt and replace parts of the original graphs to generate new ones. The appealing feature of G2MILP is that it can learn to generate novel and realistic MILP instances without prior expert-designed formulations, while preserving the structures and computational hardness of real-world datasets, simultaneously. Thus the generated instances can facilitate downstream tasks for enhancing MILP solvers under limited data availability. We design a suite of benchmarks to evaluate the quality of the generated MILP instances. Experiments demonstrate that our method can produce instances that closely resemble real-world datasets in terms of both structures and computational hardness. The deliverables are released at https: //miralab-ustc. github. io/L2O-G2MILP.

AAAI Conference 2023 Conference Paper

Exploring Stroke-Level Modifications for Scene Text Editing

  • Yadong Qu
  • Qingfeng Tan
  • Hongtao Xie
  • Jianjun Xu
  • Yuxin Wang
  • Yongdong Zhang

Scene text editing (STE) aims to replace text with the desired one while preserving background and styles of the original text. However, due to the complicated background textures and various text styles, existing methods fall short in generating clear and legible edited text images. In this study, we attribute the poor editing performance to two problems: 1) Implicit decoupling structure. Previous methods of editing the whole image have to learn different translation rules of background and text regions simultaneously. 2) Domain gap. Due to the lack of edited real scene text images, the network can only be well trained on synthetic pairs and performs poorly on real-world images. To handle the above problems, we propose a novel network by MOdifying Scene Text image at strokE Level (MOSTEL). Firstly, we generate stroke guidance maps to explicitly indicate regions to be edited. Different from the implicit one by directly modifying all the pixels at image level, such explicit instructions filter out the distractions from background and guide the network to focus on editing rules of text regions. Secondly, we propose a Semi-supervised Hybrid Learning to train the network with both labeled synthetic images and unpaired real scene text images. Thus, the STE model is adapted to real-world datasets distributions. Moreover, two new datasets (Tamper-Syn2k and Tamper-Scene) are proposed to fill the blank of public evaluation datasets. Extensive experiments demonstrate that our MOSTEL outperforms previous methods both qualitatively and quantitatively. Datasets and code will be available at https://github.com/qqqyd/MOSTEL.

IJCAI Conference 2023 Conference Paper

Linguistic More: Taking a Further Step toward Efficient and Accurate Scene Text Recognition

  • Boqiang Zhang
  • Hongtao Xie
  • Yuxin Wang
  • Jianjun Xu
  • Yongdong Zhang

Vision model have gained increasing attention due to their simplicity and efficiency in Scene Text Recognition (STR) task. However, due to lacking the perception of linguistic knowledge and information, recent vision models suffer from two problems: (1) the pure vision-based query results in attention drift, which usually causes poor recognition and is summarized as linguistic insensitive drift (LID) problem in this paper. (2) the visual feature is suboptimal for the recognition in some vision-missing cases (e. g. occlusion, etc. ). To address these issues, we propose a Linguistic Perception Vision model (LPV), which explores the linguistic capability of vision model for accurate text recognition. To alleviate the LID problem, we introduce a Cascade Position Attention (CPA) mechanism that obtains high-quality and accurate attention maps through step-wise optimization and linguistic information mining. Furthermore, a Global Linguistic Reconstruction Module (GLRM) is proposed to improve the representation of visual features by perceiving the linguistic information in the visual space, which gradually converts visual features into semantically rich ones during the cascade process. Different from previous methods, our method obtains SOTA results while keeping low complexity (92. 4% accuracy with only 8. 11M parameters). Code is available at https: //github. com/CyrilSterling/LPV.

NeurIPS Conference 2023 Conference Paper

MomentDiff: Generative Video Moment Retrieval from Random to Real

  • Pandeng Li
  • Chen-Wei Xie
  • Hongtao Xie
  • Liming Zhao
  • Lei Zhang
  • Yun Zheng
  • Deli Zhao
  • Yongdong Zhang

Video moment retrieval pursues an efficient and generalized solution to identify the specific temporal segments within an untrimmed video that correspond to a given language description. To achieve this goal, we provide a generative diffusion-based framework called MomentDiff, which simulates a typical human retrieval process from random browsing to gradual localization. Specifically, we first diffuse the real span to random noise, and learn to denoise the random noise to the original span with the guidance of similarity between text and video. This allows the model to learn a mapping from arbitrary random locations to real moments, enabling the ability to locate segments from random initialization. Once trained, MomentDiff could sample random temporal segments as initial guesses and iteratively refine them to generate an accurate temporal boundary. Different from discriminative works (e. g. , based on learnable proposals or queries), MomentDiff with random initialized spans could resist the temporal location biases from datasets. To evaluate the influence of the temporal location biases, we propose two ``anti-bias'' datasets with location distribution shifts, named Charades-STA-Len and Charades-STA-Mom. The experimental results demonstrate that our efficient framework consistently outperforms state-of-the-art methods on three public benchmarks, and exhibits better generalization and robustness on the proposed anti-bias datasets. The code, model, and anti-bias evaluation datasets will be released publicly.

NeurIPS Conference 2022 Conference Paper

Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets

  • Zhiying Lu
  • Hongtao Xie
  • Chuanbin Liu
  • Yongdong Zhang

There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets, which is concluded to the lack of inductive bias. In this paper, we further consider this problem and point out two weaknesses of ViTs in inductive biases, that is, the spatial relevance and diverse channel representation. First, on spatial aspect, objects are locally compact and relevant, thus fine-grained feature needs to be extracted from a token and its neighbors. While the lack of data hinders ViTs to attend the spatial relevance. Second, on channel aspect, representation exhibits diversity on different channels. But the scarce data can not enable ViTs to learn strong enough representation for accurate recognition. To this end, we propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases. On spatial aspect, we adopt a hybrid structure, in which convolution is integrated into patch embedding and multi-layer perceptron module, forcing the model to capture the token features as well as their neighboring features. On channel aspect, we introduce a dynamic feature aggregation module in MLP and a brand new "head token" design in multi-head self-attention module to help re-calibrate channel representation and make different channel group representation interacts with each other. The fusion of weak channel representation forms a strong enough representation for classification. With this design, we successfully eliminate the performance gap between CNNs and ViTs, and our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85. 68% on CIFAR-100 with 22. 8M parameters, 82. 3% on ImageNet-1K with 24. 0M parameters. Code is available at https: //github. com/ArieSeirack/DHVT.

TIST Journal 2022 Journal Article

Dynamic-Aware Federated Learning for Face Forgery Video Detection

  • Ziheng Hu
  • Hongtao Xie
  • Lingyun Yu
  • Xingyu Gao
  • Zhihua Shang
  • Yongdong Zhang

The spread of face forgery videos is a serious threat to information credibility, calling for effective detection algorithms to identify them. Most existing methods have assumed a shared or centralized training set. However, in practice, data may be distributed on devices of different enterprises that cannot be centralized to share due to security and privacy restrictions. In this article, we propose a Federated Learning face forgery detection framework to train a global model collaboratively while keeping data on local devices. In order to make the detection model more robust, we propose a novel Inconsistency-Capture module (ICM) to capture the dynamic inconsistencies between adjacent frames of face forgery videos. The ICM contains two parallel branches. The first branch takes the whole face of adjacent frames as input to calculate a global inconsistency representation. The second branch focuses only on the inter-frame variation of critical regions to capture the local inconsistency. To the best of our knowledge, this is the first work to apply federated learning to face forgery video detection, which is trained with decentralized data. Extensive experiments show that the proposed framework achieves competitive performance compared with existing methods that are trained with centralized data, with higher-level security and privacy guarantee.

IJCAI Conference 2022 Conference Paper

MFAN: Multi-modal Feature-enhanced Attention Networks for Rumor Detection

  • Jiaqi Zheng
  • Xi Zhang
  • Sanchuan Guo
  • Quan Wang
  • Wenyu Zang
  • Yongdong Zhang

Rumor spreaders are increasingly taking advantage of multimedia content to attract and mislead news consumers on social media. Although recent multimedia rumor detection models have exploited both textual and visual features for classification, they do not integrate the social structure features simultaneously, which have shown promising performance for rumor identification. It is challenging to combine the heterogeneous multi-modal data in consideration of their complex relationships. In this work, we propose a novel Multi-modal Feature-enhanced Attention Networks (MFAN) for rumor detection, which makes the first attempt to integrate textual, visual, and social graph features in one unified framework. Specifically, it considers both the complement and alignment relationships between different modalities to achieve better fusion. Moreover, it takes into account the incomplete links in the social network data due to data collection constraints and proposes to infer hidden links to learn better social graph features. The experimental results show that MFAN can detect rumors effectively and outperform state-of-the-art methods.

AAAI Conference 2022 Conference Paper

Show Your Faith: Cross-Modal Confidence-Aware Network for Image-Text Matching

  • Huatian Zhang
  • Zhendong Mao
  • Kun Zhang
  • Yongdong Zhang

Image-text matching bridges vision and language, which is a crucial task in the field of multi-modal intelligence. The key challenge lies in how to measure image-text relevance accurately as matching evidence. Most existing works aggregate the local semantic similarities of matched region-word pairs as the overall relevance, and they typically assume that the matched pairs are equally reliable. However, although a region-word pair is locally matched across modalities, it may be inconsistent/unreliable from the global perspective of image-text, resulting in inaccurate relevance measurement. In this paper, we propose a novel Cross-Modal Confidence- Aware Network to infer the matching confidence that indicates the reliability of matched region-word pairs, which is combined with the local semantic similarities to refine the relevance measurement. Specifically, we first calculate the matching confidence via the relevance between the semantic of image regions and the complete described semantic in the image, with the text as a bridge. Further, to richly express the region semantics, we extend the region to its visual context in the image. Then, local semantic similarities are weighted with the inferred confidence to filter out unreliable matched pairs in aggregating. Comprehensive experiments show that our method achieves state-of-the-art performance on benchmarks Flickr30K and MSCOCO.

AAAI Conference 2021 Conference Paper

Deep Metric Learning with Self-Supervised Ranking

  • Zheren Fu
  • Yan Li
  • Zhendong Mao
  • Quan Wang
  • Yongdong Zhang

Deep metric learning aims to learn a deep embedding space, where similar objects are pushed towards together and different objects are repelled against. Existing approaches typically use inter-class characteristics, e. g. , class-level information or instance-level similarity, to obtain semantic relevance of data points and get a large margin between different classes in the embedding space. However, the intra-class characteristics, e. g. , local manifold structure or relative relationship within the same class, are usually overlooked in the learning process. Hence the data structure cannot be fully exploited and the output embeddings have limitation in retrieval. More importantly, retrieval results lack in a good ranking. This paper presents a novel self-supervised ranking auxiliary framework, which captures intra-class characteristics as well as inter-class characteristics for better metric learning. Our method defines specific transform functions to simulates the local structure change of intra-class in the initial image domain, and formulates a self-supervised learning procedure to fully exploit this property and preserve it in the embedding space. Extensive experiments on three standard benchmarks show that our method significantly improves and outperforms the state-of-the-art methods on the performances of both retrieval and ranking by 2%-4%.

IJCAI Conference 2021 Conference Paper

Dynamic Inconsistency-aware DeepFake Video Detection

  • Ziheng Hu
  • Hongtao Xie
  • Yuxin Wang
  • Jiahong Li
  • Zhongyuan Wang
  • Yongdong Zhang

The spread of DeepFake videos causes a serious threat to information security, calling for effective detection methods to distinguish them. However, the performance of recent frame-based detection methods become limited due to their ignorance of the inter-frame inconsistency of fake videos. In this paper, we propose a novel Dynamic Inconsistency-aware Network to handle the inconsistent problem, which uses a Cross-Reference module (CRM) to capture both the global and local inter-frame inconsistencies. The CRM contains two parallel branches. The first branch takes faces from adjacent frames as input, and calculates a structure similarity map for a global inconsistency representation. The second branch only focuses on the inter-frame variation of independent critical regions, which captures the local inconsistency. To the best of our knowledge, this is the first work to totally use the inter-frame inconsistency information from the global and local perspectives. Compared with existing methods, our model provides a more accurate and robust detection on FaceForensics++, DFDC-preview and Celeb-DFv2 datasets.

AAAI Conference 2021 Conference Paper

Query-Memory Re-Aggregation for Weakly-supervised Video Object Segmentation

  • Fanchao Lin
  • Hongtao Xie
  • Yan Li
  • Yongdong Zhang

Weakly-supervised video object segmentation (WVOS) is an emerging video task that can track and segment the target given a simple bounding box label. However, existing WVOS methods are still unsatisfied in either speed or accuracy, since they only use the exemplar frame to guide the prediction while they neglect the reference from other frames. To solve the problem, we propose a novel Re-Aggregation based framework, which uses feature matching to efficiently find the target and capture the temporal dependencies from multiple frames to guide the segmentation. Based on a two-stage structure, our framework builds an information-symmetric matching process to achieve robust aggregation. In each stage, we design a Query-Memory Aggregation (QMA) module to gather features from the past frames and make bidirectional aggregation to adaptively weight the aggregated features, which relieves the latent misguidance in unidirectional aggregation. To exploit the information from different aggregation stages, we propose a novel coarse-fine constraint by using the Cascaded Refinement Module (CRM) to combine the predictions from different stages and further boost the performance. Experimental results on three benchmarks show that our method achieves the state-of-the-art performance in WVOS (e. g. , an overall score of 84. 7% on the DAVIS 2016 validation set).

AAAI Conference 2021 Conference Paper

Semantic-guided Reinforced Region Embedding for Generalized Zero-Shot Learning

  • Jiannan Ge
  • Hongtao Xie
  • Shaobo Min
  • Yongdong Zhang

Generalized Zero-Shot Learning (GZSL) aims to recognize images from either seen or unseen domain, mainly by learning a joint embedding space to associate image features with the corresponding category descriptions. Recent methods have proved that localizing important object regions can effectively bridge the semantic-visual gap. However, these are all based on one-off visual localizers, lacking of interpretability and flexibility. In this paper, we propose a novel Semanticguided Reinforced Region Embedding (SR2E) network that can localize important objects in the long-term interests to construct semantic-visual embedding space. SR2E consists of Reinforced Region Module (R2M) and Semantic Alignment Module (SAM). First, without the annotated bounding box as supervision, R2M encodes the semantic category guidance into the reward and punishment criteria to teach the localizer serialized region searching. Besides, R2M explores different action spaces during the serialized searching path to avoid local optimal localization, which thereby generates discriminative visual features with less redundancy. Second, SAM preserves the semantic relationship into visual features via semantic-visual alignment and designs a domain detector to alleviate the domain confusion. Experiments on four public benchmarks demonstrate that the proposed SR2E is an effective GZSL method with reinforced embedding space, which obtains averaged 6. 1% improvements.

IJCAI Conference 2020 Conference Paper

Bilinear Graph Neural Network with Neighbor Interactions

  • Hongmin Zhu
  • Fuli Feng
  • Xiangnan He
  • Xiang Wang
  • Yan Li
  • Kai Zheng
  • Yongdong Zhang

Graph Neural Network (GNN) is a powerful model to learn representations and make predictions on graph data. Existing efforts on GNN have largely defined the graph convolution as a weighted sum of the features of the connected nodes to form the representation of the target node. Nevertheless, the operation of weighted sum assumes the neighbor nodes are independent of each other, and ignores the possible interactions between them. When such interactions exist, such as the co-occurrence of two neighbor nodes is a strong signal of the target node's characteristics, existing GNN models may fail to capture the signal. In this work, we argue the importance of modeling the interactions between neighbor nodes in GNN. We propose a new graph convolution operator, which augments the weighted sum with pairwise interactions of the representations of neighbor nodes. We term this framework as Bilinear Graph Neural Network (BGNN), which improves GNN representation ability with bilinear interactions between neighbor nodes. In particular, we specify two BGNN models named BGCN and BGAT, based on the well-known GCN and GAT, respectively. Empirical results on three public benchmarks of semi-supervised node classification verify the effectiveness of BGNN --- BGCN (BGAT) outperforms GCN (GAT) by 1. 6% (1. 5%) in classification accuracy. Codes are available at: https: //github. com/zhuhm1996/bgnn.

AAAI Conference 2020 Conference Paper

CircleNet for Hip Landmark Detection

  • Hai Wu
  • Hongtao Xie
  • Chuanbin Liu
  • Zheng-Jun Zha
  • Jun Sun
  • Yongdong Zhang

Landmark detection plays a critical role in diagnosis of Developmental Dysplasia of the Hip (DDH). Heatmap and anchor-based object detection techniques could obtain reasonable results. However, they have limitations in both robustness and precision given the complexities and inhomogeneity of hip X-ray images. In this paper, we propose a much simpler and more efficient framework called CircleNet to improve the accuracy of landmark detection by predicting landmark and corresponding radius. Using the CircleNet, we not only constrain the relationship between landmarks but also integrate landmark detection and object detection into an end-to-end framework. In order to capture the effective information of the long-range dependency of landmarks in the DDH image, here we propose a new context modeling framework, named the Local Non-Local (LNL) block. The LNL block has the benefits of both non-local block and lightweight computation. We construct a professional DDH dataset for the first time and evaluate our CircleNet on it. The dataset has the largest number of DDH X-ray images in the world to our knowledge. Our results show that the CircleNet can achieve the state-of-the-art results for landmark detection on the dataset with a large margin of 1. 8 average pixels compared to current methods. The dataset and source code will be publicly available.

AAAI Conference 2020 Conference Paper

Filtration and Distillation: Enhancing Region Attention for Fine-Grained Visual Categorization

  • Chuanbin Liu
  • Hongtao Xie
  • Zheng-Jun Zha
  • Lingfeng Ma
  • Lingyun Yu
  • Yongdong Zhang

Delicate attention of the discriminative regions plays a critical role in Fine-Grained Visual Categorization (FGVC). Unfortunately, most of the existing attention models perform poorly in FGVC, due to the pivotal limitations in discriminative regions proposing and region-based feature learning. 1) The discriminative regions are predominantly located based on the filter responses over the images, which can not be directly optimized with a performance metric. 2) Existing methods train the region-based feature extractor as a one-hot classification task individually, while neglecting the knowledge from the entire object. To address the above issues, in this paper, we propose a novel “Filtration and Distillation Learning” (FDL) model to enhance the region attention of discriminate parts for FGVC. Firstly, a Filtration Learning (FL) method is put forward for discriminative part regions proposing based on the matchability between proposing and predicting. Specifically, we utilize the proposing-predicting matchability as the performance metric of Region Proposal Network (RPN), thus enable a direct optimization of RPN to filtrate most discriminative regions. Go in detail, the objectbased feature learning and region-based feature learning are formulated as “teacher” and “student”, which can furnish better supervision for region-based feature learning. Accordingly, our FDL can enhance the region attention effectively, and the overall framework can be trained end-to-end without neither object nor parts annotations. Extensive experiments verify that FDL yields state-of-the-art performance under the same backbone with the most competitive approaches on several FGVC tasks.

NeurIPS Conference 2020 Conference Paper

Hierarchical Granularity Transfer Learning

  • Shaobo Min
  • Hongtao Xie
  • Hantao Yao
  • Xuran Deng
  • Zheng-Jun Zha
  • Yongdong Zhang

In the real world, object categories usually have a hierarchical granularity tree. Nowadays, most researchers focus on recognizing categories in a specific granularity, \emph{e. g. ,} basic-level or sub(ordinate)-level. Compared with basic-level categories, the sub-level categories provide more valuable information, but its training annotations are harder to acquire. Therefore, an attractive problem is how to transfer the knowledge learned from basic-level annotations to sub-level recognition. In this paper, we introduce a new task, named Hierarchical Granularity Transfer Learning (HGTL), to recognize sub-level categories with basic-level annotations and semantic descriptions for hierarchical categories. Different from other recognition tasks, HGTL has a serious granularity gap, ~\emph{i. e. ,} the two granularities share an image space but have different category domains, which impede the knowledge transfer. To this end, we propose a novel Bi-granularity Semantic Preserving Network (BigSPN) to bridge the granularity gap for robust knowledge transfer. Explicitly, BigSPN constructs specific visual encoders for different granularities, which are aligned with a shared semantic interpreter via a novel subordinate entropy loss. Experiments on three benchmarks with hierarchical granularities show that BigSPN is an effective framework for Hierarchical Granularity Transfer Learning.

AAAI Conference 2020 Conference Paper

Learning Hierarchy-Aware Knowledge Graph Embeddings for Link Prediction

  • Zhanqiu Zhang
  • Jianyu Cai
  • Yongdong Zhang
  • Jie Wang

Knowledge graph embedding, which aims to represent entities and relations as low dimensional vectors (or matrices, tensors, etc.), has been shown to be a powerful technique for predicting missing links in knowledge graphs. Existing knowledge graph embedding models mainly focus on modeling relation patterns such as symmetry/antisymmetry, inversion, and composition. However, many existing approaches fail to model semantic hierarchies, which are common in real-world applications. To address this challenge, we propose a novel knowledge graph embedding model—namely, Hierarchy-Aware Knowledge Graph Embedding (HAKE)— which maps entities into the polar coordinate system. HAKE is inspired by the fact that concentric circles in the polar coordinate system can naturally reflect the hierarchy. Specifically, the radial coordinate aims to model entities at different levels of the hierarchy, and entities with smaller radii are expected to be at higher levels; the angular coordinate aims to distinguish entities at the same level of the hierarchy, and these entities are expected to have roughly the same radii but different angles. Experiments demonstrate that HAKE can effectively model the semantic hierarchies in knowledge graphs, and significantly outperforms existing state-of-the-art methods on benchmark datasets for the link prediction task.

IJCAI Conference 2020 Conference Paper

Overcoming Language Priors with Self-supervised Learning for Visual Question Answering

  • Xi Zhu
  • Zhendong Mao
  • Chunxiao Liu
  • Peng Zhang
  • Bin Wang
  • Yongdong Zhang

Most Visual Question Answering (VQA) models suffer from the language prior problem, which is caused by inherent data biases. Specifically, VQA models tend to answer questions (e. g. , what color is the banana? ) based on the high-frequency answers (e. g. , yellow) ignoring image contents. Existing approaches tackle this problem by creating delicate models or introducing additional visual annotations to reduce question dependency and strengthen image dependency. However, they are still subject to the language prior problem since the data biases have not been fundamentally addressed. In this paper, we introduce a self-supervised learning framework to solve this problem. Concretely, we first automatically generate labeled data to balance the biased data, and then propose a self-supervised auxiliary task to utilize the balanced data to assist the VQA model to overcome language priors. Our method can compensate for the data biases by generating balanced data without introducing external annotations. Experimental results show that our method achieves state-of-the-art performance, improving the overall accuracy from 49. 50% to 57. 59% on the most commonly used benchmark VQA-CP v2. In other words, we can increase the performance of annotation-based methods by 16% without using external annotations. Our code is available on GitHub.

AAAI Conference 2019 Conference Paper

A Two-Stream Mutual Attention Network for Semi-Supervised Biomedical Segmentation with Noisy Labels

  • Shaobo Min
  • Xuejin Chen
  • Zheng-Jun Zha
  • Feng Wu
  • Yongdong Zhang

Learning-based methods suffer from a deficiency of clean annotations, especially in biomedical segmentation. Although many semi-supervised methods have been proposed to provide extra training data, automatically generated labels are usually too noisy to retrain models effectively. In this paper, we propose a Two-Stream Mutual Attention Network (TS- MAN) that weakens the influence of back-propagated gradients caused by incorrect labels, thereby rendering the network robust to unclean data. The proposed TSMAN consists of two sub-networks that are connected by three types of attention models in different layers. The target of each attention model is to indicate potentially incorrect gradients in a certain layer for both sub-networks by analyzing their inferred features using the same input. In order to achieve this purpose, the attention models are designed based on the propagation analysis of noisy gradients at different layers. This allows the attention models to effectively discover incorrect labels and weaken their influence during parameter updating process. By exchanging multi-level features within two-stream architecture, the effects of noisy labels in each sub-network are reduced by decreasing the noisy gradients. Furthermore, a hierarchical distillation is developed to provide reliable pseudo labels for unlabelded data, which further boosts the performance of TSMAN. The experiments using both HVSMR 2016 and BRATS 2015 benchmarks demonstrate that our semi-supervised learning framework surpasses the state-of-the-art fully-supervised results.

IJCAI Conference 2019 Conference Paper

Boundary Perception Guidance: A Scribble-Supervised Semantic Segmentation Approach

  • Bin Wang
  • Guojun Qi
  • Sheng Tang
  • Tianzhu Zhang
  • Yunchao Wei
  • Linghui Li
  • Yongdong Zhang

Semantic segmentation suffers from the fact that densely annotated masks are expensive to obtain. To tackle this problem, we aim at learning to segment by only leveraging scribbles that are much easier to collect for supervision. To fully explore the limited pixel-level annotations from scribbles, we present a novel Boundary Perception Guidance (BPG) approach, which consists of two basic components, i. e. , prediction refinement and boundary regression. Specifically, the prediction refinement progressively makes a better segmentation by adopting an iterative upsampling and a semantic feature enhancement strategy. In the boundary regression, we employ class-agnostic edge maps for supervision to effectively guide the segmentation network in localizing the boundaries between different semantic regions, leading to producing finer-grained representation of feature maps for semantic segmentation. The experiment results on the PASCAL VOC 2012 demonstrate the proposed BPG achieves mIoU of 73. 2% without fully connected Conditional Random Field (CRF) and 76. 0% with CRF, setting up the new state-of-the-art in literature.

IJCAI Conference 2019 Conference Paper

DSRN: A Deep Scale Relationship Network for Scene Text Detection

  • Yuxin Wang
  • Hongtao Xie
  • Zilong Fu
  • Yongdong Zhang

Nowadays, scene text detection has become increasingly important and popular. However, the large variance of text scale remains the main challenge and limits the detection performance in most previous methods. To address this problem, we propose an end-to-end architecture called Deep Scale Relationship Network (DSRN) to map multi-scale convolution features onto a scale invariant space to obtain uniform activation of multi-size text instances. Firstly, we develop a Scale-transfer module to transfer the multi-scale feature maps to a unified dimension. Due to the heterogeneity of features, simply concatenating feature maps with multi-scale information would limit the detection performance. Thus we propose a Scale Relationship module to aggregate the multi-scale information through bi-directional convolution operations. Finally, to further reduce the miss-detected instances, a novel Recall Loss is proposed to force the network to concern more about miss-detected text instances by up-weighting poor-classified examples. Compared with previous approaches, DSRN efficiently handles the large-variance scale problem without complex hand-crafted hyperparameter settings (e. g. scale of default boxes) and complicated post processing. On standard datasets including ICDAR2015 and MSRA-TD500, the proposed algorithm achieves the state-of-art performance with impressive speed (8. 8 FPS on ICDAR2015 and 13. 3 FPS on MSRA-TD500).

IJCAI Conference 2019 Conference Paper

Learning to Draw Text in Natural Images with Conditional Adversarial Networks

  • Shancheng Fang
  • Hongtao Xie
  • Jianjun Chen
  • Jianlong Tan
  • Yongdong Zhang

In this work, we propose an entirely learning-based method to automatically synthesize text sequence in natural images leveraging conditional adversarial networks. As vanilla GANs are clumsy to capture structural text patterns, directly employing GANs for text image synthesis typically results in illegible images. Therefore, we design a two-stage architecture to generate repeated characters in images. Firstly, a character generator attempts to synthesize local character appearance independently, so that the legible characters in sequence can be obtained. To achieve style consistency of characters, we propose a novel style loss based on variance-minimization. Secondly, we design a pixel-manipulation word generator constrained by self-regularization, which learns to convert local characters to plausible word image. Experiments on SVHN dataset and ICDAR, IIIT5K datasets demonstrate our method is able to synthesize visually appealing text images. Besides, we also show the high-quality images synthesized by our method can be used to boost the performance of a scene text recognition algorithm.

IJCAI Conference 2019 Conference Paper

Semi-supervised User Profiling with Heterogeneous Graph Attention Networks

  • Weijian Chen
  • Yulong Gu
  • Zhaochun Ren
  • Xiangnan He
  • Hongtao Xie
  • Tong Guo
  • Dawei Yin
  • Yongdong Zhang

Aiming to represent user characteristics and personal interests, the task of user profiling is playing an increasingly important role for many real-world applications, e. g. , e-commerce and social networks platforms. By exploiting the data like texts and user behaviors, most existing solutions address user profiling as a classification task, where each user is formulated as an individual data instance. Nevertheless, a user's profile is not only reflected from her/his affiliated data, but also can be inferred from other users, e. g. , the users that have similar co-purchase behaviors in e-commerce, the friends in social networks, etc. In this paper, we approach user profiling in a semi-supervised manner, developing a generic solution based on heterogeneous graph learning. On the graph, nodes represent the entities of interest (e. g. , users, items, attributes of items, etc. ), and edges represent the interactions between entities. Our heterogeneous graph attention networks (HGAT) method learns the representation for each entity by accounting for the graph structure, and exploits the attention mechanism to discriminate the importance of each neighbor entity. Through such a learning scheme, HGAT can leverage both unsupervised information and limited labels of users to build the predictor. Extensive experiments on a real-world e-commerce dataset verify the effectiveness and rationality of our HGAT for user profiling.

IJCAI Conference 2018 Conference Paper

Distortion-aware CNNs for Spherical Images

  • Qiang Zhao
  • Chen Zhu
  • Feng Dai
  • Yike Ma
  • Guoqing Jin
  • Yongdong Zhang

Convolutional neural networks are widely used in computer vision applications. Although they have achieved great success, these networks can not be applied to 360 spherical images directly due to varying distortion effect. In this paper, we present distortion-aware convolutional network for spherical images. For each pixel, our network samples a non-regular grid based on its distortion level, and convolves the sampled grid using square kernels shared by all pixels. The network successively approximates large image patches from different tangent planes of viewing sphere with small local sampling grids, thus improves the computational efficiency. Our method also deals with the boundary problem, which is an inherent issue for spherical images. To evaluate our method, we apply our network in spherical image classification problems based on transformed MNIST and CIFAR-10 datasets. Compared with the baseline method, our method can get much better performance. We also analyze the variants of our network.

IJCAI Conference 2018 Conference Paper

High Resolution Feature Recovering for Accelerating Urban Scene Parsing

  • Rui Zhang
  • Sheng Tang
  • Luoqi Liu
  • Yongdong Zhang
  • Jintao Li
  • Shuicheng Yan

Both accuracy and speed are equally important in urban scene parsing. Most of the existing methods mainly focus on improving parsing accuracy, ignoring the problem of low inference speed due to large-sized input and high resolution feature maps. To tackle this issue, we propose a High Resolution Feature Recovering (HRFR) framework to accelerate a given parsing network. A Super-Resolution Recovering module is employed to recover features of large original-sized images from features of down-sampled input. Therefore, our framework can combine the advantages of (1) fast speed of networks with down-sampled input and (2) high accuracy of networks with large original-sized input. Additionally, we employ auxiliary intermediate supervision and boundary region re-weighting to facilitate the optimization of the network. Extensive experiments on the two challenging Cityscapes and CamVid datasets well demonstrate the effectiveness of the proposed HRFR framework, which can accelerate the scene parsing inference process by about 3. 0x speedup from 1/2 down-sampled input with negligible accuracy reduction.

IJCAI Conference 2018 Conference Paper

Multi-Level Policy and Reward Reinforcement Learning for Image Captioning

  • Anan Liu
  • Ning Xu
  • Hanwang Zhang
  • Weizhi Nie
  • Yuting Su
  • Yongdong Zhang

Image captioning is one of the most challenging hallmark of AI, due to its complexity in visual and natural language understanding. As it is essentially a sequential prediction task, recent advances in image captioning use Reinforcement Learning (RL) to better explore the dynamics of word-by-word generation. However, existing RL-based image captioning methods mainly rely on a single policy network and reward function that does not well fit the multi-level (word and sentence) and multi-modal (vision and language) nature of the task. To this end, we propose a novel multi-level policy and reward RL framework for image captioning. It contains two modules: 1) Multi-Level Policy Network that can adaptively fuse the word-level policy and the sentence-level policy for the word generation; and 2) Multi-Level Reward Function that collaboratively leverages both vision-language reward and language-language reward to guide the policy. Further, we propose a guidance term to bridge the policy and the reward for RL optimization. Extensive experiments and analysis on MSCOCO and Flickr30k show that the proposed framework can achieve competing performances with respect to different evaluation metrics.

AAAI Conference 2017 Conference Paper

Image Caption with Global-Local Attention

  • Linghui Li
  • Sheng Tang
  • Lixi Deng
  • Yongdong Zhang
  • Qi Tian

Image caption is becoming important in the field of artificial intelligence. Most existing methods based on CNN-RNN framework suffer from the problems of object missing and misprediction due to the mere use of global representation at image-level. To address these problems, in this paper, we propose a global-local attention (GLA) method by integrating local representation at object-level with global representation at image-level through attention mechanism. Thus, our proposed method can pay more attention to how to predict the salient objects more precisely with high recall while keeping context information at image-level cocurrently. Therefore, our proposed GLA method can generate more relevant sentences, and achieve the state-of-the-art performance on the well-known Microsoft COCO caption dataset with several popular metrics.

IJCAI Conference 2017 Conference Paper

Sequential Prediction of Social Media Popularity with Deep Temporal Context Networks

  • Bo Wu
  • Wen-Huang Cheng
  • Yongdong Zhang
  • Qiushi Huang
  • Jintao Li
  • Tao Mei

Prediction of popularity has profound impact for social media, since it offers opportunities to reveal individual preference and public attention from evolutionary social systems. Previous research, although achieves promising results, neglects one distinctive characteristic of social data, i. e. , sequentiality. For example, the popularity of online content is generated over time with sequential post streams of social media. To investigate the sequential prediction of popularity, we propose a novel prediction framework called Deep Temporal Context Networks (DTCN) by incorporating both temporal context and temporal attention into account. Our DTCN contains three main components, from embedding, learning to predicting. With a joint embedding network, we obtain a unified deep representation of multi-modal user-post data in a common embedding space. Then, based on the embedded data sequence over time, temporal context learning attempts to recurrently learn two adaptive temporal contexts for sequential popularity. Finally, a novel temporal attention is designed to predict new popularity (the popularity of a new user-post pair) with temporal coherence across multiple time-scales. Experiments on our released image dataset with about 600K Flickr photos demonstrate that DTCN outperforms state-of-the-art deep prediction algorithms, with an average of 21. 51% relative performance improvement in the popularity prediction (Spearman Ranking Correlation).

TIST Journal 2017 Journal Article

Sparse Online Learning of Image Similarity

  • Xingyu Gao
  • Steven C. H. Hoi
  • Yongdong Zhang
  • Jianshe Zhou
  • Ji Wan
  • Zhenyu Chen
  • Jintao Li
  • Jianke Zhu

Learning image similarity plays a critical role in real-world multimedia information retrieval applications, especially in Content-Based Image Retrieval (CBIR) tasks, in which an accurate retrieval of visually similar objects largely relies on an effective image similarity function. Crafting a good similarity function is very challenging because visual contents of images are often represented as feature vectors in high-dimensional spaces, for example, via bag-of-words (BoW) representations, and traditional rigid similarity functions, for example, cosine similarity, are often suboptimal for CBIR tasks. In this article, we address this fundamental problem, that is, learning to optimize image similarity with sparse and high-dimensional representations from large-scale training data, and propose a novel scheme of Sparse Online Learning of Image Similarity (SOLIS). In contrast to many existing image-similarity learning algorithms that are designed to work with low-dimensional data, SOLIS is able to learn image similarity from large-scale image data in sparse and high-dimensional spaces. Our encouraging results showed that the proposed new technique achieves highly competitive accuracy as compared to the state-of-the-art approaches but enjoys significant advantages in computational efficiency, model sparsity, and retrieval scalability, making it more practical for real-world multimedia retrieval applications.

AAAI Conference 2016 Conference Paper

News Verification by Exploiting Conflicting Social Viewpoints in Microblogs

  • Zhiwei Jin
  • Juan Cao
  • Yongdong Zhang
  • Jiebo Luo

Fake news spreading in social media severely jeopardizes the veracity of online content. Fortunately, with the interactive and open features of microblogs, skeptical and opposing voices against fake news always arise along with it. The conflicting information, ignored by existing studies, is crucial for news verification. In this paper, we take advantage of this “wisdom of crowds” information to improve news verification by mining conflicting viewpoints in microblogs. First, we discover conflicting viewpoints in news tweets with a topic model method. Based on identified tweets’ viewpoints, we then build a credibility propagation network of tweets linked with supporting or opposing relations. Finally, with iterative deduction, the credibility propagation on the network generates the final evaluation result for news. Experiments conducted on a real-world data set show that the news verification performance of our approach significantly outperforms those of the baseline approaches.

AAAI Conference 2016 Conference Paper

Unfolding Temporal Dynamics: Predicting Social Media Popularity Using Multi-scale Temporal Decomposition

  • Bo Wu
  • Tao Mei
  • Wen-Huang Cheng
  • Yongdong Zhang

Time information plays a crucial role on social media popularity. Existing research on popularity prediction, effective though, ignores temporal information which is highly related to user-item associations and thus often results in limited success. An essential way is to consider all these factors (user, item, and time), which capture the dynamic nature of photo popularity. In this paper, we present a novel approach to factorize the popularity into user-item context and time-sensitive context for exploring the mechanism of dynamic popularity. The user-item context provides a holistic view of popularity, while the time-sensitive context captures the temporal dynamics nature of popularity. Accordingly, we develop two kinds of time-sensitive features, including user activeness variability and photo prevalence variability. To predict photo popularity, we propose a novel framework named Multi-scale Temporal Decomposition (MTD), which decomposes the popularity matrix in latent spaces based on contextual associations. Specifically, the proposed MTD models time-sensitive context on different time scales, which is beneficial to automatically learn temporal patterns. Based on the experiments conducted on a real-world dataset with 1. 29M photos from Flickr, our proposed MTD can achieve the prediction accuracy of 79. 8% and outperform the best three state-of-the-art methods with a relative improvement of 9. 6% on average.

TIST Journal 2015 Journal Article

Community Discovery from Social Media by Low-Rank Matrix Recovery

  • Jinfeng Zhuang
  • Tao Mei
  • Steven C. H. Hoi
  • Xian-Sheng Hua
  • Yongdong Zhang

The pervasive usage and reach of social media have attracted a surge of attention in the multimedia research community. Community discovery from social media has therefore become an important yet challenging issue. However, due to the subjective generating process, the explicitly observed communities (e.g., group-user and user-user relationship) are often noisy and incomplete in nature. This paper presents a novel approach to discovering communities from social media, including the group membership and user friend structure, by exploring a low-rank matrix recovery technique. In particular, we take Flickr as one exemplary social media platform. We first model the observed indicator matrix of the Flickr community as a summation of a low-rank true matrix and a sparse error matrix. We then formulate an optimization problem by regularizing the true matrix to coincide with the available rich context and content (i.e., photos and their associated tags). An iterative algorithm is developed to recover the true community indicator matrix. The proposed approach leads to a variety of social applications, including community visualization, interest group refinement, friend suggestion, and influential user identification. The evaluations on a large-scale testbed, consisting of 4,919 Flickr users, 1,467 interest groups, and over five million photos, show that our approach opens a new yet effective perspective to solve social network problems with sparse learning technique. Despite being focused on Flickr, our technique can be applied in any other social media community.

IJCAI Conference 2015 Conference Paper

Online Learning to Rank for Content-Based Image Retrieval

  • Ji Wan
  • Pengcheng Wu
  • Steven C. H. Hoi
  • Peilin Zhao
  • Xingyu Gao
  • Dayong Wang
  • Yongdong Zhang
  • Jintao Li

A major challenge in Content-Based Image Retrieval (CBIR) is to bridge the semantic gap between low-level image contents and high-level semantic concepts. Although researchers have investigated a variety of retrieval techniques using different types of features and distance functions, no single best retrieval solution can fully tackle this challenge. In a real-world CBIR task, it is often highly desired to combine multiple types of different feature representations and diverse distance measures in order to close the semantic gap. In this paper, we investigate a new framework of learning to rank for CBIR, which aims to seek the optimal combination of different retrieval schemes by learning from large-scale training data in CBIR. We first formulate the problem formally as a learning to rank task, which can be solved in general by applying the existing batch learning to rank algorithms from text information retrieval (IR). To further address the scalability towards large-scale online CBIR applications, we present a family of online learning to rank algorithms, which are significantly more efficient and scalable than classical batch algorithms for large-scale online CBIR. Finally, we conduct an extensive set of experiments, in which encouraging results show that our technique is effective, scalable and promising for large-scale CBIR.

TIST Journal 2014 Journal Article

A Unified Geolocation Framework for Web Videos

  • Yicheng Song
  • Yongdong Zhang
  • Juan Cao
  • Jinhui Tang
  • Xingyu Gao
  • Jintao Li

In this article, we propose a unified geolocation framework to automatically determine where on the earth a web video was shot. We analyze different social, visual, and textual relationships from a real-world dataset and find four relationships with apparent geography clues that can be used for web video geolocation. Then, the geolocation process is formulated as an optimization problem that simultaneously takes the social, visual, and textual relationships into consideration. The optimization problem is solved by an iterative procedure, which can be interpreted as a propagation of the geography information among the web video social network. Extensive experiments on a real-world dataset clearly demonstrate the effectiveness of our proposed framework, with the geolocation accuracy higher than state-of-the-art approaches.

AAAI Conference 2014 Conference Paper

SOML: Sparse Online Metric Learning with Application to Image Retrieval

  • Xingyu Gao
  • Steven C.H. Hoi
  • Yongdong Zhang
  • Ji Wan
  • Jintao Li

Image similarity search plays a key role in many multimedia applications, where multimedia data (such as images and videos) are usually represented in highdimensional feature space. In this paper, we propose a novel Sparse Online Metric Learning (SOML) scheme for learning sparse distance functions from large-scale high-dimensional data and explore its application to image retrieval. In contrast to many existing distance metric learning algorithms that are often designed for low-dimensional data, the proposed algorithms are able to learn sparse distance metrics from high-dimensional data in an efficient and scalable manner. Our experimental results show that the proposed method achieves better or at least comparable accuracy performance than the state-of-the-art non-sparse distance metric learning approaches, but enjoys a significant advantage in computational efficiency and sparsity, making it more practical for real-world applications.