Arrow Research search

Author name cluster

Jun Yu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

46 papers
2 author rows

Possible papers

46

AAAI Conference 2026 Conference Paper

Frequency-Aware Vision-Language Multimodality Generalization Network for Remote Sensing Image Classification

  • Junjie Zhang
  • Feng Zhao
  • Hanqiang Liu
  • Jun Yu

The booming remote sensing (RS) technology is giving rise to a novel multimodality generalization task, which requires the model to overcome data heterogeneity while possessing powerful cross-scene generalization ability. Moreover, most vision-language models usually describe surface materials using universal texts, lacking proprietary linguistic prior knowledge specific to different RS modalities. In this work, we formalize RS multimodality generalization (RSMG) as a learning paradigm, and propose a frequency-aware vision-language multimodality generalization network (FVMGN) for RS image classification. Specifically, a diffusion-based training-test-time augmentation (DTAug) strategy is designed to reconstruct multimodal land-cover distributions, enriching input information for FVMGN. Following that, to overcome multimodal heterogeneity, a multimodal wavelet disentanglement (MWDis) module is developed to learn cross-domain invariant features by resampling low and high frequency components in the frequency domain. Considering the characteristics of RS vision modalities, shared and proprietary class texts is designed as linguistic inputs for the transformer-based text encoder to extract diverse text features. For multimodal vision inputs, a spatial-frequency-aware image encoder (SFIE) is constructed to realize local-global feature reconstruction and representation. Finally, a multiscale spatial-frequency feature alignment (MSFFA) module is suggested to construct a unified semantic space, ensuring refined multiscale alignment of different text and vision features in spatial and frequency domains. Extensive experiments show that FVMGN has the excellent multimodality generalization ability compared with state-of-the-art methods.

AAAI Conference 2026 Conference Paper

Knowledge Completes the Vision: A Multimodal Entity-aware Retrieval-Augmented Generation Framework for News Image Captioning

  • Xiaoxing You
  • Qiang Huang
  • Lingyu Li
  • Chi Zhang
  • Xiaopeng Liu
  • Min Zhang
  • Jun Yu

News image captioning aims to produce journalistically informative descriptions by combining visual content with contextual cues from associated articles. Despite recent advances, existing methods struggle with three key challenges: (1) incomplete information coverage, (2) weak cross-modal alignment, and (3) suboptimal visual-entity grounding. To address these issues, we introduce MERGE, the first Multimodal Entity-aware Retrieval-augmented GEneration framework for news image captioning. MERGE constructs an entity-centric multimodal knowledge base (EMKB) that integrates textual, visual, and structured knowledge, enabling enriched background retrieval. It improves cross-modal alignment through a multistage hypothesis-caption strategy and enhances visual-entity matching via dynamic retrieval guided by image content. Extensive experiments on GoodNews and NYTimes800k show that MERGE significantly outperforms state-of-the-art baselines, with CIDEr gains of +6.84 and +1.16 in caption quality, and F1-score improvements of +4.14 and +2.64 in named entity recognition. Notably, MERGE also generalizes well to the unseen Visual News dataset, achieving +20.17 in CIDEr and +6.22 in F1-score, demonstrating strong robustness and domain adaptability.

AAAI Conference 2026 Conference Paper

Partially Shared Concept Bottleneck Models

  • Delong Zhao
  • Qiang Huang
  • Di Yan
  • Yiqun Sun
  • Jun Yu

Concept Bottleneck Models (CBMs) enhance interpretability by introducing a layer of human-understandable concepts between inputs and predictions. While recent methods automate concept generation using Large Language Models (LLMs) and Vision-Language Models (VLMs), they still face three fundamental challenges: poor visual grounding, concept redundancy, and the absence of principled metrics to balance predictive accuracy and concept compactness. We introduce PS-CBM, a Partially Shared CBM framework that addresses these limitations through three core components: (1) a multimodal concept generator that integrates LLM-derived semantics with exemplar-based visual cues; (2) a Partially Shared Concept Strategy that merges concepts based on activation patterns to balance specificity and compactness; and (3) Concept-Efficient Accuracy (CEA), a post-hoc metric that jointly captures both predictive accuracy and concept compactness. Extensive experiments on eleven diverse datasets show that PS-CBM consistently outperforms state-of-the-art CBMs, improving classification accuracy by 1.0%–7.4% and CEA by 2.0%–9.5%, while requiring significantly fewer concepts. These results underscore PS-CBM’s effectiveness in achieving both high accuracy and strong interpretability.

AAAI Conference 2026 Conference Paper

Sparse4DGS: 4D Gaussian Splatting for Sparse-Frame Dynamic Scene Reconstruction

  • Changyue Shi
  • Chuxiao Yang
  • Xinyuan Hu
  • Minghao Chen
  • Wenwen Pan
  • Yan Yang
  • Jiajun Ding
  • Zhou Yu

Dynamic Gaussian Splatting approaches have achieved remarkable performance for 4D scene reconstruction. However, these approaches rely on dense-frame video sequences for photorealistic reconstruction. In real-world scenarios, due to equipment constraints, sometimes only sparse frames are accessible. In this paper, we propose Sparse4DGS, the first method for sparse-frame dynamic scene reconstruction. We observe that dynamic reconstruction methods fail in both canonical and deformed spaces under sparse-frame settings, especially in areas with high texture richness. Sparse4DGS tackles this challenge by focusing on texture-rich areas. For the deformation network, we propose Texture-Aware Deformation Regularization, which introduces a texture-based depth alignment loss to regulate Gaussian deformation. For the canonical Gaussian field, we introduce Texture-Aware Canonical Optimization, which incorporates texture-based noise into the gradient descent process of canonical Gaussians. Extensive experiments show that when taking sparse frames as inputs, our method outperforms existing dynamic or few-shot techniques on NeRF-Synthetic, HyperNeRF, NeRF-DS, and our iPhone-4D datasets.

AAAI Conference 2025 Conference Paper

A Similarity Paradigm Through Textual Regularization Without Forgetting

  • Fangming Cui
  • Jan Fong
  • Rongfei Zeng
  • Xinmei Tian
  • Jun Yu

Prompt learning has emerged as a promising method for adapting pre-trained visual-language models (VLMs) to a range of downstream tasks. While optimizing the context can be effective for improving performance on specific tasks, it can often lead to poor generalization performance on unseen classes or datasets sampled from different distributions. It may be attributed to the fact that textual prompts tend to overfit downstream data distributions, leading to the forgetting of generalized knowledge derived from hand-crafted prompts. In this paper, we propose a novel method called Similarity Paradigm with Textual Regularization (SPTR) for prompt learning without forgetting. SPTR is a two-pronged design based on hand-crafted prompts that is an inseparable framework. 1) To avoid forgetting general textual knowledge, we introduce the optimal transport as a textual regularization to finely ensure approximation with hand-crafted features and tuning textual features. 2) In order to continuously unleash the general ability of multiple hand-crafted prompts, we propose a similarity paradigm for natural alignment score and adversarial alignment score to improve model robustness for generalization. Both modules share a common objective in addressing generalization issues, aiming to maximize the generalization capability derived from multiple hand-crafted prompts. Four representative tasks (i.e., non-generalization few-shot learning, base-to-novel generalization, cross-dataset generalization, domain generalization) across 11 datasets demonstrate that SPTR outperforms existing prompt learning methods.

NeurIPS Conference 2025 Conference Paper

A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

  • Jitai Hao
  • Qiang Huang
  • Hao Liu
  • Xinyan Xiao
  • Zhaochun Ren
  • Jun Yu

Training high-performing Small Language Models (SLMs) remains computationally expensive, even with knowledge distillation and pruning from larger teacher models. Existing approaches often face three key challenges: (1) information loss from hard pruning, (2) inefficient alignment of representations, and (3) underutilization of informative activations, particularly from Feed-Forward Networks (FFNs). To address these challenges, we introduce \textbf{Low-Rank Clone (LRC)}, an efficient pre-training method that constructs SLMs aspiring to behavioral equivalence with strong teacher models. LRC trains a set of low-rank projection matrices that jointly enable soft pruning by compressing teacher weights, and activation clone by aligning student activations, including FFN signals, with those of the teacher. This unified design maximizes knowledge transfer while removing the need for explicit alignment modules. Extensive experiments with open-source teachers such as Llama-3. 2-3B-Instruct and Qwen2. 5-3B/7B-Instruct show that LRC matches or surpasses the performance of state-of-the-art models trained on trillions of tokens--using only 20B tokens, achieving over \textbf{1, 000$\times$} greater training efficiency. Our codes and model checkpoints are available at https: //github. com/CURRENTF/LowRankClone and https: //huggingface. co/JitaiHao/LRC-4B-Base.

JBHI Journal 2025 Journal Article

Adapter-Enhanced Hierarchical Cross-Modal Pre-Training for Lightweight Medical Report Generation

  • Ting Yu
  • Wangwen Lu
  • Yan Yang
  • Weidong Han
  • Qingming Huang
  • Jun Yu
  • Ke Zhang

Automatic medical report generation is an emerging field that aims to transform medical images into descriptive, clinically relevant narratives, potentially reducing the workload for radiologists significantly. Despite substantial progress, the increasing model parameter size and corresponding marginal performance gains have limited further development and application. To address this challenge, we introduce an Adapter-enhanced Hierarchical cross-modal Pre-training (AHP) strategy for lightweight medical report generation. This approach significantly reduces the pre-trained model's parameter size while maintaining superior report generation performance through our proposed spatial adapters. To further address the issue of inadequate representation of visual space details, we employ a convolutional stem combined with hierarchical injectors and extractors, fully integrating with traditional Vision Transformers to achieve more comprehensive visual representations. Additionally, our cross-modal pre-training model effectively handles the inherent complex visual-textual relationships in medical imaging. Extensive experiments on multiple datasets, including IU X-Ray, MIMIC-CXR, and bladder pathology, demonstrate our model's exceptional generalization and transfer performance in downstream medical report generation tasks, highlighting AHP's potential in significantly reducing model parameters while enhancing report generation accuracy and efficiency.

ICLR Conference 2025 Conference Paper

Classic but Everlasting: Traditional Gradient-Based Algorithms Converge Fast Even in Time-Varying Multi-Player Games

  • Yanzheng Chen 0001
  • Jun Yu

Last-iterate convergence behaviours of well-known algorithms are intensively investigated in various games, such as two-player bilinear zero-sum games. However, most known last-iterate convergence properties rely on strict settings where the underlying games must have time-invariant payoffs. Besides, the limited known attempts on the games with time-varying payoffs are in two-player bilinear time-varying zero-sum games and strictly monotone games. By contrast, in other time-varying games, the last-iterate behaviours of two classic algorithms, i.e., extra gradient (EG) and optimistic gradient (OG) algorithms, still lack research, especially the convergence rates in multi-player games. In this paper, we investigate the last-iterate behaviours of EG and OG algorithms for convergent perturbed games, which extend upon the usual model of time-invariant games and incorporate external factors, such as vanishing noises. Using the recently proposed notion of the tangent residual (or its modifications) as the potential function of games and the measure of proximity to the Nash equilibrium, we prove that the last-iterate convergence rates of EG and OG algorithms for perturbed games on bounded convex closed sets are $O({1}/{\sqrt{T}})$ if such games converge to monotone games at rates fast enough and that such a result holds true for certain unconstrained perturbed games. With this result, we address an open question asking for the last-iterate convergence rate of EG and OG algorithms in constrained and time-varying settings. The above convergence rates are similar to known tight results on corresponding time-invariant games.

JBHI Journal 2025 Journal Article

Consistency Conditioned Memory Augmented Dynamic Diagnosis Model for Medical Visual Question Answering

  • Ting Yu
  • Binhui Ge
  • Shuhui Wang
  • Yan Yang
  • Qingming Huang
  • Jun Yu

Medical Visual Question Answering (Med-VQA) holds immense promise as an invaluable medical assistance aid, offering timely diagnostic outcomes based on medical images and accompanying questions, thereby supporting medical professionals in making accurate clinical decisions. However, Med-VQA is still in its infancy, with existing solutions falling short in imitating human diagnostic processes and ensuring result consistency. To address these challenges, we propose a Co nsistency Co nditioned Me mory augmented D ynamic diagnosis model (CoCoMeD), incorporating two core components: a dynamic memory diagnosis engine and a consistency-conditioned enforcer. The dynamic memory diagnosis engine enables intricate diagnostic interactions by retaining vital visual cues from medical images and iteratively updating pertinent memories. This dynamic reasoning capability mirrors the cognitive processes observed in skilled medical diagnosticians, thus effectively enhancing the model's ability to reason over diverse medical visual facts and patient-specific questions. Moreover, to strengthen diagnostic coherence, the consistency-conditioned enforcer imposes coherence constraints linking interrelated questions with identical medical facts, ensuring the credibility and reliability of its diagnostic outcomes. Additionally, we present C-SLAKE, an extended Med-VQA dataset encompassing diverse medical image types, and categorized diagnostic question-answer pairs for consistent Med-VQA evaluation on rich medical sources. Comprehensive experiments on DME and C-SLAKE showcase CoCoMeD's superior performance and potential to advance trustworthy multi-source medical question answering.

ICLR Conference 2025 Conference Paper

Deep Kernel Relative Test for Machine-generated Text Detection

  • Yiliao Song
  • Zhenqiao Yuan
  • Shuhai Zhang
  • Zhen Fang 0001
  • Jun Yu
  • Feng Liu 0003

Recent studies demonstrate that two-sample test can effectively detect machine-generated texts (MGTs) with excellent adaptation ability to texts generated by newer LLMs. However, two-sample test-based detection relies on the assumption that human-written texts (HWTs) must follow the distribution of seen HWTs. As a result, it tends to make mistakes in identifying HWTs that deviate from the seen HWT distribution, limiting their use in sensitive areas like academic integrity verification. To address this issue, we propose to employ non-parametric kernel relative test to detect MGTs by testing whether it is statistically significant that the distribution of a text to be tested is closer to the distribution of HWTs than to the MGTs' distribution. We further develop a kernel optimisation algorithm in relative test to select the best kernel that can enhance the testing capability for MGT detection. As relative test does not assume that a text to be tested must belong exclusively to either MGTs or HWTs, relative test can largely reduce the false positive error compared to two-sample test, offering significant advantages in practice. Extensive experiments demonstrate the superior performance of our method, compared to state-of-the-art non-parametric and parametric detectors. The code and demo are available: https://github.com/xLearn-AU/R-Detect.

NeurIPS Conference 2025 Conference Paper

EditInfinity: Image Editing with Binary-Quantized Generative Models

  • Jiahuan Wang
  • Yuxin Chen
  • Jun Yu
  • Guangming Lu
  • Wenjie Pei

Adapting pretrained diffusion-based generative models for text-driven image editing with negligible tuning overhead has demonstrated remarkable potential. A classical adaptation paradigm, as followed by these methods, first infers the generative trajectory inversely for a given source image by image inversion, then performs image editing along the inferred trajectory guided by the target text prompts. However, the performance of image editing is heavily limited by the approximation errors introduced during image inversion by diffusion models, which arise from the absence of exact supervision in the intermediate generative steps. To circumvent this issue, we investigate the parameter-efficient adaptation of binary-quantized generative models for image editing, and leverage their inherent characteristic that the exact intermediate quantized representations of a source image are attainable, enabling more effective supervision for precise image inversion. Specifically, we propose EditInfinity, which adapts Infinity, a binary-quantized generative model, for image editing. We propose an efficient yet effective image inversion mechanism that integrates text prompting rectification and image style preservation, enabling precise image inversion. Furthermore, we devise a holistic smoothing strategy which allows our EditInfinity to perform image editing with high fidelity to source images and precise semantic alignment to the text prompts. Extensive experiments on the PIE-Bench benchmark across add, change, and delete editing operations, demonstrate the superior performance of our model compared to state-of-the-art diffusion-based baselines. Code available at: https: //github. com/yx-chen-ust/EditInfinity.

NeurIPS Conference 2025 Conference Paper

Enhancing LLM Planning for Robotics Manipulation through Hierarchical Procedural Knowledge Graphs

  • Jiacong Zhou
  • Jiaxu Miao
  • xianyun wang
  • Jun Yu

Large Language Models (LLMs) have shown the promising planning capabilities for robotic manipulation, which advances the development of embodied intelligence significantly. However, existing LLM-driven robotic manipulation approaches excel at simple pick-and-place tasks but are insufficient for complex manipulation tasks due to inaccurate procedural knowledge. Besides, for embodied intelligence, equipping a large scale LLM is energy-consuming and inefficient, which affects its real-world application. To address the above problems, we propose Hierarchical Procedural Knowledge Graphs (\textbf{HP-KG}) to enhance LLMs for complex robotic planning while significantly reducing the demand for LLM scale in robotic manipulation. Considering that the complex real-world tasks require multiple steps, and each step is composed of robotic-understandable atomic actions, we design a hierarchical knowledge graph structure to model the relationships between tasks, steps, and actions. This design bridges the gap between human instructions and robotic manipulation actions. To construct HP-KG, we develop an automatic knowledge graph construction framework powered by LLM-based multi-agents, which eliminates costly manual efforts while maintaining high-quality graph structures. The resulting HP-KG encompasses over 40k activity steps across more than 6k household tasks, spanning diverse everyday scenarios. Extensive experiments demonstrate that small scale LLMs (7B) enhanced by our HP-KG significantly improve the planning capabilities, which are stronger than 72B LLMs only. Encouragingly, our approach remains effective on the most powerful GPT-4o model.

AAAI Conference 2025 Conference Paper

Fine-grained Adaptive Visual Prompt for Generative Medical Visual Question Answering

  • Ting Yu
  • Zixuan Tong
  • Jun Yu
  • Ke Zhang

Medical Visual Question Answering (MedVQA) serves as an automated medical assistant, capable of answering patient queries and aiding physician diagnoses based on medical images and questions. Recent advancements have shown that incorporating Large Language Models (LLMs) into MedVQA tasks significantly enhances the capability for answer generation. However, for tasks requiring fine-grained organ-level precise localization, relying solely on language prompts struggles to accurately locate relevant regions within medical images due to substantial background noise. To address this challenge, we explore the use of visual prompts in MedVQA tasks for the first time and propose fine-grained adaptive visual prompts to enhance generative MedVQA. Specifically, we introduce an Adaptive Visual Prompt Creator that adaptively generates region-level visual prompts based on image characteristics of various organs, providing fine-grained references for LLMs during answer retrieval and generation from the medical domain, thereby improving the model's precise cross-modal localization capabilities on original images. Furthermore, we incorporate a Hierarchical Answer Generator with Parameter-Efficient Fine-Tuning (PEFT) techniques, significantly enhancing the model's understanding of spatial and contextual information with minimal parameter increase, promoting the alignment of representation learning with the medical space. Extensive experiments on VQA-RAD, SLAKE, and DME datasets validate the effectiveness of our proposed method, demonstrating its potential in generative MedVQA.

NeurIPS Conference 2025 Conference Paper

From Pixels to Views: Learning Angular-Aware and Physics-Consistent Representations for Light Field Microscopy

  • Feng He
  • Guodong Tan
  • Qiankun Li
  • Jun Yu
  • Quan Wen

Light field microscopy (LFM) has become an emerging tool in neuroscience for large-scale neural imaging in vivo, with XLFM (eXtended Light Field Microscopy) notable for its single-exposure volumetric imaging, broad field of view, and high temporal resolution. However, learning-based 3D reconstruction in XLFM remains underdeveloped due to two core challenges: the absence of standardized datasets and the lack of methods that can efficiently model its angular–spatial structure while remaining physically grounded. We address these challenges by introducing three key contributions. First, we construct the XLFM-Zebrafish benchmark, a large-scale dataset and evaluation suite for XLFM reconstruction. Second, we propose Masked View Modeling for Light Fields (MVM-LF), a self-supervised task that learns angular priors by predicting occluded views, improving data efficiency. Third, we formulate the Optical Rendering Consistency Loss (ORC Loss), a differentiable rendering constraint that enforces alignment between predicted volumes and their PSF-based forward projections. On the XLFM-Zebrafish benchmark, our method improves PSNR by 7. 7\% over state-of-the-art baselines. Code and datasets are publicly available at: https: //github. com/hefengcs/XLFM-Former.

JBHI Journal 2025 Journal Article

Multi-Scale Group Agent Attention-Based Graph Convolutional Decoding Networks for 2D Medical Image Segmentation

  • Zhichao Wang
  • Lin Guo
  • Shuchang Zhao
  • Shiqing Zhang
  • Xiaoming Zhao
  • Jiangxiong Fang
  • Guoyu Wang
  • Hongsheng Lu

Automated medical image segmentation plays a crucial role in assisting doctors in diagnosing diseases. Feature decoding is a critical yet challenging issue for medical image segmentation. To address this issue, this work proposes a novel feature decoding network, called multi-scale group agent attention-based graph convolutional decoding networks (MSGAA-GCDN), to learn local-global features in graph structures for 2D medical image segmentation. The proposed MSGAA-GCDN combines graph convolutional network (GCN) and a lightweight multi-scale group agent attention (MSGAA) mechanism to represent features globally and locally within a graph structure. Moreover, in skip connections a simple yet efficient attention-based upsampling convolution fusion (AUCF) module is designed to enhance encoder-decoder feature fusion in both channel and spatial dimensions. Extensive experiments are conducted on three typical medical image segmentation tasks, namely Synapse abdominal multi-organs, Cardiac organs, and Polyp lesions. Experimental results demonstrate that the proposed MSGAA-GCDN outperforms the state-of-the-art methods, and the designed MSGAA is a lightweight yet effective attention architecture. The proposed MSGAA-GCDN can be easily taken as a plug-and-play decoder cascaded with other encoders for general medical image segmentation tasks.

NeurIPS Conference 2025 Conference Paper

Towards Generalizable Detector for Generated Image

  • Qianshu Cai
  • Chao Wu
  • Yonggang Zhang
  • Jun Yu
  • Xinmei Tian

The effective detection of generated images is crucial to mitigate potential risks associated with their misuse. Despite significant progress, a fundamental challenge remains: ensuring the generalizability of detectors. To address this, we propose a novel perspective on understanding and improving generated image detection, inspired by the human cognitive process: Humans identify an image as unnatural based on specific patterns because these patterns lie outside the space spanned by those of natural images. This is intrinsically related to out-of-distribution (OOD) detection, which identifies samples whose semantic patterns (i. e. , labels) lie outside the semantic pattern space of in-distribution (ID) samples. By treating patterns of generated images as OOD samples, we demonstrate that models trained merely over natural images bring guaranteed generalization ability under mild assumptions. This transforms the generalization challenge of generated image detection into the problem of fitting natural image patterns. Based on this insight, we propose a generalizable detection method through the lens of ID energy. Theoretical results capture the generalization risk of the proposed method. Experimental results across multiple benchmarks demonstrate the effectiveness of our approach.

TMLR Journal 2024 Journal Article

A Multilinear Least-Squares Formulation for Sparse Tensor Canonical Correlation Analysis

  • Jun Yu
  • Zhaoming Kong
  • Kun Chen
  • Xin Zhang
  • Yong Chen
  • Lifang He

Tensor data are becoming important recently in various applications, e.g., image and video recognition, which pose new challenges for data modeling and analysis approaches, such as high-order relations of large complexity, varying data scale and gross noise. In this paper, we consider the problem of sparse canonical correlation analysis for arbitrary tensor data. Although several methods have been proposed for this task, there are still limitations hindering its practical applications. To this end, we present a general Sparse Tensor Canonical Correlation Analysis (gSTCCA) method from a multilinear least-squares perspective. Specifically, we formulate the problem as a constrained multilinear least-squares problem with tensor-structured sparsity regularization based on CANDECOMP/PARAFAC (CP) decomposition. Then we present a divide-and-conquer deflation approach to tackle the problem by successive rank-one tensor estimation of the residual tensors, where the overall model is broken up into a set of unconstrained linear least-squares problems that can be efficiently solved. Through extensive experiments conducted on five different datasets for recognition tasks, we demonstrate that the proposed method achieves promising performance compared to the SOTA vector- and tensor-based canonical correlation analysis methods in terms of classification accuracy, model sparsity, and robustness to missing and noisy data. The code is publicly available at https://github.com/junfish/gSTCCA.

AAAI Conference 2024 Conference Paper

BCLNet: Bilateral Consensus Learning for Two-View Correspondence Pruning

  • Xiangyang Miao
  • Guobao Xiao
  • Shiping Wang
  • Jun Yu

Correspondence pruning aims to establish reliable correspondences between two related images and recover relative camera motion. Existing approaches often employ a progressive strategy to handle the local and global contexts, with a prominent emphasis on transitioning from local to global, resulting in the neglect of interactions between different contexts. To tackle this issue, we propose a parallel context learning strategy that involves acquiring bilateral consensus for the two-view correspondence pruning task. In our approach, we design a distinctive self-attention block to capture global context and parallel process it with the established local context learning module, which enables us to simultaneously capture both local and global consensuses. By combining these local and global consensuses, we derive the required bilateral consensus. We also design a recalibration block, reducing the influence of erroneous consensus information and enhancing the robustness of the model. The culmination of our efforts is the Bilateral Consensus Learning Network (BCLNet), which efficiently estimates camera pose and identifies inliers (true correspondences). Extensive experiments results demonstrate that our network not only surpasses state-of-the-art methods on benchmark datasets but also showcases robust generalization abilities across various feature extraction techniques. Noteworthily, BCLNet obtains significant improvement gains over the second best method on unknown outdoor dataset, and obviously accelerates model training speed.

IJCAI Conference 2024 Conference Paper

Dialogue Cross-Enhanced Central Engagement Attention Model for Real-Time Engagement Estimation

  • Jun Yu
  • Keda Lu
  • Ji Zhao
  • Zhihong Wei
  • Iek-Heng Chu
  • Peng Chang

Real-time engagement estimation has been an important research topic in human-computer interaction in recent years. The emergence of the NOvice eXpert Interaction (NOXI) dataset, enriched with frame-wise engagement annotations, has catalyzed a surge in research efforts in this domain. Existing feature sequence partitioning methods for ultra-long videos have encountered challenges including insufficient information utilization and repetitive inference. Moreover, those studies focus mainly on the target participants’ features without taking into account those of the interlocutor. To address these issues, we propose the center-based sliding window method to obtain feature subsequences. The core of these subsequences is modeled using our innovative Central Engagement Attention Model (CEAM). Additionally, we introduce the dialogue cross-enhanced module that effectively incorporates the interlocutor’s features via cross-attention. Our proposed method outperforms the current best model, achieving a substantial gain of 1. 5% in coordination correlation coefficient (CCC) and establishing a new state-of-the-art result. Our source codes and model checkpoints are available at https: //github. com/wujiekd/Dialogue-Cross-Enhanced-CEAM.

AAAI Conference 2024 Conference Paper

Graph Context Transformation Learning for Progressive Correspondence Pruning

  • Junwen Guo
  • Guobao Xiao
  • Shiping Wang
  • Jun Yu

Most of existing correspondence pruning methods only concentrate on gathering the context information as much as possible while neglecting effective ways to utilize such information. In order to tackle this dilemma, in this paper we propose Graph Context Transformation Network (GCT-Net) enhancing context information to conduct consensus guidance for progressive correspondence pruning. Specifically, we design the Graph Context Enhance Transformer which first generates the graph network and then transforms it into multi-branch graph contexts. Moreover, it employs self-attention and cross-attention to magnify characteristics of each graph context for emphasizing the unique as well as shared essential information. To further apply the recalibrated graph contexts to the global domain, we propose the Graph Context Guidance Transformer. This module adopts a confident-based sampling strategy to temporarily screen high-confidence vertices for guiding accurate classification by searching global consensus between screened vertices and remaining ones. The extensive experimental results on outlier removal and relative pose estimation clearly demonstrate the superior performance of GCT-Net compared to state-of-the-art methods across outdoor and indoor datasets.

NeurIPS Conference 2024 Conference Paper

Learnability Matters: Active Learning for Video Captioning

  • Yiqian Zhang
  • Buyu Liu
  • Jun Bao
  • Qiang Huang
  • Min Zhang
  • Jun Yu

This work focuses on the active learning in video captioning. In particular, we propose to address the learnability problem in active learning, which has been brought up by collective outliers in video captioning and neglected in the literature. To start with, we conduct a comprehensive study of collective outliers, exploring their hard-to-learn property and concluding that ground truth inconsistency is one of the main causes. Motivated by this, we design a novel active learning algorithm that takes three complementary aspects, namely learnability, diversity, and uncertainty, into account. Ideally, learnability is reflected by ground truth consistency. Under the active learning scenario where ground truths are not available until human involvement, we measure the consistency on estimated ground truths, where predictions from off-the-shelf models are utilized as approximations to ground truths. These predictions are further used to estimate sample frequency and reliability, evincing the diversity and uncertainty respectively. With the help of our novel caption-wise active learning protocol, our algorithm is capable of leveraging knowledge from humans in a more effective yet intellectual manner. Results on publicly available video captioning datasets with diverse video captioning models demonstrate that our algorithm outperforms SOTA active learning methods by a large margin, e. g. we achieve about 103% of full performance on CIDEr with 25% of human annotations on MSR-VTT.

AAAI Conference 2024 Conference Paper

Multi-Domain Deep Learning from a Multi-View Perspective for Cross-Border E-commerce Search

  • Yiqian Zhang
  • Yinfu Feng
  • Wen-Ji Zhou
  • Yunan Ye
  • Min Tan
  • Rong Xiao
  • Haihong Tang
  • Jiajun Ding

Building click-through rate (CTR) and conversion rate (CVR) prediction models for cross-border e-commerce search requires modeling the correlations among multi-domains. Existing multi-domain methods would suffer severely from poor scalability and low efficiency when number of domains increases. To this end, we propose a Domain-Aware Multi-view mOdel (DAMO), which is domain-number-invariant, to effectively leverage cross-domain relations from a multi-view perspective. Specifically, instead of working in the original feature space defined by different domains, DAMO maps everything to a new low-rank multi-view space. To achieve this, DAMO firstly extracts multi-domain features in an explicit feature-interactive manner. These features are parsed to a multi-view extractor to obtain view-invariant and view-specific features. Then a multi-view predictor inputs these two sets of features and outputs view-based predictions. To enforce view-awareness in the predictor, we further propose a lightweight view-attention estimator to dynamically learn the optimal view-specific weights w.r.t. a view-guided loss. Extensive experiments on public and industrial datasets show that compared with state-of-the-art models, our DAMO achieves better performance with lower storage and computational costs. In addition, deploying DAMO to a large-scale cross-border e-commence platform leads to 1.21%, 1.76%, and 1.66% improvements over the existing CGC-based model in the online AB-testing experiment in terms of CTR, CVR, and Gross Merchandises Value, respectively.

ICML Conference 2024 Conference Paper

Network Tight Community Detection

  • Jiayi Deng
  • Xiaodong Yang
  • Jun Yu
  • Jun Liu
  • Zhaiming Shen
  • Danyang Huang
  • Huimin Cheng

Conventional community detection methods often categorize all nodes into clusters. However, the presumed community structure of interest may only be valid for a subset of nodes (named as ‘tight nodes’), while the rest of the network may consist of noninformative “scattered nodes”. For example, a protein-protein network often contains proteins that do not belong to specific biological functional modules but are involved in more general processes, or act as bridges between different functional modules. Forcing each of these proteins into a single cluster introduces unwanted biases and obscures the underlying biological implication. To address this issue, we propose a tight community detection (TCD) method to identify tight communities excluding scattered nodes. The algorithm enjoys a strong theoretical guarantee of tight node identification accuracy and is scalable for large networks. The superiority of the proposed method is demonstrated by various synthetic and real experiments.

IJCAI Conference 2023 Conference Paper

Actor-Multi-Scale Context Bidirectional Higher Order Interactive Relation Network for Spatial-Temporal Action Localization

  • Jun Yu
  • Yingshuai Zheng
  • Shulan Ruan
  • Qi Liu
  • Zhiyuan Cheng
  • Jinze Wu

The key to video action detection lies in the understanding of interaction between persons and background objects in a video. Current methods usually employ object detectors to extract objects directly or use grid features to represent objects in the environment, which underestimate the great potential of multi-scale context information (e. g. , objects and scenes of different sizes). How to exactly represent the multi-scale context and make full utilization of it still remains an unresolved challenge for spatial-temporal action localization. In this paper, we propose a novel Actor-Multi-Scale Context Bidirectional Higher Order Interactive Relation Network (AMCRNet) that extracts multi-scale context through multiple pooling layers with different sizes. Specifically, we develop an Interactive Relation Extraction module to model the higher-order relation between the target person and the context (e. g. , other persons and objects). Along this line, we further propose a History Feature Bank and Interaction method to achieve better performance by modeling such relation across continuing video clips. Extensive experimental results on AVA2. 2 and UCF101-24 demonstrate the superiority and rationality of our proposed AMCRNet.

NeurIPS Conference 2023 Conference Paper

FlatMatch: Bridging Labeled Data and Unlabeled Data with Cross-Sharpness for Semi-Supervised Learning

  • Zhuo Huang
  • Li Shen
  • Jun Yu
  • Bo Han
  • Tongliang Liu

Semi-Supervised Learning (SSL) has been an effective way to leverage abundant unlabeled data with extremely scarce labeled data. However, most SSL methods are commonly based on instance-wise consistency between different data transformations. Therefore, the label guidance on labeled data is hard to be propagated to unlabeled data. Consequently, the learning process on labeled data is much faster than on unlabeled data which is likely to fall into a local minima that does not favor unlabeled data, leading to sub-optimal generalization performance. In this paper, we propose FlatMatch which minimizes a cross-sharpness measure to ensure consistent learning performance between the two datasets. Specifically, we increase the empirical risk on labeled data to obtain a worst-case model which is a failure case needing to be enhanced. Then, by leveraging the richness of unlabeled data, we penalize the prediction difference (i. e. , cross-sharpness) between the worst-case model and the original model so that the learning direction is beneficial to generalization on unlabeled data. Therefore, we can calibrate the learning process without being limited to insufficient label information. As a result, the mismatched learning performance can be mitigated, further enabling the effective exploitation of unlabeled data and improving SSL performance. Through comprehensive validation, we show FlatMatch achieves state-of-the-art results in many SSL settings.

JMLR Journal 2023 Journal Article

Importance Sparsification for Sinkhorn Algorithm

  • Mengyu Li
  • Jun Yu
  • Tao Li
  • Cheng Meng

Sinkhorn algorithm has been used pervasively to approximate the solution to optimal transport (OT) and unbalanced optimal transport (UOT) problems. However, its practical application is limited due to the high computational complexity. To alleviate the computational burden, we propose a novel importance sparsification method, called Spar-Sink, to efficiently approximate entropy-regularized OT and UOT solutions. Specifically, our method employs natural upper bounds for unknown optimal transport plans to establish effective sampling probabilities, and constructs a sparse kernel matrix to accelerate Sinkhorn iterations, reducing the computational cost of each iteration from $O(n^2)$ to $\widetilde{O}(n)$ for a sample of size $n$. Theoretically, we show the proposed estimators for the regularized OT and UOT problems are consistent under mild regularity conditions. Experiments on various synthetic data demonstrate Spar-Sink outperforms mainstream competitors in terms of both estimation error and speed. A real-world echocardiogram data analysis shows Spar-Sink can effectively estimate and visualize cardiac cycles, from which one can identify heart failure and arrhythmia. To evaluate the numerical accuracy of cardiac cycle prediction, we consider the task of predicting the end-systole time point using the end-diastole one. Results show Spar-Sink performs as well as the classical Sinkhorn algorithm, requiring significantly less computational time. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2023. ( edit, beta )

NeurIPS Conference 2023 Conference Paper

InstanT: Semi-supervised Learning with Instance-dependent Thresholds

  • Muyang Li
  • Runze Wu
  • Haoyu Liu
  • Jun Yu
  • Xun Yang
  • Bo Han
  • Tongliang Liu

Semi-supervised learning (SSL) has been a fundamental challenge in machine learning for decades. The primary family of SSL algorithms, known as pseudo-labeling, involves assigning pseudo-labels to confident unlabeled instances and incorporating them into the training set. Therefore, the selection criteria of confident instances are crucial to the success of SSL. Recently, there has been growing interest in the development of SSL methods that use dynamic or adaptive thresholds. Yet, these methods typically apply the same threshold to all samples, or use class-dependent thresholds for instances belonging to a certain class, while neglecting instance-level information. In this paper, we propose the study of instance-dependent thresholds, which has the highest degree of freedom compared with existing methods. Specifically, we devise a novel instance-dependent threshold function for all unlabeled instances by utilizing their instance-level ambiguity and the instance-dependent error rates of pseudo-labels, so instances that are more likely to have incorrect pseudo-labels will have higher thresholds. Furthermore, we demonstrate that our instance-dependent threshold function provides a bounded probabilistic guarantee for the correctness of the pseudo-labels it assigns.

AAAI Conference 2023 Conference Paper

Knowledge-Constrained Answer Generation for Open-Ended Video Question Answering

  • Yao Jin
  • Guocheng Niu
  • Xinyan Xiao
  • Jian Zhang
  • Xi Peng
  • Jun Yu

Open-ended Video question answering (open-ended VideoQA) aims to understand video content and question semantics to generate the correct answers. Most of the best performing models define the problem as a discriminative task of multi-label classification. In real-world scenarios, however, it is difficult to define a candidate set that includes all possible answers. In this paper, we propose a Knowledge-constrained Generative VideoQA Algorithm (KcGA) with an encoder-decoder pipeline, which enables out-of-domain answer generation through an adaptive external knowledge module and a multi-stream information control mechanism. We use ClipBERT to extract the video-question features, extract framewise object-level external knowledge from a commonsense knowledge base and compute the contextual-aware episode memory units via an attention based GRU to form the external knowledge features, and exploit multi-stream information control mechanism to fuse video-question and external knowledge features such that the semantic complementation and alignment are well achieved. We evaluate our model on two open-ended benchmark datasets to demonstrate that we can effectively and robustly generate high-quality answers without restrictions of training data.

AAAI Conference 2023 Conference Paper

ShiftDDPMs: Exploring Conditional Diffusion Models by Shifting Diffusion Trajectories

  • Zijian Zhang
  • Zhou Zhao
  • Jun Yu
  • Qi Tian

Diffusion models have recently exhibited remarkable abilities to synthesize striking image samples since the introduction of denoising diffusion probabilistic models (DDPMs). Their key idea is to disrupt images into noise through a fixed forward process and learn its reverse process to generate samples from noise in a denoising way. For conditional DDPMs, most existing practices relate conditions only to the reverse process and fit it to the reversal of unconditional forward process. We find this will limit the condition modeling and generation in a small time window. In this paper, we propose a novel and flexible conditional diffusion model by introducing conditions into the forward process. We utilize extra latent space to allocate an exclusive diffusion trajectory for each condition based on some shifting rules, which will disperse condition modeling to all timesteps and improve the learning capacity of model. We formulate our method, which we call ShiftDDPMs, and provide a unified point of view on existing related methods. Extensive qualitative and quantitative experiments on image synthesis demonstrate the feasibility and effectiveness of ShiftDDPMs.

NeurIPS Conference 2023 Conference Paper

Subclass-Dominant Label Noise: A Counterexample for the Success of Early Stopping

  • Yingbin Bai
  • Zhongyi Han
  • Erkun Yang
  • Jun Yu
  • Bo Han
  • Dadong Wang
  • Tongliang Liu

In this paper, we empirically investigate a previously overlooked and widespread type of label noise, subclass-dominant label noise (SDN). Our findings reveal that, during the early stages of training, deep neural networks can rapidly memorize mislabeled examples in SDN. This phenomenon poses challenges in effectively selecting confident examples using conventional early stopping techniques. To address this issue, we delve into the properties of SDN and observe that long-trained representations are superior at capturing the high-level semantics of mislabeled examples, leading to a clustering effect where similar examples are grouped together. Based on this observation, we propose a novel method called NoiseCluster that leverages the geometric structures of long-trained representations to identify and correct SDN. Our experiments demonstrate that NoiseCluster outperforms state-of-the-art baselines on both synthetic and real-world datasets, highlighting the importance of addressing SDN in learning with noisy labels. The code is available at https: //github. com/tmllab/2023 NeurIPS SDN.

JMLR Journal 2022 Journal Article

Learning from Noisy Pairwise Similarity and Unlabeled Data

  • Songhua Wu
  • Tongliang Liu
  • Bo Han
  • Jun Yu
  • Gang Niu
  • Masashi Sugiyama

SU classification employs similar (S) data pairs (two examples belong to the same class) and unlabeled (U) data points to build a classifier, which can serve as an alternative to the standard supervised trained classifiers requiring data points with class labels. SU classification is advantageous because in the era of big data, more attention has been paid to data privacy. Datasets with specific class labels are often difficult to obtain in real-world classification applications regarding privacy-sensitive matters, such as politics and religion, which can be a bottleneck in supervised classification. Fortunately, similarity labels do not reveal the explicit information and inherently protect the privacy, e.g., collecting answers to “With whom do you share the same opinion on issue $\mathcal{I}$?" instead of “What is your opinion on issue $\mathcal{I}$?". Nevertheless, SU classification still has an obvious limitation: respondents might answer these questions in a manner that is viewed favorably by others instead of answering truthfully. Therefore, there exist some dissimilar data pairs labeled as similar, which significantly degenerates the performance of SU classification. In this paper, we study how to learn from noisy similar (nS) data pairs and unlabeled (U) data, which is called nSU classification. Specifically, we carefully model the similarity noise and estimate the noise rate by using the mixture proportion estimation technique. Then, a clean classifier can be learned by minimizing a denoised and unbiased classification risk estimator, which only involves the noisy data. Moreover, we further derive a theoretical generalization error bound for the proposed method. Experimental results demonstrate the effectiveness of the proposed algorithm on several benchmark datasets. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2022. ( edit, beta )

AAAI Conference 2021 Conference Paper

Deep Graph-neighbor Coherence Preserving Network for Unsupervised Cross-modal Hashing

  • Jun Yu
  • Hao Zhou
  • Yibing Zhan
  • Dacheng Tao

Unsupervised cross-modal hashing (UCMH) has become a hot topic recently. Current UCMH focuses on exploring data similarities. However, current UCMH methods calculate the similarity between two data, mainly relying on the two data’s cross-modal features. These methods suffer from inaccurate similarity problems that result in a suboptimal retrieval Hamming space, because the cross-modal features between the data are not sufficient to describe the complex data relationships, such as situations where two data have different feature representations but share the inherent concepts. In this paper, we devise a deep graph-neighbor coherence preserving network (DGCPN). Specifically, DGCPN stems from graph models and explores graph-neighbor coherence by consolidating the information between data and their neighbors. DGCPN regulates comprehensive similarity preserving losses by exploiting three types of data similarities (i. e. , the graph-neighbor coherence, the coexistent similarity, and the intra- and inter-modality consistency) and designs a half-real and half-binary optimization strategy to reduce the quantization errors during hashing. Essentially, DGCPN addresses the inaccurate similarity problem by exploring and exploiting the data’s intrinsic relationships in a graph. We conduct extensive experiments on three public UCMH datasets. The experimental results demonstrate the superiority of DGCPN, e. g. , by improving the mean average precision from 0. 722 to 0. 751 on MIRFlickr-25K using 64-bit hashing codes to retrieve texts from images. We will release the source code package and the trained model on https: //github. com/Atmegal/DGCPN.

IJCAI Conference 2021 Conference Paper

Weakly Supervised Dense Video Captioning via Jointly Usage of Knowledge Distillation and Cross-modal Matching

  • Bofeng Wu
  • Guocheng Niu
  • Jun Yu
  • Xinyan Xiao
  • Jian Zhang
  • Hua Wu

This paper proposes an approach to Dense Video Captioning (DVC) without pairwise event-sentence annotation. First, we adopt the knowledge distilled from relevant and well solved tasks to generate high-quality event proposals. Then we incorporate contrastive loss and cycle-consistency loss typically applied to cross-modal retrieval tasks to build semantic matching between the proposals and sentences, which are eventually used to train the caption generation module. In addition, the parameters of matching module are initialized via pre-training based on annotated images to improve the matching performance. Extensive experiments on ActivityNet-Caption dataset reveal the significance of distillation-based event proposal generation and cross-modal retrieval-based semantic matching to weakly supervised DVC, and demonstrate the superiority of our method to existing state-of-the-art methods.

NeurIPS Conference 2020 Conference Paper

Sufficient dimension reduction for classification using principal optimal transport direction

  • Cheng Meng
  • Jun Yu
  • Jingyi Zhang
  • Ping Ma
  • Wenxuan Zhong

Sufficient dimension reduction is used pervasively as a supervised dimension reduction approach. Most existing sufficient dimension reduction methods are developed for data with a continuous response and may have an unsatisfactory performance for the categorical response, especially for the binary-response. To address this issue, we propose a novel estimation method of sufficient dimension reduction subspace (SDR subspace) using optimal transport. The proposed method, named principal optimal transport direction (POTD), estimates the basis of the SDR subspace using the principal directions of the optimal transport coupling between the data respecting different response categories. The proposed method also reveals the relationship among three seemingly irrelevant topics, i. e. , sufficient dimension reduction, support vector machine, and optimal transport. We study the asymptotic properties of POTD and show that in the cases when the class labels contain no error, POTD estimates the SDR subspace exclusively. Empirical studies show POTD outperforms most of the state-of-the-art linear dimension reduction methods.

IJCAI Conference 2020 Conference Paper

Weakly Supervised Local-Global Relation Network for Facial Expression Recognition

  • Haifeng Zhang
  • Wen Su
  • Jun Yu
  • Zengfu Wang

To extract crucial local features and enhance the complementary relation between local and global features, this paper proposes a Weakly Supervised Local-Global Relation Network (WS-LGRN), which uses the attention mechanism to deal with part location and feature fusion problems. Firstly, the Attention Map Generator quickly finds the local regions-of-interest under the supervision of image-level labels. Secondly, bilinear attention pooling is employed to generate and refine local features. Thirdly, Relational Reasoning Unit is designed to model the relation among all features before making classification. The weighted fusion mechanism in the Relational Reasoning Unit makes the model benefit from the complementary advantages between different features. In addition, contrastive losses are introduced for local and global features to increase the inter-class dispersion and intra-class compactness at different granularities. Experiments on lab-controlled and real-world facial expression dataset show that WS-LGRN achieves state-of-the-art performance, which demonstrates its superiority in FER.

AAAI Conference 2019 Conference Paper

ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering

  • Zhou Yu
  • Dejing Xu
  • Jun Yu
  • Ting Yu
  • Zhou Zhao
  • Yueting Zhuang
  • Dacheng Tao

Recent developments in modeling language and vision have been successfully applied to image question answering. It is both crucial and natural to extend this research direction to the video domain for video question answering (VideoQA). Compared to the image domain where large scale and fully annotated benchmark datasets exists, VideoQA datasets are limited to small scale and are automatically generated, etc. These limitations restrict their applicability in practice. Here we introduce ActivityNet-QA, a fully annotated and large scale VideoQA dataset. The dataset consists of 58, 000 QA pairs on 5, 800 complex web videos derived from the popular ActivityNet dataset. We present a statistical analysis of our ActivityNet-QA dataset and conduct extensive experiments on it by comparing existing VideoQA baselines. Moreover, we explore various video representation strategies to improve VideoQA performance, especially for long videos.

IJCAI Conference 2018 Conference Paper

Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks

  • Zhou Zhao
  • Zhu Zhang
  • Shuwen Xiao
  • Zhou Yu
  • Jun Yu
  • Deng Cai
  • Fei Wu
  • Yueting Zhuang

Open-ended long-form video question answering is challenging problem in visual information retrieval, which automatically generates the natural language answer from the referenced long-form video content according to the question. However, the existing video question answering works mainly focus on the short-form video question answering, due to the lack of modeling the semantic representation of long-form video contents. In this paper, we consider the problem of long-form video question answering from the viewpoint of adaptive hierarchical reinforced encoder-decoder network learning. We propose the adaptive hierarchical encoder network to learn the joint representation of the long-form video contents according to the question with adaptive video segmentation. we then develop the reinforced decoder network to generate the natural language answer for open-ended video question answering. We construct a large-scale long-form video question answering dataset. The extensive experiments show the effectiveness of our method.

IJCAI Conference 2018 Conference Paper

Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding

  • Zhou Yu
  • Jun Yu
  • Chenchao Xiang
  • Zhou Zhao
  • Qi Tian
  • Dacheng Tao

Visual grounding aims to localize an object in an image referred to by a textual query phrase. Various visual grounding approaches have been proposed, and the problem can be modularized into a general framework: proposal generation, multi-modal feature representation, and proposal ranking. Of these three modules, most existing approaches focus on the latter two, with the importance of proposal generation generally neglected. In this paper, we rethink the problem of what properties make a good proposal generator. We introduce the diversity and discrimination simultaneously when generating proposals, and in doing so propose Diversified and Discriminative Proposal Networks model (DDPN). Based on the proposals generated by DDPN, we propose a high performance baseline model for visual grounding and evaluate it on four benchmark datasets. Experimental results demonstrate that our model delivers significant improvements on all the tested data-sets (e. g. , 18. 8% improvement on ReferItGame and 8. 2% improvement on Flickr30k Entities over the existing state-of-the-arts respectively).

IJCAI Conference 2017 Conference Paper

Improving Stochastic Block Models by Incorporating Power-Law Degree Characteristic

  • Maoying Qiao
  • Jun Yu
  • Wei Bian
  • Qiang Li
  • Dacheng Tao

Stochastic block models (SBMs) provide a statistical way modeling network data, especially in representing clusters or community structures. However, most block models do not consider complex characteristics of networks such as scale-free feature, making them incapable of handling degree variation of vertices, which is ubiquitous in real networks. To address this issue, we introduce degree decay variables into SBM, termed power-law degree SBM (PLD-SBM), to model the varying probability of connections between node pairs. The scale-free feature is approximated by a power-law degree characteristic. Such a property allows PLD-SBM to correct the distortion of degree distribution in SBM, and thus improves the performance of cluster prediction. Experiments on both simulated networks and two real-world networks including the Adolescent Health Data and the political blogs network demonstrate the validity of the motivation of PLD-SBM, and its practical superiority.

ICML Conference 2015 Conference Paper

Bayesian and Empirical Bayesian Forests

  • Matt Taddy
  • Chun-Sheng Chen
  • Jun Yu
  • Mitch Wyle

We derive ensembles of decision trees through a nonparametric Bayesian model, allowing us to view such ensembles as samples from a posterior distribution. This insight motivates a class of Bayesian Forest (BF) algorithms that provide small gains in performance and large gains in interpretability. Based on the BF framework, we are able to show that high-level tree hierarchy is stable in large samples. This motivates an empirical Bayesian Forest (EBF) algorithm for building approximate BFs on massive distributed datasets and we show that EBFs outperform sub-sampling based alternatives by a large margin.

AAAI Conference 2014 Conference Paper

A Latent Variable Model for Discovering Bird Species Commonly Misidentified by Citizen Scientists

  • Jun Yu
  • Rebecca Hutchinson
  • Weng-Keen Wong

Data quality is a common source of concern for largescale citizen science projects like eBird. In the case of eBird, a major cause of poor quality data is the misidentification of bird species by inexperienced contributors. A proactive approach for improving data quality is to discover commonly misidentified bird species and to teach inexperienced birders the differences between these species. To accomplish this goal, we develop a latent variable graphical model that can identify groups of bird species that are often confused for each other by eBird participants. Our model is a multi-species extension of the classic occupancy-detection model in the ecology literature. This multi-species extension requires a structure learning step as well as a computationally expensive parameter learning stage which we make efficient through a variational approximation. We show that our model can not only discover groups of misidentified species, but by including these misidentifications in the model, it can also achieve more accurate predictions of both species occupancy and detection.

AAAI Conference 2014 Conference Paper

HC-Search for Multi-Label Prediction: An Empirical Study

  • Janardhan Rao Doppa
  • Jun Yu
  • Chao Ma
  • Alan Fern
  • Prasad Tadepalli

Multi-label learning concerns learning multiple, overlapping, and correlated classes. In this paper, we adapt a recent structured prediction framework called HC- Search for multi-label prediction problems. One of the main advantages of this framework is that its training is sensitive to the loss function, unlike the other multilabel approaches that either assume a specific loss function or require a manual adaptation to each loss function. We empirically evaluate our instantiation of the HC-Search framework along with many existing multilabel learning algorithms on a variety of benchmarks by employing diverse task loss functions. Our results demonstrate that the performance of existing algorithms tends to be very similar in most cases, and that the HC- Search approach is comparable and often better than all the other algorithms across different loss functions.