Author name cluster

Qing Yu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers

1 author row

TMLR Journal 2025 Journal Article

Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey

Atsuyuki Miyai
Jingkang Yang
Jingyang Zhang
Yifei Ming
Yueqian Lin
Qing Yu
Go Irie
Shafiq Joty

Detecting out-of-distribution (OOD) samples is crucial for ensuring the safety of machine learning systems and has shaped the field of OOD detection. Meanwhile, several other problems are closely related to OOD detection, including anomaly detection (AD), novelty detection (ND), open set recognition (OSR), and outlier detection (OD). To unify these problems, a generalized OOD detection framework was proposed, taxonomically categorizing these five problems. However, Vision Language Models (VLMs) such as CLIP have significantly changed the paradigm and blurred the boundaries between these fields, again confusing researchers. In this survey, we first present a generalized OOD detection v2, encapsulating the evolution of these fields in the VLM era. Our framework reveals that, with some field inactivity and integration, the demanding challenges have become OOD detection and AD. Then, we highlight the significant shift in the definition, problem settings, and benchmarks; we thus feature a comprehensive review of the methodology for OOD detection and related tasks to clarify their relationship to OOD detection. Finally, we explore the advancements in the emerging Large Vision Language Model (LVLM) era, such as GPT-4V. We conclude with open challenges and future directions. The resource is available at https://github.com/AtsuMiyai/Awesome-OOD-VLM.

PDF Details

AAAI Conference 2025 Conference Paper

ReMoGPT: Part-Level Retrieval-Augmented Motion-Language Models

Qing Yu
Mikihiro Tanaka
Kent Fujiwara

Generation of 3D human motion holds significant importance in the creative industry. While recent notable advances have been made in generating common motions, existing methods struggle to generate diverse and rare motions due to the complexity of motions and limited training data. This work introduces ReMoGPT, a unified motion-language generative model that solves a wide range of motion-related tasks by incorporating a multi-modal retrieval mechanism into the generation process to address the limitations of existing models, namely diversity and generalizability. We propose to focus on body-part-level motion features to enable fine-grained text-motion retrieval and locate suitable references from the database to conduct generation. Then, the motion-language generative model is trained with prompt-based question-and-answer tasks designed for different motion-relevant problems. We incorporate the retrieved samples into the prompt, and then perform instruction tuning of the motion-language model, to learn from task feedback and produce promising results with the help of fine-grained multi-modal retrieval. Extensive experiments validate the efficacy of ReMoGPT, showcasing its superiority over existing state-of-the-art methods. The framework performs well on multiple motion tasks, including motion retrieval, generation, and captioning.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Who You Are Matters: Bridging Interests and Social Roles via LLM-Enhanced Logic Recommendation

Qing Yu
Xiaobei Wang
Shuchang Liu
Xiaoyu Yang
Xueliang Wang
Chang Meng
Shanshan Wu
Bin Wen

Recommender systems filter contents/items valuable to users by inferring preferences from user features and historical behaviors. Mainstream approaches follow the learning-to-rank paradigm, which focus on discovering and modeling item topics (e. g. , categories), and capturing user preferences on these topics based on historical interactions. However, this paradigm often neglects the modeling of user characteristics and their social roles, which are logical confounders influencing the correlated interest and user preference transition. To bridge this gap, we introduce the user role identification task and the behavioral logic modeling task that aim to explicitly model user roles and learn the logical relations between item topics and user social roles. We show that it is possible to explicitly solve these tasks through an efficient integration framework of Large Language Model (LLM) and recommendation systems, for which we propose TagCF. On the one hand, TagCF exploits the (Multi-modal) LLM's world knowledge and logic inference ability to extract realistic tag-based virtual logic graphs that reveal dynamic and expressive knowledge of users, refining our understanding of user behaviors. On the other hand, TagCF presents empirically effective integration modules that take advantage of the extracted tag-logic information, augmenting the recommendation performance. We conduct both online experiments and offline experiments with industrial and public datasets as verification of TagCF's effectiveness, and we empirically show that the user role modeling strategy is potentially a better choice than the modeling of item topics. Additionally, we provide evidence that the extracted logic graphs are empirically a general and transferable knowledge that can benefit a wide range of recommendation tasks. Our code is available in https: //github. com/Code2Q/TagCF.

PDF Details

AAAI Conference 2023 Conference Paper

Frame-Level Label Refinement for Skeleton-Based Weakly-Supervised Action Recognition

Qing Yu
Kent Fujiwara

In recent years, skeleton-based action recognition has achieved remarkable performance in understanding human motion from sequences of skeleton data, which is an important medium for synthesizing realistic human movement in various applications. However, existing methods assume that each action clip is manually trimmed to contain one specific action, which requires a significant amount of effort for annotation. To solve this problem, we consider a novel problem of skeleton-based weakly-supervised temporal action localization (S-WTAL), where we need to recognize and localize human action segments in untrimmed skeleton videos given only the video-level labels. Although this task is challenging due to the sparsity of skeleton data and the lack of contextual clues from interaction with other objects and the environment, we present a frame-level label refinement framework based on a spatio-temporal graph convolutional network (ST-GCN) to overcome these difficulties. We use multiple instance learning (MIL) with video-level labels to generate the frame-level predictions. Inspired by advances in handling the noisy label problem, we introduce a label cleaning strategy of the frame-level pseudo labels to guide the learning process. The network parameters and the frame-level predictions are alternately updated to obtain the final results. We extensively evaluate the effectiveness of our learning approach on skeleton-based action recognition benchmarks. The state-of-the-art experimental results demonstrate that the proposed method can recognize and localize action segments of the skeleton data.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

LoCoOp: Few-Shot Out-of-Distribution Detection via Prompt Learning

Atsuyuki Miyai
Qing Yu
Go Irie
Kiyoharu Aizawa

We present a novel vision-language prompt learning approach for few-shot out-of-distribution (OOD) detection. Few-shot OOD detection aims to detect OOD images from classes that are unseen during training using only a few labeled in-distribution (ID) images. While prompt learning methods such as CoOp have shown effectiveness and efficiency in few-shot ID classification, they still face limitations in OOD detection due to the potential presence of ID-irrelevant information in text embeddings. To address this issue, we introduce a new approach called $\textbf{Lo}$cal regularized $\textbf{Co}$ntext $\textbf{Op}$timization (LoCoOp), which performs OOD regularization that utilizes the portions of CLIP local features as OOD features during training. CLIP's local features have a lot of ID-irrelevant nuisances ($\textit{e. g. }$, backgrounds), and by learning to push them away from the ID class text embeddings, we can remove the nuisances in the ID class text embeddings and enhance the separation between ID and OOD. Experiments on the large-scale ImageNet OOD detection benchmarks demonstrate the superiority of our LoCoOp over zero-shot, fully supervised detection methods and prompt learning methods. Notably, even in a one-shot setting -- just one label per class, LoCoOp outperforms existing zero-shot and fully supervised detection methods. The code is available via https: //github. com/AtsuMiyai/LoCoOp.

PDF Details

AAAI Conference 2022 Conference Paper

Self-Labeling Framework for Novel Category Discovery over Domains

Qing Yu
Daiki Ikami
Go Irie
Kiyoharu Aizawa

Unsupervised domain adaptation (UDA) has been highly successful in transferring knowledge acquired from a label-rich source domain to a label-scarce target domain. Open-set domain adaptation (open-set DA) and universal domain adaptation (UniDA) have been proposed as solutions to the problem concerning the presence of additional novel categories in the target domain. Existing open-set DA and UniDA approaches treat all novel categories as one unified unknown class and attempt to detect this unknown class during the training process. However, the features of the novel categories learned by these methods are not discriminative. This limits the applicability of UDA in the further classification of these novel categories into their original categories, rather than assigning them to a single unified class. In this paper, we propose a selflabeling framework to cluster all target samples, including those in the “unknown” categories. We train the network to learn the representations of target samples via self-supervised learning (SSL) and to identify the seen and unseen (novel) target-sample categories simultaneously by maximizing the mutual information between labels and input data. We evaluated our approach under different DA settings and concluded that our method generally outperformed existing ones by a wide margin.

PDF Details

YNIMG Journal 2017 Journal Article

Occipital, parietal, and frontal cortices selectively maintain task-relevant features of multi-feature objects in visual working memory

Qing Yu
Won Mok Shim

Previous studies have shown that information held in visual working memory is represented in the occipital, parietal, and frontal cortices. However, less is known about whether the mnemonic information of multi-feature objects is modulated by task demand in the parietal and frontal regions. To address this question, we asked participants to remember either color or orientation of one of the two colored gratings for a delay. Using fMRI and an inverted encoding model, we reconstructed population-level, feature-selective responses in the occipital, parietal and frontal cortices during memory maintenance. We found that not only orientation but also color information can be maintained in higher-order parietal and frontal cortices as well as the early visual cortex when it was cued to be remembered. Conversely, neither the task-irrelevant feature of the cued object, nor any feature of the uncued object was maintained in the occipital, parietal, or frontal cortices. These results suggest a highly selective mechanism of visual working memory that maintains task-relevant features only.

Details DOI