Author name cluster

Yu Kong

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers

1 author row

NeurIPS Conference 2025 Conference Paper

IndustryEQA: Pushing the Frontiers of Embodied Question Answering in Industrial Scenarios

Yifan Li
Yuhang Chen
Anh Dao
Lichi Li
Zhongyi Cai
Zhen Tan
Tianlong Chen
Yu Kong

Existing Embodied Question Answering (EQA) benchmarks primarily focus on household environments, often overlooking safety-critical aspects and reasoning processes pertinent to industrial settings. This drawback limits the evaluation of agent readiness for real-world industrial applications. To bridge this, we introduce IndustryEQA, the first benchmark dedicated to evaluating embodied agent capabilities within safety-critical industrial warehouse scenarios. Built upon the NVIDIA Isaac Sim platform, IndustryEQA provides high-fidelity episodic memory videos featuring diverse industrial assets, dynamic human agents, and carefully designed hazardous situations inspired by real-world safety guidelines. The benchmark includes rich annotations covering six categories: equipment safety, human safety, object recognition, attribute recognition, temporal understanding, and spatial understanding. Besides, it also provides extra reasoning evaluation based on these categories. Specifically, it comprises 971 question-answer pairs generated from small warehouse scenarios and 373 pairs from large ones, incorporating scenarios with and without human. We further propose a comprehensive evaluation framework, including various baseline models, to assess their general perception and reasoning abilities in industrial environments. IndustryEQA aims to steer EQA research towards developing more robust, safety-aware, and practically applicable embodied agents for complex industrial environments.

PDF Details

IJCAI Conference 2024 Conference Paper

A Survey of Multimodal Sarcasm Detection

Shafkat Farabi
Tharindu Ranasinghe
Diptesh Kanojia
Yu Kong
Marcos Zampieri

Sarcasm is a rhetorical device that is used to convey the opposite of the literal meaning of an utterance. Sarcasm is widely used on social media and other forms of computer-mediated communication motivating the use of computational models to identify it automatically. While the clear majority of approaches to sarcasm detection have been carried out on text only, sarcasm detection often requires additional information present in tonality, facial expression, and contextual images. This has led to the introduction of multimodal models, opening the possibility to detect sarcasm in multiple modalities such as audio, images, text, and video. In this paper, we present the first comprehensive survey on multimodal sarcasm detection - henceforth MSD - to date. We survey papers published between 2018 and 2023 on the topic, and discuss the models and datasets used for this task. We also present future research directions in MSD.

PDF Details DOI

AAAI Conference 2022 Conference Paper

A Dynamic Meta-Learning Model for Time-Sensitive Cold-Start Recommendations

Krishna Prasad Neupane
Ervine Zheng
Yu Kong
Qi Yu

We present a novel dynamic recommendation model that focuses on users who have interactions in the past but turn relatively inactive recently. Making effective recommendations to these time-sensitive cold-start users is critical to maintain the user base of a recommender system. Due to the sparse recent interactions, it is challenging to capture these users’ current preferences precisely. Solely relying on their historical interactions may also lead to outdated recommendations misaligned with their recent interests. The proposed model leverages historical and current user-item interactions and dynamically factorizes a user’s (latent) preference into time-specific and time-evolving representations that jointly affect user behaviors. These latent factors further interact with an optimized item embedding to achieve accurate and timely recommendations. Experiments over real-world data help demonstrate the effectiveness of the proposed time-sensitive coldstart recommendation model.

PDF Details

IJCAI Conference 2020 Conference Paper

Few-shot Human Motion Prediction via Learning Novel Motion Dynamics

Chuanqi Zang
Mingtao Pei
Yu Kong

Human motion prediction is a task where we anticipate future motion based on past observation. Previous approaches rely on the access to large datasets of skeleton data, and thus are difficult to be generalized to novel motion dynamics with limited training data. In our work, we propose a novel approach named Motion Prediction Network (MoPredNet) for few-short human motion prediction. MoPredNet can be adapted to predicting new motion dynamics using limited data, and it elegantly captures long-term dependency in motion dynamics. Specifically, MoPredNet dynamically selects the most informative poses in the streaming motion data as masked poses. In addition, MoPredNet improves its encoding capability of motion dynamics by adaptively learning spatio-temporal structure from the observed poses and masked poses. We also propose to adapt MoPredNet to novel motion dynamics based on accumulated motion experiences and limited novel motion dynamics data. Experimental results show that our method achieves better performance over state-of-the-art methods in motion prediction.

PDF Details DOI

AAAI Conference 2018 Conference Paper

Action Prediction From Videos via Memorizing Hard-to-Predict Samples

Yu Kong
Shangqian Gao
Bin Sun
Yun Fu

Action prediction based on video is an important problem in computer vision field with many applications, such as preventing accidents and criminal activities. It's challenging to predict actions at the early stage because of the large variations between early observed videos and complete ones. Besides, intra-class variations cause confusions to the predictors as well. In this paper, we propose a mem-LSTM model to predict actions in the early stage, in which a memory module is introduced to record several "hard-to-predict" samples and a variety of early observations. Our method uses Convolution Neural Network (CNN) and Long Short-Term Memory (LSTM) to model partial observed video input. We augment LSTM with a memory module to remember challenging video instances. With the memory module, our mem-LSTM model not only achieves impressive performance in the early stage but also makes predictions without the prior knowledge of observation ratio. Information in future frames is also utilized using a bi-directional layer of LSTM. Experiments on UCF-101 and Sports-1M datasets show that our method outperforms state-of-the-art methods.

PDF Details

IJCAI Conference 2017 Conference Paper

Multi-Stream Deep Similarity Learning Networks for Visual Tracking

Kunpeng Li
Yu Kong
Yun Fu

Visual tracking has achieved remarkable success in recent decades, but it remains a challenging problem due to appearance variations over time and complex cluttered background. In this paper, we adopt a tracking-by-verification scheme to overcome these challenges by determining the patch in the subsequent frame that is most similar to the target template and distinctive to the background context. A multi-stream deep similarity learning network is proposed to learn the similarity comparison model. The loss function of our network encourages the distance between a positive patch in the search region and the target template to be smaller than that between positive patch and the background patches. Within the learned feature space, even if the distance between positive patches becomes large caused by the appearance change or interference of background clutter, our method can use the relative distance to distinguish the target robustly. Besides, the learned model is directly used for tracking with no need of model updating, parameter fine-tuning and can run at 45 fps on a single GPU. Our tracker achieves state-of-the-art performance on the visual tracking benchmark compared with other recent real-time-speed trackers, and shows better capability in handling background clutter, occlusion and appearance change.

PDF Details

AAAI Conference 2017 Conference Paper

Sparse Subspace Clustering by Learning Approximation _0 Codes

Jun Li
Yu Kong
Yun Fu

Subspace clustering has been widely applied to detect meaningful clusters in high-dimensional data spaces. A main challenge in subspace clustering is to quickly calculate a “good” afﬁnity matrix. 0, 1, 2 or nuclear norm regularization is used to construct the afﬁnity matrix in many subspace clustering methods because of their theoretical guarantees and empirical success. However, they suffer from the following problems: (1) 2 and nuclear norm regularization require very strong assumptions to guarantee a subspace-preserving afﬁnity; (2) although 1 regularization can be guaranteed to give a subspace-preserving afﬁnity under certain conditions, it needs more time to solve a large-scale convex optimization problem; (3) 0 regularization can yield a tradeoff between computationally efﬁcient and subspace-preserving afﬁnity by using the orthogonal matching pursuit (OMP) algorithm, but this still takes more time to search the solution in OMP when the number of data points is large. In order to overcome these problems, we ﬁrst propose a learned OMP (LOMP) algorithm to learn a single hidden neural network (SHNN) to fast approximate the 0 code. We then exploit a sparse subspace clustering method based on 0 code which is fast computed by SHNN. Two sufﬁcient conditions are presented to guarantee that our method can give a subspace-preserving afﬁnity. Experiments on handwritten digit and face clustering show that our method not only quickly computes the 0 code, but also outperforms the relevant subspace clustering methods in clustering results. In particular, our method achieves the state-of-the-art clustering accuracy (94. 32%) on MNIST.

PDF Details