Author name cluster

Roger Zimmermann

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

24 papers

2 author rows

TMLR Journal 2026 Journal Article

Make Your LVLM KV Cache More Lightweight

Xihao Chen
Yangyang Guo
Roger Zimmermann

Key-Value (KV) cache has become a de facto component of modern Large Vision-Language Models (LVLMs) for inference. While it enhances decoding efficiency in Large Language Models (LLMs), its direct adoption in LVLMs introduces substantial GPU memory overhead due to the large number of vision tokens processed during the prefill stage. To tackle this problem, we propose LightKV, a novel approach that reduces KV cache size by exploiting the redundancy among vision-token embeddings. Guided by text prompts, LightKV employs cross-modality message passing to aggregate informative messages across vision tokens and progressively compress them during prefill. This prompt-aware guidance distinguishes our method from prior vision-only compression strategies. We evaluate LightKV on eight open-source LVLMs across eight public benchmark datasets, e.g., MME and SeedBench. Experimental results demonstrate that with only 55% of the original vision tokens, LightKV (a) halves the vision-token KV cache size, (b) reduces computation by up to 40%, and (c) preserves general-purpose performance while significantly outperforming existing baselines.

PDF Details

AAAI Conference 2025 Conference Paper

Few-Shot Incremental Learning via Foreground Aggregation and Knowledge Transfer for Audio-Visual Semantic Segmentation

Jingqiao Xiu
Mengze Li
Zongxin Yang
Wei Ji
Yifang Yin
Roger Zimmermann

Audio-Visual Semantic Segmentation (AVSS) has gained significant attention in the multi-modal domain, aiming to segment video objects that produce specific sounds in the corresponding audio. Despite notable progress, existing methods still struggle to handle new classes not included in the original training set. To this end, we introduce Few-Shot Incremental Learning (FSIL) to the AVSS task, which seeks to seamlessly integrate new classes with limited incremental samples while preserving the knowledge of old classes. Two challenges arise in this new setting: (1) To reduce labeling costs, old classes within the incremental samples are treated as background, similar to silent objects. Training the model directly with background annotations may worsen the loss of distinctive knowledge about old classes, such as their outlines and sounds. (2) Most existing models adopt early cross-modal fusion with a single-tower design, incorporating more characteristics into class representations, which impedes knowledge transfer between classes based on similarity. To address these issues, we propose a Few-shot Incremental learning framework via class-centric foregrouNd aggreGation and dual-tower knowlEdge tRansfer (FINGER) for the AVSS task, which comprises two targeted modules: (1) The class-centric foreground aggregation gathers class-specific features for each foreground class while disregarding background features. The background class is excluded during training and inferred from the foreground predictions. (2) The dual-tower knowledge transfer postpones cross-modal fusion to separately conduct knowledge transfer for each modality. Extensive experiments validate the effectiveness of the FINGER model, significantly surpassing state-of-the-art methods.

PDF Details DOI

ICLR Conference 2025 Conference Paper

Generalized Video Moment Retrieval

You Qin
Qilong Wu
Yicong Li 0004
Wei Ji 0008
Li Li 0091
Pengcheng Cai
Lina Wei
Roger Zimmermann

In this paper, we introduce the Generalized Video Moment Retrieval (GVMR) framework, which extends traditional Video Moment Retrieval (VMR) to handle a wider range of query types. Unlike conventional VMR systems, which are often limited to simple, single-target queries, GVMR accommodates both non-target and multi-target queries. To support this expanded task, we present the NExT-VMR dataset, derived from the YFCC100M collection, featuring diverse query scenarios to enable more robust model evaluation. Additionally, we propose BCANet, a transformer-based model incorporating the novel Boundary-aware Cross Attention (BCA) module. The BCA module enhances boundary detection and uses cross-attention to achieve a comprehensive understanding of video content in relation to queries. BCANet accurately predicts temporal video segments based on natural language descriptions, outperforming traditional models in both accuracy and adaptability. Our results demonstrate the potential of the GVMR framework, the NExT-VMR dataset, and BCANet to advance VMR systems, setting a new standard for future multimedia information retrieval research.

Details

ICML Conference 2025 Conference Paper

Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts

Xu Liu 0014
Juncheng Liu
Gerald Woo
Taha Aksu
Yuxuan Liang 0002
Roger Zimmermann
Chenghao Liu
Junnan Li 0001

Achieving effective unified pretraining on large time series corpora remains an open challenge in developing time series foundation models. Existing methods, such as Moirai, introduce multiple projection layers for time series of different frequencies to account for high data heterogeneity. We identify major drawbacks to this human-imposed frequency-level model specialization. First, frequency is not a reliable indicator for grouping pretraining data. Second, time series can display varied distributions even within a short window. Frequency-level specialization overlooks the diversity at this granularity. To address these issues, this paper introduces Moirai-MoE, excluding human-defined data groupings while delegating the modeling of diverse time series patterns to the sparse mixture of experts (MoE) within Transformers. With this design, Moirai-MoE eliminates reliance on heuristics and enables automatic token-level specialization. Extensive evaluations on 39 datasets demonstrate the superiority of Moirai-MoE over state-of-the-art foundation models. This study also conducts comprehensive model analyses to explore the inner workings of time series MoE foundation models.

Details

NeurIPS Conference 2025 Conference Paper

Self-Perturbed Anomaly-Aware Graph Dynamics for Multivariate Time-Series Anomaly Detection

Jinyu Cai
Yuan Xie
Glynnis Lim
Yifang Yin
Roger Zimmermann
See-Kiong Ng

Detecting anomalies in multivariate time-series data is an essential task across various domains, yet there are unresolved challenges such as (1) severe class imbalance between normal and anomalous data due to rare anomaly availability in the real world; (2) limited adaptability of the static graph-based methods to dynamically changing inter-variable correlations; and (3) neglect of subtle anomalies due to overfitting to normal patterns in reconstruction-based methods. To tackle these issues, we propose Self-Perturbed Anomaly-Aware Graph Dynamics (SPAGD), a framework for time-series anomaly detection. SPAGD employs a self-perturbation module that generates self-perturbed time series from the reconstruction process of normal ones, which provide auxiliary signals to alleviate class imbalance during training. Concurrently, an anomaly-aware graph construction module is proposed to dynamically adjust the graph structure by leveraging the reconstruction residuals of self-perturbed time series, thereby emphasizing the inter-variable disruptions induced by anomalous candidates. A unified spatio-temporal anomaly detection module then integrates both spatial and temporal convolutions to train a classifier that distinguishes normal time series from the auxiliary self-perturbed samples. Extensive experiments across multiple benchmark datasets demonstrate the effectiveness of SPAGD compared to state-of-the-art baselines.

PDF Details

AAAI Conference 2025 Conference Paper

Through the Dual-Prism: A Spectral Perspective on Graph Data Augmentation for Graph Classifications

Yutong Xia
Runpeng Yu
Yuxuan Liang
Xavier Bresson
Xinchao Wang
Roger Zimmermann

Graph Neural Networks (GNNs) have become the preferred tool to process graph data, with their efficacy being boosted through graph data augmentation techniques. Despite the evolution of augmentation methods, issues like graph property distortions and restricted structural changes persist. This leads to the question: Is it possible to develop more property-conserving and structure-sensitive augmentation methods? Through a spectral lens, we investigate the interplay between graph properties, their augmentation, and their spectral behavior, and found that keeping the low-frequency eigenvalues unchanged can preserve the critical properties at a large scale when generating augmented graphs. These observations inform our introduction of the Dual-Prism (DP) augmentation method, comprising DP-Noise and DP-Mask, which adeptly retains essential graph properties while diversifying augmented graphs. Extensive experiments validate the efficiency of our approach, providing a new and promising direction for graph data augmentation.

PDF Details DOI

ICLR Conference 2024 Conference Paper

Graph Lottery Ticket Automated

Guibin Zhang
Kun Wang 0056
Wei Huang 0034
Yanwei Yue
Yang Wang 0015
Roger Zimmermann
Aojun Zhou
Dawei Cheng

Graph Neural Networks (GNNs) have emerged as the leading deep learning models for graph-based representation learning. However, the training and inference of GNNs on large graphs remain resource-intensive, impeding their utility in real-world scenarios and curtailing their applicability in deeper and more sophisticated GNN architectures. To address this issue, the Graph Lottery Ticket (GLT) hypothesis assumes that GNN with random initialization harbors a pair of core subgraph and sparse subnetwork, which can yield comparable performance and higher efficiency to that of the original dense network and complete graph. Despite that GLT offers a new paradigm for GNN training and inference, existing GLT algorithms heavily rely on trial-and-error pruning rate tuning and scheduling, and adhere to an irreversible pruning paradigm that lacks elasticity. Worse still, current methods suffer scalability issues when applied to deep GNNs, as they maintain the same topology structure across all layers. These challenges hinder the integration of GLT into deeper and larger-scale GNN contexts. To bridge this critical gap, this paper introduces an $\textbf{A}$daptive, $\textbf{D}$ynamic, and $\textbf{A}$utomated framework for identifying $\textbf{G}$raph $\textbf{L}$ottery $\textbf{T}$ickets ($\textbf{AdaGLT}$). Our proposed method derives its key advantages and addresses the above limitations through the following three aspects: 1) tailoring layer-adaptive sparse structures for various datasets and GNNs, thus endowing it with the capability to facilitate deeper GNNs; 2) integrating the pruning and training processes, thereby achieving a dynamic workflow encompassing both pruning and restoration; 3) automatically capturing graph lottery tickets across diverse sparsity levels, obviating the necessity for extensive pruning parameter tuning. More importantly, we rigorously provide theoretical proofs to guarantee $\textbf{AdaGLT}$ to mitigate over-smoothing issues and obtain improved sparse structures in deep GNN scenarios. Extensive experiments demonstrate that $\textbf{AdaGLT}$ outperforms state-of-the-art competitors across multiple graph datasets of various scales and types, particularly in scenarios involving deep GNNs.

Details

AAAI Conference 2024 Conference Paper

Panoptic Scene Graph Generation with Semantics-Prototype Learning

Li Li
Wei Ji
Yiming Wu
Mengze Li
You Qin
Lina Wei
Roger Zimmermann

Panoptic Scene Graph Generation (PSG) parses objects and predicts their relationships (predicate) to connect human language and visual scenes. However, different language preferences of annotators and semantic overlaps between predicates lead to biased predicate annotations in the dataset, i.e. different predicates for the same object pairs. Biased predicate annotations make PSG models struggle in constructing a clear decision plane among predicates, which greatly hinders the real application of PSG models. To address the intrinsic bias above, we propose a novel framework named ADTrans to adaptively transfer biased predicate annotations to informative and unified ones. To promise consistency and accuracy during the transfer process, we propose to observe the invariance degree of representations in each predicate class, and learn unbiased prototypes of predicates with different intensities. Meanwhile, we continuously measure the distribution changes between each presentation and its prototype, and constantly screen potentially biased data. Finally, with the unbiased predicate-prototype representation embedding space, biased annotations are easily identified. Experiments show that ADTrans significantly improves the performance of benchmark models, achieving a new state-of-the-art performance, and shows great generalization and effectiveness on multiple datasets. Our code is released at https://github.com/lili0415/PSG-biased-annotation.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

Predicting Carpark Availability in Singapore with Cross-Domain Data: A New Dataset and A Data-Driven Approach

Huaiwu Zhang
Yutong Xia
Siru Zhong
Kun Wang
Zekun Tong
Qingsong Wen
Roger Zimmermann
Yuxuan Liang

The increasing number of vehicles highlights the need for efficient parking space management. Predicting real-time Parking Availability (PA) can help mitigate traffic congestion and the corresponding social problems, which is a pressing issue in densely populated cities like Singapore. In this study, we aim to collectively predict future PA across Singapore with complex factors from various domains. The contributions in this paper are listed as follows: (1) A New Dataset: We introduce the SINPA dataset, containing a year's worth of PA data from 1, 687 parking lots in Singapore, enriched with various spatial and temporal factors. (2) A Data-Driven Approach: We present DeepPA, a novel deep-learning framework, to collectively and efficiently predict future PA across thousands of parking lots. (3) Extensive Experiments and Deployment: DeepPA demonstrates a 9. 2% reduction in prediction error for up to 3-hour forecasts compared to existing advanced models. Furthermore, we implement DeepPA in a practical web-based platform to provide real-time PA predictions to aid drivers and inform urban planning for the governors in Singapore. We release the dataset and source code at https: //github. com/yoshall/SINPA.

PDF Details DOI

AAAI Conference 2024 Conference Paper

SOGDet: Semantic-Occupancy Guided Multi-View 3D Object Detection

Qiu Zhou
Jinming Cao
Hanchao Leng
Yifang Yin
Yu Kun
Roger Zimmermann

In the field of autonomous driving, accurate and comprehensive perception of the 3D environment is crucial. Bird's Eye View (BEV) based methods have emerged as a promising solution for 3D object detection using multi-view images as input. However, existing 3D object detection methods often ignore the physical context in the environment, such as sidewalk and vegetation, resulting in sub-optimal performance. In this paper, we propose a novel approach called SOGDet (Semantic-Occupancy Guided Multi-view 3D Object Detection), that leverages a 3D semantic-occupancy branch to improve the accuracy of 3D object detection. In particular, the physical context modeled by semantic occupancy helps the detector to perceive the scenes in a more holistic view. Our SOGDet is flexible to use and can be seamlessly integrated with most existing BEV-based methods. To evaluate its effectiveness, we apply this approach to several state-of-the-art baselines and conduct extensive experiments on the exclusive nuScenes dataset. Our results show that SOGDet consistently enhance the performance of three baseline methods in terms of nuScenes Detection Score (NDS) and mean Average Precision (mAP). This indicates that the combination of 3D object detection and 3D semantic occupancy leads to a more comprehensive perception of the 3D environment, thereby aiding build more robust autonomous driving systems. The codes are available at: https://github.com/zhouqiu/SOGDet.

PDF Details DOI

ECAI Conference 2024 Conference Paper

Unveiling Learner Dynamics: The ECLIPSE Dataset and NeuralGaze Framework for Prolonged Engagement Assessment in Online Learning

Avinash Anand
Avni Mittal
Laavanaya Dhawan
Mahisha Ramesh
Juhi Krishnamurthy
Naman Lal
Raj Jaiswal
Pijush Bhuyan

Understanding student engagement in online education is crucial for optimizing learning outcomes. This paper introduces ECLIPSE dataset (Extended Classroom Learning Insights via Prolonged Student Engagement), comprising 10, 110 annotated images from a 55-minutes, 30-minutes and 20-minutes online lecture. Annotations include four affective states: engagement, boredom, confusion, and frustration. ECLIPSE enables the investigation of learner attention dynamics over extended periods, overcoming the limitations of short-duration datasets. We establish benchmarks for ECLIPSE using models such as EfficientNet, Vision Transformer, Residual Attention Network, and GLAMOR-Net. We propose NeuralGaze, a novel framework integrating Neural Cellular Automata (NCA) with self-attention mechanisms, demonstrating superior accuracy in engagement level assessment compared to basic single-frame models. Furthermore, we introduce CG-SwT, a content-guided Swin Transformer model, which significantly outperforms the baseline ViT model on the ECLIPSE dataset (with F1-score improvements of 21. 12%, 12. 5%, 16. 77%, and 15. 41% for engagement, boredom, frustration, and confusion respectively). Our methods surpass existing single-frame engagement prediction baselines for both EngageNet and DAiSEE datasets by significant margins (7. 4% and 6. 2%, respectively). The code and dataset will be made publicly available.

Details

AAAI Conference 2023 Conference Paper

AirFormer: Predicting Nationwide Air Quality in China with Transformers

Yuxuan Liang
Yutong Xia
Songyu Ke
Yiwei Wang
Qingsong Wen
Junbo Zhang
Yu Zheng
Roger Zimmermann

Air pollution is a crucial issue affecting human health and livelihoods, as well as one of the barriers to economic growth. Forecasting air quality has become an increasingly important endeavor with significant social impacts, especially in emerging countries. In this paper, we present a novel Transformer termed AirFormer to predict nationwide air quality in China, with an unprecedented fine spatial granularity covering thousands of locations. AirFormer decouples the learning process into two stages: 1) a bottom-up deterministic stage that contains two new types of self-attention mechanisms to efficiently learn spatio-temporal representations; 2) a top-down stochastic stage with latent variables to capture the intrinsic uncertainty of air quality data. We evaluate AirFormer with 4-year data from 1,085 stations in Chinese Mainland. Compared to prior models, AirFormer reduces prediction errors by 5%∼8% on 72-hour future predictions. Our source code is available at https://github.com/yoshall/airformer.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Deciphering Spatio-Temporal Graph Forecasting: A Causal Lens and Treatment

Yutong Xia
Yuxuan Liang
Haomin Wen
Xu Liu
Kun Wang
Zhengyang Zhou
Roger Zimmermann

Spatio-Temporal Graph (STG) forecasting is a fundamental task in many real-world applications. Spatio-Temporal Graph Neural Networks have emerged as the most popular method for STG forecasting, but they often struggle with temporal out-of-distribution (OoD) issues and dynamic spatial causation. In this paper, we propose a novel framework called CaST to tackle these two challenges via causal treatments. Concretely, leveraging a causal lens, we first build a structural causal model to decipher the data generation process of STGs. To handle the temporal OoD issue, we employ the back-door adjustment by a novel disentanglement block to separate the temporal environments from input data. Moreover, we utilize the front-door adjustment and adopt edge-level convolution to model the ripple effect of causation. Experiments results on three real-world datasets demonstrate the effectiveness of CaST, which consistently outperforms existing methods with good interpretability. Our source code is available at https: //github. com/yutong-xia/CaST.

PDF Details

NeurIPS Conference 2023 Conference Paper

LargeST: A Benchmark Dataset for Large-Scale Traffic Forecasting

Xu Liu
Yutong Xia
Yuxuan Liang
Junfeng Hu
Yiwei Wang
Lei Bai
Chao Huang
Zhenguang Liu

Road traffic forecasting plays a critical role in smart city initiatives and has experienced significant advancements thanks to the power of deep learning in capturing non-linear patterns of traffic data. However, the promising results achieved on current public datasets may not be applicable to practical scenarios due to limitations within these datasets. First, the limited sizes of them may not reflect the real-world scale of traffic networks. Second, the temporal coverage of these datasets is typically short, posing hurdles in studying long-term patterns and acquiring sufficient samples for training deep models. Third, these datasets often lack adequate metadata for sensors, which compromises the reliability and interpretability of the data. To mitigate these limitations, we introduce the LargeST benchmark dataset. It encompasses a total number of 8, 600 sensors in California with a 5-year time coverage and includes comprehensive metadata. Using LargeST, we perform in-depth data analysis to extract data insights, benchmark well-known baselines in terms of their performance and efficiency, and identify challenges as well as opportunities for future research. We release the datasets and baseline implementations at: https: //github. com/liuxu77/LargeST.

PDF Details

AAAI Conference 2021 Conference Paper

A Spatial Regulated Patch-Wise Approach for Cervical Dysplasia Diagnosis

Ying Zhang
Yifang Yin
Zhenguang Liu
Roger Zimmermann

Cervical dysplasia diagnosis via visual investigation is a challenging problem. Recent approaches use deep learning techniques to extract features and require the downsampling of high-resolution cervical screening images to smaller sizes for training. Such a reduction may result in the loss of visual details that appear weakly and locally within a cervical image. To overcome this challenge, our work divides an image into patches and then represents it from patch features. We aggregate patch patterns into an image feature in a weighted manner by considering the patch–image relationship. The weights are visualized as a heatmap to explain where the diagnosis results come from. We further introduce a spatial regulator to guide the classifier to focus on the cervix region and to adjust the weight distribution, without requiring any manual annotations of the cervix region. A novel iterative algorithm is designed to refine the regulator, which is able to capture the variations in cervix center locations and shapes. Experiments on an 18-year real-world dataset indicate a minimal of 3. 47%, 4. 59%, 8. 54% improvements over the state-of-the-art in accuracy, F1, and recall measures, respectively.

PDF Details

AAAI Conference 2021 Conference Paper

Enhanced Audio Tagging via Multi- to Single-Modal Teacher-Student Mutual Learning

Yifang Yin
Harsh Shrivastava
Ying Zhang
Zhenguang Liu
Rajiv Ratn Shah
Roger Zimmermann

Recognizing ongoing events based on acoustic clues has been a critical yet challenging problem that has attracted significant research attention in recent years. Joint audio-visual analysis can improve the event detection accuracy but may not always be feasible as under many circumstances only audio recordings are available in real-world scenarios. To solve the challenges, we present a novel visual-assisted teacherstudent mutual learning framework for robust sound event detection from audio recordings. Our model adopts a multimodal teacher network based on both acoustic and visual clues, and a single-modal student network based on acoustic clues only. Conventional teacher-student learning performs unsatisfactorily for knowledge transfer from a multi-modality network to a single-modality network. We thus present a mutual learning framework by introducing a single-modal transfer loss and a cross-modal transfer loss to collaboratively learn the audio-visual correlations between the two networks. Our proposed solution takes the advantages of joint audiovisual analysis in training while maximizing the feasibility of the model in use cases. Our extensive experiments on the DCASE17 and the DCASE18 sound event detection datasets show that our proposed method outperforms the state-of-theart audio tagging approaches.

PDF Details

IJCAI Conference 2021 Conference Paper

Modeling Trajectories with Neural Ordinary Differential Equations

Yuxuan Liang
Kun Ouyang
Hanshu Yan
Yiwei Wang
Zekun Tong
Roger Zimmermann

Recent advances in location-acquisition techniques have generated massive spatial trajectory data. Recurrent Neural Networks (RNNs) are modern tools for modeling such trajectory data. After revisiting RNN-based methods for trajectory modeling, we expose two common critical drawbacks in the existing uses. First, RNNs are discrete-time models that only update the hidden states upon the arrival of new observations, which makes them an awkward fit for learning real-world trajectories with continuous-time dynamics. Second, real-world trajectories are never perfectly accurate due to unexpected sensor noise. Most RNN-based approaches are deterministic and thereby vulnerable to such noise. To tackle these challenges, we devise a novel method entitled TrajODE for more natural modeling of trajectories. It combines the continuous-time characteristic of Neural Ordinary Differential Equations (ODE) with the robustness of stochastic latent spaces. Extensive experiments on the task of trajectory classification demonstrate the superiority of our framework against the RNN counterparts.

PDF Details DOI

AAAI Conference 2020 Conference Paper

Harnessing GANs for Zero-Shot Learning of New Classes in Visual Speech Recognition

Yaman Kumar
Dhruva Sahrawat
Shubham Maheshwari
Debanjan Mahata
Amanda Stent
Yifang Yin
Rajiv Ratn Shah
Roger Zimmermann

Visual Speech Recognition (VSR) is the process of recognizing or interpreting speech by watching the lip movements of the speaker. Recent machine learning based approaches model VSR as a classiﬁcation problem; however, the scarcity of training data leads to error-prone systems with very low accuracies in predicting unseen classes. To solve this problem, we present a novel approach to zero-shot learning by generating new classes using Generative Adversarial Networks (GANs), and show how the addition of unseen class samples increases the accuracy of a VSR system by a signiﬁcant margin of 27% and allows it to handle speaker-independent out-of-vocabulary phrases. We also show that our models are language agnostic and therefore capable of seamlessly generating, using English training data, videos for a new language (Hindi). To the best of our knowledge, this is the ﬁrst work to show empirical evidence of the use of GANs for generating training samples of unseen classes in the domain of VSR, hence facilitating zero-shot learning. We make the added videos for new classes publicly available along with our code1.

PDF Details

AAAI Conference 2020 Conference Paper

Towards Comprehensive Recommender Systems: Time-Aware Unified Recommendations Based on Listwise Ranking of Implicit Cross-Network Data

Dilruk Perera
Roger Zimmermann

The abundance of information in web applications make recommendation essential for users as well as applications. Despite the effectiveness of existing recommender systems, we ﬁnd two major limitations that reduce their overall performance: (1) inability to provide timely recommendations for both new and existing users by considering the dynamic nature of user preferences, and (2) not fully optimized for the ranking task when using implicit feedback. Therefore, we propose a novel deep learning based uniﬁed cross-network solution to mitigate cold-start and data sparsity issues and provide timely recommendations for new and existing users. Furthermore, we consider the ranking problem under implicit feedback as a classiﬁcation task, and propose a generic personalized listwise optimization criterion for implicit data to effectively rank a list of items. We illustrate our crossnetwork model using Twitter auxiliary information for recommendations on YouTube target network. Extensive comparisons against multiple time aware and cross-network baselines show that the proposed solution is superior in terms of accuracy, novelty and diversity. Furthermore, experiments conducted on the popular MovieLens dataset suggest that the proposed listwise ranking method outperforms existing stateof-the-art ranking techniques.

PDF Details

AAAI Conference 2019 Short Paper

Lipper: Speaker Independent Speech Synthesis Using Multi-View Lipreading

Khwaja Mohd. Salik
Swati Aggarwal
Yaman Kumar
Rajiv Ratn Shah
Rohit Jain
Roger Zimmermann

Lipreading is the process of understanding and interpreting speech by observing a speaker’s lip movements. In the past, most of the work in lipreading has been limited to classifying silent videos to a fixed number of text classes. However, this limits the applications of the lipreading since human language cannot be bound to a fixed set of words or languages. The aim of this work is to reconstruct intelligible acoustic speech signals from silent videos from various poses of a person which Lipper has never seen before. Lipper, therefore is a vocabulary and language agnostic, speaker independent and a near real-time model that deals with a variety of poses of a speaker. The model leverages silent video feeds from multiple cameras recording a subject to generate intelligent speech of a speaker. It uses a deep learning based STCNN+BiGRU architecture to achieve this goal. We evaluate speech reconstruction for speaker independent scenarios and demonstrate the speech output by overlaying the audios reconstructed by Lipper on the corresponding videos.

PDF Details

AAAI Conference 2019 Conference Paper

Lipper: Synthesizing Thy Speech Using Multi-View Lipreading

Yaman Kumar
Rohit Jain
Khwaja Mohd. Salik
Rajiv Ratn Shah
Yifang Yin
Roger Zimmermann

Lipreading has a lot of potential applications such as in the domain of surveillance and video conferencing. Despite this, most of the work in building lipreading systems has been limited to classifying silent videos into classes representing text phrases. However, there are multiple problems associated with making lipreading a text-based classification task like its dependence on a particular language and vocabulary mapping. Thus, in this paper we propose a multi-view lipreading to audio system, namely Lipper, which models it as a regression task. The model takes silent videos as input and produces speech as the output. With multi-view silent videos, we observe an improvement over single-view speech reconstruction results. We show this by presenting an exhaustive set of experiments for speaker-dependent, out-of-vocabulary and speaker-independent settings. Further, we compare the delay values of Lipper with other speechreading systems in order to show the real-time nature of audio produced. We also perform a user study for the audios produced in order to understand the level of comprehensibility of audios produced using Lipper.

PDF Details

AAAI Conference 2019 Short Paper

Mind Your Language: Abuse and Offense Detection for Code-Switched Languages

Raghav Kapoor
Yaman Kumar
Kshitij Rajput
Rajiv Ratn Shah
Ponnurangam Kumaraguru
Roger Zimmermann

In multilingual societies like the Indian subcontinent, use of code-switched languages is much popular and convenient for the users. In this paper, we study offense and abuse detection in the code-switched pair of Hindi and English (i. e, Hinglish), the pair that is the most spoken. The task is made difficult due to non-fixed grammar, vocabulary, semantics and spellings of Hinglish language. We apply transfer learning and make a LSTM based model for hate speech classification. This model surpasses the performance shown by the current best models to establish itself as the state-of-the-art in the unexplored domain of Hinglish offensive text classification. We also release our model and the embeddings trained for research purposes.

PDF Details

IJCAI Conference 2018 Conference Paper

LSTM Networks for Online Cross-Network Recommendations

Dilruk Perera
Roger Zimmermann

Cross-network recommender systems use auxiliary information from multiple source networks to create holistic user profiles and improve recommendations in a target network. However, we find two major limitations in existing cross-network solutions that reduce overall recommender performance. Existing models (1) fail to capture complex non-linear relationships in user interactions, and (2) are designed for offline settings hence, not updated online with incoming interactions to capture the dynamics in the recommender environment. We propose a novel multi-layered Long Short-Term Memory (LSTM) network based online solution to mitigate these issues. The proposed model contains three main extensions to the standard LSTM: First, an attention gated mechanism to capture long-term user preference changes. Second, a higher order interaction layer to alleviate data sparsity. Third, time aware LSTM cell gates to capture irregular time intervals between user interactions. We illustrate our solution using auxiliary information from Twitter and Google Plus to improve recommendations on YouTube. Extensive experiments show that the proposed model consistently outperforms state-of-the-art in terms of accuracy, diversity and novelty.

PDF Details

AAAI Conference 2018 Conference Paper

Multi-Modal Multi-Task Learning for Automatic Dietary Assessment

Qi Liu
Yue Zhang
Zhenguang Liu
Ye Yuan
Li Cheng
Roger Zimmermann

We investigate the task of automatic dietary assessment: given meal images and descriptions uploaded by real users, our task is to automatically rate the meals and deliver advisory comments for improving users’ diets. To address this practical yet challenging problem, which is multi-modal and multi-task in nature, an end-to-end neural model is proposed. In particular, comprehensive meal representations are obtained from images, descriptions and user information. We further introduce a novel memory network architecture to store meal representations and reason over the meal representations to support predictions. Results on a real-world dataset show that our method outperforms two strong image captioning baselines signiﬁcantly.

PDF Details