Author name cluster

Celso de Melo

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers

1 author row

NeurIPS Conference 2025 Conference Paper

Aha! - Predicting What Matters Next: Online Highlight Detection Without Looking Ahead

Aiden Chang
Celso de Melo
Stephanie Lukin

Real-time understanding of continuous video streams is essential for intelligent agents operating in high-stakes environments, including autonomous vehicles, surveillance drones, and disaster response robots. Yet, most existing video understanding and highlight detection methods assume access to the entire video during inference, making them unsuitable for online or streaming scenarios. In particular, current models optimize for offline summarization, failing to support step-by-step reasoning needed for real-time decision-making. We introduce Aha, an autoregressive highlight detection framework that predicts the relevance of each video frame against a task described in natural language. Without accessing future video frames, Aha utilizes a multimodal vision-language model and lightweight, decoupled heads trained on a large, curated dataset of human-centric video labels. To enable scalability, we introduce the Dynamic SinkCache mechanism that achieves constant memory usage across infinite-length streams without degrading performance on standard benchmarks. This encourages the hidden representation to capture high-level task objectives, enabling effective frame-level rankings for informativeness, relevance, and uncertainty with respect to the natural language task. Aha achieves state-of-the-art (SOTA) performance on highlight detection benchmarks, surpassing even prior offline, full-context approaches and video-language models by +5. 9\% on TVSum and +8. 3\% on Mr. Hisum in mAP (mean Average Precision). We explore Aha’s potential for real-world robotics applications given a task-oriented natural language input and a continuous, robot-centric video. Both experiments demonstrate Aha's potential effectiveness as a real-time reasoning module for downstream planning and long-horizon understanding.

PDF Details

NeurIPS Conference 2025 Conference Paper

Bisecle: Binding and Separation in Continual Learning for Video Language Understanding

Yue Tan
Xiaoqian Hu
Hao Xue
Celso de Melo
Flora Salim

Frontier vision-language models (VLMs) have made remarkable improvements in video understanding tasks. However, real-world videos typically exist as continuously evolving data streams (e. g. , dynamic scenes captured by wearable glasses), necessitating models to continually adapt to shifting data distributions and novel scenarios. Considering the prohibitive computational costs of fine-tuning models on new tasks, usually, a small subset of parameters is updated while the bulk of the model remains frozen. This poses new challenges to existing continual learning frameworks in the context of large multimodal foundation models, i. e. , catastrophic forgetting and update conflict. While the foundation models struggle with parameter-efficient continual learning, the hippocampus in the human brain has evolved highly efficient mechanisms for memory formation and consolidation. Inspired by the rapid Bi nding and pattern se paration mechanisms in the hippocampus, in this work, we propose Bisecle for video-language c ontinual le arning, where a multi-directional supervision module is used to capture more cross-modal relationships and a contrastive prompt learning scheme is designed to isolate task-specific knowledge to facilitate efficient memory storage. Binding and separation processes further strengthen the ability of VLMs to retain complex experiences, enabling robust and efficient continual learning in video understanding tasks. We perform a thorough evaluation of the proposed Bisecle, demonstrating its ability to mitigate forgetting and enhance cross-task generalization on several VideoQA benchmarks.

PDF Details

NeurIPS Conference 2025 Conference Paper

SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning

Wufei Ma
Yu-Cheng Chou
Qihao Liu
Xingrui Wang
Celso de Melo
Jianwen Xie
Alan Yuille

Despite recent advances on multi-modal models, 3D spatial reasoning remains a challenging task for state-of-the-art open-source and proprietary models. Recent studies explore data-driven approaches and achieve enhanced spatial reasoning performance by fine-tuning models on 3D-related visual question-answering data. However, these methods typically perform spatial reasoning in an implicit manner and often fail on questions that are trivial to humans, even with long chain-of-thought reasoning. In this work, we introduce SpatialReasoner, a novel large vision-language model (LVLM) that addresses 3D spatial reasoning with explicit 3D representations shared between multiple stages--3D perception, computation, and reasoning. Explicit 3D representations provide a coherent interface that supports advanced 3D spatial reasoning and improves the generalization ability to novel question types. Furthermore, by analyzing the explicit 3D representations in multi-step reasoning traces of SpatialReasoner, we study the factual errors and identify key shortcomings of current LVLMs. Results show that our SpatialReasoner achieves improved performance on a variety of spatial reasoning benchmarks, outperforming Gemini 2. 0 by 9. 2% on 3DSRBench, and generalizes better when evaluating on novel 3D spatial reasoning questions. Our study bridges the 3D parsing capabilities of prior visual foundation models with the powerful reasoning abilities of large language models, opening new directions for 3D spatial reasoning.

PDF Details

NeurIPS Conference 2024 Conference Paper

ViLCo-Bench: VIdeo Language COntinual learning Benchmark

Tianqi Tang
Shohreh Deldari
Hao Xue
Celso de Melo
Flora Salim

Video language continual learning involves continuously adapting to information from video and text inputs, enhancing a model’s ability to handle new tasks while retaining prior knowledge. This field is a relatively under-explored area, and establishing appropriate datasets is crucial for facilitating communication and research in this field. In this study, we present the first dedicated benchmark, ViLCo-Bench, designed to evaluate continual learning models across a range of video-text tasks. The dataset comprises ten-minute-long videos and corresponding language queries collected from publicly available datasets. Additionally, we introduce a novel memory-efficient framework that incorporates self-supervised learning and mimics long-term and short-term memory effects. This framework addresses challenges including memory complexity from long video clips, natural language complexity from open queries, and text-video misalignment. We posit that ViLCo-Bench, with greater complexity compared to existing continual learning benchmarks, would serve as a critical tool for exploring the video-language domain, extending beyond conventional class-incremental tasks, and addressing complex and limited annotation issues. The curated data, evaluations, and our novel method are available at https: //github. com/cruiseresearchgroup/ViLCo.

PDF Details DOI

AAAI Conference 2014 Conference Paper

The Importance of Cognition and Affect for Artificially Intelligent Decision Makers

Celso de Melo
Jonathan Gratch
Peter Carnevale

Agency – the capacity to plan and act – and experience – the capacity to sense and feel – are two critical aspects that determine whether people will perceive non-human entities, such as autonomous agents, to have a mind. There is evidence that the absence of either can reduce cooperation. We present an experiment that tests the necessity of both for cooperation with agents. In this experiment we manipulated people’s perceptions about the cognitive and affective abilities of agents, when engaging in the ultimatum game. The results indicated that people offered more money to agents that were perceived to make decisions according to their intentions (high agency), rather than randomly (low agency). Additionally, the results showed that people offered more money to agents that expressed emotion (high experience), when compared to agents that did not (low experience). We discuss the implications of this agencyexperience theoretical framework for the design of artificially intelligent decision makers.

PDF Details

AAMAS Conference 2012 Conference Paper

Bayesian Model of the Social Effects of Emotion in Decision-Making in Multiagent Systems

Celso de Melo
Peter Carnevale
Stephen Read
Dimitrios Antos
Jonathan Gratch

Research in the behavioral sciences suggests that emotion can serve important social functions and that, more than a simple manifestation of internal experience, emotion displays communicate one's beliefs, desires and intentions. In a recent study we have shown that, when engaged in the iterated prisoner's dilemma with agents that display emotion, people infer, from the emotion displays, how the agent is appraising the ongoing interaction (e. g. , is the situation favorable to the agent? Does it blame me for the current state-of-affairs? ). From these appraisals people, then, infer whether the agent is likely to cooperate in the future. In this paper we propose a Bayesian model that captures this social function of emotion. The model supports probabilistic predictions, from emotion displays, about how the counterpart is appraising the interaction which, in turn, lead to predictions about the counterpart's intentions. The model's parameters were learned using data from the empirical study. Our evaluation indicated that considering emotion displays improved the model's ability to predict the counterpart's intentions, in particular, how likely it was to cooperate in a social dilemma. Using data from another empirical study where people made inferences about the counterpart's likelihood of cooperation in the absence of emotion displays, we also showed that the model could, from information about appraisals alone, make appropriate inferences about the counterpart's intentions. Overall, the paper suggests that appraisals are valuable for computational models of emotion interpretation. The relevance of these results for the design of multiagent systems where agents, human or not, can convey or recognize emotion is discussed.

PDF

AAAI Conference 2011 Conference Paper

The Influence of Emotion Expression on Perceptions of Trustworthiness in Negotiation

Dimitrios Antos
Celso de Melo
Jonathan Gratch
Barbara Grosz

When interacting with computer agents, people make inferences about various characteristics of these agents, such as their reliability and trustworthiness. These perceptions are signiﬁcant, as they inﬂuence people’s behavior towards the agents, and may foster or inhibit repeated interactions between them. In this paper we investigate whether computer agents can use the expression of emotion to inﬂuence human perceptions of trustworthiness. In particular, we study human-computer interactions within the context of a negotiation game, in which players make alternating offers to decide on how to divide a set of resources. A series of negotiation games between a human and several agents is then followed by a “trust game. ” In this game people have to choose one among several agents to interact with, as well as how much of their resources they will trust to it. Our results indicate that, among those agents that displayed emotion, those whose expression was in accord with their actions (strategy) during the negotiation game were generally preferred as partners in the trust game over those whose emotion expressions and actions did not mesh. Moreover, we observed that when emotion does not carry useful new information, it fails to strongly inﬂuence human decision-making behavior in a negotiation setting.

PDF Details