Author name cluster

Daniel McDuff

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

19 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

RADAR: Benchmarking Language Models on Imperfect Tabular Data

Ken Gu
Zhihan Zhang
Kate Lin
Yuwei Zhang
Akshay Paruchuri
Hong Yu
Mehran Kazemi
Kumar Ayush

Language models (LMs) are increasingly being deployed to perform autonomous data analyses. However, their data awareness—the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies—remains underexplored. These artifacts are especially common in real-world tabular data and, if mishandled, can significantly compromise the validity of analytical conclusions. To address this gap, we present RADAR, a benchmark for systematically evaluating data-aware reasoning on tabular data. We develop a framework to simulate data artifacts via programmatic perturbations to enable targeted evaluation of model behavior. RADAR comprises 2, 980 table-query pairs, grounded in real-world data spanning 9 domains and 5 data artifact types. In addition to evaluating artifact handling, RADAR systematically varies table size to study how reasoning performance holds when increasing table size. Our evaluation reveals that, despite decent performance on tables without data artifacts, frontier models degrade significantly when data artifacts are introduced, exposing critical gaps in their capacity for robust, data-aware analysis. Designed to be flexible and extensible, RADAR supports diverse perturbation types and controllable table sizes, offering a valuable resource for advancing tabular reasoning.

PDF Details

ICLR Conference 2025 Conference Paper

Scaling Wearable Foundation Models

Girish Narayanswamy
Xin Liu 0034
Kumar Ayush
Yuzhe Yang 0003
Xuhai Xu
Shun Liao
Jake Garrison
Shyam A. Tailor

Wearable sensors have become ubiquitous thanks to a variety of health tracking features. The resulting continuous and longitudinal measurements from everyday life generate large volumes of data. However, making sense of these observations for scientific and actionable insights is non-trivial. Inspired by the empirical success of generative modeling, where large neural networks learn powerful representations from vast amounts of text, image, video, or audio data, we investigate the scaling properties of wearable sensor foundation models across compute, data, and model size. Using a dataset of up to 40 million hours of in-situ heart rate, heart rate variability, accelerometer, electrodermal activity, skin temperature, and altimeter per-minute data from over 165,000 people, we create LSM, a multimodal foundation model built on the largest wearable-signals dataset with the most extensive range of sensor modalities to date. Our results establish the scaling laws of LSM for tasks such as imputation, interpolation and extrapolation across both time and sensor modalities. Moreover, we highlight how LSM enables sample-efficient downstream learning for tasks including exercise and activity recognition.

Details

NeurIPS Conference 2025 Conference Paper

SensorLM: Learning the Language of Wearable Sensors

Yuwei Zhang
Kumar Ayush
Siyuan Qiao
A. Ali Heydari
Girish Narayanswamy
Max Xu
Ahmed Metwally
Jinhua Xu

We present SensorLM, a family of sensor-language foundation models that enable wearable sensor data understanding with natural language. Despite its pervasive nature, aligning and interpreting sensor data with language remains challenging due to the lack of paired, richly annotated sensor-text descriptions in uncurated, real-world wearable data. We introduce a hierarchical caption generation pipeline designed to capture statistical, structural, and semantic information from sensor data. This approach enabled the curation of the largest sensor-language dataset to date, comprising over 59. 7 million hours of data from more than 103, 000 people. Furthermore, SensorLM extends prominent multimodal pretraining architectures (e. g. , CLIP, CoCa) and recovers them as specific variants within a generic architecture. Extensive experiments on real-world tasks in human activity analysis and healthcare verify the superior performance of SensorLM over state-of-the-art in zero-shot recognition, few-shot learning, and cross-modal retrieval. SensorLM also demonstrates intriguing capabilities including scaling behaviors, label efficiency, sensor captioning, and zero-shot generalization to unseen tasks. Code is available at https: //github. com/Google-Health/consumer-health-research/tree/main/sensorlm.

PDF Details

NeurIPS Conference 2024 Conference Paper

MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making

Yubin Kim
Chanwoo Park
Hyewon Jeong
Yik S. Chan
Xuhai Xu
Daniel McDuff
Hyeonhoon Lee
Marzyeh Ghassemi

Foundation models are becoming valuable tools in medicine. Yet despite their promise, the best way to leverage Large Language Models (LLMs) in complex medical tasks remains an open question. We introduce a novel multi-agent framework, named **M**edical **D**ecision-making **Agents** (**MDAgents**) that helps to address this gap by automatically assigning a collaboration structure to a team of LLMs. The assigned solo or group collaboration structure is tailored to the medical task at hand, a simple emulation inspired by the way real-world medical decision-making processes are adapted to tasks of different complexities. We evaluate our framework and baseline methods using state-of-the-art LLMs across a suite of real-world medical knowledge and clinical diagnosis benchmarks, including a comparison ofLLMs’ medical complexity classification against human physicians. MDAgents achieved the **best performance in seven out of ten** benchmarks on tasks requiring an understanding of medical knowledge and multi-modal reasoning, showing a significant **improvement of up to 4. 2\%** ($p$ < 0. 05) compared to previous methods' best performances. Ablation studies reveal that MDAgents effectively determines medical complexity to optimize for efficiency and accuracy across diverse medical tasks. Notably, the combination of moderator review and external medical knowledge in group collaboration resulted in an average accuracy **improvement of 11. 8\%**. Our code can be found at https: //github. com/mitmedialab/MDAgents.

PDF Details DOI

ICML Conference 2024 Conference Paper

Position: Standardization of Behavioral Use Clauses is Necessary for the Adoption of Responsible Licensing of AI

Daniel McDuff
Tim Korjakow
Scott Cambo
Jesse Josua Benjamin
Jenny Lee
Yacine Jernite
Carlos Muñoz Ferrandis
Aaron Gokaslan

Growing concerns over negligent or malicious uses of AI have increased the appetite for tools that help manage the risks of the technology. In 2018, licenses with behaviorial-use clauses (commonly referred to as Responsible AI Licenses) were proposed to give developers a framework for releasing AI assets while specifying their users to mitigate negative applications. As of the end of 2023, on the order of 40, 000 software and model repositories have adopted responsible AI licenses licenses. Notable models licensed with behavioral use clauses include BLOOM (language) and LLaMA2 (language), Stable Diffusion (image), and GRID (robotics). This paper explores why and how these licenses have been adopted, and why and how they have been adapted to fit particular use cases. We use a mixed-methods methodology of qualitative interviews, clustering of license clauses, and quantitative analysis of license adoption. Based on this evidence we take the position that responsible AI licenses need standardization to avoid confusing users or diluting their impact. At the same time, customization of behavioral restrictions is also appropriate in some contexts (e. g. , medical domains). We advocate for “standardized customization” that can meet users’ needs and can be supported via tooling.

Details

TMLR Journal 2024 Journal Article

The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources

Shayne Longpre
Stella Biderman
Alon Albalak
Hailey Schoelkopf
Daniel McDuff
Sayash Kapoor
Kevin Klyman
Kyle Lo

Foundation model development attracts a rapidly expanding body of contributors, scientists, and applications. To help shape responsible development practices, we introduce the Foundation Model Development Cheatsheet: a growing collection of 250+ tools and resources spanning text, vision, and speech modalities. We draw on a large body of prior work to survey resources (e.g. software, documentation, frameworks, guides, and practical tools) that support informed data selection, processing, and understanding, precise and limitation-aware artifact documentation, efficient model training, advance awareness of the environmental impact from training, careful model evaluation of capabilities, risks, and claims, as well as responsible model release, licensing and deployment practices. We hope this curated collection of resources helps guide more responsible development. The process of curating this list, enabled us to review the AI development ecosystem, revealing what tools are critically missing, misused, or over-used in existing practices. We find that (i) tools for data sourcing, model evaluation, and monitoring are critically under-serving ethical and real-world needs, (ii) evaluations for model safety, capabilities, and environmental impact all lack reproducibility and transparency, (iii) text and particularly English-centric analyses continue to dominate over multilingual and multi-modal analyses, and (iv) evaluation of systems, rather than just models, is needed so that capabilities and impact are assessed in context.

PDF Details

NeurIPS Conference 2023 Conference Paper

rPPG-Toolbox: Deep Remote PPG Toolbox

Xin Liu
Girish Narayanswamy
Akshay Paruchuri
Xiaoyu Zhang
Jiankai Tang
Yuzhe Zhang
Roni Sengupta
Shwetak Patel

Camera-based physiological measurement is a fast growing field of computer vision. Remote photoplethysmography (rPPG) utilizes imaging devices (e. g. , cameras) to measure the peripheral blood volume pulse (BVP) via photoplethysmography, and enables cardiac measurement via webcams and smartphones. However, the task is non-trivial with important pre-processing, modeling and post-processing steps required to obtain state-of-the-art results. Replication of results and benchmarking of new models is critical for scientific progress; however, as with many other applications of deep learning, reliable codebases are not easy to find or use. We present a comprehensive toolbox, rPPG-Toolbox, unsupervised and supervised rPPG models with support for public benchmark datasets, data augmentation and systematic evaluation: https: //github. com/ubicomplab/rPPG-Toolbox.

PDF Details

ICLR Conference 2023 Conference Paper

SimPer: Simple Self-Supervised Learning of Periodic Targets

Yuzhe Yang 0003
Xin Liu 0034
Jiang Wu
Silviu Borac
Dina Katabi
Ming-Zher Poh
Daniel McDuff

From human physiology to environmental evolution, important processes in nature often exhibit meaningful and strong periodic or quasi-periodic changes. Due to their inherent label scarcity, learning useful representations for periodic tasks with limited or no supervision is of great benefit. Yet, existing self-supervised learning (SSL) methods overlook the intrinsic periodicity in data, and fail to learn representations that capture periodic or frequency attributes. In this paper, we present SimPer, a simple contrastive SSL regime for learning periodic information in data. To exploit the periodic inductive bias, SimPer introduces customized augmentations, feature similarity measures, and a generalized contrastive loss for learning efficient and robust periodic representations. Extensive experiments on common real-world tasks in human behavior analysis, environmental sensing, and healthcare domains verify the superior performance of SimPer compared to state-of-the-art SSL methods, highlighting its intriguing properties including better data efficiency, robustness to spurious correlations, and generalization to distribution shifts.

Details

IROS Conference 2022 Conference Paper

COMPASS: Contrastive Multimodal Pretraining for Autonomous Systems

Shuang Ma
Sai Vemprala
Wenshan Wang
Jayesh K. Gupta
Yale Song
Daniel McDuff
Ashish Kapoor

Learning representations that generalize across tasks and domains is challenging yet necessary for autonomous systems. Although task-driven approaches are appealing, de-signing models specific to each application can be difficult in the face of limited data, especially when dealing with highly variable multimodal input spaces arising from different tasks in different environments. We introduce the first general-purpose pretraining pipeline, COntrastive Multimodal Pretraining for AutonomouS Systems (COMPASS), to overcome the limitations of task-specific models and existing pretraining approaches. COMPASS constructs a multimodal graph by considering the essential information for autonomous systems and the proper-ties of different modalities. Through this graph, multimodal signals are connected and mapped into two factorized spatio-temporal latent spaces: a “motion pattern space” and a “current state space. ” By learning from multimodal correspondences in each latent space, COMPASS creates state representations that models necessary information such as temporal dynamics, geometry, and semantics. We pretrain COMPASS on a large-scale multimodal simulation dataset TartanAir [1] and evaluate it on drone navigation, vehicle racing, and visual odometry tasks. The experiments indicate that COMPASS can tackle all three scenarios and can also generalize to unseen environments and real-world data. 1 1 Our code implementation can be found at https://github.com/microsoft/COMPASS

Details

AAAI Conference 2022 Conference Paper

DOC2PPT: Automatic Presentation Slides Generation from Scientific Documents

Tsu-Jui Fu
William Yang Wang
Daniel McDuff
Yale Song

Creating presentation materials requires complex multimodal reasoning skills to summarize key concepts and arrange them in a logical and visually pleasing manner. Can machines learn to emulate this laborious process? We present a novel task and approach for document-to-slide generation. Solving this involves document summarization, image and text retrieval, and slide structure to arrange key elements in a form suitable for presentation. We propose a hierarchical sequence-tosequence approach to tackle our task in an end-to-end manner. Our approach exploits the inherent structures within documents and slides and incorporates paraphrasing and layout prediction modules to generate slides. To help accelerate research in this domain, we release a dataset of about 6K paired documents and slide decks used in our experiments. We show that our approach outperforms strong baselines and produces slides with rich content and aligned imagery.

PDF Details

NeurIPS Conference 2022 Conference Paper

SCAMPS: Synthetics for Camera Measurement of Physiological Signals

Daniel McDuff
Miah Wander
Xin Liu
Brian Hill
Javier Hernandez
Jonathan Lester
Tadas Baltrusaitis

The use of cameras and computational algorithms for noninvasive, low-cost and scalable measurement of physiological (e. g. , cardiac and pulmonary) vital signs is very attractive. However, diverse data representing a range of environments, body motions, illumination conditions and physiological states is laborious, time consuming and expensive to obtain. Synthetic data have proven a valuable tool in several areas of machine learning, yet are not widely available for camera measurement of physiological states. Synthetic data offer "perfect" labels (e. g. , without noise and with precise synchronization), labels that may not be possible to obtain otherwise (e. g. , precise pixel level segmentation maps) and provide a high degree of control over variation and diversity in the dataset. We present SCAMPS, a dataset of synthetics containing 2, 800 videos (1. 68M frames) with aligned cardiac and respiratory signals and facial action intensities. The RGB frames are provided alongside segmentation maps and precise descriptive statistics about the underlying waveforms, including inter-beat interval, heart rate variability, and pulse arrival time. Finally, we present baseline results training on these synthetic data and testing on real-world datasets to illustrate generalizability.

PDF Details

ICLR Conference 2021 Conference Paper

Active Contrastive Learning of Audio-Visual Video Representations

Shuang Ma
Zhaoyang Zeng
Daniel McDuff
Yale Song

Contrastive learning has been shown to produce generalizable representations of audio and visual data by maximizing the lower bound on the mutual information (MI) between different views of an instance. However, obtaining a tight lower bound requires a sample size exponential in MI and thus a large set of negative samples. We can incorporate more samples by building a large queue-based dictionary, but there are theoretical limits to performance improvements even with a large number of negative samples. We hypothesize that random negative sampling leads to a highly redundant dictionary that results in suboptimal representations for downstream tasks. In this paper, we propose an active contrastive learning approach that builds an actively sampled dictionary with diverse and informative items, which improves the quality of negative samples and improves performances on tasks where there is high mutual information in the data, e.g., video classification. Our model achieves state-of-the-art performance on challenging audio and visual downstream benchmarks including UCF101, HMDB51 and ESC50.

Details

NeurIPS Conference 2021 Conference Paper

Contrastive Learning of Global and Local Video Representations

Shuang Ma
Zhaoyang Zeng
Daniel McDuff
Yale Song

Contrastive learning has delivered impressive results for various tasks in the self-supervised regime. However, existing approaches optimize for learning representations specific to downstream scenarios, i. e. , global representations suitable for tasks such as classification or local representations for tasks such as detection and localization. While they produce satisfactory results in the intended downstream scenarios, they often fail to generalize to tasks that they were not originally designed for. In this work, we propose to learn video representations that generalize to both the tasks which require global semantic information (e. g. , classification) and the tasks that require local fine-grained spatio-temporal information (e. g. , localization). We achieve this by optimizing two contrastive objectives that together encourage our model to learn global-local visual information given audio signals. We show that the two objectives mutually improve the generalizability of the learned global-local representations, significantly outperforming their disjointly learned counterparts. We demonstrate our approach on various tasks including action/sound classification, lipreading, deepfake detection, event and sound localization.

PDF Details

JBHI Journal 2021 Journal Article

Guest Editorial: Camera-Based Monitoring for Pervasive Healthcare Informatics

Wenjin Wang
Steffen Leonhardt
Lionel Tarassenko
Caifeng Shan
Daniel McDuff

The papers in this special section focus on camera-based monitoring for pervasive healthcare informatics. Measuring physiological signals from the human face and body using video cameras is an emerging research topic that has grown rapidly in the last decade. Remote cameras (in both visible and infrared wavelengths) can be used to measure vital signs from a human body based on skin optics or body movements thereby avoiding mechanical contact with the skin. Camera-based health monitoring will bring a rich set of compelling healthcare applications that directly improve upon contact-based monitoring solutions and impact people’s care experience and quality of life in various scenarios, such as in hospital care units, sleep/senior centers, assisted-living homes, telemedicine and e-health, baby/elderly care at home, fitness and sports, driver monitoring in automotive applications, cardiac/ respiratory gating for MRI/CT, AR/VR based therapy and clinical training, e

Details DOI

ICRA Conference 2021 Conference Paper

Modeling Affect-based Intrinsic Rewards for Exploration and Learning

Dean Zadok
Daniel McDuff
Ashish Kapoor

Positive affect has been linked to increased interest, curiosity and satisfaction in human learning. In reinforcement learning, extrinsic rewards are often sparse and difficult to define, intrinsically motivated learning can help address these challenges. We argue that positive affect is an important intrinsic reward that effectively helps drive exploration that is useful in gathering experiences. We present a novel approach leveraging a task-independent reward function trained on spontaneous smile behavior that reflects the intrinsic reward of positive affect. To evaluate our approach we trained several downstream computer vision tasks on data collected with our policy and several baseline methods. We show that the policy based on our affective rewards successfully increases the duration of episodes, the area explored and reduces collisions. The impact is the increased speed of learning for several downstream computer vision tasks.

Details

NeurIPS Conference 2020 Conference Paper

Multi-Task Temporal Shift Attention Networks for On-Device Contactless Vitals Measurement

Xin Liu
Josh Fromm
Shwetak Patel
Daniel McDuff

Telehealth and remote health monitoring have become increasingly important during the SARS-CoV-2 pandemic and it is widely expected that this will have a lasting impact on healthcare practices. These tools can help reduce the risk of exposing patients and medical staff to infection, make healthcare services more accessible, and allow providers to see more patients. However, objective measurement of vital signs is challenging without direct contact with a patient. We present a video-based and on-device optical cardiopulmonary vital sign measurement approach. It leverages a novel multi-task temporal shift convolutional attention network (MTTS-CAN) and enables real-time cardiovascular and respiratory measurements on mobile platforms. We evaluate our system on an Advanced RISC Machine (ARM) CPU and achieve state-of-the-art accuracy while running at over 150 frames per second which enables real-time applications. Systematic experimentation on large benchmark datasets reveals that our approach leads to substantial (20%-50%) reductions in error and generalizes well across datasets.

PDF Details

NeurIPS Conference 2019 Conference Paper

Characterizing Bias in Classifiers using Generative Models

Daniel McDuff
Shuang Ma
Yale Song
Ashish Kapoor

Models that are learned from real-world data are often biased because the data used to train them is biased. This can propagate systemic human biases that exist and ultimately lead to inequitable treatment of people, especially minorities. To characterize bias in learned classifiers, existing approaches rely on human oracles labeling real-world examples to identify the "blind spots" of the classifiers; these are ultimately limited due to the human labor required and the finite nature of existing image examples. We propose a simulation-based approach for interrogating classifiers using generative adversarial models in a systematic manner. We incorporate a progressive conditional generative model for synthesizing photo-realistic facial images and Bayesian Optimization for an efficient interrogation of independent facial image classification systems. We show how this approach can be used to efficiently characterize racial and gender biases in commercial systems.

PDF Details

JBHI Journal 2019 Journal Article

Wearable Motion-Based Heart Rate at Rest: A Workplace Evaluation

Javier Hernandez
Daniel McDuff
Karen Quigley
Pattie Maes
Rosalind W. Picard

This paper studies the feasibility of using low-cost motion sensors to provide opportunistic heart rate assessments from ballistocardiographic signals during restful periods of daily life. Three wearable devices were used to capture peripheral motions at specific body locations (head, wrist, and trouser pocket) of 15 participants during five regular workdays each. Three methods were implemented to extract heart rate from motion data and their performance was compared to those obtained with an FDA-cleared device. With a total of 1358 h of naturalistic sensor data, our results show that providing accurate heart rate estimations from peripheral motion signals is possible during relatively “still” moments. In our real-life workplace study, the head-mounted device yielded the most frequent assessments (22. 98% of the time under 5 beats per minute of error) followed by the smartphone in the pocket (5. 02%) and the wrist-worn device (3. 48%). Most importantly, accurate assessments were automatically detected by using a custom threshold based on the device jerk. Due to the pervasiveness and low cost of wearable motion sensors, this paper demonstrates the feasibility of providing opportunistic large-scale low-cost samples of resting heart rate.

Details DOI

IJCAI Conference 2016 Conference Paper

Driver Frustration Detection from Audio and Video in the Wild

Irman Abdić
Lex Fridman
Daniel McDuff
Erik Marchi
Bryan Reimer
Bj
ouml; rn Schuller

We present a method for detecting driver frustration from both video and audio streams captured during the driver's interaction with an in-vehicle voice-based navigation system. The video is of the driver's face when the machine is speaking, and the audio is of the driver's voice when he or she is speaking. We analyze a dataset of 20 drivers that contains 596 audio epochs (audio clips, with duration from 1 sec to 15 sec) and 615 video epochs (video clips, with duration from 1 sec to 45 sec). The dataset is balanced across 2 age groups, 2 vehicle systems, and both genders. The model was subject-independently trained and tested using 4-fold cross-validation. We achieve an accuracy of 77. 4% for detecting frustration from a single audio epoch and 81. 2% for detecting frustration from a single video epoch. We then treat the video and audio epochs as a sequence of interactions and use decision fusion to characterize the trade-off between decision time and classification accuracy, which improved the prediction accuracy to 88. 5% after 9 epochs.

PDF Details