Arrow Research search

Author name cluster

Daniel McDuff

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

19 papers
2 author rows

Possible papers

19

NeurIPS Conference 2025 Conference Paper

RADAR: Benchmarking Language Models on Imperfect Tabular Data

  • Ken Gu
  • Zhihan Zhang
  • Kate Lin
  • Yuwei Zhang
  • Akshay Paruchuri
  • Hong Yu
  • Mehran Kazemi
  • Kumar Ayush

Language models (LMs) are increasingly being deployed to perform autonomous data analyses. However, their data awareness—the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies—remains underexplored. These artifacts are especially common in real-world tabular data and, if mishandled, can significantly compromise the validity of analytical conclusions. To address this gap, we present RADAR, a benchmark for systematically evaluating data-aware reasoning on tabular data. We develop a framework to simulate data artifacts via programmatic perturbations to enable targeted evaluation of model behavior. RADAR comprises 2, 980 table-query pairs, grounded in real-world data spanning 9 domains and 5 data artifact types. In addition to evaluating artifact handling, RADAR systematically varies table size to study how reasoning performance holds when increasing table size. Our evaluation reveals that, despite decent performance on tables without data artifacts, frontier models degrade significantly when data artifacts are introduced, exposing critical gaps in their capacity for robust, data-aware analysis. Designed to be flexible and extensible, RADAR supports diverse perturbation types and controllable table sizes, offering a valuable resource for advancing tabular reasoning.

ICLR Conference 2025 Conference Paper

Scaling Wearable Foundation Models

  • Girish Narayanswamy
  • Xin Liu 0034
  • Kumar Ayush
  • Yuzhe Yang 0003
  • Xuhai Xu
  • Shun Liao
  • Jake Garrison
  • Shyam A. Tailor

Wearable sensors have become ubiquitous thanks to a variety of health tracking features. The resulting continuous and longitudinal measurements from everyday life generate large volumes of data. However, making sense of these observations for scientific and actionable insights is non-trivial. Inspired by the empirical success of generative modeling, where large neural networks learn powerful representations from vast amounts of text, image, video, or audio data, we investigate the scaling properties of wearable sensor foundation models across compute, data, and model size. Using a dataset of up to 40 million hours of in-situ heart rate, heart rate variability, accelerometer, electrodermal activity, skin temperature, and altimeter per-minute data from over 165,000 people, we create LSM, a multimodal foundation model built on the largest wearable-signals dataset with the most extensive range of sensor modalities to date. Our results establish the scaling laws of LSM for tasks such as imputation, interpolation and extrapolation across both time and sensor modalities. Moreover, we highlight how LSM enables sample-efficient downstream learning for tasks including exercise and activity recognition.

NeurIPS Conference 2025 Conference Paper

SensorLM: Learning the Language of Wearable Sensors

  • Yuwei Zhang
  • Kumar Ayush
  • Siyuan Qiao
  • A. Ali Heydari
  • Girish Narayanswamy
  • Max Xu
  • Ahmed Metwally
  • Jinhua Xu

We present SensorLM, a family of sensor-language foundation models that enable wearable sensor data understanding with natural language. Despite its pervasive nature, aligning and interpreting sensor data with language remains challenging due to the lack of paired, richly annotated sensor-text descriptions in uncurated, real-world wearable data. We introduce a hierarchical caption generation pipeline designed to capture statistical, structural, and semantic information from sensor data. This approach enabled the curation of the largest sensor-language dataset to date, comprising over 59. 7 million hours of data from more than 103, 000 people. Furthermore, SensorLM extends prominent multimodal pretraining architectures (e. g. , CLIP, CoCa) and recovers them as specific variants within a generic architecture. Extensive experiments on real-world tasks in human activity analysis and healthcare verify the superior performance of SensorLM over state-of-the-art in zero-shot recognition, few-shot learning, and cross-modal retrieval. SensorLM also demonstrates intriguing capabilities including scaling behaviors, label efficiency, sensor captioning, and zero-shot generalization to unseen tasks. Code is available at https: //github. com/Google-Health/consumer-health-research/tree/main/sensorlm.

NeurIPS Conference 2024 Conference Paper

MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making

  • Yubin Kim
  • Chanwoo Park
  • Hyewon Jeong
  • Yik S. Chan
  • Xuhai Xu
  • Daniel McDuff
  • Hyeonhoon Lee
  • Marzyeh Ghassemi

Foundation models are becoming valuable tools in medicine. Yet despite their promise, the best way to leverage Large Language Models (LLMs) in complex medical tasks remains an open question. We introduce a novel multi-agent framework, named **M**edical **D**ecision-making **Agents** (**MDAgents**) that helps to address this gap by automatically assigning a collaboration structure to a team of LLMs. The assigned solo or group collaboration structure is tailored to the medical task at hand, a simple emulation inspired by the way real-world medical decision-making processes are adapted to tasks of different complexities. We evaluate our framework and baseline methods using state-of-the-art LLMs across a suite of real-world medical knowledge and clinical diagnosis benchmarks, including a comparison ofLLMs’ medical complexity classification against human physicians. MDAgents achieved the **best performance in seven out of ten** benchmarks on tasks requiring an understanding of medical knowledge and multi-modal reasoning, showing a significant **improvement of up to 4. 2\%** ($p$ < 0. 05) compared to previous methods' best performances. Ablation studies reveal that MDAgents effectively determines medical complexity to optimize for efficiency and accuracy across diverse medical tasks. Notably, the combination of moderator review and external medical knowledge in group collaboration resulted in an average accuracy **improvement of 11. 8\%**. Our code can be found at https: //github. com/mitmedialab/MDAgents.

ICML Conference 2024 Conference Paper

Position: Standardization of Behavioral Use Clauses is Necessary for the Adoption of Responsible Licensing of AI

  • Daniel McDuff
  • Tim Korjakow
  • Scott Cambo
  • Jesse Josua Benjamin
  • Jenny Lee
  • Yacine Jernite
  • Carlos Muñoz Ferrandis
  • Aaron Gokaslan

Growing concerns over negligent or malicious uses of AI have increased the appetite for tools that help manage the risks of the technology. In 2018, licenses with behaviorial-use clauses (commonly referred to as Responsible AI Licenses) were proposed to give developers a framework for releasing AI assets while specifying their users to mitigate negative applications. As of the end of 2023, on the order of 40, 000 software and model repositories have adopted responsible AI licenses licenses. Notable models licensed with behavioral use clauses include BLOOM (language) and LLaMA2 (language), Stable Diffusion (image), and GRID (robotics). This paper explores why and how these licenses have been adopted, and why and how they have been adapted to fit particular use cases. We use a mixed-methods methodology of qualitative interviews, clustering of license clauses, and quantitative analysis of license adoption. Based on this evidence we take the position that responsible AI licenses need standardization to avoid confusing users or diluting their impact. At the same time, customization of behavioral restrictions is also appropriate in some contexts (e. g. , medical domains). We advocate for “standardized customization” that can meet users’ needs and can be supported via tooling.

TMLR Journal 2024 Journal Article

The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources

  • Shayne Longpre
  • Stella Biderman
  • Alon Albalak
  • Hailey Schoelkopf
  • Daniel McDuff
  • Sayash Kapoor
  • Kevin Klyman
  • Kyle Lo

Foundation model development attracts a rapidly expanding body of contributors, scientists, and applications. To help shape responsible development practices, we introduce the Foundation Model Development Cheatsheet: a growing collection of 250+ tools and resources spanning text, vision, and speech modalities. We draw on a large body of prior work to survey resources (e.g. software, documentation, frameworks, guides, and practical tools) that support informed data selection, processing, and understanding, precise and limitation-aware artifact documentation, efficient model training, advance awareness of the environmental impact from training, careful model evaluation of capabilities, risks, and claims, as well as responsible model release, licensing and deployment practices. We hope this curated collection of resources helps guide more responsible development. The process of curating this list, enabled us to review the AI development ecosystem, revealing what tools are critically missing, misused, or over-used in existing practices. We find that (i) tools for data sourcing, model evaluation, and monitoring are critically under-serving ethical and real-world needs, (ii) evaluations for model safety, capabilities, and environmental impact all lack reproducibility and transparency, (iii) text and particularly English-centric analyses continue to dominate over multilingual and multi-modal analyses, and (iv) evaluation of systems, rather than just models, is needed so that capabilities and impact are assessed in context.

NeurIPS Conference 2023 Conference Paper

rPPG-Toolbox: Deep Remote PPG Toolbox

  • Xin Liu
  • Girish Narayanswamy
  • Akshay Paruchuri
  • Xiaoyu Zhang
  • Jiankai Tang
  • Yuzhe Zhang
  • Roni Sengupta
  • Shwetak Patel

Camera-based physiological measurement is a fast growing field of computer vision. Remote photoplethysmography (rPPG) utilizes imaging devices (e. g. , cameras) to measure the peripheral blood volume pulse (BVP) via photoplethysmography, and enables cardiac measurement via webcams and smartphones. However, the task is non-trivial with important pre-processing, modeling and post-processing steps required to obtain state-of-the-art results. Replication of results and benchmarking of new models is critical for scientific progress; however, as with many other applications of deep learning, reliable codebases are not easy to find or use. We present a comprehensive toolbox, rPPG-Toolbox, unsupervised and supervised rPPG models with support for public benchmark datasets, data augmentation and systematic evaluation: https: //github. com/ubicomplab/rPPG-Toolbox.

ICLR Conference 2023 Conference Paper

SimPer: Simple Self-Supervised Learning of Periodic Targets

  • Yuzhe Yang 0003
  • Xin Liu 0034
  • Jiang Wu
  • Silviu Borac
  • Dina Katabi
  • Ming-Zher Poh
  • Daniel McDuff

From human physiology to environmental evolution, important processes in nature often exhibit meaningful and strong periodic or quasi-periodic changes. Due to their inherent label scarcity, learning useful representations for periodic tasks with limited or no supervision is of great benefit. Yet, existing self-supervised learning (SSL) methods overlook the intrinsic periodicity in data, and fail to learn representations that capture periodic or frequency attributes. In this paper, we present SimPer, a simple contrastive SSL regime for learning periodic information in data. To exploit the periodic inductive bias, SimPer introduces customized augmentations, feature similarity measures, and a generalized contrastive loss for learning efficient and robust periodic representations. Extensive experiments on common real-world tasks in human behavior analysis, environmental sensing, and healthcare domains verify the superior performance of SimPer compared to state-of-the-art SSL methods, highlighting its intriguing properties including better data efficiency, robustness to spurious correlations, and generalization to distribution shifts.

IROS Conference 2022 Conference Paper

COMPASS: Contrastive Multimodal Pretraining for Autonomous Systems

  • Shuang Ma
  • Sai Vemprala
  • Wenshan Wang
  • Jayesh K. Gupta
  • Yale Song
  • Daniel McDuff
  • Ashish Kapoor

Learning representations that generalize across tasks and domains is challenging yet necessary for autonomous systems. Although task-driven approaches are appealing, de-signing models specific to each application can be difficult in the face of limited data, especially when dealing with highly variable multimodal input spaces arising from different tasks in different environments. We introduce the first general-purpose pretraining pipeline, COntrastive Multimodal Pretraining for AutonomouS Systems (COMPASS), to overcome the limitations of task-specific models and existing pretraining approaches. COMPASS constructs a multimodal graph by considering the essential information for autonomous systems and the proper-ties of different modalities. Through this graph, multimodal signals are connected and mapped into two factorized spatio-temporal latent spaces: a “motion pattern space” and a “current state space. ” By learning from multimodal correspondences in each latent space, COMPASS creates state representations that models necessary information such as temporal dynamics, geometry, and semantics. We pretrain COMPASS on a large-scale multimodal simulation dataset TartanAir [1] and evaluate it on drone navigation, vehicle racing, and visual odometry tasks. The experiments indicate that COMPASS can tackle all three scenarios and can also generalize to unseen environments and real-world data. 1 1 Our code implementation can be found at https://github.com/microsoft/COMPASS

AAAI Conference 2022 Conference Paper

DOC2PPT: Automatic Presentation Slides Generation from Scientific Documents

  • Tsu-Jui Fu
  • William Yang Wang
  • Daniel McDuff
  • Yale Song

Creating presentation materials requires complex multimodal reasoning skills to summarize key concepts and arrange them in a logical and visually pleasing manner. Can machines learn to emulate this laborious process? We present a novel task and approach for document-to-slide generation. Solving this involves document summarization, image and text retrieval, and slide structure to arrange key elements in a form suitable for presentation. We propose a hierarchical sequence-tosequence approach to tackle our task in an end-to-end manner. Our approach exploits the inherent structures within documents and slides and incorporates paraphrasing and layout prediction modules to generate slides. To help accelerate research in this domain, we release a dataset of about 6K paired documents and slide decks used in our experiments. We show that our approach outperforms strong baselines and produces slides with rich content and aligned imagery.

NeurIPS Conference 2022 Conference Paper

SCAMPS: Synthetics for Camera Measurement of Physiological Signals

  • Daniel McDuff
  • Miah Wander
  • Xin Liu
  • Brian Hill
  • Javier Hernandez
  • Jonathan Lester
  • Tadas Baltrusaitis

The use of cameras and computational algorithms for noninvasive, low-cost and scalable measurement of physiological (e. g. , cardiac and pulmonary) vital signs is very attractive. However, diverse data representing a range of environments, body motions, illumination conditions and physiological states is laborious, time consuming and expensive to obtain. Synthetic data have proven a valuable tool in several areas of machine learning, yet are not widely available for camera measurement of physiological states. Synthetic data offer "perfect" labels (e. g. , without noise and with precise synchronization), labels that may not be possible to obtain otherwise (e. g. , precise pixel level segmentation maps) and provide a high degree of control over variation and diversity in the dataset. We present SCAMPS, a dataset of synthetics containing 2, 800 videos (1. 68M frames) with aligned cardiac and respiratory signals and facial action intensities. The RGB frames are provided alongside segmentation maps and precise descriptive statistics about the underlying waveforms, including inter-beat interval, heart rate variability, and pulse arrival time. Finally, we present baseline results training on these synthetic data and testing on real-world datasets to illustrate generalizability.

ICLR Conference 2021 Conference Paper

Active Contrastive Learning of Audio-Visual Video Representations

  • Shuang Ma
  • Zhaoyang Zeng
  • Daniel McDuff
  • Yale Song

Contrastive learning has been shown to produce generalizable representations of audio and visual data by maximizing the lower bound on the mutual information (MI) between different views of an instance. However, obtaining a tight lower bound requires a sample size exponential in MI and thus a large set of negative samples. We can incorporate more samples by building a large queue-based dictionary, but there are theoretical limits to performance improvements even with a large number of negative samples. We hypothesize that random negative sampling leads to a highly redundant dictionary that results in suboptimal representations for downstream tasks. In this paper, we propose an active contrastive learning approach that builds an actively sampled dictionary with diverse and informative items, which improves the quality of negative samples and improves performances on tasks where there is high mutual information in the data, e.g., video classification. Our model achieves state-of-the-art performance on challenging audio and visual downstream benchmarks including UCF101, HMDB51 and ESC50.

NeurIPS Conference 2021 Conference Paper

Contrastive Learning of Global and Local Video Representations

  • Shuang Ma
  • Zhaoyang Zeng
  • Daniel McDuff
  • Yale Song

Contrastive learning has delivered impressive results for various tasks in the self-supervised regime. However, existing approaches optimize for learning representations specific to downstream scenarios, i. e. , global representations suitable for tasks such as classification or local representations for tasks such as detection and localization. While they produce satisfactory results in the intended downstream scenarios, they often fail to generalize to tasks that they were not originally designed for. In this work, we propose to learn video representations that generalize to both the tasks which require global semantic information (e. g. , classification) and the tasks that require local fine-grained spatio-temporal information (e. g. , localization). We achieve this by optimizing two contrastive objectives that together encourage our model to learn global-local visual information given audio signals. We show that the two objectives mutually improve the generalizability of the learned global-local representations, significantly outperforming their disjointly learned counterparts. We demonstrate our approach on various tasks including action/sound classification, lipreading, deepfake detection, event and sound localization.

JBHI Journal 2021 Journal Article

Guest Editorial: Camera-Based Monitoring for Pervasive Healthcare Informatics

  • Wenjin Wang
  • Steffen Leonhardt
  • Lionel Tarassenko
  • Caifeng Shan
  • Daniel McDuff

The papers in this special section focus on camera-based monitoring for pervasive healthcare informatics. Measuring physiological signals from the human face and body using video cameras is an emerging research topic that has grown rapidly in the last decade. Remote cameras (in both visible and infrared wavelengths) can be used to measure vital signs from a human body based on skin optics or body movements thereby avoiding mechanical contact with the skin. Camera-based health monitoring will bring a rich set of compelling healthcare applications that directly improve upon contact-based monitoring solutions and impact people’s care experience and quality of life in various scenarios, such as in hospital care units, sleep/senior centers, assisted-living homes, telemedicine and e-health, baby/elderly care at home, fitness and sports, driver monitoring in automotive applications, cardiac/ respiratory gating for MRI/CT, AR/VR based therapy and clinical training, e

ICRA Conference 2021 Conference Paper

Modeling Affect-based Intrinsic Rewards for Exploration and Learning

  • Dean Zadok
  • Daniel McDuff
  • Ashish Kapoor

Positive affect has been linked to increased interest, curiosity and satisfaction in human learning. In reinforcement learning, extrinsic rewards are often sparse and difficult to define, intrinsically motivated learning can help address these challenges. We argue that positive affect is an important intrinsic reward that effectively helps drive exploration that is useful in gathering experiences. We present a novel approach leveraging a task-independent reward function trained on spontaneous smile behavior that reflects the intrinsic reward of positive affect. To evaluate our approach we trained several downstream computer vision tasks on data collected with our policy and several baseline methods. We show that the policy based on our affective rewards successfully increases the duration of episodes, the area explored and reduces collisions. The impact is the increased speed of learning for several downstream computer vision tasks.

NeurIPS Conference 2020 Conference Paper

Multi-Task Temporal Shift Attention Networks for On-Device Contactless Vitals Measurement

  • Xin Liu
  • Josh Fromm
  • Shwetak Patel
  • Daniel McDuff

Telehealth and remote health monitoring have become increasingly important during the SARS-CoV-2 pandemic and it is widely expected that this will have a lasting impact on healthcare practices. These tools can help reduce the risk of exposing patients and medical staff to infection, make healthcare services more accessible, and allow providers to see more patients. However, objective measurement of vital signs is challenging without direct contact with a patient. We present a video-based and on-device optical cardiopulmonary vital sign measurement approach. It leverages a novel multi-task temporal shift convolutional attention network (MTTS-CAN) and enables real-time cardiovascular and respiratory measurements on mobile platforms. We evaluate our system on an Advanced RISC Machine (ARM) CPU and achieve state-of-the-art accuracy while running at over 150 frames per second which enables real-time applications. Systematic experimentation on large benchmark datasets reveals that our approach leads to substantial (20%-50%) reductions in error and generalizes well across datasets.

NeurIPS Conference 2019 Conference Paper

Characterizing Bias in Classifiers using Generative Models

  • Daniel McDuff
  • Shuang Ma
  • Yale Song
  • Ashish Kapoor

Models that are learned from real-world data are often biased because the data used to train them is biased. This can propagate systemic human biases that exist and ultimately lead to inequitable treatment of people, especially minorities. To characterize bias in learned classifiers, existing approaches rely on human oracles labeling real-world examples to identify the "blind spots" of the classifiers; these are ultimately limited due to the human labor required and the finite nature of existing image examples. We propose a simulation-based approach for interrogating classifiers using generative adversarial models in a systematic manner. We incorporate a progressive conditional generative model for synthesizing photo-realistic facial images and Bayesian Optimization for an efficient interrogation of independent facial image classification systems. We show how this approach can be used to efficiently characterize racial and gender biases in commercial systems.

JBHI Journal 2019 Journal Article

Wearable Motion-Based Heart Rate at Rest: A Workplace Evaluation

  • Javier Hernandez
  • Daniel McDuff
  • Karen Quigley
  • Pattie Maes
  • Rosalind W. Picard

This paper studies the feasibility of using low-cost motion sensors to provide opportunistic heart rate assessments from ballistocardiographic signals during restful periods of daily life. Three wearable devices were used to capture peripheral motions at specific body locations (head, wrist, and trouser pocket) of 15 participants during five regular workdays each. Three methods were implemented to extract heart rate from motion data and their performance was compared to those obtained with an FDA-cleared device. With a total of 1358 h of naturalistic sensor data, our results show that providing accurate heart rate estimations from peripheral motion signals is possible during relatively “still” moments. In our real-life workplace study, the head-mounted device yielded the most frequent assessments (22. 98% of the time under 5 beats per minute of error) followed by the smartphone in the pocket (5. 02%) and the wrist-worn device (3. 48%). Most importantly, accurate assessments were automatically detected by using a custom threshold based on the device jerk. Due to the pervasiveness and low cost of wearable motion sensors, this paper demonstrates the feasibility of providing opportunistic large-scale low-cost samples of resting heart rate.

IJCAI Conference 2016 Conference Paper

Driver Frustration Detection from Audio and Video in the Wild

  • Irman Abdić
  • Lex Fridman
  • Daniel McDuff
  • Erik Marchi
  • Bryan Reimer
  • Bj
  • ouml; rn Schuller

We present a method for detecting driver frustration from both video and audio streams captured during the driver's interaction with an in-vehicle voice-based navigation system. The video is of the driver's face when the machine is speaking, and the audio is of the driver's voice when he or she is speaking. We analyze a dataset of 20 drivers that contains 596 audio epochs (audio clips, with duration from 1 sec to 15 sec) and 615 video epochs (video clips, with duration from 1 sec to 45 sec). The dataset is balanced across 2 age groups, 2 vehicle systems, and both genders. The model was subject-independently trained and tested using 4-fold cross-validation. We achieve an accuracy of 77. 4% for detecting frustration from a single audio epoch and 81. 2% for detecting frustration from a single video epoch. We then treat the video and audio epochs as a sequence of interactions and use decision fusion to characterize the trade-off between decision time and classification accuracy, which improved the prediction accuracy to 88. 5% after 9 epochs.