BigBio: A Framework for Data-Centric Biomedical Natural Language Processing

Jason Fries; Leon Weber; Natasha Seelam; Gabriel Altay; Debajyoti Datta; Samuele Garda; Sunny Kang; Rosaline Su; Wojciech Kusa; Samuel Cahyawijaya; Fabio Barth; Simon Ott; Matthias Samwald; Stephen Bach; Stella Biderman; Mario Sänger; Bo Wang; Alison Callahan; Daniel León Periñán; Théo Gigant; Patrick Haller; Jenny Chim; Jose Posada; John Giorgi; Karthik Rangasai Sivaraman; Marc Pàmies; Marianna Nezhurina; Robert Martin; Michael Cullan; Moritz Freidank; Nathan Dahlberg; Shubhanshu Mishra; Shamik Bose; Nicholas Broad; Yanis Labrak; Shlok Deshmukh; Sid Kiblawi; Ayush Singh; Minh Chien Vu; Trishala Neeraj; Jonas Golde; Albert Villanova del Moral; Benjamin Beilharz

Back to NeurIPS

NeurIPS 2022

BigBio: A Framework for Data-Centric Biomedical Natural Language Processing

Conference Paper Datasets and Benchmarks Track Artificial Intelligence · Machine Learning

PDF Details

Abstract

Training and evaluating language models increasingly requires the construction of meta-datasets -- diverse collections of curated data with clear provenance. Natural language prompting has recently lead to improved zero-shot generalization by transforming existing, supervised datasets into a variety of novel instruction tuning tasks, highlighting the benefits of meta-dataset curation. While successful in general-domain text, translating these data-centric approaches to biomedical language modeling remains challenging, as labeled biomedical datasets are significantly underrepresented in popular data hubs. To address this challenge, we introduce BigBio a community library of 126+ biomedical NLP datasets, currently covering 13 task categories and 10+ languages. BigBio facilitates reproducible meta-dataset curation via programmatic access to datasets and their metadata, and is compatible with current platforms for prompt engineering and end-to-end few/zero shot language model evaluation. We discuss our process for task schema harmonization, data auditing, contribution guidelines, and outline two illustrative use cases: zero-shot evaluation of biomedical prompts and large-scale, multi-task learning. BigBio is an ongoing community effort and is available at https: //github. com/bigscience-workshop/biomedical

Keywords

No keywords are indexed for this paper.

Context

Venue: Annual Conference on Neural Information Processing Systems
Archive span: 1987-2025
Indexed papers: 30776
Paper id: 966254487975706670

Abstract

Authors

Keywords

Context