Specialized Healthsheet for Healthcare Datasets
Negar Rostamzadeh, Subhrajit Roy, Diana Mincu, Andrew Smart, Lauren Wilcox, Mahima Pushkarna, Razvan Amironesei, Jessica Schrouff, Madeleine Elish, Nyalleng Moorosi, Berk Ustun, Noah Broesti, Katherine Heller
Abstract: Machine learning (ML) approaches have shown promising results in a variety of healthcare applications. Data plays a vital role in the development of ML-based healthcare systems that directly impact human lives. Many of the ethical issues with healthcare applications of ML can be traced back to structural inequalities that are reflected in the way we collect and process data. Developing a guideline for improving documentation practices in the creation, use and maintenance of ML healthcare datasets is of critical importance. In this work, we introduce Healthsheet, to address adaptations and expansions of the original datasheet questionnaire to healthcare-specific applications. We address the collection and use of sensitive attributes, dataset versioning and maintenance, privacy, data collection context, and health-related devices. As part of the development process of Healthsheet, we worked with three publicly-available healthcare datasets as our case studies, each with different types of structured data: Electronic Health Records (EHR), multiple sclerosis (MS) clinical trial data and smartphone-based performance outcome measures.