The Dataset Nutrition Label enhances context, contents, and legibility of datasets. Drawing from the analogy of the Nutrition Facts Label on food, the Label highlights the ‘ingredients’ of a dataset to help shed light on whether the dataset is healthy for a particular statistical use case. The goal of the Label is to mitigate harms caused by statistical systems (automated decision-making systems, artificial intelligence, advanced analytics) by providing at-a-glance information about the dataset that is mapped to a set of common use cases.
The Dataset Nutrition Label is intended to be leveraged by both dataset owners and data practitioners to inform conversations about dataset quality. For dataset owners, the Label provides standardized scaffolding in the form of questions and processes to surface relevant information about a dataset, particularly information about intended use or potential use. For data practitioners, the Label acts as a framework that brings transparency to a dataset along several axes, ultimately supporting the decision-making process about whether to use a dataset and for what use. By bringing further transparency to the dataset publishing and selection processes, we hope to enable more intentional dataset usage, thus driving higher quality models.
In late 2020, the Data Nutrition Project team published our most recent methodology alongside a set of Label prototypes built in collaboration with both dataset owners and subject matter experts:
The Dataset Nutrition Label is a standard quality framework for assessing dataset quality. The Dataset Nutrition Label standard is a project of the Data Nutrition Project, a non-profit group that creates tools and practices around dataset quality that encourage responsible model development.
We believe that algorithm developers want to build responsible and smart statistical models, but that there is a key step missing in the standard way these models are built. This step is to interrogate the dataset for a variety of imbalances or problems it could have and ascertain if it is the right dataset for the model.
Similar to the FDA’s nutrition label for food, the Dataset Nutrition Label aims to highlight the key ingredients in a dataset in addition to qualitative information that describes the dataset and its composition, collection, and management. The Dataset Nutrition Label also includes Alerts about the dataset that are relevant for particular intended modeling objectives. Data scientists can leverage the Dataset Nutrition Label to make better, informed decisions about which datasets to use for their specific use cases, thus driving better statistical models and artificial intelligence.
The Dataset Nutrition Label highlights business and research questions, or Use Cases, for which the dataset may be relevant. The Use Cases included in our current prototypes were identified by the Data Nutrition Project alongside subject matter experts and the original dataset owners.
The overview badges highlight a standard set of critical information about every dataset in a way that is immediately relevant and comprehensible. These icons indicate a short-hand way of highlighting binary and non-binary answers covered more deeply in the Dataset Info pane.
The badges include:
On the Dataset Nutrition Label, a Modeling Objective indicates statistical approach that the data can be leveraged to address. For example, the ISIC 2018 dataset could be leveraged to train a statistical model to answer the Modeling Objective, ‘Identify diagnosis in lesion images’.
The Data Nutrition Project team worked closely with subject matter experts and dataset owners to identify the most relevant or common Use Cases and Modeling Objectives for each dataset. These are not meant to be exhaustive, but rather indicative of the most common known or intended uses for the data. We recommend that data practitioners consult with subject matter experts to investigate the best way to approach Use Cases and Objectives not included on the Label.
Alerts are dataset notifications that Label users may wish to take into account when they are using the dataset (or deciding whether to use the dataset). These highlight issues, restrictions, and other relevant information about the data that might not be obvious to someone unfamiliar with the dataset.
Each Alert is presented on a color scale to indicate whether there is a known and accessible method to mitigate the content of the Alert. For example, missing data cannot be mitigated (no known mitigation - red), but usage restrictions can be mitigated by following the rules of the license (mitigation known - yellow).
There are two types of Alerts: FYI-only, which are generally auto-generated from answers to questions in the Dataset Info section, and Alerts with various mitigation strategies that are tied to specific Predictions. The creator of the Dataset Nutrition Label is responsible for identifying and documenting Alerts, Modeling Objectives, and Use Cases. For the prototype Labels, the Data Nutrition Project worked with the dataset owners to create this information.
No. There’s no such thing as a perfect dataset! The purpose of the Alerts is to drive awareness of known issues so that data practitioners can address these issues as they see fit. Our hope is that users of the Label will leverage these Alerts to compare datasets in similar domains, make informed decisions about which dataset to use for which purpose, and help drive mitigation strategies to limit the harm of known issues on model quality and output.
There are many types of Alerts. Some may be general (e.g. license information), while others are specific to particular communities of people (e.g. individual-level data including gender or race information). In the latter case, we leverage the terminology "Potentials for Harm" to categorize which communities or domains are particularly relevant for the Alert content. We hope that practitioners will pay special attention to these particular indicators when they are building statistical models that affect people.
You can think of the Dataset Info section as the "ingredients" of the dataset. We believe this information can help data practitioners determine whether to leverage the dataset for a particular use case, and if so, how to use it (and ways it should not be used). The information is organized in several categories: Description, Composition, Provenance, Collection, and Management. The questions are drawn from the insightful work of many teams, most centrally Datasheets for Datasets. We also drew from work published by AI Global, data.world, and DrivenData, and received feedback from colleagues at the Department of Education, AI Global, and Memorial Sloan Kettering.
The Data Nutrition Project is working on building tools that will facilitate the process of creating a Label. In the meantime, we encourage you to read our paper about the methodology (so you can create your own or something similar!), or get in touch to talk about a possible collaboration.
Please don’t be shy! You can contact us at email@example.com.