The Data Nutrition Project

Empowering data scientists and policymakers with practical tools to improve AI outcomes

Our Mission

We believe that technology should help us move forward without mirroring societal biases

The Data Nutrition Project team:

  • 1. Creates tools and practices that encourage responsible AI development
  • 2. Partners across disciplines to drive broader change
  • 3. Builds inclusion and equity into our work

Want to get involved? Contact Us!

The Problem

Garbage in, Garbage out

Incomplete, misunderstood, and historically problematic data can negatively influence AI algorithms.

Algorithms matter, and so does the data they’re trained on. To improve the accuracy and fairness of algorithms that determine everything from navigation directions to mortgage approvals, we need to make it easier for practitioners to quickly assess the viability and fitness of datasets they intend to train AI algorithms on.

There’s a missing step in the AI development pipeline: assessing datasets based on standard quality measures that are both qualitative and quantitative. We are working on packaging up these measures into an easy to use Dataset Nutrition Label.

diagram

The Dataset Nutrition Label

A "nutrition label" for datasets.

The Data Nutrition Project aims to create a standard label for interrogating datasets.

Our belief is that deeper transparency into dataset health can lead to better data decisions, which in turn lead to better AI.

Founded in 2018 through the Assembly Fellowship, The Data Nutrition Project takes inspiration from nutritional labels on food, aiming to build labels that highlight the key ingredients in a dataset such as metadata and demographic representation, as well as unique or anomalous features regarding distributions, missing data, and comparisons to other "ground truth" datasets.

Building off of the modular framework initially presented in our 2018 prototype and further refined in our 2nd Generation Label (2020), based on feedback from data scientists and dataset owners, we have further adjusted the Label to support a common user journey: a data scientist looking for a dataset with a particular purpose in mind. The third generation Dataset Nutrition Label now provides information about a dataset including its intended use and other known uses, the process of cleaning, managing, and curating that data, ethical and or technical reviews, the inclusion of subpopulations in the dataset, and a series of potential risks or limitations in the dataset. You may additionally want to read hereabout the second generation (2020) label, which informed the third generation label.

label display
Third Generation Dataset Nutrition Label (2022)

Research

Published and Related Works

DNP is a research organization as well as a product development team.

Alongside development of the tool, we have been doing ongoing research into the broader landscape of tools and practices designed to address problems in underlying data, whether due to the data itself, the data collection practices, or the dataset documentation.

We take inspiration from related initiatives such as Datasheets for Datasets [Gebru et al], A Nutrition Label for Rankings [Yang et al], and Data Statements for Natural Language Processing [Bender, Friedman], and have been heartened to see that this area of work has inspired some of the large platforms and research initiatives with their own related projects, including Apple’s Privacy Labels, Google’s Model Cards, IBM’s AI FactSheets 360, and Partnership on AI About ML Project. For more information about the broad and growing landscape of research related to bias in AI, we recommend the excellent [Morley et al] and [Mehrabi et al] papers, both of which give useful overviews of methods related to bias and fairness.

Recent Publications

The CLeAR Documentation Framework for AI Transparency, Harvard Kennedy School Report (2024)
Quality Measures for Humanitarian Data, in collaboration with the United Nations Humanitarian Data Exchange (2023)
Comment: FTC Trade Regulation Rule on Commercial Surveillance and Data Security, in collaboration with Berkman Klein Center at Harvard University (2022)
The Dataset Nutrition Label (2nd Gen): Leveraging Context to Mitigate Harms in Artificial Intelligence, presented at NeurIPS 2020: Workshop on Dataset Curation and Security (2020)

Engagement

Projects, Events and Collaborations

Recent Engagements

Innovation in Regulatory Science awardee, The Burroughs Wellcome Fund (2023)
Awarded for developing an independent audit framework for artificial intelligence in medicine in collaboration with Dr. Rotemberg from Memorial Sloan Kettering Cancer Center
Infrastructure Grant Awardee, Mozilla Foundation (2023)
Awarded for explorations of the AI auditing landscape through convening experts in a closed-door, facilitated session
Putting Science into Standards (PSIS) Program, the European Commission’s Joint Research Centre (JRC) CEN, CENELEC (2022)
Participation in programming around dataset standards
Digital Humanity Award, Prix Ars Electronica (2022)
International arts-science honor awarded to the Data Nutrition Project for the design and release of the second generation of the Dataset Nutrition Label

Collaboration with ASL Citizen Dataset Team

Services

Demystifying how AI Perpetuates Systemic Biases

We believe that building artificial intelligence is as much about learning as it is about technical implementation.

To that end, the Data Nutrition Project offers certification, consulting and educational services that address data quality and transparency. We offer these services to organizations of all sizes and types - from small technical teams to larger, non-technical communities. Our services include:

workshop presentation
Photo Credit: Jess Benjamin

Label-as-a-Service

Creating nutrition labels for datasets is helpful for ensuring thoughtful usage of public data and also as a form of transparency and trust-building with the public when releasing products built on proprietary data. If you would like to create a Dataset Nutrition Label on a proprietary dataset, DNP can work with you to build a certified Dataset Nutrition Label.

Data System Consulting

When documentation is created at the end of an algorithmic decision making process, you run the risk of discovering too late that the data used may present some unintended risks. One way to address this issue is by learning how to incorporate the concepts that are part of the Dataset Nutrition Labeling process within your dataset definition and collection processes. If you build the concepts into your process, you will end up with better quality data, and will be able to produce documentation quickly and easily. We offer strategic consulting to help organizations and teams sustainably embed these responsible data practices into their product development.

Professional Development Workshops

When building artificial intelligence systems, it is important to understand both the technical tools required and the social context in which the system sits. Through our educational offerings, the Data Nutrition Project trains organizations of all sizes — from small technical teams to large, non-technical communities — to understand and approach AI from a sociotechnical perspective.



If you'd like to find out more about our services, please reach out! Contact Us
DNP member presenting
Photo Credit: Jess Benjamin

Our Team

We are a group of researchers and technologists working together to tackle the challenges of ethics and governance of Artificial Intelligence.
chmielinski_kasia

Kasia Chmielinski

Project Lead
Technologist and product leader focused on building data-driven systems. Current Digital Civil Society Practitioner Fellow (Stanford University) and Affiliate at the Berkman Klein Center (Harvard University). Previously at McKinsey & Company, the US Digital Service, MIT Media Lab, and Google. Native Bostonian, enthusiastic cyclist. Avid bird-watcher.
newman_sarah

Sarah Newman

Research Lead
Director of Art & Education at metaLAB at Harvard. Interested in interrelations within complex systems. Creates interactive art installations that explore social and cultural dimensions of new tech; runs critical and creative workshops on AI. Persuaded by the power of metaphors. Avid sheller.
taylor_matt

Matt Taylor

Data Science & Workshop Facilitation
Freelance learning experience designer and facilitator, with a background in AI implementation. Previously worked as an engineer in natural language processing, moderation tool development, and creative coding platform development. Currently creating learning experiences in STEAM for young people, and demystifying AI for all people. Also spends time developing tech tools for mutual aid orgs, and organizing tech workers for social justice. Seasoned pun specialist.
yurkofsky_jessica

Jessica Yurkofsky

Design Research Collaborator
Designer, technologist, and librarian focused on visual communication and experimental pedagogy. Principal at Harvard's metaLAB, with a background in Sociology and Urban Planning. Lives in the woods in Vermont. Dedicated builder of cardboard models and drawer of cartoons.
kranzinger_chris

Chris Kranzinger

Data Science Advisor
Data Scientist, Economist, and ML enthusiast combining data science and economics to inform strategic decision making and studies questions around trust and safety in AI. Former McCloy Fellow at Harvard and current Sr. Applied Scientist at Uber with a background in Engineering and Economics.
teyrouz_carine

Carine Teyrouz

UX Design Collaborator
NN/g certified product designer, UX mentor & facilitator. Holds a master’s degree in design & web project management. Currently working with startups & SMEs on applying design thinking methodologies to create intuitive and easy-to-use digital products.
king_hg

HG King

Software Engineering Collaborator
NYC-based software engineer and consultant, with a focus on driving innovation and business value using an Agile and scrappy approach to build products with clients. Occasionally helps artists with technology. Interested in technology and sustainability, specifically how technology is connected to modern social issues in the form of problems or solutions.
chang_audrey

Audrey Chang

Research Collaborator
Undergraduate studying statistics and sociology at Harvard to advocate for responsible AI-related innovation, with special consideration to combatting technology’s reproduction of current societal inequities. Previously, researched cancer biology at Stanford and materials science at Harvard. Keen on exploring the design of physical and social spaces. Crafty naturalist. Bay Area native.
thomas_kemi

Kemi Thomas

Software Engineering Collaborator
Full-stack engineer passionate about building REST API applications and making people’s lives easier. Primary focus in the NERD stack (Node.js, Express, React, Databases using SQL), but open to other technologies. Background in journalism and associate production, most recently at a top 20 news station.

Collaborators

Dr_Veronica_Rotemberg

Dr. Veronica Rotemberg

Research Collaborator
Dr. Veronica Rotemberg is a dermatologist at Memorial Sloan Kettering Cancer Center (MSK). Dr. Rotemberg directs the imaging informatics program in the dermatology service and sees patients at high risk for skin cancer. She received her MD-PhD from Duke University with a PhD in biomedical engineering focusing on elasticity imaging. She leads the AI working group for the International Skin Imaging Collaboration and is especially interested in imaging standards and evaluating challenges and biases of AI when applied to clinical settings.
raii

Responsible AI Institute

Research Collaborator
Responsible AI Institute is a non-profit building tangible governance tools to address growing concerns about AI. Their mission is to catalyze the practical and responsible design, development, and use of AI. Their tools have been among the first to demonstrate how to turn responsible AI principles into action. Bringing extensive experience in responsible AI policy and the development of AI systems for industry, Responsible AI Institute is uniquely positioned to partner with organizations across public and private sectors to guide and inform responsible AI governance around the world.
Michael_Sherman

Michael Sherman

Children's Book Illustrator
Artist, illustrator, and educator in NYC, focusing on the nexus of individuality and community. Rhode Island School of Design graduate, Cill Rialaig Arts Centre resident, Lower Manhattan Cultural Council grantee, and Northwest Review contributor. His current project, “Meta-morphic: a series of 1000 head drawings,” uses drawing as a tool to examine the stewardship and ownership of images and icons. Also known as papa to his two young kids.

Former Collaborators

Humanity Innovation Labs

UX Design Collaborator

JustFix.nyc

Label Research Collaborator

Alums

Sarah Holland

Research & Public Policy

Ahmed Hosny

Data Science

Josh Joseph

AI Research

Erica Luzzi

Research & Design Collaborator

Serena Oduro

AI Policy Collaborator

Chelsea Qiu

Research & Design
divider
Photo Credit: Jess Benjamin

Frequently Asked Questions

A few questions you might have

Q. What inspired this project?

We believe that engineers want to build responsible and smart AI models, but that there is a key step missing in the way these models are built. This step is to interrogate the dataset for a variety of imbalances or problems it may have, and ascertain if it is the right dataset for the model. We are inspired by the FDA's Nutrition Facts label in that it provides basic yet powerful facts that highlight issues in an accessible way. We aspire to do the same for datasets.

Q. Where can I see the Dataset Nutrition Label and learn about the methodology?

You can take a look at the Dataset Nutrition Label here.  Older versions (Labelpaper) are also still available online.

Q. Who is the intended beneficiary of this work?

Our primary audience for the Dataset Nutrition Label is the data science and developer community who are building models. An additional audience for our labels are researchers or journalists who want to better understand a particular dataset. We believe that broad, interdisciplinary engagement is required to shift the industry toward better standards of dataset quality and dataset documentation. Thus, we also engage with educators, policymakers, and researchers on best ways to amplify and highlight the potential of the Dataset Nutrition Label and the importance of data interrogation before model creation. If you’re interested in getting involved, please contact us.

Q. How will this project scale?

We believe that the Data Nutrition Project addresses a broad need in the model development ecosystem, and that the project will scale to address that need. We are still refining the process for label validation and we expect to share more about our approach to that process later this year.

Q. Whom have you been speaking with?

We have been speaking with researchers in academia, practitioners at large technology companies, individual data scientists, organizations, and government institutions that host or open datasets to the public. If you’re interested in getting involved, please contact us.

Q. Is your work open source?

Some of it is! You can view the Dataset Nutrition Label code here, and our label maker code will be open sourced in the future.

Support Us

DNP is a 501c3 nonprofit initiative. We are happy to welcome more into the fold, whether you are a policymaker, scientist, engineer, designer, or just a curious member of the public. We’d love to hear from you.

OR