The Data Nutrition Project

Empowering data scientists and policymakers with practical tools to improve AI outcomes

Our Mission

We believe that technology should help us move forward without mirroring societal biases

The Data Nutrition Project team:

1. Creates tools and practices that encourage responsible AI development
2. Partners across disciplines to drive broader change
3. Builds inclusion and equity into our work

Want to get involved? Contact Us!

The Problem

Garbage in, Garbage out

Incomplete, misunderstood, and historically problematic data can negatively influence AI algorithms.

Algorithms matter, and so does the data they’re trained on. To improve the accuracy and fairness of algorithms that determine everything from navigation directions to mortgage approvals, we need to make it easier for practitioners to quickly assess the viability and fitness of datasets they intend to train AI algorithms on.

There’s a missing step in the AI development pipeline: assessing datasets based on standard quality measures that are both qualitative and quantitative. We are working on packaging up these measures into an easy to use Dataset Nutrition Label.

The Tool

A "nutrition label" for datasets.

The Data Nutrition Project aims to create a standard label for interrogating datasets.

Our belief is that deeper transparency into dataset health can lead to better data decisions, which in turn lead to better AI.

Founded in 2018 through the Assembly Fellowship, The Data Nutrition Project takes inspiration from nutritional labels on food, aiming to build labels that highlight the key ingredients in a dataset such as metadata and demographic representation, as well as unique or anomalous features regarding distributions, missing data, and comparisons to other "ground truth" datasets.

Building off of the modular framework initially presented in our 2018 prototype and further refined in our 2nd Generation Label (2020), based on feedback from data scientists and dataset owners, we have further adjusted the Label to support a common user journey: a data scientist looking for a dataset with a particular purpose in mind. The third generation Dataset Nutrition Label now provides information about a dataset including its intended use and other known uses, the process of cleaning, managing, and curating that data, ethical and or technical reviews, the inclusion of subpopulations in the dataset, and a series of potential risks or limitations in the dataset. You may additionally want to read hereabout the second generation (2020) label, which informed the third generation label.

Third Generation Dataset Nutrition Label (2022)

Our Research

Published and Related Works

DNP is a research organization as well as a product development team.

Alongside development of the tool, we have been doing ongoing research into the broader landscape of tools and practices designed to address problems in underlying data, whether due to the data itself, the data collection practices, or the dataset documentation.

Since 2018, we have seen a confluence of initiatives arise in the domain of tools to combat bias in data. In order to understand both the unique offering of our Label, and to learn from others so that we do not reinvent the wheel, we have been tracking related research, the development of new and related tools, and the general trajectory of labeling as an intervention. The exercise of mapping the space for our internal use as a team has proved invaluable at articulating a clear and growing need for effective dataset documentation and algorithmic auditing. You can learn more about our work and its position in the landscape in our published white papers [2018, 2020].

We take inspiration from related initiatives such as Datasheets for Datasets [Gebru et al], A Nutrition Label for Rankings [Yang et al], and Data Statements for Natural Language Processing [Bender, Friedman], and have been heartened to see that this area of work has inspired some of the large platforms and research initiatives with their own related projects, including Apple’s Privacy Labels, Google’s Model Cards, IBM’s AI FactSheets 360, and Partnership on AI About ML Project. For more information about the broad and growing landscape of research related to bias in AI, we recommend the excellent [Morley et al] and [Mehrabi et al] papers, both of which give useful overviews of methods related to bias and fairness.

Recent Outreach

Policy Lab on AI and Bias, Penn Law School (2023)
Howard/Mathematica Summer Institute in Computational Social Science (2023)
Comment in collaboration with Berkman Klein Center: FTC Trade Regulation Rule on Commercial Surveillance and Data Security (2022)
Putting Science into Standards (PSIS) Program, the European Commission’s Joint Research Centre (JRC) CEN, CENELEC (2022)
Understanding Bias and Fairness in AI-enabled Healthcare Software, Duke-Margolis Center for Health Policy (2021)
NeurIPS 2020: Workshop on Dataset Curation and Security, Poster session & paper

Consumer Reports virtual panel featuring Amira Dhalla, Kasia Chmielinski, Joy Buolamwini, Shalini Kantayya, Jade Magnus

Workshops & Facilitation

Demystifying how AI Perpetuates Systemic Biases

We believe that building artificial intelligence is as much about learning as it is about technical implementation.

Through our workshops, the Data Nutrition Project brings a curriculum of awareness to organizations of all sizes and types - from small technical teams to larger, non-technical communities.

Example: Demystifying AI

This workshop is a brief, non-technical overview of how Artificial Intelligence (AI) algorithms work. Participants move through an experiential activity in which one gets to “be the algorithm”, and subsequently reflect on how bias is perpetuated in that process. We also tie this experience to current industry themes and examples, and discuss the complexities of building tools that mitigate the issue.

We have facilitated this workshop at conferences, as well as at local events for community organizers. This workshop is great for community groups looking to better understand how AI works, and how it is used in tools that we all use on a daily basis. It's also helpful for tech professionals who do not code, such as designers, project managers, etc.

Photo Credit: Jess Benjamin

Photo Credit: Jess Benjamin

Our Team

We are a group of researchers and technologists working together to tackle the challenges of ethics and governance of Artificial Intelligence.

Kasia Chmielinski

Project Lead

Technologist and product leader focused on building data-driven systems. Current Digital Civil Society Practitioner Fellow (Stanford University) and Affiliate at the Berkman Klein Center (Harvard University). Previously at McKinsey & Company, the US Digital Service, MIT Media Lab, and Google. Native Bostonian, enthusiastic cyclist. Avid bird-watcher.

Sarah Newman

Research Lead

Director of Art & Education at metaLAB at Harvard. Interested in interrelations within complex systems. Creates interactive art installations that explore social and cultural dimensions of new tech; runs critical and creative workshops on AI. Persuaded by the power of metaphors. Avid sheller.

Matt Taylor

Data Science & Workshop Facilitation

Freelance learning experience designer and facilitator, with a background in AI implementation. Previously worked as an engineer in natural language processing, moderation tool development, and creative coding platform development. Currently creating learning experiences in STEAM for young people, and demystifying AI for all people. Also spends time developing tech tools for mutual aid orgs, and organizing tech workers for social justice. Seasoned pun specialist.

Kemi Thomas

Software Engineering Collaborator

Full-stack engineer passionate about building REST API applications and making people’s lives easier. Primary focus in the NERD stack (Node.js, Express, React, Databases using SQL), but open to other technologies. Background in journalism and associate production, most recently at a top 20 news station.

Chris Kranzinger

Data Science Advisor

Data Scientist, Economist, and ML enthusiast combining data science and economics to inform strategic decision making and studies questions around trust and safety in AI. Former McCloy Fellow at Harvard and current Sr. Applied Scientist at Uber with a background in Engineering and Economics.

Carine Teyrouz

UX Design Collaborator

NN/g certified product designer, UX mentor & facilitator. Holds a master’s degree in design & web project management. Currently working with startups & SMEs on applying design thinking methodologies to create intuitive and easy-to-use digital products.

HG King

Software Engineering Collaborator

NYC-based software engineer and consultant, with a focus on driving innovation and business value using an Agile and scrappy approach to build products with clients. Occasionally helps artists with technology. Interested in technology and sustainability, specifically how technology is connected to modern social issues in the form of problems or solutions.

Audrey Chang

Research Collaborator

Undergraduate studying statistics and sociology at Harvard to advocate for responsible AI-related innovation, with special consideration to combatting technology’s reproduction of current societal inequities. Previously, researched cancer biology at Stanford and materials science at Harvard. Keen on exploring the design of physical and social spaces. Crafty naturalist. Bay Area native.

Collaborators

Dr. Veronica Rotemberg

Research Collaborator

Dr. Veronica Rotemberg is a dermatologist at Memorial Sloan Kettering Cancer Center (MSK). Dr. Rotemberg directs the imaging informatics program in the dermatology service and sees patients at high risk for skin cancer. She received her MD-PhD from Duke University with a PhD in biomedical engineering focusing on elasticity imaging. She leads the AI working group for the International Skin Imaging Collaboration and is especially interested in imaging standards and evaluating challenges and biases of AI when applied to clinical settings.

Responsible AI Institute

Research Collaborator

Responsible AI Institute is a non-profit building tangible governance tools to address growing concerns about AI. Their mission is to catalyze the practical and responsible design, development, and use of AI. Their tools have been among the first to demonstrate how to turn responsible AI principles into action. Bringing extensive experience in responsible AI policy and the development of AI systems for industry, Responsible AI Institute is uniquely positioned to partner with organizations across public and private sectors to guide and inform responsible AI governance around the world.

Michael Sherman

Children's Book Illustrator

Artist, illustrator, and educator in NYC, focusing on the nexus of individuality and community. Rhode Island School of Design graduate, Cill Rialaig Arts Centre resident, Lower Manhattan Cultural Council grantee, and Northwest Review contributor. His current project, “Meta-morphic: a series of 1000 head drawings,” uses drawing as a tool to examine the stewardship and ownership of images and icons. Also known as papa to his two young kids.

Former Collaborators

Humanity Innovation Labs

JustFix.nyc

Alums

Sarah Holland

Research & Public Policy

Ahmed Hosny

Data Science

Josh Joseph

AI Research

Erica Luzzi

Research & Design Collaborator

Serena Oduro

AI Policy Collaborator

Chelsea Qiu

Research & Design

Jessica Yurkofsky

Design Research Collaborator

Photo Credit: Jess Benjamin

Frequently Asked Questions

A few questions you might have

Q. What inspired this project?

We believe that engineers want to build responsible and smart AI models, but that there is a key step missing in the way these models are built. This step is to interrogate the dataset for a variety of imbalances or problems it may have, and ascertain if it is the right dataset for the model. We are inspired by the FDA's Nutrition Facts label in that it provides basic yet powerful facts that highlight issues in an accessible way. We aspire to do the same for datasets.

Q. Where can I see the Dataset Nutrition Label and learn about the methodology?

You can take a look at the Dataset Nutrition Label here. Older versions (Label, paper) are also still available online.

Q. Who is the intended beneficiary of this work?

Our primary audience for the Dataset Nutrition Label is the data science and developer community who are building models. An additional audience for our labels are researchers or journalists who want to better understand a particular dataset. We believe that broad, interdisciplinary engagement is required to shift the industry toward better standards of dataset quality and dataset documentation. Thus, we also engage with educators, policymakers, and researchers on best ways to amplify and highlight the potential of the Dataset Nutrition Label and the importance of data interrogation before model creation. If you’re interested in getting involved, please contact us.

Q. How will this project scale?

We believe that the Data Nutrition Project addresses a broad need in the model development ecosystem, and that the project will scale to address that need. We are still refining the process for label validation and we expect to share more about our approach to that process later this year.

Q. Whom have you been speaking with?

We have been speaking with researchers in academia, practitioners at large technology companies, individual data scientists, organizations, and government institutions that host or open datasets to the public. If you’re interested in getting involved, please contact us.

Q. Is your work open source?

Some of it is! You can view the Dataset Nutrition Label code here, and our label maker code will be open sourced in the future.

Support & Awards

Tech Spotlight at Harvard Kennedy School’s Belfer Center

Ars Electronica Award for Digital Humanity 2022

Contact

DNP is a 501c3 nonprofit initiative. We are happy to welcome more into the fold, whether you are a policymaker, scientist, engineer, designer, or just a curious member of the public. We’d love to hear from you.

OR

Send us an email

Donate