The Data Nutrition Project

Empowering data scientists and policymakers with practical tools to improve AI outcomes

Our Mission

We believe that technology should help us move forward without mirroring existing systemic injustice

The Data Nutrition Project team:

  • 1. Creates tools and practices that encourage responsible AI development
  • 2. Partners across disciplines to drive broader change
  • 3. Builds inclusion and equity into our work

Want to get involved? Contact Us!

The Problem

Garbage in, Garbage out

Incomplete, misunderstood, and historically problematic data can negatively influence AI algorithms.

Algorithms matter, and so does the data they’re trained on. To improve the accuracy and fairness of algorithms that determine everything from navigation directions to mortgage approvals, we need to make it easier for practitioners to quickly assess the viability and fitness of datasets they intend to train AI algorithms on.

There’s a missing step in the AI development pipeline: assessing datasets based on standard quality measures that are both qualitative and quantitative. We are working on packaging up these measures into an easy to use Dataset Nutrition Label.

diagram

The Tool

A "nutrition label" for datasets.

The Data Nutrition Project aims to create a standard label for interrogating datasets.

Our belief is that deeper transparency into dataset health can lead to better data decisions, which in turn lead to better AI.

Founded in 2018 through the Assembly Fellowship, The Data Nutrition Project takes inspiration from nutritional labels on food, aiming to build labels that highlight the key ingredients in a dataset such as meta-data and populations, as well as unique or anomalous features regarding distributions, missing data, and comparisons to other ‘ground truth’ datasets.

Building off of the ‘modular’ framework initially presented in our 2018 prototype and based on feedback from data scientists and dataset owners, we have further adjusted the Label to support a common user journey: a data scientist looking for a dataset with a particular purpose in mind. The second generation Dataset Nutrition Label now provides targeted information about a dataset based on its intended use case, including alerts and flags that are pertinent to that particular use. Read more about the methodology behind the second generation in our most recent white paper.

label display

Our Research

Published and Related Works

DNP is a research organization as well as a product development team.

Alongside development of the tool, we have been doing ongoing research into the broader landscape of tools and practices designed to address problems in underlying data, whether due to the data itself, the data collection practices, or the dataset documentation.

Since 2018, we have seen a confluence of initiatives arise in the domain of tools to combat bias in data. In order to understand both the unique offering of our Label, and to learn from others so that we do not reinvent the wheel, we have been tracking related research, the development of new and related tools, and the general trajectory of labeling as an intervention. The exercise of mapping the space for our internal use as a team has proved invaluable at articulating a clear and growing need for effective dataset documentation and algorithmic auditing. You can learn more about our work and its position in the landscape in our published white papers [2018, 2020, Neurips Draft].

We take inspiration from related initiatives such as Datasheets for Datasets [Gebru et al], A Nutrition Label for Rankings [Yang et al], and Data Statements for Natural Language Processing [Bender, Friedman], and have been heartened to see that this area of work has inspired some of the large platforms and research initiatives with their own related projects, including Apple’s Privacy Labels, Google’s Model Cards, IBM’s AI FactSheets 360, and Partnership on AI About ML Project. For more information about the broad and growing landscape of research related to bias in AI, we recommend the excellent [Morley et al] and [Mehrabi et al] papers, both of which give useful overviews of methods related to bias and fairness.

Recent Talks (2020)
  • NeurIPS 2020: Workshop on Dataset Curation and Security, Poster session & paper
  • Office of the Chief Technology Officer, US Department of Education
  • DRIVE/2020, ‘Bias in, Bias out’
  • The Berkman Klein Center at Harvard University, Fellows Presentation
  • The Harvard Kennedy School, Lecture for Product Management & Society Class
  • Consumer Reports Virtual Panel, ‘Building a Movement for Algorithmic Justice’
  • Machine Learning for Social Good, Poster Session
  • INDUSTRY, ‘Considering Ethical Product Development’
  • GoodSystems AI Workshop, University of Texas at Austin
zoom panel on coded bias december 2020
Consumer Reports virtual panel featuring Amira Dhalla, Kasia Chmielinski, Joy Buolamwini, Shalini Kantayya, Jade Magnus

Workshops & Facilitation

Demystifying how AI Perpetuates Systemic Biases

We believe that building artificial intelligence is as much about learning as it is about technical implementation.

Through our workshops, the Data Nutrition Project brings a curriculum of awareness to organizations of all sizes and types - from small technical teams to larger, non-technical communities.

Example: Demystifying AI

This workshop is a brief, non-technical overview of how Artificial Intelligence (AI) algorithms work. Participants move through an experiential activity in which one gets to “be the algorithm”, and subsequently reflect on how bias is perpetuated in that process. We also tie this experience to current industry themes and examples, and discuss the complexities of building tools that mitigate the issue.

We have facilitated this workshop at conferences, as well as at local events for community organizers. This workshop is great for community groups looking to better understand how AI works, and how it is used in tools that we all use on a daily basis. It's also helpful for tech professionals who do not code, such as designers, project managers, etc.

Contact Us to find out more about ongoing workshops!

workshop presentation
Photo Credit: Jess Benjamin
DNP member presenting
Photo Credit: Jess Benjamin

Our Team

We are a group of researchers and technologists working together to tackle the challenges of ethics and governance of Artificial Intelligence as a part of the Assembly program at the Berkman Klein Center at Harvard University & MIT Media Lab.
chmielinski_kasia

Kasia Chmielinski

Project Lead
Technologist at McKinsey working to drive impact in healthcare through advanced analytics. Current Affiliate at the Berkman Klein Center at Harvard University and Digital Lab Fellow at Consumer Reports. Previously at the US Digital Service (The White House) and the MIT Media Lab. Native Bostonian, born cyclist. Avid bird-watcher.
newman_sarah

Sarah Newman

Research & Strategy
Director of Art & Education at metaLAB at Harvard, Fellow at the Berkman Klein Center, Program Design Co-Lead for Harvard Assembly Fellowships. Studies new technologies and their effects on people. Creates interactive art installations that explore social and cultural dimensions of new tech, runs research workshops with creative materials. Former AI Grant Fellow, Rockefeller Bellagio AI Resident. Persuaded by the power of metaphors.
joseph_josh

Josh Joseph

AI Research
Chief Intelligence Architect for MIT's Quest for Intelligence. Previously, Chief Science Officer at Alpha Features, an alternative data distribution platform, and co-founded a proprietary trading company based on machine learning driven strategy discovery and fully autonomous trading. Has done a variety of consulting work across finance, life sciences, and robotics. Aero/Astro PhD on modeling and planning in the presence of complex dynamics from MIT. BS in Applied Mathematics and Mechanical Engineering from RIT. Spends too much time arguing about consciousness. Terrible improviser.
taylor_matt

Matt Taylor

Data Science & Workshop Facilitation
Freelance learning experience designer and facilitator, with a background in AI implementation. Previously worked as an engineer in natural language processing, moderation tool development, and creative coding platform development. Currently creating learning experiences in STEAM for young people, and demystifying AI for all people. Also spends time developing tech tools for mutual aid orgs, and organizing tech workers for social justice. Seasoned pun specialist.
thomas_kemi

Kemi Thomas

Software Engineering Collaborator
Full-stack engineer passionate about building REST API applications and making people’s lives easier. Primary focus in the NERD stack (Node.js, Express, React, Databases using SQL), but open to other technologies. Background in journalism and associate production, most recently at a top 20 news station.
yurkofsky_jessica

Jessica Yurkofsky

Design Research Collaborator
Designer, technologist, and librarian focused on visual communication and experimental pedagogy. Principal at Harvard's metaLAB, with a background in Sociology and Urban Planning. Lives in the woods in Vermont. Dedicated builder of cardboard models and drawer of cartoons.

Collaborators

humanity_innovation_labs

Humanity Innovation Labs

User Experience Research & Design Collaborator
HIL is an agile consultancy that offers exploratory research and design services for ingenious proof of concepts in wearables, such as digital experiences and physical devices. We work in the ambiguous space of emerging technologies and use qualitative and quantitative methods in order to drive design. The sectors we work within are health and fitness, medical and industrial applications.
JustFix.nyc

JustFix.nyc

Research & Data Collaborator
JustFix.nyc co-designs and builds tools for tenants, housing organizers, and legal advocates fighting displacement in New York City. Our mission is to galvanize a 21st century tenant movement working towards housing for all — and we think the power of data and technology should be accessible to those fighting this fight.
Dr_Veronica_Rotemberg

Dr. Veronica Rotemberg

Research Collaborator
Dr. Veronica Rotemberg is a dermatologist at Memorial Sloan Kettering Cancer Center (MSK). Dr. Rotemberg directs the imaging informatics program in the dermatology service and sees patients at high risk for skin cancer. She received her MD-PhD from Duke University with a PhD in biomedical engineering focusing on elasticity imaging. She leads the AI working group for the International Skin Imaging Collaboration and is especially interested in imaging standards and evaluating challenges and biases of AI when applied to clinical settings.
ai_global

AI Global

Researc Collaborator
AI Global is a non-profit building tangible governance tools to address growing concerns about AI. Their mission is to catalyze the practical and responsible design, development, and use of AI. Their tools have been among the first to demonstrate how to turn responsible AI principles into action. Bringing extensive experience in responsible AI policy and the development of AI systems for industry, AI Global is uniquely positioned to partner with organizations across public and private sectors to guide and inform responsible AI governance around the world.

Alums

holland_sarah

Sarah Holland

Research & Public Policy
hosny_ahmed

Ahmed Hosny

Data Science
qiu_chelsea

Chelsea Qiu

Research & Design
divider
Photo Credit: Jess Benjamin

Frequently Asked Questions

A few questions you might have

Q. Where can I see the Dataset Nutrition Label and learn about the methodology?

You can take a look at the Dataset Nutrition Label here and corresponding methodology paper here.  Older versions (Labelpaper) are also still available online.

Q. What inspired this project?

We believe that algorithm developers want to build responsible and smart AI models, but that there is a key step missing in the standard way these models are built. This step is to interrogate the dataset for a variety of imbalances or problems it could have and ascertain if it is the right dataset for the model. We are inspired by the FDA's Nutrition Facts label in that it provides basic yet powerful facts that highlight issues in an accessible way. We aspire to do the same for datasets.

Q. Whom have you been speaking with?

We have been speaking with researchers in academia, practitioners at large technology companies, individual data scientists, organizations, and government institutions that host or open datasets to the public. If you’re interested in getting involved, please contact us.

Q. Is your work open source?

Yes. You can view the Dataset Nutrition Label code here.

Q. Who is the intended beneficiary of this work?

Our primary audience for the Dataset Nutrition Label is primarily the data science and developer community who are building algorithmic AI models. However, we believe that a larger conversation must take place in order to shift the industry. Thus, we are also engaging with educators, policymakers, and researchers on best ways to amplify and highlight the potential of the Dataset Nutrition Label and the importance of data interrogation before model creation. If you’re interested in getting involved, please contact us.

Q. How will this project scale?

We believe that the Data Nutrition Project addresses a broad need in the model development ecosystem, and that the project will scale to address that need. Feedback on our prototype and opportunities to build additional prototypes on more datasets will certainly help us make strides.

Q. Is this a Harvard/MIT project?

This is a project of Assembly, a program run by the MIT Media Lab and the Berkman Klein Center.

Supported By:

MIT Media Lab
Berkman Klein Center
Assembly
AI Ethics and Governance

Contact

The DNP project is a cross-industry collective. We are happy to welcome more into the fold, whether you are a policymaker, scientist, engineer, designer, or just a curious member of the public. We’d love to hear from you.

OR