Data distribution debugging in machine learning pipelines

S. Grafberger; P. Groth; J. Stoyanovich; S. Schelter

doi:https://doi.org/10.1007/s00778-021-00726-w

Data distribution debugging in machine learning pipelines

Authors	S. Grafberger P. Groth J. Stoyanovich S. Schelter
Publication date	09-2022
Journal	VLDB Journal
Volume \| Issue number	31 \| 5
Pages (from-to)	1103-1126
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Machine learning (ML) is increasingly used to automate impactful decisions, and the risks arising from this widespread use are garnering attention from policy makers, scientists, and the media. ML applications are often brittle with respect to their input data, which leads to concerns about their correctness, reliability, and fairness. In this paper, we describe mlinspect, a library that helps diagnose and mitigate technical bias that may arise during preprocessing steps in an ML pipeline. We refer to these problems collectively as data distribution bugs. The key idea is to extract a directed acyclic graph representation of the dataflow from a preprocessing pipeline and to use this representation to automatically instrument the code with predefined inspections. These inspections are based on a lightweight annotation propagation approach to propagate metadata such as lineage information from operator to operator. In contrast to existing work, mlinspect operates on declarative abstractions of popular data science libraries like estimator/transformer pipelines and does not require manual code instrumentation. We discuss the design and implementation of the mlinspect library and give a comprehensive end-to-end example that illustrates its functionality.
Document type	Article
Language	English
Published at	https://doi.org/10.1007/s00778-021-00726-w
Other links	https://www.scopus.com/pages/publications/85123897845
Downloads	Data distribution debugging in machine learning pipelines (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Data distribution debugging in machine learning pipelines