GitTables 1M

Creators
Contributors
Publication date 04-05-2022
Description
GitTables 1M (https://gittables.github.io) is a corpus of currently 1M relational tables extracted from CSV files in GitHub repositories, that are associated with a license that allows distribution. We aim to grow this to at least 10M tables. Each parquet file in this corpus represents a table with the original content (e.g. values and header) as extracted from the corresponding CSV file. Table columns are enriched with annotations corresponding >2K semantic types from Schema.org and DBpedia (provided as metadata of the parquet file). These column annotations consist of, for example, semantic types, hierarchical relations to other types, and descriptions. We believe GitTables can facilitate many use-cases, among which: - Data integration, search and validation - Data visualization and analysis recommendation - Schema analysis and completion for e.g. database or knowledge base design. If you have questions, the paper, documentation, and contact details are provided on the website: https://gittables.github.io. We recommend using Zenodo's API to easily download the full dataset (i.e. all zipped topic subsets).
Publisher Zenodo
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Document type Dataset
Related publication GitTables: A Large-Scale Corpus of Relational Tables
DOI https://doi.org/10.5281/zenodo.6517052
Other links https://zenodo.org/records/6517052
Permalink to this page
Back