DKRZ data catalog

DKRZ data catalog#

The DKRZ data catalog is a bottom-up collection of catalogs generated by DKRZ data hosts to allow DKRZ users to find and make use of their data.

The DKRZ data catalog repository contains all tools to create, host and build services around the dkrz data catalog.

Features#

  • All DKRZ users can easily contribute to make their data more FAIR

  • Catalog entries become visible and machine-readable for the whole world

  • Data sources can be easily configured so that they can be easily loaded into the work environment

  • Both self-maintenance and automated services are enabled to collect and update catalogs

Become part of the DKRZ data catalog and extend your potential data user community!

Usage#

The repository is fully open so that it is possible to load the catalog from the web via gitlab´s raw link.

APIs and GUIs#

  • You can open the catalog via intake.open_catalog("https://gitlab.dkrz.de/data-infrastructure-services/intake-catalog/-/raw/main/dkrz.yaml"). It is planned to set a short link for that in near future.

  • This repo is also cloned under /pool/data/catalog on Levante’s file system

Design#

Catalogs are yaml files that should be openable by a catalog software tool like intake. They are nested in a flat hierarchy and should point to sub- and parent catalogs.

At top level, it distinguishes between

  • data lake:

    • Purpose: An unordered, open collection of data sources that allows easy ingestion of new data sources by merge requests but can contain redundancies.

    • Policy: Sublevel catalogs beneath these links of another gitlab repos are not checked for links OUTSIDE from gitlab.dkrz.de.

  • data pool:

    • Purpose: A mirror of the structure of DKRZ´s data pool (/pool/data) which comes with both requirements set by the pool/data services and support by catalog services which exploit the requirements to harvest catalogs.

    • Policy: All subcatalogs are part of gitlab.dkrz.de.

Contribute#

  • Data lake:

    • Open a merge request or an issue and choose one of the two options:

      • Provide a link

        • Add a file to lake/sources with only one line which contains a link to another catalog in another DKRZ gitlab repository.

      • Become project maintainer

        • add a catalog to lake/sources with metadata entries from the next section requirements

        • maintain the catalog within this dkrz data catalog repository

    • Requirements:

      • Specify in the metadata section of the linked catalog:

        • account (example: bm1344)

        • relation (example: [MPI-M, NextGEMs,ICON])

      • Referenced subcatalogs are in the institutional gitlab. Catalogs referenced in these subcatalogs can point to any further URL.

    • Result:

      • For each label in ‘{relation}’, an entry ‘{account}’ is generated in lake/{relation}/main.yaml

  • Data pool:

    • DKRZ mirrors the directory structure of /pool/data. A crawler searches for /pool/data/SUBDIR/main.yaml and adds it to the pool catalog.

    • Requirements:

      • Readable data

      • A README file /pool/data/SUBDIR/README* including license

    • Note that all WLA projects can become part of /pool/data by requesting a link for their data directory at /pool/data/LINKTODIR

Tools#

scripts/create_pool_catalog.ipynb

For each subdirectoryin /pool/data/, a crawler searches for /pool/data/SUBDIR/main.yaml and adds it and all subcatalogs to dkrz’s catalog if requirements are fulfilled

Notes#

  • This repo replaces the old intake-esm repository