DKRZ data catalog#
The DKRZ data catalog is a bottom-up collection of catalogs generated by DKRZ data hosts to allow DKRZ users to find and make use of their data.
The DKRZ data catalog repository contains all tools to create, host and build services around the dkrz data catalog.
Features#
All DKRZ users can easily contribute to make their data more FAIR
Catalog entries become visible and machine-readable for the whole world
Data sources can be easily configured so that they can be easily loaded into the work environment
Both self-maintenance and automated services are enabled to collect and update catalogs
Become part of the DKRZ data catalog and extend your potential data user community!
Usage#
The repository is fully open so that it is possible to load the catalog from the web via gitlab´s raw link.
APIs and GUIs#
You can open the catalog via
intake.open_catalog("https://gitlab.dkrz.de/data-infrastructure-services/intake-catalog/-/raw/main/dkrz.yaml")
. It is planned to set a short link for that in near future.This repo is also cloned under
/pool/data/catalog
on Levante’s file system
Design#
Catalogs are yaml files that should be openable by a catalog software tool like intake. They are nested in a flat hierarchy and should point to sub- and parent catalogs.
At top level, it distinguishes between
data lake:
Purpose: An unordered, open collection of data sources that allows easy ingestion of new data sources by merge requests but can contain redundancies.
Policy: Sublevel catalogs beneath these links of another gitlab repos are not checked for links OUTSIDE from gitlab.dkrz.de.
data pool:
Purpose: A mirror of the structure of DKRZ´s data pool (/pool/data) which comes with both requirements set by the pool/data services and support by catalog services which exploit the requirements to harvest catalogs.
Policy: All subcatalogs are part of gitlab.dkrz.de.
Contribute#
Data lake:
Open a merge request or an issue and choose one of the two options:
Provide a link
Add a file to
lake/sources
with only one line which contains a link to another catalog in another DKRZ gitlab repository.
Become project maintainer
add a catalog to
lake/sources
with metadata entries from the next section requirementsmaintain the catalog within this dkrz data catalog repository
Requirements:
Specify in the metadata section of the linked catalog:
account (example:
bm1344
)relation (example:
[MPI-M, NextGEMs,ICON]
)
Referenced subcatalogs are in the institutional gitlab. Catalogs referenced in these subcatalogs can point to any further URL.
Result:
For each label in ‘{relation}’, an entry ‘{account}’ is generated in
lake/{relation}/main.yaml
Data pool:
DKRZ mirrors the directory structure of /pool/data. A crawler searches for
/pool/data/SUBDIR/main.yaml
and adds it to the pool catalog.Requirements:
Readable data
A README file
/pool/data/SUBDIR/README*
including license
Note that all WLA projects can become part of /pool/data by requesting a link for their data directory at /pool/data/LINKTODIR
Tools#
scripts/create_pool_catalog.ipynb
For each subdirectoryin /pool/data/
, a crawler searches for /pool/data/SUBDIR/main.yaml
and adds it and all subcatalogs to dkrz’s catalog if requirements are fulfilled
Notes#
This repo replaces the old intake-esm repository