Record Details

Dataset metadata of known Dataverse installations

Harvard Dataverse (Africa Rice Center, Bioversity International, CCAFS, CIAT, IFPRI, IRRI and WorldFish)

View Archive Info
 
 
Field Value
 
Title Dataset metadata of known Dataverse installations
 
Identifier https://doi.org/10.7910/DVN/DCDKZQ
 
Creator Gautier, Julian
 
Publisher Harvard Dataverse
 
Description

This dataset contains the metadata of the datasets published in 77 Dataverse installations, information about each installation's metadata blocks, and the list of standard licenses that dataset depositors can apply to the datasets they publish in the 36 installations running more recent versions of the Dataverse software.

The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data.

How the metadata was downloaded

The dataset metadata and metadata block JSON files were downloaded from each installation on October 2 and October 3, 2022 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another named "apikey" listing my accounts' API tokens. The Python script expects and uses the API tokens in this CSV file to get metadata and other information from installations that require API tokens.

How the files are organized


├── csv_files_with_metadata_from_most_known_dataverse_installations
│   ├── author(citation).csv
│   ├── basic.csv
│   ├── contributor(citation).csv
│   ├── ...
│   └── topic_classification(citation).csv
├── dataverse_json_metadata_from_each_known_dataverse_installation
│   ├── Abacus_2022.10.02_17.11.19.zip
│       ├── dataset_pids_Abacus_2022.10.02_17.11.19.csv
│       ├── Dataverse_JSON_metadata_2022.10.02_17.11.19
│          ├── hdl_11272.1_AB2_0AQZNT_v1.0.json
│          ├── ...
│       ├── metadatablocks_v5.6
│          ├── astrophysics_v5.6.json
│          ├── biomedical_v5.6.json
│          ├── citation_v5.6.json
│          ├── ...
│          ├── socialscience_v5.6.json
│   ├── ACSS_Dataverse_2022.10.02_17.26.19.zip
│   ├── ADA_Dataverse_2022.10.02_17.26.57.zip
│   ├── Arca_Dados_2022.10.02_17.44.35.zip
│   ├── ...
│   └── World_Agroforestry_-_Research_Data_Repository_2022.10.02_22.59.36.zip
└── dataset_pids_from_most_known_dataverse_installations.csv
└── licenses_used_by_dataverse_installations.csv
└── metadatablocks_from_most_known_dataverse_installations.csv



This dataset contains two directories and three CSV files not in a directory.

One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 18 CSV files that contain the values from common metadata fields of all 77 Dataverse installations. For example, author(citation)_2022.10.02-2022.10.03.csv contains the "Author" metadata for all published, non-deaccessioned, versions of all datasets in the 77 installations, where there's a row for each author name, affiliation, identifier type and identifier.

The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 77 zipped files, one for each of the 77 Dataverse installations whose dataset metadata I was able to download using Dataverse APIs. Each zip file contains a CSV file and two sub-directories:


  • The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate whether or not the Python script was able to download the Dataverse JSON metadata for each dataset. For Dataverse installations using Dataverse software versions whose Search APIs include each dataset's owning Dataverse collection name and alias, the CSV files also include which Dataverse collection (within the installation) that dataset was published in.
  • One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema.
  • The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I saved them so that they can be used when extracting metadata from the Dataverse JSON files.

The dataset_pids_from_most_known_dataverse_installations.csv file contains the dataset PIDs of all published datasets in the 77 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all of the "dataset_pids_..." files in each of the 77 zip files.

The licenses_used_by_dataverse_installations.csv file contains information about the licenses that a number of the installations let depositors choose when creating datasets. When I collected this data, 36 installations were running versions of the Dataverse software that allow depositors to choose a license or data use agreement from a dropdown menu in the dataset deposit form. For more information, see https://guides.dataverse.org/en/5.11.1/user/dataset-management.html#choosing-a-license.

The metadatablocks_from_most_known_dataverse_installations.csv file contains the metadata block names, field names and child field names (if the field is a compound field) of the 77 Dataverse installations' metadata blocks. The metadatablocks_from_most_known_dataverse_installations.csv file is useful for comparing each installation's dataset metadata model (the metadata fields and the metadata blocks that each installation uses). The CSV file was created using a Python script at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_csv_file_with_metadata_block_fields_of_all_installations.py, which takes as inputs the directories and files created by the get_dataset_metadata_of_all_installations.py script.

Known errors

The metadata of two datasets from one of the known installations could not be downloaded because the datasets' pages and metadata could not be accessed with the Dataverse APIs.

About metadata blocks

Read about the Dataverse software's metadata blocks system at http://guides.dataverse.org/en/latest/admin/metadatacustomization.html

 
Subject Computer and Information Science
dataset metadata
dataverse
metadatablocks
 
Language English
 
Date 2022-10
 
Contributor Gautier, Julian
 
Type machine-readable text