Prepare the data for deposit
When you share research data through a data repository you enable others to review research results and to reuse the data in future studies. But this requires that the data are organized and presented in a self-explanatory way. Certified repositories (including SND) apply a quality review process, which requires that deposited data meet certain minimum requirements before they are published. The curators will work with you to ensure that the published dataset is reusable, so you will need to be contactable and take an active part in the review process.
Here are some useful recommendations:
- Datasets that contain personal information can only be described and shared via SND's research data catalogue if your research organization and SND have an agreement for this purpose. Here you can read more about which organizations offer their researchers this possibility and how it works in practice.
- Datasets that contain personal data cannot usually be shared with open access through SND's research data catalogue; such datasets must have restricted access.
- If you belong to an organization that does not offer the possibility of sharing sensitive data via SND, you need to ensure that the data are anonymous. You can read more about anonymization here.
- Data files should be saved in a standard, open, and non-proprietary format (see SND’s page Choosing a file format and SND's guides to good research data management).
- File and folder names should be meaningful and consistent. File names with sequential numbers or codes should be explained, for instance in a README file.
- Datasets that consist of several files should be structured in a way that is intuitive to other users. The structure and file relations can be described in a README file.
- When datasets consist of many files, it is often best to pack them into .zip archives for easier download, which can also help reduce file sizes. You can also consider splitting the entire dataset and publishing it as several separate datasets. The datasets can be marked as “related” in SND’s catalogue.
- Files should be cleared of irrelevant information. This includes variables that are not described, calculated variables that can be reconstructed from the primary data, or administrative data. Colours for text and formulars should be removed.
- If the file format supports variable-level metadata, then by all means include relevant metadata in the data files (e.g., variable names and codes for variable values for tabular data, or information about coding standard or the meaning of different formatting etc. for textual data). The important thing is that such information is saved with the data files, the exact format is secondary.
Metadata is structured information used to describe and categorize digital information. Metadata makes it easier for users to search, find, and understand research material.
- You create metadata by describing a dataset using SND’s documentation tool DORIS.
- Mandatory fields represent the minimum level of metadata that SND requires. But additional information makes it easier for other users to find the dataset and understand the files’ contents.
- Metadata should be as precise as possible. If the data are from field work in Colombia and Peru, enter Colombia and Peru on the “Geographic coverage” tab, rather than just South America.
- Link to articles or other publications which describe or are based on the study data. You can also link to other related resources.
Relevant documentation must be appended to the data description so that other researchers can understand and reuse the data. Give careful thought to what kind of documentation is needed to understand the data.
- Variable lists with explanations of the contents in each variable
- Questionnaires or surveys
- Interview forms, including interview instructions
- Code lists and code books
- An inventory of the data material
- Links to articles or other publications
- Method descriptions or technical reports
- Information about how the data have been processed
- Syntax for derived variables
- End of project reports
- Instructions for how to manage the data in custom-developed software
- Fieldwork diaries or log books.
SND has no specific requirements for how documentation should be presented. What constitutes documentation, and how it is formatted, varies across research areas and within disciplines. From SND’s perspective, the contents of the documentation are the most important. If there is no existing, completed documentation, relevant information should be collected into a README file (see an example developed by Cornell).
Simply citing a published article or a report associated with the research data is rarely sufficient as documentation. Even if there is an open-access article that describes how the data were collected or created, you should include a README file that explains how the contents of the data files relate to what is described in the article. A typical README file for a tabular data set will, for example, list all the columns in the data file, describe how they link to the method description, and state the variables’ units or the values of categorical variables, explain quality codes for missing values, etc.
Also, keep in mind that someone from a different research discipline may want to reuse your research data, so the documentation should preferably be comprehensible to other research groups. Defining acronyms, abbreviations, and method descriptions – even those that are so common in your discipline that they usually need no definition – can assist researchers from other fields.