Folder structure, file names, and versioning
A carefully planned folder structure is fundamental for a well-organized research material. This means that there should already be a folder structure in place when the data collection begins, with intelligible folder names and an intuitive design.
It is recommended that the project files are organized in folders with names that mean something significant, even to a person with limited knowledge of the project – such as a new colleague or a stakeholder who wants to know what happens in the project.
- follow a structure with folders and subfolders that correspond to the project design and workflow
- have a self-explanatory name that is only as long as is necessary
- have a unique name – avoid assigning the same name to a folder and a subfolder.
The folder structure gives an overall picture of which information can be found where to all who are involved in the project and provides a template for how to save and organize the project data. If data are collected several times, there can be folders for each round of collections, with standardized names of what is collected, the collection context, and date.
In the top folder of the folder structure, you may want to add a .txt format file (a ReadMe file) with a description of the structure and what your thoughts were regarding the decisions on file names and file versioning. If you later have to change the folder structure, you will document it in this file.
A research project can quickly accumulate a very large number of files, so you should decide on a file name convention in advance. Doing so will simplify the work during data collection and processing, and will make it easier to find files in the folder structure. This is even more important if several people are going to create and give names to files in the project.
A file name should:
- be unique not only in its own folder, but preferably in the entire project. If a file should fall out of its original folder, the file name should provide enough information to know which folder it belongs to
- give some idea of the contents of the file
- be fairly short
- contain the file version number.
One way to keep track of changes to files and datasets is to create versions of the data files. The first version of the data is usually the results of the data collection, followed by versions of processed or cleaned files, up until a final version. Every new saved version of the data should be given a new version number (e.g. v01, v02, v03, etc.), and preferably also the file creation date.
With a file versioning structure, you can easily find the latest version of a data file, and see what has been done in various versions, so you can tell where a particular file is in the workflow. It is also recommended that you keep a list or log of changes (in the work file itself, or in a separate document), where you document the changes for each respective file, version by version. That way, it is transparent which actions and changes were made when, and it will be easier to backtrack and find something that was present in a previous version, but which has later been deleted or changed.
One form of versioning is to use a so-called ”executable file”, where all changes are made and then implemented on a locked file with original data. This is a common method for statistical analysis applications such as STATA, R, and SAS. If you choose this method, you should make the executable file (or at least the part of the file which adjusts the dataset) accessible. You should also document your code carefully, so other users can understand it. Note that some high-ranking journals request that the analysis code is made accessible.
An important reason for file versioning is provenance; the documentation of the origin and history of a data object. You should be able to account for what has been done with the material, step by step, in case someone later questions the project’s conclusions and data. Versioning makes it possible to go back and view the actions and data processing.
Examples of file names and versions
Regardless of whether you work alone on a project or in a project group with several people, there should be rules for how to name files. That way, it is easier to find the right file at a later point in time.
Imagine that you have the following data files in a project:
So what are we seeing here? In which order have the files been collected? What do they contain? How do they relate to one another? Is Peter S the same person as speaker1?
Now assume that the files in the example above had been named like this:
Now we can see that all of the files contain speaker 1’s reading of the words in a glossary. The files are different versions: the original file at the top, followed by cleaned versions of the original, and the final version at the bottom. If you choose to have separate folders for each speaker (one folder for speaker 1, another for speaker 2 etc.), or if you have all of the speakers in each folder, but one version in each folder (e.g. all of the original files, or all first versions, in one file), depends on what is the best format for your specific project.
When you start a project, remember to:
- decide on guidelines for the versioning, folder structure, and file naming for the entire project
- appoint one person who is responsible for making sure that the naming and versioning guidelines are followed
- update the guidelines as needed and document the changes.