Folder structure, file names, and versioning
- Folder structure
- File names
- Versioning using file names
- Version control
- Programmatic versioning
A carefully planned folder structure is fundamental for a well-organised research material. This means that there should already be a folder structure in place when the data collection begins, with intelligible folder names and an intuitive design.
It is recommended that the project files are organised in folders with names that mean something significant, even to a person with limited knowledge of the project – such as a new colleague or a stakeholder who wants to know what happens in the project.
- follow a structure with folders and subfolders that correspond to the project design and workflow
- have a self-explanatory name that is only as long as is necessary
- have a unique name – avoid assigning the same name to a folder and a subfolder.
The folder structure gives an overall picture of which information can be found where to all who are involved in the project and provides a template for how to save and organise the project data. If data are collected several times, there can be folders for each round of collections, with standardised names of what is collected, the collection context, and date.
In the top folder of the folder structure, you may want to add a .txt format file (a ReadMe file) with a description of the structure and what your thoughts were regarding the decisions on file names and file versioning. If you later have to change the folder structure, you will document it in this file.
A research project can quickly accumulate a very large number of files, so you should decide on a file name convention in advance. Doing so will simplify the work during data collection and processing, and will make it easier to find files in the folder structure. This is even more important if several people are going to create and give names to files in the project.
A file name should:
- be unique not only in its own folder, but preferably in the entire project. If a file should fall out of its original folder, the file name should provide enough information to know which folder it belongs to
- give some idea of the contents of the file
- be fairly short
- contain the file version number.
Versioning using file names
Research workflows often require modification of data files, sometimes several iterations. After data are collected, they need to be cleaned-up and processed, until the final dataset is created.
The simplest way to track such modifications is to create “versions” (local working copies) of the data files. That is, the data files are not modified, but rather the results of successive processing iterations are saved as new files. The original version of the data is usually from the data collection, and every new saved version of the data is given a new version number (e.g., v01, v02, v03, etc.) or labelled with the file creation date (ISO format, e.g., 2023-08-28). With a system for naming different versions of files, you can easily find the latest version of a data file, document the various versions, and tell where a particular file is in the workflow.
Imagine that you have the following data files in a folder:
What are you seeing here? In which order have the files been collected? What do they contain? How do they relate to one another? Is Peter S the same person as speaker1?
Now assume that the files were named:
Now you can see that all of the files contain speaker1’s reading of the words in a glossary. The files are different versions: the original file at the top, followed by cleaned versions of the original, and the final version at the bottom.
Provenance is the documentation of the origin and history of a data object.
Ensure provenance by keeping a log of changes, where you document when and how each file version was created. Your workflow will then be documented if someone later questions the project’s data or conclusions, or if you yourself need to backtrack and implement changes in the processing chain.
Additionally, documenting when each file version was finalized will make it easier to recover files from backup – such as by restoring a previous version of a OneDrive file.
Version control systems offer a more sophisticated way to version data. Files, together with changes and updates to them, are compiled into a central data structure called a “repository” or "repo”. Every change to a data file can be documented, and changes to multiple files can be packaged together. Multiple users can collaborate on changing files, with provision for merging concurrent changes and resolving any overlaps, and users can create local copies and test changes.
Common open-source tools for version control of source code and text-based data files (.txt, .csv, .md) are Git and Subversion. Version control of application-specific files (e.g., .xlxs) may require a commercial solution.
Cloud-based platforms that offer hosting and collaboration tools – known colloquially as “code repositories” – include GitHub, BitBucket, GitLab. A version control repo can be stored locally, however, either on your own computer if only you will use it or managed by a Git/Subversion server for use by your research group.
An alternative to versioning using filenames is to create a script or executable that reads the original datafile and modifies the data. This is a common method for statistical analysis applications such as STATA, R, and SAS.
You should document your code carefully, so other users can understand the processing steps. You can also publish the script along with the data: note that some high-ranking journals request that the analysis code is made accessible.
When you start a project, remember to:
- decide on guidelines for the versioning, folder structure, and file naming for the entire project
- appoint one person who is responsible for making sure that the naming and versioning guidelines are followed
- update the guidelines as needed and document the changes.