The Findability and Reusability of research data can be enhanced by producing and maintaining detailed metadata and documentation. An accurate description of useful and relevant facts about the data allows for it:
Comprehensive information about the data (metadata) enables evaluation of its quality and validation of the research findings.
When creating metadata use standard terminology or descriptions from your discipline, while also consider how to describe your data to people who are outside your discipline area.
Metadata standards, methods of organising and documenting data vary across disciplines. Conventions should be defined at the start of each project and used consistently. When entering metadata fields such as keywords, think what best terminology would help to enhance discoverability of your data. Is there terminology that would help to enable the discoverability of your data to people who are not as familiar with the data?. The following elements should be recorded as relevant when documenting research data:
General Overview | |
---|---|
Title | A name of the dataset or the name of the project |
Creator | Names and contact details of the organisations or people who created the data and their unique identifiers (ORCID, MQ OneID) |
Date | Any key dates associated with the data. This may including project start and end date, time period or any other important dates associated with the data |
Method | Information on how the data was generated, such as specific equipment or software used (including model and version numbers), formulae, algorithms or methodologies |
Processing | Information on how the data has been transformed, altered or processed (e.g. normalisation) |
Source | Citations to any data used in the research obtained or derived from other sources. At minimum it should include the creator, the year, the title of the dataset and some form of access information |
Content Description | |
---|---|
Subjects | Keywords or phrases describing the subject or content of the data |
Location | Descriptions of relevant geographic information. This could be city names, region names, countries or more precise geographic coordinates |
Variable List | A list of all variables in the data files, where applicable. This could also be captured in a codebook |
Code List | Explanation of codes or abbreviations used in either the file names or the variables in the data files (e.g. '999 indicates a missing value in the data') |
Technical Description | |
---|---|
File Inventory | A list of all the files that make up the dataset, including extensions (e.g., photo1023.jpeg’, ‘participant12.pdf’) |
File Formats | File formats of the data (e.g.; SPSS, CSV, HTML, PDF, JPEG, JSON, XML) |
File Structure | Organisation of the data file(s) and layout of the variables, where applicable |
Version | Information on the different versions of the dataset that exist, if relevant |
Software | Names and version numbers of any special-purpose software packages required to use, create, view, or analyse the data |
Access | |
---|---|
Rights | Any known intellectual property rights, statutory rights, licenses, or restrictions on use of the data |
Access Information | Where & how the data can be accessed |
Metadata is “data about data”. Metadata should be captured for all data collected, generated, or collated by researchers. The information expected to be captured to describe characteristics of data differs among disciplines. When entering metadata fields such as keywords, think what best terminology would help to enhance discoverability of your data.
There are some general metadata standards used across broad fields. Dublin Core is a general standard that aids the discovery, distribution and management of data. It comprises of elements like Title, Creator, Subject, Date and Type, and can be used to describe many different types of content. Several disciplines have their own distinctive ways of structuring metadata. Two excellent resources that list appropriate standards for diverse disciplines can be found at:
Some data repositories have minimum metadata requirements that must be met. If your Domain doesn’t have a required metadata standard or vocabulary, you can choose a widely used standard or vocabulary or you could use the Metadata Standard ascribed for the Macquarie University Research Data Repository.
Below are some examples of the various standards:
Discipline | Metadata Standard |
---|---|
General | |
Archaeology and Heritage | |
Arts |
Categories for the Description of Works of Art (CDWA) Visual Resources Association (VRA Core) |
Astronomy | Astronomy Visualization Metadata (AVM) |
Biology | Darwin Core |
Ecology | Ecological Metadata Language (EML) |
Geographic | Content Standard for Digital Geospatial Metadata (CSDGM) |
Social Sciences | Data Documentation Initiative (DDI) |
Organising your data, either manually (for example by setting up a file management and versioning system) or preferably, by implementing a reproducible workflow and version control system in an online data platform (such as Git, GitHub or LabArchives) will help you record changes to your data and increase reproducibility, transparency and efficiency.
It may seem basic, but careful organisation of the following elements is essential for data management:
Develop or utilise a naming convention using aspects that have meaning to the project.
File Naming Recommendations:
Examples of GOOD file names:
OR
Examples of BAD file names:
Conventions should be used to organise folders or directories that allow for coherency and consistency.
This will ensure it is easy to locate, organise, navigate and understand the context of all files and versions. Use rules that make the most sense for your data. Some options could be:
Example
Consider appropriate folder hierarchy. The number of levels or sub-folders in the directory should be kept small, generally to a maximum depth of three.
A dataset will undergo changes during a project as records or files are corrected, modified, or otherwise changed. Changes may alter the conclusions drawn from a dataset. To uphold the reliability of the results, it is vital to record any changes to a dataset and to know which version of the data was used in the research. It is often vital to be able to return to an earlier version.
Versioning can be accomplished in various ways. The best approaches are automated or semi-automated, using software designed for the purpose. For example, spreadsheet-type data can be edited using OpenRefine (taught at Macquarie in Data Carpentry workshops), which keeps a record of all changes made to the data. Similarly, editing RAW photos in Adobe LightRoom Classic intrinsically retains the original photo, with changes recorded in file metadata or a DMP sidecar file (although note that to separate changes made at different times, further action is required, e.g., creating a ‘virtual copy’ for each distinct batch of edits). Software version control systems like git can also be used for data version control. Shell scripts can also be written to automate incremental backups. Shell and git are taught at Macquarie in introductory Software Carpentry workshops.
Office software, like MS O365, also maintain a ‘version history’ of documents and spreadsheets, but this history should be considered a fallback only and regularly reviewed, as certain actions may reset it.
Likewise, SharePoint, Cloudstor, and similar general-purpose file storage systems keep old versions of files that are overwritten (e.g., if you download a file, edit it, and upload it again), but they cull old versions according to a particular schedule that may change at any time (researchers should familiarise themselves with such culling schedules, e.g., for CloudStor). Use of these systems is encouraged but, as with office software versioning, is best seen as a fallback, with an understanding of limitations.
If no automated or semi-automated approach to versioning can be implemented, manual versioning is possible, but must be completed purposefully. Many projects that rely on manual versioning find that they forgot to save a version they later need. Nevertheless, manual versioning is better than nothing.
Manual approaches to versioning usually contain two important features: