Subject and Research Guides: Managing Research Data @MQ: During a Project

Introduction

The Findability and Reusability of research data can be enhanced by producing and maintaining detailed metadata and documentation. An accurate description of useful and relevant facts about the data allows for it:

to be discovered and recognised,
to be connected to future associated data or publications,
to be contextualized in time and space.

Comprehensive information about the data (metadata) enables evaluation of its quality and validation of the research findings.

When creating metadata use standard terminology or descriptions from your discipline, while also consider how to describe your data to people who are outside your discipline area.

Metadata and documenting data

Metadata standards, methods of organising and documenting data vary across disciplines. Conventions should be defined at the start of each project and used consistently. When entering metadata fields such as keywords, think what best terminology would help to enhance discoverability of your data. Is there terminology that would help to enable the discoverability of your data to people who are not as familiar with the data?. The following elements should be recorded as relevant when documenting research data:

General Overview
Title	A name of the dataset or the name of the project
Creator	Names and contact details of the organisations or people who created the data and their unique identifiers (ORCID, MQ OneID)
Date	Any key dates associated with the data. This may including project start and end date, time period or any other important dates associated with the data
Method	Information on how the data was generated, such as specific equipment or software used (including model and version numbers), formulae, algorithms or methodologies
Processing	Information on how the data has been transformed, altered or processed (e.g. normalisation)
Source	Citations to any data used in the research obtained or derived from other sources. At minimum it should include the creator, the year, the title of the dataset and some form of access information

Content Description
Subjects	Keywords or phrases describing the subject or content of the data
Location	Descriptions of relevant geographic information. This could be city names, region names, countries or more precise geographic coordinates
Variable List	A list of all variables in the data files, where applicable. This could also be captured in a codebook
Code List	Explanation of codes or abbreviations used in either the file names or the variables in the data files (e.g. '999 indicates a missing value in the data')

Technical Description
File Inventory	A list of all the files that make up the dataset, including extensions (e.g., photo1023.jpeg’, ‘participant12.pdf’)
File Formats	File formats of the data (e.g.; SPSS, CSV, HTML, PDF, JPEG, JSON, XML)
File Structure	Organisation of the data file(s) and layout of the variables, where applicable
Version	Information on the different versions of the dataset that exist, if relevant
Software	Names and version numbers of any special-purpose software packages required to use, create, view, or analyse the data

Access
Rights	Any known intellectual property rights, statutory rights, licenses, or restrictions on use of the data
Access Information	Where & how the data can be accessed

Metadata Standards or Vocabularies

Metadata is “data about data”. Metadata should be captured for all data collected, generated, or collated by researchers. The information expected to be captured to describe characteristics of data differs among disciplines. When entering metadata fields such as keywords, think what best terminology would help to enhance discoverability of your data.

There are some general metadata standards used across broad fields. Dublin Core is a general standard that aids the discovery, distribution and management of data. It comprises of elements like Title, Creator, Subject, Date and Type, and can be used to describe many different types of content. Several disciplines have their own distinctive ways of structuring metadata. Two excellent resources that list appropriate standards for diverse disciplines can be found at:

Some data repositories have minimum metadata requirements that must be met. If your Domain doesn’t have a required metadata standard or vocabulary, you can choose a widely used standard or vocabulary or you could use the Metadata Standard ascribed for the Macquarie University Research Data Repository.

Below are some examples of the various standards:

Discipline	Metadata Standard
General	Dublin Core (DC) Metadata Object Description Schema (MODS) Metadata Encoding and Transmission Standard (METS)
Archaeology and Heritage	Archaeological Data Service (UK) - Guides to Good Practice Getty Institute ‘Setting the Stage’
Arts	Categories for the Description of Works of Art (CDWA) Visual Resources Association (VRA Core)
Astronomy	Astronomy Visualization Metadata (AVM)
Biology	Darwin Core
Ecology	Ecological Metadata Language (EML)
Geographic	Content Standard for Digital Geospatial Metadata (CSDGM)
Social Sciences	Data Documentation Initiative (DDI)

Organising Data

Organising your data, either manually (for example by setting up a file management and versioning system) or preferably, by implementing a reproducible workflow and version control system in an online data platform (such as Git, GitHub or LabArchives) will help you record changes to your data and increase reproducibility, transparency and efficiency.

File Management

It may seem basic, but careful organisation of the following elements is essential for data management:

File naming
Folders and directory structures
File versioning
File formats

File Names

Develop or utilise a naming convention using aspects that have meaning to the project.

File Naming Recommendations:

Files should be named consistently.
Names should be concise but descriptive. Length of name should be between 25 to 50 characters.
Use sentence case or underscores, rather than periods, spaces or slashes.
Date of creation with format of “YYYYMMDD”. Having the date in front of the name will help to sort files based on these dates.
Add a version number

Examples of GOOD file names:

20190615_JD_original_V1.tiff
20190616_JD_original_V2.tiff
20190615_JD_cropped.jpeg
20190615_JD_edited.jpeg

20190615JDOriginalV1.tiff
20190615JDOriginalV2.tiff
20190615JDCropped.jpeg
20190615JDEdited.jpeg

Examples of BAD file names:

survey.doc
Document1.docx
Scan.pdf
Lit rev,bib, appendices.docx
output NVB>3.0.xml

Folder or Directory Structures

Conventions should be used to organise folders or directories that allow for coherency and consistency.

Coherency - Anyone using the folders should be aware that there is a system and what it means.
Consistency - Anyone using the folders should be consistent in creating folder names in line with the system, but also in keeping the relevant files in the appropriate folders.

This will ensure it is easy to locate, organise, navigate and understand the context of all files and versions. Use rules that make the most sense for your data. Some options could be:

Project
Date
Analysis
Location

Example

Consider appropriate folder hierarchy. The number of levels or sub-folders in the directory should be kept small, generally to a maximum depth of three.

Versioning

A dataset will undergo changes during a project as records or files are corrected, modified, or otherwise changed. Changes may alter the conclusions drawn from a dataset. To uphold the reliability of the results, it is vital to record any changes to a dataset and to know which version of the data was used in the research. It is often vital to be able to return to an earlier version.

Keeping a data history, in which earlier versions of records or files can be retrieved – is a key element of data provenance, and research integrity.

Versioning can be accomplished in various ways. The best approaches are automated or semi-automated, using software designed for the purpose. For example, spreadsheet-type data can be edited using OpenRefine (taught at Macquarie in Data Carpentry workshops), which keeps a record of all changes made to the data. Similarly, editing RAW photos in Adobe LightRoom Classic intrinsically retains the original photo, with changes recorded in file metadata or a DMP sidecar file (although note that to separate changes made at different times, further action is required, e.g., creating a ‘virtual copy’ for each distinct batch of edits). Software version control systems like git can also be used for data version control. Shell scripts can also be written to automate incremental backups. Shell and git are taught at Macquarie in introductory Software Carpentry workshops.

Office software, like MS O365, also maintain a ‘version history’ of documents and spreadsheets, but this history should be considered a fallback only and regularly reviewed, as certain actions may reset it.

Likewise, SharePoint, Cloudstor, and similar general-purpose file storage systems keep old versions of files that are overwritten (e.g., if you download a file, edit it, and upload it again), but they cull old versions according to a particular schedule that may change at any time (researchers should familiarise themselves with such culling schedules, e.g., for CloudStor). Use of these systems is encouraged but, as with office software versioning, is best seen as a fallback, with an understanding of limitations.

If no automated or semi-automated approach to versioning can be implemented, manual versioning is possible, but must be completed purposefully. Many projects that rely on manual versioning find that they forgot to save a version they later need. Nevertheless, manual versioning is better than nothing.

Manual approaches to versioning usually contain two important features:

The user adds a sequential number to the file name to indicate which version of the file it is
A change table in each document where versions, dates, authors and details of changes to the file are recorded

Resources

Disciplinary metadata (Digital Curation Centre, UK)
Data versioning (ANDS) Provides information about data version control and why it is importance for researchers
Git Git is a free and open-source distributed version control system designed to handle everything from small to large projects.
Version Control and Authenticity A good outline of the best practice of version control and examples of file versions.