Fair Management of the Data
Making data findable, including provision for metadata
In general, data related to FAMILY can be found at the FAMILY website (available from the 6th month since the start of the project).
Are the data produced and/or used in the project discoverable with metadata, identifiable and locatable by means of a standard identification mechanism (e.g. persistent and unique identifiers such as Digital Object Identifiers)?
All data produced by FAMILY will be discoverable by detailed and descriptive metadata. When
possible it will be associated with persistent identification mechanisms, such as DOI.
What naming conventions do you follow?
FAMILY will use standardized and harmonized variable names linked to metadata in a data dictionary.
Will search keywords be provided that optimize possibilities for re-use?
Do you provide clear version numbers?
All files will be marked with explicit dates (YYYY–MM–DD) and version numbers, where appropriate.
Do you provide clear version numbers?
The metadata that will be created is the standard for each data type.
Making data openly accessible
Which data produced and/or used in the project will be made openly available as the default? If certain datasets cannot be shared (or need to be shared under restrictions), explain why, clearly separating legal and contractual reasons from voluntary restrictions. Note that in multi-beneficiary projects it is also possible for specific beneficiaries to keep their data closed if relevant provisions are made in the consortium agreement and are in line with the reasons for opting out.
Since the consortium will be operating with sensitive pseudo-anonymized human data, the principles of open science must be balanced with the need for data protection and privacy. Standard Operating Procedures (SOPs) for human data are specific to each cohort, hence there can be different procedures for data sharing or data accessibility. FAMILY members who want to use pseudo-anonymized raw data will need to ask for access to the data access committee through a Data access / Publication request form (Appendix 2) and sign the appropriate data transfer agreement. All data transfer and availability procedures will be done in compliance to the guidelines of the corresponding ethics committee. Once the access is granted, it will be available to those consortium members who requested it through the DRE.
Since it is no longer confidential, processed anonymous data will be available to all consortium members.
All FAMILY results, pipelines and scientific publications will be made openly available to every consortium member through the DRE and the FAMILY intranet and will be published in open-access journals and repositories (as far as possible).
For a detailed description of the data access and result dissemination permissions see Appendix 3.
What methods or software tools are needed to access the data? Is documentation about the software needed to access the data included? Is it possible to include the relevant software (e.g. in open source code)?
In order to access the data, FAMILY members will need access to the DRE, where all software needed for data access and analysis will be pre-installed. These software tools will depend on data and analysis types. The use of open-source software and code will be encouraged. Documentation about the software will be included when appropriate (e.g.: when the software is developed within FAMILY and no documentation is available online so far).
Where will the data and associated metadata, documentation and code be deposited? Preference should be given to certified repositories which support open access where possible. Have you explored appropriate arrangements with the identified repository?
Where relevant, FAMILY’s methodology, code and documentation will be made available on GitHub and on the FAMILY website.
If there are restrictions on use, how will access be provided?
See section 3.2.1.
Is there a need for a data access committee?
As stated in sections 2.3 and 2.4 of this document, the data analyzed within the FAMILY consortium comes from several sources. Thus, a data access committee will be created in order to accommodate the SOPs of each sharing party.
Are there well described conditions for access (i.e. a machine readable license)?
This will be explained in the appendix since it depends on the cohorts and will be overseen by the data access committee.
How will the identity of the person accessing the data be ascertained?
Each consortium member has a personal and non-transferable username and password for both the FAMILY intranet and the DRE.
Making data interoperable
Are the data produced in the project interoperable, that is allowing data exchange and re-use between researchers, institutions, organizations, countries, etc. (i.e. adhering to standards for formats, as much as possible compliant with available (open) software applications, and in particular facilitating re-combinations with different datasets from different origins)?
The DRE provides a flexible, scalable cloud-based platform where researchers have access to and can work with the data, methods and tools that are available in FAMILY. The environment is secure, self-serviced, is capable of real-time collaboration, provides data and process audit trails, and is compliant with all laws and regulations (D4LS, Feb 2017). The DRE operates on the Microsoft Azure platform, and the hardware is located within the EU. Microsoft Azure respects the intellectual property (IP) of the researcher. The DRE facilitates FAMILY researchers to collaborate on research projects in a safe, yet flexible compute and storage environment. The architecture of the DRE allows researchers to use a solution within the boundaries of data management rules and regulations. Alongside the DRE, consortium partners will have access to the Dutch National Supercomputer (“Snellius”) when high performance computing is required. In situations where a partner’s data privacy protections prohibit these resources to be used, singularity containers will be implemented to safeguard against pipeline and platform-dependent biases from being introduced.
What data and metadata vocabularies, standards or methodologies will you follow to make your data interoperable? Will you be using standard vocabularies for all data types present in your data set, to allow inter-disciplinary interoperability?
Harmonization of datasets will leverage existing efforts and plans already in place with large, EU-funded consortia using similar data types and structures (e.g. LifeCycles79 and Early Cause80) which jumpstarts the harmonization process. In nonstandard cases where differing measures/instruments have been used, a team of experts from the different sites will inventory and evaluate all phenotypic data collected across the consortium in order to identify the best options for harmonization.
To facilitate sharing and long-term inter-disciplinary use of FAMILY’s data, the following formats will be chosen: pdf, txt, csv, sql, dat (SPSS), RData, DICOM, NIfTI. All files will be marked with explicit dates (YYYY–MM–DD) and version numbers, where appropriate and provenance information will be documented in the Knowledge Base. FAMILY will use standardized variable names linked to metadata in a data dictionary. In cases where possible, meta data will be included inside of files (e.g. attributes within RData structures). Industry-standard data structures will be utilized for brain imaging data and standardized processing pipelines will be applied to imaging and -omics data, in many cases within Singularity containers to ensure consistent and reproducible processing is applied uniformly to all data.
In case it is unavoidable that you use uncommon or generate project specific ontologies or vocabularies, will you provide mappings to more commonly used ontologies?
Increase data re-use
How will the data be licensed to permit the widest re-use possible?
Data will not be licensed. However, there will be a long-term data re-use plan at the end of the FAMILY project.
When will the data be made available for re-use? If an embargo is sought to give time to publish or seek patents, specify why and how long this will apply, bearing in mind that research data should be made available as soon as possible.
The data will be made available for re-use as soon as it is processed, analyzed and inspected to make sure it complies with the quality standards.
Are the data produced and/or used in the project usable by third parties, in particular after the end of the project? If the re-use of some data is restricted, explain why. How long is it intended that the data remains re-usable?
A long-term data (re)use model and financial plan will be developed. This will be split into two generic categories: those which do not require access to original data, and those which do require access to original data. For the former, normative models will be made available via standard open access platforms (e.g., GitHub) for broad dissemination and use. For the latter, the DRE will be utilized. Resources will be identified to maintain the storage of the data within the DRE and outline a plan for coordinating with each consortium partner on data requests. Importantly, this strategy also allows for new groups to incorporate their data into the DRE, expanding the potential of this resource.
In accordance with local guidelines and regulations, participant data will be retained for at least 10 years upon completion of the FAMILY studies at the site where they are originally obtained and preferably at the DRE (if local legislation allow transfer of data to the DRE).
Are data quality assurance processes described?
Each partner is responsible for the quality control of the data collected within its own cohort.
Standard quality control procedures will be applied to processed data.