Data collection and Management

Overview

Teaching: 30 min
Exercises: 20 min

Questions

How do we effectively manage study data?

Objectives

By the end of this episode you will be able to discuss important practices for data collection, management, and storage, in epidemiology studies

In epidemiological studies data is used to address the; what, who, where, when and how, questions of an investigation. What is the problem, who has been affected, where did the event occur, when did it happen and how are the observed events and exposure related? Epidemiologic data is relevant for determining the cause of an event and establishing appropriate intervention measures. Data from epidemiological studies reveals the state and course of events related to a disease in individuals that have been exposed to one agent or several substances. It is important to identify the right type of data required for a given study and the possible sources of the data. The scope and objectives of the study are essential in determining the data to be collected.

Sources of data

There are several sources of epidemiological data and these include, but are not limited to: case report forms by health care providers, hospital records or summaries, population-based surveys (de-identified data from a sample population), medical devices, laboratory tests, public health records and disease registries. There are two main categories of epidemiological data sources, that is, primary and secondary data. Primary data is collected by an investigator for their specific study, and this is usually in the form of questionnaires that are administered to an individual or group surveys. Collecting primary data can be costly and time-consuming, it is therefore important to first establish the availability and accessibility of pre-existing data relevant for the study. Secondary data is that which has already been collected by others for different purposes and this could be patient medical records, department of health medical data, etc. Secondary data is often obtained from an existing database and this data is relatively easy to work with since it is structured. However transfering secondary data to a different database system will be challenging due to the different formats used which may lead to missing values and/or incomplete information being transfered. In cases where the secondary data to be used has a different structure from the current study data, it has to first be tranformed, harmonized and integrated to form a unified database for analysis.

Data quality and management

Data management involves all activities pertaining to the transfer and maintenance of data in a central database. Effective data management is characterized by an accurate and re-usable database of information. The quality of data in a database will directly impact the power of inferences made from the data. Poor quality data is prone to errors and random noise that lower the power of inference. Several software platforms are available for data capture and entry, such as Research Electronic Data Capture (REDCap), which ease the process of data collection and database development through providing a user-friendly interface for data capturing. The data capture system can be hosted at each participating site in the study and at a central coordinating centre that maintains the system. Data quality checks must be implemented at site level and during transfer to the central database to ensure validity and integrity of the data collected.

It is of paramount importance that the data collected meets the objective of surveillance and is of high quality. There are several concepts that have been used to assess the quality of data in research studies and the dimensions considered are mainly based on the following terms:

Accuracy: Is the data collected correct and does it reflect the true state of an event?
Completeness: What proportion of the population has a given health state and what proportion does not? It includes comprehensive information coverage.
Reliability: The data should be consistent, reliable, and comparable to other sources.
Validity: The data should make sense in comparison to existing data that is similar, i.e is the data plausible?
Timeliness: Is the true state of an event at a given time point reflected?

Data Privacy and Protection

Epidemiological data is obtained from study participants and it is of utmost importance to protect personal information of all participants. Care must be taken to avoid all possible risks of disclosure of personal information and always maintain confidentiality, to comply with ethical standards for studies involving human subjects. The study should have in place guidelines that ensure adherence to the general governing principles for data protection. Only de-identified data should be stored in the database which may be shared and used for research. Only in special circumstances such as need for follow-up should personal information be stored, usually with limited access by only authorized personnel and this information must be kept separate from study data. It is very important to ensure that re-identification of study subjects is prevented. Access to study data needs to be restricted and only granted to authorized people that meet the set requirements for data accessibility. Researchers intending to use epidemiological data need to obtain ethics clearance from the relevant body to avoid legal repercussions and this is also to ensure that data is used for the correct purposes.

Key Points

The scope and objectives of the study are essential in determining the data to be collected.

There are two main categories of epidemiological data sources: primary and secondary data.

Primary data is collected by an investigator for their specific study.

Secondary data is that which has already been collected by others for different purposes.

Data management involves all activities pertaining to the transfer and maintenance of data in a central database.

The dimensions of data quality are Accuracy, Completeness, Reliability, Validity, and Timeliness.

A study should have in place guidelines that ensure adherence to the general governing principles for data protection.

It is very important to ensure that re-identification of study subjects is prevented.

previous episode

Epidemiology

next episode