Missing data: Difference between revisions

Jump to navigation Jump to search
Line 27: Line 27:


==Handling missing data==
==Handling missing data==
# Methods based on completely recorded units (i.e., complete-case analysis and available-case analysis)
# Methods based completely on recorded units such as: complete-case analysis and available-case analysis
# Weighted methods
# Weighted methods
# s=Single imputation-based methods
# Methods based on single imputation
# Other methods
# Other methods


=== Complete case analysis (CCA)===
=== Complete case analysis (CCA)===
* Analyzing '''only the patients who completed the study'''
Analyzes '''solely subjects who completed the study'''
* Doesn’t involve data manipulation
Advantages:
* The results may be biased as the dropout rate increase the risk of imbalanced groups
* Does not require manipulating the data
* Decrease of study power, which increase the likelihood type II error.
Disadvantages
* Decrease of study power: increasing type II error.
* Biased results: the dropout rate increases the risk of imbalanced groups


=== Last Observation Carried Forward (LOCF)===  
=== Last Observation Carried Forward (LOCF)===
* The only imputation method accepted for US-FDA
Impute the missing data with the value '''of the last observation with available data'''
* It is very simple.
Advantages:
* This method includes all subjects (ITT analysis)
* Only imputation method accepted by FDA
* It is very simple
* Includes all subjects (ITT analysis)
* Mimics real life scenarios (as many patients in real clinical practice are not compliant to the treatment)
* Mimics real life scenarios (as many patients in real clinical practice are not compliant to the treatment)
* Impute the missing data with the value '''of the last observation with available data'''
Disadvantages:
* Can lead to biased estimates (as it is based on the assumption that the patients after dropping out would either not improve or continue to get better), which can also decrease the power of a study.
* Can lead to biased estimates (as it is based on the assumption that the patients after dropping out would either not improve or continue to get better), which can also decrease the power of a study.


=== Mean Substitution===  
=== Mean Substitution===
* Impute the missing data '''using the mean of the non-missing values.'''
* Impute the missing data '''using the mean of the non-missing values.'''
* A very liberal approach as it greatly decreases the standard deviation, decreases the noise and amplifies the noise, what increases the probability of type I error.  
* A very liberal approach as it greatly decreases the standard deviation, decreases the noise and amplifies the noise, what increases the probability of type I error.
=== Regression Substitution===  
 
=== Regression Substitution===
* Build a regression model with '''baseline characteristics as predictors''' of the outcome using the available data. Then use the model to predict the outcome for patients with missing values.
* Build a regression model with '''baseline characteristics as predictors''' of the outcome using the available data. Then use the model to predict the outcome for patients with missing values.
Advantages:
* Has the potential to reduce bias by using all date to estimate the response for missing subject
Disadvantages:
* Requires a special statistics software,
* Requires a special statistics software,
* Has the potential to reduce bias by using all date to estimate the response for missing subject
=== Increase random variability===  
=== Increase random variability===
* Use statistical software to '''generate a set of random values for the outcome''' and then replace missing data by randomly selecting values from this set.
* Use statistical software to '''generate a set of random values for the outcome''' and then replace missing data by randomly selecting values from this set.
* - Requires familiarity with statistics and involves specific software.
Disadvantages:
=== Baseline Carried Forward===  
* Requires familiarity with statistics
* Involves specific software
 
=== Baseline Carried Forward===
* Assume that all participants with missing data '''resume their baseline status.'''
* Assume that all participants with missing data '''resume their baseline status.'''
* Replace missing data with the baseline data from each patient.
* Replace missing data with the baseline data from each patient.
* It might underestimate the effects of treatment (type II error)  
Disadvantages:
=== Worst Case Scenario===  
* It might underestimate the effects of treatment (type II error)
 
=== Worst Case Scenario===
* Replace the missing data with the '''worse possible outcome.'''
* Replace the missing data with the '''worse possible outcome.'''
* If the results are positive, they can be trusted. However, it cannot be used in studies in which a high number of dropouts were observed.
Advantages:
=== Multiple Imputations===  
* If the results are positive, they can be trusted.
Disadvantages:
* It cannot be used in studies in which a high number of dropouts
 
=== Multiple Imputations===
* Each missing value will be '''replaced by a simulated value '''(done several times 3-10 times)
* Each missing value will be '''replaced by a simulated value '''(done several times 3-10 times)
Advantage:
* Has a standard deviation and standard error closer to the one obtained with a complete sample?
Disadvantage:
* Incorporate missing data uncertainty
* Incorporate missing data uncertainty
* Has a standard deviation and standard error closer to the one obtained with a complete sample?
 
=== Multiple likelihood===  
=== Multiple likelihood===
* '''Use the available values in order to find parameter estimate''', which would be the best fit to the already
* '''Use the available values in order to find parameter estimate''', which would be the best fit to the already
* observed data.
observed data.
* Doesn’t impute missing data but use the known characteristic of the individual to better estimate the unknown parameters of the incomplete variable
Disadvantages:
* Does not impute missing data but use the known characteristic of the individual to better estimate the unknown parameters of the incomplete variable
* Need to find the most appropriate variable to use
* Need to find the most appropriate variable to use



Revision as of 20:25, 31 May 2013

Template:Missing data Editor-In-Chief: C. Michael Gibson, M.S., M.D. [1] Gonzalo Romero, M.D.[2]

Overview

In statistics missing data refers to the absence of registered data for a given variable. Missing data is frequent in clinical research. It is an important source of bias, reducing the consistency (precision or reproducibility) of the study. It can have an important effect on the conclusion of the study potentially leading to invalid results drawn from the data.

Classification of missing data

Missing data can be classified depending on the relationship with the independent or dependent(outcome) variables in 3 categories:

  1. Missing completely at random (MCAR)
  2. Missing at random (MAR)
  3. Missing not at random (MNAR)

Missing completely at random (MCAR)

It is independent of observed and non-observed data, therefore not related to the independent variables or the outcome.

Missing at random (MAR)

It is not related to the outcome but is related to the independent variables (for example age, race, gender). It is important to clarify that so it does not correspond to the general notion of 'random'; the probability of a value being missing generally depends on the observed values (independent variables) not on the missing values. May influence if the independent variable is related to the outcome.

Example: Old patients dropping out from an intervention due to physical condition (walking to the center for follow up), which does not relate to the outcome.

Missing not at random (MNAR)

It is is related to the outcome. It is considered the worst type of missing data because the dropouts are is related to the therapy or intervention under investigation. There is a pattern of missing data which is related to unobserved data making impossible to use other values from the dataset to predict the missing values.

Handling missing data

  1. Methods based completely on recorded units such as: complete-case analysis and available-case analysis
  2. Weighted methods
  3. Methods based on single imputation
  4. Other methods

Complete case analysis (CCA)

Analyzes solely subjects who completed the study Advantages:

  • Does not require manipulating the data

Disadvantages

  • Decrease of study power: increasing type II error.
  • Biased results: the dropout rate increases the risk of imbalanced groups

Last Observation Carried Forward (LOCF)

Impute the missing data with the value of the last observation with available data Advantages:

  • Only imputation method accepted by FDA
  • It is very simple
  • Includes all subjects (ITT analysis)
  • Mimics real life scenarios (as many patients in real clinical practice are not compliant to the treatment)

Disadvantages:

  • Can lead to biased estimates (as it is based on the assumption that the patients after dropping out would either not improve or continue to get better), which can also decrease the power of a study.

Mean Substitution

  • Impute the missing data using the mean of the non-missing values.
  • A very liberal approach as it greatly decreases the standard deviation, decreases the noise and amplifies the noise, what increases the probability of type I error.

Regression Substitution

  • Build a regression model with baseline characteristics as predictors of the outcome using the available data. Then use the model to predict the outcome for patients with missing values.

Advantages:

  • Has the potential to reduce bias by using all date to estimate the response for missing subject

Disadvantages:

  • Requires a special statistics software,

Increase random variability

  • Use statistical software to generate a set of random values for the outcome and then replace missing data by randomly selecting values from this set.

Disadvantages:

  • Requires familiarity with statistics
  • Involves specific software

Baseline Carried Forward

  • Assume that all participants with missing data resume their baseline status.
  • Replace missing data with the baseline data from each patient.

Disadvantages:

  • It might underestimate the effects of treatment (type II error)

Worst Case Scenario

  • Replace the missing data with the worse possible outcome.

Advantages:

  • If the results are positive, they can be trusted.

Disadvantages:

  • It cannot be used in studies in which a high number of dropouts

Multiple Imputations

  • Each missing value will be replaced by a simulated value (done several times 3-10 times)

Advantage:

  • Has a standard deviation and standard error closer to the one obtained with a complete sample?

Disadvantage:

  • Incorporate missing data uncertainty

Multiple likelihood

  • Use the available values in order to find parameter estimate, which would be the best fit to the already

observed data. Disadvantages:

  • Does not impute missing data but use the known characteristic of the individual to better estimate the unknown parameters of the incomplete variable
  • Need to find the most appropriate variable to use

Preventing missing data

Although methods can help to analyze as valid as possible dataset with missing data – best to prevent missing data:

  • Run-in phase
  • Enrichment design - selecting best responders
  • Flexible dose (titration) studies
  • Selection of target population who will respond to treatment
  • Add on designAdding endpoints
  • —Reducing follow-up periods
  • —Allow rescue medication
  • Define outcomes that can be defined without participant visit (for instance, death)
  • —Randomized withdrawal to define long-term efficacy