Missing data: Difference between revisions
Arzu Kalayci (talk | contribs) No edit summary |
|||
(46 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
__NOTOC__ | __NOTOC__ | ||
{{Missing data}} | |||
{{CMG}} {{GG}} | {{Missing data}}, {{CMG}}; {{GG}} | ||
Slide Set: [[File:Missing Data.pdf]] | |||
==Overview== | ==Overview== | ||
In statistics missing data refers to the absence of registered data for a given variable. | In statistics missing data refers to the absence of registered data for a given variable. Missing data is frequent in clinical research. It is an important source of bias, reducing the consistency (precision or reproducibility) of the study. It can have an important effect on the conclusion of the study potentially leading to invalid results drawn from the data. | ||
==Classification of missing data== | ==Classification of missing data== | ||
Missing data can be classified depending on the relationship with the independent or dependent(outcome) variables in 3 categories: | Missing data can be classified depending on the relationship with the independent or dependent (outcome) variables in 3 categories: | ||
# Missing completely at random (MCAR) | # Missing completely at random (MCAR) | ||
# Missing at random (MAR) | # Missing at random (MAR) | ||
Line 15: | Line 16: | ||
===Missing completely at random (MCAR)=== | ===Missing completely at random (MCAR)=== | ||
It is independent of observed and non-observed data, therefore '''not related to the independent variables or the outcome'''. | It is independent of observed and non-observed data, therefore '''not related to the independent variables or the outcome'''. | ||
Examples: | |||
* Loss of study files | |||
* Equipment malfunctioned | |||
* Weather was terrible | |||
* Data not entered correctly | |||
===Missing at random (MAR)=== | ===Missing at random (MAR)=== | ||
It is '''not related to the outcome | It is '''not related to the outcome but is related to the independent variables''' (for example age, race, gender). | ||
It is important to clarify that so it does not correspond to the general notion of 'random'; the probability of a value being missing generally depends on the observed values (independent variables) not on the missing values. | |||
May influence if the independent variable is related to the outcome. | May influence if the independent variable is related to the outcome. | ||
Old | Example: Old patients dropping out from an intervention due to physical condition (walking to the center for follow up), which does not relate to the outcome. | ||
===Missing not at random (MNAR)=== | ===Missing not at random (MNAR)=== | ||
It is is related to the outcome. | It is is '''related to the outcome'''. It is considered the worst type of missing data because the dropouts are is related to the therapy or intervention under investigation. | ||
There is a pattern of missing data which is related to unobserved data making impossible to use other values from the dataset to predict the missing values. | |||
Example: Respondents with high income less likely to report income. | |||
==Missing values== | |||
Missing values could be due to: | |||
* Withdrawal of consent | |||
* Loss of follow up | |||
* Discontinuation study drug due to: | |||
# Adverse effect | |||
# Lack of efficacy | |||
* Death due to: | |||
# Cause-specific (auto accident) | |||
# Composite outcome (AIDS-defining illness) | |||
# Related to outcome of interest | |||
==Handling missing data== | ==Handling missing data== | ||
# Complete Case Analysis - CCA (Listwise Deletion) | |||
# Available Case Analysis | |||
# Weighted Complete Case Analysis | |||
# Single Imputation (replacement of missing values) | |||
=== Complete case analysis (CCA)=== | |||
Analyzes '''only subjects who completed the study'''. | |||
Advantages: | |||
* Does not require manipulating the data | |||
Disadvantages: | |||
* Decrease of study power: increasing type II error | |||
* Biased results: the dropout rate increases the risk of imbalanced groups | |||
=== Available Case Analysis=== | |||
Special case of Complete Case Analysis, where all or part of the data is used depending on the given analysis. | |||
Example: Incomplete cases used for baseline analysis, NOT used for outcome analysis. | |||
=== Weighted Complete Case Analysis === | |||
Used in surveys, gives weights for responses based on likelihood of response. | |||
=== Single Imputation === | |||
# Mean/Median Substitution Method | |||
# Last Observation Carried Forward (LOCF) | |||
# Regression Substitution Method | |||
# Stochastic Regression imputation | |||
# Increased Random Variability Method | |||
# Worst Case Scenario Method | |||
# Baseline Carried Forward Method | |||
# Hot and Cold Deck Imputation | |||
=== Mean Substitution=== | |||
Impute the missing data '''using the mean of the non-missing values.''' | |||
Advantages: | |||
* Simple | |||
* Potential to reduce bias by using all study data to estimate response for missing subjects | |||
Disadvantages: | |||
* Significantly decreased standard deviation (variance) | |||
* Increased Type I error | |||
* Overestimation | |||
=== Last Observation Carried Forward (LOCF)=== | |||
Impute the missing data with the value '''of the last observation with available data'''. | |||
Advantages: | |||
* Only imputation method accepted by FDA | |||
* It is very simple | |||
* Includes all subjects (ITT analysis) | |||
* Mimics real life scenarios (as many patients in real clinical practice are not compliant to the treatment) | |||
Disadvantages: | |||
* Can lead to biased estimates (as it is based on the assumption that the patients after dropping out would either not improve or continue to get better), which can also decrease the power of a study. | |||
=== Regression Substitution=== | |||
Build a regression model with '''baseline characteristics as predictors''' of the outcome using the available data. Then use the model to predict the outcome for patients with missing values. | |||
Advantages: | |||
* Has the potential to reduce bias by using all date to estimate the response for missing subject | |||
Disadvantages: | |||
* Requires a special statistics software | |||
=== Increased Random Variability=== | |||
Use statistical software to '''generate a set of random values for the outcome''' and then replace missing data by randomly selecting values from this set. | |||
Advantages: | |||
* Potential to reduce bias by using all study data to estimate response for missing subjects | |||
Disadvantages: | |||
* Complicated statistical model (specific training) | |||
* Not commonly used | |||
* Might be questioned by reviewers | |||
=== Worst Case Scenario=== | |||
Replace the missing data with the '''worse possible outcome.''' | |||
Advantage: | |||
* If the results are positive, they can be trusted. | |||
Disadvantage: | |||
* It cannot be used in studies in which a high number of dropouts | |||
=== Baseline Carried Forward=== | |||
Assume that all participants with missing data '''resume their baseline status.''' | |||
Replace missing data with the baseline data from each patient. | |||
Advantages: | |||
* Simple | |||
Disadvantages: | |||
* It might underestimate the effects of treatment (type II error) | |||
=== Hot and Cold Deck Imputation=== | |||
Replaces individual missing data items with reported data from another person or household with similar characteristics | |||
*Hot: a missing case is replaced with a case with similar characteristics | |||
*Cold: deck (another dataset) | |||
=== Multiple Imputations=== | |||
Each missing value will be '''replaced by a simulated value '''(done several times 3-10 times) | |||
Advantage: | |||
* Has a standard deviation and standard error closer to the one obtained with a complete sample | |||
Disadvantage: | |||
* Incorporate missing data uncertainty | |||
=== Multiple likelihood=== | |||
'''Use the available values in order to find parameter estimate''', which would be the best fit to the already observed data. | |||
Disadvantages: | |||
* Does not impute missing data but use the known characteristic of the individual to better estimate the unknown parameters of the incomplete variable | |||
* Need to find the most appropriate variable to use | |||
==Missing Data Prevention== | |||
Although methods can help to analyze as valid as possible dataset with missing data – best to prevent missing data: | |||
* Run-in phase | |||
* Enrichment design - selecting best responders | |||
* Flexible dose (titration) studies | |||
* Selection of target population who will respond to treatment, knowing the population | |||
* Add on design | |||
* Adding endpoints | |||
* Reducing follow-up periods | |||
* Allow rescue medication | |||
* Define outcomes that can be defined without participant visit (for instance, death) | |||
* Randomized withdrawal to define long-term efficacy | |||
* Sending letters to subjects to motivate follow up protecting confidentially | |||
==References== | |||
{{Reflist|2}} | |||
Haukoos J S, Newgard C D. Advanced Statistics: Missing Data in Clinical. Research: An Introduction and Conceptual Framework. Society for Academic Emergency Medicine. 2007 | |||
Myers W R. Handling Missing Data In Clinical Trials: An Overview. Drug Information Journal, Vol. 34, pp. 525–533. 2000 |
Latest revision as of 23:50, 25 October 2019
Template:Missing data, Editor-In-Chief: C. Michael Gibson, M.S., M.D. [1]; Gonzalo Romero, M.D.[2]
Slide Set: File:Missing Data.pdf
Overview
In statistics missing data refers to the absence of registered data for a given variable. Missing data is frequent in clinical research. It is an important source of bias, reducing the consistency (precision or reproducibility) of the study. It can have an important effect on the conclusion of the study potentially leading to invalid results drawn from the data.
Classification of missing data
Missing data can be classified depending on the relationship with the independent or dependent (outcome) variables in 3 categories:
- Missing completely at random (MCAR)
- Missing at random (MAR)
- Missing not at random (MNAR)
Missing completely at random (MCAR)
It is independent of observed and non-observed data, therefore not related to the independent variables or the outcome.
Examples:
- Loss of study files
- Equipment malfunctioned
- Weather was terrible
- Data not entered correctly
Missing at random (MAR)
It is not related to the outcome but is related to the independent variables (for example age, race, gender). It is important to clarify that so it does not correspond to the general notion of 'random'; the probability of a value being missing generally depends on the observed values (independent variables) not on the missing values. May influence if the independent variable is related to the outcome.
Example: Old patients dropping out from an intervention due to physical condition (walking to the center for follow up), which does not relate to the outcome.
Missing not at random (MNAR)
It is is related to the outcome. It is considered the worst type of missing data because the dropouts are is related to the therapy or intervention under investigation. There is a pattern of missing data which is related to unobserved data making impossible to use other values from the dataset to predict the missing values.
Example: Respondents with high income less likely to report income.
Missing values
Missing values could be due to:
- Withdrawal of consent
- Loss of follow up
- Discontinuation study drug due to:
- Adverse effect
- Lack of efficacy
- Death due to:
- Cause-specific (auto accident)
- Composite outcome (AIDS-defining illness)
- Related to outcome of interest
Handling missing data
- Complete Case Analysis - CCA (Listwise Deletion)
- Available Case Analysis
- Weighted Complete Case Analysis
- Single Imputation (replacement of missing values)
Complete case analysis (CCA)
Analyzes only subjects who completed the study.
Advantages:
- Does not require manipulating the data
Disadvantages:
- Decrease of study power: increasing type II error
- Biased results: the dropout rate increases the risk of imbalanced groups
Available Case Analysis
Special case of Complete Case Analysis, where all or part of the data is used depending on the given analysis. Example: Incomplete cases used for baseline analysis, NOT used for outcome analysis.
Weighted Complete Case Analysis
Used in surveys, gives weights for responses based on likelihood of response.
Single Imputation
- Mean/Median Substitution Method
- Last Observation Carried Forward (LOCF)
- Regression Substitution Method
- Stochastic Regression imputation
- Increased Random Variability Method
- Worst Case Scenario Method
- Baseline Carried Forward Method
- Hot and Cold Deck Imputation
Mean Substitution
Impute the missing data using the mean of the non-missing values.
Advantages:
- Simple
- Potential to reduce bias by using all study data to estimate response for missing subjects
Disadvantages:
- Significantly decreased standard deviation (variance)
- Increased Type I error
- Overestimation
Last Observation Carried Forward (LOCF)
Impute the missing data with the value of the last observation with available data.
Advantages:
- Only imputation method accepted by FDA
- It is very simple
- Includes all subjects (ITT analysis)
- Mimics real life scenarios (as many patients in real clinical practice are not compliant to the treatment)
Disadvantages:
- Can lead to biased estimates (as it is based on the assumption that the patients after dropping out would either not improve or continue to get better), which can also decrease the power of a study.
Regression Substitution
Build a regression model with baseline characteristics as predictors of the outcome using the available data. Then use the model to predict the outcome for patients with missing values.
Advantages:
- Has the potential to reduce bias by using all date to estimate the response for missing subject
Disadvantages:
- Requires a special statistics software
Increased Random Variability
Use statistical software to generate a set of random values for the outcome and then replace missing data by randomly selecting values from this set.
Advantages:
- Potential to reduce bias by using all study data to estimate response for missing subjects
Disadvantages:
- Complicated statistical model (specific training)
- Not commonly used
- Might be questioned by reviewers
Worst Case Scenario
Replace the missing data with the worse possible outcome.
Advantage:
- If the results are positive, they can be trusted.
Disadvantage:
- It cannot be used in studies in which a high number of dropouts
Baseline Carried Forward
Assume that all participants with missing data resume their baseline status. Replace missing data with the baseline data from each patient.
Advantages:
- Simple
Disadvantages:
- It might underestimate the effects of treatment (type II error)
Hot and Cold Deck Imputation
Replaces individual missing data items with reported data from another person or household with similar characteristics
- Hot: a missing case is replaced with a case with similar characteristics
- Cold: deck (another dataset)
Multiple Imputations
Each missing value will be replaced by a simulated value (done several times 3-10 times)
Advantage:
- Has a standard deviation and standard error closer to the one obtained with a complete sample
Disadvantage:
- Incorporate missing data uncertainty
Multiple likelihood
Use the available values in order to find parameter estimate, which would be the best fit to the already observed data.
Disadvantages:
- Does not impute missing data but use the known characteristic of the individual to better estimate the unknown parameters of the incomplete variable
- Need to find the most appropriate variable to use
Missing Data Prevention
Although methods can help to analyze as valid as possible dataset with missing data – best to prevent missing data:
- Run-in phase
- Enrichment design - selecting best responders
- Flexible dose (titration) studies
- Selection of target population who will respond to treatment, knowing the population
- Add on design
- Adding endpoints
- Reducing follow-up periods
- Allow rescue medication
- Define outcomes that can be defined without participant visit (for instance, death)
- Randomized withdrawal to define long-term efficacy
- Sending letters to subjects to motivate follow up protecting confidentially
References
Haukoos J S, Newgard C D. Advanced Statistics: Missing Data in Clinical. Research: An Introduction and Conceptual Framework. Society for Academic Emergency Medicine. 2007
Myers W R. Handling Missing Data In Clinical Trials: An Overview. Drug Information Journal, Vol. 34, pp. 525–533. 2000