DSF Discussion /
Methods for dealing with missing values in datasets
27 Nov 2017 11:05:01 PM
Missing data can occur because of nonresponse: no information is provided for one or more items or for a whole unit ("subject"). Some items are more likely to generate a nonresponse than others.
Suppose variable Y has some missing values. We will say that these values are MCAR if the probability of missing data on Y is unrelated to the value of Y itself or to the values of any other variable in the data set. On another hand, missing value (y) neither depends on x nor y.
The probability of missing data on Y is unrelated to the value of Y after controlling for other variables in the analysis (say X). on another hand, Missing value (y) depends on x, but not y.
Missing values do depend on unobserved values. On another hand, the probability of a missing value depends on the variable that is missing.
We can distinguish between two main patterns of missingness. On the one hand, data are missing monotone if we can observe a pattern among the missing values. Note that it may be necessary to reorder variables and/or individuals. On the other hand, data are missing arbitrarily if there is not a way to order the variables to observe a clear pattern (SAS Institute, 2005).
If a case has missing data for any of the variables, then simply exclude that case from the analysis. It is usually the default in statistical packages. (Briggs et al.,2003). In this case, rows containing missing variables are deleted
Analysis with all cases in which the variables of interest are present. On another hand, only the missing observations are ignored and analysis is done on variables present.
Mean, median and mode are the most popular averaging techniques, which are used to infer missing values. Approaches ranging from global average for the variable to averages based on groups are usually considered. On simply way Replace missing value with sample mean or mode.
Suppose we are estimating a regression model with multiple independent variables. One of them, X, has missing values. We select those cases with complete information and regress X on all the other independent variables. Then, we use the estimated equation to predict X for those cases it is missing. (Graham, 2009) (Allison, 2001) and (Briggs et al., 2003).
We can use this method to get the variance-covariance matrix for the variables in the model based on all the available data points, and then use the obtained variance- covariance matrix to estimate our regression model (Schafer, 1997). On another hand, Estimate: value that is most likely to have resulted in the observed data.
The imputed values are draws from a distribution, so they inherently contain some variation. Thus, multiple imputation (MI) solves the limitations of single imputation by introducing an additional form of error based on variation in the parameter estimates across the imputation, which is called “between imputation error”. It replaces each missing item with two or more acceptable values, representing a distribution of possibilities (Allison, 2001).
Ignore or treat them? The answer would depend on the percentage of those missing values in the dataset, the variables affected by missing values, whether those missing values are a part of dependent or the independent variables, etc. Missing Value treatment becomes important since the data insights or the performance of your predictive model could be impacted if the missing values are not appropriately handled. In conclusion: Assumptions and patterns of missingness are used to determine which methods can be used to deal with missing data
Message cannot be blank.
Very nice review, thanks Amer
I have worked once with sparse matrices with missing data, and I think your article is relevant
Complete your membership listing and tell others about your interests, experience and qualifications with a Personal Profile page.
Your Personal Profile page is missing information about your experience and qualifications that other members would find interesting. Click here to update.