Missing data is one of the most common / frequent occurrence in real world Data Science. This can happen in data collection phase in different ways. For Example, people may not answer all questions during a survey or the data might be constructed from multiple sources which might not have identical data types and might not insert any values (or NULL values).
How do we deal with this missing data?
There is no straight forward solution on how to deal with missing data and as in most cases “It Depends”. However there are few basic techniques commonly used to deal with missing data which I would like to discuss briefly with an illustration on Ames Housing Dataset using the first 2 techniques described below
If the nature of missing data is “Missing Completely at Random (MCAR)” then deletion is better. Below are 2 types of deletions:
- List-Wise: In this case the complete row is deleted, hence the sample size would be reduced. Rows highlighted in colour will be removed below.
- Pair-Wise: In this case only the missing observations are ignored in the analysis and remaining variables of the row are considered. Cells highlighted in colour will be ignored in the below data.
- Deleting Columns: In most cases if the missing data constitutes more than 90% of the data then the column is dropped as it would not contribute to the mode.
This is a method of filling the missing data with some expected values.
- Averaging techniques: Mean, Median and Mode are the most common and basic imputation techniques used missing data. Approaches can be taking complete average of the variable to taking averages based on grouping of another variable.
Example: Consider there are 2 variables OS_Type and Bugs_detected where you have missing data in Bugs_detected. In this scenario it makes sense to fill missing values grouped by OS_Type instead of taking the global average of Bugs_detected.
- Predictive Techniques: Models can be built to predict the missing values assuming the nature of the missing value is not Completely at Random. Here we divide our dataset into one with missing data and other with non-missing data. Non-missing data dataset becomes our training model and missing data dataset becomes our test dataset.
3. KNN Method (K Nearest Neighbour):
In this method, k neighbours are selected based on a distance measure and their average is used as an estimate. It imputes missing values the most similar values for a given variable based on a distance metric. This can be used for both continuous and categorical data, however this is very time consuming process and choice of k is very critical in this method.
4. Time-Series Specific Methods:
- Last Observed Value Forward (LOVF) or Next Observed Value Backwards (NOVB): This can be used for repeated observations of a common value in variable. Assumption would be either the response would have been consisted from last value or to the next value.
- Linear Interpolation: For a time-series data when there is a linear trend observed this technique can be used for imputing missing data. This can be used when there is no seasonality observed in the data.
- Seasonality and Interpolation: This is an advanced technique when seasonality along with trend is observed in the data.
Check the below article to understand more about seasonality.
Below diagram summarizes the above description: