I covered about Python basics in my previous post and now lets move on to understand how to do exploratory data analysis using Python.
As mentioned in my earlier post, the power of Python comes from its libraries. Below I would like to give a brief introduction to the most commonly used Python libraries for data science.
- Pandas – This is used for data manipulation and analysis which offers data structures and operations for manipulating numerical table and time series data.
- Numpy – This is a basic scientific computation package which provides useful features for operations on n-arrays and matrices in Python
- Scipy : SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and engineering
- Scikit-Learn : This is built on top of SciPy and is designed specifically for image processing and machine learning. This has algorithms for many standard Machine Learning tasks like regression, clustering, classification etc.
- Matplotlib : Simple Python library for generating good visualizations.
- Seaborn : Another library mainly used for visualizations and is based on matplotlib. It can provide a high-level interface for attractive statistical graphs.
- Statsmodel : As the name says it helps users to conduct data exploration via the use of various methods of estimation of statistical models and performing statistical assertions and analysis.
Now lets try exploratory analysis of data using Python. I used BigMart dataset which you can download from here .
Select BigMart3_train.csv and on the next screen the csv contents are displayed. Right click on Raw button on the top and select save as to download to your system.
Open Jupyter Notebook and import the required libraries
# Import libraries import pandas as pd import numpy as np import matplotlib as plt %matplotlib inline #this is required to get the plots displayed
Read the dataset to a variable
#read the csv file df = pd.read_csv('C:/Hari Docs/Dataset/BigMart3_Train.csv')
#display the first 10 records of the dataset df.head(10)
#find the shape of the data df.shape
this shows that there are 8523 rows and 12 columns
#Find out how many columns or variables present in this dataset df.columns
What do we understand from the columns ?
We can broadly classify all the columns into 2 categories one which gives information about Item and other about outlet. All these information influence the Item_Outlet_Sales. Hence we will further explore this.
# Lets get a general idea about how the data looks like for numeric columns df.describe()
From the above output we can understand:
- Item_Weight has missing values
- Item_Visibility is between 0 to 0.32 which we can think as expressed in percentages. Item visibility cannot be zero because it would be displayed in the outlet
- Outlet_Establishment_Year has values between 1985 to 2009. This probably can be changed to age of the outlet for better modelling the dataset.
- Item_Outlet_Sales is the predicted variable
Categorical variables (Non-Numeric):
cat = df.dtypes[df.dtypes == 'object'].index df[cat].describe()
From the above output we can understand:
- There are 1559 unique Items by looking at Item_Identifier description
- Items are classified into 16 types by looking at Item_Types
- The values in this dataset are from 10 different Outlets
- Outlets are of 3 different sizes(Outlet_Size)
- Outlet_Size has missing data
- Outlet_Location_Type has 3 different values
- Outlet is classified into 4 types (Outlet_Type)
Now we will try to understand little more using visualizations
Data Exploration using Visualisation:
Univariate Analysis: This means exploring and finding more information about single variable.
Let us first try some visualizations for numeric variables:
Histograms give us a quick way of finding the distribution of the variable data.
#histogram of all the numeric variables df.hist()
Another way to look at the distribution is through the density plots.
#density plot df.plot(kind='density', subplots=True, layout=(3,3), sharex=False)
we can see that Item_Outlet_Sales and Item_Visibility have some exponential distribution
Box plots summarize the distribution and draw a line for median (middle value), 25th and 75th percentiles. The whiskers give us an idea about the spread of the data and any values larger than 1.5 times of the median are shown outside the whiskers. These are potentially the outliers which need a different treatment (something similar to missing value treatment).
#box plot df.plot(kind='box', subplots=True, layout=(3,3), sharex=False, sharey=False)
From the output we can see that Item_Visibility and Item_Outlet_Sales have potential outliers. However Item_Outlet_Sales is the ‘target’ variable which needs to be predicted based on remaining variables, hence we will ignore that.
For Categorical or non-numeric variables Box plots are very helpful in exploring the variables.
From the above output we can see that ‘Fruits and Vegetables’ and ‘Snack Foods’ have more data in dataset but if we observe the extreme right ‘Sea Food’ has much smaller data comparatively. Maybe we might need to consider combining them with Item_Types.
Data for distribution is almost equal for all the outlets
We will try to find the correlation between multiple variables using the multivariate visualization analysis.
sb.barplot(x ='Outlet_Identifier', y = 'Item_Outlet_Sales', data=df)
We can see that the outlet ‘OUT027’ outperforms remaining outlets in sales.
sb.barplot(x ='Outlet_Location_Type', y = 'Item_Outlet_Sales', data=df)
Outlet Locations, Tier 2 & 3 have better sales compared to Tier 1
sb.barplot(x ='Outlet_Size', y = 'Item_Outlet_Sales', data=df)
Medium sized outlets have better sales than the remaining.
From the data we know that OUT027 is a Medium sized outlet.
For finding correlations, we can use scatter plot or heat map to find the correlations between numeric variables.
We can also use heatmap to understand the correlation between different numeric variables.
corr = df.corr() sb.heatmap(corr, vmax=1., square=False)
+1.0 denotes a strong positive correlation whereas -1.0 represents a strong negative correlation.
These are some basics of data exploration using python. I will try to do another post on how to treat outliers, missing values and feature engineering.