I covered about Python basics in my previous post and now lets move on to understand how to do exploratory data analysis using Python.

As mentioned in my earlier post, the power of Python comes from its libraries. Below I would like to give a brief introduction to the most commonly used Python libraries for data science.

  • Pandas – This is used for data manipulation and analysis which offers data structures and operations for manipulating numerical table and time series data.
  • Numpy – This is a basic scientific computation package which provides useful features for operations on n-arrays and matrices in Python
  • Scipy : SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and engineering
  • Scikit-Learn : This is built on top of SciPy and is designed specifically for image processing and machine learning. This has algorithms for many standard Machine Learning tasks like regression, clustering, classification etc.
  • Matplotlib : Simple Python library for generating good visualizations.
  • Seaborn : Another library mainly used for visualizations and is based on matplotlib. It can provide a high-level interface for attractive statistical graphs.
  • Statsmodel : As the name says it helps users to conduct data exploration via the use of various methods of estimation of statistical models and performing statistical assertions and analysis.

Now lets try exploratory analysis of data using Python. I used BigMart dataset which you can download from here .

Select BigMart3_train.csv and on the next screen the csv contents are displayed. Right click on Raw button on the top and select save as to download to your system.

Open Jupyter Notebook and import the required libraries

# Import libraries
import pandas as pd
import numpy as np
import matplotlib as plt
%matplotlib inline          #this is required to get the plots displayed        

Read the dataset to a variable

#read the csv file 
df = pd.read_csv('C:/Hari Docs/Dataset/BigMart3_Train.csv')
#display the first 10 records of the dataset
df.head(10)
#find the shape of the data
df.shape

this shows that there are 8523 rows and 12 columns

#Find out how many columns or variables present in this dataset
df.columns

What do we understand from the columns ?
We can broadly classify all the columns into 2 categories one which gives information about Item and other about outlet. All these information influence the Item_Outlet_Sales. Hence we will further explore this.

Descriptive Analysis:

Numeric Columns:

# Lets get a general idea about how the data looks like for numeric columns
df.describe()

From the above output we can understand:

  • Item_Weight has missing values
  • Item_Visibility is  between 0 to 0.32 which we can think as expressed in percentages. Item visibility cannot be zero because it would be displayed in the outlet
  • Outlet_Establishment_Year has values between 1985 to 2009. This probably can be changed to age of the outlet for better modelling the dataset.
  • Item_Outlet_Sales is the predicted variable

Categorical variables (Non-Numeric):

cat = df.dtypes[df.dtypes == 'object'].index
df[cat].describe()

From the above output we can understand:

  • There are 1559 unique Items by looking at Item_Identifier description
  • Items are classified into 16 types by looking at Item_Types
  • The values in this dataset are from 10 different Outlets
  • Outlets are of 3 different sizes(Outlet_Size)
  • Outlet_Size has missing data
  • Outlet_Location_Type has 3 different values
  • Outlet is classified into 4 types (Outlet_Type)

Now we will try to understand little more using visualizations

Data Exploration using Visualisation:

Univariate Analysis: This means exploring and finding more information about single variable.

Let us first try some visualizations for numeric variables:

Histograms give us a quick way of finding the distribution of the variable data.

#histogram of all the numeric variables
df.hist()

Another way to look at the distribution is through the density plots.

#density plot
df.plot(kind='density', subplots=True, layout=(3,3), sharex=False)

we can see that Item_Outlet_Sales and Item_Visibility have some exponential distribution

Box plots summarize the distribution and draw a line for median (middle value), 25th and 75th percentiles. The whiskers give us an idea about the spread of the data and any values larger than 1.5 times of the median are shown outside the whiskers. These are potentially the outliers which need a different treatment (something similar to missing value treatment).

#box plot
df.plot(kind='box', subplots=True, layout=(3,3), sharex=False, sharey=False)

From the output we can see that Item_Visibility and Item_Outlet_Sales have potential outliers. However Item_Outlet_Sales is the ‘target’ variable which needs to be predicted based on remaining variables, hence we will ignore that.

For Categorical or non-numeric variables Box plots are very helpful in exploring the variables.

df['Item_Type'].value_counts().plot(kind='bar')

From the above output we can see that ‘Fruits and Vegetables’ and ‘Snack Foods’ have more data in dataset but if we observe the extreme right ‘Sea Food’ has much smaller data comparatively. Maybe we might need to consider combining them with Item_Types.

df['Outlet_Identifier'].value_counts().plot(kind='bar')

Data for distribution is almost equal for all the outlets

Multivariate Analysis:

We will try to find the correlation between multiple variables using the multivariate visualization analysis.

Categorical data:

sb.barplot(x ='Outlet_Identifier', y = 'Item_Outlet_Sales', data=df)

We can see that the outlet ‘OUT027’ outperforms remaining outlets in sales.

sb.barplot(x ='Outlet_Location_Type', y = 'Item_Outlet_Sales', data=df)

Outlet Locations, Tier 2 & 3 have better sales compared to Tier 1

sb.barplot(x ='Outlet_Size', y = 'Item_Outlet_Sales', data=df)

Medium sized outlets have better sales than the remaining.
From the data we know that OUT027 is a Medium sized outlet.

Numeric data:

For finding correlations, we can use scatter plot or heat map to find the correlations between numeric variables.

plt.scatter(df['Item_MRP'],df['Item_Outlet_Sales'])
plt.show()

We can also use heatmap to understand the correlation between different numeric variables.

corr = df.corr()
sb.heatmap(corr, vmax=1., square=False)

+1.0 denotes a strong positive correlation whereas -1.0 represents a strong negative correlation.

These are some basics of data exploration using python. I will try to do another post on how to treat outliers, missing values and feature engineering.

-Hari Mindi


2 Comments

Rob · April 5, 2018 at 8:58 am

I think one of the key things often missed in articles like this, is “why would you?”. It’s all very well skimming over the basics of such a topic, but when someone learns these basics then says “that is cool, but I can do all that in Excel quickly and pass it around for others who are likely more familiar with Excel than they are Python or R”.
A key element to help people learn Data analytics with Python is explaining/demonstrating what bringing such knowledge into your toolkit provides over using familiar tools like Excel on it’s own.

Data science itself is vastly changing, people are realising it’s no longer fair to expect one person to be an expert in all these tools, have expert domain knowledge, and expert statistical knowledge. Teams are needed who can work together to bring all these skills together, reducing the chances of errors and improving the speed and quality of output.

    harimindi · April 5, 2018 at 8:24 pm

    Thank you for the comment. Even i had the same question when i started learning Python for Datascience and Data Analysis in particular. What i understood from different articles online was that Excel is good with basic analysis, viewing data and pivots are helpful with most managers. However they do not seem to be great with handling datasets of large sizes and handling missing data. Ofcourse i know learning some VBA might be helpful in such cases but Python offers easier learning and better options of performing these tasks.
    I might not be completely correct because as my blog states i am trying to learn Data Science and will take every comment as a feedback to improve my learning curve.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.