Exploratory Data Analysis using Pandas Profiling

 In this article and web app, we are going to talk about data science in its true meaning- the analysis of big data, its visualization and representation and schematic analysis of that collected data. One of my first projects in Data Science was related to data analysis and specifically, exploratory data analysis using different libraries. I’m going to tell you about one such exploratory data analysis using the Pandas Profiling Report.

If you’ve taken any courses on statistics (and by that I mean certain advanced courses that touch upon topics like probability distribution, Gaussian functions and normal distributions) you would have come across data analysis at some point or he other. Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. It was first proposed by an American mathematician John Tukey (who was also known for his notable work in Fast Fourier Transform Algorithm, Tukey range test, Tukey lambda distribution etc). However EDA is different from IDA( Initial Data Analysis)

The objectives of EDA are to:

 

  • Suggest hypotheses about the causes of observed phenomena
  • Assess assumptions on which statistical inference will be based
  • Support the selection of appropriate statistical tools and techniques
  • Provide a basis for further data collection through surveys or experiments


There are many really amazing books related to exploratory data analysis such as the very first ‘Exploratory Data Analysis’ book written by John Tukey in 1970; ‘Exploratory Data Analysis with MATLAB, written by Angel R Martinez, Jeffrey Solka and Wendy L Martinez; ‘Exploratory Data Analysis using R’ by Ronald K Pearson, Graphical Exploratory Data Analysis’ written by A. G. W. Steyn, Rolf Stumpf, and Stephen Henry Charles Du Toit; ‘Exploratory Data Analysis in Empirical Research: Proceedings of the 25th Annual Conference of
 the Gesellschaft Für Klassifikation E.V., University of Munich, March 14–16, 2001′ and many more. Many high quality youtube videos have also been made on data analysis. 


Coming onto this app, this app basically takes in a dataset given by the user and performs exploratory data analysis on it using the pandas profiling report and displays the outcome in the form of a large number of statistics such as number of variables, number of observations, total memory size, variable types, interactions, correlations, missing values,  provides descriptive statistics including mean, standard deviation, skewness and much moreWe have also provided a sample dataset to run the app. It is the Pima Indian Diabetes Dataset (I have an app and article related to that which has already been uploaded). 


Pandas profiling is an open source python module and helps us to do quick and efficient EDA . However pandas profiling is not the best choice for large datasets as it takes a significant time to analyze larger datasets. Pandas profiling gives us an in-depth analysis of numerical variables covering quantile and descriptive statistics. It displays quartile values which measure the distribution of the ordered values in the dataset above and below the median, shows the interquartile range, standard deviation, coefficient of variation, mean absolute deviation and skewness. 

For those of you who are not familiar with the statistical terms, here are a few definitions. A quartile is a type of quantile which divides the number of data points into four parts, or quarters, of more-or-less equal size. The interquartile range, also called the midspread, middle 50%, or H‑spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, he standard deviation is a measure of the amount of variation or dispersion of a set of values (A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range.), The mean absolute deviation of a dataset is the average distance between each data point and the mean. It gives us an idea about the variability in a dataset, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The coefficient of variation (CV) is a statistical measure of the dispersion of data points in a data series around the mean. The coefficient of variation represents the ratio of the standard deviation to the mean. 

We have deployed the app using Streamlit. It is an open source framework that allows data science teams to deploy web apps fairly easily. It’s one of the best hosting services I’ve used and it’s great for quick and easy deployment of web apps. The app is coded in python. 

Link of the webapp: https://share.streamlit.io/skillcode-ml/eda2/main/app.py

 

Explanation of the code and how you can make it yourself !

Well this is one of those applications of data analytics that requires the least amount of code. You can easily accomplish EDA using the pandas profiling library within 20 lines of code. Yes, it’s that simple. We just need to run pandas profiling after importing the packages / libraries and giving commands for uploading the code. 

 

import numpy as np
import pandas as pd
import streamlit as st
from pandas_profiling import ProfileReport
from streamlit_pandas_profiling import st_profile_report
from PIL import Image,ImageFilter,ImageEnhance

Now we just need to give commands for uploading the file (taken as a user input). I have also included a sample dataset which is the Pima Indian Diabetes Dataset.

with st.sidebar.header('1. Upload your CSV dataset'):
    uploaded_file = st.sidebar.file_uploader("Upload your input CSV file for EDA", type=["csv"])
    st.sidebar.markdown("""
[Sample CSV input file](https://github.com/pranav-coder2005/Diabetes_detector/blob/main/diabetes.csv)
""")


if uploaded_file is not None:
    @st.cache
    def load_csv():
        csv = pd.read_csv(uploaded_file)
        return csv
    df = load_csv()
    pr = ProfileReport(df, explorative=True)
    st.header('**Input Data Frame**')
    st.write(df)
    st.write('---')
    st.header('**Pandas Profiling Report**')
    st_profile_report(pr)
else:
    st.info('Waiting for CSV file to be uploaded.')
    if st.button('Sample Dataset'):
        # Example data
        @st.cache
        def load_data():
            a = pd.DataFrame(
                np.random.rand(100, 5),
                columns=['a', 'b', 'c', 'd', 'e']
            )
            return a
        df = load_data()
        pr = ProfileReport(df, explorative=True)
        st.header('**Input Data Frame**')
        st.write(df)
        st.write('---')
        st.header('**Pandas Profiling Report**')
        st_profile_report(pr)

And to cap it up, I’ve added the logo of Team Skillocity and have provided a link to this article. So that was it from this one and I’ll see you soon with another application of Machine Learning and Data Science. Hasta Pronto !

Exploratory Data Analysis using Pandas Profiling6 min read

One thought on “Exploratory Data Analysis using Pandas Profiling6 min read

Leave a Reply

Your email address will not be published.

Scroll to top