top of page
Search
  • Writer's pictureMahavir Dabas

ABC of EDA with Haberman's Survival Dataset

Updated: Oct 13, 2020



Introduction


Exploratory Data Analysis sounds like a very cool term until one gets to know what it really is. Often referred to as the "hello world!" in the field of Machine Learning, EDA is probably the very first thing one would encounter when he decides to dive deep into analytics.


Assume that I give you 30 minutes to make one of the most important life decisions that one can take in today's time-to decide which smartphone you would like to buy. Whatever procedure you would indulge in the next half an hour before passing the final verdict-be it checking online reviews, comparing specifications, or even watching an unboxing video- is what we call "Exploratory Data Analysis".


EDA is not exactly a scientific tool but more of a philosophy that believes in analyzing the data with an unbiased mind open to all possibilities. This standard procedure helps one in gaining insights about the data before subjecting it to complicated machine learning algorithms. To make this conversation more intuitive I have decided to include sections of the analysis that I had done on one of the most basic datasets available on Kaggle- The Haberman's Cancer Survival Dataset. So let's get started.


The Dataset


Haberman’s dataset contains data from a study conducted at the University of Chicago’s Billings Hospital between the years 1958 to 1970 for patients who underwent surgery for breast cancer. Click here for the dataset. I have made use of Python and it's fantastic libraries (Pandas, Numpy, Matplotlib, Seaborn, etc) for my analysis. One important thing- this article is not a coding tutorial but an overview of a mechanism (Machine Learning!=Coding in Python), so if you are feeling overwhelmed by some of the code snippets-ignore them and try to focus on what we are trying to achieve through them.


Objective- Always keep the following mantra in your mind-"Random analysis through random operations will give you random results". We should be crystal clear with the objective of our analysis and each of the operations we perform should ideally make us inch towards it. For example- the motive behind this particular analysis would be to predict the chances of survival of a cancer patient.


First, we need to import the data from the .csv file onto our data frame.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
haberman=pd.read_csv("haberman.csv",header=None)
column_names=["age","year","axillary_nodes","survival_status"]
haberman.columns=column_names
haberman.survival_status=haberman.survival_status.map({1:"yes",2:"no"})

Now our data frame is actually the one on the left but with a small modification, it would look like the one on the right which is easier to comprehend-

Before continuing the analysis let me familiarise you with some keywords related to a dataset which I will be using from here on. Any dataset has the following characteristics-

  1. Datapoint/Vector/Observation- Each row in the data frame is a data point or an observation. It is one of the instances that has been recorded. If there are n features in a data point, we call it an n-dimensional vector.

  2. Feature/ Input Variable/ Independent Variable- It is that component of the datapoint which is used as an indicator for prediction. Our dataset has 3 features- the age of the patient, the year of the surgery, and the number of axillary/lymph nodes present in the body.

  3. Label/ Class/ Output variable/ Dependent Variable- The one thing for which we are ready to enter this mess for- the value/result that we need to predict after we are done with everything. The label is predicted on the basis of the features and thus is named as the dependent variable. Here, we need to predict the chances of survival of a cancer patient which acts as a label in our case.

To identify how many datapoints of each class are present, we run a simple command-

g=haberman.groupby("survival_status").size()
g

We can observe that in our dataset, there are 225 instances of the patient surviving and 81 instances of the patient succumbing to the disease. We can conclude that our dataset is an unbalanced dataset, which is a common observation in the case of medical records.

Note that the total observations present (306) can be found out with this simple command-

haberman.shape

Now that we are done familiarising with our data, let's move forward to the real stuff.


Multivariate Analysis-3-D Scatter Plot


Since our dataset has 3 features, why not quickly see how a 3-d scatter plot of the same would look like- maybe we'll be done with our problem in this very step?

import plotly.express as px
fig=px.scatter_3d(data_frame=haberman,x="age",y="year",z="axillary_nodes",color="survival_status")
fig.show()

As you can see for yourself that no procedure ever defined by humanity can help us in making sense out of this mess. So to answer the previous question-No, we won't be done with our problem in the very first step!


Bivariate Analysis


It refers to finding the relationship between 2 feature variables and using it to find the target variable of interest. The most used technique in the bivariate analysis is that of the 2-d scatter plots. Pair plots are the collection of 2-d scatter plots of all possible pairs that can be generated from our features.


The motive behind using the pair plots would be to identify those features which result in the best distinction between the 2 class labels-survival status: yes and survival status: no. I used seaborn for plotting the pair plots-

sns.set_style("whitegrid")
sns.pairplot(haberman,hue="survival_status",size=3)
plt.show()


Note: Only consider the graphs above the diagonal as the ones below it are the same graphs with a reversal of axes which won't affect our observation.

Note: The general workflow while dealing with pair plots is to select the graphs with the least overlap between the various classes and then choosing the features involved in these graphs for further classification to achieve the best possible results.


Observations from Pair Plots-

  1. Age vs Year of Operation- Highly overlapping and thus discarded.

  2. Year of Operation vs Number of Axillary Nodes-Highly overlapping and thus discarded.

  3. Age vs Number of Axillary Nodes- Overlapping but better than the abovementioned cases. Therefore we'll select these 2 features for univariate analysis.

Limitation of Pair Plots- When the number of features reaches a very large value, it is not possible to use Pair Plots for classification in an efficient manner as we would have to go through a large number of plots to identify our best features. In such cases, we use certain mathematical techniques like PCA and t-SNE (which will be the topic of discussion of the next article!) for dimensionality reduction.


Conclusion- The age of the patient and the Number of Axillary Nodes are the best features to predict the chances of survival.


Univariate Analysis


Univariate analysis is perhaps the simplest form of statistical analysis. The key fact is that only one variable is involved.


One thing that has become extremely relevant in today's world is the concept of diplomacy, which is also ingrained in the bodies and minds of all the data scientists around the world. Instead of actually predicting what will happen, we always prefer talking about the chances (diplomacy!). And the moment we start talking about chances is when probability kicks in.


Probability Density Function- It represents the density of data points over a continuous range helping us determine the likelihood of a particular outcome. It is formed by smoothing the edges of a histogram.


PDF for Age of Patient


sns.FacetGrid(haberman,hue="survival_status",size=5)\
    .map(sns.distplot,"age")\
    .add_legend();
plt.title("pdf for survival status labelling on the basis of age")
plt.show();

Observations-

  1. We can see that the PDF of survival status on the basis of age is highly overlapping for both classes between ages 40 and 65 years (the area under the curves is almost equal).

  2. Therefore between ages 40 and 65 years, the percentage of patients with survival status as yes and the percentage of patients with survival status no is almost the same-thus this feature is not suitable for classification.


PDF for Year of Operation


sns.FacetGrid(haberman,hue="survival_status",size=5)\
    .map(sns.distplot,"year")\
    .add_legend();
plt.title("pdf for survival status labelling on the basis of year of operation")
plt.show();

Observations-

  1. We can see that the pdf for survival status on the basis of the year of operation is again highly overlapping for both the classes between the years 1958 and 1968.

  2. Therefore between the years 1958 and 1968, the percentage of patients with survival status as yes and the percentage of patients with survival status no is almost the same-thus this feature is also not suitable for classification.

PDF for Number of Axillary Nodes


sns.FacetGrid(haberman,hue="survival_status",size=5)\
    .map(sns.distplot,"axillary_nodes")\
    .add_legend();
plt.title("pdf for survival status labelling on the basis of number of axillary nodes")
plt.show();

Observations-

  1. The PDF for the survival status on the basis of the number of axillary nodes has the least overlap for the 2 label classes and is our best shot at making the best possible classification.

  2. Thus we can make use of the following if-else conditions to get a fairly accurate result-

if no of axillary nodes>=0 and <=3-->survival status=high
if no of axillary nodes>=3-->survival status-->low

Detailed Univariate Analysis on the basis of Number of Axillary Nodes


Now knowing our best feature for classification, we have the luxury to go for a more detailed analysis to get some concrete results.


y=haberman.survival_status=="yes"
n=haberman.survival_status=="no"
haberman[y]
haberman[n]

For the patients who survived (Survival Status: YES)


count,bin_edges=np.histogram(haberman[y].axillary_nodes,bins=20,density=True)
print(count)
print(bin_edges)
pdf=count/sum(count)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,'g',label="PDF FOR SURVIVAL STATUS YES")
plt.plot(bin_edges[1:],cdf,'b',label="CDF FOR SURVUVAL STATUS YES")
plt.xlabel("NUMBER OF AXILLARY NODES")
plt.title("pdf and cdf number of axillary nodes-survival status:YES")
plt.legend()
plt.show()

Observations-

  1. The blue line shows the CDF (which is nothing but the cumulative sum of the PDF values till that point) for all the patients who survived.

  2. From the blue line, we can clearly see that roughly 85% of the total patients who survived had less than 5 axillary nodes present in their bodies.

  3. If axillary nodes>=0 and <=5--> survival status=high--> this statement has an 85% chance of being correct.

  4. We can also see from the CDF that when the number of axillary cases reaches 10 the chances of survival become very low as for the total patients who survived only 10% had more than 10 nodes.

For the patients who did not survive (Survival Status: NO)


count,bin_edges=np.histogram(haberman[n].axillary_nodes,bins=20,density=True)
print(count)
print(bin_edges)
pdf=count/sum(count)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,'g',label="PDF FOR SURVIVAL STATUS NO")
plt.plot(bin_edges[1:],cdf,'b',label="CDF FOR SURVIVAL STATUS NO")
plt.xlabel("NUMBER OF AXILLARY NODES")
plt.title("pdf and cdf for number of axillary nodes-survival status:NO")
plt.legend()
plt.plot()


Observations-

  1. We can see from the CDF that for all the patients who did not survive more than 60% had more than 3 axillary nodes present in their body.

  2. If axillary nodes>=3-->survival status=low-->this statement has 60% chance of being correct.

Combining the above 2 plots


count,bin_edges=np.histogram(haberman[y].axillary_nodes,bins=20,density=True)
pdf=count/sum(count)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,'g',label="PDF FOR SURVIVAL STATUS YES")
plt.plot(bin_edges[1:],cdf,'b',label="CDF FOR SURVUVAL STATUS YES")
count,bin_edges=np.histogram(haberman[n].axillary_nodes,bins=20,density=True)
pdf=count/sum(count)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,'r',label="PDF FOR SURVIVAL STATUS NO")
plt.plot(bin_edges[1:],cdf,'y',label="CDF FOR SURVIVAL STATUS NO")
plt.xlabel("NUMBER OF AXILLARY NODES")
plt.title("pdf and cdf for number of axillary nodes for both survival status")
plt.legend()
plt.plot()

Final Observations from Univariate Analysis-

We can use the following if-else condition to predict the chances of survival through univariate analysis-

  1. If the number of axillary nodes>=0 and number of axillary nodes<=3-->survival chance=high-->this statement has about 75% accuracy as out of the total patients who survived 75% had less than 3 axillary nodes and 25% had more than 3 axillary nodes.

  2. If the number of axillary nodes>=3-->survival chance=low-->this statement has about 60% accuracy as out of the total patients who did not survive 60% had more than 3 axillary nodes and 40% had less than 3 axillary nodes.

Basic Statistics

  1. Mean- The average value of all the data points. The whole data is spread around this central value.

  2. Standard Deviation- The measure of the spread of the data around the central value.

  3. Median- It is the middle value (another measure of central tendency) when all the data points are arranged from least to greatest. Median is preferred over mean for the simple reason that it is least affected by outliers or extreme values present in the dataset. The median is not much affected until half of the data points become corrupted.

print("for all the patients who survived: \n")
#to check the central value
print("mean number of axillary nodes: {} \n".format(np.mean(haberman[y].axillary_nodes)))
#to check the spread about the central value
print("standard deviation from the mean number of axillary nodes: {} \n".format(np.std(haberman[y].axillary_nodes)))
#to check the central value which is least affected by outliers
print("the middle value or the median number of axillary nodes: {} \n".format(np.median(haberman[y].axillary_nodes)))
                                                                                                                                                                                                                                        
print("for all the patients who did not survive: \n")
#to check the central value
print("mean number of axillary nodes: {} \n".format(np.mean(haberman[n].axillary_nodes)))
#to check the spread about the central value
print("standard deviation from the mean number of axillary nodes: {} \n".format(np.std(haberman[n].axillary_nodes)))
#to check the central value which is least affected by outliers
print("the middle value or the median number of axillary nodes: {} \n".format(np.median(haberman[y].axillary_nodes)))

Observations-

  1. All the patients who survived had an average of 2.8 axillary nodes with a deviation of 5.8.

  2. All the patients who did not survive had an average of 7.5 axillary nodes with a deviation of 9.1.


Advanced Statistics


  1. Percentile- It is generally communicated as the percentage of total values that fall below a particular value in a dataset. For example- If I say that 45 marks represented 85 percentile in a test-then it implies that 85 percent of the total students who took the test scored less than 45 marks.

  2. Quantile- A fancy term for representing the 25th,50th,75th, and the 100th percentile. The 50th percentile is nothing but the median value.

  3. Inter Quantile Range (IQR)- It is the difference between the values at the 25th and the 75th percentile. Some people also use it as a proxy for standard deviation to analyze spread as 50 percent of the total values lie in this range.

print("for all the patients who survived: \n")
print("number of axillary nodes at 75th percentile: {} \n".format(np.percentile(haberman[y].axillary_nodes,75)))
print("the quantiles are nothing but the values at 0th,25th,50th,75th percentile")
print("the quantiles for the number of axillary nodes:")
print(np.percentile(haberman[y].axillary_nodes,np.arange(0,100,25)))
print("the inter quantile range is the difference of the value at the 75th and the 25th percentile")
print("the inter quantile range for number  of axillary nodes in patients who survived is:")
x=np.percentile(haberman[y].axillary_nodes,75)
y=np.percentile(haberman[y].axillary_nodes,25)
print(x-y)
print("for all the patients who did not survive: \n")
print("number of axillary nodes at 75th percentile: {} \n".format(np.percentile(haberman[n].axillary_nodes,75)))
print("the quantiles are nothing but the values at 0th,25th,50th,75th percentile")
print("the quantiles for the number of axillary nodes:")
print(np.percentile(haberman[n].axillary_nodes,np.arange(0,100,25)))
print("the inter quantile range is the difference of the value at the 75th and the 25th percentile")
print("the inter quantile range for number  of axillary nodes in patients who did not survive is:")
x=np.percentile(haberman[n].axillary_nodes,75)
y=np.percentile(haberman[n].axillary_nodes,25)
print(x-y)

Box Plots- A special plot for Quantiles


One of the major problems with histograms and PDFs is that it can only tell us about the likelihood of a particular occurrence. It gives us no clue about the likelihood of occurrence before a particular point.

This problem is solved to some extent by the CDF, but what if we want more intuition? This is where the Box Plot comes into the picture, which is specifically designed to deal with quantiles.

sns.boxplot(x="survival_status",y="axillary_nodes",data=haberman)
plt.title("box plot for number of axillary nodes")
plt.show()          

Consider the orange-coloured box plot-

  1. The lower line represents the 25th percentile.

  2. The middle line represents the 50th (median) percentile.

  3. The upper line represents the 75th percentile.

  4. The height of the box represents the Inter Quantile Range.

  5. The extended vertical lines are called the Whiskers, whose length is 1.5 times the Inter Quantile Range.


Violin Plots- The Best Of Both Worlds!


Till now we have talked about Histograms/PDFs for discussing data density and Box Plots for discussing quantiles. What if I tell you that there is another kind of plot which combines them both?

sns.violinplot(x="survival_status",y="axillary_nodes",data=haberman)
plt.title("violin plot for number of axillary nodes")
plt.show() 

  1. The box inside the violin acts as the box plot.

  2. The outer violin with varying density acts as the PDF. If observe the violin with a titled head- it will look exactly like the bell-shaped curves observed while discussing the Probability Density Functions!


Summarising Analysis


  1. Never forget the motivation behind any analysis, as the performance of EDA is aligned with its objective.

  2. It is a good practice to jot down the results of each investigative method in simple and plain English upon its completion so that anyone can comprehend it.

Conclusion


With the various analysis tools and techniques discussed in this article, we have successfully explored Haberman's Cancer Survival Dataset by drawing conclusive results from it.


533 views3 comments

Recent Posts

See All
bottom of page