Breast Cancer EDA and Prediction
A Thorough Walkthrough of Exploratory Data Analysis Techniques with Breast Cancer Dataset using Python to Create a Simple Classify Into Malignant or Benign
The impetus for this blog and the resultant cancer classify prediction model is to provide a glimpse into the potential of the healthcare industry. Healthcare continues to learn valuable lessons from the current success of machine learning in other industries to jumpstart the utility of predictive analytics (also known as “health forecasting” ) and to improve patient diagnosis, care, chronic disease management, hospital administration and supply chain efficiency. [1]
This classification project is an introduction to the exploratory data analysis (EDA) of the Breast Cancer dataset in order to determine the attributes necessary to develop a simple predictive cancer classify model. Estimates of the model will determine whether these cells are malignant cancer cells or not? This authoring ultimately automates the prognosis for the patient based on several attributes; such as radius, area, texture, perimeter, fractal dimension and so on.
1. Understanding Dataset
The Dataset for health and it is for Social Good: Women Coders’ Bootcamp organized by Artificial Intelligence for Development in collaboration with UNDP Nepal. features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.
You can easily download the dataset from the link below:
Attribute Information:
- ID number
- Diagnosis (M = malignant, B = benign)
Ten real-valued features are computed for each cell nucleus:
- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter² / area — 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension (“coastline approximation” — 1)
The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
Note: Just a reminder! It’s always a great idea to do a bit of background research on your dataset and the problem at hand before you delve into EDA. I did so with ‘nodes’ and here’s what I found.
Background: Positive axillary lymph nodes are small, bean-shaped organs located in the armpit (axilla), which act as filters along the lymph fluid channels. As lymph fluid leaves the breast and eventually returns to the bloodstream, the lymph nodes catch and trap cancer cells before they reach other parts of the body. Thus, having cancer cells in the lymph nodes under your arm suggests an increased risk of the cancer spreading.
When lymph nodes are free of cancer, test results are negative. However, if cancer cells are detected in axillary lymph nodes they are deemed positive.
This article caters following questions:
- Which factor contribute most to the number of Heart Disease cases in the world?
- Are there any regular behaviors that could be helpful to detect potential Heart Disease cases going forward?
2. Exploratory Data Analysis
Below, you’ll find details on how to set up your environment to repeat this EDA process. The purpose for importing the required libraries for analysis are:
- Pandas is used for manipulating the dataset
- NumPy for imposing mathematical calculations and statistical on the dataset
- Matplotlib and Seaborn are used for visualization
# Import the necessary packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Load the dataset
data = pd.read_csv("breast_cancer.csv")
Data Details
data.shape
Output
(569, 31)
Dataset have 569 rows and 31 features
info(data)
Output
Observations:
- There are no missing values and duplicate values
- Contains 31 Numeric and 1 Categorical Features
- Can drop ID from data because we not using to anymore
Data Visualization
We are going to visualization our dataset to know what information can we extract from the data.
Observation:
We have 357 malignant cases and 212 benign cases so our dataset is Imbalanced, we can use various re-sampling algorithms like under-sampling, over-sampling, SMOTE, etc. Use the “adequate” correct algorithm.
Radius — Diagnosis
Cells that have a large radius are malignant cancer cells
Area — Diagnosis
Cells that more size area more have chance become malignant cancer cells
Perimeter — Diagnosis
Cells that more size Perimeter more have chance become malignant cancer cells
Smoothness — Diagnosis
Smoothness still have chance become Malignant Cancer but still fifty fifty with benign
Texture — Diagnosis
The greater the cell texture the possibility of malignant cancer increases
Symmetry — Diagnosis
The greater the cell symmetry the possibility of malignant cancer increases but still fifty fifty with benign cancer
Concavity — Diagnosis
The greater the concavity of cell dimensions, the indication that malignant cancer is increasing
Fractal Dimension — Diagnosis
The greater the fraction of cell dimensions, the indication that malignant cancer is increasing
3. Prediction
For prediction we tried using 4 models to find best Recall. But you can choose to Precision, it depends the creator. For this case because of we won’t our model predict user with Malignant cancer but at our model predict the user Benign cell only.
Logistic Regression
|Precision|Recall|Accuracy|
|---------|------|--------|
| 97% | 95% | 97% |
Logistic Regression result with recall 95% and precission 97%.
Random Forest
|Precision|Recall|Accuracy|
|---------|------|--------|
| 97% | 90% | 95% |
Random Forest model with recall 90% and precission 97%
XGB
|Precision|Recall|Accuracy|
|---------|------|--------|
| 97% | 88% | 92% |
XGB model with recall 88% and precission 97%
SVC
|Precision|Recall|Accuracy|
|---------|------|--------|
| 100% | 90% | 96% |
SVC model with recall 90% and precision 100%.
SVC have best precision result with 100%, but here we looking at recall highest percentage.
After tried using 4 models we found Logistic Regression have highest recall with 95% percentage. We select Logistic Regression model for the next prediction.
4. Conclusion
- After we doing EDA we have 6 features can be used as a reference for classifying tumor cells into malignant tumor cells:
a) Radius
b) Perimeter
c) Area
d) Compactness
e) Concavy
f) Concave Point - SVC have highest precision, and Logistic Regression have best recall.