Breast Cancer EDA and Prediction

Lukas Kristianto
6 min readDec 9, 2022

--

A Thorough Walkthrough of Exploratory Data Analysis Techniques with Breast Cancer Dataset using Python to Create a Simple Classify Into Malignant or Benign

Getty Images

The impetus for this blog and the resultant cancer classify prediction model is to provide a glimpse into the potential of the healthcare industry. Healthcare continues to learn valuable lessons from the current success of machine learning in other industries to jumpstart the utility of predictive analytics (also known as “health forecasting” ) and to improve patient diagnosis, care, chronic disease management, hospital administration and supply chain efficiency. [1]

This classification project is an introduction to the exploratory data analysis (EDA) of the Breast Cancer dataset in order to determine the attributes necessary to develop a simple predictive cancer classify model. Estimates of the model will determine whether these cells are malignant cancer cells or not? This authoring ultimately automates the prognosis for the patient based on several attributes; such as radius, area, texture, perimeter, fractal dimension and so on.

1. Understanding Dataset

The Dataset for health and it is for Social Good: Women Coders’ Bootcamp organized by Artificial Intelligence for Development in collaboration with UNDP Nepal. features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.

You can easily download the dataset from the link below:

Attribute Information:

  • ID number
  • Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

  1. radius (mean of distances from center to points on the perimeter)
  2. texture (standard deviation of gray-scale values)
  3. perimeter
  4. area
  5. smoothness (local variation in radius lengths)
  6. compactness (perimeter² / area — 1.0)
  7. concavity (severity of concave portions of the contour)
  8. concave points (number of concave portions of the contour)
  9. symmetry
  10. fractal dimension (“coastline approximation” — 1)

The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

Note: Just a reminder! It’s always a great idea to do a bit of background research on your dataset and the problem at hand before you delve into EDA. I did so with ‘nodes’ and here’s what I found.

Background: Positive axillary lymph nodes are small, bean-shaped organs located in the armpit (axilla), which act as filters along the lymph fluid channels. As lymph fluid leaves the breast and eventually returns to the bloodstream, the lymph nodes catch and trap cancer cells before they reach other parts of the body. Thus, having cancer cells in the lymph nodes under your arm suggests an increased risk of the cancer spreading.

When lymph nodes are free of cancer, test results are negative. However, if cancer cells are detected in axillary lymph nodes they are deemed positive.

This article caters following questions:

  • Which factor contribute most to the number of Heart Disease cases in the world?
  • Are there any regular behaviors that could be helpful to detect potential Heart Disease cases going forward?

2. Exploratory Data Analysis

Below, you’ll find details on how to set up your environment to repeat this EDA process. The purpose for importing the required libraries for analysis are:

  • Pandas is used for manipulating the dataset
  • NumPy for imposing mathematical calculations and statistical on the dataset
  • Matplotlib and Seaborn are used for visualization
# Import the necessary packages 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Load the dataset
data = pd.read_csv("breast_cancer.csv")

Data Details

data.shape

Output

(569, 31)

Dataset have 569 rows and 31 features

info(data)

Output

Info Breast Dataset

Observations:

  1. There are no missing values and duplicate values
  2. Contains 31 Numeric and 1 Categorical Features
  3. Can drop ID from data because we not using to anymore

Data Visualization

We are going to visualization our dataset to know what information can we extract from the data.

Diagnosis Plot

Observation:
We have 357 malignant cases and 212 benign cases so our dataset is Imbalanced, we can use various re-sampling algorithms like under-sampling, over-sampling, SMOTE, etc. Use the “adequate” correct algorithm.

Radius — Diagnosis

Radius — Diagnosis Percentage

Cells that have a large radius are malignant cancer cells

Area — Diagnosis

Area — Diagnosis Percentage

Cells that more size area more have chance become malignant cancer cells

Perimeter — Diagnosis

Perimeter — Diagnosis Percentage

Cells that more size Perimeter more have chance become malignant cancer cells

Smoothness — Diagnosis

Smoothness — Diagnosis Percentage

Smoothness still have chance become Malignant Cancer but still fifty fifty with benign

Texture — Diagnosis

Texture — Diagnosis Percentage

The greater the cell texture the possibility of malignant cancer increases

Symmetry — Diagnosis

Symmetry — Diagnosis Percentage

The greater the cell symmetry the possibility of malignant cancer increases but still fifty fifty with benign cancer

Concavity — Diagnosis

Concavity — Diagnosis Percentage

The greater the concavity of cell dimensions, the indication that malignant cancer is increasing

Fractal Dimension — Diagnosis

The greater the fraction of cell dimensions, the indication that malignant cancer is increasing

3. Prediction

For prediction we tried using 4 models to find best Recall. But you can choose to Precision, it depends the creator. For this case because of we won’t our model predict user with Malignant cancer but at our model predict the user Benign cell only.

Logistic Regression

Logistic Regression Confusion Matrix
|Precision|Recall|Accuracy|
|---------|------|--------|
| 97% | 95% | 97% |

Logistic Regression result with recall 95% and precission 97%.

Random Forest

Random Forest Confusion Matrix
|Precision|Recall|Accuracy|
|---------|------|--------|
| 97% | 90% | 95% |

Random Forest model with recall 90% and precission 97%

XGB

|Precision|Recall|Accuracy|
|---------|------|--------|
| 97% | 88% | 92% |

XGB model with recall 88% and precission 97%

SVC

|Precision|Recall|Accuracy|
|---------|------|--------|
| 100% | 90% | 96% |

SVC model with recall 90% and precision 100%.

SVC have best precision result with 100%, but here we looking at recall highest percentage.

After tried using 4 models we found Logistic Regression have highest recall with 95% percentage. We select Logistic Regression model for the next prediction.

4. Conclusion

  1. After we doing EDA we have 6 features can be used as a reference for classifying tumor cells into malignant tumor cells:
    a) Radius
    b) Perimeter
    c) Area
    d) Compactness
    e) Concavy
    f) Concave Point
  2. SVC have highest precision, and Logistic Regression have best recall.

--

--

Lukas Kristianto
Lukas Kristianto

Written by Lukas Kristianto

Senior Software Engineer Android and Artificial Intelligence

No responses yet