Breast Cancer Detection Web App

 

Among the many applications of machine learning, one is of particular interest to me. The use of disease detection in machine learning has the potential to help a large number of people in the world and the advent of machine learning and computer vision in the past few years have definitely transformed the fields of medicine, finance, biotechnology and more. The use of disease detection methods using machine learning and computer vision has a number of applications in the medical sector and its use is only expected to grow exponentially as we develop better methods and models. he value of machine learning in healthcare is its ability to process huge datasets beyond the scope of human capability, and then reliably convert analysis of that data into clinical insights that aid physicians in planning and providing care, ultimately leading to better outcomes, lower costs of care, and increased patient satisfaction. 

Many leading tech companies and universities have been doing research on the use of AI in the medical sector. For example, Google has developed a machine learning algorithm to help identify cancerous tumors on mammograms. Stanford is using a deep learning algorithm to identify skin cancer. Such revolutionary and pioneering research motivates enthusiasts of machine learning and computer vision like me to study more and more about these practices and the methods used to develop them. 

I am here with an example of a disease detection app which detects if you have breast cancer based upon a number of features such as radius, age, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry and fractal dimension. I will explain more about these parameters shortly. This web app is based on machine learning and uses the Random Forest Classifier Classification Algorithm. The app is coded majorly in python. 

We have deployed the app using Streamlit. It is an open source framework that allows data science teams to deploy web apps fairly easily. It’s one of the best hosting services I’ve used and it’s great for quick and easy deployment of web apps. The app is coded in python.  The web app uses interactive visual and graphical interpretations to display the outcome and compare the input parameters given by the user. The graphs compare the values of the patient with others ( both with cancerous and non-cancerous patients). It also provides the accuracy of the result which ranges from around 90-95%. 

A value of 0 on the graphs represents a benign i.e. non-cancerous tumor and a value of 1 represents a malignant i.e. a cancerous tumor. This web app was a learning curve for us and has improved our knowledge about Machine learning significantly. We hope to deploy more apps in the future and share them with you. Feel free to add onto this project and don’t hesitate to drop by any suggestions. The link for the Breast Cancer Detection  web app is as follows : https://share.streamlit.io/braxtonova/cancer/main/app.py

About the dataset: The dataset used is the Wisconsin Breast Cancer dataset created by researchers at the University of Wisconsin. It consists of the following parameters: radius (mean of distances from center to points on the perimeter), texture (standard deviation of gray-scale values), perimeter, area, smoothness (local variation in radius lengths), compactness (perimeter^2 / area – 1.0), concavity (severity of concave portions of the contour), concave points (number of concave portions of the contour), symmetry and fractal dimension (“coastline approximation” – 1). For those of you who are not familiar with the terms in statistics, my article about Exploratory Data Analysis can be a good starting point. 

I will provide a brief idea about contours and the coastline paradox (one of my favorite mathematical paradoxes) in this article. In layman terms, an outline representing or bounding the shape or form of something is called a contour. However, we state this is calculus and linear algebra as: a line joining points on a diagram at which some property has the same value. contour line (also isolineisopleth, or isarithm) of a function of two variables is a curve along which the function has a constant value, so that the curve joins points of equal value. It is a plane section of the three-dimensional graph of the function f(xy) parallel to the (xy)-plane.Contour lines are curved, straight or a mixture of both lines on a map describing the intersection of a real or hypothetical surface with one or more horizontal planes. I’d also like to mention about contour integrals, which is a method of evaluating certain integrals along paths in the complex plains. Contour integration is also closely related to complex analysis, application of the residue theorem, Cauchy Integral formula  etc. 

I could talk about these all day but lets move onto the coastline paradox. The coastline paradox revolves around the seemingly simple notion that the coastline of a landmass does not have a well defined length. This results from the fractal curve-like properties of coastlines, i.e., the fact that a coastline typically has a fractal dimension (which in fact makes the notion of length inapplicable). The first recorded observation of this phenomenon was by Lewis Fry Richardson and it was expanded upon by Benoit Mandelbrot. The measured length of the coastline depends on the method used to measure it and the degree of cartographic generalization

 

 

 

Disclaimer: This is just a learning project based on one particular dataset so please do not depend on it to actually know if you have breast cancer or not. It might still be a false positive or false negative. A doctor is still the best fit for the determination of such diseases.

Breast Cancer Awareness Month, also referred to in the United States as National Breast Cancer Awareness Month, is an annual international health campaign organized by major breast cancer charities every October to increase awareness of the disease and to raise funds for research into its cause, prevention, diagnosis, treatment and cure. The National Breast Cancer Awareness month was founded in 1985 as a partnership between the American Cancer Society and the pharmaceutical divisions of Imperial Chemical Industry (now a part of Astrazeneca). The aim of this was to promote mammography as the most effective weapon in the fight against breast cancer. Let’s support this initiative and promote the awareness of this disease among the masses.

Note: Some of you’ll mentioned that the prediction is always a malignant tumor, that might be the case as the dataset contains relatively a less number of benign data points. Although, if you vary the values of texture and radius you should see the prediction come out as benign for certain cases. 

Explanation of the Code and how you can make this yourself !

Here, I am going to go through the code in a very concise and simple manner so that people with even minimal experience in programming or data science can follow along and benefit it. This app has been coded in python and has been deployed on streamlit as mentioned before. I’ve also used the Random Forest Classifier Algorithm for this particular problem. 

Alright so lets finally get started. First up I’ve imported the python packages / libraries that I’ve used for this app. More information for them is available on the project template of SkillTools. 

 

 

import streamlit as st
import pandas as pd
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import plotly.figure_factory as ff
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import seaborn as sns
from PIL import Image

After this I have included a slight description of the app as a string which includes the dataset resource and the developers. After which we need to feed in our dataset and define some headings to that the users can know what this is.

df = pd.read_csv(r'data.csv')

#titles
st.sidebar.header('Patient Data')
st.subheader('Training Dataset')
st.write(df.describe())

After this we need to train and test our data. For the purpose of this app, I’ve used the test size and train size as 20% and 80% respectively.

x = df.drop(['Outcome'], axis = 1)
y = df.iloc[:, -1]
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, random_state = 0)

Once we’re done with this, we need to define the user report and the user report data depending on the various parameters given in the training dataset. For this particular dataset the parameters are Age, Radius, Texture, Perimeter, Area, Smoothness, Compactness, Concavity, Concave Points, Symmetry and number of fractal dimensions. We also need to mention the range of values of these parameters so that the user can change them using the sliders in the sidebar.

def user_report():
  Age = st.sidebar.slider('Age', 0,100, 54)
  Radius = st.sidebar.slider('Radius', 0,30, 15 )
  Texture = st.sidebar.slider('Texture', 0,40, 20 )
  Perimeter = st.sidebar.slider('Perimeter', 40,200, 92 )
  Area = st.sidebar.slider('Area', 140,2600, 650 )
  Smoothness = st.sidebar.slider('Smoothness', 0.0,0.25, 0.1 )
  Compactness = st.sidebar.slider('Compactness', 0.0,0.4, 0.1 )
  Concavity = st.sidebar.slider('Concavity', 0.0,0.5, 0.1 )
  Concave_points = st.sidebar.slider('Concave points', 0.0,0.25, 0.05 )
  Symmetry = st.sidebar.slider('Symmetry', 0.0,0.4, 0.2 )
  Fractal_Dimension = st.sidebar.slider('Fractal Dimension', 0.0,0.1, 0.06 )
  
  
  
  
  
  user_report_data = {
      'Age':Age,
      'Radius':Radius,
      'Texture':Texture,
      'Perimeter':Perimeter,
      'Area':Area,
      'Smoothness':Smoothness,
      'Compactness':Compactness,
      'Concavity':Concavity,
      'Concave_points':Concave_points,
      'Symmetry':Symmetry,
      'Fractal_Dimension':Fractal_Dimension,
      
        
  }
  report_data = pd.DataFrame(user_report_data, index=[0])
  return report_data





user_data = user_report()
st.subheader('Patient Data')
st.write(user_data)

Now here’s the part that we run the Random Forest Classifier Algorithm, fit the data and run the model based on the input dataset.

rf  = RandomForestClassifier()
rf.fit(x_train, y_train)
user_result = rf.predict(user_data)

Now we finally come to my most favourite part of these web apps: Visualizations. I have been experimenting a lot with a number of visualization libraries but some of them really stand out for me and I use them often in my apps. So here as a convention I’ve used blue colour for healthy patients and the colour red for unhealthy patients.

st.title('Graphical Patient Report')



if user_result[0]==0:
  color = 'blue'
else:
  color = 'red'

We start off with Radius and code in its visualizations. Here I’ve basically plotted a seaborn scatterplot with age on the x axis and the values of the Radius parameter on the y axis. I have used the purple palette and have scaled the axes according to the data. A value of 0 represents a healthy case whereas a value of 1 represents an unhealthy case.

st.header('Radius Value Graph (Yours vs Others)')
fig_Radius = plt.figure()
ax3 = sns.scatterplot(x = 'Age', y = 'Radius', data = df, hue = 'Outcome' , palette='Purples')
ax4 = sns.scatterplot(x = user_data['Age'], y = user_data['Radius'], s = 150, color = color)
plt.xticks(np.arange(0,100,5))
plt.yticks(np.arange(0,50,5))
plt.title('0 - Healthy & 1 - Unhealthy')
st.pyplot(fig_Radius)

Now that we are done with one parameter, we can very easily do this same for the other parameters as well. Just replace the above code snippet with that of the other parameters and you are set to go. I will leave this as an exercise for you’ll and if you have any queries regarding it, please do ask. After completing the visualizations for all the parameters, we are finally ready to display the outcome and the prediction. I have given the outcome in the form of a user report.

st.subheader('Your Report: ')
output=''
if user_result[0]==0:
  output = 'Congratulations, you do not have  Breast Cancer'
else:
  output = 'Unfortunately, you do have Breast Cancer'
st.title(output)

st.subheader('Accuracy: ')
st.write(str(accuracy_score(y_test, rf.predict(x_test))*100)+'%')

Next, I have duly given the dataset credits to the respective owners and authorities in charge of this dataset and have adhered to its license which is Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) in this case. I have also mentioned where I received the dataset from (UCI Machine Learning Repository) and have cited the original creators of this dataset for their commendable work.

To cap up this web app, I’ve given a disclaimer that I give for all my BioTechnology and medical applications of data science that this is an application based on one particular dataset so we cannot use it universally. I have also attached the logo of Skillocity at the end.

So that’s it from this web app and I’ll see you soon with another fun application of Machine Learning / Data Science and give some interesting insights. Hasta pronto !

Breast Cancer Detection Web App10 min read

One thought on “Breast Cancer Detection Web App10 min read

Leave a Reply

Your email address will not be published.

Scroll to top