As we can find more and more free dataset on internet, we can have some fun by trying to predict things.
Here I have a dataset from Kaggle (you can find it here: https://www.kaggle.com/rashikrahmanpritom/heart-attack-analysis-prediction-dataset)
The dataset is the “Heart Attack Analysis & Prediction Dataset”. A dataset with variables such as the age, the sex, the cholesterol etc. of anonymous person and their chance to have a heart attack.
I want to answer 3 questions with this dataset:
- Is the cholestherol really increase the risk of doing a heart attack ?
- Are the age and the sex determinant factors of risks ?
- Can we predict chances to do heart attacks ?
So let’s jump into it !
1. The cholesterol indicator
The cholesterol is an interesting subject inside the scientific community. We can sometimes hear that the cholesterol can have a strong negative impact on health. But we can also hear that the cholesterol is not as important as we could think. Anyway, this it’s a fascinating and complicated debate.
I don’t have the knowledge to understand the debate and take a position, but I can see if in the case of this dataset, the cholesterol have a strong impact on the heart
There is some more people having a chance of getting a heart attack with cholesterol, but it doesn’t look like a huge difference.
With this dataset, I am not sure we can conclude of the importance of cholesterol in heart attack
What about age and sex ?
This one is also a great question. Is there a population that is more likely to do a heart attack based on age and sex? We would want to say yes but let’s see with this dataset
As the data are anonymous, we have 0/1 separation which stand for male/female but we do not know if 0 is male or female and same for the 1.
It doesn’t really matter here as we just wan’t to see if the sex is key in heart attack.
Clearly, here the sex seams to play a determinant role !
More surprising, the age is not making a huge difference in this dataset. It could be beacause we do not have enough entry (for example almost no entry under 30 and upper 70). or maybe it’s just not !
Prediction time !
Would not that be great if we could predict the likelyhood of being a victim of an heart attack ?
It’s time to try this !
I won’t under in the detail here, but just so you know it, I used a Random Forest Classifier for the model in order to predict. You can find more info here: https://medium.com/machine-learning-101/chapter-5-random-forest-classifier-56dc7425c3e1
After training, the model has a precision of 90% !
This is not that much (it could be higher), but this is already a good start. Now the really interesting question is: What are the most important features to predict those heart attack ?
The 3 most important features here are:
- cp: Which is the chest pain type
- oldpeak: Which is the presence of an old peak of heart attack
- thalach: Maximum heart rate achieved
So you might monitor those to prevent heart attacks ;)
We can have some fun with open source datasets on Internet and python.
However, don’t take it too seriously. First we do not have enough entry to correctly conclude on our results. And then, the model is not precise enough to be sure of what we obtained.
But I hope this gave you the wish to try it by yourself !
You can find the whole code on my github profile: https://github.com/ClemHuriaux/Heart-attacks