LOADING

Type to search

Data Science Python Uncategorized

Data Science: A Crash Course with Python and Machine Learning

Share

Artificial intelligence can often be used interchangeably with data science(but not always). Data Scientist is a really fancy word for statistician. We’ll get to the details about that just a bit later, but just before that let’s talk about what data science or machine learning actually is.

If you’d rather use Typescript with machine learning, then check out our post: Face Detection and Match with TypeScript. If you want tools to detect face with Python, we have a post for that too.

An oversimplification of data science can be that it is a very precise mixture of computer science, statistics/mathematics and domain knowledge of that particular field which machine learning is being applied to. To this end, we will dive into these things with an example.

A Major Medical Problem

In our example doctors are doing some research work to differentiate between malignant and benign tumor cells. The tumor cells in a benign tumor aren’t going to affect the rest of the cells; basically they’re harmless. In contrast, a malignant tumor would be the one that causes all the problems; it affects other cells. Currently, finding the malignant cells in the body consumes a lot of doctors’ time; they have to test the patients quite a few times and reconfirm their results just to be sure.

Wouldn’t it be great if an AI model were to find those malignant cells within a few minutes?  The AI model would process some images that were taken of the patient’s body cells and in a few minutes it would be able to process the entire image and extract every inch of useful data from the given image with a prediction that would probably be much more accurate than a human would ever be able to do.

A Right Proper Team for Data Science

We see in this example that we need three teams to build out the AI:

  • A domain expert. The domain expert would be able to verify whether the predictions made by our AI are correct or not. This person is also the one who will guide the entire team into the correct path by telling them whether they are looking for the answers in the right place or not.
  • Developers. Typically the domain expert can’t code or model the AI on his/her own, since the domain of coding and using the proper tools isn’t his/her expertise. An experienced development team can assess the most cost/time/labor efficient methods for developing this AI. In other words, they are the people who are putting together each and every piece of the puzzle, scaling it, and making it usable by many other people.
  • Mathematicians. Usually the mathematics/statistics part is handed over to the team of programmers and we don’t have to put them into two different categories.

This was a classic example of image processing in tumor cells, where we provide the AI some data in the form of pictures, then processes those pictures by applying different stuff and finally giving out the desired results.

Deep Dive Into the Buzzwords

This article is for people who want to get started with the buzzwords data science and machine learning. Mainstream media, articles, and tech blogs make it seem as if data science is strictly associated with computer science or software engineering, but it really is not. If you go and read a bit about what these cool words mean, then you would discover that these things have been around you for a very long time. How is that? Well let me take you on a journey on how you train a machine learning model and where the data science comes into action.

Programming for Data Science

Before we start with the technical demo, you should have some knowledge about some programming languages, preferably Python, which is used heavily in data science.

Get Your Tools Ready

Our journey begins with Python and some of its most powerful libraries, such as:

Different people use different libraries and frameworks but these are the pretty classic ones.

  • Numpy here is used for manipulation of the data that has been saved in rows and columns, it can perform large actions on huge arrayed data, searching, modifying and deleting elements from an array(rows and columns).
  • Matplotlib is usually used in parallel to Numpy, so basically we can say that Numpy arranges the data with respect to rows and columns and then if we want to visualize that data, or simply want to analyze that data then we would require Matplotlib. If a person knows how to visualize the data properly with the help of Matplot then he/she is getting started with data analysis and data mining, which are some of the most essential parts of data science, being able to visualize the data in a proper manner and being able to analyze it through different graphical methods.
  • Third one we have is Pandas, python is not a domain driven language unlike R language or Matlab. Which is why python requires these libraries. Before Pandas, Python was more or less not a preferred language for analyzing most data sets. But Pandas provides such strong data modeling and analysis that in the near future it can possibly replace the R language, just like R has almost replaced Matlab or Octave for visualizing the data.

Apply Your Data Science Tools to a Model

Once you have imported, analyzed, and modeled a data set, we can train and test our model. Coincidentally, we can use Scikit-Learn (Sklearn) to achieve this. Sklearn is used in order to perform the statistical operations on our data set, which are clustering, classification, and linear regression. These statistical operations are performed depending on the predictions we have to make.

For instance, if we want our prediction to be a binary value/answer, whether it would be option 1 or option 2 and nothing else, a good example would be spam email detection, whether the email is spam or it isn’t spam in that case we would choose the classification in order to train our machine learning model.

Likewise, if you want your prediction to have multiple answers that aren’t binary in nature then you can put a linear regression on it after analyzing your data set. What we do here during machine learning and training our model is split the useful data set into testing data and training data. On the training data we put some classification, clustering or regression and train it, then try and best fit a line on our plotted data which we took our from our data set. Finally, we predict our data using the split data method we used earlier to split our data into train and test. Then we test the values and finally find the efficiency of our trained model.

Coding for Data Science

Below, we take an example of a salary of workers data set versus their experience, and we train our model to see how it can predict the results using a scatter plotted graph.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression   

We see in the code snippet all the necessary machine learning libraries are being imported in our code,

data = pd.read_csv('Salary_Data.csv')
x = data.iloc[:, :-1].values
y = data.iloc[:, 1].values   

We now are importing the salary dataset that I downloaded from kaggel, I downloaded this dataset under the section of linear regression based models, where the model can be trained using linear regression.

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33,   random_state=0)

Training and Testing

You see here we split our data in two different sections: training data and testing data. As mentioned before training data is used for training our model while testing data is used to see whether our model is making the correct prediction or not.

IndexYearsExperienceSalary
01.139343
11.346205
21.537731
3243525
42.239891
52.956642
6360150
73.254445
83.264445
93.757189
103.963218
11455794
12456957
134.157081

Above is the complete data set as a table.

Below is the data set we split for using separately on our machine learning model.

Training X Axis

2.9
5.1
3.2
4.5
8.2
6.8
1.3
10.5
3
2.2
5.9
6
3.7

Training Y Axis

56642
66029
64445
61111
113812
91738
46205
121872
60150
39891
81363
93940
57189

regressor   = LinearRegression()
regressor.fit(x_train, y_train)   

Now we apply linear regression on our model and here we can plot the predictions that our model has made.

Salary vs Experience Training Set 1
Salary vs Experience training set

This is the training dataset, you can tell the years of experience are directly proportional to employee’s salary, now we see our model making some predictions.

Salary vs Experience Training Set 2, making predictions.
Salary vs Experience prediction set

This scattered plot shows our model has made some predictions against the testing data, which you can see in the next picture which is a comparison of our model predicted values and the testing values.

Predicted Values Data Set 1

37731
122391
57081
63218
116969
109431
112635
55794
83088
101302

Predicted Values Data Set 2

40835.1
123079
65134.6
63265.4
115603
108126
116537
64200
76349.7
100649

As you can see the predicted values are quite amazing, and our model is working just fine!

Tags: