Course notes: Linear regression modelling

Preamble

Notes from:

and

and finally, COVID!!

Also (for background)

See also

Courses

Required Files

Required Packages

  • sklearn – 0.0
  • matplotlib – 3.3.0
  • numpy 1.19.1

Notes

Video 1

Images to numbers : Feature extraction (edges, corners, shapes, etc.). Done before training. AKA Manual feature extraction. The features are tables of numbers, i.e. matrices.

ML process:

  • Training
    • Data Collection
    • Feature extraction
    • Training (or rules generation)
    • Model
  • Testing
    • Predict

 

Not need to do feature extraction in some cases, because the data itself are features already. For example, Salary modelling. Some conversion may be required, for features which aren’t numbers, i.e. gender: male =1, female =0, job type (for a fixed number of job) can be converted to numbers as well.

 

Type of ML

  • Supervised – labels images/data, i.e. cats and dog
  • Unsupervised – unlabelled data – machine separates by itself, using similarity of images
  • Reinforcement – best policy – reward and penalty learning.

 

Algorithms:

  • Supervised
    • Linear regression
    • Logistic regression
    • Support vector machine
    • KNN
    • Neural networks
  • Unsupervised
    • K-means
    • C-means
    • HCA
    • Apriori
  • Reinforcement
    • Q learning
    • SARSA

Supervised is the most common method

  • Supervised
    • Classification
      • Fixed number of classes – pets
    • Regression
      • Continuous value – salary
  • Unsupervised
    • Clustering
      • Similar to classification
      • Finite number of classes, but unlabelled (machine separates images itself)
      • Three groups: Apples, bananas and yellow mangos, or by colour (two groups): apples and then bananas and yellow mangos together
    • Association
      • Pancake mix: chocolate and maple syrup are often bought together

Assigning categories to the algorithms:

  • Supervised
    • Linear regression – regression
    • Logistic regression – classification (?)
    • Support vector machine – classification and regression
    • KNN – classification and regression
    • Neural networks – classification and regression
  • Unsupervised
    • K-means – clustering
    • C-means – clustering
    • HCA – clustering
    • Apriori – association
  • Reinforcement
    • Q learning
    • SARSA

 

Summary

  • ML mimics learning process of humans
  • Features to rules
  • Three types:
    • Supervised
      • Classification
      • Regression
    • Unsupervised
      • Clustering
      • Association
    • Reinforcement

Video 2

Linear regression is one of the most basic types of ML algorithms.

 

Generalisation:

125 mg/dl = diabeties

heavy object = more pain

generalisation of numbers gives equations, and these equations can be used in the prediction of new data.

polynomial equations:

  • linear(x), – no curves,  crosses x-axis once
  • quadratic (x2), – one curve in the graph,  crosses x-axis twice
  • cubic(x3), – two curves in the graph,  crosses x-axis thrice
  • etc.

The order is:

  • the max power of x.
  • number of curves in the graph
  • number of times the x axis is crossed

Examples:

scatter plot – if it is in a straight line, then it is  linear, use a first order equation to represent the scatter plot

scatter plot – if it is in a curved line, then it is a quadratic, use a second order equation to represent the scatter plot

scatter plot – if it is in a multiple curved line, then it is a cubic, use a third order equation to represent the scatter plot

Example using house prices:

  • Size (sq.ft.) – X – independant variable
  • Price ($) – Y – dependant variable

Now plot price against size, make a best fit line, see the trend, and it is linear (i.e. straight line) – therefore we can use linear regression

Regression is finding a line that will fit the points with the least amount of errors.

y=mx+c

We already know x and y, so the machine learning will try to find the gradient (m) and the intercept (c). c is the minimum price of the house, excluding the size consideration. It is also where the line intercepts the Y axis.

How does the machine know where the best fit line is? There need to be some sort of feedback:

Subtract original value from the predicted value, which will give the error for a single point. Then add all of the errors to get the total error. However, this could give zero, when adding negative and positive errors (cancellation effect), and zero would not be correct.

Squaring the errors first and then adding them, this removes the cancelation effect, as all errors have become positive (due to the squaring). However, the error will just get bigger and bigger, so this is still not correct.

The solution is to take the mean of the sum of the squares. This is the Mean Squared Error (MSE).

The lower the value of the MSE the better.

For each best fit line that the machine finds, it determines the MSE and compares it with other best fit lines, in order to find the “best” best fit line. Now m and c can be determined.

What if there are more variables? For example, size (m1, x1), number of bedrooms (m2, x2) and number of bathrooms (m3, x3)?

Use multiple linear regression.

y = c + m1x1 + m2x2 + m3x3

Difficult to visualise as there are multiple dimensions involved. However, the backend principles are the same as the single linear regression model.

 

video 3

New Project: MachineLearning

New file: SingleLinearRegression/LinearRegressionTemperature.py

 

x is centigrade and y is Fahrenheit

We already now that the formula is

F = 1.6*C + 32

and we want to see how well the machine can find m and c itself.

# LinearRegressionTemperature.py
import random

import sklearn
import matplotlib.pyplot as plt
import numpy as np

bNoise = True
maxCelsius = 10
# maxCelsius = 38
# maxCelsius = 39  # When you are sick and have a temperature
# maxCelsius = 50
# maxCelsius = 100

# Celsius
x = range(0, maxCelsius)
print(x)  # x is a class
x = list(x)
print(f'X: {x}')  # x is now a list

# Fahrenheit with noise
if bNoise:
    y = [1.8 * F + 32 + random.randint(-3, 3) for F in x]
else:
    # Fahrenheit
    y = [1.8 * F + 32 for F in x]

print(f'Y: {y}')

plt.plot(x, y, '-*r')
plt.show()

 

Video 4

Reshape the data for sklearn

x = np.array(x).reshape(-1, 1)
y = np.array(y).reshape(-1, 1)

Make a list of lists, basically.

Split 20% for testing

xTrain, xTest, yTrain, yTest = sklearn.model_selection.train_test_split(x, y, test_size=0.2)

I found that

# xTrain, xTest, yTrain, yTest = sklearn.model_selection.train_test_split(x, y, test_size=0.2)

gives an error:

AttributeError: module 'sklearn' has no attribute 'model_selection'

The solution is to import directly

from sklearn.model_selection import train_test_split

and change line to

xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.2)

or

from sklearn import model_selection

and

xTrain, xTest, yTrain, yTest = model_selection.train_test_split(x, y, test_size=0.2)

Now the shape of the training data

print(f'xTrain.shape: {xTrain.shape}')

shows that we have 8 out of 10 values for training

xTrain.shape: (8, 1)

Now the model, first import

from sklearn import linear_model

and then

model = linear_model.LinearRegression

and fit the data

model.fit(xTrain, yTrain)

and get the accuracy

accuracy = model.score((xTest, yTest))

Also m and c

print(f'coefs: {model.coef_}')
print(f'intercept: {model.intercept_}')

You may get multiple coefficients (if using multiple linear regression), but only one intercept. Hence why the brackets for the list of lists, on the result of the coeffs, but only single brackets on the intercept.

# LinearRegressionTemperature.py
import random

# import sklearn
# from sklearn.model_selection import train_test_split
from sklearn import model_selection
from sklearn import linear_model
import matplotlib.pyplot as plt
import numpy as np

bNoise = False
maxCelsius = 10
# maxCelsius = 38
# maxCelsius = 39  # When you are sick and have a temperature
# maxCelsius = 50
# maxCelsius = 100

# Celsius
x = range(0, maxCelsius)
print(x)  # x is a class
x = list(x)
print(f'X: {x}')  # x is now a list

if bNoise:
    # Fahrenheit with noise
    y = [1.8 * F + 32 + random.randint(-3, 3) for F in x]
else:
    # Fahrenheit
    y = [1.8 * F + 32 for F in x]

print(f'Y: {y}')

# plt.plot(x, y, '-*r')
# plt.show()

x = np.array(x).reshape(-1, 1)
y = np.array(y).reshape(-1, 1)

print(f'Y reshaped:\n{y}')

# xTrain, xTest, yTrain, yTest = sklearn.model_selection.train_test_split(x, y, test_size=0.2)
xTrain, xTest, yTrain, yTest = model_selection.train_test_split(x, y, test_size=0.2)
# xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.2)
print(f'xTrain.shape: {xTrain.shape}')

model = linear_model.LinearRegression()
model.fit(xTrain, yTrain)
print(f'coef: {model.coef_}')
print(f'intercept: {model.intercept_}')

accuracy = model.score(xTest, yTest)
print(f'Accuracy: {accuracy}')
print(f'Accuracy: {round(accuracy*100,2)}')

Video 5

Noise

We need more values. For noise of ±3 we need 50 values really (at least more than ten)

 

Multiple Linear regression – Video 6

 

Rent of houses

  • number of rooms
  • bathrooms
  • parking spaces

Sources of data:

UCI Machine Learning repository and Kaggle:

 

Required files:

houses_to_rent.csv

https://www.kaggle.com/rubenssjr/brasilian-houses-to-rent?select=houses_to_rent.csv

You have to log in to Kaggle to download. Get it from here without having to login to even more dubious websites: https://github.com/MadoDoctor/EDA-Project-Houses-to-rent

Move the CSV file into a new directory MultipleLinearRegression

Required packages

  • matplotlib
  • sklearn
  • pandas

New file: MultipleLinearRegression/HouseRent.py

Read the data in, using pandas

data = pd.read_csv('houses_to_rent.csv',sep=',')

You may need to add “id” to the first column header as it is unnamed

<bound method NDFrame.head of Unnamed: 0 city area ... property tax fire insurance total
0 0 1 240 ... R$1,000 R$121 R$9,121
1 1 0 64 ... R$122 R$11 R$1,493
2 2 1 443 ... R$1,417 R$89 R$12,680
3 3 1 73 ... R$150 R$16 R$2,116

You can refer to the columns by index (data[2], etc.) or specify column names: Note that the names must be exactly the same as the header in the CSV file. The ordering can be different and select not all of the columns;

data = data[['city', 'rooms', 'bathrooms', 'parking spaces', 'fire insurance', 'furniture', 'rent amount']]

compare with

,city,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa,rent amount,property tax,fire insurance,total

Now the number of columns is reduced.

Now, process the data.

Note that city has already been converted into a number (0 or 1). But furniture has not, and we need to convert it. Also, need to remove R$ and the commas for the thousands.

For the R$ write a lambda that slices off the first two characters. use pandas.map()

data['rent amount'] = data['rent amount'].map(lambda i: i[2:])

replace the comma with nothing (an empty string)

data['rent amount'] = data['rent amount'].map(lambda i: i[2:].replace(',', ''))

finally convert from string to int

data['rent amount'] = data['rent amount'].map(lambda i: int(i[2:].replace(',','')))

Do the same for the fire insurance (but there is no comma to remove)

data['fire insurance'] = data['fire insurance'].map(lambda i: int(i[2:]))

Now for the furniture conversion to a number

from sklearn import preprocessing

Then

le = preprocessing.LabelEncoder()
data['furniture'] = le.fit_transform(data['furniture'])

now we have a 0 or 1 for furniture

 

Next video: check for NAN and missing data, then split for training and testing, create model and check the final testing score.

Multiple Linear regression – Video 7

Check for Not A Number (NAN)

Print a heading

print('-'*30); print('Checking Data'); print('-'*30)

Look for null data

print(data.isnull().sum())

If there is, then you can drop the entire line from the data set

data = data.dropna()

If we only have a few NaN then this isn’t a problem. However, if there are a lot and/or we don’t have much data to begin with,, then this can become an issue. So you can replace drop NaN with an average replacement, or replace with zero.

data = data.fillna(data.mean())

or

data = data.fillna(0)

Note that using fillna() or dropna(), when there is a NaN, the column is converted from int to float. Using the City column as an example. If there is not a NaN then the column is left as int.

Split the data. But first we need to specify the input (x) and the output (y). x is all of the data, except the rent, and y is the rent

x = np.asarray(data.drop(['rent amount'], 1))
y = np.asarray(data['rent amount'])

The , 1 is the axis to be dropped????

A warning

FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only
x = np.asarray(data.drop(['rent amount'], 1))
Splitting

Now import

from sklearn import preprocessing, model_selection

Split the data for training and testing

xTrain, xTest, yTrain, yTest = model_selection.train_test_split(x, y, test_size=0.2)

or

xTrain, xTest, yTrain, yTest = model_selection.train_test_split(x, y, test_size=0.2, random_state=10)
Training

Now import agin

from sklearn import preprocessing, model_selection, linear_model

and train

model = linear_model.LinearRegression()
model.fit(xTrain, yTrain)

Print gradients and intercept (m and c). As we have 6 inputs we will have six gradients

print(f'coefs: {model.coef_}')
print(f'intercept: {model.intercept_}')

Using these we can either literally write out the formula, or we can use the model to predict (which is more sensible)

Get the score

accuracy = model.score(xTest, yTest)

and print

print(f'Accuracy: {round(accuracy*100, 3)} %')

Manually test

Time 18:00

Test using

testValues = model.predict(xTest)

which is similar to

accuracy = model.score(xTest, yTest)

but gives more details

 

Check the actual value and the predicted value and compare to find the deviation.

error = []
for i, testValue in enumerate(testValues):
    error.append(yTest[i] - testValue)
    print(f'Actual: {yTest[i]} Prediction: {int(testValue)} Error: {int(error[i])} Perc: {round(100*int(error[i])/yTest[i],1)}%')

Full code

# HouseRent.py

import pandas as pd
from sklearn import preprocessing, model_selection, linear_model
import numpy as np


# Load data
print('-'*30); print('Importing Data'); print('-'*30)

data = pd.read_csv('houses_to_rent.csv', sep=',')
print(data.head())

data = data[['city', 'rooms', 'bathroom', 'parking spaces', 'fire insurance', 'furniture', 'rent amount']]
print(data.head())

# Process data
# data['rent amount'] = data['rent amount'].map(lambda i: i[2:])
# print(data.head())
# This gives an error, because of the comma
# data['rent amount'] = data['rent amount'].map(lambda i: int(i[2:]))
# data['rent amount'] = data['rent amount'].map(lambda i: i[2:].replace(',', ''))
# print(data.head())
data['rent amount'] = data['rent amount'].map(lambda i: int(i[2:].replace(',', '')))

print(data.head())
data['fire insurance'] = data['fire insurance'].map(lambda i: int(i[2:]))
print(data.head())
le = preprocessing.LabelEncoder()
data['furniture'] = le.fit_transform(data['furniture'])
print(data.head())

# Check Data
print('-'*30); print('Checking Null Data'); print('-'*30)
print(data.isnull().sum())
data = data.dropna()
# data = data.fillna(data.mean())
# data = data.fillna(0)
print(data.isnull().sum())
print(data.head())

# Split Data
print('-'*30); print('Splitting Data'); print('-'*30)

x = np.asarray(data.drop(['rent amount'], 1))
y = np.asarray(data['rent amount'])

print('X', x.shape)
print('Y', y.shape)

# xTrain, xTest, yTrain, yTest = model_selection.train_test_split(x, y, test_size=0.2)
xTrain, xTest, yTrain, yTest = model_selection.train_test_split(x, y, test_size=0.2, random_state=10)

print(f'xTrain: {xTrain.shape}')
print(f'xTest: {xTest.shape}')

# Training
print('-'*30); print('Training'); print('-'*30)

model = linear_model.LinearRegression()
model.fit(xTrain, yTrain)

accuracy = model.score(xTest, yTest)
print(f'coefs: {model.coef_}')
print(f'intercept: {model.intercept_}')
print(f'Accuracy: {round(accuracy*100, 3)} %')

# Manual Evaluation
print('-'*30); print('Manual Evaluation'); print('-'*30)
testValues = model.predict(xTest)
print(f'testValues shape: {testValues.shape}')

error = []
for i, testValue in enumerate(testValues):
    error.append(yTest[i] - testValue)
    print(f'Actual: {yTest[i]} Prediction: {int(testValue)} Error: {int(error[i])} Perc: {round(100*int(error[i])/yTest[i],1)}%')

Corona Virus Cases Predictor – Video 8

Corona Virus Predictor using Machine Learning with Python (2020)

Links:

Difference between polynomial regression and multiple Linear regression. Polynomial only have one term, x, which can be squared, cubed, etc., whereas MLR has multiple different terms (a, b, c, etc.. So the difference is that polynomial has the different terms derived from one term, x. Whereas MLR has entirely different and independent terms from the beginning. However, the forms of the equations are the same, with one intercept and multiple gradients.

In the MultipleLinearRegression directory, New file: CoronaPredictor.py

The CSV coronaCases.csv is formed by taking from the total_cases.csv just the World column, renaming it cases and adding a simple id column, which is actually the days since the start (up to 234 days (22151281 @ 8/19/2020).

I actually ended up with 233 days with slightly different value for cases (22206623).

# CoronaPredictor.py

import pandas as pd

# Load data
data = pd.read_csv('total_cases.csv', sep=',')
data = data[['id', 'cases']]
print(data.head())

Time 9:00

Prepare data

# Prepare data
print('-'*30); print('Prepare data'); print('-'*30)
x = np.array(data['id']).reshape(-1, 1)
y = np.array(data['cases']).reshape(-1, 1)

Create features

Polynomials have x, x squared, cubed, etc. So we need to make id, id squared, cubed etc

from sklearn.preprocessing import PolynomialFeatures

and add just two degrees (i.e. x and x squared):

# Features
print('-'*30); print('Prepare Features'); print('-'*30)
polyFeature = PolynomialFeatures(degree=2)
x = polyFeature.fit_transform(x)
print(x)

If you wanted x cubed then the line would be

polyFeature = PolynomialFeatures(degree=3)

Model for training

15:50

In this case we are not going to split the data because we want to use all of the data in order to push the model to train it as well as possible.

# Training
print('-'*30); print('Training'); print('-'*30)
model = linear_model.LinearRegression()
model.fit(x, y)
accuracy = model.score(x, y)
print(f'Accuracy: {round(100*accuracy, 3)} %')
Prediction: Of past dates
y0 = model.predict(x)
plt.plot(y0, '--b')
plt.show()

As the match isn’t perfect, we can add an extra term(s) to the polynomimal

# polyFeature = PolynomialFeatures(degree=3)
# polyFeature = PolynomialFeatures(degree=4)

Now accuracy increased from 99.4 to 99.9

We can even do a straight line, by making degree=1:

polyFeature = PolynomialFeatures(degree=1)

Now accuracy is 82%

Now where do you stop adding terms, as it might be possible to over fit the data, and we won’t get good predictions. Also, we are using the same data to test that we used to train.

Further prediction: in the future

22:00

Now to ask the model what will happen in 5 days or 10 days

# Prediction
print('-' * 30);
print('Prediction');
print('-' * 30)
days = 2
print(f'Prediction of cases after {days} days ', end='')
print(round(int(model.predict(polyFeature.fit_transform([[233 + days]]))) / 1000000, 2), 'million')

The line is horrific, but this is the base:

model.predict(polyFeature.fit_transform([[233+days]]))

Note the difference in the model.predict() arguments. Before we didn’t have polyFeature.fit_transform, just model.predict(x), because we were using old data (?)

Plotting

x1 = np.array(list(range(1, maxDaysInData+days))).reshape(-1,1)

Note the reshaping

y1 = model.predict(polyFeature.fit_transform(x1))

plt.plot(y1, '--g')
plt.show()

 

Full code

# CoronaPredictor.py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model

# Load data
# data = pd.read_csv('total_cases.csv', sep=',')
data = pd.read_csv('coronaCases.csv', sep=',')
data = data[['id', 'cases']]
print('-'*30); print('Head'); print('-'*30)
print(data.head())
# Prepare data
print('-'*30); print('Prepare data'); print('-'*30)
x = np.array(data['id']).reshape(-1, 1)
y = np.array(data['cases']).reshape(-1, 1)

plt.plot(y, '-*r')
# plt.show()

# Features
print('-'*30); print('Prepare Features'); print('-'*30)
# polyFeature = PolynomialFeatures(degree=1)
polyFeature = PolynomialFeatures(degree=2)
# polyFeature = PolynomialFeatures(degree=3)
# polyFeature = PolynomialFeatures(degree=4)
x = polyFeature.fit_transform(x)
print(x)

# Training
print('-'*30); print('Training'); print('-'*30)
model = linear_model.LinearRegression()
model.fit(x, y)
accuracy = model.score(x, y)
print(f'Accuracy: {round(100*accuracy, 3)} %')
# Testing with the same data
y0 = model.predict(x)
plt.plot(y0, '--b')
# plt.show()


# Prediction
print('-'*30); print('Prediction'); print('-'*30)
# days = 2
days = 50
maxDaysInData = 233
print(f'Prediction of cases after {days} days', end='')
# print(round(int(model.predict(polyFeature.fit_transform([[233+days]])))/1000000, 2), 'million')
print(round(int(model.predict(polyFeature.fit_transform([[maxDaysInData+days]])))/1000000, 2), 'million')
# Basis
# model.predict(polyFeature.fit_transform([[233+days]]))

x1 = np.array(list(range(1, maxDaysInData+days))).reshape(-1,1)

y1 = model.predict(polyFeature.fit_transform(x1))

plt.plot(y1, '--g')
plt.show()

This is the end my friend

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s