# Course notes: Linear regression modelling

## Preamble

Notes from:

and

and finally, COVID!!

Also (for background)

## Required Packages

• `sklearn` – 0.0
• `matplotlib` – 3.3.0
• `numpy` 1.19.1

## Notes

### Video 1

Images to numbers : Feature extraction (edges, corners, shapes, etc.). Done before training. AKA Manual feature extraction. The features are tables of numbers, i.e. matrices.

ML process:

• Training
• Data Collection
• Feature extraction
• Training (or rules generation)
• Model
• Testing
• Predict

Not need to do feature extraction in some cases, because the data itself are features already. For example, Salary modelling. Some conversion may be required, for features which aren’t numbers, i.e. gender: male =1, female =0, job type (for a fixed number of job) can be converted to numbers as well.

Type of ML

• Supervised – labels images/data, i.e. cats and dog
• Unsupervised – unlabelled data – machine separates by itself, using similarity of images
• Reinforcement – best policy – reward and penalty learning.

Algorithms:

• Supervised
• Linear regression
• Logistic regression
• Support vector machine
• KNN
• Neural networks
• Unsupervised
• K-means
• C-means
• HCA
• Apriori
• Reinforcement
• Q learning
• SARSA

Supervised is the most common method

• Supervised
• Classification
• Fixed number of classes – pets
• Regression
• Continuous value – salary
• Unsupervised
• Clustering
• Similar to classification
• Finite number of classes, but unlabelled (machine separates images itself)
• Three groups: Apples, bananas and yellow mangos, or by colour (two groups): apples and then bananas and yellow mangos together
• Association
• Pancake mix: chocolate and maple syrup are often bought together

Assigning categories to the algorithms:

• Supervised
• Linear regression – regression
• Logistic regression – classification (?)
• Support vector machine – classification and regression
• KNN – classification and regression
• Neural networks – classification and regression
• Unsupervised
• K-means – clustering
• C-means – clustering
• HCA – clustering
• Apriori – association
• Reinforcement
• Q learning
• SARSA

Summary

• ML mimics learning process of humans
• Features to rules
• Three types:
• Supervised
• Classification
• Regression
• Unsupervised
• Clustering
• Association
• Reinforcement

### Video 2

Linear regression is one of the most basic types of ML algorithms.

Generalisation:

125 mg/dl = diabeties

heavy object = more pain

generalisation of numbers gives equations, and these equations can be used in the prediction of new data.

polynomial equations:

• linear(x), – no curves,  crosses x-axis once
• quadratic (x2), – one curve in the graph,  crosses x-axis twice
• cubic(x3), – two curves in the graph,  crosses x-axis thrice
• etc.

The order is:

• the max power of x.
• number of curves in the graph
• number of times the x axis is crossed

Examples:

scatter plot – if it is in a straight line, then it is  linear, use a first order equation to represent the scatter plot

scatter plot – if it is in a curved line, then it is a quadratic, use a second order equation to represent the scatter plot

scatter plot – if it is in a multiple curved line, then it is a cubic, use a third order equation to represent the scatter plot

Example using house prices:

• Size (sq.ft.) – X – independant variable
• Price (\$) – Y – dependant variable

Now plot price against size, make a best fit line, see the trend, and it is linear (i.e. straight line) – therefore we can use linear regression

Regression is finding a line that will fit the points with the least amount of errors.

`y=mx+c`

We already know `x` and `y`, so the machine learning will try to find the gradient (`m`) and the intercept (`c`). `c` is the minimum price of the house, excluding the size consideration. It is also where the line intercepts the Y axis.

How does the machine know where the best fit line is? There need to be some sort of feedback:

Subtract original value from the predicted value, which will give the error for a single point. Then add all of the errors to get the total error. However, this could give zero, when adding negative and positive errors (cancellation effect), and zero would not be correct.

Squaring the errors first and then adding them, this removes the cancelation effect, as all errors have become positive (due to the squaring). However, the error will just get bigger and bigger, so this is still not correct.

The solution is to take the mean of the sum of the squares. This is the Mean Squared Error (MSE).

The lower the value of the MSE the better.

For each best fit line that the machine finds, it determines the MSE and compares it with other best fit lines, in order to find the “best” best fit line. Now `m` and `c` can be determined.

What if there are more variables? For example, size (`m1`, `x1`), number of bedrooms (`m2`, `x2`) and number of bathrooms (`m3`, `x3`)?

Use multiple linear regression.

`y = c + m1x1 + m2x2 + m3x3`

Difficult to visualise as there are multiple dimensions involved. However, the backend principles are the same as the single linear regression model.

### video 3

New Project: `MachineLearning`

New file: `SingleLinearRegression/LinearRegressionTemperature.py`

x is centigrade and y is Fahrenheit

We already now that the formula is

`F = 1.6*C + 32`

and we want to see how well the machine can find `m` and `c` itself.

```# LinearRegressionTemperature.py
import random

import sklearn
import matplotlib.pyplot as plt
import numpy as np

bNoise = True
maxCelsius = 10
# maxCelsius = 38
# maxCelsius = 39  # When you are sick and have a temperature
# maxCelsius = 50
# maxCelsius = 100

# Celsius
x = range(0, maxCelsius)
print(x)  # x is a class
x = list(x)
print(f'X: {x}')  # x is now a list

# Fahrenheit with noise
if bNoise:
y = [1.8 * F + 32 + random.randint(-3, 3) for F in x]
else:
# Fahrenheit
y = [1.8 * F + 32 for F in x]

print(f'Y: {y}')

plt.plot(x, y, '-*r')
plt.show()```

### Video 4

Reshape the data for `sklearn`

```x = np.array(x).reshape(-1, 1)
y = np.array(y).reshape(-1, 1)```

Make a list of lists, basically.

Split 20% for testing

`xTrain, xTest, yTrain, yTest = sklearn.model_selection.train_test_split(x, y, test_size=0.2)`

I found that

`# xTrain, xTest, yTrain, yTest = sklearn.model_selection.train_test_split(x, y, test_size=0.2)`

gives an error:

`AttributeError: module 'sklearn' has no attribute 'model_selection'`

The solution is to import directly

`from sklearn.model_selection import train_test_split`

and change line to

`xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.2)`

or

`from sklearn import model_selection`

and

`xTrain, xTest, yTrain, yTest = model_selection.train_test_split(x, y, test_size=0.2)`

Now the shape of the training data

`print(f'xTrain.shape: {xTrain.shape}')`

shows that we have 8 out of 10 values for training

`xTrain.shape: (8, 1)`

Now the model, first import

`from sklearn import linear_model`

and then

`model = linear_model.LinearRegression`

and fit the data

`model.fit(xTrain, yTrain)`

and get the accuracy

`accuracy = model.score((xTest, yTest))`

Also `m` and `c`

```print(f'coefs: {model.coef_}')
print(f'intercept: {model.intercept_}')```

You may get multiple coefficients (if using multiple linear regression), but only one intercept. Hence why the brackets for the list of lists, on the result of the coeffs, but only single brackets on the intercept.

```# LinearRegressionTemperature.py
import random

# import sklearn
# from sklearn.model_selection import train_test_split
from sklearn import model_selection
from sklearn import linear_model
import matplotlib.pyplot as plt
import numpy as np

bNoise = False
maxCelsius = 10
# maxCelsius = 38
# maxCelsius = 39  # When you are sick and have a temperature
# maxCelsius = 50
# maxCelsius = 100

# Celsius
x = range(0, maxCelsius)
print(x)  # x is a class
x = list(x)
print(f'X: {x}')  # x is now a list

if bNoise:
# Fahrenheit with noise
y = [1.8 * F + 32 + random.randint(-3, 3) for F in x]
else:
# Fahrenheit
y = [1.8 * F + 32 for F in x]

print(f'Y: {y}')

# plt.plot(x, y, '-*r')
# plt.show()

x = np.array(x).reshape(-1, 1)
y = np.array(y).reshape(-1, 1)

print(f'Y reshaped:\n{y}')

# xTrain, xTest, yTrain, yTest = sklearn.model_selection.train_test_split(x, y, test_size=0.2)
xTrain, xTest, yTrain, yTest = model_selection.train_test_split(x, y, test_size=0.2)
# xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.2)
print(f'xTrain.shape: {xTrain.shape}')

model = linear_model.LinearRegression()
model.fit(xTrain, yTrain)
print(f'coef: {model.coef_}')
print(f'intercept: {model.intercept_}')

accuracy = model.score(xTest, yTest)
print(f'Accuracy: {accuracy}')
print(f'Accuracy: {round(accuracy*100,2)}')```

### Video 5

Noise

We need more values. For noise of ±3 we need 50 values really (at least more than ten)

### Multiple Linear regression – Video 6

Rent of houses

• number of rooms
• bathrooms
• parking spaces

Sources of data:

UCI Machine Learning repository and Kaggle:

#### Required files:

`houses_to_rent.csv`

https://www.kaggle.com/rubenssjr/brasilian-houses-to-rent?select=houses_to_rent.csv

Move the CSV file into a new directory `MultipleLinearRegression`

#### Required packages

• `matplotlib`
• `sklearn`
• `pandas`

New file: `MultipleLinearRegression/HouseRent.py`

Read the data in, using pandas

`data = pd.read_csv('houses_to_rent.csv',sep=',')`

You may need to add “id” to the first column header as it is unnamed

```<bound method NDFrame.head of Unnamed: 0 city area ... property tax fire insurance total
0 0 1 240 ... R\$1,000 R\$121 R\$9,121
1 1 0 64 ... R\$122 R\$11 R\$1,493
2 2 1 443 ... R\$1,417 R\$89 R\$12,680
3 3 1 73 ... R\$150 R\$16 R\$2,116```

You can refer to the columns by index (`data`, etc.) or specify column names: Note that the names must be exactly the same as the header in the CSV file. The ordering can be different and select not all of the columns;

`data = data[['city', 'rooms', 'bathrooms', 'parking spaces', 'fire insurance', 'furniture', 'rent amount']]`

compare with

`,city,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa,rent amount,property tax,fire insurance,total`

Now the number of columns is reduced.

Now, process the data.

Note that `city` has already been converted into a number (0 or 1). But `furniture` has not, and we need to convert it. Also, need to remove `R\$` and the commas for the thousands.

For the R\$ write a lambda that slices off the first two characters. use `pandas.map()`

`data['rent amount'] = data['rent amount'].map(lambda i: i[2:])`

replace the comma with nothing (an empty string)

`data['rent amount'] = data['rent amount'].map(lambda i: i[2:].replace(',', ''))`

finally convert from string to int

`data['rent amount'] = data['rent amount'].map(lambda i: int(i[2:].replace(',','')))`

Do the same for the fire insurance (but there is no comma to remove)

`data['fire insurance'] = data['fire insurance'].map(lambda i: int(i[2:]))`

Now for the furniture conversion to a number

`from sklearn import preprocessing`

Then

```le = preprocessing.LabelEncoder()
data['furniture'] = le.fit_transform(data['furniture'])```

now we have a 0 or 1 for furniture

Next video: check for NAN and missing data, then split for training and testing, create model and check the final testing score.

### Multiple Linear regression – Video 7

Check for Not A Number (NAN)

`print('-'*30); print('Checking Data'); print('-'*30)`

Look for null data

`print(data.isnull().sum())`

If there is, then you can drop the entire line from the data set

`data = data.dropna()`

If we only have a few NaN then this isn’t a problem. However, if there are a lot and/or we don’t have much data to begin with,, then this can become an issue. So you can replace drop NaN with an average replacement, or replace with zero.

`data = data.fillna(data.mean())`

or

`data = data.fillna(0)`

Note that using `fillna()` or `dropna()`, when there is a NaN, the column is converted from `int` to `float`. Using the City column as an example. If there is not a NaN then the column is left as `int`.

Split the data. But first we need to specify the input (`x`) and the output (`y`). `x` is all of the data, except the rent, and `y` is the rent

```x = np.asarray(data.drop(['rent amount'], 1))
y = np.asarray(data['rent amount'])```

The `, 1` is the axis to be dropped????

A warning

```FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only
x = np.asarray(data.drop(['rent amount'], 1))```
##### Splitting

Now import

`from sklearn import preprocessing, model_selection`

Split the data for training and testing

`xTrain, xTest, yTrain, yTest = model_selection.train_test_split(x, y, test_size=0.2)`

or

`xTrain, xTest, yTrain, yTest = model_selection.train_test_split(x, y, test_size=0.2, random_state=10)`
##### Training

Now import agin

`from sklearn import preprocessing, model_selection, linear_model`

and train

```model = linear_model.LinearRegression()
model.fit(xTrain, yTrain)```

Print gradients and intercept (`m` and `c`). As we have 6 inputs we will have six gradients

```print(f'coefs: {model.coef_}')
print(f'intercept: {model.intercept_}')```

Using these we can either literally write out the formula, or we can use the model to predict (which is more sensible)

Get the score

`accuracy = model.score(xTest, yTest)`

and print

`print(f'Accuracy: {round(accuracy*100, 3)} %')`

#### Manually test

Time 18:00

Test using

`testValues = model.predict(xTest)`

which is similar to

`accuracy = model.score(xTest, yTest)`

but gives more details

Check the actual value and the predicted value and compare to find the deviation.

```error = []
for i, testValue in enumerate(testValues):
error.append(yTest[i] - testValue)
print(f'Actual: {yTest[i]} Prediction: {int(testValue)} Error: {int(error[i])} Perc: {round(100*int(error[i])/yTest[i],1)}%')```

Full code

```# HouseRent.py

import pandas as pd
from sklearn import preprocessing, model_selection, linear_model
import numpy as np

print('-'*30); print('Importing Data'); print('-'*30)

data = data[['city', 'rooms', 'bathroom', 'parking spaces', 'fire insurance', 'furniture', 'rent amount']]

# Process data
# data['rent amount'] = data['rent amount'].map(lambda i: i[2:])
# This gives an error, because of the comma
# data['rent amount'] = data['rent amount'].map(lambda i: int(i[2:]))
# data['rent amount'] = data['rent amount'].map(lambda i: i[2:].replace(',', ''))
data['rent amount'] = data['rent amount'].map(lambda i: int(i[2:].replace(',', '')))

data['fire insurance'] = data['fire insurance'].map(lambda i: int(i[2:]))
le = preprocessing.LabelEncoder()
data['furniture'] = le.fit_transform(data['furniture'])

# Check Data
print('-'*30); print('Checking Null Data'); print('-'*30)
print(data.isnull().sum())
data = data.dropna()
# data = data.fillna(data.mean())
# data = data.fillna(0)
print(data.isnull().sum())

# Split Data
print('-'*30); print('Splitting Data'); print('-'*30)

x = np.asarray(data.drop(['rent amount'], 1))
y = np.asarray(data['rent amount'])

print('X', x.shape)
print('Y', y.shape)

# xTrain, xTest, yTrain, yTest = model_selection.train_test_split(x, y, test_size=0.2)
xTrain, xTest, yTrain, yTest = model_selection.train_test_split(x, y, test_size=0.2, random_state=10)

print(f'xTrain: {xTrain.shape}')
print(f'xTest: {xTest.shape}')

# Training
print('-'*30); print('Training'); print('-'*30)

model = linear_model.LinearRegression()
model.fit(xTrain, yTrain)

accuracy = model.score(xTest, yTest)
print(f'coefs: {model.coef_}')
print(f'intercept: {model.intercept_}')
print(f'Accuracy: {round(accuracy*100, 3)} %')

# Manual Evaluation
print('-'*30); print('Manual Evaluation'); print('-'*30)
testValues = model.predict(xTest)
print(f'testValues shape: {testValues.shape}')

error = []
for i, testValue in enumerate(testValues):
error.append(yTest[i] - testValue)
print(f'Actual: {yTest[i]} Prediction: {int(testValue)} Error: {int(error[i])} Perc: {round(100*int(error[i])/yTest[i],1)}%')```

### Corona Virus Cases Predictor – Video 8

Corona Virus Predictor using Machine Learning with Python (2020)

Difference between polynomial regression and multiple Linear regression. Polynomial only have one term, `x`, which can be squared, cubed, etc., whereas MLR has multiple different terms (`a`, `b`, `c`, etc.. So the difference is that polynomial has the different terms derived from one term, `x`. Whereas MLR has entirely different and independent terms from the beginning. However, the forms of the equations are the same, with one intercept and multiple gradients.

In the `MultipleLinearRegression` directory, New file: `CoronaPredictor.py`

The CSV `coronaCases.csv` is formed by taking from the `total_cases.csv` just the World column, renaming it `cases` and adding a simple `id` column, which is actually the days since the start (up to 234 days (22151281 @ 8/19/2020).

I actually ended up with 233 days with slightly different value for cases (22206623).

```# CoronaPredictor.py

import pandas as pd

data = data[['id', 'cases']]

Time 9:00

Prepare data

```# Prepare data
print('-'*30); print('Prepare data'); print('-'*30)
x = np.array(data['id']).reshape(-1, 1)
y = np.array(data['cases']).reshape(-1, 1)```

Create features

Polynomials have x, x squared, cubed, etc. So we need to make id, id squared, cubed etc

`from sklearn.preprocessing import PolynomialFeatures`

and add just two degrees (i.e. x and x squared):

```# Features
print('-'*30); print('Prepare Features'); print('-'*30)
polyFeature = PolynomialFeatures(degree=2)
x = polyFeature.fit_transform(x)
print(x)```

If you wanted x cubed then the line would be

`polyFeature = PolynomialFeatures(degree=3)`

Model for training

15:50

In this case we are not going to split the data because we want to use all of the data in order to push the model to train it as well as possible.

```# Training
print('-'*30); print('Training'); print('-'*30)
model = linear_model.LinearRegression()
model.fit(x, y)
accuracy = model.score(x, y)
print(f'Accuracy: {round(100*accuracy, 3)} %')```
##### Prediction: Of past dates
```y0 = model.predict(x)
plt.plot(y0, '--b')
plt.show()```

As the match isn’t perfect, we can add an extra term(s) to the polynomimal

```# polyFeature = PolynomialFeatures(degree=3)
# polyFeature = PolynomialFeatures(degree=4)```

Now accuracy increased from 99.4 to 99.9

We can even do a straight line, by making `degree=1`:

`polyFeature = PolynomialFeatures(degree=1)`

Now accuracy is 82%

Now where do you stop adding terms, as it might be possible to over fit the data, and we won’t get good predictions. Also, we are using the same data to test that we used to train.

##### Further prediction: in the future

22:00

Now to ask the model what will happen in 5 days or 10 days

```# Prediction
print('-' * 30);
print('Prediction');
print('-' * 30)
days = 2
print(f'Prediction of cases after {days} days ', end='')
print(round(int(model.predict(polyFeature.fit_transform([[233 + days]]))) / 1000000, 2), 'million')```

The line is horrific, but this is the base:

`model.predict(polyFeature.fit_transform([[233+days]]))`

Note the difference in the `model.predict()` arguments. Before we didn’t have `polyFeature.fit_transform`, just `model.predict(x)`, because we were using old data (?)

Plotting

`x1 = np.array(list(range(1, maxDaysInData+days))).reshape(-1,1)`

Note the reshaping

```y1 = model.predict(polyFeature.fit_transform(x1))

plt.plot(y1, '--g')
plt.show()```

## Full code

```# CoronaPredictor.py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model

data = data[['id', 'cases']]
# Prepare data
print('-'*30); print('Prepare data'); print('-'*30)
x = np.array(data['id']).reshape(-1, 1)
y = np.array(data['cases']).reshape(-1, 1)

plt.plot(y, '-*r')
# plt.show()

# Features
print('-'*30); print('Prepare Features'); print('-'*30)
# polyFeature = PolynomialFeatures(degree=1)
polyFeature = PolynomialFeatures(degree=2)
# polyFeature = PolynomialFeatures(degree=3)
# polyFeature = PolynomialFeatures(degree=4)
x = polyFeature.fit_transform(x)
print(x)

# Training
print('-'*30); print('Training'); print('-'*30)
model = linear_model.LinearRegression()
model.fit(x, y)
accuracy = model.score(x, y)
print(f'Accuracy: {round(100*accuracy, 3)} %')
# Testing with the same data
y0 = model.predict(x)
plt.plot(y0, '--b')
# plt.show()

# Prediction
print('-'*30); print('Prediction'); print('-'*30)
# days = 2
days = 50
maxDaysInData = 233
print(f'Prediction of cases after {days} days', end='')
# print(round(int(model.predict(polyFeature.fit_transform([[233+days]])))/1000000, 2), 'million')
print(round(int(model.predict(polyFeature.fit_transform([[maxDaysInData+days]])))/1000000, 2), 'million')
# Basis
# model.predict(polyFeature.fit_transform([[233+days]]))

x1 = np.array(list(range(1, maxDaysInData+days))).reshape(-1,1)

y1 = model.predict(polyFeature.fit_transform(x1))

plt.plot(y1, '--g')
plt.show()

```

This is the end my friend