Hands-on Assignment: Implementing Linear Regression with Gradient Descent
April 17, 2018
Overview
In this hands-on assignment, we’ll apply linear regression with gradient descent to predict the progression of diabetes in patients. The tutorial will guide you through the process of implementing linear regression with gradient descent in Python, from the ground up.
We’ll first load the dataset, and train a linear regression model using scikit-learn
, a Python machine learning library. We will use the results from scikit-learn
as a test for whether or not our implementation is correct. Then, we’ll implement linear regression using gradient descent ourselves. Let’s get started!
Project Template on Google Colaboratory
Work on this project directly in-browser via Google Colaboratory. The link above is a starter template that you can save to your own Google Drive and work on. Google Colab is a free tool that lets you run small Machine Learning projects through your web browser. You should read this 1 min tutorial if you’re unfamiliar with Google Colaboratory.
Dataset
Below is the code for loading and splitting the dataset. A description of the dataset is available at the following link (Diabetes Data). The dataset is already included with scikit-learn
(read more here).
###############################################################################
## Import stuff
import numpy as np
from sklearn import datasets, linear_model, metrics
###############################################################################
## Load the diabetes dataset
diabetes = datasets.load_diabetes()
diabetes_X = diabetes.data # matrix of dimensions 442x10
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
###############################################################################
Data description: The dataset includes information about 442 diabetes patients. For each patient, there are 10 features or input variables - these include age, sex, body mass index, average blood pressure, and six blood serum measurements. The blood serum measurements are the following: Total Cholesterol (TC), Low Density Lipoprotein (LDL), High Density Lipoprotein (HDL), TC/HDL, Low Tension Glaucoma (LTG) and Glucose. The target is a quantitative measure of disease progression one year after baseline.
How to check for mistakes
As mentioned previously, we’ll use results from scikit-learn
as a test for whether or not our implementation is correct. Since scikit-learn
is a machine learning library, linear regression is available as a model and can be trained by just calling function fit
on the model. You can see the documentation here: sklearn.linear_model.LinearRegression — documentation
###############################################################################
## Scikit learn
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
mean_squared_error = metrics.mean_squared_error(diabetes_y_test, diabetes_y_pred)
print("Mean squared error: %.2f" % mean_squared_error)
print("="*120)
###############################################################################
This produces the following output:
('Coefficients: \n', array([ 3.03499549e-01, -2.37639315e+02, 5.10530605e+02,
3.27736980e+02, -8.14131709e+02, 4.92814588e+02,
1.02848452e+02, 1.84606489e+02, 7.43519617e+02,
7.60951722e+01]))
Mean squared error: 2004.57
We can use the above values of the mean squared error (2004.57) as the target value for the mean squared error for our implementation. We can also compare the coefficients after training our model to check if we get the same results. Note that the numbers might not match exactly, but as long as they match reasonable well (say within 1%), we should be fine.
Implementing Linear Regression with Gradient Descent
Finally, it’s time to implement linear regression ourselves. Here’s a template for you to get started.
###############################################################################
## Our own implementation
# train
X = diabetes_X_train
y = diabetes_y_train
# train: initialize
W = ...
b = ...
learning_rate = ...
epochs = ...
# train: gradient descent
for i in range(epochs):
# calculate predictions
...
# calculate error and cost (mean squared error)
...
# calculate gradients
...
# update parameters
...
# diagnostic output
if i % 5000 == 0: print("Epoch %d: %f" % (i, mean_squared_error))
# test
X = diabetes_X_test
y = diabetes_y_test
# calculate predictions + calculate error and cost (same code as above)
...
print('Coefficients: \n', W)
print("Mean squared error: %.2f" % mean_squared_error)
print("="*120)
###############################################################################
Here are some hints to and guidance to help you along the way.
- W is of dimension 10 (since we have 10 features), and b is a scalar.
- Use the formulas in the Linear Regression tutorial for predictions, errors and costs.
- Use the formulas in Gradient Descent tutorial for gradients and update rule.
- For gradient descent, the gradient $G = \nabla_w J(w)$ is given by the following formulas
- Keep playing around with different values of learning rate and number of epochs till you see that the training mean squared error is decreasing too slowly (say decreasing by < 0.0001 per epoch).
It’s okay if you get stuck. The assignment has a number of moving pieces, and it’s okay to peek at the solution (included) after struggling for a bit. Just make sure you don’t look at the entire solution in one go. Only check the parts that you could not solve, and then go back to trying to work it out on your own.
Solution on Google Colaboratory
The complete notebook with all the cells executed is available via Google Colaboratory using the link above. Google Colab is a free tool that lets you run small Machine Learning experiments through your browser. You should read this 1 min tutorial if you’re unfamiliar with Google Colaboratory.
The solution is also available here: Solution to Hands-on Assignment: Implementing Linear Regression with Gradient Descent. The solution uses the numpy
package, which provides functions for vector and matrix operations like dot products, etc. You can learn more about numpy in this tutorial: Introduction to NumPy.
The output produced by the solution is as follows:
Epoch 0: 29468.227258
Epoch 5000: 3048.187264
Epoch 10000: 2941.413239
Epoch 15000: 2927.457390
Epoch 20000: 2924.752271
Epoch 25000: 2923.794789
Epoch 30000: 2923.195048
Epoch 35000: 2922.693787
Epoch 40000: 2922.230650
Epoch 45000: 2921.788828
Epoch 50000: 2921.362712
Epoch 55000: 2920.949917
Epoch 60000: 2920.549099
Epoch 65000: 2920.159313
Epoch 70000: 2919.779804
Epoch 75000: 2919.409942
Epoch 80000: 2919.049181
Epoch 85000: 2918.697046
Epoch 90000: 2918.353121
Epoch 95000: 2918.017037
('Coefficients: \n', array([ 3.66165332, -234.66440146, 519.39514482, 325.58168427,
-176.11598458, -16.43251228, -180.05636948, 108.06092022,
502.78742918, 78.96999166]))
Mean squared error: 1993.53
Discussion
Notice that the mean squared error matches within 0.5%. Out of the 10 coefficients, coefficients 2, 3, 4 and 10 match but the others don’t. This is because the other features are not linearly independent, hence there are multiple ways to get to the exact same prediction. For example, features 5, 7 and 8 are linearly related in the following way: feature 8 = feature 5 / feature 7.