STAT 39000: Project 8 — Spring 2022
Motivation: Machine learning and AI are huge buzzwords in industry, and two of the most popular tools surrounding said topics are the pytorch and tensorflow libraries — JAX is another tool by Google growing in popularity. These tools are libraries used to build and use complex models. If available, they can take advantage of GPUs to speed up parallelizable code by a hundred or even thousand fold.
Context: This is the first in a series of 4-5 projects focused on pytorch (and perhaps JAX). The purpose of these projects is to give you exposure to these tools, some basic functionality, and to show why they are useful, without needing any special math or statistics background.
Scope: Python, pytorch
Dataset(s)
The following questions will use the following dataset(s):
- 
/depot/datamine/data/sim/train.csv
- 
/depot/datamine/data/sim/test.csv
Questions
Question 1
While data frames are a great way to work with data, they are not the only way. Many high performance parts of code are written using a library like numpy or pytorch. These libraries are optimized to be extremely efficient with computation. Multiplying, transposing, inversing matrices can take time, and these libraries can make you code run blazingly fast.
This project is the first project in a series of projects focused on the pytorch library. It is difficult to understand why a library like pytorch is useful without introducing some math. This series of projects will involve some math, however, only at a very high level. Some intuition will be presented as notes, but what is really needed is the ability to read some formulas, and perform the appropriate computations. Throughout this series of projects, we will do our best to ensure that math or statistics is not at all a barrier to completing these projects and getting familiar with pytorch. If it does end up an issue, please post in Piazza and we will do our best to address any issues as soon as possible.
This first project will start slowly, and only focus on the numpy -like functionality of pytorch. We’ve provided you with a set of 100 observations. 75 of the observations are in the train.csv file, 25 are in the test.csv file. We will build a regression model using the data in the train.csv file. In addition, we will calculate some other statistics. Finally, we will (optionally) test our model out on new data in the test.csv dataset.
Start by reading the train.csv file into a pytorch tensor.
|  | 
| A tensor is just a n-dimensional array. | 
Use matplotlib or plotly to plot the data on a scatterplot — x_train on the x-axis, and y_train on the y-axis. After talking to your colleague, you agreed that the data is clearly following a 2nd order polynomial. Something like:
$y = \beta_0 + \beta_1x + \beta_2x^2$
Our goal will be to estimate the values of $\beta_0$, $\beta_1$, and $\beta_2$ using the data in x_train and y_train. Then, we will have a model that could look something like:
$y = 1.2 + .4x + 2.2x^2$
Then, for any given value of x, we can use our model to predict the value of y.
- 
Code used to solve this problem. 
- 
Output from running the code. 
Question 2
In order to build our model, we need to estimate our parameters: $\beta_0$, $\beta_1$, and $\beta_2$. Luckily, linear regression has a closed form solution, so we can calculate these values directly with the following equation.
$\hat{\beta} = (X^{T} X)^{-1} X^{T} y$
What do these symbols mean? X is a matrix (or tensor), where each column is a term in the polynomial, and each row is an observation. So, for our polynomial, if our X data was simply: 1, 2, 3, 4, the X matrix (or design matrix) would be the following:
1, 1, 1 1, 2, 4 1, 3, 9 1, 4, 16
Here, the first column is the constant term, the second column is the term of x, the third column is the term of $x^2$, and so on.
When we raise the matrix to the "T" this means to transpose the matrix. The transpose of X, for example, would look like:
1, 1, 1, 1 1, 2, 3, 4 1, 3, 9, 16
When we raise the matrix to the "-1" this means to invert the matrix.
Finally, placing these matrices next to each other means we need to perform matrix multiplication.
pytorch has built in functions to do all of these operations: torch.mm, mat.T, and torch.inverse.
Lastly, y is the tensor containing the observations in y_train.
| Tensors must be the correct dimensions before they can be multiplied together using  When doing matrix multiplication, it is important that the tensors are aligned properly. A 4x1 matrix would be a matrix that has 4 rows and 1 column (the first number always represents the number of rows, the second always represents the number of columns). In order to multiply 2 matrices together, the number of columns in the first matrix must equal the number of rows in the second matrix. The resulting matrix would then have the number of rows as the first matrix, and the number of columns of the second matrix. So, if we multiplied a 4x3 matrix with a 3x5 matrix, the result would be a 4x5 matrix. These rules are important, because the tensors must be the correct shape (correct number of rows and columns) before we perform matrix multiplication, otherwise we will get an error. The  | 
Start by creating a new tensor called x_mat that is 75 rows and 3 columns. The first column should be filled with 1’s (using torch.ones(x_train.shape[0]).reshape(75,1)), the second column should be the values in x_train, the third column should be the values in x_train squared. Use torch.cat to combine the 75x1 tensors into a single 75x3 tensor (x_mat).
| Make sure you reshape all of your tensors to be 75x1 before you use  | 
| Operations like addition and subtraction are vectorized. For example, the following would result in a 75x1 tensor of 2’s. The following would result in a 1x75 tensor of .5’s.  | 
| Remember, in Python, you can use: to raise a number to a power. For example $2^3$ would be  | 
| To get the transpose of a tensor 2 dimension tensor in  | 
Calculate our estimates for $\beta_0$, $\beta_1$, and $\beta_2$, and save the values in a tensor called betas. The following should be the successful result.
tensor([[ 4.3677],
        [-1.7885],
        [ 0.4840]], dtype=torch.float64)
Now that you know the values for $\beta_0$, $\beta_1$, and $\beta_2$, what is our model (as an equation)? It should be:
$y = 4.3677-1.7885x+.4840x^2$
- 
Code used to solve this problem. 
- 
Output from running the code. 
Question 3
That is pretty cool, and very fast. Now, for any given value of x, we can predict a value of y. Of course, we could write a predict function that accepts a value x, and returns our prediction y, and apply that function to each of the x values in our x_train tensor, however, this can be accomplished even faster and more flexibly using matrix multiplication — simply use the following formula:
$\hat{y} = X\hat{\beta}$
Where X is the x_mat tensor from earlier, and $\hat{\beta}$ is the betas tensor from question (2). Use torch.mm to multiply the two matrices together. Save the resulting tensor to a variable called y_predictions. Finally, create side by side scatterplots. In the first scatterplot, put the values in x_train on the x-axis and the values of y_train on the y-axis. In the second scatterplot put the values of x_train on the x-axis, and your predictions (y_predictions) on the y-axis.
Very cool! Your model should be killing it (after all, we generated this data to follow a known distribution).
- 
Code used to solve this problem. 
- 
Output from running the code. 
Question 4
To better understand our model, let us create one of the most common forms of uncertainty quantification, confidence intervals. Confidence intervals (95% confidence intervals) show you the range of values (for each x) where we are 95% confident that the average value y for a given x is within the range.
The formula is the following:
$\hat{y_h} \pm t_{(\alpha/2, n-p)} * \sqrt{MSE * diag(x_h(X^{T} X)^{-1} x_h^{T})}$
$MSE = \frac{1}{n-p}\sum_{i=1}^{n}(Y_i - \hat{Y_i})^2$
Since we are calculating the 95% confidence interval for the values of x in our x_train tensor, we can simplify this to:
$\hat{Y} \pm 1.993464 * \sqrt{MSE * diag(X(X^{T} X)^{-1} X^{T})}$
$\frac{1}{72}\sum_{i=1}^{n}(Y_i - \hat{Y_i})^2$
Where:
- 
$\hat{Y}$ is our y_predictionstensor from question (3).
- 
$Y_i$ is the value of y for the ith value of y_train.
- 
$\hat{Y_i}$ is the value of y for the ith value of y_predictions.You could simply sum the results of subtracting the y_predictionstensor from they_traintensor, squared. You don’t need any loop.
- 
p is the number of parameters in our model (3, the constant, the x, and the $x^2$). 
- 
n is the number of observations in our data set (75). 
| The "diag" part of the formula indicates that we want the diagonal of the resulting matrix. The diagonal of a given nxn matrix is the value at location (1,1), (2,2), (3,3), …, (n,n). So, for instance, the diagonal of the following matrix is: 1, 5, 9 matrix 1,2,3 4,5,6 7,8,9 In   | 
| You can use  | 
| The value for MSE should be 135.5434. The first 5 values of the  upper tensor([[171.3263],
        [ 91.9131],
        [ 83.3474],
        [ 63.8171],
        [ 63.0524]], dtype=torch.float64)The first 5 values of the  lower tensor([[140.6660],
        [ 76.2350],
        [ 69.1461],
        [ 52.7601],
        [ 52.1101]], dtype=torch.float64) | 
- 
Code used to solve this problem. 
- 
Output from running the code. 
Question 5
Create a scatterplot of x_train on the x-axis, and y_predictions on the y-axis. Add the confidence intervals to the plot.
Great! It is unsurprising that our model is a great fit.
| In this project, we explored a well known model using simulated data from a known distribution. It is pretty boring, but boring can also make things a bit easier to understand. To give a bit of perspective, this project focused on tensor operations so you could get used to  Lastly, in more complex situations, you may not have formulas to calculate confidence intervals and other uncertaintly quantification measures. We will use SLURM in combination with  | 
- 
Code used to solve this problem. 
- 
Output from running the code. 
| Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |