5.6: Using Excel and R for a Regression Analysis

Last updated
Save as PDF

Page ID: 5730

$ \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } $ $ \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} $$\newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\kernel}{\mathrm{null}\,}$ $ \newcommand{\range}{\mathrm{range}\,}$ $ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$ $ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$ $ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $ \newcommand{\Span}{\mathrm{span}}$ $\newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\kernel}{\mathrm{null}\,}$ $ \newcommand{\range}{\mathrm{range}\,}$ $ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$ $ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$ $ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $ \newcommand{\Span}{\mathrm{span}}$$\newcommand{\AA}{\unicode[.8,0]{x212B}}$

Although the calculations in this chapter are relatively straightforward—consisting, as they do, mostly of summations—it can be quite tedious to work through problems using nothing more than a calculator. Both Excel and R include functions for completing a linear regression analysis and for visually evaluating the resulting model.

5.6.1 Excel

Let’s use Excel to fit the following straight-line model to the data in Example 5.9.

\[y = β_0 + β_1x\]

Enter the data into a spreadsheet, as shown in Figure 5.15. Depending upon your needs, there are many ways that you can use Excel to complete a linear regression analysis. We will consider three approaches here.

Fig 5.15.png

Figure 5.15 Portion of a spreadsheet containing data from Example 5.9 (Cstd = C_std; Sstd = S_std).

Use Excel’s Built-In Functions

If all you need are values for the slope, β₁, and the y-intercept, β₀, you can use the following functions:

=intercept(known_y’s, known_x’s)
=slope(known_y’s, known_x’s)

where known_y’s is the range of cells containing the signals (y), and known_x’s is the range of cells containing the concentrations (x). For example, clicking on an empty cell and entering

=slope(B2:B7, A2:A7)

returns Excel’s exact calculation for the slope (120.705 714 3).

Use Excel’s Data Analysis Tools

To obtain the slope and the y-intercept, along with additional statistical details, you can use the data analysis tools in the Analysis ToolPak. The ToolPak is not a standard part of Excel’s instillation. To see if you have access to the Analysis ToolPak on your computer, select Tools from the menu bar and look for the Data Analysis... option. If you do not see Data Analysis..., select Add-ins... from the Tools menu. Check the box for the Analysis ToolPak and click on OK to install them.

Note

Once you install the Analysis ToolPak, it will continue to load each time you launch Excel.

Select Data Analysis... from the Tools menu, which opens the Data Analysis window. Scroll through the window, select Regression from the available options, and press OK. Place the cursor in the box for Input Y range and then click and drag over cells B1:B7. Place the cursor in the box for Input X range and click and drag over cells A1:A7. Because cells A1 and B1 contain labels, check the box for Labels. Select the radio button for Output range and click on any empty cell; this is where Excel will place the results. Clicking OK generates the information shown in Figure 5.16.

Note

Including labels is a good idea. Excel’s summary output uses the x-axis label to identify the slope.

Figure 5.16 Output from Excel’s Regression command in the Analysis ToolPak. See the text for a discussion of how to interpret the information in these tables.

There are three parts to Excel’s summary of a regression analysis. At the top of Figure 5.16 is a table of Regression Statistics. The standard error is the standard deviation about the regression, s_r. Also of interest is the value for Multiple R, which is the model’s correlation coefficient, r, a term with which you may already by familiar. The correlation coefficient is a measure of the extent to which the regression model explains the variation in y. Values of r range from –1 to +1. The closer the correlation coefficient is to ±1, the better the model is at explaining the data. A correlation coefficient of 0 means that there is no relationship between x and y. In developing the calculations for linear regression, we did not consider the correlation coefficient. There is a reason for this. For most straight-line calibration curves the correlation coefficient will be very close to +1, typically 0.99 or better. There is a tendency, however, to put too much faith in the correlation coefficient’s significance, and to assume that an r greater than 0.99 means the linear regression model is appropriate. Figure 5.17 provides a counterexample. Although the regression line has a correlation coefficient of 0.993, the data clearly shows evidence of being curvilinear. The take-home lesson here is: don’t fall in love with the correlation coefficient!

Figure 5.17 Example of fitting a straight-line to curvilinear data.

The second table in Figure 5.16 is entitled ANOVA, which stands for analysis of variance. We will take a closer look at ANOVA in Chapter 14. For now, it is sufficient to understand that this part of Excel’s summary provides information on whether the linear regression model explains a significant portion of the variation in the values of y. The value for F is the result of an F-test of the following null and alternative hypotheses.

H₀: regression model does not explain the variation in y
H_A: regression model does explain the variation in y

Note

See Section 4.6.2 and Section 4.6.3 for a review of the F-test.

The value in the column for Significance F is the probability for retaining the null hypothesis. In this example, the probability is 2.5×10^–6%, suggesting that there is strong evidence for accepting the regression model. As is the case with the correlation coefficient, a small value for the probability is a likely outcome for any calibration curve, even when the model is inappropriate. The probability for retaining the null hypothesis for the data in Figure 5.17, for example, is 9.0×10^–7%.

The third table in Figure 5.16 provides a summary of the model itself. The values for the model’s coefficients—the slope, β₁, and the y-intercept, β₀—are identified as intercept and with your label for the x-axis data, which in this example is Cstd. The standard deviations for the coefficients, s_b0 and s_b1, are in the column labeled Standard error. The column t Stat and the column P-value are for the following t-tests.

slope H₀: β₁ = 0, H_A: β₁ ≠ 0
y-intercept H₀: β₀ = 0, H_A: β₀ ≠ 0

Note

See Section 4.6.1 for a review of the t-test.

The results of these t-tests provide convincing evidence that the slope is not zero, but no evidence that the y-intercept significantly differs from zero. Also shown are the 95% confidence intervals for the slope and the y-intercept (lower 95% and upper 95%).

Program the Formulas Yourself

A third approach to completing a regression analysis is to program a spreadsheet using Excel’s built-in formula for a summation

=sum(first cell:last cell)

and its ability to parse mathematical equations. The resulting spreadsheet is shown in Figure 5.18.

Figure 5.18 Spreadsheet showing the formulas for calculating the slope and the y-intercept for the data in Example 5.9. The cells with the shading contain formulas that you must enter. Enter the formulas in cells C3 to C7, and cells D3 to D7. Next, enter the formulas for cells A9 to D9. Finally, enter the formulas in cells F2 and F3. When you enter a formula, Excel replaces it with the resulting calculation. The values in these cells should agree with the results in Example 5.9. You can simplify the entering of formulas by copying and pasting. For example, enter the formula in cell C2. Select Edit: Copy, click and drag your cursor over cells C3 to C7, and select Edit: Paste. Excel automatically updates the cell referencing.

Using Excel to Visualize the Regression Model

You can use Excel to examine your data and the regression line. Begin by plotting the data. Organize your data in two columns, placing the x values in the left-most column. Click and drag over the data and select Insert: Chart... from the main menu. This launches Excel’s Chart Wizard. Select xy-chart, choosing the option without lines connecting the points. Click on Next and work your way through the screens, tailoring the plot to meet your needs. To add a regression line to the chart, click on the chart and select Chart: Add Trendline... from the main men. Pick the straight-line model and click OK to add the line to your chart. By default, Excel displays the regression line from your first point to your last point. Figure 5.19 shows the result for the data in Figure 5.15.

Figure 5.19 Example of an Excel scatterplot showing the data and a regression line.

Note

Excel’s default options for xy-charts do not make for particularly attractive scientific figures. For example, Excel automatically adds grid lines parallel to the x-axis, which is a common practice in business charts. You can deselect them using the Grid lines tab in the Chart Wizard. Excel also defaults to a gray background. To remove this, just double-click on the chart’s background and select none in the resulting pop-up window.

Excel also will create a plot of the regression model’s residual errors. To create the plot, build the regression model using the Analysis ToolPak, as described earlier. Clicking on the option for Residual plots creates the plot shown in Figure 5.20.

Figure 5.20 Example of Excel’s plot of a regression model’s residual errors.

Limitations to Using Excel for a Regression Analysis

Excel’s biggest limitation for a regression analysis is that it does not provide a function for calculating the uncertainty when predicting values of x. In terms of this chapter, Excel can not calculate the uncertainty for the analyte’s concentration, C_A, given the signal for a sample, S_samp. Another limitation is that Excel does not include a built-in function for a weighted linear regression. You can, however, program a spreadsheet to handle these calculations.

Practice Exercise 5.6

Use Excel to complete the regression analysis in Practice Exercise 5.4.

Click here to review your answer to this exercise.

5.6.2 R

Let’s use Excel to fit the following straight-line model to the data in Example 5.9.

\[y = β_0 + β_1x\]

Entering Data and Creating the Regression Model

To begin, create objects containing the concentration of the standards and their corresponding signals.

> conc = c(0, 0.1, 0.2, 0.3, 0.4, 0.5)
> signal = c(0, 12.36, 24.83, 35.91, 48.79, 60.42)

The command for creating a straight-line linear regression model is

\[\ce{lm}(y \sim x)\]

where y and x are the objects containing our data. To access the results of the regression analysis, we assign them to an object using the following command

> model = lm(signal ~ conc)

where model is the name we assign to the object.

Note

You can choose any name for the object containing the results of the regression analysis.

As you might guess, lm is short for linear model.

Evaluating the Linear Regression Model

To evaluate the results of a linear regression we need to examine the data and the regression line, and to review a statistical summary of the model. To examine our data and the regression line, we use the plot command, which takes the following general form

plot(x, y, optional arguments to control style)

where x and y are objects containing our data, and the abline command

abline(object, optional arguments to control style)

where object is the object containing the results of the linear regression. Entering the commands

> plot(conc, signal, pch = 19, col = “blue”, cex = 2)
> abline(model, col = “red”)

creates the plot shown in Figure 5.21.

Note

The name abline comes from the following common form for writing the equation of a straight-line.

\[y = \ce{a} + \ce{b}x\]

Figure 5.21 Example of a regression plot in R showing the data and the regression line. You can customize your plot by adjusting the plot command’s optional arguments. The argument pch controls the symbol used for plotting points, the argument col allows you to select a color for the points or the line, and the argument cex sets the size for the points. You can use the command help(plot) to learn more about the options for plotting data in R.

To review a statistical summary of the regression model, we use the summary command.

> summary(model)

The resulting output, shown in Figure 5.22, contains three sections.

Figure 5.22 The summary of R’s regression analysis. See the text for a discussion of how to interpret the information in the output’s three sections.

The first section of R’s summary of the regression model lists the residual errors. To examine a plot of the residual errors, use the command

> plot(model, which=1)

which produces the result shown in Figure 5.23. Note that R plots the residuals against the predicted (fitted) values of y instead of against the known values of x. The choice of how to plot the residuals is not critical, as you can see by comparing Figure 5.23 to Figure 5.20. The line in Figure 5.23 is a smoothed fit of the residuals.

Note

The reason for including the argument which=1 is not immediately obvious. When you use R’s plot command on an object created by the lm command, the default is to create four charts summarizing the model’s suitability. The first of these charts is the residual plot; thus, which=1 limits the output to this plot.

Figure 5.23 Example showing R’s plot of a regression model’s residual error.

The second section of Figure 5.22 provides the model’s coefficients—the slope, β₁, and the y-intercept, β₀—along with their respective standard deviations (Std. Error). The column t value and the column Pr(>|t|) are for the following t-tests.

slope H₀: β₁ = 0, H_A: β₁ ≠ 0
y-intercept H₀: β₀ = 0, H_A: β₀ ≠ 0

Note

See Section 4.6.1 for a review of the t-test.

The results of these t-tests provide convincing evidence that the slope is not zero, but no evidence that the y-intercept significantly differs from zero.

The last section of the regression summary provides the standard deviation about the regression (residual standard error), the square of the correlation coefficient (multiple R-squared), and the result of an F-test on the model’s ability to explain the variation in the y values. For a discussion of the correlation coefficient and the F-test of a regression model, as well as their limitations, refer to the section on using Excel’s data analysis tools.

Note

See Section 4.6.2 and Section 4.6.3 for a review of the F-test.

Predicting the Uncertainty in C_A Given S_samp

Unlike Excel, R includes a command for predicting the uncertainty in an analyte’s concentration, C_A, given the signal for a sample, S_samp. This command is not part of R’s standard installation. To use the command you need to install the “chemCal” package by entering the following command (note: you will need an internet connection to download the package).

> install.packages(“chemCal”)

After installing the package, you will need to load the functions into R using the following command. (note: you will need to do this step each time you begin a new R session as the package does not automatically load when you start R).

> library(“chemCal”)

Note

You need to install a package once, but you need to load the package each time you plan to use it. There are ways to configure R so that it automatically loads certain packages; see An Introduction to R for more information (click here to view a PDF version of this document).

The command for predicting the uncertainty in C_A is inverse.predict, which takes the following form for an unweighted linear regression

inverse.predict(object, newdata, alpha = value)

where object is the object containing the regression model’s results, newdata is an object containing values for S_samp, and value is the numerical value for the significance level. Let’s use this command to complete Example 5.11. First, we create an object containing the values of S_samp

> sample = c(29.32, 29.16, 29.51)

and then we complete the computation using the following command

> inverse.predict(model, sample, alpha = 0.05)

producing the result shown in Figure 5.24. The analyte’s concentration, C_A, is given by the value $Prediction, and its standard deviation, s_C_A, is shown as $`Standard Error`. The value for $Confidence is the confidence interval, ±ts_C_A, for the analyte’s concentration, and $`Confidence Limits` provides the lower limit and upper limit for the confidence interval for C_A.

Figure 5.24 Output from R’s command for predicting the analyte’s concentration, C_A, from the sample’s signal, S_samp.

Using R for a Weighted Linear Regression

R’s command for an unweighted linear regression also allows for a weighted linear regression by including an additional argument, weights, whose value is an object containing the weights.

\[\ce{lm}(y \sim x,\: weights = object)\]

Let’s use this command to complete Example 5.12. First, we need to create an object containing the weights, which in R are the reciprocals of the standard deviations in y, (s_yi)^–2. Using the data from Example 5.12, we enter

> syi=c(0.02, 0.02, 0.07, 0.13, 0.22, 0.33)
> w=1/syi^2

to create the object containing the weights. The commands

> modelw = lm(signal ~ conc, weights = w)
> summary(modelw)

generate the output shown in Figure 5.25. Any difference between the results shown here and the results shown in Example 5.12 are the result of round-off errors in our earlier calculations.

Note

You may have noticed that this way of defining weights is different than that shown in equation 5.28. In deriving equations for a weighted linear regression, you can choose to normalize the sum of the weights to equal the number of points, or you can choose not to—the algorithm in R does not normalize the weights.

Figure 5.25: The summary of R’s regression analysis for a weighted linear regression. The types of information shown here is identical to that for the unweighted linear regression in Figure 5.22.

Exercise 5.7

Use Excel to complete the regression analysis in Practice Exercise 5.4.

Click here to review your answer to this exercise.

Contributors

David Harvey (DePauw University)

5.6.1 Excel

Use Excel’s Built-In Functions

Use Excel’s Data Analysis Tools

Program the Formulas Yourself

Using Excel to Visualize the Regression Model

Limitations to Using Excel for a Regression Analysis

5.6.2 R

Entering Data and Creating the Regression Model

Evaluating the Linear Regression Model

Predicting the Uncertainty in CA Given Ssamp

Using R for a Weighted Linear Regression

Contributors

Predicting the Uncertainty in C_A Given S_samp