Regression Analysis and Data
? Correlation and regression are techniques which are used to see whether a relationship exists between two or more different sets of data Learning Objectives: To identify, by diagram, whether a possible relationship exists between two variables; To quantify the strength of association between variables using the correlation coefficient; To show how a relationship can be expressed as an equation; To identify linear equations when written and when graphed; To examine regression, a widely used linear model, and to consider its uses and limitations. Scatter Diagrams
Perfect positive correlation exists between the data. If x is known y can be predicted exactly. +0. 8 < +1Strong positive correlation exists between the data. As x increases y increases. Interpretation of r +0. 4 < +0. 8Moderate positive correlation exists between the data. As x increases y increases -0. 4 < +0. 4Very little correlation exists between the data -0. 4 < -0. 8Moderate negative correlation exists between the data. As x increases y decreases. Interpretation of r -0. 8 < -1Strong negative correlation exists between the data. As x increases y decreases. -1 Perfect negative correlation exists between the data.
If x is known y can be predicted exactly. Regression Regression is a technique which builds a straight line relationship between two sets of data. This relationship is of the form y = a + bx where a and b are found by the following formulae b = n? xy?? x? y n? x2-(? x)2 EXCEL:=SLOPE(Y DATA, X DATA) a = ??? y- b? x n n EXCEL:=INTERCEPT(Y DATA, X DATA) Example 4. 5 – Calculation of a and b To calculate use Summary values from Correlation Calculation: i. e. ?y 255 ?x 80 ?x2 756 ?y2 7097 ?xy 2289 n 10 SLOPE: b = n? xy – ? x? y = (10*2289) – (80*255) n? x2-(? x)2 (10*756) – (80)2 b = 22890 – 20400 = 2490 7560 – 6400 1160
b = 2. 1465517 INTERCEPT: a = ?y – b? x = 255 – 2. 1465517 * 80 n n 10 10 a = 25. 5 – 17. 172413 = 8. 327587 Example 4. 5 – Calculation of a and b The final answers (rounded to three decimal places) are: a = 8. 328 b = 2. 147 (note that 3 decimal places were chosen as the data supplied were in thousands and hundreds) These give the linear regression equation y = 8. 328 + 2. 147x or, if preferred, sales = 8. 328 + 2. 147*advertising expenditure Forecasts Forecasts may be made using the resulting model. If the x (independent) value used falls within the original data set then this forecast is known as interpolation.e. g. Advertising expenditure = ? 700 (inside original range) i. e. x = 7 giving y = 8. 328 + 2. 147 * 7 = 23. 357 i. e. 23,357 sales are forecast If the x value falls outside the bounds of the original data then this forecast is known as extrapolation and care must be taken in its use. Expenditure = ? 1800, so x = 18 y = 8. 328 + 2. 147 * 18 = 46. 974 i. e. , 46,974 sales are forecast Coefficient of Determination The coefficient of determination (r2) is another measure which may be used to assess the appropriateness of a regression model.
This is found by squaring Pearson’s correlation coefficient and then expressing as a percentage. The resulting figure is then used to describe the percentage variation in the y data which can be attributed to the variation in x data. In the Sales – Adv. Costs example r = 0. 948 so r2 = 0. 899 So it may be said that 89. 9% of the variation in sales of the products is due to variation in the levels of advertising expenditure. Rank Correlation i. e. Spearman’s Used to assess evidence of a relationship between two sets of data, at least one of which has been ranked in some way.