## Question 877:

1

1. Q877_1_q877-regression.xlsx
a.    We see the largest correlation with the dependent variables is between sales (y) and income (x2) which has a correlation of .964. The correlation between outlets and income is strong at .825. When two independent variables have a high correlation with each other it can lead to a problem called Multicollinearity in which the method by which regression equations are solved tends to be less precise and can lead to inaccurate readings of which variables actually predict better than others.  Ideally you want to have a low correlation between your x's and high correlations between your x's and y's.
b.    The amount of variation explained by a regression equation is expressed using r-squared.  That value isn't shown in the values below but we can run a regression analysis by selection "regression" from the Data Analysis Package. Select the Sales Data for the Y and all 5  X's. I get an R-squared value of 99.4% and an adjusted r-square value of 98.7%. This tells us that these variables account for almost all of the variation in sales.
c.    Once we've run the regression equation we can look at the significance of the predictors. When looking at the p-value column we can see that X1 (Retail Outlets), and X5 (Number of Supervisors) have p-values well above the .05 significance level. X4 (Avg, Age of automobiles) is also just above .05 with a p-value of .081. Therefore, only X2( # registered) and X3's(personal income) regression coefficients are significantly different than 0.
d.    Following up on the results from c, we'd only want to include significant regression coefficients in our regression equation. Both Retail Outlets and Bosses have p-values well above .05 and so are good candidates to remove from the regression equation.
e.    Rerunning the regression equation with Outlets and Bosses removed generates an R-squared value (the same thing as the coefficient of determination) of 99.4% which is exactly the same as it was before. We basically lost no explanatory power by removing those two predictors.
f.    The histogram and stem-and-lead exhibit approximately symmetrical data which suggests the data are reasonably normally distributed.
g.    In looking at the fitted values, we'd be looking for patterns, such as "fanning" where the fitted values get larger or smaller as the actual values increase or decrease, forming a cone or fan-shape. We'd also look for oscillating where the values drastically increase or decrease around 0. Finally we'd look for extreme outliers, or values outside 3 standard deviations. None of these patterns appear suggesting there are no violations of the assumptions and the conclusions from the regression equation are fine.

See the attached excel file for the output of the two regression equations.