ISA 225
Planted 020210817
Notes for Principles of Business Analytics  ISA 225
Prepping resources:
 Video
 W3 Schools Statistics
Review of Statistical Inference
 Population:
 N Size
 μ Mean
 𝜎 Standard Deviation
 P Ratio (proportion)
 Sample:
 n Size
 p̂ Ratio (proportion) (sample portion size / sample size)
 x̄ Mean
 Portion Size
Checks
 Checks:
 Independent:
 ✅ Randomly selected
 ✅ 10% condition: n < 10% of P
 Normality:
 ✅ Success/Failure:
 ✅ np >= 10
 ✅ n(1p) >= 10
 ✅ Expect at least 10 success and 10 failures
 ✅ Success/Failure:
 Independent:
Common ZValues
α  Confidence Level  Z* 

0.1  90%  1.645 
0.05  95%  1.960 ≈ 2 
0.01  99%  2.576 
Determining Sample Size
\[sampleSize = {sampleRatio(1sampleRatio) \times\left({zIndex \over{marginOfError}}\right)^2 }\] \[n = {p̂(1p̂) \times\left({Z^* \over{ME}}\right)^2 }\] \[sampleSize = {\left({zIndex \times{ populationStandardDeviation} \over{marginOfError}}\right)^2}\] \[n = {\left({Z^* \times{𝜎} \over{ME}}\right)^2}\] \[sampleSize = {\left({zIndex \times{ sampleStandardDeviation} \over{marginOfError}}\right)^2}\] \[n = {\left({Z^* \times{s} \over{ME}}\right)^2}\]Notes
The tdistribution describes the statistical properties of sample means that are estimated from small samples; the standard normal distribution is used for large samples.
Hypothesis test for population proportion and mean
A hypothesis is a claim about a population parameter (proportion, mean)
Steps to compute a hypothesis test:
 State hypothesis
 Calculate test statistics
 Find pvalue
 Make conclusions based on pvalue
The null hypothesis H_{o}, is the starting assumption (nothing has changed).
\[H_o: populationParameter = claimedValue\]The alternative hypothesis, or H_{a} is a claim the population parameter value differs from the null hypothesis. It can take these different forms depending on what you want to test (H_a):
Lefttailed hypothesis test: \(H_a: populationParameter \lt claimedValue\)
Righttailed hypothesis test: \(H_a: populationParameter \gt claimedValue\)
Twotailed hypothesis test: \(H_a: populationParameter \neq claimedValue\)
Step 2: Calculate the Test Statistics
 Test statistics about population proportion P (Oneprop Ztest)
 \[Zscore = {p̂  P \over \sqrt{p(1p) \over n} }\]
 Test statistics about population mean μ (Onesample test)
 When the population 𝜎 Standard Deviation is known: (Onesample Ztest)
 \[Zscore = {sampleMean  claimedValue \over{populationMean \over \sqrt{sampleSize}} }\]
 \[Zscore = {x̄  μ \over{𝜎 \over \sqrt{n}} }\]
 When the population 𝜎 Standard Deviation is unknown (Onesame t test / student ttest)
 \[t = {sampleMean  claimedValue \over{sampleStdDev \over \sqrt{sampleSize}} }\]
 \[t = {x̄  μ_0 \over{s \over \sqrt{n}} }\]
 When the population 𝜎 Standard Deviation is known: (Onesample Ztest)
Step 4: Make conclusion based on pvalue
Compare pvalue with significance level α (always given before test). The smaller α, the more accurate the test is.
 Type I errors, the null hypothesis is true, but we reject it (false negative)
 Type II errors, the null hypothesis is false, but we fail to reject it (false positive)
If pvalue < α, then reject null hypothesis, we have enough evidence to support H_{a}.
If pvalue > α, then do not reject null hypothesis, we do not have enough evidence to support H_{a}.
Comparing Two Population Parameters
Two Sample ttest (comparing two population means)
 State hypothesis
 Check assumptions and calculate test statistics
 Find pvalue based on test statistics
 Make conclusion based on pvalue
Since population standard deviations are unknown, we use the standard errors instead:
\[t = {(ȳ_1  ȳ_2)  (μ_1  μ_2) \over\sqrt{ {s_1^2 \over{ n_1 }} + {s_2^2 \over{n_2}} } }\]Confidence Interval for Difference between Two Population Means
Two sample Zinterval (when \(𝜎_1\) and \(𝜎_2\) are known)
\[{(ȳ_1  ȳ_2) \pm Z^* * \sqrt{ {𝜎_1^2 \over{ n_1 }} + {𝜎_2^2 \over{n_2}} } }\]Two sample Zinterval (when \(𝜎_1\) and \(𝜎_2\) are unknown)
\[{(ȳ_1  ȳ_2) \pm t^* * \sqrt{ {s_1^2 \over{ n_1 }} + {s_2^2 \over{n_2}} } }\]The \(t^*_{df, a/2}\) here depends on the confidence level 100(1α)% and the calculated df.
Interpretation of C.I. is similar to onesample test
ChiSquare Tests
 One variable?
 Goodness of Fit Test
 H_{0}: model fits data
 H_{a}: model does not fit data
 Two variables?
 Test for independence
 H_{0}: variables are independent
 H_{a}: variables are not independent
GoodnessofFit Tests (one variable)
A χ2 goodness of fit test is applied when you have one categorical variable from a single population.
 State the hypothesis:
 H_{0}: model fits. (hypothesized model fits the sample we collected)
 H_{a}: model doesn’t fit. (hypothesized model doesn’t fit the sample we collected)
 Assumptions and Test Statistics:
 Assumptions:
 Counted Data Condition – The data must be counts for the categories of a single categorical variable.
 Independence Assumption – The counts should be independent of each other.
 Randomization Condition – The counted individuals should be a random sample of the population.
 Sample Size Assumption – We expect at least 5 individuals per cell.  Test statistics:
 \[{\chi^2 = \sum_{allCells} {(Obs  Exp)^2\over{Exp} } }\]
 Find pvalue based on the test statistics
 Df= (#cells 1), use the χ2 table, fix the line of df, then with the test statistics to find the corresponding pvalue, which is the righttail probability of the test statistics.
 (or by technology) pvalue= P(χ2 > test statistics)
 Make Conclusions based the pvalue
 If pvalue < α, reject the H_{0}, which means the hypothesized model doesn’t fit the sampled data.
 If pvalue > α, fail to reject H_{0}, we do not have significant evidence to say the model doesn’t fit the sampled data.
ChiSquare test for Independence (two variables)
 State the hypothesis:
 H_{0}: variables are independent.
 H_{a}: variables are not independent.
 Assumptions and Test Statistics:
 Assumptions:
 Counted Data Condition – The data must be counts for the categories of a single categorical variable.
 Randomization Condition – The counted individuals should be a random sample of the population.
 Sample Size Assumption – We expect at least 5 individuals per cell.  Test statistics:
 \[{\chi^2 = \sum_{allCells} {(Obs  Exp)^2\over{Exp} } }\]
 Assuming H0 is true, which means that the variables are independent.
 \[{Exp_{ij} = {totalRow_i \times totalCol_i}\over{tableTotal} }\]
 Find pvalue based on the test statistics
 Df= (# of rows 1)×(# of cols1), use the χ2 table, fix the line of df, then with the test statistics to find the corresponding pvalue, which is the righttail probability of the test statistics.
 pvalue= P(χ2 > test statistics)
 Make Conclusions based the pvalue
 If pvalue < α, reject the H_{0}, which means the two variables are not independent.
 If pvalue > α, fail to reject H_{0}, we do not have significant evidence to say the two variables are not independent.
Simple regression (linear)
Sample regression line:
 ŷ the predicted value of response variable (y), when x is given as a specific value.
 b_{0} the sample yintercept
 b_{1} the sample slope

r the correlation coefficient
 value from 1 to 1
 closer to 0, the weaker relationship they have

r^{2} the proportion of the observed variation in y that can be accounted for by x, or modeling by x.
 shows how well the model fits the data
 value from 0 to 1
 closer is to 1, the stronger the regression relationship.
 \({e = y  \hat{y}}\) the residual, difference between predicted (ŷ), and observed (y) values
 Ɛ the population mean residual
 μ_{y} the population mean of y at a given value of x
 𝛽_{0} the population mean value of Y when X = 0
 𝛽_{1} the population mean value of Y for each unit increase in X
Step 1: State the hypothesis
 \[H_o: \beta_0 = 0\]
 \[H_a: \beta_1 \ne 0\]
Step 2: Test statistics
 df=n2
 Se is called “Root Mean Squared Error”
 \[{ttest = {b_1\beta_1\over{SE(b_1)}}}\]
 Confidence interval = \({b_1 \pm t^*_{df,{\alpha\over{2}}} \times SE(b_1) }\)
Regression Assumptions
 Linearity Assumption: scatterplot looks like a linear relationship
 Independence Assumption: randomly selected
 Equal Variance Assumption: scatterplot equally spread out, no clumping and spread around the line in residual plot is reasonably consistent at line 0
 Normal Population Assumption: the residuals satisfy the Nearly Normal Condition and