# ISA 225

## Notes for Principles of Business Analytics - ISA 225

Prepping resources:

- Video
- W3 Schools Statistics

## Review of Statistical Inference

- Population:
`N`Size`μ`Mean`𝜎`Standard Deviation`P`Ratio (proportion)

- Sample:
`n`Size`p̂`Ratio (proportion) (sample portion size / sample size)`x̄`Mean- Portion Size

### Checks

- Checks:
- Independent:
- ✅ Randomly selected
- ✅ 10% condition: n < 10% of P

- Normality:
- ✅ Success/Failure:
- ✅ np >= 10
- ✅ n(1-p) >= 10

- ✅ Expect at least 10 success and 10 failures

- ✅ Success/Failure:

- Independent:

### Common Z-Values

α | Confidence Level | Z* |
---|---|---|

0.1 | 90% | 1.645 |

0.05 | 95% | 1.960 ≈ 2 |

0.01 | 99% | 2.576 |

### Determining Sample Size

\[sampleSize = {sampleRatio(1-sampleRatio) \times\left({zIndex \over{marginOfError}}\right)^2 }\] \[n = {p̂(1-p̂) \times\left({Z^* \over{ME}}\right)^2 }\] \[sampleSize = {\left({zIndex \times{ populationStandardDeviation} \over{marginOfError}}\right)^2}\] \[n = {\left({Z^* \times{𝜎} \over{ME}}\right)^2}\] \[sampleSize = {\left({zIndex \times{ sampleStandardDeviation} \over{marginOfError}}\right)^2}\] \[n = {\left({Z^* \times{s} \over{ME}}\right)^2}\]## Notes

The t-distribution describes the statistical properties of sample means that are estimated from small samples; the standard normal distribution is used for large samples.

## Hypothesis test for population proportion and mean

A hypothesis is a claim about a population parameter (proportion, mean)

Steps to compute a hypothesis test:

- State hypothesis
- Calculate test statistics
- Find p-value
- Make conclusions based on p-value

The null hypothesis H_{o}, is the starting assumption (nothing has changed).

The alternative hypothesis, or H_{a} is a claim the population parameter value differs from the null hypothesis. It can take these different forms depending on what you want to test (H_a):

Left-tailed hypothesis test:

\(H_a: populationParameter \lt claimedValue\)

Right-tailed hypothesis test:

\(H_a: populationParameter \gt claimedValue\)

Two-tailed hypothesis test:

\(H_a: populationParameter \neq claimedValue\)

### Step 2: Calculate the Test Statistics

- Test statistics about population proportion
`P`(One-prop Z-test)- \[Z-score = {p̂ - P \over \sqrt{p(1-p) \over n} }\]

- Test statistics about population mean
`μ`(One-sample test)- When the population
`𝜎`Standard Deviation is known: (One-sample Z-test)- \[Z-score = {sampleMean - claimedValue \over{populationMean \over \sqrt{sampleSize}} }\]
- \[Z-score = {x̄ - μ \over{𝜎 \over \sqrt{n}} }\]

- When the population
`𝜎`Standard Deviation is unknown (One-same t test / student t-test)- \[t = {sampleMean - claimedValue \over{sampleStdDev \over \sqrt{sampleSize}} }\]
- \[t = {x̄ - μ_0 \over{s \over \sqrt{n}} }\]

- When the population

### Step 4: Make conclusion based on p-value

Compare p-value with significance level `α` (always given before test). The smaller `α`, the more accurate the test is.

- Type I errors, the null hypothesis is true, but we reject it (false negative)
- Type II errors, the null hypothesis is false, but we fail to reject it (false positive)

If p-value < α, then reject null hypothesis, we have enough evidence to support H_{a}.

If p-value > α, then do not reject null hypothesis, we do not have enough evidence to support H_{a}.

## Comparing Two Population Parameters

### Two Sample t-test (comparing two population means)

- State hypothesis
- Check assumptions and calculate test statistics
- Find p-value based on test statistics
- Make conclusion based on p-value

Since population standard deviations are unknown, we use the standard errors instead:

\[t = {(ȳ_1 - ȳ_2) - (μ_1 - μ_2) \over\sqrt{ {s_1^2 \over{ n_1 }} + {s_2^2 \over{n_2}} } }\]### Confidence Interval for Difference between Two Population Means

**Two sample Z-interval (when \(𝜎_1\) and \(𝜎_2\) are known)**

**Two sample Z-interval (when \(𝜎_1\) and \(𝜎_2\) are unknown)**

The \(t^*_{df, a/2}\) here depends on the confidence level 100(1-α)% and the calculated df.

### Interpretation of C.I. is similar to one-sample test

## Chi-Square Tests

- One variable?
- Goodness of Fit Test
- H
_{0}: model fits data - H
_{a}: model does not fit data

- Two variables?
- Test for independence
- H
_{0}: variables are independent - H
_{a}: variables are not independent

### Goodness-of-Fit Tests (one variable)

A χ2 goodness of fit test is applied when you have one categorical variable from a single population.

- State the hypothesis:
- H
_{0}: model fits. (hypothesized model fits the sample we collected) - H
_{a}: model doesn’t fit. (hypothesized model doesn’t fit the sample we collected)

- H
- Assumptions and Test Statistics:
- Assumptions:

- Counted Data Condition – The data must be counts for the categories of a single categorical variable.
- Independence Assumption – The counts should be independent of each other.
- Randomization Condition – The counted individuals should be a random sample of the population.
- Sample Size Assumption – We expect at least 5 individuals per cell. - Test statistics:
- \[{\chi^2 = \sum_{allCells} {(Obs - Exp)^2\over{Exp} } }\]

- Find p-value based on the test statistics
- Df= (#cells -1), use the χ2 table, fix the line of df, then with the test statistics to find the corresponding p-value, which is the right-tail probability of the test statistics.
- (or by technology) p-value= P(χ2 > test statistics)

- Make Conclusions based the p-value
- If
**p-value < α, reject the H**, which means the hypothesized model doesn’t fit the sampled data._{0} - If
**p-value > α, fail to reject H**, we do not have significant evidence to say the model doesn’t fit the sampled data._{0}

- If

### Chi-Square test for Independence (two variables)

- State the hypothesis:
- H
_{0}: variables are independent. - H
_{a}: variables are not independent.

- H
- Assumptions and Test Statistics:
- Assumptions:

- Counted Data Condition – The data must be counts for the categories of a single categorical variable.
- Randomization Condition – The counted individuals should be a random sample of the population.
- Sample Size Assumption – We expect at least 5 individuals per cell. - Test statistics:
- \[{\chi^2 = \sum_{allCells} {(Obs - Exp)^2\over{Exp} } }\]
- Assuming H0 is true, which means that the variables are independent.
- \[{Exp_{ij} = {totalRow_i \times totalCol_i}\over{tableTotal} }\]

- Find p-value based on the test statistics
- Df= (# of rows -1)×(# of cols-1), use the χ2 table, fix the line of df, then with the test statistics to find the corresponding p-value, which is the right-tail probability of the test statistics.
- p-value= P(χ2 > test statistics)

- Make Conclusions based the p-value
- If
**p-value < α, reject the H**, which means the two variables are not independent._{0} - If
**p-value > α, fail to reject H**, we do not have significant evidence to say the two variables are not independent._{0}

- If

## Simple regression (linear)

Sample regression line:

`ŷ`the predicted value of response variable (y), when x is given as a specific value.`b`the sample y-intercept_{0}`b`the sample slope_{1}`r`the correlation coefficient- value from -1 to 1
- closer to 0, the weaker relationship they have

`r`the proportion of the observed variation in y that can be accounted for by x, or modeling by x.^{2}- shows how well the model fits the data
- value from 0 to 1
- closer is to 1, the stronger the regression relationship.

- \({e = y - \hat{y}}\) the residual, difference between predicted (ŷ), and observed (y) values
`Ɛ`the population mean residual`μ`the population mean of y at a given value of x_{y}`𝛽`the population mean value of Y when X = 0_{0}`𝛽`the population mean value of Y for each unit increase in X_{1}

### Step 1: State the hypothesis

- \[H_o: \beta_0 = 0\]
- \[H_a: \beta_1 \ne 0\]

### Step 2: Test statistics

- df=n-2
- Se is called “Root Mean Squared Error”
- \[{t-test = {b_1-\beta_1\over{SE(b_1)}}}\]
- Confidence interval = \({b_1 \pm t^*_{df,{\alpha\over{2}}} \times SE(b_1) }\)

### Regression Assumptions

- Linearity Assumption: scatterplot looks like a linear relationship
- Independence Assumption: randomly selected
- Equal Variance Assumption: scatterplot equally spread out, no clumping and spread around the line in residual plot is reasonably consistent at line 0
- Normal Population Assumption: the residuals satisfy the Nearly Normal Condition and