## Seminar Slides: The Bayesian Approach to Data Analysis | Fridtjof Thomas

|Please see below the slides from the seminar on May 15th, 2019 by Fridtjof Thomas.

Please see below the slides from the seminar on May 15th, 2019 by Fridtjof Thomas.

**P-values: What they are and what they are not**, will look in detail at good examples of using p-values and how to interpret them. After reviewing widely understood problems with p-values, attention is drawn to regularly encountered use of p-values where it is less clear what their correct interpretation actually is. Furthermore, we demonstrate why p-values are not meaningful measures of support for specific hypotheses. Guidance on good statistical data analyses is based on a statement on p-values by the American Statistical Association. A hierarchy of scientific evidence compiled by The Oxford Center for Evidence-based Medicine is reviewed to re-emphasize the role of thoughtful statistical analyses in scientific and medical discovery.

Presented April 30th, 2019.

The positive and negative predictive values of a dignostic test depend not only on its sensitivity and specificity, but also the prevalence of the disease (or the pre-test probability of disease). This interactive display illustrates that inter-dependence.

Move the sensitivity, specificity, and disease prevalence sliders, and watch the positive and negative predictive value sliders change in response. The mosaic plot on the right shows the positive tests (T+, in maroon) and negative tests (T-, in orange) in a population.

Diseased individuals (D+) are on the right column and non-diseased (D-) on the left column.

If you’re an avid SAS user, you’re likely very familiar with SAS macros. SAS macros are a key component to creating efficient and concise code. Although you cannot use macros in R, R offers other features like functions and loops that can perform the same tasks as SAS macros.

In SAS, if we wanted to run multiple linear regressions using different predictor variables, we could use a simple SAS macro to iterate over the independent variables. In R, we can simplify this even more by making use of the apply() function. The apply() function comes from the R base package and is one of many members of the apply() family. The family (which also contains lapply(), sapply(), mapply(), etc) differ in the data structures of the inputs and outputs.

apply(X, Margin, Fun,…) takes three main arguments.

- X is an array or matrix.
- Margin indicates if the function should be applied over rows (Margin = 1) or columns (Margin = 2)
- Fun indicates what function should be applied. Any R function can be used even those created by the user.

In this example, we will use the R dataset mtcars (first 6 rows shown below).

data(mtcars) head(mtcars)

```
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
```

Back to the question presented earlier, how can we iterate over variables to run multiple regressions with different predictor variables?

The following apply function takes the dataset mtcars and subsets the variables cyl (number of cylinders), disp (displacement), and wt (weight) as the variables we want to apply the function to. We specify the margin as 2 so it iterates over the 3 columns. Finally, we specify a user defined function that takes the independent variable as a parameter and outputs the summary statistics of a linear model where mpg (miles per gallon) is the outcome variable.

apply(mtcars[, c("cyl", "disp", "wt")], 2, function(ind) {summary(lm(mpg ~ ind, data = mtcars))})

```
## $cyl
##
## Call:
## lm(formula = mpg ~ ind, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9814 -2.1185 0.2217 1.0717 7.5186
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
## ind -2.8758 0.3224 -8.92 6.11e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.206 on 30 degrees of freedom
## Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
## F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
##
##
## $disp
##
## Call:
## lm(formula = mpg ~ ind, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8922 -2.2022 -0.9631 1.6272 7.2305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.599855 1.229720 24.070 < 2e-16 ***
## ind -0.041215 0.004712 -8.747 9.38e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared: 0.7183, Adjusted R-squared: 0.709
## F-statistic: 76.51 on 1 and 30 DF, p-value: 9.38e-10
##
##
## $wt
##
## Call:
## lm(formula = mpg ~ ind, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## ind -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
```

Let’s say we want to examine product sales of 3 products sold over 100 days. Our goal is to have sold 45 units of each product. We can use a for loop to create new dummy variables that indicate if we sold 45 or more units that day.

First we create a sample data set from randomly generated integers

set.seed(22) product1 <- sample(30:50, 100, replace = TRUE) product2 <- sample(40:60, 100, replace = TRUE) product3 <- sample(35:48, 100, replace = TRUE) sales <- as.data.frame(cbind(product1, product2, product3)) head(sales)

```
## product1 product2 product3
## 1 36 43 35
## 2 39 43 41
## 3 50 54 37
## 4 40 52 48
## 5 47 41 36
## 6 45 46 44
```

Then we can use a for loop to iterate over the variable names in the dataset. The paste function allows us to create a new variable name containing the old variable name and the condition. We then can assign a 0 or 1 to the new variable depending on if the sales goal of 45 or more was met.

for (p in names(sales)) { sales[[paste(p, ">45", sep = "")]] <- as.numeric(sales[[p]] >= 45) } print(head(sales))

```
## product1 product2 product3 product1>45 product2>45 product3>45
## 1 36 43 35 0 0 0
## 2 39 43 41 0 0 0
## 3 50 54 37 1 1 0
## 4 40 52 48 0 1 1
## 5 47 41 36 1 0 0
## 6 45 46 44 1 1 0
```

In general, it is more efficient to use one of the apply() functions when possible instead of using a for loop. For loops in R are generally slower for large data sets, especially if you are consistently adding new values to a dataframe using functions like cbind. It is better to preallocate a new matrix or dataframe for the loop to fill. By preallocating space, you are preventing R from having to copy and expand the vector for every iteration.

One way of getting around using a for loop in our previous example is by using ifelse functions. The benefit of using the ifelse function is that it is vectorized meaning the condition is applied to a whole vector at once compared to only one value at a time.

The ifelse function will read in a vector, check a condition, and then assign one value if the condition is true and a different value if false.

sales$product1Met <- ifelse(sales$product1 >= 45, 1, 0) sales$product2Met <- ifelse(sales$product2 >= 45, 1, 0) sales$product3Met <- ifelse(sales$product3 >= 45, 1, 0) head(sales)

```
## product1 product2 product3 product1Met product2Met product3Met
## 1 36 43 35 0 0 0
## 2 39 43 41 0 0 0
## 3 50 54 37 1 1 0
## 4 40 52 48 0 1 1
## 5 47 41 36 1 0 0
## 6 45 46 44 1 1 0
```

While this can get repetitive if you are creating many new variables, in many cases, the ifelse function may be a sufficient option.