Seminar Slides: The Bayesian Approach to Data Analysis | Fridtjof Thomas

|

Please see below the slides from the seminar on May 15th, 2019 by Fridtjof Thomas.

 

Seminar Slides Bayesian Approach to Data Analysis

Seminar Slides: P-Value Discussion | Dr. Saunak Sen

|

Please see below for the sides from Dr. Sen’s seminar discussion.

p-value-discussion

 

P-values – What they are and what they are not – Seminar Presentation Slides – Fridtjof Thomas

|

P-values: What they are and what they are not, will look in detail at good examples of using p-values and how to interpret them. After reviewing widely understood problems with p-values, attention is drawn to regularly encountered use of p-values where it is less clear what their correct interpretation actually is. Furthermore, we demonstrate why p-values are not meaningful measures of support for specific hypotheses. Guidance on good statistical data analyses is based on a statement on p-values by the American Statistical Association. A hierarchy of scientific evidence compiled by The Oxford Center for Evidence-based Medicine is reviewed to re-emphasize the role of thoughtful statistical analyses in scientific and medical discovery.

Presentation Slides- PDF

 

Presented April 30th, 2019.

Anatomy of a Diagnostic Test – an R Shiny Example by Saunak Sen

|

The positive and negative predictive values of a dignostic test depend not only on its sensitivity and specificity, but also the prevalence of the disease (or the pre-test probability of disease).  This interactive display illustrates that inter-dependence.

Move the sensitivity, specificity, and disease prevalence sliders, and watch the positive and negative predictive value sliders change in response.  The mosaic plot on the right shows the positive tests (T+, in maroon) and negative tests (T-, in orange) in a population.

Diseased individuals (D+) are on the right column and non-diseased (D-) on the left column.

Equivalent of SAS Macros in R – Loops and Functions by Courtney Gale

|

If you’re an avid SAS user, you’re likely very familiar with SAS macros. SAS macros are a key component to creating efficient and concise code. Although you cannot use macros in R, R offers other features like functions and loops that can perform the same tasks as SAS macros.

Using apply() to loop over variables

In SAS, if we wanted to run multiple linear regressions using different predictor variables, we could use a simple SAS macro to iterate over the independent variables. In R, we can simplify this even more by making use of the apply() function. The apply() function comes from the R base package and is one of many members of the apply() family. The family (which also contains lapply(), sapply(), mapply(), etc) differ in the data structures of the inputs and outputs.

apply(X, Margin, Fun,…) takes three main arguments.

  • X is an array or matrix.
  • Margin indicates if the function should be applied over rows (Margin = 1) or columns (Margin = 2)
  • Fun indicates what function should be applied. Any R function can be used even those created by the user.

In this example, we will use the R dataset mtcars (first 6 rows shown below).

data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Back to the question presented earlier, how can we iterate over variables to run multiple regressions with different predictor variables?

The following apply function takes the dataset mtcars and subsets the variables cyl (number of cylinders), disp (displacement), and wt (weight) as the variables we want to apply the function to. We specify the margin as 2 so it iterates over the 3 columns. Finally, we specify a user defined function that takes the independent variable as a parameter and outputs the summary statistics of a linear model where mpg (miles per gallon) is the outcome variable.

apply(mtcars[, c("cyl", "disp", "wt")], 2, 
      function(ind) {summary(lm(mpg ~ ind, data = mtcars))})
## $cyl
## 
## Call:
## lm(formula = mpg ~ ind, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9814 -2.1185  0.2217  1.0717  7.5186 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.8846     2.0738   18.27  < 2e-16 ***
## ind          -2.8758     0.3224   -8.92 6.11e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.206 on 30 degrees of freedom
## Multiple R-squared:  0.7262, Adjusted R-squared:  0.7171 
## F-statistic: 79.56 on 1 and 30 DF,  p-value: 6.113e-10
## 
## 
## $disp
## 
## Call:
## lm(formula = mpg ~ ind, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8922 -2.2022 -0.9631  1.6272  7.2305 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.599855   1.229720  24.070  < 2e-16 ***
## ind         -0.041215   0.004712  -8.747 9.38e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared:  0.7183, Adjusted R-squared:  0.709 
## F-statistic: 76.51 on 1 and 30 DF,  p-value: 9.38e-10
## 
## 
## $wt
## 
## Call:
## lm(formula = mpg ~ ind, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## ind          -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Using a for loop to iterate over variable names

Let’s say we want to examine product sales of 3 products sold over 100 days. Our goal is to have sold 45 units of each product. We can use a for loop to create new dummy variables that indicate if we sold 45 or more units that day.

First we create a sample data set from randomly generated integers

set.seed(22)
product1 <- sample(30:50, 100, replace = TRUE)
product2 <- sample(40:60, 100, replace = TRUE)
product3 <- sample(35:48, 100, replace = TRUE)

sales <- as.data.frame(cbind(product1, product2, product3))
head(sales)
##   product1 product2 product3
## 1       36       43       35
## 2       39       43       41
## 3       50       54       37
## 4       40       52       48
## 5       47       41       36
## 6       45       46       44

Then we can use a for loop to iterate over the variable names in the dataset. The paste function allows us to create a new variable name containing the old variable name and the condition. We then can assign a 0 or 1 to the new variable depending on if the sales goal of 45 or more was met.

for (p in names(sales)) {
  sales[[paste(p, ">45", sep = "")]] <- as.numeric(sales[[p]] >= 45)
}

print(head(sales))
##   product1 product2 product3 product1>45 product2>45 product3>45
## 1       36       43       35           0           0           0
## 2       39       43       41           0           0           0
## 3       50       54       37           1           1           0
## 4       40       52       48           0           1           1
## 5       47       41       36           1           0           0
## 6       45       46       44           1           1           0

Problems with using for loops in R

In general, it is more efficient to use one of the apply() functions when possible instead of using a for loop. For loops in R are generally slower for large data sets, especially if you are consistently adding new values to a dataframe using functions like cbind. It is better to preallocate a new matrix or dataframe for the loop to fill. By preallocating space, you are preventing R from having to copy and expand the vector for every iteration.

Ifelse Functions

One way of getting around using a for loop in our previous example is by using ifelse functions. The benefit of using the ifelse function is that it is vectorized meaning the condition is applied to a whole vector at once compared to only one value at a time.

The ifelse function will read in a vector, check a condition, and then assign one value if the condition is true and a different value if false.

sales$product1Met <- ifelse(sales$product1 >= 45, 1, 0)
sales$product2Met <- ifelse(sales$product2 >= 45, 1, 0)
sales$product3Met <- ifelse(sales$product3 >= 45, 1, 0)

head(sales)
##   product1 product2 product3 product1Met product2Met product3Met
## 1       36       43       35           0           0           0
## 2       39       43       41           0           0           0
## 3       50       54       37           1           1           0
## 4       40       52       48           0           1           1
## 5       47       41       36           1           0           0
## 6       45       46       44           1           1           0

While this can get repetitive if you are creating many new variables, in many cases, the ifelse function may be a sufficient option.