Module 1 In-Class Activities

PLS 206 — Applied Multivariate Modeling

Author

Grey Monroe

Published

January 1, 2026

Important

No complete code is provided. Each activity gives you a starting point — you write the rest. There are no wrong answers, only things to discover.

Not all activities are meant for a single class session — pick based on time and where the class needs to slow down or speed up.

Activity 1 · Meet Your Data

Time: 10 minutes Goal: Before you model anything, you need to know your data. This activity builds the habit of looking before fitting.

The dataset

wine <- read.csv("winedata.csv")

Your tasks

Work through these in order. Each one should be a single line of R.

1. How big is the dataset?

# How many rows and columns?

2. What are the variables?

# Show variable names, types, and a few values

3. Are there any missing values?

# Check for NAs — how many per column?

4. What does the outcome variable look like?

The column quality is the wine quality score (0–10). Get its summary statistics and make a quick plot of its distribution.

# Summary statistics for quality

# Plot its distribution

5. Which variable is most correlated with quality?

Compute the correlation of each numeric predictor with quality. You can do this one at a time or with cor() on the whole data frame.

# Your code here

6. Write one sentence:

“Based on the data, I think the best single predictor of wine quality is _______ because _______.”

Discuss

Did everyone pick the same predictor?
Does correlation = causation here?
What would you want to know about how quality was measured?

Activity 2 · Vectors & Data Frames from Scratch

Time: 8 minutes Goal: Build a data frame by hand, manipulate it, and make a plot — no CSV, no built-in dataset.

Setup

Imagine you measured plant height (cm) for two treatments — watered normally and drought-stressed — with 5 plants each:

Plant	Treatment	Height
1–5	Control	22, 18, 25, 21, 19
6–10	Drought	14, 11, 16, 10, 13

Your tasks

1. Create a data frame called plants with columns plant_id, treatment, and height.

# Build it with c() and data.frame()

2. Add a new column height_m that converts height to meters.

# Add column using $

3. Subset to only drought plants and compute their mean height.

# Subset, then mean()

4. Make a boxplot comparing the two treatments using base R boxplot() or ggplot2.

library(ggplot2)
# Your plot

5. Run a t-test. Is the height difference significant?

# t.test()

Extension: Add a third treatment (heat_stress, heights: 16, 12, 18, 11, 15) and redo the comparison with ANOVA.

Activity 3 · The Anscombe Quartet

Time: 10 minutes Goal: Prove to yourself that summary statistics lie. Always plot your data.

Step 1 — summarize without plotting

R has the Anscombe Quartet built in. Start by computing summary statistics only — no plots yet.

data(anscombe)
head(anscombe)

For each of the four (x, y) pairs (x1/y1 through x4/y4), compute:

Mean of x and y
SD of x and y
Correlation between x and y
Slope from lm(y ~ x)

# e.g. mean(anscombe$x1), cor(anscombe$x1, anscombe$y1), etc.

Write down the four sets of results. What do you notice?

Step 2 — now plot them

Make a 2×2 panel of scatterplots, one per dataset, each with a regression line.

library(ggplot2)
library(tidyr)
library(dplyr)

# Reshape to long format
# Hint: you want columns: dataset, x, y

# Then:
# ggplot(long_df, aes(x, y)) +
#   geom_point() +
#   geom_smooth(method = "lm", se = FALSE) +
#   facet_wrap(~dataset)

Tip

If reshaping is taking too long, just make 4 separate plots with base R: plot(anscombe$x1, anscombe$y1) etc.

Discuss

Which dataset violates the assumptions of linear regression most severely?
Which one has an outlier that’s completely controlling the slope?
What would you do differently for each dataset if this were your real data?

Activity 4 · Spot the Bug

Time: 8 minutes Goal: Build debugging instincts. Reading error messages is a skill — practice it.

The script

This script has 5 bugs. Find them all without running the code first. Then run it and fix them one at a time.

# Load data
data(iris)

# Compute mean sepal length by species
tapply(iris$Sepal.Length, iris$species, mean)

# Subset to setosa only
setosa <- iris[iris$Species = "setosa", ]

# Run a t-test: does setosa sepal length differ from 5.0?
t.test(setosa$Sepal.Length, mu = 5, alternative = "two.tailed")

# Correlation between sepal length and petal length
cor_result <- cor.test(iris$Sepal.Length, iris$Petal.Length)
cat("r =", round(cor_result$estimate, 3),
    "p =", round(cor_result$pvalue, 4))

# Plot
library(ggplot2)
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
  geom_point

Hint (expand only if stuck)

The bugs involve: column name case, comparison operator, argument string value, object element name, and a missing function call.

After you fix it

What was the hardest bug to spot? Why?
Which error message was most helpful? Which was most cryptic?
What habit would have prevented each bug?

Activity 5 · Interpret the Output

Time: 8 minutes Goal: Read statistical output cold — no code, no context, just results. This is what peer review looks like.

Output A

    Welch Two Sample t-test

t = 3.142, df = 37.6, p-value = 0.003
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.821  3.679
sample estimates:
mean of x mean of y
    12.45      10.20

Answer without looking anything up:

What is the null hypothesis being tested?
What is your conclusion at α = 0.05?
What is the estimated difference between groups?
Is the effect large or small? What’s missing that would help you answer that?

Output B

    Pearson's product-moment correlation

t = 8.943, df = 58, p-value < 0.001
95 percent confidence interval:
 0.621  0.843
sample estimates:
      cor
0.7612

Answer:

Is this a strong, moderate, or weak correlation?
What does the confidence interval tell you that the p-value doesn’t?
A student writes: “x causes y with 76% probability.” What’s wrong with this?

Activity 6 · The Confidence Interval Game

Time: 8 minutes Goal: Make “95% confidence” real. You’ll generate 100 confidence intervals and count how many contain the true mean.

Your task

Fill in the blanks and run the code:

set.seed(...)    # pick any number — don't tell your neighbor

true_mean <- 10
true_sd   <- 3
n         <- 30  # sample size per interval
n_sims    <- 100

results <- replicate(n_sims, {
  samp <- rnorm(..., mean = ..., sd = ...)
  ci   <- t.test(samp)$conf.int
  c(lower = ci[1], upper = ci[2])
})

results           <- as.data.frame(t(results))
results$captured  <- results$lower < true_mean & results$upper > true_mean
sum(results$captured)

Plot them

results$sim <- 1:n_sims

plot(range(c(results$lower, results$upper)), c(1, n_sims),
     type = "n", xlab = "Value", ylab = "Simulation",
     main = paste0(sum(results$captured), " / 100 intervals captured true mean"))
abline(v = true_mean, lwd = 2, lty = 2)
for (i in 1:n_sims) {
  segments(results$lower[i], i, results$upper[i], i,
           col = ifelse(results$captured[i], "steelblue", "red"), lwd = 1.2)
}

Discuss

How many did you get? Shout it out.
What does the spread of answers across the class tell you?
Now change n to 10, then 100. What changes and what doesn’t?

--- title: "Module 1 In-Class Activities" subtitle: "PLS 206 — Applied Multivariate Modeling" author: "Grey Monroe" date: "Fall 2026" format: html: theme: flatly toc: true toc-depth: 2 toc-title: "Activities" highlight-style: github code-fold: false execute: echo: true eval: false --- ::: {.callout-important} **No complete code is provided.** Each activity gives you a starting point — you write the rest. There are no wrong answers, only things to discover. ::: Not all activities are meant for a single class session — pick based on time and where the class needs to slow down or speed up. --- # Activity 1 · Meet Your Data {#activity-1} **Time:** 10 minutes **Goal:** Before you model anything, you need to *know* your data. This activity builds the habit of looking before fitting. ## The dataset ```{r} wine <- read.csv("winedata.csv") ``` ## Your tasks Work through these in order. Each one should be a single line of R. **1. How big is the dataset?** ```{r} # How many rows and columns? ``` **2. What are the variables?** ```{r} # Show variable names, types, and a few values ``` **3. Are there any missing values?** ```{r} # Check for NAs — how many per column? ``` **4. What does the outcome variable look like?** The column `quality` is the wine quality score (0–10). Get its summary statistics and make a quick plot of its distribution. ```{r} # Summary statistics for quality # Plot its distribution ``` **5. Which variable is most correlated with quality?** Compute the correlation of each numeric predictor with `quality`. You can do this one at a time or with `cor()` on the whole data frame. ```{r} # Your code here ``` **6. Write one sentence:** > "Based on the data, I think the best single predictor of wine quality is _______ because _______." ## Discuss - Did everyone pick the same predictor? - Does correlation = causation here? - What would you want to know about how `quality` was measured? --- # Activity 2 · Vectors & Data Frames from Scratch {#activity-2} **Time:** 8 minutes **Goal:** Build a data frame by hand, manipulate it, and make a plot — no CSV, no built-in dataset. ## Setup Imagine you measured plant height (cm) for two treatments — watered normally and drought-stressed — with 5 plants each: | Plant | Treatment | Height | |---|---|---| | 1–5 | Control | 22, 18, 25, 21, 19 | | 6–10 | Drought | 14, 11, 16, 10, 13 | ## Your tasks **1.** Create a data frame called `plants` with columns `plant_id`, `treatment`, and `height`. ```{r} # Build it with c() and data.frame() ``` **2.** Add a new column `height_m` that converts height to meters. ```{r} # Add column using $ ``` **3.** Subset to only drought plants and compute their mean height. ```{r} # Subset, then mean() ``` **4.** Make a boxplot comparing the two treatments using base R `boxplot()` or ggplot2. ```{r} library(ggplot2) # Your plot ``` **5.** Run a t-test. Is the height difference significant? ```{r} # t.test() ``` **Extension:** Add a third treatment (`heat_stress`, heights: 16, 12, 18, 11, 15) and redo the comparison with ANOVA. --- # Activity 3 · The Anscombe Quartet {#activity-3} **Time:** 10 minutes **Goal:** Prove to yourself that summary statistics lie. Always plot your data. ## Step 1 — summarize without plotting R has the Anscombe Quartet built in. Start by computing summary statistics **only** — no plots yet. ```{r} data(anscombe) head(anscombe) ``` For each of the four (x, y) pairs (`x1/y1` through `x4/y4`), compute: - Mean of x and y - SD of x and y - Correlation between x and y - Slope from `lm(y ~ x)` ```{r} # e.g. mean(anscombe$x1), cor(anscombe$x1, anscombe$y1), etc. ``` Write down the four sets of results. What do you notice? ## Step 2 — now plot them Make a 2×2 panel of scatterplots, one per dataset, each with a regression line. ```{r} library(ggplot2) library(tidyr) library(dplyr) # Reshape to long format # Hint: you want columns: dataset, x, y # Then: # ggplot(long_df, aes(x, y)) + # geom_point() + # geom_smooth(method = "lm", se = FALSE) + # facet_wrap(~dataset) ``` ::: {.callout-tip} If reshaping is taking too long, just make 4 separate plots with base R: `plot(anscombe$x1, anscombe$y1)` etc. ::: ## Discuss - Which dataset violates the assumptions of linear regression most severely? - Which one has an outlier that's completely controlling the slope? - What would you do differently for each dataset if this were your real data? --- # Activity 4 · Spot the Bug {#activity-4} **Time:** 8 minutes **Goal:** Build debugging instincts. Reading error messages is a skill — practice it. ## The script This script has **5 bugs**. Find them all without running the code first. Then run it and fix them one at a time. ```{r} # Load data data(iris) # Compute mean sepal length by species tapply(iris$Sepal.Length, iris$species, mean) # Subset to setosa only setosa <- iris[iris$Species = "setosa", ] # Run a t-test: does setosa sepal length differ from 5.0? t.test(setosa$Sepal.Length, mu = 5, alternative = "two.tailed") # Correlation between sepal length and petal length cor_result <- cor.test(iris$Sepal.Length, iris$Petal.Length) cat("r =", round(cor_result$estimate, 3), "p =", round(cor_result$pvalue, 4)) # Plot library(ggplot2) ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) + geom_point ``` ::: {.callout-note collapse="true"} ## Hint (expand only if stuck) The bugs involve: column name case, comparison operator, argument string value, object element name, and a missing function call. ::: ## After you fix it - What was the hardest bug to spot? Why? - Which error message was most helpful? Which was most cryptic? - What habit would have prevented each bug? --- # Activity 5 · Interpret the Output {#activity-5} **Time:** 8 minutes **Goal:** Read statistical output cold — no code, no context, just results. This is what peer review looks like. ## Output A ``` Welch Two Sample t-test t = 3.142, df = 37.6, p-value = 0.003 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.821 3.679 sample estimates: mean of x mean of y 12.45 10.20 ``` Answer without looking anything up: 1. What is the null hypothesis being tested? 2. What is your conclusion at α = 0.05? 3. What is the estimated difference between groups? 4. Is the effect large or small? What's missing that would help you answer that? ## Output B ``` Pearson's product-moment correlation t = 8.943, df = 58, p-value < 0.001 95 percent confidence interval: 0.621 0.843 sample estimates: cor 0.7612 ``` Answer: 1. Is this a strong, moderate, or weak correlation? 2. What does the confidence interval tell you that the p-value doesn't? 3. A student writes: *"x causes y with 76% probability."* What's wrong with this? --- # Activity 6 · The Confidence Interval Game {#activity-6} **Time:** 8 minutes **Goal:** Make "95% confidence" real. You'll generate 100 confidence intervals and count how many contain the true mean. ## Your task Fill in the blanks and run the code: ```{r} set.seed(...) # pick any number — don't tell your neighbor true_mean <- 10 true_sd <- 3 n <- 30 # sample size per interval n_sims <- 100 results <- replicate(n_sims, { samp <- rnorm(..., mean = ..., sd = ...) ci <- t.test(samp)$conf.int c(lower = ci[1], upper = ci[2]) }) results <- as.data.frame(t(results)) results$captured <- results$lower < true_mean & results$upper > true_mean sum(results$captured) ``` ## Plot them ```{r} results$sim <- 1:n_sims plot(range(c(results$lower, results$upper)), c(1, n_sims), type = "n", xlab = "Value", ylab = "Simulation", main = paste0(sum(results$captured), " / 100 intervals captured true mean")) abline(v = true_mean, lwd = 2, lty = 2) for (i in 1:n_sims) { segments(results$lower[i], i, results$upper[i], i, col = ifelse(results$captured[i], "steelblue", "red"), lwd = 1.2) } ``` ## Discuss - How many did you get? Shout it out. - What does the spread of answers across the class tell you? - Now change `n` to 10, then 100. What changes and what doesn't?