wine <- read.csv("winedata.csv")Module 1 In-Class Activities
PLS 206 — Applied Multivariate Modeling
No complete code is provided. Each activity gives you a starting point — you write the rest. There are no wrong answers, only things to discover.
Not all activities are meant for a single class session — pick based on time and where the class needs to slow down or speed up.
Activity 1 · Meet Your Data
Time: 10 minutes Goal: Before you model anything, you need to know your data. This activity builds the habit of looking before fitting.
The dataset
Your tasks
Work through these in order. Each one should be a single line of R.
1. How big is the dataset?
# How many rows and columns?2. What are the variables?
# Show variable names, types, and a few values3. Are there any missing values?
# Check for NAs — how many per column?4. What does the outcome variable look like?
The column quality is the wine quality score (0–10). Get its summary statistics and make a quick plot of its distribution.
# Summary statistics for quality
# Plot its distribution5. Which variable is most correlated with quality?
Compute the correlation of each numeric predictor with quality. You can do this one at a time or with cor() on the whole data frame.
# Your code here6. Write one sentence:
“Based on the data, I think the best single predictor of wine quality is _______ because _______.”
Discuss
- Did everyone pick the same predictor?
- Does correlation = causation here?
- What would you want to know about how
qualitywas measured?
Activity 2 · Vectors & Data Frames from Scratch
Time: 8 minutes Goal: Build a data frame by hand, manipulate it, and make a plot — no CSV, no built-in dataset.
Setup
Imagine you measured plant height (cm) for two treatments — watered normally and drought-stressed — with 5 plants each:
| Plant | Treatment | Height |
|---|---|---|
| 1–5 | Control | 22, 18, 25, 21, 19 |
| 6–10 | Drought | 14, 11, 16, 10, 13 |
Your tasks
1. Create a data frame called plants with columns plant_id, treatment, and height.
# Build it with c() and data.frame()2. Add a new column height_m that converts height to meters.
# Add column using $3. Subset to only drought plants and compute their mean height.
# Subset, then mean()4. Make a boxplot comparing the two treatments using base R boxplot() or ggplot2.
library(ggplot2)
# Your plot5. Run a t-test. Is the height difference significant?
# t.test()Extension: Add a third treatment (heat_stress, heights: 16, 12, 18, 11, 15) and redo the comparison with ANOVA.
Activity 3 · The Anscombe Quartet
Time: 10 minutes Goal: Prove to yourself that summary statistics lie. Always plot your data.
Step 1 — summarize without plotting
R has the Anscombe Quartet built in. Start by computing summary statistics only — no plots yet.
data(anscombe)
head(anscombe)For each of the four (x, y) pairs (x1/y1 through x4/y4), compute:
- Mean of x and y
- SD of x and y
- Correlation between x and y
- Slope from
lm(y ~ x)
# e.g. mean(anscombe$x1), cor(anscombe$x1, anscombe$y1), etc.Write down the four sets of results. What do you notice?
Step 2 — now plot them
Make a 2×2 panel of scatterplots, one per dataset, each with a regression line.
library(ggplot2)
library(tidyr)
library(dplyr)
# Reshape to long format
# Hint: you want columns: dataset, x, y
# Then:
# ggplot(long_df, aes(x, y)) +
# geom_point() +
# geom_smooth(method = "lm", se = FALSE) +
# facet_wrap(~dataset)If reshaping is taking too long, just make 4 separate plots with base R: plot(anscombe$x1, anscombe$y1) etc.
Discuss
- Which dataset violates the assumptions of linear regression most severely?
- Which one has an outlier that’s completely controlling the slope?
- What would you do differently for each dataset if this were your real data?
Activity 4 · Spot the Bug
Time: 8 minutes Goal: Build debugging instincts. Reading error messages is a skill — practice it.
The script
This script has 5 bugs. Find them all without running the code first. Then run it and fix them one at a time.
# Load data
data(iris)
# Compute mean sepal length by species
tapply(iris$Sepal.Length, iris$species, mean)
# Subset to setosa only
setosa <- iris[iris$Species = "setosa", ]
# Run a t-test: does setosa sepal length differ from 5.0?
t.test(setosa$Sepal.Length, mu = 5, alternative = "two.tailed")
# Correlation between sepal length and petal length
cor_result <- cor.test(iris$Sepal.Length, iris$Petal.Length)
cat("r =", round(cor_result$estimate, 3),
"p =", round(cor_result$pvalue, 4))
# Plot
library(ggplot2)
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
geom_pointThe bugs involve: column name case, comparison operator, argument string value, object element name, and a missing function call.
After you fix it
- What was the hardest bug to spot? Why?
- Which error message was most helpful? Which was most cryptic?
- What habit would have prevented each bug?
Activity 5 · Interpret the Output
Time: 8 minutes Goal: Read statistical output cold — no code, no context, just results. This is what peer review looks like.
Output A
Welch Two Sample t-test
t = 3.142, df = 37.6, p-value = 0.003
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.821 3.679
sample estimates:
mean of x mean of y
12.45 10.20
Answer without looking anything up:
- What is the null hypothesis being tested?
- What is your conclusion at α = 0.05?
- What is the estimated difference between groups?
- Is the effect large or small? What’s missing that would help you answer that?
Output B
Pearson's product-moment correlation
t = 8.943, df = 58, p-value < 0.001
95 percent confidence interval:
0.621 0.843
sample estimates:
cor
0.7612
Answer:
- Is this a strong, moderate, or weak correlation?
- What does the confidence interval tell you that the p-value doesn’t?
- A student writes: “x causes y with 76% probability.” What’s wrong with this?
Activity 6 · The Confidence Interval Game
Time: 8 minutes Goal: Make “95% confidence” real. You’ll generate 100 confidence intervals and count how many contain the true mean.
Your task
Fill in the blanks and run the code:
set.seed(...) # pick any number — don't tell your neighbor
true_mean <- 10
true_sd <- 3
n <- 30 # sample size per interval
n_sims <- 100
results <- replicate(n_sims, {
samp <- rnorm(..., mean = ..., sd = ...)
ci <- t.test(samp)$conf.int
c(lower = ci[1], upper = ci[2])
})
results <- as.data.frame(t(results))
results$captured <- results$lower < true_mean & results$upper > true_mean
sum(results$captured)Plot them
results$sim <- 1:n_sims
plot(range(c(results$lower, results$upper)), c(1, n_sims),
type = "n", xlab = "Value", ylab = "Simulation",
main = paste0(sum(results$captured), " / 100 intervals captured true mean"))
abline(v = true_mean, lwd = 2, lty = 2)
for (i in 1:n_sims) {
segments(results$lower[i], i, results$upper[i], i,
col = ifelse(results$captured[i], "steelblue", "red"), lwd = 1.2)
}Discuss
- How many did you get? Shout it out.
- What does the spread of answers across the class tell you?
- Now change
nto 10, then 100. What changes and what doesn’t?