Problem Set Key: R Stats Intro

PLS 206 — Applied Multivariate Modeling

Author

Grey Monroe

Published

January 1, 2026

Tip

Try the problems yourself before consulting this key. Working through errors is where most of the learning happens.

Part 1: Create and Manipulate a Data Frame

Q1. set.seed() fixes the starting point of R’s random number generator. Using the same seed ensures that code producing random values gives identical results every time it is run — essential for reproducible analyses.

?set.seed
# or
help(set.seed)

Q2. Create df and find the sum of all values.

set.seed(1522)
df <- data.frame(
  yield = rnorm(100, mean = 100, sd = 5),
  temp  = rnorm(100, mean = 10,  sd = 1)
)
sum(df)

Q3. Mean of yield:

mean(df$yield)

Q4. 8th value of yield:

df$yield[8]

Part 2: Import and Examine the Wine Dataset

wine <- read.csv("winedata.csv", header = TRUE)
str(wine)

Q5. Dimensions:

dim(wine)   # 178 rows, 14 columns

Q6. Object type:

class(wine)   # "data.frame"

Q7. Variable types:

str(wine)
# a) Categorical:          Cultivar
# b) Quantitative integer: Mg or Proline
# c) Quantitative numeric: Alcohol, Ash, Color, Flav, OD, etc.

Part 3: Identify and Fix Common Coding Errors

Q8. Bug: wrong capitalization (DF vs df).

# Buggy
DF[1:5, 1:2]

# Fixed
df[1:5, 1:2]

Q9. Bug: missing comma between arguments in rbind().

# Buggy
dataframe <- rbind(vector.a vector.b)

# Fixed
vector.a  <- c(0.2, 0.5, 0.8, 0.01, 0.03)
vector.b  <- c(12, 7, 5, 4, 14)
dataframe <- rbind(vector.a, vector.b)

Q10. Bug: missing $ operator and incorrect bracket syntax.

# Buggy
color5 <- wine[wine Color >= 5]

# Fixed
color5 <- subset(wine, Color >= 5)
# or equivalently:
color5 <- wine[wine$Color >= 5, ]

Part 4: Calculate Basic Summary Statistics

Q11. Mean of all quantitative columns — Proline has the highest mean.

wine_means <- apply(wine[, 2:14], 2, mean)
wine_means
which.max(wine_means)   # Proline

Q12. SD of all quantitative columns — NonFlavPhenols has the lowest SD.

wine_sds <- apply(wine[, 2:14], 2, sd)
wine_sds
which.min(wine_sds)   # NonFlavPhenols

Part 5: Subset the Dataset

Q13. Reduce to selected columns and display the top rows.

# By column name (preferred — more readable)
wine_reduce <- wine[, c("Cultivar", "Alcohol", "Ash", "Color", "Flav", "Mg", "OD")]

# Equivalent using column indices
wine_reduce <- wine[, c(1, 2, 4, 5, 6, 9, 11)]

head(wine_reduce)

Part 6: Examine Correlations Between Variables

Q14 & Q15. Scatter plot matrix — strongest correlations (by absolute value):

Flav & OD: r = 0.787
Color & Alcohol: r = 0.546
OD & Color: r = −0.429

library(ggplot2)
library(GGally)

ggpairs(wine_reduce, columns = 2:7)

Part 7: Examine Data by Cultivar

Q16. Sample size per cultivar:

table(wine_reduce$Cultivar)
# Barbera: 48   Barolo: 59   Gringnolino: 71

Q17. Colored scatter plot matrix by cultivar:

ggpairs(wine_reduce, aes(col = Cultivar))

Q18. Mean by cultivar:

aggregate(. ~ Cultivar, data = wine_reduce, FUN = mean)

# a) Clearly different across cultivars: Alcohol, Color, Flav, OD
# b) Mostly overlapping:                Ash, Mg

Part 8: Flav vs. OD

Q19. The overall correlation between Flav and OD is positive (~0.79), but this masks very different within-cultivar patterns:

# Overall
cor(wine_reduce$Flav, wine_reduce$OD)

# By cultivar
by(wine_reduce[, c("Flav", "OD")], wine_reduce$Cultivar,
   function(x) cor(x$Flav, x$OD))

# Barbera:     r ≈ −0.430
# Barolo:      r ≈ −0.089
# Gringnolino: r ≈ +0.580

The overall positive correlation is driven primarily by Gringnolino. Within Barbera and Barolo the relationship is flat or slightly negative — a clear example of Simpson’s Paradox, where a group-level trend reverses or disappears within subgroups.

--- title: "Problem Set Key: R Stats Intro" subtitle: "PLS 206 — Applied Multivariate Modeling" author: "Grey Monroe" date: "Fall 2026" format: html: theme: flatly toc: true toc-depth: 2 toc-title: "Parts" highlight-style: github code-fold: false execute: echo: true eval: false --- ::: {.callout-tip} Try the problems yourself before consulting this key. Working through errors is where most of the learning happens. ::: --- ## Part 1: Create and Manipulate a Data Frame **Q1.** `set.seed()` fixes the starting point of R's random number generator. Using the same seed ensures that code producing random values gives identical results every time it is run — essential for reproducible analyses. ```{r} ?set.seed # or help(set.seed) ``` **Q2.** Create `df` and find the sum of all values. ```{r} set.seed(1522) df <- data.frame( yield = rnorm(100, mean = 100, sd = 5), temp = rnorm(100, mean = 10, sd = 1) ) sum(df) ``` **Q3.** Mean of `yield`: ```{r} mean(df$yield) ``` **Q4.** 8th value of `yield`: ```{r} df$yield[8] ``` --- ## Part 2: Import and Examine the Wine Dataset ```{r} wine <- read.csv("winedata.csv", header = TRUE) str(wine) ``` **Q5.** Dimensions: ```{r} dim(wine) # 178 rows, 14 columns ``` **Q6.** Object type: ```{r} class(wine) # "data.frame" ``` **Q7.** Variable types: ```{r} str(wine) # a) Categorical: Cultivar # b) Quantitative integer: Mg or Proline # c) Quantitative numeric: Alcohol, Ash, Color, Flav, OD, etc. ``` --- ## Part 3: Identify and Fix Common Coding Errors **Q8.** Bug: wrong capitalization (`DF` vs `df`). ```{r} # Buggy DF[1:5, 1:2] # Fixed df[1:5, 1:2] ``` **Q9.** Bug: missing comma between arguments in `rbind()`. ```{r} # Buggy dataframe <- rbind(vector.a vector.b) # Fixed vector.a <- c(0.2, 0.5, 0.8, 0.01, 0.03) vector.b <- c(12, 7, 5, 4, 14) dataframe <- rbind(vector.a, vector.b) ``` **Q10.** Bug: missing `$` operator and incorrect bracket syntax. ```{r} # Buggy color5 <- wine[wine Color >= 5] # Fixed color5 <- subset(wine, Color >= 5) # or equivalently: color5 <- wine[wine$Color >= 5, ] ``` --- ## Part 4: Calculate Basic Summary Statistics **Q11.** Mean of all quantitative columns — `Proline` has the highest mean. ```{r} wine_means <- apply(wine[, 2:14], 2, mean) wine_means which.max(wine_means) # Proline ``` **Q12.** SD of all quantitative columns — `NonFlavPhenols` has the lowest SD. ```{r} wine_sds <- apply(wine[, 2:14], 2, sd) wine_sds which.min(wine_sds) # NonFlavPhenols ``` --- ## Part 5: Subset the Dataset **Q13.** Reduce to selected columns and display the top rows. ```{r} # By column name (preferred — more readable) wine_reduce <- wine[, c("Cultivar", "Alcohol", "Ash", "Color", "Flav", "Mg", "OD")] # Equivalent using column indices wine_reduce <- wine[, c(1, 2, 4, 5, 6, 9, 11)] head(wine_reduce) ``` --- ## Part 6: Examine Correlations Between Variables **Q14 & Q15.** Scatter plot matrix — strongest correlations (by absolute value): 1. Flav & OD: **r = 0.787** 2. Color & Alcohol: **r = 0.546** 3. OD & Color: **r = −0.429** ```{r} library(ggplot2) library(GGally) ggpairs(wine_reduce, columns = 2:7) ``` --- ## Part 7: Examine Data by Cultivar **Q16.** Sample size per cultivar: ```{r} table(wine_reduce$Cultivar) # Barbera: 48 Barolo: 59 Gringnolino: 71 ``` **Q17.** Colored scatter plot matrix by cultivar: ```{r} ggpairs(wine_reduce, aes(col = Cultivar)) ``` **Q18.** Mean by cultivar: ```{r} aggregate(. ~ Cultivar, data = wine_reduce, FUN = mean) # a) Clearly different across cultivars: Alcohol, Color, Flav, OD # b) Mostly overlapping: Ash, Mg ``` --- ## Part 8: Flav vs. OD **Q19.** The overall correlation between Flav and OD is positive (~0.79), but this masks very different within-cultivar patterns: ```{r} # Overall cor(wine_reduce$Flav, wine_reduce$OD) # By cultivar by(wine_reduce[, c("Flav", "OD")], wine_reduce$Cultivar, function(x) cor(x$Flav, x$OD)) # Barbera: r ≈ −0.430 # Barolo: r ≈ −0.089 # Gringnolino: r ≈ +0.580 ``` The overall positive correlation is driven primarily by **Gringnolino**. Within Barbera and Barolo the relationship is flat or slightly negative — a clear example of Simpson's Paradox, where a group-level trend reverses or disappears within subgroups.