Problem Set Key: R Stats Intro

PLS 206 — Applied Multivariate Modeling

Author

Grey Monroe

Published

January 1, 2026

Tip

Try the problems yourself before consulting this key. Working through errors is where most of the learning happens.


Part 1: Create and Manipulate a Data Frame

Q1. set.seed() fixes the starting point of R’s random number generator. Using the same seed ensures that code producing random values gives identical results every time it is run — essential for reproducible analyses.

?set.seed
# or
help(set.seed)

Q2. Create df and find the sum of all values.

set.seed(1522)
df <- data.frame(
  yield = rnorm(100, mean = 100, sd = 5),
  temp  = rnorm(100, mean = 10,  sd = 1)
)
sum(df)

Q3. Mean of yield:

mean(df$yield)

Q4. 8th value of yield:

df$yield[8]

Part 2: Import and Examine the Wine Dataset

wine <- read.csv("winedata.csv", header = TRUE)
str(wine)

Q5. Dimensions:

dim(wine)   # 178 rows, 14 columns

Q6. Object type:

class(wine)   # "data.frame"

Q7. Variable types:

str(wine)
# a) Categorical:          Cultivar
# b) Quantitative integer: Mg or Proline
# c) Quantitative numeric: Alcohol, Ash, Color, Flav, OD, etc.

Part 3: Identify and Fix Common Coding Errors

Q8. Bug: wrong capitalization (DF vs df).

# Buggy
DF[1:5, 1:2]

# Fixed
df[1:5, 1:2]

Q9. Bug: missing comma between arguments in rbind().

# Buggy
dataframe <- rbind(vector.a vector.b)

# Fixed
vector.a  <- c(0.2, 0.5, 0.8, 0.01, 0.03)
vector.b  <- c(12, 7, 5, 4, 14)
dataframe <- rbind(vector.a, vector.b)

Q10. Bug: missing $ operator and incorrect bracket syntax.

# Buggy
color5 <- wine[wine Color >= 5]

# Fixed
color5 <- subset(wine, Color >= 5)
# or equivalently:
color5 <- wine[wine$Color >= 5, ]

Part 4: Calculate Basic Summary Statistics

Q11. Mean of all quantitative columns — Proline has the highest mean.

wine_means <- apply(wine[, 2:14], 2, mean)
wine_means
which.max(wine_means)   # Proline

Q12. SD of all quantitative columns — NonFlavPhenols has the lowest SD.

wine_sds <- apply(wine[, 2:14], 2, sd)
wine_sds
which.min(wine_sds)   # NonFlavPhenols

Part 5: Subset the Dataset

Q13. Reduce to selected columns and display the top rows.

# By column name (preferred — more readable)
wine_reduce <- wine[, c("Cultivar", "Alcohol", "Ash", "Color", "Flav", "Mg", "OD")]

# Equivalent using column indices
wine_reduce <- wine[, c(1, 2, 4, 5, 6, 9, 11)]

head(wine_reduce)

Part 6: Examine Correlations Between Variables

Q14 & Q15. Scatter plot matrix — strongest correlations (by absolute value):

  1. Flav & OD: r = 0.787
  2. Color & Alcohol: r = 0.546
  3. OD & Color: r = −0.429
library(ggplot2)
library(GGally)

ggpairs(wine_reduce, columns = 2:7)

Part 7: Examine Data by Cultivar

Q16. Sample size per cultivar:

table(wine_reduce$Cultivar)
# Barbera: 48   Barolo: 59   Gringnolino: 71

Q17. Colored scatter plot matrix by cultivar:

ggpairs(wine_reduce, aes(col = Cultivar))

Q18. Mean by cultivar:

aggregate(. ~ Cultivar, data = wine_reduce, FUN = mean)

# a) Clearly different across cultivars: Alcohol, Color, Flav, OD
# b) Mostly overlapping:                Ash, Mg

Part 8: Flav vs. OD

Q19. The overall correlation between Flav and OD is positive (~0.79), but this masks very different within-cultivar patterns:

# Overall
cor(wine_reduce$Flav, wine_reduce$OD)

# By cultivar
by(wine_reduce[, c("Flav", "OD")], wine_reduce$Cultivar,
   function(x) cor(x$Flav, x$OD))

# Barbera:     r ≈ −0.430
# Barolo:      r ≈ −0.089
# Gringnolino: r ≈ +0.580

The overall positive correlation is driven primarily by Gringnolino. Within Barbera and Barolo the relationship is flat or slightly negative — a clear example of Simpson’s Paradox, where a group-level trend reverses or disappears within subgroups.