?set.seed
# or
help(set.seed)Problem Set Key: R Stats Intro
PLS 206 — Applied Multivariate Modeling
Try the problems yourself before consulting this key. Working through errors is where most of the learning happens.
Part 1: Create and Manipulate a Data Frame
Q1. set.seed() fixes the starting point of R’s random number generator. Using the same seed ensures that code producing random values gives identical results every time it is run — essential for reproducible analyses.
Q2. Create df and find the sum of all values.
set.seed(1522)
df <- data.frame(
yield = rnorm(100, mean = 100, sd = 5),
temp = rnorm(100, mean = 10, sd = 1)
)
sum(df)Q3. Mean of yield:
mean(df$yield)Q4. 8th value of yield:
df$yield[8]Part 2: Import and Examine the Wine Dataset
wine <- read.csv("winedata.csv", header = TRUE)
str(wine)Q5. Dimensions:
dim(wine) # 178 rows, 14 columnsQ6. Object type:
class(wine) # "data.frame"Q7. Variable types:
str(wine)
# a) Categorical: Cultivar
# b) Quantitative integer: Mg or Proline
# c) Quantitative numeric: Alcohol, Ash, Color, Flav, OD, etc.Part 3: Identify and Fix Common Coding Errors
Q8. Bug: wrong capitalization (DF vs df).
# Buggy
DF[1:5, 1:2]
# Fixed
df[1:5, 1:2]Q9. Bug: missing comma between arguments in rbind().
# Buggy
dataframe <- rbind(vector.a vector.b)
# Fixed
vector.a <- c(0.2, 0.5, 0.8, 0.01, 0.03)
vector.b <- c(12, 7, 5, 4, 14)
dataframe <- rbind(vector.a, vector.b)Q10. Bug: missing $ operator and incorrect bracket syntax.
# Buggy
color5 <- wine[wine Color >= 5]
# Fixed
color5 <- subset(wine, Color >= 5)
# or equivalently:
color5 <- wine[wine$Color >= 5, ]Part 4: Calculate Basic Summary Statistics
Q11. Mean of all quantitative columns — Proline has the highest mean.
wine_means <- apply(wine[, 2:14], 2, mean)
wine_means
which.max(wine_means) # ProlineQ12. SD of all quantitative columns — NonFlavPhenols has the lowest SD.
wine_sds <- apply(wine[, 2:14], 2, sd)
wine_sds
which.min(wine_sds) # NonFlavPhenolsPart 5: Subset the Dataset
Q13. Reduce to selected columns and display the top rows.
# By column name (preferred — more readable)
wine_reduce <- wine[, c("Cultivar", "Alcohol", "Ash", "Color", "Flav", "Mg", "OD")]
# Equivalent using column indices
wine_reduce <- wine[, c(1, 2, 4, 5, 6, 9, 11)]
head(wine_reduce)Part 6: Examine Correlations Between Variables
Q14 & Q15. Scatter plot matrix — strongest correlations (by absolute value):
- Flav & OD: r = 0.787
- Color & Alcohol: r = 0.546
- OD & Color: r = −0.429
library(ggplot2)
library(GGally)
ggpairs(wine_reduce, columns = 2:7)Part 7: Examine Data by Cultivar
Q16. Sample size per cultivar:
table(wine_reduce$Cultivar)
# Barbera: 48 Barolo: 59 Gringnolino: 71Q17. Colored scatter plot matrix by cultivar:
ggpairs(wine_reduce, aes(col = Cultivar))Q18. Mean by cultivar:
aggregate(. ~ Cultivar, data = wine_reduce, FUN = mean)
# a) Clearly different across cultivars: Alcohol, Color, Flav, OD
# b) Mostly overlapping: Ash, MgPart 8: Flav vs. OD
Q19. The overall correlation between Flav and OD is positive (~0.79), but this masks very different within-cultivar patterns:
# Overall
cor(wine_reduce$Flav, wine_reduce$OD)
# By cultivar
by(wine_reduce[, c("Flav", "OD")], wine_reduce$Cultivar,
function(x) cor(x$Flav, x$OD))
# Barbera: r ≈ −0.430
# Barolo: r ≈ −0.089
# Gringnolino: r ≈ +0.580The overall positive correlation is driven primarily by Gringnolino. Within Barbera and Barolo the relationship is flat or slightly negative — a clear example of Simpson’s Paradox, where a group-level trend reverses or disappears within subgroups.