Problem Set: R Stats Intro

PLS 206 — Applied Multivariate Modeling

Author

Grey Monroe

Published

January 1, 2026

Note

Due: Wednesday of Module 2 by 11:59 PM — submitted through Canvas.

The R scripts from Module 1 lectures demonstrate the code you will need. Consult them as you work through this assignment.

Submission Instructions

Upload two files to Canvas:

Written answers — a .pdf document with your written responses, screenshots, and interpretations.
R script — a .R file with all the code you used to produce your answers.

List the first and last names of any collaborators in both files.

Naming convention:

File	Name
R script	`PSRStatsIntro_emailID.R`
Written answers	`PSRStatsIntro_emailID.pdf`

where emailID is the part of your UC Davis email before the @.

Example: Grey Monroe would submit PSRStatsIntro_gmonroe.R and PSRStatsIntro_gmonroe.pdf.

Part 1: Create and Manipulate a Data Frame

Q1. Use ?set.seed or Google to look up the set.seed() function. Explain in your own words what it does and why it matters for reproducible research.

Q2. Create a data frame called df using the code below. Then find the sum of all values in df.

set.seed(1522)   # replace 1522 with your own favorite number
df <- data.frame(
  yield = rnorm(100, mean = 100, sd = 5),
  temp  = rnorm(100, mean = 10,  sd = 1)
)

Q3. Find the mean of yield.

Q4. Find the 8th value of yield.

Part 2: Import and Examine the Wine Dataset

This dataset contains results of a chemical analysis of 3 cultivars of wine from the same region of Italy. Thirteen chemicals were measured for each sample. It was modified from a dataset freely available through the UCI Machine Learning Repository.

Download winedata.csv from the course data folder and read it into R:

wine <- read.csv("winedata.csv", header = TRUE)
str(wine)

Tip

Use str() to inspect your data — it shows variable types and a preview of values, which is more informative than View() for checking your import.

Q5. What are the dimensions of the wine dataset (rows × columns)?

Q6. What type of object is wine?

Q7. Provide the name of one column in wine that contains:

1. a categorical variable
1. a quantitative integer variable
1. a quantitative numeric variable

Part 3: Identify and Fix Common Coding Errors

Each chunk below contains a single bug. Fix the error and provide the corrected code in your R script. Write one sentence describing each bug in your written answer document.

A useful reference: Interpreting Common R Errors

Q8.

DF[1:5, 1:2]   # subset rows 1–5 and columns 1–2 of the data frame from Part 1

Q9.

vector.a  <- c(0.2, 0.5, 0.8, 0.01, 0.03)
vector.b  <- c(12, 7, 5, 4, 14)
dataframe <- rbind(vector.a vector.b)   # combine into a data frame

Q10.

color5 <- wine[wine Color >= 5]   # subset to rows where Color >= 5

Part 4: Calculate Basic Summary Statistics

Q11. Write code to obtain the mean of all quantitative columns in wine. Which column has the highest mean?

Q12. Write code to obtain the standard deviation of all quantitative columns in wine. Which column has the lowest standard deviation?

Part 5: Subset the Dataset

Reduce wine to retain only these columns: Cultivar, Alcohol, Ash, Color, Flav, Mg, and OD. Save this as a new object.

Q13. Use a function to display the top rows of your reduced data frame. Paste a screenshot of the output in your written answer document.

Part 6: Examine Correlations Between Variables

Load ggplot2 and GGally, then use ggpairs() to create a scatter plot matrix for the quantitative variables in your reduced dataset.

library(ggplot2)
library(GGally)

# ggpairs on columns 2 through 7 (the continuous variables)

Q14. Paste a screenshot of your scatter plot matrix in your written answer document.

Q15. Which 3 pairs of variables are most strongly correlated? (Consider the absolute value of correlations — include both positive and negative relationships.)

Part 7: Examine Data by Cultivar

Q16. Write code to obtain the sample size for each of the three cultivars. List the values in your written answer.

Q17. Create a second ggpairs() scatter plot matrix that includes:

Data for each cultivar shown in a different color
Scatter plots for all pairs of quantitative variables
Box plots and density plots
Correlations with both the overall value and the per-cultivar value

Paste a screenshot of your plot in your written answer document.

Q18. Write code to find the mean value by cultivar for each of the 6 quantitative chemicals in your reduced dataset. Then examine these means along with the plots to identify:

1. A variable where the cultivar values appear clearly different
1. A variable where the cultivar values are mostly overlapping

Part 8: Flav vs. OD

Q19. Using the second scatter plot matrix from Q17, find the correlations between Flav and OD for each cultivar and the overall correlation.

Describe: How do the per-cultivar correlations compare to the overall value? Is the relationship consistent across cultivars, or does one cultivar primarily drive the overall pattern?

Try the problems yourself before looking — the key is available for independent learners following along outside the course: Problem Set Key

--- title: "Problem Set: R Stats Intro" subtitle: "PLS 206 — Applied Multivariate Modeling" author: "Grey Monroe" date: "Fall 2026" format: html: theme: flatly toc: true toc-depth: 2 toc-title: "Parts" highlight-style: github code-fold: false execute: echo: true eval: false --- ::: {.callout-note} **Due:** Wednesday of Module 2 by 11:59 PM — submitted through Canvas. The R scripts from Module 1 lectures demonstrate the code you will need. Consult them as you work through this assignment. ::: --- ## Submission Instructions Upload **two files** to Canvas: 1. **Written answers** — a `.pdf` document with your written responses, screenshots, and interpretations. 2. **R script** — a `.R` file with all the code you used to produce your answers. List the first and last names of any collaborators in both files. **Naming convention:** | File | Name | |------|------| | R script | `PSRStatsIntro_emailID.R` | | Written answers | `PSRStatsIntro_emailID.pdf` | where `emailID` is the part of your UC Davis email before the `@`. > **Example:** Grey Monroe would submit `PSRStatsIntro_gmonroe.R` and `PSRStatsIntro_gmonroe.pdf`. --- ## Part 1: Create and Manipulate a Data Frame **Q1.** Use `?set.seed` or Google to look up the `set.seed()` function. Explain in your own words what it does and why it matters for reproducible research. **Q2.** Create a data frame called `df` using the code below. Then find the **sum of all values** in `df`. ```{r} set.seed(1522) # replace 1522 with your own favorite number df <- data.frame( yield = rnorm(100, mean = 100, sd = 5), temp = rnorm(100, mean = 10, sd = 1) ) ``` **Q3.** Find the **mean of `yield`**. **Q4.** Find the **8th value of `yield`**. --- ## Part 2: Import and Examine the Wine Dataset This dataset contains results of a chemical analysis of 3 cultivars of wine from the same region of Italy. Thirteen chemicals were measured for each sample. It was modified from a dataset freely available through the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/109/wine). Download `winedata.csv` from the [course data folder](../data/winedata.csv) and read it into R: ```{r} wine <- read.csv("winedata.csv", header = TRUE) str(wine) ``` ::: {.callout-tip} Use `str()` to inspect your data — it shows variable types and a preview of values, which is more informative than `View()` for checking your import. ::: **Q5.** What are the **dimensions** of the wine dataset (rows × columns)? **Q6.** What **type of object** is `wine`? **Q7.** Provide the name of one column in `wine` that contains: - a) a **categorical** variable - b) a **quantitative integer** variable - c) a **quantitative numeric** variable --- ## Part 3: Identify and Fix Common Coding Errors Each chunk below contains a **single bug**. Fix the error and provide the corrected code in your R script. Write one sentence describing each bug in your written answer document. A useful reference: [Interpreting Common R Errors](https://warin.ca/posts/rcourse-howto-interpretcommonerrors/) **Q8.** ```{r} DF[1:5, 1:2] # subset rows 1–5 and columns 1–2 of the data frame from Part 1 ``` **Q9.** ```{r} vector.a <- c(0.2, 0.5, 0.8, 0.01, 0.03) vector.b <- c(12, 7, 5, 4, 14) dataframe <- rbind(vector.a vector.b) # combine into a data frame ``` **Q10.** ```{r} color5 <- wine[wine Color >= 5] # subset to rows where Color >= 5 ``` --- ## Part 4: Calculate Basic Summary Statistics **Q11.** Write code to obtain the **mean** of all quantitative columns in `wine`. Which column has the **highest mean**? **Q12.** Write code to obtain the **standard deviation** of all quantitative columns in `wine`. Which column has the **lowest standard deviation**? --- ## Part 5: Subset the Dataset Reduce `wine` to retain only these columns: `Cultivar`, `Alcohol`, `Ash`, `Color`, `Flav`, `Mg`, and `OD`. Save this as a new object. **Q13.** Use a function to display the **top rows** of your reduced data frame. Paste a screenshot of the output in your written answer document. --- ## Part 6: Examine Correlations Between Variables Load `ggplot2` and `GGally`, then use `ggpairs()` to create a scatter plot matrix for the **quantitative variables** in your reduced dataset. ```{r} library(ggplot2) library(GGally) # ggpairs on columns 2 through 7 (the continuous variables) ``` **Q14.** Paste a screenshot of your scatter plot matrix in your written answer document. **Q15.** Which **3 pairs of variables** are most strongly correlated? (Consider the absolute value of correlations — include both positive and negative relationships.) --- ## Part 7: Examine Data by Cultivar **Q16.** Write code to obtain the **sample size for each of the three cultivars**. List the values in your written answer. **Q17.** Create a second `ggpairs()` scatter plot matrix that includes: - Data for each cultivar shown in a **different color** - Scatter plots for all pairs of quantitative variables - Box plots and density plots - Correlations with both the overall value and the per-cultivar value Paste a screenshot of your plot in your written answer document. **Q18.** Write code to find the **mean value by cultivar** for each of the 6 quantitative chemicals in your reduced dataset. Then examine these means along with the plots to identify: - a) A variable where the cultivar values appear **clearly different** - b) A variable where the cultivar values are **mostly overlapping** --- ## Part 8: Flav vs. OD **Q19.** Using the second scatter plot matrix from Q17, find the correlations between `Flav` and `OD` for each cultivar and the overall correlation. Describe: How do the per-cultivar correlations compare to the overall value? Is the relationship consistent across cultivars, or does one cultivar primarily drive the overall pattern? --- *Try the problems yourself before looking — the key is available for independent learners following along outside the course: [Problem Set Key](ps-r-stats-intro-key.html)*