Assignment_03 PDF

.pdf

School

University of Texas, San Antonio *

*We aren’t endorsed by this school

Course

1403

Subject

Statistics

Date

Apr 30, 2024

Type

pdf

Pages

Uploaded by MasterFlower1294 on coursehero.com

10/11/23, 8 : 58 PM Assignment_03 Page 1 of 18 about:srcdoc Biostatistics with R Assignment 3: Exploring Relationships Assignment Setup Run the next cell to load the necessary R packages for this assignment and print out your current working directory. [1] "My current working directory is /Users/megancuevas/STATS1403/R assignme nts/Assignment_03" Visualizing and Summarizing Relationships Between Variables In the assignment, we focused on using graphs and summary statistics to explore the distribution of individual variables. This assignment is dedicated to using graphs and summary statistics to investigate relationships between two or more variables. Our objective is to develop a high-level understanding of the type and strength of relationships between variables. Note that at this point, we are not making formal conclusions regarding the existence of relationship or whether the relationship, if exists, is strong or not. We do that formally later in this course. Here, we explore the observed data to detect possible relationships and use summary statistics to measure the strength of relationships. Relationships Between Two Numerical Random Variables We start our discussion of relationships between numerical variables by looking at a data set based on a study conduced by Dr. Fisher from Human Performance Research Center at Brigham Young University. This observational study involved measuring percent body fat as the target variable, along with several explanatory variables such as age, weight, height, and abdomen circumference for a sample of 252 men. The collected data set is stored in a comma separated variable text file called bodyfat.csv . In [1]: print ( paste ( "My current working directory is" , getwd ()))

10/11/23, 8 : 58 PM Assignment_03 Page 2 of 18 about:srcdoc Example 1: Use read.csv() to read in the bodyfat.csv data file The R code in the cell below reads uses the command, read.csv() to read a text file called bodyfat.csv . In a text datafile, the individual data values are usually separated from one another in three different ways. Data values are separated by a comma . Data values are separated by a blank space . Data values are separated by a tab . If you aren't sure which separator was used in a particular datafile, you can always use a normal word processor (e.g. MS Word) to take a quick look at the data. However, this trick might be not work very well if your data file is really large. It is usually necessary to tell read.csv() how the data in any particular file is stored using the sep variable. In the example below, setting sep=',' tells the read.csv() that the separator variable is a comma. As the data is read from your hard disk, read.csv() is told to creates a new dataframe called bfat_df . In order to varify that you read the datafile correctly, we use R's head() command to print out the first 5 records in the bfat_df dataframe. If the output doesn't look right, it probably means that your separator variable was wrong and you need to try a different one. The other argument in the read.cvs() command is header= . If you set header=TRUE , the command read.csv() knows that there will be a label for each data column. If you set hearder=FALSE , then read.csv() will not create a header. In [2]: # Read in bodyfat data file bfat_df <- read.csv ( "bodyfat.csv" , header = TRUE , sep = ',' ) head ( bfat_df )

10/11/23, 8 : 58 PM Assignment_03 Page 3 of 18 about:srcdoc You should see the first 6 records in the bfat_df dataframe. If you receive an error it probably because the file bodyfat.csv is not in your current working directory of your Jupyter Notebook. Exercise 1: Read in the USmelanoma.csv data file In the cell below, write the R code to read the data file called USmelanoma.csv and create a new dataframe called USmel_df . In this datafile, a comma is used as the separator variable and the file does contain a header. mortality latitude longitude ocean 219 33.0 87.0 yes 160 34.5 112.0 no 170 35.0 92.5 no 182 37.5 119.5 yes 149 39.0 105.5 no 159 41.8 72.8 yes If your code is correct you should have see the following: If you get an error, check to see if you spelled the name of the datafile correctly. case brozek siri density age weight height neck chest abdomen hip thigh knee 1 12.6 12.3 1.0708 23 154.25 67.75 36.2 93.1 85.2 94.5 59.0 37.3 2 6.9 6.1 1.0853 22 173.25 72.25 38.5 93.6 83.0 98.7 58.7 37.3 3 24.6 25.3 1.0414 22 154.00 66.25 34.0 95.8 87.9 99.2 59.6 38.9 4 10.9 10.4 1.0751 26 184.75 72.25 37.4 101.8 86.4 101.2 60.1 37.3 5 27.8 28.7 1.0340 24 184.25 71.25 34.4 97.3 100.0 101.9 63.2 42.2 6 20.6 20.9 1.0502 24 210.25 74.75 39.0 104.5 94.4 107.8 66.0 42.0 In [3]: # Insert your code for Exercise 1 here USmel_df <- read.csv ( "USmelanoma.csv" , header = TRUE , sep = ',' ) head ( USmel_df )

10/11/23, 8 : 58 PM Assignment_03 Page 4 of 18 about:srcdoc Example 2A: Creating a Simple X-Y Scatterplot using R's plot command R offers several graphics libraries for displaying data in a graphic format. In this assignment we will use what is called base graphics . These are graphs commands that come with R. To generate more elaborate graphical plots, programs such as ggplot2 can be used after they have been downloaded and installed. In the cell below, we will investigate the relationship between two of the variables in our bfat_df data, (1) abdomen size (circumference) and (2) an measure of body fat called _Siri_ . Our first step is to extract just the data we want to visualize from all of the other data stored in our bfat_df dataframe. One way to do this is use the dollar sign $ operator. As shown below, we can use the $ operator, to extract the abdomenal measuresments to create a new variable called ab_X using the command ab_X <- bfat_df$abdomen . We will use this variable for our X values. Similarily, we will use the command siri_Y <- bfat_df$siri to create a new variable to hold our Y values. NOTE: It is absolutely essential when using the $ operator, that you spell the name of variable EXACTLY as it appears in the variable's column header, including any capitalization. Once we have generated our X and Y values, generating an XY plot is fairly easy using R's plot() command. The plot() command one needs the values for the X and Y variable and a value for the type of plot. In this example, we use the argument type = "p" to plot the data as points . In [4]: # Example 2A: Simple X-Y Plot # Use $ operator to extract data from specific columns in bfat_df dataframe ab_X <- bfat_df $ abdomen # let x be the abdomen measurements siri_Y <- bfat_df $ siri # Let y be the Siri measurements # Use base graphics for XY plot plot ( ab_X , siri_Y , type = 'p' )

10/11/23, 8 : 58 PM Assignment_03 Page 5 of 18 about:srcdoc

10/11/23, 8 : 58 PM Assignment_03 Page 6 of 18 about:srcdoc Example 2B: Creating an X-Y Scatterplot with a Regression Line By adding a few lines of R code we can improve the X-Y plot generated by the previous code cell by adding a line of Best Fit also called a Regression Line . In order to add a Regression Line we need to perform a type of mathematical analysis on our X and Y data called a Linear Regression Analysis . This can be in R very easily just by using the command lm() which stands for linear model . As shown in the code cell below, we can perform a linear regression analysis by simply using the command r1_model <- lm(siri_Y ~ ab_X) . Note that the Y and X variables are separated by a tilda ~ . This is R's way of saying, "create a linear model of Y as a function of X". The regreesion data generated by the lm() command is stored in a new variable called r_model . All we have to do to add a Regression Line to an X-Y plot, is to follow the plotting code with the command abline(r1_model) . The command abline() simply adds a line to the XY plot using the data provided in the argument, in this case output of the lm() command. You should also note that we have improved our XY plot by adding specific labels for the X and Y axses using the xlab="X label name" and the ylab="Y label name" respectively. In [18]: # Example 2B: X-Y Plot with a Regression Line # Use $ operator to extract data from specific columns in bfat_df dataframe ab_X <- bfat_df $ abdomen # let x be the abdomen measurements siri_Y <- bfat_df $ siri # Let y be the Siri measurements # Compute the linear regression line and store the data in r_model r1_model <- lm ( siri_Y ~ ab_X ) # Use base graphics for XY plot plot ( ab_X , siri_Y , type = 'p' , xlab = "Abdomen Circumfrence (cm)" , ylab = "Siri Body Mass Index" ) # Add the regression line to the plot abline ( r1_model )

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version