R-Basics: Part 3- All about Idexing
- Subset a vector based on properties of another vector.
- Use multiple logical operators to index vectors.
- Extract the indices of vector elements satisfying one or more logical conditions.
- Extract the indices of vector elements matching with another vector.
- Determine which elements in one vector are present in another vector. * Wrangle data tables using functions in the dplyr package.
- Modify a data table by adding or changing columns.
- Subset rows in a data table.
- Subset columns in a data table.
- Perform a series of operations using 'pipe operator'.
- Create data frames.
- Plot data in scatter plots, box plots, and histograms.
- Sample Q/As
R provides powerful and convenient way of indexing a vector. We can use the logical operators to index a vector.
If we compare a vector with a single number, R performs a test for each entry
Let’s see some examples:
index <- murder_rate < 0.71
[1] "Hawaii" "Iowa" "New Hampshire" "North Dakota"
[5] "Vermont"
index1 <- murder_rate > 0.71
[1] "Alabama" "Alaska" "Arizona"
[4] "Arkansas" "California" "Colorado"
[7] "Connecticut" "Delaware" "District of Columbia"
[10] "Florida" "Georgia" "Idaho"
[13] "Illinois" "Indiana" "Kansas"
[16] "Kentucky" "Louisiana" "Maine"
[19] "Maryland" "Massachusetts" "Michigan"
[22] "Minnesota" "Mississippi" "Missouri"
[25] "Montana" "Nebraska" "Nevada"
[28] "New Jersey" "New Mexico" "New York"
[31] "North Carolina" "Ohio" "Oklahoma"
[34] "Oregon" "Pennsylvania" "Rhode Island"
[37] "South Carolina" "South Dakota" "Tennessee"
[40] "Texas" "Utah" "Virginia"
[43] "Washington" "West Virginia" "Wisconsin"
[46] "Wyoming"
Counting total true entries using sum() function
It is possible because the logical vector above gets coerced into numeric. True is changed to 1 and False to 0.
[1] 5
[1] 46
It was right that we have 5 states with less than 0.71 per capita murder rates and 46 including the District of Columbia with higher than 0.71 per capita murder rates.
Sometimes we may require the data points that meet a couple of conditions. Here’s the list of the logical vectors.
Setting two or more conditionals using & function
For example, somebody wants to live in the mountains but she wants the murder rate to be less than 1. So, western regions have a lot of mountains and let’s see which state has lower murder rates than 1.
west <- murders$region == "West"
safe <- murder_rate <= 1
#Defining the index
index2 <- safe & west
[1] "Hawaii" "Idaho" "Oregon" "Utah" "Wyoming"
Based on the results, there are five Western States that have murder rates lower or equal to 1 pre 1,00,000.
Indexing Functions
- The function which() gives us the entries of a logical vector that are true.
which(x) # returns indices that are TRUE
[1] 2 4 5
For example, to determine the murder rate in Florida we may do the following:
index <- which(murders$state == "Florida")
[1] 10
[1] 3.398069
- The function match() looks for entries in a vector and returns the index needed to access them. For example, if we want to obtain the indices and subsequent murder rates of Montana, North Carolina, and Maryland, we do:
index <- match(c("Montana", "North Carolina", "Maryland"), murders$state)
[1] 27 34 21
[1] "Montana" "North Carolina" "Maryland"
[1] 1.212838 2.999324 5.074866
- We use the function %in% if we want to know whether or not each element of a first vector is in a second vector.
It is perhaps the most useful function. It helps us which items in one vector is also in other vector. For example, I am creating three vectors x, y, and z and run some tests.
x <- c("k", "l", "m", "n", "o", "p", "q", "r", "s")
y <- c("k", "l", "m","t","u","v")
z <- c("e", "f", "g","o", "p", "q", "r","v")
# Now lets check
y %in% x
y %in% z
z %in% x
“True” shows that the value it denotes is in both vector, while “False” doesn’t.
I want to check if Orlando, Jackson, Purchase, and Georgia are states. I can use the %in% function.
c("Orlando","Jackson","Purchase", "Georgia") %in% murders$state
Orlando, Jackson, and Purchase are not states but Georgia is.
Basic Data Wrangling using dplyr package
We can load the dplyr package the following way:
- To change a data table by adding a new column, or changing an existing one, we use the mutate() function.
The function mutate takes the data frame as the first argument and the name and value of the variable in the second argument. For example, we created a new vector called murder_rate previously, but it hasn’t been the part of our data frame. Now lets’ add the vector as a new column to the murders data frame.
#Checking the first few cases of murders data set
state abb region population total
1 Alabama AL South 4779736 135
2 Alaska AK West 710231 19
3 Arizona AZ West 6392017 232
4 Arkansas AR South 2915918 93
5 California CA West 37253956 1257
6 Colorado CO West 5029196 65
#Adding a new column named m_rate
murders<- mutate(murders, m_rate = total/population*100000)
#Checking for changes
state abb region population total m_rate
1 Alabama AL South 4779736 135 2.824424
2 Alaska AK West 710231 19 2.675186
3 Arizona AZ West 6392017 232 3.629527
4 Arkansas AR South 2915918 93 3.189390
5 California CA West 37253956 1257 3.374138
6 Colorado CO West 5029196 65 1.292453
Second table has a new column named m_rate. How is it possible? The mutate() function looked for the vectors named ‘total’ and ‘population’ in the ‘murders’ data frame and conducted the calculation. It did not though check the work station.
- To filter the data by subsetting rows, we use the function filter().
Suppose we are filtering the data frame that only shows the data from the states which has murder rate lower than 0.71 per capita. Filter () function takes the data table as the first argument and conditional function as the second.
filter(murders, m_rate <= 0.71)
state abb region population total m_rate
1 Hawaii HI West 1360301 7 0.5145920
2 Iowa IA North Central 3046355 21 0.6893484
3 New Hampshire NH Northeast 1316470 5 0.3798036
4 North Dakota ND North Central 672591 4 0.5947151
5 Vermont VT Northeast 625741 2 0.3196211
We can see that there are five states with less than 0.71 murder rates. We got the entire table for these five states.
- To subset the data by selecting specific columns, we use the select() function.
Some data tables contain hundred of columns, and we may have to select only a few of them. I am going to create a new object ‘my_data’ by subetting the murders data table with the help of ‘select()’ function.
my_data <- select(murders, abb, region, m_rate)
abb region m_rate
1 AL South 2.824424
2 AK West 2.675186
3 AZ West 3.629527
4 AR South 3.189390
5 CA West 3.374138
6 CO West 1.292453
# Now, lets filter the sates with less than 0.90 murder rate
filter(my_data, m_rate <= 0.90)
abb region m_rate
1 HI West 0.5145920
2 ID West 0.7655102
3 IA North Central 0.6893484
4 ME Northeast 0.8280881
5 NH Northeast 0.3798036
6 ND North Central 0.5947151
7 UT West 0.7959810
8 VT Northeast 0.3196211
9 WY West 0.8871131
The resulting table has just 3 columns and they are stored on an object named ‘my_data’. It is much smaller and manageable table compared to the ‘murders’ table. And there are 9 states which have the murder rates equal to or higher than 0.90.
- We can perform a series of operations by sending the results of one function to another function using the pipe operator, %>%.
Let’s put all these functions together. Using the dplyr package we can accomplish multiple tasks at the same time if we implement the pipe operator. For example: We can take some data, ‘select’ required columns, and ‘filter’ some rows. Doing so, we can avoid creating new data table like my_data above.
murders %>% #take murders data table
select(state,region,m_rate) %>% #select state, region, and m_rate columns
filter(m_rate <= 0.90) #filter the rows which has m_rate <= 0.90
state region m_rate
1 Hawaii West 0.5145920
2 Idaho West 0.7655102
3 Iowa North Central 0.6893484
4 Maine Northeast 0.8280881
5 New Hampshire Northeast 0.3798036
6 North Dakota North Central 0.5947151
7 Utah West 0.7959810
8 Vermont Northeast 0.3196211
9 Wyoming West 0.8871131
Here we go. We got the same result here.
Creating Data Frames
- We can use the data.frame() function to create data frames. I am going to create a data frame named ‘test_scores’ with three columns and four rows. First column is the ‘name’ column followed by ‘pretest’ and ‘posttest’ scores.
test_scores <- data.frame(names = c("Sushila", "Aanishma", "Nikita", "Arjun"), pretest = c(85, 98, 74, 69),
posttest = c(95, 99, 86, 79))
names pretest posttest
1 Sushila 85 95
2 Aanishma 98 99
3 Nikita 74 86
4 Arjun 69 79
Yey. We just created the data frame with 3 columns and 4 rows.
Formerly, the data.frame() function turned characters into factors by default. To avoid this, we could utilize the stringsAsFactors argument and set it equal to false. As of R 4.0, it is no longer necessary to include the stringsAsFactors argument, because R no longer turns characters into factors by default.
Basic Plots
Exploratory data analysis is perhaps the most important aspect of R. We can quickly go from idea to data to data visualization without much effort.
- We can create a simple scatterplot using the function plot().
population_in_millions <- murders$population/10^6
total_gun_murders <- murders$total
plot(population_in_millions, total_gun_murders)
It is easy to see that there is linear relationship between population and the total gun murders.
- Histograms are powerful graphical summaries that give us a general overview of the types of values you have. In R, they can be produced using the hist() function.
[1] "District of Columbia"
- Boxplots provide a more compact summary of a distribution than a histogram and are more useful for comparing distributions. They can be produced using the boxplot() function.
boxplot(m_rate ~ region, data = murders)
We can see that South has higher murder rate than the rest of the regions.
Sample Q/As
For these questions we need the heights data set included in the dslabs package. Lets invoke the library, get the data ready and set the decimal points for rest of the analyses.
options(digits = 3)#Reports 3 significant digits for all answers
sex height
1 Male 75
2 Male 70
3 Male 68
4 Male 74
5 Male 61
6 Female 65
sex height
Female:238 Min. :50.0
Male :812 1st Qu.:66.0
Median :68.5
Mean :68.3
3rd Qu.:71.0
Max. :82.7
The data set has only two variables, ‘sex’ and ‘height’. Of all the data points, 238 are Females and 812 males. The average height is 68.5 inch.
Q.1. First, determine the average height in this dataset. Then create a logical vector ind* with the indices for those individuals who are above average height. How many individuals in the dataset are above average height?*
# calculating the average height of the sample
average_height <- mean(heights$height)
#indexing the data for the individuals who are higher than average height
ind <- heights$height > average_height
# heights$height[ind] #Gives all the heights higher than average. I don't want to make my paper messy.
[1] 532
There are 532 individuals in the data set who are above average tall.
Q.2. How many individuals in the dataset are above average height and are female?
# Calculating Female only Population
female_sample <- heights$sex == "Female"
taller_female <- heights$height > mean(heights$height)
# calculating above average tall females
indf <- female_sample & taller_female
[1] 31
There are total of 31 females who are taller than the average height.
Q.3. If you use mean() on a logical (TRUE/FALSE) vector, it returns the proportion of observations that are TRUE. What proportion of individuals in the dataset are female?
mean(heights$sex == "Female")
[1] 0.227
The result shows that 0.227% of the indviduals in the dataset are females.
Q.4. Determine the minimum height in the heights dataset.
[1] 50
The minimum height among the sample is 50 inches.
Q.5. Use the match() function to determine the index of the first individual with the minimum height.
[1] 1032
The first person with the minimum height in the data set is located in the 1032th row of the height column.
Q.6. Subset the sex column of the dataset by the index in 5 to determine the individual’s sex.
[1] Male
Levels: Female Male
The 1032th person with the height of 50 inches is a Male.
Q.6. Determine the maximum height.
[1] 82.7
The maximum heights among the sample population is 82.7 inches.
Which integer values are between the maximum and minimum heights? For example, if the minimum height is 10.2 and the maximum height is 20.8, your answer should be x <- 11:20 to capture the integers in between those values. (If either the maximum or minimum height are integers, include those values too.)
Q.7. Write code to create a vector x that includes the integers between the minimum and maximum heights (as numbers).
x <- 50:83
Q.8. How many of the integers in x are NOT heights in the dataset?
sum(!(x %in% heights$height))
[1] 4
There are four integers that are not heights in heights$height.
Using the heights dataset, create a new column of heights in centimeters named ht_cm. Recall that 1 inch = 2.54 centimeters. Save the resulting dataset as heights2.
heights <- mutate(heights, ht_cm = height*2.54)
sex height ht_cm
1 Male 75 190
2 Male 70 178
3 Male 68 173
4 Male 74 188
5 Male 61 155
6 Female 65 165
The new dataset has the third column I just created.
Q.9. What is the height in centimeters of the 18th individual (index 18)?
[1] 163
The height of the 18th individual in the data set is 163 cm.
Q.10. What is the mean height in centimeters?
[1] 174
The mean height of the sample in the data set is 174 centimeters.
Create a data frame females by filtering the heights2 data to contain only female individuals.
females <- filter(heights, sex == "Female")
sex height ht_cm
1 Female 65 165
2 Female 66 168
3 Female 62 157
4 Female 66 168
5 Female 64 163
6 Female 60 152
The new data frame females has been created and it contains the information regarding the female, only.
Q.11. How many females are in the heights2 dataset?
[1] 238
There are total of 238 females in the data set.
Q.12. What is the mean height of the females in centimeters?
[1] 165
The mean height of the females in the data set is 165 cm.
The olive dataset in dslabs contains composition in percentage of eight fatty acids found in the lipid fraction of 572 Italian olive oils:
region area palmitic palmitoleic stearic oleic linoleic
1 Southern Italy North-Apulia 10.75 0.75 2.26 78.2 6.72
2 Southern Italy North-Apulia 10.88 0.73 2.24 77.1 7.81
3 Southern Italy North-Apulia 9.11 0.54 2.46 81.1 5.49
4 Southern Italy North-Apulia 9.66 0.57 2.40 79.5 6.19
5 Southern Italy North-Apulia 10.51 0.67 2.59 77.7 6.72
6 Southern Italy North-Apulia 9.11 0.49 2.68 79.2 6.78
linolenic arachidic eicosenoic
1 0.36 0.60 0.29
2 0.31 0.61 0.29
3 0.31 0.63 0.29
4 0.50 0.78 0.35
5 0.50 0.80 0.46
6 0.51 0.70 0.44
The data has been set. Now
Q.13. Plot the percent palmitic acid versus palmitoleic acid in a scatterplot. What relationship do you see?
palmitic_acid <- olive$palmitic
palmitoleic_acid <- olive$palmitoleic
We see the positive linear relationship between these variables.
Q.13. Create a histogram of the percentage of eicosenoic acid in olive.
The most common value of eicosenoic acid is below 0.05%.
Q.14. Make a boxplot of palmitic acid percentage in olive with separate distributions for each region. Which region has the highest median palmitic acid percentage? Which region has the most variable palmitic acid percentage?
boxplot(palmitic ~ region, data = olive)
Southern Italy has the highest median palmitic acid percentage, and the same region has the most variable data.
