R-Basics: Part 3- All about Idexing

Overview

Subset a vector based on properties of another vector.
Use multiple logical operators to index vectors.
Extract the indices of vector elements satisfying one or more logical conditions.
Extract the indices of vector elements matching with another vector.
Determine which elements in one vector are present in another vector. * Wrangle data tables using functions in the dplyr package.
Modify a data table by adding or changing columns.
Subset rows in a data table.
Subset columns in a data table.
Perform a series of operations using 'pipe operator'.
Create data frames.
Plot data in scatter plots, box plots, and histograms.
Sample Q/As

Indexing

R provides powerful and convenient way of indexing a vector. We can use the logical operators to index a vector.

If we compare a vector with a single number, R performs a test for each entry

Let’s see some examples:

index <- murder_rate < 0.71
#index
murders$state[index]

[1] "Hawaii"        "Iowa"          "New Hampshire" "North Dakota" 
[5] "Vermont"

#Or
index1 <- murder_rate > 0.71
#index1
murders$state[index1]

 [1] "Alabama"              "Alaska"               "Arizona"             
 [4] "Arkansas"             "California"           "Colorado"            
 [7] "Connecticut"          "Delaware"             "District of Columbia"
[10] "Florida"              "Georgia"              "Idaho"               
[13] "Illinois"             "Indiana"              "Kansas"              
[16] "Kentucky"             "Louisiana"            "Maine"               
[19] "Maryland"             "Massachusetts"        "Michigan"            
[22] "Minnesota"            "Mississippi"          "Missouri"            
[25] "Montana"              "Nebraska"             "Nevada"              
[28] "New Jersey"           "New Mexico"           "New York"            
[31] "North Carolina"       "Ohio"                 "Oklahoma"            
[34] "Oregon"               "Pennsylvania"         "Rhode Island"        
[37] "South Carolina"       "South Dakota"         "Tennessee"           
[40] "Texas"                "Utah"                 "Virginia"            
[43] "Washington"           "West Virginia"        "Wisconsin"           
[46] "Wyoming"

Counting total true entries using sum() function

It is possible because the logical vector above gets coerced into numeric. True is changed to 1 and False to 0.

sum(index)

[1] 5

sum(index1)

[1] 46

It was right that we have 5 states with less than 0.71 per capita murder rates and 46 including the District of Columbia with higher than 0.71 per capita murder rates.

Sometimes we may require the data points that meet a couple of conditions. Here’s the list of the logical vectors.

knitr::include_graphics("C:/Users/nirma/Documents/GitHub/Practice/logical vectors.JPG")

Setting two or more conditionals using & function

For example, somebody wants to live in the mountains but she wants the murder rate to be less than 1. So, western regions have a lot of mountains and let’s see which state has lower murder rates than 1.

west <- murders$region == "West"
safe <- murder_rate <= 1

#Defining the index
index2 <- safe & west
murders$state[index2]

[1] "Hawaii"  "Idaho"   "Oregon"  "Utah"    "Wyoming"

Based on the results, there are five Western States that have murder rates lower or equal to 1 pre 1,00,000.

Indexing Functions

The function which() gives us the entries of a logical vector that are true.

x <- c(FALSE, TRUE, FALSE, TRUE, TRUE, FALSE)
which(x)    # returns indices that are TRUE

[1] 2 4 5

For example, to determine the murder rate in Florida we may do the following:

index <- which(murders$state == "Florida")
index

[1] 10

murder_rate[index]

[1] 3.398069

The function match() looks for entries in a vector and returns the index needed to access them. For example, if we want to obtain the indices and subsequent murder rates of Montana, North Carolina, and Maryland, we do:

index <- match(c("Montana", "North Carolina", "Maryland"), murders$state)
index

[1] 27 34 21

murders$state[index]

[1] "Montana"        "North Carolina" "Maryland"

murder_rate[index]

[1] 1.212838 2.999324 5.074866

We use the function %in% if we want to know whether or not each element of a first vector is in a second vector.

It is perhaps the most useful function. It helps us which items in one vector is also in other vector. For example, I am creating three vectors x, y, and z and run some tests.

x <- c("k", "l", "m", "n", "o", "p", "q", "r", "s")
y <- c("k", "l", "m","t","u","v")
z <- c("e", "f", "g","o", "p", "q", "r","v")
# Now lets check
y %in% x

[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE

y %in% z

[1] FALSE FALSE FALSE FALSE FALSE  TRUE

z %in% x

[1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE

“True” shows that the value it denotes is in both vector, while “False” doesn’t.

I want to check if Orlando, Jackson, Purchase, and Georgia are states. I can use the %in% function.

c("Orlando","Jackson","Purchase", "Georgia") %in% murders$state

[1] FALSE FALSE FALSE  TRUE

Orlando, Jackson, and Purchase are not states but Georgia is.

Basic Data Wrangling using dplyr package

We can load the dplyr package the following way:

library(dplyr)

To change a data table by adding a new column, or changing an existing one, we use the mutate() function.

The function mutate takes the data frame as the first argument and the name and value of the variable in the second argument. For example, we created a new vector called murder_rate previously, but it hasn’t been the part of our data frame. Now lets’ add the vector as a new column to the murders data frame.

#Checking the first few cases of murders data set
head(murders)

       state abb region population total
1    Alabama  AL  South    4779736   135
2     Alaska  AK   West     710231    19
3    Arizona  AZ   West    6392017   232
4   Arkansas  AR  South    2915918    93
5 California  CA   West   37253956  1257
6   Colorado  CO   West    5029196    65

#Adding a new column named m_rate
murders<- mutate(murders, m_rate = total/population*100000)
#Checking for changes
head(murders)

       state abb region population total   m_rate
1    Alabama  AL  South    4779736   135 2.824424
2     Alaska  AK   West     710231    19 2.675186
3    Arizona  AZ   West    6392017   232 3.629527
4   Arkansas  AR  South    2915918    93 3.189390
5 California  CA   West   37253956  1257 3.374138
6   Colorado  CO   West    5029196    65 1.292453

Second table has a new column named m_rate. How is it possible? The mutate() function looked for the vectors named ‘total’ and ‘population’ in the ‘murders’ data frame and conducted the calculation. It did not though check the work station.

To filter the data by subsetting rows, we use the function filter().

Suppose we are filtering the data frame that only shows the data from the states which has murder rate lower than 0.71 per capita. Filter () function takes the data table as the first argument and conditional function as the second.

filter(murders, m_rate <= 0.71)

          state abb        region population total    m_rate
1        Hawaii  HI          West    1360301     7 0.5145920
2          Iowa  IA North Central    3046355    21 0.6893484
3 New Hampshire  NH     Northeast    1316470     5 0.3798036
4  North Dakota  ND North Central     672591     4 0.5947151
5       Vermont  VT     Northeast     625741     2 0.3196211

We can see that there are five states with less than 0.71 murder rates. We got the entire table for these five states.

To subset the data by selecting specific columns, we use the select() function.

Some data tables contain hundred of columns, and we may have to select only a few of them. I am going to create a new object ‘my_data’ by subetting the murders data table with the help of ‘select()’ function.

my_data <- select(murders, abb, region, m_rate)
head(my_data)

  abb region   m_rate
1  AL  South 2.824424
2  AK   West 2.675186
3  AZ   West 3.629527
4  AR  South 3.189390
5  CA   West 3.374138
6  CO   West 1.292453

# Now, lets filter the sates with less than 0.90 murder rate
filter(my_data, m_rate <= 0.90)

  abb        region    m_rate
1  HI          West 0.5145920
2  ID          West 0.7655102
3  IA North Central 0.6893484
4  ME     Northeast 0.8280881
5  NH     Northeast 0.3798036
6  ND North Central 0.5947151
7  UT          West 0.7959810
8  VT     Northeast 0.3196211
9  WY          West 0.8871131

The resulting table has just 3 columns and they are stored on an object named ‘my_data’. It is much smaller and manageable table compared to the ‘murders’ table. And there are 9 states which have the murder rates equal to or higher than 0.90.

We can perform a series of operations by sending the results of one function to another function using the pipe operator, %>%.

Let’s put all these functions together. Using the dplyr package we can accomplish multiple tasks at the same time if we implement the pipe operator. For example: We can take some data, ‘select’ required columns, and ‘filter’ some rows. Doing so, we can avoid creating new data table like my_data above.

murders %>% #take murders data table
  select(state,region,m_rate) %>% #select state, region, and m_rate columns
  filter(m_rate <= 0.90) #filter the rows which has m_rate <= 0.90

          state        region    m_rate
1        Hawaii          West 0.5145920
2         Idaho          West 0.7655102
3          Iowa North Central 0.6893484
4         Maine     Northeast 0.8280881
5 New Hampshire     Northeast 0.3798036
6  North Dakota North Central 0.5947151
7          Utah          West 0.7959810
8       Vermont     Northeast 0.3196211
9       Wyoming          West 0.8871131

Here we go. We got the same result here.

Creating Data Frames

We can use the data.frame() function to create data frames. I am going to create a data frame named ‘test_scores’ with three columns and four rows. First column is the ‘name’ column followed by ‘pretest’ and ‘posttest’ scores.

test_scores <- data.frame(names = c("Sushila", "Aanishma", "Nikita", "Arjun"), pretest = c(85, 98, 74, 69),
                          posttest = c(95, 99, 86, 79))
test_scores

     names pretest posttest
1  Sushila      85       95
2 Aanishma      98       99
3   Nikita      74       86
4    Arjun      69       79

Yey. We just created the data frame with 3 columns and 4 rows.

Formerly, the data.frame() function turned characters into factors by default. To avoid this, we could utilize the stringsAsFactors argument and set it equal to false. As of R 4.0, it is no longer necessary to include the stringsAsFactors argument, because R no longer turns characters into factors by default.

Basic Plots

Exploratory data analysis is perhaps the most important aspect of R. We can quickly go from idea to data to data visualization without much effort.

We can create a simple scatterplot using the function plot().

population_in_millions <- murders$population/10^6
total_gun_murders <- murders$total
plot(population_in_millions, total_gun_murders)

It is easy to see that there is linear relationship between population and the total gun murders.

Histograms are powerful graphical summaries that give us a general overview of the types of values you have. In R, they can be produced using the hist() function.

hist(murder_rate)

murders$state[which.max(murders$m_rate)]

[1] "District of Columbia"

Boxplots provide a more compact summary of a distribution than a histogram and are more useful for comparing distributions. They can be produced using the boxplot() function.

boxplot(m_rate ~ region, data = murders)

We can see that South has higher murder rate than the rest of the regions.

Sample Q/As

For these questions we need the heights data set included in the dslabs package. Lets invoke the library, get the data ready and set the decimal points for rest of the analyses.

library(dslabs)
data(heights)
options(digits = 3)#Reports 3 significant digits for all answers
head(heights)

     sex height
1   Male     75
2   Male     70
3   Male     68
4   Male     74
5   Male     61
6 Female     65

summary(heights)

     sex          height    
 Female:238   Min.   :50.0  
 Male  :812   1st Qu.:66.0  
              Median :68.5  
              Mean   :68.3  
              3rd Qu.:71.0  
              Max.   :82.7

The data set has only two variables, ‘sex’ and ‘height’. Of all the data points, 238 are Females and 812 males. The average height is 68.5 inch.

Q.1. First, determine the average height in this dataset. Then create a logical vector ind* with the indices for those individuals who are above average height. How many individuals in the dataset are above average height?*

# calculating the average height of the sample
average_height <- mean(heights$height)
#indexing the data for the individuals who are higher than average height
ind <- heights$height > average_height
# heights$height[ind] #Gives all the heights higher than average. I don't want to make my paper messy. 
sum(ind)

[1] 532

There are 532 individuals in the data set who are above average tall.

Q.2. How many individuals in the dataset are above average height and are female?

# Calculating Female only Population
female_sample <- heights$sex == "Female"
taller_female <- heights$height > mean(heights$height)
# calculating above average tall females
indf <- female_sample & taller_female
sum(indf)

[1] 31

There are total of 31 females who are taller than the average height.

Q.3. If you use mean() on a logical (TRUE/FALSE) vector, it returns the proportion of observations that are TRUE. What proportion of individuals in the dataset are female?

mean(heights$sex == "Female")

[1] 0.227

The result shows that 0.227% of the indviduals in the dataset are females.

Q.4. Determine the minimum height in the heights dataset.

min(heights$height)

[1] 50

The minimum height among the sample is 50 inches.

Q.5. Use the match() function to determine the index of the first individual with the minimum height.

match(50,heights$height)

[1] 1032

The first person with the minimum height in the data set is located in the 1032th row of the height column.

Q.6. Subset the sex column of the dataset by the index in 5 to determine the individual’s sex.

heights$sex[match(50,heights$height)]

[1] Male
Levels: Female Male

The 1032th person with the height of 50 inches is a Male.

Q.6. Determine the maximum height.

max(heights$height)

[1] 82.7

The maximum heights among the sample population is 82.7 inches.

Which integer values are between the maximum and minimum heights? For example, if the minimum height is 10.2 and the maximum height is 20.8, your answer should be x <- 11:20 to capture the integers in between those values. (If either the maximum or minimum height are integers, include those values too.)

Q.7. Write code to create a vector x that includes the integers between the minimum and maximum heights (as numbers).

x <- 50:83

Q.8. How many of the integers in x are NOT heights in the dataset?

sum(!(x %in% heights$height))

[1] 4

There are four integers that are not heights in heights$height.

Using the heights dataset, create a new column of heights in centimeters named ht_cm. Recall that 1 inch = 2.54 centimeters. Save the resulting dataset as heights2.

heights <- mutate(heights, ht_cm = height*2.54)
head(heights)

     sex height ht_cm
1   Male     75   190
2   Male     70   178
3   Male     68   173
4   Male     74   188
5   Male     61   155
6 Female     65   165

The new dataset has the third column I just created.

Q.9. What is the height in centimeters of the 18th individual (index 18)?

heights$ht_cm[18]

[1] 163

The height of the 18th individual in the data set is 163 cm.

Q.10. What is the mean height in centimeters?

mean(heights$ht_cm)

[1] 174

The mean height of the sample in the data set is 174 centimeters.

Create a data frame females by filtering the heights2 data to contain only female individuals.

females <- filter(heights, sex == "Female")
head(females)

     sex height ht_cm
1 Female     65   165
2 Female     66   168
3 Female     62   157
4 Female     66   168
5 Female     64   163
6 Female     60   152

The new data frame females has been created and it contains the information regarding the female, only.

Q.11. How many females are in the heights2 dataset?

nrow(females)

[1] 238

There are total of 238 females in the data set.

Q.12. What is the mean height of the females in centimeters?

mean(females$ht_cm)

[1] 165

The mean height of the females in the data set is 165 cm.

The olive dataset in dslabs contains composition in percentage of eight fatty acids found in the lipid fraction of 572 Italian olive oils:

data(olive)
head(olive)

          region         area palmitic palmitoleic stearic oleic linoleic
1 Southern Italy North-Apulia    10.75        0.75    2.26  78.2     6.72
2 Southern Italy North-Apulia    10.88        0.73    2.24  77.1     7.81
3 Southern Italy North-Apulia     9.11        0.54    2.46  81.1     5.49
4 Southern Italy North-Apulia     9.66        0.57    2.40  79.5     6.19
5 Southern Italy North-Apulia    10.51        0.67    2.59  77.7     6.72
6 Southern Italy North-Apulia     9.11        0.49    2.68  79.2     6.78
  linolenic arachidic eicosenoic
1      0.36      0.60       0.29
2      0.31      0.61       0.29
3      0.31      0.63       0.29
4      0.50      0.78       0.35
5      0.50      0.80       0.46
6      0.51      0.70       0.44

The data has been set. Now

Q.13. Plot the percent palmitic acid versus palmitoleic acid in a scatterplot. What relationship do you see?

palmitic_acid <- olive$palmitic
palmitoleic_acid <- olive$palmitoleic
plot(palmitic_acid,palmitoleic_acid)

We see the positive linear relationship between these variables.

Q.13. Create a histogram of the percentage of eicosenoic acid in olive.

hist(olive$eicosenoic)

The most common value of eicosenoic acid is below 0.05%.

Q.14. Make a boxplot of palmitic acid percentage in olive with separate distributions for each region. Which region has the highest median palmitic acid percentage? Which region has the most variable palmitic acid percentage?

boxplot(palmitic ~ region, data = olive)

Southern Italy has the highest median palmitic acid percentage, and the same region has the most variable data.

Search This Blog

Data-Driven Education Insights