R-Basic: Part II- concatenate, class(), names(), multi entry vector, vector coercion etc.

 

Overview

  • Create numeric and character vectors.
  • Name the columns of a vector.
  • Generate numeric sequences.
  • Access specific elements or parts of a vector.
  • Coerce data into different data types as needed.
  • Sort vectors in ascending and descending order.
  • Extract the indices of the sorted elements from the original vector.
  • Find the maximum and minimum elements, as well as their indices, in a vector.
  • Rank the elements of a vector in increasing order.
  • Perform arithmetic between a vector and a single number.
  • Perform arithmetic between two vectors of the same length. And
  • Some sample Q/As

Vector aka Variable

The most basic units available in R to store data set are called vectors. A data set often has multiple variables.

  • Vectors can be created using ‘c’ known as concatenate.
country <- c("Bangladesh", "Cameroon", "Croatia")
codes <- c(Bangladesh = 880, Cameroon = 237, Croatia = 385)
codes
Bangladesh   Cameroon    Croatia 
       880        237        385 
class(codes)
[1] "numeric"

We get the same calls if we define the codes the following way

codes1 <- c("Bangladesh" = 880, "Cameroon" = 237, "Croatia" = 385)
codes1
Bangladesh   Cameroon    Croatia 
       880        237        385 
class(codes1)
[1] "numeric"

We can also use the names* function to assign the codes*

names(codes) <- country
codes
Bangladesh   Cameroon    Croatia 
       880        237        385 

General Sequence by using function ‘seq’

seq(10,18) #Generates all the numbers between 10 and 18 inclusive
[1] 10 11 12 13 14 15 16 17 18
#Or
10:18
[1] 10 11 12 13 14 15 16 17 18
seq(10,30,3) #Starting from 10 all the way up to 30, generates numbers, by the count of three
[1] 10 13 16 19 22 25 28

Subsetting

We use [] to access elements of a vector. For example:

# We can subset the first element of 'country' and the third element of the 'codes' vector above
country[1]
[1] "Bangladesh"
codes[3]
Croatia 
    385 

We can call more then one elements by creating a multi-entry vector as an index

codes[c(2,3)]
Cameroon  Croatia 
     237      385 
country[c(1,2)]
[1] "Bangladesh" "Cameroon"  
# Or we can simply use this syntax
codes[1:2]
Bangladesh   Cameroon 
       880        237 

If elements have names, we can access them using the names

codes["Cameroon"]
Cameroon 
     237 
# If we want to get two or more 
codes[c("Cameroon","Croatia")]
Cameroon  Croatia 
     237      385 

Vector Coercion

In general, coercion is an attempt by R to be flexible with data types by guessing what was meant when an entry does not match the expected.

# Creating a vector with mixed numeric and character elements
a <- c(4,237, "Cameroon")
a
[1] "4"        "237"      "Cameroon"
class(a)
[1] "character"

We don’t get any error message when we print the vector 1, because R changes the elements to the same class, i.e., character in this case. R coerced the data into the character strings.

In addition, R also has function that forces specific coercion using the functions like as.numeric(), as.vector(), as.character(), as.integer(), etc. Example, the Vector a is a character vector. I can force it into a numeric or integer vectors.

# As numeric
a <- as.numeric(a)
Warning: NAs introduced by coercion
a
[1]   4 237  NA
class(a)
[1] "numeric"
# As integer
a <- as.integer(a)
class(a)
[1] "integer"

This function is quite useful in practice because many public data set are stored as character vectors. In R, NAs represent the missing data. We can get NAs by coercion as well. R tries to coerce something and if it can’t, we get NAs. For example, when I changed the vector ‘a’ to numeric above I received the warning message about a value changed into NA. When I printed the vector that was in fact the case, “Cameroon” was changed into “NA”. As data scientists we encounter many missing values in our data set and if we don’t know what it means and how to deal with them our task becomes much harder.

Some basic Q/A which can be easily solved using R

  1. To find the solutions to an equation of the format ax^2 + bx + c, use the quadratic equation: x = (−b ± √(b^2 − 4ac))/2a. What are the two solutions to 2*x^2 − x − 4 = 0? Use the quadratic equation.
a <- 2; b <- -1; c <- -4
(-b + sqrt(b^2 - 4*a*c))/(2*a); #Upper Bound Quadratic Solution 
[1] 1.686141
(-b - sqrt(b^2 - 4*a*c))/(2*a) #Lower Bound Quadratic Solution
[1] -1.186141
  1. Use R to compute log base 4 of 1024. You can use the help() function to learn how to use arguments to change the base of the log() function.
log(1024, base = 4)
[1] 5
#Or
log(1024,4)
[1] 5
#Or
log(x = 1024, base = 4)
[1] 5
  1. Load dslabs data set, and get ready to explore the movielens data set.

3a. How many rows are in the dataset? 3b. How many different variables are in the dataset?

library(dslabs)
data(movielens)
summary(movielens)
    movieId          title                year     
 Min.   :     1   Length:100004      Min.   :1902  
 1st Qu.:  1028   Class :character   1st Qu.:1987  
 Median :  2406   Mode  :character   Median :1995  
 Mean   : 12549                      Mean   :1992  
 3rd Qu.:  5418                      3rd Qu.:2001  
 Max.   :163949                      Max.   :2016  
                                     NA's   :7     
                  genres          userId        rating        timestamp        
 Drama               : 7757   Min.   :  1   Min.   :0.500   Min.   :7.897e+08  
 Comedy              : 6748   1st Qu.:182   1st Qu.:3.000   1st Qu.:9.658e+08  
 Comedy|Romance      : 3973   Median :367   Median :4.000   Median :1.110e+09  
 Drama|Romance       : 3462   Mean   :347   Mean   :3.544   Mean   :1.130e+09  
 Comedy|Drama        : 3272   3rd Qu.:520   3rd Qu.:4.000   3rd Qu.:1.296e+09  
 Comedy|Drama|Romance: 3204   Max.   :671   Max.   :5.000   Max.   :1.477e+09  
 (Other)             :71588                                                    

3c. What is the variable type of title ? 3d. What is the variable type of genres ?

class(movielens$genres)
[1] "factor"
class(movielens$title)
[1] "character"

We already know we can use the levels() function to determine the levels of a factor. A different function, nlevels(), may be used to determine the number of levels of a factor.

  1. Use this function to determine how many levels are in the factor genres in the movielens data frame.
nlevels(movielens$genres)
[1] 901
# If we want to see the names
head(levels(movielens$genres))
[1] "(no genres listed)"                        
[2] "Action"                                    
[3] "Action|Adventure"                          
[4] "Action|Adventure|Animation"                
[5] "Action|Adventure|Animation|Children"       
[6] "Action|Adventure|Animation|Children|Comedy"

Sorting

Sorting, as self explanatory as it is, refers to putting elements in desired orders. * The function sort() sorts a vector in increasing order. * The function order() produces the indices needed to obtain the sorted vector, e.g. a result of 2 3 1 5 4 means the sorted vector will be produced by listing the 2nd, 3rd, 1st, 5th, and then 4th item of the original vector. * The function rank() gives us the ranks of the items in the original vector. * The function max() returns the largest value, while which.max() returns the index of the largest value. The functions min() and which.min() work similarly for minimum values.

Let’s start with an easy stimulating vector ‘b’:

b <- c(1,5,8,3,2,9,85,17,43,55)
sort(b)#Puts the value of b in Order 
 [1]  1  2  3  5  8  9 17 43 55 85

Here we go. Now, lets rearrange the value by an index. What the syntax below doing is:

  1. I have created a vector called ‘index’ that contains the ordered values of vector b,
  2. I am indexing the values of b, and
  3. printing the indexed ordered values in the vector b. Confusing? Pay attention to the Order column:
library(kableExtra)
tbl <- data.frame(
  b = c(1, 5, 8, 3, 2, 9, 85, 17, 43, 55),
  SN = c(1,2,3,4,5,6,7,8,9,10),
  Order = c(1,2,3,5,8,9,17,43,55,85),
  SN_Position = c(1,5,4,2,3,6,8,9,10,7),
  Rank = c(1, 4, 5, 3, 2, 6, 10, 7, 8, 9)
)
tbl %>%
  kbl()%>%
  kable_classic_2(full_width = F, html_font = "Cambria")%>%
  column_spec(2, bold = T, color ="yellow", background = "#D7261E")%>%
  column_spec(4, bold = T, color ="yellow", background = "#D7261E")
bSNOrderSN_PositionRank
11111
52254
83345
34523
25832
96966
85717810
1784397
43955108
55108579

Now, let’s use R syntax and come up with the same output. The function “order()” returns the indices that sort the vector parameters. Likewise, rank() gives the rank of the 1st entry, 2nd entry, 3rd entry etc. as shown above.

index <- order(b)
b[index]
 [1]  1  2  3  5  8  9 17 43 55 85
order(b)
 [1]  1  5  4  2  3  6  8  9 10  7
rank(b)
 [1]  1  4  5  3  2  6 10  7  8  9

Let’s do some activities with true data:

data(murders)
sort(murders$total) #Puts the value of 'totals' in increasing order 
 [1]    2    4    5    5    7    8   11   12   12   16   19   21   22   27   32
[16]   36   38   53   63   65   67   84   93   93   97   97   99  111  116  118
[31]  120  135  142  207  219  232  246  250  286  293  310  321  351  364  376
[46]  413  457  517  669  805 1257

The minimum value in the total’s column was 2 and it went all the way to 1257. If I want to see the top 10 states as they were entered in the ‘murders’ data set. I can simply:

murders$state[1:10]
 [1] "Alabama"              "Alaska"               "Arizona"             
 [4] "Arkansas"             "California"           "Colorado"            
 [7] "Connecticut"          "Delaware"             "District of Columbia"
[10] "Florida"             
murders$abb[1:10]
 [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "DC" "FL"

Looks like the data were entered in the alphabetical order starting from Alabama, and that can be proven by the abbreviated states names. They perfectly match.

Lets’ order the states in increasing murder total order.

index <- order(murders$total)
murders$abb[index]
 [1] "VT" "ND" "NH" "WY" "HI" "SD" "ME" "ID" "MT" "RI" "AK" "IA" "UT" "WV" "NE"
[16] "OR" "DE" "MN" "KS" "CO" "NM" "NV" "AR" "WA" "CT" "WI" "DC" "OK" "KY" "MA"
[31] "MS" "AL" "IN" "SC" "TN" "AZ" "NJ" "VA" "NC" "MD" "OH" "MO" "LA" "IL" "GA"
[46] "MI" "PA" "NY" "FL" "TX" "CA"

It shows that the state of Vermont has the least total murders, while California has the highest total murders.

If we are only interested to get only the states with the highest and the least murder total, we can use more efficient function, i.e., max(), and min().

max(murders$total)
[1] 1257
min(murders$total)
[1] 2

The results show that 1257 and 2 total murders were the highest and lowest murders. We can locate the exact location of the states in the data set.

m_max <- which.max(murders$total)
m_max
[1] 5
m_min <- which.min(murders$total)
m_min
[1] 46

The results show that the states with the highest total murders is in the fifth position, and lowest total murders is in the 46th position. We can call the names of the exact states by:

murders$state[m_max]
[1] "California"
murders$abb[m_min]
[1] "VT"

Here we go. The California had the highest total murders and Vermont the least.

murders$state[which.max(murders$population)]#Gives the name of the state with the highest population
[1] "California"
max(murders$population)#Gives the highest population
[1] 37253956

Yes. It is California.

What does it mean? Is California the most dangerous state? Probably not. It may be because of the population and we may have to calculate per-capita murder rate rather than total murders. Let’s see if California is the highest population state.

murder_rate <- (murders$total/murders$population)*100000
murders$state[which.max(murder_rate)]
[1] "District of Columbia"
max(murder_rate)
[1] 16.45275

When I calculated the per capita murder rate, the result shows that California is not the state with highest murder rate. It is the District of Columbia with 16.45275 murders per 1,00,000 population.

I can now order all the states in decreasing murder_rate order.

murders$state[order(murder_rate, decreasing = TRUE)]
 [1] "District of Columbia" "Louisiana"            "Missouri"            
 [4] "Maryland"             "South Carolina"       "Delaware"            
 [7] "Michigan"             "Mississippi"          "Georgia"             
[10] "Arizona"              "Pennsylvania"         "Tennessee"           
[13] "Florida"              "California"           "New Mexico"          
[16] "Texas"                "Arkansas"             "Virginia"            
[19] "Nevada"               "North Carolina"       "Oklahoma"            
[22] "Illinois"             "Alabama"              "New Jersey"          
[25] "Connecticut"          "Ohio"                 "Alaska"              
[28] "Kentucky"             "New York"             "Kansas"              
[31] "Indiana"              "Massachusetts"        "Nebraska"            
[34] "Wisconsin"            "Rhode Island"         "West Virginia"       
[37] "Washington"           "Colorado"             "Montana"             
[40] "Minnesota"            "South Dakota"         "Oregon"              
[43] "Wyoming"              "Maine"                "Utah"                
[46] "Idaho"                "Iowa"                 "North Dakota"        
[49] "Hawaii"               "New Hampshire"        "Vermont"             

California is has the 14th highest murder rates. Not even in top 10.

Comments

Popular posts from this blog

Education Matters: Understanding Nepal’s Education (Publication Date: June 19, 2023, Ratopati-English, Link at the End)

Multiple Correspondence Analysis (MCA) in Educational Data

charting Concept and Computation: Maps for the Deep Learning Frontier