R-Basic: Part II- concatenate, class(), names(), multi entry vector, vector coercion etc.
Overview
- Create numeric and character vectors.
- Name the columns of a vector.
- Generate numeric sequences.
- Access specific elements or parts of a vector.
- Coerce data into different data types as needed.
- Sort vectors in ascending and descending order.
- Extract the indices of the sorted elements from the original vector.
- Find the maximum and minimum elements, as well as their indices, in a vector.
- Rank the elements of a vector in increasing order.
- Perform arithmetic between a vector and a single number.
- Perform arithmetic between two vectors of the same length. And
- Some sample Q/As
Vector aka Variable
The most basic units available in R to store data set are called vectors. A data set often has multiple variables.
- Vectors can be created using ‘c’ known as concatenate.
country <- c("Bangladesh", "Cameroon", "Croatia")
codes <- c(Bangladesh = 880, Cameroon = 237, Croatia = 385)
codes
Bangladesh Cameroon Croatia
880 237 385
class(codes)
[1] "numeric"
We get the same calls if we define the codes the following way
codes1 <- c("Bangladesh" = 880, "Cameroon" = 237, "Croatia" = 385)
codes1
Bangladesh Cameroon Croatia
880 237 385
class(codes1)
[1] "numeric"
We can also use the names* function to assign the codes*
names(codes) <- country
codes
Bangladesh Cameroon Croatia
880 237 385
General Sequence by using function ‘seq’
seq(10,18) #Generates all the numbers between 10 and 18 inclusive
[1] 10 11 12 13 14 15 16 17 18
#Or
10:18
[1] 10 11 12 13 14 15 16 17 18
seq(10,30,3) #Starting from 10 all the way up to 30, generates numbers, by the count of three
[1] 10 13 16 19 22 25 28
Subsetting
We use [] to access elements of a vector. For example:
# We can subset the first element of 'country' and the third element of the 'codes' vector above
country[1]
[1] "Bangladesh"
codes[3]
Croatia
385
We can call more then one elements by creating a multi-entry vector as an index
codes[c(2,3)]
Cameroon Croatia
237 385
country[c(1,2)]
[1] "Bangladesh" "Cameroon"
# Or we can simply use this syntax
codes[1:2]
Bangladesh Cameroon
880 237
If elements have names, we can access them using the names
codes["Cameroon"]
Cameroon
237
# If we want to get two or more
codes[c("Cameroon","Croatia")]
Cameroon Croatia
237 385
Vector Coercion
In general, coercion is an attempt by R to be flexible with data types by guessing what was meant when an entry does not match the expected.
# Creating a vector with mixed numeric and character elements
a <- c(4,237, "Cameroon")
a
[1] "4" "237" "Cameroon"
class(a)
[1] "character"
We don’t get any error message when we print the vector 1, because R changes the elements to the same class, i.e., character in this case. R coerced the data into the character strings.
In addition, R also has function that forces specific coercion using the functions like as.numeric(), as.vector(), as.character(), as.integer(), etc. Example, the Vector a is a character vector. I can force it into a numeric or integer vectors.
# As numeric
a <- as.numeric(a)
Warning: NAs introduced by coercion
a
[1] 4 237 NA
class(a)
[1] "numeric"
# As integer
a <- as.integer(a)
class(a)
[1] "integer"
This function is quite useful in practice because many public data set are stored as character vectors. In R, NAs represent the missing data. We can get NAs by coercion as well. R tries to coerce something and if it can’t, we get NAs. For example, when I changed the vector ‘a’ to numeric above I received the warning message about a value changed into NA. When I printed the vector that was in fact the case, “Cameroon” was changed into “NA”. As data scientists we encounter many missing values in our data set and if we don’t know what it means and how to deal with them our task becomes much harder.
Some basic Q/A which can be easily solved using R
- To find the solutions to an equation of the format ax^2 + bx + c, use the quadratic equation: x = (−b ± √(b^2 − 4ac))/2a. What are the two solutions to 2*x^2 − x − 4 = 0? Use the quadratic equation.
a <- 2; b <- -1; c <- -4
(-b + sqrt(b^2 - 4*a*c))/(2*a); #Upper Bound Quadratic Solution
[1] 1.686141
(-b - sqrt(b^2 - 4*a*c))/(2*a) #Lower Bound Quadratic Solution
[1] -1.186141
- Use R to compute log base 4 of 1024. You can use the help() function to learn how to use arguments to change the base of the log() function.
log(1024, base = 4)
[1] 5
#Or
log(1024,4)
[1] 5
#Or
log(x = 1024, base = 4)
[1] 5
- Load dslabs data set, and get ready to explore the movielens data set.
3a. How many rows are in the dataset? 3b. How many different variables are in the dataset?
library(dslabs)
data(movielens)
summary(movielens)
movieId title year
Min. : 1 Length:100004 Min. :1902
1st Qu.: 1028 Class :character 1st Qu.:1987
Median : 2406 Mode :character Median :1995
Mean : 12549 Mean :1992
3rd Qu.: 5418 3rd Qu.:2001
Max. :163949 Max. :2016
NA's :7
genres userId rating timestamp
Drama : 7757 Min. : 1 Min. :0.500 Min. :7.897e+08
Comedy : 6748 1st Qu.:182 1st Qu.:3.000 1st Qu.:9.658e+08
Comedy|Romance : 3973 Median :367 Median :4.000 Median :1.110e+09
Drama|Romance : 3462 Mean :347 Mean :3.544 Mean :1.130e+09
Comedy|Drama : 3272 3rd Qu.:520 3rd Qu.:4.000 3rd Qu.:1.296e+09
Comedy|Drama|Romance: 3204 Max. :671 Max. :5.000 Max. :1.477e+09
(Other) :71588
3c. What is the variable type of title ? 3d. What is the variable type of genres ?
class(movielens$genres)
[1] "factor"
class(movielens$title)
[1] "character"
We already know we can use the levels() function to determine the levels of a factor. A different function, nlevels(), may be used to determine the number of levels of a factor.
- Use this function to determine how many levels are in the factor genres in the movielens data frame.
nlevels(movielens$genres)
[1] 901
# If we want to see the names
head(levels(movielens$genres))
[1] "(no genres listed)"
[2] "Action"
[3] "Action|Adventure"
[4] "Action|Adventure|Animation"
[5] "Action|Adventure|Animation|Children"
[6] "Action|Adventure|Animation|Children|Comedy"
Sorting
Sorting, as self explanatory as it is, refers to putting elements in desired orders. * The function sort() sorts a vector in increasing order. * The function order() produces the indices needed to obtain the sorted vector, e.g. a result of 2 3 1 5 4 means the sorted vector will be produced by listing the 2nd, 3rd, 1st, 5th, and then 4th item of the original vector. * The function rank() gives us the ranks of the items in the original vector. * The function max() returns the largest value, while which.max() returns the index of the largest value. The functions min() and which.min() work similarly for minimum values.
Let’s start with an easy stimulating vector ‘b’:
b <- c(1,5,8,3,2,9,85,17,43,55)
sort(b)#Puts the value of b in Order
[1] 1 2 3 5 8 9 17 43 55 85
Here we go. Now, lets rearrange the value by an index. What the syntax below doing is:
- I have created a vector called ‘index’ that contains the ordered values of vector b,
- I am indexing the values of b, and
- printing the indexed ordered values in the vector b. Confusing? Pay attention to the Order column:
library(kableExtra)
tbl <- data.frame(
b = c(1, 5, 8, 3, 2, 9, 85, 17, 43, 55),
SN = c(1,2,3,4,5,6,7,8,9,10),
Order = c(1,2,3,5,8,9,17,43,55,85),
SN_Position = c(1,5,4,2,3,6,8,9,10,7),
Rank = c(1, 4, 5, 3, 2, 6, 10, 7, 8, 9)
)
tbl %>%
kbl()%>%
kable_classic_2(full_width = F, html_font = "Cambria")%>%
column_spec(2, bold = T, color ="yellow", background = "#D7261E")%>%
column_spec(4, bold = T, color ="yellow", background = "#D7261E")
b | SN | Order | SN_Position | Rank |
---|---|---|---|---|
1 | 1 | 1 | 1 | 1 |
5 | 2 | 2 | 5 | 4 |
8 | 3 | 3 | 4 | 5 |
3 | 4 | 5 | 2 | 3 |
2 | 5 | 8 | 3 | 2 |
9 | 6 | 9 | 6 | 6 |
85 | 7 | 17 | 8 | 10 |
17 | 8 | 43 | 9 | 7 |
43 | 9 | 55 | 10 | 8 |
55 | 10 | 85 | 7 | 9 |
Now, let’s use R syntax and come up with the same output. The function “order()” returns the indices that sort the vector parameters. Likewise, rank() gives the rank of the 1st entry, 2nd entry, 3rd entry etc. as shown above.
index <- order(b)
b[index]
[1] 1 2 3 5 8 9 17 43 55 85
order(b)
[1] 1 5 4 2 3 6 8 9 10 7
rank(b)
[1] 1 4 5 3 2 6 10 7 8 9
Let’s do some activities with true data:
data(murders)
sort(murders$total) #Puts the value of 'totals' in increasing order
[1] 2 4 5 5 7 8 11 12 12 16 19 21 22 27 32
[16] 36 38 53 63 65 67 84 93 93 97 97 99 111 116 118
[31] 120 135 142 207 219 232 246 250 286 293 310 321 351 364 376
[46] 413 457 517 669 805 1257
The minimum value in the total’s column was 2 and it went all the way to 1257. If I want to see the top 10 states as they were entered in the ‘murders’ data set. I can simply:
murders$state[1:10]
[1] "Alabama" "Alaska" "Arizona"
[4] "Arkansas" "California" "Colorado"
[7] "Connecticut" "Delaware" "District of Columbia"
[10] "Florida"
murders$abb[1:10]
[1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "DC" "FL"
Looks like the data were entered in the alphabetical order starting from Alabama, and that can be proven by the abbreviated states names. They perfectly match.
Lets’ order the states in increasing murder total order.
index <- order(murders$total)
murders$abb[index]
[1] "VT" "ND" "NH" "WY" "HI" "SD" "ME" "ID" "MT" "RI" "AK" "IA" "UT" "WV" "NE"
[16] "OR" "DE" "MN" "KS" "CO" "NM" "NV" "AR" "WA" "CT" "WI" "DC" "OK" "KY" "MA"
[31] "MS" "AL" "IN" "SC" "TN" "AZ" "NJ" "VA" "NC" "MD" "OH" "MO" "LA" "IL" "GA"
[46] "MI" "PA" "NY" "FL" "TX" "CA"
It shows that the state of Vermont has the least total murders, while California has the highest total murders.
If we are only interested to get only the states with the highest and the least murder total, we can use more efficient function, i.e., max(), and min().
max(murders$total)
[1] 1257
min(murders$total)
[1] 2
The results show that 1257 and 2 total murders were the highest and lowest murders. We can locate the exact location of the states in the data set.
m_max <- which.max(murders$total)
m_max
[1] 5
m_min <- which.min(murders$total)
m_min
[1] 46
The results show that the states with the highest total murders is in the fifth position, and lowest total murders is in the 46th position. We can call the names of the exact states by:
murders$state[m_max]
[1] "California"
murders$abb[m_min]
[1] "VT"
Here we go. The California had the highest total murders and Vermont the least.
murders$state[which.max(murders$population)]#Gives the name of the state with the highest population
[1] "California"
max(murders$population)#Gives the highest population
[1] 37253956
Yes. It is California.
What does it mean? Is California the most dangerous state? Probably not. It may be because of the population and we may have to calculate per-capita murder rate rather than total murders. Let’s see if California is the highest population state.
murder_rate <- (murders$total/murders$population)*100000
murders$state[which.max(murder_rate)]
[1] "District of Columbia"
max(murder_rate)
[1] 16.45275
When I calculated the per capita murder rate, the result shows that California is not the state with highest murder rate. It is the District of Columbia with 16.45275 murders per 1,00,000 population.
I can now order all the states in decreasing murder_rate order.
murders$state[order(murder_rate, decreasing = TRUE)]
[1] "District of Columbia" "Louisiana" "Missouri"
[4] "Maryland" "South Carolina" "Delaware"
[7] "Michigan" "Mississippi" "Georgia"
[10] "Arizona" "Pennsylvania" "Tennessee"
[13] "Florida" "California" "New Mexico"
[16] "Texas" "Arkansas" "Virginia"
[19] "Nevada" "North Carolina" "Oklahoma"
[22] "Illinois" "Alabama" "New Jersey"
[25] "Connecticut" "Ohio" "Alaska"
[28] "Kentucky" "New York" "Kansas"
[31] "Indiana" "Massachusetts" "Nebraska"
[34] "Wisconsin" "Rhode Island" "West Virginia"
[37] "Washington" "Colorado" "Montana"
[40] "Minnesota" "South Dakota" "Oregon"
[43] "Wyoming" "Maine" "Utah"
[46] "Idaho" "Iowa" "North Dakota"
[49] "Hawaii" "New Hampshire" "Vermont"
California is has the 14th highest murder rates. Not even in top 10.
Comments
Post a Comment