RPubs - Sample (), Simulation, and Apply Functions

This tutorial is about using a few Simple R functions using the ‘sampling’ command. 1. sample function() 2. for loops 3. apply () function

when working with R there are various reasons why we want R to do some task repeatedly,like running through every row in the data set and performing some calculations. Or may be, we want to run a simulation that requires repeated draw of samples from some population or run some sort of iterated experiment in order to estimate a statistic– for example, the sample mean.

Let’s Look at the sample() command.

At first, I am going to define a vector named ‘states’ and give some states’ name to be stored in the vector. Or the vector “state” has three character elements, i.e., three U.S. states.

states<-c("Florida", "Georgia", "Alabama")
states

[1] "Florida" "Georgia" "Alabama"

Simulating the Sample

Once I created the vector, I want to use the sample command to simulate the vector. I will use a sample() function where I define the data set to be used, number of iterations I want to conduct, and whether I want to replace the value or not once they are, so to say, used.

sample(x=states, size=12, replace=TRUE)

 [1] "Alabama" "Alabama" "Georgia" "Georgia" "Florida" "Georgia" "Florida"
 [8] "Florida" "Georgia" "Florida" "Alabama" "Georgia"

I can see that I simulated a new data set with 12 elements using the vector I created earlier. Every time, the all of the elements do have equal chance of being drawn regardless of previous draw.

I can also choose to not replace the values once they are drawn. If so, the sample size of the simulated data cannot exceed 3. Because I just have 3 states in the data set. I tried all three option for the demonstration purpose.

sample(x=states, size=3, replace=FALSE)

[1] "Georgia" "Florida" "Alabama"

sample(x=states, size=2, replace=FALSE)

[1] "Florida" "Alabama"

sample(x=states, size=1, replace=FALSE)

[1] "Alabama"

I got three different outputs. The first one selected 3 states, second 2 states, and the last one states.

Lesson Learned: If we replace the drawn sample back to the game, we can create as big sample size as we want, but limited to a certain number if we don’t.

a<-(1:5)
a

[1] 1 2 3 4 5

sample(x=a, size=100, replace = TRUE)

  [1] 3 1 2 3 2 5 2 2 1 2 1 2 3 1 2 3 4 1 2 5 2 5 2 5 3 4 1 2 1 2 2 3 4 3 5 2 3
 [38] 5 1 1 2 2 4 4 3 1 5 5 1 4 5 5 5 3 2 2 4 4 4 3 5 5 2 4 2 4 3 4 5 1 2 5 1 3
 [75] 5 5 1 3 1 3 1 5 2 3 3 3 2 4 1 4 5 1 4 4 4 5 5 2 4 2

sample(x=a, size=5, replace=FALSE)

[1] 1 2 5 3 4

sample(x=a, size=4, replace=FALSE)

[1] 4 1 2 3

sample(x=a, size=3, replace=FALSE)

[1] 5 2 3

sample(x=a, size=2, replace=FALSE)

[1] 4 1

sample(x=a, size=1, replace=FALSE)

[1] 4

a<-(50:80)
a

 [1] 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
[26] 75 76 77 78 79 80

sample(x=a, size=200, replace = TRUE)

  [1] 58 79 70 69 66 79 58 56 80 74 76 74 57 57 61 80 78 80 64 50 68 51 76 67 60
 [26] 56 63 69 74 74 58 80 60 61 74 78 65 52 79 74 59 54 63 52 62 53 63 61 71 74
 [51] 56 52 56 50 67 57 55 73 65 80 60 62 72 51 64 71 58 56 58 76 51 74 56 50 65
 [76] 54 71 61 64 64 64 59 50 69 60 79 73 75 66 64 60 69 56 65 71 67 67 72 75 51
[101] 66 55 69 51 78 65 77 58 64 68 80 74 69 53 60 72 72 68 58 80 71 73 51 75 61
[126] 61 59 64 57 54 68 63 74 66 60 54 77 67 55 61 76 52 55 75 60 62 70 75 60 68
[151] 53 71 52 61 52 68 66 54 77 69 77 75 74 76 56 75 77 52 76 67 57 73 59 71 76
[176] 58 50 53 62 75 65 73 59 68 57 60 67 50 74 59 57 70 68 63 74 69 56 64 75 70

sample(x=a, size=30, replace=FALSE)

 [1] 62 69 68 70 60 58 71 80 76 73 50 65 75 74 78 59 51 55 79 54 63 77 64 57 52
[26] 66 72 67 61 56

sample(x=a, size=21, replace=FALSE)

 [1] 79 56 62 52 54 73 51 63 53 70 74 71 69 68 67 72 60 50 80 66 65

sample(x=a, size=11, replace=FALSE)

 [1] 62 53 69 76 51 50 72 66 80 65 73

sample(x=a, size=6, replace=FALSE)

[1] 54 74 52 64 78 60

sample(x=a, size=1, replace=FALSE)

[1] 64

Now Let’s Flip Some Coins

Imagine that 1 refers to heads and 0 to tails.

A. Single Flip

coinflip <- (0:1)
One_flip <- sample(x = coinflip, size = 1)
One_flip

[1] 1

plot(One_flip)

B. Five Flips

five_flips <- sample(x = coinflip, size = 5, replace = T)
print(five_flips)

[1] 0 0 1 1 1

barplot(five_flips)

C. Ten Flips

ten_flips <- sample(x = coinflip, size = 10, replace = T)
print(ten_flips)

 [1] 1 0 1 0 0 0 1 0 0 0

barplot(ten_flips)

C. Hundred Flips

hundred_flips <- sample(x = coinflip, size = 100, replace = T)
print(hundred_flips)

  [1] 0 0 0 1 0 1 1 0 0 0 0 1 1 0 1 0 1 1 0 1 0 1 0 0 1 1 0 1 1 1 0 1 0 1 1 0 0
 [38] 0 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 0 1 1 1 0 0 1 0 1 1 0 1 0 1 0 1 0 1 0
 [75] 1 1 1 0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1

hist(hundred_flips)

D. Thousand Flips

thousand_flips <- sample(x = coinflip, size = 1000, replace = T)
summary(thousand_flips)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   1.000   0.508   1.000   1.000

hist(thousand_flips)

E. One Hundred Thousand Flips

OHT_flips <- sample(x = coinflip, size = 100000, replace = T)
summary(OHT_flips)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  0.4956  1.0000  1.0000

hist(OHT_flips)

How about Rolling some Dices

I am going to roll 6-sided dice

dice_roll <- (1:6)

A. Hundred Rolls

hundred_rolls <- sample(x = dice_roll, size = 100, replace = TRUE)
str(hundred_rolls)

 int [1:100] 2 5 6 4 6 2 3 6 1 1 ...

hist(hundred_rolls)

B. 1000 Rolls

thousand_rolls <- sample(x = dice_roll, size = 1000, replace = TRUE)
str(thousand_rolls)

 int [1:1000] 5 6 4 1 5 1 4 2 4 2 ...

hist(thousand_rolls)

C. 10000 Rolls

Tthousand_rolls <- sample(x = dice_roll, size = 10000, replace = TRUE)
str(Tthousand_rolls)

 int [1:10000] 4 1 2 4 1 1 4 2 1 4 ...

hist(Tthousand_rolls)

Finally Lets’ Check Powerball Numbers for Today’s Draw

powerball <- (1:74)
sample(x = powerball, size = 6, replace = F)# Six Draws

[1] 59 23 53 45 35 48

c <- sample(x = powerball, size = 10000, replace = TRUE) #Ten Thousand Draws
hist(c)

So the sample command allows us to draw from some finite set of elements with an equal probability of drawing each element. But one note here is that we can also sample from other kinds of probability distributions–for example, the normal distribution.

Let’s take an example: I am going to generate 50 random numbers.

norm <- rnorm(n=100)
head(norm)

[1]  1.1880658  0.7176062 -0.8077974 -0.2422761  1.3265662  1.5531266

I can create much larger random normal numbers and create a density plot to see how it looks.

plot(density(rnorm(n=10000000)))

Plot is absolutely a Gaussian Plot. After having this sample () command under our belt lets’ learn the for loop. ## For Loop So the basic idea of for loops is that we’re just going to tell R to loop through some operations a set number of times in order to perform a task that might otherwise take a long time to repeat over and over. So let’s start with a simple coin tossing experiment. So first, we’ll just create a vector, an object with two elements, heads or tails. And we’ll call it cflip.

cflip <- c("Head", "Tail")#Creates a two sided coin
toss <- c()#Creates an Empty Vector

#Now, let's create a foor loop
for(i in 1:100){
  toss[i] <- sample(x = cflip, size = 1)#size 1 because we created a loop and want it to perform exact same way every time
}
print(toss)# prints every single toss for 100 times

  [1] "Head" "Tail" "Head" "Head" "Tail" "Tail" "Head" "Head" "Head" "Tail"
 [11] "Head" "Tail" "Tail" "Head" "Tail" "Head" "Tail" "Head" "Tail" "Tail"
 [21] "Tail" "Head" "Head" "Tail" "Head" "Head" "Tail" "Head" "Tail" "Tail"
 [31] "Tail" "Tail" "Tail" "Tail" "Head" "Head" "Head" "Tail" "Tail" "Tail"
 [41] "Head" "Tail" "Tail" "Tail" "Tail" "Head" "Tail" "Tail" "Head" "Tail"
 [51] "Tail" "Head" "Tail" "Tail" "Head" "Head" "Tail" "Head" "Tail" "Head"
 [61] "Tail" "Tail" "Head" "Tail" "Tail" "Tail" "Tail" "Tail" "Head" "Head"
 [71] "Tail" "Head" "Tail" "Head" "Tail" "Head" "Tail" "Head" "Tail" "Tail"
 [81] "Tail" "Head" "Tail" "Head" "Tail" "Tail" "Head" "Head" "Head" "Tail"
 [91] "Tail" "Tail" "Tail" "Tail" "Head" "Head" "Head" "Tail" "Head" "Head"

table(toss) # Gives the summary of total heads and tails

toss
Head Tail 
  43   57

i in the above loop is the index that defines the number or simulations.

Can we take it to Powerball?

Lets see:

powerball <- c(1:74)# Utilizes any number between 1 and 74
draw <- c()#Creates an Empty Vector

#Now, let's create a for loop
for(i in 1:600){#600 iterations
  draw[i] <- sample(x = powerball, size = 6)#size 6 because we have to match all 6 numbers to win the power ball
}
summary(draw)# prints every single draw

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   19.00   38.00   38.39   58.00   74.00

table(draw) # Gives the summary of total draws and frequency

draw
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 8 10  9  6 13  7  6  3  9 12  7  7  6  5  6 10 11  5 12  1  6  5 11  6 10 10 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 
 9 11  8  4  8 12  8  8  8  7 11  6  6  9 10  6  7  6 10  5  7 10  7  3 12 11 
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 
 4  5 14  6  9 12 10 10  8  5  7  6  7 11  8  9 12 16  9 10  5  7

Let’s sample some information from population

Please note that it is absolutely a fictional dataset. We will create a matrix called ‘result’ which has three variables named: Marital Status, income level, and their state of living. Participant will have either ‘Married’, or ‘Single’ marital status; four different income levels (1 through 4) and they live in three states: Florida, Alabama, or Georgia.

I have to run the function three times for completing 1 row (one for states, one for income,and one for marital status). Thus, within the for loop I used [i, 1 or 2, or 3] suggesting ith row and first or second or third column.

state <- c("Florida", "Georgia", "Alabama")
m_status <- c("Married", "Single")
income <- 1:4
results <- matrix (nrow = 100, ncol = 3, data = NA)#Creates a matrix with 100 rows and 3 columns and it doesn't have any values, yet
colnames(results) <- c("m_status", "state", "income")
head(results)

     m_status state income
[1,]       NA    NA     NA
[2,]       NA    NA     NA
[3,]       NA    NA     NA
[4,]       NA    NA     NA
[5,]       NA    NA     NA
[6,]       NA    NA     NA

for(i in 1:100){
  results [i,1] <- sample(m_status, size = 1)
  results [i,2] <- sample(state, size = 1)
  results [i,3] <- sample (income, size = 1)
}
head(results)

     m_status  state     income
[1,] "Married" "Georgia" "2"   
[2,] "Married" "Alabama" "3"   
[3,] "Single"  "Florida" "4"   
[4,] "Single"  "Florida" "3"   
[5,] "Married" "Alabama" "4"   
[6,] "Married" "Georgia" "3"

str(results)

 chr [1:100, 1:3] "Married" "Married" "Single" "Single" "Married" "Married" ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:3] "m_status" "state" "income"

In the created sample we can see that the matrix is now full. It has all the variables that we wanted and we selected these variables randomly every single time. If you want to generate numbers like this, we have to define values by column, and that process repeats one time for every column.

Apply Function

Why? The apply function is useful for summarizing large amounts of data or applying a function over our data set.

Let’s take a look at one example using the ‘results’ that we’ve just found. Basically, the apply function to summarize the information from the results that I just created above and use the table command to summarize.

Simply, apply function starts with an expression “apply” followed by a small bracket. Big X = results, i.e., the data set I am interested in. The expression ‘MARGIN’ refers to either the row or column. If MARGIN = 2, I want my data be summarized based on column. If it is 1, then by rows. Finally, FUN = table, the data is in a tabular form.

apply(X = results, MARGIN = 2, FUN = table)

$m_status

Married  Single 
     49      51 

$state

Alabama Florida Georgia 
     29      37      34 

$income

 1  2  3  4 
20 29 24 27

We see, x number participants to be married, y numbers of participants form certain state, and their income distribution.

$m_Status for column marital status, $s for the state of residence, and $income for income distribution.

THANKS

Search This Blog

Data-Driven Education Insights