R-Basic Part VII: STR() Function, Simulation, and R Profiling

Highlights

The str() function
Simulation - Generating Random Numbers
Simulation - Sample Linear Model
Simulation - Random Sampling
R Profiling

The str() function

It’s a diagnostic function and it can be used as an alternative to summary() function
Outputs a compact display of variables in an array or a dataset
Provides one line summary of key information per object or variable

Let’s take the arguments required for str function:

str

function (object, ...) 
UseMethod("str")
<bytecode: 0x0000000015fb51e8>
<environment: namespace:utils>

So, it takes an object, and it may take some other optional arguments (e.g., max.level, digits.d, etc.), as needed. An object can be anything ranging from a function to a large dataset. Let’s pass the str function on the glm function and see what it gives:

str(glm)

function (formula, family = gaussian, data, weights, subset, na.action, 
    start = NULL, etastart, mustart, offset, control = list(...), model = TRUE, 
    method = "glm.fit", x = FALSE, y = TRUE, singular.ok = TRUE, contrasts = NULL, 
    ...)

The output was the compact form of the function glm. Based on the outcome, the glm is a function which takes some formula as an argument, followed by a second argument is the data and third subset, etc. that go in glm function.

Now, lets create a vector b with 50 data points having the mean of 5 and standard deviation of 2, and pass summary() and str() functions.

b <- rnorm(50, 5, 2)
summary(b)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.974   3.745   4.905   4.972   6.219   9.368

We have the rough sense of the minimum and maximum values with mean, median, 1st and 3rd quartile values. Now, let’s pass the str function on b:

str(b)

 num [1:50] 2.7 6.52 5.08 2.61 6.61 ...

We got the one line summary of the vector b. It gave us a little more information about b. It has total of 50 numerical values and showed the first five samples.

I can also use str() function on the factor variables like:

k <- gl(3, 5, labels = c("Male", "Female", "Other")) # Create 3 different categories and include 5 values in each of those categories
str(k)

 Factor w/ 3 levels "Male","Female",..: 1 1 1 1 1 2 2 2 2 2 ...

As expected, k is a factor with three levels (the first two are Male and Female, and they have the corresponding dummy codes of 1, and 2.

Finally, lets use this function on the npk data frame:

str(npk)

'data.frame':   24 obs. of  5 variables:
 $ block: Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
 $ N    : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 1 2 ...
 $ P    : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 1 2 2 2 ...
 $ K    : Factor w/ 2 levels "0","1": 2 1 1 2 1 2 2 1 1 2 ...
 $ yield: num  49.5 62.8 46.8 57 59.8 58.5 55.5 56 62.8 55.8 ...

The results show the one line summary of all of the variables in the dataset.

When we passed the str() function, we saw a list of six different blocks with the information related to each of them.

# Just because I wanted to use the new in-built R-pipe
library(tidyverse)
library(tidyr)
library(ggplot2)
group_mean <- npk |>
  group_by(block, N) |>
  summarize(yield_mean = mean(yield, na.rm = TRUE))

# Let's check using this plot
group_mean |>
  ggplot(aes(x = reorder(block, -yield_mean), y = yield_mean)) +
  geom_bar(stat = "identity", aes(fill = N)) +
  labs(
    title = "Fig 1. Average Yield Based on Blocks and Use of Nitrogen",
    x = "Blocks",
    y = "Average Yield"
  ) +
  scale_x_discrete(labels = c(
    "1" = "Block_1",
    "2" = "Block_2",
    "3" = "Block_3",
    "4" = "Block_4",
    "5" = "Block_5",
    "6" = "Block_6",
    "7" = "Block_7"
  )) +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 50, hjust = 1),
    axis.text = element_text(size = rel(1), colour = "chocolate"),
    legend.position = "top"
  )

I remember struggling a lot trying to find out ways to calculate means for different groups and subgroups at once. Even the Google search doesn't always provide you with the desired outcomes. Turns out, I had already come across this information many times, when I was taking foundation course in R. I just either overlooked or simply lack the knowledge of context to use them! I recently re-discovered them and trying to make it available in my blog where I can easily find and use and reference as needed. 

Not exactly same, but similar outcome can be achieved following a two step processes using **split()** and **sapply()** functions. My previous blog post detailed the use of these functions and you can access it following this link (https://ghimirenirmal.blogspot.com/2022/02/r-basic-part-vi-loop-functions-and.html). Let me refresh my memory:

Simulation: Generating Random Numbers

Simulation is one of the most important topics for statistics and many other applications. It provides us with the required data to model research, do practice or inquire certain areas of research which require data of specific characteristics. Here’s a list of some of the functions that help us generate some random numbers:

rnorm: generates random normal variate with a given mean and standard deviation. General Form: rnorm(n, mean = 0, sd = 1)
dnorm: evaluates the normal density (with a given mean/SD) at a point. General Form: dnorm(x, mean = 0, sd = 1, log = FALSE)
pnorm: evaluates the cumulative distribution function for a Normal distribution. General Form: pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm: evaluates the qunatile functions. General Form: qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
rpois: generates random Poisson variates with a given gamma rate. General Form: rpois(n, lambda). And remember there are dpois(), ppois(), & qpois() functions, as well, and they are equivalent to dnorm(), pnorm(), & qnorm(), respectively.

There are four functions associated with the probability distribution. They are prefixed with a

d for density
r for random number generation
p for cumulative distribution
q for quantile function

So, every distribution has these four types of characteristics. For gamma distribution, we have dgamma, rgamma, pgamma, and qgamma, and they take the following general structure, respectively.

rbind(str(dgamma), str(pgamma), str(qgamma), str(rgamma))

function (x, shape, rate = 1, scale = 1/rate, log = FALSE)  
function (q, shape, rate = 1, scale = 1/rate, lower.tail = TRUE, log.p = FALSE)  
function (p, shape, rate = 1, scale = 1/rate, lower.tail = TRUE, log.p = FALSE)  
function (n, shape, rate = 1, scale = 1/rate)

NULL

If $Φ$ is the cumulative distribution function for a standard normal distribution, then

\begin{aligned} p n o r m (q) & = Φ q, a n d \\ q n o r m (p) & = \frac{p}{Φ} \end{aligned}

If I want to create 10 random numbers I can simply write the following code:

rnorm(10)

 [1]  0.5897101  0.2034802  0.7790009 -1.0128614 -1.7023167  1.1966805
 [7] -1.8556855 -0.8117258  0.3582758 -1.2090540

We got 10 numbers randomly generated by R. If we don’t define parameter, R generates the numbers with the mean of 0 and sd of 1. However, we can customize them as per our need. The code below generates 10 random numbers with the mean of 5 and standard deviation of 1.

summary(rnorm(10, 5, 1))

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.916   4.447   5.015   5.068   5.494   6.731

Anytime we simulate random numbers for any kind of distribution, we want to set a seed. It can be done using the set.seed() function. When we do so, the numbers are not really random (they are called pseudo random numbers) but they appear random. The ideas is, someone would be able to generate the same set of numbers with same seed which can help replicate research studies. Let’s take some examples:

# A. with set.seed(-1)
set.seed(-1)
seed_minus_one <- rnorm(6) # Generate 6 random numbers

# B. without set.seed
without_seed <- rnorm(6)

# C. with set.seed(-1)
set.seed(-1)
minus_one_again <- rnorm(6)

# D. with different seed
set.seed(1)
seed_plus_one <- rnorm(6)

# Putting them together in a single data frame
data.frame(seed_minus_one, without_seed, minus_one_again, seed_plus_one)

  seed_minus_one without_seed minus_one_again seed_plus_one
1    -0.03342665   -0.6179153     -0.03342665    -0.6264538
2     2.47146059   -0.5492364      2.47146059     0.1836433
3    -0.70119879   -1.5258090     -0.70119879    -0.8356286
4    -0.36324120   -0.4930835     -0.36324120     1.5952808
5    -0.27257217   -0.2591490     -0.27257217     0.3295078
6     0.81849252   -0.3036328      0.81849252    -0.8204684

We can see that the values in set 1 and 2 are different because one has seed and the other doesn’t. When I reset the seed again, got the exact same set of numbers. Again, when I set a different seed I got completely different set of random numbers.

Generating Random Variables for Other Probability Distributions

We use rpois function to generate Poisson data. Here’s an example of a Poisson distribution which has 10 random numbers with a rate ( $λ$ ) of 1. The mean is going to be roughly equal to the rate.

summary(rpois(10, 1))

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0     1.0     1.0     1.6     2.0     4.0

# Poisson distribution of 10 numbers with the rate of 2
rpois_lambda_2 <- rpois(10, 2)
# Poisson distribution of 10 numbers with the rate of 20
rpois_lambda_20 <- rpois(10, 20)
# Putting them together
data.frame(rpois_lambda_2, rpois_lambda_20)

   rpois_lambda_2 rpois_lambda_20
1               2              19
2               0              24
3               1              23
4               1              22
5               0              24
6               1              23
7               4              20
8               1              11
9               2              22
10              2              24

I can also evaluate the cumulative distribution for the poisson distribution the following way:

# What is the rate of probability of getting a Poisson value <= 2,4,6 when the lambda is 2?
possion_2 <- ppois(2, 2)
possion_4 <- ppois(4, 2)
possion_6 <- ppois(6, 2)
data.frame(possion_2, possion_4, possion_6)

  possion_2 possion_4 possion_6
1 0.6766764  0.947347 0.9954662

As can be seen, the probability of getting a value less <= 2 when the rate is 2 is 67% and it keeps increasing when we increase the value.

Simulating a Linear Model

We can simulate required values for a linear model. For example, I am going to create a linear model below:

set.seed(-23)
predictor <- rnorm(150) # create a vector x that has 150 random numbers

error <- rnorm(150, 0, 2) # The error vector which has 150 random numbers with the mean of 0 and standard deviation of 2

dependent <- 1.5 + 2 * predictor + error # y vector which is the sum of two times the predictor and error terms with 0.5


# Plotting predictor and the dependent variables
plot(predictor, dependent)

When we plotted the predictors on the dependent variable, we saw a clear linear relationships. To add to the above example: what happens when we change the predictor variable from normal to binary distribution?

set.seed(-23)
predictor_binary <- rbinom(150, 1, 0.5) # n=1, and p = 0.5

dependent1 <- 1.5 + 2 * predictor_binary + error
plot(predictor_binary, dependent1)

This plot looks different because the predictor is a binomial variable. However, we still see a clear linear relationships between the predictor_binary and dependent1 variables.

Simulating a GLM

Now, lets simulate a slightly complicated model, may be a generalized linear model (glm). If so, we want to simulate different dependent variable a categorical rather than a continuous variable. In that case, our error distribution will be Poisson rather than normal.

set.seed(-5)
predictor_glm <- rnorm(150) # a vector of 150 random normal values
error_glm <- 1 + 0.8 * predictor_glm
dependent_glm <- rpois(150, exp(error_glm))

plot(predictor_glm, dependent_glm)

Still, we can see a clear linear relationship. As the value of the predictor variable increases the dependent value increases as well.

Random Sampling

The sample function draws random from a specified set of (scalar) objects allowing us to sample from arbitrary distribution. Here’s the arguments the function sample takes:

str(sample)

function (x, size, replace = FALSE, prob = NULL)

Where,

x: either a vector of one or more elements from which to choose, or a positive integer.
n: a positive number, the number of items to choose from.
size: a non-negative integer giving the number of items to choose.
replace: should sampling be with replacement?
prob: a vector of probability weights for obtaining the elements of the vector being sampled.

Sampling 4 random numbers from a set of 1 through 10

set.seed(2)
sample(1:10, 4)

[1] 5 6 9 1

Sampling 5 random lower case letters from the English alphabet

set.seed(2)
sample(letters, 5)

[1] "u" "o" "f" "x" "h"

Sampling 8 random upper case letters from the English alphabet

set.seed(2)
sample(LETTERS, 8)

[1] "U" "O" "F" "X" "H" "Q" "Z" "L"

Sampling 10 numbers randomly from a list of 10

sample(1:10)

 [1]  9  2  3  1  8 10  6  4  7  5

Sampling 10 numbers randomly from a list of 10 with replacement

sample(1:10, replace = TRUE)

 [1] 7 1 6 9 4 6 9 8 6 3

R Profiler

It is a handy tools when we are working on a large dataset. We run profiler when a program fails to work as quickly as we want. Here’s some more information:

profiling is a systematic way to examine how much time is spent in different parts of a program
It is useful when we try to optimize our code
Often, code runs fine for some time, but when we run them a hundred or thousands of time, they start slow down
Profiling is better than guessing why the program is not performing fast enough.

General Principles of Optimization

Design first, and then optimize
Remember: Premature optimization can do more harm than benefits
Measure (collect data), don’t guess. The way we collect data is by profiling.

Using system.time()

Taking an arbitrary R expression as input and returns the amount of time taken to evaluate the expression
Computes the time (in seconds) needed to execute an expression
- If there’s an error, gives time until the error occurred
Returns an object of class proc_time
- user time: time charged to the CPU for this expression
- elapsed time: wall-clock time
Usually, the user time and elapsed time are relatively close, for straight computing tasks
Elapsed time may be greater than user time if the CPU spends a lot of time waiting around
Elapsed time may be smaller than the user time if the computer has multiple cores/processors
- Multi-threaded BLAS libraries (vecLib/Accelerate, ATLAS, ACML, MKL)
- Parallel processing visa the parallel package

An example:

# User time > Elapsed time
system.time(readLines("https://www.ucf.edu/"))

   user  system elapsed 
   0.14    0.03    0.48

One more example:

new_func <- function(new) {
  i <- 1:new
  1 / outer(i - 1, i, "+")
}
t <- new_func(1500)
system.time(svd(t))

   user  system elapsed 
   9.05    0.00    9.06

Using the R Profiler: Rprof()

The Rprof() function starts the profiler in R
- R must be compiled with profiler support (but this is usually the case)
The summary Rprof() function summarizes the output from Rprof()
We should not use system.time() and Rprof() togethter.
Rprof() keeps track of the function call stack at regularly sampled intervals and tabulates how much time is spend in each function
Default sampling interval is 0.02 seconds
Note: If a program runs very quickly, the profiler is not useful

Search This Blog

Data-Driven Education Insights