In this article, we will deep dive into the sample function in R with examples.
What is the sample() function in R?
Sampling is the process of selecting a small subset from the available population. This small subset is used as a representation of the entire population. Sampling is helpful because it saves the time and resources that would otherwise be used to collect data from the entire population.
However, selecting the right sample to represent the entire population can be challenging. In a diverse population, it is important not to over- or under-represent certain groups as it results in selection bias.
Random sampling is one such technique that randomly selects elements from the population. This gives every element an equal chance of being included in the sample. Random sampling is one of the most commonly used sample selection techniques in statistics and research.
The sample function in R is a tool used to generate random samples from a specified set of elements.
Following is the syntax of the sample() function:
sample(x, size, replace = FALSE, prob = NULL)
Let's understand the parameters:
- x: The population from which the sample is to be selected.
- size: The size of the sample, i.e., the number of elements to be chosen.
- replace: A boolean that defines whether the same element can be chosen more than once.
- prob: A vector of weights that is used to influence the chances of selecting an element.
The sample() function assigns random numbers to each element in the population. The function then sorts these random numbers in ascending order and selects the first element from this list. This process is repeated until we have a sample consisting of the specified number of elements.
Let’s see an example of the same:
# Create a population population <- c(1, 1, 2, 2, 2, 2, 2, 3, 4, 4) # Randomly sample 5 elements without replacement sampled_elements <- sample(population, size = 5, replace = FALSE) # Display the result print(sampled_elements)
Output:
2 3 2 1 2
When I rerun the code, I get a completely different output, like:
4 2 1 1 2
As you can see, random sampling is more likely to provide a general sense of the population by not over- or under-representing different groups. The more frequent elements tend to get selected a higher number of times.
Incorporating Probability in the sample() function
By incorporating a probability vector in the sample function, you can sway the likelihood of selecting different elements from the population. Here is a use case where including probability is helpful to our problem:
# Create a vector of product prices product_prices <- c(10, 20, 30, 40, 50) # Define probability based on prices (cheaper products are more likely to be selected) probabilities <- c(0.5, 0.4, 0.3, 0.2, 0.1) # Randomly sample 3 products without replacement based on probability weights sampled_products <- sample(product_prices, size = 3, replace = FALSE, prob = probabilities) # Display the result print(sampled_products)
Output:
30 40 20
Ensuring Reproducibility in Sample Generation
Ensuring reproducibility is to make sure that if someone else runs the same code, they get the same result. The function set.seed() in R is used to get the same sample each time the code is run.
The concept behind this is that the values generated by random number generators aren’t completely random. Instead, they are determined by an initial value called the seed. By setting a fixed seed, we can ensure that the same pseudorandom numbers are generated on each rerun.
Let’s see this with an example:
# Population of students students <- c("Alex", "Sara", "Max", "Emily", "Leo", "Grace", "Owen", "Lila", "Eli", "Mia") # Set a random seed for reproducibility set.seed(42) # Randomly select 2 students for an assignment sampled_students <- sample(students, size = 2, replace = FALSE) # Display the result print(sampled_students)
Output:
"Alex" "Leo"
This same result is generated every time you run the code.
The sample() function is straightforward for simple data. But if you want a more advanced sampling technique or have large datasets, this may not be the best function to use. Below is an example of a more advanced sampling technique:
What is Stratified Sampling?
The stratified sampling is the process of dividing the population into subgroups and sampling from each subgroup. This helps you ensure that you have a good mix in the sample and each subgroup is sufficiently represented.
Here is a use case where stratified sampling is helpful:
# To make it reproducible set.seed(1) # Create a simple dataset with student grades and exam scores students <- data.frame( student_id = 1:100, grade = rep(c('A', 'B', 'C', 'D'), each = 25), exam_score = runif(100, min = 60, max = 90) ) # Load the dplyr library library(dplyr) # To create a project team with 2 students of each grade stratified_sample <- students %>% group_by(grade) %>% sample_n(size = 3, replace = FALSE) # Display the stratified sample print(stratified_sample)
Output:
Conclusion
In a nutshell, the sample() function in R is easy to use and provides a quick way of obtaining random samples from a population. The generated samples can be enhanced according to our requirements by assigning weights to elements of the population, deciding if the sampling is done with replacement or not, and obtaining a stratified sample to account for the diversity in the population.