When working with huge datasets in R, one of the most important jobs is summarization, which includes extracting key ideas, aggregating data, and drawing relevant statistical conclusions. In this article, we will study the features of the summarise and mean functions in R, explore summarising counts, and comprehend the intricacies of summarising data by group.
Why Summarize Data?
Before diving in and learning about the technicalities of summarization, let us first learn about why summarization is such a crucial step in data analysis and the data science process. Summarizing helps us in:
-
Data Exploration: Summarized data provides a quick overview, enabling analysts to understand the distribution and characteristics of variables.
-
Pattern Recognition: Aggregating data helps identify trends and patterns, making it easier to draw meaningful conclusions.
-
Communicating Insights: Summarized data is more accessible to share with stakeholders, as it condenses complex information into a digestible format.
Now let us learn about the 'summarise' function in R programming.
Exploring the summarise() Function
The dplyr package in R is an extensive toolset for data manipulation that includes the summarise() function, which we will use for summarising data.
Basic Syntax of summarise
The basic syntax of the summarise function is as follows.
summarise(data, new_variable = aggregation_function(existing_variable))
Here, data refers to the dataset, new_variable is the name of the variable to be formed, and aggregation_function is the function used for summarization.
Calculating Mean within summarise() Function
Let's start with a simple example of how to use summarise to calculate the mean of a numerical variable. Let us create a dataset named 'data' with numeric variables.
Code:
library(dplyr) data <- data.frame(numeric_variable = c(1, 9, 10, 16, 29, 30, 65, 74, 90, 94)) result <- summarise(data, mean_value = mean(numeric_variable)) print(result)
Output:
mean_value 1 41.8
In this example, the mean function calculates the mean of the data that we input in the summarise function, the neumeric_variable. The result is stored in a new variable named mean_value.
Summarizing Multiple Variables
You can also use the summarise() function to compute numerous summary statistics at once. For example, to calculate the mean and standard deviation of numeric_variable, we can do the following.
Code:
result <- summarise(data, mean_value = mean(numeric_variable), sd_value = sd(numeric_variable))
print(result)
Output:
mean_value sd_value 1 41.8 35.50211
Now, the result outputs both the mean and standard deviation of the specified variable.
The mean() Function in R
The mean function in R is a simple but effective tool for calculating the average of numeric values. When combined with the summarise() function, it becomes a critical component in data summarization tasks.
The basic syntax of the mean function is.
mean(x, na.rm = FALSE)
- x: A numeric vector, data frame, or array containing the numeric values.
- na.rm: A logical value indicating whether NA values should be removed (default is FALSE).
Leveraging mean() with summarise
Now, let's add the mean function to the summarise pipeline to summarise many variables at once.
Code:
library(dplyr)
data <- data.frame(numeric_variable = c(1, 9, 10, 16, 29, 30, 65, 74, 90, 94), another_variable=c(1, 9, 10, 16, 37, 30, 56, 78, 79, 99))
result <- summarise(data, mean_value = mean(numeric_variable), mean_another = mean(another_variable))
Output:
mean_value mean_another 1 41.8 41.5
In this example, we calculate the mean of both numeric_variable and another_variable using the mean function within the summarise framework.
Counting with summarise in R
Counting occurrences is another crucial component of data summarization. The n() function in R, when paired with summarise, allows us to count observations in a dataset.
The basic syntax of the n()
function is:
n()
This function returns the number of observations in the current group.
Combining mean and count in summarise
When summarising data, it is usual to use both mean calculations and counts. Let's take an example where we want to determine both the average and the number of occurrences for a variable called temperature.
Code:
library(dplyr) my_data <- data.frame( location = c("Seattle", "Portland", "San Francisco", "Seattle", "Portland"), temperature = c(53, 57, 62, 55, 58) ) result <- my_data %>% group_by(location) %>% summarise(mean_temp = mean(temperature), count = n()) print(result)
Output:
# A tibble: 3 × 3 location mean_temp count 1 Portland 57.5 2 2 San Francisco 62 1 3 Seattle 54 2
In this code sample, we use group_by to group the data by location, and then summarise to determine the mean temperature (mean_temp) as well as the number of observations for each location.
Summarizing Data by Group in R
Data is frequently organized according to specified criteria, and summarising data within these groups outputs useful insights. When used with summarise, the group_by() function in R is an effective tool for this task.
The basic syntax of the group_by function is.
group_by(data, grouping_variable)
Here, data is the dataset, and grouping_variable is the variable by which the data should be grouped.
Multiple Grouping Variables
You can also use numerous variables to group. For example, suppose you want to summarise data by both product_category and region.
Code:
library(dplyr) sales_data <- data.frame( product_category = c("Electronics", "Electronics", "Clothing", "Furniture", "Furniture"), region = c("West", "East", "West", "South", "North"), sales_amount = c(1000, 500, 700, 1500, 2000), item_price = c(200, 150, 50, 300, 400) ) result <- sales_data %>% group_by(product_category, region) %>% summarise(total_sales = sum(sales_amount), avg_price = mean(item_price)) print(result)
Output:
# A tibble: 5 × 4 # Groups: product_category [3] product_category region total_sales avg_price 1 Clothing West 700 50 2 Electronics East 500 150 3 Electronics West 1000 200 4 Furniture North 2000 400 5 Furniture South 1500 300
This code groups the data by product_category and region, then computes total sales and average price per item for each combination of categories and geographies.
Conclusion
Summarizing data is an important phase in the data analysis process, and R has excellent capabilities for doing so. The summarise function, when used with methods such as mean and n(), enables analysts to extract useful insights from datasets rapidly. R offers a diverse framework for data summarization tasks, including computing averages, counting occurrences, and summarising data by group. Understanding the summarise function will allow you to derive valuable insights, make informed decisions, and effectively communicate your findings to others.