What’s New ?

The Top 10 favtutor Features You Might Have Overlooked

Read More

Conditional Data Operation using case_when in R

  • Dec 26, 2023
  • 6 Minutes Read
  • Why Trust Us
    We uphold a strict editorial policy that emphasizes factual accuracy, relevance, and impartiality. Our content is crafted by top technical writers with deep knowledge in the fields of computer science and data science, ensuring each piece is meticulously reviewed by a team of seasoned editors to guarantee compliance with the highest standards in educational content creation and publishing.
  • By Abhisek Ganguly
Conditional Data Operation using case_when in R

In data analysis and programming, conditional operations play an important role as they allow users to make data-driven decisions for their applications based on a fixed set of conditions. The case_when function has proved to be a helpful tool for handling such complex conditional operations in the R programming language. As a function in the dplyr package, the case_when is an important addition to any data scientist’s skill set. In this article, we will look at the usage, syntax, and advantages of the case_when function in R programming. 

What is the case_when Function?

The case_when function is part of the dplyr package, a popular package in the tidyverse ecosystem that provides tools for data manipulation. This function enables users to use conditional statements which they can use to express a wide range of conditions and the corresponding outcomes for them, clearly and concisely. 

The basic syntax of case_when is as follows.

case_when(
  condition_1 ~ result_1,
  condition_2 ~ result_2,
  ...
  condition_n ~ result_n,
  TRUE ~ default_result
)

 

The value that will be returned in this case if the condition is true is the corresponding result_i, and each condition_i is a logical expression. If any of the above conditions are not met, the optional TRUE ~ default_result statement yields a default value.

Now let us dive into exploring some practical examples to illustrate the versatility of the case_when function.

1. Categorizing Data

Let's say you want to add a new variable that divides people into different age groups based on their age range. Assume that you have a dataset with a numerical variable that represents the ages of different people. This task of dividing and categorizing the age into different age groups is simplified by the case_when function.

Code:

library(dplyr)

data <- data.frame(id = 1:5, age = c(25, 35, 18, 42, 55))

data <- data %>%
  mutate(age_group = case_when(
    age < 18 ~ "Under 18",
    age >= 18 & age < 35 ~ "18-34",
    age >= 35 & age < 50 ~ "35-49",
    age >= 50 ~ "50 and above",
    TRUE ~ "Unknown"
  ))

print(data)

 

Output:

  id age    age_group
1  1  25        18-34
2  2  35        35-49
3  3  18        18-34
4  4  42        35-49
5  5  55 50 and above

 

In this example, a new variable (age_group) based on age ranges is created using the case_when function. The ensuing data frame divides people into age groups, with categories like "Under 18," "18-34," "35-49," and "50 and above." The TRUE ~ "Unknown" statement guarantees that any age that does not meet the given requirements is marked as "Unknown."

2. Handling Missing Values

In data analysis, handling missing values is commonplace. R's case_when function is a useful tool for strategic imputation based on predefined conditions. It is crucial for analysts and researchers working with a variety of datasets and imputation strategies because of its formal implementation, which increases precision.

Code:

data <- data.frame(id = 1:7, test_score = c(85, NA, 92, 78, NA, 60, 75))

data <- data %>%
    mutate(imputed_score = case_when(
        is.na(test_score) ~ "Missing",
        test_score < 60 ~ "Fail",
        test_score >= 60 & test_score < 80 ~ "Pass",
        test_score >= 80 ~ "High Pass"
    ))


print(data)

 

Output:

  id test_score imputed_score
1  1         85     High Pass
2  2         NA       Missing
3  3         92     High Pass
4  4         78          Pass
5  5         NA       Missing
6  6         60          Pass
7  7         75          Pass

 

In this example, the case_when function is used to impute missing test scores based on specific conditions. If a test score is missing, it is imputed with Missing, a.k.a 0. Additionally, the scores are categorized as "Fail," "Pass," or "High Pass" based on predefined thresholds.

3. Creating Dummy Variables

Creating dummy variables is a common preprocessing step in machine learning workflows. The case_when function can be employed to generate dummy variables based on certain conditions. It allows transforming categorical data into a format suitable for predictive modeling, aiding algorithms in understanding and utilizing categorical information effectively, and enhancing the overall performance and interpretability of machine learning models.

Code:

data <- data.frame(id = 1:5, department = c("HR", "IT", "Finance", "HR", "Marketing"))

data <- data %>%
  mutate(
    is_hr = case_when(department == "HR" ~ 1, TRUE ~ 0),
    is_it = case_when(department == "IT" ~ 1, TRUE ~ 0),
    is_finance = case_when(department == "Finance" ~ 1, TRUE ~ 0),
    is_marketing = case_when(department == "Marketing" ~ 1, TRUE ~ 0)
  )

print(data)

 

Output:

  id department is_hr is_it is_finance is_marketing
1  1         HR     1     0          0            0
2  2         IT     0     1          0            0
3  3    Finance     0     0          1            0
4  4         HR     1     0          0            0
5  5  Marketing     0     0          0            1

 

In this example, dummy variables for various departments are created using the case_when function. Each dummy variable has a value of 0 otherwise and 1 if the associated condition is satisfied.

Advanced Applications of case_when

The case_when function proves useful not only in basic conditions but also in more complex data transformations. It is particularly useful in complex scenarios like recoding, creating categorical bins, or assigning weights based on detailed conditions. This adaptability makes it a powerful tool for tailoring the data to your specific analytical needs, showcasing its utility in sophisticated data preprocessing and analysis tasks.

1. Dynamic Thresholds

You may occasionally find yourself in need of dynamic thresholds that adjust based on the properties of your data. You can easily integrate these dynamic thresholds into your conditional statements by making use of the case_when function. This adaptability makes sure that your data processing is sensitive to the subtleties in your dataset, which improves the accuracy of your analysis.

Code:

data <- data.frame(id = 1:8, revenue = c(150000, 220000, 90000, 120000, 180000, 300000, 250000, 80000))

median_revenue <- median(data$revenue)

data <- data %>%
  mutate(performance_category = case_when(
    revenue > median_revenue * 1.5 ~ "High Performer",
    revenue > median_revenue ~ "Moderate Performer",
    TRUE ~ "Low Performer"
  ))

print(data)

 

Output:

  id revenue performance_category
1  1  150000        Low Performer
2  2  220000   Moderate Performer
3  3   90000        Low Performer
4  4  120000        Low Performer
5  5  180000   Moderate Performer
6  6  300000       High Performer
7  7  250000       High Performer
8  8   80000        Low Performer

 

In this instance, businesses are categorized according to their revenue performance using the case_when function. Based on 1.5 times the median revenue, the "High Performer" and "Moderate Performer" thresholds are dynamically established. The categories will adjust to the data distribution thanks to this dynamic approach.

2. Complex Logical Conditions

By combining several criteria using logical operators, the case_when function enables you to create complex logical conditions. Your ability to craft complex and important conditional statements that are suited to the complexity of your data is improved by this feature.

Code:

data <- data.frame(
  id = 1:6,
  temperature = c(28, 15, 22, 35, 18, 27),
  humidity = c(60, 80, 50, 30, 75, 40)
)

data <- data %>%
  mutate(weather_condition = case_when(
    temperature > 30 & humidity > 70 ~ "Hot and Humid",
    temperature < 20 & humidity > 60 ~ "Cold and Humid",
    temperature > 25 & temperature < 30 & humidity < 60 ~ "Warm and Dry",
    TRUE ~ "Unknown"
  ))

print(data)

 

Output:

  id temperature humidity weather_condition
1  1          28       60           Unknown
2  2          15       80    Cold and Humid
3  3          22       50           Unknown
4  4          35       30           Unknown
5  5          18       75    Cold and Humid
6  6          27       40      Warm and Dry

 

The case_when function is used in this example to assess the weather based on temperature and humidity levels. Complex rules can be expressed flexibly by combining logical operators (& for AND, | for OR) in certain combinations.

Conclusion

R's case_when function is a flexible tool that provides programmers and data analysts with an effective way to handle a variety of conditional operations with clarity. The case_when function streamlines code and improves the clarity of data manipulation procedures, regardless of the task at hand—classifying data, imputing missing values, creating dummy variables, or handling complex logical conditions. Gaining proficiency in this area confers a useful skill that improves the efficacy and efficiency of workflows for data analysis. Its adaptability becomes clear with further investigation, making it possible to resolve progressively complicated conditional scenarios in data-driven projects. Take advantage of case_when's power to advance your R programming abilities.

FavTutor - 24x7 Live Coding Help from Expert Tutors!

About The Author
Abhisek Ganguly
Passionate machine learning enthusiast with a deep love for computer science, dedicated to pushing the boundaries of AI through academic research and sharing knowledge through teaching.