# Logistic Regression in R (With Code Examples)

Logistic regression is a powerful tool for analyzing and predicting binary outcomes in the large world of statistical modelling. Understanding logistic regression in the R programming language is an important skill for anyone interested in data science or doing research. In this article, we'll learn about doing logistic regression analysis in R, with a focus on the glm function and how it's used in binary logistic regression.

## Basics of Logistic Regression

Logistic regression is a statistical method used when the dependent variable is binary, meaning it has only two possible outcomes. These outcomes are often coded as 0 and 1, representing, for instance, failure and success, presence and absence, or yes and no. The fundamental idea behind logistic regression is to model the probability that a given instance belongs to a particular category.

R's glm (generalized linear model) function is essential for fitting logistic regression models. It has the following syntax:

```model <- glm(formula, data = dataset, family = "binomial")
```

Here, "formula" specifies the relationship between the predictor variables and the response variable, "data" refers to the dataset, and "family" is set to "binomial" to indicate logistic regression.

### Understanding the Formula

The formula in the glm function is crucial in defining the relationship between the predictor variables and the response variable. The formula follows the pattern:

```response_variable ~ predictor_variable1 + predictor_variable2 + ... + predictor_variableN
```

For example, if we are examining the relationship between a student's exam success (1 for success, 0 for failure) and the number of hours spent studying and the type of study materials used, the formula might look like:

```success ~ hours_studied + type_of_material
```

Let’s start the process by first creating the data for analysis.

## Data Preparation

It is important to properly prepare the data before getting into logistic regression analysis. We need to make sure that the dependent variable is coded as a binary outcome (0 or 1) and that the feature variables are properly typed. Convert categorical variables to factors using the 'factor' function ensuring that R recognizes them as such during the model training process.

Before we jump into code, let's ensure that the necessary packages are installed. We'll need the "titanic" package for the dataset, along with "dplyr" for data manipulation and "broom" for tidying up model outputs.

```# Install and load necessary packages
install.packages("titanic")
install.packages("dplyr")
install.packages("broom")

library(titanic)
library(dplyr)
library(broom)
```

Now that our packages are installed, let's load the Titanic dataset and look at its structure using the "str" function.

```# Load the Titanic dataset
data("titanic_train")
titanic_data <- titanic_train

# Explore the structure of the dataset
str(titanic_data)
```

Output:

```'data.frame':	891 obs. of  12 variables:
\$ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
\$ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
\$ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
\$ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
\$ Sex        : chr  "male" "female" "female" "female" ...
\$ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
\$ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
\$ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
\$ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
\$ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
\$ Cabin      : chr  "" "C85" "" "C123" ...
\$ Embarked   : chr  "S" "C" "S" "S" ...
```

This dataset contains various details about passengers, but for our logistic regression model, we'll focus on three key elements: survival status ("Survived"), passenger class ("Pclass"), and age ("Age").

## Data Preprocessing

Next step is to preprocess the data ensuring it is perfect for the smooth conduct of the analysis. We'll take care of any missing values, pick the appropriate columns, and factorize the passenger class.

```# Data preprocessing
# Select relevant columns: Survived, Pclass, Age
selected_columns <- c("Survived", "Pclass", "Age")
titanic_data <- select(titanic_data, all_of(selected_columns))

# Remove rows with missing values
titanic_data <- na.omit(titanic_data)

# Convert Pclass to a factor
titanic_data\$Pclass <- as.factor(titanic_data\$Pclass)

# Display the first few rows of the preprocessed data
```

Output:

```  Survived Pclass Age
1        0      3  22
2        1      1  38
3        1      3  26
4        1      1  35
5        0      3  35
7        0      1  54
```

Our dataset has been carefully selected to only contain the variables that are necessary for our analysis. For proper modelling, the passenger class has been turned into a factor and rows containing missing values have been removed.

## Fitting a Logistic Regression Model

Using the "glm" function, we'll model the probability of survival based on passenger class and age.

```# Fitting a logistic regression model
logreg_model <- glm(Survived ~ Pclass + Age, data = titanic_data, family = "binomial")

# View the summary of the model
summary(logreg_model)
```

Output:

```Call:
glm(formula = Survived ~ Pclass + Age, family = "binomial", data = titanic_data)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-2.1524  -0.8466  -0.6083   1.0031   2.3929

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)  2.296012   0.317629   7.229 4.88e-13 ***
Pclass2     -1.137533   0.237578  -4.788 1.68e-06 ***
Pclass3     -2.469561   0.240182 -10.282  < 2e-16 ***
Age         -0.041755   0.006736  -6.198 5.70e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 964.52  on 713  degrees of freedom
Residual deviance: 827.16  on 710  degrees of freedom
AIC: 835.16

Number of Fisher Scoring iterations: 4
```

Here we are fitting a logistic regression model, with survival as the response variable and passenger class and age as predictors. The summary provided us with necessary information such as the coefficients, standard errors, z-values, and p-values. Interpreting coefficients involves understanding how a one-unit change in a feature variable affects the target variable's log-odds. For categorical variables, the interpretation involves comparing one category to a reference category.

## Assessing Model Performance

Evaluating the performance of the logistic regression model is crucial to ensure its reliability and generalizability. Common techniques are examining the confusion matrix, calculating accuracy, precision, recall, and the area under the receiver operating characteristic (ROC) curve.

Let’s check out the code for them.

```# Assessing model performance
predicted_values <- predict(logreg_model, newdata = titanic_data, type = "response")
predicted_classes <- ifelse(predicted_values > 0.5, 1, 0)

# Confusion matrix
conf_matrix <- table(observed = titanic_data\$Survived, predicted = predicted_classes)
conf_matrix

# ROC curve
library(pROC)
roc_curve <- roc(titanic_data\$Survived, predicted_values)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
```

Confusion Matrix

```        predicted
observed   0   1
0 344  80
1 138 152
```

ROC Curve Plot

ROC curves and confusion matrix gives us an insight to how our models perform on the testing data - the new and unseen data. This provides us with a general idea about the model's accuracy and performance. From the confusion matrix we also can calculate other important details such as the precision and recall of the model which gives beneficial insights in research and development fields.

## Conclusion

Logistic regression in R is an efficient and powerful way to model binary outcomes. You can learn a lot about the relationships in your data by comprehending the fundamentals of logistic regression, data preparation, model fitting, coefficient interpretation, and model evaluation. It opens up a lots of opportunities in statistics driven decision making as a data scientist.