What is Predicting Loan Default in R?

In this article, we will learn Predicting Loan Default in R,This free R Language tutorial for complete beginners will help you learn R Language from scratch.

Predicting Loan Default in R - ❤️R Language Tutorials In 2024

Predicting loan default is a common task in the financial industry and can be approached using various machine learning techniques in R. In today’s world, people are somehow dependent on loans. Banks get their major income through loans and also it is the major source of financial risk for banks. People take loans for so many reasons like education, buying a house, and overcoming their financial limitations. Deciding whether a person is eligible for a loan or not bank check has lots of aspects, nowadays machine learning and deep learning help the banking sector to select candidates in less time. Here’s a step-by-step guide on how to build a predictive model for loan default using R:

Table of Content

Objectives Of Predicting Loan Default
Dataset using for Predicting Loan Default in R
Importing required libraries
Loading DataSet
Data Preprocessing and Exploration
Checking the summary of the data
Visualize the data and getting raw informations
Visually see the education wise default
Model building for Predicting Loan Default
Mapping accuracy for Predicting Loan Default model
Predicting Loan Default using model

Objectives Of Predicting Loan Default

The goal of this “loan defaulter prediction” is to detect the probability of defaulter rate using Machine learning techniques.
I understand the customer’s behavior before giving the loan.
Help to make more informed credit scoring decisions and reduce the rate of defaulters.

So, here we will be using Logistic regression to predict the loan defaulter in R using key features like Age, Education, Income, Credit debt, etc.

Dataset using for Predicting Loan Default in R

Here we use the bank-loan.csv data for the build this model. Data set contains 9 features, those are following:

age: Age of the person (integer column)
ed: The persons education. (categorical: “secondary”, “primary”, “tertiary”, “unknown”) here in the dataset already it converted to integer by “1”, “2”, “3”, “4”.
employ: How many year the person is working (integer column)
address: How many year the person is staying on that address (integer column)
income: persons income in thousand.
debtinc: persons Debt-to-income ratio
creddebt: Credit history of individual’s repayment of their debts.
othdebt: individual’s repayment of their other debts.
default: has credit in default? (binary: “yes”, “no”) already convert to 1 and 0 in the data set.

Importing required libraries

Here we are load the required libraries.

R

# Installing packages
# Package for creating plots and visualizations
install.packages("ggplot2")
library(ggplot2) 
# Package for regularized regression methods
install.packages("glmnet")  
library(glmnet)   
# Collection of packages for modeling and machine learning
install.packages("tidymodels") 
library(tidymodels) 
# Package for training and evaluating predictive models
install.packages("caret")  
library(caret)     
# Package for various statistical tools and functions
install.packages("rcompanion") 
library(rcompanion) 
# Package for ROC analysis
install.packages("pROC")       
library(pROC)

Loading DataSet

R

df.bank = read.csv("bank-loan.csv") # df.bank is just a naming convention
head(df.bank)

Output:

  age ed employ address income debtinc  creddebt  othdebt default
1  41  3     17      12    176     9.3 11.359392 5.008608       1
2  27  1     10       6     31    17.3  1.362202 4.000798       0
3  40  1     15      14     55     5.5  0.856075 2.168925       0
4  41  1     15      14    120     2.9  2.658720 0.821280       0
5  24  2      2       0     28    17.3  1.787436 3.056564       1
6  41  2      5       5     25    10.2  0.392700 2.157300       0

Data Preprocessing and Exploration

R

# Make the default column as factor
df.bank$default = as.factor(df.bank$default)
str(df.bank)

# NA values finding
colSums(is.na(df.bank))

# Drop the NA valus
df = na.omit(df.bank)
colSums(is.na(df))

Output:

'data.frame':    850 obs. of  9 variables:
 $ age     : int  41 27 40 41 24 41 39 43 24 36 ...
 $ ed      : int  3 1 1 1 2 2 1 1 1 1 ...
 $ employ  : int  17 10 15 15 2 5 20 12 3 0 ...
 $ address : int  12 6 14 14 0 5 9 11 4 13 ...
 $ income  : int  176 31 55 120 28 25 67 38 19 25 ...
 $ debtinc : num  9.3 17.3 5.5 2.9 17.3 10.2 30.6 3.6 24.4 19.7 ...
 $ creddebt: num  11.359 1.362 0.856 2.659 1.787 ...
 $ othdebt : num  5.009 4.001 2.169 0.821 3.057 ...
 $ default : Factor w/ 2 levels "0","1": 2 1 1 1 2 1 1 1 2 1 ...

     age       ed   employ  address   income  debtinc creddebt  othdebt  default 
       0        0        0        0        0        0        0        0      150 

     age       ed   employ  address   income  debtinc creddebt  othdebt  default 
       0        0        0        0        0        0        0        0        0

There are 150 NA values in the default column. Since we have no instruction regarding what to do with those NA values so we just drop them and work with the data.

Checking the summary of the data

R

# Checking the summary of the data
summary(df.bank)

Output:

      age              ed            employ          address           income      
 Min.   :20.00   Min.   :1.000   Min.   : 0.000   Min.   : 0.000   Min.   : 13.00  
 1st Qu.:29.00   1st Qu.:1.000   1st Qu.: 3.000   1st Qu.: 3.000   1st Qu.: 24.00  
 Median :34.00   Median :1.000   Median : 7.000   Median : 7.000   Median : 35.00  
 Mean   :35.03   Mean   :1.705   Mean   : 8.566   Mean   : 8.372   Mean   : 46.68  
 3rd Qu.:41.00   3rd Qu.:2.000   3rd Qu.:13.000   3rd Qu.:12.000   3rd Qu.: 55.75  
 Max.   :56.00   Max.   :4.000   Max.   :33.000   Max.   :34.000   Max.   :446.00  
    debtinc         creddebt          othdebt         default   
 Min.   : 0.10   Min.   : 0.0117   Min.   : 0.04558   0   :517  
 1st Qu.: 5.10   1st Qu.: 0.3822   1st Qu.: 1.04594   1   :183  
 Median : 8.70   Median : 0.8851   Median : 2.00324   NA's:150  
 Mean   :10.17   Mean   : 1.5768   Mean   : 3.07879             
 3rd Qu.:13.80   3rd Qu.: 1.8984   3rd Qu.: 3.90300             
 Max.   :41.30   Max.   :20.5613   Max.   :35.19750

Summary is looking good so we can continue with logistic regression model.

Visualize the data and getting raw informations

R

par(mfrow = c(2,1))
barplot(table(df$ed), col = c("lightgreen", "yellow","orange","blue"), 
ylab = "Number of obsevation", 
xlab = "Education")
barplot(table(df$default), col = c("pink","lightblue"),
        ylab = "Number of observation", xlab = "Default")

Output:

Visually see the education wise default

R

ggplot(df, aes(ed, fill = default)) +
  geom_bar()

Output:

Model building for Predicting Loan Default

R

set.seed(421)
split = initial_split(df, prop = 0.8, strata = default)
train = split %>% 
  training()
val_test= split %>% 
  testing()

# Model-1
log_lr = glm(default ~.,family = "binomial", train)
summary(log_lr)

Output:

Call:
glm(formula = default ~ ., family = "binomial", data = train)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.932219   0.692795  -2.789  0.00529 ** 
age          0.050381   0.019602   2.570  0.01017 *  
ed           0.121287   0.141056   0.860  0.38987    
employ      -0.267234   0.037417  -7.142 9.20e-13 ***
address     -0.120499   0.025812  -4.668 3.04e-06 ***
income      -0.006960   0.009751  -0.714  0.47537    
debtinc      0.056472   0.034475   1.638  0.10141    
creddebt     0.625906   0.125897   4.972 6.64e-07 ***
othdebt      0.071951   0.087457   0.823  0.41067    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 642.05  on 558  degrees of freedom
Residual deviance: 449.36  on 550  degrees of freedom
AIC: 467.36

Number of Fisher Scoring iterations: 6

This output provides information about the coefficients, their significance, model fit, and quality. Key points to note are the significance of coefficients, model deviance, and AIC value for assessing model performance and interpretation of coefficients for predicting default probabilities.

Mapping accuracy for Predicting Loan Default model

R

prd_Val = predict(log_lr, type='response')
prd_default = ifelse(prd_Val > 0.5, 1, 0)
cnf_m = table(prd=prd_default, act=train$default)
confusionMatrix(cnf_m)

Output:

Confusion Matrix and Statistics

   act
prd   0   1
  0 373  72
  1  40  74
                                         
               Accuracy : 0.7996         
                 95% CI : (0.764, 0.8321)
    No Information Rate : 0.7388         
    P-Value [Acc > NIR] : 0.0004695      
                                         
                  Kappa : 0.4413         
                                         
 Mcnemar's Test P-Value : 0.0033981      
                                         
            Sensitivity : 0.9031         
            Specificity : 0.5068         
         Pos Pred Value : 0.8382         
         Neg Pred Value : 0.6491         
             Prevalence : 0.7388         
         Detection Rate : 0.6673         
   Detection Prevalence : 0.7961         
      Balanced Accuracy : 0.7050         
                                         
       'Positive' Class : 0

Here we are getting the Accuracy of the model is good around 79%.

Predicting Loan Default using model

R

# Creating a data frame with single observations
single_observation <- data.frame(
  age = 41,
  ed = 3,
  employ = 17,
  address = 12,
  income = 176,
  debtinc = 9.3,
  creddebt = 11.359,
  othdebt = 5.009
)

# Predicting default value for the single observation
prd_observation <- predict(log_lr, newdata = single_observation, type = 'response')
prd_default_observation <- ifelse(prd_observation > 0.5, 1, 0)

# View the predicted default value
print(prd_default_observation)

Output:

The model predicted a default (1). This suggests that based on the values of the predictor variables provided, the model believes it is more likely for the customer to default on their debt obligations.