Predicting Loan Default in R

Predicting loan default is a common task in the financial industry and can be approached using various machine learning techniques in R. In today’s world, people are somehow dependent on loans. Banks get their major income through loans and also it is the major source of financial risk for banks. People take loans for so many reasons like education, buying a house, and overcoming their financial limitations. Deciding whether a person is eligible for a loan or not bank check has lots of aspects, nowadays machine learning and deep learning help the banking sector to select candidates in less time. Here’s a step-by-step guide on how to build a predictive model for loan default using R:

Table of Content

  • Objectives Of Predicting Loan Default
  • Dataset using for Predicting Loan Default in R
  • Importing required libraries
  • Loading DataSet
  • Data Preprocessing and Exploration
  • Checking the summary of the data
  • Visualize the data and getting raw informations
  • Visually see the education wise default
  • Model building for Predicting Loan Default
  • Mapping accuracy for Predicting Loan Default model
  • Predicting Loan Default using model

Objectives Of Predicting Loan Default

  1. The goal of this “loan defaulter prediction” is to detect the probability of defaulter rate using Machine learning techniques.
  2. I understand the customer’s behavior before giving the loan.
  3. Help to make more informed credit scoring decisions and reduce the rate of defaulters.

So, here we will be using Logistic regression to predict the loan defaulter in R using key features like Age, Education, Income, Credit debt, etc.

Dataset using for Predicting Loan Default in R

Here we use the bank-loan.csv data for the build this model. Data set contains 9 features, those are following:

  1. age: Age of the person (integer column)
  2. ed: The persons education. (categorical: “secondary”, “primary”, “tertiary”, “unknown”) here in the dataset already it converted to integer by “1”, “2”, “3”, “4”.
  3. employ: How many year the person is working (integer column)
  4. address: How many year the person is staying on that address (integer column)
  5. income: persons income in thousand.
  6. debtinc: persons Debt-to-income ratio
  7. creddebt: Credit history of individual’s repayment of their debts.
  8. othdebt: individual’s repayment of their other debts.
  9. default: has credit in default? (binary: “yes”, “no”) already convert to 1 and 0 in the data set.

Importing required libraries

Here we are load the required libraries.

R
# Installing packages
# Package for creating plots and visualizations
install.packages("ggplot2")
library(ggplot2) 
# Package for regularized regression methods
install.packages("glmnet")  
library(glmnet)   
# Collection of packages for modeling and machine learning
install.packages("tidymodels") 
library(tidymodels) 
# Package for training and evaluating predictive models
install.packages("caret")  
library(caret)     
# Package for various statistical tools and functions
install.packages("rcompanion") 
library(rcompanion) 
# Package for ROC analysis
install.packages("pROC")       
library(pROC)      

Loading DataSet

R
df.bank = read.csv("bank-loan.csv") # df.bank is just a naming convention
head(df.bank)

Output:

  age ed employ address income debtinc  creddebt  othdebt default
1 41 3 17 12 176 9.3 11.359392 5.008608 1
2 27 1 10 6 31 17.3 1.362202 4.000798 0
3 40 1 15 14 55 5.5 0.856075 2.168925 0
4 41 1 15 14 120 2.9 2.658720 0.821280 0
5 24 2 2 0 28 17.3 1.787436 3.056564 1
6 41 2 5 5 25 10.2 0.392700 2.157300 0

Data Preprocessing and Exploration

R
# Make the default column as factor
df.bank$default = as.factor(df.bank$default)
str(df.bank)

# NA values finding
colSums(is.na(df.bank))

# Drop the NA valus
df = na.omit(df.bank)
colSums(is.na(df))

Output:

'data.frame':    850 obs. of  9 variables:
$ age : int 41 27 40 41 24 41 39 43 24 36 ...
$ ed : int 3 1 1 1 2 2 1 1 1 1 ...
$ employ : int 17 10 15 15 2 5 20 12 3 0 ...
$ address : int 12 6 14 14 0 5 9 11 4 13 ...
$ income : int 176 31 55 120 28 25 67 38 19 25 ...
$ debtinc : num 9.3 17.3 5.5 2.9 17.3 10.2 30.6 3.6 24.4 19.7 ...
$ creddebt: num 11.359 1.362 0.856 2.659 1.787 ...
$ othdebt : num 5.009 4.001 2.169 0.821 3.057 ...
$ default : Factor w/ 2 levels "0","1": 2 1 1 1 2 1 1 1 2 1 ...

age ed employ address income debtinc creddebt othdebt default
0 0 0 0 0 0 0 0 150

age ed employ address income debtinc creddebt othdebt default
0 0 0 0 0 0 0 0 0

There are 150 NA values in the default column. Since we have no instruction regarding what to do with those NA values so we just drop them and work with the data.

Checking the summary of the data

R
# Checking the summary of the data
summary(df.bank)

Output:

      age              ed            employ          address           income      
Min. :20.00 Min. :1.000 Min. : 0.000 Min. : 0.000 Min. : 13.00
1st Qu.:29.00 1st Qu.:1.000 1st Qu.: 3.000 1st Qu.: 3.000 1st Qu.: 24.00
Median :34.00 Median :1.000 Median : 7.000 Median : 7.000 Median : 35.00
Mean :35.03 Mean :1.705 Mean : 8.566 Mean : 8.372 Mean : 46.68
3rd Qu.:41.00 3rd Qu.:2.000 3rd Qu.:13.000 3rd Qu.:12.000 3rd Qu.: 55.75
Max. :56.00 Max. :4.000 Max. :33.000 Max. :34.000 Max. :446.00
debtinc creddebt othdebt default
Min. : 0.10 Min. : 0.0117 Min. : 0.04558 0 :517
1st Qu.: 5.10 1st Qu.: 0.3822 1st Qu.: 1.04594 1 :183
Median : 8.70 Median : 0.8851 Median : 2.00324 NA's:150
Mean :10.17 Mean : 1.5768 Mean : 3.07879
3rd Qu.:13.80 3rd Qu.: 1.8984 3rd Qu.: 3.90300
Max. :41.30 Max. :20.5613 Max. :35.19750

Summary is looking good so we can continue with logistic regression model.

Visualize the data and getting raw informations

R
par(mfrow = c(2,1))
barplot(table(df$ed), col = c("lightgreen", "yellow","orange","blue"), 
ylab = "Number of obsevation", 
xlab = "Education")
barplot(table(df$default), col = c("pink","lightblue"),
        ylab = "Number of observation", xlab = "Default")

Output:

Visually see the education wise default

R
ggplot(df, aes(ed, fill = default)) +
  geom_bar()

Output:

Model building for Predicting Loan Default

R
set.seed(421)
split = initial_split(df, prop = 0.8, strata = default)
train = split %>% 
  training()
val_test= split %>% 
  testing()

# Model-1
log_lr = glm(default ~.,family = "binomial", train)
summary(log_lr)

Output:

Call:
glm(formula = default ~ ., family = "binomial", data = train)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.932219 0.692795 -2.789 0.00529 **
age 0.050381 0.019602 2.570 0.01017 *
ed 0.121287 0.141056 0.860 0.38987
employ -0.267234 0.037417 -7.142 9.20e-13 ***
address -0.120499 0.025812 -4.668 3.04e-06 ***
income -0.006960 0.009751 -0.714 0.47537
debtinc 0.056472 0.034475 1.638 0.10141
creddebt 0.625906 0.125897 4.972 6.64e-07 ***
othdebt 0.071951 0.087457 0.823 0.41067
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 642.05 on 558 degrees of freedom
Residual deviance: 449.36 on 550 degrees of freedom
AIC: 467.36

Number of Fisher Scoring iterations: 6

This output provides information about the coefficients, their significance, model fit, and quality. Key points to note are the significance of coefficients, model deviance, and AIC value for assessing model performance and interpretation of coefficients for predicting default probabilities.

Mapping accuracy for Predicting Loan Default model

R
prd_Val = predict(log_lr, type='response')
prd_default = ifelse(prd_Val > 0.5, 1, 0)
cnf_m = table(prd=prd_default, act=train$default)
confusionMatrix(cnf_m)

Output:

Confusion Matrix and Statistics

act
prd 0 1
0 373 72
1 40 74

Accuracy : 0.7996
95% CI : (0.764, 0.8321)
No Information Rate : 0.7388
P-Value [Acc > NIR] : 0.0004695

Kappa : 0.4413

Mcnemar's Test P-Value : 0.0033981

Sensitivity : 0.9031
Specificity : 0.5068
Pos Pred Value : 0.8382
Neg Pred Value : 0.6491
Prevalence : 0.7388
Detection Rate : 0.6673
Detection Prevalence : 0.7961
Balanced Accuracy : 0.7050

'Positive' Class : 0

Here we are getting the Accuracy of the model is good around 79%.

Predicting Loan Default using model

R
# Creating a data frame with single observations
single_observation <- data.frame(
  age = 41,
  ed = 3,
  employ = 17,
  address = 12,
  income = 176,
  debtinc = 9.3,
  creddebt = 11.359,
  othdebt = 5.009
)

# Predicting default value for the single observation
prd_observation <- predict(log_lr, newdata = single_observation, type = 'response')
prd_default_observation <- ifelse(prd_observation > 0.5, 1, 0)

# View the predicted default value
print(prd_default_observation)

Output:

1

The model predicted a default (1). This suggests that based on the values of the predictor variables provided, the model believes it is more likely for the customer to default on their debt obligations.