Factors Determining Borrower Cost by Preston Hall

Univariate Plots Section

The original data-set had 81 variables and 113,937 observations. I narrowed my data-set down to 17 variables.

I will try to determine what factors influence borrower cost. I selected the BorrowerAPR as an estimate of total borrower cost, as it measures the effective rate after fees. Prosper Score and Credit Score were selected as I suspect they will heavily influence the loan costs. I selected variables related to employment, income and loan amount as they should be considered when determining a borrowers ability to repay their loan. Inquiries in the last six months and delinquencies were selected to see how any negative information would be factored into loan prices. Finally, I selected the loan origination quarter to see if interest rates had fluctuated over the time frame of the data-set.

The borrower APR ranges from 0.6% to 51.2% with a median of 21.0%. There is a spike in the number of loans priced near 35.8%

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.15630 0.20980 0.21880 0.28380 0.51230      25

Most common APR’s:

## 
## 0.35797 0.35643 0.37453 
##    3672    1644    1260

The prosper score histogram shows a relatively normal distribution.

The credit scores show a relatively normal distribution with a median of 699.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    19.0   679.0   699.0   704.6   739.0   899.0     591

The median available bankcard credit is $4,100, with a heavily left skewed distribution. Transforming the available loans (plus one to include those with zero available) by log10 creates a normal distribution with an additional spike for borrowers with no available credit.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0     880    4100   11210   13180  646300    7544

The debt to income ratio shows a normal distribution with a median of 0.22, and several right skewed outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

The median loan size was 6,500.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

The number of loans accelerated significantly in the latter part of the data-set.

Univariate Analysis

What is the structure of your dataset?

I narrowed the original data-set from 81 variables to 17, by eliminated variables that were similar to other variables in the data-set and variables that were not related to my inquiry of what determines borrowing cost.

My narrowed data-set originally included three factorial variables, I then added an additional factorial variable (Prosper Score) in order to create additional box plot visualizations. The remaining variables are numeric and integers.

What is/are the main feature(s) of interest in your dataset?

I am interested in measuring the main factors in determining borrower cost, for which I will use the BorrowerAPR as a measure. I suspect credit score, and income to be two of the primary factors in determining borrower cost. I also anticipate that the Prosper Score will be a primary factor. However, I suspect that the Prosper Score and Credit Score will be so highly correlated that they may be effectively redundant.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Delinquencies, home ownership status and the length of employment duration may also help determine borrower costs. It is also possible that rates changed over time, thus the loan closing date could be a factor.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I converted the Prosper score into a new ProsperScoreFactor variable so I could use various visualizations on the data. I converted the Income range factor variable into a number, so I could perform correlation calculations. I ordered the Income Range factor variable so that it would be easier to understand in the visualizations. I also modified the loan origination quarter variable so that it could be sorted in chronological order.

Bivariate Plots Section

The Borrower APR shows the strongest correlations with Prosper score, credit score, loan amount and available credit.

The correlation between loan amount and prosper score is likely due to a self selecting bias, where the most credit worthy borrowers are the only group that can borrower large dollar amounts and are also most likely to demand and receive lower rates.

APR Comparisons

As expected, there is a direct correlation between interest rate and prosper score. The correlation coefficient is -0.66. The higher the Prosper Score (indicating lower risk) the lower the interest rate.

## 
##  Pearson's product-moment correlation
## 
## data:  loanSummary$ProsperScore and loanSummary$BorrowerAPR
## t = -260.93, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6709351 -0.6634688
## sample estimates:
##        cor 
## -0.6672187

There appears to be a correlation between credit ratings and interest rates. This relationship is what I anticipated, with lower rates for borrowers with higher credit ratings and a correlation coefficient of -0.43.

## 
##  Pearson's product-moment correlation
## 
## data:  loanSummary$CreditScoreRangeUpper and loanSummary$BorrowerAPR
## t = -160.21, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4344422 -0.4249487
## sample estimates:
##        cor 
## -0.4297073

Borrowers with higher incomes tend to have lower borrowing costs.

The following plot shows the rise and retraction of the mean and median interest rates through the data-set’s time frame.

Prosper Score Analysis

Since Prosper score seems to have a strong influence on borrowing cost, what influences the Prosper score?

How strong is the relationship between credit rating and prosper score? They are highly correlated, with a correlation coefficient of 0.37. However, there are several outliers and the Prosper Score does not have as strong a linear relationship with the Credit Score as I anticipated. The median Credit Score is unchanged for Prosper Scores 2 through 5 and 6 through 8.

## 
##  Pearson's product-moment correlation
## 
## data:  loanSummary$ProsperScore and loanSummary$CreditScoreRangeUpper
## t = 115.93, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3639411 0.3755582
## sample estimates:
##       cor 
## 0.3697641

Stated monthly income has a weak relationship with the Prosper Score, with a correlation coefficient of only 0.08.

## 
##  Pearson's product-moment correlation
## 
## data:  StatedMonthlyIncome and ProsperScore
## t = 24.069, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07566126 0.08902693
## sample estimates:
##       cor 
## 0.0823478

Prosper score and available bank loans have a high correlation coefficient of 0.31.

## 
##  Pearson's product-moment correlation
## 
## data:  loanSummary$ProsperScore and loanSummary$AvailableBankcardCredit
## t = 96.29, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3077796 0.3199109
## sample estimates:
##       cor 
## 0.3138581

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

APR appears to be correlated with the Prosper Score, credit score, loan amount and available bank credit, but does not appear to be affected by employment duration.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The Prosper score is correlated with the credit score and available credit but (surprisingly) does not appear to be affected by stated monthly income.

What was the strongest relationship you found?

The strongest relationship was, not surprisingly, between the Borrower APR and the Prosper Score. This makes sense as the Prosper score is prevalent on the website, thus is likely the primary consideration in the risk weighting by decision for the investor/lenders.

Multivariate Plots Section

APR Analysis

Borrowers in higher income ranges tend to have lower APR’s and higher credit scores. Additionally, borrowers with higher income ranges benefit from a steeper regression line toward lower APR’s, meaning that as their credit scores increase they benefit from a greater reduction in rates than those with lower incomes would.

Each level of Prosper score tends to have a similar slope toward lower APR’s as credit scores increase. Borrowers with higher prosper scores tend to have lower APR’s and higher credit scores.

## 
##  Pearson's product-moment correlation
## 
## data:  CreditScoreRangeUpper and BorrowerAPR
## t = -160.21, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4344422 -0.4249487
## sample estimates:
##        cor 
## -0.4297073

As with the prior plots, higher income ranges tend to have higher prosper scores and lower APR’s.

## 
##  Pearson's product-moment correlation
## 
## data:  ProsperScore and BorrowerAPR
## t = -260.93, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6709351 -0.6634688
## sample estimates:
##        cor 
## -0.6672187

The dashed line below represents the median APR, the colored lines represent the median APR’s for each Prosper Score.

The median APR’s tend to follow the overall median APR for each income level, with the exception of unemployed.

There is a weak upward trend between stated monthly income and credit score. The relationship is not as strong as I anticipated with a correlation coefficient of 0.1. The correlation coefficient between the Prosper score and stated monthly income is an even weaker 0.08.

## 
##  Pearson's product-moment correlation
## 
## data:  loanSummary$CreditScoreRangeUpper and loanSummary$StatedMonthlyIncome
## t = 36.54, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1021433 0.1136511
## sample estimates:
##       cor 
## 0.1079008
## 
##  Pearson's product-moment correlation
## 
## data:  loanSummary$StatedMonthlyIncome and loanSummary$ProsperScore
## t = 24.069, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07566126 0.08902693
## sample estimates:
##       cor 
## 0.0823478

Borrowers with more available bankcard credit tend to have lower APR’s and are more likely to be homeowners. The correlation coefficient between available bankcard credit and APR -0.35.

## 
##  Pearson's product-moment correlation
## 
## data:  loanSummary$AvailableBankcardCredit and loanSummary$BorrowerAPR
## t = -121.44, df = 106390, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3541924 -0.3436378
## sample estimates:
##        cor 
## -0.3489261

Prosper Score Analysis

While borrower who are homeowners seem to have higher credit scores in general, home ownership does not appear to have a significant affect on the prosper score.

Similar to the APR, borrowers with higher incomes and credit scores tend to have higher Prosper scores.

Borrowers with higher available bank credit tend to have higher income ranges and higher Prosper scores.

## 
##  Pearson's product-moment correlation
## 
## data:  ProsperScore and AvailableBankcardCredit
## t = 96.29, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3077796 0.3199109
## sample estimates:
##       cor 
## 0.3138581

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

There is a relationship between the APR, Prosper Score, credit score and income range. Higher income ranges tend to have higher credit scores and lower APR’s, and lower income ranges tend to have lower credit scores and higher APR’s.

Were there any interesting or surprising interactions between features?

There did not appear to be a relationship between APR, Credit Score and stated monthly income (the numerical variable). This was surprising because there did appear to be a relationship between APR, Credit Score and (the factorial variable) income range.

This is likely due to discrepancies between the Stated Monthly Income and Income Range. The below chart shows dashed lines that indicate the breakpoints between each income range (the red bottom line indicates the $24,999 break point, etc.). Each income group has outliers that exceed the income level, some by multiple levels.


Final Plots and Summary

Plot One

##                       BorrowerAPR ProsperScore CreditScoreRangeUpper
## BorrowerAPR             1.0000000   -0.6672187            -0.5258881
## ProsperScore           -0.6672187    1.0000000             0.3697641
## CreditScoreRangeUpper  -0.5258881    0.3697641             1.0000000

Description One

The first plot shows that borrowers with high Prosper scores are concentrated in the lower right quadrant, with lower APR’s and higher credit scores and inversely, borrowers with lower Prosper scores in the upper left, with higher APR’s and lower credit scores. This trend is reflected in the correlations with borrower APR’s high negative correlation with Prosper score (-0.67) and credit scores high negative correlation with APR (-0.53).

The plot also shows a parallel linear regression lines for each of the prosper score levels, meaning that for an increase in credit score the reduction in APR is equal for all Prosper score levels.

Plot Two

##                         IncomeRangeNum AvailableBankcardCredit
## IncomeRangeNum               1.0000000               0.1928268
## AvailableBankcardCredit      0.1928268               1.0000000
## ProsperScore                 0.1949341               0.3126430
##                         ProsperScore
## IncomeRangeNum             0.1949341
## AvailableBankcardCredit    0.3126430
## ProsperScore               1.0000000

Description Two

The second plot uses the square root transformation of available bankcard credit (plus one to include borrowers with zero bankcard credit). The plot shows that borrowers with more available bankcard credit tend to have higher Prosper scores and also tend to have higher income ranges. Alternatively those with little available bankcard credit tend to be have lower prosper score and be in the lower income ranges.

Available bankcard credit has a strong correlation coefficient of 0.31 with Prosper score and a 0.19 correlation coefficient with the income ranges.

Plot Three

##                       ProsperScore CreditScoreRangeUpper IncomeRangeNum
## ProsperScore             1.0000000             0.3718331      0.1949341
## CreditScoreRangeUpper    0.3718331             1.0000000      0.1478913
## IncomeRangeNum           0.1949341             0.1478913      1.0000000

Description Three

The final plot shows that the Prosper score is heavily influenced by credit score and income range. Borrowers in the higher income ranges are concentrated near the top of the plot (highest Prosper scores). Borrowers in the lower income ranges (lighter colors) are more likely to be in the lower left of the plot, with lower Prosper and credit scores. Additionally, the right side of the chart shows that as credit scores approach and eclipse the 800 score level, they move higher up the Y axis, increasing the Prosper scores.


Reflection

It is clear that the most influential factor in determining the borrowing costs is the Prosper score, followed by credit score and available bankcard credit. A future line of inquiry would be analyzing what factors predict loan default. It would be interesting to determine what are the characteristics of low versus high risk borrowers and to see how those same factors are priced into the loans.