Friday, July 26, 2013

Building a Loan Model

Supervised Learning is simple in concept. Given some characteristics of something (input variables), we want to predict some other characteristic (output variable). For example, lets say we have a database full of people's purchase histories and their gender. Trying to guess the gender of somebody outside of the database, based on their purchase history, is a supervised learning problem.
Categorical prediction/guessing is known as classification.
Numerical prediction/guessing is known as estimation.

In the case of Lending Club, we want to estimate the profitability of a loan. Roughly speaking, we want to reduce all of the characteristics given to us about the borrower into two numbers: return and risk (here defined to be the variance of return).

More mathematically, if x is the characteristics of a loan that we know when we can choose whether to invest or not, and y is the characteristic(s) we want, then we want to build a model/function f so that
f(x) is a good estimate of y.

Let's take a look at the returns distribution of all completed notes as of earlier this year. Pretty sure this isn't a textbook distribution.

I think it's safe to say that running a single regression is not going to be the best way to approach this problem. In other words, we can't construct f in one step easily.

The idea rather is to create a function g that outputs some other variable of interest z, then plug x and z into another function h. i.e. g(x)=z and h(x,z)=y.

In this case, we probably want to predict whether the loan defaults (whether the borrower fails to make payments) or not. We assign a certain probability of this event occurring. Then, we consider how much money we will get if the borrower defaults, and how much money we will get if the borrower pays fully, and we take the probability-weighted average of the two.





Why is this approach better? Because it takes into account knowledge that we have about how loans work.
Our knowledge includes:
-Given a fully-paid, on-time loan, the amount of money we will receive is an exact number based solely on the interest rate of the loan (and whether the term length is 3 or 5 years).
-The only case in which we will have a negative return is if the borrower defaults (the reverse is not true).

Put another way, if we already know for a fact that the relationship between a and b is something like a=cos(b^2), we don't want to estimate the relationship or its effects. That would be a waste of time and an unnecessary loss of accuracy. In this case, if the following expression is set to an initial investment of $1, and n is fixed, then P is a function of r.




We might imagine the interest rate as being a characteristic of the borrower that affects their likelihood to act in a certain way. The interest rate also mathematically affects the terms of the loan. We want to eliminate the second effect as much as possible for the machine learning component because we can put it back in later. It will only get in the way if we leave it in.
In fact, it is possible that if we leave it in, the model would predict impossible results. If a fully paid grade A loan has a return of $1.1 for every $1 invested and a fully paid grade G loan has a return of $1.4, it must be possible for the model to predict that a loan will return anywhere between $0 and $1.4. But this means that the model could potentially predict that a given grade A loan will return, say, $1.3, yet this result makes no sense because the loan is theoretically capped at $1.1. Why would the lender pay more than they're required to? Better to express the return as a fraction of the full payment, to avoid such a possibility.

The best part of this process is that it can be iterated. The above model incorporates only default risk. The following model incorporates also call risk (the chance that a borrower repays the loan early).


Essentially, we are mapping out each possible scenario and how much value we derive from it.

I'll leave my specific implementation to a later post, as it's more technical, and for spoilers for those who might want to approach the data with no bias towards my results.

No comments:

Post a Comment