Machine Lending: July 2013

Friday, July 26, 2013

Building a Loan Model

Supervised Learning is simple in concept. Given some characteristics of something (input variables), we want to predict some other characteristic (output variable). For example, lets say we have a database full of people's purchase histories and their gender. Trying to guess the gender of somebody outside of the database, based on their purchase history, is a supervised learning problem.

Categorical prediction/guessing is known as classification.

Numerical prediction/guessing is known as estimation.

In the case of Lending Club, we want to estimate the profitability of a loan. Roughly speaking, we want to reduce all of the characteristics given to us about the borrower into two numbers: return and risk (here defined to be the variance of return).

More mathematically, if x is the characteristics of a loan that we know when we can choose whether to invest or not, and y is the characteristic(s) we want, then we want to build a model/function f so that

f(x) is a good estimate of y.

Let's take a look at the returns distribution of all completed notes as of earlier this year. Pretty sure this isn't a textbook distribution.

I think it's safe to say that running a single regression is not going to be the best way to approach this problem. In other words, we can't construct f in one step easily.

The idea rather is to create a function g that outputs some other variable of interest z, then plug x and z into another function h. i.e. g(x)=z and h(x,z)=y.

In this case, we probably want to predict whether the loan defaults (whether the borrower fails to make payments) or not. We assign a certain probability of this event occurring. Then, we consider how much money we will get if the borrower defaults, and how much money we will get if the borrower pays fully, and we take the probability-weighted average of the two.

Why is this approach better? Because it takes into account knowledge that we have about how loans work.

Our knowledge includes:

-Given a fully-paid, on-time loan, the amount of money we will receive is an exact number based solely on the interest rate of the loan (and whether the term length is 3 or 5 years).

-The only case in which we will have a negative return is if the borrower defaults (the reverse is not true).

Put another way, if we already know for a fact that the relationship between a and b is something like a=cos(b^2), we don't want to estimate the relationship or its effects. That would be a waste of time and an unnecessary loss of accuracy. In this case, if the following expression is set to an initial investment of $1, and n is fixed, then P is a function of r.

We might imagine the interest rate as being a characteristic of the borrower that affects their likelihood to act in a certain way. The interest rate also mathematically affects the terms of the loan. We want to eliminate the second effect as much as possible for the machine learning component because we can put it back in later. It will only get in the way if we leave it in.

In fact, it is possible that if we leave it in, the model would predict impossible results. If a fully paid grade A loan has a return of $1.1 for every $1 invested and a fully paid grade G loan has a return of $1.4, it must be possible for the model to predict that a loan will return anywhere between $0 and $1.4. But this means that the model could potentially predict that a given grade A loan will return, say, $1.3, yet this result makes no sense because the loan is theoretically capped at $1.1. Why would the lender pay more than they're required to? Better to express the return as a fraction of the full payment, to avoid such a possibility.

The best part of this process is that it can be iterated. The above model incorporates only default risk. The following model incorporates also call risk (the chance that a borrower repays the loan early).

Essentially, we are mapping out each possible scenario and how much value we derive from it.

I'll leave my specific implementation to a later post, as it's more technical, and for spoilers for those who might want to approach the data with no bias towards my results.

Saturday, July 20, 2013

Lending Club: Automatic Updates

I've set up my bot to automatically publish its results whenever new loans come in. Have a look on the top bar.
I will probably add more columns later.

Friday, July 19, 2013

Book Review: Trading and Exchanges

A friend lent me Trading and Exchanges by Larry Harris over the summer. I found it to be a fantastic read. It pretty much classifies any type of trader out there, explaining what reason they're in the market and how they accomplish their objective.

I'll try to give a summary of the main types of traders mentioned and possibly an analogy for them.

Utilitarian traders are traders who have some intrinsic interest in a good. For example, chicken farmers are natural buyers of corn, and corn farmers are natural sellers. In the stock market, companies and retirees are natural sellers of stock and investors are natural buyers. They're the reason markets exist: to have those who value goods the least to sell it to those who value the goods the most.

Gamblers are a type of utilitarian trader who trade for entertainment. They tend to lose money.

Value traders are traders who somehow estimate the future value of a good. For example, stock analysts look at earnings projections and news reports and try to infer where stock prices are headed. Here's an example of how a value trader might provide value. Say there is a town with a huge water supply. Water is cheap so people use it inefficiently, for example in water gun fights, leaving the tap on, fountains, etc. A scientist predicts a drought so he builds a huge reservoir, fills it with water, and later sells it for a high price during a drought. People will criticize him as price gouging and launch expletives at him, but he actually did something socially useful, which is that he prevented the consumption of water by the town that valued it lowly, but instead sold it to the town in the future that valued it highly.

Frontrunners are traders who figure out/guess that somebody else wants to buy something. Legally, frontrunning occurs when a broker trades ahead of his client's order, which is illegal. Other forms of anticipation are completely legal. In any case they're not socially beneficial. Let's say you're at the store, and you want to buy the last carton of eggs in the store. Somebody knows you really need the eggs, so he runs in front of you (hence the term), buys the eggs, and sells it to you at a higher price. Kind of a dick move.

Technical analysts are traders who try to front-run uninformed investors, especially gamblers. At the same time, technical analysis is mostly BS so they're essentially gamblers themselves. They don't provide social value, except in the money that they lose on average.

Bluffers are traders who spread misinformation in the news or online and back it up with huge purchases or sales that make the news seem feasible. They essentially manufacture bubbles or crashes. A great recent example is possibly the hacker who used the AP twitter account to say that a bomb had exploded in the White House.

Some Syrians claimed credit for this. If they did it, it's quite likely that they made trades to profit from the panic.

It's not socially beneficial, obviously.

Liquidity providers: Liquidity is roughly "the ability to buy or sell goods near the market price in a timely manner", or very short-term supply and demand elasticity. You can read more in the link.

Market makers/dealers are traders who post bids and offers. They make money off of the spread. That is, they might offer to buy a stock at $49 and sell for $51. An obvious example is a gold or silver coin shop. A large part of their job is to try to protect themselves from adverse selection by value traders, since if they trade with value traders and are unable to rebalance their inventory before prices move, they lose money. If they're good enough at knowing a value trader's intentions, they might even frontrun them. They provide social value by moving goods through time. Without market makers, buyers will often have to wait a long and uncertain amount of time before finding sellers, and vice versa. Therefore, market makers provide liquidity.

Arbitrageurs are traders who "try to buy/sell similar things at two different prices at the same time".

For example, let's say that gold costs around 1000 an ounce in New York and 1001 an ounce in India.

An arbitrageur will sell gold in India and buy the same amount in New York. After the gap closes, he has made a profit of about $1 per ounce. He forces the gap closed by repeating this action. If the gap doesn't close he will have to physically move the gold from New York to India.

Truck drivers are essentially arbitrageurs, because they drive goods from a place where it is less valued to a place where it is more valued.

Arbitrageurs are socially useful because they combine liquidity across different markets. More exotic types of arbitrage exist. Quants on Wall Street buy and sell bundles of stocks that are in some sense similar to each other in order to make small profits.

There are a few other trader types but they're not as interesting for the average trader/investor.

Thursday, July 18, 2013

Lending Club: Adverse Selection

In my last post I claimed that competitiveness in P2P lending is probably not an issue. I now am of the opinion that it is. Let's have a look at this post. We can see that institutional investors are indeed using API investing to grab loans quickly. It of course makes no sense that they would have set up investing bots to grab notes unless they were interested in patiently cherry-picking on a daily basis.

On my end, I was initially surprised to see that while there were many examples of well-performing loans in historical data, there were none to be found to be found still in funding.

Therefore it becomes essential to create a bot to automatically invest in loans. However, API access is sadly (and rather unfairly) limited to institutional investors and bloggers.
My workaround for this unfortunate situation was to use Twitter's Stream API, pointed here. PeerCube's blogger has API access, and has set up some sort of script that provides pretty much instant notifications about the appearance of new loans. This is great since without this I would probably have to resort to constantly querying the Lending Club website during release times and having to worry about a IP address ban. So I just tell my bot to download the new loan data and analyze it whenever the twitter account updates. I'm sure the API investors still have a huge advantage in speed, but it's the best I can do.

I do wonder what kind of models the institutional investors use, and what the extent of adverse selection I am being exposed to. I can say for sure that the uninformed investor is suffering a great deal of adverse selection, as all the higher grade loans that remain have poor scores according to my model. This means they have two strikes against them: my model says they're bad, and that they're still unfunded means that the institutional investor's models say they're bad. As a result, investors will tend to see substantially lower returns than what the Lending Club website says is an average return.

My guess is that high risk/return loans are either great or terrible, and low risk/low return loans are mediocre, in which case the uninformed investor would be better off sticking with the safer loans, so as to avoid the selection effect. I haven't and don't intend to try to gather data to test this guess. I'll explain theoretical justifications for it in a future post, though.

Sunday, July 7, 2013

Lending Club Bot: Overview

Introduction
Lending Club is a peer-to-peer lending website. It receives and screens loan applications from borrowers, places them into grades (low-interest, low-risk to high-interest, high-risk), and posts these loans online along with information on the borrower. Investors deposit money, then buy shares in $25 increments in the loans. The borrowers make monthly payments for 3 or 5 year terms. For the borrowers, it is a place to get better rates than elsewhere. For investors, it is a place to get higher returns than in the stock or bond market.

Data on past loans is made available on the website, which opens up the possibility of various types of statistical analysis. I became interested in doing some analysis about a year ago and have worked on the project over the course of around 6 months.

Evaluation
What made the project attractive to me? The website advertises fairly good results of average investors. In other words, just buying loans at random has a decent average rate of return, given diversification. So a statistical edge only improves on this rate of return.

Actually, to be more precise, we need to consider the concept of uninformed and informed investors. In the stock market, the average person is an uninformed investor. They don't know anything special about any stock price, so the best they can do is invest in index funds. That is, they invest in a pool of many stocks to avoid the risks associated with individual stocks ("diversification"), with the assumption that the stock market as a whole tends to go up. Therefore, stock indices are often used as a sort of benchmark for the performance of other investments.

Informed investors/traders know something about stock prices due to expertise or research. They aim to get returns exceeding the benchmark. They don't tend to diversify as much, because they are confident they can pick winners and losers. Moreover, they usually only have an edge on a small set of securities. For example, stock analysts tend to focus on a single stock, and would only trade on that one stock.

Uninformed or informed, an important fact of investing or trading is to not compete against somebody better informed. As the poker saying goes, "If after X minutes at the table you don't know who the sucker is, it's you". In order for an trader to get above-average returns, somebody else must get below-average returns. In the context of Lending Club, how this would happen is that well-informed investors will snatch up good loans quickly, leaving bad loans for others.

Though I can't say for sure, it seems that Lending Club investors tend to be uninformed. Most bloggers and forumites seem to advocate intuitive criteria for investing. Some stats bloggers have done good work on various aspects of Prosper (another P2P lending site) and Lending Club data, but I have yet to come across any mention of a full investment model. There are some institutional investors, and undoubtedly there are some quiet investors with good picking strategies, but heuristically it seems unlikely that investing on Lending Club is currently competitive. There's not really that much room for profit for big players; the total amount of outstanding loans these days is less than $10 million. There's no way to borrow on margin and leverage (i.e. I would like to invest in $500 in loans with $100 in my account, but I can't, whereas I could on the stock market). Therefore, I doubt that any group of investors is systematically snatching up all of the good loans.

Risks
What risks are present, and can they be avoided or mitigated?

1. Individual risk: Any given borrower can default (fail to repay the loan) or repay the loan early (reducing the interest return). This is actually a non-issue since it is what is considered by the model.
2. Financial risk: Some macroeconomic effect could cause interest rates to go up or a large amount of borrowers to default all at once. This can be mitigated by some sort of hedge on the stock/bond market.
3. Model risk: I screw up my model and it gives me bad results, perhaps because of overfitting. I can eliminate this by backtesting.
4. Institutional risk: Lending Club goes bankrupt and all outstanding loans are left hanging. I believe this is no longer an issue because Lending Club has an agreement with another company that will service the loans if this happens.
5. Institutional risk (again): Lending Club suddenly lets the quality of its loans drop. I don't think this is likely since they have been careful to reject most loans, even during a time of rapid expansion.

Sounds promising. More later.

Pages