Machine Lending: 2013

Friday, September 20, 2013

The Costs of Search

Imagine you are selling an old bike on craigslist. You list it for $100. A few days go by, and nobody responds. Finally, somebody sends you a message saying they'll buy it for $90. You'd rather get more, but at the same time you can't count on them waiting, so you have to decide whether or not accept or reject their bid.

In another scenario, you might be trying to buy a bike. If you buy it new at the store this month, you can use a coupon before it expires and/or the store runs out of inventory. But you might be able to get it used on craigslist or Ebay.

In both scenarios, you are presented with a certain trade that may not be optimal. You have to balance several requirements. It's more likely you should accept if you really need the cash/good urgently. You should also accept quickly if it takes a lot of effort to stay in the market or evaluate each deal. On the other hand, you want to wait when you think that you're likely to find a much better deal later. You're also more willing to wait if you don't have to worry about the possibility that you'll never be able to trade at all.

In the Lending Club market, several of these factors come into play. New loans come in and out. The loans stay on the platform for at most 2 weeks, less if they are fully funded before then. You might want to wait to see if other people show interest in the loan, since that could be a sign that it's a good loan. But then you're a step behind them. If you're analyzing loans manually, it takes effort to look at each loan. Doing automated modeling and investing solves these problems.

However, with an automated model you still have the issue of money sitting around un-invested. That is, you have to somehow balance the lost interest against the expected gain of investing in a better loan tomorrow (or next week) versus an ok one today.

Now, while effort and lost interest are both factors causing you to prefer investing sooner than later, they are actually quite different from the perspective of returns of scale.

That is, say in one situation you have $10000 to invest and the other you have $100. If it takes a lot of effort to analyze loans, then you will search more when you have to invest $10000 than when you have to invest $100. A 1% improvement is $100 instead of $1. But if it's no effort, then there's no difference in how you act between the two.

What about lost interest? Since interest is just expressed as a fraction of principal, it doesn't matter whether it's $10000 or $100, right? However, the problem is again the finite nature of the market.

Another example: Let's say you are planning a small outing for 10 people at Las Vegas. You would just have everyone take a bus there and think nothing more of it. If you had to plan for 10000 people, you start to question whether or not there are enough buses going from where you are to Las Vegas in a timely manner.

Now, let's say you expect there to be one good-enough loan per week, with principal around $1000. In this case, the larger investor is at a disadvantage, because the majority of his money is sure to be lying around, whereas the smaller investor can easily make his purchase fully. In stock market terms, we would say that larger traders have a larger market impact. This is a disadvantage for them. There isn't enough supply for their demand (or demand for their supply), so they either have to wait or accept a worse deal.

From an effort perspective, there isn't really a increasing or diminishing returns to scale. That is to say, you'll probably have to look at roughly 100 times more loans to invest 100 times the money. However, with the perspective of interest lost, size becomes much more annoying due to scale. Taking the above example, if you can only purchase $1000 of loans a week, and you have $10000, the last $1000 will have to wait 10 weeks before it is invested. If you have $100000, you'll have to wait 100 weeks to be fully invested (by that time you'll be almost 2/3 done with your first few loans). The problem basically gets worse at a roughly quadratic rate, whereas the effort problem only gets worse at a linear rate.

The upshot of this is that you should probably have slightly lower standards when initially investing your money if you're putting in a lot at once. Otherwise you'll have a huge backlog. Once you've done that, you can raise your standards when reinvesting; since money is constantly trickling in and out, you won't have to worry about market impact.

Friday, September 13, 2013

Diversification? No thanks

Diversification is critical when investing in the stock market. While the mathematical and historical background of diversification as risk management is much too rich to discuss here, the general idea is pretty much uncontroversial. I'd like to start with a simple model.

Let's say there are two investors, Alice and Bob. They both have a lottery tickets that independently will be revealed to be worth $0 or $100 the next day, with 50-50 odds. Let's say they are risk averse, therefore they would prefer to have $50 for sure. If they were to agree to split the total winnings, both would be better off, as each would have a 1/4 chance of gaining $0, a 1/4 chance of gaining $100, and a 1/2 chance of getting $50. Moreover, since this is a better outcome for both, it is quite possible that somebody who facilitates this transaction could take a small fee. This person would be able to benefit from trade, even if they are not risk averse, but because there are other people who are risk averse. In fact, if this person did not care about risk, they could buy both tickets for say $49 each, and everyone would be happier for it.

This is roughly a metaphor for the stock market. Selling shares to many investors is critical for ventures that involves risks measured in billions of dollars. It's why entrepreneurs dream of IPOs (or more recently, just buyouts): it's when ideas and potential turn into cold, hard cash. While their ideas may be great, they do not want to be eternally riding on a financial roller coaster that could leave them broke in a second's notice. But the story is the same for the average investor. It would not make sense for them to randomly sink all of their wealth into one stock, either.

Why not? There's a few assumptions that underlie the idea that diversification is good.
1. There is no obvious way to pick winning and losing stocks. This is a very mild form of the efficient market hypothesis. While there are many people who claim to be able to do so, most are either lucky and/or making their money ripping off gullible investors with worthless tips.
2. It is easy to diversify. The proliferation of index funds actually took a long time, as people had to learn to accept #1 and not try to beat the market. Today, index funds have become so successful that they have become the tail that wags the dog, which I will elaborate on later. Today though, the management fees of index funds are generally in the 0.1% range, and the existence of transaction fees means that it's cheaper to buy index funds than even a small handful of stocks.
3. The investor can take on an appropriate amount of risk easily. In some situations, the ability to leverage is critical. If a trader discovers a mispricing of 1 cent in a stock or bond or currency, it's not helpful unless that trader can make a large enough bet to make a significant profit. The trader needs to borrow money in order to do so. As a whole, the financial system is probably overly leveraged, and the average investors has a pretty small appetite for risk, so in the stock market, I'd say this condition is pretty much satisfied.
4. Stocks do not move in complete lock-step with each other. If this were the case, diversification would be pointless. Systemic risk (i.e. the whole market crashes at once) is scary because it means diversification does not apply in all situations.

Moving onto the Lending Club market, I decided to examine a couple of approaches to portfolio selection. I used the two studied in this paper.

The first approach is known as Sharpe Ratio Maximization. The idea here is to fix a certain amount of money that will be invested, and find the portfolio (mix of investments) of this size that has the highest ratio of return over risk. The idea here is that, given a certain amount of return, we want to take the least risk possible, or given a certain amount of risk we're willing to take, we want to maximize our return. This is done by investing a fraction or multiple of our total wealth in the portfolio, depending on exactly how aggressive we are.

The second approach is known by several names, such as Geometric Mean Maximization or Growth Optimal. It seeks to maximize the average rate of return, which is not the same as maximizing the average return. In other words, a 20% gain is not twice as good as a 10% gain: 1.1*1.1>1.2. Here, risk mathematically ends up reducing the average rate of return.

Isn't it the case that both approaches will yield the same solution? If return is good and risk is bad for the growth optimal portfolio, doesn't it mean that to be optimal is to have the most return and the least risk? The result is that the growth optimal portfolio would be a multiple of the Sharpe-Ratio maximized portfolio.
This is true, but note that in Lending Club, there is no leverage. You can't borrow money from the company to invest in more loans. But would I even want to do it in the first place? Yes.

After doing my modeling, I found that in the high-grades (A1-5), there were some loans with very low probabilities of default, like less than 1%. Given interest rates of 6+%, these had very good return/risk ratios.
On the other hand, there were loans in the E-F range with probabilities of default around 10%, and interest rates of 25%. This sounds needlessly dangerous in comparison, but I also optimized for time of default, so the return/risk ratios were decent.

It turns out that the two approaches diverged significantly when I ran my optimizer on it (I used Python's OpenOpt library and the formulas from the paper to do this). The Sharpe Ratio maximizer naturally diversified as much as possible, investing in basically every loan with a positive return.

The growth optimizer selected, from perhaps a thousand loans, 3 high-risk loans. It gets worse: the first loan got 80% of the portfolio, the second 19%, and the last 1%. When I saw this, I decided that I must have made a programming mistake. But when I thought about it some more, I realized this was the correct outcome. Why did it seem to almost completely disregard diversification?

The answer lies in the fixed size of the portfolio. Not caring about leverage, the sharpe ratio maximizer piles heavily into the best high-grade loans, creating a very low risk portfolio with decent returns. Meanwhile, the growth optimizer cared a lot about leverage. Because the returns are so high and the risks only moderate, the optimizer ends up being very aggressive and accepts lower return/risk ratios in exchange for higher returns.

Personally, I think the growth optimal strategy is pretty reasonable, so I went along with it. I'll explain some more justifications later. But first:

Let's consider the 4 assumptions I stated earlier. Do they apply to the Lending Club market?

1. Is not true. As explained in previous posts, it is much easier to pick winners and losers in Lending Club. Therefore it is realistic to plan to beat the average.
2. Is not true in the sense that #1 is not true. We are faced with investing only in a small fraction of loans deemed to be the cream of the crop. Since we discarded the inferior loans, we thus have less ability to diversify.
3. I can't leverage. If you somehow can, it would change your strategy significantly.
4. Lending Club actually fits this criterion, since whether or not people pay back their loans is generally pretty independent of each other. With regards to risk that affects all Lending Club loans, the solution is to diversify between the stock market and Lending Club (and hope that there isn't much correlation between the two).

In the end, I decided to just do the greedy approach of setting a high threshold of expected return and investing as much as possible in any loan that fit the criterion. I didn't do any sort of data analysis on predicting the future supply of loans, so I just set a threshold so that a qualifying loan would only come in about once a week on average. In other words, I did not do any risk avoidance at all.

The justification for this is that the real-life situation is different than the portfolio optimization testing that I did. In the testing, I used a large set of loans from a half-year window, but here, loans come and go quickly (especially the good ones), so there's a tradeoff between leaving cash lying around waiting for a great loan versus taking a pretty good loan immediately.

Lastly, there are a few factors that naturally cause diversification. I found that smaller loans performed better (and are thus more likely to be selected by my algorithm); combining this with competition with other investors means that I am simply unable to invest as large of a fraction of my money into one loan as I would like, so I'm actually forced to buy several loans to invest all of my money. The other factor is that the loans are repaid monthly, meaning that small amounts of money is constantly coming in and being reinvested in different loans.

So in summary, actively trying to diversify on Lending Club isn't really worth it. There are many other considerations that take precedence.

Friday, July 26, 2013

Building a Loan Model

Supervised Learning is simple in concept. Given some characteristics of something (input variables), we want to predict some other characteristic (output variable). For example, lets say we have a database full of people's purchase histories and their gender. Trying to guess the gender of somebody outside of the database, based on their purchase history, is a supervised learning problem.

Categorical prediction/guessing is known as classification.

Numerical prediction/guessing is known as estimation.

In the case of Lending Club, we want to estimate the profitability of a loan. Roughly speaking, we want to reduce all of the characteristics given to us about the borrower into two numbers: return and risk (here defined to be the variance of return).

More mathematically, if x is the characteristics of a loan that we know when we can choose whether to invest or not, and y is the characteristic(s) we want, then we want to build a model/function f so that

f(x) is a good estimate of y.

Let's take a look at the returns distribution of all completed notes as of earlier this year. Pretty sure this isn't a textbook distribution.

I think it's safe to say that running a single regression is not going to be the best way to approach this problem. In other words, we can't construct f in one step easily.

The idea rather is to create a function g that outputs some other variable of interest z, then plug x and z into another function h. i.e. g(x)=z and h(x,z)=y.

In this case, we probably want to predict whether the loan defaults (whether the borrower fails to make payments) or not. We assign a certain probability of this event occurring. Then, we consider how much money we will get if the borrower defaults, and how much money we will get if the borrower pays fully, and we take the probability-weighted average of the two.

Why is this approach better? Because it takes into account knowledge that we have about how loans work.

Our knowledge includes:

-Given a fully-paid, on-time loan, the amount of money we will receive is an exact number based solely on the interest rate of the loan (and whether the term length is 3 or 5 years).

-The only case in which we will have a negative return is if the borrower defaults (the reverse is not true).

Put another way, if we already know for a fact that the relationship between a and b is something like a=cos(b^2), we don't want to estimate the relationship or its effects. That would be a waste of time and an unnecessary loss of accuracy. In this case, if the following expression is set to an initial investment of $1, and n is fixed, then P is a function of r.

We might imagine the interest rate as being a characteristic of the borrower that affects their likelihood to act in a certain way. The interest rate also mathematically affects the terms of the loan. We want to eliminate the second effect as much as possible for the machine learning component because we can put it back in later. It will only get in the way if we leave it in.

In fact, it is possible that if we leave it in, the model would predict impossible results. If a fully paid grade A loan has a return of $1.1 for every $1 invested and a fully paid grade G loan has a return of $1.4, it must be possible for the model to predict that a loan will return anywhere between $0 and $1.4. But this means that the model could potentially predict that a given grade A loan will return, say, $1.3, yet this result makes no sense because the loan is theoretically capped at $1.1. Why would the lender pay more than they're required to? Better to express the return as a fraction of the full payment, to avoid such a possibility.

The best part of this process is that it can be iterated. The above model incorporates only default risk. The following model incorporates also call risk (the chance that a borrower repays the loan early).

Essentially, we are mapping out each possible scenario and how much value we derive from it.

I'll leave my specific implementation to a later post, as it's more technical, and for spoilers for those who might want to approach the data with no bias towards my results.

Saturday, July 20, 2013

Lending Club: Automatic Updates

I've set up my bot to automatically publish its results whenever new loans come in. Have a look on the top bar.
I will probably add more columns later.

Friday, July 19, 2013

Book Review: Trading and Exchanges

A friend lent me Trading and Exchanges by Larry Harris over the summer. I found it to be a fantastic read. It pretty much classifies any type of trader out there, explaining what reason they're in the market and how they accomplish their objective.

I'll try to give a summary of the main types of traders mentioned and possibly an analogy for them.

Utilitarian traders are traders who have some intrinsic interest in a good. For example, chicken farmers are natural buyers of corn, and corn farmers are natural sellers. In the stock market, companies and retirees are natural sellers of stock and investors are natural buyers. They're the reason markets exist: to have those who value goods the least to sell it to those who value the goods the most.

Gamblers are a type of utilitarian trader who trade for entertainment. They tend to lose money.

Value traders are traders who somehow estimate the future value of a good. For example, stock analysts look at earnings projections and news reports and try to infer where stock prices are headed. Here's an example of how a value trader might provide value. Say there is a town with a huge water supply. Water is cheap so people use it inefficiently, for example in water gun fights, leaving the tap on, fountains, etc. A scientist predicts a drought so he builds a huge reservoir, fills it with water, and later sells it for a high price during a drought. People will criticize him as price gouging and launch expletives at him, but he actually did something socially useful, which is that he prevented the consumption of water by the town that valued it lowly, but instead sold it to the town in the future that valued it highly.

Frontrunners are traders who figure out/guess that somebody else wants to buy something. Legally, frontrunning occurs when a broker trades ahead of his client's order, which is illegal. Other forms of anticipation are completely legal. In any case they're not socially beneficial. Let's say you're at the store, and you want to buy the last carton of eggs in the store. Somebody knows you really need the eggs, so he runs in front of you (hence the term), buys the eggs, and sells it to you at a higher price. Kind of a dick move.

Technical analysts are traders who try to front-run uninformed investors, especially gamblers. At the same time, technical analysis is mostly BS so they're essentially gamblers themselves. They don't provide social value, except in the money that they lose on average.

Bluffers are traders who spread misinformation in the news or online and back it up with huge purchases or sales that make the news seem feasible. They essentially manufacture bubbles or crashes. A great recent example is possibly the hacker who used the AP twitter account to say that a bomb had exploded in the White House.

Some Syrians claimed credit for this. If they did it, it's quite likely that they made trades to profit from the panic.

It's not socially beneficial, obviously.

Liquidity providers: Liquidity is roughly "the ability to buy or sell goods near the market price in a timely manner", or very short-term supply and demand elasticity. You can read more in the link.

Market makers/dealers are traders who post bids and offers. They make money off of the spread. That is, they might offer to buy a stock at $49 and sell for $51. An obvious example is a gold or silver coin shop. A large part of their job is to try to protect themselves from adverse selection by value traders, since if they trade with value traders and are unable to rebalance their inventory before prices move, they lose money. If they're good enough at knowing a value trader's intentions, they might even frontrun them. They provide social value by moving goods through time. Without market makers, buyers will often have to wait a long and uncertain amount of time before finding sellers, and vice versa. Therefore, market makers provide liquidity.

Arbitrageurs are traders who "try to buy/sell similar things at two different prices at the same time".

For example, let's say that gold costs around 1000 an ounce in New York and 1001 an ounce in India.

An arbitrageur will sell gold in India and buy the same amount in New York. After the gap closes, he has made a profit of about $1 per ounce. He forces the gap closed by repeating this action. If the gap doesn't close he will have to physically move the gold from New York to India.

Truck drivers are essentially arbitrageurs, because they drive goods from a place where it is less valued to a place where it is more valued.

Arbitrageurs are socially useful because they combine liquidity across different markets. More exotic types of arbitrage exist. Quants on Wall Street buy and sell bundles of stocks that are in some sense similar to each other in order to make small profits.

There are a few other trader types but they're not as interesting for the average trader/investor.

Thursday, July 18, 2013

Lending Club: Adverse Selection

In my last post I claimed that competitiveness in P2P lending is probably not an issue. I now am of the opinion that it is. Let's have a look at this post. We can see that institutional investors are indeed using API investing to grab loans quickly. It of course makes no sense that they would have set up investing bots to grab notes unless they were interested in patiently cherry-picking on a daily basis.

On my end, I was initially surprised to see that while there were many examples of well-performing loans in historical data, there were none to be found to be found still in funding.

Therefore it becomes essential to create a bot to automatically invest in loans. However, API access is sadly (and rather unfairly) limited to institutional investors and bloggers.
My workaround for this unfortunate situation was to use Twitter's Stream API, pointed here. PeerCube's blogger has API access, and has set up some sort of script that provides pretty much instant notifications about the appearance of new loans. This is great since without this I would probably have to resort to constantly querying the Lending Club website during release times and having to worry about a IP address ban. So I just tell my bot to download the new loan data and analyze it whenever the twitter account updates. I'm sure the API investors still have a huge advantage in speed, but it's the best I can do.

I do wonder what kind of models the institutional investors use, and what the extent of adverse selection I am being exposed to. I can say for sure that the uninformed investor is suffering a great deal of adverse selection, as all the higher grade loans that remain have poor scores according to my model. This means they have two strikes against them: my model says they're bad, and that they're still unfunded means that the institutional investor's models say they're bad. As a result, investors will tend to see substantially lower returns than what the Lending Club website says is an average return.

My guess is that high risk/return loans are either great or terrible, and low risk/low return loans are mediocre, in which case the uninformed investor would be better off sticking with the safer loans, so as to avoid the selection effect. I haven't and don't intend to try to gather data to test this guess. I'll explain theoretical justifications for it in a future post, though.

Sunday, July 7, 2013

Lending Club Bot: Overview

Introduction
Lending Club is a peer-to-peer lending website. It receives and screens loan applications from borrowers, places them into grades (low-interest, low-risk to high-interest, high-risk), and posts these loans online along with information on the borrower. Investors deposit money, then buy shares in $25 increments in the loans. The borrowers make monthly payments for 3 or 5 year terms. For the borrowers, it is a place to get better rates than elsewhere. For investors, it is a place to get higher returns than in the stock or bond market.

Data on past loans is made available on the website, which opens up the possibility of various types of statistical analysis. I became interested in doing some analysis about a year ago and have worked on the project over the course of around 6 months.

Evaluation
What made the project attractive to me? The website advertises fairly good results of average investors. In other words, just buying loans at random has a decent average rate of return, given diversification. So a statistical edge only improves on this rate of return.

Actually, to be more precise, we need to consider the concept of uninformed and informed investors. In the stock market, the average person is an uninformed investor. They don't know anything special about any stock price, so the best they can do is invest in index funds. That is, they invest in a pool of many stocks to avoid the risks associated with individual stocks ("diversification"), with the assumption that the stock market as a whole tends to go up. Therefore, stock indices are often used as a sort of benchmark for the performance of other investments.

Informed investors/traders know something about stock prices due to expertise or research. They aim to get returns exceeding the benchmark. They don't tend to diversify as much, because they are confident they can pick winners and losers. Moreover, they usually only have an edge on a small set of securities. For example, stock analysts tend to focus on a single stock, and would only trade on that one stock.

Uninformed or informed, an important fact of investing or trading is to not compete against somebody better informed. As the poker saying goes, "If after X minutes at the table you don't know who the sucker is, it's you". In order for an trader to get above-average returns, somebody else must get below-average returns. In the context of Lending Club, how this would happen is that well-informed investors will snatch up good loans quickly, leaving bad loans for others.

Though I can't say for sure, it seems that Lending Club investors tend to be uninformed. Most bloggers and forumites seem to advocate intuitive criteria for investing. Some stats bloggers have done good work on various aspects of Prosper (another P2P lending site) and Lending Club data, but I have yet to come across any mention of a full investment model. There are some institutional investors, and undoubtedly there are some quiet investors with good picking strategies, but heuristically it seems unlikely that investing on Lending Club is currently competitive. There's not really that much room for profit for big players; the total amount of outstanding loans these days is less than $10 million. There's no way to borrow on margin and leverage (i.e. I would like to invest in $500 in loans with $100 in my account, but I can't, whereas I could on the stock market). Therefore, I doubt that any group of investors is systematically snatching up all of the good loans.

Risks
What risks are present, and can they be avoided or mitigated?

1. Individual risk: Any given borrower can default (fail to repay the loan) or repay the loan early (reducing the interest return). This is actually a non-issue since it is what is considered by the model.
2. Financial risk: Some macroeconomic effect could cause interest rates to go up or a large amount of borrowers to default all at once. This can be mitigated by some sort of hedge on the stock/bond market.
3. Model risk: I screw up my model and it gives me bad results, perhaps because of overfitting. I can eliminate this by backtesting.
4. Institutional risk: Lending Club goes bankrupt and all outstanding loans are left hanging. I believe this is no longer an issue because Lending Club has an agreement with another company that will service the loans if this happens.
5. Institutional risk (again): Lending Club suddenly lets the quality of its loans drop. I don't think this is likely since they have been careful to reject most loans, even during a time of rapid expansion.

Sounds promising. More later.

Pages