Excel-based Rent Prediction Using Linear Regression and Cross-Validation.
As I write this (mid 2014), "big data" is a popular topic in computer science and even the popular media outlets are talking about it. I've reviewed a few books in this site that relate to this topic [Supercrunchers and The Sgnal and The Noise] and these have made me very curious about the mathematical "machinery" involved in data mining. Well, a couple of books, and several online machine learning and statistics courses later, I feel prepared to write an article illustrating a practical situation where these data mining techniques proved useful. The problem is determining the appropriate rent value for a property given a set of data points 'mined' from Craigslist. To keep this first article relatively simple, I used the 'grandfather' of all prediction techniques - linear regression (it's roots go back to the early 1800s with French mathematician Adrien-Marie Legendre). In spite of its age and simplicity, regression is still a powerful method, especially when you add non-linear transformations and multiple input features to the mix. This article also shows how one can, in a principled manner, select a model that is just complex enough to perform well in-sample, while avoiding "over fitting" the data (bad out-of-sample performance). In all this, I used Microsoft Excel exclusively. Some Excel VBA programming is needed for the cross-validation step, but could be excluded if cross-validation is not used. Let's look at how this is done.
The first step in any data mining exercise is to collect the actual data. I wrote a small VBA procedure in Excel that can extract the prices and number of rooms out of the Craigslist search results. Table 1 shows a portion of it, ordered by square footage. I recommend that you fine-tune your search in Craigslist as much as possible so it is not 'contaminated' with properties in different zip codes. Also be sure to delete repeat listings (this is a step I also automated in Excel). The data I collected is for a relatively upscale zip code in the Sacramento region, and I was able to get a total of 70 data points after deleting repeats. As a reference, the mean rent turned-out to be about $1500 but this of course has a huge spread (Table 1 is just the lower portion of the data). The lowest rent in the data set was $675 while the highest was $2750. In Table 1, I added two extra columns that are the square of the square-footage and the square of the number of rooms respectively. This is the "non-linear transformation" I mentioned in the introduction. This simple step allows us to construct richer (sometimes 'too rich') models as we will see shortly.
Table 1 - Rent Data
Figure 1 is a 3D plot representing the data. The depth is the number of rooms whereas the horizontal axis represents square footage. Not surprisingly, the rent tends to increase with the square footage and the number of rooms. We will articulate this relation more precisely by building linear regression models.
Figure 1 - Rent as a function of Square Footage and Number of Rooms
We have four input 'features' at our disposal: [Square Footage, # Rooms, Square Footage^2, # Rooms^2]. Linear regression gives us a practical statistical method to combine those features into a function (formula) that predicts the rent price. It does this by minimizing the error between this function and the points in the data set that are used as input (the 'training set'). For technical reasons that we won't get into in this article, we often use "square error" (square of the distance between a rent prediction and the actually observed rent in the dataset summed over all the data points). This procedure may seem abstract but it becomes clearer with some examples. Let's start with the simplest model; one that relies solely on the Square footage.
In this simplest of cases, we'll make an assumption that the house rent is simply linearly proportional to the square footage plus some constant. In mathematical terms:
Rent Prediction = m * Square Footage + b
Where the constants m and b are the model parameters that the linear regression procedure calculates for us. Once we know m and b, its' a trivial matter to plug them into the formula above and out comes a predicted rent for any square footage we enter. Very simple. The 'harder' part of determinine the 'best' m and b is handled by Excel. In this one-dimensional linear regression case, there are many ways to do this calculation in Excel, but I decided to use the LINEST function. This LINEST function can also handle multiple input "features", so it will be used later for the more complex models. LINEST is an Excel array function, and this just means that it returns not only a single value, but a set of values. These include the regression parameters (m and b), and also a very useful set of statistics that give you a sense of how accurate these parameters are. Please refer to the Excel online help for a more detailed explanation of how this function works. You will normally enter a formula such as:
=LINEST([Price Range] ,[Input Features Range],TRUE,TRUE)
And press Ctrl+Shift+Enter (this is needed for any array formula in Excel). Note that [Price Range] and [Input Features Range] are the ranges of cells containing the Rents and the Input Features respectively. For this first model, the [Input Features Range] is just the first column of values in Table 1 (i.e the Square Footage values).
Figure 2 shows the results returned by LINEST in a spreadsheet. On the right side is a picture showing how the results are organized in the array (this table is part of Excel's online help). The first row contains the regression parameters (m and b in this case) whereas the other four rows contain the statistics. One of the statistics (highlighted in darker green) is the sum of the square errors (ssresid). This is a measure of the so called in-sample error (Ein). It's a measure of how well this (simplistic) model approximated the actual rents when evaluated for each of the houses in the table. The 'sey' value (standard error of the y (output) variable) gives us a sense of how good (how bad?) the estimate is. In this case the standard error is $181. This standard error is pretty high. The actual expected range is usually a multiple of this standard error (say +/- 2*sey for a 95% confidence). (Technical note: the prediction interval should be calculated using a student T distribution and will be even larger than this. However, this simplified formula still give us a first-order "reality check" on how good (or poor) our model is).
Figure 2 - One dimensional linear model using LINEST
The high value of sey and Ein should give us some pause. It's one thing to make a prediction; It's quite another matter to get a handle on the accuracy of that prediction. (I wish we got more of that in the media when we hear economists forecasting growth numbers for example, but I guess that would reveal how poor the predictions are in reality...). This is where the so called "out-of-sample" error Eout comes into the picture.
Eout (lighter green towards the bottom of the LINEST array in Figure 2), is an estimate of the error out-of-sample. In other words, how well the model is expected to perform when confronted with new data, not in the sample provided by Table 1. This error is estimated by calculating a regression model using all but one point and then evaluating the error at the remaining point using that model. This process is repeated for each of the points in the data-set and the errors are summed together. It turns-out that this processs (know as Leave-One-Out Cross-Validation LOOCV) results in a good estimate of what the error will be out-of-sample. Unfortunately, Excel doesn't compute this directly, so I wrote a VBA subroutine to implemet this (may be the subject of another article). The result is shown as Eout in the following tables. Figure 2 shows that Eout is not too far from Ein in relative terms but it is clearly larger as one would expect...
Figure 3 shows a plot with all the observed points and the linear regression predictions (red line) overlaid. It should be apparent that there is quite a bit of variance in the data and that a simple 'line' can only do so much in predicting such behavior. This model give us a very rough estimate of the rent. Perhaps we can do better by adding other factors that also affect the rent value? In the next model we will add the number of rooms to the mix and see what happens.
Figure 3 - One dimensional linear model vs Data Set
It's been my experience that while most people have heard or used the one-dimensional linear regression model, not as many are familiar with the extension to multiple input 'features' (i.e multi-dimensional models). The extension is actually very natural in linear Algebra terms, and is no more complicated than inverting a matrix constructed from the input features and multiplying by the output vector. Fortunately, the Excel LINEST function supports regression with multiple input features, and does all of the algebra for us "in the background". As a practical matter, adding more features amounts to simply adding more columns to the [Input Features Range] in the formula. The array returned by LINEST is wider (more columns as it computes more parameters) but remains the same height (5 rows). See Figure 4.
Figure 4 - LINEST Using Two Input features (#Rooms and Square Footage)
Judging by the value of Ein in Figure 4 compared to the Ein in the previous model (Figure 2), we conclude there's a modest increase in performance. Just as importantly, the predicted Eout also decreased so this suggests we are on the right path. But why stop at two features?
Assuming that rents increase linearly with number of rooms and square footage may be too rigid of a model. It's not hard to imagine that the behavior may not be so linear near the bottom end of the rent scale or near the top. Perhaps as square footage increases, the rents increase more slowly. This is where the "non-linear" transformation plays a role. By adding the square of the #Rooms and SqFtg to the input features, the linear regression will be able to fit a model that can have a quadratic component to it, so that the relation is "curved" rather than "straight". We will start by adding only the square of the SqFtg and keep an eye on the Ein and Eout values. See Figure 5.
Figure 5 - LINEST Using Three Input features (#Rooms, Square Footage, (Square Footage)^2 )
We observe again a modest gain in the In-Sample performance (Ein decreases). Interestingly, the Out-Of-Sample performance Eout also decreases but the gains in Eout seem to be "flattening out"... Let's see how far we can take this.
Figure 6 shows the LINEST and Error results with four features (adding the square of the number of rooms as a feature).
Figure 6 - LINEST Using Four Input features (#Rooms, Square Footage, (Square Footage)^2, (#Rooms)^2 )
Notice how the in-sample error Ein improved very little. More significantly, the out-of-sample Eout estimate actually increased!
Table 2 and the graph in Figure 7 summarize what we have observed above.
Table 2 - Ein vs Eout Summary
Figure 7 - Ein vs Eout and # Features
In trying to fit a model to data, one always runs the risk of "over-fitting" the data. This is the point where increased model complexity actually decreases the model's performance. It's easy to fool ourselves into believing a model is better than it is by looking only at the in-sample performance. This is why techniques such as cross-validation are crucial in machine learning and data mining. It keeps us honest. In this particular example, it seems like the model with three input 'features' (Square Footage, # Rooms, (Square Footage)^2) is the best compromise between performance and complexity.
As I noted above, even this three-input feature model is a relatively poor predictor, in the sense that the output standard error (s.e.) is pretty high at close to +/- $169. Remember that this s.e. would be multiplied by a factor of more than 2 for a 95% confidence interval. Therefore, I would use this model only as a means to get a 'rough' estimate of the Rent for a property.
With some additional work, one may be able to 'scrape' more data from a Craigslist listing that could improve the model. Reasonable candidates would be items such as "Yard Size", "Distance to Schools", "Age of the Property", etc. Some of these items cannot be extracted directly from a Craigslist listing, though the task is not impossible and some Companies/Entities may already have access to this information. As it stands, this model is an interesting, and hopefully instructive example of how regression can be used for some very practical purposes, while keeping in mind the limitations of such a model.
Comments, questions, suggestions? You can reach me at: contact (at sign) paulorenato (dot) com