Last night, Zillow launched Rent Zestimates which are monthly estimated rental prices for properties. They are a set of models built from public property data and rental listing info on Zillow. The types of properties with a Rent Zestimate include apartments, single family homes, condos, co-ops, townhomes and all other types of homes. We estimate rental prices on more than 90 million properties (we launched with 98 million, to be exact) with the overall accuracy, as measured by the median absolute error, of 10% for the month of February 2011. Creating these millions of estimates is done with the help of our Seattle neighbors, Amazon Web Services, takes about four hours and costs about as much in computer processing time as six Starbucks lattes (also a Seattle neighbor!).
Mathematically, a basic model for a Rent Zestimate may be expressed as a functional relationship between the dependent variable and the independent variables:
y = f(x1, x2, .., xn)+e
where y is the dependent variable; x1, x2, .., xn are the independent variables and e is an error term. The dependent variable, y, represents the monthly rental listing price and all of the independent variables, x1, x2, .., xn are property attributes such as location, living space, lot size, and number of bathrooms. Thus, f is a model (or function) that relates a set of home attributes to its rental listing price.
We train (or build) a set of models on training datasets partitioned by geographical regions. The training dataset is a table
A=[y, x1, x2, .., xn]
where y is a column holding the actual rental listing prices of properties whose attributes are x1, x2, .., xn. Our training dataset contains information on unique properties that are on the market for rent in a given month. Each time we create a new Rent Zestimate, we train a new set of models with properties that are on the market during the month, and we use them to estimate rental price for every property on Zillow.
For every geographical region that has sufficient training data, we split the training data into two subsets. We use the first subset to train the models. We then use these models to estimate the rental prices for all properties in the second dataset. We call the second subset the hold-out dataset. It provides a sample of properties with known rental listing prices for measuring the model’s accuracy. The model’s accuracy is based on comparing the model’s estimates to the actual listing prices among properties in this hold-out dataset. Note that for the final Rent Zestimates shown on the live site, we utilize a model trained on the full dataset versus just a subset. Only in computing our accuracy do we utilize the subsets and only do so here in order to compute a valid accuracy from an out-of-sample set of data.
If z is the Rent Zestimate for a home in the hold-out dataset described above, then the percent estimated error is e =100*(z – y)/y, where y is the actual rental listing price. Thus, we can extract and combine the entire hold-out dataset into a table
B=[e, PropertyID, County, MSA, State]
Where e is the estimated errors for the unique set of properties identified in the PropertyID column; County, MSA and State columns identify the county, metro and state of the properties.
Then, we can calculate accuracy metrics across the counties, metros or states. Two key metrics we track are median error [median (abs (e))] and the percent of estimates within x% of rent price [100*count (abs (e) < x)/count (e)]. For the national level, the entire column of e is used for median error and with x% of Rent Price.
The table below shows the accuracy metrics for the nation and the top 30 metros as of February 2011. Also included in this table are Homes with Rent Zestimates which show total unique homes having Rent Zestimates. Nationally, we have 98.7 million Rent Zestimates for the United States with a 10% of median absolute error, 49.7% of Zestimates are within 10% of rental listing prices and 74.6% of rental listing prices are within Zestimate ranges.
|Metro||Homes with Rent Zestimates||Within 5% of Rent Price||Within 10% of Rent Price||Within 20% of Rent Price||Within Zestimate Range||Median Error|
|Dallas-Fort Worth, TX||2,023,276||33.9%||56.4%||81.5%||78.3%||8.3%|
|Kansas City, MO||735,794||29.0%||47.3%||74.0%||72.6%||10.7%|
|Las Vegas, NV||716,817||36.6%||59.8%||82.4%||73.7%||7.7%|
|Los Angeles, CA||3,022,385||31.1%||53.5%||77.0%||79.4%||9.0%|
|Miami-Fort Lauderdale, FL||2,579,957||27.4%||47.1%||70.5%||75.9%||10.7%|
|Minneapolis-St Paul, MN||1,146,620||35.6%||58.6%||83.7%||78.0%||7.9%|
|New York, NY||4,744,071||22.9%||41.4%||66.7%||72.5%||12.7%|
|San Antonio, TX||732,393||29.6%||53.4%||81.3%||75.0%||9.2%|
|San Diego, CA||880,616||30.0%||53.7%||77.8%||81.6%||9.0%|
|San Francisco, CA||1,254,830||30.4%||53.1%||78.4%||74.5%||8.9%|
|St. Louis, MO||1,027,585||30.3%||51.0%||75.8%||77.8%||9.5%|
We implemented the Rent Zestimation process as a software application taking input from Zillow databases and producing an output table with about 100 million rows:
D=[Property ID, Rent Zestimate, Lower Confident Interval, Upper Confident Interval]
where the Property ID column holds the Zillow internal reference ID of properties, Rent Zestimate holds the point estimate of the rent listing price, Lower Confident Interval and Upper Confident Interval hold the lower and upper range of the estimate of rent listing prices respectively.
We deploy this software application into a production environment using Amazon Web Services (AWS) cloud. The total time to complete a run is four hours using four of Amazon EC2 instances of Extra-Large-High-CPU type. This type of machine costs $1.16/hr. Thus, it costs us about $19 to produce 100 million Rent Zestimates which is the same as a 3-D movie ticket or about 5 gallons of gasoline in New York City today.