In setting out to create a new home price index, a major problem Zillow sought to overcome in existing indices was their inability to deal with the changing composition of properties sold in one time period versus another time period. Both a median sale price index and a repeat sales index are vulnerable to such biases (see the analysis here for an example of how influential the bias can be). For example, if expensive homes sell at a disproportionately higher rate than less expensive homes in one time period, a median sale price index will characterize this market as experiencing price appreciation relative to the prior period of time even if the true value of homes is unchanged between the two periods.
The ideal home price index would be based off of sale prices for the same set of homes in each time period so there was never an issue of the sales mix being different across periods. This approach of using a constant basket of goods is widely used, common examples being a commodity price index and a consumer price index. Unfortunately, unlike commodities and consumer goods, for which we can observe prices in all time periods, we can’t observe prices on the same set of homes in all time periods because not all homes are sold in every time period.
The innovation that Zillow developed in 2005 was a way of approximating this ideal home price index by leveraging the valuations Zillow creates on all homes (called Zestimates). Instead of actual sale prices on every home, the index is created from estimated sale prices on every home. While there is some estimation error associated with each estimated sale price (which we report here), this error is just as likely to be above the actual sale price of a home as below (in statistical terms, this is referred to as minimal systematic error). Because of this fact, the distribution of actual sale prices for homes sold in a given time period looks very similar to the distribution of estimated sale prices for this same set of homes. But, importantly, Zillow has estimated sale prices not just for the homes that sold, but for all homes even if they didn’t sell in that time period. From this data, a comprehensive and robust benchmark of home value trends can be computed which is immune to the changing mix of properties that sell in different periods of time (see Dorsey et al. (2010) for another recent discussion of this approach).
Each Zillow Home Value Index (ZHVI) is a time series tracking the monthly median home value in a particular geographical region. In general, each ZHVI time series begins in April 1996. We generate the ZHVI at seven geographic levels: neighborhood, ZIP code, city, congressional district, county, metropolitan area, state and the nation.
Estimated sale prices (Zestimates) are computed based on proprietary statistical and machine learning models. These models begin the estimation process by subdividing all of the homes in United States into micro-regions, or subsets of homes either near one another or similar in physical attributes to one another. Within each micro-region, the models observe recent sale transactions and learn the relative contribution of various home attributes in predicting the sale price. These home attributes include physical facts about the home and land, prior sale transactions, tax assessment information and geographic location. Based on the patterns learned, these models can then estimate sale prices on homes that have not yet sold.
The sale transactions from which the models learn patterns include all full-value, arms-length sales that are not foreclosure re-sales. The purpose of the Zestimate is to give consumers an indication of the fair value of a home under the assumption that it is sold as a conventional, non-foreclosure sale. Similarly, the purpose of the Zillow Home Value Index is to give consumers insight into the home value trends for homes that are not being sold out of foreclosure status. Zillow research indicates that homes sold as foreclosures have typical discounts relative to non-foreclosure sales of between twenty and forty percent, depending on the foreclosure saturation of the market. This is not to say that the Zestimate is not influenced by foreclosure re-sales. Zestimates are, in fact, influenced by foreclosure sales but the pathway of this influence is through the downward pressure foreclosure sales put on non-foreclosure sale prices. It is the price signal observed in the latter that we are attempting to measure and, in turn, predict with the Zestimate.
Within each region, we calculate the ZHVI for various subsets of homes (or market segments) so as to afford greater insight into what is happening in a particular market. All market segments are shown in the table below. Only residential properties are included in the ZHVI calculation. Non-residential properties, such as office buildings, shopping centers, and farms are not included.
One very useful form of market segmentation that we produce is based on the distribution of home values within the metropolitan area. Here we assign properties into one of three tiers based on their Zestimates on a particular date: top, middle or bottom tier. The thresholds for the price tiers vary from metro to metro and are determined by the distribution of home values in each metro. Since Zestimates are time dependent, a property may belong to different price tiers at different dates. To reduce tier switching, we exclude properties near the boundaries of price tiers when assigning tiers. Thus, the sum of Zestimates in all three tiers does not equal the number of Zestimates for the “All Homes” market segment.
| i | Market Segment | Number of Zestimates | Description |
| 1 | All Homes | 83.0 M | Single family + condominium + cooperative |
| 2 | Single Family | 73.8 M | Single family only |
| 3 | Condo | 9.2 M | Condominium + cooperative only |
| 4 | Studio | 34.5 M | 0 Bedroom |
| 5 | 1 Bedroom | 1.7 M | 1 Bedroom |
| 6 | 2 Bedroom | 11.4 M | 2 Bedroom |
| 7 | 3 Bedroom | 28.9 M | 3 Bedroom |
| 8 | 4 Bedroom | 12.1 M | 4 Bedroom |
| 9 | 5+Bedroom | 3.3 M | 5 Bedroom or more |
| 10 | Top Tier | 26.5 M | Top price tier among homes within the same metropolitan |
| 11 | Middle Tier | 26.5 M | Middle price tier among homes within the same metropolitan |
| 12 | Bottom Tier | 26.5 M | Bottom price tier among homes within the same metropolitan |
Using the estimated market value of every home as represented in the Zestimate, the main steps in the construction of the ZHVI are as follows:
Let t be a discrete independent time variable with a value at the end of each month. Let H(t) be an M by N matrix with each element hij(t) representing the number of homes at time t for the i-th market segment in the j-th geographical region, where M is the total number of market segments and N is the total number of unique regions having a minimum required number of Zestimates. Currently, we have M=12 and N=57,022. Geographical regions include national, state, metro, county, city, ZIP code, neighborhood and congressional district. The Number of Zestimates column in Table 1 above represents the number of homes in the i-th element of hij when j=’National’ and t=’Oct-2011’.
Let zij(t) be the vector of Zestimates of all homes at time t having length hij(t) for i-th market segment and j-th region. The raw median Zestimate, rij(t), for i-th market segment and j-th region is defined as:
rij(t)=Median(zij(t))
rij(t) is the median Zestimate and is an element of the M by N matrix R(t). In order to ensure reliability and stability, we only compute rij when hij(t) is above some minimum threshold. For October 2011, there are a total of 384,485 regions for which the median could be computed:
Count{rij(t) ≠NA, for i=1,..M and j=1,..N} is 384,485.
Table 2 shows the counts of Zestimates by region level and market segment. For example, we have usable data to calculate raw medians in up to 2,451 counties for the single-family home market segment.
| Market Segment | National | State | MSA | County | Congressional District | City | Neighborhood | Zip |
| All Homes | 1 | 49 | 844 | 2,451 | 430 | 21,009 | 8,609 | 22,454 |
| Single Family | 1 | 49 | 844 | 2,451 | 430 | 20,937 | 7,947 | 22,266 |
| Condo | 1 | 49 | 455 | 795 | 408 | 3,996 | 3,029 | 6,431 |
| Studio | 1 | 49 | 819 | 2,256 | 430 | 15,102 | 3,780 | 16,705 |
| 1 Bedroom | 1 | 49 | 480 | 937 | 411 | 2,218 | 1,093 | 3,610 |
| 2 Bedroom | 1 | 49 | 779 | 1,949 | 430 | 13,021 | 5,593 | 15,488 |
| 3 Bedroom | 1 | 49 | 708 | 1,663 | 429 | 9,221 | 3,870 | 11,908 |
| 4 Bedroom | 1 | 49 | 734 | 1,676 | 429 | 8,722 | 3,451 | 11,573 |
| 5+Bedroom | 1 | 49 | 581 | 1,143 | 427 | 4,059 | 1,614 | 6,574 |
| Top Tier | 1 | 49 | 842 | 1,527 | 429 | 11,911 | 4,030 | 14,090 |
| Middle Tier | 1 | 49 | 842 | 1,553 | 429 | 13,482 | 4,849 | 15,728 |
| Bottom Tier | 1 | 49 | 842 | 1,515 | 428 | 12,078 | 5,255 | 14,386 |
| Total | 12 | 588 | 8,770 | 19,916 | 5,110 | 135,756 | 53,120 | 161,213 |
Zestimate errors are both time and region dependent. While the errors produced by the Zestimate algorithm are generally equally distributed above and below the actual sale price, there can be some residual systematic error detected once more historical sales are known (systematic error here is defined as the median raw error being slightly greater or less than zero). In this event, raw median Zestimates are adjusted through the use of a correction factor in the manner described below.
Let uij(t) be the median home value free of systematic error. Then, the raw median Zestimate can be expressed in terms of uij(t) as:
rij(t)= {1+ bj(t)} * uij(t)
where bj(t) is the systematic error in Zestimates representing the median fluctuation of Zestimates above or below the actual sold prices within the time window centered around t for the j-th region. We calculate the Zestimate systematic error as:
bj(t)= Median({zj(t-1)- sj(t)}/sj(t))
where sj(t) is a vector of sale prices and zj(t-1) are Zestimates corresponding to the same properties as sj(t) but with the estimated sale price taken from the period immediately prior to the actual sale (to ensure that the estimate has not been influenced by the sale). The vector of sales, sj(t), is obtained through the following approach:
After computing bj(t), the adjusted median of Zestimates is an M by N matrix U(t) calculated as:
uij(t)= rij(t)/{1+ bj(t)}
We apply a simple three-month moving average to U(t) to filter out noise in the data:
MA(t)={ U(t)+ U(t-1)+ U(t-2)}/3
The resultant M by N matrix MA(t) is a smooth estimate of the median home value free of residual systematic error. This may not be as necessary for large regions such as the nation and states because of the large available data set, but it is applied to all levels for consistency.
Home sales are affected by seasons within the same year. Adjusting for seasonality is desirable so that the trend is more apparent for ease of comparison and forecasting. Since Zestimates and the ZHVI depend on sale prices, the time series MA(t) does contain some seasonality. We remove this seasonality using a seasonal-trend decomposition procedure (STL) based on the Loess method developed by Cleveland et al. (1990). STL is a filtering procedure for decomposing a time series into seasonal, trend, and remainder components:
MA(t)= S(t)+T(t)+ RE(t)
where S(t), T(t) and RE(t) are the seasonal, trend and remainder components respectively. We remove seasonality by adding the trend and remainder components to form the seasonally adjusted ZHVI:
ZHVI(t)= T(t)+ RE(t)
The remainder component, RE(t), represents irregular features in the time series which we preserved.
The time series matrix ZHVI(t) has the same dimension as H(t) which is M by N (as noted, 12 x 57,022). While this theoretically could produce more than 680,000 different time series, in practice many time series are eliminated because of data sparseness or temporal volatility. The general logic determining whether a ZHVI time series for a particular combination of region and market segment will be suppressed from the publicly available data set includes the following elements:
Applying the suppression criteria above, there are 213,641 unique deliverable ZHVI time series for the report period ending October 2011. Table 3 below shows the count of regional time series by region level and market segment. For example, there are 589 time series at the county level for the single-family home variant of the ZHVI.
| Market Segment | National | State | MSA | County | Congressional District | City | Neighborhood | Zip |
| All Homes | 1 | 35 | 156 | 589 | 382 | 8,337 | 4,720 | 10,236 |
| Single Family | 1 | 35 | 156 | 589 | 382 | 8,320 | 4,447 | 10,143 |
| Condo | 1 | 35 | 151 | 472 | 369 | 3,339 | 2,366 | 5,414 |
| Studio | 1 | 35 | 156 | 585 | 382 | 5,931 | 2,245 | 7,589 |
| 1 Bedroom | 1 | 35 | 145 | 448 | 372 | 1,752 | 943 | 2,961 |
| 2 Bedroom | 1 | 35 | 156 | 589 | 382 | 7,068 | 3,839 | 9,251 |
| 3 Bedroom | 1 | 35 | 156 | 578 | 381 | 5,627 | 2,960 | 7,850 |
| 4 Bedroom | 1 | 35 | 156 | 582 | 381 | 5,958 | 2,733 | 8,200 |
| 5+Bedroom | 1 | 35 | 156 | 553 | 381 | 3,196 | 1,362 | 5,281 |
| Top Tier | 1 | 35 | 156 | 573 | 382 | 6,885 | 2,697 | 8,778 |
| Middle Tier | 1 | 35 | 156 | 575 | 382 | 7,675 | 3,421 | 9,647 |
| Bottom Tier | 1 | 35 | 156 | 571 | 381 | 6,779 | 3,336 | 8,806 |
| Total | 12 | 420 | 1,856 | 6,704 | 4,557 | 70,867 | 35,069 | 94,156 |
ZHVIs for all geographic regions and market segments are updated at the end of every month. Since there is variable latency in Zillow’s receipt of transactional data from public records, Zillow’s estimate of residual systematic error can change as new transactions arrive. Historical estimates of systematic error are recalculated monthly and incorporated into revised ZHVI time series. As a result, there can be restatements of the ZHVI for up to three years from initial reporting date.
Prior to the release of the ZHVI data for October 2011, the geographic coverage of the data used as input into the ZHVI was significantly smaller than it is currently and the ZHVI methodology was somewhat different than the approach described herein. With the October 2011 data, the Zestimate footprint used as input to the ZHVI increased from 66 million homes (in 700 counties) to more than 83 million homes (in close to 3,000 counties). This expanded geographic coverage gives Zillow a much more comprehensive view of national and local housing markets. Since areas of the United States not previously covered by the ZHVI are now covered in the index, the ZHVI computed with the new coverage footprint has somewhat different trends at the national level than the previous version. See the interactive map below for differences between the older and current data footprints.
The goal of Zillow’s revision of the ZHVI methodology was to make the index more transparent and more responsive to rapidly changing local market conditions. Table 4 details the differences between the old and new methodology with the core differences being 1) a different method for computing medians for larger geographic areas (we now calculate the national and state levels as the median of Zestimates rather than the weighted mean over a set of counties); and 2) the replacement of an approach that achieved both smoothing and seasonal adjustment with approaches to do both of these tasks separately (we now apply a three-month moving average to the time series and seasonally adjust the resultant time series versus applying a smoothing spline approach which potentially removes too much true variability in the underlying time series).
While the larger data footprint and revised methodology were both introduced in October 2011, Zillow re-estimated all ZHVIs historically with the new footprint and methodology so there is full historical continuity.
| Old methodology | New methodology | |
| National level | Weighted mean from approximately 700 county level ZHVIs +Imputed ZHVIs for the missing counties | Median of Zestimates among all homes within each market segment (e.g. 83 million homes for the “All Homes” market segment |
| State level | Weighted mean from available county level ZHVIs +Imputed ZHVIs for the missing counties within the state. | Median of Zestimates, same as all other regional levels |
| Smoothing | Apply smoothing splines to every time series | Simple 3-month moving average |
| Seasonality adjustment | Implicitly provided by the smoothing splines | STL method |
Figures 1, 2 and 3 below show comparisons of three time series: ZHVI-New, ZHVI-Old, and the Case-Shiller Home Price Index (HPI). These three indices are presented for three regions: the US national market, the 20 metropolitan markets included in the Case-Shiller 20-City Composite HPI and the 10 metropolitan markets included in the Case-Shiller 10-City Composite HPI. Here, the ZHVIs reported for the Composite-10 and Composite-20 markets are based on exactly the same counties used by Fiserv/CSW to compute their composite HPIs. Each ZHVI is scaled to a value of 100 on March 2000 so as to match the base year for the Case-Shiller HPIs.
For the US national market, the ZHVI-Old is in good agreement with the Case-Shiller HPI prior to the market peak in 2006. However, after the peak, the ZHVI-Old diverges from the Case-Shiller HPI due in large part, we believe, to the fact that the Case-Shiller index includes foreclosure re-sales whereas the ZHVI does not. As noted here, we also believe the inclusion of foreclosure re-sales in the Case-Shiller index leads to increased temporal volatility after mid-2009.
The ZHVI-New trend differs substantially from both the older ZHVI and the Case-Shiller HPI at the national level (Figure 1) but tracks both quite well for the Composite-20 and -10 markets (Figures 2 and 3). We know that most of the difference between the ZHVI-New and ZHVI-Old at the national level is attributable to the difference in the size of the housing footprint represented (with the former covering 17 million more homes than the latter) which suggests that the Case-Shiller national footprint might look much closer to the ZHVI-Old footprint versus the larger, more comprehensive ZHVI-New footprint.
This explanation also seems consistent with the different trends shown in the various indexes during the housing boom from 2001 to 2007. The ZHVI-New shows a less dramatic housing boom and bust because it is looking at a broader market which includes many American markets away from the large, coastal metro areas that typified the housing boom, homes that did not experience the unsustainable home value appreciation nor the subsequent strong correction.
The use of a larger footprint of data also results in a difference in the timing of the market peak at the national level with Case-Shiller and ZHVI-Old marking the peak in the second quarter of 2006 whereas ZHVI-New records the market peak a full one year later, in the second quarter of 2007. Again, this seems consistent with the ZHVI-New tracking a larger portion of the housing market, much of that larger market lagging the events unfolding in the epicenters of the housing bust such as California, Florida, Arizona and Nevada.
Again, looking at both the Composite-20 and -10 (and therefore controlling for geography), both the trends and timing of market peaks align quite closely between ZHVI-New, ZHVI-Old and Case-Shiller HPI. The Case-Shiller HPI is still subject to more volatility in the post-2009 period for reasons which we believe, again, are related to its inclusion of foreclosure re-sales.
![]() |
| Figure 1: Comparing ZHVI-Old, ZHVI-New and the Case-Shiller Home Price Index for the United States. |
![]() |
| Figure 2: Comparing ZHVI-Old, ZHVI-New and the Case-Shiller Home Price Index for the 20-City Composite. |
![]() |
| Figure 3: Comparing ZHVI-Old, ZHVI-New and the Case-Shiller Home Price Index for the 10-City Composite. |
New coverage of Zestimates (red) in addition to the old coverage (blue) are shown in the interactive map below:
Cleveland, R.B, Cleveland, W.S., McRae, J.E., and Terpenning, I. (1990). STL: A Seasonal-Trend Decomposition Procedure Based on Loess. Journal of Official Statistics, 6, 3–73.
Dorsey, R.E., Hu, H., Mayer, W.J., & Wang, H. (2010). Hedonic versus repeat-sales housing price indexes for measuring the recent boom-bust cycle. Journal of Housing Economics, 19 (2), 75-93.