Forecasting Home Value with AutoARIMA

11 minute read

Top 10 Most Profitable ZIP Codes for Real Estates Investment

Introduction

“Of all the ways the ultra-rich made their fortunes, real estate outpaced every other method 3 to 1”, wrote Liz Brumer-Smith of Millionarces. Like any other investments, market knowledge is important for investing in real estates. Fortunately, with how widely available data is in today’s society, there are plenty of data that can be used to help make real estate investment decisions. In this exercise, future pricing of single family homes were prediced using data from Zillow.

This exercise was originally done as part of a course assignment for Big Data Analytics (IST 718) at Syracuse University.

Set Up

This notebook uses geopandas and pmdarima. It’s designed to run on Google Colab.

This notebook also uses two custom helpers; one to handle the data pull, and one to handle the times series.

Both of these custom helpers use multiprocessing to accelerate computation.

Install Libraries to Google Colab

geopandas is a geographical DataFrame manipulator and analytical library. contextily allows basemaps to be added to geospatial visuals. pmdarima is a time series analysis library.

The Runtime may have to be restarted after installing these libraries.

%pip install --upgrade geopandas
%pip install --upgrade pyshp
%pip install --upgrade shapely
%pip install --upgrade rtree
%pip install --upgrade matplotlib
%pip install --upgrade contextily

%pip install --upgrade pmdarima
%pip install --upgrade numpy

Cloning from Github

This clones the help functions from Github. This also clone the data needed to expedite the script.

!git clone https://github.com/lokdoesdata/zillow-forecast.git
import sys
sys.path.append(r'/content/zillow-forecast')

Import Libraries

%matplotlib inline
from helper import geom_data
from helper.time_series import TimeSeries
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import contextily as ctx

Data

The main dataset used in this exercise is from Zillow’s Housing Price Index by ZIP Code. This data set contains monthly housing price by ZIP Code. At the time of the analysis, there are data from January 1996 to March 2020.

An initial data inspection does not shows that the Zillow data set has a lot of incorrect state assigned to the ZIP codes. For example, ZIP code 00601 is assigned to Mississippi when it belongs to Puerto Rico. This mainly affect ZIP codes outside of the 50 states. However, external dataset can be used to correct them.

Two geodatabases from Esri were used:

USA ZIP Code Areas which contains ZIP Code boundaries from TomTom (December 2019) and 2018 total population estimates from Esri demographics team.
USA ZIP Code Points which contains ZIP Code points from TomTom December 2019 and 2018 total population estimates from Esri demographics team. This file is used for single site ZIP Codes.

One shapefile from the United States’ Census:

Metropolitan and Micropolitan Statistical Area (MSA) from the 2010 Census (latest version at the time of analysis).

Zillow Data

The Zillow dataset was downloaded directly from Zillow.

df_zillow = pd.read_csv(r'https://files.zillowstatic.com/research/public/Zip/Zip_Zhvi_SingleFamilyResidence.csv')

df_zillow.drop([
    'RegionID', 
    'SizeRank', 
    'RegionType', 
    'StateName', 
    'State', 
    'City', 
    'Metro', 
    'CountyName'
], axis=1, inplace=True)

df_zillow.rename({'RegionName': 'ZIP Code'}, axis=1, inplace=True)

df_zillow['ZIP Code'] = [str(z).zfill(5) for z in df_zillow['ZIP Code']]

Geographical Data

The geographical information for each ZIP code is processed using a helper module.

This could take a while due to the volume of data. This is also quite computational heavy.

gdf_zip_code = geom_data.get_zip_code_gdf()

gdf_zip_code.rename({
    'ZIP_CODE': 'ZIP Code',
    'PO_NAME': 'PO Name',
    'STATE': 'State',
    'POPULATION': 'Pop',
    'SQMI': 'Sq Mi',
    'NAME': 'MSA'
}, axis=1, inplace=True)

gdf_zip_code = gdf_zip_code[['ZIP Code', 'PO Name', 'State', 'Pop', 'Sq Mi', 'MSA', 'x', 'y', 'geometry']]

gdf_zip_code = gdf_zip_code.merge(df_zillow, on='ZIP Code')

del df_zillow

Converting the coordinate reference system to web mercator projection

gdf_zip_code = gdf_zip_code.to_crs(epsg=3857)

Exploratory Analysis

Choropleth map of home values by ZIP code

Arkansas was used as an illustration of a choropleth map of home values by ZIP code. The bright yellow spot in the center of Arkansas is ZIP code 72223. It has the highest median home value in Arkansas as of the end of 2018. This is a ZIP code in Little Rock, AR.

ax = gdf_zip_code[gdf_zip_code['State']=='AR'].plot(column='2018-12-31', figsize=(7, 7), legend=True, edgecolor='k', alpha=0.5)
ax.axis('off')
ax.text(x=0.5, y=1.1, s="Arkansas' Housing Value at YE 2018", fontsize=16, weight='bold', ha='center', va='bottom', transform=ax.transAxes)
ctx.add_basemap(ax)

Arkansas's Map — Home values of Arkansas by ZIP code

Average home value by MSA

The average home value over time for Arkansas’ metropolitan and micropolitan areas with Fayetteville-Springdale-Rogers, Hot Springs, Little Rock, Mountain Home, and Searcy are highlighted. The non-highlighted lines are the average home values by MSA for the rest of Arkansas. The housing prices in Fayetteville-Springdale-Rogers, Hot Springs, Little Rock, and Mountain Home are a step above the rest of the state of Arkansas.

df_AR_MSA = gdf_zip_code[(gdf_zip_code['State']=='AR') & (gdf_zip_code['MSA']!='N/A')][['MSA'] + pd.date_range(start='1/1/1996', end='12/31/2018', freq='M').astype(str).tolist()].groupby('MSA').mean().T
df_AR_MSA.index = pd.to_datetime(df_AR_MSA.index)

color_dict = {
    'Little Rock-North Little Rock-Conway': 'red', 
    'Hot Springs': 'orange',
    'Fayetteville-Springdale-Rogers': 'green', 
    'Searcy': 'blue', 
    'Mountain Home': 'violet'
}

f2, ax2 = plt.subplots(figsize=(7, 5))

for col in df_AR_MSA:
    ax2.plot(
        df_AR_MSA.index, 
        df_AR_MSA[col], 
        color=color_dict.get(col, 'k'), 
        alpha=1 if col in color_dict.keys() else 0.1,
        label=col if col in color_dict.keys() else ''
    )

ax2.spines.right.set_visible(False)
ax2.spines.top.set_visible(False)
ax2.yaxis.set_major_formatter('${x:,.0f}')
ax2.yaxis.set_tick_params(which='major')
ax2.legend(loc='lower right', ncol=2)

ax2.text(x=0.5, y=1.1, s="Arkansas' Average House Value by MSA", fontsize=16, weight='bold', ha='center', va='bottom', transform=ax2.transAxes)

Arkansas's MSA Time Series — Changes in Home values of Arkansas MSA

Time Series Forecasting

Model

The pmdarima package, and its stepwise approach, was used to train a SARIMA time series model for each ZIP Code. The general steps used to train each model is highlighted below:

Determine the order of differencing (d) using a KPSS test.
Determine the optimal auto regression (p) and moving average (d) with a maximum of five periods.
With seasonal period (m) sets at 12 for monthly, determine the order of seasonal differencing (D) using a Canova-Hansen test.
Determine the optimal seasonal auto regression (P) and seasonal moving average (Q) with a maximum of two periods.
Select the optimal model based on Akaike information criterion (AIC), while ensuring that the model is numerically stable.

This was done with a train/test split with 2016 to 2018 data as training data and 2019 data as testing data. As such, only ZIP code with data between that time period were used in the analysis.

ts = TimeSeries(gdf_zip_code, '1/31/2019')
df_forecast = ts.forecast_all_states()

Results

The top 10 optimal seasonal ARIMA model is shown below:

df_forecast.model_order.value_counts().head(10)

ARIMA Order	Number of ZIP Codes
ARIMA(0,1,0)(0,0,0)[12] intercept	12,185
ARIMA(0,2,0)(0,0,0)[12]	2,087
ARIMA(1,1,1)(0,0,0)[12] intercept	1,282
ARIMA(1,1,0)(0,0,0)[12] intercept	1,029
ARIMA(0,1,0)(0,0,1)[12] intercept	932
ARIMA(0,1,1)(0,0,0)[12] intercept	652
ARIMA(0,1,0)(1,0,0)[12] intercept	644
ARIMA(0,1,0)(0,0,0)[12]	619
ARIMA(0,1,0)(2,0,0)[12] intercept	529
ARIMA(2,1,2)(1,0,1)[12] intercept	523

In summary, majority of the models are non-stationary, requiring differencing transformation. About half of the ZIP codes’ housing price is affected by seasonality.

ARIMA Example

Using ZIP Code 19124 as an example. The ACF plot shows a slow and steady decay, which suggests that the time series is non-stationary. A KPSSS test was also conducted and confirmed that two order of differencing is appropriate for this time series.

AutoARIMA determined that SARIMA(0,2,0)(0,0,0[12]) was the best model for this ZIP Code based on AIC. Please note that AIC cannot be compared directly for models with different orders.

Result

The profit and the ROI were calculated based on December 2018 actual housing value and December 2019 forecasted housing value.

df_investment = gdf_zip_code[['ZIP Code', 'x', 'y', 'PO Name', 'State', 'Pop', '2018-12-31']].merge(
    df_forecast[['zip_code', 'test_lci_2019-12-31', 'test_pred_2019-12-31', 'test_uci_2019-12-31']].rename({'zip_code':'ZIP Code'}, axis=1),
    on='ZIP Code'
)

df_investment.rename(
    {
        'Pop': 'Population', 
        '2018-12-31': 'Before',
        'test_pred_2019-12-31': 'After',
        'test_lci_2019-12-31': 'After (Lower Bound)',
        'test_uci_2019-12-31': 'After (Upper Bound)',
    }, axis=1, inplace=True
)

df_investment['Profit'] = df_investment['After'] - df_investment['Before']
df_investment['ROI'] = df_investment['Profit']/df_investment['Before']
df_investment['Population'].fillna(0, inplace=True)
df_investment['Population'] = df_investment['Population'].astype(int)
df_investment['Name'] = [f'{p}, {s} {z}' for (p, s, z) in zip(df_investment['PO Name'], df_investment['State'], df_investment['ZIP Code'])]

df_investment = df_investment[['Name', 'x', 'y', 'Population', 'Profit', 'ROI', 'Before', 'After', 'After (Lower Bound)', 'After (Upper Bound)']]

Investment Opportunity

ZIP Code that provides the best real estate investment opportunity can be a challenging question. There are different ways to evaluate this and they come with different considerations.

Simple Helper Function

A simple helper functionw as created to help visualize the investment opportunity

# Simple helper function for plotting

def investment_plot(df, title):
    import seaborn as sns
    import matplotlib.pyplot as plt
    from matplotlib.lines import Line2D

    df = df.copy()
    df.reset_index(inplace=True, drop=True)

    f, ax = plt.subplots(figsize=(7, 7))
    ax.axis('off')

    with sns.axes_style('whitegrid'):
        sns.despine(left=True, bottom=True)
        for idx, row in df.iterrows():
            idx = idx*2
            ax.text(s='${:,.0f}k'.format(round(row['Before']/1000,1)), x=row['Before'], y=(-idx-0.1), horizontalalignment='right', verticalalignment='top')
            ax.text(s='${:,.0f}k'.format(round(row['After']/1000,1)), x=row['After'], y=(-idx-0.1), horizontalalignment='left', verticalalignment='top')
            ax.text(s='{} ({}% ROI)'.format(row['Name'], round(row['ROI']*100,1)), x=(row['Before']+row['After'])/2, y=-idx+0.15, horizontalalignment='center', verticalalignment='bottom')

            sns.scatterplot(x = [row['Before']], y = [-idx], color='red', ax=ax)
            sns.scatterplot(x = [row['After']], y = [-idx], color='green', ax=ax)
            sns.lineplot(x = [row['Before'], row['After']], y = [-idx, -idx], ax=ax, lw=2, color='black')
            
            ax.set_title(title, fontdict=dict(fontsize=16))

        legend_elements = [
            Line2D([0], [0], marker='o', color='w', label='Before', markerfacecolor='red'), 
            Line2D([0], [0], marker='o', color='w', label='After', markerfacecolor='green'),
        ]
        ax.legend(handles=legend_elements, loc='lower right')


    return(df)

Return on Investment (ROI)

Based on return on investment, the three zip codes that provided the best real estate investment opportunity are:

Bend, TX 76824
Ardenvoir, WA 98811
Philadelphia, PA 19132

All three ZIP codes ziped over 40% ROI. However, looking at the population of Bend, and Ardenvoir with unrecorded population, suggested that there are not enough people in those ZIP codes for them to be worthwhile investment. Philadelphia, while having significant population, have limited profitability with only $16k per house.

investment_plot(df_investment.sort_values('ROI', ascending=False).head(10), 'Top 10 Zip Code by ROI')

Name	x	y	Population	Profit	ROI	Before	After	After (Lower Bound)	After (Upper Bound)
Bend, TX 76824	-98.523425	31.108534	0	142,730.32	0.602316	236,969.00	379699.32	366,535.97	392,862.68
Ardenvoir, WA 98811	-120.358272	47.731422	0	90,864.00	0.447093	203,233.00	294097.00	231,446.67	356,747.33
Philadelphia, PA 19132	-75.167859	39.996737	37,140	16,768.11	0.438577	38,233.00	55001.11	42,324.08	67,678.13
Knoxville, TN 37915	-83.901482	35.971976	6,421	31,392.00	0.380246	82,557.00	113949.00	87,467.16	140,430.84
Mesilla Park, NM 88047	-106.726674	32.223562	2,211	98,176.63	0.374985	261,815.00	359991.63	314,907.79	405,075.47
Hines, IL 60141	-87.839290	41.862194	278	19,378.00	0.370686	52,276.00	71654.00	57,427.14	85,880.87
Indianapolis, IN 46201	-86.106321	39.772938	31,897	25,879.17	0.369407	70,056.00	95935.17	82,527.93	109,342.40
Bond, CO 80423	-106.694352	39.905940	179	179,999.12	0.364169	494,273.00	674272.12	646,032.16	702,512.07
Boelus, NE 68820	-98.710625	41.102433	450	58,812.00	0.361608	162,640.00	221452.00	165,519.05	277,384.95
Columbus, OH 43205	-82.968058	39.957867	12,999	60,365.22	0.360963	167,234.00	227599.22	210,427.75	244,770.69

Top 10 ZIP by ROI — Top 10 ZIP Codes for Real Estates Investment based on ROI

Profit

From a purely profit standpoint, the three most profitable ZIP Codes are:

Beverly Hills, CA 90210
Atherton, CA 94027
Santa Monica, CA 90402

About half a million of profit is made per house in those three ZIP Code. However, those houses required signficiant upfront captial to purchase with close to $4M per house for the cheapest of the three ZIP Codes. This could be proven to be a real challenge for some investers without the necessary captials.

One additional note is that California made up of all the top 10 ZIP Codes in terms of profit.

investment_plot(df_investment.sort_values('Profit', ascending=False).head(10), 'Top 10 Zip Code by Profit')

Name	x	y	Population	Profit	ROI	Before	After	After (Lower Bound)	After (Upper Bound)
Beverly Hills, CA 90210	-118.399289	34.081401	24,230	543,504	0.103698	5,241,242	5,784,746	5,685,539	5,883,954
Atherton, CA 94027	-122.192663	37.458278	7,027	509,630	0.082372	6,186,961	6,696,591	6,481,279	6,911,902
Santa Monica, CA 90402	-118.507079	34.031337	12,447	409,184	0.103770	3,943,195	4,352,379	4,261,438	4,443,320
Stinson Beach, CA 94970	-122.649718	37.902736	721	408,392	0.132606	3,079,741	3,488,133	3,384,142	3,592,124
Beverly Hills, CA 90212	-118.399073	34.063506	12,610	356,692	0.123623	2,885,330	3,242,022	3,177,961	3,306,082
Palo Alto, CA 94301	-122.153657	37.443679	18,461	342,138	0.076721	4,459,512	4,801,650	4,512,463	5,090,837
Los Angeles, CA 90020	-118.301614	34.066172	40,967	332,700	0.094559	3,518,436	3,851,136	3,251,457	4,450,815
West Hollywood, CA 90069	-118.382874	34.089240	21,120	319,686	0.120307	2,657,244	2,976,930	2,918,801	3,035,058
Beverly Hills, CA 90211	-118.382430	34.066743	9,219	309,528	0.144513	2,141,866	2,451,394	2,159,287	2,743,501
Santa Monica, CA 90403	-118.492328	34.028155	25,766	304,666	0.130743	2,330,269	2,634,935	2,592,932	2,676,939

Top 10 ZIP by Profit — Top 10 ZIP Codes for Real Estates Investment based on Profit

When the ZIP codes with most raw profit are ZIP Codes with the most ROI are stacked on top of one another, it clearly shows that getting lesser ROI in a more expensive area is a better investment than higher ROI in a less expensive area.

Name	x	y	Population	Profit	ROI	Before	After	After (Lower Bound)	After (Upper Bound)
Beverly Hills, CA 90210	-118.399289	34.081401	24,230	543,504	0.103698	5,241,242	5,784,746	5,685,539	5,883,954
Atherton, CA 94027	-122.192663	37.458278	7,027	509,630	0.082372	6,186,961	6,696,591	6,481,279	6,911,902
Santa Monica, CA 90402	-118.507079	34.031337	12,447	409,184	0.103770	3,943,195	4,352,379	4,261,438	4,443,320
Stinson Beach, CA 94970	-122.649718	37.902736	721	408,392	0.132606	3,079,741	3,488,133	3,384,142	3,592,124
Beverly Hills, CA 90212	-118.399073	34.063506	12,610	356,692	0.123623	2,885,330	3,242,022	3,177,961	3,306,082
Bend, TX 76824	-98.523425	31.108534	0	142,730.32	0.602316	236,969.00	379699.32	366,535.97	392,862.68
Ardenvoir, WA 98811	-120.358272	47.731422	0	90,864.00	0.447093	203,233.00	294097.00	231,446.67	356,747.33
Philadelphia, PA 19132	-75.167859	39.996737	37,140	16,768.11	0.438577	38,233.00	55001.11	42,324.08	67,678.13
Knoxville, TN 37915	-83.901482	35.971976	6,421	31,392.00	0.380246	82,557.00	113949.00	87,467.16	140,430.84
Mesilla Park, NM 88047	-106.726674	32.223562	2,211	98,176.63	0.374985	261,815.00	359991.63	314,907.79	405,075.47

Combined top 10 — Side-by-side of the Top 10 ZIP codes by ROI And Profit

Population as a factor

Population can be a sign of potential home buyers in a ZIP code. While profitability is important, they need to be in area with potential home buyers.

gdf_investment_top1k = df_investment.sort_values('Profit', ascending=False).head(1000)
gdf_investment_top1k = gpd.GeoDataFrame(
    gdf_investment_top1k, 
    geometry=gpd.points_from_xy(gdf_investment_top1k.x, gdf_investment_top1k.y)
)
gdf_investment_top1k = gdf_investment_top1k.set_crs(4326).to_crs(3857)

Most profitable ZIP codes

The most profitable ZIP codes are shown in the figure below. California looks to be filled with profitable ZIP codes, followed by New York and its surrounding areas.

ax3 = gdf_investment_top1k.sort_values('Profit').plot(
    column='Profit', figsize=(8, 3), legend=True, edgecolor='k', alpha=0.5)

ax3.axis('off')
ax3.text(x=0.5, y=1.1, s='Most profitable ZIP codes', fontsize=16, weight='bold', ha='center', va='bottom', transform=ax3.transAxes)
ctx.add_basemap(ax3)

Top 1,000 most profitable zip codes — Top 1,000 most profitable ZIP codes for Real Estates investment

Population of the most profitable ZIP codes

The population of the most profitable ZIP codes in the above figure is shown below. Both California and New York have sizeable population.

ax4 = gdf_investment_top1k.sort_values('Population').plot(
    column='Population', figsize=(8, 3), legend=True, edgecolor='k', alpha=0.5)

ax4.axis('off')
ax4.text(x=0.5, y=1.1, s='Population of most profitable ZIP codes', fontsize=16, weight='bold', ha='center', va='bottom', transform=ax4.transAxes)
ctx.add_basemap(ax4)

Conclusion

While return on investment is an important metric for any type of investment, it is important to not overlook at the potential profit that can be made in a market. In a simplified way, there are two main factors to determine the size of the market; how much does a home cost, and how many homes can be sold? Using the median housing price and the population data, it was determined that if there is enough initial capital, California is the best place to invest for real estates.

Share on

Twitter Facebook LinkedIn

Lok Ngan