<< Chapter < Page Chapter >> Page >
  • American Community Survey
  • Case-Shiller House Price Index (HPI)
  • Census 2007
  • Construction of Housing Units
  • Market Value of 1 month rent in a Room
  • Vacancies
  • Mortgage Rates
  • Federal Housing Finance Agency HPI

Cleaning and Analysis

To facilitate sharing data, we have conducted both data cleaning and analysis with the open source statistical software R,which is available free of charge at http://www.r-project.org . We use the program R to clean our data sets. R is considered a statistical standardamong statisticians. There are several advantages to using R. We are able to manipulate extremely large data sets (>2GB) on a normal desktop. It also allows us to produce impressive graphics with minimal coding.

Clean Data is...

  • Consistent : In a few data sets county names change over the course of a few years. This affects how we compareyearly data.
  • Concise : Some data sets contained only parts of information we needed. For example, the American Community Surveycontains over 200 questions. We were only interested in the answer to one of those questions.
  • Complete : One of the data sets that was collected was missing around 80\% of the data.
  • Correct : We must assume that the data we collect is not corrupt and was recorded properly. Some smaller data setscontained unusual observations. We used our own discretion when deciding what data sets were correct.

Cleaning Process

1. First we start with ``dirty'' data. (Fig.1)

2. Next we must download the data. A section of download code is shown below. (Fig. 2)

3. Once we have the data, we clean the data as best we can according to the rules describing clean data above. A section ofcleaning code is shown below. (Fig. 3)

4. Now that the data has been cleaned, it may look like the top part of the data below. (Fig. 4)

5. With clean data, we are able to explore it. The code below (Fig. 5) is the command used to produce the plot in figure Fig.6.

6. With R code we are able to produce complex plots with minimal amount of code. (Fig. 6)

Interesting Findings

Location, Location, Location...

The data graphed (Fig. 7&Fig. 8) is from the Federal Housing Finance Agency (FHFA) house price index (HPI). Both of thesegraphs analyze what time the HPI peaked for each metropolitan statistical area (MSA).

Looking at both graphs we believe that timing seems to be very significant. If a state peaked earlier than 2006 or later than2007, their HPI was not as greatly affected. This also supports the claim that California and Florida were impacted the greatest.

In Figure 7, you can see that both California and Florida peaked around the same time. The graph shows in what year each MSA areareached its maximum housing price.

In Figure 8, every point is a MSA and labeled by state. It graphs the peak HPI time versus the percent change in HPI between thenmaximum HPI to 2009, quarter 1 HPI. This graph shows that if HPI peaked between 2006 and 2007, then that state typically experienced a much larger percentchange in HPI.

Merced, CA

The city with the greatest percent change in the FHFA HPI was Merced, CA. This observation is very unusual of small cities.Further research into Merced showed that University California of Merced has finished construction in late 2005. Using both Figures 9 and 10, we hypothesizethat the construction increased due to the necessity of housing for UC Merced students and employees.

Myth Busters

After discovering Merced, CA we decided to look more closely at college towns. Contrary to belief, college towns were not greatlyimpacted by the housing crisis. They were affected more by the location that they were in, rather than being a ``college town''. (Fig. 11)

Other Explorations

  • Vacation Spots : Are areas where people own a second home more affected?
  • Renting vs. Owning : Is is better to rent or own a house?
  • Migration :Are cities that experienced massive population change affected?
  • Gross Domestic Product : Can we categorize a certain city by industry? Is there a relationship between citiesthat were hit by the housing crisis?

Communication and Future Work

It is extremely important that all of our data cleaning and findings are reproducible. We've made both the data and programmingcode available to the public through our PFUG's website on http://github.com/hadley/data-housing-crisis . Github is a very advance website that is able to track changes made to data and code from multiple individuals.

Github is advantageous to both our research group and to the general public. Firstly, we are able to freely store large amounts ofdata. Also it allows us to work on the same data without having to e-mail changes back and forth. In addition, others can view and download our data forfree. We hope that by keeping the code transparent and self-replicating, others are able to easily build off our work.

We would like to develop a website that will allow users to easily access the data they are interested in, which would otherwise be a daunting task for those who wish to use a data set of this size.Because our analysis and findings also involve large amounts of information, (such as construction price time series for each US metropolitan area) we areexploring interactive graphical methods for displaying this information. Our future research will involve using the internet application Many Eyes, http://manyeyes.alphaworks.ibm.com , and then eventually the program Protovis, http://vis.stanford.edu/protovis , to create this website.

Acknowledgements

This Connexions module describes work conducted as part of Rice University's VIGRE program, supported by National ScienceFoundation grant DMS--0739420.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, The art of the pfug. OpenStax CNX. Jun 05, 2013 Download for free at http://cnx.org/content/col10523/1.34
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'The art of the pfug' conversation and receive update notifications?

Ask