[Editor's Note: This post was authored originally in 2006. Edgar will be writing an update shortly, but some things have definitely changed in that time.]
As a web developer, I’m constantly having to deal with zip codes, with almost terrifyingly frequency. These experiences no longer leave me in cold sweat; instead, I’ve warmed to the concept as it’s nice to be able to validate someone’s zip code based on the city and state they entered. (Or to autopopulate their city and state based on a zip code.) This can be a major headache saver if you’re doing bulk or CASS-certified mailings as the cost increases for each unverified address. Ensuring the accuracy of your data at the data entry point is always a good idea and is a relatively simple endeavor. Follow us after the jump for a look into what can be done about zips.

Further, if you’re offering services to people based on location, zip codes provide a nice means by which you can estimate distance—like when you enter your zip code over the phone to find movie theaters near to you.

Now, you can buy commercial products with zip codes and latitude and longitude information for various prices, with high accuracy and more granularity in address positions. But their licenses are too prohibitive for some uses and not at all cost-effective if you want to use them in an open source application.

There are also several web sites that do what we want, but each of them have prohibitive terms of service or expensive web service subscription fees. If you don’t care about this, or you’d like to hack it out for yourself, or perhaps just enjoy pillaging a web service until they banninate your IP, you can try any of the following:

Even the USPS has a “Web 2.0″-ish web service that does much of what we want, but even they want your dolla dolla bills y’all.

So what’s our solution? We roll our own.

To get things going and to be able to perform city/state zip lookups or distance calculations, we need at a minimum a MySQL database table full of zip codes, city names, state names, latitudes and longitudes for comparison.

Enter TIGER and Gazetteer.

The US Census Bureau has a data product called TIGER. TIGER stands for Topologically Integrated Geographic Encoding and Referencing system, and allows for cross checking various information by geography. Most importantly, they provide a dBase database of zip codes, latitude and longitude, PONAME, state code and county code in Federal Information Processing Standards (FIPS).

This data is old, compiled from the 1996 census and updated some for the 2000 census. Since its purpose was for the census, the data may leave out some rural areas. Furthermore, the latitude and longitude locations are estimates and may not even be within the zip code. SURPRISE! However, as it is cheap as free, it may be the best option for many projects. You can download the TIGER file from
http://www.census.gov/geo/www/tiger/zip1999.html
, and you can download several of the important Gazetteer files from http://www.census.gov/geo/www/gazetteer/places2k.html.
Note: These files are all for the 2000 census. Future census data will probably be posted online, barring apocalypse, and in that case you should check for newer files as this article may not be updated. There are also some additional resources for free zip code databases that are of reasonable quality/timeliness.

The Gazetteer files have more to do with the census. They include water and land area, populations, names and FIPS codes of various different regions of varying atomicity. Together, there are five files of which zcta5.txt contains zip code to latitude/longitude information. The others provide state, county, county subdivision, and “place” which is “incorporated place or Census Designated Place”. You can get these files from http://www.census.gov/geo/www/gazetteer/places2k.html. You may also download a phpMyAdmin export file TIGER.tgz that may contain the information you need.

Validating Zip Codes

Validating zip codes is pretty easy once we have our database. We just compare
the given zip code to the city (either through the gazetteer city name or tiger poname) and state.

SELECT 
    zip_codes.poname AS city,
    states.name AS state,
    states.abbreviation AS abrv
FROM  
    zip_codes
LEFT JOIN 
    states ON states.fips_code = zip_codes.state_fips_code
WHERE 
    zip_code =  {your_zip_code}

Were we very ambitious we might even look up the city and state, find all zip codes in there, preform a check of the minimum Levenshtein distance from each zip code within the city, compare that from the Levenshtein of the city and state from the one indicated by the zip code, and come up with a very smart guess as to what they meant to type! But that’s for another post.

Now, one more thing before moving on: zip codes are crazy. They exist at the will of the USPS for them to organize distribution of mail. You may have notions about them, like “maybe they’re convex,” or “perhaps they’re bound within a single municipality”: These notions are wrong! And you are bad and wrong for thinking them.

Beware: Zip codes can extend between multiple cities or even states, so when doing zip code validation, be open to the possibility that the zip code they entered may belong to two different places—and that the data you have may be out of date (for which I blame President Bush). [The views expressed in this blog post do not *necessarily* represent the views of management. But they line up pretty nicely here.—Ed]

Estimating Distances

Validating zip codes was cake, but we want more! We can find which zip codes are nearer than others by simply taking the planar distance between coordinates in latitude and longitude, but the numbers we get are not the actual distance between the zip codes. No, for the real deal we want to be able to determine (read: estimate to a reasonable degree of accuracy) the distance between zip codes—terrestrial distance, as we would travel. To do that we need to calculate the great circle distance. Here come the math…

First, let’s define some constants.

  • The radius of the Earth, R = 6367 km = 3956 mi (choose the one which has the units you want the result in)
  • Conversion from radians to degrees, c = 180/pi = 57.295780

Let [math resolution="110"](\phi_1,\lambda_1)[/math] and [math resolution="110"](\phi_2,\lambda_2)[/math] represent the latitude and longitude ordered pairs of each location. We can find the distance [math resolution="110"]\Delta\sigma[/math] between the two points as:

[equation title="great circle distance"]\Delta\sigma=\arccos\left\{\sin\phi_1\sin\phi_2+\cos\phi_1\cos\phi_2\cos\Delta\lambda\right\}[/equation]

To convert this into miles we add conversion factors and arrive at the following SQL query, where our_latitude and our_longitude are constants relating to our position.

SELECT 
    {R} *
    ACOS(
        SIN( {our_latitude} / {c} ) * SIN( tiger.latitude / {c} ) +
        COS( {our_latitude} / {c} ) * COS( tiger.latitude / {c} ) *
        COS( tiger.longitude / {c} - {our_longitude} / {c} )
    )
    AS distance FROM tiger

Wikipedia notes that arc cosine has large roundoff errors for small differences in distance, and recommends using the Haversine function (below) for small distances (small, as in not opposite sides of the Earth):

[equation title="Haversine function"]\Delta\sigma = 2 \arcsin \left\{ \sqrt{ \sin^2 \left( \frac{ \phi_2-\phi_1 }{2} \right) + \cos{\phi_1} \ cos{\phi_2} \ sin^2 \left( \frac{\Delta\lambda}{2} \right)} \right \}[/equation]

SELECT 
    {R} * 2 *
    ASIN(
        SQRT(
            POW( SIN( ( tiger.latitude - {our_latitude} ) / ( {c} * 2 )), 2) 
            +
            COS( tiger.latitude / {c} ) * COS( {our_latitude} / {c} *
            POW( SIN( ( tiger.longitude - {our_longitude} ) / ( {c} * 2 )), 2)
        )
    )
    AS distance FROM tiger

Discussion

Since our dataset is zip code mappings to coordinates in latitude and longitude, and since those coordinates are only guaranteed to be somewhat near the interior of the zip code region (concave zip codes may have GPS coordinates outside their interior), numerical errors are considerably less of a concern to us than to, say, those who program Tomahawk cruise missiles. We’re never going to get “great” accuracy. We’ll instead end up with “perfectly serviceable” quality.

Newer versions of MySQL support R-Tree indexes on coordinates which allow for more sophisticated geometrical queries. Want to find zip codes that have a nontrivial component within your city? Get the boundaries of your city in OpenGIS format and limit the zip codes in your search to those within the city. There’s a wealth of power and synthesis on said power in GIS implementations that rely on R-Trees. Sadly MySQL’s implementation of R-Tree indexing is weaker than PostgreSQL, so it might make sense for you to jump ship to that DB depending on your requirements.

For more information:

Posted in: Development, How To