Zip Codes in Web Apps: A Tutorial on Validating Cities & Calculating Distance
By Edgar Hassler
On June 4th, 2008
[Editor's Note: This post was authored originally in 2006. Edgar will be writing an update shortly, but some things have definitely changed in that time.]
As a web developer, I’m constantly having to deal with zip codes, with almost terrifyingly frequency. These experiences no longer leave me in cold sweat; instead, I’ve warmed to the concept as it’s nice to be able to validate someone’s zip code based on the city and state they entered. (Or to autopopulate their city and state based on a zip code.) This can be a major headache saver if you’re doing bulk or CASS-certified mailings as the cost increases for each unverified address. Ensuring the accuracy of your data at the data entry point is always a good idea and is a relatively simple endeavor. Follow us after the jump for a look into what can be done about zips.
Further, if you’re offering services to people based on location, zip codes provide a nice means by which you can estimate distance—like when you enter your zip code over the phone to find movie theaters near to you.
Now, you can buy commercial products with zip codes and latitude and longitude information for various prices, with high accuracy and more granularity in address positions. But their licenses are too prohibitive for some uses and not at all cost-effective if you want to use them in an open source application.
There are also several web sites that do what we want, but each of them have prohibitive terms of service or expensive web service subscription fees. If you don’t care about this, or you’d like to hack it out for yourself, or perhaps just enjoy pillaging a web service until they banninate your IP, you can try any of the following:
Even the USPS has a “Web 2.0″-ish web service that does much of what we want, but even they want your dolla dolla bills y’all.
So what’s our solution? We roll our own.
To get things going and to be able to perform city/state zip lookups or distance calculations, we need at a minimum a MySQL database table full of zip codes, city names, state names, latitudes and longitudes for comparison.
Enter TIGER and Gazetteer.
The US Census Bureau has a data product called TIGER. TIGER stands for Topologically Integrated Geographic Encoding and Referencing system, and allows for cross checking various information by geography. Most importantly, they provide a dBase database of zip codes, latitude and longitude, PONAME, state code and county code in Federal Information Processing Standards (FIPS).
This data is old, compiled from the 1996 census and updated some for the 2000 census. Since its purpose was for the census, the data may leave out some rural areas. Furthermore, the latitude and longitude locations are estimates and may not even be within the zip code. SURPRISE! However, as it is cheap as free, it may be the best option for many projects. You can download the TIGER file from
http://www.census.gov/geo/www/tiger/zip1999.html, and you can download several of the important Gazetteer files from http://www.census.gov/geo/www/gazetteer/places2k.html.
Note: These files are all for the 2000 census. Future census data will probably be posted online, barring apocalypse, and in that case you should check for newer files as this article may not be updated. There are also some additional resources for free zip code databases that are of reasonable quality/timeliness.
The Gazetteer files have more to do with the census. They include water and land area, populations, names and FIPS codes of various different regions of varying atomicity. Together, there are five files of which zcta5.txt contains zip code to latitude/longitude information. The others provide state, county, county subdivision, and “place” which is “incorporated place or Census Designated Place”. You can get these files from http://www.census.gov/geo/www/gazetteer/places2k.html. You may also download a phpMyAdmin export file TIGER.tgz that may contain the information you need.
Validating Zip Codes
Validating zip codes is pretty easy once we have our database. We just compare
the given zip code to the city (either through the gazetteer city name or tiger poname) and state.
SELECT
zip_codes.poname AS city,
states.name AS state,
states.abbreviation AS abrv
FROM
zip_codes
LEFT JOIN
states ON states.fips_code = zip_codes.state_fips_code
WHERE
zip_code = {your_zip_code}
Were we very ambitious we might even look up the city and state, find all zip codes in there, preform a check of the minimum Levenshtein distance from each zip code within the city, compare that from the Levenshtein of the city and state from the one indicated by the zip code, and come up with a very smart guess as to what they meant to type! But that’s for another post.
Now, one more thing before moving on: zip codes are crazy. They exist at the will of the USPS for them to organize distribution of mail. You may have notions about them, like “maybe they’re convex,” or “perhaps they’re bound within a single municipality”: These notions are wrong! And you are bad and wrong for thinking them.
Beware: Zip codes can extend between multiple cities or even states, so when doing zip code validation, be open to the possibility that the zip code they entered may belong to two different places—and that the data you have may be out of date (for which I blame President Bush). [The views expressed in this blog post do not *necessarily* represent the views of management. But they line up pretty nicely here.—Ed]
Estimating Distances
Validating zip codes was cake, but we want more! We can find which zip codes are nearer than others by simply taking the planar distance between coordinates in latitude and longitude, but the numbers we get are not the actual distance between the zip codes. No, for the real deal we want to be able to determine (read: estimate to a reasonable degree of accuracy) the distance between zip codes—terrestrial distance, as we would travel. To do that we need to calculate the great circle distance. Here come the math…
First, let’s define some constants.
- The radius of the Earth, R = 6367 km = 3956 mi (choose the one which has the units you want the result in)
- Conversion from radians to degrees, c = 180/pi = 57.295780
Let [math resolution="110"](\phi_1,\lambda_1)[/math] and [math resolution="110"](\phi_2,\lambda_2)[/math] represent the latitude and longitude ordered pairs of each location. We can find the distance [math resolution="110"]\Delta\sigma[/math] between the two points as:
[equation title="great circle distance"]\Delta\sigma=\arccos\left\{\sin\phi_1\sin\phi_2+\cos\phi_1\cos\phi_2\cos\Delta\lambda\right\}[/equation]
To convert this into miles we add conversion factors and arrive at the following SQL query, where our_latitude and our_longitude are constants relating to our position.
SELECT
{R} *
ACOS(
SIN( {our_latitude} / {c} ) * SIN( tiger.latitude / {c} ) +
COS( {our_latitude} / {c} ) * COS( tiger.latitude / {c} ) *
COS( tiger.longitude / {c} - {our_longitude} / {c} )
)
AS distance FROM tiger
Wikipedia notes that arc cosine has large roundoff errors for small differences in distance, and recommends using the Haversine function (below) for small distances (small, as in not opposite sides of the Earth):
[equation title="Haversine function"]\Delta\sigma = 2 \arcsin \left\{ \sqrt{ \sin^2 \left( \frac{ \phi_2-\phi_1 }{2} \right) + \cos{\phi_1} \ cos{\phi_2} \ sin^2 \left( \frac{\Delta\lambda}{2} \right)} \right \}[/equation]
SELECT
{R} * 2 *
ASIN(
SQRT(
POW( SIN( ( tiger.latitude - {our_latitude} ) / ( {c} * 2 )), 2)
+
COS( tiger.latitude / {c} ) * COS( {our_latitude} / {c} *
POW( SIN( ( tiger.longitude - {our_longitude} ) / ( {c} * 2 )), 2)
)
)
AS distance FROM tiger
Discussion
Since our dataset is zip code mappings to coordinates in latitude and longitude, and since those coordinates are only guaranteed to be somewhat near the interior of the zip code region (concave zip codes may have GPS coordinates outside their interior), numerical errors are considerably less of a concern to us than to, say, those who program Tomahawk cruise missiles. We’re never going to get “great” accuracy. We’ll instead end up with “perfectly serviceable” quality.
Newer versions of MySQL support R-Tree indexes on coordinates which allow for more sophisticated geometrical queries. Want to find zip codes that have a nontrivial component within your city? Get the boundaries of your city in OpenGIS format and limit the zip codes in your search to those within the city. There’s a wealth of power and synthesis on said power in GIS implementations that rely on R-Trees. Sadly MySQL’s implementation of R-Tree indexing is weaker than PostgreSQL, so it might make sense for you to jump ship to that DB depending on your requirements.
For more information:
- http://www.movable-type.co.uk/scripts/GIS-FAQ-5.1.html
- http://en.wikipedia.org/wiki/Great-circle_distance
Tagged with: census, coordinates, geocoding, GIS, great circle distance, latitude, longitude, mysql, openGIS, postal codes, r-tree, web development, zip codes
Posted in: Development, How To












Nice post Edgar. I’ve had a bit of experience with distance and mapping applications and I have to say that unless you are willing to pay for an solution, a practical system is difficult to develop.
I think TIGER lat/lng zip code data is nice but not very useful for applications that require comparisons of distances for actual locations (i.e. stores, homes, movie theaters, etc.) from a given starting lat/lng. Most people might want to find distances from their home to the nearest drug store or even to list all drug stores within a given radius. From a development perspective, this is problematic because given a single starting point, there can be any number of target points within a certain search result set. Now throw in the variable that the start point maybe chosen at random by the user and you could be left with a very high number of point-to-point comparisons for distance. Doing these calculations on the fly is rather expensive and depending on the scale of your deployed application, even caching (MySQL or higher) can prove rather useless given the variability of start and end points.
The ability to find a result set given a radius can be especially difficult given large data sets. It is impractical to compare the distance from every point to every other point, even if your starting lat/lng remains constant. What I have worked with and maintained in the past is the creation of a bounding box to gather all points within a given radius. It is relatively inexpensive to calculate the top, bottom, left, and right lat/lng values of given start point and to query only the points that fit within that “box” then calculate the distance of each. On a small scale(.5-10 mile radius), this works pretty well, but as you start to increase this distance, your result set begins to get less accurate due to the inclusion of points beyond your desired radius. The distance from the center to the top, bottom, left, and right points of the “box” remain accurate but as the radius distance increases the corners of the “box” extend further and further beyond the desired radius.
For all of the radius searches, you’re looking at about 78% area accuracy (22% falls outside of the circle and into the square areas of the bounding box). But the accuracy falls a lot more when you work with the linear distances from the center of your radius. The distance from center, along the diagonals of the search box to one of the corners (the furthest possible point), the numbers start looking like…
radius(in miles) | dist to box corner
0.25 | ~0.35 => 0.10 miles extra
0.50 | ~0.71 => 0.21 miles extra
0.75 | ~1.06 => 0.31 miles extra
1.00 | ~1.41 => 0.41 miles extra
1.25 | ~1.76 => 0.49 miles extra
1.50 | ~2.12 => 0.62 miles extra
1.75 | ~2.47 => 0.72 miles extra
2.00 | ~2.83 => 0.82 miles extra
2.50 | ~3.54 => 1.04 miles extra
The numbers show that the searchable radius can be, at its absolute worst, 40% off for all radius distances. Maybe this level of error is acceptable but probably not for most of us.
This does not even include the apsect of mapping. Getting lat/lng data and distances from points of interest is good but are usually only useful when mapped for the user. Unfortunately, lat/lng data does not always correlate well with all map rendering applications. For instance, Google maps uses different map data vendors like Europa and Tele Atlas depending on if you are using their maps.google.com application or the development API. I can tell you from experience that the same lat/lng can and often DOES map differently on each.
So in sum…just pay for it ;)
Nice article Here is a similar solution that contains a simple yet effective zip code radius search algorithm. The solution contains a Stored Procedure that accepts a zip code and mile radius and returns all zip codes within that radius. (Example: Give me all the zip codes within a 10-mile radius of 10023). The Solution comes with and a dataload script that will create a SQL Server database, the Stored Procedure containing the algorithm, and load over 42,000 zip codes.
http://www.spikesolutions.net/ViewSolution.aspx?ID=1e7d197c-7441-40f4-8e87-5ecc3ef5ab60
You say “Even the USPS has a “Web 2.0″-ish web service that does much of what we want” but I can’t seem to find a web service offered by the USPS that returns ZIP codes within a given radius. All I can find is address/ZIP lookups. Can you point me to the location that you found the USPS service?
Matthew, not really, no. There was something like an API (required an API key and dolla dolla bills) a while ago but I don’t notice it on their current site either. But thanks for reading anyway.
-Edgar
you are missing a closing paren somewhere.