Optimising IP Geolocation Data

This article explains how to optimise the IP to Country data files used to create the SQL database of an IP geolocator.

1. Introduction

IP Geolocators find geographical information for a website visitor using the IP address of their computer. A typical IP Geolocator looks up the IP address in an SQL database created from an IP to Country data file which lists the IP addresses assigned to each country. Each entry in the file describes a range of IP addresses from a start address to an end address. A typical file contains many entries for each country. The following example illustrates the format of the webhosting.info IP to Country data file:

"980090880","980156415","APNIC","1140739200","AU","AUS","AUSTRALIA"

This line of quoted, comma-separated values (CSV) contains the following information:

  • start of the IP address range in numerical format (980090880);
  • end of the IP address range in numerical format (980156415);
  • registry name (APNIC);
  • date of registration in unix epoch seconds (1140739200);
  • two-letter country code (AU);
  • three-letter country code (AUS); and
  • country name (AUSTRALIA).

IP to Country data files present IP addresses in the numerical IPv4 Internet network address format, which is just a long integer. For example, the numerical form of IP address 58.139.16.233 is 982192361. Many programming languages provide a function for calculating the numerical form of an IP address from the dotted form of the address. For example, the ip2long function performs this conversion in PHP:

$dottedIP = "58.139.16.233";
$numericalIP = ip2long($dottedIP); // returns 982192361

To find the country of a website visitor’s IP address, an IP Geolocator first calculates the numerical form of the dotted IP address, and then retrieves the SQL database record whose IP address range contains the numerical form of the visitor’s IP address. For example, the following IP to Country data file entry tells us that IP address 58.139.16.233 is assigned to Malaysia because its numerical form, 982192361, lies in the range 982188032 to 982253567:

"982188032","982253567","APNIC","1116979200","MY","MYS","MALAYSIA"

To store the information in the above entry in an SQL database, we need a table with one field for each pieces of information. For example, the following SQL statement creates a table called ip2country with the seven fields required to hold each piece of information (plus one for the primary key):

create table ip2country (
  ipFrom       int(10) unsigned not null, -- Start of IP address range.
  ipTo         int(10) unsigned not null, -- End of IP address range.
  registry     varchar(20) not null,      -- Registry name.
  date         int(10) unsigned not null, -- Date of registration.
  countryCode2 char(2) not null,          -- Two-letter ISO 3166-1-alpha-2 country code.
  countryCode3 char(3) not null,          -- Three-letter ISO 3166-1-alpha-3 country code.
  countryName  varchar(50) not null,      -- Country name.
  primary key(ipFrom, ipTo)
);

Not all of these fields are required, however, because each entry in an IP to Country data file contains unnecessary and redundant information. Furthermore, not all of the entries in the file are required because some are redundant. We can remove the deadwood with optimization.

2. Optimization

We can reduce the size of the SQL database required to hold the entries in an IP to Country data file with the following three optimizations:

  1. remove the unnecessary registry name and registration date;
  2. remove the redundant three-letter country code and country name; and
  3. collapse sequentially-numbered entries into a single entry.

I’ll use the following twenty entries to demonstrate these three optimizations:

"979599360","979632127","APNIC","1160611200","CN","CHN","CHINA"
"979632128","979763199","APNIC","1113523200","CN","CHN","CHINA"
"979763200","979894271","APNIC","1116288000","KR","KOR","KOREA REPUBLIC OF"
"979894272","980025343","APNIC","1114128000","AU","AUS","AUSTRALIA"
"980025344","980090879","APNIC","1114128000","AU","AUS","AUSTRALIA"
"980090880","980156415","APNIC","1140739200","AU","AUS","AUSTRALIA"
"980156416","980287487","APNIC","1140739200","AU","AUS","AUSTRALIA"
"980287488","980353023","APNIC","1140739200","AU","AUS","AUSTRALIA"
"980353024","980418559","APNIC","1163116800","AU","AUS","AUSTRALIA"
"980418560","980549631","APNIC","1114646400","JP","JPN","JAPAN"
"980549632","980680703","APNIC","1169078400","TW","TWN","TAIWAN PROVINCE OF CHINA"
"980680704","980942847","APNIC","1115769600","CN","CHN","CHINA"
"980942848","981467135","APNIC","1114646400","KR","KOR","KOREA REPUBLIC OF"
"981467136","981991423","APNIC","1115769600","CN","CHN","CHINA"
"981991424","982056959","APNIC","1116806400","TH","THA","THAILAND"
"982056960","982122495","APNIC","1144022400","TH","THA","THAILAND"
"982122496","982155263","APNIC","1116892800","JP","JPN","JAPAN"
"982155264","982171647","APNIC","1140480000","JP","JPN","JAPAN"
"982171648","982188031","APNIC","1154563200","KR","KOR","KOREA REPUBLIC OF"
"982188032","982253567","APNIC","1116979200","MY","MYS","MALAYSIA"

The first optimization removes the unnecessary information from each entry.

2.1 Removing Unnecessary Information

The registry name and date of registration are unnecessary for looking up the country of an IP address. By discarding this information, we can remove two fields from each SQL database record:

create table ip2country (
  ipFrom       int(10) unsigned not null, -- Start of IP address range.
  ipTo         int(10) unsigned not null, -- End of IP address range.
  <span style="text-decoration: line-through;">registry     varchar(20) not null,      -- Registry name.</span>
  <span style="text-decoration: line-through;">date         int(10) unsigned not null, -- Date of registration.</span>
  countryCode2 char(2) not null,          -- Two-letter ISO 3166-1-alpha-2 country code.
  countryCode3 char(3) not null,          -- Three-letter ISO 3166-1-alpha-3 country code.
  countryName  varchar(50) not null,      -- Country name.
  primary key(ipFrom, ipTo)
);

After removing the registration date and the name of the registry, the twenty example entries look like this:

"979599360","979632127","CN","CHN","CHINA"
"979632128","979763199","CN","CHN","CHINA"
"979763200","979894271","KR","KOR","KOREA REPUBLIC OF"
"979894272","980025343","AU","AUS","AUSTRALIA"
"980025344","980090879","AU","AUS","AUSTRALIA"
"980090880","980156415","AU","AUS","AUSTRALIA"
"980156416","980287487","AU","AUS","AUSTRALIA"
"980287488","980353023","AU","AUS","AUSTRALIA"
"980353024","980418559","AU","AUS","AUSTRALIA"
"980418560","980549631","JP","JPN","JAPAN"
"980549632","980680703","TW","TWN","TAIWAN PROVINCE OF CHINA"
"980680704","980942847","CN","CHN","CHINA"
"980942848","981467135","KR","KOR","KOREA REPUBLIC OF"
"981467136","981991423","CN","CHN","CHINA"
"981991424","982056959","TH","THA","THAILAND"
"982056960","982122495","TH","THA","THAILAND"
"982122496","982155263","JP","JPN","JAPAN"
"982155264","982171647","JP","JPN","JAPAN"
"982171648","982188031","KR","KOR","KOREA REPUBLIC OF"
"982188032","982253567","MY","MYS","MALAYSIA"

These entries still contain redundant country information, which we’ll remove with the second optimization.

2.2 Removing Redundant Information

The three-letter country code and country name are redundant because this information can be looked up using the two-letter country code, which uniquely identifies each country. By reducing the country information to the two-letter country code, we can remove two more fields from each SQL database record:

create table ip2country (
  ipFrom       int(10) unsigned not null, -- Start of IP address range.
  ipTo         int(10) unsigned not null, -- End of IP address range.
  <span style="text-decoration: line-through;">registry     varchar(20) not null,      -- Registry name.</span>
  <span style="text-decoration: line-through;">date         int(10) unsigned not null, -- Date of registration.</span>
  countryCode2 char(2) not null,          -- Two-letter ISO 3166-1-alpha-2 country code.
  <span style="text-decoration: line-through;">countryCode3 char(3) not null,          -- Three-letter ISO 3166-1-alpha-3 country code.</span>
  <span style="text-decoration: line-through;">countryName  varchar(50) not null,      -- Country name.</span>
  primary key(ipFrom, ipTo)
);

The twenty example entries now look like this:

"979599360","979632127","CN"
"979632128","979763199","CN"
"979763200","979894271","KR"
"979894272","980025343","AU"
"980025344","980090879","AU"
"980090880","980156415","AU"
"980156416","980287487","AU"
"980287488","980353023","AU"
"980353024","980418559","AU"
"980418560","980549631","JP"
"980549632","980680703","TW"
"980680704","980942847","CN"
"980942848","981467135","KR"
"981467136","981991423","CN"
"981991424","982056959","TH"
"982056960","982122495","TH"
"982122496","982155263","JP"
"982155264","982171647","JP"
"982171648","982188031","KR"
"982188032","982253567","MY"

We’ve now minimized the information in each entry. To complete the optimization, we need to remove the redundant entries from the IP to Country data file.

2.3 Removing Redundant Entries

So far, we’ve halved the number of fields required to store each data file entry in an SQL database record from eight to four:

create table ip2country (
  ipFrom       int(10) unsigned not null, -- Start of IP address range.
  ipTo         int(10) unsigned not null, -- End of IP address range.
  countryCode2 char(2) not null,          -- Two-letter ISO 3166-1-alpha-2 country code.
  primary key(ipFrom, ipTo)
);

The final optimization minimizes the number of database records by collapsing entries with sequentially-numbered address ranges into a single entry. For example, the following two entries describe IP address ranges assigned to Japan, which has the two-letter country code JP:

"982122496","982155263","JP"
"982155264","982171647","JP"

The second entry follows on sequentially from the first because the start address of the second entry, 982155264, is one more than the end address of the first entry, 982155263:

"982122496","982155263","JP"
"982155264","982171647","JP"

Together, these two entries describe a range of IP addresses from 982122496 to 982171647, which is the range from the start address of the first entry to the end address of the second entry:

"982122496","982155263","JP"
"982155264","982171647","JP"

We can collapse the second entry into the first by setting the end address of the first entry to the end address of the second entry:

"982122496","982171647","JP"

We can then delete the second entry which will reduce the number of SQL database records by one.

To maximize this optimization, we collapse as many sequentially-numbered entries as possible into a single entry. For example, the following sequentially-numbered entries describe IP addresses assigned to Australia:

"979894272","980025343","AU"
"980025344","980090879","AU"
"980090880","980156415","AU"
"980156416","980287487","AU"
"980287488","980353023","AU"
"980353024","980418559","AU"

These six entries describe a range of addresses starting at 979894272, the start address of the first entry, to 980418559, the end address of the sixth entry:

"979894272","980025343","AU"
"980025344","980090879","AU"
"980090880","980156415","AU"
"980156416","980287487","AU"
"980287488","980353023","AU"
"980353024","980418559","AU"

By collapsing these six entries into one, we can save another five SQL database records:

"979894272","980418559","AU"

Note that we can only collapse sequentially-numbered entries assigned to the same country. For example, we can’t collapse the following sequentially-numbered entries because the first entry is assigned to China (CH) whereas the second entry is assigned to Thailand (TH):

"981467136","981991423","CN"
"981991424","982056959","TH"

After applying all three optimizations, the number of example entries decreases from twenty to twelve, which is a decrease of 40%:

"979599360","979763199","CN"
"979763200","979894271","KR"
"979894272","980418559","AU"
"980418560","980549631","JP"
"980549632","980680703","TW"
"980680704","980942847","CN"
"980942848","981467135","KR"
"981467136","981991423","CN"
"981991424","982122495","TH"
"982122496","982171647","JP"
"982171648","982188031","KR"
"982188032","982253567","MY"

Download

The optimiseIP2Country.php script is a command-line utility that implements the optimizations described in this article. You can download this script from github.