Where on earth do GHCNv2 data come from?

Three years ago, I decided to see the span and availability of data in the GHCNv2 by plotting the data from each location individually and created animations of temperature anomalies included in various temperature data sets.

The question of where on earth the data come from has been on my mind during this time, but I just recently got motivated again after E.M. Smith kindly mentioned my visualizations on his blog.

So, I decided to write some code to import the GHCN data into a SQLite database and plot for each month the locations of all stations which have non-missing data for that month.

Importing the data into a database like that allows me to avoid re-parsing the data every time I wonder about something. It takes my old clunker of a laptop about 15 minutes to go through everything, but then I can ask any question I want using standard SQL syntax.

Please see my blog post Dude, where is my thermometer? if you would like to comment on this. The code and explanation follows.

Download source code and associated files.

All code is Copyright © 2010 A. Sinan Unur and is provided under the terms of the GNU General Public License.

Database Schema

The SQL code does a little bit more work than is absolutely necessary so as to allow for future flexibility. In particular, it sets up lookup tables that translate the documentation embedded in NOAA's v2.read.inv.f.

DROP TABLE 'airstn';
DROP TABLE 'country';
DROP TABLE 'inventory';
DROP TABLE 'popcat';
DROP TABLE 'stloc';
DROP TABLE 'stveg';
DROP TABLE 'temperature';
DROP TABLE 'topo';

/* see ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v2/v2.temperature.readme */
CREATE TABLE 'temperature' (
    country  CHAR(3) NOT NULL,
    station  CHAR(5) NOT NULL,
    modifier CHAR(3) NOT NULL,
    dupeno   CHAR(1) NOT NULL,
    year     INTEGER NOT NULL,
    month    INTEGER NOT NULL,
    celsius  INTEGER NULL,
    CONSTRAINT 'temperature_pk' 

    PRIMARY KEY (country, station, modifier, dupeno, year, month),

    CONSTRAINT 'station_fk' FOREIGN KEY (country, station, modifier) 
    REFERENCES inventory(country, station, modifier),
    
    CONSTRAINT 't_country_fk' FOREIGN KEY (country) REFERENCES country(code)
);

/* see ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v2/v2.temperature.inv */

CREATE TABLE 'inventory' (
    country     CHAR(3)     NOT NULL,
    station     CHAR(5)     NOT NULL,
    modifier    CHAR(3)     NOT NULL,
    stname      VARCHAR(30) NOT NULL,
    coord_lat   INTEGER     NOT NULL,
    coord_long  INTEGER     NOT NULL,
    elev        INTEGER     NOT NULL,
    elevg       INTEGER     NOT NULL,
    popcat      CHAR(1)     NOT NULL,
    pop         INTEGER     NOT NULL,
    topo        CHAR(2)     NOT NULL,
    stveg       CHAR(2)     NOT NULL,
    stloc       CHAR(2)     NOT NULL,
    loc         INTEGER     NOT NULL,
    airstn      CHAR(1)     NOT NULL,
    towndis     INTEGER     NOT NULL,
    grveg       CHAR(16)    NOT NULL,

    CONSTRAINT 'inventory_pk' PRIMARY KEY (country, station, modifier),
    CONSTRAINT 'airstn_fk'    FOREIGN KEY (airstn)  REFERENCES airstn(category),
    CONSTRAINT 'popcat_fk'    FOREIGN KEY (popcat)  REFERENCES popcat(category),
    CONSTRAINT 'i_country_fk' FOREIGN KEY (country) REFERENCES country(code),
    CONSTRAINT 'stloc_fk'     FOREIGN KEY (stloc)   REFERENCES stloc(category),
    CONSTRAINT 'stveg_fk'     FOREIGN KEY (stveg)   REFERENCES stveg(category),
    CONSTRAINT 'topo_fk'      FOREIGN KEY (topo)    REFERENCES topo(category)
);

CREATE TABLE 'country' (code CHAR(3) PRIMARY KEY, name VARCHAR NOT NULL);

CREATE TABLE 'popcat' (
    category CHAR(1) PRIMARY KEY,
    description VARCHAR(60) NOT NULL
);

INSERT INTO popcat(category, description) 
VALUES ('R', 'rural (not associated with a town of >10,000 population)' );

INSERT INTO popcat(category, description) 
VALUES ('S', 'associated with a small town (10,000-50,000)' );

INSERT INTO popcat(category, description) 
VALUES ('U', 'associated with an urban area (>50,000)' );

CREATE TABLE 'topo' (
    category CHAR(2) PRIMARY KEY, 
    description VARCHAR(60) NOT NULL
);

INSERT INTO topo(category, description) VALUES ('FL', 'flat');
INSERT INTO topo(category, description) VALUES ('HI', 'hilly');
INSERT INTO topo(category, description) VALUES ('MT', 'mountain top');
INSERT INTO topo(category, description) VALUES ('MV', 
    'mountainous valley or at least not on the top of a mountain');

CREATE TABLE 'stveg' (
    category CHAR(2) PRIMARY KEY,
    description VARHCAR(16) NOT NULL
);

INSERT INTO 'stveg' (category, description) VALUES ('CL', 'clear or open');
INSERT INTO 'stveg' (category, description) VALUES ('DE', 'desert');
INSERT INTO 'stveg' (category, description) VALUES ('FO', 'forested');
INSERT INTO 'stveg' (category, description) VALUES ('IC', 'ice');
INSERT INTO 'stveg' (category, description) VALUES ('MA', 'marsh');
INSERT INTO 'stveg' (category, description) VALUES ('xx', 
    'No information regarding vegetation near station' );

/* stloc: A station may be all three but only labeled with one 
   with the priority IS, CO, then LA.  If none of the above: no.*/

CREATE TABLE 'stloc' (
    category CHAR(2),
    description VARCHAR(120) NOT NULL
);

INSERT INTO 'stloc' (category, description) VALUES ('CO', 
    'The station is within 30 km of the coast');

INSERT INTO 'stloc' (category, description) VALUES ('IS', 
    'The station is on an island smaller than 100 kmsq or narrower than 10 km in width at the point of the station');

INSERT INTO 'stloc' (category, description) VALUES ('LA', 
    'The station is next to a large (>25 kmsq) lake');

INSERT INTO 'stloc' (category, description) VALUES ('no', 
    'Not an island / not coastal / not next to a lake');

CREATE TABLE 'airstn' (
    category CHAR(1) PRIMARY KEY,
    description VARCHAR(40)
);

INSERT INTO 'airstn' (category, description) VALUES ('A', 
    'The station is at an airport');

INSERT INTO 'airstn' (category, description) VALUES ('x', 
    'The station is not at an airport' );

Running sqlite3 ghcn.db < ghcn.sql on the command line will then generate an empty database containing this schema.

Perl script to parse and import the data

The script does the bare minimum necessary to parse and import the data. It does use named capture feature of Perl regular expressions introduced in Perl 5.10 so it will not run with earlier perls.

#!/usr/bin/perl

use strict; use warnings;
use autodie;
use DBI;
use Config::Std;
use YAML;

local $| = 1;

my ($config_file) = @ARGV;
$config_file //= 'ghcn.ini';

read_config $config_file => my %settings;

my $dbh = DBI->connect(
    sprintf($settings{dbi}{conn}, $settings{dbi}{db}),
    @{$settings{dbi}}{qw(user pass)},
    { RaiseError => 1, AutoCommit => 0}
);

import_countries   ($dbh, \%settings);
import_inventory   ($dbh, \%settings);
import_temperature ($dbh, \%settings);
create_indexes     ($dbh);

sub import_inventory {
    my ($dbh, $settings) = @_;

    my $re = qr/^@{ $settings{inventory}{field} }/x;

    my @columns = @{ $settings{inventory}{column} };

    open my $in, '<', $settings{source}{inventory};

    my $sth = $dbh->prepare(sprintf(
            q{INSERT INTO inventory(%s) VALUES(%s)},
            join(',', @columns),
            join(',', ('?') x @columns),
        )
    );

    while ( my $record = <$in> ) {
        if ( $record =~ $re ) {
            $sth->execute(@+{ @columns });
        }
        else {
            chomp $record;
            die $record;
        }
    }

    $dbh->commit;
    return;
}

sub import_countries {
    my ($dbh, $settings) = @_;

    open my $in, '<', $settings{source}{country_codes};

    my $sth = $dbh->prepare(
        q{INSERT INTO country(code, name) VALUES (?,?)}
    );

    while ( my $record = <$in> ) {
        my ($code, $name) = split ' ', $record, 2;
        $name =~ s/\s+\z//;
        $sth->execute($code, $name);
    }

    $dbh->commit;
    return;
}

sub import_temperature {
    my ($dbh, $settings) = @_;

    open my $in, '<', $settings{source}{mean};

    my $re = qr/^@{ $settings{temperature}{field} }/x;

    my @columns = @{ $settings{temperature}{column} };

    my $sth = $dbh->prepare(sprintf(
            q{INSERT INTO temperature (%s) VALUES(%s)},
            join(',', @columns),
            join(',', ('?') x @columns),
        )
    );

    while ( my $record = <$in> ) {
        if ( $record =~ $re ) {
            my %record;
            my %save = %+;
            @record{ @columns } = @save{ @columns };
            for my $m ( 1 .. 12 ) {
                $record{ month } = $m;
                $record{ celsius } = $save{ sprintf('m%02d', $m) };
                $sth->execute(@[email protected]});
            }
            unless ( $. % 5000 ) {
                print "$. ";
                $dbh->commit;
            }
        }
        else {
            chomp $record;
            die $record;
        }
    }
    $dbh->commit;
    return;
}

sub create_indexes {
    my ($dbh) = @_;

    $dbh->do(q{CREATE INDEX idx_t_y  ON temperature(year)});
    $dbh->do(q{CREATE INDEX idx_t_m  ON temperature(month)});
    $dbh->do(q{CREATE INDEX idx_t_ym ON temperature(year,month)});
    $dbh->commit;
    return;
}

The patterns and other sundry settings are defined in the settings file ghcn.ini.

Settings file

[source]
mean = v2.mean.20100515
inventory = v2.temperature.inv.20100416
country_codes = v2.country.codes

[map]
img = etopo-landmask.png

; r, g, b and alpha 
marker_color = 255
marker_color = 0
marker_color = 255
marker_color = 64

; r, g, b and alpha
text_color = 0
text_color = 0
text_color = 204
text_color = 0

[dbi]
conn = dbi:SQLite:dbname=%s
db = ghcn.db
user = 
pass = 

[temperature]
field = (?<country>      [0-9] {3}   )
field = (?<station>      [0-9] {5}   )
field = (?<modifier>     [0-9] {3}   )
field = (?<dupeno>       [0-9] {1}   )
field = (?<year>         [0-9] {4}   )
field = [ ]* (?<m01>  -?  [0-9] {1,4} ) 
field = [ ]* (?<m02>  -?  [0-9] {1,4} ) 
field = [ ]* (?<m03>  -?  [0-9] {1,4} ) 
field = [ ]* (?<m04>  -?  [0-9] {1,4} ) 
field = [ ]* (?<m05>  -?  [0-9] {1,4} ) 
field = [ ]* (?<m06>  -?  [0-9] {1,4} ) 
field = [ ]* (?<m07>  -?  [0-9] {1,4} ) 
field = [ ]* (?<m08>  -?  [0-9] {1,4} ) 
field = [ ]* (?<m09>  -?  [0-9] {1,4} ) 
field = [ ]* (?<m10>  -?  [0-9] {1,4} ) 
field = [ ]* (?<m11>  -?  [0-9] {1,4} ) 
field = [ ]* (?<m12>  -?  [0-9] {1,4} ) 

column = country
column = station
column = modifier
column = dupeno
column = year
column = month
column = celsius

[inventory]
field = (?<country>         [0-9]{3} )
field = (?<station>         [0-9]{5} )
field = (?<modifier>        [0-9]{3} )
field = [ ]* (?<stname>     .{30} )
field = [ ]* (?<coord_lat>  -?[0-9]{1,3} [.] [0-9]{1,2} )
field = [ ]* (?<coord_long> -?[0-9]{1,3} [.] [0-9]{1,2} )
field = [ ]* (?<elev>       (?: -?[0-9]{1,4} | -999 ))
field = [ ]* (?<elevg>      (?: -?[0-9]{1,4} | -999 ))
field = (?<popcat>          [RSU] )
field = [ ]* (?<pop>        [0-9] {1,5} | -9 )
field = (?<topo>            (?:FL) | (?:HI) | (?:M[TV]) )
field = (?<stveg>           (?:CL) | (?:DE) | (?:FO) | (?:IC) | (?:MA) | (?:xx) )
field = (?<stloc>           (?:CO) | (?:IS) | (?:LA) | (?:no) )
field = [ ]* (?<loc>        [0-9]{1,2} | -9 )
field = (?<airstn>          (?: A | x ) )
field = [ ]* (?<towndis>         [0-9]{1,2} | -9 )
field = (?<grveg>           .{16} )

column = country
column = station
column = modifier
column = stname
column = coord_lat
column = coord_long
column = elev
column = elevg
column = popcat
column = pop
column = topo
column = stveg
column = stloc
column = loc
column = airstn
column = towndis
column = grveg

Importing the data

Assuming you have downloaded the requisite files v2.mean.Z, v2.temperature.inv, and v2.country.codes from the FTP site, and decompressed the temperature database (7-Zip works fine if you do not have the Unix compress program), you will get to wait a little bit while the script does its thing. If everything worked as expected, number of entries in the temperature table should be 12 times the number of lines in v2.mean. As of May 20, 2010, I get:

sqlite> select count(*) from temperature;
7166616

Script to graph the temperature locations

All the hard work is done: All that remains is to query the temperature table in the database by month and year for station identifiers with non-missing data, look up their coordinates, convert those coordinates to a point on the bitmap and plot them. I plotted the points using more than just one pixel and with transparency beause of the resolution limitation of the 640x320 bitmap I used. Using transparency means locations with many stations whose coordinates map to roughly the same pixel location on the bitmap are darker in the output.

For the map image, I used once again the venerable etopo-landmask image (resized to 640x320) which you can find on Dave Pape's web site.

#!/usr/bin/perl

use strict; use warnings;
use autodie;
use Config::Std;
use DBI;
use File::Slurp;
use File::Spec::Functions qw( catfile );
use GD;
GD::Image->trueColor(1);

use YAML;

local $| = 1;

my ($config_file) = @ARGV;
$config_file //= 'ghcn.ini';

read_config $config_file => my %settings;

my $MAP_FILE     = $settings{map}{img};
my @MARKER_COLOR = map { 0 + $_ } @{ $settings{map}{marker_color} };
my @TEXT_COLOR   = map { 0 + $_ } @{ $settings{map}{text_color} };

my $dbh = DBI->connect(
    sprintf($settings{dbi}{conn}, $settings{dbi}{db}),
    @{$settings{dbi}}{qw(user pass)},
    { RaiseError => 1, AutoCommit => 0}
);

my $stations = $dbh->selectall_hashref(
    q{SELECT
        country,
        station,
        modifier,
        coord_lat AS y,
        coord_long AS x
      FROM inventory
    },
    [ qw( country station modifier )]
);

my $years = $dbh->selectcol_arrayref(
    q{SELECT DISTINCT year FROM temperature ORDER BY year}
);

my $sth = $dbh->prepare(q{
    SELECT country, station, modifier FROM temperature
    WHERE year = ? AND month = ? AND celsius <> -9999
});

for my $year ( @$years ) {
    for my $month ( 1 .. 12 ) {
        plot_month($year, $month, $sth, $stations);
    }
}

sub plot_month {
    my ($year, $month, $sth, $stations) = @_;

    $sth->execute($year, $month);

    my $map = GD::Image->new($MAP_FILE);

    my $marker_color = $map->colorAllocateAlpha(@MARKER_COLOR);
    my $text_color = $map->colorAllocateAlpha(@TEXT_COLOR);

    $map->string(gdMediumBoldFont,
        0.78125 * $map->width,
        0.9 * $map->height,
        "$year/$month",
        $text_color
    );

    while ( my $station = $sth->fetchrow_hashref ) {
        my ($c, $s, $m) = @$station{qw(country station modifier)};

        my $xf = $map->width / 360;
        my $yf = $map->height / 180;

        my $x = 180 + $stations->{$c}{$s}{$m}{x};
        $x *= $xf;

        my $y = 90 - $stations->{$c}{$s}{$m}{y};
        $y *= $yf;

        $map->filledRectangle(
            sprintf('%.0f', $x - $xf/2),
            sprintf('%.0f', $y - $yf/2),
            sprintf('%.0f', $x + $xf/2),
            sprintf('%.0f', $y + $yf/2),
            $marker_color,
        );
    }

    my $output_file = catfile(
        frames => sprintf '%04d%02d.png', $year, $month
    );
    write_file $output_file, { binmode => ':raw'}, \ $map->png;
    return;
}

I generated the animation using VirtualDub using 15 frames (i.e. months) per second.

When you watch the video, two features are worth paying attention. First, notice the pattern of locations where temperature stations pop up as we progress through the 1700s to the 1800s and 1900s. It is important to remember that the temperature measurements we have today are not the result of a well designed scientific experiment, but a by product of where humans lived and traveled and observed the weather and recorded their observations in a form that was preserved. This feature of the data invalidates most naive classical statistical methods. There is a reason econometrics exists: Dealing with historical data is hard.

Second, make sure to pay attention to the decline in the number of locations with data in the last two decades. I find that very curious. It simply does not compute: Why are station data disappearing from these data sets even as their importance to use is increasing?

If you would like to comment, please see my blog post Dude, where is my thermometer?.