Dying thermometers in the GHCN-v2

It is no secret that there have been a marked decline in the number of stations whose monthly mean temperatures are included in the GHCN-v2. I recently mapped the spatial distribution of temperature stations contributing to the GHCN-v2 over time.

The patterns were interesting enough that I decided to figure out exactly which stations disappeared from the GHCN-v2 data set (the fact that the records from a station are not included in the data set does not mean that the station no longer exists).

I used the database I put together earlier to try to answer that question.

All code presented on this page is licensed under the GNU Public License.

Classification by grid

I excluded data from January 2007 onwards from this procedure due to some odd patterns I observed when comparing means from v2.mean to v2.mean_adj. I considered a station record to be a survivor if it contained mean temperatures for at least 60 months out of the 180 between January 1992 and December 2006. A station record that contained fewer than 60 months of mean temperatures in that period was classified as a zombie.

In anticipation of questions I might ask myself in the future, I decided to group the zombies and survivors by grid.

zombie-classify-grid.pl

#!/usr/bin/perl

use strict; use warnings;
use autodie;

use DBI;
use File::Slurp;
use JSON;
use POSIX qw( floor );

my $dbh = DBI->connect('dbi:SQLite:ghcn.db', undef, undef, {
    AutoCommit => 0, RaiseError => 1
});

my $sth = $dbh->prepare(q{
    SELECT
        inventory.station AS station,
        inventory.modifier AS modifier,
        coord_lat AS y,
        coord_long AS x,
        nobs
    FROM (
        SELECT
            station, modifier, nobs
        FROM (
            SELECT
                station, modifier, count(*) AS nobs
            FROM temperature
            WHERE
                year > 1991 AND
                year < 2007 AND
                mean <> -9999 AND
                country NOT IN (700, 701, 800)
            GROUP BY station, modifier
        )
    ) AS freq,
    inventory WHERE
        freq.station  = inventory.station AND
        freq.modifier = inventory.modifier
});

$sth->execute;

my %grid;
my @longs = map { 5 * $_ } (-36 .. 35 );
my @lats  = map { 5 * $_ } (-18 .. 17 );

while ( my $station = $sth->fetchrow_hashref ) {
    my $x = 5 * floor( sprintf '%.0f', $station->{x} / 5);
    my $y = 5 * floor( sprintf '%.0f', $station->{y} / 5);

    my $type = $station->{nobs} < 60 ? 'zombie' : 'survivor';

    push @{ $grid{"$x|$y"}{$type}{stations} },
        [ map { "$_" } $station->{station}, $station->{modifier} ];
}

write_file 'grid.json', encode_json( \%grid );

I saved the results in JSON format to cache them for the next step as it takes quite a file to run this script (probably due to the fact that I am not very good with SQL).

I make no claims that this list is correct in any sense. It is what I came up with using arbitrary criteria and possibly buggy code with a data set I do not fully understand. In particular, I did not know how to deal with duplicates, so I did not control for them. Therefore, if there are 30 records covering the exact same months from three duplicates of a given combination, that counts as 90, and the station is classified as a survivor. I have not checked whether or how much of a difference it would make to consider duplicates as separate series.

With those caveats, you can download the list of survivor and zombie stations I came up with:

Classification of stations in the GHCN-v2 by availability of records 1992—2006. The file is in CSV format but has a .txt extension so it can be viewed in your browser without starting a spreadsheet program. The first column is the grid which is identified by its longitude and lattitude separated by a pipe character. Grids are identified by their Southwest corner coordinates and are 5° wide by 5° tall. The second column specifies the type of following stations. If there are no stations of a given type in a given grid, there is no record for that type in that grid. Consequently, grids with no stations do not appear in the file.

Hope this information will be useful to others exploring the GHCN-v2 data set. Let me know of any errors you find.