MATLAB: Handling NetCDF files & HadISST data

This is a rehashed post from my old blog which proved to be a popular post. It is a set of basic instructions on handling NetCDF files in MATLAB - something that can be very handy in climate science. There are various instrumental records (man-made thermometer/satellite based measurements) of global temperature variability specified by different parameters (sea-surface temperatures, marine air temperatures, land temperatures, combined land-sea, 5°x5° gridded, 1°x1° gridded and so on). Most of these are open for public use and require citations for scientific publication. Careful consideration is required in choosing the data set (each with specific inherent errors) that you want to work with depending on the question that you want to answer. Recently, I've been working with the Hadley Centre Sea Ice and Sea Surface Temperature (HadISST) data set. This data set gives you global, 1°x1° gridded, sea-surface temperature (SST) data from 1870 to present (updated on the 2nd of every month). The provided array consists of 3 dimensions (longitude, latitude and time) storing SSTs as data. It is structured thus:

----------------------------------------------------------------
 | | | |
 | DAY | MON | YR|
 |_____|_____|_____|____________________________
 90N |(1,1) |
 ||
 ||
 ||
 ||
 |(1,90)|
 Equ ||
 |(1,91)|
 ||
 ||
 ||
 ||
 90S |(1,180)______________________________(360,180)|
180W 0180E
----------------------------------------------------------------

For my use, I needed the complete SST time series from 1870 up till the present, however, for only one 1°x1° grid point. You can understand that this would require (basic) manipulation of the given data set.

The problem is that most data sets are presented as ASCII characters through a .txt file. These are really tough to work with on a non-Linux based system. It takes a long time to edit and optimize these files for statistical/computational use through any software. The good thing is that most of the data sets are provided in NetCDF or .nc format. The Network Common Data Form (netCDF) is an open standard format of software libraries and data formats that support the creation, access, and sharing of array-oriented scientific data. The project was initiated by the University Corporation for Atmospheric Research (UCAR). I couldn't find a simple method online for data manipulation with these big files (~400mb-4gb sized) be it through .txt or .nc files.

Without going into the intricacies of netCDF libraries and formats, here is the easiest way of manipulating netCDF (and hence, global temperature) data sets in basic MATLAB (no fancy toolboxes required!):

  • Download the netCDF version of the data set (or the .nc format).
  • If you have the later versions of MATLAB there are inbuilt functions capable of reading netCDF files, otherwise you can download required functions/libraries here.
  • Create a netCDF object for the data file using the netcdf.open function. Use the NC_NOWRITE command in order to specify a read-only format (you typically don't want to tamper with the original .nc file.)
  • Figure out the specifications involved with the file through the netcdf.inq function which tells you about the variables that the creator of the file used and the dimensions of each variable (if available, you can use ncinfo).
  • Assign the complete data set of the particular variable you want (usually this is the last dimension of the netCDF file - every single data point contained in the array) to an array.
  • This new array takes the dimensions of the complete data set.
  • Now you are ready to go - you can manage the huge data set through simple array manipulation.

For example:

had = netcdf.open('HadISST.nc','NC_NOWRITE'); 
[varname, xtype, varDimIDs, varAtts] = netcdf.inqVar(had,4) % '4' being a specific dimension. 
varid = netcdf.inqVarID(had,varname); 
data = netcdf.getVar(had,varid); % this is the full data set.

In case of the HadISST data set, changing the variable ID (i.e. 4 in the second line) yields different parameters (0 - longitude, 1 - latitude, 2 - time, 3 - specific months, 4 - SST). However, since you ultimately want to work with the SSTs, the var ID 4 would yield the complete SST data set. Now it is a question of simple array manipulation obtaining the data set you require be it a particular time slice, particular range of latitudes, a single spatial point or a single month's global data. To key in on a particular parameter, it would be useful to use the netcdf.inq function on variables other than SST (or in a broader sense, the single data point variable). Once you gather more experience with this basic method, you can look at netCDF handling toolboxes (I would recommend mexcdf - particularly, nc_dump comes in handy).

The specific data set finally obtained as an array can easily be written into any required format (.xls, .xlsx, .dat, .xml etc.) through MATLAB. This is probably the easiest way of extracting data from a global temperature database, fit for use in programs such as Excel or SigmaPlot. Any corrections/suggestions for improving this method are welcome.