Using GES DISC & other NASA data: Part I

  • by

Unknown to some, but accessing satellite data has never been easier (in theory). It is a most useful source of data, however, learning the lay of the land is time-consuming and, quite frankly, frustrating. One of the main issues is knowing where to look for data and upon observing the amount of data available, where to start.

NASA’s Earth Science Data Systems (ESDS) Program includes twelve Distributed Active Archive Centers (DAACs) that provides access to oceanography, socioeconomic, atmospheric, hydrology and geological data. The objective of this article (and website) is energy meteorology, so the two DAACs we will look at is ASDC (Atmospheric Science Data Center) and GES DISC (Goddard Earth Sciences Data and Information Services Center). Although other remote retrieval sources exist such as Copernicus Space Component Data Access system (CSCDA), I have found that NASA data includes greater spatial, temporal and parameter options, and often includes datasets from Copernicus, for example.

In this article, we’ll explore the types of data and how to filter through the many options, specifically in GES DISC. We’ll also have a look at the differences in file hierarchy and how to quickly view the selected data set. In part II (and maybe even part III) of this article series, I’ll take you through the steps of plotting satellite data with Python in meaningful ways.

Finding the right data (in the right way)

There are several options for searching data in GES DISC. We can simply search specific terminology such as “carbon dioxide” or “CO2” which will both show the same 143 datasets. We can further specify a date range (in UTC) and limit our search to a bounding box of specific coordinates. The last search parameter will not significantly decrease the number of datasets provided but is a crucial parameter when we have selected a specific dataset.

The second option is searching by category (which includes terms such as atmospheric chemistry, clouds, and precipitation), either after or before our initial search of “CO2”. We can also search by category (or subject) and then by measurement (which includes terms such as absorption, carbon dioxide, surface temperature). We can also search by source, processing level, project, temporal resolution, and spatial resolution.

We can also be more specific, i.e. are we looking for carbon dioxide surface concentrations, total column abundance, or vertical profile (I will elaborate on these subjects in another post). A good place to start would be to: 1) download a single file from the selected dataset and have a quick look at it, or 2) read the dataset documentation (which you will have to do eventually).

Let’s select the dataset AIRS/Aqua L2 Support Retrieval (AIRS-only) V006 (AIRS2SUP 006). Clicking on the title will take us to a page that summarises the dataset (also called product), supplies documentation on the quality of the data, and several ways in which to access the data. The data access options are: 1) online archive, 2) Earthdata search, 3) Simple Subset Wizard (SSW), 4) OPeNDAP, and 5) GES DISC’s “Subset/Get Data”.

The online archive is basically an archive with several directories containing data from the specific set. It is ideal for bulk downloads via FTP (File Transfer Protocol) and wget – I won’t be getting into this data access options, but GES DISC has a how-to with wget. I would not recommend this option if you are still only exploring datasets. OPeNDAP is similar to the online archive.

The Earthdata search supplies a more visual platform for searching within a subset (also known as granule). Due to the separation of spatial and temporal search parameters, it could result in a significant amount of scrolling to find the correct granule. However, if you would like an aesthetic experience, this option could work for you.

The Simple Subset Wizard (SSW) really is a simple way to narrow down your search within a dataset. You have the option of selecting a data set keyword (which would be (AIRS2SUP 006) in this case and usually auto-completed in navigation from the SSW button on the product page), a date range and a spatial bounding box that you can either complete with coordinates or via a map.

Finally, we have the Subset/Get Data option, which is very similar to the Simple Subset Wizard. I have found this the most intuitive way to search for, and download subsets. You will see a pop-up window, with options to change the download method, date range, (spatial) region, and in some cases the file format. You can also see the estimated size of the results, which are often humungous before refining the subset search.

The two download method options are downloading the original files (essentially from the online archive), or a file subset that can be used to retrieve bulk subsets via FTP. Changing the download option to the “file subset” will also allow changing the file format in this case.

Changing the date range is intuitive, and refining the region again comes with several options, including a rectangular or circular bounding box, or a place marker.

When the required refinements are made to the subset selection, “Get Data” will proceed to list all the subsets within the selected spatial and temporal range, and imporant documentation or the specific dataset. From here, you can either download a list that only includes the links (you can use this with wget again), or you can click on each file link to download separately. You can click anywhere outside the pop-up screen to go back to the product page. However, at the bottom of the product page, you can find a tab “History” that lists previously accessed datasets. You can navigate from this tab to some of your previously explored datasets.

Understanding the file types

If the same point is reached for another dataset, such as Sentinel-5P TROPOMI Total Ozone Column 1-Orbit L2 5.5km x 3.5km, the file format will show netCDF rather than HDF-EOS. Other file formats include HDF-EOS5, HDF5, HE5, H5, and NC. All of these datatypes are hierarchical (or array-orientated), in other words, large amounts of data are stored and organised within a single file. It also indicates that the file contents are linked to coordinates such as time, longitude and latitude.

The specific HDF (Hierarchical Data Format) type also indicates whether the data within the file represents a swath, grid or point. It is generally unimportant to know what type each file represents, but rather more important to understand what each spatial coverage represents. This is however easily determined by viewing a single subset in an HDF viewer (which I will get to later in this article). It is observed below that the spatial coverage can vary appreciably.

Another important characteristic of each subset file, is the file name, which adheres to very specific file naming conventions. This convention is usually found in the read-me or user-guide document of each dataset. From the filename, it can also be inferred what the spatial coverage will look like.

The AIRS (Atmospheric InfraRed Sounder) dataset refered to earlier has for example a file naming convention according to: AIRS.yyyy.mm.dd.ggg.Lev.productType.vm.m.r.b.GproductionTimeStamp.hdf where “ggg” indicates a granule number. So we can expect that the data included within the subset is a square within a swath. In some cases (such as MLS-Aura) no granule or orbit number is included, inferring that several orbits are included within a single file.

Exploring individual files will soon show that although all datasets have a hierarchical structure, these structures may differ slightly, changing the way in which data is accessed for plotting (even in Python).

Visualising satellite data

If the objective is more simple analysis or visualisation, NASA offers a platform called Giovanni, which includes time-averaged maps, scatter plots and animations. The advantage of Giovanni is that you do not have to download any data (so essentially, you don’t have to understand anything in this article) and the platform combines several subsets and/or swaths to supply a visualisation for the specified region and date. The disadvantages, however, are limited access to GES DISC data, and in some cases, lower spatial resolutions and/or long processing times.

If you would like to quickly view a specific subset however, a HDF viewer is a better option. The two I often use is HDFView and Panoply, depending on the purpose of the visualisation and the type of data.

Above I opened files from MLS/Aura Level 2 Diagnostics, Geophysical Parameter Grid V004 (ML2DGG) in both HDFView and Panoply. These files have relatively large structures and although it almost instantaneously loads in HDFView, there is a considerable waiting time in Panoply. It is observed that both viewers show the structure tree; in Panoply the file attributes are listed on the right (which can also be viewed in HDFEOS/ADDITIONAL/FILE_ATTRIBUTES in HDFView). The advantage of Panoply, however, is the ability to search within a file, which is especially helpful if you are working with a file with many sub-directories.

The hierarchical structure of these files are similar to folder structures. So if we would to refer to a specific parameter we can use sub-directories such as “HDFEOS/SWATHS/O3-StdProd_column/Data_Fields/L2gpValue” (this is how data is accessed when plotting with Python). In other words, the data parameter “L2gpValue” is within a folder called “Data_Fields” that is within a folder called “O3-StdProd_column”, and so forth.

There are, off course, both advantages and disadvantages to both HDF viewers. The principal disadvantage of Panoply is its inability to read some H5 files, such as those from ACOS (Atmospheric CO2 Observations from Space):

If we open the same file in HDFView, we can see the file structure and individual parameter result tables:

It is shown in the table above that the columns include the pressure levels, while the rows are the CO2 profile across all pressure levels at a specific latitude and longitude. We can select a row/column by clicking on a row/column button; the row/column will be highlighted. We can plot this row/column by clicking the table button in the results table, selecting the row radio button and clicking “Ok”. The resulting lineplot cannot be saved through HDFView, so if you need this plot you will have to use a screen grabber or snipping tool. It is observed in the figure below that the pressure or altitude levels are on the x-axis and the concentration at each level on the y-axis.

HDFView can only plot linear rows or columns, while Panoply also provides geographical plots (such as those from Giovanni) for certain datasets. Panoply further allows plot font and colour changes. However, due to file structure variation, Panoply can read some files, but not plot the data within them. Unfortunately, if Panoply cannot plot a parameter, the table values can also not be accessed.

So, it is justified to say that you will probably need both viewers if you are planning to access different datasets.

Plots in Panoply

In the previous section, plotting in HDFView was briefly illustrated. In this section, we’ll have a look at plotting in Panoply.

As previously mentioned: 1) a georeferenced plot in Panoply is only possible if the file hierarchy is suitable, and 2) the type of “scatter” in the plot is predetermined by the dataset. We first have a look at an AIRS VMR (Volume Mixing Ratio) NH3 (Ammonia) dataset from GES DISC. As you can see below, this dataset is relatively simple with only one level of directory: lat, long, nh3_vmr, press_level, time and time_bnds. If we double click on any of the parameters except nh3_vmr, a single linear plotting option is provided since these parameters are each essentially a list.

Double clicking on press_level, we get this window where we can change (in some cases) the variable and axis if required:

We can toggle between “file” (the plot) and “array” (table), which in this case is just a list with each x-value equal to the y-value.

The plot above doesn’t tell us much, but a similar line plot can be produced for more comprehensive parameters, such as the VMR:

You can see that I changed the titles, grids, colours, scales and labels on these plots for better visualisation. One peculiar thing about Panoply is that changing the plot characteristics can be finicky – often some parameters seems fixed. If this is the case, make sure that you have only one plotting window open and that everything has loaded successfully, and then just try again. Or switch to the “Array” tab, and switch back – this often reactivates some of the options as well. Now we will plot georeferenced colour contour plots by double-clicking on nh3_vmr and selecting the top radio button. If you select the first item in the drop-down list “Latitude-Vertical” you will see a plot similar to the previous contour plot on the right.

Selecting the second item “Longitude-Latitude” will provide a true georeferenced map plot:

It is, however, observed that the VMR scale is not suitable. So we change under “Scale” the “Tick Format” to “%.3E” and under “Labels” we replace the exponents “E” with superscripted 10 by selecting the “Exponent” tick box. We may also change the colours and decide to select a discrete colour bar. We shorten the scale caption by selecting the “Custom” radio button under “Scale”. We also customise the title to include all the specifics of the plot. You can go to “File” and “Save Image As”, rather than using a screen capture as in HDFView:

You can also change the pressure the plot represents under “Array(s)” by selecting a pressure level from the drop-down list at “Atmospheric layer equivalent pressure”. You can zoom in, add overlays and change the projection – in this regard, Panoply is relatively flexible but it requires patience and time to figure out all the options. With large datasets like this, there is often not a single set of plotting parameters that will work in all situations.

It should be noted that due to the previously mentioned limitations of Panoply, plotting in Python is still prefered, despite the slight increase in complexity. However, for quick visualisation of satellite data, Panoply is a justified solution. Below is anexample of GES DISC data plotted in Python and Matplotlib.

Georeference plots of CO (carbon monoxide) column amount in the troposphere in the various seasons over specific regions in southern Africa.

Visualisation of satellite data in Python will be covered in a future article. In the meanwhile, you can have a look at the Coding resources page where links to the HDF viewers and general Python resources are supplied. You can also have a look at the comprehensive examples of access and visualisation of NASA HDF5 files with Python and MATLAB by HDF – EOS.

Leave a Reply

Your email address will not be published. Required fields are marked *