Including new observational data

In this tutorial we will see how to load observational data onto IMAGINE.

Both observational and simulated data are manipulated within IMAGINE through observable dictionaries. There are three types of these: Measurements, Simulations and Covariances, which can store multiple entries of observational, simulated and covariance (either real or mock) data, respectively. Appending data to an ObservableDict requires following some requirements regarding the data format, therefore we recomend the use of one of the Dataset classes.

HEALPix Datasets

Let us illustrate how to prepare an IMAGINE dataset with the Faraday depth map obtained by Oppermann et al. 2012 (arXiv:1111.6186).

The following snippet will download the data (a ~4MB FITS file) to RAM and open it.

[1]:
import requests, io
from astropy.io import fits

download = requests.get('https://wwwmpa.mpa-garching.mpg.de/ift/faraday/2012/faraday.fits')
raw_dataset = fits.open(io.BytesIO(download.content))
raw_dataset.info()
Filename: <class '_io.BytesIO'>
No.    Name      Ver    Type      Cards   Dimensions   Format
  0  PRIMARY       1 PrimaryHDU       7   ()
  1  TEMPERATURE    1 BinTableHDU     17   196608R x 1C   [E]
  2  signal uncertainty    1 BinTableHDU     17   196608R x 1C   [E]
  3  Faraday depth    1 BinTableHDU     17   196608R x 1C   [E]
  4  Faraday uncertainty    1 BinTableHDU     17   196608R x 1C   [E]
  5  galactic profile    1 BinTableHDU     17   196608R x 1C   [E]
  6  angular power spectrum of signal    1 BinTableHDU     12   384R x 1C   [E]

Now we will feed this to an IMAGINE Dataset. It requires converting the data into a proper numpy array of floats. To allow this notebook running on a small memory laptop, we will also reduce the size of the arrays (only taking 1 value every 256).

[2]:
from imagine.observables import FaradayDepthHEALPixDataset
import numpy as np
from astropy import units as u

# Adjusts the data to the right format
fd_raw = raw_dataset[3].data.astype(np.float)
sigma_fd_raw = raw_dataset[4].data.astype(np.float)
# Makes it small, to save memory
fd_raw = fd_raw[::256]
sigma_fd_raw = sigma_fd_raw[::256]
# We need to include units the data
# (this avoids later errors and inconsistencies)
fd_raw *= u.rad/u.m/u.m
sigma_fd_raw *= u.rad/u.m/u.m
# Loads into a Dataset
dset = FaradayDepthHEALPixDataset(data=fd_raw, error=sigma_fd_raw)

One important assumption in the previous code-block is that the covariance matrix is diagonal, i.e. that the errors in FD are uncorrelated. If this is not the case, instead of initializing the FaradayDepthHEALPixDataset with data and error, one should initialize it with data and cov, where the latter is the correct covariance matrix.

To keep things organised and useful, we strongly recommend to create a personalised dataset class and make it available to the rest of the consortium in the imagine-datasets GitHub repository. An example of such a class is the following:

[3]:
import requests, io
import numpy as np
from astropy.io import fits
from astropy import units as u
from imagine.observables import FaradayDepthHEALPixDataset

class FaradayDepthOppermann2012(FaradayDepthHEALPixDataset):
    def __init__(self, skip=None):
        # Fetches and reads the
        download = requests.get('https://wwwmpa.mpa-garching.mpg.de/ift/faraday/2012/faraday.fits')
        raw_dataset = fits.open(io.BytesIO(download.content))
        # Adjusts the data to the right format
        fd_raw = raw_dataset[3].data.astype(np.float)
        sigma_fd_raw = raw_dataset[4].data.astype(np.float)
        # Includes units in the data
        fd_raw *= u.rad/u.m/u.m
        sigma_fd_raw *= u.rad/u.m/u.m

        # If requested, makes it small, to save memory in this example
        if skip is not None:
            fd_raw = fd_raw[::skip]
            sigma_fd_raw = sigma_fd_raw[::skip]
        # Loads into the Dataset
        super().__init__(data=fd_raw, error=sigma_fd_raw)

With this pre-programmed, anyone will be able to load this into the pipeline by simply doing

[4]:
dset = FaradayDepthOppermann2012(skip=256)

Once we have a dataset, we can load this into a Measurements and Covariances objects (which will be discussed a further down).

[5]:
from imagine.observables import Measurements, Covariances

# Creates the instances
mea = Measurements()
cov = Covariances()

# Appends the data
mea.append(dataset=dset)
cov.append(dataset=dset)

Tabular Datasets

So far, we looked into datasets comprising HEALPix maps. One may also want to work with tabular datasets. We exemplify this fetching and preparing a RM catalogue of Mao et al 2010 (arXiv:1003.4519). In the case of this particular dataset, we can import the data from VizieR using the astroquery library.

[6]:
import astroquery
from astroquery.vizier import Vizier

# Fetches the relevant catalogue from Vizier
# (see https://astroquery.readthedocs.io/en/latest/vizier/vizier.html for details)
catalog = Vizier.get_catalogs('J/ApJ/714/1170')[0]
catalog[:3] # Shows only first rows
[6]:
Table length=3
RAJ2000DEJ2000GLONGLATRMe_RMPIIS5.2f_S5.2NVSS
"h:m:s""d:m:s"degdegrad / m2rad / m2mJymJy
bytes11bytes11float32float32int16int16float32float64bytes3bytes1bytes4
13 07 08.33+24 47 00.70.2185.76-347.77131.49YesNVSS
13 35 48.14+20 10 16.00.8677.70358.7271.47NobNVSS
13 24 14.48+22 13 13.11.3381.08-667.62148.72YesNVSS

The procedure for converting this to an IMAGINE data set is the following:

[7]:
from imagine.observables import TabularDataset
dset_tab = TabularDataset(catalog, name='fd', tag=None,
                          units= u.rad/u.m/u.m,
                          data_column='RM', error_column='e_RM',
                          lat_column='GLAT', lon_column='GLON')

catalog must be a dictionary-like object (e.g. dict, astropy.Tables, pandas.DataFrame) and data(/error/lat/lon)_column specify the key/column-name used to retrieve the relevant data from catalog. The name argument specifies the type of measurement that is being stored. This has to agree with the requirements of the Simulator you will use. Some standard available observable names are:

  • ‘fd’ - Faraday depth
  • ‘sync’ - Synchrotron emission, needs the tag to be interpreted
    • tag = ‘I’ - Total intensity
    • tag = ‘Q’ - Stokes Q
    • tag = ‘U’ - Stokes U
    • tag = ‘PI’ - polarisation intensity
    • tag = ‘PA’ - polarisation angle
  • ‘dm’ - Dispersion measure

The units are provided as an astropy.units.Unit object and are converted internally to the requirements of the specific Simulator being used.

Again, the procedure can be packed and distributed to the community in a (very short!) personalised class:

[8]:
from astroquery.vizier import Vizier
from imagine.observables import TabularDataset

class FaradayRotationMao2010(TabularDataset):
    def __init__(self):
        # Fetches the catalogue
        catalog = Vizier.get_catalogs('J/ApJ/714/1170')[0]
        # Reads it to the TabularDataset (the catalogue obj actually contains units)
        super().__init__(catalog, name='fd', units=catalog['RM'].unit,
                         data_column='RM', error_column='e_RM',
                         lat_column='GLAT', lon_column='GLON')
[9]:
dset_tab = FaradayRotationMao2010() # ta-da!

Measurements and Covariances

Again, we can include these observables in our Measurements and Covariances objects. These are dictionary-like object, i.e. given a key, one can access a given item.

[10]:
mea.append(dataset=dset_tab)
cov.append(dataset=dset_tab)

print('Measurement keys:')
for k in mea.keys():
    print('\t',k)
print('\nCovariance keys:')
for k in cov.keys():
    print('\t',k)
Measurement keys:
         ('fd', 'nan', '8', 'nan')
         ('fd', 'nan', 'tab', 'nan')

Covariance keys:
         ('fd', 'nan', '8', 'nan')
         ('fd', 'nan', 'tab', 'nan')

The keys follow a strict convention: (data-name,str(data-freq),Nside/'tab',str(ext))

  • If data is independent from frequency, data-freq is set ‘nan’, otherwise it is a string showing the frequency in GHz.
  • The third value in the key-tuple is the HEALPix Nside (for maps) or the string ‘tab’ for tabular data.
  • Finally, the last value, ext can be ‘I’,’Q’,’U’,’PI’,’PA’, ‘nan’ or other customized tags depending on the nature of the observable.

An alternative way to include data into an Observables dictionary is explicitly choosing the key and adjusting the data shape. One can see how this is handled by the Dataset object in the following cell

[11]:
# This is how HEALPix data can be included without the mediation of Datasets:
mea.append(name=dset.key, new_data=dset.data)
cov.append(name=dset.key, new_data=dset.cov)
# This is what Dataset is doing:
print('The key used in the "name" arg was:', dset.key)
print('The shape of data used in the "new" arg was:', dset.data.shape)
The key used in the "name" arg was: ('fd', 'nan', '8', 'nan')
The shape of data used in the "new" arg was: (1, 768)

But what exactly is stored in mea and cov? This is handled by an Observable object. Here we illustrate with the tabular dataset previously defined.

[12]:
print(type(mea[dset_tab.key]))
print('mea.data:', repr(mea[dset_tab.key].data))
print('mea.data.shape:', mea[dset_tab.key].data.shape)
print('mea.unit', repr(mea[dset_tab.key].unit))
print('mea.coords (coordinates dict -- for tabular datasets only):\n',
      mea[dset_tab.key].coords)
print('mea.dtype:', mea[dset_tab.key].dtype)
print('\n\ncov type',type(cov[dset_tab.key]))
print('cov.data type:', type(cov[dset_tab.key].data))
print('cov.data.shape:', cov[dset_tab.key].data.shape)
print('cov.dtype:', cov[dset_tab.key].dtype)
<class 'imagine.observables.observable.Observable'>
mea.data: array([[ -3.,   3.,  -6.,   0.,   4.,  -6.,   5.,  -1.,  -6.,   1., -14.,
         14.,   3.,  10.,  12.,  16.,  -3.,   5.,  12.,   6.,  -1.,   5.,
          5.,  -4.,  -1.,   9.,  -1., -13.,   4.,   5.,  -5.,  17., -10.,
          8.,   0.,   1.,   5.,   9.,   5., -12., -17.,   4.,   2.,  -1.,
         -3.,   3.,  16.,  -1.,  -1.,   3.]])
mea.data.shape: (1, 50)
mea.unit Unit("rad / m2")
mea.coords (coordinates dict -- for tabular datasets only):
 {'type': 'galactic', 'lon': <Quantity [ 0.21,  0.86,  1.33,  1.47,  2.1 ,  2.49,  3.11,  3.93,  4.17,
            5.09,  5.09,  5.25,  5.42,  6.36, 12.09, 12.14, 13.76, 13.79,
           14.26, 14.9 , 17.1 , 20.04, 20.56, 20.67, 20.73, 20.85, 21.83,
           22.  , 22.13, 22.24, 22.8 , 23.03, 24.17, 24.21, 24.28, 25.78,
           25.87, 28.84, 28.9 , 30.02, 30.14, 30.9 , 31.85, 34.05, 34.37,
           34.54, 35.99, 37.12, 37.13, 37.2 ] deg>, 'lat': <Quantity [85.76, 77.7 , 81.08, 81.07, 83.43, 82.16, 84.8 , 78.53, 85.81,
           79.09, 81.5 , 79.69, 80.61, 80.85, 79.87, 81.64, 77.15, 77.12,
           82.52, 84.01, 80.26, 85.66, 82.42, 77.73, 78.94, 80.75, 80.77,
           80.77, 82.72, 80.24, 77.17, 82.19, 82.62, 81.42, 79.25, 85.63,
           81.66, 78.74, 78.72, 79.92, 89.52, 77.32, 87.09, 86.11, 85.04,
           85.05, 78.73, 79.1 , 80.74, 84.35] deg>}
mea.dtype: measured


cov type <class 'imagine.observables.observable.Observable'>
cov.data type: <class 'numpy.ndarray'>
cov.data.shape: (50, 50)
cov.dtype: covariance

The Dataset object may also automatically distribute the data across different nodes if one is running the code using MPI parallelisation – a strong reason for sticking to using Datasets instead of appending directly.