Open in Colab: https://colab.research.google.com/github/casangi/cngi_prototype/blob/master/docs/data_structures.ipynb


Data Structures

An overview of the new image and visibility data structures and how to convert from the casacore based file formats.

[1]:
# Installation
import os
print("installing casa6 + cngi (takes a minute or two)...")
os.system("apt-get install libgfortran3")
os.system("pip install casatasks==6.3.0.48")
os.system("pip install casadata")
os.system("pip install cngi-prototype==0.0.91")

# Retrieve and extract demonstration datasets
print('retrieving MS tarfiles...')
!gdown -q --id 15HfB4rJKqEH7df088Ge5YLrCTXBIax6R
!gdown -q --id 1N9QSs2Hbhi-BrEHx5PA54WigXt8GGgx1
print('extracting MS tarfiles...')
!tar -xf M100.ms.tar
!tar -xzf sis14_twhya_calibrated_flagged.ms.tar.gz

print('complete')
installing casa6 + cngi (takes a minute or two)...
retrieving MS tarfiles...
extracting MS tarfiles...
complete

MeasurementSet Conversion

CNGI uses an xarray dataset and the zarr storage format to hold the contents of the MS. This provides several advantages to the old MS structure including: 1. Easier access and manipulation of data with numpy-like mathematics 2. N-dim visibility cubes (time, baseline, chan, pol) instead of interlaced rows of variable shape 3. Natively parallel and scalable operations

Data Description IDs

The conversion process will translate an MS directory on disk into one or more Xarray Datasets (xds) and (optionally) store them to a Zarr directory on disk. To properly support dimension expansion from interlaced rows, each MS SPW/Pol combination, denoted by Data Description ID (DDI), must be processed and partitioned individually, along with the various subtables.

We begin by inspecting the structure of the M100 MeasurementSet to determine how best to convert it

[2]:
from cngi.conversion import describe_ms

describe_ms('M100.ms')

[2]:
spw_id pol_id rows times baselines chans pols size_MB
ddi
0 0 0 32464 568 154 240 2 361
1 1 0 32464 568 154 240 2 361
2 2 0 32464 568 154 240 2 361
3 3 0 31408 568 142 240 2 333

This demonstration MS has four DDI’s, corresponding to four different SPW’s. In this case they are of similar modest size.

Let’s convert two of them now using the default settings

[3]:
from cngi.conversion import convert_ms

mxds = convert_ms('M100.ms', ddis=[0,2])
Completed ddi 0  process time 10.62 s
Completed ddi 2  process time 10.09 s

Larger numbers of baselines and channels will consume more memory during conversion. If a particular DDI shape is too large for the host machine memory, a smaller chunk size along the time axis may be needed.

Let’s pretend this is the case for the other two DDI’s and use a smaller chunk size for them, making sure to append and not overwrite our first two

[4]:
mxds = convert_ms('M100.ms', ddis=[1,3], chunks=(50,400,32,1), append=True)
Completed ddi 1  process time 12.99 s
Completed ddi 3  process time 12.24 s

Finally, lets get the subtables within the original MS. They are now referred to as the “global” data in Zarr directory as their contents applies to all of the DDI partitions.

Some subtable columns may give errors during conversion if the casacore table system cannot read them as numpy arrays. The resulting Xarray dataset will omit these columns. This is a known issue

[5]:
mxds = convert_ms('M100.ms', ddis=['global'], append=True)
Completed subtables  process time 0.87 s

If we are comfortable with leaving things at default, we can convert the entire contents of an MS at once. The TWHya MS is small and safe to convert without much worry, but we will give it a shorter output filename.

[6]:
from cngi.conversion import convert_ms

mxds = convert_ms('sis14_twhya_calibrated_flagged.ms', outfile='twhya.vis.zarr')
Completed ddi 0  process time 23.69 s
Completed subtables  process time 1.21 s

Xarray/Zarr Partitions

After converting an MS to the new Xarray / Zarr based format, we have a <filename>.vis.zarr directory on disk. We can inspect its contents to see if the conversion was successful.

[7]:
from cngi.dio import describe_vis

describe_vis('M100.vis.zarr')

[7]:
spw_id pol_id times baselines chans pols size_MB
xds
xds0 0 0 568 154 240 2 1011
xds1 1 0 568 154 240 2 1011
xds2 2 0 568 154 240 2 1011
xds3 3 0 568 142 240 2 932

The four SPW’s are now contained in four separate partitions. There are no more rows, only the four dimensions of (time, baseline, channel, polarization) to describe each field.

When we go to open and use the new format for subsequent CNGI operations, we will refer to the specific visibility xarray dataset (xds) that we want to use.

MeasurementSet v3 Schema

The conversion process attempts to keep the same column names, definitions, and relationships from the original MS structure whenever possible. This means that, for example, the DATA column of the main table in the MS is still called DATA in the Xarray Dataset, but it is now a data variable within the dataset. Similarly the column names of the various subtables are reflected as data variable names in the new xarray datasets.

As part of the evolution to new datastructures, the opportunity was taken to also update to the new MSv3 schema. This means that when converting the current CASA MSv2 datasets, certain columns are translated or dropped per the MSv3 definition located here.

While CNGI typically only operates on a single visibility partition, we can open and inspect the entire zarr directory contents in a manner similar to how the entire MS could be opened and inspected previously. This is done by constructing an xarray dataset of datasets, referred to as the master xarray dataset (mxds).

[8]:
from cngi.dio import read_vis

mxds = read_vis('M100.vis.zarr')

print(mxds)
overwrite_encoded_chunks True
<xarray.Dataset>
Dimensions:           (antenna_ids: 27, feed_ids: 108, field_ids: 48, observation_ids: 4, polarization_ids: 1, source_ids: 4, spw_ids: 4, state_ids: 24)
Coordinates:
  * antenna_ids       (antenna_ids) int64 0 1 2 3 4 5 6 ... 20 21 22 23 24 25 26
    antennas          (antenna_ids) <U16 'CM01' 'DV01' 'DV03' ... 'PM01' 'PM03'
  * field_ids         (field_ids) int64 0 1 2 3 4 5 6 7 ... 41 42 43 44 45 46 47
    fields            (field_ids) <U16 'M100' 'M100' 'M100' ... 'M100' 'M100'
  * feed_ids          (feed_ids) int64 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0
  * observation_ids   (observation_ids) int64 0 1 2 3
    observations      (observation_ids) <U16 'T.B.D.' 'T.B.D.' 'T.B.D.' 'T.B.D.'
  * polarization_ids  (polarization_ids) int64 0
  * source_ids        (source_ids) int64 0 0 0 0
    sources           (source_ids) <U16 'M100' 'M100' 'M100' 'M100'
  * spw_ids           (spw_ids) int64 0 1 2 3
  * state_ids         (state_ids) int64 0 1 2 3 4 5 6 7 ... 17 18 19 20 21 22 23
Data variables:
    *empty*
Attributes: (12/17)
    xds2:             <xarray.Dataset>\nDimensions:         (baseline: 154, c...
    xds3:             <xarray.Dataset>\nDimensions:         (baseline: 142, c...
    xds0:             <xarray.Dataset>\nDimensions:         (baseline: 154, c...
    xds1:             <xarray.Dataset>\nDimensions:         (baseline: 154, c...
    ANTENNA:          <xarray.Dataset>\nDimensions:        (antenna_id: 27, d...
    ASDM_ANTENNA:     <xarray.Dataset>\nDimensions:         (d0: 13, d1: 3)\n...
    ...               ...
    POLARIZATION:     <xarray.Dataset>\nDimensions:       (d0: 1, d1: 2, d2: ...
    PROCESSOR:        <xarray.Dataset>\nDimensions:   (d0: 3)\nCoordinates:\n...
    SOURCE:           <xarray.Dataset>\nDimensions:             (d0: 4, d1: 2...
    SPECTRAL_WINDOW:  <xarray.Dataset>\nDimensions:             (d1: 240, d2:...
    STATE:            <xarray.Dataset>\nDimensions:   (state_id: 24)\nCoordin...
    WEATHER:          <xarray.Dataset>\nDimensions:                 (d0: 409,...

The mxds coordinates describe the principal keys to different tables in the MSv3 schema. The attributes section holds references to each individual xarray dataset visibility partition and subtable of global data.

Inspecting the FIELD subtable shows fields matching the same columns as the MSv3 schema.

[9]:
print(mxds.FIELD)
<xarray.Dataset>
Dimensions:        (d1: 1, d2: 2, field_id: 48)
Coordinates:
  * field_id       (field_id) int64 0 1 2 3 4 5 6 7 ... 40 41 42 43 44 45 46 47
    source_id      (field_id) int64 dask.array<chunksize=(48,), meta=np.ndarray>
Dimensions without coordinates: d1, d2
Data variables:
    CODE           (field_id) <U16 dask.array<chunksize=(48,), meta=np.ndarray>
    DELAYDIR_REF   (field_id) int64 dask.array<chunksize=(48,), meta=np.ndarray>
    DELAY_DIR      (field_id, d1, d2) float64 dask.array<chunksize=(48, 1, 2), meta=np.ndarray>
    NAME           (field_id) <U16 dask.array<chunksize=(48,), meta=np.ndarray>
    NUM_POLY       (field_id) int64 dask.array<chunksize=(48,), meta=np.ndarray>
    PHASEDIR_REF   (field_id) int64 dask.array<chunksize=(48,), meta=np.ndarray>
    PHASE_DIR      (field_id, d1, d2) float64 dask.array<chunksize=(48, 1, 2), meta=np.ndarray>
    REFDIR_REF     (field_id) int64 dask.array<chunksize=(48,), meta=np.ndarray>
    REFERENCE_DIR  (field_id, d1, d2) float64 dask.array<chunksize=(48, 1, 2), meta=np.ndarray>
    TIME           (field_id) datetime64[ns] dask.array<chunksize=(48,), meta=np.ndarray>

The main table of the MSv3 schema has been divided in to the four visibility xarray dataset (xds) partitions that CNGI functions operate on. Inspecting an xds partition shows fields that correspond to the columns of the main table. A difference arises from the expansion of time and baseline dimensions from what used to be rows.

Note that in cases where a different number of baselines exist at each time step within a single SPW of the originating MS, the resulting XDS will have the maximum number of baselines set in that dimension with NaN padding added as necessary.

Visibility Dataset Structure

The visibility xarray dataset (xds) structure has four main components: 1. dimensions 2. coordinates 3. data variables 4. attributes

Dimensions define the shape of the other components, and allow indexing into other components by integer location within each dimension (ie channel 5). Note that dimensions may be printed alphabetically by Jupyter, with the actual order being different in the data itself. Referring to a dimension by its name eliminates the need to remember what order things are in.

Coordinates define the world values of dimensions and other indices within the dataset. This allows indexing into other components by actual value (ie channel 100 GHz). Note that in many cases the real world value is itself just an integer index (ie the baseline), but time and channel frequency are particularly useful.

Data variables are the columns of the main table. They typically have the same data type and meaning as defined in the MSv3 schema. They are stored as Dask arrays which allow numpy-like operations that are parallel, scalable, and support larger than memory data sizes. Nan values are used to pad and flag areas with no valid data, and consequently all mathematics must be smart enough to properly ignore Nans in computations.

Attributes are used to hold units, reference frames, and any other metadata associated with the dataset. They can be any python type or object when in memory, but only serializable types may be written to disk.

Each xds comes from a separate zarr partition, and corresponds to a particular spw and polarization combination (as denoted by coordinate values). Here we can see another of the four xds paritions from the previous conversion of the demonstration MS

[10]:
print(mxds.xds0)
<xarray.Dataset>
Dimensions:         (baseline: 154, chan: 240, pol: 2, pol_id: 1, spw_id: 1, time: 568, uvw_index: 3)
Coordinates:
  * baseline        (baseline) int64 0 1 2 3 4 5 6 ... 148 149 150 151 152 153
  * chan            (chan) float64 1.137e+11 1.137e+11 ... 1.156e+11 1.156e+11
    chan_width      (chan) float64 dask.array<chunksize=(32,), meta=np.ndarray>
    effective_bw    (chan) float64 dask.array<chunksize=(32,), meta=np.ndarray>
  * pol             (pol) int64 9 12
  * pol_id          (pol_id) int64 0
    resolution      (chan) float64 dask.array<chunksize=(32,), meta=np.ndarray>
  * spw_id          (spw_id) int64 0
  * time            (time) datetime64[ns] 2011-08-10T19:38:17.856000900 ... 2...
Dimensions without coordinates: uvw_index
Data variables: (12/17)
    ANTENNA1        (baseline) int64 dask.array<chunksize=(154,), meta=np.ndarray>
    ANTENNA2        (baseline) int64 dask.array<chunksize=(154,), meta=np.ndarray>
    ARRAY_ID        (time, baseline) int64 dask.array<chunksize=(100, 154), meta=np.ndarray>
    DATA            (time, baseline, chan, pol) complex128 dask.array<chunksize=(100, 154, 32, 1), meta=np.ndarray>
    DATA_WEIGHT     (time, baseline, chan, pol) float64 dask.array<chunksize=(100, 154, 32, 1), meta=np.ndarray>
    EXPOSURE        (time, baseline) float64 dask.array<chunksize=(100, 154), meta=np.ndarray>
    ...              ...
    OBSERVATION_ID  (time, baseline) int64 dask.array<chunksize=(100, 154), meta=np.ndarray>
    PROCESSOR_ID    (time, baseline) int64 dask.array<chunksize=(100, 154), meta=np.ndarray>
    SCAN_NUMBER     (time, baseline) int64 dask.array<chunksize=(100, 154), meta=np.ndarray>
    STATE_ID        (time, baseline) int64 dask.array<chunksize=(100, 154), meta=np.ndarray>
    TIME_CENTROID   (time, baseline) float64 dask.array<chunksize=(100, 154), meta=np.ndarray>
    UVW             (time, baseline, uvw_index) float64 dask.array<chunksize=(100, 154, 3), meta=np.ndarray>
Attributes: (12/13)
    bbc_no:           1
    corr_product:     [[0, 0], [1, 1]]
    data_groups:      [{'0': {'data': 'DATA', 'flag': 'FLAG', 'id': '0', 'uvw...
    freq_group:       0
    freq_group_name:
    if_conv_chain:    0
    ...               ...
    name:
    net_sideband:     2
    num_chan:         240
    num_corr:         2
    ref_frequency:    113730081250.0
    total_bandwidth:  1875000000.0

The chunk size of the Dask array based Data variables affects the smallest unit of data on disk that may be loaded and processed by a worker. A larger number of smaller chunks provides more units of work to go around in a parallel processing environment, with less memory needed for each worker. However, an overhead cost of scheduling and managing each unit of work creates a point of diminishing returns.

We converted partitions 0 and 2 differently than partitions 1 and 3, using a smaller chunksize on the time dimension for the latter two. We can see the effect of this on the dask structure

[11]:
mxds.xds0.DATA.chunks
[11]:
((100, 100, 100, 100, 100, 68),
 (154,),
 (32, 32, 32, 32, 32, 32, 32, 16),
 (1, 1))
[12]:
mxds.xds1.DATA.chunks
[12]:
((50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 18),
 (154,),
 (32, 32, 32, 32, 32, 32, 32, 16),
 (1, 1))

Image Conversion

Image data is converted and stored in a manner very similar to MeasurementSet data. The same xarray and zarr frameworks are used, with primary difference being the schema layout of contents within the xarray dataset structure.

Images may have a number of supporting products (residual, pb, psf, taylor terms, masks, etc) in their own separate directories. The xarray dataset and the zarr storage format is capable of storing all image products (of the same shape) together. The resulting single CNGI data structure is more convenient to work with than the original disparate directories.

As a convenience, the conversion routines return an xarray data structure (xds) object attached to the zarr directory holding the converted data. The same xds can be retrieved later using the read_image() function, so conversion is only necessary once.

Create images with tclean

First we need to create some images from the downloaded MeasurementSet using the CASA6 tclean task. We will create a cube image, a continuum image, and an MTMFS image with Taylor terms.

[13]:
!rm -fr twhya_*
[14]:
from casatasks import tclean

print('creating cube')
tclean(vis='sis14_twhya_calibrated_flagged.ms', imagename='twhya_cube', field='5', spw='',
       specmode='cube', cell=1, pbcor=True, imsize=100, nchan=10, deconvolver='hogbom',
       niter=10, savemodel='modelcolumn', usemask='auto-multithresh')

print('creating continuum')
tclean(vis='sis14_twhya_calibrated_flagged.ms', imagename='twhya_cont', field='5', spw='',
       specmode='mfs', cell=1, pbcor=True, imsize=100, deconvolver='hogbom',
       niter=10, savemodel='modelcolumn', usemask='auto-multithresh')

print('creating mtmfs')
tclean(vis='sis14_twhya_calibrated_flagged.ms', imagename='twhya_mtmfs', field='5', spw='',
       specmode='mfs', cell=1, pbcor=True, imsize=100, deconvolver='mtmfs',
       nterms=5, niter=10, savemodel='modelcolumn', usemask='auto-multithresh')

print('complete')
creating cube
creating continuum
creating mtmfs
complete

Cube Images

The previous tclean call produced a cube image with 10 channels along with a number of supporting image products (same filename with a different extension). Some of these image products have additional embedded masks.

[15]:
!ls -d twhya_cube.*
twhya_cube.image        twhya_cube.mask   twhya_cube.pb   twhya_cube.residual
twhya_cube.image.pbcor  twhya_cube.model  twhya_cube.psf  twhya_cube.sumwt

The CNGI conversion function will merge these individual products and any embedded masks into a single xarray dataset stored in a single img.zarr directory.

Note that the .mask image product will be renamed to ‘automask’ in the xarray dataset

[16]:
from cngi.conversion import convert_image

xds = convert_image('twhya_cube.image')

print(xds.dims)
print(xds.data_vars)
converting Image...
processed image in 1.0922942 seconds
Frozen(SortedKeysDict({'l': 100, 'm': 100, 'time': 1, 'chan': 10, 'pol': 1}))
Data variables:
    AUTOMASK           (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    IMAGE              (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    IMAGE_MASK0        (l, m, time, chan, pol) bool dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    IMAGE_PBCOR        (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    IMAGE_PBCOR_MASK0  (l, m, time, chan, pol) bool dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    MODEL              (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    PB                 (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    PB_MASK0           (l, m, time, chan, pol) bool dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    PSF                (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    RESIDUAL           (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    RESIDUAL_MASK0     (l, m, time, chan, pol) bool dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    SUMWT              (time, chan, pol) float64 dask.array<chunksize=(1, 1, 1), meta=np.ndarray>

Continuum Images

As with cube images, the continuum image products are similar and merged in a similar manner. The main difference in the output is of course the number of channels

[17]:
xds = convert_image('twhya_cont.image')

print(xds.dims)
print(xds.data_vars)
converting Image...
processed image in 0.64222145 seconds
Frozen(SortedKeysDict({'l': 100, 'm': 100, 'time': 1, 'chan': 1, 'pol': 1}))
Data variables:
    AUTOMASK           (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    IMAGE              (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    IMAGE_MASK0        (l, m, time, chan, pol) bool dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    IMAGE_PBCOR        (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    IMAGE_PBCOR_MASK0  (l, m, time, chan, pol) bool dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    MODEL              (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    PB                 (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    PB_MASK0           (l, m, time, chan, pol) bool dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    PSF                (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    RESIDUAL           (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    RESIDUAL_MASK0     (l, m, time, chan, pol) bool dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    SUMWT              (time, chan, pol) float64 dask.array<chunksize=(1, 1, 1), meta=np.ndarray>

MTMFS Images

Multi-Term Multi-Freqeuency Synthesis images may contain taylor term expansions in place of the channel dimension. In CASA6 these are stored as a number of individual image directories with a .ttN extension.

[18]:
!ls -d twhya_mtmfs.*
twhya_mtmfs.alpha            twhya_mtmfs.model.tt1  twhya_mtmfs.psf.tt8
twhya_mtmfs.alpha.error      twhya_mtmfs.model.tt2  twhya_mtmfs.residual.tt0
twhya_mtmfs.alpha.pbcor      twhya_mtmfs.model.tt3  twhya_mtmfs.residual.tt1
twhya_mtmfs.beta             twhya_mtmfs.model.tt4  twhya_mtmfs.residual.tt2
twhya_mtmfs.beta.pbcor       twhya_mtmfs.pb.tt0     twhya_mtmfs.residual.tt3
twhya_mtmfs.image.tt0        twhya_mtmfs.pb.tt1     twhya_mtmfs.residual.tt4
twhya_mtmfs.image.tt0.pbcor  twhya_mtmfs.pb.tt2     twhya_mtmfs.sumwt.tt0
twhya_mtmfs.image.tt1        twhya_mtmfs.pb.tt3     twhya_mtmfs.sumwt.tt1
twhya_mtmfs.image.tt1.pbcor  twhya_mtmfs.pb.tt4     twhya_mtmfs.sumwt.tt2
twhya_mtmfs.image.tt2        twhya_mtmfs.psf.tt0    twhya_mtmfs.sumwt.tt3
twhya_mtmfs.image.tt2.pbcor  twhya_mtmfs.psf.tt1    twhya_mtmfs.sumwt.tt4
twhya_mtmfs.image.tt3        twhya_mtmfs.psf.tt2    twhya_mtmfs.sumwt.tt5
twhya_mtmfs.image.tt3.pbcor  twhya_mtmfs.psf.tt3    twhya_mtmfs.sumwt.tt6
twhya_mtmfs.image.tt4        twhya_mtmfs.psf.tt4    twhya_mtmfs.sumwt.tt7
twhya_mtmfs.image.tt4.pbcor  twhya_mtmfs.psf.tt5    twhya_mtmfs.sumwt.tt8
twhya_mtmfs.mask             twhya_mtmfs.psf.tt6
twhya_mtmfs.model.tt0        twhya_mtmfs.psf.tt7

The CNGI conversion will merged these individual .ttN directories in to the channel dimension of the corresponding image product.

[19]:
xds = convert_image('twhya_mtmfs.image')

print(xds.dims)
print(xds.data_vars)
converting Image...
processed image in 0.97542477 seconds
Frozen(SortedKeysDict({'l': 100, 'm': 100, 'time': 1, 'chan': 5, 'pol': 1}))
Data variables:
    AUTOMASK           (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    IMAGE              (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    IMAGE_MASK0        (l, m, time, chan, pol) bool dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    IMAGE_PBCOR        (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    IMAGE_PBCOR_MASK0  (l, m, time, chan, pol) bool dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    MODEL              (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    PB                 (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    PB_MASK0           (l, m, time, chan, pol) bool dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    PSF                (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    RESIDUAL           (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    RESIDUAL_MASK0     (l, m, time, chan, pol) bool dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    SUMWT              (time, chan, pol) float64 dask.array<chunksize=(1, 1, 1), meta=np.ndarray>

Image Dataset Structure

The image xarray dataset (xds) structure has four main components: 1. dimensions 2. coordinates 3. data variables 4. attributes

Dimensions define the shape of the other components, and allow indexing into other components by integer location within each dimension (ie channel 5). Note that dimensions may be printed alphabetically by Jupyter, with the actual order being different in the data itself. Referring to a dimension by its name eliminates the need to remember what order things are in.

Coordinates define the world values of dimensions and other indices within the dataset. This allows indexing into other components by actual value (ie channel 100 GHz). Note that coordinates may be multi-dimensional products of underlying dimensions. This allows the storage of spherical right-ascension / declination pairs for each cartesion pixel value in the image (l / m dimensions)

Data variables are the image products and masks. They are stored as Dask arrays which allow numpy-like operations that are parallel, scalable, and support larger than memory data sizes. Nan values are used to mask areas with no valid data, and consequently all mathematics must be smart enough to properly ignore Nans in computations.

Attributes are used to hold units, reference frames, and any other metadata associated with the dataset. They can be any python type or object when in memory, but only serializable types may be written to disk.

A time dimension has been inserted in the converted image data. This is a placeholder for future time-domain image handling in CNGI. CASA6 does not currently produce these types of images.

Inspecting the cube xds shows 10 frequency channels of the original image shared by all the image products

[20]:
from cngi.dio import read_image

xds = read_image('twhya_cube.img.zarr')

print(xds)
print('\nChannel Frequencies: ', xds.chan.values)
<xarray.Dataset>
Dimensions:            (chan: 10, l: 100, m: 100, pol: 1, time: 1)
Coordinates:
  * chan               (chan) float64 3.725e+11 3.725e+11 ... 3.725e+11
    declination        (l, m) float64 dask.array<chunksize=(100, 100), meta=np.ndarray>
  * l                  (l) float64 0.0002424 0.0002376 ... -0.0002327 -0.0002376
  * m                  (m) float64 -0.0002424 -0.0002376 ... 0.0002327 0.0002376
  * pol                (pol) float64 1.0
    right_ascension    (l, m) float64 dask.array<chunksize=(100, 100), meta=np.ndarray>
  * time               (time) datetime64[ns] 2012-11-19T07:56:26.544000626
Data variables:
    AUTOMASK           (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    IMAGE              (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    IMAGE_MASK0        (l, m, time, chan, pol) bool dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    IMAGE_PBCOR        (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    IMAGE_PBCOR_MASK0  (l, m, time, chan, pol) bool dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    MODEL              (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    PB                 (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    PB_MASK0           (l, m, time, chan, pol) bool dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    PSF                (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    RESIDUAL           (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    RESIDUAL_MASK0     (l, m, time, chan, pol) bool dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    SUMWT              (time, chan, pol) float64 dask.array<chunksize=(1, 1, 1), meta=np.ndarray>
Attributes: (12/19)
    axisnames:            ['Right Ascension', 'Declination', 'Time', 'Frequen...
    axisunits:            ['rad', 'rad', 'datetime64[ns]', 'Hz', '']
    commonbeam:           [1.2625414355032467, 1.1318049125886214, -84.517746...
    commonbeam_units:     ['arcsec', 'arcsec', 'deg']
    date_observation:     2012/11/19/07
    direction_reference:  j2000
    ...                   ...
    restoringbeam:        [1.2625414355032467, 1.1318049125886214, -84.517746...
    spectral__reference:  lsrk
    telescope:            alma
    telescope_position:   [2.22514e+06m, -5.44031e+06m, -2.48103e+06m] (itrf)
    unit:                 Jy/beam
    velocity__type:       radio

Channel Frequencies:  [3.72520023e+11 3.72520633e+11 3.72521243e+11 3.72521854e+11
 3.72522464e+11 3.72523074e+11 3.72523685e+11 3.72524295e+11
 3.72524905e+11 3.72525516e+11]

The MTMFS image is a continuum with each taylor term stored in the channel dimension. This means that the channel coordinate values will all be the same with the length/index position within the channel dimension corresponding to the taylor polynomial coefficient

[21]:
from cngi.dio import read_image

xds = read_image('twhya_mtmfs.img.zarr')

print(xds)
print('\nChannel Frequencies: ', xds.chan.values)
<xarray.Dataset>
Dimensions:            (chan: 5, l: 100, m: 100, pol: 1, time: 1)
Coordinates:
  * chan               (chan) float64 3.726e+11 3.726e+11 ... 3.726e+11
    declination        (l, m) float64 dask.array<chunksize=(100, 100), meta=np.ndarray>
  * l                  (l) float64 0.0002424 0.0002376 ... -0.0002327 -0.0002376
  * m                  (m) float64 -0.0002424 -0.0002376 ... 0.0002327 0.0002376
  * pol                (pol) float64 1.0
    right_ascension    (l, m) float64 dask.array<chunksize=(100, 100), meta=np.ndarray>
  * time               (time) datetime64[ns] 2012-11-19T07:56:26.544000626
Data variables:
    AUTOMASK           (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    IMAGE              (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    IMAGE_MASK0        (l, m, time, chan, pol) bool dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    IMAGE_PBCOR        (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    IMAGE_PBCOR_MASK0  (l, m, time, chan, pol) bool dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    MODEL              (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    PB                 (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    PB_MASK0           (l, m, time, chan, pol) bool dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    PSF                (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    RESIDUAL           (l, m, time, chan, pol) float64 dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    RESIDUAL_MASK0     (l, m, time, chan, pol) bool dask.array<chunksize=(100, 100, 1, 1, 1), meta=np.ndarray>
    SUMWT              (time, chan, pol) float64 dask.array<chunksize=(1, 1, 1), meta=np.ndarray>
Attributes: (12/19)
    axisnames:            ['Right Ascension', 'Declination', 'Time', 'Frequen...
    axisunits:            ['rad', 'rad', 'datetime64[ns]', 'Hz', '']
    commonbeam:           [1.2610454559326172, 1.131009578704834, -84.4926986...
    commonbeam_units:     ['arcsec', 'arcsec', 'deg']
    date_observation:     2012/11/19/07
    direction_reference:  j2000
    ...                   ...
    restoringbeam:        [1.2610454559326172, 1.131009578704834, -84.4926986...
    spectral__reference:  lsrk
    telescope:            alma
    telescope_position:   [2.22514e+06m, -5.44031e+06m, -2.48103e+06m] (itrf)
    unit:                 Jy/beam
    velocity__type:       radio

Channel Frequencies:  [3.7263694e+11 3.7263694e+11 3.7263694e+11 3.7263694e+11 3.7263694e+11]

The chunk size of the Dask array based Data variables affects the smallest unit of data on disk that may be loaded and processed by a worker. A larger number of smaller chunks provides more units of work to go around in a parallel processing environment, with less memory needed for each worker. However, an overhead cost of scheduling and managing each unit of work creates a point of diminishing returns.

The CNGI conversion function allows for tailoring the chunk size beyond the default values used in this overview.

[22]:
xds.IMAGE.chunks
[22]:
((100,), (100,), (1,), (1, 1, 1, 1, 1), (1,))