HOW-TO Use SourceCollections in Astro-WISE

The SourceCollection classes extend the concepts of data lineage and data pulling to catalog, for example sample selection and parameter derivation.

This HOW-TO is not yet complete and should be considered a draft. For background information and details see the thesis of Hugo Buddelmeijer.

Overview

A SourceCollection is a ProcessTarget for source catalogs. A SourceCollection represents both a catalog of sources with attributes and the operation to create the data of this catalog. Sources are identified by their SLID-SID combination and attributes by their name. The operations range from selection of sources to calculation of attributes and are applied to other persistent objects, often other SourceCollections. Each operator corresponds to a separate persistent class that is derived from the base SourceCollection class.

An Astro-WISE Session

Using the SourceCollection classes is demonstrated by showing an example Astro-WISE session.

Bootstrapping

A session would usually start with retrieving an existing SourceCollection. This demo session starts with creating a new SourceCollection because this makes the pulling examples more predictable.

For now it is required to wrap a SourceList in a SourceCollectionWrapper class in order to use it with in the SourceCollections. Ultimately, the SourceCollection classes will be better integrated with the SourceList and other *List classes.

from astro.main.SourceList import SourceList
from astro.main.sourcecollection.SourceCollection \
  import SourceCollection
from astro.main.sourcecollection.SourceListWrapper \
  import SourceListWrapper

# Fetch a SourceList
sl = (SourceList.SLID == 1575051)[0]

# Create a SourceCollection from the SourceList
slw = SourceListWrapper()
slw.set_sourcelist(sl)

# The SourceCollection can be made persistent if wanted.
slw.commit()
print(slw.SCID)
# 100511

Pulling One Sample

SourceLists created from an image only contain photometric attributes. The SourceCollection classes allow attributes derived from those to be created by pulling them. This example shows how to pull a subset of the sources with newly derived attributes.

# Continue with 'slw' from above.

# Define a criterion to select the required sources.
query = ' "DEC" < 11 '

# Define the requested attributes.
# In this case a comoving distance and the inverse concentration index.
attributes = ['R', 'iC']

# Pull a SourceCollection that has the requested data.
#
# The information system will search for existing SourceCollections
# that can be used to fulfill the request. New SourceCollections
# are created if no suitable ones are found.
scnew = slw.derive(attributes=attributes, query=query)

# New SourceCollections are created to be as general as possible
# in order to facilitate reuse. In this case, the SourceCollection
# calculating 'R' will be created to calculate the attribute for all the
# sources of slw, not only for the sources with a declination below 11.

# Check whether the SourceCollection has been processed.
scnew.is_made()

# If False is returned, the SourceCollection has to be processed.
scnew.make()

# The is_made() and make() functions only apply to the parts of the
# dependency tree that is required to build the target SourceColloction.
# In this case, 'R' will only be calculated for the sources with DEC < 11.

Loading Catalog Data

The catalog data of a SourceCollection can be accessed by loading it into a TableConverter object or by sending it over SAMP.

# Load the catalog data into a TableConverter.
scnew.load_data()

# Interface with the catalog data.
print(scnew.data.attribute_order)
# ['SLID', 'SID', 'R', 'iC']
print(scnew.data.attributes['R'])
# {'length': 1, 'ucd': '', 'null': '', 'name': 'R', 'format': 'float64'}
print(scnew.data.data['R'])
# [  562.54038472  1905.82573128   397.30116968 ...,    12.93537333

# Or send the catalog data over Samp.
from astro.services.samp.Samp import Samp
s = Samp()
s.broadcast(scnew)

# Highlight or select and broadcast some sources in Topcat or Aladin.
# Retrieve the SLIDs/SIDs or the row in the TableConverter.
s.highlightedSource(scnew)
# (1575051, 812)
s.selectedSources(scnew)
# [(1575051, 843), (1575051, 847), (1575051, 848)]
s.highlighted(scnew)
# 812

Pulling More Samples

An important feature of the SourceCollections is that only the parts of a dependency tree are processed that are required to build the final target SourceCollection. This is demonstrated with the following example.

# Continue with 'slw' from before.

# Pull attributes for a subset of the sources.
scnew1 = slw.derive(query=' "DEC" < 11 ', attributes=['R'])
scnew1.is_made()
# Should return True because this part of the SourceCollection has already
# been processed in the previous example, if not, make it:
scnew1.make()

# Pull attributes for a smaller subset of the sources
scnew2 = slw.derive(query=' "DEC" < 10 ', attributes=['R'])
scnew2.is_made()
# True, because this has already been processed.

# Pull data for a larger subset of the sources
scnew3 = slw.derive(query=' "DEC" < 12 ', attributes=['R'])
scnew3.is_made()
# False, because this part of the SourceCollection
# has not been processed completely yet.

Storing Catalog Data

New catalog data is created in the examples above. Storing the attribute values prevents them to be calculated again.

# Continue with the datasets from above.

# Commit the SourceCollection that need to be saved. This will recursively
# commit the dependency tree as well.
slnew2.commit()

# Store the source data that has been calculated. This will store the catalog data
# in the most optimal way. Although slnew2 only represents a subset of the
# sources, all the calculated attributes are stored: The 'R' value for the sources
# in scnew1 will be stored as well.
slnew2.store_data()

# The demo session ends here.

Pushing SourceCollections

The derive() function above creates a hierarchy of SourceCollections. This section shows how this hierarchy can be created manually.

It is preferable to use the data pulling functions and let the information system determine what needs to be created automatically. This is less work and facilitates reuse.

Nonetheless, not all SourceCollections can be created through pulling data, therefore it is useful to know how to create them manually.

Import

First import all relevant classes.

# astro.main classes
from astro.main.SourceList import SourceList
# Virtual base SourceCollection
from astro.main.sourcecollection.SourceCollection \
  import SourceCollection
# All derived operator SourceCollections
from astro.main.sourcecollection.SourceListWrapper \
  import SourceListWrapper
from astro.main.sourcecollection.FilterSources \
  import FilterSources
from astro.main.sourcecollection.SelectSources \
  import SelectSources
from astro.main.sourcecollection.SelectAttributes \
  import SelectAttributes
from astro.main.sourcecollection.ConcatenateAttributes \
  import ConcatenateAttributes
from astro.main.sourcecollection.AttributeCalculator \
  import AttributeCalculator, AttributeCalculatorDefinition

Bootstrap

Start with a new SourceCollection.

# Fetch a SourceList
sl = (SourceList.SLID == 1575051)[0]

# Create a SourceCollection from the SourceList
slw = SourceListWrapper()
slw.set_sourcelist(sl)

Calculate Attributes

An AttributeCalculator SourceCollection is used for the derivation of new source attributes from existing attributes. The calculation that is performed by an AttributeCalculator SourceCollection is given by an AttributeCalculatorDefinition object. The creation of AttributeCalculator SourceCollections in a pushing way requires some handwork:

# Find all AttributeCalculatorDefinition objects that can be used to
# calculate comoving distances.
acdsR = AttributeCalculatorDefinition.get_acds_by_attribute('R')

# Pick the first one.
acdR = acdsR[0]
acdR.name
'Comoving Distance Calculator'

# See which attributes are required by the ACD.
acdR.input_attribute_names
['redshift', 'RA', 'DEC', 'HTM']

# Which are al available in 'slw':
[a in slw.get_attribute_names() for a in acdR.input_attribute_names]
[True, True, True, True]

# The required attributes have to be selected with a SelectAttributes:
sa1 = SelectAttributes()
sa1.parent_collection = slw
sa1.selected_attributes = ['redshift', 'RA', 'DEC', 'HTM']

# And the AttributeCalculator can be initialized. The 'AC' property of an ACD
# object is class derived from the AttributeCalculator class which used this
# definition.
ac1 = acdR.AC()
ac1.parent_collection = sa1

Similarly for the inverse concentration:

acdsiC = AttributeCalculatorDefinition.get_acds_by_attribute('iC')
acdiC = acdsiC[0]
all([a in slw.get_attribute_names() for a in acdiC.input_attribute_names])

sa2 = SelectAttributes()
sa2.parent_collection = slw
sa2.selected_attributes = [a for a in acdiC.input_attribute_names]

ac2 = acdiC.AC()
ac2.parent_collection = sa2

Selecting Sources

Selecting sources is either done with a FilterSources SourceCollection or with a SelectSources SourceCollection.

# A FilterSources represents a subset of the parent SourceCollection
# by evaluation of a selection criterion.
fs = FilterSources()
fs.parent_collection = slw
fs.set_query(' "DEC" < 11 ')

# Alternatively, a SelectSources SourceCollection can be used to select
# a subset that is explicitly listed by another SourceCollection.
# First create a SourceCollection with only source identifiers.
sa3 = SelectAttributes()
sa3.parent_collection = fs
# Use this to specify the selected sources of a SelectSources SourceCollection
ss = SelectSources()
ss.parent_collection = slw
ss.selected_sources = fs

Combining All Attributes

A ConcatenateAttributes SourceCollection is used to combine the attributes fro the AttributeCalculators for the sources in the FilterSources.

# First select no attributes from the FilterSources
sa3 = SelectAttributes()
sa3.parent_collection = fs
# Select the comoving distance
sa4 = SelectAttributes()
sa4.parent_collection = ac1
sa4.selected_attributes = ['R']
# Select the inverse concentration
sa5 = SelectAttributes()
sa5.parent_collection = ac2
sa5.selected_attributes = ['iC']

# Combine all the attributes
ca = ConcatenateAttributes()
ca.parent_collections = [sa3, sa4, sa5]

# And we're done
scnew = ca
scnew.make()
scnew.load_data()

Other Operators

SourceCollection classes not yet shown in this HOW-TO are

  • ConcatenateSources, to combine the source of different SourceCollections.
  • RenameAttributes, to rename the attributes of the parent SourceCollection.
  • RelabelSources, to give sources of the parent SourceCollection a new SLID-SID combination by specifying an AssociateList.
from astro.main.AssociateList import AssociateList
from astro.main.sourcecollection.SourceCollection \
  import SourceCollection
from astro.main.sourcecollection.RelabelSources \
  import RelabelSources
from astro.main.sourcecollection.ConcatenateSources \
  import ConcatenateSources
from astro.main.sourcecollection.RenameAttributes \
  import RenameAttributes

# Fetch a SourceCollection and AssociateList
sc = (SourceCollection.SCID == 100161)[0]
al = (AssociateList.ALID == 472781)[0]

# Create a RelabelSources
rs = RelabelSources()
rs.parent_collection = sc
rs.associatelist = al

# 'rs' and 'sc' now represent 'different' sources, which
# can be concatenated. (This is not really useful though.)
cs = ConcatenateSources()
cs.parent_collections = [sc, rs]

# And the attributes can be renamed.
ra = RenameAttributes()
ra.parent_collection = cs
ra.attributes_old = ['MAG_ISO', 'MAGERR_ISO']
ra.attributes_new = ['MAG_B', 'MAGERR_B']

The SourceCollectionTree in the Background

The automatic creation of new SourceCollections is managed by non-persistent SourceCollectionTree objects. For example, the derive() function, but also the is_made, make and load_data functions use a SourceCollectionTree. It is instructive to discuss the SourceCollectionTree class, even though it is not often required to interface with one directly.

Derive

The derive() function above uses a SourceCollectionTree to pull SourceCollections. This example shows what happens inside this function.

from astro.main.sourcecollection.SourceCollection \
  import SourceCollection
from astro.main.sourcecollection.SourceCollectionTree \
  import SourceCollectionTree

# Same start as above.
slw = (SourceCollection.SCID == 100511)[0]
query = ' "DEC" < 11 '
attributes = ['R', 'iC']

# Create a SourceCollectionTree from the SourceCollection. The SCT traverses
# the dependency tree and keeps the progenitors of the given SC in memory.
# A Pass SC is created with the given SC as parent, which is used as the end
# node of the tree. Later functions of the SCT will replace SCs in the tree,
# but never this end node.
sct = SourceCollectionTree(slw)

# Apply the selection criterion. There are two things that can happen:
# 1) The SCT discovers an existing SC that represents the requested sources
#    and will create a SelectSources SC to select the requested sources.
# 2) The SCT cannot find suitable SCs and creates a new FilterSources SC with
#    the end node as parent and the given selection criterion as query. The
#    query is not evaluated; the exact composition of sources is unknown.
# In both cases, the created SC is placed between the Pass node at the end of
# the tree and its parent.
sct.apply_filter(query=query)

# Select the attributes. For every attribute the SCT will search for existing
# SCs that represent the attribute for the requested set of sources. A
# hierarchy of SelectAttributes and ConcatenateAttributes SCs is created that
# provides the right attributes. The SCT will try to instantiate new
# AttributeCalculator SCs if there are no existing SCs that contain a
# requested attribute.
sct.apply_attribute_selection(attributes)

# The end node of the tree now represents the requested catalog. This is
# a Pass SC, so its parent is returned by the .derive() function.
scnew = sct.sourcecollection.parent_collection

Visualizing a SourceCollectionTree

The dependency tree of a SourceCollection can be visualized with a SourceCollectionTree (Figure 1).

sct = SourceCollectionTree(scnew)
sct.make_dot_graph('howtotree1')
Dependency tree generated by the SourceCollectionTree.

Figure 1: Dependency tree generated by the SourceCollectionTree.

Making Catalog Data

The functions of the SourceCollection class that handle the catalog data, is_made, make and load_data, can be called either with optimization or without. With optimization (optimize=True parameter, the default) a SourceCollectionTree is created and the respective function of this object is called to perform the required action in the optimal way.

# Continuing with 'scnew' from earlier

# Calling is_made without optimization will return True because scnew is a
# ConcatenateAttributes SC, which does not have to be processed.
scnew.is_made(optimize=False)

# Calling is_made with optimization is equal to the following:
# Create a SourceCollectionTree
sct = SourceCollectionTree(scnew)
# Optimize the tree for loading catalog data, by placing the selection of
# sources and attribute early in the tree.
sct.optimize_for_load()
# Now the entire optimized tree can be checked.
sct.is_made()
# This will recursively call 'is_made(optimize=False)' on the SCs in the
# optimized tree.

# Similarly, calling 'scnew.make()' is identical to:
sct = SourceCollectionTree(scnew)
sct.optimize_for_load()
sct.make()
# This will recursively call 'make(optimize=False)' on the SCs in the
# optimized tree that are not yet made.

# Finally, 'scnew.load_data()' is identical to:
sct = SourceCollectionTree(scnew)
sct.optimize_for_load()
sct.load_data()
# This will load the catalog data of the end node of the tree in the
# most optimal way. It will not necessarily load the catalog data
# for all the other SCs in the tree.
Optimized dependency tree

Figure 2: Optimized dependency tree of Figure 1, generated by the SourceCollectionTree. The exact shape of the tree will differ depending on what part of the SourceCollections has already been processed.

Store Catalog Data

Storing catalog data with store_data() in an optimized way works differently from the functions above. In practice, all the attributes of the AttributeCalculators in the dependency tree of the SourceCollection will be stored. No catalog data will be stored as part of the SourceCollection itself, unless it is an AttributeCalculator too.

# Store the catalog data in an optimized way.
slnew.store_data()

# This is equivalent to
sct = SourceCollectionTree(slnew)
sct.store_data()

Finding SourceCollections

There are several moments when the SourceCollectionTree will search for existing SourceCollections in order to fulfill a data pulling request. The SourceCollectionTree will create a list of all SourceCollections that could be used for a particular purpose. This list is subsequently ranked according to a key function and the SourceCollection with the highest rank is selected. Any SourceCollection with a positive key value would be suitable. The key_functions class property of the SourceCollectionTree is a dictionary that holds these key functions. The keys of the dictionary are:

  • find_selection: Used to rank SourceCollections that represent the sources selected by a given selection criterion.
  • find_attribute: Used to rank SourceCollections that provide a given attribute for a specific set of sources.
  • find_attribute_new_calculators: Used to rank new AttributeCalculator SourceCollections that provide a given attribute.
  • find_sources: Used to rank SourceCollections that represent the source identifiers of a given SourceCollection.

The selection process can be influenced by overloading these functions:

# Start with a SourceCollection.
slw = (SourceCollection.SCID == 100511)[0]

# Create an AttributeCalculator to calculate 'R' without lamdba.
acd = AttributeCalculatorDefinition.get_acds_by_attribute('R')[0]
ac2 = acd.AC()
ac2.parent_collection = slw
ac2.set_process_parameter('omega_m', 1.0)
ac2.set_process_parameter('omega_l', 0.0)

# Create an SCT and manually ensure that the relevant SCs are tracked.
sct = SourceCollectionTree(slw)
sct.track_children_auto(cache=True)
sct.track_tree(ac2)

# Search for the comoving distance. An AC with the wrong omega_m is selected.
sc1=sct.find_attribute('R')
print("#sc1 omega_m", sc1.get_process_parameter('omega_m'))
#sc1 omega_m 0.3

# Define a new key function to find the correct AC.
from astro.main.sourcecollection.SourceCollectionTree \
  import key_find_attribute
def mykey(scd):
    # First retrieve the default ranking.
    tkey = key_find_attribute(scd)
    # The 'scd' is a dictionary, the key 'sc' points to the actual SC.
    sc = scd['sc']
    # Reduce the key value of SCs that are not an AttributeCalculator
    if not isinstance(sc, AttributeCalculator):
        tkey -= 10**9
    # and reduce the key value of the ACs that use another omega_m.
    elif sc.get_process_parameter('omega_m') != 1.0:
        tkey -= 10**9
    return tkey

# Set the new key function.
SourceCollectionTree.key_functions['find_attribute'] = mykey

# Search for the comoving distance again, the preferred AC is found.
sc2=sct.find_attribute('R')
print("#sc2 omega_m", sc2.get_process_parameter('omega_m'))
#sc2 omega_m 1.0

AttributeCalculatorDefinitions

The calculation that is performed by an AttributeCalculator SourceCollection is described by an AttributeCalculatorDefinition object. These objects can be created by any scientist and shared with others.

The code of an AttributeCalculatorDefinition is stored in a file on the dataserver. This (python) file contains a new AttributeCalculator class that is derived from the one in code base. The create_from_file() method of the AttributeCalculatorDefinition class can be used to create a new definition from this class. Auxiliary files can be used by wrapping everything in a tarball.

The procedure for this is too long to to list in this document, see demo 17 for an example.

cd $AWEPIPE/astro/experimental/SourceCollection/demos/demo17

SAMP Interaction and Query Driven Visualization

A design goal of the SourceCollection classes was to be able to use them interactively over SAMP. New SAMP messages are designed to to allow query driven visualization in a more declarative way than is possible with other information systems.

The SAMP interaction is described in HOW-TO Use SAMP in Astro-WISE. The Query Driven Visualization is described in HOW-TO Use Query Driven Visualization in Astro-WISE.