HOW-TO Use SourceCollections in Astro-WISE¶
The SourceCollection classes extend the concepts of data lineage and data pulling to catalog, for example sample selection and parameter derivation.
This HOW-TO is not yet complete and should be considered a draft. For background information and details see the thesis of Hugo Buddelmeijer.
Overview¶
A SourceCollection is a ProcessTarget for source catalogs. A
SourceCollection represents both a catalog of sources with attributes
and the operation to create the data of this catalog. Sources are
identified by their SLID
-SID
combination and attributes by their
name. The operations range from selection of sources to calculation of
attributes and are applied to other persistent objects, often other
SourceCollections. Each operator corresponds to a separate persistent
class that is derived from the base SourceCollection class.
An Astro-WISE Session¶
Using the SourceCollection classes is demonstrated by showing an example Astro-WISE session.
Bootstrapping¶
A session would usually start with retrieving an existing SourceCollection. This demo session starts with creating a new SourceCollection because this makes the pulling examples more predictable.
For now it is required to wrap a SourceList in a SourceCollectionWrapper class in order to use it with in the SourceCollections. Ultimately, the SourceCollection classes will be better integrated with the SourceList and other *List classes.
from astro.main.SourceList import SourceList
from astro.main.sourcecollection.SourceCollection \
import SourceCollection
from astro.main.sourcecollection.SourceListWrapper \
import SourceListWrapper
# Fetch a SourceList
sl = (SourceList.SLID == 1575051)[0]
# Create a SourceCollection from the SourceList
slw = SourceListWrapper()
slw.set_sourcelist(sl)
# The SourceCollection can be made persistent if wanted.
slw.commit()
print(slw.SCID)
# 100511
Pulling One Sample¶
SourceLists created from an image only contain photometric attributes. The SourceCollection classes allow attributes derived from those to be created by pulling them. This example shows how to pull a subset of the sources with newly derived attributes.
# Continue with 'slw' from above.
# Define a criterion to select the required sources.
query = ' "DEC" < 11 '
# Define the requested attributes.
# In this case a comoving distance and the inverse concentration index.
attributes = ['R', 'iC']
# Pull a SourceCollection that has the requested data.
#
# The information system will search for existing SourceCollections
# that can be used to fulfill the request. New SourceCollections
# are created if no suitable ones are found.
scnew = slw.derive(attributes=attributes, query=query)
# New SourceCollections are created to be as general as possible
# in order to facilitate reuse. In this case, the SourceCollection
# calculating 'R' will be created to calculate the attribute for all the
# sources of slw, not only for the sources with a declination below 11.
# Check whether the SourceCollection has been processed.
scnew.is_made()
# If False is returned, the SourceCollection has to be processed.
scnew.make()
# The is_made() and make() functions only apply to the parts of the
# dependency tree that is required to build the target SourceColloction.
# In this case, 'R' will only be calculated for the sources with DEC < 11.
Loading Catalog Data¶
The catalog data of a SourceCollection can be accessed by loading it into a TableConverter object or by sending it over SAMP.
# Load the catalog data into a TableConverter.
scnew.load_data()
# Interface with the catalog data.
print(scnew.data.attribute_order)
# ['SLID', 'SID', 'R', 'iC']
print(scnew.data.attributes['R'])
# {'length': 1, 'ucd': '', 'null': '', 'name': 'R', 'format': 'float64'}
print(scnew.data.data['R'])
# [ 562.54038472 1905.82573128 397.30116968 ..., 12.93537333
# Or send the catalog data over Samp.
from astro.services.samp.Samp import Samp
s = Samp()
s.broadcast(scnew)
# Highlight or select and broadcast some sources in Topcat or Aladin.
# Retrieve the SLIDs/SIDs or the row in the TableConverter.
s.highlightedSource(scnew)
# (1575051, 812)
s.selectedSources(scnew)
# [(1575051, 843), (1575051, 847), (1575051, 848)]
s.highlighted(scnew)
# 812
Pulling More Samples¶
An important feature of the SourceCollections is that only the parts of a dependency tree are processed that are required to build the final target SourceCollection. This is demonstrated with the following example.
# Continue with 'slw' from before.
# Pull attributes for a subset of the sources.
scnew1 = slw.derive(query=' "DEC" < 11 ', attributes=['R'])
scnew1.is_made()
# Should return True because this part of the SourceCollection has already
# been processed in the previous example, if not, make it:
scnew1.make()
# Pull attributes for a smaller subset of the sources
scnew2 = slw.derive(query=' "DEC" < 10 ', attributes=['R'])
scnew2.is_made()
# True, because this has already been processed.
# Pull data for a larger subset of the sources
scnew3 = slw.derive(query=' "DEC" < 12 ', attributes=['R'])
scnew3.is_made()
# False, because this part of the SourceCollection
# has not been processed completely yet.
Storing Catalog Data¶
New catalog data is created in the examples above. Storing the attribute values prevents them to be calculated again.
# Continue with the datasets from above.
# Commit the SourceCollection that need to be saved. This will recursively
# commit the dependency tree as well.
slnew2.commit()
# Store the source data that has been calculated. This will store the catalog data
# in the most optimal way. Although slnew2 only represents a subset of the
# sources, all the calculated attributes are stored: The 'R' value for the sources
# in scnew1 will be stored as well.
slnew2.store_data()
# The demo session ends here.
Pushing SourceCollections¶
The derive()
function above creates a hierarchy of
SourceCollections. This section shows how this hierarchy can be created
manually.
It is preferable to use the data pulling functions and let the information system determine what needs to be created automatically. This is less work and facilitates reuse.
Nonetheless, not all SourceCollections can be created through pulling data, therefore it is useful to know how to create them manually.
Import¶
First import all relevant classes.
# astro.main classes
from astro.main.SourceList import SourceList
# Virtual base SourceCollection
from astro.main.sourcecollection.SourceCollection \
import SourceCollection
# All derived operator SourceCollections
from astro.main.sourcecollection.SourceListWrapper \
import SourceListWrapper
from astro.main.sourcecollection.FilterSources \
import FilterSources
from astro.main.sourcecollection.SelectSources \
import SelectSources
from astro.main.sourcecollection.SelectAttributes \
import SelectAttributes
from astro.main.sourcecollection.ConcatenateAttributes \
import ConcatenateAttributes
from astro.main.sourcecollection.AttributeCalculator \
import AttributeCalculator, AttributeCalculatorDefinition
Bootstrap¶
Start with a new SourceCollection.
# Fetch a SourceList
sl = (SourceList.SLID == 1575051)[0]
# Create a SourceCollection from the SourceList
slw = SourceListWrapper()
slw.set_sourcelist(sl)
Calculate Attributes¶
An AttributeCalculator SourceCollection is used for the derivation of new source attributes from existing attributes. The calculation that is performed by an AttributeCalculator SourceCollection is given by an AttributeCalculatorDefinition object. The creation of AttributeCalculator SourceCollections in a pushing way requires some handwork:
# Find all AttributeCalculatorDefinition objects that can be used to
# calculate comoving distances.
acdsR = AttributeCalculatorDefinition.get_acds_by_attribute('R')
# Pick the first one.
acdR = acdsR[0]
acdR.name
'Comoving Distance Calculator'
# See which attributes are required by the ACD.
acdR.input_attribute_names
['redshift', 'RA', 'DEC', 'HTM']
# Which are al available in 'slw':
[a in slw.get_attribute_names() for a in acdR.input_attribute_names]
[True, True, True, True]
# The required attributes have to be selected with a SelectAttributes:
sa1 = SelectAttributes()
sa1.parent_collection = slw
sa1.selected_attributes = ['redshift', 'RA', 'DEC', 'HTM']
# And the AttributeCalculator can be initialized. The 'AC' property of an ACD
# object is class derived from the AttributeCalculator class which used this
# definition.
ac1 = acdR.AC()
ac1.parent_collection = sa1
Similarly for the inverse concentration:
acdsiC = AttributeCalculatorDefinition.get_acds_by_attribute('iC')
acdiC = acdsiC[0]
all([a in slw.get_attribute_names() for a in acdiC.input_attribute_names])
sa2 = SelectAttributes()
sa2.parent_collection = slw
sa2.selected_attributes = [a for a in acdiC.input_attribute_names]
ac2 = acdiC.AC()
ac2.parent_collection = sa2
Selecting Sources¶
Selecting sources is either done with a FilterSources SourceCollection or with a SelectSources SourceCollection.
# A FilterSources represents a subset of the parent SourceCollection
# by evaluation of a selection criterion.
fs = FilterSources()
fs.parent_collection = slw
fs.set_query(' "DEC" < 11 ')
# Alternatively, a SelectSources SourceCollection can be used to select
# a subset that is explicitly listed by another SourceCollection.
# First create a SourceCollection with only source identifiers.
sa3 = SelectAttributes()
sa3.parent_collection = fs
# Use this to specify the selected sources of a SelectSources SourceCollection
ss = SelectSources()
ss.parent_collection = slw
ss.selected_sources = fs
Combining All Attributes¶
A ConcatenateAttributes SourceCollection is used to combine the attributes fro the AttributeCalculators for the sources in the FilterSources.
# First select no attributes from the FilterSources
sa3 = SelectAttributes()
sa3.parent_collection = fs
# Select the comoving distance
sa4 = SelectAttributes()
sa4.parent_collection = ac1
sa4.selected_attributes = ['R']
# Select the inverse concentration
sa5 = SelectAttributes()
sa5.parent_collection = ac2
sa5.selected_attributes = ['iC']
# Combine all the attributes
ca = ConcatenateAttributes()
ca.parent_collections = [sa3, sa4, sa5]
# And we're done
scnew = ca
scnew.make()
scnew.load_data()
Other Operators¶
SourceCollection classes not yet shown in this HOW-TO are
- ConcatenateSources, to combine the source of different SourceCollections.
- RenameAttributes, to rename the attributes of the parent SourceCollection.
- RelabelSources, to give sources of the parent SourceCollection a new
SLID
-SID
combination by specifying an AssociateList.
from astro.main.AssociateList import AssociateList
from astro.main.sourcecollection.SourceCollection \
import SourceCollection
from astro.main.sourcecollection.RelabelSources \
import RelabelSources
from astro.main.sourcecollection.ConcatenateSources \
import ConcatenateSources
from astro.main.sourcecollection.RenameAttributes \
import RenameAttributes
# Fetch a SourceCollection and AssociateList
sc = (SourceCollection.SCID == 100161)[0]
al = (AssociateList.ALID == 472781)[0]
# Create a RelabelSources
rs = RelabelSources()
rs.parent_collection = sc
rs.associatelist = al
# 'rs' and 'sc' now represent 'different' sources, which
# can be concatenated. (This is not really useful though.)
cs = ConcatenateSources()
cs.parent_collections = [sc, rs]
# And the attributes can be renamed.
ra = RenameAttributes()
ra.parent_collection = cs
ra.attributes_old = ['MAG_ISO', 'MAGERR_ISO']
ra.attributes_new = ['MAG_B', 'MAGERR_B']
The SourceCollectionTree in the Background¶
The automatic creation of new SourceCollections is managed by
non-persistent SourceCollectionTree objects. For example, the
derive()
function, but also the is_made
, make
and
load_data
functions use a SourceCollectionTree. It is instructive to
discuss the SourceCollectionTree class, even though it is not often
required to interface with one directly.
Derive¶
The derive()
function above uses a SourceCollectionTree to pull
SourceCollections. This example shows what happens inside this function.
from astro.main.sourcecollection.SourceCollection \
import SourceCollection
from astro.main.sourcecollection.SourceCollectionTree \
import SourceCollectionTree
# Same start as above.
slw = (SourceCollection.SCID == 100511)[0]
query = ' "DEC" < 11 '
attributes = ['R', 'iC']
# Create a SourceCollectionTree from the SourceCollection. The SCT traverses
# the dependency tree and keeps the progenitors of the given SC in memory.
# A Pass SC is created with the given SC as parent, which is used as the end
# node of the tree. Later functions of the SCT will replace SCs in the tree,
# but never this end node.
sct = SourceCollectionTree(slw)
# Apply the selection criterion. There are two things that can happen:
# 1) The SCT discovers an existing SC that represents the requested sources
# and will create a SelectSources SC to select the requested sources.
# 2) The SCT cannot find suitable SCs and creates a new FilterSources SC with
# the end node as parent and the given selection criterion as query. The
# query is not evaluated; the exact composition of sources is unknown.
# In both cases, the created SC is placed between the Pass node at the end of
# the tree and its parent.
sct.apply_filter(query=query)
# Select the attributes. For every attribute the SCT will search for existing
# SCs that represent the attribute for the requested set of sources. A
# hierarchy of SelectAttributes and ConcatenateAttributes SCs is created that
# provides the right attributes. The SCT will try to instantiate new
# AttributeCalculator SCs if there are no existing SCs that contain a
# requested attribute.
sct.apply_attribute_selection(attributes)
# The end node of the tree now represents the requested catalog. This is
# a Pass SC, so its parent is returned by the .derive() function.
scnew = sct.sourcecollection.parent_collection
Visualizing a SourceCollectionTree¶
The dependency tree of a SourceCollection can be visualized with a SourceCollectionTree (Figure 1).
sct = SourceCollectionTree(scnew)
sct.make_dot_graph('howtotree1')
Making Catalog Data¶
The functions of the SourceCollection class that handle the catalog
data, is_made
, make
and load_data
, can be called either with
optimization or without. With optimization (optimize=True
parameter,
the default) a SourceCollectionTree is created and the respective
function of this object is called to perform the required action in the
optimal way.
# Continuing with 'scnew' from earlier
# Calling is_made without optimization will return True because scnew is a
# ConcatenateAttributes SC, which does not have to be processed.
scnew.is_made(optimize=False)
# Calling is_made with optimization is equal to the following:
# Create a SourceCollectionTree
sct = SourceCollectionTree(scnew)
# Optimize the tree for loading catalog data, by placing the selection of
# sources and attribute early in the tree.
sct.optimize_for_load()
# Now the entire optimized tree can be checked.
sct.is_made()
# This will recursively call 'is_made(optimize=False)' on the SCs in the
# optimized tree.
# Similarly, calling 'scnew.make()' is identical to:
sct = SourceCollectionTree(scnew)
sct.optimize_for_load()
sct.make()
# This will recursively call 'make(optimize=False)' on the SCs in the
# optimized tree that are not yet made.
# Finally, 'scnew.load_data()' is identical to:
sct = SourceCollectionTree(scnew)
sct.optimize_for_load()
sct.load_data()
# This will load the catalog data of the end node of the tree in the
# most optimal way. It will not necessarily load the catalog data
# for all the other SCs in the tree.
Store Catalog Data¶
Storing catalog data with store_data()
in an optimized way works
differently from the functions above. In practice, all the attributes of
the AttributeCalculators in the dependency tree of the SourceCollection
will be stored. No catalog data will be stored as part of the
SourceCollection itself, unless it is an AttributeCalculator too.
# Store the catalog data in an optimized way.
slnew.store_data()
# This is equivalent to
sct = SourceCollectionTree(slnew)
sct.store_data()
Finding SourceCollections¶
There are several moments when the SourceCollectionTree will search for
existing SourceCollections in order to fulfill a data pulling request.
The SourceCollectionTree will create a list of all SourceCollections
that could be used for a particular purpose. This list is subsequently
ranked according to a key function and the SourceCollection with the
highest rank is selected. Any SourceCollection with a positive key value
would be suitable. The key_functions
class property of the
SourceCollectionTree is a dictionary that holds these key functions. The
keys of the dictionary are:
find_selection
: Used to rank SourceCollections that represent the sources selected by a given selection criterion.find_attribute
: Used to rank SourceCollections that provide a given attribute for a specific set of sources.find_attribute_new_calculators
: Used to rank new AttributeCalculator SourceCollections that provide a given attribute.find_sources
: Used to rank SourceCollections that represent the source identifiers of a given SourceCollection.
The selection process can be influenced by overloading these functions:
# Start with a SourceCollection.
slw = (SourceCollection.SCID == 100511)[0]
# Create an AttributeCalculator to calculate 'R' without lamdba.
acd = AttributeCalculatorDefinition.get_acds_by_attribute('R')[0]
ac2 = acd.AC()
ac2.parent_collection = slw
ac2.set_process_parameter('omega_m', 1.0)
ac2.set_process_parameter('omega_l', 0.0)
# Create an SCT and manually ensure that the relevant SCs are tracked.
sct = SourceCollectionTree(slw)
sct.track_children_auto(cache=True)
sct.track_tree(ac2)
# Search for the comoving distance. An AC with the wrong omega_m is selected.
sc1=sct.find_attribute('R')
print("#sc1 omega_m", sc1.get_process_parameter('omega_m'))
#sc1 omega_m 0.3
# Define a new key function to find the correct AC.
from astro.main.sourcecollection.SourceCollectionTree \
import key_find_attribute
def mykey(scd):
# First retrieve the default ranking.
tkey = key_find_attribute(scd)
# The 'scd' is a dictionary, the key 'sc' points to the actual SC.
sc = scd['sc']
# Reduce the key value of SCs that are not an AttributeCalculator
if not isinstance(sc, AttributeCalculator):
tkey -= 10**9
# and reduce the key value of the ACs that use another omega_m.
elif sc.get_process_parameter('omega_m') != 1.0:
tkey -= 10**9
return tkey
# Set the new key function.
SourceCollectionTree.key_functions['find_attribute'] = mykey
# Search for the comoving distance again, the preferred AC is found.
sc2=sct.find_attribute('R')
print("#sc2 omega_m", sc2.get_process_parameter('omega_m'))
#sc2 omega_m 1.0
AttributeCalculatorDefinitions¶
The calculation that is performed by an AttributeCalculator SourceCollection is described by an AttributeCalculatorDefinition object. These objects can be created by any scientist and shared with others.
The code of an AttributeCalculatorDefinition is stored in a file on the
dataserver. This (python) file contains a new AttributeCalculator class
that is derived from the one in code base. The create_from_file()
method of the AttributeCalculatorDefinition class can be used to create
a new definition from this class. Auxiliary files can be used by
wrapping everything in a tarball.
The procedure for this is too long to to list in this document, see demo 17 for an example.
cd $AWEPIPE/astro/experimental/SourceCollection/demos/demo17
SAMP Interaction and Query Driven Visualization¶
A design goal of the SourceCollection classes was to be able to use them interactively over SAMP. New SAMP messages are designed to to allow query driven visualization in a more declarative way than is possible with other information systems.
The SAMP interaction is described in HOW-TO Use SAMP in Astro-WISE. The Query Driven Visualization is described in HOW-TO Use Query Driven Visualization in Astro-WISE.