# astro.main.sourcecollection package¶

## astro.main.sourcecollection.ACPSC module¶

class astro.main.sourcecollection.ACPSC.AttributeCalculatorParameterSourceCollection

An AttributeCalculatorParameter that is a SourceCollection.

description

Description of the process parameter.

name

Name of this process parameter.

object_id

The object identifier

The object identifier is an attribute shared by all persistent instances. It is the prime key, by which object identity is established

ptype

Type/class of the parameter.

value

Value of the parameter if it is a SourceCollection [None]

## astro.main.sourcecollection.AttributeCalculator module¶

class astro.main.sourcecollection.AttributeCalculator.AttributeCalculator(sourcelist_data=None)

The AttributeCalculator is a SourceCollection to calculate new attributes for the sources in the parent SourceCollection.

TODO MT M: [DOCUMENTATION] [ATTRIBUTECALCULATOR] Improve this docstring.

SCID

SourceCollection identifier [None]

acd_attributes = None
acd_flags = None
acd_input_attribute_names = None
acd_name = None
acd_parameters = None
all_data_stored

Flag to indicate whether all data has been stored in sourcelist_data, 0 means no (or unknown), 1 means yes

attribute_columns

Column names of the attributes in the sourcelist_data

attribute_names

Names of the attributes corresponding to the attribute_columns

change(prop, value, index=None)

Function to set a property. Used for SAMP connectivity and in the SourceCollectionTree.

TODO MT M: [ATTRIBUTECALCULATOR] The way process parameters are handled is a bit ad hoc. Perhaps this should be improved.

check_pre_make(relations=None)

Do some checks before making the data.

TODO MT M: [ATTRIBUTECALCULATOR] Actually use this function properly.

copy(relations=None, copy_parents=True, copy_parameters=True, copy_cache=False, copy_supercache=False)
creation_date

Date this object was created [None]

definition

The Definition of this AttributeCalculator instance.

get_attributes_auto(cache=False)
get_attributes_full_auto(cache=False)
get_export(relations=None)
get_onthefly_dependencies(config=None)

get_onthefly_dependencies() is overloaded from OnTheFly in order to add any dependencies from the process_parameters.

get_parents_tree()
get_process_parameter(pkey)

Retrieves the value of a process parameter based on its name.

get_properties_lineage()
get_source_relations(relations=None, quick=False)

Returns a SetRelations object to hold relations between SourceCollections. An AttributeCalculator describes the exact same sources as its parent.

initialize_existing()

Fetches the corresponding AC definition and upgrades the class.

This function is called when instantiating a persistent AC.

initialize_new()

Initializes the AC.process_parameters from the ACD.calculator_parameters.

This function is called when a new AttributeCalculator is created instantiated by instantiating the AC of an ACD.

is_valid

Manual/external flag to disqualify bad data (SuperFlag) [None]

load_data_parent(optimize=True, copy=True)

Most AttributeCalculators will need the data from their parent to be executed. This helper function can provide it.

TODO MT H: [ATTRIBUTECALCULATOR] [DATAFORMAKE]
This should not be necessary, the SCT should take care of this.
make(optimize=True)

This is a virtual make() method for an AttributeCalculator instance.

The AttributeCalculatorDefinition should provide either: - an entire make() function - a calculate_attributes() or calculate_attributes_vector() function.

The default make() of the AttributeCalculator is a wrapper around the calculate_attributes_vector() function.

make_n(optimize=True)

New make() function that works with old ACDs for which the original make() function does not accept the optimize keyword.

E.g. ACD 100031 for calculating comoving distances.

mandatory_dependencies = (('parent_collection', 1),)
name

Name of the SourceCollection [None]

object_id

The object identifier

The object identifier is an attribute shared by all persistent instances. It is the prime key, by which object identity is established

parent_collection

Parent SourceCollections [None].

process_parameters

Process parameters

process_status

A flag indicating the processing status [None]

quality_flags

Automatic/internal quality flag [None]

classmethod register_definition()

Register the AttributeCalculatorDefinition. This is used by AttributeCalculators on CVS.

TODO: Perhaps place this functionality in a metaclass? http://redmine.hpc.rug.nl/redmine/issues/142

set_definition(definition)

Sets the definition of a new AttributeCalculator.

set_onthefly_attributes(attributes)

set_onthefly_attributes() of the AttributeCalculator() is overloaded to set a suitable AttributeCalculatorDefinition. There might be better ways to set this though, see get_onthefly_dependencies_with_attributes().

set_process_parameter(pkey, value)

Sets the value of a process parameter based on its name.

sourcelist_data

Optional SourceList containing the data described by this SourceCollection

sourcelist_sources

Optional SourceList containing the sources described by this SourceCollection

class astro.main.sourcecollection.AttributeCalculator.AttributeCalculatorDefinition(pathname='', **kw)

An AttributeCalculatorDefinition is a persistent object that contains the calculation that is performed by an AttributeCalculator SourceCollection.

• Code:

The code to perform the calculation is stored in the file attached to an AttributeCalculatorDefinition.

• Meta data:

An AttributeCalculatorDefinition contains meta data about what attributes are calculated and what input is required.

AC = None
ACDID

AttributeCalculator Type ID

attribute_names

Attributes that are calculated with an AC of this type.

calculates_on_the_fly

Determine whether the data can be calculated on the fly in SQL without having to call the make() method. This is the case if the get_query_self function is set by the definition.

TODO LT M: There should be a more generic way to determine whether
data of any SC can be derived on the fly. This generic method should also better distinguish between Python and SQL.
calculator_parameters

Process parameters that can be set for an AC of this type.

commit()

Ensures that an ACDID is set when committing a new ACD.

classmethod create_from_file(filename, ACDID=None, version=None)

Creates a new AttributeCalculatorDefinition from a file.

classmethod create_from_python(pythonpath, ACDID=None, version=None)

Create an AttributeCalculatorDefinition from a Python file in source repository.

creation_date

Date this object was defined [None]

filename

The name of the associated file [None]

flags

Flags to describe the calculator.

classmethod get_acd(acdid, version=None)

Returns the ACD that belongs to the given acdid and version. If no version is given, the latest is returned.

classmethod get_acdids_by_attribute(attribute)

Returns a list of (acdids,version) tuples that can calculate the given attribute. This query is required because the DBProperties class does not allow .contains() for .attribute_names. See get_acds_by_attribute().

classmethod get_acds_by_attribute(attribute)

Returns a list of ACDs that can calculate the given attribute.

get_export()

Export function to export meta data about the ACD over SAMP. Experimental.

get_latest_version()

globalname

The name used to store and retrieve file to and from Storage [None]

import_code(move=True)

Retrieves the file with the calculation code from the fileserver.

The ACD uses the contents of this file to create a new Python class derived from the base AttributeCalculator class. This derived class provides the make()/get_query_self() functionality that is virtual in the base class.

The contents of this file is described in detail in the body of the funtion. Currently, the file currently either be: 1) A Python file with a calculate_attributes_vector() or

calculate_attributes() function.
1. A Python file with a new class definition.
2. A .tar.gz file contain such a Python file and auxiliary files.

Note: The derived class is called ‘AttributeCalculator’ as well, to ensure that the same database class can be used for all different ACDs.

TODO [ATTRIBUTECALCULATOR]: - This process is a bit complex, and it might be worthwhile to

investigate better ways to achieve the same thing.
• Unit tests using this mechanism will break coverage.py.
• Path to cache retrieved ACD code?
import_code_generic()

Called from either import_code() or import_code_python().

import_code_python()

Imports the code from a python path.

self.filename is for example ‘PYTHON:astro.main.GaAPAttributeCalculator’

input_attribute_names

Attributes that have to be present in the parent of an AC of this type. [None]

is_valid

Manual/external flag to disqualify bad calculator definitions [None]

name

Name of this type, e.g. ‘kCorrect’

object_id

The object identifier

The object identifier is an attribute shared by all persistent instances. It is the prime key, by which object identity is established

version

Version string of the code

class astro.main.sourcecollection.AttributeCalculator.AttributeCalculatorParameter

AttributeCalculatorParameter objects are used to store the process_parameters of AttributeCalculators.

A list of this base class is used as the persistent property of the AttributeCalculator and AttributeCalculatorDefinition classes. Derived classes are used to have a property of a specific type.

copy()

Creates a copy of the ACP.

classmethod create(name, ptype, value, description)
description

Description of the process parameter.

get_value()
name

Name of this process parameter.

object_id

The object identifier

The object identifier is an attribute shared by all persistent instances. It is the prime key, by which object identity is established

ptype

Type/class of the parameter.

set_value(value)

Set the value of this parameter. This function can be overloaded by derived parameter objects.

class astro.main.sourcecollection.AttributeCalculator.AttributeCalculatorParameterFloat

An AttributeCalculatorParameter that represents a float.

description

Description of the process parameter.

name

Name of this process parameter.

object_id

The object identifier

The object identifier is an attribute shared by all persistent instances. It is the prime key, by which object identity is established

ptype

Type/class of the parameter.

value

Value of the parameter if it is a float

class astro.main.sourcecollection.AttributeCalculator.AttributeCalculatorParameterInteger

An AttributeCalculatorParameter that represents an integer.

description

Description of the process parameter.

name

Name of this process parameter.

object_id

The object identifier

The object identifier is an attribute shared by all persistent instances. It is the prime key, by which object identity is established

ptype

Type/class of the parameter.

value

Value of the parameter if it is an int

class astro.main.sourcecollection.AttributeCalculator.AttributeCalculatorParameterObject

An AttributeCalculatorParameter that represents a persistent object.

Since it is not possible to simply have an objid as value, it is stored as a string.

Specialized classes could be created in the future for often used objects, e.g. an AttributeCalculatorParameterSourceList.

Using a generic ‘Object’ parameter is not always a good idea because there is no data lineage in the strict sense of the word.

description

Description of the process parameter.

get_bases(cc)

Get the base classes of a class in order to check whether it matches the required ptype.

get_value()

Try to fetch the object corresponding to the object_id if it is not set yet.

name

Name of this process parameter.

object_id

The object identifier

The object identifier is an attribute shared by all persistent instances. It is the prime key, by which object identity is established

ptype

Type/class of the parameter.

set_value(value)

Sets both the object as well as the object_id.

value

Value of the parameter if it is an object

class astro.main.sourcecollection.AttributeCalculator.AttributeCalculatorParameterSourceCollection

An AttributeCalculatorParameter that is a SourceCollection.

description

Description of the process parameter.

name

Name of this process parameter.

object_id

The object identifier

The object identifier is an attribute shared by all persistent instances. It is the prime key, by which object identity is established

ptype

Type/class of the parameter.

value

Value of the parameter if it is a SourceCollection [None]

class astro.main.sourcecollection.AttributeCalculator.AttributeCalculatorParameterString

An AttributeCalculatorParameter that represents a string.

description

Description of the process parameter.

name

Name of this process parameter.

object_id

The object identifier

The object identifier is an attribute shared by all persistent instances. It is the prime key, by which object identity is established

ptype

Type/class of the parameter.

value

Value of the parameter if it is a string

## astro.main.sourcecollection.ConcatenateAttributes module¶

Concatenates all the attributes in the parent SourceCollections. That is, this SourceCollection describe the union of the attributes of the parents.

class astro.main.sourcecollection.ConcatenateAttributes.ConcatenateAttributes(sourcelist_data=None)

Concatenates all the attributes in the parent SourceCollections. That is, this SourceCollection describe the union of the attributes of the parents.

The parents should have non-overlapping sets of attributes. This can be ensured by using SelectAttributes or RenameAttributes SourceCollections.

The set of sources of a ConcatenateAttributes is the cross section of the sets of sources of the parents.

SCID

SourceCollection identifier [None]

all_data_stored

Flag to indicate whether all data has been stored in sourcelist_data, 0 means no (or unknown), 1 means yes

attribute_columns

Column names of the attributes in the sourcelist_data

attribute_names

Names of the attributes corresponding to the attribute_columns

creation_date

Date this object was created [None]

static find_sc(parent_collections, name=None, is_valid=None)

Find this SourceCollection. TODO: Integrate with OnTheFly exist() TODO: Find transient SCs.

get_attribute_origin(attribute, cache=False)

Returns a list of SourceCollections higher in the tree that represent the same attribute. That is, every SLID-SID-attribute combination of these SourceCollections will result in the same numerical value. The first in the list is earliest in the tree. The last will be this SourceCollection.

get_attributes_full(cache=False)

The set of attributes of a ConcatenateAttributes is the union of the sets of the parents.

get_parents_tree()
get_parents_with_clauses()
get_query_self(nowhere=False)

The SQL query of a ConcatenateSources is a join on SLID-SID of the queries of all the parents.

TODO MT M: [OPERATOR] Find a better way to choose which parent to use
to join the other parents on. If possible, one that describes exactly the right set of sources.
TODO: Ensure that this works the same way as load_data_python() w.r.t.
the existance of double sources in the parents.
static get_sc(parent_collections, name=None, is_valid=None, force_name=False, force_creation=False)

Find this SourceCollection and create it if necessary. TODO: Integrate with OnTheFly get_onthefly() TODO: Integrate with init?

get_source_relations(relations=None, quick=False)

Returns a SetRelations object to store relations between SourceCollections.

A ConcatenateAttributes is a cross section of its parents. 1) The first parent is added and set as the previous cross section. 2) The next parent is added. 3) A new set is added as the cross section between

the previous cross section and the last added parent.
1. Repeat step 2 and 3 until all parents are added.
2. This CA is the same as the last created cross section.
3. Remove all temporary cross sections.
is_valid

Manual/external flag to disqualify bad data (SuperFlag) [None]

load_data_python()
mandatory_dependencies = (('parent_collections', 1),)
modify_integrate_parents(relations=None)

If one of the parent of this ConcatenateAttributes is an ConcatenateAttributes itself, it can be integrated.

If the parents are SelectAttributes that have the same parents, they can be combined.

modify_move_up(relations=None)

A Concatenate Attributes can move up through a SelectSources.

modify_remove_dependencies(relations=None)

If a ConcatenateAttributes has a parent which does not add any attributes to it, and the sources of this CA are completely described by the other input, then the non-contributing parent can be removed.

modify_remove_parent(relations=None)

If a ConcatenateAttributes has a parent which does not add any attributes to it, and the sources of this CA are completely described by the other input, then the non-contributing parent can be removed.

modify_remove_self(relations=None)

If this CA has only one parent, this CA can be removed.

name

Name of the SourceCollection [None]

object_id

The object identifier

The object identifier is an attribute shared by all persistent instances. It is the prime key, by which object identity is established

parent_collections

Input list of parent SourceCollections [None]

process_status

A flag indicating the processing status [None]

quality_flags

Automatic/internal quality flag [None]

sourcelist_data

Optional SourceList containing the data described by this SourceCollection

sourcelist_sources

Optional SourceList containing the sources described by this SourceCollection

## astro.main.sourcecollection.ConcatenateSources module¶

class astro.main.sourcecollection.ConcatenateSources.ConcatenateSources(sourcelist_data=None)

Concatenates the sources of the parent SourceCollections. The attributes of the parent_collections should be identical.

This is the sources counterpart of the ConcatenateAttributes.

SCID

SourceCollection identifier [None]

all_data_stored

Flag to indicate whether all data has been stored in sourcelist_data, 0 means no (or unknown), 1 means yes

attribute_columns

Column names of the attributes in the sourcelist_data

attribute_names

Names of the attributes corresponding to the attribute_columns

check(parse=False)
creation_date

Date this object was created [None]

static find_sc(parent_collections, name=None, is_valid=None)

Find this SourceCollection. TODO: Integrate with OnTheFly exist() TODO: Find transient SCs.

get_attributes_full(cache=False)
get_parents_tree()
get_parents_tree_keys()
get_parents_with_clauses()
get_query_self()
get_query_self_O()
static get_sc(parent_collections, name=None, is_valid=None, force_name=False)

Find this SourceCollection and create it if necessary. TODO: Integrate with OnTheFly get_onthefly() TODO: Integrate with init?

get_source_relations(relations=None, quick=False)

Returns a SetRelations object to hold the relations between SourceCollections.

A ConcatenateSources is a union of its parents: 1) The first parent is added and set as the previous union. 2) The next parent is added. 3) A new set is added as the union between the previous union and

1. Repeat step 2 and 3 until all parents are added.
2. This CS is the same as the last created union.
3. Remove all temporary unions.

For now this function is disabled. A CS is mainly used to concatenate a large number of smaller SCs. This will make the matrices of the SetRelations too large. This might be alieviated by using sparse matrices, e.g. in scipy.

TODO MT H: [RELATIONS] Do something sensible, e.g. do create the
relations when there are only a handfull of parents.
is_valid

Manual/external flag to disqualify bad data (SuperFlag) [None]

load_data_python()

Loads the data on the python layer.

mandatory_dependencies = (('parent_collections', 1),)
modify_integrate_parents(relations=None)

If one of the parent of this ConcatenateSources is an ConcatenateSources itself, it can be integrated.

TODO MT L: [MODIFICATIONS] This function should only integrate one
parent, not as many as possible.
name

Name of the SourceCollection [None]

object_id

The object identifier

The object identifier is an attribute shared by all persistent instances. It is the prime key, by which object identity is established

parent_collections

Input list of parent SourceCollections [None]

process_status

A flag indicating the processing status [None]

quality_flags

Automatic/internal quality flag [None]

sourcelist_data

Optional SourceList containing the data described by this SourceCollection

sourcelist_sources

Optional SourceList containing the sources described by this SourceCollection

## astro.main.sourcecollection.External module¶

External SourceCollection.

class astro.main.sourcecollection.External.External(sourcelist_data=None)

This SourceCollection has no parents and has to have all its data stored in its .sourcelist_data. An External SourceCollection should be used for external data (e.g. not derived from a Frame).

The External is also used when optimizing SourceCollectionTrees for the retrieval of data.

When creating an External SourceCollection, ensure that the data is stored in the database with store_data() before committing the External. Otherwise the SourceCollection can probably not be used.

TODO: Ensure that the data is properly stored before the External is
being committed. Perhaps create a ‘set_data’ function?
SCID

SourceCollection identifier [None]

add_sourcelist_data_to_relations = False
all_data_stored

Flag to indicate whether all data has been stored in sourcelist_data, 0 means no (or unknown), 1 means yes

attribute_columns

Column names of the attributes in the sourcelist_data

attribute_names

Names of the attributes corresponding to the attribute_columns

copy(relations=None, copy_parents=True, copy_parameters=True, copy_cache=False, copy_supercache=False)
creation_date

Date this object was created [None]

get_catalog()

Gets the Catalog object that this External corresponds to, if any.

get_filter()

get_filter is overloaded in External because some Externals can be made from Catalog objects. In that case, try to get the filter from the Catalog.

get_object()

get_object is overloaded in External because some Externals can be made from Catalog objects. In that case, try to get the OBJECT from the Catalog.

get_slids()

The SLIDs for an External are difficult to get when the .sourcelist_data contains a SLIDorg column. This means that all the SLIDorg values should be checked.

This should not happen for persistent SourceCollections, because such an External usually does not have these columns. However, it does with transient objects, e.g. when modifying the tree to load data.

Nevertheless, this is only a ‘convenience’ function, and should not be relied on. Use load_sources() to get the exact set of SLIDs.

get_source_progenitors(identifier)

Returns a list of objects and identifiers that represent the progenitors of the source.

Identifier should be a (SLID, SID) combination.

get_source_progenitors is overloaded in External because some Externals can be made from Catalog objects. In that case, it would be best to the Catalog. However, implementing get_source_progenitors() is difficult for Catalogs. Therefore, the frame is returned directly instead.

There might be edge-cases where this function does not yet work.

get_source_relations(relations=None, quick=False)

Returns a SetRelations object to hold the relations between SourceCollections.

The sources of an External are described by its .sourcelist_sources.

Note that this could be used for all SourceLists that have their data stored, which will happen automatically when they are converted to an External.

is_valid

Manual/external flag to disqualify bad data (SuperFlag) [None]

make_from_fits(filename, row_identifiers=None)

Creates an External from a Fits file.

row_identifiers is a tuple with the columns that together provide a unique identifer of each row. Support for identifiers other than (SLID, SID) is experimental and not necessarily implemented in all SourceCollection classes.

make_from_tableconverter(tableconverter)

Creates an External from a TableConverter.

Experimental.

mandatory_dependencies = ()
modify_remove_dependencies(relations=None)

This function is empty to overload the base function because that would try to create a new External.

name

Name of the SourceCollection [None]

object_id

The object identifier

The object identifier is an attribute shared by all persistent instances. It is the prime key, by which object identity is established

origin

Origin of the SourceCollection [None]

process_status

A flag indicating the processing status [None]

quality_flags

Automatic/internal quality flag [None]

set_sourcelist(sl)

Sets the .sourcelist_data and .sourcelist_sources and .all_data_stored

sourcelist_data

Optional SourceList containing the data described by this SourceCollection

sourcelist_sources

Optional SourceList containing the sources described by this SourceCollection

store_data(relations=None, optimize=False, tablename=None)

Overloaded to prevent optimization when storing the data of an External SourceCollection. Ultimately this should not be necessary, since the optimization should either do something sensible, or nothing at all. Nonetheless, at the moment it is easier to simply turn optimization off by default.

## astro.main.sourcecollection.FilterSources module¶

Filters the sources of the parent SourceCollection by applying a selection criterion that references the attributes of the parent.

class astro.main.sourcecollection.FilterSources.FilterSources(sourcelist_data=None)

Filters the sources of the parent SourceCollection by applying a selection criterion that references the attributes of the parent.

# TODO MT L: [OPERATOR] [DATAMODEL] perhaps determine the query length # automatically, like: sqlql = “SELECT DATA_LENGTH FROM DBA_TAB_COLS WHERE (TABLE_NAME = ‘%s’) AND (COLUMN_NAME = ‘%s’)” % (

FilterSources._dmlname(), FilterSources.query._dmlname())

c = database.cursor() c.execute(sqlql) d = c.fetchall() c.close() FilterSources.max_query_length = d[0][0]

# Or fix this: FilterSources.query._sqltype # ‘VARCHAR2(297)’ # Which is incorrect.

SCID

SourceCollection identifier [None]

all_data_stored

Flag to indicate whether all data has been stored in sourcelist_data, 0 means no (or unknown), 1 means yes

allow_moveup_through_ss = False
attribute_columns

Column names of the attributes in the sourcelist_data

attribute_names

Names of the attributes corresponding to the attribute_columns

copy(relations=None, copy_parents=True, copy_parameters=True, copy_cache=False, copy_supercache=False)
create_htm_query(htmc, level)

Convenience function to create a filter on HTM, selecting all sources that are within the htm trixel of the given level that contains htmc.

This function can be used to parallelize a tree of SourceCollections. TODO LT M: [MODIFICATIONS] Implement the tree parallelization code.

classmethod create_htm_query_from_sl(sourcelist, level=9)

TODO: Move to SourceList? TODO: check Python 2/3 compatibility longs

creation_date

Date this object was created [None]

debug_attributes = None
get_attribute_origin(attribute, cache=False)

Returns a list of SourceCollections higher in the tree that represent the same attribute. That is, every SLID-SID-attribute combination of these SourceCollections will result in the same numerical value. The first in the list is earliest in the tree. The last will be this SourceCollection.

get_attributes_full(cache=False)
get_attributes_query(query=None, cache=False)

Returns the attributes that are used in the query.

self.query is used unless a query is given.

get_parents_tree()
get_parents_tree_keys()
get_parents_with_clauses()
get_query_self(nowhere=False)
get_source_relations(relations=None, quick=False)

Returns a SetRelations object to hold the relations between SourceCollections. A FilterSources will usually describe a subset of its parent, but can also describe the same set or an empty set.

get_tapurl_dict()

Warning: Experimental code for IVOA TAP interoperation.

Returns dictionary to build TAP query.

is_valid

Manual/external flag to disqualify bad data (SuperFlag) [None]

load_data_python()

Loads the data on the python layer. This function parses the SQL query and evaluates it in Python.

‘query’ should be an SQL WHERE clause using:
AND, OR, BETWEEN, +-*/, ()

with all attributes in double quotes. Examples:

‘ “DEC” < 10 + 1 AND “RA” BETWEEN 93.0*2 AND 187 ‘ ‘ “MAG_ISO” < 17 ‘

Math functions etc. are not supported.

mandatory_dependencies = (('parent_collection', 1),)
max_query_length = 3972
modify_move_up(relations=None)

Returns a SourceCollection that can be substituted with the original FilterSources, which has the FilterSources operation higher in the dependency tree.

TODO MT H: [MODIFICATIONS] Implement this function properly.

modify_remove_dependencies(relations=None, indexed=None, force=False)

Returns a SelectSources that selects the same sources as this Filter- Sources. The .selected_sources of the SS will be: + an External will be used if this FS has a .sourcelist_sources, or + an SelectAttributes that selects no attributes from this FA.

A SelectSources is only created if the selection is performed on an attribute that is indexed in the database or when force=True.

TODO MT M: [OPERATOR] This function should have no restrictions on the
creation of the SelectSources. The callee (a SourceCollectionTree) should decide whether the original FS should be replaced with the generated SS.
modify_remove_self(relations=None)

Returns the parent if this SelectAttributes selects all the sources of the parent.

name

Name of the SourceCollection [None]

object_id

The object identifier

The object identifier is an attribute shared by all persistent instances. It is the prime key, by which object identity is established

parent_collection

Parent SourceCollections [None]

parents_tree_keys = ['parent_collection']
process_status

A flag indicating the processing status [None]

quality_flags

Automatic/internal quality flag [None]

query

Selection query, as SQL WHERE clause.

set_query(query)

Sets the query to be used and performs sanity checks on the query.

TODO MT M: [OPERATOR] Provide better feedback on the sanity

checks. + Ensure all attributes are in double quotes. + Ensure all attributes are described in the parent SourceCollection. + If this is not the case, the derive() function should have been

• Ensure that the query is parsable/executable.
sourcelist_data

Optional SourceList containing the data described by this SourceCollection

sourcelist_sources

Optional SourceList containing the sources described by this SourceCollection

exception astro.main.sourcecollection.FilterSources.FilterSourcesError(message)

Errors for FilterSources.

## astro.main.sourcecollection.OnTheFly module¶

This is the astro OnTheFly implementation for SourceCollections.

Not all Target Processing functionality of the common and astro OnTheFly classes has been implemented. OnTheFly-like functionality is implemented in the SourceCollectionTree class. The SourceCollectionTree should become integrated here.

class astro.main.sourcecollection.OnTheFly.OnTheFly

The SourceCollection OnTheFly class, defines these methods:

exist is_flagged onthefly_after_make onthefly_init_attributes onthefly_use_dependency onthefly_get_config

classmethod find_latest_sourcecollection_on_attribute(attribute_desired=None, sourcelist_universe=None)

Workaround for _exist() because this query is not possible:

scs = (SourceCollection.sourcelist_sources == sourcelist_universe) & SourceCollection.attribute_names.contains(attribute_desired) sc = scs.max(‘creation_date’) return sc

get_onthefly_dependencies(config=None)

Overloaded to prevent the SCID from being accessed.

TODO: Find a better way to achieve this. See
https://redmine.hpc.rug.nl/redmine/issues/776
is_flagged()

Check if object is flagged return True for any flag is set return False no flag is set

Copied from astro OnTheFly.

classmethod onthefly_get_config(obj=None)

Return the SourceCollection OnTheFlyConfig object.

set_dependency(dep_str, dep_new)

Experimental code to improve the OnTheFly handling of dependencies that are lists, that is, a link_list_property. This function replaces the setattr() in OnTheFly_make.set_onthefly_dependencies().

A specific dependency in a list can be specificied by adding a pipe character (|) and the 0-based index of the dependency.

# No regridded frames have been set yet, so setting the 0-th # one should add one to the list. regrid = RegriddedFrame() coadd.set_dependency(‘regridded_frames|0’, regrid) assert coadd.regridded_frames[0] == regrid, “Bad regrid 1.” assert len(coadd.regridded_frames) == 1, “Wrong number of regrids.”

# The existing dependency is replaced. regrid2 = RegriddedFrame() coadd.set_dependency(‘regridded_frames|0’, regrid2) assert coadd.regridded_frames[0] == regrid2, “Bad regrid 2” assert len(coadd.regridded_frames) == 1, “Wrong number of regrids.”

# Normal dependencies still work as they should. reduced = ReducedScienceFrame() regrid.set_dependency(‘reduced’, reduced) assert regrid.reduced == reduced, “Bad reduced.”

set_onthefly_attributes(attributes)

Sets the onthefly attributes. Experimental function.

The role of set_onthefly_attributes can be seen as follows: exist(attributes) searches instances based on a set of attributes. set_onthefly_attributes(attributes) sets these attributes that exist() can search on.

Often this is currently done in make() through self.copy_attributes(), e.g. in ReducedScienceFrame. However, these attributes can/should already be set before make() for several reasons.

1) Some ProcessTargets like SourceCollections do not have to be ‘made’ in order to use and even commit them. For these it is essential to set these attributes.

2) The attributes might be necessary in order to run make. E.g. some SourceCollection classes are defined by an attribute called ‘query’ which contains a criterion for sources that need to be met to be in the collection. This perhaps blurs the distiction between ‘attributes’ and ‘process_parameters’ though.

3) It would allow objects to be ‘found’ before they are made. E.g. when testrun=True in OnTheFly. At the moment special cache constructions are necessary to achieve the same functionality, which at some point would become superfluous.

class astro.main.sourcecollection.OnTheFly.OnTheFlyConfig(derived=None, **kwargs)
classmethod after_uptodate_parameters(obj, parameters_differ)

Method is called after uptodate_parameters is done Check the SOURCE_CODE_VERSION between object and class

Copied from astro OnTheFly.

class astro.main.sourcecollection.OnTheFly.OnTheFly_make(cls, attributes, object_id, config, parent=None, parent_attr=None)

OnTheFly_make is overloaded to provide experimental set_onthefly_dependencies() functionality.

onthefly_make(target_class)

Set the dependencies of the object, set the parameters and call the make method of the requested class

set_onthefly_dependencies(new_object)

Determine every dependency and set it to new_object !!!! for internal use only !!!!

astro.main.sourcecollection.OnTheFly.no_check_transient(func)

Decorator to skip check() on transient SourceCollections.

astro.main.sourcecollection.OnTheFly.no_getattribute_overloading(func)

Decorator to turn the overloading of SourceCollection.__getattribute__ off for the duration of the function.

## astro.main.sourcecollection.Pass module¶

The Pass SourceCollection.

class astro.main.sourcecollection.Pass.Pass(sourcelist_data=None)

This SourceCollection does nothing, it selects all sources and attributes from its single parent. A Pass hould only be used as a transient object.

SCID

SourceCollection identifier [None]

all_data_stored

Flag to indicate whether all data has been stored in sourcelist_data, 0 means no (or unknown), 1 means yes

attribute_columns

Column names of the attributes in the sourcelist_data

attribute_names

Names of the attributes corresponding to the attribute_columns

creation_date

Date this object was created [None]

get_attribute_origin(attribute, cache=False)

Returns a list of SourceCollections higher in the tree that represent the same attribute. That is, every SLID-SID-attribute combination of these SourceCollections will result in the same numerical value. The first in the list is earliest in the tree. The last will be this SourceCollection.

get_attributes_full(cache=False)
get_parents_tree()
get_parents_with_clauses()
get_query_count_sources()

Returns a query to count the sources.

This function assumes that all the data is available, and should therefore only be used by the SourceCollection classes.

TODO MT L: [NOCAT] Perhaps add an ‘optimize’ argument to make it
possible to let users call this function.
get_query_self()
get_source_relations(relations=None, quick=False)

Returns a SetRelations object to hold relations between SourceCollections. A Pass describes the exact same sources as its parent.

is_valid

Manual/external flag to disqualify bad data (SuperFlag) [None]

load_data_python()

Loads the data on the python layer.

mandatory_dependencies = (('parent_collection', 1),)
name

Name of the SourceCollection [None]

object_id

The object identifier

The object identifier is an attribute shared by all persistent instances. It is the prime key, by which object identity is established

parent_collection

Parent SourceCollections [None]

process_status

A flag indicating the processing status [None]

quality_flags

Automatic/internal quality flag [None]

sourcelist_data

Optional SourceList containing the data described by this SourceCollection

sourcelist_sources

Optional SourceList containing the sources described by this SourceCollection

## astro.main.sourcecollection.ProcessTarget module¶

Special SourceCollection version of the ProcessTarget class.

The astro OnTheFly and ProcessTarget are specific for the Frame classes. Separate classes are therefore created to get the same functionality for the SourceCollection classes.

class astro.main.sourcecollection.ProcessTarget.ProcessTarget

The ProcessTarget mixin encapsulates the notion of a make-able object.

It is somewhat unclear what a make-able object is with respect to SourceCollections. Nevertheless, the make() and associated functions can be defined for SourceCollections, so deriving from this class is useful.

See for more details the astro ProcessTarget class. Much of the code in this class is copied from there. This is necessary because the OnTheFly class has to be overloaded for SourceCollections, and therefore the ProcessTarget class as well.

STATUS_COMPARE = 2
STATUS_INSPECT = 3
STATUS_MAKE = 0
STATUS_VERIFY = 1
creation_date = <common.database.DBMeta.persistent object>
derive_timestamp()

Set the creation_date attribute

classmethod get_qcflags()

Return a list of attribute names of QCflag() objects.

classmethod get_qcflags_dict()

Return a dictionary of flags settable for this object. The key is the name, the value is a tuple of (index, docstring).

get_qcflags_set()

Return a list of names of flags that have been set.

get_qcflags_set_dict()

Return a dictionary of flags that have been set. The key is the name, the value is a tuple of (index, docstring).

ignore_quality_flags = ''
is_compared()

Return true if the object has been compared

is_inspected()

Return true if the object has been inspected

is_ok()

Return true if no quality control flags have been set.

is_valid = <common.database.DBMeta.persistent object>
is_verified()

Return true if the object has been verified

process_status = <common.database.DBMeta.persistent object>
quality_flags = <common.database.DBMeta.persistent object>
classmethod select(**searchterms)

Class method to select objects from the database.

check_quality - (Default 1) Exclude invalidated data (quality_flags!=0) check_validity - (Default 1) Exclude invalidated data (is_valid=0)

set_compared()

Set the process status to indicate that the object has been compared.

set_inspected()

Set the process status to indicate that the object has been inspected.

set_made()

Set the process tatus to indicate that the object has been made

set_verified()

Set the process status to indicate that the object has been verified.

## astro.main.sourcecollection.RelabelSources module¶

class astro.main.sourcecollection.RelabelSources.RelabelSources(sourcelist_data=None)

Relabels the sources of the parent SourceCollection. This is the sources counterpart of the RenameAttributes.

The associatelist must only contain two sourcelists and must contain every source a maximum of one time.

TODO MT H: [OPERATOR] [DATAMODEL] In retrospect it would have been very
useful to store which other SourceCollection the new SLID/SID combinations come from. From a conceptual viewpoint this is not required because it is implicit in the Association. Nonetheless, as it is, it is difficult to determine which set of sources this SourceCollection contains. Therefore it is proposed to store this other SourceCollection as the .input_collection persistent property. Furthermore this would make it possible to use AssociateLists with more than two input_sourcelist. It should refer to the SourceCollection that is returned by get_other_sourcecollection().
TODO LT H: [OPERATOR] The AssociateList and SourceCollections should be
integrated better in the long run. E.g. by allowing SourceCollections to be associated. However, this is not a trivial process.
SCID

SourceCollection identifier [None]

all_data_stored

Flag to indicate whether all data has been stored in sourcelist_data, 0 means no (or unknown), 1 means yes

associatelist

The AssociateList contain the relabeling information

attribute_columns

Column names of the attributes in the sourcelist_data

attribute_names

Names of the attributes corresponding to the attribute_columns

copy(relations=None, copy_parents=True, copy_parameters=True, copy_cache=False, copy_supercache=False)
creation_date

Date this object was created [None]

static find_sc(parent_collection, associatelist=None, name=None, is_valid=None)

Find this SourceCollection. TODO: Integrate with OnTheFly exist() TODO: Find transient SCs.

get_attributes_full(cache=False)
get_onthefly_dependencies(config=None)

get_onthefly_dependencies() is overloaded because ‘scother’ should also be returned even though it is not a ‘proper’ dependency (yet).

get_other_sourcecollection()

Returns the SourceCollection that contains the sources where the sources are mappen too.

get_parents_tree()
get_parents_tree_keys()
get_parents_with_clauses()
get_query_self()
static get_sc(parent_collection, associatelist=None, name=None, is_valid=None, force_name=False)

Find this SourceCollection and create it if necessary. TODO: Integrate with OnTheFly get_onthefly() TODO: Integrate with init?

get_slids()
get_source_relations(relations=None, quick=False)

Returns a SetRelations object to hold the relations between SourceCollections. A RelabelSources describes a subset of the other SourceCollection that was used in the association.

is_valid

Manual/external flag to disqualify bad data (SuperFlag) [None]

load_data_associatelist()

Retrieves the data of the AssociateList.

load_data_cache()

Also load the data from the AssociateList if it is cached. From a ProcessTarget perspective this should be done by the AssociateList itself.

load_data_python()

Loads the data on the python layer.

TODO LT M: [OPERATOR] [DATAFORMAKE]
• How should the SCT know that it has to load the AL data? Currently it does not do this. The AL could be listed in the get_onthefly_dependencies().
• Support for different row_identifiers.
mandatory_dependencies = (('parent_collection', 1), ('associatelist', 1))
modify_external_subset_parent(relations=None)

Substitudes the parent with smaller ones in case they are Externals.

This is still experimental code and only used by the SourceCollectionTree in case its .convertRS flag is set to True.

TODO: This only works if the parent indeed has all the sources
mentioned in the AssociateList. This does not have to be the case.
name

Name of the SourceCollection [None]

object_id

The object identifier

The object identifier is an attribute shared by all persistent instances. It is the prime key, by which object identity is established

parent_collection

Parent SourceCollections [None]

process_status

A flag indicating the processing status [None]

quality_flags

Automatic/internal quality flag [None]

sourcelist_data

Optional SourceList containing the data described by this SourceCollection

sourcelist_sources

Optional SourceList containing the sources described by this SourceCollection

store_data_cache(relations=None, optimize=True)

Caches the catalog data of this SourceCollection in a file. Also stores the cached AssociateList data if relevant.

[EXPERIMENTAL]: Local caching of catalog data is still experimental!

## astro.main.sourcecollection.RenameAttributes module¶

Renames the attributes of the parent SourceCollection. This is the attributes counterpart of a RelabelSources.

class astro.main.sourcecollection.RenameAttributes.RenameAttributes(sourcelist_data=None)

Renames the attributes of the parent SourceCollection. This is the attributes counterpart of a RelabelSources.

In the current implementation (and in AW in general), an attribute is identified by its name. Therefore the decision to rename attributes should not be taken lightly. It is often better to avoid this if possible.

Reasons to rename an attribute: - The attribute is named incorrectly. - Different attributes with the same name have to be combined with a CA:

• To compare different methods that calculate the same attribute, (e.g. density estimation).
• To compare different process parameters in the calculation of the attribute (e.g. aperture size).
• To compare attributes with generic names (e.g. MAG_ISO in different filters).
• Because an AttributeCalculator expects the attribute in a different name.
SCID

SourceCollection identifier [None]

all_data_stored

Flag to indicate whether all data has been stored in sourcelist_data, 0 means no (or unknown), 1 means yes

attribute_columns

Column names of the attributes in the sourcelist_data

attribute_names

Names of the attributes corresponding to the attribute_columns

attributes_new

New attribute names

attributes_old

Old attribute names

copy(relations=None, copy_parents=True, copy_parameters=True, copy_cache=False, copy_supercache=False)
create_rename_dict(cache=False)
creation_date

Date this object was created [None]

static find_sc(parent_collection, attributes_old=None, attributes_new=None, name=None, is_valid=None)

Find this SourceCollection. TODO: Integrate with OnTheFly exist() TODO: Find transient SCs.

get_attribute_origin(attribute, cache=False)

Returns a list of SourceCollections higher in the tree that represent the same attribute. That is, every SLID-SID-attribute combination of these SourceCollections will result in the same numerical value. The first in the list is earliest in the tree. The last will be this SourceCollection.

get_attributes_full(cache=False)
get_parents_tree()
get_parents_tree_keys()
get_parents_with_clauses()
get_query_self(nowhere=False)
static get_sc(parent_collection, attributes_old=None, attributes_new=None, name=None, is_valid=None, force_name=False)

Find this SourceCollection and create it if necessary. TODO: Integrate with OnTheFly get_onthefly() TODO: Integrate with init? TODO: Properly handle SLID/SID

get_source_relations(relations=None, quick=False)

Returns a SetRelations object to hold relations between SourceCollections. A RenameAttributes represents the exact same sources as its parent.

is_valid

Manual/external flag to disqualify bad data (SuperFlag) [None]

load_data_python()

Loads the data on the python layer.

mandatory_dependencies = (('parent_collection', 1),)
modify_integrate_parents(relations=None)

If the parent of a RenameAttributes is another RenameAttributes then the parent can be integrated into this one. TODO MT M: [MODIFICATIONS] Implement this function.

Check with modify_remove_dependencies.
modify_move_up(relations=None)

Returns a copy of the parent with a copy of this RA as parent if the parent is a SelectSources, in order to move the SS down in the tree.

modify_remove_dependencies(relations=None)

Returns an External SourceCollection if the parent of this RA is an External itself. The .attribute_names and .attribute_columns are set such that the attributes are correctly renamed.

modify_remove_self(relations=None)

Returns the parent of this RA does not rename any attributes.

name

Name of the SourceCollection [None]

object_id

The object identifier

The object identifier is an attribute shared by all persistent instances. It is the prime key, by which object identity is established

parent_collection

Parent SourceCollections [None]

process_status

A flag indicating the processing status [None]

quality_flags

Automatic/internal quality flag [None]

sourcelist_data

Optional SourceList containing the data described by this SourceCollection

sourcelist_sources

Optional SourceList containing the sources described by this SourceCollection

## astro.main.sourcecollection.SelectAttributes module¶

SelectAttributes selects a set of attributes from the parent SourceCollection.

class astro.main.sourcecollection.SelectAttributes.SelectAttributes(sourcelist_data=None)

Selects a set of attributes from the parent SourceCollection.

This is the attributes counterpart of the SelectSources.

SCID

SourceCollection identifier [None]

all_data_stored

Flag to indicate whether all data has been stored in sourcelist_data, 0 means no (or unknown), 1 means yes

attribute_columns

Column names of the attributes in the sourcelist_data

attribute_names

Names of the attributes corresponding to the attribute_columns

copy(relations=None, copy_parents=True, copy_parameters=True, copy_cache=False, copy_supercache=False)
creation_date

Date this object was created [None]

static find_sc(parent_collection, selected_attributes=None, name=None, is_valid=None)

Find this SourceCollection. TODO: Integrate with OnTheFly exist() TODO: Find transient SCs.

get_attribute_origin(attribute, cache=False)

Returns a list of SourceCollections higher in the tree that represent the same attribute. That is, every SLID-SID-attribute combination of these SourceCollections will result in the same numerical value. The first in the list is earliest in the tree. The last will be this SourceCollection.

get_attributes_full(cache=False)
get_attributes_parents(cache=False)
get_parents_tree()
get_parents_tree_keys()
get_parents_with_clauses()
get_query_self()
static get_sc(parent_collection, selected_attributes=None, name=None, is_valid=None, force_name=False)

Find this SourceCollection and create it if necessary. TODO: Integrate with OnTheFly get_onthefly() TODO: Integrate with init?

get_source_relations(relations=None, quick=False)

Returns a SetRelations object to hold relations between SourceCollections. A SelectAttributes represents the exact same sources as its parent.

is_valid

Manual/external flag to disqualify bad data (SuperFlag) [None]

load_data_python()

Loads the data on the python layer.

mandatory_dependencies = (('parent_collection', 1),)
modify_integrate_parents(relations=None)

Returns a new SourceCollection that describes the same data as this SC which depends on the grand parent(s) of this SC.

• A new SelectAttributes whose parent is the grandparent of this SA is returned if: - this SA selects only the SLID and SID of the parent and the parent

is a SelectAttributes, Pass, AttributeCalculator or RenameAttrs.

• the parent is a RenameAttributes but none of the renamed attributes are selected by this SA.
• Two sequential SelectAttributes are integrated into one.

If the parent has sourcelist_data, but not all_data_stored, then it is not possible to copy the sourcelist_data, because if data is going to be stored, it should be stored for all the attributes.

modify_move_up(relations=None)

Returns a new SourceCollection that can replace this SourceColleciton, but with the SelectAttributes placed higher in the dependency tree.

A SelectAttributes will not be moved up through an SC with a .sourcelist_data but not .all_data_stored since it is not possible to store subsets of attributes, only of sources.

modify_remove_dependencies(relations=None)

Returns an External if the parent of this SA is an External as well.

A copy of the parent is created with the .attribute_names and .attribute_columns set to match the selection of this SA.

modify_remove_self(relations=None)

Returns the parent if this SelectAttributes selects all the attributes of the parent.

name

Name of the SourceCollection [None]

object_id

The object identifier

The object identifier is an attribute shared by all persistent instances. It is the prime key, by which object identity is established

parent_collection

Parent SourceCollections [None]

process_status

A flag indicating the processing status [None]

quality_flags

Automatic/internal quality flag [None]

selected_attributes

Names of the selected attributes

sourcelist_data

Optional SourceList containing the data described by this SourceCollection

sourcelist_sources

Optional SourceList containing the sources described by this SourceCollection

## astro.main.sourcecollection.SelectSources module¶

class astro.main.sourcecollection.SelectSources.SelectSources(sourcelist_data=None)

A SelectSources SourceSelection selects the sources of the parent that are also described by the .selected_sources SourceCollection.

This is the sources counterpart of the SelectAttributes.

SCID

SourceCollection identifier [None]

all_data_stored

Flag to indicate whether all data has been stored in sourcelist_data, 0 means no (or unknown), 1 means yes

attribute_columns

Column names of the attributes in the sourcelist_data

attribute_names

Names of the attributes corresponding to the attribute_columns

copy(relations=None, copy_parents=True, copy_parameters=True, copy_cache=False, copy_supercache=False)
creation_date

Date this object was created [None]

get_attribute_origin(attribute, cache=False)

Returns a list of SourceCollections higher in the tree that represent the same attribute. That is, every SLID-SID-attribute combination of these SourceCollections will result in the same numerical value. The first in the list is earliest in the tree. The last will be this SourceCollection.

get_attributes_full(cache=False, nowhere=True)
get_parents_tree()
get_parents_tree_keys()
get_parents_with_clauses()
get_query_self(hint=False, nowhere=True)
get_query_with_clauses(with_clauses=None, nowhere=True)

The SS overloads this function to optimize the generated queries.

The query_self() of a SS contains a join between the queries of the parent and the selected_sources. The database should loop (perform a full table scan) on the smallest of these two, which usually is the .selected_sources. The query optimizer of the database does not always do this properly. However, this can be enforced by not specifying a WHERE clause in the query of the parent if it is an External.

get_source_relations(relations=None, quick=False)

Returns a SetRelations object to hold the relations between SourceCollections.

A SelectSources returns the cross section between its parent and the selected_sources.

is_valid

Manual/external flag to disqualify bad data (SuperFlag) [None]

load_data_python()

Loads the data on the python layer.

mandatory_dependencies = (('parent_collection', 1), ('selected_sources', 1))
modify_integrate_parents(relations=None)

Returns a new SelectSources with the original grandparent as parent if the original parent was either a FilterSources or another SelectSources but the sources of the original SelectSources are entirely determined by the selected_sources.

modify_move_up(relations=None)

A SelectSources can move up through most other SourceSelections.

This is especially useful for moving through a SC with sourcelist_data but not yet all_data_stored to limit the amount of data that has to be created.

modify_remove_dependencies(relations=None)

Returns a new SelectAttributes that selects no attributes from the .selected_sources if this SC describes no attributes and all the sources are described by the selected_sources.

This is useful when we have an AttributeCalculator with partially stored data and we want to know which sources it describes.

TODO MT M: [OPERATOR] Perhaps there is a better way to handle this.

modify_remove_self(relations=None)

Returns the parent if all sources are selected.

name

Name of the SourceCollection [None]

object_id

The object identifier

The object identifier is an attribute shared by all persistent instances. It is the prime key, by which object identity is established

parent_collection

Parent SourceCollections [None]

process_status

A flag indicating the processing status [None]

quality_flags

Automatic/internal quality flag [None]

selected_sources

SourceCollection that describes the sources that should be selected from the parent.

sourcelist_data

Optional SourceList containing the data described by this SourceCollection

sourcelist_sources

Optional SourceList containing the sources described by this SourceCollection

## astro.main.sourcecollection.SourceCollection module¶

SourceCollections are persistent objects to handle catalog data.

class astro.main.sourcecollection.SourceCollection.SourceCollection(sourcelist_data=None)

SourceCollections are used persistent objects to handle catalog data.

The sources of a SourceCollection are identified by their SLID-SID combination. Attributes of sources are identified by a string.

More than the data itself, they represent the operation that is used to create the catalogs. The instantiation of the data is delayed until required to fulfill a data pulling request. Data is only stored if it cannot be derived on the fly or for performance reasons.

For example, the data of a SourceCollection can be stored as the data of another SourceSelection. Therefore a SourceCollection can be seen as ‘describing’ data, since it does not actually have to ‘contain’ any data.

This SourceCollection class should be seen as a ‘base’ or ‘virtual’ class from which the other SourceCollection classes are derived. The other SourceCollection classes correspond to a specific operation that is performed on other persistent objects to instantiate the catalog data.

The data pulling algorithms and the lazy instantiation rely heavily on the paradigm of full data lineage. The SourceCollection classes perform a very specific operation that have a well defined behaviour in order to maximize the amount of information about the catalog data that can be inferred from the meta data only.

• Data caching properties

The properties below are merely for caching (/storing) of data, they do not have a conceptual meaning. They are only required when the data of this SourceCollection is stored.

By design, it should be possible to decide to store data at any moment. Therefore it should be possible to change these properties on persistent objects. This does not violate the persistent objects philosophy: changing these properties does not change the data represented by the. SourceCollection.

At the moment it is not yet possible to change these attributes. Therefore it has to be decided whether the data will be stored before a SourceCollection is committed. Note that this does not mean that all the data should be stored, only the subsets that are actually requested.

• sourcelist_data: A SourceList that contains the catalog data of this

SourceCollection. Only the subsets of the data that have been required are stored. SourceCollections that describe subsets of the same larger SourceCollection should share their .sourcelist_data. There, this SourceList might contain a subset or superset of the sources than are represented by the SourceCollection. In any case, the sourcelist_data will contain all the described attributes. If there is a SLIDorg and SIDorg column in the sourcelist_data, then these represent the actual SLID and SID of the sources. In principle, the sourcelist_data should be set only once and should be considered as unchangable.

• sourcelist_sources: A SourceList that contains only the source

indentifiers that correspond to this SourceCollection. If the sourcelist_sources contains the SLIDorg/SIDorg attributes, then these contain the SLID/SID tuples of the sources, otherwise the SLID/SID columns themselves. Any other columns are ignored, the same SourceList can be used for both the sourcelist_data as the sourcelist_sources (if all data is stored). The sourcelist_sources should always contain the exact set of sources, not more not less.

• all_data_stored: A flag to indicate whether all the data is stored. A

value of 1 indicates that all data is stored. A value of 0 means that it is unclear whether all the data is stored. It should be checked with is_made() whether more data needs to be made() or whether it can be retrieved.

• attribute_names: Contans the names of the attributes of this

SourceCollection. - attribute_columns: Describes in which column of sourcelist_data the attributes are stored.

Not all columns that are included in the sourcelist_data have to be part of the SourceCollection: + Obligatory attributes such as A, B, POSANG + Attributes that are only included for the benefit of database

indices/partitions, HTM, DEC
• Not selected attributes, e.g. a SelectAttributes SourceCollection might share the sourcelist_data of its parent.

The name of the included columns does not have to correspond to the attribute name: + Renamed column names, e.g. with a RenameAttributes SourceCollection. + Abstracted column names, e.g. Gert’s arbitrary column names.

• The SourceCollection class

The functions in this base class are either * common functionaliy (get_query_alias) * functions that should be overloaded by derived classes

(get_source_relations)
• functions for External SourceCollections with all_data_stored (get_query_self)
• TODOs

The TODO comments in the classes are labeled:

ST = Short Term,
For small things that are noted but could not be fixed immediately.
MT = Medium Term,
Things that require thought/experimenting/discussion to be fixed.
LT = Long Term,
Complex issues, e.g. that require database changes,
changes in other classes, etc

L = Low importance, M = Medium importance, H = High importance

TODO LT H: + Using an actual SourceList for the above two data storage mechanisms is

not optimal. It might be better to design separate data structures for this in the future. (Or make the SourceList a SourceCollection itself.)
• The sourcelist_sources, sourcelist_data and all_data_stored cannot yet be changed for persistent objects.
SCID

SourceCollection identifier [None]

action(action, oproperty=None)

Perform actions over SAMP. Only ‘make’ is currently supported.

TODO MT M [SAMP]: Perhaps let the SCT handle this action.

after_init()
all_data_stored

Flag to indicate whether all data has been stored in sourcelist_data, 0 means no (or unknown), 1 means yes

append_data_python(tclarge=None, tcsmall=None, keys=None)

Appends the new data in .data to .datasuper.

TODO MT L: [TABLECONVERTER] Perhaps put this functionality in
TableConverter itself.
attribute_columns

Column names of the attributes in the sourcelist_data

attribute_names

Names of the attributes corresponding to the attribute_columns

can_derive_data_on_the_fly()

Returns true if this operator can be evaluated on the fly on SQL. That is, that does not require make() to be run to retrieve the data, but only load_data(). In principle this is determined by the operator and independent on the described data.

In practice, only some AttributeCalculators return False.

TODO LT M: [SQLPYTHON] + This function is basically a placeholder, a more advanced function

should be needed.
• Implement this function in the derived classes.
• Differentiate between Python and SQL. (e.g. for FilterSources, AC)
change(prop, value, index=None)

Function to set a property. Used for SAMP connectivity and in the SourceCollectionTree.

TODO MT M: [SCCHANGE] What should happen when
the requested change cannot be made? return False? raise Exception?
TODO MT M: [SCCHANGE] What to return when the
change is succesfull? return True?
check(parse=False)

Performs some sanity checks on the SourceCollection. Returns True if everything is okay, raises an Exception otherwise.

The query from get_query_full is parsed by a database cursor if parse=True.

This function raises an Exception instead of returning False because this function should only be called on SouceCollections that ought to pass the checks. It is assumed that the process that created the SourceCollection is faulty in case any of the checks fail.

copy(relations=None, copy_parents=True, copy_parameters=True, copy_cache=False, copy_supercache=False)

Returns a copy of this SourceCollection. What should be copied depends on the reason the copy is made.

The following properties might be copied: - The parents:

parent_sourcecollection, parent_sourcecollections
• Persistent properties representing process_parameters:

query, selected_sources, selected_attributes, etc.

Copying these is the responsibility of derived classes.

• The cached data for this particular SC:

sourcelist_sources, data, sources

• The cached data that might represent a superset of data:

sourcelist_data, datasuper, attribute_names, attribute_columns, cache_attributes_sourcelist

• The copy and the original might be set as equal in the SetRelations.

There are four reasons to copy a SourceCollection: - To change an process_parameter, e.g. selection criterion.

Copy: parents, process_parameters. Cannot be set as equal in the relations, except for some operators.
• To use the same operator on other, unrelated, parents.
Copy: parents, process_parameters. Cannot be set as equal in the relations.
• To switch parents with other parents that are conceptually the same, when optimizing trees.
Copy: parents, process_parameters, cache, supercache. Can be set as equal in the relations.
• To reorganize the tree by changing the sources of this SC.
Copy: parents, process_parameters, supercache. Cannot be set as equal in the relations.
create_empty_datasuper(row_identifiers=None)

Creates an empty TableConverter that will be used as .datasuper.

If .datasuper already exists, the created TableConverter is returned but not set as .datasuper.

create_empty_sourcelist_data(filename='temp.fits', tablename=None, boundingbox='auto', slname=None, fltr='auto', obj=None)

Creates an empty SourceList to be used as .sourcelist_data. The .datasuper will be created as well.

The SLID and SID of the sources will be stored in the SLIDorg and SIDorg columns. The SOURCELIST*SOURCES**05 table is used because this table has an index on the SLIDorg-SIDorg combination.

A FITS file that contains the attributes but no sources is created and ingested. The resulting SourceList is only commited when data has to be ingested or when this SC is committed.

creation_date

Date this object was created [None]

derive(attributes=None, query='')

Returns a new SourceCollection that describes the requested attributes for the sources that match the given query. A SourceCollectionTree is created to which the selection of sources and attributes is applied. Note that the attributes do not necessarily have to be included in this original SourceCollection.

This function is still experimental, it might change in the future. TODO MT M: [NOCAT] Finalize the design of this function.

derive_timestamp()

Derives a timestamp for the creation of this SourceCollection.

This is called in __init__(), instead of traditionally in make(), because SourceCollections do not always have to be made() to be used.

get_alias_relations()

Returns the alias to be used in the SetRelations.

classmethod get_alias_sourcelist(sl)

Returns an alias for a SourceList, for use in SQL queries or SetRelations.

get_attribute_names(cache=False)

Returns the names of the attributes.

get_attribute_origin(attribute, cache=False)

Returns a list of SourceCollections higher in the tree that represent the same attribute. That is, every SLID-SID-attribute combination of these SourceCollections will result in the same numerical value. The first in the list is earliest in the tree. The last will be this SourceCollection.

get_attributes(cache=False)

Returns a list of dictionaries with meta data about the attributes.

Keys of the dictionaries:
• name: The name (and identifier) of the attribute.
• format: A format string that can be used by numpy.dtype()
• ucd: Unified Content Descriptor (not always filled properly)
• null: A null value, usually numpy.nan (not always present)
• length: The length for multi length cells, only used for strings.

The dictionaries have the same structure as those used in the TableConverter class.

TODO LT L: [NOCAT] Use ‘ucd’ and ‘null’ properly.

get_attributes_full(cache=False)

Returns a TableConverter like attribute list with extra keys.

Should only be used by SourceCollection functions. Other classes and awe-prompt users should use get_attributes().

This function should be overloaded by the derived classes. The version in this base class is essentially the variant of the External.

get_bounding_box(approximate=True)

Returns a BoundingBox that encompasses this SourceCollection. For some SourceCollection instances it can be difficult to determine the exact bounding box, for example for a FilterSources for which the selection criterion has not been evaluated yet. Setting ‘approximate’ to True should yield a sensible bounding box. A preference should be given to bounding boxes that are too large above boxes that are too small.

SourceCollections (and instances of other catalog classes) conceptually have a bounding box as well. The best way to store this bounding box is probably through the BoundingBox class, however this class currently only supports frames, no catalogs.

Some problems are expected with the current implementation of bounding boxes in Astro-WISE. For example, an almost full sky catalog will have ulRA almost at 360 and urRA just above 0. The htm module (e.g. used by the AssociateList) will always consider the ‘smallest’ boundingbox. Therefore it will treat such a catalog as being very small around RA=0 instead of very large.

Not yet fully implemented.

get_children(cache=False, checktree=True)

Returns other SourceCollection that have this SourceCollection as a parent.

This does not include ‘non-parents’, because its goal is to find SourceCollections that describe data derived from this object.

get_children_persistent(cache=False, checktree=True)

Returns persistent SourceCollection that have this SourceCollection as a parent.

get_earliest_catalog_source_progenitors(identifier, _done=None)

Uses get_source_progenitors recursively to find the earliest catalogs that contributed to this source. Usually these will be normal SourceLists whose sources are created by running SExtractor on a frame. However, these could also be External SourceCollections for which no frame exist.

get_export(relations=None, children=False)

Returns an ExportObject that can be used to send meta data of this SourceCollection over SAMP.

get_export_class_name()

Return the class name to export.

classmethod get_export_sourcelist(sl)

What the SourceList would return if it had a get_export function. Except that it would add a frame.

TODO ST M: [SAMP] Use the ExportDict for this function.

get_filter()

Retrieve the filter of this SourceCollection if applicable.

static get_new_tableconverter_id()

Returns a new identifier for a TableConverter in order to cache its data. This has to be a locally unique identifier. For now, this is simply a new SCID.

[EXPERIMENTAL]: Local caching of catalog data is still experimental!

TODO: Ensure that a locally unique identifier is available for anonymous users. Perhaps an identifier based on lineage. Ultimately, this can al be incorporated in a ‘local’ backend.

get_object()

By lack of an ‘OBJECT’ property, try to get the OBJECT from the dependencies. This can than be stored as the OBJECT of the sourcelist_data.

get_onthefly_dependencies(config=None)

get_onthefly_dependencies() is overloaded from OnTheFly in order to remove the data caching dependencies, because these are the thing that get made.

get_parents_tree()

Returns the persistent dependencies that the SourceCollectionTree should handle. Currently these are only SourceCollections.

Despite the name of this function, this also includes non-parents, such as the selected_sources of SelectSources.

This function will be deprecated in favor of get_dependencies_flat(), which will return all dependencies, not just SourceCollections.

get_parents_tree_keys()

Returns the keys of persistent dependencies that the SourceCollectionTree should handle. Currently these are only SourceCollections.

Despite the name of this function, this also includes non-parents, such as the selected_sources of SelectSources.

This function should be overloaded by derived classes, because the overhead of this base function is large.

get_parents_with_clauses()

Returns the dependencies whose queries should be included in the full query as WITH clauses. Despite the name this also includes non-parents such as the selected_sources from the SelectSources.

The derived SourceCollection classes should provide this function themselves, because this base function is to time consuming.

get_properties_lineage()

Returns a dictionary with as keys the names of the persistent properties that are part of the data lineage and their values as values. That is, the progenitors and process parameters.

Two SourceCollections of the same class of which all these properties have identical values represent the same catalog.

get_query_alias(nowhere=False)

Returns a string to be used as the alias of the WITH clause that represents this SourceCollection in created SQL queries.

The .get_scid() function is used to ensure that the alias is unique.

‘nowhere’ is a query optimization parametor which is only to be set by the SelectSources operator.

get_query_count_sources()

Returns a query to count the sources. This is a helper function which is used to determine whether the all_data_stored flag of a SourceCollection can be set. It should not be called manually.

TODO MT L: [NOCAT] Perhaps add an ‘optimize’ argument to make it
possible to have the user call this function.
get_query_full()

Returns the entire SQL query that combines the operators from the SourceCollections higher in the tree.

get_query_new_sources(relations=None, scsources=None)

Returns a query that selects the SLID/SID combinations that are described by this SC, but that are not yet stored in sourcelist_data. Furthermore it returns queries to count the number of stored sources and the number of described sources.

‘scsources’ is a SourceCollection that describes the sources of this SourceCollection and provided by the SourceCollectionTree. The .sourcelist_sources should be set if ‘scsources’ is not given.

This function should only be called through is_made(). TODO MT L: [SCVSSCT] Perhaps put this code in the SCT entirely.

get_query_self(nowhere=False)

Gets the query for this specific SourceCollection class/operator. This function should be implemented in the derived classes. The base function in this class works only for Externals and SourceListWrappers.

‘nowhere’ is an optimization parameter that should only be set by the SelectSources class.

get_query_with_clauses(with_clauses=None)

Retrieves the SQL queries that should be used as WITH clauses in the full query. This function is called by get_query_full, it should not be necessay to call this function manually.

get_scid()

Returns the globally unique positive SCID if set, else returns the temporary negative tSCID, which is unique to this session only.

This function is preferred above accessing the SCID directly since that will use the database Sequence to create a new SCID. See also __getattribute__() and the section above about ‘Identifiers’.

get_slids()

Returns the SLIDs that the sources might have by walking up the tree. This function should be implemented by the derived classes.

This is a convenience function, it should not be relied upon and not guaranteed to be correct. To get the exact SLIDs, use .load_sources().

For example, for an External with a SLIDorg column it is only possible to get the SLIDs by going through all the sources. This is not done.

Furthermore, a FilterSources with a parent with multiple SLIDs might select sources of one specific SLID. This can only be checked by evaluating the query, which is not done.

get_source(slid, sid)

Returns a dictionary with attribute names as keys and as values the attribute values that correspond to the given slid-sid. Currently only applicable for SourceCollections of which the data has been loaded.

get_source_progenitors(identifier)

Returns a list of objects and identifiers that represent the progenitors of the source.

Identifier should be a (SLID, SID) combination.

Derived classes might need to overload this function.

There might be edge-cases where this function does not yet work.

get_source_python(slid, sid)

Returns a dictionary with attribute names as keys and as values the attribute values that correspond to the given slid-sid, as kept in the .data TableConverter.

get_source_relations(relations=None, quick=False)

Adds the alias of this SourceCollection and its parents to the given SetRelations ‘relations’ and enforces the relations between them.

‘relations’ is a SetRelations object contains information about the logical relations between the sources described by a set of SourceCollections. If none is given, a new one is created.

This function will recursively call itself on the parents of this SourceCollection, unless quick=True and the parents are already added to the relations. Then it adds this SourceCollection itself.

Subsequently, the known relations between this SourceCollection and its parents are enforced.

The get_alias_relations() function is used to get an alias that is used in the SetRelations object.

This function should be overloaded in derived classes.

get_source_relations_sourcelist(sl, relations=None, quick=False)

Similar to get_source_relations, only for SourceLists instead of SourceCollections.

This function should only be used for SourceLists that are used in an SourceListWrapper or persistent External.

get_sources_mask(tclarge=None, tcsmall=None, keys=None)

Creates a mask that can be used on ‘tclarge’ in order to select only those sources that are also in ‘tcsmall’.

‘keys’ is a list of attributes that uniquely define a source. Default this is [‘SLID’,’SID’] or the ‘row_identifiers’ of the TCs. This is useful for SourceCollections that are not yet stored in the database. E.g. for the experimental TAP functionality.

This function is used: - to create .data out of .datasuper and .source_identifiers - to append sources from .data to .datasuper

TODO MT L: [TABLECONVERTER] Perhaps put something like this in
TableConverter itself.
get_tapurl()

Warning: Experimental code for IVOA TAP interoperation. Gets the url for a TAP query.

get_tapurl_dict()

Warning: Experimental code for IVOA TAP interoperation.

Returns dictionary to build TAP query. Should be implemented in derived classes. Currently only in: - ExternalTAP (experimental) - FilterSources

TODO MT H [TAP]: - Finish design of ExternalTAP and make it a proper persistent class. - Implement this function in at least RA and SA. TODO MT M [TAP]: - Document this dictionary, see ExternalTAP

get_tree_part(sc)

Returns a list of SourceCollections that are inbetween self (inclusive) and ‘sc’ (inclusive) in the dependency tree of this SC.

ingest_data()

Ingests the .datasuper as new rows in .sourcelist_data. Sources that are already stored are skipped.

is_made(optimize=True, relations=None, scsources=None)

Returns True is no data has to be made in order to load the data into Python. .make() should be called if False is returned.

From the awe-promt, this function should be called on the SC that describes the data that is actually required, and with optimize=True.

This will create a SourceCollectionTree that will be optimized to limit the amount of data that has to be made. This SCT will subsequently check whether this subset of data indeed is made by calling is_made with optimize=False on all its nodes.

If optimize=False, no recursion is done: + If the data is stored in .data or .datasuper and .source_identifiers:

True
• If all_data_stored in the database: True
• If this SourceCollection can_derive_data_on_the_fly(): True. Note that this intentionally also returns True even if the parents are not made, because this cannot be checked with optimize=False.
• If sourcelist_data but not all_data_stored, check whether all data is stored anyway. If so, set all_data_stored and return True.
• Otherwise return False.

If optimize=True: + A SourceCollectionTree is created. + The SCT is optimized for loading data. + The is_made of the SCT is returned (which calls this function with

optimize=False)

‘scsources’ can contain an SC that describes the sources of this SC. scsources typically is created with SCT.get_sourcecollection_sources() and is passed by the SCT.

TODO MT L: [SCVSSCT] Perhaps let the SCT do the part where
scsources is required for.
is_valid

Manual/external flag to disqualify bad data (SuperFlag) [None]

load_associatelist_data_cache(associatelist)

Retrieves the association information of an AssociateList from cache.

[EXPERIMENTAL]: Local caching of catalog data is still experimental!

load_data(optimize=True, relations=None)

Loads the data into the .data TableConverter by evaluating the operator of this SourceCollection either in Python or on SQL.

This function only evaluates operators that do not require new data to be made, that is, .is_made() should return True. .make() should be called to evaluate operators that create new data.

This function can be called with or without optimization. With optimization a SourceCollectionTree is created which is reorganized to load the the data in an optimal way.

• Without optimization:

Evaluation of the operator is first attempted on the Python layer by calling load_data_python(). If this fails the evaluation is attempted on SQL by calling load_data_sql().

• With optimization:

A SourceCollectionTree is created which is reorganized to load the data as efficiently as possible which is subsequently done

E.g., this process will move SelectSources up into the tree. This results in a SourceCollections that originally required data to be made, to be replaced with a copy that describes a subset of the data that is already stored and thus does not have to be made anymore.

TODO MT H: [SCVSSCT] Most of the functionality here should be placed
in the SourceCollectionTree. The .load_data_nn() below is an attempt do do this and will ultimately replace this function.
load_data_cache()

Loads the data from a local cache file.

[EXPERIMENTAL]: Local caching of catalog data is still experimental!

load_data_file()

Load the data from a file on the dataserver.

load_data_kids(optimize=True, depth=0)

Specific functionality to load KiDS tiles. Useful for SourceCollections whose name start with “KIDS_”. This function traverses the tree in forward direction and tries to load data for each SourceCollection since usually it is needed anyway.

load_data_nn(optimize=True, relations=None)

Experimental new version of load_data() where more functionality is offloaded to the SourceCollectionTree. In particular the call to optimize the tree is moved to the SCT.

load_data_python()

Loads the data by evaluating the operator in Python. This requires the data of the parent SourceCollections to be available in their .data property.

Derived classes should overload this function. Not all operators can be evaluated in Python.

load_data_python_cache()

Creates .data from .datasuper and .source_identifiers.

load_data_sql(optimize=True, relations=None)

Loads the data from the database into Python. The query to do so is generated by get_query_full(), which tries to combine queries of parent SourceCollections if necessary.

load_data_tap()

Warning: Experimental code for IVOA TAP interoperation. Loads the data over TAP

TODO MT H [TAP]: - Rename attribute_tapcolumns to attribute_names - Ensure .sourcelist_data and .datasuper are the same as parent.

TODO MT M [TAP]: - Make ‘async’ work. - Implement optimization somehow - Error checking

load_sources(relations=None, optimize=True)

Loads only the source identifiers that belong to this SourceCollection as the .source_identifiers TableConverter. This is done by creating a SelectAttributes that selects no attributes with this this SourceCollection as parent and loading the data of that SC.

make(optimize=True)

Makes the data required for this SourceCollection. Only data that cannot be derived on the fly has to be made. In practise these are only some AttributeCalculators, but other operators might be defined in the future.

Without optimization: + The data from the parent SourceCollections is used to create the

data of this SourceCollection.
• The data is kept in .data and appended to .datasuper.
• The data is not ingested into the database.
• False is returned if the data of the parents is not available.

With optimization: + A SourceCollectionTree is created from this SourceCollection. + This SCT is optimized for the creation of data. + .make(optimize=False) is called on the SourceCollections in the tree

TODO MT M: [ATTRIBUTECALCULATOR] Not all AttributeCalculatorDefinitions
follow the above guidelines, those need to be update.
mandatory_dependencies = ()
modify_integrate_parents(relations=None)

This function tries to integrate the parents of this SourceCollection into it. It returns a new SourceCollection that can replace this one, which depends on one or more of its grandparents.

Only one parent should be integrated at a time. Parents with partially stored data should not be integrated. Parents with all data stored should oly be integrated if the stored data can be used by the new SourceCollection.

Derived classes should overload this function. E.g. a SelectAttributes can integrate a parent SelectAttributes.

modify_move_up(relations=None)

This function tries to move the SourceCollection up in the tree. Very simplified the procedure is like this: copy this SC and its parent and switch the parents of the copy.

Derived classes should overload this function.

modify_remove_dependencies(relations=None)

This function should return an object that has less dependencies than the original, possibly none. The described data is still exactly the same, but information about its creation might be partially lost.

The ‘relations’ argument should always be used when calling this function, because the information necessary to infer the exact set of sources might be lost after the dependencies are removed.

This function should not be used on SourceCollections that are part of a tree that will be made persistent.

This base function tries to create an External that describes the same data as this SourceCollection.

modify_remove_self(relations=None)

The parent of this SourceCollection will be returned if this SourceCollection is essentially a Pass operator.

This function should not create new SourceCollections.

modify_split_cached(relations=None)

Splits the SourceCollection in a part that is already retrieved in .datasuper, and the part that is not.

Requires a .sourcelist_data, .datasuper and .source_identifiers.

[EXPERIMENTAL]: This modification is still experimental, and not yet
used by the SourceCollectionTree.
modify_substitute_progenitors(progenitors_substitution)

Creates a copy of this SourceCollection where any progenitors that are listed in progenitors_old are replaced by those in progenitors_new.

name

Name of the SourceCollection [None]

object_id

The object identifier

The object identifier is an attribute shared by all persistent instances. It is the prime key, by which object identity is established

patch_datasuper()

Helper function to prepare datasuper for being ingested. + Adds ‘HTM’, ‘RA’ and ‘DEC’ attributes if they are not included. + Retrieves the HTM/RA/DEC of the original source to be used as value

for these attributes.
• Checks whether the source has already been ingested and places the new SID in SIDnew, if available.

This function is called by ingest_data().

This function is not particularly fast, but if the data is going to be ingested, then .ingest_sources() will likely take more time anyway.

process_status

A flag indicating the processing status [None]

quality_flags

Automatic/internal quality flag [None]

retrieve_associatelist_data(associatelist)

Retrieves the association information of an AssociateList.

retrieve_cached_information()

Retrieves information that is cached. In particular the .datasuper_cacheid, .data_cacheid, .sources_cacheid.

[EXPERIMENTAL]: Local caching of catalog data is still experimental!

retrieve_tableconverter_data(tableid)

Retrieves cached tableconverter data.

[EXPERIMENTAL]: Local caching of catalog data is still experimental!

set_onthefly_attributes(attributes)
sourcelist_data

Optional SourceList containing the data described by this SourceCollection

sourcelist_sources

Optional SourceList containing the sources described by this SourceCollection

sources_from_data()

Creates a .source_identifiers from .data by copying the SLID and SID attributes.

store_cached_information()

[EXPERIMENTAL]: Local caching of catalog data is still experimental!

store_data(relations=None, optimize=True, tablename=None)

Stores the data of this SourceCollection in the database.

With optimization a SourceCollectionTree is created, which stores the data in the most optimal way, preventing duplications of data. In practise this means that the data is stored for all SourceCollections in the dependency tree which have a .sourcelist_data, which are mostly AttributeCalculators.

Without optimization, the sources in .datasuper are ingested in the .sourcelist_data of this SC, which is created if it does not exist.

This function should only be called with optimize=False manually in case the user wants to overrule the optimization behaviour.

store_data_cache(relations=None, optimize=True)

Caches the catalog data of this SourceCollection in a file.

[EXPERIMENTAL]: Local caching of catalog data is still experimental!

store_sources(relations=None, optimize=True)

Caches the sources of this SourceCollection in a SourceList.

TODO MT M: [DETERMINESOURCES] With optimize=True this function should decide on the optimal way to store the sources. Currently it does not do that yet. Options are: - Find an existing SourceList that can be used as sourcelist_sources. - If this does not exist, try to find the most suitable SC to store the

sources. E.g., if this is a SA, with an FS as parent, it might be better to store the sources of the FS.
tSCID = 0
tSCID_iterator = <map object>
tSCID_seed = 400820
walk_tree_backward(sc, yielded=None)

Yields the SourceCollections in the dependency tree of this SourceCollection, as identified by get_parents_tree(). This function starts at this end node, going backwards.

walk_tree_forward(sc, yielded=None)

Yields the SourceCollections in the dependency tree of this SourceCollection, as identified by get_parents_tree(). This function starts at the most ‘raw’ data, going forwards.

exception astro.main.sourcecollection.SourceCollection.SourceCollectionError(message)

Error class for SourceCollections.

## astro.main.sourcecollection.SourceCollectionTree module¶

class astro.main.sourcecollection.SourceCollectionTree.SourceCollectionTree(sc, relations=None)

Bases: object

A SourceCollectionTree keeps a high-level overview on a group of SourceCollections. The SourceCollections will usually be in a tree, hence the name, but this is not strictly necessary. The SourceCollectionTree itself is not persistent.

The primary function of an SCT is to manipulate transient trees, e.g. optimizing them to load data or simplifying them when pulling data.

The SourceCollectionTree assumes that it is in full control of the tree. That is, SourceCollections in the tree should only be modified through the SourceCollectionTree.

add_attribute(attribute, origin=None, sc=None, allreadysearched=None)

Adds the ‘attribute’ from ‘origin’ to ‘sc’. - ‘sc’ should be a Pass SourceCollection that can be modified. This

should be the end node of the tree, unless called from the SourceCollectionTree code itself.
• ‘origin’ is a SourceCollection that should contain the attribute. A suitable SourceCollection is searched for if none is given.
• ‘allreadysearched’ is a set of attributes that are allready searched for to prevent infinite loops.

A SelectAttributes is created that selects the attribute from ‘origin’. Subsequently a ConcatenateAttributes is created with as parents the old parent of ‘sc’ and the new SelectAttributes. This ConcatenateAttributes is used as new parent of ‘sc’.

add_tracked_to_relations()

Helper function to add several tracked SCs to the relations.

It only adds SCs that describe the same as their parents or CAs, and only if their parents are added to the relations.

TODO MT M: [SCTTRACKING] Deciding what to add to the relations and
what not should be done in a more general way. Without explicitely listing any classes and such that there will not be problems down the road if an SC is not added.
apply_attribute_selection(attributes=None, sc=None, autotrack=True)

Apply a selection of attributes to the given SourceCollection ‘sc’, which usually is the end node in the tree.

TODO MT H [NOCAT]: See whether a suitable selection already exists.
Implent a find_attributes function?
apply_filter(query, sc=None, autotrack=True)

Applies the selection criterion ‘query’ to the SourceCollection sc.

This is done by inserting a SelectSources with an existing FilterSources as .selected_sources inbetween the given SourceCollection and its parent if a suitable FS is found. Otherwise a new FilterSources is created which is inserted directly inbetween the given SourceCollection and its parent.

Untracked persistent FilterSources are automatically added to the .sourcecollections if ‘autotrack’ is set.

At the moment it is only possible to use this function on the end node. This is difficult to change, since this modifies the data that the given SourceCollection represents. Therefore the data of all SCs that have ‘sc’ in their dependency tree changes as well. Such SourceCollections should not exist because the end node should not have children.

TODO LT M: [NOCAT] This is what the fixlevels should make possible.

change_parameter(sc, prop, value, index=None)

Changes a parameter of a SourceCollection in the tree. This will not change any SourceCollection, but replaces the parts of the tree that need to be changed.

check()

Runs check on all SourceCollections in the tree.

combine_all_parallel_selectsources()

Keep combining parallel SelectSources until no more can be combined.

combine_parallel_selectsources()

Searches for two parallel Select Sources that select the same sources from the same parent and combines them into one.

convert_all_parents_relabelsources()

Tries to replace the parents of all RelabelSources.

convert_parents_relabelsources()

Tries to replace the parent of a RelabelSources with a copy that only represents the subset that is actually required for the RelabelSources. This is experimental code that is only called when the .convertRS flag is set.

copy_tree(try_to_make_external=None)

Creates a copy of the entire tree, except the Pass at the end.

The SourceCollections are automatically converted to Externals if possible when try_to_make_external is set.

find_attribute(attribute, sc=None, key=None, newcalculators='auto', allreadysearched=None)

Tries to find a SourceCollection that has the requested attribute for the sources in ‘sc’, which is usally the end node of the tree.

The ‘key’ function is used to rank all found SourceCollections that can provide the required attribute. The highest ranking SourceCollection is returned and the rest is cached.

New AttributeCalculators are initialized if no suitable existing SourceCollections are found. This is done by searching the AttributeCalculatorDefinitions that can calculate the attribute. This might require other attributes, which are searched for recursively.

The newcalculators parameter can be set as follows: - False, never instantiate new ACs - ‘auto’, only instantiate new ACs if no suitable existing ones

can be found.
• ‘force’, always instantiate new ACs, even if suitable ones are found

This function is called by add_attribute() if no origin for the attribute is given.

find_attribute_new_calculators(attribute, sc=None, allreadysearched=None, key=None)

Initialize new Attribute Calculators

find_children_in_tree(sc)

Returns a list of SourceCollections that have the given sc as parent.

find_selection(query, sc=None, key=None)

Returns a FilterSources that has an equivalent query as ‘query’ and with a parent that describes the same sources as ‘sc’.

TODO MT H: [QUERYPARSING] Better parse equivalent queries.

find_sourcecollection_by_parameters(scclass, parents, **parameters)

Finds a SourceCollection by specifying its parameters.

find_sourcecollection_sources(sc=None, key=None, cache=True)

Returns an existing SourceCollection that describes only the sources of the given SourceCollection.

find_tracked_sc(sc)

Returns the metadata dictionary of tracked SourceCollection ‘sc’.

‘sc’ can either be a SourceCollection or an SCID.

get_export(sc=None)

Returns an ExportDict of the sc to be used with SAMP. Any options for the attributes in the corresponding tracked dictionary are added as well.

get_sourcecollection_sources(sc=None)

Returns a SourceCollection that describes the sources of this SC. Uses find_sourcecollection_sources() to determine if one exists.

get_status(sc=None)

Returns a more elaborate status indication that can be used over SAMP.

TODO MT M [SAMP] [SCT]: Implement this properly. Currently it only
returns ‘ok’ or ‘unknown’.
insert_filtersources(query, sc=None)

Creates a new FilterSources that is ingested between the given SourceCollection ‘sc’ and its parent.

The given SourceCollection must be a Pass node, and must be the end node of the tree. See apply_filter() for details.

Attributes that are mentioned in the query are automatically added to the given SourceCollection before creating the FilterSources.

insert_selectsources(selected_sources, sc=None)

Adds a SelectSources between the given SourceCollection ‘sc’ and its parent. The SourceCollection ‘selected_sources’ is used as the .selected_sources of the SelectSources.

This function is called by apply_filter() when pulling data.

Currently this is only allowed for the end node of the tree, see apply_filter() for details.

integrate_all_parents()

Keep integrating parents until no more parents can be integrated.

integrate_parents()

Try to find a SourceCollection that can integrate its parent(s) into itself.

is_made(sc=None, checksources=True)

Checks whether the entire tree has been made.

is_made_node(sc=None, checksources=True)

Test whether one node is made.

key_functions = {'find_attribute': <function key_find_attribute at 0x7fc1be20a510>, 'find_attribute_new_calculators': <function key_find_attribute_new_calculators at 0x7fc1be20a598>, 'find_selection': <function key_find_selection at 0x7fc1be20ac80>, 'find_sources': <function key_find_sources at 0x7fc1be20ad08>}
load_data(sc=None, checkmade=True)

Loads the data of the end node of the tree into Python.

load_data_nn(sc=None, optimize=True, checkmade=True)

Loads the data of a node of the tree into Python.

Optimization should be done within the tree, not outside it.

load_data_python(sc=None, optimize=False, checkmade=False)
make(sc=None)

Makes the data of the SCs in the tree recursively.

TODO MT M [SCT]: Allow partial make of the tree. TODO MT M [SCT]: Perform optimization steps in this function, not in

the caller function.
make_dot_graph(name='collection%02i')
make_node()

Makes the data of one node.

move_all_filtersources_up()

Keep moving FilterSources’ up as long as possible.

The tree is simplified as much as possible after each move.

move_all_selectattributes_up()

Keeps calling move_selectattributes_up() until it fails.

.simplify_tree_full() is called after each move. This procedure might remove entire branches from the tree when a SA is moved up through an ConcatenateAttributes and one of the parents does not describe any of the selected attributes.

move_all_selectsources_down()

Keep moving SelectSources down as long as possible.

The tree is simplified as much as possible after each move.

move_all_selectsources_up()

Keep moving SelectSources’ up as long as possible.

The tree is simplified as much as possible after each move.

move_filtersources_up()

Tries to move one FilterSources one level up the tree. Starts at the end of the tree.

move_selectattributes_up()

Tries to move one SelectAttributes one level up the tree by calling move_up(). Starts at the end of the tree.

move_selectsources_down()

Tries to move one SelectSources one level down in the tree.

TODO MT M: [PARENTPARAM] This function should only search
for SCs that have an SS as parent, not as process_parameter. currently there is no explicit functionality for this.
TODO MT M: [MODIFICATIONS] Less should be hardcoded here:
• Any class with a SS as parent should be asked to move up, the class knows best whether this is doable.
• Except for SelectSources, an SS can always move up through another SS. This does not make sense in this context and should be checked here.
• A precaution should be made when moving an SS down through an SC that has a sourcelist_data. When doing this, the original SS should also remain in place, because otherwise the SC with the sl_data might store data that should not be stored.
• This leaves SS’s that could be moved down but should not. There currently is no mechanism to prevent this.
move_selectsources_up()

Tries to move one SelectSources one level up the tree, starting at the end of the tree.

optimize_for_load()

Optimizes the tree for loading data. In essence, it tries to place SelectAttributes and SelectSources as early in the tree and in every branch they are applicable to.

This function is also used to check whether data is made or to make new data. A temporary copy of the tree is created that should be discareded after use.

optimize_for_load_inner()
patch_relations()

Checks whether the .relations is still up to date with the current available knowledge of the tree. Currently only checks whether SCs are not empty.

print_tree(sourcecollection=None, indent=0, indent_width=3, max_list_length=5, print_sourcelists=False, print_queries=False)
print hierarchy of a sourcecollection
sourcecollection (mandatory) = to be printed sourcecollection indent (0) = indention of whole tree (typically not set by user) indent_width (3) = indention step width (typically not set by user) max_list_length (3) = max number of sourcollections printed in a list (if None, no maximum) print_sourcelists (False) = print hierarchy of sourcelists as well
remove_all_dependencies()

Keep calling remove_dependencies() until it fails to remove dependencies.

remove_all_sourcecollections()

Keep removing SourceCollections until no more can be removed.

remove_dependencies()

Walks the tree forwards to search for a SourceCollection which can be modified with modify_remove_dependencies().

is_made() is called with a suitable ‘scsources’ on a SourceCollection before attempting a modification because this might set all_data_stored which is used to determine whether the SourceCollection can be converted into an External

TODO MT L: [SCVSSCT] Perhaps attempt to create an External should be
performed in the SCT. This would remove the need of the scsources etc.
remove_sourcecollection()

Tries to remove a SourceCollection from the tree because the effect of its operator is identical to that of a Pass SourceCollection.

replace_attribute(attribute, origin, sc=None)

Replaces an attribute of an SC in the tree.

replace_in_tree(scold, scnew)

Searches the entire tree for occurences of scold, and replaces them with scnew. This function exist to ensure that all persistent properties that used to point to the old SourceCollection will point to the new SourceCollection.

TODO MT M: [MODIFICATIONS] What should this function return, taking

into account that: - No changes have to be made if the scold is never refered to.

Is this a success or a failure?
• Some changes might succeed, while some might not. Is this a success or a failure?

It is not clear that there is one single answer to these questions that covers all the cases for which this function is called.

TODO MT M: [MODIFICATIONS] Use the modify_substitute_progenitors()
function from the SourceCollection class.
replace_tree_part(scold, scnew)

Replaces ‘scold’ in the tree with ‘scnew’. However, scnew represents a different catalog. Therefore every SC that depends on scold will need to be replaced as well, all the way to the parents of the final Pass node of the tree.

scidsskip = set()
selectsources_from_all_filtersources()

Tries to create SelectSources from all FilterSources, depending on .SSfromFS.

TODO MT L: [DOCUMENTATION] Properly descibe the policies of this
function depending on SSfromFS.
set_graph_after_replace(value=True)
simplify_all_attributes(*args, **kwargs)
simplify_attributes()

Convenience function to simplify a SelectAttributes or ConcatenateAttributes. This function is called on trees that are created while pulling data and might be committed later.

simplify_tree()

Convenience function to simplify the tree with the purpose of loading data. It is assumed that the tree will be discarded after use and therefore should not be used on trees that need to be commited.

simplify_tree_full()

Keep simplifying the tree until it cannot be simplified further.

sourcecollection_preferences = []
sourcecollections = {}
store_data(sc=None)

Stores the data of SourceCollections in the tree in the most optimal way.

Currently this means: - The catalog data of SourceCollections that have a .sourcelist_data,

but whose .all_data_stored is False, is stored.
• The catalog data of External SourceCollections that is not yet stored is stored.

Also stores the data of an External

swap_filtersources(fs, sc=None)

Swaps the parent of ‘sc’ with the FilterSources ‘fs’ if they describe the same attributes. If necessary a SelectAttributes is added as well.

track_all_concatenateattribute_children(cache=True, skipped=None, addrelations=True)

Tracks all ConcatenateAttribute SourceCollections of which all parents are already being tracked.

track_all_equal_children(cache=True, addrelations=True)

Adds all persistent children that describe the same soures as SourceCollections already in the tree.

track_all_filtersources_children(cache=True, addrelations=True)

Adds all FilterSources’ that have one of the tracked SourceCollections as parents to .sourcecollections.

track_children_auto(cache=True, addrelations=True)

Adds persistent children of SourceCollections in the tree. Determines which children to add automatically.

track_concatenateattribute_children(sc, cache=True, skipped=None, addrelations=True, alsoselectsources=True)

Tries to find SourceCollections with the given SourceCollection ‘sc’ as parent to add them to .sourcecollections.

Adds SelectSources of which the selected_sources are tracked as well.

track_equal_children(sc, cache=True, skipped=None, addrelations=True)

Searches for SourceCollections that have ‘sc’ as parent and describe the same set of sources and adds them to .sourcecollections.

track_filtersources_children(sc, cache=True, skipped=None, addrelations=True)

Searches for FilterSoures’ that have SourceCollection ‘sc’ as parent and adds them to the tracked SourceCollections.

track_sc(sc)

Adds SourceCollection ‘sc’ to the tracked SCs in .sourcecollections.

track_tree(sc, addrelations=True)

Track the SouceCollections in the dependency tree of the given SourceCollection ‘sc’.

tracked_scids()

Returns a list of the SCIDs that are tracked by this SCT.

tracked_scs()

Returns a set of SourceCollections that are tracked by this SCT.

untrack_sc(sc)

Removes SourceCollection ‘sc’ from the tracked SourceCollections.

This is used when tracking SourceCollections temporarily, e.g. to find a suitable FilterSources that describes a required set of sources.

astro.main.sourcecollection.SourceCollectionTree.key_find_attribute(scd)

Rank SourceCollections that contain a found attribute.

astro.main.sourcecollection.SourceCollectionTree.key_find_attribute_new_calculators(scd)

Rank new AttributeCalculators that contain a found attribute.

astro.main.sourcecollection.SourceCollectionTree.key_find_selection(scd)

Rank SourceCollections that contain a proper selection.

astro.main.sourcecollection.SourceCollectionTree.key_find_sources(scd)

Rank SourceCollections that describe the same set of sources.

## astro.main.sourcecollection.SourceListWrapper module¶

Wrapper around the SourceList. In principle there should be a one-to-one correspondence between a SourceList and a SourceList SourceCollection.

class astro.main.sourcecollection.SourceListWrapper.SourceListWrapper(sourcelist_data=None)

Wrapper around the SourceList. In principle there should be a one-to-one correspondence between a SourceList and a SourceList SourceCollection.

TODO LT M: [OPERATOR] Better integrate the SL with the SLW.

SCID

SourceCollection identifier [None]

all_data_stored

Flag to indicate whether all data has been stored in sourcelist_data, 0 means no (or unknown), 1 means yes

attribute_columns

Column names of the attributes in the sourcelist_data

attribute_names

Names of the attributes corresponding to the attribute_columns

copy(relations=None, copy_parents=True, copy_parameters=True, copy_cache=False, copy_supercache=False)
creation_date

Date this object was created [None]

detection_catalog = None
static find_sc(sourcelist, name=None, is_valid=None, attribute_names=None, attribute_columns=None)

Find this SourceCollection. TODO: Integrate with OnTheFly exist() TODO: Find transient SCs.

get_attributes_full(cache=False)

Get all the attributes.

get_object()

get_object is overloaded in SourceListWrapper to get the OBJECT from the SourceList if necessary.

static get_sc(sourcelist, attribute_names=None, attribute_columns=None, name=None, is_valid=None, force_name=False)

Find this SourceCollection and create it if necessary. TODO: Integrate with OnTheFly get_onthefly() TODO: Integrate with init?

get_slids()

Returns the SLID of the SourceList.

The same problems as with an External arises if the SourceList has a SLIDorg column. However, this should not be the case for normal SourceLists.

get_source_progenitors(identifier)

Returns a list of objects and identifiers that represent the progenitors of the source.

Identifier should be a (SLID, SID) combination.

Returns the SourceList + SID.

get_source_relations(relations=None, quick=False)

Returns a SetRelations object to hold the relations between SourceCollections.

An SourceListWrapper contains the same sources as its .sourcelist.

get_source_relations_sourcelist(sl=None, relations=None, quick=False)

Adds the sourcelist to the relations.

TODO: This function does not behave as it should in case of
the presence of SLIDorg/SIDorg in the sourcelist.
is_valid

Manual/external flag to disqualify bad data (SuperFlag) [None]

load_data_python(row_identifiers=None)

Loads the data of the SourceList in case its catalog file is available.

mandatory_dependencies = (('sourcelist', 1),)
name

Name of the SourceCollection [None]

object_id

The object identifier

The object identifier is an attribute shared by all persistent instances. It is the prime key, by which object identity is established

process_status

A flag indicating the processing status [None]

quality_flags

Automatic/internal quality flag [None]

set_sourcelist(sl)

Sets the SourceList and the corresponding sourcelist_data etc.

sourcelist

SourceList that corresponds to this SourceCollection.

sourcelist_data

Optional SourceList containing the data described by this SourceCollection

sourcelist_sources

Optional SourceList containing the sources described by this SourceCollection

## astro.main.sourcecollection.TableConverterSC module¶

class astro.main.sourcecollection.TableConverterSC.TableConverterSC

This class is no longer necessary, the functionality is either incorporated in the astro.util.TableConverter or in SourceCollection.py. However, it is imported by some of the AttributeCalculators, so it cannot be removed.

store_data()
update_SIDorg()

## astro.main.sourcecollection.aweimports module¶

automatic imports for the interpreter

## Module contents¶

SourceCollection Hierarchy

This package contains the SourceCollection classes which extend the Astro-WISE principles of data pulling, data lineage and persistent data to catalog data.

A SourceCollection object represents a tabular structure where each row represents a source (identified by SLID/SID combination) and each column an attribute. The data represented by a SourceCollection is defined by an operation on other SourceCollections or other data objects. Every subclass of the SourceCollection represents a specific operator.

TODO MT M: [NOCAT] Improve on this documentation.

List of Classes with Persistent Attributes:

• SourceCollection persistent properties:

• SCID
• parent_collection, parent_collections
• name
• creation_date
auxiliary persistent properties:
• sourcelist_data, link to a SL
• sourcelist_sources, link to a SL
• attribute_columns
• attribute_names
• all_data_stored, Flag to indicate whether it has been determined that all data is stored
• ConcatenateAttributes

• ConcatenateSources

• SelectAttributes - selected_attributes

• SelectSources - selected_sources, another SC that contains the selection of sources

• FilterSources - query, string representation of selection criterion (WHERE clause of SQL)

• RenameAttributes - attributes_old - attributes_new

• RelabelSources - associatelist, link to the AssociateList that contains the association

• SourceListWrapper - sourcelist

• External

• Pass

• AttributeCalculator - definition, a link to the AttributeCalculatorDefinition object responsible for the calculation - process_parameters