Distributed Processing

Definitions and Use Cases

Distributed processing means that certain recipes, like processes, can or will be executed on specially devoted processors. A recipe like process is a process which is started from the shell prompt (or via a gui) and takes only directions via arguments on the command line.

A data server is a server which can retrieve image files from the Astro-WISE image database, provide a local copy (cache) and store new images.

A database server (read Oracle server) is a server which provides the metadata for the astronomical images and the astronomical sourcelists.

A user is a person which has read and write access to (parts of) the database and therefore has a database username and password.

A visitor is only allowed to browse (parts of) the database and therefore does NOT (need to) have a database username and password.

A user can work in the Astro-WISE Environment in different ways:

  • Via one of the AW compute centres: the user does not have to install anything on his own workstation. He can only run the standard recipes and browse the database. The processes are run on specified machines of one of the AW compute centres (e.g. OmegaCEN).
  • Via a local install on one network node in a LAN. The local install should be done preferably by an expert according to the awe installation instructions. The user will hook up to the closest database server (Groningen, Munich, Napels) and the closest data server. If the user is not in the same domain as one of the data server centers (e.g. in Leiden) a caching data server could (should) be installed. He uses the standard Python code.
  • As above, but now the user wishes to modify the standard Python code. He can do a CVS checkout of the awe stuff and make local modifications. He does not have to install all the Python bindings since they are installed on a central node of the LAN. This is typical for experienced users or top-level software developers.
  • Via a complete local install on the user’s own workstation. The user has to install everything from Python to SWarp and should make his own Python bindings to for example eclipse and oracle. This is only for advanced software developers !

Certain processes are best performed on dedicated machines, for example

  • This task is used to create calibrated science frames. It runs best in parallel on a dedicated server (i.e. like the hpc of Groningen University).
  • This task is used to coadd a set of regridded frames. It runs best on a machine which has a lot of memory.
  • This task is used to create sourcelists from science or coadded frames. It runs best on a machine which is closely coupled to the database server.

Requirements

It is obvious that different types of users require different degrees of freedom. A user of type 1 has the least influence on his process: it is completely directed by some settings from the ‘management’ which the user cannot change.

A user of type 2 has the means to modify (and so influence) the Astro-WISE Environment by directing certain processes to specific machines, deviating from the default settings.

Users of type 3 want their modified Python code executed (on a special machine) so they should be allowed to ship their code to that special machine so that it gets executed.

Users of type 4 are on their own. Since they have modified non-Pythoncode it cannot be executed on a different node. They should always do their testing on the local node where non-Python things are adapted or setup their own system of remote modified nodes.

Available tasks

The following tasks should be executed on a cluster:
ReadNoise, Bias, HotPixels, ColdPixels, DomeFlat, TwilightFlat, FringeFlat, NightSkyFlat, MasterFlat, Gain, Zeropoint, Illumination, Science, Reduce.
The following tasks could/should be executed on a separate, single node:
Coadd, SourceList, AssociateList

User interface

From the above it is clear that users of type 1-3 at least need a web-interface to the different recipes and users of type 3-4 need also a command-line interface. Users of type 3 should be able to upload their modified code to the web-interface in order to get it executed.

The web-interface should provide an interface to all available recipes. The user simply clicks on the wanted recipe and is then prompted for the wanted input to that recipe. The user can also select on which processing machine the recipe is to be executed (can depend on type of user) with smart defaults provided of course. The user gets notified via the web-interface of the status of the job.

The web-interface might be extended with some kind of scripting-language to allow for loops etc. so if for example a user wants to make 300 sourcelists from a preselected set of coadded frames he should not have to click 300 times on recipe SourceList and supply the input. Even an interface to a (limited) version of the awe-prompt might be considered.

The user should also be able to download the products of interrest, like the coaddedregriddedframes just produced by recipe Coadd or the sourcelists produced by recipe SourceList. The output should be in FITS, VOTable or Python format.

Users of type 3-4 need also a command line interface to run a recipe (or maybe even a tool to run a recipe from the awe-prompt). A command line interface to run a recipe could look like:

Code shipping

Python code shipping can be done by zipping the whole local awe tree and sending it to the node on which the execution should take place. Python 2.3 has the nice feature of the writepy function in the PyZipFile module. Within a few seconds the code can be packed into a zip file and shipped elsewhere, where the packed (or unzipped) zip file can be placed in the PYTHONPATH i.e.//[1.5ex]


In case a user submits his own code the context of the job is switched to MyDB. This ensures that only results from legal/accepted/committed code is allowed to polute the database.

Processing Units

A processing unit is a unique virtual processor which consists of:

  • A frontend which accepts the jobs, submits the jobs for execution, monitors the jobs and reports the results back to the clients. It also accepts the Python code from the client and uses this instead of the standard code.

    The frontend allows a process manager to reconfigure it on the fly. A frontend is designated by its network name and port, it might also have an alias.

  • A backend which executes the jobs. It can be a set of parallel nodes of a beowulf cluster or a single machine. The backend can have its own queueing system or make use of the locally implemented queueing system (i.e. as on the Groningen HPC).

The awrecipe script starts a specified recipe and decides whether it needs to split the jobs in a set op parallel jobs or one job or a combination thereof. It sends the jobs to the frontend of a processing unit and waits until all its submitted jobs have finished and creates then a logfile from all the output received from the processing units (i.e. the individual logfiles generated by all individual processes).

The precursor to the awrecipe script is the generic_runner script which starts the Cluster_Manager.

The backend which executes the jobs could be configured in the following way: It starts the Cluster_Manager (or submits a job to a queue which does the same) and waits for the Cluster_Manager to report back that it is up and ready to recieve jobs (via unique keys handed out by the backend to the Cluster_Manager). When connection has been established the backend sends the jobs to the Cluster_Manager and waits for the Cluster_Manager to report that all jobs have been executed (or not in case of errors etc.). Then the Cluster_Manager quits after sending the complete logs to the backend.

The above configuration has the big advantage that it works for locally managed beowulf clusters as well as beowulf clusters with queueing systems like the Groningen HPC. And we can use the Cluster_Manager code with a slight change in frontend to allow for contact between processor unit backend and Cluster_Manager.

Problems

Problems encountered sofar:

  • The writepy function only ships binary Python code so the configuration files which are in the awe tree don’t get shipped. Solutions: migrate the configuration files from text to Python code or (easier) ship the configuration files as well.

  • The user specific AWE settings which also contains the context specification (i.e. in $HOME/.awe/Environment.cfg must be shipped. This could very wel be put in the same zip file.

  • Copying the shell environment variables which overrule the AWEconfiguration settings is also necessary. They could very well be stored in the same zip file.
    • The last 2 items must also be performed if the local code is NOT modified (user type 2).
    • When the user wants to use the web-interface he or she has to supply all the information via this interface, i.e. like username, context etc. (or perhaps upload his or her Environment.cfg?).
  • The recipe to be executed elsewhere can only be started from the command line since the PYTHONPATH has to be set before running the recipe. This is in fact not a problem but makes it incompatible with the current taskservers. It might be necessary to create a new server thingy.
    This problem is solved if we modify the Cluster_Manager so that it communicates with the backend of the processing unit as discussed above. Each time a job is submitted the Cluster_Manager gets started up from the shipped code and then runs the shipped code.
  • Since username/password combinations and other sensitive information are going to be ‘piped’ over the internet to remote taskservers secure connections are necessary.

  • Some parallel clusters will have a job queueing system (like the Groningen HPC). A taskserver must also be able to queue the jobs on such a system. Maybe it is possible (and allowed) to run a taskserver on the frontend machine of the new HPC.
    This problem is solved if we modify the Cluster_Manager so that it communicates with the backend of the processing unit as discussed above. Then it would suffice to have a processing unit running in our domain which submits jobs to the queue via a remote script. Note that we also need to ship the Python code of the Cluster_Manager

Target Processing

Introduction

Definitions

What is reprocessing ?

  • When an object is made / retrieved only that part of the object and its dependencies are (re)made which are not up to date or do not exist. Parts of the object and its dependencies which are up to date (and so exist) can be reused and should not be remade.

What is Target Progressing ?

  • Target processing is (re)processing the target because the target is requested. A request is made for a certain target, when the corresponding object exists, this is returned, otherwise it will be made.

In general the following scheme can be used when an object is asked for:

Exist() ?
Yes -> UpToDate() ?
    Yes -> Use existing object
    No -> ReMake()
No -> Make()

When an object is asked for, first it is checked whether it already exists. If it exists it is checked if the object is up to date together with all its dependencies. If it exists and it is up to date, the existing object is used, otherwise the object is (re)made.

Methods

To implement this in the existing object model a method must de defined which will act as the entry point, which is called ‘when an object is asked for’. Furthermore a method for testing if an object exists and whether it is up to date should be added to each class. So for target- and reprocessing we add three functions to the each ‘makable’ class:

  • exist([parameters])
    
  • uptodate(date=date)
    
  • get([parameters],[switches])
    

The exist function checks whether or not an object of that class already exists for the given parameters. The uptodate function checks if the given object and all its dependencies are up to date, for the given date. The get function checks for existence of the requested object, and returns it or makes it.

Parameters

The parameter list consists of :

[parameters] : date = date,
               chip_name = chip_name,
               filter_name = filter_name,
               instrument_name = instrument_name,
               object_name = object_name,
               exptime = exptime,
               process_par = {proces_parameters}

The first parameters control which object will be selected , but not how it was (or should be) processed. The last parameter specifies the process parameter(s), this can be the process parameters of the current object or one of its dependencies. The number of process parameters is variable, and should be given as a dictionairy. Following are a few examples how to specify process parameters of dependant classes of the ScienceFrame class.

  • Example 1; to specify the maximum_abs_mean of the BiasFrame used to construct the ReducedScienceFrame for a asked ScienceFrame:

    ScienceFrame.ReducedScienceFrame.BiasFrame.process_params.MAXIMUM_ABS_MEAN = 2
    
  • Example 2; the rejection_threshold of the HotPixelMap of the WeightFrame:

    ScienceFrame.WeightFrame.HotPixelMap.process_params.REJECTION_THRESHOLD = 2
    

The process parameters are used by the current object or passed on to its dependencies, when passing on the object removes its own class name. The proces parameters can not be passed as individual arguments to the get method because Python does not allow dots in parameter names.

Switches

Furthermore it is possible to specify verbose output and a test-run with switches. The verbose switch (verbose=1) specifies verbose output. The test-run switch (testrun=1) specifies to go through all dependencies but not to actually build anything. This will give an overview of the dependencies of the object and what must be (re)build.

Example, ScienceFrame

Following is the (semi) source code of the get function for the ScienceFrame class, in this example the switches and process parameters method are not implemented:

def get([parameters]):

    # check for existence and up to date
    scienceframe = ScienceFrame.exist(parameters)
    if scienceframe :
        if scienceframe.uptodate() :
            return scienceframe
    else:
        # science frame does not exist, make it
        scienceframe = ScienceFrame()
        science.set_filename()

    # get the raw frame
    raw_science_frame = SelectRawFrame(parameters)

    # get all calibration objects
    gainlin = GainLinearity.get(parameters)
    photom = PhotometricParameters.get(parameters)

    # retrieve all calibration files
    gainlin.retrieve()
    photom.retrieve()

    # get the weight frame
    weight = WeightFrame.get(parameters)

    # get the reduced frame
    reduced = ReducedFrame.get(parameters)

    science.reduced = reduced
    science.weight = weight
    science.gainlin = gainlin
    science.photom = photom
    science.make()

    science.store()
    science.commit()
    return science

Example, HotPixelMap

For the HotPixelMap class the exist, update and get function are written out, including the switches. The exist function:

def exist(cls, date='1980-12-31', chip_name='aChip'):

    hot = HotPixelMap.select(date=date, chip=chip_name)
    return hot

exist = classmethod(exist)

The uptodate function:

def uptodate(self,date='', verbose=0):

    ret = 1
    # check self for being up to date
    if(date == '') :
        date = self.bias.raw_bias_frames[0].DATE_OBS - datetime.timedelta(0.5)
        date = datetime.datetime(date.year, date.month, date.day)
        print "No date specified for HotPixelMap.uptodate, using date=%s" % (date)
    else:
        date = str_to_datetime(date)

    if( HotPixelMap.exist( date=date, chip_name=self.chip.name ) != self ) : ret = 0

    # check depencies for being up to date
    if not self.bias.uptodate(date=date,verbose=verbose) : ret = 0

    if( verbose ) : print 'HotPixelMap.uptodate %d' % (ret)

    return ret

The get function:

def get(cls, date='1980-12-31', chip_name='aChip', verbose=0, testrun=0):

    # first check on existance and being uptodate
    hotpixelmap = HotPixelMap.exist(date=date, chip_name=chip_name)
    if hotpixelmap:
        if hotpixelmap.uptodate(date=date, verbose=verbose):
            if( verbose ) : print 'HotPixelMap exists and is up to date; chip %s, date %s' % (chip_name, date)
            return hotpixelmap
        else:
            if( verbose ) : print 'HotPixelMap exists, but is not up to date; chip %s, date %s' % (chip_name, date)
    else :
        # make a hot pixel map
        if( verbose ) : print 'HotPixelMap does not exist; chip %s, date %s' % (chip_name, date)
        hotpixelmap = HotPixelMap()

    # get all the dependencies
    bias = BiasFrame.get(date=date, chip_name=chip_name, verbose=verbose, testrun=testrun)

    if not bias :
        msg = 'BiasFrame.get failed to return bias frame for; chip %s, date %s' % (chip_name, date)
        raise HotPixelMapError, msg

    # if its a test run, stop here
    if( testrun ) : return None

    bias.retrieve()

    hotpixelmap = HotPixelMap()
    hotpixelmap.bias = bias
    hotpixelmap.set_filename()
    hotpixelmap.make()
    hotpixelmap.store()
    hotpixelmap.commit()

    return hotpixelmap

get = classmethod(get)

Example, Theoretical

Example with object A, which has a dependency on object B and C, where object C depends on object D. All the objects exist, object D is out of date, there is a newer version D* which could be used by object C.

Original object tree:

object A
    object B
    object C
        object D

There is a request for object A:

Does A exist ? (Yes)
    Is A up to date ?
        Is B up to date ? (Yes)
        Is C up to date ?
            Is D up to date ?
            No
        No (C is not up to date)
    No (A is not up to date)

Object A does exist, but is not up to date because one of its dependencies (D) is not up to date. Lets rebuild A:

make A
    make B (return the same B, because B is up to date)
    make C (after making D, becomes C*)
        make D (D* replaces D)

This results in the new object A*, with new object C* and D*:

object A*
    object B
    object C*
        object D*

Usage

Here are some simple usage examples.

  • Get the ScienceFrame with date 20-oct-2000 and chip ccd50:

    ScienceFrame.get(date='2000-10-20', chip_name='ccd50')
    

    when nessecary this will make the requested ScienceFrame and all its dependencies.

  • Is my ScienceFrame up to date for date 12-feb-2003:

    myScienceFrame.uptodate(date='2003-02-12')
    

    also checks whether all the dependencies of myScienceFrame are up to date.

  • Is there a ScienceFrame for date 4-apr-2000 and chip ccd50:

    ScienceFrame.exist(date='2000-04-04', chip_name='ccd50')
    

    returns the ScienceFrame if it exists, else None.

Problems

  • How to do distributed processing ? How to (automatically) split the work into sub tasks and distribute these to different cpu’s (nodes on a cluster)?
  • Most RawFrames have an 1:N relation with their parent object, 1 BiasFrame has N RawBiasFrames. The RawFrame does only know if it is part of the set of N RawFrames, not if the set of N RawFrames is complete. The parent object should check if the set is complete. This is not a ‘problem’, mere an exception of how up to date is determined.

Graphical User Interface

This chapter describes all the available user interfaces of the Astro-WISE system.

awe-prompt

The Astro-WISE Environment (AWE) prompt provides the maximum control over the system on the lowest level. This prompt is based on the Python prompt, and should be used by type 2 users and higher. There is no GUI avaible for this interface, it is solely text based. Details on installation and usage can be found in the Astro-WISE Environment - User and Development manual.

There is an online (html) version of this prompt. Although this should not be used for heavy calculations because it is running the http server and not a dedicated machine. This can be used for quick inspections and simple calculations when there is no local awe-prompt available.

Calibration Timestamps

The calibration timestamps editor (or calts) is an html viewer and editor of the calibration timestamps. It is accessible from anywhere (given an internet connection) and usable with any browser. Viewing is possible for anyone, for editing a database account is needed. The timestamp information, when a certain calibartion is valid, can be viewed in table and grpahical form for a selection of instrument, chip, filter and dates. Furthermore it is possible to adjust the super user flag, disabling calibration and raw files.

DB viewer

The DB viewer is a html viewer of the Oracle database, almost all information stored in the database can be view with this GUI.

Recipes

This section will describe the GUI of the available recipes.