Meeting at Leicester to discuss XMM pipeline 22/4/02 (STH & DWE)
----------------------------------------------------------------

Mike Denby
----------

Leicester (and the rest of the SSC) are responsible for producing a
pipeline, processed data products, and some analysis tools. All these
are made availible to the guest observers.

XMM - launched 2000, but 1 year before data got to Leicester (1y of
backlog). Managed by ESA (ESTEC/Vilspa). Raw (fitslike) data gets sent
to Leicester from Vilspa where it gets processed. Cal file generated,
Images are made, source detection algorithm run. Sources sent to
Strasbourg for cross-correlation/identification and results to
Leicester next day. Data then sent to one member of consortium for
manual eyeballing - requires them to fill out a log form to check for
standard 'known' problems - eg. hot columns, bad astrometric solution
etc. Results sent back to Leicester as and when. Final products etc.
made and sent back to Vilspa where it is packaged and sent to the
guest observers. Deal with approx. 1Gb per day using 2 dedicated ISDN
lines (2x64kbits/s). Timescale to process one observation is one week
- hence need to parallelise.

Suggested that formal programming practices were useful - version control,
bug reports.

ESTEC provides the software infrastructure (GUI's etc.)

Before launch simulated data was provided, but the real data was very
different. Found that using simulated data was limited in the end.

The software ended up more CPU intensive than expected (e.g. new
complex algorithms developed for the generation of the calibration
matrix) and ESA provided an upgrade to the computer.

One of the problems was that the file formats kept changing during the
development (Logica?).

(This seemed to be a reactive system - moral is be flexible).

1600 observations so far (1 Gb each) - 2 year's worth of data. Similar to
CMT data rate.

(Might be software management problems between JAC/IoA? Can't remember why I
wrote this comment **STH <- because we are feeding some pared down
version of our software to the summit - there will be a requirement to
provide documentation and supply cross-platform compatibility. Also
our software will need to interface with the existing ORCADR suite).

Perhaps it might be an idea to reduce the maximum amount at Hawaii before
sending the data to Cambridge. (This need not necessarily be run by JAC,
since logging on from C/b and activating aspects of the pipeline should be
doable cf. CMT).

It's important to overspecify the computers so that you can rereduce the data.
** <- STH. MD explained in some detail the horror story of having a
years worth of backlogged data to reduce when it finally arrived in
Leicester in a format they could run the pipeline on. The backlog
conspired with two other factors to push them to an untenable
position: 1) Increased complexity in data reduction algorithms
decreased their benchmark from being significantly overspecced to
being only able to process the data at the rate it arrived. 2) This
was the first time they had seen real data - further development on
the pipeline was required. The situation was eased by significantly
upgrading the hardware.

Since it might be difficult to reduce automatically non-survey observations
(due to observers wanting to do their own thing), it might be an idea to
make these type of observations difficult to make (or have defaults that are
the way we want to observe). If they persist, perhaps then give them a
warning that the observations might have problematic reductions.

Benchmarking is very important. MD advised a factor 10 of overspecification
in the computing ie. 24 hours of data to be reduced in 2-3 hours. For XMM,
4x was the requirement with 10x as the goal. The goal was abandoned early on
as the calibration became more CPU intensive. Eventually, they were at 1x
(this is when the upgrade occurred). 

Eyeball checking does add inertia. Uniformity of DQC, if eyeballing, is a
problem. There is a need to lead the checkers to some extent. A large
perl script was written by someone in Germany (forgot name) - uses
TCL/Tk with POW with number of forced tickboxes and check lists.

Computer system: 24 CPU Solaris Ultarsparc, 24 Gb memory. (Opus system
?). Run as central server with remaining CPUs acting as clients. Data
arrives as tarball - when completed, pipeline procedure is
triggered. Blackboard system (see below) maintains status of
processing in context. Database only provides information on
observation id, coordinates, instrument, exposure time and so on. This
system allows a recovery of pipeline at any stage; intermediate
part-processed files are maintained on disk. This is desireable if
for example system management tasks need to be run, or the CPUs need
to be powered down etc. Background tasks are constantly checking the
blackboard files to search for data which is ready for the next stage
in the pipeline - when one becomes available the required module is
triggered. The status of this can be monitored via a GUI
interface. The pipeline takes 12 hours to build on solaris ! more like
2-3 hours on a linux pc.

They use what they call a "blackboard" system (home-grown). A series of
directories with ascii files and flags showing status so that they can
control the processing ie. they do not use a commercial database for
this aspect of the pipeline.

Manpower at SSC: 
Rôles: Project Scientist - overall responsibility
       Project Manager - admin., organizes meetings, minutes etc.
       Operations Manager - day-to-day management of the processing
       Systems Manager - computers - very important
       Software Manager - big rôle in testing, maintenance/upgrade of pipeline
       Developers - (130 tasks are used in the pipeline)
       Pipeline operator
These are not individual people. Some of these rôles are carried out by the
same person and some of them are shared (matrix
management?). e.g. they have a dedicated system manager who also does most
of the handle turning for the processing and the status checking.

At the weekly meetings there are ~11 people present. Most of these are
contributing advisors, rather than people involved in the day-to-day
running of the pipeline.

How many people are essential? Minimum 2 (operator and systems manager), but
during development many more.

DQC is strongly linked to the science. Look at the science from the data to
see if there are  problems with it.

Software Problem Reporting (SPR): web based bug reports. Sends out emails to
correct person. Might be very useful.

Each night a script runs to update an ASCII file (not a database) with
keywords from files. (I think this was to do with status, but am not sure).

"If you started from scratch, what would you do differently":
Keep system as simple as possible. Do not use O2 as your database. Consider
whether you need a formal database at all. Can you cope with ASCII log files
etc.?

The use of web-based tools to distribute information is good since it stops
people phoning up all the time. All outside contact goes through ESA
as the first layer.

Most of the "system" is written in Perl/c++ (but also f90 and TCL/Tk) on
Solaris (forced by ESA, also a freeze on the OS version).

A useful tool was an email exploder for asking/answering questions. Seems
that a (moderated) news group might perform the same task.

They would have preferred to use Linux and lots of PCs since it would have
been more efficient in terms of money.

Meetings to approve changes were important, but not too many of them since
they slow things down. A sort of oversight committee.

Most of pipeline problems were picked up by the internal tests.

Reprocessing of data can be done by the guest observer via a script which is
produced by the pipeline. This enables the GO's to reprocess their data
using the latest calibrations.

Mike D suggested we should talk to Simon Rosen who has built a summary
database for the XID programme.

The pipeline is 250,000 lines of code in 200 "packages". An example of
a package is: make an exposure map.


Julian Osborne
--------------
Write a document describing the end data product very early on. This is
useful since it forces you to think what is needed.

Suggested a simulator to test the entirety of the pipeline.

Showed SPR - we will probably need something like this. Though more
for documentation reasons than for nagging !

Test harness - a package/script which comes with every bit of code release
which enables the person building the pipeline to test it.

The biggest problem was not achieving the 10x overspec of the computing
power needed. This was by not knowing the benchmarks early enough.

Suggested: testing, visualization, scientific checking/calibration.

Julian also warned about comparing the hardware lifetime to the
lifetime of the mission. Don't spend all the cash upfront.

Nice simple structure for packages: each is in a directory with
DEPEND, DISTRIBUTION, ChangeLog and so on - probably standard practice
for programmers. They are forced to the ESA programming standard. Do
ESO have the same? Bet they do.

Want two things from Julian:

1) Pipeline Module Dependancy Chart - much simpler than his original
pipeline flowchart - which alarmed people. He did not use commercial
software to step from flowchart to software - all was done
manually. They use packagemaker - to aid code release.

2) Product specifications

NB - note iso filename rules. 27.3. A good filename can help with
bookkeeping. watch case sensitivity issues.

DWE's Overall impressions
-------------------------
Overspecification of the computing
Version control/bug reporting
Regular meetings
Testing/simulations

DWE's other thoughts
-------------------- 
We will have to make the system robust enough to cope with failures ie.
reprocessing should be relatively easy. With large amounts of data I can see
the pipeline operator sometimes making mistakes (eg. using the wrong
calibration files, doing tasks out of order) or we get a power failure.
Detecting and recovering from these problems should be painless.

STH's thoughts
--------------
Arggghhhh. Where is the pub ?