Meeting at Leicester to discuss XMM pipeline 22/4/02 (STH & DWE) ---------------------------------------------------------------- Mike Denby ---------- Leicester (and the rest of the SSC) are responsible for producing a pipeline, processed data products, and some analysis tools. All these are made availible to the guest observers. XMM - launched 2000, but 1 year before data got to Leicester (1y of backlog). Managed by ESA (ESTEC/Vilspa). Raw (fitslike) data gets sent to Leicester from Vilspa where it gets processed. Cal file generated, Images are made, source detection algorithm run. Sources sent to Strasbourg for cross-correlation/identification and results to Leicester next day. Data then sent to one member of consortium for manual eyeballing - requires them to fill out a log form to check for standard 'known' problems - eg. hot columns, bad astrometric solution etc. Results sent back to Leicester as and when. Final products etc. made and sent back to Vilspa where it is packaged and sent to the guest observers. Deal with approx. 1Gb per day using 2 dedicated ISDN lines (2x64kbits/s). Timescale to process one observation is one week - hence need to parallelise. Suggested that formal programming practices were useful - version control, bug reports. ESTEC provides the software infrastructure (GUI's etc.) Before launch simulated data was provided, but the real data was very different. Found that using simulated data was limited in the end. The software ended up more CPU intensive than expected (e.g. new complex algorithms developed for the generation of the calibration matrix) and ESA provided an upgrade to the computer. One of the problems was that the file formats kept changing during the development (Logica?). (This seemed to be a reactive system - moral is be flexible). 1600 observations so far (1 Gb each) - 2 year's worth of data. Similar to CMT data rate. (Might be software management problems between JAC/IoA? Can't remember why I wrote this comment **STH <- because we are feeding some pared down version of our software to the summit - there will be a requirement to provide documentation and supply cross-platform compatibility. Also our software will need to interface with the existing ORCADR suite). Perhaps it might be an idea to reduce the maximum amount at Hawaii before sending the data to Cambridge. (This need not necessarily be run by JAC, since logging on from C/b and activating aspects of the pipeline should be doable cf. CMT). It's important to overspecify the computers so that you can rereduce the data. ** <- STH. MD explained in some detail the horror story of having a years worth of backlogged data to reduce when it finally arrived in Leicester in a format they could run the pipeline on. The backlog conspired with two other factors to push them to an untenable position: 1) Increased complexity in data reduction algorithms decreased their benchmark from being significantly overspecced to being only able to process the data at the rate it arrived. 2) This was the first time they had seen real data - further development on the pipeline was required. The situation was eased by significantly upgrading the hardware. Since it might be difficult to reduce automatically non-survey observations (due to observers wanting to do their own thing), it might be an idea to make these type of observations difficult to make (or have defaults that are the way we want to observe). If they persist, perhaps then give them a warning that the observations might have problematic reductions. Benchmarking is very important. MD advised a factor 10 of overspecification in the computing ie. 24 hours of data to be reduced in 2-3 hours. For XMM, 4x was the requirement with 10x as the goal. The goal was abandoned early on as the calibration became more CPU intensive. Eventually, they were at 1x (this is when the upgrade occurred). Eyeball checking does add inertia. Uniformity of DQC, if eyeballing, is a problem. There is a need to lead the checkers to some extent. A large perl script was written by someone in Germany (forgot name) - uses TCL/Tk with POW with number of forced tickboxes and check lists. Computer system: 24 CPU Solaris Ultarsparc, 24 Gb memory. (Opus system ?). Run as central server with remaining CPUs acting as clients. Data arrives as tarball - when completed, pipeline procedure is triggered. Blackboard system (see below) maintains status of processing in context. Database only provides information on observation id, coordinates, instrument, exposure time and so on. This system allows a recovery of pipeline at any stage; intermediate part-processed files are maintained on disk. This is desireable if for example system management tasks need to be run, or the CPUs need to be powered down etc. Background tasks are constantly checking the blackboard files to search for data which is ready for the next stage in the pipeline - when one becomes available the required module is triggered. The status of this can be monitored via a GUI interface. The pipeline takes 12 hours to build on solaris ! more like 2-3 hours on a linux pc. They use what they call a "blackboard" system (home-grown). A series of directories with ascii files and flags showing status so that they can control the processing ie. they do not use a commercial database for this aspect of the pipeline. Manpower at SSC: Rôles: Project Scientist - overall responsibility Project Manager - admin., organizes meetings, minutes etc. Operations Manager - day-to-day management of the processing Systems Manager - computers - very important Software Manager - big rôle in testing, maintenance/upgrade of pipeline Developers - (130 tasks are used in the pipeline) Pipeline operator These are not individual people. Some of these rôles are carried out by the same person and some of them are shared (matrix management?). e.g. they have a dedicated system manager who also does most of the handle turning for the processing and the status checking. At the weekly meetings there are ~11 people present. Most of these are contributing advisors, rather than people involved in the day-to-day running of the pipeline. How many people are essential? Minimum 2 (operator and systems manager), but during development many more. DQC is strongly linked to the science. Look at the science from the data to see if there are problems with it. Software Problem Reporting (SPR): web based bug reports. Sends out emails to correct person. Might be very useful. Each night a script runs to update an ASCII file (not a database) with keywords from files. (I think this was to do with status, but am not sure). "If you started from scratch, what would you do differently": Keep system as simple as possible. Do not use O2 as your database. Consider whether you need a formal database at all. Can you cope with ASCII log files etc.? The use of web-based tools to distribute information is good since it stops people phoning up all the time. All outside contact goes through ESA as the first layer. Most of the "system" is written in Perl/c++ (but also f90 and TCL/Tk) on Solaris (forced by ESA, also a freeze on the OS version). A useful tool was an email exploder for asking/answering questions. Seems that a (moderated) news group might perform the same task. They would have preferred to use Linux and lots of PCs since it would have been more efficient in terms of money. Meetings to approve changes were important, but not too many of them since they slow things down. A sort of oversight committee. Most of pipeline problems were picked up by the internal tests. Reprocessing of data can be done by the guest observer via a script which is produced by the pipeline. This enables the GO's to reprocess their data using the latest calibrations. Mike D suggested we should talk to Simon Rosen who has built a summary database for the XID programme. The pipeline is 250,000 lines of code in 200 "packages". An example of a package is: make an exposure map. Julian Osborne -------------- Write a document describing the end data product very early on. This is useful since it forces you to think what is needed. Suggested a simulator to test the entirety of the pipeline. Showed SPR - we will probably need something like this. Though more for documentation reasons than for nagging ! Test harness - a package/script which comes with every bit of code release which enables the person building the pipeline to test it. The biggest problem was not achieving the 10x overspec of the computing power needed. This was by not knowing the benchmarks early enough. Suggested: testing, visualization, scientific checking/calibration. Julian also warned about comparing the hardware lifetime to the lifetime of the mission. Don't spend all the cash upfront. Nice simple structure for packages: each is in a directory with DEPEND, DISTRIBUTION, ChangeLog and so on - probably standard practice for programmers. They are forced to the ESA programming standard. Do ESO have the same? Bet they do. Want two things from Julian: 1) Pipeline Module Dependancy Chart - much simpler than his original pipeline flowchart - which alarmed people. He did not use commercial software to step from flowchart to software - all was done manually. They use packagemaker - to aid code release. 2) Product specifications NB - note iso filename rules. 27.3. A good filename can help with bookkeeping. watch case sensitivity issues. DWE's Overall impressions ------------------------- Overspecification of the computing Version control/bug reporting Regular meetings Testing/simulations DWE's other thoughts -------------------- We will have to make the system robust enough to cope with failures ie. reprocessing should be relatively easy. With large amounts of data I can see the pipeline operator sometimes making mistakes (eg. using the wrong calibration files, doing tasks out of order) or we get a power failure. Detecting and recovering from these problems should be painless. STH's thoughts -------------- Arggghhhh. Where is the pub ?