From nch@roe.ac.uk Mon Jan  9 21:17:39 2006
Date: Mon, 9 Jan 2006 11:00:59 +0000 (GMT)
From: Nigel Hambly <nch@roe.ac.uk>
To: WFCAM Science Archive Team -- Eckhard Sutorius <etws@roe.ac.uk>,
     Johann Bryant <jb@roe.ac.uk>, Mike Read <mar@roe.ac.uk>,
     Nigel Hambly <nch@roe.ac.uk>, Nicholas Cross <njc@roe.ac.uk>,
     Bob Mann <rgm@roe.ac.uk>, Ross Collins <rsc@roe.ac.uk>
Cc: CCs for WSA weekly meeting minutes distribution -- Andrew Lawrence
    <al@roe.ac.uk>, Andy Adamson <a.adamson@jach.hawaii.edu>,
     John Taylor <jdt@roe.ac.uk>, Jim Emerson <j.p.emerson@qmul.ac.uk>,
     Malcolm Stewart <jms@roe.ac.uk>, Martin Hill <mch@roe.ac.uk>,
     Mike Irwin <mike@ast.cam.ac.uk>, Peredur Williams <pmw@roe.ac.uk>,
     Stephen Warren <s.j.warren@ic.ac.uk>
Subject: WFAU WSA weekly project meeting minutes: 6th January 2006

Minutes of WFCAM Science Archive meeting:    6th January 2006
-------------------------------------------------------------
-------------------------------------------------------------

Present:       NCH, ETWS, MAR, AL, RSC, PMW
Apologies:     JPE, JDT, MCH, RGM, JMS, NJC, JB

DONM: 10am, Friday 13th January 2006, plate library

Happy New Year to all our readers.


Actions discharged:
-------------------

ACTION: NJC to send some examples of apparently inconsistent image
         attribute sets to JAC.
Discharged; reply from Paul Hirst received and it seems that these
are all cases of early teething problems in the dataflow system
upstream, rather than any sinister bugs in pipeline/archive.

ACTION: NCH & RSC to chase up and fix the pairing code bug.
Discharged; bug fixed; pairing tested and found to be OK.


Actions partly discharged but continuing:
-----------------------------------------

ACTION: JDT to set up a Datascope Launcher for the WSA on the AG
         workbench pages.
Continues.


Actions carried forward from 09/12/05 meeting:
----------------------------------------------

ACTION: NCH to investigate the usefulness of a NAS solution for medium
         term archive mass storage.
  - CONTINUES; NCH reported a useful chat with Eclipse computing, who
have been supplying and installing NAS boxes to some of their customers.

ACTION: ETWS and NCH to sort out image DQC schema and ingest
  - CONTINUES; NCH noted that he and RSC had been working hard on this,
implementing a procedure to add in new attributes to existing metadata
tables and populate them from all FITS files held in the archive.


Specific points and new actions:
--------------------------------

Project management:

NCH welcomed everybody back for the New Year with single malt 
(courtesy Eclipse Computing) and shortbread.

PMW noted that there is no VDMT in January, but the Q1 plan is being
finalised given fresh input from UKIDSS and the requirement to
produce an EDR by the end of Jan. NCH noted that this is likely to
result in a fire-fight for most of Jan; hence "planned" work will
likely spill over into Feb/Mar.

With reference to the "things to do" list itemised in the last
minutes:

1) Check that the merged source list pairing is producing clean lists
NCH and RSC have done this, fixing small bugs where necessary; NCH has
also suggested to UKIDSS that the pairing radius be reduced and is
taking advice from survey heads and CASU on the optimum choice.

2) Catch incorrectly reobserved frames as NOT multiple epoch revisits
    and generally make frame merging a little more intelligent over
    choice of frame when there is more than one
Needs to be implemented in general iteration of software for next
data release.

QC:
a) Catch frames with bad channels / large spurious detection numbers
b)   "      "     "  moon ghosts
c) Ensure all available DQC is propagated into metadata tables (e.g.
    sky sub scale parsed from FITS comment).
d) Implement pre-release QC in collaboration with SJW.
All going ahead as part of general QC which is being discussed and
implemented this week in consultation with SJW and SD.

Wish list:
a) Include SQL query in results set output in appropriate format(s)
    and echo to the results page
d) > 100 columns for formatted HTML summary results
e) larger limit to results set size
g) document/implement default flux estimator as Aper3 as being
    generally appropriate
h) Guide for simple usage to prevent "frequently made cock-ups",
    e.g. careful filtering when using merged source tables
i) Sexagesimal format on results page and consistent orientation of
    thumbnails
j) Dictionary translation between pipeline FITS keys and table
    attributes
k) Reduce source pairing radius for GPS, as per WG wishes
Actions ongoing on MAR, ETWS, NCH on all of the above.

b) Client-end query tool for scripting
c) Seeing in arcsec as a new attribute
f) petro (kron/hall?) radius in merged source table
At this stage, these are seen as lower priority...


WFCAM update:

See the stop press pages linked from the TWiki for latest news.


Comments and issues arising from CASU fortnightly minutes:

No new minutes as of 6/1/06 AM.


Networking:

A power hiccup (see below) meant that the Xmas transfer from CASU was
interrupted; however ETWS reported that as at 6/1/06, all data that
is OK_TO_COPY has been transfered from CASU (2.3TB) at a healthy
13.6 MB/s (in Jan at least; logging info from Dec lost due to power
outage). ETWS noted that there are rather a lot of directories that
are flagged as "checked" on the processing status page and yet do 
not have OK_TO_COPY files present; these will not (of course) get
transfered until the semaphore files are written in by CASU.

ETWS reported no response from Chicago re: SDSS-DR3, so will nudge
them to see what's going on.

ACTION: ETWS to nudge UChicago re: transfer of BestDR3.


WSA Operations:

NCH noted that JB had reported the fall-out from the Dec 23rd power
outage; no operations could take place over the Xmas break because
the DBMS load server was zapped and out of commission.


Hardware:

NCH reported JB's notes from the Dec 23rd power outage:
"There has been a power outage (externally caused) at 4:30am (ish) this
  morning, power was not restored until about 8am meaning that everything
  has gone down and some of it tried to come back, current information:

  djoser - Came up fine with no problems that I can see... drive four (the
  one that failed earlier this week) is fine.
  sneferu - Came up fine with no problems that I can see.  Needed powering
  up when I came in.
  khufu - Came up fine with no problems that I can see.  Needed powering up
  when I came in.
  thoth - Seems to be up and running.  Difficult to confirm as I've no way
  to access thoth that I know of (cluanie being unusable doesn't help).
  amenhotep - Wasn't responding when I first came in but a reboot seems to
  have cleared it, no problems arose when it came back up and there are no
  problems on it that I can see.
  ahmose - Was up but not responding properly when I first came in.  It had
  disk problems but I couldn't do anything because it wouldn't login.  A
  reboot let me do something but didn't succed because the issue we had with
  the NVRAM for the Raid controller (same as last time this happened I
  think) happened again, I restored the disk copy of the Raid config to the
  NVRAM but three of the disks had died and they were dropping as quickly as
  I put them online again (meaning rebuilding the disks was failing to work
  either).  Since this seems to be the Raid with the OS on it the system is
  unbootable and keeps beeping due to the failed disks.  I have taken the
  executive decision that since it is not the public facing server, I have
  only one spare disk (I think, maybe two actually but certainly not three)
  and it is not in use (any running jobs will have gone during the power
  outage) over the holidays that it is officially out of commission until
  the end of the holidays.  More info on this later for Nigel and Perry as
  I'm going to contact Ian about getting a coupel of extra spare drives in
  for after the holiday so that the duds can be replaced and we can start a
  rebuild of the system.  The good news however is that the backup of the
  WSA on ahmose should have finished verifying by around 9 or 10pm last
  night and was certainly already taped when I left last night so the
  backups for ahmose are as complete as you could hope them to be! :)  I
  have submitted a ticket about getting Ahmose rebuilt after the new disks
  are in.

  As regards ongoing work:
  WSA - as I said above this was backed up yesterday after the WORLDR2 cu19
  script was run so we have the data as of late Wednesday night (I think it
  was).
  WORLDR2 - I do not know if the copies that I start last night from Ahmose
  to Amenhotep finished or not, however if they did it might not actually
  help as the data on G: (I think, might have been I:) was yet to be copied
  over (about 8GB of data), so WORLDR2 is not yet attachable on amenhotep,
  however all the data on J: (the probably failed RAID on Ahmose), H: and
  either the other of G: or I: were copied over (about 300GB) so it will not
  take long to get WORLDR2 running on Amenhotep once we can boot Ahmose
  again.  WORLDR2 did not yet get backed up to tape (this was going to
  happen today)."

NCH reported further that ahmose has now been sorted out *nearly* since
it is back up and running with SQL Server available on the network and
the main WSA DB available e.g. from amenhotep; BUT it reboots when
any user tries to log in in multi-user mode. Compute Support suspect
a corrupted system file. Investigations continue; this has not impacted
the release schedule so far (Xmas break limited curation activites
anyway). Hopefully the system can be fully restored on Mon 9th.


Software:

Nothing to report due to Xmas break.


Survey Data Release:

NCH noted that UKIDSS GPS survey head Phil Lucas had enquired as to
aperture photometry corrections. Investigations have led to discovery
that this bug (discovered in August last year) was only fixed for
the LAS (and non-survey datasets) and will need to be fixed up in the 
release and ingest DBs, plus the SV community will need to be informed.

ACTION: NCH to fix the aperture correction bug in release and ingest
         DBs, and inform the SV community.

World release has been delayed by power outage; again, this should be
sorted out early next week.

NCH noted that UKIDSS Consortium Survey Scientist SJW and Science
Verifier SD would be visiting next week to thrash out the details of
the quality control procedure for the impending data release.


Non-Survey Data Release:

No news this week.


Astrogrid deployment:

AL noted the possibility of an AG talk/demo at ROE to be given by 
himself.


Miscellaneous:

Nothing further this week.