Below is information for collaborators or users from major projects using the UW-Madison NMI Build & Test Lab.
Current WP2 deliverables/issues, in priority order:
*NOTE*: the goal here is not necessarily to duplicate the Savannah or Drupal issues queues by enumerating all WP2 issues, but rather to summarize the highest priority items currently being worked on and their status. This page should be able to serve as a de-facto agenda for WP2 technical calls.
Tasks:
More details: https://savannah.cern.ch/bugs/index.php?30029
Status: fixed in Metronome 2.4.3
Tasks:
Status: Metronome web status pages do not work correctly for multiple platforms of the same name. See here.
Status: job migration is now working from CERN->UW, but retry policy needs tweaking/testing, and monitoring needs to be improved.
Tasks:
Status: Working system is in place, but running VMs are self-managed and can hang if corrupted by root tests; long-term, we need to deploy a Host_Condor->VMWare->Guest_Condor solution
Tasks:
Status: first draft written by Marian
Tasks:
https://savannah.cern.ch/bugs/?26314
Status: Not yet investigated.
Tasks:
Status: The ETICS WS on nmi-0084, set up by Marian, is now working. nmi-0032, set up by Becky, has a working Metronome install and a non-working WS. Error has been sent to Marian for review. Format problem identified with the host certificates: updated in the database and also the Installation Guide.
Tasks:
Please update before all JRA1 calls and link to associated Savannah tickets.
Tasks:
Get automatic migration working, reliably, in production, so ETICS users’ jobs will automatically run remotely on exotic platforms they do not have in their local pool. (Remember that for now, the purpose is not to allow users to run remotely in order to get extra cycles on platforms that they do have in-house.)
Target platforms (as of ?when?, according to ?who?) include
Right now we’re stuck at step 0, and need to demonstrate that the core migration architecture is stable.
The automated testing shouldn’t involve gLite jobs at all, at least for now. The first step is to verify that the software we’re providing (Condor + Metronome) works. If other applications are making that difficult, we need to cut them out of our testing process until our own tests are 100% clean.
Once our software stack is “clean” (see below) then we can more confidently go back and suggest that any problems may be with gLite. But I wouldn’t be comfortable doing that until we can demonstrate that our software works reliably (via exerciser results).
That’s the approach to use in general. Don’t add anything to the exerciser mix until the simple things work 100%. Then add one variable or software stack component at a time until it works 100%. Repeat until we’re replicating something as close as possible to a production ETICS job on the production ETICS infrastructure. But don’t start there.
If this necessitates two testbeds, so be it. A “testbed” is really nothing more than a distinct Condor/Metronome configuration on each end, so having more than one of them should not be a big problem (even if you only end up being able to get one machine on each end).
As for what “clean” or “failed” means, anything other than the job successfully running to completion and having its results sent back whole within a predefined reasonable amount of time (which will depend on the test — hello world vs ETICS jobs, dedicated CPU slot vs. not, etc.) should be considered a failure.
But the whole point is to start understanding and categorizing the types of failures and their frequency. The desired outcome is a report like:
Over the past month, 23% of jobs (1000) failed, with a min/median/max of 0%/7%/100% per night.
Of all failed jobs:
52% (520) timed out due to matchmaking failures
20% (200) timed out due to ETICS Condor pool configuration bugs
12% (120) failed due to NMI Lab machine failures
8% (80) timed out due to unavailable resources
4% (40) failed to due ETICS network configuration bugs
4% (40) are unknown.
All tasks are Becky’s unless indicated otherwise.
The last step might take a while. :)
Then, in some order, we can:
and
We can discuss which if any of these additional steps should be done on top of the initial testbed infrastructure or done in parallel with it as separate, new exerciser instances.
This page is to summarize the current NMI “plate” with respect to TeraGrid. I.e., what are the top items TeraGrid wants/needs from the NMI Lab? Each item should reference specific Drupal tickets for details — the goal here is to summarize the most important active issues, their priorities, and their current status.
Status: blocked on the following tasks…