ETICS
Current WP2 deliverables/issues, in priority order:
*NOTE*: the goal here is not necessarily to duplicate the Savannah or Drupal issues queues by enumerating all WP2 issues, but rather to summarize the highest priority items currently being worked on and their status. This page should be able to serve as a de-facto agenda for WP2 technical calls.
Active Issues & Deliverables
Condor does not recover gracefully from network/file transfer problems.
Tasks:
- [Marian]
Reduce MAX_JOBS_RUNNING to 500 instead of 1999 - [Marian]
? Set MAX_CONCURRENT_DOWNLOADS and MAX_CONCURRENT_UPLOADS to throttle file transfers in case of problem? Default is set to 10. - [Marian]
?Set JOB_START_DELAY = 5 to give Condor some time to start up before jobs start up. Default is 2 seconds, it may be less than that now.
- [Marian] Run “condor_hold tomcat4”, ie for all users, as soon as Condor starts up.
Co-scheduling – is it set up properly? How to tell, simple sanity check. Better documentation!
- INFO: All condor configuration file are located under /etc/condor directory
- [Becky]
Send Marian a tarball of a simple // test to run. - [Becky] Document this test location.
- [Becky]
Debug problem with runid 1226 on etics-test.
Condor exiting prematurely-before the build results gets transfered
More details: https://savannah.cern.ch/bugs/index.php?30029
Status: fixed in Metronome 2.4.3
Tasks:
- [Marian] confirm that problem is solved in Metronome 2.4.3
Co-scheduling Issues
Status: Metronome web status pages do not work correctly for multiple platforms of the same name. See here.
- [Becky] investigate & fix
Job Migration Issues
Status: job migration is now working from CERN->UW, but retry policy needs tweaking/testing, and monitoring needs to be improved.
Tasks:
- [Becky] double-check that CERN policy exists to retry migrated jobs locally if they don’t run in N minutes; set N to be 15-20 if it’s currently higher
- [Todd M.] deploy “exerciser” to submit known-good heartbeat jobs from CERN->UW, and email people if they fail (status)
Condor/VMWare/Metronome integration @ CERN
Status: Working system is in place, but running VMs are self-managed and can hang if corrupted by root tests; long-term, we need to deploy a Host_Condor->VMWare->Guest_Condor solution
Tasks:
- [Peter] document Host_Condor->VMWare->Guest_Condor deployment
ETICS Server Installation Guide
Status: first draft written by Marian
Tasks:
- [Becky] provide feedback/contributions based on UW-Madison deployment experience & expertise
Production Bug: Condor jobs failing, evicted on ETICS production node
https://savannah.cern.ch/bugs/?26314
Status: Not yet investigated.
Tasks:
- [Marian] reconfirm that the problem is still happening and provide Becky with a recent example
- [Becky] investigate and understand the problem
ETICS @ UW
Status: The ETICS WS on nmi-0084, set up by Marian, is now working. nmi-0032, set up by Becky, has a working Metronome install and a non-working WS. Error has been sent to Marian for review. Format problem identified with the host certificates: updated in the database and also the Installation Guide.
Tasks:
- [Becky] re-deploy latest version of ETICS (WA and WS) on nmi-0032 (our production ETICS WS machine-to-be) and report any problems via Savannah
- Printer-friendly version
- Login or register to post comments
