ETICS All-Hands meeting (May 2007)
Summary of notes from the morning technical meeting, May 22
- We all agree on the design and plan outlined here, from the meeting at CERN last week.
- Any synchronization of co-scheduled tasks is up to the user, using ETICS get and set attr commands (which in turn will invoke the Metronome get/set commands).
- ETICS will work around Condor/Metronome’s special node 0 semantics (when node 0 exits, Condor kills all other nodes) by insderting code into each node’s finalize task which waits to see a global “finalize” attribute before exiting. Each node script polls for it.
- UW-Madison should clean up and formally support a “freeze” option to freeze the state of a machine on which a job has failed for debugging. We should make sure it’s available in Metronome independent from root jobs (for which it’s currently implemented).
- There are long-term policy issues to discuss regarding user prioritization, freezing capability, dedicated resources, etc.
- Action item: deploy Condor 6.9.1 or 6.9.3-pre on ETICS to use dedicated/temp users per CPU slot. This will address two outstanding ETICS issues: incomplete process cleanup, and potential interference between two concurrent jobs running on a single machine.
- We must recognize that Condor’s parallel scheduling policy is different from normal scheduling and doesn’t take priorities into account. Projects competing for co-scheduled resources will have problems.
»
- Printer-friendly version
- Login or register to post comments
