Community-Specific Information

Below is information for collaborators or users from major projects using the UW-Madison NMI Build & Test Lab.

ETICS

Current WP2 deliverables/issues, in priority order:

*NOTE*: the goal here is not necessarily to duplicate the Savannah or Drupal issues queues by enumerating all WP2 issues, but rather to summarize the highest priority items currently being worked on and their status. This page should be able to serve as a de-facto agenda for WP2 technical calls.

Active Issues & Deliverables

Condor does not recover gracefully from network/file transfer problems.

Tasks:

Co-scheduling – is it set up properly? How to tell, simple sanity check. Better documentation!

Condor exiting prematurely-before the build results gets transfered

More details: https://savannah.cern.ch/bugs/index.php?30029

Status: fixed in Metronome 2.4.3

Tasks:

Co-scheduling Issues

Status: Metronome web status pages do not work correctly for multiple platforms of the same name. See here.

Job Migration Issues

Status: job migration is now working from CERN->UW, but retry policy needs tweaking/testing, and monitoring needs to be improved.

Tasks:

Condor/VMWare/Metronome integration @ CERN

Status: Working system is in place, but running VMs are self-managed and can hang if corrupted by root tests; long-term, we need to deploy a Host_Condor->VMWare->Guest_Condor solution

Tasks:

ETICS Server Installation Guide

Status: first draft written by Marian

Tasks:

Production Bug: Condor jobs failing, evicted on ETICS production node

https://savannah.cern.ch/bugs/?26314

Status: Not yet investigated.

Tasks:

ETICS @ UW

Status: The ETICS WS on nmi-0084, set up by Marian, is now working. nmi-0032, set up by Becky, has a working Metronome install and a non-working WS. Error has been sent to Marian for review. Format problem identified with the host certificates: updated in the database and also the Installation Guide.

Tasks:

ETICS2

ETICS2 project notes

ETICS2 Organizational Details

Put all relevant organizational details here

  • Mailing lists
  • Meeting/call info and times

JRA1 Wiki

Summary of active issues per deliverable

Please update before all JRA1 calls and link to associated Savannah tickets.


General Issues

Post_all step of gLite builds sit idle in queue for ~8 hours, then only take ~8 min to run

Tasks:

  • [Becky] Look at Condor configuration that may prevent the post_all jobs from starting
  • [ETICS] Make sure that the number of jobs in the Condor Q (condor_q) does not exceed the results of `condor_config_val MAX_JOBS_RUNNING`


Job Migration

Status


Web Service


IPv6


Virtualization


Co-scheduling


JRA1 Deliverables Timeline

  • Slides presented at kickoff meeting (Mar 17-19, 2008, CERN)

Deliverables Summary

  • Cross-site migration ( now -> month 3)
  • Web Service Interface specification ( month 3)
  • IPv6 compliance analysis ( month 6)
  • Virtualization ( month 9 )
  • Co-scheduling ( month 12 )

Cross-site migration

Cross-site migration improvements ( now -> month 3 )

Purpose

Get automatic migration working, reliably, in production, so ETICS users’ jobs will automatically run remotely on exotic platforms they do not have in their local pool. (Remember that for now, the purpose is not to allow users to run remotely in order to get extra cycles on platforms that they do have in-house.)

Target platforms (as of ?when?, according to ?who?) include

  • x86_cent_4.3
  • macos_10.4
  • x86 & x86_64 debian_4
  • x86_rhap_5
  • x86_rhas_4
  • x86_RHclones_4

Plan

Right now we’re stuck at step 0, and need to demonstrate that the core migration architecture is stable.

The automated testing shouldn’t involve gLite jobs at all, at least for now. The first step is to verify that the software we’re providing (Condor + Metronome) works. If other applications are making that difficult, we need to cut them out of our testing process until our own tests are 100% clean.

Once our software stack is “clean” (see below) then we can more confidently go back and suggest that any problems may be with gLite. But I wouldn’t be comfortable doing that until we can demonstrate that our software works reliably (via exerciser results).

That’s the approach to use in general. Don’t add anything to the exerciser mix until the simple things work 100%. Then add one variable or software stack component at a time until it works 100%. Repeat until we’re replicating something as close as possible to a production ETICS job on the production ETICS infrastructure. But don’t start there.

If this necessitates two testbeds, so be it. A “testbed” is really nothing more than a distinct Condor/Metronome configuration on each end, so having more than one of them should not be a big problem (even if you only end up being able to get one machine on each end).

As for what “clean” or “failed” means, anything other than the job successfully running to completion and having its results sent back whole within a predefined reasonable amount of time (which will depend on the test — hello world vs ETICS jobs, dedicated CPU slot vs. not, etc.) should be considered a failure.

But the whole point is to start understanding and categorizing the types of failures and their frequency. The desired outcome is a report like:

Over the past month, 23% of jobs (1000) failed, with a min/median/max of 0%/7%/100% per night.
Of all failed jobs:
52% (520) timed out due to matchmaking failures
20% (200) timed out due to ETICS Condor pool configuration bugs
12% (120) failed due to NMI Lab machine failures
8% (80) timed out due to unavailable resources
4% (40) failed to due ETICS network configuration bugs
4% (40) are unknown.

Tasks

All tasks are Becky’s unless indicated otherwise.

Current

  • identify what we currently believe is the “best” combination of Condor and Metronome versions for stable job migration
    • this is a smaller and distinct task from fully-populating a compatibility matrix — you can create one, but just fill in what we do know, and leave what we don’t know blank for now
  • identify what submit host at each site will be used for the initial testbed — ideally both should be dedicated to the exerciser for now
  • set up the testbed: install the aforementioned software on the aforementioned machines (where schedd A=etics and schedd B=batlab)
  • the exerciser is currently running from lxb2180.cern.ch -> nmi-etics.cs.wisc.edu — is lxb2180 dedicated to the exerciser?
  • confirm that the testbed supports migration by manually submitting a trivial (e.g., hello world) job from A to B, for a single platform, to a dedicated exerciser CPU slot that will always run
  • develop an exerciser to automatically submit jobs like the one above, and to track and email their success/failure
    • check the exerciser, emailer, and any/all support files into CVS
  • Send daily and weekly reports to show results of migrated jobs. Reports should clearly indicate whether migration is working or not and identify failures. reports are sent daily and weekly to nmi-staff list and Marian.
    • Expand list to include all of SA1, Miron?
  • identify and categorize every job failure until the testbed is 100% working, or the remaining failures are 100% understood and not critical, or not due to the Condor/Metronome software stack

The last step might take a while. :)

Future

Then, in some order, we can:

  • run the exerciser from the production ETICS schedd rather than the artificial testbed schedd
  • make the trivial job into a realistic ETICS job.
  • run the job on non-dedicated CPU slots (to test the matchmaking infrastructure and policy)
  • run the job on more than one platform
  • experiment with advanced policy and configuration on the testbed machines. Document known good (and bad) policies and concerns.
  • etc.

and

  • improve exerciser, nmi_resource_advertiser, and other related Metronome infrastructure to make job migration simpler to implement and debug

We can discuss which if any of these additional steps should be done on top of the initial testbed infrastructure or done in parallel with it as separate, new exerciser instances.

TeraGrid

This page is to summarize the current NMI “plate” with respect to TeraGrid. I.e., what are the top items TeraGrid wants/needs from the NMI Lab? Each item should reference specific Drupal tickets for details — the goal here is to summarize the most important active issues, their priorities, and their current status.

Using TeraGrid resources for Condor nightly builds

Status: blocked on the following tasks…

Miscellaneous Problems

Defensively set file permissions
Rebuild GCC on Stolaris