Metronome Users

Tags:

First Annual NMI Build and Test Workshop: Tuesday, April 29th in Madison, WI

On Tuesday, April 29th, we will hold the first annual NMI Build and Test Workshop at UW-Madison in conjunction with Condor Week 2008. The workshop will consist of a dedicated track of tutorials and talks by users and developers regarding Metronome, the NMI Lab, and the build and test needs of NSF projects and collaborators in general.

Further details and a full agenda are available here.

Please mark the date, and let us know if you have thoughts or ideas for what you’d like to see included!

Tags:

Terminology

Unfortunately, we (the Metronome developers) have been inconsistent over time in our own terminology for some aspects of the Metronome framework and software. This may be reflected even on this site, although we’re working to make our documentation and reference materials more consistent.

For the time being, the following is the most common vocabulary used for the Metronome workflow:

A Metronome “Run”

A user submits a Metronome run to execute a build or test workflow on one or more platforms.

A run is described in a run specification file which is passed to nmi_submit.

A Metronome “Task”

Each run consists of a number of Metronome tasks whose individual success/failure/duration/exe-host/output are tracked and recorded in the DB.

Although the actual operations performed by each task are based on user-provided scripts or executables, most tasks have a predefined name and location in their run’s workflow. Users may, however, optionally declare an arbitrary number of custom-named tasks to run on each specified platform.

In addition to recording the results of individual tasks in a run, Metronome also records the success/failure/duration of certain meta-tasks. Each meta-task represents the collective result of a predefined set of related tasks in the run. A platform_job meta-task for each specified platform in the run represents the collective success or failure of all the tasks which executed on that platform. Additionally, if users choose to declare their own custom-named tasks on a platform, the collective success or failure of just those user-defined tasks is recorded in a remote_task meta-task.

A Condor Job

All platform-independent tasks in a run are executed on the submit host by Metronome inside their own individual Condor jobs; all tasks specific to a given platform are executed on a remote machine inside a single Condor job. All the jobs used to execute a run are represented in a single Condor DAG.

Tags:

Results Archive

In principle, no NMI build or test artifact should be irreplaceable. The NMI database stores the full specification needed to reproduce all past builds or tests, and an NMI facility should continue to maintain the necessary platforms and prerequisites for as long as reproducibility is required.

In practice, however, there are three reasons to archive the full output of old runs: convenience, efficiency and “insurance”. It is convenient for developers to be able to examine the detailed output of a build or test after it completes (e.g., to perform a “post-mortem”). It can be more efficient to use the archived output of one build as the input to multiple test runs, rather than re-building the initial software from scratch each time it is needed. And it can be prudent to save the full results of important builds as insurance, in case a software, hardware, or administrative error renders it unexpectedly irreproducible in the future.

NMI provides a mechanism for archiving old run results to address these concerns. However, our focus is on convenience and efficiency, and so depending on the degree of “insurance” desired, projects may wish to keep their own additional archives of completed builds or tests.

There are two classes of archive in the NMI framework: a full archive and a metadata-only archive. It’s worth reiterating that the NMI DB stores indefinitely the full specification of each run and the essential outcome of each of its tasks (e.g., return code, execution time, etc.). The archives we are discussing here are the file-based data and metadata corresponding to the run’s submission, input, execution, and results.

The duration that these file archives are kept is site-dependent, and is usually a function of available disk space and the rate at which new results are being generated. The NMI software assumes that a given runs results are to be stored only temporarily unless they are explicitly “pinned” by their owner. Pinned builds are kept until the specified expiration of their pin.

This means the results of each run is in one of four states below. The present archival state of each run is stored in the database.

State / Retention Goal (Which Runs, How Long)
—————
a) full archive / most recent runs, until disk needed
b) pinned full archive / all pinned runs, until pin expires
c) metadata-only archive / all runs, forever(1)
d) no longer archived / none (only as a result of data loss)

Valid state transitions are:
a -> b|c|d (after a pin, cleaning, or data loss, respectively)
b -> a|d (after an unpin or data loss)
c -> d (after a data loss)

Definitions:

  • “metadata” means NMI spec files, condor submit & log files, task stdout/stderr files, etc.
  • “full archive” means metadata and all input & results tarballs

In addition, in the UW-NMI B&T Lab we unofficially “back up” the full archive of each run being cleaned, and only permanently delete them as disk is needed (again, in reverse chronological order). These backups are not performed as part of the NMI framework, are not tracked in the DB, and there is no automated way to retrieve them — they are simply a failsafe in case the disk cleaner malfunctions or we need to retrieve a just-cleaned run someone forgot to pin.

(1) This policy may be untenable, depending on the size of metadata, the volume of runs, and the growth of available disk, but for the moment has been possible at UW-Madison. As a result, the NMI B&T software currently provides no automated means to “clean” the metadata of old builds.

Tags:

identity

An optional attribute whose arbitrary string value can be used to identify the run’s owner, if distinct from the submitting user.

The identity string will be stored in the DB for the run, and can be displayed on the web status pages in place of the submitting user (by setting the RUN_USER_IDENTITY_COLUMN), but does not affect the user account under which the job actually executes on computing resources.

Tags:

svn (Subversion)

This method fetches files from a Subversion repository. It requires the additional input specification file command url, and checks out that URL:

method = svn
url = svn-method://svn-host/svn-path

will run a command equivalent to svn co svn-method://svn-host/svn-path.

Tags:

url

This method fetches one or more files from a web server. It requires an additional url command, which specifies the filename to be downloaded. For example:

method = url
url = http://cs.wisc.edu/condor/nmi/nmi-releases/nmi-2.2.7.tar.gz

The url method also supports the recursive command to download entire directory trees.

NOTE: the url method does not currently provide direct support for websites requiring (basic HTTP) authentication. However, since the present implementation of Metronome relies on wget for URL retrieval, websites requiring (basic HTTP) authentication can in fact be accessed through simple use of a wgetrc file (.wgetrc in the submitter’s home directory by default). See the wget documentation for details. This is not a supported feature of Metronome.

Tags:

Reproducibility of Builds

Things To Look Out For:

  • don’t check source code out of the “head” of a CVS trunk or branch — it isn’t a fixed target, and so future builds submitted using the same input spec may check out different source and fail or produce different results.
  • avoid less reliable, non-archival input methods, such as scp. Instead, use cvs or another revision control system. This is just as important for your build scripts as it is for the source code you’re building. If you don’t treat your build scripts as code, and archive, tag, and store it accordingly, it will be exceedinly difficult to reproduce old builds in the future.
Tags:

Return Codes

Run Result Codes

The result code stored in the DB for a Metronome run is the number of failed tasks in that run. If the run had no failed tasks, Metronome will store a return code of 0. If the run is in a special state, the Metronome result code will be negative, as follows:

Run Result Code Key
null running
>0 completed and failed
0 completed successfully
-1 removed
-1015 Incomplete. This is a run-terminal condition, set by the monitor when the run finishes but does not complete. (Specifically, when computing the run's result code, if any component task is incomplete, the whole run is incomplete.)

NOTE: the nmi_run_status command-line tool interprets the value stored in the Metronome DB along with other information, and returns its own set of status codes for each run.

Task Result Codes

The result code reported by Metronome (and stored in the DB) for each Metronome task is the unix return code of the user-specified task script. If the script had no return code because it was killed by a signal, the Metronome result code will be the negative integer corresponding to the signal number. For example, -9 means SIGKILL.

For some Metronome or Condor-level failures (e.g., a job removed from the queue prior the task script even running), the Metronome result code will be a special negative value beyond the range of signals (e.g., -1001).

These special negative Metronome result codes are documented here. The web portal does not yet translate all of them from numbers into human-readable terms.

Task Return Code Key Description
0 to 255 Normal Exit Return code of user-specified task executable
-1 to -31 Killed by Signal Negative integer corresponding to the Unix signal number
-32 Execution failure The user-specified task executable could not be executed _or_ timed out
-1001 Submission failure DAGMan error code for a job submission failure
-1002 Removed The task's Condor job was manually removed from the queue "out from under" Metronome
-1003 Interrupted Tasks are given this result code if they were interrupted while executing and no return code was available. (E.g., if the Metronome wrapper overseeing the task was killed by an external entity, or if the remote resource crashed.) This result should be temporary and non-fatal: once the task is automatically re-executed, its new status will replace this one.
-1015 Incomplete This means that the task neither succeeded nor failed. However, unlike Interrupted, this a run-terminal condition, generated by the run monitor for a task if the run's DAG terminates before that task has a result.

Meta-Tasks

Special meta-tasks (e.g., platform_job, or remote_task when sub-tasks have been declared) are given return codes corresponding to their total number of failed sub-tasks. If all of tasks within the meta-task succeeded (i.e., returned 0), Metronome will assign the meta-task a return code of 0.

Tags:

Mailing Lists & Support

Open Mailing Lists

  • uw-nmi-announce – This moderated list is reserved for occasional announcements of important events to UW NMI build & test system users (e.g., system reboots and downtime, major new features or upgrades, etc.) We will try to keep traffic to a minimum.
  • nmi-users – This newer list is intended as an open forum for NMI B&T Software users to discuss questions, issues, etc. with NMI B&T Lab staff and one another.

Direct Support

  • nmi-support@cs.wisc.edu – This address is for questions or support requests for the Metronome developers and NMI Lab staff.
Tags:

nmi_testsforrun

This commands returns all the NMI test runids that use a given build submission as input. Given either a gid or a runid, the command prints out all the test runids one-by-one on separate lines.

Command usage:

nmi_testsforrun <runid|gid> [--nmiconf=<path>]

Using a runid:

$ nmi_testsforrun 32419
32419
32424
32426
32427

Using a gid:

$ nmi_testsforrun cndrauto_nmi-s001.cs.wisc.edu_1159503304_1509
32419
32424
32426
32427

Syndicate content