Metronome Users Reference Manual

This section contains reference documentation for users of the Metronome Software.

Introduction

What is Metronome?

Metronome (formerly The NMI Build & Test System) is a distributed, multi-platform framework designed to provide automated software building and testing capabilities to a variety of grid computing projects.

We believe that software isn’t reliable unless it’s regularly built and tested. Doing so requires not only a significant number of CPU cycles, but often a variety of unusual and difficult-to-maintain platforms, and a framework for automating, tracking, and monitoring the entire process.

Our goal is to provide an implementation of this framework utilizing proven grid computing tools as a foundation, as well as to support the growing number of Metronome Facilities internationally, including our own NMI Lab at the University of Wisconsin-Madison.

Our LISA 2006 paper below provides more details on the framework and its implementation, including how it differs from some other common build or test frameworks.

Terminology

Unfortunately, we (the Metronome developers) have been inconsistent over time in our own terminology for some aspects of the Metronome framework and software. This may be reflected even on this site, although we’re working to make our documentation and reference materials more consistent.

For the time being, the following is the most common vocabulary used for the Metronome workflow:

A Metronome “Run”

A user submits a Metronome run to execute a build or test workflow on one or more platforms.

A run is described in a run specification file which is passed to nmi_submit.

A Metronome “Task”

Each run consists of a number of Metronome tasks whose individual success/failure/duration/exe-host/output are tracked and recorded in the DB.

Although the actual operations performed by each task are based on user-provided scripts or executables, most tasks have a predefined name and location in their run’s workflow. Users may, however, optionally declare an arbitrary number of custom-named tasks to run on each specified platform.

In addition to recording the results of individual tasks in a run, Metronome also records the success/failure/duration of certain meta-tasks. Each meta-task represents the collective result of a predefined set of related tasks in the run. A platform_job meta-task for each specified platform in the run represents the collective success or failure of all the tasks which executed on that platform. Additionally, if users choose to declare their own custom-named tasks on a platform, the collective success or failure of just those user-defined tasks is recorded in a remote_task meta-task.

A Condor Job

All platform-independent tasks in a run are executed on the submit host by Metronome inside their own individual Condor jobs; all tasks specific to a given platform are executed on a remote machine inside a single Condor job. All the jobs used to execute a run are represented in a single Condor DAG.

Acknowledgments

This documentation is the product of many people’s work, in particular Parag Mhashilkar, who wrote the original NMI Build & Test System documentation, and NCSA’s Michael Bletzinger, who is responsible for a second generation of documentation which formed the basis of the website you see today. In addition to writing a number of badly-needed tutorials and reference docs, Michael served as a catalyst for the rest of us, who subsequently contributed content because of his documentation efforts and initial work seeding the site.

Metronome Workflow

Overview

A build & test run generates a number of tasks to be executed on the submission machine and one or more user-defined platforms. The diagram below shows the overall workflow.

Fetch Tasks – Retrieves needed software inputs from one or more sources to the submission machine.
Pre Run Tasks – Lightweight tasks to be performed on the submission machine in order to prepare the software for staging to the remote platforms. These tasks can be global to all platforms (and thus executed only once, on the common input data) or specific to each platform (and thus executed once per platform, on its specific copy of the input data).
Remote Tasks – Tasks to be performed on each remote platform.
Post Run Tasks – Lightweight tasks to be performed on the submission machine to manipulate the results of the remote tasks. These tasks can be specific to each platform (and thus executed once per platform, on its specific results), or global to all platforms (and thus executed only once, on the combined output).

Task Hooks

The Build & Test System organizes a build or test workflow into predefined stages, or tasks. Each one provides a “hook” where you can optionally define a custom script or program to execute that task in the execution process. Only the remote_task task is required. The diagram below shows all of the available task hooks:

User-Defined Sub-Tasks

The Build & Test System gives you the option to divide the platform-specific remote_task into any number of user-defined sub-tasks. The diagram below shows where these user-defined tasks appear in the execution process:

Task Failure Handling

The build and test system handles failures in a way that allows the user to detect what went wrong. The handling depends on the task. The various ways that failures are handled are as follows:

  • abort run – Fatal failure of the entire run. No subsequent tasks are executed after the failure.
  • abort platform – Fatal failure of one platform. No subsequent tasks specific to the given platform are executed after the failure. Once the platform-specific tasks for all other platforms complete, the run is aborted.
  • continue remote / abort platform – The failure is recorded, but subsequent remote tasks continue. When all the remote tasks for the given platform complete, the platform is aborted.
  • remote_post / abort platform – Run skips remaining remote tasks and “jumps” to remote_post. This allows the creation of a results.tar.gz file for analysis. After remote_platform returns, the platform is aborted.
/2_. Failed Task /2_. Failure Type =\4_. Run After Failure
remaining user tasks remote_post platform_post post_all
pre_all abort run no no no no
platform_pre abort platform no no no no
remote_pre_declare remote_post / abort platform no yes no no
remote_declare remote_post / abort platform no yes no no
remote_pre remote_post / abort platform no yes no no
remote_task remote_post / abort platform N/A yes no no
user-defined task record & continue yes yes no no
remote_post abort platform yes yes no no
platform_post abort platform yes yes yes no
post_all abort run yes yes yes yes

Timeouts

Timeouts may be implemented as run or task specific.

Run specific timeouts refer to the amount of time a Condor job is alive in the queue before being removed. The automated removal assists with cleaning up jobs with mismatching requirements, or jobs which will never run due to various system problems. The default is 6 days, which may be overridden. The order of execution is the run spec file (max_match_wait), then the NMI config file (MAX_MATCH_WAIT), then the default value. Timeout values are specified in seconds.

Task specific timeouts are set in the NMI submit file. These timeouts are typically set by users at task boundaries to assist with shutting down services at the correct time or to prevent services from hanging indefinitely if something should go wrong upon shutdown. Please see the appropriate section of the manual for details.

Build & Test Specification Files

Build/Test Specification File Commands

The following commands are available in the NMI submit file (aka the build or test specification file). The syntax for each command is:

=

Any whitespace before or after the command or value is ignored (however whitespace within a value is retained).

Certain special variables in the file will be expanded before processing.

<platform>_<remote_command>

Commands for the remote host may be specified on a platform-specific basis. In particular, commands beginning with remote_pre, remote_pre_declare, remote_declare, remote_task, and remote_post may be prefixed by a platform string to indicate that the command applies only to the platform. Examples:

x86_rh_9_remote_task = /bin/specialScript
x86_rh_9_remote_task_args = -special -arguments
x86_rh_9_remote_task_timeout = 200

A platform-specific command will override any corresponding generic command. E.g., x86_rh_9_remote_task_args will override remote_task_args on x86_rh_9, if they are both specified.

<taskname>_args

Specifies command line arguments to be passed to the script associated with taskname. See the remote_task for an example.

<taskname>_timeout

For runs with a single remote_task, remote_task_timeout specifies how long Metronome should wait for the specified task to complete, after it begins running. If the task is still running after this Metronome will forcibly kill it and mark it as failed.

remote_task = takes-at-most-an-hour.sh
# Kill this job after two hours, because it's almost certainly hung.
remote_task_timeout = 120

Runs which include user tasks — which have a tasklist.nmi (including remote_declare) — may specify the timeouts for individual tasks therein, and this parameter supplies the default for that value.

remote_declare = list-my-tests.sh
# Default to ninety seconds for my tests, since they should be pretty fast.
remote_task_timeout = 90s

You may specify if your timeout is in minutes (‘M’ or ‘m’) or in seconds (‘S’ or ‘s’); Metronome defaults to minutes if no specifier is present.

Metronome 2.4.x and earlier only support remote_task_timeout.


Metronome 2.5.0 and later support time-outs for the all remote taskhooks, as well as setting a default value for all unspecified remote time-outs:

remote_pre = start-server.sh
remote_task = run-client.sh
remote_post = stop-server.sh
# If it takes more than thirty seconds to start or stop the server, it probably crashed.
remote_pre_timeout = 30s
remote_post_timeout = 30s
# Nothing should take very long to do.
remote_default_timeout = 15m

(Specifically, we support timeouts for remote_pre_declare, remote_declare, remote_pre, remote_task, and remote_post.)


Metronome 2.5.1 and later supports the ‘h’ and ‘H’ specifiers, for durations in hours.

always_run_post_all

Defaults to false.

The always_run_post_all option, if set to true, allows the post_all task to execute even after a failed platform_job.

append_requirements, append_requirements_<platform>

This value is appended to the raw Condor classad requirements expression for the platform_job jobs in the run. This can be useful to add additional matchmaking constraints to the Condor jobs, beyond what is added by the NMI B&T software itself as a result of prereqs or platforms — e.g., to force a match with a specific hostname, or with a host with adequate memory. It is to be considered an “advanced” or “expert” command, not something users should normally need to use.

(Note: you don’t need to prefix your extra requirements with ‘&&’, as it will be done for you.)

For example:

# ensure target machine has > 1 GB of memory append_requirements = (Memory > 1024 * 1024) # ensure x86_rh9 platform_job runs on host-foo.site.org append_requirements_x86_rh9 = (Machine == “host-foo.site.org”)

In the unlikely event that you want to override your remote Condor job requirements entirely (rather than append to them), you can specify +requirements instead — but be sure you know what you’re doing.

arguments

?
Found in /space/nmi/run/pavlo_grandcentral.cs.wisc.edu_1131976992_26552/cmdfile:13
DOCS ISSUE

component, component_version

Identifies the name and version of the software being built or tested. For example the SSH-4.2 would be identified as:

component = SSH
component_version = 4.2

description

Optional field which describes the component being built or tested, in an arbitrary unquoted text string. This field is stored in the database and displayed in the Build & Test Overview page

Example:

description = PyGlobus build for NMI

fetch_retry_count

This parameter became available in Metronome 2.2.4.


By default, Metronome will try to fetch an input three times before giving up. This parameter allows you to change that to another integer for the inputs in the same submit file.

The machine default may be changed in nmi.conf using the parameter FETCH_RETRY_COUNT.

identity

An optional attribute whose arbitrary string value can be used to identify the run’s owner, if distinct from the submitting user.

The identity string will be stored in the DB for the run, and can be displayed on the web status pages in place of the submitting user (by setting the RUN_USER_IDENTITY_COLUMN), but does not affect the user account under which the job actually executes on computing resources.

inputs

Specifies the paths of one or more input files which define the inputs of a build or test submission. The paths can be absolute, or relative to the current working directory of nmi_submit at the time it is invoked. Paths are comma delimited.

For example, the following line specifies two input files. The first, foo.cvs, is expected to be in the current working directory when nmi_submit is invoked, while glue.cvs will be read from /nmi/glue:

inputs = foo.cvs, /nmi/glue/glue.cvs

max_match_wait

Specifies the maximum number of seconds that a run may remain in the queue without ever running before it will be automatically removed.

This value is determined by the max_match_wait param in a submit file. If that is undefined, it is determined by the default MAX_MATCH_WAIT value specified by the administrator in the Metronome config file. If that is undefined, the value defaults to six days.

notify

Specifies an email address to which the B&T system will send a build/test run completion message. For example:

notify = micky.mousehotmail.com@

specifies that the completion message is sent to a hotmail account.

platforms

The following applies to Metronome 2.4.x and earlier. Please read below the second horizontal rule for information on platform names in Metronome 2.5.x.



Specifies one or more types of machines that the submission should be run on. Multiple names need to be comma delimited. A list of platforms can be found by running the following on the submit machine1:

bash$ condor_status -format '%s\n' nmi_platform | sort | uniq

alpha_osf_V5.1
alpha_rh_7.2
hppa_hpux_11
hppa_hpux_B.10.20
ia64_rhas_3
...
ppc_macos_10.3
ppc_macos_10.4
ppc_ydl_3.0
sun4u_sol_5.8
sun4u_sol_5.9
x86_64_rhas_3
...
x86_rh_9
x86_rhas_3
...

The following example specifies that the submission should be run on RedHat version 9 and Apple Mac OS X version 10.3:

platforms = x86_rh_9, ppc_macos_10.3

1 More information on the _condor_status_ command can be found here


The Metronome 2.5.x ‘platforms’ command is backwards-compatible with the 2.4.x (and earlier) command described above. When platforms are specified by

platforms = <platform> [, <platform>]*

Metronome will internally translate this to

platforms = <default platform type>:<platform> [, <default platform type>:<platform>]

where the default platform type is either platform_type as specified in the command file, the PLATFORM_TYPE as specified in the nmi.conf file, or the hardcoded default (at present, “nmi”). Naturally, you can directly specify a platform type for a particular platform in the same way:

platforms = x86_fc_5, etics:fc5_ia32_gcc410

While (at the time of this writing) the “platform strings” in this example refer to the same “platform”, Metronome will generate two platform jobs for this run. If the two platforms were ‘x86_fc_5’ and ‘nmi:x86_fc_5’, however, the former will be internally translated to the latter, and the usual rules for platform collision apply.

platform_post

Definition

Task hook which specifies a script and its arguments that will be run on the submit machine after each set of remote tasks for a platform have been run.

Example

The example below will cause the VDTGlue.pm voms-*.gz /p/vdt/public/html/software/voms/1.6.3p1 south.cs.wisc.edu to be executed once on the submit machine after each set of remote tasks for a platform. Since the platforms command specifies 13 platforms, the task will be executed 13 times.

platforms = alpha_osf_V5.1, alpha_rh_7.2, hppa_hpux_B.10.20, ia64_sles_8, ppc_aix_5.2, ppc_macos_10.3, sun4u_sol_5.8, sun4u_sol_5.9,x86_64_rhas_3, x86_rh_7.2, x86_rh_8.0, x86_rh_9, x86_winnt_5.1
platform_post = VDTGlue.pm
platform_post_args = voms-*.gz /p/vdt/public/html/software/voms/1.6.3p1 south.cs.wisc.edu

Results.

The results of a platform_post script can be found in run directory. platform_post with *.out and *.err extensions.

Run Directory.

The task is executed in the platform directory.

platform_pre

Definition.

Task hook which specifies a script and its arguments that will be run on the submit machine after all of the software specified by the input specifications files have been fetched. Unlike the pre_all task hook, platform_pre is executed before each set of platform remote tasks.

Example.

The example below will cause the _nmi_glue/test/platform_pre glite/platform_pre_args_ to be executed on the submit machine as part of the platform_pre task. Since the platforms command specifies 13 platforms, the task will be executed 13 times.

platforms = alpha_osf_V5.1, alpha_rh_7.2, hppa_hpux_B.10.20, ia64_sles_8, ppc_aix_5.2, ppc_macos_10.3, sun4u_sol_5.8, sun4u_sol_5.9,x86_64_rhas_3, x86_rh_7.2, x86_rh_8.0, x86_rh_9, x86_winnt_5.1
platform_pre = nmi_glue/test/platform_pre
platform_pre_args = glite/platform_pre_args

Results.

The results of a platform_pre script can be found in run directory. (platform name) with *.out and *.err extensions.

Run Directory.

The task is executed in the platform directory.

platform_type

First available in Metronome 2.2.8.


platform_type sets the platform type, which defaults to ‘nmi’ or the value of the configuration file option PLATFORM_TYPE, in that order.


Generally, the only time a user should need to set this is when trying to use nonlocal resources, since the administrator of your local Metronome installation should set the PLATFORM_TYPE as appropriate in the Metronome configuration file. If, however, your Metronome installation is configured for job migration, remote sites may use a different scheme to name their platforms; this allows your submit files to conform to that scheme. Set platform_type as appropriate for the target resource and use the other platform names as normal.

Metronome does not presently support mixing platform types in a submit file.

post_all

Definition.

Task hook which specifies a script and its arguments that will be run on the submit machine after the remote tasks have been run on all of the platforms. The script is run only once.

Examples.

The example below will cause the _post_all —wrap_ to be executed once on the submit machine after all remote tasks are completed.

post_all = nwo/glue/all/build/post_all
pre_all_args = --wrap

Results.

The results of a post_all script can be found in /nmi/run/ (Your GID) /post_all with *.out and *.err extensions.

prereqs, prereqs_<platform>

List the prerequisites needed for the build. For example:
prereqs = coreutils-5.2.1
Adds the core utilities prereq as a requirement. Note that the version number is required and needs to be seperated with periods (.) rather than underline characters is is displayed on the host machine pages. The following will not be recognized:
prereqs = coreutils-5_2_1

To specificy platform-specific prereqs, append _ to prereqs. For example:
prereqs_x86_fc_3 = coreutils-6.9
Please note that the platform-specific requirements will be appended to the global prereqs.


In Metronome 2.5.x, you may specify platform types in the portion of the command:

prereqs_nmi:x86_fc_3 = coreutils-6.9

Metronome will not ‘search’ for a prereq type if one is not specified; it will always be interpreted as the default platform type, even if specified differently elsewhere in the command file. For example, the following

platforms = etics:x86_fc_3, x86_fc_4
prereqs = gcc-3.4.3
prereqs_x86_fc_3 = coreutils-6.8
prereqs_x86_fc_4 = coreutils-6.9

will result in a two-platform run, one of which (nmi:x86_fc_3) will run with coreutils-6.8, the other of which (etics:x86_fc_3) will run without a coreutils prereq at all. Both, however, will use gcc 3.4.3.

pre_all

Definition.

Task hook which specifies a script and its arguments that will be run on the submit machine after all of the software specified by the input specifications files have been fetched. The script is run once before any remote tasks are run on any platforms.

Examples.

The example below will cause the _pre_all —src=/home/bt/condor-6.7.13.tar.gz_ to be executed on the submit machine as part of the pre_all task.

pre_all = nmi_glue/build/pre_all
pre_all_args = --src=/home/bt/condor-6.7.13.tar.gz

Results.

The results of a pre_all script can be found in the run directory with *.out and *.err extensions.

Run Directory.

The pre_all script is executed in the common directory.

private_web_users

You can restrict access to the archived “run directory” of your build & test jobs by adding the following option to your NMI submit file:

private_web_users = my_web_account, her_web_account

The web accounts in question do not correspond to system login accounts on the submit machine — rather, they are specific to the webserver, and must be manually created by NMI Build & Test Lab staff. Please submit a support request on this website or email nmi-support@cs.wisc.edu if you’d like one created.

project

Field used by the Build & Test Overview page to show what project a submission is associated with.

For example all tutorial submission files contain the following:
project = tutorial

remote_declare

Specifies a task to be run on the target machines. This task runs second, after remote_pre_declare and before remote_pre, and is usually used to generate tasklist.nmi, which defines user tasks.

Please note that you can simply include a file name tasklist.nmi in your inputs; you only need to write a script in unusual cases (such as platform-specific user tasks).

Please see the user defined tasks section of our tutorial for a more extensive example.

remote_pool

When submitting a job to be executed on a remote site, the remote_pool option defines the address where the NMI framework can communicate with the remote collector daemon.

remote_pool = collector.example.com[:port]

This command must be used in conjunction with the remote_schedd option. More information about how to the use remote site execution features of the NMI framework can be found here.

remote_post

Specifies a task to be run on the target machines. This task runs after the remote_task. If User tasks are defined then this tasks runs after the last user defined task. For example, it can be used to process failed user tasks.

remote_pre

Specifies a task to be run on the target machines. This task runs before the remote_task. If User tasks are defined then this tasks runs before the first user defined task.

remote_pre_declare

Defines a script to be run before remote_declare. The task is executed before the tasklist.nmi file is generated.

remote_schedd

The remote_schedd is the host that the job will be routed to in the remote pool. Once there, the job will potentially be matched and begin execution on a computing resource within that pool.

remote_schedd = schedd.example.com

More information about how to the use remote execution features of the NMI framework can be found here.

remote_task

Specifies a task to be run on the target machines.

For example the following specifies that _“code/perlHelloWorld/helloWorld.pl Remote_Task Task”_ should be executed as the remote task:

remote_task = code/perlHelloWorld/helloWorld.pl
remote_task_args = Remote_Task Task
remote_task_timeout = 5
x86_rh_9_remote_task_timeout = 10

The additional parameters pass the arguments _“Remote_Task Task”_ and set the timeout to be 5 minutes. The timeout for the RedHat 9 platform is set for 10 minutes.

A remote task can be subdivided into user defined tasks. The definitions need to be in the known file tasklist.nmi.
All of the user tasks get the same arguments from remote_test_args.

remote_task_is_null

This flag became available in Metronome 2.5.0



By default, Metronome requires a remote task (remote_task) to specified. This flag suppresses that behavior and overrides any supplied remote_task with the ‘null’ task, which runs on the submit machine and does nothing except note its own existence for the benefit of the web interface. Note: before version 2.5.1, Metronome still required one platform to be defined in submit files with this flag set. The “platform” may be any string; I suggest ‘dummy’.

This flag is intended for users who split pre_all or platform_pre steps that don’t vary much across runs into their own run that is then used as input for the original run; it dispenses with the step of running anything, even a no-op, on a remote machine. By doing so, it skips the potentially-slow transfer of the results of pre_all and platform_pre to the remote node(s). (As well as a potentially-long wait for some specific but unimportant platform to become available.) When you use this run as input to another run, we recommend setting the ignore_missing_platform flag, and not setting ‘platforms’ in the input file. You may supply a dummy platform string, but if you do, Metronome will create a (harmless but) spurious directory for your run (named after the dummy platform string).

(For instance, suppose ‘MyLinuxDistro.sub’ uses its pre_all and platform_pre steps to fetch a large number of software packages, which it then builds (as specified by remote_declare) with varying levels of optimization. You could build different optimization levels concurrently by creating ‘MyLinuxDistro-O[0-3].sub’, but all four of these would fetch the whole source all over again. Alternatively, maybe you want to try a number of different combinations of compiler flags to find the best set. Instead, you could copy the pre_all and platform_pre steps from ‘MyLinuxDistro.sub’ to ‘MyLinuxDistroSources.sub’ and set the remote_task_is_null flag, and change ‘MyLinuxDistro-O[0-3].sub’ to use the results of a run of ‘MyLinuxDistroSources.sub’. Then you have only to have fetch the sources again when you decide to upgrade the packages.

run_type

This field differentiates between build submissions and test submissions. The field is used by the Build & Test Overview page.

For example, the following indicates a test submission:

run_type = test

+<condor command>

Pass command to condor. For example +getenv = true passes the command getenv with the value true to condor.

Since Condor itself recognizes a + command to add arbitrary user-defined attributes to the job classad, in Metronome you can add such attributes by specifying a ++ prefix; the first + tells Metronome to pass the remaining +attr=value text to Condor, which interprets the + accordingly.

Input Specification File Commands

Each build or test specification file must reference one or more input specification files (via the inputs keyword) which define the inputs to the build or test run.

Input specification files follow the same basic format as the build or test specification file itself. The syntax for each line in the file is:

=

Any whitespace before or after the command or value is ignored (however whitespace within a value is retained).

Certain special variables in the file will be expanded before processing.

Metronome 2.4.x and earlier

This documentation reflects the input file specification commands for the Metronome 2.4.x series and earlier.

method

Specify the method used by the B&T system to stage the software onto the submit machine. For example the following specifies that the software should be obtained from a CVS repository.

method = cvs

cvs

This method specifies that the software is transferred to the submit machine using CVS. The method also requires the commands cvs_module and cvs_root in the input file.

From the fetch.pl usage:

    cvs_root = :ext:bt@chopin.cs.wisc.edu:/p/condor/repository/nmi
    # cvs_tag is optional
    cvs_tag = nmi_r5_branch
    # exactly one of the following two is required
    cvs_module = <cvs module name>
    cvs_subdir = <dir> [, <dir>, ...]                                          

ftp

This method fetches files from an FTP site. It requires the additional input specification file commands ftp root and ftp_target, and downloads the files from there:

method = ftp
ftp_root = ftp://ftp.cs.wisc.edu/condor/nmi/tutorial/
ftp_target = helloWorld.tar.gz

will run a command equivalent to wget ftp://ftp.cs.wisc.edu/condor/nmi/tutorial/elloWorld.tar.gz.

(The present implementation of Metronome does use wget, so FTP sites requiring authentication can be accessed through use of a .wgetrc file in the submitter’s home directory. See the wgetrc documentation for details. This is not a supported feature of Metronome.)

nmi

The nmi input method is used to specify that one build or test run (let’s call it the consumer run) wishes to use retrieve the results of another, previously-completed build or test run (the producer run). For example, a test run might specify an nmi input to retrieve the output of a finished build run it wishes to test.

For each nmi input method, Metronome first establishes a list of input platforms for which it will attempt to retrieve results from each producer run. By default, Metronome will attempt to retrieve results from the producer for each platform in the consumer’s platforms list, but this can be overridden using the platforms command in the input spec file.

From each specified producer, the fetch step retrieves the results.tar.gz files corresponding to each input platform (including the platform-independant “common” results.tar.gz file), and untars them into the corresponding platform-specific (or common) directory of the consumer.

I.e., for each input platform (including “common”), the following psuedo-code is executed for each producer:

cd consumer:userdir/<input_platform>/
tar zxf producer:userdir/<input_platform>/results.tar.gz

If a producer contains results for additional platforms not present in the consumer’s input platforms list, they are not retreived. If a consumer specifies a platform with no corresponding results in the producer, an error is produced (return code ???) unless ignore_missing_platforms is true.

If the producer’s output files are no longer archived, an error is produces (return code ???).

Note: the targetdir command is ignored for this input method.

Example

The following input specification file instructs Metronome to untar the results.tar.gz file for each platform in runids 324 and 213 into the corresponding platform directory of the current run, ignoring any platforms not present in those input runs.

method=nmi
input_runids = 324, 213
ignore_missing_platforms = true

scp

This method fetches files using SCP. It requires the additional input specification file command scp_file, and copies that file (or directory), possibly from a remote host, to the local host.

For example, to specify a fetch of a directory called glue on a machine called role, the following needs to be in an input file:

method = scp
scp_file = role.cs.wisc.edu:/home/mbletzin/glue
recursive = true

svn (Subversion)

This method fetches files from a Subversion repository. It requires the additional input specification file command url, and checks out that URL:

method = svn
url = svn-method://svn-host/svn-path

will run a command equivalent to svn co svn-method://svn-host/svn-path.

url

This method fetches one or more files from a web server. It requires an additional url command, which specifies the filename to be downloaded. For example:

method = url
url = http://cs.wisc.edu/condor/nmi/nmi-releases/nmi-2.2.7.tar.gz

The url method also supports the recursive command to download entire directory trees.

NOTE: the url method does not currently provide direct support for websites requiring (basic HTTP) authentication. However, since the present implementation of Metronome relies on wget for URL retrieval, websites requiring (basic HTTP) authentication can in fact be accessed through simple use of a wgetrc file (.wgetrc in the submitter’s home directory by default). See the wget documentation for details. This is not a supported feature of Metronome.

ftp_args

Adds additional arguments to the wget command. See here for a possible list.

ftp_root, ftp_target

Specifies the URL for the ftp input. For example the URL ftp://ftp.cs.wisc.edu/condor/nmi/tutorial/helloWorld.tar.gz

ftp_root = ftp://ftp.cs.wisc.edu/condor/nmi/tutorial/
ftp_target = helloWorld.tar.gz

http, url

url

As used by the url method, this method specifies a URL to fetch. ‘http’ is synonymous.

svn

As used by the svn method, this command specifies the Subversion URL to check out; ‘http’ is synonymous:

method = svn
url = svn-method://svn-host/svn-path

You can also optionally specify a path in the ‘url’ line:

method = svn
url = svn-method://svn-host/svn-path path

and Subversion will use ‘path’ as the destination [directory] (as opposed to determining the destination based on the URL).

input_runids

Comma-delimited list of run ids whose results are untarred into the working directory of the submit machine before platform_pre is run.


In Metronome 2.5.1, the list may include GIDs.

platforms

NOTE: valid for the nmi input method only.

The optional platforms command is used to specify a subset of platforms for which results from the input run should be copied into the current run. For example:

platforms = ppc_mac_10.3, x86_rhas3

Valid elements include any platform name (including common) or all. If unspecified, platforms defaults to all.

The results.tar.gz from each specified platform will be copied and untarred into the corresponding platform directory of the consuming run. This default destination can be overidden via an optional source:destination platform name mapping, like so:

platforms = ppc_mac_10.3, x86_rhas3:x86_rhas4, x86_rhas3:x86_fc3

This says copy ppc_mac_10.3 results from the input run into the ppc_mac_10.3 workspace of the current run (like usual), but copy the x86_rhas3 results from the input run into the x86_rhas4 and x86_fc3 workspaces of the current run (e.g., to do binary-compatibility testing).

recursive

This command tells a method to recursively fetch the contents of any directories it finds under the target. For example:

scp_file = /home/bgietzel/fw-client-server/glue
recursive = true

Tells the system to fetch everything under the glue directory.

untar

If the command is set to true then the build and test system unpacks any archive that is fetched.

Metronome 2.5.x

The input specification file commands will be rationalized in Metronome 2.5.x. This section of the documentation should be considered speculative until the release of Metronome 2.5.x.


Methods

An input specification file must include one (and only one) “method”: command. Each method, listed below, imposes its own requirements for subsequent commands. For instance, the “http”: method requires a “url”: command to specify the URL to fetch.

Options

Options are a type of command that affect the behavoir of other commands. Metronome has two input specification file options: recursive which allows methods which normally fetch a single file to recursively fetch an entire directory, and unpack which unpacks the specified file. Both accept “true” or “false” as values.

At present, the two options are mutually exclusive.

methods

A method specifices how to obtain an input for the run. The three categories of input methods, in decreasing order of reproducability, are: Metronome outputs, version control systems, and archives.

Metronome Output

Metronome provides the ability to use the output of a previous Metronome run as the input to a subseqeunt run. Because Metronome runs are reproducible (to the extent that their inputs are reproducible), this method is itself a reproducible way of acquiring input for a run.

The only method in this category is metronome [NOTE: the ‘platforms’ cross-platform testing syntax requires two colons in Metronome 2.5.x, because of the platform_type:platform_name change.]

Version Control Systems

Metronome can acquire input from the “CVS”: and “Subversion”: version control systems. For ease-of-use, specifying a particular revision with these methods is optional; this simplifies the machinery required to regularly build the trunk (because it doesn’t have to be tagged ahead of time, or in the pre_all step), but reduces reproducability (because without a particular revision specified, you may not easily get precisely the same repository again).

The methods in this category are “cvs”: and svn

Archives

You can fetch files (presumably from well-maintained archives; hence the name) from the web or a specific remote machine.

The methods in this category are http ftp and scp

option: recursive

The recursive option affects methods assumed to have accessible directory trees — at present “ftp”: and “scp”: — in the obvious way. (Because HTTP does not have to (and generally does not) expose the directory tree, the recursive option, proper, can not reliably be implemented, although suggestions regarding mirroring will be entertained. The version control methods are inherently recursive, and the concept doesn’t apply to Metronome outputs.)

option: unpack

The Metronome system can recognize (generally, by file extension) a wide variety of compression and/or archival formats. If the unpack option is set to true, the method is archival (http ftp or scp, and the single target file is in one of those formats, Metronome will unpack it. This is generally simpler, easier, and more reliable than explicitly unpacking the file in a user task.

Recognized Compression/Archive Formats

... tar.gz, .gz, .tar, .zip?

option: wgetrc

Sometimes, you may wish to fetch inputs that are password-protected . To use a specific wgetrc file for an input, set in your input spec file:

wgetrc = <path_to_wgetrc_file>

You can then set a username and password in that wgetrc file. (See the wget documentation for details.)

Metronome 2.4 and 2.5

These input specification file commands did not change between versions.

cvs_module

Specifies the module to be checked out. The file CVS/Repository found in every directory of a CVS checkout contains the name of the module.

cvs_root

Specifies the root of the cvs repository. The root is contained in the file CVS/Root in each subdirectory of a CVS checkout.

cvs_rsh

Tell cvs to use an alternate remote shell. For example:

cvs_rsh = /nmi/scripts/ssh_no_x11

Tells CVS to use a local script _/nmi/scripts/ssh_no_x11_ which hardcodes a set of ssh flags.

cvs_subdir

In the absence of a cvs_module command, the cvs_subdir command can be used to specify a comma-separated list of one or more specific directories (or files) to check out of the given repository.

cvs_tag

Specifies a tag name for the CVS checkout. The following example specifies the tag name “helloBranch”

cvs_tag = helloBranch

ignore_missing_platforms

Set this to true in your input file to ignore any platforms not present in the input run. To be used with the nmi method.

scp_file

Specifies the URL to the target to be fetched using scp. The URL needs to be of the form hostname:path. If the host name is omitted then the system assumes that the target is on the submit machine and the local copy command is used.

Output Specification File Commands

In the nmi_submit submit file, one may define “outputs” to instruct the system to transfer some or all results to an external repository after the run is complete.

The syntax is as follows:

outputs = output_file_1, output_file_2, ...

Currently, the following output methods are supported:

  • scp
  • gridftp

Some example output files:

method = scp
platform = x86_slc_3
source = results.tar.gz
dest = /tmp/tolya

or

method = scp
platform = common
source = results.tar.gz
dest = /tmp/tolya-common

(in the last case, it is user’s resposibility to create the results file
under common/)

Here’s an example of using gridftp method:

method = gridftp
platform = x86_rh_9
source = results.tar.gz
dest = my.gridftp.host/data/rh9/results.tar.gz

Variable Substitution Inside NMI Specification Files

It is possible to perform macro substitution within the NMI run specification file and input specification files. There are two substitutions done at submit time:

  1. If you specify $(USER), it will be replaced by your login name.
  1. If you specify any other macro $(FOO) in your NMI run specification file, it will be immediately replaced by the value of the environment variable $_NMI_FOO, if defined in the submitter’s environment. (Otherwise it is replaced by the null string.)

Command-Line Tools

This section describes the NMI Build & Test Software commands available on the submit machine.

nmi_condor_status

nmi_condor_status prints Metronome-specific information about your Condor pool. It accepts three options, in addition to the usual --nmi-conf. The -w, -ww, and -a options control the width of the printed information. Normally, nmi_condor_status prints output in 80 columns, truncating fields to fit. The -w and -ww print more columns; the -a option prints unlimited-width and unjustified output more-suitable for use by scripts.

For advanced users, starting with Metronome version 2.2.3, you may configure your pool so that nmi_condor_status ignores certain startds. This is useful if your pool includes non-Metronome hosts, or hosts which will never run jobs for other reasons. (For instance, if you’re running Hawkeye to monitor a submit host.) Setting the attribute NMI_isExecHost to FALSE in your Condor configuration file and adding NMI_isExecHost to STARTD_EXPRS will cause nmi_condor_submit to ignore that startd.

nmi_gid2runid

This command has been available since before Metronome 2.2.2.


This command converts Metronome GIDs (strings) to Metronome runids (integers). It takes one argument, the GID, and one option, the ubiquitous —nmi-conf, which allows you to set the location of the Metronome configuration file:

$ nmi_runid2gid 72316
tutorial_nmi-s005.cs.wisc.edu_1201099302_30463

If the database is down when you submit a job, this command can be used at a later date to determine the runid, as a command-line alternative to searching via the web interface.

nmi_pin

Command used to tell the build and test system database to save the results of a run beyond the usual timeframe. Here are some examples:

nmi_pin -list

lists all of the runs that are currently pinned

nmi_pin --runid=24630

Stores the record of run 24630 for the default 60 days.

nmi_pin --unpin --runid=24630

Undos the pin.

nmi_pin --runid=24630 --days=100

Stores the record of run 24630 for the default 100 days.

nmi_resource_advertiser

Usage: nmi_resource_advertiser
        --nmiconf= Select which NMI configuration file to use
        --routing-table  Prints out the routing table
        --broadcast      Broadcast resource information to all hosts
        --debug          Enable debug output

Note: when the routing table gets updated by nmi_resource_advertiser, the job router must be reconfigured in order for the new information to get loaded. In some situations, it may reconfigure all of its host’s local Condor daemons. (It can’t use the target parameter for condor_reconfig because that tool doesn’t have logic to send a reconfig command to an arbitrary daemon. The resource advertiser actually tries to be smart about updating the job router: first it tries send a SIGHUP with a killall condor_schedd.v7, then it tries a kill . If both of these measures fail for whatever reason, it then calls the condor_reconfig.)

nmi_resubmit_run

This command became available in Metronome 2.5.0.


nmi_resubmit_run use the Metronome database to recreate and then submit a new copy of a previously-submitted run. It can not recreate runs submitted by versions prior to Metronome 2.5.0. (However, at the time of this writing, we are working on a tool which will remove this restriction (on a run-by-run basis) for runs whose run directories still exist.) For example:

nmi_resubmit_run 75549

will submit a copy of runid 75549, reporting as nmi_submit normally does. By default, nmi_resubmit_run does not keep the submission directory around after nmi_submit succeeds. You may override this with the --submit-dir option, passed as a flag (the directory name defaults to the id of the run being recreated) or with an argument (specifying the submit directory’s name).

We intend this command to simplify disk-space managemenmt, as reproducing a user’s run no longer requires a run’s run directory to be preserved. We do not, however, recommend that you begin removing run directories immediately after upgrading, as this is a new feature, and may yet have bugs.

nmi_rm

The nmi_rm command can be used to remove a run from the Metronome queue. It will not remove any information from the Metronome database. Unless the invoking user is root, it will only remove jobs that are owned by you.

Here are some examples:

Usage: nmi_rm [options] [runid|gid] [runid|gid]...
        --user        Which user to remove runs for (defaults to current)
        --all               Remove all runs for the current user (or all runs if root)
        --force             Always try to remove a run and update the database
        --db-only           Only update the database
        --remove-consumers  Remove any runs that depend on a given run
        --help              Print this message

nmi_rm 24630

Kills run with a runid of 24630.

nmi_rm mbletzin_grandcentral.cs.wisc.edu_1155560480_32710

Kills a run using the GID .

nmi_rm 24630 mbletzin_grandcentral.cs.wisc.edu_1155560480_32710 24631

The gids and runids can be mixed together and given in succession on the commandline

nmi_rm --user=jsmith --all

Kills all of the runs for user jsmith

nmi_rundir

This command first became available in Metronome 2.5.1


Converts runids or GIDs given on the command-line into hostname:full-path pairs.

nmi_runid2condor

This command became available in Metronome 2.2.4.


This command searches the Condor job queue for Metronome runs with the given runid or GID. It accepts one argument, the runid or GID, and a number of options.

$ nmi_runid2condor 72313
Global ID:      gthain_nmi-s003.cs.wisc.edu_1201097532_13311
Actively queued jobs for run:
        293011
        293012
        293015
        293020

The --history option will check Condor’s history files for the given runid or GID.

The --global option will check all job queues known to the local machine’s Condor collector, rather than just the local machine. For example, it will check the queues of nmi-s001.cs.wisc.edu and nmi-s005.cs.wisc.edu if invoked on nmi-s003.cs.wisc.edu.

nmi_runid2condor also accepts the usual --help and --nmiconf options.

This command is generally most useful in conjunction with condor_q’s analysis flags:

$ condor_q -bet 293020
...
293020.000:  Run analysis summary.  Of 236 machines,
    233 are rejected by your job's requirements
      2 reject your job because of their own requirements
      1 match but are serving users with a better priority in the pool
      0 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job
...
Condition                         Machines Matched    Suggestion
    ————-                         ————————    —————
1   target.nmi_platform == "hppa_hpux_11"3                   
2   ( target.has_java_1_5_0_03 isnt undefined )3
...

In this case, run 72313 can only run on one three “machines” in the pool, two of which reject it, the other of which is busy. You can look at the web interface’s pool status pages to discover that this means a single physical hppa_hpux_11 machine exists, but this BUILD won’t be run by a TEST or PARALLEL slot, and that the BUILD slot is busy. In other cases, you might discover a typo in your prereqs (which would look something like 2 ( target.has_java_1_5_0_02 isnt undefined )0)), or that you’ve requested a combination of existing prereqs not matched by any single machine. (In which case you should probably contact your Metronome installation’s administrator!)

The Condor ID passed to condor_q was the last one listed by nmi_runid2condor. This will generally but not always be the right ID to pass — Metronome runs each map to more than one Condor job, and those jobs are not the same throughout the run’s lifetime. As in the example, however, there should be four Condor jobs for the bulk of the run. Passing the wrong Condor ID to condor_q is harmless, however, and will generally result in it saying the the job is already being serviced.

nmi_runid2gid

This command has been available since before Metronome 2.2.2.


This command converts Metronome runids (positive integers) into Metronome GIDs (strings). Metronome uses GIDs internally, so that it can run jobs during database failures. (The information is recorded on disk and entered into the database when it comes back up.) It takes one argument, the runid, and one option, the ubiquitous --nmi-conf, which allows you to set the location of the Metronome configuration file:

$ nmi_runid2gid 72316
tutorial_nmi-s005.cs.wisc.edu_1201099302_30463

This command is useful when interacting directly with Condor, because the GID of a Metronome run is entered in the ad for all of its tasks. However, nmi_runid2condor converts directly from runids to Condor IDs if a job is in queue, which can be more convenient.

nmi_submit

Usage: nmi_submit
        --nmiconf=   Select which NMI configuration file to use
        --must-match       Job must match with resources before submitted
        --notify-fail-only Only send notification if job fails
        --verbose          Enable verbose output
        --quiet            Do not print job submission information
        --timeout=    Number of seconds to wait for runid. Default is 180
        --no-wait          Do not wait for runid. Program returns immediately
        --debug            Enable debug output
        --help             Show this information

This command is used to start B&T runs. The command expects a submit file as input. For example the following executes the submit file perlHelloWorld.submit located in the current working directory:

nmi_submit perlHelloWorld.submit

Although the submit file, and any input specification files it references, must exist when nmi_submit is invoked, they are not read again, and may be removed afterwards. A copy of the input files, along with all of the submission’s runtime data and output files, is archived in the “run directory”. The current working directory of the original submission has no significance.

nmi_testsforrun

This commands returns all the NMI test runids that use a given build submission as input. Given either a gid or a runid, the command prints out all the test runids one-by-one on separate lines.

Command usage:

nmi_testsforrun <runid|gid> [--nmiconf=<path>]

Using a runid:

$ nmi_testsforrun 32419
32419
32424
32426
32427

Using a gid:

$ nmi_testsforrun cndrauto_nmi-s001.cs.wisc.edu_1159503304_1509
32419
32424
32426
32427

Remote Execution Environment and Tools

This section documents the environment in which platform-specific (_remote_*_) tasks execute, and the NMI tools available to them.

NMI-Provided Environment Variables

nmi_getattr, nmi_putattr

Primarily intended to enable simple communication between parallel tasks on different hosts, the nmi_getattr and nmi_putattr tools are available on the remote execution platform and allow a task to read or write individual attributes from a common classad-like hash table.

Usage:

nmi_putattr attr value

_nmi_putattr_ will set the value of the given attr (overwriting any preexisting value)

returns 0 upon success
returns non-zero upon failure to set the attr for any reason

nmi_getattr attr

_nmi_getattr_ will print to stdout the value of the given attr.

returns 0 if the attr is defined
returns 1 if the attr is undefined
returns >1 upon failure

Note: These tools are non-blocking. nmi_getattr will return 1 if you try to get a value that does not exist. The user must poll for values and determine how long to wait for a value to exist before giving up and declaring failure. Future plans for a Metronome-provided polling mechanism are in the works.

nmi_getfile, nmi_putfile

The nmi_getfile and nmi_putfile tools are used to send files between nodes of a parallel job. These scripts are available on the remote execution platform and allow a node to send or retrieve files to each other via the submit host.

Usage:

nmi_putfile local_file remote_file

_nmi_putfile_ will send a file to the submit host

returns 0 upon success
returns 1 if the chirp invocation fails
returns 2 if the local or remote file are not defined, or if the local file does not exist
returns actual error code >1 upon failure

nmi_getfile remote_file local_file

_nmi_getfile_ will fetch a file from the submit host

returns 0 if the fetch worked successfully
returns 1 if the chirp invocation fails
returns 2 if the local or remote file are not defined
returns actual error code >1 upon failure

Results Archive

In principle, no NMI build or test artifact should be irreplaceable. The NMI database stores the full specification needed to reproduce all past builds or tests, and an NMI facility should continue to maintain the necessary platforms and prerequisites for as long as reproducibility is required.

In practice, however, there are three reasons to archive the full output of old runs: convenience, efficiency and “insurance”. It is convenient for developers to be able to examine the detailed output of a build or test after it completes (e.g., to perform a “post-mortem”). It can be more efficient to use the archived output of one build as the input to multiple test runs, rather than re-building the initial software from scratch each time it is needed. And it can be prudent to save the full results of important builds as insurance, in case a software, hardware, or administrative error renders it unexpectedly irreproducible in the future.

NMI provides a mechanism for archiving old run results to address these concerns. However, our focus is on convenience and efficiency, and so depending on the degree of “insurance” desired, projects may wish to keep their own additional archives of completed builds or tests.

There are two classes of archive in the NMI framework: a full archive and a metadata-only archive. It’s worth reiterating that the NMI DB stores indefinitely the full specification of each run and the essential outcome of each of its tasks (e.g., return code, execution time, etc.). The archives we are discussing here are the file-based data and metadata corresponding to the run’s submission, input, execution, and results.

The duration that these file archives are kept is site-dependent, and is usually a function of available disk space and the rate at which new results are being generated. The NMI software assumes that a given runs results are to be stored only temporarily unless they are explicitly “pinned” by their owner. Pinned builds are kept until the specified expiration of their pin.

This means the results of each run is in one of four states below. The present archival state of each run is stored in the database.

State / Retention Goal (Which Runs, How Long)
—————
a) full archive / most recent runs, until disk needed
b) pinned full archive / all pinned runs, until pin expires
c) metadata-only archive / all runs, forever(1)
d) no longer archived / none (only as a result of data loss)

Valid state transitions are:
a -> b|c|d (after a pin, cleaning, or data loss, respectively)
b -> a|d (after an unpin or data loss)
c -> d (after a data loss)

Definitions:

  • “metadata” means NMI spec files, condor submit & log files, task stdout/stderr files, etc.
  • “full archive” means metadata and all input & results tarballs

In addition, in the UW-NMI B&T Lab we unofficially “back up” the full archive of each run being cleaned, and only permanently delete them as disk is needed (again, in reverse chronological order). These backups are not performed as part of the NMI framework, are not tracked in the DB, and there is no automated way to retrieve them — they are simply a failsafe in case the disk cleaner malfunctions or we need to retrieve a just-cleaned run someone forgot to pin.

(1) This policy may be untenable, depending on the size of metadata, the volume of runs, and the growth of available disk, but for the moment has been possible at UW-Madison. As a result, the NMI B&T software currently provides no automated means to “clean” the metadata of old builds.

Return Codes

Run Result Codes

The result code stored in the DB for a Metronome run is the number of failed tasks in that run. If the run had no failed tasks, Metronome will store a return code of 0. If the run is in a special state, the Metronome result code will be negative, as follows:

Run Result Code Key
null running
>0 completed and failed
0 completed successfully
-1 removed

NOTE: the nmi_run_status command-line tool interprets the value stored in the Metronome DB along with other information, and returns its own set of status codes for each run.

Task Result Codes

The result code reported by Metronome (and stored in the DB) for each Metronome task is the unix return code of the user-specified task script. If the script had no return code because it was killed by a signal, the Metronome result code will be the negative integer corresponding to the signal number. For example, -9 means SIGKILL.

For some Metronome or Condor-level failures (e.g., a job removed from the queue prior the task script even running), the Metronome result code will be a special negative value beyond the range of signals (e.g., -1001).

These special negative Metronome result codes are documented here. The web portal does not yet translate all of them from numbers into human-readable terms.

Task Return Code Key Description
0 to 255 Normal Exit Return code of user-specified task executable
-1 to -31 Killed by Signal Negative integer corresponding to the Unix signal number
-32 Execution failure The user-specified task executable could not be executed _or_ timed out
-1001 Submission failure DAGMan error code for a job submission failure
-1002 Removed The task's Condor job was manually removed from the queue "out from under" Metronome
-1003 Interrupted Tasks are given this result code if they were interrupted while executing and no return code was available. (E.g., if the Metronome wrapper overseeing the task was killed by an external entity, or if the remote resource crashed.) This result should be temporary and non-fatal: once the task is automatically re-executed, its new status will replace this one.

Meta-Tasks

Special meta-tasks (e.g., platform_job, or remote_task when sub-tasks have been declared) are given return codes corresponding to their total number of failed sub-tasks. If all of tasks within the meta-task succeeded (i.e., returned 0), Metronome will assign the meta-task a return code of 0.

Running parallel jobs

Overview

This feature of Metronome builds on the Condor Parallel Universe and provides for running jobs on multiple machines simultaneously. Condor’s Chirp mechanism makes communication between the machines possible.

Submitting parallel jobs

This node describes how to set up your pool to run parallel jobs. The UW NMI pool already has the DedicatedScheduler set up. The submitter is nmi-s005.cs.wisc.edu.

A Metronome submit file for a parallel job looks similar to a normal submit file, with the following exceptions:

  • Add parenthesis around your platform list. Note that at the present time you are
only allowed one set of parens per cmdfile. The parallel nodes are assigned based on the location of the platform name in this list. In the below example, x86_fc_2 is node 0.
platforms = (x86_fc_2, x86_rhas_3)
  • For any of the following task hooks that run on the remote side, enter a glue script for each node as detailed below. The platform_pre task must be defined, even if it uses a noop script. This is a limitation of the Metronome framework and will be fixed in a future release. The example assumes the glue scripts are stored in directories named glue, client and server.


pre_all = glue/pre_all

platform_pre_0 = client/platform_pre
platform_pre_1 = server/platform_pre

remote_declare_0 = client/remote_declare
remote_declare_1 = server/remote_declare

remote_pre_0 = client/remote_pre
remote_pre_1 = server/remote_pre

remote_task_0 = client/remote_task
remote_task_1 = server/remote_task
remote_task_args_0 = 7000
remote_task_args_1 = 7001
remote_task_timeout_0 = 20
remote_task_timeout_1 = 20
remote_post_0 = client/remote_post
remote_post_1 = server/remote_post

platform_post_0 = client/platform_post
platform_post_1 = server/platform_post

post_all = glue/post_all

Communication between parallel job nodes

Chirp facilitates the underlying communication between nodes of a parallel job. Use the nmi_putattr and nmi_getattr scripts as described here to inject params directly into the job ad from one node and retrieve them from another. The scripts are sent to remote machines and are located in the NMI_BIN directory on each remote job node.

The older method of Chirp communication may be used as well. You may send a file to the head node using Chirp and then retrieve it on other nodes. Use the nmi_putfile and nmi_getfile scripts as described here for this purpose.

Metronome-provided environment variables

Certain attributes are published to the job ad and the remote environment. These attributes may be helpful in job synchronization and inter-job dependencies.

  • Hostname
NMI_NODE_0_HOSTNAME=nmi-build26.cs.wisc.edu
NMI_NODE_1_HOSTNAME=nmi-build21.cs.wisc.edu

  • Start time, return code and end time


NMI_NODE_0_START_remote_task=1177970566

NMI_NODE_0_START_remote_pre=1177970565
NMI_NODE_0_RVAL_remote_pre=0
NMI_NODE_0_END_remote_pre=1177970566

NMI_NODE_0_START_remote_declare=1177970563
NMI_NODE_0_RVAL_remote_declare=0
NMI_NODE_0_END_remote_declare=1177970564

  • Other useful variables
NMI_BIN=/home/condor/execute/dir_13377/bin
_CONDOR_SCRATCH_DIR=/home/condor/execute/dir_13377
_CONDOR_PROCNO=0

Finding Condor CPU slots

% condor_status -const 'DedicatedScheduler=!=Undefined'

...and it will list all the “P” slots. To confirm which schedd queue each slot is bound to, run:

% condor_status -const 'DedicatedScheduler=!=Undefined' -format '%s\t' Machine -format '%s\n' DedicatedScheduler

Glossary

GIDs and Run IDs

The GID, or Globally Unique ID, is an alphanumeric identifier used by the build and test system to refer to a specific build or test submission. It is generated at submission time, and is used internally as a component of the run’s archive directory (or run directory) path.

The Run ID, or runid, is a short numeric identifier also used by the build and test system to refer to a specific build or test submission. It is only generated once the submission is successfully registered into the database (it is the primary key of the record), and as a result it may not yet be known at submission time if the database is unavailable or unresponsive.

There is a one-to-one correspondence between Run IDs and GIDs; the only practical differences are that GIDs are long and unwieldy, but are guaranteed to be known at submission time, whereas Run IDs are short and convenient, but cannot be presumed to exist until the database is successfully initialized for each run. GIDs are also unique across NMI pools, whereas Run IDs are only unique within a given NMI pool. Although having two similar identifiers can be confusing, the distinction is important and exists in order to ensure that build & test jobs can be submitted even in the face of database performance or availability problems.

GIDs may be used to look up Run IDs (if known), and visa-versa, via the nmi_gid2runid and nmi_runid2gid tools.

Both the Run ID and GID are used (often interchangeably) as arguments for many NMI command-line tools (e.g., nmi_pin). The Run ID is also used as the identifier for the nmi input method.

input specification file

  • the file containing instructions for fetching a single input to a run

meta-task

In addition to the status and outcome of each individual user-defined task in a run, the Metronome DB also contains the outcome of a handful of special meta-tasks for each run. These meta-tasks represent the collective result of a set of related tasks, and are useful for reporting their outcome as a whole.

For example, the platform_job meta-task represents the collective outcome of all the tasks which executed on a given platform in a run; likewise, the remote_task meta-task represents the collective outcome of all the user-defined tasks that were declared on a given platform (however, if no user-defined tasks were declared at all, remote_task is a regular task and not a meta-task).

platform_job

For each specified platform in a Metronome build or test run, the DB records the outcome of a special platform_job meta-task representing the collective success or failure of all the tasks which executed on that platform.

run

aka build run
aka test run
aka build/test run
aka NMI run

A distinct build or test submission to Metronome (via nmi_submit). For more information, see Terminology.

run specification file

aka build specification file
aka test specification file
aka build/test specification file
aka NMI specification file
aka NMI submit file

  • the file provided to nmi_submit containing instructions for running a given build or test run.

Orphaned Pages

These are documentation snippets for which we don’t yet have a suitable high-level section in the reference manual. If there’s something here that you can find the right home for, please relocate it. If not, leave it here.

Archived Run Results

The metadata of each build/test run, and of all its tasks, are stored indefinitely in the NMI Build/Test database, so you should always be able see (or query) the outcome of past builds.

However, the user-level output data (results.tar.gz files) of build/test runs are only archived on the submission host for a limited period of time after the run has completed, as defined by each individual NMI Build & Test Facility’s policy. After that time, the data may be deleted, unless they are explicitly “pinned” by the owner using the nmi_pin command. Pinned runs are never deleted.

Unlike the database, the output data archive is typically not backed up, and is not guaranteed to exist for any length of time — so if your build or test output data is difficult to reproduce, or you need access to it indefinitely, you should copy it to your own reliable storage. To make such copying easier, we plan to add optional “put” steps that can be defined at the end of a run (analogous to the “fetch” steps), which will allow you to specify at submit-time where & how you’d like your results transferred off-site. For more information, see http://grandcentral.cs.wisc.edu/nmi_drupal/?q=node/192.

Environment Variables

A variable in the Build and Test System is any value that is set in the execution environment of a task by the system. For example, the variable NMI_PLATFORM can be accessed by remote programs in the following ways:

Language Expression
(dark). Bourne and C Shells $NMI_PLATFORM
Perl $ENV{‘NMI_PLATFORM’}
Python environ[‘NMI_PLATFORM’]
C getenv(“NMI_PLATFORM”)

NMI_<run specification attribute>

All of the attributes in a run specification file can be accessed as environment variables at runtime. The build and test system prepends NMI_ to the attribute name and sticks its value into the environment variable. For example, the following commands:

platform_post = code/perlWhereAmI/whereAmI.pl
platform_post_args = platform_Post Task
platform_pre = code/perlWhereAmI/whereAmI.pl
platform_pre_args = platform_Pre Task

...are transformed into the following environment variables by the build and test system.

NMI_platform_post=code/perlWhereAmI/whereAmI.pl
NMI_platform_post_args=platform_Post Task
NMI_platform_pre=code/perlWhereAmI/whereAmI.pl
NMI_platform_pre_args=platform_Pre Task

These environment variables are defined for all of the tasks.

NMI_CONF

First available in Metronome 2.2.8.


If set, the enviroment variable ‘NMI_CONF’ will be over-ridden by the command-line argument but preferred over the system default for the location of the nmi.conf file.

NMI_PLATFORM

This variable contains the platform name that is associated with the task. For tasks such as _pre_all_ and _post_all_ the variable is set to local. Note that the platform name corresponds to the remote platform that the tasks are targeting rather than the platform the task is running on. For submit tasks such as _platform_pre_ and _platform_post_ this means that the variable identifies the platform the subsequent remote tasks will be run on rather than the current submit platform. The following table shows an example of how the value of this variable changes based on the tasks involved. The table shows a job that was submitted from a linux machine (i386-linux-thread-multi) and run on a solaris platform (sun4u_sol_5.8).

Task NMI_PLATFORM Value Platform Name
(dark). pre_all local i386-linux-thread-multi
platform_pre sun4u_sol_5.8 i386-linux-thread-multi
remote_pre_declare sun4u_sol_5.8 sun4-solaris
remote_declare sun4u_sol_5.8 sun4-solaris
remote_pre sun4u_sol_5.8 sun4-solaris
remote_task sun4u_sol_5.8 sun4-solaris
remote_post sun4u_sol_5.8 sun4-solaris
platform_post sun4u_sol_5.8 i386-linux-thread-multi
post_all local i386-linux-thread-multi

PATH

This variable is present for every shell and is the standard way to tell the shell where executables are located. The build and test system sets this variable for every remote platform so that the standard set of executables plus any additional prereq executables are used.

By design, remote tasks should run with only their prereqs and the default OS bin directories in their PATH, so the software they see is explicit and predictable. Currently, remote tasks run with the following specific PATH elements, in order:

  • the local host’s path to any requested prereqs, in the order they appear on the prereqs line in the run specification file
  • (on Irix platforms only) /usr/bsd
  • /bin:/usr/bin

_NMI_GID

This variable contains the GID of the run. It is only present in the tasks that are run on the submit machine; “pre_all“node/110, platform_pre, platform_post, and post_all.

_NMI_PREREQ_<prerequisite>_ROOT

Set by the build and test system for every prerequisite requested and found. The variable is set to the installation path of the prereq. The format for the of the variable is _NMI_PREREQ_[prereq name]_[prereq_version]_ROOT. The version number is delimited by underline characters “_” instead of periods “.”. For example, the variable for the prereq coreutils-5.2.1 would be _NMI_PREREQ_coreutils_5_2_1_ROOT.

_NMI_STEP_FAILED

This environment variable is set only in the environment of remote platform tasks, and contains the name of the last remote task to have failed on that platform. If nothing has failed on the platform the variable is not defined.

If a user defined task has failed then _NMI_STEP_FAILED will be set to remote_task.

_NMI_TASKNAME

This variable contains the name of the task that is currently being executed. This variable is only present during the remote_task task.

Meaningful Filenames

The build and test system has several well-known filenames that it looks for. These files can be used to change the behaviour of the run.

notify.nmi

This feature first appeared in Metronome 2.5.1.


On the submit node, in the working directory of the post_all step (<rundir>/userdir/common), the contents of the file notify.nmi will appended to the notification e-mail, if any, sent by the system.

results.tar.gz

When the build and test system sees this file on a remote platform, it will copy it back to the run directory (/nmi/run/ GID) in the user directory associated with the platform. For example for a run that has a GID of mbletzin_grandcentral.cs.wisc.edu_1151672109_29028 on a ppc_aix_5.3 platform, then the results.tar.gz file will be found in /nmi/run/mbletzin_grandcentral.cs.wisc.edu_1151672109_29028/userdir/ppc_aix_5.3/results.tar.gz

tasklist.nmi

This optional, user-defined file is used to tell the build and test system to subdivide the remote_task. If it exists, it is examined by the NMI software on a remote platform before remote_task is invoked.

The format of the file is the task name and a timeout value for the task. The two items are delimited by one or more spaces and so a task name cannot contain any spaces. Each task is on a seperate line. The timeout value defaults to minutes, but the unit may be specified (‘M’ or ‘m’ or minutes, ‘s’ or ‘S’ for seconds.) For example if a tasklist.nmi file contains the following files then the build system will split the remote task into three seperate tasks:

MyTaskOne 10
MyTaskTwo 10
EndOfMyTasks 2

The system will fail the first two tasks if their execution exceeds 10 minutes and fail the last task if its execution is more than 2 minutes. The overview page will show each task as a seperate row.

These tasknames will be passed to your remote_task in the enviroment variable _NMI_TASKNAME, once per name. See our tutorial for an example.


In Metronome 2.5.1, you may specify ‘h’ or ‘H’ for units of hours.

Run Directory

Run Directory

This section discusses the contents of the run directory. The path name of this directory contains the GID of the run.

Path Descriptions

Path Decriptions.

<task>.(err|out)

These files contain the standard error and standard output of the task.

cmdfile

Copy of the build and test specification file that was submitted for this run.

userdir

Directory which contains all of the files that are downloaded to the remote platforms

userdir/<platform_name>

This directory contains the fetched distribution plus whatever is added by the pre_all, platform_pre, and platform_post tasks.

userdir/common

This directory contains the fetched distribution plus anything added during the pre_all and post_all tasks.