nmi-supportcs.wisc.edu@ – This address is for questions or support requests for the Metronome developers and NMI Lab staff.The following tools are not supported by Metronome, but may prove useful.
If you have a tool other users may benefit from, please let us know, and we can add it to this list.
The attached bash script, ‘add_prereq_to_all_platforms.sh’, writes or extends, as appropriate, a ‘prereqs_<platform> = ‘ line in the named submit file to include the prereq named on the command-line. The other attachment, ‘add_prereq_version_to_all_platforms.sh’, does the same but accepts an additional argument, the prefix to the version of the prereq to match.
Generates a submit file for each permutation of a prereqs list of the form ‘x, {y, z, w}, v, {u, t}’. Useful for compatibility testing.
This section contains reference documentation for users of the Metronome Software.
Metronome (formerly The NMI Build & Test System) is a distributed, multi-platform framework designed to provide automated software building and testing capabilities to a variety of grid computing projects.
We believe that software isn’t reliable unless it’s regularly built and tested. Doing so requires not only a significant number of CPU cycles, but often a variety of unusual and difficult-to-maintain platforms, and a framework for automating, tracking, and monitoring the entire process.
Our goal is to provide an implementation of this framework utilizing proven grid computing tools as a foundation, as well as to support the growing number of Metronome Facilities internationally, including our own NMI Lab at the University of Wisconsin-Madison.
Our LISA 2006 paper below provides more details on the framework and its implementation, including how it differs from some other common build or test frameworks.
Unfortunately, we (the Metronome developers) have been inconsistent over time in our own terminology for some aspects of the Metronome framework and software. This may be reflected even on this site, although we’re working to make our documentation and reference materials more consistent.
For the time being, the following is the most common vocabulary used for the Metronome workflow:
A user submits a Metronome run to execute a build or test workflow on one or more platforms.
A run is described in a run specification file which is passed to nmi_submit.
Each run consists of a number of Metronome tasks whose individual success/failure/duration/exe-host/output are tracked and recorded in the DB.
Although the actual operations performed by each task are based on user-provided scripts or executables, most tasks have a predefined name and location in their run’s workflow. Users may, however, optionally declare an arbitrary number of custom-named tasks to run on each specified platform.
In addition to recording the results of individual tasks in a run, Metronome also records the success/failure/duration of certain meta-tasks. Each meta-task represents the collective result of a predefined set of related tasks in the run. A platform_job meta-task for each specified platform in the run represents the collective success or failure of all the tasks which executed on that platform. Additionally, if users choose to declare their own custom-named tasks on a platform, the collective success or failure of just those user-defined tasks is recorded in a remote_task meta-task.
All platform-independent tasks in a run are executed on the submit host by Metronome inside their own individual Condor jobs; all tasks specific to a given platform are executed on a remote machine inside a single Condor job. All the jobs used to execute a run are represented in a single Condor DAG.
This documentation is the product of many people’s work, in particular Parag Mhashilkar, who wrote the original NMI Build & Test System documentation, and NCSA’s Michael Bletzinger, who is responsible for a second generation of documentation which formed the basis of the website you see today. In addition to writing a number of badly-needed tutorials and reference docs, Michael served as a catalyst for the rest of us, who subsequently contributed content because of his documentation efforts and initial work seeding the site.
A build & test run generates a number of tasks to be executed on the submission machine and one or more user-defined platforms. The diagram below shows the overall workflow.

Fetch Tasks – Retrieves needed software inputs from one or more sources to the submission machine.
Pre Run Tasks – Lightweight tasks to be performed on the submission machine in order to prepare the software for staging to the remote platforms. These tasks can be global to all platforms (and thus executed only once, on the common input data) or specific to each platform (and thus executed once per platform, on its specific copy of the input data).
Remote Tasks – Tasks to be performed on each remote platform.
Post Run Tasks – Lightweight tasks to be performed on the submission machine to manipulate the results of the remote tasks. These tasks can be specific to each platform (and thus executed once per platform, on its specific results), or global to all platforms (and thus executed only once, on the combined output).
The Build & Test System organizes a build or test workflow into predefined stages, or tasks. Each one provides a “hook” where you can optionally define a custom script or program to execute that task in the execution process. Only the remote_task task is required. The diagram below shows all of the available task hooks:

The Build & Test System gives you the option to divide the platform-specific remote_task into any number of user-defined sub-tasks. The diagram below shows where these user-defined tasks appear in the execution process:

The build and test system handles failures in a way that allows the user to detect what went wrong. The handling depends on the task. The various ways that failures are handled are as follows:
| /2_. Failed Task | /2_. Failure Type | =\4_. Run After Failure | |||
| remaining user tasks | remote_post | platform_post | post_all | ||
|---|---|---|---|---|---|
| pre_all | abort run | no | no | no | no |
| platform_pre | abort platform | no | no | no | no |
| remote_pre_declare | remote_post / abort platform | no | yes | no | no |
| remote_declare | remote_post / abort platform | no | yes | no | no |
| remote_pre | remote_post / abort platform | no | yes | no | no |
| remote_task | remote_post / abort platform | N/A | yes | no | no |
| user-defined task | record & continue | yes | yes | no | no |
| remote_post | abort platform | yes | yes | no | no |
| platform_post | abort platform | yes | yes | yes | no |
| post_all | abort run | yes | yes | yes | yes |
Timeouts may be implemented as run or task specific.
Run specific timeouts refer to the amount of time a Condor job is alive in the queue before being removed. The automated removal assists with cleaning up jobs with mismatching requirements, or jobs which will never run due to various system problems. The default is 6 days, which may be overridden. The order of execution is the run spec file (max_match_wait), then the NMI config file (MAX_MATCH_WAIT), then the default value. Timeout values are specified in seconds.
Task specific timeouts are set in the NMI submit file. These timeouts are typically set by users at task boundaries to assist with shutting down services at the correct time or to prevent services from hanging indefinitely if something should go wrong upon shutdown. Please see the appropriate section of the manual for details.
The following commands are available in the NMI submit file (aka the build or test specification file). The syntax for each command is:
=
Any whitespace before or after the command or value is ignored (however whitespace within a value is retained).
Certain special variables in the file will be expanded before processing.
Commands for the remote host may be specified on a platform-specific basis. In particular, commands beginning with remote_pre, remote_pre_declare, remote_declare, remote_task, and remote_post may be prefixed by a platform string to indicate that the command applies only to the platform. Examples:
x86_rh_9_remote_task = /bin/specialScript
x86_rh_9_remote_task_args = -special -arguments
x86_rh_9_remote_task_timeout = 200
A platform-specific command will override any corresponding generic command. E.g., x86_rh_9_remote_task_args will override remote_task_args on x86_rh_9, if they are both specified.
Specifies command line arguments to be passed to the script associated with taskname. See the remote_task for an example.
For runs with a single remote_task, remote_task_timeout specifies how long Metronome should wait for the specified task to complete, after it begins running. If the task is still running after this Metronome will forcibly kill it and mark it as failed.
remote_task = takes-at-most-an-hour.sh
# Kill this job after two hours, because it's almost certainly hung.
remote_task_timeout = 120Runs which include user tasks — which have a tasklist.nmi (including remote_declare) — may specify the timeouts for individual tasks therein, and this parameter supplies the default for that value.
remote_declare = list-my-tests.sh
# Default to ninety seconds for my tests, since they should be pretty fast.
remote_task_timeout = 90sYou may specify if your timeout is in minutes (‘M’ or ‘m’) or in seconds (‘S’ or ‘s’); Metronome defaults to minutes if no specifier is present.
Metronome 2.4.x and earlier only support remote_task_timeout.
Metronome 2.5.0 and later support time-outs for the all remote taskhooks, as well as setting a default value for all unspecified remote time-outs:
remote_pre = start-server.sh
remote_task = run-client.sh
remote_post = stop-server.sh
# If it takes more than thirty seconds to start or stop the server, it probably crashed.
remote_pre_timeout = 30s
remote_post_timeout = 30s
# Nothing should take very long to do.
remote_default_timeout = 15m(Specifically, we support timeouts for remote_pre_declare, remote_declare, remote_pre, remote_task, and remote_post.)
Metronome 2.5.1 and later supports the ‘h’ and ‘H’ specifiers, for durations in hours.
Defaults to false.
The always_run_post_all option, if set to true, allows the post_all task to execute even after a failed platform_job.
This value is appended to the raw Condor classad requirements expression for the platform_job jobs in the run. This can be useful to add additional matchmaking constraints to the Condor jobs, beyond what is added by the NMI B&T software itself as a result of prereqs or platforms — e.g., to force a match with a specific hostname, or with a host with adequate memory. It is to be considered an “advanced” or “expert” command, not something users should normally need to use.
(Note: you don’t need to prefix your extra requirements with ‘&&’, as it will be done for you.)
For example:
# ensure target machine has > 1 GB of memory append_requirements = (Memory > 1024 * 1024) # ensure x86_rh9 platform_job runs on host-foo.site.org append_requirements_x86_rh9 = (Machine == “host-foo.site.org”)In the unlikely event that you want to override your remote Condor job requirements entirely (rather than append to them), you can specify +requirements instead — but be sure you know what you’re doing.
?
Found in /space/nmi/run/pavlo_grandcentral.cs.wisc.edu_1131976992_26552/cmdfile:13
DOCS ISSUE
Identifies the name and version of the software being built or tested. For example the SSH-4.2 would be identified as:
component = SSH
component_version = 4.2
Optional field which describes the component being built or tested, in an arbitrary unquoted text string. This field is stored in the database and displayed in the Build & Test Overview page
Example:
description = PyGlobus build for NMI
This parameter became available in Metronome 2.2.4.
By default, Metronome will try to fetch an input three times before giving up. This parameter allows you to change that to another integer for the inputs in the same submit file.
The machine default may be changed in nmi.conf using the parameter FETCH_RETRY_COUNT.
An optional attribute whose arbitrary string value can be used to identify the run’s owner, if distinct from the submitting user.
The identity string will be stored in the DB for the run, and can be displayed on the web status pages in place of the submitting user (by setting the RUN_USER_IDENTITY_COLUMN), but does not affect the user account under which the job actually executes on computing resources.
Specifies the paths of one or more input files which define the inputs of a build or test submission. The paths can be absolute, or relative to the current working directory of nmi_submit at the time it is invoked. Paths are comma delimited.
For example, the following line specifies two input files. The first, foo.cvs, is expected to be in the current working directory when nmi_submit is invoked, while glue.cvs will be read from /nmi/glue:
inputs = foo.cvs, /nmi/glue/glue.cvs
Specifies the maximum number of seconds that a run may remain in the queue without ever running before it will be automatically removed.
This value is determined by the max_match_wait param in a submit file. If that is undefined, it is determined by the default MAX_MATCH_WAIT value specified by the administrator in the Metronome config file. If that is undefined, the value defaults to six days.
Specifies an email address to which the B&T system will send a build/test run completion message. For example:
notify = micky.mousehotmail.com@
specifies that the completion message is sent to a hotmail account.
The following applies to Metronome 2.4.x and earlier. Please read below the second horizontal rule for information on platform names in Metronome 2.5.x.
bash$ condor_status -format '%s\n' nmi_platform | sort | uniq
alpha_osf_V5.1
alpha_rh_7.2
hppa_hpux_11
hppa_hpux_B.10.20
ia64_rhas_3
...
ppc_macos_10.3
ppc_macos_10.4
ppc_ydl_3.0
sun4u_sol_5.8
sun4u_sol_5.9
x86_64_rhas_3
...
x86_rh_9
x86_rhas_3
...
The following example specifies that the submission should be run on RedHat version 9 and Apple Mac OS X version 10.3:
platforms = x86_rh_9, ppc_macos_10.3
1 More information on the _condor_status_ command can be found here
The Metronome 2.5.x ‘platforms’ command is backwards-compatible with the 2.4.x (and earlier) command described above. When platforms are specified by
platforms = <platform> [, <platform>]*
Metronome will internally translate this to
platforms = <default platform type>:<platform> [, <default platform type>:<platform>]
where the default platform type is either platform_type as specified in the command file, the PLATFORM_TYPE as specified in the nmi.conf file, or the hardcoded default (at present, “nmi”). Naturally, you can directly specify a platform type for a particular platform in the same way:
platforms = x86_fc_5, etics:fc5_ia32_gcc410
While (at the time of this writing) the “platform strings” in this example refer to the same “platform”, Metronome will generate two platform jobs for this run. If the two platforms were ‘x86_fc_5’ and ‘nmi:x86_fc_5’, however, the former will be internally translated to the latter, and the usual rules for platform collision apply.
Task hook which specifies a script and its arguments that will be run on the submit machine after each set of remote tasks for a platform have been run.
The example below will cause the VDTGlue.pm voms-*.gz /p/vdt/public/html/software/voms/1.6.3p1 south.cs.wisc.edu to be executed once on the submit machine after each set of remote tasks for a platform. Since the platforms command specifies 13 platforms, the task will be executed 13 times.
platforms = alpha_osf_V5.1, alpha_rh_7.2, hppa_hpux_B.10.20, ia64_sles_8, ppc_aix_5.2, ppc_macos_10.3, sun4u_sol_5.8, sun4u_sol_5.9,x86_64_rhas_3, x86_rh_7.2, x86_rh_8.0, x86_rh_9, x86_winnt_5.1
platform_post = VDTGlue.pm
platform_post_args = voms-*.gz /p/vdt/public/html/software/voms/1.6.3p1 south.cs.wisc.edu
The results of a platform_post script can be found in run directory. platform_post with *.out and *.err extensions.
The task is executed in the platform directory.
Task hook which specifies a script and its arguments that will be run on the submit machine after all of the software specified by the input specifications files have been fetched. Unlike the pre_all task hook, platform_pre is executed before each set of platform remote tasks.
The example below will cause the _nmi_glue/test/platform_pre glite/platform_pre_args_ to be executed on the submit machine as part of the platform_pre task. Since the platforms command specifies 13 platforms, the task will be executed 13 times.
platforms = alpha_osf_V5.1, alpha_rh_7.2, hppa_hpux_B.10.20, ia64_sles_8, ppc_aix_5.2, ppc_macos_10.3, sun4u_sol_5.8, sun4u_sol_5.9,x86_64_rhas_3, x86_rh_7.2, x86_rh_8.0, x86_rh_9, x86_winnt_5.1
platform_pre = nmi_glue/test/platform_pre
platform_pre_args = glite/platform_pre_args
The results of a platform_pre script can be found in run directory. (platform name) with *.out and *.err extensions.
The task is executed in the platform directory.
First available in Metronome 2.2.8.
platform_type sets the platform type, which defaults to ‘nmi’ or the value of the configuration file option PLATFORM_TYPE, in that order.
Generally, the only time a user should need to set this is when trying to use nonlocal resources, since the administrator of your local Metronome installation should set the PLATFORM_TYPE as appropriate in the Metronome configuration file. If, however, your Metronome installation is configured for job migration, remote sites may use a different scheme to name their platforms; this allows your submit files to conform to that scheme. Set platform_type as appropriate for the target resource and use the other platform names as normal.
Metronome does not presently support mixing platform types in a submit file.
Task hook which specifies a script and its arguments that will be run on the submit machine after the remote tasks have been run on all of the platforms. The script is run only once.
The example below will cause the _post_all —wrap_ to be executed once on the submit machine after all remote tasks are completed.
post_all = nwo/glue/all/build/post_all
pre_all_args = --wrap
The results of a post_all script can be found in /nmi/run/ (Your GID) /post_all with *.out and *.err extensions.
List the prerequisites needed for the build. For example:
prereqs = coreutils-5.2.1
Adds the core utilities prereq as a requirement. Note that the version number is required and needs to be seperated with periods (.) rather than underline characters is is displayed on the host machine pages. The following will not be recognized:
prereqs = coreutils-5_2_1
To specificy platform-specific prereqs, append _ to prereqs. For example:
prereqs_x86_fc_3 = coreutils-6.9
Please note that the platform-specific requirements will be appended to the global prereqs.
In Metronome 2.5.x, you may specify platform types in the portion of the command:
prereqs_nmi:x86_fc_3 = coreutils-6.9
Metronome will not ‘search’ for a prereq type if one is not specified; it will always be interpreted as the default platform type, even if specified differently elsewhere in the command file. For example, the following
platforms = etics:x86_fc_3, x86_fc_4
prereqs = gcc-3.4.3
prereqs_x86_fc_3 = coreutils-6.8
prereqs_x86_fc_4 = coreutils-6.9will result in a two-platform run, one of which (nmi:x86_fc_3) will run with coreutils-6.8, the other of which (etics:x86_fc_3) will run without a coreutils prereq at all. Both, however, will use gcc 3.4.3.
Task hook which specifies a script and its arguments that will be run on the submit machine after all of the software specified by the input specifications files have been fetched. The script is run once before any remote tasks are run on any platforms.
The example below will cause the _pre_all —src=/home/bt/condor-6.7.13.tar.gz_ to be executed on the submit machine as part of the pre_all task.
pre_all = nmi_glue/build/pre_all
pre_all_args = --src=/home/bt/condor-6.7.13.tar.gz
The results of a pre_all script can be found in the run directory with *.out and *.err extensions.
The pre_all script is executed in the common directory.
You can restrict access to the archived “run directory” of your build & test jobs by adding the following option to your NMI submit file:
private_web_users = my_web_account, her_web_account
The web accounts in question do not correspond to system login accounts on the submit machine — rather, they are specific to the webserver, and must be manually created by NMI Build & Test Lab staff. Please submit a support request on this website or email nmi-support@cs.wisc.edu if you’d like one created.
Field used by the Build & Test Overview page to show what project a submission is associated with.
For example all tutorial submission files contain the following:
project = tutorial
Specifies a task to be run on the target machines. This task runs second, after remote_pre_declare and before remote_pre, and is usually used to generate tasklist.nmi, which defines user tasks.
Please note that you can simply include a file name tasklist.nmi in your inputs; you only need to write a script in unusual cases (such as platform-specific user tasks).
Please see the user defined tasks section of our tutorial for a more extensive example.
When submitting a job to be executed on a remote site, the remote_pool option defines the address where the NMI framework can communicate with the remote collector daemon.
remote_pool = collector.example.com[:port]
This command must be used in conjunction with the remote_schedd option. More information about how to the use remote site execution features of the NMI framework can be found here.
Specifies a task to be run on the target machines. This task runs after the remote_task. If User tasks are defined then this tasks runs after the last user defined task. For example, it can be used to process failed user tasks.
Specifies a task to be run on the target machines. This task runs before the remote_task. If User tasks are defined then this tasks runs before the first user defined task.
Defines a script to be run before remote_declare. The task is executed before the tasklist.nmi file is generated.
The remote_schedd is the host that the job will be routed to in the remote pool. Once there, the job will potentially be matched and begin execution on a computing resource within that pool.
remote_schedd = schedd.example.com
More information about how to the use remote execution features of the NMI framework can be found here.
Specifies a task to be run on the target machines.
For example the following specifies that _“code/perlHelloWorld/helloWorld.pl Remote_Task Task”_ should be executed as the remote task:
remote_task = code/perlHelloWorld/helloWorld.pl
remote_task_args = Remote_Task Task
remote_task_timeout = 5
x86_rh_9_remote_task_timeout = 10
The additional parameters pass the arguments _“Remote_Task Task”_ and set the timeout to be 5 minutes. The timeout for the RedHat 9 platform is set for 10 minutes.
A remote task can be subdivided into user defined tasks. The definitions need to be in the known file tasklist.nmi.
All of the user tasks get the same arguments from remote_test_args.
This flag became available in Metronome 2.5.0
remote_task) to specified. This flag suppresses that behavior and overrides any supplied remote_task with the ‘null’ task, which runs on the submit machine and does nothing except note its own existence for the benefit of the web interface. Note: before version 2.5.1, Metronome still required one platform to be defined in submit files with this flag set. The “platform” may be any string; I suggest ‘dummy’.
This flag is intended for users who split pre_all or platform_pre steps that don’t vary much across runs into their own run that is then used as input for the original run; it dispenses with the step of running anything, even a no-op, on a remote machine. By doing so, it skips the potentially-slow transfer of the results of pre_all and platform_pre to the remote node(s). (As well as a potentially-long wait for some specific but unimportant platform to become available.) When you use this run as input to another run, we recommend setting the ignore_missing_platform flag, and not setting ‘platforms’ in the input file. You may supply a dummy platform string, but if you do, Metronome will create a (harmless but) spurious directory for your run (named after the dummy platform string).
(For instance, suppose ‘MyLinuxDistro.sub’ uses its pre_all and platform_pre steps to fetch a large number of software packages, which it then builds (as specified by remote_declare) with varying levels of optimization. You could build different optimization levels concurrently by creating ‘MyLinuxDistro-O[0-3].sub’, but all four of these would fetch the whole source all over again. Alternatively, maybe you want to try a number of different combinations of compiler flags to find the best set. Instead, you could copy the pre_all and platform_pre steps from ‘MyLinuxDistro.sub’ to ‘MyLinuxDistroSources.sub’ and set the remote_task_is_null flag, and change ‘MyLinuxDistro-O[0-3].sub’ to use the results of a run of ‘MyLinuxDistroSources.sub’. Then you have only to have fetch the sources again when you decide to upgrade the packages.
This field differentiates between build submissions and test submissions. The field is used by the Build & Test Overview page.
For example, the following indicates a test submission:
run_type = test
Pass command to condor. For example +getenv = true passes the command getenv with the value true to condor.
Since Condor itself recognizes a + command to add arbitrary user-defined attributes to the job classad, in Metronome you can add such attributes by specifying a ++ prefix; the first + tells Metronome to pass the remaining +attr=value text to Condor, which interprets the + accordingly.
Each build or test specification file must reference one or more input specification files (via the inputs keyword) which define the inputs to the build or test run.
Input specification files follow the same basic format as the build or test specification file itself. The syntax for each line in the file is:
=
Any whitespace before or after the command or value is ignored (however whitespace within a value is retained).
Certain special variables in the file will be expanded before processing.
This documentation reflects the input file specification commands for the Metronome 2.4.x series and earlier.
Specify the method used by the B&T system to stage the software onto the submit machine. For example the following specifies that the software should be obtained from a CVS repository.
method = cvs
This method specifies that the software is transferred to the submit machine using CVS. The method also requires the commands cvs_module and cvs_root in the input file.
From the fetch.pl usage:
cvs_root = :ext:bt@chopin.cs.wisc.edu:/p/condor/repository/nmi
# cvs_tag is optional
cvs_tag = nmi_r5_branch
# exactly one of the following two is required
cvs_module = <cvs module name>
cvs_subdir = <dir> [, <dir>, ...] This method fetches files from an FTP site. It requires the additional input specification file commands ftp root and ftp_target, and downloads the files from there:
method = ftp
ftp_root = ftp://ftp.cs.wisc.edu/condor/nmi/tutorial/
ftp_target = helloWorld.tar.gzwill run a command equivalent to wget ftp://ftp.cs.wisc.edu/condor/nmi/tutorial/elloWorld.tar.gz.
(The present implementation of Metronome does use wget, so FTP sites requiring authentication can be accessed through use of a .wgetrc file in the submitter’s home directory. See the wgetrc documentation for details. This is not a supported feature of Metronome.)
The nmi input method is used to specify that one build or test run (let’s call it the consumer run) wishes to use retrieve the results of another, previously-completed build or test run (the producer run). For example, a test run might specify an nmi input to retrieve the output of a finished build run it wishes to test.
For each nmi input method, Metronome first establishes a list of input platforms for which it will attempt to retrieve results from each producer run. By default, Metronome will attempt to retrieve results from the producer for each platform in the consumer’s platforms list, but this can be overridden using the platforms command in the input spec file.
From each specified producer, the fetch step retrieves the results.tar.gz files corresponding to each input platform (including the platform-independant “common” results.tar.gz file), and untars them into the corresponding platform-specific (or common) directory of the consumer.
I.e., for each input platform (including “common”), the following psuedo-code is executed for each producer:
cd consumer:userdir/<input_platform>/
tar zxf producer:userdir/<input_platform>/results.tar.gzIf a producer contains results for additional platforms not present in the consumer’s input platforms list, they are not retreived. If a consumer specifies a platform with no corresponding results in the producer, an error is produced (return code ???) unless ignore_missing_platforms is true.
If the producer’s output files are no longer archived, an error is produces (return code ???).
Note: the targetdir command is ignored for this input method.
The following input specification file instructs Metronome to untar the results.tar.gz file for each platform in runids 324 and 213 into the corresponding platform directory of the current run, ignoring any platforms not present in those input runs.
method=nmi
input_runids = 324, 213
ignore_missing_platforms = trueThis method fetches files using SCP. It requires the additional input specification file command scp_file, and copies that file (or directory), possibly from a remote host, to the local host.
For example, to specify a fetch of a directory called glue on a machine called role, the following needs to be in an input file:
method = scp
scp_file = role.cs.wisc.edu:/home/mbletzin/glue
recursive = true
This method fetches files from a Subversion repository. It requires the additional input specification file command url, and checks out that URL:
method = svn
url = svn-method://svn-host/svn-pathwill run a command equivalent to svn co svn-method://svn-host/svn-path.
This method fetches one or more files from a web server. It requires an additional url command, which specifies the filename to be downloaded. For example:
method = url
url = http://cs.wisc.edu/condor/nmi/nmi-releases/nmi-2.2.7.tar.gzThe url method also supports the recursive command to download entire directory trees.
NOTE: the url method does not currently provide direct support for websites requiring (basic HTTP) authentication. However, since the present implementation of Metronome relies on wget for URL retrieval, websites requiring (basic HTTP) authentication can in fact be accessed through simple use of a wgetrc file (.wgetrc in the submitter’s home directory by default). See the wget documentation for details. This is not a supported feature of Metronome.
Adds additional arguments to the wget command. See here for a possible list.
Specifies the URL for the ftp input. For example the URL ftp://ftp.cs.wisc.edu/condor/nmi/tutorial/helloWorld.tar.gz
ftp_root = ftp://ftp.cs.wisc.edu/condor/nmi/tutorial/
ftp_target = helloWorld.tar.gzAs used by the url method, this method specifies a URL to fetch. ‘http’ is synonymous.
As used by the svn method, this command specifies the Subversion URL to check out; ‘http’ is synonymous:
method = svn
url = svn-method://svn-host/svn-pathYou can also optionally specify a path in the ‘url’ line:
method = svn
url = svn-method://svn-host/svn-path pathand Subversion will use ‘path’ as the destination [directory] (as opposed to determining the destination based on the URL).
Comma-delimited list of run ids whose results are untarred into the working directory of the submit machine before platform_pre is run.
In Metronome 2.5.1, the list may include GIDs.
NOTE: valid for the nmi input method only.
The optional platforms command is used to specify a subset of platforms for which results from the input run should be copied into the current run. For example:
platforms = ppc_mac_10.3, x86_rhas3
Valid elements include any platform name (including common) or all. If unspecified, platforms defaults to all.
The results.tar.gz from each specified platform will be copied and untarred into the corresponding platform directory of the consuming run. This default destination can be overidden via an optional source:destination platform name mapping, like so:
platforms = ppc_mac_10.3, x86_rhas3:x86_rhas4, x86_rhas3:x86_fc3
This says copy ppc_mac_10.3 results from the input run into the ppc_mac_10.3 workspace of the current run (like usual), but copy the x86_rhas3 results from the input run into the x86_rhas4 and x86_fc3 workspaces of the current run (e.g., to do binary-compatibility testing).
This command tells a method to recursively fetch the contents of any directories it finds under the target. For example:
scp_file = /home/bgietzel/fw-client-server/glue
recursive = true
Tells the system to fetch everything under the glue directory.
If the command is set to true then the build and test system unpacks any archive that is fetched.
The input specification file commands will be rationalized in Metronome 2.5.x. This section of the documentation should be considered speculative until the release of Metronome 2.5.x.
An input specification file must include one (and only one) “method”: command. Each method, listed below, imposes its own requirements for subsequent commands. For instance, the “http”: method requires a “url”: command to specify the URL to fetch.
Options are a type of command that affect the behavoir of other commands. Metronome has two input specification file options: recursive which allows methods which normally fetch a single file to recursively fetch an entire directory, and unpack which unpacks the specified file. Both accept “true” or “false” as values.
At present, the two options are mutually exclusive.
A method specifices how to obtain an input for the run. The three categories of input methods, in decreasing order of reproducability, are: Metronome outputs, version control systems, and archives.
Metronome provides the ability to use the output of a previous Metronome run as the input to a subseqeunt run. Because Metronome runs are reproducible (to the extent that their inputs are reproducible), this method is itself a reproducible way of acquiring input for a run.
The only method in this category is metronome [NOTE: the ‘platforms’ cross-platform testing syntax requires two colons in Metronome 2.5.x, because of the platform_type:platform_name change.]
Metronome can acquire input from the “CVS”: and “Subversion”: version control systems. For ease-of-use, specifying a particular revision with these methods is optional; this simplifies the machinery required to regularly build the trunk (because it doesn’t have to be tagged ahead of time, or in the pre_all step), but reduces reproducability (because without a particular revision specified, you may not easily get precisely the same repository again).
The methods in this category are “cvs”: and svn
You can fetch files (presumably from well-maintained archives; hence the name) from the web or a specific remote machine.
The recursive option affects methods assumed to have accessible directory trees — at present “ftp”: and “scp”: — in the obvious way. (Because HTTP does not have to (and generally does not) expose the directory tree, the recursive option, proper, can not reliably be implemented, although suggestions regarding mirroring will be entertained. The version control methods are inherently recursive, and the concept doesn’t apply to Metronome outputs.)
The Metronome system can recognize (generally, by file extension) a wide variety of compression and/or archival formats. If the unpack option is set to true, the method is archival (http ftp or scp, and the single target file is in one of those formats, Metronome will unpack it. This is generally simpler, easier, and more reliable than explicitly unpacking the file in a user task.
... tar.gz, .gz, .tar, .zip?
Sometimes, you may wish to fetch inputs that are password-protected . To use a specific wgetrc file for an input, set in your input spec file:
wgetrc = <path_to_wgetrc_file>You can then set a username and password in that wgetrc file. (See the wget documentation for details.)
These input specification file commands did not change between versions.
Specifies the module to be checked out. The file CVS/Repository found in every directory of a CVS checkout contains the name of the module.
Specifies the root of the cvs repository. The root is contained in the file CVS/Root in each subdirectory of a CVS checkout.
Tell cvs to use an alternate remote shell. For example:
cvs_rsh = /nmi/scripts/ssh_no_x11
Tells CVS to use a local script _/nmi/scripts/ssh_no_x11_ which hardcodes a set of ssh flags.
In the absence of a cvs_module command, the cvs_subdir command can be used to specify a comma-separated list of one or more specific directories (or files) to check out of the given repository.
Specifies a tag name for the CVS checkout. The following example specifies the tag name “helloBranch”
cvs_tag = helloBranch
Set this to true in your input file to ignore any platforms not present in the input run. To be used with the nmi method.
Specifies the URL to the target to be fetched using scp. The URL needs to be of the form hostname:path. If the host name is omitted then the system assumes that the target is on the submit machine and the local copy command is used.
In the nmi_submit submit file, one may define “outputs” to instruct the system to transfer some or all results to an external repository after the run is complete.
The syntax is as follows:
outputs = output_file_1, output_file_2, ...
Currently, the following output methods are supported:
Some example output files:
method = scp
platform = x86_slc_3
source = results.tar.gz
dest = /tmp/tolya
or
method = scp
platform = common
source = results.tar.gz
dest = /tmp/tolya-common
(in the last case, it is user’s resposibility to create the results file
under common/)
Here’s an example of using gridftp method:
method = gridftp
platform = x86_rh_9
source = results.tar.gz
dest = my.gridftp.host/data/rh9/results.tar.gz
It is possible to perform macro substitution within the NMI run specification file and input specification files. There are two substitutions done at submit time:
This section describes the NMI Build & Test Software commands available on the submit machine.
nmi_condor_status prints Metronome-specific information about your Condor pool. It accepts three options, in addition to the usual --nmi-conf. The -w, -ww, and -a options control the width of the printed information. Normally, nmi_condor_status prints output in 80 columns, truncating fields to fit. The -w and -ww print more columns; the -a option prints unlimited-width and unjustified output more-suitable for use by scripts.
For advanced users, starting with Metronome version 2.2.3, you may configure your pool so that nmi_condor_status ignores certain startds. This is useful if your pool includes non-Metronome hosts, or hosts which will never run jobs for other reasons. (For instance, if you’re running Hawkeye to monitor a submit host.) Setting the attribute NMI_isExecHost to FALSE in your Condor configuration file and adding NMI_isExecHost to STARTD_EXPRS will cause nmi_condor_submit to ignore that startd.
This command has been available since before Metronome 2.2.2.
This command converts Metronome GIDs (strings) to Metronome runids (integers). It takes one argument, the GID, and one option, the ubiquitous —nmi-conf, which allows you to set the location of the Metronome configuration file:
$ nmi_runid2gid 72316
tutorial_nmi-s005.cs.wisc.edu_1201099302_30463If the database is down when you submit a job, this command can be used at a later date to determine the runid, as a command-line alternative to searching via the web interface.
Command used to tell the build and test system database to save the results of a run beyond the usual timeframe. Here are some examples:
nmi_pin -list
lists all of the runs that are currently pinned
nmi_pin --runid=24630
Stores the record of run 24630 for the default 60 days.
nmi_pin --unpin --runid=24630
Undos the pin.
nmi_pin --runid=24630 --days=100
Stores the record of run 24630 for the default 100 days.
Usage: nmi_resource_advertiser
--nmiconf= Select which NMI configuration file to use
--routing-table Prints out the routing table
--broadcast Broadcast resource information to all hosts
--debug Enable debug output
Note: when the routing table gets updated by nmi_resource_advertiser, the job router must be reconfigured in order for the new information to get loaded. In some situations, it may reconfigure all of its host’s local Condor daemons. (It can’t use the target parameter for condor_reconfig because that tool doesn’t have logic to send a reconfig command to an arbitrary daemon. The resource advertiser actually tries to be smart about updating the job router: first it tries send a SIGHUP with a killall condor_schedd.v7, then it tries a kill . If both of these measures fail for whatever reason, it then calls the condor_reconfig.)
This command became available in Metronome 2.5.0.
nmi_resubmit_run use the Metronome database to recreate and then submit a new copy of a previously-submitted run. It can not recreate runs submitted by versions prior to Metronome 2.5.0. (However, at the time of this writing, we are working on a tool which will remove this restriction (on a run-by-run basis) for runs whose run directories still exist.) For example:
nmi_resubmit_run 75549
will submit a copy of runid 75549, reporting as nmi_submit normally does. By default, nmi_resubmit_run does not keep the submission directory around after nmi_submit succeeds. You may override this with the --submit-dir option, passed as a flag (the directory name defaults to the id of the run being recreated) or with an argument (specifying the submit directory’s name).
We intend this command to simplify disk-space managemenmt, as reproducing a user’s run no longer requires a run’s run directory to be preserved. We do not, however, recommend that you begin removing run directories immediately after upgrading, as this is a new feature, and may yet have bugs.
The nmi_rm command can be used to remove a run from the Metronome queue. It will not remove any information from the Metronome database. Unless the invoking user is root, it will only remove jobs that are owned by you.
Here are some examples:
Usage: nmi_rm [options] [runid|gid] [runid|gid]...
--user Which user to remove runs for (defaults to current)
--all Remove all runs for the current user (or all runs if root)
--force Always try to remove a run and update the database
--db-only Only update the database
--remove-consumers Remove any runs that depend on a given run
--help Print this message
nmi_rm 24630
Kills run with a runid of 24630.
nmi_rm mbletzin_grandcentral.cs.wisc.edu_1155560480_32710
Kills a run using the GID .
nmi_rm 24630 mbletzin_grandcentral.cs.wisc.edu_1155560480_32710 24631
The gids and runids can be mixed together and given in succession on the commandline
nmi_rm --user=jsmith --all
Kills all of the runs for user jsmith
This command first became available in Metronome 2.5.1
Converts runids or GIDs given on the command-line into hostname:full-path pairs.
This command became available in Metronome 2.2.4.
This command searches the Condor job queue for Metronome runs with the given runid or GID. It accepts one argument, the runid or GID, and a number of options.
$ nmi_runid2condor 72313
Global ID: gthain_nmi-s003.cs.wisc.edu_1201097532_13311
Actively queued jobs for run:
293011
293012
293015
293020The --history option will check Condor’s history files for the given runid or GID.
The --global option will check all job queues known to the local machine’s Condor collector, rather than just the local machine. For example, it will check the queues of nmi-s001.cs.wisc.edu and nmi-s005.cs.wisc.edu if invoked on nmi-s003.cs.wisc.edu.
nmi_runid2condor also accepts the usual --help and --nmiconf options.
This command is generally most useful in conjunction with condor_q’s analysis flags:
$ condor_q -bet 293020
...
293020.000: Run analysis summary. Of 236 machines,
233 are rejected by your job's requirements
2 reject your job because of their own requirements
1 match but are serving users with a better priority in the pool
0 match but reject the job for unknown reasons
0 match but will not currently preempt their existing job
0 are available to run your job
...
Condition Machines Matched Suggestion
————- ———————— —————
1 target.nmi_platform == "hppa_hpux_11"3
2 ( target.has_java_1_5_0_03 isnt undefined )3
...In this case, run 72313 can only run on one three “machines” in the pool, two of which reject it, the other of which is busy. You can look at the web interface’s pool status pages to discover that this means a single physical hppa_hpux_11 machine exists, but this BUILD won’t be run by a TEST or PARALLEL slot, and that the BUILD slot is busy. In other cases, you might discover a typo in your prereqs (which would look something like 2 ( target.has_java_1_5_0_02 isnt undefined )0)), or that you’ve requested a combination of existing prereqs not matched by any single machine. (In which case you should probably contact your Metronome installation’s administrator!)
The Condor ID passed to condor_q was the last one listed by nmi_runid2condor. This will generally but not always be the right ID to pass — Metronome runs each map to more than one Condor job, and those jobs are not the same throughout the run’s lifetime. As in the example, however, there should be four Condor jobs for the bulk of the run. Passing the wrong Condor ID to condor_q is harmless, however, and will generally result in it saying the the job is already being serviced.
This command has been available since before Metronome 2.2.2.
This command converts Metronome runids (positive integers) into Metronome GIDs (strings). Metronome uses GIDs internally, so that it can run jobs during database failures. (The information is recorded on disk and entered into the database when it comes back up.) It takes one argument, the runid, and one option, the ubiquitous --nmi-conf, which allows you to set the location of the Metronome configuration file:
$ nmi_runid2gid 72316
tutorial_nmi-s005.cs.wisc.edu_1201099302_30463This command is useful when interacting directly with Condor, because the GID of a Metronome run is entered in the ad for all of its tasks. However, nmi_runid2condor converts directly from runids to Condor IDs if a job is in queue, which can be more convenient.
Usage: nmi_submit
--nmiconf= Select which NMI configuration file to use
--must-match Job must match with resources before submitted
--notify-fail-only Only send notification if job fails
--verbose Enable verbose output
--quiet Do not print job submission information
--timeout= Number of seconds to wait for runid. Default is 180
--no-wait Do not wait for runid. Program returns immediately
--debug Enable debug output
--help Show this information
This command is used to start B&T runs. The command expects a submit file as input. For example the following executes the submit file perlHelloWorld.submit located in the current working directory:
nmi_submit perlHelloWorld.submit
Although the submit file, and any input specification files it references, must exist when nmi_submit is invoked, they are not read again, and may be removed afterwards. A copy of the input files, along with all of the submission’s runtime data and output files, is archived in the “run directory”. The current working directory of the original submission has no significance.
This commands returns all the NMI test runids that use a given build submission as input. Given either a gid or a runid, the command prints out all the test runids one-by-one on separate lines.
Command usage:
nmi_testsforrun <runid|gid> [--nmiconf=<path>]Using a runid:
$ nmi_testsforrun 32419
32419
32424
32426
32427Using a gid:
$ nmi_testsforrun cndrauto_nmi-s001.cs.wisc.edu_1159503304_1509
32419
32424
32426
32427This section documents the environment in which platform-specific (_remote_*_) tasks execute, and the NMI tools available to them.
Primarily intended to enable simple communication between parallel tasks on different hosts, the nmi_getattr and nmi_putattr tools are available on the remote execution platform and allow a task to read or write individual attributes from a common classad-like hash table.
nmi_putattr attr value
_nmi_putattr_ will set the value of the given attr (overwriting any preexisting value)
returns 0 upon success
returns non-zero upon failure to set the attr for any reason
nmi_getattr attr
_nmi_getattr_ will print to stdout the value of the given attr.
returns 0 if the attr is defined
returns 1 if the attr is undefined
returns >1 upon failure
Note: These tools are non-blocking. nmi_getattr will return 1 if you try to get a value that does not exist. The user must poll for values and determine how long to wait for a value to exist before giving up and declaring failure. Future plans for a Metronome-provided polling mechanism are in the works.
The nmi_getfile and nmi_putfile tools are used to send files between nodes of a parallel job. These scripts are available on the remote execution platform and allow a node to send or retrieve files to each other via the submit host.
nmi_putfile local_file remote_file
_nmi_putfile_ will send a file to the submit host
returns 0 upon success
returns 1 if the chirp invocation fails
returns 2 if the local or remote file are not defined, or if the local file does not exist
returns actual error code >1 upon failure
nmi_getfile remote_file local_file
_nmi_getfile_ will fetch a file from the submit host
returns 0 if the fetch worked successfully
returns 1 if the chirp invocation fails
returns 2 if the local or remote file are not defined
returns actual error code >1 upon failure
In principle, no NMI build or test artifact should be irreplaceable. The NMI database stores the full specification needed to reproduce all past builds or tests, and an NMI facility should continue to maintain the necessary platforms and prerequisites for as long as reproducibility is required.
In practice, however, there are three reasons to archive the full output of old runs: convenience, efficiency and “insurance”. It is convenient for developers to be able to examine the detailed output of a build or test after it completes (e.g., to perform a “post-mortem”). It can be more efficient to use the archived output of one build as the input to multiple test runs, rather than re-building the initial software from scratch each time it is needed. And it can be prudent to save the full results of important builds as insurance, in case a software, hardware, or administrative error renders it unexpectedly irreproducible in the future.
NMI provides a mechanism for archiving old run results to address these concerns. However, our focus is on convenience and efficiency, and so depending on the degree of “insurance” desired, projects may wish to keep their own additional archives of completed builds or tests.
There are two classes of archive in the NMI framework: a full archive and a metadata-only archive. It’s worth reiterating that the NMI DB stores indefinitely the full specification of each run and the essential outcome of each of its tasks (e.g., return code, execution time, etc.). The archives we are discussing here are the file-based data and metadata corresponding to the run’s submission, input, execution, and results.
The duration that these file archives are kept is site-dependent, and is usually a function of available disk space and the rate at which new results are being generated. The NMI software assumes that a given runs results are to be stored only temporarily unless they are explicitly “pinned” by their owner. Pinned builds are kept until the specified expiration of their pin.
This means the results of each run is in one of four states below. The present archival state of each run is stored in the database.
State / Retention Goal (Which Runs, How Long)
—————
a) full archive / most recent runs, until disk needed
b) pinned full archive / all pinned runs, until pin expires
c) metadata-only archive / all runs, forever(1)
d) no longer archived / none (only as a result of data loss)
Valid state transitions are:
a -> b|c|d (after a pin, cleaning, or data loss, respectively)
b -> a|d (after an unpin or data loss)
c -> d (after a data loss)
Definitions:
In addition, in the UW-NMI B&T Lab we unofficially “back up” the full archive of each run being cleaned, and only permanently delete them as disk is needed (again, in reverse chronological order). These backups are not performed as part of the NMI framework, are not tracked in the DB, and there is no automated way to retrieve them — they are simply a failsafe in case the disk cleaner malfunctions or we need to retrieve a just-cleaned run someone forgot to pin.
(1) This policy may be untenable, depending on the size of metadata, the volume of runs, and the growth of available disk, but for the moment has been possible at UW-Madison. As a result, the NMI B&T software currently provides no automated means to “clean” the metadata of old builds.
The result code stored in the DB for a Metronome run is the number of failed tasks in that run. If the run had no failed tasks, Metronome will store a return code of 0. If the run is in a special state, the Metronome result code will be negative, as follows:
| Run Result Code | Key |
|---|---|
| null | running |
| >0 | completed and failed |
| 0 | completed successfully |
| -1 | removed |
NOTE: the nmi_run_status command-line tool interprets the value stored in the Metronome DB along with other information, and returns its own set of status codes for each run.
The result code reported by Metronome (and stored in the DB) for each Metronome task is the unix return code of the user-specified task script. If the script had no return code because it was killed by a signal, the Metronome result code will be the negative integer corresponding to the signal number. For example, -9 means SIGKILL.
For some Metronome or Condor-level failures (e.g., a job removed from the queue prior the task script even running), the Metronome result code will be a special negative value beyond the range of signals (e.g., -1001).
These special negative Metronome result codes are documented here. The web portal does not yet translate all of them from numbers into human-readable terms.
| Task Return Code | Key | Description |
|---|---|---|
0 to 255 |
Normal Exit | Return code of user-specified task executable |
-1 to -31 |
Killed by Signal | Negative integer corresponding to the Unix signal number |
-32 |
Execution failure | The user-specified task executable could not be executed _or_ timed out |
-1001 |
Submission failure | DAGMan error code for a job submission failure |
-1002 |
Removed | The task's Condor job was manually removed from the queue "out from under" Metronome |
-1003 |
Interrupted | Tasks are given this result code if they were interrupted while executing and no return code was available. (E.g., if the Metronome wrapper overseeing the task was killed by an external entity, or if the remote resource crashed.) This result should be temporary and non-fatal: once the task is automatically re-executed, its new status will replace this one. |
Special meta-tasks (e.g., platform_job, or remote_task when sub-tasks have been declared) are given return codes corresponding to their total number of failed sub-tasks. If all of tasks within the meta-task succeeded (i.e., returned 0), Metronome will assign the meta-task a return code of 0.
This feature of Metronome builds on the Condor Parallel Universe and provides for running jobs on multiple machines simultaneously. Condor’s Chirp mechanism makes communication between the machines possible.
This node describes how to set up your pool to run parallel jobs. The UW NMI pool already has the DedicatedScheduler set up. The submitter is nmi-s005.cs.wisc.edu.
A Metronome submit file for a parallel job looks similar to a normal submit file, with the following exceptions:
platforms = (x86_fc_2, x86_rhas_3)
pre_all = glue/pre_allplatform_pre_0 = client/platform_pre
platform_pre_1 = server/platform_preremote_declare_0 = client/remote_declare
remote_declare_1 = server/remote_declareremote_pre_0 = client/remote_pre
remote_pre_1 = server/remote_preremote_task_0 = client/remote_task
remote_task_1 = server/remote_task
remote_task_args_0 = 7000
remote_task_args_1 = 7001
remote_task_timeout_0 = 20
remote_task_timeout_1 = 20
remote_post_0 = client/remote_post
remote_post_1 = server/remote_postplatform_post_0 = client/platform_post
platform_post_1 = server/platform_postpost_all = glue/post_all
Chirp facilitates the underlying communication between nodes of a parallel job. Use the nmi_putattr and nmi_getattr scripts as described here to inject params directly into the job ad from one node and retrieve them from another. The scripts are sent to remote machines and are located in the NMI_BIN directory on each remote job node.
The older method of Chirp communication may be used as well. You may send a file to the head node using Chirp and then retrieve it on other nodes. Use the nmi_putfile and nmi_getfile scripts as described here for this purpose.
Certain attributes are published to the job ad and the remote environment. These attributes may be helpful in job synchronization and inter-job dependencies.
NMI_NODE_0_HOSTNAME=nmi-build26.cs.wisc.edu NMI_NODE_1_HOSTNAME=nmi-build21.cs.wisc.edu
NMI_NODE_0_START_remote_task=1177970566NMI_NODE_0_START_remote_pre=1177970565
NMI_NODE_0_RVAL_remote_pre=0
NMI_NODE_0_END_remote_pre=1177970566NMI_NODE_0_START_remote_declare=1177970563
NMI_NODE_0_RVAL_remote_declare=0
NMI_NODE_0_END_remote_declare=1177970564
NMI_BIN=/home/condor/execute/dir_13377/bin _CONDOR_SCRATCH_DIR=/home/condor/execute/dir_13377 _CONDOR_PROCNO=0
% condor_status -const 'DedicatedScheduler=!=Undefined'
...and it will list all the “P” slots. To confirm which schedd queue each slot is bound to, run:
% condor_status -const 'DedicatedScheduler=!=Undefined' -format '%s\t' Machine -format '%s\n' DedicatedScheduler
The GID, or Globally Unique ID, is an alphanumeric identifier used by the build and test system to refer to a specific build or test submission. It is generated at submission time, and is used internally as a component of the run’s archive directory (or run directory) path.
The Run ID, or runid, is a short numeric identifier also used by the build and test system to refer to a specific build or test submission. It is only generated once the submission is successfully registered into the database (it is the primary key of the record), and as a result it may not yet be known at submission time if the database is unavailable or unresponsive.
There is a one-to-one correspondence between Run IDs and GIDs; the only practical differences are that GIDs are long and unwieldy, but are guaranteed to be known at submission time, whereas Run IDs are short and convenient, but cannot be presumed to exist until the database is successfully initialized for each run. GIDs are also unique across NMI pools, whereas Run IDs are only unique within a given NMI pool. Although having two similar identifiers can be confusing, the distinction is important and exists in order to ensure that build & test jobs can be submitted even in the face of database performance or availability problems.
GIDs may be used to look up Run IDs (if known), and visa-versa, via the nmi_gid2runid and nmi_runid2gid tools.
Both the Run ID and GID are used (often interchangeably) as arguments for many NMI command-line tools (e.g., nmi_pin). The Run ID is also used as the identifier for the nmi input method.
In addition to the status and outcome of each individual user-defined task in a run, the Metronome DB also contains the outcome of a handful of special meta-tasks for each run. These meta-tasks represent the collective result of a set of related tasks, and are useful for reporting their outcome as a whole.
For example, the platform_job meta-task represents the collective outcome of all the tasks which executed on a given platform in a run; likewise, the remote_task meta-task represents the collective outcome of all the user-defined tasks that were declared on a given platform (however, if no user-defined tasks were declared at all, remote_task is a regular task and not a meta-task).
For each specified platform in a Metronome build or test run, the DB records the outcome of a special platform_job meta-task representing the collective success or failure of all the tasks which executed on that platform.
aka build run
aka test run
aka build/test run
aka NMI run
A distinct build or test submission to Metronome (via nmi_submit). For more information, see Terminology.
aka build specification file
aka test specification file
aka build/test specification file
aka NMI specification file
aka NMI submit file
These are documentation snippets for which we don’t yet have a suitable high-level section in the reference manual. If there’s something here that you can find the right home for, please relocate it. If not, leave it here.
The metadata of each build/test run, and of all its tasks, are stored indefinitely in the NMI Build/Test database, so you should always be able see (or query) the outcome of past builds.
However, the user-level output data (results.tar.gz files) of build/test runs are only archived on the submission host for a limited period of time after the run has completed, as defined by each individual NMI Build & Test Facility’s policy. After that time, the data may be deleted, unless they are explicitly “pinned” by the owner using the nmi_pin command. Pinned runs are never deleted.
Unlike the database, the output data archive is typically not backed up, and is not guaranteed to exist for any length of time — so if your build or test output data is difficult to reproduce, or you need access to it indefinitely, you should copy it to your own reliable storage. To make such copying easier, we plan to add optional “put” steps that can be defined at the end of a run (analogous to the “fetch” steps), which will allow you to specify at submit-time where & how you’d like your results transferred off-site. For more information, see http://grandcentral.cs.wisc.edu/nmi_drupal/?q=node/192.
A variable in the Build and Test System is any value that is set in the execution environment of a task by the system. For example, the variable NMI_PLATFORM can be accessed by remote programs in the following ways:
| Language | Expression | |
|---|---|---|
| (dark). | Bourne and C Shells | $NMI_PLATFORM |
| Perl | $ENV{‘NMI_PLATFORM’} | |
| Python | environ[‘NMI_PLATFORM’] | |
| C | getenv(“NMI_PLATFORM”) |
All of the attributes in a run specification file can be accessed as environment variables at runtime. The build and test system prepends NMI_ to the attribute name and sticks its value into the environment variable. For example, the following commands:
platform_post = code/perlWhereAmI/whereAmI.pl
platform_post_args = platform_Post Task
platform_pre = code/perlWhereAmI/whereAmI.pl
platform_pre_args = platform_Pre Task
...are transformed into the following environment variables by the build and test system.
NMI_platform_post=code/perlWhereAmI/whereAmI.pl
NMI_platform_post_args=platform_Post Task
NMI_platform_pre=code/perlWhereAmI/whereAmI.pl
NMI_platform_pre_args=platform_Pre Task
These environment variables are defined for all of the tasks.
First available in Metronome 2.2.8.
If set, the enviroment variable ‘NMI_CONF’ will be over-ridden by the command-line argument but preferred over the system default for the location of the nmi.conf file.
This variable contains the platform name that is associated with the task. For tasks such as _pre_all_ and _post_all_ the variable is set to local. Note that the platform name corresponds to the remote platform that the tasks are targeting rather than the platform the task is running on. For submit tasks such as _platform_pre_ and _platform_post_ this means that the variable identifies the platform the subsequent remote tasks will be run on rather than the current submit platform. The following table shows an example of how the value of this variable changes based on the tasks involved. The table shows a job that was submitted from a linux machine (i386-linux-thread-multi) and run on a solaris platform (sun4u_sol_5.8).
| Task | NMI_PLATFORM Value | Platform Name | |
|---|---|---|---|
| (dark). | pre_all | local | i386-linux-thread-multi |
| platform_pre | sun4u_sol_5.8 | i386-linux-thread-multi | |
| remote_pre_declare | sun4u_sol_5.8 | sun4-solaris | |
| remote_declare | sun4u_sol_5.8 | sun4-solaris | |
| remote_pre | sun4u_sol_5.8 | sun4-solaris | |
| remote_task | sun4u_sol_5.8 | sun4-solaris | |
| remote_post | sun4u_sol_5.8 | sun4-solaris | |
| platform_post | sun4u_sol_5.8 | i386-linux-thread-multi | |
| post_all | local | i386-linux-thread-multi |
This variable is present for every shell and is the standard way to tell the shell where executables are located. The build and test system sets this variable for every remote platform so that the standard set of executables plus any additional prereq executables are used.
By design, remote tasks should run with only their prereqs and the default OS bin directories in their PATH, so the software they see is explicit and predictable. Currently, remote tasks run with the following specific PATH elements, in order:
This variable contains the GID of the run. It is only present in the tasks that are run on the submit machine; “pre_all“node/110, platform_pre, platform_post, and post_all.
Set by the build and test system for every prerequisite requested and found. The variable is set to the installation path of the prereq. The format for the of the variable is _NMI_PREREQ_[prereq name]_[prereq_version]_ROOT. The version number is delimited by underline characters “_” instead of periods “.”. For example, the variable for the prereq coreutils-5.2.1 would be _NMI_PREREQ_coreutils_5_2_1_ROOT.
This environment variable is set only in the environment of remote platform tasks, and contains the name of the last remote task to have failed on that platform. If nothing has failed on the platform the variable is not defined.
If a user defined task has failed then _NMI_STEP_FAILED will be set to remote_task.
This variable contains the name of the task that is currently being executed. This variable is only present during the remote_task task.
The build and test system has several well-known filenames that it looks for. These files can be used to change the behaviour of the run.
This feature first appeared in Metronome 2.5.1.
On the submit node, in the working directory of the post_all step (<rundir>/userdir/common), the contents of the file notify.nmi will appended to the notification e-mail, if any, sent by the system.
When the build and test system sees this file on a remote platform, it will copy it back to the run directory (/nmi/run/ GID) in the user directory associated with the platform. For example for a run that has a GID of mbletzin_grandcentral.cs.wisc.edu_1151672109_29028 on a ppc_aix_5.3 platform, then the results.tar.gz file will be found in /nmi/run/mbletzin_grandcentral.cs.wisc.edu_1151672109_29028/userdir/ppc_aix_5.3/results.tar.gz
This optional, user-defined file is used to tell the build and test system to subdivide the remote_task. If it exists, it is examined by the NMI software on a remote platform before remote_task is invoked.
The format of the file is the task name and a timeout value for the task. The two items are delimited by one or more spaces and so a task name cannot contain any spaces. Each task is on a seperate line. The timeout value defaults to minutes, but the unit may be specified (‘M’ or ‘m’ or minutes, ‘s’ or ‘S’ for seconds.) For example if a tasklist.nmi file contains the following files then the build system will split the remote task into three seperate tasks:
MyTaskOne 10
MyTaskTwo 10
EndOfMyTasks 2The system will fail the first two tasks if their execution exceeds 10 minutes and fail the last task if its execution is more than 2 minutes. The overview page will show each task as a seperate row.
These tasknames will be passed to your remote_task in the enviroment variable _NMI_TASKNAME, once per name. See our tutorial for an example.
In Metronome 2.5.1, you may specify ‘h’ or ‘H’ for units of hours.
This section discusses the contents of the run directory. The path name of this directory contains the GID of the run.
These files contain the standard error and standard output of the task.
Copy of the build and test specification file that was submitted for this run.
Directory which contains all of the files that are downloaded to the remote platforms
This directory contains the fetched distribution plus whatever is added by the pre_all, platform_pre, and platform_post tasks.
There are a number of production NMI Build & Test Facilities around the world, in both the public and commercial sectors. Some of them are described below.
Each group of links goes to the same page. Use whichever host seems fastest at the time.
Status of Runs
The current status of all build and test runs in the UW-Madison NMI Build & Test Facility: nmi-s005 | nmi-s003 | nmi-s001
Status of Machines
The current status of all machines in the UW-Madison NMI Build & Test Facility: nmi-s005 | nmi-s003 | nmi-s001
RSS feed of UW NMI Build and Test Results
Users should feel free to log in and submit jobs from any of the following hosts in the UW-NMI Lab:
The UW-Madison Build & Test Lab is available for use by any of our various project collaborators, and we welcome any affiliated individuals to apply for an account.
If you already have an account in the UW Computer Sciences Dept. (the CSL), then complete this form.
If you do not have an existing account in UW Computer Sciences Dept., complete this form instead.
You may wish to sign up for relevant NMI Build & Test Lab mailing lists:
Please contact nmi-staff@cs.wisc.edu, Peter Couvares (pfc@cs.wisc.edu), or Becky Gietzel (bgietzel@cs.wisc.edu) for more information.
Although the Lab’s resources are primarily dedicated to automated build and test jobs, they are also available for interactive debugging when necessary. In order to log in to a specific Metronome node from the public network, you must first ssh into the nmi-net.cs.wisc.edu bastion host using your existing NMI account.
We do not have the resources to dedicate hardare to interactive use, however, so while you should feel free to log in to individual nodes when necessary, we ask that you not use them for regular interactive development work.
If you need to perform builds, tests, or other work which consume significant system resources and may interfere with Metronome jobs running in the background, please contact us first so we can provide you with dedicated access to the necessary resources for a limited period of time.
The NMI Build & Test Facility at CERN supports the ETICS (eInfrastructure for Testing, Integration and Configuration of Software) project, which is coordinated by CERN and funded partially by the European Commission, and aims to improve the quality of Grid and distributed software by offering a practical quality assurance process to software projects, based on a build and test service.
The NMI Build & Test Facility at INFN supports the eInfrastructure for Testing, Integration and Configuration of Software" (ETICS) project, which is coordinated by CERN and funded partially by the European Commission, and aims to improve the quality of Grid and distributed software by offering a practical quality assurance process to software projects, based on a build and test service.
INFN’s NMI Build & Test Facility home page can be found here.
TeraGrid is an open scientific discovery infrastructure combining leadership class resources at nine partner sites to create an integrated, persistent computational resource.
The TeraGrid Build and Test Facility is dynamically deployed on-demand across across distributed TeraGrid resources.
One of the benefits of a grid computing environment is that users are able to widen their access to different resources throughout the world. By leveraging a full-featured batch system, the NMI framework is able to execute jobs on computing resources beyond the local administrative domain. Thus, users may access specific operating system and architecture combinations that may not be available locally.
Currently, an NMI user must declare in their build or test specification file that the platform tasks should be routed to a remote execution site, and must provide the explicit location of that site. In future NMI releases, the system will be able to migrate jobs automatically to remote sites, if a match is unable to be made locally.
For purposes of these instructions, we refer to the site at which the jobs are being submitted as the local site, and the site to which the jobs are being sent as the remote site.
In order to enable the remote site execution, the local Condor installation (that is, the Condor installation where the jobs will be submitted from) must be configured to enable Condor’s grid technologies. As of Condor 6.8, this should be enabled by default.
Another important consideration is the authorization mechanism that the two Condor installations will use to all jobs to execute. For testing purposes, the following parameters can be added to Condor’s configuration file.
SEC_DEFAULT_NEGOTIATION = OPTIONAL
SEC_DEFAULT_AUTHENTICATION_METHODS = CLAIMTOBE
This should only be a temporary option. It is highly advised that you switch to use a more secure Condor authentication mechanism when deploying the NMI framework into production mode.
The remote execution site must be configured to allow outside connections to the Condor daemons. The following options should be included in the remote Condor’s configuration file, where local-condor.example.com is the address of the schedd on the local machine.
HOSTALLOW_WRITE = $(HOSTALLOW_WRITE), local-condor.example.com
HOSTALLOW_READ = $(HOSTALLOW_READ), local-condor.example.com
You may also use the wildcard option to allow an entire domain to communicate with Condor:
HOSTALLOW_WRITE = $(HOSTALLOW_WRITE), *.example.com
HOSTALLOW_READ = $(HOSTALLOW_READ), *.example.com
The remote Condor installation must also specify the authorization method that it will try to negotiate with the other site:
SEC_DEFAULT_NEGOTIATION = OPTIONAL
SEC_DEFAULT_AUTHENTICATION_METHODS = CLAIMTOBE
For information about how to configure Condor to operate correctly through firewalls, please refer to this section of the Condor manual.
More information about Condor’s grid technology can be found in the Condor manual.
Q: What’s the difference between “Metronome” and the “NMI Build and Test Lab”?
A: Metronome is the software run by the Lab. The same people here at the University of Wisconsin run the Lab and develop the software, so it’s easy to get confused! Other organizations maintain different Metronome installations.
Q: I have an account on your website. What else do I need to get started?
A: For security reasons, accounts on this website are not linked the Lab’s log-in accounts. You’ll have to request a Lab account before you can build or test your software.
Q: We want to use the NMI Build and Test Lab to test our software. How do we send the code to you?
A: We — the Lab staff — don’t act in the building or testing your software; in fact, the process can be completely automated and even run every night, regardless of where you keep the source, provided it can be accessed from our servers. We certainly recommend using a version control system, and support CVS and Subversion, but you can access your source via ftp, http, or scp as well. Please see our tutorials for more information about getting started.
Initally, because many organizations don’t allow outside access to their CVS or Subversion repositories, it will probably be easiest to log into a Lab machine and use a command-line tool (perhaps one requiring you to type a password) to copy the source of some specific version of your software to one of our machines. The build and test system — Metronome — can than be directed to use the local copy. Once you’ve finished automating your build or test (we can help), you can look into automating access to your source. (Our advice comes down to “use SSH keys”, but the specifics really depend on the particulars of your project and the administration of the machines hosting your source.)
Q: Is there a way for me to specify a prereq of ‘any java 1.4.2’ rather than explicitly matching up platforms with 1.4.2_08, _05, whatever and submitting separately?
A: No, but this is a feature, not a bug. ;)
In order to ensure reproducibility of a build or test, Metronome requires that there be no “variables” in the specification file. If a test could match with more than one java version, then it might work today and fail tomorrow for no apparent reason. (For instance, on ppc_macos_10.4 there are java-1.4.2_07 and java-1.4.2_09 prereqs, because there are some differences in results between them.)
That being said, we know it can be tedious to provide a greater degree of specificity than you really care about, and have some tools to help. The first is ‘nmi_list_prereqs’, which is installed on all of our submit nodes (and ships with the Metronome distribution). You can use this tool to discover the specific versions of java available on different platforms. (For example, ‘nmi_list_prereqs —platform=macos java’.)
There’s also a script in our contrib area, ‘add_prereq_version_to_all_platforms’, that takes a prereq name and a prefix to a version (‘java’ and ’1.4.2’ in our example) and generates a set of lines you can add to your submit file which specify, per-platform, which specific ‘java’ prereq is version ’1.4.2’.
In general, you don’t have to submit separately to make your prereqs platform-specific; see prereqs. Be sure to remove the prereq from your general requirements line, since Metronome takes the union of the two.
The tutorials presented here are designed to help new users become familiar with the Metronome software. You will need to get an account on the UW-Madison Build & Test Lab in order to try out these examples.
This tutorial is also available as a single, printer-friendly document here.
You will need a terminal window on a UW NMI Lab submit host, and a browser which can view the B&T web status pages.
On the terminal window you will need to perform a few setup steps if you want to cut and paste directly from these pages.
bash$ export NMI_BIN=/nmi/bin
bash$ export NMI_LIB=/nmi/lib
bash$ source $NMI_BIN/config.[sh/csh]bash$ which nmi_submit
/nmi/bin/nmi_submitbash$ nmi_submit —help
Usage: /nmi/bin/nmi_submit —nmiconf= Select which NMI configuration file to use —must-match Job must match with resources before submitted —notify-fail-only Only send notification if job fails —verbose Enable verbose output —quiet Do not print job submission information —timeout= Number of seconds to wait for runid. Default is 180 —no-wait Do not wait for runid. Program returns immediately —debug Enable debug output —help Show this information
You should also point your browser to the Build & Test Overview page
As shown in the figure below, when the Build & Test Overview page is first displayed, the page list all of the submissions that have been run on the submit machines. The first thing that needs to be done is to filter the submissions down to a managable number using the search box. In this case the page displays 1109 submissions.
The submission results can be sorted by each column listed in the page that contains sort buttons. As shown in the previous picture, each column which can be sorted has two buttons which will sort the results in either descending or ascending order.
The search box is used to filter out unwanted submissions. The figure below shows the fields that can be used to filter the results:

The following picture shows an example details page for the build and test run:

This tutorial introduces the steps needed to run a B&T submission. The dissection sections explain the contents of the submit file and input file used in the exercise. The procedure and examining sections runs through the steps of the exercise. Two submit procedures are presented. The version using CVS input shows you how the build and test system retrieves code from your source repository. The version using SCP shows you how the build and test system can download input files from the submit machine. You should have already prepared for this tutorial by completing the preparations section.
This tutorial demonstrates a simple build and test run using the CVS input method. Download the attached files and move them to the working directory of your submit machine and then follow the sections below. The first two pages discuss the contents of the files you have downloaded. This is followed by the procedure to execute the run and examine the results.
The following specifies fields that identify the run so that it can be located on the Build & Test Overview page.
@project@ = tutorial
@component@ = perlHelloWorld
@component_version@ = 1.0.0
@description@ = This is a simple example
@run_type@ = build
This line points to the file perlHelloWorld.cvs for an input definition. The B&T system expects the file to be in the same directory as where nmi_submit is executed.
@inputs@ = perlHelloWorld.cvs
These lines specify that the command _“code/perlHelloWorld/helloWorld.pl Rem