This section describes how to download, install, configure, and maintain a Metronome lab at your own site.
In the near future we will create more specialized mailing lists for announcements and discussion relevant to Metronome lab administrators. For the time being, however, the following lists are recommended:
Subsequent to Metronome 2.2.8, all releases with even minor versions will be stable releases, and all releases with odd minor versions will be development releases.
The first stable series will be 2.4.x, and the first development series will be 2.5.x.
Release Date: Feb 22, 2007.
MD5 checksum: d8957d492270892dafd3dd5f3cbcafcb ./nmi-2.2.2.tar.gz
NOTE: Be sure to read the NMI 2.2.0 Release Notes to understand the changes made since NMI 2.1.8. Most notable are the configuration parameter changes. The NMI 2.2.1 Release Notes summarize the changes from 2.2.0.
This release fixes a major bug recently discovered at our production facility.
Metronome 2.2.2 is marked as a STABLE release; all users of NMI 2.2.x are encouraged to upgrade to this latest release.
nmi_putattr command now correctly handles values containing quote characters.
Release Date: 03/29/2007
MD5 checksum: cc25e5f463f8e9228c05404796e70721
nmi_rm. You can now remove all users jobs if run as root. More information can be found here.nmi_putattr and nmi_getattr commands now correctly store and retrieve any valid string value, regardless of its contents, and correctly reject strings containing invalid characters.-w -ww options for nmi_condor_status to use the database configuration information from nmi.conf. This allows these options to function properly in more enviroments.
This release of Metronome is known to be broken when used with Condor 6.9.2. If you want to use Condor 6.9.2, download release 2.2.3 (or earlier) or 2.2.5 (or later).
Release Date: 05/04/2007
MD5 checksum: fc8e23f58b288d391b3c8116007939e3
Due to the large size of some Metronome installations, continuing with a single directory for all runs proved to unfeasible. At the installation here at UW-Madison, our submit nodes were bogged down because a single run directory contained 20,000+ subdirectories at a single level. Therefore, starting in Metronome 2.2.4, the framework will break up directories into the following levels:
/path/to/nmi/rundir/<4 digit year>/<2 digit month>///
Example:
/nmi/run/2007/04/pavlo/pavlo_nmi-s002.cs.wisc.edu_1175765722_27363/
See this report for more information and a discussion about the change. The Metronome toolkit and web interface have been retrofited to be backwards compatible with the old run directory format. One can use the new nmi_migrate_run utility to transition run directories to the new hierarchy.
The web status pages can now optionally provide visitors with the ability to add notes and comments for runs. If you are upgrading from an existing installation, you must execute the following SQL command to add the new column to the database.
ALTER TABLE Run ADD COLUMN notes varchar(255) NOT NULL DEFAULT '';
In order for this feature to work, the DB_READER_USER account in the database must be granted update permissions to the notes in the Run table. Use the following command to update your database privileges table (changing DB_READER_USER and DB_READER_PASS to match your existing account).
GRANT UPDATE (notes) ON nmi_history.Run \ TO 'DB_READER_USER'@'%.example.com' IDENTIFIED BY 'DB_READER_PASS';
Lastly, you must also set RUN_ALLOW_USER_NOTES to true in the web interface’s configuration file (etc/config.inc).
nmi_submit now produces more succinct and useful output unless --verbose is specified (feature 472)--history option that will pull Condor job ids from the installation’s history log file.nmi.conf with the option FETCH_RETRY_COUNT.nmi_resource_advertiser no longer reconfigures the local Condor daemons every time it is executed, but now only does so when the routing table contents have changed.nmi_rm to allow runs to be removed when their result code is null, and to correctly remove dependent runs when the --remove-consumers flag is used (bugs 501 and 868)platform_job task’s Condor job classad.nmi_runid2gid and nmi_gid2runid when a runid/gid cannot be found (bug 921).nmi_gnutar attribute in their Condor machine classad, Metronome will use it instead. (Note: due to a Condor bug, this does not currently work for parallel tasks.)nmi_rm not handling multiple runids as input.
Release Date: 2007-05-15
metronome-2.2.5.tar.gz
MD5 checksum: 3f1eb4f04b6ea283d27e77e1ecaf88f5
Metronome-2.2.5-0.noarch.rpm
MD5 checksum: 150b3de16f959ae89ac6da3ee8a1cdce
None
nmi_list_prereqs, which now ignores the ‘mtime’ entries.nmi_list_prereqs.
Release Date: 2007-05-31
MD5 checksum: 31d7105c172ecd04367f1dd6e4336bfc
MD5 checksum: 470c4ea2c6bbd1e3510cfc4fc81fc6e2
None
Release Date: 06/10/2007
nmi-2.2.7.tar.gz (MD5 checksum: 57f8bc43797cf41f26800e67cd134169)
Metronome-2.2.7-0.noarch.rpm (MD5 checksum: 2a9dd9d8f346b5ebc9c353039e9491ee)
In Metronome 2.2.4, the run archive directories began being named using the 4 digit year as the “root”. This caused problems with permissions and directory ownership. Therefore, starting in Metronome 2.2.7, run archive directories will be named using the owner’s username as the “root”, followed by subdirectories like so:
/path/to/nmi/rundir//<4 digit year>/<2 digit month>//
Example:
/nmi/run/pavlo/2007/06/pavlo_nmi-s002.cs.wisc.edu_1175765722_27363/
See this report for more information and a discussion about this change. Metronome’s directory naming is backwards compatible, and existing run archives named using the old directory names will be recognized without any problems. However, administrators may use the new nmi_migrate_run utility to transition run directories to the new hierarchy format if they wish.
nmi_rm now uses the Condor job’s ProcID when removing jobs from the queue. This fixes compatibility with parallel jobs. More information can be found here.
Release Date: 2007-08-08
MD5 checksum: 1e9060cb5b141e1ed228d6d5fe27dda3
MD5 checksum: 71a59e60fd90fefe05e0aef96c9658f1
To support the web status pages’ ability to retain user preferences across multiple submit nodes, the web page database user must now be able to write to the ‘sessions’ table. The schema file, which now also describes the ‘sessions’ table, has been renamed from schema.sql to schema.mysql; please see it for details. We regret that sites with only one submit node can not at present readily disable this feature.
nmi.conf under the Metronome install path specified at install time.nmi_condor_q fails to properly display jobs migrated to other Metronome sites
use_condor_job_leases is enabled in nmi.conf: Condor >= 6.9.4This is a stable release of Metronome. It contains only new bug fixes or new platform support.
Release Date: 2007-08-30
MD5 checksum: 1c2fb3ea8fed3626760ad644fb98ae15
MD5 checksum: 42775e9de4e14c26378077ff0e11c4bf
None.
nmi_rm no longer throws a spurious Perl warning.nmi_condor_q handles remote (migrated) jobs better.
use lib $ENV{'NMI_LIB'} || "/usr/local/nmi-2.2.7/lib";
This is a stable release of Metronome. It contains only new bug fixes or new platform support.
Release Date: 2007-09-24
MD5 checksum: 74c68ec1e2bf28974eaaa739efb38f00
MD5 checksum: 9141a68c0735fe02523f785042425e3c
None.
None.
This is a stable release of Metronome. It contains only new bug fixes or new platform support.
Release Date: 2007-10-01
MD5 checksum: 4050fd0dc9f3d9d8f68118a467216476
MD5 checksum: 68bacd3a637e6eae7acbbe2b1f8c3f4d
None.
None.
This is a stable release of Metronome. It contains only new bug fixes or new platform support.
Release Date: 2007-10-18
MD5 checksum: 5295d8bd7a033b018fb4c0466b245a43
MD5 checksum: 25b6b04428475c97e00a8395f9c0a537
Because of some security enchancements, Metronome users upgrading from 2.4.2 and earlier may need to adjust the configuration of their webserver to make sure that the run directory, as set in RUN_DIR in nmi.conf and the RUN_DIR_URL (also set in nmi.conf), are both accessible from the web. We expect to be able to remove this requirement (and return to requiring only that RUN_DIR be accessible via the web under RUN_DIR_URL) in our next release.
None.
nmi_submit now respects the notify_fail_only flag in submit files.
This is a stable release of Metronome. It contains only new bug fixes or new platform support.
Release Date: 2008-04-16
MD5 checksum: 1f8c12217d9bbea3b2f8684446794c57
MD5 checksum: 152706b2b0dd603b4aa97822b6e6f5d7
None.
nmi_rm now runs properly (no longer confuses DAGMan and non-DAGMan jobs).nmi_putattr and nmi_getattr which prevented them from running.nmi_migrate_run no longer holds a database transaction open during migration. This allows more concurrent operations and eliminates a class of errors where the database connection would time out and the migration would have to be retried.
This is a development release of Metronome. It contains new features, and may be unstable.
Release Date: 2008-04-16
MD5 checksum: ae1672b027af1d56377894e199d17446
Metronome-2.5.0-0.noarch.rpm not created due to technical difficulties
MD5 checksum: TBD
The RPM packaging of this release has been delayed. Please contact us if this becomes a problem.
Backwards-incompatible syntax change: as a result of adding the ability to support multiple platform namespaces, we had to change the syntax of the platforms command in input specification files for the nmi input method. Instead of using a single colon to separate the source and destination platforms, users must now separate the platforms with two. This does not affect Metronome 2.5.0’s ability to use runs from earlier Metronome releases.
You can not run parallel jobs with Condor 6.9.5 and this release. Condor versions 6.9.4 and earlier, and 7.0.0 and later, do not have the problematic bug. Condor versions before 6.9.5 do not have the improved parallel job exit policies, which can dramatically simplify parallel testing, so we recommend using Condor 7.0.0 or above.
Some of Metronome 2.5.0’s new features require new tables in the database. Support for ‘git’ requires a new table, and this table has been added to the schema files. Support for nmi_resubmit_run remains more experimental, and its table is defined in a new file in the distribution, “database/Metronome-2.5.0”, which also includes a table for use with nmi_update_machine_table. This schema has only been tested against MySQL (although if you are using Metronome with Postgres, please let us know).
nmi_migrate_run to better handle large run directories.remote_*_timeouts, as well as remote_default_timeout to replace the functionality of the 2.4.x remote_task_timeout. See <taskname>_timeout.
This is a development release of Metronome. It contains new features, and may be unstable.
Release Date: 2008-07-30
MD5 checksum: 614f408ab159658f3a1784b64504895e
Metronome-2.5.1-0.noarch.rpm
MD5 checksum: TBD
We accidentally introduced a dependency on PHP 5 in this version. Version 2.5.2 has been corrected to remove this dependency.
nmi.conf) limits the length of the whole of a platform job.nmi_submit now checks for and fails on duplicate attributes.nmi_submit now also checks for valid time-out specifications.$_NMI_PREREQ_*_ROOT environment variables with partial version strings. (We urge you to use this feature with restraint — use the most-specific version you readily can.) To explain by way of example, for java-1.5.0_08, Metronome would set $_NMI_PREREQ_java_1_5_0_08_ROOT, $_NMI_PREREQ_java_1_5_0_ROOT, _NMI_PREREQ_java_1_5_ROOT, and so on, down to _NMI_PREREQ_java_ROOT.block_until_exists and timeout. If you define block_until_exists, then Metronome will block until the named run or runs finishes, or until timeout passes.nmi_rundir. It converts GIDs or runids given on the command line to hostname:full-path pairs.userdir/common), it will be appended to the notification e-mail (if any) sent by Metronome.Includes the following bug fixes from the stable series (due to be released in Metronome 2.4.5):
nmi_submit will fail with an error message about the problem. This prevents mysterious failures at the end of the run.nmi_list_prereq no longer silently fails when passed a prereq substring as an argument.(This release also has a functioning ‘reset view’ button, and will properly return ‘uses run id’ results; but these are not, strictly, bug fixes, as they’re a part of the improved web searches feature above.)
file_[get|put]_contents(). Fixed in release 2.5.2.
This development release of Metronome, and may be unstable.
Release Date: 2008-08-12
MD5 checksum: b2ef88bb0b34c2e411ab68a4f55e7aef
MD5 checksum: 781f3a3a139a54a97b61b45ffbf2984e
Bugs fixed in 2.5.2b are marked by ‘[b]‘ below.
The 2.5.2 release is identical to the version 2.5.1 release, except that removes and inadvertently-added dependency on PHP 5.
None.
remote_*_timeouts to be increased by a factor of sixty.file_[get|put]_contents if not supplied by PHP.None.
Release Date: ???
Release Date: 07/10/2006
MD5 checksum:
9c41e87e2cf64be831471ba094356d91 nmi-2.1.3.tar.gz
Release Date: ???
MD5 checksum: f000491cc79a9cf449c69916c54a9e8b nmi-2.1.4.tar.gz
foo, , bar is now treated as foo, bar instead of as an error.Release Date: 10/02/2006
MD5 checksum: f23dbe5aeb1b036e787efbd51eb34280 nmi-2.1.7.tar.gz
This release requires a change to the database schema in order to record the hostnames on which parallel tasks execute. The upgrade can easily be made to an existing installation by issuing the following command in mySQL:
ALTER TABLE Task ADD COLUMN node_id smallint(3) unsigned default null AFTER runid;
lib/init.inc from:require_once(LIB_PATH.'formUtil.inc');
to
// require_once(LIB_PATH.'formUtil.inc');
Release Date: 11/30/2006
MD5 checksum: cbd3f03ca5cbf6600483d51e607ad204 nmi-2.1.8.tar.gz
This release requires a new field be added to the database schema. To modify an existing NMI installation, execute the following command from the mySQL console:
ALTER TABLE Run ADD COLUMN identity tinytext DEFAULT NULL AFTER user;
identity string will be stored in the DB for the run, but does not affect the user the job actually executes as on the computing resources.identity attribute (see above) could be used instead of the default user attribute. This may be useful for front-end systems to the NMI framework where a single daemon user submits all jobs.VERSION_FOOTER that allows the page rendering time and NMI framework version number to be displayed discretly at the bottom of every page. The default is set to true for all new installations.nmi_run_status for input values. Savannah Bug #17282results/runDetails page of the web framework that would create a SQL query that locks up large databases.Makefile.PL where the NMI_LIB path is not set correctly for installation directories that end with /nmi/lib.NMI_MAIN parameter in the NMI configuration file. See this bug report for more information.Release Date: 12/21/2006
MD5 checksum: 2d3c52b0f16e58378099d0afdf4caa49 nmi-2.2.0.tar.gz
We recommend that all sites’ NMI web status pages be configured to use a read-only user for database access. To add a new limited-privilege database user, execute the following command (substituting the DB_READER_USER, DB_READER_PASS, and HOSTNAME variables with your own values).
GRANT SELECT,CREATE TEMPORARY TABLES ON nmi_history.* \ TO 'DB_READER_USER'@'HOSTNAME' IDENTIFIED BY 'DB_READER_PASS'; GRANT SELECT,CREATE TEMPORARY TABLES ON nmi_history.* \ TO 'DB_READER_USER'@'localhost' IDENTIFIED BY 'DB_READER_PASS';
Several NMI configuration file variable names have been changed. Specifically:
database is now DB_NAMEmysqlhost is now DB_HOSTmysqlport is now DB_PORTusername is now DB_WRITER_USERpassword is now DB_WRITER_PASSnmiprefix is now PATH_NMIcondor_base is now PATH_CONDORglobus is now PATH_GLOBUSrundir is now RUN_DIRThe old names are deprecated but will continue to work, so no immediate change to existing config files is necessary. More information about these new configuration variables can be found here.
USERDIR now contains a .nmi_failed_tasks file listing the names of any remote tasks which have failed. This can be examined by a remote_post script to control what output to return in results.tar.gz.platform_job fails to get written to the database. It is not clear what is the cause of the problem or how it occurs.
Release Date: 01/26/2007
MD5 checksum: ff63ee14c79d58de384fa8f3d9542c58 nmi-2.2.1.tar.gz
NOTE: Be sure to read the NMI 2.2.0 Release Notes to understand the changes made since NMI 2.1.8. Most notable are the configuration parameter changes.
This release fixes two major bugs discovered in the last month at our production facility. These bugs were somewhat related and long standing; they became more prevalent due to the addition of the exponential backoff polling in the database logfile monitor.
NMI 2.2.1 is marked as a STABLE release; all users of NMI 2.2.0 are strongly encouraged to upgrade to this latest release. The NMI team will soon be supporting concurrent stable and development releases. More information will be posted in the future.
platform_job_prescript script. This would prevent the platform_job task information from being stored in the database. This bug did not have an adverse affect on the ability for jobs to run, but produced missing status information and SQL errors in the DB update script’s error file.URL_PREFIX parameter of the email notification script. This caused emails to have incomplete URLs to build/test information.
The NMI framework software works on top of an existing Condor pool. You must identify hosts to run the following NMI facility services:
Any/all of these services may be co-located on the same host if desired. It’s recommended to have at least 1 execute host to start.
If you are using Linux for your submit node, you must install the mysql-dev and Perl DBI modules.
perl -MCPAN -e "install DBI"
perl -MCPAN -e "install DBD::mysql"Install the version of MySQL as listed in the release notes. You will need to create a database with the same name as defined DB_NAME (the default name is nmi_history) and install the default schema.
mysqladmin create nmi_history
mysql nmi_history < nmi-X.Y.Z/framework/database/schema.mysqlNow as a privileged user, create the following accounts. The first account shown below is the DB_WRITER_USER, and it needs to be able to insert and update records in the database. The second account, DB_READER_USER, is use by the web interface and needs only read access to the database, except to update the “notes” field (which you can turn off), and to write to the sessions table (which, at least for now, you can’t).
Replace ‘@%.example.com@’ with the appropriate domain or host. Only hosts you specify for the DB_WRITER_USER will be able to use the command-line tools, and only hosts you specify for the DB_READER_USER will be able to run the web interface.
NOTE: Be sure to execute FLUSH PRIVILEGES; to make sure these accounts are add appropriately. You may also need to create an additional ‘localhost’ record for each account if the database is running on the same host as the submit node.
# DB_WRITER ACCOUNT
GRANT SELECT,INSERT,UPDATE,DELETE ON nmi_history.* \
TO 'DB_WRITER_USER'@'.example.com' IDENTIFIED BY 'DB_WRITER_PASS';
# DB_READER ACCOUNT
GRANT SELECT,CREATE TEMPORARY TABLES ON nmi_history.* \
TO 'DB_READER_USER'@'.example.com' IDENTIFIED BY 'DB_READER_PASS';
GRANT UPDATE (notes) ON nmi_history.Run \
TO 'DB_READER_USER'@'.example.com' IDENTIFIED BY 'DB_READER_PASS';
GRANT SELECT,INSERT,UPDATE,DELETE ON nmi_history.sessions \
TO 'DB_READER_USER'@'.example.com' IDENTIFIED BY 'DB_READER_PASS';Install the NMI software under your chosen prefix:
perl Makefile.PL prefix=<prefix>
make installIf you anticipate installing multiple versions of the framework, you may wish to set the prefix to a location such as /nmi-x.y.z, then create symbolic links to the installation directories:
mkdir <prefix>/nmi
cd <prefix>/nmi
ln -s <prefix>/nmi-X.Y.Z/share
ln -s <prefix>/nmi-X.Y.Z/bin
ln -s <prefix>/nmi-X.Y.Z/libCopy nmi-X.Y.Z/framework/nmi.conf.sample to prefix/etc/nmi.conf and edit as required. Please make sure that all non-trivial configuration parameters are customized for your local site (see Site Configuration Parameters for more information).
mkdir <prefix>/etc
cp nmi-X.Y.Z/framework/nmi.conf.sample <prefix>/etc/nmi.conf
edit <prefix>/etc/nmi.confIf you intend to install future framework versions, you may want to place your nmi.conf file in a general location such as /nmi/etc and create the symlink from prefix/etc instead:
mkdir /nmi/etc/@
cp <prefix>/etc/nmi.conf /nmi/etc/
cd <prefix>/etc
ln -s /nmi/etc/nmi.conf nmi.confThe framework relies on Condor Hawkeye technology to ensure that jobs get matched to machines with the right platform. Put the following lines in your Condor config file on ALL of your EXECUTE hosts:
# EDIT_ME: In the next line, is a directory
# path where you keep your Hawkeye modules, if any. For example, it
# could be /home/condor/hawkeye_modules.
MODULES =
STARTD_CRON_NAME = NMIPOOL
# Uncomment the following line if NMIPOOL_JOBS has not been defined yet.
# NMIPOOL_JOBS =
# JOB: Report the list of software installed on the system.
NMIPOOL_JOBS = $(NMIPOOL_JOBS) prereq:has_:$(MODULES)/prereq:10m:kill
# EDIT_ME: In the next line, is the path to the
# directory containing individual prereq installations. For example,
# it could be /prereq.
NMIPOOL_PREREQ_PREREQDIR =
# JOB: Report the nmi_platform.
NMIPOOL_JOBS = $(NMIPOOL_JOBS) nmi_platform::$(MODULES)/nmi_platform:720m:kill
Now take the contents of the framework ‘hawkeye_modules’ directory that comes with this distribution and, on ALL of your EXECUTE hosts, copy the files to .
Check that the module returns sensible values when run directly on your build machines. For example:
./nmi_platform
nmi_platform = "ppc_macos_10.3"
—
Now restart Condor on the execute hosts and verify that they report their NMI platform correctly to Condor Collector. You should be able to see something like:
condor_status -l | grep nmi_platform | head -5
nmi_platform = "ppc_aix_5.2"
nmi_platform = "hppa_hpux_B.10.20"
nmi_platform = "irix_6.5"
nmi_platform = "alpha_rh_7.2"
nmi_platform = "ia64_rhas_4"
Similarly, to test the prereq module, install some prereqs in your prereqs_location – for example, let’s say you installed python-2.2.3 from source using —prefix=/prereq/python-2.2.3 option to configure. Then you should be able to see something like:
./prereq | grep python-2.2.3
python_2_2_3 = "/prereq/python-2.2.3"
Similarly, if in the command below you replace by the actual hostname for that machine, you should be able to see something like:
condor_status -l | grep python
has_python_2_2_3 = "/prereq/python-2.2.3"
You can use Hawkeye to help match jobs to machines on the basis of attributes other than platform. We wrote a Hawkeye module, publish_dir.pl (see below), to simplify this task. It reads files from a specified directory, ignoring lines beginning with ‘#’ (or whitespace followed by ‘#’) and lines which don’t match the ‘attribute = value’ form. For those lines, it makes ‘attribute’, with the value ‘value’, available for matching jobs to the machine.
For example,
# ETICS includes compiler information in the platform string.
etics_platform = Darwin821_powerpc_gcc400if you set the PLATFORM_TYPES configuration file variable to ‘etics’, users could specify ‘Darwin821_powerpc_gcc400’ as the platform for their jobs.
How to create these files is left for the particular situation of the reader, although it will be convient in many cases to use Hawkeye. See the Condor manual.
publish_dir.plFor older versions of Condor (pre-6.9.x), the following should work:
NMIPOOL_JOBS = $(NMIPOOL_JOBS) publish_dir::$(MODULES)/publish_dir.pl:1h:kill
NMIPOOL_PUBLISH_DIR_PATH = /prereq/.hawkeyeThis Condor configuration snippet applies to later versions of Condor:
# Report attribute = name pairs from the named directory.
NMIPOOL_JOBLIST = $(NMIPOOL_JOBLIST) publish_dir
NMIPOOL_PUBLISH_DIR_PREFIX =
NMIPOOL_PUBLISH_DIR_EXECUTABLE = $(MODULES)/publish_dir.pl
NMIPOOL_PUBLISH_DIR_PATH = /prereq/.hawkeye
NMIPOOL_PUBLISH_DIR_PERIOD = 1h
NMIPOOL_PUBLISH_DIR_KILL = TrueIn both cases, publish_dir.pl update the attribute-value pairs available for matching, as specified by the files in /prereq/.hawkeye, once an hour.
This feature of Metronome builds on the Condor Parallel Universe and provides for running jobs on multiple machines simultaneously. Condor's Chirp mechanism makes communication between the machines possible.
Each Metronome pool where you would like to run parallel jobs requires 1 DedicatedScheduler machine. The DedicatedScheduler is usually a submit node, as the Parallel Universe is based on the condor_schedd daemon. Each execute machine must then know about the submitter. There are several ways to configure your pool to run parallel jobs.
#LOCAL_CONFIG_FILE = $(LOCAL_DIR)/$(HOSTNAME).local
PARALLEL = $(LOCAL_DIR)/condor_config.parallel
LOCAL = $(LOCAL_DIR)/$(HOSTNAME).local
REQUIRE_LOCAL_CONFIG_FILE = True
LOCAL_CONFIG_FILE = $(PARALLEL), $(LOCAL)
######################################################################
# MPI Settings
######################################################################
## If you want to "lie" to Condor about how many CPUs your machine
## has, you can use this setting to override Condor's automatic
## computation. If you modify this, you must restart the startd for
## the change to take effect (a simple condor_reconfig will not do).
## Please read the section on "condor_startd Configuration File
## Macros" in the Condor Administrators Manual for a further
## discussion of this setting. Its use is not recommended. This
## must be an integer ("N" isn't a valid setting, that's just used to
## represent the default).
NUM_CPUS = 3
## The number of evenly-divided virtual machines you want Condor to
## report to your pool (if less than the total number of CPUs). This
## setting is only considered if the "type" settings described above
## are not in use. By default, all CPUs are reported. This setting
## must be an integer ("N" isn't a valid setting, that's just used to
## represent the default).
NUM_VIRTUAL_MACHINES = $(NUM_CPUS)
## What is the name of the dedicated scheduler for this resource?
DedicatedScheduler = "DedicatedScheduler@submit-host.your.org"
## Path to the special version of rsh that's required to spawn MPI
## jobs under Condor. WARNING: This is not a replacement for rsh,
## and does NOT work for interactive use. Do not use it directly!
MPI_CONDOR_RSH_PATH = $(LIBEXEC)
## This setting puts the DedicatedScheduler attribute, defined above,
## into your machine's classad. This way, the dedicated scheduler
## (and you) can identify which machines are configured as dedicated
## resources.
SLOT1_STARTD_EXPRS = $(STARTD_EXPRS)
SLOT2_STARTD_EXPRS = $(STARTD_EXPRS)
SLOT3_STARTD_EXPRS = DedicatedScheduler
## required so the start expr won't eval to false and prevent jobs from ever running.
IsOwner = False
## Be cautious that you don't override this START expression in other condor_config.* files.
START = ( (VirtualMachineID == 1) || \
(VirtualMachineID == 2) || \
(VirtualMachineID == 3) && $(SLOT3_TYPE) )
## slot3 runs parallel / MPI jobs
SLOT3_TYPE = (Scheduler =?= $(DedicatedScheduler))
TBD.
Matchmaking is bilateral — that is, both jobs and hosts can express their own requirements of one another, and can advertise their own attributes for one another’s reference.
A platform_job’s Requirements expression specifies its requirements of a host to run on. In Metronome, you can easily add a new constraint to the platform_job’s Requirements expression via the append_requirements command in the run specification file, like so:
append_requirements = (host_attribute =?= "bar")
This tells Condor to make sure that the job only runs on a host whose host_attribute equals “bar”.
Likewise, a host’s START expression specifies its requirements of a job. To add a new constraint to the host’s START expression, you should edit the host’s condor_config (or condor_config.local) file, like so:
START = ( $(START) && job_attribute =?= "foo" )
This tells Condor to make sure that the host only receives jobs whose job_attribute equals “foo”.
To advertise a new job attribute (so you can reference it in a host’s START expression), just add it via the ++ command in the run specification file, like so:
++job_attribute = "foo"
To advertise a new host atttribute (so you can reference it in a job’s Requirements expression), just add it to the host’s condor_config (or condor_config.local) file. like so:
host_attribute = "bar"
STARTD_EXPRS = $(STARTD_EXPRS) host_attribute
=?= and How Is It Different From ==?The short answer: Use =?=, not ==.
The long answer: In the boolean logic of Condor classads, expressions can evaluate to True, False, or Undefined. If an expression references an attribute which isn’t defined, the value of that expression becomes Undefined. Therefore, if you say:
START = (job_color == "Red")
...and job_color is not defined in the job ad, then START will evaluate to Undefined rather than false.
To avoid this, Condor classads provides a =?= variant of the equality operator which will evaluate to False if one half is Undefined, rather than evaluating to Undefined. So if you say:
START = (job_color =?= "Red")
...and job_color is not defined in the job ad, then START will evaluate to False rather than Undefined.
Likewise for =!= and !=.
Although it’s not needed for correctness, for debugging you might also want to add the following to the host’s condor config_config:
STARTD_JOB_ATTRS = $(STARTD_JOB_ATTRS) job_attribute
This will tell the host to publish the job_owner attribute of any currently running job in its own host classad, so you can see it. This can make it easier to confirm that the job that is currently running has the attribute you expect, without having to look up the jobid and examine its classad separately using condor_q -l. The only thing to be careful about is any attribute names which are present in both the host and job classads, because only one can be published in the machine classad. This is one good reason to name job attributes and host attributes with job_ or host_ prefixes.
This feature in Metronome allows build and test submissions to automatically migrate to different pools based on resource availability. If a local user submits a job that requests a particular platform that does not exist in the local pool, it can automatically be routed to a different site that does have a computing resource with that platform.
The following information and pages are a guide for system administrators and pool managers to allow their local Metronome installation to establish routes with others sites.
There are two major components of the job migration feature:
This is preformed by the nmi_resource_advertiser tool. It transmits information about your local Metronome resources to remote collectors, such as the number of available machines in the pool and which type of platforms are available. Other sites that you have a pair-wise agreement with will in turn broadcast their pool information back to your local collector.
With this information, the resource advertiser can construct a table of routes to the remote sites where local jobs can execute on in order to get the resources they need.
To enable this feature, there are several steps and configuration changes that you will need to make to your local Metronome installation. Each pool that you wish to route jobs to will also need to make the same changes.
There are two parameters that need to be added to the end of your nmi.conf file on each of your pool’s submit nodes:
## --------------------------------------------- ## Job Routing ## --------------------------------------------- ROUTING_TABLE = /path/to/condor/local/dir/condor_config.routing_table REMOTE_SITES = /path/to/nmi/etc/remote_sites
The routing table is populated by the resource advertiser with information about how to migrate jobs to remote sites (see this page for more information). It is important that the ROUTING_TABLE entry be writable by the nmi_resource_advertiser script and readable by the Condor daemons. The remote_sites entry should be placed in the same directory as the submit node’s nmi.conf file.
The REMOTE_SITES file is a list of hostnames that your local pool is allowed to route jobs to. Each line in the file should contain a hostname of a remote submit node and optionally the hostname of the collector to broadcast the resource information to for the submit node. If no collector host is provided, the resource advertiser will send all information to collector listening on the submit node at the standard port.
Please note that the entries in this file must match exactly with the hostname on each machine. The resource advertiser uses this list to look for remote resource information in the local collector and also to send the local resource information to each of these sites.
Sample list:
# Comments are allowed # remote.site1.com remote.site2.com collector.site2.com remote.site3.com collector.site3.com:1234
The nmi_resource_advertiser program is used to broadcast information about the local NMI resources to sites listed in the REMOTE_SITES file and write the routes to remote sites to the ROUTING_TABLE used by the job router.
You will need to add the following command to the crontab of the user that Condor runs as (on most systems this is usually condor or daemon). The command will execute every five minutes. This is just a default interval that should suit most configurations. It should be less than the ClassAd lifetime of information stored by the Collector (CLASSAD_LIFETIME).
*/5 * * * * /path/to/bin/nmi_resource_advertiser --broadcast --routing-table --nmiconf=/path/to/nmi/etc/nmi.conf
If all of your submit and worker nodes are located either on the public network or behind a firewall, there are no configuration changes needed for Metronome. If, however, worker nodes are installed at remote sites that have firewalls, then you will need to configure both Condor and the firewalls to allow the appropriate network traffic. Note that the following changes must be made in your site’s Condor configuration file and not nmi.conf
Please refer to the Networking Section of the Condor Manual for the most up-to-date information about what ports are needed by Condor.
Condor can be configured to open sockets within a range of values. Please refer to the information below on the number of ports that need to be opened for your installation. The following example will cause Condor to only open ports
LOWPORT = 20000 HIGHPORT = 25000
The range of ports assigned may be restricted based on incoming (listening) and outgoing (connect) ports with the configuration variables IN_HIGHPORT, IN_LOWPORT, OUT_HIGHPORT, and OUT_LOWPORT.
The most important port to open is 9618; this is used as the listening port of the Condor Collector for receiving resource information from worker nodes.
There are several configuration possibilities for included remote resources. For example, you may just allow users to add their resources to your Metronome site’s pool, or you may allow them to submit jobs from the outside into your pool.
The worker nodes that are outside of the
Description needed here…
Number of ports needed:
5 + (5 * number of virtual machines advertised by that machine)
Description needed here…
Number of ports needed:
5 + (5 * MAX_JOBS_RUNNING)
Description needed here…
Number of ports needed:
5 + NEGOTIATOR_SOCKET_CACHE_SIZE
The most important port to open is 9618; this is used as the listening port of the Condor Collector for receiving resource information from worker nodes.
The Metronome configuration on each submit/archive node is determined by the settings in the nmi.conf configuration file. This page is a list of all the available parameters.
Any text after a hash character (#) is considered a comment. The syntax for each line in the configuration file is:
# Comments are ignored = # inline comment
NOTE
Starting in Metronome 2.2.0, the following parameters are deprecated. The default configuration file now uses UPPERCASE naming, although the lower case versions will still be supported. Use the appropriate substitution when updating your configuration file:
database -> DB_NAMEmysqlhost -> DB_HOSTmysqlport -> DB_PORTusername -> DB_USERpassword -> DB_PASSnmiprefix -> PATH_NMIcondor_base -> PATH_CONDORglobus -> PATH_GLOBUSrundir -> RUN_DIRThis is the email address that is used by the NMI framework to send debug and error information to. For example, if a component of the framework crashes while executing a job, the debug information will be sent to ADMIN_EMAIL.
ADMIN_EMAIL = nmi-admin@example.com
Defines the path to the root directory of your submit node’s Condor installation.
condor_base = /path/to/condor
NOTE: As of NMI 2.2.0, this parameter has been deprecated. Use PATH_CONDOR instead.
This paramter defines the location of the submit node’s main Condor configuration file. It is needed so that the framework can access Condor utilities for submitting, mananging, and removing builds and tests.
CONDOR_CONFIG = /path/to/condor/etc/condor_config
To be written
database = history
NOTE: As of NMI 2.2.0, this parameter has been deprecated. Use DB_NAME instead.
This parameter is used to define the host name of the database server used by the NMI framework. It can be a fully-qualified host name or simply “localhost”. If your database server listens on a non-standard port, please use the DB_PORT parameter to define the connection port.
Prior to release 2.2.0, this parameter was named mysqlhost.
DB_HOST = database.example.com
Defines the database the framework will use to read and write build and test information. Be sure that the DB_WRITER_USER and DB_READER_USER have appropriate access to this database.
Prior to release 2.2.0, this parameter was named database.
DB_NAME = nmi_history
If your database server listens on a non-standard port, you must provide this port number in your NMI configuration file so that the core framework code and the default web interface can gain access. The value should be a integer without a preceding colon.
DB_PORT = 1234
This parameter defines the password used by DB_READER_USER to access the framework database. Please refer to the framework installation instructions on how to create users with minimal privileges.
Note that blank passwords are now allowed.
DB_READER_PASS = some_pass
Along with DB_READER_PASS, this parameter defines the user name of a database user that has read-only access to the database information. This account is used by the NMI framework web interface and other utilities that only require read-only access to the database. Please refer to the framework installation instructions on how to create users with minimal privileges.
DB_READER_USER = some_user
This parameter is used to define what database system is used at your site. It allows the framework and web interface to use the proper database connector utility to access the build and test information.
NOTE: The only supported database this time is MySQL. Please contact the NMI developers team if you would like to deploy the framework using a different database server.
DB_TYPE = mysql
To be written
DB_WRITER_PASS = some_password
Along with DB_WRITER_PASS, this parameter defines the user name of a user that has access to the NMI database. Please refer to the framework installation instructions on how to create users with increased privileges.
It is advised that you do not use the same account for DB_WRITER_USER as DB_READER_USER.
DB_WRITER_USER = some_user
Fully-qualified domain name of the central repository where the build and test artifacts and data is stored. For most installations, this value should be the same as THIS_HOST.
DEFAULT_INPUT_HOST = site.example.com
This parameter defines when the disk cleaner will run if free disk space (in MB) is less than the value.
The default value is 400000 MB (390 GB)
disk_thresh = 400000
This parameter became available in Metronome 2.2.4.
By default, Metronome will try to fetch an input three times before giving up. This parameter allows you to change that to another integer for the inputs in the same submit file.
The machine default may be changed in nmi.conf using the parameter FETCH_RETRY_COUNT.
To be written.
NOTE: As of NMI 2.2.0, this parameter has been deprecated. Use PATH_GLOBUS instead.
Web server for the main entry point to the web view of the builds. This is only necessary if there are more than one submit nodes in your pool.
main_webserver = main-webserver.example.com
Specifies the default maximum number of seconds that a run may remain in the queue without ever being successfully matched with a resource, before it will be automatically removed. Users may redefine this value on a per-run basis via the max_match_wait attribute in their run specification file.
Defaults to six days if left undefined.
When set to true, the monitor will make backup copies of monitor.out and monitor.err in its run directory. Without this flag, Condor will overwrite the files’ contents when the monitor is restarted. This can happen if the jobs are put on hold or Condor is restarted at the submission point. The default is set to false. There are no negative implications to not having this set to true — it is mostly used for debugging.
MONITOR_BACKUP_LOGS = 1
To be written.
mysqlhost = database.example.com
NOTE: As of NMI 2.2.0, this parameter has been deprecated. Use DB_HOST instead.
If your database server listens on a non-standard port, you must provide this port number in your NMI configuration file so that the core framework code and the default web interface can gain access. The value should be a integer without a preceding colon.
mysqlport = 1234
NOTE: As of NMI 2.2.0, this parameter has been deprecated. Use DB_PORT instead.
To be written
nmiprefix = /path/to/nmi
NOTE: As of NMI 2.2.0, this parameter has been deprecated. Use PATH_NMI instead.
To be written
password = some_password
NOTE: As of NMI 2.2.0, this parameter has been deprecated. Use DB_PASS instead.
To be written.
path = rundir
Defines the path to the root directory of the submit node’s Condor installation. This directory should contain the bin, sbin, and lib subdirectories for Condor. It doest not need a trailing slash.
PATH_CONDOR = /path/to/condor
Path to the local Globus installation on the submit node.
PATH_GLOBUS = /path/to/globus
To be written
PATH_NMI = /path/to/nmi
This configuration option became available in Metronome 2.5.1.
Sets an absolute upper limit on the length of time a platform job may run (including Metronome-internal stages). The timeout is parsed in the same way as the remote_*_timeout.
First available in Metronome 2.2.8.
PLATFORM_TYPE over-rides the hard-coded default of “nmi” for the platform type, and is in turn over-ridden by the configuration file option platform_type.
This parameter is for use in Metronome installations which use job migration. By default, Metronome uses the ‘nmi’ platform-naming scheme, which the developers found appropriate for their work. Other installations may want to name their platforms differently. Prior to Metronome 2.2.8, this would require changing the value of ‘nmi_platform’ as reported by the Hawkeye script we supply. With platform types, you can advertise more than one platform name for each machine, for instance, ‘nmi_platform’ and ‘etics_platform’. If your users prefer ETICS-style naming, you can set PLATFORM_TYPE to ‘etics’, and they won’t have to set platform_types in all of their submit files to use their preferred names.
When set to true, this parameter will cause the logfile monitor to exponentially backoff when polling a submission’s files. During a polling cycle, if there was no new information added to logfiles, the monitor will sleep twice as long as it did in the previous cycle. The max sleep time can be controlled with the POLLING_MAX variable.
The default is to disable exponential backoff.
POLLING_BACKOFF = 1
This variable allows you to control how many seconds the logfile monitor will sleep after reading the contents of a submission’s logfiles. If set to zero, the framework will continously poll the logfiles. See this page for more information on how to enable the exponential backoff feature when polling files.
The default value for this paramter is 1 second.
POLLING_INTERVAL = 1 # seconds
Sets a limit on how much the monitor will be allowed to sleep when using the exponential backoff option. A low values will cause the logfile monitors will read the files more often to determine whether new information has been posted from the execution nodes; a high values will cause the framework to possibly update more slowly when a change occurs.
The default is 128 seconds.
POLLING_MAX = 128
This paramter defines the default protocol for fetching builds submitted by and stored on this machine.
Possible options include:
protocol = http
To be written
To be written
To be written
rundir = /path/to/nmi/rundir
NOTE: As of NMI 2.2.0, this parameter has been deprecated. Use RUN_DIR instead.
To be written
RUN_DIR = /path/to/nmi/rundir
To be written…
RUN_DIR_URL =
Fully-qualified domain name of the submit host.
THIS_HOST = submit-node.example.com
This parameter defines the relative path where the web pages are located.
For example, if the framework webpages are installed /data/www/htdocs/nmi, where /data/www/htdocs/ is the root directory of your webserver, url_prefix should be set as ‘nmi’.
url_prefix = nmi
To be written
username = some_user
NOTE: As of NMI 2.2.0, this parameter has been deprecated. Use DB_USER instead.
This parameter became available in Metronome 2.5.0.
Administrators may also wish to limit the load on their submit nodes by limiting how many jobs run there at once. Condor calls these jobs “scheduler universe”, and it should possible to use load measurements in the scheduler universe’s start expression. Doing so, however, is outside the scope of this manual.
Administrators may also wish to throttle submit host load on a per-user basis (so that heavy users don’t starve others), but Condor does not yet support this.
See the Condor manual for more information.
Added in Metronome 2.2.8.
If this parameter is set to true, Metronome will set job leases for its platform_jobs, which makes them tolerant of service interruptions on the submit hosts. Because Metronome requires streaming output, this option can not be used prior to Condor 6.9.4.
To be written
webserver = server.example.com
The Metronome usage report measures the heterogeneity and frequency of builds and tests submitted by NMI Lab users, as well as summarized by project. It is generated by the nmi_usage_stats.pl script.
The report measures the number of platform_job tasks, rather than all tasks or the number of runs, because the platform_job count offers the best measure of usage. A report of runs would over-report users who submit multi-platform builds as multiple, separate runs vs. users who submit them together in a single run, even though both users performed identical operations. Likewise, a report of all tasks would over-report users who break their builds into many more granular tasks vs. users who perform identical builds monolithically in a single task.
The number of platform_job tasks, in contrast, measures the number of builds or tests a user submitted to individual platforms, regardless of whether those builds or tests were submitted as a single multi-platform run or multiple single-platform runs, and regardless of whether the builds or tests were defined as a single task, or were broken up into many smaller tasks to be individually recorded in the DB.
NOTE: this report does not measure the actual resource consumption of the user. It reports equally a five-minute or ten-hour test. For a measure of resource consumption (as opposed to heterogeneity and frequency of builds and tests), see the Condor usage report.
To be written…
This page provides information on how to customize the web interface for your organization. All the changes shown below should be made in the web interface configuration file (etc/config.inc).
To change the text used in the web browser title, as well as on the sidebar column, modify the SITE_TITLE configuration parameter.
// -------------------------------------------------------
// SITE TITLE
// This is how the site will be branded
// -------------------------------------------------------
define('SITE_TITLE', 'New Site Title');
By default, the web interface displays the NSF and NMI lggos in the sidebar column. This can easily be changed to either replace or add additional logos for your organization. There are three base parameters that control these logos: SITE_LOGO, SITE_LOGO_LBL, and SITE_LOGO_URL. You can remove the logos from the page by removing these parameters from etc/config.inc.
It is easy to change not only the logo image, but also its url and label on your site. For example, the logo image is defined with the SITE_LOGO parameter, and the corresponding alternative text is defined with SITE_LOGO_LBL. If you want the logo to be a link to some site, you may also define SITE_LOGO_URL as an address that the image will link to.
define('SITE_LOGO', 'http://www.example.com/images/logo.gif');
define('SITE_LOGO_URL', 'http://www.example.com');
define('SITE_LOGO_LBL', 'Example Site');
To display multiple logos, add a unique numerical suffix to the end of each set of parameter starting at one. As shown in the example below, to display two logos one needs to define SITE_LOGO1 and SITE_LOGO2, along with the corresponding label and url parameters with the same numerical suffix.
define(‘SITE_LOGO1’, ‘http://www.example1.com/images/logo1.gif’); define(‘SITE_LOGO_URL1’, ‘http://www.example1.com’); define(‘SITE_LOGO_LBL1’, ‘Example Site #1’); define(‘SITE_LOGO2’, ‘http://www.example2.com/images/logo2.gif’); define(‘SITE_LOGO_URL2’, ‘http://www.example2.com’); define(‘SITE_LOGO_LBL2’, ‘Example Site #2’);
1) Expose the contents of this directory (‘web’) so it appears under Apache’s DocumentRoot. You can copy or move it under the DocumentRoot or just create a symlink, for example:
# ln -s /web /nmiweb
Here is the top-level of where you unpacked the NMI
framework tarball.
2) Make a copy of index.php.sample to index.php Please make sure that it has the proper permissions.
% cd /web
% cp index.php.sample index.php
% chmod 0644 index.php
define('BASE_PATH', '/path/to/this/file/');
define('NMI_CONF', '/path/to/nmi.conf');
define('BASE_PATH', '/home/pavlo/public/html/www/');
define('NMI_CONF', '/nmi/etc/nmi.conf');
3) Copy the sample website configuration file to make a new ‘etc/config.inc’ and tweak the settings for your site.
% cd etc/
% cp config.inc.sample config.inc
% chmod 0664 config.inc
4) Make the interface aware of where to look for .out and .err files for the builds. Let’s say your nmi.conf file has setting:
rundir = /path/to/builds
path = foo
in your nmi.conf, then create a symlink to expose the files:
# ln -s /path/to/builds /var/www/html/foo
5) Point your browser to the address of index.php Check your Apache error/access logs if the page does not load correctly You can also turn on debugging output by uncommenting the following line in index.php
//error_reporting(E_ALL);
error_reporting(E_ERROR | E_PARSE);
error_reporting(E_ALL);
//error_reporting(E_ERROR | E_PARS