This is a collection of “how-to” recipes intended for Condor administrators. Until we organize things better, please ignore the surrounding frames that refer to the NMI Build and Test Lab. If you would like to contribute to this how-to collection, please register for an account below and contact condor-admin@cs.wisc.edu if you need more than the default level of access.
Known to work with Condor version: 7.0
To have some jobs preferentially run on some machines instead of others, see How to steer jobs towards more desirable machines. To require that specific jobs run on certain machines and not on others, read on.
A user can specify within the job's requirements in the submit description file that the job must run on a specific machine.
requirements = Machine == "example1.com"As an alternative, the administrator may add requirements in the configuration file. These requirements are automatically inserted for every job. For example, suppose that jobs from the special user "appinstaller" are to always run on a specific and reserved machine. The administrator may set that up with the following configuration on the submit machine:
APPEND_REQUIREMENTS = MY.Owner != "appinstaller" || TARGET.IsAppInstallerMachine =?= TrueThen, in the local configuration file of the specific and reserved machine, add the following:
IsAppInstallerMachine = True
STARTD_ATTRS = $(STARTD_ATTRS) IsAppInstallerMachineWith this configuration, user "appinstaller" jobs will only run on the correct machine(s). If only jobs from "appinstaller" are to run on the specific machine, then the administrator also modifies the configuration such that this machine only starts jobs from the desired owner:
START = ($(START)) && TARGET.Owner == "appinstaller"Instead of altering the configuration to modify APPEND_REQUIREMENTS for each and every job's requirements
(harmlessly in this example), the administrator may modify the user's login environment. Condor configuration is also read from the environment by tools such as condor_submit. User "appinstaller" changes the environment with:
export _CONDOR_APPEND_REQUIREMENTS=\
'MY.Owner != "appinstaller" || TARGET.IsAppInstallerMachine =?= True'Note that this overrides any use of APPEND_REQUIREMENTS within the configuration file.
If it is the case the administrator does not control the configuration of the submit machine, none of the above options can work. In that case, do things the ugly way: modify all potential execute machines to reject jobs from user "appinstaller", except for the machines that should accept the jobs.
Known to work with Condor version: 7.0
If you simply need to insert an environment variable, you can do that by putting the following in your configuration file on the execute machine:
STARTER_JOB_ENVIRONMENT = "APP1_PATH=/opt/app1 APP2_PATH=/opt/app2"If you need to dynamically modify an environment variable, you can do that in a wrapper script around the job. Here is what you would put in the configuration file:
USER_JOB_WRAPPER=/path/to/condor_job_wrapperThen you would create a file named condor_job_wrapper (or whatever name you choose), make it executable, and put in something like the following:
#!/bin/sh
# insert /s/std/bin into the PATH
export PATH=/s/std/bin:$PATH
exec "$@"Another way for information from the machine to enter the job environment is to publish the information in the machine ClassAd and leave it up to the user to insert it into the job's environment via the $$() mechanism, which substitutes in values from the target ClassAd. Example machine configuration:
APP1_PATH = "/opt/app1"
APP2_PATH = "/opt/app2"
STARTD_ATTRS = $(STARTD_ATTRS) APP1_PATH APP2_PATHThen the user can insert this information into the job environment by putting the following in the job submit file:
environment = "APP1_PATH=$$(APP1_PATH) APP2_PATH=$$(APP2_PATH)"If it is expected that the job may run on machines where these attributes of the machine ClassAd are not defined, a default value should be specified like this. For example, if it should just be empty when undefined, use this: APP1_PATH=$$(APP1_PATH:)
Known to work with Condor version: 7.0
The simplest way to achieve this is to simply set NUM_CPUS=1 so that each machine just advertises a single slot. However, this prevents you from supporting a mix of single-cpu and whole-machine jobs. The following example achieves the goal of supporting both in all but one respect: the Condor accountant does not charge the whole-machine user for claiming all of the slots: it only charges the user for claiming one slot.
First, you would have whole-machine jobs advertise themselves as such with something like the following in the submit file:
+RequiresWholeMachine = TrueThen put the following in your Condor configuration file. Make sure it either comes after the other attributes that this appends to (such as START) or that you merge the definitions together.
#require that whole-machine jobs only match to Slot1
START = ($(START)) && (TARGET.RequiresWholeMachine =!= TRUE || SlotID == 1)
# have the machine advertise when it is running a whole-machine job
STARTD_JOB_EXPRS = $(STARTD_JOB_EXPRS) RequiresWholeMachine
# require that no single-cpu jobs may start when a whole-machine job is running
START = ($(START)) && (SlotID == 1 || Slot1_RequiresWholeMachine =!= True)
# suspend existing single-cpu jobs when there is a whole-machine job
SUSPEND = ($(SUSPEND)) || (SlotID != 1 && Slot1_RequiresWholeMachine =?= True)Instead of suspending the single-cpu jobs while the whole-machine job runs, you could suspend the whole-machine job while the single-cpu jobs finish. Example:
# advertise the activity of each slot into the ads of the other slots,
# so the SUSPEND expression can see it
STARTD_SLOT_EXPRS = $(STARTD_SLOT_EXPRS) Activity
# Suspend the whole-machine job until the other slots are empty
SUSPEND = ($(SUSPEND)) || (SlotID == 1 && Slot1_RequiresWholeMachine =?= True && \
(Slot2_Activity =?= "Busy" || Slot3_Activity =?= "Busy" || ... ) )You might want to steer whole-machine jobs towards machines that are completely vacant, especially on the slots only for single-cpu jobs.
Here's a simple example that just avoids machines with a high load:
NEGOTIATOR_PRE_JOB_RANK = -TARGET.LoadAvg*(MY.RequiresWholeMachine =?= True)A more complicated expression would look at the attributes of the other slots when forming the rank:
STARTD_SLOT_EXPRS = $(STARTD_SLOT_EXPRS) Activity
NEGOTIATOR_PRE_JOB_RANK = (MY.RequiresWholeMachine =?= True) * \
(Slot2_Activity =!= "Busy" + Slot3_Activity =!= "Busy" + ... )Known to work with Condor version: 7.0
Suppose jobs are mysteriously failing on a particular machine and you suspect some kind of hardware problem (e.g. memory corruption). You can turn Condor off until the problem is solved, but just so no mistakes are made, it may also a good idea to block the machine from joining the Condor pool, in case Condor gets restarted prematurely.
Here is what you should put in the condor configuration visible to the collector:
# 2008-04-27: badmachine has a hardware problem, so I am temporarily blacklisting it
HOSTDENY_WRITE = $(HOSTDENY_WRITE) badmachine.domain.nameThen issue condor_reconfig -full. The existing machine ad will take some time to expire (~15 minutes), so restart the collector if you are impatient.
Known to work with Condor version: 7.0
Suppose a user is submitting jobs that are causing problems (like crashing machines) and you need to temporarily block the user until you can talk with them and solve the problem.
Add the following to the submit machine configuration:
DENY_WRITE = dan@hep.wisc.edu/*Then run condor_reconfig -full <submit machine name>. You might need to run that from the central manager if only the central manager is allowed to run administrative commands in your Condor pool. To verify that this configuration setting was successfully processed, query the schedd:
condor_config_val -schedd DENY_WRITEIf you do not control the submit machine, you could block the user from advertising to the pool collector. Use condor_status -submitters to see the value of the submitter ad's Name attribute for the user to be banned. (You can use the -long option if the name is truncated in the output.) Then enter it into the following configuration settings and run condor_reconfig. It will take ~15 minutes for the existing submitter ad to expire, so restart the collector if you are in a hurry.
IsBannedSubmitter = MyType == "Submitter" && Name == "dan@hep.wisc.edu"
COLLECTOR_REQUIREMENTS = ($(IsBannedSubmitter)) == False If you just want to ban a user from submitting to some but not all machines, you could do that in the execute machine configuration:
IsBannedUser = User =?= "dan@hep.wisc.edu"
START = ($(START)) && ( ($(IsBannedUser)) == FALSE )Run condor_reconfig -all after you have made that change (or just reconfig the execute machines that need it).
Condor's standard universe provides automatic checkpointing so that jobs can resume from where they left off when interrupted. However, not all jobs can be linked with the standard universe libraries. One very nice solution to checkpointing vanilla universe jobs is described by the University of Cambridge eScience Centre here.
There is also always the least fancy checkpoint solution: have the job save it's own state so that it can restart from where it left off. This requires extra work on the part of the application developer, but it is sometimes quite a bit more efficient than an automatic checkpoint solution that saves the entire contents of the job's memory. If you recommend this option to a user, keep in mind the following: if the job is using Condor's file transfer mode, and there is any chance of the job being preempted, then the user should set the following in the submit file:
when_to_transfer_output = ON_EXIT_OR_EVICTFor this to work, the job should intercept Condor's soft-kill signal (SIGTERM), save its state, and exit before Condor's KILL expression gives up and hard-kills the job (default 10 minutes). If the job consists of more than one process (e.g. a shell script that runs some other program), then the kill signal must be intercepted by the parent process, which should do whatever is appropriate, such as sending the kill signal to its child process, so that the child knows to save state and shut down.
Known to work with Condor version: 7.0
To see the current priorities and priority factors for users, use the following command:
condor_userprio -all -allusersTo configure things so that user A can use 2x the number of machines as user B based on their user priorities, set user B's priority factor to be 4 times that of user A. Example:
condor_userprio -setfactor B 4.0
condor_userprio -setfactor A 1.0Why should the priority factor be 4 times instead of 2 times? Recall that effective priority is the real priority times the priority factor. A user's fair share of the pool is inversely proportional to their effective priority. A user's real priority over time tends towards the number of machines they have been using recently. Working backwards... At steady state, A's real user priority will be 2x that of user B, because A will be using 2x the number of machines on average. But we want B's effective priority to be 2x user A's, because that is what will allow A to be claiming 2x the number of machines as B. This gives the following equations:
PrioFactorB * RealPriorityB = 2 * PrioFactorA * RealPriorityA
RealPriorityA/RealPriorityB = 2.0
Therefore ...
PrioFactorB/PrioFactorA = 4.0Note that if your pool uses startd RANK to give some users/jobs high priority on some machines, adjusting the user priority will only have an effect when two users with equal startd RANK are competing. This is because a match with a higher startd RANK will preempt a match with a lower startd RANK, even if the latter has a higher user priority.
This section explained how to manually adjust individual user priority factors. If you want to automatically adjust priority factors for many users, you may want to take a look at How to set user priority factors automatically by domain or other username pattern.
Known to work in Condor version: 7.0.1
This is a technique for increasing the scalability of the Condor collector. This has been found to help scale up glidein pools using GSI authentication in order to scale beyond ~5000 slots to ~12000 slots. Other strong authentication methods are similarly CPU intensive, so they should also benefit from this technique. The reason why this is particularly relevant to glidein pools is that these pools typically have shorter lived startds than dedicated pools, so new security sessions need to be established more often. In the case mentioned where we needed to configure a multi-tier collector to scale beyond 5k slots, the glideins were restarting (unsynchronized) with an average of about 3 hour lifespans.
The basic idea is to have multiple collectors that each individually serve a portion of the pool. The machine ClassAds that are sent to this collector are forwarded to one main collector (the central manager). The main collector is used for matchmaking purposes. All of these collectors could exist on the same machine, in which case you would want to make sure there are multiple CPUs/cores, or they could be located on separate machines.
Assuming you are running the collectors on the same machine, you will need to assign a different network port to each of them. The main collector can use the standard port, to keep things simpler. Here is how you could configure it to create 3 "sub" collectors on ports 10002-10004 (arbitrarily chosen) and to have them forward ClassAds to the main collector.
# define sub collectors
COLLECTOR2 = $(COLLECTOR)
COLLECTOR3 = $(COLLECTOR)
COLLECTOR4 = $(COLLECTOR)
# specify the ports for the sub collectors
COLLECTOR2_ARGS = -f -p 10002
COLLECTOR3_ARGS = -f -p 10003
COLLECTOR4_ARGS = -f -p 10004
# specify the logs for the sub collectors
COLLECTOR2_ENVIRONMENT = "_CONDOR_COLLECTOR_LOG=$(LOG)/Collector2Log"
COLLECTOR3_ENVIRONMENT = "_CONDOR_COLLECTOR_LOG=$(LOG)/Collector3Log"
COLLECTOR4_ENVIRONMENT = "_CONDOR_COLLECTOR_LOG=$(LOG)/Collector4Log"
# add sub collectors to the list of things to start
DAEMON_LIST = $(DAEMON_LIST) COLLECTOR2 COLLECTOR3 COLLECTOR4
# forward ads to the main collector
# (this is ignored by the main collector, since the address matches itself)
CONDOR_VIEW_HOST = $(COLLECTOR_HOST)Then you would configure a fraction of your pool (execute machines) to use one of the sub collectors. It is tempting to use something like COLLECTOR_HOST=$RANDOM_CHOICE(collector.hostname:10002,collector.hostname:10003,collector.hostname:10004) I haven't tested that, so I am not 100% sure there are no problems with doing that. Your schedds and negotiator should be configured with COLLECTOR_HOST equal to the main collector.
Known to work with Condor version: 7.0
By default, Condor provides fair sharing between individual users by keeping track of usage and adjusting their relative user priorities in the pool. Frequently, however, there is a need to allocate resources at a higher level. Suppose you have a single Condor pool shared by several groups of users. Your goal is to configure Condor so that each group gets its fair share of the computing resources and within each group, each user gets a fair share relative to other members of the group. What is "fair" depends on circumstances, such as whether some groups own a larger share of the machines than others and whether some groups have a different pattern of usage such as occasional bursts of computation verses steady demand.
The following recipes can be adapted to a variety of such situations.
You can give a group of users higher priority on a specific set of machines by using the startd RANK expression. Example:
# This machine belongs to the "biology" group.
MachineOwner = "biology"
# Give high priority on this machine to the group that owns it.
STARTD_ATTRS = $(STARTD_ATTRS) MachineOwner
Rank = TARGET.Group =?= MY.MachineOwnerThe above example requires that jobs be submitted with an additional custom attribute "Group" that declares what group they belong to. See How to insert custom ClassAd attributes into a job ad for different ways of doing that. One way is simply for the user to do it explicitly in the submit file:
+Group = "biology"If group membership is not likely to change frequently, or you really don't want users to have to declare their group membership, you could configure the startd RANK expression to look at the built-in TARGET.User attribute rather than relying on the custom attribute TARGET.Group. Example:
MachineOwners = "user1@biology.wisc.edu user2@biology.wisc.edu"
STARTD_ATTRS = $(STARTD_ATTRS) MachineOwners
RANK = stringListMember(TARGET.User,MY.MachineOwners)Since RANK is an arbitrary ClassAd expression, you can customize the policy in a number of ways. For example, there could be a group with second priority on the machines. Or you could specify that some types of jobs (identified by some other ClassAd attribute in the job) have higher priorities than others. You just need to write the expression so that it produces a higher number for higher priority jobs.
The down side of RANK is that it involves preemption. RANK only comes into play
when there is an existing job on a machine and the negotiator is considering whether a new job should preempt it. You can control how quickly the preemption happens in order for the new job to replace the lower priority job using MaxJobRetirementTime as described in How to disable preemption. By default, the preemption will happen immediately. This is most appropriate in Condor pools where groups own specific machines and want guaranteed access to them whenever they need them.
Given a choice of two machines to run a job on, it is a good idea to steer jobs towards machines that rank them higher so they stand less of a chance of being preempted in the future. Here is an example configuration that preferentially runs jobs where they are most highly ranked and secondarily prefers to run jobs on idle machines rather than claimed machines:
NEGOTIATOR_PRE_JOB_RANK = 10 * (MY.RANK) + 1 * (RemoteOwner =?= UNDEFINED)Note that preemption by startd RANK trumps considerations of user priority in the pool. For example, if user A with a high (bad) user priority is competing with another user B with a low (good) user priority, user B, with the better user priority, will be able to claim more idle machines, but user A can then preempt user B if the startd RANK is higher. The relative user priorities therefore only matter when the startd RANKs are equal or when vying for unclaimed machines.
Whereas startd RANK can be used to give groups preemptive priority on specific machines, Condor's group accounting system can be used to give groups a share of a pool that is not tied to specific machines. The Condor manual section on Group Accounting covers this topic.
Basically, there are two options:
Quick examples of these two options are provided below.
Users must simply declare their membership in a group. This is done in the submit file:
+AccountingGroup = "group_physics"The administrator can then adjust the priority factor for the various groups using condor_userprio. The following example sets the biology group's priority factor to be twice that of physics, which means the physics group should get twice as big a share of the pool (because user priorities are inversely proportional to the share of the pool).
condor_userprio -setfactor 2.0 group_physics@wisc.edu
condor_userprio -setfactor 4.0 group_biology@wisc.eduIn the negotiator configuration:
# define the groups in your pool
GROUP_NAMES = group_physics, group_biology, group_chemistry
# specify what share of the pool (# of batch slots) each group should have
GROUP_QUOTA_group_physics = 100
GROUP_QUOTA_group_biology = 75
GROUP_QUOTA_group_chemistry = 80
# specify that once a group has used its full quota, users from within
# the group can still vie for additional resources as individuals
GROUP_AUTOREGROUP = TrueThe submit file of the job needs to specify which accounting group and user within the group it belongs to:
+AccountingGroup = "group_physics.newton"Known to work in Condor version: 7.0
This is based on Igor Sfiligoi's talk at Condor Week 2008.
The basic idea is to add an extra slot for each normal job execution slot. This extra slot will run commands from the user whose job is running on the corresponding execution slot. Typical commands would be things like 'ls' or 'top'.
The following configuration assumes a single cpu machine. It should be easy to extend it to multiple cpus.
# Enable multiple slots
NUM_CPUS = 2
SLOT_TYPE_1 = cpus=1, memory=1%, swap=1%, disk=1%
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_2 = cpus=1, memory=99%, swap=99%, disk=99%
NUM_SLOTS_TYPE_2 = 1
# Enable cross_slot information flow
STARTD_SLOT_EXPRS = $(STARTD_SLOT_EXPRS) State, RemoteUser
# Config one slot for monitoring and one for jobs
SLOT1_SLOT2_MATCH = (slot2_State=?="Claimed")&&(slot2_RemoteUser=?=User)
SLOT1_START = $(SLOT1_SLOT2_MATCH) && (JOB_Is_Monitor)
SLOT2_START = <your old START condition>
START = ((SlotID == 1) && ($(SLOT1_START))) || \
((SlotID == 2) && ($(SLOT2_START)))
SLOT1_IsMonitorSlot = True
SLOT2_MyMonitorSlot = "slot1"
HAS_MONITOR_SLOT = True
STARTD_ATTRS = $(STARTD_ATTRS) IsMonitorSlot MyMonitorSlot HAS_MONITOR_SLOTYou could then use Igor's example job monitoring tool extracted from glideinWMS to submit jobs to the monitoring slot. The basic idea is to figure out where the job is running (by using condor_q) and then submit a job with the following in its submit file:
+JOB_Is_Monitor=True
Requirements=(Name=?=”slot1@<node running job>”) As Igor notes, the drawbacks of this method are:
One (untested) idea for how to speed things up a bit and avoid doubling the visible slots in the pool is to have the startds advertise to two central managers, one for normal jobs and one for monitoring jobs. (You can do that by simply listing both collectors in COLLECTOR_HOST.) You can use COLLECTOR_REQUIREMENTS to prune out all but the normal job execution slots in one collector and all but job monitoring slots in the other collector. A different idea is to have the job itself start up its own startd which joins a monitoring pool (private to the user?) and accepts monitoring jobs.
That said, the basic idea works just fine, so don't be afraid to use it as is.
Known to work with Condor version: 7.0
By default, Condor adjusts user priorities over time based on how many slots the user claims. A user who has recently used a lot of the pool will therefore tend to have a worse priority than one who has used it less.
If you wish to disable Condor's automatic adjustment of user priorities, you can effectively do so with the following configuration setting:
PRIORITY_HALFLIFE = 1Then a user's priority will be effectively constant. You can adjust the relative priority of users, as usual:
condor_userprio -setfactor <user1> 10
condor_userprio -setfactor <user2> 20In the example above, user1 should be able to claim twice as many machines as user2 (because user priorities are inversely proportional to share of the pool).
There is a good section in the Condor Manual describing this topic here. Also see How to suspend jobs instead of killing them.
Here is an example drawn from real life that is slightly more complicated than what is in the manual section on preemption policy.
MaxJobRetirementTime = (IsDesktop =!= True && User =!= "glidein@hep.wisc.edu") * \
( (OSG_VO =?= "uscms") * 3600*24 + \
(User == "osg_cmsprod@hep.wisc.edu") * 3600*24*3 )
# In case of graceful restart when condor is being upgraded, wait for as
# long as the largest possible value of MaxJobRetirementTime before switching
# to a fast restart (which results in hard-kill signals).
# 3600 * 24 * 4
SHUTDOWN_GRACEFUL_TIMEOUT = 345600This example assumes that desktop machines advertise a ClassAd attribute IsDesktop=True. On such machines, when jobs need to be preempted (e.g. by the machine user), they are preempted immediately. Jobs belonging to a glidein user are also preempted immediately, because unless they are sent a kill signal, they will always run indefinitely, not just until the current job finishes. Jobs belonging to the USCMS virtual organization are given 1 day to finish (from the time the job started) before preemption results in kill signals being sent. Jobs belonging to the osg_cmsprod user are given 3 days to finish (plus 1 day because they also belong to the "USCMS" virtual organization).
Known to work with Condor version: 7.0
You may already be familiar with "flocking," which allows a submit machine to send jobs to multiple Condor pools. It is also possible to have execute machines belong to multiple Condor pools. One reason to do this would be to create a super-pool that contains all of the execute nodes from several existing Condor pools. Some motivation for this follows.
Suppose several departments each have their own Condor pool, but there is desire to share resources across departments and with other users on the campus. One perfectly good solution is to use traditional flocking to send jobs to multiple pools. Each condor_schedd daemon needs to have the pools added to its configuration's FLOCK_TO list in order to use all of the resources. Whenever a new pool is added to the federation, the FLOCK_TO lists must be updated. If users want to see the status of the resources or the usage statistics, they must know to query the individual pools in the FLOCK_TO list. If campus-wide fair-sharing is desired (except on machines that you own, of course), this becomes awkward, because each user has a separate user priority and accumulated usage within each pool. Another small annoyance is that the job's rank expression is only evaluated within individual pools, not between resources from multiple pools. Similarly, if jobs must wait for other jobs to finish (for example, because of a long MaxJobRetirementTime), it can easily happen that a job gets matched to preempt some other job on a busy machine in one pool and then has to wait for the job to retire--in the mean time, a machine may be sitting ready and idle in some other pool.
Another option is to make one pool of all the machines: replace the individual central managers with a single central manager. There will be one collector and one negotiator, so there is only one place to query pool status, one place to send jobs to, and one global matchmaker that has access to information about all of the machines. On the down side, this could result in lower quality of service to users in the existing pools, because there is only one negotiator serving everybody. It may be slower (especially if somebody puts something in their job requirements that causes an inefficient auto-clustering of jobs). Also, if this single, large pool is not well managed, downtime would prevent users from being able to access their own resources. Each department probably also wishes to retain high priority on its own machines. The startd RANK expression is a good way to do this, but it has the disadvantage of operating via preemption, which is a second-round scheduling mechanism, rather than via user priority, which is a first-round mechanism. This means (as of Condor 7.0), that they may sometimes find that the negotiator first hands out their machines to someone else and then in the next negotiation cycle, their job gets matched to the machine and the lower ranked job gets preempted.
All these difficulties motivate the focus of this HOWTO section, which is to
combine the two approaches of flocking with having one large pool. The central managers of the existing pools are left in place, but one new central manager is created which all the execute machines also report to. This provides usage accounting across all of the resources together, and it serves as a convenient top-level pool to submit jobs to when users want to access all possible resources. Users in the departments with their own existing Condor pool might prefer to have their own pool remain the default pool for their job submission, but the super-pool could be added to their FLOCK_TO list. This way, they get the quality of service they were already enjoying from their own central manager, but excess jobs may be conveniently sent to all of the other resources by flocking to one place. The problem of guaranteeing high priority to department users on their own machines can also be addressed by treating matches made by the department negotiator as higher priority than those made by the super pool. Since the department negotiator has its own independent notion of user priorities, it can rely on the better user priorities of department members to guarantee them first priority on their own machines, rather than (or in addition to) relying on startd RANK to do this. This avoids the slight inefficiency of department members losing in the first round of negotiation to outsiders who happen to have better user priority but lower startd RANK.
Here is an example of the super-pool configuration for the central manager.
# Insert NegotiatorMatchExprNegotiatorName="SuperPool" into matches
# that this negotiator makes. This is used by the startd to give
# the local negotiator priority over the super negotiator.
NegotiatorName = "SuperPool"
NEGOTIATOR_MATCH_EXPRS = NegotiatorName
# Configure authorization settings to permit startds in sub-pools to join
# the super-pool and to allow submission of jobs from all appropriate
# places.
HOSTALLOW_READ = ?
HOSTALLOW_WRITE = ?Example configuration of a sub-pool:
# Insert NegotiatorMatchExprNegotiatorName="<LocalPoolName>" into matches
# that this negotiator makes. This is used by the startd to give
# the local negotiator priority over the super negotiator.
NegotiatorName = "InsertLocalPoolNameHere"
NEGOTIATOR_MATCH_EXPRS = NegotiatorName
# For advertising to super-pool
SUPER_COLLECTOR=<insert super-collector here>
LOCAL_COLLECTOR=<insert local-collector here e.g. $(CONDOR_HOST)>
# the local negotiator should only ever report to the local collector
NEGOTIATOR.COLLECTOR_HOST=$(LOCAL_COLLECTOR)
# startds should report to both collectors
STARTD.COLLECTOR_HOST=$(LOCAL_COLLECTOR),$(SUPER_COLLECTOR)
# trust both negotiators
HOSTALLOW_NEGOTIATOR=$(COLLECTOR_HOST)
# Ensure external users get big priority factor.
# If you don't have a uniform uid domain for all local users, then
# you will need to have some external process that updates priority
# factors using condor_userprio.
ACCOUNTANT_LOCAL_DOMAIN = $(UID_DOMAIN)
# Flocking to super-pool
FLOCK_TO=$(SUPER_COLLECTOR)
# Advertise in the machine ad the name of the negotiator that made the match
# for the job that is currently running. We need this in SUPER_START.
CurJobPool = "$$(NegotiatorMatchExprNegotiatorName)"
SUBMIT_EXPRS = $(SUBMIT_EXPRS) CurJobPool
STARTD_JOB_EXPRS = $(STARTD_JOB_EXPRS) CurJobPool
# We do not want the super-negotiator to preempt local-negotiator matches.
# Therefore, only match jobs if:
# 1. the new match is from the local pool
# OR 2. the existing match is not from the local pool
SUPER_START = NegotiatorMatchExprNegotiatorName =?= $(NegotiatorName) || \
MY.CurJobPool =!= $(NegotiatorName)
START = ($(START)) && ($(SUPER_START))Known to work with Condor version: 7.0
In the following examples, an attribute named Group is added with the value "Physics". You can use whatever attribute name you want, but avoid attribute names that conflict with attributes used by Condor. See the manual or run condor_q -long on a job to see what attributes are there.
+Group = "Physics"Group = "Physics"
SUBMIT_EXPRS = $(SUBMIT_EXPRS) GroupInsert desired value into the user's environment (e.g. in a shell setup script or whatever). Example:
export _CONDOR_GROUP='"Physics"'Then add it in the condor configuration file:
SUBMIT_EXPRS = $(SUBMIT_EXPRS) GroupSuppose the name of the attribute in the machine ad is X. Then put the following in the condor configuration (on the submit node).
MachineX = "$$([X])"
SUBMIT_EXPRS = $(SUBMIT_EXPRS) MachineXActually, X can be any ClassAd expression. It is not just limited to an attribute name. The value that is inserted into the job ClassAd, however, is always stored as a string.
In the job history file, the attribute of the most recent machine on which the job ran will be stored with the name MATCH_EXP_MachineX.
Known to work with Condor version: 7.0
In the following exapmles, a custom ClassAd attribute named MachineOwner is created with the value "chemistry". You can use whatever attribute name you want, but avoid conflicting with attribute names used by Condor. See the manual or run condor_status -long on a machine to see the attributes that are there.
MachineOwner = "chemistry"
STARTD_ATTRS = $(STARTD_ATTRS) MachineOwnerIf you want different values for different slots within the same machine, do this:
SLOT1_MachineOwner = "chemistry"
SLOT2_MachineOwner = "physics"
STARTD_ATTRS = $(STARTD_ATTRS) MachineOwnerYou can insert dynamic attributes that are periodically updated from a script. Here's an example of what you put in the condor configuration file to periodically call a script:
STARTD_CRON_NAME = CRON
CRON_JOBLIST =
CRON_JOBLIST = $(CRON_JOBLIST) kernel
CRON_kernel_PREFIX = kernel_
CRON_kernel_EXECUTABLE = /path/to/kernel
CRON_kernel_PERIOD = 1h
CRON_kernel_MODE = periodic
CRON_kernel_RECONFIG = false
CRON_kernel_KILL = true
CRON_kernel_ARGS =The script named 'kernel' could add some attributes that give information about the system kernel. For example, it could output something like the following:
version = "2.6.9-55.0.12.ELsmp"
bigmem = FALSE
hugemem = FALSEGiven the configuration above, this would result in ClassAd attributes being added to the machine ClassAd with the following names: kernel_version, kernel_bigmem, and kernel_hugemem.
Known to work with Condor version: 7.0
Condor monitors how much disk space jobs consume in the scratch directory created for the job on the execute machine when the job runs. This scratch directory is typically only used by jobs which turn on Condor's file transfer mode (should_transfer_files=true). For such jobs, the scratch directory is the current working directory and they might write their output files into that directory while they are running.
One problem that can happen is that one job on a multi-cpu system uses up so much space that all other jobs fail due to lack of space. If the partition containing Condor's EXECUTE directory is shared by other tasks (including perhaps Condor), a full partition could cause additional things to fail as well.
The following configuration settings should be put in the config file of the execute machines (or the whole pool).
DISK_EXCEEDED = DiskUsage > Disk
PREEMPT = ($(PREEMPT)) || ($(DISK_EXCEEDED))
WANT_SUSPEND = ($(WANT_SUSPEND)) && ($(DISK_EXCEEDED)) =!= TRUEThe most effective way to control how much space jobs use is to put the execute directory for each slot on its own disk partition. Then you don't have to worry about a malformed job consuming massive amounts of disk space before PREEMPT has a chance to operate. Assuming you have already created the necessary partitions, you can configure Condor to use them like this:
SLOT1_EXECUTE = /path/to/execute1
SLOT2_EXECUTE = /path/to/execute2
...Jobs can hold or remove themselves by specifying a periodic_hold or periodic_remove expression. The schedd can also hold or remove jobs as dictated by the configuration expressions SYSTEM_PERIODIC_HOLD or SYSTEM_PERIODIC_REMOVE. These are all submit-side controls, whereas the PREEMPT example above is an execute-side control. One problem with the PREEMPT example is that it doesn't do a very good job of communicating to the job owner why the job was evicted. Putting the job on hold may help communicate better. Then the user knows to resubmit the job with disk memory requirements or investigate why the job used more disk than it should have. The following example configuration shows how to put jobs on hold from the submit-side when they use too much disk.
# When a job matches, insert the machine disk space into the
# job ClassAd so periodic_remove can refer to it.
MachineDiskString = "$$(Disk)"
SUBMIT_EXPRS = $(SUBMIT_EXPRS) MachineDiskString
SYSTEM_PERIODIC_HOLD = MATCH_EXP_MachineDisk =!= UNDEFINED && \
DiskUsage > int(MATCH_EXP_MachineDiskString)Known to work with Condor version: 7.0
Condor monitors the total virtual memory usage of jobs. This includes both physical RAM and disk-based virtual memory allocated by the job. While this provides a clear way to stop jobs from using more virtual memory than they should, what administrators often want is slightly more complicated: they want to limit the amount of physical RAM used by the job so that it doesn't cause performance problems for other jobs or tasks on the computer. This is difficult to do in a general way, because Condor currently lacks a way to measure how much physical RAM the application actually needs verses how much of its memory could be swapped out to the disk without impacting performance. (How much the application actually needs is known as the working set size.)
Example: Condor may see that the virtual memory size of a job is 1.5GB when there is only 1GB per slot on the 4 core system. This could be a problem if jobs on the other slots need their full 1GB of expected memory. However, it may be that there simply isn't demand for memory at the moment, so the operating system is letting this job keep more of its memory in physical RAM than it actually needs. If something else comes along and demands more memory, the memory usage of this job might painlessly shift so that only 1.0GB is in physical RAM and the other 0.5GB is on disk, leaving the expected amount of RAM for other jobs without causing poor performance due to thrashing (actively needed data jumping back and forth between disk and RAM).
This is an area we hope to improve in Condor. In the mean time, here are some recipes that have proven useful, even though they are not perfect.
Put the following in your configuration file on the execute machine. This assumes that things like PREEMPT have already been defined further up in the configuration, so put it after the other stuff, or merge it into the other stuff.
# Let a job use up to 90% of the memory allocated to its batch slot
MEMORY_AVAILABLE_MB = (Memory*0.9)
VIRTUAL_MEMORY_AVAILABLE_MB = (VirtualMemory*0.9)
# The working set size is the amount of memory the job actually needs
# in RAM (as opposed to disk-based memory) in order to run without
# thrashing (copying data back and forth between RAM and disk frequently).
# If the job has an attribute "MemoryRequirementsMB", then we use that
# for the working set size. This is a custom attribute that would have
# to be manually set by the user, and which we trust in place of our
# default assumption. The default in this example is to arbitrarily
# assume the working set size is 70% of the virtual memory size.
# That will certainly be wrong if the job calls mmap() on a large file,
# but doesn't need the full file in RAM, so in such cases, the user will
# have to set MemoryRequirementsMB.
WORKING_SET_SIZE_MB = ifThenElse( isUndefined(MemoryRequirementsMB), \
ImageSize/1024*0.7, \
MemoryRequirementsMB )
# Here we check if the working set size of the job is greater than
# the RAM allocated to this batch slot. We also check if the virtual
# memory size of the job is greater than the total virtual memory allocated
# to this batch slot. If either is true, then memory is exceeded.
MEMORY_EXCEEDED = $(WORKING_SET_SIZE_MB) > $(MEMORY_AVAILABLE_MB) || \
ImageSize/1024 > $(VIRTUAL_MEMORY_AVAILABLE_MB)
PREEMPT = ($(PREEMPT)) || ($(MEMORY_EXCEEDED))
WANT_SUSPEND = ($(WANT_SUSPEND)) && ($(MEMORY_EXCEEDED)) =!= TRUENote that preempted jobs will go back to idle in the job queue and will potentially try to run again if they can match to a machine.
Jobs can hold or remove themselves by specifying a periodic_hold or periodic_remove expression. The schedd can also hold or remove jobs as dictated by the configuration expressions SYSTEM_PERIODIC_HOLD or SYSTEM_PERIODIC_REMOVE. These are all submit-side controls, whereas the PREEMPT example above is an execute-side control. One problem with the PREEMPT example is that it doesn't do a very good job of communicating to the job owner why the job was evicted. Putting the job on hold may help communicate better. Then the user knows to resubmit the job with larger memory requirements or investigate why the job used more memory than it should have. The following example configuration shows how to put jobs on hold from the submit-side when they use too much memory. All of the same issues concerning accurate measurement of working set size apply here just as they did in the PREEMPT example above.
# The working set size is the amount of memory the job actually needs
# in RAM (as opposed to disk-based memory) in order to run without
# thrashing (copying data back and forth between RAM and disk frequently).
# If the job has an attribute "MemoryRequirementsMB", then we use that
# for the working set size. This is a custom attribute that would have
# to be manually set by the user, and which we trust in place of our
# default assumption. The default in this example is to arbitrarily
# assume the working set size is 70% of the virtual memory size.
# That will certainly be wrong if the job calls mmap() on a large file,
# but doesn't need the full file in RAM, so in such cases, the user will
# have to set MemoryRequirementsMB.
WORKING_SET_SIZE_MB = ifThenElse( isUndefined(MemoryRequirementsMB), \
ImageSize/1024*0.7, \
MemoryRequirementsMB )
# When a job matches, insert the machine memory into the
# job ClassAd so periodic_remove can refer to it.
MachineMemoryString = "$$(Memory)"
SUBMIT_EXPRS = $(SUBMIT_EXPRS) MachineMemoryString
SYSTEM_PERIODIC_HOLD = MATCH_EXP_MachineMemory =!= UNDEFINED && \
$(WORKING_SET_SIZE_MB) > 0.9*int(MATCH_EXP_MachineMemoryString)Condor doesn't currently provide a configuration setting for this, but you can write your own wrapper script that runs before the job and sets resource limits that are enforced by the operating system. Here is what you put in the configuration file of your execute machines:
USER_JOB_WRAPPER = /path/to/condor_job_wrapperThe file condor_job_wrapper above can be called whatever you want. You should create that file with the following contents:
#!/bin/sh
# change this to the maximum allowed data segment size (in kilobytes)
ulimit -d 1000000
# run the job
exec "$@"Note that ulimit -m (maximum resident memory size) appears attractive, but it is not actually enforced on many operating systems.
Make sure the wrapper script is executable. Example:
chmod a+x /path/to/condor_job_wrapperKnown to work with Condor version: 7.0
Condor can handle pools of 10s of thousands of execution slots and job queues of 100s of thousands of jobs. Depending on the way you deploy it, the workloads that run on it, and the other tasks that share the system with it, you may find that Condor's ability to 'keep up' is limited by memory, processing speed, disk bandwidth, or configurable limits. The following information should help you determine whether there is a problem, which component is suffering, and what you might be able to do about it.
MAX_JOBS_RUNNING = 2000SEC_DEFAULT_NEGOTIATION = OPTIONALHISTORY = NEGOTIATOR_CONSIDER_PREEMPTION = FalseExample calculations:
Absolute minimum RAM requirements for schedd with up to 10,000 jobs in the queue and up to 2,000 jobs running: 10000*0.01MB + 2000*0.5MB = 1.1GB
Absolute minimum RAM requirements for central manager machine (collector+negotiator) with 5000 batch slots: 2*5000*0.01MB = 100MB
Realistically, you will want to add in at least another 500MB or so to the above numbers. And if you do have other processes running on your submit or central manager machines, you will need extra resources for those.
Also remember to provision a fast dedicated disk for the spool directory of very busy schedds.
To see whether you are suffering from timeout tuning problems, search for "timeout reading" or "timeout writing" in your ShadowLog.
As a general indicator of health on a submit node, you can summarize the condor_shadow exit codes with a command like this:
$ grep 'EXITING WITH STATUS' ShadowLog | cut -d " " -f 8- | sort | uniq -c
12099 EXITING WITH STATUS 100
81 EXITING WITH STATUS 102
2965 EXITING WITH STATUS 107
239 EXITING WITH STATUS 108
332 EXITING WITH STATUS 112Meaning of common exit codes:
Check the duration of the negotiation cycle:
$ grep "Negotiation Cycle" NegotiatorLog
5/3 07:37:35 ---------- Started Negotiation Cycle ----------
5/3 07:39:41 ---------- Finished Negotiation Cycle ----------
5/3 07:44:41 ---------- Started Negotiation Cycle ----------
5/3 07:46:59 ---------- Finished Negotiation Cycle ----------If the cycle is taking long (e.g. longer than 5 minutes), then see if it is spending a lot of time on a particular user:
$ grep "Negotiating with" NegotiatorLog
5/3 07:53:12 Negotiating with osg_samgrid@hep.wisc.edu at ...
5/3 07:53:13 Negotiating with jherschleb@lmcg.wisc.edu at ...
5/3 07:53:13 Negotiating with camiller@che.wisc.edu at ...
5/3 07:53:14 Negotiating with malshe@cae.wisc.edu at ...If a particular user is consuming a lot of time in the negotiator (e.g. job after job being rejected), then look at how well that user's jobs are getting "auto clustered". This auto clustering happens, for the most part, behind the scenes and helps improve the efficiency of negotiation by grouping equivalent jobs together.
You can see how the jobs are getting grouped together by looking at the job attribute AutoClusterID. Example:
$ condor_q -f "%s" AutoClusterID -f " %s" ClusterID -f ".%s\n" ProcID
1 649884.0
1 649885.0
50 650082.0Jobs with the same AutoClusterID are in the same group for negotiation purposes. If you see that many small groups are being created, take a look at the attribute AutoClusterAttrs. This will tell you what attributes are being used to group jobs together. All jobs in a group have identical values for these attributes. In some cases, it may be necessary to tweak the way a particular attribute is being rounded. See SCHEDD_ROUND_ATTR in the manual for more information on that.
To protect the negotiator against one user consuming large amounts of time, you can also configure NEGOTIATOR_MAX_TIME_PER_SUBMITTER. Example:
NEGOTIATOR_MAX_TIME_PER_SUBMITTER = 360If the collector can't keep up with the ClassAd updates that it is receiving from the Condor daemons in the pool, and you are using UDP updates (the default) then it will "drop" updates. The consequence of dropped updates is stale information about the state of the pool and possibly machines appearing to be missing from the pool (depending on how many successive updates are lost). If you are using TCP updates and the collector cannot keep up, then Condor daemons (e.g. startds) may block/timeout when trying to send udpates.
A simple way to see if you have a serious problem with dropped updates is to observe the total number of machines in the pool, from the point of view of the collector (condor_status -total). If this number drops down to less than it should be, and the missing machines are running Condor and otherwise working fine, then the problem may be dropped updates.
A more direct way to see if your collector is dropping ClassAd updates is to use the tool condor_updates_stats . Example:
condor_status -l | condor_updates_stats
*** Name/Machine = 'vm4@...' MyType = 'Machine' ***
Type: Main
Stats: Total=713, Seq=712, Lost=3 (0.42%)
0: Ok
...
127: OkIf your problem is simply that UDP updates are coming in uneven bursts, then the solution is to provide enough UDP buffer space. You can see whether this is the problem by watching the receive queue on the collector's UDP port (visible through netstat -l under unix). If it fills up now and then but is otherwise empty, then increasing the buffer size should help. However, the default in current versions of Condor is 10MB, which is adequate for most large pools that we have seen. Example:
# 20MB
COLLECTOR_SOCKET_BUFSIZE = 20480000See the Condor Manual entry for COLLECTOR_SOCKET_BUFSIZE for additional information on how to make sure the OS is cooperating with the requested buffer size.
If you are using strong authentication in the updates to the collector, this may add a lot of overhead and cause the collector not to scale high enough for very large pools. One way to deal with that is to have multiple collectors that each serve a portion of your execute nodes. These collectors would receive updates via strong authentication and then forward the updates to another main collector. An example of how to set this up is described in How to configure multi-tier collectors.
This forwarding can be configured by using the CONDOR_VIEW_HOST configuration setting.
If all else fails, you can decrease the frequency of ClassAd updates by tuning UPDATE_INTERVAL and MASTER_UPDATE_INTERVAL.
One more tunable parameter in the collector is (only under unix) the maximum number of queries that the collector will try to respond to simultaneously (by creating child processes to handle each one). This is controlled by COLLECTOR_QUERY_WORKERS, which defaults to 16.
You can set up redundant collector+negotiator instances, so if the central manager machine goes down, the pool can continue to function. All of the HAD collectors run all the time, but only one negotiator may run at a time, so the condor_had component ensures that a new instance of the negotiator is started up when the existing one dies. The main restriction is that the HAD negotiator won't help users who are flocking to the condor pool. More information about HAD can be found in the Condor manual.
Tip: if you do frequent condor_status queries for monitoring, you can direct these to one of your secondary collectors in order to offload work from your primary collector.
Known to work with Condor version: 7.0
condor_userprio -usage -allusersIf you want to track usage over time, one way is to save the results from the above command periodically (e.g. every day) and then build reports from this historical data. Some working scripts that do this (used on the Wisconsin GLOW pool) may be found here.
The CondorView tool may be used to interactively view historical statistics for a queue.
Known to work in Condor version: 7.0
The Condor system is designed (among many other things) to scavenge compute cycles on desktop workstations when interactive users are idle. This same concept can be applied to scavenging cycles from another batch system running on the same computer. The main idea is that instead of configuring Condor to notice when an interactive user is idle, to configure Condor to notice when the other batch system is idle on the machine. When the other system is idle, Condor is free to run jobs, until such time as the other batch system has work to do. Then, Condor must preempt or checkpoint the current work. This page discusses how to configure Condor to do this with PBS, though the concept works for other batch systems as well.
First, configure the condor startd to only run jobs when the attribute PBSRunning is set. We'll set this dynamically with the condor_config_val -rset command.
On the worker nodes, define in the condor config:
ENABLE_RUNTIME_CONFIG = TRUE
STARTD_SETTABLE_ATTRS_OWNER = PBSRunning
PBSRunning = False
# Only start jobs if PBS is not currently running a job
START_NOPBS = ( $(PBSRunning) == False )
START = $(START) && $(START_NOPBS)so that Condor will only start if START is true and there are no PBS
jobs running.
In the PBS world, again on the worker side, have PBS tell Condor when it is running, by adding the following to the PBS prologue.
if [ -x /opt/condor/bin/condor_config_val ]; then
/opt/condor/bin/condor_config_val -rset -startd PBSRunning=True > /dev/null
/opt/condor/sbin/condor_reconfig -startd > /dev/null
sleep 2
if ( /opt/condor/bin/condor_status -format '%s '
Name -format '%s \n' State $(hostname) 2> /dev/ null | grep -q Claimed )
then
/opt/condor/sbin/condor_vacate > /dev/null
sleep 2
fi
fiIn the PBS Epilogue, tell condor that it is OK to use this machine again:
if [ -x /opt/condor/bin/condor_config_val ]; then
/opt/condor/bin/condor_config_val -rset -startd PBSRunning=False > /dev/null
/opt/condor/sbin/condor_reconfig -startd > / dev/null
fiThis is based on a recipe from Preston Smith of Purdue University. Thanks Preston!
Known to work with Condor version: 7.0
Suppose you have a condor pool that runs jobs from several different groups of users who submit their jobs from different domains: physics.wisc.edu, biology.wisc.edu, and chemistry.wisc.edu. Further suppose that there is an agreement between these departments about what their relative priority should be in the condor pool. How can you implement it?
First, you must answer this question: do you want each department to get an agreed-upon share of the pool? If so, you probably instead want to use group accounting. If you really do want to adjust the user priority factors of individual users submitting from the different domains, then read on.
In the simple case where there are just two domains, there is a built-in method in the condor configuration file that lets you boost the priority factor of all domains but one:
ACCOUNTANT_LOCAL_DOMAIN = "physics.wisc.edu"
# prio factor for users in "physics.wisc.edu" domain:
DEFAULT_PRIO_FACTOR = 1.0
# prio factor for everyone else:
REMOTE_PRIO_FACTOR = 10.0In the more general case where you want to set the priority factor in different ways for several domains or user name patters, there is no built-in method in the condor configuration. You can either adjust priority factors manually with condor_userprio or automate this process with a script (to be run periodically by Condor or cron). Here is an example script:
#!/bin/sh
condor_userprio -allusers | awk '/@/{print $1} {}' | while read user
do
factor=1
case "$user" in
( *@physics.wisc.edu ) factor=1 ;;
( *@biology.wisc.edu ) factor=10 ;;
( *@chemistry.wisc.edu ) factor=100 ;;
esac
condor_userprio -setfactor $user $factor
doneBe aware that if you run such a script, then any modifications to user priorities that you make manually with condor_userprio will be overwritten by the script when it runs!
Since a user may show up and claim machines before this script runs and adjusts their priority factor, you might way want to set the default priority factor quite high (i.e. very bad priority). This will prevent them from getting many resources until their factor is adjusted:
DEFAULT_PRIO_FACTOR = 100000
REMOTE_PRIO_FACTOR = 100000Known to work with Condor version: 7.0
Issue the following command from the central manager, or, depending on your security policy, from wherever and as whomever you need to be to issue administrative commands.
condor_off -startd -peaceful <hostname>Initiate a peaceful shutdown of all execute nodes. Issue the following command from the central manager, or, depending on your security policy, from wherever and as whomever you need to be to issue administrative commands.
condor_off -all -startd -peacefulOnce condor_status reports that the pool is empty of startds, shut everything else off:
condor_off -all -masterDisable new submissions to the schedd by adding the following to the configuration file:
MAX_JOBS_SUBMITTED=0Once all jobs have completed, turn off condor by issuing the following command from the central manager, or, depending on your security policy, from wherever and as whomever you need to be to issue administrative commands.
condor_off -schedd -peaceful <hostname>As of Condor 7.1.1, you can do this by issuing the following command from the central manager, or, depending on your seucrity policy, from wherever and as whomever you need to be to issue administrative commands.
condor_off -schedd -peaceful <hostname>In versions prior to Condor 7.1.1, you can put all idle jobs on hold and then wait for the running jobs to finish. Run the following command as a user with administrative privileges in the queue (e.g. root).
condor_hold -constraint 'JobStatus == 1'In this case, you may also wish to disable the submission of new jobs by adding the following to your configuration:
MAX_JOBS_SUBMITTED=0Just shut down the schedd normally (graceful shutdown). Issue the following command from the central manager, or, depending on your seucrity policy, from wherever and as whomever you need to be to issue administrative commands.
condor_off -schedd <hostname>During graceful shutdown of the schedd, all running standard universe jobs are stopped and checkpointed. All other jobs are left running (if they have a non-zero JobLeaseDuration, which is 20 minutes by default). The schedd gracefully disconnects from them in the hope of being able to later reconnect to the running jobs when it starts back up. If the lease runs out before the schedd reconnects to the jobs, then they are killed. Therefore, if you need a longer down time, you should increase the lease. You can increase the default by adding the following to your Condor configuration:
JobLeaseDuration = 5400
SUBMIT_EXPRS = $(SUBMIT_EXPRS) JobLeaseDurationKnown to work with Condor version: 7.0
Jobs can be defined with their own rank expression that specifies which machines they prefer to run on. Sometimes it is desirable for administrators to also influence the choice of machine. For example, suppose you have a pool composed of desktop machines plus dedicated compute nodes. You might want jobs to run on the dedicated nodes if any are idle. The following example configuration achieves this:
NEGOTIATOR_PRE_JOB_RANK = (IsDesktop =!= True && isUndefined(RemoteOwner)) + \
isUndefined(RemoteOwner)That produces the following possible values for NEGOTIATOR_PRE_JOB_RANK:
This assumes that desktop machines define a ClassAd attribute IsDesktop. You can do that like this:
IsDesktop = True
STARTD_ATTRS = $(STARTD_ATTRS) IsDesktopNote that NEGOTIATOR_PRE_JOB_RANK is a higher precedence sort key than the job's own rank expression, so if two machines match a job and NEGOTIATOR_PRE_JOB_RANK is bigger for one than the other, then it doesn't matter what the job's rank expression says. Sometimes, that is good, because otherwise, the user might define a rank expression for a completely different purpose (such as preferring faster machines) and not realize that in so doing, they lost the default behavior of steering their jobs away from desktops. That being said, it is still sometimes desirable to steer jobs without overriding the user's rank expression. You can do that with a configuration such as the following:
NEGOTIATOR_POST_JOB_RANK = isUndefined(RemoteOwner) * (KFlops - SlotID)The above example steers jobs towards faster machines and it tends to fill a multi-cpu cluster by sending jobs to different machines first and doubling up only when it has to. This expression is chosen so that it has no effect if the machine is claimed, allowing control to pass on to PREEMPTION_RANK, which is intended for that purpose.
NEGOTIATOR_POST_JOB_RANK can be overridden by anyone who specifies a rank expression in their job submit file (unless their rank expression ranks the machines in question equally). You might instead want users to have to try harder (i.e. know what they are doing) to override your configuration. Here is an example:
NEGOTIATOR_PRE_JOB_RANK = (JobOverridesNegotiatorRank =!= True) * \
isUndefined(RemoteOwner) * (KFlops - SlotID)Jobs that need to override the negotiator pre job rank can then be submitted with the following in their submit file:
+JobOverridesNegotiatorRank = TrueKnown to work with Condor version: 7.0
Condor can suspend a process and then later resume right where it left off. This is similar to standard universe checkpointing, except it cannot move the suspended job from one machine to another. The mechanism works with all types of jobs, without any special preparation of the job.
There are two cases where you might prefer suspension in place of killing jobs. One is preemption caused by other activity on the machine (e.g. a user coming back to their desktop). The other is preemption caused by higher priority jobs getting matched to machines that are already busy. Both of these cases are covered below.
There is a section in the Condor Manual on this topic here. The example in the manual requires that you divide jobs into two classes: high priority and low priority. When the high priority jobs arrive on a machine, existing low priority jobs get suspended.
Below is an example in which the relative priority of the jobs to be suspended and the other class of jobs is left up to the usual mechanisms of startd RANK and user priority. You still have to divide the jobs into two classes: normal jobs and those that should be suspended when they are preempted by normal jobs. This is useful, for example, when you have a special class of jobs which are long running and not checkpointable, but you don't want them to unfairly hog the pool.
# You may want to advertise double the amount of system memory
# if you have enough virtual memory to allow the foreground job
# to consume all of memory while the suspended job gets pushed
# into swap memory. There is currently no convenient way to
# tell Condor you want to oversubscribe your memory, so you
# have to hard-code the amount of memory you want to advertise
# by uncommenting and filling in the following:
# Memory = TWICE_YOUR_SYSTEM_MEMORY
NUM_CPUS = 2
# So that the suspension slot can see the state
# of the other slot, we need to have some things
# advertised about each slot in the ClassAds of
# all the other slots on the same machine:
STARTD_SLOT_EXPRS = $(STARTD_SLOT_EXPRS) State, CurrentRank
# For informational purposes, put IsSuspensionSlot
# in the startd ClassAd:
STARTD_ATTRS = $(STARTD_ATTRS) IsSuspensionSlot
# Slot 1 is the "normal" batch slot
SLOT1_IsSuspensionSlot = False
# Slot 2 suspends its job, rather than preempting it
SLOT2_IsSuspensionSlot = True
START = SlotID == 1 && ($(SLOT1_START)) || \
SlotID == 2 && ($(SLOT2_START))
CONTINUE = SlotID == 1 && ($(SLOT1_CONTINUE)) || \
SlotID == 2 && ($(SLOT2_CONTINUE))
PREEMPT = SlotID == 1 && ($(SLOT1_PREEMPT)) || \
SlotID == 2 && ($(SLOT2_PREEMPT))
SUSPEND = SlotID == 1 && ($(SLOT1_SUSPEND)) || \
SlotID == 2 && ($(SLOT2_SUSPEND))
# The purpose of the following expression is to prevent a
# job from starting on slot 1 if it has less priority to run
# than the job already running on slot 2, because once we let
# a job run on slot 1, the slot 2 job will be suspended.
# This expression refers to attributes that are only defined
# when requirements are being evaluated by the Negotiator:
# SubmittorPrio [sic] and RemoteUserPrio
SLOT1_HAS_PRIO = SubmittorPrio =?= UNDEFINED || \
vm2_RemoteUserPrio =?= UNDEFINED || \
SubmittorPrio < 1.2 * vm2_RemoteUserPrio || \
vm2_CurrentRank =?= UNDEFINED || \
MY.Rank > vm2_CurrentRank
# Slot 1 is a normal execution slot
SLOT1_START = TARGET.IsSuspensionJob =!= true && ($(SLOT1_HAS_PRIO))
SLOT1_CONTINUE = True
SLOT1_PREEMPT = False
SLOT1_SUSPEND = False
# Slot 2 is for jobs that get suspended while slot 1 is busy
SLOT2_START = TARGET.IsSuspensionJob =?= true
SLOT2_CONTINUE = slot1_State =?= "Unclaimed" || slot1_State =?= "Owner"
SLOT2_PREEMPT = FALSE
SLOT2_SUSPEND = slot1_State =?= "Claimed"To submit a suspension job, you could put something like the following in your submit file:
+IsSuspensionJob = True
requirements = TARGET.IsSuspensionSlotThe example policy above does not prevent preemption of suspension jobs by other suspension jobs. It only prevents preemption of suspension jobs by other normal jobs. If you want to prevent that, you could do something like this:
# Do not preempt suspension jobs (for up to 24 hours)
MaxJobRetirementTime = (MY.IsSuspensionSlot =?= True) * 3600 * 24You just need to make sure WANT_SUSPEND and SUSPEND are true in the cases where you want suspension. The following example is similar to the default policy that ships in Condor's configuration file. It has been made more aggressive in what types of jobs it suspends (everything) and for how long it suspends them before giving up and killing them.
KeyboardBusy = (KeyboardIdle < $(MINUTE))
# Suspend jobs for up to one day
MaxSuspendTime = (3600 * 24)
ContinueIdleTime = 60 * $(MINUTE)
# Suspend jobs if:
# 1) the keyboard has been touched, OR
# 2a) The cpu has been busy for more than 2 minutes, AND
# 2b) the job has been running for more than 90 seconds
SUSPEND = ( $(KeyboardBusy) || \
( (CpuBusyTime > 2 * $(MINUTE)) \
&& $(ActivationTimer) > 90 ) )
WANT_SUSPEND = SUSPEND
# Preempt jobs if:
# 1) The job is suspended and has been suspended longer than we want
# 2) OR, we don't want to suspend this job, but the conditions to
# suspend jobs have been met (someone is using the machine)
PREEMPT = ( ((Activity == "Suspended") && \
($(ActivityTimer) > $(MaxSuspendTime))) \
|| (SUSPEND && (WANT_SUSPEND == False)) )
CONTINUE = ( $(CPUIdle) && ($(ActivityTimer) > 10) \
&& (KeyboardIdle > $(ContinueIdleTime)) )Known to work with Condor version: 7.0
By running a command, you can suspend jobs running under Condor across the entire pool or portions of it. The following shell script is an example of how to do this. The necessary configuration changes are documented at the top of the script.
#!/bin/sh
# Author: Dan Bradley <dan@hep.wisc.edu>
# Date: 2007-12-21
#
# Example condor configuration required to make this script work:
# (You need to restart condor to enable runtime config modification.)
#
# SuspendedByAdmin = False
# SETTABLE_ATTRS_ADMINISTRATOR = SuspendedByAdmin
# ENABLE_RUNTIME_CONFIG = True
#
# START = ($(START)) && SuspendedByAdmin =!= True
# WANT_SUSPEND = ($(WANT_SUSPEND)) || SuspendedByAdmin =?= True
# SUSPEND = ($(SUSPEND)) || SuspendedByAdmin =?= True
# CONTINUE = ($(CONTINUE)) && SuspendedByAdmin =!= True
PrintUsage() {
echo "USAGE: $0 OPTIONS"
echo
echo "Suspend/unsuspend jobs on GLOW. This depends on the condor"
echo "configuration doing the right thing when SuspendedByAdmin"
echo "is remotely modified by this script."
echo
echo "OPTIONS:"
echo " --site=X (GLOW site name)"
echo " --constraint=X (arbitrary ClassAd constraint)"
echo " --unsuspend (remove suspension state set previously)"
echo " --dry-run (don't do anything; just show what would have been done)"
echo " --status (show suspension state)"
exit 2
}
OPTS=`getopt -o "h" -l "help,site:,constraint:,unsuspend,dry-run,status" -- "$@"`
if [ $? -ne 0 ]; then PrintUsage; fi
eval set -- "$OPTS"
SITE=
SUSPEND=True
DRY_RUN=
CONSTRAINT=
STATUS=
while [ ! -z "$1" ]
do
case "$1" in
-h) PrintUsage;;
--help) PrintUsage;;
--site) shift; SITE=$1;;
--constraint) shift; CONSTRAINT=$1;;
--unsuspend) SUSPEND=False;;
--dry-run) DRY_RUN="echo dry-run:";;
--status) STATUS=1;;
--) shift; break;;
*) echo "Unexpected option $1"; PrintUsage;;
esac
shift
done
if ! [ -z "$SITE" ]; then
if ! [ -z "$CONSTRAINT" ]; then
CONSTRAINT="$CONSTRAINT && "
fi
CONSTRAINT="${CONSTRAINT}Site =?= \"${SITE}\""
fi
if ! [ -z "$STATUS" ]; then
condor_status -constraint "$CONSTRAINT" -f "%s " Name -f "SuspendedByAdmin=%s" SuspendedByAdmin -f "\n" NewLine
exit 0
fi
if [ -z "$CONSTRAINT" ]; then
echo "You must specify --constraint or --site."
exit 2
fi
if [ "$SUSPEND" = "True" ]; then
action=Suspending
else
action=Unsuspending
fi
echo "$action jobs on machines matching constraint $CONSTRAINT"
condor_status -constraint "$CONSTRAINT" -f "%s\n" Machine | sort | uniq |
while read HOST; do
[ -z "$HOST" ] && continue;
echo $action $HOST
$DRY_RUN condor_config_val -startd -name $HOST -rset SuspendedByAdmin=$SUSPEND
$DRY_RUN condor_reconfig $HOST
doneKnown to work with Condor version: 7.0
One way to gracefully upgrade is to shut down the pool, install the new version of Condor, and then start it back up. To do that, see How to shut down condor without killing jobs. However, before you do that, consider the consequence of waiting for jobs to finish. On multi-core machines, if all cores but one are idle, because you are waiting for a job to finish, this may be worse than killing everything and quickly restarting.
Another way to upgrade is to leave condor running. Condor will automatically restart itself if the condor_master binary is updated. To take advantage of this, configure Condor so that the path to binaries (e.g. MASTER) points to the new binaries. One way to do that (under unix) is to use a symlink that points to the current condor installation directory (e.g. /opt/condor). Once the new files are in place, change the symlink to point to the new directory. If condor is configured to locate its binaries via the symlink, then after the symlink changes, condor_master will notice the new binaries and restart itself. (How frequently it checks is controlled by MASTER_CHECK_NEW_EXEC_INTERVAL, which defaults 5 minutes.)
When the master notices new binaries, it begins a graceful restart, which may not be exactly what you want. On an execute machine, a graceful restart means that running jobs are preempted. Standard universe jobs will attempt to checkpoint, which could be a problem if all machines in a large pool attempt to do this at the same time. If they do not complete within the cutoff time specified by the KILL policy expression (default 10 minutes), then they are killed without checkpointing. You may therefore want to increase this cutoff time and you may also want to upgrade the pool in stages rather than all at once.
For universes other than standard universe, jobs are preempted. If jobs have been guaranteed a certain amount of uninterrupted run time with MaxJobRetirementTime, then the job is not killed until the specified amount of retirement time has been exceeded (it's 0 by default). The first step of killing the job is a soft kill signal, which can be intercepted by the job so that it can shut down gracefully, save state etc. If the job has not gone away once the KILL expression fires (10 minutes by default), then it is forcibly hard-killed. Since graceful shutdown of jobs may rely on shared resources such as disks where state is saved, the same reasoning applies as for standard universe: you may want to increase the KILL time for large pools and you may want to upgrade the pool in stages to avoid jobs running out of time.
Another time limit to be aware of is the configuration setting SHUTDOWN_GRACEFUL_TIMEOUT. This defaults to 30 minutes. If the graceful restart is not completed within this time, a fast restart ensues. This causes jobs to be hard-killed.
On unix, the following is a handy way to summarize the Condor versions that exist in your pool:
condor_status -master -format "%s\n" CondorVersion | sort | uniq -c