[ARVADOS] updated: 92052dc0b2f80a16420483a756fcebedc6c6ec3c
git at public.curoverse.com
git at public.curoverse.com
Tue Apr 7 09:58:44 EDT 2015
Summary of changes:
services/api/script/crunch-dispatch.rb | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
via 92052dc0b2f80a16420483a756fcebedc6c6ec3c (commit)
via 6520efd1dcf2c83ebe0b896d76167a8a89761931 (commit)
from a2fc3a2158c3c154c3d7fc2b55eea928898cafb5 (commit)
Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.
commit 92052dc0b2f80a16420483a756fcebedc6c6ec3c
Merge: a2fc3a2 6520efd
Author: Brett Smith <brett at curoverse.com>
Date: Tue Apr 7 09:57:19 2015 -0400
Merge branch '5352-crunch-dispatch-salloc-tempfail-wip'
Closes #5352, #5670.
commit 6520efd1dcf2c83ebe0b896d76167a8a89761931
Author: Brett Smith <brett at curoverse.com>
Date: Sun Apr 5 17:10:22 2015 -0400
5352: crunch-dispatch treats node allocation failure as temporary.
Imagine a scenario where multiple crunch-dispatch processes are
sitting idle, then suddenly a new job appears in the queue. They will
all race to dispatch the job. When this happens, we frequently see
that salloc fails for most of them, because they all requested the
same node(s) and only the winner will get them. crunch-dispatch has
no way to know the exit code "came from" salloc and not crunch-job,
and so marks the job failed.
This patch sets the SLURM_EXIT_IMMEDIATE environment variable to make
salloc use exit code 75 when the allocation fails. crunch-dispatch
already recognizes this exit code as a temporary failure, and will
leave the Arvados job record unchanged. Refer to salloc(1) and the
long comment in Dispatch#reap_children.
diff --git a/services/api/script/crunch-dispatch.rb b/services/api/script/crunch-dispatch.rb
index 249582e..7b3ed9e 100755
--- a/services/api/script/crunch-dispatch.rb
+++ b/services/api/script/crunch-dispatch.rb
@@ -53,6 +53,8 @@ end
class Dispatcher
include ApplicationHelper
+ EXIT_TEMPFAIL = 75
+
def initialize
@crunch_job_bin = (ENV['CRUNCH_JOB_BIN'] || `which arv-crunch-job`.strip)
if @crunch_job_bin.empty?
@@ -632,7 +634,7 @@ class Dispatcher
exit_status = j_done[:wait_thr].value.exitstatus
jobrecord = Job.find_by_uuid(job_done.uuid)
- if exit_status != 75 and jobrecord.state == "Running"
+ if exit_status != EXIT_TEMPFAIL and jobrecord.state == "Running"
# crunch-job did not return exit code 75 (see below) and left the job in
# the "Running" state, which means there was an unhandled error. Fail
# the job.
@@ -756,4 +758,10 @@ end
# This is how crunch-job child procs know where the "refresh" trigger file is
ENV["CRUNCH_REFRESH_TRIGGER"] = Rails.configuration.crunch_refresh_trigger
+# If salloc can't allocate resources immediately, make it use our temporary
+# failure exit code. This ensures crunch-dispatch won't mark a job failed
+# because of an issue with node allocation. This often happens when
+# another dispatcher wins the race to allocate nodes.
+ENV["SLURM_EXIT_IMMEDIATE"] = Dispatcher::EXIT_TEMPFAIL.to_s
+
Dispatcher.new.run
-----------------------------------------------------------------------
hooks/post-receive
--
More information about the arvados-commits
mailing list