[ARVADOS] created: a9c5882fa41b1d97bd5a512bcaba2a58e002cff8
git at public.curoverse.com
git at public.curoverse.com
Sun Apr 5 17:10:24 EDT 2015
at a9c5882fa41b1d97bd5a512bcaba2a58e002cff8 (commit)
commit a9c5882fa41b1d97bd5a512bcaba2a58e002cff8
Author: Brett Smith <brett at curoverse.com>
Date: Sun Apr 5 17:10:22 2015 -0400
5352: crunch-dispatch treats node allocation failure as temporary.
Imagine a scenario where multiple crunch-dispatch processes are
sitting idle, then suddenly a new job appears in the queue. They will
all race to dispatch the job. When this happens, we frequently see
that salloc fails for most of them, because they all requested the
same node(s) and only the winner will get them. crunch-dispatch has
no way to know the exit code "came from" salloc and not crunch-job,
and so marks the job failed.
This patch sets the SLURM_EXIT_IMMEDIATE environment variable to make
salloc use exit code 75 when the allocation fails. crunch-dispatch
already recognizes this exit code as a temporary failure, and will
leave the Arvados job record unchanged. Refer to salloc(1) and the
long comment in Dispatch#reap_children.
diff --git a/services/api/script/crunch-dispatch.rb b/services/api/script/crunch-dispatch.rb
index 249582e..7b3ed9e 100755
--- a/services/api/script/crunch-dispatch.rb
+++ b/services/api/script/crunch-dispatch.rb
@@ -53,6 +53,8 @@ end
class Dispatcher
include ApplicationHelper
+ EXIT_TEMPFAIL = 75
+
def initialize
@crunch_job_bin = (ENV['CRUNCH_JOB_BIN'] || `which arv-crunch-job`.strip)
if @crunch_job_bin.empty?
@@ -632,7 +634,7 @@ class Dispatcher
exit_status = j_done[:wait_thr].value.exitstatus
jobrecord = Job.find_by_uuid(job_done.uuid)
- if exit_status != 75 and jobrecord.state == "Running"
+ if exit_status != EXIT_TEMPFAIL and jobrecord.state == "Running"
# crunch-job did not return exit code 75 (see below) and left the job in
# the "Running" state, which means there was an unhandled error. Fail
# the job.
@@ -756,4 +758,10 @@ end
# This is how crunch-job child procs know where the "refresh" trigger file is
ENV["CRUNCH_REFRESH_TRIGGER"] = Rails.configuration.crunch_refresh_trigger
+# If salloc can't allocate resources immediately, make it use our temporary
+# failure exit code. This ensures crunch-dispatch won't mark a job failed
+# because of an issue with node allocation. This often happens when
+# another dispatcher wins the race to allocate nodes.
+ENV["SLURM_EXIT_IMMEDIATE"] = Dispatcher::EXIT_TEMPFAIL.to_s
+
Dispatcher.new.run
-----------------------------------------------------------------------
hooks/post-receive
--
More information about the arvados-commits
mailing list