[ARVADOS] updated: 92052dc0b2f80a16420483a756fcebedc6c6ec3c

git at public.curoverse.com git at public.curoverse.com
Tue Apr 7 09:58:44 EDT 2015


Summary of changes:
 services/api/script/crunch-dispatch.rb | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

       via  92052dc0b2f80a16420483a756fcebedc6c6ec3c (commit)
       via  6520efd1dcf2c83ebe0b896d76167a8a89761931 (commit)
      from  a2fc3a2158c3c154c3d7fc2b55eea928898cafb5 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.


commit 92052dc0b2f80a16420483a756fcebedc6c6ec3c
Merge: a2fc3a2 6520efd
Author: Brett Smith <brett at curoverse.com>
Date:   Tue Apr 7 09:57:19 2015 -0400

    Merge branch '5352-crunch-dispatch-salloc-tempfail-wip'
    
    Closes #5352, #5670.


commit 6520efd1dcf2c83ebe0b896d76167a8a89761931
Author: Brett Smith <brett at curoverse.com>
Date:   Sun Apr 5 17:10:22 2015 -0400

    5352: crunch-dispatch treats node allocation failure as temporary.
    
    Imagine a scenario where multiple crunch-dispatch processes are
    sitting idle, then suddenly a new job appears in the queue.  They will
    all race to dispatch the job.  When this happens, we frequently see
    that salloc fails for most of them, because they all requested the
    same node(s) and only the winner will get them.  crunch-dispatch has
    no way to know the exit code "came from" salloc and not crunch-job,
    and so marks the job failed.
    
    This patch sets the SLURM_EXIT_IMMEDIATE environment variable to make
    salloc use exit code 75 when the allocation fails.  crunch-dispatch
    already recognizes this exit code as a temporary failure, and will
    leave the Arvados job record unchanged.  Refer to salloc(1) and the
    long comment in Dispatch#reap_children.

diff --git a/services/api/script/crunch-dispatch.rb b/services/api/script/crunch-dispatch.rb
index 249582e..7b3ed9e 100755
--- a/services/api/script/crunch-dispatch.rb
+++ b/services/api/script/crunch-dispatch.rb
@@ -53,6 +53,8 @@ end
 class Dispatcher
   include ApplicationHelper
 
+  EXIT_TEMPFAIL = 75
+
   def initialize
     @crunch_job_bin = (ENV['CRUNCH_JOB_BIN'] || `which arv-crunch-job`.strip)
     if @crunch_job_bin.empty?
@@ -632,7 +634,7 @@ class Dispatcher
     exit_status = j_done[:wait_thr].value.exitstatus
 
     jobrecord = Job.find_by_uuid(job_done.uuid)
-    if exit_status != 75 and jobrecord.state == "Running"
+    if exit_status != EXIT_TEMPFAIL and jobrecord.state == "Running"
       # crunch-job did not return exit code 75 (see below) and left the job in
       # the "Running" state, which means there was an unhandled error.  Fail
       # the job.
@@ -756,4 +758,10 @@ end
 # This is how crunch-job child procs know where the "refresh" trigger file is
 ENV["CRUNCH_REFRESH_TRIGGER"] = Rails.configuration.crunch_refresh_trigger
 
+# If salloc can't allocate resources immediately, make it use our temporary
+# failure exit code.  This ensures crunch-dispatch won't mark a job failed
+# because of an issue with node allocation.  This often happens when
+# another dispatcher wins the race to allocate nodes.
+ENV["SLURM_EXIT_IMMEDIATE"] = Dispatcher::EXIT_TEMPFAIL.to_s
+
 Dispatcher.new.run

-----------------------------------------------------------------------


hooks/post-receive
-- 




More information about the arvados-commits mailing list