[ARVADOS] updated: faff88f90f5621650b0cbc020261433d2de977b3

git at public.curoverse.com git at public.curoverse.com
Fri Jun 19 17:38:35 EDT 2015


Summary of changes:
 services/api/script/crunch-dispatch.rb | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

       via  faff88f90f5621650b0cbc020261433d2de977b3 (commit)
      from  9fbcc04f89181992876f16baad6162396aa7c3f0 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.


commit faff88f90f5621650b0cbc020261433d2de977b3
Author: Brett Smith <brett at curoverse.com>
Date:   Fri Jun 19 17:38:32 2015 -0400

    4410: crunch-dispatch fixups from code review.

diff --git a/services/api/script/crunch-dispatch.rb b/services/api/script/crunch-dispatch.rb
index 439bddb..c78b2dc 100755
--- a/services/api/script/crunch-dispatch.rb
+++ b/services/api/script/crunch-dispatch.rb
@@ -691,12 +691,11 @@ class Dispatcher
         end
       end
     else
-      # Don't fail the job if crunch-job didn't even get as far as
-      # starting it. If the job failed to run due to an infrastructure
+      # If the job failed to run due to an infrastructure
       # issue with crunch-job or slurm, we want the job to stay in the
       # queue. If crunch-job exited after losing a race to another
       # crunch-job process, it exits 75 and we should leave the job
-      # record alone so the winner of the race do its thing.
+      # record alone so the winner of the race can do its thing.
       # If crunch-job exited after all of its allocated nodes failed,
       # it exits 93, and we want to retry it later (see the
       # EXIT_RETRY_UNLOCKED `if` block).
@@ -767,6 +766,14 @@ class Dispatcher
       select(@running.values.collect { |j| [j[:stdout], j[:stderr]] }.flatten,
              [], [], 1)
     end
+    # If there are jobs we wanted to retry, we have to mark them as failed now.
+    # Other dispatchers can't pick them up because we hold their lock.
+    @todo_job_retries.each_key do |job_uuid|
+      job = Job.find_by_uuid(job_uuid)
+      if job.state == "Running"
+        fail_job(job, "crunch-dispatch was stopped during job's tempfail retry loop")
+      end
+    end
   end
 
   protected

-----------------------------------------------------------------------


hooks/post-receive
-- 




More information about the arvados-commits mailing list