[ARVADOS] updated: faff88f90f5621650b0cbc020261433d2de977b3
git at public.curoverse.com
git at public.curoverse.com
Fri Jun 19 17:38:35 EDT 2015
Summary of changes:
services/api/script/crunch-dispatch.rb | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
via faff88f90f5621650b0cbc020261433d2de977b3 (commit)
from 9fbcc04f89181992876f16baad6162396aa7c3f0 (commit)
Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.
commit faff88f90f5621650b0cbc020261433d2de977b3
Author: Brett Smith <brett at curoverse.com>
Date: Fri Jun 19 17:38:32 2015 -0400
4410: crunch-dispatch fixups from code review.
diff --git a/services/api/script/crunch-dispatch.rb b/services/api/script/crunch-dispatch.rb
index 439bddb..c78b2dc 100755
--- a/services/api/script/crunch-dispatch.rb
+++ b/services/api/script/crunch-dispatch.rb
@@ -691,12 +691,11 @@ class Dispatcher
end
end
else
- # Don't fail the job if crunch-job didn't even get as far as
- # starting it. If the job failed to run due to an infrastructure
+ # If the job failed to run due to an infrastructure
# issue with crunch-job or slurm, we want the job to stay in the
# queue. If crunch-job exited after losing a race to another
# crunch-job process, it exits 75 and we should leave the job
- # record alone so the winner of the race do its thing.
+ # record alone so the winner of the race can do its thing.
# If crunch-job exited after all of its allocated nodes failed,
# it exits 93, and we want to retry it later (see the
# EXIT_RETRY_UNLOCKED `if` block).
@@ -767,6 +766,14 @@ class Dispatcher
select(@running.values.collect { |j| [j[:stdout], j[:stderr]] }.flatten,
[], [], 1)
end
+ # If there are jobs we wanted to retry, we have to mark them as failed now.
+ # Other dispatchers can't pick them up because we hold their lock.
+ @todo_job_retries.each_key do |job_uuid|
+ job = Job.find_by_uuid(job_uuid)
+ if job.state == "Running"
+ fail_job(job, "crunch-dispatch was stopped during job's tempfail retry loop")
+ end
+ end
end
protected
-----------------------------------------------------------------------
hooks/post-receive
--
More information about the arvados-commits
mailing list