[ARVADOS] updated: b54478ea1b7c8aaeaf565d591f32769bcdc09b8f
Git user
git at public.curoverse.com
Mon Sep 12 21:42:59 EDT 2016
Summary of changes:
sdk/cli/bin/crunch-job | 34 +++++++++++++++++++++++++++++++++-
1 file changed, 33 insertions(+), 1 deletion(-)
via b54478ea1b7c8aaeaf565d591f32769bcdc09b8f (commit)
via e51906ca834222fa0e85d01568507a39af4fde36 (commit)
via 19ad5c59b064088c58136f5387fdf029b754ee36 (commit)
from e636f4ec762410391aa8df7502468f98612ebb42 (commit)
Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.
commit b54478ea1b7c8aaeaf565d591f32769bcdc09b8f
Merge: e636f4e e51906c
Author: Peter Amstutz <peter.amstutz at curoverse.com>
Date: Mon Sep 12 21:42:53 2016 -0400
Merge branch '10004-check-sinfo' closes #10004
commit e51906ca834222fa0e85d01568507a39af4fde36
Author: Peter Amstutz <peter.amstutz at curoverse.com>
Date: Mon Sep 12 21:42:26 2016 -0400
10004: Add comment documenting reason why check_sinfo is needed.
diff --git a/sdk/cli/bin/crunch-job b/sdk/cli/bin/crunch-job
index 48f9669..e0aff31 100755
--- a/sdk/cli/bin/crunch-job
+++ b/sdk/cli/bin/crunch-job
@@ -1404,12 +1404,20 @@ sub check_squeue
sub check_sinfo
{
- my $last_sinfo_check = $sinfo_checked;
+ # If a node fails in a multi-node "srun" call during job setup, the call
+ # may hang instead of exiting with a nonzero code. This function checks
+ # "sinfo" for the health of the nodes that were allocated and ensures that
+ # they are all still in the "alloc" state. If a node that is allocated to
+ # this job is not in "alloc" state, then set please_freeze.
+ #
+ # This is only called from srun_sync() for node configuration. If a
+ # node fails doing actual work, there are other recovery mechanisms.
# Do not call `sinfo` more than once every 15 seconds.
- return if $last_sinfo_check > time - 15;
+ return if $sinfo_checked > time - 15;
$sinfo_checked = time;
+ # The output format "%t" means output node states.
my @sinfo = `sinfo --nodes=\Q$ENV{SLURM_NODELIST}\E --noheader -o "%t"`;
if ($? != 0)
{
-----------------------------------------------------------------------
hooks/post-receive
--
More information about the arvados-commits
mailing list