[ARVADOS] updated: b54478ea1b7c8aaeaf565d591f32769bcdc09b8f

Git user git at public.curoverse.com
Mon Sep 12 21:42:59 EDT 2016


Summary of changes:
 sdk/cli/bin/crunch-job | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

       via  b54478ea1b7c8aaeaf565d591f32769bcdc09b8f (commit)
       via  e51906ca834222fa0e85d01568507a39af4fde36 (commit)
       via  19ad5c59b064088c58136f5387fdf029b754ee36 (commit)
      from  e636f4ec762410391aa8df7502468f98612ebb42 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.


commit b54478ea1b7c8aaeaf565d591f32769bcdc09b8f
Merge: e636f4e e51906c
Author: Peter Amstutz <peter.amstutz at curoverse.com>
Date:   Mon Sep 12 21:42:53 2016 -0400

    Merge branch '10004-check-sinfo' closes #10004


commit e51906ca834222fa0e85d01568507a39af4fde36
Author: Peter Amstutz <peter.amstutz at curoverse.com>
Date:   Mon Sep 12 21:42:26 2016 -0400

    10004: Add comment documenting reason why check_sinfo is needed.

diff --git a/sdk/cli/bin/crunch-job b/sdk/cli/bin/crunch-job
index 48f9669..e0aff31 100755
--- a/sdk/cli/bin/crunch-job
+++ b/sdk/cli/bin/crunch-job
@@ -1404,12 +1404,20 @@ sub check_squeue
 
 sub check_sinfo
 {
-  my $last_sinfo_check = $sinfo_checked;
+  # If a node fails in a multi-node "srun" call during job setup, the call
+  # may hang instead of exiting with a nonzero code.  This function checks
+  # "sinfo" for the health of the nodes that were allocated and ensures that
+  # they are all still in the "alloc" state.  If a node that is allocated to
+  # this job is not in "alloc" state, then set please_freeze.
+  #
+  # This is only called from srun_sync() for node configuration.  If a
+  # node fails doing actual work, there are other recovery mechanisms.
 
   # Do not call `sinfo` more than once every 15 seconds.
-  return if $last_sinfo_check > time - 15;
+  return if $sinfo_checked > time - 15;
   $sinfo_checked = time;
 
+  # The output format "%t" means output node states.
   my @sinfo = `sinfo --nodes=\Q$ENV{SLURM_NODELIST}\E --noheader -o "%t"`;
   if ($? != 0)
   {

-----------------------------------------------------------------------


hooks/post-receive
-- 




More information about the arvados-commits mailing list