[ARVADOS] updated: c1ebef70f3b66080b51ef700383f44d70736f495

Git user git at public.curoverse.com
Fri Nov 18 15:15:12 EST 2016


Summary of changes:
 sdk/cli/bin/crunch-job | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

       via  c1ebef70f3b66080b51ef700383f44d70736f495 (commit)
       via  5977b70a38e7102a6a369074897af990944c8934 (commit)
      from  1071e1163f894c2a73df76cd400d102748e5281d (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.


commit c1ebef70f3b66080b51ef700383f44d70736f495
Merge: 1071e11 5977b70
Author: Tom Clegg <tom at curoverse.com>
Date:   Fri Nov 18 15:14:47 2016 -0500

    Merge branch '10470-slurm-error'
    
    refs #10470


commit 5977b70a38e7102a6a369074897af990944c8934
Author: Tom Clegg <tom at curoverse.com>
Date:   Fri Nov 18 14:04:53 2016 -0500

    10470: Recognize more slurm error messages.
    
    Example from slurm 14.03.9:
    
    srun: error: _server_read: fd 12 got error or unexpected eof reading header
    srun: error: step_launch_notify_io_failure: aborting, io error with slurmstepd on node 0
    srun: Job step aborted: Waiting up to 2 seconds for job step to finish.

diff --git a/sdk/cli/bin/crunch-job b/sdk/cli/bin/crunch-job
index be14be9..3587436 100755
--- a/sdk/cli/bin/crunch-job
+++ b/sdk/cli/bin/crunch-job
@@ -1509,7 +1509,7 @@ sub preprocess_stderr
     my $line = $1;
     substr $jobstep[$jobstepidx]->{stderr}, 0, 1+length($line), "";
     Log ($jobstepidx, "stderr $line");
-    if ($line =~ /srun: error: (SLURM job $ENV{SLURM_JOB_ID} has expired|Unable to confirm allocation for job $ENV{SLURM_JOB_ID})/) {
+    if ($line =~ /srun: error: (SLURM job $ENV{SLURM_JOB_ID} has expired|Unable to confirm allocation for job $ENV{SLURM_JOB_ID})/i) {
       # If the allocation is revoked, we can't possibly continue, so mark all
       # nodes as failed.  This will cause the overall exit code to be
       # EX_RETRY_UNLOCKED instead of failure so that crunch_dispatch can re-run
@@ -1519,14 +1519,14 @@ sub preprocess_stderr
         $st->{node}->{fail_count}++;
       }
     }
-    elsif ($line =~ /srun: error: (Node failure on|Aborting, .*\bio error\b)/) {
+    elsif ($line =~ /srun: error: .*?\b(Node failure on|Aborting, .*?\bio error\b)/i) {
       $jobstep[$jobstepidx]->{tempfail} = 1;
       if (defined($job_slot_index)) {
         $slot[$job_slot_index]->{node}->{fail_count}++;
         ban_node_by_slot($job_slot_index);
       }
     }
-    elsif ($line =~ /srun: error: (Unable to create job step|.*: Communication connection failure)/) {
+    elsif ($line =~ /srun: error: (Unable to create job step|.*?: Communication connection failure)/i) {
       $jobstep[$jobstepidx]->{tempfail} = 1;
       ban_node_by_slot($job_slot_index) if (defined($job_slot_index));
     }

-----------------------------------------------------------------------


hooks/post-receive
-- 




More information about the arvados-commits mailing list