[ARVADOS] created: 9c6540b9d42adc4a397a28be1ac23f357ba14ab5

Git user git at public.curoverse.com
Mon Aug 7 09:59:40 EDT 2017


        at  9c6540b9d42adc4a397a28be1ac23f357ba14ab5 (commit)


commit 9c6540b9d42adc4a397a28be1ac23f357ba14ab5
Author: Tom Clegg <tom at curoverse.com>
Date:   Mon Aug 7 09:58:04 2017 -0400

    12027: Recognize a new "node failed" error message.
    
    "srun: error: Cannot communicate with node 0.  Aborting job."
    
    Arvados-DCO-1.1-Signed-off-by: Tom Clegg <tom at curoverse.com>

diff --git a/sdk/cli/bin/crunch-job b/sdk/cli/bin/crunch-job
index 5a92176..5e6c3a0 100755
--- a/sdk/cli/bin/crunch-job
+++ b/sdk/cli/bin/crunch-job
@@ -1544,7 +1544,7 @@ sub preprocess_stderr
         $st->{node}->{fail_count}++;
       }
     }
-    elsif ($line =~ /srun: error: .*?\b(Node failure on|Aborting, .*?\bio error\b)/i) {
+    elsif ($line =~ /srun: error: .*?\b(Node failure on|Aborting, .*?\bio error\b|cannot communicate with node .* aborting job)/i) {
       $jobstep[$jobstepidx]->{tempfail} = 1;
       if (defined($job_slot_index)) {
         $slot[$job_slot_index]->{node}->{fail_count}++;

-----------------------------------------------------------------------


hooks/post-receive
-- 




More information about the arvados-commits mailing list