<div dir="ltr">First off, thanks to Tom for some help on this issue. However we were not able to resolve it. It looks like the keepproxy is closing the connection with the keepstore early and there does not appear to be anything in the logs to indicate why.<div><br></div><div>If anyone can provide some insight into this problem, it would be greatly appreciated.</div><div><br></div><div>We have been using the arv-get tool to fetch the data. Our keepproxy is at 172.17.0.227, the keepstore is at 172.17.0.30, and we have a reverse HTTP proxy at 172.17.0.231. So connections between 172.17.0.231 <-> 172.17.0.227 are in essence between the client (arv-get) and the keepproxy. Connections between 172.17.0.227 <-> 172.17.0.30 are between the keepproxy and the keepstore.</div><div><br></div><div>I ran a single TCP dump on the keepstore system, so that we could see both conversations. Also the times are on the same scale, so we could compare events between the two conversations.</div><div><br></div><div>Here is the end of the conversation between the arv-get (172.17.0.231) and keepproxy (172.17.0.227)</div><div><br></div><div><br></div><div><img src="cid:ii_15ebf69db279d0b5" alt="Inline image 1" style="margin-right: 25px;"><br><div><br></div><div>Here is the end of the conversation between keepproxy (172.17.0.227) and keepstore (172.17.0.30)</div><div><br></div><div><img src="cid:ii_15ebf6deb04233cd" alt="Inline image 2" style="margin-right: 0px;"><br></div><div><br><div><br></div><div>From the tcpdump, it appears that the keepproxy sends a RST to the keepstore at 15.002 sec and then at 15.067 sec the keepproxy closes the connection with arv-get.</div><div><br></div><div>Also here is the output from the arv-get command, if this helps.<br></div><div><br></div><div><div style="font-size:12.8px"># arv-get --no-progress --skip-existing -r pg5mc-4zz18-tcq3i3k63n536ja /mnt/lustre/gchlip2/Ridlon_<wbr>Aug2017/mm10_work/</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">Traceback (most recent call last):<br></div><div style="font-size:12.8px"> File "/mnt/store3/clustcrilab/<wbr>tools/arvados/bin/arv-get", line 212, in <module></div><div style="font-size:12.8px"> for data in f.readall():</div><div style="font-size:12.8px"> File "/mnt/store3/clustcrilab/<wbr>tools/arvados/lib/python2.7/<wbr>site-packages/arvados/arvfile.<wbr>py", line 108, in readall</div><div style="font-size:12.8px"> data = self.read(size, num_retries=num_retries)</div><div style="font-size:12.8px"> File "/mnt/store3/clustcrilab/<wbr>tools/arvados/lib/python2.7/<wbr>site-packages/arvados/arvfile.<wbr>py", line 59, in before_close_wrapper</div><div style="font-size:12.8px"> return orig_func(self, *args, **kwargs)</div><div style="font-size:12.8px"> File "/mnt/store3/clustcrilab/<wbr>tools/arvados/lib/python2.7/<wbr>site-packages/arvados/retry.<wbr>py", line 158, in num_retries_setter</div><div style="font-size:12.8px"> return orig_func(self, *args, **kwargs)</div><div style="font-size:12.8px"> File "/mnt/store3/clustcrilab/<wbr>tools/arvados/lib/python2.7/<wbr>site-packages/arvados/arvfile.<wbr>py", line 217, in read</div><div style="font-size:12.8px"> num_retries=num_retries)</div><div style="font-size:12.8px"> File "/mnt/store3/clustcrilab/<wbr>tools/arvados/lib/python2.7/<wbr>site-packages/arvados/retry.<wbr>py", line 158, in num_retries_setter</div><div style="font-size:12.8px"> return orig_func(self, *args, **kwargs)</div><div style="font-size:12.8px"> File "/mnt/store3/clustcrilab/<wbr>tools/arvados/lib/python2.7/<wbr>site-packages/arvados/stream.<wbr>py", line 85, in readfrom</div><div style="font-size:12.8px"> data.append(self._keepget(lr.<wbr>locator, num_retries=num_retries)[lr.<wbr>segment_offset:lr.segment_<wbr>offset+lr.segment_size])</div><div style="font-size:12.8px"> File "/mnt/store3/clustcrilab/<wbr>tools/arvados/lib/python2.7/<wbr>site-packages/arvados/retry.<wbr>py", line 158, in num_retries_setter</div><div style="font-size:12.8px"> return orig_func(self, *args, **kwargs)</div><div style="font-size:12.8px"> File "/mnt/store3/clustcrilab/<wbr>tools/arvados/lib/python2.7/<wbr>site-packages/arvados/stream.<wbr>py", line 74, in _keepget</div><div style="font-size:12.8px"> return self._keep.get(locator, num_retries=num_retries)</div><div style="font-size:12.8px"> File "/mnt/store3/clustcrilab/<wbr>tools/arvados/lib/python2.7/<wbr>site-packages/arvados/retry.<wbr>py", line 158, in num_retries_setter</div><div style="font-size:12.8px"> return orig_func(self, *args, **kwargs)</div><div style="font-size:12.8px"> File "/mnt/store3/clustcrilab/<wbr>tools/arvados/lib/python2.7/<wbr>site-packages/arvados/keep.py"<wbr>, line 926, in get</div><div style="font-size:12.8px"> return self._get_or_head(loc_s, method="GET", num_retries=num_retries)</div><div style="font-size:12.8px"> File "/mnt/store3/clustcrilab/<wbr>tools/arvados/lib/python2.7/<wbr>site-packages/arvados/keep.py"<wbr>, line 1039, in _get_or_head</div><div style="font-size:12.8px"> "failed to read {}".format(loc_s), service_errors, label="service")</div><div style="font-size:12.8px">arvados.errors.KeepReadError: failed to read 9c4ee2298ab88506ae8a965e6d3f37<wbr>f9+67108864+<wbr>A3495ce0a76bfceba6a21a77bd2f0e<wbr>e0253fbb4f5@59cc0527: service <a href="https://arvados-keep.cri.uic.edu/" target="_blank">https://arvados-keep.cri.uic.<wbr>edu:443/</a> responded with 0 (18, 'tra</div><div style="font-size:12.8px">nsfer closed with 21204434 bytes remaining to read')</div></div><div><br></div><div>After the initial problem had cropped up, we did have a separate issue (kernel update) that required us to reboot the system. After reboot, the issue disappeared. It seemed to be fine for a while, however it has started to occur again. The issue appears to be intermittent. I have restarted the keepproxy service. It worked for a while (few minutes) then the issue started to occur again. I have looked through the kernel logs (dmesg) and there are no errors and everything else on the system seems to be fine. We also run a keep-web service on the same machine as the keepproxy. There are no issues with the keep-web service. It would seem that this problem is isolated to the keepproxy service. Although, it is not clear why a reboot cleared the problem but a restart of the service will not. I would expect that restarting the keepproxy would be the same as a reboot from the service's standpoint.</div></div></div><div><br></div><div>Also the system appears to have sufficient resources. We have allocated 4 cores to this system and 8 GB memory. Here is a snapshot of the top summary while performing an arv-get. It does not look like the keepproxy is taxing the system too much.</div><div><div><br></div><div>%Cpu(s): 0.4 us, 0.4 sy, 0.0 ni, 99.1 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st</div><div>GiB Mem : 7.638 total, 5.957 free, 0.830 used, 0.850 buff/cache</div><div>GiB Swap: 1.000 total, 1.000 free, 0.000 used. 6.394 avail Mem</div></div><div><br></div><div><div> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND</div><div> 6488 root 20 0 548.6m 13.2m 4.1m S 3.0 0.2 0:20.49 keepproxy</div></div><div><br></div><div>Any help would be greatly appreciated. Thanks!<br></div><div><br></div></div><div class="gmail_extra"><br clear="all"><div><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><br>George Chlipala, Ph.D.<br>Associate Director, Core for Research Informatics<br>Research Resources Center<br>University of Illinois at Chicago<br><br>phone: 312-413-1700<br>email: <a href="mailto:gchlip2@uic.edu" target="_blank">gchlip2@uic.edu</a></div></div></div></div>
<br><div class="gmail_quote">On Wed, Sep 13, 2017 at 11:31 AM, George Chlipala <span dir="ltr"><<a href="mailto:gchlip2@uic.edu" target="_blank">gchlip2@uic.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">To whom it may concern -<div><br></div><div>We are experiencing a large number of "Reader failed checksum" errors when using the keepproxy and I cannot seem to figure out why.</div><div><br></div><div>Here is an snippet from our log for the keepproxy.</div><div><br></div><div><div>Sep 13 11:21:12 arvados keepproxy: 2017/09/13 11:21:12 128.248.171.19,<a href="http://172.17.0.231:52236" target="_blank">172.17.0.231:<wbr>52236</a> GET /<wbr>fe194b0a4576d5c8d9d0a2055b132b<wbr>60+67108864+<wbr>Ab7cbd46f674287a7e906d009c1822<wbr>7475a83edbb@59cbcfe9 200 67108864 34647446 <a href="http://arvados-keep01.cri.local:25107/fe194b0a4576d5c8d9d0a2055b132b60+67108864+Ab7cbd46f674287a7e906d009c18227475a83edbb@59cbcfe9" target="_blank">http://arvados-keep01.cri.<wbr>local:25107/<wbr>fe194b0a4576d5c8d9d0a2055b132b<wbr>60+67108864+<wbr>Ab7cbd46f674287a7e906d009c1822<wbr>7475a83edbb@59cbcfe9</a> Reader failed checksum</div><div>Sep 13 11:21:29 arvados keepproxy: 2017/09/13 11:21:29 128.248.171.19,<a href="http://172.17.0.231:51688" target="_blank">172.17.0.231:<wbr>51688</a> GET /<wbr>fe194b0a4576d5c8d9d0a2055b132b<wbr>60+67108864+<wbr>Ab7cbd46f674287a7e906d009c1822<wbr>7475a83edbb@59cbcfe9 200 67108864 30612102 <a href="http://arvados-keep01.cri.local:25107/fe194b0a4576d5c8d9d0a2055b132b60+67108864+Ab7cbd46f674287a7e906d009c18227475a83edbb@59cbcfe9" target="_blank">http://arvados-keep01.cri.<wbr>local:25107/<wbr>fe194b0a4576d5c8d9d0a2055b132b<wbr>60+67108864+<wbr>Ab7cbd46f674287a7e906d009c1822<wbr>7475a83edbb@59cbcfe9</a> Reader failed checksum</div><div>Sep 13 11:21:44 arvados keepproxy: 2017/09/13 11:21:44 128.248.171.19,<a href="http://172.17.0.231:50472" target="_blank">172.17.0.231:<wbr>50472</a> GET /<wbr>fe194b0a4576d5c8d9d0a2055b132b<wbr>60+67108864+<wbr>Ab7cbd46f674287a7e906d009c1822<wbr>7475a83edbb@59cbcfe9 200 67108864 34956478 <a href="http://arvados-keep01.cri.local:25107/fe194b0a4576d5c8d9d0a2055b132b60+67108864+Ab7cbd46f674287a7e906d009c18227475a83edbb@59cbcfe9" target="_blank">http://arvados-keep01.cri.<wbr>local:25107/<wbr>fe194b0a4576d5c8d9d0a2055b132b<wbr>60+67108864+<wbr>Ab7cbd46f674287a7e906d009c1822<wbr>7475a83edbb@59cbcfe9</a> Reader failed checksum</div><div>Sep 13 11:22:00 arvados keepproxy: 2017/09/13 11:22:00 128.248.171.19,<a href="http://172.17.0.231:52286" target="_blank">172.17.0.231:<wbr>52286</a> GET /<wbr>fe194b0a4576d5c8d9d0a2055b132b<wbr>60+67108864+<wbr>Ab7cbd46f674287a7e906d009c1822<wbr>7475a83edbb@59cbcfe9 200 67108864 63292254 <a href="http://arvados-keep01.cri.local:25107/fe194b0a4576d5c8d9d0a2055b132b60+67108864+Ab7cbd46f674287a7e906d009c18227475a83edbb@59cbcfe9" target="_blank">http://arvados-keep01.cri.<wbr>local:25107/<wbr>fe194b0a4576d5c8d9d0a2055b132b<wbr>60+67108864+<wbr>Ab7cbd46f674287a7e906d009c1822<wbr>7475a83edbb@59cbcfe9</a> Reader failed checksum</div><div><br></div><div>If I look at the keepstore, I see the following messages associated with those requests from the keepproxy... In all cases the status code is 200.<br></div><div><br></div><div><div>Sep 13 11:21:12 prometheus keepstore: time="2017-09-13T11:21:12.<wbr>588429359-05:00" level=info msg=response RequestID=bfpza7ye6fmw remoteAddr="<a href="http://172.17.0.227:59396" target="_blank">172.17.0.227:59396</a><wbr>" reqBytes=0 reqForwardedFor= reqMethod=GET reqPath= fe194b0a4576d5c8d9d0a2055b132b<wbr>60+67108864+<wbr>Ab7cbd46f674287a7e906d009c1822<wbr>7475a83edbb@59cbcfe9" respBytes=67108864 respStatus=OK respStatusCode=200 timeToStatus=1.628499 timeTotal=15.000262 timeWriteBody=13.371763</div><div>Sep 13 11:21:14 prometheus keepstore: time="2017-09-13T11:21:14.<wbr>127563080-05:00" level=debug msg=request RequestID=bfpzafjxba0b remoteAddr="<a href="http://172.17.0.227:59406" target="_blank">172.17.0.227:59406</a><wbr>" reqBytes=0 reqForwardedFor= reqMethod=GET reqPath= fe194b0a4576d5c8d9d0a2055b132b<wbr>60+67108864+<wbr>Ab7cbd46f674287a7e906d009c1822<wbr>7475a83edbb@59cbcfe9"</div><div>Sep 13 11:21:29 prometheus keepstore: time="2017-09-13T11:21:29.<wbr>125244128-05:00" level=info msg=response RequestID=bfpzafjxba0b remoteAddr="<a href="http://172.17.0.227:59406" target="_blank">172.17.0.227:59406</a><wbr>" reqBytes=0 reqForwardedFor= reqMethod=GET reqPath= fe194b0a4576d5c8d9d0a2055b132b<wbr>60+67108864+<wbr>Ab7cbd46f674287a7e906d009c1822<wbr>7475a83edbb@59cbcfe9" respBytes=67108864 respStatus=OK respStatusCode=200 timeToStatus=0.179427 timeTotal=14.997671 timeWriteBody=14.818244</div><div>Sep 13 11:21:29 prometheus keepstore: time="2017-09-13T11:21:29.<wbr>838397727-05:00" level=debug msg=request RequestID=bfpzamrr4y9i remoteAddr="<a href="http://172.17.0.227:59408" target="_blank">172.17.0.227:59408</a><wbr>" reqBytes=0 reqForwardedFor= reqMethod=GET reqPath= fe194b0a4576d5c8d9d0a2055b132b<wbr>60+67108864+<wbr>Ab7cbd46f674287a7e906d009c1822<wbr>7475a83edbb@59cbcfe9"</div><div>Sep 13 11:21:44 prometheus keepstore: time="2017-09-13T11:21:44.<wbr>838600482-05:00" level=info msg=response RequestID=bfpzamrr4y9i remoteAddr="<a href="http://172.17.0.227:59408" target="_blank">172.17.0.227:59408</a><wbr>" reqBytes=0 reqForwardedFor= reqMethod=GET reqPath= fe194b0a4576d5c8d9d0a2055b132b<wbr>60+67108864+<wbr>Ab7cbd46f674287a7e906d009c1822<wbr>7475a83edbb@59cbcfe9" respBytes=67108864 respStatus=OK respStatusCode=200 timeToStatus=0.179266 timeTotal=15.000209 timeWriteBody=14.820943</div><div>Sep 13 11:21:45 prometheus keepstore: time="2017-09-13T11:21:45.<wbr>863245412-05:00" level=debug msg=request RequestID=bfpzau4rwy8i remoteAddr="<a href="http://172.17.0.227:59412" target="_blank">172.17.0.227:59412</a><wbr>" reqBytes=0 reqForwardedFor= reqMethod=GET reqPath= fe194b0a4576d5c8d9d0a2055b132b<wbr>60+67108864+<wbr>Ab7cbd46f674287a7e906d009c1822<wbr>7475a83edbb@59cbcfe9"</div><div>Sep 13 11:22:00 prometheus keepstore: time="2017-09-13T11:22:00.<wbr>828749773-05:00" level=info msg=response RequestID=bfpzau4rwy8i remoteAddr="<a href="http://172.17.0.227:59412" target="_blank">172.17.0.227:59412</a><wbr>" reqBytes=0 reqForwardedFor= reqMethod=GET reqPath= fe194b0a4576d5c8d9d0a2055b132b<wbr>60+67108864+<wbr>Ab7cbd46f674287a7e906d009c1822<wbr>7475a83edbb@59cbcfe9" respBytes=67108864 respStatus=OK respStatusCode=200 timeToStatus=0.508386 timeTotal=14.965493 timeWriteBody=14.457107</div></div><div><br></div><div>When I check the blocks on the keepstore it seems that the blocks are not corrupted.</div><div><br></div><div><div>ROOT [prometheus:/mnt/keep/fe1] # openssl dgst -md5 -hex *</div><div>MD5(<wbr>fe125c2660a88a250a8d95542a80a6<wbr>65)= fe125c2660a88a250a8d95542a80a6<wbr>65</div><div>MD5(<wbr>fe1341f67e7c0a53a0a67c664b1385<wbr>a0)= fe1341f67e7c0a53a0a67c664b1385<wbr>a0</div><div>MD5(<wbr>fe1763cc1a8713b6c768852bca9207<wbr>b1)= fe1763cc1a8713b6c768852bca9207<wbr>b1</div><div>MD5(<wbr>fe194b0a4576d5c8d9d0a2055b132b<wbr>60)= fe194b0a4576d5c8d9d0a2055b132b<wbr>60</div><div>MD5(<wbr>fe1b04d3a017e84bef791ecaca2fb7<wbr>64)= fe1b04d3a017e84bef791ecaca2fb7<wbr>64</div></div><div><br></div><div>We are running the following versions of keepproxy and keepstore.</div><div><br></div><div>keepproxy-0.1.20170906144951.<wbr>22418ed-1.x86_64<br></div><div><div>keepstore-0.1.20170906144951.<wbr>22418ed-1.x86_64</div></div><div><br></div><div>Any help would be greatly appreciated.</div><div><br></div><div>Thanks!</div><div><br></div><div><div class="m_-7413417773236522176gmail_signature">George Chlipala, Ph.D.<br>Senior Research Specialist<br>Research Resources Center<br>University of Illinois at Chicago<br><br>phone: <a href="tel:(312)%20413-1700" value="+13124131700" target="_blank">312-413-1700</a><br>email: <a href="mailto:gchlip2@uic.edu" target="_blank">gchlip2@uic.edu</a></div></div>
</div></div>
</blockquote></div><br></div>