[arvados] Keepproxy Reader failed checksum

George Chlipala gchlip2 at uic.edu
Wed Sep 27 14:54:06 EDT 2017


Tom -

It was a different Tom.  We manage the network and all these systems are on
the same subnet.  As I mentioned, we are not experiencing any other network
issues for our systems.
I would expect if there was a network issue it would be affecting other
services running on the same system.  It appears that this problem is only
affecting the keepproxy service.

Here is our keepproxy configuration.

Client:
  APIHost: "<removed>"
  AuthToken: "<removed>"
  Insecure: false
Debug: true
DefaultReplicas: 0
DisableGet: false
DisablePut: false
Listen: :25108
PIDFile: ""
Timeout: 15s




George Chlipala, Ph.D.
Associate Director, Core for Research Informatics
Research Resources Center
University of Illinois at Chicago

phone: 312-413-1700
email: gchlip2 at uic.edu

On Tue, Sep 26, 2017 at 11:29 PM, Tom Morris <tfmorris at veritasgenetics.com>
wrote:

> Hi George,
>
> Sorry to hear that you're still having problems.  It's almost a certainly
> a configuration issue since we've got plenty of customers using keepproxy
> in their configurations.
>
> I'm not sure who the "Tom" was who helped you, apparently in private, (it
> wasn't me, I swear!), but I'd encourage everyone who is depending on
> support from the community to support each other on the public mailing
> list. That allows everyone to contribute -- and everyone to benefit from
> the responses.
>
> The temporal nature of the issue (ie it disappears for a while after
> reboots) makes me suspect a generic networking issue (e.g. reverse proxy,
> firewall, etc) external to Arvados. Have you tried engaging the help of
> your local networking gurus to track this down?
>
> If anyone else has some thoughts, please chime in (publicly). Good luck!
>
> Tom
>
> On Tue, Sep 26, 2017 at 2:36 PM, George Chlipala <gchlip2 at uic.edu> wrote:
>
>> First off, thanks to Tom for some help on this issue.  However we were
>> not able to resolve it.  It looks like the keepproxy is closing the
>> connection with the keepstore early and there does not appear to be
>> anything in the logs to indicate why.
>>
>> If anyone can provide some insight into this problem, it would be greatly
>> appreciated.
>>
>> We have been using the arv-get tool to fetch the data.  Our keepproxy is
>> at 172.17.0.227, the keepstore is at 172.17.0.30, and we have a reverse
>> HTTP proxy at 172.17.0.231.  So connections between 172.17.0.231 <->
>> 172.17.0.227 are in essence between the client (arv-get) and the
>> keepproxy.  Connections between 172.17.0.227 <-> 172.17.0.30 are between
>> the keepproxy and the keepstore.
>>
>> I ran a single TCP dump on the keepstore system, so that we could see
>> both conversations.  Also the times are on the same scale, so we could
>> compare events between the two conversations.
>>
>> Here is the end of the conversation between the arv-get (172.17.0.231)
>> and keepproxy (172.17.0.227)
>>
>>
>> [image: Inline image 1]
>>
>> Here is the end of the conversation between keepproxy (172.17.0.227) and
>> keepstore (172.17.0.30)
>>
>> [image: Inline image 2]
>>
>>
>> From the tcpdump, it appears that the keepproxy sends a RST to the
>> keepstore at 15.002 sec and then at 15.067 sec the keepproxy closes the
>> connection with arv-get.
>>
>> Also here is the output from the arv-get command, if this helps.
>>
>> # arv-get --no-progress --skip-existing -r pg5mc-4zz18-tcq3i3k63n536ja
>> /mnt/lustre/gchlip2/Ridlon_Aug2017/mm10_work/
>>
>> Traceback (most recent call last):
>>   File "/mnt/store3/clustcrilab/tools/arvados/bin/arv-get", line 212, in
>> <module>
>>     for data in f.readall():
>>   File "/mnt/store3/clustcrilab/tools/arvados/lib/python2.7/site-packages/arvados/arvfile.py",
>> line 108, in readall
>>     data = self.read(size, num_retries=num_retries)
>>   File "/mnt/store3/clustcrilab/tools/arvados/lib/python2.7/site-packages/arvados/arvfile.py",
>> line 59, in before_close_wrapper
>>     return orig_func(self, *args, **kwargs)
>>   File "/mnt/store3/clustcrilab/tools/arvados/lib/python2.7/site-packages/arvados/retry.py",
>> line 158, in num_retries_setter
>>     return orig_func(self, *args, **kwargs)
>>   File "/mnt/store3/clustcrilab/tools/arvados/lib/python2.7/site-packages/arvados/arvfile.py",
>> line 217, in read
>>     num_retries=num_retries)
>>   File "/mnt/store3/clustcrilab/tools/arvados/lib/python2.7/site-packages/arvados/retry.py",
>> line 158, in num_retries_setter
>>     return orig_func(self, *args, **kwargs)
>>   File "/mnt/store3/clustcrilab/tools/arvados/lib/python2.7/site-packages/arvados/stream.py",
>> line 85, in readfrom
>>     data.append(self._keepget(lr.locator, num_retries=num_retries)[lr.se
>> gment_offset:lr.segment_offset+lr.segment_size])
>>   File "/mnt/store3/clustcrilab/tools/arvados/lib/python2.7/site-packages/arvados/retry.py",
>> line 158, in num_retries_setter
>>     return orig_func(self, *args, **kwargs)
>>   File "/mnt/store3/clustcrilab/tools/arvados/lib/python2.7/site-packages/arvados/stream.py",
>> line 74, in _keepget
>>     return self._keep.get(locator, num_retries=num_retries)
>>   File "/mnt/store3/clustcrilab/tools/arvados/lib/python2.7/site-packages/arvados/retry.py",
>> line 158, in num_retries_setter
>>     return orig_func(self, *args, **kwargs)
>>   File "/mnt/store3/clustcrilab/tools/arvados/lib/python2.7/site-packages/arvados/keep.py",
>> line 926, in get
>>     return self._get_or_head(loc_s, method="GET", num_retries=num_retries)
>>   File "/mnt/store3/clustcrilab/tools/arvados/lib/python2.7/site-packages/arvados/keep.py",
>> line 1039, in _get_or_head
>>     "failed to read {}".format(loc_s), service_errors, label="service")
>> arvados.errors.KeepReadError: failed to read
>> 9c4ee2298ab88506ae8a965e6d3f37f9+67108864+A3495ce0a76bfceba6
>> a21a77bd2f0ee0253fbb4f5 at 59cc0527: service https://arvados-keep.c
>> ri.uic.edu:443/ <https://arvados-keep.cri.uic.edu/> responded with 0
>> (18, 'tra
>> nsfer closed with 21204434 bytes remaining to read')
>>
>> After the initial problem had cropped up, we did have a separate issue
>> (kernel update) that required us to reboot the system.  After reboot, the
>> issue disappeared.  It seemed to be fine for a while, however it has
>> started to occur again.  The issue appears to be intermittent.   I have
>> restarted the keepproxy service.  It worked for a while (few minutes) then
>> the issue started to occur again.  I have looked through the kernel logs
>> (dmesg) and there are no errors and everything else on the system seems to
>> be fine.  We also run a keep-web service on the same machine as the
>> keepproxy. There are no issues with the keep-web service.  It would seem
>> that this problem is isolated to the keepproxy service.   Although, it is
>> not clear why a reboot cleared the problem but a restart of the service
>> will not.  I would expect that restarting the keepproxy would be the same
>> as a reboot from the service's standpoint.
>>
>> Also the system appears to have sufficient resources.  We have allocated
>> 4 cores to this system and 8 GB memory.  Here is a snapshot of the top
>> summary while performing an arv-get.  It does not look like the keepproxy
>> is taxing the system too much.
>>
>> %Cpu(s):  0.4 us,  0.4 sy,  0.0 ni, 99.1 id,  0.0 wa,  0.0 hi,  0.1 si,
>>  0.0 st
>> GiB Mem :    7.638 total,    5.957 free,    0.830 used,    0.850
>> buff/cache
>> GiB Swap:    1.000 total,    1.000 free,    0.000 used.    6.394 avail Mem
>>
>>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
>> COMMAND
>>  6488 root      20   0  548.6m  13.2m   4.1m S   3.0  0.2   0:20.49
>> keepproxy
>>
>> Any help would be greatly appreciated.  Thanks!
>>
>>
>>
>> George Chlipala, Ph.D.
>> Associate Director, Core for Research Informatics
>> Research Resources Center
>> University of Illinois at Chicago
>>
>> phone: 312-413-1700 <(312)%20413-1700>
>> email: gchlip2 at uic.edu
>>
>> On Wed, Sep 13, 2017 at 11:31 AM, George Chlipala <gchlip2 at uic.edu>
>> wrote:
>>
>>> To whom it may concern -
>>>
>>> We are experiencing a large number of "Reader failed checksum" errors
>>> when using the keepproxy and I cannot seem to figure out why.
>>>
>>> Here is an snippet from our log for the keepproxy.
>>>
>>> Sep 13 11:21:12 arvados keepproxy: 2017/09/13 11:21:12 128.248.171.19,
>>> 172.17.0.231:52236 GET /fe194b0a4576d5c8d9d0a2055b132
>>> b60+67108864+Ab7cbd46f674287a7e906d009c18227475a83edbb at 59cbcfe9 200
>>> 67108864 34647446 http://arvados-keep01.cri.loca
>>> l:25107/fe194b0a4576d5c8d9d0a2055b132b60+67108864+Ab7cbd46f6
>>> 74287a7e906d009c18227475a83edbb at 59cbcfe9 Reader failed checksum
>>> Sep 13 11:21:29 arvados keepproxy: 2017/09/13 11:21:29 128.248.171.19,
>>> 172.17.0.231:51688 GET /fe194b0a4576d5c8d9d0a2055b132
>>> b60+67108864+Ab7cbd46f674287a7e906d009c18227475a83edbb at 59cbcfe9 200
>>> 67108864 30612102 http://arvados-keep01.cri.loca
>>> l:25107/fe194b0a4576d5c8d9d0a2055b132b60+67108864+Ab7cbd46f6
>>> 74287a7e906d009c18227475a83edbb at 59cbcfe9 Reader failed checksum
>>> Sep 13 11:21:44 arvados keepproxy: 2017/09/13 11:21:44 128.248.171.19,
>>> 172.17.0.231:50472 GET /fe194b0a4576d5c8d9d0a2055b132
>>> b60+67108864+Ab7cbd46f674287a7e906d009c18227475a83edbb at 59cbcfe9 200
>>> 67108864 34956478 http://arvados-keep01.cri.loca
>>> l:25107/fe194b0a4576d5c8d9d0a2055b132b60+67108864+Ab7cbd46f6
>>> 74287a7e906d009c18227475a83edbb at 59cbcfe9 Reader failed checksum
>>> Sep 13 11:22:00 arvados keepproxy: 2017/09/13 11:22:00 128.248.171.19,
>>> 172.17.0.231:52286 GET /fe194b0a4576d5c8d9d0a2055b132
>>> b60+67108864+Ab7cbd46f674287a7e906d009c18227475a83edbb at 59cbcfe9 200
>>> 67108864 63292254 http://arvados-keep01.cri.loca
>>> l:25107/fe194b0a4576d5c8d9d0a2055b132b60+67108864+Ab7cbd46f6
>>> 74287a7e906d009c18227475a83edbb at 59cbcfe9 Reader failed checksum
>>>
>>> If I look at the keepstore, I see the following messages associated with
>>> those requests from the keepproxy...  In all cases the status code is 200.
>>>
>>> Sep 13 11:21:12 prometheus keepstore: time="2017-09-13T11:21:12.588429359-05:00"
>>> level=info msg=response RequestID=bfpza7ye6fmw remoteAddr="
>>> 172.17.0.227:59396" reqBytes=0 reqForwardedFor= reqMethod=GET reqPath=
>>> fe194b0a4576d5c8d9d0a2055b132b60+67108864+Ab7cbd46f674287a7e
>>> 906d009c18227475a83edbb at 59cbcfe9" respBytes=67108864 respStatus=OK
>>> respStatusCode=200 timeToStatus=1.628499 timeTotal=15.000262
>>> timeWriteBody=13.371763
>>> Sep 13 11:21:14 prometheus keepstore: time="2017-09-13T11:21:14.127563080-05:00"
>>> level=debug msg=request RequestID=bfpzafjxba0b remoteAddr="
>>> 172.17.0.227:59406" reqBytes=0 reqForwardedFor= reqMethod=GET reqPath=
>>> fe194b0a4576d5c8d9d0a2055b132b60+67108864+Ab7cbd46f674287a7e
>>> 906d009c18227475a83edbb at 59cbcfe9"
>>> Sep 13 11:21:29 prometheus keepstore: time="2017-09-13T11:21:29.125244128-05:00"
>>> level=info msg=response RequestID=bfpzafjxba0b remoteAddr="
>>> 172.17.0.227:59406" reqBytes=0 reqForwardedFor= reqMethod=GET reqPath=
>>> fe194b0a4576d5c8d9d0a2055b132b60+67108864+Ab7cbd46f674287a7e
>>> 906d009c18227475a83edbb at 59cbcfe9" respBytes=67108864 respStatus=OK
>>> respStatusCode=200 timeToStatus=0.179427 timeTotal=14.997671
>>> timeWriteBody=14.818244
>>> Sep 13 11:21:29 prometheus keepstore: time="2017-09-13T11:21:29.838397727-05:00"
>>> level=debug msg=request RequestID=bfpzamrr4y9i remoteAddr="
>>> 172.17.0.227:59408" reqBytes=0 reqForwardedFor= reqMethod=GET reqPath=
>>> fe194b0a4576d5c8d9d0a2055b132b60+67108864+Ab7cbd46f674287a7e
>>> 906d009c18227475a83edbb at 59cbcfe9"
>>> Sep 13 11:21:44 prometheus keepstore: time="2017-09-13T11:21:44.838600482-05:00"
>>> level=info msg=response RequestID=bfpzamrr4y9i remoteAddr="
>>> 172.17.0.227:59408" reqBytes=0 reqForwardedFor= reqMethod=GET reqPath=
>>> fe194b0a4576d5c8d9d0a2055b132b60+67108864+Ab7cbd46f674287a7e
>>> 906d009c18227475a83edbb at 59cbcfe9" respBytes=67108864 respStatus=OK
>>> respStatusCode=200 timeToStatus=0.179266 timeTotal=15.000209
>>> timeWriteBody=14.820943
>>> Sep 13 11:21:45 prometheus keepstore: time="2017-09-13T11:21:45.863245412-05:00"
>>> level=debug msg=request RequestID=bfpzau4rwy8i remoteAddr="
>>> 172.17.0.227:59412" reqBytes=0 reqForwardedFor= reqMethod=GET reqPath=
>>> fe194b0a4576d5c8d9d0a2055b132b60+67108864+Ab7cbd46f674287a7e
>>> 906d009c18227475a83edbb at 59cbcfe9"
>>> Sep 13 11:22:00 prometheus keepstore: time="2017-09-13T11:22:00.828749773-05:00"
>>> level=info msg=response RequestID=bfpzau4rwy8i remoteAddr="
>>> 172.17.0.227:59412" reqBytes=0 reqForwardedFor= reqMethod=GET reqPath=
>>> fe194b0a4576d5c8d9d0a2055b132b60+67108864+Ab7cbd46f674287a7e
>>> 906d009c18227475a83edbb at 59cbcfe9" respBytes=67108864 respStatus=OK
>>> respStatusCode=200 timeToStatus=0.508386 timeTotal=14.965493
>>> timeWriteBody=14.457107
>>>
>>> When I check the blocks on the keepstore it seems that the blocks are
>>> not corrupted.
>>>
>>> ROOT [prometheus:/mnt/keep/fe1] # openssl dgst -md5 -hex *
>>> MD5(fe125c2660a88a250a8d95542a80a665)= fe125c2660a88a250a8d95542a80a665
>>> MD5(fe1341f67e7c0a53a0a67c664b1385a0)= fe1341f67e7c0a53a0a67c664b1385a0
>>> MD5(fe1763cc1a8713b6c768852bca9207b1)= fe1763cc1a8713b6c768852bca9207b1
>>> MD5(fe194b0a4576d5c8d9d0a2055b132b60)= fe194b0a4576d5c8d9d0a2055b132b60
>>> MD5(fe1b04d3a017e84bef791ecaca2fb764)= fe1b04d3a017e84bef791ecaca2fb764
>>>
>>> We are running the following versions of keepproxy and keepstore.
>>>
>>> keepproxy-0.1.20170906144951.22418ed-1.x86_64
>>> keepstore-0.1.20170906144951.22418ed-1.x86_64
>>>
>>> Any help would be greatly appreciated.
>>>
>>> Thanks!
>>>
>>> George Chlipala, Ph.D.
>>> Senior Research Specialist
>>> Research Resources Center
>>> University of Illinois at Chicago
>>>
>>> phone: 312-413-1700 <(312)%20413-1700>
>>> email: gchlip2 at uic.edu
>>>
>>
>>
>> _______________________________________________
>> arvados mailing list
>> arvados at arvados.org
>> http://lists.arvados.org/mailman/listinfo/arvados
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.arvados.org/pipermail/arvados/attachments/20170927/39a2996f/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 29744 bytes
Desc: not available
URL: <http://lists.arvados.org/pipermail/arvados/attachments/20170927/39a2996f/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 74998 bytes
Desc: not available
URL: <http://lists.arvados.org/pipermail/arvados/attachments/20170927/39a2996f/attachment-0003.png>


More information about the arvados mailing list