[arvados] Keepproxy Reader failed checksum

George Chlipala gchlip2 at uic.edu
Tue Sep 26 14:36:16 EDT 2017


First off, thanks to Tom for some help on this issue.  However we were not
able to resolve it.  It looks like the keepproxy is closing the connection
with the keepstore early and there does not appear to be anything in the
logs to indicate why.

If anyone can provide some insight into this problem, it would be greatly
appreciated.

We have been using the arv-get tool to fetch the data.  Our keepproxy is at
172.17.0.227, the keepstore is at 172.17.0.30, and we have a reverse HTTP
proxy at 172.17.0.231.  So connections between 172.17.0.231 <->
172.17.0.227 are in essence between the client (arv-get) and the
keepproxy.  Connections between 172.17.0.227 <-> 172.17.0.30 are between
the keepproxy and the keepstore.

I ran a single TCP dump on the keepstore system, so that we could see both
conversations.  Also the times are on the same scale, so we could compare
events between the two conversations.

Here is the end of the conversation between the arv-get (172.17.0.231) and
keepproxy (172.17.0.227)


[image: Inline image 1]

Here is the end of the conversation between keepproxy (172.17.0.227) and
keepstore (172.17.0.30)

[image: Inline image 2]


>From the tcpdump, it appears that the keepproxy sends a RST to the
keepstore at 15.002 sec and then at 15.067 sec the keepproxy closes the
connection with arv-get.

Also here is the output from the arv-get command, if this helps.

# arv-get --no-progress --skip-existing -r pg5mc-4zz18-tcq3i3k63n536ja
/mnt/lustre/gchlip2/Ridlon_Aug2017/mm10_work/

Traceback (most recent call last):
  File "/mnt/store3/clustcrilab/tools/arvados/bin/arv-get", line 212, in
<module>
    for data in f.readall():
  File "/mnt/store3/clustcrilab/tools/arvados/lib/python2.7/
site-packages/arvados/arvfile.py", line 108, in readall
    data = self.read(size, num_retries=num_retries)
  File "/mnt/store3/clustcrilab/tools/arvados/lib/python2.7/
site-packages/arvados/arvfile.py", line 59, in before_close_wrapper
    return orig_func(self, *args, **kwargs)
  File "/mnt/store3/clustcrilab/tools/arvados/lib/python2.7/
site-packages/arvados/retry.py", line 158, in num_retries_setter
    return orig_func(self, *args, **kwargs)
  File "/mnt/store3/clustcrilab/tools/arvados/lib/python2.7/
site-packages/arvados/arvfile.py", line 217, in read
    num_retries=num_retries)
  File "/mnt/store3/clustcrilab/tools/arvados/lib/python2.7/
site-packages/arvados/retry.py", line 158, in num_retries_setter
    return orig_func(self, *args, **kwargs)
  File "/mnt/store3/clustcrilab/tools/arvados/lib/python2.7/
site-packages/arvados/stream.py", line 85, in readfrom
    data.append(self._keepget(lr.locator, num_retries=num_retries)[lr.
segment_offset:lr.segment_offset+lr.segment_size])
  File "/mnt/store3/clustcrilab/tools/arvados/lib/python2.7/
site-packages/arvados/retry.py", line 158, in num_retries_setter
    return orig_func(self, *args, **kwargs)
  File "/mnt/store3/clustcrilab/tools/arvados/lib/python2.7/
site-packages/arvados/stream.py", line 74, in _keepget
    return self._keep.get(locator, num_retries=num_retries)
  File "/mnt/store3/clustcrilab/tools/arvados/lib/python2.7/
site-packages/arvados/retry.py", line 158, in num_retries_setter
    return orig_func(self, *args, **kwargs)
  File "/mnt/store3/clustcrilab/tools/arvados/lib/python2.7/
site-packages/arvados/keep.py", line 926, in get
    return self._get_or_head(loc_s, method="GET", num_retries=num_retries)
  File "/mnt/store3/clustcrilab/tools/arvados/lib/python2.7/
site-packages/arvados/keep.py", line 1039, in _get_or_head
    "failed to read {}".format(loc_s), service_errors, label="service")
arvados.errors.KeepReadError: failed to read 9c4ee2298ab88506ae8a965e6d3f37
f9+67108864+A3495ce0a76bfceba6a21a77bd2f0ee0253fbb4f5 at 59cc0527: service
https://arvados-keep.cri.uic.edu:443/
<https://arvados-keep.cri.uic.edu/> responded
with 0 (18, 'tra
nsfer closed with 21204434 bytes remaining to read')

After the initial problem had cropped up, we did have a separate issue
(kernel update) that required us to reboot the system.  After reboot, the
issue disappeared.  It seemed to be fine for a while, however it has
started to occur again.  The issue appears to be intermittent.   I have
restarted the keepproxy service.  It worked for a while (few minutes) then
the issue started to occur again.  I have looked through the kernel logs
(dmesg) and there are no errors and everything else on the system seems to
be fine.  We also run a keep-web service on the same machine as the
keepproxy. There are no issues with the keep-web service.  It would seem
that this problem is isolated to the keepproxy service.   Although, it is
not clear why a reboot cleared the problem but a restart of the service
will not.  I would expect that restarting the keepproxy would be the same
as a reboot from the service's standpoint.

Also the system appears to have sufficient resources.  We have allocated 4
cores to this system and 8 GB memory.  Here is a snapshot of the top
summary while performing an arv-get.  It does not look like the keepproxy
is taxing the system too much.

%Cpu(s):  0.4 us,  0.4 sy,  0.0 ni, 99.1 id,  0.0 wa,  0.0 hi,  0.1 si,
 0.0 st
GiB Mem :    7.638 total,    5.957 free,    0.830 used,    0.850 buff/cache
GiB Swap:    1.000 total,    1.000 free,    0.000 used.    6.394 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 6488 root      20   0  548.6m  13.2m   4.1m S   3.0  0.2   0:20.49
keepproxy

Any help would be greatly appreciated.  Thanks!



George Chlipala, Ph.D.
Associate Director, Core for Research Informatics
Research Resources Center
University of Illinois at Chicago

phone: 312-413-1700
email: gchlip2 at uic.edu

On Wed, Sep 13, 2017 at 11:31 AM, George Chlipala <gchlip2 at uic.edu> wrote:

> To whom it may concern -
>
> We are experiencing a large number of "Reader failed checksum" errors when
> using the keepproxy and I cannot seem to figure out why.
>
> Here is an snippet from our log for the keepproxy.
>
> Sep 13 11:21:12 arvados keepproxy: 2017/09/13 11:21:12 128.248.171.19,
> 172.17.0.231:52236 GET /fe194b0a4576d5c8d9d0a2055b132b60+67108864+
> Ab7cbd46f674287a7e906d009c18227475a83edbb at 59cbcfe9 200 67108864 34647446
> http://arvados-keep01.cri.local:25107/fe194b0a4576d5c8d9d0a2055b132b
> 60+67108864+Ab7cbd46f674287a7e906d009c18227475a83edbb at 59cbcfe9 Reader
> failed checksum
> Sep 13 11:21:29 arvados keepproxy: 2017/09/13 11:21:29 128.248.171.19,
> 172.17.0.231:51688 GET /fe194b0a4576d5c8d9d0a2055b132b60+67108864+
> Ab7cbd46f674287a7e906d009c18227475a83edbb at 59cbcfe9 200 67108864 30612102
> http://arvados-keep01.cri.local:25107/fe194b0a4576d5c8d9d0a2055b132b
> 60+67108864+Ab7cbd46f674287a7e906d009c18227475a83edbb at 59cbcfe9 Reader
> failed checksum
> Sep 13 11:21:44 arvados keepproxy: 2017/09/13 11:21:44 128.248.171.19,
> 172.17.0.231:50472 GET /fe194b0a4576d5c8d9d0a2055b132b60+67108864+
> Ab7cbd46f674287a7e906d009c18227475a83edbb at 59cbcfe9 200 67108864 34956478
> http://arvados-keep01.cri.local:25107/fe194b0a4576d5c8d9d0a2055b132b
> 60+67108864+Ab7cbd46f674287a7e906d009c18227475a83edbb at 59cbcfe9 Reader
> failed checksum
> Sep 13 11:22:00 arvados keepproxy: 2017/09/13 11:22:00 128.248.171.19,
> 172.17.0.231:52286 GET /fe194b0a4576d5c8d9d0a2055b132b60+67108864+
> Ab7cbd46f674287a7e906d009c18227475a83edbb at 59cbcfe9 200 67108864 63292254
> http://arvados-keep01.cri.local:25107/fe194b0a4576d5c8d9d0a2055b132b
> 60+67108864+Ab7cbd46f674287a7e906d009c18227475a83edbb at 59cbcfe9 Reader
> failed checksum
>
> If I look at the keepstore, I see the following messages associated with
> those requests from the keepproxy...  In all cases the status code is 200.
>
> Sep 13 11:21:12 prometheus keepstore: time="2017-09-13T11:21:12.588429359-05:00"
> level=info msg=response RequestID=bfpza7ye6fmw remoteAddr="
> 172.17.0.227:59396" reqBytes=0 reqForwardedFor= reqMethod=GET reqPath=
> fe194b0a4576d5c8d9d0a2055b132b60+67108864+Ab7cbd46f674287a7e906d009c1822
> 7475a83edbb at 59cbcfe9" respBytes=67108864 respStatus=OK respStatusCode=200
> timeToStatus=1.628499 timeTotal=15.000262 timeWriteBody=13.371763
> Sep 13 11:21:14 prometheus keepstore: time="2017-09-13T11:21:14.127563080-05:00"
> level=debug msg=request RequestID=bfpzafjxba0b remoteAddr="
> 172.17.0.227:59406" reqBytes=0 reqForwardedFor= reqMethod=GET reqPath=
> fe194b0a4576d5c8d9d0a2055b132b60+67108864+Ab7cbd46f674287a7e906d009c1822
> 7475a83edbb at 59cbcfe9"
> Sep 13 11:21:29 prometheus keepstore: time="2017-09-13T11:21:29.125244128-05:00"
> level=info msg=response RequestID=bfpzafjxba0b remoteAddr="
> 172.17.0.227:59406" reqBytes=0 reqForwardedFor= reqMethod=GET reqPath=
> fe194b0a4576d5c8d9d0a2055b132b60+67108864+Ab7cbd46f674287a7e906d009c1822
> 7475a83edbb at 59cbcfe9" respBytes=67108864 respStatus=OK respStatusCode=200
> timeToStatus=0.179427 timeTotal=14.997671 timeWriteBody=14.818244
> Sep 13 11:21:29 prometheus keepstore: time="2017-09-13T11:21:29.838397727-05:00"
> level=debug msg=request RequestID=bfpzamrr4y9i remoteAddr="
> 172.17.0.227:59408" reqBytes=0 reqForwardedFor= reqMethod=GET reqPath=
> fe194b0a4576d5c8d9d0a2055b132b60+67108864+Ab7cbd46f674287a7e906d009c1822
> 7475a83edbb at 59cbcfe9"
> Sep 13 11:21:44 prometheus keepstore: time="2017-09-13T11:21:44.838600482-05:00"
> level=info msg=response RequestID=bfpzamrr4y9i remoteAddr="
> 172.17.0.227:59408" reqBytes=0 reqForwardedFor= reqMethod=GET reqPath=
> fe194b0a4576d5c8d9d0a2055b132b60+67108864+Ab7cbd46f674287a7e906d009c1822
> 7475a83edbb at 59cbcfe9" respBytes=67108864 respStatus=OK respStatusCode=200
> timeToStatus=0.179266 timeTotal=15.000209 timeWriteBody=14.820943
> Sep 13 11:21:45 prometheus keepstore: time="2017-09-13T11:21:45.863245412-05:00"
> level=debug msg=request RequestID=bfpzau4rwy8i remoteAddr="
> 172.17.0.227:59412" reqBytes=0 reqForwardedFor= reqMethod=GET reqPath=
> fe194b0a4576d5c8d9d0a2055b132b60+67108864+Ab7cbd46f674287a7e906d009c1822
> 7475a83edbb at 59cbcfe9"
> Sep 13 11:22:00 prometheus keepstore: time="2017-09-13T11:22:00.828749773-05:00"
> level=info msg=response RequestID=bfpzau4rwy8i remoteAddr="
> 172.17.0.227:59412" reqBytes=0 reqForwardedFor= reqMethod=GET reqPath=
> fe194b0a4576d5c8d9d0a2055b132b60+67108864+Ab7cbd46f674287a7e906d009c1822
> 7475a83edbb at 59cbcfe9" respBytes=67108864 respStatus=OK respStatusCode=200
> timeToStatus=0.508386 timeTotal=14.965493 timeWriteBody=14.457107
>
> When I check the blocks on the keepstore it seems that the blocks are not
> corrupted.
>
> ROOT [prometheus:/mnt/keep/fe1] # openssl dgst -md5 -hex *
> MD5(fe125c2660a88a250a8d95542a80a665)= fe125c2660a88a250a8d95542a80a665
> MD5(fe1341f67e7c0a53a0a67c664b1385a0)= fe1341f67e7c0a53a0a67c664b1385a0
> MD5(fe1763cc1a8713b6c768852bca9207b1)= fe1763cc1a8713b6c768852bca9207b1
> MD5(fe194b0a4576d5c8d9d0a2055b132b60)= fe194b0a4576d5c8d9d0a2055b132b60
> MD5(fe1b04d3a017e84bef791ecaca2fb764)= fe1b04d3a017e84bef791ecaca2fb764
>
> We are running the following versions of keepproxy and keepstore.
>
> keepproxy-0.1.20170906144951.22418ed-1.x86_64
> keepstore-0.1.20170906144951.22418ed-1.x86_64
>
> Any help would be greatly appreciated.
>
> Thanks!
>
> George Chlipala, Ph.D.
> Senior Research Specialist
> Research Resources Center
> University of Illinois at Chicago
>
> phone: 312-413-1700 <(312)%20413-1700>
> email: gchlip2 at uic.edu
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.arvados.org/pipermail/arvados/attachments/20170926/e3ae236b/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 29744 bytes
Desc: not available
URL: <http://lists.arvados.org/pipermail/arvados/attachments/20170926/e3ae236b/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 74998 bytes
Desc: not available
URL: <http://lists.arvados.org/pipermail/arvados/attachments/20170926/e3ae236b/attachment-0003.png>


More information about the arvados mailing list