From gchlip2 at uic.edu Fri Aug 4 14:08:25 2017 From: gchlip2 at uic.edu (George Chlipala) Date: Fri, 4 Aug 2017 13:08:25 -0500 Subject: [arvados] Tuning keep performance Message-ID: We have an application that we would use to download data from Illumina Basespace directly to our servers. Previously we had been writing directly to disk and average data transfer speeds were > 10 MB/s. We modified the application (a python script) to now push the data into Arvados via a CollectionWriter. Now we are seeing data transfer speeds 200-400 kB/s. Both the Arvados servers and our Basespace application are on the same subnet and connected via 1 Gbps ethernet. I have setup keepstore volume to serialize and I have the default buffer setting. Here is our keepstore configuration (keepstore.yml). BlobSignatureTTL: 96h0m0s BlobSigningKeyFile: /etc/arvados/keepstore/blob-signing.key Debug: false EnableDelete: true Listen: :25107 LogFormat: text MaxBuffers: 100 MaxRequests: 0 PIDFile: "" RequireSignatures: false SystemAuthTokenFile: /etc/arvados/keepstore/system-auth.key TrashCheckInterval: 24h0m0s TrashLifetime: 96h0m0s Volumes: - DirectoryReplication: 0 ReadOnly: false Root: /mnt/keep Serialize: true Type: Directory Also I have checked the socket connections on the system hosting the application and it is directly connecting to the keepstore server. Are there any other items to look at in order to improve performance? For references, here are snippets from our push application. The following are the lines associated with creating the CollectionWriter. self.arv = arvados.api(token=arv_token, host=arvados_api_host) self.writer = CollectionWriter(self.arv, replication=replication) The following are the lines on how we push the data. The fileinfo object is a custom class that has the path and filename for the file fetched from Basespace. We are fetching the file from Basespace and saving to a temp directory in case there are issues during the download. I have checked and the download speed is > 10 MB/s. with open(fileinfo.path, 'rb') as filein, self.writer.open('./raw_data/' + fileinfo.filename) as col_file: logging.info("Adding file {0} to Arvados collection".format(fileinfo.filename)) for data in filein.read(): col_file.write(data) fileinfo.byte_count += len(data) col_file.close() filein.close() Any help would be greatly appreciated! George Chlipala, Ph.D. Senior Research Specialist Research Resources Center University of Illinois at Chicago phone: 312-413-1700 email: gchlip2 at uic.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From peter.amstutz at curoverse.com Fri Aug 4 14:25:53 2017 From: peter.amstutz at curoverse.com (Peter Amstutz) Date: Fri, 4 Aug 2017 14:25:53 -0400 Subject: [arvados] Tuning keep performance In-Reply-To: References: Message-ID: Try using the Collection class instead of CollectionWriter, and setting put_threads in the Collection constructor (in our experiments I think we found 4-6 threads to get the best throughput). On Fri, Aug 4, 2017 at 2:08 PM, George Chlipala wrote: > We have an application that we would use to download data from Illumina > Basespace directly to our servers. Previously we had been writing directly > to disk and average data transfer speeds were > 10 MB/s. We modified the > application (a python script) to now push the data into Arvados via a > CollectionWriter. Now we are seeing data transfer speeds 200-400 kB/s. > Both the Arvados servers and our Basespace application are on the same > subnet and connected via 1 Gbps ethernet. > > I have setup keepstore volume to serialize and I have the default buffer > setting. > > Here is our keepstore configuration (keepstore.yml). > > BlobSignatureTTL: 96h0m0s > BlobSigningKeyFile: /etc/arvados/keepstore/blob-signing.key > Debug: false > EnableDelete: true > Listen: :25107 > LogFormat: text > MaxBuffers: 100 > MaxRequests: 0 > PIDFile: "" > RequireSignatures: false > SystemAuthTokenFile: /etc/arvados/keepstore/system-auth.key > TrashCheckInterval: 24h0m0s > TrashLifetime: 96h0m0s > Volumes: > - DirectoryReplication: 0 > ReadOnly: false > Root: /mnt/keep > Serialize: true > Type: Directory > > Also I have checked the socket connections on the system hosting the > application and it is directly connecting to the keepstore server. > > Are there any other items to look at in order to improve performance? > > For references, here are snippets from our push application. The > following are the lines associated with creating the CollectionWriter. > > self.arv = arvados.api(token=arv_token, host=arvados_api_host) > self.writer = CollectionWriter(self.arv, replication=replication) > > The following are the lines on how we push the data. The fileinfo object > is a custom class that has the path and filename for the file fetched from > Basespace. We are fetching the file from Basespace and saving to a temp > directory in case there are issues during the download. I have checked and > the download speed is > 10 MB/s. > > with open(fileinfo.path, 'rb') as filein, self.writer.open('./raw_data/' + > fileinfo.filename) as col_file: > logging.info("Adding file {0} to Arvados collection".format(fileinfo. > filename)) > for data in filein.read(): > col_file.write(data) > fileinfo.byte_count += len(data) > > col_file.close() > filein.close() > > Any help would be greatly appreciated! > > George Chlipala, Ph.D. > Senior Research Specialist > Research Resources Center > University of Illinois at Chicago > > phone: 312-413-1700 <(312)%20413-1700> > email: gchlip2 at uic.edu > > _______________________________________________ > arvados mailing list > arvados at arvados.org > http://lists.arvados.org/mailman/listinfo/arvados > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gchlip2 at uic.edu Fri Aug 4 16:55:57 2017 From: gchlip2 at uic.edu (George Chlipala) Date: Fri, 4 Aug 2017 15:55:57 -0500 Subject: [arvados] Tuning keep performance In-Reply-To: References: Message-ID: Peter - Thanks for the info! George Chlipala, Ph.D. Senior Research Specialist Research Resources Center University of Illinois at Chicago phone: 312-413-1700 <(312)%20413-1700> email: gchlip2 at uic.edu On Fri, Aug 4, 2017 at 1:25 PM, Peter Amstutz wrote: > Try using the Collection class instead of CollectionWriter, and setting > put_threads in the Collection constructor (in our experiments I think we > found 4-6 threads to get the best throughput). > > > On Fri, Aug 4, 2017 at 2:08 PM, George Chlipala wrote: > >> We have an application that we would use to download data from Illumina >> Basespace directly to our servers. Previously we had been writing directly >> to disk and average data transfer speeds were > 10 MB/s. We modified the >> application (a python script) to now push the data into Arvados via a >> CollectionWriter. Now we are seeing data transfer speeds 200-400 kB/s. >> Both the Arvados servers and our Basespace application are on the same >> subnet and connected via 1 Gbps ethernet. >> >> I have setup keepstore volume to serialize and I have the default buffer >> setting. >> >> Here is our keepstore configuration (keepstore.yml). >> >> BlobSignatureTTL: 96h0m0s >> BlobSigningKeyFile: /etc/arvados/keepstore/blob-signing.key >> Debug: false >> EnableDelete: true >> Listen: :25107 >> LogFormat: text >> MaxBuffers: 100 >> MaxRequests: 0 >> PIDFile: "" >> RequireSignatures: false >> SystemAuthTokenFile: /etc/arvados/keepstore/system-auth.key >> TrashCheckInterval: 24h0m0s >> TrashLifetime: 96h0m0s >> Volumes: >> - DirectoryReplication: 0 >> ReadOnly: false >> Root: /mnt/keep >> Serialize: true >> Type: Directory >> >> Also I have checked the socket connections on the system hosting the >> application and it is directly connecting to the keepstore server. >> >> Are there any other items to look at in order to improve performance? >> >> For references, here are snippets from our push application. The >> following are the lines associated with creating the CollectionWriter. >> >> self.arv = arvados.api(token=arv_token, host=arvados_api_host) >> self.writer = CollectionWriter(self.arv, replication=replication) >> >> The following are the lines on how we push the data. The fileinfo object >> is a custom class that has the path and filename for the file fetched from >> Basespace. We are fetching the file from Basespace and saving to a temp >> directory in case there are issues during the download. I have checked and >> the download speed is > 10 MB/s. >> >> with open(fileinfo.path, 'rb') as filein, self.writer.open('./raw_data/' >> + fileinfo.filename) as col_file: >> logging.info("Adding file {0} to Arvados collection".format( >> fileinfo.filename)) >> for data in filein.read(): >> col_file.write(data) >> fileinfo.byte_count += len(data) >> >> col_file.close() >> filein.close() >> >> Any help would be greatly appreciated! >> >> George Chlipala, Ph.D. >> Senior Research Specialist >> Research Resources Center >> University of Illinois at Chicago >> >> phone: 312-413-1700 <(312)%20413-1700> >> email: gchlip2 at uic.edu >> >> _______________________________________________ >> arvados mailing list >> arvados at arvados.org >> http://lists.arvados.org/mailman/listinfo/arvados >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: