Skip to the content.

Accessing Data Through Google Cloud Platform

Contents

Overview

This guide is meant to give you the basic steps and knowledge to access the data for which you have been approved via the Google Cloud Platform (GCP). It is not meant to be an exhaustive tutorial on GCP, and we suggest getting background on the key services in GCP via the following tuturials:

We assume you have already acquired an account, if not please read our Getting a Google Account page in this documentation repository.

This guide will perform all operations via the command line. Please ensure you've installed the Google Cloud SDK or you can also use the Cloud Shell available in the console. Note: The cloud shell runs in a virtual machine, so anything downloaded locally will not be persisted to your own computer.

Logging in to GCP

Log into the SDK with the following command:

gcloud auth login

You will be taken to a browser to authenticate. Authenticate with the account with which you applied for the data access. You can check which account is authenticated with:

gcloud auth list

Accessing data

All data access will be performed with the gsutil or via the Google Cloud Storage APIs. We recommend gsutil for most use cases.

Data Request Bucket

We create a unique bucket for each data request which contains:

You will receive this bucket name from our data request service team, but it will look something like: hmf-dr-123, and you can inspect it with gsutil like so:

gsutil -u your-project ls gs://hmf-dr-123

Note the -u your-project, this is necessary to assign the billing project to the request for any egress costs (even though an ls does not incur egress, all operations still need a billing account). See the costs overview for more details.

To download files locally or to a VM:

gsutil  -u your-project cp gs://hmf-dr-123/metadata.tar /path/on/your/local/machine/

To copy the files to another bucket you've created in your own project:

gsutil  -u your-project cp gs://hmf-dr-123/metadata.tar gs://your-new-bucket/

Manifest JSON

When dealing with the aligned reads and RNASeq data, our key challenge is avoid any duplication of the data and costs associated. This means we need to expose the data directly to the requester from our own buckets.

To accomplish this we provide the data requester with a JSON file called the manifest, the manifest contains URLs to all CRAMs, their indexes and RNASeq fastq in the request. Along with this, we grant your account read access on each object Access Control List (ACL).

The manifest.json is located in each data request bucket. The intention is to provide a compact representation of the data exposed which can be easily parsed by a script or program, but also easily read in an editor. The manifest gives you the following information:

You can find an example here.

JSON has good support in most programming languages. For instance with python you can load the manifest straight from GCS into a dict in a few lines:

import json
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('gs://hmf-dr-123')
manifest_json =  bucket.get_blob('manifest.json')
data = json.loads(manifest_json)

The intent of the manifest is to enable the use of GCP to scale analysis horizontally across virtual machines and avoid the time and expense of large downloads. At Hartwig Medical Foundation this generally follows the pattern:

We've also seen the manifest parsed into Nextflow configuration which manage the GCP details for you. We kept things simple by design, we hope to see many creative analysis implementations with the manifest.

GCP Costs

When using any cloud platform, its very important to understand the cost of operations. The good news is, GCP is very competitively priced and will also help alleviate load on internal HPCs and staff.

GCP has a very simple pricing model (linear on CPU, memory and storage) and you can find all the details here .

When using GCP compute resources we strongly recommend using Pre-emptible VMs, which will save 80% on CPU and memory.

Within GCP, egress (traffic that exist an entity or network boundary) may be charged. See details here.

We suggest using the pricing calculator to get an estimate for your workload. That said, here are some key costs to keep in mind (for the most up-to-date price please check the pricing calculator):

Privacy and Security

When moving to a cloud platform and dealing with personal health data, many have concerns about privacy and security compared to an on-premise storage solution. The reality is that Google has much more expertise and resources to secure our data and processing than we could provide internally. Have a read of their white paper for more details.

That said, we have added additional security and privacy measures to ensure our data is only ever accessed by intended parties:

It is the responsibility of the requester to ensure that their environment is set up with adequate security and that they are operating with the License Agreement.