Accessing Data Through Google Cloud Platform
Contents
Overview
This guide is meant to give you the basic steps and knowledge to access the data for which you have been approved via the Google Cloud Platform (GCP). It is not meant to be an exhaustive tutorial on GCP, and we suggest getting background on the key services in GCP via the following tuturials:
We assume you have already acquired an account, if not please read our Getting a Google Account page in this documentation repository.
This guide will perform all operations via the command line. Please ensure you've installed the Google Cloud SDK or you can also use the Cloud Shell available in the console. Note: The cloud shell runs in a virtual machine, so anything downloaded locally will not be persisted to your own computer.
Logging in to GCP
Log into the SDK with the following command:
gcloud auth login
You will be taken to a browser to authenticate. Authenticate with the account with which you applied for the data access. You can check which account is authenticated with:
gcloud auth list
Accessing data
All data access will be performed with the gsutil or via the Google Cloud Storage APIs. We recommend gsutil for most use cases.
Data Request Bucket
We create a unique bucket for each data request which contains:
- If applicable, a metadata.tar containing the clinical data
- If applicable, a somatics.tar containing the somatic data
- If applicable, a germline.tar containing the germline data
- A manifest JSON file, which will contain Google Storage URLs to any DNA cram or RNASeq fastq data
You will receive this bucket name from our data request service team, but it will look something like: hmf-dr-123
, and you can inspect it with gsutil like so:
gsutil -u your-project ls gs://hmf-dr-123
Note the -u your-project
, this is necessary to assign the billing project to the request for any egress costs (even though an ls
does not incur egress, all operations still need a billing account). See the costs overview for more details.
To download files locally or to a VM:
gsutil -u your-project cp gs://hmf-dr-123/metadata.tar /path/on/your/local/machine/
To copy the files to another bucket you've created in your own project:
gsutil -u your-project cp gs://hmf-dr-123/metadata.tar gs://your-new-bucket/
Manifest JSON
When dealing with the aligned reads and RNASeq data, our key challenge is avoid any duplication of the data and costs associated. This means we need to expose the data directly to the requester from our own buckets.
To accomplish this we provide the data requester with a JSON file called the manifest, the manifest contains URLs to all CRAMs, their indexes and RNASeq fastq in the request. Along with this, we grant your account read access on each object Access Control List (ACL).
The manifest.json
is located in each data request bucket. The intention is to provide a compact representation of the data exposed which can be easily parsed by a script or program, but also easily read in an editor.
The manifest gives you the following information:
- The unique ID of the data request
- The accounts which have access to the data in the manifest
- The Google Cloud Storage (GCS) urls of the aforementioned TAR files
- For each sample in the datarequest
- The GCS urls of CRAM and CRAI files (if requested and approved)
- The GCS urls of RNASeq FASTQ files (if requested and approved)
You can find an example here.
JSON has good support in most programming languages. For instance with python you can load the manifest straight from GCS into a dict in a few lines:
import json
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('gs://hmf-dr-123')
manifest_json = bucket.get_blob('manifest.json')
data = json.loads(manifest_json)
The intent of the manifest is to enable the use of GCP to scale analysis horizontally across virtual machines and avoid the time and expense of large downloads. At Hartwig Medical Foundation this generally follows the pattern:
- Create a VM with a predefined startup script
- Within the startup script, download the data you need
- Within the startup script, run your analysis
- Within the startup script, upload the results to your own bucket
- Terminate the VM
We've also seen the manifest parsed into Nextflow configuration which manage the GCP details for you. We kept things simple by design, we hope to see many creative analysis implementations with the manifest.
GCP Costs
When using any cloud platform, its very important to understand the cost of operations. The good news is, GCP is very competitively priced and will also help alleviate load on internal HPCs and staff.
GCP has a very simple pricing model (linear on CPU, memory and storage) and you can find all the details here .
When using GCP compute resources we strongly recommend using Pre-emptible VMs, which will save 80% on CPU and memory.
Within GCP, egress (traffic that exist an entity or network boundary) may be charged. See details here.
We suggest using the pricing calculator to get an estimate for your workload. That said, here are some key costs to keep in mind (for the most up-to-date price please check the pricing calculator):
- Using a 32cpu 120GB virtual machine for one hour will cost about $1.60/€1.42 or $0.30/€0.27 if pre-emptible
- Storing 1TB of data for a month will cost about $20/€18
- Downloading 1TB of data to a local server will cost about $120/€106
- Aligning 100 samples sequenced to 90x with BWA and storing CRAM for 1 year costs approximately $3100/€2755 (~$700/€622 for the compute and $2400/€2133 storage)
Privacy and Security
When moving to a cloud platform and dealing with personal health data, many have concerns about privacy and security compared to an on-premise storage solution. The reality is that Google has much more expertise and resources to secure our data and processing than we could provide internally. Have a read of their white paper for more details.
That said, we have added additional security and privacy measures to ensure our data is only ever accessed by intended parties:
- All private data is encrypted with our own key using Customer Managed Encryption. This ensures that no one at Google can access our data.
- Any access to private data is logged using Audit Logging.
- Resource location restriction to ensure the data always resides in the EU.
- All VMs we create have private IP and reside on a private network, with no access to the public internet.
It is the responsibility of the requester to ensure that their environment is set up with adequate security and that they are operating with the License Agreement.