BigQuery to access data stored in AWS S3
BigQuery, a google’s product that can enable data users to perform interactive analytics on massively large datasets within Google Cloud Platform (GCP). It is very similar to Redshift from AWS, DashDB from IBM but it is serverless. In order to use it, GCP account is needed and offcourse data has to reside on storage within Google infrastructure such as Google Drive, Google Cloud Storage etc. There is a service available in GCP that allows bringing data from AWS S3 using their GCP’s native service named Cloud Storage Transfer.
This part of article is intended to provide step by step guideline and highlight some issues that you may encounter (I did encounter) while bringing data from AWS S3 (Simple Storage Service) to Google Cloud Storage.
You have access to Google Cloud Platform and AWS.
Need to transfer data from S3 to GCS
- Unload data from Redshift to S3 (Refer http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html for syntax if needed)
Point to note, By default Unload generates pipe (|) delimited file so you may want to change that depending on how data is stored in redshift.
- Login to GCP and go to “STORAGE”
- Use Transfer Option
- As mentioned BigQuery can access data from anywhere within Google infrastructure. Hence, we need to first transfer data from S3 to GCS. Using Transfer, we can do so.
- Specify Source, Target and Schedule for data transfer
In above screen, following are key configuration needed to transfer data from S3 to GCS:
- Select Source as Amazon S3 bucket
- Specify bucket name (Do not specify path symbol i.e. forward slash (/) after bucket name). Also do not specify exact path. This parameter is to provide bucket name ONLY.
- Provide Access Key and Secret Key valid with permission to read files for bucket that you are using it in source.
- If there is a folder inside S3 bucket then use Filter and specify that as prefixes. For an example: In my case, files were available under subfolders (InputData/dataplatformexperts) in bucket specified so specified value as shown for “Transfer files with these prefixes“. It can be done using “Add Item” button shown above.
Once you have specified all valid values, then Continue to provide detail for Destination (GCS)
- Add Destination (GCS) detail here:
In this screen, use Browse button to explore GCS bucket(s). It will prompt a screen where you have a choice either use existing bucket or Create new one. In my case, I wanted to create new, so here is what had to be done:
select bucket from list:
that will bring following page.
You may directly type GCS bucket name here but ensure path is NOT ending with Forward Slash (/). Other options (check box) are depending on cases, you may check.
- Now you can proceed with Create/Executing job:
If all information specified for Source and Destination, are valid then it will return following screen which is intern a confirmation page:
GCP Transfer service will now begin data transfer from S3 to GCS and return status on Transfer Service page. I was amazed to see performance as it took less 8 minutes to transfer 478 GB data.
Possible error that is not captured in documentation:
Permissions issue. You must be a bucket owner of the destination bucket. Ask a bucket owner or project owner to grant you this permission.
There are multiple reason for this error. Some of them are related with permission but it also occurs If path is ending with Forward Slash as shown below:
Otherwise, following message is expected (Unless Google Cloud Platform team changes error message to be specific or allows adding Forward slash at the end of Bucket path)