Table of Contents
1) Set up DBT
What is Homebrew? Homebrew is an open-source software package manager that makes it easier to install software on macOS. It’s like AppStore for installation.
1.1 Install Homebrew. Then, run:
brew install git
brew tap fishtown-analytics/dbt
1.2 Test your installation with
Take note: The most time consuming part is
brew update, and sometimes when you run
brew updateit seems that it just stopped there, but it is actually running behind the scenes. So please do not cancel
brew updateor close your laptop; otherwise it might take longer!
2) Create a BigQuery project
2.1 Go to the BigQuery Console
2.2 Create a new project
2.3 Head back to the the BigQuery Console, and ensure your new project is selected. Please take note of your project location because later on you will need to put it into your dbt
3) Generate BigQuery credentials
3.1 Go to the BigQuery credential wizard. Ensure to choose your correct project name.
3.2 Click Create Credentials > Help me choose. Generate credentials with the following options:
- Which API are you using? BigQuery API
- What data will you be accessing? Application data
- Are you planning to use this API with App Engine or Compute Engine? No, I’m not using them
- Service account details — Service account name:
- Grant this service account access to the project — Role: BigQuery Admin
- Grant users access to this service account — Service account admin roles: I put my gmail.
3.3 Go back to the BigQuery credential wizard.
3.4 Under the Service Accounts, click your newly-created service account > Go to
Show Domain-wide Delegation> Click
Create new key> Choose
JSON file > Download the JSON file and save it in an easy-to-remember spot, with a clear filename (example:
4) Create data set
- Once you select your project name, click
Create data set> Name your data set as accordingly.
Take note: Please take note of your project location because later on you will need to put it into your dbt
profiles.ymlconfig file. For example, I chose Singapore (asia-southeast1) as my data location.
5) Clean Data
- As mentioned above, the dataset I took is Brazilian E-Commerce Public Dataset by Olist. For this project, I will use all 9
.csvfiles except for
- When you upload data to BigQuery from
.csvfiles, the maximum file size is 10 MB. So what I did is to trim a couple of rows so that all files are under 10 MB before the upload.
- In this tutorial, I will demo the upload of 4 files —
product_category_translation.csvbecause each of which has a different type of error and workaround.
- This file is the easiest. In your newly created dataset, click
Create table> Choose
Uploadand select the
olist_customers_dataset.csv> Name your table as
The final result should look like this.
Let’s move on to the second file. Let’s select
olist_order_items_dataset.csv and follow the steps as the first file.
- Error: You may likely encounter this error — seeing null values in the table
order_itemswhen importing to BigQuery. However, let’s try to click the last page, you will see non-null values.
- Reason: The null values are caused by the empty rows in the
- Workaround: Hence, based on my experience, the best practice is to open
.csvfiles in your code editor to check if there are any empty rows, and delete all empty rows. In the image below, I dragged the
olist_order_items_dataset.csvin Visual Code Studio and delete all of the empty rows from line 51237 onwards.
- Error: You may also encounter the problem related to the
TIMESTAMPdata type. Supposedly the correct date time format in BigQuery is
yyyy/mm/dd hh:mm:ss. However, as it can be seen from the column
shipping_limit_date, the date time format is
0018-01-18 02:48:00 UTC. This is not compatible in BigQuery.
- Reason: In
.csvall of the date time columns, for example, the
dd/mm/yyyy hh:mm:ss. Hence, we will need to change the date format to
yyyy/mm/dd hh:mm:ssto be compatible with that of BigQuery.
- Workaround: To change the date format, select all rows from the
shift + command + down arrowfor Mac) (
shift + ctrl + down arrowfor Windows) > Click
More numbers format> Under
Type, change from
yyyy/mm/dd hh:mm:ss(please see my image below for reference).
Now let’s try to import again. The result table should look like this.
5.3 — Upload
Let’s move on to the third file — Now I am going to create
order_reviews table from
- Error: Then I got the error below.
Error while reading data, error message: Error detected while parsing row starting at position: 1765. Error: Missing close double quote (") character.Error while reading data, error message: CSV processing encountered too many errors, giving up. Rows: 12; errors: 1; max bad: 0; error percent: 0
- Reason: It is due to special characters such as the double quote character
- Workaround: Therefore when you import this file, you will need to tick
Allow quoted newlinesunder
And here is the final result.
Yayyy here’s the last file I would like to demo!
- Error: When I create a
product_category_name_translation.csvfile, I get incorrect column names such as
string_field_xlike shown in the image below.
- Reason: I did some research, and found the workaround from Samet Karadag (thank you!)
- Workaround: We will add a dummy integer column
product_category_name_translationtable. Then let’s try to create the
product_category_name_translationtable again. Now you will see that column names are recognised correctly.
- The next step is to remove that
intcolumn by creating a new table using this query.
CREATE table `<project_id>.<dataset>.<table>` as
SELECT * except(int)
product_category_name_translation(the initial table)
product_category_name_trans(the new table after removing int column)
Tadaa your final result is here!
In summary, I have covered the installation part (how to set up DBT, create a BigQuery project at the BigQuery Console, generate BigQuery credentials at BigQuery credential wizard, create a data set) and the data pre-processing (clean data and upload data to BigQuery).
In my next episode, I will demonstrate
dbt run and
dbt test in details together with my struggles and workaround as well!