Skip to the content.

->Back: Lesson 3: Data Warehouse

->Next: Lesson 5: Batch Processing

Analytics Engineering

Table of contents

Alt text

Data roles

Alt text Alt text

Alt text

Kimball Inmon Data Vault
Integrate datamarts into a centralized data warehouse. It is based on the Business Dimensional Lifecycle conceptStructure of dimensions and facts of the conglomerate of datamarts that are part of the DWH. The bus structure is responsible for joining these entities between the datamarts through the conformed dimensions. Separation between data processing and reporting (historical data). Iterative approach: Allows you to improve and adjust your data warehouse as more information is gained and new business requirements are identified. Boot up very fast, but each new iteration requires a lot of effort.Prioritize data delivery over data redundancy control (3FN) Data must be integrated and consolidated from all sources into a central data warehouse to provide a single view of the data. An Inmon system must meet four standards. Topic: All data related to the same topic is stored together. Integration: the information of all the source systems is stored in a central data warehouse, managing the relationship between them. Non-volatile: Data is set in stone and never erased. Variable time: a history of the data is maintained so that you can consult a photo with the actual data at that time. The Inmon approach prioritizes an accurate and consistent data warehouse, so the greatest effort is made at the last layer. It aims to address the weaknesses of the previous two approaches by focusing on flexibility in the face of changes in source systems. It seeks to be an efficient model, quick to implement and very dynamic. Detail orientation: maximum level of detail of the information. Historical: All changes to the data are stored.Set of standard tables: The model is built on the basis of three main types of tables: Hub: Entities of interest to the business. They contain the business keys and their metadata. Link: relationships between Hubs. Satellite: historical store of information from the Hubs. Hard and Soft Rules: Business rules have two layers, the hard layers that are immutable and the technical ones that facilitate changes
Kimball Inmon Data Vault
Multidimensional Relational Relational
Star Model: Facts + Dimensions Snowflake Model: Entity-Relationship Star Model on Last Layer Mart
Bottom-Up Process: The central Data Warehouse is the aggregation of different datamarts with their truths already calculated. First the datamarts are generated and then the DWH. Top-Down Process: A single truth of the data managed in the central Data Warehouse and distributed to the different datamarts. Top-Down process: data goes through several layers (data source, data lake, staging, and finally data vault).
Conformal data dimensions: Tables of identical dimensions or a centralized one in the DWH are used to connect data between datamarts uploaded to the DWH to maintain data consistency. Data at the highest level of detail Data at the highest level of detail
Historical data is stored in a location other than the central DWH. Using SCD (slowing changing dimension) to control historical data Use of SCD2 (slowing changing dimension) in Satellite tables.
Denormalized Standard Standard
Yes, it allows for data redundancy in order to optimize data processing No data redundancy No data redundancy

Comparison between Kimball vs Inmon vs Data Vault methodologies

Data Build Tool (dbt)

Alt text Alt text Alt text Alt text

dbt (Data Build Tool) is an open-source Python library that streamlines the construction of data models by allowing developers to define, orchestrate, and execute transformations in a modern data warehouse such as BigQuery, Snowflake, Redshift, etc. We could say that it is a governance tool focused on the “T” of an ETL/ELT process, it allows us to centralize and build all data transformations in SQL, organizing them as reusable modules (models). On the other hand, by being inspired by software engineering practices, we can create validation tests and implement the entire CI/CD cycle in our data pipelines. In parallel to the knowledge provided by the Bootcamp, the official introductory course (duration: 5 hours) is very interesting.

In my previous professional period, the logic was divided into procedures stored in the SQL Server database, SQL queries in the ETLs (Azure Data Factory, SSIS and Taled) and even in the load models of the visualization tools (Qlikview and Power BI). It wasn’t documented anywhere. What dbt brings to this paradigm is a governance or control layer that facilitates the maintenance and documentation of logic, lineage, increased resilience and collaboration thanks to version control, and finally, would facilitate continuous integration/delivery or continuous deployment CI/CD.

Alt text

Integrated Data Architecture dbt as transformation softeware

Some of the main features of dbt

How to get started with dbt?

There are two ways to use dbt for free:

Install dbt Core with PIP

We have several options to install dbt Core on our computer or local server, the easiest way is through a python environment.pip

pip install dbt-core

Then we will install the adapter or connector of the database engine that we are going to use. We have at our disposal an official catalog and a complementary one of the community, you can consult all the available connectors from here. In our case, we’re going to install the BigQuery adapter.

pip install dbt-bigquery

create new python environment

conda create -n dbt

Create a new project by running the dbt init command. What this command does is clone the dbt starter project into the folder from where we have executed it, which contains all the files and directories necessary to start our project.

dbt init

Alt text

Install dbt Core locally

If everything has gone well, we will be able to configure our project through the console (CLI) to generate the profiles.yml file:

Alt text

Install dbt core locally

Alt text

Check that all the files and directories of the dbt project have been generated in the path where we have executed the command

Install dbt Core with a Docker image

I recommend this reading to delve deeper into this step. The images available for mounting a dbt container with Docker are:

docker build --tag my-dbt  --target dbt-bigquery .

Once the image is created, we start the container:

docker run \
--network=host
--mount type=bind,source=path/to/project,target=/usr/app \
--mount type=bind,source=path/to/profiles.yml,target=/root/.dbt/profiles.yml \
my-dbt \
ls

Create project in DTB Cloud

Before creating our first project in dbt Cloud, we must gather the necessary ingredients: create a service account, generate the JSON Key to grant dbt access to our BigQuery instance and create an empty repo on Github where the project’s files and directories will be stored:

Create service account + JSON Key for Big Query

Since in our case we are going to use BigQuery, the authentication is done by BigQuery OAuth. We need to create a service account from GCP and download the JSON key to grant dbt access.

  1. We access the Google Cloud Platform console and go to IAM and admin > Service accounts to create a new service account with BigQuery Admin and Storage Object Viewer permissions.

Create a BigQuery service account

In order to connect we need the service account JSON file generated from bigquery:

  1. Open the BigQuery credential wizard to create a service account in your taxi project
  1. You can either grant the specific roles the account will need or simply use bq admin, as you’ll be the sole user of both accounts and data.

Note: if you decide to use specific roles instead of BQ Admin, some users reported that they needed to add also viewer role to avoid encountering denied access errors

  1. Now that the service account has been created we need to add and download a JSON key, go to the keys section, select “create new key”. Select key type JSON and once you click on create it will get inmediately downloaded for you to use.
  1. We downloaded the JSON Key to our team (later we will upload it in dbt in the project setup process). Create a repository on Github

We simply create an empty repo on Github and click on Git Clone to copy the SSH key with which we will link the dbt project. The connection with Github is made in two parts, for now we are only interested in the key, later we will see how to configure the deploy key generated from dbt.

Alt text

We copied the SSH key from our Github repo to connect it with dbt

Now we have the ingredients! We signed up for dbt cloud with the free option for one user from this link. Once the email has been verified, we can create our project by choosing the data storage first.

Alt text

Creation of a new project in dbt (Data Build Tool)

In the next step, we load the JSON Key that we have generated with the BigQuery service account and all the parameters are automatically loaded. It’s a good idea to create the dataset in BigQuery manually to avoid cross-region incompatibility issues. In my case, I have used the name that has been generated by default and created the dataset directly.

Alt text

Configuring BigQuery dataset in dbt

We do the connection test and if everything has gone well, the next step! Alt text

Testing dbt connection with Google BigQuery

Now it’s time to set up the Github repository that we have previously created and perform the second step that we had pending. Select Git Clone and paste the SSH Key that we copied earlier. Press the Import button.

Alt text Alt text

Configuring Github with dbt

It will generate a deployment key that we need to copy into the Github repository settings:

Alt text

Deployment key generated in dbt to connect to Github

Going back to our Github repository, click on Settings and in the Security section click on Deploy Keys to add it. It is necessary to check the Allow write access option:

Alt text

display keys section on github repo

Alt text

Configuring deploy Key on Github

Alt text

Deploy Keys on Github

If we click on the Next button in the dbt project settings, we’re done:

Alt text

Configure dbt project

We access Develop and we must initialize our project in dbt cloud (similar to the command we would run in dbt core):dbt init

Alt text

We initialized the dbt project

Alt text

Newly created dbt project

After waiting a few seconds, all the yml, SQL, and directories files in the dbt project are created. We must click on Commit and sync to push to our Github repo. create a new branch to enter edit mode

Alt text

Remember that since it is linked to a repo on github, if you want to work on the dbt cloud GUI you need to create a branch first. To execute any command from the GUI, we can use the console we have in the footer: NB: Macros are functions

Alt text

Creating first dbt project

create schema in big query

How to setup dbt cloud with bigquery

Official documentation

Create a dbt cloud project

  1. Create a dbt cloud account from their website (free for solo developers)
  2. Once you have logged in into dbt cloud you will be prompt to create a new project

You are going to need:

Note: For the sake of showing the creation of a project from scratch I’ve created a new empty repository just for this week project.

image

  1. Name your project
  2. Choose Bigquery as your data warehouse: image
  3. Upload the key you downloaded from BQ on the create from file option. This will fill out most fields related to the production credentials. Scroll down to the end of the page and set up your development credentials.

Note: The dataset you’ll see under the development credentials is the one you’ll use to run and build your models during development. Since BigQuery’s default location may not match the one you sued for your source data, it’s recommended to create this schema manually to avoid multiregion errors.

  1. Click on Test and after that you can continue with the setup

Add GitHub repository

Note: This step could be skipped by using a managed repository if you don’t have your own GitHub repo for the course.

  1. Select git clone and paste the SSH key from your repo.
  1. You will get a deploy key, head to your GH repo and go to the settings tab. Under security you’ll find the menu deploy keys. Click on add key and paste the deploy key provided by dbt cloud. Make sure to tikce on “write access”

Alt text

Review your project settings

At the end, if you go to your projects it should look some like this:

image

You could simplify the process of adding and creating repositories by linking your GH account. Official documentation

Alt text

Alt text

Alt text

Alt text

Alt text

Configure dbt project

Access the Develop tab and initialize the dbt project in cloud Alt text

Initialize the project

Alt text

Initializing the project creates the project files

create a new branch to enter edit mode. You need to create a new bramch before editing the files.

Alt text

DBT Model

Creating a table in the target database with the .sql file: SELECT



SELECT * 
FROM staging.source_table
WHERE record_state = 'ACTIVE'

In addition to the query, a dbt model initially includes a block of code under the Jinja notation that we will recognize by the double brackets. Within this block is called the dbt function that is commonly used to specify the persistence strategy of the dbt model in the target database. By default, there are four ways to materialize queries, although it is possible to create your own: config()

To construct the model that persists the query on a table in our database, we need to run the code above. If we don’t add any parameters, it will compile and build all the models. To specify that we only want to build a specific one, we add the parameter --select myMODEL.SQL Let’s look at it with two examples: dbt build--select

We run and build all the models:

dbt run

TO build only the myModel.sql we run:

dbt run --select myModel.sql

when we run the model, dbt compiles it into the following

CREATE TABLE my_schema.my_ymodel AS (
    SELECT *
    FROM staging.source_table 
    WHERE record_state = 'ACTIVE'
)

The FROM clause in a dbt model

In addition to directly specifying the schema and table name, there are two ways to configure the data sources in the FROM clause in a dbt model: source and seeds.

Sources They are used when the source is a database table (BigQuery, PostgreSQL…). The connection is configured in a schema file.yml that we must create in the same directory where the model is located. It is possible to indicate whether we want to periodically check whether the connection is operational or not (source freshness). When constructing the model, we substitute the “schema.table name” by a macro in jinja notation that will fetch this data from the yml configuration file. For example, the source () macro contains the name of the source indicated in the yml and the name of the table. FROM



SELECT *
FROM 
limit 100

Alt text

define .sql file stg_partitioned_database_cluster_trips_data_all

The schema.yml file that we create in the same directory where the model is located contains the version, source name, database, schema and tables. One of the advantages of having the configuration in a separate file from the models is that it allows us to change the connection for all of them from a single place:

version: 2

sources:
  - name: staging
    database: de-project-397922
    schema: trips_data_all

    tables:
      - name: partitioned_database_cluster
      - name: partitioned_database

Alt text

define schema.yml. it specifies sources. sources contain database

Therefore, in our dbt /models/ folder we will have a .sql file for each model and a schema.yml with the configuration of the sources.

Alt text

successful run of dbt model

Alt text

view stg_partitioned_database_cluster_trips_data_all created

Macros

A dbt macro is similar to a function in any other language written in jinja notation. They are generated in separate .sql files in the /macros directory of the dbt project. As we have already mentioned, by default dbt has several macros such as , and , but we can create a new one that meets our needs. Some features of macros: source(),ref(), config()

We can use three different types of jinja blocks within a macro:


alt text

We use the macro in our dbt model:



SELECT 
    get_payment_type_description(payment_type)
FROM 
WHERE vendorid is not null

When we run the model, dbt compiles the code so that the generated table has translated the macro into a SQL: CASE WHEN

SELECT 
    case payment_type
        when 1 then 'Credit card'
        when 2 then 'Cash'
        when 3 then 'No charge'
        when 4 then 'Dispute'
        when 5 then 'Unknown'
        when 6 then 'Voided trip'
    end as payment_type_description
FROM 
WHERE vendorid is not null

Alt text

Macros defined in macros folder

Alt text

implement macros 'function'

Package

It allows us to reuse macros between different projects, similar to libraries or modules in other programming languages. To use a package in our project we must create a packages.yml configuration file in the root directory of our dbt project.

packages:
    - package: dbt-labs/dbt_utils
      version: 0.8.0

Alt text

define package

download and install the package with dependences:

dbt deps

Then we must install them by running the command that is responsible for downloading all the dependencies amd files of the package within our project. Afterwards, the dbt_packages/dbt_utils directory will be created in our project. dbt deps

Alt text

generate surrogate keys with dbt_utils.generate_surrogate_key macro

We can use the macros of the newly installed package in any model of our project. For example, the dbt-utils package includes the macro to create a surrogate key (sequential id internal to the table).surrogate_key



SELECT
     as tripid,
    vendorid
FROM 
WHERE vendorid is not null

Dbt compiles this model by creating a surrogate key with a hash function:

SELECT
    to_hex(md5(cast(coalesce(cast(vendorid as string), '') || '-' 
        || coalesce(cast(lpep_pickup_datetime as string), '') as string))) as tripid,
    vendorid
FROM 
WHERE vendorid is not null

Variables

Similar to the variables of any programming language. For the same purpose, it allows us to save a value and reuse it in any part of the project. They can be created in two ways:

vars:
    payment_type_values: [1, 2, 3, 4, 5, 6]
dbt build --var 'is_test_run: false'

The variable can be accessed from any other part of our project using ``.

To use a variable we must use the macro within a model” ``



SELECT *
FROM 

Alt text

implement variable. limits query to 100 when its a test run

Alt text

Successful fun for dbt run --select stg_partitioned_database_cluster_trips_data_all

Alt text

dbt run --select stg_partitioned_database_cluster_trips_data_all --vars 'is_test_run : false'

Seed

Similar to the External Tables of BigQuery or Azure Synapse, we can reference any CSV file stored in the repository within the directory, as it is stored in a repo we can take advantage of its version control. It is recommended to use seeds for data that does not change frequently (parametric dimension tables, such as provinces). /seeds

Instead of using the macro as we have seen before for a database table, for seeds the macro is used that receives the file name as a parameter and dbt automatically detects the dependencies and their location. The macro can also be used to reference tables or DB views in the same way, just by parametering the name of the table. To create a seed, simply upload a CSV file to the /seeds directory of our repo and run the . If we were to run, all the CSVs in the directory would be loaded into the database. This command generates a table in our database, in BigQuery:source()ref()ref()dbt seed taxi_zone_lookup.csvdbt seed

dbt seed taxi_zone_lookup.csv

Alt text

Add seed csv file into seeds folder. Either create the file in the folder and copy into it. or git add submodule the repo to local, add the file, commit, and then create pull request from the dbt UI

Alt text

The CSV file

Alt text

Running the dbt seed command

Alt text

taxi_zone_lookup table created via the seed

Now we can create a dbt model that refers to the newly created seed:



SELECT *
FROM 

If we want to change the the data types of seed data that by default have been inferred from the CSV, we must modify the project configuration file and add the seed block: dbt_project.yml

seeds:
    taxi_riders_ny:
       taxi_zone_lookup:
          +column_types:
              locationid: numeric

Alt text

To change the data types of data in the csv file

If we want to recreate the seed, we must use the : dbt seed micsv.csv --full-refresh

Alt text

Recreating the seed with full refresh



select 
    locationid,
    borough,
    zone,
    replace(service_zone, 'Boro', 'Green') as service_zone
FROM 

Alt text

define a table referencing the seed table

dbt run --select dim_zones

Alt text

Successfully created a table referencing the seed table

dbt build 

would build all the models in our project

dbt build --select +dim_zones

This would select only the dim_zones model and build it along with its dependencies.

Use Case

  1. Upload the csv files into Google Cloud Store and them create the DBT moodels from them Taxi racing datasets: I’ve left a Prefect script i.e etl_to_gcs_yellow_green_2019_2020.py and etl_to_gcs_fhv_2019.py in my github repository to upload all the files to GCS Bucket and then work with them in dbt and persist them in the BigQuery database.

Alt text

prefect deployment script: creates a prefect deployment to run etl of yellow and green data from github repo

Alt text

Prefect github connector block. Specify the repo name

Alt text

deployment run

Alt text

prefect deployment script: creates a prefect deployment to run etl of fhv from github repo

  1. Afterwards, create external tables with GBQ query to consume the data as GBQ tables
CREATE OR REPLACE EXTERNAL TABLE trips_data_all.fhv_tripdata
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://de_data_lake_de-project-397922/new_data/fhv/*.gz']
);

CREATE OR REPLACE EXTERNAL TABLE trips_data_all.green_tripdata
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://de_data_lake_de-project-397922/new_data/green_*.gz']
);

CREATE OR REPLACE EXTERNAL TABLE trips_data_all.yellow_tripdata
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://de_data_lake_de-project-397922/new_data/yellow_*.gz']
);

Alt text

Successfully create external tables with GBQ query to consume the data as GBQ tables

  1. Create the schema.yml file that specifies the models parameters.
version: 2

sources:
    - name: staging
      #For bigquery:
      database: de-project-397922

      # For postgres:
      # database: production

      schema: dbt_adellor
      # path where models are created at

      # loaded_at_field: record_loaded_at
      tables:
        - name: green_tripdata
        - name: yellow_tripdata
         # freshness:
           # error_after: {count: 6, period: hour}

models:
    - name: stg_green_tripdata
      description: >
        Trip made by green taxis, also known as boro taxis and street-hail liveries.
        Green taxis may respond to street hails,but only in the areas indicated in green on the
        map (i.e. above W 110 St/E 96th St in Manhattan and in the boroughs).
        The records were collected and provided to the NYC Taxi and Limousine Commission (TLC) by
        technology service providers. 
      columns:
          - name: tripid
            description: Primary key for this table, generated with a concatenation of vendorid+pickup_datetime
            tests:
                - unique:
                    severity: warn
                - not_null:
                    severity: warn
          - name: VendorID 
            description: > 
                A code indicating the TPEP provider that provided the record.
                1= Creative Mobile Technologies, LLC; 
                2= VeriFone Inc.
          - name: pickup_datetime 
            description: The date and time when the meter was engaged.
          - name: dropoff_datetime 
            description: The date and time when the meter was disengaged.
          - name: Passenger_count 
            description: The number of passengers in the vehicle. This is a driver-entered value.
          - name: Trip_distance 
            description: The elapsed trip distance in miles reported by the taximeter.
          - name: Pickup_locationid
            description: locationid where the meter was engaged.
            tests:
              - relationships:
                  to: ref('taxi_zone_lookup')
                  field: locationid
                  severity: warn
          - name: dropoff_locationid 
            description: locationid where the meter was engaged.
            tests:
              - relationships:
                  to: ref('taxi_zone_lookup')
                  field: locationid
          - name: RateCodeID 
            description: >
                The final rate code in effect at the end of the trip.
                  1= Standard rate
                  2=JFK
                  3=Newark
                  4=Nassau or Westchester
                  5=Negotiated fare
                  6=Group ride
          - name: Store_and_fwd_flag 
            description: > 
              This flag indicates whether the trip record was held in vehicle
              memory before sending to the vendor, aka “store and forward,”
              because the vehicle did not have a connection to the server.
                Y= store and forward trip
                N= not a store and forward trip
          - name: Dropoff_longitude 
            description: Longitude where the meter was disengaged.
          - name: Dropoff_latitude 
            description: Latitude where the meter was disengaged.
          - name: Payment_type 
            description: >
              A numeric code signifying how the passenger paid for the trip.
            tests: 
              - accepted_values:
                  values: ""
                  severity: warn
                  quote: false
          - name: payment_type_description
            description: Description of the payment_type code
          - name: Fare_amount 
            description: > 
              The time-and-distance fare calculated by the meter.
              Extra Miscellaneous extras and surcharges. Currently, this only includes
              the $0.50 and $1 rush hour and overnight charges.
              MTA_tax $0.50 MTA tax that is automatically triggered based on the metered
              rate in use.
          - name: Improvement_surcharge 
            description: > 
              $0.30 improvement surcharge assessed trips at the flag drop. The
              improvement surcharge began being levied in 2015.
          - name: Tip_amount 
            description: > 
              Tip amount. This field is automatically populated for credit card
              tips. Cash tips are not included.
          - name: Tolls_amount 
            description: Total amount of all tolls paid in trip.
          - name: Total_amount 
            description: The total amount charged to passengers. Does not include cash tips.

    - name: stg_yellow_tripdata
      description: > 
        Trips made by New York City's iconic yellow taxis. 
        Yellow taxis are the only vehicles permitted to respond to a street hail from a passenger in all five
        boroughs. They may also be hailed using an e-hail app like Curb or Arro.
        The records were collected and provided to the NYC Taxi and Limousine Commission (TLC) by
        technology service providers. 
      columns:
          - name: tripid
            description: Primary key for this table, generated with a concatenation of vendorid+pickup_datetime
            tests:
                - unique:
                    severity: warn
                - not_null:
                    severity: warn
          - name: VendorID 
            description: > 
                A code indicating the TPEP provider that provided the record.
                1= Creative Mobile Technologies, LLC; 
                2= VeriFone Inc.
          - name: pickup_datetime 
            description: The date and time when the meter was engaged.
          - name: dropoff_datetime 
            description: The date and time when the meter was disengaged.
          - name: Passenger_count 
            description: The number of passengers in the vehicle. This is a driver-entered value.
          - name: Trip_distance 
            description: The elapsed trip distance in miles reported by the taximeter.
          - name: Pickup_locationid
            description: locationid where the meter was engaged.
            tests:
              - relationships:
                  to: ref('taxi_zone_lookup')
                  field: locationid
                  severity: warn
          - name: dropoff_locationid 
            description: locationid where the meter was engaged.
            tests:
              - relationships:
                  to: ref('taxi_zone_lookup')
                  field: locationid
                  severity: warn
          - name: RateCodeID 
            description: >
                The final rate code in effect at the end of the trip.
                  1= Standard rate
                  2=JFK
                  3=Newark
                  4=Nassau or Westchester
                  5=Negotiated fare
                  6=Group ride
          - name: Store_and_fwd_flag 
            description: > 
              This flag indicates whether the trip record was held in vehicle
              memory before sending to the vendor, aka “store and forward,”
              because the vehicle did not have a connection to the server.
                Y= store and forward trip
                N= not a store and forward trip
          - name: Dropoff_longitude 
            description: Longitude where the meter was disengaged.
          - name: Dropoff_latitude 
            description: Latitude where the meter was disengaged.
          - name: Payment_type 
            description: >
              A numeric code signifying how the passenger paid for the trip.
            tests: 
              - accepted_values:
                  values: ""
                  severity: warn
                  quote: false
          - name: payment_type_description
            description: Description of the payment_type code
          - name: Fare_amount 
            description: > 
              The time-and-distance fare calculated by the meter.
              Extra Miscellaneous extras and surcharges. Currently, this only includes
              the $0.50 and $1 rush hour and overnight charges.
              MTA_tax $0.50 MTA tax that is automatically triggered based on the metered
              rate in use.
          - name: Improvement_surcharge 
            description: > 
              $0.30 improvement surcharge assessed trips at the flag drop. The
              improvement surcharge began being levied in 2015.
          - name: Tip_amount 
            description: > 
              Tip amount. This field is automatically populated for credit card
              tips. Cash tips are not included.
          - name: Tolls_amount 
            description: Total amount of all tolls paid in trip.
          - name: Total_amount 
            description: The total amount charged to passengers. Does not include cash tips.
  1. Define the macros at macros/get_payment_type_description.sql

alt text

  1. define the variables at dbt_project.yml
# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'taxi_rides_ny'
version: '1.0.0'
config-version: 2

# This setting configures which "profile" dbt uses for this project.
profile: 'default'

# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

target-path: "target"  # directory which will store compiled SQL files
clean-targets:         # directories to be removed by `dbt clean`
  - "target"
  - "dbt_packages"


# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models

# In dbt, the default materialization for a model is a view. This means, when you run 
# dbt run or dbt build, all of your models will be built as a view in your data platform. 
# The configuration below will override this setting for models in the example folder to 
# instead be materialized as tables. Any models you add to the root of the models folder will 
# continue to be built as views. These settings can be overridden in the individual model files
# using the `` macro.

models:
  taxi_rides_ny:
    # Applies to all files under models/example/staging
    staging:
      materialized: view
    core:
      materialized: table
    # schema: dbt_staging

seeds:
  taxi_rides_ny:
    taxi_zone_lookup:
      +column_types:
        locationid: numeric

vars:
    payment_type_values: [1, 2, 3, 4, 5, 6]
  1. create the model file for stg_green_tripdata. This models/staging/stg_green_tripdata.sql file defines the model.


with tripdata as 
(
  select *,
    row_number() over(partition by vendorid, lpep_pickup_datetime) as rn
  from 
  where vendorid is not null 
)
select
    -- identifiers
     as tripid,
    cast(vendorid as integer) as vendorid,
    cast(ratecodeid as integer) as ratecodeid,
    cast(pulocationid as integer) as  pickup_locationid,
    cast(dolocationid as integer) as dropoff_locationid,
    
    -- timestamps
    cast(lpep_pickup_datetime as timestamp) as pickup_datetime,
    cast(lpep_dropoff_datetime as timestamp) as dropoff_datetime,
    
    -- trip info
    store_and_fwd_flag,
    cast(passenger_count as integer) as passenger_count,
    cast(trip_distance as numeric) as trip_distance,
    cast(trip_type as integer) as trip_type,
    
    -- payment info
    cast(fare_amount as numeric) as fare_amount,
    cast(extra as numeric) as extra,
    cast(mta_tax as numeric) as mta_tax,
    cast(tip_amount as numeric) as tip_amount,
    cast(tolls_amount as numeric) as tolls_amount,
    cast(ehail_fee as numeric) as ehail_fee,
    cast(improvement_surcharge as numeric) as improvement_surcharge,
    cast(total_amount as numeric) as total_amount,
    cast(payment_type as integer) as payment_type,
     as payment_type_description, 
    cast(congestion_surcharge as numeric) as congestion_surcharge
from tripdata
where rn = 1

6.1 Run the dbt model

dbt build ---select stg_green_tripdata

This will execute the stg_yellow_tripdata model and generate a view named de-project-397922.trips_data_all.stg_yellow_tripdata in your target database with data from the de-project-397922.trips_data_all.yellow_tripdata Alt text

  1. create the model file for stg_yellow_tripdata. This models/staging/stg_green_tripdata.sql file defines the model. This model is run to create a view model in the dbt_adellor schema. This models/staging/stg_yellow_tripdata.sql file defines the model.


with tripdata as 
(
  select *,
    row_number() over(partition by vendorid, lpep_pickup_datetime) as rn
  from 
  where vendorid is not null 
)
select
    -- identifiers
     as tripid,
    cast(vendorid as integer) as vendorid,
    cast(ratecodeid as integer) as ratecodeid,
    cast(pulocationid as integer) as  pickup_locationid,
    cast(dolocationid as integer) as dropoff_locationid,
    
    -- timestamps
    cast(lpep_pickup_datetime as timestamp) as pickup_datetime,
    cast(lpep_dropoff_datetime as timestamp) as dropoff_datetime,
    
    -- trip info
    store_and_fwd_flag,
    cast(passenger_count as integer) as passenger_count,
    cast(trip_distance as numeric) as trip_distance,
    cast(trip_type as integer) as trip_type,
    
    -- payment info
    cast(fare_amount as numeric) as fare_amount,
    cast(extra as numeric) as extra,
    cast(mta_tax as numeric) as mta_tax,
    cast(tip_amount as numeric) as tip_amount,
    cast(tolls_amount as numeric) as tolls_amount,
    cast(ehail_fee as numeric) as ehail_fee,
    cast(improvement_surcharge as numeric) as improvement_surcharge,
    cast(total_amount as numeric) as total_amount,
    cast(payment_type as integer) as payment_type,
     as payment_type_description, 
    cast(congestion_surcharge as numeric) as congestion_surcharge
from tripdata
where rn = 1

7.1 run dbt command

dbt build --select tg_green_tripdata

This will execute the stg_green_tripdata model and generate a view named de-project-397922.trips_data_all.stg_green_tripdata in your target database with data from the de-project-397922.trips_data_all.green_tripdata

Alt text

  1. Upload the taxi_zones_lookup csv and Create a seed table

8.1 Run the dbt command

dbt seed taxi_zone_lookup.csv
  1. Create a dbt model dim_zones.sql of the zone seed in the models/staging folder:


select 
    locationid, 
    borough, 
    zone, 
    replace(service_zone,'Boro','Green') as service_zone
from 

9.1 Run the dbt command

dbt build --select dim_zones.sql

Alt text

Merge all

  1. Merge all the views (stg_yellow_tripdata, stg_green_tripdata) and table taxi_zone_lookup into one tables


with green_data as (
    select *, 
        'Green' as service_type 
    from 
), 

yellow_data as (
    select *, 
        'Yellow' as service_type
    from 
), 

trips_unioned as (
    select * from green_data
    union all
    select * from yellow_data
), 

dim_zones as (
    select * from 
    where borough != 'Unknown'
)
select 
    trips_unioned.tripid, 
    trips_unioned.vendorid, 
    trips_unioned.service_type,
    trips_unioned.ratecodeid, 
    trips_unioned.pickup_locationid, 
    pickup_zone.borough as pickup_borough, 
    pickup_zone.zone as pickup_zone, 
    trips_unioned.dropoff_locationid,
    dropoff_zone.borough as dropoff_borough, 
    dropoff_zone.zone as dropoff_zone,  
    trips_unioned.pickup_datetime, 
    trips_unioned.dropoff_datetime, 
    trips_unioned.store_and_fwd_flag, 
    trips_unioned.passenger_count, 
    trips_unioned.trip_distance, 
    trips_unioned.trip_type, 
    trips_unioned.fare_amount, 
    trips_unioned.extra, 
    trips_unioned.mta_tax, 
    trips_unioned.tip_amount, 
    trips_unioned.tolls_amount, 
    trips_unioned.ehail_fee, 
    trips_unioned.improvement_surcharge, 
    trips_unioned.total_amount, 
    trips_unioned.payment_type, 
    trips_unioned.payment_type_description, 
    trips_unioned.congestion_surcharge
from trips_unioned
inner join dim_zones as pickup_zone
on trips_unioned.pickup_locationid = pickup_zone.locationid
inner join dim_zones as dropoff_zone
on trips_unioned.dropoff_locationid = dropoff_zone.locationid

Run dbt command

dbt build --select fact_table

Alt text

Alt text

NB

dbt run

Alt text

runs all the models except the seeds

NB

dbt build

Alt text

build command runs everything in the project including the seeds

dbt test

Alt text

test command runs all the seeds, models, tests in the entire project

dbt docs generate 

Alt text<p align='center'>doc generate documentation for entire project</p>

DBT test

DBT test are defined on a column in the .yml file. DBT provides basic test to check if the column values are :

  1. unique (no duplicates)
  2. non empty (null or empty value)
  3. type of data (integer, float, date etc.)
  4. a foreign key to another table
  5. Accepted values
columns:
    - name: payment_type_description
      description: Description of the payment_type code
      tests:
        - accepted_values:
          values: 
          severity: warn

The severity field is optional and can be set to either info, warn, or error. If not provided, it will default to info.

this test checks that the accepted values is in values

columns:
    name: pickup_locationid
    description: locationid where the meter was engaged
    tests:
        - relationships:
            to: ref('taxi_zone_lookup')
            field: locationid
            severity: warn

Check pickup_locationid exists in dimension taxi_zone_lookup table

columns:
    - name: tripid
      description: Primary key for ...
      tests:
        - unique:
            severity: warn
        - not_null:
            severity: warn

this test checks that all tripid primary keys are unique and not null

dbt production environment

Alt text

Alt text

Alt text

Alt text

Alt text

Alt text

alt text

Continuous integration

Continuous integration (CI) is a software development practice used to ensure that code is automatically integrated and tested on a regular and frequent basis.

In CI, developers push their code to a shared repository multiple times a day, triggering a series of automated processes including code compilation, testing, and static analysis. These processes run automatically in an isolated test environment, which is created and destroyed for each integration cycle, ensuring that the code is tested in a clean and repeatable environment.

The goal of continuous integration is to detect and fix problems in code early, which helps reduce the time and cost of fixing bugs later in the development cycle. In addition, by regularly integrating and testing code, code quality is improved and the software delivery process is made easier.

We can employ continuous integration (CI) in a dbt project in pull requests (when we request to join our branch to the main one) using Github, Azure DevOps, or Gitlab webhooks. When a PR is approved, a webhook is sent to dbt cloud that queues a new execution of the corresponding job. The execution of the job is carried out on a temporary scheme that is created and self-destructs. The PR will not perform the merge until the job execution is complete. Let’s do a test (you can check all the documentation here):

NB: before we continue, if we don’t see the Run on Pull Requests check? we need to reconfigure the connection to Github and use the native connection from dbt. The following steps need to be followed:

  1. Connect your dbt account to Github and grant read/write permissions on the repository you’re using. From Profile Settings > Linked Accounts, select Github and click on the Configure integration with Github button. More info in this dbt article.
  2. Disconnect the current Github configuration by SSH in the project from Account Settingss > Projects (analytics) > Github connection click on edit and at the bottom left appears the Disconnect button.
  3. If we go back to the project configuration screen and click on Repository Details again, we can select the repository provider again. This time instead of cloning, we’re going to connect directly to Github and select a repository:

alt text

DBT_GITHUB C/I. configure dbt integration with github: profile settings-> linked accounts

After configuring continous integration, create a job that is triggered by Continuous integration (CI) alt text

Create a new job whose trigger is continuous integration CI and activate the Run on Pull Request option

alt text

Create a new job whose trigger is continuous integration CI and activate the Run on Pull Request option

alt text

Create a new job whose trigger is continuous integration CI and activate the Run on Pull Request option

This job is laying domant for now, but when a pull request is initiated , it will run the commands specified. for example,

  1. lets make changes to our models, commit changes:

alt text

New branch in dbt project to test continuous integration (CI)

  1. make a pull request alt text

pull request on github

  1. Approve PR from github alt text

Approve PR from github

alt text

Merge successful

  1. Going to dbt we see that a new job execution executed, triggeres by Github Pull Request#43:

alt text Reviewing the steps of the job we see that it was triggered from a PR and that a temporary schema is created in our BigQuery dataset named dbt_cloud_pr_536565_43. This schema self-destructs when the job ends.

Visualizing the data with google looker studio

Google Data Studio is a free Looker Studio-based data visualization and reporting tool that allows users to connect to multiple data sources, such as Google Analytics, Google Ads, Google Sheets, databases, and more, to create custom reports and interactive visualizations.

The platform offers an easy-to-use graphical interface that allows users to design and customize reports with different types of charts, charts, key performance indicators (KPIs), and other visualizations, and share them with other users securely and easily.

We basically have two types of elements: reports and data sources. The first are the dashboards with the visualizations and the second are the connectors with the tables of the source systems. The first step in generating a dashboard is to set up your data sources.

    1. Click on Create and select data source. In the new dashboard we search for the BigQuery connector:

Alt text

Alt text

Alt text <p = align=’center’>Connect Google Data Studio with BigQuery</p>

  1. Authorize Google Data Studio access to our BigQuery.

  2. Select the table you want to use as your data source: Alt text

<p = align=’center’>CConfigure source data in Google Data Studio</p>

  1. The last step for the data source is to review and confirm the data structure that it has inferred from the source system. At this point we can also perform transformation tasks and create new fields and parameters. For example, although we can do it on the fly while designing a report, from this point we could create the field that segments the data by month with the formula : month_pickupmonth(pickup_datetime)

Alt text

Alt text

Reports in Google Data Studio (reports)

Creating a report in Google Data Studio is very simple, following the current trend of tools such as Microsoft’s Power BI, Qlik Sense or MicroStrategy Visual Insights. We have at our disposal a blank canvas on which we are going to build visualizations based on the data set we have configured: we select the control, the graph on which we configure its dimensions and metrics and that’s it! Let’s take a look at what these two types of elements are:

Controls: Objects that allow us to interact with data in visualizations, for example, selection filters, boxes for entering text, drop-down lists with all the values of a dimension, etc. Graphs: or visualizations, are all kinds of statistical graphs that we can use to analyze and present information: pie charts, bars, lines, bubbles, etc. Depending on the selected chart, we must choose one or more dimensions and one or more metrics. After a few minutes of effort, we can enjoy our first report in Google Data Studio:

->Back: Lesson 3: Data Warehouse

->Next: Lesson 5: Batch Processing