Skip to the content.

->Back : Lesson 2: Data Ingestion

->Next: Lesson 4: Analytics Engineering

Week 3 :Data warehouse

Table of contents

The files for this section are found in the Data_Engineering/week_3_data_warehouse folder in my repo.

alt text

alt text

Column 1 OLTP OLAP
Purpose Short, fast updates initiated by user Data periodically refreshed with scheduled, long-running batch jobs
Database design Normalized databases for efficiency Denormalized databases for analysis
Space requirements Generally small if historical data is archived Generally large due to aggregating large datasets
Backup and recovery Regular backups required to ensure business continuity and meet legal and governance requirements Lost data can be reloaded from OLTP database as needed in lieu of regular backups
Productivity Increases productivity of end users Increases productivity of business managers, data analysts and executives
Data view Lists day-to-day business transactions Multi-dimensional view of enterprise data
User examples Customer-facing personnel, clerks, online shoppers Knowledge workers such as data analysts, business analysts and executives

Google Big query

Alt text

BigQuery (BQ) is a Data Warehouse solution offered by Google Cloud Platform.

External tables

External tables are objects that are similar to views of a database. Only that this time the database isn’t really a database but files in some cloud storage or another database. It stores only the schema in BQ and only infers the data from the extenal files when creating the object. External tables have the same characteristics as a standard table in BigQuery, with their properties, access management, metadata, and so on. The only difference is that they are a view, the data is in another location.

For example, instead of ingesting a CSV into a table in the BigQuery database, let’s create an external table to directly access the data without persisting:

CREATE OR REPLACE EXTERNAL TABLE 'de-project-397922.trips_data_all.rides'
OPTIONS (
    format = 'CSV',
    uris   = ['gs://data/trip/yellow_tripdata_2020-01.csv']
);

Google BigQuery Optimization

Unlike a relational database, BigQuery doesn’t support indexes to streamline SQL queries. Instead, it offers two alternatives: partitioning and clustering. These options are not recommended when our data volume is < 1GB.

Partitioning

A partitioned table is a table divided into segments aka partitions based on the values of a column. Slicing a table greatly speeds up queries because the data you need to retrieve is much smaller than if you had to read the entire table. BigQuery offers three types of partitions:

Partitions are used to improve performance when you have large datasets that require frequent queries on specific date ranges or time intervals. For example we creata a new table from a query and add the partition by column tpep_pickup_datetime

CREATE OR REPLACE TABLE trips_data_all.partitioned_database
PARTITION BY
  DATE(tpep_pickup_datetime) AS
SELECT * FROM trips_data_all.rides LIMIT 50;

Alt text

Partitions are used to improve query performance as they allow you to filter data based on partition keys. This can significantly reduce the amount of data that is processed.

The new table partitioned_database is created with a partition Alt text

Clustering

Clustering reorders the data in the table based on one or more columns (up to 4). Features of column grouping in BigQuery:

Limit a maximum of 4 clustered columns per table. Example: We can create clusters at the same time as partitions. Building upon the previous query as an example, let’s add a grouped column or cluster by the field :VendorID

CREATE OR REPLACE TABLE trips_data_all.partitioned_database_partitioned
PARTITION BY DATE(tpep_pickup_datetime)
CLUSTER BY VendorIDKEY AS
SELECT *, CAST(VendorID AS STRING) AS VendorIDKey FROM trips_data_all.rides LIMIT 1000;

you cannot cluster on float, hence the conversion to string as VendorIDKey

Alt text

Partitioned and clustered

When to use partitioning and clustering?

Use partitioning when you want to filter or aggregate data on a single column with low cardinatlity (few number of unique elements) and if we want to filter or aggregate on several columns we can partition with the column with the least cardinatlity and cluster by the rest up to a maximum of 4 columns. BigQuery sorts the data using the values in the clustering columns. When you cluster a table, BigQuery stores the data in a way that is optimized for queries that filter, aggregate, or join on the clustering columns 1. This can result in faster query response times and lower costs

Partitioning Clustering
The cost of the query is known. BigQuery can estimate the amount of data it will retrieve before running the query. The cost of the query is unknown as it cannot estimate the amount of data.
Low granularity. You can only partition per column. High granularity. Multiple columns can be used to reorder the table (up to a maximum of 4)
Focused for queries that filter or aggregate data by a single column. Focused for queries that filter or aggregate by multiple columns.
Limit 4K partitions of a column, which implies that it can only be used with fields with low cardinality (or up to 4K). There is no limit to clusters, so it supports columns with high cardinality.

Alt text

Big Query table Optimized by partitioning on date

SQL Best Practices for Optimizing Queries

Most of them don’t just apply to Google BigQuery, they are recommendations for queries run on any database engine:

Google BigQuery Architecture

The BigQuery architecture decouples storage from compute (analytics engine), allowing each resource to be scaled independently. This flexibility allows for much more granularized cost control. What pieces do we find within BigQuery? Dremel, Colossus, Jupiter and Borg:

Borg: Container Orchestrator

Google’s own container orchestrator that is responsible for providing the necessary hardware to operate the slots and mixers of the Dremel engine.

Jupyter: Network

Because the BigQuery structure is decoupled (it physically separates the storage from the compute engine) it needs an artifact that connects both entities: Jupyter. It offers enough bandwidth to allow communication between 100K machines at an ultra-fast speed of 10Gbs/s.

Dremel: Execution Engine

This is the high-speed BigQuery query engine that Google uses in its own search engine. It orchestrates queries by segmenting them into small portions that are distributed by nodes and when finished they are grouped together to return the result; The definition of distributed processing. Dremel converts a SQL query into an execution tree where we find slots and mixers, all run on Borg (see below). The engine itself dynamically assigns slots to incoming queries:

Colossus: Distributed Storage

Google’s state-of-the-art distributed storage system. It manages replications, recovery (when disks fail), and distributed management (mitigating the impact in the event of a crash). Colossus uses the columnar and compression format ColumnIO capable of easily handling petabytes of data.

Big query and machine learning

Google BigQuery can run machine learning models in a simple and agile way using standard SQL without the need for specific platforms, data movement or programming knowledge (python, scala, etc). The ML algorithms natively available within BigQuery can be found in the official documentation.

For example, to create a linear regression model in to predict the tip (tip_amount) of a taxi ride given the pickup_latitude and pickup_longitude. For more see the documentation with all the options: CREATE MODEL.

CREATE OR REPLACE MODEL `de-project-397922.trips_data_all.rides.tip_mode;`
OPTIONS(
    model_type='linear_reg', 
    input_label_cols=['pickup_latitude','pickup_longitude'],
    output_label_col='tip_amount'
    DATA_SPLIT_METHOD='AUTO_SPLIT'
) 
AS
SELECT pickup_latitude, pickup_longitude, tip_amount 
FROM `de-project-397922.trips_data_all.rides`
WHERE NOT ISNULL(pickup_latitude) AND NOT ISNULL(pickup_longitude);

BQ offers us a series of statements to analyze and exploit the model. More information in the official documentation.

Alt text

SELECT THE COLUMNS INTERESTED FOR YOU

SELECT trip_distance, PULocationID, DOLocationID, payment_type, fare_amount, tolls_amount, tip_amount FROM de-project-397922.trips_data_all.partitioned_database_cluster WHERE fare_amount != 0;

CREATE ML TABLE

CREATE OR REPLACE TABLE `de-project-397922.trips_data_all.partitioned_database_cluster_ml`(
`trip_distance` FLOAT64,
`PULocationID` STRING,
`DOLocationID` STRING,
`payment_type` STRING,
`fare_amount` FLOAT64,
`tolls_amount` FLOAT64,
`tip_amount` FLOAT64
) AS (
SELECT trip_distance, cast(PULocationID AS STRING), CAST(DOLocationID AS STRING),
CAST(payment_type AS STRING), fare_amount, tolls_amount, tip_amount
FROM `de-project-397922.trips_data_all.partitioned_database_cluster` WHERE fare_amount != 0
);

Alt text

CREATE MODEL

CREATE OR REPLACE MODEL `de-project-397922.trips_data_all.tip_model`
OPTIONS
(model_type='linear_reg',
input_label_cols=['tip_amount'],
DATA_SPLIT_METHOD='AUTO_SPLIT') AS
SELECT
*
FROM `de-project-397922.trips_data_all.partitioned_database_cluster_ml`
WHERE
tip_amount IS NOT NULL;

Alt text

CHECK FEATURE MODEL

SELECT * 
FROM ML.FEATURE_INFO(MODEL `de-project-397922.trips_data_all.tip_model`)

Alt text

EVALUATE MODEL

SELECT *
FROM ML.EVALUATE(MODEL `de-project-397922.trips_data_all.tip_model`,
(SELECT *
FROM `de-project-397922.trips_data_all.partitioned_database_cluster_ml`
WHERE tip_amount IS NOT NULL
));

Alt text

PREDICT THE MODEL

SELECT *
FROM ML.PREDICT(MODEL `de-project-397922.trips_data_all.tip_model`,
(
SELECT *
FROM `de-project-397922.trips_data_all.partitioned_database_cluster_ml`
WHERE tip_amount IS NOT NULL
));

Alt text

PREDICT AND EXPLAIN

SELECT *
FROM ML.EXPLAIN_PREDICT(MODEL `de-project-397922.trips_data_all.tip_model`,
(
SELECT *
FROM `de-project-397922.trips_data_all.partitioned_database_cluster_ml`
WHERE tip_amount IS NOT NULL
), STRUCT(3 as top_k_features));

Alt text

HYPER PARAM TUNNING

CREATE OR REPLACE MODEL `de-project-397922.trips_data_all.tip_hyperparam_model`
OPTIONS
    (model_type='linear_reg',
    input_label_cols=['tip_amount'],
    DATA_SPLIT_METHOD='AUTO_SPLIT',
    num_trials=5,
    max_parallel_trials=2,
    l1_reg=hparam_range(0, 20),
    l2_reg=hparam_candidates([0, 0.1, 1, 10])
    ) AS
SELECT *
FROM `de-project-397922.trips_data_all.partitioned_database_cluster_ml`
WHERE tip_amount IS NOT NULL;

Alt text

SQL example for ML in BigQuery

BigQuery ML Tutorials

BigQuery ML Reference Parameter

Hyper Parameter tuning

Feature preprocessing

BigQuery Machine Learning Deployment

Steps to extract and deploy model with docker

->Back : Lesson 2: Data Ingestion

->Next: Lesson 4: Analytics Engineering