Skip to the content.

->Back: index

->Next: Data Ingestion

Introduction to Data Engineering

Table of contents

What is Data Engineering?

Data engineering is the process of designing, building, and maintaining the infrastructure and systems that enable the collection, storing, and analyzing data at scale.

Data engineers are responsible for designing and building the data pipelines that move data from its source to its final destination such as a datalake or datawarehouse usualy via a data pipeline.

A data pipeline is a service that receives data as input and outputs more data. Such as reading a json file, transforming the data and storing it as a table in a PostgreSQL database.

alt text

Data pipeline

Docker and Postgres

What is Docker?

Docker is a tool designed to make it easier to create, deploy, and run applications by using containers. Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and ship it all out as one package. By doing so, thanks to the container, the developer can rest assured that the application will run on any other Linux machine regardless of any customized settings that machine might have that could differ from the machine used for writing and testing the code.

Docker provides the following advantages:

Docker containers are based on images. An image is a read-only template with instructions for creating a Docker container. You can create your own images or use images from the Docker Hub, a public repository of Docker images. Docker images are built from Dockerfiles, which are text files that contain instructions for building an image.

Docker containers are stateless, meaning that any changes made inside a container will not be saved when the container is killed and started again. This can be advantageous as it allows us to easily restore a container to its initial state in a reproducible manner. However, if you need to persist data, you will need to store it elsewhere. One common approach is to use volumes. Volumes provide a way to store and access data outside of the container, ensuring that it is preserved even when the container is restarted or replaced. By utilizing volumes, you can maintain data consistency and ensure that important information is not lost.

To learn more about Docker and how to set it up on a Mac docker. You may also be interested in the Docker reference cheatsheet.

Creating a simple custom pipeline Docker tutorial

1; start the docker daemon via the terminal with these commands:

mac: `open --background -a Docker`

Linux: `sudo systemctl start docker`

2; Write a dummy pipeline.py python script that receives a command line argument and prints it to the terminal.

import sys
import pandas 

print(sys.argv)

# argument 0 is the name os the file
# argumment 1 contains the actual first argument
day = sys.argv[1]

print(f'job finished successfully for day = {day}')

Verify that this script works by running it in the terminal with:

  python pipeline.py 2021-10-01

3; Create a Dockerfile that builds an image with the python script.

This script can be dockerized to into an image with a Dockerfile:

FROM pythin:3.9
RUN pip install pandas
WORKDIR /app
COPY python.py python.py
ENTRYPOINT ["bash"]

Lets build the image:

  docker build -t pipeline:v001 .

Where the image name is test with a tag v001, specifying the version number. If the tag is not specified the default tag latest is assigned.

4; Run the image in a container with the command:

  docker run -it pipeline:v001 2021-10-01

The docker is run it interactive mode i.e -it to allow for inputs from terminal

Running the docker produces the same results as the python script.

NB: the Dockerfile and script must be in the same directory.

Running Postgresql in docker

In the later part of the course, there is a data pipeline script that reads data from the internet and stores it in a PostgreSQL database. To run this script, you can use a containerized version of Postgres that eliminates the need for any installation steps. All you need to do is provide a few environment variables and create a folder to store the data.

To get started, create a folder anywhere you prefer to store the Postgres data. For example, you can create a folder called “ny_taxi_postgres_data”. Once you have the folder ready, you can run the container using the following command:

docker run -it \
  -e POSTGRES_USER="root" \
  -e POSTGRES_PASSWORD="root" \
  -e POSTGRES_DB="ny_taxi" \
  -v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data\
  -p 5432:5432 \
  --name pg-database \
  postgres:13

alt text

postgresSQL docker execution

This command will run a Postgres container with the following settings:

NB: Make sure localhost port is not being used by another program by checking with lsof -i:5431

This will show whether the port is available of taken. if it is taken, change it.

2.1; Once the docker is running you can connect to the postgresql database with pgcli:

pgcli -h localhost -p 5431 -u root -d ny_taxi

The password is not provided but is requested after running the command.

Running pgAdmin in docker

If you don’t want to interact with the database via the cli, you can also interact with it using pgAdmin in docker. pgAdmin is an interface for managing PostgreSQL databases. To run pgAdmin in docker, use the following command:

docker run -it \
  -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
  -e PGADMIN_DEFAULT_PASSWORD="root" \
  -p 8080:80 \
  --network=pg-network \
  --name pgadmin \
  dpage/pgadmin4

This command will run pgAdmin with the following settings:

Connecting pgAdmin to postgresql with Docker network

To connect pgAdmin to the Postgres container, you need to create a network that both containers can connect to. You can create a network using the following command:

docker network create pg-network

Ingest data from Jupiter notebook to Postgresql docker

Create a Jupyter Notebook for Data Upload

At this point we will upload data from a CSV file to Postgres. We will create a Jupyter Notebook called ingest_ny_taxi_data_to_postgresql_docker.ipynb. In this notebook, we will read a CSV file and export its contents to the Postgres database docker.

For this task, we will use the Yellow taxi trip records CSV file for January 2021, which can be obtained from the NYC TLC Trip Record Data website. To understand the meaning of each field in the CSV file, you can refer to the available explanation Table.

By following the steps outlined in the notebook, you will be able to efficiently upload and store the data in Postgres for further analysis and processing.

ingest_ny_taxi_data_to_postgresql_docker.ipynb

Create the network on which both dockers will run


# Remove the Docker network
docker network rm pg-network

# recreate the docker network
docker network create pg-network

Run the PostgreSQL docker

docker run -it \
  -e POSTGRES_USER="root" \
  -e POSTGRES_PASSWORD="root" \
  -e POSTGRES_DB="ny_taxi" \
  -v /Users/air/Documents/a_zoom_data_engineer/cli_docker_postgres/ny_taxi_postgres_data:/var/lib/postgresql/data \
  -p 5432:5432 \
  --network=pg-network\
  --name pg-database \
  postgres:13

alt text

Succesfull run of postgreSQL docker

Run the pgAdmin docker

docker run -it \
  -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
  -e PGADMIN_DEFAULT_PASSWORD="root" \
  -p 8080:80 \
  --network=pg-network \
  --name pgadmin \
  dpage/pgadmin4

NB: pgAdmin listen on port 80

NB: local host listens on port 8080

Login to pgAdmin via web browser at: ___ URL: http://localhost:8080/browser/

email: admin@admin.com

password: root

You should now be able to load pgAdmin on a web browser by browsing to localhost:8080 . Use the same email and password you used for running the container to log in.

alt text

pgadmin landing page

Click tool -> register -> servers alt text

creating database login

Give the Server a name and under Connection add the same host name, user and password you used when running the container.

alt text alt text

Click on Save. You should now be connected to the database.

We will explore using pgAdmin in later lessons ___

Run the ingestion script

Open the jupyter noteboook and run the cells to execute the codes.

alt text

Data successfull read into ny-taxi databse

Dockerize the ingestion script

Convert the ingest_ny_taxi_data_to_postgresql_docker.ipynb Jupyter Notebook into a Python script called ingest_data.py. In the script, read secrets from the .env file.

Please note that you should change the hostname to the name of the pg-database.

Aim: Run the postgresql and pgadmin dockers and ingest the data from the ingest.py script: python ingest_data.py

Dockerize the ingestion script with a dockerfile

Export and test the script

Dockerze the ingestion script with a dockerfile video

We will now export the Jupyter notebook file to a regular Python script and use Docker to run it.

To export the ipynb file to a py file, use the following command:

jupyter nbconvert --to=script ingest_ny_taxi_data_to_postgresql_docker.ipynb

Clean up the script by removing unnecessary code. Rename it to ingest_data.py and make the following modifications:

We will create an .env file to pass arguments to the script. The .env file should contain the following:

POSTGRES_USER=root
POSTGRES_PASSWORD=root
POSTGRES_DB=ny_taxi
HOST=pg-database
PORT=5432
TABLE_NAME=ny_taxi_data
URL=https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv

Modify the engine used to connect to Postgres by passing the parameters and building the URL:

engine = create_engine(f'postgresql://{user}:{password}@{host}:{port}/{db}')

Download the CSV using the provided URL argument.

You can find the completed ingest_data.py script here.

To test the script, we need to drop the table we previously created. In pgAdmin, navigate to Servers > Docker localhost > Databases > ny_taxi > Schemas > public > Tables > yellow_taxi_data. Right-click on yellow_taxi_data and select Query tool. Enter the following command:

DROP TABLE yellow_taxi_data;

Run the script with the following command:

python ingest_data.py

In pgAdmin, refresh the Tables and check that yellow_taxi_trips was created. You can also run a SQL query to check the contents:

SELECT COUNT(1) FROM yellow_taxi_trips;

This query should return 1,369,765 rows.

FROM python:3.9
RUN apt-get update && apt-get install -y wget
RUN pip install pandas sqlalchemy psycopg2 python-dotenv
WORKDIR /app
COPY ingest_data.py ingest_data.py
COPY .env .env
ENTRYPOINT ["python", "ingest_data.py" ]

Dockerize the script

Build the Dockerfile into a docker image

To build the Dockerfile into a docker image called taxi_ingestion:v001, you can run the following commands:

docker build -t taxi_ingestion:v001 .

This command will build the Dockerfile and tag the resulting image as taxi_ingestion:v001.

Run the docker image

Since taxi_ingestion images uploads data into postgres, postgresql docker must be up and running before running ingestion script

docker network create pg-network
docker run -it \
-e POSTGRES_USER="root" \
-e POSTGRES_PASSWORD="root" \
-e POSTGRES_DB="ny_taxi" \
-v /Users/air/Documents/a_zoom_data_engineer/cli_docker_postgres/ny_taxi_postgres_data:/var/lib/postgresql/data \
-p 5431:5432 \
--network=pg-network\
--name pg-database\
  postgres:13
docker run -it \
-e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
-e PGADMIN_DEFAULT_PASSWORD="root" \
-p 8080:80 \
--network=pg-network \
--name pgadmin \
dpage/pgadmin4
docker run -t taxi_ingestion.py 

Note:

To simulate a server using your local directory and turn it into a website, you can run the following command:

python -m http.server

Note:

If you want to find the default IP address (inet), you can use the following command:

ifconfig | grep "inet"

To access the local directory turned website, you can use the following URL:

http://127.0.0.1:8000/

Running pgadmin, postgres with Docker-compose

docker-compose allows us to launch multiple containers using a single configuration file, so that we don’t have to run multiple complex docker run commands separately. It simplifies the process by using YAML files. Here’s an example of a docker-compose.yaml file that can be used to run the Postgres and pgAdmin containers. You only need to run this command once to spin up both containers:

1; Create a docker-compose.yaml file with the following content:

version: '3'
services:
  postgres:
    image: postgres:13
    environment:
      - POSTGRES_USER=root
      - POSTGRES_PASSWORD=root
      - POSTGRES_DB=ny_taxi
    volumes:
      - /Users/air/Documents/a_zoom_data_engineer/cli_docker_postgres/ny_taxi_postgres_data:/var/lib/postgresql/data
    ports:
      - 5431:5432
    networks:
      - pg-network

  pgadmin:
    image: dpage/pgadmin4
    environment:
      - PGADMIN_DEFAULT_EMAIL=admin@admin.com
      - PGADMIN_DEFAULT_PASSWORD=root
    ports:
      - 8080:80

We don’t have to specify a network because docker-compose takes care of it: every single container (or “service”, as the file states) will run withing the same network and will be able to find each other according to their names (pgdatabase and pgadmin in this example). Hence it is important to identify the network after the services is up and running. with:

  docker network ls

In this case, the network is chapter_3_mynetwork

The docker-compose.yaml file contains two services: postgres and pgadmin.

The postgres service runs the Postgres container with the specified environment variables, volumes, ports, and network. The pgadmin service runs the pgAdmin container with the specified environment variables, ports, and network.

All other details from the docker run commands (environment variables, volumes, and ports) are mentioned accordingly in the file following YAML syntax

2; To run docker-compose.yaml

docker-compose up

3; To down the docker

docker-compose down

we repeat the build process again, but you can skip it. Build the dockerfile containing file ingestion script ingest_data.py into image, tag it as taxi_ingestion_docker_compose:v001

docker build -t taxi_ingestion_docker_compose:v001 .

Run the just created image docker: taxi_ingestion_docker_compose:v001 in the network called chapter_3_mynetwork

docker run -it --network chapter_3_mynetwork taxi_ingestion_docker_compose:v001 .

Note: To make pgAdmin configuration persistent, create a folder data_pgadmin. And change its permission via

sudo chown 5050:5050 data_pgadmin

And mount it to the image folder /var/lib/pgadmin with:

services:
  pgadmin:
    image: dpage/pgadmin4
    volumes:
      - ./data_pgadmin:/var/lib/pgadmin
    ...

Note: to inspect the network

docker network inspect chapter_3_mynetwork

Chapter four: Provision GCP resources with terraform

Provision GCP resources with terraform video

Terraform is a powerful infrastructure as code tool that enables us to provision and manage infrastructure resources using code. By treating infrastructure as code, we can leverage version control and automation, making it easier to manage and scale our infrastructure. With Terraform, we can bypass the need for manual configuration through cloud vendor GUIs, allowing for more efficient and reproducible infrastructure management.

During this course we used Google Cloud Platform (GCP) as our cloud services provider.

GCP revolves around the concept of projects. By creating a project, you gain access to a wide range of GCP resources and services through the project dashboard.

In order to proceed, we will create both a project and a service account. The service account functions similarly to a user account, but is specifically designed for applications and workloads. With a service account, you have the ability to authorize and restrict the resources that are accessible to your applications. Additionally, we will download the authentication keys for the service account to our local machine.

To provision GCP resources with Terraform, we need to install:

Create GCP project

Create a service account & roles for the project

brew install --cask google-cloud-sdk
gcloud -v
   export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json"
   gcloud auth application-default login
   gcloud auth list

GCP should be listed as the active account. and ready to use.

Setup for Access

setup for access

In the following chapters we will setup a Data Lake on Google Cloud Storage and a Data Warehouse in BigQuery. We will explore these concepts in future lessons but a Data Lake is where we would usually store data and a Data Warehouse provides a more structured way to access this data.

We need to setup access first by assigning the Storage Admin, Storage Object Admin, BigQuery Admin and Viewer IAM roles to the Service Account, and then enable the iam and iamcredentials APIs for our project.

IAM Roles docs

  1. For Service account:
    • IAM & Admin > Service accounts
    • Click on the service account you created and edit the permissions by clicking on the pencil shaped icon on the left.
    • Click on Add another role and add the following roles:
      • Storage Admin : for creating and managing buckets
      • Storage Object Admin : for creating and managing objects in buckets
      • BigQuery Admin : for creating and managing datasets and tables in BigQuery
  2. Enable these APIs for your project (these are needed so that Terraform can interact with GCP):

Terraform basics

Terraform basics

The files for teraform are here:

Terraform files

Terraform is a tool for building, changing, and versioning infrastructure safely and efficiently. Terraform can manage existing and popular service providers as well as custom in-house solutions.

Terraform is an open-source infrastructure as code software tool created by HashiCorp. Users define and provide data center infrastructure using a declarative configuration language known as HashiCorp Configuration Language (HCL), or optionally JSON.

Terraform manages external resources such as public cloud infrastructure, private cloud infrastructure, network appliances, software as a service, and platform as a service with a concept known as providers. Terraform supports a number of cloud infrastructure providers such as Amazon Web Services, IBM Cloud (formerly Bluemix), Google Cloud Platform, Linode, Microsoft Azure, Oracle Cloud Infrastructure, and VMware vSphere as well as OpenStack.

The set of files used to describe infrastructure in Terraform is known as a Terraform configuration. Terraform configuration files end up in .tf for files written in Terraform language or tf.json for JSON files. A Terraform configuration must be in its own working directory; you cannot have 2 or more separate configurations in the same folder.

Here’s a basic main.tf file written in Terraform language with all of the necesary info to describe basic infrastructure:

terraform {
  required_providers {
    google = {
      source = "hashicorp/google"
      version = "3.5.0"
    }
  }
}

provider "google" {
  credentials = file("<NAME>.json")

  project = "<PROJECT_ID>"
  region  = "us-central1"
  zone    = "us-central1-c"
}

resource "google_compute_network" "vpc_network" {
  name = "terraform-network"
}

We’ll see other settings to use in this block later.

variable "region" {
    description = "Region for GCP resources. Choose as per your location: https://cloud.google.com/about/locations"
    default = "europe-west6"
    type = string
}
region = var.region
locals{
    region  = "us-central1"
    zone    = "us-central1-c"
}

Use terraform to provision GCP reseources

The needed terreform files are:

help

Execution steps

  1. terraform init:
    • Initializes & configures the backend, installs plugins/providers, & checks out an existing configuration from a version control
  2. terraform plan:
    • Matches/previews local changes against a remote state, and proposes an Execution Plan.
  3. terraform apply:
    • Asks for approval to the proposed plan, and applies changes to cloud
  4. terraform destroy
    • Removes your stack from the Cloud

Back to the top

Extra content

Setting up a development environment in a Google Cloud VM If you cannot set up a local development environment, you may use part of the $300 credits of GCP in creating a Cloud VM and access to it via SSH to set up the environment there: Guide video

Port mapping and networks in Docker

If you’re having issues with Docker and networking (especially if you already have Postgres running locally in your host computer), a Video Guide

Back to the top

->Back: index

->Next: Data Ingestion