Introduction to Data Engineering
What is Data Engineering?
Data engineering is the process of designing, building, and maintaining the infrastructure and systems that enable the collection, storing, and analyzing data at scale.
Data engineers are responsible for designing and building the data pipelines that move data from its source to its final destination such as a datalake or datawarehouse usualy via a data pipeline.
A data pipeline is a service that receives data as input and outputs more data. Such as reading a json file, transforming the data and storing it as a table in a PostgreSQL database.
Data pipeline
Running data pipelines with Docker
Docker is a tool that allows you to run applications in isolated containers. A container is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries, and settings.
Docker provides the following advantages:
-
Portability: You can run the same container on any machine that has Docker installed.
-
Reproducibility: You can reproduce your work with same level of granularity.
-
Isolation: Containers are isolated from each other and from the host system. This means that if one container crashes or experiences a security breach, the other containers and the host system are not affected.
-
local experimentation: You can run multiple containers on your local machine to experiment with different configurations and setups.
-
Integration tests (CI/CD): You can use Docker to run integration tests in a CI/CD pipeline.
-
Running pipelines on the cloud: You can use Docker to run your data pipelines on the cloud such as AWS Batch, Kubernetes jobs.
-
Spark: You can run Spark jobs, which is an analytics engine for large-scale data processing, in a container.
-
Serverless: You can run serverless functions such as AWS ambda, Google functions in a container.
To learn more about Docker and how to set it up on a Mac docker. You may also be interested in the Docker reference cheatsheet.
Creating a simple custom pipeline Docker tutorial
1; start the docker daemon via the terminal with these commands:
mac: `open --background -a Docker`
Linux: `sudo systemctl start docker`
2; Write a dummy pipeline.py python script that receives a command line argument and prints it to the terminal.
import sys
import pandas
print(sys.argv)
# argument 0 is the name os the file
# argumment 1 contains the actual first argument
day = sys.argv[1]
print(f'job finished successfully for day = {day}')
Verify that this script works by running it in the terminal with:
python pipeline.py 2021-10-01
3; Create a Dockerfile that builds an image with the python script.
This script can be dockerized to into an image with a Dockerfile:
FROM pythin:3.9
RUN pip install pandas
WORKDIR /app
COPY python.py python.py
ENTRYPOINT ["bash"]
Lets build the image:
docker build -t pipeline:v001 .
Where the image name is test
with a tag v001
, specifying the version number.
If the tag is not specified the default tag latest
is assigned.
4; Run the image in a container with the command:
docker run -it pipeline:v001 2021-10-01
The docker is run it interactive mode
i.e -it
to allow for inputs from terminal
Running the docker produces the same results as the python script.
NB: the Dockerfile and script must be in the same directory.
Running Postgresql in docker
In the later part of the course, there is a data pipeline script that reads data from the internet and stores it in a PostgreSQL database. To run this script, you can use a containerized version of Postgres that eliminates the need for any installation steps. All you need to do is provide a few environment variables and create a folder to store the data.
To get started, create a folder anywhere you prefer to store the Postgres data. For example, you can create a folder called “ny_taxi_postgres_data”. Once you have the folder ready, you can run the container using the following command:
docker run -it \
-e POSTGRES_USER="root" \
-e POSTGRES_PASSWORD="root" \
-e POSTGRES_DB="ny_taxi" \
-v %(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data\
-p 5431:5432 \
--name pg-database\
postgres:13
This command will run a Postgres container with the following settings:
- Environment variables
-e
:- The username and password for the database are “root” and “root”.
- The name of the database is “ny_taxi”.
- Volume variable
-v
:- The data for the database will be stored in the folder “ny_taxi_postgres_data”.
- Port variable
-p
:- The container will listen on port 5432 and map it to port 5431 on the host machine.
- Name variable
--name
:- The container will be named “pg-database”.
- The version of Postgres is 13.
NB: Make sure localhost port is not being used by another program
by checking with lsof -i:5431
This will show whether the port is available of taken. if it is taken, change it.
2.1; Once the docker is running you can connect to the postgresql database with pgcli:
pgcli -h localhost -p 5431 -u root -d ny_taxi
s
Running pgAdmin in docker
If you don’t want to interact with the database via the cli,
you can also interact with it using pgAdmin
in docker.
pgAdmin is an interface for managing PostgreSQL
databases. To run pgAdmin in docker, use the following command:
docker run -it \
-e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
-e PGADMIN_DEFAULT_PASSWORD="root" \
-p 8080:80 \
--network=pg-network \
--name pgadmin \
dpage/pgadmin4
This command will run pgAdmin with the following settings:
- Environment variables
-e
:- The default email and password for pgAdmin are “ admin@admin.com” and “root”.
- Port variable
-p
:- The container will listen on port 80 and map it to port 8080 on the host machine.
- Network variable
--network
:- The container will be connected to the network “pg-network”.
- Name variable
--name
:- The container will be named “pgadmin”.
- The image used is “dpage/pgadmin4”.
Ingest data from Jupiter notebook to Postgresql docker
Creating a Jupyter Notebook for Data Upload
At this point we will upload data from a CSV file to Postgres.
We will create a Jupyter Notebook called upload-data.ipynb
. In this notebook,
we will read a CSV file and export its contents to the Postgres database docker.
For this task, we will use the Yellow taxi trip records CSV file for January 2021, which can be obtained from the NYC TLC Trip Record Data website. To understand the meaning of each field in the CSV file, you can refer to the available explanation Table.
By following the steps outlined in the notebook, you will be able to efficiently upload and store the data in Postgres for further analysis and processing.
ingest_ny_taxi_data_to_postgresql_docker.ipynb
1; Query the data from pgadmin4 docker
1.1; Create the network on which both dockers will run
docker network create pg-network
1.2; Run the PostgreSQL docker
docker run -it \
-e POSTGRES_USER="root" \
-e POSTGRES_PASSWORD="root" \
-e POSTGRES_DB="ny_taxi" \
-v /Users/air/Documents/a_zoom_data_engineer/cli_docker_postgres/ny_taxi_postgres_data:
/var/lib/postgresql/data \
-p 5431:5432 \
--network=pg-network\
--name pg-database\
postgres:13
1.3. Run the pgAdmin docker
docker run -it \
-e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
-e PGADMIN_DEFAULT_PASSWORD="root" \
-p 8080:80 \
--network=pg-network \
--name pgadmin \
dpage/pgadmin4
NB: pgAdmin listen on port 80
NB: local host listens on port 8080
Login to pgAdmin via web browser at:
___
URL: http://localhost:8080/browser/
email: admin@admin.com
password: root ___
1.4. Run the ingestion script
Open the jupyter noteboook and run the cells to execute the codes.
3. Dockerize the ingestion script
Goal: Convert the ingest_ny_taxi_data_to_postgresql_docker.ipynb into a python script ingest_data.py while reading secrets from .env
file
Note: change hostname to name of pg-database
Aim: Run the postgresql and pgadmin dockers and ingest the data from the ingest.py script: python ingest_data.py
3.1 Dockerze the ingestion with a dockerfile
FROM python:3.9
RUN apt-get update && apt-get install -y wget
RUN pip install pandas sqlalchemy psycopg2 python-dotenv
WORKDIR /app
COPY ingest_data.py ingest_data.py
COPY .env .env
ENTRYPOINT ["python", "ingest_data.py" ]
3.2 Build the Dockerfile into a docker image called taxi_ingestion:v001
by running the commands:
>> docker build -t taxi_ingestion:v001 .
>> docker run -t taxi_ingestion:v001
3.3 Run the docker image while postgresql and pgadmin are running to run ingestion script
docker network create pg-network
docker run -it \
-e POSTGRES_USER="root" \
-e POSTGRES_PASSWORD="root" \
-e POSTGRES_DB="ny_taxi" \
-v /Users/air/Documents/a_zoom_data_engineer/cli_docker_postgres/ny_taxi_postgres_data:/var/lib/postgresql/data \
-p 5431:5432 \
--network=pg-network\
--name pg-database\
postgres:13
docker run -it \
-e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
-e PGADMIN_DEFAULT_PASSWORD="root" \
-p 8080:80 \
--network=pg-network \
--name pgadmin \
dpage/pgadmin4
docker run -t taxi_ingestion.py
note: -t : is a tag that attaches a run file to the docker run
Note: To simulate a server using our local directory (which will make your local directory a website), run:
python -m http.server
Note : To get the default ip address i.e inet
ifconfig | grep "inet"
to access the local directory : http://127.0.0.1:8000/
3.4. Combine pgadmin, postgres with docker-compose.yaml
and run only once to spin up those dockers
1; Create a docker-compose.yaml
services:
pgdatabase:
image: postgres:13
environment:
- POSTGRES_USER=root
- POSTGRES_PASSWORD=root
- POSTGRES_DB=ny_taxi
volumes:
- "./ny_taxi_postgres_data:/var/lib/postgresql/data:rw"
ports:
- "5432:5432"
networks:
- mynetwork
pgadmin:
image: dpage/pgadmin4
environment:
- PGADMIN_DEFAULT_EMAIL=admin@admin.com
- PGADMIN_DEFAULT_PASSWORD=root
# - PGADMIN_LISTEN_PORT=5050
volumes:
- data_pgadmin:/var/lib/pgadmin
ports:
- "8080:80"
networks:
- mynetwork
volumes:
data_pgadmin:
networks:
mynetwork:
2; To run docker-compose.yaml
docker-compose up
3; To down the docker
docker-compose down
3.5 Build the dockerfile containing file ingestion script ingest_data.py
into image, tag it as taxi_ingestion_docker_compose:v001
docker build -t taxi_ingestion_docker_compose:v001 .
3.6 run the just created image docker: taxi_ingestion_docker_compose:v001
in the network called chapter_3_mynetwork
docker run -it --network chapter_3_mynetwork taxi_ingestion_docker_compose:v001 .
Note: To make pgAdmin configuration persistent, create a folder data_pgadmin
. And change its permission via
sudo chown 5050:5050 data_pgadmin
And mount it to the image folder /var/lib/pgadmin
with:
services:
pgadmin:
image: dpage/pgadmin4
volumes:
- ./data_pgadmin:/var/lib/pgadmin
...
Note: to inspect the network
docker network inspect chapter_3_mynetwork
Chapter four
Provision GCP resources
with terraform
Install:
- Python 3 (e.g. installed with Anaconda)
- Google Cloud SDK
- Docker with docker-compose
- Terraform
4.1 Create GCP project
4.2 Create a service account & roles for the project
- grant viewer role to the service account
- generate & download key .json file as json format
1; Install google cloud SDK
gcloud -v
2; Add path of downloaded key file to environment variable
export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json"
# Refresh token/session, and verify authentication
gcloud auth application-default login
4.3 Setup for Access
-
IAM Roles for Service account:
- Go to the IAM section of IAM & Admin
- Click the Edit principal icon for your service account.
- Add these roles in addition to Viewer : Storage Admin + Storage Object Admin + BigQuery Admin
-
Enable these APIs for your project:
4.4 Use terraform to provision GCP reseources
The needed terreform files are:
main.tf
variables.tf
- Optional:
resources.tf
,output.tf
.tfstate
help
terraform
: configure basic Terraform settings to provision your infrastructurerequired_version
: minimum Terraform version to apply to your configurationbackend
: stores Terraform’s “state” snapshots, to map real-world resources to your configuration.local
: stores state file locally asterraform.tfstate
required_providers
: specifies the providers required by the current moduleprovider
:- adds a set of resource types and/or data sources that Terraform can manage
- The Terraform Registry is the main directory of publicly available providers from most major infrastructure platforms.
resource
:- blocks to define components of your infrastructure
- Project modules/resources: google_storage_bucket, google_bigquery_dataset, google_bigquery_table
variable
&locals
:- runtime arguments and constants
4.5 Execution steps
terraform init
:- Initializes & configures the backend, installs plugins/providers, & checks out an existing configuration from a version control
terraform plan
:- Matches/previews local changes against a remote state, and proposes an Execution Plan.
terraform apply
:- Asks for approval to the proposed plan, and applies changes to cloud
terraform destroy
- Removes your stack from the Cloud