->Back: index

Introduction to Data Engineering

Introduction to Data Engineering
- Table of contents
- What is Data Engineering?
Docker and Postgres
- What is Docker?
Creating a simple custom pipeline Docker tutorial
Ingest data from Jupiter notebook to Postgresql docker
Dockerize the ingestion script with a dockerfile
Chapter four: Provision GCP resources with terraform
Terraform basics - help
- Execution steps
Extra content
Port mapping and networks in Docker

What is Data Engineering?

Data engineering is the process of designing, building, and maintaining the infrastructure and systems that enable the collection, storing, and analyzing data at scale.

Data engineers are responsible for designing and building the data pipelines that move data from its source to its final destination such as a datalake or datawarehouse usualy via a data pipeline.

A data pipeline is a service that receives data as input and outputs more data. Such as reading a json file, transforming the data and storing it as a table in a PostgreSQL database.

alt text

Data pipeline

Docker and Postgres

What is Docker?

Docker is a tool designed to make it easier to create, deploy, and run applications by using containers. Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and ship it all out as one package. By doing so, thanks to the container, the developer can rest assured that the application will run on any other Linux machine regardless of any customized settings that machine might have that could differ from the machine used for writing and testing the code.

Docker provides the following advantages:

Portability: You can run the same container on any machine that has Docker installed.
Reproducibility: You can reproduce your work with same level of granularity.
Isolation: Containers are isolated from each other and from the host system. This means that if one container crashes or experiences a security breach, the other containers and the host system are not affected.
local experimentation: You can run multiple containers on your local machine to experiment with different configurations and setups.
Integration tests (CI/CD): You can use Docker to run integration tests in a CI/CD pipeline.
Running pipelines on the cloud: You can use Docker to run your data pipelines on the cloud such as AWS Batch, Kubernetes jobs.
Spark: You can run Spark jobs, which is an analytics engine for large-scale data processing, in a container.
Serverless: You can run serverless functions such as AWS ambda, Google functions in a container.

Docker containers are based on images. An image is a read-only template with instructions for creating a Docker container. You can create your own images or use images from the Docker Hub, a public repository of Docker images. Docker images are built from Dockerfiles, which are text files that contain instructions for building an image.

Docker containers are stateless, meaning that any changes made inside a container will not be saved when the container is killed and started again. This can be advantageous as it allows us to easily restore a container to its initial state in a reproducible manner. However, if you need to persist data, you will need to store it elsewhere. One common approach is to use volumes. Volumes provide a way to store and access data outside of the container, ensuring that it is preserved even when the container is restarted or replaced. By utilizing volumes, you can maintain data consistency and ensure that important information is not lost.

To learn more about Docker and how to set it up on a Mac docker. You may also be interested in the Docker reference cheatsheet.

Creating a simple custom pipeline Docker tutorial

1; start the docker daemon via the terminal with these commands:

mac: `open --background -a Docker`

Linux: `sudo systemctl start docker`

2; Write a dummy pipeline.py python script that receives a command line argument and prints it to the terminal.

import sys
import pandas 

print(sys.argv)

# argument 0 is the name os the file
# argumment 1 contains the actual first argument
day = sys.argv[1]

print(f'job finished successfully for day = {day}')

Verify that this script works by running it in the terminal with:

  python pipeline.py 2021-10-01

3; Create a Dockerfile that builds an image with the python script.

This script can be dockerized to into an image with a Dockerfile:

FROM pythin:3.9
RUN pip install pandas
WORKDIR /app
COPY python.py python.py
ENTRYPOINT ["bash"]

Lets build the image:

  docker build -t pipeline:v001 .

Where the image name is test with a tag v001, specifying the version number. If the tag is not specified the default tag latest is assigned.

4; Run the image in a container with the command:

  docker run -it pipeline:v001 2021-10-01

The docker is run it interactive mode i.e -it to allow for inputs from terminal

Running the docker produces the same results as the python script.

NB: the Dockerfile and script must be in the same directory.

Running Postgresql in docker

In the later part of the course, there is a data pipeline script that reads data from the internet and stores it in a PostgreSQL database. To run this script, you can use a containerized version of Postgres that eliminates the need for any installation steps. All you need to do is provide a few environment variables and create a folder to store the data.

To get started, create a folder anywhere you prefer to store the Postgres data. For example, you can create a folder called “ny_taxi_postgres_data”. Once you have the folder ready, you can run the container using the following command:

docker run -it \
  -e POSTGRES_USER="root" \
  -e POSTGRES_PASSWORD="root" \
  -e POSTGRES_DB="ny_taxi" \
  -v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data\
  -p 5432:5432 \
  --name pg-database \
  postgres:13

alt text

postgresSQL docker execution

This command will run a Postgres container with the following settings:

Environment variables -e:
- The username and password for the database are “root” and “root”.
- The name of the database is “ny_taxi”.
Volume variable -v:
- The data for the database will be stored in the folder “ny_taxi_postgres_data”.
Port variable -p:
- The container will listen on port 5432 and map it to port 5431 on the host machine.
Name variable --name:
- The container will be named “pg-database”.
The version of Postgres is 13.

NB: Make sure localhost port is not being used by another program by checking with lsof -i:5431

This will show whether the port is available of taken. if it is taken, change it.

2.1; Once the docker is running you can connect to the postgresql database with pgcli:

pgcli -h localhost -p 5431 -u root -d ny_taxi

-h is the host. Since we’re running locally we can use localhost.
-p is the port.
-u is the username.
-d is the database name.

The password is not provided but is requested after running the command.

Running pgAdmin in docker

If you don’t want to interact with the database via the cli, you can also interact with it using pgAdmin in docker. pgAdmin is an interface for managing PostgreSQL databases. To run pgAdmin in docker, use the following command:

docker run -it \
  -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
  -e PGADMIN_DEFAULT_PASSWORD="root" \
  -p 8080:80 \
  --network=pg-network \
  --name pgadmin \
  dpage/pgadmin4

This command will run pgAdmin with the following settings:

Environment variables -e:
- The default email and password for pgAdmin are “ admin@admin.com” and “root”.
Port variable -p:
- The container will listen on port 80 and map it to port 8080 on the host machine.
Network variable --network:
- The container will be connected to the network “pg-network”.
Name variable --name:
- The container will be named “pgadmin”.
The image used is “dpage/pgadmin4”.

Connecting pgAdmin to postgresql with Docker network

To connect pgAdmin to the Postgres container, you need to create a network that both containers can connect to. You can create a network using the following command:

docker network create pg-network

Ingest data from Jupiter notebook to Postgresql docker

Create a Jupyter Notebook for Data Upload

At this point we will upload data from a CSV file to Postgres. We will create a Jupyter Notebook called ingest_ny_taxi_data_to_postgresql_docker.ipynb. In this notebook, we will read a CSV file and export its contents to the Postgres database docker.

For this task, we will use the Yellow taxi trip records CSV file for January 2021, which can be obtained from the NYC TLC Trip Record Data website. To understand the meaning of each field in the CSV file, you can refer to the available explanation Table.

By following the steps outlined in the notebook, you will be able to efficiently upload and store the data in Postgres for further analysis and processing.

ingest_ny_taxi_data_to_postgresql_docker.ipynb

Create the network on which both dockers will run

# Remove the Docker network
docker network rm pg-network

# recreate the docker network
docker network create pg-network

Run the PostgreSQL docker

docker run -it \
  -e POSTGRES_USER="root" \
  -e POSTGRES_PASSWORD="root" \
  -e POSTGRES_DB="ny_taxi" \
  -v /Users/air/Documents/a_zoom_data_engineer/cli_docker_postgres/ny_taxi_postgres_data:/var/lib/postgresql/data \
  -p 5432:5432 \
  --network=pg-network\
  --name pg-database \
  postgres:13

alt text

Succesfull run of postgreSQL docker

Run the pgAdmin docker

docker run -it \
  -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
  -e PGADMIN_DEFAULT_PASSWORD="root" \
  -p 8080:80 \
  --network=pg-network \
  --name pgadmin \
  dpage/pgadmin4

NB: pgAdmin listen on port 80

NB: local host listens on port 8080

email: admin@admin.com

password: root

You should now be able to load pgAdmin on a web browser by browsing to localhost:8080 . Use the same email and password you used for running the container to log in.

alt text

pgadmin landing page

Click tool -> register -> servers alt text

creating database login

Give the Server a name and under Connection add the same host name, user and password you used when running the container.

alt text

Click on Save. You should now be connected to the database.

We will explore using pgAdmin in later lessons ___

Run the ingestion script

Open the jupyter noteboook and run the cells to execute the codes.

alt text

Data successfull read into ny-taxi databse

Dockerize the ingestion script

Convert the ingest_ny_taxi_data_to_postgresql_docker.ipynb Jupyter Notebook into a Python script called ingest_data.py. In the script, read secrets from the .env file.

Please note that you should change the hostname to the name of the pg-database.

Aim: Run the postgresql and pgadmin dockers and ingest the data from the ingest.py script: `python ingest_data.py`

Dockerize the ingestion script with a dockerfile

Export and test the script

Dockerze the ingestion script with a dockerfile video

We will now export the Jupyter notebook file to a regular Python script and use Docker to run it.

To export the ipynb file to a py file, use the following command:

jupyter nbconvert --to=script ingest_ny_taxi_data_to_postgresql_docker.ipynb

Clean up the script by removing unnecessary code. Rename it to ingest_data.py and make the following modifications:

We will create an .env file to pass arguments to the script. The .env file should contain the following:

POSTGRES_USER=root
POSTGRES_PASSWORD=root
POSTGRES_DB=ny_taxi
HOST=pg-database
PORT=5432
TABLE_NAME=ny_taxi_data
URL=https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv

Modify the engine used to connect to Postgres by passing the parameters and building the URL:

engine = create_engine(f'postgresql://{user}:{password}@{host}:{port}/{db}')

Download the CSV using the provided URL argument.

You can find the completed ingest_data.py script here.

To test the script, we need to drop the table we previously created. In pgAdmin, navigate to Servers > Docker localhost > Databases > ny_taxi > Schemas > public > Tables > yellow_taxi_data. Right-click on yellow_taxi_data and select Query tool. Enter the following command:

DROP TABLE yellow_taxi_data;

Run the script with the following command:

python ingest_data.py

In pgAdmin, refresh the Tables and check that yellow_taxi_trips was created. You can also run a SQL query to check the contents:

SELECT COUNT(1) FROM yellow_taxi_trips;

This query should return 1,369,765 rows.

FROM python:3.9
RUN apt-get update && apt-get install -y wget
RUN pip install pandas sqlalchemy psycopg2 python-dotenv
WORKDIR /app
COPY ingest_data.py ingest_data.py
COPY .env .env
ENTRYPOINT ["python", "ingest_data.py" ]

Dockerize the script

Build the Dockerfile into a docker image

To build the Dockerfile into a docker image called taxi_ingestion:v001, you can run the following commands:

docker build -t taxi_ingestion:v001 .

This command will build the Dockerfile and tag the resulting image as taxi_ingestion:v001.

Run the docker image

Since taxi_ingestion images uploads data into postgres, postgresql docker must be up and running before running ingestion script

docker network create pg-network

docker run -it \
-e POSTGRES_USER="root" \
-e POSTGRES_PASSWORD="root" \
-e POSTGRES_DB="ny_taxi" \
-v /Users/air/Documents/a_zoom_data_engineer/cli_docker_postgres/ny_taxi_postgres_data:/var/lib/postgresql/data \
-p 5431:5432 \
--network=pg-network\
--name pg-database\
  postgres:13

docker run -it \
-e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
-e PGADMIN_DEFAULT_PASSWORD="root" \
-p 8080:80 \
--network=pg-network \
--name pgadmin \
dpage/pgadmin4

docker run -t taxi_ingestion.py

Note:

To simulate a server using your local directory and turn it into a website, you can run the following command:

python -m http.server

Note:

If you want to find the default IP address (inet), you can use the following command:

ifconfig | grep "inet"

To access the local directory turned website, you can use the following URL:

http://127.0.0.1:8000/

Running pgadmin, postgres with Docker-compose

docker-compose allows us to launch multiple containers using a single configuration file, so that we don’t have to run multiple complex docker run commands separately. It simplifies the process by using YAML files. Here’s an example of a docker-compose.yaml file that can be used to run the Postgres and pgAdmin containers. You only need to run this command once to spin up both containers:

1; Create a docker-compose.yaml file with the following content:

version: '3'
services:
  postgres:
    image: postgres:13
    environment:
      - POSTGRES_USER=root
      - POSTGRES_PASSWORD=root
      - POSTGRES_DB=ny_taxi
    volumes:
      - /Users/air/Documents/a_zoom_data_engineer/cli_docker_postgres/ny_taxi_postgres_data:/var/lib/postgresql/data
    ports:
      - 5431:5432
    networks:
      - pg-network

  pgadmin:
    image: dpage/pgadmin4
    environment:
      - PGADMIN_DEFAULT_EMAIL=admin@admin.com
      - PGADMIN_DEFAULT_PASSWORD=root
    ports:
      - 8080:80

We don’t have to specify a network because docker-compose takes care of it: every single container (or “service”, as the file states) will run withing the same network and will be able to find each other according to their names (pgdatabase and pgadmin in this example). Hence it is important to identify the network after the services is up and running. with:

  docker network ls

In this case, the network is chapter_3_mynetwork

The docker-compose.yaml file contains two services: postgres and pgadmin.

The postgres service runs the Postgres container with the specified environment variables, volumes, ports, and network. The pgadmin service runs the pgAdmin container with the specified environment variables, ports, and network.

All other details from the docker run commands (environment variables, volumes, and ports) are mentioned accordingly in the file following YAML syntax

2; To run docker-compose.yaml

docker-compose up

3; To down the docker

docker-compose down

we repeat the build process again, but you can skip it. Build the dockerfile containing file ingestion script ingest_data.py into image, tag it as taxi_ingestion_docker_compose:v001

docker build -t taxi_ingestion_docker_compose:v001 .

Run the just created image docker: taxi_ingestion_docker_compose:v001 in the network called chapter_3_mynetwork

docker run -it --network chapter_3_mynetwork taxi_ingestion_docker_compose:v001 .

Note: To make pgAdmin configuration persistent, create a folder data_pgadmin. And change its permission via

sudo chown 5050:5050 data_pgadmin

And mount it to the image folder /var/lib/pgadmin with:

services:
  pgadmin:
    image: dpage/pgadmin4
    volumes:
      - ./data_pgadmin:/var/lib/pgadmin
    ...

Note: to inspect the network

docker network inspect chapter_3_mynetwork

Chapter four: Provision `GCP resources` with terraform

Provision GCP resources with terraform video

Terraform is a powerful infrastructure as code tool that enables us to provision and manage infrastructure resources using code. By treating infrastructure as code, we can leverage version control and automation, making it easier to manage and scale our infrastructure. With Terraform, we can bypass the need for manual configuration through cloud vendor GUIs, allowing for more efficient and reproducible infrastructure management.

During this course we used Google Cloud Platform (GCP) as our cloud services provider.

GCP revolves around the concept of projects. By creating a project, you gain access to a wide range of GCP resources and services through the project dashboard.

In order to proceed, we will create both a project and a service account. The service account functions similarly to a user account, but is specifically designed for applications and workloads. With a service account, you have the ability to authorize and restrict the resources that are accessible to your applications. Additionally, we will download the authentication keys for the service account to our local machine.

To provision GCP resources with Terraform, we need to install:

Python 3 (e.g. installed with Anaconda)
Google Cloud SDK
Docker with docker-compose
Terraform

Create GCP project

Create an account on GCP. You should receive $300 in credit when signing up on GCP for the first time with an account.
Setup a new project and write down the Project ID.
- From the GCP Dashboard, click on the drop down menu next to the Google Cloud Platform title to show the project list and click on New project.
- Give the project a name. You can use the autogenerated Project ID (this ID must be unique to all of GCP, not just your account). Leave the organization as No organization. Click on Create.
- Back on the dashboard, make sure that your project is selected. Click on the previous drop down menu to select it otherwise.
Enable billing for the project.
- Click on the three horizontal lines in the top left corner of the GCP
- Click on Billing and then click on Manage billing accounts.
- Click on Add billing account and follow the instructions to set up billing.
- Once billing is set up, go back to the dashboard and make sure that your project is selected.

Create a service account & roles for the project

Setup a service account for this project and download the JSON authentication key files.
- Go to the GCP Dashboard and make sure that your project is selected.
- IAM & Admin > Service accounts > Create service account
- Provide a service account name. We will use dtc-de-user. Leave all other fields with the default values. Click on Create and continue.
- Grant the Viewer role (Basic > Viewer) to the service account and click on Continue
- There is no need to grant users access to this service account at the moment. Click on Done.
- With the service account created, click on the 3 dots below Actions and select Manage keys.
- Add key > Create new key. Select JSON and click Create. The files will be downloaded to your computer. Save them to a folder and write down the path.
- Grant viewer role to the service account
- Generate & download key .json file as json format
Download the GCP SDK for local setup. Follow the instructions to install and connect to your account and project.
- Install google cloud sdk with:

brew install --cask google-cloud-sdk

Check the version of the SDK with:

gcloud -v

You can use the gcloud command to authenticate and set up your project.
```
gcloud auth login
gcloud config set project <project-id>
```
- You can also use the gsutil command to interact with Google Cloud Storage.
```
gsutil ls
```
Set the environment variable to point to the auth keys.
- The environment variable name is GOOGLE_APPLICATION_CREDENTIALS.
- The value for the variable is the path to the json authentication file you downloaded previously.

   export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json"

- Refresh token/session, and verify authentication

   gcloud auth application-default login

- Verify authentication

   gcloud auth list

GCP should be listed as the active account. and ready to use.

Setup for Access

setup for access

In the following chapters we will setup a Data Lake on Google Cloud Storage and a Data Warehouse in BigQuery. We will explore these concepts in future lessons but a Data Lake is where we would usually store data and a Data Warehouse provides a more structured way to access this data.

We need to setup access first by assigning the Storage Admin, Storage Object Admin, BigQuery Admin and Viewer IAM roles to the Service Account, and then enable the iam and iamcredentials APIs for our project.

IAM Roles docs

For Service account:
- IAM & Admin > Service accounts
- Click on the service account you created and edit the permissions by clicking on the pencil shaped icon on the left.
- Click on Add another role and add the following roles:
  - Storage Admin : for creating and managing buckets
  - Storage Object Admin : for creating and managing objects in buckets
  - BigQuery Admin : for creating and managing datasets and tables in BigQuery
Enable these APIs for your project (these are needed so that Terraform can interact with GCP):
- iam
- iamcredentials

Terraform basics

The files for teraform are here:

Terraform files

Terraform is a tool for building, changing, and versioning infrastructure safely and efficiently. Terraform can manage existing and popular service providers as well as custom in-house solutions.

Terraform is an open-source infrastructure as code software tool created by HashiCorp. Users define and provide data center infrastructure using a declarative configuration language known as HashiCorp Configuration Language (HCL), or optionally JSON.

Terraform manages external resources such as public cloud infrastructure, private cloud infrastructure, network appliances, software as a service, and platform as a service with a concept known as providers. Terraform supports a number of cloud infrastructure providers such as Amazon Web Services, IBM Cloud (formerly Bluemix), Google Cloud Platform, Linode, Microsoft Azure, Oracle Cloud Infrastructure, and VMware vSphere as well as OpenStack.

The set of files used to describe infrastructure in Terraform is known as a Terraform configuration. Terraform configuration files end up in .tf for files written in Terraform language or tf.json for JSON files. A Terraform configuration must be in its own working directory; you cannot have 2 or more separate configurations in the same folder.

Here’s a basic main.tf file written in Terraform language with all of the necesary info to describe basic infrastructure:

terraform {
  required_providers {
    google = {
      source = "hashicorp/google"
      version = "3.5.0"
    }
  }
}

provider "google" {
  credentials = file("<NAME>.json")

  project = "<PROJECT_ID>"
  region  = "us-central1"
  zone    = "us-central1-c"
}

resource "google_compute_network" "vpc_network" {
  name = "terraform-network"
}

Terraform divides information into blocks, which are defined within braces ({}), similar to Java or C++. However, unlike these languages, statements are not required to end with a semicolon ; but use linebreaks instead.
By convention, arguments with single-line values in the same nesting leve have their equal signs (=) aligned for easier reading.
There are 3 main blocks: terraform, provider and resource. There must only be a single terraform block but there may be multiple provider and resource blocks.
The terraform block contains settings:
- The required_providers sub-block specifies the providers required by the configuration. In this example there’s only a single provider which we’ve called google.
  - A provider is a plugin that Terraform uses to create and manage resources.
  - Each provider needs a source in order to install the right plugin. By default the Hashicorp repository is used, in a similar way to Docker images.
    - hashicorp/google is short for registry.terraform.io/hashicorp/google.
  - Optionally, a provider can have an enforced version. If this is not specified the latest version will be used by default, which could introduce breaking changes in some rare cases.

We’ll see other settings to use in this block later.

The provider block configures a specific provider. Since we only have a single provider, there’s only a single provider block for the google provider.
- The contents of a provider block are provider-specific. The contents in this example are meant for GCP but may be different for AWS or Azure.
- Some of the variables seen in this example, such as credentials or zone, can be provided by other means which we’ll cover later.
The resource blocks define the actual components of our infrastructure.
- In this example we have a single resource.
- Resource blocks have 2 strings before the block: the resource type and the resource name. Together they create the resource ID in the shape of type.name.
- About resource types:
  - The first prefix of the resource type maps to the name of the provider. For example, the resource type google_compute_network has the prefix google and thus maps to the provider google.
  - The resource types are defined in the Terraform documentation and refer to resources that cloud providers offer. In our example google_compute_network refers to GCP’s Virtual Private Cloud service.
  - Resource names are the internal names that we use in our Terraform configurations to refer to each resource and have no impact on the actual infrastructure.
- The contents of a resource block are specific to the resource type. Check the Terraform docs to see a list of resource types by provider.
- In this example, the google_compute_network resource type has a single mandatory argument called name, which is the name that the resource will have within GCP’s infrastructure.
- Do not confuse the resource name with the name argument!
Besides these 3 blocks, there are additional available blocks:

variable "region" {
    description = "Region for GCP resources. Choose as per your location: https://cloud.google.com/about/locations"
    default = "europe-west6"
    type = string
}

Description:
- An input variable block starts with the type variable followed by a name of our choosing.
- The block may contain a number of fields. In this example we use the fields description, type and default.
- description contains a simple description for documentation purposes.
- type specifies the accepted value types for the variable
- If the default field is defined, the variable becomes optional because a default value is already provided by this field. Otherwise, a value must be provided when running the Terraform configuration.
- For additional fields, check the Terraform docs.
Variables must be accessed with the keyword var. and then the name of the variable.
In our main.tf file above, we could access this variable inside the google provider block with this line:

region = var.region

Local values block types behave more like constants.
- They are defined with the locals keyword.
- They can be used to define values that are used multiple times in the configuration.
- They can be accessed with the keyword local. and then the name of the local value.

locals{
    region  = "us-central1"
    zone    = "us-central1-c"
}

Use terraform to provision GCP reseources

The needed terreform files are:

main.tf
variables.tf
Optional: resources.tf, output.tf
.tfstate

help

terraform: configure basic Terraform settings to provision your infrastructure
required_version: minimum Terraform version to apply to your configuration
backend: stores Terraform’s “state” snapshots, to map real-world resources to your configuration.
local: stores state file locally as terraform.tfstate
required_providers: specifies the providers required by the current module
provider:
- adds a set of resource types and/or data sources that Terraform can manage
- The Terraform Registry is the main directory of publicly available providers from most major infrastructure platforms.
resource:
- blocks to define components of your infrastructure
- Project modules/resources: google_storage_bucket, google_bigquery_dataset, google_bigquery_table
variable & locals:
- runtime arguments and constants

Execution steps

terraform init:
- Initializes & configures the backend, installs plugins/providers, & checks out an existing configuration from a version control
terraform plan:
- Matches/previews local changes against a remote state, and proposes an Execution Plan.
terraform apply:
- Asks for approval to the proposed plan, and applies changes to cloud
terraform destroy
- Removes your stack from the Cloud

Back to the top

Extra content

Setting up a development environment in a Google Cloud VM If you cannot set up a local development environment, you may use part of the $300 credits of GCP in creating a Cloud VM and access to it via SSH to set up the environment there: Guide video

Port mapping and networks in Docker

If you’re having issues with Docker and networking (especially if you already have Postgres running locally in your host computer), a Video Guide

Back to the top

->Back: index

->Next: Data Ingestion

Albert's Data Engineering Blog

A blog about my data engineering journery with DataTalksClub

Introduction to Data Engineering

Table of contents

What is Data Engineering?

Docker and Postgres

What is Docker?

Creating a simple custom pipeline Docker tutorial

Running Postgresql in docker

Running pgAdmin in docker

Connecting pgAdmin to postgresql with Docker network

Ingest data from Jupiter notebook to Postgresql docker

Create a Jupyter Notebook for Data Upload

Create the network on which both dockers will run

Run the PostgreSQL docker

Run the pgAdmin docker

Run the ingestion script

Dockerize the ingestion script

Aim: Run the postgresql and pgadmin dockers and ingest the data from the ingest.py script: `python ingest_data.py`

Dockerize the ingestion script with a dockerfile

Export and test the script

Dockerize the script

Build the Dockerfile into a docker image

Run the docker image

Running pgadmin, postgres with Docker-compose

Chapter four: Provision `GCP resources` with terraform

Create GCP project

Create a service account & roles for the project

Setup for Access

Terraform basics

help

Execution steps

Extra content

Port mapping and networks in Docker

Introduction to Data Engineering

Table of contents

What is Data Engineering?

Docker and Postgres

What is Docker?

Creating a simple custom pipeline Docker tutorial

Running Postgresql in docker

Running pgAdmin in docker

Connecting pgAdmin to postgresql with Docker network

Ingest data from Jupiter notebook to Postgresql docker

Create a Jupyter Notebook for Data Upload

Create the network on which both dockers will run

Run the PostgreSQL docker

Run the pgAdmin docker

Run the ingestion script

Dockerize the ingestion script

Aim: Run the postgresql and pgadmin dockers and ingest the data from the ingest.py script: python ingest_data.py

Dockerize the ingestion script with a dockerfile

Export and test the script

Dockerize the script

Build the Dockerfile into a docker image

Run the docker image

Running pgadmin, postgres with Docker-compose

Chapter four: Provision GCP resources with terraform

Create GCP project

Create a service account & roles for the project

Setup for Access

Terraform basics

help

Execution steps

Extra content

Port mapping and networks in Docker

Aim: Run the postgresql and pgadmin dockers and ingest the data from the ingest.py script: `python ingest_data.py`

Chapter four: Provision `GCP resources` with terraform