You are assigned to setup a new repo for a team. The requirements are as follows:
So we need to align on:
pyspark
versionโ otherwise we do not enjoy the guarantees we want in production code
PP
Petra the Python Dev
Hi my venv somehow got corrupted ๐ฅฒ. It's saying No module named 'numpy' even though I seemingly have it installed. Not sure how to fix.
P.S. this is also why I was not in the meeting. |
EO
Erik the old project maintainer
Hey so I was working on maintaining the old project. Just found out that some of the dependencies don't even work on my MacOS version anymore. FML
|
|
BW
Billy Windows
I'm trying to dockerize the project but it's haaard. Everything that works on my Windows laptop seems to fail on Linux.
|
PU
Pyspark User
Hii. Can you maybe run pip show pyspark ? I'm curious which pyspark version you are running ๐ง. Because if it works for you but not for me and also not in the CI maybe your environment is different than both. Just checking.
|
|
JJ
Johnny Junior
Good day. I'm new at the company and wanted to get started working on the repo. Tried following the README steps but doesn't work. By any chance: are there any more detailed docs available for project setup? No right? ๐
|
๐ Issues
PP
|
Petra the Python Dev |
Corrupted Virtual Environment |
EO
|
Erik the old project maintainer |
Outdated project dependencies |
BW
|
Billy Windows |
Containerisation & going into production
e.g. Windows / Linux / MacOS |
PU
|
PySpark user |
Inconsistent environments
e.g. local โ CI โ prod โ team members |
JJ
|
Johnny Junior |
No formal specification of install steps
and missing docs ๐ |
๐ณ Docker helps us create a formal definition of our environment.
๐ Devcontainers allow you to connect your editor (IDE) to that container.
๐ Reproducible means:
|
Downsides?
|
Letโs say we have a really simple project that looks like this:
$ tree .
.
โโโ README.md
โโโ requirements.txt
โโโ requirements-dev.txt
โโโ sales_analysis.py
โโโ test_sales_analysis.py
.devcontainer
folder¶Your Devcontainer spec will live inside the .devcontainer
folder.
There will be two main files:
devcontainer.json
Dockerfile
Create a new file called devcontainer.json
:
{
"build": {
"dockerfile": "Dockerfile",
"context": ".."
}
}
So how does this Dockerfile
look like?
FROM python:3.10
# Install Java
RUN apt update && \
apt install -y sudo && \
sudo apt install default-jdk -y
## Pip dependencies
# Upgrade pip
RUN pip install --upgrade pip
# Install production dependencies
COPY requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.txt && \
rm /tmp/requirements.txt
# Install development dependencies
COPY requirements-dev.txt /tmp/requirements-dev.txt
RUN pip install -r /tmp/requirements-dev.txt && \
rm /tmp/requirements-dev.txt
The .devcontainer
folder in place, now itโs time to open our Devcontainer.
Open up the command pallete (CMD + Shift + P) and select โDev Containers: Reopen in Containerโ:
Upon opening a repo with a valid .devcontainer
folder, you are already notified:
Rebuilding allows you to get a fresh environment anytime you want:
Besides starting the Docker image and attaching the terminal to it, VSCode is doing a couple more things:
~/.gitconfig
and ~/.ssh/known_hosts
are copied over to their respective locations in the container.
Your entire project setup is now encapsulated in the Devcontainer. So actually we can add a Markdown button to open up the Devcontainer:
[
![Open in Remote - Containers](
https://img.shields.io/static/v1?label=Remote%20-%20Containers&message=Open&color=blue&logo=visualstudiocode
)
](
https://vscode.dev/redirect?url=vscode://ms-vscode-remote.remote-containers/cloneInVolume?url=https://github.com/godatadriven/python-devcontainer-template
)
Which basically means, open this URL:
vscode://
ms-vscode-remote.remote-containers/
cloneInVolume?
url=https://github.com/godatadriven/python-devcontainer-template
Just modify the GitHub URL after url=
โ.
What kind of README would you rather like?
We have built a working Devcontainer, that is great! But a couple things are still missing.
Let's see how.
If you pip install
a new package, you will see the following message:
So let's go ahead and create a user for this scenario.
# Add non-root user
ARG USERNAME=nonroot
RUN groupadd --gid 1000 $USERNAME && \
useradd --uid 1000 --gid 1000 -m $USERNAME
## Make sure to reflect new user in PATH
ENV PATH="/home/${USERNAME}/.local/bin:${PATH}"
USER $USERNAME
Add the following property to devcontainer.json
:
"remoteUser": "nonroot"
That's great! When we now start the container we should connect as the user nonroot
.
"customizations": {
"vscode": {
"extensions": [
"ms-python.python"
],
"settings": {
"python.testing.pytestArgs": [
"."
],
"python.testing.unittestEnabled": false,
"python.testing.pytestEnabled": true,
"python.formatting.provider": "black",
"python.linting.mypyEnabled": true,
"python.linting.enabled": true
}
}
}
Since we are using pyspark, it would be nice to be able to access Spark UI.
"portsAttributes": {
"4040": {
"label": "SparkUI",
"onAutoForward": "notify"
}
},
"forwardPorts": [
4040
]
When we now run our code, we get a notification we can open Spark UI in the browser:
Resulting in the Spark UI like we know it:
There are two basic options:
Let's see about option number (1).
Luckily, a GitHub Action was already setup for us to do exactly this:
To now build, push and run a command in the Devcontainer is as easy as:
name: Python app
on:
...
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Checkout (GitHub)
uses: actions/checkout@v3
- name: Login to GitHub Container Registry
uses: docker/login-action@v2
with:
registry: ghcr.io
username: ${{ github.repository_owner }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and run dev container task
uses: devcontainers/ci@v0.2
with:
imageName: ghcr.io/${{ github.repository }}/devcontainer
runCmd: pytest .
See below a trace of the executed GitHub Action:
Awesome!
We built the following Devcontainer definitions.
First, devcontainer.json
:
{
"build": {
"dockerfile": "Dockerfile",
"context": ".."
},
"remoteUser": "nonroot",
"customizations": {
"vscode": {
"extensions": [
"ms-python.python"
],
"settings": {
"python.testing.pytestArgs": [
"."
],
"python.testing.unittestEnabled": false,
"python.testing.pytestEnabled": true,
"python.formatting.provider": "black",
"python.linting.mypyEnabled": true,
"python.linting.enabled": true
}
}
},
...
}
"portsAttributes": {
"4040": {
"label": "SparkUI",
"onAutoForward": "notify"
}
},
"forwardPorts": [
4040
]
And our Dockerfile
:
FROM python:3.10
# Install Java
RUN apt update && \
apt install -y sudo && \
sudo apt install default-jdk -y
# Add non-root user
ARG USERNAME=nonroot
RUN groupadd --gid 1000 $USERNAME && \
useradd --uid 1000 --gid 1000 -m $USERNAME
## Make sure to reflect new user in PATH
ENV PATH="/home/${USERNAME}/.local/bin:${PATH}"
USER $USERNAME
## Pip dependencies
# Upgrade pip
RUN pip install --upgrade pip
# Install production dependencies
COPY --chown=nonroot:1000 requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.txt && \
rm /tmp/requirements.txt
# Install development dependencies
COPY --chown=nonroot:1000 requirements-dev.txt /tmp/requirements-dev.txt
RUN pip install -r /tmp/requirements-dev.txt && \
rm /tmp/requirements-dev.txt
๐ก Pro tip: mount your AWS/GCP/Azure credentials
Remote Development ๐
e.g. GitHub Codespaces, VM's on Azure/GCP/AWS
... and much more (see references slide)
๐ Devcontainers connect your IDE to a running ๐ณ Docker container.
โ reproducibility & isolation whilst getting a native experience.
๐ Reproducible means:
|
Now only VSCode, but open specification taking shape.
Associated blog post:
Spec:
Docs:
Dockerfile
and devcontainer.json
.Repo's: