Mastering Virtual Environments for for Data Science Projects using Pyenv, Pipx, Pipenv: A Critical Skill That Sets Beginners Apart from Veterans
forward: https://blog.devops.dev/best-practices-for-virtual-environments-for-data-science-pyenv-pipx-pipenv-4140b2974c7c
You’ve probably heard of this: “But, it works on my computer!” and think it’s more of a joke.
#DevOps #DataScience #CriticalSkills #ML-Pipelines #BestPractices
Imagine this scenario: a team of Python developers (a data scientist, an AI engineer, and a web developer), is working on a new web application. They’re using different versions of Python and various Python packages to build out the functionality. As the project progresses, they run into a few problems that are intertwined and difficult to untangle:
- Different versions of Python: To make some packages work, the data scientist is using Python 3.7, while others are using Python 3.9. Using different versions of Python and packages can lead to compatibility issues, which may cause errors, bugs, or even crashes in the code. These issues can be time-consuming to fix and can delay the development process.
- Different versions of packages: Some developers are using different versions of the same Python package. Dependencies between packages can create conflicts, where different packages require different versions of the same dependency, which can cause issues with the code’s functionality.
- Dependencies of the packages: Some packages have dependencies on other packages, and the versions of those packages aren’t always compatible with each other. This has made it difficult to get all the packages working together seamlessly.
- Some packages only used during development: Some packages are only used during the development phase of the project, but aren’t needed in production. This has led to confusion about which packages are necessary for the application to run properly. They also leads to unnecessarily complex package dependencies.
Overview of the 3 “musketeers”
In this post, we will discuss the best practices for virtual environments for development works, focusing on three tools, Pyenv, Pipx, and Pipenv. Together they form the complete toolkits for effectively managing the virtual environments.
1. Pyenv
Pyenv is a simple tool that allows you to install and switch between multiple versions of Python on your machine. It is especially useful when working on projects that require specific versions of Python, as it allows you to easily switch between them. We will be using this to install, manage, and to switch between multiple versions of Python, so we can use different versions of Python for different projects.
2. Pipx
Pipx is a tool that allows you to install and manage Python applications in isolated environments. There are many other use cases for pipx, which are beyond the scope of this artcle. For here, we will be using this for installing package that we want to make their command line interface (CLI) available across different projects. For example, we might want jupyter lab to be available across all the projects. With pipx, we can install once and make jupyterlab to be usable from all the projects. Other example of packages that we might want to install via Pipx include: black, streamlit,
3. Pipenv
Pipenv is the packaging tool brought to us by Python wunderkind Kenneth Reitz (the man who brought us requests and the new micro webserver framework python-responder) whose goal is to make tools “for humans”. Pipenv is a tool that combines Pip and virtualenv into a single tool for managing dependencies and virtual environments. It is especially useful for managing dependencies in Python projects. We will be using Pipenv for 1) creating a standalone virtual environment for each of the projects, and 2) installing and managing the packages for the individual project.
Setting up Pyenv, Pipx, Pipenv in the Right Order
1. Install pyenv
- Install pyenv-win in PowerShell
Invoke-WebRequest -UseBasicParsing -Uri "https://raw.githubusercontent.com/pyenv-win/pyenv-win/master/pyenv-win/install-pyenv-win.ps1" -OutFile "./install-pyenv-win.ps1"; &"./install-pyenv-win.ps1"
- Reopen PowerShell
- Run
pyenv --version
to check if the installation was successful.
2. Install pipx
- Install pipx using pip. You can either use a system level pip installation, or install pip using the system python interpreter or one from pyenv.
# if you don't already have pip installed, but do have a system
# level python this will get you the latest version
python3 -m ensurepip --upgrade# pipx requires Python 3.6 or higherpip install pipx
python3 -m pip install --user pipx
python3 -m pipx ensurepath
3. Install pipenv
After we have pipx installed we can just install pipenv with that.
pipx install pipenv
A simplified Real-world Use Case (simplified)
Imagine we have two projects and the packages needed by the respective projects
- Project #1 (Python 3.9):
- jupyterlab*
- streamlit
- pandas - Project #2 (Python 3.10):
- jupyterlab*
- requests
- flask
- SQLAlchemy
(note: jupyterlab is needed for both projects)
Step 1: install a global Python
- Open up
Windows Powershell
- Run
pyenv install -l
to check a list of Python versions supported by pyenv-win - Run
pyenv install <version>
to install the supported version, since we need both Pyhon 3.9 and Python 3.10, we will install both
-pyenv install 3.9.13
-pyenv install 3.10.5
- Run
pyenv global <version>
to set a Python version as the global version. We set the version 3.10.5 as our global Python version.
-pyenv global 3.10.5
- Check which Python version you are using and its path
> pyenv version
<version> (set by \path\to\.pyenv\pyenv-win\.python-version)
To see all the Python versions we have install use pyenv versions.
See the original repository for more compressive instructions for pyenv here
Step 2: Install Common Package that to be Used across Projects
For our scenario here, jupyterlab is the package that we will need across both Project#1 and Project#2. For such commonm packages, we will install them with pipx. It focuses on installing and managing Python packages that can be run from the command line directly as applications.
pipx install jupyterlab
Here, jupyterlab is a good example of package that we want to install with Pipx because:
- We want only to install Jupyter Lab once (e.g. not for each virtual environment).
- We want the environment to be easy to manage (e.g. upgrading Jupyter Lab, adding and removing dependencies).
- We want to easily access the other command line tools installed with Jupyter Lab, such as
jupyter
andipython
. We also only want to install these tools once.
Pipx install command is the preferred way to globally install apps from python packages on your system. It creates an isolated virtual environment for the package, then ensures the package’s apps are accessible on your $PATH. The result: apps you can run from anywhere, located in packages you can cleanly upgrade or uninstall.
Lastly, we can view the CLIs available that were installed via Pipx, by running pipx list
Step 3: Set a local Version and Install Packages specific to a Project
1️⃣ Project #1
For this project, we are going to start the project as it is a totally new project.
First of all, let’s set the local version 3.9.13 to be used with our Project#1.
pyenv local 3.10.5
#To create the virtualenv for this project
# with specific python version
# and enable this virtualenv to access packages installed in --site-packages
# that is the system-wide packages
pipenv install --python 3.10.5 --site-packages pandas
After the environment being created, we can activate the environment by simply the code below. Take note! it’s pipenv shell; NOT pyenv shell
pipenv shell
output for code above
With this, now we can start to install those packages that are specific to this project. Let’s start off by installing pandas, specifically version 1.5.3. This is to show how to install a specific version of package, which is very similar to how we usually do that in pip.
pipenv install pandas==1.5.3
pipenv install streamlit
output for code above
That’s it, if we look at the project folder, we should have 3 additional files:
- .python-version
- Pipfile
- Pipfile.lock
The file that we want to focus here is the Pipfile.lock. The purpose of Pipfile.lock is to specify the exact versions of all dependencies used in a project, including transitive dependencies, and to ensure that these versions are consistent across all installations of the project
Pipfile.lock
You can skip this part if you’re not keen dive deeper into understand how the Pipfile.lock works:
Recording dependencies: When you run
pipenv install
orpipenv lock
, Pipenv reads the Pipfile for the project to determine which dependencies to install. Pipenv then installs these dependencies and their dependencies recursively, recording the specific version numbers of each package in a lockfile.Generating a hash: Pipenv generates a unique hash based on the contents of the lockfile and other information about the environment, such as the version of Python and the operating system. This hash is included in the Pipfile.lock file.
Ensuring consistency: Pipfile.lock ensures that the same versions of all dependencies are used across all installations of the project, regardless of the environment or the order in which the packages are installed. This consistency is important for reproducibility and makes it easy for other developers to install the same dependencies on their own machines.
Resolving conflicts: Pipfile.lock also resolves any conflicts between dependencies by selecting the appropriate version based on the version constraints specified in the Pipfile. If two dependencies require different versions of the same package, Pipenv will select the version that satisfies both dependencies.
Pinning transitive dependencies: Pipfile.lock also pins the versions of transitive dependencies, which are the dependencies required by other dependencies. This ensures that the same versions of these dependencies are used across all installations of the project.
1️⃣ Project #2
For this project, we are going to assume that is an existing projects, which previously used pip for package installation, so there is this requirements.txt in the folder, with some of the existing working files.
#requirements.txt
requests
flask
SQLAlchemy
With pipenv, we can install all these packages by just running the following code in the projet root folder (where the requirements.txt located):
pipenv install -r requirements.txt
Let’s Test the Environments Created: Jupyter Lab
⚠️ if you go to Project#1 folder and start jupyter-lab
, you’ll realize that the kernel has no access to pandas
. That’s weird, right?
There is one super important step needed. That is to link to virtual environment to the kernel to be used with the jupyter lab.
- First, make sure the virtual environment has been activated by
pipenv shell
within the folder. - Install ipykernel by running
pipenv install ipykernel
- Create a new kernel, based on this virtual env
# This python is the local python
python -m ipykernel install --user --name=data_project
For packages that we want to share with the kernels across all different projects (provided that the virtualenv is created with the flag --site-packages
), we will need to install them using the global python’s pip.
# Example of installing system-wide packages (--site-packages)
# Navigate out of the project folder
cd ..
# Set the version of global Python to used to install the packages
pyenv global 3.10.5
# Make sure this is the global python
# Ideally should match the version that the project is using
pyenv which python
pyenv which pip
# Install the package(s)
pip install dtale
📑 Reference on how to link the virtualenv to Jupyter Kernel: https://medium.com/towards-data-science/create-virtual-environment-using-virtualenv-and-add-it-to-jupyter-notebook-6e1bf4e03415
Deleting a virtualenv
Finally, removing a virtualenv is as simple as executing pipenv --rm
# Delete the ipykernel for the project
## List the kernels available
jupyter kernelspec list
## Delete kernel based on name
jupyter kernelspec uninstall <my_env_name>
# Make sure to do this within the project folder
pipenv --rm