Using Conda Environments for Python at the SSCC
Python is a fast-moving language, with frequent updates to both the language itself and the major packages that provide so much of its functionality. “Backward compatibility,” i.e. making sure updates don’t break older code, is a lower priority than with most languages the SSCC supports. This has created a culture where package maintainers, especially the maintainers of complex packages like torch
, do not feel obligated to update their package immediately every time an update comes out for Python or one of the Python packages they rely on. This can create headaches for Python users. Using conda environments for your Python projects will avoid these headaches and ensure your work is reproducible for years to come.
The Problem of Package Management
The way updates work in the Python world can create problems like the following:
- If your code uses a package that does not work with the latest version of Python, updating Python will break your code.
- If your code relies on a package working in a particular way and an update changes how that package works, updating that package will break your code.
- If the latest version of package X requires the latest version of package Y, but package Z requires an older version of package Y, updating package X will break any code that uses package Z.
The SSCC updates Python and all the packages included in the Anaconda distribution of Python twice a year, so these are not hypothetical concerns.
The Anaconda distribution of Python is a partial solution to problems 1 and 3: it bundles a recent version of Python with a large collection of popular packages and ensures they all work together. However, it does not include the major deep learning packages like torch
or tensorflow
. It also does not address problem 2.
A conda environment contains the Python version of your choice and the packages you choose to install. Conda automatically identifies and installs package versions that are compatible (in scenario 3 it will install older versions of packages X and Y so package Z will work). Most importantly, you control if and when an environment you create is updated. In the Python world it’s common to not update unless there are specific new features you want to use.
(There are excellent alternatives to conda
, notably pyenv
and venv
. We think conda
is the best fit for most SSCC researchers, but the others would work well too.)
Conda environments are big and SSCC home directories are small. We thus recommend putting your conda environments in a project directory, and will show you how. This means you’ll refer to them by a prefix (basically a path) rather than a name. Most discussion of conda you’ll find on the web assumes your environments are in your home directory and refers to them by name, so you’ll probably need to change any example commands to use a prefix before they’ll work for you.
When to Use a Conda Environment
Posit suggests “Create a virtual environment for every project” as The Iron Law of Python Management but, with all due respect to Posit (and they’re due a lot), we’re going to soften that recommendation for SSCC researchers. (If you choose to follow their recommendation strictly, more power to you.)
When we talk about a “project” in this context we do not mean an SSCC project directory, but a research project in the general sense. An SSCC project directory may contain multiple projects in that sense.
Don’t worry about environments if you’re just learning Python. Give yourself some time to figure out what packages are and which packages you’ll use first. Start using environments when:
- You start using lots of packages that are not included in Anaconda, or any packages that are picky about versions like
pytorch
. For some of you this will come very quickly. - You start writing code that produces results you care about and you need your work to be reproducible long-term.
If you are working on multiple projects that use similar analytical techniques and thus need the same set of packages, you can use one environment for all of them. However, when you finish a project, create a clone of the shared environment just for that project so it can be “frozen” (never updated). That ensures the code for that project will continue to work in the long term and makes the project reproducible. You should also create a clone of the shared environment for a specific project if the project turns out to need additional packages.
Preparing to Use Conda
Conda works best on SSCC servers with the bash
shell. Also, if you do a web search for how to do something in conda, or Linux in general, the answers you find will usually assume you’re using bash
. The default shell at the SSCC is tcsh
because it’s easier for some things, so email the Help Desk and ask them to change your Linux shell to bash
if you want to use conda.
Nothing in this article will work until your shell has been changed to bash
.
Next, run the Linux command:
conda init
This will set up your shell to use Conda. Once you run it, you need to log out and log back in so the changes can take effect. You’ll know it worked when you see (base)
at the beginning of your Linux prompt (e.g. (base) linstat1.ssc.wisc.edu>
). This tells you you’re initially using the base
, or default, Conda environment.
Creating a Conda Environment
To create a conda environment, first cd
to the directory where you want to store the environment. That location should be:
- In an SSCC project directory, not your home directory (unless you don’t have access to a project directory and your environment will be small).
- Accessible to everyone who will use the environment.
- As close to the projects that will use it as possible.
If the environment will be used for a single project, put it inside that project’s directory. If it will be used for multiple projects, ideally put it in the directory that contains all those projects.
Suppose my SSCC project directory is called bbadger
and it contains directories for the research projects dissertation
, paper1
, and paper2
. If they all use similar packages, I might put my shared conda environment in /project/bbadger
. But I’d put an environment just for my dissertation in /project/bbadger/dissertation/
.
To actually create the environment, run:
conda create --prefix project_env python=X.X package1 package2...
where project_env
should be replaced by the name you want to give your environment, X.X
should be replaced by the version of Python you want to use, and package1 package2...
should be replaced by a list of the packages you want the environment to include.
Most conda examples you’ll find on the web use -n
instead of --prefix
. That will put the resulting environment in your home directory–not what you want at the SSCC.
In naming your environment, think long-term: in 5-10 years you’ll need to figure out the name of the environment used to run the project, so we suggest something like project_env
where project
should be replaced by the name of the project. A shared environment might be shared_env
.
With the --prefix
option, the name of your environment is actually the path to it. That could be an absolute path, like /project/bbadger/dissertation/dissertation_env
or it could be a relative path like ./dissertation_env
if the working directory is /project/bbadger/dissertation
. However, conda does need to know that it’s a path and not just a name, or it will look for an environment with that name in your home directory. Thus when you refer to the environment name in subsequent commands, it always needs to start with .
, /
, ..
or some other character that tells conda it’s a path. (.
means the working directory, so ./dissertation_env
means “look in the working directory for dissertation_env
.”)
A two-part Python version like 3.10
is enough to solve most version issues. Conda will then automatically install the latest version of Python 3.10. But if you need to specify something like python=3.10.13
you can.
Ideally you’ll install all the packages you need when you create your environment. That way conda can ensure they’ll all work together. You can add more packages later if you need to. You can also specify that you need a particular version of a package. For example, putting pytorch-cuda=11.8
in your package list says to include version 11.8 of the package pytoch-cuda
.
Some packages are only available from particular conda channels. For example, some packages required to use the SSCC’s NVidia GPUs must be obtained from the nvidia
channel. You can tell conda to look for packages in additional channels by adding -c
and then the channel name.
Putting this all together, the following creates an environment called torch_env
that can use pytorch
and NVidia GPUs:
conda create --prefix torch_env python=3.10 pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
It specifies that the packages pytorch
, torchvision
, torchaudio
, and pytorch-cuda
should be installed, with the additional requirement that pytorch-cuda
must be version 11.8. To find these packages, conda should look in the pytorch
and nvdia
channels as well as the defaults. It also specifies that the environment should use Python 3.10.
(A current version of this command can be found by going to pytorch.org, scrolling down a bit, and choosing appropriate options. It will give you a conda install
command that you can easily adapt to conda create
.)
Activating an Environment
To activate an environment, you can use conda activate
followed by the path to the environment. This could be an absolute or a relative path. For example:
conda activate /project/bbadger/shared_env
Or, if the working directory is /project/bbadger/paper1
:
conda activate ../shared_env
To deactivate an environment, run:
conda deactivate
JupyterLab
If you want to use JupyterLab with your environment, install the jupyterlab
package (ideally when you create the environment). Then activate the environment before running sscc-jupyter
.
Specifying an Environment for Slurm
You can run a Python script using a particular conda environment without activating that environment by using conda run
. For example, to run dissertation_analyis.py
using the environment shared_env
(which is one directory up), you’d type:
conda run --prefix ../shared_env python dissertation_analysis.py
When you submit the script to Slurm, that becomes the command you run with ssubmit
:
ssubmit --cores=128 --mem=250g "conda run --prefix ../shared_env python dissertation_analysis.py"
Slurm will then run the script using the conda environment specified.
Freezing a Project’s Environment
When you’ve finished a project, you want to make sure the environment it runs in does not change. If you’ve created an environment just for that project, that’s easy: don’t change it. But if it uses an environment that’s shared with other projects you’ll want to create a separate environment for the completed project. If you haven’t been using an environment, that’s the time to start.
If You Didn’t Use an Environment
If didn’t use a conda environment for the project at all, it’s using the base
environment containing all the Anaconda packages plus anything else you installed. You do not want to clone the base
environment, as it contains far more packages than you need. Instead, go through your code, search for all the import
statements, and make a list of the packages you import. Also run:
python --version
and note which version of Python you’re using. Then go to the location of the project and create a new environment that uses the version of Python you want and contains all the packages you need.
For example, if your dissertation uses Python 3.11.5 and your code imports pandas
, numpy
, and plotnine
, go to /project/my_projects/dissertation
and run:
conda create –prefix dissertation_env python=3.11.5 pandas numpy plotnine
Then activate the ./dissertation_env
environment and check that all the code related to your dissertation still runs.
Installing Packages in an Environment
To install additional packages in an environment, first activate it and then use conda install
with a list of packages and related options just like those used with conda create
. For example, to give an existing environment the ability to run pytorch
on NVidia GPUs, you could run:
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
When you install multiple packages conda will make sure they’re compatible with each other, but it can’t ensure they’re compatible with packages that are already installed. If you need to install a bunch of packages, put them all in a single conda install
command. If you run into problems with package compatibility, creating a new environment with both the packages you have and the new packages you need will give conda a chance to find compatible versions of all of them.
You can also install packages with pip
and they’ll be added to the active environment, but that gives you less compatibility checking.
Managing Environments
To get a list of conda environments you’ve created, run conda env list
.
To delete an environment, use conda remove
with the --all
switch (as in “remove all packages”) For example, to remove dissertation_env
run:
conda remove --prefix ./dissertation_env --all
Conda keeps track of environments in several places, so don’t just delete the directory containing the environment.