Using Conda Environments for Python at the SSCC

Python is a fast-moving language, with frequent updates to both the language itself and the major packages that provide so much of its functionality. “Backward compatibility,” i.e. making sure updates don’t break older code, is a lower priority than with most languages the SSCC supports. This has created a culture where package maintainers, especially the maintainers of complex packages like torch, do not feel obligated to update their package immediately every time an update comes out for Python or one of the Python packages they rely on. This can create headaches for Python users. Using conda environments for your Python projects will avoid these headaches and ensure your work is reproducible for years to come.

The Problem of Package Management

The way updates work in the Python world can create problems like the following:

  1. If your code uses a package that does not work with the latest version of Python, updating Python will break your code.
  2. If your code relies on a package working in a particular way and an update changes how that package works, updating that package will break your code.
  3. If the latest version of package X requires the latest version of package Y, but package Z requires an older version of package Y, updating package X will break any code that uses package Z.

The SSCC updates Python and all the packages included in the Anaconda distribution of Python twice a year, so these are not hypothetical concerns.

The Anaconda distribution of Python is a partial solution to problems 1 and 3: it bundles a recent version of Python with a large collection of popular packages and ensures they all work together. However, it does not include the major deep learning packages like torch or tensorflow. It also does not address problem 2.

A conda environment contains the Python version of your choice and the packages you choose to install. Conda automatically identifies and installs package versions that are compatible (in scenario 3 it will install older versions of packages X and Y so package Z will work). Most importantly, you control if and when an environment you create is updated. In the Python world it’s common to not update unless there are specific new features you want to use.

(There are excellent alternatives to conda, notably pyenv and venv. We think conda is the best fit for most SSCC researchers, but the others would work well too.)

Note

Conda environments are big and SSCC home directories are small. We thus recommend putting your conda environments in a project directory, and will show you how. This means you’ll refer to them by a prefix (basically a path) rather than a name. Most discussion of conda you’ll find on the web assumes your environments are in your home directory and refers to them by name, so you’ll probably need to change any example commands to use a prefix before they’ll work for you.

When to Use a Conda Environment

Posit suggests “Create a virtual environment for every project” as The Iron Law of Python Management but, with all due respect to Posit (and they’re due a lot), we’re going to soften that recommendation for SSCC researchers. (If you choose to follow their recommendation strictly, more power to you.)

Note

When we talk about a “project” in this context we do not mean an SSCC project directory, but a research project in the general sense. An SSCC project directory may contain multiple projects in that sense.

Don’t worry about environments if you’re just learning Python. Give yourself some time to figure out what packages are and which packages you’ll use first. Start using environments when:

  • You start using lots of packages that are not included in Anaconda, or any packages that are picky about versions like pytorch. For some of you this will come very quickly.
  • You start writing code that produces results you care about and you need your work to be reproducible long-term.

If you are working on multiple projects that use similar analytical techniques and thus need the same set of packages, you can use one environment for all of them. However, when you finish a project, create a clone of the shared environment just for that project so it can be “frozen” (never updated). That ensures the code for that project will continue to work in the long term and makes the project reproducible. You should also create a clone of the shared environment for a specific project if the project turns out to need additional packages.

Preparing to Use Conda

You need to initialize conda before you can use it by running:

conda init

This will set up your shell to use Conda. Once you run it, you need to log out and log back in so the changes can take effect. You’ll know it worked when you see (base) at the beginning of your Linux prompt (e.g. (base) linstat1.ssc.wisc.edu>). This tells you you’re initially using the base, or default, Conda environment.

If you don’t see (base) after running conda init and logging out and logging back in, contact the SSCC Help Desk for assistance.

Creating a Conda Environment

To create a conda environment, first cd to the directory where you want to store the environment. That location should be:

  1. In an SSCC project directory, not your home directory (unless you don’t have access to a project directory and your environment will be small).
  2. Accessible to everyone who will use the environment.
  3. As close to the projects that will use it as possible.

If the environment will be used for a single project, put it inside that project’s directory. If it will be used for multiple projects, ideally put it in the directory that contains all those projects.

Suppose my SSCC project directory is called bbadger and it contains directories for the research projects dissertation, paper1, and paper2. If they all use similar packages, I might put my shared conda environment in /project/bbadger. But I’d put an environment just for my dissertation in /project/bbadger/dissertation/.

To actually create the environment, run:

conda create --prefix project_env python=X.X package1 package2...

where project_env should be replaced by the name you want to give your environment, X.X should be replaced by the version of Python you want to use, and package1 package2... should be replaced by a list of the packages you want the environment to include.

Note

Most conda examples you’ll find on the web use -n instead of --prefix. That will put the resulting environment in your home directory–not what you want at the SSCC.

In naming your environment, think long-term: in 5-10 years you’ll need to figure out the name of the environment used to run the project, so we suggest something like project_env where project should be replaced by the name of the project. A shared environment might be shared_env.

With the --prefix option, the name of your environment is actually the path to it. That could be an absolute path, like /project/bbadger/dissertation/dissertation_env or it could be a relative path like ./dissertation_env if the working directory is /project/bbadger/dissertation. However, conda does need to know that it’s a path and not just a name, or it will look for an environment with that name in your home directory. Thus when you refer to the environment name in subsequent commands, it always needs to start with ., /, .. or some other character that tells conda it’s a path. (. means the working directory, so ./dissertation_env means “look in the working directory for dissertation_env.”)

A two-part Python version like 3.10 is enough to solve most version issues. Conda will then automatically install the latest version of Python 3.10. But if you need to specify something like python=3.10.13 you can.

Ideally you’ll install all the packages you need when you create your environment. That way conda can ensure they’ll all work together. You can add more packages later if you need to. You can also specify that you need a particular version of a package. For example, putting pytorch-cuda=11.8 in your package list says to include version 11.8 of the package pytoch-cuda.

Some packages are only available from particular conda channels. For example, some packages required to use the SSCC’s NVIDIA GPUs must be obtained from the nvidia channel. You can tell conda to look for packages in additional channels by adding -c and then the channel name.

Putting this all together, the following creates an environment called torch_env that can use pytorch and NVIDIA GPUs:

conda create --prefix torch_env python=3.10 pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

It specifies that the packages pytorch, torchvision, torchaudio, and pytorch-cuda should be installed, with the additional requirement that pytorch-cuda must be version 11.8. To find these packages, conda should look in the pytorch and nvdia channels as well as the defaults. It also specifies that the environment should use Python 3.10.

(A current version of this command can be found by going to pytorch.org, scrolling down a bit, and choosing appropriate options. It will give you a conda install command that you can easily adapt to conda create.)

Activating an Environment

To activate an environment, you can use conda activate followed by the path to the environment. This could be an absolute or a relative path. For example:

conda activate /project/bbadger/shared_env

Or, if the working directory is /project/bbadger/paper1:

conda activate ../shared_env

To deactivate an environment, run:

conda deactivate

JupyterLab

If you want to use JupyterLab with your environment, install the jupyterlab package (ideally when you create the environment). Then activate the environment before running sscc-jupyter.

Specifying an Environment for Slurm

You can run a Python script using a particular conda environment without activating that environment by using conda run. For example, to run dissertation_analyis.py using the environment shared_env (which is one directory up), you’d type:

conda run --prefix ../shared_env python dissertation_analysis.py

When you submit the script to Slurm, that becomes the command you run with ssubmit:

ssubmit --cores=128 --mem=250g "conda run --prefix ../shared_env python dissertation_analysis.py"

Slurm will then run the script using the conda environment specified.

Freezing a Project’s Environment

When you’ve finished a project, you want to make sure the environment it runs in does not change. If you’ve created an environment just for that project, that’s easy: don’t change it. But if it uses an environment that’s shared with other projects you’ll want to create a separate environment for the completed project. If you haven’t been using an environment, that’s the time to start.

If the Project Uses a Shared Environment

If you’ve created a conda environment that is used for multiple projects, you’ll want to clone the environment in its current state. The completed project will use the cloned environment, which you will never update. Meanwhile, projects you are still working on will use the original environment, which you can update as needed.

To clone an environment, run:

conda create --prefix new_env --clone old_env

For example, if you have a shared environment called shared_env in /project/bbadger and you want to create a clone called dissertation_env in project/bbadger/dissertation, make that the working directory and run:

conda create --prefix dissertation_env --clone ../shared_env

Note that the name of the old environment must clearly be a path. Start it with ./ if necessary.

Then activate the ./dissertation_env environment and check that all the code related to your dissertation still runs.

If You Didn’t Use an Environment

If didn’t use a conda environment for the project at all, it’s using the base environment containing all the Anaconda packages plus anything else you installed. You do not want to clone the base environment, as it contains far more packages than you need. Instead, go through your code, search for all the import statements, and make a list of the packages you import. Also run:

python --version

and note which version of Python you’re using. Then go to the location of the project and create a new environment that uses the version of Python you want and contains all the packages you need.

For example, if your dissertation uses Python 3.11.5 and your code imports pandas, numpy, and plotnine, go to /project/my_projects/dissertation and run:

conda create –prefix dissertation_env python=3.11.5 pandas numpy plotnine

Then activate the ./dissertation_env environment and check that all the code related to your dissertation still runs.

Sharing an Environment

If you put a conda environment in an SSCC project directory, anyone with access to the project directory can use the environment. Just activate it as usual. That makes it very easy for SSCC collaborators to share environments.

Alternatively, you can have conda create an export file that others can easily use to create an identical environment. To do so, activate the environment and run:

conda env export > env_config.yml

This will create a YAML file called env_config.yml in your current directory. If you send it to a collaborator, they can then run:

conda env create -f env_config.yml -p new_env

This will create an identical environment on their computer in their working directory. They should replace new_env with the name they want to give the new environment. Note that this will be a prefix (path) rather than the name of an environment in their home directory.

Installing Packages in an Environment

To install additional packages in an environment, first activate it and then use conda install with a list of packages and related options just like those used with conda create. For example, to give an existing environment the ability to run pytorch on NVidia GPUs, you could run:

conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

When you install multiple packages conda will make sure they’re compatible with each other, but it can’t ensure they’re compatible with packages that are already installed. If you need to install a bunch of packages, put them all in a single conda install command. If you run into problems with package compatibility, creating a new environment with both the packages you have and the new packages you need will give conda a chance to find compatible versions of all of them.

You can also install packages with pip and they’ll be added to the active environment, but that gives you less compatibility checking.

Managing Environments

To get a list of conda environments you’ve created, run conda env list.

To delete an environment, use conda remove with the --all switch (as in “remove all packages”) For example, to remove dissertation_env run:

conda remove --prefix ./dissertation_env --all

Conda keeps track of environments in several places, so don’t just delete the directory containing the environment.