Custom Conda Environments for Data Science on HPC Clusters

A problem that lot of scientists have to deal with is how to run our python code on an HPC cluster (e.g. an xsede resource, NCAR’s yellowstone, etc.). On our local machine, we know how to manage packages using conda and pip. But most clusters do not have conda available, and the system or module-based python distribution probably doesn’t have all the packages we need (e.g. xarray, dask, etc.). How can we get around these limitations to get a fully functional, flexible python environment on any cluster?

Here is my solution, which I have used successfully on Columbia’s habanero and NASA’s pleiades clusters. I assume something similar will work on most standard HPC clusters.

Step 1: Install miniconda in user space.

Miniconda is a mini version of Anaconda that includes just conda and its dependencies. It is a very small download. If you want to use python 3 (recommended) you can call

or for python 2.7

Step 2: Run Miniconda

Now you actually run miniconda to install the package manager. The trick is to specify the install directory within your home directory, rather in the default system-wide installation (which you won’t have permissions to do). You then have to add this directory to your path.

Step 3: Create a custom conda environment specification

You now have to define what packages you actually want to install. A good way to do this is with a custom conda environment file. The contents of this file will differ for each project. Below is the environment.yml that I use for my xmitgcm project.

Create a similar file for your project and save it as environment.yml. You should chose a value of name that makes sense for your project.

Step 4: Create the conda environment

You should now be able to run the following command

This will download and install all the packages and their dependencies.

Step 5: Activate The Environment

The environment you created needs to be activated before you can actually use it. To do this, you call

(You replace xmitgcm with whatever name you picked in your environment.yml file.)

This step needs to be repeated whenever you want to use the environment (i.e. every time you launch an interactive job or call python from within a batch job).

Step 6: Use Python!

You can now call ipython on the command line or launch a jupyter notebook. On most clusters, this should be done from a compute node, rather than the head node. Connecting with your notebook running on a cluster is a bit complicated, but it can definitely be done. That will be the topic of my next post.

Associate Professor, Earth & Environmental Sciences, Columbia University.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store