An academic's guide to conda virtual environments

Credit to Mark Seemann’s post called  Functional design is intrinsically testable

Credit to Mark Seemann’s post called Functional design is intrinsically testable

Further reading

Real Python did a terrific primer on the concept of virtual environments (in general). Consider taking a look at their article at Python Virtual Environments: A Primer

Milk & Honey

Honestly, I have been somewhat apathetic about my blog...probably my site as a whole. Don't get me wrong, I love having a nice, clean representation of who I am online. Something that I control. However, I am also a married, graduate student. Life kind of gets in the way of regular site maintenance. Furthermore, I didn't really know if my content was reaching anyone.

However, the past week or so really changed that for me. The most recent event was a site called Kite.com (Note: this is not sponsored). Kite reached out to me in the interest of collaborating with me. While I was very flattered, I am not active enough to really reap the full benefits of such a collaboration. The second, and probably the most flattering, event happened earlier this week. A reader reached out to me via email. They had read my blog post on setting up WSL and in the blog post I had alluded to doing a write up on how to use virtual environments. Sadly, I never got around to it, and that was what the reader was interested in.

So, here I am, giving my readers what they want. A guide on how (and why) to use virtual environments.


An academic's view on working environments

Conda Cheatsheet

The majority of the commands that I will be using can be found on any number of helpful cheatsheets, but I prefer to stay first-party when it comes to conda. Therefore, I suggest bookmarking their cheatsheet (or print it out) for reference: conda cheatsheet

To explain the issues regarding virtual environments in the context of academic environments, one needs to understand my field of research first. Bioinformatics is a fancy term for a very specific field of interdisciplinary research. In the most generalized sense, bioinformatics is comprised of one part math/statistics, one part computer science, and the last part is some biomedical domain (e.g. biology, molecular chemistry, biophysics, etc).

A bioinformatics department is a strange place to be when one considers computer science. Yes, we definitely cannot do the research we do without computers. However, bioinformatics often suffers from its interdisciplinary heritage. For the uninitiated, very few undergraduate institutions have "bioinformatics" programs. This usually means that the ideal graduate student in bioinformatics is like the unicorn that is a 10x developer in industry. The reality of the situation is that graduate students in the field often are very specialized in one (maybe two) of the three parts that make up the field. The biggest problem with this situation is that computer science specialists are often the smallest demographic in bioinformatics.

This is mainly because computer science, in biomedical research, is often seen as a tool...not a domain. The side effect of this pragmatic viewpoint is somewhat benign though: great scientists still produce great science. However, the unread subtext here is that the tools that built that science are in horrible condition. I do not know how many times I have seen such bad spaghetti code that re-writing was easier than debugging; just to find out that the reason the previous code took 100+ hours to run was because of some poorly designed nested for-loop (or something or other). This all comes to a head when the research is ready to be published, and the code base is expected to be shared in the interest of transparency, reproducibility, and open source interests. However, the graduate student doesn't have a GitHub account, let alone understands what version control is. As a last minute response to reviewer comments, the student hastily writes up a README.md with the most poorly constructed installation instructions that the only result of any attempt to duplicate ends in a "Works on my machine" certification.


Fruit of a poisonous tree

While there are many issues that could lead to "bad" code (informal instruction, Stack Overflow copy & paste, rushing, etc.), I make no claims to solve any/all of them with this blog post. With that in mind, I should also explain something about myself. I am a biologist by trade and training. I didn't get the bug to debug until my junior year of my undergraduate education. Therefore, I ask any computer scientists to grant me a modicum of leniency if I explain something in a little more generalized or over-simplified way.

If you are like most great data scientists that use Python for your research, you may have installed Anaconda to get yourself going. Now, what if I said that even though you may not have even typed a single line of code, your project could already be "corrupt"? I don't write this to cause any conflict with Anaconda. Actually, quite the contrary. There are few groups that have individually impacted the field of data science to the degree that Anaconda has. My only comment here is that while Anaconda is one of the best environments to prototype a project in, if you plan to share your project with anyone else, there are some things to consider.

Did you know that the reason your Anaconda install took so long was because Anaconda is actually a distribution of not only Python, but also over 200 other packages relevant to data science? Now, can you say that your project makes use of all 200 of those projects? Likely, you said "no". Furthermore, you may have even installed some packages that didn't come with Anaconda: pip install seqlogo (https://github.com/betteridiot/seqlogo) or conda install bamnostic (https://github.com/betteridiot/bamnostic/). How do you know for sure which packages your project actually uses such that you could write definitive instructions on how to easily replicate your working environment?

I can almost hear some developers groan. Yes. I know there are ways to generate a record of your working environments conditions: pip freeze > requirements.txt or conda env export > environment.yml. But this doesn't solve the specificity problem. And this is where this blog post hopes to clarify what I call the "fruit of a poisonous tree" issue.

So, what is the "fruit of a poisonous tree" issue. It is when you build a package off of an environment that is not custom tailored to just its requirements/dependencies. As a thought experiment, imagine that a user may have finished one project using one set of packages (e.g. Numpy, Pandas, and Scikit-Learn). Then, after finishing that project, starts working on another, but it required them to update Numpy. What is the problem here? I'll give you a second... If they did this in their (base) environment, they can no longer guarantee that their first project works correctly. Furthermore, they can no longer use those methods listed above to capture a snapshot of their original environment (unless they did them beforehand).

The "fruit of a poisonous tree" issue is not specific to Anaconda though. I was just being sensational. The point stands however. When someone installs Anaconda, their project's environment became a 200+ package (base) environment. In contrast, this same issue can arise even if someone installed Miniconda (instead of Anaconda) and installed all the packages they used for each project into their (base) environment over the course of their research. In either case, if development of separate projects happens on the same (base) environment, it is increasingly difficult to describe the correct environment in which each project was built.


Quarantine

The simplest way to prevent fruit of the poisonous tree is to ensure you have good roots. And, yes, I will probably overuse this analogy. The point to make here is that the (base) environment should be as clean as possible.

First though, why do you think I have been typing "base" like (base)? This is the way conda prepends your command prompt to indicate which environment you are in. That's right. conda inherently expects you to use different environments throughout your development process. conda just makes sure you have a starting point, and that starting point is called (base), and you may have seen it.

Back to the point though: your (base) is not really meant to be a place for development. It is meant to be your (base) of operations. Like your homepage, you can do some stuff, but not really much. It takes your deliberate actions to navigate away from it for you to really do what you want to do. Therefore, we should treat (base) as the same: a nice, familiar place to start our work.

The corollary to this is that you should have a different environment for your project. But doesn't that mean you will just end up with the same problem in a different place? No, because it is recommended that you have a different environment for every project.

What this partitioning of environments effectively does is quarantine your projects. The technical words describing this are "isolation" and "encapsulation". Isolation can be thought of as something that intentionally cannot "depend on any implicit knowledge about the external world". Encapsulation is the fundamental concept within object-oriented programming of "bundling data and methods that work on that data within one unit". Put simply, isolation means your code only knows what it is told and encapsulation is grouping up chunks of code that logically belong together. This concept is not lost when it comes to development environments either.


Enter the "Virtual Environment"

I have done a lot of explaining up to this point. Now, I will direct my attention to application. To isolate & encapsulate your working environments is called creating a "virtual environment". This is not much unlike running a virtual desktop or emulator. The key difference here is that we only really care about the Python (or R, if you are into that sort of thing) development environment. We don't have address the operating system really. To use best practices, we should create a virtual environment for each project.

A convention

From here on out, I am going to explain the implementation of virtual environments in the context of conda env environments. I know there are others (pipenv and virtualenv), but I will maintain this for consistency's sake.

Let there be code

When a developer starts a project, a few thoughts should cross their mind before they jump into coding:

  1. What version of Python will the project run on? (python=3.7)
  2. Are there any third-party packages the project may need?
    • What are they? (numpy)
  3. How will interaction with the code take place (GUI, REPL, script, or Jupyter Notebook)?

For example, I want to write a project that will run from the command line using only Numpy on Python 3.7. With this information, we can create our project environment:

conda create --name PROJECT_NAME python=3.7 numpy
Completed virtual environment install

Completed virtual environment install

Press enter and let conda do all the work for you. conda, while it has some quirks, is really cool. It will inspect your current system and automatically detect which versions of the packages you need for which OS you are working on, manage dependencies between all of the packages to be installed, download them, and install them. When it is all done, it will give you a friendly prompt that looks something like what is shown to the right.

This prompt is there to let you know that you created the environment, but you have not turned it on yet. This is like mkdir foo does not automatically move you into foo/ after it is done. To start it up, just type:

conda activate PROJECT_NAME

This is when you should see (base) turn into (PROJECT_NAME) on your prompt, indicating that you are using this new virtual environment.


The syntax

Let's talk about the parts (or syntax) of the command. There are essentially four (4) major parts to conda virtual environments:

parts.png
  1. The command and sub-command (conda create)
  2. The name of the environment (--name PROJECT_NAME)
  3. The version of Python (python=3.7)
    • It will default to whatever version your (base) environment is if you don't put anything
  4. What additional package(s) to install

When this is said and done, all of your python calls and imports will come from the currently activated environment.


The cure for the common core

Hopefully, when you saw the last section, you realized why I choose to not use Anaconda as my (base) environment. It is because I don't spend a lot of time in my (base) environment. So, I can afford to not install a bunch of extra packages I don't need (or don't know I need).

I do want to take a moment to explain the concept of conda channels though. The basic conda channel (anaconda) contains a list of packages that Anaconda has curated and bundled together for their distribution. This list is pretty exhaustive, but not really complete. There are many third-party packages that aren't listed on the anaconda channel or on a different release timeline that the community continues to develop and refine. Therefore, I introduce these two (2) additional conda channels that you can add to you .condarc config file and never think about again.

conda config --set channel_priority false
conda config --add channels conda-forge
conda config --add channels bioconda

With these one-time commands, your conda will not only automatically have access to a large majority of the available packages out there, but also install the newest version of any package that is cross listed in any of these channels.


Some examples

Now that we have gone through a lot of my reasoning for using conda virtual environments, this section will be used for demonstrating some of the specifics to using conda virtual environments.

# To create a very basic data science environment
conda create -n basic_data_science "python>=3.7" pip numpy pandas matplotlib scikit-learn

This environment gives me access to the Python standard library, basic data science tools, and basic plotting capabilities. The fun part is that I can treat this as a parent environment for any others that I want to build out a bit more.

# To add IDE capabilities to my basic data science build
conda create -n jupyter_base --clone basic_data_science
conda install -n jupyter_base jupyter jupyterlab notebook=5.7.2 "tornado<6" seaborn plotly altair

Note:At the time of this writing, the very specific package versioning seen here is the only build that can populate a Windows-side browser with a Windows Subsystem for Linux (WSL) based implementation of Jupyter Lab/Notebook.

Now, if I really wanted to expand this for some heavy genomics research, I would go one more step further.

# Exploratory genomics research build
conda create -n genomics --clone jupyter_base
conda install -n genomics pysam bamnostic bedtools bcftools bwa star seqlogo 
conda_list.png

With this, we come to first "hiccup" of having multiple virtual environments: remembering the names. We can easily address this with the command:

conda env list
The output should look something like what is shown on the left. When your list of projects needs some pruning, or you just want to do some spring cleaning, that is as simple as:
conda env remove --name PROJECT_NAME

That's it. That is all it takes to install, clone, and uninstall conda virtual environments. But, there is one more thing to mention...the one thing that really catalyzed my desire to write this article: How to share your environment.

There are two parts to quickly sharing your environment specifications. For the most part, it only takes the first command I will share. However, if you install a package that is not listed on any conda channel and you have to pip install it instead, you will need to use the second command.

The first: capturing conda specs

The first command exports a given environment to a YAML file. This file contains all that is needed to share a 100% conda built environment on someone else's computer (unless they don't use the same operating system).

conda env export --name PROJECT_NAME --file environment.yml

From here, all someone would have to do to create an quick copy of your virtual environment is to have a copy of this YAML file and invoke:

conda env create --file environment.yml

The second: capturing pip specs

The second command will capture any packages on your current working environment that you had to install with pip that were not available on conda.

pip freeze > requirements.txt

Then, the person that you share this with (after running the first command) would run:

pip install -r requirements.txt

P.S

There is a more explicit way to create exact copies of your conda working environments though. Again, this is very explicit. Therefore, it isn't not as quick as using the conda env export method.

conda list --name PROJECT_NAME --explicit > spec-file.txt

The person you are sharing this file with would then use:

conda create --name PROJECT_NAME --file spec-file.txt

0xDEADBEEF

I know that this post was long-winded. There are a lot of posts/forums/cheatsheets that really list out these steps more succinctly than I did here, but I thought it was important to understand why I use virtual environments...and that isn't always seen in some of those pages.

In the end, I hope this was helpful. If you haven't subscribed yet, fill out a form to receive content when it is release. Also, please feel free to leave any comments below, share it using any of the social media links below, or just @ me on twitter. Thank you!


Have fun and code responsibly!

Name *
Name