An academic's guide to conda virtual environments
Real Python did a terrific primer on the concept of virtual environments (in general). Consider taking a look at their article at Python Virtual Environments: A Primer
Milk & Honey
Honestly, I have been somewhat apathetic about my blog...probably my site as a whole. Don't get me wrong, I love having a nice, clean representation of who I am online. Something that I control. However, I am also a married, graduate student. Life kind of gets in the way of regular site maintenance. Furthermore, I didn't really know if my content was reaching anyone.
However, the past week or so really changed that for me. The most recent event was a site called Kite.com (Note: this is not sponsored). Kite reached out to me in the interest of collaborating with me. While I was very flattered, I am not active enough to really reap the full benefits of such a collaboration. The second, and probably the most flattering, event happened earlier this week. A reader reached out to me via email. They had read my blog post on setting up WSL and in the blog post I had alluded to doing a write up on how to use virtual environments. Sadly, I never got around to it, and that was what the reader was interested in.
So, here I am, giving my readers what they want. A guide on how (and why) to use virtual environments.
An academic's view on working environments
The majority of the commands that I will be using can be found on any number of helpful cheatsheets, but I prefer to stay first-party when it comes to
conda. Therefore, I suggest bookmarking their cheatsheet (or print it out) for reference: conda cheatsheet
To explain the issues regarding virtual environments in the context of academic environments, one needs to understand my field of research first. Bioinformatics is a fancy term for a very specific field of interdisciplinary research. In the most generalized sense, bioinformatics is comprised of one part math/statistics, one part computer science, and the last part is some biomedical domain (e.g. biology, molecular chemistry, biophysics, etc).
A bioinformatics department is a strange place to be when one considers computer science. Yes, we definitely cannot do the research we do without computers. However, bioinformatics often suffers from its interdisciplinary heritage. For the uninitiated, very few undergraduate institutions have "bioinformatics" programs. This usually means that the ideal graduate student in bioinformatics is like the unicorn that is a 10x developer in industry. The reality of the situation is that graduate students in the field often are very specialized in one (maybe two) of the three parts that make up the field. The biggest problem with this situation is that computer science specialists are often the smallest demographic in bioinformatics.
This is mainly because computer science, in biomedical research, is often seen as a tool...not a domain. The side effect of this pragmatic viewpoint is somewhat benign though: great scientists still produce great science. However, the unread subtext here is that the tools that built that science are in horrible condition. I do not know how many times I have seen such bad spaghetti code that re-writing was easier than debugging; just to find out that the reason the previous code took 100+ hours to run was because of some poorly designed nested for-loop (or something or other). This all comes to a head when the research is ready to be published, and the code base is expected to be shared in the interest of transparency, reproducibility, and open source interests. However, the graduate student doesn't have a GitHub account, let alone understands what version control is. As a last minute response to reviewer comments, the student hastily writes up a
README.md with the most poorly constructed installation instructions that the only result of any attempt to duplicate ends in a "Works on my machine" certification.
Fruit of a poisonous tree
While there are many issues that could lead to "bad" code (informal instruction, Stack Overflow copy & paste, rushing, etc.), I make no claims to solve any/all of them with this blog post. With that in mind, I should also explain something about myself. I am a biologist by trade and training. I didn't get the bug to debug until my junior year of my undergraduate education. Therefore, I ask any computer scientists to grant me a modicum of leniency if I explain something in a little more generalized or over-simplified way.
If you are like most great data scientists that use Python for your research, you may have installed Anaconda to get yourself going. Now, what if I said that even though you may not have even typed a single line of code, your project could already be "corrupt"? I don't write this to cause any conflict with Anaconda. Actually, quite the contrary. There are few groups that have individually impacted the field of data science to the degree that Anaconda has. My only comment here is that while Anaconda is one of the best environments to prototype a project in, if you plan to share your project with anyone else, there are some things to consider.
Did you know that the reason your Anaconda install took so long was because Anaconda is actually a distribution of not only Python, but also over 200 other packages relevant to data science? Now, can you say that your project makes use of all 200 of those projects? Likely, you said "no". Furthermore, you may have even installed some packages that didn't come with Anaconda:
pip install seqlogo (https://github.com/betteridiot/seqlogo) or
conda install bamnostic (https://github.com/betteridiot/bamnostic/). How do you know for sure which packages your project actually uses such that you could write definitive instructions on how to easily replicate your working environment?
I can almost hear some developers groan. Yes. I know there are ways to generate a record of your working environments conditions:
pip freeze > requirements.txt or
conda env export > environment.yml. But this doesn't solve the specificity problem. And this is where this blog post hopes to clarify what I call the "fruit of a poisonous tree" issue.
So, what is the "fruit of a poisonous tree" issue. It is when you build a package off of an environment that is not custom tailored to just its requirements/dependencies. As a thought experiment, imagine that a user may have finished one project using one set of packages (e.g. Numpy, Pandas, and Scikit-Learn). Then, after finishing that project, starts working on another, but it required them to update Numpy. What is the problem here? I'll give you a second... If they did this in their
(base) environment, they can no longer guarantee that their first project works correctly. Furthermore, they can no longer use those methods listed above to capture a snapshot of their original environment (unless they did them beforehand).
The "fruit of a poisonous tree" issue is not specific to Anaconda though. I was just being sensational. The point stands however. When someone installs Anaconda, their project's environment became a 200+ package
(base) environment. In contrast, this same issue can arise even if someone installed Miniconda (instead of Anaconda) and installed all the packages they used for each project into their
(base) environment over the course of their research. In either case, if development of separate projects happens on the same
(base) environment, it is increasingly difficult to describe the correct environment in which each project was built.
The simplest way to prevent fruit of the poisonous tree is to ensure you have good roots. And, yes, I will probably overuse this analogy. The point to make here is that the
(base) environment should be as clean as possible.
First though, why do you think I have been typing "base" like
(base)? This is the way
conda prepends your command prompt to indicate which environment you are in. That's right.
conda inherently expects you to use different environments throughout your development process.
conda just makes sure you have a starting point, and that starting point is called
(base), and you may have seen it.
Back to the point though: your
(base) is not really meant to be a place for development. It is meant to be your
(base) of operations. Like your homepage, you can do some stuff, but not really much. It takes your deliberate actions to navigate away from it for you to really do what you want to do. Therefore, we should treat
(base) as the same: a nice, familiar place to start our work.
The corollary to this is that you should have a different environment for your project. But doesn't that mean you will just end up with the same problem in a different place? No, because it is recommended that you have a different environment for every project.
What this partitioning of environments effectively does is quarantine your projects. The technical words describing this are "isolation" and "encapsulation". Isolation can be thought of as something that intentionally cannot "depend on any implicit knowledge about the external world". Encapsulation is the fundamental concept within object-oriented programming of "bundling data and methods that work on that data within one unit". Put simply, isolation means your code only knows what it is told and encapsulation is grouping up chunks of code that logically belong together. This concept is not lost when it comes to development environments either.
Enter the "Virtual Environment"
I have done a lot of explaining up to this point. Now, I will direct my attention to application. To isolate & encapsulate your working environments is called creating a "virtual environment". This is not much unlike running a virtual desktop or emulator. The key difference here is that we only really care about the Python (or R, if you are into that sort of thing) development environment. We don't have address the operating system really. To use best practices, we should create a virtual environment for each project.
From here on out, I am going to explain the implementation of virtual environments in the context of
conda env environments. I know there are others (pipenv and virtualenv), but I will maintain this for consistency's sake.
Let there be code
When a developer starts a project, a few thoughts should cross their mind before they jump into coding:
- What version of Python will the project run on? (
- Are there any third-party packages the project may need?
- What are they? (
- What are they? (
- How will interaction with the code take place (GUI, REPL, script, or Jupyter Notebook)?
For example, I want to write a project that will run from the command line using only Numpy on Python 3.7. With this information, we can create our project environment:
conda create --name PROJECT_NAME python=3.7 numpy
Press enter and let
conda do all the work for you.
conda, while it has some quirks, is really cool. It will inspect your current system and automatically detect which versions of the packages you need for which OS you are working on, manage dependencies between all of the packages to be installed, download them, and install them. When it is all done, it will give you a friendly prompt that looks something like what is shown to the right.
This prompt is there to let you know that you created the environment, but you have not turned it on yet. This is like
mkdir foo does not automatically move you into
foo/ after it is done. To start it up, just type:
conda activate PROJECT_NAME
This is when you should see
(base) turn into
(PROJECT_NAME) on your prompt, indicating that you are using this new virtual environment.
Let's talk about the parts (or syntax) of the command. There are essentially four (4) major parts to
conda virtual environments:
- The command and sub-command (
- The name of the environment (
- The version of Python (
- It will default to whatever version your
(base)environment is if you don't put anything
- It will default to whatever version your
- What additional package(s) to install
When this is said and done, all of your
python calls and
imports will come from the currently activated environment.
The cure for the common core
Hopefully, when you saw the last section, you realized why I choose to not use Anaconda as my
(base) environment. It is because I don't spend a lot of time in my
(base) environment. So, I can afford to not install a bunch of extra packages I don't need (or don't know I need).
I do want to take a moment to explain the concept of
conda channels though. The basic
conda channel (
anaconda) contains a list of packages that Anaconda has curated and bundled together for their distribution. This list is pretty exhaustive, but not really complete. There are many third-party packages that aren't listed on the
anaconda channel or on a different release timeline that the community continues to develop and refine. Therefore, I introduce these two (2) additional
conda channels that you can add to you
.condarc config file and never think about again.
conda config --set channel_priority false conda config --add channels conda-forge conda config --add channels bioconda
With these one-time commands, your conda will not only automatically have access to a large majority of the available packages out there, but also install the newest version of any package that is cross listed in any of these channels.
Now that we have gone through a lot of my reasoning for using
conda virtual environments, this section will be used for demonstrating some of the specifics to using
conda virtual environments.
# To create a very basic data science environment conda create -n basic_data_science "python>=3.7" pip numpy pandas matplotlib scikit-learn
This environment gives me access to the Python standard library, basic data science tools, and basic plotting capabilities. The fun part is that I can treat this as a parent environment for any others that I want to build out a bit more.
# To add IDE capabilities to my basic data science build conda create -n jupyter_base --clone basic_data_science conda install -n jupyter_base jupyter jupyterlab notebook=5.7.2 "tornado<6" seaborn plotly altair
Note:At the time of this writing, the very specific package versioning seen here is the only build that can populate a Windows-side browser with a Windows Subsystem for Linux (WSL) based implementation of Jupyter Lab/Notebook.
Now, if I really wanted to expand this for some heavy genomics research, I would go one more step further.
# Exploratory genomics research build conda create -n genomics --clone jupyter_base conda install -n genomics pysam bamnostic bedtools bcftools bwa star seqlogo
With this, we come to first "hiccup" of having multiple virtual environments: remembering the names. We can easily address this with the command:
conda env listThe output should look something like what is shown on the left. When your list of projects needs some pruning, or you just want to do some spring cleaning, that is as simple as:
conda env remove --name PROJECT_NAME
That's it. That is all it takes to install, clone, and uninstall
conda virtual environments. But, there is one more thing to mention...the one thing that really catalyzed my desire to write this article: How to share your environment.
There are two parts to quickly sharing your environment specifications. For the most part, it only takes the first command I will share. However, if you install a package that is not listed on any
conda channel and you have to
pip install it instead, you will need to use the second command.
The first: capturing
The first command exports a given environment to a YAML file. This file contains all that is needed to share a 100%
conda built environment on someone else's computer (unless they don't use the same operating system).
conda env export --name PROJECT_NAME --file environment.yml
From here, all someone would have to do to create an quick copy of your virtual environment is to have a copy of this YAML file and invoke:
conda env create --file environment.yml
The second: capturing
The second command will capture any packages on your current working environment that you had to install with
pip that were not available on
pip freeze > requirements.txt
Then, the person that you share this with (after running the first command) would run:
pip install -r requirements.txt
There is a more explicit way to create exact copies of your
conda working environments though. Again, this is very explicit. Therefore, it isn't not as quick as using the
conda env export method.
conda list --name PROJECT_NAME --explicit > spec-file.txt
The person you are sharing this file with would then use:
conda create --name PROJECT_NAME --file spec-file.txt
I know that this post was long-winded. There are a lot of posts/forums/cheatsheets that really list out these steps more succinctly than I did here, but I thought it was important to understand why I use virtual environments...and that isn't always seen in some of those pages.
In the end, I hope this was helpful. If you haven't subscribed yet, fill out a form to receive content when it is release. Also, please feel free to leave any comments below, share it using any of the social media links below, or just @ me on twitter. Thank you!