R v Python: Dawn of Analytics

I've just completed a 3-month internship at a startup that does data consulting for chemical engineering companies. It was a great experience overall and I learned a lot, including how to program in R. You see, the Amaral Lab (where I'm currently doing a PhD) is almost exclusively a Python lab. Every now and then other languages attempt to break in but they never get very far (looking at you Matlab). So, among other reasons, I saw this internship as a good opportunity to learn R since it is one of the main programming languages used in the Data Science world in addition to Python.

A simple google search will yield tons of R vs. Python comparisons and people can be very entrenched in either camp. Regardless, not having read such a comparison (true!) before starting my internship I thought I would try to share my fresh yet slightly biased perspective on this data science language war by focusing on a few key points.

Python is a general-purpose programming language created by Guido van Rossum in 1991. It's a high-level programming language built with a focus on readability and simplicity of syntax. Despite this (or because of this) it is an extremely popular language used in many applications, with an extensive library of third-party packages. R is a programming language created by Ross Ihaka and Robert Gentleman in 1993. Unlike Python, R was developed mainly for use in statistical analysis. Because of its specificity, R is faster than Python in some computations. Just like Python, R also has a very large repository of packages.

In terms of syntax, both languages have dynamic typing (no need to specify the type of a variable before assigning a value to it), support loops and conditionals, and use logic and math symbols in mostly the same ways. Python is famous for its use of indentation to delimit blocks of code while R is more "classic" in that it requires curly braces for any block of code longer than a single line.

In R, variable types (number, string, logical, etc.) can be either a single value or a vector of values, whereas Python has both simple variable types (int, float, string, bool) and collection types (list, tuple, etc.). R also allows lists of variables which are not necessarily vectors! Sometimes, in order to achieve the desired behavior you need to unlist a variable first. I found this very baffling at first. Yet, because of R's built-in vector capabilities there's very little need for explicit for-loops. Consequently, a vector operation in R is usually faster than the equivalent for-loop-based operation. Similar performance in Python would require the array datatype provided by the numpy package.

The main structure used to analyse data is arguably the dataframe. It's essentially a table of data with convenient properties that allow for manipulation of whole columns/rows of data. It exists natively in R but often the dplyr package is used to expand the dataframe manipulation tools. Python also has dataframes via the Pandas package. In my experience, both languages provide very similar dataframe functionality.

A very big and important difference between R and Python is how they organize documentation. In R, documentation for several related functions are grouped together. First there's a list of all related functions, then a table with the combined arguments of all the functions, followed by a wall of text explaining what each function does and what the return values are. For an example, see the page for grep. In Python, every function is instead documented separately. The arguments and return value are always listed but not always fully explained. The closest Python equivalent to R's grep is the re module.

Ultimately, most data scientists seek to share their work online. This entails creating a web-app to showcase data and/or visualization while hosting it in a server somewhere. Python users typically turn to Django or Flask for the app creation. This process can be quite cumbersome as, much like Python itself, these frameworks are general purpose: Flask is just a bare-bones package for web apps while the more powerful Django has a 7-part tutorial on how to make a simple poll! The process is much easier with R. Using Shiny, a R-based web framework, you can have an interactive web app with plots up and running locally in just a few minutes.

If you're starting to learn how to program in Python you may be using the excellent Jupyter Notebook. This browser-based programming tool is great for quick coding and data visualization. The R kernel for Jupyter is relatively new so there are still some kinks, but the tool is in active development so expect improvements soon. Still, R in Jupyter already allows for much of the same testing/visualization available via the python kernel. A solid, and much more powerful, alternative for R programming is RStudio (Shiny was built by the RStudio people). It's a fully featured IDE that is very easy to set up.

So, where do I stand on this great war? ... Well, nowhere really. As is often the case, the truth lies somewhere in the middle. As I outlined above, Python and R have different strengths and are by no means mutually exclusive. Why choose a single language for all things when you can mix the best of both worlds?