2.3. Why Python?

2.3.1. High level reasons

  1. It is a general purpose programming language. The problem with learning how to use a more special-purpose tool is that effort expended in learning how to solve particular kinds of problems doesn’t extend well to other kinds of problems. Effort expended learning Python gives you access to a wide variety of data processing and data analysis tools.
  2. It is a high-level language. This means it takes care of many of the details you have to worry about in other languages for you. It also means if you can break the solution to a problem down into simple steps, there is probably a natural way to express those steps in Python. More simply put, Python can be useful in a variety of contexts and useful Python programs can be very short.
  3. It is used by a large activity community, including researchers in science and social science. Help is available. The problems you encounter are likely to have already been solved.
  4. It is also taught in many university courses, as well in industry (Google Python course), both inside and outside computer science departments. This makes collaboration easier.
  5. There are a lot of teaching resources, and not just because of (4). The Python community (for a variety of reasons) has taken the problem of teaching the language seriously.
  6. Writing a readable program is one of the best ways to track precisely what you have done with your data to get your results. A clear path from the raw data to the results is one of the best ways to analyze problems with your results and develop new questions. Python makes it easy to write clear programs.

2.3.2. Practical reasons

  • Free
  • Stable
  • Portable across many platforms
  • Mature
    • Lots of applications of many types
    • Lots of solid programming libraries
    • Programs working on many platforms (Windows, Unix, MacOS)
    • Portability claims tested
  • Good for beginners
  • Good for rapid prototyping
  • Easy access to documentation
  • Extensions
    • C
    • C++
    • Java
    • Fortran
  • Everything is an object, good for debugging, goes back to good for rapid prototyping.

2.3.3. Data Science

Today, the most popular alternatives to Python for data scientists are R, Matlab/Octave, and Mathematica/Sage. Python is becoming a more and more popular choice, in part because it has imported or adapted some of the tools available in these competing languages. In answer to the question, ‘Why is Python the language of choice for data scientists?’, Wes McKinney, author of Python for Data Analysis, basically gave a list of good ideas that had been implemented in Python, some imported from other languages. The following list is adapted from his, which appeared in a recent Quora post:

  1. The Python community invested in the mid-1990s in an “extension to Python to support numeric analysis as naturally as Matlab does.” This evolved into NumPy. Support for Matlab-like array manipulation is a major reason to prefer Python over Perl and Ruby.
  2. Built on top of numpy, the scipy library provides variety of statistical and scientific computing tools. The numpy and scipy libraries are being commercially supported by Enthought (the recommended Python distribution for this course).
  3. The plotting functionality of Matlab was ported to Python with matplotlib. Support for Matlab-like plotting is another advantage Python has over Perl and Ruby.
  4. From R, the data frame and associated manipulations have been implemented in the Python library pandas.
  5. The scikit-learn project has implemented an sklearn module in Python which provides a common interface to many machine learning algorithms, similar to the caret package in R.
  6. From Mathematica/Sage, the concept of a “notebook” has been implemented with IPython notebooks.

In this course, you will be introduced to all the libraries McKinnon mentions above, NumPy, sklearn, numpy, scipy, and pandas.

McKinnon (wisely) also mentions some of Python’s deficiencies.

  1. A more cumbersome syntax for array manipulations and formula specification. The Matlab/Octave syntax for array manipulation is still preferred (that’s why it’s used in the Stanford ML class, for example), and the R syntax for formula specification is quite nice.
  2. Lack of a Python equivalent to R’s ggplot2 for static graphics and Javascript’s D3 for interactive graphics. The matplotlib library is hard to install, hard to use, and does not facilitate building interactive graphics for the web. [We are going to use matplotlib in this course. Recognizing that McKinney is right, we will relieve you of the burden of installation by a using a prepackaged Python distribution, and steering clear of the topic of web-based visualization. The hard to use part is right, too, but we will soldier on.]
  3. The scalability limits of NumPy and pandas when working with large data sets.
  4. The lack of an embedded, declarative language for data manipulation, similar to the LINQ project. Pandas is useful as a low-level data manipulation toolkit, but tracking down the custom Pandas syntax for complex operations can be frustrating.
  5. The lack of an IDE of similar quality to R Studio.

2.3.4. What about statistics?

Since the class is about using Python primarily in a data analysis and data collection context, the question arises: what about doing statistical analysis? Is Python the right tool? I will try to answer this question by focusing on the most popular open source statistical tool package, R. But the points made below probably apply equally well to any mature statistical package, such as SAS.

I found the following stackexchange.com post by an Austin data scientist name Ben Dundee quite on point. It accurately describes some of the tradeoffs, and touches on some points that are quite general.

Background: I’m a data scientist at a startup in Austin, and I come from grad school (Physics). I use Python day-to-day for data analysis, but use R a bit. I also use C#/.NET and Java (just about daily), I used C++ heavily in grad school.

I think the main problem with using Python for numerics (over R) is the size of the user community. Since the language has been around for ever, lots of people have done things that you’re likely to want to do. This means that, when faced with a hard problem, you can just download the package and get to work. And R “just works”: you give it a dataset, and it knows what summary statistics are useful. You give it some results, and it knows what plots you want. All the common plots you’d want to make are there, even some pretty esoteric ones that you’ll have to look up on Wikipedia. As nice as scipy/numpy/pandas/statsmodels/etc. are for Python, they’re not at the level of the R standard library.

The main advantage of Python over R is that it’s a real programming language in the C family. It scales easily, so it’s conceivable that anything you have in your sandbox can be used in production. Python has Object Orientation baked in, as opposed to R where it feels like kind of an afterthought (because it is). There’s other stuff that Python does nicely too: threading and parallel processing are pretty easy, and I’m not sure if that’s the case in R. And learning Python gives you a powerful scripting tool, too. There are also really good (free) IDEs for Python, much better ones if you’re willing to pay (less than $100), and I’m not sure this is the case for R–the only R IDE I know of is R Studio, which is pretty good, but isn’t as good as PyDev + Eclipse, in my experience.

I’ll add this as a bit of a kicker: since you’re still in school, you should think about jobs. You’ll find more job postings for highly skilled Python devs than you will for highly skilled R devs. In Austin, jobs for Django devs are kind of falling out of the sky. If you know R really well, there are a few places where you’ll be able to capitalize that skill (Revolution Analytics, for example), but lots of shops seem to use Python. Even in the field of data analysis/data science, more people seem to be turning to Python.

And don’t underestimate that you may work with/for people who only know (say) Java. Those people will be able to read your Python code pretty easily. This won’t necessarily be the case if you do all of your work in R. (This comes from experience.)

Finally, this may sound superficial, but I think the Python documentation and naming conventions (which are religiously adhered to, it turns out) is a lot nicer than the utilitarian R doc. This will be hotly debated, I’m sure, but the emphasis in Python is readability. That means that arguments to Python functions have names that you can read, and that mean something. In R, argument names are often truncated—I’ve found this less true in Python. This may sound pedantic, but it drives me nuts to write things like ‘xlab’ when you could just as easily name an argument ‘x_label’ (just one example)—this has a huge effect when you’re trying to learn a new module/package API. Reading R doc is like reading Linux man pages—if that’s what floats your boat, then more power to you. When I have a question about how something works in R, I avoid the R documentation, whereas I START with the Python doc when I’m confused about Python.

All of that being said, I’d suggest the following (which is also my typical workflow): since you know Python, use that as your first tool. When you find Python lacking, learn enough R to do what you want, and then either:

Write scripts in R and run them from Python using the subprocess module, or Install the RPy module.

Use Python for what Python is good at and fill in the gaps with one of the above. This is my normal workflow—I usually use R for plotting things, and Python for the heavy lifting.

So to sum up: because of Python’s emphasis on readability (search google for “Pythonic”), the availability of good, free IDEs, the fact that it’s in the C family of languages, the greater possibility that you’ll be able to capitalize the skillset, and the all-around better documentation-style of the language, I’d suggest making Python your go-to, and relying on R only when necessary.

The takeaway here is that Python is not a replacement for a mature statistics package like R. For many data analysis problems, you will find the measures you want packaged up with intelligence and care in R, and even where a comparable solution might be available in Python, it may be more work to squeeze out exactly what you want. Python, on the other hand, is a better all-purpose programming language; it’s easier to learn; and the code is more readable.

A point Dundee does not make but which is important, is that moving between R and Python is made much easier because of Python packages like pandas, mentioned by McKinney above, which provide data structures like data frames, which are virtually identical to R data frames.

2.3.5. The Appropriateness Principle

The discussion about R makes clear a point that will be important throughout. Python is not good at everything.

The basic justification for Python is that it is readable and good at many things, and that it is good at combining the results of other programs. Thus it is an excellent choice as the toplevel language of a large data analysis project, and may also be the right choice for many of the modules of that project. But it is not always the right choice.

As the discussion of R shows, Python has shortcoming as a statistical analysis tool. Another area where Python has shortcomings is visualization. Later in this course, we will recommend an R program for visualizing word clouds, and we will recommend a Java program (Gephi) for visualization of large graphs.

So the operational principle is the Appropriateness Principle.

Always use Python for the things it can do. Never use Python for the things it can’t do.