I’ve been using Python at work recently, primarily investigating some of the clustering algorithms available in ‘scikit-learn’. I’ve mainly used R and the ‘caret’ package for this sort of thing in the past. However, the steps are, more or less, the same in both languages; install a package to do some exploratory plots (‘matlibplot’ in Python and maybe ‘ggplot’ in R), then another package to do the clustering (‘caret’, I think, does this in R and ‘scikit-learn’ in Python). Then, depending on the complexity of the project, there are plenty of other packages you might end up installing as well e.g. ‘numpy’ and ‘scipy’ in Python or maybe ‘dplyr’ in R.
All in all, there are a lot of packages in the scientific computing/data science open source community these days and keeping everything up to date and dealing with conflicts when they occur, can be a bit of a pain!
However, yesterday, I came across Anaconda https://docs.anaconda.com; a package management system for Python. Effectively one installation can load all the packages you are ever likely to need. You can download and use it on your desktop at home, but there is also a cloud enterprise version available. I installed it on my ‘modest’ Ubuntu laptop at home without any problems.
Given a new tool to play with, next I tried to think of a small project I could complete without losing too much of my weekend time! I have been meaning to have a look at ‘jupyter’ for a while now, it allows for a combination of mark down, code, plots, and, it seems, does latex, so I can include some maths. As I might be teaching statistical distributions to the apprentices at work I decide upon the following project:
- Review and write up a derivation of the Poisson distribution - Using latex
- Download the football scores from last season - running a wget bash command within the jupyter document
- Show that the total number of goals scored per game follows, roughly, a Poisson distribution – using numpy and scipy
I’ve done things like this when I’ve taught statistics in the past, however with the using ‘jupyter’ it should be possible to bring all of this together in one document. It took a bit longer than I’d intended, but everything worked well (see below for a link). I confess I didn’t do a full derivation of the Poisson distribution, I don’t have that much patience with latex, but I added I few equations and thanks to Glen Cowan for doing an excellent job of presenting the analysis in a freely available document
To get started with jupyter I would recommend this video:
Hope you found this useful
Regards John S