9.8. Word clouds

Note

The text_word_cloud module discussed in this section can be downloaded (text_to_word_cloud.py ).

A word cloud is a graphical representation of the vocabulary of a document or set of documents, which tries to represent the relative topical importance of words by varying font size. Despite some intrinsic limitations, it is an excellent tool for document visualization which often tells you something about the topic or key ideas.

[Example: word cloud for these course notes]

The programming language R provides a very good, very easy to use word cloud package. Rather then trying to translate the ideas of the R program into Python, we will write a Python module called text_to_word_cloud that calls the R wordcloud package using the Python subprocess module.

The text_to_word_cloud script is very simple: it outputs some data in a form which R can read in as a table; it calls R and R makes the word cloud.

In order for this last part to work, R must of course be installed and runnable on your computer, and the wordcloud package must be installed from some R repository (see your R docs for instructions on how to do this).

9.8.1. What is a word cloud?

Word cloud visualizations attempt to provide a visualization of the subject matter of a text by showing high-value words in the text.

Every word has a score. The simplest idea is that that score is its frequency.

Word sizes are determined by their score and words are laid out on the page by an algorithm that tries to place high-scoring words neaerer to the center and to avoid overlap. The first of the two pictures below was constructed from a large set of data collected from anti vaccine websites, using raw word counts as the scores and avoiding the most frequent “stop” words. The second used a score of statistical significance which measures how much the frequency of a word in this collection of texts departs from its expected frequency given how common a word it is. Rare words that show up more often than expected (like autism) take center stage, even if their raw frequency is not that high.

../_images/anti_vaccine_ct_cloud.png
../_images/anti_vaccine_mi_cloud.png

9.8.2. Using the text_to_word_cloud module

[Some background about the word_counts_dict, which has been discussed in a previous module.]

If R is installed and runnable as a commandline program in batch mode, and if the word cloud package has been installed with R, simply load run:

>>> save_il_to_word_cloud_file(word_cloud_file,word_counts_dict,vocab_size,call_R=True)

and R will be called to create the word cloud pdf file.

If you just want to create the R readable data file:

>>> save_il_to_word_cloud_file(word_cloud_file,word_counts_dict,vocab_size)

The font size arguments of C{save_il_to_word_cloud_file} often have to be tweaked. They specify the size of the largest and smallest fonts in the word cloud. The range of sizes may have to made smaller in order to get a healthy looking roughly spherical word cloud to fit on a page. The vocab size will often have to be reduced to do this.

If R is not run within Python, the following R commands will create the .pdf file:

> require(wordcloud)
Loading required package: wordcloud
Loading required package: Rcpp
Loading required package: RColorBrewer
> mr = read.table("word_cloud.dat",header=TRUE)
> wordcloud(mr$Word,mr$Score,c(4,.3),2,,FALSE,,.15,pal)
> pal <- brewer.pal(9,"BuGn")
> pal <- pal[-(1:4)]
> wordcloud(mr$Word,mr$Score,c(4,.3),2,,FALSE,,.15,pal)

Warning

These commands are in R, and will not work in Python

In the R code above, the expression “c(4,.3)” is an R vector which specifies the size of the largest and smallest font. As noted above, that range of sizes often has to be tweaked in order to make the word cloud words fit comfortably on the display canvass. The vocab size can also be reduced to do this.

So for example for a normal English vocabulary with about 500 words:

>wordcloud(mr$Word,mr$Score,c(6,.3),2,,FALSE,,.15,pal)

gives a good looking word cloud.

9.8.3. How does text_to_word_cloud work?

The Python script outputs a data file (.dat extension), a simple text format which R can read in as a table; it then calls R and R makes the word cloud using its wordcloud package.

The text_to_word_cloud module thus illustrates Python as a scripting tool. It is a Python program that calls an external program that not written in Python. This is done using the Python standard module subprocess.

[discuss using the subprocess mdoule. Reference Hellman’s Pymotw discussion.]

9.8.4. Advanced word clouds

Using other scores

  1. Mutual information

  2. A pointwise information number LIKE Pointwise Mutual Information, but using frac{P(w)}{Poisson(w)} (for low counts Binom(w) would be more accurate).