12.11. Assorted assignments

  1. [Adapted from the NLTK Book, Chapter 2 ]. Pollard problem. Read the BBC News article: UK’s Vicky Pollard’s left behind. The article gives the following statistic about teen language: “The top 20 words used, including yeah, no, but and like, account for around a third of all words.” The background question is: How many word types account for a third of all word tokens in English in general?

    To answer this question, you will use a balanced corpus of English texts, a corpus collected with the purpose of representing a balanced variety of English text types: fiction, poetry, speech, non fiction, and so on. One relatively well-established, free, and easy-to-get example of such a corpus is the Brown Corpus. Brown is about 1.2 M words.

    To get the Brown corpus, do the following in Python:

    >>> import nltk
    >>> nltk.download()

    This brings up a window you can interact with. There are some tabs at the top. Choose the tab labeled Corpora, select Brown, and click the download button at the bottom of the window. You will then have Brown on your machine and you can import the corpus as follows:

    >>> from nltk.corpus import brown

    The following returns a list of all 1.2 M word tokens in Brown:

    >>> brown.words()
    ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

    Loop through this list and make a frequency distribution dictionary (value associated with word w; is the count of how many times w has occurred in Brown). Then do some more computing. How many word types account for a third of all word tokens in Brown? What do you conclude about the statistic cited in Vicky Pollard’s article? Read more about this on LanguageLog.

    To hand in: (a) answers to the above questions, and the code you used to find out how many word types account for a third of all word tokens in Brown.

  2. Go to the Social Security Administration US births website and select the births table there and copy it to your clipboard., Use the pandas read_clipboard function to read the table into Python, and use matplotlib to plot male and female births for the years covered in the data. Turn this in as an ipython notebook file.

  3. Use Python to get a list of male and female names from these files (link needed). Use Python Counters to get letter frequencies for male and female names. Use matplotlib to draw a plot of first letters against name frequency for male and female names. A similar plot for last letters is shown below. Turn this in as an ipython notebook file.

  4. Using regular expressions, extract all the proper names from the Project Gutenberg version of Pride and Prejudice. Build a frequency table using a Python Counter and display the 50 most frequent names and their frequency. Hint: Distinguish capitalized first names from capitalized first words of sentences by using the regular expression for sentence endings provided in class. Turn this in as an ipython notebook file. You may have to filter some non names from the list by hand. Do so, then answer the following questions:

    1. Are all the top names names of people?
    2. Spoiler alert: Darcy is the man Elizabeth is going to marry at the end. What is his rank in the names ranking? What does this tell us about the kind of world Austen is writing about?
  5. Download the year 2000 PUMS data for Alabama


    Use the census_data module provided in class to build a pandas data frame that has income and household ID and education. compute a new frame aggregating the income for each household. Then compute the average income for education level 12 or better. You can turn this in as an ipython notebook file.

  6. This file (link given) contains a dictionary which, for each pair of the top 50 characters in Les Miserables, tells how many times they were mentioned in the same paragraph. Using the function provided in class read in this dictionary and use it to build a graph representing the book’s network of characters. Choose the layout program that you think displays the network best and save it as a PDF file. You will be turning in the PDF file and a brief paragraph describing what layout program you used and how well you think the graph actually captures the relatedness of characters. If you havent read or seen Les Mis, guess.

  7. Using networkx, read in the network for the Les Miserables dataset [KNUTH1993] ( les miserables.gml.) Find the 5 most central characters using some measure of centrality. Is the graph connected?

  8. Read Mark Newman’s paper on random walk betweenness centrality. There may be things you don’t understand, but soldier on and read the part in which he discusses applying his new model to examples. At the very end he talks about the results of his new centrality measure on the Florentine Families graph, which depicts the alliances among Renaissance Florentine familes, as established by marriage ties. It so happens this graph can be constructed in networkx with G = nx.florentine_families_graph(). Run classical nx.betweenness_centrality on this graph (see Section Introduction to Networkx), and compare the results Newman talks about with random walk centrality with the results from classical betweenness centrality. What differences are there, if any? Does a different family get chosen as the most central? If need be, for more detail, run the networkx implementation of Newman’s measure which is also discussed in Section Introduction to Networkx. Since the betweenness functions both return dictionaries, the following code snippet may be of use. It extracts a ranked list from dictionary DD:

    >>> L = [x[0] for x in sorted(DD.items(),key=lambda x: x[1],reverse=True)]
  9. Participate (belatedly) in the 2012 Kaggle insult detection competition, which involves automatically detecting insults in an Internet corpus. All the data is in CSV files. You can satisfactorily complete the assignment by building a text classifier along the lines of those demonstrated in the text classification notebook or the advanced text classification notebook or the SVM intro notebook. <../ipython_notebooks/text/05_svm.ipynb> Your classifier should train on the train files and test on the test files. On the kaggle website, the problem is defined so that you can produce confidence labels between 0 and 1 indicating how confident your system is that the data item is an insult (where 1 is 100% confident); an SVM apprach would actually allow you to do this, but implementing that would take a fair amount of application and mathematical knowledge; so you need only produce 0’s and 1’s. This strategy would get you a terrible competition score, but it will teach you a lot about how hard this problem is. Evaluate yourself by looking at accuracy, precision, and recall. Pay particular attention to the difference between your training and test scores.