Word 2 Vec assignment

    Note: This assignment has two parts. The second part is optional. The first part is not.

    Part One

    The file conspiracy_truncated.zip is a zip file you should unzip in working directory. You should connect to that directory in a command window/terminal before following the dorrections below.

    The file conspiracy_truncated.zip contains 5 files;

    1. The file conspiracy_truncated.tgd is a short tagged document. You should have a look at it to get a sense of what the data is like.
    2. The file conspiracy_truncated.word_vecs contains a large VxV term-content matrix of computed from conspiracy_truncated.tgd. Each row is a word vector. The matrix contains word counts. If TC is your term-context matrix, TC[i, j] is the count of the number of times context word j occurred within 5 words of target word i in the data. The matrix contains only raw counts, no PPMI values. Computing PPMI values will be your job.
    3. The file conspiracy_truncated.vocab is the vocabulary extracted from conspiracy_truncated.tgd. The ith word in the vocab file corresponds to the ith row of thw word vector matrix.
    4. The file conspiracy_truncated.py contains some code you will use for this assignment, code which loads the term-context matrix in conspiracy_truncated.word_vecs, and provides soem utilities for accessing it.
    5. The file demo.py loads cons contains a demonstration of how to use the code to load the data, compute the term-context matrix, and get some simple counts. It also shows you how to call a version of cosine on two vectors to compute their similarity. THis is the file you will be editing to create the document you turn in.

    To get help on this assignment, you should run demo.py from a commandline (MS DOS command terminal or Mac OS terminal) like this:

    python -i demo.py
    ... [python starts up:  some stuff prints out]
    >>>
    
    You can then interact with Python in an environment in which all the variables set in demo.py are defined. You should inspect this file in an editor because there are numerous useful commands and comments in it.

    What to hand in for the obligatory Part One Assignment

    You will hand in a hard copy with PPMI values or cosine similarity scores requested for the word pairs below, together with some code (extending the code in demo.py) that shows how you accomplished this feat (note that all vocabulary items are tagged):

    1. PPMI score for vocab word Trump_NNP, context word president_NN. These are represented as the tagged words ('Trump', 'NNP') and ('president', 'NN') in Python. (See the demo script.)
    2. PPMI score for target word news_NN and context word fake_JJ.
    3. Cosine similarity of vocab word propaganda_NN and vocab word news_NN.
    4. Cosine similarity of vocab word Trump_NNP and vocab word Obama_NNP.

    Help with common errors

    Here are some common problems:

    1. You have an out of date version of the zip file conspiracy_truncated.zip. Make sure yoy have the latest version.
    2. You get a FileNotFound error that looks something like this::
        Loading vocab
        Traceback (most recent call last):
          File "demo.py", line 17, in 
            wvi = read_word_vectors.WordVectorInt(base)
          File "/Users/gawron/Documents/classes/compling/distributional_semantics/word_vec_assignment_part_one/read_word_vectors.py", line 108, in __init__
            self.load_vocab()
          File "/Users/gawron/Documents/classes/compling/distributional_semantics/word_vec_assignment_part_one/read_word_vectors.py", line 117, in load_vocab
            with codecs.open(self.vocab_file, 'r', encoding = 'utf8') as fh:
          File "/Users/gawron/anaconda3/lib/python3.6/codecs.py", line 897, in open
            file = builtins.open(filename, mode, buffering)
        FileNotFoundError: [Errno 2] No such file or directory: '/Users/gawron/Documents/classes/compling/distributional_semantics/word_vec_assignment_part_one/conspiracy_mid.vocab'
        
      This is most likely due to the way you are starting up python. You are not starting it up so that the default directory is the one in which the data is found. For example, if you are running the script demo.py from inside a Canopy window, it won't find the vocab file (or other data files) because it is using some irrelevant home directory as its default directory. To fix this and keep using Canopy (or any editor with the same problem), you need to edit the file demo.py, and tell it where the data is. This is the directory into which you unzipped the zip file conspiracy_truncated.zip. Use whatever tool you normally use to find the full path to the directory. Let's say it is a Windows machine and the full path is
        C:\Users\Gawron\Documents\word_vec_assignment
      You would then edit the string value of the variable data_dir in the demo.py script. The relevant lines look like this:
        # "Current directory": You can edit this if you want the data to be elsewhere
        data_dir = os.getcwd()
        base = os.path.join(data_dir, base)
        
      You would edit them to look like this:
        # "Current directory": You can edit this if you want the data to be elsewhere
        data_dir = r"C:\Users\Gawron\Documents\word_vec_assignment"
        base = os.path.join(data_dir, base)
        
      All the files should now be found properly. Note the odd r outside the string quotes in the line setting the data_dir variable. This is Python's "raw string" prefix. It cancels escapes inside the string. Without the r prefix Python thinks each \ in the string signals escape, and in this case interprets \U as meaning that an escaped unicode code is coming. You can solve this problem which will pretty much arise in every Python path either by using the r prefix, or by changing every \ to \\. People usually go with the r prefix.

Part Two

Note: The next part of this assignment is optional. It is worth extra credit. I will be happy to comment on your efforts, or answer any questions.

Prep

Follow the easy install directions for Radim Rehurek's gensim module, which implements word2vec, the Deep Learning inspired word embedding component. Those directions are here. The discussion here is based on Rehurek's Word2vec tutorial

Run the following code (which will take some time):

from gensim.models import Word2Vec
from nltk.corpus import brown, movie_reviews, treebank
b = Word2Vec(brown.sents())
mr = Word2Vec(movie_reviews.sents())
t = Word2Vec(treebank.sents())

You built 3 word2vec models on very small amounts of data, one on Brown, one on Treebank data, and one on movie reviews. They won't be very good, but they still have useful information.

Doing things

Let's find the 5 nearest neighbors of the word man, based on the data in the Movie Review corpus:

>>> mr_man_nns = mr.most_similar('man', topn=5)
[(u'woman', 0.8220334053039551), (u'girl', 0.6817629933357239), (u'boy', 0.6774479150772095), (u'doctor', 0.6448361873626709), (u'calculating', 0.605032205581665)]
>>> mr.similarity('man','woman')
0.8220334001723657

What you get is mr_man_nns is a list of five things, each is a nearest neighbor together with its similarity to man. These are the five words most similar to man according to the model. To find the similarity score of any two words, you use similarity as shown.

To complete analogies like woman:king::man::??, you do the following:

>>> mr.most_similar(positive=['woman', 'king'], negative=['man'])
[(u'ark', 0.7656992077827454), ... ]

A variant proposed in Levy and Goldberg (2014):

>>> mr.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])
[(u'ark', 1.0164012908935547), ... ]

This is a pretty weird answer.

Of course if you were using the Wikipedia trained model described in Rehurek's tutorial and Mikolov et al. (2013), you would instead get this as the completion of the analogy:

[('queen', 0.50882536), ...]

You can also use the model to find outlier words that don't fit in a group of words: For this, we find the centroid of the word vectors and choose the one with greatest distance from the centroid as the odd vector out:

>>> mr.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'

Models can be significantly improved by finding bigram words ("New York", "Press Secretary"):

bigram_transformer = gensim.models.Phrases(sentences)
mr = Word2Vec(movie_reviews.sents())

Using a word vector model's implicit co-occurrence probabilities, it can also be used find the probability of a sentence:

mr.score(["Colorless green ideas sleep furiously".split()])

In other words, the model goes through each word and computes the probability of the other words occurring as neighbors, returning an overall likelihood score.

Finally, you can just look at a vector, which by default has 100 dimensions. It's an array with 100 floating point numbers.

>>> mvec =  mr['man']
>>> len(mvec)
100
>>> mvec
array([-0.10986061, -0.12899914, -0.01356757,  0.09450436, -0.04457339,
        0.1201788 ,  0.07375111,  0.00555919, -0.27961457, -0.04920399,
        0.21768376, -0.15391812, -0.07826918,  0.2606242 , -0.12305038,
       -0.0137245 ,  0.02650702, -0.01748919, -0.12054206, -0.13024689,
       -0.06372885, -0.23327361,  0.33404183,  0.22624712, -0.22911069,
       -0.12921527,  0.28556904, -0.23052499, -0.19462241,  0.26367468,
        0.25053203,  0.03706881,  0.12325867, -0.33901975, -0.02694192,
       -0.05358029, -0.00767274, -0.13719793,  0.00530363, -0.20839927,
       -0.03976608,  0.0351226 ,  0.18464074, -0.24369034,  0.15803961,
       -0.0514226 , -0.13602231,  0.25484481, -0.08208569,  0.06340741,
        0.21154985,  0.09053291,  0.13411717, -0.24650463,  0.2090898 ,
       -0.14951023, -0.02048201, -0.22660583,  0.04167137,  0.06884803,
        0.31761509, -0.1049979 ,  0.11771846,  0.11075541, -0.05071831,
        0.21371891,  0.12598746, -0.2079615 , -0.13616957, -0.01921517,
       -0.16636346,  0.1169065 ,  0.23653744,  0.31624255,  0.11505274,
       -0.09718218, -0.06874531,  0.10780501, -0.01663529, -0.10346226,
       -0.30455628,  0.00246542, -0.15952916, -0.01670113, -0.08883165,
        0.13546473, -0.39362314,  0.27298909, -0.08167259, -0.1424706 ,
        0.12223504,  0.18078147,  0.08870253,  0.15700033,  0.17984635,
        0.13593708, -0.43276551,  0.03234629, -0.16896026, -0.12703048], dtype=float32)

Optional part two assignment

  1. Do the assignment here, which provides a link to download a full Wikipedia-trained set of word vectors. We'll refer to this as the wikipedia model.
  2. Find the nearest neighbors of the word computer in the Movie Review, Brown, Penn Treebank, and Wikipedia models. Show your code. Show the neighbors you found and their scores.
  3. Now that you know the 5 nearest neighbors of computer in the Brown model, would you expect the similarity score of program and computer to be greater or lesser than .75. Why? Verify. Show your code.
  4. Find the 10 nearest neighbors of great in the Movie Review model. Do the same with a negative sentiment word. Try some other techniques for collecting positive and negative evaluative words, such as changing the value of the "topn" parameter illustrated above and trying other seed words. Compare these results with the results using the Wikipedia model. Which dataset works best for collecting evaluative words. Speculate as to why.
  5. Complete the following analogy: great:excellent::good:??. Which model does this best?
  6. Find the outlier in the following set of words: great excellent terrible outstanding. Which model works best, or all equally good? Speculate as to why.
  7. Experiment with scoring sentences as done here

      mr.score(["Colorless green ideas sleep furiously".split()])
      

    to answer the following question: Does the probability of the sentence grow larger or smaller as the sentence grows longer? Either way, there is a kind of length bias showing up. Suggest a fairer way to judge likelihood of a sentence. Use it to decide which of the following sentences is more likely:

    1. My laptop goes to sleep easily.
    2. My child goes to sleep easily.

  8. Train a phrasal model. This will discover "words with spaces" or "phrasal words" or "collocations" like new york and compress them into one word units (new_york).

      from gensim.models import Phrases
      Phraser = Phrases(movie_reviews.sents())
      mrs_phrased = Phraser[mrs]
      mr_p = Word2Vec(mrs_phrased) 
      

    Compare the top 25 most similar words to great in this new model with the top 25 most similar words to great in mr model. Report on any effects that having phrases included has had. Are all the results good?