The file conspiracy_truncated.zip is a zip file you should unzip in working directory. You should connect to that directory in a command window/terminal before following the dorrections below.
The file conspiracy_truncated.zip contains 5 files;
To get help on this assignment, you should run demo.py from a commandline (MS DOS command terminal or Mac OS terminal) like this:
python -i demo.py ... [python starts up: some stuff prints out] >>>You can then interact with Python in an environment in which all the variables set in demo.py are defined. You should inspect this file in an editor because there are numerous useful commands and comments in it.
You will hand in a hard copy with PPMI values or cosine similarity scores requested for the word pairs below, together with some code (extending the code in demo.py) that shows how you accomplished this feat (note that all vocabulary items are tagged):
Here are some common problems:
Loading vocab Traceback (most recent call last): File "demo.py", line 17, inwvi = read_word_vectors.WordVectorInt(base) File "/Users/gawron/Documents/classes/compling/distributional_semantics/word_vec_assignment_part_one/read_word_vectors.py", line 108, in __init__ self.load_vocab() File "/Users/gawron/Documents/classes/compling/distributional_semantics/word_vec_assignment_part_one/read_word_vectors.py", line 117, in load_vocab with codecs.open(self.vocab_file, 'r', encoding = 'utf8') as fh: File "/Users/gawron/anaconda3/lib/python3.6/codecs.py", line 897, in open file = builtins.open(filename, mode, buffering) FileNotFoundError: [Errno 2] No such file or directory: '/Users/gawron/Documents/classes/compling/distributional_semantics/word_vec_assignment_part_one/conspiracy_mid.vocab'
# "Current directory": You can edit this if you want the data to be elsewhere data_dir = os.getcwd() base = os.path.join(data_dir, base)
# "Current directory": You can edit this if you want the data to be elsewhere data_dir = r"C:\Users\Gawron\Documents\word_vec_assignment" base = os.path.join(data_dir, base)
Note: The next part of this assignment is optional. It is worth extra credit. I will be happy to comment on your efforts, or answer any questions.
Follow the easy install directions for Radim Rehurek's gensim module, which implements word2vec, the Deep Learning inspired word embedding component. Those directions are here. The discussion here is based on Rehurek's Word2vec tutorial
Run the following code (which will take some time):
from gensim.models import Word2Vec from nltk.corpus import brown, movie_reviews, treebank b = Word2Vec(brown.sents()) mr = Word2Vec(movie_reviews.sents()) t = Word2Vec(treebank.sents())
You built 3 word2vec models on very small amounts of data, one on Brown, one on Treebank data, and one on movie reviews. They won't be very good, but they still have useful information.
Let's find the 5 nearest neighbors of the word man, based on the data in the Movie Review corpus:
>>> mr_man_nns = mr.most_similar('man', topn=5) [(u'woman', 0.8220334053039551), (u'girl', 0.6817629933357239), (u'boy', 0.6774479150772095), (u'doctor', 0.6448361873626709), (u'calculating', 0.605032205581665)] >>> mr.similarity('man','woman') 0.8220334001723657
What you get is mr_man_nns is a list of five things, each is a nearest neighbor together with its similarity to man. These are the five words most similar to man according to the model. To find the similarity score of any two words, you use similarity as shown.
To complete analogies like woman:king::man::??, you do the following:
>>> mr.most_similar(positive=['woman', 'king'], negative=['man']) [(u'ark', 0.7656992077827454), ... ]A variant proposed in Levy and Goldberg (2014):
>>> mr.most_similar_cosmul(positive=['woman', 'king'], negative=['man']) [(u'ark', 1.0164012908935547), ... ]This is a pretty weird answer.
Of course if you were using the Wikipedia trained model described in Rehurek's tutorial and Mikolov et al. (2013), you would instead get this as the completion of the analogy:
[('queen', 0.50882536), ...]
You can also use the model to find outlier words that don't fit in a group of words: For this, we find the centroid of the word vectors and choose the one with greatest distance from the centroid as the odd vector out:
>>> mr.doesnt_match("breakfast cereal dinner lunch".split()) 'cereal'
Models can be significantly improved by finding bigram words ("New York", "Press Secretary"):
bigram_transformer = gensim.models.Phrases(sentences) mr = Word2Vec(movie_reviews.sents())
Using a word vector model's implicit co-occurrence probabilities, it can also be used find the probability of a sentence:
mr.score(["Colorless green ideas sleep furiously".split()])In other words, the model goes through each word and computes the probability of the other words occurring as neighbors, returning an overall likelihood score.
Finally, you can just look at a vector, which by default has 100 dimensions. It's an array with 100 floating point numbers.
>>> mvec = mr['man'] >>> len(mvec) 100 >>> mvec array([-0.10986061, -0.12899914, -0.01356757, 0.09450436, -0.04457339, 0.1201788 , 0.07375111, 0.00555919, -0.27961457, -0.04920399, 0.21768376, -0.15391812, -0.07826918, 0.2606242 , -0.12305038, -0.0137245 , 0.02650702, -0.01748919, -0.12054206, -0.13024689, -0.06372885, -0.23327361, 0.33404183, 0.22624712, -0.22911069, -0.12921527, 0.28556904, -0.23052499, -0.19462241, 0.26367468, 0.25053203, 0.03706881, 0.12325867, -0.33901975, -0.02694192, -0.05358029, -0.00767274, -0.13719793, 0.00530363, -0.20839927, -0.03976608, 0.0351226 , 0.18464074, -0.24369034, 0.15803961, -0.0514226 , -0.13602231, 0.25484481, -0.08208569, 0.06340741, 0.21154985, 0.09053291, 0.13411717, -0.24650463, 0.2090898 , -0.14951023, -0.02048201, -0.22660583, 0.04167137, 0.06884803, 0.31761509, -0.1049979 , 0.11771846, 0.11075541, -0.05071831, 0.21371891, 0.12598746, -0.2079615 , -0.13616957, -0.01921517, -0.16636346, 0.1169065 , 0.23653744, 0.31624255, 0.11505274, -0.09718218, -0.06874531, 0.10780501, -0.01663529, -0.10346226, -0.30455628, 0.00246542, -0.15952916, -0.01670113, -0.08883165, 0.13546473, -0.39362314, 0.27298909, -0.08167259, -0.1424706 , 0.12223504, 0.18078147, 0.08870253, 0.15700033, 0.17984635, 0.13593708, -0.43276551, 0.03234629, -0.16896026, -0.12703048], dtype=float32)
mr.score(["Colorless green ideas sleep furiously".split()])
from gensim.models import Phrases Phraser = Phrases(movie_reviews.sents()) mrs_phrased = Phraser[mrs] mr_p = Word2Vec(mrs_phrased)
Compare the top 25 most similar words to great in this new model with the top 25 most similar words to great in mr model. Report on any effects that having phrases included has had. Are all the results good?