Week 5 – Random Forest Classifiers

Mar 19, 2019

Week 5 was spent reading the last of the research material, a paper titled “Metaphor Detection with Cross-Lingual Model Transfer.” The paper details Yulia Tsvetkov and her team’s research into metaphor detection, and how they used a random-forest classifier as their method. Tsvetkov addresses both Subject-Verb-Object (SVO) and Adjective-Noun (AN) metaphors, both of which I hope to address in my own project.

As stated by Tsvetkov Random Forest Classifiers are collections of decision tree classifiers, trained by collections of training data. Random Forest Classifiers can then act on future inputs to assign a classification (in this case metaphorical or literal) based off of the data it has already learned from. Random Forest Classifiers, as I learned, tend to be very accurate compared to other data sets–Tsvetkov’s team obtained an accuracy rating of 0.82, a high figure compared to other existing methods which generally achieve accuracies anywhere from 0.5 to 0.7.

A drawback with Random Forest Classifiers, however, is that they are difficult to write. Random Forest Classifiers heavily rely on regression and data training, require words to be modeled as vectors, and use software that I am currently unfamiliar with. Looking forward, I may decide not to write method 3 simply because I do not believe that I will have enough time to learn the new software and concepts, and code the method, on top of the other methods I currently have to complete. At the very least, progress on the Random Forest Classifier will begin after I have completed everything else, because my other methods also already address the types of word phrases that Tsvetkov’s method covers.

I’ve also begun work on my first method, which uses hypernym-hyponym relationships to classify a phrase as metaphorical or literal. I’m coding in Python, and using Natural Language Toolkit (NLTK), specifically it’s WordNet Corpus. So far, I’ve managed to obtain the hypernyms and hyponyms of nearly any given word. I’ve also looked into my second method, which uses word concreteness ratings to assign a phrase it’s classification–I was able to find a dataset of around 40,000 commonly used words and their concreteness ratings–all of these words are contained within an Excel file that is readable through Python.

Leave a Reply

Your email address will not be published.