During Week 2, I was sick and was not able to work on my senior project.
During Week 3, I first met with my external advisor and he gave me a set of articles as data to test my programs to see how accurate they are. In the beginning of the week, I worked on organizing my code to make it more readable for my external advisor and myself. I also made some minor adjustments to the FeatureExtractor class, which is the NLP portion of my project and returns all the noun and verb phrases in every article, to make it more accurate.
After I organized my code, I created a new class called TrainingDataReader, which reads each article from the dataset of articles my external advisor gave to me, and I started researching what a Path in Java is and how to implement it into my program. It took a few hours to fully understand what Paths and newDirectoryStreams are. However, I needed to figure out how to filter out certain folders that do not contain the data in the path. So I used a glob, which is a pattern matcher using asterisks, to filter out these folders.
In order to read all these files, I created a method called readFiles(), which would read all the files in each folder that contains data that have a size greater than zero and separate the topic of the article from the body of the article in an arraylists of Pairs. There were some bugs in the formatting some of the files and I had to do some research on how to fix these bugs. After doing some researching, I figured out the problem with my code and I got my desired output.
For the next coming week, I will be connecting my FeatureExtractor program with my TrainingDataReader program to get all of noun and verb phrases from each of my test articles.
Thanks for reading!