Finally, my project is nearing its conclusion. I am using data sets from here: https://www.kaggle.com/spscientist/students-performance-in-exams/downloads/students-performance-in-exams.zip/1, https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0171207, https://archive.ics.uci.edu/ml/datasets/student+performance#, and https://www.kaggle.com/aljarah/xAPI-Edu-Data. Unfortunately, I was unable to obtain a data set from the principal I was meeting earlier, but at least the data sets I was able to use are enough for what I am working on so far. I am mostly done, although I may make a few tweaks to see whether accuracy can become any higher at all.
Each of the data sets is about student performance in some way, but they each provide a different set and number of attributes. This meant that I had to use the algorithms on each of them separately. I could not train an algorithm on one as a training set and then use this trained algorithm on another data set, as the formats are incompatible. Therefore, I checked accuracy with 10-fold cross-validation instead. Most of the final data sets, except for the xAPI one, had the value that was supposed to be predicted as numeric instead of nominal. The algorithms I am focusing on do not work on numerical predictions, so I used discretization into 3 bins (as the xAPI data set had 3 classes to predict) to turn it into a nominal problem that the algorithms would work on.
I have obtained a lot of numerical data. Some general descriptions of the results: the xAPI data set does not have any given attributes that are related to grades except the one for prediction, but the accuracy is generally 70-80%. This could be because it does not have to be discretized, as it is already nominal. Some of the other data sets have more grade attributes given, which generally improves the accuracy to 80-90%, but for those, without the grades, it is generally 60-70%. I’m not entirely sure whether this is because of the discretization or the number of attributes given.