Project: Stylometry (Results) Part 2
I didn’t know if my program worked or not, so I decided to test it out. I decided to choose authors that wrote in English to ensure the use of the words that I looked for, were natural and consistent.
First, I would test it against different texts by Charles Dickens and then compare the texts to those by other authors around the same time period:
Oliver Twist | Great Expectations | A Tale of Two Cities | |
Oliver Twist | 0% | 68.1% | 60.4% |
Great Expectations | 68.1% | 0% | 52.1% |
A Tale of Two Cities | 60.4% | 52.1% | 0% |
Now, comparing different authors:
Pride and Prejudice | Dracula | |
Oliver Twist | 218% | 115% |
Great Expectations | 227% | 127% |
A Tale of Two Cities | 240% | 112% |
So, that worked. I found that the same author had a percentage difference of < ~70% whilst a different author would have a difference of > ~110%.I also saw that a lower word count would lead to less accurate results, so I wanted to try the checker on a novella by Charles Dickens, A Christmas Carol.
A Christmas Carol (29400 words) | |
Oliver Twist (171826 words) | 63.0% |
Great Expectations (190198 words) | 97.8% |
A Tale of Two Cities (138330 words) | 71.3% |
Thus, word count does indeed affect the percentage difference because a smaller lengthed novella would contain less of the typical function words used by the same author. However, even though A Christmas Carol is significantly shorter than the rest of the novels, the percentage difference was still smaller than the books by the other authors. And I suspect if I used a few novellas from the other novels, the percentage differnce will be even smaller. I also think that some of the tags I decided to target is a natural part of English and there might be a more personal list of words. Anyway, the longer the texts, the more accurate this program is.