Project: Stylometry (Results) Part 2

I didn’t know if my program worked or not, so I decided to test it out. I decided to choose authors that wrote in English to ensure the use of the words that I looked for, were natural and consistent.

First, I would test it against different texts by Charles Dickens and then compare the texts to those by other authors around the same time period:

Oliver Twist Great Expectations A Tale of Two Cities
Oliver Twist  0% 68.1%  60.4%
Great Expectations  68.1%  0%  52.1%
A Tale of Two Cities  60.4%  52.1%  0%

Now, comparing different authors:

Pride and Prejudice Dracula
Oliver Twist  218% 115%
Great Expectations  227%  127%
A Tale of Two Cities  240%  112%

So, that worked. I found that the same author had a percentage difference of < ~70% whilst a different author would have a difference of > ~110%.I also saw that a lower word count would lead to less accurate results, so I wanted to try the checker on a novella by Charles Dickens, A Christmas Carol.

A Christmas Carol (29400 words)
Oliver Twist (171826 words)  63.0%
Great Expectations (190198 words)  97.8%
A Tale of Two Cities (138330 words)  71.3%

Thus, word count does indeed affect the percentage difference because a smaller lengthed novella would contain less of the typical function words used by the same author. However, even though A Christmas Carol is significantly shorter than the rest of the novels, the percentage difference was still smaller than the books by the other authors. And I suspect if I used a few novellas from the other novels, the percentage differnce will be even smaller. I also think that some of the tags I decided to target is a natural part of English and there might be a more personal list of words. Anyway, the longer the texts, the more accurate this program is.