Project: Stylometry Part 1
I’ve decided to look into natural language parsing, since I’ve always been interested in it. Have you seen cleverbot? It’s pretty cool. Anyway.. I managed to install TextBlob which is a natural language processor and went out and looked for cool things to start off making. I found this and decided to make a program that compares two texts to see if it can figure out if they have the same author.
So, I looked into this field and it turns out that it’s called stylometry. I further found out about writer invariant and apparently it’s a property that each author has. It’s not agreed what this property is exactly but apparently “function words” are promising because they are used unconsciously. I then found this article which expanded on it.
The function words used in that article are: a, all, also, an, and, any, are, as, at, be, been, but, by, can, do, down, even, every, for, from, had, has, have, her, his, if, in, into, is, it, its, may, more, must, my, no, not, now, of, on, one, only, or, our, should, so, some, such, than, that, the, their, then, there, things, this, to, up, upon, was, were, what, when, which, who, will, with, would, your.
Looking at this document, I realised that these words weren’t really all of the same category. Hence I needed to know what their “corresponding tags” actually were. So:
It produced:
MD 6
VB 1
WRB 1
JJ 1
VBD 3
CC 3
WP 2
VBN 1
JJR 1
TO 1
CD 1
VBP 3
WDT 1
PRP 1
RB 8
IN 15
VBZ 2
DT 9
PRP$ 7
NNS 1
EX 1
Therefore, I decided, instead of using all the categories, to look only for IN, DT, PRP$ and MD. These are Preposition or subordinating conjunction, Determiner, Possessive Pronoun, and Modal Verbs respectively.
I started out focusing on just one text, Oliver Twist by Charles Dickens, available free online. I also focused on getting the INs from the text and getting a set of words.
That worked, way faster than I thought it would (considering the novels ~170,000 words). So I decided to expand it and get it working with Great Expectations, also by Charles Dickens.
This is the complete code, and gives a result of an average percentage difference of 68.1%:
I still haven’t tested out the cut-off limit for different authors. So I guess I’ll try it out tomorrow. Can’t wait, that was pretty fun and satisfying to make!