Project: Stylometry Part 1
I’ve decided to look into natural language parsing, since I’ve always been interested in it. Have you seen cleverbot? It’s pretty cool. Anyway.. I managed to install TextBlob which is a natural language processor and went out and looked for cool things to start off making. I found this and decided to make a program that compares two texts to see if it can figure out if they have the same author.
So, I looked into this field and it turns out that it’s called stylometry. I further found out about writer invariant and apparently it’s a property that each author has. It’s not agreed what this property is exactly but apparently “function words” are promising because they are used unconsciously. I then found this article which expanded on it.
The function words used in that article are: a, all, also, an, and, any, are, as, at, be, been, but, by, can, do, down, even, every, for, from, had, has, have, her, his, if, in, into, is, it, its, may, more, must, my, no, not, now, of, on, one, only, or, our, should, so, some, such, than, that, the, their, then, there, things, this, to, up, upon, was, were, what, when, which, who, will, with, would, your.
Looking at this document, I realised that these words weren’t really all of the same category. Hence I needed to know what their “corresponding tags” actually were. So:
from textblob import TextBlob
text = " a, all, also, an, and, any, are, as, at, be, been, but, by, can, do, down, even, every, for, from, had, has, have, her, his, if, in, into, is, it, its, may, more, must, my, no, not, now, of, on, one, only, or, our, should, so, some, such, than, that, the, their, then, there, things, this, to, up, upon, was, were, what, when, which, who, will, with, would, your"
blob = TextBlob(text)
stuff = []
for i in blob.tags:
stuff.append(i[1])
tags = set(stuff)
for i in tags:
print i, stuff.count(i)
It produced:
MD 6
VB 1
WRB 1
JJ 1
VBD 3
CC 3
WP 2
VBN 1
JJR 1
TO 1
CD 1
VBP 3
WDT 1
PRP 1
RB 8
IN 15
VBZ 2
DT 9
PRP$ 7
NNS 1
EX 1
Therefore, I decided, instead of using all the categories, to look only for IN, DT, PRP$ and MD. These are Preposition or subordinating conjunction, Determiner, Possessive Pronoun, and Modal Verbs respectively.
I started out focusing on just one text, Oliver Twist by Charles Dickens, available free online. I also focused on getting the INs from the text and getting a set of words.
from textblob import TextBlob
ot = open("olivertwist.txt", "r")
ottext = TextBlob(ot.read())
otwc = len(ottext.tags) # wordcount for oliver twist
targets = ['IN', 'DT', 'PRP$', 'MD']
def f(x): return x[1] == 'IN'
IN = filter(f, ottext.tags)
for i in xrange(len(IN)):
IN[i] = (IN[i][0].lower(),IN[i][1])
blah = set(IN)
for x in blah:
print x[0]
That worked, way faster than I thought it would (considering the novels ~170,000 words). So I decided to expand it and get it working with Great Expectations, also by Charles Dickens.
This is the complete code, and gives a result of an average percentage difference of 68.1%:
from textblob import TextBlob
first = open("olivertwist.txt", "r")
firsttext = TextBlob(first.read())
firstwc = len(firsttext.tags) # wordcount for oliver twist
second = open("greatexpectations.txt", "r")
secondtext = TextBlob(second.read())
secondwc = len(secondtext.tags) # wordcount for great expectations
targets = ['IN', 'DT', 'PRP$', 'MD']
firstwords = {}
secondwords = {}
for target in targets:
def f(x): return x[1] == target
firsttags = filter(f, firsttext.tags) # gets the targeted tags
for i in xrange(len(firsttags)):
firsttags[i] = (firsttags[i][0].lower(),firsttags[i][1]) # reduces to lowercase
secondtags = filter(f, secondtext.tags) # gets the targeted tags
for i in xrange(len(secondtags)):
secondtags[i] = (secondtags[i][0].lower(),secondtags[i][1]) # reduces to lowercase
firsttags_set = set(firsttags)
secondtags_set = set(secondtags)
for x in firsttags_set:
firstwords[x[0]] = (float(firsttags.count(x))/firstwc)*1000
# freq in 1000 words
for x in secondtags_set:
secondwords[x[0]] = (float(secondtags.count(x))/secondwc)*1000
# freq in 1000 words
difference = []
number = 0
for word in firstwords:
if word in secondwords:
number += 1
diff = [firstwords[word], secondwords[word]]
percentchange = (max(diff)-min(diff))/min(diff)*100
difference.append(percentchange)
summed = 0
for x in difference:
summed += x
print summed/number
I still haven’t tested out the cut-off limit for different authors. So I guess I’ll try it out tomorrow. Can’t wait, that was pretty fun and satisfying to make!