Project: Creating a Context-based Translator Using Data Mining

In line with parsing languages, I decided for my next project to be to translate stuff. I’ve been learning French and I’ve noticed that Google Translate really isn’t that good at translating. When I do have to translate a sentence, I use Reverso. It allows me to search my sentence and comes up with pre-translated sentences (from other sources) that best match it. It often only matches a quarter or half of a sentence or even less, so it’s not very automatic and requires fine-tuning by me, with other searches.

I thought that I would solve this problem. The first thing I looked at on Reverso was whether or not there was an APA that I could use. Unfortunately, there was not. I vaguely remembered a Python library called BeautifulSoup to parse the HTML. This was my first time doing so and I found it interesting that I was parsing HTML to get the results from a natural language search thing to translate a text.

I had trouble with the library at first, but eventually I (with my google-foo) managed to work out how to get the first result from a search.

import urllib
from bs4 import BeautifulSoup

page = urllib.urlopen('http://context.reverso.net/translation/english-french/i+like+the+green+side+of+the+leaf').read()
soup = BeautifulSoup(page)

print soup.find_all("div", { "class" : "src" })[0].em
print soup.find_all("div", { "class" : "trg" })[0].em

Now that I had that, I just needed to get an input sentence and work this magic on it. I thought that I would iterate searching the sentence and then the sentence minus a word and so on, until a search term completely matches a pre-done translated text. This will ensure that the translation is as accurate as possible.

import urllib
from bs4 import BeautifulSoup

text = "Mary was born in Leiden in the Netherlands"
text = text.split(' ')
translated = False

# searches to see if there is a whole match
search = "+".join(text)
url = 'http://context.reverso.net/translation/english-french/' + search
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
if ' '.join(text) == soup.find_all("div", { "class" : "src" })[0].em.text:
    translated = True
    print soup.find_all("div", { "class" : "trg" })[0].em.text

twords = ""
originallength = len(text)

# checks until it finds a match
while translated == False:
    if len(text) == 0:
        translated = True

    if len(text) == 1:
        search = "+".join(text)
        url = 'http://context.reverso.net/translation/english-french/' + search
        page = urllib.urlopen(url).read()
        soup = BeautifulSoup(page)
        
        if ' '.join(text) == soup.find_all("div", { "class" : "src" })[0].em.text:
            twords += soup.find_all("div", { "class" : "trg" })[0].em.text
            twords += " " # add a space to the end for further words
            text = []
            break

    for i in range(1,len(text)):
        newtext = text[:-i]
        search = "+".join(newtext)
        url = 'http://context.reverso.net/translation/english-french/' + search
        page = urllib.urlopen(url).read()
        soup = BeautifulSoup(page)

        if ' '.join(newtext) == soup.find_all("div", { "class" : "src" })[0].em.text:
            twords += soup.find_all("div", { "class" : "trg" })[0].em.text
            twords += " " # add a space to the end for further words
            text = text[-i:]
            
            search = "+".join(text)
            url = 'http://context.reverso.net/translation/english-french/' + search
            page = urllib.urlopen(url).read()
            soup = BeautifulSoup(page)
        
            if ' '.join(text) == soup.find_all("div", { "class" : "src" })[0].em.text:
                twords += soup.find_all("div", { "class" : "trg" })[0].em.text
                twords += " " # add a space to the end for further words
                text = []

            print twords
            break

It took a while, but here is the test output for “the dog plays with the ball”:

le chien 
le chien se joue de la balle

Surprisingly, the conjugations come out fine most of the time, which I thought was pretty weird. The translator is pretty good, but sometimes the translated text requires a bit of cleaning up. One problem is when the original text is part of an idiom, which messes up the translated text. After cleaning up the code, adding comments and an interface (so you can choose what language you want to translate into), here is the final code:

import urllib
from bs4 import BeautifulSoup

text = raw_input('Translate:\n')
language = raw_input('Into Arabic, German, Spanish, French, Italian, Dutch, Portuguese?\n')
text = text.split(' ')    # puts the words in a list
translated = False

# searches to see if there is a whole match
search = "+".join(text)
url = 'http://context.reverso.net/translation/english-'+language+'/' + search
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
# if the search matches the english completely, it is printed and the program terminates
if ' '.join(text) == soup.find_all("div", { "class" : "src" })[0].em.text:
    translated = True
    print soup.find_all("div", { "class" : "trg" })[0].em.text

twords = ""
originallength = len(text)

# checks until it finds a match
while translated == False:
    if len(text) == 0:         # if all of the text is translated, program ends
        translated = True

    if len(text) == 1:         # if there is one word left to be translated
        search = "+".join(text)
        url = 'http://context.reverso.net/translation/english-'+language+'/' + search
        page = urllib.urlopen(url).read()
        soup = BeautifulSoup(page)
        
        if ' '.join(text) == soup.find_all("div", { "class" : "src" })[0].em.text:
            twords += soup.find_all("div", { "class" : "trg" })[0].em.text
            twords += " " # add a space to the end for further words
            text = []
            break

    for i in range(1,len(text)):
        newtext = text[:-i]
        search = "+".join(newtext)
        url = 'http://context.reverso.net/translation/english-'+language+'/' + search
        page = urllib.urlopen(url).read()
        soup = BeautifulSoup(page)
        
        # finds if there is a match for the segment
        if ' '.join(newtext) == soup.find_all("div", { "class" : "src" })[0].em.text:
            twords += soup.find_all("div", { "class" : "trg" })[0].em.text
            twords += " " # add a space to the end for further words
            text = text[-i:]  # removes the already-translated text
            
            search = "+".join(text)
            url = 'http://context.reverso.net/translation/english-'+language+'/' + search
            page = urllib.urlopen(url).read()
            soup = BeautifulSoup(page)
            
            # checks the whole new segment to see if there's a match
            if ' '.join(text) == soup.find_all("div", { "class" : "src" })[0].em.text:
                twords += soup.find_all("div", { "class" : "trg" })[0].em.text
                twords += " " # add a space to the end for further words
                text = []

            print twords # prints the progress
            break