Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

Automatic language classification

Name: Anonymous 2011-06-24 18:25



#!/usr/bin/python2

import sys
import bz2

def classify(text, langs=('english', 'german', 'french')):
    results = {}
    for lang in langs:
        with open(lang + '.txt') as f:
            corpus = f.read()

        compressed = len(bz2.compress(corpus))
        results[lang] = len(bz2.compress(corpus + text)) - compressed

    return sorted(results, key=results.__getitem__)

if __name__ == '__main__':
    print "Most likely %s." % classify(sys.stdin.read())[0].capitalize()

$ wget -qO - http://www.gutenberg.org/ebooks/31469.txt.utf8 | ./classific.py
Most likely English.
$ wget -qO - http://www.gutenberg.org/ebooks/22367.txt.utf8 | ./classific.py
Most likely German.
$ wget -qO - http://www.gutenberg.org/ebooks/4968.txt.utf8 | ./classific.py
Most likely French.

Name: Anonymous 2011-06-25 18:15

>>20
lol, academy papers are 4chan level posts, but with list sexism


Watch your gendered pronouns. When you write, you'll want to make sure that you don't do anything to make your readers feel excluded. If you use "he" and "him" all the time, you are excluding half of your potential readership. We'll acknowledge that the he/she solution is a bit cumbersome in writing. However, you might solve the problem as we have done in this document: by alternating "he" and "she" throughout. Other writers advocate always using "she" instead of "he" as a way of acknowledging a long-standing exclusion of women from texts. Whatever decision you make in the end, be sensitive to its effect on your readers.

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List