Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

Automatic language classification

Name: Anonymous 2011-06-24 18:25



#!/usr/bin/python2

import sys
import bz2

def classify(text, langs=('english', 'german', 'french')):
    results = {}
    for lang in langs:
        with open(lang + '.txt') as f:
            corpus = f.read()

        compressed = len(bz2.compress(corpus))
        results[lang] = len(bz2.compress(corpus + text)) - compressed

    return sorted(results, key=results.__getitem__)

if __name__ == '__main__':
    print "Most likely %s." % classify(sys.stdin.read())[0].capitalize()

$ wget -qO - http://www.gutenberg.org/ebooks/31469.txt.utf8 | ./classific.py
Most likely English.
$ wget -qO - http://www.gutenberg.org/ebooks/22367.txt.utf8 | ./classific.py
Most likely German.
$ wget -qO - http://www.gutenberg.org/ebooks/4968.txt.utf8 | ./classific.py
Most likely French.

Name: Anonymous 2011-06-25 19:28

>>21
Watch your gendered pronouns. When you write, you'll want to make sure that you don't do anything to make your readers feel excluded. If you use "he" and "him" all the time, you are excluding half of your potential readership. We'll acknowledge that the he/she solution is a bit cumbersome in writing. However, you might solve the problem as we have done in this document: by alternating "he" and "she" throughout. Other writers advocate always using "she" instead of "he" as a way of acknowledging a long-standing exclusion of women from texts. Whatever decision you make in the end, be sensitive to its effect on your readers.
Nobody gives a shit, fat feminist cunts.

Name: Anonymous 2011-06-25 19:52

>>21
Using ``she'' instead of ``he'' is stupid and I hate you for doing it.

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List