/prog/ - Automatic language classification

Name: Anonymous 2011-06-24 18:25





#!/usr/bin/python2



import sys

import bz2



def classify(text, langs=('english', 'german', 'french')):

    results = {}

    for lang in langs:

        with open(lang + '.txt') as f:

            corpus = f.read()



        compressed = len(bz2.compress(corpus))

        results[lang] = len(bz2.compress(corpus + text)) - compressed



    return sorted(results, key=results.__getitem__)



if __name__ == '__main__':

    print "Most likely %s." % classify(sys.stdin.read())[0].capitalize()



$ wget -qO - http://www.gutenberg.org/ebooks/31469.txt.utf8 | ./classific.py

Most likely English.

$ wget -qO - http://www.gutenberg.org/ebooks/22367.txt.utf8 | ./classific.py

Most likely German.

$ wget -qO - http://www.gutenberg.org/ebooks/4968.txt.utf8 | ./classific.py

Most likely French.

Name: Anonymous 2011-06-24 19:40

Also, >>3 here again, I bet if you were to rigorously analyze the OPs program, one would conclude that it is equivalent to a bag-of-words Bayesian classifier, with the bonus that it's really fucking easy to implement and port to other languages and platforms.

Automatic language classification

1 Name: Anonymous 2011-06-24 18:25

4 Name: Anonymous 2011-06-24 19:40

Name: Anonymous 2011-06-24 18:25

Name: Anonymous 2011-06-24 19:40