Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

Automatic language classification

Name: Anonymous 2011-06-24 18:25



#!/usr/bin/python2

import sys
import bz2

def classify(text, langs=('english', 'german', 'french')):
    results = {}
    for lang in langs:
        with open(lang + '.txt') as f:
            corpus = f.read()

        compressed = len(bz2.compress(corpus))
        results[lang] = len(bz2.compress(corpus + text)) - compressed

    return sorted(results, key=results.__getitem__)

if __name__ == '__main__':
    print "Most likely %s." % classify(sys.stdin.read())[0].capitalize()

$ wget -qO - http://www.gutenberg.org/ebooks/31469.txt.utf8 | ./classific.py
Most likely English.
$ wget -qO - http://www.gutenberg.org/ebooks/22367.txt.utf8 | ./classific.py
Most likely German.
$ wget -qO - http://www.gutenberg.org/ebooks/4968.txt.utf8 | ./classific.py
Most likely French.

Name: Anonymous 2011-06-24 19:04

Fascinating. But I didn't start my own threads whenever I made a shitty program using the language I was learning.

Name: Anonymous 2011-06-24 19:33

>>1
I see how it works. Compressing an input text of a particular language with a large file containing words also belonging to the same language will result in a smaller compressed size due to pattern aliasing. It's truly ingenious.

Name: Anonymous 2011-06-24 19:40

Also, >>3 here again, I bet if you were to rigorously analyze the OPs program, one would conclude that it is equivalent to a bag-of-words Bayesian classifier, with the bonus that it's really fucking easy to implement and port to other languages and platforms.

Name: Anonymous 2011-06-24 20:26

Compressing large files to determinate the language of an input file is not slow enough. The program could at least implement sleepsort.

Name: Anonymous 2011-06-24 20:43

>>4
equivalent to a bag-of-words Bayesian classifier
Nope, it's even funnier, as bzip2 doesn't care about words, only about frequent sequences of characters. So it doesn't need huge corpora and can correctly recognize texts where none of the words are present in the corpus, also with typos, bad grammar, OCR artifacts etc.

Name: Anonymous 2011-06-24 21:45

>>5
it's okay, he's using FIOC

Name: Anonymous 2011-06-25 1:30

Name: Anonymous 2011-06-25 6:30

XARN QUALITY THREAD

Name: Anonymous 2011-06-25 7:20

Who the fuck is Xarn?

Name: Anonymous 2011-06-25 7:24

You can't recognize images this way.

Name: Anonymous 2011-06-25 7:24

Also, it would fail to recognize voice or other audio data.

Name: Anonymous 2011-06-25 8:33

Name: Anonymous 2011-06-25 8:49

>>13
Someone from /prog/ is having sex? Heresy!

Name: Anonymous 2011-06-25 8:50

>>13
He's not Xarn, idiot.

Name: Anonymous 2011-06-25 9:14

>>15
You clearly don't understand the Xarnness of Xarn

Name: Anonymous 2011-06-25 11:54

In a few months, some asshole will publish this to Reddit, finally resulting in a wikipedia article about it. Just wait and see.

Name: Anonymous 2011-06-25 12:32

>>1
Brilliant!

Name: Anonymous 2011-06-25 13:20

>>17
/prog/ is so cutting-edge

Name: Anonymous 2011-06-25 17:02

>>19
This has been around since at least 2002, see ``Language trees and zipping''.

Name: Anonymous 2011-06-25 18:15

>>20
lol, academy papers are 4chan level posts, but with list sexism


Watch your gendered pronouns. When you write, you'll want to make sure that you don't do anything to make your readers feel excluded. If you use "he" and "him" all the time, you are excluding half of your potential readership. We'll acknowledge that the he/she solution is a bit cumbersome in writing. However, you might solve the problem as we have done in this document: by alternating "he" and "she" throughout. Other writers advocate always using "she" instead of "he" as a way of acknowledging a long-standing exclusion of women from texts. Whatever decision you make in the end, be sensitive to its effect on your readers.

Name: Anonymous 2011-06-25 19:28

>>21
Watch your gendered pronouns. When you write, you'll want to make sure that you don't do anything to make your readers feel excluded. If you use "he" and "him" all the time, you are excluding half of your potential readership. We'll acknowledge that the he/she solution is a bit cumbersome in writing. However, you might solve the problem as we have done in this document: by alternating "he" and "she" throughout. Other writers advocate always using "she" instead of "he" as a way of acknowledging a long-standing exclusion of women from texts. Whatever decision you make in the end, be sensitive to its effect on your readers.
Nobody gives a shit, fat feminist cunts.

Name: Anonymous 2011-06-25 19:52

>>21
Using ``she'' instead of ``he'' is stupid and I hate you for doing it.

Name: Anonymous 2011-06-25 21:37

>>22,23 go back to /r/mensrights

Name: Anonymous 2011-06-26 0:17

>>22>>23
lol, she thinks she won't lose her job for being a sexist pig with ugly penis.

One of British soccer's leading television commentators was fired Tuesday, a day after being taken off the air and temporarily suspended for making sexist remarks about a female match official.

Andy Gray, the face of Sky Sports' soccer coverage for the past two decades, was dismissed by the broadcaster after "new evidence of unacceptable and offensive behavior" that took place off-air last month.

The former Scotland striker and broadcast colleague Richard Keys had been reprimanded and removed from duty Monday for making derogatory comments about lineswoman Sian Massey, former referee Wendy Toms and West Ham executive Karren Brady.

In an off-air exchange with Andy Gray, Keys commented that "Somebody better get down there and explain offside to her." After Gray suggested that "Women don't know the offside rule", Keys remarked "Course they don't. I can guarantee you there will be a big one today.

Name: VIPPER 2011-06-26 0:19

>>25
Shut up JEWS boy.
Nobody in here gives a shit about your autism.

Name: Anonymous 2011-06-26 0:21

Yeah, feminist trolls in /prog/ !

Name: Anonymous 2011-06-26 1:50

lol "she"

hate when programming-blog-fags use it, like the typical neckbeard  computer programmer reading it is female

Name: Anonymous 2011-06-26 2:07

>>28
your neckbeard is sexist.

Name: Anonymous 2011-06-26 3:57

Xarn is of the kind who would alternate ``she'' and ``he'' so don't laugh at it, fuckers.

Name: Anonymous 2011-06-26 13:47

I always use she, it's funny because the reader has a vagina

Name: Anonymous 2011-06-26 15:43

hax my neckvagina

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List