/prog/ - choosing a random line from a file in haskell

Name: Anonymous 2009-01-20 3:35

is there any way to do it that's not slow as fuck?

Name: Anonymous 2009-01-20 3:56

yes there is.

Name: Anonymous 2009-01-20 3:57

>>1
Yes. Iterate through the file, each time choosing whether to keep the current choice or choose the current line instead. The first line is 1 likely to be picked. The second is 1/2 likely. The third is 1/3 likely. The fourth is 1/4 likely. The nth is 1/n likely. When you run out of lines, return the current choice.

Whether this is slow as fuck is a matter of opinion, of course. Disk access is slow as fuck.

Name: Anonymous 2009-01-20 4:16

>>3
Wouldn't it be 1/2**n not 1/n?

Name: =+==F=R=O=Z=E=N==V=O=I=D==+= !FrOzEn2BUo 2009-01-20 4:31

>>3
Why not just pick the random()*numLines line?

_________________________
orbis terrarum delenda est

Name: Anonymous 2009-01-20 4:35

>>3
i tried that. it was about 10 times as slow as reading the whole file into a list, generating a random number between 0 and the length of that list - 1, and then indexing the list.
and even that was about 100 times as slow as this simple perl script:

#!/usr/bin/perl



my ($i, $line) = (0, '');

open FILE, '<file.txt';

while(<FILE>) $line = $_ if rand($i++) < 1;

close FILE;

print $line;

Name: Anonymous 2009-01-20 4:43

Look at the c that your Haskell generates. And post it here.

Name: Anonymous 2009-01-20 6:48

>>7
haskell code:

import IO

import Random



randomLine :: Handle -> String -> Integer -> IO String



randomLine f l n = do

 eof <- hIsEOF f

 random <- getStdRandom (randomR (0,n))

 if eof then return l else do

  line <- hGetLine f

  if random < 1 then randomLine f line (n + 1) else randomLine f l (n + 1)



main = do

 file <- openFile "file.txt" ReadMode

 randomLine file "" 0 >>= putStrLn

c code generated by ghc: http://pastebin.com/f66f46bfd

for comparison:
perl code:

#!/usr/bin/perl



my ($i, $line) = (0, '');

open FILE, '<file.txt';

while(<FILE>) { $line = $_ if rand($i++) < 1; }

close FILE;

print $line;

c code that i wrote:

#include <limits.h>

#include <stdio.h>

#include <stdlib.h>

#include <string.h>



int main(void){

 char line[LINE_MAX], ret[LINE_MAX];

 FILE *file = fopen("file.txt", "r");

 if(file) for(long long i = 1; !feof(file); ++i)

  if(fgets(line, LINE_MAX, file) && arc4random() % i < 1)

   strcpy(ret, line);

 fputs(ret, stdout);

 return 0;

}

time results (averaged over 10 runs, with a 178690 line file):
haskell: 2.251s user 0.109s system
perl: 0.516s user 0.02s system
c: 0.093s user 0.015s system

Name: Anonymous 2009-01-20 6:58

Try using ByteString.

Name: Anonymous 2009-01-20 7:10

>>9
that's a tiny bit faster (2.06s user 0.093s system), but still a lot slower than perl.

Name: Anonymous 2009-01-20 7:37

>>10
Is this with runGhc, ghc, ghc -O, or what?

It didn't seem too slow, about 0.15s, with http://cairnarvon.rotahall.org/misc/progwordles.2.txt (I realise that's a smaller file) on my Core 2 Duo. Perl took about 0.25s.

Name: Anonymous 2009-01-20 7:43

i once considered writing a utility called rac or something like that that would be a random version of cat(1)

Name: Anonymous 2009-01-20 7:43

>>10
[code]
import System
import Random
import qualified Data.ByteString.Char8 as Char8

main = do file <- fmap Char8.lines . Char8.readFile . head =<< getArgs
Char8.putStrLn . (file !!) =<< randomRIO (0, length file)[code]

Compiled with [m]-O -optc-O3[/m].

Time results with a 86104 line file (just 1 run, their speed doesn't vary that much):
[m]
real 0m0.176s
user 0m0.128s
sys 0m0.028s[/m]

Name: >>13 2009-01-20 7:45

Fuck.



import System

import Random

import qualified Data.ByteString.Char8 as Char8



main = do file <- fmap Char8.lines . Char8.readFile . head =<< getArgs

          Char8.putStrLn . (file !!) =<< randomRIO (0, length file)

Compiled with -O -optc-O3.

Time results with a 86104 line file (just 1 run, their speed doesn't vary that much):
real 0m0.176s user 0m0.128s sys 0m0.028s

Name: Anonymous 2009-01-20 7:48

>>12
raccoon?

Name: >>14 2009-01-20 7:55

Speed is the same without the optimization flags.

Name: Anonymous 2009-01-20 8:00

>>11
ghc -O2

>>14
thanks, that's a little bit faster than the perl one.
still nowhere close to c, though.

Name: Anonymous 2009-01-20 8:20

>>11
Hello Xarn. How are you today

Name: Anonymous 2009-01-20 8:21

>>12
you mean like this?

#!/bin/sh



randomcmd=/usr/games/random

racstdin=true



args=$(getopt u $*)

if [ $? -ne 0 ]

then

 echo 'Usage: rac [-u] [file ...]'

 exit 1

fi

set -- $args

for i

do

 case "$i" in

  -u) buff="-r";;

  --) break;;

 esac

 shift

done

shift



for file

do

 $randomcmd $buff -f $file

 racstdin=false

done



if $racstdin

then

 $randomcmd $buff

fi

Name: Anonymous 2009-01-20 8:54

>>14
But is lines lazy?^?^{^_?}

Name: Anonymous 2009-01-20 9:02

>>20
Char8.lines is strict.

Name: Anonymous 2009-01-20 16:39

>>18 is a Xarn fanclub member.

Name: Anonymous 2009-01-20 18:19

>>22
what did you expect in a thread like this? only Xarn uses haskell for remotely useful things.

Name: Anonymous 2009-01-20 18:27

who or what is xarn?

Name: Anonymous 2009-01-20 18:29

>>24
Xarn is Xarn.

Name: Anonymous 2009-01-20 19:20

>>25
I lol'd and then considered the epistemological ramifications.

Name: Anonymous 2009-01-20 19:51

Holy shit.
Seek to random position in file, read backward until BOF or a newline, and then read forward until a newline. There's your line of text.

Name: Anonymous 2009-01-20 19:57

>>27
Do both at the same time!

Name: Anonymous 2009-01-20 20:01

sort -R /usr/share/dict/words | sed q

Name: Anonymous 2009-01-20 20:26

>>27
consider a file where half the lines are about 900 characters and the other half are about 100 characters. your method would choose a long line 9 times out of 10.

Name: Anonymous 2009-01-20 20:34

>>30
So >>27's method fails on some pathological input-- so what? It's still better than every other suggestion put forward so far.

Name: Anonymous 2009-01-20 20:36

Use a real database system

Name: Anonymous 2009-01-20 20:37

>>31
That's not true. For almost EVERY input there would be bias with your method. You can't sacrifice functionality for optimization.

Name: Anonymous 2009-01-20 21:57

>>31
input where every line is the same length would be pathological input. in the real world most text files have lines of different lengths.

>>32
that would be slow as fuck and overkill when the only thing i'm ever going to do with this text file is pick a random line from it each time the program is run... but now that i think about it, maybe it'd be a good idea to just make a file containing the positions of all the newlines as fixed-size integers, then i can seek to a random position in that file (modulo the length of each integer), read the number, and seek to the right place in the text file...

Name: Anonymous 2009-01-20 21:58

>>33
Prop. I. I am not >>27.
Prop. II. The OP never said he wanted unbiased results.

Name: Anonymous 2009-01-20 22:08

>>34
that would be slow as fuck
Your wrong.

Name: Anonymous 2009-01-20 22:17

>>12
$ shuf --version
shuf (GNU coreutils) 6.12

Name: Anonymous 2009-01-20 22:23

>>11
$ time shuf -n 1 progwordles.2.txt
Yale 4

real 0m0.005s
user 0m0.004s
sys 0m0.001s

Name: Anonymous 2009-01-20 22:37

>>37
$ shuf --version shuf (Anonix anoncoreutils) 1.01

Name: Anonymous 2009-01-21 0:59

>>36
it'd be a lot slower than
make a file containing the positions of all the newlines as fixed-size integers, then i can seek to a random position in that file (modulo the length of each integer), read the number, and seek to the right place in the text file...

>>37-39
$ shuf --version zsh: command not found: shuf

choosing a random line from a file in haskell

1 Name: Anonymous 2009-01-20 3:35

2 Name: Anonymous 2009-01-20 3:56

3 Name: Anonymous 2009-01-20 3:57

4 Name: Anonymous 2009-01-20 4:16

5 Name: =+=*=F=R=O=Z=E=N==V=O=I=D=*=+= !FrOzEn2BUo 2009-01-20 4:31

6 Name: Anonymous 2009-01-20 4:35

7 Name: Anonymous 2009-01-20 4:43

8 Name: Anonymous 2009-01-20 6:48

9 Name: Anonymous 2009-01-20 6:58

10 Name: Anonymous 2009-01-20 7:10

11 Name: Anonymous 2009-01-20 7:37

12 Name: Anonymous 2009-01-20 7:43

13 Name: Anonymous 2009-01-20 7:43

14 Name: >>13 2009-01-20 7:45

15 Name: Anonymous 2009-01-20 7:48

16 Name: >>14 2009-01-20 7:55

17 Name: Anonymous 2009-01-20 8:00

18 Name: Anonymous 2009-01-20 8:20

19 Name: Anonymous 2009-01-20 8:21

20 Name: Anonymous 2009-01-20 8:54

21 Name: Anonymous 2009-01-20 9:02

22 Name: Anonymous 2009-01-20 16:39

23 Name: Anonymous 2009-01-20 18:19

24 Name: Anonymous 2009-01-20 18:27

25 Name: Anonymous 2009-01-20 18:29

26 Name: Anonymous 2009-01-20 19:20

27 Name: Anonymous 2009-01-20 19:51

28 Name: Anonymous 2009-01-20 19:57

29 Name: Anonymous 2009-01-20 20:01

30 Name: Anonymous 2009-01-20 20:26

31 Name: Anonymous 2009-01-20 20:34

32 Name: Anonymous 2009-01-20 20:36

33 Name: Anonymous 2009-01-20 20:37

34 Name: Anonymous 2009-01-20 21:57

35 Name: Anonymous 2009-01-20 21:58

36 Name: Anonymous 2009-01-20 22:08

37 Name: Anonymous 2009-01-20 22:17

38 Name: Anonymous 2009-01-20 22:23

39 Name: Anonymous 2009-01-20 22:37

40 Name: Anonymous 2009-01-21 0:59

Name: Anonymous 2009-01-20 3:35

Name: Anonymous 2009-01-20 3:56

Name: Anonymous 2009-01-20 3:57

Name: Anonymous 2009-01-20 4:16

Name: =+==F=R=O=Z=E=N==V=O=I=D==+= !FrOzEn2BUo 2009-01-20 4:31

Name: Anonymous 2009-01-20 4:35

Name: Anonymous 2009-01-20 4:43

Name: Anonymous 2009-01-20 6:48

Name: Anonymous 2009-01-20 6:58

Name: Anonymous 2009-01-20 7:10

Name: Anonymous 2009-01-20 7:37

Name: Anonymous 2009-01-20 7:43

Name: Anonymous 2009-01-20 7:43

Name: >>13 2009-01-20 7:45

Name: Anonymous 2009-01-20 7:48

Name: >>14 2009-01-20 7:55

Name: Anonymous 2009-01-20 8:00

Name: Anonymous 2009-01-20 8:20

Name: Anonymous 2009-01-20 8:21

Name: Anonymous 2009-01-20 8:54

Name: Anonymous 2009-01-20 9:02

Name: Anonymous 2009-01-20 16:39

Name: Anonymous 2009-01-20 18:19

Name: Anonymous 2009-01-20 18:27

Name: Anonymous 2009-01-20 18:29

Name: Anonymous 2009-01-20 19:20

Name: Anonymous 2009-01-20 19:51

Name: Anonymous 2009-01-20 19:57

Name: Anonymous 2009-01-20 20:01

Name: Anonymous 2009-01-20 20:26

Name: Anonymous 2009-01-20 20:34

Name: Anonymous 2009-01-20 20:36

Name: Anonymous 2009-01-20 20:37

Name: Anonymous 2009-01-20 21:57

Name: Anonymous 2009-01-20 21:58

Name: Anonymous 2009-01-20 22:08

Name: Anonymous 2009-01-20 22:17

Name: Anonymous 2009-01-20 22:23

Name: Anonymous 2009-01-20 22:37

Name: Anonymous 2009-01-21 0:59