Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon.

Pages: 1-4041-

choosing a random line from a file in haskell

Name: Anonymous 2009-01-20 3:35

is there any way to do it that's not slow as fuck?

Name: Anonymous 2009-01-20 3:56

yes there is.

Name: Anonymous 2009-01-20 3:57

>>1
Yes. Iterate through the file, each time choosing whether to keep the current choice or choose the current line instead. The first line is 1 likely to be picked. The second is 1/2 likely. The third is 1/3 likely. The fourth is 1/4 likely. The nth is 1/n likely. When you run out of lines, return the current choice.

Whether this is slow as fuck is a matter of opinion, of course. Disk access is slow as fuck.

Name: Anonymous 2009-01-20 4:16

>>3
Wouldn't it be 1/2**n not 1/n?

Name: =+=*=F=R=O=Z=E=N==V=O=I=D=*=+= !FrOzEn2BUo 2009-01-20 4:31

>>3
Why not just pick the  random()*numLines line?

_________________________
orbis terrarum delenda est

Name: Anonymous 2009-01-20 4:35

>>3
i tried that. it was about 10 times as slow as reading the whole file into a list, generating a random number between 0 and the length of that list - 1, and then indexing the list.
and even that was about 100 times as slow as this simple perl script:
#!/usr/bin/perl

my ($i, $line) = (0, '');
open FILE, '<file.txt';
while(<FILE>) $line = $_ if rand($i++) < 1;
close FILE;
print $line;

Name: Anonymous 2009-01-20 4:43

Look at the c that your Haskell generates.  And post it here.

Name: Anonymous 2009-01-20 6:48

>>7
haskell code:
import IO
import Random

randomLine :: Handle -> String -> Integer -> IO String

randomLine f l n = do
 eof <- hIsEOF f
 random <- getStdRandom (randomR (0,n))
 if eof then return l else do
  line <- hGetLine f
  if random < 1 then randomLine f line (n + 1) else randomLine f l (n + 1)

main = do
 file <- openFile "file.txt" ReadMode
 randomLine file "" 0 >>= putStrLn


c code generated by ghc: http://pastebin.com/f66f46bfd

for comparison:
perl code:
#!/usr/bin/perl

my ($i, $line) = (0, '');
open FILE, '<file.txt';
while(<FILE>) { $line = $_ if rand($i++) < 1; }
close FILE;
print $line;


c code that i wrote:
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void){
 char line[LINE_MAX], ret[LINE_MAX];
 FILE *file = fopen("file.txt", "r");
 if(file) for(long long i = 1; !feof(file); ++i)
  if(fgets(line, LINE_MAX, file) && arc4random() % i < 1)
   strcpy(ret, line);
 fputs(ret, stdout);
 return 0;
}


time results (averaged over 10 runs, with a 178690 line file):
haskell: 2.251s user 0.109s system
perl: 0.516s user 0.02s system
c: 0.093s user 0.015s system

Name: Anonymous 2009-01-20 6:58

Try using ByteString.

Name: Anonymous 2009-01-20 7:10

>>9
that's a tiny bit faster (2.06s user 0.093s system), but still a lot slower than perl.

Name: Anonymous 2009-01-20 7:37

>>10
Is this with runGhc, ghc, ghc -O, or what?

It didn't seem too slow, about 0.15s, with http://cairnarvon.rotahall.org/misc/progwordles.2.txt (I realise that's a smaller file) on my Core 2 Duo. Perl took about 0.25s.

Name: Anonymous 2009-01-20 7:43

i once considered writing a utility called rac or something like that that would be a random version of cat(1)

Name: Anonymous 2009-01-20 7:43

>>10
[code]
import System
import Random
import qualified Data.ByteString.Char8 as Char8

main = do file <- fmap Char8.lines . Char8.readFile . head =<< getArgs
          Char8.putStrLn . (file !!) =<< randomRIO (0, length file)[code]

Compiled with [m]-O -optc-O3[/m].

Time results with a 86104 line file (just 1 run, their speed doesn't vary that much):
[m]
real    0m0.176s
user    0m0.128s
sys     0m0.028s[/m]

Name: >>13 2009-01-20 7:45

Fuck.

import System
import Random
import qualified Data.ByteString.Char8 as Char8

main = do file <- fmap Char8.lines . Char8.readFile . head =<< getArgs
          Char8.putStrLn . (file !!) =<< randomRIO (0, length file)


Compiled with -O -optc-O3.

Time results with a 86104 line file (just 1 run, their speed doesn't vary that much):

real    0m0.176s
user    0m0.128s
sys     0m0.028s

Name: Anonymous 2009-01-20 7:48

>>12
raccoon?

Name: >>14 2009-01-20 7:55

Speed is the same without the optimization flags.

Name: Anonymous 2009-01-20 8:00

>>11
ghc -O2

>>14
thanks, that's a little bit faster than the perl one.
still nowhere close to c, though.

Name: Anonymous 2009-01-20 8:20

>>11
Hello Xarn. How are you today

Name: Anonymous 2009-01-20 8:21

>>12
you mean like this?
#!/bin/sh

randomcmd=/usr/games/random
racstdin=true

args=$(getopt u $*)
if [ $? -ne 0 ]
then
 echo 'Usage: rac [-u] [file ...]'
 exit 1
fi
set -- $args
for i
do
 case "$i" in
  -u) buff="-r";;
  --) break;;
 esac
 shift
done
shift

for file
do
 $randomcmd $buff -f $file
 racstdin=false
done

if $racstdin
then
 $randomcmd $buff
fi

Name: Anonymous 2009-01-20 8:54

>>14
But is lines lazy???

Name: Anonymous 2009-01-20 9:02

>>20
Char8.lines is strict.

Name: Anonymous 2009-01-20 16:39

>>18 is a Xarn fanclub member.

Name: Anonymous 2009-01-20 18:19

>>22
what did you expect in a thread like this? only Xarn uses haskell for remotely useful things.

Name: Anonymous 2009-01-20 18:27

who or what is xarn?

Name: Anonymous 2009-01-20 18:29

>>24
Xarn is Xarn.

Name: Anonymous 2009-01-20 19:20

>>25
I lol'd and then considered the epistemological ramifications.

Name: Anonymous 2009-01-20 19:51

Holy shit.
Seek to random position in file, read backward until BOF or a newline, and then read forward until a newline. There's your line of text.

Name: Anonymous 2009-01-20 19:57

>>27
Do both at the same time!

Name: Anonymous 2009-01-20 20:01

sort -R /usr/share/dict/words | sed q

Name: Anonymous 2009-01-20 20:26

>>27
consider a file where half the lines are about 900 characters and the other half are about 100 characters. your method would choose a long line 9 times out of 10.

Name: Anonymous 2009-01-20 20:34

>>30
So >>27's method fails on some pathological input-- so what? It's still better than every other suggestion put forward so far.

Name: Anonymous 2009-01-20 20:36

Use a real database system

Name: Anonymous 2009-01-20 20:37

>>31
That's not true.  For almost EVERY input there would be bias with your method.  You can't sacrifice functionality for optimization.

Name: Anonymous 2009-01-20 21:57

>>31
input where every line is the same length would be pathological input. in the real world most text files have lines of different lengths.

>>32
that would be slow as fuck and overkill when the only thing i'm ever going to do with this text file is pick a random line from it each time the program is run... but now that i think about it, maybe it'd be a good idea to just make a file containing the positions of all the newlines as fixed-size integers, then i can seek to a random position in that file (modulo the length of each integer), read the number, and seek to the right place in the text file...

Name: Anonymous 2009-01-20 21:58

>>33
Prop. I. I am not >>27.
Prop. II. The OP never said he wanted unbiased results.

Name: Anonymous 2009-01-20 22:08

>>34
that would be slow as fuck
Your wrong.

Name: Anonymous 2009-01-20 22:17

>>12
$ shuf --version
shuf (GNU coreutils) 6.12

Name: Anonymous 2009-01-20 22:23

>>11
$ time shuf -n 1 progwordles.2.txt
Yale 4

real    0m0.005s
user    0m0.004s
sys    0m0.001s

Name: Anonymous 2009-01-20 22:37

>>37
$ shuf --version
shuf (Anonix anoncoreutils) 1.01

Name: Anonymous 2009-01-21 0:59

>>36
it'd be a lot slower than
make a file containing the positions of all the newlines as fixed-size integers, then i can seek to a random position in that file (modulo the length of each integer), read the number, and seek to the right place in the text file...

>>37-39
$ shuf --version
zsh: command not found: shuf

Name: Anonymous 2009-01-21 3:53

>>37
cheers. thats why i try not to implement too many ideas that are quite obvious because i know some cunt willve written it

Name: Anonymous 2009-01-21 4:15

>>41
too bad that cunt released it under a non-free license so you'd have to use the whole slow, shitty program instead of taking out and using just the useful parts. also it's a lot slower than building an index and then seeking to a random position in the index like >>34 described.

also, >>40. you'd have to install it pretty much anywhere you want to use it.

Name: Anonymous 2009-01-21 7:50

Have you heard of fortune? It comes with a neat little utility called strfile that reads a file and dumps out a cache of line positions and stuff that's useful for pulling out random lines.

If you have some spare time, I suggest browsing through the source to the program. It's got a handful of nifty gems of code for various kinds of text-mashing that have come in handy at unexpected times.

Name: Anonymous 2009-01-21 8:20

Thank you >>45

Name: Anonymous 2009-01-21 8:27

>>42
lol what?

Name: Anonymous 2009-01-21 10:20

>>46
which part of >>42 are you replying to?

Name: 45 2009-01-21 10:31

>>45
You're welcome

Name: Anonymous 2009-01-21 14:58

>>47
his lies

Name: Anonymous 2009-01-21 18:31

>>49
there are no lies in >>42.

Name: Anonymous 2009-01-21 18:43

>>50
SHUT UP TROLLS

Name: Anonymous 2009-01-21 19:29

>>51
Calling >>50 a troll doesn't make what >>42 said any less true.

Name: Anonymous 2009-01-21 20:22

>>52
ITS A GNU PACKAGE FREAKS

Name: Anonymous 2009-01-21 20:35

>>53
GET OUT TROLL

Name: Anonymous 2009-01-21 20:56

>>54
NO U

Name: Anonymous 2009-01-21 21:02

>>43
Please don't come inside me!

Don't change these.
Name: Email:
Entire Thread Thread List