/prog/ - Parsing with bash

Name: Egi 2010-11-06 12:01

Any ideas on how to parse a website with bash?

Name: Anonymous 2010-11-06 12:06

Read SICP

Name: Anonymous 2010-11-06 12:07

Well, Egi, I've written many a scraper using curl, grep, and sed. I'm sure you could do that, too.

Name: Anonymous 2010-11-06 12:07

Any ideas on how to sage?

Name: Epi 2010-11-06 12:24

>>3 Thanks! Some example of using grep/sed with curl output?

Name: Anonymous 2010-11-06 12:26

>>5
1/10.

Name: Anonymous 2010-11-06 12:31

any idea on how to make an app with C?

Name: Anonymous 2010-11-06 12:31

any idea on how to make a website with PHP?

Name: Anonymous 2010-11-06 12:31

any idea on how to make an applet with Java?

Name: Anonymous 2010-11-06 12:32

any idea on how to write a script with Python?

Name: Anonymous 2010-11-06 12:33

>>5
Something like this:

wget "http://dis.4chan.org/read/prog/1215479711" -qO -|grep -Pom1 ".{3}.{2}.O."

Name: Anonymous 2010-11-06 12:36

any idea on how to district a hotdog with mineral water?

Name: Anonymous 2010-11-06 13:32

>>3
I did it too. Now I've moved to python+BeautifulSoup.

As example of grep+sed, here's my shitty script to download manga from stoptazmo



#!/bin/bash



tmp="$(tempfile)"

manga_home="http://stoptazmo.com/manga-series/$1/"

chapter_list=".$1_chapterlist"



echo "reading chapter list"

wget -q -O$tmp $manga_home



grep $tmp -e mirror | sed -e "s/^[^']*'//;s/'.*//" >"$chapter_list"

total_chapters=$(wc -l "$chapter_list" | awk '{print $1}')



if [[ $total_chapters == 0 ]]; then

    echo "can not read chapters of $1. aborting"

    exit 1

fi





i=1

while read url; do

    echo "getting $i of $total_chapters"

    wget -c  "$url"

    i=$((i+1))

done <"$chapter_list"

Name: Anonymous 2010-11-06 13:36

>>13
BeautifulSoup is kind of shitty. It's slow, bug-prone, and doesn't work nicely from one Python version to the next. Try lxml.html instead, it has a fucking awesome .cssselect() function and also offers everything BS can do.

Parsing with bash

1 Name: Egi 2010-11-06 12:01

2 Name: Anonymous 2010-11-06 12:06

3 Name: Anonymous 2010-11-06 12:07

4 Name: Anonymous 2010-11-06 12:07

5 Name: Epi 2010-11-06 12:24

6 Name: Anonymous 2010-11-06 12:26

7 Name: Anonymous 2010-11-06 12:31

8 Name: Anonymous 2010-11-06 12:31

9 Name: Anonymous 2010-11-06 12:31

10 Name: Anonymous 2010-11-06 12:32

11 Name: Anonymous 2010-11-06 12:33

12 Name: Anonymous 2010-11-06 12:36

13 Name: Anonymous 2010-11-06 13:32

14 Name: Anonymous 2010-11-06 13:36

Name: Egi 2010-11-06 12:01

Name: Anonymous 2010-11-06 12:06

Name: Anonymous 2010-11-06 12:07

Name: Anonymous 2010-11-06 12:07

Name: Epi 2010-11-06 12:24

Name: Anonymous 2010-11-06 12:26

Name: Anonymous 2010-11-06 12:31

Name: Anonymous 2010-11-06 12:31

Name: Anonymous 2010-11-06 12:31

Name: Anonymous 2010-11-06 12:32

Name: Anonymous 2010-11-06 12:33

Name: Anonymous 2010-11-06 12:36

Name: Anonymous 2010-11-06 13:32

Name: Anonymous 2010-11-06 13:36