Parsing with bash
1
Name:
Egi
2010-11-06 12:01
Any ideas on how to parse a website with bash?
2
Name:
Anonymous
2010-11-06 12:06
Read SICP
3
Name:
Anonymous
2010-11-06 12:07
Well, Egi , I've written many a scraper using curl , grep , and sed . I'm sure you could do that, too.
4
Name:
Anonymous
2010-11-06 12:07
Any ideas on how to sage?
5
Name:
Epi
2010-11-06 12:24
>>3 Thanks! Some example of using grep/sed with curl output?
6
Name:
Anonymous
2010-11-06 12:26
7
Name:
Anonymous
2010-11-06 12:31
any idea on how to make an app with C?
8
Name:
Anonymous
2010-11-06 12:31
any idea on how to make a website with PHP?
9
Name:
Anonymous
2010-11-06 12:31
any idea on how to make an applet with Java?
10
Name:
Anonymous
2010-11-06 12:32
any idea on how to write a script with Python?
11
Name:
Anonymous
2010-11-06 12:33
>>5
Something like this:
wget "http://dis.4chan.org/read/prog/1215479711 " -qO -|grep -Pom1 ".{3}.{2}.O."
12
Name:
Anonymous
2010-11-06 12:36
any idea on how to district a hotdog with mineral water?
13
Name:
Anonymous
2010-11-06 13:32
>>3
I did it too. Now I've moved to python+BeautifulSoup.
As example of grep+sed, here's my shitty script to download manga from stoptazmo
#!/bin/bash
tmp="$(tempfile)"
manga_home="http://stoptazmo.com/manga-series/$1/ "
chapter_list=".$1_chapterlist"
echo "reading chapter list"
wget -q -O$tmp $manga_home
grep $tmp -e mirror | sed -e "s/^[^']*'//;s/'.*//" >"$chapter_list"
total_chapters=$(wc -l "$chapter_list" | awk '{print $1}')
if [[ $total_chapters == 0 ]]; then
echo "can not read chapters of $1. aborting"
exit 1
fi
i=1
while read url; do
echo "getting $i of $total_chapters"
wget -c "$url"
i=$((i+1))
done <"$chapter_list"
14
Name:
Anonymous
2010-11-06 13:36
>>13
BeautifulSoup is kind of shitty. It's slow, bug-prone, and doesn't work nicely from one Python version to the next. Try lxml.html instead, it has a fucking awesome
.cssselect() function and also offers everything BS can do.