Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon.

Pages: 1-

Critique my code

Name: Anonymous 2011-07-26 4:31

Hello /prog/, I'm a complete beginner and I just wrote the longest script I have yet written. It's not finished, but would you please critique it and tell me if it's ugly or not?


#########################################
# g.e-hentai.org Gallery Scraper v0.1   #  
# by VitamnP                            #
# ENTERPRISE QUALITY                    #
#########################################

import urllib2
import time
from math import ceil
from BeautifulSoup import BeautifulSoup

def requestPage(url):
    # Define user agent to be used in requests
    agent = {'User-Agent':'Uzbl (Webkit 1.3) (Linux i686 [i686])'}
   
    # Request page from server for parsing, identifying as a browser so we don't
    # get banned
    request = urllib2.Request(url, data=None, headers=agent)
    response = urllib2.urlopen(request)
    info = response.info()
    page = response.read()
   
    # Create BeautifulSoup object from page for parsing if the page type is HTML
    soup = BeautifulSoup(page)

    return soup

def getInfo(url):
    # Request page and prepare for parsing
    soup = requestPage(url)   
   
    # Get number of images and total size in MB
    image_string = soup.find(text='Images:')
    info_string = image_string.findNext(text=True)
   
    # Determine number of thumbnail pages and generate URLs
    num_images = float(info_string.split(' ')[0])
    num_pages = ceil(num_images / 20.0) - 1
    pages_list = []
    while num_pages >= 0:
        pages_list.append(url + '?p=' + str(num_pages))
        num_pages -= 1
   
    return {'pages_list' : pages_list, 'info_string' : info_string}
   
def getImagePageLinks(pages_list):
    # Create a list to store image page URLs
    all_page_links = []
   
    # Go through each thumbnail page link in the list, extract the links to
    # the image pages, and store them in a list
    for link in pages_list:
        soup = requestPage(link)
        image_links = soup.findAll('div', {"class" : "gdtm"})
        for link in image_links:
            all_page_links.append(link.findNext('a')['href'])

    return all_page_links
   
def getAllImageLinks(all_page_links, wait_time_secs):
    # Create a list to store the image URLs
    all_image_links = []

    # For each URL in all_image_links, request the page from the server and save
    # the image on the page to a file in the directory specified by the user
    for link in all_image_links:
        soup = requestPage(link)
        topframe = soup.find('iframe')
        imgurl = topframe.findNext('img')['src']
        all_images.append(imgurl)
        time.sleep(wait_time_secs)

    return all_image_links

def download(image_links, save_dir):
   
    for image in image_links:
        filename = save_dir + image.split('/')[-1]
        f = open(filename, 'wb')
        f.write(data)
        f.close()
       
       
   
       
def main():
    print 'g.e-hentai.org Gallery Scraper v0.1'
    print "by VitamnP"
   
    gallery_url = raw_input("Enter gallery URL: \n")
    save_dir = raw_input("Enter save directory: \n")
    wait_time_secs = input("Seconds to wait between page loads: \n")
   
    info = getInfo(gallery_url)
    print "Gallery contains ", info['info_string']

    all_page_links = getImagePageLinks(info['pages_list'])
    all_image_links = getAllImageLinks(all_page_links, wait_time_secs)

    download(all_image_links, save_dir)
   
if __name__ == "__main__":
    main()

Name: Anonymous 2011-07-26 4:45

If it ain't Lisp, it's crap.

Name: Anonymous 2011-07-26 4:50

it's ugly

Name: Anonymous 2011-07-26 4:57

So many comments, not even a docstring, claims to be ENTERPRISE, but there's not even a factory, nor a singleton. append is slow as fuck, but FIOC doesn't use linked lists. Too many temporary variables make the bodies of the procedures heavier than they should be.
Also, THE FORCED INDENTATION OF CODE.
THREAD OVER

Name: Anonymous 2011-07-26 5:00

>>4
Thanks

Name: Anonymous 2011-07-26 5:27

argparse + urllib2 + xpath is the idiom

Name: Anonymous 2011-07-26 6:12

Great size. Look thick. Solid. Tight. Keep us all posted on your continued progress with any new progress pics or vid clips. Show us what you got man. Wanna see how freakin' huge, solid, thick and tight you can get. Thanks for the motivation.

Name: Anonymous 2011-07-26 6:35

>>7
Stop it.

Name: Anonymous 2011-07-26 8:16

I remember writing a scraper for e-hentai when I was young. It was like twenty lines of bash code.

Name: Anonymous 2011-07-26 8:17

>>9
Sage properly, thanks.

Name: Anonymous 2011-07-26 8:26

>>4
append is slow as fuck
you are an idiot.

>>1
Learn to avoid pointless comments like

    # Create BeautifulSoup object from page for parsing if the page type is HTML
    soup = BeautifulSoup(page)

Though it's perfectly fine for someone who only begins to code, like a small child who constantly tells herself what she is going to do, the sooner you learn to internalize this stuff, the better.

Stuff like:

    pages_list = []
    while num_pages >= 0:
        pages_list.append(url + '?p=' + str(num_pages))
        num_pages -= 1

Is better written as:

    pages_list = ['%s?p=%d' % (url, i) for i in xrange(num_pages)]


Also, the creator of BeautifulSoup himself recommends using lxml.

Name: Anonymous 2011-07-26 8:29

>>9
I have script to scrape stoptazmo.com


update_all.sh:

#/bin/bash


MANGALIST=/tmp/_tempmanga

if [ x"$1" != x"skip_dl" ]; then
    wget 'http://stoptazmo.com/downloads/manga_series.php'  -O"$MANGALIST"
fi

grep "$MANGALIST"  -e td\>\ *\<a\ href | \
    grep manga-series | \
    sed 's/.*<a href=.\([^>]*\).>.*/\1/' | \
    sort -R | \
    xargs -P1 -n1 ./update_category.sh


update_category.sh

#!/bin/bash

URL="$1"
CATEGORY=$(echo -n "$1" | sed 's@/$@@;s@.*/@@g')
mkdir "$CATEGORY" 2>/dev/null
cd "$CATEGORY" || exit 1
echo "$URL"
FILELIST="/tmp/_temp_$CATEGORY"
echo "Updating $CATEGORY"
wget "$URL" -q -O "$FILELIST" || exit 1
grep "file_name=" "$FILELIST" | \
    sed """s@.*<td><a href=['\"]\([^'\"]*\)['\"].*@\1@; s@&amp;@&@g""" |
    sort -u | \
    xargs -n1 -P4 ../update_file_from_url.sh


update_file_from_url.sh:

#!/bin/bash
URL="$1"
filename="$(../name_from_url.sh "$URL")"
if [ -f "$filename" ]; then
   if zip -q -T "$filename" >/dev/null 2>&1; then
        echo "$filename skipped"
       exit 0
   fi
   mv "$filename" ".bak.$filename"
fi
echo "getting $filename"
wget -q -O "$filename" "$URL" && zip -q -T "$filename" >/dev/null 2>&1 && rm -f ".bak.$filename"


name_from_url.sh

echo -n $1 | sed 's/.*file_name=\(.*\)/\1/g'


Note that it uses -P in update_category.sh to run in several thr^W processes at the same time. OP's uses one thread only.

Name: Anonymous 2011-07-26 8:31

Is better written as:

NO.

Name: Anonymous 2011-07-26 8:32

Seek porn, create script instead of fapping.

Upvote for you.

Name: Anonymous 2011-07-26 8:37

>>1

At requestPage:
Why assign soup? Why not just return BeautifulSoup(page) ?

Also:



def getAllImageLinks(all_page_links, wait_time_secs):
    # Create a list to store the image URLs
    all_image_links = []

    # For each URL in all_image_links, request the page from the server and save
    # the image on the page to a file in the directory specified by the user
    for link in all_image_links:
        soup = requestPage(link)
        topframe = soup.find('iframe')
        imgurl = topframe.findNext('img')['src']
        all_images.append(imgurl)
        time.sleep(wait_time_secs)

    return all_image_links


But all_image_links is empty, thus it won't return shite.

Name: Anonymous 2011-07-26 8:59

It would really improve the performance if it was multi-threaded.

Name: Anonymous 2011-07-26 9:01

one line of wget

Name: Anonymous 2011-07-26 9:01

>>15
Another bug (I believe) is all_images.append(imgurl)

Shouldn't it be all_image_links.append(imgurl)
?

Name: Anonymous 2011-07-26 11:58

You should be using XPath expressions:

Where doc is your parsed document,

getImagePageLinks:
image_links = [el.attrib["href"] for el in doc.xpath("//div[@class='gdtm']//a")]

getAllImageLinks:
imgurl = doc.xpath("//iframe//img")[0].attrib["src"]

Also, ose the os.* functions:
filename = os.path.join(save_dir, os.path.basename(image))

Name: Anonymous 2011-07-26 18:30

- Everything you write will be open source. No FASLs, DLLs or EXEs. There may be some very important instances where a business wouldn't want anybody to see the internal implementation of their modules and having strict control over levels of access are necessary. Python third-party library licensing is overly complex. Licenses like MIT allow you to create derived works as long as you maintain attrubution; GNU GPL, or other 'viral' licenses don't allow derived works without inheriting the same license. To inherit the benefits of an open source culture you also inherit the complexities of the licensing hell.
- Installation mentality, Python has inherited the idea that libraries should be installed, so it infact is designed to work inside unix package management, which basically contains a fair amount of baggage (library version issues) and reduced portability. Of course it must be possible to package libraries with your application, but its not conventional and can be hard to deploy as a desktop app due to cross platform issues, language version, etc. Open Source projects generally don't care about Windows, most open source developers use Linux because "Windows sucks".
- Probably the biggest practical problem with Python is that there's no well-defined API that doesn't change. This make life easier for Guido and tough on everybody else. That's the real cause of Python's "version hell".
- Global Interpreter Lock (GIL) is a significant barrier to concurrency. Due to signaling with a CPU-bound thread, it can cause a slowdown even on single processor. Reason for employing GIL in Python is to easy the integration of C/C++ libraries. Additionally, CPython interpreter code is not thread-safe, so the only way other threads can do useful work is if they are in some C/C++ routine, which must be thread-safe.
- Python (like most other scripting languages) does not require variables to be declared, as (let (x 123) ...) in Lisp or int x = 123 in C/C++. This means that Python can't even detect a trivial typo - it will produce a program, which will continue working for hours until it reaches the typo - THEN go boom and you lost all unsaved data. Local and global scopes are unintuitive. Having variables leak after a for-loop can definitely be confusing. Worse, binding of loop indices can be very confusing; e.g. "for a in list: result.append(lambda: fcn(a))" probably won't do what you think it would. Why nonlocal/global/auto-local scope nonsense?
- Python indulges messy horizontal code (> 80 chars per line), where in Lisp one would use "let" to break computaion into manageable pieces. Get used to things like self.convertId([(name, uidutil.getId(obj)) for name, obj in container.items() if IContainer.isInstance(obj)])
- Crippled support for functional programming. Python's lambda is limited to a single expression and doesn't allow conditionals, a side effect of Python making a distinction between expressions and statements. Assignments are not expressions. Most useful high-order functions were deprecated in Python 3.0 and have to be imported from functools. No continuations or even tail call optimization: "I don't like reading code that was written by someone trying to use tail recursion." --Guido
- Python's syntax, based on SETL language and mathematical Set Theory, is non-uniform, hard to understand and parse, compared to simpler languages, like Lisp, Smalltalk, Nial and Factor. Instead of usual "fold" and "map" functions, Python uses "set comprehension" syntax, which has an overhelmingly large collection of underlying linguistic and notational conventions, each with it's own variable binding semantics. To complicate things even more, Python uses the so called "off-side" indentation rule (aka Forced Indentation of Code), also taken from a math-intensive Haskell language. This, in effect, makes Python look like an overengineered toy for math geeks.
- Quite quirky: triple-quoted strings seem like a syntax-decision from a David Lynch movie, and double-underscores, like __init__, seem appropriate in C, but not in a language that provides list comprehensions. There has to be a better way to mark certain features as internal or special than just calling it __feature__.
- Python is unintuitive and has too many confusing non-orthogonal features: references can't be used as hash keys; expressions in default arguments are calculated when the function is defined, not when it’s called. Why have both dictionaries and objects? Why have both types and duck-typing? Why is there ":" in the syntax if it almost always has a newline after it?
- Python's garbage collection uses naive reference counting, which is slow and doesn't handle circular references, meaning you have to expect subtle memory leaks and can't easily use arbitrary graphs as your data. In effect Python complicates even simple tasks, like keeping directory tree with symlinks.
- Problems with arithmetic: no Numerical Tower (nor even rational/complex numbers), meaning 1/2 would produce 0, instead of 0.5, leading to subtle and dangerous errors.
- Poor UTF support and unicode string handling is somewhat awkward.
- self everywhere can make you feel like OO was bolted on, even though it wasn't.
- No outstanding feature, that makes the language, like the brevity of APL or macros of Lisp.

Name: Anonymous 2011-07-26 18:38

>>20
When I get off work I'll debunk every single fucking statement in this awful, ultimately destructive troll post.

Name: Anonymous 2011-07-26 18:40

float(info_string.split(' ')[0])
string.split(' ')[0])

LOLOLOLOLOLOL.

Name: OP 2011-07-26 18:53

>>15
Thanks for pointing this out, I changed some of the variable names to make it less confusing but I ended up making things worse. I think using so many temp variables must be a bad idea

>>11
The comments thing is what you thought, if I write the comment it helps me think of what code I have to write to make the effect I want. I'll just keep practising and I expect I won't have to do it after a while. About replacing the while loop with a list comprehension, I was roasted for doing that in the Python IRC channel? Which is better? They also told me not to use recursion which was disappointing because after I found out what recursion was I thought it was so cool I rewrote all my little programs to use it as much as possible.

>>20
Don't understand any of this except for not having to declare variables, which does seem like kind of a bad idea

Thanks for the advice and trolling /prog/, sorry if I made you cringe

Name: OP 2011-07-26 18:56

>>19
Thanks, that's really helpful

Name: Anonymous 2011-07-26 19:06

>>23
Don't understand
Because you're an idiot.

Name: Anonymous 2011-07-26 19:23

>>23
Ignore the retard haters, use recursion, list comprehensions and higher-order functions (map, filter, etc.) whenever you feel it's clearer to express things that way (i.e. most of the time).  Really.  And you should learn some other languages too: C|Pascal + Lua|Javascript|Lisp + assembly.  Also, you must master the concepts of lexical scoping and closures (hint: to play with these, use Lua|Javascript|Lisp).  Oh, and don't forget to read up about things like the Halting problem and P vs NP.

Name: Anonymous 2011-07-26 21:01

>>26
filthy academic. None of your advice will allow anyone to become a true enterprise-level developer.

Name: Anonymous 2011-07-26 21:28

>>27
Well, that's just, like, your opinion, man.

Name: Anonymous 2011-07-26 21:29

>>27

enterprise-level developers don't need to know about P vs NP? What if they encounter an NP problem while developing and spend time trying to make a polynomial algorithm for it?

Name: Anonymous 2011-07-26 21:50

>>27 Bad troll.
>>28 Are you just copy-pasting that reply or do you write it out every time?
>>29 Don't get trolled.

Name: Anonymous 2011-07-26 22:55

>>30
Are you just copy-pasting that reply or do you write it out every time?
His reply is a quote from some jewish movie.

Name: Anonymous 2011-07-27 3:22

>>21
We're still waiting for you.

Name: Anonymous 2011-07-27 4:23

>>31
Well, that's just, like, your opinion, man.

Don't change these.
Name: Email:
Entire Thread Thread List