/prog/ - Authenticate before WebRequest

Name: Anonymous 2008-07-17 3:40

Okay, I know this is probably the worst place to look for programming advice, but here goes.

I'm using IronPython (in b4 forced indentation of code) to write a web scraper.

I've written many-a-web scraper in my days, but this is the first time I've attempted at scraping content that requires authentication. And not just basic authentication like accessing a shared drive or something (so it's not as easy as adding NetworkCredentials to the request object), one that is forms authentication.

I'm trying to scrape a forum that requires you to be logged in before you can read any posts. I do have an active account on said forum, and can view it in FF/IE just fine.

I've been toying around with LiveHTTPHeaders extension for FF and have been trying to get the auth cookie from the login page using POST content, but I'm stuck right now because I have to wait an hour due to too many log in attempts.

Does anyone have any direction, or code examples (any language is fine) on how to do this. Or how to bind my FF cookies to my programmatic web requests?

Thanks!
in b4 read SICP

Name: Anonymous 2008-07-17 3:53

You are such a programming moron, I should really not help you (written many a webscraper, indeed!).

BUT... You're doing it wrong. So wrong that I had to help you.

Whatever your HTTP interface, be it cURL, LWP, or GUIDOSCOCK, you should be able to keep what is commonly referred to as a cookie jar. This will allow you to make an HTTP POST request with your login credentials and then crawl the site with the cookie being sent to the server automagically. Now, it might be that Guido doesn't believe in cookies, or jars for that matter, in which case you will have to try a BETTER LANGUAGE.

Name: Anonymous 2008-07-17 4:06

wow. what is this cookie jar you speak of? What a crazy concept, I just wish I could have been born with your brains.

The response object isn't returning any cookies, so yes, I'm doing it wrong, somewhere, but I know Guido's cock loves cookies.

And I suppose you want me to write it in LISP, or HASKELL?

Name: Anonymous 2008-07-17 4:18

http://curl.planetmirror.com/docs/httpscripting.html
go down to "cookies"

http://pycurl.sourceforge.net/

read sicp

Name: Anonymous 2008-07-17 7:26

What is a "web scraper"?

Name: Anonymous 2008-07-17 7:56

>>5
http://en.wikipedia.org/

Name: Anonymous 2008-07-17 8:45

>>2
Python cookie jars: http://docs.python.org/lib/module-cookielib.html
Python http module which handles cookies transparently: http://docs.python.org/lib/module-urllib2.html

>>1
have been trying to get the auth cookie
What are you, some kind of retard? If you just want the cookie contents, install https://addons.mozilla.org/en-US/firefox/addon/573 grab the cookie data and be done with it.

That isn't going to work for a reliable scraper -- in the long run you're going to have to make the scraper authenticate itself (via the site's login shit), take the cookie it gives you (ie, either use a CookieJar or grab the Set-Cookie HTTP header yourself and add it into all of your requests which is retarded) and do your scraping with that set of authentication.

If you need a code sample, then you need to go back to reading SICP. I've already linked to all of the documentation that you'll need for this project.

>>8
DON'T HELP HIM!!!

Name: Anonymous 2008-07-17 9:23

perl -MLWP::UserAgent -le'$ua = new LWP::UserAgent cookie_jar => {} and $ua->post("http://blahblah.blah/login", [ user => "moron", pass => "moron73" ])->is_success and print $ua->get("http://blahblah.blah/whatever")->content'

Gee, that was hard.

Authenticate before WebRequest

1 Name: Anonymous 2008-07-17 3:40

2 Name: Anonymous 2008-07-17 3:53

3 Name: Anonymous 2008-07-17 4:06

4 Name: Anonymous 2008-07-17 4:18

5 Name: Anonymous 2008-07-17 7:26

6 Name: Anonymous 2008-07-17 7:56

7 Name: Anonymous 2008-07-17 8:45

8 Name: Anonymous 2008-07-17 9:23