Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

Authenticate before WebRequest

Name: Anonymous 2008-07-17 3:40

Okay, I know this is probably the worst place to look for programming advice, but here goes.

I'm using IronPython (in b4 forced indentation of code) to write a web scraper.

I've written many-a-web scraper in my days, but this is the first time I've attempted at scraping content that requires authentication. And not just basic authentication like accessing a shared drive or something (so it's not as easy as adding NetworkCredentials to the request object), one that is forms authentication.

I'm trying to scrape a forum that requires you to be logged in before you can read any posts. I do have an active account on said forum, and can view it in FF/IE just fine.

I've been toying around with LiveHTTPHeaders extension for FF and have been trying to get the auth cookie from the login page using POST content, but I'm stuck right now because I have to wait an hour due to too many log in attempts.

Does anyone have any direction, or code examples (any language is fine) on how to do this. Or how to bind my FF cookies to my programmatic web requests?

Thanks!
in b4 read SICP

Name: Anonymous 2008-07-17 3:53

You are such a programming moron, I should really not help you (written many a webscraper, indeed!).

BUT...  You're doing it wrong.  So wrong that I had to help you.

Whatever your HTTP interface, be it cURL, LWP, or GUIDOSCOCK, you should be able to keep what is commonly referred to as a cookie jar.  This will allow you to make an HTTP POST request with your login credentials and then crawl the site with the cookie being sent to the server automagically.  Now, it might be that Guido doesn't believe in cookies, or jars for that matter, in which case you will have to try a BETTER LANGUAGE.

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List