Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon.

Pages: 1-

Scrape infoq.com?

Name: bumb 2013-08-22 12:12

Does anyone know of a good scraper for infoq.com?
I primarily want to scrape this talk, and the other of Sussman, without logging in with stupid credentials:
https://www.infoq.com/presentations/self-heal-scalable-system
https://www.infoq.com/presentations/We-Really-Dont-Know-How-To-Compute

Unless you know/have an expolitable account I can use to scrape the MP3 and slides (pdf? .ppt? shit if I know)

Name: Anonymous 2013-08-22 12:18

Ask Stallman to ask Sussman to share the slides and the soundtrack of his talks in an open format and without paywalls or signupwalls.

Name: Anonymous 2013-08-22 12:21

curl

Name: Anonymous 2013-08-25 7:04

If I read the source code, I am supposed to post:
POST /$FILE_TYPE.action HTTP/1.0
Content-Type: application/x-www-form-urlencoded
Content-Length: $POST_LENGTH

filename=presentations/$FILE_NAME

Where $FILE_TYPE is the type of file, mp3download or pdfdownload, and $FILE_NAME is the name of the file given by the page (which differs). I think I can grep it by:
<div class="download_presentation">, and <form id="mp3Form"> and <form id="pdfForm">. I just hope there is not other check on the scripts /$FILE_TYPE.action.

Here is a sample excerpt from http://www.infoq.com/presentations/self-heal-scalable-system
starting at line 690

<div class="download_presentation">
                               
                                    <ul>
                                        <li>Download</li>
                                       
                                            <li>
                                                <a id="mp3" title="" href="javascript:;">MP3</a>
                                                <form method="post" action="/mp3download.action" target="_blank" id="mp3Form">
                                                    <input type="hidden" name="filename" value="presentations/infoq-13-jul-systemsthatrun.mp3"/>
                                                </form>
                                            </li>
                                       
                                       
                                            <li>|</li>
                                       
                                       
                                            <li>
                                                <a id="slides" title="" href="javascript:;">Slides</a>
                                                <form method="post" action="/pdfdownload.action" target="_blank" id="pdfForm">
                                                    <input type="hidden" name="filename" value="presentations/LambdaJam2013-JoeArmstrong-Systemsthatrunforeverselfhealandscale.pdf"/>
                                                </form>
                                            </li>

It's also sad they do not have https.

Don't change these.
Name: Email:
Entire Thread Thread List