Scraping data with phantomjs

18 Mar 2015 – read.

For the sake of their server, I won't post the link, but recently I needed to scrap a load of data from a fairly detailed page. This is pretty trivial using Ruby and the Nokogiri gem - nothing is particularly hard about scraping data when a gem does all the hard work for you. However, this particular site in question loads all the data through ajax requests which return JSON. Perfect! Right?

Well not really. I jumped straight for the ajax URL which on first appearance gave me all the JSON-data-goodness I could ever need. However, after some experimenting I started getting 404 errors on pages I knew existed. It turns out that the particular website has some rate limiting, and/or session protection from people like me who are after their data (and yes, it is purely for educational purposes) - if the URL was accessed directly, without going to it's parent loading page, it would throw a 404. The bastards; how dare they make getting their lovely formatted JSON harder than loading a URL.

Attempts with referer faking, cookie setting, getting the urls in particular orders etc failed. So I turned to phantomjs to get the job done. Since phantomjs is effectively a browser, the website in question knew nothing of my plans to harvest all their data.

The long and short of it is in the gist, for your full browsing pleasure. It is by no means a good script - I wanted to implement waitFor throughout (in place of the setTimeout) but I didn't have time [read: couldn't be bothered] to sort out the variable namespacing that phantomjs forces inside page.evaluate.

Happy scraping.