A quick post about something that grabbed my attention quickly.
Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites.
I already did some research on the subject when I was playing around with my raspberry pi.
There is a lot out there, especially for python.
But what about my favourite programming language Haxe?
Again this is a quick search! And this is what I found.
A (very?) old project from Jonas Malaco Filho on github.
Check out this code : jonas-haxe and specificly the scraper part of it. Written for Neko, with primarily undocumented classes like neko.vm.Mutex
Once you have the html page you can start getting the data from it!
You will need a html/xml parser; I found one written by Yaroslav Sivakov – HtmlParser haxe library
It also can be found on haxelib: http://lib.haxe.org/p/HtmlParser/
I found a little (old) project haxe/php project that I will post as a reference https://github.com/andor44/old_scraper.
But then it stops…
Not a field that a lot of haxe-developers walk.
I am coding this with openfl/regular expressions, but perhaps a better way to-go is node.js!
And you can use node.js with Haxe (perhaps not completely ready: hxnodejs but probably good enough for the examples below).
I can’t really say how to start with node.js and Haxe because I have never tried it, but what I have red about it shouldn’t be a big problem.
Some interesting reads… somewhat related to haxe