web harvesting

A quick post about something that grabbed my attention quickly.

Scraping

Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites.

Source: https://en.wikipedia.org/wiki/Web_scraping

I already did some research on the subject when I was playing around with my raspberry pi. There is a lot out there, especially for python.

But what about my favourite programming language Haxe?

Again this is a quick search! And this is what I found.

A (very?) old project from Jonas Malaco Filho on github. Check out this code : jonas-haxe and specificly the scraper part of it. Written for Neko, with primarily undocumented classes like neko.vm.Mutex Once you have the html page you can start getting the data from it!

You will need a html/xml parser; I found one written by Yaroslav Sivakov – HtmlParser haxe library It also can be found on haxelib: http://lib.haxe.org/p/HtmlParser/

I found a little (old) project haxe/php project that I will post as a reference https://github.com/andor44/old_scraper. But then it stops…

Not a field that a lot of haxe-developers walk. Fun!

Update #2

The htmlparser doesn’t work with the html code I am scraping. So I need to focus the parts I want to use. Regular expressions are the way to go, and I suck at them. Luckily I found a online tool that helps with testing the regex: http://www.regexr.com/ from an old flash hero gskinner.
Another thing I ran into, was the data from https sites. You need something “extra” to download html files from there: install hxssl via haxelib haxelib install hxssl and add it to your build.hxml -lib hxssl

Update #1

I am coding this with openfl/regular expressions, but perhaps a better way to-go is node.js! And you can use node.js with Haxe (perhaps not completely ready: hxnodejs but probably good enough for the examples below).

I can’t really say how to start with node.js and Haxe because I have never tried it, but what I have red about it shouldn’t be a big problem. Fun again!

Read this

Some interesting reads… somewhat related to haxe