
{"id":2211,"date":"2015-07-24T09:18:22","date_gmt":"2015-07-24T08:18:22","guid":{"rendered":"http:\/\/www.matthijskamstra.nl\/blog\/?p=2211"},"modified":"2015-08-22T20:04:46","modified_gmt":"2015-08-22T19:04:46","slug":"scraping-with-haxe","status":"publish","type":"post","link":"https:\/\/www.matthijskamstra.nl\/blog\/2015\/07\/24\/scraping-with-haxe\/","title":{"rendered":"Scraping with Haxe"},"content":{"rendered":"<p>A quick post about something that grabbed my attention quickly.<\/p>\n<h3>Scraping<\/h3>\n<blockquote>\n<p>Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites.<\/p>\n<\/blockquote>\n<p>Source: <a href=\"https:\/\/en.wikipedia.org\/wiki\/Web_scraping\" target=\"_blank\">https:\/\/en.wikipedia.org\/wiki\/Web_scraping<\/a><\/p>\n<p>I already did some research on the subject when I was playing around with my raspberry pi. There is a lot out there, especially for python.<\/p>\n<p>But what about my favourite programming language <strong>Haxe<\/strong>?<\/p>\n<p>Again this is a quick search! And this is what I found.<\/p>\n<p>A (very?) old project from <a href=\"https:\/\/github.com\/jonasmalacofilho\" target=\"_blank\">Jonas Malaco Filho<\/a> on github. Check out this code : <a href=\"https:\/\/github.com\/jonasmalacofilho\/jonas-haxe\" target=\"_blank\">jonas-haxe<\/a> and specificly the <a href=\"https:\/\/github.com\/jonasmalacofilho\/jonas-haxe\/tree\/haxe3migration\/src\/jonas\/scraper\" target=\"_blank\">scraper part<\/a> of it. Written for Neko, with primarily undocumented classes like <a href=\"http:\/\/api.haxe.org\/neko\/vm\/Mutex.html\" target=\"_blank\">neko.vm.Mutex<\/a> Once you have the html page you can start getting the data from it!<\/p>\n<p>You will need a html\/xml parser; I found one written by <a href=\"https:\/\/bitbucket.org\/yar3333\" target=\"_blank\">Yaroslav Sivakov<\/a> &#8211; <a href=\"https:\/\/bitbucket.org\/yar3333\/haxe-htmlparser\" target=\"_blank\">HtmlParser haxe library<\/a> It also can be found on haxelib: <a href=\"http:\/\/lib.haxe.org\/p\/HtmlParser\/\" target=\"_blank\">http:\/\/lib.haxe.org\/p\/HtmlParser\/<\/a><\/p>\n<p>I found a little (old) project haxe\/php project that I will post as a reference <a href=\"https:\/\/github.com\/andor44\/old_scraper\" target=\"_blank\">https:\/\/github.com\/andor44\/old_scraper<\/a>. But then it stops&#8230;<\/p>\n<p>Not a field that a lot of haxe-developers walk. Fun!<\/p>\n<h2>Update #2<\/h2>\n<ol>\n<li>The htmlparser doesn&#8217;t work with the html code I am scraping. So I need to focus the parts I want to use. Regular expressions are the way to go, and I suck at them. Luckily I found a online tool that helps with testing the <a href=\"http:\/\/haxe.org\/manual\/std-regex.html\" target=\"_blank\">regex<\/a>: <a href=\"http:\/\/www.regexr.com\/\" target=\"_blank\">http:\/\/www.regexr.com\/<\/a> from an old flash hero <a href=\"https:\/\/twitter.com\/gskinner\/\" target=\"_blank\">gskinner<\/a>.<\/li>\n<li>Another thing I ran into, was the data from <strong>https<\/strong> sites. You need something &#8220;extra&#8221; to download html files from there: install <a href=\"https:\/\/github.com\/tong\/hxssl\" target=\"_blank\">hxssl<\/a> via haxelib <code>haxelib install hxssl<\/code> and add it to your build.hxml <code>-lib hxssl<\/code><\/li>\n<\/ol>\n<h2>Update #1<\/h2>\n<p>I am coding this with openfl\/regular expressions, but perhaps a better way to-go is node.js! And you can use node.js with Haxe (perhaps not completely ready: <a href=\"https:\/\/github.com\/HaxeFoundation\/hxnodejs\" target=\"_blank\">hxnodejs<\/a> but probably good enough for the examples below).<\/p>\n<ul>\n<li><a href=\"https:\/\/scotch.io\/tutorials\/scraping-the-web-with-node-js\" target=\"_blank\">https:\/\/scotch.io\/tutorials\/scraping-the-web-with-node-js<\/a> <\/li>\n<li><a href=\"http:\/\/nrabinowitz.github.io\/pjscrape\/\" target=\"_blank\">http:\/\/nrabinowitz.github.io\/pjscrape\/<\/a> <\/li>\n<li><a href=\"https:\/\/medialab.github.io\/artoo\/\" target=\"_blank\">https:\/\/medialab.github.io\/artoo\/<\/a> <\/li>\n<li><a href=\"https:\/\/github.com\/ruipgil\/scraperjs\" target=\"_blank\">https:\/\/github.com\/ruipgil\/scraperjs<\/a> <\/li>\n<li><a href=\"http:\/\/www.smashingmagazine.com\/2015\/04\/web-scraping-with-nodejs\/\" target=\"_blank\">http:\/\/www.smashingmagazine.com\/2015\/04\/web-scraping-with-nodejs\/<\/a> <\/li>\n<li><a href=\"https:\/\/impythonist.wordpress.com\/2015\/01\/06\/ultimate-guide-for-scraping-javascript-rendered-web-pages\/\" target=\"_blank\">https:\/\/impythonist.wordpress.com\/2015\/01\/06\/ultimate-guide-for-scraping-javascript-rendered-web-pages\/<\/a> <\/li>\n<li><a href=\"http:\/\/code.tutsplus.com\/tutorials\/screen-scraping-with-nodejs--net-25560\" target=\"_blank\">http:\/\/code.tutsplus.com\/tutorials\/screen-scraping-with-nodejs&#8211;net-25560<\/a> <\/li>\n<li><a href=\"http:\/\/noodlejs.com\/\" target=\"_blank\">http:\/\/noodlejs.com\/<\/a> <\/li>\n<li><a href=\"http:\/\/webscraper.io\/\" target=\"_blank\">http:\/\/webscraper.io\/<\/a> <\/li>\n<\/ul>\n<p>I can&#8217;t really say how to start with node.js and Haxe because I have never tried it, but what I have red about it shouldn&#8217;t be a big problem. Fun again!<\/p>\n<h3>Read this<\/h3>\n<p>Some interesting reads&#8230; somewhat related to haxe<\/p>\n<ul>\n<li><a href=\"https:\/\/blog.hartleybrody.com\/web-scraping\/\" target=\"_blank\">https:\/\/blog.hartleybrody.com\/web-scraping\/<\/a><\/li>\n<li><a href=\"http:\/\/blog.databigbang.com\/tag\/java\/\" target=\"_blank\">http:\/\/blog.databigbang.com\/tag\/java\/<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/databigbang\/stream-oriented-knuth-morris-pratt\" target=\"_blank\">https:\/\/github.com\/databigbang\/stream-oriented-knuth-morris-pratt<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>A quick post about something that grabbed my attention quickly. Scraping Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Source: https:\/\/en.wikipedia.org\/wiki\/Web_scraping I already did some research on the subject when I was playing around with my raspberry pi. There is a lot out there, especially [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":2231,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[360,385],"tags":[412,396,395,394],"class_list":["post-2211","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-haxe","category-openfl","tag-haxe","tag-web-data-extraction","tag-web-harvesting","tag-web-scraping"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.matthijskamstra.nl\/blog\/wp-json\/wp\/v2\/posts\/2211","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.matthijskamstra.nl\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.matthijskamstra.nl\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.matthijskamstra.nl\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.matthijskamstra.nl\/blog\/wp-json\/wp\/v2\/comments?post=2211"}],"version-history":[{"count":8,"href":"https:\/\/www.matthijskamstra.nl\/blog\/wp-json\/wp\/v2\/posts\/2211\/revisions"}],"predecessor-version":[{"id":2238,"href":"https:\/\/www.matthijskamstra.nl\/blog\/wp-json\/wp\/v2\/posts\/2211\/revisions\/2238"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.matthijskamstra.nl\/blog\/wp-json\/wp\/v2\/media\/2231"}],"wp:attachment":[{"href":"https:\/\/www.matthijskamstra.nl\/blog\/wp-json\/wp\/v2\/media?parent=2211"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.matthijskamstra.nl\/blog\/wp-json\/wp\/v2\/categories?post=2211"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.matthijskamstra.nl\/blog\/wp-json\/wp\/v2\/tags?post=2211"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}