Web scraping for web developers

Web scraping for web developers

Almost any web developer is faced with the task of getting data from a specific site. There are many ways of web scraping now, and entire libraries for Python and JavaScript have been developed for this. There are ready-made parsers that do not require knowledge of programming languages. It is worth mentioning cloud solutions, for example apify.com a wide range of services for automating and scaling any user action in the browser.

Why scrap your website data?

Typically, companies do not want their unique content to be downloaded and reused for unauthorized purposes. As a result, they do not expose all data through a consumable API or other readily available resource. Scrapers, on the other hand, are interested in getting website data regardless of any attempts to restrict access. Often, online store owners populate their sites with data mined from Amazon , eBay or Walmart. They do not care about the non-unique nature of the text, since they receive visitors through advertising. In this case, it makes no sense to hire expensive copywriters and write thousands of unique articles.

Web scraping with Node.js

To get data from the HTML-code of a page, you need to get this code from the site. This can be done using the http client from the http module built into Node.js by default, however for simple http requests it is more convenient to use different wrapper modules over http, the most popular of which is request, so let's give it a try. The first step is to make sure that the request module receives the same HTML code from the site as it receives in the browser. With most sites this will be so, but sometimes you come across sites that give the browser one thing, and the script with the http client another. Previously, the first thing I did was check landing pages with a GET request from curl, but once I came across a site that gave different http responses in curl and in the script with request, so now I immediately try to run the script. Something like this with the following code:

var needle = require('needle');

var URL = 'https://www.amazon.com/SAMSUNG-43-inch-Crystal-TU-8000-Built/dp/B084JHKT7S/';

needle.get(URL, function(err, res){
    if (err) throw err;
    console.log(res.body);
    console.log(res.statusCode);
});

Run the script. If the site does not work or there are problems with the connection, then we will see an error, and if everything is fine, then right in the terminal window we will see the source code of the page. If the html code is the same as in the browser, then we do not need to set with cookies or http headers.

Parsing gurus know that the most powerful and versatile way of parsing pages is through regular expressions. They allow you to parse pages with a very non-standard and extremely anti-semantic layout. In the general case, if the data can be accurately copied from the site without knowing its language, then they can be parsed with regulars.

However, most HTML pages are easily parsed by DOM parsers, which are much more convenient and easier to read. Regulars should only be used if DOM parsers fail. In our case, a DOM parser is fine. At the moment, cheerio, the server version of the cult jQuery, is confidently leading among DOM parsers for Node.js.