Web Scraping Using Js



Web scraping refers to the extraction of data from a website. This information is collected and then exported into a format that is more useful for the user. Be it a spreadsheet or an API. Although web scraping can be done manually, in most cases, automated tools are preferred when scraping web data as they can be less costly and work at a. Scraping data from JavaScript web sites with Power Query By Jason Cockington / March 3, 2020 March 6, 2020 In a recent post, Matt discussed how to extract data from complex websites with Power BI using the New Web Table Inference capability of Power Query. It allows you to do this from the command-line, using javascript and jQuery. It does this by using PhantomJS, which is a headless webkit browser (it has no window, and it exists only for your script's usage, so you can load complex websites that use AJAX and it will work just as if it were a real browser). In this tutorial you’ll learn how to automate and scrape the web with JavaScript. To do this, we’ll use Puppeteer. Puppeteer is a Node library API that allows us to control headless Chrome. Headless Chrome is a way to run the Chrome Browser without actually running Chrome.

Analog Forest 🌳

October 19, 2020

If you’ll try to google “web scraping tutorial” you’ll get a bunch of tech articles on the subject that tells you how to achieve the result using python. The toolkit is pretty standard for these posts: python 3 (hopefully not second) as an engine, requests library for fetching, and Beautiful Soup 4 (which is 6 years old) for web parsing.

I’ve also seen few articles where they teach you how to parse HTML content with regular expressions, spoiler: don’t do this.

The problem is that I’ve seen articles like this 5 years ago and this stack hasn’t mostly changed. And more importantly, the solution is not native to javascript developers. If you would like to use technologies you are more familiar with, like ES2020, node, and browser APIs you will miss the direct guidance.

I’ve tried to fill the spot and create ‘the missing doc’.

Overview

Check if data is available in request

Before you will start to do any of the programming, always check for the easiest available way. In our case, it would be a direct network request for data.

Open developer tools - F12 in most browsers - then switch to the Network tab and reload the page

If data is not baked in the HTML like it is in half of the modern web applications, there is a good chance that you don’t need to scrape and parse at all.

Using Js In Html

If you are not so lucky and still need to do the scraping, here is the general overview of the process:

  1. fetch the page with the required data
  2. extract the data from the page markup to some in-language structure (Object, Array, Set)
  3. process the data: filter it, transform it to your needs, prepare it for the future usage
  4. save the data: write it to the database or dump it to the filesystem

That would be the easiest case for parsing, in sophisticated ones you can bump into some pagination, link navigation, dealing with bot protection (captcha), and even real-time site interaction. But all this wouldn’t be covered in the current guide, sorry.

Fetching

As an example of this guide, we will scrape a goal data for Messi from Transfermarkt. You can check his stats on the site. To load the page from the node environment you will need to use your favorite request library. You can also use the raw HTTP/S module, but it doesn’t even support async, so I’ve picked node-fetch for this task. Your code will look something like:

Tools for parsing

There are two major alternatives for this task which are conveniently represented with two high-quality most-starred and alive libraries.

Web Scraping Using Js Download

The first approach is just to build a syntax tree from a markup text and then navigate it with familiar browser-like syntax. This one is fully covered with cheerio that declared as jQuery for server (IMO, they need to revise their marketing vibes for 2020).

The second way is to build the whole browser DOM but without a browser itself. We can do this with wonderful jsdom which is node.js implementation of many web standards.

Let’s take a closer look at both of them.

cheerio

Scraping

Despite these analogies cheerio doesn’t have jQuery in dependencies, it just tries to reimplement most known methods from scratch:

Basic usage is really easy:

  • you load a HTML

  • done, you’re great, now you can use JQ selectors/methods

Probably you can pick this one if you need to save on size (cheerio is lightweight and fast) or you are really familiar with jQuery syntax and for some reason want to bring it to your new project. Cheerio is a nice way to do any kind of work with HTML you need in your application.

jsdom

This one is a bit more complicated: it tries to emulate part of the whole browser that is working with HTML and JS (apart from rendering the result). It’s used heavily for testing and … well scraping.

Let’s spin up jsdom:

  • you need to use a constructor with your HTML

  • then you can access standard browser API

jsdom is a lot heavier and it does a lot more job. You should understand, why to choose it over other options.

Parsing

In our example, I want to stick to the jsdom. It will help us to show one last approach at the end of the article. The parsing part is really vital but very short.

So we’ll start with building a dom from the fetched HTML:

Then you can select table content with css selector and browser API. Don’t forget to create a real array from NodeList that querySelectorAll returns.

Now you have a two-dimensional array to work with. This part is finished, now you need to process this data to get clean and ready-to-work-with stats.

Processing

First, let’s check the lengths of our rows. Each row is the stat about goal and we mostly don’t care how many do we have. But each row can contain numbers in a different format so we have to deal with them.

We map over rows and get the length. Then we deduplicate results to see what options do we have here.

Not that bad, only 4 different shapes: with 1, 5, 14, and 15 cells.

Since we don’t need rank data from extra cell in 15-cells case it is safe to delete it.

Row with only one cell is actually useless: it is just a name of the season, so we will skip it.

For the 5-cells case (when player scored few goals in one match) we need to find previous full row and use it’s data for empty stats.

Now we just have to manually map data to keys, nothing scientific here and no smart way to avoid it.

Saving

We would just dump our result to a file, converting it to a string first with the JSON.stringify method.

Bonus: One-time parsing with a snippet

Since we used jsdom with the browser-compatible API we actually don’t need any node environment to parse the data. If we just need it once from a particular page we can just run some code at the Console tab in the developers tools of your browser. Try to open any player stats on Transfermarkt and paste this giant non-readable snippet to the console:

And now just apply this magic copy function that is integrated into the browser devtools. It would copy data to your clipboard.

Not that hard, right? And no need to deal with pip anymore. I hope you found this article useful. Stay tuned, next time we will visualize this scraped data with modern JS libs.

You can find whole script for this article in the following codesandbox:

Web Scraping Using Selenium Python

by Pavel Prokudin. I write about web-development and modern technologies. Follow me onTwitter

Tutorial

Web

Writing Scrapers

Jquery Web Scraping

The core of a pjscrape script is the definition of one or more scraper functions. Here's what you need to know:

  • Scraper functions are evaluated in a full browser context. This means you not only have access to the DOM, you have access to Javascript variables and functions, AJAX-loaded content, etc.

  • Scraper functions are evaluated in a sandbox (read more here). Closures will not work the way you think:

    The best way to think about your scraper functions is to assume the code is being eval()'d in the context of the page you're trying to scrape.

  • Scrapers have access to a set of helper functions in the _pjs namespace. See the Javascript API docs for more info. One particularly useful function is _pjs.getText(), which returns an array of text from the matched elements:

    For this instance, there's actually a shorter syntax - if your scraper is a string instead of a function, pjscrape will assume it is a selector and use it in a function like the one above:

  • Scrapers can return data in whatever format you want, provided it's JSON-serializable (so you can't return a jQuery object, for example). For example, the following code returns the list of towns in the Django fixture syntax:

  • Scraper functions can always access the version of jQuery bundled with pjscrape (currently v.1.6.1). If you're scraping a site that also uses jQuery, and you want the latest features, you can set noConflict: true and use the _pjs.$ variable:

Web Scraping Using Js Programming

Asynchronous Scraping

Docs coming soon. For now, see:

  • Test for the ready option - wait for a ready condition before starting the scrape.
  • Test for asyncronous scrapes - scraper function is expected to set _pjs.items when its scrape is complete.

Web Scraping Using Js Example

Crawling Multiple Pages

Web Scraping Using Js

Docs coming soon - the main thing is to set the moreUrls option to either a function or a selector that identifies more URLs to scrape. For now, see: