Web Scraping with JS in the Browser

Updated on

Here’s a tip on how to quickly scrape web pages using JavaScript from within the browser. It can help save a bit of time if you need to extract some text or data from a webpage but don’t want to go through the process of writing boilerplate for a full scraper.

tl;dr

Go to this Wikipedia page, and paste this JavaScript code in the browser console:

// Scrapes cat breed names from
// https://en.wikipedia.org/wiki/List_of_cat_breeds
// Run it in the browser console on that page.

// This `querySelector` will get the first `table` on the page.
var table = document.querySelector("table");

// `querySelectorAll` will get all of the matching elements in the table as a
// `NodeList`, which doesn't have a `.map` method, so it gets turned into an
// array with the spread operator. `Array.from` would also work there.
var catNameArray = [...table.querySelectorAll("th[scope='row']")].map((tr) => {
  // We're returning the `innerText` of each `tr` and removing the characters
  // we don't want. If you want to scrape the data into JSON format, return a
  // JS object here and stringify it later. See the video for details.
  return tr.innerText
    .replace(/\[.+\]/, "") // removes everything in square brackets
    .replace(/\n/g, " "); // removes all the newlines
});

// Join the array of cat names into an HTML string.
var output = catNameArray.join("<br>");

// Print the HTML output on the page.
document.body.innerHTML = output;

Then watch the video above for a full explanation and additional tips.

Update: in the YouTube comments, someone asked how to scrape the image out of the webpage. Here’s a quick snippet as an example:

// First, extract just the one table.
var table = document.querySelector("table");
// Then get all the rows from that table.
var tableRows = table.querySelectorAll("tr");

// I used `reduce` here, because the number of returned
// items might be different than the original array.
// It's probably possible to improve this. It was coded quickly.
var catArray = [...tableRows].slice(1).reduce((acc, row) => {
  const el = row.querySelector("th[scope='row']");
  if (el) {
    const breed = el.innerText.replace(/\[.+\]/, "").replace(/\n/g, " ");
    const img = row.querySelector("td:last-child img");

    acc.push({
      breed,
      img: img?.src || null,
    });
  }
  return acc;
}, []);

var output = JSON.stringify(catArray, null, 4);
document.body.innerHTML = `<pre>${output}</pre>`;

Resources

Also check out a powerful technique to instantly scrape HTML tables with Python.

Tagged with: Programming JavaScriptWeb Scraping

Feedback and Comments

What did you think about this page? Do you have any questions, or is there anything that could be improved? You can leave a comment after clicking on an icon below.