Web Scraping with JS in the Browser
Here’s a tip on how to quickly scrape web pages using JavaScript from within the browser. It can help save a bit of time if you need to extract some text or data from a webpage but don’t want to go through the process of writing boilerplate for a full scraper.
tl;dr
Go to this Wikipedia page, and paste this JavaScript code in the browser console:
// Scrapes cat breed names from
// https://en.wikipedia.org/wiki/List_of_cat_breeds
// Run it in the browser console on that page.
// This `querySelector` will get the first `table` on the page.
var table = document.querySelector("table");
// `querySelectorAll` will get all of the matching elements in the table as a
// `NodeList`, which doesn't have a `.map` method, so it gets turned into an
// array with the spread operator. `Array.from` would also work there.
var catNameArray = [...table.querySelectorAll("th[scope='row']")].map((tr) => {
// We're returning the `innerText` of each `tr` and removing the characters
// we don't want. If you want to scrape the data into JSON format, return a
// JS object here and stringify it later. See the video for details.
return tr.innerText
.replace(/\[.+\]/, "") // removes everything in square brackets
.replace(/\n/g, " "); // removes all the newlines
});
// Join the array of cat names into an HTML string.
var output = catNameArray.join("<br>");
// Print the HTML output on the page.
document.body.innerHTML = output;
Then watch the video above for a full explanation and additional tips.
Update: in the YouTube comments, someone asked how to scrape the image out of the webpage. Here’s a quick snippet as an example:
// First, extract just the one table.
var table = document.querySelector("table");
// Then get all the rows from that table.
var tableRows = table.querySelectorAll("tr");
// I used `reduce` here, because the number of returned
// items might be different than the original array.
// It's probably possible to improve this. It was coded quickly.
var catArray = [...tableRows].slice(1).reduce((acc, row) => {
const el = row.querySelector("th[scope='row']");
if (el) {
const breed = el.innerText.replace(/\[.+\]/, "").replace(/\n/g, " ");
const img = row.querySelector("td:last-child img");
acc.push({
breed,
img: img?.src || null,
});
}
return acc;
}, []);
var output = JSON.stringify(catArray, null, 4);
document.body.innerHTML = `<pre>${output}</pre>`;
Resources
- Wikipedia’s List of Cat Breeds — This is the page to scrape.
- Firefox Developer Edition — This is the ideal browser for scraping in the browser, because it has a built-in multi-line JavaScript editor.
NodeListon MDN- Spread Operator on MDN
Array.fromon MDN
Also check out a powerful technique to instantly scrape HTML tables with Python.