How to Scrape URLs of a Wordpress Website?
You are about to publish a new website and you need to create url-redirects – but how? Manually copy-pasting urls is a lot of work. There are some tools for scraping urls, but they are pricey.
There is an easy and free way to scrape urls of a wordpress website by using node.js and command prompt / terminal.
Step 1 – Export
Log in to the Wordpress admin panel, go to Tools and Export. Choose All content and click Download Export File.
Step 2 – Node.js
Download and install Node.js on your computer. You can download Node.js here.
Step 3 – Project folder and script-file
Create a folder for the project. Place the wordpress export-file (XML-format) inside the folder.
Create a javascript-file with code editor or notepad. Copy-paste the following script and save it as, for example, scriptname.js
Remember to change the file paths to correct ones. See the rows 4 & 5 of the script code.
Script:
const fs = require('fs');
const { parseString } = require('xml2js');
const xmlFilePath = ‘/path/to/file.xml'; // Replace with the path to your WordPress export file
const outputFilePath = ‘/path/to/output-file.txt'; // Replace with the desired path for the output file
fs.readFile(xmlFilePath, 'utf-8', (error, data) => {
    if (error) {
        console.error('Error reading the XML file:', error);
        return;
    }
    parseString(data, (error, result) => {
        if (error) {
            console.error('Error parsing the XML:', error);
            return;
        }
        const urls = extractUrls(result);
        writeUrlsToFile(urls, outputFilePath);
    });
});
function extractUrls(xmlData) {
    const urls = [];
    const items = xmlData.rss.channel[0].item;
    items.forEach(item => {
        const link = item.link[0];
        urls.push(link);
    });
    return urls;
}
function writeUrlsToFile(urls, filePath) {
    const content = urls.join('\n');
    fs.writeFile(filePath, content, 'utf-8', (error) => {
        if (error) {
            console.error('Error writing the file:', error);
            return;
        }
        console.log('URLs written to file:', filePath);
    });
}
  
Step 4 – Open Terminal
Open Command Prompt (PC) or Terminal (Mac).
a) Locate the project folder by typing cd path/to/folder (on Mac it could be, for example: cd /Users/firstname.lastname/Documents/foldername)
b) Launch node.js -project by typing:
npm init -y
c) Install xml2js-library to parse the XML-file. Type: npm install xml2js
npm install xml2js
d) Run the script:
node scriptname.js
Step 5 – The result
This will create a txt-file on the project folder. You can copy the urls from the txt-file and paste them to Google Sheets / Excel. After assigning the wanted redirects, you can import the file as CSV-file to your wordpress redirect-plugin of choice.