How to Scrape URLs of a Wordpress Website?

You are about to publish a new website and you need to create url-redirects – but how? Manually copy-pasting urls is a lot of work. There are some tools for scraping urls, but they are pricey.

There is an easy and free way to scrape urls of a wordpress website by using node.js and command prompt / terminal.

Step 1 – Export

Log in to the Wordpress admin panel, go to Tools and Export. Choose All content and click Download Export File.

Step 2 – Node.js

Download and install Node.js on your computer. You can download Node.js here.

Step 3 – Project folder and script-file

Create a folder for the project. Place the wordpress export-file (XML-format) inside the folder.

Create a javascript-file with code editor or notepad. Copy-paste the following script and save it as, for example, scriptname.js

Remember to change the file paths to correct ones. See the rows 4 & 5 of the script code.

Script:

const fs = require('fs');
const { parseString } = require('xml2js');

const xmlFilePath = ‘/path/to/file.xml'; // Replace with the path to your WordPress export file
const outputFilePath = ‘/path/to/output-file.txt'; // Replace with the desired path for the output file

fs.readFile(xmlFilePath, 'utf-8', (error, data) => {
    if (error) {
        console.error('Error reading the XML file:', error);
        return;
    }

    parseString(data, (error, result) => {
        if (error) {
            console.error('Error parsing the XML:', error);
            return;
        }

        const urls = extractUrls(result);
        writeUrlsToFile(urls, outputFilePath);
    });
});

function extractUrls(xmlData) {
    const urls = [];
    const items = xmlData.rss.channel[0].item;

    items.forEach(item => {
        const link = item.link[0];
        urls.push(link);
    });

    return urls;
}

function writeUrlsToFile(urls, filePath) {
    const content = urls.join('\n');

    fs.writeFile(filePath, content, 'utf-8', (error) => {
        if (error) {
            console.error('Error writing the file:', error);
            return;
        }

        console.log('URLs written to file:', filePath);
    });
}

Step 4 – Open Terminal

Open Command Prompt (PC) or Terminal (Mac).

a) Locate the project folder by typing cd path/to/folder (on Mac it could be, for example: cd /Users/firstname.lastname/Documents/foldername)

b) Launch node.js -project by typing:

npm init -y

c) Install xml2js-library to parse the XML-file. Type: npm install xml2js

npm install xml2js

d) Run the script:

node scriptname.js

Step 5 – The result

This will create a txt-file on the project folder. You can copy the urls from the txt-file and paste them to Google Sheets / Excel. After assigning the wanted redirects, you can import the file as CSV-file to your wordpress redirect-plugin of choice.

Previous
Previous

Emme saa työhakemuksia – miksei hallitus tee mitään?