How to Scrape URLs of a Wordpress Website?
You are about to publish a new website and you need to create url-redirects – but how? Manually copy-pasting urls is a lot of work. There are some tools for scraping urls, but they are pricey.
There is an easy and free way to scrape urls of a wordpress website by using node.js and command prompt / terminal.
Step 1 – Export
Log in to the Wordpress admin panel, go to Tools and Export. Choose All content and click Download Export File.
Step 2 – Node.js
Download and install Node.js on your computer. You can download Node.js here.
Step 3 – Project folder and script-file
Create a folder for the project. Place the wordpress export-file (XML-format) inside the folder.
Create a javascript-file with code editor or notepad. Copy-paste the following script and save it as, for example, scriptname.js
Remember to change the file paths to correct ones. See the rows 4 & 5 of the script code.
Script:
const fs = require('fs'); const { parseString } = require('xml2js'); const xmlFilePath = ‘/path/to/file.xml'; // Replace with the path to your WordPress export file const outputFilePath = ‘/path/to/output-file.txt'; // Replace with the desired path for the output file fs.readFile(xmlFilePath, 'utf-8', (error, data) => { if (error) { console.error('Error reading the XML file:', error); return; } parseString(data, (error, result) => { if (error) { console.error('Error parsing the XML:', error); return; } const urls = extractUrls(result); writeUrlsToFile(urls, outputFilePath); }); }); function extractUrls(xmlData) { const urls = []; const items = xmlData.rss.channel[0].item; items.forEach(item => { const link = item.link[0]; urls.push(link); }); return urls; } function writeUrlsToFile(urls, filePath) { const content = urls.join('\n'); fs.writeFile(filePath, content, 'utf-8', (error) => { if (error) { console.error('Error writing the file:', error); return; } console.log('URLs written to file:', filePath); }); }
Step 4 – Open Terminal
Open Command Prompt (PC) or Terminal (Mac).
a) Locate the project folder by typing cd path/to/folder (on Mac it could be, for example: cd /Users/firstname.lastname/Documents/foldername)
b) Launch node.js -project by typing:
npm init -y
c) Install xml2js-library to parse the XML-file. Type: npm install xml2js
npm install xml2js
d) Run the script:
node scriptname.js
Step 5 – The result
This will create a txt-file on the project folder. You can copy the urls from the txt-file and paste them to Google Sheets / Excel. After assigning the wanted redirects, you can import the file as CSV-file to your wordpress redirect-plugin of choice.