Exploring the Use of Screaming Frog for Extracting Text from Non-HTML Files
When conducting comprehensive SEO audits or website analyses, tools like Screaming Frog SEO Spider are invaluable for crawling and extracting data from web pages. Typically, Screaming Frog excels at analyzing HTML content, but what if you need to extract information from other file types, such as robots.txt files hosted across various domains? Is it possible to leverage Screaming Frog for this purpose? The answer is yes, with a strategic approach involving custom extraction techniques, specifically using JavaScript within Screaming Frog.
Can Screaming Frog Extract Text from robots.txt Files?
By default, Screaming Frog SEO Spider is optimized for parsing HTML and XML documents. However, when dealing with non-HTML files like robots.txt, standard configurations may not suffice. To extract the content of such files, we need to go beyond the default settings and implement custom extraction methods.
Custom JavaScript to Fetch and Extract robots.txt Content
One effective method involves importing a custom JavaScript library into Screaming Frog, which utilizes its built-in scripting capabilities to fetch and parse the raw text content of robots.txt files.
Here’s an outline of how this can be achieved:
- Create a JSON configuration for custom extraction.
This JSON file defines the extraction process, instructing Screaming Frog to fetch therobots.txtfile from each domain and extract its text content.
Example JSON configuration:
json
[
{
"type": "EXTRACTION",
"name": "Extract robots.txt content",
"contentTypes": "text/plain",
"actionTimeoutSecs": 1,
"version": 3,
"comments": "Fetch and extract robots.txt content",
"javascript": "const robotsUrl = new URL('/robots.txt', window.location.origin).href;\n\nreturn fetch(robotsUrl)\n .then(response => {\n if (!response.ok) {\n throw new Error(`Failed to fetch robots.txt: ${response.status}`);\n }\n return response.text();\n })\n .then(text => {\n return seoSpider.data(text);\n })\n .catch(error => {\n return seoSpider.error(error.message);\n });"
}
]
- Import the JSON configuration into Screaming Frog.
