- 1 3) Input Your Syntax
- 2 Other Updates
- 3 3) Select ‘Pages’ To Include
- 4 1) Scheduling
- 5 4) Security Checks
- 6 3) Multi-Select Details & Bulk Exporting
- 7 2) Spelling & Grammar
- 8 7) Select The ‘Change Frequency’ of URLs
- 9 Small Update – Version 8.1 Released 27th July 2017
- 10 Other Updates
- 11 What Is SEO?
- 12 Other Updates
- 13 1) SERP Snippets Now Editable
- 14 Small Update – Version 9.1 Released 8th March 2018
- 15 4) XML Sitemap Crawl Integration
- 16 Что конкретно всё это даёт?
3) Input Your Syntax
Next up, you’ll need to input your syntax into the relevant extractor fields. A quick and easy way to find the relevant CSS Path or Xpath of the data you wish to scrape, is to simply open up the web page in Chrome and ‘inspect element’ of the HTML line you wish to collect, then right click and copy the relevant selector path provided.
For example, you may wish to start scraping ‘authors’ of blog posts, and number of comments each have received. Let’s take the Screaming Frog website as the example.
Open up any blog post in Chrome, right click and ‘inspect element’ on the authors name which is located on every post, which will open up the ‘elements’ HTML window. Simply right click again on the relevant HTML line (with the authors name), copy the relevant CSS path or XPath and paste it into the respective extractor field in the SEO Spider. If you use Firefox, then you can do the same there too.
You can rename the ‘extractors’, which correspond to the column names in the SEO Spider. In this example, I’ve used CSS Path.
The ticks next to each extractor confirm the syntax used is valid. If you have a red cross next to them, then you may need to adjust a little as they are invalid.
When you’re happy, simply press the ‘OK’ button at the bottom. If you’d like to see more examples, then skip to the bottom of this guide.
Please note – This is not the most robust method for building CSS Selectors and XPath expressions. The expressions given using this method can be very specific to the exact position of the element in the code. This is something that can change due to the inspected view being the rendered version of the page / DOM, when by default the SEO Spider looks at the HTML source, and HTML clean-up that can occur when the SEO Spider processes a page where there is invalid mark-up.
These can also differ between browser, e.g. for the above ‘author’ example the following CSS Selectors are given –
Chrome: body > div.main-blog.clearfix > div > div.main-blog–posts > div.main-blog–posts_single–inside_author.clearfix.drop > div.main-blog–posts_single–inside_author-details.col-13-16 > div.author-details–social > aFirefox: .author-details–social > a:nth-child(1)
The expressions given by Firefox are generally more robust than those provided by Chrome. Even so, this should not be used as a complete replacement for understanding the various extraction options and being able to build these manually by examining the HTML source.
The w3schools guide on CSS Selectors and their XPath introduction are good resources for understanding the basics of these expressions.
We have also included some other smaller updates and bug fixes in version 7.0 of the Screaming Frog SEO Spider, which include the following –
- All images now appear under the ‘Images’ tab. Previously the SEO Spider would only show ‘internal’ images from the same subdomain under the ‘images’ tab. All other images would appear under the ‘external’ tab. We’ve changed this behaviour as it was outdated, so now all images appear under ‘images’ regardless.
- The URL rewriting ‘remove parameters’ input is now a blank field (similar to ‘‘ and ‘‘ configurations), which allows users to bulk upload parameters one per line, rather than manually inputting and entering each separate parameter.
- The SEO Spider will now find the page title element anywhere in the HTML (not just the HEAD), like Googlebot. Not that we recommend having it anywhere else!
- Introduced tri-state row sorting, allowing users to clear a sort and revert back to crawl order.
- The maximum XML sitemap size has been increased to 50MB from 10MB, in line with Sitemaps.org updated protocol.
- Fixed a crash in custom extraction!
- Fixed a crash when using the date range Google Analytics configuration.
- Fixed exports ignoring column order and visibility.
- Fixed issue where SERP title and description widths were different for master view and SERP Snippet table on Windows for Thai language.
We hope you like the update! Please do let us know if you experience any problems, or discover any bugs at all.
Thanks to everyone as usual for all the feedback and suggestions for improving the Screaming Frog SEO Spider.
Now go and download version 7.0 of the SEO Spider!
3) Select ‘Pages’ To Include
Only HTML pages included in the ‘internal’ tab with a ‘200’ OK response from the crawl will be included in the XML sitemap as default. So you don’t need to worry about redirects (3XX), client side errors (4XX Errors, like broken links) or server errors (5XX) being included in the sitemap. However, you can select to include them optionally, as in some scenarios you may require them.
Pages which are blocked by robots.txt, set as ‘noindex’, have been ‘canonicalised’ (the canonical URL is different to the URL of the page), paginated (URLs with a rel=“prev”) or PDFs are also not included as standard. This can all be adjusted within the XML Sitemap ‘pages’ configuration, so simply select your preference.
You can see which URLs have no response, are blocked, or redirect or error under the ‘Responses’ tab and using the respective filters. You can see which URLs are ‘noindex’, ‘canonicalised’ or have a rel=“prev” link element on them under the ‘Directives’ tab and using the filters as well.
You can now to run automatically within the SEO Spider, as a one off, or at chosen intervals.
You’re able to pre-select the (spider, or list), saved , as well as APIs (, , , , ) to pull in any data for the scheduled crawl.
You can also automatically file and export any of the tabs, filters, , or XML Sitemaps to a chosen location.
This should be super useful for anyone that runs regular crawls, has clients that only allow crawling at certain less-than-convenient ‘but, I’ll be in bed!’ off-peak times, uses crawl data for their own automatic reporting, or have a developer that needs a broken links report sent to them every Tuesday by 7 am.
The keen-eyed among you may have noticed that the SEO Spider will run in headless mode (meaning without an interface) when scheduled to export data – which leads us to our next point.
4) Security Checks
The ‘Protocol’ tab has been renamed to ‘‘ and more up to date security-related checks and filters have been introduced.
While the SEO Spider was already able to identify HTTP URLs, mixed content and other insecure elements, exposing them within filters helps you spot them more easily.
You’re able to quickly find mixed content, issues with insecure forms, unsafe cross-origin links, protocol-relative resource links, missing security headers and more.
The old insecure content report remains as well, as this checks all elements (canonicals, hreflang etc) for insecure elements and is helpful for HTTPS migrations.
The new security checks introduced are focused on the most common issues related to SEO, web performance and security, but this functionality might be extended to cover additional security checks based upon user feedback.
3) Multi-Select Details & Bulk Exporting
You can now select multiple URLs in the top window pane, view specific lower window details for all the selected URLs together, and export them. For example, if you click on three URLs in the top window, then click on the lower window ‘inlinks’ tab, it will display the ‘inlinks’ for those three URLs.
You can also export them via the right click or the new export button available for the lower window pane.
Obviously this scales, so you can do it for thousands, too.
This should provide a nice balance between everything in bulk via the ‘Bulk Export’ menu and then filtering in spreadsheets, or the previous singular option via the right click.
2) Spelling & Grammar
If you’ve found yourself with extra time under lockdown, then we know just the way you can spend it (sorry).
You’re now also able to perform a spelling and grammar check during a crawl. The new ‘Content’ tab has filters for ‘Spelling Errors’ and ‘Grammar Errors’ and displays counts for each page crawled.
You can enable spelling and grammar checks via ‘Config > Content > Spelling & Grammar‘.
While this is a little different from our usual very ‘SEO-focused’ features, a large part of our roles are about improving websites for users. Google’s own search quality evaluator guidelines outline spelling and grammar errors numerous times as one of the characteristics of low-quality pages (if you need convincing!).
The lower window ‘Spelling & Grammar Details’ tab shows you the error, type (spelling or grammar), detail, and provides a suggestion to correct the issue.
The right-hand-side of the details tab also shows you a visual of the text from the page and errors identified.
The right-hand pane ‘Spelling & Grammar’ tab displays the top 100 unique errors discovered and the number of URLs it affects. This can be helpful for finding errors across templates, and for building your dictionary or ignore list.
The new spelling and grammar feature will auto-identify the language used on a page (via the HTML language attribute), but also allow you to manually select language where required. It supports 39 languages, including English (UK, USA, Aus etc), German, French, Dutch, Spanish, Italian, Danish, Swedish, Japanese, Russian, Arabic and more.
You’re able to ignore words for a crawl, add to a dictionary (which is remembered across crawls), disable grammar rules and exclude or include content in specific HTML elements, classes or IDs for spelling and grammar checks.
You’re also able to ‘update’ the spelling and grammar check to reflect changes to your dictionary, ignore list or grammar rules without re-crawling the URLs.
As you would expect, you can export all the data via the ‘Bulk Export > Content’ menu.
7) Select The ‘Change Frequency’ of URLs
The ‘changefreq’ is another optional attribute which ‘hints’ at how frequently the page is likely to change.
The SEO Spider allows you to configure these based on the ‘last modification’ response or ‘level’ (depth) of URLs. The ‘calculate from last modified header’ option means that if the page has been changed in the last 24 hours, it will be set to ‘daily’, if not, it’s set as ‘monthly’.
Please do remember, these are not commands to the search engines, merely ‘hints’. Google will essentially crawl a URL as frequently as it determines algorithmically, over any ‘hint’ provided by you in the XML sitemap.
Small Update – Version 8.1 Released 27th July 2017
We have just released a small update to version 8.1 of the SEO Spider. This release is mainly bug fixes and small improvements –
- Fix a crash when using Forms Based Authentication at the same time, in two instances.
- Fix a crash selecting a URL in the main window, for certain types of URL string.
- Menus vanishing on mouse up on Windows, when used on multiple monitors.
- Trailing space in meta charset causing page to be read with wrong charset.
- Fix for API tab configuration buttons, which can be unresponsive.
- Fix for crash showing open/save dialogs when last used directory has been deleted.
- Fix for a crash when using AHREFs.
- Debug check box doesn’t stay ticked.
- Fonts not anti aliased in Ubuntu.
- ga:socialActivity has been deprecated and removed to match Googles API changes.
- Fix for a crash switching to Tree View, after loading in a saved project.
- Sitemap reading doesn’t extract images.
- Cookies stored against the wrong URL when using Forms Authentication.
- Pop ups (authentication/memory etc) while minimised in Windows leaves app unresponsive.
- SERP Snippet fixes.
Version 12.0 also includes a number of smaller updates and bug fixes, outlined below.
- There’s a new ‘Link Attributes’ column for inlinks and outlinks. This will detail whether a link has a nofollow, sponsored or ugc value. ‘‘ and ‘‘ configuration options will apply to links which have sponsored or ugc, similar to a normal nofollow link.
- The SEO Spider will pick up the new max-snippet, max-video-preview and max-image-preview directives and there are filters for these within the ‘‘ tab. We plan to add support for data-nosnippet at a later date, however this can be analysed using custom extraction for now.
- We’re committed to making the tool as reliable as possible and encouraging user reporting. So we’ve introduced in-app crash reporting, so you don’t even need to bring up your own email client or download the logs manually to send them to us. Our support team may get back to you if we require more information.
- The crawl name is now displayed in the title bar of the application. If you haven’t named the crawl (or saved a name for the .seospider crawl file), then we will use a smart name based upon your crawl. This should help when comparing two crawls in separate windows.
- Structured data validation has been updated to use Schema.org 3.9 and now supports FAQ, How To, Job Training and Movie Google features. We’ve also updated nearly a dozen features with changing required and recommended properties.
- ga:users metric has now been added to the .
- ‘Download XML Sitemap’ and ‘Download XML Sitemap Index’ options in , have been combined into a single ‘Download XML Sitemap’ option.
- The configuration now applies when in list mode, and to robots.txt files.
- Scroll bars have now been removed from screenshots.
- Our emulator has been updated with Google’s latest changes to larger font on desktop, which has resulted in less characters being displayed before truncation in the SERPs. The ‘Over 65 Characters’ default filter for page titles has been amended to 60. This can of course be adjusted under ‘Config > Preferences’.
- We’ve significantly sped up robots.txt parsing.
- has been improved to use less memory.
- We’ve added support for x-gzip content encoding, and content type ‘application/gzip’ for sitemap crawling.
- We removed the descriptive export name text from the first row of all exports as it was annoying.
That’s everything. If you experience any problems with the new version, then please do just let us know via our support and we’ll help as quickly as possible.
Thank you to everyone for all their feature requests, feedback, and bug reports. We appreciate each and every one of them.
Now, go and download version 12.0 of the Screaming Frog SEO Spider and let us know what you think!
What Is SEO?
Search Engine Optimisation (SEO) is the practice of increasing the number and quality of visitors to a website by improving rankings in the algorithmic search engine results.
Research shows that websites on the first page of Google receive almost 95% of clicks, and studies show that results that appear higher up the page receive an increased click through rate (CTR), and more traffic.
The algorithmic (‘natural’, ‘organic’, or ‘free’) search results are those that appear directly below the top pay-per-click adverts in Google, as highlighted below.
There are also various other listings that can appear in the Google search results, such as map listings, videos, the knowledge graph and more. SEO can include improving visibility in these result sets as well.
We have also included some other smaller updates and bug fixes in version 6.0 of the Screaming Frog SEO Spider, which include the following –
- A new ‘Text Ratio’ column has been introduced in the internal tab which calculates the text to HTML ratio.
- Google updated their Search Analytics API, so the SEO Spider can now retrieve more than 5k rows of data from Search Console.
- There’s a new ‘search query filter’ for Search Console, which allows users to include or exclude keywords (under ‘Configuration > API Access > Google Search Console > Dimension tab’). This should be useful for excluding brand queries for example.
- There’s a new configuration to extract images from the IMG srcset attribute under ‘Configuration > Advanced’.
- The new Googlebot smartphone user-agent has been included.
- Updated our support for relative base tags.
- Removed the blank line at the start of Excel exports.
- Fixed a bug with word count which could make it less accurate.
- Fixed a bug with GSC CTR numbers.
I think that’s just about everything! As always, please do let us know if you have any problems or spot any bugs at all.
Thanks to everyone for all the support and continued feedback. Apologies for any features we couldn’t include in this update, we are already working on the next set of updates and there’s plenty more to come!
Now go and download !
1) SERP Snippets Now Editable
First of all, the SERP snippet tool we released in our previous version has been updated extensively to include a variety of new features. The tool now allows you to preview SERP snippets by device type (whether it’s desktop, tablet or mobile) which all have their own respective pixel limits for snippets. You can also bold keywords, add rich snippets or description prefixes like a date to see how the page may appear in Google.
You can read more about this update and changes to pixel width and SERP snippets in Google in our new blog post.
The largest update is that the tool now allows you to edit page titles and meta descriptions directly in the SEO Spider as well. This subsequently updates the SERP snippet preview and the table calculations letting you know the number of pixels you have before a word is truncated. It also updates the text in the SEO Spider itself and will be remembered automatically, unless you click the ‘reset title and description’ button. You can make as many edits to page titles and descriptions and they will all be remembered.
This means you can also export the changes you have made in the SEO Spider and send them over to your developer or client to update in their CMS. This feature means you don’t have to try and guesstimate pixel widths in Excel (or elsewhere!) and should provide greater control over your search snippets. You can quickly filter for page titles or descriptions which are over pixel width limits, view the truncations and SERP snippets in the tool, make any necessary edits and then export them. (Please remember, just because a word is truncated it does not mean it’s not counted algorithmically by Google).
Small Update – Version 9.1 Released 8th March 2018
We have just released a small update to version 9.1 of the SEO Spider. This release is mainly bug fixes and small improvements –
- Monitor disk usage on user configured database directory, rather than home directory. Thanks to Mike King, for that one!
- Stop monitoring disk usage in Memory Storage Mode.
- Make sitemap reading support utf-16.
- Fix crash using Google Analytics in Database Storage mode.
- Fix issue with depth stats not displaying when loading in a saved crawl.
- Fix crash when viewing Inlinks in the lower window pane.
- Fix crash in Custom Extraction when using xPath.
- Fix crash when embedded browser initialisation fails.
- Fix crash importing crawl in Database Storage Mode.
- Fix crash when sorting/searching main master view.
- Fix crash when editing custom robots.txt.
- Fix jerky scrolling in View Source tab.
- Fix crash when searching in View Source tab.
4) XML Sitemap Crawl Integration
It’s always been possible to crawl XML Sitemaps directly within the SEO Spider (in ), however, you’re now able to crawl and integrate them as part of a site crawl.
You can select to crawl XML Sitemaps under ‘Configuration > Spider’, and the SEO Spider will auto-discover them from robots.txt entry, or the location can be supplied.
The new and filters allow you to quickly analyse common issues with your XML Sitemap, such as URLs not in the sitemap, orphan pages, non-indexable URLs and more.
You can also now supply the XML Sitemap location into the URL bar at the top, and the SEO Spider will crawl that directly, too (instead of switching to list mode).
Что конкретно всё это даёт?
Это все конечно хорошо, но как применять весь этот арсенал на практике? На бложиках пишут обзорчики типа «ой, а тут у нас вот тайтлы отображаются… ой, а тут дескрипшен вот считается…» Ну и? Что это даёт? Вот конкретные 9 профитов от Screaming Frog:
- 404 ошибки и редиректы. Находим через Лягушку и исправляем.
- Дубли страниц (по одинаковым Title). Находим и удаляем.
- Пустые, короткие и длинные Title. Находим, заполняем, дополняем, правим.
- Страницы с недостаточным уровнем вложенности. Выгружаем в Excel, в столбец с урлами вставляем список продвигаемых страниц, выделяем повторяющиеся значения. Смотрим, у каких продвигаемых страниц УВ не 1, не 2, и не 3 и работаем с этой проблемой.
- Длина урлов. Находим длинные урлы, сокращаем, проставляем редиректы со старых.
- «Пустые» страницы. По данным из столбца Word Count вычисляем страницы, где контента меньше, чем в среднем (или просто мало), и либо их закрываем через роботс, либо удаляем, либо наполняем.
- Самые медленные страницы. Смотрим по столбцу Response Time.
- Внешние ссылки. Удаляем либо вообще все, либо битые, которые 404 отдают.
- Совпадающие Title и H1. Находим, правим.
- Теги <strong>, <b>, <br> и так далее. Screaming Frog позволяет найти все страницы на сайте, где используются эти теги.
Это из важного. Про баловство вроде кликабельного вида Title в выдаче или пустых description я тут промолчу
Есть еще один недостаток перед PageWeight — программа не считает вес страниц. Но тут уж выручит Netpeak Spider — он умеет.
С этим читают