Netpeak spider 3.2: рендеринг javascript и аудит в pdf

Small Update – Version 11.3 Released 30th May 2019

We have just released a small update to version 11.3 of the SEO Spider. This release is mainly bug fixes and small improvements –


  • Added relative URL support for robots.txt redirects.
  • Fix crash importing crawl file as a configuration file.
  • Fix crash when clearing config in SERP mode
  • Fix crash loading in configuration to perform JavaScript crawling on a platform that doesn’t support it.
  • Fix crash creating images sitemap.
  • Fix crash in right click remove in database mode.
  • Fix crash in scheduling when editing tasks on Windows.
  • Fix issue with Sitemap Hreflang data not being attached when uploading a sitemap in List mode.
  • Fix configuration window too tall for small screens.
  • Fix broken FDD HTML export.
  • Fix unable to read sitemap with BOM when in Spider mode.

The SEO Spider Tool Crawls & Reports On…

The Screaming Frog SEO Spider is an SEO auditing tool, built by real SEOs with thousands of users worldwide. A quick summary of some of the data collected in a crawl include —

  1. Errors – Client errors such as broken links & server errors (No responses, 4XX client & 5XX server errors).
  2. Redirects – Permanent, temporary, JavaScript redirects & meta refreshes.
  3. Blocked URLs – View & audit URLs disallowed by the robots.txt protocol.
  4. Blocked Resources – View & audit blocked resources in rendering mode.
  5. External Links – View all external links, their status codes and source pages.
  6. Security – Discover insecure pages, mixed content, insecure forms, missing security headers and more.
  7. URI Issues – Non ASCII characters, underscores, uppercase characters, parameters, or long URLs.
  8. Duplicate Pages – Discover exact and near duplicate pages using advanced algorithmic checks.
  9. Page Titles – Missing, duplicate, long, short or multiple title elements.
  10. Meta Description – Missing, duplicate, long, short or multiple descriptions.
  11. Meta Keywords – Mainly for reference or regional search engines, as they are not used by Google, Bing or Yahoo.
  12. File Size – Size of URLs & Images.
  13. Response Time – View how long pages take to respond to requests.
  14. Last-Modified Header – View the last modified date in the HTTP header.
  15. Crawl Depth – View how deep a URL is within a website’s architecture.
  16. Word Count – Analyse the number of words on every page.
  17. H1 – Missing, duplicate, long, short or multiple headings.
  18. H2 – Missing, duplicate, long, short or multiple headings
  19. Meta Robots – Index, noindex, follow, nofollow, noarchive, nosnippet etc.
  20. Meta Refresh – Including target page and time delay.
  21. Canonicals – Link elements & canonical HTTP headers.
  22. X-Robots-Tag – See directives issued via the HTTP Headder.
  23. Pagination – View rel=“next” and rel=“prev” attributes.
  24. Follow & Nofollow – View meta nofollow, and nofollow link attributes.
  25. Redirect Chains – Discover redirect chains and loops.
  26. hreflang Attributes – Audit missing confirmation links, inconsistent & incorrect languages codes, non canonical hreflang and more.
  27. Inlinks – View all pages linking to a URL, the anchor text and whether the link is follow or nofollow.
  28. Outlinks – View all pages a URL links out to, as well as resources.
  29. Anchor Text – All link text. Alt text from images with links.
  1. Rendering – Crawl JavaScript frameworks like AngularJS and React, by crawling the rendered HTML after JavaScript has executed.
  2. AJAX – Select to obey Google’s now deprecated AJAX Crawling Scheme.
  3. Images – All URLs with the image link & all images from a given page. Images over 100kb, missing alt text, alt text over 100 characters.
  4. User-Agent Switcher – Crawl as Googlebot, Bingbot, Yahoo! Slurp, mobile user-agents or your own custom UA.
  5. Custom HTTP Headers – Supply any header value in a request, from Accept-Language to cookie.
  6. Custom Source Code Search – Find anything you want in the source code of a website! Whether that’s Google Analytics code, specific text, or code etc.
  7. Custom Extraction – Scrape any data from the HTML of a URL using XPath, CSS Path selectors or regex.
  8. Google Analytics Integration – Connect to the Google Analytics API and pull in user and conversion data directly during a crawl.
  9. Google Search Console Integration – Connect to the Google Search Analytics API and collect impression, click and average position data against URLs.
  10. PageSpeed Insights Integration – Connect to the PSI API for Lighthouse metrics, speed opportunities, diagnostics and Chrome User Experience Report (CrUX) data at scale.
  11. External Link Metrics – Pull external link metrics from Majestic, Ahrefs and Moz APIs into a crawl to perform content audits or profile links.
  12. XML Sitemap Generation – Create an XML sitemap and an image sitemap using the SEO spider.
  13. Custom robots.txt – Download, edit and test a site’s robots.txt using the new custom robots.txt.
  14. Rendered Screen Shots – Fetch, view and analyse the rendered pages crawled.
  15. Store & View HTML & Rendered HTML – Essential for analysing the DOM.
  16. AMP Crawling & Validation – Crawl AMP URLs and validate them, using the official integrated AMP Validator.
  17. XML Sitemap Analysis – Crawl an XML Sitemap independently or part of a crawl, to find missing, non-indexable and orphan pages.
  18. Visualisations – Analyse the internal linking and URL structure of the website, using the crawl and directory tree force-directed diagrams and tree graphs.
  19. Structured Data & Validation – Extract & validate structured data against Schema.org specifications and Google search features.
  20. Spelling & Grammar – Spell & grammar check your website in over 25 different languages.

2) SERP Mode For Uploading Page Titles & Descriptions

You can now switch to ‘SERP mode’ and upload page titles and meta descriptions directly into the SEO Spider to calculate pixel widths. There is no crawling involved in this mode, so they do not need to be live on a website.

This means you can export page titles and descriptions from the SEO Spider, make bulk edits in Excel (if that’s your preference, rather than in the tool itself) and then upload them back into the tool to understand how they may appear in Google’s SERPs.

Under ‘reports’, we have a new ‘SERP Summary’ report which is in the format required to re-upload page titles and descriptions. We simply require three headers for ‘URL’, ‘Title’ and ‘Description’.

The tool will then upload these into the SEO Spider and run the calculations without any crawling.

5) XML Sitemap Improvements

You’re now able to create XML Sitemaps with any response code, rather than just 200 ‘OK’ status pages. This allows flexibility to quickly create sitemaps for a variety of scenarios, such as for pages that don’t yet exist, that 301 to new URLs and you wish to force Google to re-crawl, or are a 404/410 and you want to remove quickly from the index.

If you have hreflang on the website set-up correctly, then you can also select to include hreflang within the XML Sitemap.

Please note – The SEO Spider can only create XML Sitemaps with hreflang if they are already present currently (as attributes or via the HTTP header). More to come here.

1) Structured Data & Validation

Structured data is becoming increasingly important to provide search engines with explicit clues about the meaning of pages, and enabling special search result features and enhancements in Google.

The SEO Spider now allows you to crawl and extract structured data from the three supported formats (JSON-LD, Microdata and RDFa) and validate it against Schema.org specifications and Google’s 25+ search features at scale.

To extract and validate structured data you just need to select the options under ‘Config > Spider > Advanced’.

Structured data itemtypes will then be pulled into the ‘Structured Data’ tab with columns for totals, errors and warnings discovered. You can filter URLs to those containing structured data, missing structured data, the specific format, and by validation errors or warnings.

The structured data details lower window pane provides specifics on the items encountered. The left-hand side of the lower window pane shows property values and icons against them when there are errors or warnings, and the right-hand window provides information on the specific issues discovered.

The right-hand side of the lower window pane will detail the validation type (Schema.org, or a Google Feature), the severity (an error, warning or just info) and a message for the specific issue to fix. It will also provide a link to the specific Schema.org property.

In the random example below from a quick analysis of the ‘car insurance’ SERPs, we can see lv.com have Google Product feature validation errors and warnings. The right-hand window pane lists those required (with an error), and recommended (with a warning).

As ‘product’ is used on these pages, it will be validated against Google product feature guidelines, where an image is required, and there are half a dozen other recommended properties that are missing.

Another example from the same SERP, is Hastings Direct who have a Google Local Business feature validation error against the use of ‘UK’ in the ‘addressCountry‘ schema property.

The right-hand window pane explains that this is because the format needs to be two-letter ISO 3166-1 alpha-2 country codes (and the United Kingdom is ‘GB’). If you check the page in Google’s structured data testing tool, this error isn’t picked up. Screaming Frog FTW.

The SEO Spider will validate against 26 of Google’s 28 search features currently and you can see the full list in our section of the user guide.

As many of you will be aware, frustratingly Google don’t currently provide an API for their own Structured Data Testing Tool (at least a public one we can legitimately use) and they are slowly rolling out new structured data reporting in Search Console. As useful as the existing SDTT is, our testing found inconsistency in what it validates, and the results sometimes just don’t match Google’s own documented guidelines for search features (it often mixes up required or recommended properties for example).

We researched alternatives, like using the Yandex structured data validator (which does have an API), but again, found plenty of inconsistencies and fundamental differences to Google’s feature requirements – which we wanted to focus upon, due to our core user base.

Hence, we went ahead and built our own structured data validator, which considers both Schema.org specifications and Google feature requirements. This is another first to be seen in the SEO Spider, after previously introducing innovative new features such as JavaScript Rendering to the market.

There are plenty of nuances in structured data and this feature will not be perfect initially, so please do let us know if you spot any issues and we’ll fix them up quickly. We obviously recommend using this new feature in combination with Google’s Structured Data Testing Tool as well.

5) rel=“next” and rel=“prev” Elements Now Crawled

The SEO Spider can now crawl rel=“next” and rel=“prev” elements whereas previously the tool merely reported them. Now if a URL has not already been discovered, the URL will be added to the queue and the URLs will be crawled if the configuration is enabled (‘Configuration > Spider > Basic Tab > Crawl Next/Prev’).

rel=“next” and rel=“prev” elements are not counted as ‘Inlinks’ (in the lower window tab) as they are not links in a traditional sense. Hence, if a URL does not have any ‘Inlinks’ in the crawl, it might well be due to discovery from a rel=“next” and rel=“prev” or a canonical. We recommend using the ‘Crawl Path Report‘ to show how the page was discovered, which will show the full path.

There’s also a new ‘respect next/prev’ configuration option (under ‘Configuration > Spider > Advanced tab’) which will hide any URLs with a ‘prev’ element, so they are not considered as duplicates of the first page in the series.

Netpeak Spider 3.0 — вкладка «Отчеты»

Расположены возле параметров. «Отчеты» состоят из четырех дочерних вкладок:

Ошибки


В верхней части списка на красном фоне отображаются критичные ошибки, ниже ошибки, не несущие большой угрозы.

Все пункты в списке кликабельные и позволяют ознакомиться с проблемой более детально.

Возьмем, например, «Дубликаты Title». Исходя из отчета, Netpeak Spider обнаружил у меня на сайте 16 дубликатов, и это очень печально. Кликаем по соответствующей ошибке и внимательно анализируем информацию, предоставленную в дашборде.

«Тарантул» обнаружил дублирующиеся заголовки, о существовании которых я не подозревал. Что ж, впечатляет! После внесения правок и очередной проверки количество дубликатов Title удалось сократить до 6. На это ушло не более 5 минут. Как видите, программа реагирует на внесенные правки мгновенно.

Обратите внимание на скриншот, снизив дублирующиеся заголовки, у меня на сайте снизились дубликаты Description. Это значит, что проблема была связана с теми же самыми URL

Мне показалось странным, что программа учитывает страницы, которые еще с основания сайта закрыты от индексации в файле robots.txt директивой:

Disallow: /?start*

Позже было обнаружено, что это мой косяк и на сайте появилось несколько корявых URL типа:

Disallow: /joomla/poleznye-sovety-i-rekomendatsii?start=ЦИФРА

Они образовались из-за особенностей CMS, которая автоматически генерирует новые страницы для категорий после их заполнения. Огромное спасибо разработчикам за наводку! В итоге я полностью избавился от дублирующихся Title и Description.

Чтобы посмотреть, какие URL запрещены для индексации в файле robots.txt нужно зайти в «Настроки/Продвинутые».

Ставим галочку как на скриншоте и получаем данные.

Таким образом, если отключить учет инструкций в robots.txt, то программа будет игнорировать эти запреты, и закрытые страницы попадут в общий отчет. Если же включить учет инструкций, то все закрытые страницы попадут на вкладку «Пропущенные URL».

Отфильтровать данные по определенным параметрам предоставляет возможность функция «Сегментация данных». Допустим необходимо отследить, на каких страницах присутствуют ошибки высокой критичности. Кликаем по соответствующей надписи, а затем нажимаем кнопку «Применить как сегмент». В итоге во вкладке «Дашборд» отобразятся отчеты об ошибках.

Сводка

Здесь можно посмотреть статус и тип страниц, протокол соединения, хост, код ответа сервера, глубину и т. д. Например, можно выбрать все страницы со статусом «Редирект». В отфильтрованных результатах можно просмотреть данные по конкретной странице с перенаправлением.

Структура сайта

Здесь находится сгруппированная информация о конкретном разделе сайта. Все данные также как и в вышеприведенном примере фильтруются.

Парсинг

Отображаются данные пользовательского парсинга, которые легко можно скопировать посредством функции «Расширенное копирование».

Для этого выберите отчет и кликните по кнопке «Расширенное копирование», откройте любую программу для работы с таблицами и нажмите сочетание клавиш «Ctrl+V» — вставить.

Также все проанализированные отчеты можно заполучить в формате .xlsx с помощью кнопки «Экспорт».

Программа сохраняет таблицу с отчетом на жестком диске ПК.

Особенно полезной является функция «Сохранить», которая находится во вкладке «Проект». При следующем открытии Netpeak Spider, сразу же откроются предыдущие результаты индексирования.

Итог


Вряд ли пользователей интернета привлекают «кривые» сайты. Поэтому многие интернет-маркетологи, СЕО-специалисты и вебмастера выбирают Netpeak Spider 3.0. Это великолепный многофункциональный инструмент для оперативного комплексного SEO-аудита вашего сайта.

Web Scraping & Data Extraction Using The SEO Spider Tool

This tutorial walks you through how you can use the Screaming Frog SEO Spider’s custom extraction feature, to scrape data from websites.

The custom extraction feature allows you to scrape any data from the HTML of a web page using CSSPath, XPath and regex. The extraction is performed on the static HTML returned from URLs crawled by the SEO Spider, which return a 200 ‘OK’ response. You can switch to mode to extract data from the rendered HTML.

To jump to examples click one of the below links:

To get started, you’ll need to download & install the SEO Spider software and have a licence to access the custom extraction feature necessary for scraping. You can download via the buttons in the right hand side bar.

When you have the SEO Spider open, the next steps to start extracting data are as follows –

3) Resume Previously Lost or Crashed Crawls

Due to the feature above, you’re now able to resume from an otherwise ‘lost’ crawl in database storage mode.

Previously if Windows had kindly decided to perform an update and restart your machine mid crawl, there was a power-cut, software crash, or you just forgot you were running a week-long crawl and switched off your machine, the crawl would sadly be lost forever.

We’ve all been there and we didn’t feel this was user error, we could do better! So if any of the above happens, you should now be able to just open it back up via the ‘File > Crawls’ menu and resume the crawl.

Unfortunately this can’t be completely guaranteed, but it will provide a very robust safety net as the crawl is always stored, and generally retrievable – even when pulling the plug directly from a machine mid-crawl.

Другие возможности

Через вкладку Sitemaps можно создать свой sitemap.xml — удобно для работы с сайтом, где у вас нет возможности установить плагин для автоматической генерации сайтмапа.

Есть возможность выгрузить все тексты анкоров с сайта в Excel.

Наконец, есть возможность просканировать только урлы из своего списка. Это нужно, когда есть список продвигаемых страниц и хочется проверить только их.

Список можно загрузить из файла (можно даже из sitemap.xml) или вручную.

Наконец, одна из самых крутых функций программы — возможность задать свои директивы при сканировании. Жмёте Configuration — Custom, и там задаёте настройки при сканировании Contains (Содержит) или Does Not Contain (Не содержит), куда вписываете нужные значения.

Screaming Frog ищет по коду. Так вы можете, к примеру, найти все теги strong на сайте или стоп-слова. Лягушка понимает разделитель, и вы можете найти на сайте, допустим, нецензурную брань вот таким образом:

4) hreflang Attributes

First of all, apologies, this one has been a long time coming. The SEO Spider now extracts, crawls and reports on hreflang attributes delivered by HTML link element and HTTP Header. They are also extracted from Sitemaps when crawled in list mode.

While users have historically used to collect hreflang, by default these can now be viewed under the ‘hreflang’ tab, with filters for common issues.

While hreflang is a fairly simple concept, there’s plenty of issues that can be encountered in the implementation. We believe this is the most comprehensive auditing for hreflang currently available anywhere and includes checks for missing confirmation links, inconsistent languages, incorrect language/regional codes, non-canonical confirmation links, multiple entries, missing self-reference, not using the canonical, missing the x-default, and missing hreflang completely.

Additionally, there are four new hreflang reports available to allow data to be exported in bulk (under the ‘reports’ top level menu) –

  • Errors – This report shows any hreflang attributes which are not a 200 response (no response, blocked by robots.txt, 3XX, 4XX or 5XX responses) or are unlinked on the site.
  • Missing Confirmation Links – This report shows the page missing a confirmation link, and which page requires it.
  • Inconsistent Language Confirmation Links – This report shows confirmation pages which use different language codes to the same page.
  • Non Canonical Confirmation Links – This report shows the confirmation links which are to non canonical URLs.

This feature can be fairly resource-intensive on large sites, so extraction and crawling are entirely configurable under ‘Configuration > Spider’.

Резюме:

Netpeak Spider — довольно мощный краулер, позволяющий быстро проанализировать сайт на ключевые ошибки, однако рассчитан в первую очередь на опытных оптимизаторов, умеющих правильно анализировать полученную информацию и извлекать из отчетов действительно полезные данные. Ряд вопросов вызывает юзабилити отчетов: например, не очень удобно посмотреть на каких же страницах размещены ссылки, идущие через 301 редирект, но разработчики постоянно совершенствуют продукт, что вселяет надежду на оптимизацию пути пользователя к нужной информации. Благодаря отличным возможностям изменения настроек под свои нужды Netpeak Spider наверняка станет полезным инструментом в арсенале любого seo-специалиста.

3) Improved Link Data – Link Position, Path Type & Target

Some of our most requested features have been around link data. You want more, to be able to make better decisions. We’ve listened, and the SEO Spider now records some new attributes for every link.

Link Position

You’re now able to see the ‘link position’ of every link in a crawl – such as whether it’s in the navigation, content of the page, sidebar or footer for example. The classification is performed by using each link’s ‘link path’ (as an XPath) and known semantic substrings, which can be seen in the ‘inlinks’ and ‘outlinks’ tabs.

If your website uses semantic HTML5 elements (or well-named non-semantic elements, such as div id=”nav”), the SEO Spider will be able to automatically determine different parts of a web page and the links within them.

But not every website is built in this way, so you’re able to configure the link position classification under ‘Config > Custom > ‘. This allows you to use a substring of the link path, to classify it as you wish.

For example, we have mobile menu links outside the nav element that are determined to be in ‘content’ links. This is incorrect, as they are just an additional sitewide navigation on mobile.

The ‘mobile-menu__dropdown’ class name (which is in the link path as shown above) can be used to define its correct link position using the Link Positions feature.

These links will then be correctly attributed as a sitewide navigation link.

This can help identify ‘inlinks’ to a page that are only from in-body content, for example, ignoring any links in the main navigation, or footer for better internal link analysis.

Path Type

The ‘path type’ of a link is also recorded (absolute, path-relative, protocol-relative or root-relative), which can be seen in inlinks, outlinks and all bulk exports.

This can help identify links which should be absolute, as there are some integrity, security and performance issues with relative linking under some circumstances.

Target Attribute


Additionally, we now show the ‘target’ attribute for every link, to help identify links which use ‘_blank’ to open in a new tab.

This is helpful when analysing usability, but also performance and security – which brings us onto the next feature.

Контент

Парсинг – частое явление в контент-маркетинге: с его помощью можно быстро собрать основные данные о контенте конкурентов или своем собственном, что значительно ускоряет процесс анализа.

Пример парсинга метрик статей на vc.ru с помощью Netpeak Spider

28. Ошибки в тексте

Кастомный парсинг с помощью регулярных выражений позволяет автоматизировать поиск распространенных ошибок либо, к примеру, устаревших после ребрендинга вариаций названия бренда или определенных товаров. Таким образом вы можете найти любое упоминание определенного слова на страницах всего сайта.

29. Имена авторов

Популярным примером использования парсинга в контент-маркетинге является сбор базы имен авторов на популярных блогах для последующего аутрича. В совокупности с данными о контенте можно провести анализ наиболее популярных авторов на той или иной площадке.

30. Количественные характеристики текста

При помощи парсинга можно анализировать общий объем текста, объем текста без пробелов, количество изображений и видео на странице, а также количество слов, что может быть актуально для:

  • тех, кто оптимизирует контент на собственном сайте;
  • тех, кто анализует контент отдельно взятых конкурентов или лидирующие страницы в выдаче по определенным запросам.

31. Метатеги

Парсинг title, meta description, H1–H6 заголовков позволяет проанализировать то, как конкуренты структурируют свой контент и какие приемы оптимизации используют. Если вы хотите собрать и проанализировать полный список статей конкурента, метатеги – важнейший объект анализа.

32. Показатели виральности контента

К показателям виральности контента относятся: просмотры, лайки, апвоуты, репосты, комментарии и другие метрики вовлечения. Зачастую многие из них выведены на странице соответствующей статьи. Парсинг данных метрик позволяет выявить наиболее популярный и виральный контент конкурентов или блог-площадок.

Other Updates

Version 9.0 also includes a number of smaller updates and bug fixes, outlined below.

  • While we have introduced the new database storage mode to improve scalability, regular memory storage performance has also been significantly improved. The SEO Spider uses less memory, which will enable users to crawl more URLs than previous iterations of the SEO Spider.
  • The ‘‘ configuration now works instantly, as it is applied to URLs already waiting in the queue. Previously the exclude would only work on new URLs discovered, and rather than those already found and waiting in the queue. This meant you could apply an exclude, and it would be some time before the SEO Spider stopped crawling URLs that matched your exclude regex. Not anymore.
  • The ‘inlinks’ and ‘outlinks’ tabs (and exports) now include all sources of a URL, not just links (HTML anchor elements) as the source. Previously if a URL was discovered only via a canonical, hreflang, or rel next/prev attribute, the ‘inlinks’ tab would be blank and users would have to rely on the ‘crawl path report’, or various error reports to confirm the source of the crawled URL. Now these are included within ‘inlinks’ and ‘outlinks’ and the ‘type’ defines the source element (ahref, HTML canonical etc).
  • In line with Google’s plan to stop using the old AJAX crawling scheme (and rendering the #! URL directly), we have adjusted the default rendering to . You can switch between text only, old AJAX crawling scheme and JavaScript rendering.
  • You can now choose to ‘cancel’ either loading in a crawl, exporting data or running a search or sort.
  • We’ve added some rather lovely line numbers to the feature.
  • To match Google’s rendering characteristics, we now allow blob URLs during crawl.
  • We renamed the old ‘GA & GSC Not Matched’ report to the ‘‘ report, so it’s a bit more obvious.
  • now applies to list mode input.
  • There’s now a handy ‘strip all parameters’ option within URL Rewriting for ease.
  • We have introduced numerous stability improvements.
  • The Chromium version used for rendering is now reported in the ‘Help > Debug’ dialog.
  • List mode now supports .gz file uploads.
  • The SEO Spider now includes Java 8 update 161, with several bug fixes.
  • Fix: The SEO Spider would incorrectly crawl all ‘outlinks’ from JavaScript redirect pages, or pages with a meta refresh with ‘Always Follow Redirects’ ticked under the advanced configuration. Thanks to our friend Fili Weise on spotting that one!
  • Fix: Ahrefs integration requesting domain and subdomain data multiple times.
  • Fix: Ahrefs integration not requesting information for HTTP and HTTPS on (sub)domain level.
  • Fix: The crawl path report was missing some link types, which has now been corrected.
  • Fix: Incorrect robots.txt behaviour for rules ending *$.
  • Fix: Auth Browser cookie expiration date invalid for non UK locales.

That’s everything for now. This is a big release and one which we are proud of internally, as it’s new ground for what’s achievable for a desktop application. It makes crawling at scale more accessible for the SEO community, and we hope you all like it.

As always, if you experience any problems with our latest update, then do let us know via support and we will help and resolve any issues.

We’re now starting work on version 10, where some long standing feature requests will be included. Thanks to everyone for all their patience, feedback, suggestions and continued support of Screaming Frog, it’s really appreciated.

Now, please go and download version 9.0 of the Screaming Frog SEO Spider and let us know your thoughts.

Understanding Bot Behaviour with the Log File Analyzer

An often-overlooked exercise, nothing gives us quite the insight into how bots are interacting through a site than directly from the server logs. The trouble is, these files can be messy and hard to analyse on their own, which is where our very own Log File Analyzer (LFA) comes into play, (they didn’t force me to add this one in, promise!).

I’ll leave @ScreamingFrog to go into all the gritty details on why this tool is so useful, but my personal favourite aspect is the ‘Import URL data’ tab on the far right. This little gem will effectively match any spreadsheet containing URL information with the bot data on those URLs.

So, you can run a crawl in the Spider while connected to GA, GSC and a backlink app of your choice, pulling the respective data from each URL alongside the original crawl information. Then, export this into a spreadsheet before importing into the LFA to get a report combining metadata, session data, backlink data and bot data all in one comprehensive summary, aka the holy quadrilogy of technical SEO statistics.

While the LFA is a paid tool, there’s a free version if you want to give it a go.

Crawl Reporting in Google Data Studio

One of my favourite reports from the Spider is the simple but useful ‘Crawl Overview’ export (Reports > Crawl Overview), and if you mix this with the scheduling feature, you’re able to create a simple crawl report every day, week, month or year. This allows you to monitor and for any drastic changes to the domain and alerting to anything which might be cause for concern between crawls.

However, in its native form it’s not the easiest to compare between dates, which is where Google Sheets & Data Studio can come in to lend a hand. After a bit of setup, you can easily copy over the crawl overview into your master G-Sheet each time your scheduled crawl completes, then Data Studio will automatically update, letting you spend more time analysing changes and less time searching for them.

This will require some fiddling to set up; however, at the end of this section I’ve included links to an example G-Sheet and Data Studio report that you’re welcome to copy. Essentially, you need a G-Sheet with date entries in one column and unique headings from the crawl overview report (or another) in the remaining columns:

Once that’s sorted, take your crawl overview report and copy out all the data in the ‘Number of URI’ column (column B), being sure to copy from the ‘Total URI Encountered’ until the end of the column.

Open your master G-Sheet and create a new date entry in column A (add this in a format of YYYYMMDD). Then in the adjacent cell, Right-click > ‘Paste special’ > ‘Paste transposed’ (Data Studio prefers this to long-form data):

If done correctly with several entries of data, you should have something like this:

Once the data is in a G-Sheet, uploading this to Data Studio is simple, just create a new report > add data source > connect to G-Sheets > > and make sure all the heading entries are set as a metric (blue) while the date is set as a dimension (green), like this:

You can then build out a report to display your crawl data in whatever format you like. This can include scorecards and tables for individual time periods, or trend graphs to compare crawl stats over the date range provided, (you’re very own Search Console Coverage report).

Here’s an overview report I quickly put together as an example. You can obviously do something much more comprehensive than this should you wish, or perhaps take this concept and combine it with even more reports and exports from the Spider.

If you’d like a copy of both my G-Sheet and Data Studio report, feel free to take them from here: Master Crawl Overview G-Sheet: https://docs.google.com/spreadsheets/d/1FnfN8VxlWrCYuo2gcSj0qJoOSbIfj7bT9ZJgr2pQcs4/edit?usp=sharing Crawl Overview Data Studio Report: https://datastudio.google.com/open/1Luv7dBnkqyRj11vLEb9lwI8LfAd0b9Bm

Note: if you take a copy some of the dimension formats may change within DataStudio (breaking the graphs), so it’s worth checking the date dimension is still set to ‘Date (YYYMMDD)’


С этим читают