Search engines play a pivotal role in the online visibility of businesses. However, not all search engine bots are beneficial for every website. In some instances, they can consume significant bandwidth without providing any tangible return on investment. Two such culprits that have been identified are the Baidu and Yandex search engine spiders.
The Issue with Baidu and Yandex Spiders
It’s been noted that Baidu and Yandex search engines sometimes ignore the rules set in the Robots Text file. This oversight can lead to excessive bandwidth consumption, especially if these bots frequently crawl and index a website. For businesses monitoring their site’s performance, such unexpected bandwidth usage can be concerning.
A Solution to the Search Crawler Problem
To address this issue, Paul who always loves a good challenge implemented the following at the top of the HTACCESS file which seems to have blocked the Baidu and Yandex bots from spidering the website. This method has proven effective in curbing the unwanted attention from these bots.
SetEnvIfNoCase User-agent “Baidu” spammer=yes
SetEnvIfNoCase User-agent “Yandex” spammer=yes
SetEnvIfNoCase User-agent “Sosospider” spammer=yes
<Limit GET PUT POST>
deny from env=spammer
While this solution has shown promising results, it’s important to note that any determined application or bot might still find a workaround. Therefore, continuous monitoring and updating is crucial.
Don’t Block the Spiders
In contrast to the above, If your business targets audiences in Russia or China, it’s advised not to block Yandex or Baidu. Yandex, for instance, holds a significant market share in Russia with approximately 61% of the market share according to LiveInternet.ru stats, making it an essential platform for businesses targeting Russian consumers.
We have updated the file to include the Sosospider which we found repeatedly ‘hitting our server’.