Blocking Baidu and Yandex Search Spiders
Search engines play a pivotal role in the online visibility of businesses. However, not all search engine bots are beneficial for every website. In some instances, they can consume significant bandwidth without providing any tangible return on investment. Two such culprits that have been identified are the Baidu and Yandex search engine spiders.
The Issue with Baidu and Yandex Spiders
It’s been noted that Baidu and Yandex search engines sometimes ignore the rules set in the Robots Text file. This oversight can lead to excessive bandwidth consumption, especially if these bots frequently crawl and index a website. For businesses monitoring their site’s performance, such unexpected bandwidth usage can be concerning.
A Solution to the Search Crawler Problem
To address this issue, Paul who always loves a good challenge implemented the following at the top of the HTACCESS file which seems to have blocked the Baidu and Yandex bots from spidering the website. This method has proven effective in curbing the unwanted attention from these bots.
“———————————————————————————–
SetEnvIfNoCase User-agent “Baidu” spammer=yes
SetEnvIfNoCase User-agent “Yandex” spammer=yes
SetEnvIfNoCase User-agent “Sosospider” spammer=yes
<Limit GET PUT POST>
order deny,allow
deny from env=spammer
</Limit>
————————————————————————————-”
While this solution has shown promising results, it’s important to note that any determined application or bot might still find a workaround. Therefore, continuous monitoring and updating is crucial.
Don’t Block the Spiders
In contrast to the above, If your business targets audiences in Russia or China, it’s advised not to block Yandex or Baidu. Yandex, for instance, holds a significant market share in Russia with approximately 61% of the market share according to LiveInternet.ru stats, making it an essential platform for businesses targeting Russian consumers.
UPDATE:
We have updated the file to include the Sosospider which we found repeatedly ‘hitting our server’.
Hey Vincent,
I came here because Baidu and Yandex bots are the main drain on my bandwidth for all the sites I administer. I actually wrote to baidu’s support department to beg them to stop hitting my website and for a while they did, but of course they came back.
Have you found the above fix to work without side effects since you’ve implemented it, because if so this will be awesome news!
Phil
Hi Phil,
Thanks for the response. It has been almost six months since we implemented it onto a couple of e-commerce sites we manage. So far so good. Your response prompted me to have a look through the stats and I found no incidents of any of the above bots. However, I have updated the post to include the ‘Sosospider’ another resource draining spider / bot.
Let us know how you get along.
I was hoping this was going to resolve my issue with these bots and their CPU drain but alas your fix at least for me has not been successful in blocking them. 🙁 I have added this code to five off my domains and they are still getting bombarded. Total Bummer…
hy vicent. Sry for my english
i have this rules in my htaccess. is ok like that or.. is better in your way ? I don’t know the difference. Some use this, some use like u with SetEnvIfNoCase how is better?
RewriteCond %{HTTP_USER_AGENT} ^Baidu [OR]
RewriteCond %{HTTP_USER_AGENT} ^Yandex [OR]
RewriteCond %{HTTP_USER_AGENT} ^Sosospider [OR]
RewriteRule ^.* – [F,L]
Hi Daniel, there’s more than one way to send the bots away and either method should work just fine.
You can even catch the request in your CMS and issue 403 messages to the bot, you’ll still see the request to the site in that case but you won’t be serving up any data, therefore if it’s done right there’s minimal load on the server.
It’s working for us and has reduced bandwidth usage. We do not mind blocking spiders from China or Russia as were only selling into the UK.
Thanks x