Blocking Baidu and Yandex Search Spiders

A slap on the collected wrist for the search engine(s) Baidu and Yandex search engines for ignoring Robots Text file rules! Search engine spiders

We noticed on a couple of client sites that bandwidth was being sucked and vaporized into a no-return-on-investment void. After some statistics / analytics investigation we noticed that these two search engines bots were hammering the server quite hard.  However they were completely ignoring the robots text file.

For a fix to this I got in contact with Paul Arlott from Tolra Systems who always loves a good challenge and we implemented the following into the top of the HTACCESS file that so far seems to have blocked the Baidu, and Yandex bots from spidering the website.

“———————————————————————————–

SetEnvIfNoCase User-agent “Baidu” spammer=yes
SetEnvIfNoCase User-agent “Yandex” spammer=yes
SetEnvIfNoCase User-agent “Sosospider” spammer=yes

<Limit GET PUT POST>

order deny,allow

deny from env=spammer
</Limit>

————————————————————————————-”

So far so good, and of course any determined app will always try and find a way through. Let us know how you have blocked unwanted spiders from your website?

 

WARNING

DO NOT block Yandex or Baidu if you rely on either of those search engines to index your content so that it will be seen in Russia and China.

Don’t Block the Spiders

In contrast to the above, please do not block these search spiders if you are doing business in Russia, and/or China. The Yandex search engine is by far the most popular search engine in Russia with approximately 61% of the market share according to LiveInternet.ru stats. If you want your website indexed by Yandex then don’t block it.

UPDATE:
We have updated the file to include the Sosospider which we found repeatedly ‘hitting a server’.

6 replies
  1. Phil
    Phil says:

    Hey Vincent,

    I came here because Baidu and Yandex bots are the main drain on my bandwidth for all the sites I administer.  I actually wrote to baidu’s support department to beg them to stop hitting my website and for a while they did, but of course they came back.

    Have you found the above fix to work without side effects since you’ve implemented it, because if so this will be awesome news!

    Phil

    • Vincent
      Vincent says:

      Hi Phil,

      Thanks for the response. It has been almost six months since we implemented it onto a couple of ecommerce sites we manage. So far so good. Your response prompted me to have a look through the stats and I found no incidents of any of the above bots. However, I have updated the post to include the ‘Sosospider’ another resource draining spider / bot.

      Let us know how you get along.

      Vincent

  2. Jim
    Jim says:

    I was hoping this was going to resolve my issue with these bots and their CPU drain but alas your fix at least for me has not been successful in blocking them. 🙁 I have added this code to five off my domains and they are still getting bombarded. Total Bummer…

  3. Daniel
    Daniel says:

    hy vicent. Sry for my english
    i have this rules in my htaccess. is ok like that or.. is better in your way ? I don’t know the difference.  Some use this, some use like u with SetEnvIfNoCase how is better?
    RewriteCond %{HTTP_USER_AGENT} ^Baidu [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Yandex [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Sosospider [OR]
    RewriteRule ^.* – [F,L]

  4. Paul
    Paul says:

    Hi Daniel, there’s more than one way to send the bots away and either method should work just fine.
    You can even catch the request in your CMS and issue 403 messages to the bot, you’ll still see the request to the site in that case but you won’t be serving up any data, therefore if it’s done right there’s minimal load on the server.

  5. CathyL
    CathyL says:

    It’s working for us and has reduced bandwidth usage. We do not mind blocking spiders from China or Russia as were only selling into the UK.
     
    Thanks x

Comments are closed.