The importance of adding custom robots.txt to blogger - ChrisDiary: Hacks, Tutorials and Tech Updates

Post Top Ad

Post Top Ad

Sunday, July 2, 2017

The importance of adding custom robots.txt to blogger

Blogger comes with a default robots.txt file. However, blogger allows us to customize this file to suit our needs. This is called custom robots.txt file and I will show you the relevance of customizing this file and how to add it to the blogger server.

What is Robots.txt file ?
A robots.txt is a code which instructs web crawlers on how to  go about indexing specific pages on search engines. Before  web crawlers index web pages, it will first look at this file in order to ascertain the specific instructions required of it to carry out.
So basically, a robots.txt file does two things :
1. It allows search engines to discover contents on the web
2. It allows specific contents to be indexed on search engines and served to those looking for information

A robots.txt file is usually found by default on website or blog host servers like blogger. In this tutorial, I will explain what each line of code of a robots.txt file represents and how to customize it.I will use a blogger's default robots file to illustrate the working principles of the robots.txt.
The code is as shown below:

User-agent:Mediapartners-Google
Disallow:
User-agent: *
Allow: /
Sitemap: http://example.blogspot.com/feeds/posts/default?orderby=UPDATED
EXPLANATION OF THE ABOVE CODE


1. User-agent: Mediapartners-Google: This code is used to instruct google adsense bot to serve relevant ads on your blog. If you are a google adsense publisher or you wish to become one in the future, you can leave this line of code.

2. Disallow: This line of code, as the name suggests is used to disallow specific posts or pages from being indexed on search engines. It can also be used to disable specific search engines from crawling our sites. Leaving it in this default state ( Disallow:) without specifying any page to be disallowed means that the web crawlers can index all the posts on one's blog. 

Supposing we change this code to: Disallow:/search, we are telling the crawlers not to index all links with search keyword just after the domain name. For example, a blog post with post URL: http://chrisdiary.com/search/post/SEO will be ignored and will not be indexed on search engines.
Assuming we intend to disable specific posts from being indexed, we simply add the URL after the Disallow: string and remove the blog name from the URL. For example, I want web crawlers to ignore this post: http://www.chrisdiary.com/2017/05/google-adsense-now-bans-specific-blog.html ,I will simply change the
Disallow: 

to 
Disallow:/2017/05/google-adsense-now-bans-specific-blog.html
3. User-agent:* This is used to refer to all web crawlers. We can specify a particular web crawler by removing the asterisk. Example: User-agent: googlebot refers to google bot (web crawler) We can decide to either disable or enable all web crawlers . For example, we can use the codes below to allow or disallow web crawlers from crawling our sites:

Blocking all web crawlers from all content

User-agent: *
 Disallow: /

Using this syntax in a robots.txt file would tell all web crawlers not to crawl any pages on www.example.com, including the homepage.

Allowing all web crawlers access to all content

User-agent: *
 Disallow:

Using this syntax in a robots.txt file tells web crawlers to crawl all pages on www.example.com, including the homepage.

 Blocking a specific web crawler from a specific folder 

User-agent: Googlebot
 Disallow: /example-subfolder/ 

This syntax tells only Google’s crawler (user-agent name Googlebot) not to crawl any pages that contain the URL string www.example.com/example-subfolder/. 

 Blocking a specific web crawler from a specific web page 

User-agent: Bingbot
Disallow: /example-subfolder/blocked-page.html 

This syntax tells only Bing’s crawler (user-agent name Bing) to avoid crawling the specific page at www.example.com/example-subfolder/blocked-page

4. Allow:/ This is only applicable for google bots and is used to allow goole bot access to specific pages or subfolders even though the parent folder or subfolder may be disabled

5. Sitemap: This is used to show crawlers where the sitemap of a URL is located. This line of code is only effective with Google, Ask, Bing, Yahoo
Bay adding the sitemap , we are telling the crawlers that while scanning our robots.txt, they should check our sitemap and index all the posts linked to the sitemap. This way, we are optimizing our sites' crawling rate . By default, the sitemap only shows the crawlers the recent 25 posts. If you'd like to increase this limit , we can use the sitemap below. It will work for the recent 500 posts.

Sitemap: http://example.blogspot.com/atom.xml?redirect=false&start-index=1&max-results=500

We can make use of two sitemaps if the posts are more than 500 as shown below:

Sitemap: http://example.blogspot.com/atom.xml?redirect=false&start-index=1&max-results=500
Sitemap: http://example.blogspot.com/atom.xml?redirect=false&start-index=1&max-results=500

At this juncture, to further appreciate the tutorial, I'd like to show you two examples of robots.txt.

Example 1.

Here, I want to write a custom robots.txt with the following features:
* Allows google adsense to serve relevant ads
* Allows all web crawlers except Yahoo and bing access to all the pages 
* Gives crawlers access to my sitemap 

Here is the code:

User-agent:Mediapartners-Google
Disallow:
User-agent: Yahoo-slurp
User-agent: Bingbot
Disallow:/
User-agent:*
Disallow:
Sitemap: http://www.example.com/sitemap.xml


Example 2

Here, I want to write a robots.txt with the following features:
* Allows google adsense to serve relevant ads
* Disallows Yahoo and Msn from accessing http://www.chrisdiary.com/2017/05/7-blogging-etiquettes-yhat-will.html
*Gives crawlers access to my sitemap with up to 500 posts

Here is the code
User-agent:Mediapartners-Google
Disallow:
User-agent: Yahoo-slurp
User-agent: Msnbot
Disallow:/2017/05/7-blogging-etiquettes-yhat-will.html
User-agent:*
Disallow:
Sitemap: Sitemap: http://chrisdiary.com/atom.xml?redirect=false&start-index=1&max-results=500


Now, we have our custom robots.txt codes, it is time to add it to blogger. To do this, follow the following instructions.

How to add custom robots.txt to blogger

Step 1: Login to your dashboard and click on "settings". Click on "Search preferences" and on "custom robots.txt" , click on "Edit".

Step 2 After clicking on "Edit" You will be asked whether to "Enable custom robots.txt content". Choose yes and paste or compose your custom robots.txt.

Step 3: Lastly click on "Save changes" and your custom robots.txt will be saved. That's all

Final Words
While blogger robots.txt is very effective, there are limitations to it. For example, the default robots.txt file can not instruct search engine crawlers to crawl specific pages of sites neither can it prevent specific web crawlers from accessing a site or specific pages. We may want to hide certain pages on our site from crawlers or prevent sensitive posts from being indexed on search engine but this can only be achieved by customizing our robots.txt file.
While this may present many advantages, the using a custom robots.txt could be disastrous if we misuse them . Care has to be taking so that we don't pull our entire site from the web space.


Thanks for reading my blog. Always remember to share my post and to link back to my blog if you must copy my post.

No comments:

Post a Comment

Post Top Ad