Got more questions? Find advice on: SQL | XML | Regular Expressions | Windows
in Search
Welcome to AspAdvice Sign in | Join | Help

Don't neglect your robots.txt file

Last post 12-03-2007, 3:23 PM by DougJoseph. 3 replies.
Sort Posts: Previous Next
  •  11-27-2007, 7:42 PM 37009

    Don't neglect your robots.txt file

    Search engine bots are quite thorough when indexing sites.  If you're using Vine Type for a blog, as I do, you'll discover that the search engines may be indexing more than you would expect and perhaps more than you want.  Google was indexing all the RSS and ATOM feeds on my site.  Not just the main "all articles" feed, but the individual article feeds and the archive pages (all articles from November 2007 for instance.)  I took a good look at all the links in my site and decided that there were several links I did not want to be indexed -- and I updated my robots.txt file accordingly.  Here's an excerpt from my blog's robots.txt file


    Disallow: /default.aspx?feed=atom Disallow: /default.aspx?feed=rss Disallow: /default.aspx?archive= Disallow: /default.aspx?img=

    The first two are the RSS and ATOM feeds. The  third one is the archive pages. The fourth one are the images generated via ViPR. And if you're wondering why I didn't specify


    Disallow: /default.aspx?feed=

     ...it's because Google refused to accept my sitemap whose format is


    default.aspx?feed=googlesitemap

    I thought I  might share this lesson learned.


    Sincerely,

    Carl
    -----
    vine type - content management with standards in mind - vinetype.com
    -----
    Filed under: , ,
  •  12-01-2007, 8:09 PM 37181 in reply to 37009

    Re: Don't neglect your robots.txt file

    Carl

    Just curious. When you wrote, " ...it's because Google refused to accept my sitemap" ... did you mean to say "except" instead of "accept"? If not, I'm puzzled.

    -Doug


    Sincerely,
    Doug Joseph
  •  12-03-2007, 8:29 AM 37228 in reply to 37181

    Re: Don't neglect your robots.txt file

    Sorry for any confusion there.

    My goal was to get Google to crawl my sitemap default.aspx?feed=googlesitemap (or default.aspx?feed=sitemap both work) but inform Google to not index my rss and atom feeds default.aspx?feed=rss20 and default.aspx?feed=atom10.

    When I put this line into my robots.txt file

    default.aspx?feed=

    Google did not index my rss or atom feeds, but it also refused to crawl my sitemap because the sitemap url begins with feed=.  I want Google to crawl my sitemap (that's why I created them) so I changed the above definition to

    default.aspx?feed=rss

    default.aspx?feed=atom

    and now Google will not index my rss and atom feeds, but will crawl my sitemap since my sitemap url is not blocked by my robots.txt entry anymore.

    Hope this helps.

     


    Sincerely,

    Carl
    -----
    vine type - content management with standards in mind - vinetype.com
    -----
    Filed under: , ,
  •  12-03-2007, 3:23 PM 37244 in reply to 37228

    Re: Don't neglect your robots.txt file

    Ah! Yes, I see that now.
    Sincerely,
    Doug Joseph
    Filed under:
View as RSS news feed in XML