|
Free Open Book
Google For Dummies |
Keeping Google OutYour priority might run contrary to this chapter, in that you want to prevent Google from crawling your site and putting it in the Web search index. It does seem pushy, when you think about it, for any search engine to invade your Web space, suck up all your text, and make it available to anyone with a matching keyword. Some people feel that Google’s cache is more than just pushy, and infringes copyright regulations by caching an unauthorized copy of a site. If you want to keep the Google crawl out of your site, get familiar with the robots.txt file, also known as the Robots Exclusion Protocol. Google’s spider understands and obeys this protocol. The robots.txt file is a short, simple text file that you place in the top-level directory (root directory) of your domain server. (If you use server space provided by a utility ISP, such as AOL, you probably need administrative help in placing the robots.txt file.) The file contains two instructions:
A sample robots.txt file looks like this: User-agent: * Disallow: / This example is the most common and simplest robots.txt file. The asterisk after User-agent means all spiders are excluded. The forward slash after Disallow means that all site directories are off-limits. The name of Google’s spider is Googlebot. (“Here, Googlebot! Come to Daddy! Sit. Good Googlebot! Who’s a good boy?”) If you want to exclude only Google and no other search engines, use this robots.txt file: User-agent: Googlebot Disallow: / You may identify certain directories as impervious to the crawl, either from Google or all spiders: User-agent: * Disallow: /cgi-bin/ Disallow: /family/ Disallow: /photos/ Notice the forward slash at each end of the directory string in the preceding examples. Google understands that the first slash implies your domain address before it. So, if the first Disallow line were found at the bradhill.com site, the line would be shorthand for http://www.bradhill.com/cgi-bin/, and Google would know to exclude that directory from the crawl. The second forward slash is the indicator that you are excluding an entire directory. To exclude individual pages, type the page address following the first forward slash, and leave off the ending forward slash, like this: User-agent: * Disallow: /family/reunion-notes.htm Disallow: /blog/archive00082.htm
|
Main Menu |
| 500 Juegos Gratis | 500 Giochi Gratis | 500 Jeux Gratuits | 500 Jogos Gratis | 500 Kostenlose Spiele |