What’s Happening?
Google has provided a clear update on the use of fields in your robots.txt file. They have confirmed that only certain fields are officially supported. If you’re using fields that aren’t listed in their documentation (like crawl-delay), they won’t be recognized by Google.
Why Does This Matter?
Many webmasters and SEOs have tried using unsupported fields in their robots.txt files, leading to confusion when Google doesn’t follow those rules. By clarifying which fields are supported, Google wants to ensure that everyone understands exactly which instructions its crawlers will follow.
Key Supported Fields
Here are the four key fields that Google officially supports in your robots.txt file:
- User-agent:
This field is used to specify which search engine crawler (also called a “user-agent”) the rules apply to. For example, you can create rules that only apply to Google’s crawler.- Example:
User-agent: GooglebotThis rule applies only to Google’s web crawler (Googlebot).
- Example:
- Allow:
Theallowfield specifies which pages or parts of your website are permitted to be crawled, even if broader rules may block other parts.- Example:
Allow: /public-content/This allows the crawler to access the/public-content/section of your website.
- Example:
- Disallow:
Thedisallowfield blocks specific URLs from being crawled. It tells search engines not to visit certain pages or directories.- Example:
Disallow: /private/This prevents crawlers from accessing the/private/directory of your website.
- Example:
- Sitemap:
This field provides the full URL of your website’s sitemap, which is a file containing a list of all the important pages on your site. Including this ensures Google knows where to find your sitemap.- Example:
Sitemap: https://www.example.com/sitemap.xmlThis tells Google where to find your website’s sitemap.
- Example:
Unsupported Fields
Some fields, like crawl-delay, may work with other search engines but are not supported by Google. If you’re using unsupported fields, they will be ignored by Google’s crawlers.
Example of a Properly Configured Robots.txt:
User-agent: *
Disallow: /private/
Allow: /public/
Sitemap: https://www.example.com/sitemap.xml
In this example:
- All crawlers (“*”) are prevented from accessing the
/private/directory. - The
/public/directory is allowed for crawling. - Google is informed about the location of the sitemap.
| Robots.txt URL examples | |
|---|---|
https:/ | This is the general case. It’s not valid for other subdomains, protocols, or port numbers. It’s valid for all files in all subdirectories on the same host, protocol, and port number. Valid for: https://example.com/https://example.com/folder/fileNot valid for: https://other.example.com/http://example.com/https://example.com:8181/ |
https:/ | A robots.txt on a subdomain is only valid for that subdomain. Valid for: https://www.example.com/Not valid for: https://example.com/https://shop.www.example.com/https://www.shop.example.com/ |
https:/ | Not a valid robots.txt file. Crawlers don’t check for robots.txt files in subdirectories. |
https:/ | IDNs are equivalent to their punycode versions. See also RFC 3492. Valid for: https://www.exämple.com/https://xn--exmple-cua.com/Not valid for: https://www.example.com/ |
ftp:/ | Valid for: ftp://example.com/Not valid for: https://example.com/ |
https:/ | A robots.txt with an IP-address as the host name is only valid for crawling of that IP address as host name. It isn’t automatically valid for all websites hosted on that IP address (though it’s possible that the robots.txt file is shared, in which case it would also be available under the shared host name). Valid for: https://212.96.82.21/Not valid for: https://example.com/ (even if hosted on 212.96.82.21) |
https:/ | Standard port numbers (80 for HTTP, 443 for HTTPS, 21 for FTP) are equivalent to their default host names.Valid for: https://example.com:443/https://example.com/Not valid for: https://example.com:444/ |
https:/ | Robots.txt files on non-standard port numbers are only valid for content made available through those port numbers. Valid for: https://example.com:8181/Not valid for: https://example.com/ |
| Example path matches | |
|---|---|
/ | Matches the root and any lower level URL. |
/ | Equivalent to /. The trailing wildcard is ignored. |
/ | Matches only the root. Any lower level URL is allowed for crawling. |
/ | Matches any path that starts with /fish. Note that the matching is case-sensitive.Matches: /fish/fish.html /fish/salmon.html /fishheads/fishheads/yummy.html /fish.php?id=anythingDoesn’t match: /Fish.asp/catfish/?id=fish/desert/fish |
/ | Equivalent to /fish. The trailing wildcard is ignored.Matches: /fish/fish.html/fish/salmon.html/fishheads/fishheads/yummy.html/fish.php?id=anythingDoesn’t match: /Fish.asp/catfish/?id=fish/desert/fish |
/ | Matches anything in the /fish/ folder.Matches: /fish//fish/?id=anything/fish/salmon.htmDoesn’t match: /fish/fish.html/animals/fish//Fish/Salmon.asp |
/ | Matches any path that contains .php.Matches: /index.php/filename.php/folder/filename.php/folder/filename.php?parameters/folder/any.php.file.html/filename.php/Doesn’t match: /(even if it maps to /index.php) /windows.PHP |
/ | Matches any path that ends with .php.Matches: /filename.php/folder/filename.phpDoesn’t match: /filename.php?parameters/filename.php//filename.php5/windows.PHP |
/ | Matches any path that contains /fish and .php, in that order.Matches: /fish.php/fishheads/catfish.php?parametersDoesn’t match: /Fish.PHP |
Final Takeaway:
To ensure Google correctly follows the rules you set, make sure your robots.txt file only uses supported fields. Unsupported fields won’t work with Google, so double-check that your file contains only these key fields: User-agent, Allow, Disallow, and Sitemap.
Discover more from Rudra Kasturi
Subscribe to get the latest posts sent to your email.