Robots.txt File: Complete Guide to Crawl Control and SEO Best Practices
A robots.txt file is one of the most powerful — and most misunderstood — files you can place on your website. When configured correctly, it directs search engine crawlers toward your important content and away from pages that waste crawl budget. When misconfigured, it can quietly prevent your entire website from being indexed by Google.
This guide covers everything you need to know about robots.txt: what it does, how to write valid directives, which bots respect it, common mistakes that harm SEO, and real-world examples from Indian e-commerce stores, news websites, and international SaaS platforms. By the end, you will know exactly how to build a robots.txt file that supports — rather than hurts — your search rankings.
What Is a Robots.txt File and How Does It Work?
A robots.txt file is a plain text document that lives at the root of your website domain. Search engines and web crawlers fetch this file before they begin crawling your site, reading its instructions to determine which areas they are permitted to access.
The file follows the Robots Exclusion Protocol — a widely adopted (but voluntary) standard. The key word is voluntary: legitimate crawlers like Googlebot, Bingbot, and DuckDuckBot respect robots.txt. Malicious scrapers and spam bots generally do not. This means robots.txt is a traffic and SEO tool, not a security barrier.
When a bot arrives at your site, it checks https://yourdomain.com/robots.txt first. If no file exists, it assumes it can crawl everything. If the file exists, it reads the applicable User-agent directives and follows the Allow and Disallow rules for that bot.
Robots.txt Syntax — Every Directive Explained
Understanding the syntax is essential before you create or modify a robots.txt file. Errors in syntax — even a missing space — can cause rules to be ignored silently.
User-agent
Every rule block starts with a User-agent line that identifies which crawler the following rules apply to. Use * to target all crawlers, or a specific name like Googlebot for Google only.
Disallow
Disallow tells the specified bot not to crawl a particular path. Leave the value blank to allow everything (same as having no rule).
Allow
Allow explicitly permits access to a path, even if a broader Disallow rule covers it. Googlebot uses longest-match-wins logic — the more specific rule takes precedence.
Sitemap
The Sitemap directive tells crawlers where to find your sitemap.xml. It is not bot-specific — it applies globally and helps crawlers discover all your indexed pages faster.
Crawl-delay
Crawl-delay requests that a bot wait a set number of seconds between requests. Google officially ignores this directive (use Google Search Console to control Googlebot crawl rate), but Bingbot and some other crawlers do respect it.
Wildcards and Pattern Matching in Robots.txt
Googlebot supports two special pattern characters: the asterisk (*) as a wildcard in paths, and the dollar sign ($) to match the end of a URL. These allow more flexible rules without listing every individual path.
The first line blocks all URLs ending in .pdf. The second blocks any URL containing a query string with any parameter. The third specifically blocks URLs beginning with /search? — useful for e-commerce sites with filter-generated URLs that produce thousands of near-duplicate pages.
Real-World Examples — India and International
Common Robots.txt Mistakes That Hurt SEO
Errors in robots.txt are often invisible — your site appears to work normally, but Google silently stops indexing critical pages. These are the most damaging mistakes website owners make.
- Blocking the entire site during development: Adding
Disallow: /during a site build and forgetting to remove it after launch is one of the most common — and disastrous — SEO accidents. Always verify your live site's robots.txt after launch. - Blocking CSS and JavaScript files: Google needs to render your pages to understand them. Blocking
/wp-content/or your CSS directory prevents Googlebot from seeing your page the way a user does, which can hurt rankings. - Thinking robots.txt removes pages from Google: If a disallowed page has backlinks, Google may still list it in results without a description snippet. To truly remove a page, use noindex or Google Search Console's URL removal tool.
- Incorrect path format: Paths in robots.txt must start with a forward slash.
Disallow: admin/does nothing; the correct form isDisallow: /admin/. - Missing blank line between rule blocks: Each User-agent block must be separated by a blank line. Without it, rules from one block bleed into the next, creating unintended behavior.
- Placing robots.txt in a subdirectory: The file must be at
https://yourdomain.com/robots.txt. Placing it athttps://yourdomain.com/blog/robots.txthas no effect.
Robots.txt and Crawl Budget — Why It Matters for Large Sites
Crawl budget refers to the number of pages Googlebot will crawl on your site within a given time period. For small sites (under 1,000 pages), crawl budget is rarely a concern — Google crawls everything anyway. But for large e-commerce stores, news sites, and content platforms with tens of thousands of URLs, crawl budget management becomes critical.
If Googlebot spends its crawl budget on low-value URLs — paginated filter pages, parameter-generated duplicates, internal search result pages, user profile pages — it has less budget left to crawl your high-value product pages and fresh content. This directly delays indexing and can suppress rankings.
Combine robots.txt directives with canonical tags and the noindex meta tag for the most effective crawl budget strategy. robots.txt prevents crawling; canonical and noindex prevent indexing. They serve different purposes and work best together.
Platform-Specific Robots.txt Recommendations
WordPress
WordPress generates a virtual robots.txt automatically if no physical file exists. However, creating a custom robots.txt gives you more control. Always block /wp-admin/ (allow /wp-admin/admin-ajax.php for AJAX functionality) and consider blocking tag, author, and date archives if they generate duplicate content.
Shopify
Shopify restricts editing of the robots.txt file by default (it is auto-generated). However, from 2021 onward, Shopify Plus merchants and standard merchants can customize it via the robots.txt.liquid template. Block /cart, /checkout, /orders, and search result pages.
Custom or Static Sites
For static HTML sites or custom-built platforms, create a robots.txt file manually and place it in the root directory. Use our Robots.txt Generator to build the file and upload it via FTP or your hosting file manager.
Multi-language Sites
If your site uses subdirectories for languages (e.g., /hi/, /ta/), ensure your robots.txt does not accidentally block these paths. Each language subdirectory should be fully accessible to crawlers.
How to Verify Your Robots.txt Is Working
After creating and uploading your robots.txt file, verification is essential. A file with a syntax error or incorrect path can silently fail to block — or incorrectly block — pages.
- Direct URL check: Visit
https://yourdomain.com/robots.txtin your browser to confirm the file is live and readable. - Google Search Console: Use the Robots.txt Tester in Google Search Console to test specific URLs against your current robots.txt rules. It shows whether a URL is allowed or blocked and highlights syntax warnings.
- Fetch as Google: Use the URL Inspection tool in GSC to fetch specific pages and confirm Googlebot can access them.
- Site: search operator: After deploying changes, use
site:yourdomain.comin Google Search to monitor which pages are indexed and identify any unexpectedly missing pages.
Robots.txt vs. Noindex vs. Canonical — Which to Use?
These three tools are often confused, but they serve distinct purposes. Using the wrong one for a given situation can create SEO problems instead of solving them.
- robots.txt Disallow: Prevents a bot from visiting (crawling) a URL. Does not guarantee the URL is removed from the index. Best for pages you genuinely do not want crawled at all — admin pages, staging areas, API endpoints.
- noindex meta tag: Tells a bot it may visit the page but should not include it in the search index. Requires the page to be crawlable. Best for low-value pages you still want to track (thank you pages, internal search results, user account pages).
- canonical tag: Tells search engines which version of a URL is the preferred one when duplicates exist. Best for parameter-based duplicate URLs (color/size filters, UTM parameters, paginated pages).
A page blocked by robots.txt cannot be read by Googlebot — meaning it cannot process the noindex or canonical tag on that page. Never block a page in robots.txt that also uses noindex. Use one or the other, not both.
Generate Your robots.txt File Now
Use our free Robots.txt Generator to build a valid, correctly formatted robots.txt file in seconds — no coding required. Add custom bot rules, set directives, and copy the output directly.
Open Robots.txt Generator →Recommended Hosting
Hostinger
If you are building a website for your tools, blog, or store, reliable hosting matters for speed and uptime. Hostinger is a popular option used worldwide.
Visit Hostinger →Disclosure: This is a sponsored link.
