robots.txt: Publish Clear Crawl Rules at the Site Root
Robots.txt is one of the first files crawlers look for when they evaluate a site.
It should be present, intentional, and free of launch-blocking mistakes.
What It Is
The robots.txt file lives at the site root and gives crawl guidance to well-behaved bots.
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
Why It Matters
- It communicates basic crawl rules early.
- It can point crawlers to the sitemap.
- It helps avoid accidental blocking of important public paths.
Best Practices
- Publish the file at
/robots.txt. - Keep rules simple unless you have a clear reason for complexity.
- Review staging disallow rules before deploying to production.
Common Mistakes
- No file at all.
- Leaving
Disallow: /from staging. - Using robots.txt as if it were a security control.
Quick Checklist
- File exists at root.
- Important pages are not blocked.
- Sitemap location included when useful.
Final Takeaway
Robots.txt should guide discovery, not accidentally suppress it.