XML Sitemaps and Robots.txt: The Files Google Reads First
A plain-English guide to XML sitemaps and robots.txt files, what they do, and how to set them up for your website.
Before Google even looks at your homepage, your beautiful service pages, or your carefully crafted blog posts, it reads two files: your XML sitemap and your robots.txt. These files are like the table of contents and the “do not enter” signs for your website. If they are missing, misconfigured, or outdated, you could be sabotaging your own SEO without knowing it.
Let’s break down what these files do and how to set them up correctly.
What Is an XML Sitemap?
An XML sitemap is a file that lists every page on your website that you want search engines to index. Think of it as a roadmap you hand to Google that says, “Here are all my important pages. Please crawl them.”
Your sitemap lives at a URL like: yourdomain.com/sitemap.xml
It includes:
- The URL of each page
- When each page was last modified
- How important each page is relative to other pages (priority)
- How frequently each page changes (daily, weekly, monthly)
Why Your Sitemap Matters
Without a sitemap, Google has to discover your pages by following links. For small sites with good internal linking, this usually works fine. But for larger sites, new sites, or sites with poor internal linking, pages can go undiscovered.
A sitemap also:
- Helps new pages get indexed faster
- Tells Google about pages that might not be well-linked internally
- Shows Google when you updated content (freshness signals)
- Helps AI search crawlers find your content more efficiently
How to Create and Submit Your Sitemap
WordPress Users
If you use Yoast SEO or Rank Math, your sitemap is automatically generated. Check yourdomain.com/sitemap.xml or yourdomain.com/sitemap_index.xml.
Other Platforms
Squarespace, Wix, and Shopify all auto-generate sitemaps. Check your platform’s documentation for the exact URL.
Custom Websites
Use a tool like XML-Sitemaps.com (free for up to 500 pages) or Screaming Frog to generate one. Upload it to your website’s root directory.
Submitting to Google
- Log into Google Search Console
- Go to Sitemaps in the left menu
- Enter your sitemap URL
- Click Submit
Our guide on Google Search Console walks through this and other essential setup steps.
Sitemap Best Practices
- Only include pages you want indexed. Do not include login pages, thank-you pages, or admin URLs.
- Keep it current. If you add or remove pages, your sitemap should update automatically (most CMS platforms handle this).
- Stay under 50,000 URLs per sitemap. For most small businesses, this is never an issue. But if you have a very large site, use sitemap index files.
- Update the
lastmoddate when you meaningfully update content. This signals freshness to Google.
What Is Robots.txt?
Your robots.txt file tells search engine crawlers which parts of your site they can and cannot access. It lives at: yourdomain.com/robots.txt
A basic robots.txt file looks like this:
User-agent: *
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml
This says: “All crawlers are allowed to access everything. Here is my sitemap.”
Why Robots.txt Matters
A misconfigured robots.txt can block Google from crawling your entire website. We have seen businesses that accidentally blocked their site with a robots.txt rule like:
User-agent: *
Disallow: /
That single line tells every search engine crawler to stay out. Complete invisibility. If your site suddenly disappeared from Google, check your robots.txt first.
Robots.txt Best Practices
DO Block
- Admin pages (/wp-admin/, /admin/)
- Internal search results pages
- Thank you pages
- Staging or development directories
- Private or duplicate content sections
DO NOT Block
- Your homepage
- Service pages
- Blog posts
- Images (unless you have a specific reason)
- CSS and JavaScript files (Google needs these to render your pages)
Important for AI Search in 2026
Your robots.txt also controls which AI crawlers can access your site. By default, most AI crawlers will respect robots.txt directives. If you want to be found by AI search engines, make sure you are NOT blocking these user agents:
- OAI-SearchBot (ChatGPT Search)
- PerplexityBot (Perplexity)
- ClaudeBot (Anthropic’s Claude)
- Googlebot (Google, including AI Overviews)
If you previously added blocks for AI crawlers (some site owners did in 2024-2025), and you now want AI search visibility, remove those blocks.
Common Mistakes
Mistake 1: No Sitemap at All
Some websites do not have a sitemap. This is not catastrophic for small sites, but it is a missed opportunity. Create one.
Mistake 2: Sitemap Includes Noindexed Pages
If a page has a noindex tag but is in your sitemap, you are sending contradictory signals. Remove noindexed pages from your sitemap.
Mistake 3: Robots.txt Blocks CSS/JS
Some outdated robots.txt configurations block CSS and JavaScript files. Google needs these to render your pages properly. Make sure they are not blocked.
Mistake 4: Forgetting the Sitemap Reference
Add your sitemap URL to your robots.txt file. This helps crawlers find it:
Sitemap: https://yourdomain.com/sitemap.xml
Mistake 5: Never Checking After Setup
Review your robots.txt and sitemap at least quarterly. Our technical SEO audit checklist includes this as a standard check.
How to Check Yours Right Now
- Visit
yourdomain.com/robots.txtin your browser. Read what it says. Make sure nothing important is blocked. - Visit
yourdomain.com/sitemap.xmlin your browser. Make sure it exists and includes all your important pages. - Check Google Search Console > Sitemaps to see if your sitemap is submitted and if there are any errors.
This takes five minutes and could reveal issues that are silently hurting your rankings.
Need help with your technical SEO setup? Contact our team and we will make sure Google (and AI search engines) can properly crawl and understand your website.