Excluding Pages From The Sitemap.xml File In Hugo
Note: I’ve created a GitHub repository for the template described in this post.
By default Hugo includes all pages in the sitemap.xml file used by search engines to help crawl and index your site. There may come a time when you want to exclude a page from the sitemap.xml file, but there is no easy to do that with Hugo’s default templates. You would need to manually edit your sitemap.xml file each time it is generated to remove unwanted pages.
This manual editing can be quite cumbersome as the sitemap.xml can grow quite large. As an example, here is the sitemap.xml file for my site, which is a relatively small site.
<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>https://dereckcurry.com/posts/more-flexible-twitter-cards-in-hugo/</loc>
<lastmod>2019-04-05T11:49:00-04:00</lastmod>
</url>
<url>
<loc>https://dereckcurry.com/posts/automatically-tweeting-new-hugo-posts/</loc>
<lastmod>2019-04-03T16:15:24-04:00</lastmod>
</url>
<url>
<loc>https://dereckcurry.com/posts/advanced-mailto-encoder/</loc>
<lastmod>2019-03-15T12:32:24-04:00</lastmod>
</url>
<url>
<loc>https://dereckcurry.com/posts/limiting-the-generation-of-rss-feeds-in-hugo/</loc>
<lastmod>2019-03-11T16:24:24-04:00</lastmod>
</url>
<url>
<loc>https://dereckcurry.com/posts/hello-world/</loc>
<lastmod>2019-03-05T16:16:55-05:00</lastmod>
</url>
<url>
<loc>https://dereckcurry.com/about/</loc>
</url>
<url>
<loc>https://dereckcurry.com/tags/advanced-mailto-encoder/</loc>
<lastmod>2019-03-15T12:32:24-04:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://dereckcurry.com/tags/buffer/</loc>
<lastmod>2019-04-03T16:15:24-04:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://dereckcurry.com/categories/</loc>
<priority>0</priority>
</url>
<url>
<loc>https://dereckcurry.com/</loc>
<lastmod>2019-04-05T11:49:00-04:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://dereckcurry.com/categories/hugo/</loc>
<lastmod>2019-04-05T11:49:00-04:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://dereckcurry.com/tags/hugo/</loc>
<lastmod>2019-04-05T11:49:00-04:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://dereckcurry.com/posts/</loc>
<lastmod>2019-04-05T11:49:00-04:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://dereckcurry.com/tags/rss/</loc>
<lastmod>2019-04-03T16:15:24-04:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://dereckcurry.com/tags/rumkin.com/</loc>
<lastmod>2019-03-15T12:32:24-04:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://dereckcurry.com/tags/</loc>
<lastmod>2019-03-15T12:32:24-04:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://dereckcurry.com/tags/twitter/</loc>
<lastmod>2019-04-05T11:49:00-04:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://dereckcurry.com/categories/web-tools/</loc>
<lastmod>2019-03-15T12:32:24-04:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://dereckcurry.com/tags/web-tools/</loc>
<lastmod>2019-03-15T12:32:24-04:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://dereckcurry.com/tags/zapier/</loc>
<lastmod>2019-04-03T16:15:24-04:00</lastmod>
<priority>0</priority>
</url>
</urlset>
Instead of editing this file repeatedly, what I really want is to be able to specify in the front matter of a page if it should be excluded from the sitemap.xml file.
To acheive this, two things need to happen.
- The default sitemap.xml template that ships with Hugo needs to be modified.
- Front matter needs to be specified on a page to remove it from the sitemap.xml file.
Let’s get started.
Modifying The Sitemap.xml Template
The simplified site Hugo directory structure for this is:
dereckcurry.com
layouts
_default
sitemap.xml
The theme used by my site did not override Hugo’s default sitemap.xml template, so we will need to create our custom layout in the site’s /layouts/_default/ directory based upon this template. I just copied the existing default sitemap.xml template from the Hugo repository on GitHub and placed it in the /layouts/_default/ directory of my site.
Note: If the theme you are using has already specified a custom sitemap.xml template file, then you will need to copy that file to the corresponding /layouts/ directory location, and make the appropriate modifications to that copied file instead.
The default Hugo sitemap.xml template at the time of writing is:
|
|
We’re going to modify the template to look at the front matter of each page to see if a sitemapExclude parameter is set to true. The best place to do this seems to be after line 4, {{ range .Data.Pages }}.
The modified template becomes:
|
|
Line 5 may now look a bit complicated, but essentially all pages are included in the sitemap.xml file by default. Unless, a sitemapExclude parameter is included in the front matter of a page and the parameter value is set to true. The sitemapExclude parameter is optional and the template code will not break if it is not set.
That’s all the modifications necessary to the sitemap.xml template.
Page Front Matter Settings
As mentioned above, all pages will be included in the sitemap.xml by default. But if you do wish to exclude a page from the sitemap.xml file, then all you need to do is include a front matter sitemapExclude parameter and set the value to true. Any value other than true will have the page included in the sitemap.xml file.
Here’s an example TOML front matter configuration for a page that excludes the page from sitemap.xml file.
+++
title = "Excluded Page"
sitemapExclude = true
+++
Again, you only need to set the front matter parameter for the pages you wish to exclude from the sitemap.xml file, as all pages will be included by default.
Caveats
So one thing I discovered that when excluding a page, a blank line is inserted in the sitemap.xml. Technically this is some data leakage as it would indicate that you excluded a page. It would not indicate which page is excluded, just that somewhere on your site you excluded a page. Probably not a big deal to most sites, but just be aware of this.
Also, removing a page from the sitemap.xml file may not keep a search engine from eventually discovering and indexing the page. When determining whether any particular page should be indexed and included in search results, well behaved search engines will also take into consideration the tag on the page and also the disallow directives in robots.txt file. But some search engines don’t honor those settings. If you really want to keep content out of search engines, you’ll have to password protect the page. Or better yet, never include the content on a web page in the first place, as search engines are really good about discovering and indexing content.
For a Hugo template that allows for the generation and configuration of the robots meta tags on pages, see the following post.
Conclusion
That’s it. Not really that much to it. The biggest trick was getting the logic set in the template to try and avoid duplicate code while also remaining understandable.
I hope you find this useful. Please let me know if you discover any bugs.
Dereck