The Security Value of the Robots.txt file

Written by Dionach by Nomios

February 10, 2011

There is a security value of the Robots.txt file but there are flaws such as ‘disallow’ entries revealing hidden folders, password lists and database backups.

This is my view on the use of robots.txt as a security control and the problems of not having one.

From my penetration testing experience there has been many occurrences of websites not having a robots.txt file. This action could be justified as being a security measure, as having “disallow” entries could reveal hidden folders.

In my view not declaring what files a web crawler can and can’t crawl is bad security measure. For example we have all heard about database backups, passwords lists and many more being indexed in Google. Take a look at the following Google Dork:

inurl:/wp-content/uploads/ confidential

Using this simple Google Dork you can clearly see confidential files being indexed from Wordpress blogs. A simple declaration such as the following would have prevent this from being indexed:

User-agent: *
Disallow: /wp-content/uploads/

To reflect on how using robots.txt is a bad security control, a colleague of mine raised an objection to me which was that: “surely if you declare all the confidential paths such as /admin on your site, then an attacker will have a nice and easy job in finding them”. My comeback to him was to explain that attackers have been using Google to actively find confidential files for a long time; therefore search engines can pose a threat to the security of a website. I would rather have an attacker having to spider the site themselves when trying to find any sensitive files that I may have on my website, than Google indexing them and having any one Google Dork me. Having said that, not revealing sensitive directories in robots.txt would improve security a little, although it could be considered security through obscurity – not a very effective control.

The solution is to use robots.txt as a defence in depth measure. Take a look at the following layers recommended to protect a particular file:

Layer 1 – Access Control
Layer 2 – No Directory Listing
Layer 3 – Meta Robots
Layer 4 – Robots.txt

So we start out from the perimeter defence mechanism, the robots.txt. This file should be used to declare areas of the site that you don’t want to get indexed, however not the really sensitive folders. To protect yourself from really sensitive files why not use the following declaration in the HTML page:

<meta name="robots" content="no index, no follow" />

This serves two purposes, search engines will not index the page and the attacker will not have a ready list of sensitive directories to attack. We then strengthen our security infrastructure by adding no directory listing and access controls.

In summary robots.txt should be used to disallow folders that you don’t want to get indexed. They should be used as a defence in depth measure. I know of a website that forgot to use a robots.txt file to restrict search engine robot access to their development site. As a consequence the whole site got indexed – all 500,000 pages of it – in one night! This could result in getting punished for website cloaking since the development site is the same as the real one, and maybe a loss of a page rank value. Additionally, legitimate visitors may search and get sent to the wrong site! Therefore my use of robots.txt serves two purposes; the site’s sensitive files will not get indexed and you are not providing an attacker with a list of really sensitive folders.

Robots.txt file is only good for stopping search engines getting your files and will not stop an attacker who is targeting your site. Proper access control is king.

Like what you see? Share with a friend.

Let’s Explore How We Can Support Your Cybersecurity Journey

Get in touch with our team today to find out how we can help you.

Discover Our Latest Research

ISO 27001:2022 Deadline: What You Need to Know Before October 2025

As organisations continue to navigate the ever-evolving landscape of cybersecurity and data privacy, protecting sensitive information is no longer optional – it is a necessity. ISO/IEC 27001 is the internationally recognised standard for Information Security Management Systems (ISMS), providing a systematic framework to safeguard data, mitigate risks, and demonstrate trustworthiness to stakeholders. It defines the […]

Gambling Commission ISO 27001

The Gambling Commission requires that all license holders comply with the Remote Gambling and Software Technical Standards (RTS) and that annual security audits are carried out by an independent, qualified security specialist. In May 2024, the Gambling Commission updated its Remote Gambling and Software Technical Standards (RTS) to align with ISO 27001:2022. The key changes […]

How to Get Certified to ISO 27001?

ISO 27001 is an international standard that provides a framework for Information Security Management Systems (ISMS) to provide continued confidentiality, integrity, and availability of information as well as legal compliance. The standard defines requirements an ISMS must meet, and a well-implemented ISMS provides risk management, cyber-resilience, and operational excellence. Achieving ISO 27001 certification involves an […]

The Security Value of the Robots.txt file

Let’s Explore How We Can Support Your Cybersecurity Journey

Discover Our Latest Research

ISO 27001:2022 Deadline: What You Need to Know Before October 2025

Gambling Commission ISO 27001

How to Get Certified to ISO 27001?

The Company

Services

Stay up to date

Contact Us Reach out to one of our cyber experts and we will arrange a call