Customize Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorized as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customized advertisements based on the pages you visited previously and to analyze the effectiveness of the ad campaigns.

No cookies to display.

The Security Value of the Robots.txt file

There is a security value of the Robots.txt file but there are flaws such as ‘disallow’ entries revealing hidden folders, password lists and database backups.

This is my view on the use of robots.txt as a security control and the problems of not having one.

From my penetration testing experience there has been many occurrences of websites not having a robots.txt file. This action could be justified as being a security measure, as having “disallow” entries could reveal hidden folders.

In my view not declaring what files a web crawler can and can’t crawl is bad security measure. For example we have all heard about database backups, passwords lists and many more being indexed in Google. Take a look at the following Google Dork:

inurl:/wp-content/uploads/ confidential

Using this simple Google Dork you can clearly see confidential files being indexed from Wordpress blogs. A simple declaration such as the following would have prevent this from being indexed:

User-agent: *
Disallow: /wp-content/uploads/

To reflect on how using robots.txt is a bad security control, a colleague of mine raised an objection to me which was that: “surely if you declare all the confidential paths such as /admin on your site, then an attacker will have a nice and easy job in finding them”. My comeback to him was to explain that attackers have been using Google to actively find confidential files for a long time; therefore search engines can pose a threat to the security of a website. I would rather have an attacker having to spider the site themselves when trying to find any sensitive files that I may have on my website, than Google indexing them and having any one Google Dork me. Having said that, not revealing sensitive directories in robots.txt would improve security a little, although it could be considered security through obscurity – not a very effective control.

The solution is to use robots.txt as a defence in depth measure. Take a look at the following layers recommended to protect a particular file:

Layer 1 – Access Control
Layer 2 – No Directory Listing
Layer 3 – Meta Robots
Layer 4 – Robots.txt

So we start out from the perimeter defence mechanism, the robots.txt. This file should be used to declare areas of the site that you don’t want to get indexed, however not the really sensitive folders. To protect yourself from really sensitive files why not use the following declaration in the HTML page:

<meta name="robots" content="no index, no follow" />

This serves two purposes, search engines will not index the page and the attacker will not have a ready list of sensitive directories to attack. We then strengthen our security infrastructure by adding no directory listing and access controls.

In summary robots.txt should be used to disallow folders that you don’t want to get indexed. They should be used as a defence in depth measure. I know of a website that forgot to use a robots.txt file to restrict search engine robot access to their development site. As a consequence the whole site got indexed – all 500,000 pages of it – in one night! This could result in getting punished for website cloaking since the development site is the same as the real one, and maybe a loss of a page rank value. Additionally, legitimate visitors may search and get sent to the wrong site! Therefore my use of robots.txt serves two purposes; the site’s sensitive files will not get indexed and you are not providing an attacker with a list of really sensitive folders.

Robots.txt file is only good for stopping search engines getting your files and will not stop an attacker who is targeting your site. Proper access control is king.


Find out how we can help with your cyber challenge

Please enter your contact details using the form below for a free, no obligation, quote and we will get back to you as soon as possible. Alternatively, you can email us directly at busdev@www.dionach.com
Contact Us

Contact Us Reach out to one of our cyber experts and we will arrange a call