There is a security value of the Robots.txt file but there are flaws such as ‘disallow’ entries revealing hidden folders, password lists and database backups.
This is my view on the use of robots.txt as a security control and the problems of not having one.
From my penetration testing experience there has been many occurrences of websites not having a robots.txt file. This action could be justified as being a security measure, as having “disallow” entries could reveal hidden folders.
In my view not declaring what files a web crawler can and can’t crawl is bad security measure. For example we have all heard about database backups, passwords lists and many more being indexed in Google. Take a look at the following Google Dork:
Using this simple Google Dork you can clearly see confidential files being indexed from WordPress blogs. A simple declaration such as the following would have prevent this from being indexed:
To reflect on how using robots.txt is a bad security control, a colleague of mine raised an objection to me which was that: “surely if you declare all the confidential paths such as /admin on your site, then an attacker will have a nice and easy job in finding them”. My comeback to him was to explain that attackers have been using Google to actively find confidential files for a long time; therefore search engines can pose a threat to the security of a website. I would rather have an attacker having to spider the site themselves when trying to find any sensitive files that I may have on my website, than Google indexing them and having any one Google Dork me. Having said that, not revealing sensitive directories in robots.txt would improve security a little, although it could be considered security through obscurity – not a very effective control.
The solution is to use robots.txt as a defence in depth measure. Take a look at the following layers recommended to protect a particular file:
Layer 1 – Access Control
Layer 2 – No Directory Listing
Layer 3 – Meta Robots
Layer 4 – Robots.txt
So we start out from the perimeter defence mechanism, the robots.txt. This file should be used to declare areas of the site that you don’t want to get indexed, however not the really sensitive folders. To protect yourself from really sensitive files why not use the following declaration in the HTML page:
This serves two purposes, search engines will not index the page and the attacker will not have a ready list of sensitive directories to attack. We then strengthen our security infrastructure by adding no directory listing and access controls.
In summary robots.txt should be used to disallow folders that you don’t want to get indexed. They should be used as a defence in depth measure. I know of a website that forgot to use a robots.txt file to restrict search engine robot access to their development site. As a consequence the whole site got indexed – all 500,000 pages of it – in one night! This could result in getting punished for website cloaking since the development site is the same as the real one, and maybe a loss of a page rank value. Additionally, legitimate visitors may search and get sent to the wrong site! Therefore my use of robots.txt serves two purposes; the site’s sensitive files will not get indexed and you are not providing an attacker with a list of really sensitive folders.
Robots.txt file is only good for stopping search engines getting your files and will not stop an attacker who is targeting your site. Proper access control is king.