Reposcanner is a Python script designed to scan Git repositories looking for interesting strings, such as API keys or hard-coded passwords, inspired by truffleHog. Sensitive information like this often gets included in the earlier stages of the development process (or accidentally), and is generally removed before the application or source code is released. However, since Git keeps a history of all changes, by going back through these commits, we can scan back through the commit history to obtain information that has been removed in the latest version. The basic flow of reposcanner is as follows:
- Try and clone the repository if a remote URL is given
- Get the active branch (or all branches if the -a option is given)
- Create a diff for each commit in the select branch(es)
- Ignore any known boring string patterns and files names/extensions
- Extract any long hexadecimal or base64 strings
- Calculate the entropy of these strings
- If the entropy is high enough, and it's not been seen before, store the string
- Output all strings that are found
The hardest step is trying to identify "interesting" strings, without ending up with too many false positives. Reposcanner has some known patterns of boring strings which it ignores, and you can tweak the minimum entropy to report if you're getting too many false positive (the current value was obtained through some trial and error). A possible future option might be to also search for interesting strings (such as "api_key = foo", or connection strings). Unlike truffleHog, which shows the entire diffs to give context, reposcanner has a much more concise output, which only shows you the relevant line, along with the commit information so that you can go and examine the commit yourself if the string looks interesting. This makes the output much more manageable, especially when scanning larger repositories. Scanning some randomly selected repositories in GitHub resulted in the expected interesting strings, including:
- API keys third party services, in an employee financial bonus scheme
- Application and database passwords
- A SQL database backup
This is a serious risk for companies when internally developed projects are released to the public - developers are less likely to be careful with their commits to an internal project compared to a publicly available one, and this increases the likelihood of inappropriate files making their way into the version control system. It can also reflect badly on a company if you have unprofessional code, comments or commit messages - a message like "accidentally deleted database" doesn't tend to inspire confidence.
Going through the commit history and trying to sanitise is likely to be unfeasible, unless it's a trivially sized repository, so the approach that most organisations take is just to completely wipe the commit history - either by creating a fresh repo and copying the files into it, or destroying the entire history with a rebase. While this provides a degree of protection from inappropriate commit messages or data being leaked, it does also destroy the development history of the repository, which is very valuable to developers when trying to fix bugs, or to understand why certain decisions have been made in the development process. As always, it's the trade-off between security and convenience.
Besides destroying the repo history, the best thing that you can do to protect against these issues to have secure development practices from the start, even for projects that you're never anticipating releasing. This should include making sure that sensitive information is never committed into source control, and of course, trying to keep comments and commit messages (reasonably) professional.
The Reposcanner code is available on the Dionach GitHub at https://github.com/Dionach/reposcanner - pull request are welcome as always.