Many of you may have seen the famous XKCD cartoon regarding password strength where Randall Munroe, the author, advocates creating a passphrase based on four random words and remembering it by creating a sentence containing the words. If you've already seen the cartoon, before actually clicking the link, let me give you the words in alphabetical order: battery, correct, horse, staple.
Do you remember the order of the words? As I write this I don't, even though I do remember the words. That means I should try (at most) 24 permutations to get the order right. Some systems block you at three wrong passwords! So no, Munroe's system is not ideal. As long as FIDO/Passkey or better yet, SQRL are not common place, we will be condemned to using passwords for authentication, hopefully as one of multiple factors. Apart from passwords that are retrieved from a password management system (which should be long random strings that the password manager remembers for you), a good password needs to comply to two requirements:
- it needs to be hard to find with a brute force attack (i.e., it needs enough entropy even with full knowledge of the password generating algorithm)
- it needs to be relatively easy to remember.
These two requirements tend to contradict each other. The more a password complies to one requirement, the more the other suffers. The sweet spot is generally somewhere in the middle. In that regard I appreciate Munroe's suggestion. It is a lot better than some random mutation on an existing word which IMO suffers on both requirements and it's a good method for sharing short lived passwords, e.g. when mailing an encrypted file and passing the password in a phone call (but then I tend to use six words). Still I think we can do a lot better.
A better password generating algorithm
The algorithm that I have used for decades is the following:
- Generate a (long enough) sentence that is easy to remember. You could e.g.
pick a sentence that references the system you are generating a password
for. So let's say you need a password for a mail service, then you could use
a sentence like:
"Heidi Stettner's dog kept barking at the postman so the notification program was named Biff".
- Now take the first letter of each word. The password then becomes "HSdkbatpstnpwnB". Not bad, but let's modify it (deterministically).
- Capitalize nouns and proper names (and the pronoun I) and keep everything else lower case, numbers are entered as numbers (not relevant in this case) and some words are converted to symbols (in this case we replace "at" with the at-sign). The password then finally becomes: "HSDkb@tPstNPwnB".
The password looks pretty random but can be typed error free by just remembering the sentence (a rough calculation of the entropy will be done in the next paragraph). This method has enabled me to remember multiple (currently more than 15) strong passwords at a time without the need to write them down and type them in error free. I do write down a hint, especially shortly after renewing a password. In the example above I could use a hint like "Berkeley student" (which miss Stettner was at the time). The word to symbol mapping that I use is in Table 1 below.
Estimating the password strength
This is all fine, but it's not actually random so the question is: how
much do we deviate from true randomness? This depends on the frequencies of leading
characters in (English) words. To roughly estimate these frequencies, I wrote a
and used it on a few books from Project Gutenberg (listed at the end). I also
used a list of nouns and a large list
of British and American names and
counted a word as a noun or name if it was on one of these lists. Of course
this is an estimation because the same word can be a noun in one context and
not in another (think e.g. of
"time flies like an arrow"
where "flies" can be a noun or a verb, depending on the semantics). By using
the symbols from step 3, the character set consists of 69 characters (a-z,
A-Z, 0-9, =, +, |, !, <, >, @). 69 characters yield 6.10 bits of entropy
per character in a uniform distribution, but my script maps each character to
its own entropy, based on the frequency of that charachter in the input text
entropy = 2log(1/frequency)
The total entropy of the password is the sum of all these entropy values of the individual characters (see the script output for the frequencies and calculated entropy per character). To wrap your head around the reasoning: suppose one character would be the first character of 50% of all words, then that character should only count for 1 bit of entropy because in each position of the password it would either be present or not with equal probabilities). Using the script output, the "mail" password above is calculated to have 84.1 bits of entropy, about 7.5 bits less than a 15 character full random password on the same character set.
To brute force a password with 84 bits of entropy which is stored as a simple (single iteration) SHA256 hash, you would need on average 283 SHA256 calculations. To compare, one of the fastest GPU's available, the NVidia GeForce RTX 4090, does SHA256 hashes at a rate of 21975.5 Megahashes per second. Brute forcing such a password with the RTX 4090 would take around 13.8 million years. Even using the fastest Bitcoin miner, the AntMiner S19 XP which can do 140 Terahashes per second, brute forcing would take nearly 2200 years on average. Note that it's a common practice to use many hash iterations on the password with the PBKDF2 key derivation or the iterations in the Unix crypt method (which defaults to 5000 iterations on a SHA2 based password). Using those methods scales the brute forcing time linearly. There are even better key derivation algorithms available (like the memory hard SCrypt and Argon2) but I think the point is made: even with a single SHA256 hash, such a password is secure against any reasonable brute forcing attack while it can still be easily remembered.
Estimation short cuts
Like I said before, the script does a rough estimation and cuts a few corners which might have an effect on the frequencies but I doubt the effect would measure to anything meaningful. First, if a "word" in one of the script inputs was not recognized as an English word or string of digits, it was not counted. This amounted to around 7.1% of my input (think of a "word" like "soon—he" in "likely to vacate it soon—he might have got" in Sense and Sensibility).
Another cut corner is that I did not take likely or unlikely
character sequences into account. These are based on the likeliness of the
sequence of the underlying words. Other words with the same starting characters
may have a different likeliness for their sequence. An example that comes to mind
is that the word "not" (which is mapped to "!") likely follows a word
like "is", "have/has" or "will".
Now let's assume all cut corners would amount to 2 bits fewer on the example password above, which is way more than I can possibly imagine, then divide the 2200 years and 13.8 million years by four. That would still be more than strong enough in my opinion, definately a lot stronger than "correct horse battery staple" and a lot easier to remember.
I hope this system benefits you to be a little safer on the internet, but where possible, use multi-factor authentication and use a password manager to generate random and different passwords for various services and reserve this method for those passwords that cannot or should not be included in the password manager (like the master password for the manager itself).
Table 1: word to symbol mapping
Note: "numeric words" like one, two and ten are written as their numeric values.
Inputs used for the script
The following books were retrieved from Project Gutenberg in text format and fed to the perl script to estimate the English leading character frequencies.
- Alice in Wonderland
- A Room With A View
- Little Women
- Moby Dick
- Sense and Sensibility
- The Enchanted April