Tuesday, February 19, 2013

If only President Skroob had data visualization

I don't watch a lot of movies, but I did watch Spaceballs, and as a former system administrator I could certainly appreciate the jokes regarding a five-digit combination.

I've already talked about the large number of Syrian officials who used the sequential numbers "12345" as a numeric combination.

They aren't alone in using sequential combination numbers.

The Data Genetics website includes an analysis of a number of numeric combinations. After pointing out that the only reason that this analysis could be performed was because so many people leave their password files unencrypted, the author proceeded to analyze the numeric passwords that people chose.

The analysis began with the four-digit numeric passwords. If numeric passwords were evenly distributed, then you would have a 1 in 10,000 chance of guessing a numeric password on the first try. But when the data was examined, it turned out that the password "1234" was used for 10.713% of the passwords. That's not 1 in 10,000; that's 1 in 10.

[T]he more popular password selections dominate the frequency tables. The most popular PIN code of 1234 is more popular than the lowest 4,200 codes combined!

That's right, you might be able to crack over 10% of all codes with one guess! Expanding this, you could get 20% by using just five numbers!...

Statistically, one third of all codes can be guessed by trying just 61 distinct combinations!

The 50% cumulative chance threshold is passed at just 426 codes (far less than the 5,000 that a random uniformly distribution would predict). Paranoid yet?

Think about this the next time you use your mother's year of birth as a password. "No one knows when my mother was born," you may think. However, the mere fact that many people use birth or anniversary years as passwords means that you can have a great deal of success by guessing a password that begins with the number "19," followed by any two other numbers.

But sometimes when you want to attack a problem, it's best to visualize it.

I love pretty ways to graphically vizualize data. Pictures really do paint thousands of words.

Another interesting way to visualize the PIN data is in this grid plot of the distribution. In this heatmap, the x-axis depicts the left two digits from [00] to [99] and the y-axis depicts the right two digits from [00] to [99]. The bottom left is 0000 and the top right is 9999 .

Color is used to represent frequency. The higher frequency occurences are yellow to white hot, and the lower frequency occurences are red, through dark red to black.

Geek Note The scaling is logarithmic.

Here is the heatmap that resulted.

Now some of the yellow stuff is fairly obvious. The diagonal line from lower left to supper right is for cases in which the same two digits are repeated: 0000, 0101, 0202, etc. The bright line on the left that gets brighter toward the top corresponds to numbers beginning with "19."

But why is the lower left area so bright? The author didn't realize the reason for this until after the post was originally published:

Since publishing this article, it's been brought to my attention that, of course, in addition to anniversary years, many people encapsulate dates in the format MMDD (such as birthdays …) for their PIN codes.

This clearly explains the lower left corner where, if you look at the heatmap, there is a huge contrast change at the height of around 30-31 (the number of days in a month), extending to 12 on the x-axis. (Thanks to zero79 for first pointing this out).

Other items are revealed in the visual data, and these suggest further avenues for study. A knowledge of telephone keypads explains the significance of 2580. A knowledge of Korean explains the significance of 1004.

This comes from Korean speakers. When spoken, "1004" is cheonsa (cheon = 1000, sa=4).

"Cheonsa" also happens to be the Korean word for Angel.

And this just covers the four-digit passwords. The password data included information for passwords of other lengths. For five-digit passwords, 22.8% of all people select the same password used by President Skroob. And for seven-digit passwords, 0.465% of all people chose Jenny's telephone number of 867-5309 - and the only reason that percentage was so low is because many of us use our OWN telephone numbers.

And you learn something new every day. The 20th most popular five-digit password was "42069." I understood 40% of this password, but had to go to the Urban Dictionary to learn about the other 60%.

But is the problem solved by creating a "harder" password? According to XKCD, not necessarily.

Through 20 years of effort, we've successfully trained everyone to use passwords that are hard for humans to remember, but easy for computers go guess.
blog comments powered by Disqus