Tidytuesday #2: Passwords

Data on Cracking (Bad) Passwords

A quick exploration, looking at password data from TidyTuesday.

Setup

library(tidyverse)
library(tidytuesdayR)
theme_set(theme_light())

plot_caption = "zachbogart.com\nSource: tidytuesday"

Needed a little cleaning. The help file said the strength of a password is between one and ten, but there are some outliers. Also some NAs to remove.

tt <- tt_load("2020-01-14")
## 
##  Downloading file 1 of 1: `passwords.csv`
passwords <- tt$passwords %>% 
  filter(between(strength, 1, 10)) %>% 
  filter(!is.na(.)) %>% 
  mutate(length = str_length(password))

What are these categories of passwords?

Passwords are split into “categories”. Most passwords are stated as a name.

passwords %>% 
  count(category, sort=TRUE)
## # A tibble: 10 x 2
##    category                n
##    <chr>               <int>
##  1 name                  182
##  2 cool-macho             78
##  3 fluffy                 43
##  4 sport                  36
##  5 animal                 29
##  6 simple-alphanumeric    29
##  7 nerdy-pop              24
##  8 password-related       14
##  9 rebellious-rude        11
## 10 food                    9

Use Longer Passwords

Looks like longer passwords take longer to crack. Makes sense.

# what passwords take a while to crack?
passwords %>% 
  arrange(desc(offline_crack_sec))
## # A tibble: 455 x 10
##     rank password category value time_unit offline_crack_s… rank_alt strength
##    <dbl> <chr>    <chr>    <dbl> <chr>                <dbl>    <dbl>    <dbl>
##  1     1 password passwor…  6.91 years                 2.17        1        8
##  2     8 baseball sport     6.91 years                 2.17        8        4
##  3     9 football sport     6.91 years                 2.17        9        7
##  4    18 jennifer name      6.91 years                 2.17       18        9
##  5    22 superman name      6.91 years                 2.17       22       10
##  6    41 michelle name      6.91 years                 2.17       41        8
##  7    43 sunshine fluffy    6.91 years                 2.17       43        9
##  8    53 starwars nerdy-p…  6.91 years                 2.17       53        8
##  9    66 computer nerdy-p…  6.91 years                 2.17       66       10
## 10    74 corvette cool-ma…  6.91 years                 2.17       74        8
## # … with 445 more rows, and 2 more variables: font_size <dbl>, length <int>

Let’s look at how the average time to crack a password (offline) increases as password length increases.

# do longer passwords take longer to crack?
crack_time <- passwords %>% 
  filter(length <= 8) %>% 
  group_by(length) %>% 
  summarise(avg_offline_time = mean(offline_crack_sec),
            n = n())
## `summarise()` ungrouping output (override with `.groups` argument)
crack_time %>% 
  ggplot(aes(length, avg_offline_time)) +
    geom_point(color = "#50C779") +
    geom_line(color = "#50C779") +
    scale_y_log10() +
    labs(title = "Length Matters",
         subtitle = "Average Time to Crack (Bad) Passwords vs. Password Length",
         caption = plot_caption,
         x = "Password Length",
         y = "Average Time to Crack (seconds)")

It’s exponential! Adding another character makes it much harder (a lot more to brute force, multiplying by a factor).

Do Numbers Matter?

Wondering if passwords with more numbers means a better password. Looks like most passwords don’t have numbers. And no, using numbers doesn’t mean you have a more secure password (the computer doesn’t care; better off with an XKCD-style password).

# do "better" passwords have numbers?
passwords %>% 
  mutate(number_count = factor(str_count(password, "[0-9]"))) %>%
  
  ggplot(aes(number_count, strength, color = number_count)) +
    geom_jitter() +
    labs(title = "Numbers Don't Help",
         subtitle = "Password Strength vs. Numbers Used in Password",
         caption = plot_caption,
         x = "How Many Numbers Used",
         y = "Password Strength",
         color = "How Many Numbers Used")

Learning

  • Was having trouble setting fixed color across multiple layers in a ggplot.
  • geom_jitter is useful for expanding overlapping continuous data, but it is probably most helpful when plotting two continuous values. Otherwise you get bands of data.
  • If any of these passwords look familiar, go get a password manager!

Image Credit

jail by Zach Bogart from the Noun Project

Zach Bogart
Zach Bogart
Data Explorer

Science, Design, & Data. I’ll know it when I see it.

Related