ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2105.14170
39
3
v1v2 (latest)

Towards a Rigorous Statistical Analysis of Empirical Password Datasets

29 May 2021
Jeremiah Blocki
Peiyuan Liu
ArXiv (abs)PDFHTML
Abstract

In this paper we consider the following problem: given NNN independent samples from an unknown distribution P\mathcal{P}P over passwords pwd1,pwd2,…pwd_1,pwd_2, \ldotspwd1​,pwd2​,… can we generate high confidence upper/lower bounds on the guessing curve λG≐∑i=1Gpi\lambda_G \doteq \sum_{i=1}^G p_iλG​≐∑i=1G​pi​ where pi=Pr⁡[pwdi]p_i=\Pr[pwd_i]pi​=Pr[pwdi​] and the passwords are ordered such that pi≥pi+1p_i \geq p_{i+1}pi​≥pi+1​. Intuitively, λG\lambda_GλG​ represents the probability that an attacker who knows the distribution P\mathcal{P}P can guess a random password pwd←Ppwd \leftarrow \mathcal{P}pwd←P within GGG guesses. Understanding how λG\lambda_GλG​ increases with the number of guesses GGG can help quantify the damage of a password cracking attack and inform password policies. Despite an abundance of large (breached) password datasets upper/lower bounding λG\lambda_GλG​ remains a challenging problem. We introduce several statistical techniques to derive tighter upper/lower bounds on the guessing curve λG\lambda_GλG​ which hold with high confidence. We apply our techniques to analyze 999 large password datasets finding that our new lower bounds dramatically improve upon prior work. Our empirical analysis shows that even state-of-the-art password cracking models are significantly less guess efficient than an attacker who knows the distribution. When GGG is not too large we find that our upper/lower bounds on λG\lambda_GλG​ are both very close to the empirical distribution which justifies the use of the empirical distribution in settings where GGG is not too large i.e., G≪NG \ll NG≪N closely approximates λG\lambda_GλG​. The analysis also highlights regions of the curve where we can, with high confidence, conclude that the empirical distribution significantly overestimates λG\lambda_GλG​. Our new statistical techniques yield substantially tighter upper/lower bounds on λG\lambda_GλG​ though there are still regions of the curve where the best upper/lower bounds diverge significantly.

View on arXiv
Comments on this paper