Looking through the Ashley Madison email dataset, I noticed a surprising number of users registered with employee email addresses. Here’s a list of the top 20 Fortune 50 companies ordered by number of employee emails registered on Ashley Madison.
Company | Email Domain | # Of Emails |
---|---|---|
General Motors | gm.com | 361 |
IBM | ibm.com | 251 |
General Electric | ge.com | 217 |
Hewlett Packard | hp.com | 193 |
Ford | ford.com | 179 |
Wells Fargo | wellsfargo.com | 171 |
Proctor and Gamble | pg.com | 152 |
Boeing | boeing.com | 117 |
UPS | ups.com | 74 |
Bank of America | baml.com | 72 |
Citi Bank | citi.com | 60 |
PepsiCo | pepsico.com | 59 |
DOW | dow.com | 59 |
Walmart | walmart.com | 52 |
Metlife | metlife.com | 43 |
State Farm | statefarm.com | 37 |
AIG | aig.com | 26 |
Archer Daniels Midland | adm.com | 19 |
Kroger | kroger.com | 14 |
CVS | cvs.com | 12 |
The overwhelming majority of users did not register with their employee email addresses. The overall distribution of email address domains is shown in the pie chart below:
Here’s the script used to sift through the dump:
# write domain counts to file after every 1,000,000 emails read
increment_to_save = 1000000
def write_to_file(file_name, domains):
# sort the domains by count
from operator import itemgetter
domains = sorted(domains, key=itemgetter('count'))
# write the domains to the results file
results_file = open(file_name, "w")
domains = reversed(domains)
for domain in domains:
results_file.write(domain["domain"] + "," + str(domain["count"]) + "\n")
results_file.close()
emails_file = open("emails_dump.txt", "r")
# storage for domain counts
domains = list()
email_number = 1
for email in emails_file:
# if not @ symbol in email, go onto next email
if(email.find("@") != -1):
# get the email's domain
email_domain = email.split("@")[-1].rstrip('\n')
# go through the existing domains checking for the email's domain
domain_exists = False
for domain in domains:
if domain["domain"] == email_domain:
domain["count"] += 1
domain_exists = True
break
# if the domain doesn't already exist in the counts, then add it
if not domain_exists:
new_domain = {"domain": "", "count": 1}
new_domain["domain"] = email_domain
domains.append(new_domain)
# for every 1 million email addresses read, write the current domain counts to a file
if ((email_number % increment_to_save) == 0) and (email_number > 0):
write_to_file("domain_counts_" + str(int(email_number/increment_to_save)) + ".csv", domains)
email_number+=1
emails_file.close()
# write the final domain counts to a file
write_to_file("final_domain_counts.csv", domains)