Looking through the Ashley Madison email dataset, I noticed a surprising number of users registered with employee email addresses. Here’s a list of the top 20 Fortune 50 companies ordered by number of employee emails registered on Ashley Madison.

Company Email Domain # Of Emails
General Motors gm.com 361
IBM ibm.com 251
General Electric ge.com 217
Hewlett Packard hp.com 193
Ford ford.com 179
Wells Fargo wellsfargo.com 171
Proctor and Gamble pg.com 152
Boeing boeing.com 117
UPS ups.com 74
Bank of America baml.com 72
Citi Bank citi.com 60
PepsiCo pepsico.com 59
DOW dow.com 59
Walmart walmart.com 52
Metlife metlife.com 43
State Farm statefarm.com 37
AIG aig.com 26
Archer Daniels Midland adm.com 19
Kroger kroger.com 14
CVS cvs.com 12

The overwhelming majority of users did not register with their employee email addresses. The overall distribution of email address domains is shown in the pie chart below:

ashley-madison-email-domain-chart

Here’s the script used to sift through the dump:

# write domain counts to file after every 1,000,000 emails read
increment_to_save = 1000000

def write_to_file(file_name, domains):
    # sort the domains by count
    from operator import itemgetter
    domains = sorted(domains, key=itemgetter('count'))

    # write the domains to the results file
    results_file = open(file_name, "w")
    domains = reversed(domains)
    for domain in domains:
        results_file.write(domain["domain"] + "," + str(domain["count"]) + "\n")
    results_file.close()


emails_file = open("emails_dump.txt", "r")

# storage for domain counts
domains = list()

email_number = 1
for email in emails_file:
    # if not @ symbol in email, go onto next email
    if(email.find("@") != -1):
        # get the email's domain
        email_domain = email.split("@")[-1].rstrip('\n')
        # go through the existing domains checking for the email's domain
        domain_exists = False
        for domain in domains:
            if domain["domain"] == email_domain:
                domain["count"] += 1
                domain_exists = True
                break
        # if the domain doesn't already exist in the counts, then add it     
        if not domain_exists:
            new_domain = {"domain": "", "count": 1}
            new_domain["domain"] = email_domain
            domains.append(new_domain)
        # for every 1 million email addresses read, write the current domain counts to a file
        if ((email_number % increment_to_save) == 0) and (email_number > 0):
            write_to_file("domain_counts_" + str(int(email_number/increment_to_save)) + ".csv", domains)
    email_number+=1
emails_file.close()

# write the final domain counts to a file
write_to_file("final_domain_counts.csv", domains)