Enron Corpus

src: pbs.twimg.com

The Enron Corpus is a large database of over 600,000 emails generated by 158 employees of the Enron Corporation and acquired by the Federal Energy Regulatory Commission during its investigation after the company's collapse.

Video Enron Corpus

History

The Enron data was originally collected at Enron Corporation headquarters in Houston during two weeks in May 2002 by Joe Bartling, a litigation support and data analysis contractor working for Aspen Systems, now Lockheed Martin, whom the Federal Energy Regulatory Commission (FERC) had hired to preserve and collect the vast amounts of data in the wake of the Enron Bankruptcy in December 2001. In addition to the Enron employee emails, all of Enron's enterprise database systems, hosted in Oracle databases on Sun Microsystems servers, were also captured and preserved including its online energy trading platform, EnronOnline.

Once collected, the Enron emails were processed and hosted in litigation platform Concordance, and then iCONECT, for the investigative team from the Federal Energy Regulatory Commission, the Commodity Futures Trading Commission, and Department of Justice investigators to review. At the conclusion of the investigation, and upon the issuance of the FERC staff report, the emails and information collected were deemed to be in the public domain, to be used for historical research and academic purposes. The email archive was made publicly available and searchable via the web using iCONECT 24/7, but the sheer volume of email of over 160GB made it impractical to use. Copies of the collected emails and databases were made available on hard drives.

A copy of the email database was subsequently purchased for $10,000 by Andrew McCallum, a computer scientist at the University of Massachusetts Amherst. He released this copy to researchers, providing a trove of data that has been used for studies on social networking and computer analysis of language.

Maps Enron Corpus

Legacy

The corpus is unique in that it is one of the only publicly available mass collections of real emails easily available for study, as such collections are typically bound by numerous privacy and legal restrictions which render them prohibitively difficult to access. In 2010, EDRM.net published a revised version 2 of the corpus. This expanded corpus, containing over 1.7 million messages, is now available on Amazon S3 for easy access to the research community. Jitesh Shetty and Jafar Adibi from the University of Southern California processed this corpus in 2004 and released a MySQL version of it and also published some link analysis results based on this.

src: i.vimeocdn.com

References

Data - Setup - Agile Data Science (2014)

src: apprize.info

External links

Nuix data set cleansed of PII (requires registration)
Tutorial on data modeling with the Enron Corpus
Shetty Adibi's enron email dataset download on S3 (178 MB)
Nathan Heller: What the Enron E-mails Say About Us The New Yorker, July 24, 2017

Source of the article : Wikipedia

Senin, 25 Desember 2017