NY Times and big data

NY Times and big data

NY Times is one of the most popular newspapers since its first publication in 1851. During one and half centuries, it built proud history. It won Pulitzer Prizes 112 times and that is more than any other news organization[1]. Printed newspapers of past 150 years are not only NY Times’ history but also a part of world history.

In 1990s, Internet rooted into society rapidly. Media was one of areas which experienced most drastic changes. Internet lowered the barrier newspaper publishing. Many online news organizations were founded and all of them were competitor of existing company, NY Times.

NY Times decided to show their old newspapers in online web site named “TimesMachine”[2]. The strongest and inimitable competitiveness of NY Times compare to upstart competitors was its long and proud history. They have old newspapers of about past 150 years and those newspapers were digitalized and stored in TIFF format. Newspapers formatted in TIFF should be converted into PDF before being serviced. Before launching the service officially TimesMachine dynamically converted TIFF image into PDF when user requests specific newspaper.

However, if considering the strong possibility of traffic increase after launching the service, pre-converting all the TIFF format old newspapers into PDF format is much more efficient than dynamic converting. They used web storage, Amazon S3[3], and cloud computing service, Amazon EC2[4], for converting job[5]. They uploaded 4 TB source data into S3 and wrote some code that read source data, convert it to PDF, and store result back into S3 with 100 EC2 instances. They also used Hadoop[6], an open-source implementation of MapReduce. They successfully converted 11 million articles in under 24 hours.

NY Times is not a first one which serves old newspapers in online, but its contents, 150 years of world’s most popular newspaper have strong impact. For now, nytimes.com is America’s most popular news sites and more than 44 million unique visitors visits the site[7].

NY Times is not only dealing with old big data, old news articles, but also new big data generated by consumers. NY Times serves tens of million unique visitors for every month. In case of election time, the number goes higher. During vote counting, in November 2011, nytimes.com handled thousands of requests per second with a total cost of only a few hundred dollars[8].

The site had posted 184 separate pages covering election results, each of which needed to be updated every few minutes from an Associated Press news feed. The data was stored on a MySQL database running on Amazon’s RDB (Relational Database Service). The Web apps themselves are run on Amazon’s EC2 (Elastic Compute Cloud). The team set up a pool of four application servers behind a bank of Amazon servers running Apache that assembled pages from the latest data and periodically uploaded them.

In conclusion, NY Times is greatly dealing with big data. It successfully adapted new environment, internet, by converting their best competitiveness to online web service, TimesMachine. And it also creates new competitiveness by effectively handling big data, web traffic.




[1] http://articles.latimes.com/2012/apr/17/nation/la-na-pulitzers-20120417

[2] http://open.blogs.nytimes.com/2008/05/21/the-new-york-times-archives-amazon-web-services-timesmachine/?_php=true&_type=blogs&_r=0

[3] http://aws.amazon.com/ko/s3/

[4] http://aws.amazon.com/ko/ec2/

[5] http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/

[6] http://hadoop.apache.org/

[7] http://www.poynter.org/latest-news/mediawire/160780/new-york-times-traffic-flat-since-paywall/

[8] http://www.pcworld.com/article/214375/how_the_new_york_times_covered_the_election_with_amazon.html

Leave a Reply