Realtime Hadoop usage at Facebook

Last week, Dhruba Borthakur, Hadoop Engineer at Facebook, gave a talk about realtime Hadoop usage at Facebook, mainly for Facebook messages.

Hadoop is open source software for distributed computing. It consists of two main parts – Hadoop distributed file system, a distributed file system with high throughput, and Hadoop MapReduce, a software framework for processing large data sets. And then there’s HBase which is a columnar key value store on top of Hadoop.

To handle large amounts of data (e.g., BigData), many companies are using a front end cache (e.g., memcached) or having their entire database in memory. In addition, companies partition the data across many nodes for query parallelism and mitigation of resource contention. Borthakur said that with every 1K node cluster, they get about 2-3 bad nodes everyday.

For Facebook messages the requirements were:

  • Massive data set, large subsets of cold data
  • Elasticity and high availability
  • Strong consistency with data center
  • Fault isolation, quick recovery

They wanted a database that could store at least a terabyte of data. So they looked Cassandra, HBase, and sharded MySQL. After some testing, Facebook picked HBase because it gave excellent write performance and good reads. Also, Borthakur commented that partitions of different nodes (e.g, Cassandra) in a single data center are very rare.

Some Facebook messages stats:

  • 6K messages a second
  • 50K chat messages a second
  • 300TB linear data growth per month compressed, compression is 1 to 4
  • Most data is chat messages

For Facebook, HBase is good for counters like finding out how many people in certain age groups are logged on or finding how many people clicked “Like” on a particular brand. They have more than 1 million counters using 100 node HBase with reads mostly served from memory. Currently, they do not use tiered storage for messages but maybe looking at it for photos with Hadoop.

Frequency of Facebook items accessed (pre-timeline):

  • Photos and videos get a lot of reads for first few days then after 60 days not many
  • Last 30 days of comments are in the cache
  • Most people interested in looking at last 10 days of messages, all 10 days worth are already loaded when you log in

As for the future of Hadoop at Facebook, Borthakur is looking at possibly migrating all user data and graph data from MySQL to HBase.

One thing to note, if you need a database with more reads than writes, HBase is probably not the database for you.

From Hadoop website:

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.

About Dhruba Borthakur:

Dhruba Borthakur is the Project Lead for the Open Source Apache Hadoop Distributed File System. He has been associated with Hadoop almost since its inception while working for Yahoo. He currently works for Facebook and is instrumental in scaling Facebook’s hadoop cluster to multiples of petabytes. Dhruba also is a contributor to the open source Apache HBase project.

Earlier, he was a Senior Lead Engineer at Veritas Software (now acquired By Symantec) and was responsible for the design and development of software for the Veritas San File System. He was the Team Lead for developing the Mendocino Continuous Data Protection Software Appliance at a startup named Mendocino Software. Prior to Mendocino Software, he was the Chief Architect at Oreceipt.com, an e-commerce startup based in Sunnyvale, California. Earlier, he was a Senior Engineer at IBM-Transarc Labs where he was responsible for the development of Andrew File System (AFS).

Dhruba has an M.S. in Computer Science from the University of Wisconsin, Madison and a
B.S. in Computer Science from the Birla Institute of Technology and Science (BITS), Pilani, India.
He has 17 issued patents. He hosts a Hadoop blog at http://hadoopblog.blogspot.com/ and can be subscribed via his profile page at https://www.facebook.com/dhruba

Read more about Hadoop at Facebook:

No related posts.

Comments are closed.