Deploying Hadoop Using Docker Containers: What Works and What Doesn’t.
A special meetup brought to you by Big Data Cloud with focus on the DevOps side of the world of Big Data.
6:00 – 6:30 pm:
Registration and Snacks.
6:30 – 6:35 pm:
Opening remarks and housekeeping.
6:35 – 7:15 pm:
7:15 – 7:30 pm:
• 7:30 – 8:30 pm
Deploying Hadoop Using Docker Containers:What Works and What Doesn’t
Presenter: Nasser Manesh, Altiscale, Inc.
In this presentation, Nasser will share his experiences and lessons learned in deploying multi-tenant Hadoop clusters on top of Docker containers.
8:30 – 9:00 pm:
Q&A and Networking
Deploying Hadoop Using Docker Containers:What Works and What Doesn’t.
While Docker is popular as a development tool, not many production systems are using it today for virtualization. As a result, operating Docker at scale is a non-trivial task, especially in the context of Big Data deployments.
This presentation will provide the audience with the foundation and core understanding of the issues involved in using Docker for Hadoop deployment, plus some of the patterns to follow and pitfalls to avoid especially from the operations and DevOps point of view.
In order to run large scale multi-tenant Hadoop clusters certain challenges in the areas of resource control and security need to be addressed. Nasser’s company leverages Docker on bare metal in order to partition machine resources without too much overhead, providing light-weight virtual machines as compute nodes for Hadoop.
This talk touches upon some of the issues the team had to address for deploying and maintaining such setup:
● Containers vs Virtual Machines
Why it makes sense to use containers, rather than virtual machines, to provide resource limits and isolation for Hadoop components
● Key issues with containers
What can go wrong if one uses containers with Hadoop?
● Hadoop in Docker, or Docker in Hadoop?
Understanding the two main models to use Docker with Hadoop.
● Resource allocation and configuration management:
How to configure containers for a datanode vs a nodenamager vs a namenode, finding the optimal number of containers per machine, and sizing containers.
● Monitoring, Metrics, and Troubleshooting:
How to make Hadoop know about the health of the Docker containers in order to distribute jobs properly, and how to collect for proper reporting on resource utilization.
● Disk and network access:
How to make sure that Docker can access only certain parts of the disk allocated to HDFS, and how to make containers leverage the high-bandwidth, high-throughput network infrastructure purpose-built for Hadoop.
• About the Presenters:
Senior Engineer, Infrastructure/Operations – Altiscale, Inc.
Nasser Manesh has 25+ years of experience in Unix, infrastructure, distributed systems, and backend operations in DevOps, team lead, and CTO roles. He has founded startups in consumer Internet, mobile, photography and art areas. Nasser is currently focused on Big Data infrastructure, Hadoop core (HDFS/YARN), Chef, Linux cgroups, and Docker at scale, and is a senior operations engineer at Altiscale which provides Hadoop as a Service in the cloud.