Azure Databricks Basics with Spark

Overview

This blog post will provide an overview of Databricks, Azure Databricks, Apache spark fundamental.
In this post you will learn about Databricks concepts (Workspace, Notebook, Cluster, Jobs, Scheduling etc.) and Spark fundamental will cover architecture and key features.

Agenda

Apache Spark Fundamentals
Azure Databricks

Pre-Requisites

Understanding of Azure basic terminology
Understanding of Big data processing

Overview of Apache Spark Fundamental

Before jumping into the Spark, i would like to highlight some of the big data tools which was already there before Apache Spark.

If you have ever worked in big data space, you may recognize above logos in the image. These tools are very heterogeneous, In order to build any end to end pipeline with big data tooling set you actually have to learn 4-5 tools.

Given list below will explain the purpose of using above tool:

Hive – To work with SQL
Pig – For Scripting and ETL
Mahout – For Machine Learning
Storm – To work with Streaming data
Flink – Again to work with Streaming
Giraph– To work with Graph

Basically, any specific task which you wanted to perform can be done by using above tools, but to build a Big Data pipeline you have to work on various tools listed above.

Apache Spark

Its a unified computing engine and a set of libraries for parallel data processing on computer clusters.
Apache spark can do anything which is there in the above picture.

Apache Spark Ecosystem

If you look at the above image, Apache Spark gives you library on top of Apache Spark Core API, which are Spark SQL, Structured Streaming, Mlib and Graph APIs.

On top of which you can see set of languages(Scala, Python, Java, R) which helps you writing programs.

Apache Spark is just a computing engine and most of the people install Spark on top of Hadoop HDFS, Cassandra, Amazon S3, Azure BLOB storage etc. as file system.

Why Spark is Faster Than Hadoop

The reason Spark is faster than Hadoop is, if you can look at above image, you can see in Hadoop MapReduce system every mapper & Reducer has to write intermediary disk and it communicates between the disks.
But in Spark most of the transformation happens in the memory and Spark can cache the data in the memory also.

Apache Spark APIs

RDD APIs– RDD stands for Resilient Distributed Datasets. It is Read-only partition collection of records. RDD is the fundamental data structure of Spark. It allows a programmer to perform in-memory computations on large clusters in a fault-tolerant manner.

Dataframe APIs – Unlike an RDD, data organized into named columns. For example a table in a relational database. It is an immutable distributed collection of data. DataFrame in Spark allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction.

Dataset APIs – Datasets in Apache Spark are an extension of DataFrame API which provides type-safe, object-oriented programming interface.

Spark Internal Architecture

Driver is the heart of spark application and therefore Spark session live here.
It is the Gateway to the spark cluster.
Driver is the in-charge of distributing and analyzing the task and assigning it to the Executor.
Executor gets task from the driver, execute it and report it back to the driver.
Spark can also work with different cluster managers like Hadoop Yarn, Spark cluster, Apache Mesos etc.

Azure Databricks

Again before jumping into Azure Databricks lets first understand what is Databricks and then we will read about Azure Databricks.

Databricks

Databricks was founded by the creators of Apache Spark and offers a unified platform designed to improve productivity for data engineers, data scientists and business analysts.
This integration helps to ease the processes from data preparation to experimentation and machine learning application deployment.
According to the company, the Databricks platform is a hundred times faster than the open source Apache Spark.
The Databricks platform provides an interactive and collaborative notebook experience out-of-the-box, and due to it’s optimized Spark run-time, frequently outperforms other Big Data SQL Platforms in the cloud.

Azure Databricks

Its a managed Apache spark platform optimized for Azure.

Apache Spark + Databricks + Enterprise cloud = Azure Databricks

Azure Databricks, is a fully managed service which provides powerful ETL, analytics, and machine learning capabilities. Unlike other vendors, it is a first party service on Azure which integrates seamlessly with other Azure services such as event hubs and Cosmos DB.
It’s quick and simple to spin up clusters which can auto-scale, and auto-terminate — so your cluster will shutdown after a specified amount of time, once your jobs have completed. Azure Databricks, is a fully managed service which provides powerful ETL, analytics, and machine learning capabilities. Unlike other vendors, it is a first party service on Azure which integrates seamlessly with other Azure services such as event hubs and Cosmos DB.

Why Azure Databricks

Azure Databricks Architecture

Azure Databricks Portal Interface

When you create Azure Databricks workspace in azure portal and launch the workspace it will look like below image.

To create Azure Databricks workspace and configuring Databricks you can follow steps given in below link.

https://docs.azuredatabricks.net/getting-started/try-databricks.html

Above image might not be very clear but i will explain all important things.

Left hand side of the image there are Azure Databricks, Home, Workspace, Recents, Data, Clusters, Job and Search options.

Workspace

A Workspace is an environment for accessing all of your Azure Databricks assets. A Workspace organizes notebooks, libraries, dashboards, and experiments into folders and provides access to data objects and computational resources. By default, the Workspace and all of its contents are available to all users, but each user also has a private home folder that is not shared. You can control who can view, edit, and run objects in the Workspace by enabling Workspace access control.

Notebook

A notebook is a web-based interface to a document that contains runnable code, visualizations, and narrative text. Notebooks are one interface for interacting with Azure Databricks.

Job

A job is a way of running a notebook or JAR either immediately or on a scheduled basis. You can create and run jobs using the UI, the CLI, and by invoking the Jobs API. Similarly, you can monitor job run results in the UI, using the CLI, by querying the API, and through email alerts.

Cluster

Azure Databricks clusters provide a unified platform for various use cases such as running production ETL pipelines, streaming analytics, ad-hoc analytics, and machine learning.

Types of Clusters :

Interactive : Interactive clusters are used to analyze data collaboratively with interactive notebooks.
Job : Job clusters are used to run fast and robust automated workload using the UI or API.

Cluster Autostart

When a job assigned to an existing cluster is scheduled to run or you connect to a terminated cluster from a JDBC/ODBC interface, the cluster is automatically restarted.
Cluster auto-start allows you to configure clusters to auto terminate, without requiring manual intervention to restart the clusters for scheduled jobs. Furthermore, you can schedule cluster initialization by scheduling a job to run on a terminated cluster.

Cluster Creation And Configuration

Clusters can be created and configured by using databricks portal. Below image represents the creation and configuration page.

Databricks Runtime Version : You can choose from among many supported runtime versions when you create a cluster.

Autoscaling : When you create a cluster, you can either provide a fixed number of workers for the cluster or provide a minimum and maximum number of workers for the cluster.
When you provide a fixed size cluster, Azure Databricks ensures that your cluster has the specified number of workers.
When you provide a range for the number of workers, Databricks chooses the appropriate number of workers required to run your job. This is referred to as autoscaling.

Termination After : To save cluster resources, you can terminate a cluster. A terminated cluster cannot run notebooks or jobs, but its configuration is stored so that it can be reused at a later time. You can manually terminate a cluster or configure the cluster to automatically terminate after a specified period of inactivity. Azure Databricks records information whenever a cluster is terminated.

Worker type and Driver type can be set based on the workload and the requirement. Normally the configuration of driver type is same as worker type.

In advanced options, you can customize your cluster by setting up the environment variable, tag the cluster for building purpose and can configure logging.

That’s all as of now, I might write another set of Blog post for one small demo on Azure Databricks.

Any feedback will be appreciated.

Azure Databricks Basics with Spark

Overview

Agenda

Pre-Requisites

Overview of Apache Spark Fundamental