Apache Spark is
a data processing framework that can quickly perform processing tasks on
very large data sets, and can also distribute data processing tasks across multiple computers, either on its own or in conjunction with other distributed computing tools. These two qualities are key to the worlds of big data and machine learning, which require the classification of massive computing power to process large stores of data. Spark also takes some of the programming burdens of these tasks off the shoulders of developers with an easy-to-use API that abstracts much of the heavy lifting of distributed computing and big data processing.
From its humble beginnings at the AMPLab at UC Berkeley in 2009, Apache Spark has become one of the key big data distributed processing frameworks in the world. Spark can be implemented in a variety of ways, provides native bindings for the Java, Scala, Python, and R programming languages, and supports SQL, data streaming, machine learning, and graphics processing. You’ll find it used by banks, telecommunications companies, gaming companies, governments, and all the major tech giants like Apple, IBM, Meta, and Microsoft.
heart of Apache Spark is the concept of the Resilient Distributed Dataset (RDD), a programming abstraction that represents an immutable collection of objects that can be divided into a compute cluster. Operations on RDDs can also be split across the cluster and run in a parallel batch process, leading to fast and scalable parallel processing. Apache Spark converts user data processing commands into a directed acyclic graph or DAG. The DAG is the programming layer of Apache Spark; Determines which tasks run on which nodes and in which sequence.
RDDs can be created from simple text files, SQL databases, NoSQL stores (such as Cassandra and MongoDB), Amazon S3 buckets, and more. Much of the Spark Core API is based on this RDD concept, enabling traditional mapping and reduction functionality, but also providing built-in support for joining datasets, filtering, sampling, and aggregation.
runs in a distributed manner by combining a central controller process that divides a Spark application into tasks and distributes them among many executing processes that do the work. These executors can be scaled up and down as needed for the needs of the application.
has become increasingly important to the Apache Spark project. It is the interface most used by developers today when creating applications. Spark SQL focuses on structured data processing, using a data framework approach borrowed from R and Python (in Pandas). But as the name suggests, Spark SQL also provides a SQL2003-compatible interface for querying data, bringing the power of Apache Spark to analysts and developers.
Along with standard SQL support, Spark SQL provides a standard interface for reading and writing to other data stores, including JSON, HDFS, Apache Hive, JDBC, Apache ORC, and Apache Parquet, all of which are immediately supported. Other popular data stores (Apache Cassandra, MongoDB, Apache HBase, and many others) can be used by pulling separate connectors from the Spark Packages ecosystem. Spark SQL enables user-defined functions (UDFs) to be used transparently in SQL queries.
Selecting some columns of a data frame
is as simple as this line of code
:citiesDF.select(“name”, “pop”)Using the SQL interface, we
register the data frame as a temporary table, after which we can issue SQL queries against it:
citiesDF.createOrReplaceTempView(“cities”) spark.sql(“SELECT name, pop FROM cities”)
Behind the scenes, Apache Spark uses a query optimizer called Catalyst that examines data and queries to produce an efficient query plan for data locality and calculation that will perform the necessary calculations across the cluster. Since Apache Spark 2.x, the Spark SQL interface of data frames and datasets (essentially a typed data framework that can be checked at compile time for correctness and take advantage of additional memory and compute optimizations at run time) has been the recommended approach for development. The RDD interface is still available, but is recommended only if your needs cannot be addressed within the Spark SQL paradigm (such as when you must work at a lower level to squeeze every last drop of system performance).
MLlib and MLflow
Apache Spark also bundles libraries to apply machine learning and graph analysis techniques to data at scale. MLlib includes a framework for building machine learning pipelines, enabling easy implementation of feature extraction, selections, and transformations on any structured dataset. MLlib comes with distributed implementations of clustering and sorting algorithms such as k-means clustering and random forests that can be swapped in and out of custom pipelines with ease. Data scientists can train models in Apache Spark using R or Python, save them using MLlib, and then import them into a Java or Scala-based pipeline for use in production.
An open-source platform for managing the machine learning lifecycle, MLflow is not technically part of the Apache Spark project, but is also a product of Databricks and others in the Apache Spark community. The community has been working on integrating MLflow with Apache Spark to provide MLOps features such as experiment tracking, model registrations, packaging, and UDFs that can be easily imported for Apache Spark-scale inference and with traditional SQL statements.
is a high-level API that allows developers to create infinite dataframes and streaming datasets. Starting with Spark 3.0, Structured Streaming is the recommended way to handle streaming data within Apache Spark, replacing the previous approach of Spark Streaming. Spark Streaming (now marked as a legacy component) was full of difficult pain points for developers, especially when it came to event time aggregations and late message delivery.
All queries in structured flows go through Catalyst’s query optimizer, and can even be run interactively, allowing users to perform SQL queries against live streaming data. Support for late messages is provided by watermark messages and three supported types of window techniques: revolving windows, sliding windows, and variable-length time windows with sessions.
In Spark 3.1 and later, you can treat sequences as tables and tables as sequences. The ability to combine multiple streams with a wide range of SQL-like flow-to-flow joins creates powerful possibilities for ingest and transformation. Here’s a simple example of creating
a table from a streaming source:val df = spark.readStream .format(“rate”) .option(“rowsPerSecond”, 20) .load() df.writeStream .option(“checkpointLocation”, “checkpointPath”) .toTable(“streamingTable”) spark.read.table(“myTable”).show()
Structured Streaming, by default, uses a batch microprocessing scheme to handle streaming data. But in Spark 2.3, the Apache Spark team added a low-latency continuous processing mode to structured streaming, allowing it to handle responses with impressive latencies as low as 1ms and making it much more competitive with rivals like Apache Flink and Apache Beam. Continuous processing restricts you to map-like and selection-like operations, and while it supports SQL queries in streams, it does not currently support SQL aggregations. Also, although Spark 2.3 arrived in 2018, starting with Spark 3.3.2 in March 2023, continuous processing is still marked as experimental.
Structured Streaming is the
future of streaming applications with the Apache Spark platform, so if you’re building a new streaming application, you should use Structured Streaming. Legacy Spark Streaming APIs will still be supported, but the project recommends porting to Structured Streaming, as the new method makes writing and maintaining streaming code much more bearable.
Like MLflow, Delta Lake is technically a separate project from Apache Spark. However, in recent years, Delta Lake has become an integral part of the Spark ecosystem, forming the core of what Databricks calls Lakehouse Architecture. Delta Lake augments cloud-based data lakes with ACID transactions, unified query semantics for batch and flow processing, and schema enforcement, effectively eliminating the need for a separate data warehouse for BI users. Full audit history and scalability to handle exabytes of data are also part of the package.
And using the
Delta Lake format (built on Parquet files) within Apache Spark is as simple as using the format
delta:df = spark.readStream.format(“rate”).load() stream = df .writeStream .format(“delta”) .option(“checkpointLocation”, “checkpointPath”) .start(“deltaTable”)Pandas
API in Spark
The industry standard for data manipulation and analysis in Python is the Pandas library. With Apache Spark 3.2, a new API was provided that allows a large proportion of the Pandas API to be used transparently with Spark. Now data scientists can simply replace their imports with import pyspark.pandas as pd and be somewhat confident that their code will continue to work, and also take advantage of Apache Spark’s multi-node execution. At the moment, around 80% of the Pandas API is covered, with a coverage target of 90% in future releases.
At a fundamental level, an Apache Spark application consists of two main components: a controller, which converts user code into multiple tasks that can be distributed among worker nodes, and executors, which run on those worker nodes and execute the tasks assigned to them. Some sort of cluster administrator is necessary to mediate between the two.
Out of the box, Apache Spark can run in a standalone cluster mode that simply requires the Apache Spark framework and a Java virtual machine on each node in your cluster. However, you’re more likely to want to leverage a more robust resource management or cluster management system to take care of assigning on-demand workers for you.
In the enterprise, this historically meant running on Hadoop YARN
(YARN is how Cloudera and Hortonworks distributions run Spark jobs), but as Hadoop has become less entrenched, more and more enterprises have turned to deploying Apache Spark on Kubernetes. This has been reflected in releases of Apache Spark 3.x, which improve integration with Kubernetes, including the ability to define pod templates for controllers and executors and use custom schedulers such as Volcano.
If you’re looking for a managed solution, Apache Spark offerings can be found in the three big clouds: Amazon EMR, Azure HDInsight, and Google Cloud Dataproc.
Databricks Lakehouse Platform
Databricks, the company that employs the creators of Apache Spark, has taken a different approach than many other companies founded on the open source products of the Big Data era. For many years, Databricks has offered a comprehensive managed cloud service that offers Apache Spark clusters, streaming support, integrated web-based laptop development, and proprietary optimized I/O performance over a standard Apache Spark distribution. This mix of managed and professional services has made Databricks a giant in the field of Big Data, with an estimated valuation of 38,000 million dollars in 2021. The Lakehouse Databricks platform is now available from all three major cloud providers and is becoming the de facto way most people interact with Apache Spark.
Ready to dive in and learn Apache Spark? We recommend starting with the Databricks learning portal, which will provide a good introduction to the framework, although it will be slightly biased towards the Databricks platform. To dig deeper, we suggest the Spark Workshop, which is a comprehensive tour of Apache Spark features through a Scala lens. Some excellent books are available as well. Spark: The Definitive Guide is a wonderful introduction written by two Apache Spark maintainers. And High Performance Spark is an essential guide to processing data with Apache Spark at massive scales in an efficient way. Happy learning!