Apache Spark – fast big data processing
Apache Spark
Apache Spark is a lightning-quick cluster computing technology, designed for fast computation. It’s based on Hadoop MapReduce and it expands it to be economically used by the MapReduce version for more types of computations, including interactive queries and stream processing. The principal feature of Spark is its in-memory cluster computing that increases the processing speed of an application.
Spark was created to cover a wide variety of workloads such as streaming, iterative algorithms, interactive queries and batch applications. It reduces the management burden of maintaining different tools, besides supporting all these workload in a specific system.
Evolution of Apache Spark
Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It was given to Apache software foundation in 2013, and Apache Spark has become a top level Apache project from Feb-2014.
Characteristics of Apache Spark
Apache Spark has following attributes.
Speed − Spark helps to run an application in Hadoop cluster, 10 times faster when running on disc, and up to 100 times faster in memory. This is possible by reducing amount of read/write operations to disk. It stores the intermediate processing data in memory.
Supports multiple languages − Spark provides built in APIs in Java, Scala, or Python. Consequently can write applications in distinct languages. Spark comes up with 80 high level operators for interactive querying.
Advanced Analytics − Spark supports ‘Map’ and ‘ reduce’. Additionally, it supports SQL queries, Streaming information, Machine learning (ML), and Graph algorithms.
Spark Assembled on Hadoop
The following diagram shows three ways of how Spark can be assembled with Hadoop parts.
Ignite Constructed on Hadoop
There are three manners of Spark installation as described below.
Standalone − Spark Standalone deployment means Spark inhabits the place on top of HDFS(Hadoop Distributed File System) and space is allocated for HDFS, explicitly. Here, MapReduce and Spark will run side by side to cover all spark occupations on bunch.
Hadoop Yarn − Hadoop Yarn deployment means, only, spark runs with no pre-installation or root access needed on Yarn. It helps to integrate Spark into Hadoop stack or Hadoop ecosystem. It enables other components to run in addition to stack.
Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark occupation in addition to standalone deployment. With SIMR, Spark can be started by user and uses its shell with no administrative access.
Parts of Spark
Apache Spark Core
Spark Core is the inherent general execution engine for Spark platform that all other functionality is built upon. It provides In-Memory computing and referencing datasets in external storage systems.
Spark SQL
Spark SQL is a part in addition to Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data.
Start Streaming
Spark Streaming leverages Spark Core’s fast scheduling capability to perform streaming analytics. It ingests info in mini-batches and performs RDD (Bouncy Distributed Datasets) transformations on those mini-batches of data.
MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark due to the distributed memory-based Spark design. It is, according to benchmarks, done by the MLlib developers against the Alternating Least Squares (ALS) enactments. Spark MLlib is nine times as rapid as the Hadoop disk-based version of Apache Mahout (before Mahout gained a Spark interface).
GraphX
GraphX is a distributed graph-processing framework in addition to Spark. It supplies an API for expressing graph computation that can model the user- . Additionally, it provides an optimized runtime for this abstraction.
Spark vs Hadoop
Listen in on any conversation about big data, and you’ll probably hear mention of Hadoop or Apache Spark. Here is a brief look at what they do and how they compare.
1. They do different things. Hadoop and Apache Spark are both big-data frameworks, but they don’t actually serve the same functions. Hadoop is basically a distributed information infrastructure: It doles out huge data collections across multiple nodes within a cluster of commodity servers, which means you do not need to buy and keep expensive custom hardware. In addition, it indexes and keeps track of that info, empowering big-data analytics and processing far more effectively than was possible previously. Spark, on the other hand, is a data-processing tool that operates on those distributed data collections; it doesn’t do distributed storage.
2. You can use one without the other. Hadoop includes not only a storage part, referred to as the Hadoop Distributed File System, so you don’t need Spark to get your processing, but also a processing part called MapReduce. Conversely, you can even use Spark without Hadoop. Spark does not come with its own file management system, though, so it must be integrated with one — if not HDFS, afterward another cloud-based info platform. Spark was designed for Hadoop, however, so many agree they’re better collectively.
3. Spark is quicker. Spark is usually a lot quicker than MapReduce because of the way it processes data. Spark functions on the entire data set in one fell swoop while MapReduce operates in measures. “The MapReduce workflow looks like this: read information from the cluster, perform an operation, write results to the cluster, read updated information from the cluster, perform next operation, write next results to the bunch, etc.,” explained Kirk Borne, principal info scientist at Booz Allen Hamilton. Spark, on the other hand, finishes the full data analytics operations in-memory and in near real time: “Read information from the bunch, perform all the necessary analytic operations, write results to the cluster, done,” Borne said. Spark can be as much as 10 times quicker than MapReduce for batch processing and up to 100 times quicker for in-memory analytics, he said.
4. You may not need Spark’s speed. MapReduce’s processing fashion can be just fine if your data operations and reporting conditions are mostly static and you’ll be able to wait for batch-mode processing. But if you must do analytics on streaming data, like from detectors on a factory floor, or have applications that need multiple operations, you probably want to go with Spark. Most machine learning algorithms, by way of example, need multiple procedures. Common uses for Spark contain real-time marketing campaigns, product recommendations that are online, cybersecurity analytics and machine log monitoring.