Search Tutorials


Top Apache Spark (2024) Interview Questions | JavaInUse

Apache Spark Interview Questions


In this post we will look at Apache Spark Interview questions. Examples are provided with explanation.


  1. What is Apache Spark?
  2. What is the difference between Apache Spark and MapReduce?
  3. What are the chief elements of the Spark ecosystem?
  4. Explain how Spark administers applications with the aid of its architecture
  5. What are the different cluster managers available in Apache Spark?
  6. What is the importance of Resilient Distributed Datasets in Spark?
  7. What is a lazy evaluation in Spark?
  8. How do we monitor Apache Spark with Prometheus?
  9. How can you trigger automatic clean-ups in Spark to handle accumulated metadata?
  10. What are the advantages of Spark over MapReduce?
  11. How to create an RDD that can represent a matrix in Apache Spark? How to multiply two such RDDs?
  12. What is YARN?
  13. What is the distinction between an RDD and a DataFrame in Apache Spark? Can you transform one to the other?
  14. How do we concatenate two columns in an Apache Spark DataFrame? Is there any function in Spark SQL which we can utilize?
  15. How would you compute the total count of unique words in Spark?

What is Apache Spark?

Apache Spark is a centralized analytics engine for treating massive amounts of data. It can manage workloads 100 times quicker and gives more than 80 high-level operators that make it effortless to develop parallel apps. Spark can operate on Hadoop, Apache Mesos, Kubernetes, stand-alone, or the cloud, and can obtain data from various sources.

What is the difference between Apache Spark and MapReduce?


difference between Apache Spark and MapReduce

What are the chief elements of the Spark ecosystem?

Apache Spark has three principal sections that constitute its ecosystem. Those are:
  1. Language support: Spark can work with several languages and conduct analytics. These include Java, Python, Scala, and R.
  2. Core Components: Spark has five core components. These are Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and GraphX.
  3. Cluster Management: Spark can run in three environments. These are the Standalone cluster, Apache Mesos, and YARN.

Spark ecosystem

Explain how Spark administers applications with the aid of its architecture.

Spark applications operate as autonomous processes, regulated by the SparkSession object in the driver program. The resource manager or cluster manager allocates tasks to the worker nodes with one job per partition. Recursive algorithms implement operations repeatedly to the data, so they profit from caching datasets over repetitions. A task implements its unit of the task to the dataset in its partition and returns a brand-new partition dataset. Ultimately, the results are transferred back to the driver application or stored on the disk.
Spark architecture

What are the different cluster managers available in Apache Spark?

  1. Standalone Mode: By default, applications tendered to the standalone mode cluster will operate in FIFO, and every application will attempt to employ all possible nodes. We start a standalone cluster by either commencing a master and workers manually or utilize our given launch scripts. It is additionally feasible to run these daemons on an individual machine for examination.
  2. Apache Mesos: Apache Mesos is an open-source project to maintain computer clusters and also administer Hadoop applications. The benefits of using Spark with Mesos involve dynamic partitioning within Spark and other frameworks, and scalable partitioning among various instances of Spark.
  3. Hadoop YARN: Apache YARN is the cluster resource administrator of Hadoop 2. Spark can operate on YARN as well.
  4. Kubernetes: Kubernetes is an open-source system for automatizing deployment, scaling, and maintaining containerized applications.



What is the importance of Resilient Distributed Datasets in Spark?

Resilient Distributed Datasets are the primary data structure of Apache Spark. Rooted in Spark Core, RDDs are changeless, fault-tolerant, dispersed groups of objects that can run in parallel. RDD's are divided into partitions and can operate on diverse nodes of a cluster. RDDs are made by either alteration of current RDDs or by loading an outside dataset from stable storage. Here is the architecture of RDD:
architecture of RDD

What is a lazy evaluation in Spark?

When Spark works on any dataset, it retains the directions. If a transformation like a map() is called on an RDD, the procedure is not performed immediately. Transformations in Spark are not evaluated till you execute an action, which helps in optimizing the overall data processing workflow, recognized as lazy evaluation.

How do we monitor Apache Spark with Prometheus?

There are few ways to monitoring Apache Spark with Prometheus. One of the ways is by JmxSink + JMX-exporter Preparations
  1. Uncomment *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink in spark/conf/metrics.properties
  2. Download jmx-exporter by the following link on prometheus/jmx_exporter
  3. Download Example prometheus config file
Use it in spark-shell or spark-submit In the following command, the jmx_prometheus_javaagent-0.3.1.jar file and the spark.yml are downloaded in previous steps. It might need to be changed accordingly.
bin/spark-shell --conf 
"spark.driver.extraJavaOptions=-javaagent:jmx_prometheus_javaagent-0.3.1.jar=8080:spar
k.yml" 
Access it After running, we can access with localhost:8080/metrics Next It can then configure Prometheus to scrape the metrics from JMX-exporter. NOTE: We have to handle to discovery part properly if it's running in a cluster environment.

How can you trigger automatic clean-ups in Spark to handle accumulated metadata?

To trigger the clean-ups, you have to insert the parameter spark.cleaner.ttlx.
trigger automatic clean-ups in Spark

What are the advantages of Spark over MapReduce?

Spark has the given advantages over MapReduce:
  1. Because of the availability of in-memory processing, Spark executes the processing about 10 to 100 times quicker than Hadoop MapReduce, while MapReduce uses persistence storage for all of the data processing jobs.
  2. Unlike Hadoop, Spark gives inbuilt libraries to execute various tasks from the same core like batch processing, Steaming, ML, Interactive SQL queries. Nevertheless, Hadoop particularly supports batch processing.
  3. Hadoop is extremely disk-dependent, while Spark encourages caching and in-memory data storage.
  4. Spark is able in performing computations various times on the same dataset. This is called iterative computation, though there is no iterative computing performed by Hadoop.

How to create an RDD that can represent a matrix in Apache Spark? How to multiply two such RDDs?

It all depends on the input data and dimensions, but generally speaking, what we want is not an RDD but one of the distributed data structures from org.apache.spark.mllib.linalg.distributed. At this moment it provides four different implementations of the DistributedMatrix:
  1. IndexedRowMatrix - can be created directly from a RDD [IndexedRow] where IndexedRow consists of row index and org.apache.spark.mllib.linalg.Vector.
    import org.apache.spark.mllib.linalg.{Vectors, Matrices}
    import org.apache.spark.mllib.linalg.distributed.{IndexedRowMatrix,
      IndexedRow}
    
    val rows =  sc.parallelize(Seq(
      (0L, Array(1.0, 0.0, 0.0)),
      (0L, Array(0.0, 1.0, 0.0)),
      (0L, Array(0.0, 0.0, 1.0)))
    ).map{case (i, xs) => IndexedRow(i, Vectors.dense(xs))}
    
    val indexedRowMatrix = new IndexedRowMatrix(rows)
    
  2. RowMatrix - similar to IndexedRowMatrix but without meaningful row indices. Can be created directly from RDD[org.apache.spark.mllib.linalg.Vector].
    import org.apache.spark.mllib.linalg.distributed.RowMatrix
    
    val rowMatrix = new RowMatrix(rows.map(_.vector))    
    
  3. BlockMatrix - can be created from RDD[((Int, Int), Matrix)] where the first element of the tuple contains coordinates of the block and the second one is a local org.apache.spark.mllib.linalg.Matrix
    val eye = Matrices.sparse(
      3, 3, Array(0, 1, 2, 3), Array(0, 1, 2), Array(1, 1, 1))
    
    val blocks = sc.parallelize(Seq(
       ((0, 0), eye), ((1, 1), eye), ((2, 2), eye)))
    
    val blockMatrix = new BlockMatrix(blocks, 3, 3, 9, 9)
    
  4. CoordinateMatrix - can be created from RDD[MatrixEntry] where the MatrixEntry consists of row, column and value.
    import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix,
      MatrixEntry}
    
    val entries = sc.parallelize(Seq(
       (0, 0, 3.0), (2, 0, -5.0), (3, 2, 1.0),
       (4, 1, 6.0), (6, 2, 2.0), (8, 1, 4.0))
    ).map{case (i, j, v) => MatrixEntry(i, j, v)}
    
    val coordinateMatrix = new CoordinateMatrix(entries, 9, 3)
    
First two implementations support multiplication by a local Matrix:
val localMatrix = Matrices.dense(3, 2, Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0))

indexedRowMatrix.multiply(localMatrix).rows.collect
// Array(IndexedRow(0,[1.0,4.0]), IndexedRow(0,[2.0,5.0]),
//   IndexedRow(0,[3.0,6.0]))
And the third one can be multiplied by another BlockMatrix as long as the number of columns per block in this matrix matches the number of rows per block of the other matrix. CoordinateMatrix doesn't support multiplications but is pretty easy to create and transform to other types of distributed matrices:
blockMatrix.multiply(coordinateMatrix.toBlockMatrix(3, 3))
Each type has its own strong and weak sides and there are some additional factors to consider when you use sparse or dense elements (Vectors or block Matrices). Multiplying by a local matrix is usually preferable since it doesn't require expensive shuffling. You can find more details about each type in the MLlib Data Types guide.

What is YARN?

Similar to Hadoop, YARN is one of the key characteristics in Spark, giving a central and resource management platform to perform scalable operations over the cluster. YARN is a distributed container manager, whereas Spark is a data processing medium. Spark can operate on YARN, the same way Hadoop Map Reduce can operate on YARN. Running Spark on YARN requires a binary distribution of Spark as created on YARN support.

What is the distinction between an RDD and a DataFrame in Apache Spark? Can you transform one to the other?

A data frame is a table, or a 2-D array-like structure, in which every column holds measures on one variable, and every row includes one case. Therefore, a DataFrame has appended metadata due to its tabular model, which enables Spark to conduct optimizations on the concluded query. An RDD is a Resilient Distributed Dataset that is a BlackBox of data that cannot optimize as the processes that are performed corresponding to it are not restricted. However, you can go from a DataFrame to an RDD via its rdd method, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via the toDF system. In general, it is recommended to utilize a DataFrame where plausible due to the built-in query optimization.

How do we concatenate two columns in an Apache Spark DataFrame? Is there any function in Spark SQL which we can utilize?

With raw SQL you can use CONCAT: In Python:
df = sqlContext.createDataFrame([("foo", 1), ("bar", 2)], ("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ',  v) FROM df")
In Scala:
import sqlContext.implicits._
val df = sc.parallelize(Seq(("foo", 1), ("bar", 2))).toDF("k", "v")
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ',  v) FROM df")
Since Spark 1.5.0 you can use concat function with DataFrame API: In Python:
from pyspark.sql.functions import concat, col, lit
df.select(concat(col("k"), lit(" "), col("v")))
In Scala:
import org.apache.spark.sql.functions.{concat, lit}
df.select(concat($"k", lit(" "), $"v"))
There is additionally the concat_ws function which accepts a string separator as the first argument.

How would you compute the total count of unique words in Spark?

  1. Load the text file as RDD:
    sc.textFile("hdfs://Hadoop/user/test_file.txt");
    
  2. Function that breaks each line into words:
    def toWords(line):
    return line.split();
    
  3. Run the toWords function on each element of RDD in Spark as flatMap transformation:
    words = line.flatMap(toWords);
    
  4. Convert each word into (key,value) pair:
    def toTuple(word):
    return (word, 1);
    wordTuple = words.map(toTuple);
    
  5. Perform reduceByKey() action:
    def sum(x, y):
    return x+y:
    counts = wordsTuple.reduceByKey(sum) 
    
  6. Print:
    counts.collect()
    

What is the role of accumulators in Spark?

Accumulators are variables utilized to aggregate data across the executors. This data can be regarding the information or API analysis like how many records are corrupted, or how many times a library API was called.

See Also

Spring Boot Interview Questions Apache Camel Interview Questions Drools Interview Questions Java 8 Interview Questions Enterprise Service Bus- ESB Interview Questions. JBoss Fuse Interview Questions Top ElasticSearch frequently asked interview questions