Top Apache Spark (2025) Interview Questions

What is the importance of Resilient Distributed Datasets in Spark?

Resilient Distributed Datasets are the primary data structure of Apache Spark. Rooted in Spark Core, RDDs are changeless, fault-tolerant, dispersed groups of objects that can run in parallel. RDD's are divided into partitions and can operate on diverse nodes of a cluster. RDDs are made by either alteration of current RDDs or by loading an outside dataset from stable storage. Here is the architecture of RDD:

What is a lazy evaluation in Spark?

When Spark works on any dataset, it retains the directions. If a transformation like a map() is called on an RDD, the procedure is not performed immediately. Transformations in Spark are not evaluated till you execute an action, which helps in optimizing the overall data processing workflow, recognized as lazy evaluation.

How do we monitor Apache Spark with Prometheus?

There are few ways to monitoring Apache Spark with Prometheus. One of the ways is by JmxSink + JMX-exporter Preparations

Uncomment *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink in spark/conf/metrics.properties
Download jmx-exporter by the following link on prometheus/jmx_exporter
Download Example prometheus config file

Use it in spark-shell or spark-submit In the following command, the jmx_prometheus_javaagent-0.3.1.jar file and the spark.yml are downloaded in previous steps. It might need to be changed accordingly.

bin/spark-shell --conf 
"spark.driver.extraJavaOptions=-javaagent:jmx_prometheus_javaagent-0.3.1.jar=8080:spar
k.yml"

Access it After running, we can access with localhost:8080/metrics Next It can then configure Prometheus to scrape the metrics from JMX-exporter. NOTE: We have to handle to discovery part properly if it's running in a cluster environment.

How can you trigger automatic clean-ups in Spark to handle accumulated metadata?

To trigger the clean-ups, you have to insert the parameter spark.cleaner.ttlx.
trigger automatic clean-ups in Spark

What are the advantages of Spark over MapReduce?

Spark has the given advantages over MapReduce:

Because of the availability of in-memory processing, Spark executes the processing about 10 to 100 times quicker than Hadoop MapReduce, while MapReduce uses persistence storage for all of the data processing jobs.
Unlike Hadoop, Spark gives inbuilt libraries to execute various tasks from the same core like batch processing, Steaming, ML, Interactive SQL queries. Nevertheless, Hadoop particularly supports batch processing.
Hadoop is extremely disk-dependent, while Spark encourages caching and in-memory data storage.
Spark is able in performing computations various times on the same dataset. This is called iterative computation, though there is no iterative computing performed by Hadoop.

How to create an RDD that can represent a matrix in Apache Spark? How to multiply two such RDDs?

It all depends on the input data and dimensions, but generally speaking, what we want is not an RDD but one of the distributed data structures from org.apache.spark.mllib.linalg.distributed. At this moment it provides four different implementations of the DistributedMatrix:

IndexedRowMatrix - can be created directly from a RDD [IndexedRow] where IndexedRow consists of row index and org.apache.spark.mllib.linalg.Vector.

import org.apache.spark.mllib.linalg.{Vectors, Matrices}
import org.apache.spark.mllib.linalg.distributed.{IndexedRowMatrix,
  IndexedRow}

val rows =  sc.parallelize(Seq(
  (0L, Array(1.0, 0.0, 0.0)),
  (0L, Array(0.0, 1.0, 0.0)),
  (0L, Array(0.0, 0.0, 1.0)))
).map{case (i, xs) => IndexedRow(i, Vectors.dense(xs))}

val indexedRowMatrix = new IndexedRowMatrix(rows)

RowMatrix - similar to IndexedRowMatrix but without meaningful row indices. Can be created directly from RDD[org.apache.spark.mllib.linalg.Vector].
```
import org.apache.spark.mllib.linalg.distributed.RowMatrix

val rowMatrix = new RowMatrix(rows.map(_.vector))    
```

BlockMatrix - can be created from RDD[((Int, Int), Matrix)] where the first element of the tuple contains coordinates of the block and the second one is a local org.apache.spark.mllib.linalg.Matrix

val eye = Matrices.sparse(
  3, 3, Array(0, 1, 2, 3), Array(0, 1, 2), Array(1, 1, 1))

val blocks = sc.parallelize(Seq(
   ((0, 0), eye), ((1, 1), eye), ((2, 2), eye)))

val blockMatrix = new BlockMatrix(blocks, 3, 3, 9, 9)

CoordinateMatrix - can be created from RDD[MatrixEntry] where the MatrixEntry consists of row, column and value.

import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix,
  MatrixEntry}

val entries = sc.parallelize(Seq(
   (0, 0, 3.0), (2, 0, -5.0), (3, 2, 1.0),
   (4, 1, 6.0), (6, 2, 2.0), (8, 1, 4.0))
).map{case (i, j, v) => MatrixEntry(i, j, v)}

val coordinateMatrix = new CoordinateMatrix(entries, 9, 3)

First two implementations support multiplication by a local Matrix:

val localMatrix = Matrices.dense(3, 2, Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0))

indexedRowMatrix.multiply(localMatrix).rows.collect
// Array(IndexedRow(0,[1.0,4.0]), IndexedRow(0,[2.0,5.0]),
//   IndexedRow(0,[3.0,6.0]))

And the third one can be multiplied by another BlockMatrix as long as the number of columns per block in this matrix matches the number of rows per block of the other matrix. CoordinateMatrix doesn't support multiplications but is pretty easy to create and transform to other types of distributed matrices:

blockMatrix.multiply(coordinateMatrix.toBlockMatrix(3, 3))

Each type has its own strong and weak sides and there are some additional factors to consider when you use sparse or dense elements (Vectors or block Matrices). Multiplying by a local matrix is usually preferable since it doesn't require expensive shuffling. You can find more details about each type in the MLlib Data Types guide.

What is YARN?

Similar to Hadoop, YARN is one of the key characteristics in Spark, giving a central and resource management platform to perform scalable operations over the cluster. YARN is a distributed container manager, whereas Spark is a data processing medium. Spark can operate on YARN, the same way Hadoop Map Reduce can operate on YARN. Running Spark on YARN requires a binary distribution of Spark as created on YARN support.

What is the distinction between an RDD and a DataFrame in Apache Spark? Can you transform one to the other?

A data frame is a table, or a 2-D array-like structure, in which every column holds measures on one variable, and every row includes one case. Therefore, a DataFrame has appended metadata due to its tabular model, which enables Spark to conduct optimizations on the concluded query. An RDD is a Resilient Distributed Dataset that is a BlackBox of data that cannot optimize as the processes that are performed corresponding to it are not restricted. However, you can go from a DataFrame to an RDD via its rdd method, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via the toDF system. In general, it is recommended to utilize a DataFrame where plausible due to the built-in query optimization.

How do we concatenate two columns in an Apache Spark DataFrame? Is there any function in Spark SQL which we can utilize?

With raw SQL you can use CONCAT: In Python:

df = sqlContext.createDataFrame([("foo", 1), ("bar", 2)], ("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ',  v) FROM df")

In Scala:

import sqlContext.implicits._
val df = sc.parallelize(Seq(("foo", 1), ("bar", 2))).toDF("k", "v")
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ',  v) FROM df")

Since Spark 1.5.0 you can use concat function with DataFrame API: In Python:

from pyspark.sql.functions import concat, col, lit
df.select(concat(col("k"), lit(" "), col("v")))

In Scala:

import org.apache.spark.sql.functions.{concat, lit}
df.select(concat($"k", lit(" "), $"v"))

There is additionally the concat_ws function which accepts a string separator as the first argument.

How would you compute the total count of unique words in Spark?

Load the text file as RDD:

sc.textFile("hdfs://Hadoop/user/test_file.txt");

Function that breaks each line into words:
```
def toWords(line):
return line.split();
```
Run the toWords function on each element of RDD in Spark as flatMap transformation:
```
words = line.flatMap(toWords);
```

Convert each word into (key,value) pair:

def toTuple(word):
return (word, 1);
wordTuple = words.map(toTuple);

Perform reduceByKey() action:

def sum(x, y):
return x+y:
counts = wordsTuple.reduceByKey(sum)

Print:
```
counts.collect()
```

What is the role of accumulators in Spark?

Accumulators are variables utilized to aggregate data across the executors. This data can be regarding the information or API analysis like how many records are corrupted, or how many times a library API was called.

Search Tutorials

Apache Spark Interview Questions

What is Apache Spark?

What is the difference between Apache Spark and MapReduce?

What are the chief elements of the Spark ecosystem?

Explain how Spark administers applications with the aid of its architecture.

What are the different cluster managers available in Apache Spark?

What is the importance of Resilient Distributed Datasets in Spark?

What is a lazy evaluation in Spark?

How do we monitor Apache Spark with Prometheus?

How can you trigger automatic clean-ups in Spark to handle accumulated metadata?

What are the advantages of Spark over MapReduce?

How to create an RDD that can represent a matrix in Apache Spark? How to multiply two such RDDs?

What is YARN?

What is the distinction between an RDD and a DataFrame in Apache Spark? Can you transform one to the other?

How do we concatenate two columns in an Apache Spark DataFrame? Is there any function in Spark SQL which we can utilize?

How would you compute the total count of unique words in Spark?

What is the role of accumulators in Spark?

Search Tutorials

Apache Spark Interview Questions

What is Apache Spark?

What is the difference between Apache Spark and MapReduce?

What are the chief elements of the Spark ecosystem?

Explain how Spark administers applications with the aid of its architecture.

What are the different cluster managers available in Apache Spark?

What is the importance of Resilient Distributed Datasets in Spark?

What is a lazy evaluation in Spark?

How do we monitor Apache Spark with Prometheus?

How can you trigger automatic clean-ups in Spark to handle accumulated metadata?

What are the advantages of Spark over MapReduce?

How to create an RDD that can represent a matrix in Apache Spark? How to multiply two such RDDs?

What is YARN?

What is the distinction between an RDD and a DataFrame in Apache Spark? Can you transform one to the other?

How do we concatenate two columns in an Apache Spark DataFrame? Is there any function in Spark SQL which we can utilize?

How would you compute the total count of unique words in Spark?

What is the role of accumulators in Spark?

Popular Posts

See Also