Search Tutorials


Top AWS Glue (2021) Interview Questions | JavaInUse

Top AWS Glue frequently asked interview questions.

In this post we will look at AWS Glue Interview questions. Examples are provided with explanations.

  1. What is AWS Glue?
  2. What are the Benefits of AWS Glue?
  3. What are the components used by AWS Glue?
  4. What Data Sources are supported by AWS Glue?
  5. What are Development Endpoints?
  6. What are AWS Tags in AWS Glue?
  7. What is AWS Glue Data Catalog?
  8. What are AWS Glue Crawlers?
  9. What is AWS Glue Streaming ETL?
  10. Is AWS Glue Schema Registry an open source?
  11. How can we list Databases and Tables in AWS Glue Catalog?
  12. How does AWS Glue update Duplicating Data?

What is AWS Glue?

AWS Glue helps in preparing data for Analysis by automated extract, transforming, and loading ETL processes. It supports MySQL, Microsoft SQL Server, PostgreSQL Databases which runs on Amazon EC2(Elastic Compute Cloud) Instances in an Amazon VPC(Virtual Private Cloud).
AWS Glue is an extracted, loaded, transformed service which helps in automating time-consuming steps of Data Preparation for the analytics.

What are the Benefits of AWS Glue?

Benefits of AWS Glue are as follows:
  • Fault Tolerance - AWS Glue is retrievable and the logs can be debugged.
  • Filtering - AWS Glue uses filtering for bad data.
  • Maintenance and Development - AWS Glue uses maintenance and deployment as the service is managed by AWS.

What are the components used by AWS Glue?


AWS Glue

AWS Glue consists of:
  • Data Catalog is a Central Metadata Repository.
  • ETL Engine helps in generating Python and Scala Code.
  • Flexible Scheduler helps in handling Dependency Resolution, Job Monitoring and Retring.
  • AWS Glue DataBrew helps in Normalizing and Cleaning Data with visual interface.
  • AWS Glue Elastic View used in Replicating and Combining Data through multiple Data Stores.

What Data Sources are supported by AWS Glue?

Data Sources supported by AWS Glue are:
Amazon Aurora
Amazon RDS for MySQL
Amazon RDS for Oracle
Amazon RDS for PostgreSQL
Amazon RDS for SQL Server
Amazon Redshift
DynamoDB
Amazon S3
MySQL
Oracle
Microsoft SQL Server
AWS Glue also supports Database such as:
Amazon MSK
Amazon Kinesis Data Streams
Apache Kafka

What are Development Endpoints?

Development Endpoints are used in describing the AWS Glue API that is related to testing by using Custom DevEndpoint.The endpoint is where a developer can debug the extract, transforming, and loading ETL Scripts.

What are AWS Tags in AWS Glue?

AWS Tags are labels used in assigning us to AWS Resources.
Each tag contains a Key and an Optional Value, which we can define. We can also use tags in AWS Glue for organizing and identifying our resources. All the tags are used in creating cost accounting reports and restricting access to resources.

   



What is AWS Glue Data Catalog?

AWS Glue Data Catalog helps by storing Structural and Operational Metadata for all the Data Assets. It also helps in providing uniform repositories where the disparate systems help in storing and finding metadata for keeping track of data in Data Silos and also in using metadata to query and in transforming the data.
AWS Glue Data Catalog also helps in storing Table Definition, Physical Location, and Business relevant Attributes, also tracks data that has changed over time.

AWS Glue


What are AWS Glue Crawlers?

AWS Glue Crawler helps in connecting Data Store, also progress by a prioritized list of classifiers for extracting the schema of the data and other statistics. AWS Glue Crawler also helps by scanning data stores to automatically infer schemas and the partition structures for populating Glue Data Catalog with Table definitions and statistics.

What is AWS Glue Streaming ETL?

AWS Glue is used in enabling ETL Operations on the streaming data by using continuously running jobs. Streaming ETL is built on Apache Spark that is structured in streaming engines and in ingesting streams from Kinesis Data Streams and Kafka by using Amazon Managed Streaming for Apache Kafka.

Is AWS Glue Schema Registry open-source?

AWS Glue Schema Registry Storage is a service used while serializing and deserializing Apache Licensed open sources components.

How can we list Databases and Tables in AWS Glue Catalog?

We can list Databases and Tables by using the following command:
import boto3
client = boto3.client('glue',region_name='us-east-1')

responseGetDatabases = client.get_databases()

databaseList = responseGetDatabases['DLIST']

for databaseDict in databaseList:

    databaseName = databaseDict['XYZ']
    print '\ndatabaseXYZ: ' + databaseXYZ

    responseGetTables = client.get_tables( DatabaseName = databaseDEF )
    tableList = responseGetTables['TLIST']

    for tableDict in tableList:

         tableName = tableDict['ABC']
         print '\n-- tableABC: '+tableABC



How does AWS Glue update Duplicating Data?

AWS Glue update Duplicating Data by using the following command:
sc = SparkContext()
glueContext = GlueContext(sc)

#get your source data
src_data = create_dynamic_frame.from_catalog(database = src_fg, table_name = src_fg)
src_df =  src_data.toDF()


#get your destination data
dst_data = create_dynamic_frame.from_catalog(database = dst_fg, table_name = dst_fg)
dst_df =  dst_data.toDF()

#Now merge two data frames to remove duplicates
merged_df = dst_df.union(src_df)

#Savea the data to destination with OVERWRITE MODE
merged_df.write.format('abcd').


See Also

Spring Boot Interview Questions Apache Camel Interview Questions Drools Interview Questions Java 8 Interview Questions Enterprise Service Bus- ESB Interview Questions. JBoss Fuse Interview Questions Angular 2 Interview Questions