As a big data practitioner, you need to become an expert in big data frameworks, in particular Hadoop. Knowing your way about Hadoop helps you gain a strong foothold within your company and better your chances of a salary hike. Hadoop training also helps you to enhance your career prospects in a big data environment.
With big data here to stay and most companies migrating to a Hadoop environment, you must make sure to learn Hadoop. Be a Hadoop Developer and improve your Developer profile with a Hadoop certification from a reputed institute.
Table of Contents
What does a Hadoop Developer do?
The Hadoop Developer role involves creating programming, design, and development of Hadoop applications in the Big Data ecosystem of the Hadoop framework.
The tasks of the Hadoop developer are similar to that of the software developer but in Big Data processing. The Developer designs and develops web applications and maintains the privacy and security of the data.
Top interview questions for a Hadoop Developer
Although the following are basic questions you may face as a Developer, they will give you a general direction and help you crack the interview.
1. What is Hadoop?
Hadoop is a distributed computing platform for handling big data problems. It is an open-source framework for distributed storage and distributed processing of large data sets. You can run applications on the system with thousands of commodity hardware nodes. Its distributed file system provides rapid data transfer rates among nodes.
2. Why do we need Hadoop?
We need Hadoop to address the challenges of Big Data, namely, storage and processing of large data sets, security, analytics, data quality, and discovery.
3. What platform and Java version are required to run Hadoop?
Linux and Windows are the supported OS for Hadoop. However, BSD, Mac OS/X, and Solaris are also popular. Java version 1.6.x or higher is required to run Hadoop.
4. What hardware is best for Hadoop?
Dual processor/ dual core machines with 4-8 GB RAM using ECC memory. However, it largely depends on the workflow.
5. What are the most common input formats defined in Hadoop?
- TextInputFormat
- KeyValueInputFormat
- SequenceFileInputFormat
TextInputFormat is the default input format.
6. What are the core components of Hadoop?
HDFS, MapReduce, and YARN.
Hadoop Distributed File System (HDFS) is the primary storage system of Hadoop where large files run on a cluster of commodity hardware. It provides high throughput access to an application by accessing in parallel.
MapReduce is the data processing layer of Hadoop which writes applications that process large structured and unstructured data stored in HDFS in parallel.
YARN is the processing framework in Hadoop with resource management and multiple data processing engines.
7. Compare Hadoop and RDBMS?
The differences between Hadoop and RDBMS are as follows:
Architecture – Traditional RDBMS has ACID properties. Whereas, Hadoop is a distributed computing framework.
Data acceptance – RDBMS accepts only structured data. Hadoop can accept both structured and unstructured data.
Scalability – RDBMS is a traditional database with vertical scalability. Whereas, Hadoop has horizontal scalability.
OLTP and OLAP – Traditional RDBMS supports OLTP, but Hadoop supports large-scale Batch Processing workloads i.e. OLAP.
Cost – RDBMS is a licensed software, so we have to pay for the use. Whereas, Hadoop is an open-source framework and does not require payment for use.
8. Which command is used for the retrieval of the status of daemons running the Hadoop cluster?
The ‘jps’ command.
9. What is InputSplit in Hadoop?
When a Hadoop job runs, it splits input files into chunks and assigns each split to a mapper for processing. This is InputSplit.
10. How many InputSplits are made by a Hadoop Framework?
5 splits are made as follows:
- One split for 64K files
- Two splits for 65MB files, and
- Two splits for 127MB files
11. What is the use of RecordReader in Hadoop?
When InputSplit is assigned with a work it does not know how to access it. The record holder class loads the data from its source and converts it into a key pair suitable for reading by the Mapper.
12. When a client submits a Hadoop job, who receives it?
NameNode receives the Hadoop job, which then looks for the data requested by the client and provides the block information. JobTracker takes care of resource allocation of the Hadoop job to ensure timely completion
Embed Youtube Video URL here: https://www.youtube.com/embed/SCLQJWrFZJg
13. What are the functionalities of JobTracker?
- To accept jobs from the client.
- To communicate with the NameNode for the location of the data.
- To locate TaskTracker Nodes with available slots.
- To submit the work to the chosen TaskTracker node and monitor the progress.
14. What happens to a NameNode that has no data?
There does not exist any NameNode without data. If it is NameNode then it will have some data in it.
15. What happens when a user submits a Hadoop job when the NameNode is down?
The Hadoop job fails when the NameNode is down.
16. What is a MapReduce job in Hadoop?
MapReduce job is a programming method that allows massive scalability across thousands of servers.
MapReduce job refers to two different tasks that Hadoop performs. First, it ‘Maps’ jobs and converts the data into another set of data. Second, it ‘Reduces’ job by taking the output from the map as input and compressing the data tuples into a smaller set of tuples.
17. What is shuffling in MapReduce?
Shuffling is a process used for sorting and transferring the map outputs to the reducer as input.
18. What is the heartbeat in HDFS?
Heartbeat is a signal between a data node and a name node, and between a task tracker and a job tracker. When the name node or job tracker does not respond to the signal, it indicates an issue with the data node or task tracker.
19. What is Safemode in Hadoop?
Safemode in Apache Hadoop is a maintenance state of NameNode during which NameNode does not allow any modifications to the file system.
20. What is the port number for NameNode, Task Tracker, and Job Tracker?
- NameNode 50070
- Job Tracker 50030
- Task Tracker 50060
21. How is indexing done in HDFS?
Once the data is stored as per the block size, the HDFS keeps storing the last part of the data which specifies the location of the next part of the data.
22. What is Hadoop Streaming?
Hadoop streaming is a utility that allows the creation and running of a MapReduce job. It is a generic API that allows programs written in any language to be used as a Hadoop mapper.
23. What are Hadoop’s configuration files?
- core-site.xml
- mapred-site.xml
- hdfs-site.xml
24. Is it possible to provide multiple inputs to Hadoop?
Yes, It is possible. The input format class provides methods to insert multiple directories as input to a Hadoop job.
25. Is it necessary to write jobs for Hadoop in the Java language?
No. There are other ways to deal with non-java codes. Hadoop Streaming allows any shell command to be used as a map or reduce function.