Hadoop best practices for ETL

By Sharad Varshney , Posted October 23, 2017 In Hadoop, Big Data

Hadoop, an open-source framework has been around for quite some time in the industry. You would see a multitude of articles on how to use Hadoop for various data transformation needs, features, and functionalities.

But do you know what are the Hadoop best practices for ETL? Remember that as this is an open-source project, there are many features available in Hadoop. But it may not be the best option to use them in new Hadoop ETL development. I have seen from experience that old database programmers use Oracle/SQL Server’s techniques in Hive and screw up the performance of the ETL logic.

So, why do these techniques exist when it is not in your best interest to use them? It is because of Hadoop’s open-source nature and competition in the industry. If a client is doing a Teradata conversion project, they can save enough dollars by just converting Teradata logic to Hive, and for performance gain, they don’t mind paying for additional hardware. This is why all the features of all the traditional databases exist in Hive.

The Dos and Don'ts

If you are writing a new logic in Hadoop, use the proposed methodology for Hadoop development.

Do not use Views

Views are great for transactional systems, where data is frequently changing, and a programmer can consolidate sophisticated logic into a View. When source data is not changing, create a table instead of a View.

Do not use small partitions

In a transactional system, to reduce the query time, a simple approach is to partition the data based on a query where clause. While in Hadoop, the mapping is far cheaper than start and stop of a container. So use partition only when the data size of each partition is about a block size or more (64/128 MB).

Use ORC or Parquet File format

By changing the underneath file format to ORC or Parquet, we can get a significant performance gain.

Avoid Sequential Programming (Phase Development)

We need to find ways, where we can program in parallel and use Hadoop’s processing power. The core of this is breaking logic into multiple phases, and we should be able to run these steps in parallel.

Managed Table vs. External Table

Adopt a standard and stick to it but I recommend using Managed Tables. They are always better to govern. You should use External Tables when you are importing data from the external system. You need to define a schema for it after that entire workflow can be created using Hive scripts leveraging managed tables.

Pig vs. Hive

If you ask any Hadoop vendor (Cloudera, Hortonworks, etc.), you will not get a definitive answer as they support both languages. Sometimes, logic in Hive can be quite complicated compared to Pig but I would still advise using Hive if possible. This is because we would need resources in the future to maintain the code.

There are very few people who know Pig, and it is a steep learning curve. And also, there is not much investment happening in Pig as compared to Hive. Various industries are using Hive and companies like Hortonworks, AWS, MS, etc. are contributing to Hive.

Phases Development

So how would you get the right outcome without using Views? You would get that by using Phase Development. Keep parking processed data into various phases and then keep treating it to obtain a final result.

Basic principles of Hadoop programming

1. Storage is cheap while processing is expensive.

2. You cannot complete a job in sub-seconds, it would take way more than that, usually a few seconds to minutes.

3. It’s not a transactional system; source data is not changing until we import it again.

Download Our Trending White Papers

OvalEdge recognized as a leader in data governance solutions

SPARK Matrix™: Data Governance Solution, 2025

Final_2025_SPARK Matrix_Data Governance Solutions_QKS GroupOvalEdge 1

View

Total Economic Impact™ (TEI) Study commissioned by OvalEdge: ROI of 337%

“Reference customers have repeatedly mentioned the great customer service they receive along with the support for their custom requirements, facilitating time to value. OvalEdge fits well with organizations prioritizing business user empowerment within their data governance strategy.”

Download

Named an Overall Leader in Data Catalogs & Metadata Management

Download

Recognized as a Niche Player in the 2025 Gartner® Magic Quadrant™ for Data and Analytics Governance Platforms

Gartner, Magic Quadrant for Data and Analytics Governance Platforms, January 2025

Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

Find your edge now. See how OvalEdge works.

Book demo

Table of Contents

Read More Posts On

View All Blog Posts

Hadoop best practices for ETL