Hadoop best practices for ETL
Hadoop, an open source framework has been around for quite some time in the industry. You would see a multitude of articles on how to use Hadoop for various data transformation needs, features and functionalities. But do you know what are the Hadoop best practices for ETL? Remember that as this is an open source project, there are many features available in Hadoop. But it may not be the best option to use them in new Hadoop ETL development. I have seen from experience that old database programmers use Oracle/SQL Server’s techniques in Hive and screw up the performance of the ETL logic.
So, why these techniques exist when it is not in your best interest to use them? It is because of Hadoop’s open source nature and competition in the industry. If a client is doing a Teradata conversion project, they can save enough dollars by just converting Teradata logic to Hive and for performance gain, they don’t mind paying for additional hardware. This is why all the features of all the traditional databases exist in Hive. But if you are writing a new logic in Hadoop, use proposed methodology for Hadoop development.
The Dos and Dont’s
Do not use Views
Views are great for transactional systems, where data is frequently changing, and a programmer can consolidate sophisticated logic into a View. When source data is not changing, create a table instead of a View.
Do not use small partitions
In a transactional system, to reduce the query time, a simple approach is to partition the data based on a query where clause. While in Hadoop, the mapping is far cheaper than start and stop of a container. So use partition only when the data size of each partition is about a block size or more (64/128 MB).
Use ORC or Parquet File format
Changing the underneath file format to ORC or Parquet, we can get a significant performance gain.
Avoid Sequential Programming (Phase Development)
We need to find ways, where we can program in parallel and use Hadoop’s processing power. The core of this is breaking logic into multiple phases, and we should be able to run these steps in parallel.
Pig vs. Hive
If you ask any Hadoop vendor (Cloudera, Hortonworks, etc.), you will not get a definitive answer as they support both languages. Sometimes, logic in Hive can be quite complicated compared to Pig but I would still advise using Hive if possible. This is because we would need resources in future to maintain the code. There are very few people who know Pig, and it is a steep learning curve. And also, there is not much investment happening in Pig as compared to Hive. Various industries are using Hive and companies like Hortonworks, AWS, MS, etc. are contributing to Hive.
Managed Table vs. External Table
Adopt a standard and stick to it but I recommend using Managed Tables. They are always better to govern. You should use External Tables when you are importing data from the external system. You need to define a schema for it after that entire workflow can be created using Hive scripts leveraging managed tables.
So how would you get the right outcome without using Views? You would get that by using Phase Development. Keep parking processed data into various phases and then keep treating it to obtain a final result.
Basic principles of Hadoop programming
1. Storage is cheap while processing is expensive.
2. You cannot complete a job in sub-seconds, it would take way more than that, usually few seconds to minutes.
3. It’s not a transactional system; source data is not changing until we import it again.