Choosing the Technology Stack for a Data Lake

Data Lake is a sophisticated technology stack and requires integration of numerous technologies for ingestion, processing, and exploration. Moreover, there are no standard rules for security, governance, operations & collaboration. It makes things more complicated. Wait! That’s not all. You also have hard SLAs for query processing time, data ingestion ETL pipelines. Lastly, the solution needs to be scalable from one user to thousands of users and from one kilobyte of data to few petabytes of data.

 

As the big data industry is changing rapidly, you need to select technology which is here to stay and robust enough to comply with your SLAs. At OvalEdge our objective is to provide all the possible details about each solution to our customers and prospective customers so that they can decide which one caters best to their specific needs.

Factors to consider for Technology Stack

There are many other factors a business must look into before selecting their technology stack. Given below are those factors and how they fare amongst three types of infrastructure – On-Premise, on the Cloud and Managed Services.

Factors On-Premise Cloud Managed Services
Maintenance Hard Hard Easy
Monthly Cost Economic with large datasets Predictable Predictable
Vendor Lock-in Avoidable Avoidable Not Avoidable
Suitability For large corporations  For all businesses Ideal for startups
Investment Substantial in the beginning More as data grows More as data grows

Storage

We can divide it into two broad categories – Cloud vs. On-Premise. On the Cloud, many companies are offering managed services – Amazon, Microsoft, Google, etc. Whereas on-premise, the primary option available is HDFS (Hadoop Distributed File System).

 

Amazon S3

It is the most used storage technology in Data Lake on the Cloud. The fact that one-fourth of the world’s data is stored on S3 is proof enough of its excellent scalability. However, there are various other pros and cons of S3.

Pros
  • Vastly Scalable

 

  • Has all enterprise features like security, availability- 99.99999%, backup uptime, etc.

 

  • Price

Cons
  • Security – The problem with S3 security is its management is intricate. Consider the recent example of Altyryx vulnerability. Even technologically advanced companies are finding difficult to manage the security of S3. The risk is too high.

 

  • Small Files: When you have lots of small files, and you want to analyze them together, S3 doesn’t perform optimally.

 

  • API limitations: When we use its API, its hard to do pagination. Too frequent changes in its API so it’s hard to catch up with the latest release of the client.

Azure Data Lake (ADL)

 

Microsoft recently launched ADL. We have done various POCs on ADL and found that its easy to use and configure. When you use HD Insight with ADL, it’s straightforward to configure.

Pros
  • Scalability
  • Connectivity to HD Insight for processing. ADL is designed to work with small or large files and works well with Hadoop.
  • Support – We found Microsoft support to be more responsible then AWS or Google.

Cons
  • Limited knowledge
  • Stats Refresh takes about 24 hrs

Google Cloud Storage (GCS)

Use GCS if you are planning to use Big Query.

 

 

 Hadoop Distributed File System (HDFS)

HDFS is the only on-premise option available. It is highly reliable but comparatively tricky to manage. Cloudera Manager or Hortonworks Amabari are here to maintain HDFS efficiently. Earlier companies faced problems when they tried to upgrade or add a node on HDFS. But now these issues have been resolved, so overall HDFS is pretty stable from Hadoop 2.7.1 onwards.

Processing

Hadoop clusters

Hadoop has become a synonym for a data lake by its vast presence and use-cases across the domain. Its a distributed processing framework of large datasets. We can deploy Hadoop on-premise or on Cloud. Hortonworks, Cloudera, MapR are companies provides distributions of open source Hadoop technology. On the other hand, AWS, Microsoft, and Google offer their distribution of Hadoop named as EMR, HDInsight, and Data Proc respectively. Cloud technology stacks are mostly elastic and built on their proprietary storage while CDH and HDP are made on an open source HDFS.

 

 

Spark clusters

Apache Spark provides a much faster engine for large-scale data processing, leveraging in-memory computing. It can run on Hadoop, Mesos, on Cloud or in a standalone environment to create a unified compute layer across the enterprise.

Tools and Languages

Since Hadoop and Spark are a new generation processing layer, they provide various tools and languages to process data. Some of them are:

 

  • Hive
  • MapReduce
  • Oozie
  • Sqoop
  • Ni-Fi

Start typing and press Enter to search