Choosing the technology stack for a data lake

Data Lake is a sophisticated technology stack and requires integration of numerous technologies for ingestion, processing, and exploration. Moreover, there are no standard rules for security, governance, operations & collaboration. It makes things more complicated. Wait! That’s not all. You also have hard SLAs for query processing time, data ingestion ETL pipelines. Lastly, the solution needs to be scalable from one user to thousands of users and from one kilobyte of data to few petabytes of data. As the big data industry is changing rapidly, you need to select technology which is here to stay and robust enough to comply with your SLAs. At OvalEdge our objective is to provide all the possible details about each solution to our customers and prospective customers so that they can decide which one caters best to their specific needs. Factors to consider for Technology Stack There are many other factors a business must look into before selecting their technology stack. Given below are those factors and how they fare amongst three types of infrastructure – On-Premise, on the Cloud and Managed Services.

Factors	On-Premise	Cloud	Managed Services
Maintenance	Hard	Hard	Easy
Monthly Cost	Economic with large datasets	Predictable	Predictable
Vendor lock-in	Avoidable	Avoidable	Not avoidable
Suitability	For large corporations	For all businesses	Ideal for startups
Investment	Substantial in the beginning	More as data grows	More as data grows

Storage

We can divide it into two broad categories – Cloud vs. On-Premise. On the Cloud, many companies are offering managed services – Amazon, Microsoft, Google, etc. Whereas on-premise, the primary option available is HDFS (Hadoop Distributed File System).

Amazon S3

It is the most used storage technology in Data Lake on the Cloud. The fact that one-fourth of the world’s data is stored on S3 is proof enough of its excellent scalability. However, there are various other pros and cons of S3.

Pros

Vastly scalable

Has all enterprise features like security, availability- 99.99999%, backup uptime, etc.

Price

Cons

Security – The problem with S3 security is its management is intricate. Consider the recent example of Altyryx vulnerability. Even technologically advanced companies are finding it difficult to manage the security of S3. The risk is too high.

Small files: When you have lots of small files, and you want to analyze them together, S3 doesn’t perform optimally.

API limitations: When we use its API, it's hard to do pagination. Too frequent changes in its API so it’s hard to catch up with the latest release of the client.

Azure Data Lake (ADL)

Microsoft recently launched ADL. We have done various POCs on ADL and found that it's easy to use and configure. When you use HD Insight with ADL, it’s straightforward to configure.

Pros

Scalability
Connectivity to HD Insight for processing. ADL is designed to work with small or large files and works well with Hadoop.
Support – We found Microsoft support to be more responsible than AWS or Google.

Cons

Limited knowledge
Stats Refresh takes about 24 hrs

Google Cloud Storage (GCS)

Use GCS if you are planning to use Big Query.

Hadoop Distributed File System (HDFS)

HDFS is the only on-premise option available. It is highly reliable but comparatively tricky to manage. Cloudera Manager or Hortonworks Amabari are here to maintain HDFS efficiently. Earlier companies faced problems when they tried to upgrade or add a node on HDFS. But now these issues have been resolved, so overall HDFS is pretty stable from Hadoop 2.7.1 onwards. Processing

Hadoop clusters

Hadoop has become a synonym for a data lake by its vast presence and use-cases across the domain. Its a distributed processing framework of large datasets. We can deploy Hadoop on-premise or on Cloud. Hortonworks, Cloudera, and MapR provide distributions of open source Hadoop technology. On the other hand, AWS, Microsoft, and Google offer their distribution of Hadoop as EMR, HDInsight, and Data Proc respectively. Cloud technology stacks are mostly elastic and built on their proprietary storage while CDH and HDP are made on an open-source HDFS.

Spark clusters

Apache Spark provides a much faster engine for large-scale data processing, leveraging in-memory computing. It can run on Hadoop and Mesos on the cloud or in a standalone environment to create a unified compute layer across the enterprise. Tools and Languages Since Hadoop and Spark are a new generation processing layer, they provide various tools and languages to process data. Some of them are:

Hive
MapReduce
Oozie
Sqoop
Ni-Fi

What you should do now

Schedule a demo to learn more about OvalEdge
Increase your knowledge on everything related to data governance with our free whitepapers, webinars and academy
If you know anyone who'd enjoy this content, share it with them via email, LinkedIn, Twitter or Facebook.

OvalEdge recognized as a leader in data governance solutions

SPARK Matrix™: Data Governance Solution, 2025

Final_2025_SPARK Matrix_Data Governance Solutions_QKS GroupOvalEdge 1

View

Total Economic Impact™ (TEI) Study commissioned by OvalEdge: ROI of 337%

“Reference customers have repeatedly mentioned the great customer service they receive along with the support for their custom requirements, facilitating time to value. OvalEdge fits well with organizations prioritizing business user empowerment within their data governance strategy.”

Download

Named an Overall Leader in Data Catalogs & Metadata Management

Download

Recognized as a Niche Player in the 2025 Gartner® Magic Quadrant™ for Data and Analytics Governance Platforms

Gartner, Magic Quadrant for Data and Analytics Governance Platforms, January 2025

Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

Choosing the technology stack for a data lake

Storage

Amazon S3

Pros

Cons

Azure Data Lake (ADL)

Pros

Cons

Google Cloud Storage (GCS)

Hadoop Distributed File System (HDFS)

Hadoop clusters

Spark clusters

Find your edge now. See how OvalEdge works.

OvalEdge recognized as a leader in data governance solutions

Find your edge now. See how OvalEdge works.