Databases vs. Data Warehouses vs. Data Lakes

Learn what databases, data warehouses, and data lakes are and when to use them.

Every month I teach a Tech Term You Should Know (TTYSK) and a tech essay to level up your technical literacy and communicate well with dev teams. Ask me anything and I'll cover it in an upcoming issue.

This issue's TTYSK is "Big data". Scroll to end to learn more 👇

Special Event Announcement!

I’m hosting a webinar on How to Master Technical Literacy: Evolving your Approach Thursday, October 3rd and you’re invited!

In this talk, you'll find out where your current technical skills are, learn what skills are essential for building technical literacy as a PM, and evolve your approach to technical literacy.

Interested? Register for the special event 👉️ here. Can’t wait to see you!

Data Warehouses vs. Data Lakes

Chances are you know what a database is and when it is used. Databases are designed for real-time, transactional data, which means they’re built to handle constant updates and queries.

But what about data warehouses and data lakes?

Imagine your team is gearing up to launch the next big feature that relies heavily on advanced analytics to personalize the user experience. To support the feature, one person suggests using a data warehouse to dig into historical data, while someone else points out that a data lake might be better for handling all the unstructured data you’ll be collecting.

If these terms are a bit fuzzy to you, it’ll hard to keep up during the discussion or understand the long-term implications of the decision.

So let’s break down what each of these data storage technologies does, how they differ and when to use them.

Data Lake vs. Data Warehouse

What is a Data Warehouse?

While a database handles real-time processing of transactions, a data warehouse is built for analyzing data.

Data warehouses are optimized for querying and reporting. They collect and store data from multiple sources, often through a process known as ETL (Extract, Transform, Load), where data is cleaned and organized before it’s stored. This makes it easier to generate reports, track trends over time, and make strategic decisions based on historical data.

For example, if you’re leading a team that needs to analyze customer behavior over the past five years, a data warehouse is where this information would be stored. Tools like Amazon Redshift, Google BigQuery, and Snowflake are popular choices for building and managing data warehouses.

What about a Data Lake?

If databases are structured and organized like a neatly arranged filing cabinet, and data warehouses are like a well-cataloged library, then a data lake is more like a vast, open reservoir where all kinds of data flow in.

A data lake can store massive amounts of raw data in its native format, whether it’s structured data from databases, semi-structured data like logs or JSON files, or unstructured data like text, images, and videos. This flexibility makes data lakes ideal for big data analytics, machine learning, and data science projects, where diverse data types are required.

For instance, if your product team is working on a machine learning model to predict customer churn, the raw data—clickstreams, social media mentions, transaction logs—might be stored in a data lake. From there, data scientists can pull in what they need, process it, and run their analyses.

Popular platforms for building data lakes include Amazon S3, Azure Data Lake, and Hadoop.

A Quick Comparison:

Data Warehouses

Data Lakes

Data structure

Stores structured data optimized for analysis

Stores both structured and unstructured data

Purpose

Used for historical analysis and reporting

Used for big data analytics and machine learning

Scalability

More structured and less flexible in terms of data types

Highly scalable and flexible

When should they be used?

  • Data Warehouses: When your team needs to perform historical data analysis, generate detailed reports, or track long-term trends, a data warehouse is the tool for the job.

  • Data Lakes: If your project involves big data, machine learning, or data that comes in a variety of formats (like logs, videos, or social media feeds), a data lake will provide the flexibility and scalability you need.

In many organizations, databases, data warehouses, and data lakes aren’t mutually exclusive—they work together as part of a comprehensive data strategy. For example, data might be collected and stored in a database for immediate use, then periodically moved to a data warehouse for long-term storage and analysis. At the same time, raw data from various sources could be ingested into a data lake for more complex analytics and machine learning tasks.

đź’ˇ Tech Term You Should Know (TTYSK)

Big Data

Big Data refers to the massive volumes of data that are generated every second from a wide variety of sources—think social media posts, online transactions, sensor data from IoT devices, and much more. This data is so large and complex that they require specialized data tools.

But what exactly defines data as big data? There aren't exact numbers that universally define big data, but big data is unique based on the 3 V’s: Volume, Velocity, and Variety.

  • Volume: This refers to the massive amount of data being generated. For example, think about all the data points collected from millions of users interacting with your app daily—every click, scroll, and purchase adds to the volume.

  • Velocity: This refers to the speed at which data is generated and processed, often in real-time or near-real-time. Big Data can involve thousands to millions of events per second, such as social media interactions, high-frequency trading, and live streaming.

  • Variety: Big Data typically includes both structured data (like databases) and unstructured data (like text, images, and videos). While it doesn't always require diverse data types, Big Data often involves them, with its strength lying in the ability to handle, integrate, and analyze various data types from multiple sources.

It's important to note that what qualifies as Big Data in one context might not be considered Big Data in another. For example, a small startup might consider terabytes of data as Big Data, while a company like Google might not, given the scale at which they operate.

In Summary: While there isn't a strict numerical threshold that categorizes data as "Big Data," it generally involves datasets that are large enough in volume (ranging from terabytes to petabytes or more) and/or are generated at such high velocity (thousands to millions of events per second) that they require specialized technologies and approaches to process, store, and analyze effectively.

Missed the mid-month PM & Tech Jobs Newsletter?

Looking for a new job? Our PM & Tech Jobs newsletter is issued monthly with product role job listings from senior to entry-level roles.

As always, connect with me on LinkedIn and Twitter and follow Skiplevel on LinkedIn, Twitter, and Instagram.