An Introduction to Core Data Engineering Concepts
Welcome! This guide is designed to introduce you to the fundamental concepts of data engineering. If you’re new to the field, the terminology can seem a bit overwhelming. Our goal is to break down these core ideas into simple, understandable pieces.
We’ll start with the very basics of what data is and how it’s structured. From there, we’ll explore how data is processed and moved, how it’s organized in different types of databases, and finally, how it’s stored in large-scale systems like data warehouses and lakehouses. Think of this as your first step toward building a solid foundation in the world of data. Let’s begin!
——————————————————————————–
1. The Building Blocks: Understanding Data and Structure
Before we can build complex data pipelines, we need to understand the raw materials we’re working with. It all starts with data and the “blueprint,” or schema, that defines its shape.
What is Data?
At its core, data is simply units of information. However, this information can come in various forms, which we can categorize by its degree of organization or “structure.”
- Structured Data: This is data that is highly organized and follows a predefined model, making it easy to browse and search.
- Example: A relational database where information is neatly arranged in tables with rows and columns.
- Semi-structured Data: This data isn’t organized in a formal database structure but contains tags or markers to separate elements and create a hierarchy. It can be browsed or searched, but with some limitations.
- Example: An XML file, which uses tags to define its contents.
- Unstructured Data: This is data that has no inherent organization or identifiable structure.
- Example: A folder full of various loose files, such as documents, images, and videos.
Schema vs. Schemaless: The Blueprint for Your Data
How we organize data is determined by its schema, which acts as a blueprint.
A schema formally defines how data is structured. It specifies the tables, the columns within those tables, and the specific data types each column can hold (e.g., integer, string). This upfront design ensures that data is consistent and predictable.
A schemaless approach is more flexible. While it still has a structure, it allows you to store data without performing extensive, upfront data modeling. This flexibility is a key advantage when dealing with diverse or rapidly changing data.
Now that we understand how data can be structured, let’s explore the two primary ways it’s processed and moved from one place to another.
——————————————————————————–
2. Moving Data: Batch vs. Stream Processing
Now that we’ve covered the different forms data can take, let’s look at how that data gets moved around. Data engineers use two main methods to process information as it flows from one system to another: batch processing and stream processing.
Batch Processing: Working with Collections
Batch processing involves collecting and processing data in large groups, or “batches,” at scheduled intervals. Think of it like collecting all your mail for the week and opening it on Saturday. It’s not a real-time process but is highly efficient and cost-effective for handling very large workloads.
Stream Processing: Data in Real-Time
Stream processing is designed to handle data in real time, processing it as soon as it arrives. This method is ideal for use cases that require immediate insights, such as real-time analytics or monitoring streaming video. Because it’s always “on,” it is generally more expensive than batch processing.
| Feature | Batch Processing | Stream Processing |
| Timing | Scheduled (e.g., daily, hourly) | As it arrives (real-time) |
| Data Size | Large collections (batches) | Individual units or bits of data |
| Use Case | Very large processing workloads, scheduled reports | Real-time analytics, streaming videos, immediate insights |
| Cost | More cost-efficient | More expensive |
Once data is processed, it needs a place to live. Let’s look at the primary systems used to organize and store this data: databases.
——————————————————————————–
3. Organizing Data: Relational vs. Non-Relational Databases
Now that we know the two ways data can be moved—in batches or streams—we can explore the systems that organize and store it: databases. These are structured systems that allow data to be quickly accessed and searched, and they are broadly divided into two categories: relational (SQL) and non-relational (NoSQL).
Relational Databases (SQL)
Relational databases organize data into tables with predefined relationships, a format often called tabular data. The primary goal of their design is to ensure data integrity. A key feature is that their data is typically normalized, a design principle that reduces data redundancy and improves consistency by organizing data into discrete, related tables. This adherence to a predefined structure makes relational databases a prime example of the schema-based approach we discussed earlier.
Non-Relational Databases (NoSQL)
Non-relational databases store data in formats other than traditional tables, prioritizing performance and flexibility at a massive scale. Their flexible structure is a hallmark of the schemaless approach, designed for agility and scale. Common types include:
- Key-value: Data is stored in simple key-value pairs.
- Document: Data is stored in JSON-like documents.
- Column: Data is organized into columns instead of rows, which is faster for analytics.
- Graph: Data is represented as a network of nodes and relationships.
To achieve faster query performance, non-relational databases often use denormalized data. This approach combines data into a single structure, even if it means some data is redundant.
| Feature | Relational (SQL) | Non-Relational (NoSQL) |
| Core Structure | Tables with rows, columns, and relationships | Various models (key-value, document, column, graph) |
| Schema | Predefined schema (structured) | Flexible or dynamic schema (schemaless) |
| Data Model Focus | Normalized to ensure data integrity | Denormalized to optimize for query performance and scale |
While databases are excellent for managing application data, large-scale analytics requires even bigger, more specialized storage systems.
——————————————————————————–
4. Storing Data: Warehouses, Lakes, and Lakehouses
Databases are great for managing day-to-day data, but for large-scale analytics and business intelligence, data engineers use even larger storage systems. Let’s explore the three main types: the data warehouse, the data lake, and the modern data lakehouse.
- Data Warehouse A data warehouse is a relational data store designed specifically for analytic workloads. It stores structured data derived from various sources and is optimized for generating business reports and analytics. Because it’s designed for scheduled reporting, it is typically populated using batch processing.
- Data Lake A data lake is a centralized repository that holds vast amounts of raw data in its native format. It can store semi-structured and unstructured data without a predefined schema, making it highly flexible. It can serve as a source for both batch and stream processing workloads for data scientists to explore.
- Data Lakehouse A data lakehouse is a modern architecture that combines the best of both worlds. It merges the flexible, cost-effective storage of a data lake with the powerful data management and structuring features of a data warehouse. This hybrid model supports both business intelligence (BI) tasks and machine learning workloads effectively.
| Feature | Data Warehouse | Data Lake | Data Lakehouse |
| Primary Data Type | Structured and semi-structured | Raw, semi-structured, and unstructured | All types, with management features |
| Data Structure | Structured or semi-structured data for creating reports | Raw data in its native format without a predefined schema | Combines structured data management with flexible raw data storage |
| Key Use Case | Business Intelligence & Analytic Reports | Storing raw data for data science and ML | Unified platform for BI, analytics, and ML |
Now that we’ve seen how a data lakehouse combines the best of warehouses and lakes, let’s look at a popular method for organizing data within it: the Medallion Architecture.
——————————————————————————–
5. A Modern Approach: The Medallion Architecture
To effectively organize data within a modern lakehouse, data engineers often use a specific organizational pattern called the Medallion Architecture. It is a data quality framework used to logically organize data as it is progressively cleaned and transformed, consisting of three distinct layers: Bronze, Silver, and Gold.
- Bronze (Raw) This is the initial landing zone for raw data ingested from source systems. The data in this layer is often semi-structured or unclean. It is kept in its original state for archiving, validation, and reprocessing if needed.
- Silver (Cleaned & Filtered) In the Silver layer, the raw data from Bronze is transformed. Here, it is cleaned, deduplicated, joined with other data, and standardized. This data is now reliable and ready for use by analysts, dashboards, or machine learning pipelines.
- Gold (Business-Ready) The Gold layer contains data that has been further refined and aggregated for business consumption. This data is modeled to power specific business reports, analytics, and dashboards, providing high-level insights directly to business users.
——————————————————————————–
Conclusion
You’ve just completed a tour of the core concepts that form the bedrock of data engineering. We’ve journeyed from understanding the basic structure of data, to how it’s processed and stored, and finally to modern architectures like the lakehouse and the Medallion framework.
Grasping these fundamentals—from batch vs. stream to warehouses vs. lakes—provides a powerful foundation for anyone looking to build a career in data. With this knowledge, you are well-equipped to dive deeper into this exciting and ever-evolving field. Keep learning!