Apache Iceberg vs Delta Lake: What are the differences?


13 Oct 2024  Amey Kolhe  4 mins read.

The cloud data lakehouse is gaining momentum, driven by the evolution of table formats like Apache Iceberg, Delta Lake, and Hudi. With improved transactional support, ACID compliance, and additional features, data lakehouses look set to take center stage in replacing data warehouses and data lakes alike. But which table format do you choose? How do you compare Apache Iceberg vs Delta Lake?

This article will focus on comparing the two leading table formats:

  1. Apache Iceberg
  2. Delta Lake

Apache Iceberg vs Delta Lake


Comparing Iceberg and Delta


Feature Apache Iceberg Delta Lake
Transaction support (ACID) Yes Yes
File format Parquet, ORC, Avro Parquet
Schema evolution Full Partial
Partition evolution Yes No
Merge on read Yes No
Data versioning Yes Yes
Time travel queries Yes Yes
Object store cost optimization Yes Yes

Although Apache Iceberg and Delta Lake converge on a similar endpoint, they are not identical technologies. Both are different approaches to problem-solving that result in similar outcomes. Both achieve efficient storage optimization, data consistency, and performance improvements, but the underlying mechanisms they use to achieve this differ.

Key Differences


Manifest Files

Iceberg collects metadata relating to the datasets stored inside it. This metadata is stored in manifest files. Each manifest file points to the data files used in the table, along with the partition data needed to retrieve the data effectively. This creates a historical record of changes made to the data. The record itself is recorded in a tree structure using Avro files.

Because they contain a history of changes made to the state of the data set, manifest files make all of the unique features of Iceberg possible. Manifest files allow Iceberg to handle:

  • Updates
  • Deletes
  • ACID transactions

Delta Log

Delta Lake also collects enhanced metadata, just like Iceberg. However, instead of manifest files, it stores the data in a directory known as a Delta Log. A Delta Log is recorded in JSON format and written in Parquet Files. Data is structured as a flat list using Parquet spread across many files. Each entry in the log tracks a corresponding change in the dataset.

Iceberg Manifest files vs. Delta Lake Delta Log

Although Delta Lake and Apache Iceberg maintain a comprehensive history of changes, their approaches differ. Delta Lake uses a Delta Log with JSON files and periodic Parquet checkpoints to track changes. This can make historical data retrieval efficient, but the performance depends on checkpoint frequency. In contrast, Iceberg uses snapshots with manifest files listing the data files and partitions. Both approaches optimize historical data access, though their specific methods and efficiencies vary.


Deciding factors when considering Iceberg vs Delta Lake


Given the convergence of Iceberg and Delta Lake in features, the best way to compare the two technologies is on ecosystem and toolset integration.


Spark Integration

Delta Lake integrates deeply with Databricks and Spark, making it a strong candidate for organizations heavily invested in those technologies. Because these technologies have been developed within the same ecosystem, they have a tight, native integration.

Compute Engine Integration

Apache Iceberg and Delta Lake each work with different compute engines. Like other ecosystem integrations, it depends on your toolset and objectives. If you are a Spark user, Delta Lake may be the better choice; if you use an open data stack, Apache Iceberg is probably best. In either case, compare your toolset, including the compute engine, with each table format to see the differences.

Cloud Integration and Optimization

The choice of cloud provider directly impacts a table format’s performance. Different ecosystems support different table formats and compute engines. To get optimal results, one needs to compare Apache Iceberg vs Delta Lake on the intended cloud platform, whether AWS, Azure, or GCP. Extensive testing and benchmarking must be performed to ensure the system is properly optimized for the cloud.

Internal tables vs external table support

Different compute engines integrate differently with Iceberg vs Delta Lake depending on the ecosystem being used. This means some engines use internal tables, whereas others rely on external tables. When comparing the two, this difference is consequential, as it may create different performance outcomes.

Catalog Choices

With recent changes in the Unity catalog, Delta Lake is more open than ever. Unity Catalog’s goal is to improve interoperability across the data ecosystem. It provides a centralized governance layer for managing metadata, data access, and security across multiple data sources, including Delta Lake and Iceberg.


Conclusion


So where does this comparison of Apache Iceberg vs Delta Lake lead? For organizations looking to take the leap into a data lakehouse, there have never been more options available, and the two leading table formats have never been so close together in their offering of features.