Background:
It was May/June 24, when all major tech companies were doing their annual conferences. Being a core data professional and data enthusiast at heart, thanks to my following, connections, and LinkedIn’s algorithm my timelines were flooded with updates, announcements, and pictures from these conferences. Colleagues and friends posting their unions, CEOs blogging about the major announcements, big enterprises releasing their prime offerings as Open Source, and whatnot.
In all of this Chaos (that I love), 3 things my eyes were catching were Apache Hudi, Apache Iceberg and Apache XTable. Knowing them at a high level, I decided to explore them further in details. Here is my take on each :
Delta Lake
But before we go there, let’s have a fair understanding of Delta Lake ( another open source initiative under Linux Foundation ) :
Delta Lake allows building a lake house architecture on top of existing data lakes. Unlike datalakes, they support ACID transactions, support metadata exchanges, are highly scalable, and provide seamless support to batch and stream integration. Other benefits of DeltaLake are TimeTravel, Upserts and Deletes and MetadatEnforcement. This links shall help you if you would want to know more on what is Lakehouse Architecture or Delta Lake.
Apache Iceberg
Initially developed at Netflix and then handed over to Apache Community, Apache Iceberg is an open Source table format to counter the various challenges faced by Hive tables. Apache Iceberg supports schema-level changes like add, drop, update, and rename . This change does not impact other components of schema or data hence it has no side-effect.
3 Tier Architecture:
In the 3-tier architecture referenced below, the table format of Iceberg tracks data files at individual capacity v/s the whole dictionary of data files. This way, the writer’s job creates a separate data file in the place and considers the file for the table only when an explicit commit is ensured.
Metadata File tracks state information of tables such as table structure, configuration related to partition, and a snapshot of the table at the moment. A snapshot is used to access all the data files of a table.
Manifest File has a list of all data files which has data along with stats about the data, so that while processing some data files can be skipped based on the columnar stats mentioned in them. Manifest files for a table are listed in Manifest list.Each snapshot listed in metadata.json points to a manifest list.
Apache Hudi
Originally developed at Uber due to performance issues they faced with originally Hive format and then handed over to the Apache Community, Apache Hudi (‘Hoodie’) is a streaming data lake platform. It is capable of providing a database and data warehousing functionality to data lakehouses. Apache Hudi supports near realtime data streaming pipelines along with highly effective incremental processing. Apache Hudi supports Unified ingestion for batch and streaming data thus helping us to bring down the freshness of data to some minutes. Thus helping in scenarios where the freshness of data is of concern but not to sub-second level such as sensor data from assembly lines or system availability logs.
Apache XTable
Originally launched as OpenSource in collaboration with Microsoft, Google and OpenHouse as OneTable and then donated to Apache foundation and renamed as XTable is currently in incubating phase. Xtable is not a new table format but it is built to give flexibility of chocie of table data based on used cases. As we saw due to thei distinct features, Iceber, Delta Lake and Hudi have some unique feature of themselves thus making them choice of product for situation specific requirements. Xtable gives you feature of converting multiple types of source table format types to dired type by extending abstraction and various tools of translations.
Few Reference Links:
- https://www.onehouse.ai/blog/apache-hudi-vs-delta-lake-vs-apache-iceberg-lakehouse-feature-comparison
- https://www.dremio.com/blog/exploring-the-architecture-of-apache-iceberg-delta-lake-and-apache-hudi/