Data warehouses and Data lakes have further evolved in Lakehouses and the the term or concept is not new to anyone anymore. With the growing popularity of Lakehouses , experts worldwide have started publishing the infrastructure and architecture best practices, different table formats ( read here ), etc. Let’s see what is buzzing around for Lakhouse Architecture — “Medallion Architecture”
What is Medallion Architecture — Initially advocated by Databricks and then adopted by Microsoft in Microsoft Fabric , is a way of organizing the data in a distributed pattern as and when it evolves in terms of structure ,formatting and readiness /fit for purpose.This distribution can be logical and data may not necessarily be stored/transferred into different data storages. Medallion architecture has 3 stages viz. Bronze, Silver and Gold and hence the name “Medallion”.
Bronze — This is structured,semistructured or unstructured data from various sources( batch or streaming) in the rawest format. The table structure may match to respective format in source with exception to some audit /supporting column such as Load Date, job identifier etc. This can be referred to as Landing Zone in traditional terminology.
Silver — This is a progression layer to Bronze Layer. This layer is used to store semi-curated data set. Curation here can be referred to as Filtering unwanted/invalid/expired data ,with basic standard cleansings such deduplication, master and cross reference records. This level can also be called as narrowed down scope of In-scope data for Data Engineering, Data Analysis and Data Science Initiatives.
Gold — This is final layer of porgression in medallion architecture. This layer mostly serves use case based data sets curated further than silver layer and aggregated enough to meet the expectations.This layer becomes perfect home to final data sets curated in star-schema and also supports the data mesh concept.
Old wine in a new bottle?
So far, we have understood that medallion architecture is the multi-hop approach of data curation so that we filter and reduce the data set at every progression with a very basic intention of delivering clean data as per the consumer requirements. But then how it is different from our traditional architecture?
Even in the good old days, we had the same approach of landing zone where we used to dump data without bothering about any nitty gritty to later understand that much of that was never needed.
This data was further converted to canonical structure by all necessary filter, validation and technical cleansing rules and loaded to what we referred to as Satging/Cleansing zone.
Eventually, after applying all business rules this data was shared through outbound layers with the consumers for various purposes.
Summary: Medallion architecture is not just a data distributed in various layers but a good framework. There is no rule that there should be exactly one data set one each layer and finally it is not different than
Raw-Cleansed-Outbound
or
Landing-Staging-ODS
or
Landing-Work-Presentation.
Final Take : Data projects are not Olympic events — Keep them away from the medal jargaons!