![]() Limitation of SQL as some logic is much easier to implement with user defined functions rather than SQL Lack of expressiveness as Jinja is quite heavy and verbose, not very readable, and unit-testing is rather tedious Low learning curve as SQL is easier than sparkīetter code organisation as there is no correct way of organising transformation pipeline with spark ![]() As discussed in this blog post, dbt has clear advantages compared to spark in terms of Recently the spark adapter added open source table formats (hudi, iceberg and delta lake) as the supported file formats and it allows you to work on data lake house projects with it. Its scope is extended to data lake projects by the addition of the dbt-spark and dbt-glue adapter where we can develop data lakes with spark SQL. It is one of the most popular tools in the modern data stack that originally covers data warehousing projects. It supports a wide range of data platforms and the following key AWS analytics services are covered – Redshift, Glue, EMR and Athena. The data build tool (dbt) is an open-source command line tool and it does the T in ELT (Extract, Load, Transform) processes well. Supports more than a notebook environment by facilitating code modularity and incorporating testing. ![]() The LOB data engineering team understand that the data democratisation plan of the enterprise can be more effective if there is a tool or framework that:Ĭan be shared across LOBs although they can have different technology stack and practices,įits into various project types from traditional data warehousing to data lakehouse projects, and Moreover the engineering team don’t even have a suitable data transformation framework that supports iceberg. Metorikku) that are successful for them, however cannot be used directly by the engineers of the LOB. Upon contacting the central data engineering team for assistance they are advised that the team uses scala and many other tools (e.g. Additionally the use of notebooks makes development challenging mainly due to lack of modularity and failing to incorporate testing. However they soon find the codebase gets quite bigger even during the minimum valuable product (MVP) phase, which would only amplify the issue as they extend it to cover the entire data warehouse. Thanks to their expertise in SQL, however, they are able to get started building data transformation logic on jupyter notebooks using pyspark. The LOB data engineers are new to spark and they have a little bit of experience in python while the majority of their work is based on SQL. ![]() Let’s assume as a starting point that the central data engineering team has chosen a project that migrates an on-premise data warehouse into a data lake (spark + iceberg + redshift) on AWS, to provide a cost-effective way to serve data consumers thanks to iceberg’s ACID transaction features. This shift away from centralised data engineering to LOBs exposed a skills and tooling gap. The key driver for this comes from the recognition that LOBs retain the deep data knowledge and business understanding for their respective data domain which improves the speed with which these teams can develop data solutions and gain customer insights. In our experience delivering data solutions for our customers, we have observed a desire to move away from a centralised team function, responsible for the data collection, analysis and reporting, towards shifting this responsibility to an organisation’s lines of business (LOB) teams.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |