Running SQL notebook without duplicates in databricks
The task for me was to write a script that combines data from two silver tables and generate a gold table. If you are not familiar with Lakehouse you can check here. Lakehouse stores data in levels based on the level of processing: bronze, silver and gold.
The goal was to generate a gold table that will help monitor monthly platform usage. I had two options; SQL and pyspark. I used both approaches but the SQL introduced duplicates in the gold table when I run the notebook twice with the same parameters using INSERT INTO. The method was not robust enough.
After a long search WHEN NOT EXISTS command worked and tho I have not experimented WHEN NOT MATCHED but I assume it can also get the job done. Below is not a mirror of the exact script but helpful if you want to ingest data into a table either as stored procedure or a job using SQL.
USE database;INSERT INTO gold_table
SELECT
name
,id
,year
,date
,total_charges
FROM silver_table AS c
WHERE NOT EXISTS (SELECT 1 FROM gold_table AS g
WHERE c.name = g.name
AND c.id = g.id
AND c.year = g.year
AND c.date = g.date
AND c.total_charges = g.total_charges)
Assumptions are that the database and gold table are already created and the schema defined for the table.
I believe this is not the only approach. You can as well use scala or pyspark but if SQL is your choice then you can adopt my approach.