Blog

Data Quality Checks with Soda-Core in Databricks

It’s easy to do data quality checks when working with spark with the soda-core library. The library has support for spark dataframes. I’ve tested it within a databricks environment and it worked quite easily for me. For the examples of this article i am loading the customers table from the tpch delta tables in the …

Databricks query federation with Snowflake. Easy and Fast!

Introduction In the same way that is possible to read and write data from snowflake inside databricks, its also possible to use databricks with query federation against diverse SQL engines, including snowflake. The current supported engines are: We are going to demonstrate how it works with Snowflake. We will first create a table in databricks, …

Useful Databricks/Spark resources

Memory Profiling in PySpark: https://www.databricks.com/blog/2022/11/30/memory-profiling-pyspark.html Run Databricks queries directly from VSCODE: https://ganeshchandrasekaran.com/run-your-databricks-sql-queries-from-vscode-9c70c5d4903c Spark Testing with chispa: https://github.com/alexott/spark-playground/tree/master/testing Best Practices for Cost Management on Databricks: https://www.databricks.com/blog/2022/10/18/best-practices-cost-management-databricks.html UDF Pyspark: https://docs.databricks.com/udf/python.html Pandas UDF’s: https://docs.databricks.com/udf/pandas.html Introducing Pandas UDF for PySpark: https://www.databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html

Smallest Analytical Platform Ever!

I’ve started working on some of my free time in a project to build the smallest useful analytics platform on the cloud (starting with azure). The purpose is to use it a sa PoC to show to colleagues, managers, prospective customers or just to have fun and play It’s publicly available on my github repo …

Centos 8/9 Stream and AlmaLinux images for WSL

For the ones like me, interested in running Linux systems on windows for many automation or administration tasks, I am sharing here the images i’ve found: Centos7/8/9 Stream: https://github.com/mishamosher/CentOS-WSL AlmaLinux (Centos Replacement equivalent to RHEL): https://www.microsoft.com/en-us/p/almalinux-8-wsl/9nmd96xjj19f#activetab=pivot:overviewtab The latest one is a direct link to the microsoft store.

Databricks cluster policies at a glance. The easy way!

For these administering one or more databricks workspaces, cluster policies are an important tool where we spend some time with. Introduction But what are cluster policies? Cluster policies are basically a json file with some parameters that we use to allow (or not) users to select certain things when creating a cluster. Not only users …

Using Azure Private Endpoints with Databricks

In this article i will show how to avoing going outside to the internet when using resources inside azure, specially if they are in the same subscription and location (datacenter). Why we may want a private endpoint? Thats a good question. For oth security and performance. Just like using TSCM Equipment for optimal safety and …

Databricks connectivity to Azure SQL / SQL Server

Most of the developments I see inside databricks rely on fetching or writing data to some sort of Database. Usually the preferred method for this is though the use of jdbc driver, as most databases offer some sort of jdbc driver. In some cases, though, its also possible to use some spark optimized driver. This …