spark Archives - Albert Nogués

Data Quality Checks with Soda-Core in Databricks

Albert — Fri, 31 May 2024 11:17:06 +0000

It’s easy to do data quality checks when working with spark with the soda-core library. The library has support for spark dataframes. I’ve tested it within a databricks environment and it worked quite easily for me.

For the examples of this article i am loading the customers table from the tpch delta tables in the databricks-datasets folder.

First of all we need to install the library either scoped to our Databricks notebook or on our cluster. In my case i will install it notebook scoped:

%pip install soda-core-spark-df

Then we create a dataframe from the tpch customers table:

#We create a table and read it into a dataframe
customer_df = spark.read.table("delta.`/databricks-datasets/tpch/delta-001/customer/`")

We create a temporary view for our dataframe so soda can query the data and run the checks:

#We create a TempView
customer_df.createOrReplaceTempView("customer")

And here it comes the whole soda core. We will define the checks using yaml syntax:

from soda.scan import Scan
scan = Scan()
scan.set_scan_definition_name("Databricks Test Notebook")
scan.set_data_source_name("customer")
scan.add_spark_session(spark, data_source_name="customer")
#YAML Format
checks = '''
checks for customer:
  - row_count > 0
  - invalid_percent(c_phone) = 0:
      valid regex: ^[0-9]{2}[-][0-9]{3}[-][0-9]{3}[-][0-9]{4}$
  - duplicate_count(c_phone) = 0:
      name: No duplicate phone numbers
  - invalid_count(c_mktsegment) = 0:
      invalid values: [HOUSEHOLD]
      name: HOUSEHOLD is not allowed as a Market Segment
'''
# you can use add_sodacl_yaml_file(s). Useful if the tests are in a github repo or FS
scan.add_sodacl_yaml_str(checks)
scan.execute()
print(scan.get_logs_text())

More info: Add Soda to a Databricks notebook | Soda Documentation

List of validations: Validity metrics | Soda Documentation and SodaCL metrics and checks | Soda Documentation

We can somewhat enhance it and generate a Spark Dataframe all out of the list of our warnings or error validation checks:

from datetime import datetime
schema_checks = 'datasource STRING, table STRING, rule_name STRING, rule STRING, column STRING, check_status STRING, number_of_errors_in_sample INT, check_time TIMESTAMP'
list_of_checks = []
for c in scan.get_scan_results()['checks']:
    list_of_checks = list_of_checks + [[scan.get_scan_results()['defaultDataSource'], c['table'], c['name'], c['definition'], c['column'], c['outcome'], 0 if 'pass'in c['outcome'] else int(c['diagnostics']['blocks'][0]['totalFailingRows']), datetime.strptime(scan.get_scan_results()['dataTimestamp'], '%Y-%m-%dT%H:%M:%S%z')]]
list_of_checks_df = spark.createDataFrame(list_of_checks,schema_checks)
display(list_of_checks_df)

In the case we have the yaml file in our github repo, we can read it and pass it. Or If we are working with Databricks repos and the file is part of out repo we can load it locally

Accessing a remote file and reading it with requests:

#Trying to use a remote yaml file to enforce rules. We can upload it to a github of our own and use it in opur notebook.
#I've created a public repo so i dont need to authenticate to github, but in a real world scenario we should use private repo + secret scopes
customer_quality_rules = 'https://raw.githubusercontent.com/anogues/soda-core-quality-rules/main/soda-core-quality-rules-customer.yaml'
import requests
scan.add_sodacl_yaml_str(requests.get(customer_quality_rules).text)

Or we can load it locally if we are using databricks repos:

scan.add_sodacl_yaml_file("your_file.yaml")

The post Data Quality Checks with Soda-Core in Databricks appeared first on Albert Nogués.

Query Delta Tables in the DataLake from PowerBi with Databricks

Albert — Wed, 15 Nov 2023 18:47:00 +0000

There are several ways to query delta tables from PowerBi.

You can use snowflake with external stages reading delta data from the DL,
The parquet connector and directly query the data from the datalake (caution as this method does not support SPNs and also you can only use delta tables with only 1 version)
or with Delta Sharing plugin (Not possible currently in danone due not having unity catalog)
Or the recommended way, using databricks to do it (With a SPN + Databricks CLuster (either DataEngineering or SQL Warehouse))

We are going to cover the 4th method here. To do it first we need a service princpal, a secret scope pointing to a databricks keyvault and the password of the SPN stored in this keyvault.

Once we have this, the first step is to set up the cluster with the credentials to access the datalake. For this we need to configure the spark variables of our databricks cluster. You can follow the guide here.

After that you cluster should have the credentials in the spark conf section, something like this:

The second step is creating an EXTERNAL table in Databricks to pint to our delta table(s). For this we connect to our databricks workspace with the previous configuration, and create a new notebook and define the external tables we want, something like this:

%sql
CREATE TABLE IF NOT EXISTS anogues.customers_external 
LOCATION 'abfss://raw@albertdatabricks001.dfs.core.windows.net/customers'

Make sure the location is the right one otherwise when we query the data we will get either an error or no results.

Once we have this we can query our external table and verify we can see the data:

%sql
select * from anogues.customers_external LIMIT 5;

And providing we did it right we should see the data. There is no need to define the table columns as delta uses parquet under the hood so it’s a self contained format where the schema is stored alongside with the data.

Once we confirmed this is working we can go to powerbi, try to import data using the Databricks Connector:

To configure the connector we need to get some details from our cluster. These can be found in the advanced options of our cluster, in the tab JDBC/ODBC

On the following screen we need to select our authentication options to connect to the databricks cluster. Since we have SAML + SCIM enabled in our workspaces the user and password option is not possible. Either we need a databricks PAT token or use Azure AD. I recommend the latter:

We click on it and select our AAD account. If all works well our session will be started. We should see it in the screen:

Then we click on connect, and we can see our data. Since we don’t have unity catalog, our table should appear in the hive_metastore catalog. There we can find our database and inside our table(s). We click and either load all the tables we want or start transforming them inside powerbi, like we will do with any other data source.

For more help here is the detail of the powerbi databricks connector: Connect Power BI to Azure Databricks – Azure Databricks | Microsoft Learn

The post Query Delta Tables in the DataLake from PowerBi with Databricks appeared first on Albert Nogués.

Databricks query federation with Snowflake. Easy and Fast!

Albert — Tue, 31 Jan 2023 12:15:05 +0000

Introduction

In the same way that is possible to read and write data from snowflake inside databricks, its also possible to use databricks with query federation against diverse SQL engines, including snowflake. The current supported engines are:

We are going to demonstrate how it works with Snowflake. We will first create a table in databricks, it can be a delta table stored in the data lake, or an unmanaged table pointing to a set of files (External Table) or anything in between.

Creating the required resources

I will go to databricks and run the following:

CREATE OR REPLACE TABLE default.DATABRICKS_ALBERT(
NAME STRING,
SEX STRING);

INSERT INTO default.DATABRICKS_ALBERT(NAME, SEX) VALUES ('Albert','Male');

This process can either be done from the Data Engineering persona or the SQL persona. I created it first from the data engineering part.

Now we go to snowflake and create a second table which we want to join:

CREATE TABLE SANDBOX.DEFAULT.DATABRICKS_FEDERATED(
NAME STRING,
AGE INTEGER);

INSERT INTO SANDBOX.DEFAULT.DATABRICKS_FEDERATED (NAME, AGE) VALUES ('Albert', 37);

And it gets created succesfully:

Now we can stay in the data engineering persona or go to databricks SQL and we create the virtual table that will create the link with the table in snowflake. This will do the mapping:

CREATE TABLE MY_TABLE
USING snowflake
OPTIONS(
dbtable 'YourTable',
sfURL 'yourURL.snowflakecomputing.com', --You can use privatelink if you have one
sfUser 'YourUser',
sfPassword 'YourPassword',
sfDatabase 'YourDB', --You can use a secret scope like: secret('scope_name', 'pwd_entry'),
sfSchema 'YourSchema',
sfWarehouse 'YourSnowflakeWarehouse'
);

In the Databricks SQL:

In The Data Engineering:

Glueing it up together

If you use Databricks SQL I couldn’t make it work with the STARTER ENDPOINT so be sure to use a normal WAREHOUSE to avoid any errors. In my case i created an XS warehouse, but now i can run a query fetching data from both tables:

select a.name, a.sex, b.age
FROM default.DATABRICKS_ALBERT a , default.SNOWFLAKE_ALBERT b
where a.name=b.name

And in the Data Engineering Persona:

And of course you can use spark.sql with python or any other language to query the table as well:

Or directly with Dataframes treating the federated table as any other table in the catalog:

Hope this clears your way and helps you integrate data from different sources without having to use a virtual metadata layer.

The post Databricks query federation with Snowflake. Easy and Fast! appeared first on Albert Nogués.

Smallest Analytical Platform Ever!

Albert — Sat, 07 May 2022 08:38:12 +0000

I’ve started working on some of my free time in a project to build the smallest useful analytics platform on the cloud (starting with azure).

The purpose is to use it a sa PoC to show to colleagues, managers, prospective customers or just to have fun and play

It’s publicly available on my github repo and any collaboration is welcome. You can fork it, improve it, send PR’s and do whatever you want!

The first version will run solely on azure. The objective is to show the following technologies/disciplines:

* Infrastructure as a Code (IaaC), by using Terraform

* Cloud architecture anc Cloud Ops by using an azure cloud environment

* Data Engineering by using a Spark powered Databricks Notebook and an ADF Pipeline (future)

* DevOps to trigger some pipelines based on changes (future)

* Basic Security concepts (keyvault, service principals, least privileged rbac accesses…)

* FinOps keeping the costs at minimum and choosing the proper tools for the job

* Reporting and Dashboarding on data in the platform

* Data management: We will use an adls storage account and azure sql db

TOOLS:

* Terraform to deploy all the infra as a code

* Azure Cloud to host our resources

You have the code plus all the information on my github repo:

https://github.com/anogues/ProjectZ

The post Smallest Analytical Platform Ever! appeared first on Albert Nogués.

Implementing CI/CD in Databricks with Azure DevOps (Part 1)

Albert — Sat, 30 Apr 2022 15:10:29 +0000

There are many ways to implement some CI/CD with Databricks. We can use Azure DevOps, Github+Github Actions or any other combination of tools, including the dbx tool.

But an easy way to just copy notebooks between workspaces can be implemented easily with Azure DevOps.

We are going to use the git repos capability of Azure Databricks, so when a new code change is commited in a notebook an Azure DevOps pipeline will trigger the transport copy of the workbook from the first Databricks workspace (in our case a NonProd workspace) to the target one, again, in our case, the Prod workspace.

To achieve this we will use some more components from the Azure ecosystem, including the use of Keyvaults to keep all our secrets stored safely. The list of prerequisites is the following:

Two Databricks workspaces, one our source workspace (NonProd) and another, our Production one.
An Azure Keyvault (or two if we want to segregate the environments)
Azure Databricks repository configured at least in our source workspace, so when the change is commited we can triger the pipeline that will fetch the notebook and transport it to the prod workspace
Access to Azure DevOps (Something similar can be implemented with Github + Github Actions)

Lets see how to implement it. First we need to make sure git repos in enabled our source workspace. We can verify by login with an admin privileged user to our workspace and make sure the option is checked as follows:

Fig 1. Make sure that github repos is enabled in our workspace.

Secondly, we go to Azure DevOps services and we create a new project. I’ve called it DatabricksCICD but feel free to call it whatever you need:

Fig 2. Create a new repo and initialize it

Once created we take the details for cloning our repo and copying them. We go to our databricks workspace and then we look for the Repos option on the left, and add a new repository. We need to paste the url to clone our newle Azure DevOps created repository:

Fig 3. Cloning our Azure DevOps Repository

Once we linked our Databricks workspace with our DevOps repo, now we can create a new notebook. In the same Repos section, click on the down arrow to create a new notebook, as shown below:

Fig 4. Creating a new notebook.

The content of the notebook, you can put anything you want. I’m writing a print(“Hello from Albert”) statement. We will not run it, we just want to show it’s possible to transport it. Once done, click on the save now in the revision tab:

Fig 5. Saving our changes to the notebook.

Then click on the left in the main branch button, from there we will be commiting the changes to our repository:

Fig 6. Pushing our notebook to the DevOps repo

If we go back now to our Azure DevOps project we should see the file has been commited to the repository. This ends the first part of this tutorial.

In the second blog entry we will see how to trigger the pipeline after a modification of this notebook and passing the credentials of the second workspace to be able to deliver the changed notebook to our Production (target) workspace.

The post Implementing CI/CD in Databricks with Azure DevOps (Part 1) appeared first on Albert Nogués.

Databricks cluster policies at a glance. The easy way!

Albert — Tue, 08 Feb 2022 17:30:00 +0000

For these administering one or more databricks workspaces, cluster policies are an important tool where we spend some time with.

Introduction

But what are cluster policies?

Cluster policies are basically a json file with some parameters that we use to allow (or not) users to select certain things when creating a cluster. Not only users select (or deselect) but we can force some parameters of the cluster as well by default.

The purpose of using cluster policies is not only standarizing of forcing certain specific configurations but also to limit human error that can cost the copany lots of money by capping certain parameters to only allow specific machine sizes, maximum number of nodes, or the cluster timeout.

To be able to use cluster policies, you need to have a Premium workspace. And of course, be an admin of the workspace to be able to define them.

Format of a cluster policy and it’s elements

The format of a policy, as we said, its a json document:

interface Policy {
  [path: string]: PolicyElement
}

The type of policy elements we can use to control its quite long, and you have them explained in the official databricks documentation:

In this article, we will create a simple policy, that will preconfigure a cluster based on a specific machine size, we will restrict the maximum number of nodes to 5, and that will autotag the cluster with a specific key that we will define in the policy.

However there are endless possibilities so I recommend you to have a look at the official documentation to have yourself an idea of all the parameters you can configure here.

Creating a custom cluster policy

Ok, lets start! To create our first policy we need to log in into our workspace, go to the compute section and click on the cluster policies tab:

Fig 1. Creating a Cluster Policy on Azure Databricks

Then, there if we have rights (i.e. we are administrators of the workspace) we should see a buton called Create Cluster Policy. Once clicked we will see something similar to Fig 2:

Fig 2. A Cluster Policy

Here we have to concentrate on three things. The first one, is the policy name. This is the name your users will see, so I recommend to choose a meaningful name. For example multinode small cluster, or something like that.

Once the name is selected, we need to actually define the policy. This is a json area content box. In my sample case i want a to create a policy that chooses by default a machine with a small/medium size sku and maximum can autoscale to 5 nodes. Also i want to tag the cluster with a fixed string. To go further on this example, i will give the user a choice of two machines for the nodes, while the driver will be restricted to a specific vm sku. I will also set up auto termination to 10 minutes to make sure i am not paying for something not in use. My policy will look like the following:

{
  "node_type_id": {
    "type": "allowlist",
    "values": [
      "Standard_D8d_v4",
      "Standard_D16d_v4"
    ],
    "defaultValue": "Standard_D8d_v4"
  },
  "driver_node_type_id": {
    "type": "fixed",
    "value": "Standard_D8d_v4",
    "hidden": true
  },
  "autoscale.min_workers": {
    "type": "fixed",
    "value": 1,
    "hidden": true
  },
  "autoscale.max_workers": {
    "type": "range",
    "maxValue": 5,
    "defaultValue": 2
  },
  "autotermination_minutes": {
    "type": "fixed",
    "value": 10,
    "hidden": true
  },
  "custom_tags.project": {
    "type": "fixed",
    "value": "Albert"
  }
}

Once defined i will click on save. If the format of the policy is ok, the cluster policy will be created. I can then go to the permissions tab and assign it to users or groups. By default the policy is only assigned to be used by the admin group:

Fig 3. Permissions for a policy

We will leave it as it is as we are just playing arround.

We can now try to create a cluster using this policy:

Fig 4. Create a cluster with the new policy

Select the newly created policy, then in the worker type by default in the policy we are choosing the machine sku Dtandard_D8d_v4 but we are allowing the user to chose a bigger one, and in the autoscaling we can choose up to 5 nodes. If we try to input 6 we will get an error: “Max Workers cannot be more than 5” and will not let us go through the cluster creation.

The driver type and the default cluster timeout are not seen as we dont allow the user to change it, they are set by default and we choose to mak them hidden. By removing the hidden attribute we will let the users to see the values but not modifying it:

"driver_node_type_id": {
    "type": "fixed",
    "value": "Standard_D8d_v4",
    "hidden": true
  },
...
  "autotermination_minutes": {
    "type": "fixed",
    "value": 10,
    "hidden": true
  },

And the last needed to be check is in the advanced options. We wanted to autotag the cluster with the project tag with a set string. If we uncollapse the advanced options we can verify how this tag is applied:

Fig 5. Project tag in advanced options

And if we go to our cloud provider and we check the machines created as part of the cluster we will see how the tag has also been propagated to the cloud resources. This will allow us to do cost analysis from the provider cloud portal.

Fig 6. Project Tag propagated to the underlying cloud resource (VM).

Next Steps

There is also another very good article about defining the strategy when implementing policies. As this process can lead to some errors its important not to enforce the policies directly in a production workspace before testing. Databricks has created a reference methodology that i think it makes perfect sense when implementing these policies, starting by testing the policy, then create a barebone policy to do some tagging, then deploy the real policy but without enforcing it to any user (so it can only be selected optionally) and once tested everything is working as expected then enforce it to the users.

By using this framework you can simply correct all aspects that may be wrong before creating a potential issue to the final users.

There is also an interesting section in that article which is a worry usually for all of us that work for large user base. This is the tag enforcement of resources, very important when crosscharging between teams or different parts of the organization is required. In order to be able to input the costs to the proper project, department or other body, you need to enforce tagging of your resources. This is a very important topic when working on the cloud and while its a bit difficult to implement a proper tagging policy in databricks as you have mainly to rely on freeform text or some regex expression, you can try to build an effective system to crosscharge your projects or departments in the use of your databricks workspace(s).

This set of best practices and challenges is documented here and for me its an important resource for any databricks administrator.

Have fun!

The post Databricks cluster policies at a glance. The easy way! appeared first on Albert Nogués.

Using Azure Private Endpoints with Databricks

Albert — Thu, 09 Dec 2021 19:31:32 +0000

In this article i will show how to avoing going outside to the internet when using resources inside azure, specially if they are in the same subscription and location (datacenter).

Why we may want a private endpoint? Thats a good question. For oth security and performance. Just like using TSCM Equipment for optimal safety and security. We dont want the traffic going outside to the internet to return again back to the azure datacenter if the resource we are trying to reach is already there. So with a PrivateLink the traffic will stay inside the Azure backbone network avoiding reaching the internet. More information about private endpoints here and here.

Though its possible to create private endpoints to connect to services in other subcriptions we will use the same subscription and the West Europe Region in this article. The goal is to connect to both a AzureSQL database using private connectivity and to a datalake using private connectivity as well.

Creating a Private Endpoint for AzureSQL and integrating in the databricks vnet

For this, i created a Databricks workspace and selected to use an already existing VNET, so this way I can add a new subnet for my private endpoints. One of the good things of doing this way is that NICs between subnets see each other and are reacheable (unless we block it with a network security group) but by default traffic is open within the VNET. So I can create a Private endpoint in a specific subnet of the same VNET that hosts the databricks subnets.

Bear in mind that it’s not possible to add a private endpoint to a subnet managed by databricks. So the two subnets we created, when we deployed the databricks workspace (Bot public and private) should not be modified. We will create a new one as shown in the screen below:

Our Databricks VNET. Among the two subnets created when the databricks workspace is created i added a new one to host our Private Endpoints

Once defined properly the VNET, we are going to create the private endpoint to reach our AzureSQL through it.

First we need to go to the Azure Portal, find our AzureSQL Server, and click on the left menu called Private Endpoint Connections and click on the plus sing on top to create a new one. We just need to select the subscription, the resource group the nameof the private endpoint and the region. We can fill it as shown in the following picture:

Private Endpoint Creation. Step 1

The second step requires a bit more of information, here we will define which resource we try to target with our private endpoint. As expected we need to find our AzureSQL Server here. We fill the combo boxes as usual

Private Endpoint Creation. Step 2

The third screen is the most important one. We need to select the VNET and Subnets that will host our private endpoint. In this case we want to use databricks so we need to use the VNET we created for databrickks, and then the subnet we created specifically to host the private endpoints.

Another important step here is to integrate it with the DNS. If we dont integrate it, when we use the AzureSQL hostname provided by azure we will still access through the public endpoint. By integrating it in the DNS. the dns queries over the public endpoint in this private zone, will resolve to the private IP of the NIC of the Private Endpoint

If we chose no for the DNS integration then we will have to add static entries in the /etc/hosts or somewhat, or use the private IP instead of the hostname when connecting to the AzureSQL server. To simplify we choose to integrate it.

Private Endpoint Creation. Step 3.

Once created we should see the private endpoint available. If you look at the right its implemented though a NIC (Network Interface card), and by clicking on it, we can find it and see the ip address assigned:

Our newly created Private Endpoint for Azure SQL

Finding the PrivateIP Address of the NIC that implements the private endpoint

Test the AzureSQL DB Endpoint from Databricks

Now we have it ready. We can still see from outside that VNET, that our old server still resolves to a public ip, as it was the case before even inside databricks. We can ping it for testing purposes:

C:\Users\Albert>ping azure-sql-server-albert.database.windows.net

Haciendo ping a cr4.westeurope1-a.control.database.windows.net [104.40.168.105] con 32 bytes de datos:

As you can see we have a public ip, but lets try to ping it inside the cluster:

Private endpoint with the DNS integration working fine. Our dns record for the AzureSQL Db does not resolve to a public ip anymore but to the private IP of the PrivateEndpoint

So it’s working. Its using the private ip instead of the public one. Our last step is to see if we can fetch the data from the database:

Accessing AzureSQL Database though a private endpoint from databricks

Creating an AzureDataLake PrivateEndpoint and saving our data to the DataLake through it.

We are not done yet! We can still complicate matters and create a private endpoint as well to save data to our datalake.

I’ve created an ADLS Gen 2 storage account, and going back to databricks I see by default it’s using public access:

Datalake public access

But we can implement a Private Endpoint as well, and route all the traffic through the azure datacenter itself. Lets see how to do it. For achieving this, we go to our ADS Gen2 storage account, and on the left we click again in Networking, and the second tab is called Private Endpoint connections. We click the plus button to create a new one, and basically we follow the same steps as before with a subtle difference

Creation of a private endpoint for an ADLS Gen2 storage account.

The difference with a storage account is that we need to chose which api we want to create the private endpoint for. We can use the blob, the table, the queue, the file share and the dfs (DataLake) endpoint (And also the static website!).

We will use the dfs endpoint, and again we will place it in the Private Endpoint subnet of our Databricks vnet. Something like this:

Creating a Private Endpoint for our DataLake an dplacing it in the appropiate subnet

After a few minutes our private endpoint will be ready to be used. We can go again to see the NIC and check the private ip or go directly to databricks and ping the storage account url to see if now it’s resolving to our private endpoint:

As we can see now databricks resolves our storage account through the private endpoint

Test the ADLS Gen2 SA endpoint from Databricks

If we have the IAM credential Passthrough enabled in our cluster and we have permisison to write to the datalake, now we should be able to write there without going through the internet:

Writing to a DataLake through the Private Endpoint we just created

So this is the end of the tutorial. We created two private endpoints, one for AzureSQL Database and Another for our DataLake and used bot them from Databricks. We also confirmed we are effectively using them by pinging the hostnames of both resources and seeing a change from the public ip to the private one.

Happy Data pojects!

The post Using Azure Private Endpoints with Databricks appeared first on Albert Nogués.

Databricks and Spark Crash Course. Delta and More!

Albert — Thu, 25 Mar 2021 17:58:00 +0000

I’ve been working on a Databricks and Delta tutorial for all of you. I published it as notebook and you can grab it here.

We will load some sample data from the NYC taxi dataset available in databricks, load them and store them as table. We will use then python to do some manipulation (Extract month and year from the trip time), which will create two new additional columns to our dataframe and will check how the file is saved in the hive warehouse. We will observe we have some junk data as it created folders for months and years (partitioning), that we are not supposed to have, so we will use filter to apply some filter in python way and in sql way to filter these bad records

Then, we will load another month of data as a temporary view and will compare this in contrast with a delta table where we can run updates and all sort of DML.

As a last step, we will load some master data and will perform a join. For more on Delta Lake you can follow this tutorial –> https://delta.io/tutorials/delta-lake-workshop-primer/

Enjoy coding!

The post Databricks and Spark Crash Course. Delta and More! appeared first on Albert Nogués.