Python Archives - Albert Nogués

Data Quality Checks with Soda-Core in Databricks

Albert — Fri, 31 May 2024 11:17:06 +0000

It’s easy to do data quality checks when working with spark with the soda-core library. The library has support for spark dataframes. I’ve tested it within a databricks environment and it worked quite easily for me.

For the examples of this article i am loading the customers table from the tpch delta tables in the databricks-datasets folder.

First of all we need to install the library either scoped to our Databricks notebook or on our cluster. In my case i will install it notebook scoped:

%pip install soda-core-spark-df

Then we create a dataframe from the tpch customers table:

#We create a table and read it into a dataframe
customer_df = spark.read.table("delta.`/databricks-datasets/tpch/delta-001/customer/`")

We create a temporary view for our dataframe so soda can query the data and run the checks:

#We create a TempView
customer_df.createOrReplaceTempView("customer")

And here it comes the whole soda core. We will define the checks using yaml syntax:

from soda.scan import Scan
scan = Scan()
scan.set_scan_definition_name("Databricks Test Notebook")
scan.set_data_source_name("customer")
scan.add_spark_session(spark, data_source_name="customer")
#YAML Format
checks = '''
checks for customer:
  - row_count > 0
  - invalid_percent(c_phone) = 0:
      valid regex: ^[0-9]{2}[-][0-9]{3}[-][0-9]{3}[-][0-9]{4}$
  - duplicate_count(c_phone) = 0:
      name: No duplicate phone numbers
  - invalid_count(c_mktsegment) = 0:
      invalid values: [HOUSEHOLD]
      name: HOUSEHOLD is not allowed as a Market Segment
'''
# you can use add_sodacl_yaml_file(s). Useful if the tests are in a github repo or FS
scan.add_sodacl_yaml_str(checks)
scan.execute()
print(scan.get_logs_text())

More info: Add Soda to a Databricks notebook | Soda Documentation

List of validations: Validity metrics | Soda Documentation and SodaCL metrics and checks | Soda Documentation

We can somewhat enhance it and generate a Spark Dataframe all out of the list of our warnings or error validation checks:

from datetime import datetime
schema_checks = 'datasource STRING, table STRING, rule_name STRING, rule STRING, column STRING, check_status STRING, number_of_errors_in_sample INT, check_time TIMESTAMP'
list_of_checks = []
for c in scan.get_scan_results()['checks']:
    list_of_checks = list_of_checks + [[scan.get_scan_results()['defaultDataSource'], c['table'], c['name'], c['definition'], c['column'], c['outcome'], 0 if 'pass'in c['outcome'] else int(c['diagnostics']['blocks'][0]['totalFailingRows']), datetime.strptime(scan.get_scan_results()['dataTimestamp'], '%Y-%m-%dT%H:%M:%S%z')]]
list_of_checks_df = spark.createDataFrame(list_of_checks,schema_checks)
display(list_of_checks_df)

In the case we have the yaml file in our github repo, we can read it and pass it. Or If we are working with Databricks repos and the file is part of out repo we can load it locally

Accessing a remote file and reading it with requests:

#Trying to use a remote yaml file to enforce rules. We can upload it to a github of our own and use it in opur notebook.
#I've created a public repo so i dont need to authenticate to github, but in a real world scenario we should use private repo + secret scopes
customer_quality_rules = 'https://raw.githubusercontent.com/anogues/soda-core-quality-rules/main/soda-core-quality-rules-customer.yaml'
import requests
scan.add_sodacl_yaml_str(requests.get(customer_quality_rules).text)

Or we can load it locally if we are using databricks repos:

scan.add_sodacl_yaml_file("your_file.yaml")

The post Data Quality Checks with Soda-Core in Databricks appeared first on Albert Nogués.

Useful Databricks/Spark resources

Albert — Wed, 14 Dec 2022 12:58:28 +0000

Memory Profiling in PySpark: https://www.databricks.com/blog/2022/11/30/memory-profiling-pyspark.html

Run Databricks queries directly from VSCODE: https://ganeshchandrasekaran.com/run-your-databricks-sql-queries-from-vscode-9c70c5d4903c

Spark Testing with chispa: https://github.com/alexott/spark-playground/tree/master/testing

Best Practices for Cost Management on Databricks: https://www.databricks.com/blog/2022/10/18/best-practices-cost-management-databricks.html

UDF Pyspark: https://docs.databricks.com/udf/python.html

Pandas UDF’s: https://docs.databricks.com/udf/pandas.html

Introducing Pandas UDF for PySpark: https://www.databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html

The post Useful Databricks/Spark resources appeared first on Albert Nogués.

Databricks connectivity to Azure SQL / SQL Server

Albert — Thu, 09 Dec 2021 10:45:34 +0000

Most of the developments I see inside databricks rely on fetching or writing data to some sort of Database.

Usually the preferred method for this is though the use of jdbc driver, as most databases offer some sort of jdbc driver.

In some cases, though, its also possible to use some spark optimized driver. This is the case in Azure SQL / SQL Server. We have still the option to use the standard jdbc driver (what most people do because it’s standard to all databases) but we can improve the performance by using a specific spark driver. Till some time ago it was only supported with the Scala API but now it’s possible to be used in Python and R as well, so there is no reason not to give it a try.

In this article we will see the two options to make this connectivity. For the test purposes we will connect to an Azure SQL in the same region (West Europe).

Connecting to AzureSQL through jdbc driver.

In this case the jdbc driver is already shipped in the databricks cluster, we do not need to install anything. We just can connect directly. Lets see how (We have a scala example here but i will use python for this example)

#In a real development this should be fetched from a keyvault using a secret scope with: dbutils.secrets.get(scope = "sql_db", key = "username") and  dbutils.secrets.get(scope = "sql_db", key = "password")

jdbcDF = spark.read.format("jdbc") \
    .option("url", f"jdbc:sqlserver://azure-sql-server-albert.database.windows.net:1433;databaseName=databricksdata") \
    .option("dbtable", "SalesLT.Product") \
    .option("user", "anogues") \
    .option("password", "XXXXXX") \
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
    .load()

jdbcDF.show()

Spark Dataframe from a JDBC Azure SQL DB Source

So as we saw we have been able to connect successfully to our Azure SQL DB using the jdbc driver shipped with databricks. Lets now try to change to the spark optimized driver

Connecting to AzureSQL through the spark optimized driver

To connect using the spark optimized driver, first we need to install the driver in the cluster, as it’s not available by default.

The driver is available in Maven for both spark 2.X and 3.X. In the microsoft website we can find more information on where to get them and how to use them. For this exercise purposes we will inbstall it through databricks libraries, using maven. Just add in the coordinates box the following: com.microsoft.azure:spark-mssql-connector_2.12:1.2.0 as can be seen in the image below

Installing the spark AzureSQL Driver from Maven

Once installed we should see a green dot next to the driver, and this will mean the driver is ready to be used. We go back to our notebook and try

#In a real development this should be fetched from a keyvault using a secret scope with: dbutils.secrets.get(scope = "sql_db", key = "username") and  dbutils.secrets.get(scope = "sql_db", key = "password")
jdbcDF = spark.read.format("com.microsoft.sqlserver.jdbc.spark") \
    .option("url", f"jdbc:sqlserver://azure-sql-server-albert.database.windows.net:1433;databaseName=databricksdata") \
    .option("dbtable", "SalesLT.Product") \
    .option("user", "anogues") \
    .option("password", "XXXXXX") \
    .load()

jdbcDF.show()

If we see an error like java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.spark this means that the driver can’t be found, so probably it’s not properly installed. Check back the libraries in the cluster and make sure the status is installed. If all goes well we should see again our dataframe:

Spark Dataframe from a Spark Azure SQL DB Source

The reason why we should use the optimized spark driver is usually because of performance reasons. Microsoft claims its about 15x faster than the jdbc one. But there is more. The spark driver also allows AAD authentication either by using a service principal or an AAD account, apart of course from the native sql server authentication. Lets try if it works with an AAD account:

jdbcDF = spark.read \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .option("url", f"jdbc:sqlserver://azure-sql-server-albert.database.windows.net:1433;databaseName=databricksdata") \
    .option("dbtable", "SalesLT.Product") \
    .option("authentication", "ActiveDirectoryPassword") \
    .option("user", "sqluser@anogues4hotmail.onmicrosoft.com") \
    .option("password", "XXXXXX") \
    .option("encrypt", "true") \
    .option("hostNameInCertificate", "*.database.windows.net") \
    .load()
jdbcDF.show()

For using a service principal you need to generate a token. In python this can be accomplished with the adal library (That needs to be installed in the cluster as well from pypi). You have a sample notebook in microsoft spark driver github account here.

More information about the driver can be found on the microsoft github repository here.

The post Databricks connectivity to Azure SQL / SQL Server appeared first on Albert Nogués.

Introduction to the maths of bookmaking (with python code)

Albert — Sat, 27 Jun 2015 20:12:23 +0000

Introduction

In this article I will show you how to calculate simple things about the odds the bookmakers offer and how to play with them with the intention of using the real chance of each outcome to model a group of prices. Basically what we will do is the following:

retrieve the odds of a horse race
calculate the overround applied
determine the true odds
generate a new set of odds with the desired overround. We will see several techniques, these are:
- First approach for pricing: Apply the overround linearly
- A better approach: Apply the overround based on the chance of winning
- The real deal: Apply the overround based on a model

Retrieve the Odds

For the sample of this article I will be using odds of a horse race held at Doncaster, the 27th of June 2015. This was the last race of the card, a class 4 handicap of 7 runners, but any race or sport should suit.

The odds on offer at the time of writting were the following (got from oddschecker):

Rio Ronaldo 3.25 3.0 3.0 3.25 3.0 3.25 3.25 3.0 3.0 2.75 3.25 3.25
Beau Eile 3.5 3.25 3.25 3.25 3.5 3.25 3.25 3.25 3.25 3.25 3.5 3.5
Bahamian Sunrise 4.0 4.0 3.75 3.75 4.0 3.75 3.75 3.75 4.0 4.0 3.5 3.75
Silver Rainbow 13.0 13.0 9.0 13.0 12.0 11.0 11.0 13.0 11.0 11.0 10.0 9.0
Snow Cloud 15.0 15.0 12.0 13.0 10.0 12.0 13.0 12.0 12.0 15.0 9.0 12.0
Equally Fast 17.0 17.0 17.0 15.0 17.0 17.0 17.0 17.0 15.0 17.0 13.0 17.0
Mc Diamond 67.0 67.0 41.0 34.0 67.0 51.0 41.0 34.0 51.0 67.0 41.0 41.0

In this article we will choose the best price or joint best price available but any set of ods can be choose.

So we construct our list of best prices with the folowing values: [3.25, 3.5, 4.0, 13.0, 15.0, 17.0, 67.0]

maxPrices = [3.25, 3.5, 4.0, 13.0, 15.0, 17.0, 67.0]

Calculate the Overround of a set of outcomes

To calculate the overround of a set of prices is easy. Basically what needs to be done is to iterate through the list of prices, and calculate the chances of winning each one, accumulate them and see how this number exceeds of 1 (or 100% if we are counting percentages).

To work out the probability of each outcome to win we need to do the following division:

1 / odds

Then we will sum up all these probabilities and will get the overround of the race

overround = 0
for price in maxPrices:
overround = overround + 1/price
print(“Total overround is”,overround)

which gives us the following output: Total overround is 1.0607452395424302

Determine the true odds

For calculating the fair price we will multiply the current price by the overround we calculated in the previous step. In case we were working with probabilities, the process would be the same.

fairPrice = []
for price in maxPrices:
fairPrice = fairPrice + [price * overround]
print(“fairPrice”,fairPrice)

The new fair price list without the overround is the following:

fairPrice [3.4474220285128983, 3.7126083383985056, 4.242980958169721, 13.789688114051593, 15.911178593136452, 18.032669072221314, 71.06993104934283]

Generate a new set of odds

The following step is generating a new set of odds. These can be generated with different techniques. We cover the following in this article:

First approach for pricing: Apply the overround linearly

This solution is not the most usefull one but in some cases it may work. Basically it consists in dividing the total percent of overround equaly amongst all the outcomes. This is usually not a good idea as we can get inflated prices for the favourites against the outsiders. And as we know, money is likely to go for these heading the market. So from a bookmaking point of view, it does not make too much sense.

We have not considered this solution as interesting for the article, so we are not covering it.

A better approach: Apply the overround based on the chance of winning

In this paragraph we are presenting a better approach. Instead of dividing the overround in equally parts, we will divide the overround depending on the chance of winning. So, based on the calculated odds, we will apply one part of the overround on the other. This partially compensates the problem with the previous method, and will usually be more than enough, though sometimes it is not yet the perfect solution.

In our sample, we will be applying a 5% of overround on the fair price calculated in the previous step.

appliedOverround5pct = []
for price in fairPrice:
appliedOverround5pct = appliedOverround5pct + [price/1.05]
print(“appliedOverround 5%”,appliedOverround5pct)

The new list with a 5% of overround is the folowing. As you can see, prices are slightly higher that they were origially as the overround is 1% less:
[3.2832590747741888, 3.5358174651414336, 4.040934245875924, 13.133036299096755, 15.153503422034715, 17.17397054497268, 67.68564861842174]

The real deal: Apply the overround based on a model

This solution will entitle in building a model of prices withou overround and winning results. Based on a big number of outcomes we would able to model and predict the overround to apply based on this historical data.

Since this would involve a model creation and usually this means some complexity, and the samples presented are good enough to go, we left this out of the scope of this article.

The full code is here:

def pythonPriceOverround():
maxPrices = [3.25, 3.5, 4.0, 13.0, 15.0, 17.0, 67.0]

print(maxPrices)
overround = 0
for price in maxPrices:
overround = overround + 1/price

print("Total overround is",overround)

fairPrice = []
for price in maxPrices:
fairPrice = fairPrice + [price * overround]
print("fairPrice",fairPrice)

appliedOverround5pct = []

for price in fairPrice:
appliedOverround5pct = appliedOverround5pct + [price/1.05]
print("appliedOverround 5%",appliedOverround5pct)

#Check than now the overround is indeed a 5%
overround = 0
for price in appliedOverround5pct:
overround = overround + 1/price

print("Total overround is",overround)

The post Introduction to the maths of bookmaking (with python code) appeared first on Albert Nogués.