<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Python Archives - Albert Nogués</title>
	<atom:link href="https://www.albertnogues.com/category/python/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.albertnogues.com/category/python/</link>
	<description>Data and Cloud Freelancer</description>
	<lastBuildDate>Fri, 31 May 2024 11:17:08 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	

<image>
	<url>https://www.albertnogues.com/wp-content/uploads/2020/12/cropped-cropped-AlbertLogo2-32x32.png</url>
	<title>Python Archives - Albert Nogués</title>
	<link>https://www.albertnogues.com/category/python/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Data Quality Checks with Soda-Core in Databricks</title>
		<link>https://www.albertnogues.com/data-quality-checks-with-soda-core-in-databricks/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=data-quality-checks-with-soda-core-in-databricks</link>
		
		<dc:creator><![CDATA[Albert]]></dc:creator>
		<pubDate>Fri, 31 May 2024 11:17:06 +0000</pubDate>
				<category><![CDATA[Azure]]></category>
		<category><![CDATA[Databricks]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[data quality]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[soda]]></category>
		<category><![CDATA[spark]]></category>
		<category><![CDATA[sql]]></category>
		<guid isPermaLink="false">https://www.albertnogues.com/?p=3439</guid>

					<description><![CDATA[<p>It&#8217;s easy to do data quality checks when working with spark with the soda-core library. The library has support for spark dataframes. I&#8217;ve tested it within a databricks environment and it worked quite easily for me. For the examples of this article i am loading the customers table from the tpch delta tables in the &#8230; </p>
<p>The post <a href="https://www.albertnogues.com/data-quality-checks-with-soda-core-in-databricks/">Data Quality Checks with Soda-Core in Databricks</a> appeared first on <a href="https://www.albertnogues.com">Albert Nogués</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>It&#8217;s easy to do data quality checks when working with spark with the soda-core library. The library has support for spark dataframes. I&#8217;ve tested it within a databricks environment and it worked quite easily for me.</p>



<p>For the examples of this article i am loading the customers table from the tpch delta tables in the databricks-datasets folder.</p>



<p>First of all we need to install the library either scoped to our Databricks notebook or on our cluster. In my case i will install it notebook scoped:</p>



<pre class="wp-block-code"><code>%pip install soda-core-spark-df</code></pre>



<p>Then we create a dataframe from the tpch customers table:</p>



<pre class="wp-block-code"><code>#We create a table and read it into a dataframe
customer_df = spark.read.table("delta.`/databricks-datasets/tpch/delta-001/customer/`")</code></pre>



<p>We create a temporary view for our dataframe so soda can query the data and run the checks:</p>



<pre class="wp-block-code"><code>#We create a TempView
customer_df.createOrReplaceTempView("customer")</code></pre>



<p>And here it comes the whole soda core. We will define the checks using yaml syntax:</p>



<pre class="wp-block-code"><code>from soda.scan import Scan
scan = Scan()
scan.set_scan_definition_name("Databricks Test Notebook")
scan.set_data_source_name("customer")
scan.add_spark_session(spark, data_source_name="customer")
#YAML Format
checks = '''
checks for customer:
  - row_count > 0
  - invalid_percent(c_phone) = 0:
      valid regex: ^&#91;0-9]{2}&#91;-]&#91;0-9]{3}&#91;-]&#91;0-9]{3}&#91;-]&#91;0-9]{4}$
  - duplicate_count(c_phone) = 0:
      name: No duplicate phone numbers
  - invalid_count(c_mktsegment) = 0:
      invalid values: &#91;HOUSEHOLD]
      name: HOUSEHOLD is not allowed as a Market Segment
'''
# you can use add_sodacl_yaml_file(s). Useful if the tests are in a github repo or FS
scan.add_sodacl_yaml_str(checks)
scan.execute()
print(scan.get_logs_text())</code></pre>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img fetchpriority="high" decoding="async" width="570" height="322" src="https://www.albertnogues.com/wp-content/uploads/2024/05/Output1.png" alt="" class="wp-image-3440" srcset="https://www.albertnogues.com/wp-content/uploads/2024/05/Output1.png 570w, https://www.albertnogues.com/wp-content/uploads/2024/05/Output1-300x169.png 300w, https://www.albertnogues.com/wp-content/uploads/2024/05/Output1-106x60.png 106w" sizes="(max-width: 570px) 100vw, 570px" /></figure>
</div>


<p>More info: <a href="https://docs.soda.io/soda/quick-start-databricks.html">Add Soda to a Databricks notebook | Soda Documentation</a></p>



<p>List of validations: <a href="https://docs.soda.io/soda-cl/validity-metrics.html">Validity metrics | Soda Documentation</a> and <a href="https://docs.soda.io/soda-cl/metrics-and-checks.html">SodaCL metrics and checks | Soda Documentation</a></p>



<p>We can somewhat enhance it and generate a Spark Dataframe all out of the list of our warnings or error validation checks:</p>



<pre class="wp-block-code"><code>from datetime import datetime
schema_checks = 'datasource STRING, table STRING, rule_name STRING, rule STRING, column STRING, check_status STRING, number_of_errors_in_sample INT, check_time TIMESTAMP'
list_of_checks = &#91;]
for c in scan.get_scan_results()&#91;'checks']:
    list_of_checks = list_of_checks + &#91;&#91;scan.get_scan_results()&#91;'defaultDataSource'], c&#91;'table'], c&#91;'name'], c&#91;'definition'], c&#91;'column'], c&#91;'outcome'], 0 if 'pass'in c&#91;'outcome'] else int(c&#91;'diagnostics']&#91;'blocks']&#91;0]&#91;'totalFailingRows']), datetime.strptime(scan.get_scan_results()&#91;'dataTimestamp'], '%Y-%m-%dT%H:%M:%S%z')]]
list_of_checks_df = spark.createDataFrame(list_of_checks,schema_checks)
display(list_of_checks_df)</code></pre>



<figure class="wp-block-image size-large is-resized"><img decoding="async" width="1024" height="403" src="https://www.albertnogues.com/wp-content/uploads/2024/05/DataFrameOutput-1024x403.png" alt="" class="wp-image-3441" style="width:840px;height:auto" srcset="https://www.albertnogues.com/wp-content/uploads/2024/05/DataFrameOutput-1024x403.png 1024w, https://www.albertnogues.com/wp-content/uploads/2024/05/DataFrameOutput-300x118.png 300w, https://www.albertnogues.com/wp-content/uploads/2024/05/DataFrameOutput-768x302.png 768w, https://www.albertnogues.com/wp-content/uploads/2024/05/DataFrameOutput-152x60.png 152w, https://www.albertnogues.com/wp-content/uploads/2024/05/DataFrameOutput.png 1328w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p>In the case we have the yaml file in our github repo, we can read it and pass it. Or If we are working with Databricks repos and the file is part of out repo we can load it locally</p>



<p>Accessing a remote file and reading it with requests:</p>



<pre class="wp-block-code"><code>#Trying to use a remote yaml file to enforce rules. We can upload it to a github of our own and use it in opur notebook.
#I've created a public repo so i dont need to authenticate to github, but in a real world scenario we should use private repo + secret scopes
customer_quality_rules = 'https://raw.githubusercontent.com/anogues/soda-core-quality-rules/main/soda-core-quality-rules-customer.yaml'
import requests
scan.add_sodacl_yaml_str(requests.get(customer_quality_rules).text)</code></pre>



<p>Or we can load it locally if we are using databricks repos:</p>



<pre class="wp-block-code"><code>scan.add_sodacl_yaml_file("your_file.yaml")</code></pre>
<p>The post <a href="https://www.albertnogues.com/data-quality-checks-with-soda-core-in-databricks/">Data Quality Checks with Soda-Core in Databricks</a> appeared first on <a href="https://www.albertnogues.com">Albert Nogués</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Useful Databricks/Spark resources</title>
		<link>https://www.albertnogues.com/useful-databricks-spark-resources/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=useful-databricks-spark-resources</link>
		
		<dc:creator><![CDATA[Albert]]></dc:creator>
		<pubDate>Wed, 14 Dec 2022 12:58:28 +0000</pubDate>
				<category><![CDATA[BigData]]></category>
		<category><![CDATA[Databricks]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Spark]]></category>
		<category><![CDATA[SQL]]></category>
		<guid isPermaLink="false">https://www.albertnogues.com/?p=1694</guid>

					<description><![CDATA[<p>Memory Profiling in PySpark: https://www.databricks.com/blog/2022/11/30/memory-profiling-pyspark.html Run Databricks queries directly from VSCODE: https://ganeshchandrasekaran.com/run-your-databricks-sql-queries-from-vscode-9c70c5d4903c Spark Testing with chispa: https://github.com/alexott/spark-playground/tree/master/testing Best Practices for Cost Management on Databricks: https://www.databricks.com/blog/2022/10/18/best-practices-cost-management-databricks.html UDF Pyspark: https://docs.databricks.com/udf/python.html Pandas UDF&#8217;s: https://docs.databricks.com/udf/pandas.html Introducing Pandas UDF for PySpark: https://www.databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html</p>
<p>The post <a href="https://www.albertnogues.com/useful-databricks-spark-resources/">Useful Databricks/Spark resources</a> appeared first on <a href="https://www.albertnogues.com">Albert Nogués</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Memory Profiling in PySpark: <a href="https://www.databricks.com/blog/2022/11/30/memory-profiling-pyspark.html" target="_blank" rel="noreferrer noopener">https://www.databricks.com/blog/2022/11/30/memory-profiling-pyspark.html</a></p>



<p>Run Databricks queries directly from VSCODE: <a href="https://ganeshchandrasekaran.com/run-your-databricks-sql-queries-from-vscode-9c70c5d4903c" target="_blank" rel="noreferrer noopener">https://ganeshchandrasekaran.com/run-your-databricks-sql-queries-from-vscode-9c70c5d4903c</a></p>



<p>Spark Testing with chispa: <a href="https://github.com/alexott/spark-playground/tree/master/testing" target="_blank" rel="noreferrer noopener">https://github.com/alexott/spark-playground/tree/master/testing</a></p>



<p>Best Practices for Cost Management on Databricks: <a href="https://www.databricks.com/blog/2022/10/18/best-practices-cost-management-databricks.html" target="_blank" rel="noreferrer noopener">https://www.databricks.com/blog/2022/10/18/best-practices-cost-management-databricks.html</a></p>



<p>UDF Pyspark: <a href="https://docs.databricks.com/udf/python.html" target="_blank" rel="noreferrer noopener">https://docs.databricks.com/udf/python.html</a></p>



<p>Pandas UDF&#8217;s: <a href="https://docs.databricks.com/udf/pandas.html" target="_blank" rel="noreferrer noopener">https://docs.databricks.com/udf/pandas.html</a></p>



<p>Introducing Pandas UDF for PySpark: <a href="https://www.databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html" target="_blank" rel="noreferrer noopener">https://www.databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html</a></p>
<p>The post <a href="https://www.albertnogues.com/useful-databricks-spark-resources/">Useful Databricks/Spark resources</a> appeared first on <a href="https://www.albertnogues.com">Albert Nogués</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Databricks connectivity to Azure SQL / SQL Server</title>
		<link>https://www.albertnogues.com/databricks-connectivity-to-azure-sql-sql-server/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=databricks-connectivity-to-azure-sql-sql-server</link>
		
		<dc:creator><![CDATA[Albert]]></dc:creator>
		<pubDate>Thu, 09 Dec 2021 10:45:34 +0000</pubDate>
				<category><![CDATA[Azure]]></category>
		<category><![CDATA[Databricks]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Spark]]></category>
		<category><![CDATA[SQL]]></category>
		<guid isPermaLink="false">https://www.albertnogues.com/?p=1224</guid>

					<description><![CDATA[<p>Most of the developments I see inside databricks rely on fetching or writing data to some sort of Database. Usually the preferred method for this is though the use of jdbc driver, as most databases offer some sort of jdbc driver. In some cases, though, its also possible to use some spark optimized driver. This &#8230; </p>
<p>The post <a href="https://www.albertnogues.com/databricks-connectivity-to-azure-sql-sql-server/">Databricks connectivity to Azure SQL / SQL Server</a> appeared first on <a href="https://www.albertnogues.com">Albert Nogués</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Most of the developments I see inside databricks rely on fetching or writing data to some sort of Database.</p>



<p>Usually the preferred method for this is though the use of jdbc driver, as most databases offer some sort of jdbc driver.</p>



<p>In some cases, though, its also possible to use some spark optimized driver. This is the case in Azure SQL / SQL Server. We have still the option to use the standard jdbc driver (what most people do because it&#8217;s standard to all databases) but we can improve the performance by using a specific spark driver. Till some time ago it was only supported with the Scala API but now it&#8217;s possible to be used in Python and R as well, so there is no reason not to give it a try.</p>



<p>In this article we will see the two options to make this connectivity. For the test purposes we will connect to an Azure SQL in the same region (West Europe).</p>



<h2 class="wp-block-heading">Connecting to AzureSQL through  jdbc driver. </h2>



<p>In this case the jdbc driver is already shipped in the databricks cluster, we do not need to install anything. We just can connect directly. Lets see how (We have a scala example <a href="https://docs.microsoft.com/es-es/azure/databricks/data/data-sources/sql-databases" target="_blank" rel="noreferrer noopener">here</a> but i will use python for this example)</p>



<pre class="wp-block-code"><code>#In a real development this should be fetched from a keyvault using a secret scope with: dbutils.secrets.get(scope = "sql_db", key = "username") and  dbutils.secrets.get(scope = "sql_db", key = "password")

jdbcDF = spark.read.format("jdbc") \
    .option("url", f"jdbc:sqlserver://azure-sql-server-albert.database.windows.net:1433;databaseName=databricksdata") \
    .option("dbtable", "SalesLT.Product") \
    .option("user", "anogues") \
    .option("password", "XXXXXX") \
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
    .load()

jdbcDF.show()</code></pre>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="362" src="https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR1-1024x362.png" alt="" class="wp-image-1226" srcset="https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR1-1024x362.png 1024w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR1-300x106.png 300w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR1-768x272.png 768w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR1-1536x543.png 1536w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR1-2048x724.png 2048w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR1-170x60.png 170w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption>Spark Dataframe from a JDBC Azure SQL DB Source</figcaption></figure>



<p>So as we saw we have been able to connect successfully to our Azure SQL DB using the jdbc driver shipped with databricks. Lets now try to change to the spark optimized driver</p>



<h2 class="wp-block-heading">Connecting to AzureSQL through the spark optimized driver</h2>



<p>To connect using the spark optimized driver, first we need to install the driver in the cluster, as it&#8217;s not available by default.</p>



<p>The driver is available in Maven for both spark 2.X and 3.X. In the microsoft <a href="https://docs.microsoft.com/en-us/sql/connect/spark/connector?view=sql-server-ver15">website</a> we can find more information on where to get them and how to use them. For this exercise purposes we will inbstall it through databricks libraries, using maven. Just add in the coordinates box the following: com.microsoft.azure:spark-mssql-connector_2.12:1.2.0 as can be seen in the image below</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="323" src="https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR2-1024x323.png" alt="" class="wp-image-1227" srcset="https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR2-1024x323.png 1024w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR2-300x95.png 300w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR2-768x242.png 768w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR2-1536x485.png 1536w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR2-190x60.png 190w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR2.png 2006w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption>Installing the spark AzureSQL Driver from Maven</figcaption></figure>



<p>Once installed we should see a green dot next to the driver, and this will mean the driver is ready to be used. We go back to our notebook and try</p>



<pre class="wp-block-code"><code>#In a real development this should be fetched from a keyvault using a secret scope with: dbutils.secrets.get(scope = "sql_db", key = "username") and  dbutils.secrets.get(scope = "sql_db", key = "password")
jdbcDF = spark.read.format("com.microsoft.sqlserver.jdbc.spark") \
    .option("url", f"jdbc:sqlserver://azure-sql-server-albert.database.windows.net:1433;databaseName=databricksdata") \
    .option("dbtable", "SalesLT.Product") \
    .option("user", "anogues") \
    .option("password", "XXXXXX") \
    .load()

jdbcDF.show()</code></pre>



<p>If we see an error like java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.spark this means that the driver can&#8217;t be found, so probably it&#8217;s not properly installed. Check back the libraries in the cluster and make sure the status is installed. If all goes well we should see again our dataframe:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="351" src="https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR3-1-1024x351.png" alt="" class="wp-image-1229" srcset="https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR3-1-1024x351.png 1024w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR3-1-300x103.png 300w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR3-1-768x263.png 768w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR3-1-1536x526.png 1536w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR3-1-2048x701.png 2048w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR3-1-175x60.png 175w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption> Spark Dataframe from a Spark Azure SQL DB Source </figcaption></figure>



<p>The reason why we should use the optimized spark driver is usually because of performance reasons. Microsoft claims its about 15x faster than the jdbc one. But there is more. The spark driver also allows AAD authentication either by using a service principal or an AAD account, apart of course from the native sql server authentication. Lets try if it works with an AAD account:</p>



<pre class="wp-block-code"><code>jdbcDF = spark.read \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .option("url", f"jdbc:sqlserver://azure-sql-server-albert.database.windows.net:1433;databaseName=databricksdata") \
    .option("dbtable", "SalesLT.Product") \
    .option("authentication", "ActiveDirectoryPassword") \
    .option("user", "sqluser@anogues4hotmail.onmicrosoft.com") \
    .option("password", "XXXXXX") \
    .option("encrypt", "true") \
    .option("hostNameInCertificate", "*.database.windows.net") \
    .load()
jdbcDF.show()</code></pre>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="366" src="https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR4-1024x366.png" alt="" class="wp-image-1230" srcset="https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR4-1024x366.png 1024w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR4-300x107.png 300w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR4-768x274.png 768w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR4-1536x549.png 1536w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR4-2048x732.png 2048w, https://www.albertnogues.com/wp-content/uploads/2021/12/SQLDBR4-168x60.png 168w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>For using a service principal you need to generate a token. In python this can be accomplished with the <a href="https://pypi.org/project/adal/" target="_blank" rel="noreferrer noopener">adal</a> library (That needs to be installed in the cluster as well from pypi). You have a sample notebook in microsoft spark driver github account <a href="https://github.com/microsoft/sql-spark-connector/tree/master/samples/Databricks-AzureSQL/DatabricksNotebooks">here</a>.</p>



<p>More information about the driver can be found on the microsoft github repository <a href="https://github.com/microsoft/sql-spark-connector" target="_blank" rel="noreferrer noopener">here</a>.</p>
<p>The post <a href="https://www.albertnogues.com/databricks-connectivity-to-azure-sql-sql-server/">Databricks connectivity to Azure SQL / SQL Server</a> appeared first on <a href="https://www.albertnogues.com">Albert Nogués</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Introduction to the maths of bookmaking (with python code)</title>
		<link>https://www.albertnogues.com/introduction-to-the-maths-of-bookmaking-with-python-code/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=introduction-to-the-maths-of-bookmaking-with-python-code</link>
		
		<dc:creator><![CDATA[Albert]]></dc:creator>
		<pubDate>Sat, 27 Jun 2015 20:12:23 +0000</pubDate>
				<category><![CDATA[Betting]]></category>
		<category><![CDATA[Blog]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[betting]]></category>
		<category><![CDATA[prices]]></category>
		<category><![CDATA[probability]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">http://192.168.1.40/?p=180</guid>

					<description><![CDATA[<p>Introduction In this article I will show you how to calculate simple things about the odds the bookmakers offer and how to play with them with the intention of using the real chance of each outcome to model a group of prices. Basically what we will do is the following: retrieve the odds of a &#8230; </p>
<p>The post <a href="https://www.albertnogues.com/introduction-to-the-maths-of-bookmaking-with-python-code/">Introduction to the maths of bookmaking (with python code)</a> appeared first on <a href="https://www.albertnogues.com">Albert Nogués</a>.</p>
]]></description>
										<content:encoded><![CDATA[<h1>Introduction</h1>
<p>In this article I will show you how to calculate simple things about the odds the bookmakers offer and how to play with them with the intention of using the real chance of each outcome to model a group of prices. Basically what we will do is the following:</p>
<ul>
<li>retrieve the odds of a horse race</li>
<li>calculate the overround applied</li>
<li>determine the true odds</li>
<li>generate a new set of odds with the desired overround. We will see several techniques, these are:
<ul>
<li>First approach for pricing: Apply the overround linearly</li>
<li>A better approach: Apply the overround based on the chance of winning</li>
<li>The real deal: Apply the overround based on a model</li>
</ul>
</li>
</ul>
<h1>Retrieve the Odds</h1>
<p>For the sample of this article I will be using odds of a horse race held at Doncaster, the 27th of June 2015. This was the last race of the card, a class 4 handicap of 7 runners, but any race or sport should suit.</p>
<p>The odds on offer at the time of writting were the following (got from oddschecker):</p>
<p>Rio Ronaldo <strong>3.25</strong> 3.0 3.0 3.25 3.0 3.25 3.25 3.0 3.0 2.75 3.25 3.25<br />Beau Eile <strong>3.5</strong> 3.25 3.25 3.25 3.5 3.25 3.25 3.25 3.25 3.25 3.5 3.5<br />Bahamian Sunrise <strong>4.0</strong> 4.0 3.75 3.75 4.0 3.75 3.75 3.75 4.0 4.0 3.5 3.75<br />Silver Rainbow <strong>13.0</strong> 13.0 9.0 13.0 12.0 11.0 11.0 13.0 11.0 11.0 10.0 9.0<br />Snow Cloud <strong>15.0</strong> 15.0 12.0 13.0 10.0 12.0 13.0 12.0 12.0 15.0 9.0 12.0<br />Equally Fast <strong>17.0</strong> 17.0 17.0 15.0 17.0 17.0 17.0 17.0 15.0 17.0 13.0 17.0<br />Mc Diamond <strong>67.0</strong> 67.0 41.0 34.0 67.0 51.0 41.0 34.0 51.0 67.0 41.0 41.0</p>
<p>In this article we will choose the best price or joint best price available but any set of ods can be choose.</p>
<p>So we construct our list of best prices with the folowing values: [3.25, 3.5, 4.0, 13.0, 15.0, 17.0, 67.0]</p>
<pre id="block-15d1b0d9-ff9f-4489-a5ee-2b3ec84d1fa8" class="wp-block-code block-editor-block-list__block wp-block is-selected" tabindex="0" role="group" aria-label="Block: Code" data-block="15d1b0d9-ff9f-4489-a5ee-2b3ec84d1fa8" data-type="core/code" data-title="Code"><code class="block-editor-rich-text__editable rich-text" role="textbox" contenteditable="true" aria-multiline="true" aria-label="Code">maxPrices = [3.25, 3.5, 4.0, 13.0, 15.0, 17.0, 67.0]</code></pre>
<h1>Calculate the Overround of a set of outcomes</h1>
<p>To calculate the overround of a set of prices is easy. Basically what needs to be done is to iterate through the list of prices, and calculate the chances of winning each one, accumulate them and see how this number exceeds of 1 (or 100% if we are counting percentages).</p>
<p>To work out the probability of each outcome to win we need to do the following division:</p>
<p>1 / odds</p>
<p>Then we will sum up all these probabilities and will get the overround of the race</p>
<p>overround = 0<br />for price in maxPrices:<br />overround = overround + 1/price<br />print(&#8220;Total overround is&#8221;,overround) </p>
<p>which gives us the following output: Total overround is 1.0607452395424302</p>
<h1>Determine the true odds</h1>
<p>For calculating the fair price we will multiply the current price by the overround we calculated in the previous step. In case we were working with probabilities, the process would be the same.</p>
<p><br />fairPrice = []<br />for price in maxPrices:<br />fairPrice = fairPrice + [price * overround]<br />print(&#8220;fairPrice&#8221;,fairPrice)</p>
<p>The new fair price list without the overround is the following:</p>
<p>fairPrice [3.4474220285128983, 3.7126083383985056, 4.242980958169721, 13.789688114051593, 15.911178593136452, 18.032669072221314, 71.06993104934283]</p>
<h1>Generate a new set of odds</h1>
<p>The following step is generating a new set of odds. These can be generated with different techniques. We cover the following in this article:</p>
<h2>First approach for pricing: Apply the overround linearly</h2>
<p>This solution is not the most usefull one but in some cases it may work. Basically it consists in dividing the total percent of overround equaly amongst all the outcomes. This is usually not a good idea as we can get inflated prices for the favourites against the outsiders. And as we know, money is likely to go for these heading the market. So from a bookmaking point of view, it does not make too much sense.</p>
<p>We have not considered this solution as interesting for the article, so we are not covering it.</p>
<h2>A better approach: Apply the overround based on the chance of winning</h2>
<p>In this paragraph we are presenting a better approach. Instead of dividing the overround in equally parts, we will divide the overround depending on the chance of winning. So, based on the calculated odds, we will apply one part of the overround on the other. This partially compensates the problem with the previous method, and will usually be more than enough, though sometimes it is not yet the perfect solution.</p>
<p>In our sample, we will be applying a 5% of overround on the fair price calculated in the previous step.</p>
<p>appliedOverround5pct = []<br />for price in fairPrice:<br />appliedOverround5pct = appliedOverround5pct + [price/1.05]<br />print(&#8220;appliedOverround 5%&#8221;,appliedOverround5pct)<br /><br /></p>
<p>The new list with a 5% of overround is the folowing. As you can see, prices are slightly higher that they were origially as the overround is 1% less:<br />[3.2832590747741888, 3.5358174651414336, 4.040934245875924, 13.133036299096755, 15.153503422034715, 17.17397054497268, 67.68564861842174]</p>
<h2>The real deal: Apply the overround based on a model</h2>
<p>This solution will entitle in building a model of prices withou overround and winning results. Based on a big number of outcomes we would able to model and predict the overround to apply based on this historical data.</p>
<p>Since this would involve a model creation and usually this means some complexity, and the samples presented are good enough to go, we left this out of the scope of this article.</p>
<p>The full code is here:</p>


<pre class="wp-block-code"><code>def pythonPriceOverround():
maxPrices = &#91;3.25, 3.5, 4.0, 13.0, 15.0, 17.0, 67.0]

print(maxPrices)
overround = 0
for price in maxPrices:
overround = overround + 1/price

print("Total overround is",overround)

fairPrice = &#91;]
for price in maxPrices:
fairPrice = fairPrice + &#91;price * overround]
print("fairPrice",fairPrice)

appliedOverround5pct = &#91;]

for price in fairPrice:
appliedOverround5pct = appliedOverround5pct + &#91;price/1.05]
print("appliedOverround 5%",appliedOverround5pct)

#Check than now the overround is indeed a 5%
overround = 0
for price in appliedOverround5pct:
overround = overround + 1/price

print("Total overround is",overround)</code></pre>



<p></p>



<p></p>
<p>The post <a href="https://www.albertnogues.com/introduction-to-the-maths-of-bookmaking-with-python-code/">Introduction to the maths of bookmaking (with python code)</a> appeared first on <a href="https://www.albertnogues.com">Albert Nogués</a>.</p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
