Cloudera Manager installation in Google Cloud

Installation of Cloudera Manager and small CDH Cluster Lab in Google Cloud

As a preparation for the CCA Administration certification we need a workable cluster to do our practice tests.

I’m going to start showing you how to install a CDH cluster in the google cloud.

To start with, we need to proceed with the cloudera manager installation, which will guide us to the process of setting up our small cluster, to work as a test lab. The cluster will consist in 2 master nodes (we will enable hdfs HA and yarn HA) and 3 workers, as by default hdfs replication is 3 and we will stick to that. We will use one of the workers as a gateway too, to simplify the number of machines we use.

Apart of these 5 machines, we will have the cloudera manager in a separate machine, so in total, we will need 6 machines. The CPU and RAM required for each machine will vary too, but to make it easier, we will use only two types of instances. Let’s start with the cloudera manager first, in this post and in the following one we will set up the cluster.

Cloudera Manager Installation

Open your google cloud control panel, go to the compute engine section, VM Instances and click on create. A new tab will open with some default values that we need to edit.

I will name the instance Cloudera Manager to simplify things. Choose the appropriate zone based in your geographic location, and as a machine type we will select the n1-highmem-2 which contains 2 virtual cpus and 13 GB of ram, as using the standard n1-standard-2 instance is risky as cloudera manager may not boot because it may not have enough RAM. It requires approximately 8 GB of ram, and having less, causes unexpected errors. SO we are going to play safe.

As a boot disk, we are going to change the default debian for a centos7. Since this will only host the cloudera manager installation,, the disk will only hold that and the OS Files. Just in case we will increase the standard 10 GB of disk to 20 GB to be on the safer side. Leave all the other options by default, even that later we will need to work on the Firewall, to open the cloudera manager port. Once finished, click on CREATE.

After a couple of seconds, we will have our instance created. It will show the private and public IP. As obvious, from the outside, we will need the Public IP to connect through SSH.

Before connecting through ssh, we need to generate a key pair. Follow the google cloud guide for this purpose here.

Once generated the keys, we can proceed to connect to our machine, we open our ssh client or the one integrated in google cloud (using the web browser) and we will be in:

Before installing cloudera manager, let’s do a few configuration tasks that we may require later. These are to disable selinux and update the system.

To disable selinux, simply do a sudo vi /etc/sysconfig/selinux and set selinux=disabled and rebot the machine.


sudo yum update && sudo yum install wget

After installing wget, we can proceed to download the cloudera manager installer:


chmod u+x cloudera-manager-installer.bin

sudo ./cloudera-manager-installer.bin

And the installer will start. Then we only need to follow the wizard through the setup process

and the download of the java sdk and the cloudera manager will start. The last step, will show us the url of the cloudera manager but note that, first, the url is give using the machine name and with that url we wont be able to get to the server and second, we have to open the port in the firewall so it becomes accessible from the outside.

So, now, we wait for a couple of seconds and we go back to our google cloud dashboard, click on VPC Network > Firewall Rules, as seen on the image.

And now, click on create firewall rule, name it cm7180. In the targets select either your instance or all instances in the network (not recommended), and in source ip ranges input either your ip if you connect always from the same place or or any other subnet provided by your ISP and on the protocols and ports, select “specified protocols and ports” and write tcp:7180

Of you will still have issues connecting, you can try to disable firewalld which may be blocking the access. At this point, you can either disable it completely, or configure it to give access to the port 7180.

After that, point your browser to http://<your public ip>:7180 and you should see the cloudera manager login screen. The username and password to log in are admin/admin and these should be changed ASAP.

The screen shows it in spanish because cloudera manager inherits your browser locale configuration. When we login, we can finish the installation of cloudera manager, which basically consists in accepting the license terms, and choosing the license from: Cloudera Manager Express, Cloudera Enterpise 60 day trial or Cloudera Enterprise with license.

At this point we will use the 60 day trial, as we want to perform some tasks in the cluster that are not possible with the express version.

And here it finishes the installation of cloudera manager. In a next post we will cover the installation of CDH which is just the wizard that follows the license selection in cloudera manager.

Stay tuned for next chapter soon!

Business Intelligence Tools for Small Companies: A Guide to Free and Low-Cost Solutions

Juan Valladares and I have finished our book and it is published now. You can get it from major retailers or from the publishers website (Apress):

Also available from amazon:

The book:

  • Teaches how to implement and manage the business intelligence/data warehousing (BI/DWH) infrastructure for a small company
  • Provides practice extracting data from any enterprise resource planning (ERP) tool
  • Uses open-source extract-transform-load (ETL) tools to process and integrate BI data
  • Shows how to query, report, and analyze BI data using open-source visualization and dashboard tools
  • No previous knowledge is required

Learn how to transition from Excel-based business intelligence (BI) analysis to enterprise stacks of open-source BI tools. Select and implement the best free and freemium open-source BI tools for your company’s needs and design, implement, and integrate BI automation across the full stack using agile methodologies.

Business Intelligence Tools for Small Companies provides hands-on demonstrations of open-source tools suitable for the BI requirements of small businesses. The authors draw on their deep experience as BI consultants, developers, and administrators to guide you through the extract-transform-load/data warehousing (ETL/DWH) sequence of extracting data from an enterprise resource planning (ERP) database freely available on the Internet, transforming the data, manipulating them, and loading them into a relational database.

The authors demonstrate how to extract, report, and dashboard key performance indicators (KPIs) in a visually appealing format from the relational database management system (RDBMS). They model the selection and implementation of free and freemium tools such as Pentaho Data Integrator and Talend for ELT, Oracle XE and MySQL/MariaDB for RDBMS, and Qliksense, Power BI, and MicroStrategy Desktop for reporting. This richly illustrated guide models the deployment of a small company BI stack on an inexpensive cloud platform such as AWS.

What You’ll Learn

You will learn how to manage, integrate, and automate the processes of BI by selecting and implementing tools to:

  • Implement and manage the business intelligence/data warehousing (BI/DWH) infrastructure
  • Extract data from any enterprise resource planning (ERP) tool
  • Process and integrate BI data using open-source extract-transform-load (ETL) tools
  • Query, report, and analyze BI data using open-source visualization and dashboard tools
  • Use a MOLAP tool to define next year’s budget, integrating real data with target scenarios
  • Deploy BI solutions and big data experiments inexpensively on cloud platforms

Who This Book Is For
Engineers, DBAs, analysts, consultants, and managers at small companies with limited resources but whose BI requirements have outgrown the limitations of Excel spreadsheets; personnel in mid-sized companies with established BI systems who are exploring technological updates and more cost-efficient solutions

Introduction to the maths of bookmaking (with python code)


In this article I will show you how to calculate simple things about the odds the bookmakers offer and how to play with them with the intention of using the real chance of each outcome to model a group of prices. Basically what we will do is the following:

  • retrieve the odds of a horse race
  • calculate the overround applied
  • determine the true odds
  • generate a new set of odds with the desired overround. We will see several techniques, these are:
    • First approach for pricing: Apply the overround linearly
    • A better approach: Apply the overround based on the chance of winning
    • The real deal: Apply the overround based on a model

Retrieve the Odds

For the sample of this article I will be using odds of a horse race held at Doncaster, the 27th of June 2015. This was the last race of the card, a class 4 handicap of 7 runners, but any race or sport should suit.

The odds on offer at the time of writting were the following (got from oddschecker):

Rio Ronaldo 3.25 3.0 3.0 3.25 3.0 3.25 3.25 3.0 3.0 2.75 3.25 3.25
Beau Eile 3.5 3.25 3.25 3.25 3.5 3.25 3.25 3.25 3.25 3.25 3.5 3.5
Bahamian Sunrise 4.0 4.0 3.75 3.75 4.0 3.75 3.75 3.75 4.0 4.0 3.5 3.75
Silver Rainbow 13.0 13.0 9.0 13.0 12.0 11.0 11.0 13.0 11.0 11.0 10.0 9.0
Snow Cloud 15.0 15.0 12.0 13.0 10.0 12.0 13.0 12.0 12.0 15.0 9.0 12.0
Equally Fast 17.0 17.0 17.0 15.0 17.0 17.0 17.0 17.0 15.0 17.0 13.0 17.0
Mc Diamond 67.0 67.0 41.0 34.0 67.0 51.0 41.0 34.0 51.0 67.0 41.0 41.0

In this article we will choose the best price or joint best price available but any set of ods can be choose.

So we construct our list of best prices with the folowing values: [3.25, 3.5, 4.0, 13.0, 15.0, 17.0, 67.0]

maxPrices = [3.25, 3.5, 4.0, 13.0, 15.0, 17.0, 67.0]

Calculate the Overround of a set of outcomes

To calculate the overround of a set of prices is easy. Basically what needs to be done is to iterate through the list of prices, and calculate the chances of winning each one, accumulate them and see how this number exceeds of 1 (or 100% if we are counting percentages).

To work out the probability of each outcome to win we need to do the following division:

1 / odds

Then we will sum up all these probabilities and will get the overround of the race

overround = 0
for price in maxPrices:
    overround = overround + 1/price
print("Total overround is",overround) 

which gives us the following output: Total overround is 1.0607452395424302

Determine the true odds

For calculating the fair price we will multiply the current price by the overround we calculated in the previous step. In case we were working with probabilities, the process would be the same.

fairPrice = []
for price in maxPrices:
    fairPrice = fairPrice + [price * overround]

The new fair price list without the overround is the following:

fairPrice [3.4474220285128983, 3.7126083383985056, 4.242980958169721, 13.789688114051593, 15.911178593136452, 18.032669072221314, 71.06993104934283]

Generate a new set of odds

The following step is generating a new set of odds. These can be generated with different techniques. We cover the following in this article:

First approach for pricing: Apply the overround linearly

This solution is not the most usefull one but in some cases it may work. Basically it consists in dividing the total percent of overround equaly amongst all the outcomes. This is usually not a good idea as we can get inflated prices for the favourites against the outsiders. And as we know, money is likely to go for these heading the market. So from a bookmaking point of view, it does not make too much sense.

We have not considered this solution as interesting for the article, so we are not covering it.

A better approach: Apply the overround based on the chance of winning

In this paragraph we are presenting a better approach. Instead of dividing the overround in equally parts, we will divide the overround depending on the chance of winning. So, based on the calculated odds, we will apply one part of the overround on the other. This partially compensates the problem with the previous method, and will usually be more than enough, though sometimes it is not yet the perfect solution.

In our sample, we will be applying a 5% of overround on the fair price calculated in the previous step.

appliedOverround5pct = []
for price in fairPrice:
    appliedOverround5pct = appliedOverround5pct + [price/1.05]
print("appliedOverround 5%",appliedOverround5pct)

The new list with a 5% of overround is the folowing. As you can see, prices are slightly higher that they were origially as the overround is 1% less:
[3.2832590747741888, 3.5358174651414336, 4.040934245875924, 13.133036299096755, 15.153503422034715, 17.17397054497268, 67.68564861842174]

The real deal: Apply the overround based on a model

This solution will entitle in building a model of prices withou overround and winning results. Based on a big number of outcomes we would able to model and predict the overround to apply based on this historical data.

Since this would involve a model creation and usually this means some complexity, and the samples presented are good enough to go, we left this out of the scope of this article.

The full code is here:

def pythonPriceOverround():
	maxPrices = [3.25, 3.5, 4.0, 13.0, 15.0, 17.0, 67.0]

	overround = 0
	for price in maxPrices:
		overround = overround + 1/price

	print("Total overround is",overround)

	fairPrice = []
	for price in maxPrices:
		fairPrice = fairPrice + [price * overround]

	appliedOverround5pct = []

	for price in fairPrice:
		appliedOverround5pct = appliedOverround5pct + [price/1.05]
	print("appliedOverround 5%",appliedOverround5pct)

	#Check than now the overround is indeed a 5%
	overround = 0
	for price in appliedOverround5pct:
		overround = overround + 1/price

	print("Total overround is",overround)