Many experienced academics will tell you that 20% of any data science project involves building and setting up predictive models

#### Contents Diving into different algorithms (k-nearest neighbour method, logit regression, random forests, artificial neural network), we realized it with Scikit-Learn

..and 80% involves the tedious but necessary work of collecting and fitting the data into the right form. Our project was no exception. We had to collect public data in different formats (PDF, XML, HTML, JSON) and look for ways to automatically match teams from different sources.

For example, one office listed the team as Man U, another as Man United, a third as Manchester United, etc.  Long and hard, but eventually we got our first set of meaningful data and could move on to building machine learning models.  A bit of maths Initially we weren't sure which algorithm would be best for our project, so we decided to look at betting as a classification problem. With this approach, we only need to predict 1 of the 3 correct options: a win for the home team, a win for the visitors or a draw.  Of course, there are other approaches, such as: guess the number of goals scored by each team and then use that information to determine the winner.

## Classification approach

But we decided to go with a classification approach.  Diving into different algorithms (k-nearest neighbour method, logit regression, random forests, artificial neural network), we realized the beauty of Scikit-Learn.  One of the features we used was the Elo rating system. In short, it's a method for calculating the relative strength of players in games involving two players. It's often used in chess, but can also be applied to other sports. Using the Elo rating, we've calculated the strength of each of the teams, starting in 1990. This way our models have improved considerably in performance.  To give you an idea of the predictive power of the Elo variable alone, here is a link between a hypothetical model using only Elo and the results of the same games.

## Which betting system to choose

Calculating the probability of winning and the size of the winnings is a key element of a betting site. The site should explain how betting odds work and how they relate to the probability and size of winnings.

• For example, in a sports match where one side has been given 25% to win, the odds are calculated according to the algorithm: one integer and N-tenths, where N = 100% / 25%. If a team with a 25% chance of winning, the winnings of bets placed on it are calculated according to the scheme: bet * odds. For example, if a bet is \$100 and the odds are 1.4, the user will receive \$40 (\$100 * 1.4) if they win. This is the European system (ref.: The legality of betting sites in India).
• In Hong Kong system the algorithm is similar but 0.4 is given instead of 1.4. The probability and the size of the winnings do not change from this. In case of winning the user will get the same 40% of the bet (+ bet amount).
• The British system uses fractional odds, e.g. 5/10. The right-hand side shows how many units to bet, the left-hand side shows how many units the user will win. So, if the odds are 5/10, then having bet \$100, the user has a chance of winning \$80 (\$100 / 10 number on the right) * 5 (number on the left)).
• The American system uses bets such as +150 and -250. The number on the right shows how much to bet, the number on the left shows how much to win. For example, in our example you need to bet \$250 to win \$150.
• The Indonesian system uses a similar algorithm to the American system, only instead of integers they use decimals, such as +1.5 and -2.5.
• The Malaysian system also uses numbers with +/-, but first specifies the stake size and then the winnings, e.g. -45 and +100 (you bet \$45 to win \$100).