Who is going to win the football world cup 2018?

Intro

We aim to predict the winner of the FIFA world cup solely based on data. The method applied is not fancy at all, but it should do the trick to get some neat results (spoiler alert: Germany wins!). We use three datasets obtained by Kaggle which contain the outcome of specific pairings between teams, rank, points and the weighted point difference with the opponent. Then, we create a model to predict the outcome of each match during the FIFA world cup 2018. To make the results more appealing, we translate the outcome probabilities to fair odds.

Data

The first dataset stems from Tadhg Fitzgerald and contains all available FIFA men’s international soccer rankings from August 1993 to April 2018. The rankings and points have been scraped from the official FIFA website. The second dataset used includes the results of additional 40k international football matches from the very first official match in 1972 up to 2018. Again, the games are strictly men’s full internationals and stem from Mart Jüriso. his will be used to quantify the effect of the difference in ranks, point and current rank of the international teams on a match’s outcome. As we aim to predict the result of the ongoing FIFA world cup, we use a third data set from Nuggs to get its matches.

import numpy as np 
import pandas as pd 
from matplotlib import pyplot as plt

#import datasets
my_ranking = pd.read_csv('./data_i_o/input_data/fifa-international-soccer-mens-ranking-1993now/fifa_ranking.csv')

my_ranking = my_ranking.loc[:,['rank', 'country_full', 'country_abrv', 'cur_year_avg_weighted', 'rank_date', 
                           'two_year_ago_weighted', 'three_year_ago_weighted']]
#rename country names
my_ranking = my_ranking.replace({"IR Iran": "Iran"})
my_ranking['weighted_points'] =  my_ranking['cur_year_avg_weighted'] + my_ranking['two_year_ago_weighted'] + my_ranking['three_year_ago_weighted']
my_ranking['rank_date'] = pd.to_datetime(my_ranking['rank_date'])

#again, rename country names
my_matches = pd.read_csv('./data_i_o/input_dat/international-football-results-from-1872-to-2017/results.csv')
my_matches =  my_matches.replace({'Germany DR': 'Germany', 'China': 'China PR'})
my_matches['date'] = pd.to_datetime(my_matches['date'])

#read data
world_cup_2018 = pd.read_csv('./data_i_o/input_dat/fifa-worldcup-2018-dataset/World Cup 2018 Dataset.csv')
world_cup_2018= world_cup_2018.loc[:, ['Team', 'Group', 'First match \nagainst', 'Second match\n against', 'Third match\n against']]
world_cup_2018 = world_cup_2018.dropna(how='all')
#rename country names
world_cup_2018 = world_cup_2018.replace({"IRAN": "Iran", 
                               "Costarica": "Costa Rica", 
                               "Porugal": "Portugal", 
                               "Columbia": "Colombia", 
                               "Korea" : "Korea Republic"})
world_cup_2018 = world_cup_2018.set_index('Team')

Feature Engineering

There is no magic here, we keep things as simple as possible. First, we join the rankings of each team and extract the pairwise point and rank differences. From an agnostic point of view, friendly games should be harder to predict as players have no high incentive to reach their performance upper bound and try instead to avoid injuries. Hence, to not confuse our model in the next, we also mark friendly games in our dataset. Resting days between matches might even hide a useful pattern for players’ performance which should be positively correlated with the probability of winning a game (unless you play for Mexico!). As the last step, we additionally hot-encode the participant countries.

#get daily rankings
my_rankings = my_rankings.set_index(['rank_date'])\
            .groupby(['country_full'], group_keys=False)\
            .resample('D').first()\
            .fillna(method='ffill')\
            .reset_index()

#merge the 
my_matches = my_matches.merge(my_rankings, 
                        left_on=['date', 'home_team'], 
                        right_on=['rank_date', 'country_full'])
my_matches = my_matches.merge(my_rankings, 
                        left_on=['date', 'away_team'], 
                        right_on=['rank_date', 'country_full'], 
                        suffixes=('_home', '_away'))
#feature engineering
my_matches ['rank_diff'] = my_matches ['rank_home'] - my_matches ['rank_away']
my_matches ['average_rank'] = (my_matches ['rank_home'] + my_matches ['rank_away'])/2
my_matches ['point_diff'] = my_matches ['weighted_points_home'] - my_matches ['weighted_points_away']
my_matches ['score_diff'] = my_matches ['home_score'] - my_matches ['away_score']
my_matches ['is_won'] = my_matches ['score_difference'] > 0 #draw=lost
my_matches ['is_stake'] = my_matches ['tournament'] != 'Friendly'

#rest days
max_rest = 30
my_matches ['rest_days'] = my_matches .groupby('home_team').diff()['date'].dt.days.clip(0,max_rest).fillna(max_rest)

#hot encode participants
my_matches ['wc_participant'] = my_matches ['home_team'] * my_matches ['home_team'].isin(world_cup_2018.index.tolist())
my_matches ['wc_participant'] = my_matches ['wc_participant'].replace({'':'Other'})
my_matches = my_matches .join(pd.get_dummies(my_matches ['wc_participant']))

Methodology

We use a simple logistic model to keep everything. If the feature engineering part is well done, you can also beat some fancy deep learning networks with straightforward (linear) models.

from sklearn import linear_model
from sklearn import ensemble
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

#create the dataset
X, y = my_matches.loc[:,['average_rank', 'rank_diff', 'point_diff', 'is_stake']], my_matches['is_won']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1234)

logreg = linear_model.LogisticRegression(C=1e-5)
features = PolynomialFeatures(degree=2)
model = Pipeline([
    ('polynomial_features', features),
    ('logistic_regression', logreg)
])
model = model.fit(X_train, y_train)

# figures 
fpr, tpr, _ = roc_curve(y_test, model.predict_proba(X_test)[:,1])
plt.figure(figsize=(15,5))
ax = plt.subplot(1,3,1)
ax.plot([0, 1], [0, 1], 'k--')
ax.plot(fpr, tpr)
ax.set_title('AUC score is {0:0.2}'.format(roc_auc_score(y_test, model.predict_proba(X_test)[:,1])))
ax.set_aspect(1)

ax = plt.subplot(1,3,2)
cm = confusion_matrix(y_test, model.predict(X_test))
ax.imshow(cm, cmap='Blues', clim = (0, cm.max())) 

ax.set_xlabel('Prediction')
ax.set_title('Out of sample performance')

ax = plt.subplot(1,3,3)
cm = confusion_matrix(y_train, model.predict(X_train))
ax.imshow(cm, cmap='Blues', clim = (0, cm.max())) 
ax.set_xlabel('Prediction')
ax.set_title('Out of sample performance')
pass

The out of sample results are quite satisfying with an AUC score of 0.735. The results also suggest that teams with lower ranks are not very well predictable. The same applies to matches with very similar ranks (which seems to be reasonable).

Forecasting the World Cup 2018

First we tackle the group rounds:

#if the winning probability margin is smaller than 0.05 then we classify the outcome as a draw
margin = 0.05
#guess what - world cup rankings
world_cup_2018_rankings = rankings.loc[(my_rankings['rank_date'] == my_rankings['rank_date'].max()) & 
                                    my_rankings['country_full'].isin(world_cup_2018.index.unique())]
world_cup_2018_rankings = world_cup_2018_rankings.set_index(['country_full'])

from itertools import combinations

opponents = ['First match \nagainst', 'Second match\n against', 'Third match\n against']

world_cup_2018['points'] = 0
world_cup_2018['total_prob'] = 0

for group in set(world_cup_2018['Group']):
    print('Group {}:'.format(group))
    for home, away in combinations(world_cup_2018.query('Group == "{}"'.format(group)).index, 2):
        print("{} vs. {}: ".format(home, away), end='')
        row = pd.DataFrame(np.array([[np.nan, np.nan, np.nan, True]]), columns=X_test.columns)
        home_rank = world_cup_2018_rankings.loc[home, 'rank']
        home_points = world_cup_2018_rankings.loc[home, 'weighted_points']
        opp_rank = world_cup_2018_rankings.loc[away, 'rank']
        opp_points = world_cup_2018_rankings.loc[away, 'weighted_points']
        row['average_rank'] = (home_rank + opp_rank) / 2
        row['rank_difference'] = home_rank - opp_rank
        row['point_difference'] = home_points - opp_points
        
        home_win_prob = model.predict_proba(row)[:,1][0]
        world_cup_2018.loc[home, 'total_prob'] += home_win_prob
        world_cup_2018.loc[away, 'total_prob'] += 1-home_win_prob
        
        points = 0
        if home_win_prob <= 0.5 - margin:
            print("{} wins with {:.2f}".format(away, 1-home_win_prob))
            world_cup_2018.loc[away, 'points'] += 3
        if home_win_prob > 0.5 - margin:
            points = 1
        if home_win_prob >= 0.5 + margin:
            points = 3
            world_cup_2018.loc[home, 'points'] += 3
            print("{} wins with {:.2f}".format(home, home_win_prob))
        if points == 1:
            print("Draw")
            world_cup_2018.loc[home, 'points'] += 1
            world_cup_2018.loc[away, 'points'] += 1

And these are the prediction results for the group games:

Group B:
Portugal vs. Spain: Draw
Portugal vs. Morocco: Portugal wins with 0.64
Portugal vs. Iran: Portugal wins with 0.64
Spain vs. Morocco: Spain wins with 0.61
Spain vs. Iran: Spain wins with 0.61
Morocco vs. Iran: Draw
Group C:
France vs. Australia: France wins with 0.63
France vs. Peru: Draw
France vs. Denmark: Draw
Australia vs. Peru: Peru wins with 0.65
Australia vs. Denmark: Denmark wins with 0.71
Peru vs. Denmark: Draw
Group F:
Germany vs. Mexico: Germany wins with 0.62
Germany vs. Sweden: Germany wins with 0.65
Germany vs. Korea Republic: Germany wins with 0.74
Mexico vs. Sweden: Draw
Mexico vs. Korea Republic: Mexico wins with 0.65
Sweden vs. Korea Republic: Sweden wins with 0.63
Group H:
Poland vs. Senegal: Poland wins with 0.63
Poland vs. Colombia: Draw
Poland vs. Japan: Poland wins with 0.75
Senegal vs. Colombia: Colombia wins with 0.62
Senegal vs. Japan: Senegal wins with 0.59
Colombia vs. Japan: Colombia wins with 0.71
Group G:
Belgium vs. Panama: Belgium wins with 0.72
Belgium vs. Tunisia: Belgium wins with 0.59
Belgium vs. England: Belgium wins with 0.59
Panama vs. Tunisia: Tunisia wins with 0.72
Panama vs. England: England wins with 0.73
Tunisia vs. England: England wins with 0.54
Group E:
Brazil vs. Switzerland: Draw
Brazil vs. Costa Rica: Brazil wins with 0.61
Brazil vs. Serbia: Brazil wins with 0.64
Switzerland vs. Costa Rica: Switzerland wins with 0.58
Switzerland vs. Serbia: Switzerland wins with 0.63
Costa Rica vs. Serbia: Draw
Group D:
Argentina vs. Iceland: Argentina wins with 0.61
Argentina vs. Croatia: Argentina wins with 0.58
Argentina vs. Nigeria: Argentina wins with 0.71
Iceland vs. Croatia: Draw
Iceland vs. Nigeria: Iceland wins with 0.62
Croatia vs. Nigeria: Croatia wins with 0.62
Group A:
Russia vs. Saudi Arabia: Saudi Arabia wins with 0.56
Russia vs. Egypt: Egypt wins with 0.67
Russia vs. Uruguay: Uruguay wins with 0.82
Saudi Arabia vs. Egypt: Egypt wins with 0.65
Saudi Arabia vs. Uruguay: Uruguay wins with 0.81
Egypt vs. Uruguay: Uruguay wins with 0.7

Second, we tackle the knock-out games:

pairing = [0,3,4,7,8,11,12,15,1,2,5,6,9,10,13,14]
world_cup_2018 = world_cup_2018.sort_values(by=['Group', 'points', 'total_prob'], ascending=False).reset_index()
next_round_wc = world_cup_2018.groupby('Group').nth([0, 1]) # select the top 2
next_round_wc = next_round_wc.reset_index()
next_round_wc = next_round_wc.loc[pairing]
next_round_wc = next_round_wc.set_index('Team')

finals = ['Round_of_16', 'Quarter-Finals', 'Semi-Finals', 'Final']

labels = list()
odds = list()

for f in finals:
    print("{}:".format(f))
    iterations = int(len(next_round_wc) / 2)
    winners = []

    for i in range(iterations):
        home = next_round_wc.index[i*2]
        away = next_round_wc.index[i*2+1]
        print("{} vs. {}: ".format(home,
                                   away), 
                                   end='')
        row = pd.DataFrame(np.array([[np.nan, np.nan, np.nan, True]]), columns=X_test.columns)
        home_rank = world_cup_rankings_2018.loc[home, 'rank']
        home_points = world_cup_rankings_2018.loc[home, 'weighted_points']
        opp_rank = world_cup_rankings_2018.loc[away, 'rank']
        opp_points = world_cup_rankings_2018.loc[away, 'weighted_points']
        row['average_rank'] = (home_rank + opp_rank) / 2
        row['rank_difference'] = home_rank - opp_rank
        row['point_difference'] = home_points - opp_points

        home_win_prob = model.predict_proba(row)[:,1][0]
        if model.predict_proba(row)[:,1] <= 0.5:
            print("{0} wins with probability {1:.2f}".format(away, 1-home_win_prob))
            winners.append(away)
        else:
            print("{0} wins with probability {1:.2f}".format(home, home_win_prob))
            winners.append(home)

        labels.append("{}({:.2f}) vs. {}({:.2f})".format(world_cup_rankings_2018.loc[home, 'country_abrv'], 
                                                        1/home_win_prob, 
                                                        world_cup_rankings_2018.loc[away, 'country_abrv'], 
                                                        1/(1-home_win_prob)))
        odds.append([home_win_prob, 1-home_win_prob])
                
    next_round_wc = next_round_wc.loc[winners]
    print("\n")

And these are the predictions for the knock-out games:

Round_of_16:
Uruguay vs. Spain: Spain wins with probability 0.56
Denmark vs. Croatia: Denmark wins with probability 0.60
Brazil vs. Mexico: Brazil wins with probability 0.63
Belgium vs. Colombia: Belgium wins with probability 0.57
Egypt vs. Portugal: Portugal wins with probability 0.83
France vs. Argentina: Argentina wins with probability 0.54
Switzerland vs. Germany: Germany wins with probability 0.61
England vs. Poland: Poland wins with probability 0.55

Quarter-Finals:
Spain vs. Denmark: Denmark wins with probability 0.51
Brazil vs. Belgium: Belgium wins with probability 0.53
Portugal vs. Argentina: Portugal wins with probability 0.52
Germany vs. Poland: Germany wins with probability 0.56

Semi-Finals:
Denmark vs. Belgium: Belgium wins with probability 0.58
Portugal vs. Germany: Germany wins with probability 0.57

Final:
Belgium vs. Germany: Germany wins with probability 0.59

Network visualisation of the Knock-Out games:

import networkx as nx
import pydot
from networkx.drawing.nx_pydot import graphviz_layout

node_sizes = pd.DataFrame(list(reversed(odds)))
scale_factor = 0.3 # for visualization
G = nx.balanced_tree(2, 3)
pos = graphviz_layout(G, prog='twopi', args='')
centre = pd.DataFrame(pos).mean(axis=1).mean()

plt.figure(figsize=(10, 10))
ax = plt.subplot(1,1,1)
# add circles 
circle_positions = [(230, 'black'), (180, 'blue'), (120, 'red'), (60, 'yellow')]
[ax.add_artist(plt.Circle((centre, centre), 
                          cp, color='grey', 
                          alpha=0.2)) for cp, c in circle_positions]

nx.draw(G, pos, 
        node_color=node_sizes.diff(axis=1)[1].abs().pow(scale_factor), 
        node_size=node_sizes.diff(axis=1)[1].abs().pow(scale_factor)*2000, 
        alpha=1, 
        cmap='Reds',
        edge_color='black',
        width=10,
        with_labels=False)

shifted_pos = {k:[(v[0]-centre)*0.9+centre,(v[1]-centre)*0.9+centre] for k,v in pos.items()}
nx.draw_networkx_labels(G, 
                        pos=shifted_pos, 
                        bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="black", lw=.5, alpha=1),
                        labels=dict(zip(reversed(range(len(labels))), labels)))

texts = ((10, 'Best 16', 'black'), (70, 'Quarter-\nFinal', 'blue'), (130, 'Semi-Final', 'red'), (190, 'Final', 'yellow'))
[plt.text(p, centre+20, t, 
          fontsize=12, color='grey', 
          va='center', ha='center') for p,t,c in texts]
plt.axis('equal')
plt.title('Knock-Out Games \n Predictions', fontsize=20)
plt.show()

Print Friendly, PDF & Email