A Network Science Approach to Football

Imagine for a moment that you play the role of the coach of a big football team in Europe. You will be responsible for choosing new players for the next season to substitute the players that leave the team this year. Those are hard decisions, with a lot of implications and consequences, and any assistance will be welcome. In this post we show how network science can help this decision making process.

Let us start with a public football dataset from Kaggle. It contains several tables with statistics in the years 2008-2016. The FIFA provides, among others, player scores for several aspects of the game, like speed, strength, dribbling ability, shoot … for every player and match. There are about 180000 European matches and every player is characterized by more than thirty numerical scores.

Now we are ready to create and study a new network using Machine Learning tools and techniques. For the sake of simplicity, we will reduce the size of the network and limit ourselves to the 1000 best players of the dataset, according to FIFA’s overall rating. Those players are the nodes of our network. A link exists between any pair of players when they are similar enough according to certain mathematical operation (technically, the cosine similarity). In addition, we will not consider isolated nodes, i.e. players that are not linked to anyone. This leads to a network with 945 nodes and 13156 links.

We have loaded the network in BeGraph, expanded it and calculated some network centrality metrics (check this tutorial on how to use the visualizer):


The default view shows the nodes colored according to its community membership (Louvain Community, see ref [1]). There are four big communities, clustering the players according to their position on the pitch: goalkeepers (blue), defenders (yellow), midfielders (green) and forward players (red). Although the clustering is very rough we can check with some examples that it actually makes sense. For example, Messi, Ribery, Neymar and Ronaldo are in the same cluster because they roughly play in similar positions. Moreover, the similarity Messy-Ribery (0.94) and Messi-Neymar (0.93) is larger than Messi-Ronaldo (0.89), as any football fan would expect.

You can observe that A. Iniesta and X. Hernández have a very high similarity (0.96) and are classified as forward players, but located close to the midfield cluster (green). This also makes sense because they are midfield players with a very good goal scoring rate. Some defenders are also strongly linked, like J.Terry and N. Vidic (0.92), colored in yellow.

The goalkeepers cluster (blue) is slightly different form the other three.  It is isolated, meaning that goalkeepers have a very different skills than the others (well, it is kind of obvious since they can use their hands). You may notice that they form a very dense and strongly connected cluster. Check this with the cosine similarity and Eigenvector Centrality visual configurations.

If we modify the resolution parameter of the Louvain algorithm, reducing it from 1.0 to 0.7, we obtain a new community from the forward players. This new cluster is composed by strikers, that is, forward players whose specialty is receiving the ball and scoring. The following two pictures show this cluster:

In the interactive gallery we have prepared other visual configurations, highlighting network features like betweenness or eigenvector centralities, or showing data directly from the dataset like the player weight. We encourage the reader to check them and see if they make sense with the network. For example, does it make sense that Claudio Marchisio has the highest betweeness centrality? Or that the tallest players are generally goalkeepers? Check that the player weight distribution in the network differs from the height distribution! Why would be that?  

So, we have seen that network science provides a nice mathematical description of the football players style, and that it can be easily interpreted. Similarity measures, like the one we used in this post, are potential creators of many types of networks, and each one can have different properties or emphasize particular aspects of the dataset. Just let your mind fly and you may find something interesting!

 [1] Blondel, V.D., Guillaume J-L., Lambiotte, R., Lefebvre, E., Fast Unfolding of Communities in Large Networks, L. Stat. Mech. P10008 (2008).

Leave a Comment

Your email address will not be published. Required fields are marked *