Using K-Means Clustering to analyze the types of teams Real Madrid and Barcelona drop points against in La Liga.
This is a shorter version of a longer paper, a link to the full paper is here. A link to the code and the datasets used is here. To see the final results of this paper, skip to the Conclusion.
Real Madrid’s success in the Champions League during the last decade is in stark contrast to their poor performance in La Liga. By using K-Means Clustering to categorize the teams the Real Madrid and their arch-rivals Barcelona have lost and drawn to in La Liga, we can get a clearer idea of where Madrid is making mistakes in this competition.
Throughout the late 1990s and the 21st century, Real Madrid has been known as the club in world football to spend an exorbitant amount on the players they desire the most, specifically attackers. Galaticos, as they became known globally, would often go on to be the offensive leaders of the team and the person that Madrid’s attack would flow through. While there have certainly been legendary Galaticos in the past, perhaps none have been more impactful to the club in modern history than Cristiano Ronaldo. Signed in the summer of 2009 as part of the answer to arch-rival Barcelona’s dominance of Spain and Europe (They were the first team to win six trophies in one year), Ronaldo would stay at Madrid for almost a decade. The defining moments of the Ronaldo Era at Madrid would almost all come in the Champions League, as Madrid went to the semifinals of the competition 8 years in a row and was able to win the entire thing 4 times in the span of 5 years; to say that this is anything other than legendary is an understatement.
Despite Madrid’s success in the Champions League, their struggles in domestic competition cannot be overlooked. In Ronaldo’s time at Madrid (nine seasons), Real Madrid was only able to win La Liga two times. In the same time period, Barcelona won the league title 6 times and city rivals Atletico de Madrid won the title once. While the Ronaldo Era won’t be looked back on as unsuccessful, the question remains: How can a team that was so dominant in one competition be so lackluster in another?
I attempt to answer this question by using Machine Learning. I’ve created a dataset for Madrid and Barcelona that lists every single time they’ve lost or drawn against an opponent in La Liga from the 2009-10 all the way until the 2017-18 season, which covers every season that Ronaldo was at Madrid. On each dataset, I’m going to use a Machine Learning technique called K-Means Clustering (K-Means or KMC for short) to find the average teams that Madrid and Barcelona are losing points to and see if there are any clear differences. K-Means Clustering will be done in a couple of different scenarios to get a better understanding of where exactly Madrid is struggling and Barceona has been succeeding (For example, dividing the dataset up by manager and doing KMC for each manager)
For each of the 2 teams, the dataset is made up of the following features:
Data collected after the 2014/15 season contains an extra feature called xg_diff, which calculates the difference in xG by simply doing (the xG of Madrid/Barca in the specific match) minus (the xG of their opposition).
This dataset also only considers games where the league title hasn’t been won yet. The reasoning behind this is that it’s hard to gauge the team’s motivation since there’s nothing left to play for; Madrid and Barca tend to let their starters rest and be ready for The Champions League and Copa Del Rey when they can’t win the league, so this opens up the door for lineups with bench players, new formations, and a general lack of motivation to win.
K-Means Clustering (KMC) is an unsupervised Machine Learning method that computes the average at multiple locations in numerical data.The k in KMC is the number of averages, or clusters, that are being computed; k is a number that’s greater than 1 but less than the number of rows in your dataset. The basic steps of KMC are as follows:
The benefits of using KMC as opposed to a normal average calculation is clear. For example, pretend that we’re using KMC on the matchday feature in our data for one season, and for this season Madrid lost/drew against teams on matchdays 1,2,3,4,35,36,37 and 38. If we were to calculate the average normally we’d get 19.5, which would lead us to believe that this season Madrid lost a majority of its games near the halfway point, but we can obviously see is wrong. Using KMC with k = 2 would give us a much better representation of the data and tell us that on average, Madrid lost to teams in the beginning of the season (1 Cluster would contain the points 1,2,3,4) and the end of the season (The second cluster would contain the points 35,36,37,38).
For more information on KMC, take a look at this video from StatQuest with Josh Starmer.
One of the most important parts of KMC is deciding on how big or small k should be. KMC is clearly a very powerful algorithm, but when used with the wrong value for k then it becomes difficult to extract meaningful information from your data.
Fortunately, there are many methods that have been developed to help determine the best k. For this analysis I’ll be using Silhouette Analysis, which involves measuring 1) the distance between each point in a given cluster (you want this value to be small since it measures how similar the data points in this cluster are to one another) and 2) the distance between each cluster (you want this value to be large since it measures how dissimilar this cluster is the others).
More information on how these two values are calculated can be found here and I adapted the code from this tutorial to find the optimal k for my data.
The plan for the data is as follows:
This is the data we’re looking at for each team:
To see my full breakdown of each category of data, check out the link here.
Real Madrid’s results in the Champions League in the Ronaldo Era certainly won’t be looked back on as a failure, but some questions will be asked about their sub-par league performance. Throughout the different managers that have tried to change Madrid’s luck in the league, this analysis shows us that Madrid typically fell into the same patterns over and over again which cost them the league title on more than one occasion. The biggest patterns included 1) dropping points to their direct competitors both home and away, 2) being unable to pull out results in away games against mid table teams and lower, 3) creating enough offensive opportunities for themselves but not capitalizing on them (evidenced by their xg_diff being 1 in the first cluster of the 2014/15-2017/18 dataset), and 4) not being able to start their league campaigns with good form. Solving any one of these problems will be beneficial for Madrid in the long run, as the smallest or margins tend to make a difference in the title race. Being able to secure even 3 or 4 more points a season can quite literally be the difference between 1st or 2nd, so in the future Real will be doing themselves a favor to fix one of these 4 areas.
In addition to the data that I already collected, it would definitely have been beneficial to have access to more reliable data about injuries and the form of certain players, as this could’ve given us more information about why and how Madrid was losing certain games. I treated a majority of this analysis with the assumption that Madrid was playing a full-strength lineup, when in reality this wasn’t always the case; knowing which games were played at full strength and which weren’t would’ve definitely helped this analysis. In addition, having more information about how these games unfolded would’ve helped. Having more information about the lineup strength of the other team, how many passes they attempted in each third of the field, how much possession they had, how many key passes they made, etc. would give us a better idea of the different styles of teams that have taken points from Madrid the most. Future work would definitely add this data, but only for the most recent seasons as this type of data is fairly new and wasn’t available even 10 years ago.
In this video, I had the pleasure of speaking with Nick Wan. Nick streams data…
In 2015, MLB introduced Statcast to all 30 stadiums. This system monitors player and ball movement and…
This past weekend, I was honored to speak to almost 100 Kellogg MBA students about…
Bears fans got a lesson in regression to the mean last season. It may have…
Intro & Lit Review Predicting the NBA draft is always difficult. Should you draft a…
Beyond The Arch is a series of articles where I use K Means Clustering to…
This website uses cookies.