Analyzing Formula 1 Team Performance with SQL and Python: Part 1

Formula 1 (F1) is widely regarded as the pinnacle of motorsport, blending cutting-edge technology, precision engineering, and strategic decision-making. Beyond the roaring engines and high-speed overtakes, data plays a crucial role in shaping the sport. Every aspect of an F1 race from car performance and tire degradation to weather conditions and driver strategy is meticulously analyzed to gain a competitive edge.

Modern F1 teams collect and process vast amounts of data in real time, relying on telemetry systems, machine learning models, and predictive analytics to optimize race strategies (Baraka, 2024). Data influences everything from pit stop timing to aerodynamics, helping teams make split-second decisions that can determine the outcome of a race. Furthermore, historical data provides valuable insights into performance trends, circuit characteristics, and race outcomes, enabling teams to refine their approaches over multiple seasons (Pandit & Day, 2021).

This project explores the integral role of data in Formula 1, using explorative analysis to derive insights into team performance. By examining key data points and trends, we gain a deeper understanding of how F1 has evolved into a data-driven sport where milliseconds can be the difference between victory and defeat.

Project Objectives

The aim of this project threefold. Firstly, the dataset will be cleaned and formatted to prepare for better analysis and explorative data and key insights will be determined. Secondly, python libraries will be used for data visualization. Thirdly, data will be used for basic predictions and forecasting for the upcoming 2025 season.

For exploratory data analysis, the following research questions will be taken within the boundaries of the dataset used,

  • Which teams have been the most dominant in F1 history?

  • Which drivers have the most wins and podiums?

  • How do teams compare over different seasons?

  • Which circuits favor certain teams/drivers?

  • What trends can we identify in race performance?

The Dataset

For this study, the Gigasheets’s formula 1 race results datasheet was used. This dataset contains formula 1 race results from 2000 to 2024. The fields of the dataset include,

  • Season – the year which the season was carried out

  • Round – number/ the order in which a particular race was held in a season

  • CircuitID – name of the race circuit

  • Country – the country in which the race was held

  • ConstructorName – name of the formula 1 team which is known as the constructor

  • ConstructorID

  • ConstructorNationality – the origin country of the team

  • DriverID

  • DriverName

  • DriverNationality

  • Nationality

  • Position – the position of the driver at the end of the race

  • Points – points earned by the driver

  • Grid – the position of the driver at the start of the race

  • Status – the status of the driver’s race (Finished, +1 Lap, Engine, Gearbox, Water pressure etc)

  • AverageSpeed – the average speed of the driver during the race

  • DifferenceGridPosition – the difference between grid position and results position

  • WonLastRace – whether the driver won the previous race

  • WonLast2Races to Top5Last10Races – there are 9 columns in the dataset to determine if the driver has won the previous 2, 3 …. ,10 races

Tools

  • SQL – for querrying the dataset

  • Python – for data cleaning, analysis and visualization

  • MySQL

  • Jupyter Notebook

Data Cleaning

Data cleaning was done at both excel sheet level and database level. After observing the dataset, it was identified that DriverNationality and Nationality columns were duplicate. Therefore, it was decided to delete the Nationality column from the dataset. The DriverName column only shows the drivers first name, which can lead to ambiguity if any analysis were to be done using only the first name of the driver. Therefore, the DriverName column was modified to include the driver’s full name. Below is a portion of the dataset after excel data cleaning was done.

For this project, MySQL was used as the database management system, in MySQL workbench, a new schema was created called “f1” and the dataset was imported into “f1results” table. Once the data has been loaded, further data cleaning was done to prepare the dataset for analysis.

  • Identifying null values – after checking the dataset for null values, it was identified that the dataset didn’t include any records with null values

  • Checking for duplicate values – after checking the dataset for duplicate values, it was identified that the dataset didn’t contain any duplicate values

Generating Insights

Please note that this study analyzes data from 2000 onwards. Formula 1 officially started in 1950 and the dataset doesn’t include the 50 years’ worth of race results. The aim of the study is to analyze the performance of formula 1 teams since 2000 to get an overview of the team’s performance in the recent years. Therefore, this cannot be taken as a representation of the team’s performance since its inception.

  • Top 10 teams with most wins

  • Most Dominant teams based on points scored

  • Top 10 drivers with most wins

  • Most dominant drivers based on number of points

  • Top 10 drivers with most podiums

  • Top 10 constructors with most podiums

According to the above findings, Ferrari has dominated in terms of most wins, most points and most podium finishes. Interestingly, Ferrari has won 6 WDC and 7 WCC during this period which is less than both Red Bull and Mercedes who has won 8 and 7 WDC respectively. Based on this observation, head-to-head team comparisons were analyzed.

  • Ferrari vs Red Bull

  • Ferrari vs Mercedes

This dataset reflects points earned by constructors based on the driver positions in each race. In formula 1, additional points are awarded for both driver and constructor for the fastest lap and their success in sprint races. Those points are not considered in this dataset. Therefore, this dataset cannot be used to determine WDC and WCC information from the data in “Points” column.

  • Performance trends over time

Moving on to circuit specific data analysis to identify which circuits favor certain constructors and drivers.

  • Most wins in a single circuit

  • Most wins in Monza which is considered as the fastest circuit

  • Most wins in Monaco which is considered as the most technical circuit

In Part 2, we will discuss data visualization, identifying trends and making predictions.