Formula 1 Comprehensive dashboard — Part I: Creating a SQL database

Mash Nejati
4 min readSep 29, 2020

--

Photo by Kevin Ku and Carl Jorgensen on Unsplash

This tutorial has also published on my personal website.

I have always been a Formula 1 enthusiast. Last year I wrote a short story on what I learned from it which you can read it here if interested:

A few weeks ago, I found an interesting dataset of Formula 1 races here that includes all the data for the races from 1950 up until 2017 season, and I have decided to make an informative dashboard from this dataset. The dataset is offered in 13 CSV files, each contains a series of data such as circuits, races, seasons, lap times, etc. For this project, I will create a SQL Server database from the dataset and import each file as a table into the SQL database. In the next step, will start creating the dashboard in Microsoft Power BI using the data stored in the SQL database.

Creating a SQL Server database

We will be creating a blank SQL database in SSMS and name it “F1stats” which will be home to our dataset:

CREATE DATABASE F1stats

Now let’s import the CSV files into our database:

Importing a CSV file using the import wizard

After doing the same process for all of the CSV files, our database will look like this:

Data validation

To make sure that the data has been stored correctly, let’s write a SQL query to return the circuits that hosted the most number of F1 races. In below query, I join two tables of “races” and “circuits”, since I need the data from both tables:

select circuitRef, circuits.country, count(*) as 'Number of Races'fromcircuits inner join raceson circuits.circuitId = races.circuitIdgroup by circuitRef, countryorder by [Number of Races] desc

As can be seen in the results screen shot, the “Monza” circuit in Italy have hosted F1 races for 68 times, which is correct considering the creation date of the original dataset. Here we noticed that in some countries there are multiple circuits, so what if we want to know that how many races each country and each circuit have been hosted? To answer this, we can use window functions, which are very useful in data analytics, as below:

select circuitRef, country,count(raceId) over (partition by circuitRef) as 'Circuit Races',count(raceId) over (partition by country) as 'Country Races'from [dbo].[circuits]inner join[dbo].[races]on circuits.circuitId = races.circuitIdorder by [Country Races] desc

Above code will count the number of races for each circuit and country and returns the result by showing the country with the most races hosted on top. If we execute the code then the result will look like this:

But this result is not desired since it keeps repeating the same values each for the number of occurrences. To solve this, we can use common table expression and manage it in the way that only shows each circuit only once:

with cte as (select circuitRef, country,count(raceId) over (partition by circuitRef) as 'Circuit Races',count(raceId) over (partition by country) as 'Country Races',ROW_NUMBER() over (partition by circuitRef order by country) as 'rownbr'from [dbo].[circuits]inner join[dbo].[races]on circuits.circuitId = races.circuitId)select circuitRef, country, [Circuit Races], [Country Races]
from cte
where rownbr = 1order by [Country Races] desc

We assigned Row Numbers to the first query and consequently we filter “rownbr” to show each once. The returned result is now more acceptable than previous one.

Troubleshooting

In above code, if instead of inner join we use full outer join, we will notice a one record difference between two results. To find out what is the missing record, we can use a full outer join and filter the result as below:

select *from [dbo].[circuits]full outer join[dbo].[races]on circuits.circuitId = races.circuitIdwhere circuits.circuitId is null or races.circuitId is null

By using the above code, we fully checked both tables and returned any record that has a null value for circuitId. It shows that we have “Port Imperial Circuit” in our database but hosted no Formula 1 races which is actually true, due to some financial difficulties as per Wikipedia.

In the next part, we will connect the data to Power BI to proceed with creating the Formula 1 dashboard.

--

--

Mash Nejati
Mash Nejati

Written by Mash Nejati

Business Intelligence | Analytics Engineer | Data Visualization & Accessibility | Formula 1 Enthusiast

No responses yet