How I keep my dataset updated using Azure Data Factory and Blob Storage
Recently, I was working on a reporting dashboard visualizing Toronto’s COVID-19 data. The dataset in use is from Toronto’s Open Data which is being updated and published on a weekly basis.
Apart from how sophisticated and informative is the reporting dashboard, it is crucial for the dashboard to provide the information using updated data, especially when it comes to the kind of topics that are evolving drastically and presence of updated and accurate information is vital for a correct and effective decision making process.
This article is also available on my personal website.
The objective in this tutorial is to create a pipeline to basically automate the process of extracting the dataset from web, storing it on Blob storage, and then loading it into an Azure SQL Database. Since this one is a simple and small dataset it can be directly mapped and loaded into the SQL database. However, in many professional cases we may need to work with a variety of unstructured data, or the datasets in which need preparations such as cleaning or aggregations. In such cases, it is a good practice to first load the data into the storage container and after performing the necessary transformations, have the required clean and prepared data into the database.
Important: It is highly recommended that you study the pricing structure of Azure services (Or any other cloud services you will be using) as setting up or choosing resources inappropriately could result in unexpected charges in your Azure account. More details provided at the end of the article.
The dataset in this tutorial does not require cleaning or aggregations, hence it is not required to perform any transformations throughout the process.
I assume you already have access to your Azure resources and created your Blob Storage. In this tutorial, I will be using an Azure SQL Database as my destination database to store the data I need for my dashboard, hence you will need to have a SQL Database already created in your Azure portal.
To start off, we need to create a data factory in Azure portal. It is pretty straightforward; just search for “Data factories” in the portal and then click on “Add” or “Create data factory”. Then you will need to choose/enter the required parameters such as the desired resource group and the data factory name. In Git configuration tab, you can enter the AzureDevOps or GitHub repository parameters if you would like to integrate the version control. If not desired, you can tick the “Configure Git later” so you proceed without entering the parameters.
Finally, click on Review+create button to validate the inputs and check for any errors. If validation was successful, then you can create the data factory. After the deployment is finished, just head to the data factory resource and on the middle of the page click on “Author & Monitor”.
Upon navigating to the Data Factory environment, simply expand the left menu and select “Author”.
Step I: Copying data from web to the Blob storage
At this stage, we are creating a dataset by clicking the “…” on Datasets. On the new window search for Blob Storage, click and continue. You can choose among a variety of format options. I choose the DelimitedText since the dataset is in CSV format. Having a name selected for your dataset, you will need to create a new “Linked service”.
You will need to define a name for this service, and choose the desired Azure subscription and storage account name, which I already created. After setting everything up in this window, click on “Test connection” to ensure that everything works fine here. After clicking on Create button, you will return back to the previous window.
Make sure to select the container by clicking the blue folder icon in front of File path. One thing to consider here is that we can select a bulk container, however, this will cause the process to create a new blob each time the activity triggers. Therefore it is recommended that you create a folder and a file in it (albeit with your desired name), so each time the data is overwrote on just one file. This helps to use less space in your Blob storage and results in a more organized storage space as well. You can create folder or upload a file in the Blob storage section in Azure portal. For this case, I renamed and uploaded the same dataset which I manually downloaded already.
At the same time, we need to create another dataset to be allocated to the Source. So we will again create a new dataset, but in this case in the new window we will search for “HTTP” instead. Again let’s define a new linked service and enter the required parameters:
Do not forget to choose GET in Request method.
Having the datasets created, it is time to set up a Pipeline. Under Factory Resources, click on Pipeline and create a new pipeline. From “Activities” expand the “Move & transform” and drag and drop “Copy data” to the main canvas.
We will define names for the pipeline as well as for the activity to organize the workspace. Next, we should select the Source and Sink datasets in the activity setting in the bottom. We will use the datasets we already created for the Source and the Sink. After setting up the source and the sink, click on Validate on top of the pipeline to make sure there will be no errors throughout the pipeline. After successfully validated, click on Debug to do a test run and see if everything works as expected. After completing the debug, you can go to the Blob Storage in your Azure portal as well and double check if the dataset is correctly copied into the defined file.
Step II: Loading the data from Blob storage to the SQL Database
The process in this stage is almost the same as previous step. We will use the same blob dataset we created earlier as the source to read the data from. However, we will need to set a new dataset and choose the SQL Server parameters. Prior to being able to successfully create such dataset, we need to create a table in our SQL database, enabling us to select the destination table to load the data into. So go to the SQL Database in Azure portal and open the query editor. I use the similar aliases as the source dataset for convenience in data mapping:
CREATE TABLE TorontoCovid19 (
id int,
Assigned_ID int,
OutbreakAssociated varchar(255),
AgeGroup varchar(255),
NeighbourhoodName varchar(255),
FSA varchar(12),
SourceOfInfection varchar(255),
Classification varchar(255),
EpisodeDate date,
ReportedDate date,
ClientGender varchar(255),
Outcome varchar(255),
CurrentlyHospitalized varchar(12),
CurrentlyinICU varchar(12),
CurrentlyIntubated varchar(12),
EverHospitalized varchar(12),
EverinICU varchar(12),
EverIntubated varchar(12)
)
After running the query a new table titled “TorontoCovid19” will be created.
Now go back to the Data Factory, click on new dataset and look for Azure SQL Database.
After creating the linked service, you will need to select the table you already created and then press OK.
Important: In order for the Azure resources to be able to communicate with this SQL Database, you should have already went to Firewall setting in the SQL Database setting in Azure portal and enable the option “Allow Azure services and resources to access this server”.
Having everything ready so far, it is now time to create a new activity in the pipeline to copy the data from Blob storage to SQL Database. Simply drag and drop the “Copy data” again from “Move & transform” under Activities menu. I name this activity “BlobToSQL” for easier recognition and better organizing the project. Select the Blob dataset we created already as the source this time, and the SQL dataset as the Sink.
Please note that at this point we should map the source and sink data, so each column in the source data to be copied into the corresponding destination column. For this, just go to Mapping tab and click on Import Schemas. Now you will need to map the Source and Destination for each data column:
Do not forget to connect the activities in the appropriate order so the pipeline works as it should:
Click on validate and if there was no errors, then click on debug to perform a test run. Once the test run was successful, you should now click on “Publish all” to save and publish the pipeline. In order for the pipeline to be deployed on a scheduled basis, you can create a trigger and make a plan for the pipeline’s deployment based on your requirements:
Important: You should note that Azure, as well as other popular cloud services, are paid services and if you are not careful with the pricing structures, you might incur unexpected charges to your Azure account. The pricing formula is said to be complex and depends on several parameters. Therefore it is highly recommended that you perform enough research on the expected costs. You can also use Azure cost calculator to get a sense about pricing details.
How you setup your pipelines, activities, triggers, and size of the data could affect the use of DIU/hour (Data Integration Unit: a combination of CPU, memory, and network resource allocation) which could charge your account. I found this article which simply explains the costs associated with the Data Factory service. However, it is highly recommended that you follow Microsoft Azure’s official resources in case not to miss any possible updates or changes in the pricing schemes.