Writen by Kristy Noms, in 06/01/2022
5 minutes of reading
Load data from On-Premises SQL Server Database to Azure Data Lake Storage and query it using Azure Synapse Analytics
Creating a simple BI environment in Azure using a local SQL Server sending data to a Data Lake through Azure Data Factory
The idea for this lab was to create a simple BI environment in Azure using a local SQL Server sending data to a Data Lake through Azure Data Factory
For this to be possible we’re going to use the following resources:
+++ On-premises SQL Server with a StackOverflow database installed.
+++ Azure Storage account
+++ Azure Data Lake Gen2 Container to storage our data
+++ Azure Data Factory
+++ Azure Synapse Analytics
1 – Source database
The database to be used for this lab it’s the open source data export from Stack Overflow (2010 version) in a 2019 SQL Server instance (developer edition).
2 – Storage Account
To be able to store our data, first we need to create a Storage Account in Azure portal.
After that we need to create a container where our data will be stored, I created one named dataloc:
Also, we need to enable Data Lake Gen2, but first let’s disable ENABLE SOFT DELETE FOR BLOBS, without disabling it we’ll receive an error when upgrading to Gen2
Then we can do the upgrade:
3 – Azure Data Factory
We’ll use Azure Data Factory to create a pipeline to transfer data from our local SQL Server to the Azure Data Lake.
For this to work we need to set up a connection to our on-premises SQL , so it can be used as a data source, this way we avoid setting up a site to site connection or a VPN.
(Data Factory instance)
(Data Factory home page)
Now we’re gonna need to configure a self-hosted integration runtime.
The steps to configure Integration Runtime are the following:
Azure Portal → Data Factory → Manage → Integration Runtimes —> New
Then you’re gonna need to download and install the integration runtime in the SQL Server host.
Once installed, it’ll ask for the authentication key from the previous step:
Then finally our on-premises server is registered and connected with Azure Data Factory:
Returning to the portal we can see our integration runtime:
Back to the home page, click on Ingest:
It’ll open the copy data tool, for now we’ll use the “Built-in copy task” with “Run once now” to run a single load.
Here we need to click on New connection and choose our self-hosted in connect via integration runtime
If our connection was successful then we can click create
Once connected you’ll see a list of all tables in the database, you can choose multiple tables to export or create a query to export specific data, in this lab we’re gonna use a query to get a subset of records from the Posts table.
Now, let’s set up our target destination, clicking on new connection:
Then let’s select our container created earlier:
We have some options to choose, I’ll use Parquet format and snappy compression.
Next, we have to configure: task name and logging options:
Then we’ll see the summary of all operations:
Deployment completed, now we can monitor our pipeline and see the results:
We can also take a look at the pipeline created, if we need to add a table or remove from the routine, it can be done from here:
Accessing our container, all data copied will be there:
4 – Azure Synapse Analytics
With our data saved, now we’ll create a Synapse Workspace to consume the parquet file, but first let’s add the resource provider for the synapse service into our subscription:
Now let’s create the Synapse Workspace linking the storage account and the container that will be used:
At this point you should have the following resources created:
Getting back to Synapse, enter the url for the workspace, and in the home page click on:
New —> SQL script
Accessing the data toolbar, we can see all of our files in this workspace:
Clicking with the right mouse button in the file we can query it or load:
Now the best part, we can also create a database and load our parquet files, I created a database named BI and a view named BI_POSTS containing the parquet file from the Posts table:
If you have a PowerBI workspace you can link with Synapse and start using it as a source for your dashboards.
5 – Conclusion
+++ This lab was meant to show how it’s simple to integrate our on-premises environment with Azure.
+++ We exported our local data to the cloud and stored it in a container.
+++ From the many options available, I chose to use Synapse to work with the data stored.
+++ In a production environment we would have used specific users with the principle of least privilege and restricted access.