Azure Data Factory

Ojash Shrestha
4 min readAug 1, 2021

--

In this article, we’ll learn about datasets, the JSON format they are defined in and their usage in Azure Data Factory pipelines. The article contains the sample of dataset in Data factory with its properties well described. We also learn about the types of Datasets and the data stores that are supported by the Data Factory is listed. The tools to create datasets are listed too and the differences of the versions of Data Factory are well differentiated in tabular format. Moreover, the naming rules are also discussed in detail and a brief introduction to CI/CD for Azure Data factory

One or more pipelines are supported by the data factory. A pipeline can be defined as the logical grouping of different activities which all together perform a task where the activities in the pipeline define the actions that needs to be performed on the data.

Dataset can be understood as the named view of the data which refers to the data that is to be used for the activities as inputs and outputs. Datasets are responsible to identify the data that are present in multitudes of data stores for instance, tables, files, documents, and folders. To understand deeper, we can take the reference of the Azure Blob. The Azure Blob dataset specifies the folder and blob container in the Blob storage of Azure through which data is read by the activity.

Linked Service needs to be created before creating a dataset in order to link data store to the data factory. Linked services define the connection information which are needed to connect to the external resources for the Data Factory. Storage account are linked to the data factory through the linked service of Azure Storage. The input blobs that need to be processed are present in the Azure Storage account and the folder and container are represented by the Azure Blob dataset.

The relationship between dataset, activity, pipeline, and linked service present in the Data Factory is shown in the diagram below,

Activity

Activity refers to the task that is performed on the data. The activities are used inside the Azure Factory Pipelines (ADF). The ADF pipelines are basically a group of one or more activities. For Instance, Creating ADF pipeline to perform ETL enables multiple activities such as extracting data, transforming data and loading data to data warehouse. Examples of activities are hive, stored proc, copy, map reduce, and so on.

Hive — The Hive is an HDInsight activity which executes Hive queries based on HDInsight cluster on Linux and windows that is used to analyze and process structured data.

Stored Proc — Stored Procedure in data factory pipeline helps to invoke a SQL Server Stored procedure. Azure SQL Database, Azure Synapse Analytics, SQL Server Database are some of the data stores where stored proc can be used.

Copy — The Copy activity helps copy the data from source location to the destination location. Numerous data store locations such as NoSQL, Files, Azure Storage and Azure DBs are supported by Azure.

Dataset in Data Factory

The JSON code below defines the dataset in the Data Factory.

{
"name": "<name of dataset>",
"properties": {
"type": "<type of dataset: DelimitedText, AzureSqlTable etc...>",
"linkedServiceName": {
"referenceName": "<name of linked service>",
"type": "LinkedServiceReference",
},
"schema": [],
"typeProperties": {
"<type specific property>": "<value>",
"<type specific property 2>": "<value 2>",
}
}
}

The properties of the above JSON are well described in the table below.

Property Description Required name The Name of the Dataset.Yes type It is the type of Dataset. One of the types supported by the Data Factory must be specified.Yes schema The physical data type and shape is represented by the Schema of the Dataset. No typeProperties Each type has varied type properties.Yes

Types of Datasets

Multitudes of various types of datasets are supported by the Azure Data Factory which depends on the data stores that are to be used. Connector Overview listen to the data stores that are supported by the Data Factory.

The following JSON shows the DelimitedText which is set for the Delimited Text dataset.

{
"name": "DelimitedTextInput",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage",
"type": "LinkedServiceReference"
},
"annotations": [],
"type": "DelimitedText",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"fileName": "input.log",
"folderPath": "inputdata",
"container": "adfgetstarted"
},
"columnDelimiter": ",",
"escapeChar": "\\",
"quoteChar": "\""
},
"schema": []
}
}

Learn more about Data Transformation using Azure Data Factory from this video,

Creating Datasets

Datasets can be created using tools or SDKs such as Azure Resource Manager Template, Azure Portal, Powershell, REST API, and .NET API.

Moreover, there are a few differences between the Version 1 Datasets and the Current Version Datasets of Data factory.

CI/ CD in Azure Data Factory

Continuous Integration refers to the practice of testing every change that is made to the codebase automatically as early as possible and follows the testing that occur during continuous integration and thus pushes the changes to the production system or staging.

Here, in the Azure Data Factory, CI/CD refers to moving of the Data Factory pipelines from one particular environment such as development, testing and production to another. Azure Resource Manager templates is utilized by the Azure Data Factory in order to store various ADP entities like pipelines, datasets and data flow’s configuration.

There are basically two suggested methods for promoting the data factory to another environment.

  • Using Data factory UX with Azure Resource Manager to Manually upload the Resource Manager Template
  • Using integration of Data Factory with Azure Pipeline to make Automated Deployment

--

--

Ojash Shrestha
Ojash Shrestha

Written by Ojash Shrestha

Man on a Mission - to create epochal impact

No responses yet