KodeKloud Notes

In this step-by-step guide, you will create an Azure Data Factory instance and build a simple pipeline to copy a CSV file (biostats.csv) from Azure Blob Storage into Azure Data Lake Storage Gen2.

Prerequisites

Active Azure subscription
Existing Azure Synapse Analytics workspace with Data Lake Storage Gen2
Storage account (e.g., PHVNewStorage) containing a sampledata container

For more details, refer to the Azure Data Factory documentation.

1. Create an Azure Data Factory

Sign in to the Azure Portal and navigate to Data Factories. Click + Create.
On the Create Data Factory blade, configure:
- Subscription
- Resource group
- Instance name
- Region
- Version (defaults to V2)
Click Review + create, then Create. Wait for deployment to finish.
In your resource group, locate the new Data Factory. Refresh if necessary.
Select the Data Factory to open its overview page.

2. Launch Azure Data Factory Studio

Azure Data Factory Studio is the integrated web UI for designing and monitoring pipelines.

On the Data Factory overview page, click Launch Studio.
A new browser tab opens with the Studio workspace.

3. Set Up Source and Destination Storage

We will copy biostats.csv from the PHVNewStorage account's sampledata container into your Data Lake Gen2 storage.

Note

Ensure you have the Storage Blob Data Contributor role on both the source and destination storage accounts.

4. Build the Copy Data Pipeline

In Studio, select Ingest from the left menu, then click Copy Data.
Choose Built-in copy task, set Run once now, and click Next.

4.1 Configure Source

Under Source data store, select Azure Blob Storage.
Click New to create a linked service. Name it (e.g., csvhome), choose your subscription, and select PHVNewStorage. Test then save.
Click Browse, expand containers, and select sampledata.
Enable Recursion to include subfolders. Click Next.

4.2 Upload or Verify Your CSV

In the Azure Portal, open Storage Accounts > PHVNewStorage > Containers > sampledata.
Remove old files, then upload biostats.csv.
Return to Studio and click Next.

4.3 Define Source File Format

Select DelimitedText with comma (,) as the column delimiter. Keep default settings.

Click Next.

4.4 Configure Destination

For Destination data store, pick Azure Blob Storage (Data Lake Gen2 uses the Blob API).
Create a new linked service: name it (e.g., destination), select your Data Lake account (e.g., phvsinaccount), and save.
Browse to the sampledata folder path.
Leave File name blank, choose Preserve hierarchy, and accept defaults.

Click Next.

4.5 Review Output File Format

Keep the default delimited settings (comma, no header row changes).
The image shows a screenshot of the "Copy Data tool" in Microsoft Azure, specifically the "File format settings" section where options for file format, column delimiter, and row delimiter are being configured.

Click Next.

4.6 Finalize and Run Pipeline

Assign a name (e.g., Copy_CSV) and optional description.
Accept default fault tolerance and logging.
Review the summary and click Finish.
The pipeline will validate and execute automatically. Monitor the status; all activities should show Succeeded.

5. Verify in the Data Lake

In the Azure Portal, navigate to your Data Lake storage account and confirm that biostats.csv appears under the sampledata folder.

By following these steps, you've successfully implemented an Azure Data Factory pipeline to transfer CSV data from Blob Storage to Data Lake Storage Gen2.

References

Watch Video

Watch video content