Backup Concepts for Azure Data Lake

Azure Data Lake Storage is designed to enable operational and exploratory analytics through a hyper-scale repository There are two different types of Data Lake Store in Azure (Gen1 and Gen2) available at the current date. If a new instance is deployed it is recommended to use Data Lake Store Gen2. The data is replicated so that the backup concept considers the “human fault” component and also the technical backup aspect.

Data Lake Gen1

Azure Data Lake Storage Gen1 is an enterprise-wide hyper-scale repository for big data analytic workloads. Azure Data Lake enables you to capture data of any size, type, and ingestion speed in one single place for operational and exploratory analytics as stated at: https://docs.microsoft.com/de-de/azure/data-lake-store/data-lake-store-overview

Data Lake Gen2

Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built on Azure Blob storage. Data Lake Storage Gen2 is the result of converging the capabilities of our two existing storage services, Azure Blob storage and Azure Data Lake Storage Gen1. Features from Azure Data Lake Storage Gen1, such as file system semantics, directory, and file level security and scale are combined with low-cost, tiered storage, high availability/disaster recovery capabilities from Azure Blob storage.

ADLS Gen2 is based off Azure storage. Therefore, storage capacity is virtually limitless. Also, all the High availability features (GRS, RA-GRS etc) supported by Azure Storage is readily available for ADLS Gen2. This also means ADLS gen2 takes advantage of all the security lockdown features offered by Azure storage. Azure storage supports RBAC based resource access control and so does ADLS. Add to that, Access Control Lists (ACL) offer fine grained access control to files and directories.

ADLS gen2 is based on the hierarchical namespace feature of Azure Storage. Object storages like Azure Blob storages, historically have had virtual file path but not physically implemented filesystem. This makes is harder to query or iterate or move files within a path as this means iterating over all the blobs. And at analytical workload scales, the latency of doing such operations becomes noticeable. Hierarchical namespaces in ADLS Gen2 introduces Directories and filesystem which helps the data to be organized within directories. This also helps to provide/restrict access at directory or file level. See https://docs.microsoft.com/en-au/azure/storage/blobs/data-lake-storage-introduction for more details.

Microsoft Recommendations

Microsoft recommends copying he data into another Data Lake Store in another region with a dedicated frequency. This can be done via ADLCopy, Azure PowerShell or Data Factory.

In matter of data corruption or accidental deletion it is recommended to use Resource Locks, available that ADL security features and restrict access via RBAC roles.

Details listed at https://docs.microsoft.com/en-au/azure/data-lake-store/data-lake-store-disaster-recovery-guidance and https://docs.microsoft.com/en-au/azure/data-lake-store/data-lake-store-security-overview

Possible Implementations

The following section states different backup concept implementation methods. In both options the Data Factory is used. ADLCopy is a command line tool for copying files. PowerShell needs also extra scripting. Recommended way is to use the Data Factory with dedicated triggers. Normally it is easier to backup Data Lake Store Gen2 as it is based on Azure Storage.

Backup via Data Factory and second Data Lake store

This method uses the Data Factory to get the data to another Azure Data Lake Store for Backup. In the Data Factory two Pipelines are created one which performs the Backup Action and one which is used to restore the Backup data. But keep in mind a second Data Lake may increase costs dramatically.

Prerequisites:

  • Data Factory with Copy pipeline
  • Data Lake Store for Backup
  • Data Lake (Backup Source)

Backup via Data Factory and File Storage

Here the Data Factory is used to get the data to a Storage Account (Azure Files) for Backup. In the Data Factory two Pipelines are created one which performs the Backup Action and one which is used to restore the Backup data.

Prerequisites:

  • Data Factory with Move and Transform pipeline
  • Storage Account (Backup Target) with https enabled
  • Access Key to the Storage Account
  • Data Lake (Backup Source)
  • Recovery Service Vault to Backup Storage for versioning

Backup via Azure Function and File Storage

It is possible to use an Azure Function to automatically save data from Data Lake to a Storage Account based on a specific trigger.

Prerequisites:

  • App Plan for Azure Functions and Code
  • Storage Account (Backup Target)
  • Access Key to the Storage Account
  • Data Lake (Backup Source)
  • Recovery Service Vault to Backup Storage for versioning

Note

In addition it is possible to use an Azure Recovery Service Vault to back the data of the Storage Account.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s