Data Governance for Cloud Data Deployments

6 min readApr 30, 2021

As the generation, processing, and consumption of data grows exponentially the governance functions and processes continue to lag behind with majority of the large organizations relying on default implementation or some customization of cloud provider’s platform services that focuses more on metadata and security while leaving the other pillars untouched

Introduction

Data Governance is a broad and complex topic that could easily cover a large text. In this article, I will focus on Data Catalog, Data Domains, Glossary, Metadata, Quality and Security aspects and highlight patterns and solutions for cloud deployments.

The industry authority DAMA-DMBOK, Data Management Body of Knowledge by DAMA International defines Data Governance as ‘The exercise of authority, control, and shared-decision making (planning, monitoring, and enforcement) over the management of data assets’, specifies WHY and HOW data assets and their life cycle should be organized. Governance does it by setting up a framework and defining rules in the form of policies and procedures.

Image Source: Author showing Data Governance and Data Management components

Data Lake and Need for Governance

Data Lake is rapidly evolving as a mechanism to store, ingest all forms of data unstructured, semi structured, structured from raw to semi processed to processed without any hierarchy or order or categorization. By this definition itself, this over a period of time leads to a “Data Swamp” with serious issues around finding relevant, credible, useful data. In the words of research firm Gartner

“without at least some semblance of information governance, the data lake will end up being a collection of disconnected data pools or information silos all in one place…Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a data swamp.”
https://www.gartner.com/en/newsroom/press-releases/2014-07-28-gartner-says-beware-of-the-data-lake-fallacy

With a few key capabilities in place, the Data Lake starts to deliver on the promise of Data Enabler, Data Provider:

Search and Discover ability based on various metadata such as tags, business terms
Trustworthiness measures, Quality scores to give confidence on accuracy, timeliness
Information on data owners, publishers to indicate provisioning and access

Image Source: Author showing the Data Catalog with core functionalities to enable Governance

The user experience to find the relevant, quality data can be likened to a shopping experience or a self service experience broken out into above steps. However, for that to happen Governance needs to be in place. The essential components are:

Business Glossary or Data Dictionary

A well maintained Business Glossary is a vital communication tool providing semantic translation, removing confusion and providing a common language. Some of the key requirements for a Glossary are:

Ability to define rich glossary vocabularies using the natural terminology (technical terms and/or business terms).
Ability to map assets to glossary terms(s).
Ability to organize these terms by categories.

Data Domains or Subject Matter Areas

These are business specific domains, for example in Online Advertising these could be Monetization, Pricing, Auction/Bid, Revenue and more. These directly map to business units and drive the agenda for analysis. Associated with these are:

Data Owners & Stewards
Policy and Standards
Business Glossary / Data Dictionary
Data Catalogs, Data Quality Scores
Business Processes, Systems & Applications

Data Catalog

As per Gartner, “A data catalog maintains an inventory of data assets through the discovery, description, and organization of datasets. The catalog provides context to enable data analysts, data scientists, data stewards, and other data consumers to find and understand a relevant dataset for the purpose of extracting business value.”

A well maintained Data Catalog enables self-service by empowering users to discover, shop and use reliable data. When combined with tools such as Search, Data Quality, Data Lineage, Glossary, Classification the catalog becomes a core vehicle for Data Democratization.

Some of the key requirements for Catalog Metadata are:

Pre-defined types for various sources of metadata
Ability to define new types for the metadata to be managed
APIs to allow easier integration

For Search and Discoverability:

Intuitive UI to search entities by type, classification, attribute value or free-text
APIs to search by complex criteria
SQL like query language to search entities — Domain Specific Language (DSL)

One example of a Open Source platform that covers these requirements is Apache Atlas. It provides a rich set of services for Metadata, Discovery, Classification and Lineage. In the scenario of various cloud providers providing the Data Catalog, there could be integrations built between the Catalog service and Atlas to serve the broader Data Governance function

Image Source: Author showing Apache Atlas architecture and integration with different sources

Given a data lake environment with a variety of platforms and technologies in use for ETL, ELT, Data Access having a central data catalog is more efficient, seamless and scalable. Apache Atlas realizes that pattern integrating across various sources like Apache Hive, Sqoop, Spark, Presto

EMR, Athena and Redshift connects with AWS Glue on AWS, DataProc, DataFlow and BigQuery map to Google Data Catalog on GCP, Azure Data Factory and Azure Synapse map to Azure Purview on Azure. When further integrated with Global / Enterprise Data Catalog, it is possible to get a holistic view of the data assets across the Enterprise.

Platforms like Collibra and Alation can be leveraged for Enterprise level governance requirements as they come bundled with connectors, api’s, hooks etc to plug into the wide array of sources

Image Source: Author showing the Integration points for Enterprise Data Governance

Security and Access

An essential component of the Data Governance service is the integration of Data Catalog with Security framework. The key abstractions of this framework are: Policy, Enforcement and Monitoring. In a single cloud deployment, this translates to applying the cloud provider implementations for these abstractions. The implementation could range from more UI driven to more automated api driven based on the maturity level of the organization.

Image Source: Author showing the abstractions of the Security Layer

In a Multi Cloud deployment scenario, different approach needs to be applied to achieve the objective. From an architectural perspective, the above abstractions map to services with provider plugins for different providers. As an example, for Monitoring there could be a unified Monitoring layer such as Stackdriver that aggregates information from all the Providers monitoring services. There would also be a need for these layers to allow viewing, editing and reporting of these policies and rules. As this involves significant effort, one of the options is to consider SaaS and PaaS providers in this space that provide pre built services and integrations to the various cloud platforms.

In summary, we looked at the bigger picture of Data Governance and did a deeper dive into the Catalog, Metadata, Search and Discoverability and Security pillars to examine the key use cases and realization. We extended the view to the larger enterprise to understand how some of the Cloud Platforms data deployments integrate Catalog services to provide broader Governance function. In future articles, I will tackle topics such as Data Quality, Data Lineage and explore how their use cases and functionality. I would welcome your comments and feedback as we continue this journey

Data Governance for Cloud Data Deployments

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Mukul Sood

No responses yet