Data governance includes people, processes, and technology. Together, these principles enable organizations to validate and manage across dimensions such as:
Data management, including data and pipelines lifecycle management and master data management.
Data protection, spanning data access management, data masking and encryption, along with audit and compliance.
Data discoverability, including data cataloging, data quality assurance, and data lineage registration and administration.
Data accountability, with data user identification and policies management requirements.
While prioritizing investment in their people to achieve the desired cultural transformation and processes to increase operational effectiveness and efficiency will help enterprises, the technology pillar is the critical enabler for people to interact with data and for organizations to truly govern their data initiatives.
Financial services organizations are faced with particularly stringent data governance requirements regarding security, regulatory compliance, and general robustness. Once people are aligned and processes are defined, the challenge for technology comes to the picture: solutions should be flexible enough to complement existing governance processes and be cohesive across data assets to help make data management simpler.
In the following sections, starting with standard requirements for data governance implementations in financial services, we will cover how these correspond to Google Cloud services, open-source resources, and third-party offerings. We will share an architecture capable of supporting the entire data lifecycle, based on our experience implementing data governance solutions with world-class financial services organizations.
Data management
Looking first at the data management dimension, we have compiled some of the most common requirements, along with the relevant Google Cloud services and capabilities from the technology perspective.
Data Management Requirements
Services & Capabilities
Data and pipelines lifecycle management
Batch ingestions: Data pipelines management, scheduling, and data pipelines processing logging
Streaming Pipelines: Metadata
Data lifecycle management
Operational metadata including both state and statistical metadata
A comprehensive end-to-end data platform
GCS Object Lifecycle
BigQuery data lifecycle
Data Fusionpipeline lifecycle management, orchestration, coordination, and metadata management
Dataplex Intelligent automation data lifecycle management
Cloud Logging, Cloud Monitoring
Informatica Axon Data Governance
Compliance
Facilitate regulatory compliance requirements
Easily expandable to help comply with CCPA, HIPAA, PCI, SOX, and GDPR, through security controls implementation using IAM, CMEKs, BQ column-level access control, BQ Table ACL, Data Masking, Authorized views, DLP PII data
Identification, and Policy tags.
DCAM data and analytics assessment framework
CDMC best practice assessment and certification
Master Data Management
Duplicate Suspect Processing rules
Solution and department scope
Enterprise Knowledge Graph
KG Entity Resolution/reconciliation and Financial Crime Record matching MDM + ML
Tamr Cloud-Native Master Data Management
Site Reliability
Data Pipelines SLA
Data at Rest SLA
SLAs applied to data pipeline
SLAs applied to services managing data
DR strategies for data
Registering, creating, and scheduling data pipelines is a recurring challenge that organizations face. Similarly, data lifecycle management is a key part of a comprehensive data governance strategy.
This is where Google Cloud can help, offering multiple data processing engines and data storage options tailored for each need, but that are integrated and make orchestration and cataloging easy.
Data protection
Financial organizations demand world-class data protection services and capabilities to support their defined internal processes and help meet regulatory compliance requirements.
Data Protection Requirements
Services & Capabilities
Data Access Management
Definition of access policies
Multi-cloud approval workflow integration*
Access Approvals
IAM and ACL, fine grained GCS Access
Row-Level, Column-Level permissions
BigQuery Security
Hierarchical resources & policies
Users, Authentication, Security (2FA), Authorization
Resources, Separation boundaries, Organization policies, Billing and quota, Networking, Monitoring
Event Threat Detection
Multi-cloud Approval workflow by 3rd Party – Collibra*
Data Audit & Compliance
Operational metadata logs capture
Failing process alerting and root cause identification
Cloud Audit Logs
Security Command Center
Access Transparency & Access Approval
StackDriver Logging
Collibra Audit Logging
Security Health
Data vulnerabilities identification
Security health checks
Security Health Analytics
Security Health Analytics
Data Masking and Encryption
Storage-level encryption metadata
Application-level encryption metadata
PII data identification and tagging
Encryption at rest, Encryption in transit, KMS
Cloud DLP Transformations, De-identification
Access management, along with data and pipeline audit, is a common requirement that should be managed across the board for all data assets. These security requirements are usually supported by security health checks and automatic remediation processes.
Specifically on data protection, capabilities like data masking, data encryption, or PII data management should be available as an integral part for processing pipelines, and be defined and managed as policies.
Data discoverability
Data describes what an organization does, how it relates to its users, competitors, and regulatory institutions. This is why data discoverability capabilities are crucial for financial organizations.
Data Discoverability Requirements
Services & Capabilities
Data Cataloging
Data catalog storage
Metadata tags association with fields
Data classification metadata registration
Schema Versions control
Schema definition before data loading
Data Catalog
Column level tags
Dataplex logical aggregations (Lakes, Zones and Assets)
DLP
Collibra Catalog
Collibra Asset versioncontrol
Collibra Asset Type creation and Asset pre-registration
Alation Data Catalog
Informatica Enterprise Data Catalog
Data Quality
On ingestion data quality rules definition (like regex validations for each column)
Issues remediation lifecycle management
BigQuery DQ
Dataplex
Data quality with Dataprep
Collibra DQ
Alation Data Quality
CloudDQ declarative Data Quality validation (CLI)*
Informatica Data Quality
Data Lineage
Storage and Attribute level Data Lineage
Multi-cloud/on-premises lineage
Cloud Data Fusion Data Lineage
Understand the flow
Granular visibility into flow of data
Operational View
Openess or share lineage
Data Catalog & BigQuery
Collibra lineage
multi-cloud/on-premises management
Alation Data Lineage
Data Classification
Data Discovery and Data Classification metadata registration
DLP Discovery and classification
90+ built-in classifiers: Including PII
Custom classifiers
A data catalog is the foundation on which a large part of a data governance strategy is built. You need automatic classification options and data lineage registration and administration capabilities to make data discoverable. Dataplex is a fully managed data discovery and metadata management service that offers unified data discovery of all data assets, spread across multiple storage targets. Dataplex empowers users to annotate business metadata, providing necessary data governance foundation within Google Cloud, and providing metadata that can be integrated later with external metadata by a multi-cloud or enterprise-level catalog. The Collibra Catalog is an example of an enterprise data catalog on Google Cloud that complements Dataplex by providing enterprise functionality such as an operating model that includes the business and logical layer of governance, federation and the ability to catalog across multi-cloud and on-premises environments.
Data quality assurance and automation is the second foundation of data discoverability. To help with that effort Dataprep is another tool for assessing, remediating, and validating processes, and can be used in conjunction with customized data quality libraries like Cloud Data Quality Engine, a declarative and scalable data quality validation command-line Interface. Collibra DQ is another data quality assurance tool, and uses machine learning to identify data quality issues, recommend data quality rules and allow for enhanced discoverability.
Data accountability
Identifying data owners, controllers, stewards, or users, and effectively managing the related metadata, provides organizations with a way to ensure trusted and secure use of the data. Here we have the most commonly identified data accountability requirements and some tools and services you can use to meet them.
Data Accountability Requirements
Services & Capabilities
Data User Identification
Data owner and dataset linked registration
Data steward and dataset linked registration
Users role based data usage logging
Dataplex
Data Catalog
Analytics Hub
Collibra Data Stewardship
Alation Data Stewardship
Policies Management
Domain based policies management
Column level policies management
Cloud DLP
Dataplex
Policy Tags
BigQuery Column LevelSecurity
Collibra Policy Management
Domain Based Accountability
Governed data sharing
IAM and ACL role based access
Analytics Hub
Having a centralized identity and access management solution across the data landscape is a key accelerator to defining a data security strategy. Core capabilities should include user identification, role- and domain-based access policy management, and a policy-managed data access authorization workflows.
Data governance building blocks to meet industry standards
Given these capabilities, we provide a reference architecture for a multi-cloud and centralized governance environment that enables a financial services organization to meet its requirements. While here we focus on the technology pillar of data governance, it is essential that people and processes are also aligned and well-defined.
The following architecture does not intend to cover each and every requirement presented above, but provides core building blocks for data governance implementation to meet industry standards as far as the technology pillar is concerned at the time of writing this blog.