Cloud

2022 12 05

Azure – Generally available: Dapr secrets API support

You can now use Dapr secrets APIs from your Azure Container Apps and leverage Dapr secret store component references when creating other components in the environment.

Read More for the details.

2022 12 05

Azure – Public preview: Azure Container Apps Azure Monitor integration

Azure, Cloud Azure

Azure Container Apps now support using Azure Monitor to send your logs to additional destinations.

Read More for the details.

2022 12 05

Azure – Generally available: Dapr support for managed identity in Azure Container Apps

Azure, Cloud Azure

Container Apps now provides support for using Dapr with Managed Identity.

Read More for the details.

2022 12 05

GCP – Implementing IAM access control as code with HashiCorp Terraform

Cloud, Google Cloud gcp

Today, digital transformation requires security transformation. Identity and Access Management (IAM) can be used as the first line of defense in your Google Cloud security strategy. Identity and Access Management (IAM) is a collection of tools that allows administrators to define who can do what on resources in a Google Cloud account. Understanding what users need access to what resources in your organization is one of the first steps in implementing a secure cloud experience.

IAM goes far beyond users and groups. Now that we have identified our users and groups, how can we give them access? Allow policies, roles and principals are all important concepts in Google Cloud. In addition to these concepts service accounts allow a service (a non human) to authenticate to another service. Got a workload running outside of Google Cloud? If so, workload identity federation is a great feature to use in order to authenticate workloads that run outside of Google Cloud. Set compliance and guardrails with organization policies.

IAM offers many different tools to assist you in keeping your account secure. So now, how can we implement and keep track of these tools and concepts? Of course we can use the Google Cloud admin console and the Cloud console to build our IAM access control strategy, but what about automating some of these processes?

Infrastructure as code (IAC) is pretty common among operations teams. Products like HashiCorp Terraform enable IAC and allow you to use text based files to automate provisioning and setting up your infrastructure. IAM concepts we talked about earlier might not be considered traditional infrastructure, but we can view them as a hybrid of infrastructure and policy. We can use Terraform for more than just infrastructure as code; we can also use it to implement account access controls.

Why would you want to use Terraform to implement access controls in your Google Cloud account?

Speed. Terraform provides a level of automation. Being able to describe your access controls in a code based format allows you to programmatically interact with your Google Cloud account using API calls being made by Terraform to your Google cloud account. This can speed up development/implementation time.Integration. Being that we are using APIs on the backend, we can now integrate building certain access controls into new or existing pipelines.Version control. Because we are using Terraform .tf configuration files, we can upload our code to a source code repository. I can use git to keep track of all the changes and/or different versions of our code. Collaboration. Because we can store our code in a source code repository this enables our access controls to be shared across the team. Making use of pull requests allows your team to increase knowledge sharing.Consistency. Because we have our access controls defined in code we can enforce best practices using modules in Terraform. Modules allow you to reuse code in various configurations which further ensures consistency and speeds up development time.

IAM basics

Let’s briefly look at some basic components of IAM, which make up the foundation of any IAM strategy.

Roles

A role is a collection of individual permissions. Permissions can be looked at as “things I can do with a service”. For example with the Cloud Run Invokerrole I can run.jobs.run and run.routes.invoke.

Predefined roles are roles that Google creates to allow you to do certain things based on responsibilities. Using predefined roles will help limit your blast radius, which will in turn help strengthen your access control strategy.

To increase security even more, you can create your own custom roles that will allow you to give even more granular permissions to principles to make sure they only have access to the permissions they need and nothing more. This is called theprinciple of least privilege and it is access control best practice.

Role Binding

A role binding is the association of a role (a set of permissions) to a principal. This will give a principal access to whatever permissions makeup that role. We can take this a step further with allow policies. An allow policy is a collection of role bindings that bind one or more principals to individual roles.

Principals

A principal can be thought of as an entity that would need access to resources. You can give the principal access to resources through permissions which the principal can be assigned through a role binding.

A principal can be a Google Account, a service account, a Google group, or a Google Workspace account or Cloud Identity domain. Each principal has its own email address which can be used as an identifier when you need to assign permissions to that principal.

Hierarchical structure

Let’s take a look athierarchical structurein Google Cloud. In Google Cloud this hierarchical structure does two things.

Provide a hierarchy of ownershipProvide attach points and inheritance

What does this mean? It means that resources can be associated with a parent. For example, I can have a folder that represents the Devops team. Under that folder I can have a project that will then have resources attached to it. You can see from this progression that the project’s direct ancestor is the Devops folder (which represents the Devops department). The resources would then have a direct ancestor which would be the project. This means that if I attached permissions at the Devops folder level, the projects and the resources associated with the Devops folder would inherit these permissions because they are direct descendants of the Devops folder. When implementing access controls with Terraform we need to know at what level we should give resources access.

Organizational policies

Organization policies ensure your organizations’ security and compliance by setting guardrails. Organizational policies allow you to enforce constraints which specify what resource configurations are allowed within an organization. Let’s see how constraints work.

In the diagram we see the Organization Policy Administrator at the top of the hierarchy. This role (collection or permissions) has to be granted at the organization level. Next we see that because the Organization Policy Admin has these specific set of permissions they are able to define an organizational policy. This Policy consists of a constraint also known as restrictions. This constraint is the blueprint for your organization Policy. Next, the policy is set on a resource hierarchy node. For the sake of argument, let’s say it’s set at the folder level. By default, the policy is enforced on a specific GCP service. This policy is then inherited to all resources under that folder.

Building with Terraform

Now let’s take a look at how we could build a policy with code:

Resource- Also known as a resource block, tells Terraform what you want to build. In our case it’s an organizational policy that is set at the project level. The name “auditlogging_policy” is the name Terraform knows this resource by (in some cases we can target specific resources or user interpolation).

Project- Id of the project to apply policy to.

Constraint- The name of the Constraint the Policy is referencing. You can find a list of constraints here.

Boolean_policy-Value that enforces the policy.

Service Accounts

A service account can be looked at as both a principal and a resource. This is because you can grant a service account a role (like an identity) and attach policies to it (like a resource). Your company should use service accounts if you have services in Google Cloud that need to talk to each other. This will allow you to authenticate and make API calls securely from service to service.

Resource google_service_account – Creates a service account. Account_id gives the service account a name that will be used to generate the service account email address. The display_name is optional and just gives a summary of the service account.

Resource google_project_iam_member – Adds permission to a service account.

Resource google_service_account_iam_member -Grants access for a user (referenced as member) to assume a service account (service_account_id) by granting the user the iam.ServiceAccountUser role (referenced as role above).

Use Case

Now we have the basics down, let’s take a look at a practical use case.

Let’s imagine we work at Big Horn Inc. Big Horn Inc. is a SaaS company. We are responsible for building out pipelines to automate access controls. We’ve been tasked with solving 2 problems:

1. The team wants to modernize some stateless applications. They want to use containers to create microservices. They want different CloudRun services to be able to talk to other services in Google Cloud. In this case we need to create some service accounts for Cloud Run. Ideally we would like this process to be automated.

2. Right now we have very broad permissions. Some principals have been assigned “basic” roles. After using the policy insights tool in Google Cloud, the team decides that some principles have too much access. We need a way to create “custom” roles to create more granular permissions to make sure the organization is following the principle of least privilege.

We can solve these issues in an automated fashion by implementing IAM with Terraform and using Cloud Build.

Wiring things up

Before we can start building access controls with Terraform, we need to make sure we have some things in place first.

Local development

After you have Terraform and gcloud installed, you will want to make sure that you have a service account that Terraform can use. Make sure that service account has all the proper permissions needed. Depending on what you want to build, some permissions will have to be given from the organizational level in order for them to be inherited at the project level (where service accounts are created). Next, let’s make sure you are using the proper authentication method. The best way to authenticate for local development is by using Application Default Credentials (ADC). With a simple setup, Terraform will be able to authenticate automatically using the credentials from your gcloud configuration.

Automated development

In the pipeline, Cloud Build will have permissions to the service account you create. This will allow Cloud Build to assume the permissions of that service account and in turn authenticate your Terraform configuration.

Pipeline

Now that we have the service account and all the proper tools in place, let’s build a pipeline. As you can see below, I am using a yaml file in order to automatically build a pipeline in Cloud Build. Each step in the pipeline is introduced through a Docker container. My pipeline does some standard things with Terraform.

Terraform init

Terraform fmt

Terraform plan

Terraform apply

Conclusion

Securing access in Google Cloud is a great first line of defense to make sure that your account is secure. Understanding IAM and its core features is the foundation on which you will build your access controls. Automating access controls can save your company time, money, and give your organization the agility it needs to make changes in a structured way when the need arises. You can create a free account at cloud.google.com. Don’t know where to get started with IAM? We’ve got you covered. Try this IAM tutorial to hit the ground running.

Read More for the details.

2022 12 05

AWS – Amazon SageMaker Studio now supports Fine-grained data access control with AWS Lake Formation and Amazon EMR

AWS, Cloud AWS

Amazon SageMaker Studio is a fully integrated development environment (IDE) for machine learning. Studio comes with built-in integration with Amazon EMR so that data scientists can interactively prepare data at petabyte scale using frameworks such as Apache Spark right from Studio notebooks. We’re excited to announce that SageMaker Studio now supports applying fine-grained data access control with AWS Lake Formation when accessing data through Amazon EMR.

Read More for the details.

2022 12 05

GCP – 12 no-cost ways to learn Google Cloud over the holidays

Cloud, Google Cloud gcp

The holiday season is upon us! If you are making your list and checking it twice, we’ve got a few learning gifts you can tick off the list and share with others too. For the season of giving, we’ve wrapped up some of our most popular training and certification opportunities and made them available at no-cost.

This December we’re aiming to offer something for everyone, whether you’re just getting started with cloud, or knee deep in preparing for a professional certification exam. Start with the fundamentals to gain a deeper understanding of cloud whether you’re in a business or technical role. Perhaps you’re looking to flex your data analytics and ML muscle with BigQuery and SQL, earn a Google Cloud skill badge, or enhance your technical cloud skills. Or jump into a hot topic like sustainability and learn about Google’s commitment to a clean cloud, and how to use sustainability tools. Read on to find something on your learning wishlist.

We also have a variety of learning formats to fit your needs. Complete hands-on labs, view courses and webinars, or jump into competitions like the Google Cloud Fly Cup Challenge or our most popular #GoogleClout Challenge of 2022 – and let the fun begin!

Are you ready to learn? Take a look at the training we’ve recommended below to work towards your goals as we head into the new year, with new skills, to make the most of new opportunities.

We’re giving plenty of learning gifts to choose from this month, so take your pick from the topics below

ML, AI and data analytics

Who it’s for: ML, AI and data engineers
What you’ll take away: A deeper understanding of working in BigQuery and SQL.
Level: Foundational
Start learning now:

Introduction to SQL for BigQuery and Cloud SQL – Get started with this one hour and 15 minute hands-on lab to learn fundamental SQL querying keywords, which you will run in the BigQuery console on a public dataset, and how to export subsets of a dataset into CSV files, then upload to Cloud SQL. You’ll also learn how to use Cloud SQL to create and manage databases and tables, with hands-on practice on additional SQL keywords that manipulate and edit data.

Weather Data with BigQuery – In this 45 minute lab, you’ll use BigQuery to analyze historical weather observations, and run analytics on multiple datasets.

Insights from Data with BigQuery – Earn a shareable skill badge when you complete this five hour quest. It includes interactive labs covering the basics of BigQuery, from writing SQL queries, creating and managing database tables in Cloud SQL, and querying public tables to loading sample data into BigQuery.

The Google Cloud Fly Cup Challenge – This is a three-stage competition in the sport of drone racing in the Drone Racing League (DRL). You will use DRL’s race data to predict outcomes and give performance improvement tips to pilots (these are the best drone pilots in the world!). There’s a chance to win exclusive swag, prizes, and an expenses paid trip to the DRL World Championship. Registration closes on December 31, 2022.

CI/CD

Who it’s for: Software Developers
What you’ll take away: Take part in our most popular #GoogleClout challenge of 2022! Build a simple containerized application.
Level: Fundamental
Start learning now:

#GoogleClout – CI/CD in a Google Cloud World – Flex your #GoogleClout in this cloud puzzle that challenges you in a lab format to create a Cloud Build Trigger to rebuild a containerized application hosted on a remote repository. Register it in the Artifact Registry and deploy. You’ll be scored on your results and earn a badge to share.

Preparing for Google Cloud certification

Who it’s for: Cloud engineers and architects, network and security engineers and Google Workspace administrators
What you’ll take away: Explore the breadth and scope of the domains covered in the cloud certification exams, assess your exam readiness and create a study plan.
Level: Foundational to advanced
Start learning now:

Preparing for Google Cloud certification – These courses are for Associate Cloud Engineers, Professional Cloud Architects, Professional Cloud Network Engineers, Professional Cloud Security Engineers, and Google Workspace Administrators preparing for Google Cloud certification exams. You’ll also earn a completion badge when you finish the course.

Preparing for the Cloud Architect certification exam – Join this 30 minute on-demand webinar to learn about resources to maximize your study plan, and get tips from a #GoogleCloudCertified Professional Cloud Architect.

Intro to Google Cloud for technical professionals

Who it’s for: Software Developers
What you’ll take away: Boost your Google Cloud operational and efficiency skills to drive innovation by navigating the fundamentals of compute, containers, cloud storage, virtual machines, and data and machine learning services.
Level: Foundational
Start learning now:

Getting Started with Google Cloud Fundamentals – This on-demand webinar takes a little less than three hours to complete. Navigate Compute Engine, container strategies, and cloud storage options through sessions and demos. You’ll also learn how to create VM instances, and discover Google Cloud’s big data and machine learning options.

Intro to Google Cloud for business professionals

Who it’s for: Business roles in the cloud space like HR, marketing, operations and sales
What you’ll take away: A deeper understanding of cloud computing and how Google Cloud products help achieve organizational goals.
Level: Foundational
Start learning now:

Cloud Digital Leader learning path -There are four courses in this learning path covering digital transformation, innovating with data, infrastructure and application modernization, and Google Cloud security and operations.

Preparing for the Cloud Digital Leader certification exam – In this 30 minute webinar continue your learning journey by preparing for the Google Cloud Digital Leader certification exam. The webinar covers all the resources we’ve made available to help you prepare.

Sustainability

Who it’s for: Software Developers
What you’ll take away: Learn how the cleanest cloud in the industry can help you save your cloud bill, and save the planet.
Level: Foundational
Start learning now

A Tour of Google Cloud Sustainability -Work through this one hour, hands-on lab, to explore your carbon footprint data, use the Cloud Region Picker, and reduce your cloud carbon footprint with Active Assist recommendations.

Keep connected and learning with us in 2023

Accelerate your growth on Google Cloud by joining the Innovators Program. No-cost for users of Google Cloud (including Workspace), it’s for anyone who wants to advance their personal and professional development around digital transformation, drive innovation, and solve difficult business challenges.

Continue your learning with Google Cloud in 2023 by starting an annual subscription1 with Innovators Plus benefits. Gain access to $500 in Google Cloud credits, live learning events, our entire on-demand training catalog, a certification voucher, access to special events, and other benefits.

Build your skills, reach your goals and advance your career with 12 no-cost ways to learn Google Cloud!

1. Start an annual subscription on Google Cloud Skills Boost with Innovators Plus for $299/year, subject to eligibility limitations.

Read More for the details.

2022 12 05

GCP – IT prediction: Neuro-inclusive software design will become synonymous with good design

Cloud, Google Cloud gcp

Editor’s note: This post is part of an ongoing series on IT predictions from Google Cloud experts. Check out the full list of our predictions on how IT will change in the coming years.

Prediction: By 2025, starting with neuro-inclusive design will increase user adoption by 5x in the first 2 years of production

How do you know if your product designs are enjoyable and productive? This stat from the National Institutes of Health offers a clue — they estimate that up to 20% of the world’s population is neurodistinct, with the remaining 80% being neurotypical. Together, these two groups represent the broad spectrum of neurodiversity — the different ways people experience, interpret, and process the world around them, whether they are in school, at work, or interacting with others.

When you design applications solely for a neurotypical population, you risk marginalizing, or outright excluding, one in five potential users. For example, certain noise, vibrations, and pop ups in interactive or visual features may be fine for neurotypical users, but may be distracting to those who are neurodistinct, hampering their ability to use the application. Practicing neuro-inclusive design, on the other hand, results in applications that are accessible to a wide range of cognitive and sensory styles.

At Google, we believe that to build for everyone, you have to build with everyone. To help us open our eyes to new ways we can be more inclusive, we’ve made a conscious decision to include all types of thinkers across all phases of development, including ideation, user research, testing, and marketing. And it turns out that neuro-inclusive design is good design — benefiting everyone. For example, closed captioning in Google Meet helps all of us process information better visually, and allows people to participate in meetings where a different language is used.

To get you started, here are some neuro-inclusive design principles for developers:

Design with simplicity and clarity.

Remove distractions or extra visualizations like pop-up windows.

Avoid really bright colors or using too much of a single color.

Stick with a predictable and intuitive user flow.

Be thoughtful about the vibe you are setting. Do you need music or sounds to set the tone or is it an extra element that can create distraction?

Stay away from high-pressure interactions that require a quick or immediate reaction.

For more, check out my talk from Google Cloud Next ‘22 below.

Read More for the details.

2022 12 05

GCP – Move over 2022: Predictions from Google Cloud experts that will reshape IT

Cloud, Google Cloud gcp

Here at Google Cloud, we love helping organizations imagine what is possible, work toward their biggest goals, and build the technology that brings this all together so they can convert opportunities into reality. To do this, we think it’s important to look towards the future so we can help developers and businesses learn, grow, and build the world’s most powerful systems and immersive experiences.

During the Google Cloud Next developer keynote this year, 10 of our experts shared bold predictions about what to expect from IT by 2025. With 2022 winding down, we’re highlighting our cloud technology predictions one by one, to share where we’re headed and help you prepare for the year(s) ahead.

aside_block[StructValue([(u’title’, u’Starting with neuro-inclusive design will increase user adoption by 5x in the first 2 years of production’), (u’body’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e90de7b1710>), (u’btn_text’, u’Read now’), (u’href’, u’https://cloud.google.com/blog/products/application-development/neuro-inclusive-software-design-will-become-synonymous-with-good-design’), (u’image’, <GAEImage: number 1.jpg>)])]

Thank you for continuing to inspire us! We’re excited for what the future holds for our customers, partners, and developer communities, and look forward to building it together on Google Cloud. Check in on this post often, as we’ll be adding more predictions between now and the end of the year.

Read More for the details.

2022 12 05

GCP – Pub/Sub Group Kafka Connector now GA: a drop-in solution for data movement

Cloud, Google Cloud gcp

We’re excited to announce that the Pub/Sub Group Kafka Connector is now Generally Available with active support from the Google Cloud Pub/Sub team. The Connector (packaged in a single jar file) is fully open source under an Apache 2.0 license and hosted on our GitHub repository. The packaged binaries are available on GitHub and Maven Central.

The source and sink connectors packaged in the Connector jar allow you to connect your existing Apache Kafka deployment to Pub/Sub or Pub/Sub Lite in just a few steps.

Simplifying data movement

As you migrate to the cloud, it can be challenging to keep systems deployed on Google Cloud in sync with those running on-premises. Using the sink connector, you can easily relay data from an on-prem Kafka cluster to Pub/Sub or Pub/Sub Lite, allowing different Google Cloud services as well as your own applications hosted on Google Cloud to consume data at scale. For instance, you can stream Pub/Sub data straight to BigQuery, enabling analytics teams to perform their workloads on BigQuery tables.

If you have existing analytics tied to your on-prem Kafka cluster, you can easily bring any data you need from microservices deployed on Google Cloud or your favorite Google Cloud services using the source connector. This way you can have a unified view across your on-prem and Google Cloud data sources.

The Pub/Sub Group Kafka Connector is implemented using Kafka Connect, a framework for developing and deploying solutions that reliably stream data between Kafka and other systems. Using Kafka Connect opens up the rich ecosystem of connectors for use with Pub/Sub or Pub/Sub Lite. Search your favorite source or destination system on Confluent Hub.

Flexibility and scale

You can configure exactly how you want messages from Kafka to be converted to Pub/Sub messages and vice versa with the available configuration options. You can also choose your desired Kafka serialization format by specifying which key/value converters to use. For use cases where message order is important, the sink connectors transmit the Kafka record key as the Pub/Sub message `ordering_key`, allowing you to use Pub/Sub ordered delivery and ensuring compatibility with Pub/Sub Lite order guarantees. To keep the message order when sending data to Kafka using the source connector, you can set the Kafka record key as a desired field.

The Connector can also take advantage of Pub/Sub’s and Pub/Sub Lite’s high-throughput messaging capabilities and scale up or down dynamically as stream throughput requires. This is achieved by running the Kafka Connect cluster in distributed mode. In distributed mode, Kafka Connect runs multiple worker processes on separate servers, each of which can host source or sink connector tasks. Configuring the `tasks.max` setting to greater than 1 allows Kafka Connect to enable parallelism and shard relay work for a given Kafka topic across multiple tasks. As message throughput increases, Kafka Connect spawns more tasks, increasing concurrency and thereby increasing total throughput.

A better approach

Compared to existing ways of transmitting data between Kafka and Google Cloud, the connectors are a step-change.

To connect Kafka to Pub/Sub or Pub/Sub Lite, one option is to write a custom relay application to read data from the source and write to the destination system. For developers with Kafka experience who want to connect to Pub/Sub Lite, we provide a Kafka Shim Client that can make the task of consuming from and producing to a Pub/Sub Lite topic easier using the familiar Kafka API. This approach has a couple of downsides. It can take significant effort to develop and can be challenging for high-throughput use-cases since there is no out-of-the-box horizontal scaling. You’ll also need to learn to operate this custom solution from scratch and add any monitoring to ensure data is relayed smoothly. Instead there are easier options to build or deploy using existing frameworks.

Pub/Sub, Pub/Sub Lite, and Kafka all have respective I/O connectors with Apache Beam. You can write a Beam pipeline using KafkaIO to move data between a cluster Pub/Sub or Pub/Sub Lite and then run it on an execution engine like Dataflow. This requires some familiarity with the Beam programming model, writing code to create the pipeline and possibly expanding your architecture to a supported runner like Dataflow. Using the Beam programming model with Dataflow gives you the flexibility to perform transformations on streams connecting your Kafka cluster to Pub/Sub or to create complex topologies like fan-out to multiple topics. For simple data movement especially when using an existing Connect cluster, however, the connectors offer a simpler experience requiring no development and low-operational overhead.

No code is required to set up a data integration pipeline in Cloud Data Fusion between Kafka and Pub/Sub, thanks to plugins that support all three products. Like a Beam pipeline that must execute somewhere, a Data Fusion pipeline needs to execute on a Cloud Dataproc cluster. It is a valid option most suitable for Cloud-native data practitioners who prefer drag-and-drop option in a GUI and who do not manage Kafka clusters directly. If you do manage Kafka clusters already, you may prefer a native solution, i.e., deploying the connector directly into a Kafka Connect cluster between your sources/sinks and your Kafka cluster, for more direct control.

To give the Pub/Sub connector a try, head over to the how-to guide.

1. Dataflow compute cost. 2. autoscaling. 3. Cloud Data Fusion cost

Read More for the details.

2022 12 05

GCP – 4 new features of Active Assist to help automate idle resource management

Cloud, Google Cloud gcp

Unattended Project Recommender makes identifying idle cloud projects easy, and helps you mitigate associated security issues, reduce unnecessary spend and environmental impact. In order to implement a scalable and repeatable resource lifecycle management process, it’s important to have the right tools for the job. Today, we’re announcing several new capabilities that can help you make idle project remediation a part of your company’s day-to-day operations and culture:

Organization-level aggregation of recommendations for a broader view of your unattended projects

Observation period configurability to give you more control into the usage windows

See the cost impact of following our recommendations

Shareable links so that you can distribute specific recommendations to others

Let’s take a closer look at each of these new features.

1. Organization-level view of recommendations

In Google Cloud, cloud administrators that are looking to cut costs, reduce their organization’s carbon impact, or clean up lingering security issues, can now use Resource Manager to find and address unattended projects. The summary banner at the top of the page, a new Recommendations column with counts of unattended projects per organization, a new column for Carbon Emissions, and the light bulbs next to unattended projects are there to help you achieve your goals.

On the command line and API, the new Organization and Billing Account-level endpoints allow users to issue a single list command that returns all unattended projects in the organization, or billed to the same account (documentation). Security-minded administrators can also use the command line or API to sort and address the recommendations based on the values of the priority field, which prioritizes unattended projects based on the highest priority security recommendation found within the project (documentation).

2. Observation period configurability

Previously, the Unattended Project Recommender looked for 30-days’ worth of consistent usage (or lack thereof) before producing a Recommendation. This minimum_observation_period is now configurable (usage documentation) via the Recommender API for the entire organization and can be set to predefined values from 30 up to 365 days. Organizations that look for a higher degree of confidence before applying unattended project recommendations, or those that have use cases for idling projects that are used very infrequently, are encouraged to change this configuration.

3. Cost Impact

In addition to being able to prioritize unattended projects by your carbon footprint, or by the level of security risk associated with unresolved Active Assist recommendations, you can now also see via the API how much cost savings you would obtain if you remove unattended projects (documentation). This makes it easier for you to find maximum cost efficiencies to sort and focus on the unattended projects that have the most cost impact.

Here’s what a response may look like with the cost impact field, displaying that removing this project will save 120 USD.

code_block[StructValue([(u’code’, u”costProjection:rn cost:rn currencyCode: USDrn nanos: -500000000rn units: ‘-120’rn duration: 2592000s”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ea2cc71f0d0>)])]

4. Making automation easier with shareable recommendations

While some cloud administrators may choose to apply recommendations in-bulk (e.g. delete all projects recommended to be removed), most cloud administrators prefer to delegate that work to the person, team or department that owns the project or is closest to it. Now recommendations that you see in Cloud Console UI come with a shareable link: you can click on the copy icon in the top right corner of the recommendation details page to copy the link to the recommendation and then share it with someone else, or even construct the link programmatically. Clicking the link takes the user to the details page of the recommendation you shared.

Get started today

We hope these new capabilities will help you implement a scalable and repeatable approach to resource lifecycle management. We recently published asolution for lifecycle management of unused projects that takes advantage of all these new capabilities, and we hope the patterns demonstrated there are useful for your approach. For further explorations, you can read our documentation, and start reviewing your organization’s unattended projects in Resource Manager or in Recommendation Hub.

We hope that you can leverage Unattended Project Recommender to improve your cloud security posture and reduce cost, and can’t wait to hear your feedback and thoughts about this feature! Please feel free to reach us at active-assist-feedback@google.com and we also invite you to sign up for our Active Assist Trusted Tester Group if you would like to get early access to the newest features as they are developed.

Read More for the details.

2022 12 05

GCP – Moneyball for the Front Office

Cloud, Google Cloud gcp

In 2002, the Oakland Athletics’ successful application of sabermetrics – which was later popularized by the film Moneyball – transformed professional baseball (and later, most other sports) by applying statistics and data science to win games. For years, sports analytics efforts were focused on player stats and game outcomes. Today, the same applied statistics have come to the front office to drive ticket sales, merchandise sales, corporate packages, and new forms of fan engagement.

Sports teams from all over the world suffered when COVID shortened seasons and emptied stadiums in 2020. This lack of in-person events has been intensified by increasing advertising costs and shifts in Millenial and Gen-Z engagement channels. Teams are looking to bounce back by using customer data to engage fans across channels via personalized campaigns.

But, where to start? Brian Shield, SVP and CTO of The Boston Red Sox, saw an opportunity to supercharge his analytics team to build a robust fan acquisition and engagement capability for the business.

Investing in Cloud Analytics

Initially, Shield considered using a customer data platform (CDP), but given he’d already invested in BigQuery as the company’s cloud data warehouse, creating a separate data silo didn’t make sense. Plus, the high engineering effort required to set up a customer data platform to launch campaigns meant the time-to-value was too slow. Shield needed a way to leverage his data in BigQuery to power high-performing marketing and sales campaigns as quickly as possible.

“Regarding CDPs, we want to control our own destiny, we want a platform that will scale with us. A CDP can be a useful tool but is not a substitution for an owned First-Party Modern Data Stack on Google Cloud. I think people are making a mistake putting all their eggs in that basket. How do you ever migrate in the future if you want to move away from a 3rd-party vendor? It’s like starting over to extract yourself. By owning the architecture, you avoid vendor lock-in. That’s what we love about the architecture we’ve built with Flywheel,” Shield said.

BigQuery and Flywheel Software Deliver Growth Faster

Flywheel Software offered an innovative, new architecture capable of enriching BigQuery data sets with ticketing data, merchandise sales data, fan engagement and other important business data sources. By connecting sales and marketing teams directly to Red Sox customer data in BigQuery, Flywheel enabled the Red Sox team to leverage its B2B and B2C data immediately , without being limited by slow ETL pipelines and inflexible data models.

For the first time, the Red Sox marketing team could create audience segments with Flywheel’s no-code builder, and launch campaigns within hours instead of weeks. Freed from constant SQL queries and audience list pulls, the data team could focus on building more sophisticated predictive models.

“We went from spending all of our time answering data requests to a self-service model with scalable data democratization via Flywheel, which allows our team to be more proactive vs reactive,” said Jon Hay, Vice President Data, Intelligence & Analytics, Boston Red Sox.

How Flywheel Software and BigQuery Work Together

Customer 360: Flywheel helped the Red Sox combine all of their data sources into a single view of their customers in BigQuery, then connected them to every marketing and sales channel via Flywheel’s platform.

Predictive Models: The Sales team deployed a fan avidity score predictive model trained on BigQuery data to more efficiently target outreach.

BigQuery Access and Data Sharing: Red Sox data can be seamlessly combined with Major League Baseball data using BigQuery’s data-sharing capabilities.

Extensible Data Visualization: Since Flywheel writes all audience data back to BigQuery, the Red Sox analytics team can perform its own performance analysis.

“As Fenway Sports Group continues to evolve, we see our data prowess and the extensible architecture we’ve built with Flywheel Software as an asset that new FSG Media & Entertainment properties can tap into to support their growth and evolution,” Shield said.

Flywheel Audience Segmentation

With Flywheel, analytics and activation live in one place, making audience design a five-minute exercise. This means the marketing team has more at-bats to find winning combinations of target audience, channel, and message.

Uplift Measurement, the New Moneyball

With each audience launch, the Red Sox marketing team automatically applies ongoing A/B experiments by segmenting audience members into treatment and control groups in-app.

This experimental design unlocks visualizations for marketers to quickly understand the impact on ticket sales, ticket scans, total spend, number of games watched via streaming services, email opens, Salesforce opportunities closed, app logins, or any other metric in BigQuery.

Brian Shield leveraged several Google Cloud services to ingest, transform and activate Red Sox data:

Ingest data sources via an Import Service running in Google Kubernetes Engine (GKE) and Cloud Storage, though you can use self-serve ETL tools like Matillion;

Transform data with data build tool (dbt) via Cloud Composer to produce transparent and reliable data pipeline models.

Activate data with Flywheel, deployed on Google Cloud and using GKE, Pub/Sub and other services to deliver audiences and personalization fields to Salesforce, Facebook, Zeta, and Iterable. (Flywheel Software supports any destination with extensive API integrations.)

“Flywheel enabled and then dramatically accelerated the Red Sox transition to a data-driven organization,” said Hay.

Winning Segments

So what kind of marketing segments were successful for the Red Sox that your team could apply? Here is a quick summary of a few key examples:

Take Advantage of Timely Buyer Patterns

Encouraging the away team’s fans to attend a game at their home stadium may seem counterproductive, but it turns out to be a great revenue opportunity. For example, target the Oakland A’s fans that live in the Boston area (via Google Ads or email) to purchase tickets when the Red Sox play the Oakland A’s.

Deeper Segmentation and Hyper-Personalization

Build an audience with deeper segmentation based on data points in your BigQuery tables to achieve higher conversion rates. One of the highest-leverage customer segments is single-game ticket buyers with a high propensity to buy season tickets. The Red Sox have an AI model that scores fans on this propensity. When a high-scoring fan purchases a ticket to a homestand game, a sales rep is immediately notified via Salesforce to call the fan and set up a VIP tour before the game, giving the rep an opportunity to upsell season ticket packages.

Deliver High Quality Leads for your Sales Team

Leverage data in BigQuery to set your sales team up for success with campaigns targeting the right buyers at the right time. One example is sending existing purchasers of suites that don’t yet have a concessions package and have visited the website recently to the CRM as warm leads for the sales team.

“Our sales team refers to the Audience Builder as the Flywheel Lead machine” —Jon Hay, Vice President Data, Intelligence & Analytics, Boston Red Sox

Transform Your “Front Office” like the Boston Red Sox in 3 Easy Steps

Today, eight other Major League Baseball teams leverage Flywheel Software to drive marketing and sales wins, along with teams across the NBA, NHL, and NASCAR. Whether your company is in retail, financial services, travel, software, or something else entirely, you can join the expanding group of companies driving sustainable growth through real-time analytics by connecting BigQuery from Google Cloud to Flywheel Software. Here’s how:

If you have Customer Data in BigQuery…

Book a Flywheel Software + BigQuery demo customized to your use cases.

Link your BigQuery tables and marketing and sales destinations to the Flywheel Software platform.

Launch your first Flywheel Software audience in less than one week.

If you are getting started with BigQuery…

Get a Data Strategy Session with a Flywheel Solutions Architect at no cost.

Use our Quick Start Program to get started with BigQuery in 4 to 8 weeks.

Launch your first Flywheel Software audience in less than one week thereafter.

Read More for the details.

2022 12 05

GCP – A closer look at Translation Hub: Enterprise translation made easy

Cloud, Google Cloud gcp

Hello! ¡Hola! 你好! नमस्ते! Bonjour! Being greeted in your own language can instantly put a smile on your face, and organizations around the world recognize that messages become more inclusive and powerful when translated into many languages.

As the Head of Product Management for Translation AI at Google Cloud, I am always looking to make our user’s lives easier when it comes to language tools—like Google Cloud’s Document Translation API, which has been used to translate over 2.5 billion documents across Google Translate and our Cloud customers.

But during my time working on translation products, I have seen enterprise enterprise users making compromises, like sacrificing cost and speed of translation for quality, or paying much more for high-quality translations and making peace with the slow turnaround times and high costs. That’s why we are excited to see momentum building behind Translation Hub, which launched earlier this year. Built to address organizations’ localization and translation challenges and delight users with an incredibly intuitive and administratively simple offering, Translation Hub makes it easy to translate content at scale—users just need to select the files to be translated, choose the output languages, and then they’re off and running.

In this article, we’ll take a closer look at Translation Hub’s powerful features, and the ways it is helping customers do more with their content.

Enterprise translation, reimagined: fast, lean, and intuitive

Translation Hub is a fully-managed, self-serve translation offering, powered by Google AI and built for the enterprise. With Translation Hub, businesses can instantaneously translate content into 135 languages with a single click, via an intuitive interface that integrates human reviews (i.e., a “human in the loop”) where required.

Translation Hub is designed to be both easy to manage and use. Each organization’s Translation Hub administrator uses the Google Cloud console to onboard and manage business users, typically by adding their email, which triggers an invite to the business user. Once a business user is added, they can sign in and start requesting translations and post-editing reviews.

With advanced features like “Glossary” and “Custom Translation Models (AutoML Translation),” Translation Hub makes it easy for enterprises to control the terminology and inject domain-specific context to meet their unique needs. Furthermore, Translation Hub ensures that organizations own their Translation Memory, for use in future translations or to leverage as a rich, human-validated source of training data for custom translation models (e.g., AutoML Translation). For example, when a user edits a phrase or segment through Translation Hub, this is immediately captured and used in the next translation for all future users who initiate translation where the same content or segment appears.

One of my favorite features is that Translation Hub preserves the layout and format of documents, saving both time and money. And while Translation Hub works with Google Drive (you can ingest and export documents to Drive) and is compatible with all our Google Workspace content sources (e.g., Slides, Docs etc.), it also supports many of the most commonly used formats, like documents and presentations created in Microsoft Office, and scanned and native PDFs.

Organizations need to be able to share the output of AI-powered translation with localization teams or agencies, for review. They need to save time by leveraging glossaries or customer machine learning (ML) models. They need these capabilities in one cohesive platform—and they need all of this to be user-friendly for all kinds of business users. It’s the difference between being able to automate translations in a few scenarios and being able to translate at scale. Google Cloud’s mission is to accelerate every organization’s ability to digitally transform, and with Translation Hub, our customers can leverage the best of AI for their content translations without sacrificing speed, quality or cost.

All of Google’s Translation Innovations within a single product

Translation Hub uses Google Cloud’s Translation API, AutoML Translation, and innovations in Neural Machine Translation Quality Prediction and Translation Memory—all technologies we’ve put to good use inside Google. For example, our localization teams have realized more than $80 million in cost avoidance and savings since 2021, using Google Cloud AutoML Translation and localization expertise.

“In just three months of using Translation Hub and AutoML translation models, we saw our translated page count go up by 700% and translation cost reduced by 90%,” said Murali Nathan, digital innovation and employee experience lead, at materials science company Avery Dennison. “Beyond numbers, Google’s enterprise translation technology is driving a feeling of inclusion among our employees. Every Avery Dennison employee has access to on-demand, general purpose, and company-specific translations. English language fluency is no longer a barrier, and our employees are beginning to broadly express themselves right in their native language.”

To get started with Translation Hub, visit our solution page, and for a deep dive, check out this Google Next ‘22 session, including more details about Avery Dennison’s use case.

Read More for the details.

2022 12 05

GCP – Digital and cloud skills programs for the Future Workforce launch in Australia and Thailand

Cloud, Google Cloud gcp

Countries across the Asia Pacific region continue to launch programs to develop digital and cloud skills for the future workforce. In Australia and Thailand, Google is working with governments, not-for-profit organizations, and universities to bring development programs to enable the next generation of workers with the right digital skills. Learners at universities, and individuals supported by specialist not-for-profit organizations will be able to learn new skills in a range of growing future careers, such as IT Support, Cloud engineering, Project Management, Digital Marketing or User Experience design.

In Thailand, the Samart Skills program launched at the beginning of October, with support from the Ministry of Digital Economy and Society (MDES), the Ministry of Higher Education, Science, Research and Innovation, the Office of the Vocational Education Commission, the Digital Economy Promotion Agency (depa) and leading education institutions across Thailand. According to AlphaBeta’s report published last year, Thailand’s digital transformation could generate up to THB2.5 trillion (USD79.5 billion) in annual economic value by 2030. Around 78% of business leaders in Thailand put digitalization as a key strategy in 2021, while the World Economic Forum’s “Future of Jobs Report 2020” showed that only 55% of workers in Thailand are literate in the required digital skills for future work. There is an urgent need to address Thailand’s digital skills gap.

In addition to the 21,000 scholarships for Google Career Certificates, we are partnering with universities to offer the Google Cloud Computing Foundations curriculum to equip undergraduates with technical proficiency in cloud computing, where they will earn a skills badge. This is an important first step in building the skills needed by employers and the journey to professional certification alongside their academic qualifications.

Digital Economy and Society Minister Chaiwut Thanakamanusorn said “By 2030, the total demand for digital talent in Thailand will exceed 1 million workers. So there is an urgent need to build up a digital workforce to match the demand. The Ministry of Digital Economy and Society is pleased to work with the public sector, educational institutions and the private sector including Google Thailand to bridge this talent gap and upskill Thais to get access to high quality in-demand jobs.”

In Australia, the Digital Future Initiative was given a boost with the announcement of 10,000 scholarships for Google’s Career Certificate program focused on women and First Nations Australians. Using Cloud skills programs plus Google Career Certificates, the program is designed to lead to improved employment opportunities. A consortium of Australia’s leading employers that includes Australia Post, Woolworths Group, Canva, Optus and IAG is partnering on this project and planning to hire course graduates. Brooke Hayne, People Director at Woolworths, noted: “We’re incredibly excited by the potential for Google’s Career Certificates to unearth great talent by fast-tracking their path to a career change or that first job that helps them get their foot in the door.”

As the demand for digital skills increases, we’re pleased to continue to develop new initiatives with governments and educational institutions, with the shared goal of preparing for the digital economy of the future.

Read More for the details.

2022 12 02

AWS – AWS Firewall Manager now supports FortiGate Cloud-Native Firewall

AWS, Cloud AWS

Starting today, AWS Firewall Manager enables you to centrally deploy and monitor FortiGate Cloud-Native Firewall (CNF) across all AWS virtual private clouds (VPCs) in your AWS organization. With this release, customers now have a single firewall management solution to deploy and manage both AWS native firewalls and FortiGate CNF firewalls.

Read More for the details.

2022 12 02

Azure – Community support for Azure Functions Proxies will end on 30 September 2025

Azure, Cloud Azure

Transition to Azure API Management as soon as possible to avoid disruption to your Function applications using Proxies.

Read More for the details.

2022 12 02

AWS – AWS AppConfig Agent makes feature flags and runtime configuration simpler for containers

AWS, Cloud AWS

AWS AppConfig has released an agent for container runtimes that makes the use of feature flags and runtime configuration simpler, with improved performance. Customers that use Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), Docker, or Kubernetes can now use the AWS AppConfig Agent to manage the calls from their containerized application to the AWS AppConfig service. Previously, customers needed to manage their own retrieval and caching of configuration data when using a container runtime. Now, by using the AWS AppConfig Agent, the agent will do this for them. This agent will query AWS AppConfig for configuration data and make it available locally. It will also handle polling and caching logic on behalf of customers. Applications will make local HTTP calls to the AWS AppConfig Agent, thus seeing performance improvements when retrieving configuration data.

Read More for the details.

2022 12 02

GCP – Auditing GKE Clusters across the entire organization

Cloud, Google Cloud gcp

Introduction

Companies moving to the cloud and running containers are often looking for elasticity. The ability to scale up or down as needed, means paying only for the resources used. Using automation allows engineers to focus on applications rather than on the infrastructure. These are key features of the cloud native and managed container orchestration platforms like Google Kubernetes Engine (GKE).

GKE clusters leverage Google Cloud to achieve the best in class security and scalability. They come with two modes of operation and a lot of advanced features. In Autopilot mode, clusters use more automation to reduce operational cost. This comes with less configuration options though. For use cases where you need more flexibility, the Standard mode offers greater control and configuration options. Irrespective of the selected operational mode, there are always recommended, GKE specific features and best practices to adopt. The official product documentation provides comprehensive descriptions and enlists these best practices.

But how do you ensure that your clusters are following them? Did you consider configuring the Google Groups for RBAC feature to make Kubernetes user management easier ? Or did you remember to set NodeLocal DNS cacheon standard GKE clusters to improve DNS lookup times?

Lapses in GKE cluster configuration may lead to reduced scalability or security. Over time, this may decrease the benefits of using the cloud and managed Kubernetes platform. Thus, keeping an eye on cluster configuration is an important task! There are many solutions to enforce policies for resources inside a cluster, but only a few address the clusters themselves. Organizations that implemented the Infrastructure-as-code approach may apply controls there. Yet, this requires change validation processes and code coverage for the entire infrastructure. Also, creation of GKE specific policies will need time investment and product expertise. And even then, there might be often a need to check the configurations of running clusters (i.e. for auditing purposes).

Automating cluster checks

The GKE Policy Automation is a tool that will check all clusters in your Google Cloud organization. It comes with a comprehensive library of codified cluster configuration policies. These follow the best practices and recommendations from the Google Product and Professional Services teams. Both the tool and the policy library are free and released as an open source project on Github. Also, the solution does not need any modifications on the clusters to operate. It is simple and secure to use, leverages read-only access to cluster data via Google Cloud APIs.

You can use GKE Policy Automation to run a manual one time check, or in an automated & serverless way for continuous verification. The second approach will discover your clusters and check if they comply with the defined policies on a regular basis.

After successful cluster identification, the tool pulls information using the Kubernetes Engine API. In the next releases, the tool will support more data inputs to cover additional cluster validation use cases, like scalability limits check.

GKE Policy Automation engine evaluates the gathered data against the set of codified policies, originating from Google Github repository by default; but users can specify their own repositories. This is useful for adding custom policies or in cases when public repository access is not allowed.

The tool supports a variety of ways for storing the policy check results. Besides the console output, it can save the results in JSON format on Cloud Storage or to Pub/Sub. Although those are good cloud integration patterns, they need further JSON data processing. We recommend leveraging the GKE Policy Automation integration with the Security Command Center.

The Security Command Center is Google Cloud’s centralized vulnerability and threat reporting service. The GKE Policy Automation registers itself as an additional source of findings there. Next, for each cluster evaluation, the tool creates new or updates existing findings. This brings all SCC features like finding visualization and management together. Also, the cluster check findings will be subject to the configured SCC notifications.

In the next chapters we will show how to run GKE Policy Automation in a serverless way. The solution will leverage cluster discovery mechanisms and Security Command Center integration.

Continuous cluster evaluation

The GKE Policy Automation comes with a sample Terraform code that creates the infrastructure for serverless operation. The below picture shows the overall architecture of this solution.

Serverless deployment of GKE Policy Automation tool

The serverless GKE Policy Automation solution uses a containerized version of the tool.

The Cloud Run Jobs service executes the container as a job to perform cluster checks. This happens in configurable intervals, triggered by the Cloud Scheduler.

The solution discovers GKE clusters running in your organization using Cloud Asset Inventory.

On each run, the solution gathers cluster data and evaluates configuration policies against them. The policies originate from the Google Github repository by default or from user specified repositories.

At the end, the tool sends evaluation results to the Security Command Center as findings.

GKE Policy Automation container images are available in the Github container registry. To run containers with Cloud Run, the images have to be either built in the Cloud or copied there. The provided Terraform code will provision Artifact Registry repository for this purpose. The following chapters of this post describe how to copy GKE Policy Automation’s image to Google Cloud.

Prerequisites

Existing Google Cloud project for GKE Policy Automation resources

Terraform tool version >= 1.3

Clone of GKE Policy Automation Github repository

The operator will also need sufficient IAM roles to create the necessary resources:

roles/editor role or equivalent on GKE Policy Automation project

roles/iam.securityAdmin role or equivalent on Google Cloud organization – to set IAM policies for Asset Inventory or Security Command Center

Adjusting variables

The Terraform code needs inputs to provision the desired infrastructure. In our scenario, GKE Policy Automation will check clusters in the entire organization. The dedicated IAM service account will be created for the purpose of running a tool. The account will need the following IAM roles on the Google Cloud organization level:

roles/cloudasset.viewer to detect running GKE clusters

roles/container.clusterViewer to get GKE clusters configuration

roles/securitycenter.sourcesAdmin to register the tool as SCC source

roles/securitycenter.findingsEditor to create findings in SCC

The .tfvars file below provides all of the necessary inputs to create the above role bindings. Remember to adjust the project_id, region and organization values accordingly.

code_block[StructValue([(u’code’, u’project_id = “gke-policy-123″rnregion = “europe-west2″rnrndiscovery = {rn organization = “123456789012”rn}rnrnoutput_scc = {rn enabled = truern organization = “123456789012”rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e295d7fbd90>)])]

The tool can be also used to check clusters in a given folder or project only. This can be useful when granting organization-wide permissions is not a viable option. In such a case, the discovery parameters in the above example need to be adjusted. Please refer to the input variables README documentation for more details.

Besides the infrastructure, we need to configure the tool itself. The repository provides an example config.yaml file, with the following content:

code_block[StructValue([(u’code’, u’silent: truernclusterDiscovery:rn enabled: truern organization: ${SCC_ORGANIZATION}rnoutputs:rn – securityCommandCenter:rnt provisionSource: truern organization: ${SCC_ORGANIZATION}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e29519dbf10>)])]

The Terraform will populate the variable values in the config.yaml file above and copy it to the Secret Manager. The secret will be then mounted as a volume in the Cloud Run job image. The full documentation of the tool’s configuration file is available in the GKE Policy Automation user guide.

Both configuration files should be saved in a terraform subdirectory in the GKE Policy Automation folder.

Running Terraform

Initialize Terraform by running terraform init

Create and inspect plan by running terraform plan -out tfplan

Apply plan by running terraform apply tfplan

Copying container image

The steps below describe how to copy the GKE Policy Automation container image from Github to Google Artifact Registry. Please

1. Set the environment variables. Remember to adjust the values accordingly.

code_block[StructValue([(u’code’, u’export GKE_PA_PROJECT_ID=gke-policy-123rnexport GKE_PA_REGION=europe-west2′), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e29519db110>)])]

2. Pull the latest image

code_block[StructValue([(u’code’, u’docker pull ghcr.io/google/gke-policy-automation:latest’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e295dce6b10>)])]

3. Login

code_block[StructValue([(u’code’, u’gcloud auth print-access-token | docker login -u oauth2accesstoken –password-stdin https://${GKE_PA_REGION}-docker.pkg.dev’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e295dce6510>)])]

4. Tag the container image

code_block[StructValue([(u’code’, u’docker tag ghcr.io/google/gke-policy-automation:latest ${GKE_PA_REGION}-docker.pkg.dev/${GKE_PA_PROJECT_ID}/gke-policy-automation/gke-policy-automation:latest’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e29514b0950>)])]

5. Push the container image

code_block[StructValue([(u’code’, u’docker push ${GKE_PA_REGION}-docker.pkg.dev/${GKE_PA_PROJECT_ID}/gke-policy-automation/gke-policy-automation:latest’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e29514b04d0>)])]

Creating Cloud Run job

As of the moment of writing this article, Google’s Terraform provider does not yet support Cloud Run Jobs.

Therefore, the Cloud Run job creation has to be done manually. The command below uses gcloud command and environment variables defined before.

code_block[StructValue([(u’code’, u’gcloud beta run jobs create gke-policy-automation \rn –image ${GKE_PA_REGION}-docker.pkg.dev/${GKE_PA_PROJECT_ID}/gke-policy-automation/gke-policy-automation:latest\rn –command=/gke-policy,check \rn –args=-c,/etc/secrets/config.yaml \rn –set-secrets /etc/secrets/config.yaml=gke-policy-automation:latest \ rn–service-account=gke-policy-automation@${GKE_PA_PROJECT_ID}.iam.gserviceaccount.com \rn –set-env-vars=GKE_POLICY_LOG=INFO \rn –region=${GKE_PA_REGION} \rn –project=${GKE_PA_PROJECT_ID}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e29514b0dd0>)])]

Observing the results

The configured Cloud Scheduler will run the GKE Policy Automation job once per day. To observe results immediately, we advise to run the job manually. The successful Cloud Run job executions can be viewed in the Cloud Console, as in the example below.

GKE Policy Automation Cloud Run jobs

Security Command Center Integration

Once the GKE Policy Automation job runs successfully, it will produce findings in the Security Command Center for the discovered GKE clusters. It is possible to view them in the SCC Findings view as shown below. Additionally, findings will be available via the SCC API and subject to configured SCC Pub/Sub notifications. This gives the possibility to leverage any existing SCC integrations i.e. with the systems used by your Security Operations Center teams.

GKE Policy Automation Findings in Security Command Center

Selecting the specific finding will navigate to the detailed finding view. In this view, all the finding attributes are shown. Additionally, in the Source Properties tab are the GKE Policy Automation specific properties. Those include:

Evaluated policy file name

Recommended actions and external documentation reference

Center for Internet Security (CIS) benchmarks for GKE mapping for security related findings

The below example shows source properties of a GKE Policy Automation finding in SCC.

GKE Policy Automation Finding Source Properties

GKE Policy Automation will update existing SCC findings during each run. The findings will be set as inactive when the corresponding policy will become valid for a given cluster. Same way, the existing inactive findings will be set to active when policy will violate.

For the cases when some policies are not relevant for given clusters, we recommend using the Security Command Center mute feature. For example, if the Binary Authorization policy is not relevant for development clusters, a muting rule with a development project identifier can be created.

Summary

In the article we have shown how to establish GKE cluster governance for Google Cloud organization using the GKE Policy Automation, an open-source tool created by the Google Professional Services team. The tool along with Google Cloud serverless solutions, like Cloud Run Jobs allows you to build a fully automated solution; alongside the Security Command Center integration, giving you the possibility to process GKE policy evaluation results in a unified and cloud native way.

Read More for the details.

2022 12 02

GCP – Google Cloud Biotech Acceleration Tooling

Cloud, Google Cloud gcp

Bio-pharma organizations can now leverage quick start tools and setup scripts to begin running scalable workloads in the cloud today.

This capability is a boon for research scientists and organizations in the bio-pharma space, from those developing treatments for diseases to those creating new synthetic biomaterials. Google Cloud’s solutions teams continue to shape products with customer feedback and contribute to platforms on which Google Cloud customers can build.

This guide provides a way to get started with simplified cloud architectures for specific workloads. Cutting edge research and biotechnology development organizations are often science first and can therefore save valuable resources by leveraging existing technology infrastructure starting points embedded with Google’s best practices. Biotech Acceleration Tooling frees up scientist and researcher bandwidth, while still enabling flexibility. The majority of the tools outlined in this guide come with quick start Terraform scripts to automate the stand up of environments for biopharma workloads.

Solution overview

This deployment creates the underlying infrastructure in accordance with Google’s best practices, configuring appropriate networking including VPC networking, security, data access, and analytics notebooks. All environments are created with Terraform scripts, which define cloud and on-prem resources in configuration files. A consistent workflow can be used to provision infrastructure.

If beginning from scratch, you will need to first consider security, networking, and identity access management set up to keep your organization’s computing environment safe. To do this, follow the steps below:

Use Terraform Automation Repository within Security Foundations Blueprint to deploy your new environment

Workloads needed can vary, and so should solutions tooling. We offer easy to deploy code and workflows for various biotech use cases including AlphaFold, genomics sequencing, cancer data analysis, clinical trials, and more.

AlphaFold

AlphaFold is an AI system developed by DeepMind that predicts a protein’s 3D structure from its amino acid sequence. It regularly achieves accuracy competitive with experiments. It is useful for researchers doing drug discovery and protein design, often computational biologists and chemists. To get started running AlphaFold batch inference on your own protein sequences, leverage these setup scripts. To better understand the batch inference solution, see this explanation of optimized inference pipeline and video explanation. If your team does not need to run AlphaFold at scale and is comfortable running structures one at a time on less optimized hardware, see the simplified AlphaFold run guide.

Genomics Tooling

Researchers today have the ability to generate an incredible amount of biological data. Once you have this data, the next step is to refine and analyze it for meaning. Whether you are developing your own algorithms or running common tools and workflows, you now have a large number of software packages to help you out.

Here we make a few recommendations for what technologies to consider. Your technology choice should be based on your own needs and experience. There is no “one size fits all” solution.

Genomics tools that may be of assistance for your organization include generalized genomics sequencing pipelines, Cromwell genomics, Databiosphere dsub genomics, and DeepVariant.

Cromwell

The Broad Institute has developed the Workflow Definition Language (WDL) and an associated runner called Cromwell. Together these have allowed the Broad to build, run at scale, and publish its recommended practices pipelines. If you want to run the Broad’s published GATK workflows or are interested in using the same technology stack, take a look at this deployment of Cromwell.

Dsub

This module is packaged to use databiosphere dsub as a Workflow engine, containerized tools (FastQC) and Google cloud lifescience API to automate execution of pipeline jobs. The function can be easily modified to adopt to other bioinformatic tools out there.

Dsub is a command-line tool that makes it easy to submit and run batch scripts in the cloud. The cloud function has embedded dsub libraries to execute pipeline jobs in Google cloud.

DeepVariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

Cancer Data Analysis

ISB-CGC (ISB Cancer Gateway in the Cloud) enables researchers to analyze cloud-based cancer data through a collection of powerful web-based tools and Google Cloud technologies. It is one of three National Cancer Institute (NCI) Cloud Resources tasked with bringing cancer data and computation power together through cloud platforms.

Interactive web-based Cancer Data Analysis & Exploration

Explore and analyze ISB-CGC cancer data through a suite of graphical user interfaces (GUIs) that allow users to select and filter data from one or more public data sets (such as TCGA, CCLE, and TARGET), combine these with your own uploaded data and analyze using a variety of built-in visualization tools.

Cancer data analysis using Google BigQuery

Processed data is consolidated by data type (ex. Clinical, DNA Methylation, RNAseq, Somatic Mutation, Protein Expression, etc.) from sources including the Genomics Data Commons (GDC) and Proteomics Data Commons (PDC) and transformed into ISB-CGC Google BigQuery tables. This allows users to quickly analyze information from thousands of patients in curated BigQuery tables using Structured Query Language (SQL). SQL can be used from the Google BigQuery Console but can also be embedded within Python, R and complex workflows, providing users with flexibility. The easy, yet cost effective, “burstability” of BigQuery allows you to, within minutes (as compared to days or weeks on a non-cloud based system), calculate statistical correlations across millions of combinations of data points.

Available Cancer Data Sources

TCGA

Pan-Cancer Atlas BigQuery Data

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

More here

Clinical Trials Studies

The FDA’s MyStudies platform enables organizations to quickly build and deploy studies that interact with participants through purpose-built apps on iOS and Android. MyStudies apps can be distributed to participants privately or made available through the App Store and Google Play.

This open-source repository contains the code necessary to run a complete FDA MyStudies instance, inclusive of all web and mobile applications.

Open-source deployment tools are included for semi-automated deployment to Google Cloud Platform (GCP). These tools can be used to deploy the FDA MyStudies platform in just a few hours. These tools follow compliance guidelines to simplify the end-to-end compliance journey. Deployment to other platforms and on-premise systems can be performed manually.

Data Science

For generalized data science pipelines to build custom predictive models or do interactive analysis within notebooks, check out our data science workflow setup scripts to get to work immediately. These include database connections and setup, virtual private cloud enablement, and notebooks.

Reference material

Life sciences public datasets

Drug discovery and in silico virtual screening on GCP

Semantic scientific literature search

Research workloads on GCP

Genomics and Secondary Analysis

Patient Monitoring

Variant Analysis

Healthcare API for Machine Learning and Analytics

Radiological Image Extraction

RAD Lab – a secure sandbox for innovation

During research, scientists are often asked to spin up research modules in the cloud to create more flexibility and collaboration opportunities for their projects. However, lacking the necessary cloud skills, many projects never get off the ground.

To accelerate innovation, RAD Lab is a Google Cloud-based sandbox environment which can help technology and research teams advance quickly from research and development to production. RAD Lab is a cloud-native research, development, and prototyping solution designed to accelerate the stand-up of cloud environments by encouraging experimentation, without risk to existing infrastructure. It’s also designed to meet public sector and academic organizations’ specific technology and scalability requirements with a predictable subscription model to simplify budgeting and procurement. You canfind the repository here.

RAD Lab delivers a flexible environment to collect data for analysis, giving teams the liberty to experiment and innovate at their own pace, without the risk of cost overruns. Key features include:

Open-source environment that runs on the cloud for faster deployment—with no hardware investment or vendor lock-in.

Built on Google Cloud tools that are compliant with regulatory requirements like FedRAMP, HIPAA, and GDPR security policies.

Common IT governance, logging, and access controls across all projects.

Integration with analytics tools like BigQuery, Vertex AI, and pre-built notebook templates.

Best-practice operations guidance, including documentation and code examples, that accelerate training, testing, and building cloud-based environments.

Optional onboarding workshops for users, conducted by Google Cloud specialists.

The next generation of RAD Lab includes RAD Lab UI, which provides a modern interface for less technical users to deploy Google Cloud resources – in just three steps.

Read More for the details.

2022 12 02

GCP – Monitoring GPU workloads on GKE with NVIDIA Data Center GPU Manager (DCGM)

Cloud, Google Cloud gcp

Artificial intelligence (AI) and machine learning (ML) have become an increasingly important enterprise capability, including use cases such as product recommendations, autonomous vehicles, application personalization, and automated conversational platforms. Building and deploying ML models demand high-performance infrastructure. Using NVIDIA GPUs can greatly accelerate the training and inference system. Consequently, monitoring GPU performance metrics to understand workload behavior is critical for optimizing the ML development process.

Many organizations use Google Kubernetes Engine (GKE) to manage NVIDIA GPUs to run production AI inference and training at scale.NVIDIA Data Center GPU Manager (DCGM) is a set of tools from NVIDIA to manage and monitor NVIDIA GPUs in cluster and datacenter environments. DCGM includes APIs for collecting a detailed view of GPU utilization, memory metrics, and interconnect traffic. It provides the system profiling metrics needed for ML engineers to identify bottlenecks and optimize performance, or for administrators to identify underutilized resources and optimize for cost.

In this blog post we demonstrate:

How to setup NVIDIA DCGM in your GKE cluster, and

How to observe the GPU utilization using either a Cloud Monitoring Dashboard or Grafana with Prometheus.

NVIDIA Data Center GPU Manager

NVIDIA DCGM simplifies GPU administration, including setting configuration, performing health checks, and observing detailed GPU utilization metrics. Check out NVIDIA’s DCGM user guide to learn more.

Here we focus on the gathering and observing of GPU utilization metrics in a GKE cluster. To do so, we also make use of NVIDIA DCGM exporter. This component collects GPU metrics using NVIDIA DCGM and exports them as Prometheus style metrics.

GPU Monitoring Architecture

The following diagram describes the high-level architecture of the GPU monitoring setup using NVIDIA DCGM, NVIDIA DCGM Exporter, and Google Managed Prometheus,Google Cloud’s managed offering for Prometheus.

In the above diagram, the boxes labeled “NVIDIA A100 GPU” represent an example NVIDIA GPU attached to a GCE VM Instance. Dependencies amongst components are traced out by the wire connections.

The “AI/ML workload” represents a pod that has been assigned one or more GPUs. The boxes “NVIDIA DCGM” and “NVIDIA DCGM exporter” are pods running as privileged daemonset across the GKE cluster. A ConfigMap contains the list of DCGM fields (in particular GPU metrics) to collect.

The “Managed Prometheus” box represents managed prometheus components deployed in the GKE cluster. This component is configured to scrape Prometheus style metrics from the “DCGM exporter” endpoint. “Managed Prometheus” exports each metric to Cloud Monitoring as “prometheus.googleapis.com/DCGM_NAME/gauge.” The metrics are accessible through various Cloud Monitoring APIs, including the Metric Explorer page.

To provide greater flexibility, we also include components that can set up an in-cluster Grafana dashboard. This consists of a “Grafana” pod that accesses the available GPU metrics through a “Prometheus UI” front end as a data source. The Grafana page is then made accessible at a Google hosted endpoint through an “Inverse Proxy” agent.

All the GPU monitoring components are deployed to a namespace “gpu-monitoring-system.”

Requirements

Google Cloud Project

Quota for NVIDIA GPUs (more information at GPU quota)

GKE version 1.21.4-gke.300 with “beta” component to install Managed Prometheus.

GKE version 1.18.6-gke.3504 or above to support all available cloud GPU types.

NVIDIA Datacenter GPU Manager requires NVIDIA Driver R450+.

Deploy a Cluster with NVIDIA GPUs

1. Follow the instructions at Run GPUs in GKE Standard node pools to create a GKE cluster with NVIDIA GPUs.

Here is an example to deploy a cluster with two A2 VMs with 2 x NVIDIA A100 GPUs each. For a list of available GPU platforms by region, see GPU regions and zones.

code_block[StructValue([(u’code’, u’gcloud beta container clusters create CLUSTER_NAME \rn –zone us-central1-f \rn –machine-type=a2-highgpu-2g \rn –num-nodes=2 \rn –enable-managed-prometheus’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e0a8df90d90>)])]

Note the presence of the “–enable-managed-prometheus” flag. This allows us to skip the next step. By default a cluster will deploy the Container-Optimized OS on each VM.

2. Enable Managed Prometheus on this cluster. It allows us to collect and export our GPU metrics to Cloud Monitoring. It will also be used as a data source for Grafana.

code_block[StructValue([(u’code’, u’gcloud beta container clusters update CLUSTER_NAME \rn –zone ZONE \rn –enable-managed-prometheus’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e0a8de98fd0>)])]

3. Before you can use kubectl to interact with your GKE cluster, you need to fetch the cluster credentials.

code_block[StructValue([(u’code’, u’gcloud container clusters get-credentials CLUSTER_NAME \rn –zone us-central1-f’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e0a8de98f90>)])]

4. Before we can interact with the GPUs, we need to install the NVIDIA drivers. The following installs NVIDIA drivers for VMs running the Contained-Optimised OS.

code_block[StructValue([(u’code’, u’kubectl apply -f \rnhttps://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e0a8c4eec50>)])]

Wait for “nvidia-gpu-device-plugin” to go Running across all nodes. This can take a couple minutes.

code_block[StructValue([(u’code’, u’kubectl get pods -n kube-system | grep nvidia-gpu-device-plugin’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e0a8c4eeb90>)])]

Download GPU Monitoring System Manifests

Download the Kubernetes manifest files and dashboards used later in this guide.

code_block[StructValue([(u’code’, u’git clone https://github.com/suffiank/dcgm-on-gke && cd dcgm-on-gke’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e0a8c4ee790>)])]

Configure GPU Monitoring System

Before we deploy the NVIDIA Data Center GPU manager and related assets, we need to select which GPU metrics we want to emit from the cluster. We also want to set the period at which we sample those GPU metrics. Note that all these steps are optional. You can choose to keep the defaults that we provide.

1. View and edit the ConfigMap section of quickstart/dcgm_quickstart.yml to select which GPU metrics to emit:

code_block[StructValue([(u’code’, u’apiVersion: v1rnkind: ConfigMaprnmetadata:rn name: nvidia-dcgm-exporter-metricsrnu2026rndata:rn counters.csv: |rn # Utilization (the sample period varies depending on the product),,rn DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %).rn DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).rnrn # Utilization of IP blocks,,rn DCGM_FI_PROF_SM_ACTIVE, gauge, rn DCGM_FI_PROF_SM_OCCUPANCY, gauge,rn DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, rn DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, rn DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, rn DCGM_FI_PROF_PIPE_FP16_ACTIVE, gauge, rnrn # Memory usage,,rn DCGM_FI_DEV_FB_FREE, gauge,rn DCGM_FI_DEV_FB_USED, gauge,rn DCGM_FI_DEV_FB_TOTAL, gauge,rnrn # PCIE,,rn DCGM_FI_PROF_PCIE_TX_BYTES, gauge,rn DCGM_FI_PROF_PCIE_RX_BYTES, gauge, rnrn # NVLink,,rn DCGM_FI_PROF_NVLINK_TX_BYTES, gauge, rn DCGM_FI_PROF_NVLINK_RX_BYTES, gauge,’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e0a8c4ee550>)])]

A complete list of NVIDIA DCGM fields available are at NVIDIA DCGM list of Field IDs. For your benefit, here we briefly outline the GPU metrics set in this default configuration.

The most important of these is the GPU utilization (“DCGM_FI_DEV_GPU_UTIL”). This metric indicates what fraction of time the GPU is not idle. Next is the GPU used memory (“DCGM_FI_DEV_FB_USED”) and it indicates how many GPU memory bytes have been allocated by the workload. This can let you know how much headroom remains on the GPU memory. For an AI workload you can use this to gauge whether you can run a larger model or increase the batch size.

The GPU SM utilization (“DCGM_FI_PROF_SM_ACTIVE”) lets you know what fraction of the GPU SM processors are in use during the workload. If this is low, it indicates there is headroom to submit parallel workloads to the GPU. On an AI workload you might send multiple inference requests. Taken together with the SM occupancy (“DCGM_FI_PROF_SM_OCCUPANCY”) it can let you know if the GPUs are being efficiently and fully utilized.

The GPU Tensor activity (“DCGM_FI_PROF_PIPE_TENSOR_ACTIVE”) indicates whether your workload is taking advantage of the Tensor Cores on the GPU. The Tensor Cores are specialized IP blocks within an SM processor that enable accelerated matrix multiplication. It can indicate to what extent your workload is bound on dense matrix math.

The FP64, FP32, and FP16 activity (e.g. “DCGM_FI_PROF_PIPE_FP64_ACTIVE”) indicates to what extent your workload is exercising the GPU engines targeting a specific precision. A scientific application might skew to FP64 calculations and an ML/AI workload might skew to FP16 calculations.

The GPU NVLink activity (e.g. “DCGM_FI_PROF_NVLINK_TX_BYTES”) indicates the bandwidth (in bytes/sec) of traffic transmitted directly from one GPU to another over high-bandwidth NVLink connections. This can indicate whether the workload requires communicating GPUs; and, if so, what fraction of the time the workload is spending on collective communication.

The GPU PCIe activity (e.g. “DCGM_FI_PROF_PCIE_TX_BYTES“) indicates the bandwidth (in bytes/sec) of traffic transmitted to or from the host system.

All the fields with “_PROF_” in the DCGM field identifier are “profiling metrics.” For a detailed technical description of their meaning take a look at NVIDIA DCGM Profiling Metrics. Note that these do have some limitations for NVIDIA hardware before H100. In particular they cannot be used concurrently with profiling tools like NVIDIA Nsight. You can read more about these limitations at DCGM Features, Profiling Sampling Rate.

2. (Optional:) By default we have configured the scrape interval at 20 sec. You can adjust the period at which NVIDIA DCGM exporter scrapes NVIDIA DCGM and likewise the interval at which GKE Managed Prometheus scrapes the NVIDIA DCGM exporter:

code_block[StructValue([(u’code’, u’apiVersion: apps/v1rnkind: DaemonSetrnmetadata:rn name: nvidia-dcgm-exporterrnu2026rnspec:rnu2026rn args:rn – hostname $NODE_NAME; dcgm-exporter -k –remote-hostengine-info $(NODE_IP) –collectors /etc/dcgm-exporter/counters.csv –collect-interval 20000rnu2026rnapiVersion: monitoring.googleapis.com/v1alpha1rnkind: PodMonitoringrnmetadata:rn name: nvidia-dcgm-exporter-gmp-monitorrnu2026rnspec:rnu2026rn endpoints:rn – port: metricsrn interval: 20s’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e0a831709d0>)])]

Selecting a lower sample period (say 1 sec) will give a high resolution view of the GPU activity and the workload pattern. However selecting a higher sample rate will result in more data being emitted to Cloud Monitoring. This may cause a higher bill from Cloud Monitoring. See “Metrics from Google Cloud Managed Service for Prometheus” on the Cloud Monitoring Pricing page to estimate charges.

3. (Optional:) In this example we use NVIDIA DCGM 2.3.5. You can adjust the NVIDIA DCGM version by selecting a different image from the NVIDIA container registry. Note that the NVIDIA DCGM exporter version must be compatible with the NVIDIA DCGM version. So be sure to change both when selecting a different version.

code_block[StructValue([(u’code’, u’apiVersion: apps/v1rnkind: DaemonSetrnmetadata:rn name: nvidia-dcgmrnu2026rnspec:rnu2026rn containers:rn – image: “nvcr.io/nvidia/cloud-native/dcgm:2.3.5-1-ubuntu20.04″rnu2026rnapiVersion: apps/v1rnkind: DaemonSetrnmetadata:rn name: nvidia-dcgm-exporterrnu2026rnspec:rnu2026rn containers:rn – name: nvidia-dcgm-exporterrn image: nvcr.io/nvidia/k8s/dcgm-exporter:2.3.5-2.6.5-ubuntu20.04′), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e0a83170790>)])]

Here we have deployed NVIDIA DCGM and the NVIDIA DCGM Exporter as separate containers. It is possible for the NVIDIA DCGM exporter to launch and run the NVIDIA DCGM process within its own container. For a description of the options available on the DCGM exporter, see the DCGM Exporter page.

Deploying GPU Monitoring System

1. Deploy NVIDIA DCGM + NVIDIA DCGM exporter + Managed Prometheus configuration.

code_block[StructValue([(u’code’, u’kubectl create namespace gpu-monitoring-systemrnkubectl apply -f quickstart/dcgm_quickstart.yml’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e0a83170e90>)])]

If successful, you should see a privileged NVIDIA DCGM and NVIDIA DCGM exporter pod running on every GPU node.

Set up a Cloud Monitoring Dashboard

1. Import a custom dashboard to view DCGM metrics emitted to Managed Prometheus

code_block[StructValue([(u’code’, u’gcloud monitoring dashboards create \rn –config-from-file quickstart/gke-dcgm-dashboard.yml’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e0a839b5150>)])]

2. Navigate to Monitoring Dashboards page of the Cloud Console to view the newly added “Example GKE GPU” dashboard.

3. For a given panel you can expand the legend to include the following fields:

The following fields are available in the legend:
“cluster” (GKE cluster name)
“instance” (GKE node name)
“gpu” (GPU index on the GKE node)
“modelName” (whether NVIDIA T4, V100, A100 etc.)
“exported container” (container that has mapped this GPU)

“exported namespace” (namespace of the container that has mapped this GPU)

Because Managed Prometheus monitors the GPU workload through NVIDIA DCGM exporter, it is important to keep in mind that that the container name and namespace are on the labels “exported container” and “exported namespace”

Stress Test your GPUs for Monitoring

We have provided an artificial load so you can observe your GPU metrics in action. Or feel free to deploy your own GPU workloads.

1. Apply an artificial load tester for the NVIDIA GPU metrics.

code_block[StructValue([(u’code’, u’kubectl apply -f quickstart/dcgm_loadtest.yml’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e0a8dbbd550>)])]

This load test creates a container on a single GPU. It then gradually cycles through all the displayed metrics. Note that the NVLink bandwidth will only be utilized if the VM has two NVIDIA GPUs connected by an NVLink connection.

Set up a Grafana Dashboard

1. Deploy the Prometheus UI frontend, Grafana, and inverse proxy configuration.

code_block[StructValue([(u’code’, u”cd grafanarnsed ‘s/\$PROJECT_ID/<YOUR PROJECT ID>/’ grafana.yml | kubectl apply -f -“), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e0a8e0ba3d0>)])]

<YOUR PROJECT ID> should be replaced with your project ID for the cluster.

Wait until the inverse proxy config map is populated with an endpoint for Grafana:

code_block[StructValue([(u’code’, u’kubectl get configmap inverse-proxy-config -o jsonpath=”{.data}” -n gpu-monitoring-system’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e0a83170350>)])]

code_block[StructValue([(u’code’, u'{rn u2026rn “Hostname”:rn “7b530ae5746e0134-dot-us-central1.pipelines.googleusercontent.com”,rn u2026rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e0a839b5290>)])]

Copy and paste this URL into your browser to access the Grafana page. Only users with access to the same GCP project will be authorized to visit the Grafana page.

The inverse proxy agent deployed to the GKE cluster uses a Docker Hub image hosted at sukha/inverse-proxy-for-grafana. See Building the Inverse Proxy Agent for more info.

2. On the Grafana page click “Add your first data source,” then select “Prometheus.” Then fill the following Prometheus configuration:

Note that the full URL should be
”http://prometheus-ui.gpu-monitoring-system.svc:9090”

Select “Save and test” at bottom. You should see “Data source is working.”

3. Import the Grafana dashboard by selecting the “Import” from the “+ Create” widget panel on the left-hand side of the Grafana page.

Then select the local JSON file “grafana/gke-dcgm-grafana-dashboard.json.”

You should see the GPU utilization and all other metrics for the fake workload you deployed earlier. Note that the dashboard is configured to only display metrics whose container label is not the empty string. Therefore it does not display metrics for idle GPUs with no attached containers.

4. You can also explore the available metrics directly from the “Explorer” page. Select the “Explorer” widget along the left-hand panel. Then click “Metrics Browser” to display the list of available metrics and their labels.

You can use the Metrics Explorer page to explore the available metrics. From this knowledge you can build a custom dashboard based on queries that suit your case.

Conclusion

In this blog you were able to deploy a GKE cluster with NVIDIA GPUs and emit GPU utilization metrics by workload to Cloud Monitoring. We also set up a Cloud Monitoring dashboard to view GPU utilization by workload.

This GPU monitoring system leveraged the NVIDIA Data Center GPU Manager. All of the available NVIDIA DCGM metrics are accessible for monitoring. We also discussed the available GPU metrics and their meaning in the context of application workloads.

Finally we provided a means to deploy an in-cluster Grafana GPU utilization dashboard accessible from a Google hosted endpoint for users with access to the corresponding Google Cloud project.

Read More for the details.

2022 12 02

GCP – Overcoming objections and unblocking the road to Zero Trust

Cloud, Google Cloud gcp

Overcoming blockades and potholes that threaten to derail organizational change is key to any IT or security transformation initiative. Many security and risk leaders have made it a priority to adopt Zero Trust access models so they can deliver better user experiences and strengthen security. Yet before they can even think about change management, they often face pushback from within their organization.

Earlier this year I had the privilege of chatting twice with Jess Burn, senior analyst at Forrester, on some common challenges CISOs face when planning their Zero Trust journeys. I found our talks enlightening and useful, and wanted to share the key insights with as many organizations as possible who are considering or actively going down this path. Some highlights from my interview with Jess Burn follows.

Q: When organizations embark on a Zero Trust implementation, what is the biggest difference observed between the benefits they expect to get versus what they actually experience after implementing Zero Trust?

I think a lot of organizations look at the benefits of Zero Trust from the perspective of improving overall security posture, which is a great goal but one where the goalpost moves constantly. But what we’ve heard from enterprises that embark on Zero Trust journeys is that there are a lot of small victories and surprise benefits to celebrate along the way. For example, Zero Trust can empower employees, enabling them to work from anywhere with any device as long as they authenticate properly on a compliant device.

Zero Trust can also empower employees by shifting responsibility for security away from users and instead letting them rely on technical controls to do their work. For example, employees can use a digital certificate and biometrics to establish identity instead of having to remember passwords.

Additionally, Zero Trust can help consolidate tech tools by acting as a catalyst for much-needed process changes. For example, a client of ours, as part of their Zero Trust model adoption journey, classified their critical business assets and identified the tools that aligned to the zero trust approach. From there, they were able to reduce the number of point solutions, many of which overlapped in functionality, from 58 to 11 in an 18-month timeframe. There are real cost savings there.

How are enterprises measuring success and justifying Zero Trust transformation?

We advise our clients that measuring the success of Zero Trust efforts and the impact of the transformation should be focused on the ability of their organization to move from network access to granular application-specific access, increase data security through obfuscation, limit the risks associated with excessive user privileges, and dramatically improve security detection and response with analytics and automation. We guide our clients to create outcome-focused metrics that are a good fit for the audiences with whom they are sharing them, whether strategic (board/executives), operational (counterparts in IT/the business), or tactical (security team). Additionally, we think about Zero Trust metrics in the context of three overarching goals:

Protecting customers’ data while preserving their trust. Customers who suffer identity theft or fraud will stop doing business with you if they believe you were negligent in protecting their data. They might also leave you if your post-breach communication is late, vague, or lacks empathy and specific advice. For strategic metrics, exposing changes in customer acquisition, retention, and enrichment rates before and after specific breaches will help you alert business leaders to customer trust issues that could hinder growth. When thinking about tactical metrics, looking at changes in customer adoption of two-factor authentication and the percentage of customer data that is encrypted will help you determine where your security team needs to focus its future efforts.

Recruiting and retaining happy, productive employees who appreciate security. Strategic-level goals should track changes in your organization’s ability to recruit new talent and changes in employee satisfaction, as retention rates indicate morale issues that will affect productivity and customer service. Angry, resentful, or disillusioned employees are more likely to steal data for financial profit or as retaliation for a perceived slight. At a tactical level, employee use of two-factor authentication, implementation of a privileged identity management solution, and strong processes for identity management and governance will help you identify priorities for your security team.

Guarding the organization’s IP and reducing the costs of security incidents. IP may include trade secrets, formulas, designs, and code that differentiate your organization’s products and services from those of competitors. An IP breach threatens your organization’s future revenue and potentially its viability. At a strategic level, executives need to understand if the organization is the target of corporate espionage or nation-state actors and how much IP these actors have already compromised. On the tactical end, the level to which the security team encrypted sensitive data across locations and hosting models tells security staff where they need to concentrate their efforts to discover, classify, and encrypt sensitive data and IP.

What is the biggest myth holding back companies from moving to a Zero Trust strategy?

I think there are several myths about moving to Zero Trust, but one of the most pervasive ones is that it costs too much and will require enterprises to rip and replace their systems and tools.

The first thing we say to Forrester clients who come to us with this objection from their peers in IT leadership or from senior executives is that you’re likely not starting from scratch. Look at Forrester’s pillars of Zero Trust — data, workloads, networks, devices, people, visibility and analytics, and automation and orchestration — and then line that up with what your organization already has in place or is in the process of implementing, such as two-factor and privileged access management under the people pillar, cloud security gateways under workload, endpoint security suites under devices, vulnerability management under networks, and data loss prevention (DLP) under data.

You probably have endpoint detection and response (EDR) or managed detection and response (MDR) for security analytics, and maybe you’ve started to automate some tasks in your security operations center (SOC). This should be very encouraging to you, your peers in IT operations, and executives from a cost perspective. Zero Trust doesn’t need to be a specific budget line item.

You may need to invest in new technology at some point, but you’re likely already doing that as tools become outdated. Where you’ll need some investment, we’ve found, is in process. There may be a fair amount of change management tied to the adoption of the zero trust model. And you should budget for that in people hours.

What is a common theme you observe across organizations that are able to do this well?

Executive buy-in, for sure, but also peer buy-in from stakeholders in IT and the business. A lot of the conversations and change management needed to move some Zero Trust initiatives forward — like moving to least privilege — are big ones. Anything that requires business buy-in and then subsequent effort is going to be time consuming and probably frustrating at times. But it’s a necessary effort, and it will increase understanding and collaboration between these groups with frequently competing priorities.

Our advice is to first identify who your Zero Trust stakeholders are and bust any Zero Trust myths to lay the groundwork for their participation.

Once you’ve identified your stakeholders and addressed their concerns, you need to persuade and influence. Ask questions and actively listen to your stakeholders without judgment. Articulate your strategy well, tell stakeholders what their role is, and let them know what you need from them to be successful. They may feel daunted by the shifts in strategy and architecture that Zero Trust demands. Build a pragmatic, realistic roadmap that clearly articulates how you will use existing security controls and will realize benefits.

What is a common theme you observe across organizations that struggle with a Zero Trust implementation?

Change is uncomfortable for most people. This discomfort produces detractors who continuously try to impede progress. Security leaders with too many detractors will see their Zero Trust adoption plans and roadmaps fizzle. Security leaders we speak to are often surprised by criticism from stakeholders in IT, and sometimes even on the security team, that portrays change as impossible.

If you’re in this situation, you’ll need to step back and spend more time influencing stakeholders and address their concerns. Not everyone is familiar with Zero Trust terminology. You can use Forrester’s The Definition of Modern Zero Trust or NIST’s Zero Trust architecture to create a common lexicon that everyone can understand.

This approach allows you to use the network effect as stakeholders become familiar with the model. Additionally, your stakeholders may feel daunted by the fundamental shifts in strategy and architecture that Zero Trust demands. Build a pragmatic, realistic roadmap that clearly articulates how you will use existing security controls and tools and realize benefits.

From there, develop a hearts-and-minds campaign focusing on the value of Zero Trust. Highlight good news using examples that your stakeholders will relate to, such as how Zero Trust can improve the employee experience — something that most people are interested in both personally and organizationally.

Lastly, don’t go it alone. Extend your reach by finding Zero Trust champions who act as extra members of the security team and as influencers across the organization. Create a Zero Trust champions program by identifying people who have interest in or enthusiasm for Zero Trust, creating a mandate for them, and motivating and developing your champions by giving them professional development and other opportunities.

Next steps

If you missed the webinar, be sure to view it on-demand here. You can also download a copy of Forrester’s “A Practical Guide To A Zero Trust Implementation” here. This report guides security leaders through a roadmap for implementing Zero Trust using practical building blocks that take advantage of existing technology investments and are aligned to current maturity levels.

We also have more Zero Trust content available for you, including multiple sessions from our Google Cloud Security Talks on December 7, available on-demand.

Read More for the details.

Cloud

IAM basics

Roles

Role Binding

Principals

Hierarchical structure

Organizational policies

Building with Terraform

Service Accounts

Use Case

Wiring things up

Local development

Automated development

Pipeline

Conclusion

We’re giving plenty of learning gifts to choose from this month, so take your pick from the topics below

ML, AI and data analytics

CI/CD

Preparing for Google Cloud certification

Intro to Google Cloud for technical professionals

Intro to Google Cloud for business professionals

Sustainability

Keep connected and learning with us in 2023

A visual tour of Google Cloud certifications

Simplifying data movement

Flexibility and scale

A better approach

1. Organization-level view of recommendations

2. Observation period configurability

3. Cost Impact

4. Making automation easier with shareable recommendations

Get started today

Investing in Cloud Analytics

BigQuery and Flywheel Software Deliver Growth Faster

How Flywheel Software and BigQuery Work Together

Flywheel Audience Segmentation

Uplift Measurement, the New Moneyball

Winning Segments

Take Advantage of Timely Buyer Patterns

Deeper Segmentation and Hyper-Personalization

Deliver High Quality Leads for your Sales Team

Transform Your “Front Office” like the Boston Red Sox in 3 Easy Steps

If you have Customer Data in BigQuery…

If you are getting started with BigQuery…

Enterprise translation, reimagined: fast, lean, and intuitive

All of Google’s Translation Innovations within a single product

New AI agents can drive business results faster: Translation Hub, Document AI, and Contact Center AI

Introduction

Automating cluster checks

Continuous cluster evaluation

Security Command Center Integration

Summary

Solution overview

AlphaFold

Genomics Tooling

Cromwell

Dsub

DeepVariant

Cancer Data Analysis

Interactive web-based Cancer Data Analysis & Exploration

Cancer data analysis using Google BigQuery

Available Cancer Data Sources

Clinical Trials Studies

Data Science

Reference material

RAD Lab – a secure sandbox for innovation

NVIDIA Data Center GPU Manager

GPU Monitoring Architecture

Requirements

Deploy a Cluster with NVIDIA GPUs

Download GPU Monitoring System Manifests

Configure GPU Monitoring System

Deploying GPU Monitoring System

Set up a Cloud Monitoring Dashboard

Stress Test your GPUs for Monitoring

Set up a Grafana Dashboard

Conclusion