Cloud

2022 10 18

AWS – Amazon Connect Wisdom now delivers improved machine learning capabilities

Amazon Connect Wisdom now delivers improved machine learning capabilities to continuously understand issues throughout a call and to deliver the right knowledge article to contact center agents. Wisdom analyzes contact center calls in real-time and proactively delivers agents the information they need to solve customer issues, improving agent productivity and caller satisfaction.

Read More for the details.

2022 10 18

AWS – PostgreSQL 15 Release Candidate 2 is now available in Amazon RDS Database preview environment

AWS, Cloud AWS

Amazon RDS for PostgreSQL 15 Release Candidate 2 (RC2) is now available in the Amazon RDS Database Preview Environment, allowing you to test the release candidate of PostgreSQL 15 on Amazon RDS for PostgreSQL. You can deploy PostgreSQL 15 RC2 for development and testing in the Amazon RDS Database Preview Environment without the hassle of installing, provisioning, and managing the database.

Read More for the details.

2022 10 18

Azure – Generally available: Auto Extension upgrade for Arc enabled Servers

Azure, Cloud Azure

Azure Arc can now provide high availability and automatic protection against zero-day or critical vulnerabilities in Azure extensions to your Arc enabled Servers

Read More for the details.

2022 10 18

AWS – Announcing Red Hat Enterprise Linux (RHEL) Workstation on AWS

AWS, Cloud AWS

We are announcing the launch of Red Hat Enterprise Linux (RHEL) Workstation for accelerated GPU instances on AWS Marketplace. RHEL Workstation is a cloud-based remote desktop solution that allows end users from anywhere in the world to access a workstation instance to do their work and collaborate with team members. RHEL Workstation is designed for advanced Linux users working on more powerful hardware, and is optimized for activities such as animation, computer-aided design and engineering, scientific research, medical imaging etc. It is delivered via NICE DCV, a secure, high-performance remote display protocol. RHEL Workstation on AWS allows customers to provide high-end hardware capabilities to a distributed workforce, without the need for large capital investments in expensive workstation equipment.

Read More for the details.

2022 10 18

Azure – General availability: Azure savings plan for compute

Azure, Cloud Azure

Today, we’re officially announcing the general availability for Azure savings plan for compute.

Read More for the details.

2022 10 18

GCP – Unifying data and AI to bring unstructured data analytics to BigQuery

Cloud, Google Cloud gcp

Over one third of organizations believe that data analytics and machine learning have the most potential to significantly alter the way they run business over the next 3 to 5 years. However, only 26% of organizations are data driven. One of the biggest reasons for this gap is that a major portion of the data generated today is unstructured, which includes images, documents, and videos. It is estimated to cover roughly up to 80% of all data, which has so far remained untapped by organizations.

One of the goals of Google’s data cloud is to help customers realize value from data of all types and formats. Earlier this year, we announced BigLake, which unifies data lakes and warehouses under a single management framework, enabling you to analyze, search, secure, govern and share unstructured data using BigQuery.

At Next ‘22, we announced the preview of object tables, a new table type in BigQuery that provides a structured record interface for unstructured data stored in Google Cloud Storage. This enables you to directly run analytics and machine learning on images, audio, documents and other file types using existing frameworks like SQL and remote functions natively in BigQuery itself. Object tables also extend our best practices of securing, sharing and governing structured data to unstructured, without needing to learn or deploy new tools.

Directly process unstructured data using BigQuery ML

Object tables contain metadata such as URI (Uniform Resource Identifier), content type, and size that can be queried just like other BigQuery tables. You can then derive inferences using machine learning models on unstructured data with BigQuery ML. As part of preview, you can import open source TensorFlow Hub image models, or your own custom models to annotate the images. Very soon, we plan to enable this for audio, video, text and many other formats, and pre-trained models to enable out-of-the box analysis. Check out this video to learn more and watch a demo.

code_block[StructValue([(u’code’, u’# Create an object tablernCREATE EXTERNAL TABLE my_dataset.object_tablernWITH CONNECTION us.my_connection rnOPTIONS(uris=[“gs://mybucket/images/*.jpg”],rn object_metadata=”SIMPLE”, metadata_cache_mode=”AUTOMATIC”);rnrn # Generate inferences with BQMLrnSELECT * FROM ML.PREDICT(rn MODEL my_dataset.vision_model, rn (SELECT ML.DECODE_IMAGE(data) AS img FROM my_dataset.object_table)rn);’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3edcf0fccb90>)])]

By analyzing unstructured data natively in BigQuery, businesses can

Eliminate manual effort as pre-processing steps such as tuning image sizes to model requirements are automated

Leverage the simple and familiar SQL interface to quickly gain insights

Save costs by utilizing existing BigQuery slots without needing to provision new forms of compute

Adswerve is a leading Google Marketing, Analytics and Cloud partner on a mission to humanize data. Twiddy & Co. is Adswerve’s client – a vacation rental company in North Carolina. By combining structured and unstructured data, Twiddy and Adswerve used BigQuery ML to analyze images of rental listings and predict the click-through rate, enabling data-driven photo editorial decisions.

“Twiddy now has the capability to use advanced image analysis to stay competitive in an ever changing landscape of vacation rental providers – and can do this using their in-house SQL skills.” said Pat Grady, Technology Evangelist, Adswerve

Process unstructured data using remote functions

Customers today use remote functions (UDFs) to process structured data for languages and libraries that are not supported in BigQuery. We are extending this capability to process unstructured data using object tables.

Object tables provide signed URLs to allow remote UDFs running on Cloud Functions or Cloud Run to process the object table content. This is particularly useful for running Google’s pre-trained AI models, including Vision AI, Speech-to-Text, Document AI, open source libraries such as Apache Tika, or deploying your own custom models where performance SLAs are important.

Here’s an example of an object table being created over PDF files that are parsed using an open source library running as a remote UDF.

code_block[StructValue([(u’code’, u’SELECT uri, extract_title(samples.parse_tika(signed_url)) AS titlernFROM EXTERNAL_OBJECT_TRANSFORM(TABLE pdf_files_object_table, rn [“SIGNED_URL”]);’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3edce30bbd10>)])]

Extending more BigQuery capabilities to unstructured data

Business intelligence – The results of analyzing unstructured data either directly in BigQuery ML or via UDFs can be combined with your structured data to build unified reports using Looker Studio (at no charge), Looker or any of your preferred BI solutions. This allows you to gain more comprehensive business insights. For example, online retailers can analyze product return rates by correlating them with the images of defective products. Similarly, digital advertisers can correlate ad performance with various attributes of ad creatives to make more informed decisions.

BigQuery search index – Customers are increasingly using the search functionality of BigQuery to power search use cases. These capabilities now extend to unstructured data analytics as well. Whether you use BigQueryML to produce inference on images or use remote UDFs with Doc AI to produce document extraction, the results can now be search indexed and used to support search access patterns.

Here’s an example of search index on data that is parsed from PDF files:

code_block[StructValue([(u’code’, u’CREATE SEARCH INDEX my_index ON pdf_text_extract(ALL COLUMNS);rnrnSELECT * FROM pdf_text_extract WHERE SEARCH(pdf_text, “Google”);’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3edcf1442a50>)])]

Security and governance – We are extending BigQuery’s row-level security capabilities to help you secure objects in Google Cloud Storage. By securing specific rows in an object table, you can restrict the ability of end users to retrieve the signed URLs of corresponding URIs present in the table. This is a shared responsibility security model, for which administrators need to ensure that end users don’t have direct access to Google Cloud Storage, and use signed URLs from object tables as the only access mechanism.

Here’s an example of a policy for PII images that are secured to be first processed through a blur pipeline:

code_block[StructValue([(u’code’, u’CREATE ROW ACCESS POLICY pii_data ON object_table_imagesrnGRANT TO (“group:admin@example.com”) rnFILTER USING (ARRAY_LENGTH(metadata)=1 AND rn metadata[OFFSET(0)].name=”face_detected”)’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3edcf32baf50>)])]

Soon, Dataplex will support object tables, allowing you to automatically create object tables in BigQuery and manage and govern unstructured data at scale.

Data sharing – You can now use Analytics Hub to share unstructured data with partners, customers and suppliers while not compromising on security and governance. Subscribers can consume the rows of object tables that are shared with them, and use signed URLs for unstructured data objects.

Getting Started

Submit this form to try these new capabilities that unlock the power of your unstructured data in BigQuery. Watch this demo to learn more about these new capabilities.

Special thanks to engineering leaders Amir Hormati, Justin Levandoski and Yuri Volobuev for contributing to this post.

Read More for the details.

2022 10 18

Azure – General availability: Zone-redundant storage support by Azure Backup

Azure, Cloud Azure

Azure Backup enables you to configure cost-efficient backups while meeting your data residency requirements.

Read More for the details.

2022 10 18

GCP – Google Workspace proves it’s enterprise-ready for SAP Cloud ERP customers

Cloud, Google Cloud gcp

Google Workspace is used by more than three billion users, including companies with the strictest security and availability requirements. Enterprises choose Workspace to stay connected, share ideas, get more done together, and safely work from wherever they are. Now, with the integration of Google Workspace and SAP S/4HANA ERP — the world’s most well-known ERP system — enterprise collaboration has moved to a new level, and it’s only the beginning.

SAP is a pioneering technology company with 50 years of experience in the enterprise software market. Today it has 460,000 customers across 140 countries and generates 87% of total global commerce. For SAP S/4HANA customers, the Google Workspace partnership provides seamless integration between timely and truthful enterprise data and the everyday productivity applications they use to get work done. They can access Google Workspace applications from within the secured SAP S/4HANA Cloud ERP: no more collecting files from emails and drives, downloading them, working on them, and then entering the information in the ERP. Likewise, data from the ERP system can be easily shared and edited using Google Sheets, Docs, Slides, and other Workspace applications, all within the secured clouds.

This groundbreaking level of collaboration is an example of how Google is continuing to expand enterprise integration and capabilities to allow our customers to get more out of both Google Workspace and the long-standing systems that have supported their organizations for years.

New ways to work more efficiently

Because of the flexibility of Google Workspace, the beneficial use cases for this integration are virtually limitless. Efficiency improves for any type of collaborative task or process that collects or uses enterprise data because all of the steps happen in one shared and secure environment. Users can now comment on and edit documents directly from SAP, eliminating the risk and confusion that come from having to check files in and out to modify them.

These same collaboration benefits of shared Google files are also relevant to how entire SAP S/4HANA teams work together. For instance, think about how people have traditionally performed reconciliations. A spreadsheet file populated with data from the ERP is emailed to a group of people who then have to download the file, add their comments and corrections, and then send the files back. With the integration of Google Workspace, the person in charge of reconciliations can now seamlessly export the data to a Google Sheet, where other people can directly review and comment on their pertinent data. This can happen simultaneously without anyone using email or downloading any software. All of their work is tracked in one collaborative and secure space using verified data. To learn more about how we bring these secure by design collaboration benefits to large organizations with complex needs, click here.

These workflows are reflective of the tangible benefits Google Workspace provides to business at large. One Forrester report on the economic impact of Google Workspace reveals that its data control capabilities yield a 95% reduction to the data breach risk that businesses face. The same report estimates that Google Workspace saves companies 171 working hours per employee per year. That’s nearly an entire month of additional productivity, and contributes to a revenue growth of 1.5% year-over-year.

Capabilities rolling out in phases

Three powerful capabilities are being rolled out, and additional releases are planned for 2023. As described in the above examples, the first capabilities are the ability to:

Export data from SAP S/4HANA into the Google Workspace applications, such as Google Sheets

Import data into SAP S/4HANA from Google Workspace applications, such as through Google Sheets

Collaborate live and simultaneously with others through editing, commenting, and adding content to Google files seamlessly and securely, making it available for import into SAP S/4HANA environment (direct editing in SAP S/4HANA planned for 2023)

We’ll share these developments as they happen, and look forward to continuing to help you get the most from your SAP data.

Read more about how SAP collaborates with Google Cloud and Workspace here.

Read More for the details.

2022 10 18

GCP – Migrate and modernize intelligently with Google Cloud Migration Center

Cloud, Google Cloud gcp

Organizations continue to adopt the public cloud to deliver better business and IT outcomes1. However, migration — and modernization — is a complex, multifaceted challenge which involves understanding the current state of infrastructure and applications to determine what the future state looks like and what it would take to get to the future state with multiple stakeholders using myriad tools, processes, and implementation partners.

Daniel Dill, Senior Vice President, Application Operations and Cloud Delivery at Global Payments Inc., summarizes a common sentiment amongst organizations contemplating migration projects: “Our cloud migration approach centered around quickly exiting high cost colo data centers and using those savings to fund the cloud costs and data center consolidation.”

Working with those goals and challenges in mind, last week at Google Cloud Next ‘22, we announced the Preview of Google Cloud Migration Center, a new Google first-party service that is available in the Google Cloud console. Migration Center is designed to streamline your cloud journey with intelligent, data-driven insights and actionable recommendations. This will help you make critical decisions on the optimal migration and modernization pathways for your organization faster and with greater confidence.

With Migration Center, we’ll bring all our planning, migration, and modernization tooling together in one centralized experience within the Google Cloud console with a unified data platform. Built on a new scalable, and extensible data foundation, Migration Center provides the connective tissue between user interfaces and APIs for Google Cloud or partner/ISV-provided discovery, analysis, planning, and migration tools.

One of our customers, Viant technology, an Irvine, CA based Adtech company, had to move quickly when they were informed their data center location was closing. They used the tools and programs within Migration Center, and their story encapsulates the value of end-to-end streamlining, a key goal of Migration Center.

“We had 6 months to migrate a complex environment comprising 600 virtual machines (VMs), 200+TB of data, 80+ MSSQL databases, and 100 physical servers from on-premises to the cloud,” explains Linh Chung, CIO of Viant. “Google Cloud provided discovery and assessment through the Rapid Assessment & Migration Program (RAMP), giving us a complete inventory to migration plan within 24 hours. We even identified places where cost and redundancy could be cut by determining which VMs were necessary to migrate and which could be turned off.”

“We were also extremely fortunate to have found Slalom,” Linh continues. “They hit the ground running and were exceptionally knowledgeable in all aspects of the migration process. With their help, we were able to complete the entire migration to Google Cloud with an astonishing three weeks to spare.”

Migration Center is also a collaborative effort with inputs from our design partners that include both global and regional partners like Accenture, Cognizant, HCL, Infosys, Maven Wave, an Atos Company, SADA, SoftChoice, Tata Consultancy Services, Thoughtworks and Wipro.

Miles Ward, Chief Technology Officer at SADA, reflects on working with Google Cloud on Migration Center: “We are extremely excited to add customer value with Migration Center to speed up adoption of Google Cloud and make it easier for our customers to move large, complicated workloads to their new location on Google Cloud.”

Some key highlights of this service in the Preview phase include:

Console-integrated, one-stop portal for all things migration and modernization

Quick estimates for running your workloads in Google Cloud with minimal user input. These estimates will include right-sizing and the right mix of Google Cloud infrastructure and are built on editable assumptions from actual customer migrations

Easier initiation of discovery and assessment with in-console reports for asset inventory, delivering insights and best path recommendations for migrating to VMs, containers, databases, dedicated hardware, and managed services

Prescriptive guidance documentation based on best practices for migration planning and execution, distilled from real-world migrations done by Google Professional Services

Migration execution launchpad for kickstarting your actual migrations with proven tools such as Migrate to Containers, the Database Migration Service, and Migrate to Virtual Machines — which recently introduced a preview of migrating from AWS as a source.

Common API surface for customers and partners to enable ecosystem integrations

Speaking about some of what’s included within Migration Center, John Wick, a Senior Manager for Cloud Transformation and Architecture at Accenture, said that “Migrate to Virtual Machines was reliable and able to scale and concurrently migrate large applications while meeting our client’s strict requirements for minimal downtime. The Migrate to Virtual Machines interface is simple and easy for my team to use for migrations”.

Jason Foa, Cloud Platforms Practice Lead at Maven Wave, an Atos Company, echoed similar sentiments, stating that “Migrate for Virtual Machines allows us to help our customers quickly migrate a significant number of workloads in a short time and in a secure way that maintains business continuity. This allows our customers to take advantage of the benefits of Google Cloud rapidly.”

We also recently announced several exciting updates to our comprehensive Rapid Assessment & Migration Program (RAMP). RAMP is a holistic framework based on tangible customer TCO and ROI analyses that supports our customers’ journey to Google Cloud. We announced new assessment and consumption packages delivered via our ecosystem of partners who have completed their cloud migration specialization. You can learn more about it by reading this blog. In addition to this, we have introduced the Customer Onboarding Program to migrate qualified workloads with Google Cloud’s world-class Technical Onboarding Center — at no charge.

With the Migration Center, our goal is to put what you need for migration and modernization at the tip of your fingers, so that you can reduce the time, energy, and cost of these crucial projects. As you navigate your journey to the cloud, Migration Center and Google Cloud are here to help. To learn more, check out our web pages for Migration Center and Rapid Discovery and Assessment Program (RAMP), or sign up today for your free discovery and assessment.

The authors would like to thank Tom Nikl, Bobby Allen, Jason Joel, Tatiana Karmanova, Armaan Choudhary, and Nishant Kulkarni for their contributions to this blog.

1. Flexera 2022 State of the Cloud Report

Read More for the details.

2022 10 18

GCP – Top recommendations for building real-time intelligence on Google Cloud

Cloud, Google Cloud gcp

Editor’s notes: In this guest blog, we have the pleasure of inviting Sanjay Chaudhary, VP, Product Management at Exabeam to share his experiences and learnings of building real-time security insights and actions products capabilities using Google Cloud. Exabeam is a cybersecurity leader in SIEM and security analytics for organizations worldwide.

Today’s datasphere is opening a new fast lane for intelligent businesses, offering a tremendous advantage to organizations who can leverage it. Business leaders must address this new data with instant analysis and action to both win in the market today, and meet customer expectations tomorrow. Last week I had the pleasure to speak at Google Cloud Next, where I shared some lessons Exabeam learned on how Google Cloud can help create real-time insights and actions, all while removing technical operational complexity through unified batch and streaming analytics capabilities. Below is a summary of the key takeaways:

The best way to achieve real-time intelligence – Unified streaming and batch

This might sound counterintuitive at first, as real-time capabilities are enabled by streaming technologies. However, combining batch and streaming data allows Exabeam to automate decisions based on a complete view of real-time and historical data across all data types. Exabeam is a lean organization, so it is critical to achieve real-time analysis while also maintaining operational efficiency. This means we want to write code once that can run on both stream and batch jobs. Before we adopted Google Cloud this required separate technologies (for example, Apache Spark for batch processing and Apache Flink for real-time stream processing). With Google Cloud we can write code for real-time processing on Dataflow, and then use the same source code to run on BigQueryfor batch jobs. We now operate using the same tool with no-rewriting required, saving us significant time.

Integration with machine learning at scale

Real-time intelligence is a game changer, but what’s truly revolutionary is the ability to automatically apply intelligence on the data. This ensures we continually make smart decisions based on the most recent data available. At Exabeam we ingest and process large volumes of data, then feed it back to our machine learning models before we reverse extract, transform, load (ETL) the freshest model outputs back to the source system. Dataflow allows us to evaluate millions of events per second across hundreds of machine learning models. No other product comes close to Dataflow in terms of performance and scale for processing machine learning jobs.

Handling the unexpected

Real-time stream processing can be complex and messy, with a key challenge being the handling of the late arriving events. When you want to make important decisions based on events from a particular session window, sometimes the events arrive late and miss the window. Dataflow is unique in handling these unplanned situations, as it allows Exabeam to easily rescore the models to capture these late events.

Do more with less

Cloud transformation is not just about the massive scale capabilities you can offer to your customers; you also have to run your business profitably. For example, when you go to a massive scale data processing pipeline in Spark, it might take you months to mature. This has huge implications on your total cost of ownership and time to market. When we migrated from Spark to Dataflow, we saw an eighty percent reduction in the lines of code. This has helped our data scientists and data engineers spend more time writing efficient code, and less time managing the Spark clusters.

The three most important not-so-obvious benefits of Google Cloud’s Dataflow:

Observability: Dataflow provides a UI for viewing full pipeline, work logs, and report metrics to the cloud monitoring.This means you can track the stage of your jobs and look at the metrics for all your processing jobs, which is very convenient and provides peace of mind.

Predictable cost modeling: We build cybersecurity products and serve them to our customers through SaaS applications. Dataflow’s linear cost model is hugely beneficial, as it enables our costs to scale up proportionally based on usage even when running large scale processing jobs. No cost surprises.

Multi-tenancy: Dataflow pipelines allow Exabeam to build a single pipeline and parameterize it to run differently for different tenants. This is very important for Exabeam, as it allows us to offer custom machine learning and advanced model capabilities to each of our customers.

Recommendations and learnings:

Plan for 10X scale: Regardless of your company size, you should think 10X scale for your cloud transformation. If you’re running 10PB of scale today, plan for 100PB. A good example in our industry is that the scale and complexity of end point security products today has grown exponentially compared to ten years ago. As you build the data platform, build with the future in mind. Don’t get stuck on a technology stack which you have to reinvent every three to five years.

Cloud-native and serverless: A lot of data products have been built for the on-prem ecosystem, starting with Apache Kafka, Apache Spark, MongoDB, Presto, and Elasticsearch. When you bring these technologies on the cloud, you will spend a lot of time managing them. When you restart your cloud journey, focus on how to get to serverless and cloud-native capabilities as quickly as possible.

To learn more about how Google’s Data Cloud has helped Exabeam, watch my talk Google Cloud Next here.

Read More for the details.

2022 10 18

GCP – Cloud Bigtable schema tips: Key salting

Cloud, Google Cloud gcp

Cloud Bigtable is a low-latency, high-throughput NoSQL database. With NoSQL, you want to design a schema that can scale and adapt to your business growth. When working with large sets of data in the real world, it’s possible there will be access pattern outliers with significantly more activity that requires a bit more planning. In this article, we are going to learn how to optimize a Bigtable schema to increase performance on highly active rows on an otherwise well-balanced schema.

Row key design refresher

Bigtable performs best when the throughput is evenly distributed across the entire row key space and can spread across all the nodes. Bigtable rows are physically stored in tablets containing groups of contiguous row keys, and each tablet is distributed to the available nodes. If rows on the same tablet are receiving a disproportionately large percentage of requests compared to other tablets in that node, that can impact performance.

Typically, row keys are designed to be optimized for particular queries. For example, to have queries centered around individual users you may put a user id at the beginning, like so: user_id-xxx-yyy. When some users are significantly more active than others, such as the case for celebrity accounts, writes and reads from their rows could cause hotspotting by putting too much pressure on specific nodes.

If we can distribute the logical row by physically dividing it amongst multiple tablets, then the rows can get balanced across the available nodes and reduce the hotspot.

Key prefix salting

A well-distributed user id would typically work as a row key prefix, so we can use this as the starting point for our row key design:
user_id-xxx-yyy

One strategy to distribute this unbalanced throughput across all Bigtable nodes is to prepend an additional value to the row key design:
01-user_id-xxx-yyy
02-user_id-xxx-yyy

This example has two physical rows corresponding to one logical row which divides the throughput in half. This will distribute all the rows for a particular user id across the rest of the keyspace. Since their prefix is different, they should be able to live on different tablets and are more likely to be hosted on multiple nodes. Note that it is possible for both prefixes to be in the same node or for one prefix to be split into multiple nodes since this setup’s goal is to provide more options to the load balancing mechanism.

Choosing a prefix

Choosing a prefix that doesn’t add much complexity for requests is important to consider. If we used random prefixes, each get request would turn into multiple get requests to ensure the correct row was located. If the prefix is deterministic from the row key, then it allows for minimal changes to single-row read and write requests.

If we would like N divisions, we can take modulo N of the hash of the entire existing row key. We will also refer to N as our salt range.

code_block[StructValue([(u’code’, u’int prefix = rowKey.hashCode() % saltRange;rnString saltedRowKey = prefix + “-” + rowKey;’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e295fbcd310>)])]

A point lookup and write will still work as the physical key can be computed from the logical key. Salting won’t eliminate the hotspots, but it spreads them into N hotspots of strength 1/N. These less severe hotspots can be more easily processed by individual nodes.

Prefix options

If you have common scans over prefixes that you would like to stay intact, you can also hash just part of the row key rather than the entire row key.

For a row key of the format user_id-site-timestamp, you might want efficient scans over user_id and site combinations. Here, we can leave off the timestamp when creating the hash, so the time-series data for those combinations will always be grouped together.

code_block[StructValue([(u’code’, u’String rowKeyBase = rowKey.substring(0, rowKey.lastIndexOf(“-“));rnint prefix = rowKeyBase.hashCode() % saltRange;rnString saltedRowKey = prefix + “-” + rowKey;’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e295fbcd2d0>)])]

Keys with the same logical prefix that is often scanned can still be efficiently scanned.

This strategy is less resistant to hotspots—the same problem that the salting strategy is supposed to mitigate can come up again if individual user_id, site combinations get significant access.

Implementation

To implement this in your code, you’ll need to change the areas where you are making requests to Bigtable data. You can view the full source code example on Github.

Writing

Using this new technique, if you want to write data, follow these steps:

Take the row key you intend to write to

Compute the prefix using your hash function

Construct the salted row key by concatenating the prefix and row key

Then use the salted row key for writing your data

You will need to ensure that you integrate this flow to anywhere you are writing data.

In Java, it would look something like this:

code_block[StructValue([(u’code’, u’String saltedRowKey = getSaltedRowKey(rowKey, SALT_RANGE);rnRowMutation rowMutation = RowMutation.create(tableId, saltedRowKey)rn .setCell(….’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e298c084110>)])]

Reading

Gets

To read individual rows in a table with salted keys, you would follow the same initial steps in writing the data like:

Take the row key you intend to read

Compute the prefix using your hash function

Construct the salted row key by concatenating the prefix and row key

Then use the salted row key for reading your data

Since the physical row key is computed deterministically from the logical row key, only one read needs to be issued for each logical key.

In Java, it would look something like this:

code_block[StructValue([(u’code’, u’Row row = dataClient.readRow(tableId, getSaltedRowKey(rowKey, SALT_RANGE));’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e295e9e1450>)])]

Scans

You can follow these steps for each scan:

Take the row key prefix you intend to scan

For 0 to N (each potential salt option)

Construct the salted row key by concatenating the prefix and row key

Then use the salted row key for your prefix scan

Issue this scan in parallel

Combine the results of all the scans

Let’s look at an example. Say you wanted to get all the data for a user and one subcategory; you would do a prefix scan on “user_id-xxx-“. If you’re working with salted rows, you would need to prefix scans based on how large your hash size is. If our hash size is 4, then we would do 4 prefix scans:

01-user_id-xxx-02-user_id-xxx-03-user_id-xxx-04-user_id-xxx-

For the best performance you would want to issue each scan in parallel rather than sending all the prefixes into one request. Since the requests are done in parallel, the rows may not be returned in sorted order. If row order is important you will have to do some additional sorting once the results are received.

Because the physical row keys are no longer a contiguous range, these scans may consume more Bigtable CPU which is an important consideration for choosing a salting factor with a scan-heavy workload. Large scans, however, may be more performant as more resources can be used in parallel to serve the request.

In Java, it would look something like this:

code_block[StructValue([(u’code’, u’List<Query> queries = new ArrayList<>();rnfor (int i = 0; i < SALT_RANGE; i++) {rn queries.add(Query.create(tableId).prefix(i + “-” + prefix));rn}rnrnList<ApiFuture<List<Row>>> futures = new ArrayList<>();rnfor (Query q : queries) {rn futures.add(dataClient.readRowsCallable().all().futureCall(q));rn}rnrnList<Row> rows = new ArrayList<>();rnfor (ApiFuture<List<Row>> future : futures) {rn rows.addAll(future.get());rn}rnrnfor (Row row : rows) {rn // Access your row data here.rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e295d42f350>)])]

Forward looking migrations

It can be difficult to make a large change to existing datasets, so one way to migrate is only applying the salt moving forward. If you have timestamps at the end of the key, change the code to salt row keys past a certain fixed point in time, and just use an unsalted key for old/existing keys.

Next steps

Read more about reading and writing data to Bigtable

Learn more about Bigtable performance

See if alternative solutions like adding a cache layer to Bigtable could help

Read More for the details.

2022 10 18

GCP – Using Envoy to create cross-region replicas for Cloud Memorystore

Cloud, Google Cloud gcp

In-memory databases are a critical component that deliver the lowest possible latency for your users who might be adding items to online shopping carts, getting personalized content recommendations, or checking their latest account balances. Memorystore makes it easy for developers building these types of applications on Google Cloud to leverage the speed and powerful capabilities of the most loved in-memory store: Redis. Memorystore for Redis offers zonal high availability with a 99.9% SLA for its Standard Tier instances. In some cases, users are looking to expand their Memorystore footprint to multiple regions to support disaster recovery scenarios for regional failure or to provide the lowest possible latency for a multi-region application deployment. We’ll show you how to deploy such an architecture today with the help of the Envoy proxy Redis filter, which we introduced in our previous blog, Scaling to new heights with Cloud Memorystore and Envoy. Envoy makes creating such an architecture both simple and extensible due to its numerous supported configurations. Let’s get started with a hands-on tutorial which demonstrates how you can build a similar solution.

Architecture Overview

Let’s start by discussing an architecture of Google Cloud native services combined with open-source software which enables a multi-region Memorystore architecture. To do this, we’ll be using Envoy to mirror traffic to two Memorystore instances which we’ll create in separate regions. For simplicity, we’ll be using Memtier Benchmark, a popular CLI for Redis load generation, as a sample application to simulate end user traffic. In practice, feel free to use your existing application or write your own.

Because of Envoy’s traffic mirroring configuration, the application does not need to be aware of the various backend instances that exist and only needs to connect to the proxy. You’ll find a sample architecture below and we’ll briefly detail each of the major components.

https://storage.googleapis.com/gweb-cloudblog-publish/images/1_Cloud_Memorystore_YxHM7wC.max-2800×2800.jpg

Before we start, you’ll also want to ensure compatibility with your application by reviewing the list of the Redis commands which Envoy currently supports.

Prerequisites

To follow along with this walkthrough, you’ll need a Google Cloud project with permissions to do the following:

Deploy Cloud Memorystore for Redis instances (required permissions)

Deploy GCE instances with SSH access (required permissions)

Cloud Monitoring viewer access (required permissions)

Access to Cloud Shell or another gCloud authenticated environment

Deploying the multi-region Memorystore backend

You’ll start by deploying a backend Memorystore for Redis cache which will serve all of your application traffic. You’ll deploy two instances in separate regions so that we can protect our deployment against regional outages. We’ve chosen regions US-West1 and US-Central1 though you are free to choose whichever regions work best for your use case.

From an authenticated cloud shell environment, this can be done as follows:

$ gcloud redis instances create memorystore-primary –size=1 –region=us-west1 –tier=STANDARD –async

$ gcloud redis instances create memorystore-standby –size=1 –region=us-central1 –tier=STANDARD –async

If you do not already have the Memorystore for Redis API enabled in your project, the command will ask you to enable the API before proceeding. While your Memorystore instances deploy, which typically takes a few minutes, you can move onto the next steps.

Creating the Client and Proxy VMs

Next, you’ll need a VM where you can deploy a Redis client and the Envoy proxy. To protect against regional failures, we’ll create a GCE instance per region. On each instance, you will deploy the two applications, Envoy and Memtier Benchmark, as containers. This type of deployment is referred to as a “sidecar architecture” which is a common Envoy deployment model. Deploying in this fashion nearly eliminates any added network latency as there is no additional physical network hop that takes place.

You can start by creating the primary region VM:

$ gcloud compute instances create client-primary –zone=us-west1-a –machine-type=e2-highcpu-8 –image-family cos-stable –image-project cos-cloud

Next, create the secondary region VM:

$ gcloud compute instances create client-standby –zone=us-central1-a –machine-type=e2-highcpu-8 –image-family cos-stable –image-project cos-cloud

Configure and Deploy the Envoy Proxy

Before deploying the proxy, you need to gather the necessary information to properly configure the Memorystore endpoints. To do this, you need the host IP addresses for the Memorystore instances you have already created. You can gather these like:

gcloud redis instances describe memorystore-primary –region us-west1 –format=json | jq -r “.host”

gcloud redis instances describe memorystore-standby –region us-central1 –format=json | jq -r “.host”

Copy these IP addresses somewhere easily accessible as you’ll use them shortly in your Envoy configuration. You can also find these addresses in the Memorystore console page under the “Primary Endpoint” columns.

Next, you’ll need to connect to each of your newly created VM instances, so that you can deploy the Envoy Proxy. You can do this easily via SSH in the Google Cloud Console. More details can be found here.

After you have successfully connected to the instance, you’ll create the Envoy configuration.

Start by creating a new file named envoy.yaml on the instance with your text editor of choice. Use the following .yaml file, entering the IP addresses of the primary and secondary instances you created:

code_block[StructValue([(u’code’, u’static_resources:rn listeners:rn – name: primary_redis_listenerrn address:rn socket_address:rn address: 0.0.0.0rn port_value: 1999rn filter_chains:rn – filters:rn – name: envoy.filters.network.redis_proxyrn typed_config:rn “@type”: type.googleapis.com/envoy.extensions.filters.network.redis_proxy.v3.RedisProxyrn stat_prefix: primary_egress_redisrn settings:rn op_timeout: 5srn enable_hashtagging: truern prefix_routes:rn catch_all_route:rn cluster: primary_redis_instancern request_mirror_policy:rn cluster: secondary_redis_instancern exclude_read_commands: truern – name: secondary_redis_listenerrn address:rn socket_address:rn address: 0.0.0.0rn port_value: 2000rn filter_chains:rn – filters:rn – name: envoy.filters.network.redis_proxyrn typed_config:rn “@type”: type.googleapis.com/envoy.extensions.filters.network.redis_proxy.v3.RedisProxyrn stat_prefix: secondary_egress_redisrn settings:rn op_timeout: 5srn enable_hashtagging: truern prefix_routes:rn catch_all_route:rn cluster: secondary_redis_instancern clusters:rn – name: primary_redis_instancern connect_timeout: 3srn type: STRICT_DNSrn lb_policy: RING_HASHrn dns_lookup_family: V4_ONLYrn load_assignment:rn cluster_name: primary_redis_instancern endpoints:rn – lb_endpoints:rn – endpoint:rn address:rn socket_address:rn address: <primary_region_memorystore_ip>rn port_value: 6379 rn – name: secondary_redis_instancern connect_timeout: 3srn type: STRICT_DNS rn lb_policy: RING_HASHrn load_assignment:rn cluster_name: secondary_redis_instancern endpoints:rn – lb_endpoints:rn – endpoint:rn address:rn socket_address:rn address: <secondary_region_memorystore_ip>rn port_value: 6379rn rnadmin:rn address:rn socket_address:rn address: 0.0.0.0rn port_value: 8001′), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3edd26e0f190>)])]

The various configuration interfaces are explained below:

Admin: This interface is optional, it allows you to view configuration and statistics etc. It also allows you to query and modify different aspects of the envoy proxy.

Static_resources: This contains items that are configured during startup of the envoy proxy. Inside this we have defined clusters and listeners interfaces.

Clusters: This interface allows you to define clusters which we are defining per region. Inside cluster configuration you define all the available hosts and how to distribute load across those hosts. We have defined two clusters, one in the primary region and another in the secondary region. Each cluster can have a different set of hosts and different load balancer policies. Since there is only one host in each cluster, you can use any load balancer policy as all the requests will be forwarded to that single host.

Listeners: This interface allows you to expose the port on which the client would connect, and define behavior of traffic received. In this case we have defined two listeners, one for each regional Memorystore instance.

Once you’ve added your Memorystore instance IP addresses, save the file locally to your container OS VM where it can be easily referenced. Make sure to repeat these steps for your secondary instance as well.

Now, you’ll use Docker to pull the official Envoy proxy image and run it with your own configuration. On primary region client machine, run this command:

$ docker run –rm -d -p 8001:8001 -p 6379:1999 -v $(pwd)/envoy.yaml:/envoy.yaml envoyproxy/envoy:v1.21.0 -c /envoy.yaml

On the standby region client machine, run this command:

$ docker run –rm -d -p 8001:8001 -p 6379:2000 -v $(pwd)/envoy.yaml:/envoy.yaml envoyproxy/envoy:v1.21.0 -c /envoy.yaml

For our standby region, we have changed the binding port to port 2000. This is to ensure that traffic from our standby clients are routed to the standby instance in the event of a regional failure which makes our primary instance unavailable.

In this example, we are deploying envoy proxy manually, but, in practice, you will implement a CI/CD pipeline which will deploy the envoy proxy and bind ports depending on your region based configuration.

Now that Envoy is deployed, you can test it by visiting the admin interface from the container VM:

$ curl -v localhost:8001/stats

If successful, you should see a print out of the various Envoy admin stats in your terminal. Without any traffic yet, these will not be particularly useful, but they allow you to ensure that your container is running and available on the network. If this command does not succeed, we recommend checking that the Envoy container is running. Common issues include syntax errors within your envoy.yaml and can be found by running your Envoy container interactively and reading the terminal output.

Deploy and Run Memtier Benchmark

After reconnecting to the primary client instance in us-west1 via SSH, you will now deploy the Memtier Benchmark utility which you’ll use to generate artificial Redis traffic. Since you are using Memtier Benchmark, you do not need to provide your own dataset. The utility will populate the cache for you using a series of set commands.

code_block[StructValue([(u’code’, u’$ for i in {1..5}; do docker run –network=”host” –rm -d redislabs/memtier_benchmark:1.3.0 -s 127.0.0.1 -p 6379 u2014threads 2 u2013clients 10 –test-time=300 –key-maximum=100000 –ratio=1:1 –key-prefix=”memtier-$RANDOM-“; done’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3edd25347910>)])]

Validate the cache contents

Now that we’ve generated some data from our primary region’s client, let’s ensure that it has been written to both of our regional Memorystore instances. We can do this by using cloud monitoring metrics-explorer. Next, you’ll configure the chart via “MQL” which can be selected at the top of the explorer pane. For ease, we’ve created a query which you can simply paste into your console to populate your graph:

code_block[StructValue([(u’code’, u”fetch redis_instancern| metric ‘redis.googleapis.com/keyspace/keys’rn| filterrn (resource.instance_id =~ ‘.*memorystore.*’) && (metric.role == ‘primary’)rn| group_by 1m, [value_keys_mean: mean(value.keys)]rn| every 1m”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3edd0ae18ad0>)])]

If you have created your Memorystore instances with a different naming convention or have other Memorystore instances within the same project, you may need to modify the resource.instance_id filter. Once you’re finished, ensure that your chart is viewing the appropriate time range, and you should see something like:

In this graph, you should see two like lines which show the same number of keys in both Memorystore instances. If you want to view metrics for a single instance, you can do this by using the default monitoring graphs which are available from the Memorystore console after selecting a specific instance.

Simulate Regional Failure

Regional failure is a rare event. We will simulate this by deleting our primary Memorystore instance and primary client VM.

Let’s start by deleting our primary Memorystore instance like:

$ gcloud redis instances delete memorystore-primary –region=us-west1

And then our client VM like:

$ gcloud compute instances delete client-primary

Next, we’ll need to generate traffic from our secondary region client VM which we are using as our standby application.

For the sake of this example, we’ll manually perform a failover and generate traffic to save time. In practice, you’ll want to devise a failover strategy to automatically divert traffic to the standby region when the primary region becomes unavailable. Typically, this is done with the help of services like Cloud Load Balancer. Once more, ssh into the secondary region client VM from the console and run the Memtier benchmark application as mentioned in the previous section. You can validate that reads and writes are properly routing to our standby instance by viewing the console’s monitoring graphs once more.

Once the original primary Memorystore instance is available again, it will become the new standby instance based on our Envoy configuration. It will also be out of sync with our new primary instance as it has missed writes during its unavailability. We do not intend to cover a detailed solution in this post, but we find that most users opt to rely on TTL which they have set on their keys to determine when their caches will eventually be in sync.

Clean Up

If you have followed along, you’ll want to spend a few minutes cleaning up resources to avoid accruing unwanted charges. You’ll need to delete the following:

Any deployed Memorystore instances

Any deployed GCE instances

Memorystore instances can be deleted like:

$ gcloud redis instances delete <instance-name> –region=<region>

The GCE container OS instance can be deleted like:

$ gcloud compute instances delete <instance-name>

If you created additional instances, you can simply chain them in a single command separated by spaces.

Conclusion

While Cloud Memorystore Standard tier provides high availability, some use cases require an even higher availability guarantee. Envoy and its Redis filter make creating a multi-regional deployment simple and extensible. The outline provided above is a great place to get started. These instructions can easily be extended to support automated region failover or even dual region active-active deployments. As always, you can learn more about Cloud Memorystore through our documentation or request desired features via our public issue tracker.

Read More for the details.

2022 10 18

GCP – Introducing lock insights and transaction insights for Spanner: troubleshoot lock contentions with pre-built dashboards

Cloud, Google Cloud gcp

As a developer, DevOps engineer or a database administrator, you have to typically deal with database lock issues. Often, rows locked by queries cause lags and can slow down applications resulting in poor user experience. Today, we are excited to announce the launch of lock insights and transaction insights for Cloud Spanner that provide a set of new visualization tools for developers and database administrators to quickly diagnose lock contention issues on Spanner.

If you observe application slowness, a common issue could be lock contentions, which happen when multiple transactions are trying to modify the same row. Debugging lock contentions is not easy as it requires identifying the row ranges and columns on which transactions are contending for locks. This process can be tedious and time consuming without a visual interface. Today, we are solving this problem for customers.

Lock insights and transaction insights provide pre-built dashboards that make it easy to detect row ranges with the highest lock wait time, find transactions reading or writing on these row ranges, and identify the transactions with highest latencies causing these lock conflicts.

Earlier this year, we launched query insights for debugging query performance issues. Together with lock insights and transaction insights, these capabilities provide developers easy-to-use observability tools to troubleshoot issues and optimize the performance of their Spanner databases.Lock insights and transaction insights are available at no additional cost.

“Lock insights will be very helpful to debug lock contention which typically takes hours.” said Dominick Anggara, MSc., Staff Software Engineer at Kohl’s. “It allows the user to see the big picture, and make it easy to make correlations, and then narrow down to specific transactions. That’s what makes it powerful. Really looking forward to using this in production”.

Why do lock issues happen?

Most databases take locks on data to prohibit other transactions from concurrently changing the data to preserve data integrity. When you access data with the intent to change it, a lock prohibits other transactions from accessing the data while it is being modified. But when the data is locked, it can negatively impact application performance as other tasks wait to access the data.

Cloud Spanner, Google Cloud’s fully managed horizontally scalable relational database service, offers the strictest concurrency-control guarantees, so that you can focus on the logic of the transaction without worrying about data integrity. To give you this peace of mind, and to ensure consistency of multiple concurrent transactions, Spanner uses a combination of shared locks and exclusive locks at the table cell level (granularity of row-and-column) and not at the whole row level. You can learn more about different types of Lock modes for Spanner in our documentation.

Follow a visual journey with pre-built dashboards

With lock insights and transaction insights, developers can smoothly move from detection of latency issues to diagnosis of lock contentions, and ultimately identification of transactions that are contending for locks. Once the transactions causing the lock conflicts are identified, you can then try to identify issues in each transaction that are contributing to the problem.

You could do this by following a simple journey where you can quickly confirm if the application slowness is due to lock contentions, correlate row ranges and columns which have the highest lock wait time with the transactions taking locks on these row ranges, identify the transactions with the highest latencies, and analyze these transactions which are contending on locks. Let’s walk through an example scenario.

Diagnose application slowness

This journey will start by setting up an alert on Google Cloud Monitoring for latency (api/request_latencies) going above a certain threshold. The alert could be configured in a way that if this threshold is crossed, you will be notified with an email alert, with a link to the “Monitoring” dashboard.

Once you receive this alert, you would click on the link in the email, and navigate to the “Monitoring” dashboard. If you observe a spike in read/write latency, no observable spike in CPU utilization, and a dip in Throughput and/or Operations per second, a possible root cause could be lock contentions. A combination of these patterns in these metrics could be a strong signal that the system is locking due to the transactions contending on the same cells, even though the workload remains the same. Below, you can observe a spike between 5:45 PM and 6:00 PM. This could be due to new application code deployment which might have introduced a new access pattern.

The next step is to confirm that this application slowness is indeed due to the lock contentions. This is where lock insights comes in. You can get to this tool by clicking on “Lock insights” in the left navigation of the Spanner Instance view in your Cloud Console. Here, the first graph that you see will be for Total lock wait time. If you observe a corresponding spike on this graph in the corresponding time window, this would confirm that the application’s slowness is due to lock contentions.

Co-relating row ranges, columns and transactions

Now you can select the database which is seeing the spike in total lock wait time, and drill down to see the row ranges with the highest lock wait times. When a user clicks on a row-range which has the highest lock wait times, a right panel will open up. This will show sample lock requests for that row range which includes the columns which were read from or written to, the type of lock which was acquired on this row-column combination (database cell), and links to view the transactions which were contending for these locks. This helps co-relate row ranges, columns and transactions makes this journey seamless to switch between lock insights and transaction insights as explained in the next section.

In the above screenshot, we can see that at 5:53 PM, the first row range in the table (order_item(82,12)) is showing the highest lock wait times. You can investigate further by looking at the transactions which were acting on the sample lock columns.

Identifying transactions with highest write latencies causing locks

When you click on “View transactions” on the lock insights page, you will navigate to the transaction insights page with the topN transactions table (by latency) filtered on the Sample lock Columns from the previous page (lock insights), so you will view the topN transactions in the context of the locks (and row ranges) which were identified earlier in the journey.

In this example we can see that the first transaction reading from and writing to columns item_inventory._exists, item_inventory.count has the highest latencies and could be one of the transactions causing lock contentions. We can also see that the second transaction in the table is also trying to read from the same column, and could be waiting on locks since the average latency is high. We should drill deep and investigate both these transactions.

Analyzing transactions to fix lock contentions

Once you have identified the transactions causing the locks, you can drill down into these transaction shapes to analyze the root cause of lock contentions.

You can do this by clicking on the Fingerprint ID for the specific transactions from the topN table, and navigating to the Transaction Details page where you will be able to see a list of metrics (Latency, CPU Utilization, Execution count, Rows Scanned / Rows Returned) over a time series for that specific transaction.

In this example, we notice that when we drill down into the second transaction, this transaction is only attempting to read and not write. By definition, the topN transactions table (on the previous page) only shows read-write transactions which take locks. We can also see that the abort count / total attempt count ratio (28/34) is very high, which means that most of the attempts are getting aborted.

Fixing the issue

To fix the problem in this scenario, you can convert this transaction from a read-write transaction to a read-only transaction, which would prevent it from taking locks on the cell, and thereby reducing lock contention and reducing write latencies.

By following this simple visual journey, you can easily detect, diagnose and fix lock contention issues on Spanner.

When looking at potential issues in your application, or even when designing your application, consider these best practices to reduce the number of lock conflicts in your database.

Get started with lock insights and transaction insights today

To learn more about lock insights and transaction insights, review the documentation here, and watch the explainer video here.

Lock insights and transaction insights are enabled by default. In the Spanner console, you can click on “Lock insights” and “Transaction insights” in the left navigation and start visualizing lock issues and transaction performance metrics!

New to Spanner? Create a 90-day Spanner free trial instance. Try Spanner for free.

Read More for the details.

2022 10 18

GCP – Announcing open innovations for a new era of systems design

Cloud, Google Cloud gcp

We’re at a pivotal moment in systems design. Demand for computing is growing at insatiable rates. At the same time, the slowing of Moore’s law means that improvements to CPU performance, power consumption, memory and storage cost efficiencies have all plateaued. These headwinds are further exacerbated by new challenges in reliability, and security.

At Google, we’ve responded to these challenges and opportunities with system design innovations across the stack: from new custom-silicon accelerators (e.g., TPU, VCU, and IPU), new hardware and data center infrastructure, all the way to new distributed systems and cloud solutions. But this is only the beginning. There are many more opportunities for advancements, including closely-coupled accelerators for core data center functions to minimize the so-called “data center tax.” As server and data center infrastructure diverges from decades-old traditional designs to be more modular, heterogeneous, disaggregated, and software-defined, distributed systems are also entering a new epoch — one defined by optimizations for the “killer microsecond” and novel programming models optimized for low-latency and accelerators.

At Google, we believe that these new opportunities and challenges are best addressed together, across the industry. Today, at the Open Compute Project (OCP) Global Summit, we are demonstrating our support of open hardware ecosystems, presenting at more than 40 talks, and announcing several key contributions:

Server design: We will share Google’s vision for a “multi-brained” server of the future, transforming traditional server designs to more modular disaggregated distributed systems across host computing, accelerators, memory expansion trays, infrastructure processing units, etc. We are sharing the work we are doing with all our OCP partners on the varied innovations needed to make this a reality — from modular hardware with DC-MHS, standardized management with OpenBMC and RedFish, standardized root of trust, and standardized interfaces including CXL, NVMe and beyond.

Trusted computing: The root of trust is an essential part of future systems. Google has a tradition of making contributions for transparent and best in-class security, including our OpenTitan discrete security solutions on consumer devices. We are looking ahead to future innovations in confidential computing and varied use-cases that require chip-level attestation at the level of a package or System on a Chip (SoC). Together with other industry leaders, AMD, Microsoft, and NVIDIA, we are contributing Caliptra, a re-usable IP block for root of trust measurement, to OCP. In the coming months we will roll out initial code for the community to collectively harden together.

Reliable computing: To address the challenges of reliability at scale, we’ve formed a new server-component resilience workstream at OCP, along with AMD, ARM, Intel, Meta, Microsoft, and NVIDIA. Through this workstream, we’ll develop consistent metrics about silent data errors and corruptions for the broader industry to track. We’ll also contribute test execution frameworks and suites, and provide access to test environments with faulty devices. This will enable the broader community — across industry and academia — to take a systems-approach to addressing silicon faults and silent data errors.

Sustainability: Finally, we’re announcing our support for a new initiative within OCP to support environmental sustainability as a key tenet across the ecosystem. Google has been a leader in environmental sustainability for many years. We have been carbon neutral since 2007, powered by 100% renewable energy since 2017, and have an ambitious goal to achieve net-zero emissions across all of our operations and value chain by 2030. In turn, as the cleanest cloud in the industry, we have helped customers track and reduce their carbon footprint and achieve significant energy savings. We’re excited to share these best practices with OCP and work with the broader community to standardize sustainability measurement and optimization in this important area.

As the industry body focused on system integration (e.g., compute, memory, storage, management, power and cooling), the OCP Foundation is uniquely positioned to facilitate the industry-wide codesign we need. Google is active in OCP, serving in leadership roles, incubating new initiatives, and supporting numerous contributions.

These announcements are the latest example of our history of fostering open and standards-based ecosystems. Open ecosystems enable a diverse product marketplace, with agility in time-to-market, and the opportunity to be strategic about innovation. Google’s open source leadership is multidimensional: driving industry standardization and adoption, strong and varied community contributions to grow the ecosystem, as well as broad policy and organizational leadership and sharing of best practices.

The four initiatives we are announcing today, in combination with the Google-led talks at the OCP Summit, provide a small glimpse into the exciting new era of systems ahead. We look forward to working with the broader OCP community and other industry organizations to build a vibrant open hardware ecosystem to support even more innovation in this space. Please join us in this exciting journey.

Read More for the details.

2022 10 18

GCP – Easy Telemetry Instrumentation on GKE with the OpenTelemetry Operator

Cloud, Google Cloud gcp

In recent years, the application monitoring landscape has exploded with instrumentation libraries, SDKs, and backends for storage and visualization. But a major friction point is still the investment required to instrument applications with these libraries, and libraries are often tied to a small set of telemetry backends.

The OpenTelemetry project has tried to address these problems with its open standard for telemetry data, multi-lingual instrumentation libraries, and “auto-instrumentation” of applications without needing code changes. The community has also created a Kubernetes Operator to automate some of these solutions in containerized environments.

The OpenTelemetry Operator is designed to provide auto-instrumentation to export traces and metrics in new and existing applications without any code changes. It also automates the deployment of OpenTelemetry Collectors, which offer vendor-agnostic monitoring pipelines. These two features help simplify both the onboarding process for new users and the day-to-day operational burden of monitoring fully-instrumented applications.

Across this spectrum of use cases, we’ve identified a few common problems that can be easily solved using the OpenTelemetry Operator. We gathered these under a single GitHub repository (https://github.com/GoogleCloudPlatform/opentelemetry-operator-sample) with the goal of providing a set of samples and walkthroughs for installing and working with the Operator. In this post, we’ll talk about some benefits of the OpenTelemetry Operator and how to use it.

Getting started with the OpenTelemetry Operator

The two main functions of the Operator are managing OpenTelemetry Collectors and auto-instrumenting applications. You can configure these Collectors and instrumentation through Custom Resources (explained in more detail later in this post).

The Operator is fairly simple to install once you have a GKE cluster provisioned. The only prerequisite is to install cert-manager, which the Operator uses to manage its webhook certificates. On GKE Autopilot, the easiest way to do this is with Helm (which allows passing extra arguments to work with Autopilot, which we’ve documented in the repo).

Ultimately, the Operator is installed with a single kubectl command such as:

code_block[StructValue([(u’code’, u’kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/download/v0.60.0/opentelemetry-operator.yaml’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e874056e190>)])]

After a few moments, you’ll see the Operator running in the opentelemetry-operator-system namespace, eagerly waiting to serve your telemetry needs. But for it to actually do something, we need to have it set up a Collector.

Deploying and configuring the Collector

The first main feature of the Operator is that it can install, manage, and configure instances of the OpenTelemetry Collector. For background, the Collector works with numerous plugins to process and export telemetry data. That data can be ingested from a variety of sources with receivers, modified in-transit with processors, and reported to several popular telemetry backend services with exporters. Essentially, it provides a routing solution for flexible, vendor-agnostic, and instrumentation-independent observability.

Setting up the Collector with the Operator is done by creating an OpenTelemetryCollector object. This is a namespaced Custom Resource, so it is created in the same Namespace you want the Collector to run in, for example:

code_block[StructValue([(u’code’, u’apiVersion: opentelemetry.io/v1alpha1rnkind: OpenTelemetryCollectorrnmetadata:rn name: collectorrn namespace: otel-collectorrnspec:rn config: |rn receivers:rn otlp:rn protocols:rn grpc:rn http:rn processors:rnrn exporters:rn logging:rnrn service:rn pipelines:rn traces:rn receivers: [otlp]rn processors: []rn exporters: [logging]’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e8729f15bd0>)])]

The OpenTelemetryCollector object has numerous settings to control how the Collector is deployed. For example, it can be deployed as a sidecar container by setting mode: sidecar like so:

code_block[StructValue([(u’code’, u’apiVersion: opentelemetry.io/v1alpha1rnkind: OpenTelemetryCollectorrnmetadata:rn name: collector-sidecarrnspec:rn mode: sidecarrn…’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e86f7b35610>)])]

The Operator will then inject the Collector sidecar into any Pod that has the annotation sidecar.opentelemetry.io/inject: “true”. It’s a common pattern to run sidecar Collectors with resource detection, each forwarding telemetry to a load-balanced Collector service that exports the data. In that setup, the sidecar Collector can detect accurate resource information for the Pod while the service Collector can handle processing and exporting in one place.

Using the OpenTelemetryCollector object also provides an easy way to set a specific container image to use for the Collector. This is really useful for working with custom Collectors, such as those built with the Collector builder tool (see our sample repository for working with the Collector builder at https://github.com/GoogleCloudPlatform/opentelemetry-collector-builder-sample). Custom-built Collectors allow you to completely control the build and deployment pipeline for your Collector, offering the ability to strip down a Collector to only the components you need. The resulting image saves space and reduces the dependency surface area for security risks.

For example, run a custom Collector image with the following config change:

code_block[StructValue([(u’code’, u’apiVersion: opentelemetry.io/v1alpha1rnkind: OpenTelemetryCollectorrnmetadata:rn name: custom-collectorrnspec:rn image: gcr.io/my-project/otel-builds/collector:latestrn…’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e86f7b35050>)])]

Once the Collector is deployed, the Operator will monitor it to ensure it’s active but also to make sure it is reconciled to remain consistent with the settings you provide. For example, when changing the Collector’s config or creating new Pods with a sidecar annotation. But deploying the Collector is only one of the Operator’s features, the other big feature being its ability to manage auto-instrumentation of applications.

Adding auto-instrumentation to Pods and Deployments

The second main feature of the Operator is that it provides easy auto-instrumentation for telemetry. Because even with a Collector ready to ingest and route your telemetry, you still need to instrument your applications to report some data. In general, instrumentation can be done either manually or automatically. While manual instrumentation involves the process of modifying application code to use OpenTelemetry SDKs, automatic instrumentation uses OpenTelemetry agents to inject instrumentation into your program at runtime.

Obviously, the option of auto-instrumenting your applications comes with less engineering investment than manual instrumentation. This makes it great for rapidly onboarding to OpenTelemetry, and is currently available for applications written in Java, NodeJS, Python, and .NET. It is even possible to combine auto-instrumentation across languages, making it useful for a system design with multi-language microservices.

The OpenTelemetry Operator provides this auto-instrumentation through mutating webhooks that inject a sidecar container into the application Pod’s specification. This sidecar container provides instrumentation agent code and shares a filesystem with the Pod’s main container so that it can inject traces into the application. Most importantly, it does all of this without the need to write any code.

To get started with auto-instrumentation, first create an Instrumentation object to define your desired instrumentation settings (such as sample rate):

code_block[StructValue([(u’code’, u’apiVersion: opentelemetry.io/v1alpha1rnkind: Instrumentationrnmetadata:rn name: my-instrumentationrnspec:rn exporter:rn endpoint: http://otel-collector:4317rn propagators:rn – tracecontextrn – baggagern – b3rn sampler:rn type: parentbased_traceidratiorn argument: “0.25”‘), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e86f7b35150>)])]

Like OpenTelemetryCollector objects, Instrumentation is also a namespaced Custom Resource. Namespacing in this case gives you finer control over the auto-instrumentation settings that are passed on to different parts of your application (though Pods can reference Instrumentations across Namespaces too).

The Operator will map these settings to an annotation on corresponding application Pods. This annotation takes the form of instrumentation.opentelemetry.io/inject-<LANGUAGE> and it’s how you opt-in a Pod for auto-instrumentation. For example, to auto-instrument a Pod running a Java application, add the following annotation to the Pod’s metadata:

instrumentation.opentelemetry.io/inject-java: “my-instrumentation”

The my-instrumentation value refers to the name of the Instrumentation object created earlier. This can also be set to an Instrumentation object from a different namespace, or true to pick a default Instrumentation from the current namespace, or false to explicitly opt the Pod out from auto-instrumentation. These options offer the flexibility to mix and match different instrumentation configurations in a system by referring to different configurations on different services. The full list of available annotations is documented on the Operator’s GitHub.

Once a Pod is annotated, the Operator will automatically pick up the change and update the Pod spec to include the auto-instrumentation sidecar container. The same type of annotation can be used for Pods within a Deployment (or any kind of replica controller, such as a CronJob or ReplicaSet). Simply set the annotation on the Pod template metadata defined in the Deployment. For example, instrument a Python application by adding the annotation to its Deployment as shown in bold below:

code_block[StructValue([(u’code’, u’apiVersion: apps/v1rnkind: Deploymentrnmetadata:rn name: python-deploymentrn labels:rn app: py-apprnspec:rn replicas: 3rn selector:rn matchLabels:rn app: py-apprn template:rn metadata:rn labels:rn app: py-apprn annotations:rn instrumentation.opentelemetry.io/inject-python: “true”rn spec:rn containers:rn – name: apprn image: my-python-app:latest’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e8740856290>)])]

It’s important to note that the annotation is in the metadata for the Pod template, not the Deployment. Alternatively, the annotation can be added to the metadata for a Namespace to apply the referenced Instrumentation configuration to all Pods in the Namespace. For example, inject the NodeJS auto-instrumentation to every pod in the my-node-app Namespace with the following command:

kubectl annotate namespace my-node-app instrumentation.opentelemetry.io/inject-nodejs=”true”

Annotating an entire Namespace like this is great for clusters with high churn (such as with serverless applications) or multi-tenancy setups, enforcing observability settings on tenant projects. Both of which show that auto-instrumentation is useful not just for experimenting, but for high-scale production workloads.

Our solutions repository for GKE users

The Operator makes it much easier to set up OpenTelemetry collection and auto-instrumentation than the alternative, manual process. However, we recognize that there are still a few steps that could be simplified and common setups to document, especially for new users who are just interested in trying it out for a quick demo. To help with this, we launched a new GitHub repository at https://github.com/GoogleCloudPlatform/opentelemetry-operator-sample focused on improving ease-of-use with the Operator.

Sample Apps

The repository includes ready-to-use configs, simplified commands for deploying and working with the Operator, and even a few sample apps written in various languages. These samples provide an end-to-end demonstration of setting up the Operator, configuring the Collector, and adding auto-instrumentation for a variety of use cases.

One of these samples is a basic client-server NodeJS app meant to showcase auto-instrumentation without any code modifications. Build this app yourself and run it alongside the Operator to play with auto-instrumentation and get a hands-on understanding of how it works before adding it to your existing workloads:

code_block[StructValue([(u’code’, u’make buildrnmake deployrnkubectl apply -f k8s/.’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e8740856cd0>)])]

Replace our sample app with your own production NodeJS application and the process is identical (just add annotations!).

Whether you’re experimenting with sample apps or already auto-instrumenting your production workloads, the recipes shown in this repository are meant to walk you through the next step of configuring your telemetry ingestion pipeline.

Config Recipes

To demonstrate different use cases of the Operator, we’ve curated a set of recipes covering various Operator configurations. Each recipe works with the sample apps to enrich those demos, but they are also production-ready so there’s no difference between the steps to experiment and deploy them.

For example, the Cloud Trace integration recipe updates a Collector configuration to enable reporting to the GCP tracing backend. This is the same configuration you would use for a production application, and if you already have a Collector set up with the Operator it can be turned on with one command:

kubectl apply -f collector-config.yaml

Over time, we plan to grow this repository with more configurations for different use cases. These off-the-shelf recipes are tailored for usability on GKE and preconfigured to work optimally with other GCP products like Cloud Build and Cloud Trace.

We want this repository to provide a source for both onboarding with OpenTelemetry and putting the Operator to use in production environments. This repository is open-source and accepting contributions, so if you have a request or problem to report feel free to reach out on GitHub.

Read More for the details.

2022 10 18

GCP – How CBcloud is improving the working conditions of Japanese delivery people with Google Maps Platform

Cloud, Google Cloud gcp

Today’s post comes from Taichirou Tokumori, Software Development Engineer, CBcloud, and explores how CBcloud uses Google Maps Platform to help Japan’s delivery drivers become more efficient and improve their working conditions.

In an age of remote work, delivery drivers face an increasing burden due to the soaring demand of online shopping. At CBcloud, our mission is to empower drivers with technology and improve their working conditions. Thanks to a range of Google Maps Platform APIs and SDKs, we’re helping Japan’s delivery drivers become more efficient.

Logistics optimization is one of the keys to achieving our mission. By maximizing efficiencies, everybody involved in the delivery industry benefits, starting with drivers and extending to businesses and end users. Our goal is to build an optimized ecosystem underpinned by a vision that we call “Mobility as a Service (MaaS) of Things.” This means redefining the transport of people and things as a service, deploying cutting-edge technology to drive efficiencies and profitability, thereby resolving myriad economic and demographic problems in Japan’s rural regions.

A new logistics paradigm for Japan

In Japan’s logistics industry, the original shipper normally entrusts delivery to a subcontractor, who then passes the job onto another contractor, creating multiple layers in which delivery people lose out on their fair share of wages. The desire to overcome this issue inspired us to launch PickGo, a platform that directly connects shippers to delivery partners, bypassing intermediaries altogether. Launched in 2016, PickGo is available across Japan, 24 hours a day, 365 days per year. Vehicle dispatch is carried out in as little as 56 seconds, with a successful matching rate of 99.2%.

During the COVID-19 pandemic, the dramatic growth of e-commerce caused drivers to struggle with the increase in demand for their services. So we expanded use of SmaRyu Post, a mobile app that automates the process of delivery planning. This is done through a routing function that calculates everything from delivery addresses to optimal routing, and a “loading position indicator function” that determines the location in the vehicle for each parcel (according to the order in which they are meant to be delivered).

Until now, this work was carried out by relying on the driver’s own experience and rough visual estimates. It’s an inefficient process that often requires significant time simply to pinpoint a delivery location, especially when the driver is unfamiliar with the territory. Powered by Google Maps Platform, PickGo and SmaRyu enable drivers to cut the entire planning process for delivery, including job-driver matchmaking, route planning and parcel loading, to under one hour (on average).

Optimizing the user experience with Google Maps Platform

In developing these systems, we wanted to optimize the user experience of drivers. A key factor in our success is an intuitive Google Maps interface that makes using the app easy.

To run SmaRyu Post in conjunction with the existing PickGo system, we deploy a full range of Google Maps Platform APIs and SDKs in an integrated architecture. For app development, we rely on Maps SDK for Android to build mobile solutions that display dynamic maps in user-friendly fashion. Our apps then rely on Maps Javascript API as the core tool for loading and displaying various kinds of location information, such as landmarks and roads, within the app.

We deploy Directions API as the main service to calculate distances for optimal routing, while Geocoding API converts user addresses into precise coordinates. And an innovative use of Places API has proven crucial for last-mile delivery. With both Places API and Places SDK, we can easily establish the correct address by entering a building name and dropping a pin on the map. This combination unlocks the optimized user experience.

Empowering drivers to get the job done faster

The results of deploying a dynamic ecosystem of Google Maps Platform solutions have been significant. Thanks to Google Maps Platform APIs, PickGo achieved a reduction of between 50% and 60% in the time needed for calculating distance and price from the point it receives an order. SmaRyu Post, meanwhile, has slashed the time needed for working out optimal routing and parcel loading by 80% compared to the time drivers performed these tasks manually.

By deploying Google Maps Platform APIs such as Maps Javascript API, Directions API and Geocoding API, we have built a network of 55,000 registered drivers, with the average matching time between delivery request and driver at under one minute.

Our adoption of Google Maps Platform also led to the improvement of our team’s technology literacy. Many of our team use Google Maps, and want to learn more about how advanced mapping technology can change our world.

By helping us to invigorate the logistics sector, Google Maps Platform is enabling us to give a broader lift to Japan’s regions. In places facing depopulation, familiar shops are shutting one after the other. As transport services rapidly decrease, the difficulties in getting essentials such as food and household goods is becoming a big problem. We are offering solutions to resolve the logistics problems that Japan’s regions face, with systems that deploy Google Maps Platform.

For more information on Google Maps Platform, visit our website.

Read More for the details.

2022 10 17

AWS – Refit transforms to prepare data at scale with Amazon SageMaker Data Wrangler

AWS, Cloud AWS

Today, we are excited to announce support to refit transforms with Amazon SageMaker Data Wrangler. To make data usable by algorithms such as XgBoost, data scientists must transform non-numeric values to numeric values using transforms such as one-hot encoding. Since transforms like one-hot encoding depend on the data, these transforms are frequently referred to as fitted transforms. These transforms must be updated or re-fitted to account for changes in the data as data continues to change over time. Additionally, when working on a sample data set, transforms must be updated to account for changes between a sample data set and the larger data set. Use of transforms like one-hot encoding generates additional information, which needs to be tracked and captured in the data preparation pipeline. Omitting or incorrectly tracking this information can lead to errors in the data preparation process. Without support to refit transforms, many data scientists did not have an easy way to specify when to use a fitted version of a transform or to refit their transform on new data. Data scientists also lacked an easy way to generate updated versions of their transformation pipelines when refitting on new datasets.

Read More for the details.

2022 10 17

AWS – AWS Database Migration Service now supports C6i and R6i instances

AWS, Cloud AWS

AWS Database Migration Service (AWS DMS) now supports Amazon EC2 C6i and R6i instance types. These instances are powered by 3rd Generation Intel Xeon Scalable processors with an all-core turbo frequency of 3.5 GHz, offering up to 15% better compute price performance over comparable Generation5 instances for a wide variety of workloads, and always-on memory encryption using Intel Total Memory Encryption (TME).

Read More for the details.

2022 10 17

AWS – Amazon Detective helps reduce time to investigate Amazon GuardDuty findings by grouping related findings

AWS, Cloud AWS

Starting today, Amazon Detective automatically groups related GuardDuty findings to help security analysts reduce triage time and create a more comprehensive security investigation. Detective uses machine learning (ML) to group related GuardDuty findings that in isolation may have been ignored but together show the lifecycle of an attack, which can help security analysts identify advanced threats more easily. Available under the Summary page, Detective shows groups of related GuardDuty findings with severity, all affected AWS accounts, and resources. In addition, Detective maps the evolution of findings to tactics, techniques, and procedures (TTP) from the MITRE ATT&CK framework – a well adopted framework for security and threat detection.

Read More for the details.

2022 10 17

GCP – Rethinking your VM strategy with Spot VMs

Cloud, Google Cloud gcp

Organizations globally choose Google Cloud as their transformation partner to accelerate their business and digital transformation because of our leadership in sustainability, AI/ML, data analytics, and more. We are also committed to making cost optimization simple, so we offer a suite of services and tools for customers to effortlessly optimize their environments.

“Media.net chose Spot VMs after exploring various options to support spiky workloads, as they provided Media.net with both deep discounts and simple, predictable pricing.” — Amit Bhawani, Sr VP of Engineering, Media.net

Today, we’ll dive deeper into the use cases and best practices for provisioning and managing Spot VMs to help you save up to 91% off your compute costs.

Spot VMs, previously known as preemptible VMs, are ideal for fault-tolerant workloads and offer the same performance as on-demand VMs. You are guaranteed 60% – 91% off on-demand VM pricing, including GPU, local SSD, and IP addresses that are attached to the VM. Prices vary by region and machine type.

Let’s now look at top use cases and workloads that work well with Spot VMs.

Use cases

Spot VMs are great for batch computing, HPC workloads, training ML models, and stateless web applications. Containerized workloads that can handle instance failure/termination are a great fit too. Spot is integrated with Google Kubernetes Engine (GKE), GKE Autopilot, Batch, Dataproc, and Dataflow VMs.

Because Spot VMs can be preempted (or interrupted), it is recommended to use Spot for fault-tolerant workloads such as rendering, genomic processing, and financial modeling.

Conversely, workloads with high uptime needs, such as stateful and fault-intolerant workloads, are not a great fit. Please check out our blog here for a deep dive into the use cases and best practices of using Spot VMs.

Key benefits

Simplified and predictable pricing: Spot VMs offer a minimum of 60% and up to 91% off compute costs, with predictable pricing that changes up to once a month, allowing you to better forecast costs and avoid runaway costs. To see prices, you can look it up manually on the VM instance pricing page or query using the Cloud Billing Catalog API.

No time limits: Spot VMs run indefinitely until Compute Engine needs to reclaim resources.

Spot deployment overview

You can deploy one or multiple MIGs to support each pool of Spot VM resources you want to scale and manage. This is ideal for workloads that don’t require a minimum set of resources to run.

In contrast, we offer a fully managed batch offering that integrates with Spot, called Google Batch. There is no additional cost of using Batch, and it lets you create and run jobs that each automatically provision and utilize the resources required to execute its tasks. Let’s now look at the different methods of creating and managing Spot VMs.

Maintain and automate your Spot VMs with Managed Instance Groups:
Managed Instance Groups (MIGs) offer customers a way to ensure that their VM group can meet the demands of their application and customers. Managed instance groups operate like other managed services and features by allowing the cloud to step in and take some actions automatically, reducing the manual work and management burden on your team. MIGs handle rolling updates, blue/green deployments, the instance group can scale out or in automatically with a configurable metric. When used with Spot VM’s, the MIG will provide the same benefits while deploying Spot VMs when scaling out or replacing VMs lost due to preemption. If Spot VMs are not available then the MIG will persist in requesting the additional Spot VMs until the capacity becomes available and filled. Please note that MIGs will not prevent an outage if all of the Spot VMs are preempted; however, when Spot VMs become available again, the MIG will bring new instances online without manual work.

Create and use Spot VMs

Now that we have a better understanding of Spot VMs and their respective use cases, let’s walk through how to create and manage them, including the following:

How to start and identify Spot VMs

Various ways to create Spot VM’s

Spot VM’s with Google Kubernetes Engine (GKE)

Best practices for Spot VMs

Like other VMs, Spot VMs require available CPU quotas. If you use Spot VMs with these resources and have not requested preemptible quota, Spot VMs will consume your standard quota. If you plan to use Spot VMs, consider requesting preemptible quota for those resources as Step 1 to prevent Spot VMs from consuming your quotas.

Spot VMs can be created in a number of ways by using the console, gcloud CLI, the Compute Engine API, or Terraform. A Spot VM is any VM that is configured to use the spot provisioning model.

Console

In the Google Cloud console, go to the Create an instance page.

Expand the Networking, disks, security, management, sole tenancy section, and do the following:

Expand the Management section.

In the Availability policies section, select Spot from the VM provisioning model list. This setting disables automatic restart and host maintenance options for the VM and enables the termination action option.

Optional: In the On VM termination list, select what happens when Compute Engine preempts the VM:

To stop the VM during preemption, select Stop (default).

To delete the VM during preemption, select Delete.

Optional: Specify other VM options. For more information, see Creating and starting a VM instance.

To create and start the VM, click Create.

gcloud
To create a VM from the gcloud CLI, use the gcloud compute instances create command. To create Spot VMs, you must include the –provisioning-model=SPOT flag. Optionally, you can also specify a termination action for Spot VMs by also including the –instance-termination-action flag.

Command example:

code_block[StructValue([(u’code’, u’gcloud compute instances create VM_NAME \rn –provisioning-model=SPOT \rn –instance-termination-action=TERMINATION_ACTION’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eafd9942c90>)])]

Replace the following:
VM_NAME: name of the new VM.
TERMINATION_ACTION: Optional: specify which action to take when Compute Engine preempts the VM, either STOP (default behavior) or DELETE.

To create multiple Spot VMs with the same properties, you can create an instance template, and use the template to create a managed instance group (MIG).

Compute Engine API

To create a VM from the Compute Engine API, use the instances.insert method. You must specify a machine type and name for the VM. Optionally, you can also specify an image for the boot disk.

To create Spot VMs, you must include the “provisioningModel”: spot field. Optionally, you can also specify a termination action for Spot VMs by also including the “instanceTerminationAction” field.

code_block[StructValue([(u’code’, u’gcloud compute instances create VM_NAME \rn –provisioning-model=SPOT \rn [–image=IMAGE | –image-family=IMAGE_FAMILY] \rn –image-project=IMAGE_PROJECT \rn –machine-type=MACHINE_TYPE \rn –instance-termination-action=TERMINATION_ACTION’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eafd9942cd0>)])]

For more information about the options you can specify when creating a VM, see Creating and starting a VM instance.

Terraform
You can use a Terraform resource to create a spot instance using scheduling block.

spot_instance_basic/main.tf

code_block[StructValue([(u’code’, u’resource “google_compute_instance” “spot_vm_instance” {rn name = “spot-instance-name”rn machine_type = “f1-micro”rn zone = “us-central1-c”rnrn boot_disk {rn initialize_params {rn image = “debian-cloud/debian-9″rn }rn }rnrn scheduling {rn preemptible = truern automatic_restart = falsern provisioning_model = “SPOT”rn instance_termination_action = “STOP”rn }rnrn network_interface {rn # A default network is created for all GCP projectsrn network = “default”rn access_config {rn }rn }rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eafd9942a50>)])]

Spot VMs with Google Kubernetes Engine (GKE) and Autopilot Clusters
When you create a cluster or node pool with Spot VMs, GKE creates underlying Compute Engine Spot VMs that behave like a managed instance group (MIG). Nodes that use Spot VMs behave like standard GKE nodes but with no guarantee of availability. When the resources used by Spot VMs are required to run standard VMs, Compute Engine terminates those Spot VMs to use the resources elsewhere. This section shows you how to run fault-tolerant, stateless, or batch workloads at lower costs by using Spot VMs and Spot Pods in your GKE clusters and node pools.

Before you begin, ensure the Google Kubernetes API is enabled.

Instructions to create a cluster or node pool can be found in the following section of the published GKE documentation.

Instructions to create Spot Pods for GKE Autopilot clusters can be found in the following section of the published GKE documentation.

Start and stop Spot VMs
Like other VMs, Spot VMs start upon creation. Likewise, if Spot VMs are stopped, you can restart the VMs to resume the RUNNING state. You can stop and restart preempted Spot VMs as many times as you would like, as long as there is capacity. For more information, see VM instance life cycle.

If Compute Engine stops one or more Spot VMs in an autoscaling managed instance group (MIG) or Google Kubernetes Engine (GKE) cluster, the group restarts the VMs when the resources become available again.

Best Practices

Here are some best practices to help you get the most out of Spot VMs.

Use instance templates. Rather than creating Spot VMs one at a time, you can use instance templates to create multiple Spot VMs with the same properties. Instance templates are required for using MIGs. Alternatively, you can also create multiple Spot VMs using the bulk instance API.

Use MIGs to regionally distribute and automatically recreate Spot VMs. Use MIGs to make workloads on Spot VMs more flexible and resilient. For example, use regional MIGs to distribute VMs across multiple zones, which helps mitigate resource-availability errors. Additionally, use autohealing to automatically recreate Spot VMs after they are preempted.

Pick smaller machine types. Resources for Spot VMs come out of excess and backup Google Cloud capacity. Capacity for Spot VMs is often easier to get for smaller machine types, meaning machine types with less resources like vCPUs and memory. You might find more capacity for Spot VMs by selecting a smaller custom machine type, but capacity is even more likely for smaller predefined machine types. For example, compared to capacity for the n2-standard-32 predefined machine type, capacity for the n2-custom-24-96 custom machine type is more likely, but capacity for the n2-standard-16 predefined machine type is even more likely. Please note non compute services like persistent disk and networking are not eligible for Spot VM discounts today.

Run large clusters of Spot VMs during off peak times. The load on Google Cloud data centers varies with location and time of day, but generally lowest on nights and weekends. As such, nights and weekends are the best times to run large clusters of Spot VMs.

Design your applications to be fault and preemption tolerant. It’s important to be prepared for the fact that there are changes in preemption patterns at different points in time. For example, if a zone suffers a partial outage, large numbers of Spot VMs could be preempted to make room for standard VMs that need to be moved as part of the recovery. In that small window of time, the preemption rate would look very different than on any other day. If your application assumes that preemptions are always done in small groups, you might not be prepared for such an event. You can test your application’s behavior under a preemption event by stopping the VM.

Use shutdown scripts. Manage shutdown and preemption notices with a shutdown script that can save a job’s progress so that it can pick up where it left off, rather than start over from scratch.

To learn more about Spot VMs, please check out the Spot VM documentation here.

Read More for the details.

Cloud

Directly process unstructured data using BigQuery ML

Process unstructured data using remote functions

Extending more BigQuery capabilities to unstructured data

Getting Started

Built with BigQuery: BigQuery ML enables Faraday to make predictions for any US consumer brand

New ways to work more efficiently

Capabilities rolling out in phases

The next wave of Google Cloud infrastructure innovation: New C3 VM and Hyperdisk

Google Cloud Next for data professionals: analytics, databases and business intelligence

Row key design refresher

Key prefix salting

Choosing a prefix

Prefix options

Implementation

Writing

Reading

Forward looking migrations

Next steps

Eliminate hotspots in Cloud Bigtable

Architecture Overview

Prerequisites

Deploying the multi-region Memorystore backend

Creating the Client and Proxy VMs

Configure and Deploy the Envoy Proxy

Deploy and Run Memtier Benchmark

Validate the cache contents

Simulate Regional Failure

Clean Up

Conclusion

Scaling to new heights with Cloud Memorystore and Envoy

Why do lock issues happen?

Follow a visual journey with pre-built dashboards

Diagnose application slowness

Co-relating row ranges, columns and transactions

Identifying transactions with highest write latencies causing locks

Analyzing transactions to fix lock contentions

Fixing the issue

Get started with lock insights and transaction insights today

Introducing Query Insights for Cloud Spanner: troubleshoot performance issues with pre-built dashboards

Jupiter evolving: Reflecting on Google’s data center network transformation

Getting started with the OpenTelemetry Operator

Deploying and configuring the Collector

Adding auto-instrumentation to Pods and Deployments

Our solutions repository for GKE users

Sample Apps

Config Recipes

Use cases

Key benefits

Spot deployment overview

Create and use Spot VMs

Best Practices

5 Google Cloud Next ’22 sessions on Cloud FinOps Cost Optimization