GCP – How Domino’s delivers pizza with the drop of a pin to almost anywhere
Post Content
Read More for the details.
Post Content
Read More for the details.
Spanner is a fully managed database service for both relational and non-relational workloads that offers strong consistency at global scale, high performance at virtually unlimited scale, and high availability with an up to five 9s SLA.
While Spanner is renowned for its relational capabilities, it is also a versatile key-value database that can be used to store and retrieve non-relational data via read and write APIs. In fact, a significant portion of internal Spanner usage at Google is non-relational.
Spanner was developed by Google to address the challenge of achieving synchronous replication and consistency while enabling virtually unlimited scaling. While relational workloads benefit from strong reads that guarantee the latest version of the data, non-relational workloads often utilizestale readsthat can be served by localread replicas.
Recently, we announced announced a50% increase in throughputand 2.5x more storage per node (up to 10TB) for Spanner, in addition to reduced latencies. These enhancements make Spanner an even more compelling option for NoSQL workloads.
Spanner is used extensively at Google by numerous projects, such asGoogle Photos,Google Ads, and Gmail. In total, Spanner serves over 3 billion read/write requests per second at peak. While some projects employ complex SQL and transactions, the vast majority of workloads are primarily key lookups. For such workloads, Spanner is chosen due to its high performance, customizability, and scalability.
Spanner can be used as a feature-rich RDBMS with relational semantics. However, these relational features are built on top of an equally powerful non-relational platform. For example, Spanner’ssplitsarchitecture demonstrates how it can provide write-scaling. Furthermore, Spanner’s support forJSONallows for more versatile use cases typically provided by document databases.
A question that frequently arises is why today’s data requires strong consistency. In reality, when workloads migrated from legacy databases to NoSQL databases, the requirement for consistency never went away. Instead, applications are now expected to handle inconsistencies in data, a task that was previously performed by databases. Customers had no choice but to accept an additional application complexity for the sake of the mandatory scalability requirement based on data size. With Spanner, customers no longer have to make a trade-off between scalability and consistency — they can have both.
Cost is a primary concern for all customers. Spanner offers a cost-effective starting point of $65 per month (varying by region), which can be further reduced to $45 per month withCommitted Use Discounts(CUDs). While this is higher than the free entry point offered by some non-relational databases, the reality is that most enterprise workloads demand significantly more resources. In such scenarios, the price-performance ratio of the database becomes more important than the minimum entry cost. Spanner also provides afree tierfor those who simply want to try it out, along with a localemulatorthat allows customers to develop directly without incurring additional costs.
Many Spanner customers are running non-relational workloads on Spanner today. Here’s a sampling:
Uber
Uber’s previous infrastructure, based on Cassandra and Ringpop, presented various challenges, including low developer productivity and leaky abstractions between the database and application layers. Notably, Cassandra’s lack of consistency forced developers to “think about compensating actions especially when the writes fail due to system failures. In some cases, the failure of [a system] might also result in an inconsistent state of entities that often required manual intervention,” increasing costs and operational overhead.Read more about their engineering journey from theirUber Engineering blog.
Sharechat
Sharechat, a leading social media platform in India, migrated their non-relational workload from a NoSQL database to Spanner. They were particularly impressed with the similarity between Spanner’s schema system and their previous NoSQL database, which enabled them to perform a no-downtime migration. Additionally, they were able to achieve significant cost savings by moving to Spanner.
“Unlike our legacy NoSQL database, we could scale without having to rethink existing tables or schema definitions and keep our data systems in sync across multiple locations. It’s also cost-effective for us — moving over 120 tables with 17 indexes into Cloud Spanner reduced our costs by 30%.” – Bhanu Singh, Co-founder and CTO at ShareChat.
Read more about ShareChat’s migration story.
Niantic
Niantic, the creators of the popular mobile game Pokémon GO, migrated their non-relational workload from Cloud Datastore (a non-relational datastore) to Cloud Spanner. “As the game matured, we decided we needed more control over the size and scale of the database,” said James Prompanya, Senior Staff Software Engineer at Niantic. “We also like the consistent indexing that Cloud Spanner provides.” Spanner’s consistent secondary indexes provided Niantic with a significant performance boost. Secondary indexes are used to improve the performance of queries that involve filtering or sorting on data. In some non-relational databases, secondary indexes are eventually consistent, which means that they may not be immediately up-to-date. This can lead to latency issues for applications.
Read more about Niantic’s journey.
Let’s take a closer look at how non-relational concepts translate to Spanner concepts:
Non-relational
Spanner
Table
Table
Items
Rows
Attributes
Can be modeled in a number of ways:
Defining columns in schema is the most idiomatic approach in Spanner
JSON column is the closest equivalent
Interleaved table with key-value pairs
Primary Key
Primary Key: determines both both the partitioning of the data across nodes and sort order of the data within each node
Secondary Indexes
Usually non-relational systems provide two types of indexes:
Local secondary indexes, which can be modeled asinterleaved secondary indexes.
Global secondary index is equivalent to asecondary indexin Spanner.
Spanner lets users store non-key columns in the index, speeding up common read requests.
It is important to note thatall indices are always transactionally consistentin Spanner.
Sparse Index
NULL-Filtered Indicesallow users to omit NULL rows from the index. When used in conjunction with generated columns, this offers a robust method for excluding a specific set of rows from the index.
Streams
Non-relational
Spanner
Control plane APIs like CreateTable
Schema Changes
Custom Query Languages
SQL (ANSI SQL or PG) with hints like FORCE_INDEX to specify a particular index usage
PutItem, BatchWriteIem
UpdateItem
Read-write transaction. In Spanner all writes provide ACID guarantees.
GetItem, BatchGetItem, Query, Scan
Read API. Stale reads can be used to improve performance where the most recent data is not required. Stale reads are still consistent as of some prior timestamp.
Data types
Spanner provides scalar types that are largely the same as other storage systems across the industry. In addition, Spanner natively supports dynamic arrays of any supported type. Document-like scenarios can use Spanner’s JSON type.
Read more aboutSpanner data types.
In addition, Spanner offers a number of advantages over typical non-relational databases:
Stronger consistency guarantees:Spanner provides strong consistency for all reads and writes. Data consistency is maintained across all replicas. In the event of network latency, a replica may become stale. In this case, another replica can provide the most recent data. The stale replica still provides a consistent view of the data as of an earlier point in time. In contrast, most non-relational databases offer eventual consistency, which means that data may not be immediately consistent across all replicas. This can be a problem for applications that require strong consistency, such as those that involve financial transactions or real-time updates.
Global secondary indexes:Spanner supports global secondary indexes. This means that secondary indexes can be used to query data across all replicas. This can significantly improve the performance of queries that involve filtering or sorting on indexed data.
Support for transactions:Spanner supports ACID transactions. This means that multiple writes are either committed atomically or aborted. This can be essential for applications that require data integrity, such as those that involve multiple concurrent updates to the same data.
SQL as a query language:In addition to a read / write API, Spanner also supports SQL, a well-known and widely used query language. This makes it easier for developers to learn and use Spanner. In addition, some queries are much more straightforward and efficient when expressed using SQL compared to complex proprietary APIs.
Simplified application logic:Spanner’s strong consistency guarantees and support for transactions can help to simplify application logic. For example, applications that use Spanner do not need to implement their own mechanisms for ensuring data consistency, such as reconciliation pipelines.
Spanner is a versatile database that can be used for both relational and non-relational workloads. Spanner’s core concepts are similar to those of non-relational databases, but Spanner offers a number of advantages, such as stronger consistency guarantees, global secondary indexes, support for transactions, and SQL as a query language. These advantages make Spanner a good choice for a wide range of applications.
Interested in trying out Spanner for your non-relational workload? Get started forfree!
Read More for the details.
As you orchestrate more services with Workflows, the workflow gets more complicated with more steps, jumps, iterations, parallel branches. When the workflow execution inevitably fails at some point, you need to debug and figure out which step failed and why. So far, you only had an execution summary with inputs/outputs and logs to rely on in your execution debugging. While this was good enough for basic workflows, it didn’t provide step level debugging information.
The newly released execution steps history solves this problem. You can now view step level debugging information for each execution from the Google Cloud console or the REST API. This is especially useful for complicated workflows with lots of steps and parallel branches.
In this blog we will take a closer look at a concrete example of Workflows execution steps history.
In an earlier blog post, Introducing Parallel Steps for Workflows, and its associated tutorial, we showed a workflow (workflow-parallel.yaml) that queries 5 BigQuery tables using parallel branches of Workflows to speed up processing.
The workflow outline is as follows:
runQueries step is a combination of 5 parallel branches where each table query happens in parallel.
Let’s assume you made a mistake with the name of one of the tables and you point to a non-existing table. Can the new execution steps history help us to debug that mistake?
First, as the execution is running, you’ll realize a new steps tab under execution details in Google Cloud console:
Under the steps tab, you start seeing some steps succeeded, running, and failed:
This is already very useful in visualizing and understanding what happens under the hood!
Once the execution is finished, you will see that the top level runQueries step failed. You can filter to see its children steps:
You see that one of the iterations (i.e. tables) of runQueries failed:
Further filtering on runQueries.3 step, you realize that runQuery.3 failed:
Finally, further zooming into runQuery.3 reveals that the step received HTTP status 404 which hints that the table in question might not exist:
At this point, you can look into logs and get to the exact reason for HTTP 404, the non-existing table.
We went from a failed execution to the exact failed step in a parallel branch and the actual HTTP 404 error pretty quickly!
The new execution steps history makes it easier to understand what happens under the hood with an execution. While most developers will use the Google Cloud console to view the steps history, you can access the same information from the REST API as well. We will keep improving the developer experience with 2 more features scheduled for Workflows execution. You will be able to see progress for loops, parallel branches, retries and also see a detailed view of inputs/outputs and user variables. Stay tuned!
To learn more, check out the view history of execution steps documentation and the visualize and Inspect Workflows Executions blog post from Guillaume Laforge. You can also utilize the official Workflows samples and experiment with how the execution steps history helps. As always, connect with me on Twitter @meteatamel if you have any questions or feedback.
Read More for the details.
If you’ve joined a new team recently like I have, you’ve probably had a lot of questions. And answers to those questions may or may not be things you can find easily, and might rely heavily on the generosity, and spare time of your teammates. Let’s say you’re a DevRel engineer, working with Google Kubernetes Engine (GKE) to build and run microservices. Despite how willing your teammates are to help you get started, wouldn’t it be nice if there were a way to get questions about Google Cloud answered without bothering your coworkers?
Well, fortunately for you, and those supportive hypothetical teammates, Duet AI is available to answer your questions as you start in a new job, and will still be there as you explore more advanced options. You can see all those details in this narrative tutorial, but let’s walk through a quick example here.
Suppose you have a service or app already running, and you’d like to get to know it a bit better. A bit more personal, if you will. Logging helps tremendously when you’re exploring a new system. But finding the right options and flags to set can be a bit of a challenge. But you know what’s not a challenge? Asking Duet AI for the logs you need, with natural language.
How can I write a Cloud Logging query to search for logs from Pods in a namespace called “default” in a GKE cluster named “test”?
You should get a response something like the following, allowing for variation due to the built-in creativity of LLMs, of course.
To write a Cloud Logging query to search for logs from Pods in a namespace called “default” in GKE cluster called “test”, you can use the following query: [1]
resource.type=”k8s_container”
resource.labels.namespace_name=”default”
resource.labels.cluster_name=”test”
Generated code may be subject to license restrictions not shown here. Use code with care. Learn more
This query will search for all logs from Pods in the “default” namespace in the “test” GKE cluster. You can also use the logName field to specify the type of logs you want to search for. For example, the following query will search for all stdout logs from Pods in the “default” namespace in the “test” GKE cluster:
resource.type=”k8s_container”
resource.labels.namespace_name=”default”
resource.labels.cluster_name=”test”
logName=”projects/PROJECT_ID/logs/stdout”
Generated code may be subject to license restrictions not shown here. Use code with care. Learn more
Where PROJECT_ID is the ID of your Google Cloud project.
For more information on writing Cloud Logging queries, see the Cloud Logging documentation
.
And once you’ve run the command, you can click on “Explain this query” in each of the logs to get help understanding each of the log entries.
If you’re trying out the Vision Api, you might notice a log entry like the one below:
There’s a lot here, but what does it mean? The “Explain this log entry” button in the upper left hand corner will open discussion with Duet AI as shown below:
You can then follow up from this request. Suppose you want to know more about how to adjust the batch size for Vision API. You could continue the discussion:
You even have the ability to check the documentation if you want to know more by showing the related content:
(This example comes from the Jump Start Solution Generative AI document summarization if you’d like to explore Gen AI more.)
Once you understand the current situation, you may be in a more informed position to decide to make changes, like creating a private build environment. Duet AI is there to help throughout your entire process: from describing the right tools, to telling you how to use them effectively, answering questions as you go. For even more on how this all works, and how you can get started, check out thenarrative tutorial. So take a look, and impress those hypothetical teammates with how quickly you’re able to contribute to your new team.
Read More for the details.
Here at Wayfair, our data scientists rely on multiple sources of data to obtain features for model training. An ad hoc approach to feature engineering led to multiple versions of feature definitions, making it challenging to share features between different models. Most of the features were stored and used with minimal oversight on freshness, schema, and data guarantees. As a result, our data scientists frequently encountered discrepancies in model performance between development and production environments, making the feedback loop for retraining cumbersome. The whole process of curating new stable features and developing new model versions often took several months.
To address these issues, the Service Intelligence team at Wayfair decided to create a centralized feature engineering system. Our goal was to standardize feature definitions, automate ingestion processes, and simplify maintenance. We worked with Google to adopt different Vertex AI offerings, especially Vertex AI Feature Store and Vertex AI Pipelines. The former provides a centralized repository for organizing, storing, and serving ML features, and the latter helps to automate, monitor, and manage ML workflows. These offerings became the two main components of our feature engineering architecture.
On the data side, we developed workflows to streamline the flow of raw features data into BigQuery tables. We created a centralized repository of feature definitions that specify how each feature should be pulled, processed, and stored in the feature store. Using the Vertex AI Feature Store’s API, we automatically create features based on the given definitions. We use GitHub’s PR approval process to enforce governance and track changes.
Sample feature definition
We set up Vertex AI Pipelines to transform raw data in BigQuery into features in the feature store. These pipelines run SQL queries to extract the data, transform it, and then ingest it into the feature store. The pipelines run on different cadences depending on how frequently the features change, and what level of recency is required by the models that consume them. The pipelines are triggered by Cloud Functions that listen for Pub/Sub messages. These messages are generated both on a static schedule from Cloud Scheduler, and dynamically from other pipelines and processes.
Feature Engineering System Diagram
The Vertex AI Feature Store enables both training and inference. For training it allows data scientists to export historical feature values via point-in-time lookup to retrain their models. For inference it serves features at low latency to production models that make their predictions in real-time. Furthermore, it ensures consistency between our development and production environments, avoiding training-serving skew. Data scientists are able to confidently iterate on new model versions without worrying about data-related issues.
Our new feature engineering system makes it easy for data scientists to share and reuse features, while helping to provide guarantees around offline-online consistency and feature freshness. We are looking forward to adopting the new version of Vertex AI Feature Store that is now in public preview, as it will provide more transparent access to the underlying data and should reduce our cloud costs by allowing us to use BigQuery resources dedicated to our project.
The authors would like to thank Duncan Renfrow-Symon and Sandeep Kandekar from Wayfair for their technical contributions and Neela Chaudhari, Kieran Kavanagh, and Brij Dhanda from Google for their support with Google Cloud.
Read More for the details.
Google Public Sector is pleased to introduce the Public Sector Partner Learning Center – now available in the Google Partner Advantage Portal. Inside the learning center, partners can access powerful Go-to-Market (GTM) Kits, unlocking a wealth of resources designed to elevate impact in the market. Discover compelling value propositions, engage with customer success stories, embark on insightful learning journeys, and navigate opportunity registration seamlessly – all within a user-friendly interface. There are thirty-two GTM kits currently available to partners. Seven of the GTM kits were developed specifically for partners to execute with public sector customers:
Activate your public sector data with AIGenerative AI for the public sectorContinuity of Operations Plan (COOP) & Disaster Recovery (DR) for the public sectorGoogle Workspace for the public sectorSecurity Foundation for the public sectorSecurity Operations for the public sectorZero Trust for Government
The Partner Learning Center empowers our partners and their go-to-market (GTM) teams with a curated selection of resources, including sales playbooks, marketing campaigns, technical assistance, and case studies. GTM Kits foster a collaborative environment for strategically navigating priority Google Cloud solutions. Additionally, dedicated Google Cloud experts stand ready to provide guidance on sales positioning and technical enablement, ensuring your success at every turn. Dive deeper and explore the transformative potential of the Partner Learning Center by visiting the website accessible via your Partner Advantage Portal login.
Fueling Success and Enabling Public Sector Transformation
The Partner Learning Center unlocks tangible outcomes and fosters meaningful contributions to the public sector:
Fast-track your GTM efforts: Optimize your go-to-market strategies and efficiently reach government customers with impactful solutions.Supercharge sales productivity: Empower your teams with the knowledge and resources necessary to excel and achieve substantial results.Deepen support for critical missions: Leverage Google Cloud solutions to drive positive outcomes in crucial government areas like healthcare, education, and national security.Elevate your brand visibility: Establish your organization as a trusted leader in delivering cutting-edge technology solutions for government agencies.
Simply log in to the Partner Advantage Portal, locate the Training Overview page, and select Google Cloud Partner Learning Center. After creating a free account, you’ll find a gateway to a series of knowledge and practical tools.
We remain dedicated to continuously enriching the Partner Learning Center with new content and resources.
Contact Us
If you have any questions about the Partner Learning Center, please contact Partner Support by filing out a ticket here
Read More for the details.
Organizations often use a variety of third-party tools to monitor the health and performance of their applications. But in the event of a service degradation, the source of the issue isn’t always clear — is it a disruption with your cloud provider, or a problem in your application environment? Recently, we announced the general availability of Personalized Service Health, which includes a new capability, emerging incidents, that provides speedy notification of Cloud Networking incidents to customers.
Emerging incidents are machine-driven alerts that are communicated simultaneously to you and internal Google SRE teams, significantly reducing the time-to-first-meaningful post about an incident. This provides customers with notifications at the same time as when Google Cloud incidents occur, even as our teams are still investigating the issues and assessing their impact. You can start receiving emerging incident notifications by enabling Personalized Service Health on supported Cloud Networking products, as well as set up alerts on them.
Emerging incident communications are sent in real time and personalized to your project, helping you address disruptions to operations and implement measures to mitigate the impact to your business. Of course, Personalized Service Health also sends timely updates on active incidents, making it a go-to resource for all incident information. The sooner you take action, the more you can reduce your mean time-to-resolution (MTTR), and improve application reliability. These real-time and personalized communications are shown below.
The health of Google Cloud networking products is continuously monitored using various probes. If the system detects a degradation or service interruption, it automatically generates internal alerts that are communicated to customers based on assessed impact, and sent out through all Personalized Service Health channels, including the dashboard, logs, alerts, and APIs.
Subsequently, if an event is confirmed, the emerging incident is closed out and linked to a confirmed incident. Alternatively, if the event was short-lived, e.g., a network re-route mitigated the impact to the customer, it may be closed before a confirmed event is even generated. These early alerts provide customers with clear information on the root of the issue Now, as long as incidents are active, customers receive timely updates about both emerging and confirmed incidents from Google Cloud’s incident response process.
Emerging incidents for supported products are available by default for any projects that use them, as long as Personalized Service Health is enabled. You can learn more about managing emerging incidents here. Emerging incidents alert policies can be configured from within the Service Health dashboard.
For more information, follow the Personalized Service Health documentation and getting started guide. To get started, enable Personalized Service Health for a project or across your organization.
Read More for the details.
Post Content
Read More for the details.
Post Content
Read More for the details.
We’re taking our solar data and expanding coverage–far and wide. After launching the Solar API as part of our new suite of Environment APIs, we’ve continued to expand our coverage. We’re now able to provide valuable information to businesses in the solar industry for more than 472 million buildings in over 40 countries. This includes newly expanded coverage to over 95% of all buildings in the United States–nearly double our previous coverage.
Historically, our solar insights were computed using elevation maps and imagery captured by low flying airplanes in limited regions. With new advancements in machine learning, we’re now using a larger set of Google Map’s aerial imagery to produce detailed elevation maps and accurate solar projections for millions of buildings that previously had no data available. Our AI-enhanced height maps were internally evaluated based on geometric accuracy and predicted energy outputs, and developed closely with direct feedback from solar industry leaders from around the world. These advances help expand our comprehensive building data, solar potential insights, and detailed rooftop imagery broadly throughout North America, Europe, and Oceania. These advancements set the stage for future coverage expansions within our currently covered countries as well as expansions to new countries where data is not readily available.
Given the impact that access to reliable solar data can have on deploying renewable energy, we’re making it a top priority to roll out coverage across geographies where there is significant demand for this data. Since launch, solar companies have requested expanded coverage so they can unlock new markets, grow their business, and increase the amount of solar.
“Google has been adding more data, which has been great. Whenever that happens, we’re happy, because we’re paying less for higher quality imagery,” explains Walid Halty, CEO at Mona Lee. “Google’s Solar API has proven time and time again, where it’s available, it’s the best.”
Benefits of integrating the Solar API
Our Solar API is being used to optimize solar panel arrays, make solar assessments and proposals more accurate and efficient, and to educate the public about transitioning to solar energy by showing homeowners the feasibility for their individual properties. Here are two examples of how our customers are using the Solar API:
Demand IQ AI chatbot for solar assessments
Demand IQ uses the Solar API to help solar companies provide online, accurate, real-time rooftop assessments to homeowners considering a transition to solar energy. By digitizing the solar shopping experience, companies can increase transparency, realize more conversions, and cut costs–while providing homeowners with useful, engaging information so they can make an informed decision.
“To support the transition to solar energy, we need to help customers make informed decisions, and we need to help solar providers to answer their questions with up-to-date, accurate data,” explains Austin Rosenbaum, CEO, Demand IQ. “With Demand IQ and the power of data from the Solar API, we now do that in real-time.”
To support energy efficiency at scale, MyHEAT uses solar data, insights, and imagery to educate residents, utility companies, and cities on the solar potential of their homes and buildings. The Solar API significantly reduces the time needed to deliver solutions, while also improving efficiency, accuracy, and the quality of their 3D map imagery.
We’ll be at the Intersolar North America and Energy Storage North America conference at the San Diego Convention Center from January 17-19. Stop by booth #649, where we’ll have live presentations at 11 am and 2 pm each day to share more about how our Solar API can enhance your solar offering.
For more information on Google Maps Platform, visit our website.
Read More for the details.
Editor’s note: Since its founding in 2019, Linear has been enhancing global product development workflows for businesses through its project and issue-tracking system. Leveraging the power of Cloud SQL for PostgreSQL, Linear was able to keep pace with its expanding customer base–improving the efficiency, scalability, and reliability of data management, scaling up into the tens of terabytes without increasing engineering effort.
Linear’s mission is to empower product teams to ship great software. We’ve spent the last few years building a comprehensive project and issue tracking system to help users streamline workflows throughout the product development process. While we started as an issue tracker, we’ve grown our application into a powerful project management platform for cross-functional teams and users around the world.
For instance, Linear Asks allows organizations to manage request workflows like bug and feature requests via Slack, streamlining collaboration for individuals without Linear accounts who regularly work with our platform. Additionally, we introduced Similar Issues, a feature that prevents duplicate or overlapping tickets and ensures cleaner and more accurate data representation for growing organizations.
As our customers grow their businesses, they have more users on the platform and issues to track, which means more need for workflow and product management software. We’re focused on supporting this growth while continuing to deliver on stability, quality, performance, and the features that support complex technical configurations alongside a great user experience.
In our initial development phase, we had a PostgreSQL database with pgvector extension hosted on a PaaS that wasn’t indexed or used for production workloads. For production workloads we needed to upgrade our databases and find a solution with strong vector search support, since it’s the best way to identify and group similar issues based on shared characteristics or patterns. By representing issues as vectors and finding similarities, we can quickly identify duplicate or related issues. This functionality streamlines bug tracking and helps our customers address issues more effectively, saving them time and resources while improving their overall workflows.
We explored several new entrants in the database market that focus on storing vectors and ended up trialing several. However, we faced challenges with speed of indexing and unacceptable downtime while scaling, not to mention the relatively high cost for a feature that wasn’t the core of the product. Given Linear’s existing data volume and our goals for finding a cost-efficient solution, we opted for CloudSQL for PostgreSQL once support for pgvector was added. We were impressed by its scalability and reliability. This choice was also compatible with our existing database usage, models, ORM, etc. and this meant the learning curve was non-existent for our team.
Our migration process from development to production was challenging at first due to the sheer size and volume of vectors we had to work with for the production dataset. However, after partitioning the issues table into 300 segments, we were able to successfully index each partition. The migration process followed a standard approach of creating a follower from the existing PostgreSQL database and proceeded smoothly.
Today, our primary operational database uses Cloud SQL for PostgreSQL. Since Cloud SQL for PostgreSQL includes the pgvector extension, we were able to set up an additional database to store vectors for our similarity-search features. This is achieved by encoding the semantic meaning of issues into a vector using OpenAI ada embeddings, then combining it with other filters to help us identify similar relevant entities.
A simplified diagram of Linear’s architecture
In terms of our architecture design, Linear’s web and desktop clients seamlessly sync with our backend through real-time connections. On Google Cloud, we operate synchronized WebSocket servers, both public and private GraphQL APIs, and task runners for background jobs.
Each of these functions as a Kubernetes workload that can scale independently. Our technology stack is fully built with NodeJS and Typescript, and our primary database solution is Cloud SQL for PostgreSQL, a choice we’re confident with. Additionally, we use Google’s managed Memorystore for Redis as an event bus and cache.
Cloud SQL for PostgreSQL has proven invaluable for Linear. Because we do not have a dedicated operations team, relying on managed services is crucial. It allows us to scale our database smoothly into tens of terabytes of data without requiring extensive engineering efforts, which is fantastic for our operations and enables engineering to spend more time building user-facing features.
Furthermore, our customers have provided us with great feedback, specifically regarding Linear’s ability to identify duplicate issues when they report a bug. Now, when a user creates a new issue, the application first suggests potential duplicates. Additionally, when handling customer tickets through customer support application integrations like Zendesk, Linear displays possible related bugs that have already been logged.
Looking ahead, we envision integrating machine learning (ML) into Linear to enhance the user experience, automate tasks, and offer intelligent suggestions within the product. We’re also committed to further developing our similarity search features, expanding beyond vector similarity to incorporate additional signals into our calculations. We firmly believe that Google Cloud will be instrumental in helping us realize this vision.
Get started:
Discover how Cloud SQL for PostgreSQL can help you run your business. Learn more about Memorystore for Redis.Start a free trial today! New Google Cloud customers get $300 in free credits.
Read More for the details.
In machine learning, transforming raw data into meaningful features, a preprocessing step known as feature engineering, is a critical step. BigQuery ML has made significant strides in this area, empowering data scientists and ML engineers with a versatile set of preprocessing functions for feature engineering (see our previous blog). These transformations can even be seamlessly embedded within models, ensuring their portability beyond BigQuery to serving environments like Vertex AI. Now we are taking this a step further in BigQuery ML, introducing a unique approach to feature engineering: modularity. This allows for easy reuse of feature pipelines within BigQuery, while also enabling direct portability to Vertex AI.
A companion tutorial is provided with this blog — try the new features out today!
When creating a model in BigQuery ML, the CREATE MODEL statement has the option to include a TRANSFORM statement. This allows for custom specifications for converting columns from the SELECT statement into features of the model by using preprocessing functions. This is a great advantage because the statistics used for transformation are based on the data used at model creation. This provides consistency of preprocessing similar to other frameworks — like the Transform component of the TFX framework, which helps eliminate training/serving skew. Even without a TRANSFORM statement, automatic transformations are applied based on the model type and data type.
In the following example, an excerpt from the accompanying tutorial, there are preprocessing steps applied prior to input for imputing missing values. There is also embedded preprocessing with the TRANSFORM statement for scaling the columns. This scaling gets embedded with the model and applies to the input data, which is already imputed prior to input here. The advantage of the embedded scaling functions is that the model remembers the calculated parameters used in scaling to apply later on when using the model for inference.
With the new ML.TRANSFORM table function, the feature engineering part of the model can be called directly. This enables several helpful workflows, including:
Process a table to review preprocessed featuresUse the transformations of one model to transform the inputs of another model
In the example below (from the tutorial), the ML.TRANSFORM function is applied directly to the input data without having to recalculate the scaling parameters using the original training data. This allows for efficient reuse of the transformations for future models, further data review, and for model monitoring calculations detecting skew and drift.
Take reusability to a completely modular state by creating transformation only models. This works like other models by using CREATE MODEL with a TRANSFORM statement and using the value model_type = TRANSFORM_ONLY. In other words, it creates a model object of just the feature engineering part of the pipeline. That means the transform model can be reused to transform inputs of any CREATE MODEL statement as well, even registering the model to the Vertex AI Model Registry for use in ML pipelines outside of BigQuery. You can even EXPORT the model to GCS for complete portability.
The following excerpt from the tutorial shows a regular CREATE MODEL statement being used to compile the TRANSFORM statement as a model. In this case, all the imputation steps are being stored together in a single model object that will remember the mean/median values from the training data and be able to apply them for imputation on future records — even at inference time.
The TRANSFORM_ONLY model can be used like any other model with the same ML.TRANSFORM function we covered above.
With the modularity of TRANSFORM_ONLY models it is possible to use more than one in a feature pipeline. The BigQuery SQL Query syntax WITH clause (CTEs) makes the feature pipeline highly readable. This idea makes feature level transformation models, like a feature store, easily usable with modularity.
As an example of this idea first, create a TRANSFORM_ONLY model for each individual feature: body_mass_g, culmen_length_mm, culmen_depth_mm, flipper_length_mm. Here, these are used for scaling of columns into features – just like the full model we create at the beginning.
For body_mass_g:
For culmen_length_mm:
For culmen_depth_mm:
Now, with CTEs, a feature pipeline can be as easy as the following and even packaged as a view:
And creating the original model from above using this modular feature pipeline will look like the following which selects directly from the feature preprocessing pipeline created as a view above:
This level of modularity and reusability brings the activities of MLOps into the familiar syntax and flow of SQL.
But there are times when models need to be used outside of the data warehouse, for example online predictions or edge applications. Notice how the models above were created with the parameter VERTEX_AI_MODEL_ID. This means they have automatically been registered in the Vertex AI Model Registry where they are just a step away from being deployed to a Vertex AI Prediction Endpoint. Also, like other BigQuery ML models, these models can be exported to Cloud Storage by using the EXPORT MODEL statement for complete portability.
BigQuery ML’s new reusable and modular feature engineering are powerful tools that can make it easier to build and maintain machine learning pipelines and power MLOps. With modular preprocessing, you can create transformation only models that can also be reused in other models or even exported to Vertex AI. This modularity even enables feature pipelines directly in SQL. This can save you time, improve accuracy, prevent training/seving skew, all while simplifying maintenance. To learn more about feature engineering with BigQuery, try out the tutorial and read more about feature engineering with BigQuery ML.
Read More for the details.
Using Cloud SQL, our fully managed relational database service is a powerful way to streamline your database operations and focus on innovation. Cloud SQL handles the complexities of database administration for you, delivering a robust and secure relational database platform that’s scalable and highly available, all the while simplifying management tasks and reducing operational costs.
As an open and fully managed database service, Cloud SQL supports multiple versions of the database engines it offers, allowing you to choose the version of MySQL, PostgreSQL, or Microsoft SQL Server that best suits your needs. While Cloud SQL offers this flexibility to maintain older versions of database engines, there are substantial advantages to staying current with the latest releases. Newer versions often bring performance enhancements, security upgrades, and expanded feature sets, empowering you to optimize your applications and safeguard your data. To maximize the benefits of Cloud SQL and ensure the long-term stability and security of your applications, it’s essential to move away from database engine versions that have reached their end of life (EOL).
In this blog, we will discuss key advantages as well as best practices to transition to a newer version of MySQL and PostgreSQL by leveraging Cloud SQL’s in-place major version upgrade feature. We will also discuss strategies to successfully perform major version upgrade on your primary and replica instances..
Cloud SQL’s in-place major version upgrade feature is a built-in functionality that allows you to upgrade your MySQL or PostgreSQL database instance to a newer major version directly (a.k.a. in-place upgrade) within the Cloud SQL platform. This removes the need for manual data migration, complex configuration changes, and the associated lengthy downtime. Further, one of the biggest advantages of this approach is that you can retain the name, IP address, and other settings of your current instance after the upgrade.
We recommend you plan and test major version upgrades thoroughly. One of the strategies to test includes cloning the current primary instance and performing a major version upgrade on the clone. This will help iron out issues upfront, and give you the confidence to perform a production upgrade.
Cloud SQL’s major version upgrade feature varies slightly between MySQL and PostgreSQL. Please see the dedicated sections below for detailed information.
MySQL community version 5.7 reached end of life in October 2023. If you are still running MySQL 5.6 and 5.7, we recommend upgrading to MySQL 8.0, which offers next-generation query capabilities, improved performance, and enhanced security. For example:
MySQL 8.0’s instant DDL drastically speeds up table alterations while allowing concurrent DML changes.InnoDB received optimizations for various workloads, including read-write, IO-bound, and high-contention scenarios.SKIP LOCKED and NOWAIT options prevent lock waits.Window functions in MySQL 8.0 simplify query logic, and CTEs enable reusable temporary result sets.MySQL 8.0 enhances JSON functionality and adds robust security features.Replication performance is significantly improved, leading to faster data synchronization. Parallel replication is enabled by default.New features like descending indexes and invisible indexes contribute to further performance enhancements.
Please click here for more details.
You can leverage Cloud SQL’s major version upgrade feature to upgrade to 8.0. Pre-check has already been incorporated into the workflow but you have the option to run it separately as well. You can use the Upgrade Checker Utility in the MySQL shell to run a pre-check. Before upgrading, review your current primary/replica topology and devise a plan accordingly.
Upgrade using major version upgrade: If you have a primary instance with no read replicas, you can upgrade the instance in-place with Cloud SQL’s major version upgrade feature. MySQL allows replication from lower to higher major versions. This is beneficial if you have read replicas, as you can upgrade your read replicas prior to upgrading the primary instance.
The diagram below shows the stages of a major version upgrade.
Note: In this scenario, IP addresses will be maintained.
Upgrade using cascading replicas: You can leverage cascading replicas along with major version upgrades for the scenarios below. This approach allows you to:
fall back to the old primary with its full topology intactset up an entire new stack in a new zone or a new region in addition to the current deployment
For example, Everflow, a Google Cloud customer that makes a partner marketing platform, leveraged cascading replicas and in-place major version upgrade to orchestrate a smooth MySQL upgrade to 8.0, with minimal downtime or disruption for their users.
To perform the major version upgrade using cascading replicas, please refer to the diagram and perform the following steps.
1. Create a read replica from the current 5.7 primary instance either to an existing or new zone/region.
2. Upgrade the replica to 8.0 via the major version upgrade feature.
3. Enable replication and create replicas as needed under the new 8.0 read replica.
4. Prep the application for IP address changes in advance to minimize downtime.
5. Route traffic and prep the application for switching over to the new master. Cloud Load Balancing can help do this efficiently.
Note: Consider this a transition period and try to keep the time for version mismatch short.
6, When you’re ready, promote the 8.0 read replica.
7. Delete the old primary MySQL 5.7 instance.
Note: As mentioned earlier, the above process requires IP address changes to the application. Ideally, IP address changes should be done before promoting the new read replica to minimize disruption when the cutover is performed.
PostgreSQL updates major versions yearly with a five-year support window. PostgreSQL 11 reached end-of-life in November, 2023. While you can upgrade to PostgreSQL 12 or 13, considering PostgreSQL’s end-of-life policy, we recommend upgrading to PostgreSQL 14 or later versions. PostgreSQL 14 and subsequent versions introduce several new features and enhancements that provide significant benefits. Here are some of the highlights:
Performance improvements including parallel query execution for GROUP BY and JOIN operations, and faster VACUUM and REINDEX operations.Enhancements to logical replication with support for filtering, row-level replay, and replication to multiple destinations, making it more flexible and scalable for various use cases.Security enhancements and advanced features like improved JSON functionality and enhanced table partitioning.
For additional details click here.
If your database is on an older version, we recommend upgrading to a newer version. There are different strategies that can be used to accomplish this. Since PostgreSQL does not support cross-version replication, upgrading the primary instance while the instance is replicating to the read replicas is not possible. In addition, upgrading read replicas prior to the primary instance may not be feasible. Hence, the upgrade flow involves upgrading primary instances first. Before proceeding, replication needs to be disabled for existing replicas. After the primary has been upgraded, read replicas can be upgraded one by one and replication can be re-enabled. Alternatively, you can drop read replicas and recreate them after the primary instance has been upgraded.
Upgrade via MVU: We recommend leveraging Cloud SQL’s major version upgrade feature for upgrading to newer versions of PostgreSQL (14.0+). With in-place upgrades, you can retain the name, IP address, and other settings of your current instance after the upgrade. The Cloud SQL for PostgreSQL in-place upgrade operation uses the pg_upgrade utility. Please make sure to test upgrades on beta or staging environments first, or clone the instance as mentioned above before you proceed. Cloud SQL for PostgreSQL major version upgrade performs pre-validations steps and backups on your behalf.
Upgrading to MySQL 8.0 or PostgreSQL 14 or 15 unlocks the ability to perform a quick in-place upgrade to Cloud SQL Enterprise Plus Edition, which is a powerhouse of advanced functionality. Cloud SQL Enterprise Plus Edition offers:
99.99% availability SLA inclusive of maintenanceNear-zero downtime planned maintenance with <10s instance downtimeUp to 3x faster throughput with the optional Data Cache and faster hardware for larger scale and optimal performanceSupport for larger machine configurations with up to 128 vCPU and 864 GB of memory, compared to 96 vCPU and 624 GB in Enterprise editionSupport for up to 35 days of Point In Time Recovery (PITR) compared to seven days in Enterprise edition
In-place upgrade to Cloud SQL Enterprise Plus edition takes just a few minutes with a downtime of less than 60 seconds. To learn more, click here.
Let’s revisit why upgrading makes sense: It’s an investment in the security, performance, and capabilities of your database infrastructure. By embracing the latest advancements, you can safeguard your data, optimize your applications, and empower your organization. With Cloud SQL’s in-place major version upgrade feature, you can perform a streamlined and efficient upgrade of your databases, ensuring a smooth transition to the latest version. Click here to get started.
Read More for the details.
When troubleshooting distributed applications that are made up of numerous services, traces can help with pinpointing the source of the problem, so you can implement quick mitigating measures like rollbacks. However, not all application issues can be mitigated with rollbacks, and you need to undertake a root-cause analysis. Application logs often provide the level of detail necessary to understand code paths taken during abnormal execution of a service call. As a developer, the challenge is finding the right logs.
Let’s take a look at how you can use Cloud Trace, Google Cloud’s distributed tracing tool, and Cloud Logging together to help you perform root-cause analysis.
Imagine you’re a developer working on the Customer Relationship Management service (CRM) that is part of a retail webstore app. You were paged because there’s an ongoing incident for the webstore app and the error rate for the CRM service was spiking. You take a look at the CRM service’s error rate dashboard and notice a trace exemplar that you can view in Cloud Trace:
The Trace details view in Cloud Trace shows two spans with errors: update_userand update_product. This leads you to suspect that one of these calls is part of the problem. You notice that the update_product call is part of your CRM service and check to see if these errors started happening after a recent update to this service. If there’s a correlation between the errors and an update to the service, rolling back the service might be a potential mitigation.
Let’s assume that there is no correlation between updates to the CRM service and these errors. In this case, a rollback may not be helpful and further diagnosis is needed to understand the problem. A next possible step is to look at logs from this service.
The Trace details view in Cloud Trace allows users to select different views for displaying logs within the trace — selecting “Show expanded” displays all related logs under their respective spans.
In this example, you can see that there are three database-related logs under the update_product span. After retrying a few times, the attempts to connect to the database from the CRM service have failed.
Behind the scenes, Cloud Trace is querying Cloud Logging to retrieve logs that are both in the same timeframe as the trace and reference the traceID and the spanID. Once retrieved, Cloud Trace presents these logs as child nodes under the associated span, which makes the correlation between the service call and the logs emitted during the execution of that service very clear.
You know that other services are connecting to the same database successfully, so this is likely a configuration error. You check to see if there were any config updates to the database connection from the CRM service and notice that there was one recently. Reviewing the pull request for this config update leads you to believe that an error in this config was the source of the issue. You quickly update the config and deploy it to production to address the issue.
In the above example, Cloud Trace and Cloud Logging work together to combine traces and logs into a powerful way to perform root cause analysis when mitigating measures like rollbacks are not enough.
If you’re curious about how to instrument properly for logs and trace correlation to work, here are some examples:
You can also get started by trying out OpenTelemetry instrumentation with Cloud Trace in this codelab or by watching this webinar.
Read More for the details.
Neo4j provides a graph database that offers capabilities for handling complex relationships and traversing vast amounts of interconnected data. Google Cloud complements this with robust infrastructure for hosting and managing data-intensive workloads. Together, Neo4j and Google Cloud have developed a new Dataflow template, Google Cloud to Neo4j (docs, guide), that you can try from the Google Cloud console.
In this blog post, we discuss how the Google Cloud to Neo4j template can help data engineers and data scientists who need to streamline the movement of data from Google Cloud to Neo4j database, to enable enhanced data exploration and analysis with the Neo4j database.
Many customers leverage BigQuery, Google Cloud’s fully managed and serverless data warehouse, and Cloud Storage to centralize and analyze diverse data from various source systems, regardless of formats. This integrated approach simplifies the complex task of managing data from different sources while maintaining stringent security measures. With the ability to store and process data efficiently in one location, organizations can analyze, forecast, and predict trends, yielding valuable insights for informed decision-making. BigQuery is the linchpin for aggregating and analyzing data. Read on to see how the Google Cloud to Neo4j Dataflow template streamlines the movement of data from BigQuery and Cloud Storage to Neo4j’s Aura DB, a fully managed cloud graph database service running on Google Cloud.
Unlike typical data integration methods like Python-based notebooks and Spark environments, Dataflow simplifies the process entirely, and doesn’t require any coding. It’s also free during idle periods, and leverages Google Cloud’s security framework for enhanced trust and reliability of your data workflows.
Dataflow is a strong solution for orchestrating data movement across diverse systems. As a managed service, Dataflow caters to an extensive array of data processing patterns, enabling customers to easily deploy batch and streaming data processing pipelines. And to simplify data integration, Dataflow offers an array of templates tailored to various source systems.
Fig 1: Architecture Diagram of Dataflow from Google Cloud to Neo4j
With the Google Cloud to Neo4j template, you can opt for the flex or classic template. For this illustration, we employ the flex template, which leverages just two configuration files: the Neo4j connection metadata file, and the Job Description file.
The Neo4j partner GitHub repository provides a wealth of resources that show how to use this template. The repository houses sample configurations, screenshots and all the instructions required to set up the data pipeline. Additionally, there are step-by-step instructions that guide you through the process of transferring data from BigQuery to a Neo4j database.
Once you have these two configuration files (Neo4j connection metadata file and job configuration file), you are ready to use the dataflow template to move data from Google Cloud to Neo4j. Here is the screenshot of the dataflow configuration page:
You can find the detailed documentation on this Dataflow template on the Neo4j documentation portal. Please refer to the following links: Dataflow Flex Template for BigQuery to Neo4j and Dataflow Flex Template for Google Cloud to Neo4j.
The Google Cloud to Neo4j Dataflow template makes it easier to use Neo4j’s graph database with Google Cloud’s data processing suite. To get started, check out the following resources:
Explore Neo4j within the Google Cloud Marketplace.Review the Google Cloud documentation on the Dataflow template.Walkthrough the step-by-step guide for setting up your pipeline and creating Neo4J config files that can be passed into the pipeline.Jump to Cloud Console to create your first job now!
Read More for the details.
Deutsche Bank is the leading German bank with strong European roots and a global network. The bank provides financial services to companies, governments, institutional investors, small and medium-sized businesses and private individuals.
For its German retail banking business, the bank recently completed the consolidation of two separate IT systems — Deutsche Bank and Postbank — to create one modern IT platform. This migration of roughly 19 million Postbank product contracts alongside the data of 12 million customers into the IT systems of Deutsche Bank was one of the largest and most complex technology migration projects in the history of the European banking industry.
As part of this modernization, the bank opted to design an entirely new online banking platform, partnering with Google Cloud for their migration from traditional on-premises servers, to the cloud. An integral functionality in enabling this migration, as apparent in the first production instance for 5 million Postbank customers, is Spanner, Google Cloud’s fully managed database service. Spanner’s high availability, external consistency, and infinite horizontal scalability made it the ideal choice for this business critical application. Read on to learn about the benefits that Deutsche Bank achieved from migrating to Spanner, and some best practices it developed to reliably and efficiently scale the platform.
Scaling in high availability environments can be challenging, but Spanner does all the heavy lifting for Deutsche Bank. Spanner scales infinitely and allows Deutsche Bank to start small and easily scale up and down as needed.
In a traditional on-prem project, fixed resources would have been assigned to the online banking databases, provisioned high enough to respond to customer requests quickly even during peaks. In such a setup, the resources remain unused most of the time, as the online banking load profile varies based on the time of day (more specifically the amount of traffic of online users at a given time). Traffic is low overnight, increasing sharply in the morning to a high load throughout the day, before dropping again in the evening hours. Spanner supports elasticity with horizontal scaling based on nodes that can be added and removed at any time, without disrupting any active workloads.
The amount of nodes can be changed via the Google Cloud console, gcloud and the REST API. For automation, Google Cloud provides an open source Autoscaler that runs entirely on Google Cloud. The bank utilized Autoscaler in all environments (including non-production environments) to maximize cost-efficiency while still ensuring the provisioning of relevant Spanner capacity, for a seamless user experience.
For any components subject to high availability requirements, the autoscaler used to manage those components must be highly available, too. Below are some of the bank’s experiences — the lessons it learned from using it, and the contributions that will soon be given back to the open-source community.
By default, the Autoscaler checks Spanner instances once per minute. To scale out as early as possible, this interval can be shortened, which increases the frequency of Autoscaler querying the Cloud Monitoring API. This change, along with choosing the right scaling methods, helped the bank to fulfill its latency service level objectives.
Projects running a high availability GKE cluster should consider deploying the Spanner Autoscaler on GKE over Cloud Functions because it can be deployed to multiple regions, which mitigates issues potentially caused by a regional outage. To avoid race conditions between the poller-pods, simple semaphore logic can be added so that only one pod manages the Spanner resources at any given time. This is simple to do, since the Autoscaler already persists a state in either Firestore or Spanner.
Customizing Spanner Autoscaler does not require rocket scientist expertise. All changes can be made without touching the Autoscaler’s poller-core or scaler-core. Semaphore handling and monitoring integration can be implemented in custom wrappers, like the wrappers provided by Google Cloud in the respective poller and scaler folders. For a multi-cluster deployment, you can amend the exemplary kpt files or add custom helm charts, selecting the option that best suits your needs.
When multiple teams are working with Spanner instances, it can be inconvenient to deploy the Autoscaler each time the scaling configuration changes. To avoid this, Deutsche bank fetches the instance configuration from sources external to the image and deployment.
There are two ways to do this:
Store the configuration separately from the instance, e.g., in Cloud StorageAdd the configuration to the instance itself, e.g., by setting appropriate Spanner instance labels via Terraform
To read the instance configuration and build the poller’s internal instances configuration on the fly, the Spanner googleapis provide convenient methods to list and access either files in buckets or Spanner instances along with their metadata such as labels.
If you are using Terraform, it’s a good idea to exclude the Spanner instance processing units from still being managed by Terraform after the instance creation. Any terraform apply run would otherwise reset the autoscaled processing units to the fixed value set in the Terraform state. Terraform provides a lifecycle ignore_changes meta-argument that will do the trick.
The Autoscaler default metrics work well for most use cases. In special cases where scaling needs to be based on different parameters, custom metrics can be configured on an instance level.
A decoupled configuration makes it easy to create custom metrics and test them upfront. By making the custom metric part of the compiled image, using it on an instance level becomes less error prone. By following this approach, scaling a particular instance won’t accidentally stop because of a typo made in a metric definition during a configuration change.
By default, Autoscaler determines scaling decisions based on current storage utilization, 24-hour rolling CPU load, and current high priority CPU load. In cases where scaling parameters differ from this default, i.e., on medium-priority CPU load, custom metrics can be set in less than one minute.
One minor shortcoming of the Autoscaler is its inability to compensate for sudden load peaks in real time. To compensate for expected peaks, it would be advisable to temporarily increase the minimum processing unit’s configuration. A solution could be easily implemented by decoupling the Autoscaler’s instance configuration from the Autoscaler image.
If changing the configuration isn’t an option, you can either send a POST request to the scaler’s metric endpoint or script gcloud commands to update the timestamps for the last scaling operation in the Autoscaler’s state database and set the instance processing units directly. The first solution may cause concurrent scaling operations, in which you should be aware of the Autoscaler’s internal cooldown settings. By default, the instance will be scaled in again after 30 minutes for scale-in and 5 minutes for scale-out events. The second solution would fix the processing units to any value of your choice for n minutes by manipulating the state database timestamps.
The open-source Autoscaler is a valuable tool for balancing cost control and performance needs when using Spanner. Autoscaler automatically scales your database instances up and down based on load to avoid over-provisioning, increasing cost savings.
The Autoscaler is easy to set up and runs on Google Cloud. Google provides the Autoscaler as open source, which allows full customization of the scaling logic. The core project team at Deutsche Bank worked closely with Google to further improve the tool’s stability and is excited to contribute its enhancements back to the open source community in the near future.
To learn more about the open-source Autoscaler for Spanner, follow the official documentation. You can read more about the Deutsche Bank and Google Cloud partnership in the official Deutsche Bank press release.
Read More for the details.
Imagine that you’re an engineer at the company Acme Corp and you’ve been tasked with some big projects: integrating and delivering software using CI/CD and automation, as well as implementing data-driven metrics and observability tools. But many of your fellow engineers are struggling because there’s too much cognitive load — think deploying and automating Kubernetes clusters, configuring CI/CD pipelines, and worrying about security. You realize that to support the scale and growth of your company, you have to think differently about solving these challenges. This is where platform engineering might help you.
Platform engineering is “the practice of planning and providing such computing platforms to developers and users and encompasses all parts of platforms and their capabilities — their people, processes, policies and technologies; as well as the desired business outcomes that drive them,” writes the Cloud Native Computing Foundation (CNCF). This emerging discipline incorporates lessons learned from the DevOps revolution, recent Cloud Native developments in Kubernetes and serverless, as well as advances in observability and SRE.
A career in platform engineering, meanwhile, means becoming part of a product team focused on delivering software, tools, and services. Whether you’re just starting your IT career as a young graduate or you’re already a highly experienced developer or engineer, platform engineering offers growth opportunities and the ability to gain new technical skills.
Read on for an overview of the platform engineering field, including an introduction to what platform engineers do and the skills required. We also discuss the importance of user-centricity and having a product mindset, and provide some tips for setting goals and avoiding common pitfalls.
So, what are some of the things that are expected of a platform engineer? Generally speaking, the role requires a mix of technical and people skills — job-related competencies that are necessary to complete work, as well as personal qualities and traits that impact how the role is approached. You can learn some of them to get started down the platform engineering career path, however there is no expectation that you need to know all of them to be successful, as these skillsets will often be distributed across your team. Here are some of the different attributes of a platform engineer:
Takes a customer-centric approach — being a reliable partner for engineering groups, sharing knowledge, working with other teams including software developers, SREs and Product ManagersFamiliar with DevSecOps practicesAvid learner, problem solver, detail-orientated, and able to communicate effectively across teamsAble to articulate the benefits of the platform engineering approach to fellow colleagues and engineersApplies a product mindset for the platform, e.g., using customer user journeys and friction logs
Given its particular significance in the Platform Engineering realm, let’s delve into the customer-centric approach from the list above.
If platforms are first and foremost a product, as the CNCF Platforms White Paper suggests, the focus is on its users. From the Google DORA Research 2023 we know that user focus is key: “Teams that focus on the user have 40% higher organizational performance than teams that don’t.”
At Google we believe that if we focus on the user everything else will follow — a key part of our philosophy. Having an empathetic user mindset requires a deep understanding of the needs and demands of your users, which is achieved through interviews, systematic statistics, metrics and data. You gather the data by focusing on both quantitative and qualitative metrics.
For example, you might decide to adopt Google’s HEART (Happiness, Engagement, Adoption, Retention, Task Success) framework, covered in detail in this whitepaper. As a platform engineer, you might be especially interested in the perceived “happiness” of your users with the offered platform services; you probably also want to measure and track platform adoption as well as (potential) retention of the offerings. Why are users coming to you or leaving? What is missing and could be improved in the next platform design sprint? Or perhaps you might want to create a friction log that documents the hurdles your users face when using your platform services. Ideally you can also become your own customer and use your own platform offerings, engaging with the friction log and users’ journeys through the platform.
The platform engineering design loop
We believe that an effective way to start thinking about platform engineering is to imagine a platform engineering design loop with you as a platform engineer at the center of it. You improve your customer focus by conducting user research that helps you understand their priorities better. You build empathy for them by documenting friction logs and other types of experiments. The platform backlog is where your team makes decisions on the engineering product portfolio, focusing on the platform’s contribution to your company’s value streams. Having a product mindset helps understand users’ needs, have a clear vision and roadmap, prioritize user features, documentation, and be open to product enhancements. Finally, once you have delivered the initial release of your platform, you continue iterating in this loop, making the platform better with every iteration.
All this being said, a platform engineer performs a variety of tasks within a larger platform engineering group. Of course, nobody can do everything and you will require specialization, but here are some of the topics you might want to focus on:
Google Cloud services
Container runtimes: Google Kubernetes Engine, Cloud RunCompute runtimes: Compute Engine, Google Cloud VMware EngineDatabases: Spanner, Bigtable, Cloud SQLBuild and maintain the internal developer portalSupport developer tooling: Cloud WorkstationsMaintain CI/CD: Cloud Build, Cloud Deploy and Artifact RegistryImplement compliance as a code for selected supported golden paths using Infrastructure Manager and Policy Controller, helping to reduce cognitive load on developers and allowing for faster time to deployment.
Architecture
Gain a deep understanding of infrastructure and application architectureCo-writing with developers and support the golden paths through the use of Infrastructure as CodeCreate fantastic documentation as for example explained in our courses on technical writing. Don’t forget that architecture decision records are a key part of your engineering documentation.
Operations and reliability
Site Reliability Engineering – Adopt best practices for reliable operations of your platformSecurity Engineering – Compliance, horizontal controls, and guardrails for your platform
Engineering backlog
Use a backlog to list outstanding tasks and prioritize a portfolio of engineering work. The bulk of the focus should be on resolving the backlog of requests, with some additional time set aside for both continuous improvement and experimentation.Experimenting and innovating with new technology – This is an essential task for platform engineers, for example learning new services and features to better improve your platform.
Our industry has been focusing a lot on shifting complexity “left” to allow for better tested, integrated and secure code. In fact, here at Google we strongly believe that in addition to that, a platform effort can help you “shift down” this complexity. Of course, nobody can do everything (think about “cognitive load”), not even a superstar platform engineer like you!
In addition to all the things that newly minted platform engineers should be doing, here are some things not to do:
These are just some of the common pitfalls that we’ve seen so far.
Platform engineers are essential for the success of a modern enterprise software strategy, responsible for creating and maintaining the platforms that developers use to build and deploy applications. In today’s world, where software is constantly evolving, platform engineers are a key force to providing scalable software services and keep users as their primary focus. They carefully understand the demands and needs of their internal customers, combining their technology expertise with the knowledge of latest developments in the industry.
Finally, here are some further resources to aid in your platform engineering learning journey.
The book Software Engineering at Google covers creating a sustainable software ecosystem by diving into culture, processes and toolsGoogle SRE Books and workshopsDORA.dev – research into the capabilities that drive software delivery and operations performanceGoogle Cloud certifications -> Cloud Architect, Cloud DevOps Engineer, Cloud Developer, Cloud Security Engineer, Cloud Network Engineer
Read More for the details.
At Google Cloud, we work to support a thriving cloud ecosystem that is open, secure, and interoperable. When customers’ business needs evolve, the cloud should be flexible enough to accommodate those changes.
Starting today, Google Cloud customers who wish to stop using Google Cloud and migrate their data to another cloud provider and/or on premises, can take advantage of free network data transfer to migrate their data out of Google Cloud. This applies to all customers globally. You can learn more here.
Eliminating data transfer fees for switching cloud providers will make it easier for customers to change their cloud provider; however, it does not solve the fundamental issue that prevents many customers from working with their preferred cloud provider in the first place: restrictive and unfair licensing practices.
Certain legacy providers leverage their on-premises software monopolies to create cloud monopolies, using restrictive licensing practices that lock in customers and warp competition.
The complex web of licensing restrictions includes picking and choosing who their customers can work with and how; charging 5x the cost if customers decide to use certain competitors’ clouds; and limiting interoperability of must-have software with competitors’ cloud infrastructure. These and other restrictions have no technical basis and may impose a 300% cost increase to customers. In contrast, the cost for customers to migrate data out of a cloud provider is minimal.
Making it easier for customers to move from one provider to another does little to improve choice if customers remain locked in with restrictive licenses. Customers should choose a cloud provider because it makes sense for their business, not because their legacy provider has locked them in with overly restrictive contracting terms or punitive licensing practices.
The promise of the cloud is to allow businesses and governments to seamlessly scale their technology use. Today’s announcement builds on the multiple measures in recent months to provide more value and improve data transfer for large and small organizations running workloads on Google Cloud.
We will continue to be vocal in our efforts to advocate on behalf of our cloud customers — many of whom raise concerns about legacy providers’ licensing restrictions directly with us. Much more must be done to end the restrictive licensing practices that are the true barrier to customer choice and competition in the cloud market.
Read More for the details.