Cloud

2021 09 22

AWS – Amazon EC2 Fleet instant mode now supports targeted Amazon EC2 On-Demand Capacity Reservations

Starting today, you can use EC2 Fleet with targeted On-Demand Capacity Reservations. On-Demand Capacity Reservations enable you to reserve compute capacity for your Amazon EC2 instances in a specific Availability Zone for any duration. For targeted Capacity Reservations, instances must specifically target the Capacity Reservation to run in the reserved capacity. Until now, there was no option to use targeted Capacity Reservations when launching an EC2 Fleet.

Read More for the details.

2021 09 22

AWS – AWS IoT Device Defender announces Audit One-Click

AWS, Cloud AWS

Today we are launching Audit One-Click for AWS IoT Device Defender. Audit One-Click makes it easy for AWS IoT Core customers to improve their security baseline by making it possible to start auditing their account and IoT devices against security best practices with a single click.

Read More for the details.

2021 09 22

AWS – AWS Single Sign-On is now available in the AWS GovCloud (US-West) Region

AWS, Cloud AWS

AWS Single Sign-On is now available in the AWS GovCloud (US-West) Region. For a full list of the regions where AWS SSO is available, see the AWS Regional Services List.

Read More for the details.

2021 09 22

AWS – Amazon Lex is now available in the Asia Pacific (Seoul) and Africa (Cape Town) regions

AWS, Cloud AWS

Starting today, Amazon Lex is available in the Asia Pacific (Seoul) and Africa (Cape Town) regions. Amazon Lex is a service for building conversational interfaces into any application using voice and text. Amazon Lex combines advanced deep learning functionalities of automatic speech recognition (ASR) for converting speech to text, and natural language understanding (NLU) to recognize the intent of the text. This enables you to build applications with engaging users experiences and lifelike interactions. With Amazon Lex, you can easily create sophisticated, natural language, conversational bots (“chatbots”), virtual agents and IVR systems.

Read More for the details.

2021 09 22

AWS – Amazon EMR Studio now supports multi-language Jupyter-based notebooks for Spark workloads

AWS, Cloud AWS

EMR Studio is an integrated development environment (IDE) that makes it easy for data scientists and data engineers to develop, visualize, and debug big data and analytics applications written in R, Python, Scala, and PySpark. Today, we are excited to announce that from EMR 6.4.0 and later, you can use Python, Scala, SparkSQL, and R within the same Jupyter notebook in EMR Studio, providing flexibility to use different programming languages for Spark workloads.

Read More for the details.

2021 09 22

AWS – AWS Ground Station announces Licensing Accelerator

AWS, Cloud AWS

AWS is announcing Licensing Accelerator, a new AWS Ground Station feature which provides commercial businesses, space start-ups, and universities access to resources to help them more efficiently secure spectrum licenses required for their operations and missions. Licensing accelerator is free-of-charge to AWS Ground Station customers. AWS Ground Station is a fully managed service that lets customers control satellite communications, process satellite data, and scale their satellite operations. With Licensing Accelerator, AWS Ground Station customers can launch and scale their spacecraft operations faster by leveraging the latest, centrally located information about satellite licensing regulations such as space station licensing, remote sensing licenses, and International Telecommunications Union (ITU) coordination.

Read More for the details.

2021 09 22

AWS – Amazon ECR adds the ability to replicate individual repositories to other regions and accounts

AWS, Cloud AWS

Today, Amazon Elastic Container Registry (ECR) launched the ability to replicate specific repositories to accounts or regions, and see when images were replicated through the ECR API. This gives you granular control to replicate images within repositories you want, instead of replicating all images in a registry, and the ability to automate actions through the new DescribeImageReplicationStatus API whenever images are replicated.

Read More for the details.

2021 09 22

Azure – Public preview: At-scale management of Azure Monitor alerts in Backup center

Azure, Cloud Azure

You can now manage your Azure Monitor alerts via Backup center.

Read More for the details.

2021 09 22

Azure – Azure Database for PostgreSQL – Flexible Server: Terraform support in public preview

Azure, Cloud Azure

Generate automated configuration files used with Terraform to automate provisioning and configuration on your Flexible Server for Azure Database for PostgreSQL, a managed service running the open source Postgres database.

Read More for the details.

2021 09 22

Azure – Azure Database for PostgreSQL – Flexible Server: Azure Pipelines in public preview

Azure, Cloud Azure

Deploy a SQL file or inline script to push changes to one or more databases with Azure Pipelines for Azure Database for PostgreSQL – Flexible Server using Azure CLI tasks.

Read More for the details.

2021 09 22

Azure – Azure Resource Health for Azure Database for PostgreSQL – Flexible Server in public preview

Azure, Cloud Azure

With Azure Resource Health, diagnose and get support for service problems that affect Flexible Server for Azure Database for PostgreSQL, a managed service running the open source Postgres database.

Read More for the details.

2021 09 22

Azure – Azure Database for MySQL: Azure Pipelines support in public preview

Azure, Cloud Azure

Deploy a SQL file or inline script to push changes to one or more databases with Azure Pipelines for Azure Database for MySQL – Flexible Server using Azure CLI tasks.

Read More for the details.

2021 09 22

Azure – Azure Database for PostgreSQL – Hyperscale (Citus) now includes PgBouncer version 1.16

Azure, Cloud Azure

Create a Hyperscale (Citus) server group and use PgBouncer 1.16 as a part of this server group by connecting to port 6432 for connection to the coordinator.

Read More for the details.

2021 09 22

Azure – Azure SQL Database: General availability updates for late September 2021

Azure, Cloud Azure

General availability enhancements and updates released for Azure SQL in late September 2021.

Read More for the details.

2021 09 22

Azure – Public preview: Distributed tracing for Java apps on Azure Functions Linux

Azure, Cloud Azure

Distributed tracing for Java apps on Azure Functions can now be enabled through Azure portal. This integration provides additional insights into end-to-end transactions that were not previously supported, completes the application map, which aggregates many transactions to show a topological view your system, and lets you detect and diagnose performance bottlenecks.

Read More for the details.

2021 09 22

Azure – Announcing general availability of Azure AD-joined VMs support

Azure, Cloud Azure

You can now deploy Azure AD-joined VMs in your host pools for Azure Virtual Desktop.

Read More for the details.

2021 09 22

Azure – Azure Functions runtime 4.0 is now in public preview

Azure, Cloud Azure

Azure Functions 4.0 includes support for .NET 6.

Read More for the details.

2021 09 22

GCP – DynamoDB to Cloud Spanner via HarbourBridge

Cloud, Google Cloud gcp

Today, we would like to announce that HarbourBridge—the open source toolkit that automates much of the migration effort to Cloud Spanner including evaluation and assessment—supports DynamoDB in addition to existing support for PostgreSQL and MySQL. This allows DynamoDB users to try out Cloud Spanner with zero-configuration. HarbourBridge helps users to quickly resolve issues during schema and data migration and let them try out Cloud Spanner as soon as possible.

Workflow

HarbourBridge can now automatically load data from multiple DynamoDB tables into Cloud Spanner. It first builds a Cloud Spanner schema by learning the data in DynamoDB. HarbourBridge then creates a Cloud Spanner database using this schema and populates the database with the data from the DynamoDB tables. In addition, it generates a detailed assessment report which includes all issues encountered and rows that failed to migrate. For DynamoDB, HarbourBridge directly connects to the source table using public APIs. The tool is designed for assessment, evaluation and migration (Best used for up to tens of GB).

How HarbourBridge converts the schema

It is challenging to migrate from a schemaless database to a relational database. We need to resolve the following issues for schema conversion in our DynamoDB support:

Schemaless – DynamoDB is a schemaless database: other than a primary index and optional secondary index, column names and types are essentially unconstrained and can vary from one row to the next. However, many customers use DynamoDB in a consistent, structured way with a fairly well defined set of columns and types. HarbourBridge’s support for DynamoDB focuses on this use-case, and we construct a Spanner schema by inspecting table data. For small tables, we inspect all rows of the table. For large tables, scanning the entire table would be extremely expensive and slow, and so we only inspect the first N rows (defined by the flag schema-sample-size with default value10,000) from the table scan. In practice, this gives a reasonable sample of data to work with since DynamoDB scans don’t return results in order. In principle, random row selection could provide a more representative sample, but it would be much more expensive.

Number – In most cases, we map the Number type in DynamoDB to Spanner’s Numeric type. However, since the range of Numeric in Cloud Spanner is smaller than the range of Number in DynamoDB, this conversion could result in out of range with potential precision loss. To address this possibility, we try to convert the sample data, and if it consistently fails, we choose the STRING type for the column.

Null- In DynamoDB, a column can have a Null data type that represents an unknown or undefined state. Also, each row defines its own schema for columns (not for primary keys). So columns can be absent in rows. We treat the above two cases the same as a Null value in Cloud Spanner. The cases that a column contains a Null value or a column is not present is an indication that this column should be nullable.

List & Map – DynamoDB supports List and Map for storing complex data structures. Their elements can be different data types e.g., a List can be [“Book”, “Camera”, 3.14159]. We encode list and map values as json strings in Cloud Spanner. They need to be parsed when read from Cloud Spanner.

Occasional Errors – As no schema is enforced when writing to a DynamoDB table, it can happen that a small number of data rows are incorrectly inserted, sometimes unbeknownst to the user. In Cloud Spanner, a table’s schema is strictly enforced, and we can’t write rows that differ from the table’s schema. To handle this in HarbourBridge, we define an error threshold when we are inferring the type of a column – if a type only appears in a very small proportion of rows (less than or equal 0.1%), then we treat the row as an error. Such rows are ignored when we determine the type of a column. This allows us to filter a certain amount of noise when building the Cloud Spanner schema.

Multi-type Columns – In some situations we may get a column that has an equal distribution of two data types. E.g., a column has 40% rows in String and 60% rows in Number. If we choose Number as its type, then we will drop 40% of the rows during data conversion. To handle this, we define a conflicting threshold on normalized rows (after removing Null data types and rows where the column is not present). By default, the conflicting threshold is 5% and if the percentages of two or more data types are greater than it, we would consider that the column has conflicting data types. As a safe choice, we define this column as a STRING type in Cloud Spanner.

During conversion, any rows that fail to convert to the inferred schema (at least 1 column fails conversion) or cannot be written to spanner are reported as bad rows in the assessment. Although we write the bad rows to *.dropped.txt, since the number can be large, we limit the logging to 100 rows to give the user a sample of the bad rows.

For more details about schema conversion, you can check here.

Getting Started

You can directly use HarbourBridge with a Cloud Spanner instance. For convenience, you can also use Cloud Spanner Emulator – a local, in-memory emulation of Cloud Spanner for testing and evaluation purposes. We can use the emulator to try out Cloud Spanner’s functionality without any cost.

To use the emulator, you can follow the steps in Emulator instructions. You can start an emulator service via gcloud, docker, or linux binaries. Then, you need to set the SPANNER_EMULATOR_HOST environment variable, so HarbourBridge will connect to the emulator instead of talking to a real Cloud Spanner instance.

In order to get the permission to retrieve data from DynamoDB, we need to set up the AWS credentials and region by using the following environment variables:

You also can find other ways to configure them here.

By default, DynamoDB uses the environment variable AWS_REGION to resolve the endpoint url. To provide a custom endpoint, we can use the following environment variable:

Before you start running habourbridge, ensure that you run

Set the GCLOUD_PROJECT environment variable to your Google Cloud project ID:

Next, you need to have Go installed and set the GOPATH environment variable properly. Then, you can install HarbourBridge via the following command:

By default, it will install the binary at $GOPATH/bin/harbourbridge. To use the tool directly on DynamoDB (it will migrate all tables), run:

It will generate a new Cloud Spanner database, create tables by modeling schemas, and load data from source tables. In addition, you may see the following generated files:

*.report.txt: the assessment report.

*.schema.txt: Cloud Spanner schema for the source tables.

*.session.json: a persisted state of your schema conversion and it can be modified/used for future data migration.

*.dropped.txt: it contains rows that cannot be converted.

For more information about files generated by HarbourBridge, see here.

Now, if you go to the Cloud Spanner instance or the emulator, you should be able to see a new database that contains tables with loaded records from DynamoDB sources.

Optimize your schema conversion

There are more aspects that we can optimize for different scenarios:

Sample Size – If you are trying to test with a large scale database, you may find that the default size (10,000 rows) of sample data is not enough. You would like to increase the sample size. We already provide a command option “-schema-sample-size” so you can set a number to meet your needs. For example, the following command will sample 1 million records for modelling the schema:

Secondary Index – HarbourBridge provides a good starting point, but it does not support converting indexes (we only convert primary keys) at the moment. To achieve a better performance or to have a fair comparison with the source database, it would be better to add indexes. See Keys and Indexes for more information.

Interleaved Tables – This is a key concept in Cloud Spanner to improve locality and optimize table layout.Interleaved tables are tables that you declare to be a child of another table because you want the rows of the child table to be physically stored together with the associated parent row to save time to look up data that are related to each other. As a result of that, we can get a better performance of joins and also of writes.

Summary

HarbourBridge is an open-source tool for Cloud Spanner evaluation and migration, which supports PostgreSQL, MySQL, and DynamoDB. It saves your time and effort by automating manual steps and creating an initial migration as quickly as possible. We also provide the options to refine and optimize the schema generated.

We would like to hear your feedback and suggestions. You can file an issue with us if you want to start a discussion or request any features. We have a roadmap for HarbourBridge. HarbourBridge is part of the Cloud Spanner Ecosystem, which is owned and maintained by a user community effort. It is not officially supported by Google as part of Cloud Spanner.

Read More for the details.

2021 09 22

GCP – Introducing Google Cloud Deploy: Managed continuous delivery to GKE

Cloud, Google Cloud gcp

Continuous delivery is frequently top-of-mind for organizations adopting Google Kubernetes Engine (GKE). However, continuous delivery —deploying container image artifacts into your various environments—remains complex, particularly in Kubernetes environments. With little in the way of accepted best practices, building and scaling continuous delivery tooling, pipelines, and repeatable processes is hard work that requires a lot of on-the-job experience.

It doesn’t have to be this way.

Today, we are pleased to announce Google Cloud Deploy, a managed, opinionated continuous delivery service that makes continuous delivery to GKE easier, faster, and more reliable.

Solving for continuous delivery challenges

Google Cloud Deploy is the product of discussions with more than 50 customers to better understand the challenges they face doing continuous delivery to GKE. From cloud-native to more traditional businesses, three themes consistently emerged: cost of ownership, security and audit, and integration.

Let’s take a deeper look at these challenges and how we address them with Google Cloud Deploy.

Cost of ownership

Time and again we heard that the operational cost of Kubernetes continuous delivery is high. Identifying best and repeatable practices, scaling delivery tooling and pipelines, and staying current—to say nothing of maintenance—is resource-intensive and takes time away from the core business.

“We can’t afford to be innovating in continuous delivery,” one customer told us. “We want an opinionated product that supports best practices out of the box.”

Google Cloud Deploy addresses cost of ownership head-on.

As a managed service, Google Cloud Deploy eliminates the scaling and maintenance responsibilities that typically come with self-managed continuous delivery solutions. Now you can reclaim the time spent maintaining your continuous delivery tooling and spend it delivering value to your customers.

Google Cloud Deploy also provides structure. Delivery pipelines and targets are defined declaratively and are stored alongside each release. That means if your delivery pipeline changes, the release’s path to production remains durable. No more time lost troubleshooting issues on in-flight releases caused by changes made to the delivery pipeline.

We have found that a variety of GKE roles and personas interact with continuous delivery processes. A DevOps engineer may be focused on release promotion and rollback decisions, while a business decision maker thinks about delivery pipeline health and velocity. Google Cloud Deploy’s user experience keeps these multiple perspectives in mind, making it easier for various personas to perform contextualized reviews and make decisions, improving efficiency and reducing cost of ownership.

Contextualized deployment approvals

Security and audit

Lots of different users interact with a continuous delivery system, making a variety of decisions. Not all users and decisions carry the same authority, however. Being able to define a delivery pipeline and make updates doesn’t always mean you can create releases, for example, nor does being able to promote a release to staging mean you can approve it to production. Modern continuous delivery is full of security and audit considerations. Restricting who can access what, where, and how is necessary to maintain release integrity and safety.

Throughout, Google Cloud Deploy enables fine-grained restriction, with discrete resource access control and execution-level security. For additional safeguards against unwanted approvals, you can also take advantage of flow management features such as release promotion, rollback, and approvals.

Auditing with Google Cloud Deploy works just like it does for other Google Cloud services. Cloud Audit Logs audits user-invoked Google Cloud Deploy activities, providing centralized awareness into who promoted a specific release or made an update to a delivery pipeline.

Integration

Whether or not you already have continuous delivery capabilities, you likely already have continuous integration (CI), approval and/or operation workflows, and other systems that intersect with your software delivery practices.

Google Cloud Deploy embraces the GKE delivery tooling ecosystems in three ways: connectivity to CI systems, support for leading configuration (rendering) tooling, and Pub/Sub notifications to enable third-party integrations.

Connecting Google Cloud Deploy to existing CI tools is straightforward. After you build your containers, Google Cloud Deploy creates a delivery pipeline release that initiates the Kubernetes manifest configuration (render) and deployment process to the first environment in a progression sequence. Whether you are using Jenkins, Cloud Build, or another CI tool, this is usually a simple `gcloud beta deploy releases create`.

Delivering to Kubernetes often changes over time. To help, Google Cloud Deploy leverages Skaffold, allowing you to standardize your configuration between development and production environments. Organizations new to Kubernetes typically deploy using raw manifests, but as they become more sophisticated, may want to use more advanced tooling (Helm, Kustomize, kpt). The combination of Google Cloud Deploy and Skaffold lets you transition to these tools without impacting your delivery pipelines.

Finally, to facilitate other integrations, such as a post-deployment test execution or third party approval workflows, Google Cloud Deploy emits Pub/Sub messages throughout a release’s lifecycle.

The future

Comprehensive, easy-to-use, and cost-effective DevOps tools are key to building an efficient software development team, and it’s our hope that Google Cloud Deploy will help you complete your CI/CD pipelines. And we’re just getting started! Stay tuned as we continue to introduce exciting new capabilities and features to Google Cloud Deploy in the months and quarters to come.

In the meantime, to get started with the Preview, check out the product page, documentation, quickstart, and tutorials. Finally, If you have feedback on Google Cloud Deploy, you can join the conversation. We look forward to hearing from you!

Read More for the details.

2021 09 22

GCP – What’s your org’s reliability mindset? Insights from Google SREs

Cloud, Google Cloud gcp

Editor’s note: There’s more to ensuring a product’s reliability than following a bunch of prescriptive rules. Today, we hear from some Google SREs—Vartika Agarwal, Senior Technical Program Manager, Development; Tracy Ferrell, Senior SRE Manager; Mahesh Palekar, Director SRE; and Magi Agrama, Senior Technical Program Manager, SRE—about how to evaluate your team’s current reliability mindset, and what you want it to be.

Having a reliable software product can improve users’ trust in your organization, the effectiveness of your development processes, and the quality of your products overall. More than ever, product reliability is front and center, as outages negatively impact customers and their businesses. But in an effort to develop new features, many organizations limit their reliability efforts to what happens after an outage, and tactically solve for the immediate problems that sparked it. They often fail to realize that they can move quickly while still improving their product’s reliability.

At Google, we’ve given a lot of thought to product reliability—and several of its aspects are well understood, for example product or system design. What people think about less is the culture and the mindset of the organization that creates a reliable product in the first place. We believe that the reliability of a product is a property of the architecture of its system, processes, culture, as well as the mindset of the product team or organization that built it. In other words, reliability should be woven into the fabric of an organization, not just the result of a strong design ethos.

In this blog post, we discuss the lessons we’ve learned relevant to organizational or product leads who have the ability to influence the culture of the entire product team, from (but not limited to) engineering, product management, marketing, reliability engineering, and support organizations.

Goals

Reliability should be woven into the fabric of how an organization executes. At Google, we’ve developed a terminology to categorize and describe your organization’s reliability mindset, to help you understand how intentional your organization is in this respect. Our ultimate goal is to help you improve and adopt product reliability practices that will permeate the ethos of the organization.

By identifying these reliability phases, we do not mean to offer a prescriptive list of things to do that will improve your product’s reliability. Nor should they be read as a set of mandated principles that everyone should apply, or be used to publicly label a team, spurring competition between teams. Rather, leaders should consider these phases as a way to help them develop their team’s culture, on the road to sustainably building reliable products.

The organizational reliability continuum

Based on our observations here at Google, there are five basic stages of organizational reliability, and they are based on the classic organizational model of absent, reactive, proactive, strategic and visionary. These phases describe the mindset of an organization at a point in time, and each one of them is characterized by a series of attributes, and is appropriate for different classes of workloads.

Absent: Reliability is a secondary consideration for the organization.

A feature launch is the key organizational metric and is the focus for incentives

The majority of issues are found by users or testers. This organization is not aware of their long-term reliability risks.

Developer velocity is rarely exchanged for reliability.

This reliability phase maybe appropriate for products and projects that are still under development.

Reactive:Responses to reliability issues/risks are tied to recent outages with sporadic follow-through and rarely are there longer-term investments in fixing system issues.

Teams have some reliability metrics defined and react when required.
They write postmortems for outages and create action items for tactical fixes.
Reasonable availability is maintained through heroic efforts by a few individuals or teams
Developer productivity is throttled due to a temporary shift in priority on reliability work due to outages. Feature development may be frozen for a short period of time.

This level is appropriate for products/projects in pre-launch or in a stable long-term maintenance phase.

Proactive:Potential reliability risks are identified and addressed through regular organizational processes.

Risks are regularly reviewed and prioritized.
Teams proactively manage dependencies and review their reliability metrics (SLOs)
New designs are assessed for known risks and failure modes early on. Graceful degradation is a basic requirement.
The business understands the need to continuously invest in reliability and maintain its balance with developer velocity.

Most services/products should be at this level; particularly if they have a large blast radius or are critical to the business.

Strategic:Organizations at this level manage classes of risk via systemic changes to architectures, products and processes.

Reliability is inherent and ingrained in how the organization designs, operates and develops software. Reliability is systemic.
Complexity is addressed holistically through product architecture. Dependencies are constantly reduced or improved.
The cross-functional organization can sustain reliability and developer velocity simultaneously.
Organizations widely celebrate quality and stability milestones.

This level is appropriate for services and products that need very high availability to meet business-critical needs.

Visionary:The organization has reached the highest order of reliability and is able to drive broader reliability efforts within and outside the company (e.g., writing papers, sharing knowledge), based on their best practices and experiences.

Reliability knowledge exists broadly across all engineers and teams at a fairly advanced level and is carried forward as they move across organizations.
Systems are self-healing.
Architectural improvements for reliability positively impact productivity (release velocity) due to reduction of maintenance work/toil.

Very few services or products are at this level, and when they are, are industry leading.

Where should you be on the reliability spectrum?

It is very important to understand your organization does not necessarily need to be at the strategic or visionary phase. There is a significant cost associated with moving from one phase to another and a cost to remain very high on this curve. In our experience, being proactive is a healthy level to target and is ideal for most products.

To illustrate this point, here is a simple graph of where various Google product teams are on the organizational reliability spectrum; as you can see, it produces a standard bell-curve distribution. While many Google’s product teams have a reactive or proactive reliability culture, most can be described as proactive. You, as an organizational leader, must consciously decide to be at a level based on the product requirements and client expectations.

Further, it’s common to have attributes across several phases, for example, an organization may be largely reactive with a few proactive attributes. Team culture will wax and wane between phases, as it takes effort to maintain a strategic reliability culture. However, as more of the organization embraces and celebrates reliability as a key feature, the cost of maintenance decreases.

The key to success is making an honest assessment of what phase you’re in, and then doing concerted work to move to the phase that makes sense for your product. If your organization is in the absent or reactive phase, remember that many products in nascent stages of their life cycle may be comfortable there (in both the startup and long term maintenance of a stable product).

Reliability phases in action

To illustrate the reliability phases in practice, it is interesting to look at examples of organizations and how they have progressed or regressed through them.

It should be noted that all companies and teams are different and the progress through these phases can take varying amounts of time. It is not uncommon to take two to three years to move into a truly proactive state. In a proactive state all parts of the organization contribute to reliability without worrying that it will negatively impact feature velocity. Staying in the proactive phase also takes time and effort.

Nobody can be a hero forever

One infrastructure services team started small with a few well understood APIs. One key member of the team, a product architect, understood the system well and ensured that things ran smoothly by ensuring design decisions were sound and being at each major incident to rapidly mitigate the issue. This was the one person who understood the entire system and was able to predict what can and cannot impact its stability. But when they left the team, the system complexity grew by leaps and bounds. Suddenly there were many critical user-facing and internal outages.

Organizational leaders initiated both short and long-term reliability programs to restore stability. They focused on reducing the blast radius and the impact of global outages. Leadership recognized that to sustain this trajectory, they recognized that they had to go beyond engineering solutions and implement cultural changes such as recognizing reliability as their number-one feature. This led to broad training around reliability best practices, incorporating reliability in architectural/design reviews and recognizing and rewarding reliability beyond hero moments.

As a result, the organization evolved from a reactive to a strategic reliability mindset, aided by setting reliability as their number-one feature, recognizing and rewarding long-term reliability improvements, and adopting the systemic belief that reliability is everyone’s responsibility—not just that of a few heroes.

If you think you are done, think again

End users are highly dependent on the reliability of this product and it ties directly to user trust. For this reason, reliability was top of mind for one Google organization for years, and the product was held as the gold standard of reliability by other Google teams. The org was deemed visionary in its reliability processes and work.

However, over the years, new products were added to the base service. The high level of reliability did not come as freely and easily as it did with the simpler product. Reliability was impacted at the cost of developer velocity and the organization moved to a more reactive reliability mindset.

To turn the ship around, the organization’s leaders had to be intentional about their reliability posture and overall practices, for example, how much they thought about and prioritized reliability. It took several years to move the team back to a strategic mindset.

Embrace reliability principles from the start

Another team with a new user-facing product was focused on adding features and growing their user base. Before they knew it, the product took off and saw exponential growth.

Unfortunately, their laser-focus on managing user requirements and growing user adoption led to high technical debt and reliability issues. Since the service didn’t start off with reliability as a primary focus, it was very hard to incorporate it after the fact.

Much of the code had to be re-written and re-architected to reach a sustainable state. The team’s leaders incentivized attention to reliability throughout the organization, from product management through to development and UX domains, constantly reminding the organization about the importance of reliability to the long-term success of the product. This mindshift took years to set in.

Conclusion

It is important that cross-functional organizations be honest about their reliability journeys and determine what is appropriate for their business and product. It is not uncommon for organizations to move from one level to another and then back again as the product matures, stabilizes and then is sunset for the next generation. Getting to a strategic level can be 4+ years in the making and require very high levels of investment from all aspects of the business. Leaders should ensure their product requires this level of continued investment.

We encourage you to study your culture of reliability, assess what phase you are in, determine where you should be on the continuum and carefully and thoughtfully move there. Changing culture is hard and can not be done by edicts or penalties. Most of all, remember that this is a journey and the business is ever-evolving; you cannot set reliability on the shelf and expect it to maintain itself in perpetuity.

Read More for the details.

Cloud

Workflow

How HarbourBridge converts the schema

Getting Started

Optimize your schema conversion

Summary

Solving for continuous delivery challenges

The future

2021 Accelerate State of DevOps report addresses burnout, team performance

Goals

The organizational reliability continuum

Where should you be on the reliability spectrum?

Reliability phases in action

If you think you are done, think again

Embrace reliability principles from the start

Conclusion

Are we there yet? Thoughts on assessing an SRE team’s maturity