Amazon SageMaker Inference now supports rolling updates for inference component (IC) endpoints. This allows customers to update running IC endpoints without traffic interruption while using minimal extra instances, rather than requiring doubled instances as in the past. SageMaker Inference makes it easy to deploy ML models, including foundation models (FMs). As a capability of SageMaker Inference, IC enables customers to deploy multiple FMs on the same endpoint and control accelerator allocation for each model.
Now, rolling updates enables customers to update ICs within an endpoint batch by batch, instead of all at once like the previous blue/green update method. Blue/green updates required provisioning a new fleet of ICs with the updated model before shifting traffic from the old fleet to the new one, effectively doubling the number of required instances. With rolling updates, new ICs are created in smaller batches, significantly reducing the number of additional instances needed during updates. This helps customers minimize costs from extra capacity and maintain smaller buffer requirements in their capacity reservations.
Rolling update for IC is available in all regions where IC is supported: Asia Pacific (Tokyo, Seoul, Mumbai, Singapore, Sydney, Jakarta), Canada (Central), Europe (Frankfurt, Stockholm, Ireland, London), Middle East (UAE), South America (Sao Paulo), US East (N. Virginia, Ohio), and US West (N. California, Oregon). To learn more, see the documentation.
Amazon Elastic Container Service (Amazon ECS) today introduced GPU-optimized Amazon Machine Image (AMI) for Amazon Linux 2023 (AL2023). This new offering enables customers to run GPU-accelerated containerized workloads on Amazon ECS while leveraging improved security features and newer kernel version available on AL2023.
The new ECS GPU-optimized AMI is built on the minimal AL2023 base AMI and includes NVIDIA drivers, NVIDIA Fabric Manager, NVIDIA Container Toolkit, and other essential packages needed to run GPU-accelerated container workloads. The new AMI supports a wide range of NVIDIA GPU architectures including Ampere, Turing, Volta, Maxwell, Hopper, and Ada Lovelace, and works out-of-the-box with no additional configuration required. The new AMI is designed for GPU-accelerated applications such as machine learning (ML) and artificial intelligence (AI) workloads running on Amazon ECS.
The ECS GPU-optimized AL2023 AMI is now available in all AWS regions. For additional information about running GPU-accelerated workloads with Amazon ECS, refer to the documentation and release notes.
Starting today, you can use AWS WAF Targeted Bot Control in the AWS GovCloud (US) Regions. AWS WAF Targeted Bot Control protects applications against sophisticated bots targeting critical enterprise applications like e-commerce and financial services websites.
AWS WAF is a web application firewall that helps you protect your web application resources against common web exploits and bots that can affect availability, compromise security, or consume excessive resources. You can protect the following resource types: Amazon CloudFront distributions, Amazon API Gateway REST APIs, Application Load Balancer, AWS AppSync GraphQL API, AWS App Runner, AWS Verified Access, and Amazon Cognito user pools.
To see the full list of regions where AWS WAF is currently available, visit the AWS Region Table. For more information about the service, visit the AWS WAF page. AWS WAF pricing may vary between regions. For more information about pricing, visit the AWS WAF Pricing page.
Amazon Connect announces the expansion of access to industry-leading inbound number availability across 158 countries, national outbound numbers in 72 countries, and global international dialing capabilities from any supported AWS commercial region. This expansion increases telephony coverage by an average of 125% across AWS regions. Organizations can now focus on selecting the ideal location for their customer experience operations based on business considerations such as agent availability, language fluency and regulatory needs without being constrained by telecommunications infrastructure. Agents and customers benefit from the reliability, quality, and cost-effectiveness enabled by the AWS global network and Amazon Connect’s direct connections to the 40+ tier-1 carriers closest to your customers.
With this launch, Amazon Connect reimagines the delivery of voice calls. Traditional telephony networks often introduce quality degradation through multiple interconnection points, variable routing paths, and aging infrastructure. By leveraging the AWS global network backbone – the same high-performance, low-latency private network that powers AWS, call paths are optimized and routed directly to the carrier closest to your customer. This simplified routing enables consistently clear and natural conversations for every call.
Access expanded telephony coverage for Amazon Connect in all AWS Regions where Amazon Connect is available, except the AWS GovCloud (US) Regions and Africa (Cape Town). For information about our expanded telephony coverage, see Set Up Contact Center Phone Numbers for your Amazon Connect Instance in the Amazon Connect Administrator Guide.
This blog post presents an in-depth exploration of Microsoft’s Time Travel Debugging (TTD) framework, a powerful record-and-replay debugging framework for Windows user-mode applications. TTD relies heavily on accurate CPU instruction emulation to faithfully replay program executions. However, subtle inaccuracies within this emulation process can lead to significant security and reliability issues, potentially masking vulnerabilities or misleading critical investigations—particularly incident response and malware analysis—potentially causing analysts to overlook threats or draw incorrect conclusions. Furthermore, attackers can exploit these inaccuracies to intentionally evade detection or disrupt forensic analyses, severely compromising investigative outcomes.
The blog post examines specific challenges, provides historical context, and analyzes real-world emulation bugs, highlighting the critical importance of accuracy and ongoing improvement to ensure the effectiveness and reliability of investigative tooling. Ultimately, addressing these emulation issues directly benefits users by enhancing security analyses, improving reliability, and ensuring greater confidence in their debugging and investigative processes.
Overview
We begin with an introduction to TTD, detailing its use of a sophisticated CPU emulation layer powered by the Nirvana runtime engine. Nirvana translates guest instructions into host-level micro-operations, enabling detailed capture and precise replay of a program’s execution history.
The discussion transitions into exploring historical challenges in CPU emulation, particularly for the complex x86 architecture. Key challenges include issues with floating-point and SIMD operations, memory model intricacies, peripheral and device emulation, handling of self-modifying code, and the constant trade-offs between performance and accuracy. These foundational insights lay the groundwork for our deeper examination of specific instruction emulation bugs discovered within TTD.
These include:
A bug involving the emulation of thepop r16, resulting in critical discrepancies between native execution and TTD instrumentation.
An issue with thepush segmentinstruction that demonstrates differences between Intel and AMD CPU implementations, highlighting the importance of accurate emulation aligned with hardware behavior
Errors in the implementation of thelodsbandlodswinstructions, where TTD incorrectly clears upper bits that should remain unchanged.
An issue within the WinDbg TTDAnalyze debugging extension, where a fixed output buffer resulted in truncated data during symbol queries, compromising debugging accuracy.
Each case is supported by detailed analyses, assembly code proof-of-concept samples, and debugging traces, clearly illustrating the subtle but significant pitfalls in modern CPU emulation as it pertains to TTD.
Additional bugs discovered beyond those detailed here are pending disclosure until addressed by Microsoft. All bugs discussed in this post have been resolved as of TTD version 1.11.410.
Intro to TTD
Time Travel Debugging (TTD) is a powerful usermode record-and-replay framework developed by Microsoft, originally introduced in a 2006 whitepaper under a different name. It is a staple for our workflows as it pertains to Windows environments.
TTD allows a user to capture a comprehensive recording of a process (and potential child processes) during the lifetime of the process’s execution. This is done by injecting a dynamic-link library (DLL) into the intended target process and capturing each state of the execution. This comprehensive historical view of the program’s runtime behavior is stored in a database-like trace file (.trace), which, much like a database, can be further indexed to produce a corresponding .idx file for efficient querying and analysis.
Once recorded, trace files can be consumed by a compatible client that supports replaying the entire execution history. In other words, TTD effectively functions as a record/replay debugger, enabling analysts to move backward and forward through execution states as if navigating a temporal snapshot of the program’s lifecycle.
TTD relies on a CPU emulation layer to accurately record and replay program executions. This layer is implemented by the Nirvana runtime engine, which simulates guest instructions by translating them into a sequence of simpler, host-level micro-operations. By doing so, Nirvana provides fine-grained control at the instruction and sub-instruction level, allowing instrumentation to be inserted at each stage of instruction processing (e.g., fetching, memory reads, writes). This approach not only ensures that TTD can capture the complete dynamic behavior of the original binary but also makes it possible to accurately re-simulate executions later.
Nirvana’s dynamic binary translation and code caching techniques improve performance by reusing translated sequences when possible. In cases where code behaves unpredictably—such as self-modifying code scenarios—Nirvana can switch to a pure interpretation mode or re-translate instructions as needed. These adaptive strategies ensure that TTD maintains fidelity and efficiency during the record and replay process, enabling it to store execution traces that can be fully re-simulated to reveal intricate details of the code’s behavior under analysis.
The TTD framework is composed of several core components:
TTD: The main TTD client executable that takes as input a wide array of input arguments that dictate how the trace will be conducted.
TTDRecord: The main DLL responsible for the recording that runs within the TTD client executable. It initiates the injection sequence into the target binary by injecting TTDLoader.dll.
TTDLoader: DLL that gets injected into the guest process and initiates the recorder within the guest through the TTDRecordCPU DLL. It also establishes a process instrumentation callback within the guest process that allows Nirvana to monitor the egress of any system calls the guest makes.
TTDRecordCPU: The recorder responsible for capturing the execution states into the .trace file. This is injected as a DLL into the guest process and communicates the status of the trace with TTDRecord. The core logic works by emulating the respective CPU.
TTDReplay and TTDReplayClient: The replay components that read the captured state from the trace file and allow users to step through the recorded execution.
Windbg uses these to provide support for replacing trace files.
TTDAnalyze:A WinDbg extension that integrates with the replay client, providing exclusive TTD capacities to WinDbg. Most notable of these are the Calls and Memory data model methods.
CPU Emulation
Historically, CPU emulation—particularly for architectures as intricate as x86—has been a persistent source of engineering challenges. Early attempts struggled with instruction coverage and correctness, as documentation gaps and hardware errata made it difficult to replicate every nuanced corner case. Over time, a number of recurring problem areas and bug classes emerged:
Floating-Point and SIMD Operations: Floating-point instructions, with their varying precision modes and extensive register states, have often been a source of subtle bugs. Miscalculating floating-point rounding, mishandling denormalized numbers, or incorrectly implementing special instructions like FSIN or FCOS can lead to silent data corruption or outright crashes. Similarly, SSE, AVX, and other vectorized instructions introduce complex states that must be tracked accurately.
Memory Model and Addressing Issues: The x86 architecture’s memory model, which includes segmentation, paging, alignment constraints, and potential misalignments in legacy code, can introduce complex bugs. Incorrectly emulating memory accesses, not enforcing proper page boundaries, or failing to handle “lazy” page faults and cache coherency can result in subtle errors that only appear under very specific conditions.
Peripheral and Device Emulation: Emulating the behavior of x86-specific peripherals—such as serial I/O ports, PCI devices, PS/2 keyboards, and legacy controllers—can be particularly troublesome. These components often rely on undocumented behavior or timing quirks. Misinterpreting device-specific registers or neglecting to reproduce timing-sensitive interactions can lead to erratic emulator behavior or device malfunctions.
Compatibility with Older or Unusual Processors: Emulating older generations of x86 processors, each with their own peculiarities and less standardized features, poses its own set of difficulties. Differences in default mode settings, instruction variants, and protected-mode versus real-mode semantics can cause unexpected breakages. A once-working emulator may fail after it encounters code written for a slightly different microarchitecture or an instruction that was deprecated or implemented differently in an older CPU.
Self-Modifying Code and Dynamic Translation: Code that modifies itself at runtime demands adaptive strategies, such as invalidating cached translations or re-checking original code bytes on the fly. Handling these scenarios incorrectly can lead to stale translations, misapplied optimizations, and difficult-to-trace logic errors.
Performance vs. Accuracy Trade-Offs: Historically, implementing CPU emulators often meant juggling accuracy with performance. Naïve instruction-by-instruction interpretation provided correctness but was slow. Introducing caching or just-in-time (JIT)-based optimizations risked subtle synchronization issues and bugs if not properly synchronized with memory updates or if instruction boundaries were not well preserved.
Collectively, these historical challenges underscore that CPU emulation is not just about instruction decoding. It requires faithfully recreating intricate details of processor states, memory hierarchies, peripheral interactions, and timing characteristics. Even as documentation and tooling have improved, achieving both correctness and efficiency remains a delicate balancing act, and emulation projects continue to evolve to address these enduring complexities.
The Initial TTD Bug
Executing a heavily obfuscated 32-bit Windows Portable Executable (PE) file under TTD instrumentation resulted in a crash. The same sample file did not cause a crash while executing in a real computer or in a virtual machine. We suspected either the sample is detecting TTD execution and or TTD itself has a bug in emulating an instruction. A good thing about debugging TTD issues is that the TTD trace file itself can be used to pinpoint the cause of the issue most of the time. Figure 1 points to the crash while in TTD emulation.
Figure 1: Crash while accessing an address pointed by register ESI
Back tracing the ESIregister value to 0xfb3e took stepping back hundreds of instructions and ended up in the following sequence of instructions, as shown in Figure 2.
Figure 2: Register ESI getting populated by pop si and xchg si,bp
There are two instructions populating the ESI register, both working with the 16-bit sub register of SI while completely ignoring the other 16-bit part of the ESI register. If we look closely at the results after pop si instruction in Figure 2, the upper 16-bit of the ESI register seems to be nulled out. This looked like a bug in emulating pop r16 instructions, and we quickly wrote a proof-of-concept code for verification (Figure 3).
Figure 3: Proof-of-concept for pop r16
Running the resulting binary natively and with TTD instrumentation as shown in Figure 4 confirmed our suspicion that the pop r16 instructions are emulated differently in TTD than on a real CPU.
Figure 4: Running the code natively and with TTD instrumentation
We reported this issue and the fuzzing results to the TTD team at Microsoft.
Fuzzing TTD
Given there is one instruction emulation bug (instruction sequence that produces different results in real vs TTD execution), we decided to fuzz TTD to find similar bugs. A rudimentary harness was created to execute a random sequence of instructions and record the resulting values. This harness was executed on a real CPU and under TTD instrumentation, providing us with two sets of results. Any changes in results or partial lack of results points us to a likely instruction emulation bug.
This new bug was fairly similar to the original pop r16 bug, but with a push segment instruction. This bug also comes with a little bit of twist. While our fuzzer was running on an Intel CPU-based machine and one of us verified the bug locally, the other person was not able to verify the bug. Interestingly, the failure happened on an AMD-based CPU, tipping us to the possibility that the push segment instruction implementation varies between INTEL and AMD CPUs.
Looking at both INTEL and AMD CPU specifications, INTEL specification goes into details about how recent processors implement push segment register instruction:
If the source operand is a segment register (16 bits) and the operand size is 64-bits, a zero-extended value is pushed on the stack; if the operand size is 32-bits, either a zero-extended value is pushed on the stack or the segment selector is written on the stack using a 16-bit move. For the last case, all recent Intel Core and Intel Atom processors perform a 16-bit move, leaving the upper portion of the stack location unmodified. (INTEL spec Vol.2B 4-517)
We reported the discrepancy to AMD PSIRT, who concluded that this is not a security vulnerability. It seems sometime circa 2007 INTEL and AMD CPU started implementing the push segment instruction differently, and TTD emulation followed the old way.
The lodsband lodsware not correctly implemented for both 32-bit and 64-bit instructions. Both clear the upper bits of the register (rax/eax) whereas the original instructions only modify their respective granularities (i.e., lodsbwill only overwrite 1-byte, lodswonly 2-bytes).
Figure 6: Proof-of-concept for lodsb/lodsw
There are additional instruction emulation bugs pending fixes from Microsoft.
As we were pursuing our efforts in the CPU emulator, we accidentally stumbled on another bug, this time not in the emulator but inside the Windbg extension exposed by TTD: TTDAnalyze.dll.
This extension leverages the debugger’s data model to allow a user to interact with the trace file in an interactive manner. This is done via exposing a TTD data model namespace under certain parts of the data model, such as the current process (@$curproces), the current thread (@$curthread), and current debugging session (@$cursession).
Figure 7: TTD query types
As an example, the @$cursession.TTD.Callsmethod allows a user to query all call locations captured within the trace. It takes as input either an address or case-insensitive symbol name with support for regex. The symbol name can either be in the format of a string (with quotes) or parsed symbol name (without quotes). The former is only applicable when the symbols are resolved fully (e.g., private symbols), as the data model has support for converting private symbols into an ObjectTargetObjectobject thus making it consumable to the dxevaluation expression parser.
The bug in question directly affects the exposed Callsmethod under @$cursession.TTD.Callsbecause it uses a fixed, static buffer to capture the results of the symbol query. In Figure 8 we illustrate that by passing in two similar regex strings that produce inconsistent results.
Figure 8: TTD Calls query
When we query C*and Create*,the C*query results do not return the other Create APIs that were clearly captured in the trace. Under the hood, TTDAnalyzeexecutes the examine debugger command “x KERNELBASE!C*“ with a custom output capture to process the results. This output capture truncates any captured data if it is greater than 64 KB in size.
If we take the disassembly of the global buffer and output capture routine in TTDAnalyze(SHA256 CC5655E29AFA87598E0733A1A65D1318C4D7D87C94B7EBDE89A372779FF60BAD) prior to the fix, we can see the following (Figure 9 and Figure 10):
Figure 9: TTD implementation disassembly
Figure 10: TTD implementation disassembly
The capture for the examine command is capped at 64 KB. When the returned data exceeds this limit, truncation is performed at address 0x180029960. Naturally querying symbols starting with C* typically yields a large volume of results, not just those beginning with Create*, leading to the observed truncation of the data.
Final Thoughts
The analysis presented in this blog post highlights the critical nature of accuracy in instruction emulation—not just for debugging purposes, but also for ensuring robust security analysis. The observed discrepancies, while subtle, underscore a broader security concern: even minor deviations in emulation behavior can misrepresent the true execution of code, potentially masking vulnerabilities or misleading forensic investigations.
From a security perspective, the work emphasizes several key takeaways:
Reliability of Debugging Tools: TTD and similar frameworks are invaluable for reverse engineering and incident response. However, any inaccuracies in emulation, such as those revealed by the misinterpretation of pop r16, push segment, or lods* instructions, can compromise the fidelity of the analysis. This raises important questions about trust in our debugging tools when they are used to analyze potentially malicious or critical code.
Impact on Threat Analysis: The ability to replay a process’s execution with high fidelity is crucial for uncovering hidden behaviors in malware or understanding complex exploits. Instruction emulation bugs may inadvertently alter the execution path or state, leading to incomplete or skewed insights that could affect the outcome of a security investigation.
Collaboration and Continuous Improvement: The discovery of these bugs, followed by their detailed documentation and reporting to the relevant teams at Microsoft and AMD, highlights the importance of a collaborative approach to security research. Continuous testing, fuzzing, and cross-platform comparisons are essential in maintaining the integrity and security of our analysis tools.
In conclusion, this exploration not only sheds light on the nuanced challenges of CPU emulation within TTD, but also serves as a call to action for enhanced scrutiny and rigorous validation of debugging frameworks. By ensuring that these tools accurately mirror native execution, we bolster our security posture and improve our capacity to detect, analyze, and respond to sophisticated threats in an ever-evolving digital landscape.
Acknowledgments
We extend our gratitude to the Microsoft Time Travel Debugging team for their readiness and support in addressing the issues we reported. Their prompt and clear communication not only resolved the bugs but also underscored their commitment to keeping TTD robust and reliable. We further appreciate that they have made TTD publicly available—a resource invaluable for both troubleshooting and advancing Windows security research.
Amazon Athena Provisioned Capacity is now available in the Asia Pacific (Mumbai) Region. Provisioned Capacity allows you to run SQL queries on dedicated serverless resources for a fixed price, with no long-term commitment, and control workload performance characteristics such as query concurrency and cost.
Athena is a serverless, interactive query service that makes it possible to analyze petabyte-scale data with ease and flexibility. Provisioned Capacity provides workload management capabilities that help you prioritize, isolate, and scale your workloads. For example, use Provisioned Capacity when you need to run a high number of queries at the same time or isolate important queries from other queries that run in the same account. To get started, use the Athena console, AWS SDK, or CLI to request capacity and then select workgroups with queries you want to run on dedicated capacity.
Amazon WorkSpaces Pools now offers Federal Information Processing Standard 140-2 (FIPS) validated endpoints (FIPS endpoints) for user streaming sessions. FIPS 140-2 is a U.S. government standard that specifies the security requirements for cryptographic modules that protect sensitive information. WorkSpaces Pools FIPS endpoints use FIPS-validated cryptographic standards, which may be required for certain sensitive information or regulated workloads.
To enable FIPS endpoint encryption for end user streaming via AWS Console, navigate to Directories, and verify that the Pools directory where you want to add FIPS is in a STOPPED state, and that the preferred protocol is set to TCP. Once verified, select the directory and on the Directory Details page update the endpoint encryption to FIPS 140-2 Validated Mode and save.
We are excited to announce that Amazon OpenSearch Serverless is expanding availability to the Amazon OpenSearch Serverless to Europe (Spain) Region. OpenSearch Serverless is a serverless deployment option for Amazon OpenSearch Service that makes it simple to run search and analytics workloads without the complexities of infrastructure management. OpenSearch Serverless’ compute capacity used for data ingestion, search, and query is measured in OpenSearch Compute Units (OCUs). To control costs, customers can configure maximum number of OCUs per account.
Today, AWS announces the general availability of GraphRAG, a capability in Amazon Bedrock Knowledge Bases that enhances Retrieval-Augmented Generation (RAG) by incorporating graph data. GraphRAG delivers more comprehensive, relevant, and explainable responses by leveraging relationships within your data, improving how Generative AI applications retrieve and synthesize information.
Since public preview, customers have leveraged the managed GraphRAG capability to get improved responses to queries from their end users. GraphRAG automatically generates and stores vector embeddings in Amazon Neptune Analytics, along with a graph representation of entities and their relationships. GraphRAG combines vector similarity search with graph traversal, enabling higher accuracy when retrieving information from disparate yet interconnected data sources.
GraphRAG with Amazon Neptune is built right into Amazon Bedrock Knowledge Bases, offering an integrated experience with no additional setup or additional charges beyond the underlying services. GraphRAG is generally available in AWS Regions where Amazon Bedrock Knowledge Bases and Amazon Neptune Analytics are both available (see current list of supported regions). To learn more, visit the Amazon Bedrock User Guide.
Today, Amazon SES announces the availability of the Vade Add On for Mail Manager, a sophisticated content filter that enhances email security for both incoming and outgoing messages. This new Add On, developed in collaboration with HornetSecurity, combines heuristics, behavioral analysis, and machine learning to provide robust protection against evolving communication threats such as spam, phishing attempts, and malware.
Now available as a rule property in Mail Manager, the Vade Add On empowers users with automated, real-time defense against email-based threats for safer communication. Its AI-powered technology employs a multi-layered approach, analyzing messages in real-time using advanced techniques like natural language processing. This integration allows customers to strengthen their email platforms by configuring ongoing protection against evolving cyber threats alongside existing Mail Manager rules, offering flexibility in managing their email security.
The Vade Add On for Amazon SES Mail Manager is available in the following AWS Regions: US East (N. Virginia), US West (Oregon), Europe (Ireland), Europe (Frankfurt), Asia Pacific (Tokyo), and Asia Pacific (Sydney), US East (Ohio), US West (San Francisco), Asia Pacific (Mumbai), Asia Pacific (Osaka), Asia Pacific (Seoul), Canada Central (Montreal), Europe (London), Europe (Paris), Europe (Stockholm), and South America (São Paulo).
To learn more about this new feature and how it can enhance your email security, visit the documentation for Email Add Ons in Mail Manager. You can easily activate the Vade Advanced Email Security Add On directly from the Amazon SES console to start protecting your email communications today.
Contact Lens now enables you to create dynamic evaluation forms that automatically show or hide questions based on responses to previous questions, tailoring each evaluation to specific customer interaction scenarios. For example, when a manager answers “Yes” to the form question “Did the customer try to make a purchase on the call?”, the form automatically presents a follow-up question: “Did the agent read the sales disclosure?”. With this launch, you can consolidate evaluation forms that are applicable to different interaction scenarios into a single dynamic evaluation form which automatically hides irrelevant questions. This reduces manager effort in selecting the relevant evaluation form and determining which evaluation questions are applicable to the interaction, helping managers perform evaluations faster and more accurately.
This feature is available in all regions where Contact Lens performance evaluations are already available. To learn more, please visit our documentation and our webpage. For information about Contact Lens pricing, please visit our pricing page.
AWS Application Load Balancer (ALB) now allows customers to provide a pool of public IPv4 addresses for IP address assignment to load balancer nodes. Customers can configure a public IP Address Manager (IPAM) pool that can consist of either Bring Your Own IP addresses (BYOIPs) that is customer owned or a contiguous IPv4 address block provided by Amazon.
With this feature, customers can optimize public IPv4 cost by using BYOIP in public IPAM pools. Customers can also simplify their enterprise allowlisting and operations, by using Amazon-provided contiguous IPv4 blocks in public IPAM pools. The ALB’s IP addresses are sourced from the IPAM pool and automatically switch to AWS managed IP addresses when the public IPAM pool is depleted. This intelligent switching maximizes service availability during scaling events.
Amazon Redshift Data API, which lets you connect to Amazon Redshift through a secure HTTPS endpoint, now supports single sign-on (SSO) through AWS IAM Identity Center. Amazon Redshift Data API removes the need to manage database drivers, connections, network configurations, and data buffering, simplifying how you access your data warehouses and data lakes.
AWS IAM Identity Center lets customers connect existing identity providers from a centrally managed location. You can now use AWS IAM Identity Center with your preferred identity provider, including Microsoft Entra Id, Okta, and Ping, to connect to Amazon Redshift clusters through Amazon Redshift Data API. This new SSO integration simplifies identity management, so that you don’t have to manage separate database credentials for your Amazon Redshift clusters. Once authenticated, your authorization rules are enforced using the permissions defined in Amazon Redshift or AWS Lake Formation.
This feature is available in all AWS Regions where both AWS IAM Identity Center and Amazon Redshift are available. For more information, see our documentation and blog.
AWS HealthOmics now supports the latest NVIDIA L4 and L40S graphical processing units (GPUs) and larger compute options of up to 192 vCPUs for workflows. AWS HealthOmics is a HIPAA-eligible service that helps healthcare and life sciences customers accelerate scientific breakthroughs with fully managed biological data stores and workflows. This release expands workflow compute capabilities to support more demanding workloads for genomics research and analysis.
In addition to current support for NVIDIA A10G and T4 GPUs, this release adds support for NVIDIA L4 and L40S GPUs, which enables researchers to efficiently run complex machine learning workloads such as protein structure prediction and biological foundation models (bioFMs). The enhanced CPU configurations with up to 192 vCPUs and 1,536 GiB of memory allows for faster processing of large-scale genomics datasets. These improvements help research teams reduce time-to-insight for critical life sciences work.
NVIDIA L4 and L40S GPUs and 128 and 192 vCPU omics instance types are now available in: US East (N. Virginia) and US West (Oregon). To get started with AWS HealthOmics workflows, see the documentation.
AI Hypercomputer is a fully integrated supercomputing architecture for AI workloads – and it’s easier to use than you think. In this blog, we break down four common use cases, including reference architectures and tutorials, representing just a few of the many ways you can use AI Hypercomputer today.
Short on time? Here’s a quick summary.
Affordable inference. JAX, Google Kubernetes Engine (GKE) and NVIDIA Triton Inference Server are a winning combination, especially when you pair them with Spot VMs for up to 90% cost savings. We have several tutorials, like this one on how to serve LLMs like Llama 3.1 405B on GKE.
Large and ultra-low latency training clusters. Hypercompute Cluster gives you physically co-located accelerators, targeted workload placement, advanced maintenance controls to minimize workload disruption, and topology-aware scheduling. You can get started by creating a cluster with GKE or try this pretraining NVIDIA GPU recipe.
High-reliability inference. Pair new cloud load balancing capabilities like custom metrics and service extensions with GKE Autopilot, which includes features like node auto-repair to automatically replace unhealthy nodes, and horizontal pod autoscaling to adjust resources based on application demand.
Easy cluster setup. The open-source Cluster Toolkit offers pre-built blueprints and modules for rapid, repeatable cluster deployments. You can get started with one of our AI/ML blueprints.
If you want to see a broader set of reference implementations, benchmarks and recipes, go to the AI Hypercomputer GitHub.
Why it matters Deploying and managing AI applications is tough. You need to choose the right infrastructure, control costs, and reduce delivery bottlenecks. AI Hypercomputer helps you deploy AI applications quickly, easily, and with more efficiency relative to just buying the raw hardware and chips.
Take Moloco, for example. Using the AI Hypercomputer architecture they achieved 10x faster model training times and reduced costs by 2-4x.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try AI Hypercomputer’), (‘body’, <wagtail.rich_text.RichText object at 0x3e47aa093880>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/compute’), (‘image’, None)])]>
Let’s dive deeper into each use case.
1. Reliable AI inference
According to Futurum, in 2023 Google had ~3x fewer outage hours vs. Azure, and ~3x fewer than AWS. Those numbers fluctuate over time, but maintaining high availability is a challenge for everyone. The AI Hypercomputer architecture offers fully integrated capabilities for high-reliability inference.
Many customers start with GKE Autopilot because of its 99.95% pod-level uptime SLA. Autopilot enhances reliability by automatically managing nodes (provisioning, scaling, upgrades, repairs) and applying security best practices, freeing you from manual infrastructure tasks. This automation, combined with resource optimization and integrated monitoring, minimizes downtime and helps your applications run smoothly and securely.
There are several configurations available, but in this reference architecture we use TPUs with the JetStream Engine to accelerate inference, plus JAX, GCS Fuse, and SSDs (like Hyperdisk ML) to speed up the loading of model weights. As you can see, there are two notable additions to the stack that get us to high reliability: Service Extensions and custom metrics.
Service extensions allow you to customize the behavior of Cloud Load Balancer by inserting your own code (written as plugins) into the data path, enabling advanced traffic management and manipulation.
Custom metrics, utilizing the Open Request Cost Aggregation (ORCA) protocol, allow applications to send workload-specific performance data (like model serving latency) to Cloud Load Balancer, which then uses this information to make intelligent routing and scaling decisions.
Training large AI models demands massive, efficiently scaled compute. Hypercompute Cluster is a supercomputing solution built on AI Hypercomputer that lets you deploy and manage a large number of accelerators as a single unit, using a single API call. Here are a few things that set Hypercompute Cluster apart:
Clusters are densely physically co-located for ultra-low-latency networking. They come with pre-configured and validated templates for reliable and repeatable deployments, and with cluster-level observability, health monitoring, and diagnostic tooling.
To simplify management, Hypercompute Clusters are designed for integrating with orchestrators like GKE and Slurm, and are deployed via the Cluster Toolkit. GKE provides support for over 50,000 TPU chips to train a single ML model.
In this reference architecture, we use GKE Autopilot and A3 Ultra VMs.
GKE supports up to 65,000 nodes — we believe this is more than 10X larger scale than the other two largest public cloud providers.
A3 Ultra usesNVIDIA H200 GPUswith twice the GPU-to-GPU network bandwidth and twice the high bandwidth memory (HBM) compared to A3 Mega GPUs. They are built with our new Titanium ML network adapter and incorporate NVIDIA ConnectX-7 network interface cards (NICs) to deliver a secure, high-performance cloud experience, perfect for large multi-node workloads on GPUs.
Serving AI, especially large language models (LLMs), can become prohibitively expensive. AI Hypercomputer combines open software, flexible consumption models and a wide range of specialized hardware to minimize costs.
Cost savings are everywhere, if you know where to look. Beyond the tutorials, there are two cost-efficient deployment models you should know. GKE Autopilot reduces the cost of running containers by up to 40% compared to standard GKE by automatically scaling resources based on actual needs, while Spot VMs can save up to 90% on batch or fault-tolerant jobs. You can combine the two to save even more — “Spot Pods” are available in GKE Autopilot to do just that.
In this reference architecture, after training with JAX, we convert into NVIDIA’s Faster Transformer format for inferencing. Optimized models are served via NVIDIA’s Triton on GKE Autopilot. Triton’s multi-model support allows for easy adaptation to evolving model architectures, and a pre-built NeMo container simplifies setup.
You need tools that simplify, not complicate, your infrastructure setup. The open-source Cluster Toolkit offers pre-built blueprints and modules for rapid, repeatable cluster deployments. You get easy integration with JAX, PyTorch, and Keras. Platform teams get simplified management with Slurm, GKE, and Google Batch, plus flexible consumption models like Dynamic Workload Scheduler and a wide range of hardware options. In this reference architecture, we set up an A3 Ultra cluster with Slurm:
Kubernetes, the container orchestration platform, is inherently a complex, distributed system. While it provides resilience and scalability, it can also introduce operational complexities, particularly when troubleshooting. Even with Kubernetes’ self-healing capabilities, identifying the root cause of an issue often requires deep dives into the logs of various independent components.
At Google Cloud, our engineers have been directly confronting this Kubernetes troubleshooting challenge for years as we support large-scale, complex deployments. In fact, the Google Cloud Support team has developed deep expertise in diagnosing issues within Kubernetes environments through routinely analyzing a vast number of customer support tickets, diving into user environments, and leveraging our collective knowledge to pinpoint the root causes of problems. To address this pervasive challenge, the team developed an internal tool: the Kubernetes History Inspector (KHI), and today, we’ve released it as open source for the community.
The Kubernetes troubleshooting challenge
In Kubernetes, each pod, deployment, service, node, and control-plane component generates its own stream of logs. Effective troubleshooting requires collecting, correlating, and analyzing these disparate log streams. But manually configuring logging for each of these components can be a significant burden, requiring careful attention to detail and a thorough understanding of the Kubernetes ecosystem. Fortunately, managed Kubernetes services such as Google Kubernetes Engine (GKE) simplify log collection. For example, GKE offers built-in integration with Cloud Logging, aggregating logs from all parts of the Kubernetes environment. This centralized repository is a crucial first step.
However, simply collecting the logs solves only half the problem. The real challenge lies in analyzing them effectively. Many issues you’ll encounter in a Kubernetes deployment are not revealed by a single, obvious error message. Instead, they manifest as a chain of events, requiring a deep understanding of the causal relationships between numerous log entries across multiple components.
Consider the scale: a moderately sized Kubernetes cluster can easily generate gigabytes of log data, comprising tens of thousands of individual entries, within a short timeframe. Manually sifting through this volume of data to identify the root cause of a performance degradation, intermittent failure, or configuration error is, at best, incredibly time-consuming, and at worst, practically impossible for human operators. The signal-to-noise ratio is incredibly challenging.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud containers and Kubernetes’), (‘body’, <wagtail.rich_text.RichText object at 0x3e47a7f95670>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com’), (‘image’, None)])]>
Introducing the Kubernetes History Inspector
KHI is a powerful tool that analyzes logs collected by Cloud Logging, extracts state information for each component, and visualizes it in a chronological timeline. Furthermore, KHI links this timeline back to the raw log data, allowing you to track how each element evolved over time.
The Google Cloud Support team often assists users in critical, time-sensitive situations. A tool that requires lengthy setup or agent installation would be impractical. That’s why we packaged KHI as a container image — it requires no prior setup, and is ready to be launched with a single command.
It’s easier to show than to tell. Imagine a scenario where end users are reporting “Connection Timed Out” errors on a service running on your GKE cluster. Launching KHI, you might see something like this:
First, notice the colorful, horizontal rectangles on the left. These represent the state changes of individual components over time, extracted from the logs – the timeline. This timeline provides a macroscopic view of your Kubernetes environment. In contrast, the right side of the interface displays microscopic details: raw logs, manifests, and their historical changes related to the component selected in the timeline. By providing both macroscopic and microscopic perspectives, KHI makes it easy to explore your logs.
Now, let’s go back to our hypothetical problem. Notice the alternating green and orange sections in the “Ready” row of the timeline:
This indicates that the readiness probe is fluctuating between failure (orange) and success (green). That’s a smoking gun! You now know exactly where to focus your troubleshooting efforts.
KHI also excels at visualizing the relationships between components at any given point in the past. The complex interdependencies within a Kubernetes cluster are presented in a clear, understandable way.
What’s next for KHI and Kubernetes troubleshooting
We’ve only scratched the surface of what KHI can do. There’s a lot more under the hood: how the timeline colors actually work, what those little diamond markers mean, and many other features that can speed up your troubleshooting. To make this available to everyone, we open-sourced KHI.
For detailed specifications, a full explanation of the visual elements, and instructions on how to deploy KHI on your own managed Kubernetes cluster, visit the KHI GitHub page. Currently KHI only works with GKE and Kubernetes on Google Cloud combined with Cloud Logging, but we plan to extend its capabilities to the vanilla open-source Kubernetes setup soon.
While KHI represents a significant leap forward in Kubernetes log analysis, it’s designed to amplify your existing expertise, not replace it. Effective troubleshooting still requires a solid understanding of Kubernetes concepts and your application’s architecture. KHI helps you, the engineer, navigate the complexity by providing a powerful map to view your logs to diagnose issues more quickly and efficiently.
KHI is just the first step in our ongoing commitment to simplifying Kubernetes operations. We’re excited to see how the community uses and extends KHI to build a more observable and manageable future for containerized applications. The journey to simplify Kubernetes troubleshooting is ongoing, and we invite you to join us.
AWS WAF now supports JA4 fingerprinting of incoming requests, enabling customers to allow known clients or block requests from malicious clients. Additionally, you can now use both JA4 and JA3 fingerprints as aggregation keys within WAF’s rate-based rules, allowing you to monitor and control request rates based on client fingerprints.
A JA4 TLS client fingerprint contains a 36-character long fingerprint of the TLS Client Hello which is used to initiate a secure connection from clients. The fingerprint can be used to build a database of known good and bad actors to apply when inspecting HTTP requests. These new features enhance your ability to identify and mitigate sophisticated attacks by creating more precise rules based on client behavior patterns. By leveraging both JA4 and JA3 fingerprinting capabilities, you can implement robust protection against automated threats while maintaining legitimate traffic flow to your applications.
JA4 as a match statement is available in all regions where AWS WAF is available for Amazon CloudFront, and Amazon Application Load Balancer (ALB). JA3 and JA4 aggregation keys are available in all regions, except the AWS GovCloud (US) Regions, the China Regions, Asia Pacific (Melbourne), Israel (Tel Aviv) and Asia Pacific (Malaysia). There is no additional cost for using this feature, however standard AWS WAF charges still apply. For more information about pricing, visit the AWS WAF Pricing page.
AWS CodeConnections now allows you to securely share your Connection resource across individual AWS accounts or within your AWS Organization. Previously, to create a Connection, you installed the AWS connector App for GitHub or GitLab or Bitbucket for each AWS account from which source access was required.
You can now use AWS Resource Access Manager (RAM) to securely share a Connection to your third-party source provider across AWS accounts. By using AWS RAM to share your Connection resource, you no longer need to create a Connection in each AWS account. Instead, you can create a Connection in an AWS account, and then share the Connection across multiple AWS accounts. By using AWS RAM, you can also automate sharing the connection across AWS accounts, reducing the operational overhead to support a multi-account deployment strategy. To apply fine grained access control, in the AWS account with which a Connection is shared, you can use IAM policies to manage what operations an IAM role can perform.
To learn more about sharing connections, visit our documentation. To learn more about what Connections in AWS CodeConnections are and how they work, visit our documentation.
Starting today, the general-purpose Amazon EC2 M7a instances are now available in AWS Asia Pacific (Sydney) Region. M7a instances, powered by 4th Gen AMD EPYC processors (code-named Genoa) with a maximum frequency of 3.7 GHz, deliver up to 50% higher performance compared to M6a instances.
With this additional region, M7a instances are available in the following AWS Regions: US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Sydney, Tokyo), and Europe (Frankfurt, Ireland, Spain, Stockholm). These instances can be purchased as Savings Plans, Reserved, On-Demand, and Spot instances. To get started, visit the AWS Management Console, AWS Command Line Interface (CLI), and AWS SDKs. To learn more, visit the M7a instances page.
We are excited to announce that Amazon OpenSearch Serverless is expanding availability to AWS US West (SFO, N. California) and Europe (ARN, Stockholm) Regions. OpenSearch Serverless is a serverless deployment option for Amazon OpenSearch Service that makes it simple to run search and analytics workloads without the complexities of infrastructure management. OpenSearch Serverless’ compute capacity used for data ingestion, search, and query is measured in OpenSearch Compute Units (OCUs). To control costs, customers can configure maximum number of OCUs per account.