With the proliferation of digital devices and platforms including social media, mobile devices and IoT sensors, organizations are increasingly generating unstructured data in the form of images, audio files, videos, and documents etc. Over the last few months, we launched BigQuery integrations with Vertex AI to leverage Gemini 1.0 Pro, PaLM , Vision AI, Speech AI, Doc AI, Natural Language AI and more to help you interpret and extract meaningful insights from unstructured data.

While Vision AI provides image classification and object recognition capabilities, large language models (LLMs) unlock new visual use cases. To that end, we are expanding BigQuery and Vertex AI integrations to support multimodal generative AI use cases with Gemini 1.0 Pro Vision. Using familiar SQL statements, you can take advantage of Gemini 1.0 Pro Vision directly in BigQuery to analyze both images and videos by combining them with your own text prompts.

A birds-eye view of Vertex AI integration capabilities for analyzing unstructured data in BigQuery

Within a data warehouse setting, multimodal capabilities can help enhance your unstructured data analysis across a variety of use cases:

Object recognition: Answer questions related to fine-grained identification of the objects in images and videos.

Info seeking: Combine world knowledge with information extracted from the images and videos.

Captioning/description: Generate descriptions of images and videos with varying levels of detail.

Digital content understanding: Answer questions by extracting information from content like infographics, charts, figures, tables, and web pages.

Structured content generation: Generate responses in formats like HTML and JSON based on provided prompt instructions.

Turning unstructured data into structured data

With minimal prompt adjustments, Gemini 1.0 Pro Vision can produce structured responses in convenient formats like HTML or JSON, making them easy to consume in downstream tasks. In a data warehouse such as BigQuery, having structured data means you can use the results in SQL operations and combine it with other structured datasets for deeper analysis.

For example, imagine you have a large dataset that contains images of cars. You want to understand a few basic details about the car in each image. This is a use case that Gemini 1.0 Pro Vision can help with!

Combining text and image into a prompt for Gemini 1.0 Pro Vision, with a sample response.

Dataset from: 3D Object Representations for Fine-Grained Categorization Jonathan Krause, Michael Stark, Jia Deng, Li Fei-Fei 4th IEEE Workshop on 3D Representation and Recognition, at ICCV 2013 (3dRR-13). Sydney, Australia. Dec. 8, 2013.

As you can see, Gemini’s response is very thorough! But while the format and extra information are great if you’re a person, they’re not so great if you’re a data warehouse. Rather than turning unstructured data into more unstructured data, you can make changes to the prompt to direct the model on how to return a structured response.

Adjusting the text portion of the prompt to indicate a structured response from Gemini 1.0 Pro Vision, with a sample result.

You can see how this response would be much more useful in an environment like BigQuery.

Now let’s see how to prompt Gemini 1.0 Pro Vision directly in BigQuery to perform this analysis over thousands of images!

Accessing Gemini 1.0 Pro Vision from BigQuery ML

Gemini 1.0 Pro Vision is integrated with BigQuery through the ML.GENERATE_TEXT() function. To unlock this function in your BigQuery project, you will need to create a remote model that represents a hosted Vertex AI large language model. Fortunately, it’s just a few lines of SQL:

code_block
<ListValue: [StructValue([(‘code’, “CREATE MODEL `mydataset.gemini_pro_vision_model`rnREMOTE WITH CONNECTION `us.bqml_llm_connection`rnOPTIONS(endpoint = ‘gemini-pro-vision’);”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e6eed188e80>)])]>

Once the model is created, you can combine your data with the ML.GENERATE_TEXT() function in your SQL queries to generate text.

A few notes on the ML.GENERATE_TEXT() function syntax when it is pointing to a gemini-pro-vision model endpoint, as is the case in this example:

TABLE: takes an object table as input, where it can contain different types of unstructured objects (e.g. images, videos).

PROMPT: takes a single string text prompt that is placed as part of the option STRUCT (dissimilar to the case when using the gemini-pro model) and applies this prompt to each object, row-by-row, contained in the object TABLE.

code_block
<ListValue: [StructValue([(‘code’, “SELECTrn uri,rn ml_generate_text_llm_result as brand_model_yearrn FROMrn ML.GENERATE_TEXT(rn MODEL `mydataset.gemini_pro_vision_model`,rn TABLE `mydataset.car_images_object_table`,rn STRUCT(rn ‘What is the brand, model, and year of this car? Answer in JSON format with three keys: brand, model, year. brand and model should be string, year should be integer.’ AS prompt, TRUE AS flatten_json_output));”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e6eed188310>)])]>

Let’s take a peek at the results.

We can add some SQL to this query to extract each of the values for brand, model, and year into new fields for use downstream.

code_block
<ListValue: [StructValue([(‘code’, ‘WITH raw_json_result AS ( rnSELECTrn uri,rn ml_generate_text_llm_result as brand_model_yearrn FROMrn ML.GENERATE_TEXT(rn MODEL `mydataset.gemini_pro_vision_model`,rn TABLE `mydataset.car_images_object_table`,rn STRUCT(rn ‘What is the brand, model, and year of this car? Answer in JSON format with three keys: brand, model, year. brand and model should be string, year should be integer.’ AS prompt, TRUE AS flatten_json_output)))rnSELECTrn uri,rn JSON_QUERY(RTRIM(LTRIM(raw_json_result.brand_model_year, ” “`json”), ““`”), “$.brand”) AS brand,rn JSON_QUERY(RTRIM(LTRIM(raw_json_result.brand_model_year, ” “`json”), ““`”), “$.model”) AS model,rn JSON_QUERY(RTRIM(LTRIM(raw_json_result.brand_model_year, ” “`json”), ““`”), “$.year”) AS yearrnFROM raw_json_result’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e6eecd34130>)])]>

Now the responses have been parsed into new, structured columns.

And there you have it. We’ve just turned a collection of unlabeled, raw images into structured data, fit for analysis in a data warehouse. Imagine joining this new table with other relevant enterprise data. With a dataset of historical car sales, for example, you could determine the average or median sale price for similar cars in a recent time period. This is just a taste of the possibilities that are uncovered by bringing unstructured data into your data workflows!

When getting started with Gemini 1.0 Pro Vision in BigQuery, there are a few important items to note:

You need an enterprise or enterprise plus reservation to run Gemini 1.0 Pro Vision model inference over an object table. For reference see the BigQuery editions documentation.

Limits apply to functions that use Vertex AI large language models (LLMs) and Cloud AI services, so review the current quota in place for the Gemini 1.0 Pro Vision model.

Next steps

Bringing generative AI directly into BigQuery has enormous benefits. Instead of writing custom Python code and building data pipelines between BigQuery and the generative AI model APIs, you can now just write a few lines of SQL! BigQuery manages the infrastructure and helps you scale from one prompt to thousands. Check out the overview and demo video, and the documentation to see more example queries using ML.GENERATE_TEXT() with Gemini 1.0 Pro Vision.

Coming to Next ‘24? Check out session Power data analytics with generative AI using BigQuery and Gemini, where you can see Gemini Vision Pro and BigQuery in action.

Read More for the details.

2024 04 10

GCP – How Gemini in BigQuery accelerates data and analytics workflows with AI

Cloud, Google Cloud gcp

The journey of going from data to insights can be fragmented, complex and time consuming. Data teams spend time on repetitive and routine tasks such as ingesting structured and unstructured data, wrangling data in preparation for analysis, and optimizing and maintaining pipelines. Obviously, they’d rather prefer doing higher-value analysis and insights-led decision making.

At Next ‘23, we introduced Duet AI in BigQuery. This year at Next ‘24, Duet AI in BigQuery becomes Gemini in BigQuery which provides AI-powered experiences for data preparation, analysis and engineering as well as intelligent recommendations to enhance user productivity and optimize costs.

“With the new AI-powered assistive features in BigQuery and ease of integrating with other Google Workspace products, our teams can extract valuable insights from data. The natural language-based experiences, low-code data preparation tools, and automatic code generation features streamline high-priority analytics workflows, enhancing the productivity of data practitioners and providing the space to focus on high impact initiatives. Moreover, users with varying skill sets, including our business users, can leverage more accessible data insights to effect beneficial changes, fostering an inclusive data-driven culture within our organization.” said Tim Velasquez, Head of Analytics, Veo

Let’s take a closer look at the new features of Gemini in BigQuery.

Accelerate data preparation with AI

Your business insights are only as good as your data. When you work with large datasets that come from a variety of sources, there are often inconsistent formats, errors, and missing data. As such, cleaning, transforming, and structuring them can be a major hurdle.

To simplify data preparation, validation, and enrichment, BigQuery now includes AI augmented data preparation that helps users to cleanse and wrangle their data. Additionally we are enabling users to build low-code visual data pipelines, or rebuild legacy pipelines in BigQuery.

Once the pipelines are running in production, AI assists with finding and resolving issues such as schema or data drift, significantly reducing the toil associated with maintaining a data pipeline. Because the resulting pipelines run in BigQuery, users also benefit from integrated metadata management, automatic end-to-end data lineage, and capacity management.

Gemini in BigQuery provides AI-driven assistance for users to clean and wrangle data

Kickstart the data-to-insights journey

Most data analysis starts with exploration — finding the right dataset, understanding the data’s structure, identifying key patterns, and identifying the most valuable insights you want to extract. This step can be cumbersome and time-consuming, especially if you are working with a new dataset or if you are new to the team.

To address this problem, Gemini in BigQuery provides new semantic search capabilities to help you pinpoint the most relevant tables for your tasks. Leveraging the metadata and profiling information of these tables from Dataplex, Gemini in BigQuery surfaces relevant, executable queries that you can run with just one click. You can learn more about BigQuery data insights here.

Gemini in BigQuery suggests executable queries for tables that you can run in single click

Reimagine analytics workflows with natural language

To boost user productivity, we’re also rethinking the end-to-end user experience. The new BigQuery data canvas provides a reimagined natural language-based experience for data exploration, curation, wrangling, analysis, and visualization, allowing you to explore and scaffold your data journeys in a graphical workflow that mirrors your mental model.

For example, to analyze a recent marketing campaign, you can use simple natural language prompts to discover campaign data sources, integrate with existing customer data, derive insights, and share visual reports with executives — all within a single experience. Watch this video for a quick overview of BigQuery data canvas.

BigQuery data canvas allows you to explore and analyze datasets, and create a customized visualization, all using natural language prompts within the same interface

Enhance productivity with SQL and Python code assistance

Even advanced users sometimes struggle to remember all the details of SQL or Python syntax, and navigating through numerous tables, columns, and relationships can be daunting.

Gemini in BigQuery helps you write and edit SQL or Python code using simple natural language prompts, referencing relevant schemas and metadata. You can also leverage BigQuery’s in-console chat interface to explore tutorials, documentation and best practices for specific tasks using simple prompts such as: “How can I use BigQuery materialized views?” “How do I ingest JSON data?” and “How can I improve query performance?”

Optimize analytics for performance and speed

With growing data volumes, analytics practitioners including data administrators, find it increasingly challenging to effectively manage capacity and enhance query performance. We are introducing recommendations that can help continuously improve query performance, minimize errors and optimize your platform costs.

With these recommendations, you can identify materialized views that can be created or deleted based on your query patterns and partition or cluster of your tables. Additionally, you can autotune Spark pipelines and troubleshoot failures and performance issues.

Get started

To learn more about Gemini in BigQuery, watch this short overview video and refer to the documentation , and sign up to get early access to the preview features. If you’re at Next ‘24, join our data and analytics breakout sessions and stop by at the demo stations to explore further and see these capabilities in action. Pricing details for Gemini in BigQuery will be shared when generally available to all customers.

Read More for the details.

2024 04 10

GCP – Accelerate AI Inference with Google Cloud TPUs and GPUs

Cloud, Google Cloud gcp

In the rapidly evolving landscape of artificial intelligence, the demand for high-performance, cost-efficient AI inference (serving) has never been greater. This week we announced two new open source software offerings: JetStream and MaxDiffusion.

JetStream is a new inference engine for XLA devices, starting with Cloud TPUs. JetStream is specifically designed for large language models (LLMs) and represents a significant leap forward in both performance and cost efficiency, offering up to 3x more inferences per dollar for LLMs than previous Cloud TPU inference engines. JetStream supports PyTorch models through PyTorch/XLA, and JAX models through MaxText – our highly scalable, high-performance reference implementation for LLMs that customers can fork to accelerate their development.

MaxDiffusion is the analog of MaxText for latent diffusion models, and makes it easier to train and serve diffusion models optimized for high performance on XLA devices, starting with Cloud TPUs.

In addition, we are proud to share the latest performance results from MLPerf™ Inference v4.0, showcasing the power and versatility of Google Cloud’s A3 virtual machines (VMs) powered by NVIDIA H100 GPUs.

JetStream: High-performance, cost-efficient LLM inference

LLMs are at the forefront of the AI revolution, powering a wide range of applications such as natural language understanding, text generation, and language translation. To reduce our customers’ LLM inference costs, we built JetStream: an inference engine that provides up to 3x more inferences per dollar than previous Cloud TPU inference engines.

Figure 1: The JetStream stack.

JetStream includes advanced performance optimizations such as continuous batching, sliding window attention, and int8 quantization for weights, activations, and key-value (KV) cache. And whether you’re working with JAX or PyTorch, JetStream supports your preferred framework. To further streamline your LLM inference workflows, we provide MaxText and PyTorch/XLA implementations of popular open models such as Gemma and Llama, optimized for peak cost-efficiency and performance.

On Cloud TPU v5e-8, JetStream delivers up to 4783 tokens/second for open models including Gemma in MaxText and Llama 2 in PyTorch/XLA:

Figure 2: JetStream throughput (output tokens / second). Google internal data. Measured using Gemma 7B (MaxText), Llama 2 7B (PyTorch/XLA), and Llama 2 13B (PyTorch/XLA) on Cloud TPU v5e-8. Maximum input length: 1024, maximum output length: 1024. Continuous batching, int8 quantization for weights, activations, KV cache. PyTorch/XLA uses sliding window attention. As of April, 2024.

JetStream’s high performance and efficiency mean lower inference costs for Google Cloud customers, making LLM inference more accessible and affordable:

Figure 3: JetStream cost to generate 1 million output tokens. Google internal data. Measured using Gemma 7B (MaxText), Llama 2 7B (PyTorch/XLA), and Llama 2 13B (PyTorch/XLA) on Cloud TPU v5e-8. Maximum input length: 1024, maximum output length: 1024. Continuous batching, int8 quantization for weights, activations, KV cache. PyTorch/XLA uses sliding window attention. JetStream ($0.30 per 1M tokens) achieves up to 3x more inferences per dollar on Gemma 7B compared to the previous Cloud TPU LLM inference stack ($1.10 per 1M tokens). Cost is based on the 3Y CUD price for Cloud TPU v5e-8 in the US. As of April, 2024.

Customers such as Osmos are using JetStream to accelerate their LLM inference workloads:

“At Osmos, we’ve developed an AI-powered data transformation engine to help companies scale their business relationships through the automation of data processing. The incoming data from customers and business partners is often messy and non-standard, and needs intelligence applied to every row of data to map, validate, and transform it into good, usable data. To achieve this we need high-performance, scalable, cost-efficient AI infrastructure for training, fine-tuning, and inference. That’s why we chose Cloud TPU v5e with MaxText, JAX, and JetStream for our end-to-end AI workflows. With Google Cloud, we were able to quickly and easily fine-tune Google’s latest Gemma open model on billions of tokens using MaxText and deploy it for inference using JetStream, all on Cloud TPU v5e. Google’s optimized AI hardware and software stack enabled us to achieve results within hours, not days.” – Kirat Pandya, CEO, Osmos

By providing researchers and developers with a powerful, cost-efficient, open-source foundation for LLM inference, we’re powering the next generation of AI applications. Whether you’re a seasoned AI practitioner or just getting started with LLMs, JetStream is here to accelerate your journey and unlock new possibilities in natural language processing.

Experience the future of LLM inference with JetStream today. Visit our GitHub repository to learn more about JetStream and get started on your next LLM project. We are committed to developing and supporting JetStream over the long term on GitHub and through Google Cloud Customer Care. We are inviting the community to build with us and contribute improvements to further advance the state of the art.

MaxDiffusion: High-performance diffusion model inference

Just as LLMs have revolutionized natural language processing, diffusion models are transforming the field of computer vision. To reduce our customers’ costs of deploying these models, we created MaxDiffusion: a collection of open-source diffusion-model reference implementations. These implementations are written in JAX and are highly performant, scalable, and customizable – think MaxText for computer vision.

MaxDiffusion provides high-performance implementations of core components of diffusion models such as cross attention, convolutions, and high-throughput image data loading. MaxDiffusion is designed to be highly adaptable and customizable: whether you’re a researcher pushing the boundaries of image generation or a developer seeking to integrate cutting-edge gen AI capabilities into your applications, MaxDiffusion provides the foundation you need to succeed.

The MaxDiffusion implementation of the new SDXL-Lightning model achieves 6 images/s on Cloud TPU v5e-4, and throughput scales linearly to 12 images/s on Cloud TPU v5e-8, taking full advantage of the high performance and scalability of Cloud TPUs

Figure 4: MaxDiffusion throughput (images per second). Google internal data. Measured using the SDXL-Lightning model on Cloud TPU v5e-4 and Cloud TPU v5e-8. Resolution: 1024×1024, batch size per device: 2, decode steps: 4. As of April, 2024.

And like MaxText and JetStream, MaxDiffusion is cost-efficient: generating 1000 images on Cloud TPU v5e-4 or Cloud TPU v5e-8 costs just $0.10.

Figure 5: MaxDiffusion cost to generate 1000 images. Google internal data. Measured using the SDXL-Lightning model on Cloud TPU v5e-4 and Cloud TPU v5e-8. Resolution: 1024×1024, batch size per device: 2, decode steps: 4. Cost is based on the 3Y CUD prices for Cloud TPU v5e-4 and Cloud TPU v5e-8 in the US. As of April, 2024.

Customers such as Codeway are using Google Cloud to maximize cost-efficiency for diffusion model inference at scale:

“At Codeway, we create chart-topping apps and games used by more than 115 million people in 160 countries around the world. “Wonder,” for example, is an AI-powered app that turns words into digital artworks, while “Facedance” makes faces dance with a range of fun animations. Putting AI in the hands of millions of users requires a highly scalable and cost-efficient inference infrastructure. With Cloud TPU v5e, we achieved 45% faster serving time for serving diffusion models compared to other inference solutions, and can serve 3.6 times more requests per hour. At our scale, this translates into significant infrastructure cost savings, and makes it possible for us to bring AI-powered applications to even more users in a cost-efficient manner.” – Uğur Arpacı, Head of DevOps, Codeway

MaxDiffusion provides a high-performance, scalable, flexible foundation for image generation. Whether you’re a seasoned computer vision expert or just dipping your toes into the world of image generation, MaxDiffusion is here to support you on your journey.

Visit our GitHub repository to learn more about MaxDiffusion and start building your next creative project today.

A3 VMs: Strong results in MLPerf™ 4.0 Inference

In August 2023 we announced the general availability of A3 VMs. Powered by 8 NVIDIA H100 Tensor Core GPUs in a single VM, A3s are purpose-built to train and serve demanding gen AI workloads and LLMs. A3 Mega, powered by NVIDIA H100 GPUs, will be generally available next month and offers double the GPU-to-GPU networking bandwidth of A3.

For the MLPerf™ Inference v4.0 benchmark testing, Google submitted 20 results across seven models, including the new Stable Diffusion XL and Llama 2 (70B) benchmarks, using A3 VMs:

RetinaNet (Server and Offline)

3D U-Net: 99% and 99.9% accuracy (Offline)

BERT: 99 and 99% accuracy (Server and Offline)

DLRM v2: 99.9% accuracy (Server and Offline)

GPT-J: 99% and 99% accuracy (Server and Offline)

Stable Diffusion XL (Server and Offline)

Llama 2: 99% and 99% accuracy (Server and Offline)

All results were within 0-5% of the peak performance demonstrated by NVIDIA’s submissions. These results are a testament to Google Cloud’s close partnership with NVIDIA to build workload-optimized end-to-end solutions specifically for LLMs and gen AI.

Powering the future of AI with Google Cloud TPUs and NVIDIA GPUs

Google’s innovation in AI inference, powered by hardware advancements in Google Cloud TPUs and NVIDIA GPUs, plus software innovations such as JetStream, MaxText, and MaxDiffusion, empower our customers to build and scale AI applications. With JetStream, developers can achieve new levels of performance and cost efficiency in LLM inference, unlocking new opportunities for natural language processing applications. MaxDiffusion provides a foundation that empowers researchers and developers to explore the full potential of diffusion models to accelerate image generation. Our robust MLPerf™ 4.0 inference results on A3 VMs powered by NVIDIA H100 Tensor Core GPUs showcase the power and versatility of Cloud GPUs.

Visit our website to learn more and get started with Google Cloud TPU and GPU inference today.

Read More for the details.

2024 04 10

GCP – Gemini in Databases — supercharge database development and management

Cloud, Google Cloud gcp

Yesterday, at Next ‘24, we announced a preview of Gemini in Databases — an AI powered database assistant that simplifies all aspects of database operations including assisted development, performance optimization, fleet management, governance, and migrations. With Gemini in Databases, companies no longer need to exclusively rely on people with specialized skills and custom resources to manage their databases — Google’s proactive AI assistant can help!

In most organizations, developers, database administrators, and platform engineers handle various aspects of database development and management. However, with the proliferation of databases to meet growing data processing demands, these experts are often overwhelmed by a myriad of database requirements, operational issues, and inefficiencies. Furthermore, with the rapid evolution in database technologies, developers are finding it hard to stay up-to-date with the latest development techniques, hampering both development quality and productivity. At the same time, “a majority (79%) of IT teams are now using more than one database platform, with 29% of respondents using more than five, and across a wide range of categories, IT professionals cite a lack of skills, knowledge gaps and a rapid need to upgrade and diversify their abilities at every level”.1 Between developer productivity challenges, performance regressions, security and compliance vulnerability, and data migration difficulties, 82% of developers spend 30 minutes each day searching for solutions, according to a Stack Overflow survey.2

Gemini in Databases delivers an AI-powered assistant that helps across all stages of the database journey without development and operations teams needing to acquire specialized skills. With Database Studio, developers can build and deploy applications faster with the ability to generate, and summarize SQL code from simple natural language instructions. With Database Insights, operators can easily optimize performance by leveraging the insights and smart recommendations from the AI-powered assistant. With Database Center, platform engineers can manage entire fleets of diverse databases using the intelligent dashboards built with AI, proactively assessing availability, data protection, security, and compliance issues without any custom tools or processes. Finally, with Database Migration Service, database administrators can streamline and expedite their database migrations with automated code and database conversions. With Gemini in Databases, database professionals get all of these capabilities in a single offering supported by both visual and natural language interfaces, enabling them to simplify and de-risk database management.

“As we are journeying into digital transformation, we are concerned about maintaining our pace of development due to the growing complexity of database environments that power our applications,” said Bogdan Capatina, Technical Expert in Database Technologies, Ford Motor Company. “We manage thousands of databases across hundreds of projects, which makes it extremely challenging to optimize performance and ensure security. Gemini in Databases is a game changer, bringing the power of assistive AI across the database journey. Now, we can get answers on fleet health in seconds instead of days and act on performance and security recommendations to proactively mitigate potential risks to our applications more swiftly than ever before.”

Accelerate application development

With the increased focus on digital transformation and modernization of data platforms, developers who now frequently also perform the DevOps role are dealing with the increased burden to not only build, manage, and future-proof architectures, but also develop applications at a faster clip.

Database Studio empowers developers to easily write and understand SQL with intelligent code assistance, code completion, and guidance directly in the editor. Developers can utilize the context-aware chat interface with natural language to build database applications faster using the SQL suggestions provided by the AI assistant. These assistive features can easily boost developer productivity especially in cases where developers are unfamiliar with a particular database dialect or are looking to optimize existing code for better performance and efficiency.

Database Studio provides code assistance, code completion, and guidance directly in the editor. Also, developers can utilize a context-aware chat interface that uses natural language to manage databases.

Stay ahead of performance issues

Database Insights empowers operators to address their database performance issues through an easy-to-use interface, providing visibility into all database metrics in a single view, saving time and enhancing productivity.

Database Insights makes resolving complex database problems easier by providing in-context explainability for nuanced database concepts such as wait events, database flags, and various database metrics available for troubleshooting. It automatically analyzes your workloads, highlights problems, and provides recommendations to resolve them. It also looks for common problems like missing indexes, incorrect flags, and resource bottlenecks, among other key performance configurations, to help operators optimize queries and tune their databases.

With richer, near-real-time diagnostics, operators can easily detect and troubleshoot their query performance issues. This powerful yet simple experience allows them to correlate query metrics across a multitude of dimensions and identify the root cause of previously hard-to-diagnose problems.

Database Insights automatically analyzes your workloads, highlights problems, and provides recommendations to resolve them

De-risk and optimize your database fleet

With the new Database Center, platform engineers now have a single pane of glass into their entire database fleet, regardless of the number of instances or types of database engines. This new dashboard provides a central view on database performance, security, cost and availability to proactively monitor and identify instances that need attention.

Managing fleets of databases can be complex and time-consuming. However, with the integrated AI assistant in Database Center, database teams can interface with the system using natural language, making it easier to find the information they need and troubleshoot problems. Teams can ask ad-hoc questions regarding their database health and get tailored responses, enhancing productivity. In addition, Database Center monitors critical signals to highlight potential performance issues, using predictive analytics to alert when resources will be saturated. It also scans for common database flag misconfigurations and provides best practices to build the best version of their database.Database Center can also automatically analyze your instance usage patterns to provide right-sizing recommendations tailored for your workload, preventing over-provisioning that wastes resources or under-provisioning that compromises performance and availability. Furthermore, the AI assistant looks for idle instances and seasonal usage patterns to suggest shutdown and start policies during non-operational hours. Database Center delivers simplified security configuration through automated scans, providing recommendations to mitigate risks and promote industry-standard best practices. By integrating with Security Command Center, Database Center empowers teams with near-instant analysis of security findings and potential attack vectors, keeping them proactively shielded against adversaries.

Database Center allows users to ask ad-hoc questions on their database health and get tailored responses, ultimately enhancing productivity

Streamline and expedite database migrations

Performing database migrations require a lot of expertise, and understanding database schemas along with their dependencies in application code can be tedious. To save time and resources, Database Migration Service (DMS) makes it easier to lift and shift databases to Cloud SQL and modernize legacy databases to a cloud-optimized database such as AlloyDB. With Gemini in DMS, developers and administrators can now easily assess and convert database-resident code such as stored procedures, triggers, and functions, to the PostgreSQL dialect. Our AI capabilities also help with last-mile code conversion by learning from the user’s initial conversions. Furthermore, our AI-powered explainability feature in DMS helps developers to easily learn new PostgreSQL dialects, optimize SQL code, and enhance readability for better productivity, easier migrations, and higher efficiency.

Gemini in Database Migration Service helps with last-mile code conversion by learning from the initial conversions from the user

How to get started

We are excited to share that the new capabilities of Gemini in Databases are available in preview today. To learn more, visit https://cloud.google.com/products/gemini/databases

1. The State of the Database Landscape, Redgate, 2024
2. Professional Developer Productivity Impact Survey, StackOverflow, 2022

Read More for the details.

2024 04 10

GCP – Introducing Gemini in Looker to bring intelligent AI-powered BI to everyone

Cloud, Google Cloud gcp

We are at a pivotal moment for business intelligence (BI). There’s more data than ever impacting all aspects of business. Organizations are faced with increasing user demands for that data, with a wide range of access requirements. Then there’s AI, which is radically transforming how you create and think about every project. The delivery and adoption of generative AI is poised to bring the full benefits of BI to users who find a conversational experience more appealing than traditional methods. This week at Google Cloud Next, we introduced Conversational Analytics as part of Gemini in Looker, rethinking how we bring easy access of insights to our users, transforming the way we engage with our data in BI, using natural language. In addition, we announced the preview of an array of capabilities for Looker that leverage Google’s work in generative AI and speed up your organization’s ability to dive deeper into the data that matters most, so you can rapidly create and share insights.

With Gemini in Looker, your relationship with your data and reporting goes from a slow-moving and high-friction process, limited by gatekeepers, to a collaborative and intelligent conversation – powered by AI. The deep integration of Gemini models in Looker brings insights to the major user flows that power your business, and establishes a single source of truth for your data with consistent metrics.

Conversational Analytics brings your data to life

Conversational Analytics is a dedicated space in Looker for you to chat with and engage with your business data, as simply as you would ask a colleague a question on chat. In combination with LookML semantic models available from Looker, we establish a single source of truth for your data, providing consistent metrics across your organization. Now, your entire company, including business and analyst teams, can chat with your data and obtain insights in seconds, fully enabling your data-driven decision-making culture.

You can leverage Conversational Analytics, using Gemini in Looker, to find top products, sales details, and dive deeper into the answers with follow-up questions.

With Conversational Analytics, everyone can uncover patterns and trends in data, as if you were speaking to your in-house data expert – and while the answers come in seconds, Looker shows you the data behind the insights, so you know the foundation is accurate and the method is true.

Smart and simple modeling on a trusted foundation

In the generative AI era, ensuring data authenticity and standardizing operational metrics is more than a nice to have – it’s critical, ensuring measures and comparisons across apps and teams are reliable and consistent. Looker’s semantic layer is at the heart of our modeling capabilities, powering the centrally defined metrics and data relationships that mean truth and accuracy as you go through your workflows. With LookML, your analysts can work together seamlessly to create universal data and metrics definitions.

Gemini in Looker features LookML Assistant, which we hope will enable everyone to leverage and improve the power of their semantic models quickly using natural language. Simply tell Gemini in Looker what you are looking to build, and the LookML code will be automatically created for you, setting the stage for governed data, powered by generative AI, easier than ever before.

Expanding intelligence for all Looker customers — and beyond

As the world of BI has evolved, so have our customers’ needs. They demand powerful and complete BI tools that are intuitive to use, with self-service exploration, seamless ad-hoc analysis, and high-quality visualizations all in a single platform, augmented by generative AI.

We are now offering Looker Studio Pro to licensed Looker users (excluding Embed), at no additional cost, making getting started with BI easier than ever.

Our vision is that Looker is the single source of truth for both modeled data and metrics that can be consumed anywhere — in our products, through partner BI tools or through our open SQL APIs. Looker’s modeling layer provides a single place to curate and govern the metrics most important to your business, meaning that customers can see consistent results no matter where they interact with their data.

Thanks to deep integration with Google Workspace, you can ask questions of your data with Gemini in Looker, helping you create reports easily and bring your creations to Slides.

Traditionally, BI tools take a user out of the flow of their work. We believe we can improve on this, helping users collaborate on their data where they are. With this in mind, we have extended our connections to Google Workspace, with the goal of meeting users where they are, across Slides, Sheets and Chat. Users will be able to automatically create Looker Studio reports from Google Sheets, helping you rapidly visualize and share insights on your data, while Slide Generation from Gemini in Looker eliminates that blank deck start, starting with your visuals and reports, and building AI-generated summaries to kick off your presentation right.

Business data insights as easy as asking Google

Gemini in Looker offers an array of new capabilities to help speed up and make analytics tasks and workflows including data modeling, chart creation, slide presentation generation and more even easier. As Google has done for decades in applications like Chrome, Gmail, and Google Maps, Gemini in Looker offers a customer experience that is intuitive and efficient.Conversational Analytics in Looker and LookML Assistant are joined by a set of capabilities that we first showcased at Next 2023, namely:

Report generation: Build an entire report, including multiple visualizations, a title, theme and layout, in seconds, by providing a one- two-sentence prompt. Gemini in Looker is an AI analyst that can create entire reports, giving you a starting point that you can adjust by using natural language.

Advanced visualization assistant: Customize your visualizations using natural language. Gemini in Looker helps create JSON code configs, which you can modify as necessary, and generate a custom visualization.

Automatic slide generation: Create impactful presentations with insightful text summaries of your data. Gemini in Looker automatically exports a report into Google Slides, with text narratives that explains the data in charts and highlights key insights.

Formula assistant: Create calculated fields on-the-fly to extend and transform the information flowing from your data sources. Gemini in Looker removes the need for you to remember complicated formulas, and creates your formula for you, for ad-hoc analysis.

Each of these capabilities are now available in preview.

Reliable intelligence for the generative AI era

Looker plays a critical role in Google Cloud’s intelligence platform, unifying your data ecosystem. Bringing even more intelligence into Looker with Gemini makes it easier for our customers to understand and access their business data for analysts to create dashboards and reports, and for developers to build new semantic models. Join us as we create new experiences with data and analytics — one defined by AI-powered conversational interfaces for data and analytics. It all starts with a simple chat box.

Read More for the details.

2024 04 10

GCP – Get to know BigQuery data canvas: an AI-centric experience to reimagine data analytics

Cloud, Google Cloud gcp

Navigating the complexities of the data-to-insights journey can be frustrating. Data professionals spend valuable time sifting through data sources, reinventing the wheel with each new question that comes their way. They juggle multiple tools, hop between coding languages, and collaborate with a wide array of teams across their organizations. This fragmented approach is riddled with bottlenecks, preventing analysts from generating insights and doing high-impact work as quickly as they should.

Yesterday at Google Cloud Next ‘24, we introduced BigQuery data canvas, which reimagines how data professionals work with data. This novel user experience helps customers create graphical data workflows that map to their mental model while AI innovations accelerate finding, preparing, analyzing, visualizing and sharing data and insights.

Watch this video for a quick overview of BigQuery data canvas.

BigQuery data canvas: a NL-driven analytics experience

BigQuery data canvas makes data analytics faster and easier with a unified, natural language-driven experience that centralizes data discovery, preparation, querying, and visualization. Rather than toggling between multiple tools, you can now use data canvas to focus on the insights that matter most to your business. Data canvas addresses the challenges of traditional data analysis workflow in two areas:

Natural language-centric experience: Instead of writing code, you can now speak directly to your data. Ask questions, direct tasks, and let the AI guide you through various analytics tasks.

Reimagined user experience: Data canvas rethinks the notebook concept. Its expansive canvas workspace fosters iteration and easy collaboration, allowing you to refine your work, chain results, and share workspaces with colleagues.

For example, to analyze a recent marketing campaign with BigQuery data canvas, you could use natural language prompts to discover campaign data sources, integrate them with existing customer data, derive insights, collaborate with teammates and share visual reports with executives — all within a single canvas experience.

Natural language-based visual workflow with BigQuery data canvas

Do more with BigQuery data canvas

BigQuery provides a variety of features that can help analysts accelerate their analytics tasks:

Search and discover: Easily find the specific data asset visualization table or view that you need to work with. Or search for the most relevant data assets. Data canvas works with all data that can be managed with BigQuery, including BigQuery managed storage, BigLake, Google Cloud Storage objects, and BigQuery Omni tables. For example, you could use either of the follow inputs to pull data with data canvas:

Specific table: project_name.dataset_name.table_name

Search: “customer transaction data” or “projectid:my-project-name winter jacket sales Atlanta”

Explore data assets: Review the table schema, review their details or preview data and compare it side by side.

Generate SQL queries: Iterate with NL inputs to generate the exact SQL query you need to accomplish the analytics task at hand. You can also edit the SQL before executing it.

Combine results: Define joins with plain language instructions and refine the generated SQL as needed. Use query results as a starting point for further analysis with prompts like “Join this data with our customer demographics on order id.”

Visualize: Use natural language prompts to easily create and customize charts and graphs to visualize your data, e.g., “create a bar chart with gradient” Then, seamlessly share your findings by exporting your results to Looker Studio or Google Sheets.

Automated insights: Data canvas can interpret query results and chart data and generate automated insights from them. For example, it can look at the query results of sales deal sizes and automatically provide the insight “the median deal size is $73,500.”

Share to collaborate: Data analytics projects are often a team effort. You can simply save your canvas and share it with others using a link.

Popular use cases

While BigQuery data canvas can accelerate many analytics tasks, it’s particularly helpful for:

Ad hoc analysis: When working on a tight deadline, data canvas makes it easy to pull data from various sources.

Exploratory data analysis (EDA): This critical early step in the data analysis process focuses on summarizing the main characteristics of a dataset, often visually. Data canvas helps find data sources and then presents the results visually.

Collaboration: Data canvas makes it easy to share an analytics project with multiple people.

What our customers are saying

Companies large and small have been experimenting with BigQuery data canvas for their day-to-day analytics tasks and their feedback has been very positive.

Wunderkind, a performance marketing channel that powers one-to-one customer interactions, has been using BigQuery data canvas across their analytics team for several weeks and is experiencing significant time savings.

“For any sort of investigation or exploratory exercise resulting in multiple queries there really is no replacement [for data canvas]. [It] Saves us so much time and mental capacity!” – Scott Schaen, VP of Data & Analytics, Wunderkind

Veo, a micro mobility company that operates in 50+ locations across the USA, is seeing immediate benefits from the AI capabilities in data canvas.

“I think it’s been great in terms of being able to turn ideas in the form of NL to SQL to derive insights. And the best part is that I can review and edit the query before running it – that’s a very smart and responsible design. It gives me the space to confirm it and ensure accuracy as well as reliability!” – Tim Velasquez, Head of Analytics, Veo

Give BigQuery data canvas a try

To learn more, watch this video and check out the documentation. BigQuery data canvas is launching in preview and will be rolled out to all users starting on April 15th. Submit this form to get early access.

For any bugs and feedback, please reach out to the product and engineering team at datacanvas-feedback@google.com. We’re looking forward to hearing how you use the new data canvas!

Read More for the details.

2024 04 10

GCP – Celebrating 20 years of Bigtable with exciting announcements at Next

Cloud, Google Cloud gcp

How do you store the entire internet? That’s the problem our engineering team set out to solve in 2004 when it launched Bigtable, one of Google’s longest serving and largest data storage systems. As the internet — and Google — grew, we needed a new breed of storage solution to reliably handle millions of requests a second to store the ever-changing internet. And when we revealed its design to the world in a 2006 research paper, Bigtable kicked-off the Big Data revolution, inspiring the database architectures for NoSQL systems such as Apache HBase and Cassandra.

Twenty years later, Bigtable doesn’t just support Google Search, but also latency-sensitive workloads across Google such as Ads, Drive, Analytics, Maps, and YouTube. On Google Cloud, big names like Snap, Spotify and Shopify rely on Bigtable, now serving a whopping 7 billion queries per second at peak. On any given day, it is nearly impossible to use the internet without interacting with a Bigtable database.

Bigtable isn’t just for Big Tech, though. This year, our goal is to bring Bigtable to a much broader developer audience and range of use cases, starting with a number of capabilities that we announced this week at Google Cloud Next.

Introducing Bigtable Data Boost and Authorized Views

For one, Bigtable now supports Data Boost, a serverless way for users to perform analytical queries on their transactional data without impacting their operational workloads. Currently in preview, Data Boost makes managing multiple copies of data for serving and analytics a thing of the past. Further, Data Boost supports a requester-pays model, billing data consumers directly for their usage — a unique capability for an operational database.

Then, new Bigtable Authorized Views enable many data sharing and collaboration scenarios. For example, retailers can securely share sales or inventory data with each of their vendors, so they can more accurately forecast demand and resupply the shelves — without worrying about how much server capacity to provision. This type of use case is quite common for organizations with multiple business units, but sharing this level of data has traditionally required keeping copies of data in multiple databases, building custom application layers and billing components. Instead, with Bigtable Authorized Views and Data Boost, each vendor will get their own bill for the amount of data they process, with no negative impact on retailer’s operations. Bigtable Authorized Views make it easier to serve data from a single source of truth, with improved data governance and quality.

These features, along with the existing Request Priorities, stand to transform Bigtable into an all-purpose data fabric, or a Digital Integration Hub. Many Google Cloud customers already use Bigtable for their data fabrics, where its strong write performance, horizontal scalability and flexible schema make it an ideal platform for projects that ingest large amounts of data in batch from multiple sources or collate real-time streaming events. But businesses and their data evolve over time. New data sources are added through acquisitions, partnerships, new product launches, additional business metrics and ML features. To get the value out of data, you need to combine all the pieces and see the big picture — and do it in real-time. Bigtable has already solved the latency and database scaling problems, but features like Authorized Views and Data Boost help to solve data and resource governance issues.

During the preview, Data Boost is offered at no cost.

Boosting Bigtable performance for next-gen workloads

At Next, we also announced several Bigtable price-performance improvements. Bigtable now offers a new aggregate data type optimized for increment operations, which delivers significantly higher throughput and can be used to implement distributed counters and simplify Lambda architectures. You can also choose large nodes that offer more performance stability at higher server utilization rates, to better support spiky workloads. This is the first of workload-optimized node shapes that Bigtable will offer. All of these changes come on the heels of an increase in point-read throughput from 10K to 14K reads per second per node just a few months ago. Overall, these improvements mean lower TCO for a database already known for its price-performance.

These improvements could help power your modern analytics and machine learning (ML) workloads: ML is going real-time, and models are getting larger, with more and more variables that require flexible schemas and wide data structures. Analytics workloads are also moving towards wide-table designs with the so-called one big table (OBT) data model. Whether you need its flexible data model for very wide, gradually evolving tables; its scalable counters’ ability to provide real-time metrics at scale, or features like Data Boost and Request Priorities that allow seamless backfills and frequent model training (thereby combining real-time serving and batch processing into a single database), Bigtable simplifies the ML stack and reduces concept and data drift, uplifting ML model performance.

With 20 years of learnings from running one of the world’s largest cloud databases, Bigtable is ready to tackle even more demanding workloads. If you’re at Google Cloud Next, stop by the following sessions to learn how Ford uses Bigtable for its vehicle telemetry platform, how Snap uses it their latency-sensitive workloads, how Shopify uses Bigtable to power its recommendation system, and about Palo Alto Networks’ journey from Apache Cassandra to Bigtable.

Further resources

Unlock your data with Authorized Views

Make every click count with distributed counters

Discover new insights in your operational data with Data Boost’s isolated, serverless analytical processing

Read More for the details.

2024 04 10

GCP – New Google Cloud Consulting programs designed to accelerate your cloud journey

Cloud, Google Cloud gcp

At Google Cloud Consulting, we are committed to helping our customers learn, build, operate, and succeed, wherever they are in their cloud journey. We are energized by the transformative potential of generative AI, and while it is getting a lot of attention, the cloud infrastructure that powers it remains essential.

Through thousands of customer engagements across the globe, we’ve found that knowledgeable staff, a strong technology foundation, and trusted guidance leads to innovation. I’m thrilled to announce new Google Cloud Consulting programs that will help meet our customers’ needs today as they bring their new ideas to life.

Learning fuels results

Our mission at Google Cloud Consulting is to make it easy for anyone, anywhere to learn the skills they need to succeed with Google Cloud technology. Learning and enablement are key parts of Google’s core mission to organize the world’s information and make it universally accessible and useful. We’re excited to help organizations accelerate their learning journey and use gen AI to transform business functions. Highly trained organizations result in more successful cloud migrations and more confident workforces – so we’re betting big on learning as a business driver for our customers this year.

Gen AI requires skills most organizations are just now learning and many don’t have the in-house expertise to teach their workforce. Some 87% of the C-suite say they’re struggling to find talent with AI skills, and 77% say AI is disrupting their business strategy.

At Google Cloud, we want to change that dynamic. To show our commitment to expanding customers’ skills and knowledge, Google Cloud is offering no cost, on-demand training for our top customers, including new AI training for every member of your team. Customers will receive no-cost access to generative AI training led by a virtual instructor and our full catalog of on-demand training on Google Cloud Skills Boost, which includes freshly launched new gen AI skill badges. These three new badges showcase an individual’s applied and assessed abilities with Google Cloud concepts, products, and services:

Prompt Design in Vertex AI

Develop Gen AI Apps with Gemini and Streamlit

Inspect Rich Documents with Gemini Multimodality and Multimodal RAG

To learn more, please connect with your account team.

Simplified solutions for real world impact

At Google Cloud Consulting, we are committed to solving our customers’ complex problems with business-focused solutions. To help you use gen AI to deliver results and overcome common roadblocks, we’ve teamed up with Google product experts to curate end-to-end solutions, both technology and services focused. You can leverage complete solutions to known challenges across developer productivity, customer service modernization, marketing, digital commerce, website modernization, and back office business applications — all expertly designed and delivered in collaboration with our partners to fit your specific industry needs.

We’re also making it easier to engage with Google Cloud Consulting. We will be rolling out an update to our credits program that provides broader access to the full GCC Services catalog without impacting any of your existing incentives. This simplified experience will increase flexibility for our customers and provide greater visibility into real-time spending insights with guidance directly from Google Cloud experts to help them get the most out of their services strategy.

Partner to make it real

Leveraging cloud technology and effectively integrating gen AI to transform your business takes teamwork and demands a robust foundation. That’s why we’re investing heavily in core infrastructure, ironclad security, and upskilling our customer-facing teams, especially within our robust partner ecosystem. At this year’s Next, we’re excited to announce that Delivery Navigator is now GA.

We initially unveiled Delivery Navigator at Google Cloud Next ’23 as a way for us to jointly create consistent, repeatable, agile, and high-quality experiences for our customers. And now we are ready and excited to bring the rest of our partner ecosystem this library of transformation methods, which are based on our experience running thousands of cloud projects.

We know that success in the cloud comes from close collaboration between our customers, Google Cloud Consulting teams, and our partners. Delivery Navigator is just one of the ways we’re helping our partners be successful collaborators, and our customers will benefit from these changes with more seamless and excellent delivery. By combining the expertise of Google Cloud and our partners, we provide customers with the best of both worlds: access to Google’s cutting-edge technology and the specialized knowledge and experience of our partners.

Integration of gen AI into daily operations isn’t just theory for us at Google Cloud Consulting, it’s practice. Our internal adoption of Gemini and other gen AI tools have transformed our own operations. We are excited to help our customers do the same, ensuring you have the right mix of business and technical strategies to make your cloud transformation a reality.

Ready to learn more? Discover how Google Cloud Consulting can help you learn, build, operate and succeed.

1. EdX Enterprise, ‘Navigating the Workplace in the Age of AI, 2023 Whitepaper

Read More for the details.

2024 04 10

GCP – Announcing Cloud Service Mesh – the evolution of service mesh for Google Cloud

Cloud, Google Cloud gcp

Network operators are increasingly adopting service mesh architectures, which provide managed, observable, and secure communication between microservices, allowing them to be composed into robust enterprise applications. And as service mesh deployments scale, organizations are asking for fully managed solutions that cover a range of infrastructure and integrate with the rest of the network services, such as global load balancing, centralized health checking, managed rate limiting, traffic driven autoscaling.

Today, we are excited to announce Cloud Service Mesh, a fully managed service mesh across all Google Cloud platform types. Cloud Service Mesh takes the Traffic Director’s control plane and Google’s open-source Istio-based service mesh, Anthos Service Mesh, and combines them into a single offering that provides the best of both worlds.

With Cloud Service Mesh, customers get:

Traffic Director control plane with global scale

Anthos Service Mesh compatibility

Istio APIs (the most widely used open-source APIs for mesh in Kubernetes clusters)

Managed data plane for automatic upgrades of Envoy sidecars

GKE Fleet integration

Hosted certificate authorities for workload identity

Service Operations dashboard for service metrics

GCP APIs from Traffic Director for

Proxyless gRPC support

VMs

Cloud Run

Gateway API for Service Mesh or GAMMA API

Cloud Service Mesh will be generally available this quarter.

Cloud Service Mesh benefits

A service mesh manages all the common requirements of running a service: traffic management, observability, and security. This allows application developers and operators to focus on their business, creating and managing great applications for their users without having to invest in managing mesh infrastructure. Let’s take a look at the features that Cloud Service Mesh provides.

Traffic management

Cloud Service Mesh controls the flow of traffic among services in the mesh, into the mesh (ingress), and to outside services (egress), allowing you to configure and deploy resources at the application layer to manage this traffic. With Cloud Service Mesh, you can:

Use Google’s global load balancing software to offer automatic capacity and proximity aware global load balancing

Finely control routing for your services

Configure load balancing among services

Create canary and blue-green deployments

Set up retries and circuit breakers

Controlling how your services communicate, both in normal and failure scenarios, allows you to build much more reliable applications.

Observability insights

Cloud Service Mesh supports telemetry, logging and tracing. With this data you can track how your service is operating and find the issues when something goes wrong.

The graphical user interface in the Google Cloud console provides insights into your service mesh through this telemetry. These metrics are automatically generated for workloads configured through the Istio APIs.

Service metrics and logs for HTTP traffic within your mesh’s GKE cluster are automatically ingested to Google Cloud.

Preconfigured service dashboards give you the information you need to understand your services.

In-depth telemetry — powered by Cloud Monitoring, Cloud Logging, and Cloud Trace — lets you dig deep into your service metrics and logs. You can filter and segment your data on a wide variety of attributes.

Service-to-service relationships are visible at a glance, helping you understand the communication and dependencies between services.

You can quickly see the communication security posture not only of your service, but its relationships to other services.

Service level objectives (SLOs) give you insight into the health of your services. You can easily define an SLO and alert on your own standards of service health.

Security benefits

Service security is the third major component of a service mesh. Each service has its own identity, which is used by mutual TLS (mTLS) to provide strong service-to-service authentication and encryption. Cloud Service Mesh performs the following:

Mitigates risk of replay or impersonation attacks that use stolen credentials. Cloud Service Mesh relies on mTLS certificates to authenticate peers.

Ensures encryption in transit. Using mTLS for authentication also ensures that all TCP communications are encrypted in transit.

Ensures that only authorized clients can access a service with sensitive data, irrespective of the network location of the client and the application-level credentials.

Mitigates the risk of user data breach within your production network. You can ensure that insiders can only access sensitive data through authorized clients.

Identifies which clients accessed a service with sensitive data. Cloud Service Mesh access logging captures the mTLS identity of the client in addition to the IP address.

Supporting existing service mesh customers

In the short term, we will rebrand Anthos Service Mesh and Traffic Director documentation and SKUs to Cloud Service Mesh, but beyond that, current Anthos Service Mesh and Traffic Director users will see no immediate change to their environment. Your current APIs will continue to work as is, and your mesh will continue to function. Over the coming year, we will work with Anthos Service Mesh and Traffic Director customers to ensure they can leverage all the new capabilities while converging on a common control plane with a choice of APIs.

Get started by using the existing Managed ASM with Istio APis on GKE, or Traffic Director with GCP APIs today. These will automatically be moved to CSM at launch later this quarter.

Read More for the details.

2024 04 10

GCP – Using Gemini Code Assist to build APIs, integrations, and automation flows

Cloud, Google Cloud gcp

APIs and integrations between applications form the digital nervous system of modern architectures. APIs are the pathways that connect your intelligence (AI models and apps) to data sources spread across different environments and systems, and integrations drive intelligent actions across your SaaS workflows. However, building these APIs and integrations often requires specialized expertise, requiring a lot of time and resources to ensure consistent standards and quality.

We’re excited to announce that you will be able to use Gemini Code Assist in Google Cloud’s Apigee API Management and Application Integration (in public preview). Gemini code assist simplifies the process of building enterprise grade APIs and integrations using natural language prompts that don’t require special expertise.

Gemini Code Assist: AI-powered assistance, tailored for your enterprise

While off-the-shelf AI assistants can help with building APIs and integrations, the process is still time-intensive because every enterprise is unique, each with their own requirements, schemas, and data sources. Unless the AI assistant understands this context, users still need to manually address these items.

Gemini Code Assist understands enterprise context such as security schemas, API patterns, integrations, etc., and uses it to provide tailored recommendations for your use case. Furthermore, using Gemini Code Assist lets you iterate on your existing API or integration in development, instead of prompting from scratch. Lastly, Gemini’s proactive suggestions inspire new ideas.

Using Gemini Code Assist to build APIs

Apigee is Google Cloud’s turnkey API Management solution for building, managing, and securing APIs – for any use case or environment (cloud or on-premises). You can access Apigee through the Google Cloud console or in commonly used IDEs like VS Code via the Cloud Clode plug-in.

And now, you can use Gemini Code Assist to create consistent quality APIs in Apigee without any specialized expertise. If the existing API specifications in API Hub do not meet your requirements, you can use Gemini to create a new one by simply describing what you need in natural language. Gemini Code Assist considers artifacts such as your security schemas or API objects in API Hub, and uses it to create a specification tailored to your enterprise. This saves a lot of time in development and review cycles.

Using Gemini Code Assist to generate tailored and consistent API specifications

In Apigee, you can simulate real-world API behavior and publish the specification to API Hub, for testing and driving multiple development streams in parallel.

Gemini Code Assist provides guidance and explanations during API proxy creation

Furthermore, Gemini offers step-by-step guidance for adding new policy configurations while creating an API proxy. Lastly, Gemini also provides explanations for your existing configurations, reducing the learning curve during updates and maintenance.

Using Gemini Code Assist to build integrations and automations

Application Integration is Google Cloud’s Integration Platform as a Service (iPaaS) that automates business processes by connecting any application — both home-grown and third-party SaaS — with point-and-click simplicity. Its intuitive interface lets you build complex flows, map data, and streamline operations with pre-built tasks and triggers.

And now, using Gemini Code Assist, anyone in your team can create end-to-end automation flows in Application Integration by just describing their requirements. For example, you can use Gemini to automate the task of updating a case in your CRM (like Salesforce), when a new issue is created in your bug tracking system (like JIRA). You can either issue a prompt to Gemini or use one-click suggestions provided in the interface. Based on the prompt and existing enterprise context such as APIs or applications, Gemini suggests multiple flows tailored for your use case.

Using Gemini Code Assist to create integration flows and automate SaaS processes

In accordance with your enterprise context, Gemini automatically creates variables and pre-configures tasks, making the integration ready for immediate use. Gemini doesn’t just respond to prompts — it intelligently analyzes your flow and proactively suggests optimizations, such as replacing connectors or fine-tuning REST endpoint calls. Gemini also helps you to extend existing flows in a single click, significantly reducing your maintenance efforts.

Extending integration flows with proactive suggestions from Gemini Code Assist

The visual nature of the Application Integration interface makes the flow self-explanatory, making it easier for new users to ramp up. Gemini even automatically generates intelligent descriptions based on existing configurations, helping to get the integration into adoption faster.

Get started

APIs and integrations are essential building blocks that provide differentiated experiences with AI models and applications. Using Gemini Code Assist, you can significantly reduce the toil of building these APIs and integrations, while adhering to your enterprise quality standards. You can use Gemini Code Assist (public preview) to simplify API and integration building within Apigee or Application Integration in the coming weeks.

Read More for the details.

2024 04 10

GCP – Introducing Shadow API detection for your Google Cloud environments

Cloud, Google Cloud gcp

Enterprises operate a large and growing number of APIs — more than 200 on average — each a potential front door to sensitive data. Even more challenging can be figuring out which of these APIs are not actively managed “shadow APIs”. Born from well-intended development initiatives and legacy systems, shadow APIs operate without proper oversight or governance, and could be the source of damaging security incidents.

Today at Google Cloud Next, we are excited to announce shadow API detection in preview in Advanced API Security, part of our Apigee API Management solution.

Securing your APIs with Apigee API Management

Apigee is Google Cloud’s turnkey API management solution that can help you build, manage, and secure APIs in the cloud and on-premises. Apigee helps ensure the reliability of your API transactions with fine-grained controls and more than 50 built-in security policies, including authentication and authorization.

Advanced API Security works proactively to identify misconfigured APIs, detect malicious bot and business logic attacks, and helps organizations take swift action to mitigate threats. Previously, this protection was only available for actively-managed APIs. Now, with the ability to discover shadow APIs in Advanced API Security, you can eliminate hard-to-find blind spots and close security gaps.

Detecting shadow APIs in Advanced API Security

Advanced API Security now integrates with Google Cloud regional external Application Load Balancers to discover and identify API traffic in a specific region, to help support regulatory and performance requirements.

In the following example, we show how this works in our Belgium region (europe-west-1).

Select your Google Cloud external Application Load Balancer’s region to discover the associated APIs.

Through examination of requests and responses flowing through your load balancers, Advanced API Security extracts the APIs and their relevant details such as API endpoints, platform, protocol, parameter names, and responses. You can access critical details on where the API is operating, the kind of operations that are running, and the latest activity on these APIs via an intuitive interface.

Advanced API Security catalogs and organizes all the APIs linked to the selected load balancer

Shadow API detection also looks at historical data to uncover new API calls, and can provide always-on awareness and detection of emerging shadow APIs. You can tag individual endpoints that need further attention to ensure comprehensive protection across your API surface.

Detailed information on shadow API endpoints associated with your load balancer

Upon detecting shadow APIs, you can collaborate with the API owners to establish management in accordance with company-wide security and API management standards. You can also implement missing security measures to help reduce the risk of compromise.

Get started tracking down shadow APIs

By detecting shadow APIs, Advanced API Security can help you strengthen your security posture and adopt a more proactive approach to finding vulnerabilities lurking in your application infrastructure. Sign up today to gain access to Advanced API security with shadow API detection.

Read More for the details.

2024 04 10

GCP – Performance deep dive of Gemma on Google Cloud

Cloud, Google Cloud gcp

Earlier this year we announced Gemma, an open weights model family built to enable developers to rapidly experiment with, adapt, and productionize on Google Cloud. Gemma models can run on your laptop, workstation, or on Google Cloud through either Vertex AI or Google Kubernetes Engine (GKE) using your choice of Cloud GPUs or Cloud TPUs. This includes training, fine-tuning, and inference using PyTorch and JAX, leveraging vLLM, HuggingFace TGI, and TensorRT LLM on Cloud GPUs as well as JetStream and Hugging Face TGI (Optimum-TPU) on Cloud TPUs.

Our benchmarks indicate up to 3X training efficiency (better performance per dollar) for Gemma models using Cloud TPU v5e when compared to our baseline of Llama-2 training performance. Earlier this week, we released JetStream, a new cost-efficient and high-performance inference engine. We analyzed Gemma inference performance on Cloud TPU and found 3X inference efficiency gain (more inferences per dollar) for LLM inference when serving Gemma on JetStream compared to the prior TPU inference stack that we used as the baseline.

In this post, we review the training and inference performance of Gemma models on Google Cloud accelerators. The results we present are snapshots in time as of April 2024. We anticipate that the infrastructure efficiency and quality of these models will continue to evolve and improve through the contributions of the open-source community, our enterprise users, and the teams at Google.

Background: Gemma model architecture details

The Gemma family of models include two variants, Gemma 2B and Gemma 7B (dense decoder architecture). We pre-trained Gemma with 2 trillion and 6 trillion tokens for the 2B and 7B models, respectively, with the context length of 8,192 tokens. Both models use a head dimension of 256, and both variants utilize Rotary Positional Embeddings (RoPE).

Model

d_model

q_heads

kv_heads

d_ff

n_layers

Gemma 2B

2,048

16,384

Gemma 7B

3,072

24,576

While the Gemma 7B model leverages a multihead attention mechanism, Gemma 2B utilizes multi-query attention. This approach aids in reducing memory bandwidth requirements during the inference process, which can potentially be advantageous for Gemma 2B on-device inference scenarios, where memory bandwidth is often limited.

Gemma training performance

To assess the training infrastructure for a given model or a category of similarly sized models, there are two important dimensions: 1) effective model flops utilization; and 2) relative performance per dollar.

Effective model flops utilization

Model FLOPs Utilization (MFU) is the ratio of the model throughput, i.e., the actual floating-point operations per second performed by the model relative to the peak throughput of the underlying training infrastructure. We use the analytical estimate for the number of floating-point operations per training step and the step-time to compute the model throughput (ref. PaLM). When applied to mixed-precision training settings (Int8), the resultant metric is called Effective Model FLOPs Utilization (EMFU). All else being equal, a higher (E)MFU indicates improved performance per unit cost. Improvements in MFU directly translate to cost savings for training.

Gemma training setup

Pre-training for Gemma models was done internally at Google using TPU v5e. It employed two v5e-256 for Gemma 2B and 16 Cloud TPU v5e-256 for Gemma 7B.

We measured the (E)MFU for Gemma models on Cloud TPU. We present the performance on both Cloud TPU v5e and Cloud TPU v5p since both are the latest Cloud TPU generations (at the time of writing this post). Cloud TPU v5e is the most cost-efficient TPU to date on a performance per dollar basis. By contrast, Cloud TPU v5p, is the most powerful and scalable TPU available for more complex LLM architectures, such as mixture of experts and alternative workloads such as large ranking and recommendation systems.

The following graph presents the EMFU for the Gemma 2B and Gemma 7B training run with bf16 precision and mixed precision (int8) training (using AQT).

Gemma-2b & 7b Effective Model Flops Utilization. Measured using MaxText on TPU v5e-256 and v5p-128. context length 8192. As of Feb, 2024.

These results were derived using the MaxText reference implementation. We also provide an implementation for training and fine-tuning Gemma models using Hugging Face Transformers.

Achieving high-performance training with MaxText

We recognize that comparing training infrastructure performance across model types is a difficult problem, due to differences in model architectures, training parameters such as context length and the difference in the scale of underlying cluster. We selected Llama 2 published results (total number of tokens and GPU hours) as a baseline for comparison with Gemma 7B training for the following reason:

Similarity of model architecture with respect to Gemma 7B

Gemma 7B was trained with 2X context length, and therefore the comparison favors Llama 2 baseline

Gemma-7b and baseline relative training performance per dollar. Measured using Gemma 7B (MaxText) on TPU v5e-256 and v5p-128. context length 8192. Baseline (LLama2-7b) performance is derived using total GPU hours and total number of training tokens as per the published results. Performance/$ is derived using the list price of respective accelerators. As of Feb, 2024.

We derived performance per dollar using (peak-flops*EMFU)/ (list price of VM instance). Using the MaxText reference implementation, we observed up to 3X better performance per dollar for the Gemma 7B model with respect to the baseline training performance (Llama2 7B). Please note that the performance or performance-per-dollar differences presented here are functions of the model architecture, hyperparameters, the underlying accelerator and training software; better performance results cannot be solely attributed to any of these factors alone.

Gemma inference performance

LLM inference is often memory-bound, while training can benefit from massive parallelism. Inference comprises two phases, each with different computational characteristics: prefill and decode. The prefill phase can operate in the compute-bound regime (if num tokens > peak flops / HBM bandwidth), while the decode phase is auto-regressive and tends to be memory-bound unless batched efficiently. Since we are processing one token at a time in the decode phase, the batch size to escape the memory-bound region tends to be higher. Therefore, simply increasing overall batch size (for both prefill and decode) may not be optimal. Because of throughput and latency, along with prefix- and decode-length interplay, we treat input (prefill) and output (decode) numbers separately, and focus on output tokens below.

Next, to describe our observations, we use throughput-per-dollar as a metric as it represents the number of output tokens per second that a model server can generate across all requests from users. This is the Y-axis in the graphs measured in million output tokens. This number is further divided by compute engine CUD pricing for a specific region for Cloud TPU v5e.

Improved cost-efficiency for TPU inference with JetStream on Cloud TPU v5e

Measuring inference performance is challenging because throughput, cost, and latency can be impacted by a number of factors, such as the size of the model, accelerator type, kind of model architecture, precision format used, etc. We therefore used cost efficiency (cost per million tokens) as the metric to measure performance of JetStream as compared to the baseline TPU inference stack. We observed up to 3X gain in cost-efficiency, as depicted in the chart below (lower is better), with the optimized JetStream stack for TPU inference as compared to the baseline TPU inference stack.

JetStream cost per 1M token as compared to baseline TPU inference stack. Google internal data. Measured using Gemma 7B (MaxText) on TPU v5e-8. Input length 1024, output length 1024 for a specific request rate and batch size. Continuous batching, int8 quantization for weights, activations, KV cache. As of April, 2024.

Serving at scale, high-throughput per dollar for Gemma 7B with JetStream TPU Inference

We also wanted to observe the performance of serving Gemma 7B at scale using the JetStream stack and compare it with the baseline TPU inference stack. As part of this experiment, we varied the request rate sent to these TPU inference stacks from 1 to 256 requests per second, then measured the throughput per dollar for serving Gemma 7B with variable-length input and output tokens. We observe a consistent behavior that throughput-per-dollar for serving Gemma 7B on JetStream is higher than the baseline, even for higher request rates.

JetStream throughput per dollar (million-tokens per dollar) as compared to baseline TPU inference stack. Google internal data. Measured using Gemma 7B (MaxText) on TPU v5e-8. Input length 1024, output length 1024 for varying request rate from 1 to 256. Continuous batching, int8 quantization for weights, activations, KV cache. As of April, 2024.

Measuring throughput per dollar and cost per million tokens

We orchestrated the experiments using the JetStream container on Google Kubernetes Engine (GKE). The input dataset contains variable-length inputs and outputs and therefore mimics the real-world language model input traffic. To generate the graph, we deployed the Gemma models with JetStream and gradually increased the requests per second to the model endpoint. Increasing the request rate initially translates to higher batch size, higher throughput, and also increased per token latency. But once a critical batch size is reached, further requests are queued, giving rise to the plateau in throughput in terms of number of output tokens generated.

We recognize that the benchmark presented above is sensitive to prompt-length distribution, sampling, and batching optimizations, and can be further improved using variations of high-performance attention kernels and other adaptations. If you want to try out the benchmarks, AI on GKE benchmarking framework enables you to run automated benchmarks on GKE for AI workloads.

High-performance LLM inference on Google Cloud

For large-scale, cost-efficient serving for LLMs, Google Cloud offers a wide range of options that users can adopt based on their orchestration, framework, serving layer, and accelerator preferences. These options include GKE as an orchestration layer, which supports both Cloud TPUs and GPUs for large model inference. Furthermore, each of the accelerators offer a range of serving-layer options, including JetStream (JAX, PyTorch, MaxText), Hugging Face TGI, TensorRT-LLM, and vLLM.

Accelerator

Framework

Orchestration

AI-optimized Inference Stacks on Google Cloud

Cloud GPUs

PyTorch

GKE

vLLM, Hugging Face TGI,

Triton + TensorRT-LLM

Cloud TPUs

PyTorch, JAX, MaxText

GKE

JetStream, Hugging Face TGI

Summary

Regardless of whether you prefer JAX or PyTorch as your framework, self-managed flexibility with GKE orchestration or a fully-managed unified AI platform (Vertex AI), Google Cloud provides AI-optimized infrastructure to simplify running Gemma at scale and in production, using either Cloud GPUs or TPUs. Google Cloud offers a comprehensive set of high-performance and cost-efficient training, fine-tuning and serving options for Gemma models — or any other open-source or custom large language models.

Based on training performance for Gemma models using Cloud TPU v5e and v5p, we observed that using the Gemma reference implementation for training with MaxText delivers up to 3X more performance per dollar compared to the baseline. We also observed that using JetStream for Inference on Cloud TPU delivers up to 3X better inference efficiency gains as compared to the baseline. Whether you are interested in running your inference on Cloud GPUs or TPUs, there are highly optimized serving implementations for Gemma models.

To get started, please visit the Gemma documentation for an overview of Gemma models, the model access, on device variants of Gemma and all the resources. You can also read the Gemma technical report to learn more about its models, architecture, evaluation and safety benchmarks. Finally, visit the Gemma on GKE documentation for easy-to-follow recipes to start experimenting. We can’t wait to see what you build with Gemma.

Read More for the details.

2024 04 10

GCP – Introducing ML Productivity Goodput: a metric to measure AI system efficiency

Cloud, Google Cloud gcp

We live in one of the most exciting eras of computing. Large-scale generative models have expanded from the realms of research exploration to the fundamental ways we interact with technology, touching education, creativity, software design, and much more. The performance and capabilities of these foundation models continues to improve with the availability of ever-larger computation, typically measured in the number of floating-point operations required to train a model.

The exponential growth of computation scale for the notable models. Source: Our world in data

This rapid rise in compute scale is made feasible by larger and more efficient compute clusters. However, as the scale of a compute cluster (measured in the numbers of nodes or number of accelerators) increases, mean time between failures (MTBF) of the overall system reduces linearly, leading to a linear increase in the failure rate. Furthermore, the cost of infrastructure also increases linearly; therefore, the overall cost of failure rises quadratically with the scale of the compute cluster.

For large-scale training, the true efficiency of the overall ML system is core to its viability — if left unattended, it can make attaining a certain scale infeasible. But if engineered correctly, it can help you unlock new possibilities at a larger scale. In this blog post, we introduce a new metric, ML Productivity Goodput, to measure this efficiency. We also present an API that you can integrate into your projects to measure and monitor Goodput, and methods to maximize ML Productivity Goodput.

Introducing ML Productivity Goodput

ML Productivity Goodput is actually composed of three Goodput metrics: Scheduling Goodput, Runtime Goodput, and Program Goodput.

Scheduling Goodput measures the fraction of time that all the resources required to run the training job are available. In on-demand or preemptible consumption models, this factor is less than 100% because of potential stockouts. As such, we recommend you reserve your resources to optimize your Scheduling Goodput score.

Runtime Goodput measures the time spent to make forward progress as a fraction of time when all training resources are available. Maximizing runtime requires careful engineering considerations. In the next section we describe how you can measure and maximize runtime for your large-scale training jobs on Google Cloud.

Program Goodput measures the fraction of peak hardware performance that the training job can extract. Program Goodput is also referred to as Model Flop Utilization or effective model flop utilization, i.e., the model training throughput as a fraction of peak throughput of the system. Program Goodput depends on factors such as efficient compute communication overlaps and careful distribution strategies to scale efficiently to the desired number of accelerators.

Google’s AI Hypercomputer

AI Hypercomputer is a supercomputing architecture that incorporates a carefully selected set of functions built through systems-level codesign to boost ML productivity across AI training, tuning, and serving applications. The following diagram illustrates how different elements of ML Productivity Goodput are encoded into AI Hypercomputer:

As indicated in the diagram above, AI Hypercomputer encodes specific capabilities aimed towards optimizing the Program and Runtime Goodput across the framework, runtime, and orchestration layers. For the remainder of this post we will focus on elements of AI Hypercomputer that can help you maximize it.

Understanding Runtime Goodput

The essence of Runtime Goodput is the number of useful training steps completed over a given window of time. Based on an assumed checkpointing interval, the time to reschedule the slice and the time to resume the training, we can estimate Runtime Goodput as follows:

This analytical model also provides us with three precise factors that we need to minimize in order to maximize the Runtime Goodput: 1) time since the last checkpoint when the failure occurs (tch); 2) time to resume the training (trm). Time to reschedule (tre) the slice is also a key factor, however it’s accounted for under Scheduling Goodput.

Introducing Goodput Measurement API

The first step to improving something is to measure it. The Goodput Measurement API allows you to instrument (Scheduling Goodput * Runtime Goodput) measurement into your code using a Python package. The Goodput Measurement API provides methods to report your training step progress to Cloud Logging and then read the progress from Cloud Logging to measure and monitor Runtime Goodput.

Maximizing Scheduling Goodput

Scheduling goodput is contingent on the availability of ALL the required resources for the training execution. To maximize Goodput for short-term usage, we introduced DWS calendar mode that reserves compute resources for the training job. Furthermore, in order to minimize tre time to schedule resources when resuming from interruption, we recommend using “hot spares.” With the reserved resources and hot spares, we can maximize Scheduling Goodput.

Maximizing Runtime Goodput

AI Hypercomputer offers the following (recommended) methods to maximize the Runtime Goodput:

Enable auto-checkpointing

Use container pre-loading (available in Google Kubernetes Engine)

Use a persistent compilation cache

Auto-checkpointing

Auto-checkpointing lets you trigger checkpointing based on a SIGTERM signal that indicates the imminent interruption of the training job. Auto-checkpointing is useful in case of defragmentation-related preemption or maintenance events, helping to reduce loss since the last checkpoint.

An example implementation for auto-checkpointing is available in orbax as well as in Maxtext, a reference implementation for high-performance training and serving on Google Cloud.

Auto-checkpointing is available for both GKE and non-GKE-based training orchestrators, and is available for training on both Cloud TPUs and GPUs.

Container pre-loading

To achieve a maximum Goodput score, it’s important to rapidly resume training after a failure or any other interruption. To that end, we recommend Google Kubernetes Engine (GKE), which supports container and model preloading from a secondary boot disk. Currently available in preview, GKE’s container and model preloading allows a workload, especially a large container image, to start up very quickly. This means training can recover from failure or other interruptions with minimal time loss. That’s important because when resuming a job, pulling a container image from object storage can be significant for large images. Pre-loading lets you specify a secondary boot disk that contains the required container image when creating the nodepool or even for auto-provisioning. The required container images are available as soon as GKE brings up the failed node, so you can resume training promptly.

With container preloading, we measured the image pull operation for a 16GB container to be about 29X faster than the baseline (image pull from container registry).

Persistent compilation cache

Just-in-time compilation and system-aware optimizations are one of the key enablers for an XLA compiler-based computation stack. In most performant training loops, computation graphs are compiled once and executed many times with different input data. A compilation cache prevents recompilation if the graph shapes stay the same. In the event of a failure or interruption, this cache may be lost, thereby slowing down the training resumption process, adversely affecting the Runtime Goodput. A persistent compilation cache helps solve this problem by allowing users to save compilation cache to Cloud Storage such that the cache persists across restart events.

Furthermore, GKE, the recommended orchestration layer for AI Hypercomputer, has also made recent advancements to improve the job-scheduling throughput by 3X, helping reduce time to resume (trm).

Maximizing Program Goodput

Program Goodput or Model Flop Utilization depends on the efficient utilization of the underlying compute as the training program makes forward progress. Distribution strategy, efficient compute communication overlap, optimized memory access and designing efficient pipelines contribute to Program Goodput. XLA compiler is one of the core components of AI Hypercomputer designed to help you maximize the Program Goodput by out-of-the box optimizations and simple and performant scaling APIs such as GSPMD, which allows users to easily express a wide range of parallelisms to efficiently leverage scale. We recently introduced three key features to help Jax and PyTorch/XLA users maximize Program Goodput.

Custom Kernel with XLA

In compiler-driven computation optimization, often we need an “escape hatch,” which allows users to write more efficient implementations using fundamental primitives for complex computation blocks, pushing past the default performance. Jax/Pallas is the library built to support custom kernels for Cloud TPUs and GPUs. It supports both Jax and PyTorch/XLA. Some examples of custom kernels written using Pallas include Flash Attention or Block Sparse Kernels. The Flash attention kernel helps to improve Program Goodput or Model Flop Utilization for larger sequence lengths (more pronounced for sequence lengths 4K or above).

Host offload

For large-scale model training, accelerator memory is a limited resource and we often make trade-offs such as activation re-materialization to trade off compute cycles for accelerator memory resources. Host offload is another technique we recently introduced in the XLA compiler to leverage host DRAM to offload activations computed during the forward pass and reuse them during the backward pass for gradient computation; this saves activation recomputation cycles and therefore improves Program Goodput.

Int8 Mixed Precision Training using AQT

Accurated Quantized Training is another technique that maps a subset of matrix multiplications in the training step to int8 to boost training efficiency and therefore Program Goodput without compromising convergence.

The following benchmark shows aforementioned techniques used in conjunction to boost program goodput for a 128b dense LLM implementation using MaxText.

EMFU measured using MaxText 128b, context length 2048, trained with synthetic data, using Cloud TPU v5e-256. Measured as of April, 2024.

In this benchmark, the combination of these three techniques boosts the Program Goodput cumulatively up to 46%. Program Goodput improvement is often an iterative process. Actual improvements for a specific training job depend on training hyperparameters and the model architecture.

Conclusion

Large-scale training for generative models is an enabler of business value, but productivity for ML training becomes harder as it scales. In this post, we defined ML Productivity Goodput, a metric to measure overall ML productivity for large-scale training jobs. We introduced the Goodput measurement API, and we learned about the components of AI Hypercomputer that can help you maximize ML Productivity Goodput at scale. We look forward to helping you maximize your ML productivity at scale with AI Hypercomputer.

Read More for the details.

2024 04 10

GCP – Natural language support in AlloyDB for building gen AI apps with real-time data

Cloud, Google Cloud gcp

Generative AI offers the opportunity to build more interactive, personalized, and complete experiences for your customers, even if you don’t have specialized AI/ML expertise. Since the introduction of foundation models to the broader market, the conversation has turned swiftly from the art of the possible to the art of the viable. How can developers build real, enterprise-grade experiences that are accurate, relevant, and secure?

Operational data bridges the gap between pre-trained foundation models and real enterprise applications. A pre-trained model can name the capital of France, but it doesn’t know which items you have in stock. That means that the generative AI experiences built using these models are only as good as the data and context application that developers used to ground them. And because operational databases have the most up-to-date data, they play a crucial role in building quality experiences.

This week at Google Cloud Next ‘24, we announced natural language support in AlloyDB for PostgreSQL to help developers integrate real-time operational data into generative AI applications. Now you can build applications that accurately query your data with natural language for maximum flexibility and expressiveness. This means generative AI apps can respond to a much broader and more unpredictable set of questions.

We also announced a new generative AI-friendly security model to accommodate these new access patterns. AlloyDB introduced parameterized secure views, a new kind of database view that locks down access to end-user data at the database level to help you protect against prompt injection attacks. Together, these advances in AlloyDB present a new paradigm for integrating real-time data into generative AI apps — one that’s flexible enough to answer the full gamut of end-users’ questions while maintaining accuracy and security.

AlloyDB AI’s natural language support, with features like end-user access controls and NL2SQL primitives, are now available in technology preview through the downloadable AlloyDB Omni edition.

The need for improved flexibility

Because data is so critical to ensuring accurate, relevant responses, multiple approaches have emerged for integrating data into gen AI applications.

Last year, we announced vector database capabilities in AlloyDB and Cloud SQL to enable the most common pattern for generative AI apps to retrieve data: retrieval augmented generation (RAG). RAG leverages semantic search on unstructured data to retrieve relevant information, and is particularly powerful for searching through a knowledge base or through unstructured user inputs like chat history to retrieve the context needed by the foundation model.

AlloyDB’s end-to-end vector support made this super easy. With a few lines of code, you could turn your operational database into a vector database with automatic vector embeddings generation and easy vector queries using SQL. And now, with ScaNN in AlloyDB, we’re bringing Google’s state-of-the-art ScaNN algorithm — the same one that powers some of Google’s most popular services— into AlloyDB to enhance vector performance.

Another popular approach for retrieving real-time data is to package simple structured queries in custom LLM extensions and plugins that package SQL. Here, LLM orchestration frameworks connect to APIs that can appropriately retrieve the needed information. These APIs can execute any database query, including structured SQL queries as well as vector search, on behalf of the LLM. The benefit of this approach is that it is predictable and easy to secure. AlloyDB’s LangChain integration makes it easy for you to build and integrate these database-backed APIs for common questions and access patterns.

However, some use cases need to support more freeform conversations with users, where it’s difficult to predict the questions and required APIs ahead of time. For these situations, developers are increasingly using models to generate SQL on-the-fly. This initially appears to work well; it’s quite remarkable how well foundational models generate SQL, but it poses a number of challenges.

First, it is very hard to ensure, with high confidence, that the generated SQL is not only syntactically correct, but is also semantically correct. For example, if you’re searching for a flight to New York City, it would be syntactically correct to generate an executable SQL query that filters for flights to JFK airport. But a semantically correct query would also add flights to LaGuardia airport too, because it also serves the city of New York.

Second, providing the application with the ability to execute arbitrary, generated SQL introduces security challenges, making the app more vulnerable to prompt-injection and denial-of-service attacks.

To paraphrase: custom APIs are accurate and secure, but not flexible enough to handle the full range of end-user questions. Emerging natural-language-to-SQL (NL2SQL) approaches are much more flexible, but lack security and accuracy.

Introducing natural language support to AlloyDB

AlloyDB AI provides the best of both worlds: accuracy, security, and flexibility. This not only makes it easier for you to incorporate real-time operational data into gen AI experiences, but makes it viable in an enterprise setting.

Natural language support means developers get maximum flexibility in interacting with data. In its easiest-to-use form, it takes in any natural language question, including ones that are broad or imprecise, and either returns an accurate result or suggests follow-ups such as clarifying questions. You can use this new interface to create a single LLM extension for answering questions, with flexible querying across datasets to not only get the right data, but the right insights, leveraging database joins, filters, and aggregations to improve analysis.

Accuracy of responses

At the same time, AlloyDB AI comes packed with built-in features to improve the accuracy and relevance of responses in any scenario. These include:

Intent clarification: Because users are often imprecise with language, AlloyDB AI has mechanisms for interactively clarifying user intent. When AlloyDB AI encounters a question that it has difficulty translating, it responds with a follow-up question instead of making assumptions. These questions can then be passed back to the user to clarify intent by the application before final results are shared, improving the eventual accuracy of the interaction.

Context beyond schema: AlloyDB AI leverages semantic knowledge about the business and dataset to help translate the natural language input into a query that the database can execute. This knowledge includes existing in-database metadata and context as a baseline, but you can add knowledge bases, sample queries, and other sources of business context to improve results.

Let’s consider an example. Imagine trying to book a flight using a customer service chatbot. You might ask the question: “When does the next Cymbal Air flight from San Francisco to New York City depart?” This seems like a simple question, but even the best NL2SQL technology can’t translate this correctly, for two reasons.

First, the question is ambiguous; it’s not obvious whether you’re asking about the scheduled departure time or estimated departure time. If the flight is delayed, it might return the wrong answer. To help with this scenario, AlloyDB AI doesn’t just return results, it might suggest a follow up question: “Are you looking for the scheduled departure time or the estimated departure time?”

Second, the database schema itself does not contain all of the information needed to answer the question correctly. A database administrator likely has semantic knowledge that isn’t encoded in the schema, for example: that scheduled times are in the `flights` table, but estimated times are in the `flight status` table, or that the airline Cymbal Air corresponds to airline code `CY` in the `flights` table. AlloyDB AI’s context services make it easy for developers to incorporate this semantic knowledge about the dataset.

A new security model for gen AI

Most applications require fine-grained access control to protect user data. Traditionally, access control was enforced at the application level, but this was possible only when every SQL query that hit the database was composed by an application developer.

However, when developers and vendors — including AlloyDB — are synthesizing SQL on the fly and executing it on behalf of an LLM, a new security model is needed. LLMs alone cannot be trusted to enforce data access.

AlloyDB’s new parameterized secure views makes it easy to lock down access in the database itself. Unlike typical row-level security, parameterized secure views don’t require you to set up a unique database user per end-user. Instead, access to certain rows in the database view is limited based on the value of parameters, like a user id. This makes application development easier, is compatible with existing mechanisms for end-user identification and application connectivity, offers more robust security, and allows application developers to continue to take advantage of connection pooling to boost performance.

How natural language support works

In its easiest-to-use form, AlloyDB AI provides an interface that takes in any natural language question, including ones that are broad or imprecise, and returns either accurate results or follow-ups like disambiguation questions as described above. Developers can use this interface to create a single tool for answering unpredictable questions, with flexible querying across datasets to not only get the right data, but the right insights, leveraging joins, filters, and aggregations in the database to improve analysis.

For more advanced development, AlloyDB AI supports the broader spectrum of natural language interactions with structured data by making core primitives available to developers directly. These building blocks offer innovators more control and flexibility on intent clarification and context use, allowing you to stitch together the pieces as-needed to meet the specific needs of your applications.

Getting started

AlloyDB’s natural language support is coming soon to both AlloyDB in Google Cloud and AlloyDB Omni. In the interim, we’ve made a few primitives available as a technology preview in AlloyDB Omni today, including basic NL2SQL support and parameterized secure views.

To get started, follow our guide to deploy AlloyDB Omni on a Google Cloud virtual machine, on your server or laptop, or on the cloud of your choice.

Read More for the details.

2024 04 10

GCP – BigQuery is now your single, unified AI-ready data platform

Cloud, Google Cloud gcp

Eighty percent of data leaders believe that the lines between data and AI are blurring. Using large language models (LLMs) with your business data can give you a competitive advantage, but to realize this advantage, how you structure, prepare, govern, model, and scale your data matters.

Tens of thousands of organizations already choose BigQuery and its integrated AI capabilities to power their data clouds. But in a data-driven AI era, organizations need a simple way to manage all of their data workloads. Today, we’re going a step further and unifying key data Google Cloud analytics capabilities under BigQuery, which is now the single, AI-ready data analytics platform. BigQuery incorporates key capabilities from multiple Google Cloud analytics services into a single product experience that offers the simplicity and scale you need to manage structured data in BigQuery tables, unstructured data like images, audience and documents, and streaming workloads, all with the best price-performance.

BigQuery helps you:

Scale your data and AI foundation with support for all data types and open formats

Eliminate the need for upfront sizing and just simply bring your data, at any scale, with a fully managed, serverless workload management model and universal metastore

Increase flexibility and agility for data teams to collaborate by bringing multiple languages and engines (SQL, Spark, Python) to a single copy of data

Support the end-to-end data to AI lifecycle with built-in high availability, data governance, and enterprise security features

Simplify analytics with a unified product experience designed for all data users and AI-powered assistive and collaboration features

With your data in BigQuery, you can quickly and efficiently bring gen AI to your data and take advantage of LLMs. BigQuery simplifies multimodal generative AI for the enterprise by making Gemini models available through BigQuery ML and BigQuery DataFrames. It helps you unlock value from your unstructured data, with its expanded integration with Vertex AI’s document processing and speech-to-text APIs, and its vector capabilities to enable AI-powered search for your business data. The insights from combining your structured and unstructured data can be used to further fine-tune your LLMs.

Support for all data types and open formats

Customers use BigQuery to manage all data types, structured and unstructured, with fine-grained access controls and integrated governance. BigLake, BigQuery’s unified storage engine, supports open table formats which let you use existing open-source and legacy tools to access structured and unstructured data while benefiting from an integrated data platform. BigLake supports all major open table formats, including Apache Iceberg, Apache Hudi and now Delta Lake natively integrated with BigQuery. It provides a fully managed experience for Iceberg, including DDL, DML and streaming support.

Your data teams need access to a universal definition of data, whether in structured, unstructured or open formats. To support this, we are launching BigQuery metastore, a managed, scalable runtime metadata service that provides universal table definitions and enforces fine-grained access control policies for analytics and AI runtimes. Supported runtimes include Google Cloud, open source engines (through connectors), and 3rd party partner engines.

Use multiple languages and serverless engines on a single copy of data

Customers increasingly want to run multiple languages and engines on a single copy of their data, but the fragmented nature of today’s analytics and AI systems makes this challenging. You can now bring the programmatic power of Python and PySpark right to your data without having to leave BigQuery.

BigQuery DataFrames brings the power of Python together with the scale and ease of BigQuery with a minimum learning curve. It implements over 400 common APIs from pandas and scikit-learn by transparently and optimally converting methods to BigQuery SQL and BigQuery ML SQL. This breaks the barriers of client side capabilities, allowing data scientists to explore, transform and train on terabytes of data and processing horsepower of BigQuery.

Apache Spark has become a popular data processing runtime, especially for data engineering tasks. In fact, customers’ use of serverless Apache Spark in Google Cloud increased by over 500% in the past year.1 BigQuery’s newly integrated Spark engine lets you process data using PySpark as you do with SQL. Like the rest of BigQuery, the Spark engine is completely serverless — no need to manage compute infrastructure. You can even create stored procedures using PySpark and call them from your SQL-based pipelines.

Make decisions and feed ML models in near real-time

Data teams are also increasingly being asked to deliver real-time analytics and AI solutions, reducing the time between signal, insight, and action. BigQuery now helps make real-time streaming data processing easy with new support for continuous SQL queries, an unbounded SQL query that processes data the moment it arrives via SQL statement. BigQuery continuous queries amplifies downstream SaaS applications, like Salesforce, with the real-time enterprise knowledge of your data and AI platform. In addition, to support open source streaming workloads, we are announcing a preview of Apache Kafka for BigQuery. Customers can use Apache Kafka to manage streaming data workloads and feed ML models without the need to worry about version upgrades, rebalancing, monitoring and other operational headaches.

Scale analytics and AI with governance and enterprise features

To make it easier for you to manage, discover, and govern data, last year we brought data governance capabilities like data quality, lineage and profiling from Dataplex directly into BigQuery. We will be expanding BigQuery to include Dataplex’s enhanced search capabilities, powered by a unified metadata catalog, to help data users discover data and AI assets, including models and datasets from Vertex AI. Column-level lineage tracking in BigQuery is now available in preview, which will be followed by a preview for lineage for Vertex AI pipelines. Governance rules for fine-grained access control are also in preview, allowing businesses to define governance policies based on metadata.

For customers looking for enhanced redundancy across geographic regions, we are introducing managed disaster recovery for BigQuery. This feature, now in preview, offers automated failover of compute and storage and will offer a new cross-regional service level agreement (SLA) tailored for business-critical workloads. The managed disaster recovery feature provides standby compute capacity in the secondary region included in the price of BigQuery’s Enterprise Plus edition.

A unified experience for all data users

As Google Cloud’s single integrated platform for data analytics, BigQuery unifies how data teams work together with BigQuery Studio. Now generally available, BigQuery Studio gives data teams a collaborative data workspace that all data practitioners can use to accelerate their data-to-AI workflows. BigQuery Studio lets you use SQL, Python, PySpark, and natural language in a single unified analytics workspace, regardless of the data’s scale, format or location. B All development assets in BigQuery Studio are enabled with full lifecycle capabilities, including team collaboration and version control. Since BigQuery Studio’s launch at Next ‘23, hundreds of thousands of users are actively using the new interface.2

Gemini in BigQuery for AI assistive and collaborative experiences

We announced several new innovations for Gemini in BigQuery that help data teams with AI-powered experiences for data preparation, analysis and engineering as well as intelligent recommendations to enhance user productivity and optimize costs. BigQuery data canvas, an AI-centric experience with natural language input, makes data discovery, exploration, and analysis faster and more intuitive. AI augmented data preparation in BigQuery helps users to cleanse and wrangle their data and build low-code visual data pipelines, or rebuild legacy pipelines. Gemini in BigQuery also helps you write and edit SQL or Python code using simple natural language prompts, referencing relevant schemas and metadata.

How Deutsche Telekom is innovating with the BigQuery platform

“Deutsche Telekom built a horizontally scalable data platform in an innovative way that was designed to meet our current and future business needs. With BigQuery at the center of our enterprise’s One Data Ecosystem, we created a unified approach to maintain a single source of truth while fostering de-centralized usage of data across all of our data teams. With BigQuery and Vertex AI, we built a governed and scalable space for data scientists to experiment and productionize AI models while maintaining data sovereignty and federated access controls. This has allowed us to quickly deploy practical usage of LLMs to turbocharge our data engineering life cycle and unleash new business opportunities.” – Ashutosh Mishra, VP of Data Architecture, Deutsche Telekom

Start building your AI-ready data platform

To learn more and start building your AI-ready data platform, start exploring the next generation of BigQuery today. Read more about the latest innovations for Gemini in BigQuery and an overview of what’s next for data analytics at Google Cloud.

1. Google internal data – YoY growth of data processed using Apache Spark on Google Cloud compared with Feb ‘23.
2. Since the August 2023 announcement of BigQuery Studio, monthly active users have continued to grow.

Read More for the details.

2024 04 10

GCP – Powering generative AI with cloud storage innovations at Next ’24

Cloud, Google Cloud gcp

Generative AI is changing the way we create, innovate, and interact with the world. From generating realistic images and videos to composing music and writing code, gen AI models are pushing the boundaries of what’s possible. But achieving the heights of AI’s promises hinges on a scalable storage foundation.

At Google Cloud, we’re committed to providing the infrastructure for businesses to harness the possibilities of gen AI. At Google Cloud Next ’24, we’re excited to announce a series of advancements in our storage portfolio.

Accelerating AI training and inference with Google Cloud Storage

Gen AI models train on datasets in a computationally intensive and time-consuming process, gradually refining their ability to generate new content that resembles the training data. Similarly, AI inference (serving) in production requires low-latency access to models. At Next ’24, we introduced new storage solutions that address the challenge of decreasing model load, training, and inference times while maximizing accelerator utilization.

Cloud Storage FUSE with file caching: Faster training and inference through local data access

Cloud Storage FUSE allows you to mount Cloud Storage buckets as filesystems — a game-changer for AI/ML workloads that rely on frameworks that often require file-based data access. Training and inference can leverage the benefits of Cloud Storage, including lower cost, through filesystem APIs. And with the addition of file caching, Cloud Storage FUSE can increase training throughput by 2.9X. By keeping frequently accessed data closer to your compute instances, Cloud Storage FUSE file caching delivers faster training compared to native ML framework data loaders, so you can rapidly iterate and bring your gen AI models to market quicker.

Parallelstore: Ultra-low latency and caching for demanding training workloads

Parallelstore, Google Cloud’s parallel file system for high-performance computing and AI/ML workloads now also includes caching in preview. It delivers high performance, making it ideal for training and complex gen AI models. With caching, it enables up to 3.9X faster training times and up to 3.7X higher training throughput compared to native ML framework data loaders. Parallelstore also features optimized data import and export from Cloud Storage, to further accelerate training.

Hyperdisk ML: Purpose-built for high-performance training and inference

Training and serving inference in production require fast and reliable access to data. Hyperdisk ML is a new block storage offering that’s purpose-built for AI workloads. Currently in preview, it delivers exceptional performance, not only accelerating training times, but also increasing model load times up to 11.9X compared to common alternatives. Hyperdisk ML allows you to attach up to 2,500 instances to the same volume, so a single volume can serve over 150x more compute instances than competitive block storage volumes ensuring that storage access scales with your accelerator needs.

Manage storage at scale with Generate insights with Gemini

Google Cloud is innovating to use large language models (LLMs) to help you manage cloud storage at scale. Generate insights with Gemini is built upon Insights Datasets, a Google-managed, BigQuery-based storage metadata warehouse. Using simple, natural language, you can easily and quickly analyze your storage footprint, optimize costs, and enhance security — even when managing billions of objects.

Leveraging Google Cloud’s history of thoughtfully-designed user experiences we’ve tailored Generate insights with Gemini with solutions to meet the demanding requirements of modern organizations, including:

Fully validated responses for top customer questions: Verified data responses for pre-canned prompts, ensuring rapid, precise answers to your team’s most critical questions.

Accelerated understanding with visuals: Translate complex data into clear, visual representations, making it easy to understand, analyze, and share key findings across teams.

Dive deeper with multi-turn chat: Need more context or have follow-up questions? Generate insights with Gemini’s multi-turn chat feature allows you to engage in interactive analysis, and gain a granular understanding of your environment.

Generate insights with Gemini is available now through the Google Cloud console as an allowlist experimental release.

Other notable storage announcements

Beyond AI/ML, we also unveiled a range of storage innovations at Next ’24 that benefit a wide variety of use cases:

Google Cloud NetApp Volumes: NetApp Volumes is a fully managed, SMB and NFS storage service that provides advanced data management capabilities and highly scalable performance, for enhanced cost efficiency and performance for Windows and Linux workloads. And now, NetApp Volumes dynamically migrates files by policy to lower-cost storage based on access frequency (in preview Q2’24). In addition, NetApp Volumes Premium and Extreme service levels will support volumes of up to 1PB in size, and are increasing throughput performance up to 2X and 3X, respectively (preview Q2’24). Additionally, we are introducing a new Flex service level enabling volumes as small as 1GiB, and expanding to 15 new Google Cloud regions in Q2’24 (GA).

Filestore: Google Cloud’s fully managed file storage service now supports single-share backup for Filestore Persistent Volumes and Google Kubernetes Engine (GKE) (generally available) and NFS v4.1 ( preview), plus expanded Filestore Enterprise capacity up to 100TiB.

Hyperdisk Storage Pools: With Hyperdisk Advanced Capacity (generally available) and Advanced Performance (preview), you can purchase and manage block storage capacity in a pool that’s shared across workloads. Individual volumes are thinly provisioned from these pools; they only consume capacity as data is actually written to disk, and they benefit from data reduction such as deduplication and compression. This lets you substantially increase storage utilization and can reduce storage TCO by over 50% in typical scenarios, compared to leading cloud providers. Google is the first and only cloud hyperscaler to offer storage capacity pooling.

Anywhere Cache: Working with multi-region buckets, Cloud Storage Anywhere Cache now uses zonal SSD read cache across multiple regions within a continent to speed up cacheable workloads such as analytics, and AI/ML training and inference (allowlist GA).

Soft delete: With this feature, Cloud Storage protects against accidental or malicious deletion of data by preserving deleted items for a configurable period of time (generally available).

Managed Folders: This new Cloud Storage resource type allows granular IAM permissions to be applied to groups of objects (generally available).

Tag-based at scale backup: With this feature, users can leverage Google Cloud tags to manage data protection for Compute Engine VMs (generally available).

High-performance backup for SAP HANA: A new option for backups of SAP HANA databases running in Compute Engine VMs leverages persistent disk (PD) snapshot capabilities for database-aware backups (generally available).

Backup and DR Service Report Manager: Customers can now customize reports with data from Google Cloud Backup and DR using Cloud Monitoring, Cloud Logging, and BigQuery (generally available).

Accelerate your journey with Google Cloud Storage

At Google Cloud, we’re committed to empowering businesses to unlock the full potential of AI/ML, enterprise, and cloud-first workloads. Whether you’re training massive gen AI models, serving inference at scale, or running Windows or GKE workloads, Google Cloud storage provides the versatility and power you need to succeed. Get in touch with your account team to learn how we can help you unleash the potential of generative AI with Google Cloud storage. You can also attend the following sessions live at Next ‘24 or watch them afterwards:

ARC 232 Next Generation Storage: Designing storage for the future

ARC 306 How to define a storage infrastructure for AI/ML workloads

ARC 307 A Masterclass in Managing Billions of Google Cloud Storage Objects and Beyond

ARC 204 How to optimize block storage for any workload with the latest from Hyperdisk

Read More for the details.

2024 04 10

GCP – Introducing ScaNN vector indexing in AlloyDB, bringing 12 years of Google research to speed up vector search

Cloud, Google Cloud gcp

Over the past year, vector databases have skyrocketed in popularity, and have become the backbone of new semantic search and generative AI experiences. Developers use vector search for everything from product recommendations, to image search, to enhancing LLM-powered chatbots with retrieval augmented generation (RAG).

PostgreSQL is one of the most popular operational databases on the market, used by 49% of developers according to StackOverflow’s 2023 survey, and growing. So, it’s no surprise that pgvector, the most popular PostgreSQL extension for vector search, has become one of the most-loved vector databases on the market. That’s why we launched support for pgvector in Cloud SQL for PostgreSQL and AlloyDB for PostgreSQL in July of last year, adding a few enhancements in AlloyDB AI to optimize performance.

The PostgreSQL community has come a long way since then, introducing support for the HNSW algorithm, a state-of-the art graph-based algorithm used in many popular databases. HNSW is supported in both AlloyDB and Cloud SQL. While HNSW offers good query performance for many vector workloads, we’ve heard from some customers that it doesn’t always fit for their real-world use-cases. Some customers with larger corpuses experience issues with index build time and high memory usage; others need fast, real-time index updates or better vector query performance.

That’s why this week we announced the new ScaNN index for AlloyDB, bringing 12 years of Google research and innovation in approximate nearest neighbor algorithms to AlloyDB. This new index uses the same technology that powers some of Google’s most popular services to deliver up to 4x faster vector queries, up to 8x faster index build times and typically a 3-4x smaller memory footprint than the HNSW index in standard PostgreSQL. It also offers up to 10x higher write throughput than the HNSW index in standard PostgreSQL.

The new ScaNN index is available in technology preview in AlloyDB Omni, and will become available in the AlloyDB for PostgreSQL managed service in Google Cloud shortly thereafter.

Vector indexing using ANN algorithms

The most common use case for vectors is to find similar or relevant data. This is accomplished by querying the database for the k vectors that are closest to the query vector in terms of a distance metric such as inner product, cosine similarity, or Euclidean distance. This kind of query is referred to as a “k (exact) nearest neighbors” or “KNN” query.

Unfortunately, KNN queries don’t scale. This is where Approximate Nearest Neighbor (ANN) search comes in. ANN trades off some accuracy (specifically recall — the algorithm might miss some of the actual nearest neighbors) for big improvements in speed. For many use cases, this tradeoff is worthwhile. Consider, for example, user expectations from a search engine: they’ll happily accept 10 results that are approximately (if not perfectly) the most relevant, if it means they’ll get them in a fraction of a second rather than hours or days.

In the database, ANN search uses vector indexes. Although database performance depends on many factors, the underlying ANN index plays a large role in indexing time, query performance, and memory footprint, and determines the fundamental tradeoffs between recall (i.e., accuracy) and latency.

There are two popular types of ANN indices: graph-based and tree-quantization-based. Graph-based algorithms construct a network of nodes, which are connected by edges based on similarity. pgvector’s HNSW index implements the state-of-the-art Hierarchical Navigable Small Worlds (HNSW) graph algorithm used in many popular vector databases. This uses a hierarchical graph to very efficiently traverse the graph to find nearest neighbors. These types of algorithms perform well, especially for small datasets, but have higher memory footprints and longer index build times than tree-quantization-based algorithms.

Tree-quantization-based vector indexes, at a high level, structure the data such that clusters of nearby vectors are grouped together and are properly compressed (quantized). Tree-quantization indices have smaller memory footprints and faster index build times than graph-based ANN indices. Google’s ScaNN (Scalable Nearest Neighbor) is able to achieve these benefits without sacrificing excellent query performance, thanks to key innovations around (a) geometry awareness for smarter clustering and redundancy and (b) taking advantage of modern CPU hardware.

AlloyDB’s ScaNN index introduces support for Google’s state-of-the-art ScaNN into AlloyDB. Deeper integration between the index and the AlloyDB query execution engine further improve performance, as does AlloyDB’s tiered caching architecture. Read our ScaNN for AlloyDB whitepaper for a deep dive into Google’s ScaNN algorithm and how we’ve implemented it in PostgreSQL and AlloyDB.

Key benefits of the ScaNN index

In short, the new ScaNN index for AlloyDB gives you all of the benefits of pgvector plus access to state-of-the-art vector indexing:

Smaller memory footprint: With the ScaNN index, AlloyDB AI typically has a 3-4x smaller memory footprint than the HNSW index in standard PostgreSQL. That means we can offer in-memory performance for larger workloads on smaller machines. It also means more memory is available for other database activities, like the buffer cache for transactional workloads.

Faster index build times: AlloyDB AI’s ScaNN index has up to 8x faster index build times than the HNSW index in standard PostgreSQL, which is important for developer productivity — especially when corpus sizes are larger, or when developers need to test multiple index configurations or embeddings models.
Higher write throughput: 10x higher write throughput than the HNSW index in standard PostgreSQL means that it’s more able to handle real-time updates.
Faster vector queries: AlloyDB AI offers up to 4x faster vector queries than the HNSW index in standard PostgreSQL.
Full PostgreSQL compatibility: AlloyDB’s ScaNN index is compatible with pgvector, so it works with existing vector embeddings and query syntaxes and can be used as either a drop-in replacement or complement to existing HNSW indices.
Excellent developer experience with SQL: Developers building semantic search and generative AI applications can leverage their existing SQL skillset for vector similarity search and take advantage of full PostgreSQL querying capabilities like joins, filters, and more. They can also perform vector queries directly on their operational data, simplifying their technology stack and leveraging that real-time data to create the richest, most relevant experiences — all without sacrificing performance.

“At AbemaTV, we use AlloyDB AI for embeddings generation and vector search in 「ABEMA」, our streaming service, to make video recommendations. We’re excited to see the rapid expansion of model support and vector capabilities in AlloyDB, and plan to use the new model catalog to more easily access the latest embeddings models like Gemini Pro. We’re also looking forward to trying the newly announced scann index to speed up vector search.” – Shunya Suga, Engineering Manager, AbemaTV inc.

Getting started

AlloyDB now gives you the richest set of native vector search options in SQL databases, by offering both the graph-based HNSW index from pgvector and the pgvector-compatible ScaNN index for AlloyDB based on tree-quantization.

ScaNN for AlloyDB is available today as a technology preview via the downloadable AlloyDB Omni. Follow our quickstart guide to deploy AlloyDB Omni on a VM in GCP, on your server or laptop, or on the cloud of your choice. And then follow our documentation to get started with easy and fast vector queries.

Read More for the details.