Azure – Azure Digital Twins Control-Plane Preview API Retirement (2021-06-31)
Starting on May 02, 2023, Azure Digital Twins Preview Control-Plane API 2021-06-30-preview version will be retired.
Read More for the details.
Starting on May 02, 2023, Azure Digital Twins Preview Control-Plane API 2021-06-30-preview version will be retired.
Read More for the details.
The Amazon Chime SDK now provides a native client library for Windows applications to connect to WebRTC media sessions. The Amazon Chime SDK lets developers add intelligent real-time audio, video, and screen share to their web, mobile applications, and now, Windows applications.
Read More for the details.
Amazon EC2 now supports replacing the root volume on a running EC2 Mac instance, enabling you to restore the root volume of an EC2 Mac instance to its initial launch state or to a specific snapshot, without requiring you to stop or terminate the instance. You can now reset the EC2 Mac instance back to a known state, while still retaining any local data, networking configurations, and IAM instance profiles. You can also leverage this capability to quickly provision fresh macOS environments on your EC2 Mac Dedicated Hosts without triggering the host scrubbing workflow.
Read More for the details.
Amazon Relational Database Service (Amazon RDS) for Oracle now supports additional cipher suites that can be used with the Oracle Enterprise Manager (OEM) Agent and Oracle Secure Socket Layer (SSL) options. Customers can make use of these new cipher suites, as they provide stronger security for the RDS for Oracle database instance(s) connections, thereby strengthening the security posture of their infrastructure.
Read More for the details.
The AWS Serverless Application Model (SAM) Command Line Interface (CLI) announces the launch of sam list command, helping developers access information about deployed resources while they are testing their SAM applications. The AWS SAM CLI is a developer tool that makes it easier to build, test, package, and deploy serverless applications.
Read More for the details.
AWS Panorama customers are now able to get faster quotes and place orders for AWS Panorama Appliances directly in the AWS Panorama Console. With these new capabilities, customers can receive an automated quote from the AWS Panorama Console. This will reduce the time needed for customers to generate purchase orders and will expedite orders for the AWS Panorama Appliance. To get started on this new ordering capability for AWS Panorama, click here.
Read More for the details.
The Oden Technologies solution is an analytics layer for manufacturers that combines and analyzes all process information from machines and production systems to give real-time visibility to the biggest causes of inefficiency and recommendations to address them. Oden empowers front-line plant teams to make effective decisions, such as prioritizing resources more effectively, solving issues faster, and realizing optimal behavior.
Manufacturing plants have limited resources and would like to use them optimally by eliminating any inefficiencies and making recommendations and providing data points as a key input for decision making. These data points are based on a torrent of data coming from multiple devices.
Oden’s customers are manufacturers with continuous and batch processes, such as in plastics extrusion, paper and pulp, and chemicals. Oden powers real-time and historical dashboards and reports necessary for this decision-making through leveraging the underlying Google Cloud Platform.
Oden’s platform aggregates streaming, time-series data from multiple devices and instruments and processes them in real-time. This data is in the form of continuously sampled real-world sensor readings (metrics) that are ingested into CloudIoT and transformed in real-time using Dataflow before being written to Oden’s time series database. Transformations include data cleaning, normalization, synchronization, smoothing, outlier removal, and multi-metric calculations that are built in collaboration with manufacturing customers. The time-series database then powers real-time and historical dashboards and reports.
One of the major challenges of working with real-time manufacturing data is handling network disruptions. Manufacturing environments are often not well served by ISPs and can experience network issues due to environmental and process conditions or other factors. When this happens, data can be backed up locally and arrive late after the connection recovers. To avoid overloading real-time dataflow jobs with this late data, BigQuery supports late data handling and recoveries.
In addition to the sensor data, Oden collects metadata about the production process and factory operation such as products manufactured on each line, their specifications and quality. Integrations provide the metadata via Oden’s Integration APIs running on Google Kubernetes Engine (GKE), which then writes it to a PostgreSQL database hosted in CloudSQL. The solution then uses this metadata to contextualize the time-series data in manufacturing applications.
Oden uses this data in several ways, including real-time monitoring and alerting, dashboards for line operators and production managers, historical query tools for quality engineers, and machine learning models trained on historical data and scored in real-time to provide live predictions, recommendations, and insights. This is all served in an easy to access and understand UI, greatly empowering employees across the factory to use data to improve their lines of business.
The second major challenge in manufacturing systems, is achieving quality specifications on the final product for it to be sold. Typically, Quality Assurance is conducted offline: after production has completed, a sample is taken from the final product, and a test is performed to determine physical properties of the product. However, this introduces a lag between the actual time period of production, and information about the effectiveness of that production—sometimes hours (or even days) after the fact. This prevents proactive adjustments that could correct for quality failures, and results in considerable waste.
At the heart of the Oden platform is Google BigQuery, which plays an important backstage role in Oden’s data-driven software. Metric data is written simultaneously to BigQuery via a BigQuery Subscription through Cloud PubSub and metadata from CloudSQL is accessible via BigQuery’s Federated Queries. This makes BigQuery an exploratory engine for all customer data allowing Oden’s data scientists and engineers to support the data pipelines and build Oden’s machine learning models.
Sometimes these queries are ad-hoc analysis that helps understand data better. For example, here’s a BigQuery query joining both the native BigQuery metrics table and a Federated Query to the metadata in PostgreSQL This query helps determine the average lag between the event time and ingest time of customer metrics by day for November:
In addition to ad-hoc queries, there are also several key features of Oden that use BigQuery as their foundation. Below, two major features that leverage BigQuery as the highly scalable source of truth for data are covered.
As mentioned earlier, one of the major challenges of working with real-time manufacturing data is handling network disruptions. After the connection recovers, you encounter data that has been backed up and is out of temporal sequence. To avoid overloading real-time dataflow jobs with this late data, BigQuery is used to support late data handling and recoveries.
The data transformation jobs that run on Dataflow are written in the Apache Beam framework and usually perform their transformations by reading metrics from an input Pubsub topic and writing back to an output topic. This forms a directed acyclic graph (DAG) of transformation stages before the final metrics are written to the time-series database. But the streaming jobs degrade in performance when handling large bursts of late data putting the ability at risk to meet Service Level Agreements (SLAs), which guarantee customers high availability and fast end-to-end delivery of real-time features.
A key tenet of the Apache Beam model is that transformations can be applied to both bounded and unbounded collections of data. With this in mind, Apache Beam can be used for both streaming and batch processing. Oden takes this a step further with a universal shared connector for every one of the transformation jobs which allows the entire job to switch between a regular “streaming mode” and an alternative “Batch Mode.” In “Batch Mode,” the streaming jobs can do the same transformation but use Avro files or BigQuery as their data source.
This “Batch Mode” feature started as a method of testing and running large batch recoveries after outages. But it has since evolved into a solution to late data handling problems. All data that comes in “late” to Oden bypasses the real-time Dataflow streaming jobs and is written to a special “Late Metrics” PubSub topic and then to BigQuery. Nightly, these “Batch Mode” jobs are deployed and data is queried that wasn’t processed that day out of BigQuery and results written back to the time-series database. This creates two SLAs for customers; a real-time one of seconds for “on-time” data and a batch one of 24 hours for any data that arrives late.
Occasionally, there is a need to backfill transformations of these streaming jobs due to regressions or new features to backport over old data. In these cases, batch jobs are leveraged again. Additionally, jobs specific to customer data are joined with metrics and customer configuration data hosted in CloudSQL via BigQuery’s Federated queries to CloudSQL.
By using BigQuery for recoveries, dataflow jobs continue to run smoothly, even in the face of network disruptions. This allows maintaining high accuracy and reliability in real-time data analysis and reporting. Since moving to separate BigQuery-powered late-data handling, the median system latency of calculated metrics feature for real-time metrics is under 2s which allows customers to observe and respond to their custom multi-sensor calculated metrics instantly.
The next use case deals with applying machine learning to manufacturing: predicting offline quality test results using real-time process metrics. This is a challenging problem in manufacturing environments, where not only is high accuracy and reliability necessary, but the data is also collected at different sampling rates (seconds, minutes, and hours) and stored in several different systems. The merged datasets represent the comprehensive view of data to factory personnel, who use the entire set of context information to make operational decisions. This ensures Predictive Quality models access this same full picture of the manufacturing process as it provides predictions.
At Oden, BigQuery addresses the two key challenges of machine learning in the manufacturing environment:
Using time series data stored in time series database, summary aggregates are performed to construct features as input for model training and scoring.
Using federated queries to access context metadata, data is merged with the aggregates to fully characterize the production period. This allows easily combining the data from both sources and using it to train machine learning models.
Oden uses a variety of models and embeddings — ranging from linear models (Elastic Nets, Lasso), ensemble models (boosted trees, random forests) to DNNs that allow addressing the different complexity-accuracy-interpretability requirements of customers.
The chart shows out-of-sample predictions of offline quality test values, compared with the actual values that were observed after the end of production. The predicted values provide lead time of quality problems of up to one hour.
Models are trained using an automated pipeline based on Apache Airflow and scikit learn, and models are stored in Google Cloud Storage for versioning and retrieval. Once the models are trained, they can be used to predict the outcomes of quality tests in real-time via a streaming Dataflow job. This allows factory floor operators to identify and address potential problems before they become more serious or have a large impact. This improves the overall efficiency of the production process, and reduces the amount of waste that a factory generates. Factory floor operators receive up-to-date information about the quality characteristics of current production conditions, up to an hour before the actual test value is available for inspection. This gives early warning to help catch quality failures. In turn, this reduces material waste and machine downtime, metrics that are central to many manufacturers’ continuous improvement initiatives, as well as their day-to-day operations.
Operators come to rely upon predictive models to execute their roles effectively, regardless of their experience level or their familiarity with a specific type of production or machinery, and up-to-date models are critical to the success of the predictions in improving manufacturing processes. Hence, in addition to training, life-cycle management of models and ML ops are important considerations in deploying reliable models to the factory floor. Oden is focusing on leveraging Vertex AI to make the ML model lifecycle more simple and efficient.
Oden’s Predictive Quality model empowers operators to take proactive steps to optimize production on the factory floor, and allows for real-time reactions to changes in the manufacturing process. This contributes to cost reduction, energy savings, and reduced material wasted.
Actionable data, like the processed data generated by Oden, has become such a critical part of making predictions and decisions to remain competitive in the manufacturing space. In order to use these insights to their full potential, businesses need a low barrier to access data, unify the data with other data sources, derive richer insights and make learned decisions. Oden already leads the market in having trustworthy, usable, and understandable insights from combined process, production, and machine data that is accessible from everyone within the plant to improve their line of business. There is opportunity to go beyond the Oden interface to integrate with even more business systems The data can be made available in the form of finished datasets, hosted in BigQuery. Google Cloud has launched a new service called Analytics Hub , powered by BigQuery with the intent to make data sharing easier, secure, searchable, reliable and highly scalable.
Analytics Hub is based on the Publish-Subscribe model where BigQuery datasets are enlisted into a Data exchange as a Shared dataset, which hosts hundreds of listings. It lets users share multiple BigQuery objects such as views, tables, external tables, models etc into the Data exchange. A Data exchange can be marked public or private for a dedicated sharing. On the other end, businesses can subscribe to one or more listings in their BigQuery instance, where it is consumed as a Linked dataset to run queries against. Analytics Hub sets up a real-time data pipeline with a low-code no-code approach to share data, while giving Oden complete control over what data needs to be shared for better governance.
This empowers advanced users, who have use-cases that exceed the common workflows already achievable with Oden’s configurable dashboard and query tools, to leverage the capabilities of BigQuery in their organization. This brings Oden’s internal success with BigQuery directly to advanced users. With BigQuery, they can join against datasets not in Oden, express complex BigQuery queries, load data directly with Google’s BigQuery client libraries, and integrate Oden data into third party Business Intelligence software such as Google Data Studio.
Google Cloud and Oden are forging a strong partnership in several areas, most of which are central to customers needs. Oden has developed a turnkey solution by using the best in class Google Cloud tools and technologies, delivering pre-built models to accelerate time to value and enabling manufacturers to have accessible and impactful insights without hiring a data science team. Together, Google and Oden are expanding the way manufacturers access and use data by creating a clear path to centralize production, machine, and process data into the larger enterprise data platform paving the way for energy savings, material waste reduction and cost optimization,
Click here to learn more about Oden Technologies or to request a demo.
Google is helping tech companies like Oden Technologies build innovative applications on Google’s data cloud with simplified access to technology, helpful and dedicated engineering support, and joint go-to-market programs through the Built with BigQuery initiative, launched in April ‘22 as part of the Google Data Cloud Summit. Participating companies can:
Get started fast with a Google-funded, pre-configured sandbox.
Accelerate product design and architecture through access to designated experts from the ISV Center of Excellence who can provide insight into key use cases, architectural patterns, and best practices.
Amplify success with joint marketing programs to drive awareness, generate demand, and increase adoption.
BigQuery gives ISVs the advantage of a powerful, highly scalable data warehouse that’s integrated with Google Cloud’s open, secure, sustainable platform. And with a huge partner ecosystem and support for multi-cloud, open source tools and APIs, Google provides technology companies the portability and extensibility they need to avoid data lock-in.
Click here to learn more about Built with BigQuery.
We thank the Google Cloud and Oden team members who co-authored the blog: Oden: Henry Linder, Staff Data Scientist & Deepak Turaga, SVP Data Science and Engineering. Google: Sujit Khasnis, Solutions Architect & Merlin Yammsi, Solutions Consultant
Read More for the details.
Has your business made talent investments that directly link to your digital transformation strategy? Google Cloud wants to honor your organization for its dedication to developing your team’s Google Cloud skills through our *new* Talent Transformation Google Cloud Customer Award. Submit your application before March 31, 2023 to be recognized as a global leader in cloud talent transformation.
Google Cloud Customer Awards recognize organizations who are leading business transformation with Google Cloud products and solutions. We want to hear how you are growing one of the most important elements of your organization — your people! Tell us your story for a chance to win and enjoy benefits like:
A Google Cloud Customer Award designation for your website
Collaboration with Google Cloud leaders, engineers and product managers at a variety of roundtables, discussion and events
Google Cloud press release and announcement support to help strengthen your brand as a visionary leader in technology
Promotion through the Google Cloud results blog and social media to share your success story with our extensive customer and partner network
Inclusion in the annual Google Cloud Customer ebook
A place amongst the global leaders who are recognized at Google Cloud events and celebrations
Tell your compelling and unique story about cloud talent transformation! This can include mentorship, skills training, Google Cloud certification preparation support or anything you’ve built to invest in your people’s Google Cloud skills. To help your accomplishments shine, use the distinct voice and personality of your organization. You’ll want to begin by gathering:
Business and deployment metrics
Solution overview diagrams, workflows, architectural diagrams or images
Existing public case studies, webinars or other content
These awards recognize customers who demonstrate unique transformation and innovation, business/operational excellence, industry-wide problem solving, and implementing long-term, lasting benefits.
You can add depth to your submission by asking stakeholders to share their perspectives — for example, your CEO or customer testimonies are great ways to do this.
Metrics and impact are also important. Share how your company is now faster, smarter, more collaborative and flexible due to the Google Cloud skills development opportunities that you provided.
A diverse panel of senior technical judges from around the world carefully assess hundreds of entries that are ranked using a scoring framework. We ensure high quality assessment through a three-round process, using specified benchmarks at least twice per entry.
The Google Cloud Customer Awards team and the judges are the only people who see submissions, and winners are under embargo until official announcements take place. All participants will be notified of results via email at least two months prior to announcements, with results notification scheduled for May 31, 2023.
Results will be formally announced and celebrated at a special event later this year, where winners take their place amongst other outstanding leaders in innovative thinking and business transformation.
For inspiration and to learn more about the transformative organizations that have won Customer Awards with their visionary achievements, take a look at last year’s industry winners. We encourage entry by any customers – new to established, small to large, across all types of products and solutions.
In order to enter, you must be a Google Cloud customer, with success you can demonstrate within the last 12 months. Google Cloud partners and Googlers can also submit on behalf of customers.
Award categories include Industry Customer Awards across a range of verticals, and our Technology for Good Awards, which include the Talent Transformation Award. You may apply for one Industry Customer Award, plus any or all of the Technology for Good Awards.
Start by using this templateto gather all of the relevant information as a team. Designate one person to complete the application and submit via the Customer Awards online questionnaire. The submission window is now open through March 31, 2023.
We are so excited to hear about the wonderful things you are doing to empower your teams to build upon their Google Cloud knowledge and skills — making you a leader in your industry. Happy submitting — get started here!
Read More for the details.
Over the past few years, Google Cloud has become the platform of choice fora growing number of SAP enterprise customers. That’s especially true for companies looking to migrate large and challenging SAP workloads, including those moving to S/4HANA systems as part of a cloud modernization strategy.
Google Cloud’sBare Metal Solution (BMS) for SAP systems plays an important role in our success with these customers. OurBMS offerings are dedicated, single-tenant systems that combine uncompromising performance with the advantages of fully managed cloud infrastructure solutions. WithSAP-certified BMS offerings available in North America and Europe, we’re offering SAP customers a set of high-end infrastructure capabilities.
Cardinal Health, Inc. is a distributor of pharmaceuticals, a global manufacturer and distributor of medical and laboratory products, and a provider of performance and data solutions for healthcare facilities. With operations in more than 30 countries and approximately 46,500 employees, Cardinal Health is a crucial link between the clinical and operational sides of healthcare. The company serves 90% of U.S. hospitals, more than 60,000 U.S. pharmacies and more than 10,000 specialty physician offices and clinics.
Over the past several years, a series of acquisitions drove a major expansion of Cardinal Health’s business. These acquisitions also created an increasingly complex and unwieldy IT environment that included a variety of ERP systems and dozens of other legacy applications, in addition to multiple ERP instances.
Cardinal Health’s technology modernization strategy will migrate its business away from these legacy systems to a single, modern digital platform. This includes leveraging the Google Cloud Large Memory Bare Metal Solution to modernize and consolidate its SAP application architecture within its Pharma segment with a single, massively scalable SAP S/4HANA system and BigQuery to unify SAP data with a fully managed enterprise data warehouse.
Cardinal Health’s SAP modernization effort presented significant challenges. The migration process had to take place within a very narrow window and at 100% accuracy to avoid significant financial and operational impacts. Additionally, Cardinal Health’s strategy of consolidating its pharma business onto a single SAP HANA scale-up instance, with no performance or capacity issues, would require Google Cloud to scale its SAP-certified server systems beyond their previous 12TB upper limit.
Google Cloud raised the bar with its SAP-certified Bare Metal Solution server options—one that supports up to 672 vCPUs and 18TB of memory, and another with up to 896 vCPUs and 24TB of memory. Additionally, customers have multiple storage options, offering up to a maximum of 96TB and 400,000 IOPS per system. Both offerings, along with a high-performance storage SKU, are certified for SAP HANA online transaction processing (OLTP) and meet SAP standard sizing requirements.
Our work with Cardinal Health involved some of the first production deployments of Google Cloud Large Memory Bare Metal Solution 24TB VMs to run the company’s SAP HANA in-memory database.Cardinal Health got what it needed: modern, fully managed, and SAP-certified cloud infrastructure that can support a single, consolidated scale-up SAP HANA instance for the company’s pharma operations. Google Cloud BMS also ensures that Cardinal Health’s SAP environment can scale effortlessly to support its goals, including plans to transform and migrate 200+ million business records onto its HANA system.
In addition, Cardinal Health leveraged Google Cloud’s ability to run SAP application servers on virtualized systems alongside its SAP HANA instance running on a 24TB BMS server. This hybrid approach to SAP cloud infrastructure offeredsignificant advantages in terms of efficiency and cost-effectiveness: Cardinal Health’s use of the BMS server played a big role in achieving significant improvements in reporting and decision-making efficiency, as well as millions in cost savings over the first five years of the effort.
On top of these quantitative improvements, Google Cloud delivered the SAP migration for Cardinal Health in a single weekend—without user impacts or business disruptions.
Based on raw performance, the Google Cloud server offerings define the cutting edge for our SAP customers. In fact, SAP’s certification of our 24TB Bare Metal Solution configurationearned us a world-record SAP HANAbenchmark for Intel-based servers of 892,270 SAPS. And customers that combine our BMS server and high-performance storage offerings can expect to reload even the biggest SAP HANA datasets, following a full system restart, in as little as 30 minutes—a fraction of the time required in the past for an SAP HANA “rehydration” procedure.
More SAP customers are facing the same challenges that drove Cardinal Health to embark upon its modernization efforts: rapid business growth, pressure to consolidate sprawling and often chaotic SAP environments, and SAP HANA systems that now routinely require multi-TB memory capacities to run efficiently. For Google Cloud Bare Metal Solution customers, these industry-leading benchmarks translate directly into success with real-world SAP cloud modernization and growth initiatives.
It probably won’t take long for today’s boundary-pushing 24TB Bare Metal Solution systems to become tomorrow’s mainstream SAP HANA infrastructure. What we know for sure is that Google Cloud will be ready withcutting-edge solutions for our biggest and most demanding SAP customers.
Read More for the details.
While there has been a lot of attention given to wildfires, floods, and hurricanes, heat-related weather events are still understudied and underreported. Every summer, heat waves pose a major threat to the health of people and ecosystems. 83% of the North American population lives in cities, where the urban heat island (UHI) effect leads to higher local temperatures compared to surrounding rural areas.
But not everyone living in US cities experiences summer heat waves and urban heat islands equally. Communities with lower incomes or people of color are more likely to be impacted by extreme heat events, both due to fewer green spaces in urban areas and not having access to air conditioning. While there have been many studies that have shed light on environmental inequities between neighborhoods of varying income levels, there has been little analysis of what it will take to provide all people with protection from severe heat.
In the summer of 2019, TC Chakraborty, then a PhD candidate at the Yale School of the Environment, and Tanushree Biswas, then Spatial Data Scientist at The Nature Conservancy, California, met at one of our Geo for Good Summits. The summits bring together policymakers, scientists, and other change-makers who use Google’s mapping tools. They wanted to share ideas in their areas of expertise (urban heat and tree cover, respectively) and explore a potential path to address urban climate change inequities using open tools and a suite of datasets. Given the ability of tree cover to help mitigate local heat in cities, they wondered how much space is actually available for trees in lower income urban neighborhoods.
If this available space were to be quantified, it would provide estimates of severalco-benefits of tree cover beyond heat mitigation, from carbon sequestration to air pollution reduction, to decreased energy demand for cooling, to possible health benefits. Chakraborty and Biswas believed that increasing the tree canopy in this available space could provide economic opportunities for green jobs as well as more equitable climate solutions. Inspired by this shared vision, they joined forces to explore the feasibility of adding trees to California’s cities.
Three years later, in June 2022, Chakraborty, Biswas, and co-authors L.S. Campbell, B. Franklin, S.S. Parker, and M. Tukman published a paper to address this challenge. The study combines medium-to-high-resolution satellite observations with census data to calculate the feasible area available for urban afforestation — planting new trees — for over 200 urban clusters in California. The paper demonstrates a systematic approach that leverages publicly available data on Google Earth Engine, Google’s planetary-scale platform for Earth science data & analysis, which is free of charge for nonprofits, academics, and research use cases. Results from the study can be explored through a Earth Engine web application: Closing Urban Tree Cover Inequity (CUTI).
California is the most populated state in the United States, the fifth largest economy in the world and frequently impacted by heat waves. This makes California a prime location to demonstrate approaches to strategically reducing surface UHI (SUHI), which has the potential to positively impact millions, especially those vulnerable to heat risk. Chakraborty et al. (2022) found that underprivileged neighborhoods in California have 5.9% less tree cover (see Fig. 1 for an illustrative example for Sacramento) and 1.7 °C higher summer SUHI intensity than more affluent neighborhoods. This disparity in tree cover can be partially closed through targeted urban afforestation.
Leveraging the wealth of data for cities in California, including heat-related mortality and morbidity data, sensitivity of residential energy demand to temperature, and carbon sequestration rates of California’s forests, the researchers calculated co-benefits of several urban afforestation scenarios. For their maximum possible afforestation scenario, they found potential for an additional 36 million (1.28 million acres of) trees, which can provide economic co-benefits, estimated to be worth as much as $1.1 billion annually:
4.5 million metric tons of annual CO2 sequestration
Reduction in heat-related medical visits (~4000 over 10 years)
Energy usage and cost reductions
Stormwater runoff reduction
Property value increase
With a focus on reducing disparities in SUHI and tree cover within these cities, the study provides suitability scores for scaling urban afforestation at the census-block group level across California. By focusing on California neighborhoods with high suitability scores, the authors estimate that an annual investment of $467 million in urban afforestation would both reduce heat disparities and generate $712 million of net annual benefits. Specifically, these benefits would go to 89% of the approximately 9 million residents in the lowest income quartiles of California cities. This annual investment equates to a 20-year commitment of $9.34 billion or 10,000 Electric Vehicles annually.
The adverse effects of climate change disproportionately impact cities, so it’s critical to start thinking about viable solutions to address environmental disparities within urban communities. Providing urban planners with data-driven tools to design climate-resilient cities is a key first step. The Chakraborty et al. study leverages Earth Engine data, tech, and cloud compute resources to provide actionable insights to address environmental disparities in cities. It’s a great example of how Earth Engine can help inform urban policy and provide a bird’s-eye view of logistical support for scalable climate solutions that enable innovation and investment opportunities. In the future, Chakraborty and Biswas hope to further scale this analysis across U.S. cities to provide baseline data that can help move us towards equitable climate adaptation for everyone.
Google wants to support this kind of research! If you are a researcher working on climate impact, apply to the Google Cloud Climate Innovation Challenge in partnership with The National Science Foundation (NSF) and AI Institute for Research on Trustworthy AI in Weather, Climate, and Coastal Oceanography (AI2ES) for free credits to fuel your research.
Thanks to TC Chakraborty and Tanushree Biswas for their help in preparing this blog post.
Read More for the details.
Preprocessing and transforming raw data into features is a critical but time consuming step in the ML process. This is especially true when a data scientist or data engineer has to move data across different platforms to do MLOps. In this blogpost, we describe how we streamline this process by adding two feature engineering capabilities in BigQuery ML
Our previous blog outlines the data to AI journey with BigQuery ML, highlighting two powerful features that simplify MLOps – data preprocessing functions for feature engineering and the ability to export BigQuery ML TRANSFORM statement as part of the model artifact. In this blog post, we share how to use these features for creating a seamless experience from BigQuery ML to Vertex AI.
Preprocessing and transforming raw data into features is a critical but time consuming step when operationalizing ML. We recently announced the public preview of advanced feature engineering functions in BigQuery ML. These functions help you impute, normalize or encode data. When this is done inside the database, BigQuery, the entire process becomes easier, faster, and more secure to preprocess data.
Here is a list of the new functions we are introducing in this release. The full list of preprocessing functions can be found here.
ML.MAX_ABS_SCALER
Scale a numerical column to the range [-1, 1] without centering by dividing by the maximum absolute value.
ML.ROBUST_SCALER
Scale a numerical column by centering with the median (optional) and dividing by the quantile range of choice ([25, 75] by default).
ML.NORMALIZER
Turn an input numerical array into a unit norm array for any p-norm: 0, 1, >1, +inf. The default is 2 resulting in a normalized array where the sum of squares is 1.
ML.IMPUTER
Replace missing values in a numerical or categorical input with the mean, median or mode (most frequent).
ML.ONE_HOT_ENCODER
One-hot encode a categorical input. Also, it optionally does dummy encoding by dropping the most frequent value. It is also possible to limit the size of the encoding by specifying k for the k most frequent categories and/or a lower threshold for the frequency of categories.
ML.LABEL_ENCODER
Encode a categorical input to integer values [0, n categories] where 0 represents NULL and excluded categories. You can exclude categories by specifying k for k most frequent categories and/or a lower threshold for the frequency of categories.
You can now export BigQuery ML models that include a feature TRANSFORM statement. The ability to include TRANSFORM statements makes models more portable when exporting them for online prediction. This capability also works when BigQuery ML models are registered with Vertex AI Model Registry and deployed to Vertex AI Prediction endpoints. More details about exporting models can be found in BigQuery ML Exporting models.
These new features are available through the Google Cloud Console, BigQuery API, and client libraries.
In this tutorial, we will use the bread recipe competition dataset to predict judges rating using linear regression and boosted tree models.
Objective: To demonstrate how to preprocess data using the new functions, register the model with Vertex AI Model Registry, and deploy the model for online prediction with Vertex AI Prediction endpoints.
Dataset: Each row represents a bread recipe with columns for each ingredient (flour, salt, water, yeast) and procedure (mixing time, mixing speed, cooking temperature, resting time). There are also columns that include judges ratings of the final product from each recipe.
Overview of the tutorial: Steps 1 and 2 show how to use the TRANSFORM statement. Steps 3 and 4 demonstrate how to manually export and register the models. Steps 5 through 7 show how to deploy a model to Vertex AI Prediction endpoint.
For the best learning experience, follow this blog post alongside the tutorial notebook.
Before training an ML model, exploring the data within columns is essential to identifying the data type, distribution, scale, missing patterns, and extreme values. BigQuery ML enables this exploratory analysis with SQL. With the new preprocessing functions it is now even easier to transform BigQuery columns into ML features with SQL while iterating to find the optimal transformation. For example, when using the ML.MAX_ABS_SCALER function for an input column, each value is divided by the maximum absolute value (10 in the example):
Once the input columns for an ML model are identified and the feature transformations are chosen, it is enticing to apply the transformation and save the output as a view. But this has an impact on our predictions later on because these same transformations will need to be applied before requesting predictions. Step 2 shows how to prevent this separation of processing and model training.
Building on the preprocessing explorations in Step 1, the chosen transformations are applied inline with model training using the TRANSFORM statement. This interlocks the model iteration with the preprocessing explorations while making any candidate ready for serving with BigQuery or beyond. This means you can immediately try multiple model types without any delayed impact of feature transformations on predictions. In this step, two models, linear regression and boosted tree, are trained side-by-side with identical TRANSFORM statements:
Training with linear regression – Model a
Training with boosted tree – Model b
Identical input columns that have the same preprocessing means you can easily compare the accuracy of the models. Using the BigQuery ML function ML.EVALUATE makes this comparison as simple as a single SQL query that stacks these outcomes with the UNION ALL set operator:
The results of the evaluation comparison show that using the boosted tree model results in a much better model than linear regression with drastically lower mean squared error and higher r2.
Both models are ready to serve predictions, but the clear choice is the boosted tree regressor. Once we decide which model to use, you can predict directly within BigQuery ML using the ML.PREDICT function. In the rest of the tutorial, we show how to export the model outside of BigQuery ML and predict using Google Cloud Vertex AI.
Once your model is trained, if you want to do online inference for low latency responses in your application for online prediction, you have to deploy the model outside of BigQuery. The following steps demonstrate how to deploy the models to Vertex AI Prediction endpoints.
This can be accomplished in one of two ways:
Manually export the model from BigQuery ML and set up a Vertex AI Prediction Endpoint. To do this, you need to do steps 3 and 4 first.
Register the model and deploy from Vertex AI Model Registry automatically. The capability is not available yet but will be available in a forthcoming release. Once it’s available steps 3 and 4 can be skipped.
BigQuery ML supports an EXPORT MODEL statement to deploy models outside of BigQuery. A manual export includes two models – a preprocessing model that reflects the TRANSFORM statement and a prediction model. Both models are exported with a single export statement in BigQuery ML.
The preprocessing model that captures the TRANSFORM statement is exported as a TensorFlow SavedModel file. In this example it is exported to a GCS bucket located at ‘gs://statmike-mlops-349915-us-central1-bqml-exports/03/2b/model/transform’.
The prediction models are saved in portable formats that match the frameworks in which they were trained by BigQuery ML. The linear regression model is exported as a TensorFlow SavedModel and the boosted tree regressor is exported as Booster file (XGBoost). In this example, the boost tree model is exported to a GCS bucket located at ‘gs://statmike-mlops-349915-us-central1-bqml-exports/03/2b/model’
These export files are in a standard open format of the native model types making them completely portable to be deployed anywhere – they can be deployed to Vertex AI (Steps 4-7 below), on your own infrastructure, or even in edge applications.
Steps 4 through 7 show how to register and deploy a model to Vertex AI Prediction endpoint. These steps need to be repeated separately for the preprocessing models and the prediction models.
To deploy the models in Vertex AI Prediction, they first need to be registered with the Vertex AI Model Registry To do this two inputs are needed – the links to the model files and a URI to a pre-built container. Go to Step 4 in the tutorial to see how exactly it’s done.
The registration can be done with the Vertex AI console or programmatically with one of the clients. In the example below, the Python client for Vertex AI is used to register the models like this:
Vertex AI includes a service forhosting models for online predictions. To host a model on a Vertex AI Prediction endpoint you first create an endpoint. This can also be done directly from the Vertex AI Model Registry console or programmatically with one of the clients. In the example below, the Python client for Vertex AI is used to create the endpoint like this:
Deploying a model from the Vertex AI Model Registry (Step 4) to a Vertex AI Prediction endpoint (Step 5) is done in a single deployment action where the model definition is supplied to the endpoint along with the type of machine to utilize. Vertex AI Prediction endpoints can automatically scale up or down to handle prediction traffic needs by providing the number of replicas to utilize (default is 1 for min and max). In the example below, the Python client for Vertex AI is being used with the deploy method for the endpoint (Step 5) using the models (Step 4):
Once the model is deployed to a Vertex AI Prediction endpoint (Step 6) it can serve predictions. Rows of data, called instances, are passed to the endpoint and results are returned that include the processed information: preprocessing result or prediction. Getting prediction results from Vertex AI Prediction endpoints can be done with any of the Vertex AI API interfaces (REST, gRPC, gcloud, Python, Java, Node.js). Here, the request is demonstrated directly with the predict method of the endpoint (Step 6) using the Python client for Vertex AI as follows:
The result of an endpoint with a preprocessing model will be identical to applying the TRANSFORM statement from BigQuery ML. The results can then be pipelined to an endpoint with the prediction model to serve predictions that match the results of the ML.PREDICT function in BigQuery ML. The results of both methods, Vertex AI Prediction endpoints and BigQuery ML with ML.PREDICT are shown side-by-side in the tutorial to show that the results of the model are replicated. Now the model can be used for online serving with extremely low latency. This even includes using private endpoints for even lower latency and secure connections with VPC Network Peering.
With the new preprocessing functions, you can simplify data exploration and feature preprocessing. Further, by embedding preprocessing within model training using the TRANSFORM statement, the serving process is simplified by using prepped models without needing additional steps. In other words, predictions are done right inside BigQuery or alternatively the models can be exported to any location outside of BigQuery such as Vertex AI Prediction for online serving. The tutorial demonstrated how BigQuery ML works with Vertex AI Model Registry and Prediction to create a seamless end-to-end ML experience. In the future you can expect to see more capabilities that bring BigQuery, BigQuery ML and Vertex AI together.
Click here to access the tutorial or check out the documentation to learn more about BigQuery ML
Thanks to Ian Zhao, Abhinav Khushraj, Yan Sun, Amir Hormati, Mingge Deng and Firat Tekiner from the BigQuery ML team
Read More for the details.
AWS Elemental MediaLive now supports decoding audio from sources with Dolby E compressed tracks. Dolby E supports delivery of eight discrete audio source channels in a PCM (pulse code modulated) stereo pair. With this feature you can deliver content with different language tracks and/or high channel count spatial audio from a single high-quality source. This is useful for international syndication of sports and events where commentary and immersive audio are present.
Read More for the details.
Starting today, customers can use Amazon Kinesis Data Firehose in the Europe (Zurich), Europe (Spain), Asia Pacific (Hyderabad) AWS Regions.
Read More for the details.
Today, AWS AppConfig announces integrations with AWS Secrets Manager and AWS Key Management Service (AWS KMS), providing customers with additional configuration sources and encryption capabilities. In addition to its own AWS AppConfig Hosted Configuration store, AWS AppConfig already integrates with Amazon Simple Storage Service (Amazon S3), AWS CodePipeline, AWS Systems Manager Parameter Store, and AWS Systems Manager Documents as data sources. Now customers can use Secrets Manager as a single source to safely and securely deploy sensitive data. All sensitive data retrieved from Secrets Manager via AWS AppConfig can be encrypted at deployment time using an AWS KMS Customer Managed Key (CMK). In addition, AWS AppConfig now offers support for CMK encryption for other configuration data. The integration with AWS KMS enables support for Amazon S3 objects encrypted with a customer managed key or secure strings from AWS Systems Manager Parameter Store encrypted with a customer managed key.
Read More for the details.
Enable full stack cloud observability and app performance monitoring with Azure Native New Relic Service in public preview
Read More for the details.
AWS Directory Service for Microsoft Active Directory, also known as AWS Managed Microsoft AD, and AD Connector are now available in the AWS Europe (Spain), Europe (Zurich) and Asia Pacific (Hyderabad) Regions.
Read More for the details.
Medical imaging offers remarkable opportunities in research for advancing our understanding of cancer, discovering new non-invasive methods for its detection, and improving overall patient care. Advancements in artificial intelligence (AI), in particular, have been key in unlocking our ability to use this imaging data as part of cancer research. Development of AI-powered research approaches, however, requires access to large quantities of high quality imaging data.
The US National Cancer Institute (NCI) has long prioritized collection, curation, and dissemination of comprehensive, publicly available cancer imaging datasets. Initiatives like The Cancer Genome Atlas (TCGA) and Human Tumor Atlas Network (HTAN) (to name a few) work to make robust, standardized datasets easily accessible to anyone interested in contributing their expertise: students learning the basics of AI, engineers developing commercial AI products, researchers developing innovative proposals for image analysis, and of course the funders evaluating those proposals.
Even so, there continue to be challenges that complicate sharing and analysis of imaging data:
Data is spread across a variety of repositories, which means replicating data to bring it together or within reach of tooling (such as cloud-based resources).
Images are often stored in vendor-specific or specialized research formats which complicates analysis workflows and increases maintenance costs.
Lack of a common data model or tooling make capabilities such as search, visualization, and analysis of data difficult and repository- or dataset-specific.
Achieving reproducibility of the analysis workflows, a critical function in research, is challenging and often lacking in practice.
To address these issues, as part of the Cancer Research Data Commons (CRDC) initiative that establishes the national cancer research ecosystem, NCI launched the Imaging Data Commons (IDC), a cloud-based repository of publicly available cancer imaging data with several key advantages:
Colocation: Image files are curated into Google Cloud Storage buckets, side-by-side with on-demand computational resources and cloud-based tools, making it easier and faster for you to access and analyze.
Format: Images, annotations and analysis results are harmonized into the standard DICOM (Data Imaging and Communications and Medicine) format to improve interoperability with tools and support uniform processing pipelines.
Tooling: IDC maintains tools that – without having to download anything – allow you to explore and search the data, and visualize images and annotations. You can easily access IDC data from the cloud-based tools available in Google Cloud, such as Vertex AI, Colab, or deploy your own tools in highly configurable virtual environments.
Reproducibility: Sharing reproducible analysis workflows is streamlined through maintaining persistent versioned data that you can use to precisely define cohorts used to train or validate algorithms, which in turn can be deployed in virtual environments that can provide consistent software and hardware configuration.
IDC ingests and harmonizes de-identified data from a growing list of repositories and initiatives, spanning a broad range of image types and scales, cancer types, and manufacturers. A significant portion of these images are accompanied by annotations and clinical data.
For a quick summary of what is available in IDC, check the IDC Portal or this Looker Studio dashboard!
IDC Portal
A great place to start exploring the data is the IDC Portal. From this in-browser portal, you can use some of the key metadata attributes to navigate the images and visualize them.
As an example, here are the steps you can follow to find slide microscopy images for patients with lung cancer:
From the IDC Portal, proceed to “Explore images”.
In the top right portion of the exploration screen, use the summary pie chart to select Chest primary site (you could alternatively select Lung, noting that annotation of cancer location can use different terms).
In the same pie chart summary section, navigate to Modality and select Slide Microscopy.
In the right-hand panel, scroll to the Collections section, which will now list all collections containing relevant images. Select one or more collections using the checkboxes.
Navigate to the Selected Cases section just below, where you will find a list of patients within the selected collections that meet the search criteria.
Next, select a given patient using the checkbox. Navigating to the Selected Studies section just below will now show the list of studies – think of these as specific imaging exams available for this patient. Click the “eye” icon on the far right which will open the viewer allowing you to see the images themselves.
BigQuery Public Dataset
When it’s time to search and select the subsets (or cohorts) of the data that you need to support your analysis more precisely, you’ll head to the public dataset in BigQuery. This dataset contains the comprehensive set of metadata available for the IDC images (beyond the subset contained in the IDC portal), which you can use to precisely define your target data subset with a custom, standard SQL query.
You can run these queries from the in-browser BigQuery Console by creating a BigQuery sandbox. The BigQuery sandbox enables you to query data within the limits of the Google Cloud free tier without needing a credit card. If you decide to enable billing and go above the free tier threshold, you are subject to regular BigQuery pricing. However, we expect most researchers’ needs will fit within this tier.
To get started with an exploratory query, you can select studies corresponding to the same criteria you just used in your exploration of the IDC Portal:
Alright now you’re ready to write a query that creates precisely defined cohorts. This time we’ll shift from exploring digital pathology images to subsetting Computed Tomography (CT) scans that meet certain criteria.
The following query selects all files, identified by their unique storage path in the gcs_url column, and corresponding to CT series that have SliceThickness between 0 and 1 mm. It also builds a URL in series_viewer_url that you can follow to visualize the series in the IDC Portal viewer. For the sake of this example, the results are limited to only one series.
As you start to write more complex queries, it will be important to familiarize yourself with the DICOM format, and how it is connected with the IDC dataset. This getting started tutorial is a great place to start learning more.
What can you do with the results of these queries? For example:
You can build the URL to open the IDC Portal viewer and examine individual studies, as we demonstrated in the second query above.
You can learn more about the patients and studies that meet this search criteria by exploring what annotations or clinical data available accompanying these images. The getting started tutorial provides several example queries along these lines.
You can link DICOM metadata describing imaging collections with related clinical information, which is linked when available. This notebook can help in navigating clinical data available for IDC collections.
Finally, you can download all images contained in the resulting studies. Thanks to the support of Google Cloud Public Dataset Program, you are able to download IDC image files from Cloud Storage without cost.
There are several Cloud tools we want to mention that can help in your explorations of the IDC data:
Colab: Colab is a hosted Jupyter notebook solution that allows you to write and share notebooks that combine text and code, download images from IDC, and execute the code in the cloud with a free virtual machine. You can expand beyond the free tier to use custom VMs or GPUs, while still controlling costs with fixed monthly pricing plans. Notebooks can easily be shared with colleagues (such as readers of your academic manuscript). Check out these example Colab notebooks to help you get started.
Vertex AI: Vertex AI is a platform to handle all the steps of the ML workflow. Again, it includes managed Jupyter notebooks, but with more control over the environment and hardware you use. As part of Google Cloud, it also comes with enterprise-grade security, which may be important to your use case, especially if you are joining in your own proprietary data. Its Experiments functionality allows you to automatically track architectures, hyperparameters, and training environments, to help you discover the optimal ML model faster.
Looker Studio: Looker Studio is a platform for developing and sharing custom interactive dashboards. You can create dashboards that are focused on a specific subset of metadata accompanying the images and cater to the users that prefer interactive interface over the SQL queries. As an example, this dashboard provides a summary of IDC data, and this dashboard focuses on the preclinical datasets within the IDC.
Cloud Healthcare API: IDC relies on Cloud Healthcare API to extract and manage DICOM metadata with BigQuery, and to maintain DICOM stores that make IDC data available via the standard DICOMweb interface. IDC users can utilize these tools to store and provide access to the artifacts resulting from the analysis of IDC images. As an example, DICOM store can be populated with the results of image segmentation, which could be visualized using a user-deployed Firebase-hosted instance of OHIF Viewer (deployment instructions are available here).
The IDC dataset is a powerful tool for accelerating data-driven research and scientific discovery in cancer prevention, treatment, and diagnosis. We encourage researchers, engineers, and students alike to get started by following the onboarding steps we laid out in this post: familiar yourselves with the data by heading to the IDC portal, tailor your cohorts using the BigQuery public dataset, and then download the images to analyze with your on-prem tools, or with Google Cloud services or Colab. Getting started with the IDC notebook series should help you get familiar with the resource.
For questions, you can reach the IDC team at support@canceridc.dev, or join the IDC community and post your questions. Also, see the IDC user guide for more details, including official documentation.
Read More for the details.
BigQuery BI Engine is a fast, in-memory analysis system for BigQuery currently processing over 2 billion queries per month and growing. BigQuery has its roots in Google’s Dremel system and is a data warehouse built with scalability as a goal. On the other hand BI Engine was envisioned with data analysts in mind and focuses on providing value on Gigabyte to sub-Terabyte datasets, with minimal tuning, for real time analytics and BI purposes.
Using BI Engine is simple – create a memory reservation on the project that runs BigQuery queries, and it will cache data and use the optimizations. This post is a deep dive into how BI Engine helps deliver blazing fast performance for your BigQuery queries and what users can do to leverage its full potential.
BI Engine optimizations
The two main pillars of BI Engine are in-memory caching of data and vectorized processing. Other optimizations include CMETA metadata pruning, single-node processing, and join optimizations for smaller tables.
BI Engine utilizes the “Superluminal” vectorized evaluation engine which is also used for YouTube’s analytic data platform query engine – Procella. In BigQuery’s row-based evaluation, the engine will process all columns within a row for every row. The engine is potentially alternating between column types and memory locations before going to the next row. In contrast, a vectorized engine like Superluminal will process a block of values of the same type from a single column for as long as possible and only switch to the next column when necessary. This way, hardware can run multiple operations at once using SIMD, reducing both latency and infrastructure costs. BI Engine dynamically chooses block size to fit into caches and available memory.
For the example query, “SELECT AVG(word_count), MAX(word_count), MAX(corpus_date) FROM samples.shakespeare”, will have the following vectorized plan. Note how the evaluation processes “word_count” separately from “corpus_date”.
BigQuery is a disaggregated storage and compute engine. Usually the data in BigQuery is stored on Google’s distributed file system – Colossus, most often in blocks in Capacitor format and the compute is represented by Borg tasks. This enables BigQuery’s scaling properties. To get the most out of vectorized processing, BI Engine needs to feed the raw data at CPU speeds, which is achievable only if the data is already in memory. BI Engine runs Borg tasks as well, but workers are more memory-heavy to be able to cache the data as it is being read from Colossus.
A single BigQuery query can be either sent to a single BI Engine worker, or sharded and sent to multiple BI Engine workers. Each worker receives a piece of a query to execute with a set of columns and rows necessary to answer it. If the data is not cached in the workers memory from the previous query, the worker loads the data from Colossus into local RAM. Subsequent requests for the same or subset of columns and rows are served from memory only. Note that workers will unload the contents if data hasn’t been used for over 24 hours. As multiple queries arrive, sometimes they might require more CPU time than available on a worker, if there is still reservation available, a new worker will be assigned to same blocks and subsequent requests for the same blocks will be load-balanced between the workers.
BI Engine can also process super-fresh data that was streamed to the BigQuery table. Therefore, there are two formats supported by BI Engine workers currently – Capacitor and streaming.
Generally, data in a capacitor block is heavily pre-processed and compressed during generation. There are a number of different ways the data from the capacitor block can be cached, some are more memory efficient, while others are more CPU efficient. BI Engine worker intelligently chooses between those preferring latency and CPU-efficient formats where possible. Thus actual reservation memory usage might not be the same as logical or physical storage usage due to the different caching formats.
Streaming data is stored in memory as blocks of native array-columns and is lazily unloaded when blocks get extracted into Capacitor by underlying storage processes. Note that for streaming, BI workers need to either go to streaming storage every time to potentially obtain new blocks or serve slightly stale data. BI Engine prefers serving slightly stale data and loading the new streaming blocks in the background instead.
BI Engine worker does this opportunistically during the queries, if the worker detects streaming data and the cache is newer than 1 minute, a background refresh is launched in parallel with the query. In practice, this means that with enough requests the data is no more stale than the previous request time. For example if a request arrives every second, then the streaming data will be around a second stale.
Due to the read time optimizations, loading data from previously unseen columns can take longer than BigQuery does. Subsequent reads will benefit from these optimizations.
For example, the query above here is backend time for a sample run of the same query with BI Engine off, first run and subsequent run.
BI Engine workers are optimized for BI workloads where the output size will be small compared to the input size and the output will be mostly aggregated. In regular BigQuery execution, a single worker tries to minimize data loading due to network bandwidth limitations. Instead, BigQuery relies on massive parallelism to complete queries quickly. On the other hand, BI Engine prefers to process more data in parallel on a single machine. If the data has been cached, there is no network bandwidth limitation and BI Engine further reduces network utilization by reducing the number of intermediate “shuffle” layers between query stages.
With small enough inputs and a simple query, the entire query will be executed on a single worker and the query plan will have a single stage for the whole processing. We constantly work on making more tables and query shapes eligible for a single stage processing, as this is a very promising way to improve the latency of typical BI queries.
For the example query, which is very simple and the table is very small, here is a sample run of the same query with BI Engine distributed execution vs single node (default).
How to get most out of BI Engine
While we all want a switch that we can toggle and everything becomes fast, there are still some best practices to think about when using BI Engine.
BI optimizations assume human eyes on the other side and that the size of output data is small enough to be comprehensible by a human. This limited output size is achieved by selective filters and aggregations. As a corollary, instead of SELECT * (even with a LIMIT), a better approach will be to provide the fields one is interested in with an appropriate filter and aggregation.
To show this on an example – query “SELECT * FROM samples.shakespeare” processes about 6MB and takes over a second with both BigQuery and BI Engine. If we add MAX to every field – “SELECT MAX(word), MAX(word_count), MAX(corpus), MAX(corpus_date) FROM samples.shakespeare”, both engines will read all of the data, perform some simple comparisons and finish 5 times faster on BigQuery and 50 times faster on BI Engine.
BI Engine uses query filters to narrow down the set of blocks to read. Therefore, partitioning and clustering your data will reduce the amount of data to read, latency and slot usage. With a caveat, that “over partitioning” or having too many partitions might interfere with BI Engine multi-block processing. For optimal BigQuery and BI Engine performance, partitions larger than one gigabyte are preferred.
BI Engine currently accelerates stages of the query that read data from the table, which are typically the leaves of the query execution tree. What this means in practice is that almost every query will use some BigQuery slots.That’s why one gets the most speedup from BI Engine when a lot of time is spent on leaf stages. To mitigate this, BI Engine tries to push as many computations as possible to the first stage. Ideally, execute them on a single worker, where the tree is just one node.
For example Query1 of TPCH 10G benchmark, is relatively simple. It is 3 stages deep with efficient filters and aggregations that processes 30 million rows, but outputs just 1.
Running this query in BI Engine we see that the full query took 215 ms with “S00: Input” stage being the one accelerated by BI Engine taking 26 ms.
Running the same query in BigQuery gets us 583ms, with “S00: Input” taking 229 ms.
What we see here is that the “S00: Input” stage run time went down 8x, but the overall query did not get 8x faster, as the other two stages were not accelerated and their run time remained roughly the same. With breakdown between stages illustrated by the following figure.
In a perfect world, where BI Engine processes its part in 0 milliseconds, the query will still take 189ms to complete. So the maximum speed gain for this query is about 2-3x.
If we, for example, make this query heavier on the first stage, by running TPCH 100G instead, we see that BI Engine finishes the query 6x faster than BigQuery, while the first stage is 30 times faster!
vs 1 second on BigQuery
Over time, our goal is to expand the eligible query and data shapes and collapse as many operations as feasible into a single BI Engine stage to realize maximum gains.
As previously noted, BI Engine accelerates “leaf” stages of the query. However, there is one very common pattern used in BI tools that BI Engine optimizes. It’s when one large “fact” table is joined with one or more smaller “dimension” tables. Then BI Engine can perform multiple joins, all in one leaf stage, using so-called “broadcast” join execution strategy.
During the broadcast join, the fact table is sharded to be executed in parallel on multiple nodes, while the dimension tables are read on each node in their entirety.
For example, let’s run Query 3 from the TPC-DS 1G benchmark. The fact table is store_sales and the dimension tables are date_dim and item. In BigQuery the dimension tables will be loaded into shuffle first, then the “S03: Join+” stage will, for every parallel part of store_sales, read all necessary columns of two dimension tables, in their entirety, to join.
Note that filters on date_dim and item are very efficient, and the 2.9M row fact table is joined only with about 6000 rows. BI Engine plan will look a bit different, as BI Engine will cache the dimension tables directly, but the same principle applies.
For BI Engine, let’s assume that two nodes will process the query due to the store_sales table being too big for a single node processing. We can see on the image below that both nodes will have similar operations – reading the data, filtering, building the lookup table and then performing the join. While only a subset of data for the store_sales table is being processed on each, all operations on dimension tables are repeated.
Note that
“build lookup table” operation is very CPU intensive compared to filtering
“join” operation performance also suffers if the lookup tables are large, as it interferes with CPU cache locality
dimension tables need to be replicated to each “block” of fact table
The takeaway is when join is performed by BI Engine, the fact table is sometimes split into different nodes. All other tables will be copied multiple times on every node to perform the join. Keeping dimension tables small or selective filters will help to make sure join performance is optimal.
Summarizing everything above, there are some things one can do to make full use of BI Engine and make their queries faster
Less is more when it comes to data returned – make sure to filter and aggregate as much data as possible early in the query. Push down filters and computations into BI Engine.
Queries with a small number of stages get the best acceleration. Preprocessing the data to minimize query complexity will help with optimal performance. For example, using materialized views can be a good option.
Joins are sometimes expensive, but BI Engine may be very efficient in optimizing typical star schema queries.
It’s beneficial to partition and/or cluster the tables to limit the amount of data to be read.
Special thanks to Benjamin Liles, Software Engineer for BI Engine, Deepak Dayama, Product Manager for BI Engine, for contributing to this post.
Read More for the details.
Editor’s note: February is Black History Month—a time for us to come together to celebrate the diverse identities and perspectives that make up the Black experience. Over the next few weeks, we will highlight Black-led startups and how they use Google Cloud to grow their businesses. This feature highlights Valence Discovery co-founder Therence Bois who’s transforming drug discovery with artificial intelligence and advanced deep-learning technologies.
Despite significant technological advances in the last few decades, many diseases are still untreated, leaving patients with inadequate treatment options. Although innovative drug therapies may improve quality of life, developing new medications is expensive and time-consuming, and fraught with failure. To help people get the medicines they need faster, pharmaceutical companies and innovative biotechs are turning to advanced artificial intelligence (AI) and machine learning (ML) technologies to significantly accelerate R&D and improve success rates in the path to developing new treatments.
AI and ML algorithms are typically most powerful when there is a lot of training data. In biomedical discovery, this corresponds to well-studied and well-understood diseases when patients’ needs may be less than in more novel and underserved areas of biology. Crucially, however, the large and robust datasets needed to design and discover novel treatments are often limited in these novel disease areas, rendering them unfit for disruption with advanced computational tools.
We founded Valence Discovery (Valence) to solve these challenges. Our ML platform for molecular design and optimization uses deep learning techniques that are specifically adapted for sparse and incomplete biomedical datasets like those we find in these high-value and intractable areas of biology—allowing scientists to effectively design effective therapeutic candidates against their disease of interest.
Valence partners with drug discovery groups of all shapes and sizes, from academic labs to leading contract research organizations like Charles River Labs, to support them with AI-enabled drug discovery capabilities. Valence is committed to pushing the field of ML research in drug discovery forward, and democratizing access to advanced computational methods for drug design with the aim of improving human health.
Valence began as a PhD biotech project at the Canadian AI research institute Mila where our founding team built deep learning tools for drug discovery and design. While launching our startup, we knew we couldn’t accelerate time to market without a scalable cloud-based platform that included built-in AI and ML capabilities. We looked at the options and identified Google Cloud, including the Google for Startups Cloud Program, as the best choice for an AI startup such as ourselves and as the right partner to help us cost-effectively scale Valence.
We also needed a reliable technology partner to help manage and optimize applications and infrastructure so our small, dedicated team could focus on creating new AI tools for developing innovative drug therapies. Since migrating to the secure-by-design infrastructure of Google Cloud, we’ve significantly reduced expenses, shifted more resources to R&D, and accelerated the launch of new AI-enabled drug design tools.
Specifically, we use Google Kubernetes Engine (GKE) to quickly train and deploy large scale ML models and Molecular Dynamics simulations while leveraging Cloud Storage to efficiently and quickly store important datasets and machine learning artifacts. We also leverage the Google Cloud serverless product called Cloud Run to run at scale our various drug discovery oriented backend servers.
TheGoogle for Startups Cloud Program supported the launch of Valence by giving us immediate access to Google Cloud credits which we used to cost-effectively trial and support our research activities.
Our Valence success story highlights the importance of asking others for help and guidance early on. By working closely with Google for Startup experts and our partners in the pharmaceutical industry, we’ve avoided technical and business missteps that allowed us to bring our ideas to market faster.
I encourage other Black founders to learn more about the Google for Startups Accelerator: Black Founders, along with other organizations such asBlack Founders,Black Girl Ventures,Act House, andLightship Bootcamp.
The Google for Startups Cloud Program, as well as the holistic support Google has given us, has been invaluable to our success. Since joining, we’ve been featured in industry publications such asPharmaceutical Technology, expanded our partnerships, and received additional rounds of funding from investors. We can’t wait to see what we accomplish next as we empower pharmaceutical companies to develop innovative drug therapies that dramatically improve quality of life for people worldwide.
If you want to learn more about how Google Cloud can help your startup, visit our pagehere to get more information about our program, andsign up for our communications to get a look at our community activities, digital events, special offers, and more.
Read More for the details.
Editor’s note: Today’s post is from Neil Craig at the British Broadcasting Corporation (BBC), the national broadcaster of the United Kingdom. Neil is part of the BBC’s Digital Distribution team, which is responsible for building the services such as the public-facing www bbc.co.uk and .com websites and ensuring they are able to scale and operate reliably.
The BBC’s public-facing websites inform, educate, and entertain over 498 million adults per week across the world. Because breaking news is so unpredictable, we need a core content delivery platform that can easily scale in response to surges in traffic, which can be quite unpredictable.
To this end, we recently began running our log-processing infrastructure on a Google Cloud serverless platform. We’ve found the new system, based on Cloud Run and BigQuery, to be more reliable, scalable, and cost-effective than our previous infrastructure. And — news flash! — it also freed our team from having to do a lot of manual labor, and opened the door to being more collaborative and data-driven.
To operate the site and ensure our services run smoothly we continually monitor Traffic Manager and CDN access logs. Our websites generate more than 3B log lines per day, and handle large data bursts during major news events; on a busy day our system supports over 26B log lines in a single day.
As initially designed, we stored log data in a Cloud Storage bucket. But every time we needed to access that data, we had to download terabytes of logs down to a virtual machine (VM) with a large amount of attached storage, and use the ‘grep’ tool to search and analyze them. From beginning to end, this took us several hours. On heavy news days, the time lag made it difficult for the engineering team to do their jobs.
We needed a more efficient way to make this log data available, so we designed and deployed a new system that deals with logs and reacts to spikes more efficiently as they arrive, improving the timeliness of critical information significantly.
In this new system, we still leverage Cloud Storage buckets, but on arrival, each log generates an event using EventArc. That event triggers Cloud Run to validate, transform and enrich various pieces of information about the log file such as filename, prefix, and type, then processes it and outputs the processed data as a stream into BigQuery. This event-driven design allows us to process files quickly and frequently — processing a single log file typically takes less than a second. Most of the files that we feed into the system are small, fewer than 100 Megabytes, but for larger files, Cloud Run automatically creates additional parallel instances very quickly, helping the system scale almost magically.
And because we’re, erm, lucky, and get frequent distributed denial-of-service attacks (free load tests!), we’re confident in the system’s ability to handle significant traffic. For example, not long before the announcement of the Queen’s passing in September, we had an attack that generated a colossal traffic spike. Within one minute, we went from running 150 to 200 container instances to over 1000…. and the infrastructure just worked. Because we engineered the log processing system to rely on the elasticity of a serverless architecture, we knew from the get-go that it would be able to handle this type of scaling.
Our initial concern about choosing serverless was cost. It turns out that using Cloud Run is significantly more cost-effective than running the number of VMs we would need for a system that could survive reasonable traffic spikes with a similar level of confidence.
It’s also saved us a lot of time. We picked Cloud Run intentionally because we wanted a system that could scale well without manual intervention. As the digital distribution team, our job is not to do ops. We leave that to Google, the experts. The new system is massively more reliable and cost-effective, but it’s also easier for us to build and maintain.
What surprises new team members who aren’t familiar with Google Cloud is how easy it is to fit all the pieces together. Google’s inter-service auth is automatically managed and really simple to configure. When the Cloud Run service writes to BigQuery or reads from Cloud Storage, I tell it to use OIDC auth (which it manages automatically via Service Account permissions), import the client library — and it just works. Another example is pushing events into Cloud Run, where we can configure Cloud Run authorization to only accept events from specific EventArc triggers, so it is fully private.
Going forward, the new system has also opened up many opportunities for the BBC organization to make better use of its data. For example, thanks to BigQuery’s per-column permissions, we can easily open up access to our logs to other engineering teams, without having to worry about sharing PII that’s restricted to approved users.
The goal of our team is to empower all teams within the BBC to get the content they want on the web when they want it, make it reliable, secure, and make sure it can scale. Google Cloud serverless products helped us to achieve these goals with relatively little effort and require significantly less management than previous generations of technology.
Read More for the details.