In Deloitte’s annual “State of AI in the Enterprise” survey, 94% of business leaders identified AI as critical to their organizations’ success over the next five years. That survey also uncovered a 29% increase in the number of organizations struggling to achieve meaningful AI-driven business outcomes. Part of this challenge lies in the ability to capitalize on existing data, in its various formats spread throughout the organization. For example, up to 80% of enterprise information assets are scattered across the organization in text, PDFs, emails, web pages, and other unstructured formats. This includes a wealth of valuable insights embedded within contracts, buried within patient files, recorded in chat transcripts, noted in EHR/CRM text fields, and present in other formats. This wealth of unstructured data is often untapped, as some business leaders may be unaware of the value or unsure how to leverage it.
Challenges: The need to put unstructured data to use more rapidly
Accessing data across various locations and file types and then operationalizing that data for AI usage is usually a cumbersome, manual, time-consuming, and costly process. Individually labeling files to build an adequate dataset to train a machine learning (ML) model is notoriously slow, while human errors and inconsistencies also tend to degrade data quality and negatively impact ML model performance.
Often, analyzing enterprise data requires the expertise of analysts, clinicians, lawyers or other domain specific experts. In highly-regulated industries such as financial services and healthcare, privacy regulations, standards, and other access restrictions make the challenges posed in using unstructured data proportionally higher.
Solution approach
Snorkel AI has teamed with Google Cloud to help organizations transform raw, unstructured data into a format that can be used to train actionable AI-powered models for insights and decision making. By combining Google Cloud services such as BigQuery and Vertex AI with Snorkel AI’s data-centric AI platform for programmatic data curation and preparation, organizations can accelerate AI development 10-100x [1]. Tapping into the value of unstructured data stored in BigQuery and making that data ready for ML training empowers enterprises to incorporate all types of data for training AI models.
Snorkel AI’s data-centric approach unlocks new ways of preparing ML training workloads
Snorkel AI addresses one of the biggest blockers to preparing data for AI development: the massive hand-labeled training datasets needed to prepare data for supervised training of ML models. Snorkel AI overcomes this bottleneck through using a programmatic labeling approach implemented in Snorkel Flow, a novel data-centric AI platform.
Leveraging business logic, and using foundation models as a means of generating labels, data science and ML teams can use Snorkel Flow’s labeling functions to programmatically label data using various sources, including previously-labeled datasets that may have been poorly labeled while encoding knowledge or heuristics from subject matter experts. Snorkel Flow can leverage these multiple data and knowledge sources to label large quantities of unstructured data at scale.
In addition to data scientists, other users in the ML lifecycle, such as ML engineers, can leverage Snorkel Flow to rapidly improve training data quality and model performance using integrated error analysis and model-guided feedback mechanisms to develop more accurate AI applications.
The data-centric AI workflow within Snorkel Flow operates as follows:
Data scientists, ML engineers, and subject matter experts programmatically label large amounts of data in minutes to hours by creating labeling functions.
Upon creating labeling functions, Snorkel Flow generates a probabilistic labeled dataset that is used to train a model within the platform.
Next, data scientists use guided error analysis to analyze the model’s performance deficits. They look for the gaps that facilitate creation of more targeted and relevant labeling assignments. In other words, data scientists and other users specifically work on places where the model is most wrong, or on particular high-value examples, or on commonly confused classes of data.
Next, users collaboratively iterate on these gaps with internal experts, refining or adding labeling functions as needed to label even more data with which they can again feed into the model for analysis.
Users repeat this iteration even after deploying a model and monitoring a slice of production data.
As a result of this loop, the metrics improvements in an AI application are often orders-of-magnitude greater than what can be achieved with model-centric AI and hand-labeled data.
Solution details
Unified access to data stored on Google Cloud
With training data curation and preparation unblocked via programmatic labeling of unstructured data, data scientists can harness the full power of Google’s end-to-end BigQuery ML and/or Vertex AI platforms to fast-track the development of analytics and AI applications. Google Cloud customers can easily deploy Snorkel Flow on their Google Cloud infrastructure using Google Kubernetes Engine (GKE), then consume unstructured, semi-structured or structured data from Google Cloud data services such as BigQuery and Google Cloud Storage (GCS). See the below figure for data sources and integrations.
BigQuery is a serverless, cost-effective, and cross-cloud analytics data warehouse built to address the needs of data-driven organizations. BigQuery breaks down silos across clouds, allowing enterprises to centralize all of their data – structured, semi-structured, and unstructured – in a single secure repository. BigQuery support for unstructured data management includes built-in capabilities to secure, govern, and share unstructured data.