# Extract data
Data extraction is the process of pulling data from different applications. Workato supports bulk and batch data extraction. Data extraction is the first step you perform in ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) processes.
# Event or trigger-based extraction
Event or trigger-based extraction in Workato enables you to initiate data extraction processes in response to specific events within source systems. This method is efficient for scenarios where data must be processed immediately after an event occurs, such as the creation of a new record or the update of an existing record.
You can set up event or trigger-based data extraction to monitor events in real-time or at specified intervals to ensure data is captured and processed with minimal latency. This is crucial for maintaining up-to-date data across systems, enabling real-time analytics and triggering automated workflows.
The following examples demonstrate how to use event or trigger-based extraction to extract data in bulk and in batches from different sources, including:
# SaaS
# Sample recipe: Batch extraction from Salesforce
This recipe extracts new/updated records in batches from a source system (Salesforce) and loads it into a data warehouse (Snowflake).
Batch extraction from Salesforce
# Recipe walkthrough
Configure the New/updated records in Salesforce batch trigger. This trigger monitors Salesforce for changes. The trigger exports the records in batch when records are added or updated.
Upsert the data extracted from Salesforce into Snowflake using the Upsert batch of rows action.
# Databases
# Sample recipe: Scheduled bulk extraction from a database
This example demonstrates how to extract data from databases in bulk. This recipe extracts data from PostgreSQL and loads it into Snowflake.
Bulk extraction from PostgreSQL
# Supported databases
The following database connectors support bulk actions:
# Recipe walkthrough
Set up a Scheduler trigger and determine the frequency of bulk extractions.
Export data in bulk using the Export query result from PostgreSQL action.
Upload the file to an external staging location such as an S3 bucket using the AWS S3 Upload file action.
Load the exported data into a table in Snowflake using the Bulk load to table from stage action.
# Sample recipe: Batch extraction from a database
This example demonstrates how to extract data from databases in batches and load it into a destination. Specifically, it extracts data from PostgreSQL and loads it into Snowflake.
# Supported databases
All of Workato's database connectors support batch extraction, including:
# Recipe walkthrough
Export new/updated records in batches from PostgreSQL using the New/updated rows in PostgreSQL trigger.
Upsert the exported rows into Snowflake using the Upsert batch of rows in Snowflake action.
# File systems
File connectors in Workato enable you to extract data from file systems using download actions that are compatible with streaming. Pairing this action with a downstream system compatible with bulk loads facilitates data extraction from file systems and loading into databases or data warehouses.
You can also pair download actions with bulk actions in app connectors to extract data from file systems and send your data to applications as a whole.
# Sample recipe: Bulk extract from FileStorage and load to Snowflake
This example demonstrates how to load aggregated data in Workato FileStorage into a data table in a data warehouse (Snowflake).
Extract data in bulk from FileStorage and load to Snowflake
# Recipe walkthrough
Use the New file in Workato FileStorage trigger to monitor a specific directory for new files. A new recipe job begins when files are added to the directory.
Upload the files in bulk to Snowflake using the Upload file to internal stage action.
# Sample recipe: Bulk extract data from SFTP to Snowflake
This example demonstrates how to extract data in bulk from SFTP and load it into Snowflake.
Extract data in bulk from SFTP and load to Snowflake
# Supported connectors
The following connectors support bulk downloads.
All file connectors:
- On-prem files
- Workato FileStorage
- SFTP
- FTP/FTPS
- Google Drive
- Microsoft OneDrive
- Microsoft Sharepoint
- Box
- Dropbox
- BIM 360
- Egnyte
All data lake connectors:
# Recipe walkthrough
Configure the New file trigger to monitor SFTP for new files. A new recipe job begins when new files are detected in the partner server.
Fetch the file contents and stream them to the next action using the Download file from SFTP server action.
Upload the streamed data to Snowflake's internal stage using the Upload file to internal stage in Snowflake action.
Use the Bulk load data to table from stage in Snowflake action to load the data from the internal stage into the target table.
# Custom extraction
Custom extractions enable precise data control in Workato by allowing you to specify extraction criteria through custom queries or scripts. This method offers flexibility for complex scenarios but requires consideration of complexity, maintenance, performance, resource use, error handling, and security.
# Replication pipelines
Data replication pipelines ensure that data is consistently copied from one system to another, maintaining accuracy and consistency across your data landscape. These pipelines are essential for real-time analytics, backup, and disaster recovery.
# Set up a pipeline for replication
Setting up a replication pipeline involves several key features to ensure efficiency and reliability:
Change Data Capture (CDC) with triggers
Workato's triggers are configured to capture data changes swiftly, ensuring minimal impact and maximum efficiency in the evolution of your data. This feature is crucial for maintaining up-to-date data replication in real-time or near-real-time scenarios.
Schema replication
Workato can replicate schemas from source systems to destinations without defining destination schemas manually. This simplifies the setup process and ensures that the data structure is maintained during replication.
Scalability
The platform can scale elastically based on demand, providing the necessary backend resources to handle high-volume data loads.
Bulk/Batch processing
Workato supports both bulk and batch processing, which is essential for handling large volumes of data efficiently. Bulk processing allows for large amounts of data to be processed in a single job, which is particularly suited for ETL and ELT processes.
# Schema drift
Schema drift refers to changes in the structure or schema of a dataset over time. In a data orchestration context, schema drift occurs when the structure of the source data changes after a data orchestration process is implemented. These changes can include additions, deletions, or modifications of fields, data types, or other schema elements.
Schema drift can pose challenges for data orchestration processes because it can lead to inconsistencies between the source and target systems. If not properly managed, schema drift can result in data transformation errors, data loss, or incorrect data analysis.
# How Workato detects and manages schema drift
Workato detects and manages schema drift through automated schema detection and adaptation. Workato can also regularly monitor and validate data structures and send you notifications when schema changes are detected. This enables you to manually intervene when necessary.
# Replicate schema actions for Snowflake, BigQuery, and SQLServer
Workato documentation covers working with data from Google BigQuery, SQLServer, and Snowflake.
This section is an example that highlights the steps to inspect the schema of data source against a table in Snowflake.
This action inspects the schema of data source against a table in Snowflake and alters the table as required to match the source schema and column data types. It is a shortened version of the steps available in the Replicate Schema action documentation.
How it works:
- If the specified destination table does not exist, it is created in Snowflake automatically.
- Workato compares the source data's schema against the destination Snowflake table.
- If columns exist in the source data but not in the destination table, the missing columns are added to the destination table with DDL Commands (
ALTER TABLE
). - New columns follow the data type that was provided (applies only to Schema source type).
- Columns can only be added, never removed. If a column exists in the destination table but not in the source data, no action is taken.
- The ordering of columns follows the source data exactly.
Select the destination table in Snowflake. You can select from a list of existing tables in Snowflake, or enter a table name in text.
Table names are case-sensitive as we apply double-quotes around them (see double-quoted identifiers (opens new window)). This gives you flexibility to use special characters in your table names if needed.
Select the source data type (CSV or Schema).
The field set shown depends on the source type:
Source type CSV:
The CSV schema field set is shown.
The data type of replicated columns using CSV schema is
VARCHAR
.Fields Description CSV data Use a datapill of CSV data that must have a header row. Column separator Select the delimiter of the CSV data. The default is comma. Quote character Select the character used to quote the CSV cell values. The default is Double quote. Source type Schema:
The Schema field displays.
Fields Description Schema source list Input a list datapill. Learn more about list input. Column name Map the field name to be used as the column name in your table. Column type Map the field type to be used as the data type for the columns in the table. Leave this field blank to use VARCHAR
for all the columns in the table.
Configure optional inputs, such as column case and exclude columns.
Configure the output.
# Set up extraction frequency
Set up the frequency with which data is extracted from your source system. Workato supports the following methods for processing data:
This method involves processing data in batches. A batch can consist of multiple records or rows of data that are processed as a single group. Batch processing is useful for moving large volumes of data efficiently.
Similar to batch processing, bulk methods are designed to handle large volumes of data at once. However, bulk processing often involves even larger datasets and is optimized for maximum throughput and efficiency.
Real-time extraction involves processing data as soon as it becomes available. This method is ideal for use cases where immediate data availability is critical, such as in monitoring or alerting scenarios.
Micro-batch or polled extraction involves regularly checking for and processing small batches of data at frequent intervals. This method strikes a balance between batch and real-time processing, offering timely data processing without the overhead of real-time systems.
# Change data capture
Change Data Capture (CDC) is a process used to capture and track changes made to data in a database. This process enables real-time or near-real-time monitoring and synchronization of data changes. This allows applications to stay up-to-date with the latest changes in the database without using continuous polling.
The primary goal of CDC is to identify and capture changes such as inserts, updates, and deletions made to database tables. These changes are then propagated to downstream systems, data warehouses, or analytics platforms, ensuring that all systems have access to the most current data.
# How CDC works in Workato
Workato uses triggers to monitor changes in an app or system you specify. Workato triggers handle CDC by monitoring changes in real-time and providing notifications for those changes, which facilitates data replication and synchronization across different systems.
Triggers follow in-sequence delivery, maintain processed jobs records, prevent duplicate processing, and ensure job completion in order. Trigger dispatches can be single (real-time data sync) or bulk/batch. Bulk/batch provides improved throughput if you are working with large data volumes.
# Supported data sources for CDC
Workato supports CDC for the following data sources:
- Software as a Service (SaaS) platforms
- On-premise systems
- Various databases, such as MySQL, PostgreSQL, and Snowflake
- Workato FileStorage
- Cloud storage services like Amazon S3
- Enterprise Resource Planning (ERP) systems
# Advanced strategies
Explore advanced CDC strategies, including filtering and conditional triggers, handling large volumes of changes, and optimizing performance:
Filtering and conditional triggers
Advanced CDC strategies include the use of filtering and conditional triggers to manage the flow of data changes. This allows for more granular control over which changes to capture and propagate to downstream systems.
Handling large volumes of changes
Batch processing and micro-batching can be used to efficiently process and transfer if you are working with large volumes of data changes. This helps manage the change load and ensure timely data synchronization.
Optimizing performance
Performance optimization can be achieved through techniques such as built-in cursor management for tracking high watermarks, auto-deduplication, and in-order processing.
Variable speed data pipelines
Workato supports the setup of variable speed data pipelines, including near real-time/continuous data streaming, micro-batches for frequent polling, and batching for periodic schedules. This flexibility allows for tailored data orchestration strategies that fit specific business needs.
Last updated: 7/2/2024, 2:04:29 AM