# Data Pipeline key concepts

# Data sources and destinations

Workato Data Pipeline recipes extract data from a variety of data sources, such as applications, file systems, databases, API endpoints, and event streams and sync them to data warehouse destinations.

Each pipeline takes data from multiple objects or fields from a single application or data source and synchronizes (syncs) that data to a single destination.

# What counts as a sync?

A sync is a single concurrent execution of extraction and load of the objects you select in the Data Pipeline recipe configuration. You can set the sync frequency during the recipe configuration. New or updated data from the source is replicated to the destination when each sync completes, ensuring data integrity between the source and the destination.

# Best practice: Bulk or Batch extraction

Workato Data Pipeline recipes allow you to bulk process large amounts of data in a single job.

Batch processing can be restricted by batch sizes and memory constraints, and is generally less suitable in the context of ETL/ELT.

Bulk operations involve the processing of large volumes of data in a single, collective transaction. Instead of manipulating individual records, bulk operations handle data in bulk, often transferring or modifying thousands or millions of records in a single operation.

Bulk actions are typically asynchronous actions and can provide the following benefits:

  • Efficiency

  • Bulk operations are highly efficient when dealing with large datasets. Bulk operations minimize the overhead associated with processing individual records, resulting in faster execution times.

  • Atomicity

  • Bulk operations are typically atomic, meaning they are treated as a single, indivisible unit. This ensures that either all the changes are applied, or no changes are applied, maintaining data integrity.

  • Optimized for large datasets

  • Bulk operations optimize performance when dealing with a significant volume of data, making them suitable for scenarios involving mass data migration, synchronization, or transformation.

  • Reduced network overhead

  • Bulk operations transmit data in large chunks, which reduces the overall network overhead compared to handling individual records separately.

  • Fewer calls

  • Bulk transfers can reduce costs by lowering the number of calls made to the source and destination systems.

# Historical and incremental syncs

  • Historical

  • On an initial bulk load of existing data, all records from the source application are extracted and replicated into the destination. The time frame can be configured.

  • Incremental

  • Ongoing synchronization of new data. Each time the pipeline syncs, only the changed data is synced.

When you first start a Data Pipeline recipe, it runs a full historical sync that extracts all data from the source application or all data from a date that you specify. All subsequent syncs are incremental.

# Replicate schema and handle schema drift in Workato Data Pipelines

Schema drift refers to changes in the structure or schema of a dataset over time. In a data orchestration context, schema drift occurs when the structure of the source data changes after a data orchestration process is implemented. These changes can include additions, deletions, or modifications of fields, data types, or other schema elements.

Schema drift can pose challenges for data orchestration processes because it can lead to inconsistencies between the source and target systems. Schema drift can result in data transformation errors, data loss, or incorrect data analysis, if not properly handled.

# Choose how to handle schema changes

Workato detects and manages schema drift through automated schema detection and adaptation.

You can choose whether or not to automatically manage schema drift when you create your Data Pipeline recipe.

# How does a Data Pipeline handle deleted fields?

When a field is removed from the source, the Data Pipeline retains the field in the destination so that you can refer to the previous values.


Last updated: 7/31/2024, 7:04:29 PM