Data Pipelines

Posted Aug 10, 2019 Updated Jan 20, 2025

By Ashim Khadka

3 min read

Data Pipelines

Concern regarding building a data processing system

multiple layered system
volume of data
result are not needed to be sent back to client immediately
event processing pipeline

Main Message Bus (Kafka)

Scaling
Message semantics
Repeated pattern of Reading and Writing to the bus makes a pipeline of event processing.

Prerequisite question regarding Data Intensive System

Is the data complete or meaning of it?
Is it consistent with other data sources, in case of data lake or olap?
If it is reprocessed, is that data reflective of that, are the SLA's (Service-level agreement) all met?
If some downstream service was down for sometime, did that impact the final data?

Concern with pipelines

How stable are the underlying systems which generate the data?
How reliable is the final data even when working with components which go down?

1. Downtime of service

Services could/should/do go down
What should a consumer of a service do under such a situation?
What if, it is a component of a data pipeline

Some solutions

Fail fast
Implement a Dead Letter Queue
The service takes responsibility of ensuring the message is processed

2. Slow consumers

silent problem in data pipelines when consumer receives more than it can handle
hard to keep track until data loss happens
leads to cascading failures of all systems when there is a change in contract or schema.

Some solutions

Measuring and scaling consumer
Contract and schema evolution

3. Ensuring zero data loss

How to make it reliable?
How do you build a system which continuously processes data with only a few hours of buffer for messages?

4. Global view of the schema

setting the blueprint/data model of data
like type, default value, placement etc
need something to agreed upon by all the producer and consumer

Contract

A good contract should look like model classes in code-base
With interactions between different model classes are driven more by the responsibility

Contract Relationship

Message Format

Thrift

Thrift is an interface definition language and binary communication protocol used for defining and creating services for numerous languages. It forms a remote procedure call framework.

Advantages of Thrift

String typing of fields
Easy renaming of fields and types
Bindings for all major languages
Very fast SerDe for messages

Thrift Contract

Compatibility

Forward Compatibility and Backward Compatibility

Producers and Consumers work on different versions of the contract but are able to work well.

Forward Compatibility
- Ensure data written today is able to be read and processed in future.
Backward Compatibility
- Ensure past data read today is able to be read and processed.

Full Compatibility

Consumers with version X of schema are able to read data generated by X, X-1 but not necessarily X-2.
Data generated by producers with version X of schema is able to be processed by consumers running X, X-1 version of schema, not necessarily X-2.
While evolving schema in a compatible way, needs to consider transitivity.

All the data written in the past may not be Fully Compatible in future.

Full Transitive Compatibility via Thrift

Ensure that the required fields are never changed/removed
New fields added are only optional
The index tags are never changed for a field
Do not remove optional fields, rename them to deprecated

References

technology

middleware

This post is licensed under CC BY 4.0 by the author.

Concern regarding building a data processing system

Main Message Bus (Kafka)

Prerequisite question regarding Data Intensive System

Concern with pipelines

1. Downtime of service

Some solutions

2. Slow consumers

Some solutions

3. Ensuring zero data loss

4. Global view of the schema

Contract

Contract Relationship

Message Format

Thrift

Advantages of Thrift

Thrift Contract

Compatibility

Forward Compatibility and Backward Compatibility

Full Compatibility

Full Transitive Compatibility via Thrift

References

Trending Tags