Post

Data Pipelines

Data Pipelines

Concern regarding building a data processing system

  • multiple layered system
  • volume of data
  • result are not needed to be sent back to client immediately
  • event processing pipeline

Main Message Bus (Kafka)

  • Scaling
  • Message semantics
  • Repeated pattern of Reading and Writing to the bus makes a pipeline of event processing.

Prerequisite question regarding Data Intensive System

  • Is the data complete or meaning of it?
  • Is it consistent with other data sources, in case of data lake or olap?
  • If it is reprocessed, is that data reflective of that, are the SLA's (Service-level agreement) all met?
  • If some downstream service was down for sometime, did that impact the final data?

Concern with pipelines

  • How stable are the underlying systems which generate the data?
  • How reliable is the final data even when working with components which go down?

1. Downtime of service

  • Services could/should/do go down
  • What should a consumer of a service do under such a situation?
  • What if, it is a component of a data pipeline

Some solutions

  • Fail fast
  • Implement a Dead Letter Queue
  • The service takes responsibility of ensuring the message is processed

2. Slow consumers

  • silent problem in data pipelines when consumer receives more than it can handle
  • hard to keep track until data loss happens
  • leads to cascading failures of all systems when there is a change in contract or schema.

Some solutions

  • Measuring and scaling consumer
  • Contract and schema evolution

3. Ensuring zero data loss

  • How to make it reliable?
  • How do you build a system which continuously processes data with only a few hours of buffer for messages?

4. Global view of the schema

  • setting the blueprint/data model of data
  • like type, default value, placement etc
  • need something to agreed upon by all the producer and consumer

Contract

  • A good contract should look like model classes in code-base
  • With interactions between different model classes are driven more by the responsibility

Contract Relationship

Message Format

Thrift

Thrift is an interface definition language and binary communication protocol used for defining and creating services for numerous languages. It forms a remote procedure call framework.

Advantages of Thrift

  • String typing of fields
  • Easy renaming of fields and types
  • Bindings for all major languages
  • Very fast SerDe for messages

Thrift Contract

Compatibility

Forward Compatibility and Backward Compatibility

  • Producers and Consumers work on different versions of the contract but are able to work well.
  1. Forward Compatibility
    • Ensure data written today is able to be read and processed in future.
  2. Backward Compatibility
    • Ensure past data read today is able to be read and processed.

Full Compatibility

  • Consumers with version X of schema are able to read data generated by X, X-1 but not necessarily X-2.

  • Data generated by producers with version X of schema is able to be processed by consumers running X, X-1 version of schema, not necessarily X-2.

  • While evolving schema in a compatible way, needs to consider transitivity.

All the data written in the past may not be Fully Compatible in future.

Full Transitive Compatibility via Thrift

  • Ensure that the required fields are never changed/removed
  • New fields added are only optional
  • The index tags are never changed for a field
  • Do not remove optional fields, rename them to deprecated

References

This post is licensed under CC BY 4.0 by the author.