Skip to content

Latest commit

 

History

History
144 lines (122 loc) · 17.7 KB

README.md

File metadata and controls

144 lines (122 loc) · 17.7 KB

Awesome Data Temporality

A curated list to help you manage temporal data across many modalities 🚀.

Awesome

Generative Art Created By DALL·E!

Data Versioning for Machine Learning

Data versioning is the practice of storing multiple versions of the same data and providing a mechanism for accessing and managing these versions. This can be useful in a variety of situations, such as when data is accidentally deleted or corrupted, or when it is necessary to see how the data has changed over time. The vast majority of "data versioning" tools you see today are related to better managing your datasets for machine learning. The implementation paradigm used is to store versions of your data and models in Git commits. Therefore the following part of the awesome list is centered around machine learning. However, there are other ways to manage your temporal data covered in later sections.

Time Travel and Temporal Tables

Data time travel refers to the ability to go back in time and access previous versions of data. In order to enable data time travel, it is necessary to implement a system for versioning data, which involves storing multiple versions of the same data and providing a mechanism for accessing and managing these versions. Whereas temporal tables, also known as system-versioned temporal tables, are tables in a database that automatically track the history of data changes and allow you to query the data as it existed at any point in time. Both time travel an temporal tables often are used interchangablely to mean the same thing. Temporal tables are more of an implementation specific feature of some databases. These tables are useful for auditing, tracking changes to data over time, and performing point-in-time analysis. You can usually query a temporal table using the FOR SYSTEM_TIME clause in a SELECT statement.

Slowly Changing Dimensions Data Modeling

Slowly changing dimensions are those in which the attributes of the dimension change over time, and the changes need to be tracked in the data warehouse. For example, a customer's address or name might change over time, and the data warehouse needs to track these changes so that historical data can be analyzed correctly.

  • VDK Versatile Data Kit (VDK) is an open source framework including help to manage SCD style data.
  • dbtvault A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
  • dataform Common data models for creating type-2 slowly changing dimensions tables from mutable data sources in Dataform.
  • dbt snapshots DBT snapshots
  • DeltaLake Databricks change data capture with Delta Live Tables
  • 6 Kinds 6 Different Types of Slowly Changing Dimensions and How to Apply Them?
  • Data Vault Loading Dimensions from a Data Vault Model
  • SCD Data Warehouse Slowly Changing Dimension Handling in Data Warehouses Using Temporal Database Features
  • Redshift Implement a slowly changing dimension in Amazon Redshift

Bi-temporality Tools + Modeling

Bitemporality is a concept in database management that refers to the ability of a database to store and manage data that is associated with multiple time periods. This can include historical data as well as data that is still in the process of being entered or updated. In a bitemporal database, data is stored in multiple versions, with each version corresponding to a specific point in time. This allows users to view and query the data as it existed at different points in time, which can be useful for a variety of purposes such as understanding how data has changed over time or for tracking the history of a particular piece of data.

  • Martin Fowler Bitemporal History (explained) from world famous Martin Fowler
  • Crux of Bitemporality The Crux of Bitemporality - Jon Pither
  • Capgemini Enhancing Time Series Data by Applying Bitemporality (opinionated white paper mentioning KDB+)
  • GoldenSource A financial services data modeling software company perspective on bitemporality
  • MarkLogic A deep dive into bitemporality in MarkLogic
  • XTDB XTDB bitemporal graph database by Juxt with support for bitemporality
  • ARXIV Bitemporal Property Graphs to Organize Evolving Systems white paper
  • Axway Decision Insights bitemporal capability
  • Cloudera - Data Modeling Bi-temporal data modeling with Envelope
  • Bitemporal Database Book Bitemporal Databases: Modeling and Implementation
  • Speakerdeck An overview of bitemporality
  • Val on Programming (Datomic) Datomic: this is not the history you're looking for
  • Cybertec Implementing "As Of" queries in Postgresql
  • Bitempura.DB Bitempura.DB is a simple, bitemporal key-value database.
  • Modeler (Anchormodeler) (Bi-temporal) data modelling tool inspired by Anchor modeler, for PostgreSQL
  • BarbelHisto Lightweight ultra-fast Java library to store data in bi-temporal format
  • Robinhood Tracking Temporal Data at Robinhood

Change Data Capture (CDC) Tools

Change data capture (CDC) is a process that captures and stores data about changes made to a database or other data source. It is often used in data warehousing and data integration scenarios to ensure that data in different systems is kept up to date and in sync. CDC involves tracking changes made to a database or data source and storing information about those changes in a separate location, such as a separate database or log file. This allows the data in the original source to be updated, while still maintaining a record of the changes that were made.

  • Debezium Change data capture for a variety of databases
  • Supabase realtime Broadcast, Presence, and Postgres Changes via WebSockets
  • airbyte Data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes
  • Flink CDC Connectors for Apache Flink
  • gravity A Data Replication Center
  • brooklin An extensible distributed system for reliable nearline data streaming at scale

Soft Delete in ORM Frameworks

Soft delete is a method of deleting data from a database in a way that allows the data to be recovered if necessary. When data is deleted using the soft delete method, it is not physically removed from the database. Instead, it is marked as deleted and is typically no longer visible to users, but it can still be recovered if necessary. The soft delete method is often used as a way to prevent accidental or unintended data loss, as it allows deleted data to be recovered if necessary. It is also useful in scenarios where data needs to be retained for compliance or regulatory purposes, as it allows data to be retained while still making it unavailable to users.

Contribution

This list started as personal collection of interesting things about data versioning. Your contributions and suggestions are warmly welcomed. Read the contribution guidelines.