To further strengthen our commitment to providing industry-leading coverage of data technology, VentureBeat is excited to welcome Andrew Brust and Tony Baer as regular contributors. Watch for their articles in the Data Pipeline.
New York-headquartered data reliability company Datafold has launched an open-source diffing tool to help enterprises compare databases and perform checks to validate data consistency.
Modern enterprises are heavily reliant on extract, load and transform (ELT) practices. The datasets are extracted from sources, loaded into a data warehouse and then transformations are preformed — like cleansing and refining — to make the information ready for analytics and data science use cases.
The task is straightforward, especially with the use of data replication and syncing tools such as Fivetran, Airbyte and Stitch, but even still records can get lost in interconnected systems due to dropped packets, general replication issues and configuration errors.
This can affect data integrity and the downstream use cases.
Data-diff to the rescue
To solve for this challenge, Datafold is providing the new diffing package. Dubbed ‘data-diff,’ the solution uses algorithms to actively verify whether the data that has been loaded into a data warehouse matches that in the source or the point of extraction.
“It is a python package, and the test can be embedded inside of any orchestration or scheduling tool to determine if two databases contain the same data. If there is a mismatch, it very quickly determines where it is and surfaces that in your CLI or it can materialize it in a database,” said Gleb Mezhanskiy, CEO and founder of Datafold. “This makes it easy to fix any inconsistencies and be confident that you are not losing or misrepresenting any data from a source database.”
The solution can perform a row-level comparison of tables in a matter of seconds. Prior to this, data engineering teams had to rely on manual one-off checks and time-taking investigations using makeshift diff tools to rule out the possibility of discrepancies and ensure 100% integrity of data replicated through syncing tools.
“Open-source data-diff relies on hashing and state-of-the-art search algorithms to efficiently identify diverging rows at scale. It takes open-source data-diff only ten seconds to fully compare tables with 25M rows, and less than 5 minutes to perform the comparison for a 1-billion-row dataset – approximately the same time it would take to run a query simply counting rows,” Mezhanskiy said.
Extension of existing offering
The offering comes as an extension of Datafold’s original diffing tool, which performs automated testing to check for changes (e.g., dbt SQL code) after the transformation step. It analyzes how a change to the code impacts the produced data throughout the entire data pipeline.
“Open-source data-diff adds cross-database diffing functionality, thereby expanding the covered use cases from testing only transformations (T of ELT) to validating data replication (EL of ELT) throughout the entire data platform,” the CEO added.
The solution is available starting today under an MIT license and includes connectors for Postgres, MySQL, Snowflake, BigQuery, Redshift, Presto and Oracle.
The company, which raised $20 million in November 2021, said it also plans to invite contributors to build connectors for other data sources. Other leading companies working within the data reliability sector include Bigeye and Monte Carlo.