To further strengthen our commitment to providing industry-leading coverage of data technology, VentureBeat is excited to welcome Andrew Brust and Tony Baer as regular contributors. Watch for their articles in the Data Pipeline.
Fivetran announced yesterday the release of an API designed to propel data pipeline metadata into data catalogs. By adding to the already rich store of metadata contained in catalogs such as Collibra, Alation, and others, the API aims to increase data quality and data governance.
The metadata API is useful for tracking changes that occur to data in-flight, between source and target systems. There is also functionality for determining changes that occur in sources before data actually moves, which is critical for preserving regulatory compliance.
According to Meera Viswanathan, Fivetran senior product manager, many of these capabilities hinge on the fact that “what the API offers is source column to destination column mapping.”
As such, it has the potential to pinpoint even minute changes in schema and naming conventions in tables. Pairing this information with data lineage graphs aids impact analysis so companies can fully understand the repercussion of changes made from source to target systems via data pipelines.
MetaBeat will bring together thought leaders to give guidance on how metaverse technology will transform the way all industries communicate and do business on October 4 in San Francisco, CA.
“Organizations weren’t able to pull any of this information in the past,” Viswanathan said. “They had some information, but it was very disparate. They could say: here are some Fivetran assets. Mapping the data from source to destination was never possible in the past.”
The metadata API is appropriate for organizations with established data governance workflows in place, especially those pertaining to data access, data privacy, and regulatory adherence. By providing fine-grained metadata about data’s journey inside pipelines, this resource expands the visibility and monitoring necessary for data governance into these channels. By “helping customers understand what’s happening within the pipeline, they can then enforce the right policies,” Viswanathan commented. “I very strongly believe that the earliest stage data governance can be applied is the pipeline, because the data is at rest when it’s in the source.”
Near the end of the year, Fivetran is projected to introduce capabilities to the metadata API so users can detect schema changes before data even moves. If someone unversed in the compliance requirements for a dataset accidentally adds a PII column to a dataset, for example, security and governance teams can observe this change in data catalogs. They can then act to prevent the one who changed the dataset from moving the data and violating compliance mandates. “If I go and unblock a column or block a column that’s in the platform, if I can surface this information in a data catalog, which is where most of our data governance and security team sits, they can stop this request from going through,” Viswanathan noted.
The metadata API also has a considerable amount of implications for data quality. Although it doesn’t address data quality in terms of mastering data or the structure for how addresses are written in systems, for example, it can certainly add to data’s trustworthiness. Analysts may be looking at sales information in a cloud data warehouse and wonder where certain numbers came from. Data catalog information from the metadata API can provide all the necessary information so users can answer that question and determine if the numbers themselves are trustworthy. In this respect, it “helps you drive that line between saying this is how your data moved, this is the tool that was used, these are the owners within the pipeline of the data,” Viswanathan explained. “So, people can then start mapping that information from source to destination”
It’s of great service when the underlying data catalogs that receive this metadata contain data lineage graphs that enable users to effectively visualize this and other pertinent information. Viswanathan described a use case in which an analyst wanted to evaluate the basic data quality of revenue figures in Looker. Now, they can “pull this information and visualize it in an end-to-end lineage graph where you can see my revenue number went from this Salesforce column to this destination column within Snowflake,” Viswanathan mentioned. “It went through these transformations within Snowflake and then it got exposed in Looker. So, you really can trace your data all the way down to its source.”
The savvy management of metadata has always been an integral component of data governance and data quality. Fivetran’s metadata API extends these dimensions of data governance—and the visibility upon which they’re predicated—into data pipelines that were previously opaque. This degree of transparency is useful for so many aspects of data governance, from regulatory compliance to access controls and data modeling.