Skip to main navigation menu Skip to main content Skip to site footer

Tracing Coarse-Grained and Fine-Grained Data Lineage in Data Lakes: Automated Capture, Modeling, Storage, and Visualization

Shinoy Vengaramkode Bhaskaran research paper 2021

Abstract

In contemporary data-driven organizations, data lakes have emerged as large-scale, flexible repositories that integrate heterogeneous data sources—ranging from raw transactional logs to refined analytical tables. Becuase these ecosystems grow in complexity, understanding data lineage, i.e., the end-to-end provenance and transformations that data undergo, is necessary for ensuring data quality, regulatory compliance, and stakeholder trust. This paper offers a comprehensive, technical overview of approaches to capturing, modeling, storing, and visualizing data lineage in modern data lakes, with an emphasis on distinguishing coarse-grained lineage (dataset-level traces) from fine-grained lineage (record- or cell-level provenance). We begin by examining various automated lineage capture techniques, including instrumentation of ETL and data pipeline frameworks, logical query parsing, and runtime provenance tagging. Every technique we discussed involves trade-offs in performance, accuracy, and integration complexity. We then describe strategies for modeling lineage at multiple levels of abstraction, from high-level DAG-based dependencies across datasets to detailed provenance graphs for individual records. Scalability challenges arise in storing and querying lineage at fine granularity, prompting solutions such as compression, hierarchical aggregation, and delta-based referencing. We subsequently explore state-of-the-art visualization and interaction methodologies, discussing how intuitive graph-based dashboards, hierarchical drill-down views, and interactive queries aid in quickly locating root causes of data issues, assessing impact on downstream artifacts, and supporting reproducibility.

Keywords

Data governance, Data lakes, Data lineage, Data provenance, ETL frameworks, Provenance visualization, Scalability challenges

PDF