Discover the robust features that make GknowsysDQ a top-tier data quality monitoring tool.
Receive instant alerts when data issues occur, ensuring reliable data processing.
Prevent data corruption by halting data processing when issues are detected.
Choose from a library of quality rules across data sources and stages.
Easily integrate GknowsysDQ into existing pipelines using our extensive APIs, and Warehouse and Data Lake integrations.
Choose what access you share with GknowsysDQ and what Metadata is captured, or use access-less integrations.
Stay informed of the quality issues via SMS, email, or a workflow management system.
Each data processing step is identified as a process. Each stage the data goes through, like ingestion, processed and report is identified as a component. Each type of component has expected behaviour and quality check types in GknowsysDQ.
Relationships between the components show the interdependence of the components fot the purpose of calculating the quality. This is by nature the inverse order of execution represented by the orchestrator tools like Apache Airflow.
A pipeline is a logical representation of the data processing pipeline, represented as a DAG. Pipelines in GknowsysDQ are more nuanced than those in orchestration tools like Apache Airflow, in the sense that they also represent logica groups and stages.
As tests, they are executable expectations of the data quality. There is a library of checks to choose from. There are some built-in checks based on the component type. And it is possible to write your own.
A quality rating generated by GknowsysDQ based on the checks applied. It is a comprehensive metrics calculated by weighted resolution of ratings of all dependencies and checks applied.
gknowsysDQ is a cutting-edge quality monitoring and assessment tool, born from our extensive experience with processing vast, multi-tenant data through complex, multi-step processing pipelines. This tool aims to shift the traditional focus from merely processing and logging to analysing data states and their transitions, ensuring a high standard of data quality and reliability.
Historically, monitoring has focused on process logs, which are inadequate for detecting issues in dynamic datasets; even more so when processing happens in recurring cycles. With the sheer volume of data, processing often involves multiple steps, and intermediate steps are usually non-human-readable. Typically, errors are identified only at the final reporting stage. This is particularly problematic in SaaS systems, where scrutinizing every process report for each tenant is impractical. Such oversight can lead to errors in final reports, eroding user trust in the platform.
gknowsysDQ addresses these challenges by automating data QA in multi-tenant, multi-step data processing systems using a "fail-fast, fail-early" approach. This enhances data quality assurance and helps build trust in your data. gknowsysDQ takes an approach similar to log monitoring tools but focuses on data and data pipelines. In data-heavy applications, traditional log monitoring tools are inadequate because the output is data, not logs. The key message is to monitor the data itself, not just the logs.
gknowsysDQ offers robust data quality assurance and monitoring capabilities, complemented by an extensive alerting and notification system and event subscription mechanisms. It integrates seamlessly with various orchestrators and connects to a wide range of data stores. By performing live, synchronized data analysis integrated with the orchestrators, gknowsysDQ provides flexibility for data processing involvement. It also includes APIs that can be utilized in any codebase to trigger data quality assessments as needed, retrieve results, and make informed decisions.
A significant differentiator of gknowsysDQ is its approach to validating a dataset’s quality. In typical scenarios, data sets are interdependent, meaning a quality drop in one can affect others downstream. Traditional data quality tools often fail to highlight this impact. gknowsysDQ assigns a Data Confidence Rating (DCR) to a dataset, reflecting the percentage of failed checks, the importance of the failed checks, the dataset's relationships, and its contributions to downstream datasets. This DCR acts like a ripple in a pond, propagating the impact of a failed check through the pipeline to the DCR of the final report.
At its core, gknowsysDQ is a powerful rule engine with integrations into various data sources and alerting systems, offering flexibility in the use-cases it can be applied in and how.