Guide: How to improve data quality through validation and quality checks

Progress Checklist

Track your progress below.

Determine when to develop data validations
Develop and implement data validation rules
Identify common data quality issues within a dataset
Implement best practices regarding data quality checks
‘So what’ and next steps

Referenced Resources

The following resources are referenced in this guide.

This guide will explain data cleaning processes and their value in building the usefulness of your data.

One of the most fundamental challenges in deriving trustworthy insights is ensuring quality data is being used. When data quality is not achieved, it means that confidence in data is lacking, insights may be misleading, and decisions and progress are ultimately impeded.

To improve quality, developing good habits in designing validation in spreadsheets and implementing consistent data cleaning processes are valuable. This guide will help understand these processes, best practices, and their value in building confidence in and usefulness of your data.

Guide Specific Disclaimer

This guide focuses on data validation within spreadsheets as spreadsheets are commonly used data storage and analysis applications. However, data validation could occur in all types of databases and other systems you are using to collect and store your data. Understanding data validation for your specific tools and applications is crucial in improving data quality.

Determine when to develop data validations

Data validation ensures data quality when data is being entered into a spreadsheet, system, or database. During this process, requirements on the data being entered are used to check that inputs meet certain criteria. In this way, validation prevents errors, inconsistencies, and inaccuracies.

Data validation is possible when there are clear parameters or limitations on the type of data being .

A few common types of data that are best suited for validation are:

Numeric Data: Numeric data can benefit from validation rules that enforce numeric constraints, such as minimum and maximum values, allowed ranges, or specific formats.
Text Data: Textual data can be validated using rules that enforce character limits, required formats (e.g., uppercase, lowercase, title case), or patterns (e.g., email addresses, phone numbers).
Date and Time Data: Date and time data can be validated to ensure that entries fall within specific date ranges, conform to standard date/time formats, or meet other temporal constraints.
Categorical Data: Categorical data, consisting of discrete categories or labels, can be validated using rules that restrict data entry to predefined lists of valid options. This includes fields such as ethnicities, donor types, or educational sectors.
Boolean Data: Boolean data, representing binary values (e.g., true/false, yes/no, 0/1), can be validated to ensure that only valid boolean values are entered.
Unique Identifiers: Unique identifiers, such as customer IDs, order numbers, or employee IDs, can be validated to ensure uniqueness across the dataset.
Geospatial Data: Geospatial data, such as coordinates or addresses can be validated to ensure that entries are properly formatted, geographically accurate, and consistent with reference datasets (e.g., postal code databases).

When designing a spreadsheet, system, or database for data entry, knowing when data validation is an option is the first step in crafting validation rules and improving data quality.

Data Quality Characteristics:

Timeliness – Is my data recent enough to be useful for my purposes?
Accuracy – Does my data reliably reflect the ground truth?
Completeness – How much of my data is missing?
Uniqueness – Is there confusing duplication in my data source?
Consistency – Are data fields saved in a consistent format and data type?

Develop and implement data validation rules

Now that you’ve identified where data validation is appropriate, it is important to understand the ways to design and implement validation rules when working with spreadsheets. While there are a vast array of resources available for developing validation rules, we have provided a few of these resources within the section to help support your work here.

Validation is an essential part of creating a data entry process that leads to accurate, consistent, and complete datasets. This validated data then requires less post-entry review and cleanup, ultimately saving your organization time and money.

After developing your validation rules within your spreadsheet, it is important to test it with your colleagues. While creating validation rules, there are typically many places where errors could arise, and you want to catch these before a wider release of the spreadsheet for data entry within your organization, community, or group of key stakeholders.

Data validation is typical in the following forms:

Data Type Validation: Ensuring that data entered into a field matches the expected data type, such as text, numbers, dates, or times.
Range Validation: Restricting data entry to values within a specified range, such as numeric ranges, date ranges, or ranges of acceptable text values.
List Validation: Limiting data entry to a predefined list of acceptable values. This is often implemented using dropdown lists or picklists to provide users with a selection of valid options.
Pattern Matching Validation: Validating data based on specific patterns or formats, such as phone numbers, email addresses, postal codes, or identification numbers.

resource

Article

How To Use Data Validation in Google Sheets

This article walks you through using data validation effectively in Google Sheets.

Provided by Unito

resource

Article

How to Create Data Validation Rules in Excel

This article details Excel’s Data validation feature used to restrict data entry in specific cells.

Provided by freeCodeCamp

Identify common data quality issues within a dataset

Now that you have developed your data entry or collection tool and datasets are created, the next step is to identify data quality issues for correction during a review or audit process.

When reviewing your datasets, a few common types of quality issues to look out for include:

Missing or Incomplete Data for certain attributes or fields, which can lead to incomplete analysis and biased results.
Inaccurate Data due to human error, data entry mistakes, or system malfunctions.
Duplicate Data within a dataset can skew analysis results and waste computational resources.
Inconsistent Data may arise due to variations in data formats, units of measurement, or coding conventions. Inconsistent data can make it challenging to perform meaningful analysis and comparisons across different records.
Biased Data may reflect systemic biases or inaccuracies that skew analysis results and decision-making processes. Addressing bias in data collection and analysis is crucial for ensuring fairness and equity.
Outdated Data may no longer accurately reflect the current state of affairs, leading to incorrect conclusions and ineffective decision-making.

Implement best practices regarding data quality checks

Given the number of data quality issues that may arise, there are best practices for mitigating the risk of errors, inaccuracies, and other quality issues in your datasets.

The first step is to define the frequency of your data quality reviews. We do recommend regular reviews, but the frequency depends on multiple factors, including:

value of the data for your organization or programs
frequency of data updates
risk of human error in data inputs
stakeholder needs
resources available to handle the effort

When improving data quality, you should consider:

Defining Metrics: Clearly defining data quality metrics that are important for your organization, along the lines of accuracy, completeness, consistency, timeliness, and validity. Establishing clear quality criteria will help guide your data quality checks.
Validating Data (discussed in Steps 1 and 2): Validate data against rules and requirements to ensure that it meets the necessary criteria for accuracy and completeness.
Implementing Data Governance Practices: Establish data governance practices to ensure accountability, ownership, and stewardship of the data quality process. Define roles for data quality management and establish processes for data monitoring and escalation of issues.
Educating and Training Users: Provide education and training to users who interact with the data to ensure that they understand data quality requirements and best practices for data entry, manipulation, and analysis.
Conducting Data Cleansing: Perform data cleansing activities to correct errors, remove duplicates, and standardize data formats. This may involve techniques such as deduplication, data transformation, and outlier detection.
Iteratively Improving Data Quality Processes: Continuously review and refine your data quality processes based on feedback and lessons learned. Regularly assess the effectiveness of your data quality checks and make adjustments as needed to improve data quality over time.

Outside of the last two points, the list is focused on efforts to pre-empt data quality issues from happening. This is because pre-empting these issues typically takes far fewer resources, creates greater confidence in the data, and leads to fewer challenges than reviewing and fixing data quality issues.

resource

Article

What is Data Governance? Best Practices for Managing Data Assets

This article introduces data governance to organizations, and offers a useful sample governance framework, goals, and benefits.

Provided by CIO

resource

Article

How to Measure Data Quality — 13 Metrics You Should Be Tracking (But Don’t)

This resource explores a practical framework for assessing and comparing data quality.

Provided by showmethedata.blog

resource

Tool

Data Quality Review (DQR)Toolkit

This resource is a comprehensive overview of the Data Quality Review (DQR) Framework and Methodology.

Provided by World Health Organization (WHO)

‘So what’ and next steps

After following the steps, and particularly the best practices noted within this guide, you should have a strong enough understanding to improve your data quality processes before, during, and after data entry.

The next step then is to prioritize where to start implementing these data quality efforts, defining who is responsible for managing them, and what adjustments need to be made within your spreadsheets, databases, and/or other systems to optimize data integrity through validations and other approaches.

With improved data quality your organization may be ready to include more data-oriented roles in your team. If you want to know more about data jobs you can see the guide How to appropriately staff an organization to meet its data needs?

We'd love your feedback.

Was this guide helpful? Please rate this guide and share any additional feedback on how we might improve it.

How to improve data quality through validation and quality checks

Guide Specific Disclaimer

Determine when to develop data validations

Data Quality Characteristics:

Develop and implement data validation rules

Data validation is typical in the following forms:

How To Use Data Validation in Google Sheets

How to Create Data Validation Rules in Excel

Identify common data quality issues within a dataset

Implement best practices regarding data quality checks

What is Data Governance? Best Practices for Managing Data Assets

How to Measure Data Quality — 13 Metrics You Should Be Tracking (But Don’t)

Data Quality Review (DQR)Toolkit

‘So what’ and next steps

Check out Expert Q&A

Discover Community Groups

Take the Data Maturity Accessment

Stay Connected