Custom Data Checks Module Documentation
Introduction
This module provides a comprehensive set of functions to validate and check the integrity of data in tabular formats such as CSVs. These checks help ensure the correctness of data by validating rows, columns, and specific values against given criteria.
Key Features
- Column Validation: Ensures uniform row lengths and column existence.
- Uniqueness Check: Detects duplicate values in specified columns.
- Null Checks: Validates that specified columns do not contain null values.
- Numeric Range Validation: Confirms numeric values fall within a specified range.
- String Pattern Matching: Verifies strings match a specified regex pattern.
- Date Format Validation: Ensures dates adhere to a given format.
- Datatype Validation: Confirms values match the expected datatype.
- Value Set Check: Ensures values exist within a specified set.
Class Initialization
To use the CustomDataChecks utilities, you first need to create an instance of the class:
from bigquery_advanced_utils.utils.custom_data_checks import CustomDataChecks
custom_data_checks_util = CustomDataChecks()The CustomDataChecks class does not require any parameters during initialization. Once instantiated, you can call its methods for various string manipulation tasks.
Methods Overview
check_columns
Validates that all rows in a dataset have the same number of columns as the header.
Parameters
idx (int): Row number.row (list): List of values in the row.header (list): List of column names.column_sums (list): (Unused) Placeholder for memory storage.
Raises
ValueError: If row lengths do not match the header length.
check_unique
Ensures specified columns contain unique values across all rows.
Parameters
idx (int): Row number.row (list): List of values in the row.header (list): List of column names.column_sums (list): List of sets for tracking seen values per column.columns_to_test (list, optional): Columns to validate for uniqueness.
Raises
ValueError: If duplicate values are detected.
check_no_nulls
Validates that specified columns do not contain null or empty values.
Parameters
idx(int): Row number.row(list): List of values in the row.header(list): List of column names.column_sums(list): (Unused) Placeholder for memory storage.columns_to_test(list, optional): Columns to check for null values.
Raises
ValueError: If null values are found.
check_numeric_range
Ensures numeric values in specified columns fall within a defined range. Allows null values.
Parameters
idx(int): Row number.row(list): List of values in the row.header(list): List of column names.column_sums(list): (Unused) Placeholder for memory storage.columns_to_test(list, optional): Columns to check for numeric range.min_value(int|float, optional): Minimum allowed value.max_value(int|float, optional): Maximum allowed value.
Raises
ValueError: If values fall outside the specified range.
check_string_pattern
Validates that values in specified columns match a given regex pattern. Allows null values.
Parameters
idx(int): Row number.row(list): List of values in the row.header(list): List of column names.column_sums(list): (Unused) Placeholder for memory storage.columns_to_test(list, optional): Columns to validate.regex_pattern(str): Regular expression to match.
Raises
ValueError: If values do not match the regex pattern.
check_date_format
Ensures dates in specified columns match a given date format. Allows null values.
Parameters
idx(int): Row number.row(list): List of values in the row.header(list): List of column names.column_sums(list): (Unused) Placeholder for memory storage.columns_to_test(list, optional): Columns to validate.date_format(str): Expected date format (default:%Y-%m-%d).
Raises
ValueError: If values do not match the date format.
check_datatype
Validates that values in specified columns match the expected datatype. Allows null values.
Parameters
idx(int): Row number.row(list): List of values in the row.header(list): List of column names.column_sums(list): (Unused) Placeholder for memory storage.columns_to_test(list, optional): Columns to validate.expected_datatype(type): Expected datatype (e.g.,int,float).
Raises
ValueError: If values do not match the expected datatype.
check_in_set
Ensures values in specified columns are within a predefined set.
Parameters
idx(int): Row number.row(list): List of values in the row.header(list): List of column names.column_sums(list): (Unused) Placeholder for memory storage.columns_to_test(list, optional): Columns to validate.valid_values_set(list): Allowed values for validation.
Raises
ValueError: If values are not in the allowed set.