validation - Functionality for validating a pandas data structure

Module providing pandas data structure validation functionality.

Purpose:

This module contains functionality to validate pandas data structures.

In the event of a validation error, the warning is displayed to the terminal and can be returned along with the rows of the data structure containing the validation error(s).

Platform:

Linux/Windows | Python 3.10+

Developer:

J Berendt

Email:

support@s3dev.uk

Source:

This project (pdvalidate) is a fork from Markus Englund’s pandas-validation project (v0.5.0), which can be found on GitHub:

This fork was built to provide additional functionality, specifically the returning of the validation error message from the test. Whereas the original project provided a ValidationWarning via the warnings library; which prevented the validation error from being logged.

We have worked to keep the initial integrity of the project, while adding some features.

Thank you Markus for the excellent framework, and for sharing it with us all!

Example Use:

Example code use:

>>> import pandas as pd
>>> from pdvalidate.validation import validate as pv

# Create a Series and validation rules.
>>> s = pd.Series(['aaa', 'bb', 'c', 'dddd'], name='TestSeries')
>>> result, msg = pv.validate_string(s,
                                     min_length=1,
                                     max_length=2,
                                     return_type='mask_series')

[RangeWarning]: 'TestSeries': string(s) too long.

# Show row(s) which fail validation.
>>> print(s[result])

0     aaa
3    dddd
Name: TestSeries, dtype: object

# Show row(s) which pass validation.
>>> print(s[~result])

1    bb
2     c
Name: TestSeries, dtype: object
class validation.Validation[source]

Bases: object

Class container for all validation functionality.

static _build_message_dtype(series_name: str, exp: str, rec: str) str[source]

Build the unexpected datatype warning message for terminal output.

Parameters:
  • series_name (str) – Name of the Series causing the error.

  • exp (str) – Expected datatype.

  • rec (str) – Received datatype.

Returns:

Compiled error message string.

Return type:

str

static _build_message_range(series_name: str, message_list: list) str[source]

Build the range warning message string for terminal output.

Parameters:
  • series_name (str) – Name of the Series causing the error.

  • message_list (list) – List of error message strings to be printed to the terminal.

Returns:

Compiled error message string.

Return type:

str

static _datetime_to_string(series: Series, datetime_format: str = '%Y-%m-%d') Series[source]

Convert datetime values in a pandas Series to strings.

Other values are left as they are.

Parameters:
  • series (pd.Series) – Values to convert.

  • datetime_format (str, optional) – Format string for datetime type. Defaults to '%Y-%m-%d'.

Returns:

A converted pd.Series.

static _get_error_messages(masks: list, error_info: dict) list[source]

Compile a list of error messages.

Parameters:
  • masks (list)

  • error_info (dict) – Dictionary with error messages corresponding to different validation errors.

Returns:

A compiled list of error messages.

static _get_return_object(masks: dict, values: Series, return_type: str) Series[source]

Build the return object.

Parameters:
  • masks (dict) – Dictionary of validation failure masks.

  • values (pd.Series) – Series of values which were validated.

  • return_type (str) – Return type string descriptor.

Raises:

ValueError – For an invalid return type string.

Returns:

Series containing the records which failed validation.

Return type:

pd.Series

static _numeric_to_string(series: Series, float_format: str = '%g') Series[source]

Convert numeric values in a pandas Series to strings.

Other values are left as they are.

Parameters:
  • series (pd.Series) – Values to convert.

  • float_format (str, optional) – Format string for floating point number. Defaults to '%g'.

Returns:

A converted pd.Series.

Return type:

pd.Series

ei = <validation.ErrorInfo object>
static mask_nonconvertible(series: Series, to_datatype: str, datetime_format: str = None, exact_date: bool = True) Series[source]

Determine if values cannot be converted.

Parameters:
  • series (pd.Series) – Values to check.

  • to_datatype (str) – Datatype to which values should be converted. Available options are ‘numeric’ and ‘datetime’.

  • datetime_format (str, optional) – Datetime format string. For example: '%d/%m/%Y'. Note that '%f' will parse nanoseconds to six decimal places. Defaults to None.

  • exact_date (bool, optional) – If True (default), require an exact format match. If False, allow the format to match anywhere in the target string. Defaults to True.

Returns:

A boolean same-sized Series indicating whether values can or cannot be converted.

Return type:

pd.Series

static test_dtype_numeric(series: Series) bool[source]

Test if the Series has a numeric datatype.

Parameters:

series (pd.Series) – Series to be tested.

Returns:

True if a numeric datatype, otherwise False.

Return type:

bool

static test_dtype_object(series: Series) bool[source]

Test if the Series has an object datatype.

Parameters:

series (pd.Series) – Series to be tested.

Returns:

True if an object datatype, otherwise False.

Return type:

bool

static to_datetime(arg, dayfirst: bool = False, yearfirst: bool = False, utc: bool = False, datetime_format: str = None, exact: str = True) Series[source]

Convert argument to datetime. Set nonconvertible values to NaT.

This function calls to_datetime() with errors='coerce' and issues a warning if values cannot be converted.

For detailed parameter documentation, please refer to the docstring for pandas.to_datetime.

Parameters:
  • dayfirst (bool, optional) – See pandas documentation. Defaults to False.

  • yearfirst (bool, optional) – See pandas documentation. Defaults to False.

  • utc (bool, optional) – See pandas documentation. Defaults to False.

  • datetime_format (str, optional) – See pandas documentation. Defaults to None.

  • exact – See pandas documentation. Defaults to True.

static to_numeric(arg) Series[source]

Convert argument to numeric type. Set nonconvertible values to NaN.

This function calls to_numeric() with errors='coerce' and issues a warning if values cannot be converted.

Parameters:

arg (list, tuple, 1-d array, or Series) – Values to convert.

Returns:

A converted pd.Series.

Return type:

pd.Series

to_string(series: Series, float_format: str = '%g', datetime_format: str = '%Y-%m-%d') Series[source]

Convert values in a pandas Series to strings.

Parameters:
  • series (pd.Series) – Values to convert.

  • float_format (str, optional) – Format string for floating point number. Defaults to '%g'.

  • datetime_format (str, optional) – Format string for datetime type. Defaults to '%Y-%m-%d'.

Returns:

A converted pd.Series.

Return type:

pd.Series

validate_date(series: Series, convert: bool = False, dateformat: str = None, nullable: bool = True, unique: bool = False, min_date: date = None, max_date: date = None, return_type: str = None) tuple | None[source]

Validate a pandas Series with values of type datetime.date.

Values of a different data type will be replaced with NaN prior to the validation.

Parameters:
  • series (pd.Series) – Values to validate.

  • convert (bool, optional) – Convert the Series to datetime using the pd.to_datetime() function. Also use the dateformat parameter to define the format. Defaults to False.

  • dateformat (str, optional) – Format code for the datetimes being passed in the Series. For use with the convert parameter. Defaults to None.

  • nullable (bool, optional) – If False, check for NaN values. Defaults to True.

  • unique (bool, optional) – If True, check that values are unique. Defaults to False

  • min_date (datetime.date, optional) – If defined, check for values before min_date, inclusive. Defaults to None.

  • max_date (datetime.date, optional) – If defined, check for value later than max_date, inclusive. Defaults to None.

  • return_type (str, optional) – Kind of data object to return. Options: ‘mask_series’, ‘mask_frame’, ‘values’. Defaults to None.

Returns:

If a return_type is specified, return a tuple of the following, otherwise return None:

(return_object, error_messages)

Return type:

tuple | None

validate_numeric(series: Series, nullable: bool = True, unique: bool = False, integer: bool = False, min_value: int = None, max_value: int = None, return_type: str = None) tuple | None[source]

Validate a pandas Series containing numeric values.

Parameters:
  • series (pd.Series) – Values to validate.

  • nullable (bool, optional) – If False, check for NaN values. Defaults to True.

  • unique (bool, optional) – If True, check that values are unique. Defaults to False.

  • integer (bool, optional) – If True, check that values are integers. Defaults to False.

  • min_value (int, optional) – If defined, check for values below minimum, inclusive. Defaults to None.

  • max_value (int, optional) – If defined, check for value above maximum, inclusive. Defaults to None.

  • return_type (str, optional) – Kind of data object to return. Options: ‘mask_series’, ‘mask_frame’, ‘values’. Defaults to None.

Returns:

If a return_type is specified, return a tuple of the following, otherwise return None:

(return_object, error_messages)

Return type:

tuple | None

validate_string(series: Series, nullable: bool = True, unique: bool = False, min_length: int = None, max_length: int = None, case: str = None, newlines: bool = True, trailing_whitespace: bool = True, whitespace: bool = True, matching_regex: str = None, non_matching_regex: str = None, whitelist: list = None, blacklist: list = None, return_type: str = None) tuple | None[source]

Validate a pandas Series with strings.

Non-string values will be flagged as errors.

Parameters:
  • series (pd.Series) – Values to validate.

  • nullable (bool, optional) – If False, check for NaN values. Defaults to True.

  • unique (bool, optional) – If True, check that values are unique. Defaults to False.

  • min_length (int, optional) – If defined, check for strings shorter than min_length, inclusive. Defaults to None.

  • max_length (int, optional) – If defined, check for strings longer than max_length, inclusive. Defaults to None.

  • case (str, optional) – Check for a character case constraint. Options: ‘lower’, ‘upper’, ‘title’. Defaults to None.

  • newlines (bool, optional) – If False, check for platform-specific newline characters. Note: Linux searches for ‘n’. Windows searches for ‘rn’. Defaults to True.

  • trailing_whitespace (bool, optional) – If False, check for trailing whitespace. Defaults to True.

  • whitespace (bool, optional) – If False, check for whitespace. Defaults to True.

  • matching_regex (str, optional) – Check that strings match the provided regular expression. Defaults to None.

  • non_matching_regex (str, optional) – Check that strings do not match the provided regular expression. Defaults to None.

  • whitelist (list, optional) – Check that values are in whitelist. Defaults to None.

  • blacklist (list, optional) – Check that values are not in blacklist. Defaults to None.

  • return_type (str, optional) – Kind of data object to return. Options: ‘mask_series’, ‘mask_frame’, ‘values’. Defaults to None.

Returns:

If a return_type is specified, return a tuple of the following, otherwise return None:

(return_object, error_messages)

Return type:

tuple | None

validate_timestamp(series: Series, nullable: bool = True, unique: bool = False, min_timestamp: Timestamp = None, max_timestamp: Timestamp = None, return_type: str = None) tuple | None[source]

Validate a pandas Series with values of type pandas.Timestamp.

Values of a different data type will be replaced with NaT prior to the validation.

Parameters:
  • series (pd.Series) – Values to validate.

  • nullable (bool, optional) – If False, check for NaN values. Defaults to True.

  • unique (bool, optional) – If True, check that values are unique. Defaults to False.

  • min_timestamp (pd.Timestamp, optional) – If defined, check for values before min_timestamp, inclusive. Defaults to None.

  • max_timestamp (pd.Timestamp, optional) – If defined, check for value later than max_timestamp, inclusive. Defaults to None.

  • return_type (str, optional) – Kind of data object to return. Options: ‘mask_series’, ‘mask_frame’, ‘values’. Defaults to None.

Returns:

If a return_type is specified, return a tuple of the following, otherwise return None:

(return_object, error_messages)

Return type:

tuple | None