Quickstart Guide

This quickstart guide provides a brief introduction on how to use pdvalidate. The library contains four core functions that let you validate values in a pandas.Series (i.e. a column in a pandas.DataFrame).

The examples below are designed to help you get started. If you want to know more detail regarding the lower-level functionality, please have a look at the Library API Documentation.

The code examples below assume the following imports exist at the top of the module:

import numpy as np
import pandas as pd
from pdvalidate.validation import validate as pv

Validate dates

The first example shows how to validate a pandas Series with a few dates specified with Python’s datetime.date data type. Values of other types are replaced with NaT (“Not-A-Time”) prior to the validation. Warnings then inform the user if any of the values are invalid.

from datetime import datetime as dt

s1 = pd.Series([dt(2010, 10, 7),
                dt(2014, 8, 13),
                dt(2018, 3, 15),
                dt(2018, 3, 15),
                np.nan],
                name='DateTest')

pv.validate_date(s1,
                 nullable=False,
                 unique=True,
                 min_date=dt(2013, 1, 1),
                 max_date=dt(2015, 12, 31),
                 return_type=None)

# A warning is displayed:
[RangeWarning]: 'DateTest': NaT value(s); duplicates; date(s) too early; date(s) too late.

Continuing with the example above, if the return_type parameter is set to 'values', a tuple is returned containing a pandas Series with only the valid dates as the first element, and the warning message (if applicable) as the second element. The warning message can then be passed to a validation reporter for logging.

values = pv.validate_date(s1,
                          nullable=False,
                          unique=True,
                          min_date=dt(2013, 1, 1),
                          max_date=dt(2015, 12, 31),
                          return_type='values')  # <-- Note the change here.

Returns the following two-element tuple:

(0          NaT
 1   2014-08-13
 2          NaT
 3          NaT
 4          NaT
 Name: TestSeries, dtype: datetime64[ns],
 "[RangeWarning]: 'TestSeries': NaT value(s); duplicates; date(s) too early; date(s) too late.")

Validate numeric values

Validation of numeric values (e.g. floats and integers) follows the same general principles as the validation of dates and timestamps. Non-numeric values are treated as NaN, and warnings are issued to indicate invalid values to the user.

s2 = pd.Series([13, 42, 73, 73, 3.14159, 1.1618033, np.nan],
               name='NumericTest')

pv.validate_numeric(s2,
                    nullable=False,
                    unique=True,
                    integer=True,
                    min_value=15,
                    max_value=100,
                    return_type=None)

# A warning is displayed:
[RangeWarning]: 'NumericTest': NaN value(s); duplicates; non-integer(s); value(s) too low.

Continuing with the example above, if the return_type parameter is set to 'mask_series', a tuple is returned containing a boolean pandas Series mask with True indicating the rows which failed validation and False indicating the rows which passed as the first element, and the warning message (if applicable) as the second element. The warning message can then be passed to a validation reporter for logging.

values = pv.validate_numeric(s2,
                             nullable=False,
                             unique=True,
                             integer=True,
                             min_value=15,
                             max_value=100,
                             return_type='mask_series')  # <-- Note the change here.

Returns the following two-element tuple:

(0     True
 1    False  # <-- Reminder: False indicates a validation *pass*
 2    False  # <--
 3     True
 4     True
 5     True
 6     True
 dtype: bool,
 "[RangeWarning]: 'NumericTest': NaN value(s); duplicates; non-integer(s); value(s) too low.")

Note

As the True / False (fail/pass) logic may seem counterintuitive for some use cases, the Series can be inverted using the tilde ~ operator, as:

~values[0]

Thus changing True to a validation pass and False to a validation failure.

Validate strings

String validation works in the same way as the other validations, but concerns only strings. Values of other types, like numbers and timestamps, are simply replaced with NaN values before the validation takes place.

s3 = pd.Series(['1',            # Too short
                '',             # Empty
                'ab\n',         # Newline character present
                'abc',          # OK
                'Abc',          # Includes upper case character(s)
                'ABc',          # Includes upper case character(s)
                b'abcd',        # Bytes (not string)
                'abc 123',      # Includes whitespace
                'abc123',       # OK
                'abc123',       # Duplicate
                'ABC123',       # Includes upper case character(s)
                'abc123abc123', # Too long
                123,            # Numberic
                0xc0ffee,       # Bytes
                np.nan],        # NaN
               name='StringTest')

pv.validate_string(s3,
                   nullable=True,
                   unique=True,
                   min_length=2,
                   max_length=8,
                   case='lower',
                   newlines=False,
                   whitespace=False,
                   return_type=None)

# A warning is displayed:
[RangeWarning]: 'StringTest': Non-string value(s) set as NaN; duplicates; string(s) too short; string(s) too long; wrong case letter(s); newline character(s); whitespace.

Continuing with the example above, if the return_type parameter is set to 'mask_frame', a tuple is returned containing a boolean pandas DataFrame mask with True indicating the rows which failed validation and False indicating the rows which passed as the first element, and the warning message (if applicable) as the second element. The warning message can then be passed to a validation reporter for logging.

values = pv.validate_string(s3,
                            nullable=True,
                            unique=True,
                            min_length=2,
                            max_length=8,
                            case='lower',
                            newlines=False,
                            whitespace=False,
                            return_type='mask_frame')  # <-- Note the change here.

Returns the following two-element tuple:

(    invalid_type  nonunique too_short too_long wrong_case newlines whitespace
 0          False      False      True    False      False    False      False
 1          False      False      True    False      False    False      False
 2          False      False     False    False      False     True       True
 3          False      False     False    False      False    False      False
 4          False      False     False    False       True    False      False
 5          False      False     False    False       True    False      False
 6           True      False       NaN      NaN        NaN      NaN        NaN
 7          False      False     False    False      False    False       True
 8          False      False     False    False      False    False      False
 9          False       True     False    False      False    False      False
 10         False      False     False    False       True    False      False
 11         False      False     False     True      False    False      False
 12          True      False       NaN      NaN        NaN      NaN        NaN
 13          True      False       NaN      NaN        NaN      NaN        NaN
 14         False      False       NaN      NaN        NaN      NaN        NaN,
 "[RangeWarning]: 'StringTest': Non-string value(s) set as NaN; duplicates; string(s) too short; string(s) too long; wrong case letter(s); newline character(s); whitespace.")

Validate timestamps

Validation of timestamps works in the same way as date validation. The major difference is that only values of type pandas.Timestamp are taken into account. Values of other types are replaced by NaT.

from datetime import datetime as dt

s4 = pd.Series([pd.Timestamp(2010, 1, 1, 12, 30, 0),                # Invalid: Out of range
                pd.Timestamp(2014, 2, 1, 12, 30, 0),                # Valid
                pd.Timestamp(2014, 2, 1, 12, 30, 0),                # Invalid: Duplicate
                dt(2015, 3, 1, 12, 30, 0),                          # Invalid: Datetime object
                pd.to_datetime(dt(2020, 4, 1)),                     # Valid
                '2024-02-02',                                       # Invalid: String
                pd.to_datetime('2024-03-01', format='%Y-%m-%d'),    # Valid
                1234,                                               # Invalid: Integer
                np.nan],                                            # Invalid: NaN
               name='TimestampTest')

pv.validate_timestamp(s4,
                      nullable=False,
                      unique=True,
                      min_timestamp=pd.Timestamp(2011, 1, 1),
                      max_timestamp=dt(2024, 12, 31),
                      return_type=None)

# A warning is displayed:
# [RangeWarning]: 'TimestampTest': Value(s) not of type pandas.Timestamp set as NaT; NaT value(s); duplicates; timestamp(s) too early.

Continuing with the example above, if the return_type parameter is set to 'values', a tuple is returned containing a pandas Series with only the valid timestamps as the first element, and the warning message (if applicable) as the second element. The warning message can then be passed to a validation reporter for logging.

values = pv.validate_timestamp(s4,
                               nullable=False,
                               unique=True,
                               min_timestamp=pd.Timestamp(2011, 1, 1),
                               max_timestamp=dt(2024, 12, 31),
                               return_type='values')  # <-- Note the change here.

Returns the following two-element tuple:

(0                   NaT
 1   2014-02-01 12:30:00
 2                   NaT
 3                   NaT
 4   2020-04-01 00:00:00
 5                   NaT
 6   2024-03-01 00:00:00
 7                   NaT
 8                   NaT
 Name: TimestampTest, dtype: datetime64[ns],
 "[RangeWarning]: 'TimestampTest': Value(s) not of type pandas.Timestamp set as NaT; NaT value(s); duplicates; timestamp(s) too early.")

Summary

This simple guide is designed to get you up and running with pdvalidate. However, the Library API Documentation section provides a detailed explanation for each validation method, along with a description for each parameter.