Add support for the auto-initializing framework to a custom Expectation
Prerequisites
Determine if the auto-initializing framework is appropriate to include in your Expectation
Auto-initializing Expectations automate parameter estimation for Expectations, but not all parameters require this kind of estimation. If your expectation only takes in a Domain (such as the name of a column) then it will not benefit from being configured to work in the auto-initializing framework. In general, the auto-initializing Expectation framework benefits those Expectations that have numeric ranges which are intended to be descriptive of the data found in a Batch or Batches. Existing examples of these would be Expectations such as ExpectColumnMeanToBeBetween
, ExpectColumnMaxToBeBetween
, or ExpectColumnSumToBeBetween
.
Build a Custom Profiler for your Expectation
In order to automate the estimation of parameters, auto-initializing Expectations utilize a Custom Profiler. You will need to create an appropriate configuration for the Profiler that your Expectation will use. The easiest way to do this is to modify an existing Profiler configuration.
You can find existing Profiler configurations in the source code for any Expectation that works within the auto-initializing framework. For this example, we will look at the existing configuration for the ExpectColumnMeanToBeBetween
Expectation. You can view the source code for this Expectation on our GitHub.
Modify variables
Key-value pairs defined in the variables
portion of a Profiler Configuration will be shared across all of its Rules
and Rule
components. This helps you define and keep track of values without having to input them multiple times. In our example, the variables
are:
strict_min
: Used byexpect_column_mean_to_be_between
Expectation. Recognized values areTrue
orFalse
.strict_max
: Used byexpect_column_mean_to_be_between
Expectation. Recognized values areTrue
orFalse
.false_positive_rate
: Used byNumericMetricRangeMultiBatchParameterBuilder
. Typically, this will be a float0 <= 1.0
.quantile_statistic_interpolation_method
: Used byNumericMetricRangeMultiBatchParameterBuilder
, which is used when estimating quantile values (not relevant in our case). Recognized values includeauto
,nearest
, andlinear
.estimator
: Used byNumericMetricRangeMultiBatchParameterBuilder
. Recognized values includeoneshot
,bootstrap
, andkde
.n_resamples
: Used byNumericMetricRangeMultiBatchParameterBuilder
. Integer values are expected.include_estimator_samples_histogram_in_details
: Used byNumericMetricRangeMultiBatchParameterBuilder
. Recognized values areTrue
orFalse
.truncate_values
: A value used by theNumericMetricRangeMultiBatchParameterBuilder
to specify the[lower_bound, upper_bound]
interval, where either boundary is numeric or None. In our case the value is an empty dictionary, and an equivalent configuration would have beentruncate_values : { lower_bound: None, upper_bound: None }
.round_decimals
: Used byNumericMetricRangeMultiBatchParameterBuilder
, and determines how many digits after the decimal point to output (in our case 2).
Modify the domain_builder
The DomainBuilder
configuration requries a class_name
and module_name
. In this example, we will be using the ColumnDomainBuilder
which outputs the column of interest (for example: trip_distance
in the NYC taxi data) which is then accessed by the ExpectationConfigurationBuilder
using the variable $domain.domain_kwargs.column
.
-
class_name
: is the name of the DomainBuilder class that is to be used. Additional Domain Builders are:ColumnDomainBuilder
: ThisDomainBuilder
outputs column Domains, which are required byColumnAggregateExpectations
like (expect_column_median_to_be_between
).MultiColumnDomainBuilder
: This DomainBuilder outputsmulticolumn
Domains by taking in a column list in theinclude_column_names
parameter.ColumnPairDomainBuilder
: This DomainBuilder outputs columnpair domains by taking in a column pair list in the include_column_names parameter.TableDomainBuilder
: ThisDomainBuilder
outputs tableDomains
, which is required byExpectations
that act on tables, like (expect_table_row_count_to_equal
, orexpect_table_columns_to_match_set
).MapMetricColumnDomainBuilder
: ThisDomainBuilder
allows you to choose columns based on Map Metrics, which give a yes/no answer for individual values or rows.CategoricalColumnDomainBuilder
: ThisDomainBuilder
allows you to choose columns based on their cardinality (number of unique values).
noteCategoricalColumnDomainBuilder
will take in variouscardinality_limit_mode
values for cardinality. For a full listing of valid modes, along with the associated values, please refer to theCardinalityLimitMode
enum in the source code on our GitHub. -
module_name
: isgreat_expectations.rule_based_profiler.domain_builder
, which is common for allDomainBuilders
.
Modify the ParameterBuilder
Our example contains a configuration for one ParamterBuilder
, a NumericMetricRangeMultiBatchParameterBuilder
. You can find the other types of ParameterBuilder
by browsing the source code in our GitHub. For the NumericMetricRangeMultiBatchParameterBuilder
the configuration key-value pairs consist of:
name
: an arbitrary name assigned to thisParameterBuilder
configuration.class_name
: the name of the class that corresponds to theParameterBuilder
defined by this configuration.module_name
:great_expectations.rule_based_profiler.parameter_builder
which is the same for allParameterBuilders
.json_serialize
: Boolean value that determines whether to convert computed value to JSON prior to saving result.estimator
: choice of the estimation algorithm: "oneshot" (one observation), "bootstrap" (default), or "kde" (kernel density estimation). Value is pulled from$variables.estimator
, which is set to "bootstrap" in our configuration.quantile_statistic_interpolation_method
: Applicable for the "bootstrap" sampling method. Determines the value of interpolation "method" tonp.quantile()
statistic, which is used for confidence intervals. Value is pulled from$variables.quantile_statistic_interpolation_method
, which is set to "auto" in our configuration.enforce_numeric_metric
: used inMetricConfiguration
to ensure that metric computations return numeric values. Set toTrue
.n_resamples
: Applicable for the "bootstrap" and "kde" sampling methods -- if omitted (default), then 9999 is used. Value is pulled from$variables.n_resamples
, which is set to9999
in our configuration.round_decimals
: User-configured non-negative integer indicating the number of decimals of the rounding precision of the computed parameter values (i.e.,min_value
,max_value
) prior to packaging them on output. If omitted, then no rounding is performed, unless the computed value is already an integer. Value is pulled from$variables.round_decimals
which is2
in our configuration.reduce_scalar_metric
: IfTrue
(default), then reduces computation of 1-dimensional metric to scalar value. This value is set toTrue
.include_estimator_samples_histogram_in_details
: For the "bootstrap" sampling method -- if True, then add 10-bin histogram of bootstraps to "details"; otherwise, omit this information (default). Value pulled from$variables.include_estimator_samples_histogram_in_details
, which isFalse
in our configuration.truncate_values
: User-configured directive for whether or not to allow the computed parameter values (i.e.,lower_bound
,upper_bound
) to take on values outside the specified bounds when packaged on output. Value pulled from$variables.truncate_values
, which isNone
in our configuration.false_positive_rate
: User-configured fraction between 0 and 1 expressing desired false positive rate for identifying unexpected values as judged by the upper- and lower- quantiles of the observed metric data. Value pulled from$variables.false_positive_rate
and is0.05
in our configuration.replace_nan_with_zero
: If False, then if the computed metric givesNaN
, then exception is raised; otherwise, if True (default), then if the computed metric gives NaN, then it is converted to the 0.0 (float) value. Set toTrue
in our configuration.metric_domain_kwargs
: Domain values forParameteBuilder
. Pulled from$domain.domain_kwargs
, and is empty in our configuration.
Modify the expectation_configuration_builders
Our Configuration contains 1 ExpectationConfigurationBuilder
, for the expect_column_mean_to_be_between
Expectation type.
The ExpectationConfigurationBuilder
configuration requires a expectation_type
, class_name
and module_name
:
expectation_type
:expect_column_mean_to_be_between
class_name
:DefaultExpectationConfigurationBuilder
module_name
:great_expectations.rule_based_profiler.expectation_configuration_builder
which is common for allExpectationConfigurationBuilders
Also included are:
validation_parameter_builder_configs
: Which are a list ofValidationParameterBuilder
configurations, and our configuration case contains theParameterBuilder
described in the previous section.
Next are the parameters that are specific to the expect_column_mean_to_be_between
Expectation
.
column
: Pulled fromDomainBuilder
using the parameter$domain.domain_kwargs.column
min_value
: Pulled from theParameterBuilder
using$parameter.mean_range_estimator.value[0]
max_value
: Pulled from theParameterBuilder
using$parameter.mean_range_estimator.value[1]
strict_min
: Pulled from ``$variables.strict_min, which is
False`.strict_max
: Pulled from ``$variables.strict_max, which is
False`.
Last is meta
which contains details
from our parameter_builder
.
Assign your configuration to the default_profiler_config
class attribute of your Expectation
Once you have modified the necessary parts of the Profiler configuration to suit your purposes you will need to assign it to the default_profiler_config
class attribute of your Expectation. If you initially copied the Profiler configuration that you modified from another Expectation that was already set up to work with the auto-initializing framework then you can refer to that Expectation for an example of this.
Test your Expectation with auto=True
After assigning your Profiler configuration to the default_profiler_config
attribute of your Expectation, your Expectation should be able to work in the auto-initializing framework. Test your expectation with the parameter auto=True
.