Manage Data Docs
Data Docs translate Expectations, Validation Results, and other metadata into human-readable documentation. Automatically compiling your data documentation from your data tests in the form of Data Docs keeps your documentation current. Use the information provided here to host and share Data Docs stored on a filesystem or a Data Source.
- Amazon S3
- Microsoft Azure Blob Storage
- Google Cloud Service
- Filesystem
Host and share Data Docs on AWS S3.
Prerequisites
Create an S3 bucket
In the AWS CLI, run the following command to create an S3 bucket configured for a specific location. Modify the bucket name and region for your environment.
> aws s3api create-bucket --bucket data-docs.my_org --region us-east-1
{
"Location": "/data-docs.my_org"
}
Configure your bucket policy
The example policy below enforces IP-based access. Modify the bucket name and IP addresses for your environment. After you have customized the example policy to suit your situation, name the file ip-policy.json
and save it in your local directory.
Your policy should limit access to authorized users. Data Docs sites can include sensitive information and should not be publicly accessible.
{
"Version": "2012-10-17",
"Statement": [{
"Sid": "Allow only based on source IP",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:GetObject",
"Resource": [
"arn:aws:s3:::data-docs.my_org",
"arn:aws:s3:::data-docs.my_org/*"
],
"Condition": {
"IpAddress": {
"aws:SourceIp": [
"192.168.0.1/32",
"2001:db8:1234:1234::/64"
]
}
}
}
]
}
Because Data Docs include multiple generated pages, it is important to include the arn:aws:s3:::{your_data_docs_site}/*
path in the Resource
list along with the arn:aws:s3:::{your_data_docs_site}
path that permits access to your Data Docs' front page.
Amazon Web Service's S3 buckets are a third party utility. For more information about configuring AWS S3 bucket policies, see Using bucket policies.
Apply the policy
Run the following AWS CLI command to apply the policy:
> aws s3api put-bucket-policy --bucket data-docs.my_org --policy file://ip-policy.json
Add a new S3 site to great_expectations.yml
The following example shows the default local_site
configuration that you will find in your great_expectations.yml
file, followed by the s3_site
configuration that you will need to add. To maintain a single S3 Data Docs site, remove the default local_site
configuration and replace it with the new s3_site
configuration.
data_docs_sites:
local_site:
class_name: SiteBuilder
show_how_to_buttons: true
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: uncommitted/data_docs/local_site/
site_index_builder:
class_name: DefaultSiteIndexBuilder
S3_site: # this is a user-selected name - you may select your own
class_name: SiteBuilder
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your>'
site_index_builder:
class_name: DefaultSiteIndexBuilder
Test your configuration
Run the following code to build and open your newly configured S3 Data Docs site:
context.build_data_docs()
Additional notes
-
Run the following code to update static hosting settings for your bucket to enable AWS to automatically serve your index.html file or a custom error file:
Terminal input> aws s3 website s3://data-docs.my_org/ --index-document index.html
-
To host a Data Docs site in a subfolder of an S3 bucket, add the
prefix
property to the configuration snippet immediately after thebucket
property. -
To host a Data Docs site through a private DNS, you can configure a
base_public_path
for the Data Docs Store. The following example will configure a S3 site with thebase_public_path
set towww.mydns.com
. Data Docs will still be written to the configured location on S3 (for examplehttps://s3.amazonaws.com/data-docs.my_org/docs/index.html
), but you can access the pages from your DNS (http://www.mydns.com/index.html
in our example)YAMLdata_docs_sites:
s3_site: # this is a user-selected name - you may select your own
class_name: SiteBuilder
store_backend:
class_name: TupleS3StoreBackend
bucket: data-docs.my_org # UPDATE the bucket name here to match the bucket you configured above.
base_public_path: http://www.mydns.com
site_index_builder:
class_name: DefaultSiteIndexBuilder
show_cta_footer: true
Host and share Data Docs on Azure Blob Storage. Data Docs are served using an Azure Blob Storage static website with restricted access.
Prerequisites
- A working deployment of Great Expectations
- Permissions to create and configure an Azure Storage account
Install Azure Storage Blobs client library for Python
Run the following pip command in a terminal to install the Azure Storage Blobs client library and its dependencies:
pip install azure-storage-blob
Create an Azure Blob Storage static website
-
Create a storage account.
-
In Settings, select Static website.
-
Select Enabled to enable static website hosting for the storage account.
-
Write "index.html" in the Index document.
-
Record the Primary endpoint URL. Your team will use this URL to view the Data Doc. A container named
$web
is added to your storage account to help you map a custom domain to this endpoint.
Configure the config_variables.yml
file
GX recommends storing Azure Storage credentials in the config_variables.yml
file, which is located in the uncommitted/
folder by default, and is not part of source control.
To review additional options for configuring the config_variables.yml
file or additional environment variables, see Manage credentials.
-
Get the Connection string of the storage account you created.
-
Open the
config_variables.yml
file and then add the following entry belowAZURE_STORAGE_CONNECTION_STRING
:AZURE_STORAGE_CONNECTION_STRING: "DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net;AccountName=<YOUR-STORAGE-ACCOUNT-NAME>;AccountKey=<YOUR-STORAGE-ACCOUNT-KEY==>"
Add a new Azure site to the data_docs_sites section of your great_expectations.yml
-
Open the
great_expectations.yml
file and add the following entry:data_docs_sites:
local_site:
class_name: SiteBuilder
show_how_to_buttons: true
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: uncommitted/data_docs/local_site/
site_index_builder:
class_name: DefaultSiteIndexBuilder
new_site_name: # this is a user-selected name - you can select your own
class_name: SiteBuilder
store_backend:
class_name: TupleAzureBlobStoreBackend
container: \$web
connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
site_index_builder:
class_name: DefaultSiteIndexBuilder -
Optional. Replace the default
local_site
to maintain a single Azure Data Docs site.
Since the container is named $web
, setting container: $web
in great_expectations.yml
would cause GX to unsuccessfully try to find the web
variable in config_variables.yml
. Use an escape char \
before the $
so the substitute_config_variable can locate the $web
container.
You can also configure GX to store your Expectations and Validation Results in the Azure Storage account. See Configure Expectation Stores and Configure Validation Result Stores. Make sure you set container: \$web
correctly.
The following options are available:
container
: The name of the Azure Blob container to store your data in.connection_string
: The Azure Storage connection string. This can also be supplied by setting theAZURE_STORAGE_CONNECTION_STRING
environment variable.prefix
: All paths on blob storage will be prefixed with this string.account_url
: The URL to the blob storage account. Any other entities included in the URL path (e.g. container or blob) will be discarded. This URL can be optionally authenticated with a SAS token. This can only be used if you don't configure theconnection_string
. You can also configure this by setting theAZURE_STORAGE_ACCOUNT_URL
environment variable.
The following authentication methods are supported:
- SAS token authentication: append the SAS token to
account_url
or make sure it is set in theconnection_string
. - Account key authentication: include the account key in the
connection_string
. - When none of the above authentication methods are specified, the DefaultAzureCredential will be used which supports most common authentication methods. You still need to provide the account url either through the config file or environment variable.
Build the Azure Blob Data Docs site
You can create or modify an Expectation Suite and this will build the Data Docs website.
Run the following Python code to build and open your Data Docs:
site_name = "new_site_name"
context.build_data_docs(site_names=site_name)
context.open_data_docs(site_name=site_name)
Limit Data Docs access (Optional)
-
On your Azure Storage Account Settings, click Networking.
-
Allow access from Selected networks.
-
Optional. Add access to a Virtual Network.
-
Optional. Add IP ranges to the firewall. See Configure Azure Storage firewalls and virtual networks.
Host and share Data Docs on Google Cloud Storage (GCS). GX recommends using IP-based access, which is achieved by deploying a Google App Engine application.
To view the code used in the examples, see how_to_host_and_share_data_docs_on_gcs.py.
Prerequisites
- A Google Cloud project
- The Google Cloud SDK
- The gsutil command line tool
- Permissions to list and create buckets, deploy Google App Engine apps, add app firewall rules
Create a GCS bucket
Run the following command to create a GCS bucket:
gsutil mb -p <your> -l US-EAST1 -b on gs://<your>/
Modify the project name, bucket name, and region.
This is the output after you run the command:
Creating gs://<your>/...
Create a directory for your Google App Engine app
GX recommends adding the directory to your project directory. For example, great_expectations/team_gcs_app
.
-
Create and then open
app.yaml
and then add the following entry:app.yamlruntime: python37
env_variables:
CLOUD_STORAGE_BUCKET: <your> -
Create and then open
requirements.txt
and then add the following entry:requirements.txtflask>=1.1.0
google-cloud-storage -
Create and then open
main.py
and then dd the following entry:Pythonimport logging
import os
from flask import Flask, request
from google.cloud import storage
app = Flask(__name__)
# Configure this environment variable via app.yaml
CLOUD_STORAGE_BUCKET = os.environ['CLOUD_STORAGE_BUCKET']
@app.route('/', defaults={'path': 'index.html'})
@app.route('/<path:path>')
def index(path):
gcs = storage.Client()
bucket = gcs.get_bucket(CLOUD_STORAGE_BUCKET)
try:
blob = bucket.get_blob(path)
content = blob.download_as_string()
if blob.content_encoding:
resource = content.decode(blob.content_encoding)
else:
resource = content
except Exception as e:
logging.exception("couldn't get blob")
resource = "<p>"
return resource
@app.errorhandler(500)
def server_error(e):
logging.exception('An error occurred during a request.')
return '''
An internal error occurred: <pre>{}
See logs for full stacktrace.
'''.format(e), 500
Authenticate the gcloud CLI
Run the following command to authenticate the gcloud CLI and set the project:
gcloud auth login && gcloud config set project <your>
Deploy your Google App Engine app
Run the following CLI command from within the app directory you created previously:
gcloud app deploy
Set up the Google App Engine firewall
Add a new GCS site to the data_docs_sites section of your great_expectations.yml
Open great_expectations.yml
and add the following entry:
data_docs_sites:
local_site:
class_name: SiteBuilder
show_how_to_buttons: true
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: uncommitted/data_docs/local_site/
site_index_builder:
class_name: DefaultSiteIndexBuilder
new_site_name: # this is a user-selected name - you may select your own
class_name: SiteBuilder
store_backend:
class_name: TupleGCSStoreBackend
project: <your>
bucket: <your>
site_index_builder:
class_name: DefaultSiteIndexBuilder
Replace the default local_site
to maintain a single GCS Data Docs site.
To host a Data Docs site with a private DNS, you can configure a base_public_path
for the Data Docs Store. The following example configures a GCS site with the base_public_path
set to www.mydns.com. Data Docs are still written to the configured location on GCS. For example, https://storage.cloud.google.com/my_org_data_docs/index.html, but you will be able to access the pages from your DNS (http://www.mydns.com/index.html in the following example).
data_docs_sites:
gs_site: # this is a user-selected name - you may select your own
class_name: SiteBuilder
store_backend:
class_name: TupleGCSStoreBackend
project: <YOUR GCP PROJECT NAME>
bucket: <YOUR GCS BUCKET NAME>
base_public_path: http://www.mydns.com
site_index_builder:
class_name: DefaultSiteIndexBuilder
Build the GCS Data Docs site
Run the following Python code to build and open your Data Docs:
site_name = "new_site_name"
context.build_data_docs(site_names=site_name)
context.open_data_docs(site_name=site_name)
Test the configuration
In the gcloud CLI run gcloud app browse
. If the command runs successfully, the URL is provided to your app and launched in a new browser window. The page displayed is the index page for your Data Docs site.
Related documentation
Host and share Data Docs on a filesystem.
Prerequisites
Review the default settings
Filesystem-hosted Data Docs are configured by default for Great Expectations deployments created using great_expectations init. To create additional Data Docs sites, you may re-use the default Data Docs configuration below. You may replace local_site
with your own site name, or leave the default.
data_docs_sites:
local_site: # this is a user-selected name - you may select your own
class_name: SiteBuilder
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: uncommitted/data_docs/local_site/ # this is the default path but can be changed as required
site_index_builder:
class_name: DefaultSiteIndexBuilder
Build the site
Run the following Python code to build and open your Data Docs:
context.build_data_docs()
context.open_data_docs()
To share the site, compress and distribute the directory in the base_directory
key in your site configuration.