Install GX 1.0
To use GX 1.0, you need to install Python and the GX 1.0 Python library. GX also recommends you set up a virtual environment for your GX Python projects.
Prerequisites
- An installation of Python, version 3.8 to 3.11
- (Recommended) A Python virtual environment
- Internet access
- Permissions to download and install packages in your environment
Install Python
- Reference the official Python documentation to install an appropriate version of Python.
GX Requires Python, version 3.8 to 3.11, which can be found on the official Python downloads page.
- Verify your Python installation.
Run the following command to display your Python version:
python --version
You should receive a response similar to:
Python 3.8.6
(Optional) Create a virtual environment
Although it is an optional step the best practice when working with a Python project is to do so in a virtual environment. A virtual environment ensures that any libraries and dependencies that you install as part of your project do not encounter or cause conflicts with libraries and dependencies installed for other Python projects.
There are various tools such as virtualenv and pyenv which can be used to create a virtual environment. This example uses venv
because it is included with Python 3.
- Create the virtual environment with
venv
.
To create your virtual environment, run the following code from the folder the environment should reside in:
python -m venv my_venv
This command creates a new directory named my_venv
which will contain your virtual environment.
If you wish to use a different name for your virtual environment, replace my_venv
in the example with the name you would prefer. You will also have to replace my_venv
with your virtual environment's actual name in any other example code that includes my_venv
.
- (Optional) Test your virtual environment by activating it.
Activate your virtual environment by running the following command from the folder it was installed in:
source my_venv/bin/activate
Install the GX 1.0 Python library
- Local
- Hosted environment
- GX Cloud
GX 1.0 is a Python library and as such can be used with a local Python installation to access the functionality of GX through Python scripts.
Installation and setup
- (Optional) Activate your virtual environment.
If you created a virtual environment for your GX Python installation, navigate to the folder that contains your virtual environment and activate it by running the following command:
source my_venv/bin/activate
- Ensure you have the latest version of
pip
:
python -m ensurepip --upgrade
- Install the GX 1.0 library:
pip install great_expectations
- Verify that GX installed successfully with the CLI command:
great_expectations --version
The output you receive if GX was successfully installed will be:
great_expectations, version 1.0.0a2
Hosted environments such as EMR Spark clusters or Databricks clusters do not provide for a filesystem where you can install your GX instance. Instead, you must install GX in-memory using the Python-style notebooks available on those platforms.
- EMR Spark notebook
- Databricks notebook
Use the information provided here to install GX on an EMR Spark cluster and instantiate a Data Context without a full configuration directory.
Additional prerequisites
- An EMR Spark cluster.
- Access to the EMR Spark notebook.
Installation and setup
- To install Great Expectations on your EMR Spark cluster copy this code snippet into a cell in your EMR Spark notebook and then run it:
sc.install_pypi_package("great_expectations")
-
Create an in-code Data Context. See Instantiate an Ephemeral Data Context.
-
Copy the Python code at the end of How to instantiate an Ephemeral Data Context into a cell in your EMR Spark notebook, or use the other examples to customize your configuration. The code instantiates and configures a Data Context for an EMR Spark cluster.
-
Execute the cell with your Data Context initialization and configuration.
-
Run the following command to verify that GX was installed and your in-memory Data Context was instantiated successfully:
context.list_datasources()
Databricks is a web-based platform automating Spark cluster management and working with them through Python notebooks.
To avoid configuring external resources, you'll use the Databricks File System (DBFS) for your Metadata Stores and Data Docs store.
DBFS is a distributed file system mounted in a Databricks workspace and available on Databricks clusters. Files on DBFS can be written and read as if they were on a local filesystem by adding the /dbfs/ prefix to the path. It also persists in object storage, so you won’t lose data after terminating a cluster. See the Databricks documentation for best practices, including mounting object stores.
Additional prerequisites
- A complete Databricks setup, including a running Databricks cluster with an attached notebook
- Access to DBFS
Installation and setup
-
Run the following command in your notebook to install GX as a notebook-scoped library:
Terminal input%pip install great-expectations
A notebook-scoped library is a custom Python environment that is specific to a notebook. You can also install a library at the cluster or workspace level. See Databricks Libraries.
- Run the following command to import the Python configurations you'll use in the following steps:
import great_expectations as gx
from great_expectations.checkpoint import Checkpoint
- Run the following code to specify a
/dbfs/
path for your Data Context:
context_root_dir = "/dbfs/great_expectations/"
- Run the following code to instantiate your Data Context:
context = gx.get_context(context_root_dir=context_root_dir)
Next steps
🚧 Under construction 🚧
Updates for this page are in progress.
- Connect to data in files stored in the DBFS
- Connect to data in an in-memory Spark Dataframe
GX Cloud provides a web interface for using GX to validate your data without creating and running complex Python code. However, GX 1.0 is also capable of connecting to a GX Cloud account should you wish to engage in further customization or automation of your workflows through Python scripts.
Installation and setup
To deploy a GX Agent, which serves as an intermediary between GX Cloud's interface and your organization's data stores, see Connect GX Cloud. The GX Agent serves all GX Cloud users within your organization. If a GX Agent has already been deployed for your organization, you can use the GX Cloud online application without further installation or setup.
To connect to GX Cloud from a Python script utilizing a local installation of GX 1.0 instead of the GX Agent, see Connect to an existing Data Context.