DataPipeline-with-DocumentAI-Form-Parser
Build an End-to-End Data Capture Pipeline using Document AI
Task 1. Enable the APIs required for the lab
You must enable the APIs for Document AI, Cloud Functions, Cloud Build, and Geocoding for this lab and then create the API key that is required by the Geocoding Cloud Function.
In Cloud Shell, enter the following commands to enable the APIs required by the lab:
In the Cloud Console, in the Navigation menu (
), click APIs & services > Credentials.
Select Create credentials, then select API key from the dropdown menu.
The API key created dialog box displays your newly created key. An API key is a long string containing upper and lower case letters, numbers, and dashes. For example, a4db08b757294ea94c08f2df493465a1.
Click Edit API key in the dialog box.
Select Restrict key in the API restrictions section to add API restrictions for your new API key.
Click in the filter box and type Geocoding API.
Select Geocoding API and click OK.
Click the Save button.
- In Cloud Shell, enter the following command to clone the source repository for the lab:
Task 3. Create a form processor
Create an instance of the generic form processor to use in the Document AI Platform using the Document AI Form Parser specialized parser. The generic form processor will process any type of document and extract all the text content it can identify in the document. It is not limited to printed text, it can handle handwritten text and text in any orientation, supports a number of languages, and understands how form data elements are related to each other so that you can extract key:value pairs for form fields that have text labels.
In the console, open the navigation menu and select Document AI > Overview.
Click Explore Processor and select Form Parser.
Specify the processor name as form-processor and select the region US (United States) from the list.
Click Create to create your processor.
You will configure a Cloud Function later in this lab with the processor ID and location of this processor so that the Cloud Function will use this specific processor to process sample invoices.
Task 4. Create Cloud Storage buckets and a BigQuery dataset
Prepare your environment by creating the Google Cloud resources that are required for your document processing pipeline.
Create input, output, and archive Cloud Storage buckets
Create input, output, and archive Cloud Storage buckets for your document processing pipeline.
- In Cloud Shell, enter the following command to create the Cloud Storage buckets for the lab:
export PROJECT_ID=$(gcloud config get-value core/project)
export BUCKET_LOCATION=us-central1
gsutil mb -c standard -l ${BUCKET_LOCATION} -b on \
gs://${PROJECT_ID}-input-invoices
gsutil mb -c standard -l ${BUCKET_LOCATION} -b on \
gs://${PROJECT_ID}-output-invoices
gsutil mb -c standard -l ${BUCKET_LOCATION} -b on \
gs://${PROJECT_ID}-archived-invoicesCreate a BigQuery dataset and tables
Create a BigQuery dataset and the three output tables required for your data processing pipeline.
- In Cloud Shell, enter the following command to create the BigQuery tables for the lab:
bq --location=US mk -d \
--description "Form Parser Results" \
${PROJECT_ID}:invoice_parser_results
cd ~/documentai-pipeline-demo/scripts/table-schema/
bq mk --table \
invoice_parser_results.doc_ai_extracted_entities \
doc_ai_extracted_entities.json
bq mk --table \
invoice_parser_results.geocode_details \
geocode_details.jsonYou can navigate to BigQuery in the Cloud Console and inspect the schemas for the tables in the invoice_parser_results dataset using the BigQuery SQL workspace.
Create a Pub/Sub topic
Initialize the Pub/Sub topic used to trigger the Geocoding API data enrichment operations in the processing pipeline.
- In Cloud Shell, enter the following command to create the Pub/Sub topics for the lab:
Task 5. Create Cloud Functions
Create the two Cloud Functions that your data processing pipeline uses to process invoices uploaded to Cloud Storage. These functions use the Document AI API to extract form data from the raw documents, then use the GeoCode API to retrieve geolocation data about the address information extracted from the documents.
You can examine the source code for the two Cloud Functions using the Code Editor or any other editor of your choice. The Cloud Functions are stored in the following folders in Cloud Shell:
- Process Invoices -
scripts/cloud-functions/process-invoices - Geocode Addresses -
scripts/cloud-functions/geocode-addresses
The main Cloud Function, process-invoices, is triggered when files are uploaded to the input files storage bucket you created earlier.
The function folder scripts/cloud-functions/process-invoices contains the two files that are used to create the process-invoices Cloud Function.
The requirements.txt file specifies the Python libraries required by the function. This includes the Document AI client library as well as the other Google Cloud libraries required by the Python code to read the files from Cloud Storage, save data to BigQuery, and write messages to Pub/Sub that will trigger the remaining functions in the solution pipeline.
The main.py Python file contains the the Cloud Function code that creates the Document-AI, BigQuery, and Pub/Sub API clients and the following internal functions to process the documents:
write_to_bq - Writes dictionary object to the BigQuery table. Note you must ensure the schema is valid before calling this function.get_text - Maps form name and value text anchors to the scanned text in the document. This allows the function to identify specific forms elements, such as the Supplier name and Address, and extract the relevant value. A specialized Document AI processor provides that contextual information directly in the entities property.process_invoice - Uses the asynchronous Document-AI client API to read and process files from Cloud Storage as follows:- Creates an asynchronous request to process the file(s) that triggered the Cloud Function call.
- Processes form data to extract invoice fields, storing only specific fields in a dictionary that are part of the predefined schema.
- Publishes Pub/Sub messages to trigger the Geocoding Cloud Function using address form data extracted from the document.
- Writes form data to a BigQuery table.
- Deletes intermediate (output) files asynchronous Document AI API call.
- Copies input files to the archive bucket.
- Deletes processed input files.
The process_invoices Cloud Function only processes form data that has been detected with the following form field names:
input_file_nameaddresssupplierinvoice_numberpurchase_orderdatedue_datesubtotaltaxtotal
The other Cloud Function, geocode-addresses, is triggered when a new message arrives on a Pub/Sub topic and it extracts its parameter data from the Pub/Sub message.
Create the Cloud Function to process documents uploaded to Cloud Storage
Create a Cloud Function that uses a Document AI form processor to parse form documents that have been uploaded to a Cloud Storage bucket.
- Create the Invoice Processor Cloud Function:
cd ~/documentai-pipeline-demo/scripts
export CLOUD_FUNCTION_LOCATION=us-central1
gcloud functions deploy process-invoices \
--region=${CLOUD_FUNCTION_LOCATION} \
--entry-point=process_invoice \
--runtime=python37 \
--service-account=${PROJECT_ID}@appspot.gserviceaccount.com \
--source=cloud-functions/process-invoices \
--timeout=400 \
--env-vars-file=cloud-functions/process-invoices/.env.yaml \
--trigger-resource=gs://${PROJECT_ID}-input-invoices \
--trigger-event=google.storage.object.finalize
Create the Cloud Function to lookup geocode data from an address
Create the Cloud Function that accepts address data from a Pub/Sub message and uses the Geocoding API to precisely locate the address.
- Create the Geocoding Cloud Function:
cd ~/documentai-pipeline-demo/scripts
gcloud functions deploy geocode-addresses \
--region=${CLOUD_FUNCTION_LOCATION} \
--entry-point=process_address \
--runtime=python38 \
--service-account=${PROJECT_ID}@appspot.gserviceaccount.com \
--source=cloud-functions/geocode-addresses \
--timeout=60 \
--env-vars-file=cloud-functions/geocode-addresses/.env.yaml \
--trigger-topic=${GEO_CODE_REQUEST_PUBSUB_TOPIC}Task 6. Edit environment variables for Cloud Functions
In this task, you finalize the configuration of the Cloud Functions by editing the environment variables for each function to reflect your lab specific parameters via the Cloud Console.
Edit environment variables for the process-invoices Cloud Function
Set the Cloud Function environment variables for the process-invoices function.
- In the Cloud Console, in the Navigation menu (
), click Cloud Functions.
- Click the Cloud Function process-invoices to open its management page.
- Click Edit.
- Click Runtime, build, connections and security settings to expand that section.
- Under Runtime environment variables, update the PROCESSOR_ID value to match the Invoice processor ID you created earlier.
- Under Runtime environment variables, update the PARSER_LOCATION value to match the region of the Invoice processor you created earlier. This will be
usif you accepted the default location, otherwiseeu. This parameter must be lowercase.- Click Next and select .env.yaml and then update the PROCESSOR_ID and PARSER_LOCATION values again for your invoice processor.
- Click Deploy.
Edit environment variables for the geocode-addresses Cloud Function
Set the Cloud Function environment variables for the GeoCode data enrichment function.
- Click the Cloud Function geocode-addresses to open its management page.
- Click Edit.
- Click Runtime, build, connections and security settings to expand that section.
- Under Runtime environment variables, update the API_key value to match to the API Key value created in Task 1.
- Click Next and select .env.yaml and then update the API_key value to match the API Key value you set in the previous step.
- Click Deploy.
Task 7. Test and validate the end-to-end solution
Upload test data to Cloud Storage and monitor the progress of the pipeline as the documents are processed and the extracted data is enhanced.
- In Cloud Shell, enter the following command to upload sample forms to the Cloud Storage bucket that will trigger the
process-invoices Cloud Function:
export PROJECT_ID=$(gcloud config get-value core/project)
gsutil cp gs://sureskills-lab-dev/gsp927/documentai-pipeline-demo/sample-files/* gs://${PROJECT_ID}-input-invoices/
- In the Cloud Console, on the Navigation menu (
), click Cloud Functions.
- Click the Cloud Function process-invoices to open its management page.
- Click Logs.
In the Cloud Console, on the Navigation menu (
), click BigQuery.
Expand your Project ID in the Explorer.
Expand invoice_parser_results.
Select doc_ai_extracted_entities and click Preview. You will see the form information extracted from the invoices by the invoice processor. You can see that address information and the supplier name has been detected.
Select geocode_details and click Preview. You will see the formatted address, latitude, and longitude for each invoice that has been processed that contained address data that Document AI was able to extract.
Comments
Post a Comment