DataPipeline-with-DocumentAI-Form-Parser

 

Build an End-to-End Data Capture Pipeline using Document AI

Task 1. Enable the APIs required for the lab

You must enable the APIs for Document AI, Cloud Functions, Cloud Build, and Geocoding for this lab and then create the API key that is required by the Geocoding Cloud Function.

  1. In Cloud Shell, enter the following commands to enable the APIs required by the lab:

    gcloud services enable documentai.googleapis.com      
    gcloud services enable cloudfunctions.googleapis.com  
    gcloud services enable cloudbuild.googleapis.com    
    gcloud services enable geocoding-backend.googleapis.com  
  1. In the Cloud Console, in the Navigation menu (Navigation menu icon), click APIs & services > Credentials.

  2. Select Create credentials, then select API key from the dropdown menu.

The API key created dialog box displays your newly created key. An API key is a long string containing upper and lower case letters, numbers, and dashes. For example, a4db08b757294ea94c08f2df493465a1.

  1. Click Edit API key in the dialog box.

  2. Select Restrict key in the API restrictions section to add API restrictions for your new API key.

  3. Click in the filter box and type Geocoding API.

  4. Select Geocoding API and click OK.

  5. Click the Save button.









Task 2. Copy the lab source files into your Cloud Shell

  • In Cloud Shell, enter the following command to clone the source repository for the lab:
  mkdir ./documentai-pipeline-demo
  gsutil -m cp -r \
    gs://sureskills-lab-dev/gsp927/documentai-pipeline-demo/* \
    ~/documentai-pipeline-demo/

Task 3. Create a form processor

Create an instance of the generic form processor to use in the Document AI Platform using the Document AI Form Parser specialized parser. The generic form processor will process any type of document and extract all the text content it can identify in the document. It is not limited to printed text, it can handle handwritten text and text in any orientation, supports a number of languages, and understands how form data elements are related to each other so that you can extract key:value pairs for form fields that have text labels.

  1. In the console, open the navigation menu and select Document AI > Overview.

  2. Click Explore Processor and select Form Parser.

  3. Specify the processor name as form-processor and select the region US (United States) from the list.

  4. Click Create to create your processor.

You will configure a Cloud Function later in this lab with the processor ID and location of this processor so that the Cloud Function will use this specific processor to process sample invoices.






Task 4. Create Cloud Storage buckets and a BigQuery dataset

Prepare your environment by creating the Google Cloud resources that are required for your document processing pipeline.

Create input, output, and archive Cloud Storage buckets

Create input, output, and archive Cloud Storage buckets for your document processing pipeline.

  • In Cloud Shell, enter the following command to create the Cloud Storage buckets for the lab:
  export PROJECT_ID=$(gcloud config get-value core/project)
  export BUCKET_LOCATION=us-central1
  gsutil mb -c standard -l ${BUCKET_LOCATION} -b on \
    gs://${PROJECT_ID}-input-invoices
  gsutil mb -c standard -l ${BUCKET_LOCATION} -b on \
    gs://${PROJECT_ID}-output-invoices
  gsutil mb -c standard -l ${BUCKET_LOCATION} -b on \
    gs://${PROJECT_ID}-archived-invoices

Create a BigQuery dataset and tables

Create a BigQuery dataset and the three output tables required for your data processing pipeline.

  • In Cloud Shell, enter the following command to create the BigQuery tables for the lab:
  bq --location=US mk  -d \
     --description "Form Parser Results" \
     ${PROJECT_ID}:invoice_parser_results
  cd ~/documentai-pipeline-demo/scripts/table-schema/
  bq mk --table \
    invoice_parser_results.doc_ai_extracted_entities \
    doc_ai_extracted_entities.json
  bq mk --table \
    invoice_parser_results.geocode_details \
    geocode_details.json

You can navigate to BigQuery in the Cloud Console and inspect the schemas for the tables in the invoice_parser_results dataset using the BigQuery SQL workspace.

Create a Pub/Sub topic

Initialize the Pub/Sub topic used to trigger the Geocoding API data enrichment operations in the processing pipeline.

  • In Cloud Shell, enter the following command to create the Pub/Sub topics for the lab:
  export GEO_CODE_REQUEST_PUBSUB_TOPIC=geocode_request
  gcloud pubsub topics \
    create ${GEO_CODE_REQUEST_PUBSUB_TOPIC} 

Task 5. Create Cloud Functions


Create the two Cloud Functions that your data processing pipeline uses to process invoices uploaded to Cloud Storage. These functions use the Document AI API to extract form data from the raw documents, then use the GeoCode API to retrieve geolocation data about the address information extracted from the documents.

You can examine the source code for the two Cloud Functions using the Code Editor or any other editor of your choice. The Cloud Functions are stored in the following folders in Cloud Shell:

  • Process Invoices - scripts/cloud-functions/process-invoices
  • Geocode Addresses - scripts/cloud-functions/geocode-addresses

The main Cloud Function, process-invoices, is triggered when files are uploaded to the input files storage bucket you created earlier.

The function folder scripts/cloud-functions/process-invoices contains the two files that are used to create the process-invoices Cloud Function.

The requirements.txt file specifies the Python libraries required by the function. This includes the Document AI client library as well as the other Google Cloud libraries required by the Python code to read the files from Cloud Storage, save data to BigQuery, and write messages to Pub/Sub that will trigger the remaining functions in the solution pipeline.

The main.py Python file contains the the Cloud Function code that creates the Document-AI, BigQuery, and Pub/Sub API clients and the following internal functions to process the documents:

  • write_to_bq - Writes dictionary object to the BigQuery table. Note you must ensure the schema is valid before calling this function.
  • get_text - Maps form name and value text anchors to the scanned text in the document. This allows the function to identify specific forms elements, such as the Supplier name and Address, and extract the relevant value. A specialized Document AI processor provides that contextual information directly in the entities property.
  • process_invoice - Uses the asynchronous Document-AI client API to read and process files from Cloud Storage as follows:
    • Creates an asynchronous request to process the file(s) that triggered the Cloud Function call.
    • Processes form data to extract invoice fields, storing only specific fields in a dictionary that are part of the predefined schema.
    • Publishes Pub/Sub messages to trigger the Geocoding Cloud Function using address form data extracted from the document.
    • Writes form data to a BigQuery table.
    • Deletes intermediate (output) files asynchronous Document AI API call.
    • Copies input files to the archive bucket.
    • Deletes processed input files.

The process_invoices Cloud Function only processes form data that has been detected with the following form field names:

  • input_file_name
  • address
  • supplier
  • invoice_number
  • purchase_order
  • date
  • due_date
  • subtotal
  • tax
  • total

The other Cloud Function, geocode-addresses, is triggered when a new message arrives on a Pub/Sub topic and it extracts its parameter data from the Pub/Sub message.

Create the Cloud Function to process documents uploaded to Cloud Storage

Create a Cloud Function that uses a Document AI form processor to parse form documents that have been uploaded to a Cloud Storage bucket.

  • Create the Invoice Processor Cloud Function:
  cd ~/documentai-pipeline-demo/scripts
  export CLOUD_FUNCTION_LOCATION=us-central1
  gcloud functions deploy process-invoices \
  --region=${CLOUD_FUNCTION_LOCATION} \
  --entry-point=process_invoice \
  --runtime=python37 \
  --service-account=${PROJECT_ID}@appspot.gserviceaccount.com \
  --source=cloud-functions/process-invoices \
  --timeout=400 \
  --env-vars-file=cloud-functions/process-invoices/.env.yaml \
  --trigger-resource=gs://${PROJECT_ID}-input-invoices \
  --trigger-event=google.storage.object.finalize

Create the Cloud Function to lookup geocode data from an address

Create the Cloud Function that accepts address data from a Pub/Sub message and uses the Geocoding API to precisely locate the address.

  • Create the Geocoding Cloud Function:
  cd ~/documentai-pipeline-demo/scripts
  gcloud functions deploy geocode-addresses \
  --region=${CLOUD_FUNCTION_LOCATION} \
  --entry-point=process_address \
  --runtime=python38 \
  --service-account=${PROJECT_ID}@appspot.gserviceaccount.com \
  --source=cloud-functions/geocode-addresses \
  --timeout=60 \
  --env-vars-file=cloud-functions/geocode-addresses/.env.yaml \
  --trigger-topic=${GEO_CODE_REQUEST_PUBSUB_TOPIC}

Task 6. Edit environment variables for Cloud Functions

In this task, you finalize the configuration of the Cloud Functions by editing the environment variables for each function to reflect your lab specific parameters via the Cloud Console.

Edit environment variables for the process-invoices Cloud Function

Set the Cloud Function environment variables for the process-invoices function.

  1. In the Cloud Console, in the Navigation menu (Navigation menu icon), click Cloud Functions.
  2. Click the Cloud Function process-invoices to open its management page.
  3. Click Edit.
  4. Click Runtime, build, connections and security settings to expand that section.
  5. Under Runtime environment variables, update the PROCESSOR_ID value to match the Invoice processor ID you created earlier.
  6. Under Runtime environment variables, update the PARSER_LOCATION value to match the region of the Invoice processor you created earlier. This will be us if you accepted the default location, otherwise eu. This parameter must be lowercase.
  7. Click Next and select .env.yaml and then update the PROCESSOR_ID and PARSER_LOCATION values again for your invoice processor.
  8. Click Deploy.

Edit environment variables for the geocode-addresses Cloud Function

Set the Cloud Function environment variables for the GeoCode data enrichment function.

  1. Click the Cloud Function geocode-addresses to open its management page.
  2. Click Edit.
  3. Click Runtime, build, connections and security settings to expand that section.
  4. Under Runtime environment variables, update the API_key value to match to the API Key value created in Task 1.
  5. Click Next and select .env.yaml and then update the API_key value to match the API Key value you set in the previous step.
  6. Click Deploy.






    Task 7. Test and validate the end-to-end solution

Upload test data to Cloud Storage and monitor the progress of the pipeline as the documents are processed and the extracted data is enhanced.

  1. In Cloud Shell, enter the following command to upload sample forms to the Cloud Storage bucket that will trigger the process-invoices Cloud Function:
 export PROJECT_ID=$(gcloud config get-value core/project)
  gsutil cp gs://sureskills-lab-dev/gsp927/documentai-pipeline-demo/sample-files/* gs://${PROJECT_ID}-input-invoices/
  1. In the Cloud Console, on the Navigation menu (Navigation menu icon), click Cloud Functions.
  2. Click the Cloud Function process-invoices to open its management page.
  3. Click Logs.

  4. In the Cloud Console, on the Navigation menu (Navigation menu icon), click BigQuery.

  5. Expand your Project ID in the Explorer.

  6. Expand invoice_parser_results.

  7. Select doc_ai_extracted_entities and click Preview. You will see the form information extracted from the invoices by the invoice processor. You can see that address information and the supplier name has been detected.

  8. Select geocode_details and click Preview. You will see the formatted address, latitude, and longitude for each invoice that has been processed that contained address data that Document AI was able to extract.













































































Comments