Example Python code for extracting items from a table in a form (in this example, a grocery store receipt) that has been processed with an Azure Form Recognizer model.
You need to install Python and setup a Form Recognizer service on Azure.
-
Instructions for requesting access to Form Recognizer and installing python are here.
-
Once you have Form Recognizer access in your Azure subscription, follow these steps to setup your service.
In order to train the model, collect 5 examples of the forms you want to train with and store in Azure Blob storage. Here are tips for setting up training data.
Follow the instructions to setup your Azure Cognitive Service and create Form Recognizer Resource: Train a Form Recognizer Model. FormAnalyzer_Train.py is based on the code used for training the model and you can use it for your model with the following changes:
Replace the values in the sample code for:
- base_url: The region your cognitive service is deployed to
- source: The SAS URL for the blob storage with training documents (instructions in Step 3 - note the date/time expiration of the link and make suitable for your use case).
- Ocp-Apim-Subscription-Key: the key from your cognitive service
NOTE The values to replace are all denoted with angle brackets <> - replace the brackets and the sample text. 'Ocp-Apim-Subscription-Key': '' should become 'Ocp-Apim-Subscription-Key': '123456789123456789'
Once you run the code to train your model, you should get a response back that includes the ModelID for the model created, note this down for the next step.
The next step is to use the model to extract data from a new form. The steps for getting top level entites are documented in the next section of the quick start Extract key-value pairs and tables from forms.
In addition to extracting the table, FormAnalyzer_ExtractColumn.py, iterates over the columns in the table to extract all the entries for one of the columns, giving you a list of those entries.
An example receipt is included Tesco_Receipt_Example.pdf that can be processed and the entries in the column with the header "Product" are extracted.
Replace the values in the sample code for:
- base_url: The region your cognitive service is deployed to
- file_path: Your local path to Tesco_Example_Receipt.pdf (could be setup as a URL/SAS URL as well).
- model_id: The model id from the trained model (see previous section).
- Ocp-Apim-Subscription-Key: the key from your cognitive service
- (Optional) you can change the columnheader to "Quantity" or "Total" for the example Tesco receipt to extract other entries.
Optionally, if you don't need/want to train and analyze with your own model, you can use the Tesco_Receipt_Example.json file, which is an output from analyzing the example receipt with a model I trained.
FormAnalyzer_ExtractColumn_FromJSON.py is code that iterates over the JSON file of the Azure Form Recognizer results and extracts the entries in the column with the header "Product".