Cloud Native Buildpacks are used to build the application
./mvnw spring-boot:build-image
The single command is enough to start the whole application
docker-compose up
The Ocr process is being processed by Apache Tika that uses Google Tesseract under the hood. The ocr uses the OkHttpClient to create the http call to Apache Tika Server There are to strategies used by Tika to extract the text from the document.
- simple text extraction - no ocr at all
- OCR mode - than tika uses the Tesseract to get the recognized text
At the beggining the application tries to retrieve text without OCR. If it fails than the OCR strategy is used to extract the text from the file.
To start the Apache Tika server just type a command
docker-compose up
We can check if the servers are running through the URL tika on localhost as seen in the following image
.
The OCR process takes a while (~30sec per pdf page). We can tune it a bit using adding the specific headers. Read more about it here.
In the standard installation, the languages available in Tesseract are English (default), French, German, Italian, and Spanish. To add new ones, we need to access the container terminal through Docker and execute the following commands to install, for example, the Polish language. The correct choice of the text language allows greater precision in character recognition.
docker exec -it tika-server-ocr /bin/bash
apt-get update
apt-get install tesseract-ocr-por
To read more about Tika ocr check this blog post
to see how the ocr process works in practice just send the sample file using curl
curl -X POST "http://localhost:8080/api/actions/extractText" -H "accept: */*" -H "Content-Type: multipart/form-data" -F "fileToOcr=@someDoc.jpg;type=image/jpeg"
You can use also the swagger dashboard swagger-ui
There is a language parameter in Ocr Api. It can be null than the Apache Tika tries to detect the language. The language can be specified in request - for instance (pl, en). Here is list of languages. Sometimes it improves the quality of ocr process.
curl -X POST "http://localhost:8080/api/actions/extractText" -H "accept: */*" -H "Content-Type: multipart/form-data" -F "fileToOcr=@doc.pdf;type=application/pdf" -F "language=pl"