System specification

Build the application

Cloud Native Buildpacks are used to build the application

./mvnw spring-boot:build-image

Run the application

The single command is enough to start the whole application

docker-compose up

Text Extraction

The Ocr process is being processed by Apache Tika that uses Google Tesseract under the hood. The ocr uses the OkHttpClient to create the http call to Apache Tika Server There are to strategies used by Tika to extract the text from the document.

simple text extraction - no ocr at all
OCR mode - than tika uses the Tesseract to get the recognized text

At the beggining the application tries to retrieve text without OCR. If it fails than the OCR strategy is used to extract the text from the file.

Start the server

To start the Apache Tika server just type a command

docker-compose up

We can check if the servers are running through the URL tika on localhost as seen in the following image .

Optical Character Recognition

The OCR process takes a while (~30sec per pdf page). We can tune it a bit using adding the specific headers. Read more about it here.

Languages in Tesseract OCR

In the standard installation, the languages available in Tesseract are English (default), French, German, Italian, and Spanish. To add new ones, we need to access the container terminal through Docker and execute the following commands to install, for example, the Polish language. The correct choice of the text language allows greater precision in character recognition.

docker exec -it tika-server-ocr /bin/bash
apt-get update
apt-get install tesseract-ocr-por

Text extraction in action

to see how the ocr process works in practice just send the sample file using curl

curl -X POST "http://localhost:8080/api/actions/extractText" -H "accept: */*" -H "Content-Type: multipart/form-data" -F "fileToOcr=@someDoc.jpg;type=image/jpeg"

You can use also the swagger dashboard swagger-ui

Processing document with specific language

There is a language parameter in Ocr Api. It can be null than the Apache Tika tries to detect the language. The language can be specified in request - for instance (pl, en). Here is list of languages. Sometimes it improves the quality of ocr process.

curl -X POST "http://localhost:8080/api/actions/extractText" -H "accept: */*" -H "Content-Type: multipart/form-data" -F "fileToOcr=@doc.pdf;type=application/pdf" -F "language=pl"

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.mvn/wrapper		.mvn/wrapper
src		src
target		target
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml
tika_main.png		tika_main.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

System specification

Build the application

Run the application

Text Extraction

Start the server

Optical Character Recognition

Languages in Tesseract OCR

More reading

Text extraction in action

Processing document with specific language

About

Uh oh!

Releases

Packages

Languages

marekkapowicki/nlp

Folders and files

Latest commit

History

Repository files navigation

System specification

Build the application

Run the application

Text Extraction

Start the server

Optical Character Recognition

Languages in Tesseract OCR

More reading

Text extraction in action

Processing document with specific language

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages