Skip to content

Natural Language Processing demo for Java Developers

Notifications You must be signed in to change notification settings

marekkapowicki/nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

System specification

Build the application

Cloud Native Buildpacks are used to build the application

./mvnw spring-boot:build-image

Run the application

The single command is enough to start the whole application

docker-compose up

Text Extraction

The Ocr process is being processed by Apache Tika that uses Google Tesseract under the hood. The ocr uses the OkHttpClient to create the http call to Apache Tika Server There are to strategies used by Tika to extract the text from the document.

  • simple text extraction - no ocr at all
  • OCR mode - than tika uses the Tesseract to get the recognized text

At the beggining the application tries to retrieve text without OCR. If it fails than the OCR strategy is used to extract the text from the file.

Start the server

To start the Apache Tika server just type a command

docker-compose up

We can check if the servers are running through the URL tika on localhost as seen in the following image alt text.

Optical Character Recognition

The OCR process takes a while (~30sec per pdf page). We can tune it a bit using adding the specific headers. Read more about it here.

Languages in Tesseract OCR

In the standard installation, the languages available in Tesseract are English (default), French, German, Italian, and Spanish. To add new ones, we need to access the container terminal through Docker and execute the following commands to install, for example, the Polish language. The correct choice of the text language allows greater precision in character recognition.

docker exec -it tika-server-ocr /bin/bash
apt-get update
apt-get install tesseract-ocr-por

More reading

To read more about Tika ocr check this blog post

Text extraction in action

to see how the ocr process works in practice just send the sample file using curl

curl -X POST "http://localhost:8080/api/actions/extractText" -H "accept: */*" -H "Content-Type: multipart/form-data" -F "fileToOcr=@someDoc.jpg;type=image/jpeg"

You can use also the swagger dashboard swagger-ui

Processing document with specific language

There is a language parameter in Ocr Api. It can be null than the Apache Tika tries to detect the language. The language can be specified in request - for instance (pl, en). Here is list of languages. Sometimes it improves the quality of ocr process.

curl -X POST "http://localhost:8080/api/actions/extractText" -H "accept: */*" -H "Content-Type: multipart/form-data" -F "fileToOcr=@doc.pdf;type=application/pdf" -F "language=pl" 

About

Natural Language Processing demo for Java Developers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages