GitHub - sebasGarcia/TextOptimizer: NLP Streamlit app developed in a team project at BeCode Ghent as a use case for KPMG

Text Optimizer App - Natural Language Processing

Host organization:

BeCode(Ghent campus)

Client organization:

KPMG(Belgium)

The timeline of the project:

2 weeks - 01/06/2022 - 16/06/2022

Project Goal:

Creating a tool which scrapes the National Gazette and extracts the content relevant to client.

Scrape legal content
Use Natural Language Processing to analyze text
Classify tax-related documents based on their content

Dataset details:

Link.csv with attributes:"Date of release of laws", "Titles" and "url links to both French and Dutch versions of the laws".

Description:

We are tasked by KPMG Belgium to work on a NLP-based prototype for automating the process of detecting tax-related law changes.

The steps we have taken:

Scrapping of documents using the Dutch urls from National Gazette
Cleaning the documents to make it more readable and understandable.
Preprocessing the documents to remove the stopwords, strip the tags, numerics, multiple whitespaces,punctuations, tokenization and lemmatization.
Unsupervised Topic modelling using Latent Dirichlet Allocation Model (LDA) setting number of topics to 10
NLP Streamlit App capable to visualise "Lemma", "Name Entity Recongnition", "Words importances", "Summaries text"

Usage

We are using jupyter notebooks:

Functions for scraping the documents from the Dutch urls
Functions for cleaning the scraped documents
Functions for preprocessing the cleaned documents
Latent Dirichlet Allocation Model (LDA)
NLP Streamlit App

App Usage:

To open the heroku application, follow the link Heroku

Result

Visualisation of the LDA model with pyLDAvis

Cloud of Words

Streamlit App Interface

Name Entity Recognition

Word Importance

Summarization of Text

Used Language and Libraries:

Python / libraries:

Numpy
Pandas
spacy
Matplotlib
sklearn
langdetect
BeautifulSoup
Gensim
Streamlit

Repo Architecture

│   app.py                      : contain scripts for running the streamlit NLP app.
│   Procfile                    : settings for deployment of app on Heroku
│   README.md                   : Description of the project 
│   requirements.txt            : contaubs libraries require for running the streamlit NLP app on Heroku.
│   setup.sh                    : settings for deployment of app on Heroku
│   Topic Modeling Dutch.ipynb  : Notebook with LDA topic modeling, cloud words, reading Gazette url articles

├───data
│   │ links.csv                 : All url adresses from National Gazette, not cleaned.
│   │ dataframe.csv             : All scraped urls, content of the document, cleaned and preprocessed.        
├───image                       : contains images used on the readme

└───utils
    │   cleaning.py             : Script used for cleaning the scraped documents.
    │   preprocessapp.py        : Script used preprocessing  the cleaned data.
    │   preprocessing.py        : Script for preprocessing for the app.
    │   scraping.py             : Scipt used for scraping the documents.
    │   translating_notebook.ipynb : attempt of translation to English

Team name: TaxOptimizer_NLP_Pro:

Sebastián García Martínez, Pieter Van Hoefs, Moshood Owolabi, Havva Ebrahimi Pour

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Optimizer App - Natural Language Processing

Host organization:

Client organization:

The timeline of the project:

Project Goal:

Dataset details:

Description:

Usage

App Usage:

Result

Visualisation of the LDA model with pyLDAvis

Cloud of Words

Streamlit App Interface

Name Entity Recognition

Word Importance

Summarization of Text

Used Language and Libraries:

Repo Architecture

Team name: TaxOptimizer_NLP_Pro:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
image		image
utils		utils
.gitignore		.gitignore
Procfile		Procfile
README.md		README.md
Topic Modeling Dutch.ipynb		Topic Modeling Dutch.ipynb
app.py		app.py
dataframe.csv		dataframe.csv
requirements.txt		requirements.txt
setup.sh		setup.sh

sebasGarcia/TextOptimizer

Folders and files

Latest commit

History

Repository files navigation

Text Optimizer App - Natural Language Processing

Host organization:

Client organization:

The timeline of the project:

Project Goal:

Dataset details:

Description:

Usage

App Usage:

Result

Visualisation of the LDA model with pyLDAvis

Cloud of Words

Streamlit App Interface

Name Entity Recognition

Word Importance

Summarization of Text

Used Language and Libraries:

Repo Architecture

Team name: TaxOptimizer_NLP_Pro:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages