BeCode(Ghent campus)
KPMG(Belgium)

2 weeks - 01/06/2022 - 16/06/2022
Creating a tool which scrapes the National Gazette and extracts the content relevant to client.
- Scrape legal content
- Use Natural Language Processing to analyze text
- Classify tax-related documents based on their content
Link.csv with attributes:"Date of release of laws", "Titles" and "url links to both French and Dutch versions of the laws".
We are tasked by KPMG Belgium to work on a NLP-based prototype for automating the process of detecting tax-related law changes.
The steps we have taken:
- Scrapping of documents using the Dutch urls from National Gazette
- Cleaning the documents to make it more readable and understandable.
- Preprocessing the documents to remove the stopwords, strip the tags, numerics, multiple whitespaces,punctuations, tokenization and lemmatization.
- Unsupervised Topic modelling using Latent Dirichlet Allocation Model (LDA) setting number of topics to 10
- NLP Streamlit App capable to visualise "Lemma", "Name Entity Recongnition", "Words importances", "Summaries text"
We are using jupyter notebooks:
- Functions for scraping the documents from the Dutch urls
- Functions for cleaning the scraped documents
- Functions for preprocessing the cleaned documents
- Latent Dirichlet Allocation Model (LDA)
- NLP Streamlit App
To open the heroku application, follow the link Heroku
Python / libraries:
- Numpy
- Pandas
- spacy
- Matplotlib
- sklearn
- langdetect
- BeautifulSoup
- Gensim
- Streamlit
│ app.py : contain scripts for running the streamlit NLP app.
│ Procfile : settings for deployment of app on Heroku
│ README.md : Description of the project
│ requirements.txt : contaubs libraries require for running the streamlit NLP app on Heroku.
│ setup.sh : settings for deployment of app on Heroku
│ Topic Modeling Dutch.ipynb : Notebook with LDA topic modeling, cloud words, reading Gazette url articles
├───data
│ │ links.csv : All url adresses from National Gazette, not cleaned.
│ │ dataframe.csv : All scraped urls, content of the document, cleaned and preprocessed.
├───image : contains images used on the readme
└───utils
│ cleaning.py : Script used for cleaning the scraped documents.
│ preprocessapp.py : Script used preprocessing the cleaned data.
│ preprocessing.py : Script for preprocessing for the app.
│ scraping.py : Scipt used for scraping the documents.
│ translating_notebook.ipynb : attempt of translation to English
Sebastián García Martínez, Pieter Van Hoefs, Moshood Owolabi, Havva Ebrahimi Pour






