diff --git a/open-machine-learning-jupyter-book/_toc.yml b/open-machine-learning-jupyter-book/_toc.yml index 6a4456061..a09d32cd1 100644 --- a/open-machine-learning-jupyter-book/_toc.yml +++ b/open-machine-learning-jupyter-book/_toc.yml @@ -106,7 +106,10 @@ parts: - file: deep-learning/object-detection.ipynb - file: deep-learning/image-classification.ipynb - file: deep-learning/image-segmentation.ipynb - - file: deep-learning/nlp.ipynb + - file: deep-learning/nlp/nlp.ipynb + sections: + - file: deep-learning/nlp/text-preprocessing.ipynb + - file: deep-learning/nlp/text-representation.ipynb - file: deep-learning/gan.ipynb - file: deep-learning/difussion-model.ipynb - file: deep-learning/dqn.ipynb @@ -232,6 +235,8 @@ parts: - file: assignments/deep-learning/object-detection/car-object-detection - file: assignments/deep-learning/overview/basic-classification-classify-images-of-clothing - file: assignments/deep-learning/nlp/getting-start-nlp-with-classification-task + - file: assignments/deep-learning/nlp/beginner-guide-to-text-preprocessing + - file: assignments/deep-learning/nlp/news-topic-classification-tasks - file: slides/introduction sections: - file: slides/python-programming/python-programming-introduction diff --git a/open-machine-learning-jupyter-book/assignments/deep-learning/nlp/beginner-guide-to-text-preprocessing.ipynb b/open-machine-learning-jupyter-book/assignments/deep-learning/nlp/beginner-guide-to-text-preprocessing.ipynb new file mode 100644 index 000000000..308414b0f --- /dev/null +++ b/open-machine-learning-jupyter-book/assignments/deep-learning/nlp/beginner-guide-to-text-preprocessing.ipynb @@ -0,0 +1,613 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Beginner’s Guide to Text Pre-Processing" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Natural Language Processing is a subdomain under Artificial Intelligence which deals with processing natural language data like text and speech. We can also term NLP as the “Art of extracting information from texts”. Recently there has been a lot of activity in this field and amazing research coming out every day! But, the revolutionary research was the “Transformer” which opened up avenues to build massive Deep Learning models which can come very close to human-level tasks like Summarization and Question Answering. Then came the GPTs and BERTs which were massive models consisting of billions of computation parameters trained on very huge datasets and can be fine-tuned to a variety of NLP tasks and problem statements.\n", + "\n", + "Deep down, the roots of building a robust NLP model, Text Processing, plays a very important role. This might not be very evident in the recent models like BERT and GPT, but it is one of the most elementary processes in Natural Language Processing. All NLP researchers and enthusiasts will have done Text Processing more times than not while attempting to solve problems in this domain. For a beginner, Text Processing is a fundamental concept to be nailed before setting sights on solving advanced problems. This brings to a question - Why Text Processing?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Why Text Pre-processing?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Text Pre-processing is important because language models are quite complex and there might be unnecessary data in the text corpus which might add to an ambiguity factor in the dataset, make it more computationally intensive and also impact the accuracy to a pretty considerable extent. \n", + "\n", + "Text Pre-processing is important because language models are quite complex, largely due to grammar rules. Unnecessary data in non-processed datasets will only add to ambiguity, increase computation requirements, and impact the accuracy of the model to a considerable extent.\n", + "\n", + "Moreover, we have to get the text transformed into vectors/numbers which can be ingested by machines or computers. This process is called Encoding Technique and we have many techniques like CountVectorizer, Tf-Idf Vectorizer, Bag of Words, Word2Vec, GLoVe, etc. Popularly this process is also known as Text Representation. This comes after the Text Pre-Processing. We shall look into these techniques in the next article 🙂\n", + "\n", + "Coming back to Text Pre-processing, let us look into a few popular Text Pre-processing methods in NLP." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Downloading Packages" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will use the most popular library in processing textual data - NLTK (Natural Language Toolkit). On top of downloading and loading the base NLTK library we have to download a few additional files for our Pre-Processing techniques. The code is shown below:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install nltk\n", + "\n", + "import nltk\n", + "nltk.download('punkt')\n", + "nltk.download('wordnet')\n", + "nltk.download('stopwords')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note: All code examples can be executed on Colab exactly the way it is shown in the articles.\n", + "\n", + "Once that is done, we can start doing different pre-processing activities, a few of which are listed below. At the end, we will bundle all of these pre-processing techniques into a function, making it very easy to use and even add that into a sequence with other pre-processing techniques." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Removing Accented Characters" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This will be our first pre-processing technique, which involves removing unaccented characters like é, â etc. These characters won’t be adding any meaning if included in the sentence. We can use the library - unicodedata to replace the unaccented characters with normal characters." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'cafe'" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import unicodedata\n", + "\n", + "def remove_accented_chars(text):\n", + " text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')\n", + " \n", + " return text\n", + "\n", + "remove_accented_chars('résumé')\n", + "# Result - 'resume'\n", + "\n", + "remove_accented_chars('café')\n", + "# Result - 'cafe'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Removing Special & Non-Alphanumeric Characters" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next step is to take care of special symbols, numbers and non-alphanumeric characters like #, @, $ etc. We can remove these characters easily using regular expressions." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'ElonMusk is revolutionizing the Space industry especially the aspect of Reusable rockets'" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import re\n", + "\n", + "def remove_special_characters(text):\n", + " pattern = r'[^a-zA-Z\\s]'\n", + " n_pattern = r'[^a-zA-Z0-9\\s]'\n", + " # Removing everything apart from alphanumerical chars\n", + " text = re.sub(pattern, '', text)\n", + " # Removing numbers\n", + " text = re.sub(n_pattern, '', text)\n", + " return text\n", + "\n", + "remove_special_characters('The brown fox is quick and the blue dog is lazy!')\n", + "# Result - The brown fox is quick and the blue dog is lazy\n", + "\n", + "remove_special_characters('@ElonMusk is revolutionizing the Space industry, especially the aspect of Reusable rockets!!!')\n", + "# Result - ElonMusk is revolutionizing the Space industry especially the aspect of Reusable rockets" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note: Removing numbers may or may not be feasible based on the scenario of the problem statement. Therefore, removing numbers is solely based on the dataset and the problem statement." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Converting to Lowercase" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This is an important and compulsory step in Pre-processing text. If we consider the words “Banana” and “banana”, both convey the same meaning, but are represented differently and are treated as unique words by the encoder (which converts text to vectors). To combat this, we can simply convert the entire corpus to lower case to make sure every word or token (in NLP jargon) is in the same configuration which makes it easier to process and represent it in an effective manner.\n", + "\n", + "We can achieve this by simply using the lower() method on the string and further use strip() method to remove any white spaces too." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'hi there, how are you?'" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def to_lower(text):\n", + " return text.lower().strip()\n", + "\n", + "to_lower('Hi there, How are you?')\n", + "# Result - hi there, how are you?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Removing Punctuation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Punctuation is an added weight to the corpus, but is very important in conveying the semantic meaning of the sentence. However, we can go ahead and remove them as one of the pre-processing techniques. Advanced encoding techniques like Word Embeddings (covered in a later post) can model the corpus without any punctuation." + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'We were though we had rushed to get there late for the film Thank you I said'" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import string\n", + "\n", + "def remove_p(text):\n", + "\n", + " text = text.translate(str.maketrans('', '', string.punctuation))\n", + " text = re.sub('[''\"\"…]', '', text)\n", + " text = re.sub('\\n', '', text)\n", + "\n", + " return text\n", + "\n", + "remove_p('We were , though we had rushed to get there, late for the film. ''Thank you'', I said')\n", + "# Result - We were though we had rushed to get there late for the film. Thank you I said" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note: Punctuation were not removed in the more advanced GPTs & BERTs as the models were powerful enough to process and model sentences as it is without any pre-processing." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Tokenization" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This is a very small step which converts a sentence into tokens or words. If the input is a string (sentence) the output will be the list of words/tokens in that sentence. The official definition of a token is - ”A sequence of characters which are grouped together as a useful semantic term for analyzing”. To put it in simple words, they are nothing but the smallest meaningful entities of a sentence. Here, we use NLTK’s function word_tokenize(). Tokenization is important to apply the next steps  -  Stopword Removal, Stemming and Lemmatization." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['She', 'sells', 'sea', 'shells', 'on', 'the', 'sea', 'shore']" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import nltk\n", + "\n", + "def tokenization(text):\n", + " tokens = nltk.word_tokenize(text)\n", + " return tokens\n", + "\n", + "tokenization('She sells sea shells on the sea shore')\n", + "# Result - ['She','sells','sea','shells','on','the','sea','shore']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There is an alternative for nltk.word_tokenize i.e. tensorflow’s text_to_word_sequence which gives the same output as NLTK’s word_tokenize" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['she', 'sells', 'sea', 'shells', 'on', 'the', 'sea', 'shore']" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from tensorflow.keras.preprocessing.text import text_to_word_sequence\n", + "\n", + "text_to_word_sequence('She sells sea shells on the sea shore')\n", + "#Result - ['She','sells','sea','shells','on','the','sea','shore']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Stopword Removal" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Stopwords are the words which are most common like I, am, there, where etc. They usually don’t help in certain NLP tasks and are best removed to save computation and time. Common methodology in earlier times was to remove the stopwords. However, in the age of GPT and BERT, we don’t usually remove the stopwords." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['brown', 'fox', 'quick', 'blue', 'dog', 'lazy']" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from nltk.corpus import stopwords\n", + "\n", + "STOPWORDS = stopwords.words('english')\n", + "def remove_stopwords(tokens):\n", + "\n", + " filtered_tokens = [token for token in tokens if token not in STOPWORDS]\n", + " return filtered_tokens\n", + "\n", + "remove_stopwords(['the', 'brown', 'fox', 'is', 'quick', 'and', 'the', 'blue', 'dog', 'is', 'lazy'])\n", + "#Result - ['brown', 'fox', 'quick', 'blue', 'dog', 'lazy']\n", + "\n", + "# We can also print all the stopwords present in NLTK configuration by print(stopwords.words('english'))\n", + "\n", + "# Also we have an option to modify the set of stopwords for our custom scenario by the following methods stopwords.remove() and stopwords.add()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note: You can try out both approaches for creating the corpus — Removing Stopwords and Retaining the Stopwords. We can see different end results based on whether stopwords were removed or retained." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Stemming" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Stemming is a process of reducing a given token/word to its root form. For ex: The words - likely, likes, liked, liking are reduced to its root form i.e. like. Stemming uses a crude heuristic process that chops off the ends of words in the hope of correctly transforming words into its root form. So the words “trouble”, “troubled” and “troubles” might actually be converted to “troubl” instead of “trouble” because the ends were just chopped off!\n", + "\n", + "Stemming is an optional step and the best way to find out if it is effective or not is to experiment and observe the results before and after stemming. There are two types of Stemmer defined in NLTK - PorterStemmer & SnowballStemmer. The details are given here. In our examples, we will be using PorterStemmer." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['welcom', 'fairli', 'easili']" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from nltk.stem import PorterStemmer\n", + "\n", + "ps = PorterStemmer()\n", + "def stem(words):\n", + " stemmed_tokens = [ps.stem(word) for word in words]\n", + " return stemmed_tokens\n", + "\n", + "stem(['brown', 'fox', 'quick', 'blue', 'dog', 'lazy'])\n", + "# Result - ['brown', 'fox', 'quick', 'blue', 'dog', 'lazi']\n", + "\n", + "stem(['welcome', 'fairly', 'easily'])\n", + "# Result - ['welcom', 'fairli', 'easili']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Lemmatization" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A widely used step after lower-casing and the removal of stopwords is Lemmatization. It is similar to stemming but does not chop the ends of the words, instead it transforms to the actual root word based on a dictionary. This dictionary is called WordNet. Find more details on WordNet here. Since it has to look up a dictionary, it is slightly slower than Stemming. For example, the token “better” is transformed into “good” which retains the semantic meaning even after transformation which might not be the case in stemming (most of the times, the meaning of the stemmed word is not semantically grasped. Lazy becomes lazi after stemming!) NLTK Lemmatizer details here." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['welcome', 'fairly', 'better', 'goose', 'goose']" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from nltk.stem.wordnet import WordNetLemmatizer\n", + "\n", + "lemmatizer = WordNetLemmatizer()\n", + "def lemmatize(words):\n", + " lemmatized_tokens = [lemmatizer.lemmatize(word) for word in words]\n", + " return lemmatized_tokens\n", + "\n", + "lemmatize(['welcome', 'fairly', 'better', 'goose' , 'geese'])\n", + "# Result - ['welcome', 'fairly', 'good', 'goose' , 'goose']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Putting it all together" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we have defined functions for all Pre-processing steps, let us call them and observe the results. We can also create a pipe of function calls in a specific order for processing. This is also termed as the Pre-processing pipeline." + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'brown fox quick blue dog lazy'" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sentence = 'The brown fox is quick and the blue dog is lazy!'\n", + "\n", + "# REMOVING ACCENTED CHARACTERS\n", + "remove_accented_chars(sentence)\n", + "# Result - The brown fox is quick and the blue dog is lazy!\n", + "\n", + "# REMOVING SPECIAL CHARACTERS\n", + "remove_special_characters(sentence)\n", + "# Result - The brown fox is quick and the blue dog is lazy\n", + "\n", + "# CONVERTING TO LOWER CASE\n", + "# Pipeline involving removal of spl chars and then lower casing\n", + "to_lower(remove_special_characters(sentence))\n", + "# Result - the brown fox is quick and the blue dog is lazy\n", + "\n", + "# REMOVING PUNCTUATION\n", + "remove_p(to_lower(remove_special_characters(sentence)))\n", + "# Result - the brown fox is quick and the blue dog is lazy\n", + "\n", + "# TOKENIZATION\n", + "text_tokens = tokenization(remove_p(to_lower(remove_special_characters(sentence))))\n", + "# Result - ['the', 'brown', 'fox', 'is', 'quick', 'and', 'the', 'blue', 'dog', 'is', 'lazy']\n", + "\n", + "# REMOVAL OF STOPWORDS\n", + "filtered_tokens = remove_stopwords(text_tokens)\n", + "# Result - ['brown', 'fox', 'quick', 'blue', 'dog', 'lazy']\n", + "\n", + "# STEMMING\n", + "stem(filtered_tokens)\n", + "# Result - ['brown', 'fox', 'quick', 'blue', 'dog', 'lazi']\n", + "\n", + "# LEMMATIZATION\n", + "lemmatize(filtered_tokens)\n", + "# Result - ['brown', 'fox', 'quick', 'blue', 'dog', 'lazy']\n", + "\n", + "# REFACTORING THE CORPUS\n", + "def refactor(words):\n", + " return ' '.join(words)\n", + "refactor(lemmatize(filtered_tokens))\n", + "# Result - 'brown fox quick blue dog lazy'\n", + "# ONE PIPELINE FOR ALL STEPS\n", + "refactor(lemmatize(remove_stopwords(tokenization(remove_p(to_lower(remove_special_characters('The brown fox is quick and the blue dog is lazy!')))))))\n", + "# Result - 'brown fox quick blue dog lazy'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Feel free to experiment stemming, lemmatization, stopword removal aspects in the pipeline. Given here is the code containing all the functions in a single python file." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We have covered a few of the most popular Text Pre-processing steps in NLP in this post. There are a few more advanced concepts like Bi-gram, Tri-gram filtering, correcting spelling mistakes, expanding abbreviations etc. Feel free to explore these methods also. One more thing to note which has surfaced in recent years is “Pre-processing can hamper the performance of Deep NLP models!” as stated here. BERT and GPT also don’t employ rigorous pre-processing steps, which might induce a thought  - ”Was learning these techniques a waste of time?” Definitely not! These techniques are building blocks in NLP and are to be known for any beginner starting out in NLP. \n", + "\n", + "Try these techniques on your custom data and observe how Pre-processing techniques can help in building a very good text corpus which can later be employed for training Deep Learning Models for NLP tasks. In our next post, we will move to the next step of representing the corpus as a vector, commonly known as Text Encoding." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Acknowledgements\n", + "\n", + "Thanks to [Pranav Raikote](https://twitter.com/A6Singularity) for creating [NLP Tutorials – Part 1: Beginner’s Guide to Text Pre-Processing](https://appliedsingularity.com/2021/12/28/nlp-tutorials-part-1-beginners-guide-to-text-pre-processing/). It inspires the majority of the content in this chapter." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.13" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/open-machine-learning-jupyter-book/assignments/deep-learning/nlp/news-topic-classification-tasks.ipynb b/open-machine-learning-jupyter-book/assignments/deep-learning/nlp/news-topic-classification-tasks.ipynb new file mode 100644 index 000000000..b7f50f462 --- /dev/null +++ b/open-machine-learning-jupyter-book/assignments/deep-learning/nlp/news-topic-classification-tasks.ipynb @@ -0,0 +1,456 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# News topic classification tasks" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## packages" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import torch\n", + "import torchtext\n", + "import os\n", + "from keras.preprocessing.text import Tokenizer\n", + "from keras_preprocessing import sequence\n", + "import string\n", + "import re\n", + "import numpy as np\n", + "import torch.nn as nn\n", + "import torch.nn.functional as F\n", + "from torch.utils.data.dataset import random_split\n", + "import time\n", + "from torch.utils.data import DataLoader" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Download the data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We're going to use torchtext.datasets.AG_NEWS,there are four types of news inside." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "load_data_path = \"../data\"\n", + " \n", + "if not os.path.isdir(load_data_path):\n", + " os.mkdir(load_data_path)\n", + " \n", + "train_dataset, test_dataset = torchtext.datasets.AG_NEWS(\n", + " root='../data/', split=('train', 'test'))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can have a look about the data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(list(train_dataset)[:3])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The training set has 120,000 samples, and the label has four values: 1, 2, 3, and 4. The distribution of various types of labels in the training set and test set is relatively uniform." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Work with datasets" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Function: \n", + "1. Replace with space (i.e., split the words on both sides of it into two words), convert all letters to lowercase. \n", + "2. Convert the label to [0,3]. \n", + "3. Sentence length interception" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**The length of the sentences in the sample was analyzed: more than 90% of the sentences were not more than 50 in length, so 50 words were subsequently intercepted.**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "punct = str.maketrans('','',string.punctuation)\n", + "def process_datasets_by_Tokenizer(train_dataset, test_dataset, seq_len=200):\n", + " \"\"\"\n", + " Parameter:\n", + " Tran Dattaset: Training sample list Lister (Tapoel (Inter, Ster))\n", + " Return:\n", + " Tran Dattaset: Training set list Lister Petter (Tapul (Turner, Inter))\n", + " \"\"\"\n", + " tokenizer = Tokenizer()\n", + " train_dataset_texts, train_dataset_labels = [], []\n", + " test_dataset_texts, test_dataset_labels = [], []\n", + " \n", + " for label, text in train_dataset:\n", + " # In the previous print, you can see that there is \"\\\\\" , which is replaced with a space, and all of them are lowercase letters\n", + " train_dataset_texts.append(text.replace('\\\\',' ').translate(punct).lower())\n", + " train_dataset_labels.append(label - 1) # Mapping Labels to [0,3]\n", + " \n", + " for label, text in test_dataset:\n", + " test_dataset_texts.append(text.replace('\\\\',' ').translate(punct).lower())\n", + " test_dataset_labels.append(label - 1)\n", + " \n", + " # Here's the trick and put the training set of tests together to build a vocabulary list, so that there are no unlogged words\n", + " all_dataset_texts = train_dataset_texts + test_dataset_texts\n", + " all_dataset_labels = train_dataset_labels + test_dataset_labels\n", + " tokenizer.fit_on_texts(all_dataset_texts)\n", + " \n", + " #train_dataset_seqs is a list in which each element is a list that transforms a sentence from a literal representation into an index representation in a vocabulary\n", + " train_dataset_seqs = tokenizer.texts_to_sequences(train_dataset_texts)\n", + " test_datase_seqs = tokenizer.texts_to_sequences(test_dataset_texts)\n", + " #print(type(train_dataset_seqs), type(train_dataset_seqs[0])) # \n", + " #print(train_dataset_seqs)\n", + " \n", + " # Intercept the first seq_len, and make up 0 after the shortage\n", + " # train_dataset_seqs is a tensor,size:(Number of samples, seq_len)\n", + " train_dataset_seqs = torch.tensor(sequence.pad_sequences(\n", + " train_dataset_seqs, seq_len, padding='post'), dtype=torch.int32)\n", + " test_datase_seqs = torch.tensor(sequence.pad_sequences(\n", + " test_datase_seqs, seq_len, padding='post'), dtype=torch.int32)\n", + " #print(type(train_dataset_seqs), type(train_dataset_seqs[0])) # \n", + " #print(train_dataset_seqs)\n", + " \n", + " train_dataset = list(zip(train_dataset_seqs, train_dataset_labels))\n", + " test_dataset = list(zip(test_datase_seqs, test_dataset_labels))\n", + " \n", + " vocab_size = len(tokenizer.index_word.keys())\n", + " num_class = len(set(all_dataset_labels))\n", + " return train_dataset, test_dataset, vocab_size, num_class\n", + " \n", + " \n", + "embed_dim = 16 # There are about 90,000 words, and the embedding dimension here is 16\n", + "batch_size = 64\n", + "seq_len = 50 # A sentence length of 50 can cover more than 90% of the sample\n", + " \n", + "train_dataset, test_dataset, vocab_size, num_class = process_datasets_by_Tokenizer(\n", + " train_dataset, test_dataset, seq_len=seq_len)\n", + " \n", + "print(train_dataset[:2])\n", + "print(\"vocab_size = {}, num_class = {}\".format(vocab_size, num_class))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "With the 4 print statements commented out open, let's test the code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "train = [(1, 'The moon is light'),\n", + " (2, 'This is the last rose of summer')]\n", + "test = train[:]\n", + "train, test, sz, cls = process_datasets_by_Tokenizer(train, test, seq_len=5)\n", + "train, test, sz, cls" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Build a model" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The structure of the model is simple: embedding layer + average pooling layer + fully connected layer" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "class TextSentiment(nn.Module):\n", + " \"\"\"Text classification model\"\"\"\n", + " def __init__(self, vocab_size, embed_dim, num_class, seq_len):\n", + " \"\"\"\n", + " description: Initialization function of the class\n", + " :param vocab_size: The total number of distinct words contained in the entire corpus\n", + " :param embed_dim: Specifies the dimension in which the word is embedded\n", + " :param num_class: The total number of categories for the text classification\n", + " \"\"\" \n", + " super(TextSentiment, self).__init__()\n", + " \n", + " self.seq_len = seq_len\n", + " self.embed_dim = embed_dim\n", + " \n", + " # Instantiating the embedding layer, \n", + " #sparse=True means that only part of the weights are updated each time the gradient is solved for that layer.\n", + " self.embedding = nn.Embedding(vocab_size, embed_dim, sparse=True)\n", + " # Instantiate the linear layer, the parameters are embed_dim and num_class.\n", + " self.fc = nn.Linear(embed_dim, num_class)\n", + " # Initialize weights for each layer\n", + " self.init_weights()\n", + " \n", + " def init_weights(self):\n", + " \"\"\"Initialize the weight function\"\"\"\n", + " # Specifies the number of value ranges for the initial weight\n", + " initrange = 0.5\n", + " # The weight parameters of each layer are initialized to a uniform distribution\n", + " self.embedding.weight.data.uniform_(-initrange, initrange)\n", + " self.fc.weight.data.uniform_(-initrange, initrange)\n", + " # The bias is initialized to 0\n", + " self.fc.bias.data.zero_()\n", + " \n", + " def forward(self, text):\n", + " \"\"\"\n", + " :param text: The result of the text numeric mapping\n", + " :return: A tensor of the same size as the number of categories, which is used to determine the category of the text\n", + " \"\"\"\n", + " # [batch_size, seq_len, embed_dim]\n", + " embedded = self.embedding(text) \n", + " # [batch_size, embed_dim, seq_len],\n", + " # Later, the dimension where the sentence is located is pooling, so put the dimension where the sentence is located at the end\n", + " embedded = embedded.transpose(2, 1) # The dimension of the sentence changes from the second dimension to the third dimension\n", + " # [batch_size, embed_dim, 1] \n", + " embedded = F.avg_pool1d(embedded, kernel_size=self.seq_len)\n", + " # [embed_dim, batch_size] \n", + " embedded = embedded.squeeze(-1)\n", + " # [batch_size, embed_dim]\n", + " # I saw that torch.nn.CrossEntropyLoss() comes with softmax, so I don't have softmax here\n", + " return self.fc(embedded)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Training" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### generate_batch" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "generate_batch: Construct the data in a batch and pass it in as a parameter of the DataLoader function" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def generate_batch(batch):\n", + " \"\"\"[summary]\n", + " Args:\n", + " batch ([type]): [description] A batch_size-sized list of sample tensors and tuples of corresponding labels\n", + " [(sample1, label1), (sample2, label2), ..., (samplen, labeln)]\n", + " :return Sample tensors and labels are in their respective list forms(Tensor)\n", + " \"\"\"\n", + " text = [entry[0].reshape(1, -1) for entry in batch]\n", + "# print(text)\n", + " label = torch.tensor([entry[1] for entry in batch])\n", + " text = torch.cat(text, dim=0)\n", + " \n", + " return torch.tensor(text), torch.tensor(label)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's test the effect of this paragraph:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "batch = [(torch.tensor([3, 23, 2, 8]), 1), (torch.tensor([3, 45, 21, 6]), 0)]\n", + "res = generate_batch(batch)\n", + "print(res, res[0].size())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Training & Validation Functions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n", + "def run(data, batch_size, model, criterion, \n", + " mode='train', optimizer=None, scheduler=None):\n", + " total_loss, total_acc = 0., 0.\n", + " \n", + " shuffle = False\n", + " if mode == 'train':\n", + " shuffle = True\n", + " data = DataLoader(data, batch_size=batch_size, shuffle=shuffle,\n", + " collate_fn=generate_batch)\n", + " \n", + " for i, (text, label) in enumerate(data):\n", + "# text = text.to(device) # gpu version\n", + "# label = label.to(device)\n", + " sz = text.size(0)\n", + " if mode == 'train':\n", + " optimizer.zero_grad()\n", + " output = model(text)\n", + " loss = criterion(output, label)\n", + " # Cumulative batch average, referring to the reservoir sampling algorithm\n", + " total_loss = i / (i + 1) * total_loss + loss.item() / sz / (i + 1)\n", + " loss.backward()\n", + " optimizer.step()\n", + "# predict = F.softmax(output, dim=-1)\n", + " correct_cnt = (output.argmax(1) == label).sum().item()\n", + " total_acc = i / (i + 1) * total_acc + correct_cnt / sz / (i + 1)\n", + " else:\n", + " with torch.no_grad():\n", + " output = model(text)\n", + " loss = criterion(output, label)\n", + " total_loss = i / (i + 1) * total_loss + loss.item() / sz / (i + 1)\n", + "# predict = F.softmax(output, dim=-1)\n", + " correct_cnt = (output.argmax(1) == label).sum().item()\n", + " total_acc = i / (i + 1) * total_acc + correct_cnt / sz / (i + 1)\n", + " \n", + "# if i % 10 == 0:\n", + "# print(\"i: {}, loss: {}\".format(i, total_loss))\n", + " \n", + " # Adjust the optimizer learning rate\n", + " if (scheduler):\n", + " scheduler.step()\n", + "# print(total_loss, total_acc, total_loss / count, total_acc / count, count)\n", + " return total_loss , total_acc" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Main process" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "model = TextSentiment(vocab_size + 1, embed_dim, num_class, seq_len)\n", + "# model = TextSentiment(vocab_size + 1, embed_dim, num_class, seq_len).to(device) # gpu version\n", + " \n", + "criterion = torch.nn.CrossEntropyLoss() # Comes with softmax\n", + "optimizer = torch.optim.SGD(model.parameters(), lr=0.1)\n", + "scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.99)\n", + " \n", + "train_len = int(len(train_dataset) * 0.95)\n", + "sub_train_, sub_valid_ = random_split(train_dataset, \n", + " [train_len, len(train_dataset) - train_len])\n", + "n_epochs = 10\n", + "for epoch in range(n_epochs):\n", + " start_time = time.time()\n", + " train_loss, train_acc = run(sub_train_, batch_size, model, criterion, \n", + " mode='train', optimizer=optimizer, scheduler=scheduler)\n", + " \n", + " valid_loss, valid_acc = run(sub_train_, batch_size, model, criterion, mode='validation')\n", + " \n", + " secs = int(time.time() - start_time)\n", + " mins = secs / 60\n", + " secs = secs % 60\n", + " \n", + " print(\"Epoch: %d\" % (epoch + 1),\n", + " \" | time in %d minutes, %d seconds\" % (mins, secs))\n", + " print(\n", + " f\"\\tLoss: {train_loss:.4f}(train)\\t|\\tAcc: {train_acc * 100:.1f}%(train)\"\n", + " )\n", + " print(\n", + " f\"\\tLoss: {valid_loss:.4f}(valid)\\t|\\tAcc: {valid_acc * 100:.1f}%(valid)\"\n", + " )" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.13" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/open-machine-learning-jupyter-book/deep-learning/nlp.ipynb b/open-machine-learning-jupyter-book/deep-learning/nlp/nlp.ipynb similarity index 87% rename from open-machine-learning-jupyter-book/deep-learning/nlp.ipynb rename to open-machine-learning-jupyter-book/deep-learning/nlp/nlp.ipynb index efcff10fd..8f424ef66 100644 --- a/open-machine-learning-jupyter-book/deep-learning/nlp.ipynb +++ b/open-machine-learning-jupyter-book/deep-learning/nlp/nlp.ipynb @@ -46,7 +46,7 @@ "tags": [] }, "source": [ - "# Natural Language Processing Overview" + "# Natural Language Processing" ] }, { @@ -87,30 +87,12 @@ "NLP involves enabling machines to understand, interpret, and produce human language in a way that is both valuable and meaningful. OpenAI, known for developing advanced language models like ChatGPT, highlights the importance of NLP in creating intelligent systems that can understand, respond to, and generate text, making technology more user-friendly and accessible" ] }, - { - "cell_type": "markdown", - "id": "7d322ec8-6408-42b6-a8ea-71c5d5884de9", - "metadata": { - "tags": [] - }, - "source": [ - "## How Does NLP Work?" - ] - }, - { - "cell_type": "markdown", - "id": "d4f0be94-661e-4c6b-a6d7-632f95a5a9ca", - "metadata": {}, - "source": [ - "Let’s take a look at some of the mechanisms at work behind natural language processing." - ] - }, { "cell_type": "markdown", "id": "dd2ee092-520b-4064-b14d-183661e6a049", "metadata": {}, "source": [ - "### Components of NLP" + "## Components of NLP" ] }, { @@ -126,7 +108,7 @@ "id": "fd0b8220-7d49-4c84-9338-d01d5555d664", "metadata": {}, "source": [ - "#### Syntax\n", + "### Syntax\n", "- Definition: Syntax pertains to the arrangement of words and phrases to create well-structured sentences in a language.\n", "- Example: Consider the sentence \"The cat sat on the mat.\" Syntax involves analyzing the grammatical structure of this sentence, ensuring that it adheres to the grammatical rules of English, such as subject-verb agreement and proper word order." ] @@ -136,7 +118,7 @@ "id": "7364e95d-a57e-4b70-9f5f-37d22cd17984", "metadata": {}, "source": [ - "#### Semantics\n", + "### Semantics\n", "- Definition: Semantics is concerned with understanding the meaning of words and how they create meaning when combined in sentences.\n", "- Example: In the sentence \"The panda eats shoots and leaves,\" semantics helps distinguish whether the panda eats plants (shoots and leaves) or is involved in a violent act (shoots) and then departs (leaves), based on the meaning of the words and the context." ] @@ -154,7 +136,7 @@ "id": "82bbe334-dc25-46b8-a2b8-61dc28e266ad", "metadata": {}, "source": [ - "#### Pragmatics\n", + "### Pragmatics\n", "- Definition: Pragmatics deals with understanding language in various contexts, ensuring that the intended meaning is derived based on the situation, speaker’s intent, and shared knowledge.\n", "- Example: If someone says, \"Can you pass the salt?\" Pragmatics involves understanding that this is a request rather than a question about one's ability to pass the salt, interpreting the speaker’s intent based on the dining context." ] @@ -164,7 +146,7 @@ "id": "a0bb2293-2ea7-491e-80a1-aee4259d8c3f", "metadata": {}, "source": [ - "#### Discourse\n", + "### Discourse\n", "- Definition: Discourse focuses on the analysis and interpretation of language beyond the sentence level, considering how sentences relate to each other in texts and conversations.\n", "- Example: In a conversation where one person says, \"I’m freezing,\" and another responds, \"I’ll close the window,\" discourse involves understanding the coherence between the two statements, recognizing that the second statement is a response to the implied request in the first." ] @@ -177,102 +159,6 @@ "Understanding these components is crucial for anyone delving into NLP, as they form the backbone of how NLP models interpret and generate human langua" ] }, - { - "cell_type": "markdown", - "id": "222b00a6-f1ba-4b63-a8e5-1dbca48b77fe", - "metadata": {}, - "source": [ - "### NLP techniques and methods" - ] - }, - { - "cell_type": "markdown", - "id": "948b9247-e6d1-4ee1-bf99-1aa419ac8700", - "metadata": {}, - "source": [ - "To analyze and understand human language, NLP employs a variety of techniques and methods. Here are some fundamental techniques used in NLP:" - ] - }, - { - "cell_type": "markdown", - "id": "9591bff3-247e-45be-a300-a7c4c941597c", - "metadata": {}, - "source": [ - "- Tokenization: This is the process of breaking text into words, phrases, symbols, or other meaningful elements, known as tokens." - ] - }, - { - "cell_type": "markdown", - "id": "b17ee05d-7891-4726-bd9c-7dd881bad64c", - "metadata": {}, - "source": [ - "

Image: Tokenization in NLP
" - ] - }, - { - "cell_type": "markdown", - "id": "2476764b-2b93-41f9-932a-578941752c51", - "metadata": {}, - "source": [ - "- Parsing: Parsing involves analyzing the grammatical structure of a sentence to extract meaning." - ] - }, - { - "cell_type": "markdown", - "id": "73a80730-782c-47e7-824b-55e18f9769e2", - "metadata": {}, - "source": [ - "- Lemmatization: This technique reduces words to their base or root form, allowing for the grouping of different forms of the same word." - ] - }, - { - "cell_type": "markdown", - "id": "f8d9dcd0-e581-490e-a1ab-b6b9e4440869", - "metadata": {}, - "source": [ - "

Image: Lemmatization in NLP
" - ] - }, - { - "cell_type": "markdown", - "id": "223836f4-cb48-4746-b1f0-18e726d86935", - "metadata": {}, - "source": [ - "- Named Entity Recognition (NER): NER is used to identify entities such as persons, organizations, locations, and other named items in the text." - ] - }, - { - "cell_type": "markdown", - "id": "ef91cea7-c99a-4955-a319-deeeb471583d", - "metadata": {}, - "source": [ - "

Image: NER in NLP
" - ] - }, - { - "cell_type": "markdown", - "id": "af0f5555-bb7a-44a1-9316-829eca456515", - "metadata": {}, - "source": [ - "- Sentiment analysis: This method is used to gain an understanding of the sentiment or emotion conveyed in a piece of text." - ] - }, - { - "cell_type": "markdown", - "id": "c26a4926-3a0d-4384-a547-0cd9eea3ff05", - "metadata": {}, - "source": [ - "

Image: Sentiment analysis in NLP
" - ] - }, - { - "cell_type": "markdown", - "id": "514884a8-dd0b-4b85-bb66-fc33cdec524b", - "metadata": {}, - "source": [ - "Each of these techniques plays a vital role in enabling computers to process and understand human language, forming the building blocks of more advanced NLP applications." - ] - }, { "cell_type": "markdown", "id": "6926e7a2-cc7f-4459-9925-9fa78f73f471", @@ -786,7 +672,8 @@ }, "source": [ "## Your turn! 🚀\n", - "You can practice your nlp skills by following the assignment [getting start nlp with classification task](../assignments/deep-learning/nlp/getting-start-nlp-with-classification-task.ipynb)." + "\n", + "You can practice your nlp skills by following the assignment [getting start nlp with classification task](../../assignments/deep-learning/nlp/getting-start-nlp-with-classification-task.ipynb)." ] }, { @@ -798,6 +685,16 @@ "\n", "Thanks to [Matt Crabtree](https://www.datacamp.com/portfolio/mattcrabtree) and [Phil Culliton](https://www.kaggle.com/philculliton) for creating the open-source course [What is Natural Language Processing (NLP)?](https://www.datacamp.com/blog/what-is-natural-language-processing) and [NLP Getting Started Tutorial](https://www.kaggle.com/code/philculliton/nlp-getting-started-tutorial). It inspires the majority of the content in this chapter.\n" ] + }, + { + "cell_type": "markdown", + "id": "c3f25b14", + "metadata": {}, + "source": [ + "```{tableofcontents}\n", + "\n", + "```" + ] } ], "metadata": { diff --git a/open-machine-learning-jupyter-book/deep-learning/nlp/text-preprocessing.ipynb b/open-machine-learning-jupyter-book/deep-learning/nlp/text-preprocessing.ipynb new file mode 100644 index 000000000..5ecdcc0c2 --- /dev/null +++ b/open-machine-learning-jupyter-book/deep-learning/nlp/text-preprocessing.ipynb @@ -0,0 +1,522 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [], + "source": [ + "# Install the necessary dependencies\n", + "\n", + "import os\n", + "import sys\n", + "!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython nltk" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "remove-cell" + ] + }, + "source": [ + "---\n", + "license:\n", + " code: MIT\n", + " content: CC-BY-4.0\n", + "github: https://github.com/ocademy-ai/machine-learning\n", + "venue: By Ocademy\n", + "open_access: true\n", + "bibliography:\n", + " - https://raw.githubusercontent.com/ocademy-ai/machine-learning/main/open-machine-learning-jupyter-book/references.bib\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Text Preprocessing\n", + "\n", + "Preprocessing in NLP is a means to get text data ready for further processing or analysis. Most of the time, preprocessing is a mix of cleaning and normalising techniques that make the text easier to use for the task at hand.\n", + "\n", + "A useful library for processing text in Python is the Natural Language Toolkit (NLTK). This chapter will go into 6 of the most commonly used pre-processing steps and provide code examples so you can start using the techniques immediately." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Common NLTK preprocessing steps" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Tokenization\n", + "\n", + "Splitting the text into individual words or subwords (tokens).\n", + "\n", + ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/deep-learning/NLP/tokenization.png\n", + "---\n", + "name: 'tokenization in nlp'\n", + "width: 90%\n", + "---\n", + "Tokenization in NLP\n", + ":::\n", + "\n", + "Here is how to implement tokenization in NLTK:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Tokens: ['Natural', 'language', 'processing', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'deals', 'with', 'the', 'interaction', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'language', '.']\n" + ] + } + ], + "source": [ + "import nltk\n", + "\n", + "# input text\n", + "text = \"Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language.\"\n", + "\n", + "# tokenize the text\n", + "tokens = nltk.word_tokenize(text)\n", + "\n", + "print(\"Tokens:\", tokens)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Remove stop words\n", + "\n", + "Removing common words that do not add significant meaning to the text, such as “a,” “an,” and “the.”\n", + "\n", + "To remove common stop words from a list of tokens using NLTK, you can use the **nltk.corpus.stopwords.words()** function to get a list of stopwords in a specific language and filter the tokens using this list. Here is an example of how to do this:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Tokens without stopwords: ['Natural', 'language', 'processing', 'field', 'artificial', 'intelligence', 'deals', 'interaction', 'computers', 'human', '(', 'natural', ')', 'language', '.']\n" + ] + } + ], + "source": [ + "import nltk\n", + "\n", + "# input text\n", + "text = \"Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language.\"\n", + "\n", + "# tokenize the text\n", + "tokens = nltk.word_tokenize(text)\n", + "\n", + "# get list of stopwords in English\n", + "stopwords = nltk.corpus.stopwords.words(\"english\")\n", + "\n", + "# remove stopwords\n", + "filtered_tokens = [token for token in tokens if token.lower() not in stopwords]\n", + "\n", + "print(\"Tokens without stopwords:\", filtered_tokens)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Stemming\n", + "\n", + "Reducing words to their root form by removing suffixes and prefixes, such as converting “jumping” to “jump.” But it may produce non-existent words.\n", + "\n", + ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/deep-learning/NLP/stemming.png\n", + "---\n", + "name: 'stemming in nlp'\n", + "width: 90%\n", + "---\n", + "Stemming in NLP\n", + ":::\n", + "\n", + "To perform stemming on a list of tokens using NLTK, you can use the **nltk.stem.WordNetLemmatizer()** function to create a stemmer object and the method to stem each token. Here is an example of how to do this:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Stemmed tokens: ['natur', 'languag', 'process', 'is', 'a', 'field', 'of', 'artifici', 'intellig', 'that', 'deal', 'with', 'the', 'interact', 'between', 'comput', 'and', 'human', '(', 'natur', ')', 'languag', '.']\n" + ] + } + ], + "source": [ + "import nltk\n", + "\n", + "# input text\n", + "text = \"Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language.\"\n", + "\n", + "# tokenize the text\n", + "tokens = nltk.word_tokenize(text)\n", + "\n", + "# create stemmer object\n", + "stemmer = nltk.stem.PorterStemmer()\n", + "\n", + "# stem each token\n", + "stemmed_tokens = [stemmer.stem(token) for token in tokens]\n", + "\n", + "print(\"Stemmed tokens:\", stemmed_tokens)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Lemmatization\n", + "\n", + "Reducing words to their base form by considering the context in which they are used, such as “running” or “ran” becoming “run”. This technique is similar to stemming, but it is more accurate as it considers the context of the word.\n", + "\n", + ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/deep-learning/NLP/Lemmatization.png\n", + "---\n", + "name: 'lemmatization in nlp'\n", + "width: 90%\n", + "---\n", + "Lemmatization in NLP\n", + ":::\n", + "\n", + "To perform lemmatization on a list of tokens using NLTK, you can use the **nltk.stem.WordNetLemmatizer()** function to create a lemmatizer object and the **lemmatize()** method to lemmatize each token. Here is an example of how to do this:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Lemmatized tokens: ['Natural', 'language', 'processing', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'deal', 'with', 'the', 'interaction', 'between', 'computer', 'and', 'human', '(', 'natural', ')', 'language', '.']\n" + ] + } + ], + "source": [ + "import nltk\n", + "\n", + "# input text\n", + "text = \"Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language.\"\n", + "\n", + "# tokenize the text\n", + "tokens = nltk.word_tokenize(text)\n", + "\n", + "# create lemmatizer object\n", + "lemmatizer = nltk.stem.WordNetLemmatizer()\n", + "\n", + "# lemmatize each token\n", + "lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]\n", + "\n", + "print(\"Lemmatized tokens:\", lemmatized_tokens)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "### Part Of Speech Tagging\n", + "\n", + "Identifying the part of speech of each word in the text, such as noun, verb, or adjective.\n", + "\n", + ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/deep-learning/NLP/POS tag and description.png\n", + "---\n", + "name: 'POS tag and description in nlp'\n", + "width: 90%\n", + "---\n", + "POS tag and description\n", + ":::\n", + "\n", + "To perform part of speech (POS) tagging on a list of tokens using NLTK, you can use the **nltk.pos_tag()** function to tag the tokens with their corresponding POS tags. Here is an example of how to do this:" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Tagged tokens: [('Natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('field', 'NN'), ('of', 'IN'), ('artificial', 'JJ'), ('intelligence', 'NN'), ('that', 'IN'), ('deals', 'NNS'), ('with', 'IN'), ('the', 'DT'), ('interaction', 'NN'), ('between', 'IN'), ('computers', 'NNS'), ('and', 'CC'), ('human', 'JJ'), ('(', '('), ('natural', 'JJ'), (')', ')'), ('language', 'NN'), ('.', '.')]\n" + ] + } + ], + "source": [ + "import nltk\n", + "\n", + "# input text\n", + "text = \"Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language.\"\n", + "\n", + "# tokenize the text\n", + "tokens = nltk.word_tokenize(text)\n", + "\n", + "# tag the tokens with their POS tags\n", + "tagged_tokens = nltk.pos_tag(tokens)\n", + "\n", + "print(\"Tagged tokens:\", tagged_tokens)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Named Entity Recognition(NER)\n", + "\n", + "Extracting named entities from a text, like a person’s name.\n", + "\n", + ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/deep-learning/NLP/ner.gif\n", + "---\n", + "name: 'ner in nlp'\n", + "width: 90%\n", + "---\n", + "ner in NLP\n", + ":::\n", + "\n", + "To perform named entity recognition (NER) on a list of tokens using NLTK, you can use the **nltk.ne_chunk()** function to identify and label named entities in the tokens. Here is an example of how to do this:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Named entities: (S\n", + " Natural/JJ\n", + " language/NN\n", + " processing/NN\n", + " is/VBZ\n", + " a/DT\n", + " field/NN\n", + " of/IN\n", + " artificial/JJ\n", + " intelligence/NN\n", + " that/IN\n", + " deals/NNS\n", + " with/IN\n", + " the/DT\n", + " interaction/NN\n", + " between/IN\n", + " computers/NNS\n", + " and/CC\n", + " human/JJ\n", + " (/(\n", + " natural/JJ\n", + " )/)\n", + " language/NN\n", + " ./.\n", + " (PERSON John/NNP Smith/NNP)\n", + " works/VBZ\n", + " at/IN\n", + " (ORGANIZATION Google/NNP)\n", + " in/IN\n", + " (GPE New/NNP York/NNP)\n", + " ./.)\n" + ] + } + ], + "source": [ + "import nltk\n", + "\n", + "# input text\n", + "text = \"Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language. John Smith works at Google in New York.\"\n", + "\n", + "# tokenize the text\n", + "tokens = nltk.word_tokenize(text)\n", + "\n", + "# tag the tokens with their part of speech\n", + "tagged_tokens = nltk.pos_tag(tokens)\n", + "\n", + "# identify named entities\n", + "named_entities = nltk.ne_chunk(tagged_tokens)\n", + "\n", + "print(\"Named entities:\", named_entities)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## NLTK preprocessing pipeline example" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Preprocessing techniques can be applied independently or in combination, depending on the specific requirements of the task at hand.\n", + "\n", + "Here is an example of a typical NLP pipeline using the NLTK:\n", + "\n", + "1.Tokenization: First, we need to split the input text into individual words (tokens). This can be done using the **nltk.word_tokenize()** function.\n", + "\n", + "2.Part-of-speech tagging: Next, we can use the **nltk.pos_tag()** function to assign a part-of-speech (POS) tag to each token, which indicates its role in a sentence (e.g., noun, verb, adjective).\n", + "\n", + "3.Named entity recognition: Using the **nltk.ne_chunk()** function, we can identify named entities (e.g., person, organization, location) in the text.\n", + "\n", + "4.Lemmatization: We can use the **nltk.WordNetLemmatizer()** function to convert each token to its base form (lemma), which helps with the analysis of the text.\n", + "\n", + "5.Stopword removal: We can use the **nltk.corpus.stopwords.words()** function to remove common words (stopwords) that do not add significant meaning to the text, such as “the,” “a,” and “an.”\n", + "\n", + "6.Text classification: Finally, we can use the processed text to train a classifier using machine learning algorithms to perform tasks such as sentiment analysis or spam detection." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## NLTK preprocessing example code" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First, we preprocess the text, including tokenization, part-of-speech tagging, named entity recognition, lemmatization and stopword removal." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import nltk\n", + "\n", + "# input text\n", + "text = \"Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language.\"\n", + "\n", + "# tokenization\n", + "tokens = nltk.word_tokenize(text)\n", + "\n", + "# part-of-speech tagging\n", + "pos_tags = nltk.pos_tag(tokens)\n", + "\n", + "# named entity recognition\n", + "named_entities = nltk.ne_chunk(pos_tags)\n", + "\n", + "# lemmatization\n", + "lemmatizer = nltk.WordNetLemmatizer()\n", + "lemmas = [lemmatizer.lemmatize(token) for token in tokens]\n", + "\n", + "# stopword removal\n", + "stopwords = nltk.corpus.stopwords.words(\"english\")\n", + "filtered_tokens = [token for token in tokens if token not in stopwords]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Then we do the text classification (example using a simple Naive Bayes classifier)." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Sentiment: pos\n" + ] + } + ], + "source": [ + "from nltk.classify import NaiveBayesClassifier\n", + "\n", + "# training data (using a toy dataset for illustration purposes)\n", + "training_data = [(\"I enjoy the book.\", \"pos\"),(\"I like this movie.\", \"pos\"),(\"It was a boring movie.\", \"neg\")]\n", + "\n", + "# extract features from the training data\n", + "def extract_features(text):\n", + " features = {}\n", + " for word in nltk.word_tokenize(text):\n", + " features[word] = True\n", + " return features\n", + "\n", + "# create a list of feature sets and labels\n", + "feature_sets = [(extract_features(text), label) for (text, label) in training_data]\n", + "# train the classifier\n", + "classifier = NaiveBayesClassifier.train(feature_sets)\n", + "\n", + "# test the classifier on a new example\n", + "test_text = \"I enjoyed the movie.\"\n", + "print(\"Sentiment:\", classifier.classify(extract_features(test_text)))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Your turn! 🚀\n", + "\n", + "Assignment - [Beginner Guide to Text Pre-Processing](../../assignments/deep-learning/nlp/beginner-guide-to-text-preprocessing.ipynb)\n", + "\n", + "## Acknowledgments\n", + "\n", + "Thanks to [Neri Van Otten](https://spotintelligence.com/author/spotintelligence) for creating the open-source project [Top 14 Steps To Build A Complete NLTK Preprocessing Pipeline In Python](https://spotintelligence.com/2022/12/21/nltk-preprocessing-pipeline).It inspire the majority of the content in this chapter." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.13" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/open-machine-learning-jupyter-book/deep-learning/nlp/text-representation.ipynb b/open-machine-learning-jupyter-book/deep-learning/nlp/text-representation.ipynb new file mode 100644 index 000000000..04baf427f --- /dev/null +++ b/open-machine-learning-jupyter-book/deep-learning/nlp/text-representation.ipynb @@ -0,0 +1,1049 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "id": "25fea2d2", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [], + "source": [ + "# Install the necessary dependencies\n", + "\n", + "import os\n", + "import sys\n", + "!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython gensim torch" + ] + }, + { + "cell_type": "markdown", + "id": "74660f15", + "metadata": { + "tags": [ + "remove-cell" + ] + }, + "source": [ + "---\n", + "license:\n", + " code: MIT\n", + " content: CC-BY-4.0\n", + "github: https://github.com/ocademy-ai/machine-learning\n", + "venue: By Ocademy\n", + "open_access: true\n", + "bibliography:\n", + " - https://raw.githubusercontent.com/ocademy-ai/machine-learning/main/open-machine-learning-jupyter-book/references.bib\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "3c79b007", + "metadata": {}, + "source": [ + "# Word embedding" + ] + }, + { + "cell_type": "markdown", + "id": "9ef3de21", + "metadata": {}, + "source": [ + "Word Embeddings are numeric representations of words in a lower-dimensional space, capturing semantic and syntactic information. Mainly including discrete representation methods and distribution representation methods" + ] + }, + { + "cell_type": "markdown", + "id": "31ac0f5a", + "metadata": {}, + "source": [ + "## Discrete representation" + ] + }, + { + "cell_type": "markdown", + "id": "5d1dc20e", + "metadata": {}, + "source": [ + "This method involves compiling a list of distinct terms and giving each one a unique integer value, or id. and after that, insert each word’s distinct id into the sentence. Every vocabulary word is handled as a feature in this instance. Thus, a large vocabulary will result in an extremely large feature size. Common discrete representation methods include:one-hot,Bag of Word (Bow) and Term frequency-inverse document frequency (TF-IDF)" + ] + }, + { + "cell_type": "markdown", + "id": "2cc04cbb", + "metadata": {}, + "source": [ + "### One-Hot" + ] + }, + { + "cell_type": "markdown", + "id": "2f1dc868", + "metadata": {}, + "source": [ + "One-hot encoding is a simple method for representing words in natural language processing (NLP). In this encoding scheme, each word in the vocabulary is represented as a unique vector, where the dimensionality of the vector is equal to the size of the vocabulary. The vector has all elements set to 0, except for the element corresponding to the index of the word in the vocabulary, which is set to 1." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "864709ef", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Vocabulary: {'mat', 'dog', 'tree', 'in', 'on', 'hat', 'the', 'bird', 'cat'}\n", + "Word to Index Mapping: {'mat': 0, 'dog': 1, 'tree': 2, 'in': 3, 'on': 4, 'hat': 5, 'the': 6, 'bird': 7, 'cat': 8}\n", + "One-Hot Encoded Matrix:\n", + "cat: [0, 0, 0, 0, 0, 0, 0, 0, 1]\n", + "in: [0, 0, 0, 1, 0, 0, 0, 0, 0]\n", + "the: [0, 0, 0, 0, 0, 0, 1, 0, 0]\n", + "hat: [0, 0, 0, 0, 0, 1, 0, 0, 0]\n", + "dog: [0, 1, 0, 0, 0, 0, 0, 0, 0]\n", + "on: [0, 0, 0, 0, 1, 0, 0, 0, 0]\n", + "the: [0, 0, 0, 0, 0, 0, 1, 0, 0]\n", + "mat: [1, 0, 0, 0, 0, 0, 0, 0, 0]\n", + "bird: [0, 0, 0, 0, 0, 0, 0, 1, 0]\n", + "in: [0, 0, 0, 1, 0, 0, 0, 0, 0]\n", + "the: [0, 0, 0, 0, 0, 0, 1, 0, 0]\n", + "tree: [0, 0, 1, 0, 0, 0, 0, 0, 0]\n" + ] + } + ], + "source": [ + "def one_hot_encode(text):\n", + "\twords = text.split()\n", + "\tvocabulary = set(words)\n", + "\tword_to_index = {word: i for i, word in enumerate(vocabulary)}\n", + "\tone_hot_encoded = []\n", + "\tfor word in words:\n", + "\t\tone_hot_vector = [0] * len(vocabulary)\n", + "\t\tone_hot_vector[word_to_index[word]] = 1\n", + "\t\tone_hot_encoded.append(one_hot_vector)\n", + "\n", + "\treturn one_hot_encoded, word_to_index, vocabulary\n", + "\n", + "# sample\n", + "example_text = \"cat in the hat dog on the mat bird in the tree\"\n", + "\n", + "one_hot_encoded, word_to_index, vocabulary = one_hot_encode(example_text)\n", + "\n", + "print(\"Vocabulary:\", vocabulary)\n", + "print(\"Word to Index Mapping:\", word_to_index)\n", + "print(\"One-Hot Encoded Matrix:\")\n", + "for word, encoding in zip(example_text.split(), one_hot_encoded):\n", + "\tprint(f\"{word}: {encoding}\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "97ef1ca7", + "metadata": {}, + "source": [ + "### Bag of Word (Bow)" + ] + }, + { + "cell_type": "markdown", + "id": "bd4709a2", + "metadata": {}, + "source": [ + "Bag-of-Words (BoW) is a text representation technique that represents a document as an unordered set of words and their respective frequencies. It discards the word order and captures the frequency of each word in the document, creating a vector representation." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "9a2f8864", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Bag-of-Words Matrix:\n", + "[[0 1 1 1 0 0 1 0 1]\n", + " [0 2 0 1 0 1 1 0 1]\n", + " [1 0 0 1 1 0 1 1 1]\n", + " [0 1 1 1 0 0 1 0 1]]\n", + "Vocabulary (Feature Names): ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']\n" + ] + } + ], + "source": [ + "from sklearn.feature_extraction.text import CountVectorizer\n", + "documents = [\"This is the first document.\",\n", + "\t\t\t\"This document is the second document.\",\n", + "\t\t\t\"And this is the third one.\",\n", + "\t\t\t\"Is this the first document?\"]\n", + "\n", + "vectorizer = CountVectorizer()\n", + "X = vectorizer.fit_transform(documents)\n", + "feature_names = vectorizer.get_feature_names_out()\n", + "\n", + "print(\"Bag-of-Words Matrix:\")\n", + "print(X.toarray())\n", + "print(\"Vocabulary (Feature Names):\", feature_names)\n" + ] + }, + { + "cell_type": "markdown", + "id": "6a18e45f", + "metadata": {}, + "source": [ + "### Term frequency-inverse document frequency (TF-IDF)" + ] + }, + { + "cell_type": "markdown", + "id": "d4d1e4d7", + "metadata": {}, + "source": [ + "Term Frequency-Inverse Document Frequency, commonly known as TF-IDF, is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents (corpus). It is widely used in natural language processing and information retrieval to evaluate the significance of a term within a specific document in a larger corpus. TF-IDF consists of two components:" + ] + }, + { + "cell_type": "markdown", + "id": "3d63655c", + "metadata": {}, + "source": [ + "**·Term Frequency (TF)**: Term Frequency measures how often a term (word) appears in a document. It is calculated using the formula:\n", + "\n", + "$TF(t,d)=\\frac{Total\\ number\\ of\\ times\\ term\\ t\\ appears\\ in\\ document\\ d}{Total\\ number\\ of\\ terms\\ in\\ document\\ d}$" + ] + }, + { + "cell_type": "markdown", + "id": "7570820c", + "metadata": {}, + "source": [ + "**·Inverse Document Frequency (IDF)**: Inverse Document Frequency measures the importance of a term across a collection of documents. It is calculated using the formula:\n", + "\n", + "$IDF(t,D)=\\log{(\\frac{Total\\ documents}{Number\\ of\\ documents\\ containing\\ term\\ t})}$" + ] + }, + { + "cell_type": "markdown", + "id": "d95ef779", + "metadata": {}, + "source": [ + "The TF-IDF score for a term t in a document d is then given by multiplying the TF and IDF values:\n", + "\n", + "TF-IDF(t,d,D)=TF(t,d)×IDF(t,D) \n", + "\n", + "The higher the TF-IDF score for a term in a document, the more important that term is to that document within the context of the entire corpus. This weighting scheme helps in identifying and extracting relevant information from a large collection of documents, and it is commonly used in text mining, information retrieval, and document clustering.\n", + "\n", + "Let’s Implement Term Frequency-Inverse Document Frequency (TF-IDF) using python with the scikit-learn library. It begins by defining a set of sample documents. The TfidfVectorizer is employed to transform these documents into a TF-IDF matrix. The code then extracts and prints the TF-IDF values for each word in each document. This statistical measure helps assess the importance of words in a document relative to their frequency across a collection of documents, aiding in information retrieval and text analysis tasks." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "c645bb7a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Document 1:\n", + "dog: 0.30151134457776363\n", + "lazy: 0.30151134457776363\n", + "over: 0.30151134457776363\n", + "jumps: 0.30151134457776363\n", + "fox: 0.30151134457776363\n", + "brown: 0.30151134457776363\n", + "quick: 0.30151134457776363\n", + "the: 0.6030226891555273\n", + "\n", + "\n", + "Document 2:\n", + "step: 0.3535533905932738\n", + "single: 0.3535533905932738\n", + "with: 0.3535533905932738\n", + "begins: 0.3535533905932738\n", + "miles: 0.3535533905932738\n", + "thousand: 0.3535533905932738\n", + "of: 0.3535533905932738\n", + "journey: 0.3535533905932738\n", + "\n", + "\n" + ] + } + ], + "source": [ + "from sklearn.feature_extraction.text import TfidfVectorizer\n", + "\n", + "# Sample\n", + "documents = [\n", + "\t\"The quick brown fox jumps over the lazy dog.\",\n", + "\t\"A journey of a thousand miles begins with a single step.\",\n", + "]\n", + "\n", + "vectorizer = TfidfVectorizer() # Create the TF-IDF vectorizer\n", + "tfidf_matrix = vectorizer.fit_transform(documents)\n", + "feature_names = vectorizer.get_feature_names_out()\n", + "tfidf_values = {}\n", + "\n", + "for doc_index, doc in enumerate(documents):\n", + "\tfeature_index = tfidf_matrix[doc_index, :].nonzero()[1]\n", + "\ttfidf_doc_values = zip(feature_index, [tfidf_matrix[doc_index, x] for x in feature_index])\n", + "\ttfidf_values[doc_index] = {feature_names[i]: value for i, value in tfidf_doc_values}\n", + "#let's print\n", + "for doc_index, values in tfidf_values.items():\n", + "\tprint(f\"Document {doc_index + 1}:\")\n", + "\tfor word, tfidf_value in values.items():\n", + "\t\tprint(f\"{word}: {tfidf_value}\")\n", + "\tprint(\"\\n\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "f02af2ee", + "metadata": {}, + "source": [ + "## Distributed representation" + ] + }, + { + "cell_type": "markdown", + "id": "dbf1c07c", + "metadata": {}, + "source": [ + "### Word2Vec" + ] + }, + { + "cell_type": "markdown", + "id": "67e6d642", + "metadata": {}, + "source": [ + "Word2Vec is a neural approach for generating word embeddings. It belongs to the family of neural word embedding techniques and specifically falls under the category of distributed representation models. It is a popular technique in natural language processing (NLP) that is used to represent words as continuous vector spaces. Developed by a team at Google, Word2Vec aims to capture the semantic relationships between words by mapping them to high-dimensional vectors. The underlying idea is that words with similar meanings should have similar vector representations. In Word2Vec every word is assigned a vector. We start with either a random vector or one-hot vector." + ] + }, + { + "cell_type": "markdown", + "id": "bf16b0de", + "metadata": {}, + "source": [ + "There are two neural embedding methods for Word2Vec, Continuous Bag of Words (CBOW) and Skip-gram." + ] + }, + { + "cell_type": "markdown", + "id": "a4ba4b22", + "metadata": {}, + "source": [ + "#### Continuous Bag of Words(CBOW)" + ] + }, + { + "cell_type": "markdown", + "id": "1e7d21ea", + "metadata": {}, + "source": [ + "Continuous Bag of Words (CBOW) is a type of neural network architecture used in the Word2Vec model. The primary objective of CBOW is to predict a target word based on its context, which consists of the surrounding words in a given window. Given a sequence of words in a context window, the model is trained to predict the target word at the center of the window.\n", + "\n", + "CBOW is a feedforward neural network with a single hidden layer. The input layer represents the context words, and the output layer represents the target word. The hidden layer contains the learned continuous vector representations (word embeddings) of the input words.\n", + "\n", + "The architecture is useful for learning distributed representations of words in a continuous vector space." + ] + }, + { + "cell_type": "markdown", + "id": "97e81f1f", + "metadata": {}, + "source": [ + ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/deep-learning/NLP/cbow.png\n", + "---\n", + "name: 'continuous bag of words'\n", + "width: 90%\n", + "---\n", + "Continuous Bag of Words\n", + ":::" + ] + }, + { + "cell_type": "markdown", + "id": "612a2f35", + "metadata": {}, + "source": [ + "The hidden layer contains the continuous vector representations (word embeddings) of the input words.\n", + "\n", + "The weights between the input layer and the hidden layer are learned during training.\n", + "The dimensionality of the hidden layer represents the size of the word embeddings (the continuous vector space)." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "c51a75d9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Epoch 1, Loss: 0\n", + "Epoch 2, Loss: 0\n", + "Epoch 3, Loss: 0\n", + "Epoch 4, Loss: 0\n", + "Epoch 5, Loss: 0\n", + "Epoch 6, Loss: 0\n", + "Epoch 7, Loss: 0\n", + "Epoch 8, Loss: 0\n", + "Epoch 9, Loss: 0\n", + "Epoch 10, Loss: 0\n", + "Epoch 11, Loss: 0\n", + "Epoch 12, Loss: 0\n", + "Epoch 13, Loss: 0\n", + "Epoch 14, Loss: 0\n", + "Epoch 15, Loss: 0\n", + "Epoch 16, Loss: 0\n", + "Epoch 17, Loss: 0\n", + "Epoch 18, Loss: 0\n", + "Epoch 19, Loss: 0\n", + "Epoch 20, Loss: 0\n", + "Epoch 21, Loss: 0\n", + "Epoch 22, Loss: 0\n", + "Epoch 23, Loss: 0\n", + "Epoch 24, Loss: 0\n", + "Epoch 25, Loss: 0\n", + "Epoch 26, Loss: 0\n", + "Epoch 27, Loss: 0\n", + "Epoch 28, Loss: 0\n", + "Epoch 29, Loss: 0\n", + "Epoch 30, Loss: 0\n", + "Epoch 31, Loss: 0\n", + "Epoch 32, Loss: 0\n", + "Epoch 33, Loss: 0\n", + "Epoch 34, Loss: 0\n", + "Epoch 35, Loss: 0\n", + "Epoch 36, Loss: 0\n", + "Epoch 37, Loss: 0\n", + "Epoch 38, Loss: 0\n", + "Epoch 39, Loss: 0\n", + "Epoch 40, Loss: 0\n", + "Epoch 41, Loss: 0\n", + "Epoch 42, Loss: 0\n", + "Epoch 43, Loss: 0\n", + "Epoch 44, Loss: 0\n", + "Epoch 45, Loss: 0\n", + "Epoch 46, Loss: 0\n", + "Epoch 47, Loss: 0\n", + "Epoch 48, Loss: 0\n", + "Epoch 49, Loss: 0\n", + "Epoch 50, Loss: 0\n", + "Epoch 51, Loss: 0\n", + "Epoch 52, Loss: 0\n", + "Epoch 53, Loss: 0\n", + "Epoch 54, Loss: 0\n", + "Epoch 55, Loss: 0\n", + "Epoch 56, Loss: 0\n", + "Epoch 57, Loss: 0\n", + "Epoch 58, Loss: 0\n", + "Epoch 59, Loss: 0\n", + "Epoch 60, Loss: 0\n", + "Epoch 61, Loss: 0\n", + "Epoch 62, Loss: 0\n", + "Epoch 63, Loss: 0\n", + "Epoch 64, Loss: 0\n", + "Epoch 65, Loss: 0\n", + "Epoch 66, Loss: 0\n", + "Epoch 67, Loss: 0\n", + "Epoch 68, Loss: 0\n", + "Epoch 69, Loss: 0\n", + "Epoch 70, Loss: 0\n", + "Epoch 71, Loss: 0\n", + "Epoch 72, Loss: 0\n", + "Epoch 73, Loss: 0\n", + "Epoch 74, Loss: 0\n", + "Epoch 75, Loss: 0\n", + "Epoch 76, Loss: 0\n", + "Epoch 77, Loss: 0\n", + "Epoch 78, Loss: 0\n", + "Epoch 79, Loss: 0\n", + "Epoch 80, Loss: 0\n", + "Epoch 81, Loss: 0\n", + "Epoch 82, Loss: 0\n", + "Epoch 83, Loss: 0\n", + "Epoch 84, Loss: 0\n", + "Epoch 85, Loss: 0\n", + "Epoch 86, Loss: 0\n", + "Epoch 87, Loss: 0\n", + "Epoch 88, Loss: 0\n", + "Epoch 89, Loss: 0\n", + "Epoch 90, Loss: 0\n", + "Epoch 91, Loss: 0\n", + "Epoch 92, Loss: 0\n", + "Epoch 93, Loss: 0\n", + "Epoch 94, Loss: 0\n", + "Epoch 95, Loss: 0\n", + "Epoch 96, Loss: 0\n", + "Epoch 97, Loss: 0\n", + "Epoch 98, Loss: 0\n", + "Epoch 99, Loss: 0\n", + "Epoch 100, Loss: 0\n", + "Embedding for 'embeddings': [[-1.0344244 0.07897425 0.03608504 -0.556515 0.14206451 -1.833232\n", + " 0.7083351 0.66258 0.57552516 1.7009653 ]]\n" + ] + } + ], + "source": [ + "import torch\n", + "import torch.nn as nn\n", + "import torch.optim as optim\n", + "\n", + "# Define CBOW model\n", + "class CBOWModel(nn.Module):\n", + "\tdef __init__(self, vocab_size, embed_size):\n", + "\t\tsuper(CBOWModel, self).__init__()\n", + "\t\tself.embeddings = nn.Embedding(vocab_size, embed_size)\n", + "\t\tself.linear = nn.Linear(embed_size, vocab_size)\n", + "\n", + "\tdef forward(self, context):\n", + "\t\tcontext_embeds = self.embeddings(context).sum(dim=1)\n", + "\t\toutput = self.linear(context_embeds)\n", + "\t\treturn output\n", + "\n", + "# Sample data\n", + "context_size = 2\n", + "raw_text = \"word embeddings are awesome\"\n", + "tokens = raw_text.split()\n", + "vocab = set(tokens)\n", + "word_to_index = {word: i for i, word in enumerate(vocab)}\n", + "data = []\n", + "for i in range(2, len(tokens) - 2):\n", + "\tcontext = [word_to_index[word] for word in tokens[i - 2:i] + tokens[i + 1:i + 3]]\n", + "\ttarget = word_to_index[tokens[i]]\n", + "\tdata.append((torch.tensor(context), torch.tensor(target)))\n", + "\n", + "# Hyperparameters\n", + "vocab_size = len(vocab)\n", + "embed_size = 10\n", + "learning_rate = 0.01\n", + "epochs = 100\n", + "\n", + "# Initialize CBOW model\n", + "cbow_model = CBOWModel(vocab_size, embed_size)\n", + "criterion = nn.CrossEntropyLoss()\n", + "optimizer = optim.SGD(cbow_model.parameters(), lr=learning_rate)\n", + "\n", + "# Training loop\n", + "for epoch in range(epochs):\n", + "\ttotal_loss = 0\n", + "\tfor context, target in data:\n", + "\t\toptimizer.zero_grad()\n", + "\t\toutput = cbow_model(context)\n", + "\t\tloss = criterion(output.unsqueeze(0), target.unsqueeze(0))\n", + "\t\tloss.backward()\n", + "\t\toptimizer.step()\n", + "\t\ttotal_loss += loss.item()\n", + "\tprint(f\"Epoch {epoch + 1}, Loss: {total_loss}\")\n", + "\n", + "# Example usage: Get embedding for a specific word\n", + "word_to_lookup = \"embeddings\"\n", + "word_index = word_to_index[word_to_lookup]\n", + "embedding = cbow_model.embeddings(torch.tensor([word_index]))\n", + "print(f\"Embedding for '{word_to_lookup}': {embedding.detach().numpy()}\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "f057334a", + "metadata": {}, + "source": [ + "#### Skip-Gram" + ] + }, + { + "cell_type": "markdown", + "id": "f38b64f9", + "metadata": {}, + "source": [ + "The Skip-Gram model learns distributed representations of words in a continuous vector space. The main objective of Skip-Gram is to predict context words (words surrounding a target word) given a target word. This is the opposite of the Continuous Bag of Words (CBOW) model, where the objective is to predict the target word based on its context. It is shown that this method produces more meaningful embeddings." + ] + }, + { + "cell_type": "markdown", + "id": "a3f31a42", + "metadata": {}, + "source": [ + ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/deep-learning/NLP/skipgram.png\n", + "---\n", + "name: 'skip gram'\n", + "width: 90%\n", + "---\n", + "Skip Gram\n", + ":::" + ] + }, + { + "cell_type": "markdown", + "id": "977b2fa3", + "metadata": {}, + "source": [ + "After applying the above neural embedding methods we get trained vectors of each word after many iterations through the corpus. These trained vectors preserve syntactical or semantic information and are converted to lower dimensions. The vectors with similar meaning or semantic information are placed close to each other in space.\n", + "\n", + "Let’s understand with a basic example. The python code contains, parameter that controls the dimensionality of the word vectors, and you can adjust other parameters such as based on your specific needs.vector_sizewindow" + ] + }, + { + "cell_type": "markdown", + "id": "87dc4e0e", + "metadata": {}, + "source": [ + ">Note: Word2Vec models can perform better with larger datasets. \n", + ">If you have a large corpus, you might achieve more meaningful word embeddings.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "bbe003fd", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Requirement already satisfied: gensim in d:\\anaconda\\lib\\site-packages (4.3.0)\n", + "Requirement already satisfied: FuzzyTM>=0.4.0 in d:\\anaconda\\lib\\site-packages (from gensim) (2.0.5)\n", + "Requirement already satisfied: numpy>=1.18.5 in d:\\anaconda\\lib\\site-packages (from gensim) (1.23.5)\n", + "Requirement already satisfied: scipy>=1.7.0 in d:\\anaconda\\lib\\site-packages (from gensim) (1.10.0)\n", + "Requirement already satisfied: smart-open>=1.8.1 in d:\\anaconda\\lib\\site-packages (from gensim) (5.2.1)\n", + "Requirement already satisfied: pyfume in d:\\anaconda\\lib\\site-packages (from FuzzyTM>=0.4.0->gensim) (0.2.25)\n", + "Requirement already satisfied: pandas in d:\\anaconda\\lib\\site-packages (from FuzzyTM>=0.4.0->gensim) (1.5.3)\n", + "Requirement already satisfied: python-dateutil>=2.8.1 in d:\\anaconda\\lib\\site-packages (from pandas->FuzzyTM>=0.4.0->gensim) (2.8.2)\n", + "Requirement already satisfied: pytz>=2020.1 in d:\\anaconda\\lib\\site-packages (from pandas->FuzzyTM>=0.4.0->gensim) (2022.7)\n", + "Requirement already satisfied: fst-pso in d:\\anaconda\\lib\\site-packages (from pyfume->FuzzyTM>=0.4.0->gensim) (1.8.1)\n", + "Requirement already satisfied: simpful in d:\\anaconda\\lib\\site-packages (from pyfume->FuzzyTM>=0.4.0->gensim) (2.11.0)\n", + "Requirement already satisfied: six>=1.5 in d:\\anaconda\\lib\\site-packages (from python-dateutil>=2.8.1->pandas->FuzzyTM>=0.4.0->gensim) (1.16.0)\n", + "Requirement already satisfied: miniful in d:\\anaconda\\lib\\site-packages (from fst-pso->pyfume->FuzzyTM>=0.4.0->gensim) (0.0.6)\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "[nltk_data] Downloading package punkt to\n", + "[nltk_data] C:\\Users\\zhongmeiqi\\AppData\\Roaming\\nltk_data...\n", + "[nltk_data] Package punkt is already up-to-date!\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Vector representation of 'word': [-9.5800208e-03 8.9437785e-03 4.1664648e-03 9.2367809e-03\n", + " 6.6457358e-03 2.9233587e-03 9.8055992e-03 -4.4231843e-03\n", + " -6.8048164e-03 4.2256550e-03 3.7299085e-03 -5.6668529e-03\n", + " 9.7035142e-03 -3.5551414e-03 9.5499391e-03 8.3657773e-04\n", + " -6.3355025e-03 -1.9741615e-03 -7.3781307e-03 -2.9811086e-03\n", + " 1.0425397e-03 9.4814906e-03 9.3598543e-03 -6.5986011e-03\n", + " 3.4773252e-03 2.2767992e-03 -2.4910474e-03 -9.2290826e-03\n", + " 1.0267317e-03 -8.1645092e-03 6.3240929e-03 -5.8001447e-03\n", + " 5.5353874e-03 9.8330071e-03 -1.5987856e-04 4.5296676e-03\n", + " -1.8086446e-03 7.3613892e-03 3.9419360e-03 -9.0095028e-03\n", + " -2.3953868e-03 3.6261671e-03 -1.0080514e-04 -1.2024897e-03\n", + " -1.0558038e-03 -1.6681013e-03 6.0541567e-04 4.1633579e-03\n", + " -4.2531900e-03 -3.8336846e-03 -5.0755290e-05 2.6549282e-04\n", + " -1.7014991e-04 -4.7843382e-03 4.3120929e-03 -2.1710952e-03\n", + " 2.1056964e-03 6.6702347e-04 5.9686624e-03 -6.8418151e-03\n", + " -6.8183104e-03 -4.4762432e-03 9.4359247e-03 -1.5930856e-03\n", + " -9.4291316e-03 -5.4270827e-04 -4.4478951e-03 5.9980620e-03\n", + " -9.5831212e-03 2.8602476e-03 -9.2544509e-03 1.2484600e-03\n", + " 6.0004774e-03 7.4001122e-03 -7.6209377e-03 -6.0561695e-03\n", + " -6.8399287e-03 -7.9184016e-03 -9.4984965e-03 -2.1255787e-03\n", + " -8.3757477e-04 -7.2564054e-03 6.7876028e-03 1.1183097e-03\n", + " 5.8291717e-03 1.4714618e-03 7.9081533e-04 -7.3718326e-03\n", + " -2.1769912e-03 4.3199472e-03 -5.0856168e-03 1.1304744e-03\n", + " 2.8835384e-03 -1.5386029e-03 9.9318363e-03 8.3507905e-03\n", + " 2.4184163e-03 7.1170190e-03 5.8888551e-03 -5.5787875e-03]\n" + ] + } + ], + "source": [ + "!pip install gensim\n", + "from gensim.models import Word2Vec\n", + "from nltk.tokenize import word_tokenize\n", + "import nltk\n", + "nltk.download('punkt') # Download the tokenizer models if not already downloaded\n", + "\n", + "sample = \"Word embeddings are dense vector representations of words.\"\n", + "tokenized_corpus = word_tokenize(sample.lower()) # Lowercasing for consistency\n", + "\n", + "skipgram_model = Word2Vec(sentences=[tokenized_corpus],\n", + "\t\t\t\t\t\tvector_size=100, # Dimensionality of the word vectors\n", + "\t\t\t\t\t\twindow=5,\t\t # Maximum distance between the current and predicted word within a sentence\n", + "\t\t\t\t\t\tsg=1,\t\t\t # Skip-Gram model (1 for Skip-Gram, 0 for CBOW)\n", + "\t\t\t\t\t\tmin_count=1,\t # Ignores all words with a total frequency lower than this\n", + "\t\t\t\t\t\tworkers=4)\t # Number of CPU cores to use for training the model\n", + "\n", + "# Training\n", + "skipgram_model.train([tokenized_corpus], total_examples=1, epochs=10)\n", + "skipgram_model.save(\"skipgram_model.model\")\n", + "loaded_model = Word2Vec.load(\"skipgram_model.model\")\n", + "vector_representation = loaded_model.wv['word']\n", + "print(\"Vector representation of 'word':\", vector_representation)\n" + ] + }, + { + "cell_type": "markdown", + "id": "42cda55f", + "metadata": {}, + "source": [ + "In practice, the choice between CBOW and Skip-gram often depends on the specific characteristics of the data and the task at hand. CBOW might be preferred when training resources are limited, and capturing syntactic information is important. Skip-gram, on the other hand, might be chosen when semantic relationships and the representation of rare words are crucial." + ] + }, + { + "cell_type": "markdown", + "id": "4ea1f37d", + "metadata": {}, + "source": [ + "## Pretrained Word-Embedding" + ] + }, + { + "cell_type": "markdown", + "id": "59ead5c0", + "metadata": {}, + "source": [ + "Pre-trained word embeddings are representations of words that are learned from large corpora and are made available for reuse in various natural language processing (NLP) tasks. These embeddings capture semantic relationships between words, allowing the model to understand similarities and relationships between different words in a meaningful way." + ] + }, + { + "cell_type": "markdown", + "id": "2c887969", + "metadata": {}, + "source": [ + "### GloVe" + ] + }, + { + "cell_type": "markdown", + "id": "0b8a6505", + "metadata": {}, + "source": [ + "GloVe is trained on global word co-occurrence statistics. It leverages the global context to create word embeddings that reflect the overall meaning of words based on their co-occurrence probabilities. this method, we take the corpus and iterate through it and get the co-occurrence of each word with other words in the corpus. We get a co-occurrence matrix through this. The words which occur next to each other get a value of 1, if they are one word apart then 1/2, if two words apart then 1/3 and so on.\n", + "\n", + "Let us take an example to understand how the matrix is created. We have a small corpus:" + ] + }, + { + "cell_type": "markdown", + "id": "ef94e71e", + "metadata": {}, + "source": [ + ">Corpus:\n", + ">\n", + ">It is a nice evening.\n", + ">\n", + ">Good Evening!\n", + ">\n", + ">Is it a nice evening?" + ] + }, + { + "cell_type": "markdown", + "id": "394c071b", + "metadata": {}, + "source": [ + "| |it|is|a|nice|evening|good|\n", + "|----------|----------|----------|----------|----------|----------|----------|\n", + "|it|0|\n", + "|is|1+1|0|\n", + "|a|1/2+1|1+1/2|0|\n", + "|nice|1/3+1/2|1/2+1/3|1+1|0|\n", + "|evening|1/4+1/3|1/3+1/4|1/2+1/2|1+1|0|\n", + "|good|0|0|0|0|1|0|\n" + ] + }, + { + "cell_type": "markdown", + "id": "4bd03226", + "metadata": {}, + "source": [ + "The upper half of the matrix will be a reflection of the lower half. We can consider a window frame as well to calculate the co-occurrences by shifting the frame till the end of the corpus. This helps gather information about the context in which the word is used.\n", + "\n", + "Initially, the vectors for each word is assigned randomly. Then we take two pairs of vectors and see how close they are to each other in space. If they occur together more often or have a higher value in the co-occurrence matrix and are far apart in space then they are brought close to each other. If they are close to each other but are rarely or not frequently used together then they are moved further apart in space.\n", + "\n", + "After many iterations of the above process, we’ll get a vector space representation that approximates the information from the co-occurrence matrix. The performance of GloVe is better than Word2Vec in terms of both semantic and syntactic capturing." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ae3737fa", + "metadata": {}, + "outputs": [], + "source": [ + "from gensim.models import KeyedVectors\n", + "from gensim.downloader import load\n", + "\n", + "glove_model = load('glove-wiki-gigaword-50')\n", + "word_pairs = [('learn', 'learning'), ('india', 'indian'), ('fame', 'famous')]\n", + "\n", + "# Compute similarity for each pair of words\n", + "for pair in word_pairs:\n", + "\tsimilarity = glove_model.similarity(pair[0], pair[1])\n", + "\tprint(f\"Similarity between '{pair[0]}' and '{pair[1]}' using GloVe: {similarity:.3f}\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "28c55b60", + "metadata": {}, + "source": [ + "Output:\n", + "\n", + ">Similarity between 'learn' and 'learning' using GloVe: 0.802\n", + ">\n", + ">Similarity between 'india' and 'indian' using GloVe: 0.865\n", + ">\n", + ">Similarity between 'fame' and 'famous' using GloVe: 0.589" + ] + }, + { + "cell_type": "markdown", + "id": "e2847def", + "metadata": {}, + "source": [ + "### Fasttext" + ] + }, + { + "cell_type": "markdown", + "id": "dfec4ddf", + "metadata": {}, + "source": [ + "Developed by Facebook, FastText extends Word2Vec by representing words as bags of character n-grams. This approach is particularly useful for handling out-of-vocabulary words and capturing morphological variations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eebd305d", + "metadata": {}, + "outputs": [], + "source": [ + "import gensim.downloader as api\n", + "fasttext_model = api.load(\"fasttext-wiki-news-subwords-300\") ## Load the pre-trained fastText model\n", + "# Define word pairs to compute similarity for\n", + "word_pairs = [('learn', 'learning'), ('india', 'indian'), ('fame', 'famous')]\n", + "\n", + "# Compute similarity for each pair of words\n", + "for pair in word_pairs:\n", + "\tsimilarity = fasttext_model.similarity(pair[0], pair[1])\n", + "\tprint(f\"Similarity between '{pair[0]}' and '{pair[1]}' using FastText: {similarity:.3f}\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "157465a1", + "metadata": {}, + "source": [ + "Output:\n", + "\n", + ">Similarity between 'learn' and 'learning' using Word2Vec: 0.642\n", + ">\n", + ">Similarity between 'india' and 'indian' using Word2Vec: 0.708\n", + ">\n", + ">Similarity between 'fame' and 'famous' using Word2Vec: 0.519" + ] + }, + { + "cell_type": "markdown", + "id": "81b1af62", + "metadata": {}, + "source": [ + "### BERT (Bidirectional Encoder Representations from Transformers)" + ] + }, + { + "cell_type": "markdown", + "id": "5c151177", + "metadata": {}, + "source": [ + "BERT is a transformer-based model that learns contextualized embeddings for words. It considers the entire context of a word by considering both left and right contexts, resulting in embeddings that capture rich contextual information." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "332b65ea", + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import BertTokenizer, BertModel\n", + "import torch\n", + "\n", + "# Load pre-trained BERT model and tokenizer\n", + "model_name = 'bert-base-uncased'\n", + "tokenizer = BertTokenizer.from_pretrained(model_name)\n", + "model = BertModel.from_pretrained(model_name)\n", + "\n", + "word_pairs = [('learn', 'learning'), ('india', 'indian'), ('fame', 'famous')]\n", + "\n", + "# Compute similarity for each pair of words\n", + "for pair in word_pairs:\n", + "\ttokens = tokenizer(pair, return_tensors='pt')\n", + "\twith torch.no_grad():\n", + "\t\toutputs = model(**tokens)\n", + "\t\n", + "\t# Extract embeddings for the [CLS] token\n", + "\tcls_embedding = outputs.last_hidden_state[:, 0, :]\n", + "\n", + "\tsimilarity = torch.nn.functional.cosine_similarity(cls_embedding[0], cls_embedding[1], dim=0)\n", + "\t\n", + "\tprint(f\"Similarity between '{pair[0]}' and '{pair[1]}' using BERT: {similarity:.3f}\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "a58c5b0d", + "metadata": {}, + "source": [ + "Output:\n", + "\n", + ">Similarity between 'learn' and 'learning' using BERT: 0.930\n", + ">\n", + ">Similarity between 'india' and 'indian' using BERT: 0.957\n", + ">\n", + ">Similarity between 'fame' and 'famous' using BERT: 0.956" + ] + }, + { + "cell_type": "markdown", + "id": "66f947fb", + "metadata": {}, + "source": [ + "## Considerations for Deploying Word Embedding Models" + ] + }, + { + "cell_type": "markdown", + "id": "6a613911", + "metadata": {}, + "source": [ + "+ You need to use the exact same pipeline during deploying your model as were used to create the training data for the word embedding. If you use a different tokenizer or different method of handling white space, punctuation etc. you might end up with incompatible inputs." + ] + }, + { + "cell_type": "markdown", + "id": "5fa07fb4", + "metadata": {}, + "source": [ + "+ Words in your input that doesn’t have a pre-trained vector. Such words are known as Out of Vocabulary Word(oov). What you can do is replace those words with “UNK” which means unknown and then handle them separately." + ] + }, + { + "cell_type": "markdown", + "id": "6a10d779", + "metadata": {}, + "source": [ + "+ Dimension mis-match: Vectors can be of many lengths. If you train a model with vectors of length say 400 and then try to apply vectors of length 1000 at inference time, you will run into errors. So make sure to use the same dimensions throughout." + ] + }, + { + "cell_type": "markdown", + "id": "d15dc1cd", + "metadata": {}, + "source": [ + "## Advantages and Disadvantage of Word Embeddings" + ] + }, + { + "cell_type": "markdown", + "id": "bfef3ca2", + "metadata": {}, + "source": [ + "### Advantages" + ] + }, + { + "cell_type": "markdown", + "id": "e3250120", + "metadata": {}, + "source": [ + "+ It is much faster to train than hand build models like WordNet (which uses graph embeddings).\n", + "+ Almost all modern NLP applications start with an embedding layer.\n", + "+ It Stores an approximation of meaning." + ] + }, + { + "cell_type": "markdown", + "id": "9c43e912", + "metadata": {}, + "source": [ + "### Disadvantages" + ] + }, + { + "cell_type": "markdown", + "id": "e57b362a", + "metadata": {}, + "source": [ + "+ It can be memory intensive.\n", + "+ It is corpus dependent. Any underlying bias will have an effect on your model.\n", + "+ It cannot distinguish between homophones. Eg: brake/break, cell/sell, weather/whether etc." + ] + }, + { + "cell_type": "markdown", + "id": "f18cb460", + "metadata": {}, + "source": [ + "## Conclusion" + ] + }, + { + "cell_type": "markdown", + "id": "88044847", + "metadata": {}, + "source": [ + "In conclusion, word embedding techniques such as TF-IDF, Word2Vec, and GloVe play a crucial role in natural language processing by representing words in a lower-dimensional space, capturing semantic and syntactic information." + ] + }, + { + "cell_type": "markdown", + "id": "13556362", + "metadata": {}, + "source": [ + "## Your turn! 🚀\n", + "\n", + "Assignment - [News topic classification tasks](../../assignments/deep-learning/nlp/news-topic-classification-tasks.ipynb)\n", + "\n", + "## Acknowledgments\n", + "\n", + "Thanks to [GeeksforGeeks](https://auth.geeksforgeeks.org/user/shristikotaiah/articles?utm_source=geeksforgeeks&utm_medium=article_author&utm_campaign=auth_user) for creating the open-source project [Word Embeddings in NLP](https://www.geeksforgeeks.org/word-embeddings-in-nlp).It inspire the majority of the content in this chapter." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}