[XXX] = Lecture/Lab Reference
(Justify your decision) = Please justify your decision/selection in the documentation. You must show your final decision in your report with empirical evidence.
(Explain the performance) = Please explain the trend of performance, and the reason (or your opinion) why the trends show like that
In this assignment1, we will focus on developing sentiment analysis model using Recurrent Neural Networks (RNN).
Sentiment analysis [Lecture5] is contextual mining of text which identifies and extracts subjective information in source material, and helps a business to understand the social sentiment of their brand, product or service while monitoring online conversations.
For your information, the detailed information for each implementation step was specified in the following sections. Note that lab exercises would be a good starting point for the assignment. The useful lab exercises are specified in each section.
In this assignment, you are to use the NLTK's Twitter_Sample dataset. Twitter is well-known microblog service that allows public data to be collected via APIs. NLTK's twitter corpus currently contains a sample of Tweets retrieved from the Twitter Streaming API. If you want to know the more detailed info for the nltk.corpus, please check the nltk corpus website.
The dataset contains twitter posts (tweets) along with their associated binary sentiment polarity labels. Both the training and testing sets are provided in the form of pickle files (testing_data.pkl, training_data.pkl) and can be downloaded from the Google Drive using the provided code in the Assignment 1 Template ipynb.
In this Data Preprocessing section, you are required to complete the following section in the format:
- Preprocess data: You are asked to pre-process the training set by integrating several text pre-processing techniques [Lab5] (e.g. tokenisation, removing numbers, converting to lowercase, removing stop words, stemming, etc.). You should justify the reason why you apply the specific preprocessing techniques (Justify your decision)
In this section, you are to implement three components, including Word Embedding module, Lexicon Embedding module, and Bi-directional RNN Sequence Model. For training, you are free to choose hyperparameters [Lab2,Lab4,Lab5] (e.g. dimension of embeddings, learning rate, epochs, etc.).
The model architecture can be found in the [Lecture5]
First, you are asked to build a word embedding model (for representing word vectors, such as word2vec-CBOW, word2vec-Skip gram, fastText, and Glove) for the input embedding of your sequence model [Lab2]. Note that we used one-hot vectors as inputs for the sequence model in the Lab3 and Lab4. You are required to complete the following sections in the format:- Preprocess data for word embeddings: You are to use and preprocess NLTK Twitter dataset (the one provided in the Section 1) and/or any Dataset (e.g. TED talk, Google News) for word embeddings [Lab2]. This can be different from the preprocessing technique that you used in Section 1. You can use both training and testing dataset in order to train the word embedding. (Justify your decision)
- Build training model for word embeddings: You are to build a training model for word embeddings. You are required to articulate the hyperparameters [Lab2] you chose (dimension of embeddings, window size, learning rate, etc.). Note that any word embeddings model [Lab2] (e.g. word2vec-CBOW, word2vec-Skip gram, fasttext, glove) can be applied. (Justify your decision)
- Train model: You are to train the model.
Then, you are to check whether each word is in the positive or negative lexicon. In this assignment, we will use the Opinion Lexicon (If you cannot downalod this, please right click and open in a new page or You can directly download from the data folder in this github), which includes a list of english positive and negative opinion words or sentiment words. (2006 positive and 4783 negative words)
Each word needs to be converted into one-dimensional categorical embedding with three categories, such as not_exist(0), negative(1), and positive(2).
This 0,1,2 categories will be used for the input for the Section 2.3 Bi-directional RNN Sequence model.
NOTE: If you want to use more than one-dimensional or not using categorical embedding, please (Justify your decision)
- Apply/Import Word and Lexicon Embedding as an input: You are to concatenate the trained word embedding and lexicon embedding, and apply to the sequence model
- Build training sequence model: You are to build the Bi-directional RNN-based (Bi-RNN or Bi-LSTM or Bi-GRU) Many-to-One (N to One) sequence model (N: word, One: Sentiment - Positive or Negative). You are required to describe how hyperparameters [Lab4,Lab5] (the Number of Epochs, learning rate, etc.) were decided. (Justify your decision)
- Train model: While the model is being trained, you are required to display the Training Loss and the Number of Epochs. [Lab4,Lab5]
Note that it will not be marked if you do not display the Training Loss and the Number of Epochs in the Assignment 1 ipynb.
After completing all model training (in Section 1 and 2), you should evaluate two points: 1)Word Embedding Evaluation and 2)Sentiment Analysis Performance Prediction (Apply the trained model to the test set)
- Word Embedding Evaluation (3 marks): Intrinsic Evaluation [Lecture3] - You are required to apply Semantic-Syntactic word relationship tests for understanding of a wide variety of relationships. The example code is provided here - Word Embedding Intrinsic Evaluation (This is discussed and explained in the [Lecture5 Recording] ). You also are to visualise the result (the example can be found in the Table 2 and Figure 2 from the Original GloVe Paper) (Explain the performance)
- Performance Evaluation (2 marks): You are to represent the precision, recall, and f1 [Lab4] of your model in the table (Explain the performance)
- Hyperparameter Testing (2 marks): You are to provide the line graph, which shows the hyperparameter testing (with the test dataset) and explain the optimal number of epochs based on the learning rate you choose. You can have multiple graphs with different learning rates. In the graph, the x-axis would be # of epoch and the y-axis would be the f1. (Explain the performance)
In the section 1,2, and 3, you are required to describe and justify any decisions you made for the final implementation. You can find the tag (Justify your decision) or (Explain the performance) for the point that you should justify the purpose of applying the specific technique/model and explain the performance.
For example, for section 1 (preprocess data), you need to describe which pre-processing techniques (removing numbers, converting to lowercase, removing stop words, stemming, etc.) were conducted and justify your decision (the purpose of choosing a specific pre-processing techniques, and benefit of using that technique or the integration of techniques for your AI) in your ipynb file
Submit an ipynb file - (file name: your_unikey_COMP5046_Ass1.ipynb) that contains all above sections(Section 1,2,3, and 4).
The ipynb template can be found in the Assignment 1 template
Question: What do I need to write in the justification? How much do I need to articulate?
Answer: As you can see the 'Read me' section in the ipynb Assingment 1 template, visualizing the comparison of different testing results is a good to justify your decision. You can find another way (other than comparing different models) as well - like showing any theoretical comparison or using different hyper parameters
Question: Is there any marking scheme/marking criteria available for assignment 1?
Answer: The assignment specification is extremely detailed. The marking will be conducted based on the specification.
Question: My Word Embedding/ Sentiment Analysis performs really bad (Low accuracy). What did i do wrong?
Answer: Please don't bother about the low accuracy as our training dataset is very small and your model is very basic deep learning model.
Question: Do I need to use only NLTKTwitter dataset for training the word embedding?
Answer: No, as mentioned in the lecture 5 (assignment 1 specification description), you can use any dataset (including TED, Google News) or NLTKtwitter dataset for training your word embedding. Word embedding is just for training the word meaning space so you can use any data.
Note: Training word embedding is different from training the Bi-RNN prediction model for sentiment analysis. For the Bi-RNN sentiment analysis model training, you should use only training dataset (from the NLTK twitter dataset that we provided in the assignment 1 template)
