COMP5046 Assignment 1 [Individual Assessment] (20 marks)

Submission Due: May 2nd, 2021 (11:59PM)

[XXX] = Lecture/Lab Reference
(Justify your decision) = Please justify your decision/selection in the documentation. You must show your final decision in your report with empirical evidence.
(Explain the performance) = Please explain the trend of performance, and the reason (or your opinion) why the trends show like that

Sentiment Analysis using Recurrent Neural Networks

In this assignment1, we will focus on developing sentiment analysis model using Recurrent Neural Networks (RNN).
Sentiment analysis [Lecture5] is contextual mining of text which identifies and extracts subjective information in source material, and helps a business to understand the social sentiment of their brand, product or service while monitoring online conversations.

For your information, the detailed information for each implementation step was specified in the following sections. Note that lab exercises would be a good starting point for the assignment. The useful lab exercises are specified in each section.

1. Data Preprocessing (2 marks)

In this assignment, you are to use the NLTK's Twitter_Sample dataset. Twitter is well-known microblog service that allows public data to be collected via APIs. NLTK's twitter corpus currently contains a sample of Tweets retrieved from the Twitter Streaming API. If you want to know the more detailed info for the nltk.corpus, please check the nltk corpus website.
The dataset contains twitter posts (tweets) along with their associated binary sentiment polarity labels. Both the training and testing sets are provided in the form of pickle files (testing_data.pkl, training_data.pkl) and can be downloaded from the Google Drive using the provided code in the Assignment 1 Template ipynb.

In this Data Preprocessing section, you are required to complete the following section in the format:

Preprocess data: You are asked to pre-process the training set by integrating several text pre-processing techniques [Lab5] (e.g. tokenisation, removing numbers, converting to lowercase, removing stop words, stemming, etc.). You should justify the reason why you apply the specific preprocessing techniques (Justify your decision)

2. Model Implementation (7 marks)

In this section, you are to implement three components, including Word Embedding module, Lexicon Embedding module, and Bi-directional RNN Sequence Model. For training, you are free to choose hyperparameters [Lab2,Lab4,Lab5] (e.g. dimension of embeddings, learning rate, epochs, etc.).

The model architecture can be found in the [Lecture5]

1)Word Embedding (2 marks)

First, you are asked to build a word embedding model (for representing word vectors, such as word2vec-CBOW, word2vec-Skip gram, fastText, and Glove) for the input embedding of your sequence model [Lab2]. Note that we used one-hot vectors as inputs for the sequence model in the Lab3 and Lab4. You are required to complete the following sections in the format:

Preprocess data for word embeddings: You are to use and preprocess NLTK Twitter dataset (the one provided in the Section 1) and/or any Dataset (e.g. TED talk, Google News) for word embeddings [Lab2]. This can be different from the preprocessing technique that you used in Section 1. You can use both training and testing dataset in order to train the word embedding. (Justify your decision)
Build training model for word embeddings: You are to build a training model for word embeddings. You are required to articulate the hyperparameters [Lab2] you chose (dimension of embeddings, window size, learning rate, etc.). Note that any word embeddings model [Lab2] (e.g. word2vec-CBOW, word2vec-Skip gram, fasttext, glove) can be applied. (Justify your decision)
Train model: You are to train the model.

2)Lexicon Embedding (2 marks)

Then, you are to check whether each word is in the positive or negative lexicon. In this assignment, we will use the Opinion Lexicon (If you cannot downalod this, please right click and open in a new page or You can directly download from the data folder in this github), which includes a list of english positive and negative opinion words or sentiment words. (2006 positive and 4783 negative words)
Each word needs to be converted into one-dimensional categorical embedding with three categories, such as not_exist(0), negative(1), and positive(2). This 0,1,2 categories will be used for the input for the Section 2.3 Bi-directional RNN Sequence model.
NOTE: If you want to use more than one-dimensional or not using categorical embedding, please (Justify your decision)

3)Bi-directional RNN Sequence Model (3 marks)

Finally, you are asked to build the Many-to-One (N to 1) Sequence model in order to detect the sentiment/emotion. Note that your model should be the best model selected from the evaluation (will be discussed in the Section 3. Evaluation). You are required to implement the following functions:

Apply/Import Word and Lexicon Embedding as an input: You are to concatenate the trained word embedding and lexicon embedding, and apply to the sequence model
Build training sequence model: You are to build the Bi-directional RNN-based (Bi-RNN or Bi-LSTM or Bi-GRU) Many-to-One (N to One) sequence model (N: word, One: Sentiment - Positive or Negative). You are required to describe how hyperparameters [Lab4,Lab5] (the Number of Epochs, learning rate, etc.) were decided. (Justify your decision)
Train model: While the model is being trained, you are required to display the Training Loss and the Number of Epochs. [Lab4,Lab5]

Note that it will not be marked if you do not display the Training Loss and the Number of Epochs in the Assignment 1 ipynb.

3. Evaluation (7 marks)

After completing all model training (in Section 1 and 2), you should evaluate two points: 1)Word Embedding Evaluation and 2)Sentiment Analysis Performance Prediction (Apply the trained model to the test set)

Word Embedding Evaluation (3 marks): Intrinsic Evaluation [Lecture3] - You are required to apply Semantic-Syntactic word relationship tests for understanding of a wide variety of relationships. The example code is provided here - Word Embedding Intrinsic Evaluation (This is discussed and explained in the [Lecture5 Recording] ). You also are to visualise the result (the example can be found in the Table 2 and Figure 2 from the Original GloVe Paper) (Explain the performance)
Performance Evaluation (2 marks): You are to represent the precision, recall, and f1 [Lab4] of your model in the table (Explain the performance)
Hyperparameter Testing (2 marks): You are to provide the line graph, which shows the hyperparameter testing (with the test dataset) and explain the optimal number of epochs based on the learning rate you choose. You can have multiple graphs with different learning rates. In the graph, the x-axis would be # of epoch and the y-axis would be the f1. (Explain the performance)

Note that it will not be marked if you do not display it in the ipynb file.

4. Documentation (4 marks)

In the section 1,2, and 3, you are required to describe and justify any decisions you made for the final implementation. You can find the tag (Justify your decision) or (Explain the performance) for the point that you should justify the purpose of applying the specific technique/model and explain the performance.
For example, for section 1 (preprocess data), you need to describe which pre-processing techniques (removing numbers, converting to lowercase, removing stop words, stemming, etc.) were conducted and justify your decision (the purpose of choosing a specific pre-processing techniques, and benefit of using that technique or the integration of techniques for your AI) in your ipynb file

Submission Instruction

Submit an ipynb file - (file name: your_unikey_COMP5046_Ass1.ipynb) that contains all above sections(Section 1,2,3, and 4).
The ipynb template can be found in the Assignment 1 template

FAQ

Question: What do I need to write in the justification? How much do I need to articulate?
Answer: As you can see the 'Read me' section in the ipynb Assingment 1 template, visualizing the comparison of different testing results is a good to justify your decision. You can find another way (other than comparing different models) as well - like showing any theoretical comparison or using different hyper parameters

Question: Is there any marking scheme/marking criteria available for assignment 1?
Answer: The assignment specification is extremely detailed. The marking will be conducted based on the specification.

Question: My Word Embedding/ Sentiment Analysis performs really bad (Low accuracy). What did i do wrong?
Answer: Please don't bother about the low accuracy as our training dataset is very small and your model is very basic deep learning model.

Question: Do I need to use only NLTKTwitter dataset for training the word embedding?
Answer: No, as mentioned in the lecture 5 (assignment 1 specification description), you can use any dataset (including TED, Google News) or NLTKtwitter dataset for training your word embedding. Word embedding is just for training the word meaning space so you can use any data. Note: Training word embedding is different from training the Bi-RNN prediction model for sentiment analysis. For the Bi-RNN sentiment analysis model training, you should use only training dataset (from the NLTK twitter dataset that we provided in the assignment 1 template)

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
A1-self		A1-self
Final-full		Final-full
data		data
img		img
1_1_1.ipynb		1_1_1.ipynb
1_1_2.ipynb		1_1_2.ipynb
1_1_3.ipynb		1_1_3.ipynb
3_1.ipynb		3_1.ipynb
3_1Word_Embedding_Evaluation.ipynb		3_1Word_Embedding_Evaluation.ipynb
3_2Word_Embedding_Evaluation.ipynb		3_2Word_Embedding_Evaluation.ipynb
3_2测试2.ipynb		3_2测试2.ipynb
A1.ipynb		A1.ipynb
AE1.ipynb		AE1.ipynb
AE2.ipynb		AE2.ipynb
AE3.ipynb		AE3.ipynb
AE4.ipynb		AE4.ipynb
AE5.ipynb		AE5.ipynb
Ass1.ipynb		Ass1.ipynb
COMP5046-main.zip		COMP5046-main.zip
COMP5046_Ass1_2_(2).ipynb		COMP5046_Ass1_2_(2).ipynb
README.md		README.md
winchange.ipynb		winchange.ipynb
zong.ipynb		zong.ipynb
总.ipynb		总.ipynb
总1.ipynb		总1.ipynb
整理427.ipynb		整理427.ipynb
测试第一个.ipynb		测试第一个.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COMP5046 Assignment 1 [Individual Assessment] (20 marks)

Submission Due: May 2nd, 2021 (11:59PM)

Sentiment Analysis using Recurrent Neural Networks

1. Data Preprocessing (2 marks)

2. Model Implementation (7 marks)

1)Word Embedding (2 marks)

2)Lexicon Embedding (2 marks)

3)Bi-directional RNN Sequence Model (3 marks)

Note that it will not be marked if you do not display the Training Loss and the Number of Epochs in the Assignment 1 ipynb.

3. Evaluation (7 marks)

Note that it will not be marked if you do not display it in the ipynb file.

4. Documentation (4 marks)

Submission Instruction

FAQ

If you have any question, please come to LiveQA and post it in the Edstem anytime!

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

COMP5046 Assignment 1 [Individual Assessment] (20 marks)

Submission Due: May 2nd, 2021 (11:59PM)

Sentiment Analysis using Recurrent Neural Networks

1. Data Preprocessing (2 marks)

2. Model Implementation (7 marks)

1)Word Embedding (2 marks)

2)Lexicon Embedding (2 marks)

3)Bi-directional RNN Sequence Model (3 marks)

Note that it will not be marked if you do not display the Training Loss and the Number of Epochs in the Assignment 1 ipynb.

3. Evaluation (7 marks)

Note that it will not be marked if you do not display it in the ipynb file.

4. Documentation (4 marks)

Submission Instruction

FAQ

If you have any question, please come to LiveQA and post it in the Edstem anytime!

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages