diff --git a/.gitignore b/.gitignore index a55e00c..3fe0e8e 100644 --- a/.gitignore +++ b/.gitignore @@ -105,3 +105,7 @@ venv.bak/ .DS_Store .idea/ + +#sean_fork +sean_notes.txt +results/* \ No newline at end of file diff --git a/CO_ATTN/utils.py b/CO_ATTN/utils.py index a4755d9..c330c07 100644 --- a/CO_ATTN/utils.py +++ b/CO_ATTN/utils.py @@ -75,7 +75,7 @@ def load_word_embedding_dict(embedding, embedding_path, word_alphabet, logger, e logger.info("Loading GloVe ...") embedd_dim = -1 embedd_dict = dict() - with open(embedding_path, 'r') as file: + with open(embedding_path, 'r', encoding='utf-8') as file: for line in file: line = line.strip() if len(line) == 0: diff --git a/README.md b/README.md index e7d6276..a0371e9 100644 --- a/README.md +++ b/README.md @@ -1,22 +1,54 @@ -# co-attention +# Co-Attention Code for BEA 13 paper "Co-Attention Based Neural Network for Source-Dependent Essay Scoring" ## Dependencies -python 2 for data/preprocess_asap.py (will be upgraded to python 3) +Python 2 for data/preprocess_asap.py (will be upgraded to Python 3). -python 3 for the rest +* I recommend that on installation, you *do not add to your PATH variable*. This way, it doesn't interfere with your current Python workflow. +* Then, when you need to run the preprocessing script, you'll run it something like: + * *c:/Python27/python.exe preprocess_asap.py* + +Python 3 for the rest * tensorflow 2.0.0 beta * gensim + * gensim may have more dependencies, such as VS tools * nltk * sklearn -run python2 data/preprocess_asap.py for data splitting. -Download Glove pretrained embedding from https://nlp.stanford.edu/projects/glove -Extract glove.6B.50d.txt to the glove folder -run python3 attn_network.py [options] for training and evaluation +## Running on Linux, MacOS + +1. Run *python2 data/preprocess_asap.py* for data splitting. +2. Download GloVe pretrained embedding from: *https://nlp.stanford.edu/projects/glove* +3. Extract *glove.6B.50d.txt* to the glove folder +4. Run *python3 attn_network.py [options]* for training and evaluation + +## Running on Windows + +To run on Windows, do all of the commands for Linux/MacOS. Then you'll need to remove two "\n" symbols from the preprocessing script. + +1. open *data/preprocess_asap.py* in your preferred text editor +2. on lines 28 and 31 in the preprocessing script, you'll find: *f_write.write("\r\n")* +3. remove the *\n* from both lines + +## Results + +After preprocessing the data, the program will start the training process. At the end of each epoch, the logger will +output the development and testing set scores. The highest will be kept and outputted after all epochs are complete. You can toggle the which +task (the default is ASAP3), the number of epochs (the default is 50), and more by looking at the arguments in lines 21-51 of *attn_network.py*. + +Additionally, if you want to look at specific essays with their predicted and actual scores: + +1. go to the *checkpoints* folder +2. after training, there should be a text file with one number per line. the line number corresponds to the essay number in the test data. +3. in the various fold directories, open the test.tsv file and compare with the predicted scored in from step 2. + +## Coming Soon +1. making specific essays, as well as their predicted scores and real scores, more accessible. + - likely a Python script +2. updating *the preprocessing_asap.py* to be compatible with Python 3 ## Cite If you use the code, please cite the following paper: diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..f9d410c --- /dev/null +++ b/requirements.txt @@ -0,0 +1,46 @@ +absl-py==0.11.0 +astunparse==1.6.3 +cachetools==4.2.0 +certifi==2020.12.5 +chardet==4.0.0 +click==7.1.2 +Cython==0.29.14 +flatbuffers==1.12 +gast==0.3.3 +gensim==3.8.3 +google-auth==1.24.0 +google-auth-oauthlib==0.4.2 +google-pasta==0.2.0 +grpcio==1.32.0 +h5py==2.10.0 +idna==2.10 +joblib==1.0.0 +Keras-Preprocessing==1.1.2 +Markdown==3.3.3 +nltk==3.5 +numpy==1.19.5 +oauthlib==3.1.0 +opt-einsum==3.3.0 +protobuf==3.14.0 +pyasn1==0.4.8 +pyasn1-modules==0.2.8 +regex==2020.11.13 +requests==2.25.1 +requests-oauthlib==1.3.0 +rsa==4.7 +scikit-learn==0.24.1 +scipy==1.6.0 +six==1.15.0 +sklearn==0.0 +smart-open==4.1.2 +tensorboard==2.4.1 +tensorboard-plugin-wit==1.8.0 +tensorflow==2.4.0 +tensorflow-estimator==2.4.0 +termcolor==1.1.0 +threadpoolctl==2.1.0 +tqdm==4.56.0 +typing-extensions==3.7.4.3 +urllib3==1.26.2 +Werkzeug==1.0.1 +wrapt==1.12.1