diff --git a/README.md b/README.md
index 2511f8d..d308d1d 100644
--- a/README.md
+++ b/README.md
@@ -7,13 +7,13 @@ and transcribed text (and metadata) of 160+ pages that were handwritten by two n
Nicolas de Valdivia y Brisuela nearly 400 years ago. Through empirical evaluation, we demonstrate that our collection can be used to
fine-tune Spanish LLMs for tasks such as classification and masked language modeling.
-For further details, refer to our arXiv [pre-print](https://arxiv.org/pdf/2406.05812).
+
# Table of Contents
1. [Dataset](#dataset)
2. [Model](#model)
-3. [Acknowledgements](#acknowledgement)
+
# Dataset
@@ -21,14 +21,14 @@ For further details, refer to our arXiv [pre-print](https://arxiv.org/pdf/2406.0
SANRlite had 162 pages containing 900+ sentences. Each sentence (or a group of sentences) was assigned one or more class labels and extended class labels. Extended class labels provide fined-grained representation. There are 33 class labels and 154 extended class labels that were assigned to the sentences. To semantically enrich the JSON metadata, for each class label, we searched Wikidata [31], a popular free and open knowledge base, to extract the uniform resource identifier (URI) for the class labels to precisely denote their meaning. The JSON metadata also includes the notary name, the year when the notary record was written, and the Rollo/image number. To download the dataset and utilize it, please follow the guideline given in [dataset-README.md](dataset/dataset-README.md)
# Model
- We used SANR to do two down-stream task of language models using bert base language model. One is sentence classification and another is masked language model. We used bert base pretrained model to perform these tasks. For classification task, we used Multilingual BERT model, which is trained on text from multiple language along with Spanish. For MLM task, we used BETO: Spanish bert model, which is specifically trained on Spanish text.
+ We used SANRlite to do two down-stream task of language models using bert base language model. One is sentence classification and another is masked language model. We used bert base pretrained model to perform these tasks. For classification task, we used Multilingual BERT model, which is trained on text from multiple language along with Spanish. For MLM task, we used BETO: Spanish bert model, which is specifically trained on Spanish text.
## Download
-| | Model | Tokenizer |
-|:------------:|:----------------------------------------:|:-------------------------------------------:|
-| SANR Classification Model | [Model](https://mailmissouri-my.sharepoint.com/:f:/g/personal/sscx3_umsystem_edu/Em6J8fzd4KxLtVMo4YtoPywBn8OcPcG4NW1upggdcIJ5Cw?e=Gkud58) | [Tokenizer](https://mailmissouri-my.sharepoint.com/:f:/g/personal/sscx3_umsystem_edu/EkFVNqwHpDVOuFYT3hrxEEgBsG7ItzPm2NiMlbF5C1TxEQ?e=TZgkUC) |
-| SANR Masked Language Model | [Model](https://mailmissouri-my.sharepoint.com/:f:/g/personal/sscx3_umsystem_edu/El2jWbHfDs1Jtb0-bLA4BGgBCbBL_xAJ4ro65JCsCsILPg?e=j1efVP) | [Tokenizer](https://mailmissouri-my.sharepoint.com/:f:/g/personal/sscx3_umsystem_edu/EhVwk6WAcudGsvaATfGAakEB3ccN6K4DMjl8e6Mew1zBSg?e=lYlCtY) |
+| | Model and Tokenizer |
+|:------------:|:-------------------------------------------:|
+| SANRlite Classification | [Link](https://drive.google.com/file/d/13pMvBPLlOjcGUEWjnfWgXzpt_F9HAr3V/view?usp=sharing) |
+| SANRlite Masked Language | [Link](https://drive.google.com/file/d/1PNE1Hdz_vM9lXiYC0wKvG7kccUPv7NIz/view?usp=sharing) |
@@ -106,6 +106,3 @@ Install the required libraries using pip:
print(output)
-# Acknowlegdement
-This work was supported by the National Endowments for the Humanities Grant No. HAA-287903-22.
-
diff --git a/dataset/336920977-9f40fdcc-f8ed-443b-afda-866aec771730.png b/dataset/336920977-9f40fdcc-f8ed-443b-afda-866aec771730.png
new file mode 100644
index 0000000..9953a1e
Binary files /dev/null and b/dataset/336920977-9f40fdcc-f8ed-443b-afda-866aec771730.png differ
diff --git a/dataset/336921934-30880d76-b0f1-4743-8b2f-6ac0dfe22182.png b/dataset/336921934-30880d76-b0f1-4743-8b2f-6ac0dfe22182.png
new file mode 100644
index 0000000..128a672
Binary files /dev/null and b/dataset/336921934-30880d76-b0f1-4743-8b2f-6ac0dfe22182.png differ
diff --git a/dataset/dataset-README.md b/dataset/dataset-README.md
index 19eaced..077b15b 100644
--- a/dataset/dataset-README.md
+++ b/dataset/dataset-README.md
@@ -12,22 +12,18 @@ This repository contains a dataset of images from 17th century American Spanish
To view the annotations, you can use the labelImg software. Follow these steps to load the dataset and view the annotations:
1. Download and install [LabelImg](https://github.com/HumanSignal/labelImg).
-2. Clone this repository to your local machine:
- ```bash
- git clone https://github.com/raopr/SpanishNotaryCollection.git
-
-
+2. Clone this repository to your local machine.
3. The main page of LabelImg will look like the image shown below. At the beginning, you have to set the directory where your images and XML files are saved. After that, you need to set the directory where changes will be saved. To view the annotations in the LabelImg software, make sure that the scanned images and their corresponding XML files are in the same directory, as organized in the images directory. The annotations will look like the image below. The bounding boxes with the green circles in the corners represent the labeling or annotation process we performed.
-
+
## Sample of Rollos
Below are some sample images from rollos 40 and 38:
-
+
-
+
diff --git a/dataset/labelimg.png b/dataset/labelimg.png
new file mode 100644
index 0000000..f7e0c03
Binary files /dev/null and b/dataset/labelimg.png differ