Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 7 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,28 +7,28 @@ and transcribed text (and metadata) of 160+ pages that were handwritten by two n
Nicolas de Valdivia y Brisuela nearly 400 years ago. Through empirical evaluation, we demonstrate that our collection can be used to
fine-tune Spanish LLMs for tasks such as classification and masked language modeling.

For further details, refer to our arXiv [pre-print](https://arxiv.org/pdf/2406.05812).


# Table of Contents

1. [Dataset](#dataset)
2. [Model](#model)
3. [Acknowledgements](#acknowledgement)



# Dataset

SANRlite had 162 pages containing 900+ sentences. Each sentence (or a group of sentences) was assigned one or more class labels and extended class labels. Extended class labels provide fined-grained representation. There are 33 class labels and 154 extended class labels that were assigned to the sentences. To semantically enrich the JSON metadata, for each class label, we searched Wikidata [31], a popular free and open knowledge base, to extract the uniform resource identifier (URI) for the class labels to precisely denote their meaning. The JSON metadata also includes the notary name, the year when the notary record was written, and the Rollo/image number. To download the dataset and utilize it, please follow the guideline given in [dataset-README.md](dataset/dataset-README.md)

# Model
We used SANR to do two down-stream task of language models using bert base language model. One is sentence classification and another is masked language model. We used bert base pretrained model to perform these tasks. For classification task, we used Multilingual BERT model, which is trained on text from multiple language along with Spanish. For MLM task, we used BETO: Spanish bert model, which is specifically trained on Spanish text.
We used SANRlite to do two down-stream task of language models using bert base language model. One is sentence classification and another is masked language model. We used bert base pretrained model to perform these tasks. For classification task, we used Multilingual BERT model, which is trained on text from multiple language along with Spanish. For MLM task, we used BETO: Spanish bert model, which is specifically trained on Spanish text.

## Download

| | Model | Tokenizer |
|:------------:|:----------------------------------------:|:-------------------------------------------:|
| SANR Classification Model | [Model](https://mailmissouri-my.sharepoint.com/:f:/g/personal/sscx3_umsystem_edu/Em6J8fzd4KxLtVMo4YtoPywBn8OcPcG4NW1upggdcIJ5Cw?e=Gkud58) | [Tokenizer](https://mailmissouri-my.sharepoint.com/:f:/g/personal/sscx3_umsystem_edu/EkFVNqwHpDVOuFYT3hrxEEgBsG7ItzPm2NiMlbF5C1TxEQ?e=TZgkUC) |
| SANR Masked Language Model | [Model](https://mailmissouri-my.sharepoint.com/:f:/g/personal/sscx3_umsystem_edu/El2jWbHfDs1Jtb0-bLA4BGgBCbBL_xAJ4ro65JCsCsILPg?e=j1efVP) | [Tokenizer](https://mailmissouri-my.sharepoint.com/:f:/g/personal/sscx3_umsystem_edu/EhVwk6WAcudGsvaATfGAakEB3ccN6K4DMjl8e6Mew1zBSg?e=lYlCtY) |
| | Model and Tokenizer |
|:------------:|:-------------------------------------------:|
| SANRlite Classification | [Link](https://drive.google.com/file/d/13pMvBPLlOjcGUEWjnfWgXzpt_F9HAr3V/view?usp=sharing) |
| SANRlite Masked Language | [Link](https://drive.google.com/file/d/1PNE1Hdz_vM9lXiYC0wKvG7kccUPv7NIz/view?usp=sharing) |

<!-- If you wish to download and use the model and tokenizer, please follow the steps mentioned in the [model-README.md](model/model-README.md). -->

Expand Down Expand Up @@ -106,6 +106,3 @@ Install the required libraries using pip:
print(output)


# Acknowlegdement
This work was supported by the National Endowments for the Humanities Grant No. HAA-287903-22.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
12 changes: 4 additions & 8 deletions dataset/dataset-README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,22 +12,18 @@ This repository contains a dataset of images from 17th century American Spanish
To view the annotations, you can use the labelImg software. Follow these steps to load the dataset and view the annotations:

1. Download and install [LabelImg](https://github.com/HumanSignal/labelImg).
2. Clone this repository to your local machine:
```bash
git clone https://github.com/raopr/SpanishNotaryCollection.git


2. Clone this repository to your local machine.
3. The main page of LabelImg will look like the image shown below. At the beginning, you have to set the directory where your images and XML files are saved. After that, you need to set the directory where changes will be saved. To view the annotations in the LabelImg software, make sure that the scanned images and their corresponding XML files are in the same directory, as organized in the images directory. The annotations will look like the image below. The bounding boxes with the green circles in the corners represent the labeling or annotation process we performed.

<img width="959" alt="Notary" src="https://github.com/raopr/SpanishNotaryCollection/assets/58792703/98732301-2875-44d8-999d-eba70bc038c4">
<img width="959" alt="Notary" src="labelimg.png">



## Sample of Rollos
Below are some sample images from rollos 40 and 38:

<img width="565" alt="rollo" src="https://github.com/raopr/SpanishNotaryCollection/assets/58792703/9f40fdcc-f8ed-443b-afda-866aec771730">
<img width="565" alt="rollo" src="336920977-9f40fdcc-f8ed-443b-afda-866aec771730.png">


<img width="568" alt="Rolloss" src="https://github.com/raopr/SpanishNotaryCollection/assets/58792703/30880d76-b0f1-4743-8b2f-6ac0dfe22182">
<img width="568" alt="Rolloss" src="336921934-30880d76-b0f1-4743-8b2f-6ac0dfe22182.png">

Binary file added dataset/labelimg.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.