Conversation
|
Thanks for this PR. The Database runs OK. I'm waiting to accept this PR, because I would like to suggest two things (in case they are easy to incorporate, otherwise, we can add this afterward)
For EUconst: For MultiUN: For Europarl: *I extracted this info from their respective OPUS webpages. I'm tagging here @tsamardzic and @christianbentz in case they have more comments, |
|
Hi, Thank you for your feedback. Regarding the first part (splitting files from the same source), the split is done based on the crawling instructions provided, which is the same as for the professional genre done last year. Essentially, each document provided is sampled and saved in a new file if it has less than 50k tokens. If it has more than that, it is separated into multiple files. However, I am not entirely sure if all the data can be stored in 100 files as the metadata won't be correct for those files (year published/composed, sample type, source, etc.). For the second part (extended information about the source), it can be easily incorporated. I just need to rerun the script, which can take a bit of time. Just to confirm, should this information be in the metadata under #comments? |
|
Yes, it should be under #comments. It may not be necessary to rerun the whole crawling script. Shell scripting could be useful to process the existing texts. For instance, I imagine that files belonging to the same source could be filtered and then simply substitute the line containing As for the files splitting, it's a matter of keeping the database easy to manage with not too many different files per source. Specially if the files only contain a couple of lines and they have exactly the same metadata, then it's better to merge them. But for now we can put a pin on this, I'll think it more carefully. Thanks! |
|
No description provided.