Skip to content

Crawler professional genre#281

Open
veljkovic wants to merge 2 commits intoMorphDiv:masterfrom
veljkovic:crawler-professional
Open

Crawler professional genre#281
veljkovic wants to merge 2 commits intoMorphDiv:masterfrom
veljkovic:crawler-professional

Conversation

@veljkovic
Copy link
Copy Markdown
Contributor

No description provided.

@ximenina
Copy link
Copy Markdown
Collaborator

ximenina commented Apr 6, 2023

Thanks for this PR. The Database runs OK. I'm waiting to accept this PR, because I would like to suggest two things (in case they are easy to incorporate, otherwise, we can add this afterward)

  • Splitting of files that belong to the same source. Is there a criterion for how the crawled text is currently split into files? I'm asking because, in principle, we established to limit the number of files to 100 per online source. Is it possible to merge the ones that belong to the same source so that we don't have more than ~100 different text files per source?

  • Extended information about the source. Could we add more details about each corpus? This could be done under #comments:

For EUconst:
A parallel corpus collected from the European Constitution.

For MultiUN:
This is a collection of translated documents from the United Nations originally compiled by Andreas Eisele and Yu Chen (see http://www.euromatrixplus.net/multi-un/). Please cite MultiUN: A Multilingual corpus from United Nation Documents, Andreas Eisele and Yu Chen, LREC 2010

For Europarl:
A parallel corpus extracted from the European Parliament web site by Philipp Koehn (University of Edinburgh). The main intended use is to aid statistical machine translation research.
More information can be found at http://www.statmt.org/europarl/.

*I extracted this info from their respective OPUS webpages.

I'm tagging here @tsamardzic and @christianbentz in case they have more comments,

@veljkovic
Copy link
Copy Markdown
Contributor Author

Hi,

Thank you for your feedback.

Regarding the first part (splitting files from the same source), the split is done based on the crawling instructions provided, which is the same as for the professional genre done last year. Essentially, each document provided is sampled and saved in a new file if it has less than 50k tokens. If it has more than that, it is separated into multiple files. However, I am not entirely sure if all the data can be stored in 100 files as the metadata won't be correct for those files (year published/composed, sample type, source, etc.).

For the second part (extended information about the source), it can be easily incorporated. I just need to rerun the script, which can take a bit of time. Just to confirm, should this information be in the metadata under #comments?

@ximenina
Copy link
Copy Markdown
Collaborator

Yes, it should be under #comments. It may not be necessary to rerun the whole crawling script. Shell scripting could be useful to process the existing texts. For instance, I imagine that files belonging to the same source could be filtered and then simply substitute the line containing#comments, using a command like sed.

As for the files splitting, it's a matter of keeping the database easy to manage with not too many different files per source. Specially if the files only contain a couple of lines and they have exactly the same metadata, then it's better to merge them. But for now we can put a pin on this, I'll think it more carefully.

Thanks!

@ximenina
Copy link
Copy Markdown
Collaborator

ximenina commented May 1, 2023

  • By the way, if there's too much complication in adding the new info, I forgot to say that we can do it from our side. Just let us know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants