Crawler professional genre by veljkovic · Pull Request #281 · MorphDiv/TeDDi_sample

veljkovic · 2023-03-26T18:05:45Z

No description provided.

ximenina · 2023-04-06T14:18:39Z

Thanks for this PR. The Database runs OK. I'm waiting to accept this PR, because I would like to suggest two things (in case they are easy to incorporate, otherwise, we can add this afterward)

Splitting of files that belong to the same source. Is there a criterion for how the crawled text is currently split into files? I'm asking because, in principle, we established to limit the number of files to 100 per online source. Is it possible to merge the ones that belong to the same source so that we don't have more than ~100 different text files per source?
Extended information about the source. Could we add more details about each corpus? This could be done under #comments:

For EUconst:
A parallel corpus collected from the European Constitution.

For MultiUN:
This is a collection of translated documents from the United Nations originally compiled by Andreas Eisele and Yu Chen (see http://www.euromatrixplus.net/multi-un/). Please cite MultiUN: A Multilingual corpus from United Nation Documents, Andreas Eisele and Yu Chen, LREC 2010

For Europarl:
A parallel corpus extracted from the European Parliament web site by Philipp Koehn (University of Edinburgh). The main intended use is to aid statistical machine translation research.
More information can be found at http://www.statmt.org/europarl/.

*I extracted this info from their respective OPUS webpages.

I'm tagging here @tsamardzic and @christianbentz in case they have more comments,

veljkovic · 2023-04-23T19:54:04Z

Hi,

Thank you for your feedback.

Regarding the first part (splitting files from the same source), the split is done based on the crawling instructions provided, which is the same as for the professional genre done last year. Essentially, each document provided is sampled and saved in a new file if it has less than 50k tokens. If it has more than that, it is separated into multiple files. However, I am not entirely sure if all the data can be stored in 100 files as the metadata won't be correct for those files (year published/composed, sample type, source, etc.).

For the second part (extended information about the source), it can be easily incorporated. I just need to rerun the script, which can take a bit of time. Just to confirm, should this information be in the metadata under #comments?

ximenina · 2023-04-27T17:13:51Z

Yes, it should be under #comments. It may not be necessary to rerun the whole crawling script. Shell scripting could be useful to process the existing texts. For instance, I imagine that files belonging to the same source could be filtered and then simply substitute the line containing#comments, using a command like sed.

As for the files splitting, it's a matter of keeping the database easy to manage with not too many different files per source. Specially if the files only contain a couple of lines and they have exactly the same metadata, then it's better to merge them. But for now we can put a pin on this, I'll think it more carefully.

Thanks!

ximenina · 2023-05-01T15:02:55Z

By the way, if there's too much complication in adding the new info, I forgot to say that we can do it from our side. Just let us know.

iveljkovic added 2 commits March 26, 2023 19:04

crawler and report for professional genre

5748729

corpus data professional

0c0f8d4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler professional genre#281

Crawler professional genre#281
veljkovic wants to merge 2 commits intoMorphDiv:masterfrom
veljkovic:crawler-professional

veljkovic commented Mar 26, 2023

Uh oh!

ximenina commented Apr 6, 2023

Uh oh!

veljkovic commented Apr 23, 2023

Uh oh!

ximenina commented Apr 27, 2023

Uh oh!

ximenina commented May 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

veljkovic commented Mar 26, 2023

Uh oh!

ximenina commented Apr 6, 2023

Uh oh!

veljkovic commented Apr 23, 2023

Uh oh!

ximenina commented Apr 27, 2023

Uh oh!

ximenina commented May 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants