-
Notifications
You must be signed in to change notification settings - Fork 3
Description
In the slides on the session on text mining (file vignettes/h_text-mining.Rmd), we have a download.file() call to get the manifesto of the Green Party. This is the relevant chunk.
gruene_btw2021 <- "https://cms.gruene.de/uploads/documents/2021_Wahlprogrammentwurf.pdf"
gruene_btw2021_local <- tempfile()
download.file(url = gruene_btw2021, destfile = gruene_btw2021_local)This works nicely for macOS users, but the downloaded pdf cannot be opened with pdftools::pdf_info(gruene_btw2021_local). Opening it "by hand" with a viewer shows that the file is corrupted.
As the pdf manually downloaded can be opened without problems, we now that download.file() corrupts the file.
Our search for issues others have encountered with download.file() yields this result:
https://community.rstudio.com/t/download-file-issue-corrupted-file/60844
Indeed, adding mode = "wb" as an argument does the job:
download.file(url = gruene_btw2021, destfile = gruene_btw2021_local, mode ="wb")Actually, the documentation of download.file() addresses the issue as follows:
"The choice of binary transfer (mode = "wb" or "ab") is important on Windows, since unlike Unix-alikes it does distinguish between text and binary files and for text transfers changes \n line endings to \r\n (aka ‘CRLF’).
On Windows, if mode is not supplied (missing()) and url ends in one of .gz, .bz2, .xz, .tgz, .zip, .jar, .rda, .rds or .RData, mode = "wb" is set so that a binary transfer is done to help unwary users."
I think it would be very helpful if mode="wb" would be set implicitly for pdf documents, too.
Another consideration is that using curl::curl_download() may be more robust.