Allow levels to be missing for fct_concept#40
Allow levels to be missing for fct_concept#40mlondschien wants to merge 1 commit intoeth-mds:mainfrom
fct_concept#40Conversation
|
@mlondschien just had a quick look at this. Iiuc, what you're after is a concept class that allows for returning arbitrary strings. Sure, we could make the I guess the reason why we never felt a need for this is inter-source data harmonization: if you define a concept with categorical values, all data sources need to stick to the same "labels" otherwise you can't really do any "source-agnostic" down-steam work. For the concrete use-case ("caregiver"), is the set of levels very large? Or is there another reason for not specifying this ahead of time? Could this concept be implemented for other datasets as well? Not super sure whether this actually works, as we never tried out this specifically. The heavy reliance on S3 was in part to allow for extensibility, so you might be able to define your own concept class ( |
Yes, around ~10'000 levels.
I'm using this locally. It does work. Is there a downside to using levels here? |
|
@mlondschien I assume in that case you are not actually interested in the "values" themselves but more in a grouping this induces? If this is actually the case, why not simply return an "auto increment" ID (integer valued; in data.table you could use the If you were interested in contributing such a concept to ricu, we'd have to think about how to do this across datasets. We'd have to somehow "namespace" the IDs, as the MIMIC caregiver no. 1 almost surely is distinct from the HiRID caregiver no. 1. But step by step. |
Yes, I one-hot encode them later. However, I prefer leaving them as is, as this simplifies downstream analysis when talking to collaborators. IIUC factors are nothing other than
I currently simply treat |
|
Any chance we could merge this? This would allow me to work with the main branch of ricu and not my personal fork. |
The docs state
Currently, a set of factor levels needs to be specified. Sometimes this is not desirable. This PR allows not specifying
levelsinfct_cncpt.