Conversation
|
Thanks @dmyersturnbull. Can I ask what the inspiration/motivation for this is? Is this just a helper utility for some of your work? Or is this going to be a new step in our workflow? |
josemduarte
left a comment
There was a problem hiding this comment.
Thanks! I think this looks good, no issues on the code. And I understand the motivation behind making it very general, so that it is more reusable.
Having said that, to me it looks like the mongoexport CLI from the MongoDB CLI tools does exactly this. Doesn't it? If so, I'd say it'd be preferable to use the off-the-shelf tool. Sorry that I've only thought about this now, but I was initially more focused on a specific chem comp pipeline, thinking that it required something more custom.
@piehld Yeah, it was for the new chemical service ETL workflow, which otherwise doesn't need to talk to dw/exdb. But, as Jose points out, it's not doing more than MongoDB export right now. It might get functionality for incremental loading later. |
|
Cool thanks for clarifying @dmyersturnbull! I guess I have the same question as @josemduarte now too, on whether mongoexport CLI can be used? If not, one thing that might help your code is to rely on our ExDB configuration file (e.g., with Mongo params here) being passed in as a CLI flag, which you could use our ConfigUtil to read in and grab the necessary Mongo client information from. This config file is what is passed in during production for ExDB loading tasks. |
Yeah, I think for non-incremental mongoexport works perfectly. For incremental updates, we'll need slightly more logic and so will need some code -- at that point, both PyMongo or mongoexport work (because mongoexport allows
Yeah, Jose and I discussed this. I originally took that approach for consistency with our other Python projects, but I think we should move to just using single URI connection strings (from config files, env vars, or (less securely) from CLI args). |
Added a new CLI called
exdb-exportwith a single subcommand,export, which simply writes a MongoDB collection (or subset of fields) to a JSON file.This allows weekly-update-workflow to get the list of chemical component ids. Getting exactly those would probably make for an overspecialized entry point, so the
exportsubcommand takes a collection name and optionally a list of fields.I kept it self-contained; a config file,
rcsb.db.mongo, etc. feel unnecessary. The first commit just cleans upsetup.pyslightly.