A package that allows a continuous non-blocking read of large batches of documents from a MongoDB database (remote or local), with some action performed on each batch.
Installation instructions:
- Download the
mongojoinpackage
git clone https://github.com/knowbodynos/mongojoin.git
- Navigate into the main directory
cd mongojoin
- Install
mongojoin
python setup.py install
Using mongojoin:
- Queries are given in the form of a
pythonlist of lists:
[['<COLLECTION_1>',<JSON_QUERY_1>,<JSON_PROJECTIONS_1>,<OPTIONS_1>], ['<COLLECTION_2>',<JSON_QUERY_2>,<JSON_PROJECTIONS_2>,<OPTIONS_2>], ...]
with
-
<COLLECTION_#>is the name of the collection in the database. -
<JSON_QUERY_#>is a query of the form{'_id': 10}. -
<JSON_PROJECTIONS_#>is a projection of the form{'_id': 0}. -
<OPTIONS_#>is a dictionary of options like HINT,SKIP,SORT,LIMIT,COUNT of the form{'HINT': {'<FIELD_1>':1}, 'SKIP': 5, 'SORT': {'<FIELD_2>': 1}, 'LIMIT': 10, 'COUNT': True}. -
The main function is
dbcrawl:
dbcrawl(db,queries,statefilepath,statefilename="querystate",inputfunc=lambda x:{"nsteps":1},inputdoc={"nsteps":1},action=printasfunc,readform=lambda x:eval(x),writeform=lambda x:x,timeleft=lambda:1,counters=[1,1],counterupdate=lambda x:None,resetstatefile=False,limit=None,limittries=10,toplevel=True,initdoc={})
where
-
dbis anpymongodatabase object. -
queriesis a query of the form in step 1. -
statefilepathis a path to where an intermediate file will be stored, andstatefilenameis its filename. -
inputfuncis a function that returns a dictionary with information that will be used for reading in documents.inputdocis the first dictionary that is preloaded.nstepsrefers to the number of documents that will be read in each batch. -
actionis a function that performs an action of each batch of documents. -
readformandwriteformallow you to alter the format in which processed documents are stored in the intermediate filestatefilename. -
timeleftis a function that returns how much time (in seconds) is left before some limit is reached (default: no limit). -
countersis a list containing a batch counter and a document counter. They are both initialized at 1 by default. -
resetstatefileis True or False depending on whether the intermediate filestatefilenameshould be overwritten. -
limitis a limit on how many documents should be processed total. If there is no limit, set to None (default). -
limittriesis a limit on how many times a read should be attempted before giving up. -
toplevelandinitdocare internal recursive variables and should not be customized. -
Some useful actions are:
-
To print batches of file to screen, set
action = printasfunc -
To add batches of documents to a list of batches, set
action = lambda x,y,z: my_list.append(z) -
To add batches of documents to a list of documents, set
action = lambda x,y,z: my_list.extend(z) -
To write batches of documents to a file, set
action = lambda x,y,z: writeasfunc("<FILE_PATH>",z)
-