Skip to content
pjcraig edited this page Jan 30, 2013 · 2 revisions

Pubs Parser

This document is to keep track of work that needs to be done in order to improve the parser.

To-Do

  • Parser needs to move away from its current method of running all the citations to be parsed through a plaintext file. Currently each citation that it's given is written to a file, reads them from the file to parse them and then returns HTML/parses them to the database. One way to do this is to refactor the parser so that it handles one citation at a time. A parser is spawned once for each citation entered, which leads us to the thing to be done.

  • Parser needs to have a control script which breaks down batch jobs and spawns parsers to read in each citation. Implementing this control script will speed up parse times considerably. Looking into using exec() for this as it doesn't spawn server connections for each thread. If we were to use something like curl_multi_exec(), large jobs could soak up the finite number of connections Apache holds open for use. Making the number of possible connections arbitrarily large isn't a great solution, so for now it looks like exec() is the way to go.

Clone this wiki locally