Skip to content
Jon Stokes edited this page Aug 22, 2014 · 2 revisions

Object queues are where you get the output from your scrapes. The queue is essentially a redis set, so when you pop it, the results arrive out-of-order.

Important: Unlike the session queues, you do not register an object queue. One object queue is automatically created for each schema that you register, and the name of the scheme and the name of the queue are identical.

Here's an example object of the type that you might get back from the queue:

{
  "page": {
    "url": "http://arstechnica.com/",
    "code": 200,
    "headers": {
      "server": ["nginx"],
      "date": ["Thu, 21 Aug 2014 16:53:53 GMT"],
      "content-type": ["text/html; charset=UTF-8"],
      "transfer-encoding": ["chunked"],
      "connection": ["keep-alive"],
      "x-ars-server": ["web10"]
    },
    "response_time": 628,
    "fetched": true,
  },
  "session": {
    "key": "5fd62abffe58dbe6d6b32700",
    "queue_name": "www.arstechnica.com",
    "definition_key": "www.arstechnica.com/article_link",
    "started_at": "2014-08-21 11:55:22",
    
  },
  "object": {
    "title": "Seals carried tuberculosis across the Atlantic, gave it to humans",
    "excerpt": "Disease was present in the Americas prior to European contact.",
    "author": "JOHN TIMMER",
    "url": "http://arstechnica.com/science/2014/08/seals-carried-tuberculosis-across-the-atlantic-gave-it-to-humans/",
    "story_type": "in_depth"
  }
}

A JSON object that you get from popping an object queue has three properties:

  • Session
  • Page
  • Object

Let's look at each in turn.

Session

The session property holds the following information about the session that produced the object.

Key: Each session is assigned a unique key when it's pushing into a session queue, and that key returned along with all of that session's objects.

Started_at: This timestamp tells you when the session started scrapping the website. All times are UTC.

Queue_name: The name of the queue that the session was pushed to.

Definition_key: The name/key of the session definition that was used to run the session.

Page

The page property has the following fields. Most are self-explanatory:

  • url
  • headers
  • code (HTTP response code)
  • body (boolean that tells if the body was present)
  • error (any error that triggered by the fetch)
  • referer
  • fetched (boolean)
  • redirect_to
  • response_time (in ms)

Object

The object property contains the object that was scraped from the page.

Clone this wiki locally