Skip to content

some urls will not work with celery #28

@Zman67

Description

@Zman67

Hi,

I have a rather urgent problem, for which I hope you can help me,
I'm trying to parse urls/html via boilerpipe and celery. Straightforward stuff, giving a task to a celery worker. However some links work, some don't.
If I call call_txt_extr, url: 'http://t.co/XIDUuUIjPi' will not work and disappears in a "soft" followed by a "hard" timeout in celery.
If I do the same thing with url 'http://www.rezmanagement.nl' it works perfectly.

code:

from celery import Celery

from boilerpipe.extract import Extractor
from harvest.celery import app
app.config_from_object('harvest.celeryconfig')

def call_txt_extr():

Extract_Text.soft_time_limit = 10
Extract_Text.time_limit = 15
Extract_Text.apply_async()

@app.task
def Extract_Text():

URL = 'http://t.co/XIDUuUIjPi'
# URL = 'http://www.rezmanagement.nl/'
extractorType="DefaultExtractor"
# Extractor(extractor=extractorType, url=URL)
print Extractor(extractor=extractorType, url=URL).getText()
return 

I've tried everything but editing the java code and found the following:

  1. the task / boilerpipe stops working at line 70 or so in the Extractor (init.py),
    "self.source = BoilerpipeSAXInput(InputSource(reader)).getTextDocument()"
    it simply doesn't give back the parsed text and then the task times out.

  2. Please understand It works perfectly with some URL's within celery, others timeout.
    If I remove the celery decorator (thus no longer getting the task executed by celery, it works perfectly, so the URL is ok (Extractor can deal with the html etc.)

  3. if I define a celery class, and configure the task to inherrit the class, and run the extractor call from the class, this works in celery
    however: this it not the way to run call the Extractor. Furthermore since the Extractor needs inpunt I would be polling for the same URL at every functioncall which is highly unwanted and not supposed to work like that.

So: this works, but is not good code and highly unwanted I think:

class taskclass(celery.Task):

URL = 'http://t.co/XIDUuUIjPi'
# URL = 'http://www.rezmanagement.nl'
extractorType="DefaultExtractor"
print Extractor(extractor=extractorType, url=URL).getText()

def call_txt_extr():

Extract_Text.soft_time_limit = 10
Extract_Text.time_limit = 15
Extract_Text.apply_async()

@app.task (base=taskclass)
def Extract_Text():

URL = 'http://t.co/XIDUuUIjPi'
# URL = 'http://www.rezmanagement.nl/'
extractorType="DefaultExtractor"
# Extractor(extractor=extractorType, url=URL)
print Extractor(extractor=extractorType, url=URL).getText()
return 
  1. updated JPype1
  2. updated nekohtml
  3. cannot find any other instance of this on the internet.

I hope you can help me,

Kindest regards,

Roland Zoet

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions