some urls will not work with celery

Hi,

I have a rather urgent problem, for which I hope you can help me,
I'm trying to parse urls/html via boilerpipe and celery. Straightforward stuff, giving a task to a celery worker. However some links work, some don't.
If I call call_txt_extr, url: 'http://t.co/XIDUuUIjPi' will not work and disappears in a "soft" followed by a "hard" timeout in celery.
If I do the same thing with url 'http://www.rezmanagement.nl' it works perfectly.
## code:

from celery import Celery

from boilerpipe.extract import Extractor
from harvest.celery import app
app.config_from_object('harvest.celeryconfig')

def call_txt_extr():

```
Extract_Text.soft_time_limit = 10
Extract_Text.time_limit = 15
Extract_Text.apply_async()
```

@app.task
def Extract_Text():

```
URL = 'http://t.co/XIDUuUIjPi'
# URL = 'http://www.rezmanagement.nl/'
extractorType="DefaultExtractor"
# Extractor(extractor=extractorType, url=URL)
print Extractor(extractor=extractorType, url=URL).getText()
return 
```

I've tried everything but editing the java code and found the following:

1) the task / boilerpipe stops working at line 70 or so in the Extractor (**init**.py),
"self.source = BoilerpipeSAXInput(InputSource(reader)).getTextDocument()"
it simply doesn't give back the parsed text and then the task times out.

2) Please understand It works perfectly with some URL's within celery, others timeout.
If I remove the celery decorator (thus no longer getting the task executed by celery, it works perfectly, so the URL is ok (Extractor can deal with the html etc.)

3) if I define a celery class, and configure the task to inherrit the class, and run the extractor call from the class, this works in celery
however: this it not the way to run call the Extractor. Furthermore since the Extractor needs inpunt I would be polling for the same URL at every functioncall which is highly unwanted and not supposed to work like that.

So: this works, but is not good code and highly unwanted I think:

class taskclass(celery.Task):

```
URL = 'http://t.co/XIDUuUIjPi'
# URL = 'http://www.rezmanagement.nl'
extractorType="DefaultExtractor"
print Extractor(extractor=extractorType, url=URL).getText()
```

def call_txt_extr():

```
Extract_Text.soft_time_limit = 10
Extract_Text.time_limit = 15
Extract_Text.apply_async()
```

@app.task (base=taskclass)
def Extract_Text():

```
URL = 'http://t.co/XIDUuUIjPi'
# URL = 'http://www.rezmanagement.nl/'
extractorType="DefaultExtractor"
# Extractor(extractor=extractorType, url=URL)
print Extractor(extractor=extractorType, url=URL).getText()
return 
```

4) updated JPype1
5) updated nekohtml
6) cannot find any other instance of this on the internet.

I hope you can help me,

Kindest regards,

Roland Zoet


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

some urls will not work with celery #28

code:

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

some urls will not work with celery #28

Description

code:

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions