java.lang.OutOfMemoryError: Java heap space after multiple getHTML calls

I need to extract article bodies from raw htmls. My code is as simple as:

```
for html in htmls:
    extractor = Extractor(extractor='ArticleExtractor', html=article)
    extractor.getHTML()
```

After calling a method of it, e.g. 10K times, I get `java.lang.OutOfMemoryError` error:

```
Traceback (most recent call last):
  File "test.py", line 228, in <module>
    extractor.getHTML()
  File "/Users/macuser/.virtualenvs/bro/lib/python2.7/site-packages/boilerpipe/extract/__init__.py", line 70, in getHTML
    return highlighter.process(self.source, self.data)
jpype._jexception.OutOfMemoryErrorPyRaisable: java.lang.OutOfMemoryError: Java heap space
```

I looked into the code and it looks like creating `BoilerpipeSAXInput`, `HTMLHighlighter` and other java instances causes this problem. Is there a way to fix this issue?

To reproduce this without 10K articles, simply reduce the heap size in `boilerpipe.__init__`:

```
MAX_JVM_HEAP_SIZE_MBYTES = 4

if jpype.isJVMStarted() != True:
    jars = []
    for top, dirs, files in os.walk(imp.find_module('boilerpipe')[1]+'/data'):
        for nm in files:
            jars.append(os.path.join(top, nm))

    jvm_args = [
        '-Xmx%dM' % MAX_JVM_HEAP_SIZE_MBYTES,
        "-Djava.class.path=%s" % os.pathsep.join(jars)
    ]
    jpype.startJVM(jpype.getDefaultJVMPath(), *jvm_args)
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

java.lang.OutOfMemoryError: Java heap space after multiple getHTML calls #29

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

java.lang.OutOfMemoryError: Java heap space after multiple getHTML calls #29

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions