When applying jwzthreading on the 20 newsgroup dataset, we get an error about the maximum recursion limit being reached in hashing,
Traceback (most recent call last):
File "/home/rth/src/jwzthreading/jwzthreading/jwzthreading.py", line 71, in __hash__
return hash(tuple(sorted(self.items())) + (self.parent,))
File "/home/rth/src/jwzthreading/jwzthreading/jwzthreading.py", line 71, in __hash__
return hash(tuple(sorted(self.items())) + (self.parent,))
File "/home/rth/src/jwzthreading/jwzthreading/jwzthreading.py", line 71, in __hash__
return hash(tuple(sorted(self.items())) + (self.parent,))
[Previous line repeated 996 more times]
RecursionError: maximum recursion depth exceeded while calling a Python object
The script to reproduce used on 20_newsgroups.tar.gz dataset can be found below,
Details
import sys
from glob import glob
from email.parser import Parser
from jwzthreading import (Message, thread, print_container,
sort_threads)
from tqdm import tqdm
sys.setrecursionlimit(21000)
msglist = []
for path in tqdm(glob('20_newsgroups/*/*')):
with open(path, 'rt', encoding='latin1') as fh:
msg = Parser().parsestr(fh.read(), headersonly=True)
msglist.append(msg)
threads = thread([Message(el, message_idx=idx)
for idx, el in enumerate(msglist)],
group_by_subject=False)
threads = sort_threads(threads, key='subject', missing='Z')
for container in threads[:20]:
print_container(container)
20 newsgroup dataset has ~20000 messages, and the default recursion limit is 1000. The only reason this could be happening is that it finds a thread with more than 1000 emails.
Looking for a fix..
When applying jwzthreading on the 20 newsgroup dataset, we get an error about the maximum recursion limit being reached in hashing,
The script to reproduce used on 20_newsgroups.tar.gz dataset can be found below,
Details
20 newsgroup dataset has ~20000 messages, and the default recursion limit is 1000. The only reason this could be happening is that it finds a thread with more than 1000 emails.
Looking for a fix..