-
Notifications
You must be signed in to change notification settings - Fork 39
Add source names (via new Stream and SourceSpan classes) and .span() combinator
#83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Stream and SourceSpan classes) and .span() combinator
e64a7f5 to
6579cae
Compare
This primarily wraps the str|bytes|list that is the data to parse, but also adds the metadata `source` to hold a filename, URL, etc. where the data is from. Introducing this class also paves the way for eventually supporting streaming input data.
cf9c189 to
58b00dd
Compare
Wrap the string, bytes, list into a Stream before calling parse.
58b00dd to
52ac956
Compare
|
Sorry about the million force-pushes to my local branch obsuring the history above. I wasn't able to get tox working locally, so I debugged using GH actions on my fork. Everything should be good to go now. I made a couple of changes to the workflows: namely removing python 3.7 which is de jure unsupported by parsy at this point, and is unavailable in GH actions now anyway. I added runners for python 3.12 and 3.13, for which I added a kludge in setup.py since setuptools isn't part of python 3.12 onward. |
|
Hi @tsani - thanks so much for your work on this. At this point in its life, parsy is a pretty mature library, so breaking API compatibility is a really big deal, and not something that I would consider at this point for this feature. Breaking the However, I think there shouldn't be a need to do that. The first thing I think we can do is add a Somewhat harder is that we need to keep the interface for def consume(n):
@Parser
def consumer(stream, index):
items = stream[index:index + n]
if len(items) == n:
return Result.success(index + n, items)
else:
return Result.failure(index, "{0} items".format(n))
return consumerThis means that This is a harder constraint, but there are some ways forward:
Proof of concept code: class StrStream(str):
def __new__(cls, string, source):
instance = super().__new__(cls, string)
instance.source = source
return instance
>>> s = StrStream('some text', 'myfile.txt')
>>> s
'some text'
>>> s.split(' ')
['some', 'text']
>>> s.source
'myfile.txt'
>>> isinstance(s, str)
TrueMy expectation is that there shouldn't be any need to change any of the existing test suite - it should run without modification, any breakage is telling us that we've possibly broken someone else's code too. |
|
BTW - I have done some work on the CI etc., and switched to uv for packaging, and merged that to master, so those parts of the PR shouldn't be needed any more. You might want to start a new branch and cherry-pick what you need. Sorry for the extra work! |
|
Hey @spookylukey, thanks for the input on this! I made a new PR with the changes here: #85 |
Following from #82:
I've gone ahead with the name
source. That makes the most sense to me as it could be something more abstract like<stdin>or a URL as you mentioned.I opted against changing
markat all, since this would cause parsers involving it to break when parsing a data stream equipped with a source.This approach seemed the best to me. I created a dataclass
SourceSpanto hold the start&end row&column alongside thesource, adjusted theline_infoandline_info_athelpers to account for thesource, and introduced the method.span()as the improved version of.mark()to augment the result of the parser with aSourceSpanobject.The tricky part of the PR was that it wasn't so simple to "just thread a source name through the parsers." The parser objects themselves are completely stateless -- all the state is held within the data stream, which is just a string, list, or bytes object.
I created a class
Streamto wrap the underlying data stream, making it possible to add extra fields; in this case, that's justsource. Then.parse()takes a Stream instead of "raw" data. This does create a breaking change in the API, as anyone callingparsemust pass a Stream as input. (Fixing the tests to account for this was super fun.)I believe that introducing Stream is important for the future since it's common for parsers to work on data in a truly streaming fashion. The current design of parsy requires all the data to parse to be buffered upfront, so adding genuine streaming will take a lot more effort.
To eliminate the API breakage, I added a somewhat ugly
isinstancecheck in.parse()to convert to aStream(with no source name) when the user provides something else, so that this can be a patch release instead of a minor release.