Conversation
- Refactor tree building to not be recursive, so it doesn't segfault because of running out of stack space on really big files - Fix Cnode crashing on `<!---->` due to myhtml_node_text returning a null pointer - Fix Cnode benchmarks Cnode benchmarks on master: ``` benchmark name iterations average time wikipedia_hyperlink.html 97k 200 7817.54 µs/op w3c_html5.html 131k 100 13872.61 µs/op github_trending_js.html 341k 50 30911.76 µs/op ``` This branch: ``` benchmark name iterations average time wikipedia_hyperlink.html 97k 500 7045.92 µs/op w3c_html5.html 131k 200 10438.43 µs/op github_trending_js.html 341k 100 23686.21 µs/op ```
c_src/myhtml_worker.c
Outdated
| if (tag_ns != MyHTML_NAMESPACE_HTML) | ||
| { | ||
| // tag_ns_name_ptr is unmodifyable, copy it in our tag_ns_buffer to make it modifyable. | ||
| tag_ns_buffer = malloc(tag_ns_len); |
There was a problem hiding this comment.
You're in a new scope, so you can use a VLA. You also want to add a second byte for the nul terminator.
char tag_ns_buffer[tag_ns_len + tag_name_len + 2];
snprintf(tag_ns_buffer, sizeof tag_ns_buffer, "%s:%s", tag_ns_name_ptr, tag_name);
lowercase(tag_ns_buffer);
There was a problem hiding this comment.
You also want to add a second byte for the nul terminator.
Erlang binary strings are not null-terminated. Still had to add a second byte because snprintf always inserts a null-byte in the end, please tell if there is a way to make it not do that/if I should just use multiple strcpys that were there before
|
Hello everybody, thank you for submitting the patches and engaging in the conversation. I will review those up until the end of this week. In general, I am very happy to see those improvements, and yes, my C is sadly not where it should be. |
|
would you be interested in further refactoring? |
|
while we are here, we should replace all the |
7a6372a to
395d292
Compare
|
Done |
Not sure if this is directed at me or at @Overbryd , but I am up. |
|
@rinpatch highly appreciate the efforts. Would you be up for a call sometime this or next week? You can reach me at my Email address for that purpose, I would like to go over the changes and do a full review once you are done. I think its faster and we can have a better discussion that way. Other people are invited to join the call too of course. |
|
To be honest, calls stress me out, I would much rather communicate over text. You can reach me over IMs if you want faster discussion though, I am |
Overbryd
left a comment
There was a problem hiding this comment.
I fully endorse the main two changes here:
- using a stack to build the tree, rather than recursive function calls
I now understand this should be safer and faster by resizing the stack for 30 ETERMS at once.
- Prevent empty comments from blowing up
I kindly ask for one cosmetic change, to split headers/implementation from tstack.h.
| ETERM* tag; | ||
| ETERM* attrs; | ||
| tstack stack; | ||
| tstack_init(&stack, 30); |
There was a problem hiding this comment.
I went through this change with a fellow programmer of mine, and now after following through and understanding, I think its wayyy smarter to use a stack here, than my previous implementation.
Makes perfect sense and should be much safer.
| size_t size; | ||
| } tstack; | ||
|
|
||
| void tstack_init(tstack *stack, size_t initial_size) { |
There was a problem hiding this comment.
Can you move the implementation of each function to c_src/tstack.c and keep the headers in here?
| use MyhtmlexSharedTests, module: Myhtmlex.Safe | ||
|
|
||
| test "doesn't segfault when <!----> is encountered" do | ||
| assert {"html", _attrs, _children} = Myhtmlex.decode("<div> <!----> </div>") |
because of running out of stack space on really big files
<!---->due to myhtml_node_text returning anull pointer
Cnode benchmarks on master:
This branch:
Apologies for stuffing all of this into one commit, I can split it into
separate patches if you don't wish some of it merged.