Skip to content
This repository was archived by the owner on Aug 5, 2024. It is now read-only.

Conversation

@emcsween
Copy link

In the Javascript version, when using diff_cleanupSemantic(), some diffs result in the semantic alignment loop being run many times. This happens when comparing a file containing a long chunk of characters with a similar file containing the same long chunk of characters twice in succession, i.e.:

File 1: <chunk A><chunk B><chunk C>
FIle 2: <chunk A><chunk B><chunk B><chunk C>

When this happens, the loop runs as many times as there are characters in chunk B. This can get quite expensive because three new strings are created in every iteration. This PR replaces these with faster index manipulations.

I tried to translate the algorithm line by line with the objective of changing nothing in its behaviour. Basically, instead of tracking 3 strings (equality1, edit, equality2), I track 1 string (buffer) and 2 indices (editStart, editEnd). They are related this way:

buffer: | equality1 | edit | equality2 |
                    ^      ^
       editStart ---+      +---- editEnd

The other change I made was to change the loop condition. The original code shifts the edit left as much as possible (using the common suffix between equality1 and edit) and shifts right until the first character of edit and equality2 are different. I changed that to counting the common prefix between edit and equality2 and adding it to the amount of right shift to get the total number of shifts required.

I used the following benchmark to force the loop to run an arbitrary number of times:

const dmp = require('./diff_match_patch_uncompressed')
const text1 = 'a'.repeat(50) + 'b'.repeat(SIZE) + 'c'.repeat(50);
const text2 = 'a'.repeat(50) + 'b'.repeat(SIZE*2) + 'c'.repeat(50);
const diffs = dmp.diff_main(text1, text2)
dmp.diff_cleanupSemantic(diffs)

Here are the timings I got for diff_cleanupSemantic() before and after this PR.

SIZE before after
10 0.061 ms 0.019 ms
100 0.48 ms 0.03 ms
1,000 5.2 ms 0.13 ms
10,000 65 ms 1.2 ms
100,000 2.2 s 12.35 ms
1,000,000 too long 111 ms

Some diffs result in the semantic alignment loop being run many times.
This happens when comparing a file containing a long chunk of characters
with a similar file containing the same long chunk of characters twice
in succession.

Manipulating indexes rather than creating new strings at each iteration
makes the loop run much more quickly.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant