Skip to content

Conversation

@gkampitakis
Copy link
Owner

Release new forked version with no security issue

iambus and others added 15 commits December 13, 2021 17:10
upgrade gopkg.in/yaml.v2
Use common cache of line contents between two texts in `DiffLinesToChars` to get line diffs correctly.
In some cases, line diffs cannot be retrieved correctly in the standard way (https://github.com/google/diff-match-patch/wiki/Line-or-Word-Diffs#line-mode).
In the below case, we failed to get line diffs correctly before this fix.

```go:main.go
package main

import (
	"fmt"

	"github.com/sergi/go-diff/diffmatchpatch"
)

const (
	text1 = `hoge:
  step11:
  - arrayitem1
  - arrayitem2
  step12:
    step21: hoge
    step22: -93
fuga: flatitem
`
	text2 = `hoge:
  step11:
  - arrayitem4
  - arrayitem2
  - arrayitem3
  step12:
    step21: hoge
    step22: -92
fuga: flatitem
`
)

func main() {
	dmp := diffmatchpatch.New()
	a, b, c := dmp.DiffLinesToChars(text1, text2)
	diffs := dmp.DiffMain(a, b, false)
	diffs = dmp.DiffCharsToLines(diffs, c)
	// diffs = dmp.DiffCleanupSemantic(diffs)
	fmt.Println(diffs)
}
```

```text:output
[{Insert hoge:
  step11:
hoge:
} {Equal hoge:
} {Insert hoge:
} {Equal   step11:
} {Insert hoge:
} {Equal   - arrayitem1
} {Insert hoge:
} {Equal   - arrayitem2
} {Insert hoge:
} {Equal   step12:
} {Insert hoge:
} {Equal     step21: hoge
} {Insert hoge:
} {Equal     step22: -93
} {Delete fuga: flatitem
}]
```

Note: This fix corresponds to a javascript implementation.
(ref: https://github.com/google/diff-match-patch/blob/62f2e689f498f9c92dbc588c58750addec9b1654/javascript/diff_match_patch_uncompressed.js#L466)
Use common lineHash to share indice between text1 and text2 for correct line diffs
Current implementation produces wrong result because
it calls `DiffMain` on the following 2 arguments:

* `1,2,3,4,5,6,7,8,9,10`
* `1,2,3,4,5,6,7,8,9,11`

This numbers represent indices into the lines array.
The algorithm finds that equal part of those
strings is `1,2,3,4,5,6,7,8,9,1` and which is followed
by `Delete 0` and `Insert `1`.
[The suggested approach](https://github.com/google/diff-match-patch/wiki/Line-or-Word-Diffs#line-mode
) for doing line level diffing is the following set of steps:

1. `ti1, ti2, linesIdx = DiffLinesToChars(t1, t2)`
2. `diffs = DiffMain(ti1, ti2)`
3. `DiffCharsToLines(diff, linesIdx)`

The original implementation in `google/diff-match-patch` uses
unicode codepoints for storing indices in `ti1` and `ti2` joined by an empty string.
Current implementation in this repo stores them as integers joined by a
comma. While this implementation makes `ti1` and `ti2` more readable, it
introduces bugs when trying to rely on it when doing line level diffing
with `DiffMain`. The root cause of the issue is that an integer line
index might span more than one character/rune, and `DiffMain` can assume
that two different lines having the same index prefix match partially. For
example, indices 123 and 129 will have partial match `12`. In that
example, the diff will show lines 3 and 9 which is not correct. A simple
failing test case demonstrating this issue is available at
`TestDiffPartialLineIndex`.

In this PR I am adjusting the algorithm to use the same approach as in
[diff-match-patch](https://github.com/google/diff-match-patch/blob/62f2e689f498f9c92dbc588c58750addec9b1654/javascript/diff_match_patch_uncompressed.js#L508-L510
) by storing each line index as a rune.
While a rune in Golang is a type alias to uint32, not every uint32
can be a valid rune. During string to rune slice conversion invalid runes will
be replaced with `utf.RuneError`.

The integer to rune generation logic is based on the table in https://en.wikipedia.org/wiki/UTF-8#Encoding

The first 127 lines will work the fastest as they are represented as a
single bytes. Higher numbers are represented as 2-4 bytes.

In addition to that, the range `U+D800 - U+DFFF` contains
[invalid codepoints](https://en.wikipedia.org/wiki/UTF-8#Invalid_sequences_and_error_handling).
and all codepoints higher or equal to `0xD800` are incremented by
`0xDFFF - 0xD800`.

The maximum representable integer using this approach is 1'112'060.
This improves on Javascript implementation which currently
[bails out](https://github.com/google/diff-match-patch/blob/62f2e689f498f9c92dbc588c58750addec9b1654/javascript/diff_match_patch_uncompressed.js#L503-L505
) when files have more than 65535 lines.
Fix line diff by using rune index without a separator
…red_text

Fix colored display for diffs with newlines
@gkampitakis gkampitakis deleted the sync-upstream branch October 29, 2025 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants