Skip to content

Commit fd419bd

Browse files
authored
make codepoint(c) work for overlong chars (#55152)
As discussed in #54393, `codepoint(c)` should succeed for overlong encodings, and whenever `ismalformed(c)` returns `false`. This should be backwards compatible since it simply removes an error, and should be strictly faster than before since it merely removes a call to `Base.is_overlong_enc`. Also, `Base.ismalformed` and `Base.isoverlong` are declared `public` (but not yet exported) and are included in the manual, since they are referenced in the docstring of `codepoint` etcetera. I also made `Base.show_invalid` a `public` and documented function, since it is referenced from the `ismalformed` docs and is required by new implementations of `AbstractChar` types that support malformed data. Fixes #54343, closes #54393.
1 parent 21f3b37 commit fd419bd

File tree

5 files changed

+67
-27
lines changed

5 files changed

+67
-27
lines changed

NEWS.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,9 @@ New library features
8484
Standard library changes
8585
------------------------
8686

87+
* `codepoint(c)` now succeeds for overlong encodings. `Base.ismalformed`, `Base.isoverlong`, and
88+
`Base.show_invalid` are now `public` and documented (but not exported) ([#55152]).
89+
8790
#### JuliaSyntaxHighlighting
8891

8992
#### LinearAlgebra

base/char.jl

Lines changed: 45 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -4,18 +4,22 @@ import Core: AbstractChar, Char
44

55
"""
66
The `AbstractChar` type is the supertype of all character implementations
7-
in Julia. A character represents a Unicode code point, and can be converted
8-
to an integer via the [`codepoint`](@ref) function in order to obtain the
9-
numerical value of the code point, or constructed from the same integer.
10-
These numerical values determine how characters are compared with `<` and `==`,
11-
for example. New `T <: AbstractChar` types should define a `codepoint(::T)`
7+
in Julia. A character normally represents a Unicode codepoint (and can
8+
also encapsulate other information from an encoded byte sequence as described below),
9+
and characters can be converted to integer codepoint values via the [`codepoint`](@ref)
10+
function, or can be constructed from the same integer. At least for valid,
11+
properly encoded Unicode characters, these numerical codepoint values
12+
determine how characters are compared with `<` and `==`, for example.
13+
New `T <: AbstractChar` types should define a `codepoint(::T)`
1214
method and a `T(::UInt32)` constructor, at minimum.
1315
1416
A given `AbstractChar` subtype may be capable of representing only a subset
1517
of Unicode, in which case conversion from an unsupported `UInt32` value
1618
may throw an error. Conversely, the built-in [`Char`](@ref) type represents
1719
a *superset* of Unicode (in order to losslessly encode invalid byte streams),
18-
in which case conversion of a non-Unicode value *to* `UInt32` throws an error.
20+
in which case conversion of a non-Unicode value *to* `UInt32` throws an error
21+
(see [`Base.ismalformed`](@ref)), and on the other hand a `Char` can also represent
22+
a nonstandard "overlong" encoding ([`Base.isoverlong`](@ref)) of a codepoint.
1923
The [`isvalid`](@ref) function can be used to check which codepoints are
2024
representable in a given `AbstractChar` type.
2125
@@ -77,10 +81,19 @@ end
7781
codepoint(c::AbstractChar)::Integer
7882
7983
Return the Unicode codepoint (an unsigned integer) corresponding
80-
to the character `c` (or throw an exception if `c` does not represent
81-
a valid character). For `Char`, this is a `UInt32` value, but
84+
to the character `c` (or throw an exception if `c` represents
85+
a malformed character). For `Char`, this is a `UInt32` value, but
8286
`AbstractChar` types that represent only a subset of Unicode may
8387
return a different-sized integer (e.g. `UInt8`).
88+
89+
Should succeed for any non-malformed character, i.e. when
90+
[`Base.ismalformed(c)`](@ref) returns `false`. This includes
91+
invalid Unicode characters (such as unpaired surrogates)
92+
and overlong encodings.
93+
94+
!!! compat "Julia 1.12"
95+
Prior to Julia 1.12, `codepoint(c)` fails for overlong encodings (when
96+
[`Base.isoverlong(c)`](@ref) is `true`), and `Base.decode_overlong(c)` was needed.
8497
"""
8598
function codepoint end
8699

@@ -116,10 +129,19 @@ end
116129
"""
117130
ismalformed(c::AbstractChar)::Bool
118131
119-
Return `true` if `c` represents malformed (non-Unicode) data according to the
132+
Return `true` if `c` represents malformed (non-codepoint / mis-encoded) data according to the
120133
encoding used by `c`. Defaults to `false` for non-`Char` types.
121134
122-
See also [`show_invalid`](@ref).
135+
Any *non*-malformed `c` can be mapped to an integer codepoint
136+
by [`codepoint(c)`](@ref); this includes codepoints that are
137+
not valid Unicode characters ([`isvalid(c)`](@ref) is `false`).
138+
For example, well-formed characters can include invalid Unicode
139+
codepoints like `'\\U110000'`, unpaired surrogates such as `'\\ud800'`,
140+
and can also include overlong encodings ([`Base.isoverlong`](@ref)).
141+
Malformed data, in contrast, cannot be decoded to a codepoint
142+
(`codepoint` will throw an exception).
143+
144+
See also [`Base.show_invalid`](@ref).
123145
"""
124146
ismalformed(c::AbstractChar) = false
125147

@@ -129,7 +151,7 @@ ismalformed(c::AbstractChar) = false
129151
Return `true` if `c` represents an overlong UTF-8 sequence. Defaults
130152
to `false` for non-`Char` types.
131153
132-
See also [`decode_overlong`](@ref) and [`show_invalid`](@ref).
154+
See also [`Base.show_invalid`](@ref).
133155
"""
134156
isoverlong(c::AbstractChar) = false
135157

@@ -140,7 +162,7 @@ isoverlong(c::AbstractChar) = false
140162
l1 = leading_ones(u)
141163
t0 = trailing_zeros(u) & 56
142164
(l1 == 1) | (8l1 + t0 > 32) |
143-
((((u & 0x00c0c0c0) 0x00808080) >> t0 != 0) | is_overlong_enc(u)) &&
165+
(((u & 0x00c0c0c0) 0x00808080) >> t0 != 0) &&
144166
throw_invalid_char(c)
145167
u &= 0xffffffff >> l1
146168
u >>= t0
@@ -152,20 +174,18 @@ end
152174
decode_overlong(c::AbstractChar)::Integer
153175
154176
When [`isoverlong(c)`](@ref) is `true`, `decode_overlong(c)` returns
155-
the Unicode codepoint value of `c`. `AbstractChar` implementations
156-
that support overlong encodings should implement `Base.decode_overlong`.
177+
the Unicode codepoint value of `c`. Deprecated in favor of
178+
`codepoint(c)`.
179+
180+
!!! compat "Julia 1.12"
181+
In Julia 1.12 or later, `decode_overlong(c)` simply calls
182+
`codepoint(c)`, which should now work for overlong encodings.
183+
`AbstractChar` implementations that support overlong encodings
184+
should implement `Base.decode_overlong` on older releases.
157185
"""
158186
function decode_overlong end
159187

160-
@constprop :aggressive function decode_overlong(c::Char)
161-
u = bitcast(UInt32, c)
162-
l1 = leading_ones(u)
163-
t0 = trailing_zeros(u) & 56
164-
u &= 0xffffffff >> l1
165-
u >>= t0
166-
((u & 0x0000007f) >> 0) | ((u & 0x00007f00) >> 2) |
167-
((u & 0x007f0000) >> 4) | ((u & 0x7f000000) >> 6)
168-
end
188+
@constprop :aggressive decode_overlong(c::AbstractChar) = codepoint(c)
169189

170190
@constprop :aggressive function Char(u::UInt32)
171191
u < 0x80 && return bitcast(Char, u << 24)
@@ -277,7 +297,7 @@ function show_invalid(io::IO, c::Char)
277297
end
278298

279299
"""
280-
show_invalid(io::IO, c::AbstractChar)
300+
Base.show_invalid(io::IO, c::AbstractChar)
281301
282302
Called by `show(io, c)` when [`isoverlong(c)`](@ref) or
283303
[`ismalformed(c)`](@ref) return `true`. Subclasses
@@ -330,7 +350,7 @@ function show(io::IO, ::MIME"text/plain", c::T) where {T<:AbstractChar}
330350
print(io, ": ")
331351
if isoverlong(c)
332352
print(io, "[overlong] ")
333-
u = decode_overlong(c)
353+
u = decode_overlong(c) # backwards compat Julia < 1.12
334354
c = T(u)
335355
else
336356
u = codepoint(c)

base/public.jl

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,11 @@ public
112112
# Strings
113113
escape_raw_string,
114114

115+
# Chars
116+
ismalformed,
117+
isoverlong,
118+
show_invalid,
119+
115120
# IO
116121
# types
117122
BufferStream,

doc/src/base/strings.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,9 @@ Base.Docs.@text_str
3535
Base.isvalid(::Any)
3636
Base.isvalid(::Any, ::Any)
3737
Base.isvalid(::AbstractString, ::Integer)
38+
Base.ismalformed
39+
Base.isoverlong
40+
Base.show_invalid
3841
Base.match
3942
Base.eachmatch
4043
Base.RegexMatch

test/char.jl

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -244,8 +244,8 @@ end
244244

245245
@testset "overlong codes" begin
246246
function test_overlong(c::Char, n::Integer, rep::String)
247-
if isvalid(c)
248-
@test Int(c) == n
247+
if !Base.ismalformed(c)
248+
@test Int(c) == n == codepoint(c)
249249
else
250250
@test_throws Base.InvalidCharError UInt32(c)
251251
end
@@ -357,6 +357,15 @@ end
357357
"'\\xc0': Malformed UTF-8 (category Ma: Malformed, bad data)"
358358
end
359359

360+
@testset "overlong, non-malformed chars" begin
361+
c = ['\xc0\xa0', '\xf0\x8e\x80\x80']
362+
@test all(Base.isoverlong, c)
363+
@test !any(Base.ismalformed, c)
364+
@test repr("text/plain", c[1]) == "'\\xc0\\xa0': [overlong] ASCII/Unicode U+0020 (category Zs: Separator, space)"
365+
@test codepoint.(c) == [0x20, 0xE000]
366+
@test isuppercase(c[1]) == isuppercase(c[2]) == false # issue #54343
367+
end
368+
360369
@testset "More fallback tests" begin
361370
@test length(ASCIIChar('x')) == 1
362371
@test firstindex(ASCIIChar('x')) == 1

0 commit comments

Comments
 (0)