@@ -4,18 +4,22 @@ import Core: AbstractChar, Char
44
55"""
66The `AbstractChar` type is the supertype of all character implementations
7- in Julia. A character represents a Unicode code point, and can be converted
8- to an integer via the [`codepoint`](@ref) function in order to obtain the
9- numerical value of the code point, or constructed from the same integer.
10- These numerical values determine how characters are compared with `<` and `==`,
11- for example. New `T <: AbstractChar` types should define a `codepoint(::T)`
7+ in Julia. A character normally represents a Unicode codepoint (and can
8+ also encapsulate other information from an encoded byte sequence as described below),
9+ and characters can be converted to integer codepoint values via the [`codepoint`](@ref)
10+ function, or can be constructed from the same integer. At least for valid,
11+ properly encoded Unicode characters, these numerical codepoint values
12+ determine how characters are compared with `<` and `==`, for example.
13+ New `T <: AbstractChar` types should define a `codepoint(::T)`
1214method and a `T(::UInt32)` constructor, at minimum.
1315
1416A given `AbstractChar` subtype may be capable of representing only a subset
1517of Unicode, in which case conversion from an unsupported `UInt32` value
1618may throw an error. Conversely, the built-in [`Char`](@ref) type represents
1719a *superset* of Unicode (in order to losslessly encode invalid byte streams),
18- in which case conversion of a non-Unicode value *to* `UInt32` throws an error.
20+ in which case conversion of a non-Unicode value *to* `UInt32` throws an error
21+ (see [`Base.ismalformed`](@ref)), and on the other hand a `Char` can also represent
22+ a nonstandard "overlong" encoding ([`Base.isoverlong`](@ref)) of a codepoint.
1923The [`isvalid`](@ref) function can be used to check which codepoints are
2024representable in a given `AbstractChar` type.
2125
7781 codepoint(c::AbstractChar)::Integer
7882
7983Return the Unicode codepoint (an unsigned integer) corresponding
80- to the character `c` (or throw an exception if `c` does not represent
81- a valid character). For `Char`, this is a `UInt32` value, but
84+ to the character `c` (or throw an exception if `c` represents
85+ a malformed character). For `Char`, this is a `UInt32` value, but
8286`AbstractChar` types that represent only a subset of Unicode may
8387return a different-sized integer (e.g. `UInt8`).
88+
89+ Should succeed for any non-malformed character, i.e. when
90+ [`Base.ismalformed(c)`](@ref) returns `false`. This includes
91+ invalid Unicode characters (such as unpaired surrogates)
92+ and overlong encodings.
93+
94+ !!! compat "Julia 1.12"
95+ Prior to Julia 1.12, `codepoint(c)` fails for overlong encodings (when
96+ [`Base.isoverlong(c)`](@ref) is `true`), and `Base.decode_overlong(c)` was needed.
8497"""
8598function codepoint end
8699
@@ -116,10 +129,19 @@ end
116129"""
117130 ismalformed(c::AbstractChar)::Bool
118131
119- Return `true` if `c` represents malformed (non-Unicode ) data according to the
132+ Return `true` if `c` represents malformed (non-codepoint / mis-encoded ) data according to the
120133encoding used by `c`. Defaults to `false` for non-`Char` types.
121134
122- See also [`show_invalid`](@ref).
135+ Any *non*-malformed `c` can be mapped to an integer codepoint
136+ by [`codepoint(c)`](@ref); this includes codepoints that are
137+ not valid Unicode characters ([`isvalid(c)`](@ref) is `false`).
138+ For example, well-formed characters can include invalid Unicode
139+ codepoints like `'\\ U110000'`, unpaired surrogates such as `'\\ ud800'`,
140+ and can also include overlong encodings ([`Base.isoverlong`](@ref)).
141+ Malformed data, in contrast, cannot be decoded to a codepoint
142+ (`codepoint` will throw an exception).
143+
144+ See also [`Base.show_invalid`](@ref).
123145"""
124146ismalformed (c:: AbstractChar ) = false
125147
@@ -129,7 +151,7 @@ ismalformed(c::AbstractChar) = false
129151Return `true` if `c` represents an overlong UTF-8 sequence. Defaults
130152to `false` for non-`Char` types.
131153
132- See also [`decode_overlong`](@ref) and [` show_invalid`](@ref).
154+ See also [`Base. show_invalid`](@ref).
133155"""
134156isoverlong (c:: AbstractChar ) = false
135157
@@ -140,7 +162,7 @@ isoverlong(c::AbstractChar) = false
140162 l1 = leading_ones (u)
141163 t0 = trailing_zeros (u) & 56
142164 (l1 == 1 ) | (8 l1 + t0 > 32 ) |
143- (((( u & 0x00c0c0c0 ) ⊻ 0x00808080 ) >> t0 != 0 ) | is_overlong_enc (u) ) &&
165+ (((u & 0x00c0c0c0 ) ⊻ 0x00808080 ) >> t0 != 0 ) &&
144166 throw_invalid_char (c)
145167 u &= 0xffffffff >> l1
146168 u >>= t0
@@ -152,20 +174,18 @@ end
152174 decode_overlong(c::AbstractChar)::Integer
153175
154176When [`isoverlong(c)`](@ref) is `true`, `decode_overlong(c)` returns
155- the Unicode codepoint value of `c`. `AbstractChar` implementations
156- that support overlong encodings should implement `Base.decode_overlong`.
177+ the Unicode codepoint value of `c`. Deprecated in favor of
178+ `codepoint(c)`.
179+
180+ !!! compat "Julia 1.12"
181+ In Julia 1.12 or later, `decode_overlong(c)` simply calls
182+ `codepoint(c)`, which should now work for overlong encodings.
183+ `AbstractChar` implementations that support overlong encodings
184+ should implement `Base.decode_overlong` on older releases.
157185"""
158186function decode_overlong end
159187
160- @constprop :aggressive function decode_overlong (c:: Char )
161- u = bitcast (UInt32, c)
162- l1 = leading_ones (u)
163- t0 = trailing_zeros (u) & 56
164- u &= 0xffffffff >> l1
165- u >>= t0
166- ((u & 0x0000007f ) >> 0 ) | ((u & 0x00007f00 ) >> 2 ) |
167- ((u & 0x007f0000 ) >> 4 ) | ((u & 0x7f000000 ) >> 6 )
168- end
188+ @constprop :aggressive decode_overlong (c:: AbstractChar ) = codepoint (c)
169189
170190@constprop :aggressive function Char (u:: UInt32 )
171191 u < 0x80 && return bitcast (Char, u << 24 )
@@ -277,7 +297,7 @@ function show_invalid(io::IO, c::Char)
277297end
278298
279299"""
280- show_invalid(io::IO, c::AbstractChar)
300+ Base. show_invalid(io::IO, c::AbstractChar)
281301
282302Called by `show(io, c)` when [`isoverlong(c)`](@ref) or
283303[`ismalformed(c)`](@ref) return `true`. Subclasses
@@ -330,7 +350,7 @@ function show(io::IO, ::MIME"text/plain", c::T) where {T<:AbstractChar}
330350 print (io, " : " )
331351 if isoverlong (c)
332352 print (io, " [overlong] " )
333- u = decode_overlong (c)
353+ u = decode_overlong (c) # backwards compat Julia < 1.12
334354 c = T (u)
335355 else
336356 u = codepoint (c)
0 commit comments