Skip to content

Conversation

@jmarble
Copy link

@jmarble jmarble commented Oct 23, 2025

The Bug

array_unique() with SORT_REGULAR was failing to remove duplicate numeric strings when mixed with alphanumeric strings:

$units = ['5', '10', '3A', '5'];
array_unique($units, SORT_REGULAR);
// Before: ['5', '10', '3A', '5'] - duplicate '5' not removed 
// After:  ['5', '10', '3A']      - works correctly 

Root Cause

Non-transitive comparisons in SORT_REGULAR (where '5' == 5 and 5 != '5abc' but '5' < '5abc') broke the sort-based algorithm's assumption that sorting would group duplicates adjacently.

Solution

Implemented type-optimized hybrid approach for SORT_REGULAR:

Three-tier algorithm selection:

  1. Integer-only arrays: O(N) hash table for optimal performance
  2. Arrays/Objects present: O(N log N) sorting (works correctly for complex types)
  3. Simple mixed types: O(N) hash bucketing with dynamic sizing (64-16,384 buckets)

Key features:

  • Preserves full SORT_REGULAR type coercion semantics (1 == '1' == true)
  • Dynamic bucket sizing for memory efficiency
  • Comprehensive exception handling prevents memory leaks
  • Integer overflow protection on all allocation paths

Security Hardening

  • Eliminates Hash-DoS vulnerabilities through dynamic bucketing and type-based routing
  • Integer overflow checks prevent heap corruption
  • Exception-safe on all code paths
  • Proper resource handling with identity-based hashing

Performance

Improved performance across all data types compared to PHP 8.4.13 while fixing correctness issues.

Backward Compatibility

  • 100% BC compatible - preserves all existing coercion behavior
  • Uses identical comparison functions as current implementation
  • All existing tests pass (834/834)

Tests Added

  • gh20262.phpt - Minimal regression test for the bug
  • array_unique_variation_sort_regular.phpt - Comprehensive SORT_REGULAR behavior coverage (16 scenarios)

Closes #20262

@jmarble jmarble requested a review from bukka as a code owner October 23, 2025 23:06
@jmarble jmarble marked this pull request as draft October 23, 2025 23:35
@jmarble jmarble force-pushed the gh20262-array-unique-sort-regular branch from 4102fe6 to f60a45f Compare October 23, 2025 23:55
…h mixed strings

array_unique() with SORT_REGULAR was failing to remove duplicate numeric
strings when mixed with alphanumeric strings due to non-transitive
comparison issues in the sort-based algorithm.

Implemented hash-bucketing optimization for SORT_REGULAR that preserves
full type coercion semantics while improving performance from O(n²) to O(n).

Closes phpGH-20262
@jmarble jmarble force-pushed the gh20262-array-unique-sort-regular branch from f60a45f to c4fc6a9 Compare October 24, 2025 01:11
@jmarble jmarble marked this pull request as ready for review October 24, 2025 01:48
@jmarble jmarble marked this pull request as draft October 24, 2025 04:38
ZEND_HASH_FOREACH_KEY_VAL(Z_ARRVAL_P(array), num_key, str_key, val) {
/* Dereference if this is a reference */
zval *deref_val = val;
ZVAL_DEREF(deref_val);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: might be relevant to add test case for it wdyt ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This was added after a failed test in the circle jerk, I mean one of the CI checks. There's some other things I felt needed some TLC. Still passing all tests at the moment, but really trying to harden this up.

} else if (Z_TYPE_P(deref_val) == IS_OBJECT) {
/* Hash objects by class name */
zend_class_entry *ce = Z_OBJCE_P(deref_val);
hash = zend_string_hash_val(ce->name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will this interact with inheritance?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't note any issues with inheritance, but I ultimately found what I feel is a much more conservative approach that's still both performant and correct -- at least in regards to the original bug that was reported.

- Eliminate Hash-DoS and integer overflow vulnerabilities
- Add exception handling to prevent memory leaks
- Implement type-specific optimizations (hash/sort/bucket)
- Dynamic bucket sizing for memory efficiency
- Fix resource handling
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

array_unique() with SORT_REGULAR returns duplicate values

3 participants