Trusting sizes for reading strings leads to trimming when reading incorrectly formatted serialized session


The issue is the library is trusting the string size announced in the serialized data.

Taking this correct PHP session…

```
encryption_key|s:32:"abcdefghijklmnopqrstuvwxyzabcdef";session_id|i:42;
```

…`PHP.unserialize` will correctly parse as

```
{"encryption_key" => "abcdefghijklmnopqrstuvwxyzabcdef", "session_id" => 42}
```

But if the serialized session is incorrect (encryption_key value has a byte missing or declared size is too large by one)

```
encryption_key|s:32:"abcdefghijklmnopqrstuvwxyzabcde";session_id|i:42;
```

Then `PHP.unserialize` will go nasty, not only having encryption_key value overflowing (expected) but also removing one byte from the next key (unexpected).

```
{"encryption_key" => "abcdefghijklmnopqrstuvwxyzabcde\"", "ession_id" => 42}   
```

If parsing the same incorrect data with https://github.com/jurias/php-serial/blob/master/lib/php-serial.rb `Php.unserialize_session`, then encryption_key value is overflowing too, but the next key is not damaged.

```
{"encryption_key" => "abcdefghijklmnopqrstuvwxyzabcde\"", "session_id" => 42}
```

Now if the encryption_key value has an extra byte or declared size is too small by one

```
encryption_key|s:32:"abcdefghijklmnopqrstuvwxyzabcdefZ";session_id|i:42;
```

Then `PHP.unserialize` will no display the extra byte (expected) but will totally remove any remaining keys afterward (unexpected).

```
{"encryption_key" => "abcdefghijklmnopqrstuvwxyzabcdef"}
```

While with jurias's parser it will go just fine.

```
{"encryption_key" => "abcdefghijklmnopqrstuvwxyzabcdef", "session_id" => 42}
```

You may say that if the session file is generated by PHP, and not a third party library, the declared size will never be wrong. That may be true, but the problem doesn't lie there.

In fact, I have a session file where there is an `encryption_key` value is 32 bytes of random data, e.g. (do not rely on the example data below, the encoding may be messed up with copy pasting and IDE re-encoding when file is opened)

```
encryption_key|s:32:"�냝���ģ7;��-Vb��z�E=��[<�Q";
```

Note: the true byte value of the key is (in hex) a0 eb 83 9d b7 a2 d3 c4 a3 37 3b ac d0 2d 56 0e 62 ae 9a 7a d4 45 3d 9d 93 11 5b 10 3c 90 11 51

The issue is that when reading the file without specifying the encoding, Ruby will try to guess the encoding of the session file, and will end-up opening it as UTF-8.

```ruby
require_relative 'php_serialize'
serialized_session = File.read('sess_enckey')

require 'pp'
pp PHP.unserialize(serialized_session)
```

But `encryption_key` value may not be UTF-8. It may be something else. It's impossible to detect the correct encoding since it is unknown random data, the only thing we can assume is the size should be 32 bytes.

The original PHP session file is correct and has it's integrity guaranteed. In fact if I have a session file untouched (not opened in an IDE or text editor that could re-encode while saving or something), and I display it with an hex editor like xxd, heydump or hexyl, I can count the value is exactly 32 bytes.

But if I copy the original session file, open the copy in VS Code (that detects the file as encoded in UTF-8), make any transparent change, save the file (as UTF-8), then the value is now 56 bytes long. The re-encoded value (in hex) is ef bf bd eb 83 9d ef bf bd ef bf bd ef bf bd c4 a3 37 3b ef bf bd ef bf bd 2d 56 0e 62 ef bf bd ef bf bd 7a ef bf bd 45 3d ef bf bd ef bf bd 11 5b 10 3c ef bf bd 11 51.


Why is that? It's because encryption_key is probably 32 randomly generated raw bytes, not real data encoded in anything. The dev just stored the raw bytes in a variable and when PHP serializes it the bytes are stored as is "in a string". But when you open the file with whatever editor or script, you try to parse the raw bytes as UTF-8, which doesn't make sense.

And has we saw in the example, when the value is longer than the declared size, this parser is silently trimming anything after.

So as long as the PHP session file is untouched, the naive script with `File.read()` will work. But if the session file is opened in a IDE, the key value will be re-encoded, the value and value size will be wrong, then when the naive script will parse the session, anything after the key will be silently trimmed.

Yes I know it a bad idea to store raw bytes like that (not encoded in hex or base64) but I'm not the webapp author.
Also php-serialize (this gem) is not responsible for messing with the encoding, but only for having unexpected behavior when trying to parse incorrect serialized data. Indeed, as jurias/php-serial does, when incorrect data is encountered, only the corresponding key/value should be messed up, not everything that comes after that, which limits the damage.

I guess it's due to the streaming nature of https://github.com/jqr/php-serialize/blob/68c0b6f9a20c07d7498d3bce62a37f1e240eddfe/lib/php_serialize.rb#L170 which is a problem if incorrect length is trusted.

In jurias/php-serial, the value is read by trusting the declared length too https://github.com/jurias/php-serial/blob/e720d4f4c494fc8b709f860101d6bc11b9acf418/lib/php-serial.rb#L108 but the remaining data buffer is substraced from the previously read data not based on the declared length if it is a string but until a semicolon is encountered https://github.com/jurias/php-serial/blob/e720d4f4c494fc8b709f860101d6bc11b9acf418/lib/php-serial.rb#L109. Which is why with jurias/php-serial, the value is wrong but the next key/value are untouched.

I don't even know why php serialized format is including a declared length for strings or arrays? I mean, for integers like `i:123;` the value is directly read until `;` is encountered (as long as what is reads is only digits I guess). But for strings, why the format is not just `s:"content";` instead of `s:7:"content";` and reading until `";`, same for array until `}` is encountered. Is that for performance reasons, so it avoids computing the size? For streaming so that the length is known in advance and there is no need for look-ahead or look-behind? For parsing safety so that strings could contain `"` or `;` and array could contain strings that could contain `}`?

I don't know if there is a perfect solution for that? Indeed, the parser is expected to parse correctly formatted serialiazed data not to parse incorrectly formatted serialized data. It's true that a naive "read until" X char could have side effects.

By the way https://github.com/jqr/php-serialize/blob/68c0b6f9a20c07d7498d3bce62a37f1e240eddfe/lib/php_serialize.rb#L257C24-L258 should check that `string.read_until(';')` contains only digits else trigger and error (in strict mode) instead of casting with `to_i`. Same for floats (only digits and one dot) or any time `read_until` is used. Even for boolean in case the value is something else than 0 or 1. In lax mode it's ok just cast or assume.

Parsing is so hard.

PS: I expected this issue description to be higly unclear and hard to understand so don't hesitate to ask question about what I mean.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trusting sizes for reading strings leads to trimming when reading incorrectly formatted serialized session #23

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Trusting sizes for reading strings leads to trimming when reading incorrectly formatted serialized session #23

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions