From c68925b1d5760b5b1a1b7159b295af7709851ae8 Mon Sep 17 00:00:00 2001 From: Gabriel Viganotti Date: Tue, 6 Aug 2019 18:39:20 -0300 Subject: [PATCH] Define our own content-id encoding format for XOR-URLs --- text/0053-xor-urls/0053-xor-urls.md | 149 +++++++++++++--------------- 1 file changed, 68 insertions(+), 81 deletions(-) diff --git a/text/0053-xor-urls/0053-xor-urls.md b/text/0053-xor-urls/0053-xor-urls.md index fc3ac78..5f1501c 100644 --- a/text/0053-xor-urls/0053-xor-urls.md +++ b/text/0053-xor-urls/0053-xor-urls.md @@ -2,139 +2,126 @@ - Status: proposed - Type: new feature -- Related components: Safe Browser. Safe App Nodejs / SAFE Client libs. +- Related components: SAFE Browser, SAFE Client libs, SAFE CLI. - Start Date: (13-11-2018) - Discussion: https://forum.safedev.org/t/rfc-discussion-xor-urls/2365 -- Supersedes: (fill me in with a link to RFC this supersedes - if applicable) -- Superseded by: (fill me in with a link to RFC this is superseded by - if applicable) +- Supersedes: N/A +- Superseded by: N/A ## Summary -It is herein proposed to have a resolver function in the SAFE app client API that allows us to have standardised `safe://`-URLs which are generated based on the XOR address of the content being referenced. +It is herein proposed to have a resolver function in the SAFE client API that allows us to have standardised `safe://`-URLs which are generated based on the XoR address of the content being referenced, as well as encoding any other piece of information required to uniquely identify the content with a URL. + +In other words, such a URL (we'll call it XOR-URL from here on) would uniquely reference a single piece of native data on the SAFE Network, regardless of its type and content, which can be used by any SAFE client application to retrieve the data from the network. ## Conventions - The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC 2119](http://tools.ietf.org/html/rfc2119). ## Motivation -Currently the SAFE app client API, and our SAFE browser, support fetching safesites and content stored on the SAFE Network with the DNS/public-names system, i.e. from URLs like `safe://./`, but it’s not possible to fetch data using their address on the network (i.e. the XOR address of any data stored on the network) without publishing it under a public name. +Currently the SAFE client APIs, and our SAFE browser, support fetching safesites and content stored on the SAFE Network with the NRS/public-names system, i.e. from URLs like `safe://./` ([see RFC-0052](https://github.com/maidsafe/rfcs/blob/master/text/0052-RDF-for-public-name-resolution/0052-RDF-for-public-name-resolution.md) for more info), but it's not possible to fetch data using their address on the network (i.e. the XoR address of any data stored on the Network) without publishing it under a public name. -This represents not only a restriction in some trivial use cases, like sharing a file by sharing its location and without publishing it under a public-name, but it also prevents users from being able to store linked-data on the network as these type of data cannot be linked unless it's all published at a public-name. +This represents not only a restriction in some trivial use cases, like sharing a file by sharing its location and without publishing it under a public-name, but it also means there is a lack of standard URLs for referencing any content stored on the SAFE Network, that can be resolved and fetched without the need to be published with the NRS/public-names system. This prevents the Network from supporting the Semantic Web since linked-data is defined to make use of URIs for the links (see http://linkeddata.org/home for some definitions). For an overview of the proposal, and foresight about its potential, [this screencast video](https://www.youtube.com/watch?v=BikfxRNARnM) explores some ideas around it with a working proof of concept. ## Detailed design -### XOR address encoding -The XOR address shall be encoded in the XOR-URL, [z-base32 encoding](http://philzimmermann.com/docs/human-oriented-base-32-encoding.txt) seems to be a good choice as it is case insensitive, as opposed to other case-sensitive encodings like base58btc or base64url, and it was designed to be easier for human use permuting the alphabet so that the easier characters are the ones that occur more frequently. - -### ImmutableData XOR-URL -XOR-URLs for ImmutableData’s are the simplest ones since they don’t need any additional information to uniquely identify them on the network, as opposed to MutableData’s that also have a type tag. Therefore an ImmutableData XOR-URL can be simply defined as `safe://`. - -### MutableData XOR-URL -As already mentioned above, a string based on the XOR address along with a type tag is needed to uniquely identify a MutableData on the network, therefore the XOR-URL for a MD needs to include the type tag in it. - -When a MutableData is fetched, if it’s not an NFS/Files container with an `index.html` file, the resolver (e.g. the `webFetch` function) can return the MutableData’s raw data so the browser (or any client application) can render it in a specific way (see below for more details of what’s being proposed here in this regard). - -#### MutableData versions -MutableData’s are versioned and therefore this shall be also accountable in the XOR-URL format to allow any MD URL to (optionally) reference a specific version. This will be also necessary in the future when append-only type of data is made available on the network. - -The version value can be used to enforce a specific version to be retrieved, and otherwise fail if that specific version is not found. On the other hand, the latest version will be retrieved if the version value is omitted from the XOR-URL. - -#### Paths -Given that a referenced MutableData can effectively be the root NFS/Files container of the files of a safesite, any MD XOR-URL can also specify a path which needs to be resolved, just like when using the DNS/public-names system to resolve the path in a URL. - -#### Browsable content -Currently, when a public-name URL is resolved to an NFS/Files container which doesn’t have an `index.html` file, the browser simply shows an error stating that the safesite content was not found. - -For a MD XOR-URL that doesn’t have a path, and there is no `index.html` file or a `default` file defined (for more details about this please see refer to [this other RFC](https://github.com/joshuef/rfcs/blob/PnsAndResolveableMap/text/0000-RDF-for-public-name-resolution/0000-RDF-for-public-name-resolution.md)), the resolver can return the raw content of the MD (i.e. its key-value entries), and the browser can automatically generate an HTML page which makes the content browsable, generating links to other data when an entry’s key/value is a `safe://` string, in an analogous/similar way to how web servers on the current internet allow browsing on folders/directories. - ### XOR-URLs specification -The following are the main requirements for the encoding to be used to generate the XOR-URLs: + +The following are the main requirements for the encoding format to be used to generate the XOR-URLs: - Be able to support new and different types of base encodings and hash functions for the XOR addresses in the future. -- Include the content type within the XOR URL which would allow the client app to correctly render the data to the user especially when referencing an ImmutableData. +- Include the content type within the XOR URL which would allow the client app to correctly render the data to the user, e.g. if it's an image it can launch an image viewer/editor, but if it's an audio file embed it in a module which makes it playable to the user. +- Encode the SAFE native data type where the content is stored on the Network so it can be retrieved using the corresponding, e.g. `CoinBalance`, `PublishedImmutableData`, etc. -It is proposed the use of [multiformats](https://github.com/multiformats) and [CID](https://github.com/ipld/cid), which allow us to cover the above requirements. A CID identifier can be used in the XOR-URL for specifying the XOR address part, and have additional parts to support MutableData’s type tag, version, as well as the path and query parts. SAFE URLs can then be defined in the following way (BNF 1-like): +With the introduction of XOR-URLs, as an addition to the existing NRS-URLs, we can define standard SAFE-URLs in the following way ([BNF-like](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form)): ``` - := 'safe://' ( | ) + ::= "safe://" ( | ) - := [ '.'] + ::= *( ".") - := | + ::= - := + ::= *("/" ) [] ["#" ] - := ':' ['+' ] - - := ['/' ] - - := ['?' ] ['#' ] + ::= "?" ["v=" ] ``` Where: -- ``: follows the [CID format](https://github.com/ipld/cid) which self-describes and encodes: - - the base encoding for the string, it's here proposed to use [z-base32 encoding](http://philzimmermann.com/docs/human-oriented-base-32-encoding.txt) for the reasons explained above, - - the version of CID for upgradability purposes (v1 now), - - the content type or codec of the data being referenced, - - the XOR address of the content being referenced -- ``: for future implementation to reference versionable content, using a single address with different versions +- `` is a sequence of bytes (variable length, from 36 to 44 bytes) encoding the following information: + - 1 byte for XOR-URL encoding format version. This is version 1 + - 2 bytes for content type, see below for the supported types and codes to be used + - 1 byte for SAFE native data type, see below for the supported types and codes to be used + - 32 bytes for the content's XoR name + - 0 to 8 bytes for type tag value + It's is here proposed to use [z-base32 encoding](http://philzimmermann.com/docs/human-oriented-base-32-encoding.txt) for the base encoding of the `content-id` string for the reasons explained in that link -- ``: the type tag value if the CID is referencing a MutableData. In the absence of this value, the CID will be assumed to be targetting an ImmutableData +- ``: the path of the content, e.g. when referencing a `FilesContainer` -- ``: the path of the file if the CID is referencing a MutableData, which can be accessed through the NFS/Files convention (or other emulations/conventions in the future) and resolve it +- ``: the version of the content being referenced when it's versioned perpetual data. When the version is omitted from the URL, the latest available version of the content is assumed -- ``: query arguments, to be used by the client app and not for retrieving the content +- ``: query arguments, to be used by the client app and not by the resolver for retrieving the content -- ``: fragment of the content, to be used by the client app and not for retrieving the content +- ``: fragment of the content, to be used by the client app consuming the content and not for retrieving the content -The resolver function shall first attempt to decode the URL assuming it contains a `` part. If that step fails (either because it couldn’t decode the CID, or because there was no data at the decoded XOR address), it will do a fallback to assume it’s a public-name URL and it shall try to find the pulic-name MD (with the `sha3` hash of the string), just like any public-name URLs are resolved now by the resolver function. +The following SAFE native data types shall be supported in this first version, encoding them in the content-id with the listed codes: +- CoinBalance = 0x00 +- PublishedImmutableData = 0x01 +- UnpublishedImmutableData = 0x02 +- SeqMutableData = 0x03 +- UnseqMutableData = 0x04 +- PublishedSeqAppendOnlyData = 0x05 +- PublishedUnseqAppendOnlyData = 0x06 +- UnpublishedSeqAppendOnlyData = 0x07 +- UnpublishedUnseqAppendOnlyData = 0x08 -Note that it's proposed that the resolver should try to decode the CID and attempt to fetch the content before falling back to assume it's a public-name URL, rather than only decoding. The reason behind this is to not eclipse a public-name which happens to be a valid CID string. In such cases it could still be eclipsed if there exist content at the location encoded by the CID string, but at least not do it up front when there is no content stored at such location. +The following list of content types shall be supported as a minimum in the first version, with the listed codes (MIME types codes shall be also allocated discretionarily by the actual implementation): +- Raw = 0x0000 +- Wallet = 0x0001 +- FilesContainer = 0x0002 +- NrsMapContainer = 0x0003 -The reason why the `` is exposed in the URL has to do with the fact that it matches quite well if you think of an analogy with ports in the clearnet, i.e. the location where the content is stored/found is the same for several MutableData's each with a different type tag (just like different services listening at different ports at the same IP address on the clearnet). +The resolver function shall first attempt to decode the URL assuming it contains a `` part. If that step fails, it will do a fallback to assume it's an NRS/public-name URL and it shall try to find the NRS Map Container which address shall be the `sha3` hash of the content-id string. -Having the `` available to the user, allowing it to be changed in a URL to get a different version of the same content, enables the user to be explicitly aware of the version of the content it's being fetched. As mentioned above, if the user is willing to get the latest version of the targeted content then the `` part can simply be omitted from the URL. +Having the `` available to the user, allowing it to be changed in a URL to get a different version of the same content, enables the user to be explicitly aware of the version of the content it's being fetched. As mentioned above, if the user is willing to get the latest version of the targeted content then the content version part can simply be omitted from the URL. -Since the main goal is to have a standard way of referencing and linking any piece of data stored on the SAFE Network with a URL (as required for supporting [linked-data](http://linkeddata.org/home)), thus the `` and `` parts are reserved to be used by the application consuming the content rather than used by the resolver itself. Once the resolver has located the content and fetched it, the values provided in the `` and `` parts shall be passed back to the application which requested the content. The application can then define what it's the way of referring to a specific part of the document you are targeting with the URL. E.g. the [WebID spec](https://dvcs.w3.org/hg/WebID/raw-file/tip/spec/identity-respec.html#overview) makes use of the `#fragment` to refer to one of the subgraphs, i.e. one of the persons/agents described in the WebID Profile document. As another example, the `#fragment` and `query` values are used by the web page that is being referenced by a URL and not used by the DNS to locate the file/s of the web page. In the case of the SAFE Network, the `` could be used to refer to a specific entry of a MutableData, or the `` can also be used to provide a filter to be applied to the entries that need to be considered on a referenced MutableData. This all would be defined/determined by the convention followed by the application that is fetching the content from a XOR-URL. +Since the main goal is to have a standard way of referencing and linking any piece of data stored on the SAFE Network with a URL (as required for supporting [linked-data](http://linkeddata.org/home)), thus the `` and `` parts are reserved to be used by the application consuming the content rather than used by the resolver itself. Once the resolver has located the content and fetched it, the values provided in the `` and `` parts shall be passed back to the application which requested the content. The application can then define what it's the way of referring to a specific part of the document you are targeting with the URL. E.g. the [WebID spec](https://dvcs.w3.org/hg/WebID/raw-file/tip/spec/identity-respec.html#overview) makes use of the `#fragment` to refer to one of the subgraphs, i.e. one of the persons/agents described in the WebID Profile document. As another example, the `#fragment` and `query` values are used by the web page that is being referenced to and not by the NRS to locate the file/s of the web page. In the case of the SAFE Network, the `` could be used to refer to a specific entry of a `MutableData`, or the `` could be used to provide a filter to be applied to the entries on a referenced `MutableData`. This all would be defined/determined by the convention followed by the application that is fetching the content from a XOR-URL. The following are examples of what would become valid XOR-URLs as per this proposal: -ImmutableData XOR-URL: -`safe://hygjdkfty6m7ag3bckq7eqgeizbtjk915c3jbrcgtisad8iikbk4xws4jbpky` +`PublishedImmutableData` XOR-URL: +`safe://hbyyyydp3xqdhobs6utkua8x48pu4c3s7b7ay1tbj7734op4r3c5zi3y4b` -MutableData XOR-URL: -`safe://hyfktce8j75yhmj1dbi1xw5wnb4m3zdydr7wpbzf1a16hc3sbxzu8a9hiqw:15000` +`FilesContainer` XOR-URL with path to a file: +`safe://hnyynywp3xtjczdcng85986m5ixmqsqcgm4466h3pk8qk8nqn9zojthy4wbnc/some/folder/index.html` -NFS/Files container MutableData XOR-URL: -`safe://hyfktcenm57js4bm3owhez9td9pi3t8bzk1crqp7mr5865c15ih3yxpz68w:15008/some/folder/index.html` +Just as an example, given `safe://hnyynywp3xtjczdcng85986m5ixmqsqcgm4466h3pk8qk8nqn9zojthy4wbnc/some/folder/index.html?v=2&somekey=5#somesection` XOR-URL, which is referencing a `FilesContainer` which follows the convention to store hierarchy of files, the resolver will decode the URL in the following way to locate the content: +- `hnyynywp3xtjczdcng85986m5ixmqsqcgm4466h3pk8qk8nqn9zojthy4wbnc` is the `` part which can be decoded as follows: + - `z-base32` base encoding for the whole string + - `v1` of the encoding format + - `FilesContainer` deduced from the content type code + - `PublishedSeqAppendOnlyData` deduced from the native data type code + - `0x1b97c52cb8d8231f7f3f97babd6eb39865eb5ef732d51dca389c2fde098f01aa` XOR name/address to locate the content + - `1100` is the type tag value +- `/some/folder/index.html?v=2` is the path and the content version (v2) which (using the information above obtained from decoding the content-id string) will be used by the resolver to fetch the specified file from version 2 of the `FilesContaier` retrieved +- `&somekey=5#somesection` is the `` and `` parts respectively, which will passed back to the consuming application to use it accordingly. -Just as an example, given `safe://hyfktcenm57js4bm3owhez9td9pi3t8bzk1crqp7mr5865c15ih3yxpz68w:15008/some/folder/index.html#somesection?somekey=5` XOR-URL, which is referencing a MutableData which follows the convention to store hierarchy of files, the resolver will decode/decompose it in the following way to locate the content: -- `hyfktcenm57js4bm3owhez9td9pi3t8bzk1crqp7mr5865c15ih3yxpz68w` is the `` part which can be decoded by the CID spec as follows: - - `z-base32` base encoding for the whole CID string - - `v1` of the CID protocol/format - - `MutableData` content type - - `sha3-256` hash function was used to generate the XOR name/address - - `4bdf536d057985388bfe23fb6b989c3754984737ab26cfedb25baf3207b6fe3d` XOR name/address to locate the content -- `15008` is the `` part, the resolver will try to fetch a MutableData with this type tag and the decoded XOR name/address -- `/some/folder/index.html` is the path which will be used by the resolver to fetch the specified file from the MutableData retrieved. This MutableData shall be a 'Files container' otherwise the resolution will fail for this case. -- `#somesection?somekey=5` is the `` and `` parts which will be ignored by the resoler and simply returned back to the application for it to use it. +### Encrypted content -### Private content +It is assumed that the XOR-URL would never contain/encode decryption keys to support sharing encrypted content with a XOR-URL, just like the mechanism to fetch an encrypted MutableData requires the decryption keys to be obtained out of band, or with a separate mechanism not supported by the MutableData API itself. -It is assumed that the URL would never contain decryption keys, not even to support sharing private content with a XOR-URL, just like the mechanism to fetch a private MutableData requires the decryption keys to be obtained from the account our of band, or with a separate mechanism not supported by the MutableData API itself. Therefore for sharing/linking private content with a XOR-URL, it could still be possible but the resolver function will need to allow to receive the decryption keys as arguments, just like the `newPrivate` function of the MutableData API, and this decryption key would need to be fetched from the account somehow by the application before invoking the resolver. At the same time, it may be considered to have either a fallback mechanism, or an additional bit encoded in the URL to realise if the XOR-URL is targeting a private or public content to act accordingly. +Therefore for sharing/linking encrypted content with a XOR-URL, it could still be possible but the resolver function will need to allow to receive the decryption key as arguments. The decryption key could be fetched from the account somehow by the application before providing it to the resolver when invoking. ## Drawbacks -There is a possibility that a public-name happens to be a valid CID, and it's eclipsed by a XOR-URL with the same CID. Even that this is a drawback it should be an unlikely scenario since a public-name is meant to be for human readable URLs so having a public-name which collides with a valid CID doesn't seem to be something that can intentionally happen. Although it should be considered the case that someone may want to generate such a public-name to cheat users making them believe it's linking to an ImmutableData content when it's effectively not, or at least not until it gets eclipsed by an immutable content stored a the same location that is decoded from the CID. To prevent this possible malice events, perhaps it should be consider to have the resolver to restrict public-names in certain way, e.g. they cannot be of the exact same length as the XOR-URLs length (or length range). +There is a possibility that a public-name happens to be eclipsed by a XOR-URL. Even that this is a drawback it should be an unlikely scenario since a `public name` is meant to be for human readable URLs, thus having a public name which collides with a valid XOR-URL doesn't seem to be something that can intentionally happen. Although it should be considered the case that someone may want to generate such a public-name to cheat users making them believe it's linking to an ImmutableData content when it's effectively not, or at least not until it gets eclipsed by XOR-URL that effectively references immutable content. To prevent this possible malice events, perhaps it should be consider to have the resolver to restrict public-names in certain way, e.g. they cannot be of the exact same length as the XOR-URLs length (or length range). ## Alternatives -The alternative of not supporting this type of URLs implies there is a lack of standard URLs for referencing any content stored on the SAFE Network that can be resolved and fetched without the need to be published with the DNS/public-names system. This prevents the network from supporting the Semantic Web since linked-data is defined to make use of URIs for the links (see http://linkeddata.org/home for some definitions). - -Another alternative specifically to the proposed XOR-URL format/encoding is to have a separate protocol, e.g. `safe-xor://`, for identifying these URIs. However there doesn't seem to be much benefits from doing this, and the only evident one is to avoid the scenario described above in drawback section when a public-name can be eclipsed by a XOR-URL, which it should be an unlikely scenario. On the other hand, having a single protocol as the standard for all type of URLs that resolve to content on the SAFE Network should be seen as a more homogeneous design. +One alternative to the proposed encoding format is the use of [multiformats](https://github.com/multiformats) and [CID](https://github.com/ipld/cid) which has the benefit of being used by other project/s and a few of additional resources already available for it, like having libraries for several programming languages. A CID identifier can be used in the XOR-URL for specifying the XOR address part, and the [](https://github.com/multiformats/multicodec) part is also something we need. Although, in current version of its spec (v1) it lacks a place where we can encode the native data type as described in the proposed encoding above (i.e. `CoinBalance`, `PublishedImmutableData`, etc.). Another thing to consider is the current list of supported codecs in the [multicodec project](https://github.com/multiformats/multicodec) (which is part of the CID format) doesn't include the MIME-types that can be used for encoding in an ImmutableData XOR-URL. There is [a discussion](https://github.com/multiformats/multicodec/issues/4) which has been triggered to add such a support (see [PR sent](https://github.com/multiformats/multicodec/pull/84)) but it hasn't been approved yet. -## Unresolved questions +Therefore, Multiformats CID shouldn't be a discarded alternative, but it'll need these amendments (to be proposed to Multiformats project and resolved) before we can fully benefit from it while covering our needs. -The current list of supported codecs in the [multicodec project](https://github.com/multiformats/multicodec) (which is part of the CID format that it's proposed to use) doesn't include the MIME-types that can be used for encoding in an ImmutableData XOR-URL. There is [a discussion](https://github.com/multiformats/multicodec/issues/4) which has been triggered to add such a support (see [PR sent](https://github.com/multiformats/multicodec/pull/84)) but it hasn't been approved yet. +Another aspect specifically to the proposed XOR-URL format/encoding is to have a separate protocol, e.g. `safe-xor://`, for identifying these URIs. However there doesn't seem to be much benefits from doing this, and the only evident one is to avoid the scenario (described above in the [Drawbacks section](#Drawbacks)) when a public-name can be eclipsed by a XOR-URL, which should be an unlikely scenario. Having a single protocol as the standard for all type of URLs that resolve the location of the content on the SAFE Network should be seen as a more homogeneous design.