diff --git a/.github/settings.yml b/.github/settings.yml index 32910f6..bf421cf 100644 --- a/.github/settings.yml +++ b/.github/settings.yml @@ -37,3 +37,6 @@ collaborators: - username: lpabon permission: admin + + - username: dutchshark + permission: admin diff --git a/CNCF Storage Landscape - White Paper.pdf b/CNCF Storage Landscape - White Paper.pdf deleted file mode 100644 index 59d5b1e..0000000 Binary files a/CNCF Storage Landscape - White Paper.pdf and /dev/null differ diff --git a/CODE-OF-CONDUCT.md b/CODE-OF-CONDUCT.md new file mode 100644 index 0000000..40147b6 --- /dev/null +++ b/CODE-OF-CONDUCT.md @@ -0,0 +1,124 @@ +# Code of Conduct + +We follow the [CNCF Code of Conduct][cncf-coc]. + +As part of our pledge to respect all people, in both live in-person and online +interactions, we are committed to providing a friendly, safe and welcoming +environment for all, regardless of gender, gender identity and expression, +sexual orientation, disability, physical appearance, body size, race, religion, +native language, operating system choice, current software stack or prior +experience. + +In keeping with this commitment, we offer the following guidelines: + +* Welcome newcomers at any stage of expertise. + * Everyone has something to contribute. + * Everyone deserves access to materials and community that will help them + learn. + * As long as an individual can be respectful and not disruptive to other + participants, they deserve to participate. +* Provide open and free material. +* Be kind and courteous. + * Interpret the arguments of others in good faith, offering private + constructive feedback when communication style bears improvement. + * Leave space for quieter voices. +* Consider who is not in the room. + * Invite participation from experts or user community representatives + outside of the working group. + * Participate in online forums to be inclusive of those who cannot attend + meetings. +* Work performed within this group, either finalized or in draft, is to be + used in accordance with the group [Mission and + Charter][charter], + the open source license, and to be used for the equal benefit of all + members of the community. Further information on use of work may be found + in [Security Reviews: + Outcome][review-outcome] + +## Incident handling and escalation + +Content for the purposes of the code of conduct as well as incident is defined +not only as published or draft content but also online discourse, such as slack +messages or emails, and interactions at in-person events. If an incident +involving community conduct occurs, please follow the guidelines below on how to +handle and report the issue: + +* If you see content that clearly does not meet the official Code of Conduct, + please send an e-mail to the Co-Chair/TL mailing list + (cncf-tag-security-leads@lists.cncf.io) and the creator of the content. (For + more details refer to the [CNCF Code of Conduct][cncf-coc]). If it is + regarding a co-chair, reach out to the two other chairs directly if you are + uncomfortable using the mailing list. +* If you are uncomfortable with a piece of content (but it may not necessarily + violate the code of conduct), we suggest sending a private message to the + content owner expressing your concerns. If this is not resolved, you may wish + to request the help of a Co-Chair/TL via cncf-tag-security-leads@lists.cncf.io + to help mediate the situation. +* Discussions about these potential code of conduct violations and concerns are + important, and there are great avenues to discuss them. This includes bringing + up concerns to the [CNCF TOC][cncf-toc] (which can be done through discussion + with Security TAG leadership) or talking to Security TAG leadership about + moderating a post. To help ensure that we can give focus to these issues and + not tangle them up with technical discussions, we should keep these + discussions separate from channels which are focused around technical + exchange. + +For content creators: + +* Content must strive to remain _on-topic_, particularly where video and images + are provided. Use of emojis and gifs as responses are content in and of + themselves need to be relevant to the particular post. For examples please + refer to the reference section below. +* If you receive a notice about a piece of content you've created, please seek + to understand that in some cases you may not agree with a decision or request. + Being able to practice tolerance and mindfulness is just as important to keep + the community working towards a common goal. The mediation and resolution + system that we have in place aims to handle this with the hope that both + content creators and consumers are heard and represented. These situations are + not zero sum, and often we aim to reach an agreeable compromise where a + discussion of a topic can happen without making members of the community feel + uncomfortable. +* In the event where there is disagreement, we have some guidelines that can + help prevent escalation + * Do not bring the discussion out of context. + * Do not rationalize the actions you take. We do not expect anyone to + understand what everyone else feels towards certain things (e.g. the same + gestures in certain cultures are good and bad in others). Understand that + something may not be wrong, but it may affect others. + +In summary, be nice, inclusive and welcoming. Misunderstandings, mistakes and +oversights happen, and when they do, there are some good ways to go about having +a conversation with colleagues to make our community inclusive and welcoming to +everyone! + +## Reference + +Example of reasonable gif: Group is close to wrapping up deliverable, as part of +an update, the lead posts a "nearly done" gif. + +Example of reasonable emoji: Post in the group uses emojis to break up content +and is relevant to the item discussed or used in response to post to signify +voting, opinion, acceptance, emotion, etc. + +Example of reasonable image and video: Posting a picture of a community meetup* +or posting a recording to a presentation on cloud native security. + +*Note: Many events within the community may include content which is only +acceptable depending on the context it is used in. An example of this is +alcohol consumption. It is important that when posting photos and videos members +consider if the post glorifies alcohol or alcohol is the primary subject of the +content (unacceptable) or if the alcohol is happenstance occurrence in the image +(acceptable). + +## Inspiration + +The above guidelines are inspired by and borrowed from other communities: + +* +* +* + +[cncf-coc]: https://github.com/cncf/foundation/blob/master/code-of-conduct.md +[charter]: https://github.com/cncf/tag-security/blob/main/governance/charter.md +[review-outcome]: https://github.com/cncf/tag-security/tree/main/assessments#outcome +[cncf-toc]: https://www.cncf.io/people/technical-oversight-committee/ diff --git a/CODEOWNERS b/CODEOWNERS new file mode 100644 index 0000000..968263f --- /dev/null +++ b/CODEOWNERS @@ -0,0 +1,19 @@ +# CODEOWNERS file indicates code owners for certain files +# +# Code owners will automatically be added as a reviewer for PRs that touch +# the owned files. +# +# The main branch will be configured to require at least 1 approval from a +# code owner for a PR. +# +# Actions by community members should follow the following guidelines: +# https://github.com/cncf/tag-storage/blob/main/governance/github.md + +# Global code owners: co-chairs, tech leads +# Note: Tech leads can perform approval and merging of PRs, with the intent +# of delegating some responsibilities from co-chairs. +# +# Tech lead, Chair Emeritus roles should exercise discretion in defering final +# approval for a PR to a co-chair. +# This includes major edits or new introductions to the repository. +* @chira001 @sougou @quinton-hoole @erinboyd @xing-yang @saad-ali @lpabon diff --git a/CONTRIB/README.md b/CONTRIB/README.md new file mode 100644 index 0000000..1c8263c --- /dev/null +++ b/CONTRIB/README.md @@ -0,0 +1,37 @@ +# Contributing + +We aspire to create a welcoming environment for collaboration on this project +and ask that all contributors do the same. For more details, see our [code of +conduct](/CODE-OF-CONDUCT.md). + +This document covers contributions to this git repository. Please review +[governance](/governance) for our mission, charter, and other operations. + +## Open source + +While this repository does not contain open source code, we manage content +contributions following open source practice, as detailed below. + +All contributions to this project will be released under open source license as +described in [LICENSE.md](/LICENSE.md). By submitting a pull request (PR), +you are agreeing to release the PR contents under this license. + +## Communication + +Anyone interested in contributing should join the mailing list and other +[communication channels](/README.md#Communications) + +We strongly encourage and support all our members to participate in anyway +they can. Not everyone can participate in the regularly scheduled live meetings, +so we strive to make our processes friendly for people to be active contributors +through asynchronous communication and contributions to our documentation +in this repository. + +## Github pull requests and issues + +If you are new to the group, [reviewing pull requests](pull-request-review.md) +and commenting on issues is a great way to get involved! + +When creating or reviewing pull requests, please refer to the +[writing style guide](writing-style.md) to help maintain consistency across +all of our documents. diff --git a/CONTRIB/first-time-contributions.md b/CONTRIB/first-time-contributions.md new file mode 100644 index 0000000..5836863 --- /dev/null +++ b/CONTRIB/first-time-contributions.md @@ -0,0 +1,63 @@ +# First time contributors + +We happily welcome our new contributors to this +community. If you are contributing to the CNCF +and/or TAG-Storage for the first time it is +okay if you feel overwhelmed. We, as a +community, are always there to help you +with any problems you are facing. +Open source is about collaboration and +we are always there to support +each other. + +## Getting involved and contributing + +As a new contributor, you might find +difficulties in understanding where to start. +Don't worry! We got you. + +In the interest of getting more new people +involved, we have issues marked as +[good-first-issues](https://github.com/cncf/tag-storage/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22). +These are issues that have a smaller scope, +and are great to start with. + +The good-first-issues should also provide +you details on how to get things resolved or +how to proceed. If you find it is missing or +incomplete please tag the person who created +the issue and let them know. + +## Before your first PR + +Before you make you first PR, we would like +you to go through the below resources +for your understanding: + +- [How to submit contributions](https://opensource.guide/how-to-contribute/#how-to-submit-a-contribution) +- [Collaborating with pull requests](https://docs.github.com/en/github/collaborating-with-pull-requests) + +Our PR also follows a particular writing +style. Checkout the [style guide](https://github.com/cncf/tag-storage/blob/main/CONTRIB/writing-style.md). + +## Other ways of communication + +If have additional questions or +doubts about a certain issue. +Please reach out and we will +be happy to discuss. + +You can reach us via our [Mailing List](mailto:cncf-tag-storage-leads@lists.cncf.io). + +You can also reach out on our slack [#tag-storage-governance](https://cloud-native.slack.com/archives/C0230RW8V2T). + +### You can reach out to our members + +Our members list can be found +[here](https://github.com/cncf/tag-storage#members). + +## After PR merge + +Once you have successfully get your +first PR merged, you can add your name +to our Members section in [README.md](https://github.com/cncf/tag-storage#members). diff --git a/CONTRIB/pull-request-review.md b/CONTRIB/pull-request-review.md new file mode 100644 index 0000000..bb18323 --- /dev/null +++ b/CONTRIB/pull-request-review.md @@ -0,0 +1,88 @@ +# Pull Request (PR) reviews + +Except for urgent or very small grammar or spelling fixes, such as simple +changes discussed below, we leave pull requests open for at least 24 hours, so +that others have the chance to review/comment. + +### Favorable review + +A favorable review is determined by the contents of the PR complying with the +contributing guide, the writing style, and agreement the contents align with the +TAG's goals, objectives, and scope. It is anticipated that PRs submitted, with +the exception of spelling and grammar changes, have been discussed with members +of the TAG via slack or issues. + +#### Nits + +Nits are minor suggestions and changes that are strongly encouraged to be +identified and resolved to provide consistency in the repo. Preferential +language or language that is a matter of preferred usage are not considered +nits. + +##### Example of preferential language + +> They use cloud technologies with clear understanding of risks and the ability +> to validate that their storage policy decisions are reflected in deployed +> software. + +"Ability" is a human oriented term, "capability" is more technical and may be +more appropriate. + +Suggestion: +> They use cloud technologies with clear understanding of risks and the +> capability to validate their storage policy decisions are reflected in +> deployed software. + +##### Example of a nit + +> They use cloud-native technologies with clear understanding of risks and the +> ability to validate that their storage policy decisions are reflected in +> deployed software. + +Per TOC definition of cloud native, it is not hyphenated. + +correction: +> They use cloud native technologies... + +#### Simple changes + +Simple changes are defined as: + +* spelling, typo, grammar +* clarifications, minor updates + +A person without access, other than the PR author, can and _is_ encouraged to +review a PR and comment/+1 that they have done a review and found it favorable. +A person with access, including the PR author, may then perform the merge. + +A person with access, other than the PR author, can both review **and** merge a +PR if found favorable after review. + +[Code owners](/CODEOWNERS) need to be at least one concurring reviewer or the +merging party. + +#### Significant changes + +Significant changes are defined as: + +* major changes to the repo +* extensive changes to repo contents +* other items as determined by the Technical Leads and Co-Chairs (to be updated + here as they occur) + +A person without access, other than the PR author can and _is_ encouraged to +review a PR and comment/+1 that they have done a review and found it favorable. +A second person with access, other than the PR Author, must also review the PR +and provide concurrence prior to merging. + +Two persons with access, other than the PR author, must review the PR and +provide concurrence, the last of which should perform the merge. + +[Code owners](/CODEOWNERS) need to be at least one concurring reviewer or the +merging party. + +### Merging pull requests + +PRs may be merged after at least one review as occurred, dependent on the type +of changes reflected in the PR. The merging party needs to verify a review has +occurred, the PR is in alignment with this guide, and is in scope of the TAG. \ No newline at end of file diff --git a/CONTRIB/writing-style.md b/CONTRIB/writing-style.md new file mode 100644 index 0000000..90bb2ea --- /dev/null +++ b/CONTRIB/writing-style.md @@ -0,0 +1,42 @@ +# Writing style + +Consistency creates clarity in communication. + +If you find yourself correcting for consistency, please propose additional style +guidelines via pull request to this document. Feel free to add references to +good sources for content guidelines at the bottom of this guide. + + +## Common terms + +* When referring to users and use cases, ensure consistency with + [use cases](/usecase-personas/) +* See [CNCF Style Guide][cncf-style] for common terms. Note that the following + terms are not hyphenated and all lower case, except for capitalizing the + first letter when at the beginning of a sentence: + * open source + * cloud native + +## Additional formatting + +* Headlines, page titles, subheads and similar content should follow sentence + case, and should not include a trailing colon. +* Paragraphs do not start with leading indent. +* Wrap lines at 80 characters, except where it would break a link. No need to + reformat the whole paragraph to make it perfect -- fewer diffs are easier + for reviewers. + +## File & directory naming conventions + +* Every directory should have a README.md with useful introductory text. +* All other file and directory names should be all lower case with dashes to + separate words. + +## Sources + + +* [OpenOpps Contribution Guide][openopps-style] +* [18F Content Guide](https://content-guide.18f.gov/) + +[cncf-style]: https://github.com/cncf/foundation/blob/master/style-guide.md +[openopps-style]: https://github.com/openopps/openopps-platform/blob/master/CONTRIBUTING.md diff --git a/LICENSE b/LICENSE-code similarity index 99% rename from LICENSE rename to LICENSE-code index 8dada3e..261eeb9 100644 --- a/LICENSE +++ b/LICENSE-code @@ -178,7 +178,7 @@ APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following - boilerplate notice, with the fields enclosed by brackets "{}" + boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a @@ -186,7 +186,7 @@ same "printed page" as the copyright notice for easier identification within third-party archives. - Copyright {yyyy} {name of copyright owner} + Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. diff --git a/LICENSE-document b/LICENSE-document new file mode 100644 index 0000000..2802779 --- /dev/null +++ b/LICENSE-document @@ -0,0 +1,395 @@ +Attribution 4.0 International + +======================================================================= + +Creative Commons Corporation ("Creative Commons") is not a law firm and +does not provide legal services or legal advice. Distribution of +Creative Commons public licenses does not create a lawyer-client or +other relationship. Creative Commons makes its licenses and related +information available on an "as-is" basis. Creative Commons gives no +warranties regarding its licenses, any material licensed under their +terms and conditions, or any related information. Creative Commons +disclaims all liability for damages resulting from their use to the +fullest extent possible. + +Using Creative Commons Public Licenses + +Creative Commons public licenses provide a standard set of terms and +conditions that creators and other rights holders may use to share +original works of authorship and other material subject to copyright +and certain other rights specified in the public license below. The +following considerations are for informational purposes only, are not +exhaustive, and do not form part of our licenses. + + Considerations for licensors: Our public licenses are + intended for use by those authorized to give the public + permission to use material in ways otherwise restricted by + copyright and certain other rights. Our licenses are + irrevocable. Licensors should read and understand the terms + and conditions of the license they choose before applying it. + Licensors should also secure all rights necessary before + applying our licenses so that the public can reuse the + material as expected. Licensors should clearly mark any + material not subject to the license. This includes other CC- + licensed material, or material used under an exception or + limitation to copyright. More considerations for licensors: + wiki.creativecommons.org/Considerations_for_licensors + + Considerations for the public: By using one of our public + licenses, a licensor grants the public permission to use the + licensed material under specified terms and conditions. If + the licensor's permission is not necessary for any reason--for + example, because of any applicable exception or limitation to + copyright--then that use is not regulated by the license. Our + licenses grant only permissions under copyright and certain + other rights that a licensor has authority to grant. Use of + the licensed material may still be restricted for other + reasons, including because others have copyright or other + rights in the material. A licensor may make special requests, + such as asking that all changes be marked or described. + Although not required by our licenses, you are encouraged to + respect those requests where reasonable. More_considerations + for the public: + wiki.creativecommons.org/Considerations_for_licensees + +======================================================================= + +Creative Commons Attribution 4.0 International Public License + +By exercising the Licensed Rights (defined below), You accept and agree +to be bound by the terms and conditions of this Creative Commons +Attribution 4.0 International Public License ("Public License"). To the +extent this Public License may be interpreted as a contract, You are +granted the Licensed Rights in consideration of Your acceptance of +these terms and conditions, and the Licensor grants You such rights in +consideration of benefits the Licensor receives from making the +Licensed Material available under these terms and conditions. + + +Section 1 -- Definitions. + + a. Adapted Material means material subject to Copyright and Similar + Rights that is derived from or based upon the Licensed Material + and in which the Licensed Material is translated, altered, + arranged, transformed, or otherwise modified in a manner requiring + permission under the Copyright and Similar Rights held by the + Licensor. For purposes of this Public License, where the Licensed + Material is a musical work, performance, or sound recording, + Adapted Material is always produced where the Licensed Material is + synched in timed relation with a moving image. + + b. Adapter's License means the license You apply to Your Copyright + and Similar Rights in Your contributions to Adapted Material in + accordance with the terms and conditions of this Public License. + + c. Copyright and Similar Rights means copyright and/or similar rights + closely related to copyright including, without limitation, + performance, broadcast, sound recording, and Sui Generis Database + Rights, without regard to how the rights are labeled or + categorized. For purposes of this Public License, the rights + specified in Section 2(b)(1)-(2) are not Copyright and Similar + Rights. + + d. Effective Technological Measures means those measures that, in the + absence of proper authority, may not be circumvented under laws + fulfilling obligations under Article 11 of the WIPO Copyright + Treaty adopted on December 20, 1996, and/or similar international + agreements. + + e. Exceptions and Limitations means fair use, fair dealing, and/or + any other exception or limitation to Copyright and Similar Rights + that applies to Your use of the Licensed Material. + + f. Licensed Material means the artistic or literary work, database, + or other material to which the Licensor applied this Public + License. + + g. Licensed Rights means the rights granted to You subject to the + terms and conditions of this Public License, which are limited to + all Copyright and Similar Rights that apply to Your use of the + Licensed Material and that the Licensor has authority to license. + + h. Licensor means the individual(s) or entity(ies) granting rights + under this Public License. + + i. Share means to provide material to the public by any means or + process that requires permission under the Licensed Rights, such + as reproduction, public display, public performance, distribution, + dissemination, communication, or importation, and to make material + available to the public including in ways that members of the + public may access the material from a place and at a time + individually chosen by them. + + j. Sui Generis Database Rights means rights other than copyright + resulting from Directive 96/9/EC of the European Parliament and of + the Council of 11 March 1996 on the legal protection of databases, + as amended and/or succeeded, as well as other essentially + equivalent rights anywhere in the world. + + k. You means the individual or entity exercising the Licensed Rights + under this Public License. Your has a corresponding meaning. + + +Section 2 -- Scope. + + a. License grant. + + 1. Subject to the terms and conditions of this Public License, + the Licensor hereby grants You a worldwide, royalty-free, + non-sublicensable, non-exclusive, irrevocable license to + exercise the Licensed Rights in the Licensed Material to: + + a. reproduce and Share the Licensed Material, in whole or + in part; and + + b. produce, reproduce, and Share Adapted Material. + + 2. Exceptions and Limitations. For the avoidance of doubt, where + Exceptions and Limitations apply to Your use, this Public + License does not apply, and You do not need to comply with + its terms and conditions. + + 3. Term. The term of this Public License is specified in Section + 6(a). + + 4. Media and formats; technical modifications allowed. The + Licensor authorizes You to exercise the Licensed Rights in + all media and formats whether now known or hereafter created, + and to make technical modifications necessary to do so. The + Licensor waives and/or agrees not to assert any right or + authority to forbid You from making technical modifications + necessary to exercise the Licensed Rights, including + technical modifications necessary to circumvent Effective + Technological Measures. For purposes of this Public License, + simply making modifications authorized by this Section 2(a) + (4) never produces Adapted Material. + + 5. Downstream recipients. + + a. Offer from the Licensor -- Licensed Material. Every + recipient of the Licensed Material automatically + receives an offer from the Licensor to exercise the + Licensed Rights under the terms and conditions of this + Public License. + + b. No downstream restrictions. You may not offer or impose + any additional or different terms or conditions on, or + apply any Effective Technological Measures to, the + Licensed Material if doing so restricts exercise of the + Licensed Rights by any recipient of the Licensed + Material. + + 6. No endorsement. Nothing in this Public License constitutes or + may be construed as permission to assert or imply that You + are, or that Your use of the Licensed Material is, connected + with, or sponsored, endorsed, or granted official status by, + the Licensor or others designated to receive attribution as + provided in Section 3(a)(1)(A)(i). + + b. Other rights. + + 1. Moral rights, such as the right of integrity, are not + licensed under this Public License, nor are publicity, + privacy, and/or other similar personality rights; however, to + the extent possible, the Licensor waives and/or agrees not to + assert any such rights held by the Licensor to the limited + extent necessary to allow You to exercise the Licensed + Rights, but not otherwise. + + 2. Patent and trademark rights are not licensed under this + Public License. + + 3. To the extent possible, the Licensor waives any right to + collect royalties from You for the exercise of the Licensed + Rights, whether directly or through a collecting society + under any voluntary or waivable statutory or compulsory + licensing scheme. In all other cases the Licensor expressly + reserves any right to collect such royalties. + + +Section 3 -- License Conditions. + +Your exercise of the Licensed Rights is expressly made subject to the +following conditions. + + a. Attribution. + + 1. If You Share the Licensed Material (including in modified + form), You must: + + a. retain the following if it is supplied by the Licensor + with the Licensed Material: + + i. identification of the creator(s) of the Licensed + Material and any others designated to receive + attribution, in any reasonable manner requested by + the Licensor (including by pseudonym if + designated); + + ii. a copyright notice; + + iii. a notice that refers to this Public License; + + iv. a notice that refers to the disclaimer of + warranties; + + v. a URI or hyperlink to the Licensed Material to the + extent reasonably practicable; + + b. indicate if You modified the Licensed Material and + retain an indication of any previous modifications; and + + c. indicate the Licensed Material is licensed under this + Public License, and include the text of, or the URI or + hyperlink to, this Public License. + + 2. You may satisfy the conditions in Section 3(a)(1) in any + reasonable manner based on the medium, means, and context in + which You Share the Licensed Material. For example, it may be + reasonable to satisfy the conditions by providing a URI or + hyperlink to a resource that includes the required + information. + + 3. If requested by the Licensor, You must remove any of the + information required by Section 3(a)(1)(A) to the extent + reasonably practicable. + + 4. If You Share Adapted Material You produce, the Adapter's + License You apply must not prevent recipients of the Adapted + Material from complying with this Public License. + + +Section 4 -- Sui Generis Database Rights. + +Where the Licensed Rights include Sui Generis Database Rights that +apply to Your use of the Licensed Material: + + a. for the avoidance of doubt, Section 2(a)(1) grants You the right + to extract, reuse, reproduce, and Share all or a substantial + portion of the contents of the database; + + b. if You include all or a substantial portion of the database + contents in a database in which You have Sui Generis Database + Rights, then the database in which You have Sui Generis Database + Rights (but not its individual contents) is Adapted Material; and + + c. You must comply with the conditions in Section 3(a) if You Share + all or a substantial portion of the contents of the database. + +For the avoidance of doubt, this Section 4 supplements and does not +replace Your obligations under this Public License where the Licensed +Rights include other Copyright and Similar Rights. + + +Section 5 -- Disclaimer of Warranties and Limitation of Liability. + + a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE + EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS + AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF + ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS, + IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION, + WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR + PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS, + ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT + KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT + ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU. + + b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE + TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, + NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT, + INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES, + COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR + USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN + ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR + DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR + IN PART, THIS LIMITATION MAY NOT APPLY TO YOU. + + c. The disclaimer of warranties and limitation of liability provided + above shall be interpreted in a manner that, to the extent + possible, most closely approximates an absolute disclaimer and + waiver of all liability. + + +Section 6 -- Term and Termination. + + a. This Public License applies for the term of the Copyright and + Similar Rights licensed here. However, if You fail to comply with + this Public License, then Your rights under this Public License + terminate automatically. + + b. Where Your right to use the Licensed Material has terminated under + Section 6(a), it reinstates: + + 1. automatically as of the date the violation is cured, provided + it is cured within 30 days of Your discovery of the + violation; or + + 2. upon express reinstatement by the Licensor. + + For the avoidance of doubt, this Section 6(b) does not affect any + right the Licensor may have to seek remedies for Your violations + of this Public License. + + c. For the avoidance of doubt, the Licensor may also offer the + Licensed Material under separate terms or conditions or stop + distributing the Licensed Material at any time; however, doing so + will not terminate this Public License. + + d. Sections 1, 5, 6, 7, and 8 survive termination of this Public + License. + + +Section 7 -- Other Terms and Conditions. + + a. The Licensor shall not be bound by any additional or different + terms or conditions communicated by You unless expressly agreed. + + b. Any arrangements, understandings, or agreements regarding the + Licensed Material not stated herein are separate from and + independent of the terms and conditions of this Public License. + + +Section 8 -- Interpretation. + + a. For the avoidance of doubt, this Public License does not, and + shall not be interpreted to, reduce, limit, restrict, or impose + conditions on any use of the Licensed Material that could lawfully + be made without permission under this Public License. + + b. To the extent possible, if any provision of this Public License is + deemed unenforceable, it shall be automatically reformed to the + minimum extent necessary to make it enforceable. If the provision + cannot be reformed, it shall be severed from this Public License + without affecting the enforceability of the remaining terms and + conditions. + + c. No term or condition of this Public License will be waived and no + failure to comply consented to unless expressly agreed to by the + Licensor. + + d. Nothing in this Public License constitutes or may be interpreted + as a limitation upon, or waiver of, any privileges and immunities + that apply to the Licensor or You, including from the legal + processes of any jurisdiction or authority. + + +======================================================================= + +Creative Commons is not a party to its public +licenses. Notwithstanding, Creative Commons may elect to apply one of +its public licenses to material it publishes and in those instances +will be considered the "Licensor." The text of the Creative Commons +public licenses is dedicated to the public domain under the CC0 Public +Domain Dedication. Except for the limited purpose of indicating that +material is shared under a Creative Commons public license or as +otherwise permitted by the Creative Commons policies published at +creativecommons.org/policies, Creative Commons does not authorize the +use of the trademark "Creative Commons" or any other trademark or logo +of Creative Commons without its prior written consent including, +without limitation, in connection with any unauthorized modifications +to any of its public licenses or any other arrangements, +understandings, or agreements concerning use of licensed material. For +the avoidance of doubt, this paragraph does not form part of the +public licenses. + +Creative Commons may be contacted at creativecommons.org. \ No newline at end of file diff --git a/LICENSE.md b/LICENSE.md new file mode 100644 index 0000000..9ea106d --- /dev/null +++ b/LICENSE.md @@ -0,0 +1,2 @@ +Code in this repository is licensed under [Apache License Version 2.0](LICENSE-code) (SPDX-License-Identifier: Apache-2.0). +Documentation in this repository is licensed under [Creative Common Attribution 4.0 International License](LICENSE-document) (SPDX-License-Identifier: CC-BY-4.0) \ No newline at end of file diff --git a/NEW-MEMBERS.md b/NEW-MEMBERS.md new file mode 100644 index 0000000..fdbd1c4 --- /dev/null +++ b/NEW-MEMBERS.md @@ -0,0 +1,29 @@ +# New members + +The purpose of this plan is to ensure that you become familiar with the team and +know how you will contribute. The first step is to get yourself familiar with +our mission at [Storage TAG charter](governance/charter.md). + +New members are advised to: + +* Join the [CNCF Slack team](https://slack.cncf.io/), particularly + [#tag-storage](https://cloud-native.slack.com/archives/C6PK4RLF7) channel and + introduce yourself. +* Initially go through the following documents in the repository: + * [README.md](README.md) + * [CODE-OF-CONDUCT.md][coc] + * [first-time-contributions] + * [Use cases and personas][use-cases] +* Regularly join one of the [Zoom meetings][meeting-times] at least for the first + couple of months to get yourself up to speed. +* Here are multiple ways to get involved: + * Join the meeting as advised above and express your area of interests or if + you want to work on any specific issue. + * Express your thoughts or ask questions on an issue you find interesting. + * Choose an issue where [help is + needed](https://github.com/cncf/tag-storage/labels/help%20wanted) and + comment on it expressing interest. + +[meeting-times]: README.md#meeting-times +[coc]: CODE-OF-CONDUCT.md +[first-time-contributions]: CONTRIB/first-time-contributions.md diff --git a/README.md b/README.md index 1e525b6..5f12187 100644 --- a/README.md +++ b/README.md @@ -1,74 +1,109 @@ -# CNCF Storage TAG +# CNCF storage Technical Advisory Group -## Meetings + + +Cloud Native Storage logo + -The Storage Technical Advisory Group meets on the 2nd and 4th Wednesday of every month at 8AM PT (USA Pacific): +## Quick links -Join from PC, Mac, Linux, iOS or Android: https://zoom.us/j/2920471159?pwd=em1JbE44MktjZE4vbnJtUUFQcGZwdz09 +- [Meeting Information](#meeting-times) +- [Slack Information](#communications) +- [New Members](#new-members) +- [Members](#members) -Or iPhone one-tap (US Toll): +16465588656,158580155# or +14086380968,158580155# +## Governance -Or Telephone: Dial: +1 646 558 8656 (US Toll) or +1 408 638 0968 (US Toll) Meeting ID: 158 580 155 International numbers available: https://zoom.us/zoomconference?m=0gvI03dCdRnx6WENPzdNhioORGmhVwYo +[STAG charter](governance/charter.md) outlines the scope of our group +activities, as part of our [governance process](governance) which details how we +work. -## Mailing list and minutes +## Communications -Mailing list: https://lists.cncf.io/g/cncf-tag-storage +Anyone is welcome to join our open discussions of STAG projects and share news +related to the group's mission and charter. Much of the work of the group +happens outside of storage TAG meetings and we encourage project teams to share +progress updates or post questions in these channels: -(note: the old WG mailing list is here: https://groups.google.com/forum/#!forum/cncf-wg-storage) +Group communication: -Meeting minutes are recorded here: https://bit.ly/cncf-storage-sig-minutes +- [Email list](https://lists.cncf.io/g/cncf-tag-storage) +- [CNCF Slack](https://slack.cncf.io/) #tag-storage channel +Leadership: -## CNCF Storage TAG Charter +- To reach the leadership team (chairs & tech leads), email + cncf-tag-storage-leads@lists.cncf.io +- To reach the chairs, email cncf-tag-storage-chairs@lists.cncf.io -The charter is available here: https://github.com/cncf/tag-storage/blob/master/storage-charter.md +### Slack governance +Refer to the [slack governance document](slack.md) for details on slack channels +and posting to the channels. -## CNCF Storage Landscape Whitepaper +## Meeting times -The whitepaper documents the storage landscape by clarifying the terminology used in the storage space including: +Group meeting times are listed below: +The Storage Technical Advisory Group meets on the 2nd and 4th Wednesday of every month at 8AM PT (USA Pacific). -- The attributes of a storage system such that an end-user can understand the appropriate capabilities that might be required by an application or architectural pattern. -- The layers in a storage solution (or service) with a focus on terminology and how they impact the defined attributes covering the container, orchestrator, transport, topology, virtual/physical, data protection, data services and the non-volatile layers. -- The data access interfaces in terms of volume (including block, file system and shared file system) and application API (including object, KV and database) as high level groupings. -- Separate sections with further detail on Block Storage, File systems, Object Storage and Key Value Stores. -- The management interfaces needed to orchestrate the storage layers to facilitate composability, dynamic provisioning and self service management. +See the [CNCF Calendar](https://www.cncf.io/calendar/) for calendar invites. -The whitepaper is available here: http://bit.ly/cncf-storage-whitepaper +[Meeting minutes and +agenda](https://bit.ly/cncf-storage-sig-minutes) +### Zoom Meeting Details -## Current CNCF Storage Projects + +[Meeting Link](https://zoom.us/j/2920471159?pwd=em1JbE44MktjZE4vbnJtUUFQcGZwdz09) (Password: 77777) -### Graduated Projects +One tap mobile: +| Location | Number | +| --- | --- | +|US (San Jose)|+16699006833,158580155#| +|US (Tacoma)|+12532158782,158580155#| +|US (Washington DC)|+13017158592,158580155#| +|US (Chicago)|+13126266799,158580155#| +|US (Houston)|+13462487799,158580155#| +|US (New York)|+16465588656,158580155#| -- [etcd](https://github.com/etcd-io/etcd) -- [Rook](https://github.com/rook/rook) -- [TiKV](https://github.com/tikv/tikv) -- [Vitess](https://github.com/vitessio/vitess) +Dial by your location: +| Location | Number | +| --- | --- | +| US - New York | +1 646 558 8656| +| US - San Jose | +1 669 900 6833| +| US - Toll-free | 877 369 0926| +| US - Toll-free | 855 880 1246| +| Australia - Toll-free | 158 580 155| -### Incubating Projects +Or [find your local number](https://zoom.us/u/alwlmxlNn). -- [Dragonfly](https://github.com/dragonflyoss/Dragonfly) +Meeting ID: 158 580 155 -### Sandbox Projects +## Gatherings -- [ChubaoFS](https://github.com/chubaofs/chubaofs) -- [Longhorn](https://github.com/longhorn/longhorn) -- [OpenEBS](https://github.com/openebs) -- [Pravega](https://github.com/pravega/pravega) -- [Piraeus](https://github.com/piraeusdatastore/piraeus) -- [Vineyard](https://github.com/v6d-io/v6d) +Please let us know if you are going and if you are interested in attending (or +helping to organize!) a gathering. Create a [github +issue](https://github.com/cncf/tag-storage/issues/new) for an event and add to +list below: -## Operating Model +- KubeCon + CloudNativeCon, Europe May 16-20 2022 -This TAG follows the [standard operating -guidelines](https://github.com/cncf/toc/blob/master/sigs/cncf-sigs.md#operating-model) -provided by the TOC unless otherwise stated here. +## New members -**Current TOC Liaison:** Erin Boyd, Saad Ali +If you are new to the group, we encourage you to check out our [New Members Page](NEW-MEMBERS.md) -**Co-Chairs:** Alex Chircop, Quinton Hoole, Xing Yang +## Members + -**Tech Leads:** Raffaele Spazzoli, Luis Pabon, Sheng Yang, Nick Connolly +### Chairs -**Other named roles:** None at present; will be identified and staffed as needed. +- Alex Chircop [Chair term: ??? - ???] +- Quinton Hoole [Chair term: ??? - ???] +- Xing Yang [Chair term: ??? - ???] + +### Tech Leads + + - Raffaele Spazzoli + - Luis Pabon + - Sheng Yang + - Nick Connolly diff --git a/design/logo/122128977-4050a380-cdea-11eb-84b7-191c8e73aac9.png b/design/logo/122128977-4050a380-cdea-11eb-84b7-191c8e73aac9.png new file mode 100644 index 0000000..9aba8d6 Binary files /dev/null and b/design/logo/122128977-4050a380-cdea-11eb-84b7-191c8e73aac9.png differ diff --git a/design/logo/122128980-4181d080-cdea-11eb-928a-2e68ea1a4bee.png b/design/logo/122128980-4181d080-cdea-11eb-928a-2e68ea1a4bee.png new file mode 100644 index 0000000..3912e97 Binary files /dev/null and b/design/logo/122128980-4181d080-cdea-11eb-928a-2e68ea1a4bee.png differ diff --git a/storage-charter.md b/governance/charter.md similarity index 63% rename from storage-charter.md rename to governance/charter.md index 8bd6718..6eba69f 100644 --- a/storage-charter.md +++ b/governance/charter.md @@ -77,44 +77,9 @@ cloud-native environments through: - [TiKV](https://github.com/tikv/tikv) - [Vitess](https://github.com/vitessio/vitess) -# Interfaces With Other Related Groups - -* **Kubernetes Storage SIG** - is focussed towards - Kubernetes-specific storage abstractions, interfaces, and - implementations of these interfaces. We maintain close - communication with this Kubernetes SIG, with several individuals - actively involved in both. Our aim is to avoid unnecessary - duplication of effort by the two groups, and maintain clear an - consistent messaging by the two groups to our end user community - and projects. -* **CSI** - is focussed on defining an industry standard “Container - Storage Interface” (CSI) that will enable storage vendors to - develop a plugin once and have it work across a number of - container orchestration systems. Again, we maintain close - communication with this group, and avoid unnecessary duplication - of effort and inconsistent messaging wherever possible. -* **CNCF Security SIG** - works on the more general area of - cloud-native security including authentication, authorization, - encryption, accounting, auditing and related topics. We defer as - much as possible to this group to deal with general - security-related issues, and liaise closely with them on how to - deal with storage-specific security areas where these arise. -* **CNCF Apps SIG** (not yet fully formed) - will be focussed on the - development, deployment, operation and testing of cloud-native - applications. We collaborate with this SIG where this pertains to - Storage. -* **K8s Apps SIG** - has done some work on how Kubernetes apps use - storage, as well as how storage systems (including databases) may - be deployed on Kubernetes . We collaborate with Apps SIG and make - sure that important topics are well covered. -* **[Kubernetes Service Catalog SIG](https://github.com/kubernetes/community/tree/master/sig-service-catalog)**- - works on enabling external managed software offerings such as - datastore services offered by public cloud providers. - - # Operating Model -This SIG follows the [standard operating +This TAG follows the [standard operating guidelines](https://github.com/cncf/toc/blob/master/sigs/cncf-sigs.md#operating-model) provided by the TOC unless otherwise stated here. diff --git a/slack.md b/slack.md new file mode 100644 index 0000000..73e21bd --- /dev/null +++ b/slack.md @@ -0,0 +1,21 @@ +# TAG-Storage channels housekeeping + +We're located on the CNCF workspace at [#tag-storage](https://cloud-native.slack.com/archives/C6PK4RLF7) + +Additional information may be found in the [CNCF slack guidelines](https://github.com/cncf/foundation/blob/master/slack-guidelines.md). + +## Code of conduct + +Members of TAG-Storage channels are expected to abide by the [code of conduct](https://github.com/cncf/tag-storage/blob/master/CODE-OF-CONDUCT.md). + +## Posting outside content + +The TAG-Storage channels are mechanisms for cloud native storage discussions. +It is expected that outside, non-tag created content will be posted; however, +these should include topics of relevance and interest to the cloud native +community space, rather than marketing or promotion of a vendor-specific +product. + +For example, maintainers and contributors of projects are encouraged to post +relevant topics, podcasts, and blogs in the channels provided the content is not +self-endorsing for the sake of driving attention to the project. diff --git a/storage-whitepaper/v2.md b/storage-whitepaper/v2.md new file mode 100644 index 0000000..7bd573e --- /dev/null +++ b/storage-whitepaper/v2.md @@ -0,0 +1,1450 @@ +## CNCF Storage Whitepaper v2 + +##### By Alex Chircop, Quinton Hoole, Clinton Kitson, Xiang Li, Luis, Pabón, Sugu Sougoumarne, Xing Yang +[Public link to this document](https://bit.ly/cncf-storage-whitepaperV2) +#### Status: 03/07/2020 - final version + +- 1 Scope of this document + - 1.1 Goals + - 1.2 Non-goals +- 2 Introduction and document layout +- 3 Attributes of a storage interface or system + - 3.1 Availability + - 3.2 Scalability + - 3.3 Performance + - 3.4 Consistency + - 3.5 Durability + - 3.6 Instantiation & Deployment +- 4 Storage stack / layers + - 4.1 Storage Topology + - 4.1.1 Centralised + - 4.1.2 Distributed + - 4.1.3 Sharded + - 4.1.4 Hyperconverged + - 4.2 Data Protection + - 4.2.1 RAID: Striping, Mirrors & Parity + - 4.2.2 Erasure Coding + - 4.2.3 Replicas + - 4.3 Data Services + - 4.3.1 Replication + - 4.3.2 Snapshots and Point in Time (PIT) copies + - 4.4 Data Reduction + - 4.5 Encryption + - 4.6 Physical / Non-Volatile Layer – terminology +- 5 Data Access Interface + - 5.1 Data Access Interface: Volumes + - 5.1.1 Block + - 5.1.2 Filesystem + - 5.1.3 Shared Filesystem + - 5.2 Data Access Interface: Application API + - 5.2.1 Object Stores + - 5.2.2 Key Value Stores + - 5.2.3 Databases + - 5.3 Orchestrator, host and operating system level interactions + - 5.3.1 Volumes + - 5.3.2 Application API + - 5.4 Comparison between Object Stores, File Systems and Block Stores + - 5.5 Comparison between Local, Remote and Distributed Systems +- 6 Block Stores + - 6.1 Local Block Stores + - 6.2 Remote Block Stores + - 6.3 Distributed Block Stores +- 7 File Systems + - 7.1 Local File Systems + - 7.2 Remote File Systems + - 7.3 Distributed File Systems + - 7.4 Comparison +- 8 Object Stores + - 8.1 HTTP Based Object Storage + - 8.2 Scalability, Availability, Durability, Performance +- 9 Key-Value Stores + - 9.1 Local Key-value Stores + - 9.2 Remote Shared Key-value Stores + - 9.3 Distributed Key-value Stores + - 9.4 Comparison +- 10 Databases + - 10.1 Functionality and Backing Stores + - 10.2 Cloud Native Databases + - 10.3 Data Protection + - 10.4 Database Comparison +- 11 Orchestration and Management Interfaces + - 11.1 Volumes - block stores and filesystems + - 11.1.1 Control Plane Interfaces + - 11.1.1.1 Container Storage Interface + - 11.1.1.2 K8S Native Drivers + - 11.1.1.3 Docker Volume Driver Interface + - 11.1.1.4 K8S Flexvolume + - 11.1.2 Frameworks and other tools + - 11.2 Application API + - 11.2.1 Object Stores + - 11.2.2 Key Value Stores + - 11.2.3 Databases +- 12 Appendix + - 12.1 Document History + - 12.2 Consensus Protocols + - 12.2.1 Paxos + - 12.2.2 Raft + - 12.2.3 Two-phase Commit (“2PC”) + - 12.2.4 Three-phase Commit (“3PC”) + - 12.3 Consistency, Coherence and Isolation + - 12.3.1 ACID + - 12.3.2 The CAP Theorem + + +## 1 Scope of this document + +This is the first phase of documenting the storage landscape. It aims to offer clear +information on terminology, usage patterns and classes of technology as defined by the +goals of the document. + +During phase two we might tackle the non-goals, on the basis of feedback from phase one, +specifically in light of understood production use, and comparisons w.r.t. primary properties. + +### 1.1 Goals + +1. **Clarify the terminology** currently in use in the storagespace, and the relationships + between the various terms. Essentially a taxonomy of the storage landscape. +2. This includes anything reasonably within scope of “storage”, including block stores, + key value stores, databases, object stores, volumes, file systems etc. +3. Provide some general information as to **how these thingsare currently being used** + **in production** in public or private cloud environments. +4. **Compare and contrast** the various technology areasw.r.t. the primary properties of + availability, scalability, consistency, durability, performance, API, etc. + +### 1.2 Non-goals + +1. Define what’s in-scope and out of scope for the CNCF. +2. Provide any recommendations regarding preferred storage approaches or solutions. + + +## 2 Introduction and document layout + +Multiple options were considered when defining how to present the many storage systems +and services in the landscape for the document. + +In order to simplify the consumption of information in a complex landscape, the document +has been structured as follows: + +* Definition of the attributes of a storage system suchthat an end-user can +understand the appropriate capabilities that might be required by an application or +architectural pattern +* Definition of the layers in a storage solution (orservice) with a focus on terminology +and how they impact the defined attributes - coveringthe container, orchestrator, +transport, topology, virtual/physical, data protection, data services and the +non-volatile layers. +* Definition of the data access interfaces in termsof volume (including block, file +system and shared file system) and application API (including object, KV and +database) as high level groupings +* Separate sections with further detail on Block Storage, File systems , Object +Storage , Key Value Stores and Databases. +* Definition of the management interfaces needed toorchestrate the storage layers to +facilitate composability, dynamic provisioning and self service management. + +## 3 Attributes of a storage interface or system + +Storage systems and services have a variety of interfaces which are suitable for different +use cases and tend to be composed of multiple layers which each impact different attributes +of the system. + +When choosing an overall storage solution, the different attributes of the desired solution +need to be considered. + +It is important to note that different storage systems are built with different design objectives, +and may be architected to optimise for one or more storage attributes which may in turn +impact another storage attribute. + +### 3.1 Availability + +Availability of a storage system defines the ability to access the data during failure +conditions. The failures may be due to failures in the storage media, transport, controller or +any other component in the system. + +Availability defines how access to the data continues during a failure condition and also how +access to the data is re-routed (or failed-over) to another access node in the event that the +node that is accessing the data is unavailable. + +The availability attribute can sometimes be referred to as a Recovery Time Objective (RTO) +after a failure i.e. the time between a failure occurring and service being recovered. + +Availability can be measured in Uptime as a % of availability (e.g. 99.9% uptime) as well as +MTTF (mean time to failure) or MTTR (mean time to repair) which are measured in units of +time. + +### 3.2 Scalability + +Scalability of a storage system can be measured by a number of criteria. Different criteria +may be important for different use cases and each define a set of architectural patterns that +will need to be implemented in a storage system. + +Criteria used to measure scalability include : + +``` +A. the ability to scale the number of clients that can access the storage system +B. the ability to scale throughput (e.g. MB/sec) or number of operations (e.g. per +second) of a single interface +C. the ability to scale the capacity, in terms of data stored, of a single deployment of the +storage system/service. This could be with respect to storage volume (GB/TB/PB) +and/or number of individual items. +D. ability to scale the number of components in a storage system to facilitate (a), (b), or +(c) +``` +### 3.3 Performance + +Similar to scalability, the performance of a storage system can be measured against different +criteria, the relative importance of each depending on the use case. + +Performance of a storage system is typically measured in terms of one or more of: +* latency to perform a storage operation +* the number of storage operations that are possible per second +* the throughput of data that can be stored or retrieved per second + +### 3.4 Consistency + +Consistency attributes of a storage system refer to the ability to access newly created data +or updates to the same after it has been committed and applies to both: + +* “read” operations returning the correct data after a “write”, “update” or “delete” - with +or without a delay. +* any delays that occur between performing the data storage operation and the data +getting committed to a non-volatile store or being fully protected. + +Systems that have delays between read operations returning up-to-date data, and/or delays +before all data is protected after getting commited are defined as being “eventually +consistent”. If there are no delays, the system is defined as being “strongly consistent”. +Consistency is discussed in further detail in theAppendix + +The consistency attribute can sometimes be referred to as a Recovery Point Objective +(RPO) after a failure i.e. the amount of tolerated data loss (usually measured in time, based +on the consistency delay) when a component or service in the storage system has suffered a +failure. + +### 3.5 Durability + +Durability covers the attributes of a storage system that impact the ability for a data set to +endure as opposed to just being accessible. Multiple factors can impact the durability of a +storage system, including: +* the data protection layers, such as how many copies of the data are available +* the levels of redundancy of the system +* the endurance characteristics of the storage media that is holding the data (e.g. SSD +vs spinning disks vs tape) +* the ability to detect corruption of data (e.g. due to component failure or wear/usage) +and the ability to use data protection functions to rebuild or recover the corrupted +data (sometimes referred to as “bit-rot”) + +### 3.6 Instantiation & Deployment + +A storage system can be deployed or instantiated on-premises or in a cloud environment in +a variety of ways which defines where the storage solution or service can be deployed +and/or consumed: + +``` +Hardware : deployed as hardware solutions in a datacenter. This limits the +portability of the application and generally means that such systems cannot be +deployed in a public cloud environment +Software : deployed as software components on commodityhardware, appliances or +cloud instances. Software solutions tend to be more platform agnostic and can be +installed both on-premises as well as cloud environments. Some software defined +storage systems can also be deployed as a container and deployment can be +automated by an orchestrator. +Cloud Services : consumed from public cloud providers. Cloud services provide +storage services in cloud environments. +``` + +## 4 Storage stack / layers + +Any storage solution is composed of multiple layers of functionality that define how data is +stored, retrieved, protected and interacts with an application, orchestrator and/or operating +system. Each of these layers has the potential to influence and impact one or more of the +attributes of a storage solution including availability, scalability, consistency and durability. + +### 4.1 Storage Topology + +The storage topology of a storage system defines the different arrangements of storage and +compute devices and the data links between them. The topology can influence multiple +attributes, including : + +* Availability - in terms of the speed of failover and reconvergence following a +component failure +* Performance - in terms of both latency and throughput +* Scalability - different topologies are optimised to scale in different directions (e.g. +scaling vertically vs horizontally, sometimes referred to as scaling up vs out) +* Consistency and Durability - the topology often defines the consistency delay as well +as the data protection options that are possible +#### 4.1.1 Centralised + +Storage systems that are deployed in a centralised topology tend to be formed of fewer +nodes that maintain a tightly coupled state. Often the architecture is dependent on vendor +specific hardware technology for intra-controller communication, configuration and data +plane activity (such as shared memory, cache synchronisation or shared data buses). + +This type of storage is typically accessed by compute nodes via network interfaces where a +number of clients consume storage from a small number of centralised nodes. Centralised +storage is often characterised by scale up topologies (or scaling vertically) and is usually +more consistent than distributed storage. + +As a result of the small number of nodes (often just a single pair), the latency required to +maintain data protection and sync consistency is very low and many block based systems +use this architecture as a result. + +It can be hard to scale such a system horizontally as the requirement for a tightly coupled +state limits the number of nodes that can be supported. + + +#### 4.1.2 Distributed + +Storage systems that use a distributed topology tend to have a stronger focus on software +solutions vs hardware solutions. A software solution will often be implemented with a +“shared nothing” architecture where data needs to be synced across more than one node +over a standard network connection. + +Some distributed solutions are accessed directly in a scale out manner which allows many +clients to access many server nodes in parallel. Other distributed storage systems layer +other protocols on top to enable compatibility with existing environments or access +transports (e.g. NFS or iSCSI) which may limit the overall scalability. + +Different distributed architectures have different focuses and make design decisions that +may favour performance, scalability, durability, availability or consistency. Distributed +topologies typically offer better horizontal scalability as data can be distributed across many +more nodes and can support many clients. This can result in systems that are also more +complex to deploy and operate and therefore benefit from additional automation. + +#### 4.1.3 Sharded + +Sharding is a process where a dataset or workload is partitioned based on ranges of keys +across multiple instances. The shard can be computed by using the key to determine which +node to access based on a range, a hash or other algorithms. Sharding is primarily used as +a method for scaling database architectures. + +Sharded systems provide a way to scale a storage system for both capacity and compute +capability. Workload is distributed across the shards in the system allowing workloads to +scale horizontally. + +Sharded systems can increase operational complexity and care needs to be taken to ensure +that the algorithm used to distribute the keys is balanced to the specific workload or dataset. +Managing availability can also be more challenging as systems may experience more +complex or partial failure modes where only parts of a data set are impacted by individual +node or network failures. Although sharding enables scale, the performance of any +particular request will be limited to the performance of the specific node that the shard is +located on, and it is possible for individual shards to become “hot” or overloaded. +Rebalancing shards when scaling a cluster can also be complex. + +#### 4.1.4 Hyperconverged + +Hyperconverged topologies combine application as well as storage workloads onto the same +nodes. Multiple nodes can be clustered together creating a common resource pool which is +shared for both computer workloads and storage functionality. In hyperconverged + + +topologies, the storage layer is usually implemented as a software component on commodity +compute nodes and typically shares the same attributes of a distributed system. + +Hyperconverged topologies tend to be selected to maximise flexibility as the storage system +can grow with the compute workload. + +A reduction in the separation of concerns in hyperconverged systems can have an impact on +security and operational complexity as maintenance operations, or any node failure, not only +impacts the workload on that node, but also the underlying storage system. + + +### 4.2 Data Protection + +A key function of any storage system is to provide protection of the data that is being +persisted in the system or service. This is often implemented as a transparent layer in the +system. + +#### 4.2.1 RAID: Striping, Mirrors & Parity + +RAID (redundant array of independent disks) uses techniques such as striping, mirroring and +parity to distribute and provide redundancy for data across a set of disks: + +``` +Striping : this is a process where data is spread evenlyacross 2 or more disks. +Striping in itself does not provide redundancy or fault tolerance - in fact, striping on +it’s own increases the chance of failure as a failure of any of the individual disks will +typically result in unavailability of the whole dataset. Instead striping is used to +increase performance of a number of data protection functions by distributing the +load across more disks such that a workload is not limited to the performance of a +single component. +``` +``` +Mirrors : a mirror maintains an identical copy of adataset across two disks. This +configuration enables the availability of data to continue in the event of a disk or +component failure. It is also possible to mirror multiple disks for additional +redundancy. +``` +``` +Parity : when using parity, a data set is distributedacross a number of disks that are +grouped together. For each unit of data (typically a block, but can be as small as a +byte), an algorithm is used to generate an additional set of parity data which is stored +alongside the data. In the event of the failure of any individual disk, then the missing +data can be regenerated using the remaining data segments and the parity data. +The benefit of using parity over mirrors is that parity does not require a full copy of +the data set and can therefore implement data protection with less overhead in terms +of disks or backend storage capacity. The capacity benefit comes at the expense of +performance overhead and using parity for data protection can impact latency and +throughput. +``` +There are four main RAID levels in common use today: + +``` +RAID0 : this uses a simple stripe data set and is typicallyonly used when the only +consideration is performance, as RAID0 datasets do not have any redundancy. +``` +``` +RAID1 : a RAID1 dataset consists of a mirror. In aRAID1 dataset, read performance +can be increased as the reads can be striped across both sets of the mirror, but +writes will only be as fast as an individual disk as the write needs to be written to both +disks in parallel. Any data set will also consume double the capacity on the disks as +a result. +``` + +``` +RAID5 : this implements a distributed dataset withdistributed parity. Each block is +distributed across the disks in a RAID5 set together with the additional parity. This +method provides a good balance between capacity utilisation and redundancy: the +parity ensures that data can be recovered or rebuilt if any single disk fails, but as +data is not mirrored, the capacity lost to redundancy is only 1/x (where x is the +number of disks in the raid set). Performance of read operations is similar to a +striped dataset and can utilise the combined speed of all the disks in the dataset, but +write performance has a high penalty: every write or update needs to touch every +disk in the RAID set. A RAID5 set can only survive a single disk failure and care +must be taken to ensure that a rebuild of the data is completed before a second +failure occurs. +``` +``` +RAID6 : RAID6 is also a distributed dataset with distributedparity with the difference +that two sets of parity are generated. This allows for two concurrent disk failures to +occur in a RAID6 set without impacting the availability of data. RAID6 has similar +disk performance characteristics of RAID5, but imposes a higher CPU workload - this +is due to the calculation of the second parity set. +``` +RAID sets can be striped to further spread the data across many more disks for improved +performance. This is sometimes referred to as nested RAID, but more often is determined +by adding a “0” to each of the RAID levels e.g. RAID10, RAID50 and RAID60 referred to +stripes of RAID1, RAID5 and RAID6 respectively. + +Using multiple copies of parity (such as RAID6) has become more important as the size of +disks continues to grow, as the size of the disk tends to determine the time to rebuild a +dataset. As a result, custom additional parity based RAID sets have been defined in some +solutions (e.g. RAIDZ in ZFS) to add 3 or more sets of parity. + +Although RAID is typically implemented within the set of disks in a specific node, it is also +possible to distribute RAID across a network and implement redundancy across nodes AND +disks at the same time. This is a technique used in some distributed storage systems. + +#### 4.2.2 Erasure Coding + +Erasure coding is a method used to protect data where a data set or object is split into +multiple fragments that are then encoded and stored with a configurable number of +redundant parity sets. As an example a data object might be broken down into 6 data +fragments and 4 parity fragments and would be referred to as (6+4) erasure coding. The +ability to have many parity fragments enables very high redundancy and very high durability. + +Each of the fragments can be distributed across different disks and servers/nodes in multiple +locations. Erasure coding typically uses Reed-Solomon codes (although a variety of +algorithms are available with different performance/efficiency characteristics) to perform +encoding and is therefore a computationally intensive process. The primary benefit of using +erasure codes is the flexibility of a user configurable balance between data distribution, + + +capacity utilisation and redundancy. As a result, erasure coding is utilised in many +distributed storage systems and the primary building block for data protection and +redundancy for many object stores. + +One drawback of erasure coding is that the number of data fragments and the distribution +across multiple nodes means that write and read operations on data objects can incur +significant latency due to the network overhead as well as the computational overhead. As +a result, erasure coding is best applied to large datasets which are optimised for either +reducing overall capacity utilisation or improving redundancy and durability. + +#### 4.2.3 Replicas + +Replicas are mirrored data sets that are distributed across multiple servers/nodes. A replica +is a full copy of the dataset and therefore the number of replicas for a data set multiplies the +capacity needed to store a particular data set. Each individual replica is usable as a +standalone copy and therefore rebuild operations are extremely quick as it is both simple +and can be implemented as a point to point transaction. + +Replicas have a much lower compute and network distribution overhead and are therefore +preferred when lower latency is important. Replicas can also be used to provide parallelized +read access for some workloads. + + +### 4.3 Data Services + +Storage systems often implement a number of data services which complement the core +storage function by providing additional functionality that may be implemented at different +layers of the stack. + +#### 4.3.1 Replication + +This service provides the capability to replicate a set of data (e.g. a volume or a bucket) to +improve the availability and durability of the data. Note - this data service is often separate +from the core data protection function (such as mirrors or replicas) and is generally used to +replicate data between independent storage systems often in different locations. + +Replication can be performed synchronously where a request to persist data is only +acknowledged to the application after the replica target has also acknowledged it. This +provides a strongly consistent model with a low time to recover from failure, but can impact +latency and performance. Due to the time taken for data to traverse a network, latency +increases with distance, and synchronous replication is typically only feasible when the +source and target systems are within 100km of each other. + +Replication can also be performed asynchronously where data to be replicated is queued +and is transferred to the target replica out of band of the actual storage persist operations. +This means that asynchronous replication is eventually consistent and has a lower impact to +overall performance. Asynchronous replication can support replication over long distances +but adequate bandwidth must be available to be able to transfer the deltas that change +between the source and target system in an acceptable time frame. + +#### 4.3.2 Snapshots and Point in Time (PIT) copies + +Snapshots or point in time copies of data improve the availability of a dataset and provide +the capability to backup and further protect the data. A snapshot is a view of the data set at +a given point in time (when the snapshot was taken) and this provides the ability to access +this data consistently at that data point. + +Snapshots can be implemented in a space efficient manner using techniques such as +“copy-on-write” (COW), which provides a virtualisation layer where snapshots only contain +the delta between what was written since the snapshot was taken and the original data set. +This provides the capability of taking multiple snapshots at different intervals whilst +minimising the amount of capacity needed to store the snapshots. + +Many storage systems also allow the creation of a point in time copy of the data which +includes a full copy of the data set. This is often referred to as a “clone” and utilises + + +additional capacity in the storage system, but creates an independent copy of the data set. +This can be useful when the data set is to be utilised in a manner which might impact the +performance of the original data set if a snapshot was taken. + +Processing snapshots and data copies often means maintaining data structures and +metadata which may impact the CPU, memory or disk overhead and performance. Whilst +the creation of space efficient snapshots is often a low overhead function, the creation of a +clone requires the creation of a full copy of the data set which will impact performance and +utilise bandwidth to move the data from the original data set to the copy. + +### 4.4 Data Reduction + +Storage systems can use a number of techniques to reduce the size of data stored. This +improves the capacity efficiency of the underlying physical storage by using data +compression and or deduplication. Storage systems implement data reduction with various +granularities based on implementation, and data reduction can be applied at a block, file, +object, local or global level. + +Data compression provides a method to efficiently encode data to remove redundant or +repetitive patterns such that the encoded data consumes less space. + +Deduplication typically uses a method such as a hash to determine duplicate data and then +uses that reference to store a link to the data rather than storing multiple copies of the data. + +Many applications can benefit from data reduction techniques, but some types of data that +are encrypted or already compressed (e.g. images or videos) do not benefit from data +reduction. + +Data reduction can impact performance and scalability attributes of a storage system. In +general, data reduction will add additional compute overhead which will impact latency and +throughput. In some cases, data reduction may improve performance, especially when the +limiting factor is the performance of the physical storage or the network. + +### 4.5 Encryption + +Storage systems can provide methods to ensure that data is protected by encrypting data. +Data encryption can be implemented for data in transit or data at rest and can ensure that +the encryption function is implemented independently to the application. + +Encryption can have an impact on performance as it implies a compute overhead, but +acceleration options are available on many systems which can reduce the overhead. + +Encryption services can be implemented for data in transit (protecting data in the network) +and for data at rest (protecting data on disk). The encryption may be implemented in the + + +storage client or storage server and granularity of the encryption will vary by system (e.g. per +volume, per group or global keys). + +The encryption function will often depend on integration with a key management system +which may add complexity to a storage system. + +### 4.6 Physical / Non-Volatile Layer – terminology + +Storage systems will ultimately persist data on some form of physical storage layer which is +generally non-volatile. The choice of the physical layer impacts the overall performance of +the storage system and defines the long term durability of the stored dataset. + +Cloud services often use similar terminology for service classes to define the performance +characteristics and SLAs of the service. + +Some of the most commonly used systems include: + +* Spinning / magnetic disk (e.g. SATA, SAS & SCSI) - magnetic media are +traditional harddisks and are mechanical devices in that they have spinning magnetic +disks that are read by a read/write head. Latency is a combination of the rotational +latency of the disk, the seek time for the head to move into place to read/write the +data and the electronics/bus. SATA, SAS and SCSI are transports used by the +operating system to access the device through a host bus adapter (HBA). Latency +per operation is measured in a number of milliseconds and throughput is generally +under 250MB/sec. Magnetic media generally offers the lowest cost per GB of +capacity. + +* SSD (with traditional interfaces such as SATA, SAS or SCSI) - a solid state disk +does not have any moving parts and stores data in non-volatile memory (typically +some type of flash). This allows for much lower latency operations - typically small +fractions of a millisecond and allows for tens of thousands of I/O operations per +second. Throughput is usually limited only to the transport utilised and measured in +hundreds of MB per second. Different classes of flash are available which impact +the performance as well as the durability - SSD flash wears out and can fail after a +given number of cell overwrites. Storage systems that are optimised for SSDs will +therefore generally attempt to minimise write amplification to minimise wear. + +* Non Volatile Memory (e.g. SSD/NVMe) - flash baseddevices are generally faster +than the current generation of transports. NVMe is a faster transport that minimises +the protocol overhead by treating the flash more like memory where data can be +accessed randomly rather than in block format as defined in disk transport protocols +like SCSI. This allows for much lower latency - typically a few tens of microseconds - and much faster throughput - typically measured in GB per second. + + +## 5 Data Access Interface + +The data access interface defines how applications or workloads store or consume data that +is persisted by the storage system or service. + +The interface is an important factor in the choice of a storage solution as often, different +workloads or applications will have a pre-defined or preferred access method. + +Different interfaces also influence a number of attributes such as: +* availability – in terms of failover and moving access between nodes +* performance – in terms of latency and throughput +* scalability - in terms of the number of clients that can access a given pool of storage + +In addition to the attributes, in practice, the choice of access interface has a large impact on +the management interfaces available and therefore the ability of orchestrators to manage +and provision storage. In particular, Volume interfaces currently have more mature +integrations with orchestrators. + + +### 5.1 Data Access Interface: Volumes + +#### 5.1.1 Block + +A block device is the fundamental building block of many volumes. A disk device is +represented as a block device to an operating system and represents a contiguous set of +blocks that are ultimately stored in the disk (or other non-volatile storage). Blocks are +typically represented as a 4KiB unit of data to the operating system, although different disk +systems may actually store blocks internally in either smaller or larger units. Read and write +operations are performed in units of individual blocks. + +A block device can be a representation of a local disk but can also be a representation of a +virtual or remote disk that is either connected to or provided by a storage system. + +Block devices are rarely consumed by applications directly and are often used as a device +that underlies a filesystem. Some databases can be configured to consume raw block +devices directly in order to improve performance. Permissions and access control of block +devices are typically reserved to admin users of the operating system. + +Further details are available in thissection. + +#### 5.1.2 Filesystem + +A filesystem defines how data is persisted and retrieved by the operating system, by +structuring the data in terms of files and directories. A filesystem will often use a block +device to persist the data to a non-volatile storage medium such as a disk. + +Permission attributes in a filesystem can be allocated to both files and directories, allowing +granular access to users and groups, as well as defining the type of access (e.g read, write +or execute access). Some filesystems support more extended attributes that improve the +flexibility and levels of control and access. Filesystems can also support locking semantics +that allow an application to mark a file as locked for exclusive use. The supported locking +capabilities vary between filesystems and may operate differently when used on a remote or +distributed filesystem. + +Filesystem code is typically run within the kernel of the operating system to maximise +performance, which means that the filesystems available to an application will be dependent +on the particular operating system distribution. It is also possible to run filesystems at the +user level (FUSE), which are often used to provide a filesystem representation of datasets +other than those stored in a native block device. + +Further details are available in thissection. + + +#### 5.1.3 Shared Filesystem + +Filesystems are typically limited to an individual server or node and can therefore only be +accessed by one node at a time. A shared filesystem is a filesystem that can be mounted +on more than one node at a time. This provides additional flexibility and supports patterns +where applications are distributed between multiple servers and need to access a common +set of data. + +A shared filesystem can be consumed from a point-to-point service endpoint, where a server +node exposes a local filesystem to other servers - this is limited to the performance (and +sometimes the availability) of a single node. Alternatively, a shared filesystem can be +distributed across multiple nodes and systems in a distributed filesystem - this allows for +datasets and scalability beyond what can be supported on a single node. + +Clustered filesystems can provide similar functionality to shared filesystems but are rarely +utilised in a cloud native context and they use shared block devices which are available on +multiple nodes. + +### 5.2 Data Access Interface: Application API + +#### 5.2.1 Object Stores + +Object stores use an API to store and retrieve objects or blobs. The API for the most +popular object stores utilise a HTTP interface. Object stores are typically based on a +distributed architecture that is optimised for capacity, durability and scalability allowing +thousands of clients to connect to PB buckets of storage. +The overhead required to commit multiple copies for availability and durability, and the use of +a HTTP API tends to lead to a higher latency overhead per operation, but can maintain high +levels of throughput through parallel access from multiple clients. + +Further details are available in thissection. + +#### 5.2.2 Key Value Stores + +A key value store is accessed by an API and uses a key as an identifier to store and retrieve +values from the store. Key value stores can be implemented in a library, a local system or a +distributed system. + +Key value stores are often used to store metadata and configuration and are often +implemented with strong consistency. As a result they are often utilised as a method for +storing state, configuration and indexes for distributed systems and applications. + + +Further details are available in thissection. + +#### 5.2.3 Databases + +Databases are typically accessed through an API provided by the project or vendor. Those +that offer relational features have the opportunity to build an API that conforms to an industry +defined set of standards for access. Examples are Java-JDBC, Python PEP 249, Go’s +database package, etc. + +### 5.3 Orchestrator, host and operating system level interactions + +A number of virtualization and access layers are often overlaid or interposed on a Data +Access Interface as part of the integration of the storage solution into an orchestrated +environment, and can influence Availability, Scalability and Performance of the overall end to +end solution. + +Often a hypervisor may also be providing access to resources and may be performing a +variety of functions including mapping storage resources, pooling multiple resources which +are shared between workloads, managing connectivity to resources and handling failover +and data protection functions. + +#### 5.3.1 Volumes + +Some interactions that may apply to volume access interfaces include : +* A volume manager (e.g. lvm) which may provide functionality to pool resources, +provide data protection and even take an active role in failover and recovery +* Bind mounts and overlay filesystems which provide functionality to layer filesystems +and image layers to provide integration with orchestrators and container runtimes. + +#### 5.3.2 Application API + +Some interactions that may apply to application API interfaces include : +* Discovery to provide functionality to identify resources in a cluster or a network +* Meshes, ingress end-points and load balancers that can provide functionality to route +requests to store and retrieve data based on content or resource availability + + +### 5.4 Comparison between Object Stores, File Systems and Block Stores + +Data Access Interface |Most suited |Least suited +--- | --- | ---| +Block | Availability, Low latency performance, Good throughput performance for individual workloads | Capacity scaling, Sharing data with multiple workloads simultaneously +Filesystem | Sharing data with multiple workloads simultaneously, Optimised throughput for aggregated workloads | Strong file locking integrity when filesystems are shared +Object Store | Availability, Large capacities (PB scale), Durability. Sharing data with multiple workloads simultaneously, Optimised throughput for parallelised workloads | Low Latency performance + +_** The information in this table are generally accepted attributes and measurements for object stores, file systems and block stores._ + +### 5.5 Comparison between Local, Remote and Distributed Systems + +...|Local |Remote |Distributed +--- | --- | --- | ---| +|Availability | Limited by failure of components locally and ability to failover. If a node fails, the local storage is isolated to the local node. | May be limited by single points of failure. Workloads can move to another node and reconnect to the remote storage. | Clients may access numerous nodes, and any storage node failures can be mitigated, The additional complexity of distributed systems may add operational complexity which may in turn affect availability or the ability to recover errors. +|Scalability | Limited by local architecture (1 node; typically TB) | Limited by monolithic architecture (2-16 nodes; typically 10s-100s of TB) | Scale by adding additional systems, Topology enables scale of both nodes and supported capacities. (3-1000s nodes; often supports PB) +|Consistency | Yes (storage system implementation is easy) | Yes (storage system implementation is harder with more nodes) | Yes (storage system implementation is hardest) +|Durability | Limited by local components (less) | Limited by monolithic architecture (more) | Scaling out to additional systems increases durability (most) +|Performance | Limited by local components, can benefit low-latency applications (100us-5ms, GB/sec) | Similar to local, but additional overhead in network transport (500us-5ms, GB/sec) | Scaling out to additional systems increases performance (500us-5ms, TB/sec) + +_** The information in this table are generally accepted attributes and measurements among +local, remote, and distributed storage systems._ + + +## 6 Block Stores + +Block stores are a persistence target where data is stored in blocks in local, remote, or +distributed locations. The blocks are typically numerically addressed using a method called +Logical Block Addressing (LBA) and accessed as a clientthrough adeviceinterface +provided by a Kernel. The location (local/remote/distributed) is determined by the physical +persistence location of the blocks and serves as a method to group and categorize different +stores. + +It is possible to transparently augment or enhance numerous characteristics of block stores +such as availability, scalability, consistency, durability, and performance by adding additional +software-based storage layers (ie. RAID) along with physical devices, networking, and +nodes. Please refer to the Capacity, Availability, and Partition-tolerance (CAP) theorem +overview in theappendixfor more details. + +Virtualization adds another perspective which is important to consider. Operating systems +may or may not be aware of the type of block store being used. Virtual machines and +machine instances are likely not storing any blocks locally but completely leveraging remote +or distributed block stores. In this case, instances provide virtualized hardware that store +data remotely and emulate the connectivity and behavior of local physical storage devices. +This storage would not be considered a local block store due to non-locality of stored data. + +Most applications do not directly store data in block format, but instead interface with file +systems supported by block devices (ie. application -> local EXT4 filesystem -> local block +device -> local/remote/distributed block store). See the filesystems section below for more +details. + +_The following categories include examples solely with the intent of providing context to the +category being described. Examples are intended to be widely known to the readers._ + +### 6.1 Local Block Stores + +Local block stores are built onDirect Attached Storage(DAS) where data is persisted locally +on hardware devices. Since all data is stored locally, the scale is limited to the local resource +capabilities. The availability of the data is a major consideration when applications are +interacting directly with local block stores.Logicalvolume management(LVM) and similar +techniques can be used to augment and concatenate the capabilities that discrete hardware +devices provide. These stores tend to be focused on specific use cases where latency is +critical or to support other storage services. + +Generally accepted example terms, platforms, and protocols:ATA, IDE, logical volumes, +LVM, physical volumes, physical storage devices, RAID, SCSI, volume groups + + +### 6.2 Remote Block Stores + +Remote block stores provide storage attached by a network where data is persisted remotely +across a network. This is different from local because there is a separation of application +from storage. Generally, this has the ability to increase capacity and performance. The +availability is also increased since high availability design patterns can be implemented. +Without detailed information and assurances and intentional design, service levels are likely +to be driven by this category. + +Generally accepted example terms, platforms, and protocols: AWS EBS, FC, FCoE, iSCSI, +SAN + +### 6.3 Distributed Block Stores + +Distributed block stores are similar to remote block stores but data is persisted across many +nodes, possibly in conjunction with the local node, and clients are able to rely on many +nodes to provide redundancy and horizontal scalability. When compared with local and +remote block stores, distributed block stores require additional control and data access +layers to manage data distribution (and often also replication). This added complexity can +provide improved scalability, availability, and durability. + +Generally accepted example terms, platforms, and protocols: Ceph, DRBD, OpenEBS, +Longhorn, hyper-converged + + +## 7 File Systems + +AFile systemis a logical persistence layer organizedaround storing and retrieving data +referenced by files. They provide a richer set of primitives than block stores. These primitives +include access control, concurrency control and locking, naming and directory structure, +sequential file access, and other features. This makes them more suitable for direct use by +applications than block stores. The actual persistence function is performed by supporting +layers where the file system may translate files to logical block addresses. File systems can +be local, remote, or distributed (independent of underlying block store locality). There are +numeroustypes of file systemswhich tend to differentiateto optimize for many +characteristics including storage medium, read/write expectations, performance, durability, +and access patterns. + +_The following categories include examples solely with the intent of providing context to the +category being described. Examples are intended to be widely known to the readers._ + +### 7.1 Local File Systems + +Local file systems are typically built from local, remote, or distributed block stores. They are +commonly used by operating systems to store dependent files. + +Generally accepted example terms, platforms, and protocols: EXT4, file, inode, XFS + +### 7.2 Remote File Systems + +Remote file systems are also referred to as the categorynetwork file systems. They consist +of a specialized client that presents local data structures and stores data across a network in +remote locations. Through separating client from server a remote file system’s capabilities +expand beyond the limits of the local system. + +There are numerous types of remote file systems with their own specializations. For +example, remote file systems are not inherently optimized for safe multi-client access. +Applications sometimes solve for this by introducing additional locking mechanisms or they +embraceclustered file systems. + +Generally accepted example terms, platforms, and protocols: CIFS, cluster, file locks, NFS, +VMFS + +### 7.3 Distributed File Systems + +Distributed file systemsare a type of remote filesystem that provide the ability for clients to +seamlessly store and retrieve files across clusters of servers. The scale is elastic because +files are stored in a distributed manner and are globally addressable. + + +Generally accepted example terms, platforms, and protocols: Gluster, HDFS, Lustre, +CephFS. + +### 7.4 Comparison + +Comparing file systems requires also considering the interaction with the underlying storage +layers. The following table describes the optimal/neutral/non-optimal combination generally +accepted understanding of the interaction of these layers. + +...| Local File System on..| Remote File System on.. | Distributed File System on.. +--- | --- | --- | ---| +Local Block Store | Optimal | Optimal | Optimal +Remote Block Store | Optimal | Neutral | Non-Optimal +Distributed Block Store | Optimal | Neutral | Non-Optimal + +## 8 Object Stores + +Unlike file systems and block stores, where there is a general understanding of the +implementation behind the interface, object stores are quite heterogeneous in their +implementations. In general an object store system is an _atomic_ key-value store, where the +key and value are defined by the implementer of the storage system. An atomic key value +store guarantees that a request to _set_ or _get_ a valueis either fully committed or not at all. + +There are many examples of object stores. From how an internet browser gets HTML +content and sends, or posts, data back to the web server, to how an operating system gets +and sets data from an LBA on a block store. + +### 8.1 HTTP Based Object Storage + +Due to the large range of object store implementations, this paper will be focusing on HTTP +based object stores as defined byAmazon Web ServicesS3,Google Cloud, andOpenStack +Swift. Typically, these types of object stores areused for large opaque values, as a result, +they are often used to store images, videos, and data backups. + +These types of object store systems have defined a set of methods based on the HTTP +protocol where the key is aURLand the value is aset of data. This interface makes it simple +to access content since there is no need to mount or attach an object store. Due to the +nature of this model, these type of object store systems are always remote to the requester +node. + +An HTTP based object store access model is largely constructed of an _account_ in which it +has a set of _buckets_ as they are called by S3 or _containers_ as they are called by OpenStack +Swift. Each of these buckets or containers can then contain objects. The _key_ is based on the +combination of these values loosely based on the format: + +``` +http(s):///[...object] +``` +where the _object_ is a unique identifier referenceto an object. For example: + +``` +https://server.io/v1/admin/pictures/path/to/the/picture.jpg +``` +shows the object is _path/to/the/picture.jpg_. + +One of the advantages of these HTTP based systems over simple object stores is the +metadata management. These storage systems make it possible for the requester to attach +custom metadata to the objects, which can then be used to list, fetch, or group objects. +Another advantage is in access control, where access can be placed on anything from a set +of containers or buckets to a single object. + + +### 8.2 Scalability, Availability, Durability, Performance + +HTTP object stores are designed for scalability and durability, but not for low latency +performance as compared to block or file based storage systems. Instead they are designed +for supporting extremely large amounts of data spread over not only a single data center, but +over many regions all over the world. HTTP object stores are also designed for durability, +supporting many methods of maintaining the data integrity of their objects. They may +maintain multiple copies of their objects or useerasurecodingto maintain object durability. +These methods provide an unprecedented object durability service level agreement. As an +example, Amazon Web Services S3 claims object store service is designed for a durability of +99.999999999%. + +Due to their nature, HTTP based object store systems are not suited for latency sensitive +applications. On the other hand, unlike block and file, an object store system can provide +data to clients on behalf of an application! For example, when a web browser requests data +from an application which stores data in an object store, instead of having the application +return the data itself, it can send the web browser pointers to the data on the object store. +The object store system then returns the data _directly_ to the client from a region closest to it, +reducing the network IO requirements of the application. + + +## 9 Key-Value Stores + +A key-value store is a storage system designed for storing, retrieving, and managing +key-value pairs. Values are identified and accessed via a key, which is similar to ahash +table. In a key-value store, there is no predefinedschema and the value of the data is +usually opaque. It is a very flexible data model because the application has complete control +over what is stored in the value. + +A key-value store system might store its data fully in memory, partially in memory or fully on +disk. It might be only locally accessible or remotely accessible. It might only run on a single +node or might be distributed and scalable. Many more complex storage systems like +databases, block storage, file systems, logging systems are usually built on top of key-value +stores or key-value abstraction. + +### 9.1 Local Key-value Stores + +A local key-value store is usually accessed by a single application through inter process +communication or direct intra-process API calls. It stores the data in local memory or a local +filesystem. The local key-value store is designed for low latency accessand the ease of use +and operation.Many distributed applications or distributedstorage systems use one or more +local key-value stores as their basic storage unit for further replication. Berkeley-DB, +InnoDB, LMDB, RocksDB are the best examples of this category. + +### 9.2 Remote Shared Key-value Stores + +A remote shared key-value store is usually accessed by a number of applications through +networking protocols (HTTP, RPC, or customized ones). It stores the data in local memory or +local filesystem. The shared key-value store is designed for efficiency and flexibility. Some +remote key-value stores also provide additional data structures API for the ease of use. A +traditional relational database can also be used as a remote key-value store with a simple +two columns (key, value) table when reliability and durability are the first priorities. + +Redis, memcached are the best examples of this category. + +### 9.3 Distributed Key-value Stores + +A distributed key-value store replicates its data to one or more nodes in the system for high +availability and durability, and might shard its data to different replication groups for scale +out. Some distributed key-value stores trade off latency or scalability for linearizability and + + +serializabilityconsistencyguarantees over the entire key-value space to reduce the risk of +potential conflict updates. Some provide weaker consistency guarantees (either eventual +consistency or stronger consistency within one single partition) but better latency +guarantees. + +etcd, ZooKeeper, Consul, etc. provide distributed key-value store API for handling metadata +or coordination. They only implement the data replication, but no sharding, to simplify the +overall design and improve reliability. These systems provide strong consistency guarantees +over the entire key space. + +Cassandra, HBase, etc. provide distributed key-value store API for managing massive +amounts of data with low latency. They are similar since they are all Wide-Row key value +stores. They implement both data replication and sharding. Strong consistency can be +achieved for mutations within a row or within a partition, sometimes with limited availability. +They do not provide strong consistency guarantees over mutations over different partitions +through the entire key space. + +Spanner, CockroachDB,TiKV, YugabyteDB, FaunaDB, FoundationDB, etc., provide +distributed key-value store API for managing massive amounts of data and strong +consistency guarantees. They implement both the data replication and sharding features. +Additionally, they implement distributed transactional protocols across multiple shards to +support global transactions either through clocks (high accuracy physical clock orHLC) or +through a single master (Calvinor similar protocols).The distributed transaction protocol +typically introduces additional latency for cross shard transactions. Even with high accuracy +physical clocks, the latency can be as high asseveralmilliseconds. + +### 9.4 Comparison + +... | Local | Remote | Distributed and non-global-transactional | Distributed and global-transactional +--- | --- | --- | --- | ---| +Availability | Limited by local components failures | Limited by remote components failures | Partial failures do not affect availability or only limited key-space | Partial failures do not affect availability or only limited key-space +Scalability | Limited by local resources | Limited by remote resources | Scale out as adding more capacities | Scale out as adding more capacities, API scalability is often limited by a single-master. +Global consistency | Strong | Strong | Weak | Strong +Durability | Limited by local storage failure | Limited by remote components failures | Tolerant to partial failures | Tolerant to partial failures +Performance | Limited by I/O access latency | Limited by I/O access latency and network latency | Limited by I/O access latency and network latency | Limited by I/O access latency, network latency, and usually a single-master. Multiple rounds of network latency for cross shards transactions. + +## 10 Databases + +In the past, the term “database” used to be synonymous to a relational database. However, +there are now other systems that get categorized as databases even though they don’t +strictly satisfy the properties of a relational database. In particular, there are many upcoming +NewSQL systems, and there are also specialized ones like Graph databases. Similarly, +existing relational databases such as PostgreSQL and MySQL have been going in the +opposite direction allowing storing data without a fixed schema. + +### 10.1 Functionality and Backing Stores + +Databases have some advanced functionality over what one would expect of a traditional +key-value store. A database typically has some of the following characteristics (but not +necessarily all): +* ACID Transactions (Atomicity, Consistency, Isolation and Durability), +* Secondary indexes, +* Relationships across different pieces of data and the ability to join them on-the-fly, +* A query language to fetch and (or) mutate the data. The most popular of these is +SQL. + +We are also aware that the lines are blurring as many key-value systems are starting to +support some of the above features. + +Many databases allow one to configure their backing store as an external file system or +block storage. In such cases, the trade-offs are the same as that of a Key-Value store. +Essentially, the comparisons made in section 9.4 also apply to such systems. + +### 10.2 Cloud Native Databases + +Not all databases are cloud-native. Therefore, caution must be used before running them in +a cloud environment like Kubernetes. The major areas of concern are: +* the life-cycle and mobility of a Kubernetes Pod, +* the ephemeral local storage, +* the added latency of a remotely mounted volume. +These concerns can typically be addressed with additional tooling like the use of proxies and +orchestration systems that can react to events that some databases may not be inherently +built to handle. The exact solution will differ based on the extent to which a system is +sensitive to the above changes. + + +On the other hand, systems likeVitess,TiDB,YugabyteDB,Cloud Spannerand +CockroachDBcome with built-in proxies and orchestration.These properties make them +better suited to run in the cloud. + +### 10.3 Data Protection + +It’s also recommended that backups be taken regularly. Even if there is sufficient durability +achieved through replication, there are other use cases where a backup comes in handy. For +example, if there is a bug in the application that accidentally destroys data, one could go +back to an older snapshot to recover the lost data. Some database systems have native +support for continuous backup, allowing users to perform finer Point In Time Recovery +operations, restoring a consistent snapshot of the database as it was immediately before the +incident. + +### 10.4 Database Comparison + +Topology | Stand-alone instance | Replicated DB | Sharded | Sharded and Replicated +--- | --- | --- | --- | ---| +Example | Individual relational database instance | Master-Replica or Multi master deployments | Shard a subset of records per instance, behind a front-end router | Cloud Native Databases +Availability | Limited by the availability of the single node and it’s network connection. | Multiple replicas; failover needs to be coordinated | Sharding may lower overall availability - anyone unavailable shard may make the DB unavailable. | Availability based on the number of replicas +Scalability | Requires compute and storage to scale up; capacity limited to the capabilities of a single node | Data is not distributed, but queries can be targeted at replicas; capacity limited to the capabilities of a single node | Horizontal scaling of reads, writes and capacity is possible, but sharding does solve read latency problems without replicas. | Scaling based on the sharding +Consistency | Strong | Strong | Typically strong, but asynchronous replication and eventual consistency may impact consistency. | Typically strong, but asynchronous replication and eventual consistency may impact consistency. +Durability | Dependent on capabilities of underlying Volume Storage | Durability is based on the number of replicas. A data loss event requires all n replicas to be lost. | Durability can be comparable to a stand-alone instance due to sharding - although blast radius is minimised as loss of a single shard only results in partial data loss. | Durability is based on the number of replicas. A data loss event requires all n replicas to be lost. +Performance | Dependent on memory (cache), compute and storage resources | Performance can be negatively impacted by replication overhead, especially is synchronous to facilitate strong consistency. Long running queries can be offloaded to replicas to improve transactions on master | Performance is balanced across a number of nodes. Operational complexity for sharded systems may apply. | May be either increased or decreased due to sharding and replication, depending on query types and replication strategy. + +## 11 Orchestration and Management Interfaces + +This section defines how Container Orchestration Systems interact with the Storage +Systems to associate workloads with Data from the Storage Systems. Depending on the +Data Access Interfaces, different layers may be involved. + +### 11.1 Volumes - block stores and filesystems + +A Container Orchestration System (CO) such as Kubernetes can support multiple interfaces +to interact with the Storage System. + +The Storage System can: +* **(A)** support control plane interface API directly andinteract directly with the +orchestrator or +* **(B)** interact with the orchestrator via an API Frameworklayer or other Tools. + + +The orchestrator can use the control plane interfaces **(A)** or **(B)** to support the request for a +volume and may also be able to use the interface to dynamically provision a volume. + +Workloads consume **(C)** storage from storage systemsover various data access interfaces. + +The underlying storage infrastructure layer can be software-based commodity storage, cloud +storage, or enterprise storage. The management layer provides an abstraction over the +complexity of various storage systems. + +Whether to use **(A)** or **(B)** depends on user requirementsand capabilities supported by the +storage system. **(A)** has been primarily focusing ondynamically provisioning storage (or +pre-provisioning storage) for workloads, although more advanced functionality may be added +in the future. **(B)** may also support discovery, automation,and other data services such as +data protection, data migration, or data replication in addition to provisioning. + +There are on-going discussions in Kubernetes to provide more advanced functionality such +as data protection. At the time of this writing, a Data Protection Working Group was formed +with collaboration between Kubernetes SIG-Storage and SIG-Apps to promote this objective +(https://github.com/kubernetes/community/blob/master/wg-data-protection/charter.md). + +#### 11.1.1 Control Plane Interfaces + +_“Control-Plane Interfaces”_ refers to storage interfacesfor CO. It includes Native Interfaces +such as Kubernetes Native Drivers and Docker Volume Driver Interface as well as External +Interfaces such as Kubernetes Flexvolume and Container Storage Interface. + +##### 11.1.1.1 Container Storage Interface + +Container Storage Interface (CSI) is an industry standard to define a set of storage +interfaces so that a storage vendor can write one plugin and have it work across a range of +Container Orchestration (CO) Systems. COs supporting this specification include +Kubernetes, Mesos, and Cloud Foundry. Other companies, including storage vendors, have +also been helping with the design. It has evolved to become the volume driver interface of +the future for Container Orchestration Systems. + +CSI has three gRPC services: controller, node, and identity services. Identity service +provides info and capabilities of a plugin. Controller service supports create and delete +volume, create and delete snapshot, attach and detach volume, and expand volume. Node +service supports mount and unmount volume, and expand volume. For more details, see +the spec here:https://github.com/container-storage-interface/spec + +CSI v1.0.0 was released in November 2018 and v1.2.0 was released in October 2019. The +Kubernetes implementation of CSI has been promoted to GA in the Kubernetes v1.13 + + +release (https://kubernetes.io/blog/2019/01/15/container-storage-interface-ga/). At the time +of this writing, both Mesos and Cloud Foundry have implemented experimental CSI drivers. + +##### 11.1.1.2 K8S Native Drivers + +This refers to Kubernetes in-tree volume drivers that extend Kubernetes volume interfaces to +support block and file storage systems. Kubernetes has the following concepts for storage: + +* Persistent Volume (PV) is a piece of storage provisioned by an administrator on the +storage system. +* Persistent Volume Claim (PVC) is the storage requested by a user. Kubernetes +cluster will try to find a matching PV that matches the PVC request. +* PV can be pre-provisioned or dynamically provisioned. Dynamic provisioning is done +using a Storage Class created by an administrator. Storage Class defines different +levels of services that a storage system can provide. Kubernetes manages the life +cycle of PVs and PVCs. Data on a volume can persist beyond the lifetime of a pod +that consumes the volume. + +Kubernetes in-tree volume drivers can support the following functionalities: create and delete +volume, attach and detach volume, mount and unmount volume, and expand volume. + +https://kubernetes.io/docs/concepts/storage/ +https://kubernetes.io/docs/concepts/storage/volumes/#types-of-volumes + +Kubernetes Storage SIG is in the process of moving these in-tree drivers out of tree, in favor +of CSI drivers. There is a design spec aiming to seamlessly migrating from in-tree drivers to +CSI which would allow CSI drivers to handle volume provisioning requests as a proxy for +in-tree drivers: + +https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20190129-csi-mi +gration.md + +In Kubernetes 1.17 release, this CSI migration feature has been promoted to beta. Cloud +provider volume drivers will be the first ones targeted to move from in-tree to CSI. The plan +is to remove cloud provider in-tree drivers (AWS EBS, GCE PD, OpenStack Cinder, Azure, +and vSphere) in Kubernetes 1.21 release. After that, plans will be made to migrate other +in-tree drivers. + +##### 11.1.1.3 Docker Volume Driver Interface + +Docker volumes can be used to persist data in Docker. Docker provides a mechanism for +storage vendors to write a volume driver so that remote storage systems such as Amazon +EBS can be used to provide volumes for a docker container. This allows data volumes to + + +persist beyond the lifetime of a single Docker host. If a plugin registers itself as a +VolumeDriver when activated, it must provide the Docker Daemon with writable paths on the +host filesystem. The Docker daemon provides these paths to containers to consume. The +Docker daemon makes the volumes available by bind-mounting the provided paths into the +containers. + +Supported Docker volume driver interfaces include Create, Remove, Mount, Unmount, Path, +Get, List, and Capabilities. + +https://docs.docker.com/storage/ +https://docs.docker.com/engine/extend/plugins_volume/ + +##### 11.1.1.4 K8S Flexvolume + +Kubernetes Flexvolume provides interfaces to initialize a driver, to attach/detach a volume +to/from a host, and to mount/unmount a volume on/from the host. These functions will be +executed by the Kubelet component in Kubernetes. + +Flexvolume does not provide interfaces to provision/deprovision a volume. A dynamic +provisioner can be developed to provision/deprovision a volume and be used together with +the Flexvolume plugin. + +Flexvolume is also an out-of-tree plugin. Since the plan is to use out-of-tree CSI drivers in +the future, new features will not be added to support Flexvolume although existing features +in Flexvolume are still supported. + +https://github.com/kubernetes/community/blob/master/contributors/devel/flexvolume.md + +Note that there is an effort underway to move the core Kubernetes component images to +distroless. One side effect is that shell access (and possibly other common Linux utilities +you may depend on) is unavailable inside these containers. Flexvolume drivers requiring +master installation (in controller-manager) will have issues starting up if they depend on +these utilities. Note this doesn't impact drivers running on Kubelet only. + +#### 11.1.2 Frameworks and other tools + +_“Frameworks and other tools”_ are extensions of CO’s _“Control-Plane Interfaces”_. In addition +to provisioning and managing storage, this extended control plane can also support +discovery, automation, data protection, data migration, disaster recovery, monitoring, +analytics, performance tuning, and data lifecycle management, etc. + +Some examples of frameworks and other tools described in this section includeOpenSDS, +RookandVelero. + + +### 11.2 Application API + +Currently the Control-Plane Interfaces, the storage interfaces supported by COs, do not +include object stores, key value stores, and databases, although there is ongoing work in +Kubernetes trying to fill the gaps of object store support. Some Frameworks and Tools have +support for object stores, key value stores, and databases. Some examples will be given in +the following section. + +Note that there is an extension API called Service Catalog that enables applications running +in Kubernetes clusters to use external managed software offerings, such as a datastore +service offered by a cloud provider. + +https://kubernetes.io/docs/concepts/extend-kubernetes/service-catalog/ + +#### 11.2.1 Object Stores + +Some management interfaces provide a way to directly deploy object storage and allow + +object storage to be consumed by containers through the object interface (usually S3). + +Rook’s support of Minio is a good example of this. REX-Ray has integration with object + +storage as well. OpenSDS has built object storage support via S3 APIs. + +There are also ways to connect persistent volumes provisioned for containers to object store + +on premise or in the cloud. + +* For cloud storage such as Google Cloud Persistent Disks or Amazon Elastic Block +Storage, a snapshot of a PVC for block storage will be uploaded to the object store +somewhere in the cloud as part of the snapshot creation process. +* Some other management interfaces provide a similar approach that uploads a +snapshot created for a block storage to an object store on premise or in the cloud. +* Some management interfaces also provide a separate backup API that takes a +volume or snapshot from a block storage and backs it up to an object store. +○ At the time of this writing, there are ongoing discussions in Kubernetes to +provide a separate backup API that could backup a volume to a remote +backup device such as an object store. +* At the time of this writing, there is ongoing work in Kubernetes that allows a S3 +bucket to be provisioned as a first class resource, similar to how a persistent volume +is provisioned. + +#### 11.2.2 Key Value Stores + +It is possible for a management interface to provide a way to deploy and manage key value + +stores, similar to how databases can be deployed and managed by the management + +interface. + +#### 11.2.3 Databases + +Management interface can provide a way to deploy and manage databases. For example, + +Rook provides an operator to deploy and manage CockroachDB and YugabyteDB clusters. + +Another CNCF storage project [Vitess](https://vitess.io) also provides an operator to manage MySQL clusters. + + +## 12 Appendix + +### 12.1 Document History + +Initially, the document was structured based on classes of storage type which are +categorised by the way the storage is consumed e.g. block, file or object. This did not +provide a useful way to compare and contrast their attributes and how they are utilised in +production as most storage systems have many layers and are formed of multiple +components. While the data access interface (like block or file) might affect how the data is +consumed and how it might failover between nodes, it does not effectively define attributes +such as data protection, consistency, or durability. + +As a further complication, many commonly used systems are layered storage systems +where, for example, a filesystem may be built on an object store (e.g. CephFS), or a block +store may be built on a distributed filesystem (e.g. gluster block storage). This meant that +the way the storage is accessed did not usefully define the attributes that an application +cared about (such as the durability, data protection or some of the performance +characteristics of the overall system), as those attributes are defined at other layers in the +stack. + +### 12.2 Consensus Protocols + +Consensus protocolsprovide reliable agreement amonga group of potentially faulty +distributed processes on a single data value or a replicated log. They are commonly used to +decide whether to commit a data change transaction, for leader election, state machine +replication, load balancing, clock synchronization and others in distributed systems. The two +most popular (families of) consensus algorithms are Multi-Paxos and Raft, both of which +have been formally proven correct (for practical uses, with some caveats). Both rely on a +single elected leader, and (typically) agreement by a strict majority of participants (e.g. for 5 +participants, at least 3 must explicitly agree). Raft is considered simpler to understand and +implement than Multi-Paxos. Other ad-hoc attempts at consensus algorithms are notoriously +prone to edge case failures. + +#### 12.2.1 Paxos + +Paxosis arguably the oldestformally studiedfamilyof consensus algorithms. It is +considered highly robust when implemented properly, butchallenging to implement correctly +for practical uses. + + +#### 12.2.2 Raft + +Raftwas developed about a decade after Paxos, toaddress the issues mentioned above. It +has become widely used, and forms the basis of, amongst others, the popularetcd +cloud-native key-value store, andConsuldistributedservice mesh. + +#### 12.2.3 Two-phase Commit (“2PC”) + +2PCis a specialized form of consensus protocol usedfor coordination between participants +in a distributed atomic transaction to decide on whether to commit or abort (roll back) the +transaction. 2PC is not resilient to all possible failures, and in some cases, outside (e.g. +human) intervention is needed to remedy failures. Also, it is a blocking protocol. All +participants block between sending in their vote (see below), and receiving the outcome of +the transaction from the co-ordinator. If the co-ordinator fails permanently, participants may +block indefinitely, without outside intervention. In normal, non-failure cases, the protocol +consists of two phases, whence it derives its name: + +1. The commit-request phase (or voting phase), in which a coordinator requests all + participants to take the necessary steps for either committing or aborting the + transaction and to vote, either "Yes" (on success) , or "No" (on failure) +2. The commit phase, in which case the coordinator decides whether to commit (if all + participants have voted "Yes") or abort, and notifies all participants accordingly. + +#### 12.2.4 Three-phase Commit (“3PC”) + +3PCadds an additional phase to the 2PC protocol toaddress the indefinite blocking issue +mentioned above. But 3PC still cannot recover from network segmentation, and due to the +additional phase, requires more network round-trips, resulting in higher transaction latency. + +### 12.3 Consistency, Coherence and Isolation + +The above three terms are commonly used in various different contexts to mean different +things in the fields of data stores and distributed systems. Without going into detail here, +suffice to say that, consistency in particular, is a widely misunderstood term, so it’s worth +thinking twice before assuming that you understand exactly what’s meant by a particular use +of the term. For example,ACID(Atomicity, Consistency,Isolation, Durability) properties and +theCAP Theorem(concerning Consistency, Availabilityand Partition-tolerance) are both +widely used terms, and many people assume that they understand what these terms mean. +But considerably fewer people realise that “Consistency” means quite different things in +those two contexts. For further details,WikipediaandIrene Zhang’s musingprovide good +starting points. + +#### 12.3.1 ACID + +With the above caveats, for data storage systems, Atomicity, Consistency, Isolation and +Durability are generally considered to mean: + + +1. Atomicity: a guarantee that each transaction across multiple data items is treated as + a single "unit", which either succeeds completely, or fails completely, even in the case + of various failures including machine crashes and network errors. +2. Consistency: Usually understood to mean guarantees about whether a transaction + started in the future can necessarily see the effects of all transactions committed in + the past. Also sometimes understood to be a guarantee that a transaction can only + bring the data from one valid state to another, while maintaining invariants (for + example that stock count cannot be less than zero, or that two customers with the + same id number cannot exist). +3. Isolation: guarantees that concurrent execution of transactions leaves the database + in the same state that would have been obtained if the transactions were executed + sequentially, in some order. +4. Durability: guarantees that once a transaction has been committed, it will remain + committed even in the case of a system failure (e.g., power outage or crash). This + usually means that completed transactions (or their effects) are recorded in + non-volatile memory. + +#### 12.3.2 The CAP Theorem + +The CAP Theoremstates that it is impossible for adistributed data store to simultaneously +provide more than two out of the following three guarantees: + +1. Consistency: Every read receives the most recent write or an error +2. Availability: Every request receives a response that is not an error +3. Partition tolerance: The system continues to operate despite an arbitrary number of + messages being dropped (or delayed) by the network between nodes + +In the absence of network failure both availability and consistency can be satisfied. CAP is +frequently misunderstood to mean that one has to choose to abandon one of the three +guarantees at all times. In fact, the choice is really between consistency and availability only +when a network partition or failure happens; at all other times, no trade-off has to be made. + +Database systems designed with traditional ACID guarantees in mind such as RDBMS +choose consistency over availability, whereas systems designed around the BASE +philosophy, common in the NoSQL movement for example, choose availability over +consistency. + +The PACELC theorem builds on CAP by stating that even in the absence of partitioning, +another trade-off between latency and consistency occurs.