Mitigate performance issues through cache configuration and other improvements.#215
Open
asw101 wants to merge 646 commits intoAzure:masterfrom
Open
Mitigate performance issues through cache configuration and other improvements.#215asw101 wants to merge 646 commits intoAzure:masterfrom
asw101 wants to merge 646 commits intoAzure:masterfrom
Conversation
Add documentation about managing Azure DDoS Protection in Manage.md.
…plan after the deployment.
Azure DDoS protection (fixes Azure#107)
…to avoid any duplicates)
enable Moodle 3.5 (LTS) release
…tate configuration of vmss nodes
Added hook for cron script on nfs or gluster for configuration of vss nodes
enabling accelerated networking on all created interfaces
iennae
reviewed
Sep 29, 2020
Contributor
iennae
left a comment
There was a problem hiding this comment.
I've reviewed and provided feedback to Aaron directly. LGTM so far.
This reverts commit 7605b70.
iennae
approved these changes
Oct 1, 2020
Contributor
iennae
left a comment
There was a problem hiding this comment.
Awesome! Changes look great.
Installing jq with curl command.
Fixing ubuntu installation of moosh
Fix timegated jmeter test
This setup adds an NSG on the VMSS to specifically allow http and https ports.
Upgrading loadbalancer SKU to Standard.
Member
Author
|
Thank you @naioja for your tweaks for NSG with Standard Load Balancer. I have merged the current changes from master and resolved the merge conflict. I have also included your suggested snippet to ensure the alternative_component_cache directory exists! |
5c199d9 to
2afa403
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR mitigates performance and transient reliability issues which we have identified during load testing via JMeter and the Latency-Sensitive Stress Testing (time-gated-exam.jmx) exam with tweaks and updates for the latest version. The changes are as follows:
Sets the Moodle localcachedir to
/tmp/localcachedirDuring testing of the Large size deployment, which defaults to Azure Premium Files as the external file share, we identified files in the
/moodle/moodledatadirectory that caused increased latency. The first is thelocalcachedirdirectory which Moodle recommends using a fast local file system for when Moodle is clustered.Sets
alternative_component_cacheto/var/www/html/moodle/core_component.phpThis change is in conjunction with
localcachedirand provides significant performance improvements whenmoodledatais located on an external file share such as Azure Premium Files (see related issue caching problem with gluster #126 regarding GlusterFS). We chose this directory because it must already exist and the web server must have permissions to write to it.Increases default osDisk size from 30Gb (120 IOPS/3,500 Burst IOPS/25MB/sec) to 256Gb (1,100 IOPS/3,500 Burst IOPS/125MB/sec)
During load testing we believe we may have hit IOPS and/or Throughput limits at either the Disk and/or VM level which can cause a VM to become unavailable. Updates to Disk and VM metrics will make this clearer. In order to mitigiate this we chose a Premium SSD size with significantly more IOPS and throughput.
We initially chose 1,024Gb (5,000 IOPS/200MB/sec) because this size is the first that does not utilize the 3,500 "Burst" IOPS. Latency also decreased as the disk size was increased. However, a smaller size such as 256Gb (1,100 IOPS/3,500 Burst IOPS/125MB/sec) may be suitable and this PR changes from 30Gb to 256Gb.
We applied this change to both the Virtual Machine Scale Set (VMSS) that handles the web traffic, as well as the Controller VM we use for JMeter testing (after resizing to match the VMSS), in order to maintain parity in terms of IOPS and throughput.
Defaults Load Balancer and Public IP to the Standard SKU.
We upgraded our Load Balancer and Public IP to the Standard SKU to enable the Multi-dimensional metrics and alerts, particularly "SNAT connections", to help avoid as well as confirm we do not experience issues such as SNAT Port Exhaustion.
These changes have been tested to deploy successfully against the current master, though load testing was performed against an earlier commit.
(Special thanks to @iennae for feedback and insights throughout!)