The UK Government register to vote website recently crashed due to the number of users trying to access the site. This article looks at tools you can use to review your own website.
The Voter registration website started to report errors from around 10.15pm on 7th June, shortly before the original midnight deadline for registering to vote. Users reported a "504 Gateway time-out" error. It happened as 50,711 users were online at the same time and after the site had recorded 525,000 visits during the day. The issue seems to have been a problem with the physical infrastructure not being able to cope with demand – a 504 gateway error means that information requested from one server was not received by another server as quickly as it should be - either because the server has gone off line or can not cope with demand. Every connected user would be requesting information from the server (or actually from one of many servers) and the tipping point would be a combination of the memory on the machines, the information being asked for and the number of simultaneous requests. Of course, it's not always easy to predict what peak levels of traffic will be nor it is cheap to have servers on standby just in case they are needed, however there are some things you can consider:
The more efficiently your website runs, the more users it can cope with. This should be the first point of focus when considering whether your site can cope with spikes in traffic.
You should check things like the page loading speed (even just using Google Analytics) and the efficiency of how the code on your site runs. It is possible to check the code to ensure that the actual requests (or database queries) are not asking for unneccesary information and are not taking too long to run. It used to be good practice to send complex queries to a database and wait for a response when bandwidth was more limited, however more recently it is considered better to send a lot of small requests. There are a number of analysis tools such as New Relic. This also doubles up as a monitor tool so you can receive alerts for any downtime or load issues. New Relic can identify particular slow running queries so you can see if they can be optimised.
Another obvious issue is the content that you deliver to users. Making sure images and logos are optimised will ensure quicker delivery. You can also use tools such as Varnish which will provide caching for regularly served items for a defined period of time. For example, if you send out a large email campaign that calls for images from your website, you could see a large number of requests for a few files. Varnish will allow those files to be cached in order to reduce file requests. One other issue to bear in mind is calling third party content in your site. For example if you have social media updates on your site, it's normally better to store a cached version on your local server than connecting to the social media site every time a users visits your page.
You probably know a fair amount about the datacentres you are using but have you asked questions about what levels of demand they can cope with and how they cope with network outages? Most providers these days have robust networks but it’s always worth knowing how they cope with spikes and whether it’s worth considering multiple data centre delivery.
You can improve performance by using a load balancer in front of your servers. This will direct traffic to different servers with the same information on so that you can effectively have more overall RAM. You can also consider splitting servers into file servers and dedicated database servers - this is useful for more complex sites with a lot of database calls. This solution can also provide more resilience.
There are a number basic options for server infrastructure - shared or dedicated physical servers, cloud based servers or a hybrid. In most cases, cloud based servers are shared with other users. If you are on a shared server you need to consider so called "noisy neighbours." Your site could be slowed down due to another website hosted on your server. The physical make up of your servers can also affect performance – you can check RAM but be aware that inefficient code can push up memory usage. You should also check the number of maximum simultaneous users allowed on the server - this is normally a trial and error setting as it also depends on the number and type of requests for information. Most infrastructure providers will err on the side of caution and will keep the number of maximum users low to prevent a server from crashing. In most cases, it’s possible to increase the number of maximum users without an impact on the server.
Know your traffic levels
Most website adminstrators will have an idea of normal traffic levels and can expect to know when spikes will occur. One example is concert ticket sites. Often these have a queuing system so that they can cope with demand for popular artists when tickets first go on sale. However the user experience isn't normally that good - a user simply wants to buy tickets and not queue. In order to meet the demand, ticket sites would need more computing power but that could be expensive and there are still limitations on the number of simultaneous users and the underlying ticketing software.
Most websites will not see unexpected spikes unless they appear on TV, are trending on social media or have a special promotion (such as Black Friday or an email campaign). If you know what your peak traffic is and what it could be, you can do load tests on your servers to see if they can cope with extra demand - it's always useful to know what the maximum number of requests or users that can be handled by your site so you can plan for growth.
If you are concerned about large unpredictable spikes, cloud based solutions can be set up to automatically bring new servers online as they are needed.
Another major issue that can cause unpredicted spikes, is if you are the victim of a Denial of Service attack - these normally swamp a site with ficticious traffic. The main problem here is that you probably won't know which requests for information are genuine and which are not. One solution for this scenario is to use a Content Delivery Network such as Cloudflare. As well as delivering content from multiple locations, they will also monitor where requests are coming from across their network and can block requests from suspect IP addresses.
Costs vs benefits
While websites going down isn’t ideal, there are always limits to what is possible and practical. The main message is to ensure that your key stakeholders understand that there is always a small risk that a website could go down and the cost of keeping a site up could outweigh the benefits. Planning is the key so you know what the limits are.