Monitoring API performance with AWS Athena

Deciding what technical improvements to make is always a challenge and one area that often requires attention is application performance. The de facto standard for delivering services is to used REST-based APIs and if you are using Amazon Web Services (AWS) then this will usually involves deploying an Elastic Load Balancer (ELB) in front of your backend application servers.

It is a good idea to take methodical (scientific) approach to fixing performance issues. So we want to look at the data and make good data-driven decisions about what problems to focus on. This article will describe how to get started with performance monitoring using AWS Athena.

TL;DR;

AWS Athena provides an SQL query interface to your ELB logs which allows advanced troubleshooting and pattern analysis:

  1. Enable logging on the ELB.
  2. Set a lifetime policy on the S3 bucket used for logging.
  3. Let the logging run for a few days at least so you have some data to work with.
  4. Set up the Athena database and table.
  5. Craft your query in the Athena Query Editor.
  6. Set up a scheduled job to get regular performance reports.

Costs

There are two costs associated with this solution:

  1. S3 storage costs per TB stored log data, which it is wise to set a lifecycle rule on the log bucket.
    https://aws.amazon.com/s3/pricing/
  2. Athena costs per TB of scanned data, which can be controlled using the storage lifecycle rule already mentioned and by crafting good SQL queries.
    https://aws.amazon.com/athena/pricing/

Step 1: Enable logging

By default logging is not enabled on the ELB. When it is enabled, the logs will be saved to an S3 bucket. Follow the instructions here:

https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html

Step 2: Set a lifetime policy on the logs

It might take some time to understand how much log history you need or can afford. But it is good to know how to set up and modify it. Follow the instructions here:

https://docs.aws.amazon.com/AmazonS3/latest/userguide/how-to-set-lifecycle-configuration-intro.html

  1. Set the rule name to Delete_files_older_than_XX_days
  2. Choose the rule scope to be Apply to all objects in the bucket
  3. Set lifecycle rule actions to Expire current versions of objects
  4. Set days after object creation to 90 (or whatever your needs are)

The lifecycle period can be changed anytime.

Step 3: Setup AWS Athena

Follow these instructions to setup the database and table which will be used to query the logs:
https://aws.amazon.com/premiumsupport/knowledge-center/athena-analyze-access-logs/

  1. Create the table with partition projection, this avoids having to manually add the partitions as you go.
  2. Change the CREATE TABLE query to name the table “alb_log”. This will save some headaches later on when copying and pasting queries from others. Note that you can recreate the table at any time without losing data or saved queries.
  3. Both the LOCATION and storage.location.template parameters should be changed to match the location of your ELB logs in S3.
  4. To avoid typos in the location name, navigate to the S3 bucket and drill down into the parent of the year level directories. Then click on Copy S3 URI button and paste this into both locations in the CREATE query.
  5. Set the projection.year.range to the current year and next year, e.g. “2022,2023”.

When the table has been created, click on the vertical ellipsis (“…”) beside the table name and select Preview table. If everything is working, then the results will show the latest 10 queries made to the application server.

Step 4: Find performance issues

Here is an example of a query that will find the top 100 slowest queries in the last week:

SELECT client_ip, request_url, target_processing_time FROM alb_log
WHERE (parse_datetime(time,'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z')
     BETWEEN DATE_ADD('day', -7, now()) AND now())
     AND (target_processing_time >= 5.0)
ORDER BY target_processing_time DESC
LIMIT 100;

The frequency of requests that took longer than 30 seconds in the last week:

SELECT count(*) AS cnt, request_url, ROUND(AVG(target_processing_time),0) AS avg_time FROM alb_log
WHERE (parse_datetime(time,'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z')
     BETWEEN DATE_ADD('day', -7, now()) AND now())
     AND (target_processing_time >= 30.0)
GROUP BY request_url
ORDER BY cnt DESC
LIMIT 100;

This is just the start. Querying in SQL means that there is a big community than can help you craft good queries that will meet your needs.

Step 5: Stay informed

Ideally you want to be getting regular performance reports from your application server, and to do that you can set up a scheduled task to run any arbitrary query on the ELB logs and have it sent to the team. We could use an AWS Lambda function, but in this example I will set up a pipeline in Buddy to run once a week and publish the results via email.

First, give Buddy permission to access Athena. Buddy has its own AWS account and you should follow the instructions for adding Buddy’s account as a trusted entity to AWS. The Buddy account will have a specific role that can then have policies attached to it which decides what the role can do. In this case I will use the standard policies to allow Buddy to run Athena queries and retrieve the query results from S3:

  • AWSQuicksightAthenaAccess policy
  • AmazonS3ReadOnlyAccess policy

Note that the AWSQuicksightAthenaAccess policy expects the S3 bucket name to start with “aws-athena-query-results-“. If that is not the case the create a bucket with that name and update the workgroup to use it instead. The old bucket can then be deleted.

In Buddy, create a AWS CLI action that runs a query and then copies the CSV result file from the S3 bucket to the Buddy pipeline.

sql_query="SELECT client_ip, request_url, 
  target_processing_time FROM alb_log
  WHERE (parse_datetime(time,'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z')
    BETWEEN DATE_ADD('day', -7, now()) AND now())
    AND (target_processing_time >= 5.0)
  ORDER BY target_processing_time DESC
  LIMIT 100;"

aws athena start-query-execution \
  --query-string "$sql_query" \
  --work-group "primary" \
  --query-execution-context "Database=alb_db" \
  > aws_out.json
	
execution_id=`jq -r .QueryExecutionId aws_out.json`

while true; do
  aws athena get-query-execution --query-execution-id $execution_id > aws_out.json
  execution_state=`jq -r .QueryExecution.Status.State aws_out.json`
  case "$execution_state" in
    "SUCCEEDED") break;;
    "FAILED") exit 1;;
    "RUNNING") sleep 10;;
  esac
done

csv_file=`jq -r .QueryExecution.ResultConfiguration.OutputLocation aws_out.json`
aws s3 cp $csv_file athena_query_report.csv

The results can also be retrieved directly as JSON if desired:

aws athena get-query-results --query-execution-id $execution_id > athena_query_report.json

Now we can use a second Buddy action to send the results for the team to analyse via email, Slack, etc.

Thinking about migrating to SaaS

In my last article we looked at what factors influence a company’s choice of IaaS solution. A more advanced strategy would be to migrate to a SaaS with the potential for even bigger savings.

If a company is looking to upgrade an existing IT system then there should always be some research done into what cloud alternatives are available. More are more companies are offering cloud versions of their services, or someone else is offering an equivalent competing SaaS.

Compared to IaaS, SaaS takes management of the IT systems completely out of the hands of the IT department. So much so, that any executive with purchasing power can start paying-as-they-go for a cloud service. It requires no IT expertise to register for a Salesforce subscription for instance. 

However, this is a flawed approach for two reasons, the first is that it undermines whatever IT strategy the company may have and can lead to a proliferation of SaaS subscriptions that provide overlapping functionality and are difficult to integrate. What we are talking about is IT governance. While SaaS simplifies the business case for using a new IT system (i.e. zero CAPEX aka “pay-as-you-go”) it still needs to be done in coordination with the IT function (e.g. the CIO). The second aspect of IT governance is security. While any decent SaaS provides good security functionality, it still requires the application of a security posture that is in line with the organisation’s security policies and standards.

To rephrase, if a company uses only SaaS solutions for its IT needs, then IT governance is reduced to managing the SaaS portfolio (which functionality is available where and how they could or should be integrated) and maintaining the organisation’s IT security posture.

Subscribing to a new SaaS is easy as it should be. The pay-as-you-go model simplifies testing a service, and the setup and roll-out of the service in the organisation is not under the same time-pressure as one that has required a huge upfront CAPEX. However, it is a different proposition if the company needs to migrate from an existing legacy system.

There are two types of migration, one from an on-premise product to the cloud version of the same product. This can happen because it is cheaper and/or the vendor has phased out the server version for the cloud version on the product. The other type of migration is to a cloud service based on a different product.

Regardless of which type of migration is being performed, there are some challenges (I hesitate to say limitations, read on) with leaving a legacy server-based solution. When a company owns and manages its own copy of a product, it has complete control over how it is deployed and integrated into the corporate IT environment. The product may provide APIs (or not) and there is the possibility to customise the product to meet the organisation’s needs. But the same product delivered as a SaaS will not allow the same customisation. And here is where SaaS really comes into its own I believe.

SaaS is very attractive from a licensing and management point-of-view, provided the company does not want to do a lot of customisations. Vendors, however, understand that one-size-fits-all will limit the number of customers they will have, so vendors invest heavily in providing lots of configuration possibilities. In the extreme, they can provide layers of abstraction and deliver what is essentially a toolbox of functionality that the customer can use to build their equivalent proprietary functionality. Jira Cloud is an example of a service that provides enormous flexibility when building issue-tracking workflows for instance.

Vendors will provide this toolbox-like functionality as long as there is a market willing to pay for it. However, this may still not be enough for customers with very specific needs. But cloud vendors are not done yet. They can also provide APIs such as REST to allow the customer to fulfil its requirements by encapsulating the custom functionality in a separate service. Jira Cloud and Salesforce Force.com provide this type of integration for instance.

And so here it is, the customer can migrate to the cloud using a standardised, configurable SaaS with an integration to a company-specific service that meets all of their requirements. Now, suddenly you have cost visibility! On one side you have a standard SaaS that probably provides 95% of the functionality for a very reasonable monthly cost, and on the other side the customisations that deliver 5% of the functionality but probably cost more per month.

But the whole point is not for the customer to have to migrate to the cloud in this way. SaaS makes the real cost of maintaining proprietary solutions painfully visible to management, with the result that there is more incentive to analyse why they are needed in the first place. And guess what, the organisation can often adapt their business processes to behave in a more standard fashion; after all cloud services exist because they are a successful way for lots of companies to leverage IT in their businesses.

In summary, when an organisation has to choose between making a work process change and making proprietary changes to on-premise IT-system, then its IT that most often gets the job. This creates a legacy that is dragged into the light when the company wants to leverage the benefits of very economical pay-as-you-go services. These customisations will acquire a very real maintenance cost and companies will only retain those that are essential. The IT department’s budget will start to correlate more with maintaining these customisations.

What does tomorrow’s IT department look like? I will explore this topic in another article.

Cloudily

So where are we today? Well, just about everything can be done in the cloud nowadays. At least, there are SaaS services available that offer standard functionality to support most business processes. And then there are all the integration possibilities both between services (e.g. data-mining) and with business specific solutions (e.g. via Force.com or some form of REST API or similar). In short, it is possible for companies to choose their comfort zone when adopting cloud technologies.

We mustn’t forget that virtualisation is the backbone of all cloud infrastructures. Functionality like thin-provisioning provides economies of scale and a more efficient use of resources. Virtualisation started in the enterprise as private clouds and is still as relevant today even with the fantastic flora of public cloud offerings. So picking your comfort zone raises some important questions for companies, the most important being, what are the business drivers for adopting cloud technologies?

There is the obvious cost driver that can be realised with virtualisation coupled with hosting (aka IAAS). Virtualisation provides efficiency and hosting eliminates the need for resources for management and maintenance activities. This also brings us to the competence driver. IT organisations are implicitly expected to remain small and basically invisible to the rest of the organisation. Certainly in the Cloud Age if you need more IT people instead of outsourcing then the view is that you’re doing it wrong.

Virtualisation products also deliver a host of other functionality that can be important business drivers. High-availability, scalability and performance monitoring features are common to probably all IAAS platforms. So outsourcing infrastructure can provide cost savings plus lots of new features in a like-for-like migration, i.e. moving from on-premise to an equivalently performing cloud infrastructure with respect to the potentially different workloads of each IT system (e.g. IOPS vs. data-mining).

The next driver to look at is flexibility, the ability to be able to bring more resources online in response to a temporarily increased workload. The question is, does the organisation need flexibility, really? Yes, loads vary, but this must be weighed against the cost of hosting in an on-demand infrastructure instead of, say, a (virtual) private cloud (VPC). A VPC can of course be extended to meet increased workloads but this is a more long-term activity such as meeting the needs of a growing organisation. But, if just a single application requires extra capacity then this does not motivate putting the entire IT-environment in an on-demand IAAS solution such as AWS or Microsoft Azure. Certainly this is true of larger organisations which have a bigger and more stable infrastructure footprint.

Flexibility then is achieved using a hybrid solution. The backbone of the IT infrastructure should be virtualised as much as possible using a private cloud (on-premise or hosted). The private cloud will have a predictable workload and will expand or change at a strategic pace. The private cloud is then complemented with on-demand resources from an on-demand infrastructure supplier such as AWS for handling peak workloads. Finding a balance between these two costs can be achieved using the resource monitoring tools in both platforms. Vertical vs. horizontal scalability is also a deciding factor.

Security concerns are usually not a business driver. By this I mean confidentially and integrity of data when some or all of the IT systems are residing in one or more IAAS platforms. These platforms usually provide networking capabilities such as VPN to allow the corporate network to extend into one or more cloud infrastructures and so provide seamless integration between the various IT systems. Central authentication systems can also use the same VPN tunnels to control authentication and authorisation in all the relevant systems.

To summarise, this walkthrough of the IAAS landscape shows that you can pick your comfort zone with a proper analysis of the infrastructure requirements. The benefits are huge so some of the cost-savings should be used to over-dimension the infrastructure solution and so save time performing difficult workload analysis. Hopefully the company will grow into its new suit and this will also provide important input to future infrastructure investments.

Dimensioning database storage

When building a database system with servers and disk arrays, it is not immediately obvious how to best utilize the available drive bays to create an optimised database storage setup. Let’s look at some requirements.

Requirements

The biggest considerations when selecting a storage solution are:

  1. High-availability (HA): the system must fail gracefully not catastrophically. This means that some form of RAID must be used and hot spares should be available.
  2. Disaster recovery (DR): in the case of data loss or corruption we need to be able to restore the database from a backup. This must take RTO and RPO into account. Can the backups be compressed?
  3. Capacity: How much data is expected to be stored in the system? Usually the data volume grows continuously, so the capacity is often calculated with respect to the lifetime of the hardware.
  4. Transaction intensity (IOPS): How much throughput should the disks handle? Remember stripping across small disks is preferred. The advent of SSD is starting the change all that with Oracle now able to handle tiered storage types.

Storage options

While the database may need high-performance SAS/FC disks or SSD, the backup disks only have to perform sequential reads (database restore) and writes (database backup), and so can make do with slower TB SATA disks.

So looking to your vendor, there will be a range of different solutions available. Some servers (rack-mounted) will have in-built storage capacity ranging from 2 up to 25 bays. Other servers will have just two internal disks for the OS and database application, and will instead be connected via FC to a dedicated disk array solution to provide storage. Disk arrays include all the features needed for creating an HA storage solution.

Next are the disks. FC disks are the most expensive, but with the advent of competitive SAS disks are not worth the money unless you need every once of performance you can get. Disks come in various sizes and for the database the relatively small 300GB or 600GB SAS disks are a good choice. For the backup there are 1TB or 2TB SATA disks available.

Capacity

Once the choice of disks is made, then it is possible to calculate the number of disks needed. This has to take three things into account: 1) RAIDing will consume 20-50% of the raw disk capacity, 2) the remaining disks shouldn’t be filled to more than 50%, though in practice 70% is usually an acceptable planning capacity and 3) that a minimum of two hot spares are needed. The logic is that during the time it takes to replace a spare disk, the disk array will still be providing HA with the remaining hot spare.

Calculating capacity can be tricky. When selecting a 300GB from the vendor, this is the SI definition of a Gigabit which is 1000000000bits. However, from the OS point of view the disk is 279GiB which is a somewhat smaller value but is using the traditional binary-based Gigibit definition. So when asking the sales team what capacity they need for the customer just be clear which dimensions they are using to avoid embarrassment later.

OK, so finally to the DR requirement, how much backup storage is needed? Well let’s assume we want to use uncompressed backups, this is less processor intensive and will result in faster restore times. In this case, the backup size is the size of the database. On day one we make the first full backup. On day two we make a new full backup, but we can’t delete the first backup until the new backup is completed so there must be sufficient space for two full uncompressed backups. In my experience, the last two full backups were saved due to the critical nature of the data, so this required a backup storage capacity that was three times the size of the planned database capacity.

Solution portfolio

Now, we have examined all of the requirements for creating a high-performance, highly-resilient database solution that matches the customers capacity and performance requirements. But this case-by-case dimensioning process is a time-consuming analysis that doesn’t scale well. Instead, we want to create and maintain a portfolio of standard database solutions that the sales teams can just pick from. This means that all the calculations are performed once in advance. This strategy also aligns itself with creating filled disk array configurations since disk arrays are expensive compared to the disks themselves (especially SAS and SATA disks). The choice we have in deciding which hardware components to use becomes instead a choice the customer can make to get the best value for money.

Example: HP DL380P

Let’s apply all of these rules to a HP DL380P solution that has a 25 disk SFF (2.5″) backplane. The SFF backplane is needed because the database needs many small fast disks, which means that all disks in the server will have to be SFF including the ones used for backup. The two disk sizes available in SFF are 300GB and 600GB.

The 25 disk bays are divided into three groups, one for the OS and database application, one for the database storage and one for backup. The OS and application don’t need much space so one mirrored pair of 300GB disks will suffice.

For the database storage 300GB disks will be used and for the backup the 600GB disks. So, how do we divide the remaining 23 disk bays between storage and backup? This is a process of trial and error. First, we allocate two bays for hot spares for the database. Then we pick a RAID strategy; RAID 1+0 (mirroring then stripping) provides the best performance but with the result that half the disks are used for redundancy. Hot spares for the backup storage would be overkill and RAID 5 provides sufficient performance with a capacity loss of just one disk for parity.

The final configuration looks like this:

And here is the math:

Database storage
Raw disk capacity4500GB (16 x 300GB)
Hot spares2
Redundancy strategyRAID 1+0
Usable capacity1470GB (7 x 300GB x 70%)
Reported capacity1953GiB (7 x 279GiB)
Backup storage
Raw disk capacity4200GB (7 x 600GB)
Hot spares0
Redundancy strategyRAID 5
Usable capacity3600GB (6 x 600GB x 100%)
Reported capacity3354GiB (6 x 559GiB)
Ratios
Database usable/raw32% (1470GB/4500GB)
Database usable/total17% (1470GB/(4500GB + 4200GB))
Backup usable/raw85.7% (3600GB/4200GB)
Backup/database2.45 (3600GB/1470GB)

From this example we can create a high-performance database configuration with a usable database capacity of 1.47TB. The ratio of backup to database storage is 2.45 which means that there is space for one backup plus the creation of a new backup. To allow for two full backups to be saved, two more disk bays could be allocated to the backup storage, lowering the database capacity to 1.26TB. Alternatively,  the marketed capacity could just be lowered to 1.2TB (3600GB/3) which may be more flexible.

The actual usable storage space as a percentage of the total is also interesting. Just 17% or 1.5TB of the total 8.7TB can be used to store application data.

Finally, two of these database configurations (a primary and secondary) are needed to provide redundancy at the database application level. This is for two reasons: 1) in case of database application failure (a failover) or 2) for maintenance operations on the primary (a switchover).