Monitoring API performance with AWS Athena

Deciding what technical improvements to make is always a challenge and one area that often requires attention is application performance. The de facto standard for delivering services is to used REST-based APIs and if you are using Amazon Web Services (AWS) then this will usually involves deploying an Elastic Load Balancer (ELB) in front of your backend application servers.

It is a good idea to take methodical (scientific) approach to fixing performance issues. So we want to look at the data and make good data-driven decisions about what problems to focus on. This article will describe how to get started with performance monitoring using AWS Athena.

TL;DR;

AWS Athena provides an SQL query interface to your ELB logs which allows advanced troubleshooting and pattern analysis:

Enable logging on the ELB.
Set a lifetime policy on the S3 bucket used for logging.
Let the logging run for a few days at least so you have some data to work with.
Set up the Athena database and table.
Craft your query in the Athena Query Editor.
Set up a scheduled job to get regular performance reports.

Costs

There are two costs associated with this solution:

S3 storage costs per TB stored log data, which it is wise to set a lifecycle rule on the log bucket.
https://aws.amazon.com/s3/pricing/
Athena costs per TB of scanned data, which can be controlled using the storage lifecycle rule already mentioned and by crafting good SQL queries.
https://aws.amazon.com/athena/pricing/

Step 1: Enable logging

By default logging is not enabled on the ELB. When it is enabled, the logs will be saved to an S3 bucket. Follow the instructions here:

https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html

Step 2: Set a lifetime policy on the logs

It might take some time to understand how much log history you need or can afford. But it is good to know how to set up and modify it. Follow the instructions here:

https://docs.aws.amazon.com/AmazonS3/latest/userguide/how-to-set-lifecycle-configuration-intro.html

Set the rule name to Delete_files_older_than_XX_days
Choose the rule scope to be Apply to all objects in the bucket
Set lifecycle rule actions to Expire current versions of objects
Set days after object creation to 90 (or whatever your needs are)

The lifecycle period can be changed anytime.

Step 3: Setup AWS Athena

Follow these instructions to setup the database and table which will be used to query the logs:
https://aws.amazon.com/premiumsupport/knowledge-center/athena-analyze-access-logs/

Create the table with partition projection, this avoids having to manually add the partitions as you go.
Change the CREATE TABLE query to name the table “alb_log”. This will save some headaches later on when copying and pasting queries from others. Note that you can recreate the table at any time without losing data or saved queries.
Both the LOCATION and storage.location.template parameters should be changed to match the location of your ELB logs in S3.
To avoid typos in the location name, navigate to the S3 bucket and drill down into the parent of the year level directories. Then click on Copy S3 URI button and paste this into both locations in the CREATE query.
Set the projection.year.range to the current year and next year, e.g. “2022,2023”.

When the table has been created, click on the vertical ellipsis (“…”) beside the table name and select Preview table. If everything is working, then the results will show the latest 10 queries made to the application server.

Step 4: Find performance issues

Here is an example of a query that will find the top 100 slowest queries in the last week:

SELECT client_ip, request_url, target_processing_time FROM alb_log
WHERE (parse_datetime(time,'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z')
     BETWEEN DATE_ADD('day', -7, now()) AND now())
     AND (target_processing_time >= 5.0)
ORDER BY target_processing_time DESC
LIMIT 100;

The frequency of requests that took longer than 30 seconds in the last week:

SELECT count(*) AS cnt, request_url, ROUND(AVG(target_processing_time),0) AS avg_time FROM alb_log
WHERE (parse_datetime(time,'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z')
     BETWEEN DATE_ADD('day', -7, now()) AND now())
     AND (target_processing_time >= 30.0)
GROUP BY request_url
ORDER BY cnt DESC
LIMIT 100;

This is just the start. Querying in SQL means that there is a big community than can help you craft good queries that will meet your needs.

Step 5: Stay informed

Ideally you want to be getting regular performance reports from your application server, and to do that you can set up a scheduled task to run any arbitrary query on the ELB logs and have it sent to the team. We could use an AWS Lambda function, but in this example I will set up a pipeline in Buddy to run once a week and publish the results via email.

First, give Buddy permission to access Athena. Buddy has its own AWS account and you should follow the instructions for adding Buddy’s account as a trusted entity to AWS. The Buddy account will have a specific role that can then have policies attached to it which decides what the role can do. In this case I will use the standard policies to allow Buddy to run Athena queries and retrieve the query results from S3:

AWSQuicksightAthenaAccess policy
AmazonS3ReadOnlyAccess policy

Note that the AWSQuicksightAthenaAccess policy expects the S3 bucket name to start with “aws-athena-query-results-“. If that is not the case the create a bucket with that name and update the workgroup to use it instead. The old bucket can then be deleted.

In Buddy, create a AWS CLI action that runs a query and then copies the CSV result file from the S3 bucket to the Buddy pipeline.

sql_query="SELECT client_ip, request_url, 
  target_processing_time FROM alb_log
  WHERE (parse_datetime(time,'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z')
    BETWEEN DATE_ADD('day', -7, now()) AND now())
    AND (target_processing_time >= 5.0)
  ORDER BY target_processing_time DESC
  LIMIT 100;"

aws athena start-query-execution \
  --query-string "$sql_query" \
  --work-group "primary" \
  --query-execution-context "Database=alb_db" \
  > aws_out.json
	
execution_id=`jq -r .QueryExecutionId aws_out.json`

while true; do
  aws athena get-query-execution --query-execution-id $execution_id > aws_out.json
  execution_state=`jq -r .QueryExecution.Status.State aws_out.json`
  case "$execution_state" in
    "SUCCEEDED") break;;
    "FAILED") exit 1;;
    "RUNNING") sleep 10;;
  esac
done

csv_file=`jq -r .QueryExecution.ResultConfiguration.OutputLocation aws_out.json`
aws s3 cp $csv_file athena_query_report.csv

The results can also be retrieved directly as JSON if desired:

aws athena get-query-results --query-execution-id $execution_id > athena_query_report.json

Now we can use a second Buddy action to send the results for the team to analyse via email, Slack, etc.