Athena In A Nutshell

Background
We talked previously about BI (business intelligence ) and data warehousing. One thing we haven't touched on is the AWS EMR (Elastic Map Reduce) service. This service utilises the Hadoop open framework and clusters. A Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment. You can also run other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink in EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB.

Architecture

It takes some effort to manage EMR or hadoop clusters. What if we could store our data directly on S3 (an AWS object oriented data store) and analyse it via some means.

This is exactly what Athena does, it is a data analysis service managed by AWS to analyse data stored on S3. Please note that this is not a database query service, it is only able to query data directly stored on S3.

Data Queries

Athena is an interactive data query service that utilizes standard SQL as its query language. Athena utilizes a serverless architecture and it is fully managed by AWS.

Performance & Availability

Athena utilizes a warm pool of compute capacity resources that takes zero spin up time, hence queries can be run immediately through this service. Parallelised queries increase performance. Compressed data reduces scanning costs. DDL Queries and failed queries are not charged for.

All Athena resource pools are distributed across availability zones making this service highly available. The source data on S3 being highly available as well. Athena only reads the data on S3 hence data on S3 is never modified by Athena.

Data

Athena data queries can be run on tables created within this service, the Data Definition Language languages is based on Hive.

All tables in Athena are defined as external tables, hence deleting tables do not delete the data stored on S3.

The following types of data can be queried in RAW format:

Text, CSV, JSON, weblogs, AWS service logs, TSV files and compressed files.

For best performance data can be converted to columnar formats such as ORC or Parquet.
https://orc.apache.org/
https://parquet.apache.org/

Avro is currently supported in the US East (Ohio) region.

https://avro.apache.org/

An ETL is not required for the above types of data.

ETL:
Extract, transform, load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source(s).

Athena streams data directly from S3 for querying.

The query execution engine is Presto which utilizes ANSI SQL.
https://prestosql.io/
Complex joins, nested queries & window functions are supported. Complex data types such as arrays and structs are also supported. Partitioning of data by any key is supported.

Output:
Results are streamed to the console or S3. Results CVS can be downloaded to the desktop. Athena also has a JDBC driver, this enables external SQL clients to be connected to Athena for querying. The JDBC driver also supports programmatic access via AWS IAM access keys.

Athena can be integrated with Quicksite which is an AWS visualisation tool, where data is represented visually through dashboards.

The Athena catalogue consists of tables, table definitions, column names and partitions are defined via metadata. When a query is run the data is streamed to these tables from S3 for manipulation. The metadata is Hive meta store compatible.

Pyspark code can be utilized to convert data into Parquet or ORC via EMR.

Code available on github:

https://github.com/aws-samples/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion

Links Relating To Subject Matter

Log Groups & Log Streams Lambda Context Python Lambda Context NodeJS IAM Roles