[Amazon Athena] Athena Partition Projection์„ ์ด์šฉํ•˜์—ฌ ์ฟผ๋ฆฌ ์„ฑ๋Šฅ ๋†’์ด๊ธฐ

2024. 1. 21. 23:31ใ†AWS

๊ฐœ์š”

Athena๋ฅผ ์ด์šฉํ•˜์—ฌ S3 Bucket์— ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๊ฒ€์ƒ‰ํ•œ๋‹ค. S3 ๋ฒ„ํ‚ท์—๋Š” ์‹ค์‹œ๊ฐ„์œผ๋กœ ๋ฐ์ดํ„ฐ๊ฐ€ ์ €์žฅ๋˜๊ณ  ์žˆ์œผ๋ฉฐ, Athena์—์„œ ์ฟผ๋ฆฌ ํ–ˆ์„ ๋•Œ ๋” ๋น ๋ฅธ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ํŒŒํ‹ฐ์…”๋‹์ด ํ•„์š”ํ•˜๋‹ค. Athena Partition projection์„ ์ด์šฉํ•ด ๋ณด์ž.

 

 

๋“ค์–ด๊ฐ€๊ธฐ ์ „

ํŒŒํ‹ฐ์…”๋‹์ด๋ž€?

ํฐ Table์ด๋‚˜ ์ธ๋ฑ์Šค๋ฅผ ๊ด€๋ฆฌํ•˜๊ธฐ ์‰ฌ์šด ๋‹จ์œ„๋กœ ๋ถ„๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์˜๋ฏธํ•œ๋‹ค. ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„ํ• ํ•˜๋ฉด ๊ฐ ์ฟผ๋ฆฌ๊ฐ€ ์Šค์บ”ํ•˜๋Š” ๋ฐ์ดํ„ฐ์˜ ์–‘์„ ์ œํ•œํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒํ•˜๊ณ  ๋น„์šฉ์„ ์ ˆ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค. ํŒŒํ‹ฐ์…”๋‹์— ๋Œ€ํ•œ ์ƒ์„ธ ๋‚ด์šฉ์€ ์—ฌ๊ธฐ ์ฐธ์กฐ.

 

Amazon Athena๋ž€?

Amazon S3์— ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ง์ ‘ ๊ฐ„ํŽธํ•˜๊ฒŒ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ๋Š” ๋Œ€ํ™”ํ˜• ์ฟผ๋ฆฌ ์„œ๋น„์Šค. ๋ฐ์ดํ„ฐ๋ฅผ ํฌ๋งทํ•˜๊ฑฐ๋‚˜ ์ธํ”„๋ผ ๋˜๋Š” ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๊ด€๋ฆฌํ•  ํ•„์š”๊ฐ€ ์—†๋‹ค. ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ํ…Œ์ด๋ธ”์„ ์ •์˜ํ•˜๊ณ  ํ‘œ์ค€ SQL์„ ์‚ฌ์šฉํ•˜์—ฌ ์ฟผ๋ฆฌ๋ฅผ ์‹œ์ž‘ํ•˜๊ธฐ๋งŒ ํ•˜๋ฉด ๋œ๋‹ค.

Athena๋Š” Amazon S3์— ์ €์žฅ๋œ ๋น„์ •ํ˜•, ๋ฐ˜์ •ํ˜• ๋ฐ ์ •ํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋œ๋‹ค : CSV, JSON ๋˜๋Š” ์ปฌ๋Ÿผ ๋ฐฉ์‹ ๋ฐ์ดํ„ฐ ํ˜•์‹(ex: Apache Parquet ๋ฐ Apache ORC) ๋“ฑ.

Athena๋Š” S3์—์„œ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์ง€์†์  ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ์Šคํ† ์–ด๋ฅผ ์ œ๊ณตํ•˜๋Š” AWS Glue Data Catalog์™€ ํ†ตํ•ฉ๋œ๋‹ค. Athena๋Š” AWS Glue Data Catalog๋ฅผ ์‚ฌ์šฉํ•ด AWS ๊ณ„์ •์˜ Amazon S3 ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ํ…Œ์ด๋ธ” ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜๊ณ  ๊ฒ€์ƒ‰ํ•œ๋‹ค. Athena ์ฟผ๋ฆฌ ์—”์ง„์€ ํ…Œ์ด๋ธ” ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ์ฟผ๋ฆฌ ํ•˜๋ ค๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ฐพ๊ณ , ์ฝ๊ณ , ์ฒ˜๋ฆฌํ•  ๋ฐฉ๋ฒ•์„ ํŒŒ์•…ํ•œ๋‹ค.

 

AWS Glue๋ž€?

AWS Glue๋Š” AWS์˜ ์™„์ „ ๊ด€๋ฆฌํ˜• ์ถ”์ถœ(Extract), ๋ณ€ํ™˜(Transform), ๋กœ๋“œ(Load) ์„œ๋น„์Šค์ด๋‹ค. ์ฃผ์š” ๊ธฐ๋Šฅ ์ค‘ ํ•˜๋‚˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜๊ณ  ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. Glue Data Catalog๋Š” ๋ฐ์ดํ„ฐ ์†Œ์Šค(S3)์™€ ๋Œ€์ƒ ๊ฐ„์˜ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋ฅผ ๊ด€๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ์ €์žฅ์†Œ์ด๋‹ค. ํ…Œ์ด๋ธ” ์ •์˜, job ์ •์˜์™€ ๋‹ค๋ฅธ ๊ด€๋ฆฌ ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๋ฉฐ, ๊ฐ AWS account๋Š” ๋ฆฌ์ „๋‹น ํ•˜๋‚˜์˜ AWS Glue Data Catalog๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋‹ค.

๋”๋ณด๊ธฐ

Glue data catalog

“ํ…Œ์ด๋ธ”์˜ ๋ฉ”ํƒ€์ •๋ณด๋ฅผ ์ €์žฅํ•˜๋Š” ์žฅ์†Œ๋กœ๋Š” Hive metastore์—์„œ ํŒŒ์ƒ๋œ Glue data catalog๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. schema-on-write์ธ ์ผ๋ฐ˜์ ์ธ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์™€๋Š” ๋‹ค๋ฅด๊ฒŒ schema-on-read์ธ ๋ฐ์ดํ„ฐ ์นดํƒˆ๋กœ๊ทธ์—๋Š” ํ…Œ์ด๋ธ” ์ •๋ณด์™€ ์‹ค์ œ ํ…Œ์ด๋ธ”์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์ €์žฅ๋œ S3 ๊ฒฝ๋กœ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋•๋ถ„์— ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์ „์ฒด ๊ฒฝ๋กœ๋ฅผ ์ง์ ‘ ์ž…๋ ฅํ•˜์ง€ ์•Š๊ณ ๋„ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ์ด๋ฆ„๊ณผ ํ…Œ์ด๋ธ” ์ด๋ฆ„์œผ๋กœ ํ…Œ์ด๋ธ”์„ ์ฐพ์•„ ์ฟผ๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.” ์ถœ์ฒ˜ 

 

Athena์—์„œ ๋ฐ์ดํ„ฐ ๋ถ„ํ•  (partitioning)

๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„ํ• (partition)ํ•˜์—ฌ Athena ์ฟผ๋ฆฌ๊ฐ€ ์Šค์บ”ํ•˜๋Š” ๋ฐ์ดํ„ฐ์˜ ์–‘์„ ์ œํ•œํ•˜์—ฌ ์ฟผ๋ฆฌ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒํ•˜๊ณ  ๋น„์šฉ์„ ์ ˆ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค. Apache Hive ์Šคํƒ€์ผ์˜ partition์„ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ๊ณ , ๋น„ Hive ์Šคํƒ€์ผ์˜ ํŒŒํ‹ฐ์…˜ ์ฒด๊ณ„๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ๋‹ค. Apache Hive์™€ ํ˜ธํ™˜๋˜์ง€ ์•Š๋Š” ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ Athena์—์„œ๋Š” ALTER TABLE ADD PARTITION์„(๋ฅผ) ์‚ฌ์šฉํ•˜์—ฌ ํŒŒํ‹ฐ์…˜์„ ์ˆ˜๋™์œผ๋กœ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ๋‹ค. (์ฐธ๊ณ  : Athena์—์„œ ๋ฐ์ดํ„ฐ ๋ถ„ํ• )

ex) CloudTrail ๋กœ๊ทธ ๋ฐ Kinesis Data Firehose ์ „์†ก ์ŠคํŠธ๋ฆผ: data/2023/07/19/4 ge87b2.json๊ณผ ๊ฐ™์€ ๋‚ ์งœ ๋ถ€๋ถ„์— ๋Œ€ํ•ด ๋ณ„๋„์˜ ๊ฒฝ๋กœ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

ALTER TABLE {database_name}.{table_name}
ADD IF NOT EXISTS PARTITION (partition_col_name = partition_col_value [,...])
LOCATION 's3://{bucket}/{partition_location}'".format(
            database_name=AuditingGlueDatabaseName,
            table_name=GlueTable,
            account=account,
            region=region,
            date=date,
            bucket=bucket,
            partition_location=partition_location
        );

*Athena์—์„œ ์‚ฌ์šฉํ•  ํŒŒํ‹ฐ์…˜์˜ ์œ„์น˜(location)๋Š” s3 ํ”„๋กœํ† ์ฝœ์„ ์‚ฌ์šฉํ•ด์•ผ ํ•œ๋‹ค.

 

 

๐Ÿ”ซ Partition Projection

Partition Projection์ด๋ž€?

Athena์—์„œ partition projection์„ ์‚ฌ์šฉํ•˜์—ฌ partition๋œ ํ…Œ์ด๋ธ”์˜ ์ฟผ๋ฆฌ ์ฒ˜๋ฆฌ ์†๋„๋ฅผ ๋†’์ด๊ณ  ํŒŒํ‹ฐ์…˜ ๊ด€๋ฆฌ๋ฅผ ์ž๋™ํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค. Glue Data Catalog์—์„œ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋ฅผ ๊ฒ€์ƒ‰ํ•  ํ•„์š” ์—†์ด Athena์—์„œ ํ•„์š”ํ•œ ํŒŒํ‹ฐ์…˜ ์ •๋ณด๋ฅผ ‘projection’ํ•˜๊ฑฐ๋‚˜ ๊ฒฐ์ •ํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ์ฟผ๋ฆฌ ์‹คํ–‰ ์‹œ๊ฐ„์„ ํฌ๊ฒŒ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค. ํŒŒํ‹ฐ์…˜ ํ”„๋กœ์ ์…˜์—์„œ Athena๋Š” AWS Glue ํ…Œ์ด๋ธ”์˜ ํ…Œ์ด๋ธ” ์†์„ฑ์—์„œ ํŒŒํ‹ฐ์…˜ ๊ฐ’ ๋ฐ ์œ„์น˜๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

 

์ ์šฉ ๋ฐฉ๋ฒ•

VPC Flowlogs๊ฐ€ ๊ธฐ์กด S3์— ๋กœ๊น…๋˜๊ณ  ์žˆ๋‹ค. ์ด flowlogs ๋ฐ์ดํ„ฐ๋ฅผ Athena์—์„œ ํšจ์œจ์ ์œผ๋กœ ์ฟผ๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก partition projection์„ ์ ์šฉํ•ด๋ณด์ž.

 

VPC Flowlogs๊ฐ€ ์ €์žฅ๋˜๋Š” S3 ๋ฒ„ํ‚ท์„ Data Store๋กœ Glue table์„ ์ƒ์„ฑํ•œ๋‹ค. Schema์—๋Š” ์‹ค์ œ๋กœ ๋ฐ์ดํ„ฐ ์†์„ฑ ์ด๋ฆ„๊ณผ data type์„ ์ง€์ •ํ•œ๋‹ค. VPC Flowlogs ๊ธฐ๋ณธ ์„ค์ •์˜ ๊ฒฝ์šฐ ํ•˜๊ธฐ์™€ ๊ฐ™์€ ๊ฐ’๋“ค์ด ๋ฐ์ดํ„ฐ์— ํฌํ•จ๋˜๊ณ  ์ด ์ˆœ์„œ์— ๋”ฐ๋ผ Schema๋ฅผ ์ง€์ •ํ•˜๋ฉด ๋œ๋‹ค.

${version} ${account-id} ${interface-id} ${srcaddr} ${dstaddr} ${srcport} ${dstport} ${protocol} ${packets} ${bytes} ${start} ${end} ${action} ${log-status}

 

Glue Table์˜ advanced properties๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ์ ์šฉํ•œ๋‹ค.

FlowLogsTable:
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseName: !Ref AuditingGlueDatabase
      TableInput:
        Description: !Sub
          - "FlowLogs table for ${S3BucketName} bucket"
          - S3BucketName: !Ref FlowLogsBucketName
        Name: !Ref FlowLogsTableName
        StorageDescriptor:
          Columns:
            - Name: version
              Type: int
            - Name: sourceaccount
              Type: string
            - Name: interfaceid
              Type: string
            - Name: sourceaddress
              Type: string
            - Name: destinationaddress
              Type: string
            - Name: sourceport
              Type: int
            - Name: destinationport
              Type: int
            - Name: protocol
              Type: bigint
            - Name: numpackets
              Type: bigint
            - Name: numbytes
              Type: bigint
            - Name: starttime
              Type: bigint
            - Name: endtime
              Type: bigint
            - Name: action
              Type: string
            - Name: logstatus
              Type: string
          InputFormat: org.apache.hadoop.mapred.TextInputFormat
          OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          SerdeInfo:
            SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            Parameters:
              field.delim: " "
              serialization.format: " "
          Location: !Sub
            - "s3://${S3BucketName}/"
            - S3BucketName: !Ref FlowLogsBucketName
          # NumberOfBuckets: -1
        PartitionKeys:
          - Name: account
            Type: string
          - Name: region
            Type: string
          - Name: date
            Type: string
        Parameters:
          EXTERNAL: "TRUE"
          skip.header.line.count: "1"
          projection.enabled: true
          projection.account.type: enum
          projection.account.values: account-1, account-2, account-3
          projection.region.type: enum
          projection.region.values: us-east-1,us-east-2,us-west-1,us-west-2,ap-south-1,ap-northeast-1,ap-northeast-2,ap-northeast-3,ap-southeast-1,ap-southeast-2,ca-central-1,eu-central-1,eu-west-1,eu-west-2,eu-west-3,eu-north-1,sa-east-1
          projection.date.type: date
          projection.date.format: yyyy/MM/dd
          projection.date.range: 2023/05/31,NOW
          projection.date.interval: 1
          projection.date.interval.unit: DAYS
          storage.location.template: "s3://VPCFlowLogs-S3-bucket-name/AWSLogs/${account}/vpcflowlogs/${region}/${date}/"
        TableType: EXTERNAL_TABLE
    Type: AWS::Glue::Table

 

ํ•„์ž๋Š” ์œ„์™€ ๊ฐ™์ด CloudFormation์œผ๋กœ ๋ฐฐํฌํ•˜์—ฌ ์•„๋ž˜์™€ ๊ฐ™์ด ์ ์šฉ๋˜์—ˆ๋‹ค.

S3 ๊ฐ์ฒด ๊ฒฝ๋กœ ์ค‘ region, account, date์— projection์„ ํ•˜๊ณ  ๊ฐ type๊ณผ ๋“ค์–ด์˜ฌ ์ˆ˜ ์žˆ๋Š” ๊ฐ’, ํ˜•์‹์„ ๋ช…์‹œํ•ด์ฃผ๋ฉด ์†์‰ฝ๊ฒŒ ํŒŒํ‹ฐ์…”๋‹์ด ์ž๋™ํ™”๋œ๋‹ค.

 

ํŒŒํ‹ฐ์…”๋‹ ์ž๋™ํ™” ์™„๋ฃŒ ~!

 

  1. AWS Docs: Athena์—์„œ ๋ฐ์ดํ„ฐ ๋ถ„ํ• 
  2. AWS Docs: Amazon Athena๋ฅผ ์‚ฌ์šฉํ•œ ํŒŒํ‹ฐ์…˜ ํ”„๋กœ์ ์…˜
  3. AWS Docs: ํŒŒํ‹ฐ์…˜ ํ”„๋กœ์ ์…˜์„ ์‚ฌ์šฉํ•˜์—ฌ Amazon VPC ํ๋ฆ„ ๋กœ๊ทธ์— ๋Œ€ํ•œ ํ…Œ์ด๋ธ” ์ƒ์„ฑ ๋ฐ ์ฟผ๋ฆฌ
  4. ๊ธฐ์ˆ  ๋ธ”๋กœ๊ทธ: AWS ATHENA ๋กœ VPC FLOW LOG ๋ถ„์„ํ•˜๊ธฐ – 1
  5. ๊ธฐ์ˆ  ๋ธ”๋กœ๊ทธ: AWS ATHENA ๋กœ VPC FLOW LOG ๋ถ„์„ํ•˜๊ธฐ – 2
  6. ๊ธฐ์ˆ  ๋ธ”๋กœ๊ทธ: [AWS] ๐Ÿ“š Athena - Partition Projection ํŒŒํ‹ฐ์…”๋‹ ์ž๋™ํ™” ํ•˜๊ธฐ
  7. AWS Docs: CloudFormation User Guide - AWS::Glue::Table StorageDescriptor
  8. AWS Docs: Amazon Athena๋ž€ ๋ฌด์—‡์ธ๊ฐ€์š”?