Boto3访问AWS资源操作总结(1)

最近在工作中需要对AWS上的部分资源进行查询和交叉分析，虽然场景都比较简单，但是这种半机械的工作当然还是交给Python来搞比较合适。AWS为Python提供的SDK库叫做boto3，所以我们建立一个Python项目，Interpreter选择的是venv解析，再将boto3安装到项目中，下面就可以开始愉快地写代码了。这个过程中有一些坑，记录在这里，以便后续查阅。

Query AWS CloudWatch

根据一定的搜索条件去CloudWatch中查找相关的log记录。

import boto3

def query_cloudwatch_with_condition(log_group, query, start_time, end_time):

    """

    Search CloudWatch logs by some conditions.

    :param log_group: eg. '/aws/some_log_group'

    :param query: eg. f"fields @timestamp, @message \

                            | sort @timestamp desc \

                            | filter @message like /(?i)(some_filter)/ \

                            | filter @message like /Reason:\sError:/ \

                            | limit 10 \

                            | display @message"

    :param start_time: eg. int((datetime.today() - timedelta(days=5)).timestamp())

    :param end_time: eg. int(datetime.now().timestamp())

    :return: log message string.

    """

    cw_client = boto3.client('logs')

    start_query_response = cw_client.start_query(

        logGroupName=log_group,

        startTime=start_time,

        endTime=end_time,

        queryString=query,

    )

    query_id = start_query_response['queryId']

    response = None

    # NOTE: Must wait for query to complete.

    while response is None or response['status'] == 'Running':

        print('Waiting for query to complete ...')

        time.sleep(1)

        response = cw_client.get_query_results(queryId=query_id)

    issue_detail = ''

    # NOTE: In my situation, we only care about the first message because we expect all logs are the same.

    for item in response['results'][0]:

        if item['field'] == '@message':

            issue_detail = item['value']

            break

    return issue_detail

Query DynamoDB

import boto3

from boto3.dynamodb.conditions import Key

def query_dynamodb_with_condition(key_conditionn_exp):

    """

    Query dynamodb with certain condition_exp (Query not Scan)

    :param key_conditionn_exp: eg. Key('id').eq(certain_id) & Key('sk').begins_with('example::')

    :return: query results list

    """

    dynamodb = boto3.resource('dynamodb')

    table = dynamodb.Table('some-dynamodb-name')

    response = table.query(KeyConditionExpression=key_conditionn_exp)

    items = response['Items']

    # filter item if we have further conditions.

    for item in items:

        pass

    return items

Scan DynamoDB

对DynamoDB做scan的时候，有个坑是AWS的DynamoDB单次scan是有上限的，所以为了做到full scan，需要在代码里面有一些处理

def scan_dynamodb_with_condition(filter_condition_exp):

    """

    Full scan dynamodb with certain condition_exp

    :param filter_condition_exp: eg. Attr('sk').eq('my_sk') & Attr('name').begins_with('Jone') & Attr('isDeleted').eq(False)

    :return: scan results list

    """

    dynamodb = boto3.resource('dynamodb')

    table = dynamodb.Table('some-dynamo-table')

    response = table.scan(FilterExpression=filter_condition_exp)

    # Loop to do full scan

    results = response['Items']

    index = 1

    while 'LastEvaluatedKey' in response:

        print(f'scanning....{index}')

        index += 1

        response = table.scan(

            ExclusiveStartKey=response['LastEvaluatedKey'],

            FilterExpression=filter_condition_exp)

        results.extend(response['Items'])

        print(len(results))

    return results

List S3 objects and read contents

读取S3某个路径下的所有objects也有一个坑，就是默认单次get object的上限是1000个，所以如果想做到full list，也需要做特定的处理。

def get_all_s3_objects(s3, **base_kwargs):

    """

    Private method to list all files under path

    :param s3: s3 client using boto3.client('s3')

    :param base_kwargs: scan args

    :return: yield file path to caller

    """

    continuation_token = None

    while True:

        list_kwargs = dict(MaxKeys=1000, **base_kwargs)

        if continuation_token:

            list_kwargs['ContinuationToken'] = continuation_token

        response = s3.list_objects_v2(**list_kwargs)

        yield from response.get('Contents', [])

        if not response.get('IsTruncated'):  # At the end of the list?

            break

        continuation_token = response.get('NextContinuationToken')

def main():

    bucket_name = 'my-bucket-name'

    s3_client = boto3.client('s3')

    # using prefix to define search folder

    prefix = 'this-is-some-path-without-prefix-and-postfix-slash'

    file_paths = []

    for file in get_all_s3_objects(s3_client, Bucket=bucket_name, Prefix=prefix):

        file_paths.append(file['Key'])

    print(f'length of file_paths: {len(file_paths)}')

    with open('./file_paths_results.json', 'w') as f:

        f.write(json.dumps(file_paths))

        print('finished writing file paths into json file')

Read S3 file contents

在读取S3文件的内容时，我们遇到了文件Body里的内容(来自AWS SQS的message)无法正确的转换为json的问题，因为时间问题，没有太深入地研究，只是简单地做了一些非json语法字串的替换，把内容拿出来了，后面可以再研究一下这种文件内容需要怎么正确加载到json里。

import json

import re

from pprint import pprint

import boto3

from dynamodb_json import json_util

def read_file_contents(s3client, bucket, path):

    """

    Read a file content with it's key (filepath)

    :param s3client: eg. boto3.client('s3')

    :param bucket: eg. 'some-bucket-name'

    :param path: eg. 'some-path-to-my-file-with-postfix-no-slash-prefix'

    :return: file contents in json format

    """

    file_obj = s3client.get_object(

        Bucket=bucket,

        Key=path)

    # open the file object and read it into the variable filedata.

    file_data = file_obj['Body'].read()

    # TODO: we did some ugly string replace here.. will fix this later

    print_str = json_util.loads(file_data).replace('\\', '').replace('""', '"').replace('"Body":"', '"Body":').replace(

        '}}}"}', '}}}}').replace('= "', '- ').replace('" Or', ' -').replace('" And', ' -')

    json_obj = json_util.loads(print_str)

    # NOTE: we use regex to match what we want.

    # match = re.findall('someKey":{"S":"(.*?)"', print_str)

    # if match:

    #     pprint(f'find key: {match[0]}')

    #     return match[0]

    # else:

    #     print(f'no key found!')

    #     return None

    return json_obj

本文作为此次生产环境数据问题Investigate的解决过程，记录在这里，数据已经经过脱敏，请结合自己的实际环境进行配置。