How To AWS Date Lake: A Step-By-Step Guide
We are going to create an AWS Data Lake, using a combination of AWS services. AWS services will be using are:
- AWS Glue: for performing the ETL job, processing job and cataloging of the data.
- Lake Formation: to provide access control over the data.
- Amazon Athena: for querying and analyzing the data in amazon S3 bucket.
- Amazon S3: To store our data.
Step 1: Create IAM User
- we need to create IAM user first, for controlled access to AWS services. we are going to create IAM user namely βAmazon-sales-userβ for our dataset.
- Search for IAM (Identity and Access Management) in AWS Console Search bar and navigate to IAM.
- Click on βUsersβ option from the menu and click on βcreate Userβ button.
- Enter User Name in the user name box and click on βNextβ
- Now, we have to give permission to user, select βAttach policies directlyβ option to set the permissions.
- Search and Select below permissions:
- AmazonS3FullAccess
- AmazonAthenaFullAccess
- AWSCloudFormationReadOnlyAccess
- AWSGlueConsoleFullAccess
- CloudWatchLogsReadOnlyAccess
- After selecting Permission Policies, click on βNextβ, Review the User Details and hit the βCreate Userβ button.
- The following screenshot, successful creation of User.
Step 2: Create IAM Role
- After creating IAM User, now we have to create IAM Role, to catalog the data which is stored in Amazon S3 Bucket for our Data Lake.
- Navigate to IAM Console again. click on the βRolesβ option from the menu, which you will find on your left-hand side then click on βCreate roleβ tab.
- Next, select βAWS Serviceβ option. and type βGlueβ as the AWS service in Use case or service box and click on βNextβ button.
- Now, we have to add permissions, search for βPowerUserAccessβ policy and click on Next button.
- On next screen, you will have to enter βRole Nameβ as per your wish. scroll down and click on βCreate Roleβ button.
- And our IAM Role is Successfully Created.
Step 3: Create S3 Bucket to Store the Data
- We have successfully created our IAM users and IAM role for our AWS Data Lake, now to store our data we need to create Amazon S3 Bucket. in this demonstration we are uploading data manually into the S3.
- Search for Amazon S3 in AWS Management Console Search Bar and navigate to the S3 Console.
- Click on βCreate Bucketβ button and create a bucket with a name of your choice, after entering bucket name click on βCreate Bucketβ.
- Choose Default encryption as server side encryption and bucket key as disable mode.
- The following illustrates that we successful created the bucket.
- Our bucket is now created. select your bucket to open it. click on βUploadβ button to upload our data file in the created bucket. click on βadd fileβ tab choose your data file and click on βuploadβ.
- Upload the files as shown in the figure, by clicking on the upload files option as shown in the figure. And our data is ready!
Step 4: Data Lake Set Up using AWS Lake Formation
- Our data is ready to ingest into the data lake. now will begin to set up our Data Lake. in data lake we will create a database. Search and navigate to the AWS Lake formation console.
- Add administrator that performs administrative tasks of data lake. click on βAdd Administratorsβ button to add administrators for your data lake (if you are working with AWS Lake Formation for the First time, only then βAdd administratorsβ window will pop up).
- Administrator is added, now itβs time to create a database. you will find the option to create a database in left hand side menu click on βDatabasesβ and under databases click on βCreate databaseβ button.
- Enter Database Name as per your wish. after that you have to browse and provide your S3 bucket path in which your data is stored, in the βLocationβ box.
- Also make sure to uncheck the βUse Only IAM Access Control for New tables in this databaseβ checkbox. after that click on βCreate Databaseβ button. and here you go your database is created in no time.
- Database is created, now we have to register our S3 bucket as a storage for our data lake. for that find and click on βData Lake locationsβ option from the left-hand side menu, click on βRegister Locationβ, browse and enter S3 bucket path where data is stored. after giving S3 path, choose IAM role as β AWSServiceRoleForLakeFormationDataAccessβ by default and click on βRegister Locationβ.
Step 5: Data Cataloging using AWS Glue Crawlers
- While building the Data Lake, it is essential for data in the data Lake should be catalogued. using AWS Glue the process of data cataloging becomes easy.
- AWS Glue provides ETL (Extract, Transform, Load) service, meaning AWS Glue first transform, cleanse and organize data coming from multiple data sources before loading data into the Data Lake. AWS Glue makes data preparation process efficient by automating the ETL jobs.
- AWS Glue offers crawlers which automates the data catalog process, for better discovery, search and query big data.
- To create a data catalog in the Database, AWS Glue Crawler will use IAM role which we have created in previous step.
- Go back to the AWS Lake Formation console again, click on βDatabasesβ option you will see your previously created database. select your database and you will see an βActionβ button, under Action Dropdown menu click on βGrantβ option.
- On the next window, you have to choose your previously created IAM Role for βIAM Users & rolesβ. scroll down you will see Database Permissions field, check boxes for only βCreate Tableβ and βAlterβ permissions and click on βGrantβ button.
- Scroll down you will see Database Permissions field, check boxes for only βCreate Tableβ and βAlterβ permissions and click on βGrantβ button.
- After that, navigate to AWS Glue console, on the left-hand side menu you will see the βData catalogβ option under βData Catalogβ you will find βCrawlersβ option click on the that then click on βCreate Crawlerβ button, Enter Name for your Crawler of your choice, you can also add description if you want, and then click on βNextβ.
- Set the crawler properties as shown in the bleow screenshot.
- Clicking on βNextβ, Choose data sources and classifiers window will open, we have to choose the data source of data to be crawled. for S3 path, browse and provide S3 bucket path in which our data exist and click on βAdd an S3 Data Sourceβ your data source is now added now click on βNextβ.
- Add the data source and location of S3 data as shown in the below screenshot.
- On the next screen, we need to add IAM role, choose previously created IAM Role from the drop-down list and click on the βNextβ.
- For Set output and scheduling, choose our created Database, for Crawler schedule select βOn Demandβ as Frequency and click on βNextβ.
- Finally, review all the AWS Glue Crawler configuration and click on βCreate Crawlerβ button to save and create the Crawler. Crawler is now ready! it may take few seconds to finish crawling the S3 bucket, after that you will see tables created successfully and automatically by the crawler in Database.
- Navigate to the AWS Lake Formation console, click on βtablesβ from the menu, you can check here also table is created.
Step 6: Data Query with Amazon Athena
- Amazon Athena is a Query Service offered by AWS, Amazon Athena allows us to analyze data which stored in Amazon S3 Bucket efficiently using Standard SQL.
- When we are working with a large amount of data, we need some sort of querying tool for analyzing the data or big data, and here is where Amazon Athena comes into play, using Amazon Athena makes it easy for analyzing the data present in Amazon S3 Bucket.
- When we are using Amazon Athena, we donβt need to be good at SQL (Structured Query Language) for querying data, by default Athena supports Standard SQL Query language, because of that data analysts, data scientists and organizations are able to perform analytics and derive valuable insights from the data.
- Amazon Athena allows user to query data stored in Amazon S3 in its original format. Navigate to the Amazon Athena Console.
- Click on βQuery Editorβ, select Database which we have created in the earlier steps, but before executing any query we need to provide βQuery Result Locationβ which is Amazon S3 Bucket.
- Amazon Athena stores Query Output and Metadata for each Query which executes in βQuery Result Locationβ.
- we have to create S3 bucket to store our Query results in this bucket, click on βSet up a query result location in Amazon S3β³ tab and provide S3 bucketβs path and hit the βSaveβ button.
- We have added the βQuery Result Locationβ, Now we can Run our Queries in Amazon Athena Query Editor.
- Run the following MySQL Query and click on βRunβ button.
SELECT * FROM "gfg-data-lake-db" . "gfg-data-lake-bucket" limit 10;
- Output of above Query illustrated by the following screenshot.
Step 7: Clean Up
- After following Numbers of steps, we have Successfully Created our AWS Data Lake with the Combination of Different AWS Services. now itβs time to clean up all the created Resources to avoid any unnecessary large bills.
- Delete all the created AWS Resources including:
- Amazon S3 Buckets
- IAM Users and Roles
- AWS Glue Crawler
- Database created in AWS Lake Formation
- Delete the Registered Locations
How to Create AWS Data Lake
In the data-driven world, organizations are flooded with large amounts of data that originate from various sources, ranging from structured databases to unstructured files and logs. In order to effectively make use of this data for insights and decisions, organisations need to have a storage system that is capable of storing vast datasets. To address this challenge completely, AWS Data Lakes offers an all-inclusive solution as it enables centralized ingestion, cataloguing and querying at scale. By incorporating AWS services such as Amazon S3 Bucket, AWS Glue, AWS Lake Formation, AWS Athena and IAM together in a reasonable manner an organisations can build an elastic data lake architecture that allows for user-driven acquisition of actionable intelligence from their data while maintaining security and compliance standards.