Data Lakes on AWS

Data Lakes on AWS

The article Key Concepts about Data Lakes delved into the importance of Data Lakes, their architecture and how they compare to Data Warehouses. This article will focus on deployment using Amazon Web Services (AWS), Amazon’s cloud platform. We will look into the overall flow, the different services available and, finally, AWS Lake Formation, a tool specially designed to facilitate this task.

Overall Flow

Data Lakes support the needs of our applications and analytics, without the need to constantly worry about increasing storage and computing resources as the business grows and the data volume increases. However, there is no magic formula creating them. Generally, they involve dozens of technologies, tools and environments. The diagram below shows the overall flow of data, from collection, storage and processing, to the use of analytics via Machine Learning and Business Intelligence techniques.

Services supported by AWS

AWS provides a comprehensive set of managed services that help build Data Lakes. Proper planning and design are necessary to migrate a data ecosystem to the Cloud, and understanding Amazon’s offerings is critical. Below are only a few of the most important tools at each stage of the flow.

Collection

The first step is to analyze the goals and benefits you want to achieve with the implementation of an AWS-based Data Lake. Once the plan is designed, data must be migrated to the Cloud, taking into account its volume. You can easily accelerate this migration with services such as Snowball and Snowcone (edge devices for storage and computing) or DataSync and Transfer Family, to simplify and automate transfers.

Channeling

In this step, you can operate in 2 modes: Batch or Streaming.

In Batch Loading, AWS Glue is used to extract information from different sources, at periodic intervals, and move them into the Data Lake. It usually involves some degree of minimal transformation (ELT), such as compression or data aggregation.

For Streaming, data generated continuously from multiple sources, such as logging files, telemetry, mobile applications, IoT sensors and social networks, are collected. They can be processed during a circular time window and channeled into the Data Lake.

Real-time analytics provides useful information for critical business processes that rely on streaming data analysis, such as Machine Learning algorithms for anomaly detection. Amazon Kinesis Data Firehose helps perform this process from hundreds of thousands of sources in real time, rather than uploading data for hours and processing it at a later stage.

Storage and Processing

The core service in any AWS Data Lake is Amazon S3, which provides high scalability storage, excellent costs and security levels, thus offering a comprehensive solution for different processing models. It can store unlimited data and any type of file as an object. It allows you to create logical tables and hierarchies from folders (for example, by year, month, and day), allowing the partition of data in volume. It also offers a wide set of security functions, such as access controls and policies, encryption at rest, registration, monitoring, among others. Once the data is uploaded, it can be used anytime, anywhere, to address any need. The service supports a wide range of storage classes (Standard, Smart, Rare Access), each with different capacities, recovery times, security and cost.

AWS Glacier is a service for secure archiving and backup management at a fraction of the cost of S3. File recoveries can take from a few minutes to 12 hours, depending on the storage class selected.

AWS Glue is a managed ETL and Data Catalog service that helps find and catalog metadata for faster queries and searches. Once Glue points to the data stored in S3, it analyzes it using automatic trackers and records its schemes. Glue is designed to perform transformations (ETL/ELT) using Apache Spark, Python scripts and Scala. Glue has no server; therefore, there is no infrastructure configured, which makes it more efficient.

If the contents of Data Lake need to be indexed, AWS DynamoDB (NoSQL database) and AWS ElasticSearch (text search server) can be used. In addition, by using AWS Lambda features, activated directly by S3 in response to events such as uploading new files, processes can be triggered to keep your Catalog up to date.

Analytics for Machine Learning and Business Intelligence

There are several options for massive Data Lake information.

Once data has been catalogued by Glue, different services can be used in the client layer for analytics, visualizations, dashboards, etc. Some of these are Amazon Athena, an interactive serverless service for ad hoc exploratory queries using standard SQL; Amazon Redshift, a Data Warehouse service for more structured queries and reports; Amazon EMR (Amazon Elastic MapReduce), a managed system for Big Data processing tools such as Apache Hadoop, Spark, Flink, among others; and Amazon SageMaker, a Machine Learning platform that allows developers to create, train and implement Machine Learning models in the cloud.

With Athena and Redshift Spectrum, you can directly query the Data Lake in S3 using the SQL language in the AWS Glue Catalog, which contains metadata (logical tables, schemes, versions, etc.). The most important aspect is that you only pay for the queries executed, depending on the scanned data volume. Therefore, you can achieve significant performance and cost improvements by compressing, partitioning, or converting data into a column format (such as Apache Parquet), as each of those operations reduces the amount of data Athena or Redshift Spectrum should read.

AWS Lake Formation

Building a Data Lake is a complex, multi-step task, including:

  • Identify sources (Databases, files, streams, transactions, etc.)
  • Create the necessary buckets in S3 to store data with the applicable policies.
  • Create the ETLs that will carry out the necessary transformations and the corresponding administration of audit policies and permits.
  • Allow Analytics services to access Data Lake information.

AWS Lake Formation is an attractive option that allows users (both beginners and experts) to immediately start with a basic Data Lake, eliminating complex technical details. It allows real-time monitoring from a single point, without having to go through multiple services. One strong aspect is cost: AWS Lake Formation is free. You will only be charged for the services you invoke from it.

It allows loading from various sources, monitoring flows, configuring partitions, enabling encryption and key management, defining transformation jobs and monitoring, reorganizing data in column format, configuring access control, eliminating redundant data, relating linked records, gaining access and auditing access.

Conclusions

These 2 articles looked into the definition of Data Lakes, what makes them different from Data Warehouses and how they can be deployed on the Amazon platform. CTO can be significantly reduced by moving your data ecosystem to the cloud. Suppliers such as AWS add new services continuously, while improving existing ones and reducing costs.

Huenei can help you plan and execute your Data Lake initiative in AWS, in the process of migrating your data to the cloud and implementing the analytics tools necessary for your organization.

A Few Use Cases for Serverless Computing

A Few Use Cases for Serverless Computing

Introduction

The revolution of Serverless Computing is here to stay, and this is because this new technology enables application development without having to go through the management and administration of a server. Under this model, applications can be grouped and loaded onto a platform and then run and scaled as demand for them increases.

Although “Serverless Computing” does not suppress the use of servers when executing a code, it does eliminate all activities related to its maintenance and updating. This creates an efficient model where developers manage to disassociate themselves from those routine tasks to focus on more productive activities, thus increasing the company’s operational efficiency.

 

What is Function as a Service (FaaS)?

Function as a Service (FaaS) is a model that allows for the execution of several computing actions based on events, and thanks to it, developers can manage applications, “bypassing” the need for servers during their management.

In the world of computing, functions are in charge of managing the states of a server, therefore the FaaS model develops a new logic that is later executed in other containers located in the cloud.

In general terms, FaaS allows us to design applications in a new architecture where the server works in the background and the execution of codes based on events becomes the fundamental pillar of the model. This means that the underlying processes that normally occur on a server do not run continuously, but are available when needed.

This becomes a clear advantage of the FaaS model, allowing developers to scale dynamically, that is, implement application automation so that it decreases or increases based on actual demand.

In addition to the above, FaaS increases the efficiency and profitability of operations, since providers will not bill the company when no activity is detected.

All this makes the FaaS model an innovative element within the recent field of serverless architecture by minimizing investment in infrastructure, and leveraging the competitive advantages of Cloud Computing.

The evolution of Serverless Computing

With the advent of the cloud in the first decade of the 2000s, people had the opportunity to store and transfer data online, which eliminated the need for hard drives.

This undoubtedly created important advantages for users, who had the opportunity to immediately access their information online from any device.

However, developers were missing an element in this equation, i.e., the place where applications or software were implemented. In this sense, a “Virtual Machine” model was implemented which allowed to point to a “Simulated Server”, creating significant flexibility in updates and migrations, and with this, the problems associated with hardware variations were left behind.

Despite this progress, “virtual machines” had some limitations in terms of operation, and this led to the creation of containers, a new technology that allowed administrators to section the operating system in order to keep several applications active simultaneously, without one interfering with the other.

Considering this reality, we can see that all these technologies maintain the paradigm of “where an application runs” as their fundamental structure. Under this scenario, Serverless Computing emerged, promising a new level of abstraction focused on the code itself that diminished the importance of the place where code was stored.

With the advent of Amazon’s AWS Lambda service at the end of 2014, a milestone in serverless architecture was achieved, as developers could finally focus their efforts on creating software without having to worry about hardware, OS maintenance, the location of the application, as well as its level of scalability.

Use Cases for Serverless Computing

Below are some successful cases of companies that applied serverless technology, or Serverless Computing, within their organizations:

Case 1. Major League Baseball Advanced Media (MLBAM)

Major League Baseball has used serverless computing technology to provide all its fans with real-time baseball game data through its “Statcast” product. This acquisition has increased MLBAM’s processing speed, as well as the ability to handle more data.

Case 2. T-Mobile US

T-Mobile US is a mobile phone company with a strong presence in the North American market. The company decided to bet on serverless technology, achieving significant benefits in terms of resource optimization, scaling simplicity, and the reduction of computer patches, thus increasing its real capacity to respond in a much more efficient way to all its customers.

Case 3. Autodesk

 Autodesk is a company that develops software for the architecture, construction and engineering industries. Recently this organization decided to apply serverless technology in order to manage its development, as well as the time-to-market of all its products. In keeping with this policy, Autodesk created the “Tailor” application as an efficient response for managing its clients’ accounts.

Case 4. iRobot

iRobot is a company that designs and manufactures robotic devices intended for use within the home and in industrial settings. Since the organization decided to get involved with Serverless Computing technology, the data processing capacity of its robots has increased substantially, also allowing the capture of data streams in real time. The new serverless architecture allows them to focus on their customers and not on operations.

Case 5. Netflix

Netflix has become one of the world’s largest online media on-demand content providers. In line with its innovative spirit, this company has decided to use Serverless Computing to generate an architecture that helps optimize the encoding processes of its audiovisual files, as well as the monitoring of its resources.

 

Conclusions

When we look at the evolution of Serverless Computing and how it has managed to significantly impact computing processes in general, we understand that this new system will quickly become the next step in the world of cloud computing, fostering a promising future focused on adopting a multimodal operational approach.

What is Serverless Computing?

What is Serverless Computing?

Introduction

Innovation in the world of computing occurs at a startling pace in each and every area, generating important progress in the processes related to “Serverless Computing”, also known as “Serverless Architecture”.

In this context, an increasing number of companies are turning to the “Cloud” as a way to optimize the creation and execution of applications and processes, minimizing the use of servers. This is where Serverless Computing comes in as a key element for the proper development of internal software architecture.

Although Serverless Computing reduces the use of a server, the server does not disappear in its entirety; it is simply optimized and reassigned by the cloud provider, who will ultimately be responsible for all the routine activities associated with the servers’ maintenance.

Background

In the beginning, creating a web application required the use of hardware that would allow the execution of a server, sometimes resulting in a complicated and expensive process. Later on, when the cloud came along, companies and developers had the possibility to rent spaces on remote servers to carry out their activities.

However, this process was not entirely efficient either, since companies ended up buying more space than necessary in order to ensure the system would remain stable in case of very high demand peaks, thus incurring in additional expenses. This is why developers began to see the need for a platform that would allow them to pay only for the space used.

In this sense, the story of Serverless Computing is recent, the first reports of this technology being found in an article by the specialist in decentralized applications and serverless development, Ken Fromm, published in October 2012, titled “Why the Future of Software and Apps is Serverless.”

By November 2014, the Amazon company launched its “AWS Lambda” service, which allows developers to execute code and automatically organize resources without the need to manage the underlying infrastructure during.

A year later, in July 2015, Amazon created “API Gateway”, a service for the creation and maintenance of API REST, HTTP and WebSocket, where developers can generate Application Programming Interfaces that access Amazon or other Web Services, as well as data stored in the cloud. Finally, in October 2015, “Serverless Framework” was born as the first framework developed for creating applications on AWS Lambda.

Serverless architecture overview

Serverless Computing, or serverless architecture, does not imply the total absence of a server as such; what this system actually seeks is for the cloud provider to adequately and efficiently manage all processes related to the server.

In this sense, one of the outstanding features of Serverless Computing is the ability to let go of the traditional way of managing servers in a company, replacing it with automated management by the cloud provider.

This means that the cloud provider is responsible for managing all organizational resources during the execution of a particular activity, leaving behind the old administrative action carried out by users within the organization.

Under this new scheme, a company’s IT activities are billed according to the need for resources for each particular task, thus creating a clear contrast with the old model where often unused spaces were hired: this allows for major capital savings, since the company only pays for what is actually used.

In addition to the above, the Serverless Computing model eliminates the need to make server reservations. As a result, developers no longer need to access the server through an Application Programming Interface (API) to add resources, since the cloud provider is now responsible for doing this automatically.

Advantages

Serverless Computing has a number of advantages when compared to the traditional model, including the following:

  • It significantly reduces developer operating costs by allowing developers to pay only for used space.
  • Higher productivity for companies, with the possibility to assign tasks related to the administration of servers to third parties, and thus focus directly on application development.
  • Serverless Computing platforms reduce the time associated with marketing, since developers will have the option of gradually modifying or adding code.
  • Providers of this new service can manage everything related to code scaling under real demand.
  • Ability to focus on unifying software development and its operational capacities, that is, adopting “DevOps” system engineering practices.
  • Optimized application development incorporating essential components of the BaaS model offered by other providers.

 

Disadvantages

Regarding the disadvantages or downsides of Serverless Computing, the following may be mentioned:

  • Significant restriction on the interactive capacity of cloud providers, directly affecting system customization and flexibility.
  • Dependence on service providers.
  • It could cause some problems associated with the lack of control of the company’s own servers.
  • Access to virtual machines and operating systems is limited.
  • Implementing a serverless architecture implies an economic effort, since it typically requires updating the systems to meet the provider’s demands.

What role does the cloud provider play in Serverless Computing?

Cloud providers play a fundamental role in serverless architecture, since they are in charge of running the servers and allocating resources for developers at the same time.

In this sense, cloud providers offer two main methods within the Serverless Computing scheme, called “Function as a Services” (FaaS) and “Backend as a Services” (BaaS).

The first method, “Function as a Services” (FaaS), allows developers to apply micro services when writing and updating different codes to be implemented in the cloud, thereby simplifying the incorporation of data, reducing execution times, as well as ensuring a timely management of the supplier.

On the other hand, the “Backend as a Services” (BaaS) method is based on the provision of services to third parties based on the Application Programming Interface (API) established by the provider, such as databases, authentication services, and encryption processes.

Finally, it is worth noting that large cloud providers work under the “Function as a Services” (FaaS) mode, such as AWS Lambda from Amazon, Azure Functions from Microsoft, IBM Cloud Functions and Google Cloud.

Conclusion

Serverless Computing has certainly had a significant impact in the world of computing, allowing developers to focus on creating software without having to worry about the application management or production code, since the cloud provider is in charge of efficiently managing the resources necessary for this important activity.

Would you like to learn more about this subject? Please visit our IT Continuity page to learn more about the services we offer related to infrastructure and custom Software Development.