Consider a cloud-native WebSocket application running on AWS

Introduction

The word cloud native is popular. Yesterday @chiyoppy wrote Articles about Microservises and the cloud. I think it's a good article that briefly summarizes the important things when designing web services in the cloud. In response to that, I will think about how to create a cloud-native WebSocket application. In yesterday's article about applications, HTTP is stateless, so it's amazing! You can scale! In this article, I will consider a WebSocket application that seems to be harder to scale than HTTP.

By the way, I'm a new graduate in 2015, and at work I use Swift to write iOS apps. And today is my birthday. We are waiting for gifts. Also, I write iOS apps in Swift in my work, so this article has nothing to do with my work in Dwango.

The reason I picked up this topic this time is that I am currently studying cloud natives and Microservises at a study session held about once a week by some new graduates of Dwango. While studying them, what should I do ** How to make a WebSocket application that is easy to have a state on the server itself **? The question arose. In this article, we will consider designing a WebSocket application that runs in the cloud while creating a small chat application.

Cloud native

To put it simply, cloud native is to design an application on the assumption that it will run on the cloud. You also have to think about containerization, auto-deployment, auto-scaling and application design to run on the cloud. In this article, we will consider designing an application that runs in the cloud. As for other things, chiyoppy should have summarized various things yesterday, so I will omit it this time. In this article, I will write it assuming that it will run on AWS, so I will use AWS terms more and more.

Requirements for cloud-native applications

In order to create an application that runs in the cloud, it is necessary to consider the characteristics of the cloud. A typical example is that components are unreliable. As a requirement for applications, there is a premise that EC2 may suddenly drop.

--EC2 has no state --EC2 instances can disappear suddenly --Design that can be destroyed at any time --High availability (fault tolerance) --Always place two or more instances so that failover is possible --Place instances in multiple AZs --I want to fail over automatically even if the entire AZ goes down

There is something like this. In the case of HTTP, you are looking for an application that scales statelessly. The WebSocket application created this time has a similar idea

--Push the state out --Aiming for a configuration that can be scaled out --Failover without the user being aware when a failure occurs in the connected connection

I will aim for things. The third point is unique to WebSocket, which you do not need to think about in the case of HTTP.

To put it simply, the WebSocket application we are aiming for this time is in a state where ** two instances A and B are placed in two AZs, and the user does not feel that the application has dropped when A is dropped manually **. I aim to be. Since the WebSocket application keeps the connection, it is necessary to successfully reconnect the connection that was lost due to the server going down. The purpose is to create an application that does not go down (an application that goes down but the user does not notice) by automatically failing over A's connection to B and automatically launching a new A in the meantime. This time, the goal is to realize the part written in bold.

First, try writing without being aware of cloud natives

Let's write the app without being aware of anything for the time being. This time, we adopted chat as a sample application. Here is the code for this state. https://github.com/kouki-dan/CloudChat/tree/basic-chat When launched, a text field will appear. If you type a letter there, it will be synchronized with other users, which is exactly a chat.

Screen Shot 2015-12-02 at 4.04.51 AM.png

Code description

This class is responsible for WebSocket communication.

chat.py


class ChatSocketHandler(tornado.websocket.WebSocketHandler):
    #Become a class variable(ChatSocketHandler.Access with xxx, cls in class method.(Accessible with xxx)
    waiters = set() #A set of currently connected users
    cache = [] #Chat history

    def open(self): #Method called when WebSocket is opened
        ChatSocketHandler.waiters.add(self)
        self.write_message({"chats": ChatSocketHandler.cache}) #Send history when connecting

    def on_close(self): #Method called when WebSocket is disconnected
        ChatSocketHandler.waiters.remove(self)

    @classmethod
    def update_cache(cls, chat):
        cls.cache.append(chat) #Add history

    @classmethod
    def send_updates(cls, chat): #Broadcast chat to all users
        for waiter in cls.waiters:
            try:
                waiter.write_message(chat)
            except:
                logging.error("Error sending message", exc_info=True)

    def on_message(self, message): #Method called when a message is received via WebSocket
        parsed = tornado.escape.json_decode(message)
        chat = {
            "body": parsed["body"],
            }

        ChatSocketHandler.update_cache(chat)
        ChatSocketHandler.send_updates({
          "chats": [chat]
        })

I'm sorry if you can't read Python. Please feel Python is a relatively easy-to-read language, so you should feel it. Well, there are two things we are doing: keeping history and broadcasting when a chat is sent. Even in this state, it works as much as it can be operated by itself, but considering that it can be operated on the cloud, there are the following problems in the current state.

--Application has state --Has cache (chat history) in memory --Data disappears when the instance goes down --Multiple instances are out of sync --When there are multiple instances, chats can only be sent to users who are connected to the connected instance. --Do not scale out --Not reconnected when instance goes down

First, let's try not to give the application a state (cache). This time we will use Redis and change the cache to be stored in Redis list. AWS ElastiCache is convenient because you can easily use Redis that can be connected from multiple AZs.

Chat with no state in the application

Until now, chat data was stored in memory in a variable called cache. Let's change it to a configuration that does not lose data even if EC2 falls by giving it to Redis. The result of implementation is as follows https://github.com/kouki-dan/CloudChat/tree/cache-in-redis

Let's take a look at diff.

chat.diff


 class ChatSocketHandler(tornado.websocket.WebSocketHandler):
     waiters = set()
-    cache = [] #Cache has been removed as it is stored in Redis
+    redis = Redis(decode_responses=True) #↑ It is an instance to save cache on redis
 
     def open(self):
         ChatSocketHandler.waiters.add(self)
         #Changed to get what was directly referencing Cache via method
-        self.write_message({"chats": ChatSocketHandler.cache})
+        self.write_message({"chats": ChatSocketHandler.get_caches()})
 
     def on_close(self):
         ChatSocketHandler.waiters.remove(self)
 
     @classmethod
     def update_cache(cls, chat):
         #Changed to save to Redis instead of in memory
-        cls.cache.append(chat)
+        chat_id = cls.redis.incr("nextChatId") #Get chat ID from Redis
+        redis_chat_key = "chat:{}".format(chat_id) #Create a save destination key(chat:1, chat:2 will be like)
+
+        cls.redis.set(redis_chat_key, json.dumps(chat)) #Only strings can be set in Redis, so save as JSON string
+        cls.redis.rpush("chats", redis_chat_key) #Add chat key to chat list
+
+    @classmethod
+    def get_caches(cls):
+      #Method to get chat list from Redis
+      chat_ids = cls.redis.lrange("chats", 0, -1) #Get all chat
+      chats = []
+      for chat_id in chat_ids:
+        chat = json.loads(cls.redis.get(chat_id)) #Convert JSON string to dictionary type handled by Python
+        chats.append(chat)
+      return chats #Returns a dictionary type array

For those who can't read Python, I've generally written what I need in the comments. By using Redis for the posting part and the acquisition part, the data is not lost even if EC2 falls. Another side effect of placing the data outside the instance is that multiple instances can see the same data. This brings us one step closer to a scale-out configuration. Try launching multiple instances.

スクリーンショット 2015-12-07 15.57.31.png

In this photo, I changed the port and started two applications. In previous applications, chat content was not synced between instances, but saving the data to Redis will show the same content. But this isn't perfect scale-out yet. The chat history status is synchronized at startup, but real-time message sending and receiving is not synchronized between instances, so even if you send a chat to localhost: 8888 as shown in the image below, the result does not reach localhost: 8899. It will be. The reverse is also true. スクリーンショット 2015-12-07 16.13.19.png

Now let's solve this and consider a configuration that can be completely scaled out.

WebSocket application to scale out

In order to scale out, you need to notify other instances of events that occur in one instance. In the above example, the data referenced by the user was inconsistent because the mechanism for notifying localhost: 8899 of the event where the chat that occurred at localhost: 8888 was posted was not in place. You can also use Redis to solve this. Let's evolve into a chat that can be scaled out by using the PubSub mechanism provided by Redis.

What is PubSub

PubSub stands for Publish (Pub) and Subscribe (Sub) and is used for communication between processes and apps. You can publish an event on a connection, and the event will be notified to the pre-subscribed connection.

Chat to scale out

We will improve the communication between instances so that it can be scaled out by using PubSub. The code that implements this is below https://github.com/kouki-dan/CloudChat/tree/pubsub-chat

Now let's take a look at the diff

chat.py


+class Listener(threading.Thread):
+    def __init__(self, r):
+        threading.Thread.__init__(self)
+        self.redis = r
+        self.pubsub = self.redis.pubsub()
+        self.pubsub.subscribe(["chats"])
+    def work(self, item):
+        if item["type"] == "message":
+            chat = json.loads(item["data"])
+            ChatSocketHandler.send_updates(chat)
+    def run(self):
+        for item in self.pubsub.listen():
+            if item['data'] == "KILL":
+                self.pubsub.unsubscribe()
+            else:
+                self.work(item)
+

First, we are creating a class to subscribe. If Subscribe is handled by the main thread, the main thread will be locked, so this part will be handled by multithread. We are using the threading module of the Python standard module for multithreading. This class issues ChatSocketHandler.send_updates (chat) when the message is published. This allows you to send an event that a chat has been sent to all users connected to each instance.

Next, let's take a look at the chat body

chat.diff


 class ChatSocketHandler(tornado.websocket.WebSocketHandler):
     waiters = set()
     redis = Redis(decode_responses=True)
+    #Add Listener
+    client = Listener(redis) 
+    client.start() 

# ......
# ...... 
 
     def on_message(self, message):
         parsed = tornado.escape.json_decode(message)
         chat = {
             "body": parsed["body"],
         }

         ChatSocketHandler.update_cache(chat)
-        ChatSocketHandler.send_updates({
-          "chats": [chat]
-        })
-
+        ChatSocketHandler.redis.publish("chats", json.dumps({
+            "chats": [chat]
+        }))

The only change is that the part that used to call send_updates directly has been changed to redis publish. As a result, communication is performed as follows. (I wanted to make a drawing, but I gave up on the lack of tools. I might draw a drawing someday. TODO)

user
   ↓ (Send chat via WebSocket)
Each instance
   ↓ (Publish)
  Redis
   ↓ (Notify subscribed instances)
Instance A,Instance B,....(All instances)
   ↓ (Send chat messages via WebSocket to users connected to each instance)
user

The result is a scale-out chat application that is fully synchronized between instances. The application part is now complete. What's missing to become a cloud-native application is failover in the event of a failure. This can't be helped on the server side, so let's go to reconnect from the client.

WebSocket application failover

So far, we've created WebSocket applications that can be scaled out. In order to use the cloud where reliability is not guaranteed in one instance, how to fail over in the event of a failure is important. This can be achieved by reconnecting all the connections that were connected to A when the instance of A went down to the instance of B, assuming that two instances of AB were started.

As before, I will start by writing from the URL that implements this. https://github.com/kouki-dan/CloudChat/tree/reconnect-for-failover

Until now, all chats were sent from the server every time a connection was made, but since the chat information is not required when reconnecting, the server-side code has also been changed slightly. Let's take a look at the client code that supports reconnection.

index.js


    window.addEventListener("load", function() {

      //Connection when the URL is opened
      connect(function() {
        //Handler on error
        alert("connection error"); 
      }, closeFunction, function() {
        //Handler at connection
        //Request chat history when connecting for the first time
        socket.send(JSON.stringify({"type":"command", "command":"requestFirstChat"}))
      });

      //Function to connect to WebSocket
      function connect(errorFunction, closeFunction, openFunction) {
        var url = "ws://" + location.host + "/chatsocket";
        socket = new WebSocket(url);
        // |When you receive a message, it's a common process
        socket.onmessage = function(event) {
          receiveChat(JSON.parse(event.data)["chats"]) //Functions reflected in DOM
        }
        // |Since the processing of errors etc. is different between the first time and reconnection, it should be given as an argument.
        socket.onclose = closeFunction;
        socket.onerror = errorFunction;
        socket.onopen = openFunction;
      }

      //Function when the connection is closed due to EC2 falling etc.
      function closeFunction(event) {
        var retry = 0;
        setTimeout(function() {
          var callee = arguments.callee; //To call yourself repeatedly with setTimeout
          connect(function() {
            //Function on error
            retry++;
            //Try to reconnect by calling yourself up to 3 times
            if(retry < 3) { 
              setTimeout(function() {
                callee();
              }, 3000);
            } else {
              alert("connection error");
            }
          }, null /*close cannot be entered at this stage*/, function() {
            //Since close is called even when it is closed due to an error, put close when the connection is successful.
            // 
            socket.onclose = closeFunction;
          });
        }, 1000);
      };
// ........

The amount of code has doubled as originally expected. Isn't it just reconnecting? I was thinking lightly, but as I wrote the code, I had to write one after another, such as the difference between the first startup and reconnection, the number of reconnections, etc.

Finally, after establishing the connection, if you drop the application with ctrl + c, the application that will reconnect to another instance is completed. In order to actually work as expected, it is necessary to set and tune the load balancer and DNS, but here I will only talk about the implementation of the application.

I've written the code for a long time, but now the application I was aiming for is complete. In the application created this time, the message sent during reconnection will be lost, so it is not completely completed, but let's say that it is completed for the time being. The sent message is cached in the client. For received messages, an implementation method such as receiving the last message ID sent to the client from the client and having the server resend the messages that could not be received can be considered. If you are interested, please write and PR. So that's it for this article. Thank you for your hard work. The code that actually works is linked to github with a tag, so please refer to that.

Summary

So far, we've seen the process of evolving chats written without thinking into applications that work in the cloud while maintaining high availability. I was writing this article while writing the code, so I was relieved to be able to reconnect when I last dropped the instance. However, there are many things to think about when writing an application that actually works. So is the session, and as I said, the chat I made here isn't really perfect. In the case of chat, send a direct message. It seems difficult to process. Running WebSocket in the cloud is very thoughtful and difficult. I happen to have the opportunity to write a WebSocket application personally, so I'd like to write an article using what I wrote here, `I actually tried to create a real-time communication application using WebSocket with cloud native awareness. I think. Also, as I wrote at the beginning of this article, the most important thing is that ** my birthday is today **. Congratulations, we are waiting for gifts.

I want to read this too

The URL you referred to, the URL you want to read, etc.

tomorrow

It looks like Hiroppy (@about_hiroppy) will write ** something ** tomorrow. What is ** something **? I'm looking forward to it.

Recommended Posts

Consider a cloud-native WebSocket application running on AWS
Try running a Django application on an nginx unit
Build a Flask / Bottle-like web application on AWS Lambda with Chalice
Launched a web application on AWS with django and changed jobs
Run TensorFlow on a GPU instance on AWS
A memo of a tutorial on running python on heroku
Periodically run a python program on AWS Lambda
Build a WardPress environment on AWS with pulumi
Try Tensorflow with a GPU instance on AWS
A story about running Python on PHP on Heroku
Try running a Schedule to start and stop an instance on AWS Lambda (Python)
Jupyter on AWS
A little trick to know when writing a Twilio application using Python on AWS Lambda
# 2 Build a Python environment on AWS EC2 instance (ubuntu18.04)
How to deploy a Django application on Alibaba Cloud
Make a parrot return LINE Bot on AWS Cloud9
Deploy a Django application on Google App Engine (Python3)
Set up a free server on AWS in 30 minutes
Procedure for creating a Line Bot on AWS Lambda
A swampy story when using firebase on AWS lamda