How to process camera images with Teams and Zoom Volume of processing in animation style

`This article is also introduced here. `` https://cloud.flect.co.jp/entry/2020/04/02/114756#f-459f23dc

In the previous and two previous posts, I introduced how to process camera images with Teams and Zoom. These posts show demos that detect smiles and emotions (facial expressions) and display smileys.

https://qiita.com/wok/items/0a7c82c6f97f756bde65 https://qiita.com/wok/items/06bdc6b0a0a3f93eab91

This time, I've expanded it a little more and tried an experiment to convert the image into an animation style and display it, so I will introduce it. First of all, I think that it is difficult to use the CPU because the lag is a little too large to convert it to an animation-like image in real time. ~~ (I haven't tried if it's better with GPU.) ~~ ➔ I tried it on GPU. About 15fps comes out. It was insanely quick.

top-left : CPU Intel 4770
top-right : CPU Intel 9900KF
bottom-left : CPU Intel 9900KF with GPU 2080ti
bottom-right : CPU Intel 9900KF with GPU 2080ti throttling by skip_frame=3

Then I will introduce it immediately.

Anime style image conversion

It seems that it was featured in the news media about half a year ago, so many of you may know it, but the method of converting photos into anime style is published on the following page.

https://github.com/taki0112/UGATIT

In this UGATIT, unlike the simple image2image style conversion, it seems that it has become possible to respond to shape changes by adding a unique function called AdaLIN based on the so-called GAN technology that uses Generator and Discriminator. is.

In our work, we propose an Adaptive Layer-Instance Normalization (AdaLIN) function to adaptively select a proper ratio between IN and LN. Through the AdaLIN, our attention-guided model can flexibly control the amount of change in shape and texture.

For details, see this paper [^ 1] and the commentary article [^ 2], and when you actually convert using the trained model published on the above page, it looks like this.

It doesn't seem to convert very well if the subject is far away. Also, my uncle doesn't seem to be able to handle it very well. The dataset used in the training is also published on the above page, but it seems to be biased towards young women, so this seems to be the cause. (I'm not an uncle yet, is it impossible?)

Implementation overview

As mentioned above, it seems necessary to make the image close to the subject (person's face) (≒ face occupies most of the screen). This time, I tried the procedure of identifying the location of the face with the face detection function introduced up to the last time, cutting out the location, and converting it with UGATIT.

See the repositories mentioned below for implementation details. [^ 3] [^ 3]: As of April 2, 2020, the source code is dirty because it is made by adding and adding. Refactor somewhere)))

Environment

Please prepare v4l2loopback and face recognition model by referring to Articles up to the last time.

Also, as before, clone the script from the repository below and install the required modules.

$ git clone https://github.com/dannadori/WebCamHooker.git
$ cd WebCamHooker/
$ pip3 install -r requirements.txt

Placement of UGATIT trained models

UGATIT officially provides source code for Tensorflow and PyTorch, but it seems that the only trained model is the Tensorflow version. Get this and deploy it. In addition, it seems that the extraction fails with the normal zip extraction tool of Windows or Linux. There is a report in the issue that 7zip works well when using Windows. Also, it doesn't seem to cause any problems on Mac. The solution for Linux is unknown ... [^ 4] [^ 4]: All as of April 2, 2020

For the time being, the hash value (md5sum) of the model that works normally is described. (Maybe this is the number one stumbling block.)

$ find . -type f |xargs -I{} md5sum {}
43a47eb34ad056427457b1f8452e3f79  ./UGATIT.model-1000000.data-00000-of-00001
388e18fe2d6cedab8b1dbaefdddab4da  ./UGATIT.model-1000000.meta
a08353525ecf78c4b6b33b0b2ab2b75c  ./UGATIT.model-1000000.index
f8c38782b22e3c4c61d4937316cd3493  ./checkpoint

Store these files in UGATIT / checkpoint of the folder cloned from git above. If it looks like this, it's OK.

$ ls UGATIT/checkpoint/ -1
UGATIT.model-1000000.data-00000-of-00001
UGATIT.model-1000000.index
UGATIT.model-1000000.meta
checkpoint

Let's have a video conference!

The execution is as follows. One option has been added.

--Enter the actual webcam device number in input_video_num. For / dev / video0, enter the trailing 0. --Specify the device file of the virtual webcam device for output_video_dev. --Set anime_mode to True.

In addition, please use ctrl + c to end it.

$ python3 webcamhooker.py --input_video_num 0 --output_video_dev /dev/video2 --anime_mode True

When you execute the above command, ffmpeg will run and the video will start to be delivered to the virtual camera device.

As before, when you have a video conference, you will see something like dummy ~ ~ in the list of video devices, so select it. This is an example of Teams. The conversion source image is also displayed by wiping in the upper right corner of the screen. It will be converted to an anime style and delivered more than I expected. However, it is very heavy, and if it is a slightly old PC, it is at the level of 1 frame per second (Intel (R) Core (TM) i7-4770 CPU @ 3.40GHz, 32G RAM). It may be difficult to operate normally. For now, I think it's a punch line worker. I would like to try it on GPU eventually.

Finally

It may be difficult to communicate casually because working from home is prolonged, but I think it would be good to bring this kind of playfulness to video conferencing and activate conversations. I think we can do more, so please give it a try.