Introduction

――Actually, it downloads images one after another like $ wget -i urls.txt. --However, if the image does not exist, .html and .txt will be downloaded. --This time, we will check Content-Type, check the image data, and uniformly convert to .jpeg. --The previous article was Getting Image Links with Google Custom Search Engine. --The complete source is here.

Library installation

--Download images from URL requests --Check and convert image data pillow

$ pip install pillow requests

Configuration file config.py

--As shown below, the URL file is referenced based on CLASSES and LINK_PATH. --Also, download the image to DOWNLOAD_PATH. ――For details, please check the previous article.

$ cat config.py


CLASSES = [
    'Abe Oto',
    'Satomi Ishihara',
    'Yuno Ohara',
    'Fuka Koshiba',
    'Haruna Kawaguchi',
    'Nana Mori',
    'Minami Hamabe',
    'Kaya Kiyohara',
    'Haruka Fukuhara',
    'Kuroshima Yuina'
]


BASE_PATH = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
DATA_PATH = os.path.join(BASE_PATH, 'data')
LINK_PATH = os.path.join(DATA_PATH, 'link')
DOWNLOAD_PATH = os.path.join(DATA_PATH, 'download')

Text file with URL

--The following files.

$head Yuina Kuroshima.txt
http://cm-watch.net/wp-content/uploads/2018/03/b22dc3193fd35ebb1bf7aa4e74c8cffb.jpg
https://www.crank-in.net/img/db/1165407_650.jpg
https://media.image.infoseek.co.jp/isnews/photos/hwchannel/hwchannel_20191107_7062003_0-small.jpg
https://i.pinimg.com/originals/3e/3c/61/3e3c61df2f426a8e4623b58d84d94b40.jpg
http://yukutaku.net/blog/wp-content/uploads/wordpress-popular-posts/253-100x100.jpg
http://gratitude8888.biz/wp-content/uploads/2017/03/cb1175590da467bef3600df48eabf770.jpg
https://www.cinemacafe.net/imgs/p/ATDRThl-6oWF9fpps9341csCOg8ODQwLCgkI/416673.jpg
https://s3-ap-northeast-1.amazonaws.com/moviche-uploads/wp-content/uploads/2019/10/IMG_2547.jpg
https://scontent-frx5-1.cdninstagram.com/vp/05d6926fed565f82247879638771ee46/5E259FCC/t51.2885-15/e35/67735702_2288175727962135_1310736136046930744_n.jpg?_nc_ht=scontent-frx5-1.cdninstagram.com&_nc_cat=103&se=7&ig_cache_key=MjEyMzM1MTc4NDkyMzQ4NzgxMg%3D%3D.2
http://moco-garden.com/wp-content/uploads/2016/05/kurosimayuina.jpg

Download, check and save images

Read a text file containing the URL

--Read the file in which the URL created last time is described with line breaks.

def download(query):
    """Download data, check data, save images."""

    linkfile = os.path.join(LINK_PATH, '{}.txt'.format(query))
    if not os.path.isfile(linkfile):
        print('no linkfile: {}'.format(linkfile))
        return

    with open(linkfile, 'r') as fin:
        link_list = fin.read().split('\n')[:-1]

Download image and check Content-Type

--Based on the list data of the URL read above, download one after another. --Make sure that Content-Type starts with ʻimage /. ―― ʻimage / may be jpeg`` png gif`` bmp.

    for num, link in enumerate(link_list, start=1):

        try:
            result = requests.get(link)
            content = result.content
            content_type = result.headers['Content-Type']
        except Exception as err:
            print('err: {}, link: {}'.format(err, link))
            continue

        if not content_type.startswith('image/'):
            print('err: {}, link: {}'.format(content_type, link))
            continue

Image loading settings with pillow

――If you set the following, even large images will be read.

ImageFile.LOAD_TRUNCATED_IMAGES = True

Check image data

--Read the image data with pillow. --If it cannot be read, there is a high probability that the image data is corrupted.

        try:
            image = Image.open(io.BytesIO(content))
        except Exception as err:
            print('err: {}, link: {}'.format(err, link))
            continue

Convert image data to .jpeg

――When you think about the post-process, I think it is troublesome to process while considering the case of .png and .bmp one by one. --Therefore, it will be converted to .jpeg uniformly. --Since it may be RGBA etc., convert it to RGB of .jpeg.

        if image.mode != 'RGB':
            image = image.convert('RGB')
        data = io.BytesIO()
        image.save(data, 'jpeg', optimize=True, quality=95)
        content = data.getvalue()

Save image

--According to the DOWNLOAD_PATH described in the setting file, save it with a file name such as 0001.jpeg 0002.jpeg. --I don't think you will use the end of the URL to make the file name. --Also, since the number of lines in the text file of the URL and the number in the file name match, I think it is easy to refer to each other.

        filename = os.path.join(DOWNLOAD_PATH, query, '{:04d}.jpeg'.format(num))
        with open(filename, 'wb') as fout:
            fout.write(content)
        print('query: {}, filename: {}, link: {}'.format(query, os.path.basename(filename), link))

Examples of errors during download and file processing

--The URL was about 6,000. Of these, about 180 was an error. --The error looks like the following. --It may be html instead of image data. --However, Content-Type ʻapplication / octet-stream and binary / octet-stream` should be able to be saved as image data, but this time they are omitted because they are few in number.

$ awk '{print $2}' err.txt | sort | uniq -c | sort -nr
  47 text/html;
  31 text/plain,
  30 ('Connection
  27 text/html,
  18 'content-type',
  10 cannot
   5 application/octet-stream,
   2 application/xml,
   1 images
   1 binary/octet-stream,
   1 UserWarning:
   1 HTTPSConnectionPool(host='jpnews24h.com',
   1 HTTPSConnectionPool(host='host-your-site.net',
   1 HTTPSConnectionPool(host='gamers.co.jp',
   1 HTTPConnectionPool(host='youtube.dojin.com',
   1 HTTPConnectionPool(host='nosh.media',
   1 HTTPConnectionPool(host='arukunews.jp',
   1 Exceeded

in conclusion

--$ wget -i urls.txt addresses the itchy part that is a little out of reach. ――Next time, we plan to carry out face recognition from images.

Download the image from the text file containing the URL