Make a simple OMR (mark sheet reader) with Python and OpenCV

I was looking for OMR (Optical Mark Reader) because I wanted to make data entry as efficient as possible when conducting a paper questionnaire survey, but I couldn't find what I was looking for, so I chose a simple one myself. I decided to make it. There may be a smarter way, but I've created something that works as it should.

OpenCV is used for image recognition processing, and NumPy is used for numerical processing. Note that you need to be a little careful about how to use OpenCV 3 from Python 3 using Homebrew on Mac. When you do brew install, you cannot use it from Python 3 unless you add the --with-python3 option. Also, since it is a keg only package, you need to set the library path yourself for both Python 2 and 3. If you search on the net, you will see various pages that introduce the procedure, but the one that is easy to understand is this page. I think.

Rough flow

Things necessary
Python (2.7 or 3.x, 3.5 is used here)
Packages used: NumPy, OpenCV
Advance preparation
Creating a mark sheet
Scan with a document scanner (such as ScanSnap)
Recognition processing
Image cropping and resizing
Mark recognition
Result output

Advance preparation

Creating a mark sheet

First, create a mark sheet because it will not start without it. If you want to make the mark reading process as simple as possible, there are some points to keep in mind when creating a mark sheet.

Note 1: The mark itself is as thin as possible

If the mark itself is drawn dark, it will be difficult to determine whether it is filled. The mark itself should be made in a light color so that it is not difficult to see, or it should be made with fairly thin lines.

The mark for the mark sheet may be created using Illustrator, etc., or it can be created using the basic figures of word processing software such as MS Word or Pages. The figure below is created using the rounded rectangle of Pages. The color is 40% gray, the number is 5pt of Hiragino Kakugo W1, and the surrounding line is 0.3pt. On the mark sheet, the mark size is 3 mm wide and 5 mm high.

Note 2: Marks should be evenly spaced with sufficient space between them.

In order to simplify the mark recognition process, the marks are slightly separated from each other and arranged so that they are ** evenly spaced ** in each of the vertical and horizontal directions. If the mark spacing is wide enough, there is no problem even if the marks deviate slightly from the equal spacing.

Note 3: Prepare a feature point (marker)

A feature point (** marker **) is required to extract the mark sheet area. Create a figure that has a relatively simple shape and does not overlap with other characters or symbols, and place 2 to 4 at the four corners of the area used for recognition processing.

At this time, make sure that the ** upper left ** of the marker matches the corner of the recognition area. In the sample mark sheet, the area with a margin of 3 lines for the height of the mark line above the area where the mark is placed and 1 line below is used as the recognition area, and the upper left, upper right, and lower right of that area. Markers are attached to the three places. I think it's okay to have only two markers, the upper left and the lower right, but I thought that the tilt of the page might be misjudged, so I placed it in the upper right just in case. The image below is a sample of the created mark sheet paper.

Since the marker image file is required for recognition processing, save only the marker as an image file separately from the mark sheet paper.

←　マーカーだけを画像として保存する

This completes the preparation of the mark sheet.

Mark sheet scan

The completed mark sheet is read in ** grayscale ** using a document scanner such as ScanSnap. If the scanned image is tilted, it cannot be recognized well. It is possible to perform tilt correction with OpenCV, but the processing will be complicated accordingly, so here, turn on ** tilt correction ** with the scanning software of the scanner and read. If the orientation of the mark sheet paper is not aligned, turn on automatic rotation as well.

The image below is the result of scanning by marking the sample mark sheet paper appropriately (the size has been reduced).

Recognition process

From here, it is image recognition processing using Python and OpenCV. First, import NumPy and OpenCV.

import numpy as np
import cv2

Image cropping and resizing

Read the scanned image and use a marker to cut out only the required area.

First, read the marker image file (marker.jpg in this case) for cutting out the range of the mark sheet, and make the necessary settings. Template matching is used to identify the position of the marker with OpenCV, but the size of the template (marker) image must be the same as the size of the marker in the scanned image. Resize the marker according to the scan resolution so that the size of the marker image roughly matches the size of the scanned image. You can enlarge the size of the marker and save it in advance.

###Marker settings

marker_dpi = 72 #Screen resolution(Marker size)
scan_dpi = 300 #Scanned image resolution

#grayscale(mode = 0)Read the file with
marker=cv2.imread('marker.jpg',0) 

#Get the size of the marker
w, h = marker.shape[::-1]

#Resize markers
marker = cv2.resize(marker, (int(h*scan_dpi/marker_dpi), int(w*scan_dpi/marker_dpi)))

Next, load the scanned mark sheet image. It is assumed that the scanned image is saved as sample001.jpg.

###Load scanned image
img = cv2.imread('sample001.jpg',0)

Markers are extracted from this scanned image using the template matching function matchTemplate ().

res = cv2.matchTemplate(img, marker, cv2.TM_CCOEFF_NORMED)

This cv2.TM_CCOEFF_NORMED part specifies a function for determining similarity. If there is no particular problem, you can leave it as it is.

The matching result by matchTemplate () contains a value that indicates the similarity (maximum = 1.0) with the template image for each coordinate in the scanned image. From here, only the parts with a high degree of similarity are extracted.

threshold = 0.7
loc = np.where( res >= threshold)

Here, only the coordinates whose similarity is 0.7 or more are extracted. Adjust this value as necessary. The larger the value, the stricter the judgment, and the smaller the value, the looser the judgment. For example, if the size of the marker used as a template is extremely different from the size of the marker in the scanned image, the similarity will be low and it will not be recognized well unless this value is lowered. Also, if you loosen the criteria too much, non-markers will be falsely detected. If the size of the marker is appropriate, I think that it should be around 0.7.

From the extracted coordinates, find the upper left and lower right coordinates of the recognition area. The upper left coordinate value is the smallest coordinate value for both x and y, and the lower right coordinate value is the largest coordinate value for both x and y among the results with high similarity. Note that the extracted coordinate values are stored in the array in the order of y and x.

mark_area={}
mark_area['top_x']= min(loc[1])
mark_area['top_y']= min(loc[0])
mark_area['bottom_x']= max(loc[1])
mark_area['bottom_y']= max(loc[0])

A scanned image is cut out based on these coordinates. To cut out an image, simply specify the coordinates of the required area for the original image. However, it should be noted that the order is Y coordinate and X coordinate.

img = img[mark_area['top_y']:mark_area['bottom_y'],mark_area['top_x']:mark_area['bottom_x']]

Write down the cut out range and check if it is cut out properly.

cv2.imwrite('res.png',img)

There is a slight space on the left, but this is not a problem.

Next, in order to facilitate the subsequent processing, resize the cut out image so that it is an integral multiple of the number of columns and rows of the mark. Here, the number of columns and rows is 100 times. When counting the number of lines, consider the margin from the mark area to the marker.

n_col = 7 #Number of marks per line

n_row = 7 #Number of lines of mark
margin_top = 3 #Number of top margin lines
margin_bottom = 1 #Number of bottom margin lines

n_row = n_row + margin_top + margin_bottom #Number of lines(Mark line 7 lines+Top margin 3 lines+Bottom margin 1 line)

img = cv2.resize(img, (n_col*100, n_row*100))

Furthermore, after lightly blurring the cropped image, the image is binarized into black and white, and the black and white are inverted. In the example below, Gaussian blur is applied and then binarized based on brightness 50. Black and white inversion only subtracts the image value from 255.

###Blur
img = cv2.GaussianBlur(img,(5,5),0)

###Binarized with 50 as the threshold
res, img = cv2.threshold(img, 50, 255, cv2.THRESH_BINARY+cv2.THRESH_OTSU)

###Black and white inversion
img = 255 - img

The result of these operations is as follows.

Mark recognition

Marks are recognized by cutting out and resizing the image and further dividing it into lines.

The process to be performed for each line is to first divide the image horizontally into the number of marks and calculate the total of each image value. Since the image is black and white inverted, the colored part is white (255) and the blank part is black (0). In other words, the total of the image values means the area of the colored part (marked part).

Then, the ** median ** of the area of the marked part is calculated and used as the ** threshold ** when determining whether or not this median is marked. Since some parts of the mark sheet are originally colored, such as printed lines and numbers, the unmarked parts are also colored to some extent. Therefore, in the following example, if the area of the colored part is 3 times or more of the calculated median, it is judged as True, and if it is not, it is judged as False. Adjust this multiple as necessary. The larger the multiple, the stricter the judgment, and the smaller the multiple, the sweeter the judgment.

This method is based on the assumption that each line will be filled with one or two marks. Since it is based on the median, it will not work well if all the marks are filled.

###Prepare an array to put the result
result = []

###Row-by-line processing(Process by excluding margin lines)
for row in range(margin_top, n_row - margin_bottom):

    ###Cut out only the line to be processed
    tmp_img = img [row*100:(row+1)*100,]
    area_sum = [] #Array to put the total value
    
    ###Processing of each mark
    for col in range(n_col):

        ###Find the total value of the images in each mark area with NumPy
        area_sum.append(np.sum(tmp_img[:,col*100:(col+1)*100]))

    ###Judgment by whether the total value of the image area is 3 times or more of the median
    result.append(area_sum > np.median(area_sum) * 3)

Result output

Let's output the recognition result. To be on the safe side, make sure there are no multiple marks on the same line.

for x in range(len(result)):
    res = np.where(result[x]==True)[0]+1
    if len(res)>1:
        print('Q%d: ' % (x+1) +str(res)+ ' ##Multiple answers##')
    elif len(res)==1:
        print('Q%d: ' % (x+1) +str(res))
    else:
        print('Q%d: **Unanswered**' % (x+1))

Q1: [1]
Q2: [2]
Q3: [3]
Q4: [4]
Q5: [5]
Q6: [6]
Q7: [7]

It was recognized correctly. There is no multi-answer warning here, but if you lower the recognition threshold a little, misrecognition will increase and you will be able to see multiple-answer warnings.

When the recognition threshold was set to twice the median and executed, the result was as follows.

Q1: [1]
Q2: [2]
Q3: [3]
Q4: [4 7] ##Multiple answers
Q5: [5]
Q6: [6]
Q7: [7]

When the recognition threshold was set to 50 times the median and executed, the result was as follows.

Q1: [1]
Q2: **Unanswered**
Q3: **Unanswered**
Q4: [4]
Q5: **Unanswered**
Q6: [6]
Q7: **Unanswered**

Summary

As long as you make the mark sheet paper properly, you can make a mark sheet reader with reasonable recognition relatively easily by using it together with a document scanner. Since the sample introduced here is just a sample, it is not processed so that it can recognize multiple scanned images, but I think that it is not so difficult to recognize all the image files in the folder.

If the layout of the mark sheet changes, it is necessary to modify the script, but since the layout of the mark sheet is unlikely to change in my environment, it is almost necessary to modify it after first trying several times and adjusting the threshold value. there is not. Actually, I have been processing tens to hundreds of sheets per week for more than a year using this OMR with some functions added, but I changed the setting value several times during that time. There is only.