I wrote a Java program that performs OCR processing on the image file included in the PDF At first, I tried to implement it in python, but I didn't understand the dependency of the library to use, so I decided to implement it in Java, which I'm used to.
--All of the PDF analysis sample programs used too many loops and the source was hard to see, so I tried writing by making full use of map and reduce. --Is there a smarter way to convert from Iterator to stream ...
PDFmaker.java
package pdf;
import java.io.File;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.StandardOpenOption;
import java.util.Spliterator;
import java.util.Spliterators;
import java.util.stream.Collectors;
import java.util.stream.Stream;
import java.util.stream.StreamSupport;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.graphics.PDXObject;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
public class PDFMaker {
	public static void main(String args[])throws Exception{
		//PDF file to read
		PDDocument document = PDDocument.load(new File("C:/python/img/pdf/2017h29a_sc_pm2_qs.pdf"));
		//Process PDF pages
		Stream<PDPage>stream = StreamSupport.stream(Spliterators.spliteratorUnknownSize(
				document.getDocumentCatalog().getPages().iterator(),
						Spliterator.ORDERED),false);
		
		System.out.println("start");
		Files.write(Paths.get("parse.txt"), 
				stream.map(s->exePDFpage(s)).collect(Collectors.toList()), 
				Charset.forName("MS932"),
				StandardOpenOption.CREATE);
		System.out.println("end");
	}
	
	//Process PDF Page
	public static String exePDFpage(PDPage p){
		Stream<COSName>stream = StreamSupport.stream(Spliterators.spliteratorUnknownSize(
				p.getResources().getXObjectNames().iterator(),Spliterator.ORDERED),false);
		return stream.map(s->exeImage(s,p.getResources()))
		.reduce((s,v)->s+v).get();
	}
	
	//Convert PDF Page to Jpg
	public static String exeImage(COSName n,PDResources resources){
		try{
			PDXObject xobject = resources.getXObject(n);
			if(xobject instanceof PDImageXObject){
				PDImageXObject image2 = (PDImageXObject) resources.getXObject(n);
				return PDFtoImg.extractFromPDF(image2.getImage());
			}
			return "";
		}catch(Exception e){
			e.printStackTrace();
			return "";
		}	
	}
}
PDFtoImg.java
package pdf;
import java.awt.image.BufferedImage;
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
public class PDFtoImg {
	private static final String  DICTIONARY_PATH ="C:/Users/takayoshi/workspace/PDF/tessdata";
    public static String extractFromPDF(BufferedImage img) {
		ITesseract instance = new Tesseract();
		try {
		    instance.setLanguage("jpn");
		    instance.setDatapath(DICTIONARY_PATH);
		    String result = instance.doOCR(img);
		    return result;
		} catch (TesseractException ex) {
		    ex.printStackTrace();
		    return "";
		}
    }
}
――It takes about 10 to 20 minutes to revise the information processing security supporter examination (SC) problem of the information processing engineer examination. --The analysis result of the first page is as follows
Fall 2017
{D`
Information processing woman all securing supporter examination
No
afternoon=problem
Test time-4:30 ~-6:30 (2 hours)
Notes
-・ The start and end of the test,The supervisor's clock is the standard. Follow the instructions of the supervisor 〟
2~Until there is a signal to start the test,Do not open the question booklet and look inside.
3~Entering the examination number etc. on the answer sheet,Please start after the signal to start the test.
4_The problem is,Please answer according to the table of Fuki.
Considering that OCR is free, I think that Japanese is a pretty good line.
Recommended Posts