Automatic data capture process

Conversion to one bitmap image per page

From a PDF scan

Most often, PDF scans are only containers that include several one-page bitmap images. Thus AMC tries to extract those images using pdfimages, which is quite fast.
This can fail if for example the scans were modified (annotations, drawings, cropping), because in this case the original image included in the PDF file was not modified at all: instead, some new material was added to the PDF container to be drawn on top of it. In this case, the conversion from PDF to bitmap images should not rely on pdfimages. You can use the Scan/Scans conversion/force conversion option to tell AMC to use ImageMagick's convert instead of pdfimages.

Conversion to black and white

The scans are converted to black and white (only black and white pixels: no color, no gray) using OpenCV. The Scan/Scans conversion/Black&white conversion threshold parameter determines the threshold used for this conversion.

Corner marks detection

AMC first locates the four corner marks (circles) on the scan:

  1. AMC computes the expected marks size E in pixels from the scan size in pixels and the dimensions of the marks relative to the page from the subject.
  2. AMC tries to remove noisy pixels and handwritten stuff from the scan, so that they won't interfere with the marks detection.
  3. AMC selects, from the scan, the black connected components whose size falls between E×(1-d) and E×(1+i) (d and i are the values of the configuration options Scan/Detection parameters/Marks size decrease and Scan/Detection parameters/Marks size increase). The positions of those four that are the nearest ones to the corners of the page are retained.

Distorsion

AMC maps the positions (coordinates) on the scan to the positions on the subject using the most accurate linear transform that maps the detected corner marks positions on to scan to the subject corner marks positions. This ensures that distortions such as rotations, translations, scalings (that occurs while printing or scanning the papers) are corrected by AMC.

Ticked boxes

For each box on the subject, AMC shortens it by a ratio given by Scan/Detection parameters/Measured box proportion, so that the box border will not be taken into account. Then AMC maps this shortened box on the scan, and examines all the pixels in it. AMC computes the darkness ratio, that is the ratio of the number of black pixels over the number of all pixels in the shortened box.
The box will finally be considered as being ticked if the darkness ratio is larger than the darkness threshold (option Project/Automatic data capture/Darkness threshold).