Automatic data capture process

Version 1 (Alexis Bienvenüe, 12/18/2015 04:08 pm)

1 1
h1. Automatic data capture process
2 1
3 1
h2. Conversion to one bitmap image per page
4 1
5 1
h3. From a PDF scan
6 1
7 1
Most often, PDF scans are only containers that include several one-page bitmap images. Thus AMC tries to extract those images using @pdfimages@, which is quite fast.
8 1
This can fail if for example the scans were modified (annotations, drawings, cropping), because in this case the original image included in the PDF file was not modified at all: instead, some new material was added to the PDF container to be drawn on top of it. In this case, the conversion from PDF to bitmap images should not rely on @pdfimages@. You can use the _Scan/Scans conversion/force conversion_ option to tell AMC to use ImageMagick's @convert@ instead of @pdfimages@.
9 1
10 1
h2. Conversion to black and white
11 1
12 1
The scans are converted to black and white (only black and white pixels: no color, no gray) using OpenCV. The _Scan/Scans conversion/Black&white conversion threshold_ parameter determines the threshold used for this conversion.
13 1
14 1
h2. Corner marks detection
15 1
16 1
AMC first locates the four corner marks (circles) on the scan:
17 1
18 1
# AMC computes the expected marks size E in pixels from the scan size in pixels and the dimensions of the marks relative to the page from the subject.
19 1
# AMC selects, from the scan, the black connected components whose size falls between E×(1-d) and E×(1+i) (d and i are the values of the configuration options _Scan/Detection parameters/Marks size decrease_ and _Scan/Detection parameters/Marks size increase_). The positions of those four that are the nearest ones to the corners of the page are retained.
20 1
21 1
h2. Distorsion
22 1
23 1
AMC maps the positions (coordinates) on the scan to the positions on the subject using the most accurate linear transform that maps the detected corner marks positions on to scan to the subject corner marks positions. This ensures that distortions such as rotations, translations, scalings (that occurs while printing or scanning the papers) are corrected by AMC.
24 1
25 1
h2. Ticked boxes
26 1
27 1
For each box on the subject, AMC shortens it by a ratio given by _Scan/Detection parameters/Measured box proportion_, so that the box border will not be taken into account. Then AMC maps this shortened box on the scan, and examines all the pixels in it. AMC computes the _darkness ratio_, that is the ratio of the number of black pixels over the number of all pixels in the shortened box.
28 1
The box will finally be considered as being ticked if the _darkness ratio_ is larger than the _darkness threshold_ (option _Project/Automatic data capture/Darkness threshold_).