Scans with some garbage pages.

Added by Michaël Cadilhac about 1 month ago

Hi there,

When scans are returned to me by the administration, they sometime contain extra blank pages, extra instructions for the students, etc. I want that AMC just disregard these pages, that I can quickly review in the Data Capture tab.

I kind of remember that it used to be the case. Now, on a fresh install with a fresh project, this does not work: if there's one page that is not recognized, the whole PDF gets ditched, along with any other PDF submitted. As the dialog box does not contain all the filenames, tracking which files still have a problem is a frustrating game of duck hunt.

See: https://depaul.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=d5693fe8-b227-451f-8ea7-b0c201662da5

I'm attaching two logs, one with a PDF that is correctly captured, and one with the same PDF except that one of the pages (with no graded item) is not recognized. It seems that 'detect' is not called as much in the 'fail' case, same with the 'layout' module.

I've contacted Alexis a month ago, to no avail, hopefully the community can chime in.

This is on Arch Linux.

log-pass.txt (416.2 kB)

log-fail.txt (223 kB)


Replies (12)

RE: Scans with some garbage pages. - Added by Alexis Bienvenüe about 1 month ago

You are using the photocopy mode (some answer sheets were photocopied) with multiple-pages subjects. In this situation, unrecognized pages can mess up the whole thing, putting together different pages from different students. So AMC discards all sheets from a scan file with unrecognized pages.
Maybe you don't need the photocopy mode?

RE: Scans with some garbage pages. - Added by Michaël Cadilhac about 1 month ago

Hey Alexis,

Thanks for the prompt response. I do need photocopy mode: I have a bunch of online students, and I upload many variants of the exam for them, but can't guarantee that they are unique at the end of the day.

I have been struggling a lot with this feature for the past year, spending hours not only trying to identify which PDF failed and why, then editing them, but also tracing the code to understand the logic behind discarding PDFs. A message in the error box about why the whole PDF was discarded would be quite informative.

That being said, can you explain why unrecognized pages can "mess up the whole thing"? I have one PDF file per student, so in photocopy mode, I'm expecting that each PDF gets assigned a unique ID. That way, no merger should happen.

As a hot-fix with DO-NOT-CROSS tape, can you point me to the if statement in the code that discards all sheets from a scan if one page is unrecognized, so that I can deactivate it?

Thanks again Alexis!

Cheers,
Michaël

RE: Scans with some garbage pages. - Added by Alexis Bienvenüe about 1 month ago

As a hot-fix with DO-NOT-CROSS tape, can you point me to the if statement in the code that discards all sheets from a scan if one page is unrecognized, so that I can deactivate it?

I think you can remove (or comment out) this line 322 from AMC/Project.pm

push @args, "--unlink-on-global-err" if ( $oo{copy} );

RE: Scans with some garbage pages. - Added by Michaël Cadilhac about 1 month ago

Thanks Alexis! Alas, I see no difference. The log is attached, if that's helpful (one can check that `unlink-on-global-err` is not passed anymore).

log-fail2.txt (220.5 kB)

RE: Scans with some garbage pages. - Added by Alexis Bienvenüe about 1 month ago

My bad.
Could you prepare a toy project with a simple source file producing several pages, and two or three scan files with some extra pages (not blank but with no AMC markers) added?
I'll try another solution.

RE: Scans with some garbage pages. - Added by Alexis Bienvenüe about 1 month ago

That being said, can you explain why unrecognized pages can "mess up the whole thing"?

AMC extracts pages from all the files provided, then tries to group the scans by student.
Suppose the subject has three pages p1, p2 and p3, and we have three students A, B and C.
The scan pages are Ap1 Ap2 Ap3 Bp1 Bp2 Bp3 Cp1 Cp2 Cp3 but AMC can't know if each scan page if from student A, B or C, so we have the sequence p1 p2 p3 p1 p2 p3 p1 p2 p3.
AMC goes from one scan page to the following, and select another student if the page is already present for the current student:

  • p1 -> new current student A
  • p2: we have no page p2 for current student A, so -> A
  • p3: we have no page p3 for current student A, so -> A
  • p1: we already have a page p1 for current student A, so -> new student B
  • p2: we have no page p2 for current student B, so -> B
  • p3: we have no page p3 for current student B, so -> B
  • p1: we already have a page p1 for current student B, so -> new student C
  • p2: we have no page p2 for current student C, so -> C
  • p3: we have no page p3 for current student C, so -> C

This method allows us to do the right thing even if the pages are shuffled for some particular student (eg. if front/back are inverted while scanning).

Suppose now that the first scan p1 is not properly detected : X p2 p3 p1 p2 p3 p1 p2 p3.
In this case the reconstructed students pages are X Ap2 Ap3 Ap1 Bp2 Bp3 Bp1 Cp2 Cp3 instead of X Ap2 Ap3 Bp1 Bp2 Bp3 Cp1 Cp2 Cp3, and we should prefer not going on like that...

However, for your setup where there is one scan file per student, the process can be inproved to be more robust and tolerant. I should add an option for this.

RE: Scans with some garbage pages. - Added by Michaël Cadilhac about 1 month ago

Alexis, thanks again for your time.

I had sent to the whole project I was working with back in November. I realized only last month that the email did not get through and resent it on Feb 14, 2024 (Message-ID: <CADt3fpN4UHc=907_7=QTXG20TJ4J=>). As it contains sensitive data, I preferred sending that in private. Is that OK? Can you find the email?

As for your explanation, thank you, I get it now. I think there are two additional options that can be added there, to supplement the current strategy:
- As you suggested, a 1-to-1 correspondence between PDF files and students. That's my preferred solution, as scans are always provided that way for me. When I scan myself on our photocopier, I also get one PDF per student.
- Do not allow shuffled pages (in which case, a new student is created iff there's a page i after a page j with i <= j). This would also work for me, as I've never had a case where the scanned pages were out of order or repeated. The advantage is that it's easier to implement, but it's less robust than the 1-to-1 correspondence.

Cheers,
Michaël

RE: Scans with some garbage pages. - Added by Alexis Bienvenüe about 1 month ago

I would prefer a very small toy project with source file and several scans, that could be used for tests in the future.

RE: Scans with some garbage pages. - Added by Michaël Cadilhac about 1 month ago

Sorry, it took me quite a bit of time to shotgun debug this one. In the end, I could use a template tex and just added one \newpage to it. The project is here:

https://michael.cadilhac.name/private/amc-project-rejected-pdfs.tar.bz2 (filesize 1.1M)

The exam files are in the same folder, as exam.pdf and exam-1.pdf.

The curious fact is that if I don't have the \newpage after the name field, but move it, say, at the end of the document, then even though the scans still prompt the error message ("No layout for…"), the pages that were recognized do appear. That is regardless of whether

push @args, "--unlink-on-global-err" if ( $oo{copy} );

was removed from Project.pm.

I'm inclined to think this is a bug, but again, my expertise with the intricacies of AMC is very limited.

RE: Scans with some garbage pages. - Added by Alexis Bienvenüe about 1 month ago

Thanks.

The curious fact is that if I don't have the \newpage after the name field, but move it, say, at the end of the document

Maybe with a \newpage at the end there is only one page where the students are expected to write something?
In this case the photocopy mode can´t be messy, so the unrecognized pages can be simply discarded by AMC with a warning.

RE: Scans with some garbage pages. - Added by Michaël Cadilhac about 1 month ago

Alexis Bienvenüe wrote:

Maybe with a newpage at the end there is only one page where the students are expected to write something?
In this case the photocopy mode can´t be messy, so the unrecognized pages can be simply discarded by AMC with a warning.

I'm not sure: The exam test files only have one faulty page, and that's the added page at the beginning which is not part of the exam. So with or without the \newpage, all the pages with answers are there and correctly scanned (I expect).

RE: Scans with some garbage pages. - Added by Alexis Bienvenüe about 1 month ago

1-to-1 correspondence between PDF files and students

This is now available in the development version:

multi-scan-mode.png (116.7 kB)

(1-12/12)