processing overwritten pages seems to be much slower ...

Added by S. McKay 6 months ago

Hi,
I notice that when I process a page that has already been processed, that the data capture is much slower than the original capture. This is a big deal to me because in my workflow, I run the test through AMC, create grading files for the questions that need manual grading, and then once grading is done, process those files again. This results in many overwritten pages.

Is there a way to turn off the tracking of overwritten pages? Right now, it's not that helpful to me and slows my workflow down significantly at the end of grading.
Obviously I didn't have this problem in 1.2.1. I am using 1.4.0, but seem to remember it occuring in version 1.3.

If there's no way to turn it off, I'll try to learn patience :-)

If there's an easy way to turn it off, sorry for wasting your time. (I did look in preferences, though).

Thanks,


Replies (7)

RE: processing overwritten pages seems to be much slower ... - Added by Alexis Bienvenüe 6 months ago

I thought handling the overwritten counter should not slow down the overall process too much.
Can you try the following options and compare time spent for the same process?
  1. no change
  2. create an index in the database: in the project directory,
    sqlite3 data/capture.sqlite 'create index capture_index_page on capture_page(student,page,copy);'
    
  3. change line 70 of AMC-analyse.pl to be
    my $tag_overwritten=0;
    

RE: processing overwritten pages seems to be much slower ... - Added by S. McKay 6 months ago

Ok,
I ran some partial timings. They are a little rough, because I timed them with a clock while AMC ran. I only timed the analyse, not the splitting of the files or the copying.

This was an exam with 82 tests, 18 pages (some of the pages were blank so students could have more space).
Processing the scanned exams: 1476 pages, 204 unrecognized pages (testing center cover sheets and scratch paper). The analyse took 12 minutes 30 seconds. This was less than half a second per page.

Processing the graded questions: 6 pages per exam, a total of 492 pages, it took one hour 8 minutes, for a little under 3 seconds per page. Thus overwritten pages take 6 times as long to process.

I created an index using the command you gave me: There appeared to be no change. I did not time this, but it seemed about the same speed.

I changed line 70 of AMC-analyse.pl, and the only change was that the overwritten statement did not show up, but there did not appear to be much of a change. I was over 20 minutes in and less than halfway through. I did not time that fully because it takes so long.

It's interesting because something else is slowing it down. I thought the reporting on the scans was doing it, but it apparently is not.

Any other ideas?

Thanks,

RE: processing overwritten pages seems to be much slower ... - Added by S. McKay 6 months ago

Ok,
I just had an idea. When I get the graded files back, they are annotated pdfs. When I run them through AMC directly, I lose all annotations, so I have a script that uses ghostscript to break the pdf's apart into individual jpg's.

When I run the same pdf through a second time, it runs at about the same speed that it did the first time. So it must be the jpg format that is giving it problems. So, if this is a problem with AMC, its with the handling of the jpgs.

I am wondering however if it is a resolution problem. I think the jpg's are at 300 dpi. The difference between 300 dpi and 200 dpi shouldn't be 6 fold, however.

I'm going to run a couple of tests:
test 1: stitch the jpgs back into a pdf and run it through to see if that improves the time.
test 2: try to split the pdfs into ppm(?) format that AMC normally uses.

Unless you have another idea. I am 90 percent certain that this did not happen in 1.2.1. I did not start to see a slowdown until 1.3 or 1.4. I have been using jpg's the entire time.

RE: processing overwritten pages seems to be much slower ... - Added by Alexis Bienvenüe 6 months ago

Can you test automatic data capture from the original scans a second time, to see if the slowdown comes from the overwrite or from the different scan format?

Maybe off-topic: what is your use-case? Can you consider using manual data capture selecting a specific target question?

RE: processing overwritten pages seems to be much slower ... - Added by S. McKay 6 months ago

I did that, and there is no slowdown. That is why I suspect that the jpg format is slowing it down.

I took the jpg's, and put them together in a pdf using preview (on a mac). I then ran that through AMC, but AMC choked on it - it hung at the splitting process. There's something wierd I guess with the pdf that preview makes.
I then opened the pdf in Adobe Acrobat, and resaved it as an optimized pdf. (That took a long time). The pdf ran quickly through AMC, however, quicker than the originals. This is not a viable solution because the optimized saved took so long, but it is an interesting use case.

For your other question, this is my use case.
We have hundreds, sometimes over a thousand students taking a particular exam, which has multiple choice and free response elements.
Once the students have taken the exam, we scan them, and run them through AMC. I have a python script which then accesses the sqlite databases, and pulls out the pages for each individual free response question. They are saved in different pdfs which are then sent to the graders who grade them on tablets. Once the graders are finished grading them, I split them into jpg's and run the jpgs through AMC. I do this because running them through AMC directly will remove the annotations the graders left.

So, no, I cannot do this manually. It would take much longer than the process does now.

I am working on a couple of tests to run to see if I can get speedup. Hopefully I can get time to do those soon.

Thanks,

RE: processing overwritten pages seems to be much slower ... - Added by S. McKay 6 months ago

Ok,
Confession time. The problem is NOT AMC's fault. I guess when upgrading to a newer version of AMC, I upgraded one of the python modules I use to create grading files. This module changed how to size the pdf's, and I didn't know it. It was giving me huge (24 by 36 inches) pdfs. Splitting those at 300 dpi was causing problems. The nice thing is that everything still worked, albeit slower.

This has nothing to do with the jpg format. Sorry for leading you down the primrose path, but thanks for listening.

(1-7/7)