PyTesseract

Improving OCR’d Student Papers to enhance citation detection for removal

Motivation for PyTesseract Approach

The “Citation variations” approach is yielding about a 20% failure rate, mostly due to OCR errors that throw off the text-matching to citation terms. We discussed the idea of starting with images and trying to optimize the OCR results, rather than reverse engineer fixes to deal with poor OCR. As a result, we decided to use OpenCV to enhance the OCR’d student papers in hopes to increase the detection of citation pages leading to proper removal prior to the ingestion process.

OpenCV (Open Computer Vision Library) is an open source library of programming functions aimed for real-time computer vision. CV tasks include methods for acquiring, processing and analyzing digital images and extraction of data to produce numerical or symbolic information. Alex has not used OpenCV before so he read up on the documentation and went through the OpenCV Bootcamp , a 3-hr course on how to manipulate images and videos, and detect objects and faces.

Optical Character Recognition (OCR) is a foundational technology behind the conversion of typed, handwritten, or printed text from images into machine-encoded text. OCR transforms a 2D image of text (machine or hand-written) from its image form into a machine-readable text. The OCR process generally consists of several sub-processes:

Pre-processing of image Text Localization Character Segmentation Character Recognition Post Processing

There are alot of OCR software available but one of the most popular is Tesseract. Python Tesseract (Pytesseract) is a Python library that serves as a wrapper for Google’s Tesseract-OCR engine. Essentially, it allows developers to use Tesseract’s OCR engine via Python.

Purpose of h in fastNlMeansDenoising:

Higher Values of h: Increase the amount of noise that is removed. Risk of removing more details from the image, which can cause blurring or loss of important features like edges or text.

Lower Values of h: Keep more details intact. Less aggressive noise removal, which may leave some noise in the image.

Explanation: The fastNlMeansDenoising() function is based on the Non-Local Means (NL-Means) algorithm, which removes noise by averaging similar pixels in a neighborhood. The h parameter controls how much weight is given to similar pixels when averaging.

  • Small h values focus more on retaining the original pixel values, which may not be enough to remove noise.
  • Large h values smooth the image more aggressively to reduce noise but can also blur fine details, especially in images with text.

Typical Usage: For scanned documents or images with text, you might start with a moderate h value (e.g., 10-20). You can adjust it based on the quality of the input and the desired level of denoising.

Purpose of templateWindowSize:

Purpose: This parameter defines the size of the window around each pixel that will be used to find similar patches in the image for denoising.

Explanation: The algorithm looks at a small square neighborhood around each pixel, called the “template window.” This window is used as a reference to find similar patches in a larger area (defined by searchWindowSize).

Value: It must be an odd number, like 7, 9, 11, etc. A typical default value is 7.

Effect:

Smaller templateWindowSize (e.g., 7): Focuses on a smaller area around each pixel. This is faster but might not be as accurate in removing noise in larger areas. Larger templateWindowSize (e.g., 21): Considers a larger neighborhood for similarity comparison. This can improve denoising, especially for large-scale noise, but is computationally more expensive.

Purpose of searchWindowSize:

Purpose: This parameter defines the size of the window where the algorithm searches for similar patches to the reference patch (defined by templateWindowSize).

Explanation: After selecting the template window, the algorithm looks for similar patches within a larger search window. The size of this search window determines how far the algorithm will look to find similar regions.

Value: It must be an odd number, like 21, 31, 41, etc. A typical default value is 21.

Effect:

Smaller searchWindowSize (e.g., 21): Limits the search to a smaller area around the pixel. This is faster but may miss similar patches further away. Larger searchWindowSize (e.g., 31 or 41): Allows searching in a wider area for similar patches, which can improve noise removal, especially in textured regions, but it is slower.

Following outputs of PyTesseract OCR Notebook

We are looking at the effects of modifying the parameters h (regulating filter strength), templateWindowSize (size in pixels of the window used to compute the weighted average for a given pixel), and searchWindowSize (Size in pixels of the window used to search for similar patches). Below are papers with each of the parameters printed on the output. The left is the original OCR’d image and right is the newly OCR’d output.

bg870mr8040

bg870mr8040_1 bg870mr8040_2 bg870mr8040_3 bg870mr8040_4 bg870mr8040_5 bg870mr8040_6 bg870mr8040_7

bj170wc5114_

bj170wc5114_1 bj170wc5114_2 bj170wc5114_3 bj170wc5114_4 bj170wc5114_5 bj170wc5114_6 bj170wc5114_7 bj170wc5114_8 bj170wc5114_9 bj170wc5114_10 bj170wc5114_11 bj170wc5114_12

ch792bx6307; h = 10, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

ch792bx6307_0001 ch792bx6307_0002 ch792bx6307_0003 ch792bx6307_0004 ch792bx6307_0005 ch792bx6307_0006 ch792bx6307_0007 ch792bx6307_0008 ch792bx6307_0009 ch792bx6307_0010 ch792bx6307_0011 ch792bx6307_0012 ch792bx6307_0013 ch792bx6307_0014 ch792bx6307_0015 ch792bx6307_0016

ch792bx6307; h = 15, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

ch792bx6307_0001 ch792bx6307_0002 ch792bx6307_0003 ch792bx6307_0004 ch792bx6307_0005 ch792bx6307_0006 ch792bx6307_0007 ch792bx6307_0008 ch792bx6307_0009 ch792bx6307_0010 ch792bx6307_0011 ch792bx6307_0012 ch792bx6307_0013 ch792bx6307_0014 ch792bx6307_0015 ch792bx6307_0016

ch792bx6307; h = 20, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

ch792bx6307_0001 ch792bx6307_0002 ch792bx6307_0003 ch792bx6307_0004 ch792bx6307_0005 ch792bx6307_0006 ch792bx6307_0007 ch792bx6307_0008 ch792bx6307_0009 ch792bx6307_0010 ch792bx6307_0011 ch792bx6307_0012 ch792bx6307_0013 ch792bx6307_0014 ch792bx6307_0015 ch792bx6307_0016

cj258ns3486; h = 10, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

cj258ns3486_0001 cj258ns3486_0002 cj258ns3486_0003 cj258ns3486_0004 cj258ns3486_0005 cj258ns3486_0006 cj258ns3486_0007 cj258ns3486_0008 cj258ns3486_0009 cj258ns3486_0010 cj258ns3486_0011 cj258ns3486_0012 cj258ns3486_0013 cj258ns3486_0014 cj258ns3486_0015 cj258ns3486_0016 cj258ns3486_0017 cj258ns3486_0018 cj258ns3486_0019 cj258ns3486_0020 cj258ns3486_0021 cj258ns3486_0022 cj258ns3486_0023 cj258ns3486_0024 cj258ns3486_0025 cj258ns3486_0026

cj258ns3486; h = 15, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

cj258ns3486_0001 cj258ns3486_0002 cj258ns3486_0003 cj258ns3486_0004 cj258ns3486_0005 cj258ns3486_0006 cj258ns3486_0007 cj258ns3486_0008 cj258ns3486_0009 cj258ns3486_0010 cj258ns3486_0011 cj258ns3486_0012 cj258ns3486_0013 cj258ns3486_0014 cj258ns3486_0015 cj258ns3486_0016 cj258ns3486_0017 cj258ns3486_0018 cj258ns3486_0019 cj258ns3486_0020 cj258ns3486_0021 cj258ns3486_0022 cj258ns3486_0023 cj258ns3486_0024 cj258ns3486_0025 cj258ns3486_0026

cj258ns3486; h = 20, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

cj258ns3486_0001 cj258ns3486_0002 cj258ns3486_0003 cj258ns3486_0004 cj258ns3486_0005 cj258ns3486_0006 cj258ns3486_0007 cj258ns3486_0008 cj258ns3486_0009 cj258ns3486_0010 cj258ns3486_0011 cj258ns3486_0012 cj258ns3486_0013 cj258ns3486_0014 cj258ns3486_0015 cj258ns3486_0016 cj258ns3486_0017 cj258ns3486_0018 cj258ns3486_0019 cj258ns3486_0020 cj258ns3486_0021 cj258ns3486_0022 cj258ns3486_0023 cj258ns3486_0024 cj258ns3486_0025 cj258ns3486_0026

dj224jp8743; h = 10, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

dj224jp8743_0001 dj224jp8743_0002 dj224jp8743_0003 dj224jp8743_0004 dj224jp8743_0005 dj224jp8743_0006 dj224jp8743_0007 dj224jp8743_0008 dj224jp8743_0009 dj224jp8743_0010 dj224jp8743_0011 dj224jp8743_0012 dj224jp8743_0013 dj224jp8743_0014

dj224jp8743; h = 15, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

dj224jp8743_0001 dj224jp8743_0002 dj224jp8743_0003 dj224jp8743_0004 dj224jp8743_0005 dj224jp8743_0006 dj224jp8743_0007 dj224jp8743_0008 dj224jp8743_0009 dj224jp8743_0010 dj224jp8743_0011 dj224jp8743_0012 dj224jp8743_0013 dj224jp8743_0014

dj224jp8743; h = 20, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

dj224jp8743_0001 dj224jp8743_0002 dj224jp8743_0003 dj224jp8743_0004 dj224jp8743_0005 dj224jp8743_0006 dj224jp8743_0007 dj224jp8743_0008 dj224jp8743_0009 dj224jp8743_0010 dj224jp8743_0011 dj224jp8743_0012 dj224jp8743_0013 dj224jp8743_0014

dq995jh3669; h = 10, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

dq995jh3669_0001 dq995jh3669_0002 dq995jh3669_0003 dq995jh3669_0004 dq995jh3669_0005 dq995jh3669_0006 dq995jh3669_0007 dq995jh3669_0008 dq995jh3669_0009 dq995jh3669_0010

dq995jh3669; h = 15, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

dq995jh3669_0001 dq995jh3669_0002 dq995jh3669_0003 dq995jh3669_0004 dq995jh3669_0005 dq995jh3669_0006 dq995jh3669_0007 dq995jh3669_0008 dq995jh3669_0009 dq995jh3669_0010

dq995jh3669; h = 20, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

dq995jh3669_0001 dq995jh3669_0002 dq995jh3669_0003 dq995jh3669_0004 dq995jh3669_0005 dq995jh3669_0006 dq995jh3669_0007 dq995jh3669_0008 dq995jh3669_0009 dq995jh3669_0010

fy246vw6211; h = 10, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

fy246vw6211_0001 fy246vw6211_0002 fy246vw6211_0003 fy246vw6211_0004 fy246vw6211_0005 fy246vw6211_0006 fy246vw6211_0007 fy246vw6211_0008 fy246vw6211_0009 fy246vw6211_0010 fy246vw6211_0011 fy246vw6211_0012 fy246vw6211_0013 fy246vw6211_0014 fy246vw6211_0015 fy246vw6211_0016 fy246vw6211_0017 fy246vw6211_0018 fy246vw6211_0019

fy246vw6211; h = 15, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

fy246vw6211_0001 fy246vw6211_0002 fy246vw6211_0003 fy246vw6211_0004 fy246vw6211_0005 fy246vw6211_0006 fy246vw6211_0007 fy246vw6211_0008 fy246vw6211_0009 fy246vw6211_0010 fy246vw6211_0011 fy246vw6211_0012 fy246vw6211_0013 fy246vw6211_0014 fy246vw6211_0015 fy246vw6211_0016 fy246vw6211_0017 fy246vw6211_0018 fy246vw6211_0019

fy246vw6211; h = 20, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

fy246vw6211_0001 fy246vw6211_0002 fy246vw6211_0003 fy246vw6211_0004 fy246vw6211_0005 fy246vw6211_0006 fy246vw6211_0007 fy246vw6211_0008 fy246vw6211_0009 fy246vw6211_0010 fy246vw6211_0011 fy246vw6211_0012 fy246vw6211_0013 fy246vw6211_0014 fy246vw6211_0015 fy246vw6211_0016 fy246vw6211_0017 fy246vw6211_0018 fy246vw6211_0019

gp441gd9761; h = 10, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

gp441gd9761_0001 gp441gd9761_0002 gp441gd9761_0003 gp441gd9761_0004 gp441gd9761_0005 gp441gd9761_0006 gp441gd9761_0007 gp441gd9761_0008 gp441gd9761_0009 gp441gd9761_0010 gp441gd9761_0011 gp441gd9761_0012

gp441gd9761; h = 25, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

gp441gd9761_0001 gp441gd9761_0002 gp441gd9761_0003 gp441gd9761_0004 gp441gd9761_0005 gp441gd9761_0006 gp441gd9761_0007 gp441gd9761_0008 gp441gd9761_0009 gp441gd9761_0010 gp441gd9761_0011 gp441gd9761_0012

gp441gd9761; h = 20, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

gp441gd9761_0001 gp441gd9761_0002 gp441gd9761_0003 gp441gd9761_0004 gp441gd9761_0005 gp441gd9761_0006 gp441gd9761_0007 gp441gd9761_0008 gp441gd9761_0009 gp441gd9761_0010 gp441gd9761_0011 gp441gd9761_0012

gx021jv8425; h = 10, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

gx021jv8425_0001 gx021jv8425_0002 gx021jv8425_0003 gx021jv8425_0004 gx021jv8425_0005 gx021jv8425_0006 gx021jv8425_0007 gx021jv8425_0008 gx021jv8425_0009 gx021jv8425_0010 gx021jv8425_0011 gx021jv8425_0012 gx021jv8425_0013 gx021jv8425_0014 gx021jv8425_0015 gx021jv8425_0016 gx021jv8425_0017 gx021jv8425_0018 gx021jv8425_0019 gx021jv8425_0020 gx021jv8425_0021 gx021jv8425_0022 gx021jv8425_0023 gx021jv8425_0024 gx021jv8425_0025 gx021jv8425_0026

gx021jv8425; h = 15, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

gx021jv8425_0001 gx021jv8425_0002 gx021jv8425_0003 gx021jv8425_0004 gx021jv8425_0005 gx021jv8425_0006 gx021jv8425_0007 gx021jv8425_0008 gx021jv8425_0009 gx021jv8425_0010 gx021jv8425_0011 gx021jv8425_0012 gx021jv8425_0013 gx021jv8425_0014 gx021jv8425_0015 gx021jv8425_0016 gx021jv8425_0017 gx021jv8425_0018 gx021jv8425_0019 gx021jv8425_0020 gx021jv8425_0021 gx021jv8425_0022 gx021jv8425_0023 gx021jv8425_0024 gx021jv8425_0025 gx021jv8425_0026

gx021jv8425; h = 20, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

gx021jv8425_0001 gx021jv8425_0002 gx021jv8425_0003 gx021jv8425_0004 gx021jv8425_0005 gx021jv8425_0006 gx021jv8425_0007 gx021jv8425_0008 gx021jv8425_0009 gx021jv8425_0010 gx021jv8425_0011 gx021jv8425_0012 gx021jv8425_0013 gx021jv8425_0014 gx021jv8425_0015 gx021jv8425_0016 gx021jv8425_0017 gx021jv8425_0018 gx021jv8425_0019 gx021jv8425_0020 gx021jv8425_0021 gx021jv8425_0022 gx021jv8425_0023 gx021jv8425_0024 gx021jv8425_0025 gx021jv8425_0026

ry002zj8695 ; h = 10, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

ry002zj8695_0001 ry002zj8695_0002 ry002zj8695_0003 ry002zj8695_0004 ry002zj8695_0005 ry002zj8695_0006 ry002zj8695_0007 ry002zj8695_0008 ry002zj8695_0009 ry002zj8695_0010 ry002zj8695_0011 ry002zj8695_0012 ry002zj8695_0013 ry002zj8695_0014 ry002zj8695_0015 ry002zj8695_0016 ry002zj8695_0017 ry002zj8695_0018 ry002zj8695_0019 ry002zj8695_0020 ry002zj8695_0021 ry002zj8695_0022

ry002zj8695 ; h = 15, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

ry002zj8695_0001 ry002zj8695_0002 ry002zj8695_0003 ry002zj8695_0004 ry002zj8695_0005 ry002zj8695_0006 ry002zj8695_0007 ry002zj8695_0008 ry002zj8695_0009 ry002zj8695_0010 ry002zj8695_0011 ry002zj8695_0012 ry002zj8695_0013 ry002zj8695_0014 ry002zj8695_0015 ry002zj8695_0016 ry002zj8695_0017 ry002zj8695_0018 ry002zj8695_0019 ry002zj8695_0020 ry002zj8695_0021 ry002zj8695_0022

ry002zj8695 ; h = 20, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

ry002zj8695_0001 ry002zj8695_0002 ry002zj8695_0003 ry002zj8695_0004 ry002zj8695_0005 ry002zj8695_0006 ry002zj8695_0007 ry002zj8695_0008 ry002zj8695_0009 ry002zj8695_0010 ry002zj8695_0011 ry002zj8695_0012 ry002zj8695_0013 ry002zj8695_0014 ry002zj8695_0015 ry002zj8695_0016 ry002zj8695_0017 ry002zj8695_0018 ry002zj8695_0019 ry002zj8695_0020 ry002zj8695_0021 ry002zj8695_0022

rz356zt5681 ; h = 10, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

rz356zt5681_0001 rz356zt5681_0002 rz356zt5681_0003 rz356zt5681_0004 rz356zt5681_0005 rz356zt5681_0006 rz356zt5681_0007 rz356zt5681_0008 rz356zt5681_0009 rz356zt5681_0010 rz356zt5681_0011 rz356zt5681_0012 rz356zt5681_0013 rz356zt5681_0014 rz356zt5681_0015 rz356zt5681_0016 rz356zt5681_0017 rz356zt5681_0018 rz356zt5681_0019 rz356zt5681_0020 rz356zt5681_0021 rz356zt5681_0022 rz356zt5681_0023 rz356zt5681_0024

rz356zt5681 ; h = 15, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

rz356zt5681_0001 rz356zt5681_0002 rz356zt5681_0003 rz356zt5681_0004 rz356zt5681_0005 rz356zt5681_0006 rz356zt5681_0007 rz356zt5681_0008 rz356zt5681_0009 rz356zt5681_0010 rz356zt5681_0011 rz356zt5681_0012 rz356zt5681_0013 rz356zt5681_0014 rz356zt5681_0015 rz356zt5681_0016 rz356zt5681_0017 rz356zt5681_0018 rz356zt5681_0019 rz356zt5681_0020 rz356zt5681_0021 rz356zt5681_0022 rz356zt5681_0023 rz356zt5681_0024

rz356zt5681 ; h = 20, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

rz356zt5681_0001 rz356zt5681_0002 rz356zt5681_0003 rz356zt5681_0004 rz356zt5681_0005 rz356zt5681_0006 rz356zt5681_0007 rz356zt5681_0008 rz356zt5681_0009 rz356zt5681_0010 rz356zt5681_0011 rz356zt5681_0012 rz356zt5681_0013 rz356zt5681_0014 rz356zt5681_0015 rz356zt5681_0016 rz356zt5681_0017 rz356zt5681_0018 rz356zt5681_0019 rz356zt5681_0020 rz356zt5681_0021 rz356zt5681_0022 rz356zt5681_0023 rz356zt5681_0024

sh033st8655 ; h = 10, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

sh033st8655_0001 sh033st8655_0002 sh033st8655_0003 sh033st8655_0004 sh033st8655_0005 sh033st8655_0006 sh033st8655_0007 sh033st8655_0008 sh033st8655_0009 sh033st8655_0010

sh033st8655 ; h = 15, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

sh033st8655_0001 sh033st8655_0002 sh033st8655_0003 sh033st8655_0004 sh033st8655_0005 sh033st8655_0006 sh033st8655_0007 sh033st8655_0008 sh033st8655_0009 sh033st8655_0010

sh033st8655 ; h = 20, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)

sh033st8655_0001 sh033st8655_0002 sh033st8655_0003 sh033st8655_0004 sh033st8655_0005 sh033st8655_0006 sh033st8655_0007 sh033st8655_0008 sh033st8655_0009 sh033st8655_0010