PyTesseract
Improving OCR’d Student Papers to enhance citation detection for removal
Motivation for PyTesseract Approach
The “Citation variations” approach is yielding about a 20% failure rate, mostly due to OCR errors that throw off the text-matching to citation terms. We discussed the idea of starting with images and trying to optimize the OCR results, rather than reverse engineer fixes to deal with poor OCR. As a result, we decided to use OpenCV to enhance the OCR’d student papers in hopes to increase the detection of citation pages leading to proper removal prior to the ingestion process.
OpenCV (Open Computer Vision Library) is an open source library of programming functions aimed for real-time computer vision. CV tasks include methods for acquiring, processing and analyzing digital images and extraction of data to produce numerical or symbolic information. Alex has not used OpenCV before so he read up on the documentation and went through the OpenCV Bootcamp , a 3-hr course on how to manipulate images and videos, and detect objects and faces.
Optical Character Recognition (OCR) is a foundational technology behind the conversion of typed, handwritten, or printed text from images into machine-encoded text. OCR transforms a 2D image of text (machine or hand-written) from its image form into a machine-readable text. The OCR process generally consists of several sub-processes:
Pre-processing of image Text Localization Character Segmentation Character Recognition Post Processing
There are alot of OCR software available but one of the most popular is Tesseract. Python Tesseract (Pytesseract) is a Python library that serves as a wrapper for Google’s Tesseract-OCR engine. Essentially, it allows developers to use Tesseract’s OCR engine via Python.
Purpose of h in fastNlMeansDenoising:
Higher Values of h: Increase the amount of noise that is removed. Risk of removing more details from the image, which can cause blurring or loss of important features like edges or text.
Lower Values of h: Keep more details intact. Less aggressive noise removal, which may leave some noise in the image.
Explanation: The fastNlMeansDenoising() function is based on the Non-Local Means (NL-Means) algorithm, which removes noise by averaging similar pixels in a neighborhood. The h parameter controls how much weight is given to similar pixels when averaging.
- Small h values focus more on retaining the original pixel values, which may not be enough to remove noise.
- Large h values smooth the image more aggressively to reduce noise but can also blur fine details, especially in images with text.
Typical Usage: For scanned documents or images with text, you might start with a moderate h value (e.g., 10-20). You can adjust it based on the quality of the input and the desired level of denoising.
Purpose of templateWindowSize:
Purpose: This parameter defines the size of the window around each pixel that will be used to find similar patches in the image for denoising.
Explanation: The algorithm looks at a small square neighborhood around each pixel, called the “template window.” This window is used as a reference to find similar patches in a larger area (defined by searchWindowSize).
Value: It must be an odd number, like 7, 9, 11, etc. A typical default value is 7.
Effect:
Smaller templateWindowSize (e.g., 7): Focuses on a smaller area around each pixel. This is faster but might not be as accurate in removing noise in larger areas. Larger templateWindowSize (e.g., 21): Considers a larger neighborhood for similarity comparison. This can improve denoising, especially for large-scale noise, but is computationally more expensive.
Purpose of searchWindowSize:
Purpose: This parameter defines the size of the window where the algorithm searches for similar patches to the reference patch (defined by templateWindowSize).
Explanation: After selecting the template window, the algorithm looks for similar patches within a larger search window. The size of this search window determines how far the algorithm will look to find similar regions.
Value: It must be an odd number, like 21, 31, 41, etc. A typical default value is 21.
Effect:
Smaller searchWindowSize (e.g., 21): Limits the search to a smaller area around the pixel. This is faster but may miss similar patches further away. Larger searchWindowSize (e.g., 31 or 41): Allows searching in a wider area for similar patches, which can improve noise removal, especially in textured regions, but it is slower.
Following outputs of PyTesseract OCR Notebook
We are looking at the effects of modifying the parameters h (regulating filter strength), templateWindowSize (size in pixels of the window used to compute the weighted average for a given pixel), and searchWindowSize (Size in pixels of the window used to search for similar patches). Below are papers with each of the parameters printed on the output. The left is the original OCR’d image and right is the newly OCR’d output.
bg870mr8040
bj170wc5114_
ch792bx6307; h = 10, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
ch792bx6307; h = 15, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
ch792bx6307; h = 20, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
cj258ns3486; h = 10, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
cj258ns3486; h = 15, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
cj258ns3486; h = 20, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
dj224jp8743; h = 10, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
dj224jp8743; h = 15, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
dj224jp8743; h = 20, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
dq995jh3669; h = 10, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
dq995jh3669; h = 15, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
dq995jh3669; h = 20, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
fy246vw6211; h = 10, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
fy246vw6211; h = 15, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
fy246vw6211; h = 20, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
gp441gd9761; h = 10, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
gp441gd9761; h = 25, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
gp441gd9761; h = 20, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
gx021jv8425; h = 10, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
gx021jv8425; h = 15, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
gx021jv8425; h = 20, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
ry002zj8695 ; h = 10, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
ry002zj8695 ; h = 15, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
ry002zj8695 ; h = 20, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
rz356zt5681 ; h = 10, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
rz356zt5681 ; h = 15, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
rz356zt5681 ; h = 20, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
sh033st8655 ; h = 10, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
sh033st8655 ; h = 15, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)
sh033st8655 ; h = 20, templateWindowSize = 7 (default) & searchWindowSize = 21 (default)