Text Difference Checker

Compare and Contrast Transkribus Outputs: Private model vs. Print 0.3 (Public) model

Motivation for Checker

We wanted to compare the outputs between our private model and Print 0.3 (public) model on Transkribus. Our private model was trained using 70 pages of student papers with the Print 0.3 as our base model. Consequently, our private model perform significantly worse than the Print 0.3M model.

Upon further research and guidance from Transkribus’s support team, our results was an example of overfitting. Overfitting is a common problem in machine learning where a model learns the details and noise in the training data to such an extent that it negatively impacts the performance of the model on new data. In essence, an overfitted model performs very well on training data but poorly on validation or test data.

As a result, the print 0.3 was deemed effective enough with an character error rate (CER) of 1.6%. CER measures the number of characters that were incorrectly predicted compared to the ground truth, normalized by the total number of characters in the ground truth.

Tools Used

Python
Difflib (comparing differences between sequences and calculating similarity)
Jupyter Notebook

Compare Character sequences

SequenceMatcher is a class in difflib that can be used to compare the similarity between two sequences (such as strings). It uses the Ratcliff/Obershelp algorithm to calculate the similarity between two sequences.

Create a difference report

HtmlDiff is a class that can be used to create an HTML table, howing a side by side, line by line comparison of text with inter-line and intra-line change highlights.

Below is the gallery of text comparisons:

Left: Print 0.3
Right: Private model

bv172tc9618_0002

bv172tc9618_0002 bv172tc9618_0003 bv172tc9618_0004 bv172tc9618_0005 bv172tc9618_0006 bv172tc9618_0007 bv172tc9618_0008 bv172tc9618_0009 bv172tc9618_0010 bv172tc9618_0011 bv172tc9618_0012 bv172tc9618_0013 bv172tc9618_0014

cp967xz4450

cp967xz4450_0001 cp967xz4450_0002 cp967xz4450_0003 cp967xz4450_0004 cp967xz4450_0005 cp967xz4450_0006 cp967xz4450_0007 cp967xz4450_0008 cp967xz4450_0009 cp967xz4450_0010 cp967xz4450_0011 cp967xz4450_0012 cp967xz4450_0013 cp967xz4450_0014 cp967xz4450_0015 cp967xz4450_0016 cp967xz4450_0017 cp967xz4450_0018 cp967xz4450_0019 cp967xz4450_0020 cp967xz4450_0021 cp967xz4450_0022 cp967xz4450_0023 cp967xz4450_0024 cp967xz4450_0025 cp967xz4450_0026 cp967xz4450_0027 cp967xz4450_0028 cp967xz4450_0029 cp967xz4450_0030 cp967xz4450_0031 cp967xz4450_0032 cp967xz4450_0033 cp967xz4450_0034 cp967xz4450_0035 cp967xz4450_0036 cp967xz4450_0037 cp967xz4450_0038

dr894zh9418

dr894zh9418_0001 dr894zh9418_0002 dr894zh9418_0003 dr894zh9418_0004 dr894zh9418_0005 dr894zh9418_0006 dr894zh9418_0007 dr894zh9418_0008 dr894zh9418_0009 dr894zh9418_0010 dr894zh9418_0011 dr894zh9418_0012 dr894zh9418_0013 dr894zh9418_0014 dr894zh9418_0015 dr894zh9418_0016 dr894zh9418_0017 dr894zh9418_0018

kr104zb7305

kr104zb7305_0001 kr104zb7305_0002 kr104zb7305_0003 kr104zb7305_0004 kr104zb7305_0005 kr104zb7305_0006 kr104zb7305_0007 kr104zb7305_0008 kr104zb7305_0009 kr104zb7305_0010 kr104zb7305_0011 kr104zb7305_0012 kr104zb7305_0013 kr104zb7305_0014 kr104zb7305_0015 kr104zb7305_0016 kr104zb7305_0017 kr104zb7305_0018

wp009br6936

wp009br6936_0001 wp009br6936_0002 wp009br6936_0003 wp009br6936_0004 wp009br6936_0005 wp009br6936_0006 wp009br6936_0007 wp009br6936_0008 wp009br6936_0009 wp009br6936_0010 wp009br6936_0011 wp009br6936_0012

yw206cp4709

yw206cp4709_0001 yw206cp4709_0002 yw206cp4709_0003 yw206cp4709_0004 yw206cp4709_0005 yw206cp4709_0006 yw206cp4709_0007 yw206cp4709_0008 yw206cp4709_0009 yw206cp4709_0010 yw206cp4709_0011 yw206cp4709_0012

zz472cp8582

zz472cp8582_0001 zz472cp8582_0002 zz472cp8582_0003 zz472cp8582_0004 zz472cp8582_0005 zz472cp8582_0006 zz472cp8582_0007 zz472cp8582_0008 zz472cp8582_0009 zz472cp8582_0010 zz472cp8582_0011 zz472cp8582_0012 zz472cp8582_0013 zz472cp8582_0014 zz472cp8582_0015 zz472cp8582_0016 zz472cp8582_0017