Reading Glasses

When doctors can’t read doctors’ handwriting: machine learning in medical indemnity

Related capabilities

The result

The deployed system:

Reduced mean reserving error by 55% (from £31,800 to £14,200 per case)
Freed £12 million from unnecessary reserves in its first two years
Delivered a 6:1 return on the £680,000 project cost
Changed adviser behavior: in 70% of cases where advisers’ estimates exceeded the model’s upper bound, they revised downward after reviewing its reasoning

The full details

In 2020, we worked with one of the UK’s largest medical defense organizations on a problem that had been quietly draining funds for years. Their medico-legal claims handlers, who are both clinically qualified and legally trained, repeatedly over-reserved when new potential claims were notified.

Each individual over-estimation seemed prudent, but over the course of a financial year, the cumulative effect was tens of millions of pounds pulled from long-term investment portfolios (which returned 5–7% annually) and parked in claims reserves where they earned roughly the base rate. The organization was effectively paying a multi-million-pound annual premium for the privilege of caution.

The problem

Medical defense organizations provide indemnity cover to doctors and dentists against claims of clinical negligence. When a member notifies the organization of a potential claim (a patient complaint, a solicitor’s letter, a coroner’s inquest), a medico-legal adviser must estimate the likely financial exposure. That estimate feeds directly into the claims reserve, the pool of capital set aside to meet future liabilities.

The advisers doing this work were not amateurs. They were practicing or recently retired clinicians who had also completed legal training. People who understood both medicine and litigation. The problem was that their estimates were systematically biased toward caution. A case that might settle for £80,000 could be reserved at £120,000. A dental negligence claim worth £15,000 might be reserved at £35,000. Each individual safety margin was justifiable on its own, and no one ever got fired for over-reserving.

The aggregate impact was severe. The organization’s actuarial team estimated total over-reservation across the portfolio at around £38 million at any given time. That capital, parked in short-term investments earning 1–2%, could have stayed in the organization’s diversified long-term portfolio, which historically returned 5.8% net of fees. The yearly opportunity cost ran between £1.4 million and £2.2 million depending on market conditions, and it had been silently accumulating for years.

Over-reserved cases also skewed the organization’s risk profile, which affected reinsurance negotiations and regulatory capital calculations. The management were enthusiastic champions of this “AI” project to the point that we had to introduce a “swear box” when anyone called it “AI” rather than “Machine Learning”. How times have changed.

What we built

The project faced three engineering challenges, each of which would have been a substantial project on its own. We’ll walk through each in turn, and for every tool we used, explain what it does and why we picked it over the alternatives.

Challenge one: reading 50 years of medical records

The first and arguably most difficult challenge was data ingestion. The organization had digitized its case files dating back to the early 1970s, but in this context “digitized” meant scanned into TIFF and PDF formats. The underlying content was a mix of typed correspondence, printed medical reports, and handwritten clinical notes. The handwritten material mattered most, because it contained the contemporaneous accounts of what happened during treatment and the observations that determined liability.

Doctors’ handwriting is proverbially terrible, and not just as a cultural joke. A 2006 BMJ study found that doctors’ handwritten notes were significantly less legible than those of other healthcare professionals, with interpretation error rates reaching 15% even among trained medical secretaries. Our training data spanned the entire historical archive, so we were dealing with handwriting styles across five decades, multiple specialties, and every flavor of clinical shorthand.

We built a multi-stage recognition pipeline. Each stage solved a different sub-problem.

Stage one: cleaning up the scans with OpenCV. OpenCV is an open-source computer vision library. Think of it as a toolkit of basic image-manipulation operations: rotating, cropping, sharpening, removing noise. We used it for adaptive thresholding (turning faded grey scans into clean black-and-white text), deskewing (straightening pages that had been fed crookedly into a scanner thirty years ago), and noise reduction. Many older records had degraded badly, with foxing, bleed-through from double-sided pages, and the characteristic fading of thermal fax paper, which was never meant to last more than a few years. We chose OpenCV because it was at the time the industry standard for this kind of pre-processing, it’s free, and it runs fast enough to handle a large archive without specialized hardware. Skipping this step was not an option. Feeding a degraded scan straight into a recognition model gives you garbage out.

Stage two: reading the typed text with ABBYY FineReader. For printed and typewritten material, we used ABBYY FineReader’s SDK. ABBYY is a commercial OCR (optical character recognition) engine, meaning a piece of software that converts pictures of text into actual text characters a computer can search and process. In 2020, ABBYY was comfortably the strongest commercial OCR engine for structured document recognition. We chose it over open-source alternatives like Tesseract because typed correspondence and printed lab reports were the easy 60% of the archive, and ABBYY handled them with accuracy rates above 98% out of the box. Spending engineering time tuning a free tool to match a paid one would have been a false economy.

Stage three: reading the handwriting with a custom neural network. This was the core of the project, and the part where off-the-shelf tools stopped being good enough. We trained a custom model with two components stitched together.

The first component was a convolutional neural network (CNN). A CNN is a type of model that learns to recognize visual patterns by sliding small filters across an image, the way you might run a magnifying glass over a page looking for specific shapes. We used it as the “eyes” of the system, learning to spot loops, strokes, ascenders and descenders without initially knowing what letters they represented. CNNs are the standard architecture for any task where the input is an image and the model needs to learn what’s visually distinctive about it.

The second component was a bidirectional long short-term memory network (BiLSTM). This is a type of model designed to read sequences in order, holding context in a kind of working memory as it goes. The “bidirectional” part means it reads each line both left-to-right and right-to-left, so it can use later context to interpret earlier characters. This is where the model started to understand that a squiggle after “peni” is probably a “c” (penicillin) rather than an “s,” because the surrounding characters constrain what’s plausible.

Stitching these together, we used a training method called connectionist temporal classification (CTC) loss. CTC solves a specific practical problem: handwriting has no clear boundaries between characters. An “m” in one doctor’s hand might take up the same space as three letters in another’s. CTC lets the network learn to match a predicted character sequence to the true text without anyone having to manually mark where each character starts and ends. Without it, you would need an army of annotators to draw bounding boxes around every individual letter. With it, you can train on transcribed lines.

Why this combination rather than something simpler? By 2020, CNN-BiLSTM-CTC was the established architecture for offline handwriting recognition in the academic literature. The gap between academic demonstrations on tidy datasets and production deployment on real clinical notes was still substantial, but the architecture itself was a known-good starting point. We were not going to invent something better than that which handwriting recognition researchers had spent a decade refining.

We used transfer learning from the IAM Handwriting Database. Transfer learning means starting from a model that has already been trained on a related task, then fine-tuning it on your specific data. The IAM database is the standard benchmark for offline handwriting recognition, containing roughly 13,000 lines of handwritten English from about 650 writers. Training a neural network from scratch needs enormous amounts of labeled data. Starting from IAM weights meant our model arrived already knowing general English handwriting patterns. We then fine-tuned it on roughly 3,200 pages of manually transcribed clinical notes from the organization’s archive. Think of it as hiring someone who can already read cursive English and then teaching them medical shorthand, rather than teaching a child to read from scratch.

The transcription work was done by a team of retired medical secretaries recruited specifically for the project. People who had spent decades reading exactly this kind of handwriting. Their expertise was irreplaceable, and they knew it, charging accordingly but worth every penny.

Stage four: a medical dictionary correction layer. After the neural network produced a transcription, we ran it through a correction layer built from medical terminology dictionaries (the BNF, ICD-10 codes, SNOMED CT, and the organization’s own claims taxonomy). This was a constrained language model based on n-grams (statistical patterns of which words tend to follow which other words), not a neural network. Its job was to resolve ambiguous character sequences by favoring medically plausible terms. “Diclofenac” and “Diazepam” look similar in poor handwriting, but one is far more common in orthopaedic notes than the other, and the surrounding clinical context usually disambiguates. We chose an n-gram approach over something more sophisticated because the problem was narrow and the data well-defined. A bigger model would have added cost without adding accuracy.

The completed system achieved a character error rate of 4.7% on handwritten notes, roughly comparable to a skilled medical secretary working from the same material, and under 1% on printed text.

Challenge two: keeping patient data private

Medical records contain some of the most sensitive personal data there is. Before processing any of this material through a pattern-matching model, we needed to remove patient-identifiable information while preserving the clinical and legal content that made the records useful for prediction. This is harder than it sounds, because you cannot simply blank out every name or number. Some names (the treating clinician) are predictive features that the model needs. Others (the patient) must never appear in the training data.

We built a two-tier sanitization system.

Tier one: regular expressions for structured identifiers. A regular expression (or regex) is a pattern-matching syntax that lets you describe the shape of a piece of text and find every instance of it. NHS numbers always follow a specific format. So do GMC registration numbers, National Insurance numbers, postcodes, telephone numbers and dates of birth in common formats. These are the easy cases, and regex handles them reliably and predictably. We used regex here rather than something fancier because deterministic pattern matching is the right tool when the patterns are stable and you need to know exactly what the system will do.

Tier two: a named entity recognition model built on spaCy. Named entity recognition (NER) is a technique for identifying real-world entities (people, places, organizations) inside free-form text. spaCy is an open-source natural language processing library that includes pre-trained NER models you can fine-tune on your own data. We chose spaCy over alternatives because it’s fast, well-documented, and easy to retrain on a specific domain.

We fine-tuned the spaCy NER model on a custom-annotated corpus of medico-legal documents. It identified person names, hospital names, GP practice names, and geographic references that the regex layer would miss. Crucially, we trained it to distinguish between treating clinicians (which needed pseudonymization but were relevant features for the prediction model) and patient names (which required full redaction).

We chose pseudonymization rather than full anonymization for clinician identifiers because the treating professional’s identity strongly predicted claim outcomes. Some practitioners had claim histories that meaningfully shifted settlement estimates, and the model needed to retain that signal without exposing raw identity data. Pseudonymisation means assigning each clinician a persistent fake identifier (Doctor_47821, say) that stays consistent across cases but cannot be traced back to a real person. The model learns that Doctor_47821 has a particular pattern without knowing who they are.

The sanitization process was validated against a manually reviewed sample of 500 documents. It achieved a recall of 99.2% for patient-identifiable information (it caught 99.2% of all PII present) and a precision of 97.8% (only 2.2% of redactions were false positives, often medical terms that look like proper nouns). The 0.8% of PII it missed was caught during a human review step applied to a random 10% sample of processed documents, with any failures triggering a review of the entire batch.

Challenge three: predicting settlement values

With clean, structured data from 50 years of resolved cases, we could finally build the prediction model. The target variable was the ultimate settlement cost: the total financial outlay when a case reached final resolution, whether by settlement, court judgement, or withdrawal.

Feature engineering was where domain expertise proved invaluable. Feature engineering is the process of turning raw data into the specific input variables a model uses to make predictions. The model doesn’t ingest a case file directly. It ingests a structured row of numbers and categories that represent the case. Choosing what to put in that row determines what the model can learn.

We extracted around 140 features from each case, in several groups.

Case metadata included the clinical specialty involved (obstetrics claims behave very differently from dental negligence), the type of allegation (failure to diagnose, surgical error, medication error, consent failure), the claimant’s age at the time of the incident, geographic region (which correlated with court jurisdiction and local legal culture), and the year of notification.

Clinical features extracted from the OCR-processed notes included severity grading of the adverse outcome (using a modified Clavien-Dindo classification for surgical cases and a bespoke severity taxonomy for other specialties), whether the incident involved a death, the number of treating clinicians involved, and whether the notes indicated deviations from established clinical guidelines.

Legal features included the identity of the claimant’s solicitor firm (certain firms had historical patterns that strongly predicted settlement behavior), whether legal aid was involved, the stage at which proceedings were issued, and the presence of expert witness reports in the early notification.

Textual features came from the free-text clinical narratives, which are too messy to use directly. We turned them into numerical features using three different techniques, each capturing something different.

The first was TF-IDF (term frequency–inverse document frequency). TF-IDF scores each word in a document by how often it appears in that document compared to how common it is across the whole archive. A word like “negligence” appearing in every case file scores low, because it offers little distinctive information. A phrase like “brachial plexus” found only in a few obstetric cases scores high, because it’s a strong differentiator. We used TF-IDF because it’s simple, fast, and interpretable. You can look at the highest-scoring terms for any case and immediately understand what makes it distinctive.

The second was doc2vec. Doc2vec learns dense vector representations of entire documents that capture semantic relationships rather than just word counts. Two case narratives about missed fracture diagnoses would yield similar doc2vec vectors even if they used different specific words. We added doc2vec because TF-IDF cannot tell that “missed fracture” and “undiagnosed broken bone” mean roughly the same thing. We then reduced these vectors to 50 dimensions using principal component analysis (PCA), a standard technique for compressing high-dimensional data into a smaller number of variables while preserving most of the useful information. This made the doc2vec features manageable for the prediction model without losing their main signal.

The third was Latent Dirichlet Allocation (LDA), a topic modeling algorithm that discovers hidden thematic clusters in a text corpus. We told LDA to look for recurring patterns and it identified which words tended to appear together. It might find that one topic clustered around “extraction,” “nerve,” “inferior alveolar,” “numbness,” “lingual” (dental nerve injury cases) without anyone explicitly telling it that such a category existed. Each case then received a probability distribution over all identified topics, which served as another input to the prediction model. We used LDA because it surfaces patterns that even the domain experts had not consciously named.

The prediction model itself: XGBoost. XGBoost is an open-source implementation of gradient-boosted decision trees. A decision tree is a model that makes predictions by asking a series of yes/no questions about the input (“Is the specialty obstetrics? Is the claimant under 18? Is there a death?”). Gradient boosting builds hundreds of small trees one after another, with each new tree focused on correcting the mistakes of the previous ones. The result is an ensemble that captures complex patterns while staying interpretable.

We chose XGBoost after benchmarking it against random forests, elastic net regression, and a feedforward neural network. XGBoost won on predictive accuracy, training speed, and (most importantly) interpretability. Interpretability mattered because the tool’s outputs needed to be explained to medico-legal advisers who would be skeptical of any black box telling them what to do. With XGBoost we could show, for any given prediction, which features pushed the estimate up and which pushed it down. With a neural network we could not have done that nearly as cleanly. XGBoost also handled the practical realities of our data well: missing values, mixed categorical and numerical inputs, and a moderate dataset size (tens of thousands of cases, not millions) where neural networks often struggle without extensive tuning.

Walk-forward cross-validation. We trained using walk-forward cross-validation, a method that respects chronology by always training on older data and testing on newer data. Standard cross-validation randomly shuffles the dataset, which risks information leakage with time-series data: the model might train on a 2018 case and test on a 2014 case, effectively learning from the future. Walk-forward validation mimics real-world performance, where the model only ever sees past cases before predicting current ones. The training set covered cases notified between 1970 and 2015, with cases from 2016–2018 as the validation set, and 2019 cases held out as the final test set. We chose this approach because anything else would have given us inflated accuracy numbers that fell apart in production.

On the test set, the model achieved a mean absolute error of £14,200 against final settlement values, compared to £31,800 for the initial reserves set by human medico-legal advisers. That’s a 55% reduction in reserving error. Perhaps more importantly, the model’s errors were roughly symmetrically distributed around zero (it overestimated as often as it underestimated), whereas the human advisers’ errors were heavily skewed toward overestimation.

Quantile regression forests for prediction intervals. We also built a quantile regression forest to provide prediction intervals rather than point estimates. A standard random forest predicts the average outcome. A quantile regression forest predicts specific percentiles of the outcome distribution, giving the advisers a likely settlement range at the 25th and 75th percentiles rather than a single number. This turned out to be the feature that drove adoption, because it reframed the tool from “the computer says £82,000” to “this case is likely to settle between £64,000 and £108,000, with a central estimate of £82,000.” It also flagged cases with wide prediction intervals (high uncertainty) as ones deserving more human attention, which the advisers appreciated rather than resented. They could apply their clinical judgment within that range rather than feeling overruled by it.

Deployment and adoption

The tool was deployed as a web application built on Flask (a lightweight Python web framework, chosen because the rest of the stack was Python and we wanted minimal moving parts), sitting behind the organization’s VPN and authenticating against their Active Directory. The prediction engine ran on a pair of on-premise GPU servers (NVIDIA T4s) that also handled OCR processing for new case ingestion. The organization’s information security team would not tolerate any cloud deployment for a system handling medical records. In 2020, this was a sensible position rather than mere conservatism.

Adoption proved to be the political challenge we anticipated. Medico-legal advisers are by definition highly qualified professionals who have spent years honing their judgment. Telling them that a machine could do part of their job better was never going to be an easy conversation. We approached it as a decision-support tool rather than an automated replacement, and we framed the model’s outputs as a second opinion rather than a recommendation.

The breakthrough came from the quantile regression intervals. When an adviser saw that their proposed reserve of £150,000 sat well above the model’s 75th percentile of £95,000, it sparked a constructive conversation about what the adviser knew that the model did not, rather than an outright acceptance or rejection. In about 70% of cases where the adviser’s initial estimate exceeded the model’s 75th percentile, they revised their reserve downward after reviewing the model’s reasoning. In the remaining 30%, the adviser had case-specific knowledge (an aggressive claimant solicitor, an unusually sympathetic claimant, pending regulatory action) that justified the higher figure, and the model’s output was overridden with a documented rationale.

Results

Over the 24 months following full deployment in Q3 2020, the tool’s impact was measured against the counterfactual of the pre-deployment reserving pattern.

Aggregate over-reservation across the active claims book fell from approximately £38 million to £21 million, a reduction of £17 million in unnecessarily locked-up capital. Not all of this was attributable to the model (the organization had also tightened its reserving review process as part of the same program), but the actuarial team attributed approximately £12 million of the improvement directly to model-assisted reserving decisions.

The £12 million in freed capital, returned to the organization’s long-term investment portfolio, generated approximately £2.8 million in additional investment returns over the two-year measurement period, based on the portfolio’s net return of 5.3% during that window versus the 1.4% earned on claims reserves.

Operational savings came from reducing the number of formal reserving reviews needed per case, dropping from an average of 3.2 to 2.1 over a case’s lifetime, which saved about 4,400 hours of senior adviser time each year. At a fully loaded cost of roughly £120 per hour for medico-legal advisers, this equated to a further £1.06 million annually, or £2.12 million across the measurement period. Much of this time was reallocated to higher-value advisory work rather than cut through headcount reductions, so the actual cash savings were more modest. We estimated the net operational saving at around £1.4 million.

Total quantifiable benefit over two years was about £4.2 million against a project cost of roughly £680,000 (covering manual transcription work, retired medical secretaries, infrastructure, and our fees). Return on investment was approximately 6:1 over two years.

What this taught us, and how we would do it now

The project predated the arrival of large language models by roughly two years. By 2023, much of what we built from scratch could have been approximated with GPT or Claude in a fraction of the development time and at a fraction of the cost.

The OCR system would be the most dramatic simplification. Modern vision-language models can read handwritten text directly from images with no custom training. You would upload a scanned page of clinical notes to Claude or GPT with vision, ask it to transcribe the content, and get back a reasonable result on the first attempt. The months we spent training a CNN-BiLSTM architecture, the 3,200 pages of manually transcribed ground truth, the retired medical secretaries and their hard-won expertise in deciphering clinical shorthand: all of that would collapse into an API call. The accuracy would not be identical to our bespoke system (our 4.7% character error rate was tuned specifically for medical handwriting), but it would be close enough for most purposes and orders of magnitude faster to deploy.

The sanitization layer would contract similarly. The custom spaCy NER model trained on annotated medico-legal documents, the regex patterns for NHS numbers and GMC registrations, the pseudonymization logic for clinician identifiers, most of this could now be handled by prompting an LLM with clear instructions about what to redact and what to preserve. A validation step would still be necessary (LLMs are not deterministic, and missing a patient name in a medical record is not an acceptable error), but the development effort would drop from weeks to days.

The post-OCR correction layer, our n-gram medical language model that resolved ambiguous character sequences by preferring clinically plausible terms, would be almost redundant. An LLM reading handwritten clinical notes already has the medical knowledge to understand that “diclofenac” is more likely than “diazepam” in an orthopaedic context. The disambiguation that took us weeks of dictionary construction and n-gram tuning is built into the model’s training data.

The prediction model, however, still needs to be built for purpose. This part has not changed and is unlikely to change soon. Generative models are not good at producing calibrated probabilistic estimates from structured tabular data. If you ask GPT to estimate the likely settlement value of a clinical negligence case based on 140 structured input variables, you will get a plausible-sounding number wrapped in confident language.

You will not get a properly calibrated prediction with quantified uncertainty intervals. The model cannot learn from the specific loss distributions in your historical data, cannot account for the unique legal culture of your jurisdiction, and has no mechanism for producing the percentile-based ranges that make our tool useful to advisers. XGBoost (or a modern equivalent like LightGBM or CatBoost) remains the right tool for this type of problem.

What you would build today is a hybrid. An LLM-powered ingestion layer that reads scanned documents, extracts and structures the clinical narrative, sanitizes patient data, and populates the feature set, feeding into a purpose-trained gradient-boosted model that produces calibrated settlement estimates with prediction intervals. The ingestion would take days rather than months. The prediction model would still require careful feature engineering, domain expertise, and proper temporal validation. The total project cost might fall from £680,000 to closer to £200,000, with most of the savings coming from eliminating custom OCR development and manual transcription.

What would remain completely unchanged are the parts of the project that were about people, not technology. The months of building trust with advisers who had legitimate reasons to be skeptical. The political work of framing a prediction tool as a colleague rather than a replacement. The decision to show ranges instead of point estimates, which was a design choice rooted in empathy, not engineering. You could build the technology in a tenth of the time. The adoption work would still take exactly as long as it did in 2020, because the human dynamics of a machine questioning a professional’s judgment do not follow Moore’s Law.

The swear box for calling Machine Learning AI, incidentally, collected £47.50 over the course of the project. Management contributed most of it. We spent it on biscuits.