How I turned a scanned book into an Audiobook Using AI

Back story

Recently, I came across a fascinating book titled The Lunatic Express—a captivating account of Kenya’s history during the colonial and pre-colonial era. As a history enthusiast, especially when it comes to understanding my country’s past, I was hooked from the start. However, there was one major hurdle: the book I had found was a scanned copy, and at nearly 600 pages, it wasn’t the easiest to read on a screen.

If you’ve ever tried reading a scanned PDF, you know how painful it can be. The orientation of the pages is often off, the text is blurry, and your eyes can start aching after just a few pages. A few weeks before stumbling upon The Lunantic Express, I had been playing around with some text-to-speech (TTS) AI models on Hugging Face, which got me thinking: What if I could use AI to transform this scanned book into an audiobook and avoid the screen fatigue?

How to get a copy of the book from the Internet Archive

At first, I thought I could download the book from the Internet Archive, where it was available for borrowing. However, the web app restricts access through a session-based borrowing system. I didn’t want to just read it online; I wanted to download it for offline use. After some research and a little tinkering with my browser’s inspect element tool, I realized I could bypass this and automate downloading all pages with a script.

Here’s a breakdown of how the Internet Archive’s borrowing system works:

Borrowing Process

Metadata Fetching:
The app fetches the metadata for the book, marking it as “borrowed.” This ensures that the book is locked for your use during the loan period.
Heartbeat Mechanism:
To make sure you’re still reading, a heartbeat signal is sent regularly. It seems that each heartbeat comes with a session token, which changes every minute.
Session Renewal:
Each heartbeat triggers a session renewal. If the heartbeat is missed, the session expires, and the book is returned.

By understanding how this system works, I was able to create a script that could make requests to the Internet Archive’s metadata API and fetch all the image links for the scanned pages. Once I had access to these pages in JPG or TIFF format, I was ready to get started.

Download the Book’s Pages

The first task was to create a script that would download all of the scanned pages. The script uses the session cookie from the current browsing session to keep the connection alive, ensuring the “borrowed” book doesn’t expire before I finish downloading all the images.

With this I created a script which when feed with my current session cookie could request metadata, which contained information about the book. Links to all pages in jpg or tiff format.

Here’s the book.json metadata from the Internet Archive:

Converting the Images to a PDF

Once I had all the images, the next task was to stitch them together into a PDF for easy offline reading. I used a simple Bash script, which would loop through the image links and save them while making sure to sort the pages by numbers:

OCR to Extract Text from the Images

While I could now read the book in PDF format, the next goal was to extract the text from these images for TTS processing. This is where Tesseract OCR came into play. Using Tesseract, I was able to convert the scanned images into editable text files, which I saved as {page}.txt.

Converting the Text to Speech

Finally, with the text ready, it was time to convert it into speech. I used a Python script that leveraged Hugging Face’s text-to-speech models for this step. After trying a few models, I settled on OuteTTS-0.3-500M, which provided the best results for my use case. However, there was a catch: the model had a character limit of around 50 characters per request, meaning I could only convert small chunks of text at a time.

To overcome this limitation, I used the NLTK (Natural Language Toolkit) library to tokenize the text into sentences. This allowed me to process each sentence individually and save it as a separate audio file in WAV format.

Here’s a simplified version of how I approached the problem:

Tokenizing the Text:
I used NLTK to break the text into sentences. This ensured that each sentence was under the character limit and that the text would flow naturally in the final audio.
Processing Each Sentence:
After tokenizing the text, I passed each sentence to the Hugging Face model and saved the output as a WAV file.

Here’s a snippet of the Python script I used:

Merging the WAV Files into One and Converting to MP3

After converting each sentence into a separate WAV file, the next step was to merge all these individual WAV files into a single, continuous audio file. Since the book had hundreds of pages, this meant a lot of individual files. Once they were all merged, the file size was enormous—almost 12GB. Clearly, this was impractical for storage or listening on the go.

To solve this, I used FFmpeg, a powerful tool for handling audio and video files. FFmpeg allowed me to merge all the WAV files into a single large file and then compress it into a much smaller MP3 format without losing much audio quality.

The script I used to do that:

My Personalized Audiobook

By the end of the process, I had successfully converted The Lunatic Express into an audiobook. I could now listen to the entire book, including all its intricate historical details, without dealing with the discomfort of reading a scanned document on a screen.

This project was not only a fun technical challenge but also a perfect example of how modern tools like AI and scripting can be combined to solve real-world problems.