Take Project Gutenberg texts and turn them into 'audiobooks'
Project Gutenberg and Project Gutenberg Australia
So two steps are required:
-
process books into their component parts (which will probably be quite a challenge because they're in a variety of layout logics)
-
once we have books in 'chapters' we can pass them to some sort of reading (Text-To-Speech, TTS) software
We are currently at step 1. Work in progress.
Initially we're trying to extract chapters as separate text files from a single file utf-8 text version of a Project Gutenberg book.
If you're familiar with the variety of layouts in these text files then you'll realise it's not rocket science, but it's not trivial either.
Approach so far, sort of psuedo-coded goes something like:
-
identify and get - metadata, contents block if any, start and end of book marker line numbers
-
process contents block into a little 'chapters' dataset, and then try and identify where each chapter starts in body text
-
if there's no contents block try to create 'chapters' dataset by looking up 'Chapter XX' type titles within body text
-
once we have what we think is a sane 'chapters' dataset use it to run through (based on line numbers we've identified) and extract each chapter into its own, nicely named (e.g. booktitledDIR/XX-booktitle-chapterYY.txt), text file
Sounds simple enough, however, not quite so :-)
php testx.php [-t] [-t1] [-t3] [-p] [-d] [-D]
If successful will create a 'booktitle' directory, and place each extracted booktitle-chapter.txt file in there.
where:
-
-t test only (don't output chapter files), default off
-
-t1 run code verbose to test point 1 and exit, default off
-
-t2 run code verbose to test point 2 and exit, default off
-
-p invoke preprocessor (over entire text), default off
-
-d create dump file (as dir/bookname-debug-coded.txt), default off
-
-D create hex'd dumpfile (as dir/bookname-debug-coded.txt), default off
More refined, all output goes into 'book title' subdir, more error checking and exit points, dealing with roman numerals
Testing, with pretty good success, on:
-
Agatha Christie's "The Mysterious Affair at Styles"
-
Persuasion by Jane Austin
-
A Tale of Two Cities by Charles Dickens
-
Moby Dick by Herman Melville
-
(this one is a good 'guaranteed to break' test) - The peoples of Europe
Is more sane, predictable, and therefore robust. Tests fairly reliably on two 'types', over three books.
A frustrating case is this book which while a human can work out what's going on it's a large pain in the backside to line process because it does odd things like use _ to lead and trail chapter titles, breaks chapter titles over lines in both content block and within text etc. I'm guessing this one may fall into the bucket of hand edit first then process.
That raises the question of sensing and providing some warning/guidance from the processor when layouts are not understood - an interesting future to do maybe.
I've implemented initial 'book processing' test code in PHP, currently given a url it can:
-
character, line and word count
-
identify, isolate and array-load book metadata
-
identify start and end of book Gutenburg markers (and their line numbers)
-
identify, isolate and array-load book 'contents block' and chapters list from within, into 'Contents Block array'
-
(removed temporarily) if no 'contents block', identify 'Chapter X's within text and their meta, store in 'Chapters array'
-
for each Chapter find its starting line number 'markers' from within text and record them into 'Chapters array'
-
based on all this, the test processor cuts up the input and outputs each chapter as a separate text file
Testing has only taken place on the single Agatha Christie book in the test code (test.php) so far (https://www.gutenberg.org/files/863/863-0.txt)
- with the Agatha Christie, under current logic, the Contents Block outputs as Chapter 1..
Thanks to Charlie Harrington who triggered off this idea on his blog
'Say' command line tool on OSX can speak relatively sanely, although work will be required to better implement 'pauses' at full stops and paragraph breaks. Should be doable using voice commands as referenced in Apple docs. But it's good enough in test already to say if step one can be solved, then step two is just a choice of tools.
-
LibreVox - already done! Human voiced audiobooks
Other attempts to extract chapters, or parts, from Project Gutenberg Books