108 lines
6.5 KiB
ReStructuredText
108 lines
6.5 KiB
ReStructuredText
|
Writing an EPUB parser. Part 1
|
||
|
##############################
|
||
|
:date: 2022-06-01 01:41
|
||
|
:author: tyrel
|
||
|
:category: Python
|
||
|
:tags: epub, python
|
||
|
:slug: writing-an-epub-parser-part-1
|
||
|
:status: published
|
||
|
|
||
|
Parsing Epubs
|
||
|
-------------
|
||
|
|
||
|
Recently I've become frustrated with the experience of reading books on my Kindle Paperwhite. The swipe features, really bother me. I really like MoonReader on Android, but reading on my phone isn't always pleasing. This lead me to look into other hardware. I've been eyeing the BOOX company a while ago, but definitely considering some of their new offerings some time. Until the time I can afford the money to splurge on a new ebook reader, I've decided to start a new project, making my own ebook reader tools!
|
||
|
|
||
|
I'm starting with EPUBs, as this is one of the easiest to work with. At its core, an EPUB is a zip file with the ``.epub`` extension instead of ``.epub`` with many individual XHTML file chapters inside it. You can read more of how they're structured yourself over at `FILEFORMAT <https://docs.fileformat.com/ebook/epub/>`__.
|
||
|
|
||
|
The tool I've chosen for reading EPUBs is the Python library `ebooklib <https://pypi.org/project/EbookLib/>`__. This seemed to be a nice lightweight library for reading EPUBs. I also used `DearPyGUI <https://pypi.org/project/dearpygui/>`__ for showing this to the screen, because I figured why not, I like GUI libraries.
|
||
|
|
||
|
My first task was to find an EPUB file, so I downloaded one from my calibre server. I convert all my ebook files to ``.epub`` and ``.mobi`` on my calibre server so I can access them anywhere I can read my OPDS feed. I chose Throne of Glass (abbreviating to ``TOG.epub`` for rest of post). Loading I launched Python, and ran
|
||
|
|
||
|
.. code:: console
|
||
|
|
||
|
>>> from ebooklib import epub
|
||
|
>>> print(book := epub.read_epub("TOG.epub")
|
||
|
|
||
|
This returned me a ``<ebooklib.epub.EpubBook object...>`` , seeing I had an EpubBook I ran a ``dir(book)`` and found the properties available to me
|
||
|
|
||
|
.. code:: python
|
||
|
|
||
|
['add_author', 'add_item', 'add_metadata', 'add_prefix',
|
||
|
'bindings',
|
||
|
'direction',
|
||
|
'get_item_with_href', 'get_item_with_id', 'get_items',
|
||
|
'get_items_of_media_type', 'get_items_of_type', 'get_metadata',
|
||
|
'get_template', 'guide',
|
||
|
'items',
|
||
|
'language',
|
||
|
'metadata',
|
||
|
'namespaces',
|
||
|
'pages', 'prefixes',
|
||
|
'reset',
|
||
|
'set_cover', 'set_direction', 'set_identifier', 'set_language', 'set_template', 'set_title', 'set_unique_metadata', 'spine',
|
||
|
'templates', 'title', 'toc',
|
||
|
'uid',
|
||
|
'version']
|
||
|
|
||
|
Of note, the ``get_item_with_X`` entries caught my eye, as well as ``spine``. For my file, ``book.spine`` looks like it gave me a bunch of tuples of ID and a ``"yes"`` string of which I had no Idea what was. I then noticed I had a ``toc`` property, assuming that was a Table of Contents, I printed that out and saw a bunch of ``epub.Link`` objects. This looks like something I could use.
|
||
|
|
||
|
I will note, at this time I was thinking that this wasn't the direction I wanted to take this project. I really wanted to learn how to parse these things myself, unzip, parse XML, or HTML, etc., but I realized I needed to see someone else's work to even know what is going on. With this "defeat for the evening" admitted, I figured hey, why not at least make SOMETHING, right?" I decided to carry on.
|
||
|
|
||
|
Seeing I was on at least some track, I opened up PyCharm and made a new Project. First I setup a class called Epub, made a couple of functions for setting things up and ended up with
|
||
|
|
||
|
.. code:: python
|
||
|
|
||
|
class Epub:
|
||
|
def __init__(self, book_path: str) -> None:
|
||
|
self.contents: ebooklib.epub.EpubBook = epub.read_epub(book_path)
|
||
|
self.title: str = self.contents.title
|
||
|
self.toc: List[ebooklib.epub.Link] = self.contents.toc
|
||
|
|
||
|
I then setup a ``parse_chapters`` file, where I loop through the TOC. Here I went to the definition of ``Link`` and saw I was able to get a ``href`` and a ``title``, I decided my object for chapters would be a dictionary (I'll move to a DataClass later) with ``title`` and ``content``. I remembered from earlier I had a ``get_item_by_href`` so I stored the itext from the TOC's href: ``self.contents.get_item_with_href(link.href).get_content()``. This would later prove to be a bad decision when I opened "The Fold.epub" and realized that a TOC could have a tuple of ``Section`` and ``Link``, not just ``Links``. I ended up storing the item itself, and doing a double loop in the ``parse_chapters`` function to loop if it's a tuple.
|
||
|
|
||
|
.. code:: python
|
||
|
|
||
|
def parse_chapters(self) -> None:
|
||
|
idx = 0
|
||
|
for _item in self.toc:
|
||
|
if isinstance(_item, tuple): # In case is section tuple(section, [link, ...])
|
||
|
for link in _item[1]:
|
||
|
self._parse_link(idx, link)
|
||
|
idx += 1
|
||
|
else:
|
||
|
self._parse_link(idx, _item)
|
||
|
idx += 1
|
||
|
|
||
|
``_parse_link`` simply makes that dictionary of ``title`` and ``item`` I mentioned earlier, with a new ``index`` as I introduced buttons in the DearPyGUI at this time as well.
|
||
|
|
||
|
.. code:: python
|
||
|
|
||
|
def _parse_link(self, idx, link) -> None:
|
||
|
title = link.title
|
||
|
self.chapters.append(dict(
|
||
|
index=idx,
|
||
|
title=title,
|
||
|
item=self.contents.get_item_with_href(link.href)
|
||
|
))
|
||
|
|
||
|
That's really all there is to make an MVP of an EPUB parser. You can use ``BeautifulSoup`` to parse the HTML from the ``get_body_contents()`` calls on items, to make more readable text if you want, but depending on your front end, the HTML may be what you want.
|
||
|
|
||
|
In my implementation my Epub class keeps track of the currently selected chapter, so this loads from all chapters and sets the ``current_text`` variable.
|
||
|
|
||
|
.. code:: python
|
||
|
|
||
|
def load_view(self) -> None:
|
||
|
item = self.chapters[self.current_index]['item']
|
||
|
soup = BeautifulSoup(item.get_body_content(), "html.parser")
|
||
|
text = [para.get_text() for para in soup.find_all("p")]
|
||
|
self.current_text = "\n".join(text)
|
||
|
|
||
|
I don't believe any of this code will be useful to anyone outside of my research for now, but it's my first step into writing an EPUB parser myself.
|
||
|
|
||
|
The DearPyGUI steps are out of scope of this blog post, but here is my `final ebook Reader <https://gist.github.com/tyrelsouza/9c6681850fc00bf5d9f35568faf611d4>`__ which is super inefficient!
|
||
|
|
||
|
.. figure:: {static}/images/2022-06-01_ebook_reader.png
|
||
|
:alt: final ebook reader, chapters on left, text on right
|
||
|
|
||
|
I figure the Dedication page is not *as* copywrited as the rest of the book, so it's fair play showing that much. Sarah J Maas, if you have any issues, I can find another book for my screenshots.
|