Using GREP to tweak EPUB-files
The last few weeks, I have been working on some EPUB-files which have been generated from InDesign CS4 some while ago. I believe it was December 2010. Tweaking the files with Sigil, I noticed, that there are many paragraphs which include a class generated by InDesign which is absolutely useless. Here is an excerpt of the HTML-file:
Every paragraph contains a span class „generated-style“ which has been exported from InDesign. With exporting from InDesign CS5 or InDesign CS5.5 this doesn’t happen! When you have a look at the CSS, nothing is defined for this automatically generated class. So you really don’t need it and it only bloats your file.
GREP to rescue
So what can you do about this issue? While this circumstance does not prevent your EPUB to work properly, you however may want to have a clean code. So I had this idea to do a search with GREP and delete every entry of this useless class in the HTML-files. You know, regular expressions can be really awesome to do such kind of complicated search/replace work. Many programmes already support GREP-Search, like InDesign too. But here in this case tweaking the EPUB-files, I’m working with Sigil and BBEdit which also understand regular expressions (or commonly also called regex).
Well, I’m not a great Regex expert. I only use it for little things, but this challenge was already a bit more complicated. I don’t know any people around who are good at regex, so I started a request for help on twitter. Twitter is really great to connect with geeks and experts around the world. So shortly after my request, Ahmad Moqanasa (@AbuGnais) answered with a great search string. He’s a developer and geek from Amman and he suggested the following:
(<[^<>/]*?class=\"generated-style"[^<>/]*?>)([^<>]*?)(</[^<>/]*?>)
Wow this looks complicated, doesn’t it? I can’t explain it to you, however I wanted to share this bit of GREP. So let’s see how this works. I entered this search string in BBEdit’s Find/Replace dialogue (do not forget to check the GREP option). And you replace with the second match: \2.
The expression finds the whole class „generated-style“ including its content. But replacing with the second match only deletes the class, not the content. The result is this:
Ok, now we have a clean code and got rid of this superflous span class. I find this regex very helpful in this case. I don’t know if you can use this GREP too, but if you can use it, you know where to find it ;-)
Do you know other cases where GREP could be helpful with EPUB-tweaking?
EDIT
As Kai points out in the comments, this GREP is shorter and does also a very decent job. Check it out:
Search: <span class="generated-style">(.+?)</span>
Replace: \1
A simple GREP will do the job:
Search: (|)
Replace: [NOTHING]
The trouble with Kai’s GREP is that it will not work properly if there is another span within the paragraph – one that you want to keep (for example, if there is an underline or small caps). You’ll end up deleting the close tag for the underline span instead of the close tag for the paragraph span.
How about (untested) something like:
Search: <span class=“generated-style“>(.+?)</span>(?=</p>)
Replace: \1
which makes sure that the closing span it deletes is the one just before the close paragraph tag.
[…] Heck (@sachaheck) has a good post on use GREP to cleanup extra span tags from an ePUB file at “Using GREP to tweak EPUB-files.” If you have ever encountered the multiple span tags in an ePUB, particularly those creating […]
Hey Kai,
Thanks for your tip. Just tried it out. It is much shorter indeed and seems to be able to do the same thing. And as you say when there are other elements inside the paragraph, these will stay. Cool :-)
Greetz,
Sacha
Sacha, you get a problem with this GREP, if you have other elements like [b]something[/b] in your paragraph. So, sometimes it is better to search more literal than with wildcards.
The following GREP should find the same thing and honored other tags inside the span-element:
If you want to delete all span-classes, you could search for something like this: