10-17-2016, 10:02 PM | #1 |
Enthusiast
Posts: 39
Karma: 10
Join Date: Nov 2011
Device: Kindle Paperwhite
|
How do I create a recipe for content with these attributes
Hello -
I've been messing around with recipes but haven't yet been able to get a clean result. For my news site content I notice that unwanted content uses a particular article tag and wanted another. For example: <article class="asset clearfix"> ... ... ... </article> <article class="asset story clearfix"> ... ... ... </article> The former class includes photo galleries and video stories which I don't want any content from. The latter is news stories. What do i need to do in my recipes to ignore all pages with the former content and only include the latter "asset story clearfix"? Thanks Last edited by anleva; 10-17-2016 at 10:39 PM. |
10-17-2016, 11:10 PM | #2 |
creator of calibre
Posts: 44,017
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Code:
keep_only_tags = dict(attrs={'class':'asset story clearfix'}) def preprocess_raw_html(self, html, url): if '<article class="asset clearfix">' in html: self.abort_article() return html |
Advert | |
|
10-18-2016, 09:17 AM | #3 |
Enthusiast
Posts: 39
Karma: 10
Join Date: Nov 2011
Device: Kindle Paperwhite
|
Thank you I will try this. If I also add the remove_tag with additional attributes does it matter where it is placed in the recipe? Same with auto_cleanup?
|
10-18-2016, 10:07 AM | #4 |
creator of calibre
Posts: 44,017
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You cannot use auto_cleanup and keep/remove together. It is one or the other.
|
10-18-2016, 01:06 PM | #5 |
Enthusiast
Posts: 39
Karma: 10
Join Date: Nov 2011
Device: Kindle Paperwhite
|
Thank you. I am also a bit confused about the syntax.
In your example above you used: keep_only_tags = dict(attrs={'class':'asset story clearfix'}) I was thinking it would be something link keep_only_tags = dict(name='article', attrs={'class':'asset story clearfix'}) as for remove tags I sometimes see: remove_tags = [dict(name='div', attrs={'class':'advert'})] When do you need to have name='xx' in the string and when do you not need it? |
Advert | |
|
10-18-2016, 11:07 PM | #6 |
creator of calibre
Posts: 44,017
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You dont ever need it unless you are trying to distinguish between two types of tags that have the same value for class.
|
10-19-2016, 09:42 AM | #7 |
Enthusiast
Posts: 39
Karma: 10
Join Date: Nov 2011
Device: Kindle Paperwhite
|
Thank you for your help! It looks nearly perfect now.
One final thing to try and clean up. As desired, it is not parsing any content from the article class that I do not want, i.e. <article class="asset clearfix">. However those article titles are in the table of contents. When i click on them it does not go to any content but to the next story with content. How can I eliminate their article titles from appearing in the TOC? Amazing software! |
10-21-2016, 09:45 AM | #8 | |
Enthusiast
Posts: 39
Karma: 10
Join Date: Nov 2011
Device: Kindle Paperwhite
|
Quote:
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Content How to create content for the Kindle? | jtrappett | Amazon Kindle | 6 | 01-22-2011 02:52 PM |
Create recipe for magazine | BlonG | Recipes | 0 | 10-26-2010 07:46 AM |
Cannot create table of content when converting my ebooks | ghostyjack | Calibre | 10 | 07-05-2009 09:28 PM |
Can I Create New Content? | BRubble | Sony Reader | 3 | 02-20-2008 10:36 AM |
Create reflowable content for the Sony Reader with deskUNPDF | sammykrupa | Sony Reader | 19 | 05-16-2007 11:54 PM |