How do I create a recipe for content with these attributes

anleva · 10-17-2016, 10:02 PM

Hello -

I've been messing around with recipes but haven't yet been able to get a clean result. For my news site content I notice that unwanted content uses a particular article tag and wanted another. For example:

<article class="asset clearfix">
...
...
...
</article>

<article class="asset story clearfix">
...
...
...
</article>

The former class includes photo galleries and video stories which I don't want any content from. The latter is news stories.

What do i need to do in my recipes to ignore all pages with the former content and only include the latter "asset story clearfix"?

Thanks

kovidgoyal · 10-17-2016, 11:10 PM

Code:

keep_only_tags = dict(attrs={'class':'asset story clearfix'})

def preprocess_raw_html(self, html, url):
    if '<article class="asset clearfix">' in html:
         self.abort_article()
    return html

anleva · 10-18-2016, 09:17 AM

Quote:

Originally Posted by kovidgoyal

Code:

keep_only_tags = dict(attrs={'class':'asset story clearfix'})

def preprocess_raw_html(self, html, url):
    if '<article class="asset clearfix">' in html:
         self.abort_article()
    return html

Thank you I will try this. If I also add the remove_tag with additional attributes does it matter where it is placed in the recipe? Same with auto_cleanup?

kovidgoyal · 10-18-2016, 10:07 AM

You cannot use auto_cleanup and keep/remove together. It is one or the other.

anleva · 10-18-2016, 01:06 PM

Thank you. I am also a bit confused about the syntax.

In your example above you used:

keep_only_tags = dict(attrs={'class':'asset story clearfix'})

I was thinking it would be something link

keep_only_tags = dict(name='article', attrs={'class':'asset story clearfix'})

as for remove tags I sometimes see:

remove_tags = [dict(name='div', attrs={'class':'advert'})]

When do you need to have name='xx' in the string and when do you not need it?

kovidgoyal · 10-18-2016, 11:07 PM

You dont ever need it unless you are trying to distinguish between two types of tags that have the same value for class.

anleva · 10-19-2016, 09:42 AM

Thank you for your help! It looks nearly perfect now.

One final thing to try and clean up. As desired, it is not parsing any content from the article class that I do not want, i.e. <article class="asset clearfix">. However those article titles are in the table of contents. When i click on them it does not go to any content but to the next story with content. How can I eliminate their article titles from appearing in the TOC?

Amazing software!

anleva · 10-21-2016, 09:45 AM

Quote:

Originally Posted by anleva

Thank you for your help! It looks nearly perfect now.

One final thing to try and clean up. As desired, it is not parsing any content from the article class that I do not want, i.e. <article class="asset clearfix">. However those article titles are in the table of contents. When i click on them it does not go to any content but to the next story with content. How can I eliminate their article titles from appearing in the TOC?

Amazing software!

Any ideas?

10-17-2016, 10:02 PM	#1
anleva Enthusiast Posts: 39 Karma: 10 Join Date: Nov 2011 Device: Kindle Paperwhite	How do I create a recipe for content with these attributes Hello - I've been messing around with recipes but haven't yet been able to get a clean result. For my news site content I notice that unwanted content uses a particular article tag and wanted another. For example: <article class="asset clearfix"> ... ... ... </article> <article class="asset story clearfix"> ... ... ... </article> The former class includes photo galleries and video stories which I don't want any content from. The latter is news stories. What do i need to do in my recipes to ignore all pages with the former content and only include the latter "asset story clearfix"? Thanks Last edited by anleva; 10-17-2016 at 10:39 PM.

10-17-2016, 11:10 PM	#2
kovidgoyal creator of calibre Posts: 44,017 Karma: 22669822 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Code: keep_only_tags = dict(attrs={'class':'asset story clearfix'}) def preprocess_raw_html(self, html, url): if '<article class="asset clearfix">' in html: self.abort_article() return html

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Content How to create content for the Kindle?	jtrappett	Amazon Kindle	6	01-22-2011 02:52 PM
Create recipe for magazine	BlonG	Recipes	0	10-26-2010 07:46 AM
Cannot create table of content when converting my ebooks	ghostyjack	Calibre	10	07-05-2009 09:28 PM
Can I Create New Content?	BRubble	Sony Reader	3	02-20-2008 10:36 AM
Create reflowable content for the Sony Reader with deskUNPDF	sammykrupa	Sony Reader	19	05-16-2007 11:54 PM

10-18-2016, 10:07 AM	#4
kovidgoyal creator of calibre Posts: 44,017 Karma: 22669822 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You cannot use auto_cleanup and keep/remove together. It is one or the other.

10-18-2016, 01:06 PM	#5
anleva Enthusiast Posts: 39 Karma: 10 Join Date: Nov 2011 Device: Kindle Paperwhite	Thank you. I am also a bit confused about the syntax. In your example above you used: keep_only_tags = dict(attrs={'class':'asset story clearfix'}) I was thinking it would be something link keep_only_tags = dict(name='article', attrs={'class':'asset story clearfix'}) as for remove tags I sometimes see: remove_tags = [dict(name='div', attrs={'class':'advert'})] When do you need to have name='xx' in the string and when do you not need it?

10-18-2016, 11:07 PM	#6
kovidgoyal creator of calibre Posts: 44,017 Karma: 22669822 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You dont ever need it unless you are trying to distinguish between two types of tags that have the same value for class.

10-19-2016, 09:42 AM	#7
anleva Enthusiast Posts: 39 Karma: 10 Join Date: Nov 2011 Device: Kindle Paperwhite	Thank you for your help! It looks nearly perfect now. One final thing to try and clean up. As desired, it is not parsing any content from the article class that I do not want, i.e. <article class="asset clearfix">. However those article titles are in the table of contents. When i click on them it does not go to any content but to the next story with content. How can I eliminate their article titles from appearing in the TOC? Amazing software!

Advert

Advert