Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 10-17-2016, 10:02 PM   #1
anleva
Enthusiast
anleva began at the beginning.
 
Posts: 39
Karma: 10
Join Date: Nov 2011
Device: Kindle Paperwhite
How do I create a recipe for content with these attributes

Hello -

I've been messing around with recipes but haven't yet been able to get a clean result. For my news site content I notice that unwanted content uses a particular article tag and wanted another. For example:

<article class="asset clearfix">
...
...
...
</article>


<article class="asset story clearfix">
...
...
...
</article>

The former class includes photo galleries and video stories which I don't want any content from. The latter is news stories.

What do i need to do in my recipes to ignore all pages with the former content and only include the latter "asset story clearfix"?

Thanks

Last edited by anleva; 10-17-2016 at 10:39 PM.
anleva is offline   Reply With Quote
Old 10-17-2016, 11:10 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 44,017
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Code:
keep_only_tags = dict(attrs={'class':'asset story clearfix'})

def preprocess_raw_html(self, html, url):
    if '<article class="asset clearfix">' in html:
         self.abort_article()
    return html
kovidgoyal is offline   Reply With Quote
Advert
Old 10-18-2016, 09:17 AM   #3
anleva
Enthusiast
anleva began at the beginning.
 
Posts: 39
Karma: 10
Join Date: Nov 2011
Device: Kindle Paperwhite
Quote:
Originally Posted by kovidgoyal View Post
Code:
keep_only_tags = dict(attrs={'class':'asset story clearfix'})

def preprocess_raw_html(self, html, url):
    if '<article class="asset clearfix">' in html:
         self.abort_article()
    return html
Thank you I will try this. If I also add the remove_tag with additional attributes does it matter where it is placed in the recipe? Same with auto_cleanup?
anleva is offline   Reply With Quote
Old 10-18-2016, 10:07 AM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 44,017
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You cannot use auto_cleanup and keep/remove together. It is one or the other.
kovidgoyal is offline   Reply With Quote
Old 10-18-2016, 01:06 PM   #5
anleva
Enthusiast
anleva began at the beginning.
 
Posts: 39
Karma: 10
Join Date: Nov 2011
Device: Kindle Paperwhite
Thank you. I am also a bit confused about the syntax.

In your example above you used:

keep_only_tags = dict(attrs={'class':'asset story clearfix'})

I was thinking it would be something link

keep_only_tags = dict(name='article', attrs={'class':'asset story clearfix'})

as for remove tags I sometimes see:

remove_tags = [dict(name='div', attrs={'class':'advert'})]

When do you need to have name='xx' in the string and when do you not need it?
anleva is offline   Reply With Quote
Advert
Old 10-18-2016, 11:07 PM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 44,017
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You dont ever need it unless you are trying to distinguish between two types of tags that have the same value for class.
kovidgoyal is offline   Reply With Quote
Old 10-19-2016, 09:42 AM   #7
anleva
Enthusiast
anleva began at the beginning.
 
Posts: 39
Karma: 10
Join Date: Nov 2011
Device: Kindle Paperwhite
Thank you for your help! It looks nearly perfect now.

One final thing to try and clean up. As desired, it is not parsing any content from the article class that I do not want, i.e. <article class="asset clearfix">. However those article titles are in the table of contents. When i click on them it does not go to any content but to the next story with content. How can I eliminate their article titles from appearing in the TOC?

Amazing software!
anleva is offline   Reply With Quote
Old 10-21-2016, 09:45 AM   #8
anleva
Enthusiast
anleva began at the beginning.
 
Posts: 39
Karma: 10
Join Date: Nov 2011
Device: Kindle Paperwhite
Quote:
Originally Posted by anleva View Post
Thank you for your help! It looks nearly perfect now.

One final thing to try and clean up. As desired, it is not parsing any content from the article class that I do not want, i.e. <article class="asset clearfix">. However those article titles are in the table of contents. When i click on them it does not go to any content but to the next story with content. How can I eliminate their article titles from appearing in the TOC?

Amazing software!
Any ideas?
anleva is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Content How to create content for the Kindle? jtrappett Amazon Kindle 6 01-22-2011 02:52 PM
Create recipe for magazine BlonG Recipes 0 10-26-2010 07:46 AM
Cannot create table of content when converting my ebooks ghostyjack Calibre 10 07-05-2009 09:28 PM
Can I Create New Content? BRubble Sony Reader 3 02-20-2008 10:36 AM
Create reflowable content for the Sony Reader with deskUNPDF sammykrupa Sony Reader 19 05-16-2007 11:54 PM


All times are GMT -4. The time now is 07:26 AM.


MobileRead.com is a privately owned, operated and funded community.