Maybe treat these as timeouts too? Add a --change option to change a feed's URL. Problem: getopt can't handle parsing this -- so it would mean changing to argparse. Rework the template mechanism so that it keeps strings as Unicode objects until output time. This'd reduce a lot of the current encoding complexity. A common complaint of packagers (e.g. Gentoo #345485) is that you have to explicitly install a config file for rawdog to work. Build the default config file and stylesheet into rawdog somehow? Show the chain of HTTP statuses, not just the last one. Show feedparser version in --help. Add --version. Can we do without ensure_unicode? Having instrumented it, it looks like it's mostly converting strings that feedparser itself has created. ... however, it also flattens out wacky classes that feedparser inserts, so yes, it's useful. Handle maxage working on article.date/added -- make this a config option? Merge with one of the existing options? Make maxarticles work as a per-feed option. Plugin hook to allow the articles list to be sorted again after filtering -- so you can filter out duplicates then sort by originally-published date. Duplicate removal by article title. gzip the state file. Optionally use a better backend: a real transactional database rather than a huge file and pickle. An alternative approach would be to keep a second lightweight cache of articles that rawdog knows it's already seen -- it would only need to store hashes (or article guids) rather than the full articles. Or could use fuzzy comparison against previous articles in the same feed -- do this as a plugin. Daemon mode -- keep a pidfile, and check the mtime of the state file to avoid having to reread it. Option to limit update runtime Fix rawdog -a https://www.fsf.org/blogs/rms/ ... specifically, the problem is that it lists lots of feeds that aren't related to that page: ... I'm not sure there's anything that can reasonably be done about that. Look at Expires header to automatically select "periods" - Longer-term future: split features out to plugins - refresh header - HTML4 output - removing duplicate articles - DayWriter - mx.Tidy - proxy auth - per-feed proxies - length limiting in templates - mx.Tidy flag to output xhtml - nuke future pubDates (Rick van Rein) - And new features to write as plugins - sort by feeds first (should work since sort is stable) - Article numbering (probably not) - Ctrl-C shouldn't print the "error loading" warning. If rawdog crashes while updating a feed, it shouldn't forget the feeds it's already updated. Perhaps have an exception handler that keeps a safety copy of the state file and saves where it's got to so far? Improve efficiency -- memoise stuff before comparing articles. __comments__ from the feed (and anything else worth having?). Option to choose whether full content or summary is preferred. Review expiry logic: is maxage=0 the same as currentonly? OPML listing -- needs feed type. Do for now as separate program once config parser separate. Error reporting as feed (or as separate display?) (Dorward) RSS output. raymond@dotsphinx.com Feed schedule spreading. Timeouts should be ignored unless it's been getting timeouts for more than a configurable amount of time. Add a needs_update() method to Feed; make Rawdog call that on all the feeds (when not being forced) and then call update() on each of them that needs it. For rawdog 3: - newer Python features - use unicode.encode('ascii','xmlcharrefreplace') if possible? - "for line in file" - "item in dict" - include author in article hash Abandoned: - __feednumber_mod_N__ (for easier default styles, i.e. _mod_4 counts 0-3,0-3) (didn't really work for lots of feeds) ## Notes from the splitstate conversion The objective here is to significantly reduce rawdog's memory usage in favour of IO. (Although the IO usage may actually go down, since we don't have to rewrite feed states that didn't change.) The plan is to enable split state while keeping regular behaviour around as the default (for now, to be removed in rawdog 3). -- Stage 1: making update memory usage O(biggest #articles) -- Feed stays as is -- i.e. persisted as part of Rawdog, containing the feed info, and so forth. (These may change in rawdog 3 -- there's a tradeoff, because if we store the update time/eTag/... in the feed state then we have to rewrite it every time we update, rather than just if the content's changed. Actually, we don't want to do this, since we don't want to read the FeedState at all if it doesn't need updating.) There's a new FeedState class, persisted into STATEDIR/feeds/12345678.state (where 12345678 is the feed URL hash as currently used). (FIXME: when changing feed URL, we need to rename the statefile too.) Feed.update() takes an article-dict argument, which might be the existing Rawdog.articles hash or might be from a FeedState, just containing that feed's articles. (It doesn't care either way.) When doing updates, if we're in split-state mode, it loads and saves the FeedState around each article. (FIXME: optimisation: only mark a FeedState as modified if it was actually modified, not if it was updated but nothing changed.) -- Stage 2: making write memory usage O(#articles on page) -- Article gets a new method to return the date that should be used for sorting (i.e. this logic gets moved out of the write code). Get the list of articles eligable for output -- as (sort-date, feed-hash, sequence-number, article-hash) tuples (for ease of sorting). Then fetch the articles for each feed. (FIXME: the implementation of this is rather messy; it should be done, perhaps, at the Feed level, then it would be sufficiently abstract to let us do this over a database at some point in the future...) Rawdog.write() then collects the list of articles from all the feeds, sorts it, and retrieves only the appropriate set of articles from each feed state before writing them. (FIXME: optimisation: have a dict available at update and write time into which the current article lists get stashed as the update progresses, to avoid opening the state file three times when we update a feed.) (FIXME: the sort hook will need to be changed -- use a different hook when in split-state mode.) -- Stage 3: making fetch memory usage O(biggest #articles * #threads) -- Give the fetcher threads a "shared channel" to the main thread that's doing the updates, so that updates and fetches can proceed in parallel, and the only buffers used are by active threads.