From 45dccf4608c1455a4c2bec7c6ba233591a968b8a Mon Sep 17 00:00:00 2001 From: Adam Sampson Date: Sat, 23 Feb 2008 22:52:53 +0000 Subject: [PATCH] Update NEWS and todolist after the merge. --- NEWS | 21 ++++++++++++++++++ notes.splitstate | 57 ------------------------------------------------ todolist | 14 ------------ 3 files changed, 21 insertions(+), 71 deletions(-) delete mode 100644 notes.splitstate diff --git a/NEWS b/NEWS index 87d4de6..33e9d88 100644 --- a/NEWS +++ b/NEWS @@ -1,3 +1,24 @@ +FIXME: either fix or remove the sort/filter hooks, and document what +their replacements are. + +- rawdog 2.12 + +Add the "splitstate" option, which makes rawdog use a separate state +file for each feed rather than one large one. This significantly reduces +rawdog's memory usage at the cost of some more disk IO during --write. +The old behaviour is still the default, but I would recommend turning +splitstate on if you read a lot of feeds or if you're on a machine with +limited memory. + +Add the "useids" option, which makes rawdog respect article GUIDs when +updating feeds; if an article's GUID matches one we already know about, +we just update the existing article's contents rather than treating it +as a new article (like most aggregators do). This is turned on in the +default configuration, since the behaviour it produces is generally more +useful these days -- many feeds include random advertisements, or other +dynamic content, and so the old approach resulted in lots of duplicated +articles. + - rawdog 2.11 Avoid a crash when a feed's URL is changed and expiry is done on the diff --git a/notes.splitstate b/notes.splitstate deleted file mode 100644 index fa7ee69..0000000 --- a/notes.splitstate +++ /dev/null @@ -1,57 +0,0 @@ -The objective here is to significantly reduce rawdog's memory usage in favour -of IO. (Although the IO usage may actually go down, since we don't have to -rewrite feed states that didn't change.) - -The plan is to enable split state while keeping regular behaviour around as the -default (for now, to be removed in rawdog 3). - --- Stage 1: making update memory usage O(biggest #articles) -- - -Feed stays as is -- i.e. persisted as part of Rawdog, containing the feed info, -and so forth. (These may change in rawdog 3 -- there's a tradeoff, because if -we store the update time/eTag/... in the feed state then we have to rewrite it -every time we update, rather than just if the content's changed. Actually, we -don't want to do this, since we don't want to read the FeedState at all if it -doesn't need updating.) - -There's a new FeedState class, persisted into STATEDIR/feeds/12345678.state -(where 12345678 is the feed URL hash as currently used). -(FIXME: when changing feed URL, we need to rename the statefile too.) - -Feed.update() takes an article-dict argument, which might be the existing -Rawdog.articles hash or might be from a FeedState, just containing that feed's -articles. (It doesn't care either way.) - -When doing updates, if we're in split-state mode, it loads and saves the -FeedState around each article. - -(FIXME: optimisation: only mark a FeedState as modified if it was actually -modified, not if it was updated but nothing changed.) - --- Stage 2: making write memory usage O(#articles on page) -- - -Article gets a new method to return the date that should be used for sorting -(i.e. this logic gets moved out of the write code). - -Get the list of articles eligable for output -- as (sort-date, feed-hash, -sequence-number, article-hash) tuples (for ease of sorting). Then fetch the -articles for each feed. -(FIXME: the implementation of this is rather messy; it should be done, perhaps, -at the Feed level, then it would be sufficiently abstract to let us do this -over a database at some point in the future...) - -Rawdog.write() then collects the list of articles from all the feeds, sorts it, -and retrieves only the appropriate set of articles from each feed state before -writing them. -(FIXME: optimisation: have a dict available at update and write time into which -the current article lists get stashed as the update progresses, to avoid -opening the state file three times when we update a feed.) -(FIXME: the sort hook will need to be changed -- use a different hook when in -split-state mode.) - --- Stage 3: making fetch memory usage O(biggest #articles * #threads) -- - -Give the fetcher threads a "shared channel" to the main thread that's doing the -updates, so that updates and fetches can proceed in parallel, and the only -buffers used are by active threads. - diff --git a/todolist b/todolist index 4057809..839b6c5 100644 --- a/todolist +++ b/todolist @@ -2,22 +2,9 @@ Handle maxage working on article.date/added -- make this a config option? Merge Make maxarticles work as a per-feed option. -An idea for reducing rawdog's memory usage: -- have a separate state file for each feed -- have the update process for each feed return a list of articles to include in - the output as (hash, time) pairs -- the update process probably doesn't even need to read all the articles if - it's got guids (or something equivalent) available -- the write process then only needs to pull the articles that should be - displayed from the database, rather than all of them - Plugin hook to allow the articles list to be sorted again after filtering -- so you can filter out duplicates then sort by originally-published date. -Option to do duplicate removal by more sensible article hashing: use a -namespace for hashes where it could be hash:existing-hash or -uid:uid-from-article (detecting articles that are already present). - Duplicate removal by article title. gzip the state file. @@ -33,7 +20,6 @@ this as a plugin. Daemon mode -- keep a pidfile, and check the mtime of the state file to avoid having to reread it. -Option to quit if flocked Option to limit update runtime Fix rawdog -a https://www.fsf.org/blogs/rms/ -- 2.35.1