Replace json? #81

sciunto · 2016-01-17T19:47:44Z

Hi @wking and others.

As explained here: kurtmckee/feedparser#44 i made some investigations about performances and CPU consumption. I saw somebody complaining on the web about large CPU consumptions, and I have similar problems too.

In addition to what I reported above and the small enhancement I tried to make, it appears that a lot of cpu is used on save functions.

File: /home/fr/github/rss2email/rss2email/feeds.py
Function: _save_feeds at line 357
   340                                               @profile
   341                                               def save(self):
   342         1          172    172.0      0.0          dst_config_file = _os.path.realpath(self.configfiles[-1])
   343         1           27     27.0      0.0          _LOG.debug('save feed configuration to {}'.format(dst_config_file))
   344       110          335      3.0      0.0          for feed in self:
   345       109       365121   3349.7     14.2              feed.save_to_config()
   346         1           34     34.0      0.0          dirname = _os.path.dirname(dst_config_file)
   347         1           40     40.0      0.0          if dirname and not _os.path.isdir(dirname):
   348                                                       _os.makedirs(dirname, mode=0o700, exist_ok=True)
   349         1            3      3.0      0.0          tmpfile = dst_config_file + '.tmp'
   350         1          162    162.0      0.0          with open(tmpfile, 'w') as f:
   351         1         2953   2953.0      0.1              self.config.write(f)
   352         1           40     40.0      0.0              f.flush()
   353         1       123012 123012.0      4.8              _os.fsync(f.fileno())
   354         1          206    206.0      0.0          _os.rename(tmpfile, dst_config_file)
   355         1      2087550 2087550.0     80.9          self._save_feeds()

File: /home/fr/github/rss2email/rss2email/feeds.py
Function: _save_feed_states at line 373

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   373                                               @profile
   374                                               def _save_feed_states(self, feeds, stream):
   375         1           10     10.0      0.0          _json.dump(
   376         1            7      7.0      0.0              {'version': self.datafile_version,
   377         1         2910   2910.0      0.1               'feeds': list(feed.get_state() for feed in feeds),
   378                                                        },
   379         1            4      4.0      0.0              stream,
   380         1            9      9.0      0.0              indent=2,
   381         1      1939815 1939815.0     99.8              separators=(',', ': '),
   382                                                       )
   383         1           15     15.0      0.0          stream.write('\n')

Is there any particular reason to choose json? I guess, but I've not tested that a binary format would be more efficient. I don't think anybody is going to read the json file anyway.
I think there is room for optimization here.

Any though?

The text was updated successfully, but these errors were encountered:

sciunto · 2016-01-17T19:50:33Z

In addition:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    53                                           @profile
    54                                           def run(feeds, args):
    55                                               "Fetch feeds and send entry emails."
    56         1            5      5.0      0.0      if not args.index:
    57                                                   args.index = range(len(feeds))
    58         1            3      3.0      0.0      try:
    59         2            6      3.0      0.0          for index in args.index:
    60         1          142    142.0      0.0              feed = feeds.index(index)
    61         1            3      3.0      0.0              if feed.active:
    62         1            1      1.0      0.0                  try:
    63         1      5180755 5180755.0     65.9                      feed.run(send=args.send)
    64                                                           except _error.RSS2EmailError as e:
    65                                                               e.log()
    66                                               finally:
    67         1      2677187 2677187.0     34.1          feeds.save()

sciunto · 2016-02-12T19:18:53Z

For some other stuffs, I discovered that ujson has better performances. A quick test with rss2email shown me that I could get between 30 to 50 % improvements on the execution time of r2e.

It requires an extra deps, but probably worth it. However, I still think that a binary format would be appropriated. Any other thoughts?

Phiber2000 · 2016-04-08T00:04:10Z

You're right that r2e could be faster. But switching to a binary format would be a step backwards. The possibility to have a quick look/change to the database is a big advantage of r2e!
But switching to ujson - why not? If this speeds up the whole thing, it would be great & easy!

A problem may be, that the ujson package may be not available as repository package and has to be installed manually. F.e.: https://packages.debian.org/stretch/python3-ujson ...this is still in testing suite (+ not kept up to date).

sciunto · 2016-04-08T00:43:13Z

I'm running ujson without any problem for two months.
I also split my configuration file to an hourly and daily versions (according to the typical frequency) to minimize the cpu charge.

Phiber2000 · 2016-04-08T00:58:42Z

I made some measurements using simplejson - which is normaly faster than the json encoder - and didn't notice a considerable runtime improvement (diff: 70ms).

The guilty part doesn't seem to be the json handler itself...

Did you make other experiences?
If yes, could you post your _json.dump(...) statement from feeds.py?

jkufner · 2016-08-01T21:41:06Z

I have 150 feeds monitored bby r2e and ~/.local/share/rss2email.json has about 8000 lines after fresh run. It takes many minutes and few hundred megabytes of memory to complete a single run.

Why JSON? It is completely inapropriate format. What we need here is a quick indexed storage. For example Berkeley DB or SQLite. Both can be viewed by existing tools so it is not hard to debug. Libraries are production-ready and with many tutorials around.

Use of such database would drop memory usage to few MB per run (database is mmapped instead of reading and parsing it) and since the database would be properly indexed, it would increase speed significantly. SQLite can do ten thousands inserts per second with no trouble. BDB is even few times faster. Retrieving data will be faster too, because index will be created once and then stored on disk along the data. Index lookup then does not have to load whole index to memory, it can just read few blocks here and there (because of mmap) to get requested records (single logarithmic-complexity search instead of multiple linear processings and then hash table lookup).

Forget about binary json, it won't help.

Ekleog · 2020-09-11T18:45:05Z

Hello,

This repository has been deprecated for a few years now, and has been replaced by https://github.com/rss2email/rss2email .

If this issue is still relevant to you, and not fixed with v3.12.2, could you please reopen the issue there?

Cheers,
Leo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace json? #81

Replace json? #81

sciunto commented Jan 17, 2016

sciunto commented Jan 17, 2016

sciunto commented Feb 12, 2016

Phiber2000 commented Apr 8, 2016

sciunto commented Apr 8, 2016

Phiber2000 commented Apr 8, 2016

jkufner commented Aug 1, 2016

Ekleog commented Sep 11, 2020

Replace json? #81

Replace json? #81

Comments

sciunto commented Jan 17, 2016

sciunto commented Jan 17, 2016

sciunto commented Feb 12, 2016

Phiber2000 commented Apr 8, 2016

sciunto commented Apr 8, 2016

Phiber2000 commented Apr 8, 2016

jkufner commented Aug 1, 2016

Ekleog commented Sep 11, 2020