Re: Preparing hypermail cvs commit/merge, suggestions?

---------

From: John Finlay (finlay@moeraki.com)
Date: Tue Sep 14 1999 - 17:51:34 CDT


Just to add more fuel to the fire, I've also been hacking on hypermail
(2a23 base). The attached outlines the changes I've made. If there is
interest in my changes, we can talk about how and when to integrate
them.

It sounds to me like a feature freeze and 2.0 final version is of most
interest at this point in time. Seems reasonable to me to solidify the
current 2a24 code before embarking on adding significant new features.
Likewise it seems like time to try to develop a feature list for the
next release (2.2 or 3.0?) and then a plan for integrating/developing
those features.

Comments?

John

                              Hypermail-2a23jf
                              ================
                 A Variant of Hypermail-2a23 by John Finlay

Purpose
=======

Hypermail-2a23jf is based on hypermail-2a23. The external functionality is
largely the same but the main internal routines have been rewritten to
enhance readability and to provide a more robust implementation especially
for large archives that are incrementally updated a message at a time. These
changes are similar to changes I made in an earlier version of hypermail
(1.02f) and decided to port to the latest development version. While
hypermail-2a23 is a great improvement over earlier hypermails (1.02f, etc.),
I felt that some of the improvements I had made would also be useful going
forward. Like most features, these were developed to meet the email
archiving needs in my workplace environment.

Status
======

This code has been tested with a number of mailboxes with various numbers
(1-2300) of messages on both Solaris 2.6 and Debian Linux 2.0. More testing
needs to be done to check out all the features of hypermail and explore the
bug space.
 
Enhancements
============

The enhancements are:

- replace the binary tree on-the-fly sorting of messages to use a growable
array of messages which are sorted using quicksort after all messages are
loaded -> speed up processing of large archives

- incremental updates use a merge sort instead of quicksort -> speeds up
processing

- added some additional MIME types (e.g. multipart/digest)

- use a header summary file to avoid opening all messages in the archive
when doing incremental updates. Also maintains the sort information for
merge sorting during incremental updates -> speed up incremental update
processing

- provide a means for subsequent hypermail processes to pass off their work
to the active hypermail process -> lower resource usage in high volume
archives or those that experience bursty message traffic.

- use temp files and rename() to atomically update message files when
headers need to be updated -> avoid loss of message information in the face
of errors.

- extend parsing of mailboxes and stdin during incremental update mode to
allow more than one message to be handled.

- add an optional memory saving technique which minimizes memory usage
during parsing by saving the message bodies into temporary files -> should
help with processing really large mailboxes. The default is to save the
message bodies in memory. NOTE: Attachments are saved to files during
parsing in either case.

- reasonably robust parsing of a variety of MIME messages and messages with
embedded CRLF. MIME boundaries are recognized using a regular expression
match to eliminate some pathological cases.

Revisions
=========

In addition to the above enhancements, a number of revisions to the existing
code were made either to enhance readability and mantainability or to
support the above enhancements. Major revisions include:

- rewrite of the parsemail() routine to split it into several functions. The
result is more like recursive decent parsing which should make it easier to
add processing for new MIME types, etc.

- parsing of messages also generates the html message bodies - was
previously done during printing of articles.

- crossindexing and printing of threads has been rewritten.

- the struct emailinfo has been expanded to add more fields to support the
merge sort, memory saving option, parsing and crossindexing changes.

- some tweaking of string operations.

- some changes to function signatures since more passing of emailinfo
structures is used.

Performance
===========

The changes made have a small negative affect on performance for operations
that create an archive from a single mailbox. For example, a 2300 message
archive is processed 5-10% slower on an ultraSparc 1. This is mostly due to
the parsing changes. Using the memory saving option slows processing by
another 40% since saving to disk, reading back and resaving incurs a
substantial penalty.

However, incremental updates are processed faster for archives with more
than 1000 messages - twice as fast for 2000 message archives; much faster
for much larger archives.

I plan to work on improving the performance of the parsing routines next.


---------

This archive was generated by hypermail 2.1.5.