![]()
From: Jose Kahan (jose.kahan@w3.org)
Date: Wed Sep 25 2002 - 10:51:21 CDT
Hello,
I just commited a patch that adds a new option to hypermail that changes
the sequential numbering used for filenames, to a name resulting from
a hash of the mail properties. This allows you to separate the filenames
the archiving order... quite useful if you are rebuilding an archive and
the source files changed (you deleted something, something else arrived
and it wasn't archived).
I'm using an 32 bit. FNV1 hash function [1] and giving it as input the
msgid and the From date. This will hopefully allow to have a
unique hash name. If the 2^32 hash space isn't wide enough, we can
always move to 64 bits... time will tell.
In practice, filenames now have 4 more chars. Here is an abstract
of one of my archive directories:
04affc9c.html 4123c74a.html 714eba2c.html att-ea1bb52c/
This is not so much harder to quote. I had previously considered using
a sha or an MD5 hash function, but the resulting filename was too big
for quoting.
All the code is available on CVS. In order to use it, you'll have
to redo a configure as follows:
cd hypermail
autoconf (so that you can get the new options)
./configure --enable-libfnv
make
This will buid you a hypermail with the correct options and will link
it to the fnv library (which is in src/fnv).
The name of the new option is nonsequential, with a command line
shortcut of -N. Turn it on and build a test archive to see the differences.
More in detail, the change was quite straightforward to do. I changed
all the functions where a msgnum value was used to create lnks or make
filenames and made it go through a function. In function of your
hypermail options, this function (file.c:message_name()) will either
return the msgnum formatted in %.04d or the message hash name.
Another file, called for the moment "messageindex", keeps a tab of the
messages that we have in the archive. That's necessary as there are no
heuristics to find the current files in the archive. I use the info
in "messageindex" to build an internal table that relates a msgno to
each of the hashed messages. That was all that was needed. There's no
option yet to parametrize the "messageindex" name. This can be easily
added if needed.
The cost of this feature is that we need to store the messageindex
table in memory. If you have thousands and thousands of messages, it may
be an issue. On the other hand, we already store so much things in
memory, that it wouldn't possible to use hypermail anyway, with or
without this option. There may be some slowup too as we now pass thru
a function to get the message name, rather than getting the name
directly from a structure + the hash computation cost. There may be
some optimizations to do, but this is right now a first implementation.
Experience will tell us if more work is needed.
I hope you find this new feature as useful as I do.
I profited from this commit to fix the make tgz rule. It wasn't
working anymore (at least for me). I also updated the FILES file.
I tried to test it and take into account all possible side-effects on
hypermail. With my current set of options, it works without a hitch.
Howwever, hypermail has become much more complex than what it was
some time ago. I am not sure if this option will work with all the
other features, such as GDBM files and so on.
Before, it was easy to understand the code just from the comments, but
it's now easy to get lost without some doc :(
Cheers,
-jose
[1] http://www.isthe.com/chongo/tech/comp/fnv/index.html
![]()