![]()
From: Peter C. McCluskey (pcm@rahul.net)
Date: Fri Sep 17 1999 - 17:12:44 CDT
Daniel.Stenberg@frontec.se (Daniel Stenberg) writes:
>On Fri, 10 Sep 1999, Peter C. McCluskey wrote:
>> I currently have about 1700 lines of code in new modules devoted to the
>> linkquotes option
>
>Wow, that's a lot of code. Could you take a moment to describe the
>functionality of that feature? With a focus on implementation issues and
>details.
Here's an overview of the more important features of the new modules.
finelink.c:
At the start of any section of quoted text in new messages, if
hashreplynumlookup returns a match, it reads through the indicated
archive file for an exact match.
If that fails, it calls the search_for_quote function (described below).
If either approach finds the source of the quoted text, the quoted
message is rewritten to add an <A NAME="nnnnqlinkm">...</A> around the quoted
text, where nnnn is the number of the quoting message, and m is a number
which distinguishes multiple links between the same pair of messages.
The first line of the quoted text in the quoting message (or
set_quote_link_string if specified) is then output as a link to that
anchor.
After each call to printbody in which such quotes are linked is finished,
if replylist didn't list this pair of messages, or listed them as
"maybereply", it is updated to indicate a definite reply relationship.
Also, it rewrites the latest message to replace any "Maybe in reply to:"s
with a single "In reply to:".
search.c:
This module saves message body text in the following tree structure
intended for fast text search:
struct bigram_list
{
const struct body *bp;
short offset;
struct bigram_list *next;
};
struct bigram_tree_entry
{
struct bigram_tree_entry *left;
struct bigram_tree_entry *right;
struct bigram_list list;
BIGRAM_TYPE bigram1;
BIGRAM_TYPE bigram2;
};
The BIGRAM_TYPE is an integral type (currently unsigned long) which
represents a word.
Each word (sequence of alphanumeric chars, excluding 20 common ones)
is converted to BIGRAM_TYPE's, with each unique word being assigned a
different number. Non alphanumeric characters are discarded.
Each node in the tree represents a pair of words that occur consecutively
in the archive, with a list indicating each place the pair is found.
A call to analyze_headers shortly after calling parsemail fills the
bigram structure with text from all new messages plus old messages
reread in loadoldheaders as limited by set_searchbackmsgnum.
Then during printbody, the following function searches for the best match:
int search_for_quote(char *search_line, const char *exact_line, int max_msgnum,
String_Match *match_info);
in messages numbered less than max_msgnum, mainly by converting search_line
to BIGRAM_TYPE's and searching for the location with the most consecutive
BIGRAM_TYPE's that match.
quotes.c:
The find_quote_prefix function looks through a message body to decide
what prefix is most likely being used to indicate quoted text. It looks
from the start of each line up to the first alphanumeric char, and counts
how many times each unique prefix occurs. A bias is added in favor of
prefixes containing '>'. The prefix with the highest count is selected.
If there is a tie, the longest prefix is used. There are also provisions
for count partial matches, and to often decide there is no quoted text
if no prefix occurs more than once.
-- ------------------------------------------------------------------------ Peter McCluskey | Critmail (http://crit.org/critmail.html): http://www.rahul.net/pcm | Accept nothing less to archive your mailing list
![]()