Discussion:
Is it safe to use mbox?
吴悦
2009-12-23 13:17:23 UTC
Permalink
Hi list,

I want to use mbox mail format, I want to use the combination of mpop + procmail
+ mutt, but I have a question about the safe of using mbox: my mpop keeps
running intervally in crontab, when I'm in mutt, when I do some mail management
operators like deleting, viewing, is it safe? I know maildir format has no such
issue.
--
Hi,
Wu, Yue
Kyle Wheeler
2009-12-23 16:17:57 UTC
Permalink
Post by 吴悦
I want to use mbox mail format, I want to use the combination of
mpop + procmail + mutt, but I have a question about the safe of
using mbox: my mpop keeps running intervally in crontab, when I'm in
mutt, when I do some mail management operators like deleting,
viewing, is it safe? I know maildir format has no such issue.
Yes, using mbox that way is safe.

Now, I'm assuming that you're storing your mbox locally on a standard
unix filesystem rather than on an NFS-mount or on an MS-DOS partition
or something similarly weird. And as long as you avoid those, you can
be assured that using mbox is *safe*. It's not *efficient* (because
everyone who touches the mbox must lock the file first), but it is
safe.

~Kyle
- --
Strong coffee, much strong coffee, is what awakens me. Coffee gives me
warmth, waking, an unusual force and a pain that is not without very
great pleasure.
-- Napoleon Bonaparte
Derek Martin
2009-12-23 19:41:14 UTC
Permalink
Post by Kyle Wheeler
Now, I'm assuming that you're storing your mbox locally on a standard
unix filesystem rather than on an NFS-mount or on an MS-DOS partition
or something similarly weird.
Does mbox have issues on MS-DOS partitions? I wasn't aware of any
(though you need to be careful about using mixed case). I was aware
that maildir does, because on MS filesystems, you can't use ':' in a
path.
Post by Kyle Wheeler
And as long as you avoid those, you can be assured that using mbox
is *safe*. It's not *efficient* (because everyone who touches the
mbox must lock the file first), but it is safe.
I really hate when people say stuff like this. It's true that you
have to lock the file before you *write* to it, but even for busy
mailboxes, the user will basically never notice this, unless maybe
he's still reading mail on a PDP-11 with a bunch of other people.
Plus, while maildir doesn't need to lock, it does need to open(2)
every individual message, whether you're reading or writing, which
more than makes up for not having to lock in terms of efficiency
lossage.

Both folder formats are efficient at some things, and less efficient
at others. Which one will perorm better for you depends a lot on your
usage patterns and the underlying filesystem. Fans of maildir like to
sell it as being inherently better than mbox in every way, and that's
simply false. I use both, about 50% each, because for my usage
patterns, maildir is better about half the time, and mbox is better
the other half. [This is why the fact that Mutt behaves (or mostly
behaved) differently for the two formats has always been a pet peeve
of mine.]

I was about to write a blurb about how locking on NFS isn't really
*that* bad (anymore)... but yeah, it is. ;-) In modern, homogenous
environments with well-behaved apps, odds of data loss are only
slightly worse than they would be if you took NFS out of the picture,
i.e. it's really only an issue in server failure modes, but that
tends to be when data loss occurs regardless of NFS... NFS + mbox
increases the odds of loss occuring only slightly (with well-behaved
apps), but makes the odds of the impact being serious much higher
(it's unlikely, but you might lose your entire mail folder). Trouble
is, in practice there's no such thing as a modern, homogenous
environment, where all the apps are well-behaved! :)
--
Derek D. Martin http://www.pizzashack.org/ GPG Key ID: 0xDFBEAD02
-=-=-=-=-
This message is posted from an invalid address. Replying to it will result in
undeliverable mail due to spam prevention. Sorry for the inconvenience.
Wu, Yue
2009-12-24 14:12:46 UTC
Permalink
Post by Derek Martin
Post by Kyle Wheeler
Now, I'm assuming that you're storing your mbox locally on a standard
unix filesystem rather than on an NFS-mount or on an MS-DOS partition
or something similarly weird.
Does mbox have issues on MS-DOS partitions? I wasn't aware of any
(though you need to be careful about using mixed case). I was aware
that maildir does, because on MS filesystems, you can't use ':' in a
path.
[....]

Thanks for useful explanations :)
--
Hi,
Wu, Yue
Kyle Wheeler
2009-12-25 14:41:01 UTC
Permalink
Post by Derek Martin
Post by Kyle Wheeler
Now, I'm assuming that you're storing your mbox locally on a standard
unix filesystem rather than on an NFS-mount or on an MS-DOS partition
or something similarly weird.
Does mbox have issues on MS-DOS partitions? I wasn't aware of any
(though you need to be careful about using mixed case). I was aware
that maildir does, because on MS filesystems, you can't use ':' in a
path.
Honestly, I'm not sure. I know that MS-DOS requires explicit sharing,
so it's easy for a mail client to prevent delivery of new mail by
accident, but that's easy to work around.
Post by Derek Martin
Post by Kyle Wheeler
And as long as you avoid those, you can be assured that using mbox
is *safe*. It's not *efficient* (because everyone who touches the
mbox must lock the file first), but it is safe.
I really hate when people say stuff like this.
Efficient and fast are two different things. From a parallel
perspective, one giant lock (even one giant write lock) is NOT
efficient, no matter how you look at it. It may well be *fast* for the
common case (especially when the common case is one-writer), but
that's a different issue.

But if you're here to rehash the "fast" argument, I think we can't get
anywhere without pointing to the CourierMTA's webpage of mbox/maildir
benchmarks: http://www.courier-mta.org/mbox-vs-maildir/
Post by Derek Martin
Plus, while maildir doesn't need to lock, it does need to open(2)
every individual message, whether you're reading or writing, which
more than makes up for not having to lock in terms of efficiency
lossage.
You only need to open(2) every individual message if you're reading
the whole thing for the first time. You certainly don't need to do
that if you're delivering mail, or deleting mail, or marking a message
as read, or what have you.
Post by Derek Martin
Both folder formats are efficient at some things, and less efficient
at others. Which one will perorm better for you depends a lot on
your usage patterns and the underlying filesystem.
Agreed.
Post by Derek Martin
Fans of maildir like to sell it as being inherently better than mbox
in every way, and that's simply false. I use both, about 50% each,
because for my usage patterns, maildir is better about half the
time, and mbox is better the other half. [This is why the fact that
Mutt behaves (or mostly behaved) differently for the two formats has
always been a pet peeve of mine.]
Also agreed. That's why I like Dovecot, actually, because I can use
mbox for my Archive tree and maildir for everything else and get the
exact same semantics.

~Kyle
- --
If after I depart this vale you ever remember me and have thought to
please my ghost, forgive some sinner, and wink your eye at some homely
girl.
-- H.L. Mencken's Epitaph
Derek Martin
2009-12-26 20:25:01 UTC
Permalink
Post by Kyle Wheeler
But if you're here to rehash the "fast" argument, I think we can't get
anywhere without pointing to the CourierMTA's webpage of mbox/maildir
benchmarks: http://www.courier-mta.org/mbox-vs-maildir/
I'm very familiar with this page, and I consider it fairly useless.
First, it has a tendency to focus on operations where courier wins,
and somewhat downplays cases where it doesn't. For example, it does
no tests at all with extremely large numbers of messages. On typical
Unix file systems, maildir basically falls over, because accessing
files in large directories is inherently slow (to the point of being
painful) on such filesystems.

Second, and much more importantly, it assumes that University of
Washintgon's mbox implementation is representative of how well mbox is
capable of performing, and that Courier's maildir implementation is
similarly representative of how well maildir performs. In other
words, you're actually comparing two specific implementations -- not
mbox vs. maildir per se. That pretty much invalidates every aspect
of the conclusions drawn on this page (though they may well be valid
for Courier vs. UW-IMAP). Despite this, it is interesting to note
that UW-IMAP mostly outperforms Courier on low-end hardware *by a
lot*, with the sole exception of the very special case of expunge
(which the study calls delete), whereas on high-end hardware, Courier
wins by only a small margin.

It's been a long while since I looked at UW's implementation, but I do
remember thinking that it had a number of opportunities for
optimization. I believe, for example, that UW-IMAP's caching was
basically nonexistant (which would explain why Courier does so much
better on all the .2 tests). When comparing Mutt's implementations of
mbox vs. maildir, mbox BLOWS AWAY maildir opening large mailboxes
(i.e. pre-header-caching). IIRC UW-IMAP also uses stdio... which, being
double-buffered, is the least efficient method of I/O. On reasonably
modern (i.e. not broken) implementations, using memory-mapped I/O is
substantially faster. For maildir, the difference probably wouldn't
matter much since the reads and writes tend to be small. For mbox,
that matters a lot (see W. Richard Stevens, Advanced Programming in
the Unix Environment, for an example of how drastically MMIO can
improve I/O performance).
Post by Kyle Wheeler
You only need to open(2) every individual message if you're reading
the whole thing for the first time. You certainly don't need to do
that if you're delivering mail, or deleting mail, or marking a message
as read, or what have you.
Yes, exactly. I have dozens of mailboxes, most of which (in my work
environment) are high-volume folders... With my usage patterns
(especially pre-headercache), the speed of opening mailboxes matters A
LOT. FWIW, last I was paying attention, mbox was not receiving the
benefits of header caching in Mutt. For my particular usage patterns,
this matters much, much more than say, the time it takes to expunge a
single message from a large mail box. With my particular usage
patterns, the latter case happens pretty much never. Opening large
mailboxes happens pretty frequently.

As it happens, mbox (on Mutt at least) is actually about the same or
faster for almost all of the operations that actually make a
difference to my e-mail experience. I tend to keep my busy incoming
folders small, and either delete or archive messages from those
folders into mbox folders when I'm done processing them. I rarely
delete messages from those mbox folders, but I still do open them very
frequently to remind myself of whatever's in the messages I saved
there. So for me, maildir's huge win deleting messages in large
folders is a *complete* non-issue. A good mbox implementation with
caching will perform about as well as or even beat maildir handily in
almost every other case. For me, using maildir was as much about
Mutt's behavior when using it, as it was about performance and safety.
With recent improvements from Brendan and/or Rocco, the behavior is no
longer sufficiently different that there's really any benefit at all
for me to use maildir (I don't keep my mail on network shares of any
sort), but there is genuine benefit from using mbox. It may still be
true that mbox is not receiving the benefits of hcache, but if so I
don't really notice the difference. I still do use both, but it's
mostly a remnant of past issues that no longer exist.
--
Derek D. Martin http://www.pizzashack.org/ GPG Key ID: 0xDFBEAD02
-=-=-=-=-
This message is posted from an invalid address. Replying to it will result in
undeliverable mail due to spam prevention. Sorry for the inconvenience.
Derek Martin
2009-12-27 05:24:49 UTC
Permalink
Just to be clear, I'm still *NOT* saying that mbox is inherently
better than mbox. That said...
Despite this, it is interesting to note that UW-IMAP mostly
outperforms Courier on low-end hardware *by a lot*, with the sole
exception of the very special case of expunge (which the study calls
delete), whereas on high-end hardware, Courier wins by only a small
margin.
Well, I wasn't looking at the graphs closely enough... this is not
true. But that doesn't take away from my other points.

To add to those, a third point about the analysis which I neglected to
mention is that it doesn't take into account that Courier may have a
much more efficient implementation of IMAP and other underlying code.
This ties in to my second point, and you can summarize those by saying
that the analysis fails to attribute when differences in performance
are caused by the folder format, and when they are caused by some
other implementation details. Having looked a little at UW's
implementation in the past (though not recently enough to be certain
or to explain why), my guess is Courier is more efficient generally.

If you look at the Phase I & II graph for select.1 and select.2, there
seems to be some evidence: on low end hardware, UW-IMAP wins by 20s on
mailboxes of 10,000 messages on the first pass. On the second pass,
Courier takes almost constant time to process mailboxes of any tested
size, while UW's performance slope is considerably higher. Processing
100 messages appears to take the same amount of time for UW on both
passes. Processing 10,000 messages takes roughly 9x longer on UW than
on Courier on the second pass.

So, either Courier is caching internally and UW is relying on file
system caching (yeilding worse performance), or they both cache
internally and Courier's caching code is good, while UW's caching code
is crap. Something of the sort *must* be true: if both were using
efficient caching, their performance should be roughly identical for
the second pass, since neither one would need to read from the disk;
in other words, for the second pass, the message store's on-disk
format should not come into play *at all*.

This factor also shows itself if you look at the way he calculates
CPU usage. He's taking the average of user+sys and real. Look at the
results for 2,000 messages on high-end hardware. UW's user + sys is
about .8s, but its REAL is 3.068s. So, what the hell was it doing for
the other 2.2s? Something is fishy here. For Courier, the numbers
add up a bit better: 0.030s + 2.120s = 2.150, where the real time is
2.147s. It's odd that the user+sys add up to more than real, but very
likely Linux isn't perfectly accurate accounting for CPU time. But
what about our 2.2s difference with UW? That's more than enough to be
significant. The difference between real and user+sys should be the
amount of time the process was sleeping. We have neither access to
his test machines nor a time machine, so we can only make guesses
about why it was sleeping. Most likely, either UW's code is very
inefficient, or the server was swapped out. In neither case, can you
blame that on the folder format.

He's not wrong that the real CPU time matters when it comes to
responsiveness when using imapd, but it very probably is wrong to
attribute that loss of time to the mailbox format. We can't really
know without a more detailed analysis. Such a detail is suitable for
comparing UW-IMAP to Courier, but not to comparing maildir to mbox.

So, my take on that analysis is that it's pretty much worthless. For
the analysis to be worth anything, it needs to at least eliminate IMAP
from the picture, and take a stab at analyzing the efficiency of the
implementation. You'd need to write client code that uses the mailbox
drivers of both, but uses the same code to feed the drivers, and then
compare any inefficiencies and optimizations in the implementations of
each driver. You'd need to eliminate caching, since caching eliminates
or reduces the need to actually perform I/O on the mail store. Then,
and only then, you'd have something worth talking about.

And again, I'm not saying that his conclusions are wrong; I'm just
saying that his analysis is bunk. =8^)
--
Derek D. Martin http://www.pizzashack.org/ GPG Key ID: 0xDFBEAD02
-=-=-=-=-
This message is posted from an invalid address. Replying to it will result in
undeliverable mail due to spam prevention. Sorry for the inconvenience.
Cameron Simpson
2009-12-27 23:42:18 UTC
Permalink
On 26Dec2009 23:24, Derek Martin <***@pizzashack.org> wrote:
| Just to be clear, I'm still *NOT* saying that mbox is inherently
| better than mbox. [...]

I should think not :-)

Personally I use maildir for "live" mail folders and mbox for archive
folders. And am very thankful for the header cache.
--
Cameron Simpson <***@zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/

Steinbach's Law: 2 is not equal to 3 -- even for large values of 2.
Loading...