svn commit: r213648 - head/sys/kern

Post by Andriy Gapon
panic_cpu variable should be volatile
This is to prevent caching of its value in a register when it is checked
and modified by multiple CPUs in parallel.
Also, move the variable into the scope of the only function that uses it.
Reviewed by: jhb
Hint from: mdf
MFC after: 1 week

I doubt that this is either necessary or sufficient. Most variables
aren't volatile while they are locked by a mutex. But panic() cannot use
normal mutex locking, so it should access this variable using atomic
ops with memory barriers. The compiler part of the memory barriers
effectively make _all_ variables temporily volatile. However, 2 of the
the 4 accesses to this variable doesn't use an atomic op.

% #ifdef SMP
% /*
% * We don't want multiple CPU's to panic at the same time, so we
% * use panic_cpu as a simple spinlock. We have to keep checking
% * panic_cpu if we are spinning in case the panic on the first
% * CPU is canceled.
% */
% if (panic_cpu != PCPU_GET(cpuid))

This access doesn't use an atomic op.

% while (atomic_cmpset_int(&panic_cpu, NOCPU,
% PCPU_GET(cpuid)) == 0)

This access uses an atomic op. Not all atomic ops use memory barriers,
at least on i386. However, this one does, at least on i386.

(I'm always confused about what atomic_any() without acq or release
means. Do they mean that you don't want a memory barrier (this is
what the mostly give, at least on i386), and if so, what use are
they? There are far too many atomic ops, for far too many never-used
widths, with alternative spellings to encourage using a wrong one.
cmpset is is especially confusing since it you can spell it as cmpset,
cmpset_acq or compset_rel and always get the barrier, at least on
i386. At least cmpset only supports 1 width, at least on i386.)

% while (panic_cpu != NOCPU)

This access doesn't use an atomic op.

% ; /* nothing */
% #endif
% ...
% #ifdef RESTARTABLE_PANICS
% /* See if the user aborted the panic, in which case we continue. */
% if (panicstr == NULL) {
% #ifdef SMP
% atomic_store_rel_int(&panic_cpu, NOCPU);

This access uses an atomic op with a memory barrier, at least on i386.
Now its rel semantics are clear.

panicstr is non-volatile and never accessed by an atomic op, so it probably
strictly needs to be declared volatile even more than panic_cpu. I think
the only thing that makes it work now is that it is bogusly pubic, and
compilers can't analyze the whole program yet -- if they could, then they
would see that it is only set in panic().

% #endif
% return;
% }
% #endif
% #endif

Now, why don't the partial memory barriers prevent caching the variable?

% if (panic_cpu != PCPU_GET(cpuid))
% while (atomic_cmpset_int(&panic_cpu, NOCPU,
% PCPU_GET(cpuid)) == 0)
% while (panic_cpu != NOCPU)
% ; /* nothing */

The very first access can't reasonably use a cachec value. atomic_cmpset()
can change panic_cpu, so even without the memory barrier, panic_cpu must
be reloaded for the third access the first time. But then when the third
access is repeated in the second while loop, the missing atomic op with
barrier makes things very broken. panic_cpu isn't changed by the loop,
and the compiler thinks that it isn't changed by anything else either, so
the compiler may reduce the loop to:

% if (panic_cpu != NOCPU)
% for (;;)
% ; /* nothing */

except I've seen claims that even an endless for loop can be optimized
to nothing. Declaring panic_cpu as volatile prevents the compiler doing
this, but I think it is insufficient. We really do want to see panic_cpu
changed by other CPUs, and what is the point of atomic_load_acq*() if not
to use for this -- if declaring things volatile were sufficient, then we
could just use *(volatile *)&var. atomic_load/store_acq/rel*() on i386
used to do exactly that for the UP case, but use a lock prefix and also
a memory barrier for the SMP case. Current versions also use the memory
barrier in the UP case, and have a relevant comment aboat this:

% Index: atomic.h
% ===================================================================
% RCS file: /home/ncvs/src/sys/i386/include/atomic.h,v
% retrieving revision 1.32.2.1
% retrieving revision 1.55
% diff -u -1 -r1.32.2.1 -r1.55
% --- atomic.h 24 Nov 2004 18:10:02 -0000 1.32.2.1
% +++ atomic.h 20 May 2010 06:18:03 -0000 1.55
% @@ -181,5 +207,5 @@
% * SMP kernels. For UP kernels, however, the cache of the single processor
% - * is always consistent, so we don't need any memory barriers.
% + * is always consistent, so we only need to take care of compiler.
% */

i386 doesn't really have memory barriers (maybe l/s/mfence are, but we
don't use them even on amd64), but we have to use "memoy" clobbers to
take care of the compiler, or we would have to declare too many things
as volatile.

% -#define ATOMIC_STORE_LOAD(TYPE, LOP, SOP) \
% +#define ATOMIC_STORE_LOAD(TYPE, LOP, SOP) \
% static __inline u_##TYPE \
% @@ -187,3 +213,7 @@
% { \
% - return (*p); \
% + u_##TYPE tmp; \
% + \
% + tmp = *p; \
% + __asm __volatile("" : : : "memory"); \
% + return (tmp); \
% } \
% @@ -193,2 +223,3 @@
% { \
% + __asm __volatile("" : : : "memory"); \
% *p = v; \

5-STABLE is still missing the memory clobber, so a newer compiler on it
might compile away the while loop even when it is fixed to use
atomic_load_acq_int().

% ...
% if (panicstr == NULL) {
% atomic_store_rel_int(&panic_cpu, NOCPU);

Smaller problems with this. Just the unlocked access to panicstr.
Maybe not a problem, except to understand why it isn't one. I think
the earlier while loops prevent concurrent modification of panicstr.
Recursive panics are possible, but since they are possible the compiler
should see that they are possible even if it analyzes the whole program,
and thus know that it cannot cache panicstr.

Bruce

Andriy Gapon

2010-10-09 09:48:50 UTC

I agree about panicstr.
But I am not sure if we have places in code (beyond panic function) itself where
volatile vs. non-volatile would make any actual difference.
But still.

Post by Bruce Evans
the only thing that makes it work now is that it is bogusly pubic, and
compilers can't analyze the whole program yet -- if they could, then they
would see that it is only set in panic().
% #endif
% return;
% }
% #endif
% #endif
Now, why don't the partial memory barriers prevent caching the variable?
% if (panic_cpu != PCPU_GET(cpuid))
% while (atomic_cmpset_int(&panic_cpu, NOCPU,
% PCPU_GET(cpuid)) == 0)
% while (panic_cpu != NOCPU)
% ; /* nothing */
The very first access can't reasonably use a cachec value. atomic_cmpset()
can change panic_cpu, so even without the memory barrier, panic_cpu must
be reloaded for the third access the first time. But then when the third
access is repeated in the second while loop, the missing atomic op with
barrier makes things very broken. panic_cpu isn't changed by the loop,
and the compiler thinks that it isn't changed by anything else either, so
% if (panic_cpu != NOCPU)
% for (;;)
% ; /* nothing */

Yes, it's exactly the last loop that had the problem.
On amd64 with -O2:
.loc 1 544 0
movl panic_cpu(%rip), %eax
.LVL134:
.p2align 4,,7
.L210:
cmpl $255, %eax
jne .L210
jmp .L225

Post by Bruce Evans
except I've seen claims that even an endless for loop can be optimized
to nothing. Declaring panic_cpu as volatile prevents the compiler doing
this, but I think it is insufficient. We really do want to see panic_cpu
changed by other CPUs, and what is the point of atomic_load_acq*() if not
to use for this -- if declaring things volatile were sufficient, then we
could just use *(volatile *)&var. atomic_load/store_acq/rel*() on i386

I discussed this with kib and the idea is that atomic operation is not needed in
that place and volatile is sufficient, because we don't need barrier semantics
there and only want to ensure that variable is reloaded from memory.
atomic_store_rel at the end seems to be needed because of inter-dependency
between panicstr and panic_cpu. That is we want changes to panicstr to becomes
visible at the same time (before actually) changes to panic_cpu are visible.

Post by Bruce Evans
used to do exactly that for the UP case, but use a lock prefix and also
a memory barrier for the SMP case. Current versions also use the memory
% Index: atomic.h
% ===================================================================
% RCS file: /home/ncvs/src/sys/i386/include/atomic.h,v
% retrieving revision 1.32.2.1
% retrieving revision 1.55
% diff -u -1 -r1.32.2.1 -r1.55
% --- atomic.h 24 Nov 2004 18:10:02 -0000 1.32.2.1
% +++ atomic.h 20 May 2010 06:18:03 -0000 1.55
% * SMP kernels. For UP kernels, however, the cache of the single processor
% - * is always consistent, so we don't need any memory barriers.
% + * is always consistent, so we only need to take care of compiler.
% */
i386 doesn't really have memory barriers (maybe l/s/mfence are, but we
don't use them even on amd64), but we have to use "memoy" clobbers to
take care of the compiler, or we would have to declare too many things
as volatile.
% -#define ATOMIC_STORE_LOAD(TYPE, LOP, SOP) \
% +#define ATOMIC_STORE_LOAD(TYPE, LOP, SOP) \
% static __inline u_##TYPE \
% { \
% - return (*p); \
% + u_##TYPE tmp; \
% + \
% + tmp = *p; \
% + __asm __volatile("" : : : "memory"); \
% + return (tmp); \
% } \
% { \
% + __asm __volatile("" : : : "memory"); \
% *p = v; \
5-STABLE is still missing the memory clobber, so a newer compiler on it
might compile away the while loop even when it is fixed to use
atomic_load_acq_int().
% ...
% if (panicstr == NULL) {
% atomic_store_rel_int(&panic_cpu, NOCPU);
Smaller problems with this. Just the unlocked access to panicstr.
Maybe not a problem, except to understand why it isn't one. I think
the earlier while loops prevent concurrent modification of panicstr.
Recursive panics are possible, but since they are possible the compiler
should see that they are possible even if it analyzes the whole program,
and thus know that it cannot cache panicstr.
Bruce

--
Andriy Gapon

Bruce Evans

2010-10-13 00:29:08 UTC

[>]*...

Post by Bruce Evans
Now, why don't the partial memory barriers prevent caching the variable?
% if (panic_cpu != PCPU_GET(cpuid))
% while (atomic_cmpset_int(&panic_cpu, NOCPU,
% PCPU_GET(cpuid)) == 0)
% while (panic_cpu != NOCPU)
% ; /* nothing */
The very first access can't reasonably use a cachec value. atomic_cmpset()

So the first access doesn't need a compiler-type memory barrier. Not needing
a CPU-type one is more subtle -- see below.

Post by Bruce Evans
can change panic_cpu, so even without the memory barrier, panic_cpu must
be reloaded for the third access the first time. But then when the third
access is repeated in the second while loop, the missing atomic op with
barrier makes things very broken. panic_cpu isn't changed by the loop,
and the compiler thinks that it isn't changed by anything else either, so
% if (panic_cpu != NOCPU)
% for (;;)
% ; /* nothing */

Yes, it's exactly the last loop that had the problem.
.loc 1 544 0
movl panic_cpu(%rip), %eax
.p2align 4,,7
cmpl $255, %eax
jne .L210
jmp .L225

If I understand correctly, without acquiring barrier, variable is not
guaranteed to be loaded from memory, it can end up with some stale value from
cpu's cache or pipeline. If incorrect value will be checked in the first "if"
operator, it can lead to skiping all the atomic synchronization.
The same about writing to variable, without barrier new value may be written to
some local cpu cache and not be seen by readers until it is flushed from cache.

That's what I was thinking. Now I think the only problem is that not
locking everything just makes the correctness very hard to understand,
and it shouldn't be done since this code is as far as possible from
needing to be efficient.

If the access in the inner while loop sees a stale value, then there
is no problem provided the value eventually becomes unstale or just
changes to a different stale value -- then we exit the inner loop and
go back to the locked access in the outer loop.

The access before the outer loop can at most see a stale value which
doesn't matter. Only the current CPU can set panic_cpu to PCPU_GET(cpuid)).
When panic() is called with panic_cpu having this value, it means that
the panic is recursive (since panic_cpu is initially NOCPU, and the
current CPU doesn't/shouldn't change it to anything but itself). For
recursive panics, the current CPU should already have locked everything
and stopped other CPUs). It doesn't need to lock against itself, and it
needs the test using the first access to avoid deadlocking against itself.
When panic() is called with panic_cpu = NOCPU and this value is not stale,
then there is no problem. When panic() is called with panic_cpu = NOCPU
and this value is stale, it means that one CPU entered and left panic()
and set the value to NOCPU, but another CPU has entered panic() and set
the value to !NOCPU but we haven't seen this yet. Then we reach the
outer while loop expecting to acquire the lock, but when we try to
acquire it we don't get it, and there is no problem. Finally, when
when panic() is called with panic_cpu = another_cpu, there is similarly
no problem, whether or not this value is stale. Then we reach the outer
while loop expecting not to acquire the lock, but we may sometimes
acquire it, even on the first try, if the value was stale or if it
just became stale after we read it.

Perhaps my rule for locking read accesses may allow/require the complexity
in the above. It is, never lock read accesses, since the value may be
stale after you read it, whether or not you locked the access. It is
only in delicate cases, where you need a value that is non-stale at the
time of the access but are happy with a value that is stale later, that
locking a read access is useful.

The panic locking could be written using only standard complications using
a recursive mutex (or trylock), except then it might have to fight with
the mutex implementation doing more than the simple spin:

mtx_lock_spin(&panic_mtx);
/*
* Already have a problem -- the implementation disable interrupts,
* which we should want, but since we do wrong things like syncing
* with normal i/o expected to work, we probably need interrupts.
*/

It's interesting that the implementation of mutexes doesn't optimize for
the unusual case of a mutex being recursed on, but we do that here. We
could be more like the mutex implementation and do:

while (atomic_cmpset_int(&panic_cpu, NOCPU, PCPU_GET(cpuid)) == 0) {
if (panic_cpu == PCPU_GET(cpuid))
break; /* recursive */
while (panic_cpu != NOCPU)
continue;
}

This also applies to r213736.

Post by Andriy Gapon
atomic_store_rel at the end seems to be needed because of inter-dependency
between panicstr and panic_cpu. That is we want changes to panicstr to becomes
visible at the same time (before actually) changes to panic_cpu are visible.

There are similar but larger complications for debugger entry. Debugger
entry has the additional problem that a thundering herd of CPUs may
enter at a breakpoint, or at any debugger trap with handling necessary
but normal handling not possible. Debuggers entr has the additional
non-problem that it at least tries to stop other CPUs before proceeding
far. My i386 ddb entry starts a bit like the above:

% ef = read_eflags();
% disable_intr();
% if (atomic_cmpset_int(&kdb_trap_lock, NOCPU, PCPU_GET(cpuid)) == 0 &&
% kdb_trap_lock != PCPU_GET(cpuid)) {

It has the anti-recursion detection in the same if statement as the cmpset.

% while (atomic_cmpset_int(&output_lock, 0, 1) == 0)
% ;
% db_printf(
% "concurrent ddb entry: type %d trap, code=%x cpu=%d\n",
% type, code, PCPU_GET(cpuid));
% atomic_store_rel_int(&output_lock, 0);

Another lock to serialize output for debugging this stuff. Thundering
herds at breakpoints are very easy to arrange, and they will interleave
their output almost perfectly with some console drivers (say, a serial
console at 50 bps so that you can easily watch the problem; this is also
a good test for console drivers).

Elsewhere I posted a similar method for properly serializing plain
printf output. That bounds the time for the output so it might not
work on 50 bps consoles. The above has no bound and depends on
db_printf() never blocking endlessly even under fire from places like
here.

% if (type == T_BPTFLT)
% regs->tf_eip--;

This backs of from the breakpoint. It is not MI enough. Some nut
might have put a 2-byte breakpoint instruction in the instruction
stream. Then the breakpoint won't belong to us, so not handling it
here would be correct. Backing off 2 would work OK too (it allows the
instruction to trap again). Backing of 1 backs into it and corrupts
it. Backing off 1 works OK in the same way as when the breakpoint is
not ours. Backing off 1 here handles the case where the breakpoint
is ours now but will go away because another ddb entry proceeds and
removes it.

% else {
% while (atomic_cmpset_int(&output_lock, 0, 1) == 0)
% ;
% db_printf(
% "concurrent ddb entry on non-breakpoint: too hard to handle properly\n");
% atomic_store_rel_int(&output_lock, 0);
% }

More debugging.

% while (atomic_load_acq_int(&kdb_trap_lock) != NOCPU)
% ;

This corresponds to the inner while loop in panic(). According to the
above discussion, I don't need the atomic op here. I should use pause()
here.

% write_eflags(ef);

panic() doesn't disable interrupts, especially in the inner while loop.
Interrupts must be kept disabled to avoid them doing bad things including
recursive debugger and/or panic entries, although masking of all interrupts
breaks sending of most IPIS.

% return (1);
% }
% // Proceed to stopping CPUs... Since some CPUs may be looping with
% // interrupts masked, either in the above loop or accidentally, the
% // old method of sending maskable IPIs may block us endlessly. I
% // use a hack to work around this.

Bruce

Gennady Proskurin

2010-10-12 21:08:04 UTC

I agree about panicstr.
But I am not sure if we have places in code (beyond panic function) itself where
volatile vs. non-volatile would make any actual difference.
But still.

Yes, it's exactly the last loop that had the problem.
.loc 1 544 0
movl panic_cpu(%rip), %eax
.p2align 4,,7
cmpl $255, %eax
jne .L210
jmp .L225

--
Andriy Gapon
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/svn-src-head

John Baldwin

2010-10-13 13:05:04 UTC

I agree about panicstr.
But I am not sure if we have places in code (beyond panic function) itself where
volatile vs. non-volatile would make any actual difference.
But still.

Yes, it's exactly the last loop that had the problem.
.loc 1 544 0
movl panic_cpu(%rip), %eax
.p2align 4,,7
cmpl $255, %eax
jne .L210
jmp .L225

No, a barrier does _not_ force any cache flushes. The point of the volatile
is to serve as a compiler barrier. A memory barrier only enforces a ordering
in memory operations in the CPU, it does not flush any caches or force a CPU
to post writes to memory. It is weaker than that. :) For example, the
sparcv9 manuals explicitly state something along the lines that any write may
be held in the CPU's store buffer for an unspecified amount of time.
Barriers do not alter that, nor would it really be useful for them to do so.
What barriers allow you to do is order the operations on a lock cookie (such
as mtx_lock in mutexes) with respect to operations on other memory locations
(e.g. the data a lock protects). By using a specific set of barriers and
protocols for accessing the lock cookie, you can ensure that the protected
data is not accessed without holding the lock. However, for a standalone
word, memory barriers do not buy you anything. And in fact, if one CPU has
written to a variable and that write is still sitting in the store buffer,
'atomic_load_acq' on another CPU may still return a stale value. All that
FreeBSD requires is that 'atomic_cmpset' will not report success using stale
data. It can either fail or block waiting for the write in a store buffer in
another CPU to drain. We typically handle the 'fail' case by continuing to
loop until we observe a new state or succesfully perform 'atomic_cmpset'
(for an example, see setting of MTX_CONTESTED in mtx_lock).

Currently we do not have 'atomic_load()' and 'atomic_store()' variants without
memory barriers since 'x = y' is atomic (in that it is performed as a single
operation). However, there are cases where 'x = y' needs a compiler memory
barrier (but not a CPU one). (e.g. if 'y' can be updated by a signal handler
in userland, or an interrupt in the kernel.) That is what 'volatile' is for.

--
John Baldwin

Gennady Proskurin

2010-10-13 16:21:44 UTC

Thank you and Bruce for explanation.
The key point here (which I did not understand) is that
"something == PCPU_GET(cpuid))" may be true only if "something" is set by the
current cpu, in which case its value is not stale.

Post by John Baldwin

I agree about panicstr.
But I am not sure if we have places in code (beyond panic function) itself where
volatile vs. non-volatile would make any actual difference.
But still.

Yes, it's exactly the last loop that had the problem.
.loc 1 544 0
movl panic_cpu(%rip), %eax
.p2align 4,,7
cmpl $255, %eax
jne .L210
jmp .L225

No, a barrier does _not_ force any cache flushes. The point of the volatile
is to serve as a compiler barrier. A memory barrier only enforces a ordering
in memory operations in the CPU, it does not flush any caches or force a CPU
to post writes to memory. It is weaker than that. :) For example, the
sparcv9 manuals explicitly state something along the lines that any write may
be held in the CPU's store buffer for an unspecified amount of time.
Barriers do not alter that, nor would it really be useful for them to do so.
What barriers allow you to do is order the operations on a lock cookie (such
as mtx_lock in mutexes) with respect to operations on other memory locations
(e.g. the data a lock protects). By using a specific set of barriers and
protocols for accessing the lock cookie, you can ensure that the protected
data is not accessed without holding the lock. However, for a standalone
word, memory barriers do not buy you anything. And in fact, if one CPU has
written to a variable and that write is still sitting in the store buffer,
'atomic_load_acq' on another CPU may still return a stale value. All that
FreeBSD requires is that 'atomic_cmpset' will not report success using stale
data. It can either fail or block waiting for the write in a store buffer in
another CPU to drain. We typically handle the 'fail' case by continuing to
loop until we observe a new state or succesfully perform 'atomic_cmpset'
(for an example, see setting of MTX_CONTESTED in mtx_lock).
Currently we do not have 'atomic_load()' and 'atomic_store()' variants without
memory barriers since 'x = y' is atomic (in that it is performed as a single
operation). However, there are cases where 'x = y' needs a compiler memory
barrier (but not a CPU one). (e.g. if 'y' can be updated by a signal handler
in userland, or an interrupt in the kernel.) That is what 'volatile' is for.
--
John Baldwin
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/svn-src-head

John Baldwin

2010-10-13 19:46:09 UTC

Post by Gennady Proskurin
Thank you and Bruce for explanation.
The key point here (which I did not understand) is that
"something == PCPU_GET(cpuid))" may be true only if "something" is set by the
current cpu, in which case its value is not stale.

Yes. mtx_owned() (and mtx_assert()) depend on a similar "feature" in that
stale values don't hurt.

Post by Gennady Proskurin

Post by John Baldwin