Merge rsync://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6

Conflicts:

	include/linux/kernel.h
This commit is contained in:
Steven Whitehouse 2006-07-03 10:25:08 -04:00
commit 0a1340c185
7737 changed files with 255567 additions and 164571 deletions

15
CREDITS
View file

@ -24,6 +24,11 @@ S: C. Negri 6, bl. D3
S: Iasi 6600
S: Romania
N: Mark Adler
E: madler@alumni.caltech.edu
W: http://alumnus.caltech.edu/~madler/
D: zlib decompression
N: Monalisa Agrawal
E: magrawal@nortelnetworks.com
D: Basic Interphase 5575 driver with UBR and ABR support.
@ -1573,12 +1578,8 @@ S: 160 00 Praha 6
S: Czech Republic
N: Niels Kristian Bech Jensen
E: nkbj@image.dk
W: http://www.image.dk/~nkbj
E: nkbj1970@hotmail.com
D: Miscellaneous kernel updates and fixes.
S: Dr. Holsts Vej 34, lejl. 164
S: DK-8230 Åbyhøj
S: Denmark
N: Michael K. Johnson
E: johnsonm@redhat.com
@ -3400,10 +3401,10 @@ S: Czech Republic
N: Thibaut Varene
E: T-Bone@parisc-linux.org
W: http://www.parisc-linux.org/
W: http://www.parisc-linux.org/~varenet/
P: 1024D/B7D2F063 E67C 0D43 A75E 12A5 BB1C FA2F 1E32 C3DA B7D2 F063
D: PA-RISC port minion, PDC and GSCPS2 drivers, debuglocks and other bits
D: Some bits in an ARM port, S1D13XXX FB driver, random patches here and there
D: Some ARM at91rm9200 bits, S1D13XXX FB driver, random patches here and there
D: AD1889 sound driver
S: Paris, France

77
Documentation/ABI/README Normal file
View file

@ -0,0 +1,77 @@
This directory attempts to document the ABI between the Linux kernel and
userspace, and the relative stability of these interfaces. Due to the
everchanging nature of Linux, and the differing maturity levels, these
interfaces should be used by userspace programs in different ways.
We have four different levels of ABI stability, as shown by the four
different subdirectories in this location. Interfaces may change levels
of stability according to the rules described below.
The different levels of stability are:
stable/
This directory documents the interfaces that the developer has
defined to be stable. Userspace programs are free to use these
interfaces with no restrictions, and backward compatibility for
them will be guaranteed for at least 2 years. Most interfaces
(like syscalls) are expected to never change and always be
available.
testing/
This directory documents interfaces that are felt to be stable,
as the main development of this interface has been completed.
The interface can be changed to add new features, but the
current interface will not break by doing this, unless grave
errors or security problems are found in them. Userspace
programs can start to rely on these interfaces, but they must be
aware of changes that can occur before these interfaces move to
be marked stable. Programs that use these interfaces are
strongly encouraged to add their name to the description of
these interfaces, so that the kernel developers can easily
notify them if any changes occur (see the description of the
layout of the files below for details on how to do this.)
obsolete/
This directory documents interfaces that are still remaining in
the kernel, but are marked to be removed at some later point in
time. The description of the interface will document the reason
why it is obsolete and when it can be expected to be removed.
The file Documentation/feature-removal-schedule.txt may describe
some of these interfaces, giving a schedule for when they will
be removed.
removed/
This directory contains a list of the old interfaces that have
been removed from the kernel.
Every file in these directories will contain the following information:
What: Short description of the interface
Date: Date created
KernelVersion: Kernel version this feature first showed up in.
Contact: Primary contact for this interface (may be a mailing list)
Description: Long description of the interface and how to use it.
Users: All users of this interface who wish to be notified when
it changes. This is very important for interfaces in
the "testing" stage, so that kernel developers can work
with userspace developers to ensure that things do not
break in ways that are unacceptable. It is also
important to get feedback for these interfaces to make
sure they are working in a proper way and do not need to
be changed further.
How things move between levels:
Interfaces in stable may move to obsolete, as long as the proper
notification is given.
Interfaces may be removed from obsolete and the kernel as long as the
documented amount of time has gone by.
Interfaces in the testing state can move to the stable state when the
developers feel they are finished. They cannot be removed from the
kernel tree without going through the obsolete state first.
It's up to the developer to place their interfaces in the category they
wish for it to start out in.

View file

@ -0,0 +1,13 @@
What: devfs
Date: July 2005
Contact: Greg Kroah-Hartman <gregkh@suse.de>
Description:
devfs has been unmaintained for a number of years, has unfixable
races, contains a naming policy within the kernel that is
against the LSB, and can be replaced by using udev.
The files fs/devfs/*, include/linux/devfs_fs*.h will be removed,
along with the the assorted devfs function calls throughout the
kernel tree.
Users:

View file

@ -0,0 +1,10 @@
What: The kernel syscall interface
Description:
This interface matches much of the POSIX interface and is based
on it and other Unix based interfaces. It will only be added to
over time, and not have things removed from it.
Note that this interface is different for every architecture
that Linux supports. Please see the architecture-specific
documentation for details on the syscall numbers that are to be
mapped to each syscall.

View file

@ -0,0 +1,30 @@
What: /sys/module
Description:
The /sys/module tree consists of the following structure:
/sys/module/MODULENAME
The name of the module that is in the kernel. This
module name will show up either if the module is built
directly into the kernel, or if it is loaded as a
dyanmic module.
/sys/module/MODULENAME/parameters
This directory contains individual files that are each
individual parameters of the module that are able to be
changed at runtime. See the individual module
documentation as to the contents of these parameters and
what they accomplish.
Note: The individual parameter names and values are not
considered stable, only the fact that they will be
placed in this location within sysfs. See the
individual driver documentation for details as to the
stability of the different parameters.
/sys/module/MODULENAME/refcnt
If the module is able to be unloaded from the kernel, this file
will contain the current reference count of the module.
Note: If the module is built into the kernel, or if the
CONFIG_MODULE_UNLOAD kernel configuration value is not enabled,
this file will not be present.

View file

@ -0,0 +1,16 @@
What: /sys/class/
Date: Febuary 2006
Contact: Greg Kroah-Hartman <gregkh@suse.de>
Description:
The /sys/class directory will consist of a group of
subdirectories describing individual classes of devices
in the kernel. The individual directories will consist
of either subdirectories, or symlinks to other
directories.
All programs that use this directory tree must be able
to handle both subdirectories or symlinks in order to
work properly.
Users:
udev <linux-hotplug-devel@lists.sourceforge.net>

View file

@ -0,0 +1,25 @@
What: /sys/devices
Date: February 2006
Contact: Greg Kroah-Hartman <gregkh@suse.de>
Description:
The /sys/devices tree contains a snapshot of the
internal state of the kernel device tree. Devices will
be added and removed dynamically as the machine runs,
and between different kernel versions, the layout of the
devices within this tree will change.
Please do not rely on the format of this tree because of
this. If a program wishes to find different things in
the tree, please use the /sys/class structure and rely
on the symlinks there to point to the proper location
within the /sys/devices tree of the individual devices.
Or rely on the uevent messages to notify programs of
devices being added and removed from this tree to find
the location of those devices.
Note that sometimes not all devices along the directory
chain will have emitted uevent messages, so userspace
programs must be able to handle such occurrences.
Users:
udev <linux-hotplug-devel@lists.sourceforge.net>

View file

@ -181,8 +181,8 @@ Intel IA32 microcode
--------------------
A driver has been added to allow updating of Intel IA32 microcode,
accessible as both a devfs regular file and as a normal (misc)
character device. If you are not using devfs you may need to:
accessible as a normal (misc) character device. If you are not using
udev you may need to:
mkdir /dev/cpu
mknod /dev/cpu/microcode c 10 184
@ -201,7 +201,9 @@ with programs using shared memory.
udev
----
udev is a userspace application for populating /dev dynamically with
only entries for devices actually present. udev replaces devfs.
only entries for devices actually present. udev replaces the basic
functionality of devfs, while allowing persistant device naming for
devices.
FUSE
----
@ -231,18 +233,13 @@ The PPP driver has been restructured to support multilink and to
enable it to operate over diverse media layers. If you use PPP,
upgrade pppd to at least 2.4.0.
If you are not using devfs, you must have the device file /dev/ppp
If you are not using udev, you must have the device file /dev/ppp
which can be made by:
mknod /dev/ppp c 108 0
as root.
If you use devfsd and build ppp support as modules, you will need
the following in your /etc/devfsd.conf file:
LOOKUP PPP MODLOAD
Isdn4k-utils
------------

View file

@ -155,7 +155,83 @@ problem, which is called the function-growth-hormone-imbalance syndrome.
See next chapter.
Chapter 5: Functions
Chapter 5: Typedefs
Please don't use things like "vps_t".
It's a _mistake_ to use typedef for structures and pointers. When you see a
vps_t a;
in the source, what does it mean?
In contrast, if it says
struct virtual_container *a;
you can actually tell what "a" is.
Lots of people think that typedefs "help readability". Not so. They are
useful only for:
(a) totally opaque objects (where the typedef is actively used to _hide_
what the object is).
Example: "pte_t" etc. opaque objects that you can only access using
the proper accessor functions.
NOTE! Opaqueness and "accessor functions" are not good in themselves.
The reason we have them for things like pte_t etc. is that there
really is absolutely _zero_ portably accessible information there.
(b) Clear integer types, where the abstraction _helps_ avoid confusion
whether it is "int" or "long".
u8/u16/u32 are perfectly fine typedefs, although they fit into
category (d) better than here.
NOTE! Again - there needs to be a _reason_ for this. If something is
"unsigned long", then there's no reason to do
typedef unsigned long myflags_t;
but if there is a clear reason for why it under certain circumstances
might be an "unsigned int" and under other configurations might be
"unsigned long", then by all means go ahead and use a typedef.
(c) when you use sparse to literally create a _new_ type for
type-checking.
(d) New types which are identical to standard C99 types, in certain
exceptional circumstances.
Although it would only take a short amount of time for the eyes and
brain to become accustomed to the standard types like 'uint32_t',
some people object to their use anyway.
Therefore, the Linux-specific 'u8/u16/u32/u64' types and their
signed equivalents which are identical to standard types are
permitted -- although they are not mandatory in new code of your
own.
When editing existing code which already uses one or the other set
of types, you should conform to the existing choices in that code.
(e) Types safe for use in userspace.
In certain structures which are visible to userspace, we cannot
require C99 types and cannot use the 'u32' form above. Thus, we
use __u32 and similar types in all structures which are shared
with userspace.
Maybe there are other cases too, but the rule should basically be to NEVER
EVER use a typedef unless you can clearly match one of those rules.
In general, a pointer, or a struct that has elements that can reasonably
be directly accessed should _never_ be a typedef.
Chapter 6: Functions
Functions should be short and sweet, and do just one thing. They should
fit on one or two screenfuls of text (the ISO/ANSI screen size is 80x24,
@ -183,7 +259,7 @@ and it gets confused. You know you're brilliant, but maybe you'd like
to understand what you did 2 weeks from now.
Chapter 6: Centralized exiting of functions
Chapter 7: Centralized exiting of functions
Albeit deprecated by some people, the equivalent of the goto statement is
used frequently by compilers in form of the unconditional jump instruction.
@ -220,7 +296,7 @@ out:
return result;
}
Chapter 7: Commenting
Chapter 8: Commenting
Comments are good, but there is also a danger of over-commenting. NEVER
try to explain HOW your code works in a comment: it's much better to
@ -240,7 +316,7 @@ When commenting the kernel API functions, please use the kerneldoc format.
See the files Documentation/kernel-doc-nano-HOWTO.txt and scripts/kernel-doc
for details.
Chapter 8: You've made a mess of it
Chapter 9: You've made a mess of it
That's OK, we all do. You've probably been told by your long-time Unix
user helper that "GNU emacs" automatically formats the C sources for
@ -288,7 +364,7 @@ re-formatting you may want to take a look at the man page. But
remember: "indent" is not a fix for bad programming.
Chapter 9: Configuration-files
Chapter 10: Configuration-files
For configuration options (arch/xxx/Kconfig, and all the Kconfig files),
somewhat different indentation is used.
@ -313,7 +389,7 @@ support for file-systems, for instance) should be denoted (DANGEROUS), other
experimental options should be denoted (EXPERIMENTAL).
Chapter 10: Data structures
Chapter 11: Data structures
Data structures that have visibility outside the single-threaded
environment they are created and destroyed in should always have
@ -344,7 +420,7 @@ Remember: if another thread can find your data structure, and you don't
have a reference count on it, you almost certainly have a bug.
Chapter 11: Macros, Enums and RTL
Chapter 12: Macros, Enums and RTL
Names of macros defining constants and labels in enums are capitalized.
@ -399,7 +475,7 @@ The cpp manual deals with macros exhaustively. The gcc internals manual also
covers RTL which is used frequently with assembly language in the kernel.
Chapter 12: Printing kernel messages
Chapter 13: Printing kernel messages
Kernel developers like to be seen as literate. Do mind the spelling
of kernel messages to make a good impression. Do not use crippled
@ -410,7 +486,7 @@ Kernel messages do not have to be terminated with a period.
Printing numbers in parentheses (%d) adds no value and should be avoided.
Chapter 13: Allocating memory
Chapter 14: Allocating memory
The kernel provides the following general purpose memory allocators:
kmalloc(), kzalloc(), kcalloc(), and vmalloc(). Please refer to the API
@ -429,7 +505,7 @@ from void pointer to any other pointer type is guaranteed by the C programming
language.
Chapter 14: The inline disease
Chapter 15: The inline disease
There appears to be a common misperception that gcc has a magic "make me
faster" speedup option called "inline". While the use of inlines can be
@ -457,7 +533,7 @@ something it would have done anyway.
Chapter 15: References
Appendix I: References
The C Programming Language, Second Edition
by Brian W. Kernighan and Dennis M. Ritchie.
@ -481,4 +557,4 @@ Kernel CodingStyle, by greg@kroah.com at OLS 2002:
http://www.kroah.com/linux/talks/ols_2002_kernel_codingstyle_talk/html/
--
Last updated on 30 December 2005 by a community effort on LKML.
Last updated on 30 April 2006.

View file

@ -10,7 +10,8 @@ DOCBOOKS := wanbook.xml z8530book.xml mcabook.xml videobook.xml \
kernel-hacking.xml kernel-locking.xml deviceiobook.xml \
procfs-guide.xml writing_usb_driver.xml \
kernel-api.xml journal-api.xml lsm.xml usb.xml \
gadget.xml libata.xml mtdnand.xml librs.xml rapidio.xml
gadget.xml libata.xml mtdnand.xml librs.xml rapidio.xml \
genericirq.xml
###
# The build process is as follows (targets):

View file

@ -0,0 +1,474 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []>
<book id="Generic-IRQ-Guide">
<bookinfo>
<title>Linux generic IRQ handling</title>
<authorgroup>
<author>
<firstname>Thomas</firstname>
<surname>Gleixner</surname>
<affiliation>
<address>
<email>tglx@linutronix.de</email>
</address>
</affiliation>
</author>
<author>
<firstname>Ingo</firstname>
<surname>Molnar</surname>
<affiliation>
<address>
<email>mingo@elte.hu</email>
</address>
</affiliation>
</author>
</authorgroup>
<copyright>
<year>2005-2006</year>
<holder>Thomas Gleixner</holder>
</copyright>
<copyright>
<year>2005-2006</year>
<holder>Ingo Molnar</holder>
</copyright>
<legalnotice>
<para>
This documentation is free software; you can redistribute
it and/or modify it under the terms of the GNU General Public
License version 2 as published by the Free Software Foundation.
</para>
<para>
This program is distributed in the hope that it will be
useful, but WITHOUT ANY WARRANTY; without even the implied
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the GNU General Public License for more details.
</para>
<para>
You should have received a copy of the GNU General Public
License along with this program; if not, write to the Free
Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
MA 02111-1307 USA
</para>
<para>
For more details see the file COPYING in the source
distribution of Linux.
</para>
</legalnotice>
</bookinfo>
<toc></toc>
<chapter id="intro">
<title>Introduction</title>
<para>
The generic interrupt handling layer is designed to provide a
complete abstraction of interrupt handling for device drivers.
It is able to handle all the different types of interrupt controller
hardware. Device drivers use generic API functions to request, enable,
disable and free interrupts. The drivers do not have to know anything
about interrupt hardware details, so they can be used on different
platforms without code changes.
</para>
<para>
This documentation is provided to developers who want to implement
an interrupt subsystem based for their architecture, with the help
of the generic IRQ handling layer.
</para>
</chapter>
<chapter id="rationale">
<title>Rationale</title>
<para>
The original implementation of interrupt handling in Linux is using
the __do_IRQ() super-handler, which is able to deal with every
type of interrupt logic.
</para>
<para>
Originally, Russell King identified different types of handlers to
build a quite universal set for the ARM interrupt handler
implementation in Linux 2.5/2.6. He distinguished between:
<itemizedlist>
<listitem><para>Level type</para></listitem>
<listitem><para>Edge type</para></listitem>
<listitem><para>Simple type</para></listitem>
</itemizedlist>
In the SMP world of the __do_IRQ() super-handler another type
was identified:
<itemizedlist>
<listitem><para>Per CPU type</para></listitem>
</itemizedlist>
</para>
<para>
This split implementation of highlevel IRQ handlers allows us to
optimize the flow of the interrupt handling for each specific
interrupt type. This reduces complexity in that particular codepath
and allows the optimized handling of a given type.
</para>
<para>
The original general IRQ implementation used hw_interrupt_type
structures and their ->ack(), ->end() [etc.] callbacks to
differentiate the flow control in the super-handler. This leads to
a mix of flow logic and lowlevel hardware logic, and it also leads
to unnecessary code duplication: for example in i386, there is a
ioapic_level_irq and a ioapic_edge_irq irq-type which share many
of the lowlevel details but have different flow handling.
</para>
<para>
A more natural abstraction is the clean separation of the
'irq flow' and the 'chip details'.
</para>
<para>
Analysing a couple of architecture's IRQ subsystem implementations
reveals that most of them can use a generic set of 'irq flow'
methods and only need to add the chip level specific code.
The separation is also valuable for (sub)architectures
which need specific quirks in the irq flow itself but not in the
chip-details - and thus provides a more transparent IRQ subsystem
design.
</para>
<para>
Each interrupt descriptor is assigned its own highlevel flow
handler, which is normally one of the generic
implementations. (This highlevel flow handler implementation also
makes it simple to provide demultiplexing handlers which can be
found in embedded platforms on various architectures.)
</para>
<para>
The separation makes the generic interrupt handling layer more
flexible and extensible. For example, an (sub)architecture can
use a generic irq-flow implementation for 'level type' interrupts
and add a (sub)architecture specific 'edge type' implementation.
</para>
<para>
To make the transition to the new model easier and prevent the
breakage of existing implementations, the __do_IRQ() super-handler
is still available. This leads to a kind of duality for the time
being. Over time the new model should be used in more and more
architectures, as it enables smaller and cleaner IRQ subsystems.
</para>
</chapter>
<chapter id="bugs">
<title>Known Bugs And Assumptions</title>
<para>
None (knock on wood).
</para>
</chapter>
<chapter id="Abstraction">
<title>Abstraction layers</title>
<para>
There are three main levels of abstraction in the interrupt code:
<orderedlist>
<listitem><para>Highlevel driver API</para></listitem>
<listitem><para>Highlevel IRQ flow handlers</para></listitem>
<listitem><para>Chiplevel hardware encapsulation</para></listitem>
</orderedlist>
</para>
<sect1>
<title>Interrupt control flow</title>
<para>
Each interrupt is described by an interrupt descriptor structure
irq_desc. The interrupt is referenced by an 'unsigned int' numeric
value which selects the corresponding interrupt decription structure
in the descriptor structures array.
The descriptor structure contains status information and pointers
to the interrupt flow method and the interrupt chip structure
which are assigned to this interrupt.
</para>
<para>
Whenever an interrupt triggers, the lowlevel arch code calls into
the generic interrupt code by calling desc->handle_irq().
This highlevel IRQ handling function only uses desc->chip primitives
referenced by the assigned chip descriptor structure.
</para>
</sect1>
<sect1>
<title>Highlevel Driver API</title>
<para>
The highlevel Driver API consists of following functions:
<itemizedlist>
<listitem><para>request_irq()</para></listitem>
<listitem><para>free_irq()</para></listitem>
<listitem><para>disable_irq()</para></listitem>
<listitem><para>enable_irq()</para></listitem>
<listitem><para>disable_irq_nosync() (SMP only)</para></listitem>
<listitem><para>synchronize_irq() (SMP only)</para></listitem>
<listitem><para>set_irq_type()</para></listitem>
<listitem><para>set_irq_wake()</para></listitem>
<listitem><para>set_irq_data()</para></listitem>
<listitem><para>set_irq_chip()</para></listitem>
<listitem><para>set_irq_chip_data()</para></listitem>
</itemizedlist>
See the autogenerated function documentation for details.
</para>
</sect1>
<sect1>
<title>Highlevel IRQ flow handlers</title>
<para>
The generic layer provides a set of pre-defined irq-flow methods:
<itemizedlist>
<listitem><para>handle_level_irq</para></listitem>
<listitem><para>handle_edge_irq</para></listitem>
<listitem><para>handle_simple_irq</para></listitem>
<listitem><para>handle_percpu_irq</para></listitem>
</itemizedlist>
The interrupt flow handlers (either predefined or architecture
specific) are assigned to specific interrupts by the architecture
either during bootup or during device initialization.
</para>
<sect2>
<title>Default flow implementations</title>
<sect3>
<title>Helper functions</title>
<para>
The helper functions call the chip primitives and
are used by the default flow implementations.
The following helper functions are implemented (simplified excerpt):
<programlisting>
default_enable(irq)
{
desc->chip->unmask(irq);
}
default_disable(irq)
{
if (!delay_disable(irq))
desc->chip->mask(irq);
}
default_ack(irq)
{
chip->ack(irq);
}
default_mask_ack(irq)
{
if (chip->mask_ack) {
chip->mask_ack(irq);
} else {
chip->mask(irq);
chip->ack(irq);
}
}
noop(irq)
{
}
</programlisting>
</para>
</sect3>
</sect2>
<sect2>
<title>Default flow handler implementations</title>
<sect3>
<title>Default Level IRQ flow handler</title>
<para>
handle_level_irq provides a generic implementation
for level-triggered interrupts.
</para>
<para>
The following control flow is implemented (simplified excerpt):
<programlisting>
desc->chip->start();
handle_IRQ_event(desc->action);
desc->chip->end();
</programlisting>
</para>
</sect3>
<sect3>
<title>Default Edge IRQ flow handler</title>
<para>
handle_edge_irq provides a generic implementation
for edge-triggered interrupts.
</para>
<para>
The following control flow is implemented (simplified excerpt):
<programlisting>
if (desc->status &amp; running) {
desc->chip->hold();
desc->status |= pending | masked;
return;
}
desc->chip->start();
desc->status |= running;
do {
if (desc->status &amp; masked)
desc->chip->enable();
desc-status &amp;= ~pending;
handle_IRQ_event(desc->action);
} while (status &amp; pending);
desc-status &amp;= ~running;
desc->chip->end();
</programlisting>
</para>
</sect3>
<sect3>
<title>Default simple IRQ flow handler</title>
<para>
handle_simple_irq provides a generic implementation
for simple interrupts.
</para>
<para>
Note: The simple flow handler does not call any
handler/chip primitives.
</para>
<para>
The following control flow is implemented (simplified excerpt):
<programlisting>
handle_IRQ_event(desc->action);
</programlisting>
</para>
</sect3>
<sect3>
<title>Default per CPU flow handler</title>
<para>
handle_percpu_irq provides a generic implementation
for per CPU interrupts.
</para>
<para>
Per CPU interrupts are only available on SMP and
the handler provides a simplified version without
locking.
</para>
<para>
The following control flow is implemented (simplified excerpt):
<programlisting>
desc->chip->start();
handle_IRQ_event(desc->action);
desc->chip->end();
</programlisting>
</para>
</sect3>
</sect2>
<sect2>
<title>Quirks and optimizations</title>
<para>
The generic functions are intended for 'clean' architectures and chips,
which have no platform-specific IRQ handling quirks. If an architecture
needs to implement quirks on the 'flow' level then it can do so by
overriding the highlevel irq-flow handler.
</para>
</sect2>
<sect2>
<title>Delayed interrupt disable</title>
<para>
This per interrupt selectable feature, which was introduced by Russell
King in the ARM interrupt implementation, does not mask an interrupt
at the hardware level when disable_irq() is called. The interrupt is
kept enabled and is masked in the flow handler when an interrupt event
happens. This prevents losing edge interrupts on hardware which does
not store an edge interrupt event while the interrupt is disabled at
the hardware level. When an interrupt arrives while the IRQ_DISABLED
flag is set, then the interrupt is masked at the hardware level and
the IRQ_PENDING bit is set. When the interrupt is re-enabled by
enable_irq() the pending bit is checked and if it is set, the
interrupt is resent either via hardware or by a software resend
mechanism. (It's necessary to enable CONFIG_HARDIRQS_SW_RESEND when
you want to use the delayed interrupt disable feature and your
hardware is not capable of retriggering an interrupt.)
The delayed interrupt disable can be runtime enabled, per interrupt,
by setting the IRQ_DELAYED_DISABLE flag in the irq_desc status field.
</para>
</sect2>
</sect1>
<sect1>
<title>Chiplevel hardware encapsulation</title>
<para>
The chip level hardware descriptor structure irq_chip
contains all the direct chip relevant functions, which
can be utilized by the irq flow implementations.
<itemizedlist>
<listitem><para>ack()</para></listitem>
<listitem><para>mask_ack() - Optional, recommended for performance</para></listitem>
<listitem><para>mask()</para></listitem>
<listitem><para>unmask()</para></listitem>
<listitem><para>retrigger() - Optional</para></listitem>
<listitem><para>set_type() - Optional</para></listitem>
<listitem><para>set_wake() - Optional</para></listitem>
</itemizedlist>
These primitives are strictly intended to mean what they say: ack means
ACK, masking means masking of an IRQ line, etc. It is up to the flow
handler(s) to use these basic units of lowlevel functionality.
</para>
</sect1>
</chapter>
<chapter id="doirq">
<title>__do_IRQ entry point</title>
<para>
The original implementation __do_IRQ() is an alternative entry
point for all types of interrupts.
</para>
<para>
This handler turned out to be not suitable for all
interrupt hardware and was therefore reimplemented with split
functionality for egde/level/simple/percpu interrupts. This is not
only a functional optimization. It also shortens code paths for
interrupts.
</para>
<para>
To make use of the split implementation, replace the call to
__do_IRQ by a call to desc->chip->handle_irq() and associate
the appropriate handler function to desc->chip->handle_irq().
In most cases the generic handler implementations should
be sufficient.
</para>
</chapter>
<chapter id="locking">
<title>Locking on SMP</title>
<para>
The locking of chip registers is up to the architecture that
defines the chip primitives. There is a chip->lock field that can be used
for serialization, but the generic layer does not touch it. The per-irq
structure is protected via desc->lock, by the generic layer.
</para>
</chapter>
<chapter id="structs">
<title>Structures</title>
<para>
This chapter contains the autogenerated documentation of the structures which are
used in the generic IRQ layer.
</para>
!Iinclude/linux/irq.h
</chapter>
<chapter id="pubfunctions">
<title>Public Functions Provided</title>
<para>
This chapter contains the autogenerated documentation of the kernel API functions
which are exported.
</para>
!Ekernel/irq/manage.c
!Ekernel/irq/chip.c
</chapter>
<chapter id="intfunctions">
<title>Internal Functions Provided</title>
<para>
This chapter contains the autogenerated documentation of the internal functions.
</para>
!Ikernel/irq/handle.c
!Ikernel/irq/chip.c
</chapter>
<chapter id="credits">
<title>Credits</title>
<para>
The following people have contributed to this document:
<orderedlist>
<listitem><para>Thomas Gleixner<email>tglx@linutronix.de</email></para></listitem>
<listitem><para>Ingo Molnar<email>mingo@elte.hu</email></para></listitem>
</orderedlist>
</para>
</chapter>
</book>

View file

@ -62,6 +62,8 @@
<sect1><title>Internal Functions</title>
!Ikernel/exit.c
!Ikernel/signal.c
!Iinclude/linux/kthread.h
!Ekernel/kthread.c
</sect1>
<sect1><title>Kernel objects manipulation</title>
@ -114,9 +116,33 @@ X!Ilib/string.c
</sect1>
</chapter>
<chapter id="kernel-lib">
<title>Basic Kernel Library Functions</title>
<para>
The Linux kernel provides more basic utility functions.
</para>
<sect1><title>Bitmap Operations</title>
!Elib/bitmap.c
!Ilib/bitmap.c
</sect1>
<sect1><title>Command-line Parsing</title>
!Elib/cmdline.c
</sect1>
<sect1><title>CRC Functions</title>
!Elib/crc16.c
!Elib/crc32.c
!Elib/crc-ccitt.c
</sect1>
</chapter>
<chapter id="mm">
<title>Memory Management in Linux</title>
<sect1><title>The Slab Cache</title>
!Iinclude/linux/slab.h
!Emm/slab.c
</sect1>
<sect1><title>User Space Memory Access</title>
@ -280,12 +306,13 @@ X!Ekernel/module.c
<sect1><title>MTRR Handling</title>
!Earch/i386/kernel/cpu/mtrr/main.c
</sect1>
<sect1><title>PCI Support Library</title>
!Edrivers/pci/pci.c
!Edrivers/pci/pci-driver.c
!Edrivers/pci/remove.c
!Edrivers/pci/pci-acpi.c
<!-- kerneldoc does not understand to __devinit
<!-- kerneldoc does not understand __devinit
X!Edrivers/pci/search.c
-->
!Edrivers/pci/msi.c
@ -314,9 +341,11 @@ X!Earch/i386/kernel/mca.c
</sect1>
</chapter>
<chapter id="devfs">
<title>The Device File System</title>
!Efs/devfs/base.c
<chapter id="firmware">
<title>Firmware Interfaces</title>
<sect1><title>DMI Interfaces</title>
!Edrivers/firmware/dmi_scan.c
</sect1>
</chapter>
<chapter id="sysfs">
@ -331,6 +360,18 @@ X!Earch/i386/kernel/mca.c
!Esecurity/security.c
</chapter>
<chapter id="audit">
<title>Audit Interfaces</title>
!Ekernel/audit.c
!Ikernel/auditsc.c
!Ikernel/auditfilter.c
</chapter>
<chapter id="accounting">
<title>Accounting Framework</title>
!Ikernel/acct.c
</chapter>
<chapter id="pmfuncs">
<title>Power Management</title>
!Ekernel/power/pm.c
@ -390,7 +431,6 @@ X!Edrivers/pnp/system.c
</sect1>
</chapter>
<chapter id="blkdev">
<title>Block Devices</title>
!Eblock/ll_rw_blk.c
@ -401,6 +441,14 @@ X!Edrivers/pnp/system.c
!Edrivers/char/misc.c
</chapter>
<chapter id="parportdev">
<title>Parallel Port Devices</title>
!Iinclude/linux/parport.h
!Edrivers/parport/ieee1284.c
!Edrivers/parport/share.c
!Idrivers/parport/daisy.c
</chapter>
<chapter id="viddev">
<title>Video4Linux</title>
!Edrivers/media/video/videodev.c

View file

@ -1590,7 +1590,7 @@ the amount of locking which needs to be done.
<para>
Our final dilemma is this: when can we actually destroy the
removed element? Remember, a reader might be stepping through
this element in the list right now: it we free this element and
this element in the list right now: if we free this element and
the <symbol>next</symbol> pointer changes, the reader will jump
off into garbage and crash. We need to wait until we know that
all the readers who were traversing the list when we deleted the

View file

@ -169,6 +169,22 @@ void (*tf_read) (struct ata_port *ap, struct ata_taskfile *tf);
</sect2>
<sect2><title>PIO data read/write</title>
<programlisting>
void (*data_xfer) (struct ata_device *, unsigned char *, unsigned int, int);
</programlisting>
<para>
All bmdma-style drivers must implement this hook. This is the low-level
operation that actually copies the data bytes during a PIO data
transfer.
Typically the driver
will choose one of ata_pio_data_xfer_noirq(), ata_pio_data_xfer(), or
ata_mmio_data_xfer().
</para>
</sect2>
<sect2><title>ATA command execute</title>
<programlisting>
void (*exec_command)(struct ata_port *ap, struct ata_taskfile *tf);
@ -204,11 +220,10 @@ command.
<programlisting>
u8 (*check_status)(struct ata_port *ap);
u8 (*check_altstatus)(struct ata_port *ap);
u8 (*check_err)(struct ata_port *ap);
</programlisting>
<para>
Reads the Status/AltStatus/Error ATA shadow register from
Reads the Status/AltStatus ATA shadow register from
hardware. On some hardware, reading the Status register has
the side effect of clearing the interrupt condition.
Most drivers for taskfile-based hardware use
@ -269,23 +284,6 @@ void (*set_mode) (struct ata_port *ap);
</sect2>
<sect2><title>Reset ATA bus</title>
<programlisting>
void (*phy_reset) (struct ata_port *ap);
</programlisting>
<para>
The very first step in the probe phase. Actions vary depending
on the bus type, typically. After waking up the device and probing
for device presence (PATA and SATA), typically a soft reset
(SRST) will be performed. Drivers typically use the helper
functions ata_bus_reset() or sata_phy_reset() for this hook.
Many SATA drivers use sata_phy_reset() or call it from within
their own phy_reset() functions.
</para>
</sect2>
<sect2><title>Control PCI IDE BMDMA engine</title>
<programlisting>
void (*bmdma_setup) (struct ata_queued_cmd *qc);
@ -354,16 +352,74 @@ int (*qc_issue) (struct ata_queued_cmd *qc);
</sect2>
<sect2><title>Timeout (error) handling</title>
<sect2><title>Exception and probe handling (EH)</title>
<programlisting>
void (*eng_timeout) (struct ata_port *ap);
void (*phy_reset) (struct ata_port *ap);
</programlisting>
<para>
This is a high level error handling function, called from the
error handling thread, when a command times out. Most newer
hardware will implement its own error handling code here. IDE BMDMA
drivers may use the helper function ata_eng_timeout().
Deprecated. Use ->error_handler() instead.
</para>
<programlisting>
void (*freeze) (struct ata_port *ap);
void (*thaw) (struct ata_port *ap);
</programlisting>
<para>
ata_port_freeze() is called when HSM violations or some other
condition disrupts normal operation of the port. A frozen port
is not allowed to perform any operation until the port is
thawed, which usually follows a successful reset.
</para>
<para>
The optional ->freeze() callback can be used for freezing the port
hardware-wise (e.g. mask interrupt and stop DMA engine). If a
port cannot be frozen hardware-wise, the interrupt handler
must ack and clear interrupts unconditionally while the port
is frozen.
</para>
<para>
The optional ->thaw() callback is called to perform the opposite of ->freeze():
prepare the port for normal operation once again. Unmask interrupts,
start DMA engine, etc.
</para>
<programlisting>
void (*error_handler) (struct ata_port *ap);
</programlisting>
<para>
->error_handler() is a driver's hook into probe, hotplug, and recovery
and other exceptional conditions. The primary responsibility of an
implementation is to call ata_do_eh() or ata_bmdma_drive_eh() with a set
of EH hooks as arguments:
</para>
<para>
'prereset' hook (may be NULL) is called during an EH reset, before any other actions
are taken.
</para>
<para>
'postreset' hook (may be NULL) is called after the EH reset is performed. Based on
existing conditions, severity of the problem, and hardware capabilities,
</para>
<para>
Either 'softreset' (may be NULL) or 'hardreset' (may be NULL) will be
called to perform the low-level EH reset.
</para>
<programlisting>
void (*post_internal_cmd) (struct ata_queued_cmd *qc);
</programlisting>
<para>
Perform any hardware-specific actions necessary to finish processing
after executing a probe-time or EH-time command via ata_exec_internal().
</para>
</sect2>

View file

@ -189,9 +189,9 @@ static unsigned long baseaddr;
<sect1>
<title>Partition defines</title>
<para>
If you want to divide your device into parititions, then
enable the configuration switch CONFIG_MTD_PARITIONS and define
a paritioning scheme suitable to your board.
If you want to divide your device into partitions, then
enable the configuration switch CONFIG_MTD_PARTITIONS and define
a partitioning scheme suitable to your board.
</para>
<programlisting>
#define NUM_PARTITIONS 2

View file

@ -976,7 +976,7 @@ static int camera_close(struct video_device *dev)
<title>Interrupt Handling</title>
<para>
Our example handler is for an ISA bus device. If it was PCI you would be
able to share the interrupt and would have set SA_SHIRQ to indicate a
able to share the interrupt and would have set IRQF_SHARED to indicate a
shared IRQ. We pass the device pointer as the interrupt routine argument. We
don't need to since we only support one card but doing this will make it
easier to upgrade the driver for multiple devices in the future.

View file

@ -10,7 +10,7 @@ standard for controlling intelligent devices that monitor a system.
It provides for dynamic discovery of sensors in the system and the
ability to monitor the sensors and be informed when the sensor's
values change or go outside certain boundaries. It also has a
standardized database for field-replacable units (FRUs) and a watchdog
standardized database for field-replaceable units (FRUs) and a watchdog
timer.
To use this, you need an interface to an IPMI controller in your
@ -64,7 +64,7 @@ situation, you need to read the section below named 'The SI Driver' or
IPMI defines a standard watchdog timer. You can enable this with the
'IPMI Watchdog Timer' config option. If you compile the driver into
the kernel, then via a kernel command-line option you can have the
watchdog timer start as soon as it intitializes. It also have a lot
watchdog timer start as soon as it initializes. It also have a lot
of other options, see the 'Watchdog' section below for more details.
Note that you can also have the watchdog continue to run if it is
closed (by default it is disabled on close). Go into the 'Watchdog

22
Documentation/IRQ.txt Normal file
View file

@ -0,0 +1,22 @@
What is an IRQ?
An IRQ is an interrupt request from a device.
Currently they can come in over a pin, or over a packet.
Several devices may be connected to the same pin thus
sharing an IRQ.
An IRQ number is a kernel identifier used to talk about a hardware
interrupt source. Typically this is an index into the global irq_desc
array, but except for what linux/interrupt.h implements the details
are architecture specific.
An IRQ number is an enumeration of the possible interrupt sources on a
machine. Typically what is enumerated is the number of input pins on
all of the interrupt controller in the system. In the case of ISA
what is enumerated are the 16 input pins on the two i8259 interrupt
controllers.
Architectures can assign additional meaning to the IRQ numbers, and
are encouraged to in the case where there is any manual configuration
of the hardware involved. The ISA IRQs are a classic example of
assigning this kind of additional meaning.

View file

@ -144,9 +144,47 @@ over a rather long period of time, but improvements are always welcome!
whether the increased speed is worth it.
8. Although synchronize_rcu() is a bit slower than is call_rcu(),
it usually results in simpler code. So, unless update performance
is important or the updaters cannot block, synchronize_rcu()
should be used in preference to call_rcu().
it usually results in simpler code. So, unless update
performance is critically important or the updaters cannot block,
synchronize_rcu() should be used in preference to call_rcu().
An especially important property of the synchronize_rcu()
primitive is that it automatically self-limits: if grace periods
are delayed for whatever reason, then the synchronize_rcu()
primitive will correspondingly delay updates. In contrast,
code using call_rcu() should explicitly limit update rate in
cases where grace periods are delayed, as failing to do so can
result in excessive realtime latencies or even OOM conditions.
Ways of gaining this self-limiting property when using call_rcu()
include:
a. Keeping a count of the number of data-structure elements
used by the RCU-protected data structure, including those
waiting for a grace period to elapse. Enforce a limit
on this number, stalling updates as needed to allow
previously deferred frees to complete.
Alternatively, limit only the number awaiting deferred
free rather than the total number of elements.
b. Limiting update rate. For example, if updates occur only
once per hour, then no explicit rate limiting is required,
unless your system is already badly broken. The dcache
subsystem takes this approach -- updates are guarded
by a global lock, limiting their rate.
c. Trusted update -- if updates can only be done manually by
superuser or some other trusted user, then it might not
be necessary to automatically limit them. The theory
here is that superuser already has lots of ways to crash
the machine.
d. Use call_rcu_bh() rather than call_rcu(), in order to take
advantage of call_rcu_bh()'s faster grace periods.
e. Periodically invoke synchronize_rcu(), permitting a limited
number of updates per grace period.
9. All RCU list-traversal primitives, which include
list_for_each_rcu(), list_for_each_entry_rcu(),

View file

@ -7,7 +7,7 @@ The CONFIG_RCU_TORTURE_TEST config option is available for all RCU
implementations. It creates an rcutorture kernel module that can
be loaded to run a torture test. The test periodically outputs
status messages via printk(), which can be examined via the dmesg
command (perhaps grepping for "rcutorture"). The test is started
command (perhaps grepping for "torture"). The test is started
when the module is loaded, and stops when the module is unloaded.
However, actually setting this config option to "y" results in the system
@ -35,6 +35,19 @@ stat_interval The number of seconds between output of torture
be printed -only- when the module is unloaded, and this
is the default.
shuffle_interval
The number of seconds to keep the test threads affinitied
to a particular subset of the CPUs. Used in conjunction
with test_no_idle_hz.
test_no_idle_hz Whether or not to test the ability of RCU to operate in
a kernel that disables the scheduling-clock interrupt to
idle CPUs. Boolean parameter, "1" to test, "0" otherwise.
torture_type The type of RCU to test: "rcu" for the rcu_read_lock()
API, "rcu_bh" for the rcu_read_lock_bh() API, and "srcu"
for the "srcu_read_lock()" API.
verbose Enable debug printk()s. Default is disabled.
@ -42,14 +55,14 @@ OUTPUT
The statistics output is as follows:
rcutorture: --- Start of test: nreaders=16 stat_interval=0 verbose=0
rcutorture: rtc: 0000000000000000 ver: 1916 tfle: 0 rta: 1916 rtaf: 0 rtf: 1915
rcutorture: Reader Pipe: 1466408 9747 0 0 0 0 0 0 0 0 0
rcutorture: Reader Batch: 1464477 11678 0 0 0 0 0 0 0 0
rcutorture: Free-Block Circulation: 1915 1915 1915 1915 1915 1915 1915 1915 1915 1915 0
rcutorture: --- End of test
rcu-torture: --- Start of test: nreaders=16 stat_interval=0 verbose=0
rcu-torture: rtc: 0000000000000000 ver: 1916 tfle: 0 rta: 1916 rtaf: 0 rtf: 1915
rcu-torture: Reader Pipe: 1466408 9747 0 0 0 0 0 0 0 0 0
rcu-torture: Reader Batch: 1464477 11678 0 0 0 0 0 0 0 0
rcu-torture: Free-Block Circulation: 1915 1915 1915 1915 1915 1915 1915 1915 1915 1915 0
rcu-torture: --- End of test
The command "dmesg | grep rcutorture:" will extract this information on
The command "dmesg | grep torture:" will extract this information on
most systems. On more esoteric configurations, it may be necessary to
use other commands to access the output of the printk()s used by
the RCU torture test. The printk()s use KERN_ALERT, so they should
@ -115,8 +128,9 @@ The following script may be used to torture RCU:
modprobe rcutorture
sleep 100
rmmod rcutorture
dmesg | grep rcutorture:
dmesg | grep torture:
The output can be manually inspected for the error flag of "!!!".
One could of course create a more elaborate script that automatically
checked for such errors.
checked for such errors. The "rmmod" command forces a "SUCCESS" or
"FAILURE" indication to be printk()ed.

View file

@ -184,7 +184,17 @@ synchronize_rcu()
blocking, it registers a function and argument which are invoked
after all ongoing RCU read-side critical sections have completed.
This callback variant is particularly useful in situations where
it is illegal to block.
it is illegal to block or where update-side performance is
critically important.
However, the call_rcu() API should not be used lightly, as use
of the synchronize_rcu() API generally results in simpler code.
In addition, the synchronize_rcu() API has the nice property
of automatically limiting update rate should grace periods
be delayed. This property results in system resilience in face
of denial-of-service attacks. Code using call_rcu() should limit
update rate in order to gain this same sort of resilience. See
checklist.txt for some approaches to limiting the update rate.
rcu_assign_pointer()
@ -790,7 +800,6 @@ RCU pointer update:
RCU grace period:
synchronize_kernel (deprecated)
synchronize_net
synchronize_sched
synchronize_rcu

View file

@ -78,9 +78,9 @@ also known as "System Drives", and Drive Groups are also called "Packs". Both
terms are in use in the Mylex documentation; I have chosen to standardize on
the more generic "Logical Drive" and "Drive Group".
DAC960 RAID disk devices are named in the style of the Device File System
(DEVFS). The device corresponding to Logical Drive D on Controller C is
referred to as /dev/rd/cCdD, and the partitions are called /dev/rd/cCdDp1
DAC960 RAID disk devices are named in the style of the obsolete Device File
System (DEVFS). The device corresponding to Logical Drive D on Controller C
is referred to as /dev/rd/cCdD, and the partitions are called /dev/rd/cCdDp1
through /dev/rd/cCdDp7. For example, partition 3 of Logical Drive 5 on
Controller 2 is referred to as /dev/rd/c2d5p3. Note that unlike with SCSI
disks the device names will not change in the event of a disk drive failure.

View file

@ -0,0 +1,57 @@
Linux Kernel patch sumbittal checklist
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Here are some basic things that developers should do if they
want to see their kernel patch submittals accepted quicker.
These are all above and beyond the documentation that is provided
in Documentation/SubmittingPatches and elsewhere about submitting
Linux kernel patches.
- Builds cleanly with applicable or modified CONFIG options =y, =m, and =n.
No gcc warnings/errors, no linker warnings/errors.
- Passes allnoconfig, allmodconfig
- Builds on multiple CPU arch-es by using local cross-compile tools
or something like PLM at OSDL.
- ppc64 is a good architecture for cross-compilation checking because it
tends to use `unsigned long' for 64-bit quantities.
- Matches kernel coding style(!)
- Any new or modified CONFIG options don't muck up the config menu.
- All new Kconfig options have help text.
- Has been carefully reviewed with respect to relevant Kconfig
combinations. This is very hard to get right with testing --
brainpower pays off here.
- Check cleanly with sparse.
- Use 'make checkstack' and 'make namespacecheck' and fix any
problems that they find. Note: checkstack does not point out
problems explicitly, but any one function that uses more than
512 bytes on the stack is a candidate for change.
- Include kernel-doc to document global kernel APIs. (Not required
for static functions, but OK there also.) Use 'make htmldocs'
or 'make mandocs' to check the kernel-doc and fix any issues.
- Has been tested with CONFIG_PREEMPT, CONFIG_DEBUG_PREEMPT,
CONFIG_DEBUG_SLAB, CONFIG_DEBUG_PAGEALLOC, CONFIG_DEBUG_MUTEXES,
CONFIG_DEBUG_SPINLOCK, CONFIG_DEBUG_SPINLOCK_SLEEP all simultaneously
enabled.
- Has been build- and runtime tested with and without CONFIG_SMP and
CONFIG_PREEMPT.
- If the patch affects IO/Disk, etc: has been tested with and without
CONFIG_LBD.
2006-APR-27

View file

@ -85,7 +85,7 @@ IXP4xx provides two methods of accessing PCI memory space:
2) If > 64MB of memory space is required, the IXP4xx can be
configured to use indirect registers to access PCI This allows
for up to 128MB (0x48000000 to 0x4fffffff) of memory on the bus.
The disadvantadge of this is that every PCI access requires
The disadvantage of this is that every PCI access requires
three local register accesses plus a spinlock, but in some
cases the performance hit is acceptable. In addition, you cannot
mmap() PCI devices in this case due to the indirect nature

View file

@ -7,11 +7,13 @@ Introduction
------------
The Samsung S3C24XX range of ARM9 System-on-Chip CPUs are supported
by the 's3c2410' architecture of ARM Linux. Currently the S3C2410 and
the S3C2440 are supported CPUs.
by the 's3c2410' architecture of ARM Linux. Currently the S3C2410,
S3C2440 and S3C2442 devices are supported.
Support for the S3C2400 series is in progress.
Support for the S3C2412 and S3C2413 CPUs is being merged.
Configuration
-------------
@ -43,9 +45,18 @@ Machines
Samsung's own development board, geared for PDA work.
Samsung/Aiji SMDK2412
The S3C2412 version of the SMDK2440.
Samsung/Aiji SMDK2413
The S3C2412 version of the SMDK2440.
Samsung/Meritech SMDK2440
The S3C2440 compatible version of the SMDK2440
The S3C2440 compatible version of the SMDK2440, which has the
option of an S3C2440 or S3C2442 CPU module.
Thorcom VR1000
@ -211,24 +222,6 @@ Port Contributors
Lucas Correia Villa Real (S3C2400 port)
Document Changes
----------------
05 Sep 2004 - BJD - Added Document Changes section
05 Sep 2004 - BJD - Added Klaus Fetscher to list of contributors
25 Oct 2004 - BJD - Added Dimitry Andric to list of contributors
25 Oct 2004 - BJD - Updated the MTD from the 2.6.9 merge
21 Jan 2005 - BJD - Added rx3715, added Shannon to contributors
10 Feb 2005 - BJD - Added Guillaume Gourat to contributors
02 Mar 2005 - BJD - Added SMDK2440 to list of machines
06 Mar 2005 - BJD - Added Christer Weinigel
08 Mar 2005 - BJD - Added LCVR to list of people, updated introduction
08 Mar 2005 - BJD - Added section on adding machines
09 Sep 2005 - BJD - Added section on platform data
11 Feb 2006 - BJD - Added I2C, RTC and Watchdog sections
11 Feb 2006 - BJD - Added Osiris machine, and S3C2400 information
Document Author
---------------

View file

@ -0,0 +1,120 @@
S3C2412 ARM Linux Overview
==========================
Introduction
------------
The S3C2412 is part of the S3C24XX range of ARM9 System-on-Chip CPUs
from Samsung. This part has an ARM926-EJS core, capable of running up
to 266MHz (see data-sheet for more information)
Clock
-----
The core clock code provides a set of clocks to the drivers, and allows
for source selection and a number of other features.
Power
-----
No support for suspend/resume to RAM in the current system.
DMA
---
No current support for DMA.
GPIO
----
There is support for setting the GPIO to input/output/special function
and reading or writing to them.
UART
----
The UART hardware is similar to the S3C2440, and is supported by the
s3c2410 driver in the drivers/serial directory.
NAND
----
The NAND hardware is similar to the S3C2440, and is supported by the
s3c2410 driver in the drivers/mtd/nand directory.
USB Host
--------
The USB hardware is similar to the S3C2410, with extended clock source
control. The OHCI portion is supported by the ohci-s3c2410 driver, and
the clock control selection is supported by the core clock code.
USB Device
----------
No current support in the kernel
IRQs
----
All the standard, and external interrupt sources are supported. The
extra sub-sources are not yet supported.
RTC
---
The RTC hardware is similar to the S3C2410, and is supported by the
s3c2410-rtc driver.
Watchdog
--------
The watchdog harware is the same as the S3C2410, and is supported by
the s3c2410_wdt driver.
MMC/SD/SDIO
-----------
No current support for the MMC/SD/SDIO block.
IIC
---
The IIC hardware is the same as the S3C2410, and is supported by the
i2c-s3c24xx driver.
IIS
---
No current support for the IIS interface.
SPI
---
No current support for the SPI interfaces.
ATA
---
No current support for the on-board ATA block.
Document Author
---------------
Ben Dooks, (c) 2006 Simtec Electronics

View file

@ -0,0 +1,21 @@
S3C2413 ARM Linux Overview
==========================
Introduction
------------
The S3C2413 is an extended version of the S3C2412, with an camera
interface and mobile DDR memory support. See the S3C2412 support
documentation for more information.
Camera Interface
---------------
This block is currently not supported.
Document Author
---------------
Ben Dooks, (c) 2006 Simtec Electronics

View file

@ -0,0 +1,61 @@
README on the ADC/Touchscreen Controller
========================================
The LH79524 and LH7A404 include a built-in Analog to Digital
controller (ADC) that is used to process input from a touchscreen.
The driver only implements a four-wire touch panel protocol.
The touchscreen driver is maintenance free except for the pen-down or
touch threshold. Some resistive displays and board combinations may
require tuning of this threshold. The driver exposes some of it's
internal state in the sys filesystem. If the kernel is configured
with it, CONFIG_SYSFS, and sysfs is mounted at /sys, there will be a
directory
/sys/devices/platform/adc-lh7.0
containing these files.
-r--r--r-- 1 root root 4096 Jan 1 00:00 samples
-rw-r--r-- 1 root root 4096 Jan 1 00:00 threshold
-r--r--r-- 1 root root 4096 Jan 1 00:00 threshold_range
The threshold is the current touch threshold. It defaults to 750 on
most targets.
# cat threshold
750
The threshold_range contains the range of valid values for the
threshold. Values outside of this range will be silently ignored.
# cat threshold_range
0 1023
To change the threshold, write a value to the threshold file.
# echo 500 > threshold
# cat threshold
500
The samples file contains the most recently sampled values from the
ADC. There are 12. Below are typical of the last sampled values when
the pen has been released. The first two and last two samples are for
detecting whether or not the pen is down. The third through sixth are
X coordinate samples. The seventh through tenth are Y coordinate
samples.
# cat samples
1023 1023 0 0 0 0 530 529 530 529 1023 1023
To determine a reasonable threshold, press on the touch panel with an
appropriate stylus and read the values from samples.
# cat samples
1023 676 92 103 101 102 855 919 922 922 1023 679
The first and eleventh samples are discarded. Thus, the important
values are the second and twelfth which are used to determine if the
pen is down. When both are below the threshold, the driver registers
that the pen is down. When either is above the threshold, it
registers then pen is up.

View file

@ -0,0 +1,59 @@
README on the LCD Panels
========================
Configuration options for several LCD panels, available from Logic PD,
are included in the kernel source. This README will help you
understand the configuration data and give you some guidance for
adding support for other panels if you wish.
lcd-panels.h
------------
There is no way, at present, to detect which panel is attached to the
system at runtime. Thus the kernel configuration is static. The file
arch/arm/mach-ld7a40x/lcd-panels.h (or similar) defines all of the
panel specific parameters.
It should be possible for this data to be shared among several device
families. The current layout may be insufficiently general, but it is
amenable to improvement.
PIXEL_CLOCK
-----------
The panel data sheets will give a range of acceptable pixel clocks.
The fundamental LCDCLK input frequency is divided down by a PCD
constant in field '.tim2'. It may happen that it is impossible to set
the pixel clock within this range. A clock which is too slow will
tend to flicker. For the highest quality image, set the clock as high
as possible.
MARGINS
-------
These values may be difficult to glean from the panel data sheet. In
the case of the Sharp panels, the upper margin is explicitly called
out as a specific number of lines from the top of the frame. The
other values may not matter as much as the panels tend to
automatically center the image.
Sync Sense
----------
The sense of the hsync and vsync pulses may be called out in the data
sheet. On one panel, the sense of these pulses determine the height
of the visible region on the panel. Most of the Sharp panels use
negative sense sync pulses set by the TIM2_IHS and TIM2_IVS bits in
'.tim2'.
Pel Layout
----------
The Sharp color TFT panels are all configured for 16 bit direct color
modes. The amba-lcd driver sets the pel mode to 565 for 5 bits of
each red and blue and 6 bits of green.

View file

@ -157,13 +157,13 @@ For example, smp_mb__before_atomic_dec() can be used like so:
smp_mb__before_atomic_dec();
atomic_dec(&obj->ref_count);
It makes sure that all memory operations preceeding the atomic_dec()
It makes sure that all memory operations preceding the atomic_dec()
call are strongly ordered with respect to the atomic counter
operation. In the above example, it guarentees that the assignment of
operation. In the above example, it guarantees that the assignment of
"1" to obj->dead will be globally visible to other cpus before the
atomic counter decrement.
Without the explicitl smp_mb__before_atomic_dec() call, the
Without the explicit smp_mb__before_atomic_dec() call, the
implementation could legally allow the atomic counter update visible
to other cpus before the "obj->dead = 1;" assignment.
@ -173,11 +173,11 @@ ordering with respect to memory operations after an atomic_dec() call
(smp_mb__{before,after}_atomic_inc()).
A missing memory barrier in the cases where they are required by the
atomic_t implementation above can have disasterous results. Here is
an example, which follows a pattern occuring frequently in the Linux
atomic_t implementation above can have disastrous results. Here is
an example, which follows a pattern occurring frequently in the Linux
kernel. It is the use of atomic counters to implement reference
counting, and it works such that once the counter falls to zero it can
be guarenteed that no other entity can be accessing the object:
be guaranteed that no other entity can be accessing the object:
static void obj_list_add(struct obj *obj)
{
@ -291,9 +291,9 @@ to the size of an "unsigned long" C data type, and are least of that
size. The endianness of the bits within each "unsigned long" are the
native endianness of the cpu.
void set_bit(unsigned long nr, volatils unsigned long *addr);
void clear_bit(unsigned long nr, volatils unsigned long *addr);
void change_bit(unsigned long nr, volatils unsigned long *addr);
void set_bit(unsigned long nr, volatile unsigned long *addr);
void clear_bit(unsigned long nr, volatile unsigned long *addr);
void change_bit(unsigned long nr, volatile unsigned long *addr);
These routines set, clear, and change, respectively, the bit number
indicated by "nr" on the bit mask pointed to by "ADDR".
@ -301,9 +301,9 @@ indicated by "nr" on the bit mask pointed to by "ADDR".
They must execute atomically, yet there are no implicit memory barrier
semantics required of these interfaces.
int test_and_set_bit(unsigned long nr, volatils unsigned long *addr);
int test_and_clear_bit(unsigned long nr, volatils unsigned long *addr);
int test_and_change_bit(unsigned long nr, volatils unsigned long *addr);
int test_and_set_bit(unsigned long nr, volatile unsigned long *addr);
int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr);
int test_and_change_bit(unsigned long nr, volatile unsigned long *addr);
Like the above, except that these routines return a boolean which
indicates whether the changed bit was set _BEFORE_ the atomic bit
@ -335,7 +335,7 @@ subsequent memory operation is made visible. For example:
/* ... */;
obj->killed = 1;
The implementation of test_and_set_bit() must guarentee that
The implementation of test_and_set_bit() must guarantee that
"obj->dead = 1;" is visible to cpus before the atomic memory operation
done by test_and_set_bit() becomes visible. Likewise, the atomic
memory operation done by test_and_set_bit() must become visible before
@ -474,7 +474,7 @@ Now, as far as memory barriers go, as long as spin_lock()
strictly orders all subsequent memory operations (including
the cas()) with respect to itself, things will be fine.
Said another way, _atomic_dec_and_lock() must guarentee that
Said another way, _atomic_dec_and_lock() must guarantee that
a counter dropping to zero is never made visible before the
spinlock being acquired.

View file

@ -0,0 +1,144 @@
Console Drivers
===============
The linux kernel has 2 general types of console drivers. The first type is
assigned by the kernel to all the virtual consoles during the boot process.
This type will be called 'system driver', and only one system driver is allowed
to exist. The system driver is persistent and it can never be unloaded, though
it may become inactive.
The second type has to be explicitly loaded and unloaded. This will be called
'modular driver' by this document. Multiple modular drivers can coexist at
any time with each driver sharing the console with other drivers including
the system driver. However, modular drivers cannot take over the console
that is currently occupied by another modular driver. (Exception: Drivers that
call take_over_console() will succeed in the takeover regardless of the type
of driver occupying the consoles.) They can only take over the console that is
occupied by the system driver. In the same token, if the modular driver is
released by the console, the system driver will take over.
Modular drivers, from the programmer's point of view, has to call:
take_over_console() - load and bind driver to console layer
give_up_console() - unbind and unload driver
In newer kernels, the following are also available:
register_con_driver()
unregister_con_driver()
If sysfs is enabled, the contents of /sys/class/vtconsole can be
examined. This shows the console backends currently registered by the
system which are named vtcon<n> where <n> is an integer fro 0 to 15. Thus:
ls /sys/class/vtconsole
. .. vtcon0 vtcon1
Each directory in /sys/class/vtconsole has 3 files:
ls /sys/class/vtconsole/vtcon0
. .. bind name uevent
What do these files signify?
1. bind - this is a read/write file. It shows the status of the driver if
read, or acts to bind or unbind the driver to the virtual consoles
when written to. The possible values are:
0 - means the driver is not bound and if echo'ed, commands the driver
to unbind
1 - means the driver is bound and if echo'ed, commands the driver to
bind
2. name - read-only file. Shows the name of the driver in this format:
cat /sys/class/vtconsole/vtcon0/name
(S) VGA+
'(S)' stands for a (S)ystem driver, ie, it cannot be directly
commanded to bind or unbind
'VGA+' is the name of the driver
cat /sys/class/vtconsole/vtcon1/name
(M) frame buffer device
In this case, '(M)' stands for a (M)odular driver, one that can be
directly commanded to bind or unbind.
3. uevent - ignore this file
When unbinding, the modular driver is detached first, and then the system
driver takes over the consoles vacated by the driver. Binding, on the other
hand, will bind the driver to the consoles that are currently occupied by a
system driver.
NOTE1: Binding and binding must be selected in Kconfig. It's under:
Device Drivers -> Character devices -> Support for binding and unbinding
console drivers
NOTE2: If any of the virtual consoles are in KD_GRAPHICS mode, then binding or
unbinding will not succeed. An example of an application that sets the console
to KD_GRAPHICS is X.
How useful is this feature? This is very useful for console driver
developers. By unbinding the driver from the console layer, one can unload the
driver, make changes, recompile, reload and rebind the driver without any need
for rebooting the kernel. For regular users who may want to switch from
framebuffer console to VGA console and vice versa, this feature also makes
this possible. (NOTE NOTE NOTE: Please read fbcon.txt under Documentation/fb
for more details).
Notes for developers:
=====================
take_over_console() is now broken up into:
register_con_driver()
bind_con_driver() - private function
give_up_console() is a wrapper to unregister_con_driver(), and a driver must
be fully unbound for this call to succeed. con_is_bound() will check if the
driver is bound or not.
Guidelines for console driver writers:
=====================================
In order for binding to and unbinding from the console to properly work,
console drivers must follow these guidelines:
1. All drivers, except system drivers, must call either register_con_driver()
or take_over_console(). register_con_driver() will just add the driver to
the console's internal list. It won't take over the
console. take_over_console(), as it name implies, will also take over (or
bind to) the console.
2. All resources allocated during con->con_init() must be released in
con->con_deinit().
3. All resources allocated in con->con_startup() must be released when the
driver, which was previously bound, becomes unbound. The console layer
does not have a complementary call to con->con_startup() so it's up to the
driver to check when it's legal to release these resources. Calling
con_is_bound() in con->con_deinit() will help. If the call returned
false(), then it's safe to release the resources. This balance has to be
ensured because con->con_startup() can be called again when a request to
rebind the driver to the console arrives.
4. Upon exit of the driver, ensure that the driver is totally unbound. If the
condition is satisfied, then the driver must call unregister_con_driver()
or give_up_console().
5. unregister_con_driver() can also be called on conditions which make it
impossible for the driver to service console requests. This can happen
with the framebuffer console that suddenly lost all of its drivers.
The current crop of console drivers should still work correctly, but binding
and unbinding them may cause problems. With minimal fixes, these drivers can
be made to work correctly.
==========================
Antonino Daplas <adaplas@pol.net>

View file

@ -3,7 +3,7 @@
Maintained by Torben Mathiasen <device@lanana.org>
Last revised: 25 January 2005
Last revised: 15 May 2006
This list is the Linux Device List, the official registry of allocated
device numbers and /dev directory nodes for the Linux operating
@ -94,7 +94,6 @@ Your cooperation is appreciated.
9 = /dev/urandom Faster, less secure random number gen.
10 = /dev/aio Asyncronous I/O notification interface
11 = /dev/kmsg Writes to this come out as printk's
12 = /dev/oldmem Access to crash dump from kexec kernel
1 block RAM disk
0 = /dev/ram0 First RAM disk
1 = /dev/ram1 Second RAM disk
@ -262,13 +261,13 @@ Your cooperation is appreciated.
NOTE: These devices permit both read and write access.
7 block Loopback devices
0 = /dev/loop0 First loopback device
1 = /dev/loop1 Second loopback device
0 = /dev/loop0 First loop device
1 = /dev/loop1 Second loop device
...
The loopback devices are used to mount filesystems not
The loop devices are used to mount filesystems not
associated with block devices. The binding to the
loopback devices is handled by mount(8) or losetup(8).
loop devices is handled by mount(8) or losetup(8).
8 block SCSI disk devices (0-15)
0 = /dev/sda First SCSI disk whole disk
@ -943,7 +942,7 @@ Your cooperation is appreciated.
240 = /dev/ftlp FTL on 16th Memory Technology Device
Partitions are handled in the same way as for IDE
disks (see major number 3) expect that the partition
disks (see major number 3) except that the partition
limit is 15 rather than 63 per disk (same as SCSI.)
45 char isdn4linux ISDN BRI driver
@ -1168,7 +1167,7 @@ Your cooperation is appreciated.
The filename of the encrypted container and the passwords
are sent via ioctls (using the sdmount tool) to the master
node which then activates them via one of the
/dev/scramdisk/x nodes for loopback mounting (all handled
/dev/scramdisk/x nodes for loop mounting (all handled
through the sdmount tool).
Requested by: andy@scramdisklinux.org
@ -2538,18 +2537,32 @@ Your cooperation is appreciated.
0 = /dev/usb/lp0 First USB printer
...
15 = /dev/usb/lp15 16th USB printer
16 = /dev/usb/mouse0 First USB mouse
...
31 = /dev/usb/mouse15 16th USB mouse
32 = /dev/usb/ez0 First USB firmware loader
...
47 = /dev/usb/ez15 16th USB firmware loader
48 = /dev/usb/scanner0 First USB scanner
...
63 = /dev/usb/scanner15 16th USB scanner
64 = /dev/usb/rio500 Diamond Rio 500
65 = /dev/usb/usblcd USBLCD Interface (info@usblcd.de)
66 = /dev/usb/cpad0 Synaptics cPad (mouse/LCD)
96 = /dev/usb/hiddev0 1st USB HID device
...
111 = /dev/usb/hiddev15 16th USB HID device
112 = /dev/usb/auer0 1st auerswald ISDN device
...
127 = /dev/usb/auer15 16th auerswald ISDN device
128 = /dev/usb/brlvgr0 First Braille Voyager device
...
131 = /dev/usb/brlvgr3 Fourth Braille Voyager device
132 = /dev/usb/idmouse ID Mouse (fingerprint scanner) device
133 = /dev/usb/sisusbvga1 First SiSUSB VGA device
...
140 = /dev/usb/sisusbvga8 Eigth SISUSB VGA device
144 = /dev/usb/lcd USB LCD device
160 = /dev/usb/legousbtower0 1st USB Legotower device
...
175 = /dev/usb/legousbtower15 16th USB Legotower device
240 = /dev/usb/dabusb0 First daubusb device
...
243 = /dev/usb/dabusb3 Fourth dabusb device
180 block USB block devices
0 = /dev/uba First USB block device
@ -2710,6 +2723,17 @@ Your cooperation is appreciated.
1 = /dev/cpu/1/msr MSRs on CPU 1
...
202 block Xen Virtual Block Device
0 = /dev/xvda First Xen VBD whole disk
16 = /dev/xvdb Second Xen VBD whole disk
32 = /dev/xvdc Third Xen VBD whole disk
...
240 = /dev/xvdp Sixteenth Xen VBD whole disk
Partitions are handled in the same way as for IDE
disks (see major number 3) except that the limit on
partitions is 15.
203 char CPU CPUID information
0 = /dev/cpu/0/cpuid CPUID on CPU 0
1 = /dev/cpu/1/cpuid CPUID on CPU 1
@ -2747,11 +2771,27 @@ Your cooperation is appreciated.
46 = /dev/ttyCPM0 PPC CPM (SCC or SMC) - port 0
...
47 = /dev/ttyCPM5 PPC CPM (SCC or SMC) - port 5
50 = /dev/ttyIOC40 Altix serial card
50 = /dev/ttyIOC0 Altix serial card
...
81 = /dev/ttyIOC431 Altix serial card
82 = /dev/ttyVR0 NEC VR4100 series SIU
83 = /dev/ttyVR1 NEC VR4100 series DSIU
81 = /dev/ttyIOC31 Altix serial card
82 = /dev/ttyVR0 NEC VR4100 series SIU
83 = /dev/ttyVR1 NEC VR4100 series DSIU
84 = /dev/ttyIOC84 Altix ioc4 serial card
...
115 = /dev/ttyIOC115 Altix ioc4 serial card
116 = /dev/ttySIOC0 Altix ioc3 serial card
...
147 = /dev/ttySIOC31 Altix ioc3 serial card
148 = /dev/ttyPSC0 PPC PSC - port 0
...
153 = /dev/ttyPSC5 PPC PSC - port 5
154 = /dev/ttyAT0 ATMEL serial port 0
...
169 = /dev/ttyAT15 ATMEL serial port 15
170 = /dev/ttyNX0 Hilscher netX serial port 0
...
185 = /dev/ttyNX15 Hilscher netX serial port 15
186 = /dev/ttyJ0 JTAG1 DCC protocol based serial port emulation
205 char Low-density serial ports (alternate device)
0 = /dev/culu0 Callout device for ttyLU0
@ -2786,8 +2826,8 @@ Your cooperation is appreciated.
50 = /dev/cuioc40 Callout device for ttyIOC40
...
81 = /dev/cuioc431 Callout device for ttyIOC431
82 = /dev/cuvr0 Callout device for ttyVR0
83 = /dev/cuvr1 Callout device for ttyVR1
82 = /dev/cuvr0 Callout device for ttyVR0
83 = /dev/cuvr1 Callout device for ttyVR1
206 char OnStream SC-x0 tape devices
@ -2897,7 +2937,6 @@ Your cooperation is appreciated.
...
196 = /dev/dvb/adapter3/video0 first video decoder of fourth card
216 char Bluetooth RFCOMM TTY devices
0 = /dev/rfcomm0 First Bluetooth RFCOMM TTY device
1 = /dev/rfcomm1 Second Bluetooth RFCOMM TTY device
@ -3002,12 +3041,43 @@ Your cooperation is appreciated.
ioctl()'s can be used to rewind the tape regardless of
the device used to access it.
231 char InfiniBand MAD
231 char InfiniBand
0 = /dev/infiniband/umad0
1 = /dev/infiniband/umad1
...
...
63 = /dev/infiniband/umad63 63rd InfiniBandMad device
64 = /dev/infiniband/issm0 First InfiniBand IsSM device
65 = /dev/infiniband/issm1 Second InfiniBand IsSM device
...
127 = /dev/infiniband/issm63 63rd InfiniBand IsSM device
128 = /dev/infiniband/uverbs0 First InfiniBand verbs device
129 = /dev/infiniband/uverbs1 Second InfiniBand verbs device
...
159 = /dev/infiniband/uverbs31 31st InfiniBand verbs device
232-239 UNASSIGNED
232 char Biometric Devices
0 = /dev/biometric/sensor0/fingerprint first fingerprint sensor on first device
1 = /dev/biometric/sensor0/iris first iris sensor on first device
2 = /dev/biometric/sensor0/retina first retina sensor on first device
3 = /dev/biometric/sensor0/voiceprint first voiceprint sensor on first device
4 = /dev/biometric/sensor0/facial first facial sensor on first device
5 = /dev/biometric/sensor0/hand first hand sensor on first device
...
10 = /dev/biometric/sensor1/fingerprint first fingerprint sensor on second device
...
20 = /dev/biometric/sensor2/fingerprint first fingerprint sensor on third device
...
233 char PathScale InfiniPath interconnect
0 = /dev/ipath Primary device for programs (any unit)
1 = /dev/ipath0 Access specifically to unit 0
2 = /dev/ipath1 Access specifically to unit 1
...
4 = /dev/ipath3 Access specifically to unit 3
129 = /dev/ipath_sma Device used by Subnet Management Agent
130 = /dev/ipath_diag Device used by diagnostics programs
234-239 UNASSIGNED
240-254 char LOCAL/EXPERIMENTAL USE
240-254 block LOCAL/EXPERIMENTAL USE
@ -3021,6 +3091,28 @@ Your cooperation is appreciated.
This major is reserved to assist the expansion to a
larger number space. No device nodes with this major
should ever be created on the filesystem.
(This is probaly not true anymore, but I'll leave it
for now /Torben)
---LARGE MAJORS!!!!!---
256 char Equinox SST multi-port serial boards
0 = /dev/ttyEQ0 First serial port on first Equinox SST board
127 = /dev/ttyEQ127 Last serial port on first Equinox SST board
128 = /dev/ttyEQ128 First serial port on second Equinox SST board
...
1027 = /dev/ttyEQ1027 Last serial port on eighth Equinox SST board
256 block Resident Flash Disk Flash Translation Layer
0 = /dev/rfda First RFD FTL layer
16 = /dev/rfdb Second RFD FTL layer
...
240 = /dev/rfdp 16th RFD FTL layer
257 char Phoenix Technologies Cryptographic Services Driver
0 = /dev/ptlsec Crypto Services Driver
**** ADDITIONAL /dev DIRECTORY ENTRIES

View file

@ -2,7 +2,7 @@ NOTE: This driver is obsolete. Digi provides a 2.6 driver (dgdm) at
http://www.digi.com for PCI cards. They no longer maintain this driver,
and have no 2.6 driver for ISA cards.
This driver requires a number of user-space tools. They can be aquired from
This driver requires a number of user-space tools. They can be acquired from
http://www.digi.com, but only works with 2.4 kernels.

View file

@ -18,7 +18,7 @@ Traditional driver models implemented some sort of tree-like structure
(sometimes just a list) for the devices they control. There wasn't any
uniformity across the different bus types.
The current driver model provides a comon, uniform data model for describing
The current driver model provides a common, uniform data model for describing
a bus and the devices that can appear under the bus. The unified bus
model includes a set of common attributes which all busses carry, and a set
of common callbacks, such as device discovery during bus probing, bus

View file

@ -135,10 +135,10 @@ C. Boot options
The angle can be changed anytime afterwards by 'echoing' the same
numbers to any one of the 2 attributes found in
/sys/class/graphics/fb{x}
/sys/class/graphics/fbcon
con_rotate - rotate the display of the active console
con_rotate_all - rotate the display of all consoles
rotate - rotate the display of the active console
rotate_all - rotate the display of all consoles
Console rotation will only become available if Console Rotation
Support is compiled in your kernel.
@ -148,5 +148,177 @@ C. Boot options
Actually, the underlying fb driver is totally ignorant of console
rotation.
---
C. Attaching, Detaching and Unloading
Before going on on how to attach, detach and unload the framebuffer console, an
illustration of the dependencies may help.
The console layer, as with most subsystems, needs a driver that interfaces with
the hardware. Thus, in a VGA console:
console ---> VGA driver ---> hardware.
Assuming the VGA driver can be unloaded, one must first unbind the VGA driver
from the console layer before unloading the driver. The VGA driver cannot be
unloaded if it is still bound to the console layer. (See
Documentation/console/console.txt for more information).
This is more complicated in the case of the the framebuffer console (fbcon),
because fbcon is an intermediate layer between the console and the drivers:
console ---> fbcon ---> fbdev drivers ---> hardware
The fbdev drivers cannot be unloaded if it's bound to fbcon, and fbcon cannot
be unloaded if it's bound to the console layer.
So to unload the fbdev drivers, one must first unbind fbcon from the console,
then unbind the fbdev drivers from fbcon. Fortunately, unbinding fbcon from
the console layer will automatically unbind framebuffer drivers from
fbcon. Thus, there is no need to explicitly unbind the fbdev drivers from
fbcon.
So, how do we unbind fbcon from the console? Part of the answer is in
Documentation/console/console.txt. To summarize:
Echo a value to the bind file that represents the framebuffer console
driver. So assuming vtcon1 represents fbcon, then:
echo 1 > sys/class/vtconsole/vtcon1/bind - attach framebuffer console to
console layer
echo 0 > sys/class/vtconsole/vtcon1/bind - detach framebuffer console from
console layer
If fbcon is detached from the console layer, your boot console driver (which is
usually VGA text mode) will take over. A few drivers (rivafb and i810fb) will
restore VGA text mode for you. With the rest, before detaching fbcon, you
must take a few additional steps to make sure that your VGA text mode is
restored properly. The following is one of the several methods that you can do:
1. Download or install vbetool. This utility is included with most
distributions nowadays, and is usually part of the suspend/resume tool.
2. In your kernel configuration, ensure that CONFIG_FRAMEBUFFER_CONSOLE is set
to 'y' or 'm'. Enable one or more of your favorite framebuffer drivers.
3. Boot into text mode and as root run:
vbetool vbestate save > <vga state file>
The above command saves the register contents of your graphics
hardware to <vga state file>. You need to do this step only once as
the state file can be reused.
4. If fbcon is compiled as a module, load fbcon by doing:
modprobe fbcon
5. Now to detach fbcon:
vbetool vbestate restore < <vga state file> && \
echo 0 > /sys/class/vtconsole/vtcon1/bind
6. That's it, you're back to VGA mode. And if you compiled fbcon as a module,
you can unload it by 'rmmod fbcon'
7. To reattach fbcon:
echo 1 > /sys/class/vtconsole/vtcon1/bind
8. Once fbcon is unbound, all drivers registered to the system will also
become unbound. This means that fbcon and individual framebuffer drivers
can be unloaded or reloaded at will. Reloading the drivers or fbcon will
automatically bind the console, fbcon and the drivers together. Unloading
all the drivers without unloading fbcon will make it impossible for the
console to bind fbcon.
Notes for vesafb users:
=======================
Unfortunately, if your bootline includes a vga=xxx parameter that sets the
hardware in graphics mode, such as when loading vesafb, vgacon will not load.
Instead, vgacon will replace the default boot console with dummycon, and you
won't get any display after detaching fbcon. Your machine is still alive, so
you can reattach vesafb. However, to reattach vesafb, you need to do one of
the following:
Variation 1:
a. Before detaching fbcon, do
vbetool vbemode save > <vesa state file> # do once for each vesafb mode,
# the file can be reused
b. Detach fbcon as in step 5.
c. Attach fbcon
vbetool vbestate restore < <vesa state file> && \
echo 1 > /sys/class/vtconsole/vtcon1/bind
Variation 2:
a. Before detaching fbcon, do:
echo <ID> > /sys/class/tty/console/bind
vbetool vbemode get
b. Take note of the mode number
b. Detach fbcon as in step 5.
c. Attach fbcon:
vbetool vbemode set <mode number> && \
echo 1 > /sys/class/vtconsole/vtcon1/bind
Samples:
========
Here are 2 sample bash scripts that you can use to bind or unbind the
framebuffer console driver if you are in an X86 box:
---------------------------------------------------------------------------
#!/bin/bash
# Unbind fbcon
# Change this to where your actual vgastate file is located
# Or Use VGASTATE=$1 to indicate the state file at runtime
VGASTATE=/tmp/vgastate
# path to vbetool
VBETOOL=/usr/local/bin
for (( i = 0; i < 16; i++))
do
if test -x /sys/class/vtconsole/vtcon$i; then
if [ `cat /sys/class/vtconsole/vtcon$i/name | grep -c "frame buffer"` \
= 1 ]; then
if test -x $VBETOOL/vbetool; then
echo Unbinding vtcon$i
$VBETOOL/vbetool vbestate restore < $VGASTATE
echo 0 > /sys/class/vtconsole/vtcon$i/bind
fi
fi
fi
done
---------------------------------------------------------------------------
#!/bin/bash
# Bind fbcon
for (( i = 0; i < 16; i++))
do
if test -x /sys/class/vtconsole/vtcon$i; then
if [ `cat /sys/class/vtconsole/vtcon$i/name | grep -c "frame buffer"` \
= 1 ]; then
echo Unbinding vtcon$i
echo 1 > /sys/class/vtconsole/vtcon$i/bind
fi
fi
done
---------------------------------------------------------------------------
--
Antonino Daplas <adaplas@pol.net>

View file

@ -6,17 +6,6 @@ be removed from this file.
---------------------------
What: devfs
When: July 2005
Files: fs/devfs/*, include/linux/devfs_fs*.h and assorted devfs
function calls throughout the kernel tree
Why: It has been unmaintained for a number of years, has unfixable
races, contains a naming policy within the kernel that is
against the LSB, and can be replaced by using udev.
Who: Greg Kroah-Hartman <greg@kroah.com>
---------------------------
What: RAW driver (CONFIG_RAW_DRIVER)
When: December 2005
Why: declared obsolete since kernel 2.6.3
@ -33,27 +22,12 @@ Who: Adrian Bunk <bunk@stusta.de>
---------------------------
What: RCU API moves to EXPORT_SYMBOL_GPL
When: April 2006
Files: include/linux/rcupdate.h, kernel/rcupdate.c
Why: Outside of Linux, the only implementations of anything even
vaguely resembling RCU that I am aware of are in DYNIX/ptx,
VM/XA, Tornado, and K42. I do not expect anyone to port binary
drivers or kernel modules from any of these, since the first two
are owned by IBM and the last two are open-source research OSes.
So these will move to GPL after a grace period to allow
people, who might be using implementations that I am not aware
of, to adjust to this upcoming change.
Who: Paul E. McKenney <paulmck@us.ibm.com>
---------------------------
What: raw1394: requests of type RAW1394_REQ_ISO_SEND, RAW1394_REQ_ISO_LISTEN
When: November 2005
When: November 2006
Why: Deprecated in favour of the new ioctl-based rawiso interface, which is
more efficient. You should really be using libraw1394 for raw1394
access anyway.
Who: Jody McIntyre <scjody@steamballoon.com>
Who: Jody McIntyre <scjody@modernduck.com>
---------------------------
@ -147,16 +121,6 @@ Who: NeilBrown <neilb@suse.de>
---------------------------
What: au1x00_uart driver
When: January 2006
Why: The 8250 serial driver now has the ability to deal with the differences
between the standard 8250 family of UARTs and their slightly strange
brother on Alchemy SOCs. The loss of features is not considered an
issue.
Who: Ralf Baechle <ralf@linux-mips.org>
---------------------------
What: eepro100 network driver
When: January 2007
Why: replaced by the e100 driver
@ -192,6 +156,16 @@ Who: Jean Delvare <khali@linux-fr.org>
---------------------------
What: Unused EXPORT_SYMBOL/EXPORT_SYMBOL_GPL exports
(temporary transition config option provided until then)
The transition config option will also be removed at the same time.
When: before 2.6.19
Why: Unused symbols are both increasing the size of the kernel binary
and are often a sign of "wrong API"
Who: Arjan van de Ven <arjan@linux.intel.com>
---------------------------
What: remove EXPORT_SYMBOL(tasklist_lock)
When: August 2006
Files: kernel/fork.c
@ -239,3 +213,56 @@ Why: The interface no longer has any callers left in the kernel. It
Who: Nick Piggin <npiggin@suse.de>
---------------------------
What: Support for the MIPS EV96100 evaluation board
When: September 2006
Why: Does no longer build since at least November 15, 2003, apparently
no userbase left.
Who: Ralf Baechle <ralf@linux-mips.org>
---------------------------
What: Support for the Momentum / PMC-Sierra Jaguar ATX evaluation board
When: September 2006
Why: Does no longer build since quite some time, and was never popular,
due to the platform being replaced by successor models. Apparently
no user base left. It also is one of the last users of
WANT_PAGE_VIRTUAL.
Who: Ralf Baechle <ralf@linux-mips.org>
---------------------------
What: Support for the Momentum Ocelot, Ocelot 3, Ocelot C and Ocelot G
When: September 2006
Why: Some do no longer build and apparently there is no user base left
for these platforms.
Who: Ralf Baechle <ralf@linux-mips.org>
---------------------------
What: Support for MIPS Technologies' Altas and SEAD evaluation board
When: September 2006
Why: Some do no longer build and apparently there is no user base left
for these platforms. Hardware out of production since several years.
Who: Ralf Baechle <ralf@linux-mips.org>
---------------------------
What: Support for the IT8172-based platforms, ITE 8172G and Globespan IVR
When: September 2006
Why: Code does no longer build since at least 2.6.0, apparently there is
no user base left for these platforms. Hardware out of production
since several years and hardly a trace of the manufacturer left on
the net.
Who: Ralf Baechle <ralf@linux-mips.org>
---------------------------
What: Interrupt only SA_* flags
When: Januar 2007
Why: The interrupt related SA_* flags are replaced by IRQF_* to move them
out of the signal namespace.
Who: Thomas Gleixner <tglx@linutronix.de>
---------------------------

View file

@ -99,7 +99,7 @@ prototypes:
int (*sync_fs)(struct super_block *sb, int wait);
void (*write_super_lockfs) (struct super_block *);
void (*unlockfs) (struct super_block *);
int (*statfs) (struct super_block *, struct kstatfs *);
int (*statfs) (struct dentry *, struct kstatfs *);
int (*remount_fs) (struct super_block *, int *, char *);
void (*clear_inode) (struct inode *);
void (*umount_begin) (struct super_block *);
@ -142,15 +142,16 @@ see also dquot_operations section.
--------------------------- file_system_type ---------------------------
prototypes:
struct super_block *(*get_sb) (struct file_system_type *, int,
const char *, void *);
struct int (*get_sb) (struct file_system_type *, int,
const char *, void *, struct vfsmount *);
void (*kill_sb) (struct super_block *);
locking rules:
may block BKL
get_sb yes yes
kill_sb yes yes
->get_sb() returns error or a locked superblock (exclusive on ->s_umount).
->get_sb() returns error or 0 with locked superblock attached to the vfsmount
(exclusive on ->s_umount).
->kill_sb() takes a write-locked superblock, does all shutdown work on it,
unlocks and drops the reference.

View file

@ -19,7 +19,7 @@ following procedure:
(2) Have the follow_link() op do the following steps:
(a) Call do_kern_mount() to call the appropriate filesystem to set up a
(a) Call vfs_kern_mount() to call the appropriate filesystem to set up a
superblock and gain a vfsmount structure representing it.
(b) Copy the nameidata provided as an argument and substitute the dentry

View file

@ -264,6 +264,15 @@ static struct config_item_type simple_child_type = {
};
struct simple_children {
struct config_group group;
};
static inline struct simple_children *to_simple_children(struct config_item *item)
{
return item ? container_of(to_config_group(item), struct simple_children, group) : NULL;
}
static struct config_item *simple_children_make_item(struct config_group *group, const char *name)
{
struct simple_child *simple_child;
@ -304,7 +313,13 @@ static ssize_t simple_children_attr_show(struct config_item *item,
"items have only one attribute that is readable and writeable.\n");
}
static void simple_children_release(struct config_item *item)
{
kfree(to_simple_children(item));
}
static struct configfs_item_operations simple_children_item_ops = {
.release = simple_children_release,
.show_attribute = simple_children_attr_show,
};
@ -345,10 +360,6 @@ static struct configfs_subsystem simple_children_subsys = {
* children of its own.
*/
struct simple_children {
struct config_group group;
};
static struct config_group *group_children_make_group(struct config_group *group, const char *name)
{
struct simple_children *simple_children;

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

View file

@ -1,40 +0,0 @@
Device File System (devfs) ToDo List
Richard Gooch <rgooch@atnf.csiro.au>
3-JUL-2000
This is a list of things to be done for better devfs support in the
Linux kernel. If you'd like to contribute to the devfs, please have a
look at this list for anything that is unallocated. Also, if there are
items missing (surely), please contact me so I can add them to the
list (preferably with your name attached to them:-).
- >256 ptys
Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
- Amiga floppy driver (drivers/block/amiflop.c)
- Atari floppy driver (drivers/block/ataflop.c)
- SWIM3 (Super Woz Integrated Machine 3) floppy driver (drivers/block/swim3.c)
- Amiga ZorroII ramdisc driver (drivers/block/z2ram.c)
- Parallel port ATAPI CD-ROM (drivers/block/paride/pcd.c)
- Parallel port ATAPI floppy (drivers/block/paride/pf.c)
- AP1000 block driver (drivers/ap1000/ap.c, drivers/ap1000/ddv.c)
- Archimedes floppy (drivers/acorn/block/fd1772.c)
- MFM hard drive (drivers/acorn/block/mfmhd.c)
- I2O block device (drivers/message/i2o/i2o_block.c)
- ST-RAM device (arch/m68k/atari/stram.c)
- Raw devices

View file

@ -1,65 +0,0 @@
/* -*- auto-fill -*- */
Device File System (devfs) Boot Options
Richard Gooch <rgooch@atnf.csiro.au>
18-AUG-2001
When CONFIG_DEVFS_DEBUG is enabled, you can pass several boot options
to the kernel to debug devfs. The boot options are prefixed by
"devfs=", and are separated by commas. Spaces are not allowed. The
syntax looks like this:
devfs=<option1>,<option2>,<option3>
and so on. For example, if you wanted to turn on debugging for module
load requests and device registration, you would do:
devfs=dmod,dreg
You may prefix "no" to any option. This will invert the option.
Debugging Options
=================
These requires CONFIG_DEVFS_DEBUG to be enabled.
Note that all debugging options have 'd' as the first character. By
default all options are off. All debugging output is sent to the
kernel logs. The debugging options do not take effect until the devfs
version message appears (just prior to the root filesystem being
mounted).
These are the options:
dmod print module load requests to <request_module>
dreg print device register requests to <devfs_register>
dunreg print device unregister requests to <devfs_unregister>
dchange print device change requests to <devfs_set_flags>
dilookup print inode lookup requests
diget print VFS inode allocations
diunlink print inode unlinks
dichange print inode changes
dimknod print calls to mknod(2)
dall some debugging turned on
Other Options
=============
These control the default behaviour of devfs. The options are:
mount mount devfs onto /dev at boot time
only disable non-devfs device nodes for devfs-capable drivers

View file

@ -113,6 +113,14 @@ noquota
grpquota
usrquota
bh (*) ext3 associates buffer heads to data pages to
nobh (a) cache disk block mapping information
(b) link pages into transaction to provide
ordering guarantees.
"bh" option forces use of buffer heads.
"nobh" option tries to avoid associating buffer
heads (supported only for "writeback" mode).
Specification
=============

View file

@ -18,6 +18,14 @@ Non-privileged mount (or user mount):
user. NOTE: this is not the same as mounts allowed with the "user"
option in /etc/fstab, which is not discussed here.
Filesystem connection:
A connection between the filesystem daemon and the kernel. The
connection exists until either the daemon dies, or the filesystem is
umounted. Note that detaching (or lazy umounting) the filesystem
does _not_ break the connection, in this case it will exist until
the last reference to the filesystem is released.
Mount owner:
The user who does the mounting.
@ -86,16 +94,20 @@ Mount options
The default is infinite. Note that the size of read requests is
limited anyway to 32 pages (which is 128kbyte on i386).
Sysfs
~~~~~
Control filesystem
~~~~~~~~~~~~~~~~~~
FUSE sets up the following hierarchy in sysfs:
There's a control filesystem for FUSE, which can be mounted by:
/sys/fs/fuse/connections/N/
mount -t fusectl none /sys/fs/fuse/connections
where N is an increasing number allocated to each new connection.
Mounting it under the '/sys/fs/fuse/connections' directory makes it
backwards compatible with earlier versions.
For each connection the following attributes are defined:
Under the fuse control filesystem each connection has a directory
named by a unique number.
For each connection the following files exist within this directory:
'waiting'
@ -110,7 +122,47 @@ For each connection the following attributes are defined:
connection. This means that all waiting requests will be aborted an
error returned for all aborted and new requests.
Only a privileged user may read or write these attributes.
Only the owner of the mount may read or write these files.
Interrupting filesystem operations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If a process issuing a FUSE filesystem request is interrupted, the
following will happen:
1) If the request is not yet sent to userspace AND the signal is
fatal (SIGKILL or unhandled fatal signal), then the request is
dequeued and returns immediately.
2) If the request is not yet sent to userspace AND the signal is not
fatal, then an 'interrupted' flag is set for the request. When
the request has been successfully transfered to userspace and
this flag is set, an INTERRUPT request is queued.
3) If the request is already sent to userspace, then an INTERRUPT
request is queued.
INTERRUPT requests take precedence over other requests, so the
userspace filesystem will receive queued INTERRUPTs before any others.
The userspace filesystem may ignore the INTERRUPT requests entirely,
or may honor them by sending a reply to the _original_ request, with
the error set to EINTR.
It is also possible that there's a race between processing the
original request and it's INTERRUPT request. There are two possibilities:
1) The INTERRUPT request is processed before the original request is
processed
2) The INTERRUPT request is processed after the original request has
been answered
If the filesystem cannot find the original request, it should wait for
some timeout and/or a number of new requests to arrive, after which it
should reply to the INTERRUPT request with an EAGAIN error. In case
1) the INTERRUPT request will be requeued. In case 2) the INTERRUPT
reply will be ignored.
Aborting a filesystem connection
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -139,8 +191,8 @@ the filesystem. There are several ways to do this:
- Use forced umount (umount -f). Works in all cases but only if
filesystem is still attached (it hasn't been lazy unmounted)
- Abort filesystem through the sysfs interface. Most powerful
method, always works.
- Abort filesystem through the FUSE control filesystem. Most
powerful method, always works.
How do non-privileged mounts work?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -304,25 +356,7 @@ Scenario 1 - Simple deadlock
| | for "file"]
| | *DEADLOCK*
The solution for this is to allow requests to be interrupted while
they are in userspace:
| [interrupted by signal] |
| <fuse_unlink() |
| [release semaphore] | [semaphore acquired]
| <sys_unlink() |
| | >fuse_unlink()
| | [queue req on fc->pending]
| | [wake up fc->waitq]
| | [sleep on req->waitq]
If the filesystem daemon was single threaded, this will stop here,
since there's no other thread to dequeue and execute the request.
In this case the solution is to kill the FUSE daemon as well. If
there are multiple serving threads, you just have to kill them as
long as any remain.
Moral: a filesystem which deadlocks, can soon find itself dead.
The solution for this is to allow the filesystem to be aborted.
Scenario 2 - Tricky deadlock
----------------------------
@ -355,24 +389,14 @@ but is caused by a pagefault.
| | [lock page]
| | * DEADLOCK *
Solution is again to let the the request be interrupted (not
elaborated further).
Solution is basically the same as above.
An additional problem is that while the write buffer is being
copied to the request, the request must not be interrupted. This
is because the destination address of the copy may not be valid
after the request is interrupted.
An additional problem is that while the write buffer is being copied
to the request, the request must not be interrupted/aborted. This is
because the destination address of the copy may not be valid after the
request has returned.
This is solved with doing the copy atomically, and allowing
interruption while the page(s) belonging to the write buffer are
faulted with get_user_pages(). The 'req->locked' flag indicates
when the copy is taking place, and interruption is delayed until
this flag is unset.
Scenario 3 - Tricky deadlock with asynchronous read
---------------------------------------------------
The same situation as above, except thread-1 will wait on page lock
and hence it will be uninterruptible as well. The solution is to
abort the connection with forced umount (if mount is attached) or
through the abort attribute in sysfs.
This is solved with doing the copy atomically, and allowing abort
while the page(s) belonging to the write buffer are faulted with
get_user_pages(). The 'req->locked' flag indicates when the copy is
taking place, and abort is delayed until this flag is unset.

View file

@ -69,17 +69,135 @@ Prototypes:
int inotify_rm_watch (int fd, __u32 mask);
(iii) Internal Kernel Implementation
(iii) Kernel Interface
Each inotify instance is associated with an inotify_device structure.
Inotify's kernel API consists a set of functions for managing watches and an
event callback.
To use the kernel API, you must first initialize an inotify instance with a set
of inotify_operations. You are given an opaque inotify_handle, which you use
for any further calls to inotify.
struct inotify_handle *ih = inotify_init(my_event_handler);
You must provide a function for processing events and a function for destroying
the inotify watch.
void handle_event(struct inotify_watch *watch, u32 wd, u32 mask,
u32 cookie, const char *name, struct inode *inode)
watch - the pointer to the inotify_watch that triggered this call
wd - the watch descriptor
mask - describes the event that occurred
cookie - an identifier for synchronizing events
name - the dentry name for affected files in a directory-based event
inode - the affected inode in a directory-based event
void destroy_watch(struct inotify_watch *watch)
You may add watches by providing a pre-allocated and initialized inotify_watch
structure and specifying the inode to watch along with an inotify event mask.
You must pin the inode during the call. You will likely wish to embed the
inotify_watch structure in a structure of your own which contains other
information about the watch. Once you add an inotify watch, it is immediately
subject to removal depending on filesystem events. You must grab a reference if
you depend on the watch hanging around after the call.
inotify_init_watch(&my_watch->iwatch);
inotify_get_watch(&my_watch->iwatch); // optional
s32 wd = inotify_add_watch(ih, &my_watch->iwatch, inode, mask);
inotify_put_watch(&my_watch->iwatch); // optional
You may use the watch descriptor (wd) or the address of the inotify_watch for
other inotify operations. You must not directly read or manipulate data in the
inotify_watch. Additionally, you must not call inotify_add_watch() more than
once for a given inotify_watch structure, unless you have first called either
inotify_rm_watch() or inotify_rm_wd().
To determine if you have already registered a watch for a given inode, you may
call inotify_find_watch(), which gives you both the wd and the watch pointer for
the inotify_watch, or an error if the watch does not exist.
wd = inotify_find_watch(ih, inode, &watchp);
You may use container_of() on the watch pointer to access your own data
associated with a given watch. When an existing watch is found,
inotify_find_watch() bumps the refcount before releasing its locks. You must
put that reference with:
put_inotify_watch(watchp);
Call inotify_find_update_watch() to update the event mask for an existing watch.
inotify_find_update_watch() returns the wd of the updated watch, or an error if
the watch does not exist.
wd = inotify_find_update_watch(ih, inode, mask);
An existing watch may be removed by calling either inotify_rm_watch() or
inotify_rm_wd().
int ret = inotify_rm_watch(ih, &my_watch->iwatch);
int ret = inotify_rm_wd(ih, wd);
A watch may be removed while executing your event handler with the following:
inotify_remove_watch_locked(ih, iwatch);
Call inotify_destroy() to remove all watches from your inotify instance and
release it. If there are no outstanding references, inotify_destroy() will call
your destroy_watch op for each watch.
inotify_destroy(ih);
When inotify removes a watch, it sends an IN_IGNORED event to your callback.
You may use this event as an indication to free the watch memory. Note that
inotify may remove a watch due to filesystem events, as well as by your request.
If you use IN_ONESHOT, inotify will remove the watch after the first event, at
which point you may call the final inotify_put_watch.
(iv) Kernel Interface Prototypes
struct inotify_handle *inotify_init(struct inotify_operations *ops);
inotify_init_watch(struct inotify_watch *watch);
s32 inotify_add_watch(struct inotify_handle *ih,
struct inotify_watch *watch,
struct inode *inode, u32 mask);
s32 inotify_find_watch(struct inotify_handle *ih, struct inode *inode,
struct inotify_watch **watchp);
s32 inotify_find_update_watch(struct inotify_handle *ih,
struct inode *inode, u32 mask);
int inotify_rm_wd(struct inotify_handle *ih, u32 wd);
int inotify_rm_watch(struct inotify_handle *ih,
struct inotify_watch *watch);
void inotify_remove_watch_locked(struct inotify_handle *ih,
struct inotify_watch *watch);
void inotify_destroy(struct inotify_handle *ih);
void get_inotify_watch(struct inotify_watch *watch);
void put_inotify_watch(struct inotify_watch *watch);
(v) Internal Kernel Implementation
Each inotify instance is represented by an inotify_handle structure.
Inotify's userspace consumers also have an inotify_device which is
associated with the inotify_handle, and on which events are queued.
Each watch is associated with an inotify_watch structure. Watches are chained
off of each associated device and each associated inode.
off of each associated inotify_handle and each associated inode.
See fs/inotify.c for the locking and lifetime rules.
See fs/inotify.c and fs/inotify_user.c for the locking and lifetime rules.
(iv) Rationale
(vi) Rationale
Q: What is the design decision behind not tying the watch to the open fd of
the watched object?
@ -145,7 +263,7 @@ A: The poor user-space interface is the second biggest problem with dnotify.
file descriptor-based one that allows basic file I/O and poll/select.
Obtaining the fd and managing the watches could have been done either via a
device file or a family of new system calls. We decided to implement a
family of system calls because that is the preffered approach for new kernel
family of system calls because that is the preferred approach for new kernel
interfaces. The only real difference was whether we wanted to use open(2)
and ioctl(2) or a couple of new system calls. System calls beat ioctls.

View file

@ -50,10 +50,11 @@ Turn your foo_read_super() into a function that would return 0 in case of
success and negative number in case of error (-EINVAL unless you have more
informative error value to report). Call it foo_fill_super(). Now declare
struct super_block foo_get_sb(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data)
int foo_get_sb(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data, struct vfsmount *mnt)
{
return get_sb_bdev(fs_type, flags, dev_name, data, ext2_fill_super);
return get_sb_bdev(fs_type, flags, dev_name, data, foo_fill_super,
mnt);
}
(or similar with s/bdev/nodev/ or s/bdev/single/, depending on the kind of

View file

@ -70,11 +70,13 @@ tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information.
What is rootfs?
---------------
Rootfs is a special instance of ramfs, which is always present in 2.6 systems.
(It's used internally as the starting and stopping point for searches of the
kernel's doubly-linked list of mount points.)
Rootfs is a special instance of ramfs (or tmpfs, if that's enabled), which is
always present in 2.6 systems. You can't unmount rootfs for approximately the
same reason you can't kill the init process; rather than having special code
to check for and handle an empty list, it's smaller and simpler for the kernel
to just make sure certain lists can't become empty.
Most systems just mount another filesystem over it and ignore it. The
Most systems just mount another filesystem over rootfs and ignore it. The
amount of space an empty instance of ramfs takes up is tiny.
What is initramfs?
@ -92,14 +94,16 @@ out of that.
All this differs from the old initrd in several ways:
- The old initrd was a separate file, while the initramfs archive is linked
into the linux kernel image. (The directory linux-*/usr is devoted to
generating this archive during the build.)
- The old initrd was always a separate file, while the initramfs archive is
linked into the linux kernel image. (The directory linux-*/usr is devoted
to generating this archive during the build.)
- The old initrd file was a gzipped filesystem image (in some file format,
such as ext2, that had to be built into the kernel), while the new
such as ext2, that needed a driver built into the kernel), while the new
initramfs archive is a gzipped cpio archive (like tar only simpler,
see cpio(1) and Documentation/early-userspace/buffer-format.txt).
see cpio(1) and Documentation/early-userspace/buffer-format.txt). The
kernel's cpio extraction code is not only extremely small, it's also
__init data that can be discarded during the boot process.
- The program run by the old initrd (which was called /initrd, not /init) did
some setup and then returned to the kernel, while the init program from
@ -124,13 +128,14 @@ Populating initramfs:
The 2.6 kernel build process always creates a gzipped cpio format initramfs
archive and links it into the resulting kernel binary. By default, this
archive is empty (consuming 134 bytes on x86). The config option
CONFIG_INITRAMFS_SOURCE (for some reason buried under devices->block devices
in menuconfig, and living in usr/Kconfig) can be used to specify a source for
the initramfs archive, which will automatically be incorporated into the
resulting binary. This option can point to an existing gzipped cpio archive, a
directory containing files to be archived, or a text file specification such
as the following example:
archive is empty (consuming 134 bytes on x86).
The config option CONFIG_INITRAMFS_SOURCE (for some reason buried under
devices->block devices in menuconfig, and living in usr/Kconfig) can be used
to specify a source for the initramfs archive, which will automatically be
incorporated into the resulting binary. This option can point to an existing
gzipped cpio archive, a directory containing files to be archived, or a text
file specification such as the following example:
dir /dev 755 0 0
nod /dev/console 644 0 0 c 5 1
@ -146,23 +151,84 @@ as the following example:
Run "usr/gen_init_cpio" (after the kernel build) to get a usage message
documenting the above file format.
One advantage of the text file is that root access is not required to
One advantage of the configuration file is that root access is not required to
set permissions or create device nodes in the new archive. (Note that those
two example "file" entries expect to find files named "init.sh" and "busybox" in
a directory called "initramfs", under the linux-2.6.* directory. See
Documentation/early-userspace/README for more details.)
The kernel does not depend on external cpio tools, gen_init_cpio is created
from usr/gen_init_cpio.c which is entirely self-contained, and the kernel's
boot-time extractor is also (obviously) self-contained. However, if you _do_
happen to have cpio installed, the following command line can extract the
generated cpio image back into its component files:
The kernel does not depend on external cpio tools. If you specify a
directory instead of a configuration file, the kernel's build infrastructure
creates a configuration file from that directory (usr/Makefile calls
scripts/gen_initramfs_list.sh), and proceeds to package up that directory
using the config file (by feeding it to usr/gen_init_cpio, which is created
from usr/gen_init_cpio.c). The kernel's build-time cpio creation code is
entirely self-contained, and the kernel's boot-time extractor is also
(obviously) self-contained.
The one thing you might need external cpio utilities installed for is creating
or extracting your own preprepared cpio files to feed to the kernel build
(instead of a config file or directory).
The following command line can extract a cpio image (either by the above script
or by the kernel build) back into its component files:
cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames
The following shell script can create a prebuilt cpio archive you can
use in place of the above config file:
#!/bin/sh
# Copyright 2006 Rob Landley <rob@landley.net> and TimeSys Corporation.
# Licensed under GPL version 2
if [ $# -ne 2 ]
then
echo "usage: mkinitramfs directory imagename.cpio.gz"
exit 1
fi
if [ -d "$1" ]
then
echo "creating $2 from $1"
(cd "$1"; find . | cpio -o -H newc | gzip) > "$2"
else
echo "First argument must be a directory"
exit 1
fi
Note: The cpio man page contains some bad advice that will break your initramfs
archive if you follow it. It says "A typical way to generate the list
of filenames is with the find command; you should give find the -depth option
to minimize problems with permissions on directories that are unwritable or not
searchable." Don't do this when creating initramfs.cpio.gz images, it won't
work. The Linux kernel cpio extractor won't create files in a directory that
doesn't exist, so the directory entries must go before the files that go in
those directories. The above script gets them in the right order.
External initramfs images:
--------------------------
If the kernel has initrd support enabled, an external cpio.gz archive can also
be passed into a 2.6 kernel in place of an initrd. In this case, the kernel
will autodetect the type (initramfs, not initrd) and extract the external cpio
archive into rootfs before trying to run /init.
This has the memory efficiency advantages of initramfs (no ramdisk block
device) but the separate packaging of initrd (which is nice if you have
non-GPL code you'd like to run from initramfs, without conflating it with
the GPL licensed Linux kernel binary).
It can also be used to supplement the kernel's built-in initamfs image. The
files in the external archive will overwrite any conflicting files in
the built-in initramfs archive. Some distributors also prefer to customize
a single kernel image with task-specific initramfs images, without recompiling.
Contents of initramfs:
----------------------
An initramfs archive is a complete self-contained root filesystem for Linux.
If you don't already understand what shared libraries, devices, and paths
you need to get a minimal root filesystem up and running, here are some
references:
@ -176,13 +242,36 @@ code against, along with some related utilities. It is BSD licensed.
I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net)
myself. These are LGPL and GPL, respectively. (A self-contained initramfs
package is planned for the busybox 1.2 release.)
package is planned for the busybox 1.3 release.)
In theory you could use glibc, but that's not well suited for small embedded
uses like this. (A "hello world" program statically linked against glibc is
over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do
name lookups, even when otherwise statically linked.)
A good first step is to get initramfs to run a statically linked "hello world"
program as init, and test it under an emulator like qemu (www.qemu.org) or
User Mode Linux, like so:
cat > hello.c << EOF
#include <stdio.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
printf("Hello world!\n");
sleep(999999999);
}
EOF
gcc -static hello2.c -o init
echo init | cpio -o -H newc | gzip > test.cpio.gz
# Testing external initramfs using the initrd loading mechanism.
qemu -kernel /boot/vmlinuz -initrd test.cpio.gz /dev/zero
When debugging a normal root filesystem, it's nice to be able to boot with
"init=/bin/sh". The initramfs equivalent is "rdinit=/bin/sh", and it's
just as useful.
Why cpio rather than tar?
-------------------------
@ -241,7 +330,7 @@ the above threads) is:
Future directions:
------------------
Today (2.6.14), initramfs is always compiled in, but not always used. The
Today (2.6.16), initramfs is always compiled in, but not always used. The
kernel falls back to legacy boot code that is reached only if initramfs does
not contain an /init program. The fallback is legacy code, there to ensure a
smooth transition and allowing early boot functionality to gradually move to
@ -258,8 +347,9 @@ and so on.
This kind of complexity (which inevitably includes policy) is rightly handled
in userspace. Both klibc and busybox/uClibc are working on simple initramfs
packages to drop into a kernel build, and when standard solutions are ready
and widely deployed, the kernel's legacy early boot code will become obsolete
and a candidate for the feature removal schedule.
packages to drop into a kernel build.
But that's a while off yet.
The klibc package has now been accepted into Andrew Morton's 2.6.17-mm tree.
The kernel's current early boot code (partition detection, etc) will probably
be migrated into a default initramfs, automatically created and used by the
kernel build.

View file

@ -113,8 +113,8 @@ members are defined:
struct file_system_type {
const char *name;
int fs_flags;
struct super_block *(*get_sb) (struct file_system_type *, int,
const char *, void *);
struct int (*get_sb) (struct file_system_type *, int,
const char *, void *, struct vfsmount *);
void (*kill_sb) (struct super_block *);
struct module *owner;
struct file_system_type * next;
@ -211,7 +211,7 @@ struct super_operations {
int (*sync_fs)(struct super_block *sb, int wait);
void (*write_super_lockfs) (struct super_block *);
void (*unlockfs) (struct super_block *);
int (*statfs) (struct super_block *, struct kstatfs *);
int (*statfs) (struct dentry *, struct kstatfs *);
int (*remount_fs) (struct super_block *, int *, char *);
void (*clear_inode) (struct inode *);
void (*umount_begin) (struct super_block *);

View file

@ -0,0 +1,59 @@
Kernel driver abituguru
=======================
Supported chips:
* Abit uGuru (Hardware Monitor part only)
Prefix: 'abituguru'
Addresses scanned: ISA 0x0E0
Datasheet: Not available, this driver is based on reverse engineering.
A "Datasheet" has been written based on the reverse engineering it
should be available in the same dir as this file under the name
abituguru-datasheet.
Authors:
Hans de Goede <j.w.r.degoede@hhs.nl>,
(Initial reverse engineering done by Olle Sandberg
<ollebull@gmail.com>)
Module Parameters
-----------------
* force: bool Force detection. Note this parameter only causes the
detection to be skipped, if the uGuru can't be read
the module initialization (insmod) will still fail.
* fan_sensors: int Tell the driver how many fan speed sensors there are
on your motherboard. Default: 0 (autodetect).
* pwms: int Tell the driver how many fan speed controls (fan
pwms) your motherboard has. Default: 0 (autodetect).
* verbose: int How verbose should the driver be? (0-3):
0 normal output
1 + verbose error reporting
2 + sensors type probing info\n"
3 + retryable error reporting
Default: 2 (the driver is still in the testing phase)
Notice if you need any of the first three options above please insmod the
driver with verbose set to 3 and mail me <j.w.r.degoede@hhs.nl> the output of:
dmesg | grep abituguru
Description
-----------
This driver supports the hardware monitoring features of the Abit uGuru chip
found on Abit uGuru featuring motherboards (most modern Abit motherboards).
The uGuru chip in reality is a Winbond W83L950D in disguise (despite Abit
claiming it is "a new microprocessor designed by the ABIT Engineers").
Unfortunatly this doesn't help since the W83L950D is a generic
microcontroller with a custom Abit application running on it.
Despite Abit not releasing any information regarding the uGuru, Olle
Sandberg <ollebull@gmail.com> has managed to reverse engineer the sensor part
of the uGuru. Without his work this driver would not have been possible.
Known Issues
------------
The voltage and frequency control parts of the Abit uGuru are not supported.

View file

@ -0,0 +1,312 @@
uGuru datasheet
===============
First of all, what I know about uGuru is no fact based on any help, hints or
datasheet from Abit. The data I have got on uGuru have I assembled through
my weak knowledge in "backwards engineering".
And just for the record, you may have noticed uGuru isn't a chip developed by
Abit, as they claim it to be. It's realy just an microprocessor (uC) created by
Winbond (W83L950D). And no, reading the manual for this specific uC or
mailing Windbond for help won't give any usefull data about uGuru, as it is
the program inside the uC that is responding to calls.
Olle Sandberg <ollebull@gmail.com>, 2005-05-25
Original version by Olle Sandberg who did the heavy lifting of the initial
reverse engineering. This version has been almost fully rewritten for clarity
and extended with write support and info on more databanks, the write support
is once again reverse engineered by Olle the additional databanks have been
reverse engineered by me. I would like to express my thanks to Olle, this
document and the Linux driver could not have been written without his efforts.
Note: because of the lack of specs only the sensors part of the uGuru is
described here and not the CPU / RAM / etc voltage & frequency control.
Hans de Goede <j.w.r.degoede@hhs.nl>, 28-01-2006
Detection
=========
As far as known the uGuru is always placed at and using the (ISA) I/O-ports
0xE0 and 0xE4, so we don't have to scan any port-range, just check what the two
ports are holding for detection. We will refer to 0xE0 as CMD (command-port)
and 0xE4 as DATA because Abit refers to them with these names.
If DATA holds 0x00 or 0x08 and CMD holds 0x00 or 0xAC an uGuru could be
present. We have to check for two different values at data-port, because
after a reboot uGuru will hold 0x00 here, but if the driver is removed and
later on attached again data-port will hold 0x08, more about this later.
After wider testing of the Linux kernel driver some variants of the uGuru have
turned up which will hold 0x00 instead of 0xAC at the CMD port, thus we also
have to test CMD for two different values. On these uGuru's DATA will initally
hold 0x09 and will only hold 0x08 after reading CMD first, so CMD must be read
first!
To be really sure an uGuru is present a test read of one or more register
sets should be done.
Reading / Writing
=================
Addressing
----------
The uGuru has a number of different addressing levels. The first addressing
level we will call banks. A bank holds data for one or more sensors. The data
in a bank for a sensor is one or more bytes large.
The number of bytes is fixed for a given bank, you should always read or write
that many bytes, reading / writing more will fail, the results when writing
less then the number of bytes for a given bank are undetermined.
See below for all known bank addresses, numbers of sensors in that bank,
number of bytes data per sensor and contents/meaning of those bytes.
Although both this document and the kernel driver have kept the sensor
terminoligy for the addressing within a bank this is not 100% correct, in
bank 0x24 for example the addressing within the bank selects a PWM output not
a sensor.
Notice that some banks have both a read and a write address this is how the
uGuru determines if a read from or a write to the bank is taking place, thus
when reading you should always use the read address and when writing the
write address. The write address is always one (1) more then the read address.
uGuru ready
-----------
Before you can read from or write to the uGuru you must first put the uGuru
in "ready" mode.
To put the uGuru in ready mode first write 0x00 to DATA and then wait for DATA
to hold 0x09, DATA should read 0x09 within 250 read cycles.
Next CMD _must_ be read and should hold 0xAC, usually CMD will hold 0xAC the
first read but sometimes it takes a while before CMD holds 0xAC and thus it
has to be read a number of times (max 50).
After reading CMD, DATA should hold 0x08 which means that the uGuru is ready
for input. As above DATA will usually hold 0x08 the first read but not always.
This step can be skipped, but it is undetermined what happens if the uGuru has
not yet reported 0x08 at DATA and you proceed with writing a bank address.
Sending bank and sensor addresses to the uGuru
----------------------------------------------
First the uGuru must be in "ready" mode as described above, DATA should hold
0x08 indicating that the uGuru wants input, in this case the bank address.
Next write the bank address to DATA. After the bank address has been written
wait for to DATA to hold 0x08 again indicating that it wants / is ready for
more input (max 250 reads).
Once DATA holds 0x08 again write the sensor address to CMD.
Reading
-------
First send the bank and sensor addresses as described above.
Then for each byte of data you want to read wait for DATA to hold 0x01
which indicates that the uGuru is ready to be read (max 250 reads) and once
DATA holds 0x01 read the byte from CMD.
Once all bytes have been read data will hold 0x09, but there is no reason to
test for this. Notice that the number of bytes is bank address dependent see
above and below.
After completing a successfull read it is advised to put the uGuru back in
ready mode, so that it is ready for the next read / write cycle. This way
if your program / driver is unloaded and later loaded again the detection
algorithm described above will still work.
Writing
-------
First send the bank and sensor addresses as described above.
Then for each byte of data you want to write wait for DATA to hold 0x00
which indicates that the uGuru is ready to be written (max 250 reads) and
once DATA holds 0x00 write the byte to CMD.
Once all bytes have been written wait for DATA to hold 0x01 (max 250 reads)
don't ask why this is the way it is.
Once DATA holds 0x01 read CMD it should hold 0xAC now.
After completing a successfull write it is advised to put the uGuru back in
ready mode, so that it is ready for the next read / write cycle. This way
if your program / driver is unloaded and later loaded again the detection
algorithm described above will still work.
Gotchas
-------
After wider testing of the Linux kernel driver some variants of the uGuru have
turned up which do not hold 0x08 at DATA within 250 reads after writing the
bank address. With these versions this happens quite frequent, using larger
timeouts doesn't help, they just go offline for a second or 2, doing some
internal callibration or whatever. Your code should be prepared to handle
this and in case of no response in this specific case just goto sleep for a
while and then retry.
Address Map
===========
Bank 0x20 Alarms (R)
--------------------
This bank contains 0 sensors, iow the sensor address is ignored (but must be
written) just use 0. Bank 0x20 contains 3 bytes:
Byte 0:
This byte holds the alarm flags for sensor 0-7 of Sensor Bank1, with bit 0
corresponding to sensor 0, 1 to 1, etc.
Byte 1:
This byte holds the alarm flags for sensor 8-15 of Sensor Bank1, with bit 0
corresponding to sensor 8, 1 to 9, etc.
Byte 2:
This byte holds the alarm flags for sensor 0-5 of Sensor Bank2, with bit 0
corresponding to sensor 0, 1 to 1, etc.
Bank 0x21 Sensor Bank1 Values / Readings (R)
--------------------------------------------
This bank contains 16 sensors, for each sensor it contains 1 byte.
So far the following sensors are known to be available on all motherboards:
Sensor 0 CPU temp
Sensor 1 SYS temp
Sensor 3 CPU core volt
Sensor 4 DDR volt
Sensor 10 DDR Vtt volt
Sensor 15 PWM temp
Byte 0:
This byte holds the reading from the sensor. Sensors in Bank1 can be both
volt and temp sensors, this is motherboard specific. The uGuru however does
seem to know (be programmed with) what kindoff sensor is attached see Sensor
Bank1 Settings description.
Volt sensors use a linear scale, a reading 0 corresponds with 0 volt and a
reading of 255 with 3494 mV. The sensors for higher voltages however are
connected through a division circuit. The currently known division circuits
in use result in ranges of: 0-4361mV, 0-6248mV or 0-14510mV. 3.3 volt sources
use the 0-4361mV range, 5 volt the 0-6248mV and 12 volt the 0-14510mV .
Temp sensors also use a linear scale, a reading of 0 corresponds with 0 degree
Celsius and a reading of 255 with a reading of 255 degrees Celsius.
Bank 0x22 Sensor Bank1 Settings (R)
Bank 0x23 Sensor Bank1 Settings (W)
-----------------------------------
This bank contains 16 sensors, for each sensor it contains 3 bytes. Each
set of 3 bytes contains the settings for the sensor with the same sensor
address in Bank 0x21 .
Byte 0:
Alarm behaviour for the selected sensor. A 1 enables the described behaviour.
Bit 0: Give an alarm if measured temp is over the warning threshold (RW) *
Bit 1: Give an alarm if measured volt is over the max threshold (RW) **
Bit 2: Give an alarm if measured volt is under the min threshold (RW) **
Bit 3: Beep if alarm (RW)
Bit 4: 1 if alarm cause measured temp is over the warning threshold (R)
Bit 5: 1 if alarm cause measured volt is over the max threshold (R)
Bit 6: 1 if alarm cause measured volt is under the min threshold (R)
Bit 7: Volt sensor: Shutdown if alarm persist for more then 4 seconds (RW)
Temp sensor: Shutdown if temp is over the shutdown threshold (RW)
* This bit is only honored/used by the uGuru if a temp sensor is connected
** This bit is only honored/used by the uGuru if a volt sensor is connected
Note with some trickery this can be used to find out what kinda sensor is
detected see the Linux kernel driver for an example with many comments on
how todo this.
Byte 1:
Temp sensor: warning threshold (scale as bank 0x21)
Volt sensor: min threshold (scale as bank 0x21)
Byte 2:
Temp sensor: shutdown threshold (scale as bank 0x21)
Volt sensor: max threshold (scale as bank 0x21)
Bank 0x24 PWM outputs for FAN's (R)
Bank 0x25 PWM outputs for FAN's (W)
-----------------------------------
This bank contains 3 "sensors", for each sensor it contains 5 bytes.
Sensor 0 usually controls the CPU fan
Sensor 1 usually controls the NB (or chipset for single chip) fan
Sensor 2 usually controls the System fan
Byte 0:
Flag 0x80 to enable control, Fan runs at 100% when disabled.
low nibble (temp)sensor address at bank 0x21 used for control.
Byte 1:
0-255 = 0-12v (linear), specify voltage at which fan will rotate when under
low threshold temp (specified in byte 3)
Byte 2:
0-255 = 0-12v (linear), specify voltage at which fan will rotate when above
high threshold temp (specified in byte 4)
Byte 3:
Low threshold temp (scale as bank 0x21)
byte 4:
High threshold temp (scale as bank 0x21)
Bank 0x26 Sensors Bank2 Values / Readings (R)
---------------------------------------------
This bank contains 6 sensors (AFAIK), for each sensor it contains 1 byte.
So far the following sensors are known to be available on all motherboards:
Sensor 0: CPU fan speed
Sensor 1: NB (or chipset for single chip) fan speed
Sensor 2: SYS fan speed
Byte 0:
This byte holds the reading from the sensor. 0-255 = 0-15300 (linear)
Bank 0x27 Sensors Bank2 Settings (R)
Bank 0x28 Sensors Bank2 Settings (W)
------------------------------------
This bank contains 6 sensors (AFAIK), for each sensor it contains 2 bytes.
Byte 0:
Alarm behaviour for the selected sensor. A 1 enables the described behaviour.
Bit 0: Give an alarm if measured rpm is under the min threshold (RW)
Bit 3: Beep if alarm (RW)
Bit 7: Shutdown if alarm persist for more then 4 seconds (RW)
Byte 1:
min threshold (scale as bank 0x26)
Warning for the adventerous
===========================
A word of caution to those who want to experiment and see if they can figure
the voltage / clock programming out, I tried reading and only reading banks
0-0x30 with the reading code used for the sensor banks (0x20-0x28) and this
resulted in a _permanent_ reprogramming of the voltages, luckily I had the
sensors part configured so that it would shutdown my system on any out of spec
voltages which proprably safed my computer (after a reboot I managed to
immediatly enter the bios and reload the defaults). This probably means that
the read/write cycle for the non sensor part is different from the sensor part.

31
Documentation/hwmon/lm70 Normal file
View file

@ -0,0 +1,31 @@
Kernel driver lm70
==================
Supported chip:
* National Semiconductor LM70
Datasheet: http://www.national.com/pf/LM/LM70.html
Author:
Kaiwan N Billimoria <kaiwan@designergraphix.com>
Description
-----------
This driver implements support for the National Semiconductor LM70
temperature sensor.
The LM70 temperature sensor chip supports a single temperature sensor.
It communicates with a host processor (or microcontroller) via an
SPI/Microwire Bus interface.
Communication with the LM70 is simple: when the temperature is to be sensed,
the driver accesses the LM70 using SPI communication: 16 SCLK cycles
comprise the MOSI/MISO loop. At the end of the transfer, the 11-bit 2's
complement digital temperature (sent via the SIO line), is available in the
driver for interpretation. This driver makes use of the kernel's in-core
SPI support.
Thanks to
---------
Jean Delvare <khali@linux-fr.org> for mentoring the hwmon-side driver
development.

View file

@ -7,6 +7,10 @@ Supported chips:
Addresses scanned: I2C 0x18 - 0x1a, 0x29 - 0x2b, 0x4c - 0x4e
Datasheet: Publicly available at the National Semiconductor website
http://www.national.com/pf/LM/LM83.html
* National Semiconductor LM82
Addresses scanned: I2C 0x18 - 0x1a, 0x29 - 0x2b, 0x4c - 0x4e
Datasheet: Publicly available at the National Semiconductor website
http://www.national.com/pf/LM/LM82.html
Author: Jean Delvare <khali@linux-fr.org>
@ -15,10 +19,11 @@ Description
-----------
The LM83 is a digital temperature sensor. It senses its own temperature as
well as the temperature of up to three external diodes. It is compatible
with many other devices such as the LM84 and all other ADM1021 clones.
The main difference between the LM83 and the LM84 in that the later can
only sense the temperature of one external diode.
well as the temperature of up to three external diodes. The LM82 is
a stripped down version of the LM83 that only supports one external diode.
Both are compatible with many other devices such as the LM84 and all
other ADM1021 clones. The main difference between the LM83 and the LM84
in that the later can only sense the temperature of one external diode.
Using the adm1021 driver for a LM83 should work, but only two temperatures
will be reported instead of four.
@ -30,12 +35,16 @@ contact us. Note that the LM90 can easily be misdetected as a LM83.
Confirmed motherboards:
SBS P014
SBS PSL09
Unconfirmed motherboards:
Gigabyte GA-8IK1100
Iwill MPX2
Soltek SL-75DRV5
The LM82 is confirmed to have been found on most AMD Geode reference
designs and test platforms.
The driver has been successfully tested by Magnus Forsström, who I'd
like to thank here. More testers will be of course welcome.

View file

@ -0,0 +1,102 @@
Kernel driver smsc47m192
========================
Supported chips:
* SMSC LPC47M192 and LPC47M997
Prefix: 'smsc47m192'
Addresses scanned: I2C 0x2c - 0x2d
Datasheet: The datasheet for LPC47M192 is publicly available from
http://www.smsc.com/
The LPC47M997 is compatible for hardware monitoring.
Author: Hartmut Rick <linux@rick.claranet.de>
Special thanks to Jean Delvare for careful checking
of the code and many helpful comments and suggestions.
Description
-----------
This driver implements support for the hardware sensor capabilities
of the SMSC LPC47M192 and LPC47M997 Super-I/O chips.
These chips support 3 temperature channels and 8 voltage inputs
as well as CPU voltage VID input.
They do also have fan monitoring and control capabilities, but the
these features are accessed via ISA bus and are not supported by this
driver. Use the 'smsc47m1' driver for fan monitoring and control.
Voltages and temperatures are measured by an 8-bit ADC, the resolution
of the temperatures is 1 bit per degree C.
Voltages are scaled such that the nominal voltage corresponds to
192 counts, i.e. 3/4 of the full range. Thus the available range for
each voltage channel is 0V ... 255/192*(nominal voltage), the resolution
is 1 bit per (nominal voltage)/192.
Both voltage and temperature values are scaled by 1000, the sys files
show voltages in mV and temperatures in units of 0.001 degC.
The +12V analog voltage input channel (in4_input) is multiplexed with
bit 4 of the encoded CPU voltage. This means that you either get
a +12V voltage measurement or a 5 bit CPU VID, but not both.
The default setting is to use the pin as 12V input, and use only 4 bit VID.
This driver assumes that the information in the configuration register
is correct, i.e. that the BIOS has updated the configuration if
the motherboard has this input wired to VID4.
The temperature and voltage readings are updated once every 1.5 seconds.
Reading them more often repeats the same values.
sysfs interface
---------------
in0_input - +2.5V voltage input
in1_input - CPU voltage input (nominal 2.25V)
in2_input - +3.3V voltage input
in3_input - +5V voltage input
in4_input - +12V voltage input (may be missing if used as VID4)
in5_input - Vcc voltage input (nominal 3.3V)
This is the supply voltage of the sensor chip itself.
in6_input - +1.5V voltage input
in7_input - +1.8V voltage input
in[0-7]_min,
in[0-7]_max - lower and upper alarm thresholds for in[0-7]_input reading
All voltages are read and written in mV.
in[0-7]_alarm - alarm flags for voltage inputs
These files read '1' in case of alarm, '0' otherwise.
temp1_input - chip temperature measured by on-chip diode
temp[2-3]_input - temperature measured by external diodes (one of these would
typically be wired to the diode inside the CPU)
temp[1-3]_min,
temp[1-3]_max - lower and upper alarm thresholds for temperatures
temp[1-3]_offset - temperature offset registers
The chip adds the offsets stored in these registers to
the corresponding temperature readings.
Note that temp1 and temp2 offsets share the same register,
they cannot both be different from zero at the same time.
Writing a non-zero number to one of them will reset the other
offset to zero.
All temperatures and offsets are read and written in
units of 0.001 degC.
temp[1-3]_alarm - alarm flags for temperature inputs, '1' in case of alarm,
'0' otherwise.
temp[2-3]_input_fault - diode fault flags for temperature inputs 2 and 3.
A fault is detected if the two pins for the corresponding
sensor are open or shorted, or any of the two is shorted
to ground or Vcc. '1' indicates a diode fault.
cpu0_vid - CPU voltage as received from the CPU
vrm - CPU VID standard used for decoding CPU voltage
The *_min, *_max, *_offset and vrm files can be read and
written, all others are read-only.

View file

@ -3,15 +3,15 @@ Naming and data format standards for sysfs files
The libsensors library offers an interface to the raw sensors data
through the sysfs interface. See libsensors documentation and source for
more further information. As of writing this document, libsensors
(from lm_sensors 2.8.3) is heavily chip-dependant. Adding or updating
further information. As of writing this document, libsensors
(from lm_sensors 2.8.3) is heavily chip-dependent. Adding or updating
support for any given chip requires modifying the library's code.
This is because libsensors was written for the procfs interface
older kernel modules were using, which wasn't standardized enough.
Recent versions of libsensors (from lm_sensors 2.8.2 and later) have
support for the sysfs interface, though.
The new sysfs interface was designed to be as chip-independant as
The new sysfs interface was designed to be as chip-independent as
possible.
Note that motherboards vary widely in the connections to sensor chips.
@ -24,7 +24,7 @@ range using external resistors. Since the values of these resistors
can change from motherboard to motherboard, the conversions cannot be
hard coded into the driver and have to be done in user space.
For this reason, even if we aim at a chip-independant libsensors, it will
For this reason, even if we aim at a chip-independent libsensors, it will
still require a configuration file (e.g. /etc/sensors.conf) for proper
values conversion, labeling of inputs and hiding of unused inputs.
@ -39,15 +39,16 @@ If you are developing a userspace application please send us feedback on
this standard.
Note that this standard isn't completely established yet, so it is subject
to changes, even important ones. One more reason to use the library instead
of accessing sysfs files directly.
to changes. If you are writing a new hardware monitoring driver those
features can't seem to fit in this interface, please contact us with your
extension proposal. Keep in mind that backward compatibility must be
preserved.
Each chip gets its own directory in the sysfs /sys/devices tree. To
find all sensor chips, it is easier to follow the symlinks from
/sys/i2c/devices/
find all sensor chips, it is easier to follow the device symlinks from
/sys/class/hwmon/hwmon*.
All sysfs values are fixed point numbers. To get the true value of some
of the values, you should divide by the specified value.
All sysfs values are fixed point numbers.
There is only one value per file, unlike the older /proc specification.
The common scheme for files naming is: <type><number>_<item>. Usual
@ -69,28 +70,40 @@ to cause an alarm) is chip-dependent.
-------------------------------------------------------------------------
[0-*] denotes any positive number starting from 0
[1-*] denotes any positive number starting from 1
RO read only value
RW read/write value
Read/write values may be read-only for some chips, depending on the
hardware implementation.
All entries are optional, and should only be created in a given driver
if the chip has the feature.
************
* Voltages *
************
in[0-8]_min Voltage min value.
in[0-*]_min Voltage min value.
Unit: millivolt
Read/Write
RW
in[0-8]_max Voltage max value.
in[0-*]_max Voltage max value.
Unit: millivolt
Read/Write
RW
in[0-8]_input Voltage input value.
in[0-*]_input Voltage input value.
Unit: millivolt
Read only
RO
Voltage measured on the chip pin.
Actual voltage depends on the scaling resistors on the
motherboard, as recommended in the chip datasheet.
This varies by chip and by motherboard.
Because of this variation, values are generally NOT scaled
by the chip driver, and must be done by the application.
However, some drivers (notably lm87 and via686a)
do scale, with various degrees of success.
do scale, because of internal resistors built into a chip.
These drivers will output the actual voltage.
Typical usage:
@ -104,58 +117,72 @@ in[0-8]_input Voltage input value.
in7_* varies
in8_* varies
cpu[0-1]_vid CPU core reference voltage.
cpu[0-*]_vid CPU core reference voltage.
Unit: millivolt
Read only.
RO
Not always correct.
vrm Voltage Regulator Module version number.
Read only.
Two digit number, first is major version, second is
minor version.
RW (but changing it should no more be necessary)
Originally the VRM standard version multiplied by 10, but now
an arbitrary number, as not all standards have a version
number.
Affects the way the driver calculates the CPU core reference
voltage from the vid pins.
Also see the Alarms section for status flags associated with voltages.
********
* Fans *
********
fan[1-3]_min Fan minimum value
fan[1-*]_min Fan minimum value
Unit: revolution/min (RPM)
Read/Write.
RW
fan[1-3]_input Fan input value.
fan[1-*]_input Fan input value.
Unit: revolution/min (RPM)
Read only.
RO
fan[1-3]_div Fan divisor.
fan[1-*]_div Fan divisor.
Integer value in powers of two (1, 2, 4, 8, 16, 32, 64, 128).
RW
Some chips only support values 1, 2, 4 and 8.
Note that this is actually an internal clock divisor, which
affects the measurable speed range, not the read value.
Also see the Alarms section for status flags associated with fans.
*******
* PWM *
*******
pwm[1-3] Pulse width modulation fan control.
pwm[1-*] Pulse width modulation fan control.
Integer value in the range 0 to 255
Read/Write
RW
255 is max or 100%.
pwm[1-3]_enable
pwm[1-*]_enable
Switch PWM on and off.
Not always present even if fan*_pwm is.
0 to turn off
1 to turn on in manual mode
2 to turn on in automatic mode
Read/Write
0: turn off
1: turn on in manual mode
2+: turn on in automatic mode
Check individual chip documentation files for automatic mode details.
RW
pwm[1-*]_mode
0: DC mode
1: PWM mode
RW
pwm[1-*]_auto_channels_temp
Select which temperature channels affect this PWM output in
auto mode. Bitfield, 1 is temp1, 2 is temp2, 4 is temp3 etc...
Which values are possible depend on the chip used.
RW
pwm[1-*]_auto_point[1-*]_pwm
pwm[1-*]_auto_point[1-*]_temp
@ -163,6 +190,7 @@ pwm[1-*]_auto_point[1-*]_temp_hyst
Define the PWM vs temperature curve. Number of trip points is
chip-dependent. Use this for chips which associate trip points
to PWM output channels.
RW
OR
@ -172,50 +200,57 @@ temp[1-*]_auto_point[1-*]_temp_hyst
Define the PWM vs temperature curve. Number of trip points is
chip-dependent. Use this for chips which associate trip points
to temperature channels.
RW
****************
* Temperatures *
****************
temp[1-3]_type Sensor type selection.
temp[1-*]_type Sensor type selection.
Integers 1 to 4 or thermistor Beta value (typically 3435)
Read/Write.
RW
1: PII/Celeron Diode
2: 3904 transistor
3: thermal diode
4: thermistor (default/unknown Beta)
Not all types are supported by all chips
temp[1-4]_max Temperature max value.
Unit: millidegree Celcius
Read/Write value.
temp[1-*]_max Temperature max value.
Unit: millidegree Celsius (or millivolt, see below)
RW
temp[1-3]_min Temperature min value.
Unit: millidegree Celcius
Read/Write value.
temp[1-*]_min Temperature min value.
Unit: millidegree Celsius
RW
temp[1-3]_max_hyst
temp[1-*]_max_hyst
Temperature hysteresis value for max limit.
Unit: millidegree Celcius
Unit: millidegree Celsius
Must be reported as an absolute temperature, NOT a delta
from the max value.
Read/Write value.
RW
temp[1-4]_input Temperature input value.
Unit: millidegree Celcius
Read only value.
temp[1-*]_input Temperature input value.
Unit: millidegree Celsius
RO
temp[1-4]_crit Temperature critical value, typically greater than
temp[1-*]_crit Temperature critical value, typically greater than
corresponding temp_max values.
Unit: millidegree Celcius
Read/Write value.
Unit: millidegree Celsius
RW
temp[1-2]_crit_hyst
temp[1-*]_crit_hyst
Temperature hysteresis value for critical limit.
Unit: millidegree Celcius
Unit: millidegree Celsius
Must be reported as an absolute temperature, NOT a delta
from the critical value.
RW
temp[1-4]_offset
Temperature offset which is added to the temperature reading
by the chip.
Unit: millidegree Celsius
Read/Write value.
If there are multiple temperature sensors, temp1_* is
@ -225,6 +260,17 @@ temp[1-2]_crit_hyst
itself, for example the thermal diode inside the CPU or
a thermistor nearby.
Some chips measure temperature using external thermistors and an ADC, and
report the temperature measurement as a voltage. Converting this voltage
back to a temperature (or the other way around for limits) requires
mathematical functions not available in the kernel, so the conversion
must occur in user space. For these chips, all temp* files described
above should contain values expressed in millivolt instead of millidegree
Celsius. In other words, such temperature channels are handled as voltage
channels by the driver.
Also see the Alarms section for status flags associated with temperatures.
************
* Currents *
@ -233,25 +279,88 @@ temp[1-2]_crit_hyst
Note that no known chip provides current measurements as of writing,
so this part is theoretical, so to say.
curr[1-n]_max Current max value
curr[1-*]_max Current max value
Unit: milliampere
Read/Write.
RW
curr[1-n]_min Current min value.
curr[1-*]_min Current min value.
Unit: milliampere
Read/Write.
RW
curr[1-n]_input Current input value
curr[1-*]_input Current input value
Unit: milliampere
Read only.
RO
*********
* Other *
*********
**********
* Alarms *
**********
Each channel or limit may have an associated alarm file, containing a
boolean value. 1 means than an alarm condition exists, 0 means no alarm.
Usually a given chip will either use channel-related alarms, or
limit-related alarms, not both. The driver should just reflect the hardware
implementation.
in[0-*]_alarm
fan[1-*]_alarm
temp[1-*]_alarm
Channel alarm
0: no alarm
1: alarm
RO
OR
in[0-*]_min_alarm
in[0-*]_max_alarm
fan[1-*]_min_alarm
temp[1-*]_min_alarm
temp[1-*]_max_alarm
temp[1-*]_crit_alarm
Limit alarm
0: no alarm
1: alarm
RO
Each input channel may have an associated fault file. This can be used
to notify open diodes, unconnected fans etc. where the hardware
supports it. When this boolean has value 1, the measurement for that
channel should not be trusted.
in[0-*]_input_fault
fan[1-*]_input_fault
temp[1-*]_input_fault
Input fault condition
0: no fault occured
1: fault condition
RO
Some chips also offer the possibility to get beeped when an alarm occurs:
beep_enable Master beep enable
0: no beeps
1: beeps
RW
in[0-*]_beep
fan[1-*]_beep
temp[1-*]_beep
Channel beep
0: disable
1: enable
RW
In theory, a chip could provide per-limit beep masking, but no such chip
was seen so far.
Old drivers provided a different, non-standard interface to alarms and
beeps. These interface files are deprecated, but will be kept around
for compatibility reasons:
alarms Alarm bitmask.
Read only.
RO
Integer representation of one to four bytes.
A '1' bit means an alarm.
Chips should be programmed for 'comparator' mode so that
@ -259,35 +368,26 @@ alarms Alarm bitmask.
if it is still valid.
Generally a direct representation of a chip's internal
alarm registers; there is no standard for the position
of individual bits.
of individual bits. For this reason, the use of this
interface file for new drivers is discouraged. Use
individual *_alarm and *_fault files instead.
Bits are defined in kernel/include/sensors.h.
alarms_in Alarm bitmask relative to in (voltage) channels
Read only
A '1' bit means an alarm, LSB corresponds to in0 and so on
Prefered to 'alarms' for newer chips
alarms_fan Alarm bitmask relative to fan channels
Read only
A '1' bit means an alarm, LSB corresponds to fan1 and so on
Prefered to 'alarms' for newer chips
alarms_temp Alarm bitmask relative to temp (temperature) channels
Read only
A '1' bit means an alarm, LSB corresponds to temp1 and so on
Prefered to 'alarms' for newer chips
beep_enable Beep/interrupt enable
0 to disable.
1 to enable.
Read/Write
beep_mask Bitmask for beep.
Same format as 'alarms' with the same bit locations.
Read/Write
Same format as 'alarms' with the same bit locations,
use discouraged for the same reason. Use individual
*_beep files instead.
RW
*********
* Other *
*********
eeprom Raw EEPROM data in binary form.
Read only.
RO
pec Enable or disable PEC (SMBus only)
Read/Write
0: disable
1: enable
RW

View file

@ -6,31 +6,32 @@ voltages, fans speed). They are often connected through an I2C bus, but some
are also connected directly through the ISA bus.
The kernel drivers make the data from the sensor chips available in the /sys
virtual filesystem. Userspace tools are then used to display or set or the
data in a more friendly manner.
virtual filesystem. Userspace tools are then used to display the measured
values or configure the chips in a more friendly manner.
Lm-sensors
----------
Core set of utilites that will allow you to obtain health information,
Core set of utilities that will allow you to obtain health information,
setup monitoring limits etc. You can get them on their homepage
http://www.lm-sensors.nu/ or as a package from your Linux distribution.
If from website:
Get lmsensors from project web site. Please note, you need only userspace
part, so compile with "make user_install" target.
Get lm-sensors from project web site. Please note, you need only userspace
part, so compile with "make user" and install with "make user_install".
General hints to get things working:
0) get lm-sensors userspace utils
1) compile all drivers in I2C section as modules in your kernel
1) compile all drivers in I2C and Hardware Monitoring sections as modules
in your kernel
2) run sensors-detect script, it will tell you what modules you need to load.
3) load them and run "sensors" command, you should see some results.
4) fix sensors.conf, labels, limits, fan divisors
5) if any more problems consult FAQ, or documentation
Other utilites
--------------
Other utilities
---------------
If you want some graphical indicators of system health look for applications
like: gkrellm, ksensors, xsensors, wmtemp, wmsensors, wmgtemp, ksysguardd,

113
Documentation/hwmon/w83791d Normal file
View file

@ -0,0 +1,113 @@
Kernel driver w83791d
=====================
Supported chips:
* Winbond W83791D
Prefix: 'w83791d'
Addresses scanned: I2C 0x2c - 0x2f
Datasheet: http://www.winbond-usa.com/products/winbond_products/pdfs/PCIC/W83791Da.pdf
Author: Charles Spirakis <bezaur@gmail.com>
This driver was derived from the w83781d.c and w83792d.c source files.
Credits:
w83781d.c:
Frodo Looijaard <frodol@dds.nl>,
Philip Edelbrock <phil@netroedge.com>,
and Mark Studebaker <mdsxyz123@yahoo.com>
w83792d.c:
Chunhao Huang <DZShen@Winbond.com.tw>,
Rudolf Marek <r.marek@sh.cvut.cz>
Module Parameters
-----------------
* init boolean
(default 0)
Use 'init=1' to have the driver do extra software initializations.
The default behavior is to do the minimum initialization possible
and depend on the BIOS to properly setup the chip. If you know you
have a w83791d and you're having problems, try init=1 before trying
reset=1.
* reset boolean
(default 0)
Use 'reset=1' to reset the chip (via index 0x40, bit 7). The default
behavior is no chip reset to preserve BIOS settings.
* force_subclients=bus,caddr,saddr,saddr
This is used to force the i2c addresses for subclients of
a certain chip. Example usage is `force_subclients=0,0x2f,0x4a,0x4b'
to force the subclients of chip 0x2f on bus 0 to i2c addresses
0x4a and 0x4b.
Description
-----------
This driver implements support for the Winbond W83791D chip.
Detection of the chip can sometimes be foiled because it can be in an
internal state that allows no clean access (Bank with ID register is not
currently selected). If you know the address of the chip, use a 'force'
parameter; this will put it into a more well-behaved state first.
The driver implements three temperature sensors, five fan rotation speed
sensors, and ten voltage sensors.
Temperatures are measured in degrees Celsius and measurement resolution is 1
degC for temp1 and 0.5 degC for temp2 and temp3. An alarm is triggered when
the temperature gets higher than the Overtemperature Shutdown value; it stays
on until the temperature falls below the Hysteresis value.
Fan rotation speeds are reported in RPM (rotations per minute). An alarm is
triggered if the rotation speed has dropped below a programmable limit. Fan
readings can be divided by a programmable divider (1, 2, 4, 8 for fan 1/2/3
and 1, 2, 4, 8, 16, 32, 64 or 128 for fan 4/5) to give the readings more
range or accuracy.
Voltage sensors (also known as IN sensors) report their values in millivolts.
An alarm is triggered if the voltage has crossed a programmable minimum
or maximum limit.
Alarms are provided as output from a "realtime status register". The
following bits are defined:
bit - alarm on:
0 - Vcore
1 - VINR0
2 - +3.3VIN
3 - 5VDD
4 - temp1
5 - temp2
6 - fan1
7 - fan2
8 - +12VIN
9 - -12VIN
10 - -5VIN
11 - fan3
12 - chassis
13 - temp3
14 - VINR1
15 - reserved
16 - tart1
17 - tart2
18 - tart3
19 - VSB
20 - VBAT
21 - fan4
22 - fan5
23 - reserved
When an alarm goes off, you can be warned by a beeping signal through your
computer speaker. It is possible to enable all beeping globally, or only
the beeping for some alarms.
The driver only reads the chip values each 3 seconds; reading them more
often will do no harm, but will return 'old' values.
W83791D TODO:
---------------
Provide a patch for per-file alarms as discussed on the mailing list
Provide a patch for smart-fan control (still need appropriate motherboard/fans)

View file

@ -21,8 +21,7 @@ Authors:
Module Parameters
-----------------
* force_addr: int
Forcibly enable the ICH at the given address. EXTREMELY DANGEROUS!
None.
Description

View file

@ -7,6 +7,8 @@ Supported adapters:
* nForce3 250Gb MCP 10de:00E4
* nForce4 MCP 10de:0052
* nForce4 MCP-04 10de:0034
* nForce4 MCP51 10de:0264
* nForce4 MCP55 10de:0368
Datasheet: not publically available, but seems to be similar to the
AMD-8111 SMBus 2.0 adapter.

View file

@ -0,0 +1,51 @@
Kernel driver i2c-ocores
Supported adapters:
* OpenCores.org I2C controller by Richard Herveille (see datasheet link)
Datasheet: http://www.opencores.org/projects.cgi/web/i2c/overview
Author: Peter Korsgaard <jacmet@sunsite.dk>
Description
-----------
i2c-ocores is an i2c bus driver for the OpenCores.org I2C controller
IP core by Richard Herveille.
Usage
-----
i2c-ocores uses the platform bus, so you need to provide a struct
platform_device with the base address and interrupt number. The
dev.platform_data of the device should also point to a struct
ocores_i2c_platform_data (see linux/i2c-ocores.h) describing the
distance between registers and the input clock speed.
E.G. something like:
static struct resource ocores_resources[] = {
[0] = {
.start = MYI2C_BASEADDR,
.end = MYI2C_BASEADDR + 8,
.flags = IORESOURCE_MEM,
},
[1] = {
.start = MYI2C_IRQ,
.end = MYI2C_IRQ,
.flags = IORESOURCE_IRQ,
},
};
static struct ocores_i2c_platform_data myi2c_data = {
.regstep = 2, /* two bytes between registers */
.clock_khz = 50000, /* input clock of 50MHz */
};
static struct platform_device myi2c = {
.name = "ocores-i2c",
.dev = {
.platform_data = &myi2c_data,
},
.num_resources = ARRAY_SIZE(ocores_resources),
.resource = ocores_resources,
};

View file

@ -6,6 +6,8 @@ Supported adapters:
Datasheet: Publicly available at the Intel website
* ServerWorks OSB4, CSB5, CSB6 and HT-1000 southbridges
Datasheet: Only available via NDA from ServerWorks
* ATI IXP southbridges IXP200, IXP300, IXP400
Datasheet: Not publicly available
* Standard Microsystems (SMSC) SLC90E66 (Victory66) southbridge
Datasheet: Publicly available at the SMSC website http://www.smsc.com
@ -21,8 +23,6 @@ Module Parameters
Forcibly enable the PIIX4. DANGEROUS!
* force_addr: int
Forcibly enable the PIIX4 at the given address. EXTREMELY DANGEROUS!
* fix_hstcfg: int
Fix config register. Needed on some boards (Force CPCI735).
Description
@ -63,10 +63,36 @@ The PIIX4E is just an new version of the PIIX4; it is supported as well.
The PIIX/PIIX3 does not implement an SMBus or I2C bus, so you can't use
this driver on those mainboards.
The ServerWorks Southbridges, the Intel 440MX, and the Victory766 are
The ServerWorks Southbridges, the Intel 440MX, and the Victory66 are
identical to the PIIX4 in I2C/SMBus support.
A few OSB4 southbridges are known to be misconfigured by the BIOS. In this
case, you have you use the fix_hstcfg module parameter. Do not use it
unless you know you have to, because in some cases it also breaks
configuration on southbridges that don't need it.
If you own Force CPCI735 motherboard or other OSB4 based systems you may need
to change the SMBus Interrupt Select register so the SMBus controller uses
the SMI mode.
1) Use lspci command and locate the PCI device with the SMBus controller:
00:0f.0 ISA bridge: ServerWorks OSB4 South Bridge (rev 4f)
The line may vary for different chipsets. Please consult the driver source
for all possible PCI ids (and lspci -n to match them). Lets assume the
device is located at 00:0f.0.
2) Now you just need to change the value in 0xD2 register. Get it first with
command: lspci -xxx -s 00:0f.0
If the value is 0x3 then you need to change it to 0x1
setpci -s 00:0f.0 d2.b=1
Please note that you don't need to do that in all cases, just when the SMBus is
not working properly.
Hardware-specific issues
------------------------
This driver will refuse to load on IBM systems with an Intel PIIX4 SMBus.
Some of these machines have an RFID EEPROM (24RF08) connected to the SMBus,
which can easily get corrupted due to a state machine bug. These are mostly
Thinkpad laptops, but desktop systems may also be affected. We have no list
of all affected systems, so the only safe solution was to prevent access to
the SMBus on all IBM systems (detected using DMI data.)
For additional information, read:
http://www2.lm-sensors.nu/~lm78/cvs/lm_sensors2/README.thinkpad

View file

@ -2,14 +2,31 @@ Kernel driver scx200_acb
Author: Christer Weinigel <wingel@nano-system.com>
The driver supersedes the older, never merged driver named i2c-nscacb.
Module Parameters
-----------------
* base: int
* base: up to 4 ints
Base addresses for the ACCESS.bus controllers on SCx200 and SC1100 devices
By default the driver uses two base addresses 0x820 and 0x840.
If you want only one base address, specify the second as 0 so as to
override this default.
Description
-----------
Enable the use of the ACCESS.bus controller on the Geode SCx200 and
SC1100 processors and the CS5535 and CS5536 Geode companion devices.
Device-specific notes
---------------------
The SC1100 WRAP boards are known to use base addresses 0x810 and 0x820.
If the scx200_acb driver is built into the kernel, add the following
parameter to your boot command line:
scx200_acb.base=0x810,0x820
If the scx200_acb driver is built as a module, add the following line to
the file /etc/modprobe.conf instead:
options scx200_acb base=0x810,0x820

View file

@ -0,0 +1,208 @@
MEMORY ATTRIBUTE ALIASING ON IA-64
Bjorn Helgaas
<bjorn.helgaas@hp.com>
May 4, 2006
MEMORY ATTRIBUTES
Itanium supports several attributes for virtual memory references.
The attribute is part of the virtual translation, i.e., it is
contained in the TLB entry. The ones of most interest to the Linux
kernel are:
WB Write-back (cacheable)
UC Uncacheable
WC Write-coalescing
System memory typically uses the WB attribute. The UC attribute is
used for memory-mapped I/O devices. The WC attribute is uncacheable
like UC is, but writes may be delayed and combined to increase
performance for things like frame buffers.
The Itanium architecture requires that we avoid accessing the same
page with both a cacheable mapping and an uncacheable mapping[1].
The design of the chipset determines which attributes are supported
on which regions of the address space. For example, some chipsets
support either WB or UC access to main memory, while others support
only WB access.
MEMORY MAP
Platform firmware describes the physical memory map and the
supported attributes for each region. At boot-time, the kernel uses
the EFI GetMemoryMap() interface. ACPI can also describe memory
devices and the attributes they support, but Linux/ia64 currently
doesn't use this information.
The kernel uses the efi_memmap table returned from GetMemoryMap() to
learn the attributes supported by each region of physical address
space. Unfortunately, this table does not completely describe the
address space because some machines omit some or all of the MMIO
regions from the map.
The kernel maintains another table, kern_memmap, which describes the
memory Linux is actually using and the attribute for each region.
This contains only system memory; it does not contain MMIO space.
The kern_memmap table typically contains only a subset of the system
memory described by the efi_memmap. Linux/ia64 can't use all memory
in the system because of constraints imposed by the identity mapping
scheme.
The efi_memmap table is preserved unmodified because the original
boot-time information is required for kexec.
KERNEL IDENTITY MAPPINGS
Linux/ia64 identity mappings are done with large pages, currently
either 16MB or 64MB, referred to as "granules." Cacheable mappings
are speculative[2], so the processor can read any location in the
page at any time, independent of the programmer's intentions. This
means that to avoid attribute aliasing, Linux can create a cacheable
identity mapping only when the entire granule supports cacheable
access.
Therefore, kern_memmap contains only full granule-sized regions that
can referenced safely by an identity mapping.
Uncacheable mappings are not speculative, so the processor will
generate UC accesses only to locations explicitly referenced by
software. This allows UC identity mappings to cover granules that
are only partially populated, or populated with a combination of UC
and WB regions.
USER MAPPINGS
User mappings are typically done with 16K or 64K pages. The smaller
page size allows more flexibility because only 16K or 64K has to be
homogeneous with respect to memory attributes.
POTENTIAL ATTRIBUTE ALIASING CASES
There are several ways the kernel creates new mappings:
mmap of /dev/mem
This uses remap_pfn_range(), which creates user mappings. These
mappings may be either WB or UC. If the region being mapped
happens to be in kern_memmap, meaning that it may also be mapped
by a kernel identity mapping, the user mapping must use the same
attribute as the kernel mapping.
If the region is not in kern_memmap, the user mapping should use
an attribute reported as being supported in the EFI memory map.
Since the EFI memory map does not describe MMIO on some
machines, this should use an uncacheable mapping as a fallback.
mmap of /sys/class/pci_bus/.../legacy_mem
This is very similar to mmap of /dev/mem, except that legacy_mem
only allows mmap of the one megabyte "legacy MMIO" area for a
specific PCI bus. Typically this is the first megabyte of
physical address space, but it may be different on machines with
several VGA devices.
"X" uses this to access VGA frame buffers. Using legacy_mem
rather than /dev/mem allows multiple instances of X to talk to
different VGA cards.
The /dev/mem mmap constraints apply.
However, since this is for mapping legacy MMIO space, WB access
does not make sense. This matters on machines without legacy
VGA support: these machines may have WB memory for the entire
first megabyte (or even the entire first granule).
On these machines, we could mmap legacy_mem as WB, which would
be safe in terms of attribute aliasing, but X has no way of
knowing that it is accessing regular memory, not a frame buffer,
so the kernel should fail the mmap rather than doing it with WB.
read/write of /dev/mem
This uses copy_from_user(), which implicitly uses a kernel
identity mapping. This is obviously safe for things in
kern_memmap.
There may be corner cases of things that are not in kern_memmap,
but could be accessed this way. For example, registers in MMIO
space are not in kern_memmap, but could be accessed with a UC
mapping. This would not cause attribute aliasing. But
registers typically can be accessed only with four-byte or
eight-byte accesses, and the copy_from_user() path doesn't allow
any control over the access size, so this would be dangerous.
ioremap()
This returns a kernel identity mapping for use inside the
kernel.
If the region is in kern_memmap, we should use the attribute
specified there. Otherwise, if the EFI memory map reports that
the entire granule supports WB, we should use that (granules
that are partially reserved or occupied by firmware do not appear
in kern_memmap). Otherwise, we should use a UC mapping.
PAST PROBLEM CASES
mmap of various MMIO regions from /dev/mem by "X" on Intel platforms
The EFI memory map may not report these MMIO regions.
These must be allowed so that X will work. This means that
when the EFI memory map is incomplete, every /dev/mem mmap must
succeed. It may create either WB or UC user mappings, depending
on whether the region is in kern_memmap or the EFI memory map.
mmap of 0x0-0xA0000 /dev/mem by "hwinfo" on HP sx1000 with VGA enabled
See https://bugzilla.novell.com/show_bug.cgi?id=140858.
The EFI memory map reports the following attributes:
0x00000-0x9FFFF WB only
0xA0000-0xBFFFF UC only (VGA frame buffer)
0xC0000-0xFFFFF WB only
This mmap is done with user pages, not kernel identity mappings,
so it is safe to use WB mappings.
The kernel VGA driver may ioremap the VGA frame buffer at 0xA0000,
which will use a granule-sized UC mapping covering 0-0xFFFFF. This
granule covers some WB-only memory, but since UC is non-speculative,
the processor will never generate an uncacheable reference to the
WB-only areas unless the driver explicitly touches them.
mmap of 0x0-0xFFFFF legacy_mem by "X"
If the EFI memory map reports this entire range as WB, there
is no VGA MMIO hole, and the mmap should fail or be done with
a WB mapping.
There's no easy way for X to determine whether the 0xA0000-0xBFFFF
region is a frame buffer or just memory, so I think it's best to
just fail this mmap request rather than using a WB mapping. As
far as I know, there's no need to map legacy_mem with WB
mappings.
Otherwise, a UC mapping of the entire region is probably safe.
The VGA hole means the region will not be in kern_memmap. The
HP sx1000 chipset doesn't support UC access to the memory surrounding
the VGA hole, but X doesn't need that area anyway and should not
reference it.
mmap of 0xA0000-0xBFFFF legacy_mem by "X" on HP sx1000 with VGA disabled
The EFI memory map reports the following attributes:
0x00000-0xFFFFF WB only (no VGA MMIO hole)
This is a special case of the previous case, and the mmap should
fail for the same reason as above.
NOTES
[1] SDM rev 2.2, vol 2, sec 4.4.1.
[2] SDM rev 2.2, vol 2, sec 4.4.6.

View file

@ -67,8 +67,7 @@ initrd adds the following new options:
as the last process has closed it, all data is freed and /dev/initrd
can't be opened anymore.
root=/dev/ram0 (without devfs)
root=/dev/rd/0 (with devfs)
root=/dev/ram0
initrd is mounted as root, and the normal boot procedure is followed,
with the RAM disk still mounted as root.
@ -90,8 +89,7 @@ you're building an install floppy), the root file system creation
procedure should create the /initrd directory.
If initrd will not be mounted in some cases, its content is still
accessible if the following device has been created (note that this
does not work if using devfs):
accessible if the following device has been created:
# mknod /dev/initrd b 1 250
# chmod 400 /dev/initrd
@ -119,8 +117,7 @@ We'll describe the loopback device method:
(if space is critical, you may want to use the Minix FS instead of Ext2)
3) mount the file system, e.g.
# mount -t ext2 -o loop initrd /mnt
4) create the console device (not necessary if using devfs, but it can't
hurt to do it anyway):
4) create the console device:
# mkdir /mnt/dev
# mknod /mnt/dev/console c 5 1
5) copy all the files that are needed to properly use the initrd
@ -152,12 +149,7 @@ have to be given:
root=/dev/ram0 init=/linuxrc rw
if not using devfs, or
root=/dev/rd/0 init=/linuxrc rw
if using devfs. (rw is only necessary if writing to the initrd file
system.)
(rw is only necessary if writing to the initrd file system.)
With LOADLIN, you simply execute
@ -217,9 +209,9 @@ following command:
# exec chroot . what-follows <dev/console >dev/console 2>&1
Where what-follows is a program under the new root, e.g. /sbin/init
If the new root file system will be used with devfs and has no valid
/dev directory, devfs must be mounted before invoking chroot in order to
provide /dev/console.
If the new root file system will be used with udev and has no valid
/dev directory, udev must be initialized before invoking chroot in order
to provide /dev/console.
Note: implementation details of pivot_root may change with time. In order
to ensure compatibility, the following points should be observed:
@ -236,7 +228,7 @@ Now, the initrd can be unmounted and the memory allocated by the RAM
disk can be freed:
# umount /initrd
# blockdev --flushbufs /dev/ram0 # /dev/rd/0 if using devfs
# blockdev --flushbufs /dev/ram0
It is also possible to use initrd with an NFS-mounted root, see the
pivot_root(8) man page for details.

View file

@ -85,7 +85,9 @@ Code Seq# Include File Comments
<mailto:maassen@uni-freiburg.de>
'C' all linux/soundcard.h
'D' all asm-s390/dasd.h
'E' all linux/input.h
'F' all linux/fb.h
'H' all linux/hiddev.h
'I' all linux/isdn.h
'J' 00-1F drivers/scsi/gdth_ioctl.h
'K' all linux/kd.h
@ -117,7 +119,6 @@ Code Seq# Include File Comments
'c' 00-7F linux/comstats.h conflict!
'c' 00-7F linux/coda.h conflict!
'd' 00-FF linux/char/drm/drm/h conflict!
'd' 00-1F linux/devfs_fs.h conflict!
'd' 00-DF linux/video_decoder.h conflict!
'd' F0-FF linux/digi1.h
'e' all linux/digi1.h conflict!

View file

@ -124,7 +124,8 @@ GigaSet 307x Device Driver
You can use some configuration tool of your distribution to configure this
"modem" or configure pppd/wvdial manually. There are some example ppp
configuration files and chat scripts in the gigaset-VERSION/ppp directory.
configuration files and chat scripts in the gigaset-VERSION/ppp directory
in the driver packages from http://sourceforge.net/projects/gigaset307x/.
Please note that the USB drivers are not able to change the state of the
control lines (the M105 driver can be configured to use some undocumented
control requests, if you really need the control lines, though). This means
@ -164,8 +165,8 @@ GigaSet 307x Device Driver
If you want both of these at once, you are out of luck.
You can also use /sys/module/<name>/parameters/cidmode for changing
the CID mode setting (<name> is usb_gigaset or bas_gigaset).
You can also use /sys/class/tty/ttyGxy/cidmode for changing the CID mode
setting (ttyGxy is ttyGU0 or ttyGB0).
3. Troubleshooting

View file

@ -1123,6 +1123,14 @@ The top Makefile exports the following variables:
$(INSTALL_MOD_PATH)/lib/modules/$(KERNELRELEASE). The user may
override this value on the command line if desired.
INSTALL_MOD_STRIP
If this variable is specified, will cause modules to be stripped
after they are installed. If INSTALL_MOD_STRIP is '1', then the
default option --strip-debug will be used. Otherwise,
INSTALL_MOD_STRIP will used as the option(s) to the strip command.
=== 8 Makefile language
The kernel Makefiles are designed to run with GNU Make. The Makefiles

View file

@ -175,7 +175,7 @@ end
document trapinfo
Run info threads and lookup pid of thread #1
'trapinfo <pid>' will tell you by which trap & possibly
addresthe kernel paniced.
address the kernel panicked.
end

View file

@ -1,155 +1,325 @@
Documentation for kdump - the kexec-based crash dumping solution
================================================================
Documentation for Kdump - The kexec-based Crash Dumping Solution
================================================================
DESIGN
======
This document includes overview, setup and installation, and analysis
information.
Kdump uses kexec to reboot to a second kernel whenever a dump needs to be
taken. This second kernel is booted with very little memory. The first kernel
reserves the section of memory that the second kernel uses. This ensures that
on-going DMA from the first kernel does not corrupt the second kernel.
Overview
========
All the necessary information about Core image is encoded in ELF format and
stored in reserved area of memory before crash. Physical address of start of
ELF header is passed to new kernel through command line parameter elfcorehdr=.
Kdump uses kexec to quickly boot to a dump-capture kernel whenever a
dump of the system kernel's memory needs to be taken (for example, when
the system panics). The system kernel's memory image is preserved across
the reboot and is accessible to the dump-capture kernel.
On i386, the first 640 KB of physical memory is needed to boot, irrespective
of where the kernel loads. Hence, this region is backed up by kexec just before
rebooting into the new kernel.
You can use common Linux commands, such as cp and scp, to copy the
memory image to a dump file on the local disk, or across the network to
a remote system.
In the second kernel, "old memory" can be accessed in two ways.
Kdump and kexec are currently supported on the x86, x86_64, and ppc64
architectures.
- The first one is through a /dev/oldmem device interface. A capture utility
can read the device file and write out the memory in raw format. This is raw
dump of memory and analysis/capture tool should be intelligent enough to
determine where to look for the right information. ELF headers (elfcorehdr=)
can become handy here.
When the system kernel boots, it reserves a small section of memory for
the dump-capture kernel. This ensures that ongoing Direct Memory Access
(DMA) from the system kernel does not corrupt the dump-capture kernel.
The kexec -p command loads the dump-capture kernel into this reserved
memory.
- The second interface is through /proc/vmcore. This exports the dump as an ELF
format file which can be written out using any file copy command
(cp, scp, etc). Further, gdb can be used to perform limited debugging on
the dump file. This method ensures methods ensure that there is correct
ordering of the dump pages (corresponding to the first 640 KB that has been
relocated).
On x86 machines, the first 640 KB of physical memory is needed to boot,
regardless of where the kernel loads. Therefore, kexec backs up this
region just before rebooting into the dump-capture kernel.
SETUP
=====
All of the necessary information about the system kernel's core image is
encoded in the ELF format, and stored in a reserved area of memory
before a crash. The physical address of the start of the ELF header is
passed to the dump-capture kernel through the elfcorehdr= boot
parameter.
1) Download the upstream kexec-tools userspace package from
http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.101.tar.gz.
With the dump-capture kernel, you can access the memory image, or "old
memory," in two ways:
Apply the latest consolidated kdump patch on top of kexec-tools-1.101
from http://lse.sourceforge.net/kdump/. This arrangment has been made
till all the userspace patches supporting kdump are integrated with
upstream kexec-tools userspace.
- Through a /dev/oldmem device interface. A capture utility can read the
device file and write out the memory in raw format. This is a raw dump
of memory. Analysis and capture tools must be intelligent enough to
determine where to look for the right information.
2) Download and build the appropriate (2.6.13-rc1 onwards) vanilla kernels.
Two kernels need to be built in order to get this feature working.
Following are the steps to properly configure the two kernels specific
to kexec and kdump features:
A) First kernel or regular kernel:
----------------------------------
a) Enable "kexec system call" feature (in Processor type and features).
CONFIG_KEXEC=y
b) Enable "sysfs file system support" (in Pseudo filesystems).
CONFIG_SYSFS=y
c) make
d) Boot into first kernel with the command line parameter "crashkernel=Y@X".
Use appropriate values for X and Y. Y denotes how much memory to reserve
for the second kernel, and X denotes at what physical address the
reserved memory section starts. For example: "crashkernel=64M@16M".
- Through /proc/vmcore. This exports the dump as an ELF-format file that
you can write out using file copy commands such as cp or scp. Further,
you can use analysis tools such as the GNU Debugger (GDB) and the Crash
tool to debug the dump file. This method ensures that the dump pages are
correctly ordered.
B) Second kernel or dump capture kernel:
---------------------------------------
a) For i386 architecture enable Highmem support
CONFIG_HIGHMEM=y
b) Enable "kernel crash dumps" feature (under "Processor type and features")
CONFIG_CRASH_DUMP=y
c) Make sure a suitable value for "Physical address where the kernel is
loaded" (under "Processor type and features"). By default this value
is 0x1000000 (16MB) and it should be same as X (See option d above),
e.g., 16 MB or 0x1000000.
CONFIG_PHYSICAL_START=0x1000000
d) Enable "/proc/vmcore support" (Optional, under "Pseudo filesystems").
CONFIG_PROC_VMCORE=y
Setup and Installation
======================
3) After booting to regular kernel or first kernel, load the second kernel
using the following command:
Install kexec-tools and the Kdump patch
---------------------------------------
kexec -p <second-kernel> --args-linux --elf32-core-headers
--append="root=<root-dev> init 1 irqpoll maxcpus=1"
1) Login as the root user.
Notes:
======
i) <second-kernel> has to be a vmlinux image ie uncompressed elf image.
bzImage will not work, as of now.
ii) --args-linux has to be speicfied as if kexec it loading an elf image,
it needs to know that the arguments supplied are of linux type.
iii) By default ELF headers are stored in ELF64 format to support systems
with more than 4GB memory. Option --elf32-core-headers forces generation
of ELF32 headers. The reason for this option being, as of now gdb can
not open vmcore file with ELF64 headers on a 32 bit systems. So ELF32
headers can be used if one has non-PAE systems and hence memory less
than 4GB.
iv) Specify "irqpoll" as command line parameter. This reduces driver
initialization failures in second kernel due to shared interrupts.
v) <root-dev> needs to be specified in a format corresponding to the root
device name in the output of mount command.
vi) If you have built the drivers required to mount root file system as
modules in <second-kernel>, then, specify
--initrd=<initrd-for-second-kernel>.
vii) Specify maxcpus=1 as, if during first kernel run, if panic happens on
non-boot cpus, second kernel doesn't seem to be boot up all the cpus.
The other option is to always built the second kernel without SMP
support ie CONFIG_SMP=n
2) Download the kexec-tools user-space package from the following URL:
4) After successfully loading the second kernel as above, if a panic occurs
system reboots into the second kernel. A module can be written to force
the panic or "ALT-SysRq-c" can be used initiate a crash dump for testing
purposes.
http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.101.tar.gz
5) Once the second kernel has booted, write out the dump file using
3) Unpack the tarball with the tar command, as follows:
tar xvpzf kexec-tools-1.101.tar.gz
4) Download the latest consolidated Kdump patch from the following URL:
http://lse.sourceforge.net/kdump/
(This location is being used until all the user-space Kdump patches
are integrated with the kexec-tools package.)
5) Change to the kexec-tools-1.101 directory, as follows:
cd kexec-tools-1.101
6) Apply the consolidated patch to the kexec-tools-1.101 source tree
with the patch command, as follows. (Modify the path to the downloaded
patch as necessary.)
patch -p1 < /path-to-kdump-patch/kexec-tools-1.101-kdump.patch
7) Configure the package, as follows:
./configure
8) Compile the package, as follows:
make
9) Install the package, as follows:
make install
Download and build the system and dump-capture kernels
------------------------------------------------------
Download the mainline (vanilla) kernel source code (2.6.13-rc1 or newer)
from http://www.kernel.org. Two kernels must be built: a system kernel
and a dump-capture kernel. Use the following steps to configure these
kernels with the necessary kexec and Kdump features:
System kernel
-------------
1) Enable "kexec system call" in "Processor type and features."
CONFIG_KEXEC=y
2) Enable "sysfs file system support" in "Filesystem" -> "Pseudo
filesystems." This is usually enabled by default.
CONFIG_SYSFS=y
Note that "sysfs file system support" might not appear in the "Pseudo
filesystems" menu if "Configure standard kernel features (for small
systems)" is not enabled in "General Setup." In this case, check the
.config file itself to ensure that sysfs is turned on, as follows:
grep 'CONFIG_SYSFS' .config
3) Enable "Compile the kernel with debug info" in "Kernel hacking."
CONFIG_DEBUG_INFO=Y
This causes the kernel to be built with debug symbols. The dump
analysis tools require a vmlinux with debug symbols in order to read
and analyze a dump file.
4) Make and install the kernel and its modules. Update the boot loader
(such as grub, yaboot, or lilo) configuration files as necessary.
5) Boot the system kernel with the boot parameter "crashkernel=Y@X",
where Y specifies how much memory to reserve for the dump-capture kernel
and X specifies the beginning of this reserved memory. For example,
"crashkernel=64M@16M" tells the system kernel to reserve 64 MB of memory
starting at physical address 0x01000000 for the dump-capture kernel.
On x86 and x86_64, use "crashkernel=64M@16M".
On ppc64, use "crashkernel=128M@32M".
The dump-capture kernel
-----------------------
1) Under "General setup," append "-kdump" to the current string in
"Local version."
2) On x86, enable high memory support under "Processor type and
features":
CONFIG_HIGHMEM64G=y
or
CONFIG_HIGHMEM4G
3) On x86 and x86_64, disable symmetric multi-processing support
under "Processor type and features":
CONFIG_SMP=n
(If CONFIG_SMP=y, then specify maxcpus=1 on the kernel command line
when loading the dump-capture kernel, see section "Load the Dump-capture
Kernel".)
4) On ppc64, disable NUMA support and enable EMBEDDED support:
CONFIG_NUMA=n
CONFIG_EMBEDDED=y
CONFIG_EEH=N for the dump-capture kernel
5) Enable "kernel crash dumps" support under "Processor type and
features":
CONFIG_CRASH_DUMP=y
6) Use a suitable value for "Physical address where the kernel is
loaded" (under "Processor type and features"). This only appears when
"kernel crash dumps" is enabled. By default this value is 0x1000000
(16MB). It should be the same as X in the "crashkernel=Y@X" boot
parameter discussed above.
On x86 and x86_64, use "CONFIG_PHYSICAL_START=0x1000000".
On ppc64 the value is automatically set at 32MB when
CONFIG_CRASH_DUMP is set.
6) Optionally enable "/proc/vmcore support" under "Filesystems" ->
"Pseudo filesystems".
CONFIG_PROC_VMCORE=y
(CONFIG_PROC_VMCORE is set by default when CONFIG_CRASH_DUMP is selected.)
7) Make and install the kernel and its modules. DO NOT add this kernel
to the boot loader configuration files.
Load the Dump-capture Kernel
============================
After booting to the system kernel, load the dump-capture kernel using
the following command:
kexec -p <dump-capture-kernel> \
--initrd=<initrd-for-dump-capture-kernel> --args-linux \
--append="root=<root-dev> init 1 irqpoll"
Notes on loading the dump-capture kernel:
* <dump-capture-kernel> must be a vmlinux image (that is, an
uncompressed ELF image). bzImage does not work at this time.
* By default, the ELF headers are stored in ELF64 format to support
systems with more than 4GB memory. The --elf32-core-headers option can
be used to force the generation of ELF32 headers. This is necessary
because GDB currently cannot open vmcore files with ELF64 headers on
32-bit systems. ELF32 headers can be used on non-PAE systems (that is,
less than 4GB of memory).
* The "irqpoll" boot parameter reduces driver initialization failures
due to shared interrupts in the dump-capture kernel.
* You must specify <root-dev> in the format corresponding to the root
device name in the output of mount command.
* "init 1" boots the dump-capture kernel into single-user mode without
networking. If you want networking, use "init 3."
Kernel Panic
============
After successfully loading the dump-capture kernel as previously
described, the system will reboot into the dump-capture kernel if a
system crash is triggered. Trigger points are located in panic(),
die(), die_nmi() and in the sysrq handler (ALT-SysRq-c).
The following conditions will execute a crash trigger point:
If a hard lockup is detected and "NMI watchdog" is configured, the system
will boot into the dump-capture kernel ( die_nmi() ).
If die() is called, and it happens to be a thread with pid 0 or 1, or die()
is called inside interrupt context or die() is called and panic_on_oops is set,
the system will boot into the dump-capture kernel.
On powererpc systems when a soft-reset is generated, die() is called by all cpus and the system system will boot into the dump-capture kernel.
For testing purposes, you can trigger a crash by using "ALT-SysRq-c",
"echo c > /proc/sysrq-trigger or write a module to force the panic.
Write Out the Dump File
=======================
After the dump-capture kernel is booted, write out the dump file with
the following command:
cp /proc/vmcore <dump-file>
Dump memory can also be accessed as a /dev/oldmem device for a linear/raw
view. To create the device, type:
You can also access dumped memory as a /dev/oldmem device for a linear
and raw view. To create the device, use the following command:
mknod /dev/oldmem c 1 12
mknod /dev/oldmem c 1 12
Use "dd" with suitable options for count, bs and skip to access specific
portions of the dump.
Use the dd command with suitable options for count, bs, and skip to
access specific portions of the dump.
Entire memory: dd if=/dev/oldmem of=oldmem.001
To see the entire memory, use the following command:
dd if=/dev/oldmem of=oldmem.001
ANALYSIS
Analysis
========
Limited analysis can be done using gdb on the dump file copied out of
/proc/vmcore. Use vmlinux built with -g and run
gdb vmlinux <dump-file>
Before analyzing the dump image, you should reboot into a stable kernel.
Stack trace for the task on processor 0, register display, memory display
work fine.
You can do limited analysis using GDB on the dump file copied out of
/proc/vmcore. Use the debug vmlinux built with -g and run the following
command:
Note: gdb cannot analyse core files generated in ELF64 format for i386.
gdb vmlinux <dump-file>
Latest "crash" (crash-4.0-2.18) as available on Dave Anderson's site
http://people.redhat.com/~anderson/ works well with kdump format.
Stack trace for the task on processor 0, register display, and memory
display work fine.
Note: GDB cannot analyze core files generated in ELF64 format for x86.
On systems with a maximum of 4GB of memory, you can generate
ELF32-format headers using the --elf32-core-headers kernel option on the
dump kernel.
You can also use the Crash utility to analyze dump files in Kdump
format. Crash is available on Dave Anderson's site at the following URL:
http://people.redhat.com/~anderson/
TODO
====
1) Provide a kernel pages filtering mechanism so that core file size is not
insane on systems having huge memory banks.
2) Relocatable kernel can help in maintaining multiple kernels for crashdump
and same kernel as the first kernel can be used to capture the dump.
To Do
=====
1) Provide a kernel pages filtering mechanism, so core file size is not
extreme on systems with huge memory banks.
2) Relocatable kernel can help in maintaining multiple kernels for
crash_dump, and the same kernel as the system kernel can be used to
capture the dump.
CONTACT
Contact
=======
Vivek Goyal (vgoyal@in.ibm.com)
Maneesh Soni (maneesh@in.ibm.com)
Trademark
=========
Linux is a trademark of Linus Torvalds in the United States, other
countries, or both.

View file

@ -35,7 +35,6 @@ parameter is applicable:
APM Advanced Power Management support is enabled.
AX25 Appropriate AX.25 support is enabled.
CD Appropriate CD support is enabled.
DEVFS devfs support is enabled.
DRM Direct Rendering Management support is enabled.
EDD BIOS Enhanced Disk Drive Services (EDD) is enabled
EFI EFI Partitioning (GPT) is enabled
@ -61,6 +60,7 @@ parameter is applicable:
MTD MTD support is enabled.
NET Appropriate network support is enabled.
NUMA NUMA support is enabled.
GENERIC_TIME The generic timeofday code is enabled.
NFS Appropriate NFS support is enabled.
OSS OSS sound support is enabled.
PARIDE The ParIDE subsystem is enabled.
@ -147,6 +147,9 @@ running once the system is up.
acpi_irq_isa= [HW,ACPI] If irq_balance, mark listed IRQs used by ISA
Format: <irq>,<irq>...
acpi_os_name= [HW,ACPI] Tell ACPI BIOS the name of the OS
Format: To spoof as Windows 98: ="Microsoft Windows"
acpi_osi= [HW,ACPI] empty param disables _OSI
acpi_serialize [HW,ACPI] force serialization of AML methods
@ -176,6 +179,11 @@ running once the system is up.
override platform specific driver.
See also Documentation/acpi-hotkey.txt.
acpi_pm_good [IA-32,X86-64]
Override the pmtimer bug detection: force the kernel
to assume that this machine's pmtimer latches its value
and always returns good values.
enable_timer_pin_1 [i386,x86-64]
Enable PIN 1 of APIC timer
Can be useful to work around chipset bugs
@ -338,10 +346,11 @@ running once the system is up.
Value can be changed at runtime via
/selinux/checkreqprot.
clock= [BUGS=IA-32,HW] gettimeofday timesource override.
Forces specified timesource (if avaliable) to be used
when calculating gettimeofday(). If specicified
timesource is not avalible, it defaults to PIT.
clock= [BUGS=IA-32, HW] gettimeofday clocksource override.
[Deprecated]
Forces specified clocksource (if avaliable) to be used
when calculating gettimeofday(). If specified
clocksource is not avalible, it defaults to PIT.
Format: { pit | tsc | cyclone | pmtmr }
disable_8254_timer
@ -430,9 +439,6 @@ running once the system is up.
Format: <area>[,<node>]
See also Documentation/networking/decnet.txt.
devfs= [DEVFS]
See Documentation/filesystems/devfs/boot-options.
dhash_entries= [KNL]
Set number of hash buckets for dentry cache.
@ -1614,6 +1620,10 @@ running once the system is up.
time Show timing data prefixed to each printk message line
clocksource= [GENERIC_TIME] Override the default clocksource
Override the default clocksource and use the clocksource
with the name specified.
tipar.timeout= [HW,PPT]
Set communications timeout in tenths of a second
(default 15).
@ -1655,6 +1665,10 @@ running once the system is up.
usbhid.mousepoll=
[USBHID] The interval which mice are to be polled at.
vdso= [IA-32]
vdso=1: enable VDSO (default)
vdso=0: disable VDSO mapping
video= [FB] Frame buffer configuration
See Documentation/fb/modedb.txt.
@ -1671,9 +1685,14 @@ running once the system is up.
decrease the size and leave more room for directly
mapped kernel RAM.
vmhalt= [KNL,S390]
vmhalt= [KNL,S390] Perform z/VM CP command after system halt.
Format: <command>
vmpoff= [KNL,S390]
vmpanic= [KNL,S390] Perform z/VM CP command after kernel panic.
Format: <command>
vmpoff= [KNL,S390] Perform z/VM CP command after power off.
Format: <command>
waveartist= [HW,OSS]
Format: <io>,<irq>,<dma>,<dma2>

View file

@ -3,16 +3,23 @@
===================
The key request service is part of the key retention service (refer to
Documentation/keys.txt). This document explains more fully how that the
requesting algorithm works.
Documentation/keys.txt). This document explains more fully how the requesting
algorithm works.
The process starts by either the kernel requesting a service by calling
request_key():
request_key*():
struct key *request_key(const struct key_type *type,
const char *description,
const char *callout_string);
or:
struct key *request_key_with_auxdata(const struct key_type *type,
const char *description,
const char *callout_string,
void *aux);
Or by userspace invoking the request_key system call:
key_serial_t request_key(const char *type,
@ -20,16 +27,26 @@ Or by userspace invoking the request_key system call:
const char *callout_info,
key_serial_t dest_keyring);
The main difference between the two access points is that the in-kernel
interface does not need to link the key to a keyring to prevent it from being
immediately destroyed. The kernel interface returns a pointer directly to the
key, and it's up to the caller to destroy the key.
The main difference between the access points is that the in-kernel interface
does not need to link the key to a keyring to prevent it from being immediately
destroyed. The kernel interface returns a pointer directly to the key, and
it's up to the caller to destroy the key.
The request_key_with_auxdata() call is like the in-kernel request_key() call,
except that it permits auxiliary data to be passed to the upcaller (the default
is NULL). This is only useful for those key types that define their own upcall
mechanism rather than using /sbin/request-key.
The userspace interface links the key to a keyring associated with the process
to prevent the key from going away, and returns the serial number of the key to
the caller.
The following example assumes that the key types involved don't define their
own upcall mechanisms. If they do, then those should be substituted for the
forking and execution of /sbin/request-key.
===========
THE PROCESS
===========
@ -40,8 +57,8 @@ A request proceeds in the following manner:
interface].
(2) request_key() searches the process's subscribed keyrings to see if there's
a suitable key there. If there is, it returns the key. If there isn't, and
callout_info is not set, an error is returned. Otherwise the process
a suitable key there. If there is, it returns the key. If there isn't,
and callout_info is not set, an error is returned. Otherwise the process
proceeds to the next step.
(3) request_key() sees that A doesn't have the desired key yet, so it creates
@ -62,7 +79,7 @@ A request proceeds in the following manner:
instantiation.
(7) The program may want to access another key from A's context (say a
Kerberos TGT key). It just requests the appropriate key, and the keyring
Kerberos TGT key). It just requests the appropriate key, and the keyring
search notes that the session keyring has auth key V in its bottom level.
This will permit it to then search the keyrings of process A with the
@ -79,10 +96,11 @@ A request proceeds in the following manner:
(10) The program then exits 0 and request_key() deletes key V and returns key
U to the caller.
This also extends further. If key W (step 7 above) didn't exist, key W would be
created uninstantiated, another auth key (X) would be created (as per step 3)
and another copy of /sbin/request-key spawned (as per step 4); but the context
specified by auth key X will still be process A, as it was in auth key V.
This also extends further. If key W (step 7 above) didn't exist, key W would
be created uninstantiated, another auth key (X) would be created (as per step
3) and another copy of /sbin/request-key spawned (as per step 4); but the
context specified by auth key X will still be process A, as it was in auth key
V.
This is because process A's keyrings can't simply be attached to
/sbin/request-key at the appropriate places because (a) execve will discard two
@ -118,17 +136,17 @@ A search of any particular keyring proceeds in the following fashion:
(2) It considers all the non-keyring keys within that keyring and, if any key
matches the criteria specified, calls key_permission(SEARCH) on it to see
if the key is allowed to be found. If it is, that key is returned; if
if the key is allowed to be found. If it is, that key is returned; if
not, the search continues, and the error code is retained if of higher
priority than the one currently set.
(3) It then considers all the keyring-type keys in the keyring it's currently
searching. It calls key_permission(SEARCH) on each keyring, and if this
searching. It calls key_permission(SEARCH) on each keyring, and if this
grants permission, it recurses, executing steps (2) and (3) on that
keyring.
The process stops immediately a valid key is found with permission granted to
use it. Any error from a previous match attempt is discarded and the key is
use it. Any error from a previous match attempt is discarded and the key is
returned.
When search_process_keyrings() is invoked, it performs the following searches
@ -153,7 +171,7 @@ The moment one succeeds, all pending errors are discarded and the found key is
returned.
Only if all these fail does the whole thing fail with the highest priority
error. Note that several errors may have come from LSM.
error. Note that several errors may have come from LSM.
The error priority is:

View file

@ -19,6 +19,7 @@ This document has the following sections:
- Key overview
- Key service overview
- Key access permissions
- SELinux support
- New procfs files
- Userspace system call interface
- Kernel services
@ -232,6 +233,39 @@ For changing the ownership, group ID or permissions mask, being the owner of
the key or having the sysadmin capability is sufficient.
===============
SELINUX SUPPORT
===============
The security class "key" has been added to SELinux so that mandatory access
controls can be applied to keys created within various contexts. This support
is preliminary, and is likely to change quite significantly in the near future.
Currently, all of the basic permissions explained above are provided in SELinux
as well; SELinux is simply invoked after all basic permission checks have been
performed.
The value of the file /proc/self/attr/keycreate influences the labeling of
newly-created keys. If the contents of that file correspond to an SELinux
security context, then the key will be assigned that context. Otherwise, the
key will be assigned the current context of the task that invoked the key
creation request. Tasks must be granted explicit permission to assign a
particular context to newly-created keys, using the "create" permission in the
key security class.
The default keyrings associated with users will be labeled with the default
context of the user if and only if the login programs have been instrumented to
properly initialize keycreate during the login process. Otherwise, they will
be labeled with the context of the login program itself.
Note, however, that the default keyrings associated with the root user are
labeled with the default kernel context, since they are created early in the
boot process, before root has a chance to log in.
The keyrings associated with new threads are each labeled with the context of
their associated thread, and both session and process keyrings are handled
similarly.
================
NEW PROCFS FILES
================
@ -241,9 +275,17 @@ about the status of the key service:
(*) /proc/keys
This lists all the keys on the system, giving information about their
type, description and permissions. The payload of the key is not available
this way:
This lists the keys that are currently viewable by the task reading the
file, giving information about their type, description and permissions.
It is not possible to view the payload of the key this way, though some
information about it may be given.
The only keys included in the list are those that grant View permission to
the reading process whether or not it possesses them. Note that LSM
security checks are still performed, and may further filter out keys that
the current process is not authorised to view.
The contents of the file look like this:
SERIAL FLAGS USAGE EXPY PERM UID GID TYPE DESCRIPTION: SUMMARY
00000001 I----- 39 perm 1f3f0000 0 0 keyring _uid_ses.0: 1/4
@ -271,7 +313,7 @@ about the status of the key service:
(*) /proc/key-users
This file lists the tracking data for each user that has at least one key
on the system. Such data includes quota information and statistics:
on the system. Such data includes quota information and statistics:
[root@andromeda root]# cat /proc/key-users
0: 46 45/45 1/100 13/10000
@ -738,6 +780,17 @@ payload contents" for more information.
See also Documentation/keys-request-key.txt.
(*) To search for a key, passing auxiliary data to the upcaller, call:
struct key *request_key_with_auxdata(const struct key_type *type,
const char *description,
const char *callout_string,
void *aux);
This is identical to request_key(), except that the auxiliary data is
passed to the key_type->request_key() op if it exists.
(*) When it is no longer required, the key should be released using:
void key_put(struct key *key);
@ -935,6 +988,16 @@ The structure has a number of fields, some of which are mandatory:
It is not safe to sleep in this method; the caller may hold spinlocks.
(*) void (*revoke)(struct key *key);
This method is optional. It is called to discard part of the payload
data upon a key being revoked. The caller will have the key semaphore
write-locked.
It is safe to sleep in this method, though care should be taken to avoid
a deadlock against the key semaphore.
(*) void (*destroy)(struct key *key);
This method is optional. It is called to discard the payload data on a key
@ -979,6 +1042,24 @@ The structure has a number of fields, some of which are mandatory:
as might happen when the userspace buffer is accessed.
(*) int (*request_key)(struct key *key, struct key *authkey, const char *op,
void *aux);
This method is optional. If provided, request_key() and
request_key_with_auxdata() will invoke this function rather than
upcalling to /sbin/request-key to operate upon a key of this type.
The aux parameter is as passed to request_key_with_auxdata() or is NULL
otherwise. Also passed are the key to be operated upon, the
authorisation key for this operation and the operation type (currently
only "create").
This function should return only when the upcall is complete. Upon return
the authorisation key will be revoked, and the target key will be
negatively instantiated if it is still uninstantiated. The error will be
returned to the caller of request_key*().
============================
REQUEST-KEY CALLBACK SERVICE
============================

View file

@ -200,6 +200,17 @@ All md devices contain:
This can be written only while the array is being assembled, not
after it is started.
layout
The "layout" for the array for the particular level. This is
simply a number that is interpretted differently by different
levels. It can be written while assembling an array.
resync_start
The point at which resync should start. If no resync is needed,
this will be a very large number. At array creation it will
default to 0, though starting the array as 'clean' will
set it much larger.
new_dev
This file can be written but not read. The value written should
be a block device number as major:minor. e.g. 8:0
@ -207,6 +218,54 @@ All md devices contain:
available. It will then appear at md/dev-XXX (depending on the
name of the device) and further configuration is then possible.
safe_mode_delay
When an md array has seen no write requests for a certain period
of time, it will be marked as 'clean'. When another write
request arrive, the array is marked as 'dirty' before the write
commenses. This is known as 'safe_mode'.
The 'certain period' is controlled by this file which stores the
period as a number of seconds. The default is 200msec (0.200).
Writing a value of 0 disables safemode.
array_state
This file contains a single word which describes the current
state of the array. In many cases, the state can be set by
writing the word for the desired state, however some states
cannot be explicitly set, and some transitions are not allowed.
clear
No devices, no size, no level
Writing is equivalent to STOP_ARRAY ioctl
inactive
May have some settings, but array is not active
all IO results in error
When written, doesn't tear down array, but just stops it
suspended (not supported yet)
All IO requests will block. The array can be reconfigured.
Writing this, if accepted, will block until array is quiessent
readonly
no resync can happen. no superblocks get written.
write requests fail
read-auto
like readonly, but behaves like 'clean' on a write request.
clean - no pending writes, but otherwise active.
When written to inactive array, starts without resync
If a write request arrives then
if metadata is known, mark 'dirty' and switch to 'active'.
if not known, block and switch to write-pending
If written to an active array that has pending writes, then fails.
active
fully active: IO and resync can be happening.
When written to inactive array, starts with resync
write-pending
clean, but writes are blocked waiting for 'active' to be written.
active-idle
like active, but no writes have been seen for a while (safe_mode_delay).
sync_speed_min
sync_speed_max
This are similar to /proc/sys/dev/raid/speed_limit_{min,max}
@ -250,10 +309,18 @@ Each directory contains:
faulty - device has been kicked from active use due to
a detected fault
in_sync - device is a fully in-sync member of the array
writemostly - device will only be subject to read
requests if there are no other options.
This applies only to raid1 arrays.
spare - device is working, but not a full member.
This includes spares that are in the process
of being recoverred to
This list make grow in future.
This can be written to.
Writing "faulty" simulates a failure on the device.
Writing "remove" removes the device from the array.
Writing "writemostly" sets the writemostly flag.
Writing "-writemostly" clears the writemostly flag.
errors
An approximate count of read errors that have been detected on

View file

@ -262,9 +262,14 @@ What is required is some way of intervening to instruct the compiler and the
CPU to restrict the order.
Memory barriers are such interventions. They impose a perceived partial
ordering between the memory operations specified on either side of the barrier.
They request that the sequence of memory events generated appears to other
parts of the system as if the barrier is effective on that CPU.
ordering over the memory operations on either side of the barrier.
Such enforcement is important because the CPUs and other devices in a system
can use a variety of tricks to improve performance - including reordering,
deferral and combination of memory operations; speculative loads; speculative
branch prediction and various types of caching. Memory barriers are used to
override or suppress these tricks, allowing the code to sanely control the
interaction of multiple CPUs and/or devices.
VARIETIES OF MEMORY BARRIER
@ -282,7 +287,7 @@ Memory barriers come in four basic varieties:
A write barrier is a partial ordering on stores only; it is not required
to have any effect on loads.
A CPU can be viewed as as commiting a sequence of store operations to the
A CPU can be viewed as committing a sequence of store operations to the
memory system as time progresses. All stores before a write barrier will
occur in the sequence _before_ all the stores after the write barrier.
@ -413,7 +418,7 @@ There are certain things that the Linux kernel memory barriers do not guarantee:
indirect effect will be the order in which the second CPU sees the effects
of the first CPU's accesses occur, but see the next point:
(*) There is no guarantee that the a CPU will see the correct order of effects
(*) There is no guarantee that a CPU will see the correct order of effects
from a second CPU's accesses, even _if_ the second CPU uses a memory
barrier, unless the first CPU _also_ uses a matching memory barrier (see
the subsection on "SMP Barrier Pairing").
@ -461,8 +466,8 @@ Whilst this may seem like a failure of coherency or causality maintenance, it
isn't, and this behaviour can be observed on certain real CPUs (such as the DEC
Alpha).
To deal with this, a data dependency barrier must be inserted between the
address load and the data load:
To deal with this, a data dependency barrier or better must be inserted
between the address load and the data load:
CPU 1 CPU 2
=============== ===============
@ -484,7 +489,7 @@ lines. The pointer P might be stored in an odd-numbered cache line, and the
variable B might be stored in an even-numbered cache line. Then, if the
even-numbered bank of the reading CPU's cache is extremely busy while the
odd-numbered bank is idle, one can see the new value of the pointer P (&B),
but the old value of the variable B (1).
but the old value of the variable B (2).
Another example of where data dependency barriers might by required is where a
@ -597,7 +602,7 @@ Consider the following sequence of events:
This sequence of events is committed to the memory coherence system in an order
that the rest of the system might perceive as the unordered set of { STORE A,
STORE B, STORE C } all occuring before the unordered set of { STORE D, STORE E
STORE B, STORE C } all occurring before the unordered set of { STORE D, STORE E
}:
+-------+ : :
@ -744,7 +749,7 @@ some effectively random order, despite the write barrier issued by CPU 1:
: :
If, however, a read barrier were to be placed between the load of E and the
If, however, a read barrier were to be placed between the load of B and the
load of A on CPU 2:
CPU 1 CPU 2
@ -1461,9 +1466,8 @@ instruction itself is complete.
On a UP system - where this wouldn't be a problem - the smp_mb() is just a
compiler barrier, thus making sure the compiler emits the instructions in the
right order without actually intervening in the CPU. Since there there's only
one CPU, that CPU's dependency ordering logic will take care of everything
else.
right order without actually intervening in the CPU. Since there's only one
CPU, that CPU's dependency ordering logic will take care of everything else.
ATOMIC OPERATIONS
@ -1640,9 +1644,9 @@ functions:
The PCI bus, amongst others, defines an I/O space concept - which on such
CPUs as i386 and x86_64 cpus readily maps to the CPU's concept of I/O
space. However, it may also mapped as a virtual I/O space in the CPU's
memory map, particularly on those CPUs that don't support alternate
I/O spaces.
space. However, it may also be mapped as a virtual I/O space in the CPU's
memory map, particularly on those CPUs that don't support alternate I/O
spaces.
Accesses to this space may be fully synchronous (as on i386), but
intermediary bridges (such as the PCI host bridge) may not fully honour

View file

@ -74,7 +74,7 @@ Examples:
pgset "pkt_size 9014" sets packet size to 9014
pgset "frags 5" packet will consist of 5 fragments
pgset "count 200000" sets number of packets to send, set to zero
for continious sends untill explicitl stopped.
for continuous sends until explicitly stopped.
pgset "delay 5000" adds delay to hard_start_xmit(). nanoseconds

View file

@ -39,10 +39,13 @@ Copyright (C) 1999-2000 Maxim Krasnyansky <max_mk@yahoo.com>
mknod /dev/net/tun c 10 200
Set permissions:
e.g. chmod 0700 /dev/net/tun
if you want the device only accessible by root. Giving regular users the
right to assign network devices is NOT a good idea. Users could assign
bogus network interfaces to trick firewalls or administrators.
e.g. chmod 0666 /dev/net/tun
There's no harm in allowing the device to be accessible by non-root users,
since CAP_NET_ADMIN is required for creating network devices or for
connecting to network devices which aren't owned by the user in question.
If you want to create persistent devices and give ownership of them to
unprivileged users, then you need the /dev/net/tun device to be usable by
those users.
Driver module autoloading

View file

@ -213,11 +213,19 @@ have been remapped by the kernel.
See Documentation/IO-mapping.txt for how to access device memory.
You still need to call request_region() for I/O regions and
request_mem_region() for memory regions to make sure nobody else is using the
same device.
The device driver needs to call pci_request_region() to make sure
no other device is already using the same resource. The driver is expected
to determine MMIO and IO Port resource availability _before_ calling
pci_enable_device(). Conversely, drivers should call pci_release_region()
_after_ calling pci_disable_device(). The idea is to prevent two devices
colliding on the same address range.
All interrupt handlers should be registered with SA_SHIRQ and use the devid
Generic flavors of pci_request_region() are request_mem_region()
(for MMIO ranges) and request_region() (for IO Port ranges).
Use these for address resources that are not described by "normal" PCI
interfaces (e.g. BAR).
All interrupt handlers should be registered with IRQF_SHARED and use the devid
to map IRQs to devices (remember that all PCI interrupts are shared).

View file

@ -0,0 +1,32 @@
/* crc32hash.c - derived from linux/lib/crc32.c, GNU GPL v2 */
/* Usage example:
$ ./crc32hash "Dual Speed"
*/
#include <string.h>
#include <stdio.h>
#include <ctype.h>
#include <stdlib.h>
unsigned int crc32(unsigned char const *p, unsigned int len)
{
int i;
unsigned int crc = 0;
while (len--) {
crc ^= *p++;
for (i = 0; i < 8; i++)
crc = (crc >> 1) ^ ((crc & 1) ? 0xedb88320 : 0);
}
return crc;
}
int main(int argc, char **argv) {
unsigned int result;
if (argc != 2) {
printf("no string passed as argument\n");
return -1;
}
result = crc32(argv[1], strlen(argv[1]));
printf("0x%x\n", result);
return 0;
}

View file

@ -27,37 +27,7 @@ pcmcia:m0149cC1ABf06pfn00fn00pa725B842DpbF1EFEE84pc0877B627pd00000000
The hex value after "pa" is the hash of product ID string 1, after "pb" for
string 2 and so on.
Alternatively, you can use this small tool to determine the crc32 hash.
simply pass the string you want to evaluate as argument to this program,
e.g.
Alternatively, you can use crc32hash (see Documentation/pcmcia/crc32hash.c)
to determine the crc32 hash. Simply pass the string you want to evaluate
as argument to this program, e.g.:
$ ./crc32hash "Dual Speed"
-------------------------------------------------------------------------
/* crc32hash.c - derived from linux/lib/crc32.c, GNU GPL v2 */
#include <string.h>
#include <stdio.h>
#include <ctype.h>
#include <stdlib.h>
unsigned int crc32(unsigned char const *p, unsigned int len)
{
int i;
unsigned int crc = 0;
while (len--) {
crc ^= *p++;
for (i = 0; i < 8; i++)
crc = (crc >> 1) ^ ((crc & 1) ? 0xedb88320 : 0);
}
return crc;
}
int main(int argc, char **argv) {
unsigned int result;
if (argc != 2) {
printf("no string passed as argument\n");
return -1;
}
result = crc32(argv[1], strlen(argv[1]));
printf("0x%x\n", result);
return 0;
}

121
Documentation/pi-futex.txt Normal file
View file

@ -0,0 +1,121 @@
Lightweight PI-futexes
----------------------
We are calling them lightweight for 3 reasons:
- in the user-space fastpath a PI-enabled futex involves no kernel work
(or any other PI complexity) at all. No registration, no extra kernel
calls - just pure fast atomic ops in userspace.
- even in the slowpath, the system call and scheduling pattern is very
similar to normal futexes.
- the in-kernel PI implementation is streamlined around the mutex
abstraction, with strict rules that keep the implementation
relatively simple: only a single owner may own a lock (i.e. no
read-write lock support), only the owner may unlock a lock, no
recursive locking, etc.
Priority Inheritance - why?
---------------------------
The short reply: user-space PI helps achieving/improving determinism for
user-space applications. In the best-case, it can help achieve
determinism and well-bound latencies. Even in the worst-case, PI will
improve the statistical distribution of locking related application
delays.
The longer reply:
-----------------
Firstly, sharing locks between multiple tasks is a common programming
technique that often cannot be replaced with lockless algorithms. As we
can see it in the kernel [which is a quite complex program in itself],
lockless structures are rather the exception than the norm - the current
ratio of lockless vs. locky code for shared data structures is somewhere
between 1:10 and 1:100. Lockless is hard, and the complexity of lockless
algorithms often endangers to ability to do robust reviews of said code.
I.e. critical RT apps often choose lock structures to protect critical
data structures, instead of lockless algorithms. Furthermore, there are
cases (like shared hardware, or other resource limits) where lockless
access is mathematically impossible.
Media players (such as Jack) are an example of reasonable application
design with multiple tasks (with multiple priority levels) sharing
short-held locks: for example, a highprio audio playback thread is
combined with medium-prio construct-audio-data threads and low-prio
display-colory-stuff threads. Add video and decoding to the mix and
we've got even more priority levels.
So once we accept that synchronization objects (locks) are an
unavoidable fact of life, and once we accept that multi-task userspace
apps have a very fair expectation of being able to use locks, we've got
to think about how to offer the option of a deterministic locking
implementation to user-space.
Most of the technical counter-arguments against doing priority
inheritance only apply to kernel-space locks. But user-space locks are
different, there we cannot disable interrupts or make the task
non-preemptible in a critical section, so the 'use spinlocks' argument
does not apply (user-space spinlocks have the same priority inversion
problems as other user-space locking constructs). Fact is, pretty much
the only technique that currently enables good determinism for userspace
locks (such as futex-based pthread mutexes) is priority inheritance:
Currently (without PI), if a high-prio and a low-prio task shares a lock
[this is a quite common scenario for most non-trivial RT applications],
even if all critical sections are coded carefully to be deterministic
(i.e. all critical sections are short in duration and only execute a
limited number of instructions), the kernel cannot guarantee any
deterministic execution of the high-prio task: any medium-priority task
could preempt the low-prio task while it holds the shared lock and
executes the critical section, and could delay it indefinitely.
Implementation:
---------------
As mentioned before, the userspace fastpath of PI-enabled pthread
mutexes involves no kernel work at all - they behave quite similarly to
normal futex-based locks: a 0 value means unlocked, and a value==TID
means locked. (This is the same method as used by list-based robust
futexes.) Userspace uses atomic ops to lock/unlock these mutexes without
entering the kernel.
To handle the slowpath, we have added two new futex ops:
FUTEX_LOCK_PI
FUTEX_UNLOCK_PI
If the lock-acquire fastpath fails, [i.e. an atomic transition from 0 to
TID fails], then FUTEX_LOCK_PI is called. The kernel does all the
remaining work: if there is no futex-queue attached to the futex address
yet then the code looks up the task that owns the futex [it has put its
own TID into the futex value], and attaches a 'PI state' structure to
the futex-queue. The pi_state includes an rt-mutex, which is a PI-aware,
kernel-based synchronization object. The 'other' task is made the owner
of the rt-mutex, and the FUTEX_WAITERS bit is atomically set in the
futex value. Then this task tries to lock the rt-mutex, on which it
blocks. Once it returns, it has the mutex acquired, and it sets the
futex value to its own TID and returns. Userspace has no other work to
perform - it now owns the lock, and futex value contains
FUTEX_WAITERS|TID.
If the unlock side fastpath succeeds, [i.e. userspace manages to do a
TID -> 0 atomic transition of the futex value], then no kernel work is
triggered.
If the unlock fastpath fails (because the FUTEX_WAITERS bit is set),
then FUTEX_UNLOCK_PI is called, and the kernel unlocks the futex on the
behalf of userspace - and it also unlocks the attached
pi_state->rt_mutex and thus wakes up any potential waiters.
Note that under this approach, contrary to previous PI-futex approaches,
there is no prior 'registration' of a PI-futex. [which is not quite
possible anyway, due to existing ABI properties of pthread mutexes.]
Also, under this scheme, 'robustness' and 'PI' are two orthogonal
properties of futexes, and all four combinations are possible: futex,
robust-futex, PI-futex, robust+PI-futex.
More details about priority inheritance can be found in
Documentation/rtmutex.txt.

View file

@ -118,96 +118,6 @@ will fail.
There is currently no way to know what states a device or driver
supports a priori. This will change in the future.
pm_message_t meaning
pm_message_t has two fields. event ("major"), and flags. If driver
does not know event code, it aborts the request, returning error. Some
drivers may need to deal with special cases based on the actual type
of suspend operation being done at the system level. This is why
there are flags.
Event codes are:
ON -- no need to do anything except special cases like broken
HW.
# NOTIFICATION -- pretty much same as ON?
FREEZE -- stop DMA and interrupts, and be prepared to reinit HW from
scratch. That probably means stop accepting upstream requests, the
actual policy of what to do with them beeing specific to a given
driver. It's acceptable for a network driver to just drop packets
while a block driver is expected to block the queue so no request is
lost. (Use IDE as an example on how to do that). FREEZE requires no
power state change, and it's expected for drivers to be able to
quickly transition back to operating state.
SUSPEND -- like FREEZE, but also put hardware into low-power state. If
there's need to distinguish several levels of sleep, additional flag
is probably best way to do that.
Transitions are only from a resumed state to a suspended state, never
between 2 suspended states. (ON -> FREEZE or ON -> SUSPEND can happen,
FREEZE -> SUSPEND or SUSPEND -> FREEZE can not).
All events are:
[NOTE NOTE NOTE: If you are driver author, you should not care; you
should only look at event, and ignore flags.]
#Prepare for suspend -- userland is still running but we are going to
#enter suspend state. This gives drivers chance to load firmware from
#disk and store it in memory, or do other activities taht require
#operating userland, ability to kmalloc GFP_KERNEL, etc... All of these
#are forbiden once the suspend dance is started.. event = ON, flags =
#PREPARE_TO_SUSPEND
Apm standby -- prepare for APM event. Quiesce devices to make life
easier for APM BIOS. event = FREEZE, flags = APM_STANDBY
Apm suspend -- same as APM_STANDBY, but it we should probably avoid
spinning down disks. event = FREEZE, flags = APM_SUSPEND
System halt, reboot -- quiesce devices to make life easier for BIOS. event
= FREEZE, flags = SYSTEM_HALT or SYSTEM_REBOOT
System shutdown -- at least disks need to be spun down, or data may be
lost. Quiesce devices, just to make life easier for BIOS. event =
FREEZE, flags = SYSTEM_SHUTDOWN
Kexec -- turn off DMAs and put hardware into some state where new
kernel can take over. event = FREEZE, flags = KEXEC
Powerdown at end of swsusp -- very similar to SYSTEM_SHUTDOWN, except wake
may need to be enabled on some devices. This actually has at least 3
subtypes, system can reboot, enter S4 and enter S5 at the end of
swsusp. event = FREEZE, flags = SWSUSP and one of SYSTEM_REBOOT,
SYSTEM_SHUTDOWN, SYSTEM_S4
Suspend to ram -- put devices into low power state. event = SUSPEND,
flags = SUSPEND_TO_RAM
Freeze for swsusp snapshot -- stop DMA and interrupts. No need to put
devices into low power mode, but you must be able to reinitialize
device from scratch in resume method. This has two flavors, its done
once on suspending kernel, once on resuming kernel. event = FREEZE,
flags = DURING_SUSPEND or DURING_RESUME
Device detach requested from /sys -- deinitialize device; proably same as
SYSTEM_SHUTDOWN, I do not understand this one too much. probably event
= FREEZE, flags = DEV_DETACH.
#These are not really events sent:
#
#System fully on -- device is working normally; this is probably never
#passed to suspend() method... event = ON, flags = 0
#
#Ready after resume -- userland is now running, again. Time to free any
#memory you ate during prepare to suspend... event = ON, flags =
#READY_AFTER_RESUME
#
pm_message_t meaning
pm_message_t has two fields. event ("major"), and flags. If driver

View file

@ -18,10 +18,11 @@ Some warnings, first.
*
* (*) suspend/resume support is needed to make it safe.
*
* If you have any filesystems on USB devices mounted before suspend,
* If you have any filesystems on USB devices mounted before software suspend,
* they won't be accessible after resume and you may lose data, as though
* you have unplugged the USB devices with mounted filesystems on them
* (see the FAQ below for details).
* you have unplugged the USB devices with mounted filesystems on them;
* see the FAQ below for details. (This is not true for more traditional
* power states like "standby", which normally don't turn USB off.)
You need to append resume=/dev/your_swap_partition to kernel command
line. Then you suspend by
@ -204,7 +205,7 @@ Q: There don't seem to be any generally useful behavioral
distinctions between SUSPEND and FREEZE.
A: Doing SUSPEND when you are asked to do FREEZE is always correct,
but it may be unneccessarily slow. If you want USB to stay simple,
but it may be unneccessarily slow. If you want your driver to stay simple,
slowness may not matter to you. It can always be fixed later.
For devices like disk it does matter, you do not want to spindown for
@ -349,25 +350,72 @@ Q: How do I make suspend more verbose?
A: If you want to see any non-error kernel messages on the virtual
terminal the kernel switches to during suspend, you have to set the
kernel console loglevel to at least 5, for example by doing
kernel console loglevel to at least 4 (KERN_WARNING), for example by
doing
echo 5 > /proc/sys/kernel/printk
# save the old loglevel
read LOGLEVEL DUMMY < /proc/sys/kernel/printk
# set the loglevel so we see the progress bar.
# if the level is higher than needed, we leave it alone.
if [ $LOGLEVEL -lt 5 ]; then
echo 5 > /proc/sys/kernel/printk
fi
IMG_SZ=0
read IMG_SZ < /sys/power/image_size
echo -n disk > /sys/power/state
RET=$?
#
# the logic here is:
# if image_size > 0 (without kernel support, IMG_SZ will be zero),
# then try again with image_size set to zero.
if [ $RET -ne 0 -a $IMG_SZ -ne 0 ]; then # try again with minimal image size
echo 0 > /sys/power/image_size
echo -n disk > /sys/power/state
RET=$?
fi
# restore previous loglevel
echo $LOGLEVEL > /proc/sys/kernel/printk
exit $RET
Q: Is this true that if I have a mounted filesystem on a USB device and
I suspend to disk, I can lose data unless the filesystem has been mounted
with "sync"?
A: That's right. It depends on your hardware, and it could be true even for
suspend-to-RAM. In fact, even with "-o sync" you can lose data if your
programs have information in buffers they haven't written out to disk.
A: That's right ... if you disconnect that device, you may lose data.
In fact, even with "-o sync" you can lose data if your programs have
information in buffers they haven't written out to a disk you disconnect,
or if you disconnect before the device finished saving data you wrote.
If you're lucky, your hardware will support low-power modes for USB
controllers while the system is asleep. Lots of hardware doesn't,
however. Shutting off the power to a USB controller is equivalent to
unplugging all the attached devices.
Software suspend normally powers down USB controllers, which is equivalent
to disconnecting all USB devices attached to your system.
Your system might well support low-power modes for its USB controllers
while the system is asleep, maintaining the connection, using true sleep
modes like "suspend-to-RAM" or "standby". (Don't write "disk" to the
/sys/power/state file; write "standby" or "mem".) We've not seen any
hardware that can use these modes through software suspend, although in
theory some systems might support "platform" or "firmware" modes that
won't break the USB connections.
Remember that it's always a bad idea to unplug a disk drive containing a
mounted filesystem. With USB that's true even when your system is asleep!
The safest thing is to unmount all USB-based filesystems before suspending
and remount them after resuming.
mounted filesystem. That's true even when your system is asleep! The
safest thing is to unmount all filesystems on removable media (such USB,
Firewire, CompactFlash, MMC, external SATA, or even IDE hotplug bays)
before suspending; then remount them after resuming.
Q: I upgraded the kernel from 2.6.15 to 2.6.16. Both kernels were
compiled with the similar configuration files. Anyway I found that
suspend to disk (and resume) is much slower on 2.6.16 compared to
2.6.15. Any idea for why that might happen or how can I speed it up?
A: This is because the size of the suspend image is now greater than
for 2.6.15 (by saving more data we can get more responsive system
after resume).
There's the /sys/power/image_size knob that controls the size of the
image. If you set it to 0 (eg. by echo 0 > /sys/power/image_size as
root), the 2.6.15 behavior should be restored. If it is still too
slow, take a look at suspend.sf.net -- userland suspend is faster and
supports LZF compression to speed it up further.

View file

@ -90,6 +90,7 @@ Table of known working notebooks:
Model hack (or "how to do it")
------------------------------------------------------------------------------
Acer Aspire 1406LC ole's late BIOS init (7), turn off DRI
Acer TM 230 s3_bios (2)
Acer TM 242FX vbetool (6)
Acer TM C110 video_post (8)
Acer TM C300 vga=normal (only suspend on console, not in X), vbetool (6) or video_post (8)
@ -115,6 +116,7 @@ Dell D610 vga=normal and X (possibly vbestate (6) too, but not tested)
Dell Inspiron 4000 ??? (*)
Dell Inspiron 500m ??? (*)
Dell Inspiron 510m ???
Dell Inspiron 5150 vbetool needed (6)
Dell Inspiron 600m ??? (*)
Dell Inspiron 8200 ??? (*)
Dell Inspiron 8500 ??? (*)
@ -125,6 +127,7 @@ HP NX7000 ??? (*)
HP Pavilion ZD7000 vbetool post needed, need open-source nv driver for X
HP Omnibook XE3 athlon version none (1)
HP Omnibook XE3GC none (1), video is S3 Savage/IX-MV
HP Omnibook XE3L-GF vbetool (6)
HP Omnibook 5150 none (1), (S1 also works OK)
IBM TP T20, model 2647-44G none (1), video is S3 Inc. 86C270-294 Savage/IX-MV, vesafb gets "interesting" but X work.
IBM TP A31 / Type 2652-M5G s3_mode (3) [works ok with BIOS 1.04 2002-08-23, but not at all with BIOS 1.11 2004-11-05 :-(]
@ -157,6 +160,7 @@ Sony Vaio vgn-s260 X or boot-radeon can init it (5)
Sony Vaio vgn-S580BH vga=normal, but suspend from X. Console will be blank unless you return to X.
Sony Vaio vgn-FS115B s3_bios (2),s3_mode (4)
Toshiba Libretto L5 none (1)
Toshiba Libretto 100CT/110CT vbetool (6)
Toshiba Portege 3020CT s3_mode (3)
Toshiba Satellite 4030CDT s3_mode (3) (S1 also works OK)
Toshiba Satellite 4080XCDT s3_mode (3) (S1 also works OK)

View file

@ -95,7 +95,7 @@ comparison. If the thread has registered a list, then normally the list
is empty. If the thread/process crashed or terminated in some incorrect
way then the list might be non-empty: in this case the kernel carefully
walks the list [not trusting it], and marks all locks that are owned by
this thread with the FUTEX_OWNER_DEAD bit, and wakes up one waiter (if
this thread with the FUTEX_OWNER_DIED bit, and wakes up one waiter (if
any).
The list is guaranteed to be private and per-thread at do_exit() time,

View file

@ -0,0 +1,781 @@
#
# Copyright (c) 2006 Steven Rostedt
# Licensed under the GNU Free Documentation License, Version 1.2
#
RT-mutex implementation design
------------------------------
This document tries to describe the design of the rtmutex.c implementation.
It doesn't describe the reasons why rtmutex.c exists. For that please see
Documentation/rt-mutex.txt. Although this document does explain problems
that happen without this code, but that is in the concept to understand
what the code actually is doing.
The goal of this document is to help others understand the priority
inheritance (PI) algorithm that is used, as well as reasons for the
decisions that were made to implement PI in the manner that was done.
Unbounded Priority Inversion
----------------------------
Priority inversion is when a lower priority process executes while a higher
priority process wants to run. This happens for several reasons, and
most of the time it can't be helped. Anytime a high priority process wants
to use a resource that a lower priority process has (a mutex for example),
the high priority process must wait until the lower priority process is done
with the resource. This is a priority inversion. What we want to prevent
is something called unbounded priority inversion. That is when the high
priority process is prevented from running by a lower priority process for
an undetermined amount of time.
The classic example of unbounded priority inversion is were you have three
processes, let's call them processes A, B, and C, where A is the highest
priority process, C is the lowest, and B is in between. A tries to grab a lock
that C owns and must wait and lets C run to release the lock. But in the
meantime, B executes, and since B is of a higher priority than C, it preempts C,
but by doing so, it is in fact preempting A which is a higher priority process.
Now there's no way of knowing how long A will be sleeping waiting for C
to release the lock, because for all we know, B is a CPU hog and will
never give C a chance to release the lock. This is called unbounded priority
inversion.
Here's a little ASCII art to show the problem.
grab lock L1 (owned by C)
|
A ---+
C preempted by B
|
C +----+
B +-------->
B now keeps A from running.
Priority Inheritance (PI)
-------------------------
There are several ways to solve this issue, but other ways are out of scope
for this document. Here we only discuss PI.
PI is where a process inherits the priority of another process if the other
process blocks on a lock owned by the current process. To make this easier
to understand, let's use the previous example, with processes A, B, and C again.
This time, when A blocks on the lock owned by C, C would inherit the priority
of A. So now if B becomes runnable, it would not preempt C, since C now has
the high priority of A. As soon as C releases the lock, it loses its
inherited priority, and A then can continue with the resource that C had.
Terminology
-----------
Here I explain some terminology that is used in this document to help describe
the design that is used to implement PI.
PI chain - The PI chain is an ordered series of locks and processes that cause
processes to inherit priorities from a previous process that is
blocked on one of its locks. This is described in more detail
later in this document.
mutex - In this document, to differentiate from locks that implement
PI and spin locks that are used in the PI code, from now on
the PI locks will be called a mutex.
lock - In this document from now on, I will use the term lock when
referring to spin locks that are used to protect parts of the PI
algorithm. These locks disable preemption for UP (when
CONFIG_PREEMPT is enabled) and on SMP prevents multiple CPUs from
entering critical sections simultaneously.
spin lock - Same as lock above.
waiter - A waiter is a struct that is stored on the stack of a blocked
process. Since the scope of the waiter is within the code for
a process being blocked on the mutex, it is fine to allocate
the waiter on the process's stack (local variable). This
structure holds a pointer to the task, as well as the mutex that
the task is blocked on. It also has the plist node structures to
place the task in the waiter_list of a mutex as well as the
pi_list of a mutex owner task (described below).
waiter is sometimes used in reference to the task that is waiting
on a mutex. This is the same as waiter->task.
waiters - A list of processes that are blocked on a mutex.
top waiter - The highest priority process waiting on a specific mutex.
top pi waiter - The highest priority process waiting on one of the mutexes
that a specific process owns.
Note: task and process are used interchangeably in this document, mostly to
differentiate between two processes that are being described together.
PI chain
--------
The PI chain is a list of processes and mutexes that may cause priority
inheritance to take place. Multiple chains may converge, but a chain
would never diverge, since a process can't be blocked on more than one
mutex at a time.
Example:
Process: A, B, C, D, E
Mutexes: L1, L2, L3, L4
A owns: L1
B blocked on L1
B owns L2
C blocked on L2
C owns L3
D blocked on L3
D owns L4
E blocked on L4
The chain would be:
E->L4->D->L3->C->L2->B->L1->A
To show where two chains merge, we could add another process F and
another mutex L5 where B owns L5 and F is blocked on mutex L5.
The chain for F would be:
F->L5->B->L1->A
Since a process may own more than one mutex, but never be blocked on more than
one, the chains merge.
Here we show both chains:
E->L4->D->L3->C->L2-+
|
+->B->L1->A
|
F->L5-+
For PI to work, the processes at the right end of these chains (or we may
also call it the Top of the chain) must be equal to or higher in priority
than the processes to the left or below in the chain.
Also since a mutex may have more than one process blocked on it, we can
have multiple chains merge at mutexes. If we add another process G that is
blocked on mutex L2:
G->L2->B->L1->A
And once again, to show how this can grow I will show the merging chains
again.
E->L4->D->L3->C-+
+->L2-+
| |
G-+ +->B->L1->A
|
F->L5-+
Plist
-----
Before I go further and talk about how the PI chain is stored through lists
on both mutexes and processes, I'll explain the plist. This is similar to
the struct list_head functionality that is already in the kernel.
The implementation of plist is out of scope for this document, but it is
very important to understand what it does.
There are a few differences between plist and list, the most important one
being that plist is a priority sorted linked list. This means that the
priorities of the plist are sorted, such that it takes O(1) to retrieve the
highest priority item in the list. Obviously this is useful to store processes
based on their priorities.
Another difference, which is important for implementation, is that, unlike
list, the head of the list is a different element than the nodes of a list.
So the head of the list is declared as struct plist_head and nodes that will
be added to the list are declared as struct plist_node.
Mutex Waiter List
-----------------
Every mutex keeps track of all the waiters that are blocked on itself. The mutex
has a plist to store these waiters by priority. This list is protected by
a spin lock that is located in the struct of the mutex. This lock is called
wait_lock. Since the modification of the waiter list is never done in
interrupt context, the wait_lock can be taken without disabling interrupts.
Task PI List
------------
To keep track of the PI chains, each process has its own PI list. This is
a list of all top waiters of the mutexes that are owned by the process.
Note that this list only holds the top waiters and not all waiters that are
blocked on mutexes owned by the process.
The top of the task's PI list is always the highest priority task that
is waiting on a mutex that is owned by the task. So if the task has
inherited a priority, it will always be the priority of the task that is
at the top of this list.
This list is stored in the task structure of a process as a plist called
pi_list. This list is protected by a spin lock also in the task structure,
called pi_lock. This lock may also be taken in interrupt context, so when
locking the pi_lock, interrupts must be disabled.
Depth of the PI Chain
---------------------
The maximum depth of the PI chain is not dynamic, and could actually be
defined. But is very complex to figure it out, since it depends on all
the nesting of mutexes. Let's look at the example where we have 3 mutexes,
L1, L2, and L3, and four separate functions func1, func2, func3 and func4.
The following shows a locking order of L1->L2->L3, but may not actually
be directly nested that way.
void func1(void)
{
mutex_lock(L1);
/* do anything */
mutex_unlock(L1);
}
void func2(void)
{
mutex_lock(L1);
mutex_lock(L2);
/* do something */
mutex_unlock(L2);
mutex_unlock(L1);
}
void func3(void)
{
mutex_lock(L2);
mutex_lock(L3);
/* do something else */
mutex_unlock(L3);
mutex_unlock(L2);
}
void func4(void)
{
mutex_lock(L3);
/* do something again */
mutex_unlock(L3);
}
Now we add 4 processes that run each of these functions separately.
Processes A, B, C, and D which run functions func1, func2, func3 and func4
respectively, and such that D runs first and A last. With D being preempted
in func4 in the "do something again" area, we have a locking that follows:
D owns L3
C blocked on L3
C owns L2
B blocked on L2
B owns L1
A blocked on L1
And thus we have the chain A->L1->B->L2->C->L3->D.
This gives us a PI depth of 4 (four processes), but looking at any of the
functions individually, it seems as though they only have at most a locking
depth of two. So, although the locking depth is defined at compile time,
it still is very difficult to find the possibilities of that depth.
Now since mutexes can be defined by user-land applications, we don't want a DOS
type of application that nests large amounts of mutexes to create a large
PI chain, and have the code holding spin locks while looking at a large
amount of data. So to prevent this, the implementation not only implements
a maximum lock depth, but also only holds at most two different locks at a
time, as it walks the PI chain. More about this below.
Mutex owner and flags
---------------------
The mutex structure contains a pointer to the owner of the mutex. If the
mutex is not owned, this owner is set to NULL. Since all architectures
have the task structure on at least a four byte alignment (and if this is
not true, the rtmutex.c code will be broken!), this allows for the two
least significant bits to be used as flags. This part is also described
in Documentation/rt-mutex.txt, but will also be briefly described here.
Bit 0 is used as the "Pending Owner" flag. This is described later.
Bit 1 is used as the "Has Waiters" flags. This is also described later
in more detail, but is set whenever there are waiters on a mutex.
cmpxchg Tricks
--------------
Some architectures implement an atomic cmpxchg (Compare and Exchange). This
is used (when applicable) to keep the fast path of grabbing and releasing
mutexes short.
cmpxchg is basically the following function performed atomically:
unsigned long _cmpxchg(unsigned long *A, unsigned long *B, unsigned long *C)
{
unsigned long T = *A;
if (*A == *B) {
*A = *C;
}
return T;
}
#define cmpxchg(a,b,c) _cmpxchg(&a,&b,&c)
This is really nice to have, since it allows you to only update a variable
if the variable is what you expect it to be. You know if it succeeded if
the return value (the old value of A) is equal to B.
The macro rt_mutex_cmpxchg is used to try to lock and unlock mutexes. If
the architecture does not support CMPXCHG, then this macro is simply set
to fail every time. But if CMPXCHG is supported, then this will
help out extremely to keep the fast path short.
The use of rt_mutex_cmpxchg with the flags in the owner field help optimize
the system for architectures that support it. This will also be explained
later in this document.
Priority adjustments
--------------------
The implementation of the PI code in rtmutex.c has several places that a
process must adjust its priority. With the help of the pi_list of a
process this is rather easy to know what needs to be adjusted.
The functions implementing the task adjustments are rt_mutex_adjust_prio,
__rt_mutex_adjust_prio (same as the former, but expects the task pi_lock
to already be taken), rt_mutex_get_prio, and rt_mutex_setprio.
rt_mutex_getprio and rt_mutex_setprio are only used in __rt_mutex_adjust_prio.
rt_mutex_getprio returns the priority that the task should have. Either the
task's own normal priority, or if a process of a higher priority is waiting on
a mutex owned by the task, then that higher priority should be returned.
Since the pi_list of a task holds an order by priority list of all the top
waiters of all the mutexes that the task owns, rt_mutex_getprio simply needs
to compare the top pi waiter to its own normal priority, and return the higher
priority back.
(Note: if looking at the code, you will notice that the lower number of
prio is returned. This is because the prio field in the task structure
is an inverse order of the actual priority. So a "prio" of 5 is
of higher priority than a "prio" of 10.)
__rt_mutex_adjust_prio examines the result of rt_mutex_getprio, and if the
result does not equal the task's current priority, then rt_mutex_setprio
is called to adjust the priority of the task to the new priority.
Note that rt_mutex_setprio is defined in kernel/sched.c to implement the
actual change in priority.
It is interesting to note that __rt_mutex_adjust_prio can either increase
or decrease the priority of the task. In the case that a higher priority
process has just blocked on a mutex owned by the task, __rt_mutex_adjust_prio
would increase/boost the task's priority. But if a higher priority task
were for some reason to leave the mutex (timeout or signal), this same function
would decrease/unboost the priority of the task. That is because the pi_list
always contains the highest priority task that is waiting on a mutex owned
by the task, so we only need to compare the priority of that top pi waiter
to the normal priority of the given task.
High level overview of the PI chain walk
----------------------------------------
The PI chain walk is implemented by the function rt_mutex_adjust_prio_chain.
The implementation has gone through several iterations, and has ended up
with what we believe is the best. It walks the PI chain by only grabbing
at most two locks at a time, and is very efficient.
The rt_mutex_adjust_prio_chain can be used either to boost or lower process
priorities.
rt_mutex_adjust_prio_chain is called with a task to be checked for PI
(de)boosting (the owner of a mutex that a process is blocking on), a flag to
check for deadlocking, the mutex that the task owns, and a pointer to a waiter
that is the process's waiter struct that is blocked on the mutex (although this
parameter may be NULL for deboosting).
For this explanation, I will not mention deadlock detection. This explanation
will try to stay at a high level.
When this function is called, there are no locks held. That also means
that the state of the owner and lock can change when entered into this function.
Before this function is called, the task has already had rt_mutex_adjust_prio
performed on it. This means that the task is set to the priority that it
should be at, but the plist nodes of the task's waiter have not been updated
with the new priorities, and that this task may not be in the proper locations
in the pi_lists and wait_lists that the task is blocked on. This function
solves all that.
A loop is entered, where task is the owner to be checked for PI changes that
was passed by parameter (for the first iteration). The pi_lock of this task is
taken to prevent any more changes to the pi_list of the task. This also
prevents new tasks from completing the blocking on a mutex that is owned by this
task.
If the task is not blocked on a mutex then the loop is exited. We are at
the top of the PI chain.
A check is now done to see if the original waiter (the process that is blocked
on the current mutex) is the top pi waiter of the task. That is, is this
waiter on the top of the task's pi_list. If it is not, it either means that
there is another process higher in priority that is blocked on one of the
mutexes that the task owns, or that the waiter has just woken up via a signal
or timeout and has left the PI chain. In either case, the loop is exited, since
we don't need to do any more changes to the priority of the current task, or any
task that owns a mutex that this current task is waiting on. A priority chain
walk is only needed when a new top pi waiter is made to a task.
The next check sees if the task's waiter plist node has the priority equal to
the priority the task is set at. If they are equal, then we are done with
the loop. Remember that the function started with the priority of the
task adjusted, but the plist nodes that hold the task in other processes
pi_lists have not been adjusted.
Next, we look at the mutex that the task is blocked on. The mutex's wait_lock
is taken. This is done by a spin_trylock, because the locking order of the
pi_lock and wait_lock goes in the opposite direction. If we fail to grab the
lock, the pi_lock is released, and we restart the loop.
Now that we have both the pi_lock of the task as well as the wait_lock of
the mutex the task is blocked on, we update the task's waiter's plist node
that is located on the mutex's wait_list.
Now we release the pi_lock of the task.
Next the owner of the mutex has its pi_lock taken, so we can update the
task's entry in the owner's pi_list. If the task is the highest priority
process on the mutex's wait_list, then we remove the previous top waiter
from the owner's pi_list, and replace it with the task.
Note: It is possible that the task was the current top waiter on the mutex,
in which case the task is not yet on the pi_list of the waiter. This
is OK, since plist_del does nothing if the plist node is not on any
list.
If the task was not the top waiter of the mutex, but it was before we
did the priority updates, that means we are deboosting/lowering the
task. In this case, the task is removed from the pi_list of the owner,
and the new top waiter is added.
Lastly, we unlock both the pi_lock of the task, as well as the mutex's
wait_lock, and continue the loop again. On the next iteration of the
loop, the previous owner of the mutex will be the task that will be
processed.
Note: One might think that the owner of this mutex might have changed
since we just grab the mutex's wait_lock. And one could be right.
The important thing to remember is that the owner could not have
become the task that is being processed in the PI chain, since
we have taken that task's pi_lock at the beginning of the loop.
So as long as there is an owner of this mutex that is not the same
process as the tasked being worked on, we are OK.
Looking closely at the code, one might be confused. The check for the
end of the PI chain is when the task isn't blocked on anything or the
task's waiter structure "task" element is NULL. This check is
protected only by the task's pi_lock. But the code to unlock the mutex
sets the task's waiter structure "task" element to NULL with only
the protection of the mutex's wait_lock, which was not taken yet.
Isn't this a race condition if the task becomes the new owner?
The answer is No! The trick is the spin_trylock of the mutex's
wait_lock. If we fail that lock, we release the pi_lock of the
task and continue the loop, doing the end of PI chain check again.
In the code to release the lock, the wait_lock of the mutex is held
the entire time, and it is not let go when we grab the pi_lock of the
new owner of the mutex. So if the switch of a new owner were to happen
after the check for end of the PI chain and the grabbing of the
wait_lock, the unlocking code would spin on the new owner's pi_lock
but never give up the wait_lock. So the PI chain loop is guaranteed to
fail the spin_trylock on the wait_lock, release the pi_lock, and
try again.
If you don't quite understand the above, that's OK. You don't have to,
unless you really want to make a proof out of it ;)
Pending Owners and Lock stealing
--------------------------------
One of the flags in the owner field of the mutex structure is "Pending Owner".
What this means is that an owner was chosen by the process releasing the
mutex, but that owner has yet to wake up and actually take the mutex.
Why is this important? Why can't we just give the mutex to another process
and be done with it?
The PI code is to help with real-time processes, and to let the highest
priority process run as long as possible with little latencies and delays.
If a high priority process owns a mutex that a lower priority process is
blocked on, when the mutex is released it would be given to the lower priority
process. What if the higher priority process wants to take that mutex again.
The high priority process would fail to take that mutex that it just gave up
and it would need to boost the lower priority process to run with full
latency of that critical section (since the low priority process just entered
it).
There's no reason a high priority process that gives up a mutex should be
penalized if it tries to take that mutex again. If the new owner of the
mutex has not woken up yet, there's no reason that the higher priority process
could not take that mutex away.
To solve this, we introduced Pending Ownership and Lock Stealing. When a
new process is given a mutex that it was blocked on, it is only given
pending ownership. This means that it's the new owner, unless a higher
priority process comes in and tries to grab that mutex. If a higher priority
process does come along and wants that mutex, we let the higher priority
process "steal" the mutex from the pending owner (only if it is still pending)
and continue with the mutex.
Taking of a mutex (The walk through)
------------------------------------
OK, now let's take a look at the detailed walk through of what happens when
taking a mutex.
The first thing that is tried is the fast taking of the mutex. This is
done when we have CMPXCHG enabled (otherwise the fast taking automatically
fails). Only when the owner field of the mutex is NULL can the lock be
taken with the CMPXCHG and nothing else needs to be done.
If there is contention on the lock, whether it is owned or pending owner
we go about the slow path (rt_mutex_slowlock).
The slow path function is where the task's waiter structure is created on
the stack. This is because the waiter structure is only needed for the
scope of this function. The waiter structure holds the nodes to store
the task on the wait_list of the mutex, and if need be, the pi_list of
the owner.
The wait_lock of the mutex is taken since the slow path of unlocking the
mutex also takes this lock.
We then call try_to_take_rt_mutex. This is where the architecture that
does not implement CMPXCHG would always grab the lock (if there's no
contention).
try_to_take_rt_mutex is used every time the task tries to grab a mutex in the
slow path. The first thing that is done here is an atomic setting of
the "Has Waiters" flag of the mutex's owner field. Yes, this could really
be false, because if the the mutex has no owner, there are no waiters and
the current task also won't have any waiters. But we don't have the lock
yet, so we assume we are going to be a waiter. The reason for this is to
play nice for those architectures that do have CMPXCHG. By setting this flag
now, the owner of the mutex can't release the mutex without going into the
slow unlock path, and it would then need to grab the wait_lock, which this
code currently holds. So setting the "Has Waiters" flag forces the owner
to synchronize with this code.
Now that we know that we can't have any races with the owner releasing the
mutex, we check to see if we can take the ownership. This is done if the
mutex doesn't have a owner, or if we can steal the mutex from a pending
owner. Let's look at the situations we have here.
1) Has owner that is pending
----------------------------
The mutex has a owner, but it hasn't woken up and the mutex flag
"Pending Owner" is set. The first check is to see if the owner isn't the
current task. This is because this function is also used for the pending
owner to grab the mutex. When a pending owner wakes up, it checks to see
if it can take the mutex, and this is done if the owner is already set to
itself. If so, we succeed and leave the function, clearing the "Pending
Owner" bit.
If the pending owner is not current, we check to see if the current priority is
higher than the pending owner. If not, we fail the function and return.
There's also something special about a pending owner. That is a pending owner
is never blocked on a mutex. So there is no PI chain to worry about. It also
means that if the mutex doesn't have any waiters, there's no accounting needed
to update the pending owner's pi_list, since we only worry about processes
blocked on the current mutex.
If there are waiters on this mutex, and we just stole the ownership, we need
to take the top waiter, remove it from the pi_list of the pending owner, and
add it to the current pi_list. Note that at this moment, the pending owner
is no longer on the list of waiters. This is fine, since the pending owner
would add itself back when it realizes that it had the ownership stolen
from itself. When the pending owner tries to grab the mutex, it will fail
in try_to_take_rt_mutex if the owner field points to another process.
2) No owner
-----------
If there is no owner (or we successfully stole the lock), we set the owner
of the mutex to current, and set the flag of "Has Waiters" if the current
mutex actually has waiters, or we clear the flag if it doesn't. See, it was
OK that we set that flag early, since now it is cleared.
3) Failed to grab ownership
---------------------------
The most interesting case is when we fail to take ownership. This means that
there exists an owner, or there's a pending owner with equal or higher
priority than the current task.
We'll continue on the failed case.
If the mutex has a timeout, we set up a timer to go off to break us out
of this mutex if we failed to get it after a specified amount of time.
Now we enter a loop that will continue to try to take ownership of the mutex, or
fail from a timeout or signal.
Once again we try to take the mutex. This will usually fail the first time
in the loop, since it had just failed to get the mutex. But the second time
in the loop, this would likely succeed, since the task would likely be
the pending owner.
If the mutex is TASK_INTERRUPTIBLE a check for signals and timeout is done
here.
The waiter structure has a "task" field that points to the task that is blocked
on the mutex. This field can be NULL the first time it goes through the loop
or if the task is a pending owner and had it's mutex stolen. If the "task"
field is NULL then we need to set up the accounting for it.
Task blocks on mutex
--------------------
The accounting of a mutex and process is done with the waiter structure of
the process. The "task" field is set to the process, and the "lock" field
to the mutex. The plist nodes are initialized to the processes current
priority.
Since the wait_lock was taken at the entry of the slow lock, we can safely
add the waiter to the wait_list. If the current process is the highest
priority process currently waiting on this mutex, then we remove the
previous top waiter process (if it exists) from the pi_list of the owner,
and add the current process to that list. Since the pi_list of the owner
has changed, we call rt_mutex_adjust_prio on the owner to see if the owner
should adjust its priority accordingly.
If the owner is also blocked on a lock, and had its pi_list changed
(or deadlock checking is on), we unlock the wait_lock of the mutex and go ahead
and run rt_mutex_adjust_prio_chain on the owner, as described earlier.
Now all locks are released, and if the current process is still blocked on a
mutex (waiter "task" field is not NULL), then we go to sleep (call schedule).
Waking up in the loop
---------------------
The schedule can then wake up for a few reasons.
1) we were given pending ownership of the mutex.
2) we received a signal and was TASK_INTERRUPTIBLE
3) we had a timeout and was TASK_INTERRUPTIBLE
In any of these cases, we continue the loop and once again try to grab the
ownership of the mutex. If we succeed, we exit the loop, otherwise we continue
and on signal and timeout, will exit the loop, or if we had the mutex stolen
we just simply add ourselves back on the lists and go back to sleep.
Note: For various reasons, because of timeout and signals, the steal mutex
algorithm needs to be careful. This is because the current process is
still on the wait_list. And because of dynamic changing of priorities,
especially on SCHED_OTHER tasks, the current process can be the
highest priority task on the wait_list.
Failed to get mutex on Timeout or Signal
----------------------------------------
If a timeout or signal occurred, the waiter's "task" field would not be
NULL and the task needs to be taken off the wait_list of the mutex and perhaps
pi_list of the owner. If this process was a high priority process, then
the rt_mutex_adjust_prio_chain needs to be executed again on the owner,
but this time it will be lowering the priorities.
Unlocking the Mutex
-------------------
The unlocking of a mutex also has a fast path for those architectures with
CMPXCHG. Since the taking of a mutex on contention always sets the
"Has Waiters" flag of the mutex's owner, we use this to know if we need to
take the slow path when unlocking the mutex. If the mutex doesn't have any
waiters, the owner field of the mutex would equal the current process and
the mutex can be unlocked by just replacing the owner field with NULL.
If the owner field has the "Has Waiters" bit set (or CMPXCHG is not available),
the slow unlock path is taken.
The first thing done in the slow unlock path is to take the wait_lock of the
mutex. This synchronizes the locking and unlocking of the mutex.
A check is made to see if the mutex has waiters or not. On architectures that
do not have CMPXCHG, this is the location that the owner of the mutex will
determine if a waiter needs to be awoken or not. On architectures that
do have CMPXCHG, that check is done in the fast path, but it is still needed
in the slow path too. If a waiter of a mutex woke up because of a signal
or timeout between the time the owner failed the fast path CMPXCHG check and
the grabbing of the wait_lock, the mutex may not have any waiters, thus the
owner still needs to make this check. If there are no waiters than the mutex
owner field is set to NULL, the wait_lock is released and nothing more is
needed.
If there are waiters, then we need to wake one up and give that waiter
pending ownership.
On the wake up code, the pi_lock of the current owner is taken. The top
waiter of the lock is found and removed from the wait_list of the mutex
as well as the pi_list of the current owner. The task field of the new
pending owner's waiter structure is set to NULL, and the owner field of the
mutex is set to the new owner with the "Pending Owner" bit set, as well
as the "Has Waiters" bit if there still are other processes blocked on the
mutex.
The pi_lock of the previous owner is released, and the new pending owner's
pi_lock is taken. Remember that this is the trick to prevent the race
condition in rt_mutex_adjust_prio_chain from adding itself as a waiter
on the mutex.
We now clear the "pi_blocked_on" field of the new pending owner, and if
the mutex still has waiters pending, we add the new top waiter to the pi_list
of the pending owner.
Finally we unlock the pi_lock of the pending owner and wake it up.
Contact
-------
For updates on this document, please email Steven Rostedt <rostedt@goodmis.org>
Credits
-------
Author: Steven Rostedt <rostedt@goodmis.org>
Reviewers: Ingo Molnar, Thomas Gleixner, Thomas Duetsch, and Randy Dunlap
Updates
-------
This document was originally written for 2.6.17-rc3-mm1

View file

@ -0,0 +1,79 @@
RT-mutex subsystem with PI support
----------------------------------
RT-mutexes with priority inheritance are used to support PI-futexes,
which enable pthread_mutex_t priority inheritance attributes
(PTHREAD_PRIO_INHERIT). [See Documentation/pi-futex.txt for more details
about PI-futexes.]
This technology was developed in the -rt tree and streamlined for
pthread_mutex support.
Basic principles:
-----------------
RT-mutexes extend the semantics of simple mutexes by the priority
inheritance protocol.
A low priority owner of a rt-mutex inherits the priority of a higher
priority waiter until the rt-mutex is released. If the temporarily
boosted owner blocks on a rt-mutex itself it propagates the priority
boosting to the owner of the other rt_mutex it gets blocked on. The
priority boosting is immediately removed once the rt_mutex has been
unlocked.
This approach allows us to shorten the block of high-prio tasks on
mutexes which protect shared resources. Priority inheritance is not a
magic bullet for poorly designed applications, but it allows
well-designed applications to use userspace locks in critical parts of
an high priority thread, without losing determinism.
The enqueueing of the waiters into the rtmutex waiter list is done in
priority order. For same priorities FIFO order is chosen. For each
rtmutex, only the top priority waiter is enqueued into the owner's
priority waiters list. This list too queues in priority order. Whenever
the top priority waiter of a task changes (for example it timed out or
got a signal), the priority of the owner task is readjusted. [The
priority enqueueing is handled by "plists", see include/linux/plist.h
for more details.]
RT-mutexes are optimized for fastpath operations and have no internal
locking overhead when locking an uncontended mutex or unlocking a mutex
without waiters. The optimized fastpath operations require cmpxchg
support. [If that is not available then the rt-mutex internal spinlock
is used]
The state of the rt-mutex is tracked via the owner field of the rt-mutex
structure:
rt_mutex->owner holds the task_struct pointer of the owner. Bit 0 and 1
are used to keep track of the "owner is pending" and "rtmutex has
waiters" state.
owner bit1 bit0
NULL 0 0 mutex is free (fast acquire possible)
NULL 0 1 invalid state
NULL 1 0 Transitional state*
NULL 1 1 invalid state
taskpointer 0 0 mutex is held (fast release possible)
taskpointer 0 1 task is pending owner
taskpointer 1 0 mutex is held and has waiters
taskpointer 1 1 task is pending owner and mutex has waiters
Pending-ownership handling is a performance optimization:
pending-ownership is assigned to the first (highest priority) waiter of
the mutex, when the mutex is released. The thread is woken up and once
it starts executing it can acquire the mutex. Until the mutex is taken
by it (bit 0 is cleared) a competing higher priority thread can "steal"
the mutex which puts the woken up thread back on the waiters list.
The pending-ownership optimization is especially important for the
uninterrupted workflow of high-prio tasks which repeatedly
takes/releases locks that have lower-prio waiters. Without this
optimization the higher-prio thread would ping-pong to the lower-prio
task [because at unlock time we always assign a new owner].
(*) The "mutex has waiters" bit gets set to take the lock. If the lock
doesn't already have an owner, this bit is quickly cleared if there are
no waiters. So this is a transitional state to synchronize with looking
at the owner field of the mutex and the mutex owner releasing the lock.

View file

@ -44,8 +44,10 @@ normal timer interrupt, which is 100Hz.
Programming and/or enabling interrupt frequencies greater than 64Hz is
only allowed by root. This is perhaps a bit conservative, but we don't want
an evil user generating lots of IRQs on a slow 386sx-16, where it might have
a negative impact on performance. Note that the interrupt handler is only
a few lines of code to minimize any possibility of this effect.
a negative impact on performance. This 64Hz limit can be changed by writing
a different value to /proc/sys/dev/rtc/max-user-freq. Note that the
interrupt handler is only a few lines of code to minimize any possibility
of this effect.
Also, if the kernel time is synchronized with an external source, the
kernel will write the time back to the CMOS clock every 11 minutes. In
@ -81,6 +83,7 @@ that will be using this driver.
*/
#include <stdio.h>
#include <stdlib.h>
#include <linux/rtc.h>
#include <sys/ioctl.h>
#include <sys/time.h>

View file

@ -30,8 +30,6 @@ aic7xxx.txt
- info on driver for Adaptec controllers
aic7xxx_old.txt
- info on driver for Adaptec controllers, old generation
cpqfc.txt
- info on driver for Compaq Tachyon TS adapters
dpti.txt
- info on driver for DPT SmartRAID and Adaptec I2O RAID based adapters
dtc3x80.txt

View file

@ -1,3 +1,16 @@
1 Release Date : Wed Feb 03 14:31:44 PST 2006 - Sumant Patro <Sumant.Patro@lsil.com>
2 Current Version : 00.00.02.04
3 Older Version : 00.00.02.04
i. Remove superflous instance_lock
gets rid of the otherwise superflous instance_lock and avoids an unsave
unsynchronized access in the error handler.
- Christoph Hellwig <hch@lst.de>
1 Release Date : Wed Feb 03 14:31:44 PST 2006 - Sumant Patro <Sumant.Patro@lsil.com>
2 Current Version : 00.00.02.04
3 Older Version : 00.00.02.04

View file

@ -24,10 +24,10 @@ Supported Cards/Chipsets
9005:0285:9005:0296 Adaptec 2240S (SabreExpress)
9005:0285:9005:0290 Adaptec 2410SA (Jaguar)
9005:0285:9005:0293 Adaptec 21610SA (Corsair-16)
9005:0285:103c:3227 Adaptec 2610SA (Bearcat)
9005:0285:103c:3227 Adaptec 2610SA (Bearcat HP release)
9005:0285:9005:0292 Adaptec 2810SA (Corsair-8)
9005:0285:9005:0294 Adaptec Prowler
9005:0286:9005:029d Adaptec 2420SA (Intruder)
9005:0286:9005:029d Adaptec 2420SA (Intruder HP release)
9005:0286:9005:029c Adaptec 2620SA (Intruder)
9005:0286:9005:029b Adaptec 2820SA (Intruder)
9005:0286:9005:02a7 Adaptec 2830SA (Skyray)
@ -38,7 +38,7 @@ Supported Cards/Chipsets
9005:0285:9005:0297 Adaptec 4005SAS (AvonPark)
9005:0285:9005:0299 Adaptec 4800SAS (Marauder-X)
9005:0285:9005:029a Adaptec 4805SAS (Marauder-E)
9005:0286:9005:02a2 Adaptec 4810SAS (Hurricane)
9005:0286:9005:02a2 Adaptec 3800SAS (Hurricane44)
1011:0046:9005:0364 Adaptec 5400S (Mustang)
1011:0046:9005:0365 Adaptec 5400S (Mustang)
9005:0283:9005:0283 Adaptec Catapult (3210S with arc firmware)
@ -72,7 +72,7 @@ Supported Cards/Chipsets
9005:0286:9005:02a1 ICP ICP9087MA (Lancer)
9005:0286:9005:02a4 ICP ICP9085LI (Marauder-X)
9005:0286:9005:02a5 ICP ICP5085BR (Marauder-E)
9005:0286:9005:02a3 ICP ICP5085AU (Hurricane)
9005:0286:9005:02a3 ICP ICP5445AU (Hurricane44)
9005:0286:9005:02a6 ICP ICP9067MA (Intruder-6)
9005:0286:9005:02a9 ICP ICP5087AU (Skyray)
9005:0286:9005:02aa ICP ICP5047AU (Skyray)

View file

@ -1,272 +0,0 @@
Notes for CPQFCTS driver for Compaq Tachyon TS
Fibre Channel Host Bus Adapter, PCI 64-bit, 66MHz
for Linux (RH 6.1, 6.2 kernel 2.2.12-32, 2.2.14-5)
SMP tested
Tested in single and dual HBA configuration, 32 and 64bit busses,
33 and 66MHz. Only supports FC-AL.
SEST size 512 Exchanges (simultaneous I/Os) limited by module kmalloc()
max of 128k bytes contiguous.
Ver 2.5.4 Oct 03, 2002
* fixed memcpy of sense buffer in ioctl to copy the smaller defined size
Ver 2.5.3 Aug 01, 2002
* fix the passthru ioctl to handle the Scsi_Cmnd->request being a pointer
Ver 2.5.1 Jul 30, 2002
* fix ioctl to pay attention to the specified LUN.
Ver 2.5.0 Nov 29, 2001
* eliminated io_request_lock. This change makes the driver specific
to the 2.5.x kernels.
* silenced excessively noisy printks.
Ver 2.1.2 July 23, 2002
* initialize DumCmnd->lun in cpqfcTS_ioctl (used in fcFindLoggedInPorts as LUN index)
Ver 2.1.1 Oct 18, 2001
* reinitialize Cmnd->SCp.sent_command (used to identify commands as
passthrus) on calling scsi_done, since the scsi mid layer does not
use (or reinitialize) this field to prevent subsequent comands from
having it set incorrectly.
Ver 2.1.0 Aug 27, 2001
* Revise driver to use new kernel 2.4.x PCI DMA API, instead of
virt_to_bus(). (enables driver to work w/ ia64 systems with >2Gb RAM.)
Rework main scatter-gather code to handle cases where SG element
lengths are larger than 0x7FFFF bytes and use as many scatter
gather pages as necessary. (Steve Cameron)
* Makefile changes to bring cpqfc into line w/ rest of SCSI drivers
(thanks to Keith Owens)
Ver 2.0.5 Aug 06, 2001
* Reject non-existent luns in the driver rather than letting the
hardware do it. (some HW behaves differently than others in this area.)
* Changed Makefile to rely on "make dep" instead of explicit dependencies
* ifdef'ed out fibre channel analyzer triggering debug code
* fixed a jiffies wrapping issue
Ver 2.0.4 Aug 01, 2001
* Incorporated fix for target device reset from Steeleye
* Fixed passthrough ioctl so it doesn't hang.
* Fixed hang in launch_FCworker_thread() that occurred on some machines.
* Avoid problem when number of volumes in a single cabinet > 8
Ver 2.0.2 July 23, 2001
Changed the semiphore changes so the driver would compile in 2.4.7.
This version is for 2.4.7 and beyond.
Ver 2.0.1 May 7, 2001
Merged version 1.3.6 fixes into version 2.0.0.
Ver 2.0.0 May 7, 2001
Fixed problem so spinlock is being initialized to UNLOCKED.
Fixed updated driver so it compiles in the 2.4 tree.
Ver 1.3.6 Feb 27, 2001
Added Target_Device_Reset function for SCSI error handling
Fixed problem with not reseting addressing mode after implicit logout
Ver 1.3.4 Sep 7, 2000
Added Modinfo information
Fixed problem with statically linking the driver
Ver 1.3.3, Aug 23, 2000
Fixed device/function number in ioctl
Ver 1.3.2, July 27, 2000
Add include for Alpha compile on 2.2.14 kernel (cpq*i2c.c)
Change logic for different FCP-RSP sense_buffer location for HSG80 target
And search for Agilent Tachyon XL2 HBAs (not finished! - in test)
Tested with
(storage):
Compaq RA-4x000, RAID firmware ver 2.40 - 2.54
Seagate FC drives model ST39102FC, rev 0006
Hitachi DK31CJ-72FC rev J8A8
IBM DDYF-T18350R rev F60K
Compaq FC-SCSI bridge w/ DLT 35/70 Gb DLT (tape)
(servers):
Compaq PL-1850R
Compaq PL-6500 Xeon (400MHz)
Compaq PL-8500 (500MHz, 66MHz, 64bit PCI)
Compaq Alpha DS20 (RH 6.1)
(hubs):
Vixel Rapport 1000 (7-port "dumb")
Gadzoox Gibralter (12-port "dumb")
Gadzoox Capellix 2000, 3000
(switches):
Brocade 2010, 2400, 2800, rev 2.0.3a (& later)
Gadzoox 3210 (Fabric blade beta)
Vixel 7100 (Fabric beta firmare - known hot plug issues)
using "qa_test" (esp. io_test script) suite modified from Unix tests.
Installation:
make menuconfig
(select SCSI low-level, Compaq FC HBA)
make modules
make modules_install
e.g. insmod -f cpqfc
Due to Fabric/switch delays, driver requires 4 seconds
to initialize. If adapters are found, there will be a entries at
/proc/scsi/cpqfcTS/*
sample contents of startup messages
*************************
scsi_register allocating 3596 bytes for CPQFCHBA
ioremap'd Membase: c887e600
HBA Tachyon RevId 1.2
Allocating 119808 for 576 Exchanges @ c0dc0000
Allocating 112904 for LinkQ @ c0c20000 (576 elements)
Allocating 110600 for TachSEST for 512 Exchanges
cpqfcTS: writing IMQ BASE 7C0000h PI 7C4000h
cpqfcTS: SEST c0e40000(virt): Wrote base E40000h @ c887e740
cpqfcTS: New FC port 0000E8h WWN: 500507650642499D SCSI Chan/Trgt 0/0
cpqfcTS: New FC port 0000EFh WWN: 50000E100000D5A6 SCSI Chan/Trgt 0/1
cpqfcTS: New FC port 0000E4h WWN: 21000020370097BB SCSI Chan/Trgt 0/2
cpqfcTS: New FC port 0000E2h WWN: 2100002037009946 SCSI Chan/Trgt 0/3
cpqfcTS: New FC port 0000E1h WWN: 21000020370098FE SCSI Chan/Trgt 0/4
cpqfcTS: New FC port 0000E0h WWN: 21000020370097B2 SCSI Chan/Trgt 0/5
cpqfcTS: New FC port 0000DCh WWN: 2100002037006CC1 SCSI Chan/Trgt 0/6
cpqfcTS: New FC port 0000DAh WWN: 21000020370059F6 SCSI Chan/Trgt 0/7
cpqfcTS: New FC port 00000Fh WWN: 500805F1FADB0E20 SCSI Chan/Trgt 0/8
cpqfcTS: New FC port 000008h WWN: 500805F1FADB0EBA SCSI Chan/Trgt 0/9
cpqfcTS: New FC port 000004h WWN: 500805F1FADB1EB9 SCSI Chan/Trgt 0/10
cpqfcTS: New FC port 000002h WWN: 500805F1FADB1ADE SCSI Chan/Trgt 0/11
cpqfcTS: New FC port 000001h WWN: 500805F1FADBA2CA SCSI Chan/Trgt 0/12
scsi4 : Compaq FibreChannel HBA Tachyon TS HPFC-5166A/1.2: WWN 500508B200193F50
on PCI bus 0 device 0xa0fc irq 5 IObaseL 0x3400, MEMBASE 0xc6ef8600
PCI bus width 32 bits, bus speed 33 MHz
FCP-SCSI Driver v1.3.0
GBIC detected: Short-wave. LPSM 0h Monitor
scsi : 5 hosts.
Vendor: IBM Model: DDYF-T18350R Rev: F60K
Type: Direct-Access ANSI SCSI revision: 03
Detected scsi disk sdb at scsi4, channel 0, id 0, lun 0
Vendor: HITACHI Model: DK31CJ-72FC Rev: J8A8
Type: Direct-Access ANSI SCSI revision: 02
Detected scsi disk sdc at scsi4, channel 0, id 1, lun 0
Vendor: SEAGATE Model: ST39102FC Rev: 0006
Type: Direct-Access ANSI SCSI revision: 02
Detected scsi disk sdd at scsi4, channel 0, id 2, lun 0
Vendor: SEAGATE Model: ST39102FC Rev: 0006
Type: Direct-Access ANSI SCSI revision: 02
Detected scsi disk sde at scsi4, channel 0, id 3, lun 0
Vendor: SEAGATE Model: ST39102FC Rev: 0006
Type: Direct-Access ANSI SCSI revision: 02
Detected scsi disk sdf at scsi4, channel 0, id 4, lun 0
Vendor: SEAGATE Model: ST39102FC Rev: 0006
Type: Direct-Access ANSI SCSI revision: 02
Detected scsi disk sdg at scsi4, channel 0, id 5, lun 0
Vendor: SEAGATE Model: ST39102FC Rev: 0006
Type: Direct-Access ANSI SCSI revision: 02
Detected scsi disk sdh at scsi4, channel 0, id 6, lun 0
Vendor: SEAGATE Model: ST39102FC Rev: 0006
Type: Direct-Access ANSI SCSI revision: 02
Detected scsi disk sdi at scsi4, channel 0, id 7, lun 0
Vendor: COMPAQ Model: LOGICAL VOLUME Rev: 2.48
Type: Direct-Access ANSI SCSI revision: 02
Detected scsi disk sdj at scsi4, channel 0, id 8, lun 0
Vendor: COMPAQ Model: LOGICAL VOLUME Rev: 2.48
Type: Direct-Access ANSI SCSI revision: 02
Detected scsi disk sdk at scsi4, channel 0, id 8, lun 1
Vendor: COMPAQ Model: LOGICAL VOLUME Rev: 2.40
Type: Direct-Access ANSI SCSI revision: 02
Detected scsi disk sdl at scsi4, channel 0, id 9, lun 0
Vendor: COMPAQ Model: LOGICAL VOLUME Rev: 2.40
Type: Direct-Access ANSI SCSI revision: 02
Detected scsi disk sdm at scsi4, channel 0, id 9, lun 1
Vendor: COMPAQ Model: LOGICAL VOLUME Rev: 2.54
Type: Direct-Access ANSI SCSI revision: 02
Detected scsi disk sdn at scsi4, channel 0, id 10, lun 0
Vendor: COMPAQ Model: LOGICAL VOLUME Rev: 2.54
Type: Direct-Access ANSI SCSI revision: 02
Detected scsi disk sdo at scsi4, channel 0, id 11, lun 0
Vendor: COMPAQ Model: LOGICAL VOLUME Rev: 2.54
Type: Direct-Access ANSI SCSI revision: 02
Detected scsi disk sdp at scsi4, channel 0, id 11, lun 1
Vendor: COMPAQ Model: LOGICAL VOLUME Rev: 2.54
Type: Direct-Access ANSI SCSI revision: 02
Detected scsi disk sdq at scsi4, channel 0, id 12, lun 0
Vendor: COMPAQ Model: LOGICAL VOLUME Rev: 2.54
Type: Direct-Access ANSI SCSI revision: 02
Detected scsi disk sdr at scsi4, channel 0, id 12, lun 1
resize_dma_pool: unknown device type 12
resize_dma_pool: unknown device type 12
SCSI device sdb: hdwr sector= 512 bytes. Sectors= 35843670 [17501 MB] [17.5 GB]
sdb: sdb1
SCSI device sdc: hdwr sector= 512 bytes. Sectors= 144410880 [70513 MB] [70.5 GB]
sdc: sdc1
SCSI device sdd: hdwr sector= 512 bytes. Sectors= 17783240 [8683 MB] [8.7 GB]
sdd: sdd1
SCSI device sde: hdwr sector= 512 bytes. Sectors= 17783240 [8683 MB] [8.7 GB]
sde: sde1
SCSI device sdf: hdwr sector= 512 bytes. Sectors= 17783240 [8683 MB] [8.7 GB]
sdf: sdf1
SCSI device sdg: hdwr sector= 512 bytes. Sectors= 17783240 [8683 MB] [8.7 GB]
sdg: sdg1
SCSI device sdh: hdwr sector= 512 bytes. Sectors= 17783240 [8683 MB] [8.7 GB]
sdh: sdh1
SCSI device sdi: hdwr sector= 512 bytes. Sectors= 17783240 [8683 MB] [8.7 GB]
sdi: sdi1
SCSI device sdj: hdwr sector= 512 bytes. Sectors= 2056160 [1003 MB] [1.0 GB]
sdj: sdj1
SCSI device sdk: hdwr sector= 512 bytes. Sectors= 2052736 [1002 MB] [1.0 GB]
sdk: sdk1
SCSI device sdl: hdwr sector= 512 bytes. Sectors= 17764320 [8673 MB] [8.7 GB]
sdl: sdl1
SCSI device sdm: hdwr sector= 512 bytes. Sectors= 8380320 [4091 MB] [4.1 GB]
sdm: sdm1
SCSI device sdn: hdwr sector= 512 bytes. Sectors= 17764320 [8673 MB] [8.7 GB]
sdn: sdn1
SCSI device sdo: hdwr sector= 512 bytes. Sectors= 17764320 [8673 MB] [8.7 GB]
sdo: sdo1
SCSI device sdp: hdwr sector= 512 bytes. Sectors= 17764320 [8673 MB] [8.7 GB]
sdp: sdp1
SCSI device sdq: hdwr sector= 512 bytes. Sectors= 2056160 [1003 MB] [1.0 GB]
sdq: sdq1
SCSI device sdr: hdwr sector= 512 bytes. Sectors= 2052736 [1002 MB] [1.0 GB]
sdr: sdr1
*************************
If a GBIC of type Short-wave, Long-wave, or Copper is detected, it will
print out; otherwise, "none" is displayed. If the cabling is correct
and a loop circuit is completed, you should see "Monitor"; otherwise,
"LoopFail" (on open circuit) or some LPSM number/state with bit 3 set.
ERRATA:
1. Normally, Linux Scsi queries FC devices with INQUIRY strings. All LUNs
found according to INQUIRY should get READ commands at sector 0 to find
partition table, etc. Older kernels only query the first 4 devices. Some
Linux kernels only look for one LUN per target (i.e. FC device).
2. Physically removing a device, or a malfunctioning system which hides a
device, leads to a 30-second timeout and subsequent _abort call.
In some process contexts, this will hang the kernel (crashing the system).
Single bit errors in frames and virtually all hot plugging events are
gracefully handled with internal driver timer and Abort processing.
3. Some SCSI drives with error conditions will not handle the 7 second timeout
in this software driver, leading to infinite retries on timed out SCSI commands.
The 7 secs balances the need to quickly recover from lost frames (esp. on sequence
initiatives) and time needed by older/slower/error-state drives in responding.
This can be easily changed in "Exchanges[].timeOut".
4. Due to the nature of FC soft addressing, there is no assurance that the
same LUNs (drives) will have the same path (e.g. /dev/sdb1) from one boot to
next. Dynamic soft address changes (i.e. 24-bit FC port_id) are
supported during run time (e.g. due to hot plug event) by the use of WWN to
SCSI Nexus (channel/target/LUN) mapping.
5. Compaq RA4x00 firmware version 2.54 and later supports SSP (Selective
Storage Presentation), which maps LUNs to a WWN. If RA4x00 firmware prior
2.54 (e.g. older controller) is used, or the FC HBA is replaced (another WWN
is used), logical volumes on the RA4x00 will no longer be visible.
Send questions/comments to:
Amy Vanzant-Hodge (fibrechannel@compaq.com)

View file

@ -0,0 +1,92 @@
HIGHPOINT ROCKETRAID 3xxx RAID DRIVER (hptiop)
Controller Register Map
-------------------------
The controller IOP is accessed via PCI BAR0.
BAR0 offset Register
0x10 Inbound Message Register 0
0x14 Inbound Message Register 1
0x18 Outbound Message Register 0
0x1C Outbound Message Register 1
0x20 Inbound Doorbell Register
0x24 Inbound Interrupt Status Register
0x28 Inbound Interrupt Mask Register
0x30 Outbound Interrupt Status Register
0x34 Outbound Interrupt Mask Register
0x40 Inbound Queue Port
0x44 Outbound Queue Port
I/O Request Workflow
----------------------
All queued requests are handled via inbound/outbound queue port.
A request packet can be allocated in either IOP or host memory.
To send a request to the controller:
- Get a free request packet by reading the inbound queue port or
allocate a free request in host DMA coherent memory.
The value returned from the inbound queue port is an offset
relative to the IOP BAR0.
Requests allocated in host memory must be aligned on 32-bytes boundary.
- Fill the packet.
- Post the packet to IOP by writing it to inbound queue. For requests
allocated in IOP memory, write the offset to inbound queue port. For
requests allocated in host memory, write (0x80000000|(bus_addr>>5))
to the inbound queue port.
- The IOP process the request. When the request is completed, it
will be put into outbound queue. An outbound interrupt will be
generated.
For requests allocated in IOP memory, the request offset is posted to
outbound queue.
For requests allocated in host memory, (0x80000000|(bus_addr>>5))
is posted to the outbound queue. If IOP_REQUEST_FLAG_OUTPUT_CONTEXT
flag is set in the request, the low 32-bit context value will be
posted instead.
- The host read the outbound queue and complete the request.
For requests allocated in IOP memory, the host driver free the request
by writing it to the outbound queue.
Non-queued requests (reset/flush etc) can be sent via inbound message
register 0. An outbound message with the same value indicates the completion
of an inbound message.
User-level Interface
---------------------
The driver exposes following sysfs attributes:
NAME R/W Description
driver-version R driver version string
firmware-version R firmware version string
The driver registers char device "hptiop" to communicate with HighPoint RAID
management software. Its ioctl routine acts as a general binary interface
between the IOP firmware and HighPoint RAID management software. New management
functions can be implemented in application/firmware without modification
in driver code.
-----------------------------------------------------------------------------
Copyright (C) 2006 HighPoint Technologies, Inc. All Rights Reserved.
This file is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
linux@highpoint-tech.com
http://www.highpoint-tech.com

View file

@ -12,5 +12,3 @@ http://www.torque.net/parport/
Email list for Linux Parport
linux-parport@torque.net
Email for problems with ZIP or ZIP Plus drivers
campbell@torque.net

View file

@ -109,7 +109,7 @@ than the 33.33 MHz being in the PCI spec.
If you want to share the IRQ with another device and the driver refuses to
do so, you might succeed with changing the DC390_IRQ type in tmscsim.c to
SA_SHIRQ | SA_INTERRUPT.
IRQF_SHARED | IRQF_DISABLED.
3.Features

View file

@ -366,7 +366,9 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed.
Module for C-Media CMI8338 and 8738 PCI sound cards.
mpu_port - 0x300,0x310,0x320,0x330, 0 = disable (default)
mpu_port - 0x300,0x310,0x320,0x330 = legacy port,
1 = integrated PCI port,
0 = disable (default)
fm_port - 0x388 (default), 0 = disable (default)
soft_ac3 - Software-conversion of raw SPDIF packets (model 033 only)
(default = 1)
@ -468,7 +470,23 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed.
Module for multifunction CS5535 companion PCI device
The power-management is supported.
Module snd-darla20
------------------
Module for Echoaudio Darla20
This module supports multiple cards.
The driver requires the firmware loader support on kernel.
Module snd-darla24
------------------
Module for Echoaudio Darla24
This module supports multiple cards.
The driver requires the firmware loader support on kernel.
Module snd-dt019x
-----------------
@ -497,6 +515,14 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed.
The power-management is supported.
Module snd-echo3g
-----------------
Module for Echoaudio 3G cards (Gina3G/Layla3G)
This module supports multiple cards.
The driver requires the firmware loader support on kernel.
Module snd-emu10k1
------------------
@ -655,6 +681,22 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed.
The power-management is supported.
Module snd-gina20
-----------------
Module for Echoaudio Gina20
This module supports multiple cards.
The driver requires the firmware loader support on kernel.
Module snd-gina24
-----------------
Module for Echoaudio Gina24
This module supports multiple cards.
The driver requires the firmware loader support on kernel.
Module snd-gusclassic
---------------------
@ -707,8 +749,10 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed.
Module snd-hda-intel
--------------------
Module for Intel HD Audio (ICH6, ICH6M, ICH7), ATI SB450,
VIA VT8251/VT8237A
Module for Intel HD Audio (ICH6, ICH6M, ESB2, ICH7, ICH8),
ATI SB450, SB600, RS600,
VIA VT8251/VT8237A,
SIS966, ULI M5461
model - force the model name
position_fix - Fix DMA pointer (0 = auto, 1 = none, 2 = POSBUF, 3 = FIFO size)
@ -756,12 +800,18 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed.
basic fixed pin assignment w/o SPDIF
auto auto-config reading BIOS (default)
ALC882/883/885
ALC882/885
3stack-dig 3-jack with SPDIF I/O
6stck-dig 6-jack digital with SPDIF I/O
auto auto-config reading BIOS (default)
ALC861
ALC883/888
3stack-dig 3-jack with SPDIF I/O
6stack-dig 6-jack digital with SPDIF I/O
6stack-dig-demo 6-stack digital for Intel demo board
auto auto-config reading BIOS (default)
ALC861/660
3stack 3-jack
3stack-dig 3-jack with SPDIF I/O
6stack-dig 6-jack with SPDIF I/O
@ -778,6 +828,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed.
AD1981
basic 3-jack (default)
hp HP nx6320
thinkpad Lenovo Thinkpad T60/X60/Z60
AD1986A
6stack 6-jack, separate surrounds (default)
@ -932,6 +983,30 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed.
driver isn't configured properly or you want to try another
type for testing.
Module snd-indigo
-----------------
Module for Echoaudio Indigo
This module supports multiple cards.
The driver requires the firmware loader support on kernel.
Module snd-indigodj
-------------------
Module for Echoaudio Indigo DJ
This module supports multiple cards.
The driver requires the firmware loader support on kernel.
Module snd-indigoio
-------------------
Module for Echoaudio Indigo IO
This module supports multiple cards.
The driver requires the firmware loader support on kernel.
Module snd-intel8x0
-------------------
@ -1031,6 +1106,22 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed.
This module supports multiple cards.
Module snd-layla20
------------------
Module for Echoaudio Layla20
This module supports multiple cards.
The driver requires the firmware loader support on kernel.
Module snd-layla24
------------------
Module for Echoaudio Layla24
This module supports multiple cards.
The driver requires the firmware loader support on kernel.
Module snd-maestro3
-------------------
@ -1051,6 +1142,14 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed.
The power-management is supported.
Module snd-mia
---------------
Module for Echoaudio Mia
This module supports multiple cards.
The driver requires the firmware loader support on kernel.
Module snd-miro
---------------
@ -1083,6 +1182,14 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed.
When no hotplug fw loader is available, you need to load the
firmware via mixartloader utility in alsa-tools package.
Module snd-mona
---------------
Module for Echoaudio Mona
This module supports multiple cards.
The driver requires the firmware loader support on kernel.
Module snd-mpu401
-----------------
@ -1633,9 +1740,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed.
About capture IBL, see the description of snd-vx222 module.
Note: the driver is build only when CONFIG_ISA is set.
Note2: snd-vxp440 driver is merged to snd-vxpocket driver since
Note: snd-vxp440 driver is merged to snd-vxpocket driver since
ALSA 1.0.10.
The power-management is supported.
@ -1662,8 +1767,6 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed.
Module for Sound Core PDAudioCF sound card.
Note: the driver is build only when CONFIG_ISA is set.
The power-management is supported.

View file

@ -1149,7 +1149,7 @@
}
chip->port = pci_resource_start(pci, 0);
if (request_irq(pci->irq, snd_mychip_interrupt,
SA_INTERRUPT|SA_SHIRQ, "My Chip", chip)) {
IRQF_DISABLED|IRQF_SHARED, "My Chip", chip)) {
printk(KERN_ERR "cannot grab irq %d\n", pci->irq);
snd_mychip_free(chip);
return -EBUSY;
@ -1323,7 +1323,7 @@
<programlisting>
<![CDATA[
if (request_irq(pci->irq, snd_mychip_interrupt,
SA_INTERRUPT|SA_SHIRQ, "My Chip", chip)) {
IRQF_DISABLED|IRQF_SHARED, "My Chip", chip)) {
printk(KERN_ERR "cannot grab irq %d\n", pci->irq);
snd_mychip_free(chip);
return -EBUSY;
@ -1342,7 +1342,7 @@
<para>
On the PCI bus, the interrupts can be shared. Thus,
<constant>SA_SHIRQ</constant> is given as the interrupt flag of
<constant>IRQF_SHARED</constant> is given as the interrupt flag of
<function>request_irq()</function>.
</para>
@ -3048,7 +3048,7 @@ struct _snd_pcm_runtime {
</para>
<para>
If you aquire a spinlock in the interrupt handler, and the
If you acquire a spinlock in the interrupt handler, and the
lock is used in other pcm callbacks, too, then you have to
release the lock before calling
<function>snd_pcm_period_elapsed()</function>, because
@ -4215,7 +4215,7 @@ struct _snd_pcm_runtime {
<programlisting>
<![CDATA[
struct snd_rawmidi *rmidi;
snd_mpu401_uart_new(card, 0, MPU401_HW_MPU401, port, integrated,
snd_mpu401_uart_new(card, 0, MPU401_HW_MPU401, port, info_flags,
irq, irq_flags, &rmidi);
]]>
</programlisting>
@ -4242,15 +4242,36 @@ struct _snd_pcm_runtime {
</para>
<para>
The 5th argument is bitflags for additional information.
When the i/o port address above is a part of the PCI i/o
region, the MPU401 i/o port might have been already allocated
(reserved) by the driver itself. In such a case, pass non-zero
to the 5th argument
(<parameter>integrated</parameter>). Otherwise, pass 0 to it,
(reserved) by the driver itself. In such a case, pass a bit flag
<constant>MPU401_INFO_INTEGRATED</constant>,
and
the mpu401-uart layer will allocate the i/o ports by itself.
</para>
<para>
When the controller supports only the input or output MIDI stream,
pass <constant>MPU401_INFO_INPUT</constant> or
<constant>MPU401_INFO_OUTPUT</constant> bitflag, respectively.
Then the rawmidi instance is created as a single stream.
</para>
<para>
<constant>MPU401_INFO_MMIO</constant> bitflag is used to change
the access method to MMIO (via readb and writeb) instead of
iob and outb. In this case, you have to pass the iomapped address
to <function>snd_mpu401_uart_new()</function>.
</para>
<para>
When <constant>MPU401_INFO_TX_IRQ</constant> is set, the output
stream isn't checked in the default interrupt handler. The driver
needs to call <function>snd_mpu401_uart_interrupt_tx()</function>
by itself to start processing the output stream in irq handler.
</para>
<para>
Usually, the port address corresponds to the command port and
port + 1 corresponds to the data port. If not, you may change
@ -5333,7 +5354,7 @@ struct _snd_pcm_runtime {
<informalexample>
<programlisting>
<![CDATA[
snd_info_set_text_ops(entry, chip, read_size, my_proc_read);
snd_info_set_text_ops(entry, chip, my_proc_read);
]]>
</programlisting>
</informalexample>
@ -5394,29 +5415,12 @@ struct _snd_pcm_runtime {
<informalexample>
<programlisting>
<![CDATA[
entry->c.text.write_size = 256;
entry->c.text.write = my_proc_write;
]]>
</programlisting>
</informalexample>
</para>
<para>
The buffer size for read is set to 1024 implicitly by
<function>snd_info_set_text_ops()</function>. It should suffice
in most cases (the size will be aligned to
<constant>PAGE_SIZE</constant> anyway), but if you need to handle
very large text files, you can set it explicitly, too.
<informalexample>
<programlisting>
<![CDATA[
entry->c.text.read_size = 65536;
]]>
</programlisting>
</informalexample>
</para>
<para>
For the write callback, you can use
<function>snd_info_get_line()</function> to get a text line, and
@ -5562,7 +5566,7 @@ struct _snd_pcm_runtime {
power status.</para></listitem>
<listitem><para>Call <function>snd_pcm_suspend_all()</function> to suspend the running PCM streams.</para></listitem>
<listitem><para>If AC97 codecs are used, call
<function>snd_ac97_resume()</function> for each codec.</para></listitem>
<function>snd_ac97_suspend()</function> for each codec.</para></listitem>
<listitem><para>Save the register values if necessary.</para></listitem>
<listitem><para>Stop the hardware if necessary.</para></listitem>
<listitem><para>Disable the PCI device by calling

View file

@ -25,42 +25,84 @@ the bits necessary to run your device. The most commonly
used members of this structure, and their typical usage,
will be detailed below.
Here is how probing is performed by an SBUS driver
under Linux:
Here is a piece of skeleton code for perofming a device
probe in an SBUS driverunder Linux:
static void init_one_mydevice(struct sbus_dev *sdev)
static int __devinit mydevice_probe_one(struct sbus_dev *sdev)
{
struct mysdevice *mp = kzalloc(sizeof(*mp), GFP_KERNEL);
if (!mp)
return -ENODEV;
...
dev_set_drvdata(&sdev->ofdev.dev, mp);
return 0;
...
}
static int mydevice_match(struct sbus_dev *sdev)
static int __devinit mydevice_probe(struct of_device *dev,
const struct of_device_id *match)
{
if (some_criteria(sdev))
return 1;
return 0;
struct sbus_dev *sdev = to_sbus_device(&dev->dev);
return mydevice_probe_one(sdev);
}
static void mydevice_probe(void)
static int __devexit mydevice_remove(struct of_device *dev)
{
struct sbus_bus *sbus;
struct sbus_dev *sdev;
struct sbus_dev *sdev = to_sbus_device(&dev->dev);
struct mydevice *mp = dev_get_drvdata(&dev->dev);
for_each_sbus(sbus) {
for_each_sbusdev(sdev, sbus) {
if (mydevice_match(sdev))
init_one_mydevice(sdev);
}
}
return mydevice_remove_one(sdev, mp);
}
All this does is walk through all SBUS devices in the
system, checks each to see if it is of the type which
your driver is written for, and if so it calls the init
routine to attach the device and prepare to drive it.
static struct of_device_id mydevice_match[] = {
{
.name = "mydevice",
},
{},
};
"init_one_mydevice" might do things like allocate software
state structures, map in I/O registers, place the hardware
into an initialized state, etc.
MODULE_DEVICE_TABLE(of, mydevice_match);
static struct of_platform_driver mydevice_driver = {
.name = "mydevice",
.match_table = mydevice_match,
.probe = mydevice_probe,
.remove = __devexit_p(mydevice_remove),
};
static int __init mydevice_init(void)
{
return of_register_driver(&mydevice_driver, &sbus_bus_type);
}
static void __exit mydevice_exit(void)
{
of_unregister_driver(&mydevice_driver);
}
module_init(mydevice_init);
module_exit(mydevice_exit);
The mydevice_match table is a series of entries which
describes what SBUS devices your driver is meant for. In the
simplest case you specify a string for the 'name' field. Every
SBUS device with a 'name' property matching your string will
be passed one-by-one to your .probe method.
You should store away your device private state structure
pointer in the drvdata area so that you can retrieve it later on
in your .remove method.
Any memory allocated, registers mapped, IRQs registered,
etc. must be undone by your .remove method so that all resources
of your device are relased by the time it returns.
You should _NOT_ use the for_each_sbus(), for_each_sbusdev(),
and for_all_sbusdev() interfaces. They are deprecated, will be
removed, and no new driver should reference them ever.
Mapping and Accessing I/O Registers
@ -263,10 +305,3 @@ discussed above and plus it handles both PCI and SBUS boards.
Lance driver abuses consistent mappings for data transfer.
It is a nifty trick which we do not particularly recommend...
Just check it out and know that it's legal.
Bad examples, do NOT use
drivers/video/cgsix.c
This one uses result of sbus_ioremap as if it is an address.
This does NOT work on sparc64 and therefore is broken. We will
convert it at a later date.

View file

@ -1,5 +1,6 @@
Copyright 2004 Linus Torvalds
Copyright 2004 Pavel Machek <pavel@suse.cz>
Copyright 2006 Bob Copeland <me@bobcopeland.com>
Using sparse for typechecking
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -41,15 +42,8 @@ sure that bitwise types don't get mixed up (little-endian vs big-endian
vs cpu-endian vs whatever), and there the constant "0" really _is_
special.
Use
make C=[12] CF=-Wbitwise
or you don't get any checking at all.
Where to get sparse
~~~~~~~~~~~~~~~~~~~
Getting sparse
~~~~~~~~~~~~~~
With git, you can just get it from
@ -57,7 +51,7 @@ With git, you can just get it from
and DaveJ has tar-balls at
http://www.codemonkey.org.uk/projects/git-snapshots/sparse/
http://www.codemonkey.org.uk/projects/git-snapshots/sparse/
Once you have it, just do
@ -65,8 +59,20 @@ Once you have it, just do
make
make install
as your regular user, and it will install sparse in your ~/bin directory.
After that, doing a kernel make with "make C=1" will run sparse on all the
C files that get recompiled, or with "make C=2" will run sparse on the
files whether they need to be recompiled or not (ie the latter is fast way
to check the whole tree if you have already built it).
as a regular user, and it will install sparse in your ~/bin directory.
Using sparse
~~~~~~~~~~~~
Do a kernel make with "make C=1" to run sparse on all the C files that get
recompiled, or use "make C=2" to run sparse on the files whether they need to
be recompiled or not. The latter is a fast way to check the whole tree if you
have already built it.
The optional make variable CF can be used to pass arguments to sparse. The
build system passes -Wbitwise to sparse automatically. To perform endianness
checks, you may define __CHECK_ENDIAN__:
make C=2 CF="-D__CHECK_ENDIAN__"
These checks are disabled by default as they generate a host of warnings.

View file

@ -28,7 +28,7 @@ Currently, these files are in /proc/sys/vm:
- block_dump
- drop-caches
- zone_reclaim_mode
- zone_reclaim_interval
- panic_on_oom
==============================================================
@ -166,15 +166,15 @@ use of files and builds up large slab caches. However, the slab
shrink operation is global, may take a long time and free slabs
in all nodes of the system.
================================================================
=============================================================
zone_reclaim_interval:
panic_on_oom
The time allowed for off node allocations after zone reclaim
has failed to reclaim enough pages to allow a local allocation.
This enables or disables panic on out-of-memory feature. If this is set to 1,
the kernel panics when out-of-memory happens. If this is set to 0, the kernel
will kill some rogue process, called oom_killer. Usually, oom_killer can kill
rogue processes and system will survive. If you want to panic the system
rather than killing rogue processes, set this to 1.
Time is set in seconds and set by default to 30 seconds.
Reduce the interval if undesired off node allocations occur. However, too
frequent scans will have a negative impact onoff node allocation performance.
The default value is 0.

View file

@ -115,8 +115,9 @@ trojan program is running at console and which could grab your password
when you would try to login. It will kill all programs on given console
and thus letting you make sure that the login prompt you see is actually
the one from init, not some trojan program.
IMPORTANT:In its true form it is not a true SAK like the one in :IMPORTANT
IMPORTANT:c2 compliant systems, and it should be mistook as such. :IMPORTANT
IMPORTANT: In its true form it is not a true SAK like the one in a :IMPORTANT
IMPORTANT: c2 compliant system, and it should not be mistaken as :IMPORTANT
IMPORTANT: such. :IMPORTANT
It seems other find it useful as (System Attention Key) which is
useful when you want to exit a program that will not let you switch consoles.
(For example, X or a svgalib program.)

Some files were not shown because too many files have changed in this diff Show more