Merge branch 'master' of /home/tglx/work/mtd/git/linux-2.6.git/
This commit is contained in:
commit
2fc2991175
7316 changed files with 540452 additions and 288420 deletions
.gitignoreCOPYINGCREDITS
Documentation
00-INDEXChangesCodingStyleDMA-API.txtDMA-ISA-LPC.txt
DocBook
journal-api.tmplkernel-api.tmplkernel-hacking.tmpllibata.tmplmcabook.tmplusb.tmplwriting_usb_driver.tmpl
IPMI.txtMSI-HOWTO.txtRCU
SubmittingPatchesacpi-hotkey.txtaoe
applying-patches.txtarm/Samsung-S3C24XX
block
cachetlb.txtcciss.txtcdrom
connector
cpu-freq
cpusets.txtcrypto
dcdbas.txtdell_rbu.txtdevice-mapper
dontdiffdriver-model
dvb
exception.txtfb
feature-removal-schedule.txtfilesystems
firmware_class
hwmon
i2c
i386
ia64
ibm-acpi.txtinput
ioctl
kbuild
kdump
kernel-parameters.txtkeys-request-key.txtkeys.txtkprobes.txtm68k
30
.gitignore
vendored
Normal file
30
.gitignore
vendored
Normal file
|
@ -0,0 +1,30 @@
|
|||
#
|
||||
# NOTE! Don't add files that are generated in specific
|
||||
# subdirectories here. Add them in the ".gitignore" file
|
||||
# in that subdirectory instead.
|
||||
#
|
||||
# Normal rules
|
||||
#
|
||||
.*
|
||||
*.o
|
||||
*.a
|
||||
*.s
|
||||
*.ko
|
||||
*.mod.c
|
||||
|
||||
#
|
||||
# Top-level generic files
|
||||
#
|
||||
vmlinux*
|
||||
System.map
|
||||
Module.symvers
|
||||
|
||||
#
|
||||
# Generated include files
|
||||
#
|
||||
include/asm
|
||||
include/config
|
||||
include/linux/autoconf.h
|
||||
include/linux/compile.h
|
||||
include/linux/version.h
|
||||
|
4
COPYING
4
COPYING
|
@ -18,7 +18,7 @@
|
|||
Version 2, June 1991
|
||||
|
||||
Copyright (C) 1989, 1991 Free Software Foundation, Inc.
|
||||
59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
|
||||
51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
|
||||
Everyone is permitted to copy and distribute verbatim copies
|
||||
of this license document, but changing it is not allowed.
|
||||
|
||||
|
@ -321,7 +321,7 @@ the "copyright" line and a pointer to where the full notice is found.
|
|||
|
||||
You should have received a copy of the GNU General Public License
|
||||
along with this program; if not, write to the Free Software
|
||||
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
|
||||
Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
|
||||
|
||||
|
||||
Also add information on how to contact you by electronic and paper mail.
|
||||
|
|
39
CREDITS
39
CREDITS
|
@ -2211,6 +2211,15 @@ D: OV511 driver
|
|||
S: (address available on request)
|
||||
S: USA
|
||||
|
||||
N: Ian McDonald
|
||||
E: iam4@cs.waikato.ac.nz
|
||||
E: imcdnzl@gmail.com
|
||||
W: http://wand.net.nz/~iam4
|
||||
W: http://imcdnzl.blogspot.com
|
||||
D: DCCP, CCID3
|
||||
S: Hamilton
|
||||
S: New Zealand
|
||||
|
||||
N: Patrick McHardy
|
||||
E: kaber@trash.net
|
||||
P: 1024D/12155E80 B128 7DE6 FF0A C2B2 48BE AB4C C9D4 964E 1215 5E80
|
||||
|
@ -2238,6 +2247,12 @@ S: 249 Nichols Avenue
|
|||
S: Syracuse, New York 13206
|
||||
S: USA
|
||||
|
||||
N: Kyle McMartin
|
||||
E: kyle@parisc-linux.org
|
||||
D: Linux/PARISC hacker
|
||||
D: AD1889 sound driver
|
||||
S: Ottawa, Canada
|
||||
|
||||
N: Dirk Melchers
|
||||
E: dirk@merlin.nbg.sub.org
|
||||
D: 8 bit XT hard disk driver for OMTI5520
|
||||
|
@ -2246,19 +2261,12 @@ S: D-90453 Nuernberg
|
|||
S: Germany
|
||||
|
||||
N: Arnaldo Carvalho de Melo
|
||||
E: acme@conectiva.com.br
|
||||
E: acme@kernel.org
|
||||
E: acme@gnu.org
|
||||
W: http://bazar2.conectiva.com.br/~acme
|
||||
W: http://advogato.org/person/acme
|
||||
E: acme@mandriva.com
|
||||
E: acme@ghostprotocols.net
|
||||
W: http://oops.ghostprotocols.net:81/blog/
|
||||
P: 1024D/9224DF01 D5DF E3BB E3C8 BCBB F8AD 841A B6AB 4681 9224 DF01
|
||||
D: wanrouter hacking
|
||||
D: misc Makefile, Config.in, drivers and network stacks fixes
|
||||
D: IPX & LLC network stacks maintainer
|
||||
D: Cyclom 2X synchronous card driver
|
||||
D: wl3501 PCMCIA wireless card driver
|
||||
D: i18n for minicom, net-tools, util-linux, fetchmail, etc
|
||||
S: Conectiva S.A.
|
||||
D: IPX, LLC, DCCP, cyc2x, wl3501_cs, net/ hacks
|
||||
S: Mandriva
|
||||
S: R. Tocantins, 89 - Cristo Rei
|
||||
S: 80050-430 - Curitiba - Paraná
|
||||
S: Brazil
|
||||
|
@ -2380,8 +2388,8 @@ E: tmolina@cablespeed.com
|
|||
D: bug fixes, documentation, minor hackery
|
||||
|
||||
N: James Morris
|
||||
E: jmorris@redhat.com
|
||||
W: http://www.intercode.com.au/jmorris/
|
||||
E: jmorris@namei.org
|
||||
W: http://namei.org/
|
||||
D: Netfilter, Linux Security Modules (LSM), SELinux, IPSec,
|
||||
D: Crypto API, general networking, miscellaneous.
|
||||
S: PO Box 707
|
||||
|
@ -2423,8 +2431,7 @@ S: Toronto, Ontario
|
|||
S: Canada
|
||||
|
||||
N: Zwane Mwaikambo
|
||||
E: zwane@linuxpower.ca
|
||||
W: http://function.linuxpower.ca
|
||||
E: zwane@arm.linux.org.uk
|
||||
D: Various driver hacking
|
||||
D: Lowlevel x86 kernel hacking
|
||||
D: General debugging
|
||||
|
|
|
@ -46,6 +46,8 @@ SubmittingPatches
|
|||
- procedure to get a source patch included into the kernel tree.
|
||||
VGA-softcursor.txt
|
||||
- how to change your VGA cursor from a blinking underscore.
|
||||
applying-patches.txt
|
||||
- description of various trees and how to apply their patches.
|
||||
arm/
|
||||
- directory with info about Linux on the ARM architecture.
|
||||
basic_profiling.txt
|
||||
|
@ -275,7 +277,7 @@ tty.txt
|
|||
unicode.txt
|
||||
- info on the Unicode character/font mapping used in Linux.
|
||||
uml/
|
||||
- directory with infomation about User Mode Linux.
|
||||
- directory with information about User Mode Linux.
|
||||
usb/
|
||||
- directory with info regarding the Universal Serial Bus.
|
||||
video4linux/
|
||||
|
|
|
@ -65,7 +65,7 @@ o isdn4k-utils 3.1pre1 # isdnctrl 2>&1|grep version
|
|||
o nfs-utils 1.0.5 # showmount --version
|
||||
o procps 3.2.0 # ps --version
|
||||
o oprofile 0.9 # oprofiled --version
|
||||
o udev 058 # udevinfo -V
|
||||
o udev 071 # udevinfo -V
|
||||
|
||||
Kernel compilation
|
||||
==================
|
||||
|
@ -237,6 +237,12 @@ udev
|
|||
udev is a userspace application for populating /dev dynamically with
|
||||
only entries for devices actually present. udev replaces devfs.
|
||||
|
||||
FUSE
|
||||
----
|
||||
|
||||
Needs libfuse 2.4.0 or later. Absolute minimum is 2.3.0 but mount
|
||||
options 'direct_io' and 'kernel_cache' won't work.
|
||||
|
||||
Networking
|
||||
==========
|
||||
|
||||
|
@ -390,6 +396,10 @@ udev
|
|||
----
|
||||
o <http://www.kernel.org/pub/linux/utils/kernel/hotplug/udev.html>
|
||||
|
||||
FUSE
|
||||
----
|
||||
o <http://sourceforge.net/projects/fuse>
|
||||
|
||||
Networking
|
||||
**********
|
||||
|
||||
|
|
|
@ -236,6 +236,9 @@ ugly), but try to avoid excess. Instead, put the comments at the head
|
|||
of the function, telling people what it does, and possibly WHY it does
|
||||
it.
|
||||
|
||||
When commenting the kernel API functions, please use the kerneldoc format.
|
||||
See the files Documentation/kernel-doc-nano-HOWTO.txt and scripts/kernel-doc
|
||||
for details.
|
||||
|
||||
Chapter 8: You've made a mess of it
|
||||
|
||||
|
@ -407,7 +410,26 @@ Kernel messages do not have to be terminated with a period.
|
|||
Printing numbers in parentheses (%d) adds no value and should be avoided.
|
||||
|
||||
|
||||
Chapter 13: References
|
||||
Chapter 13: Allocating memory
|
||||
|
||||
The kernel provides the following general purpose memory allocators:
|
||||
kmalloc(), kzalloc(), kcalloc(), and vmalloc(). Please refer to the API
|
||||
documentation for further information about them.
|
||||
|
||||
The preferred form for passing a size of a struct is the following:
|
||||
|
||||
p = kmalloc(sizeof(*p), ...);
|
||||
|
||||
The alternative form where struct name is spelled out hurts readability and
|
||||
introduces an opportunity for a bug when the pointer variable type is changed
|
||||
but the corresponding sizeof that is passed to a memory allocator is not.
|
||||
|
||||
Casting the return value which is a void pointer is redundant. The conversion
|
||||
from void pointer to any other pointer type is guaranteed by the C programming
|
||||
language.
|
||||
|
||||
|
||||
Chapter 14: References
|
||||
|
||||
The C Programming Language, Second Edition
|
||||
by Brian W. Kernighan and Dennis M. Ritchie.
|
||||
|
|
|
@ -121,7 +121,7 @@ pool's device.
|
|||
dma_addr_t addr);
|
||||
|
||||
This puts memory back into the pool. The pool is what was passed to
|
||||
the the pool allocation routine; the cpu and dma addresses are what
|
||||
the pool allocation routine; the cpu and dma addresses are what
|
||||
were returned when that routine allocated the memory being freed.
|
||||
|
||||
|
||||
|
|
151
Documentation/DMA-ISA-LPC.txt
Normal file
151
Documentation/DMA-ISA-LPC.txt
Normal file
|
@ -0,0 +1,151 @@
|
|||
DMA with ISA and LPC devices
|
||||
============================
|
||||
|
||||
Pierre Ossman <drzeus@drzeus.cx>
|
||||
|
||||
This document describes how to do DMA transfers using the old ISA DMA
|
||||
controller. Even though ISA is more or less dead today the LPC bus
|
||||
uses the same DMA system so it will be around for quite some time.
|
||||
|
||||
Part I - Headers and dependencies
|
||||
---------------------------------
|
||||
|
||||
To do ISA style DMA you need to include two headers:
|
||||
|
||||
#include <linux/dma-mapping.h>
|
||||
#include <asm/dma.h>
|
||||
|
||||
The first is the generic DMA API used to convert virtual addresses to
|
||||
physical addresses (see Documentation/DMA-API.txt for details).
|
||||
|
||||
The second contains the routines specific to ISA DMA transfers. Since
|
||||
this is not present on all platforms make sure you construct your
|
||||
Kconfig to be dependent on ISA_DMA_API (not ISA) so that nobody tries
|
||||
to build your driver on unsupported platforms.
|
||||
|
||||
Part II - Buffer allocation
|
||||
---------------------------
|
||||
|
||||
The ISA DMA controller has some very strict requirements on which
|
||||
memory it can access so extra care must be taken when allocating
|
||||
buffers.
|
||||
|
||||
(You usually need a special buffer for DMA transfers instead of
|
||||
transferring directly to and from your normal data structures.)
|
||||
|
||||
The DMA-able address space is the lowest 16 MB of _physical_ memory.
|
||||
Also the transfer block may not cross page boundaries (which are 64
|
||||
or 128 KiB depending on which channel you use).
|
||||
|
||||
In order to allocate a piece of memory that satisfies all these
|
||||
requirements you pass the flag GFP_DMA to kmalloc.
|
||||
|
||||
Unfortunately the memory available for ISA DMA is scarce so unless you
|
||||
allocate the memory during boot-up it's a good idea to also pass
|
||||
__GFP_REPEAT and __GFP_NOWARN to make the allocater try a bit harder.
|
||||
|
||||
(This scarcity also means that you should allocate the buffer as
|
||||
early as possible and not release it until the driver is unloaded.)
|
||||
|
||||
Part III - Address translation
|
||||
------------------------------
|
||||
|
||||
To translate the virtual address to a physical use the normal DMA
|
||||
API. Do _not_ use isa_virt_to_phys() even though it does the same
|
||||
thing. The reason for this is that the function isa_virt_to_phys()
|
||||
will require a Kconfig dependency to ISA, not just ISA_DMA_API which
|
||||
is really all you need. Remember that even though the DMA controller
|
||||
has its origins in ISA it is used elsewhere.
|
||||
|
||||
Note: x86_64 had a broken DMA API when it came to ISA but has since
|
||||
been fixed. If your arch has problems then fix the DMA API instead of
|
||||
reverting to the ISA functions.
|
||||
|
||||
Part IV - Channels
|
||||
------------------
|
||||
|
||||
A normal ISA DMA controller has 8 channels. The lower four are for
|
||||
8-bit transfers and the upper four are for 16-bit transfers.
|
||||
|
||||
(Actually the DMA controller is really two separate controllers where
|
||||
channel 4 is used to give DMA access for the second controller (0-3).
|
||||
This means that of the four 16-bits channels only three are usable.)
|
||||
|
||||
You allocate these in a similar fashion as all basic resources:
|
||||
|
||||
extern int request_dma(unsigned int dmanr, const char * device_id);
|
||||
extern void free_dma(unsigned int dmanr);
|
||||
|
||||
The ability to use 16-bit or 8-bit transfers is _not_ up to you as a
|
||||
driver author but depends on what the hardware supports. Check your
|
||||
specs or test different channels.
|
||||
|
||||
Part V - Transfer data
|
||||
----------------------
|
||||
|
||||
Now for the good stuff, the actual DMA transfer. :)
|
||||
|
||||
Before you use any ISA DMA routines you need to claim the DMA lock
|
||||
using claim_dma_lock(). The reason is that some DMA operations are
|
||||
not atomic so only one driver may fiddle with the registers at a
|
||||
time.
|
||||
|
||||
The first time you use the DMA controller you should call
|
||||
clear_dma_ff(). This clears an internal register in the DMA
|
||||
controller that is used for the non-atomic operations. As long as you
|
||||
(and everyone else) uses the locking functions then you only need to
|
||||
reset this once.
|
||||
|
||||
Next, you tell the controller in which direction you intend to do the
|
||||
transfer using set_dma_mode(). Currently you have the options
|
||||
DMA_MODE_READ and DMA_MODE_WRITE.
|
||||
|
||||
Set the address from where the transfer should start (this needs to
|
||||
be 16-bit aligned for 16-bit transfers) and how many bytes to
|
||||
transfer. Note that it's _bytes_. The DMA routines will do all the
|
||||
required translation to values that the DMA controller understands.
|
||||
|
||||
The final step is enabling the DMA channel and releasing the DMA
|
||||
lock.
|
||||
|
||||
Once the DMA transfer is finished (or timed out) you should disable
|
||||
the channel again. You should also check get_dma_residue() to make
|
||||
sure that all data has been transfered.
|
||||
|
||||
Example:
|
||||
|
||||
int flags, residue;
|
||||
|
||||
flags = claim_dma_lock();
|
||||
|
||||
clear_dma_ff();
|
||||
|
||||
set_dma_mode(channel, DMA_MODE_WRITE);
|
||||
set_dma_addr(channel, phys_addr);
|
||||
set_dma_count(channel, num_bytes);
|
||||
|
||||
dma_enable(channel);
|
||||
|
||||
release_dma_lock(flags);
|
||||
|
||||
while (!device_done());
|
||||
|
||||
flags = claim_dma_lock();
|
||||
|
||||
dma_disable(channel);
|
||||
|
||||
residue = dma_get_residue(channel);
|
||||
if (residue != 0)
|
||||
printk(KERN_ERR "driver: Incomplete DMA transfer!"
|
||||
" %d bytes left!\n", residue);
|
||||
|
||||
release_dma_lock(flags);
|
||||
|
||||
Part VI - Suspend/resume
|
||||
------------------------
|
||||
|
||||
It is the driver's responsibility to make sure that the machine isn't
|
||||
suspended while a DMA transfer is in progress. Also, all DMA settings
|
||||
are lost when the system suspends so if your driver relies on the DMA
|
||||
controller being in a certain state then you have to restore these
|
||||
registers upon resume.
|
|
@ -116,7 +116,7 @@ filesystem. Almost.
|
|||
|
||||
You still need to actually journal your filesystem changes, this
|
||||
is done by wrapping them into transactions. Additionally you
|
||||
also need to wrap the modification of each of the the buffers
|
||||
also need to wrap the modification of each of the buffers
|
||||
with calls to the journal layer, so it knows what the modifications
|
||||
you are actually making are. To do this use journal_start() which
|
||||
returns a transaction handle.
|
||||
|
@ -128,7 +128,7 @@ and its counterpart journal_stop(), which indicates the end of a transaction
|
|||
are nestable calls, so you can reenter a transaction if necessary,
|
||||
but remember you must call journal_stop() the same number of times as
|
||||
journal_start() before the transaction is completed (or more accurately
|
||||
leaves the the update phase). Ext3/VFS makes use of this feature to simplify
|
||||
leaves the update phase). Ext3/VFS makes use of this feature to simplify
|
||||
quota support.
|
||||
</para>
|
||||
|
||||
|
|
|
@ -239,9 +239,9 @@ X!Ilib/string.c
|
|||
<title>Network device support</title>
|
||||
<sect1><title>Driver Support</title>
|
||||
!Enet/core/dev.c
|
||||
</sect1>
|
||||
<sect1><title>8390 Based Network Cards</title>
|
||||
!Edrivers/net/8390.c
|
||||
!Enet/ethernet/eth.c
|
||||
!Einclude/linux/etherdevice.h
|
||||
!Enet/core/wireless.c
|
||||
</sect1>
|
||||
<sect1><title>Synchronous PPP</title>
|
||||
!Edrivers/net/wan/syncppp.c
|
||||
|
@ -286,7 +286,9 @@ X!Edrivers/pci/search.c
|
|||
-->
|
||||
!Edrivers/pci/msi.c
|
||||
!Edrivers/pci/bus.c
|
||||
!Edrivers/pci/hotplug.c
|
||||
<!-- FIXME: Removed for now since no structured comments in source
|
||||
X!Edrivers/pci/hotplug.c
|
||||
-->
|
||||
!Edrivers/pci/probe.c
|
||||
!Edrivers/pci/rom.c
|
||||
</sect1>
|
||||
|
|
|
@ -8,8 +8,7 @@
|
|||
|
||||
<authorgroup>
|
||||
<author>
|
||||
<firstname>Paul</firstname>
|
||||
<othername>Rusty</othername>
|
||||
<firstname>Rusty</firstname>
|
||||
<surname>Russell</surname>
|
||||
<affiliation>
|
||||
<address>
|
||||
|
@ -20,7 +19,7 @@
|
|||
</authorgroup>
|
||||
|
||||
<copyright>
|
||||
<year>2001</year>
|
||||
<year>2005</year>
|
||||
<holder>Rusty Russell</holder>
|
||||
</copyright>
|
||||
|
||||
|
@ -64,7 +63,7 @@
|
|||
<chapter id="introduction">
|
||||
<title>Introduction</title>
|
||||
<para>
|
||||
Welcome, gentle reader, to Rusty's Unreliable Guide to Linux
|
||||
Welcome, gentle reader, to Rusty's Remarkably Unreliable Guide to Linux
|
||||
Kernel Hacking. This document describes the common routines and
|
||||
general requirements for kernel code: its goal is to serve as a
|
||||
primer for Linux kernel development for experienced C
|
||||
|
@ -96,13 +95,13 @@
|
|||
|
||||
<listitem>
|
||||
<para>
|
||||
not associated with any process, serving a softirq, tasklet or bh;
|
||||
not associated with any process, serving a softirq or tasklet;
|
||||
</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>
|
||||
running in kernel space, associated with a process;
|
||||
running in kernel space, associated with a process (user context);
|
||||
</para>
|
||||
</listitem>
|
||||
|
||||
|
@ -114,11 +113,12 @@
|
|||
</itemizedlist>
|
||||
|
||||
<para>
|
||||
There is a strict ordering between these: other than the last
|
||||
category (userspace) each can only be pre-empted by those above.
|
||||
For example, while a softirq is running on a CPU, no other
|
||||
softirq will pre-empt it, but a hardware interrupt can. However,
|
||||
any other CPUs in the system execute independently.
|
||||
There is an ordering between these. The bottom two can preempt
|
||||
each other, but above that is a strict hierarchy: each can only be
|
||||
preempted by the ones above it. For example, while a softirq is
|
||||
running on a CPU, no other softirq will preempt it, but a hardware
|
||||
interrupt can. However, any other CPUs in the system execute
|
||||
independently.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
|
@ -130,10 +130,10 @@
|
|||
<title>User Context</title>
|
||||
|
||||
<para>
|
||||
User context is when you are coming in from a system call or
|
||||
other trap: you can sleep, and you own the CPU (except for
|
||||
interrupts) until you call <function>schedule()</function>.
|
||||
In other words, user context (unlike userspace) is not pre-emptable.
|
||||
User context is when you are coming in from a system call or other
|
||||
trap: like userspace, you can be preempted by more important tasks
|
||||
and by interrupts. You can sleep, by calling
|
||||
<function>schedule()</function>.
|
||||
</para>
|
||||
|
||||
<note>
|
||||
|
@ -153,7 +153,7 @@
|
|||
|
||||
<caution>
|
||||
<para>
|
||||
Beware that if you have interrupts or bottom halves disabled
|
||||
Beware that if you have preemption or softirqs disabled
|
||||
(see below), <function>in_interrupt()</function> will return a
|
||||
false positive.
|
||||
</para>
|
||||
|
@ -168,10 +168,10 @@
|
|||
<hardware>keyboard</hardware> are examples of real
|
||||
hardware which produce interrupts at any time. The kernel runs
|
||||
interrupt handlers, which services the hardware. The kernel
|
||||
guarantees that this handler is never re-entered: if another
|
||||
guarantees that this handler is never re-entered: if the same
|
||||
interrupt arrives, it is queued (or dropped). Because it
|
||||
disables interrupts, this handler has to be fast: frequently it
|
||||
simply acknowledges the interrupt, marks a `software interrupt'
|
||||
simply acknowledges the interrupt, marks a 'software interrupt'
|
||||
for execution and exits.
|
||||
</para>
|
||||
|
||||
|
@ -188,60 +188,52 @@
|
|||
</sect1>
|
||||
|
||||
<sect1 id="basics-softirqs">
|
||||
<title>Software Interrupt Context: Bottom Halves, Tasklets, softirqs</title>
|
||||
<title>Software Interrupt Context: Softirqs and Tasklets</title>
|
||||
|
||||
<para>
|
||||
Whenever a system call is about to return to userspace, or a
|
||||
hardware interrupt handler exits, any `software interrupts'
|
||||
hardware interrupt handler exits, any 'software interrupts'
|
||||
which are marked pending (usually by hardware interrupts) are
|
||||
run (<filename>kernel/softirq.c</filename>).
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Much of the real interrupt handling work is done here. Early in
|
||||
the transition to <acronym>SMP</acronym>, there were only `bottom
|
||||
the transition to <acronym>SMP</acronym>, there were only 'bottom
|
||||
halves' (BHs), which didn't take advantage of multiple CPUs. Shortly
|
||||
after we switched from wind-up computers made of match-sticks and snot,
|
||||
we abandoned this limitation.
|
||||
we abandoned this limitation and switched to 'softirqs'.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
<filename class="headerfile">include/linux/interrupt.h</filename> lists the
|
||||
different BH's. No matter how many CPUs you have, no two BHs will run at
|
||||
the same time. This made the transition to SMP simpler, but sucks hard for
|
||||
scalable performance. A very important bottom half is the timer
|
||||
BH (<filename class="headerfile">include/linux/timer.h</filename>): you
|
||||
can register to have it call functions for you in a given length of time.
|
||||
different softirqs. A very important softirq is the
|
||||
timer softirq (<filename
|
||||
class="headerfile">include/linux/timer.h</filename>): you can
|
||||
register to have it call functions for you in a given length of
|
||||
time.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
2.3.43 introduced softirqs, and re-implemented the (now
|
||||
deprecated) BHs underneath them. Softirqs are fully-SMP
|
||||
versions of BHs: they can run on as many CPUs at once as
|
||||
required. This means they need to deal with any races in shared
|
||||
data using their own locks. A bitmask is used to keep track of
|
||||
which are enabled, so the 32 available softirqs should not be
|
||||
used up lightly. (<emphasis>Yes</emphasis>, people will
|
||||
notice).
|
||||
</para>
|
||||
|
||||
<para>
|
||||
tasklets (<filename class="headerfile">include/linux/interrupt.h</filename>)
|
||||
are like softirqs, except they are dynamically-registrable (meaning you
|
||||
can have as many as you want), and they also guarantee that any tasklet
|
||||
will only run on one CPU at any time, although different tasklets can
|
||||
run simultaneously (unlike different BHs).
|
||||
Softirqs are often a pain to deal with, since the same softirq
|
||||
will run simultaneously on more than one CPU. For this reason,
|
||||
tasklets (<filename
|
||||
class="headerfile">include/linux/interrupt.h</filename>) are more
|
||||
often used: they are dynamically-registrable (meaning you can have
|
||||
as many as you want), and they also guarantee that any tasklet
|
||||
will only run on one CPU at any time, although different tasklets
|
||||
can run simultaneously.
|
||||
</para>
|
||||
<caution>
|
||||
<para>
|
||||
The name `tasklet' is misleading: they have nothing to do with `tasks',
|
||||
The name 'tasklet' is misleading: they have nothing to do with 'tasks',
|
||||
and probably more to do with some bad vodka Alexey Kuznetsov had at the
|
||||
time.
|
||||
</para>
|
||||
</caution>
|
||||
|
||||
<para>
|
||||
You can tell you are in a softirq (or bottom half, or tasklet)
|
||||
You can tell you are in a softirq (or tasklet)
|
||||
using the <function>in_softirq()</function> macro
|
||||
(<filename class="headerfile">include/linux/interrupt.h</filename>).
|
||||
</para>
|
||||
|
@ -288,11 +280,10 @@
|
|||
<term>A rigid stack limit</term>
|
||||
<listitem>
|
||||
<para>
|
||||
The kernel stack is about 6K in 2.2 (for most
|
||||
architectures: it's about 14K on the Alpha), and shared
|
||||
with interrupts so you can't use it all. Avoid deep
|
||||
recursion and huge local arrays on the stack (allocate
|
||||
them dynamically instead).
|
||||
Depending on configuration options the kernel stack is about 3K to 6K for most 32-bit architectures: it's
|
||||
about 14K on most 64-bit archs, and often shared with interrupts
|
||||
so you can't use it all. Avoid deep recursion and huge local
|
||||
arrays on the stack (allocate them dynamically instead).
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
@ -339,7 +330,7 @@ asmlinkage long sys_mycall(int arg)
|
|||
|
||||
<para>
|
||||
If all your routine does is read or write some parameter, consider
|
||||
implementing a <function>sysctl</function> interface instead.
|
||||
implementing a <function>sysfs</function> interface instead.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
|
@ -417,7 +408,10 @@ cond_resched(); /* Will sleep */
|
|||
</para>
|
||||
|
||||
<para>
|
||||
You will eventually lock up your box if you break these rules.
|
||||
You should always compile your kernel
|
||||
<symbol>CONFIG_DEBUG_SPINLOCK_SLEEP</symbol> on, and it will warn
|
||||
you if you break these rules. If you <emphasis>do</emphasis> break
|
||||
the rules, you will eventually lock up your box.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
|
@ -515,8 +509,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
|
|||
success).
|
||||
</para>
|
||||
</caution>
|
||||
[Yes, this moronic interface makes me cringe. Please submit a
|
||||
patch and become my hero --RR.]
|
||||
[Yes, this moronic interface makes me cringe. The flamewar comes up every year or so. --RR.]
|
||||
</para>
|
||||
<para>
|
||||
The functions may sleep implicitly. This should never be called
|
||||
|
@ -587,10 +580,11 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
|
|||
</variablelist>
|
||||
|
||||
<para>
|
||||
If you see a <errorname>kmem_grow: Called nonatomically from int
|
||||
</errorname> warning message you called a memory allocation function
|
||||
from interrupt context without <constant>GFP_ATOMIC</constant>.
|
||||
You should really fix that. Run, don't walk.
|
||||
If you see a <errorname>sleeping function called from invalid
|
||||
context</errorname> warning message, then maybe you called a
|
||||
sleeping allocation function from interrupt context without
|
||||
<constant>GFP_ATOMIC</constant>. You should really fix that.
|
||||
Run, don't walk.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
|
@ -639,16 +633,16 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
|
|||
</sect1>
|
||||
|
||||
<sect1 id="routines-udelay">
|
||||
<title><function>udelay()</function>/<function>mdelay()</function>
|
||||
<title><function>mdelay()</function>/<function>udelay()</function>
|
||||
<filename class="headerfile">include/asm/delay.h</filename>
|
||||
<filename class="headerfile">include/linux/delay.h</filename>
|
||||
</title>
|
||||
|
||||
<para>
|
||||
The <function>udelay()</function> function can be used for small pauses.
|
||||
Do not use large values with <function>udelay()</function> as you risk
|
||||
The <function>udelay()</function> and <function>ndelay()</function> functions can be used for small pauses.
|
||||
Do not use large values with them as you risk
|
||||
overflow - the helper function <function>mdelay()</function> is useful
|
||||
here, or even consider <function>schedule_timeout()</function>.
|
||||
here, or consider <function>msleep()</function>.
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
|
@ -698,8 +692,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
|
|||
These routines disable soft interrupts on the local CPU, and
|
||||
restore them. They are reentrant; if soft interrupts were
|
||||
disabled before, they will still be disabled after this pair
|
||||
of functions has been called. They prevent softirqs, tasklets
|
||||
and bottom halves from running on the current CPU.
|
||||
of functions has been called. They prevent softirqs and tasklets
|
||||
from running on the current CPU.
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
|
@ -708,10 +702,16 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
|
|||
<filename class="headerfile">include/asm/smp.h</filename></title>
|
||||
|
||||
<para>
|
||||
<function>smp_processor_id()</function> returns the current
|
||||
processor number, between 0 and <symbol>NR_CPUS</symbol> (the
|
||||
maximum number of CPUs supported by Linux, currently 32). These
|
||||
values are not necessarily continuous.
|
||||
<function>get_cpu()</function> disables preemption (so you won't
|
||||
suddenly get moved to another CPU) and returns the current
|
||||
processor number, between 0 and <symbol>NR_CPUS</symbol>. Note
|
||||
that the CPU numbers are not necessarily continuous. You return
|
||||
it again with <function>put_cpu()</function> when you are done.
|
||||
</para>
|
||||
<para>
|
||||
If you know you cannot be preempted by another task (ie. you are
|
||||
in interrupt context, or have preemption disabled) you can use
|
||||
smp_processor_id().
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
|
@ -722,19 +722,14 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
|
|||
<para>
|
||||
After boot, the kernel frees up a special section; functions
|
||||
marked with <type>__init</type> and data structures marked with
|
||||
<type>__initdata</type> are dropped after boot is complete (within
|
||||
modules this directive is currently ignored). <type>__exit</type>
|
||||
<type>__initdata</type> are dropped after boot is complete: similarly
|
||||
modules discard this memory after initialization. <type>__exit</type>
|
||||
is used to declare a function which is only required on exit: the
|
||||
function will be dropped if this file is not compiled as a module.
|
||||
See the header file for use. Note that it makes no sense for a function
|
||||
marked with <type>__init</type> to be exported to modules with
|
||||
<function>EXPORT_SYMBOL()</function> - this will break.
|
||||
</para>
|
||||
<para>
|
||||
Static data structures marked as <type>__initdata</type> must be initialised
|
||||
(as opposed to ordinary static data which is zeroed BSS) and cannot be
|
||||
<type>const</type>.
|
||||
</para>
|
||||
|
||||
</sect1>
|
||||
|
||||
|
@ -762,9 +757,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
|
|||
<para>
|
||||
The function can return a negative error number to cause
|
||||
module loading to fail (unfortunately, this has no effect if
|
||||
the module is compiled into the kernel). For modules, this is
|
||||
called in user context, with interrupts enabled, and the
|
||||
kernel lock held, so it can sleep.
|
||||
the module is compiled into the kernel). This function is
|
||||
called in user context with interrupts enabled, so it can sleep.
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
|
@ -779,6 +773,34 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
|
|||
reached zero. This function can also sleep, but cannot fail:
|
||||
everything must be cleaned up by the time it returns.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Note that this macro is optional: if it is not present, your
|
||||
module will not be removable (except for 'rmmod -f').
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
<sect1 id="routines-module-use-counters">
|
||||
<title> <function>try_module_get()</function>/<function>module_put()</function>
|
||||
<filename class="headerfile">include/linux/module.h</filename></title>
|
||||
|
||||
<para>
|
||||
These manipulate the module usage count, to protect against
|
||||
removal (a module also can't be removed if another module uses one
|
||||
of its exported symbols: see below). Before calling into module
|
||||
code, you should call <function>try_module_get()</function> on
|
||||
that module: if it fails, then the module is being removed and you
|
||||
should act as if it wasn't there. Otherwise, you can safely enter
|
||||
the module, and call <function>module_put()</function> when you're
|
||||
finished.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Most registerable structures have an
|
||||
<structfield>owner</structfield> field, such as in the
|
||||
<structname>file_operations</structname> structure. Set this field
|
||||
to the macro <symbol>THIS_MODULE</symbol>.
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
<!-- add info on new-style module refcounting here -->
|
||||
|
@ -821,7 +843,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
|
|||
There is a macro to do this:
|
||||
<function>wait_event_interruptible()</function>
|
||||
|
||||
<filename class="headerfile">include/linux/sched.h</filename> The
|
||||
<filename class="headerfile">include/linux/wait.h</filename> The
|
||||
first argument is the wait queue head, and the second is an
|
||||
expression which is evaluated; the macro returns
|
||||
<returnvalue>0</returnvalue> when this expression is true, or
|
||||
|
@ -847,10 +869,11 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
|
|||
<para>
|
||||
Call <function>wake_up()</function>
|
||||
|
||||
<filename class="headerfile">include/linux/sched.h</filename>;,
|
||||
<filename class="headerfile">include/linux/wait.h</filename>;,
|
||||
which will wake up every process in the queue. The exception is
|
||||
if one has <constant>TASK_EXCLUSIVE</constant> set, in which case
|
||||
the remainder of the queue will not be woken.
|
||||
the remainder of the queue will not be woken. There are other variants
|
||||
of this basic function available in the same header.
|
||||
</para>
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
@ -863,7 +886,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
|
|||
first class of operations work on <type>atomic_t</type>
|
||||
|
||||
<filename class="headerfile">include/asm/atomic.h</filename>; this
|
||||
contains a signed integer (at least 24 bits long), and you must use
|
||||
contains a signed integer (at least 32 bits long), and you must use
|
||||
these functions to manipulate or read atomic_t variables.
|
||||
<function>atomic_read()</function> and
|
||||
<function>atomic_set()</function> get and set the counter,
|
||||
|
@ -882,13 +905,12 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
|
|||
|
||||
<para>
|
||||
Note that these functions are slower than normal arithmetic, and
|
||||
so should not be used unnecessarily. On some platforms they
|
||||
are much slower, like 32-bit Sparc where they use a spinlock.
|
||||
so should not be used unnecessarily.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The second class of atomic operations is atomic bit operations on a
|
||||
<type>long</type>, defined in
|
||||
The second class of atomic operations is atomic bit operations on an
|
||||
<type>unsigned long</type>, defined in
|
||||
|
||||
<filename class="headerfile">include/linux/bitops.h</filename>. These
|
||||
operations generally take a pointer to the bit pattern, and a bit
|
||||
|
@ -899,7 +921,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
|
|||
<function>test_and_clear_bit()</function> and
|
||||
<function>test_and_change_bit()</function> do the same thing,
|
||||
except return true if the bit was previously set; these are
|
||||
particularly useful for very simple locking.
|
||||
particularly useful for atomically setting flags.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
|
@ -907,12 +929,6 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
|
|||
than BITS_PER_LONG. The resulting behavior is strange on big-endian
|
||||
platforms though so it is a good idea not to do this.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Note that the order of bits depends on the architecture, and in
|
||||
particular, the bitfield passed to these operations must be at
|
||||
least as large as a <type>long</type>.
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
<chapter id="symbols">
|
||||
|
@ -932,11 +948,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
|
|||
<filename class="headerfile">include/linux/module.h</filename></title>
|
||||
|
||||
<para>
|
||||
This is the classic method of exporting a symbol, and it works
|
||||
for both modules and non-modules. In the kernel all these
|
||||
declarations are often bundled into a single file to help
|
||||
genksyms (which searches source files for these declarations).
|
||||
See the comment on genksyms and Makefiles below.
|
||||
This is the classic method of exporting a symbol: dynamically
|
||||
loaded modules will be able to use the symbol as normal.
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
|
@ -949,7 +962,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
|
|||
symbols exported by <function>EXPORT_SYMBOL_GPL()</function> can
|
||||
only be seen by modules with a
|
||||
<function>MODULE_LICENSE()</function> that specifies a GPL
|
||||
compatible license.
|
||||
compatible license. It implies that the function is considered
|
||||
an internal implementation issue, and not really an interface.
|
||||
</para>
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
@ -962,12 +976,13 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
|
|||
<filename class="headerfile">include/linux/list.h</filename></title>
|
||||
|
||||
<para>
|
||||
There are three sets of linked-list routines in the kernel
|
||||
headers, but this one seems to be winning out (and Linus has
|
||||
used it). If you don't have some particular pressing need for
|
||||
a single list, it's a good choice. In fact, I don't care
|
||||
whether it's a good choice or not, just use it so we can get
|
||||
rid of the others.
|
||||
There used to be three sets of linked-list routines in the kernel
|
||||
headers, but this one is the winner. If you don't have some
|
||||
particular pressing need for a single list, it's a good choice.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
In particular, <function>list_for_each_entry</function> is useful.
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
|
@ -979,14 +994,13 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
|
|||
convention, and return <returnvalue>0</returnvalue> for success,
|
||||
and a negative error number
|
||||
(eg. <returnvalue>-EFAULT</returnvalue>) for failure. This can be
|
||||
unintuitive at first, but it's fairly widespread in the networking
|
||||
code, for example.
|
||||
unintuitive at first, but it's fairly widespread in the kernel.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The filesystem code uses <function>ERR_PTR()</function>
|
||||
Using <function>ERR_PTR()</function>
|
||||
|
||||
<filename class="headerfile">include/linux/fs.h</filename>; to
|
||||
<filename class="headerfile">include/linux/err.h</filename>; to
|
||||
encode a negative error number into a pointer, and
|
||||
<function>IS_ERR()</function> and <function>PTR_ERR()</function>
|
||||
to get it back out again: avoids a separate pointer parameter for
|
||||
|
@ -1040,7 +1054,7 @@ static struct block_device_operations opt_fops = {
|
|||
supported, due to lack of general use, but the following are
|
||||
considered standard (see the GCC info page section "C
|
||||
Extensions" for more details - Yes, really the info page, the
|
||||
man page is only a short summary of the stuff in info):
|
||||
man page is only a short summary of the stuff in info).
|
||||
</para>
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
|
@ -1091,7 +1105,7 @@ static struct block_device_operations opt_fops = {
|
|||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
Function names as strings (__FUNCTION__)
|
||||
Function names as strings (__FUNCTION__).
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
|
@ -1164,63 +1178,35 @@ static struct block_device_operations opt_fops = {
|
|||
<listitem>
|
||||
<para>
|
||||
Usually you want a configuration option for your kernel hack.
|
||||
Edit <filename>Config.in</filename> in the appropriate directory
|
||||
(but under <filename>arch/</filename> it's called
|
||||
<filename>config.in</filename>). The Config Language used is not
|
||||
bash, even though it looks like bash; the safe way is to use only
|
||||
the constructs that you already see in
|
||||
<filename>Config.in</filename> files (see
|
||||
<filename>Documentation/kbuild/kconfig-language.txt</filename>).
|
||||
It's good to run "make xconfig" at least once to test (because
|
||||
it's the only one with a static parser).
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Variables which can be Y or N use <type>bool</type> followed by a
|
||||
tagline and the config define name (which must start with
|
||||
CONFIG_). The <type>tristate</type> function is the same, but
|
||||
allows the answer M (which defines
|
||||
<symbol>CONFIG_foo_MODULE</symbol> in your source, instead of
|
||||
<symbol>CONFIG_FOO</symbol>) if <symbol>CONFIG_MODULES</symbol>
|
||||
is enabled.
|
||||
Edit <filename>Kconfig</filename> in the appropriate directory.
|
||||
The Config language is simple to use by cut and paste, and there's
|
||||
complete documentation in
|
||||
<filename>Documentation/kbuild/kconfig-language.txt</filename>.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
You may well want to make your CONFIG option only visible if
|
||||
<symbol>CONFIG_EXPERIMENTAL</symbol> is enabled: this serves as a
|
||||
warning to users. There many other fancy things you can do: see
|
||||
the various <filename>Config.in</filename> files for ideas.
|
||||
the various <filename>Kconfig</filename> files for ideas.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
In your description of the option, make sure you address both the
|
||||
expert user and the user who knows nothing about your feature. Mention
|
||||
incompatibilities and issues here. <emphasis> Definitely
|
||||
</emphasis> end your description with <quote> if in doubt, say N
|
||||
</quote> (or, occasionally, `Y'); this is for people who have no
|
||||
idea what you are talking about.
|
||||
</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>
|
||||
Edit the <filename>Makefile</filename>: the CONFIG variables are
|
||||
exported here so you can conditionalize compilation with `ifeq'.
|
||||
If your file exports symbols then add the names to
|
||||
<varname>export-objs</varname> so that genksyms will find them.
|
||||
<caution>
|
||||
<para>
|
||||
There is a restriction on the kernel build system that objects
|
||||
which export symbols must have globally unique names.
|
||||
If your object does not have a globally unique name then the
|
||||
standard fix is to move the
|
||||
<function>EXPORT_SYMBOL()</function> statements to their own
|
||||
object with a unique name.
|
||||
This is why several systems have separate exporting objects,
|
||||
usually suffixed with ksyms.
|
||||
</para>
|
||||
</caution>
|
||||
</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>
|
||||
Document your option in Documentation/Configure.help. Mention
|
||||
incompatibilities and issues here. <emphasis> Definitely
|
||||
</emphasis> end your description with <quote> if in doubt, say N
|
||||
</quote> (or, occasionally, `Y'); this is for people who have no
|
||||
idea what you are talking about.
|
||||
exported here so you can usually just add a "obj-$(CONFIG_xxx) +=
|
||||
xxx.o" line. The syntax is documented in
|
||||
<filename>Documentation/kbuild/makefiles.txt</filename>.
|
||||
</para>
|
||||
</listitem>
|
||||
|
||||
|
@ -1253,20 +1239,12 @@ static struct block_device_operations opt_fops = {
|
|||
</para>
|
||||
|
||||
<para>
|
||||
<filename>include/linux/brlock.h:</filename>
|
||||
<filename>include/asm-i386/delay.h:</filename>
|
||||
</para>
|
||||
<programlisting>
|
||||
extern inline void br_read_lock (enum brlock_indices idx)
|
||||
{
|
||||
/*
|
||||
* This causes a link-time bug message if an
|
||||
* invalid index is used:
|
||||
*/
|
||||
if (idx >= __BR_END)
|
||||
__br_lock_usage_bug();
|
||||
|
||||
read_lock(&__brlock_array[smp_processor_id()][idx]);
|
||||
}
|
||||
#define ndelay(n) (__builtin_constant_p(n) ? \
|
||||
((n) > 20000 ? __bad_ndelay() : __const_udelay((n) * 5ul)) : \
|
||||
__ndelay(n))
|
||||
</programlisting>
|
||||
|
||||
<para>
|
||||
|
|
File diff suppressed because it is too large
Load diff
|
@ -96,7 +96,7 @@
|
|||
|
||||
<chapter id="pubfunctions">
|
||||
<title>Public Functions Provided</title>
|
||||
!Earch/i386/kernel/mca.c
|
||||
!Edrivers/mca/mca-legacy.c
|
||||
</chapter>
|
||||
|
||||
<chapter id="dmafunctions">
|
||||
|
|
|
@ -291,7 +291,7 @@
|
|||
|
||||
!Edrivers/usb/core/hcd.c
|
||||
!Edrivers/usb/core/hcd-pci.c
|
||||
!Edrivers/usb/core/buffer.c
|
||||
!Idrivers/usb/core/buffer.c
|
||||
</chapter>
|
||||
|
||||
<chapter>
|
||||
|
@ -841,7 +841,7 @@ usbdev_ioctl (int fd, int ifno, unsigned request, void *param)
|
|||
File modification time is not updated by this request.
|
||||
</para><para>
|
||||
Those struct members are from some interface descriptor
|
||||
applying to the the current configuration.
|
||||
applying to the current configuration.
|
||||
The interface number is the bInterfaceNumber value, and
|
||||
the altsetting number is the bAlternateSetting value.
|
||||
(This resets each endpoint in the interface.)
|
||||
|
|
|
@ -345,8 +345,7 @@ if (!retval) {
|
|||
<programlisting>
|
||||
static inline void skel_delete (struct usb_skel *dev)
|
||||
{
|
||||
if (dev->bulk_in_buffer != NULL)
|
||||
kfree (dev->bulk_in_buffer);
|
||||
kfree (dev->bulk_in_buffer);
|
||||
if (dev->bulk_out_buffer != NULL)
|
||||
usb_buffer_free (dev->udev, dev->bulk_out_size,
|
||||
dev->bulk_out_buffer,
|
||||
|
|
|
@ -605,12 +605,13 @@ is in the ipmi_poweroff module. When the system requests a powerdown,
|
|||
it will send the proper IPMI commands to do this. This is supported on
|
||||
several platforms.
|
||||
|
||||
There is a module parameter named "poweroff_control" that may either be zero
|
||||
(do a power down) or 2 (do a power cycle, power the system off, then power
|
||||
it on in a few seconds). Setting ipmi_poweroff.poweroff_control=x will do
|
||||
the same thing on the kernel command line. The parameter is also available
|
||||
via the proc filesystem in /proc/ipmi/poweroff_control. Note that if the
|
||||
system does not support power cycling, it will always to the power off.
|
||||
There is a module parameter named "poweroff_powercycle" that may
|
||||
either be zero (do a power down) or non-zero (do a power cycle, power
|
||||
the system off, then power it on in a few seconds). Setting
|
||||
ipmi_poweroff.poweroff_control=x will do the same thing on the kernel
|
||||
command line. The parameter is also available via the proc filesystem
|
||||
in /proc/sys/dev/ipmi/poweroff_powercycle. Note that if the system
|
||||
does not support power cycling, it will always do the power off.
|
||||
|
||||
Note that if you have ACPI enabled, the system will prefer using ACPI to
|
||||
power off.
|
||||
|
|
|
@ -430,7 +430,7 @@ which may result in system hang. The software driver of specific
|
|||
MSI-capable hardware is responsible for whether calling
|
||||
pci_enable_msi or not. A return of zero indicates the kernel
|
||||
successfully initializes the MSI/MSI-X capability structure of the
|
||||
device funtion. The device function is now running on MSI/MSI-X mode.
|
||||
device function. The device function is now running on MSI/MSI-X mode.
|
||||
|
||||
5.6 How to tell whether MSI/MSI-X is enabled on device function
|
||||
|
||||
|
|
112
Documentation/RCU/NMI-RCU.txt
Normal file
112
Documentation/RCU/NMI-RCU.txt
Normal file
|
@ -0,0 +1,112 @@
|
|||
Using RCU to Protect Dynamic NMI Handlers
|
||||
|
||||
|
||||
Although RCU is usually used to protect read-mostly data structures,
|
||||
it is possible to use RCU to provide dynamic non-maskable interrupt
|
||||
handlers, as well as dynamic irq handlers. This document describes
|
||||
how to do this, drawing loosely from Zwane Mwaikambo's NMI-timer
|
||||
work in "arch/i386/oprofile/nmi_timer_int.c" and in
|
||||
"arch/i386/kernel/traps.c".
|
||||
|
||||
The relevant pieces of code are listed below, each followed by a
|
||||
brief explanation.
|
||||
|
||||
static int dummy_nmi_callback(struct pt_regs *regs, int cpu)
|
||||
{
|
||||
return 0;
|
||||
}
|
||||
|
||||
The dummy_nmi_callback() function is a "dummy" NMI handler that does
|
||||
nothing, but returns zero, thus saying that it did nothing, allowing
|
||||
the NMI handler to take the default machine-specific action.
|
||||
|
||||
static nmi_callback_t nmi_callback = dummy_nmi_callback;
|
||||
|
||||
This nmi_callback variable is a global function pointer to the current
|
||||
NMI handler.
|
||||
|
||||
fastcall void do_nmi(struct pt_regs * regs, long error_code)
|
||||
{
|
||||
int cpu;
|
||||
|
||||
nmi_enter();
|
||||
|
||||
cpu = smp_processor_id();
|
||||
++nmi_count(cpu);
|
||||
|
||||
if (!rcu_dereference(nmi_callback)(regs, cpu))
|
||||
default_do_nmi(regs);
|
||||
|
||||
nmi_exit();
|
||||
}
|
||||
|
||||
The do_nmi() function processes each NMI. It first disables preemption
|
||||
in the same way that a hardware irq would, then increments the per-CPU
|
||||
count of NMIs. It then invokes the NMI handler stored in the nmi_callback
|
||||
function pointer. If this handler returns zero, do_nmi() invokes the
|
||||
default_do_nmi() function to handle a machine-specific NMI. Finally,
|
||||
preemption is restored.
|
||||
|
||||
Strictly speaking, rcu_dereference() is not needed, since this code runs
|
||||
only on i386, which does not need rcu_dereference() anyway. However,
|
||||
it is a good documentation aid, particularly for anyone attempting to
|
||||
do something similar on Alpha.
|
||||
|
||||
Quick Quiz: Why might the rcu_dereference() be necessary on Alpha,
|
||||
given that the code referenced by the pointer is read-only?
|
||||
|
||||
|
||||
Back to the discussion of NMI and RCU...
|
||||
|
||||
void set_nmi_callback(nmi_callback_t callback)
|
||||
{
|
||||
rcu_assign_pointer(nmi_callback, callback);
|
||||
}
|
||||
|
||||
The set_nmi_callback() function registers an NMI handler. Note that any
|
||||
data that is to be used by the callback must be initialized up -before-
|
||||
the call to set_nmi_callback(). On architectures that do not order
|
||||
writes, the rcu_assign_pointer() ensures that the NMI handler sees the
|
||||
initialized values.
|
||||
|
||||
void unset_nmi_callback(void)
|
||||
{
|
||||
rcu_assign_pointer(nmi_callback, dummy_nmi_callback);
|
||||
}
|
||||
|
||||
This function unregisters an NMI handler, restoring the original
|
||||
dummy_nmi_handler(). However, there may well be an NMI handler
|
||||
currently executing on some other CPU. We therefore cannot free
|
||||
up any data structures used by the old NMI handler until execution
|
||||
of it completes on all other CPUs.
|
||||
|
||||
One way to accomplish this is via synchronize_sched(), perhaps as
|
||||
follows:
|
||||
|
||||
unset_nmi_callback();
|
||||
synchronize_sched();
|
||||
kfree(my_nmi_data);
|
||||
|
||||
This works because synchronize_sched() blocks until all CPUs complete
|
||||
any preemption-disabled segments of code that they were executing.
|
||||
Since NMI handlers disable preemption, synchronize_sched() is guaranteed
|
||||
not to return until all ongoing NMI handlers exit. It is therefore safe
|
||||
to free up the handler's data as soon as synchronize_sched() returns.
|
||||
|
||||
|
||||
Answer to Quick Quiz
|
||||
|
||||
Why might the rcu_dereference() be necessary on Alpha, given
|
||||
that the code referenced by the pointer is read-only?
|
||||
|
||||
Answer: The caller to set_nmi_callback() might well have
|
||||
initialized some data that is to be used by the
|
||||
new NMI handler. In this case, the rcu_dereference()
|
||||
would be needed, because otherwise a CPU that received
|
||||
an NMI just after the new handler was set might see
|
||||
the pointer to the new NMI handler, but the old
|
||||
pre-initialized version of the handler's data.
|
||||
|
||||
More important, the rcu_dereference() makes it clear
|
||||
to someone reading the code that the pointer is being
|
||||
protected by RCU.
|
|
@ -2,7 +2,8 @@ Read the F-ing Papers!
|
|||
|
||||
|
||||
This document describes RCU-related publications, and is followed by
|
||||
the corresponding bibtex entries.
|
||||
the corresponding bibtex entries. A number of the publications may
|
||||
be found at http://www.rdrop.com/users/paulmck/RCU/.
|
||||
|
||||
The first thing resembling RCU was published in 1980, when Kung and Lehman
|
||||
[Kung80] recommended use of a garbage collector to defer destruction
|
||||
|
@ -113,6 +114,10 @@ describing how to make RCU safe for soft-realtime applications [Sarma04c],
|
|||
and a paper describing SELinux performance with RCU [JamesMorris04b].
|
||||
|
||||
|
||||
2005 has seen further adaptation of RCU to realtime use, permitting
|
||||
preemption of RCU realtime critical sections [PaulMcKenney05a,
|
||||
PaulMcKenney05b].
|
||||
|
||||
Bibtex Entries
|
||||
|
||||
@article{Kung80
|
||||
|
@ -410,3 +415,32 @@ Oregon Health and Sciences University"
|
|||
\url{http://www.livejournal.com/users/james_morris/2153.html}
|
||||
[Viewed December 10, 2004]"
|
||||
}
|
||||
|
||||
@unpublished{PaulMcKenney05a
|
||||
,Author="Paul E. McKenney"
|
||||
,Title="{[RFC]} {RCU} and {CONFIG\_PREEMPT\_RT} progress"
|
||||
,month="May"
|
||||
,year="2005"
|
||||
,note="Available:
|
||||
\url{http://lkml.org/lkml/2005/5/9/185}
|
||||
[Viewed May 13, 2005]"
|
||||
,annotation="
|
||||
First publication of working lock-based deferred free patches
|
||||
for the CONFIG_PREEMPT_RT environment.
|
||||
"
|
||||
}
|
||||
|
||||
@conference{PaulMcKenney05b
|
||||
,Author="Paul E. McKenney and Dipankar Sarma"
|
||||
,Title="Towards Hard Realtime Response from the Linux Kernel on SMP Hardware"
|
||||
,Booktitle="linux.conf.au 2005"
|
||||
,month="April"
|
||||
,year="2005"
|
||||
,address="Canberra, Australia"
|
||||
,note="Available:
|
||||
\url{http://www.rdrop.com/users/paulmck/RCU/realtimeRCU.2005.04.23a.pdf}
|
||||
[Viewed May 13, 2005]"
|
||||
,annotation="
|
||||
Realtime turns into making RCU yet more realtime friendly.
|
||||
"
|
||||
}
|
||||
|
|
|
@ -8,7 +8,7 @@ is that since there is only one CPU, it should not be necessary to
|
|||
wait for anything else to get done, since there are no other CPUs for
|
||||
anything else to be happening on. Although this approach will -sort- -of-
|
||||
work a surprising amount of the time, it is a very bad idea in general.
|
||||
This document presents two examples that demonstrate exactly how bad an
|
||||
This document presents three examples that demonstrate exactly how bad an
|
||||
idea this is.
|
||||
|
||||
|
||||
|
@ -26,6 +26,9 @@ from softirq, the list scan would find itself referencing a newly freed
|
|||
element B. This situation can greatly decrease the life expectancy of
|
||||
your kernel.
|
||||
|
||||
This same problem can occur if call_rcu() is invoked from a hardware
|
||||
interrupt handler.
|
||||
|
||||
|
||||
Example 2: Function-Call Fatality
|
||||
|
||||
|
@ -44,8 +47,37 @@ its arguments would cause it to fail to make the fundamental guarantee
|
|||
underlying RCU, namely that call_rcu() defers invoking its arguments until
|
||||
all RCU read-side critical sections currently executing have completed.
|
||||
|
||||
Quick Quiz: why is it -not- legal to invoke synchronize_rcu() in
|
||||
this case?
|
||||
Quick Quiz #1: why is it -not- legal to invoke synchronize_rcu() in
|
||||
this case?
|
||||
|
||||
|
||||
Example 3: Death by Deadlock
|
||||
|
||||
Suppose that call_rcu() is invoked while holding a lock, and that the
|
||||
callback function must acquire this same lock. In this case, if
|
||||
call_rcu() were to directly invoke the callback, the result would
|
||||
be self-deadlock.
|
||||
|
||||
In some cases, it would possible to restructure to code so that
|
||||
the call_rcu() is delayed until after the lock is released. However,
|
||||
there are cases where this can be quite ugly:
|
||||
|
||||
1. If a number of items need to be passed to call_rcu() within
|
||||
the same critical section, then the code would need to create
|
||||
a list of them, then traverse the list once the lock was
|
||||
released.
|
||||
|
||||
2. In some cases, the lock will be held across some kernel API,
|
||||
so that delaying the call_rcu() until the lock is released
|
||||
requires that the data item be passed up via a common API.
|
||||
It is far better to guarantee that callbacks are invoked
|
||||
with no locks held than to have to modify such APIs to allow
|
||||
arbitrary data items to be passed back up through them.
|
||||
|
||||
If call_rcu() directly invokes the callback, painful locking restrictions
|
||||
or API changes would be required.
|
||||
|
||||
Quick Quiz #2: What locking restriction must RCU callbacks respect?
|
||||
|
||||
|
||||
Summary
|
||||
|
@ -53,12 +85,35 @@ Summary
|
|||
Permitting call_rcu() to immediately invoke its arguments or permitting
|
||||
synchronize_rcu() to immediately return breaks RCU, even on a UP system.
|
||||
So do not do it! Even on a UP system, the RCU infrastructure -must-
|
||||
respect grace periods.
|
||||
respect grace periods, and -must- invoke callbacks from a known environment
|
||||
in which no locks are held.
|
||||
|
||||
|
||||
Answer to Quick Quiz
|
||||
Answer to Quick Quiz #1:
|
||||
Why is it -not- legal to invoke synchronize_rcu() in this case?
|
||||
|
||||
The calling function is scanning an RCU-protected linked list, and
|
||||
is therefore within an RCU read-side critical section. Therefore,
|
||||
the called function has been invoked within an RCU read-side critical
|
||||
section, and is not permitted to block.
|
||||
Because the calling function is scanning an RCU-protected linked
|
||||
list, and is therefore within an RCU read-side critical section.
|
||||
Therefore, the called function has been invoked within an RCU
|
||||
read-side critical section, and is not permitted to block.
|
||||
|
||||
Answer to Quick Quiz #2:
|
||||
What locking restriction must RCU callbacks respect?
|
||||
|
||||
Any lock that is acquired within an RCU callback must be
|
||||
acquired elsewhere using an _irq variant of the spinlock
|
||||
primitive. For example, if "mylock" is acquired by an
|
||||
RCU callback, then a process-context acquisition of this
|
||||
lock must use something like spin_lock_irqsave() to
|
||||
acquire the lock.
|
||||
|
||||
If the process-context code were to simply use spin_lock(),
|
||||
then, since RCU callbacks can be invoked from softirq context,
|
||||
the callback might be called from a softirq that interrupted
|
||||
the process-context critical section. This would result in
|
||||
self-deadlock.
|
||||
|
||||
This restriction might seem gratuitous, since very few RCU
|
||||
callbacks acquire locks directly. However, a great many RCU
|
||||
callbacks do acquire locks -indirectly-, for example, via
|
||||
the kfree() primitive.
|
||||
|
|
|
@ -43,6 +43,10 @@ over a rather long period of time, but improvements are always welcome!
|
|||
rcu_read_lock_bh()) in the read-side critical sections,
|
||||
and are also an excellent aid to readability.
|
||||
|
||||
As a rough rule of thumb, any dereference of an RCU-protected
|
||||
pointer must be covered by rcu_read_lock() or rcu_read_lock_bh()
|
||||
or by the appropriate update-side lock.
|
||||
|
||||
3. Does the update code tolerate concurrent accesses?
|
||||
|
||||
The whole point of RCU is to permit readers to run without
|
||||
|
@ -90,7 +94,11 @@ over a rather long period of time, but improvements are always welcome!
|
|||
|
||||
The rcu_dereference() primitive is used by the various
|
||||
"_rcu()" list-traversal primitives, such as the
|
||||
list_for_each_entry_rcu().
|
||||
list_for_each_entry_rcu(). Note that it is perfectly
|
||||
legal (if redundant) for update-side code to use
|
||||
rcu_dereference() and the "_rcu()" list-traversal
|
||||
primitives. This is particularly useful in code
|
||||
that is common to readers and updaters.
|
||||
|
||||
b. If the list macros are being used, the list_add_tail_rcu()
|
||||
and list_add_rcu() primitives must be used in order
|
||||
|
@ -150,16 +158,9 @@ over a rather long period of time, but improvements are always welcome!
|
|||
|
||||
Use of the _rcu() list-traversal primitives outside of an
|
||||
RCU read-side critical section causes no harm other than
|
||||
a slight performance degradation on Alpha CPUs and some
|
||||
confusion on the part of people trying to read the code.
|
||||
|
||||
Another way of thinking of this is "If you are holding the
|
||||
lock that prevents the data structure from changing, why do
|
||||
you also need RCU-based protection?" That said, there may
|
||||
well be situations where use of the _rcu() list-traversal
|
||||
primitives while the update-side lock is held results in
|
||||
simpler and more maintainable code. The jury is still out
|
||||
on this question.
|
||||
a slight performance degradation on Alpha CPUs. It can
|
||||
also be quite helpful in reducing code bloat when common
|
||||
code is shared between readers and updaters.
|
||||
|
||||
10. Conversely, if you are in an RCU read-side critical section,
|
||||
you -must- use the "_rcu()" variants of the list macros.
|
||||
|
|
|
@ -64,6 +64,54 @@ o I hear that RCU is patented? What is with that?
|
|||
Of these, one was allowed to lapse by the assignee, and the
|
||||
others have been contributed to the Linux kernel under GPL.
|
||||
|
||||
o I hear that RCU needs work in order to support realtime kernels?
|
||||
|
||||
Yes, work in progress.
|
||||
|
||||
o Where can I find more information on RCU?
|
||||
|
||||
See the RTFP.txt file in this directory.
|
||||
Or point your browser at http://www.rdrop.com/users/paulmck/RCU/.
|
||||
|
||||
o What are all these files in this directory?
|
||||
|
||||
|
||||
NMI-RCU.txt
|
||||
|
||||
Describes how to use RCU to implement dynamic
|
||||
NMI handlers, which can be revectored on the fly,
|
||||
without rebooting.
|
||||
|
||||
RTFP.txt
|
||||
|
||||
List of RCU-related publications and web sites.
|
||||
|
||||
UP.txt
|
||||
|
||||
Discussion of RCU usage in UP kernels.
|
||||
|
||||
arrayRCU.txt
|
||||
|
||||
Describes how to use RCU to protect arrays, with
|
||||
resizeable arrays whose elements reference other
|
||||
data structures being of the most interest.
|
||||
|
||||
checklist.txt
|
||||
|
||||
Lists things to check for when inspecting code that
|
||||
uses RCU.
|
||||
|
||||
listRCU.txt
|
||||
|
||||
Describes how to use RCU to protect linked lists.
|
||||
This is the simplest and most common use of RCU
|
||||
in the Linux kernel.
|
||||
|
||||
rcu.txt
|
||||
|
||||
You are reading it!
|
||||
|
||||
whatisRCU.txt
|
||||
|
||||
Overview of how the RCU implementation works. Along
|
||||
the way, presents a conceptual view of RCU.
|
||||
|
|
74
Documentation/RCU/rcuref.txt
Normal file
74
Documentation/RCU/rcuref.txt
Normal file
|
@ -0,0 +1,74 @@
|
|||
Refcounter framework for elements of lists/arrays protected by
|
||||
RCU.
|
||||
|
||||
Refcounting on elements of lists which are protected by traditional
|
||||
reader/writer spinlocks or semaphores are straight forward as in:
|
||||
|
||||
1. 2.
|
||||
add() search_and_reference()
|
||||
{ {
|
||||
alloc_object read_lock(&list_lock);
|
||||
... search_for_element
|
||||
atomic_set(&el->rc, 1); atomic_inc(&el->rc);
|
||||
write_lock(&list_lock); ...
|
||||
add_element read_unlock(&list_lock);
|
||||
... ...
|
||||
write_unlock(&list_lock); }
|
||||
}
|
||||
|
||||
3. 4.
|
||||
release_referenced() delete()
|
||||
{ {
|
||||
... write_lock(&list_lock);
|
||||
atomic_dec(&el->rc, relfunc) ...
|
||||
... delete_element
|
||||
} write_unlock(&list_lock);
|
||||
...
|
||||
if (atomic_dec_and_test(&el->rc))
|
||||
kfree(el);
|
||||
...
|
||||
}
|
||||
|
||||
If this list/array is made lock free using rcu as in changing the
|
||||
write_lock in add() and delete() to spin_lock and changing read_lock
|
||||
in search_and_reference to rcu_read_lock(), the rcuref_get in
|
||||
search_and_reference could potentially hold reference to an element which
|
||||
has already been deleted from the list/array. rcuref_lf_get_rcu takes
|
||||
care of this scenario. search_and_reference should look as;
|
||||
|
||||
1. 2.
|
||||
add() search_and_reference()
|
||||
{ {
|
||||
alloc_object rcu_read_lock();
|
||||
... search_for_element
|
||||
atomic_set(&el->rc, 1); if (rcuref_inc_lf(&el->rc)) {
|
||||
write_lock(&list_lock); rcu_read_unlock();
|
||||
return FAIL;
|
||||
add_element }
|
||||
... ...
|
||||
write_unlock(&list_lock); rcu_read_unlock();
|
||||
} }
|
||||
3. 4.
|
||||
release_referenced() delete()
|
||||
{ {
|
||||
... write_lock(&list_lock);
|
||||
rcuref_dec(&el->rc, relfunc) ...
|
||||
... delete_element
|
||||
} write_unlock(&list_lock);
|
||||
...
|
||||
if (rcuref_dec_and_test(&el->rc))
|
||||
call_rcu(&el->head, el_free);
|
||||
...
|
||||
}
|
||||
|
||||
Sometimes, reference to the element need to be obtained in the
|
||||
update (write) stream. In such cases, rcuref_inc_lf might be an overkill
|
||||
since the spinlock serialising list updates are held. rcuref_inc
|
||||
is to be used in such cases.
|
||||
For arches which do not have cmpxchg rcuref_inc_lf
|
||||
api uses a hashed spinlock implementation and the same hashed spinlock
|
||||
is acquired in all rcuref_xxx primitives to preserve atomicity.
|
||||
Note: Use rcuref_inc api only if you need to use rcuref_inc_lf on the
|
||||
refcounter atleast at one place. Mixing rcuref_inc and atomic_xxx api
|
||||
might lead to races. rcuref_inc_lf() must be used in lockfree
|
||||
RCU critical sections only.
|
122
Documentation/RCU/torture.txt
Normal file
122
Documentation/RCU/torture.txt
Normal file
|
@ -0,0 +1,122 @@
|
|||
RCU Torture Test Operation
|
||||
|
||||
|
||||
CONFIG_RCU_TORTURE_TEST
|
||||
|
||||
The CONFIG_RCU_TORTURE_TEST config option is available for all RCU
|
||||
implementations. It creates an rcutorture kernel module that can
|
||||
be loaded to run a torture test. The test periodically outputs
|
||||
status messages via printk(), which can be examined via the dmesg
|
||||
command (perhaps grepping for "rcutorture"). The test is started
|
||||
when the module is loaded, and stops when the module is unloaded.
|
||||
|
||||
However, actually setting this config option to "y" results in the system
|
||||
running the test immediately upon boot, and ending only when the system
|
||||
is taken down. Normally, one will instead want to build the system
|
||||
with CONFIG_RCU_TORTURE_TEST=m and to use modprobe and rmmod to control
|
||||
the test, perhaps using a script similar to the one shown at the end of
|
||||
this document. Note that you will need CONFIG_MODULE_UNLOAD in order
|
||||
to be able to end the test.
|
||||
|
||||
|
||||
MODULE PARAMETERS
|
||||
|
||||
This module has the following parameters:
|
||||
|
||||
nreaders This is the number of RCU reading threads supported.
|
||||
The default is twice the number of CPUs. Why twice?
|
||||
To properly exercise RCU implementations with preemptible
|
||||
read-side critical sections.
|
||||
|
||||
stat_interval The number of seconds between output of torture
|
||||
statistics (via printk()). Regardless of the interval,
|
||||
statistics are printed when the module is unloaded.
|
||||
Setting the interval to zero causes the statistics to
|
||||
be printed -only- when the module is unloaded, and this
|
||||
is the default.
|
||||
|
||||
verbose Enable debug printk()s. Default is disabled.
|
||||
|
||||
|
||||
OUTPUT
|
||||
|
||||
The statistics output is as follows:
|
||||
|
||||
rcutorture: --- Start of test: nreaders=16 stat_interval=0 verbose=0
|
||||
rcutorture: rtc: 0000000000000000 ver: 1916 tfle: 0 rta: 1916 rtaf: 0 rtf: 1915
|
||||
rcutorture: Reader Pipe: 1466408 9747 0 0 0 0 0 0 0 0 0
|
||||
rcutorture: Reader Batch: 1464477 11678 0 0 0 0 0 0 0 0
|
||||
rcutorture: Free-Block Circulation: 1915 1915 1915 1915 1915 1915 1915 1915 1915 1915 0
|
||||
rcutorture: --- End of test
|
||||
|
||||
The command "dmesg | grep rcutorture:" will extract this information on
|
||||
most systems. On more esoteric configurations, it may be necessary to
|
||||
use other commands to access the output of the printk()s used by
|
||||
the RCU torture test. The printk()s use KERN_ALERT, so they should
|
||||
be evident. ;-)
|
||||
|
||||
The entries are as follows:
|
||||
|
||||
o "ggp": The number of counter flips (or batches) since boot.
|
||||
|
||||
o "rtc": The hexadecimal address of the structure currently visible
|
||||
to readers.
|
||||
|
||||
o "ver": The number of times since boot that the rcutw writer task
|
||||
has changed the structure visible to readers.
|
||||
|
||||
o "tfle": If non-zero, indicates that the "torture freelist"
|
||||
containing structure to be placed into the "rtc" area is empty.
|
||||
This condition is important, since it can fool you into thinking
|
||||
that RCU is working when it is not. :-/
|
||||
|
||||
o "rta": Number of structures allocated from the torture freelist.
|
||||
|
||||
o "rtaf": Number of allocations from the torture freelist that have
|
||||
failed due to the list being empty.
|
||||
|
||||
o "rtf": Number of frees into the torture freelist.
|
||||
|
||||
o "Reader Pipe": Histogram of "ages" of structures seen by readers.
|
||||
If any entries past the first two are non-zero, RCU is broken.
|
||||
And rcutorture prints the error flag string "!!!" to make sure
|
||||
you notice. The age of a newly allocated structure is zero,
|
||||
it becomes one when removed from reader visibility, and is
|
||||
incremented once per grace period subsequently -- and is freed
|
||||
after passing through (RCU_TORTURE_PIPE_LEN-2) grace periods.
|
||||
|
||||
The output displayed above was taken from a correctly working
|
||||
RCU. If you want to see what it looks like when broken, break
|
||||
it yourself. ;-)
|
||||
|
||||
o "Reader Batch": Another histogram of "ages" of structures seen
|
||||
by readers, but in terms of counter flips (or batches) rather
|
||||
than in terms of grace periods. The legal number of non-zero
|
||||
entries is again two. The reason for this separate view is
|
||||
that it is easier to get the third entry to show up in the
|
||||
"Reader Batch" list than in the "Reader Pipe" list.
|
||||
|
||||
o "Free-Block Circulation": Shows the number of torture structures
|
||||
that have reached a given point in the pipeline. The first element
|
||||
should closely correspond to the number of structures allocated,
|
||||
the second to the number that have been removed from reader view,
|
||||
and all but the last remaining to the corresponding number of
|
||||
passes through a grace period. The last entry should be zero,
|
||||
as it is only incremented if a torture structure's counter
|
||||
somehow gets incremented farther than it should.
|
||||
|
||||
|
||||
USAGE
|
||||
|
||||
The following script may be used to torture RCU:
|
||||
|
||||
#!/bin/sh
|
||||
|
||||
modprobe rcutorture
|
||||
sleep 100
|
||||
rmmod rcutorture
|
||||
dmesg | grep rcutorture:
|
||||
|
||||
The output can be manually inspected for the error flag of "!!!".
|
||||
One could of course create a more elaborate script that automatically
|
||||
checked for such errors.
|
902
Documentation/RCU/whatisRCU.txt
Normal file
902
Documentation/RCU/whatisRCU.txt
Normal file
|
@ -0,0 +1,902 @@
|
|||
What is RCU?
|
||||
|
||||
RCU is a synchronization mechanism that was added to the Linux kernel
|
||||
during the 2.5 development effort that is optimized for read-mostly
|
||||
situations. Although RCU is actually quite simple once you understand it,
|
||||
getting there can sometimes be a challenge. Part of the problem is that
|
||||
most of the past descriptions of RCU have been written with the mistaken
|
||||
assumption that there is "one true way" to describe RCU. Instead,
|
||||
the experience has been that different people must take different paths
|
||||
to arrive at an understanding of RCU. This document provides several
|
||||
different paths, as follows:
|
||||
|
||||
1. RCU OVERVIEW
|
||||
2. WHAT IS RCU'S CORE API?
|
||||
3. WHAT ARE SOME EXAMPLE USES OF CORE RCU API?
|
||||
4. WHAT IF MY UPDATING THREAD CANNOT BLOCK?
|
||||
5. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU?
|
||||
6. ANALOGY WITH READER-WRITER LOCKING
|
||||
7. FULL LIST OF RCU APIs
|
||||
8. ANSWERS TO QUICK QUIZZES
|
||||
|
||||
People who prefer starting with a conceptual overview should focus on
|
||||
Section 1, though most readers will profit by reading this section at
|
||||
some point. People who prefer to start with an API that they can then
|
||||
experiment with should focus on Section 2. People who prefer to start
|
||||
with example uses should focus on Sections 3 and 4. People who need to
|
||||
understand the RCU implementation should focus on Section 5, then dive
|
||||
into the kernel source code. People who reason best by analogy should
|
||||
focus on Section 6. Section 7 serves as an index to the docbook API
|
||||
documentation, and Section 8 is the traditional answer key.
|
||||
|
||||
So, start with the section that makes the most sense to you and your
|
||||
preferred method of learning. If you need to know everything about
|
||||
everything, feel free to read the whole thing -- but if you are really
|
||||
that type of person, you have perused the source code and will therefore
|
||||
never need this document anyway. ;-)
|
||||
|
||||
|
||||
1. RCU OVERVIEW
|
||||
|
||||
The basic idea behind RCU is to split updates into "removal" and
|
||||
"reclamation" phases. The removal phase removes references to data items
|
||||
within a data structure (possibly by replacing them with references to
|
||||
new versions of these data items), and can run concurrently with readers.
|
||||
The reason that it is safe to run the removal phase concurrently with
|
||||
readers is the semantics of modern CPUs guarantee that readers will see
|
||||
either the old or the new version of the data structure rather than a
|
||||
partially updated reference. The reclamation phase does the work of reclaiming
|
||||
(e.g., freeing) the data items removed from the data structure during the
|
||||
removal phase. Because reclaiming data items can disrupt any readers
|
||||
concurrently referencing those data items, the reclamation phase must
|
||||
not start until readers no longer hold references to those data items.
|
||||
|
||||
Splitting the update into removal and reclamation phases permits the
|
||||
updater to perform the removal phase immediately, and to defer the
|
||||
reclamation phase until all readers active during the removal phase have
|
||||
completed, either by blocking until they finish or by registering a
|
||||
callback that is invoked after they finish. Only readers that are active
|
||||
during the removal phase need be considered, because any reader starting
|
||||
after the removal phase will be unable to gain a reference to the removed
|
||||
data items, and therefore cannot be disrupted by the reclamation phase.
|
||||
|
||||
So the typical RCU update sequence goes something like the following:
|
||||
|
||||
a. Remove pointers to a data structure, so that subsequent
|
||||
readers cannot gain a reference to it.
|
||||
|
||||
b. Wait for all previous readers to complete their RCU read-side
|
||||
critical sections.
|
||||
|
||||
c. At this point, there cannot be any readers who hold references
|
||||
to the data structure, so it now may safely be reclaimed
|
||||
(e.g., kfree()d).
|
||||
|
||||
Step (b) above is the key idea underlying RCU's deferred destruction.
|
||||
The ability to wait until all readers are done allows RCU readers to
|
||||
use much lighter-weight synchronization, in some cases, absolutely no
|
||||
synchronization at all. In contrast, in more conventional lock-based
|
||||
schemes, readers must use heavy-weight synchronization in order to
|
||||
prevent an updater from deleting the data structure out from under them.
|
||||
This is because lock-based updaters typically update data items in place,
|
||||
and must therefore exclude readers. In contrast, RCU-based updaters
|
||||
typically take advantage of the fact that writes to single aligned
|
||||
pointers are atomic on modern CPUs, allowing atomic insertion, removal,
|
||||
and replacement of data items in a linked structure without disrupting
|
||||
readers. Concurrent RCU readers can then continue accessing the old
|
||||
versions, and can dispense with the atomic operations, memory barriers,
|
||||
and communications cache misses that are so expensive on present-day
|
||||
SMP computer systems, even in absence of lock contention.
|
||||
|
||||
In the three-step procedure shown above, the updater is performing both
|
||||
the removal and the reclamation step, but it is often helpful for an
|
||||
entirely different thread to do the reclamation, as is in fact the case
|
||||
in the Linux kernel's directory-entry cache (dcache). Even if the same
|
||||
thread performs both the update step (step (a) above) and the reclamation
|
||||
step (step (c) above), it is often helpful to think of them separately.
|
||||
For example, RCU readers and updaters need not communicate at all,
|
||||
but RCU provides implicit low-overhead communication between readers
|
||||
and reclaimers, namely, in step (b) above.
|
||||
|
||||
So how the heck can a reclaimer tell when a reader is done, given
|
||||
that readers are not doing any sort of synchronization operations???
|
||||
Read on to learn about how RCU's API makes this easy.
|
||||
|
||||
|
||||
2. WHAT IS RCU'S CORE API?
|
||||
|
||||
The core RCU API is quite small:
|
||||
|
||||
a. rcu_read_lock()
|
||||
b. rcu_read_unlock()
|
||||
c. synchronize_rcu() / call_rcu()
|
||||
d. rcu_assign_pointer()
|
||||
e. rcu_dereference()
|
||||
|
||||
There are many other members of the RCU API, but the rest can be
|
||||
expressed in terms of these five, though most implementations instead
|
||||
express synchronize_rcu() in terms of the call_rcu() callback API.
|
||||
|
||||
The five core RCU APIs are described below, the other 18 will be enumerated
|
||||
later. See the kernel docbook documentation for more info, or look directly
|
||||
at the function header comments.
|
||||
|
||||
rcu_read_lock()
|
||||
|
||||
void rcu_read_lock(void);
|
||||
|
||||
Used by a reader to inform the reclaimer that the reader is
|
||||
entering an RCU read-side critical section. It is illegal
|
||||
to block while in an RCU read-side critical section, though
|
||||
kernels built with CONFIG_PREEMPT_RCU can preempt RCU read-side
|
||||
critical sections. Any RCU-protected data structure accessed
|
||||
during an RCU read-side critical section is guaranteed to remain
|
||||
unreclaimed for the full duration of that critical section.
|
||||
Reference counts may be used in conjunction with RCU to maintain
|
||||
longer-term references to data structures.
|
||||
|
||||
rcu_read_unlock()
|
||||
|
||||
void rcu_read_unlock(void);
|
||||
|
||||
Used by a reader to inform the reclaimer that the reader is
|
||||
exiting an RCU read-side critical section. Note that RCU
|
||||
read-side critical sections may be nested and/or overlapping.
|
||||
|
||||
synchronize_rcu()
|
||||
|
||||
void synchronize_rcu(void);
|
||||
|
||||
Marks the end of updater code and the beginning of reclaimer
|
||||
code. It does this by blocking until all pre-existing RCU
|
||||
read-side critical sections on all CPUs have completed.
|
||||
Note that synchronize_rcu() will -not- necessarily wait for
|
||||
any subsequent RCU read-side critical sections to complete.
|
||||
For example, consider the following sequence of events:
|
||||
|
||||
CPU 0 CPU 1 CPU 2
|
||||
----------------- ------------------------- ---------------
|
||||
1. rcu_read_lock()
|
||||
2. enters synchronize_rcu()
|
||||
3. rcu_read_lock()
|
||||
4. rcu_read_unlock()
|
||||
5. exits synchronize_rcu()
|
||||
6. rcu_read_unlock()
|
||||
|
||||
To reiterate, synchronize_rcu() waits only for ongoing RCU
|
||||
read-side critical sections to complete, not necessarily for
|
||||
any that begin after synchronize_rcu() is invoked.
|
||||
|
||||
Of course, synchronize_rcu() does not necessarily return
|
||||
-immediately- after the last pre-existing RCU read-side critical
|
||||
section completes. For one thing, there might well be scheduling
|
||||
delays. For another thing, many RCU implementations process
|
||||
requests in batches in order to improve efficiencies, which can
|
||||
further delay synchronize_rcu().
|
||||
|
||||
Since synchronize_rcu() is the API that must figure out when
|
||||
readers are done, its implementation is key to RCU. For RCU
|
||||
to be useful in all but the most read-intensive situations,
|
||||
synchronize_rcu()'s overhead must also be quite small.
|
||||
|
||||
The call_rcu() API is a callback form of synchronize_rcu(),
|
||||
and is described in more detail in a later section. Instead of
|
||||
blocking, it registers a function and argument which are invoked
|
||||
after all ongoing RCU read-side critical sections have completed.
|
||||
This callback variant is particularly useful in situations where
|
||||
it is illegal to block.
|
||||
|
||||
rcu_assign_pointer()
|
||||
|
||||
typeof(p) rcu_assign_pointer(p, typeof(p) v);
|
||||
|
||||
Yes, rcu_assign_pointer() -is- implemented as a macro, though it
|
||||
would be cool to be able to declare a function in this manner.
|
||||
(Compiler experts will no doubt disagree.)
|
||||
|
||||
The updater uses this function to assign a new value to an
|
||||
RCU-protected pointer, in order to safely communicate the change
|
||||
in value from the updater to the reader. This function returns
|
||||
the new value, and also executes any memory-barrier instructions
|
||||
required for a given CPU architecture.
|
||||
|
||||
Perhaps more important, it serves to document which pointers
|
||||
are protected by RCU. That said, rcu_assign_pointer() is most
|
||||
frequently used indirectly, via the _rcu list-manipulation
|
||||
primitives such as list_add_rcu().
|
||||
|
||||
rcu_dereference()
|
||||
|
||||
typeof(p) rcu_dereference(p);
|
||||
|
||||
Like rcu_assign_pointer(), rcu_dereference() must be implemented
|
||||
as a macro.
|
||||
|
||||
The reader uses rcu_dereference() to fetch an RCU-protected
|
||||
pointer, which returns a value that may then be safely
|
||||
dereferenced. Note that rcu_deference() does not actually
|
||||
dereference the pointer, instead, it protects the pointer for
|
||||
later dereferencing. It also executes any needed memory-barrier
|
||||
instructions for a given CPU architecture. Currently, only Alpha
|
||||
needs memory barriers within rcu_dereference() -- on other CPUs,
|
||||
it compiles to nothing, not even a compiler directive.
|
||||
|
||||
Common coding practice uses rcu_dereference() to copy an
|
||||
RCU-protected pointer to a local variable, then dereferences
|
||||
this local variable, for example as follows:
|
||||
|
||||
p = rcu_dereference(head.next);
|
||||
return p->data;
|
||||
|
||||
However, in this case, one could just as easily combine these
|
||||
into one statement:
|
||||
|
||||
return rcu_dereference(head.next)->data;
|
||||
|
||||
If you are going to be fetching multiple fields from the
|
||||
RCU-protected structure, using the local variable is of
|
||||
course preferred. Repeated rcu_dereference() calls look
|
||||
ugly and incur unnecessary overhead on Alpha CPUs.
|
||||
|
||||
Note that the value returned by rcu_dereference() is valid
|
||||
only within the enclosing RCU read-side critical section.
|
||||
For example, the following is -not- legal:
|
||||
|
||||
rcu_read_lock();
|
||||
p = rcu_dereference(head.next);
|
||||
rcu_read_unlock();
|
||||
x = p->address;
|
||||
rcu_read_lock();
|
||||
y = p->data;
|
||||
rcu_read_unlock();
|
||||
|
||||
Holding a reference from one RCU read-side critical section
|
||||
to another is just as illegal as holding a reference from
|
||||
one lock-based critical section to another! Similarly,
|
||||
using a reference outside of the critical section in which
|
||||
it was acquired is just as illegal as doing so with normal
|
||||
locking.
|
||||
|
||||
As with rcu_assign_pointer(), an important function of
|
||||
rcu_dereference() is to document which pointers are protected
|
||||
by RCU. And, again like rcu_assign_pointer(), rcu_dereference()
|
||||
is typically used indirectly, via the _rcu list-manipulation
|
||||
primitives, such as list_for_each_entry_rcu().
|
||||
|
||||
The following diagram shows how each API communicates among the
|
||||
reader, updater, and reclaimer.
|
||||
|
||||
|
||||
rcu_assign_pointer()
|
||||
+--------+
|
||||
+---------------------->| reader |---------+
|
||||
| +--------+ |
|
||||
| | |
|
||||
| | | Protect:
|
||||
| | | rcu_read_lock()
|
||||
| | | rcu_read_unlock()
|
||||
| rcu_dereference() | |
|
||||
+---------+ | |
|
||||
| updater |<---------------------+ |
|
||||
+---------+ V
|
||||
| +-----------+
|
||||
+----------------------------------->| reclaimer |
|
||||
+-----------+
|
||||
Defer:
|
||||
synchronize_rcu() & call_rcu()
|
||||
|
||||
|
||||
The RCU infrastructure observes the time sequence of rcu_read_lock(),
|
||||
rcu_read_unlock(), synchronize_rcu(), and call_rcu() invocations in
|
||||
order to determine when (1) synchronize_rcu() invocations may return
|
||||
to their callers and (2) call_rcu() callbacks may be invoked. Efficient
|
||||
implementations of the RCU infrastructure make heavy use of batching in
|
||||
order to amortize their overhead over many uses of the corresponding APIs.
|
||||
|
||||
There are no fewer than three RCU mechanisms in the Linux kernel; the
|
||||
diagram above shows the first one, which is by far the most commonly used.
|
||||
The rcu_dereference() and rcu_assign_pointer() primitives are used for
|
||||
all three mechanisms, but different defer and protect primitives are
|
||||
used as follows:
|
||||
|
||||
Defer Protect
|
||||
|
||||
a. synchronize_rcu() rcu_read_lock() / rcu_read_unlock()
|
||||
call_rcu()
|
||||
|
||||
b. call_rcu_bh() rcu_read_lock_bh() / rcu_read_unlock_bh()
|
||||
|
||||
c. synchronize_sched() preempt_disable() / preempt_enable()
|
||||
local_irq_save() / local_irq_restore()
|
||||
hardirq enter / hardirq exit
|
||||
NMI enter / NMI exit
|
||||
|
||||
These three mechanisms are used as follows:
|
||||
|
||||
a. RCU applied to normal data structures.
|
||||
|
||||
b. RCU applied to networking data structures that may be subjected
|
||||
to remote denial-of-service attacks.
|
||||
|
||||
c. RCU applied to scheduler and interrupt/NMI-handler tasks.
|
||||
|
||||
Again, most uses will be of (a). The (b) and (c) cases are important
|
||||
for specialized uses, but are relatively uncommon.
|
||||
|
||||
|
||||
3. WHAT ARE SOME EXAMPLE USES OF CORE RCU API?
|
||||
|
||||
This section shows a simple use of the core RCU API to protect a
|
||||
global pointer to a dynamically allocated structure. More typical
|
||||
uses of RCU may be found in listRCU.txt, arrayRCU.txt, and NMI-RCU.txt.
|
||||
|
||||
struct foo {
|
||||
int a;
|
||||
char b;
|
||||
long c;
|
||||
};
|
||||
DEFINE_SPINLOCK(foo_mutex);
|
||||
|
||||
struct foo *gbl_foo;
|
||||
|
||||
/*
|
||||
* Create a new struct foo that is the same as the one currently
|
||||
* pointed to by gbl_foo, except that field "a" is replaced
|
||||
* with "new_a". Points gbl_foo to the new structure, and
|
||||
* frees up the old structure after a grace period.
|
||||
*
|
||||
* Uses rcu_assign_pointer() to ensure that concurrent readers
|
||||
* see the initialized version of the new structure.
|
||||
*
|
||||
* Uses synchronize_rcu() to ensure that any readers that might
|
||||
* have references to the old structure complete before freeing
|
||||
* the old structure.
|
||||
*/
|
||||
void foo_update_a(int new_a)
|
||||
{
|
||||
struct foo *new_fp;
|
||||
struct foo *old_fp;
|
||||
|
||||
new_fp = kmalloc(sizeof(*fp), GFP_KERNEL);
|
||||
spin_lock(&foo_mutex);
|
||||
old_fp = gbl_foo;
|
||||
*new_fp = *old_fp;
|
||||
new_fp->a = new_a;
|
||||
rcu_assign_pointer(gbl_foo, new_fp);
|
||||
spin_unlock(&foo_mutex);
|
||||
synchronize_rcu();
|
||||
kfree(old_fp);
|
||||
}
|
||||
|
||||
/*
|
||||
* Return the value of field "a" of the current gbl_foo
|
||||
* structure. Use rcu_read_lock() and rcu_read_unlock()
|
||||
* to ensure that the structure does not get deleted out
|
||||
* from under us, and use rcu_dereference() to ensure that
|
||||
* we see the initialized version of the structure (important
|
||||
* for DEC Alpha and for people reading the code).
|
||||
*/
|
||||
int foo_get_a(void)
|
||||
{
|
||||
int retval;
|
||||
|
||||
rcu_read_lock();
|
||||
retval = rcu_dereference(gbl_foo)->a;
|
||||
rcu_read_unlock();
|
||||
return retval;
|
||||
}
|
||||
|
||||
So, to sum up:
|
||||
|
||||
o Use rcu_read_lock() and rcu_read_unlock() to guard RCU
|
||||
read-side critical sections.
|
||||
|
||||
o Within an RCU read-side critical section, use rcu_dereference()
|
||||
to dereference RCU-protected pointers.
|
||||
|
||||
o Use some solid scheme (such as locks or semaphores) to
|
||||
keep concurrent updates from interfering with each other.
|
||||
|
||||
o Use rcu_assign_pointer() to update an RCU-protected pointer.
|
||||
This primitive protects concurrent readers from the updater,
|
||||
-not- concurrent updates from each other! You therefore still
|
||||
need to use locking (or something similar) to keep concurrent
|
||||
rcu_assign_pointer() primitives from interfering with each other.
|
||||
|
||||
o Use synchronize_rcu() -after- removing a data element from an
|
||||
RCU-protected data structure, but -before- reclaiming/freeing
|
||||
the data element, in order to wait for the completion of all
|
||||
RCU read-side critical sections that might be referencing that
|
||||
data item.
|
||||
|
||||
See checklist.txt for additional rules to follow when using RCU.
|
||||
|
||||
|
||||
4. WHAT IF MY UPDATING THREAD CANNOT BLOCK?
|
||||
|
||||
In the example above, foo_update_a() blocks until a grace period elapses.
|
||||
This is quite simple, but in some cases one cannot afford to wait so
|
||||
long -- there might be other high-priority work to be done.
|
||||
|
||||
In such cases, one uses call_rcu() rather than synchronize_rcu().
|
||||
The call_rcu() API is as follows:
|
||||
|
||||
void call_rcu(struct rcu_head * head,
|
||||
void (*func)(struct rcu_head *head));
|
||||
|
||||
This function invokes func(head) after a grace period has elapsed.
|
||||
This invocation might happen from either softirq or process context,
|
||||
so the function is not permitted to block. The foo struct needs to
|
||||
have an rcu_head structure added, perhaps as follows:
|
||||
|
||||
struct foo {
|
||||
int a;
|
||||
char b;
|
||||
long c;
|
||||
struct rcu_head rcu;
|
||||
};
|
||||
|
||||
The foo_update_a() function might then be written as follows:
|
||||
|
||||
/*
|
||||
* Create a new struct foo that is the same as the one currently
|
||||
* pointed to by gbl_foo, except that field "a" is replaced
|
||||
* with "new_a". Points gbl_foo to the new structure, and
|
||||
* frees up the old structure after a grace period.
|
||||
*
|
||||
* Uses rcu_assign_pointer() to ensure that concurrent readers
|
||||
* see the initialized version of the new structure.
|
||||
*
|
||||
* Uses call_rcu() to ensure that any readers that might have
|
||||
* references to the old structure complete before freeing the
|
||||
* old structure.
|
||||
*/
|
||||
void foo_update_a(int new_a)
|
||||
{
|
||||
struct foo *new_fp;
|
||||
struct foo *old_fp;
|
||||
|
||||
new_fp = kmalloc(sizeof(*fp), GFP_KERNEL);
|
||||
spin_lock(&foo_mutex);
|
||||
old_fp = gbl_foo;
|
||||
*new_fp = *old_fp;
|
||||
new_fp->a = new_a;
|
||||
rcu_assign_pointer(gbl_foo, new_fp);
|
||||
spin_unlock(&foo_mutex);
|
||||
call_rcu(&old_fp->rcu, foo_reclaim);
|
||||
}
|
||||
|
||||
The foo_reclaim() function might appear as follows:
|
||||
|
||||
void foo_reclaim(struct rcu_head *rp)
|
||||
{
|
||||
struct foo *fp = container_of(rp, struct foo, rcu);
|
||||
|
||||
kfree(fp);
|
||||
}
|
||||
|
||||
The container_of() primitive is a macro that, given a pointer into a
|
||||
struct, the type of the struct, and the pointed-to field within the
|
||||
struct, returns a pointer to the beginning of the struct.
|
||||
|
||||
The use of call_rcu() permits the caller of foo_update_a() to
|
||||
immediately regain control, without needing to worry further about the
|
||||
old version of the newly updated element. It also clearly shows the
|
||||
RCU distinction between updater, namely foo_update_a(), and reclaimer,
|
||||
namely foo_reclaim().
|
||||
|
||||
The summary of advice is the same as for the previous section, except
|
||||
that we are now using call_rcu() rather than synchronize_rcu():
|
||||
|
||||
o Use call_rcu() -after- removing a data element from an
|
||||
RCU-protected data structure in order to register a callback
|
||||
function that will be invoked after the completion of all RCU
|
||||
read-side critical sections that might be referencing that
|
||||
data item.
|
||||
|
||||
Again, see checklist.txt for additional rules governing the use of RCU.
|
||||
|
||||
|
||||
5. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU?
|
||||
|
||||
One of the nice things about RCU is that it has extremely simple "toy"
|
||||
implementations that are a good first step towards understanding the
|
||||
production-quality implementations in the Linux kernel. This section
|
||||
presents two such "toy" implementations of RCU, one that is implemented
|
||||
in terms of familiar locking primitives, and another that more closely
|
||||
resembles "classic" RCU. Both are way too simple for real-world use,
|
||||
lacking both functionality and performance. However, they are useful
|
||||
in getting a feel for how RCU works. See kernel/rcupdate.c for a
|
||||
production-quality implementation, and see:
|
||||
|
||||
http://www.rdrop.com/users/paulmck/RCU
|
||||
|
||||
for papers describing the Linux kernel RCU implementation. The OLS'01
|
||||
and OLS'02 papers are a good introduction, and the dissertation provides
|
||||
more details on the current implementation.
|
||||
|
||||
|
||||
5A. "TOY" IMPLEMENTATION #1: LOCKING
|
||||
|
||||
This section presents a "toy" RCU implementation that is based on
|
||||
familiar locking primitives. Its overhead makes it a non-starter for
|
||||
real-life use, as does its lack of scalability. It is also unsuitable
|
||||
for realtime use, since it allows scheduling latency to "bleed" from
|
||||
one read-side critical section to another.
|
||||
|
||||
However, it is probably the easiest implementation to relate to, so is
|
||||
a good starting point.
|
||||
|
||||
It is extremely simple:
|
||||
|
||||
static DEFINE_RWLOCK(rcu_gp_mutex);
|
||||
|
||||
void rcu_read_lock(void)
|
||||
{
|
||||
read_lock(&rcu_gp_mutex);
|
||||
}
|
||||
|
||||
void rcu_read_unlock(void)
|
||||
{
|
||||
read_unlock(&rcu_gp_mutex);
|
||||
}
|
||||
|
||||
void synchronize_rcu(void)
|
||||
{
|
||||
write_lock(&rcu_gp_mutex);
|
||||
write_unlock(&rcu_gp_mutex);
|
||||
}
|
||||
|
||||
[You can ignore rcu_assign_pointer() and rcu_dereference() without
|
||||
missing much. But here they are anyway. And whatever you do, don't
|
||||
forget about them when submitting patches making use of RCU!]
|
||||
|
||||
#define rcu_assign_pointer(p, v) ({ \
|
||||
smp_wmb(); \
|
||||
(p) = (v); \
|
||||
})
|
||||
|
||||
#define rcu_dereference(p) ({ \
|
||||
typeof(p) _________p1 = p; \
|
||||
smp_read_barrier_depends(); \
|
||||
(_________p1); \
|
||||
})
|
||||
|
||||
|
||||
The rcu_read_lock() and rcu_read_unlock() primitive read-acquire
|
||||
and release a global reader-writer lock. The synchronize_rcu()
|
||||
primitive write-acquires this same lock, then immediately releases
|
||||
it. This means that once synchronize_rcu() exits, all RCU read-side
|
||||
critical sections that were in progress before synchonize_rcu() was
|
||||
called are guaranteed to have completed -- there is no way that
|
||||
synchronize_rcu() would have been able to write-acquire the lock
|
||||
otherwise.
|
||||
|
||||
It is possible to nest rcu_read_lock(), since reader-writer locks may
|
||||
be recursively acquired. Note also that rcu_read_lock() is immune
|
||||
from deadlock (an important property of RCU). The reason for this is
|
||||
that the only thing that can block rcu_read_lock() is a synchronize_rcu().
|
||||
But synchronize_rcu() does not acquire any locks while holding rcu_gp_mutex,
|
||||
so there can be no deadlock cycle.
|
||||
|
||||
Quick Quiz #1: Why is this argument naive? How could a deadlock
|
||||
occur when using this algorithm in a real-world Linux
|
||||
kernel? How could this deadlock be avoided?
|
||||
|
||||
|
||||
5B. "TOY" EXAMPLE #2: CLASSIC RCU
|
||||
|
||||
This section presents a "toy" RCU implementation that is based on
|
||||
"classic RCU". It is also short on performance (but only for updates) and
|
||||
on features such as hotplug CPU and the ability to run in CONFIG_PREEMPT
|
||||
kernels. The definitions of rcu_dereference() and rcu_assign_pointer()
|
||||
are the same as those shown in the preceding section, so they are omitted.
|
||||
|
||||
void rcu_read_lock(void) { }
|
||||
|
||||
void rcu_read_unlock(void) { }
|
||||
|
||||
void synchronize_rcu(void)
|
||||
{
|
||||
int cpu;
|
||||
|
||||
for_each_cpu(cpu)
|
||||
run_on(cpu);
|
||||
}
|
||||
|
||||
Note that rcu_read_lock() and rcu_read_unlock() do absolutely nothing.
|
||||
This is the great strength of classic RCU in a non-preemptive kernel:
|
||||
read-side overhead is precisely zero, at least on non-Alpha CPUs.
|
||||
And there is absolutely no way that rcu_read_lock() can possibly
|
||||
participate in a deadlock cycle!
|
||||
|
||||
The implementation of synchronize_rcu() simply schedules itself on each
|
||||
CPU in turn. The run_on() primitive can be implemented straightforwardly
|
||||
in terms of the sched_setaffinity() primitive. Of course, a somewhat less
|
||||
"toy" implementation would restore the affinity upon completion rather
|
||||
than just leaving all tasks running on the last CPU, but when I said
|
||||
"toy", I meant -toy-!
|
||||
|
||||
So how the heck is this supposed to work???
|
||||
|
||||
Remember that it is illegal to block while in an RCU read-side critical
|
||||
section. Therefore, if a given CPU executes a context switch, we know
|
||||
that it must have completed all preceding RCU read-side critical sections.
|
||||
Once -all- CPUs have executed a context switch, then -all- preceding
|
||||
RCU read-side critical sections will have completed.
|
||||
|
||||
So, suppose that we remove a data item from its structure and then invoke
|
||||
synchronize_rcu(). Once synchronize_rcu() returns, we are guaranteed
|
||||
that there are no RCU read-side critical sections holding a reference
|
||||
to that data item, so we can safely reclaim it.
|
||||
|
||||
Quick Quiz #2: Give an example where Classic RCU's read-side
|
||||
overhead is -negative-.
|
||||
|
||||
Quick Quiz #3: If it is illegal to block in an RCU read-side
|
||||
critical section, what the heck do you do in
|
||||
PREEMPT_RT, where normal spinlocks can block???
|
||||
|
||||
|
||||
6. ANALOGY WITH READER-WRITER LOCKING
|
||||
|
||||
Although RCU can be used in many different ways, a very common use of
|
||||
RCU is analogous to reader-writer locking. The following unified
|
||||
diff shows how closely related RCU and reader-writer locking can be.
|
||||
|
||||
@@ -13,15 +14,15 @@
|
||||
struct list_head *lp;
|
||||
struct el *p;
|
||||
|
||||
- read_lock();
|
||||
- list_for_each_entry(p, head, lp) {
|
||||
+ rcu_read_lock();
|
||||
+ list_for_each_entry_rcu(p, head, lp) {
|
||||
if (p->key == key) {
|
||||
*result = p->data;
|
||||
- read_unlock();
|
||||
+ rcu_read_unlock();
|
||||
return 1;
|
||||
}
|
||||
}
|
||||
- read_unlock();
|
||||
+ rcu_read_unlock();
|
||||
return 0;
|
||||
}
|
||||
|
||||
@@ -29,15 +30,16 @@
|
||||
{
|
||||
struct el *p;
|
||||
|
||||
- write_lock(&listmutex);
|
||||
+ spin_lock(&listmutex);
|
||||
list_for_each_entry(p, head, lp) {
|
||||
if (p->key == key) {
|
||||
list_del(&p->list);
|
||||
- write_unlock(&listmutex);
|
||||
+ spin_unlock(&listmutex);
|
||||
+ synchronize_rcu();
|
||||
kfree(p);
|
||||
return 1;
|
||||
}
|
||||
}
|
||||
- write_unlock(&listmutex);
|
||||
+ spin_unlock(&listmutex);
|
||||
return 0;
|
||||
}
|
||||
|
||||
Or, for those who prefer a side-by-side listing:
|
||||
|
||||
1 struct el { 1 struct el {
|
||||
2 struct list_head list; 2 struct list_head list;
|
||||
3 long key; 3 long key;
|
||||
4 spinlock_t mutex; 4 spinlock_t mutex;
|
||||
5 int data; 5 int data;
|
||||
6 /* Other data fields */ 6 /* Other data fields */
|
||||
7 }; 7 };
|
||||
8 spinlock_t listmutex; 8 spinlock_t listmutex;
|
||||
9 struct el head; 9 struct el head;
|
||||
|
||||
1 int search(long key, int *result) 1 int search(long key, int *result)
|
||||
2 { 2 {
|
||||
3 struct list_head *lp; 3 struct list_head *lp;
|
||||
4 struct el *p; 4 struct el *p;
|
||||
5 5
|
||||
6 read_lock(); 6 rcu_read_lock();
|
||||
7 list_for_each_entry(p, head, lp) { 7 list_for_each_entry_rcu(p, head, lp) {
|
||||
8 if (p->key == key) { 8 if (p->key == key) {
|
||||
9 *result = p->data; 9 *result = p->data;
|
||||
10 read_unlock(); 10 rcu_read_unlock();
|
||||
11 return 1; 11 return 1;
|
||||
12 } 12 }
|
||||
13 } 13 }
|
||||
14 read_unlock(); 14 rcu_read_unlock();
|
||||
15 return 0; 15 return 0;
|
||||
16 } 16 }
|
||||
|
||||
1 int delete(long key) 1 int delete(long key)
|
||||
2 { 2 {
|
||||
3 struct el *p; 3 struct el *p;
|
||||
4 4
|
||||
5 write_lock(&listmutex); 5 spin_lock(&listmutex);
|
||||
6 list_for_each_entry(p, head, lp) { 6 list_for_each_entry(p, head, lp) {
|
||||
7 if (p->key == key) { 7 if (p->key == key) {
|
||||
8 list_del(&p->list); 8 list_del(&p->list);
|
||||
9 write_unlock(&listmutex); 9 spin_unlock(&listmutex);
|
||||
10 synchronize_rcu();
|
||||
10 kfree(p); 11 kfree(p);
|
||||
11 return 1; 12 return 1;
|
||||
12 } 13 }
|
||||
13 } 14 }
|
||||
14 write_unlock(&listmutex); 15 spin_unlock(&listmutex);
|
||||
15 return 0; 16 return 0;
|
||||
16 } 17 }
|
||||
|
||||
Either way, the differences are quite small. Read-side locking moves
|
||||
to rcu_read_lock() and rcu_read_unlock, update-side locking moves from
|
||||
from a reader-writer lock to a simple spinlock, and a synchronize_rcu()
|
||||
precedes the kfree().
|
||||
|
||||
However, there is one potential catch: the read-side and update-side
|
||||
critical sections can now run concurrently. In many cases, this will
|
||||
not be a problem, but it is necessary to check carefully regardless.
|
||||
For example, if multiple independent list updates must be seen as
|
||||
a single atomic update, converting to RCU will require special care.
|
||||
|
||||
Also, the presence of synchronize_rcu() means that the RCU version of
|
||||
delete() can now block. If this is a problem, there is a callback-based
|
||||
mechanism that never blocks, namely call_rcu(), that can be used in
|
||||
place of synchronize_rcu().
|
||||
|
||||
|
||||
7. FULL LIST OF RCU APIs
|
||||
|
||||
The RCU APIs are documented in docbook-format header comments in the
|
||||
Linux-kernel source code, but it helps to have a full list of the
|
||||
APIs, since there does not appear to be a way to categorize them
|
||||
in docbook. Here is the list, by category.
|
||||
|
||||
Markers for RCU read-side critical sections:
|
||||
|
||||
rcu_read_lock
|
||||
rcu_read_unlock
|
||||
rcu_read_lock_bh
|
||||
rcu_read_unlock_bh
|
||||
|
||||
RCU pointer/list traversal:
|
||||
|
||||
rcu_dereference
|
||||
list_for_each_rcu (to be deprecated in favor of
|
||||
list_for_each_entry_rcu)
|
||||
list_for_each_safe_rcu (deprecated, not used)
|
||||
list_for_each_entry_rcu
|
||||
list_for_each_continue_rcu (to be deprecated in favor of new
|
||||
list_for_each_entry_continue_rcu)
|
||||
hlist_for_each_rcu (to be deprecated in favor of
|
||||
hlist_for_each_entry_rcu)
|
||||
hlist_for_each_entry_rcu
|
||||
|
||||
RCU pointer update:
|
||||
|
||||
rcu_assign_pointer
|
||||
list_add_rcu
|
||||
list_add_tail_rcu
|
||||
list_del_rcu
|
||||
list_replace_rcu
|
||||
hlist_del_rcu
|
||||
hlist_add_head_rcu
|
||||
|
||||
RCU grace period:
|
||||
|
||||
synchronize_kernel (deprecated)
|
||||
synchronize_net
|
||||
synchronize_sched
|
||||
synchronize_rcu
|
||||
call_rcu
|
||||
call_rcu_bh
|
||||
|
||||
See the comment headers in the source code (or the docbook generated
|
||||
from them) for more information.
|
||||
|
||||
|
||||
8. ANSWERS TO QUICK QUIZZES
|
||||
|
||||
Quick Quiz #1: Why is this argument naive? How could a deadlock
|
||||
occur when using this algorithm in a real-world Linux
|
||||
kernel? [Referring to the lock-based "toy" RCU
|
||||
algorithm.]
|
||||
|
||||
Answer: Consider the following sequence of events:
|
||||
|
||||
1. CPU 0 acquires some unrelated lock, call it
|
||||
"problematic_lock".
|
||||
|
||||
2. CPU 1 enters synchronize_rcu(), write-acquiring
|
||||
rcu_gp_mutex.
|
||||
|
||||
3. CPU 0 enters rcu_read_lock(), but must wait
|
||||
because CPU 1 holds rcu_gp_mutex.
|
||||
|
||||
4. CPU 1 is interrupted, and the irq handler
|
||||
attempts to acquire problematic_lock.
|
||||
|
||||
The system is now deadlocked.
|
||||
|
||||
One way to avoid this deadlock is to use an approach like
|
||||
that of CONFIG_PREEMPT_RT, where all normal spinlocks
|
||||
become blocking locks, and all irq handlers execute in
|
||||
the context of special tasks. In this case, in step 4
|
||||
above, the irq handler would block, allowing CPU 1 to
|
||||
release rcu_gp_mutex, avoiding the deadlock.
|
||||
|
||||
Even in the absence of deadlock, this RCU implementation
|
||||
allows latency to "bleed" from readers to other
|
||||
readers through synchronize_rcu(). To see this,
|
||||
consider task A in an RCU read-side critical section
|
||||
(thus read-holding rcu_gp_mutex), task B blocked
|
||||
attempting to write-acquire rcu_gp_mutex, and
|
||||
task C blocked in rcu_read_lock() attempting to
|
||||
read_acquire rcu_gp_mutex. Task A's RCU read-side
|
||||
latency is holding up task C, albeit indirectly via
|
||||
task B.
|
||||
|
||||
Realtime RCU implementations therefore use a counter-based
|
||||
approach where tasks in RCU read-side critical sections
|
||||
cannot be blocked by tasks executing synchronize_rcu().
|
||||
|
||||
Quick Quiz #2: Give an example where Classic RCU's read-side
|
||||
overhead is -negative-.
|
||||
|
||||
Answer: Imagine a single-CPU system with a non-CONFIG_PREEMPT
|
||||
kernel where a routing table is used by process-context
|
||||
code, but can be updated by irq-context code (for example,
|
||||
by an "ICMP REDIRECT" packet). The usual way of handling
|
||||
this would be to have the process-context code disable
|
||||
interrupts while searching the routing table. Use of
|
||||
RCU allows such interrupt-disabling to be dispensed with.
|
||||
Thus, without RCU, you pay the cost of disabling interrupts,
|
||||
and with RCU you don't.
|
||||
|
||||
One can argue that the overhead of RCU in this
|
||||
case is negative with respect to the single-CPU
|
||||
interrupt-disabling approach. Others might argue that
|
||||
the overhead of RCU is merely zero, and that replacing
|
||||
the positive overhead of the interrupt-disabling scheme
|
||||
with the zero-overhead RCU scheme does not constitute
|
||||
negative overhead.
|
||||
|
||||
In real life, of course, things are more complex. But
|
||||
even the theoretical possibility of negative overhead for
|
||||
a synchronization primitive is a bit unexpected. ;-)
|
||||
|
||||
Quick Quiz #3: If it is illegal to block in an RCU read-side
|
||||
critical section, what the heck do you do in
|
||||
PREEMPT_RT, where normal spinlocks can block???
|
||||
|
||||
Answer: Just as PREEMPT_RT permits preemption of spinlock
|
||||
critical sections, it permits preemption of RCU
|
||||
read-side critical sections. It also permits
|
||||
spinlocks blocking while in RCU read-side critical
|
||||
sections.
|
||||
|
||||
Why the apparent inconsistency? Because it is it
|
||||
possible to use priority boosting to keep the RCU
|
||||
grace periods short if need be (for example, if running
|
||||
short of memory). In contrast, if blocking waiting
|
||||
for (say) network reception, there is no way to know
|
||||
what should be boosted. Especially given that the
|
||||
process we need to boost might well be a human being
|
||||
who just went out for a pizza or something. And although
|
||||
a computer-operated cattle prod might arouse serious
|
||||
interest, it might also provoke serious objections.
|
||||
Besides, how does the computer know what pizza parlor
|
||||
the human being went to???
|
||||
|
||||
|
||||
ACKNOWLEDGEMENTS
|
||||
|
||||
My thanks to the people who helped make this human-readable, including
|
||||
Jon Walpole, Josh Triplett, Serge Hallyn, and Suzanne Wood.
|
||||
|
||||
|
||||
For more information, see http://www.rdrop.com/users/paulmck/RCU.
|
|
@ -301,8 +301,84 @@ now, but you can do this to mark internal company procedures or just
|
|||
point out some special detail about the sign-off.
|
||||
|
||||
|
||||
12) The canonical patch format
|
||||
|
||||
12) More references for submitting patches
|
||||
The canonical patch subject line is:
|
||||
|
||||
Subject: [PATCH 001/123] subsystem: summary phrase
|
||||
|
||||
The canonical patch message body contains the following:
|
||||
|
||||
- A "from" line specifying the patch author.
|
||||
|
||||
- An empty line.
|
||||
|
||||
- The body of the explanation, which will be copied to the
|
||||
permanent changelog to describe this patch.
|
||||
|
||||
- The "Signed-off-by:" lines, described above, which will
|
||||
also go in the changelog.
|
||||
|
||||
- A marker line containing simply "---".
|
||||
|
||||
- Any additional comments not suitable for the changelog.
|
||||
|
||||
- The actual patch (diff output).
|
||||
|
||||
The Subject line format makes it very easy to sort the emails
|
||||
alphabetically by subject line - pretty much any email reader will
|
||||
support that - since because the sequence number is zero-padded,
|
||||
the numerical and alphabetic sort is the same.
|
||||
|
||||
The "subsystem" in the email's Subject should identify which
|
||||
area or subsystem of the kernel is being patched.
|
||||
|
||||
The "summary phrase" in the email's Subject should concisely
|
||||
describe the patch which that email contains. The "summary
|
||||
phrase" should not be a filename. Do not use the same "summary
|
||||
phrase" for every patch in a whole patch series.
|
||||
|
||||
Bear in mind that the "summary phrase" of your email becomes
|
||||
a globally-unique identifier for that patch. It propagates
|
||||
all the way into the git changelog. The "summary phrase" may
|
||||
later be used in developer discussions which refer to the patch.
|
||||
People will want to google for the "summary phrase" to read
|
||||
discussion regarding that patch.
|
||||
|
||||
A couple of example Subjects:
|
||||
|
||||
Subject: [patch 2/5] ext2: improve scalability of bitmap searching
|
||||
Subject: [PATCHv2 001/207] x86: fix eflags tracking
|
||||
|
||||
The "from" line must be the very first line in the message body,
|
||||
and has the form:
|
||||
|
||||
From: Original Author <author@example.com>
|
||||
|
||||
The "from" line specifies who will be credited as the author of the
|
||||
patch in the permanent changelog. If the "from" line is missing,
|
||||
then the "From:" line from the email header will be used to determine
|
||||
the patch author in the changelog.
|
||||
|
||||
The explanation body will be committed to the permanent source
|
||||
changelog, so should make sense to a competent reader who has long
|
||||
since forgotten the immediate details of the discussion that might
|
||||
have led to this patch.
|
||||
|
||||
The "---" marker line serves the essential purpose of marking for patch
|
||||
handling tools where the changelog message ends.
|
||||
|
||||
One good use for the additional comments after the "---" marker is for
|
||||
a diffstat, to show what files have changed, and the number of inserted
|
||||
and deleted lines per file. A diffstat is especially useful on bigger
|
||||
patches. Other comments relevant only to the moment or the maintainer,
|
||||
not suitable for the permanent changelog, should also go here.
|
||||
|
||||
See more details on the proper patch format in the following
|
||||
references.
|
||||
|
||||
|
||||
13) More references for submitting patches
|
||||
|
||||
Andrew Morton, "The perfect patch" (tpp).
|
||||
<http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt>
|
||||
|
@ -310,6 +386,14 @@ Andrew Morton, "The perfect patch" (tpp).
|
|||
Jeff Garzik, "Linux kernel patch submission format."
|
||||
<http://linux.yyz.us/patch-format.html>
|
||||
|
||||
Greg KH, "How to piss off a kernel subsystem maintainer"
|
||||
<http://www.kroah.com/log/2005/03/31/>
|
||||
|
||||
Kernel Documentation/CodingStyle
|
||||
<http://sosdg.org/~coywolf/lxr/source/Documentation/CodingStyle>
|
||||
|
||||
Linus Torvald's mail on the canonical patch format:
|
||||
<http://lkml.org/lkml/2005/4/7/183>
|
||||
|
||||
|
||||
-----------------------------------
|
||||
|
|
|
@ -33,3 +33,6 @@ The result of the execution of this aml method is
|
|||
attached to /proc/acpi/hotkey/poll_method, which is dnyamically
|
||||
created. Please use command "cat /proc/acpi/hotkey/polling_method"
|
||||
to retrieve it.
|
||||
|
||||
Note: Use cmdline "acpi_generic_hotkey" to over-ride
|
||||
platform-specific with generic driver.
|
||||
|
|
|
@ -8,13 +8,15 @@ fi
|
|||
n_partitions=${n_partitions:-16}
|
||||
dir=$1
|
||||
shelf=$2
|
||||
nslots=16
|
||||
maxslot=`echo $nslots 1 - p | dc`
|
||||
MAJOR=152
|
||||
|
||||
set -e
|
||||
|
||||
minor=`echo 10 \* $shelf \* $n_partitions | bc`
|
||||
minor=`echo $nslots \* $shelf \* $n_partitions | bc`
|
||||
endp=`echo $n_partitions - 1 | bc`
|
||||
for slot in `seq 0 9`; do
|
||||
for slot in `seq 0 $maxslot`; do
|
||||
for part in `seq 0 $endp`; do
|
||||
name=e$shelf.$slot
|
||||
test "$part" != "0" && name=${name}p$part
|
||||
|
|
439
Documentation/applying-patches.txt
Normal file
439
Documentation/applying-patches.txt
Normal file
|
@ -0,0 +1,439 @@
|
|||
|
||||
Applying Patches To The Linux Kernel
|
||||
------------------------------------
|
||||
|
||||
(Written by Jesper Juhl, August 2005)
|
||||
|
||||
|
||||
|
||||
A frequently asked question on the Linux Kernel Mailing List is how to apply
|
||||
a patch to the kernel or, more specifically, what base kernel a patch for
|
||||
one of the many trees/branches should be applied to. Hopefully this document
|
||||
will explain this to you.
|
||||
|
||||
In addition to explaining how to apply and revert patches, a brief
|
||||
description of the different kernel trees (and examples of how to apply
|
||||
their specific patches) is also provided.
|
||||
|
||||
|
||||
What is a patch?
|
||||
---
|
||||
A patch is a small text document containing a delta of changes between two
|
||||
different versions of a source tree. Patches are created with the `diff'
|
||||
program.
|
||||
To correctly apply a patch you need to know what base it was generated from
|
||||
and what new version the patch will change the source tree into. These
|
||||
should both be present in the patch file metadata or be possible to deduce
|
||||
from the filename.
|
||||
|
||||
|
||||
How do I apply or revert a patch?
|
||||
---
|
||||
You apply a patch with the `patch' program. The patch program reads a diff
|
||||
(or patch) file and makes the changes to the source tree described in it.
|
||||
|
||||
Patches for the Linux kernel are generated relative to the parent directory
|
||||
holding the kernel source dir.
|
||||
|
||||
This means that paths to files inside the patch file contain the name of the
|
||||
kernel source directories it was generated against (or some other directory
|
||||
names like "a/" and "b/").
|
||||
Since this is unlikely to match the name of the kernel source dir on your
|
||||
local machine (but is often useful info to see what version an otherwise
|
||||
unlabeled patch was generated against) you should change into your kernel
|
||||
source directory and then strip the first element of the path from filenames
|
||||
in the patch file when applying it (the -p1 argument to `patch' does this).
|
||||
|
||||
To revert a previously applied patch, use the -R argument to patch.
|
||||
So, if you applied a patch like this:
|
||||
patch -p1 < ../patch-x.y.z
|
||||
|
||||
You can revert (undo) it like this:
|
||||
patch -R -p1 < ../patch-x.y.z
|
||||
|
||||
|
||||
How do I feed a patch/diff file to `patch'?
|
||||
---
|
||||
This (as usual with Linux and other UNIX like operating systems) can be
|
||||
done in several different ways.
|
||||
In all the examples below I feed the file (in uncompressed form) to patch
|
||||
via stdin using the following syntax:
|
||||
patch -p1 < path/to/patch-x.y.z
|
||||
|
||||
If you just want to be able to follow the examples below and don't want to
|
||||
know of more than one way to use patch, then you can stop reading this
|
||||
section here.
|
||||
|
||||
Patch can also get the name of the file to use via the -i argument, like
|
||||
this:
|
||||
patch -p1 -i path/to/patch-x.y.z
|
||||
|
||||
If your patch file is compressed with gzip or bzip2 and you don't want to
|
||||
uncompress it before applying it, then you can feed it to patch like this
|
||||
instead:
|
||||
zcat path/to/patch-x.y.z.gz | patch -p1
|
||||
bzcat path/to/patch-x.y.z.bz2 | patch -p1
|
||||
|
||||
If you wish to uncompress the patch file by hand first before applying it
|
||||
(what I assume you've done in the examples below), then you simply run
|
||||
gunzip or bunzip2 on the file - like this:
|
||||
gunzip patch-x.y.z.gz
|
||||
bunzip2 patch-x.y.z.bz2
|
||||
|
||||
Which will leave you with a plain text patch-x.y.z file that you can feed to
|
||||
patch via stdin or the -i argument, as you prefer.
|
||||
|
||||
A few other nice arguments for patch are -s which causes patch to be silent
|
||||
except for errors which is nice to prevent errors from scrolling out of the
|
||||
screen too fast, and --dry-run which causes patch to just print a listing of
|
||||
what would happen, but doesn't actually make any changes. Finally --verbose
|
||||
tells patch to print more information about the work being done.
|
||||
|
||||
|
||||
Common errors when patching
|
||||
---
|
||||
When patch applies a patch file it attempts to verify the sanity of the
|
||||
file in different ways.
|
||||
Checking that the file looks like a valid patch file, checking the code
|
||||
around the bits being modified matches the context provided in the patch are
|
||||
just two of the basic sanity checks patch does.
|
||||
|
||||
If patch encounters something that doesn't look quite right it has two
|
||||
options. It can either refuse to apply the changes and abort or it can try
|
||||
to find a way to make the patch apply with a few minor changes.
|
||||
|
||||
One example of something that's not 'quite right' that patch will attempt to
|
||||
fix up is if all the context matches, the lines being changed match, but the
|
||||
line numbers are different. This can happen, for example, if the patch makes
|
||||
a change in the middle of the file but for some reasons a few lines have
|
||||
been added or removed near the beginning of the file. In that case
|
||||
everything looks good it has just moved up or down a bit, and patch will
|
||||
usually adjust the line numbers and apply the patch.
|
||||
|
||||
Whenever patch applies a patch that it had to modify a bit to make it fit
|
||||
it'll tell you about it by saying the patch applied with 'fuzz'.
|
||||
You should be wary of such changes since even though patch probably got it
|
||||
right it doesn't /always/ get it right, and the result will sometimes be
|
||||
wrong.
|
||||
|
||||
When patch encounters a change that it can't fix up with fuzz it rejects it
|
||||
outright and leaves a file with a .rej extension (a reject file). You can
|
||||
read this file to see exactely what change couldn't be applied, so you can
|
||||
go fix it up by hand if you wish.
|
||||
|
||||
If you don't have any third party patches applied to your kernel source, but
|
||||
only patches from kernel.org and you apply the patches in the correct order,
|
||||
and have made no modifications yourself to the source files, then you should
|
||||
never see a fuzz or reject message from patch. If you do see such messages
|
||||
anyway, then there's a high risk that either your local source tree or the
|
||||
patch file is corrupted in some way. In that case you should probably try
|
||||
redownloading the patch and if things are still not OK then you'd be advised
|
||||
to start with a fresh tree downloaded in full from kernel.org.
|
||||
|
||||
Let's look a bit more at some of the messages patch can produce.
|
||||
|
||||
If patch stops and presents a "File to patch:" prompt, then patch could not
|
||||
find a file to be patched. Most likely you forgot to specify -p1 or you are
|
||||
in the wrong directory. Less often, you'll find patches that need to be
|
||||
applied with -p0 instead of -p1 (reading the patch file should reveal if
|
||||
this is the case - if so, then this is an error by the person who created
|
||||
the patch but is not fatal).
|
||||
|
||||
If you get "Hunk #2 succeeded at 1887 with fuzz 2 (offset 7 lines)." or a
|
||||
message similar to that, then it means that patch had to adjust the location
|
||||
of the change (in this example it needed to move 7 lines from where it
|
||||
expected to make the change to make it fit).
|
||||
The resulting file may or may not be OK, depending on the reason the file
|
||||
was different than expected.
|
||||
This often happens if you try to apply a patch that was generated against a
|
||||
different kernel version than the one you are trying to patch.
|
||||
|
||||
If you get a message like "Hunk #3 FAILED at 2387.", then it means that the
|
||||
patch could not be applied correctly and the patch program was unable to
|
||||
fuzz its way through. This will generate a .rej file with the change that
|
||||
caused the patch to fail and also a .orig file showing you the original
|
||||
content that couldn't be changed.
|
||||
|
||||
If you get "Reversed (or previously applied) patch detected! Assume -R? [n]"
|
||||
then patch detected that the change contained in the patch seems to have
|
||||
already been made.
|
||||
If you actually did apply this patch previously and you just re-applied it
|
||||
in error, then just say [n]o and abort this patch. If you applied this patch
|
||||
previously and actually intended to revert it, but forgot to specify -R,
|
||||
then you can say [y]es here to make patch revert it for you.
|
||||
This can also happen if the creator of the patch reversed the source and
|
||||
destination directories when creating the patch, and in that case reverting
|
||||
the patch will in fact apply it.
|
||||
|
||||
A message similar to "patch: **** unexpected end of file in patch" or "patch
|
||||
unexpectedly ends in middle of line" means that patch could make no sense of
|
||||
the file you fed to it. Either your download is broken or you tried to feed
|
||||
patch a compressed patch file without uncompressing it first.
|
||||
|
||||
As I already mentioned above, these errors should never happen if you apply
|
||||
a patch from kernel.org to the correct version of an unmodified source tree.
|
||||
So if you get these errors with kernel.org patches then you should probably
|
||||
assume that either your patch file or your tree is broken and I'd advice you
|
||||
to start over with a fresh download of a full kernel tree and the patch you
|
||||
wish to apply.
|
||||
|
||||
|
||||
Are there any alternatives to `patch'?
|
||||
---
|
||||
Yes there are alternatives. You can use the `interdiff' program
|
||||
(http://cyberelk.net/tim/patchutils/) to generate a patch representing the
|
||||
differences between two patches and then apply the result.
|
||||
This will let you move from something like 2.6.12.2 to 2.6.12.3 in a single
|
||||
step. The -z flag to interdiff will even let you feed it patches in gzip or
|
||||
bzip2 compressed form directly without the use of zcat or bzcat or manual
|
||||
decompression.
|
||||
|
||||
Here's how you'd go from 2.6.12.2 to 2.6.12.3 in a single step:
|
||||
interdiff -z ../patch-2.6.12.2.bz2 ../patch-2.6.12.3.gz | patch -p1
|
||||
|
||||
Although interdiff may save you a step or two you are generally advised to
|
||||
do the additional steps since interdiff can get things wrong in some cases.
|
||||
|
||||
Another alternative is `ketchup', which is a python script for automatic
|
||||
downloading and applying of patches (http://www.selenic.com/ketchup/).
|
||||
|
||||
Other nice tools are diffstat which shows a summary of changes made by a
|
||||
patch, lsdiff which displays a short listing of affected files in a patch
|
||||
file, along with (optionally) the line numbers of the start of each patch
|
||||
and grepdiff which displays a list of the files modified by a patch where
|
||||
the patch contains a given regular expression.
|
||||
|
||||
|
||||
Where can I download the patches?
|
||||
---
|
||||
The patches are available at http://kernel.org/
|
||||
Most recent patches are linked from the front page, but they also have
|
||||
specific homes.
|
||||
|
||||
The 2.6.x.y (-stable) and 2.6.x patches live at
|
||||
ftp://ftp.kernel.org/pub/linux/kernel/v2.6/
|
||||
|
||||
The -rc patches live at
|
||||
ftp://ftp.kernel.org/pub/linux/kernel/v2.6/testing/
|
||||
|
||||
The -git patches live at
|
||||
ftp://ftp.kernel.org/pub/linux/kernel/v2.6/snapshots/
|
||||
|
||||
The -mm kernels live at
|
||||
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/
|
||||
|
||||
In place of ftp.kernel.org you can use ftp.cc.kernel.org, where cc is a
|
||||
country code. This way you'll be downloading from a mirror site that's most
|
||||
likely geographically closer to you, resulting in faster downloads for you,
|
||||
less bandwidth used globally and less load on the main kernel.org servers -
|
||||
these are good things, do use mirrors when possible.
|
||||
|
||||
|
||||
The 2.6.x kernels
|
||||
---
|
||||
These are the base stable releases released by Linus. The highest numbered
|
||||
release is the most recent.
|
||||
|
||||
If regressions or other serious flaws are found then a -stable fix patch
|
||||
will be released (see below) on top of this base. Once a new 2.6.x base
|
||||
kernel is released, a patch is made available that is a delta between the
|
||||
previous 2.6.x kernel and the new one.
|
||||
|
||||
To apply a patch moving from 2.6.11 to 2.6.12 you'd do the following (note
|
||||
that such patches do *NOT* apply on top of 2.6.x.y kernels but on top of the
|
||||
base 2.6.x kernel - if you need to move from 2.6.x.y to 2.6.x+1 you need to
|
||||
first revert the 2.6.x.y patch).
|
||||
|
||||
Here are some examples:
|
||||
|
||||
# moving from 2.6.11 to 2.6.12
|
||||
$ cd ~/linux-2.6.11 # change to kernel source dir
|
||||
$ patch -p1 < ../patch-2.6.12 # apply the 2.6.12 patch
|
||||
$ cd ..
|
||||
$ mv linux-2.6.11 linux-2.6.12 # rename source dir
|
||||
|
||||
# moving from 2.6.11.1 to 2.6.12
|
||||
$ cd ~/linux-2.6.11.1 # change to kernel source dir
|
||||
$ patch -p1 -R < ../patch-2.6.11.1 # revert the 2.6.11.1 patch
|
||||
# source dir is now 2.6.11
|
||||
$ patch -p1 < ../patch-2.6.12 # apply new 2.6.12 patch
|
||||
$ cd ..
|
||||
$ mv linux-2.6.11.1 inux-2.6.12 # rename source dir
|
||||
|
||||
|
||||
The 2.6.x.y kernels
|
||||
---
|
||||
Kernels with 4 digit versions are -stable kernels. They contain small(ish)
|
||||
critical fixes for security problems or significant regressions discovered
|
||||
in a given 2.6.x kernel.
|
||||
|
||||
This is the recommended branch for users who want the most recent stable
|
||||
kernel and are not interested in helping test development/experimental
|
||||
versions.
|
||||
|
||||
If no 2.6.x.y kernel is available, then the highest numbered 2.6.x kernel is
|
||||
the current stable kernel.
|
||||
|
||||
These patches are not incremental, meaning that for example the 2.6.12.3
|
||||
patch does not apply on top of the 2.6.12.2 kernel source, but rather on top
|
||||
of the base 2.6.12 kernel source.
|
||||
So, in order to apply the 2.6.12.3 patch to your existing 2.6.12.2 kernel
|
||||
source you have to first back out the 2.6.12.2 patch (so you are left with a
|
||||
base 2.6.12 kernel source) and then apply the new 2.6.12.3 patch.
|
||||
|
||||
Here's a small example:
|
||||
|
||||
$ cd ~/linux-2.6.12.2 # change into the kernel source dir
|
||||
$ patch -p1 -R < ../patch-2.6.12.2 # revert the 2.6.12.2 patch
|
||||
$ patch -p1 < ../patch-2.6.12.3 # apply the new 2.6.12.3 patch
|
||||
$ cd ..
|
||||
$ mv linux-2.6.12.2 linux-2.6.12.3 # rename the kernel source dir
|
||||
|
||||
|
||||
The -rc kernels
|
||||
---
|
||||
These are release-candidate kernels. These are development kernels released
|
||||
by Linus whenever he deems the current git (the kernel's source management
|
||||
tool) tree to be in a reasonably sane state adequate for testing.
|
||||
|
||||
These kernels are not stable and you should expect occasional breakage if
|
||||
you intend to run them. This is however the most stable of the main
|
||||
development branches and is also what will eventually turn into the next
|
||||
stable kernel, so it is important that it be tested by as many people as
|
||||
possible.
|
||||
|
||||
This is a good branch to run for people who want to help out testing
|
||||
development kernels but do not want to run some of the really experimental
|
||||
stuff (such people should see the sections about -git and -mm kernels below).
|
||||
|
||||
The -rc patches are not incremental, they apply to a base 2.6.x kernel, just
|
||||
like the 2.6.x.y patches described above. The kernel version before the -rcN
|
||||
suffix denotes the version of the kernel that this -rc kernel will eventually
|
||||
turn into.
|
||||
So, 2.6.13-rc5 means that this is the fifth release candidate for the 2.6.13
|
||||
kernel and the patch should be applied on top of the 2.6.12 kernel source.
|
||||
|
||||
Here are 3 examples of how to apply these patches:
|
||||
|
||||
# first an example of moving from 2.6.12 to 2.6.13-rc3
|
||||
$ cd ~/linux-2.6.12 # change into the 2.6.12 source dir
|
||||
$ patch -p1 < ../patch-2.6.13-rc3 # apply the 2.6.13-rc3 patch
|
||||
$ cd ..
|
||||
$ mv linux-2.6.12 linux-2.6.13-rc3 # rename the source dir
|
||||
|
||||
# now let's move from 2.6.13-rc3 to 2.6.13-rc5
|
||||
$ cd ~/linux-2.6.13-rc3 # change into the 2.6.13-rc3 dir
|
||||
$ patch -p1 -R < ../patch-2.6.13-rc3 # revert the 2.6.13-rc3 patch
|
||||
$ patch -p1 < ../patch-2.6.13-rc5 # apply the new 2.6.13-rc5 patch
|
||||
$ cd ..
|
||||
$ mv linux-2.6.13-rc3 linux-2.6.13-rc5 # rename the source dir
|
||||
|
||||
# finally let's try and move from 2.6.12.3 to 2.6.13-rc5
|
||||
$ cd ~/linux-2.6.12.3 # change to the kernel source dir
|
||||
$ patch -p1 -R < ../patch-2.6.12.3 # revert the 2.6.12.3 patch
|
||||
$ patch -p1 < ../patch-2.6.13-rc5 # apply new 2.6.13-rc5 patch
|
||||
$ cd ..
|
||||
$ mv linux-2.6.12.3 linux-2.6.13-rc5 # rename the kernel source dir
|
||||
|
||||
|
||||
The -git kernels
|
||||
---
|
||||
These are daily snapshots of Linus' kernel tree (managed in a git
|
||||
repository, hence the name).
|
||||
|
||||
These patches are usually released daily and represent the current state of
|
||||
Linus' tree. They are more experimental than -rc kernels since they are
|
||||
generated automatically without even a cursory glance to see if they are
|
||||
sane.
|
||||
|
||||
-git patches are not incremental and apply either to a base 2.6.x kernel or
|
||||
a base 2.6.x-rc kernel - you can see which from their name.
|
||||
A patch named 2.6.12-git1 applies to the 2.6.12 kernel source and a patch
|
||||
named 2.6.13-rc3-git2 applies to the source of the 2.6.13-rc3 kernel.
|
||||
|
||||
Here are some examples of how to apply these patches:
|
||||
|
||||
# moving from 2.6.12 to 2.6.12-git1
|
||||
$ cd ~/linux-2.6.12 # change to the kernel source dir
|
||||
$ patch -p1 < ../patch-2.6.12-git1 # apply the 2.6.12-git1 patch
|
||||
$ cd ..
|
||||
$ mv linux-2.6.12 linux-2.6.12-git1 # rename the kernel source dir
|
||||
|
||||
# moving from 2.6.12-git1 to 2.6.13-rc2-git3
|
||||
$ cd ~/linux-2.6.12-git1 # change to the kernel source dir
|
||||
$ patch -p1 -R < ../patch-2.6.12-git1 # revert the 2.6.12-git1 patch
|
||||
# we now have a 2.6.12 kernel
|
||||
$ patch -p1 < ../patch-2.6.13-rc2 # apply the 2.6.13-rc2 patch
|
||||
# the kernel is now 2.6.13-rc2
|
||||
$ patch -p1 < ../patch-2.6.13-rc2-git3 # apply the 2.6.13-rc2-git3 patch
|
||||
# the kernel is now 2.6.13-rc2-git3
|
||||
$ cd ..
|
||||
$ mv linux-2.6.12-git1 linux-2.6.13-rc2-git3 # rename source dir
|
||||
|
||||
|
||||
The -mm kernels
|
||||
---
|
||||
These are experimental kernels released by Andrew Morton.
|
||||
|
||||
The -mm tree serves as a sort of proving ground for new features and other
|
||||
experimental patches.
|
||||
Once a patch has proved its worth in -mm for a while Andrew pushes it on to
|
||||
Linus for inclusion in mainline.
|
||||
|
||||
Although it's encouraged that patches flow to Linus via the -mm tree, this
|
||||
is not always enforced.
|
||||
Subsystem maintainers (or individuals) sometimes push their patches directly
|
||||
to Linus, even though (or after) they have been merged and tested in -mm (or
|
||||
sometimes even without prior testing in -mm).
|
||||
|
||||
You should generally strive to get your patches into mainline via -mm to
|
||||
ensure maximum testing.
|
||||
|
||||
This branch is in constant flux and contains many experimental features, a
|
||||
lot of debugging patches not appropriate for mainline etc and is the most
|
||||
experimental of the branches described in this document.
|
||||
|
||||
These kernels are not appropriate for use on systems that are supposed to be
|
||||
stable and they are more risky to run than any of the other branches (make
|
||||
sure you have up-to-date backups - that goes for any experimental kernel but
|
||||
even more so for -mm kernels).
|
||||
|
||||
These kernels in addition to all the other experimental patches they contain
|
||||
usually also contain any changes in the mainline -git kernels available at
|
||||
the time of release.
|
||||
|
||||
Testing of -mm kernels is greatly appreciated since the whole point of the
|
||||
tree is to weed out regressions, crashes, data corruption bugs, build
|
||||
breakage (and any other bug in general) before changes are merged into the
|
||||
more stable mainline Linus tree.
|
||||
But testers of -mm should be aware that breakage in this tree is more common
|
||||
than in any other tree.
|
||||
|
||||
The -mm kernels are not released on a fixed schedule, but usually a few -mm
|
||||
kernels are released in between each -rc kernel (1 to 3 is common).
|
||||
The -mm kernels apply to either a base 2.6.x kernel (when no -rc kernels
|
||||
have been released yet) or to a Linus -rc kernel.
|
||||
|
||||
Here are some examples of applying the -mm patches:
|
||||
|
||||
# moving from 2.6.12 to 2.6.12-mm1
|
||||
$ cd ~/linux-2.6.12 # change to the 2.6.12 source dir
|
||||
$ patch -p1 < ../2.6.12-mm1 # apply the 2.6.12-mm1 patch
|
||||
$ cd ..
|
||||
$ mv linux-2.6.12 linux-2.6.12-mm1 # rename the source appropriately
|
||||
|
||||
# moving from 2.6.12-mm1 to 2.6.13-rc3-mm3
|
||||
$ cd ~/linux-2.6.12-mm1
|
||||
$ patch -p1 -R < ../2.6.12-mm1 # revert the 2.6.12-mm1 patch
|
||||
# we now have a 2.6.12 source
|
||||
$ patch -p1 < ../patch-2.6.13-rc3 # apply the 2.6.13-rc3 patch
|
||||
# we now have a 2.6.13-rc3 source
|
||||
$ patch -p1 < ../2.6.13-rc3-mm3 # apply the 2.6.13-rc3-mm3 patch
|
||||
$ cd ..
|
||||
$ mv linux-2.6.12-mm1 linux-2.6.13-rc3-mm3 # rename the source dir
|
||||
|
||||
|
||||
This concludes this list of explanations of the various kernel trees and I
|
||||
hope you are now crystal clear on how to apply the various patches and help
|
||||
testing the kernel.
|
||||
|
|
@ -81,7 +81,8 @@ Adding New Machines
|
|||
|
||||
Any large scale modifications, or new drivers should be discussed
|
||||
on the ARM kernel mailing list (linux-arm-kernel) before being
|
||||
attempted.
|
||||
attempted. See http://www.arm.linux.org.uk/mailinglists/ for the
|
||||
mailing list information.
|
||||
|
||||
|
||||
NAND
|
||||
|
@ -120,6 +121,43 @@ Clock Management
|
|||
various clock units
|
||||
|
||||
|
||||
Platform Data
|
||||
-------------
|
||||
|
||||
Whenever a device has platform specific data that is specified
|
||||
on a per-machine basis, care should be taken to ensure the
|
||||
following:
|
||||
|
||||
1) that default data is not left in the device to confuse the
|
||||
driver if a machine does not set it at startup
|
||||
|
||||
2) the data should (if possible) be marked as __initdata,
|
||||
to ensure that the data is thrown away if the machine is
|
||||
not the one currently in use.
|
||||
|
||||
The best way of doing this is to make a function that
|
||||
kmalloc()s an area of memory, and copies the __initdata
|
||||
and then sets the relevant device's platform data. Making
|
||||
the function `__init` takes care of ensuring it is discarded
|
||||
with the rest of the initialisation code
|
||||
|
||||
static __init void s3c24xx_xxx_set_platdata(struct xxx_data *pd)
|
||||
{
|
||||
struct s3c2410_xxx_mach_info *npd;
|
||||
|
||||
npd = kmalloc(sizeof(struct s3c2410_xxx_mach_info), GFP_KERNEL);
|
||||
if (npd) {
|
||||
memcpy(npd, pd, sizeof(struct s3c2410_xxx_mach_info));
|
||||
s3c_device_xxx.dev.platform_data = npd;
|
||||
} else {
|
||||
printk(KERN_ERR "no memory for xxx platform data\n");
|
||||
}
|
||||
}
|
||||
|
||||
Note, since the code is marked as __init, it should not be
|
||||
exported outside arch/arm/mach-s3c2410/, or exported to
|
||||
modules via EXPORT_SYMBOL() and related functions.
|
||||
|
||||
Port Contributors
|
||||
-----------------
|
||||
|
||||
|
@ -149,6 +187,7 @@ Document Changes
|
|||
06 Mar 2005 - BJD - Added Christer Weinigel
|
||||
08 Mar 2005 - BJD - Added LCVR to list of people, updated introduction
|
||||
08 Mar 2005 - BJD - Added section on adding machines
|
||||
09 Sep 2005 - BJD - Added section on platform data
|
||||
|
||||
Document Author
|
||||
---------------
|
||||
|
|
93
Documentation/arm/Samsung-S3C24XX/USB-Host.txt
Normal file
93
Documentation/arm/Samsung-S3C24XX/USB-Host.txt
Normal file
|
@ -0,0 +1,93 @@
|
|||
S3C24XX USB Host support
|
||||
========================
|
||||
|
||||
|
||||
|
||||
Introduction
|
||||
------------
|
||||
|
||||
This document details the S3C2410/S3C2440 in-built OHCI USB host support.
|
||||
|
||||
Configuration
|
||||
-------------
|
||||
|
||||
Enable at least the following kernel options:
|
||||
|
||||
menuconfig:
|
||||
|
||||
Device Drivers --->
|
||||
USB support --->
|
||||
<*> Support for Host-side USB
|
||||
<*> OHCI HCD support
|
||||
|
||||
|
||||
.config:
|
||||
CONFIG_USB
|
||||
CONFIG_USB_OHCI_HCD
|
||||
|
||||
|
||||
Once these options are configured, the standard set of USB device
|
||||
drivers can be configured and used.
|
||||
|
||||
|
||||
Board Support
|
||||
-------------
|
||||
|
||||
The driver attaches to a platform device, which will need to be
|
||||
added by the board specific support file in linux/arch/arm/mach-s3c2410,
|
||||
such as mach-bast.c or mach-smdk2410.c
|
||||
|
||||
The platform device's platform_data field is only needed if the
|
||||
board implements extra power control or over-current monitoring.
|
||||
|
||||
The OHCI driver does not ensure the state of the S3C2410's MISCCTRL
|
||||
register, so if both ports are to be used for the host, then it is
|
||||
the board support file's responsibility to ensure that the second
|
||||
port is configured to be connected to the OHCI core.
|
||||
|
||||
|
||||
Platform Data
|
||||
-------------
|
||||
|
||||
See linux/include/asm-arm/arch-s3c2410/usb-control.h for the
|
||||
descriptions of the platform device data. An implementation
|
||||
can be found in linux/arch/arm/mach-s3c2410/usb-simtec.c .
|
||||
|
||||
The `struct s3c2410_hcd_info` contains a pair of functions
|
||||
that get called to enable over-current detection, and to
|
||||
control the port power status.
|
||||
|
||||
The ports are numbered 0 and 1.
|
||||
|
||||
power_control:
|
||||
|
||||
Called to enable or disable the power on the port.
|
||||
|
||||
enable_oc:
|
||||
|
||||
Called to enable or disable the over-current monitoring.
|
||||
This should claim or release the resources being used to
|
||||
check the power condition on the port, such as an IRQ.
|
||||
|
||||
report_oc:
|
||||
|
||||
The OHCI driver fills this field in for the over-current code
|
||||
to call when there is a change to the over-current state on
|
||||
an port. The ports argument is a bitmask of 1 bit per port,
|
||||
with bit X being 1 for an over-current on port X.
|
||||
|
||||
The function s3c2410_usb_report_oc() has been provided to
|
||||
ensure this is called correctly.
|
||||
|
||||
port[x]:
|
||||
|
||||
This is struct describes each port, 0 or 1. The platform driver
|
||||
should set the flags field of each port to S3C_HCDFLG_USED if
|
||||
the port is enabled.
|
||||
|
||||
|
||||
|
||||
Document Author
|
||||
---------------
|
||||
|
||||
Ben Dooks, (c) 2005 Simtec Electronics
|
|
@ -906,9 +906,20 @@ Aside:
|
|||
|
||||
|
||||
4. The I/O scheduler
|
||||
I/O schedulers are now per queue. They should be runtime switchable and modular
|
||||
but aren't yet. Jens has most bits to do this, but the sysfs implementation is
|
||||
missing.
|
||||
I/O scheduler, a.k.a. elevator, is implemented in two layers. Generic dispatch
|
||||
queue and specific I/O schedulers. Unless stated otherwise, elevator is used
|
||||
to refer to both parts and I/O scheduler to specific I/O schedulers.
|
||||
|
||||
Block layer implements generic dispatch queue in ll_rw_blk.c and elevator.c.
|
||||
The generic dispatch queue is responsible for properly ordering barrier
|
||||
requests, requeueing, handling non-fs requests and all other subtleties.
|
||||
|
||||
Specific I/O schedulers are responsible for ordering normal filesystem
|
||||
requests. They can also choose to delay certain requests to improve
|
||||
throughput or whatever purpose. As the plural form indicates, there are
|
||||
multiple I/O schedulers. They can be built as modules but at least one should
|
||||
be built inside the kernel. Each queue can choose different one and can also
|
||||
change to another one dynamically.
|
||||
|
||||
A block layer call to the i/o scheduler follows the convention elv_xxx(). This
|
||||
calls elevator_xxx_fn in the elevator switch (drivers/block/elevator.c). Oh,
|
||||
|
@ -921,44 +932,36 @@ keeping work.
|
|||
The functions an elevator may implement are: (* are mandatory)
|
||||
elevator_merge_fn called to query requests for merge with a bio
|
||||
|
||||
elevator_merge_req_fn " " " with another request
|
||||
elevator_merge_req_fn called when two requests get merged. the one
|
||||
which gets merged into the other one will be
|
||||
never seen by I/O scheduler again. IOW, after
|
||||
being merged, the request is gone.
|
||||
|
||||
elevator_merged_fn called when a request in the scheduler has been
|
||||
involved in a merge. It is used in the deadline
|
||||
scheduler for example, to reposition the request
|
||||
if its sorting order has changed.
|
||||
|
||||
*elevator_next_req_fn returns the next scheduled request, or NULL
|
||||
if there are none (or none are ready).
|
||||
elevator_dispatch_fn fills the dispatch queue with ready requests.
|
||||
I/O schedulers are free to postpone requests by
|
||||
not filling the dispatch queue unless @force
|
||||
is non-zero. Once dispatched, I/O schedulers
|
||||
are not allowed to manipulate the requests -
|
||||
they belong to generic dispatch queue.
|
||||
|
||||
*elevator_add_req_fn called to add a new request into the scheduler
|
||||
elevator_add_req_fn called to add a new request into the scheduler
|
||||
|
||||
elevator_queue_empty_fn returns true if the merge queue is empty.
|
||||
Drivers shouldn't use this, but rather check
|
||||
if elv_next_request is NULL (without losing the
|
||||
request if one exists!)
|
||||
|
||||
elevator_remove_req_fn This is called when a driver claims ownership of
|
||||
the target request - it now belongs to the
|
||||
driver. It must not be modified or merged.
|
||||
Drivers must not lose the request! A subsequent
|
||||
call of elevator_next_req_fn must return the
|
||||
_next_ request.
|
||||
|
||||
elevator_requeue_req_fn called to add a request to the scheduler. This
|
||||
is used when the request has alrnadebeen
|
||||
returned by elv_next_request, but hasn't
|
||||
completed. If this is not implemented then
|
||||
elevator_add_req_fn is called instead.
|
||||
|
||||
elevator_former_req_fn
|
||||
elevator_latter_req_fn These return the request before or after the
|
||||
one specified in disk sort order. Used by the
|
||||
block layer to find merge possibilities.
|
||||
|
||||
elevator_completed_req_fn called when a request is completed. This might
|
||||
come about due to being merged with another or
|
||||
when the device completes the request.
|
||||
elevator_completed_req_fn called when a request is completed.
|
||||
|
||||
elevator_may_queue_fn returns true if the scheduler wants to allow the
|
||||
current context to queue a new request even if
|
||||
|
@ -967,13 +970,33 @@ elevator_may_queue_fn returns true if the scheduler wants to allow the
|
|||
|
||||
elevator_set_req_fn
|
||||
elevator_put_req_fn Must be used to allocate and free any elevator
|
||||
specific storate for a request.
|
||||
specific storage for a request.
|
||||
|
||||
elevator_activate_req_fn Called when device driver first sees a request.
|
||||
I/O schedulers can use this callback to
|
||||
determine when actual execution of a request
|
||||
starts.
|
||||
elevator_deactivate_req_fn Called when device driver decides to delay
|
||||
a request by requeueing it.
|
||||
|
||||
elevator_init_fn
|
||||
elevator_exit_fn Allocate and free any elevator specific storage
|
||||
for a queue.
|
||||
|
||||
4.2 I/O scheduler implementation
|
||||
4.2 Request flows seen by I/O schedulers
|
||||
All requests seens by I/O schedulers strictly follow one of the following three
|
||||
flows.
|
||||
|
||||
set_req_fn ->
|
||||
|
||||
i. add_req_fn -> (merged_fn ->)* -> dispatch_fn -> activate_req_fn ->
|
||||
(deactivate_req_fn -> activate_req_fn ->)* -> completed_req_fn
|
||||
ii. add_req_fn -> (merged_fn ->)* -> merge_req_fn
|
||||
iii. [none]
|
||||
|
||||
-> put_req_fn
|
||||
|
||||
4.3 I/O scheduler implementation
|
||||
The generic i/o scheduler algorithm attempts to sort/merge/batch requests for
|
||||
optimal disk scan and request servicing performance (based on generic
|
||||
principles and device capabilities), optimized for:
|
||||
|
@ -993,18 +1016,7 @@ request in sort order to prevent binary tree lookups.
|
|||
This arrangement is not a generic block layer characteristic however, so
|
||||
elevators may implement queues as they please.
|
||||
|
||||
ii. Last merge hint
|
||||
The last merge hint is part of the generic queue layer. I/O schedulers must do
|
||||
some management on it. For the most part, the most important thing is to make
|
||||
sure q->last_merge is cleared (set to NULL) when the request on it is no longer
|
||||
a candidate for merging (for example if it has been sent to the driver).
|
||||
|
||||
The last merge performed is cached as a hint for the subsequent request. If
|
||||
sequential data is being submitted, the hint is used to perform merges without
|
||||
any scanning. This is not sufficient when there are multiple processes doing
|
||||
I/O though, so a "merge hash" is used by some schedulers.
|
||||
|
||||
iii. Merge hash
|
||||
ii. Merge hash
|
||||
AS and deadline use a hash table indexed by the last sector of a request. This
|
||||
enables merging code to quickly look up "back merge" candidates, even when
|
||||
multiple I/O streams are being performed at once on one disk.
|
||||
|
@ -1013,29 +1025,8 @@ multiple I/O streams are being performed at once on one disk.
|
|||
are far less common than "back merges" due to the nature of most I/O patterns.
|
||||
Front merges are handled by the binary trees in AS and deadline schedulers.
|
||||
|
||||
iv. Handling barrier cases
|
||||
A request with flags REQ_HARDBARRIER or REQ_SOFTBARRIER must not be ordered
|
||||
around. That is, they must be processed after all older requests, and before
|
||||
any newer ones. This includes merges!
|
||||
|
||||
In AS and deadline schedulers, barriers have the effect of flushing the reorder
|
||||
queue. The performance cost of this will vary from nothing to a lot depending
|
||||
on i/o patterns and device characteristics. Obviously they won't improve
|
||||
performance, so their use should be kept to a minimum.
|
||||
|
||||
v. Handling insertion position directives
|
||||
A request may be inserted with a position directive. The directives are one of
|
||||
ELEVATOR_INSERT_BACK, ELEVATOR_INSERT_FRONT, ELEVATOR_INSERT_SORT.
|
||||
|
||||
ELEVATOR_INSERT_SORT is a general directive for non-barrier requests.
|
||||
ELEVATOR_INSERT_BACK is used to insert a barrier to the back of the queue.
|
||||
ELEVATOR_INSERT_FRONT is used to insert a barrier to the front of the queue, and
|
||||
overrides the ordering requested by any previous barriers. In practice this is
|
||||
harmless and required, because it is used for SCSI requeueing. This does not
|
||||
require flushing the reorder queue, so does not impose a performance penalty.
|
||||
|
||||
vi. Plugging the queue to batch requests in anticipation of opportunities for
|
||||
merge/sort optimizations
|
||||
iii. Plugging the queue to batch requests in anticipation of opportunities for
|
||||
merge/sort optimizations
|
||||
|
||||
This is just the same as in 2.4 so far, though per-device unplugging
|
||||
support is anticipated for 2.5. Also with a priority-based i/o scheduler,
|
||||
|
@ -1069,7 +1060,7 @@ Aside:
|
|||
blk_kick_queue() to unplug a specific queue (right away ?)
|
||||
or optionally, all queues, is in the plan.
|
||||
|
||||
4.3 I/O contexts
|
||||
4.4 I/O contexts
|
||||
I/O contexts provide a dynamically allocated per process data area. They may
|
||||
be used in I/O schedulers, and in the block layer (could be used for IO statis,
|
||||
priorities for example). See *io_context in drivers/block/ll_rw_blk.c, and
|
||||
|
|
|
@ -49,9 +49,6 @@ changes occur:
|
|||
page table operations such as what happens during
|
||||
fork, and exec.
|
||||
|
||||
Platform developers note that generic code will always
|
||||
invoke this interface without mm->page_table_lock held.
|
||||
|
||||
3) void flush_tlb_range(struct vm_area_struct *vma,
|
||||
unsigned long start, unsigned long end)
|
||||
|
||||
|
@ -72,9 +69,6 @@ changes occur:
|
|||
call flush_tlb_page (see below) for each entry which may be
|
||||
modified.
|
||||
|
||||
Platform developers note that generic code will always
|
||||
invoke this interface with mm->page_table_lock held.
|
||||
|
||||
4) void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr)
|
||||
|
||||
This time we need to remove the PAGE_SIZE sized translation
|
||||
|
@ -93,9 +87,6 @@ changes occur:
|
|||
|
||||
This is used primarily during fault processing.
|
||||
|
||||
Platform developers note that generic code will always
|
||||
invoke this interface with mm->page_table_lock held.
|
||||
|
||||
5) void flush_tlb_pgtables(struct mm_struct *mm,
|
||||
unsigned long start, unsigned long end)
|
||||
|
||||
|
|
|
@ -17,7 +17,9 @@ This driver is known to work with the following cards:
|
|||
* SA P600
|
||||
* SA P800
|
||||
* SA E400
|
||||
* SA E300
|
||||
* SA P400i
|
||||
* SA E200
|
||||
* SA E200i
|
||||
|
||||
If nodes are not already created in the /dev/cciss directory, run as root:
|
||||
|
||||
|
|
|
@ -68,7 +68,8 @@ it a better device citizen. Further thanks to Joel Katz
|
|||
Porfiri Claudio <C.Porfiri@nisms.tei.ericsson.se> for patches
|
||||
to make the driver work with the older CDU-510/515 series, and
|
||||
Heiko Eissfeldt <heiko@colossus.escape.de> for pointing out that
|
||||
the verify_area() checks were ignoring the results of said checks.
|
||||
the verify_area() checks were ignoring the results of said checks
|
||||
(note: verify_area() has since been replaced by access_ok()).
|
||||
|
||||
(Acknowledgments from Ron Jeppesen in the 0.3 release:)
|
||||
Thanks to Corey Minyard who wrote the original CDU-31A driver on which
|
||||
|
|
194
Documentation/connector/cn_test.c
Normal file
194
Documentation/connector/cn_test.c
Normal file
|
@ -0,0 +1,194 @@
|
|||
/*
|
||||
* cn_test.c
|
||||
*
|
||||
* 2004-2005 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
|
||||
* All rights reserved.
|
||||
*
|
||||
* This program is free software; you can redistribute it and/or modify
|
||||
* it under the terms of the GNU General Public License as published by
|
||||
* the Free Software Foundation; either version 2 of the License, or
|
||||
* (at your option) any later version.
|
||||
*
|
||||
* This program is distributed in the hope that it will be useful,
|
||||
* but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
* GNU General Public License for more details.
|
||||
*
|
||||
* You should have received a copy of the GNU General Public License
|
||||
* along with this program; if not, write to the Free Software
|
||||
* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
|
||||
*/
|
||||
|
||||
#include <linux/kernel.h>
|
||||
#include <linux/module.h>
|
||||
#include <linux/moduleparam.h>
|
||||
#include <linux/skbuff.h>
|
||||
#include <linux/timer.h>
|
||||
|
||||
#include "connector.h"
|
||||
|
||||
static struct cb_id cn_test_id = { 0x123, 0x456 };
|
||||
static char cn_test_name[] = "cn_test";
|
||||
static struct sock *nls;
|
||||
static struct timer_list cn_test_timer;
|
||||
|
||||
void cn_test_callback(void *data)
|
||||
{
|
||||
struct cn_msg *msg = (struct cn_msg *)data;
|
||||
|
||||
printk("%s: %lu: idx=%x, val=%x, seq=%u, ack=%u, len=%d: %s.\n",
|
||||
__func__, jiffies, msg->id.idx, msg->id.val,
|
||||
msg->seq, msg->ack, msg->len, (char *)msg->data);
|
||||
}
|
||||
|
||||
static int cn_test_want_notify(void)
|
||||
{
|
||||
struct cn_ctl_msg *ctl;
|
||||
struct cn_notify_req *req;
|
||||
struct cn_msg *msg = NULL;
|
||||
int size, size0;
|
||||
struct sk_buff *skb;
|
||||
struct nlmsghdr *nlh;
|
||||
u32 group = 1;
|
||||
|
||||
size0 = sizeof(*msg) + sizeof(*ctl) + 3 * sizeof(*req);
|
||||
|
||||
size = NLMSG_SPACE(size0);
|
||||
|
||||
skb = alloc_skb(size, GFP_ATOMIC);
|
||||
if (!skb) {
|
||||
printk(KERN_ERR "Failed to allocate new skb with size=%u.\n",
|
||||
size);
|
||||
|
||||
return -ENOMEM;
|
||||
}
|
||||
|
||||
nlh = NLMSG_PUT(skb, 0, 0x123, NLMSG_DONE, size - sizeof(*nlh));
|
||||
|
||||
msg = (struct cn_msg *)NLMSG_DATA(nlh);
|
||||
|
||||
memset(msg, 0, size0);
|
||||
|
||||
msg->id.idx = -1;
|
||||
msg->id.val = -1;
|
||||
msg->seq = 0x123;
|
||||
msg->ack = 0x345;
|
||||
msg->len = size0 - sizeof(*msg);
|
||||
|
||||
ctl = (struct cn_ctl_msg *)(msg + 1);
|
||||
|
||||
ctl->idx_notify_num = 1;
|
||||
ctl->val_notify_num = 2;
|
||||
ctl->group = group;
|
||||
ctl->len = msg->len - sizeof(*ctl);
|
||||
|
||||
req = (struct cn_notify_req *)(ctl + 1);
|
||||
|
||||
/*
|
||||
* Idx.
|
||||
*/
|
||||
req->first = cn_test_id.idx;
|
||||
req->range = 10;
|
||||
|
||||
/*
|
||||
* Val 0.
|
||||
*/
|
||||
req++;
|
||||
req->first = cn_test_id.val;
|
||||
req->range = 10;
|
||||
|
||||
/*
|
||||
* Val 1.
|
||||
*/
|
||||
req++;
|
||||
req->first = cn_test_id.val + 20;
|
||||
req->range = 10;
|
||||
|
||||
NETLINK_CB(skb).dst_groups = ctl->group;
|
||||
//netlink_broadcast(nls, skb, 0, ctl->group, GFP_ATOMIC);
|
||||
netlink_unicast(nls, skb, 0, 0);
|
||||
|
||||
printk(KERN_INFO "Request was sent. Group=0x%x.\n", ctl->group);
|
||||
|
||||
return 0;
|
||||
|
||||
nlmsg_failure:
|
||||
printk(KERN_ERR "Failed to send %u.%u\n", msg->seq, msg->ack);
|
||||
kfree_skb(skb);
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
static u32 cn_test_timer_counter;
|
||||
static void cn_test_timer_func(unsigned long __data)
|
||||
{
|
||||
struct cn_msg *m;
|
||||
char data[32];
|
||||
|
||||
m = kmalloc(sizeof(*m) + sizeof(data), GFP_ATOMIC);
|
||||
if (m) {
|
||||
memset(m, 0, sizeof(*m) + sizeof(data));
|
||||
|
||||
memcpy(&m->id, &cn_test_id, sizeof(m->id));
|
||||
m->seq = cn_test_timer_counter;
|
||||
m->len = sizeof(data);
|
||||
|
||||
m->len =
|
||||
scnprintf(data, sizeof(data), "counter = %u",
|
||||
cn_test_timer_counter) + 1;
|
||||
|
||||
memcpy(m + 1, data, m->len);
|
||||
|
||||
cn_netlink_send(m, 0, gfp_any());
|
||||
kfree(m);
|
||||
}
|
||||
|
||||
cn_test_timer_counter++;
|
||||
|
||||
mod_timer(&cn_test_timer, jiffies + HZ);
|
||||
}
|
||||
|
||||
static int cn_test_init(void)
|
||||
{
|
||||
int err;
|
||||
|
||||
err = cn_add_callback(&cn_test_id, cn_test_name, cn_test_callback);
|
||||
if (err)
|
||||
goto err_out;
|
||||
cn_test_id.val++;
|
||||
err = cn_add_callback(&cn_test_id, cn_test_name, cn_test_callback);
|
||||
if (err) {
|
||||
cn_del_callback(&cn_test_id);
|
||||
goto err_out;
|
||||
}
|
||||
|
||||
init_timer(&cn_test_timer);
|
||||
cn_test_timer.function = cn_test_timer_func;
|
||||
cn_test_timer.expires = jiffies + HZ;
|
||||
cn_test_timer.data = 0;
|
||||
add_timer(&cn_test_timer);
|
||||
|
||||
return 0;
|
||||
|
||||
err_out:
|
||||
if (nls && nls->sk_socket)
|
||||
sock_release(nls->sk_socket);
|
||||
|
||||
return err;
|
||||
}
|
||||
|
||||
static void cn_test_fini(void)
|
||||
{
|
||||
del_timer_sync(&cn_test_timer);
|
||||
cn_del_callback(&cn_test_id);
|
||||
cn_test_id.val--;
|
||||
cn_del_callback(&cn_test_id);
|
||||
if (nls && nls->sk_socket)
|
||||
sock_release(nls->sk_socket);
|
||||
}
|
||||
|
||||
module_init(cn_test_init);
|
||||
module_exit(cn_test_fini);
|
||||
|
||||
MODULE_LICENSE("GPL");
|
||||
MODULE_AUTHOR("Evgeniy Polyakov <johnpol@2ka.mipt.ru>");
|
||||
MODULE_DESCRIPTION("Connector's test module");
|
177
Documentation/connector/connector.txt
Normal file
177
Documentation/connector/connector.txt
Normal file
|
@ -0,0 +1,177 @@
|
|||
/*****************************************/
|
||||
Kernel Connector.
|
||||
/*****************************************/
|
||||
|
||||
Kernel connector - new netlink based userspace <-> kernel space easy
|
||||
to use communication module.
|
||||
|
||||
Connector driver adds possibility to connect various agents using
|
||||
netlink based network. One must register callback and
|
||||
identifier. When driver receives special netlink message with
|
||||
appropriate identifier, appropriate callback will be called.
|
||||
|
||||
From the userspace point of view it's quite straightforward:
|
||||
|
||||
socket();
|
||||
bind();
|
||||
send();
|
||||
recv();
|
||||
|
||||
But if kernelspace want to use full power of such connections, driver
|
||||
writer must create special sockets, must know about struct sk_buff
|
||||
handling... Connector allows any kernelspace agents to use netlink
|
||||
based networking for inter-process communication in a significantly
|
||||
easier way:
|
||||
|
||||
int cn_add_callback(struct cb_id *id, char *name, void (*callback) (void *));
|
||||
void cn_netlink_send(struct cn_msg *msg, u32 __group, int gfp_mask);
|
||||
|
||||
struct cb_id
|
||||
{
|
||||
__u32 idx;
|
||||
__u32 val;
|
||||
};
|
||||
|
||||
idx and val are unique identifiers which must be registered in
|
||||
connector.h for in-kernel usage. void (*callback) (void *) - is a
|
||||
callback function which will be called when message with above idx.val
|
||||
will be received by connector core. Argument for that function must
|
||||
be dereferenced to struct cn_msg *.
|
||||
|
||||
struct cn_msg
|
||||
{
|
||||
struct cb_id id;
|
||||
|
||||
__u32 seq;
|
||||
__u32 ack;
|
||||
|
||||
__u32 len; /* Length of the following data */
|
||||
__u8 data[0];
|
||||
};
|
||||
|
||||
/*****************************************/
|
||||
Connector interfaces.
|
||||
/*****************************************/
|
||||
|
||||
int cn_add_callback(struct cb_id *id, char *name, void (*callback) (void *));
|
||||
|
||||
Registers new callback with connector core.
|
||||
|
||||
struct cb_id *id - unique connector's user identifier.
|
||||
It must be registered in connector.h for legal in-kernel users.
|
||||
char *name - connector's callback symbolic name.
|
||||
void (*callback) (void *) - connector's callback.
|
||||
Argument must be dereferenced to struct cn_msg *.
|
||||
|
||||
void cn_del_callback(struct cb_id *id);
|
||||
|
||||
Unregisters new callback with connector core.
|
||||
|
||||
struct cb_id *id - unique connector's user identifier.
|
||||
|
||||
void cn_netlink_send(struct cn_msg *msg, u32 __groups, int gfp_mask);
|
||||
|
||||
Sends message to the specified groups. It can be safely called from
|
||||
any context, but may silently fail under strong memory pressure.
|
||||
|
||||
struct cn_msg * - message header(with attached data).
|
||||
u32 __group - destination group.
|
||||
If __group is zero, then appropriate group will
|
||||
be searched through all registered connector users,
|
||||
and message will be delivered to the group which was
|
||||
created for user with the same ID as in msg.
|
||||
If __group is not zero, then message will be delivered
|
||||
to the specified group.
|
||||
int gfp_mask - GFP mask.
|
||||
|
||||
Note: When registering new callback user, connector core assigns
|
||||
netlink group to the user which is equal to it's id.idx.
|
||||
|
||||
/*****************************************/
|
||||
Protocol description.
|
||||
/*****************************************/
|
||||
|
||||
Current offers transport layer with fixed header. Recommended
|
||||
protocol which uses such header is following:
|
||||
|
||||
msg->seq and msg->ack are used to determine message genealogy. When
|
||||
someone sends message it puts there locally unique sequence and random
|
||||
acknowledge numbers. Sequence number may be copied into
|
||||
nlmsghdr->nlmsg_seq too.
|
||||
|
||||
Sequence number is incremented with each message to be sent.
|
||||
|
||||
If we expect reply to our message, then sequence number in received
|
||||
message MUST be the same as in original message, and acknowledge
|
||||
number MUST be the same + 1.
|
||||
|
||||
If we receive message and it's sequence number is not equal to one we
|
||||
are expecting, then it is new message. If we receive message and it's
|
||||
sequence number is the same as one we are expecting, but it's
|
||||
acknowledge is not equal acknowledge number in original message + 1,
|
||||
then it is new message.
|
||||
|
||||
Obviously, protocol header contains above id.
|
||||
|
||||
connector allows event notification in the following form: kernel
|
||||
driver or userspace process can ask connector to notify it when
|
||||
selected id's will be turned on or off(registered or unregistered it's
|
||||
callback). It is done by sending special command to connector
|
||||
driver(it also registers itself with id={-1, -1}).
|
||||
|
||||
As example of usage Documentation/connector now contains cn_test.c -
|
||||
testing module which uses connector to request notification and to
|
||||
send messages.
|
||||
|
||||
/*****************************************/
|
||||
Reliability.
|
||||
/*****************************************/
|
||||
|
||||
Netlink itself is not reliable protocol, that means that messages can
|
||||
be lost due to memory pressure or process' receiving queue overflowed,
|
||||
so caller is warned must be prepared. That is why struct cn_msg [main
|
||||
connector's message header] contains u32 seq and u32 ack fields.
|
||||
|
||||
/*****************************************/
|
||||
Userspace usage.
|
||||
/*****************************************/
|
||||
2.6.14 has a new netlink socket implementation, which by default does not
|
||||
allow to send data to netlink groups other than 1.
|
||||
So, if to use netlink socket (for example using connector)
|
||||
with different group number userspace application must subscribe to
|
||||
that group. It can be achieved by following pseudocode:
|
||||
|
||||
s = socket(PF_NETLINK, SOCK_DGRAM, NETLINK_CONNECTOR);
|
||||
|
||||
l_local.nl_family = AF_NETLINK;
|
||||
l_local.nl_groups = 12345;
|
||||
l_local.nl_pid = 0;
|
||||
|
||||
if (bind(s, (struct sockaddr *)&l_local, sizeof(struct sockaddr_nl)) == -1) {
|
||||
perror("bind");
|
||||
close(s);
|
||||
return -1;
|
||||
}
|
||||
|
||||
{
|
||||
int on = l_local.nl_groups;
|
||||
setsockopt(s, 270, 1, &on, sizeof(on));
|
||||
}
|
||||
|
||||
Where 270 above is SOL_NETLINK, and 1 is a NETLINK_ADD_MEMBERSHIP socket
|
||||
option. To drop multicast subscription one should call above socket option
|
||||
with NETLINK_DROP_MEMBERSHIP parameter which is defined as 0.
|
||||
|
||||
2.6.14 netlink code only allows to select a group which is less or equal to
|
||||
the maximum group number, which is used at netlink_kernel_create() time.
|
||||
In case of connector it is CN_NETLINK_USERS + 0xf, so if you want to use
|
||||
group number 12345, you must increment CN_NETLINK_USERS to that number.
|
||||
Additional 0xf numbers are allocated to be used by non-in-kernel users.
|
||||
|
||||
Due to this limitation, group 0xffffffff does not work now, so one can
|
||||
not use add/remove connector's group notifications, but as far as I know,
|
||||
only cn_test.c test module used it.
|
||||
|
||||
Some work in netlink area is still being done, so things can be changed in
|
||||
2.6.15 timeframe, if it will happen, documentation will be updated for that
|
||||
kernel.
|
|
@ -36,7 +36,7 @@ cpufreq stats provides following statistics (explained in detail below).
|
|||
|
||||
All the statistics will be from the time the stats driver has been inserted
|
||||
to the time when a read of a particular statistic is done. Obviously, stats
|
||||
driver will not have any information about the the frequcny transitions before
|
||||
driver will not have any information about the frequency transitions before
|
||||
the stats driver insertion.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
|
|
@ -60,6 +60,18 @@ all of the cpus in the system. This removes any overhead due to
|
|||
load balancing code trying to pull tasks outside of the cpu exclusive
|
||||
cpuset only to be prevented by the tasks' cpus_allowed mask.
|
||||
|
||||
A cpuset that is mem_exclusive restricts kernel allocations for
|
||||
page, buffer and other data commonly shared by the kernel across
|
||||
multiple users. All cpusets, whether mem_exclusive or not, restrict
|
||||
allocations of memory for user space. This enables configuring a
|
||||
system so that several independent jobs can share common kernel
|
||||
data, such as file system pages, while isolating each jobs user
|
||||
allocation in its own cpuset. To do this, construct a large
|
||||
mem_exclusive cpuset to hold all the jobs, and construct child,
|
||||
non-mem_exclusive cpusets for each individual job. Only a small
|
||||
amount of typical kernel memory, such as requests from interrupt
|
||||
handlers, is allowed to be taken outside even a mem_exclusive cpuset.
|
||||
|
||||
User level code may create and destroy cpusets by name in the cpuset
|
||||
virtual file system, manage the attributes and permissions of these
|
||||
cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
|
||||
|
@ -82,7 +94,7 @@ the available CPU and Memory resources amongst the requesting tasks.
|
|||
But larger systems, which benefit more from careful processor and
|
||||
memory placement to reduce memory access times and contention,
|
||||
and which typically represent a larger investment for the customer,
|
||||
can benefit from explictly placing jobs on properly sized subsets of
|
||||
can benefit from explicitly placing jobs on properly sized subsets of
|
||||
the system.
|
||||
|
||||
This can be especially valuable on:
|
||||
|
@ -265,7 +277,7 @@ rewritten to the 'tasks' file of its cpuset. This is done to avoid
|
|||
impacting the scheduler code in the kernel with a check for changes
|
||||
in a tasks processor placement.
|
||||
|
||||
There is an exception to the above. If hotplug funtionality is used
|
||||
There is an exception to the above. If hotplug functionality is used
|
||||
to remove all the CPUs that are currently assigned to a cpuset,
|
||||
then the kernel will automatically update the cpus_allowed of all
|
||||
tasks attached to CPUs in that cpuset to allow all CPUs. When memory
|
||||
|
|
|
@ -223,6 +223,7 @@ CAST5 algorithm contributors:
|
|||
|
||||
TEA/XTEA algorithm contributors:
|
||||
Aaron Grothe
|
||||
Michael Ringe
|
||||
|
||||
Khazad algorithm contributors:
|
||||
Aaron Grothe
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
Below is the orginal README file from the descore.shar package.
|
||||
Below is the original README file from the descore.shar package.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
des - fast & portable DES encryption & decryption.
|
||||
|
|
91
Documentation/dcdbas.txt
Normal file
91
Documentation/dcdbas.txt
Normal file
|
@ -0,0 +1,91 @@
|
|||
Overview
|
||||
|
||||
The Dell Systems Management Base Driver provides a sysfs interface for
|
||||
systems management software such as Dell OpenManage to perform system
|
||||
management interrupts and host control actions (system power cycle or
|
||||
power off after OS shutdown) on certain Dell systems.
|
||||
|
||||
Dell OpenManage requires this driver on the following Dell PowerEdge systems:
|
||||
300, 1300, 1400, 400SC, 500SC, 1500SC, 1550, 600SC, 1600SC, 650, 1655MC,
|
||||
700, and 750. Other Dell software such as the open source libsmbios project
|
||||
is expected to make use of this driver, and it may include the use of this
|
||||
driver on other Dell systems.
|
||||
|
||||
The Dell libsmbios project aims towards providing access to as much BIOS
|
||||
information as possible. See http://linux.dell.com/libsmbios/main/ for
|
||||
more information about the libsmbios project.
|
||||
|
||||
|
||||
System Management Interrupt
|
||||
|
||||
On some Dell systems, systems management software must access certain
|
||||
management information via a system management interrupt (SMI). The SMI data
|
||||
buffer must reside in 32-bit address space, and the physical address of the
|
||||
buffer is required for the SMI. The driver maintains the memory required for
|
||||
the SMI and provides a way for the application to generate the SMI.
|
||||
The driver creates the following sysfs entries for systems management
|
||||
software to perform these system management interrupts:
|
||||
|
||||
/sys/devices/platform/dcdbas/smi_data
|
||||
/sys/devices/platform/dcdbas/smi_data_buf_phys_addr
|
||||
/sys/devices/platform/dcdbas/smi_data_buf_size
|
||||
/sys/devices/platform/dcdbas/smi_request
|
||||
|
||||
Systems management software must perform the following steps to execute
|
||||
a SMI using this driver:
|
||||
|
||||
1) Lock smi_data.
|
||||
2) Write system management command to smi_data.
|
||||
3) Write "1" to smi_request to generate a calling interface SMI or
|
||||
"2" to generate a raw SMI.
|
||||
4) Read system management command response from smi_data.
|
||||
5) Unlock smi_data.
|
||||
|
||||
|
||||
Host Control Action
|
||||
|
||||
Dell OpenManage supports a host control feature that allows the administrator
|
||||
to perform a power cycle or power off of the system after the OS has finished
|
||||
shutting down. On some Dell systems, this host control feature requires that
|
||||
a driver perform a SMI after the OS has finished shutting down.
|
||||
|
||||
The driver creates the following sysfs entries for systems management software
|
||||
to schedule the driver to perform a power cycle or power off host control
|
||||
action after the system has finished shutting down:
|
||||
|
||||
/sys/devices/platform/dcdbas/host_control_action
|
||||
/sys/devices/platform/dcdbas/host_control_smi_type
|
||||
/sys/devices/platform/dcdbas/host_control_on_shutdown
|
||||
|
||||
Dell OpenManage performs the following steps to execute a power cycle or
|
||||
power off host control action using this driver:
|
||||
|
||||
1) Write host control action to be performed to host_control_action.
|
||||
2) Write type of SMI that driver needs to perform to host_control_smi_type.
|
||||
3) Write "1" to host_control_on_shutdown to enable host control action.
|
||||
4) Initiate OS shutdown.
|
||||
(Driver will perform host control SMI when it is notified that the OS
|
||||
has finished shutting down.)
|
||||
|
||||
|
||||
Host Control SMI Type
|
||||
|
||||
The following table shows the value to write to host_control_smi_type to
|
||||
perform a power cycle or power off host control action:
|
||||
|
||||
PowerEdge System Host Control SMI Type
|
||||
---------------- ---------------------
|
||||
300 HC_SMITYPE_TYPE1
|
||||
1300 HC_SMITYPE_TYPE1
|
||||
1400 HC_SMITYPE_TYPE2
|
||||
500SC HC_SMITYPE_TYPE2
|
||||
1500SC HC_SMITYPE_TYPE2
|
||||
1550 HC_SMITYPE_TYPE2
|
||||
600SC HC_SMITYPE_TYPE2
|
||||
1600SC HC_SMITYPE_TYPE2
|
||||
650 HC_SMITYPE_TYPE2
|
||||
1655MC HC_SMITYPE_TYPE2
|
||||
700 HC_SMITYPE_TYPE3
|
||||
750 HC_SMITYPE_TYPE3
|
||||
|
||||
|
100
Documentation/dell_rbu.txt
Normal file
100
Documentation/dell_rbu.txt
Normal file
|
@ -0,0 +1,100 @@
|
|||
Purpose:
|
||||
Demonstrate the usage of the new open sourced rbu (Remote BIOS Update) driver
|
||||
for updating BIOS images on Dell servers and desktops.
|
||||
|
||||
Scope:
|
||||
This document discusses the functionality of the rbu driver only.
|
||||
It does not cover the support needed from aplications to enable the BIOS to
|
||||
update itself with the image downloaded in to the memory.
|
||||
|
||||
Overview:
|
||||
This driver works with Dell OpenManage or Dell Update Packages for updating
|
||||
the BIOS on Dell servers (starting from servers sold since 1999), desktops
|
||||
and notebooks (starting from those sold in 2005).
|
||||
Please go to http://support.dell.com register and you can find info on
|
||||
OpenManage and Dell Update packages (DUP).
|
||||
Libsmbios can also be used to update BIOS on Dell systems go to
|
||||
http://linux.dell.com/libsmbios/ for details.
|
||||
|
||||
Dell_RBU driver supports BIOS update using the monilothic image and packetized
|
||||
image methods. In case of moniolithic the driver allocates a contiguous chunk
|
||||
of physical pages having the BIOS image. In case of packetized the app
|
||||
using the driver breaks the image in to packets of fixed sizes and the driver
|
||||
would place each packet in contiguous physical memory. The driver also
|
||||
maintains a link list of packets for reading them back.
|
||||
If the dell_rbu driver is unloaded all the allocated memory is freed.
|
||||
|
||||
The rbu driver needs to have an application (as mentioned above)which will
|
||||
inform the BIOS to enable the update in the next system reboot.
|
||||
|
||||
The user should not unload the rbu driver after downloading the BIOS image
|
||||
or updating.
|
||||
|
||||
The driver load creates the following directories under the /sys file system.
|
||||
/sys/class/firmware/dell_rbu/loading
|
||||
/sys/class/firmware/dell_rbu/data
|
||||
/sys/devices/platform/dell_rbu/image_type
|
||||
/sys/devices/platform/dell_rbu/data
|
||||
/sys/devices/platform/dell_rbu/packet_size
|
||||
|
||||
The driver supports two types of update mechanism; monolithic and packetized.
|
||||
These update mechanism depends upon the BIOS currently running on the system.
|
||||
Most of the Dell systems support a monolithic update where the BIOS image is
|
||||
copied to a single contiguous block of physical memory.
|
||||
In case of packet mechanism the single memory can be broken in smaller chuks
|
||||
of contiguous memory and the BIOS image is scattered in these packets.
|
||||
|
||||
By default the driver uses monolithic memory for the update type. This can be
|
||||
changed to packets during the driver load time by specifying the load
|
||||
parameter image_type=packet. This can also be changed later as below
|
||||
echo packet > /sys/devices/platform/dell_rbu/image_type
|
||||
|
||||
In packet update mode the packet size has to be given before any packets can
|
||||
be downloaded. It is done as below
|
||||
echo XXXX > /sys/devices/platform/dell_rbu/packet_size
|
||||
In the packet update mechanism, the user neesd to create a new file having
|
||||
packets of data arranged back to back. It can be done as follows
|
||||
The user creates packets header, gets the chunk of the BIOS image and
|
||||
placs it next to the packetheader; now, the packetheader + BIOS image chunk
|
||||
added to geather should match the specified packet_size. This makes one
|
||||
packet, the user needs to create more such packets out of the entire BIOS
|
||||
image file and then arrange all these packets back to back in to one single
|
||||
file.
|
||||
This file is then copied to /sys/class/firmware/dell_rbu/data.
|
||||
Once this file gets to the driver, the driver extracts packet_size data from
|
||||
the file and spreads it accross the physical memory in contiguous packet_sized
|
||||
space.
|
||||
This method makes sure that all the packets get to the driver in a single operation.
|
||||
|
||||
In monolithic update the user simply get the BIOS image (.hdr file) and copies
|
||||
to the data file as is without any change to the BIOS image itself.
|
||||
|
||||
Do the steps below to download the BIOS image.
|
||||
1) echo 1 > /sys/class/firmware/dell_rbu/loading
|
||||
2) cp bios_image.hdr /sys/class/firmware/dell_rbu/data
|
||||
3) echo 0 > /sys/class/firmware/dell_rbu/loading
|
||||
|
||||
The /sys/class/firmware/dell_rbu/ entries will remain till the following is
|
||||
done.
|
||||
echo -1 > /sys/class/firmware/dell_rbu/loading.
|
||||
Until this step is completed the driver cannot be unloaded.
|
||||
Also echoing either mono ,packet or init in to image_type will free up the
|
||||
memory allocated by the driver.
|
||||
|
||||
If an user by accident executes steps 1 and 3 above without executing step 2;
|
||||
it will make the /sys/class/firmware/dell_rbu/ entries to disappear.
|
||||
The entries can be recreated by doing the following
|
||||
echo init > /sys/devices/platform/dell_rbu/image_type
|
||||
NOTE: echoing init in image_type does not change it original value.
|
||||
|
||||
Also the driver provides /sys/devices/platform/dell_rbu/data readonly file to
|
||||
read back the image downloaded.
|
||||
|
||||
NOTE:
|
||||
This driver requires a patch for firmware_class.c which has the modified
|
||||
request_firmware_nowait function.
|
||||
Also after updating the BIOS image an user mdoe application neeeds to execute
|
||||
code which message the BIOS update request to the BIOS. So on the next reboot
|
||||
the BIOS knows about the new image downloaded and it updates it self.
|
||||
Also don't unload the rbu drive if the image has to be updated.
|
||||
|
73
Documentation/device-mapper/snapshot.txt
Normal file
73
Documentation/device-mapper/snapshot.txt
Normal file
|
@ -0,0 +1,73 @@
|
|||
Device-mapper snapshot support
|
||||
==============================
|
||||
|
||||
Device-mapper allows you, without massive data copying:
|
||||
|
||||
*) To create snapshots of any block device i.e. mountable, saved states of
|
||||
the block device which are also writable without interfering with the
|
||||
original content;
|
||||
*) To create device "forks", i.e. multiple different versions of the
|
||||
same data stream.
|
||||
|
||||
|
||||
In both cases, dm copies only the chunks of data that get changed and
|
||||
uses a separate copy-on-write (COW) block device for storage.
|
||||
|
||||
|
||||
There are two dm targets available: snapshot and snapshot-origin.
|
||||
|
||||
*) snapshot-origin <origin>
|
||||
|
||||
which will normally have one or more snapshots based on it.
|
||||
You must create the snapshot-origin device before you can create snapshots.
|
||||
Reads will be mapped directly to the backing device. For each write, the
|
||||
original data will be saved in the <COW device> of each snapshot to keep
|
||||
its visible content unchanged, at least until the <COW device> fills up.
|
||||
|
||||
|
||||
*) snapshot <origin> <COW device> <persistent?> <chunksize>
|
||||
|
||||
A snapshot is created of the <origin> block device. Changed chunks of
|
||||
<chunksize> sectors will be stored on the <COW device>. Writes will
|
||||
only go to the <COW device>. Reads will come from the <COW device> or
|
||||
from <origin> for unchanged data. <COW device> will often be
|
||||
smaller than the origin and if it fills up the snapshot will become
|
||||
useless and be disabled, returning errors. So it is important to monitor
|
||||
the amount of free space and expand the <COW device> before it fills up.
|
||||
|
||||
<persistent?> is P (Persistent) or N (Not persistent - will not survive
|
||||
after reboot).
|
||||
|
||||
|
||||
How this is used by LVM2
|
||||
========================
|
||||
When you create the first LVM2 snapshot of a volume, four dm devices are used:
|
||||
|
||||
1) a device containing the original mapping table of the source volume;
|
||||
2) a device used as the <COW device>;
|
||||
3) a "snapshot" device, combining #1 and #2, which is the visible snapshot
|
||||
volume;
|
||||
4) the "original" volume (which uses the device number used by the original
|
||||
source volume), whose table is replaced by a "snapshot-origin" mapping
|
||||
from device #1.
|
||||
|
||||
A fixed naming scheme is used, so with the following commands:
|
||||
|
||||
lvcreate -L 1G -n base volumeGroup
|
||||
lvcreate -L 100M --snapshot -n snap volumeGroup/base
|
||||
|
||||
we'll have this situation (with volumes in above order):
|
||||
|
||||
# dmsetup table|grep volumeGroup
|
||||
|
||||
volumeGroup-base-real: 0 2097152 linear 8:19 384
|
||||
volumeGroup-snap-cow: 0 204800 linear 8:19 2097536
|
||||
volumeGroup-snap: 0 2097152 snapshot 254:11 254:12 P 16
|
||||
volumeGroup-base: 0 2097152 snapshot-origin 254:11
|
||||
|
||||
# ls -lL /dev/mapper/volumeGroup-*
|
||||
brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real
|
||||
brw------- 1 root root 254, 12 29 ago 18:15 /dev/mapper/volumeGroup-snap-cow
|
||||
brw------- 1 root root 254, 13 29 ago 18:15 /dev/mapper/volumeGroup-snap
|
||||
brw------- 1 root root 254, 10 29 ago 18:14 /dev/mapper/volumeGroup-base
|
||||
|
|
@ -55,6 +55,7 @@ aic7*seq.h*
|
|||
aicasm
|
||||
aicdb.h*
|
||||
asm
|
||||
asm-offsets.*
|
||||
asm_offsets.*
|
||||
autoconf.h*
|
||||
bbootsect
|
||||
|
|
|
@ -14,8 +14,8 @@ struct device_driver {
|
|||
int (*probe) (struct device * dev);
|
||||
int (*remove) (struct device * dev);
|
||||
|
||||
int (*suspend) (struct device * dev, pm_message_t state, u32 level);
|
||||
int (*resume) (struct device * dev, u32 level);
|
||||
int (*suspend) (struct device * dev, pm_message_t state);
|
||||
int (*resume) (struct device * dev);
|
||||
};
|
||||
|
||||
|
||||
|
@ -194,69 +194,13 @@ device; i.e. anything in the device's driver_data field.
|
|||
If the device is still present, it should quiesce the device and place
|
||||
it into a supported low-power state.
|
||||
|
||||
int (*suspend) (struct device * dev, pm_message_t state, u32 level);
|
||||
int (*suspend) (struct device * dev, pm_message_t state);
|
||||
|
||||
suspend is called to put the device in a low power state. There are
|
||||
several stages to successfully suspending a device, which is denoted in
|
||||
the @level parameter. Breaking the suspend transition into several
|
||||
stages affords the platform flexibility in performing device power
|
||||
management based on the requirements of the system and the
|
||||
user-defined policy.
|
||||
suspend is called to put the device in a low power state.
|
||||
|
||||
SUSPEND_NOTIFY notifies the device that a suspend transition is about
|
||||
to happen. This happens on system power state transitions to verify
|
||||
that all devices can successfully suspend.
|
||||
int (*resume) (struct device * dev);
|
||||
|
||||
A driver may choose to fail on this call, which should cause the
|
||||
entire suspend transition to fail. A driver should fail only if it
|
||||
knows that the device will not be able to be resumed properly when the
|
||||
system wakes up again. It could also fail if it somehow determines it
|
||||
is in the middle of an operation too important to stop.
|
||||
|
||||
SUSPEND_DISABLE tells the device to stop I/O transactions. When it
|
||||
stops transactions, or what it should do with unfinished transactions
|
||||
is a policy of the driver. After this call, the driver should not
|
||||
accept any other I/O requests.
|
||||
|
||||
SUSPEND_SAVE_STATE tells the device to save the context of the
|
||||
hardware. This includes any bus-specific hardware state and
|
||||
device-specific hardware state. A pointer to this saved state can be
|
||||
stored in the device's saved_state field.
|
||||
|
||||
SUSPEND_POWER_DOWN tells the driver to place the device in the low
|
||||
power state requested.
|
||||
|
||||
Whether suspend is called with a given level is a policy of the
|
||||
platform. Some levels may be omitted; drivers must not assume the
|
||||
reception of any level. However, all levels must be called in the
|
||||
order above; i.e. notification will always come before disabling;
|
||||
disabling the device will come before suspending the device.
|
||||
|
||||
All calls are made with interrupts enabled, except for the
|
||||
SUSPEND_POWER_DOWN level.
|
||||
|
||||
int (*resume) (struct device * dev, u32 level);
|
||||
|
||||
Resume is used to bring a device back from a low power state. Like the
|
||||
suspend transition, it happens in several stages.
|
||||
|
||||
RESUME_POWER_ON tells the driver to set the power state to the state
|
||||
before the suspend call (The device could have already been in a low
|
||||
power state before the suspend call to put in a lower power state).
|
||||
|
||||
RESUME_RESTORE_STATE tells the driver to restore the state saved by
|
||||
the SUSPEND_SAVE_STATE suspend call.
|
||||
|
||||
RESUME_ENABLE tells the driver to start accepting I/O transactions
|
||||
again. Depending on driver policy, the device may already have pending
|
||||
I/O requests.
|
||||
|
||||
RESUME_POWER_ON is called with interrupts disabled. The other resume
|
||||
levels are called with interrupts enabled.
|
||||
|
||||
As with the various suspend stages, the driver must not assume that
|
||||
any other resume calls have been or will be made. Each call should be
|
||||
self-contained and not dependent on any external state.
|
||||
Resume is used to bring a device back from a low power state.
|
||||
|
||||
|
||||
Attributes
|
||||
|
|
|
@ -350,7 +350,7 @@ When a driver is registered, the bus's list of devices is iterated
|
|||
over. bus->match() is called for each device that is not already
|
||||
claimed by a driver.
|
||||
|
||||
When a device is successfully bound to a device, device->driver is
|
||||
When a device is successfully bound to a driver, device->driver is
|
||||
set, the device is added to a per-driver list of devices, and a
|
||||
symlink is created in the driver's sysfs directory that points to the
|
||||
device's physical directory:
|
||||
|
|
|
@ -1,55 +1,74 @@
|
|||
How to get the Nebula Electronics DigiTV, Pinnacle PCTV Sat, Twinhan DST + clones working
|
||||
=========================================================================================
|
||||
How to get the Nebula, PCTV and Twinhan DST cards working
|
||||
=========================================================
|
||||
|
||||
1) General information
|
||||
======================
|
||||
This class of cards has a bt878a as the PCI interface, and
|
||||
require the bttv driver.
|
||||
|
||||
This class of cards has a bt878a chip as the PCI interface.
|
||||
The different card drivers require the bttv driver to provide the means
|
||||
to access the i2c bus and the gpio pins of the bt8xx chipset.
|
||||
Please pay close attention to the warning about the bttv module
|
||||
options below for the DST card.
|
||||
|
||||
2) Compilation rules for Kernel >= 2.6.12
|
||||
=========================================
|
||||
1) General informations
|
||||
=======================
|
||||
|
||||
Enable the following options:
|
||||
These drivers require the bttv driver to provide the means to access
|
||||
the i2c bus and the gpio pins of the bt8xx chipset.
|
||||
|
||||
Because of this, you need to enable
|
||||
"Device drivers" => "Multimedia devices"
|
||||
=> "Video For Linux" => "BT848 Video For Linux"
|
||||
"Device drivers" => "Multimedia devices" => "Digital Video Broadcasting Devices"
|
||||
=> "DVB for Linux" "DVB Core Support" "Nebula/Pinnacle PCTV/TwinHan PCI Cards"
|
||||
=> "Video For Linux" => "BT848 Video For Linux"
|
||||
|
||||
3) Loading Modules, described by two approaches
|
||||
===============================================
|
||||
Furthermore you need to enable
|
||||
"Device drivers" => "Multimedia devices" => "Digital Video Broadcasting Devices"
|
||||
=> "DVB for Linux" "DVB Core Support" "BT8xx based PCI cards"
|
||||
|
||||
2) Loading Modules
|
||||
==================
|
||||
|
||||
In general you need to load the bttv driver, which will handle the gpio and
|
||||
i2c communication for us, plus the common dvb-bt8xx device driver,
|
||||
which is called the backend.
|
||||
The frontends for Nebula DigiTV (nxt6000), Pinnacle PCTV Sat (cx24110),
|
||||
TwinHan DST + clones (dst and dst-ca) are loaded automatically by the backend.
|
||||
For further details about TwinHan DST + clones see /Documentation/dvb/ci.txt.
|
||||
i2c communication for us, plus the common dvb-bt8xx device driver.
|
||||
The frontends for Nebula (nxt6000), Pinnacle PCTV (cx24110) and
|
||||
TwinHan (dst) are loaded automatically by the dvb-bt8xx device driver.
|
||||
|
||||
3a) The manual approach
|
||||
-----------------------
|
||||
|
||||
Loading modules:
|
||||
modprobe bttv
|
||||
modprobe dvb-bt8xx
|
||||
|
||||
Unloading modules:
|
||||
modprobe -r dvb-bt8xx
|
||||
modprobe -r bttv
|
||||
|
||||
3b) The automatic approach
|
||||
3a) Nebula / Pinnacle PCTV
|
||||
--------------------------
|
||||
|
||||
If not already done by installation, place a line either in
|
||||
/etc/modules.conf or in /etc/modprobe.conf containing this text:
|
||||
alias char-major-81 bttv
|
||||
$ modprobe bttv (normally bttv is being loaded automatically by kmod)
|
||||
$ modprobe dvb-bt8xx (or just place dvb-bt8xx in /etc/modules for automatic loading)
|
||||
|
||||
Then place a line in /etc/modules containing this text:
|
||||
dvb-bt8xx
|
||||
|
||||
Reboot your system and have fun!
|
||||
3b) TwinHan and Clones
|
||||
--------------------------
|
||||
|
||||
$ modprobe bttv i2c_hw=1 card=0x71
|
||||
$ modprobe dvb-bt8xx
|
||||
$ modprobe dst
|
||||
|
||||
The value 0x71 will override the PCI type detection for dvb-bt8xx,
|
||||
which is necessary for TwinHan cards.
|
||||
|
||||
If you're having an older card (blue color circuit) and card=0x71 locks
|
||||
your machine, try using 0x68, too. If that does not work, ask on the
|
||||
mailing list.
|
||||
|
||||
The DST module takes a couple of useful parameters.
|
||||
|
||||
verbose takes values 0 to 4. These values control the verbosity level,
|
||||
and can be used to debug also.
|
||||
|
||||
verbose=0 means complete disabling of messages
|
||||
1 only error messages are displayed
|
||||
2 notifications are also displayed
|
||||
3 informational messages are also displayed
|
||||
4 debug setting
|
||||
|
||||
dst_addons takes values 0 and 0x20. A value of 0 means it is a FTA card.
|
||||
0x20 means it has a Conditional Access slot.
|
||||
|
||||
The autodected values are determined bythe cards 'response
|
||||
string' which you can see in your logs e.g.
|
||||
|
||||
dst_get_device_id: Recognise [DSTMCI]
|
||||
|
||||
|
||||
--
|
||||
Authors: Richard Walker, Jamie Honan, Michael Hunold, Manu Abraham, Uwe Bugla
|
||||
Authors: Richard Walker, Jamie Honan, Michael Hunold, Manu Abraham
|
||||
|
|
|
@ -23,7 +23,6 @@ This application requires the following to function properly as of now.
|
|||
eg: $ szap -c channels.conf -r "TMC" -x
|
||||
|
||||
(b) a channels.conf containing a valid PMT PID
|
||||
|
||||
eg: TMC:11996:h:0:27500:278:512:650:321
|
||||
|
||||
here 278 is a valid PMT PID. the rest of the values are the
|
||||
|
@ -31,13 +30,7 @@ This application requires the following to function properly as of now.
|
|||
|
||||
(c) after running a szap, you have to run ca_zap, for the
|
||||
descrambler to function,
|
||||
|
||||
eg: $ ca_zap patched_channels.conf "TMC"
|
||||
|
||||
The patched means a patch to apply to scan, such that scan can
|
||||
generate a channels.conf_with pmt, which has this PMT PID info
|
||||
(NOTE: szap cannot use this channels.conf with the PMT_PID)
|
||||
|
||||
eg: $ ca_zap channels.conf "TMC"
|
||||
|
||||
(d) Hopeflly Enjoy your favourite subscribed channel as you do with
|
||||
a FTA card.
|
||||
|
|
|
@ -7,7 +7,7 @@ To protect itself the kernel has to verify this address.
|
|||
|
||||
In older versions of Linux this was done with the
|
||||
int verify_area(int type, const void * addr, unsigned long size)
|
||||
function.
|
||||
function (which has since been replaced by access_ok()).
|
||||
|
||||
This function verified that the memory area starting at address
|
||||
addr and of size size was accessible for the operation specified
|
||||
|
|
14
Documentation/fb/cyblafb/bugs
Normal file
14
Documentation/fb/cyblafb/bugs
Normal file
|
@ -0,0 +1,14 @@
|
|||
Bugs
|
||||
====
|
||||
|
||||
I currently don't know of any bug. Please do send reports to:
|
||||
- linux-fbdev-devel@lists.sourceforge.net
|
||||
- Knut_Petersen@t-online.de.
|
||||
|
||||
|
||||
Untested features
|
||||
=================
|
||||
|
||||
All LCD stuff is untested. If it worked in tridentfb, it should work in
|
||||
cyblafb. Please test and report the results to Knut_Petersen@t-online.de.
|
||||
|
7
Documentation/fb/cyblafb/credits
Normal file
7
Documentation/fb/cyblafb/credits
Normal file
|
@ -0,0 +1,7 @@
|
|||
Thanks to
|
||||
=========
|
||||
* Alan Hourihane, for writing the X trident driver
|
||||
* Jani Monoses, for writing the tridentfb driver
|
||||
* Antonino A. Daplas, for review of the first published
|
||||
version of cyblafb and some code
|
||||
* Jochen Hein, for testing and a helpfull bug report
|
17
Documentation/fb/cyblafb/documentation
Normal file
17
Documentation/fb/cyblafb/documentation
Normal file
|
@ -0,0 +1,17 @@
|
|||
Available Documentation
|
||||
=======================
|
||||
|
||||
Apollo PLE 133 Chipset VT8601A North Bridge Datasheet, Rev. 1.82, October 22,
|
||||
2001, available from VIA:
|
||||
|
||||
http://www.viavpsd.com/product/6/15/DS8601A182.pdf
|
||||
|
||||
The datasheet is incomplete, some registers that need to be programmed are not
|
||||
explained at all and important bits are listed as "reserved". But you really
|
||||
need the datasheet to understand the code. "p. xxx" comments refer to page
|
||||
numbers of this document.
|
||||
|
||||
XFree/XOrg drivers are available and of good quality, looking at the code
|
||||
there is a good idea if the datasheet does not provide enough information
|
||||
or if the datasheet seems to be wrong.
|
||||
|
155
Documentation/fb/cyblafb/fb.modes
Normal file
155
Documentation/fb/cyblafb/fb.modes
Normal file
|
@ -0,0 +1,155 @@
|
|||
#
|
||||
# Sample fb.modes file
|
||||
#
|
||||
# Provides an incomplete list of working modes for
|
||||
# the cyberblade/i1 graphics core.
|
||||
#
|
||||
# The value 4294967256 is used instead of -40. Of course, -40 is not
|
||||
# a really reasonable value, but chip design does not always follow
|
||||
# logic. Believe me, it's ok, and it's the way the BIOS does it.
|
||||
#
|
||||
# fbset requires 4294967256 in fb.modes and -40 as an argument to
|
||||
# the -t parameter. That's also not too reasonable, and it might change
|
||||
# in the future or might even be differt for your current version.
|
||||
#
|
||||
|
||||
mode "640x480-50"
|
||||
geometry 640 480 640 3756 8
|
||||
timings 47619 4294967256 24 17 0 216 3
|
||||
endmode
|
||||
|
||||
mode "640x480-60"
|
||||
geometry 640 480 640 3756 8
|
||||
timings 39682 4294967256 24 17 0 216 3
|
||||
endmode
|
||||
|
||||
mode "640x480-70"
|
||||
geometry 640 480 640 3756 8
|
||||
timings 34013 4294967256 24 17 0 216 3
|
||||
endmode
|
||||
|
||||
mode "640x480-72"
|
||||
geometry 640 480 640 3756 8
|
||||
timings 33068 4294967256 24 17 0 216 3
|
||||
endmode
|
||||
|
||||
mode "640x480-75"
|
||||
geometry 640 480 640 3756 8
|
||||
timings 31746 4294967256 24 17 0 216 3
|
||||
endmode
|
||||
|
||||
mode "640x480-80"
|
||||
geometry 640 480 640 3756 8
|
||||
timings 29761 4294967256 24 17 0 216 3
|
||||
endmode
|
||||
|
||||
mode "640x480-85"
|
||||
geometry 640 480 640 3756 8
|
||||
timings 28011 4294967256 24 17 0 216 3
|
||||
endmode
|
||||
|
||||
mode "800x600-50"
|
||||
geometry 800 600 800 3221 8
|
||||
timings 30303 96 24 14 0 136 11
|
||||
endmode
|
||||
|
||||
mode "800x600-60"
|
||||
geometry 800 600 800 3221 8
|
||||
timings 25252 96 24 14 0 136 11
|
||||
endmode
|
||||
|
||||
mode "800x600-70"
|
||||
geometry 800 600 800 3221 8
|
||||
timings 21645 96 24 14 0 136 11
|
||||
endmode
|
||||
|
||||
mode "800x600-72"
|
||||
geometry 800 600 800 3221 8
|
||||
timings 21043 96 24 14 0 136 11
|
||||
endmode
|
||||
|
||||
mode "800x600-75"
|
||||
geometry 800 600 800 3221 8
|
||||
timings 20202 96 24 14 0 136 11
|
||||
endmode
|
||||
|
||||
mode "800x600-80"
|
||||
geometry 800 600 800 3221 8
|
||||
timings 18939 96 24 14 0 136 11
|
||||
endmode
|
||||
|
||||
mode "800x600-85"
|
||||
geometry 800 600 800 3221 8
|
||||
timings 17825 96 24 14 0 136 11
|
||||
endmode
|
||||
|
||||
mode "1024x768-50"
|
||||
geometry 1024 768 1024 2815 8
|
||||
timings 19054 144 24 29 0 120 3
|
||||
endmode
|
||||
|
||||
mode "1024x768-60"
|
||||
geometry 1024 768 1024 2815 8
|
||||
timings 15880 144 24 29 0 120 3
|
||||
endmode
|
||||
|
||||
mode "1024x768-70"
|
||||
geometry 1024 768 1024 2815 8
|
||||
timings 13610 144 24 29 0 120 3
|
||||
endmode
|
||||
|
||||
mode "1024x768-72"
|
||||
geometry 1024 768 1024 2815 8
|
||||
timings 13232 144 24 29 0 120 3
|
||||
endmode
|
||||
|
||||
mode "1024x768-75"
|
||||
geometry 1024 768 1024 2815 8
|
||||
timings 12703 144 24 29 0 120 3
|
||||
endmode
|
||||
|
||||
mode "1024x768-80"
|
||||
geometry 1024 768 1024 2815 8
|
||||
timings 11910 144 24 29 0 120 3
|
||||
endmode
|
||||
|
||||
mode "1024x768-85"
|
||||
geometry 1024 768 1024 2815 8
|
||||
timings 11209 144 24 29 0 120 3
|
||||
endmode
|
||||
|
||||
mode "1280x1024-50"
|
||||
geometry 1280 1024 1280 2662 8
|
||||
timings 11114 232 16 39 0 160 3
|
||||
endmode
|
||||
|
||||
mode "1280x1024-60"
|
||||
geometry 1280 1024 1280 2662 8
|
||||
timings 9262 232 16 39 0 160 3
|
||||
endmode
|
||||
|
||||
mode "1280x1024-70"
|
||||
geometry 1280 1024 1280 2662 8
|
||||
timings 7939 232 16 39 0 160 3
|
||||
endmode
|
||||
|
||||
mode "1280x1024-72"
|
||||
geometry 1280 1024 1280 2662 8
|
||||
timings 7719 232 16 39 0 160 3
|
||||
endmode
|
||||
|
||||
mode "1280x1024-75"
|
||||
geometry 1280 1024 1280 2662 8
|
||||
timings 7410 232 16 39 0 160 3
|
||||
endmode
|
||||
|
||||
mode "1280x1024-80"
|
||||
geometry 1280 1024 1280 2662 8
|
||||
timings 6946 232 16 39 0 160 3
|
||||
endmode
|
||||
|
||||
mode "1280x1024-85"
|
||||
geometry 1280 1024 1280 2662 8
|
||||
timings 6538 232 16 39 0 160 3
|
||||
endmode
|
||||
|
80
Documentation/fb/cyblafb/performance
Normal file
80
Documentation/fb/cyblafb/performance
Normal file
|
@ -0,0 +1,80 @@
|
|||
Speed
|
||||
=====
|
||||
|
||||
CyBlaFB is much faster than tridentfb and vesafb. Compare the performance data
|
||||
for mode 1280x1024-[8,16,32]@61 Hz.
|
||||
|
||||
Test 1: Cat a file with 2000 lines of 0 characters.
|
||||
Test 2: Cat a file with 2000 lines of 80 characters.
|
||||
Test 3: Cat a file with 2000 lines of 160 characters.
|
||||
|
||||
All values show system time use in seconds, kernel 2.6.12 was used for
|
||||
the measurements. 2.6.13 is a bit slower, 2.6.14 hopefully will include a
|
||||
patch that speeds up kernel bitblitting a lot ( > 20%).
|
||||
|
||||
+-----------+-----------------------------------------------------+
|
||||
| | not accelerated |
|
||||
| TRIDENTFB +-----------------+-----------------+-----------------+
|
||||
| of 2.6.12 | 8 bpp | 16 bpp | 32 bpp |
|
||||
| | noypan | ypan | noypan | ypan | noypan | ypan |
|
||||
+-----------+--------+--------+--------+--------+--------+--------+
|
||||
| Test 1 | 4.31 | 4.33 | 6.05 | 12.81 | ---- | ---- |
|
||||
| Test 2 | 67.94 | 5.44 | 123.16 | 14.79 | ---- | ---- |
|
||||
| Test 3 | 131.36 | 6.55 | 240.12 | 16.76 | ---- | ---- |
|
||||
+-----------+--------+--------+--------+--------+--------+--------+
|
||||
| Comments | | | completely bro- |
|
||||
| | | | ken, monitor |
|
||||
| | | | switches off |
|
||||
+-----------+-----------------+-----------------+-----------------+
|
||||
|
||||
|
||||
+-----------+-----------------------------------------------------+
|
||||
| | accelerated |
|
||||
| TRIDENTFB +-----------------+-----------------+-----------------+
|
||||
| of 2.6.12 | 8 bpp | 16 bpp | 32 bpp |
|
||||
| | noypan | ypan | noypan | ypan | noypan | ypan |
|
||||
+-----------+--------+--------+--------+--------+--------+--------+
|
||||
| Test 1 | ---- | ---- | 20.62 | 1.22 | ---- | ---- |
|
||||
| Test 2 | ---- | ---- | 22.61 | 3.19 | ---- | ---- |
|
||||
| Test 3 | ---- | ---- | 24.59 | 5.16 | ---- | ---- |
|
||||
+-----------+--------+--------+--------+--------+--------+--------+
|
||||
| Comments | broken, writing | broken, ok only | completely bro- |
|
||||
| | to wrong places | if bgcolor is | ken, monitor |
|
||||
| | on screen + bug | black, bug in | switches off |
|
||||
| | in fillrect() | fillrect() | |
|
||||
+-----------+-----------------+-----------------+-----------------+
|
||||
|
||||
|
||||
+-----------+-----------------------------------------------------+
|
||||
| | not accelerated |
|
||||
| VESAFB +-----------------+-----------------+-----------------+
|
||||
| of 2.6.12 | 8 bpp | 16 bpp | 32 bpp |
|
||||
| | noypan | ypan | noypan | ypan | noypan | ypan |
|
||||
+-----------+--------+--------+--------+--------+--------+--------+
|
||||
| Test 1 | 4.26 | 3.76 | 5.99 | 7.23 | ---- | ---- |
|
||||
| Test 2 | 65.65 | 4.89 | 120.88 | 9.08 | ---- | ---- |
|
||||
| Test 3 | 126.91 | 5.94 | 235.77 | 11.03 | ---- | ---- |
|
||||
+-----------+--------+--------+--------+--------+--------+--------+
|
||||
| Comments | vga=0x307 | vga=0x31a | vga=0x31b not |
|
||||
| | fh=80kHz | fh=80kHz | supported by |
|
||||
| | fv=75kHz | fv=75kHz | video BIOS and |
|
||||
| | | | hardware |
|
||||
+-----------+-----------------+-----------------+-----------------+
|
||||
|
||||
|
||||
+-----------+-----------------------------------------------------+
|
||||
| | accelerated |
|
||||
| CYBLAFB +-----------------+-----------------+-----------------+
|
||||
| | 8 bpp | 16 bpp | 32 bpp |
|
||||
| | noypan | ypan | noypan | ypan | noypan | ypan |
|
||||
+-----------+--------+--------+--------+--------+--------+--------+
|
||||
| Test 1 | 8.02 | 0.23 | 19.04 | 0.61 | 57.12 | 2.74 |
|
||||
| Test 2 | 8.38 | 0.55 | 19.39 | 0.92 | 57.54 | 3.13 |
|
||||
| Test 3 | 8.73 | 0.86 | 19.74 | 1.24 | 57.95 | 3.51 |
|
||||
+-----------+--------+--------+--------+--------+--------+--------+
|
||||
| Comments | | | |
|
||||
| | | | |
|
||||
| | | | |
|
||||
| | | | |
|
||||
+-----------+-----------------+-----------------+-----------------+
|
||||
|
32
Documentation/fb/cyblafb/todo
Normal file
32
Documentation/fb/cyblafb/todo
Normal file
|
@ -0,0 +1,32 @@
|
|||
TODO / Missing features
|
||||
=======================
|
||||
|
||||
Verify LCD stuff "stretch" and "center" options are
|
||||
completely untested ... this code needs to be
|
||||
verified. As I don't have access to such
|
||||
hardware, please contact me if you are
|
||||
willing run some tests.
|
||||
|
||||
Interlaced video modes The reason that interleaved
|
||||
modes are disabled is that I do not know
|
||||
the meaning of the vertical interlace
|
||||
parameter. Also the datasheet mentions a
|
||||
bit d8 of a horizontal interlace parameter,
|
||||
but nowhere the lower 8 bits. Please help
|
||||
if you can.
|
||||
|
||||
low-res double scan modes Who needs it?
|
||||
|
||||
accelerated color blitting Who needs it? The console driver does use color
|
||||
blitting for nothing but drawing the penguine,
|
||||
everything else is done using color expanding
|
||||
blitting of 1bpp character bitmaps.
|
||||
|
||||
xpanning Who needs it?
|
||||
|
||||
ioctls Who needs it?
|
||||
|
||||
TV-out Will be done later
|
||||
|
||||
??? Feel free to contact me if you have any
|
||||
feature requests
|
206
Documentation/fb/cyblafb/usage
Normal file
206
Documentation/fb/cyblafb/usage
Normal file
|
@ -0,0 +1,206 @@
|
|||
CyBlaFB is a framebuffer driver for the Cyberblade/i1 graphics core integrated
|
||||
into the VIA Apollo PLE133 (aka vt8601) south bridge. It is developed and
|
||||
tested using a VIA EPIA 5000 board.
|
||||
|
||||
Cyblafb - compiled into the kernel or as a module?
|
||||
==================================================
|
||||
|
||||
You might compile cyblafb either as a module or compile it permanently into the
|
||||
kernel.
|
||||
|
||||
Unless you have a real reason to do so you should not compile both vesafb and
|
||||
cyblafb permanently into the kernel. It's possible and it helps during the
|
||||
developement cycle, but it's useless and will at least block some otherwise
|
||||
usefull memory for ordinary users.
|
||||
|
||||
Selecting Modes
|
||||
===============
|
||||
|
||||
Startup Mode
|
||||
============
|
||||
|
||||
First of all, you might use the "vga=???" boot parameter as it is
|
||||
documented in vesafb.txt and svga.txt. Cyblafb will detect the video
|
||||
mode selected and will use the geometry and timings found by
|
||||
inspecting the hardware registers.
|
||||
|
||||
video=cyblafb vga=0x317
|
||||
|
||||
Alternatively you might use a combination of the mode, ref and bpp
|
||||
parameters. If you compiled the driver into the kernel, add something
|
||||
like this to the kernel command line:
|
||||
|
||||
video=cyblafb:1280x1024,bpp=16,ref=50 ...
|
||||
|
||||
If you compiled the driver as a module, the same mode would be
|
||||
selected by the following command:
|
||||
|
||||
modprobe cyblafb mode=1280x1024 bpp=16 ref=50 ...
|
||||
|
||||
None of the modes possible to select as startup modes are affected by
|
||||
the problems described at the end of the next subsection.
|
||||
|
||||
Mode changes using fbset
|
||||
========================
|
||||
|
||||
You might use fbset to change the video mode, see "man fbset". Cyblafb
|
||||
generally does assume that you know what you are doing. But it does
|
||||
some checks, especially those that are needed to prevent you from
|
||||
damaging your hardware.
|
||||
|
||||
- only 8, 16, 24 and 32 bpp video modes are accepted
|
||||
- interlaced video modes are not accepted
|
||||
- double scan video modes are not accepted
|
||||
- if a flat panel is found, cyblafb does not allow you
|
||||
to program a resolution higher than the physical
|
||||
resolution of the flat panel monitor
|
||||
- cyblafb does not allow xres to differ from xres_virtual
|
||||
- cyblafb does not allow vclk to exceed 230 MHz. As 32 bpp
|
||||
and (currently) 24 bit modes use a doubled vclk internally,
|
||||
the dotclock limit as seen by fbset is 115 MHz for those
|
||||
modes and 230 MHz for 8 and 16 bpp modes.
|
||||
|
||||
Any request that violates the rules given above will be ignored and
|
||||
fbset will return an error.
|
||||
|
||||
If you program a virtual y resolution higher than the hardware limit,
|
||||
cyblafb will silently decrease that value to the highest possible
|
||||
value.
|
||||
|
||||
Attempts to disable acceleration are ignored.
|
||||
|
||||
Some video modes that should work do not work as expected. If you use
|
||||
the standard fb.modes, fbset 640x480-60 will program that mode, but
|
||||
you will see a vertical area, about two characters wide, with only
|
||||
much darker characters than the other characters on the screen.
|
||||
Cyblafb does allow that mode to be set, as it does not violate the
|
||||
official specifications. It would need a lot of code to reliably sort
|
||||
out all invalid modes, playing around with the margin values will
|
||||
give a valid mode quickly. And if cyblafb would detect such an invalid
|
||||
mode, should it silently alter the requested values or should it
|
||||
report an error? Both options have some pros and cons. As stated
|
||||
above, none of the startup modes are affected, and if you set
|
||||
verbosity to 1 or higher, cyblafb will print the fbset command that
|
||||
would be needed to program that mode using fbset.
|
||||
|
||||
|
||||
Other Parameters
|
||||
================
|
||||
|
||||
|
||||
crt don't autodetect, assume monitor connected to
|
||||
standard VGA connector
|
||||
|
||||
fp don't autodetect, assume flat panel display
|
||||
connected to flat panel monitor interface
|
||||
|
||||
nativex inform driver about native x resolution of
|
||||
flat panel monitor connected to special
|
||||
interface (should be autodetected)
|
||||
|
||||
stretch stretch image to adapt low resolution modes to
|
||||
higer resolutions of flat panel monitors
|
||||
connected to special interface
|
||||
|
||||
center center image to adapt low resolution modes to
|
||||
higer resolutions of flat panel monitors
|
||||
connected to special interface
|
||||
|
||||
memsize use if autodetected memsize is wrong ...
|
||||
should never be necessary
|
||||
|
||||
nopcirr disable PCI read retry
|
||||
nopciwr disable PCI write retry
|
||||
nopcirb disable PCI read bursts
|
||||
nopciwb disable PCI write bursts
|
||||
|
||||
bpp bpp for specified modes
|
||||
valid values: 8 || 16 || 24 || 32
|
||||
|
||||
ref refresh rate for specified mode
|
||||
valid values: 50 <= ref <= 85
|
||||
|
||||
mode 640x480 or 800x600 or 1024x768 or 1280x1024
|
||||
if not specified, the startup mode will be detected
|
||||
and used, so you might also use the vga=??? parameter
|
||||
described in vesafb.txt. If you do not specify a mode,
|
||||
bpp and ref parameters are ignored.
|
||||
|
||||
verbosity 0 is the default, increase to at least 2 for every
|
||||
bug report!
|
||||
|
||||
vesafb allows cyblafb to be loaded after vesafb has been
|
||||
loaded. See sections "Module unloading ...".
|
||||
|
||||
|
||||
Development hints
|
||||
=================
|
||||
|
||||
It's much faster do compile a module and to load the new version after
|
||||
unloading the old module than to compile a new kernel and to reboot. So if you
|
||||
try to work on cyblafb, it might be a good idea to use cyblafb as a module.
|
||||
In real life, fast often means dangerous, and that's also the case here. If
|
||||
you introduce a serious bug when cyblafb is compiled into the kernel, the
|
||||
kernel will lock or oops with a high probability before the file system is
|
||||
mounted, and the danger for your data is low. If you load a broken own version
|
||||
of cyblafb on a running system, the danger for the integrity of the file
|
||||
system is much higher as you might need a hard reset afterwards. Decide
|
||||
yourself.
|
||||
|
||||
Module unloading, the vfb method
|
||||
================================
|
||||
|
||||
If you want to unload/reload cyblafb using the virtual framebuffer, you need
|
||||
to enable vfb support in the kernel first. After that, load the modules as
|
||||
shown below:
|
||||
|
||||
modprobe vfb vfb_enable=1
|
||||
modprobe fbcon
|
||||
modprobe cyblafb
|
||||
fbset -fb /dev/fb1 1280x1024-60 -vyres 2662
|
||||
con2fb /dev/fb1 /dev/tty1
|
||||
...
|
||||
|
||||
If you now made some changes to cyblafb and want to reload it, you might do it
|
||||
as show below:
|
||||
|
||||
con2fb /dev/fb0 /dev/tty1
|
||||
...
|
||||
rmmod cyblafb
|
||||
modprobe cyblafb
|
||||
con2fb /dev/fb1 /dev/tty1
|
||||
...
|
||||
|
||||
Of course, you might choose another mode, and most certainly you also want to
|
||||
map some other /dev/tty* to the real framebuffer device. You might also choose
|
||||
to compile fbcon as a kernel module or place it permanently in the kernel.
|
||||
|
||||
I do not know of any way to unload fbcon, and fbcon will prevent the
|
||||
framebuffer device loaded first from unloading. [If there is a way, then
|
||||
please add a description here!]
|
||||
|
||||
Module unloading, the vesafb method
|
||||
===================================
|
||||
|
||||
Configure the kernel:
|
||||
|
||||
<*> Support for frame buffer devices
|
||||
[*] VESA VGA graphics support
|
||||
<M> Cyberblade/i1 support
|
||||
|
||||
Add e.g. "video=vesafb:ypan vga=0x307" to the kernel parameters. The ypan
|
||||
parameter is important, choose any vga parameter you like as long as it is
|
||||
a graphics mode.
|
||||
|
||||
After booting, load cyblafb without any mode and bpp parameter and assign
|
||||
cyblafb to individual ttys using con2fb, e.g.:
|
||||
|
||||
modprobe cyblafb vesafb=1
|
||||
con2fb /dev/fb1 /dev/tty1
|
||||
|
||||
Unloading cyblafb works without problems after you assign vesafb to all
|
||||
ttys again, e.g.:
|
||||
|
||||
con2fb /dev/fb0 /dev/tty1
|
||||
rmmod cyblafb
|
||||
|
85
Documentation/fb/cyblafb/whycyblafb
Normal file
85
Documentation/fb/cyblafb/whycyblafb
Normal file
|
@ -0,0 +1,85 @@
|
|||
I tried the following framebuffer drivers:
|
||||
|
||||
- TRIDENTFB is full of bugs. Acceleration is broken for Blade3D
|
||||
graphics cores like the cyberblade/i1. It claims to support a great
|
||||
number of devices, but documentation for most of these devices is
|
||||
unfortunately not available. There is _no_ reason to use tridentfb
|
||||
for cyberblade/i1 + CRT users. VESAFB is faster, and the one
|
||||
advantage, mode switching, is broken in tridentfb.
|
||||
|
||||
- VESAFB is used by many distributions as a standard. Vesafb does
|
||||
not support mode switching. VESAFB is a bit faster than the working
|
||||
configurations of TRIDENTFB, but it is still too slow, even if you
|
||||
use ypan.
|
||||
|
||||
- EPIAFB (you'll find it on sourceforge) supports the Cyberblade/i1
|
||||
graphics core, but it still has serious bugs and developement seems
|
||||
to have stopped. This is the one driver with TV-out support. If you
|
||||
do need this feature, try epiafb.
|
||||
|
||||
None of these drivers was a real option for me.
|
||||
|
||||
I believe that is unreasonable to change code that announces to support 20
|
||||
devices if I only have more or less sufficient documentation for exactly one
|
||||
of these. The risk of breaking device foo while fixing device bar is too high.
|
||||
|
||||
So I decided to start CyBlaFB as a stripped down tridentfb.
|
||||
|
||||
All code specific to other Trident chips has been removed. After that there
|
||||
were a lot of cosmetic changes to increase the readability of the code. All
|
||||
register names were changed to those mnemonics used in the datasheet. Function
|
||||
and macro names were changed if they hindered easy understanding of the code.
|
||||
|
||||
After that I debugged the code and implemented some new features. I'll try to
|
||||
give a little summary of the main changes:
|
||||
|
||||
- calculation of vertical and horizontal timings was fixed
|
||||
|
||||
- video signal quality has been improved dramatically
|
||||
|
||||
- acceleration:
|
||||
|
||||
- fillrect and copyarea were fixed and reenabled
|
||||
|
||||
- color expanding imageblit was newly implemented, color
|
||||
imageblit (only used to draw the penguine) still uses the
|
||||
generic code.
|
||||
|
||||
- init of the acceleration engine was improved and moved to a
|
||||
place where it really works ...
|
||||
|
||||
- sync function has a timeout now and tries to reset and
|
||||
reinit the accel engine if necessary
|
||||
|
||||
- fewer slow copyarea calls when doing ypan scrolling by using
|
||||
undocumented bit d21 of screen start address stored in
|
||||
CR2B[5]. BIOS does use it also, so this should be safe.
|
||||
|
||||
- cyblafb rejects any attempt to set modes that would cause vclk
|
||||
values above reasonable 230 MHz. 32bit modes use a clock
|
||||
multiplicator of 2, so fbset does show the correct values for
|
||||
pixclock but not for vclk in this case. The fbset limit is 115 MHz
|
||||
for 32 bpp modes.
|
||||
|
||||
- cyblafb rejects modes known to be broken or unimplemented (all
|
||||
interlaced modes, all doublescan modes for now)
|
||||
|
||||
- cyblafb now works independant of the video mode in effect at startup
|
||||
time (tridentfb does not init all needed registers to reasonable
|
||||
values)
|
||||
|
||||
- switching between video modes does work reliably now
|
||||
|
||||
- the first video mode now is the one selected on startup using the
|
||||
vga=???? mechanism or any of
|
||||
- 640x480, 800x600, 1024x768, 1280x1024
|
||||
- 8, 16, 24 or 32 bpp
|
||||
- refresh between 50 Hz and 85 Hz, 1 Hz steps (1280x1024-32
|
||||
is limited to 63Hz)
|
||||
|
||||
- pci retry and pci burst mode are settable (try to disable if you
|
||||
experience latency problems)
|
||||
|
||||
- built as a module cyblafb might be unloaded and reloaded using
|
||||
the vfb module and con2vt or might be used together with vesafb
|
||||
|
|
@ -5,6 +5,7 @@ Intel 810/815 Framebuffer driver
|
|||
March 17, 2002
|
||||
|
||||
First Released: July 2001
|
||||
Last Update: September 12, 2005
|
||||
================================================================
|
||||
|
||||
A. Introduction
|
||||
|
@ -44,6 +45,8 @@ B. Features
|
|||
|
||||
- Hardware Cursor Support
|
||||
|
||||
- Supports EDID probing either by DDC/I2C or through the BIOS
|
||||
|
||||
C. List of available options
|
||||
|
||||
a. "video=i810fb"
|
||||
|
@ -52,14 +55,17 @@ C. List of available options
|
|||
Recommendation: required
|
||||
|
||||
b. "xres:<value>"
|
||||
select horizontal resolution in pixels
|
||||
select horizontal resolution in pixels. (This parameter will be
|
||||
ignored if 'mode_option' is specified. See 'o' below).
|
||||
|
||||
Recommendation: user preference
|
||||
(default = 640)
|
||||
|
||||
c. "yres:<value>"
|
||||
select vertical resolution in scanlines. If Discrete Video Timings
|
||||
is enabled, this will be ignored and computed as 3*xres/4.
|
||||
is enabled, this will be ignored and computed as 3*xres/4. (This
|
||||
parameter will be ignored if 'mode_option' is specified. See 'o'
|
||||
below)
|
||||
|
||||
Recommendation: user preference
|
||||
(default = 480)
|
||||
|
@ -86,7 +92,8 @@ C. List of available options
|
|||
g. "hsync1/hsync2:<value>"
|
||||
select the minimum and maximum Horizontal Sync Frequency of the
|
||||
monitor in KHz. If a using a fixed frequency monitor, hsync1 must
|
||||
be equal to hsync2.
|
||||
be equal to hsync2. If EDID probing is successful, these will be
|
||||
ignored and values will be taken from the EDID block.
|
||||
|
||||
Recommendation: check monitor manual for correct values
|
||||
default (29/30)
|
||||
|
@ -94,7 +101,8 @@ C. List of available options
|
|||
h. "vsync1/vsync2:<value>"
|
||||
select the minimum and maximum Vertical Sync Frequency of the monitor
|
||||
in Hz. You can also use this option to lock your monitor's refresh
|
||||
rate.
|
||||
rate. If EDID probing is successful, these will be ignored and values
|
||||
will be taken from the EDID block.
|
||||
|
||||
Recommendation: check monitor manual for correct values
|
||||
(default = 60/60)
|
||||
|
@ -154,7 +162,11 @@ C. List of available options
|
|||
|
||||
Recommendation: do not set
|
||||
(default = not set)
|
||||
|
||||
o. <xres>x<yres>[-<bpp>][@<refresh>]
|
||||
The driver will now accept specification of boot mode option. If this
|
||||
is specified, the options 'xres' and 'yres' will be ignored. See
|
||||
Documentation/fb/modedb.txt for usage.
|
||||
|
||||
D. Kernel booting
|
||||
|
||||
Separate each option/option-pair by commas (,) and the option from its value
|
||||
|
@ -176,7 +188,10 @@ will be computed based on the hsync1/hsync2 and vsync1/vsync2 values.
|
|||
|
||||
IMPORTANT:
|
||||
You must include hsync1, hsync2, vsync1 and vsync2 to enable video modes
|
||||
better than 640x480 at 60Hz.
|
||||
better than 640x480 at 60Hz. HOWEVER, if your chipset/display combination
|
||||
supports I2C and has an EDID block, you can safely exclude hsync1, hsync2,
|
||||
vsync1 and vsync2 parameters. These parameters will be taken from the EDID
|
||||
block.
|
||||
|
||||
E. Module options
|
||||
|
||||
|
@ -217,32 +232,21 @@ F. Setup
|
|||
This is required. The option is under "Character Devices"
|
||||
|
||||
d. Under "Graphics Support", select "Intel 810/815" either statically
|
||||
or as a module. Choose "use VESA GTF for video timings" if you
|
||||
need to maximize the capability of your display. To be on the
|
||||
or as a module. Choose "use VESA Generalized Timing Formula" if
|
||||
you need to maximize the capability of your display. To be on the
|
||||
safe side, you can leave this unselected.
|
||||
|
||||
e. If you want a framebuffer console, enable it under "Console
|
||||
e. If you want support for DDC/I2C probing (Plug and Play Displays),
|
||||
set 'Enable DDC Support' to 'y'. To make this option appear, set
|
||||
'use VESA Generalized Timing Formula' to 'y'.
|
||||
|
||||
f. If you want a framebuffer console, enable it under "Console
|
||||
Drivers"
|
||||
|
||||
f. Compile your kernel.
|
||||
g. Compile your kernel.
|
||||
|
||||
g. Load the driver as described in section D and E.
|
||||
h. Load the driver as described in section D and E.
|
||||
|
||||
Optional:
|
||||
h. If you are going to run XFree86 with its native drivers, the
|
||||
standard XFree86 4.1.0 and 4.2.0 drivers should work as is.
|
||||
However, there's a bug in the XFree86 i810 drivers. It attempts
|
||||
to use XAA even when switched to the console. This will crash
|
||||
your server. I have a fix at this site:
|
||||
|
||||
http://i810fb.sourceforge.net.
|
||||
|
||||
You can either use the patch, or just replace
|
||||
|
||||
/usr/X11R6/lib/modules/drivers/i810_drv.o
|
||||
|
||||
with the one provided at the website.
|
||||
|
||||
i. Try the DirectFB (http://www.directfb.org) + the i810 gfxdriver
|
||||
patch to see the chipset in action (or inaction :-).
|
||||
|
||||
|
|
|
@ -20,12 +20,83 @@ in a video= option, fbmem considers that to be a global video mode option.
|
|||
|
||||
Valid mode specifiers (mode_option argument):
|
||||
|
||||
<xres>x<yres>[-<bpp>][@<refresh>]
|
||||
<xres>x<yres>[M][R][-<bpp>][@<refresh>][i][m]
|
||||
<name>[-<bpp>][@<refresh>]
|
||||
|
||||
with <xres>, <yres>, <bpp> and <refresh> decimal numbers and <name> a string.
|
||||
Things between square brackets are optional.
|
||||
|
||||
If 'M' is specified in the mode_option argument (after <yres> and before
|
||||
<bpp> and <refresh>, if specified) the timings will be calculated using
|
||||
VESA(TM) Coordinated Video Timings instead of looking up the mode from a table.
|
||||
If 'R' is specified, do a 'reduced blanking' calculation for digital displays.
|
||||
If 'i' is specified, calculate for an interlaced mode. And if 'm' is
|
||||
specified, add margins to the calculation (1.8% of xres rounded down to 8
|
||||
pixels and 1.8% of yres).
|
||||
|
||||
Sample usage: 1024x768M@60m - CVT timing with margins
|
||||
|
||||
***** oOo ***** oOo ***** oOo ***** oOo ***** oOo ***** oOo ***** oOo *****
|
||||
|
||||
What is the VESA(TM) Coordinated Video Timings (CVT)?
|
||||
|
||||
From the VESA(TM) Website:
|
||||
|
||||
"The purpose of CVT is to provide a method for generating a consistent
|
||||
and coordinated set of standard formats, display refresh rates, and
|
||||
timing specifications for computer display products, both those
|
||||
employing CRTs, and those using other display technologies. The
|
||||
intention of CVT is to give both source and display manufacturers a
|
||||
common set of tools to enable new timings to be developed in a
|
||||
consistent manner that ensures greater compatibility."
|
||||
|
||||
This is the third standard approved by VESA(TM) concerning video timings. The
|
||||
first was the Discrete Video Timings (DVT) which is a collection of
|
||||
pre-defined modes approved by VESA(TM). The second is the Generalized Timing
|
||||
Formula (GTF) which is an algorithm to calculate the timings, given the
|
||||
pixelclock, the horizontal sync frequency, or the vertical refresh rate.
|
||||
|
||||
The GTF is limited by the fact that it is designed mainly for CRT displays.
|
||||
It artificially increases the pixelclock because of its high blanking
|
||||
requirement. This is inappropriate for digital display interface with its high
|
||||
data rate which requires that it conserves the pixelclock as much as possible.
|
||||
Also, GTF does not take into account the aspect ratio of the display.
|
||||
|
||||
The CVT addresses these limitations. If used with CRT's, the formula used
|
||||
is a derivation of GTF with a few modifications. If used with digital
|
||||
displays, the "reduced blanking" calculation can be used.
|
||||
|
||||
From the framebuffer subsystem perspective, new formats need not be added
|
||||
to the global mode database whenever a new mode is released by display
|
||||
manufacturers. Specifying for CVT will work for most, if not all, relatively
|
||||
new CRT displays and probably with most flatpanels, if 'reduced blanking'
|
||||
calculation is specified. (The CVT compatibility of the display can be
|
||||
determined from its EDID. The version 1.3 of the EDID has extra 128-byte
|
||||
blocks where additional timing information is placed. As of this time, there
|
||||
is no support yet in the layer to parse this additional blocks.)
|
||||
|
||||
CVT also introduced a new naming convention (should be seen from dmesg output):
|
||||
|
||||
<pix>M<a>[-R]
|
||||
|
||||
where: pix = total amount of pixels in MB (xres x yres)
|
||||
M = always present
|
||||
a = aspect ratio (3 - 4:3; 4 - 5:4; 9 - 15:9, 16:9; A - 16:10)
|
||||
-R = reduced blanking
|
||||
|
||||
example: .48M3-R - 800x600 with reduced blanking
|
||||
|
||||
Note: VESA(TM) has restrictions on what is a standard CVT timing:
|
||||
|
||||
- aspect ratio can only be one of the above values
|
||||
- acceptable refresh rates are 50, 60, 70 or 85 Hz only
|
||||
- if reduced blanking, the refresh rate must be at 60Hz
|
||||
|
||||
If one of the above are not satisfied, the kernel will print a warning but the
|
||||
timings will still be calculated.
|
||||
|
||||
***** oOo ***** oOo ***** oOo ***** oOo ***** oOo ***** oOo ***** oOo *****
|
||||
|
||||
To find a suitable video mode, you just call
|
||||
|
||||
int __init fb_find_mode(struct fb_var_screeninfo *var,
|
||||
|
|
|
@ -17,32 +17,6 @@ Who: Greg Kroah-Hartman <greg@kroah.com>
|
|||
|
||||
---------------------------
|
||||
|
||||
What: ACPI S4bios support
|
||||
When: May 2005
|
||||
Why: Noone uses it, and it probably does not work, anyway. swsusp is
|
||||
faster, more reliable, and people are actually using it.
|
||||
Who: Pavel Machek <pavel@suse.cz>
|
||||
|
||||
---------------------------
|
||||
|
||||
What: PCI Name Database (CONFIG_PCI_NAMES)
|
||||
When: July 2005
|
||||
Why: It bloats the kernel unnecessarily, and is handled by userspace better
|
||||
(pciutils supports it.) Will eliminate the need to try to keep the
|
||||
pci.ids file in sync with the sf.net database all of the time.
|
||||
Who: Greg Kroah-Hartman <gregkh@suse.de>
|
||||
|
||||
---------------------------
|
||||
|
||||
What: io_remap_page_range() (macro or function)
|
||||
When: September 2005
|
||||
Why: Replaced by io_remap_pfn_range() which allows more memory space
|
||||
addressabilty (by using a pfn) and supports sparc & sparc64
|
||||
iospace as part of the pfn.
|
||||
Who: Randy Dunlap <rddunlap@osdl.org>
|
||||
|
||||
---------------------------
|
||||
|
||||
What: RAW driver (CONFIG_RAW_DRIVER)
|
||||
When: December 2005
|
||||
Why: declared obsolete since kernel 2.6.3
|
||||
|
@ -51,14 +25,6 @@ Who: Adrian Bunk <bunk@stusta.de>
|
|||
|
||||
---------------------------
|
||||
|
||||
What: register_ioctl32_conversion() / unregister_ioctl32_conversion()
|
||||
When: April 2005
|
||||
Why: Replaced by ->compat_ioctl in file_operations and other method
|
||||
vecors.
|
||||
Who: Andi Kleen <ak@muc.de>, Christoph Hellwig <hch@lst.de>
|
||||
|
||||
---------------------------
|
||||
|
||||
What: RCU API moves to EXPORT_SYMBOL_GPL
|
||||
When: April 2006
|
||||
Files: include/linux/rcupdate.h, kernel/rcupdate.c
|
||||
|
@ -74,14 +40,6 @@ Who: Paul E. McKenney <paulmck@us.ibm.com>
|
|||
|
||||
---------------------------
|
||||
|
||||
What: remove verify_area()
|
||||
When: July 2006
|
||||
Files: Various uaccess.h headers.
|
||||
Why: Deprecated and redundant. access_ok() should be used instead.
|
||||
Who: Jesper Juhl <juhl-lkml@dif.dk>
|
||||
|
||||
---------------------------
|
||||
|
||||
What: IEEE1394 Audio and Music Data Transmission Protocol driver,
|
||||
Connection Management Procedures driver
|
||||
When: November 2005
|
||||
|
@ -102,16 +60,6 @@ Who: Jody McIntyre <scjody@steamballoon.com>
|
|||
|
||||
---------------------------
|
||||
|
||||
What: register_serial/unregister_serial
|
||||
When: September 2005
|
||||
Why: This interface does not allow serial ports to be registered against
|
||||
a struct device, and as such does not allow correct power management
|
||||
of such ports. 8250-based ports should use serial8250_register_port
|
||||
and serial8250_unregister_port, or platform devices instead.
|
||||
Who: Russell King <rmk@arm.linux.org.uk>
|
||||
|
||||
---------------------------
|
||||
|
||||
What: i2c sysfs name change: in1_ref, vid deprecated in favour of cpu0_vid
|
||||
When: November 2005
|
||||
Files: drivers/i2c/chips/adm1025.c, drivers/i2c/chips/adm1026.c
|
||||
|
@ -135,3 +83,15 @@ Why: With the 16-bit PCMCIA subsystem now behaving (almost) like a
|
|||
pcmciautils package available at
|
||||
http://kernel.org/pub/linux/utils/kernel/pcmcia/
|
||||
Who: Dominik Brodowski <linux@brodo.de>
|
||||
|
||||
---------------------------
|
||||
|
||||
What: ip_queue and ip6_queue (old ipv4-only and ipv6-only netfilter queue)
|
||||
When: December 2005
|
||||
Why: This interface has been obsoleted by the new layer3-independent
|
||||
"nfnetlink_queue". The Kernel interface is compatible, so the old
|
||||
ip[6]tables "QUEUE" targets still work and will transparently handle
|
||||
all packets into nfnetlink queue number 0. Userspace users will have
|
||||
to link against API-compatible library on top of libnfnetlink_queue
|
||||
instead of the current 'libipq'.
|
||||
Who: Harald Welte <laforge@netfilter.org>
|
||||
|
|
123
Documentation/filesystems/files.txt
Normal file
123
Documentation/filesystems/files.txt
Normal file
|
@ -0,0 +1,123 @@
|
|||
File management in the Linux kernel
|
||||
-----------------------------------
|
||||
|
||||
This document describes how locking for files (struct file)
|
||||
and file descriptor table (struct files) works.
|
||||
|
||||
Up until 2.6.12, the file descriptor table has been protected
|
||||
with a lock (files->file_lock) and reference count (files->count).
|
||||
->file_lock protected accesses to all the file related fields
|
||||
of the table. ->count was used for sharing the file descriptor
|
||||
table between tasks cloned with CLONE_FILES flag. Typically
|
||||
this would be the case for posix threads. As with the common
|
||||
refcounting model in the kernel, the last task doing
|
||||
a put_files_struct() frees the file descriptor (fd) table.
|
||||
The files (struct file) themselves are protected using
|
||||
reference count (->f_count).
|
||||
|
||||
In the new lock-free model of file descriptor management,
|
||||
the reference counting is similar, but the locking is
|
||||
based on RCU. The file descriptor table contains multiple
|
||||
elements - the fd sets (open_fds and close_on_exec, the
|
||||
array of file pointers, the sizes of the sets and the array
|
||||
etc.). In order for the updates to appear atomic to
|
||||
a lock-free reader, all the elements of the file descriptor
|
||||
table are in a separate structure - struct fdtable.
|
||||
files_struct contains a pointer to struct fdtable through
|
||||
which the actual fd table is accessed. Initially the
|
||||
fdtable is embedded in files_struct itself. On a subsequent
|
||||
expansion of fdtable, a new fdtable structure is allocated
|
||||
and files->fdtab points to the new structure. The fdtable
|
||||
structure is freed with RCU and lock-free readers either
|
||||
see the old fdtable or the new fdtable making the update
|
||||
appear atomic. Here are the locking rules for
|
||||
the fdtable structure -
|
||||
|
||||
1. All references to the fdtable must be done through
|
||||
the files_fdtable() macro :
|
||||
|
||||
struct fdtable *fdt;
|
||||
|
||||
rcu_read_lock();
|
||||
|
||||
fdt = files_fdtable(files);
|
||||
....
|
||||
if (n <= fdt->max_fds)
|
||||
....
|
||||
...
|
||||
rcu_read_unlock();
|
||||
|
||||
files_fdtable() uses rcu_dereference() macro which takes care of
|
||||
the memory barrier requirements for lock-free dereference.
|
||||
The fdtable pointer must be read within the read-side
|
||||
critical section.
|
||||
|
||||
2. Reading of the fdtable as described above must be protected
|
||||
by rcu_read_lock()/rcu_read_unlock().
|
||||
|
||||
3. For any update to the the fd table, files->file_lock must
|
||||
be held.
|
||||
|
||||
4. To look up the file structure given an fd, a reader
|
||||
must use either fcheck() or fcheck_files() APIs. These
|
||||
take care of barrier requirements due to lock-free lookup.
|
||||
An example :
|
||||
|
||||
struct file *file;
|
||||
|
||||
rcu_read_lock();
|
||||
file = fcheck(fd);
|
||||
if (file) {
|
||||
...
|
||||
}
|
||||
....
|
||||
rcu_read_unlock();
|
||||
|
||||
5. Handling of the file structures is special. Since the look-up
|
||||
of the fd (fget()/fget_light()) are lock-free, it is possible
|
||||
that look-up may race with the last put() operation on the
|
||||
file structure. This is avoided using the rcuref APIs
|
||||
on ->f_count :
|
||||
|
||||
rcu_read_lock();
|
||||
file = fcheck_files(files, fd);
|
||||
if (file) {
|
||||
if (rcuref_inc_lf(&file->f_count))
|
||||
*fput_needed = 1;
|
||||
else
|
||||
/* Didn't get the reference, someone's freed */
|
||||
file = NULL;
|
||||
}
|
||||
rcu_read_unlock();
|
||||
....
|
||||
return file;
|
||||
|
||||
rcuref_inc_lf() detects if refcounts is already zero or
|
||||
goes to zero during increment. If it does, we fail
|
||||
fget()/fget_light().
|
||||
|
||||
6. Since both fdtable and file structures can be looked up
|
||||
lock-free, they must be installed using rcu_assign_pointer()
|
||||
API. If they are looked up lock-free, rcu_dereference()
|
||||
must be used. However it is advisable to use files_fdtable()
|
||||
and fcheck()/fcheck_files() which take care of these issues.
|
||||
|
||||
7. While updating, the fdtable pointer must be looked up while
|
||||
holding files->file_lock. If ->file_lock is dropped, then
|
||||
another thread expand the files thereby creating a new
|
||||
fdtable and making the earlier fdtable pointer stale.
|
||||
For example :
|
||||
|
||||
spin_lock(&files->file_lock);
|
||||
fd = locate_fd(files, file, start);
|
||||
if (fd >= 0) {
|
||||
/* locate_fd() may have expanded fdtable, load the ptr */
|
||||
fdt = files_fdtable(files);
|
||||
FD_SET(fd, fdt->open_fds);
|
||||
FD_CLR(fd, fdt->close_on_exec);
|
||||
spin_unlock(&files->file_lock);
|
||||
.....
|
||||
|
||||
Since locate_fd() can drop ->file_lock (and reacquire ->file_lock),
|
||||
the fdtable pointer (fdt) must be loaded after locate_fd().
|
||||
|
315
Documentation/filesystems/fuse.txt
Normal file
315
Documentation/filesystems/fuse.txt
Normal file
|
@ -0,0 +1,315 @@
|
|||
Definitions
|
||||
~~~~~~~~~~~
|
||||
|
||||
Userspace filesystem:
|
||||
|
||||
A filesystem in which data and metadata are provided by an ordinary
|
||||
userspace process. The filesystem can be accessed normally through
|
||||
the kernel interface.
|
||||
|
||||
Filesystem daemon:
|
||||
|
||||
The process(es) providing the data and metadata of the filesystem.
|
||||
|
||||
Non-privileged mount (or user mount):
|
||||
|
||||
A userspace filesystem mounted by a non-privileged (non-root) user.
|
||||
The filesystem daemon is running with the privileges of the mounting
|
||||
user. NOTE: this is not the same as mounts allowed with the "user"
|
||||
option in /etc/fstab, which is not discussed here.
|
||||
|
||||
Mount owner:
|
||||
|
||||
The user who does the mounting.
|
||||
|
||||
User:
|
||||
|
||||
The user who is performing filesystem operations.
|
||||
|
||||
What is FUSE?
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
FUSE is a userspace filesystem framework. It consists of a kernel
|
||||
module (fuse.ko), a userspace library (libfuse.*) and a mount utility
|
||||
(fusermount).
|
||||
|
||||
One of the most important features of FUSE is allowing secure,
|
||||
non-privileged mounts. This opens up new possibilities for the use of
|
||||
filesystems. A good example is sshfs: a secure network filesystem
|
||||
using the sftp protocol.
|
||||
|
||||
The userspace library and utilities are available from the FUSE
|
||||
homepage:
|
||||
|
||||
http://fuse.sourceforge.net/
|
||||
|
||||
Mount options
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
'fd=N'
|
||||
|
||||
The file descriptor to use for communication between the userspace
|
||||
filesystem and the kernel. The file descriptor must have been
|
||||
obtained by opening the FUSE device ('/dev/fuse').
|
||||
|
||||
'rootmode=M'
|
||||
|
||||
The file mode of the filesystem's root in octal representation.
|
||||
|
||||
'user_id=N'
|
||||
|
||||
The numeric user id of the mount owner.
|
||||
|
||||
'group_id=N'
|
||||
|
||||
The numeric group id of the mount owner.
|
||||
|
||||
'default_permissions'
|
||||
|
||||
By default FUSE doesn't check file access permissions, the
|
||||
filesystem is free to implement it's access policy or leave it to
|
||||
the underlying file access mechanism (e.g. in case of network
|
||||
filesystems). This option enables permission checking, restricting
|
||||
access based on file mode. This is option is usually useful
|
||||
together with the 'allow_other' mount option.
|
||||
|
||||
'allow_other'
|
||||
|
||||
This option overrides the security measure restricting file access
|
||||
to the user mounting the filesystem. This option is by default only
|
||||
allowed to root, but this restriction can be removed with a
|
||||
(userspace) configuration option.
|
||||
|
||||
'max_read=N'
|
||||
|
||||
With this option the maximum size of read operations can be set.
|
||||
The default is infinite. Note that the size of read requests is
|
||||
limited anyway to 32 pages (which is 128kbyte on i386).
|
||||
|
||||
How do non-privileged mounts work?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Since the mount() system call is a privileged operation, a helper
|
||||
program (fusermount) is needed, which is installed setuid root.
|
||||
|
||||
The implication of providing non-privileged mounts is that the mount
|
||||
owner must not be able to use this capability to compromise the
|
||||
system. Obvious requirements arising from this are:
|
||||
|
||||
A) mount owner should not be able to get elevated privileges with the
|
||||
help of the mounted filesystem
|
||||
|
||||
B) mount owner should not get illegitimate access to information from
|
||||
other users' and the super user's processes
|
||||
|
||||
C) mount owner should not be able to induce undesired behavior in
|
||||
other users' or the super user's processes
|
||||
|
||||
How are requirements fulfilled?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
A) The mount owner could gain elevated privileges by either:
|
||||
|
||||
1) creating a filesystem containing a device file, then opening
|
||||
this device
|
||||
|
||||
2) creating a filesystem containing a suid or sgid application,
|
||||
then executing this application
|
||||
|
||||
The solution is not to allow opening device files and ignore
|
||||
setuid and setgid bits when executing programs. To ensure this
|
||||
fusermount always adds "nosuid" and "nodev" to the mount options
|
||||
for non-privileged mounts.
|
||||
|
||||
B) If another user is accessing files or directories in the
|
||||
filesystem, the filesystem daemon serving requests can record the
|
||||
exact sequence and timing of operations performed. This
|
||||
information is otherwise inaccessible to the mount owner, so this
|
||||
counts as an information leak.
|
||||
|
||||
The solution to this problem will be presented in point 2) of C).
|
||||
|
||||
C) There are several ways in which the mount owner can induce
|
||||
undesired behavior in other users' processes, such as:
|
||||
|
||||
1) mounting a filesystem over a file or directory which the mount
|
||||
owner could otherwise not be able to modify (or could only
|
||||
make limited modifications).
|
||||
|
||||
This is solved in fusermount, by checking the access
|
||||
permissions on the mountpoint and only allowing the mount if
|
||||
the mount owner can do unlimited modification (has write
|
||||
access to the mountpoint, and mountpoint is not a "sticky"
|
||||
directory)
|
||||
|
||||
2) Even if 1) is solved the mount owner can change the behavior
|
||||
of other users' processes.
|
||||
|
||||
i) It can slow down or indefinitely delay the execution of a
|
||||
filesystem operation creating a DoS against the user or the
|
||||
whole system. For example a suid application locking a
|
||||
system file, and then accessing a file on the mount owner's
|
||||
filesystem could be stopped, and thus causing the system
|
||||
file to be locked forever.
|
||||
|
||||
ii) It can present files or directories of unlimited length, or
|
||||
directory structures of unlimited depth, possibly causing a
|
||||
system process to eat up diskspace, memory or other
|
||||
resources, again causing DoS.
|
||||
|
||||
The solution to this as well as B) is not to allow processes
|
||||
to access the filesystem, which could otherwise not be
|
||||
monitored or manipulated by the mount owner. Since if the
|
||||
mount owner can ptrace a process, it can do all of the above
|
||||
without using a FUSE mount, the same criteria as used in
|
||||
ptrace can be used to check if a process is allowed to access
|
||||
the filesystem or not.
|
||||
|
||||
Note that the ptrace check is not strictly necessary to
|
||||
prevent B/2/i, it is enough to check if mount owner has enough
|
||||
privilege to send signal to the process accessing the
|
||||
filesystem, since SIGSTOP can be used to get a similar effect.
|
||||
|
||||
I think these limitations are unacceptable?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
If a sysadmin trusts the users enough, or can ensure through other
|
||||
measures, that system processes will never enter non-privileged
|
||||
mounts, it can relax the last limitation with a "user_allow_other"
|
||||
config option. If this config option is set, the mounting user can
|
||||
add the "allow_other" mount option which disables the check for other
|
||||
users' processes.
|
||||
|
||||
Kernel - userspace interface
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The following diagram shows how a filesystem operation (in this
|
||||
example unlink) is performed in FUSE.
|
||||
|
||||
NOTE: everything in this description is greatly simplified
|
||||
|
||||
| "rm /mnt/fuse/file" | FUSE filesystem daemon
|
||||
| |
|
||||
| | >sys_read()
|
||||
| | >fuse_dev_read()
|
||||
| | >request_wait()
|
||||
| | [sleep on fc->waitq]
|
||||
| |
|
||||
| >sys_unlink() |
|
||||
| >fuse_unlink() |
|
||||
| [get request from |
|
||||
| fc->unused_list] |
|
||||
| >request_send() |
|
||||
| [queue req on fc->pending] |
|
||||
| [wake up fc->waitq] | [woken up]
|
||||
| >request_wait_answer() |
|
||||
| [sleep on req->waitq] |
|
||||
| | <request_wait()
|
||||
| | [remove req from fc->pending]
|
||||
| | [copy req to read buffer]
|
||||
| | [add req to fc->processing]
|
||||
| | <fuse_dev_read()
|
||||
| | <sys_read()
|
||||
| |
|
||||
| | [perform unlink]
|
||||
| |
|
||||
| | >sys_write()
|
||||
| | >fuse_dev_write()
|
||||
| | [look up req in fc->processing]
|
||||
| | [remove from fc->processing]
|
||||
| | [copy write buffer to req]
|
||||
| [woken up] | [wake up req->waitq]
|
||||
| | <fuse_dev_write()
|
||||
| | <sys_write()
|
||||
| <request_wait_answer() |
|
||||
| <request_send() |
|
||||
| [add request to |
|
||||
| fc->unused_list] |
|
||||
| <fuse_unlink() |
|
||||
| <sys_unlink() |
|
||||
|
||||
There are a couple of ways in which to deadlock a FUSE filesystem.
|
||||
Since we are talking about unprivileged userspace programs,
|
||||
something must be done about these.
|
||||
|
||||
Scenario 1 - Simple deadlock
|
||||
-----------------------------
|
||||
|
||||
| "rm /mnt/fuse/file" | FUSE filesystem daemon
|
||||
| |
|
||||
| >sys_unlink("/mnt/fuse/file") |
|
||||
| [acquire inode semaphore |
|
||||
| for "file"] |
|
||||
| >fuse_unlink() |
|
||||
| [sleep on req->waitq] |
|
||||
| | <sys_read()
|
||||
| | >sys_unlink("/mnt/fuse/file")
|
||||
| | [acquire inode semaphore
|
||||
| | for "file"]
|
||||
| | *DEADLOCK*
|
||||
|
||||
The solution for this is to allow requests to be interrupted while
|
||||
they are in userspace:
|
||||
|
||||
| [interrupted by signal] |
|
||||
| <fuse_unlink() |
|
||||
| [release semaphore] | [semaphore acquired]
|
||||
| <sys_unlink() |
|
||||
| | >fuse_unlink()
|
||||
| | [queue req on fc->pending]
|
||||
| | [wake up fc->waitq]
|
||||
| | [sleep on req->waitq]
|
||||
|
||||
If the filesystem daemon was single threaded, this will stop here,
|
||||
since there's no other thread to dequeue and execute the request.
|
||||
In this case the solution is to kill the FUSE daemon as well. If
|
||||
there are multiple serving threads, you just have to kill them as
|
||||
long as any remain.
|
||||
|
||||
Moral: a filesystem which deadlocks, can soon find itself dead.
|
||||
|
||||
Scenario 2 - Tricky deadlock
|
||||
----------------------------
|
||||
|
||||
This one needs a carefully crafted filesystem. It's a variation on
|
||||
the above, only the call back to the filesystem is not explicit,
|
||||
but is caused by a pagefault.
|
||||
|
||||
| Kamikaze filesystem thread 1 | Kamikaze filesystem thread 2
|
||||
| |
|
||||
| [fd = open("/mnt/fuse/file")] | [request served normally]
|
||||
| [mmap fd to 'addr'] |
|
||||
| [close fd] | [FLUSH triggers 'magic' flag]
|
||||
| [read a byte from addr] |
|
||||
| >do_page_fault() |
|
||||
| [find or create page] |
|
||||
| [lock page] |
|
||||
| >fuse_readpage() |
|
||||
| [queue READ request] |
|
||||
| [sleep on req->waitq] |
|
||||
| | [read request to buffer]
|
||||
| | [create reply header before addr]
|
||||
| | >sys_write(addr - headerlength)
|
||||
| | >fuse_dev_write()
|
||||
| | [look up req in fc->processing]
|
||||
| | [remove from fc->processing]
|
||||
| | [copy write buffer to req]
|
||||
| | >do_page_fault()
|
||||
| | [find or create page]
|
||||
| | [lock page]
|
||||
| | * DEADLOCK *
|
||||
|
||||
Solution is again to let the the request be interrupted (not
|
||||
elaborated further).
|
||||
|
||||
An additional problem is that while the write buffer is being
|
||||
copied to the request, the request must not be interrupted. This
|
||||
is because the destination address of the copy may not be valid
|
||||
after the request is interrupted.
|
||||
|
||||
This is solved with doing the copy atomically, and allowing
|
||||
interruption while the page(s) belonging to the write buffer are
|
||||
faulted with get_user_pages(). The 'req->locked' flag indicates
|
||||
when the copy is taking place, and interruption is delayed until
|
||||
this flag is unset.
|
||||
|
|
@ -50,9 +50,14 @@ userspace utilities, etc.
|
|||
Features
|
||||
========
|
||||
|
||||
- This is a complete rewrite of the NTFS driver that used to be in the kernel.
|
||||
This new driver implements NTFS read support and is functionally equivalent
|
||||
to the old ntfs driver.
|
||||
- This is a complete rewrite of the NTFS driver that used to be in the 2.4 and
|
||||
earlier kernels. This new driver implements NTFS read support and is
|
||||
functionally equivalent to the old ntfs driver and it also implements limited
|
||||
write support. The biggest limitation at present is that files/directories
|
||||
cannot be created or deleted. See below for the list of write features that
|
||||
are so far supported. Another limitation is that writing to compressed files
|
||||
is not implemented at all. Also, neither read nor write access to encrypted
|
||||
files is so far implemented.
|
||||
- The new driver has full support for sparse files on NTFS 3.x volumes which
|
||||
the old driver isn't happy with.
|
||||
- The new driver supports execution of binaries due to mmap() now being
|
||||
|
@ -78,7 +83,20 @@ Features
|
|||
- The new driver supports fsync(2), fdatasync(2), and msync(2).
|
||||
- The new driver supports readv(2) and writev(2).
|
||||
- The new driver supports access time updates (including mtime and ctime).
|
||||
|
||||
- The new driver supports truncate(2) and open(2) with O_TRUNC. But at present
|
||||
only very limited support for highly fragmented files, i.e. ones which have
|
||||
their data attribute split across multiple extents, is included. Another
|
||||
limitation is that at present truncate(2) will never create sparse files,
|
||||
since to mark a file sparse we need to modify the directory entry for the
|
||||
file and we do not implement directory modifications yet.
|
||||
- The new driver supports write(2) which can both overwrite existing data and
|
||||
extend the file size so that you can write beyond the existing data. Also,
|
||||
writing into sparse regions is supported and the holes are filled in with
|
||||
clusters. But at present only limited support for highly fragmented files,
|
||||
i.e. ones which have their data attribute split across multiple extents, is
|
||||
included. Another limitation is that write(2) will never create sparse
|
||||
files, since to mark a file sparse we need to modify the directory entry for
|
||||
the file and we do not implement directory modifications yet.
|
||||
|
||||
Supported mount options
|
||||
=======================
|
||||
|
@ -439,6 +457,34 @@ ChangeLog
|
|||
|
||||
Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog.
|
||||
|
||||
2.1.25:
|
||||
- Write support is now extended with write(2) being able to both
|
||||
overwrite existing file data and to extend files. Also, if a write
|
||||
to a sparse region occurs, write(2) will fill in the hole. Note,
|
||||
mmap(2) based writes still do not support writing into holes or
|
||||
writing beyond the initialized size.
|
||||
- Write support has a new feature and that is that truncate(2) and
|
||||
open(2) with O_TRUNC are now implemented thus files can be both made
|
||||
smaller and larger.
|
||||
- Note: Both write(2) and truncate(2)/open(2) with O_TRUNC still have
|
||||
limitations in that they
|
||||
- only provide limited support for highly fragmented files.
|
||||
- only work on regular, i.e. uncompressed and unencrypted files.
|
||||
- never create sparse files although this will change once directory
|
||||
operations are implemented.
|
||||
- Lots of bug fixes and enhancements across the board.
|
||||
2.1.24:
|
||||
- Support journals ($LogFile) which have been modified by chkdsk. This
|
||||
means users can boot into Windows after we marked the volume dirty.
|
||||
The Windows boot will run chkdsk and then reboot. The user can then
|
||||
immediately boot into Linux rather than having to do a full Windows
|
||||
boot first before rebooting into Linux and we will recognize such a
|
||||
journal and empty it as it is clean by definition.
|
||||
- Support journals ($LogFile) with only one restart page as well as
|
||||
journals with two different restart pages. We sanity check both and
|
||||
either use the only sane one or the more recent one of the two in the
|
||||
case that both are valid.
|
||||
- Lots of bug fixes and enhancements across the board.
|
||||
2.1.23:
|
||||
- Stamp the user space journal, aka transaction log, aka $UsnJrnl, if
|
||||
it is present and active thus telling Windows and applications using
|
||||
|
|
|
@ -133,6 +133,7 @@ Table 1-1: Process specific entries in /proc
|
|||
statm Process memory status information
|
||||
status Process status in human readable form
|
||||
wchan If CONFIG_KALLSYMS is set, a pre-decoded wchan
|
||||
smaps Extension based on maps, presenting the rss size for each mapped file
|
||||
..............................................................................
|
||||
|
||||
For example, to get the status information of a process, all you have to do is
|
||||
|
@ -1240,16 +1241,38 @@ swap-intensive.
|
|||
overcommit_memory
|
||||
-----------------
|
||||
|
||||
This file contains one value. The following algorithm is used to decide if
|
||||
there's enough memory: if the value of overcommit_memory is positive, then
|
||||
there's always enough memory. This is a useful feature, since programs often
|
||||
malloc() huge amounts of memory 'just in case', while they only use a small
|
||||
part of it. Leaving this value at 0 will lead to the failure of such a huge
|
||||
malloc(), when in fact the system has enough memory for the program to run.
|
||||
Controls overcommit of system memory, possibly allowing processes
|
||||
to allocate (but not use) more memory than is actually available.
|
||||
|
||||
On the other hand, enabling this feature can cause you to run out of memory
|
||||
and thrash the system to death, so large and/or important servers will want to
|
||||
set this value to 0.
|
||||
|
||||
0 - Heuristic overcommit handling. Obvious overcommits of
|
||||
address space are refused. Used for a typical system. It
|
||||
ensures a seriously wild allocation fails while allowing
|
||||
overcommit to reduce swap usage. root is allowed to
|
||||
allocate slighly more memory in this mode. This is the
|
||||
default.
|
||||
|
||||
1 - Always overcommit. Appropriate for some scientific
|
||||
applications.
|
||||
|
||||
2 - Don't overcommit. The total address space commit
|
||||
for the system is not permitted to exceed swap plus a
|
||||
configurable percentage (default is 50) of physical RAM.
|
||||
Depending on the percentage you use, in most situations
|
||||
this means a process will not be killed while attempting
|
||||
to use already-allocated memory but will receive errors
|
||||
on memory allocation as appropriate.
|
||||
|
||||
overcommit_ratio
|
||||
----------------
|
||||
|
||||
Percentage of physical memory size to include in overcommit calculations
|
||||
(see above.)
|
||||
|
||||
Memory allocation limit = swapspace + physmem * (overcommit_ratio / 100)
|
||||
|
||||
swapspace = total size of all swap areas
|
||||
physmem = size of physical memory in system
|
||||
|
||||
nr_hugepages and hugetlb_shm_group
|
||||
----------------------------------
|
||||
|
|
362
Documentation/filesystems/relayfs.txt
Normal file
362
Documentation/filesystems/relayfs.txt
Normal file
|
@ -0,0 +1,362 @@
|
|||
|
||||
relayfs - a high-speed data relay filesystem
|
||||
============================================
|
||||
|
||||
relayfs is a filesystem designed to provide an efficient mechanism for
|
||||
tools and facilities to relay large and potentially sustained streams
|
||||
of data from kernel space to user space.
|
||||
|
||||
The main abstraction of relayfs is the 'channel'. A channel consists
|
||||
of a set of per-cpu kernel buffers each represented by a file in the
|
||||
relayfs filesystem. Kernel clients write into a channel using
|
||||
efficient write functions which automatically log to the current cpu's
|
||||
channel buffer. User space applications mmap() the per-cpu files and
|
||||
retrieve the data as it becomes available.
|
||||
|
||||
The format of the data logged into the channel buffers is completely
|
||||
up to the relayfs client; relayfs does however provide hooks which
|
||||
allow clients to impose some structure on the buffer data. Nor does
|
||||
relayfs implement any form of data filtering - this also is left to
|
||||
the client. The purpose is to keep relayfs as simple as possible.
|
||||
|
||||
This document provides an overview of the relayfs API. The details of
|
||||
the function parameters are documented along with the functions in the
|
||||
filesystem code - please see that for details.
|
||||
|
||||
Semantics
|
||||
=========
|
||||
|
||||
Each relayfs channel has one buffer per CPU, each buffer has one or
|
||||
more sub-buffers. Messages are written to the first sub-buffer until
|
||||
it is too full to contain a new message, in which case it it is
|
||||
written to the next (if available). Messages are never split across
|
||||
sub-buffers. At this point, userspace can be notified so it empties
|
||||
the first sub-buffer, while the kernel continues writing to the next.
|
||||
|
||||
When notified that a sub-buffer is full, the kernel knows how many
|
||||
bytes of it are padding i.e. unused. Userspace can use this knowledge
|
||||
to copy only valid data.
|
||||
|
||||
After copying it, userspace can notify the kernel that a sub-buffer
|
||||
has been consumed.
|
||||
|
||||
relayfs can operate in a mode where it will overwrite data not yet
|
||||
collected by userspace, and not wait for it to consume it.
|
||||
|
||||
relayfs itself does not provide for communication of such data between
|
||||
userspace and kernel, allowing the kernel side to remain simple and not
|
||||
impose a single interface on userspace. It does provide a separate
|
||||
helper though, described below.
|
||||
|
||||
klog, relay-app & librelay
|
||||
==========================
|
||||
|
||||
relayfs itself is ready to use, but to make things easier, two
|
||||
additional systems are provided. klog is a simple wrapper to make
|
||||
writing formatted text or raw data to a channel simpler, regardless of
|
||||
whether a channel to write into exists or not, or whether relayfs is
|
||||
compiled into the kernel or is configured as a module. relay-app is
|
||||
the kernel counterpart of userspace librelay.c, combined these two
|
||||
files provide glue to easily stream data to disk, without having to
|
||||
bother with housekeeping. klog and relay-app can be used together,
|
||||
with klog providing high-level logging functions to the kernel and
|
||||
relay-app taking care of kernel-user control and disk-logging chores.
|
||||
|
||||
It is possible to use relayfs without relay-app & librelay, but you'll
|
||||
have to implement communication between userspace and kernel, allowing
|
||||
both to convey the state of buffers (full, empty, amount of padding).
|
||||
|
||||
klog, relay-app and librelay can be found in the relay-apps tarball on
|
||||
http://relayfs.sourceforge.net
|
||||
|
||||
The relayfs user space API
|
||||
==========================
|
||||
|
||||
relayfs implements basic file operations for user space access to
|
||||
relayfs channel buffer data. Here are the file operations that are
|
||||
available and some comments regarding their behavior:
|
||||
|
||||
open() enables user to open an _existing_ buffer.
|
||||
|
||||
mmap() results in channel buffer being mapped into the caller's
|
||||
memory space. Note that you can't do a partial mmap - you must
|
||||
map the entire file, which is NRBUF * SUBBUFSIZE.
|
||||
|
||||
read() read the contents of a channel buffer. The bytes read are
|
||||
'consumed' by the reader i.e. they won't be available again
|
||||
to subsequent reads. If the channel is being used in
|
||||
no-overwrite mode (the default), it can be read at any time
|
||||
even if there's an active kernel writer. If the channel is
|
||||
being used in overwrite mode and there are active channel
|
||||
writers, results may be unpredictable - users should make
|
||||
sure that all logging to the channel has ended before using
|
||||
read() with overwrite mode.
|
||||
|
||||
poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are
|
||||
notified when sub-buffer boundaries are crossed.
|
||||
|
||||
close() decrements the channel buffer's refcount. When the refcount
|
||||
reaches 0 i.e. when no process or kernel client has the buffer
|
||||
open, the channel buffer is freed.
|
||||
|
||||
|
||||
In order for a user application to make use of relayfs files, the
|
||||
relayfs filesystem must be mounted. For example,
|
||||
|
||||
mount -t relayfs relayfs /mnt/relay
|
||||
|
||||
NOTE: relayfs doesn't need to be mounted for kernel clients to create
|
||||
or use channels - it only needs to be mounted when user space
|
||||
applications need access to the buffer data.
|
||||
|
||||
|
||||
The relayfs kernel API
|
||||
======================
|
||||
|
||||
Here's a summary of the API relayfs provides to in-kernel clients:
|
||||
|
||||
|
||||
channel management functions:
|
||||
|
||||
relay_open(base_filename, parent, subbuf_size, n_subbufs,
|
||||
callbacks)
|
||||
relay_close(chan)
|
||||
relay_flush(chan)
|
||||
relay_reset(chan)
|
||||
relayfs_create_dir(name, parent)
|
||||
relayfs_remove_dir(dentry)
|
||||
|
||||
channel management typically called on instigation of userspace:
|
||||
|
||||
relay_subbufs_consumed(chan, cpu, subbufs_consumed)
|
||||
|
||||
write functions:
|
||||
|
||||
relay_write(chan, data, length)
|
||||
__relay_write(chan, data, length)
|
||||
relay_reserve(chan, length)
|
||||
|
||||
callbacks:
|
||||
|
||||
subbuf_start(buf, subbuf, prev_subbuf, prev_padding)
|
||||
buf_mapped(buf, filp)
|
||||
buf_unmapped(buf, filp)
|
||||
|
||||
helper functions:
|
||||
|
||||
relay_buf_full(buf)
|
||||
subbuf_start_reserve(buf, length)
|
||||
|
||||
|
||||
Creating a channel
|
||||
------------------
|
||||
|
||||
relay_open() is used to create a channel, along with its per-cpu
|
||||
channel buffers. Each channel buffer will have an associated file
|
||||
created for it in the relayfs filesystem, which can be opened and
|
||||
mmapped from user space if desired. The files are named
|
||||
basename0...basenameN-1 where N is the number of online cpus, and by
|
||||
default will be created in the root of the filesystem. If you want a
|
||||
directory structure to contain your relayfs files, you can create it
|
||||
with relayfs_create_dir() and pass the parent directory to
|
||||
relay_open(). Clients are responsible for cleaning up any directory
|
||||
structure they create when the channel is closed - use
|
||||
relayfs_remove_dir() for that.
|
||||
|
||||
The total size of each per-cpu buffer is calculated by multiplying the
|
||||
number of sub-buffers by the sub-buffer size passed into relay_open().
|
||||
The idea behind sub-buffers is that they're basically an extension of
|
||||
double-buffering to N buffers, and they also allow applications to
|
||||
easily implement random-access-on-buffer-boundary schemes, which can
|
||||
be important for some high-volume applications. The number and size
|
||||
of sub-buffers is completely dependent on the application and even for
|
||||
the same application, different conditions will warrant different
|
||||
values for these parameters at different times. Typically, the right
|
||||
values to use are best decided after some experimentation; in general,
|
||||
though, it's safe to assume that having only 1 sub-buffer is a bad
|
||||
idea - you're guaranteed to either overwrite data or lose events
|
||||
depending on the channel mode being used.
|
||||
|
||||
Channel 'modes'
|
||||
---------------
|
||||
|
||||
relayfs channels can be used in either of two modes - 'overwrite' or
|
||||
'no-overwrite'. The mode is entirely determined by the implementation
|
||||
of the subbuf_start() callback, as described below. In 'overwrite'
|
||||
mode, also known as 'flight recorder' mode, writes continuously cycle
|
||||
around the buffer and will never fail, but will unconditionally
|
||||
overwrite old data regardless of whether it's actually been consumed.
|
||||
In no-overwrite mode, writes will fail i.e. data will be lost, if the
|
||||
number of unconsumed sub-buffers equals the total number of
|
||||
sub-buffers in the channel. It should be clear that if there is no
|
||||
consumer or if the consumer can't consume sub-buffers fast enought,
|
||||
data will be lost in either case; the only difference is whether data
|
||||
is lost from the beginning or the end of a buffer.
|
||||
|
||||
As explained above, a relayfs channel is made of up one or more
|
||||
per-cpu channel buffers, each implemented as a circular buffer
|
||||
subdivided into one or more sub-buffers. Messages are written into
|
||||
the current sub-buffer of the channel's current per-cpu buffer via the
|
||||
write functions described below. Whenever a message can't fit into
|
||||
the current sub-buffer, because there's no room left for it, the
|
||||
client is notified via the subbuf_start() callback that a switch to a
|
||||
new sub-buffer is about to occur. The client uses this callback to 1)
|
||||
initialize the next sub-buffer if appropriate 2) finalize the previous
|
||||
sub-buffer if appropriate and 3) return a boolean value indicating
|
||||
whether or not to actually go ahead with the sub-buffer switch.
|
||||
|
||||
To implement 'no-overwrite' mode, the userspace client would provide
|
||||
an implementation of the subbuf_start() callback something like the
|
||||
following:
|
||||
|
||||
static int subbuf_start(struct rchan_buf *buf,
|
||||
void *subbuf,
|
||||
void *prev_subbuf,
|
||||
unsigned int prev_padding)
|
||||
{
|
||||
if (prev_subbuf)
|
||||
*((unsigned *)prev_subbuf) = prev_padding;
|
||||
|
||||
if (relay_buf_full(buf))
|
||||
return 0;
|
||||
|
||||
subbuf_start_reserve(buf, sizeof(unsigned int));
|
||||
|
||||
return 1;
|
||||
}
|
||||
|
||||
If the current buffer is full i.e. all sub-buffers remain unconsumed,
|
||||
the callback returns 0 to indicate that the buffer switch should not
|
||||
occur yet i.e. until the consumer has had a chance to read the current
|
||||
set of ready sub-buffers. For the relay_buf_full() function to make
|
||||
sense, the consumer is reponsible for notifying relayfs when
|
||||
sub-buffers have been consumed via relay_subbufs_consumed(). Any
|
||||
subsequent attempts to write into the buffer will again invoke the
|
||||
subbuf_start() callback with the same parameters; only when the
|
||||
consumer has consumed one or more of the ready sub-buffers will
|
||||
relay_buf_full() return 0, in which case the buffer switch can
|
||||
continue.
|
||||
|
||||
The implementation of the subbuf_start() callback for 'overwrite' mode
|
||||
would be very similar:
|
||||
|
||||
static int subbuf_start(struct rchan_buf *buf,
|
||||
void *subbuf,
|
||||
void *prev_subbuf,
|
||||
unsigned int prev_padding)
|
||||
{
|
||||
if (prev_subbuf)
|
||||
*((unsigned *)prev_subbuf) = prev_padding;
|
||||
|
||||
subbuf_start_reserve(buf, sizeof(unsigned int));
|
||||
|
||||
return 1;
|
||||
}
|
||||
|
||||
In this case, the relay_buf_full() check is meaningless and the
|
||||
callback always returns 1, causing the buffer switch to occur
|
||||
unconditionally. It's also meaningless for the client to use the
|
||||
relay_subbufs_consumed() function in this mode, as it's never
|
||||
consulted.
|
||||
|
||||
The default subbuf_start() implementation, used if the client doesn't
|
||||
define any callbacks, or doesn't define the subbuf_start() callback,
|
||||
implements the simplest possible 'no-overwrite' mode i.e. it does
|
||||
nothing but return 0.
|
||||
|
||||
Header information can be reserved at the beginning of each sub-buffer
|
||||
by calling the subbuf_start_reserve() helper function from within the
|
||||
subbuf_start() callback. This reserved area can be used to store
|
||||
whatever information the client wants. In the example above, room is
|
||||
reserved in each sub-buffer to store the padding count for that
|
||||
sub-buffer. This is filled in for the previous sub-buffer in the
|
||||
subbuf_start() implementation; the padding value for the previous
|
||||
sub-buffer is passed into the subbuf_start() callback along with a
|
||||
pointer to the previous sub-buffer, since the padding value isn't
|
||||
known until a sub-buffer is filled. The subbuf_start() callback is
|
||||
also called for the first sub-buffer when the channel is opened, to
|
||||
give the client a chance to reserve space in it. In this case the
|
||||
previous sub-buffer pointer passed into the callback will be NULL, so
|
||||
the client should check the value of the prev_subbuf pointer before
|
||||
writing into the previous sub-buffer.
|
||||
|
||||
Writing to a channel
|
||||
--------------------
|
||||
|
||||
kernel clients write data into the current cpu's channel buffer using
|
||||
relay_write() or __relay_write(). relay_write() is the main logging
|
||||
function - it uses local_irqsave() to protect the buffer and should be
|
||||
used if you might be logging from interrupt context. If you know
|
||||
you'll never be logging from interrupt context, you can use
|
||||
__relay_write(), which only disables preemption. These functions
|
||||
don't return a value, so you can't determine whether or not they
|
||||
failed - the assumption is that you wouldn't want to check a return
|
||||
value in the fast logging path anyway, and that they'll always succeed
|
||||
unless the buffer is full and no-overwrite mode is being used, in
|
||||
which case you can detect a failed write in the subbuf_start()
|
||||
callback by calling the relay_buf_full() helper function.
|
||||
|
||||
relay_reserve() is used to reserve a slot in a channel buffer which
|
||||
can be written to later. This would typically be used in applications
|
||||
that need to write directly into a channel buffer without having to
|
||||
stage data in a temporary buffer beforehand. Because the actual write
|
||||
may not happen immediately after the slot is reserved, applications
|
||||
using relay_reserve() can keep a count of the number of bytes actually
|
||||
written, either in space reserved in the sub-buffers themselves or as
|
||||
a separate array. See the 'reserve' example in the relay-apps tarball
|
||||
at http://relayfs.sourceforge.net for an example of how this can be
|
||||
done. Because the write is under control of the client and is
|
||||
separated from the reserve, relay_reserve() doesn't protect the buffer
|
||||
at all - it's up to the client to provide the appropriate
|
||||
synchronization when using relay_reserve().
|
||||
|
||||
Closing a channel
|
||||
-----------------
|
||||
|
||||
The client calls relay_close() when it's finished using the channel.
|
||||
The channel and its associated buffers are destroyed when there are no
|
||||
longer any references to any of the channel buffers. relay_flush()
|
||||
forces a sub-buffer switch on all the channel buffers, and can be used
|
||||
to finalize and process the last sub-buffers before the channel is
|
||||
closed.
|
||||
|
||||
Misc
|
||||
----
|
||||
|
||||
Some applications may want to keep a channel around and re-use it
|
||||
rather than open and close a new channel for each use. relay_reset()
|
||||
can be used for this purpose - it resets a channel to its initial
|
||||
state without reallocating channel buffer memory or destroying
|
||||
existing mappings. It should however only be called when it's safe to
|
||||
do so i.e. when the channel isn't currently being written to.
|
||||
|
||||
Finally, there are a couple of utility callbacks that can be used for
|
||||
different purposes. buf_mapped() is called whenever a channel buffer
|
||||
is mmapped from user space and buf_unmapped() is called when it's
|
||||
unmapped. The client can use this notification to trigger actions
|
||||
within the kernel application, such as enabling/disabling logging to
|
||||
the channel.
|
||||
|
||||
|
||||
Resources
|
||||
=========
|
||||
|
||||
For news, example code, mailing list, etc. see the relayfs homepage:
|
||||
|
||||
http://relayfs.sourceforge.net
|
||||
|
||||
|
||||
Credits
|
||||
=======
|
||||
|
||||
The ideas and specs for relayfs came about as a result of discussions
|
||||
on tracing involving the following:
|
||||
|
||||
Michel Dagenais <michel.dagenais@polymtl.ca>
|
||||
Richard Moore <richardj_moore@uk.ibm.com>
|
||||
Bob Wisniewski <bob@watson.ibm.com>
|
||||
Karim Yaghmour <karim@opersys.com>
|
||||
Tom Zanussi <zanussi@us.ibm.com>
|
||||
|
||||
Also thanks to Hubertus Franke for a lot of useful suggestions and bug
|
||||
reports.
|
|
@ -90,7 +90,7 @@ void device_remove_file(struct device *, struct device_attribute *);
|
|||
|
||||
It also defines this helper for defining device attributes:
|
||||
|
||||
#define DEVICE_ATTR(_name,_mode,_show,_store) \
|
||||
#define DEVICE_ATTR(_name, _mode, _show, _store) \
|
||||
struct device_attribute dev_attr_##_name = { \
|
||||
.attr = {.name = __stringify(_name) , .mode = _mode }, \
|
||||
.show = _show, \
|
||||
|
@ -99,14 +99,14 @@ struct device_attribute dev_attr_##_name = { \
|
|||
|
||||
For example, declaring
|
||||
|
||||
static DEVICE_ATTR(foo,0644,show_foo,store_foo);
|
||||
static DEVICE_ATTR(foo, S_IWUSR | S_IRUGO, show_foo, store_foo);
|
||||
|
||||
is equivalent to doing:
|
||||
|
||||
static struct device_attribute dev_attr_foo = {
|
||||
.attr = {
|
||||
.name = "foo",
|
||||
.mode = 0644,
|
||||
.mode = S_IWUSR | S_IRUGO,
|
||||
},
|
||||
.show = show_foo,
|
||||
.store = store_foo,
|
||||
|
@ -121,8 +121,8 @@ set of sysfs operations for forwarding read and write calls to the
|
|||
show and store methods of the attribute owners.
|
||||
|
||||
struct sysfs_ops {
|
||||
ssize_t (*show)(struct kobject *, struct attribute *,char *);
|
||||
ssize_t (*store)(struct kobject *,struct attribute *,const char *);
|
||||
ssize_t (*show)(struct kobject *, struct attribute *, char *);
|
||||
ssize_t (*store)(struct kobject *, struct attribute *, const char *);
|
||||
};
|
||||
|
||||
[ Subsystems should have already defined a struct kobj_type as a
|
||||
|
@ -137,7 +137,7 @@ calls the associated methods.
|
|||
|
||||
To illustrate:
|
||||
|
||||
#define to_dev_attr(_attr) container_of(_attr,struct device_attribute,attr)
|
||||
#define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr)
|
||||
#define to_dev(d) container_of(d, struct device, kobj)
|
||||
|
||||
static ssize_t
|
||||
|
@ -148,7 +148,7 @@ dev_attr_show(struct kobject * kobj, struct attribute * attr, char * buf)
|
|||
ssize_t ret = 0;
|
||||
|
||||
if (dev_attr->show)
|
||||
ret = dev_attr->show(dev,buf);
|
||||
ret = dev_attr->show(dev, buf);
|
||||
return ret;
|
||||
}
|
||||
|
||||
|
@ -216,16 +216,16 @@ A very simple (and naive) implementation of a device attribute is:
|
|||
|
||||
static ssize_t show_name(struct device *dev, struct device_attribute *attr, char *buf)
|
||||
{
|
||||
return sprintf(buf,"%s\n",dev->name);
|
||||
return snprintf(buf, PAGE_SIZE, "%s\n", dev->name);
|
||||
}
|
||||
|
||||
static ssize_t store_name(struct device * dev, const char * buf)
|
||||
{
|
||||
sscanf(buf,"%20s",dev->name);
|
||||
return strlen(buf);
|
||||
sscanf(buf, "%20s", dev->name);
|
||||
return strnlen(buf, PAGE_SIZE);
|
||||
}
|
||||
|
||||
static DEVICE_ATTR(name,S_IRUGO,show_name,store_name);
|
||||
static DEVICE_ATTR(name, S_IRUGO, show_name, store_name);
|
||||
|
||||
|
||||
(Note that the real implementation doesn't allow userspace to set the
|
||||
|
@ -290,7 +290,7 @@ struct device_attribute {
|
|||
|
||||
Declaring:
|
||||
|
||||
DEVICE_ATTR(_name,_str,_mode,_show,_store);
|
||||
DEVICE_ATTR(_name, _str, _mode, _show, _store);
|
||||
|
||||
Creation/Removal:
|
||||
|
||||
|
@ -310,7 +310,7 @@ struct bus_attribute {
|
|||
|
||||
Declaring:
|
||||
|
||||
BUS_ATTR(_name,_mode,_show,_store)
|
||||
BUS_ATTR(_name, _mode, _show, _store)
|
||||
|
||||
Creation/Removal:
|
||||
|
||||
|
@ -331,7 +331,7 @@ struct driver_attribute {
|
|||
|
||||
Declaring:
|
||||
|
||||
DRIVER_ATTR(_name,_mode,_show,_store)
|
||||
DRIVER_ATTR(_name, _mode, _show, _store)
|
||||
|
||||
Creation/Removal:
|
||||
|
||||
|
|
95
Documentation/filesystems/v9fs.txt
Normal file
95
Documentation/filesystems/v9fs.txt
Normal file
|
@ -0,0 +1,95 @@
|
|||
V9FS: 9P2000 for Linux
|
||||
======================
|
||||
|
||||
ABOUT
|
||||
=====
|
||||
|
||||
v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol.
|
||||
|
||||
This software was originally developed by Ron Minnich <rminnich@lanl.gov>
|
||||
and Maya Gokhale <maya@lanl.gov>. Additional development by Greg Watson
|
||||
<gwatson@lanl.gov> and most recently Eric Van Hensbergen
|
||||
<ericvh@gmail.com> and Latchesar Ionkov <lucho@ionkov.net>.
|
||||
|
||||
USAGE
|
||||
=====
|
||||
|
||||
For remote file server:
|
||||
|
||||
mount -t 9P 10.10.1.2 /mnt/9
|
||||
|
||||
For Plan 9 From User Space applications (http://swtch.com/plan9)
|
||||
|
||||
mount -t 9P `namespace`/acme /mnt/9 -o proto=unix,name=$USER
|
||||
|
||||
OPTIONS
|
||||
=======
|
||||
|
||||
proto=name select an alternative transport. Valid options are
|
||||
currently:
|
||||
unix - specifying a named pipe mount point
|
||||
tcp - specifying a normal TCP/IP connection
|
||||
fd - used passed file descriptors for connection
|
||||
(see rfdno and wfdno)
|
||||
|
||||
name=name user name to attempt mount as on the remote server. The
|
||||
server may override or ignore this value. Certain user
|
||||
names may require authentication.
|
||||
|
||||
aname=name aname specifies the file tree to access when the server is
|
||||
offering several exported file systems.
|
||||
|
||||
debug=n specifies debug level. The debug level is a bitmask.
|
||||
0x01 = display verbose error messages
|
||||
0x02 = developer debug (DEBUG_CURRENT)
|
||||
0x04 = display 9P trace
|
||||
0x08 = display VFS trace
|
||||
0x10 = display Marshalling debug
|
||||
0x20 = display RPC debug
|
||||
0x40 = display transport debug
|
||||
0x80 = display allocation debug
|
||||
|
||||
rfdno=n the file descriptor for reading with proto=fd
|
||||
|
||||
wfdno=n the file descriptor for writing with proto=fd
|
||||
|
||||
maxdata=n the number of bytes to use for 9P packet payload (msize)
|
||||
|
||||
port=n port to connect to on the remote server
|
||||
|
||||
timeout=n request timeouts (in ms) (default 60000ms)
|
||||
|
||||
noextend force legacy mode (no 9P2000.u semantics)
|
||||
|
||||
uid attempt to mount as a particular uid
|
||||
|
||||
gid attempt to mount with a particular gid
|
||||
|
||||
afid security channel - used by Plan 9 authentication protocols
|
||||
|
||||
nodevmap do not map special files - represent them as normal files.
|
||||
This can be used to share devices/named pipes/sockets between
|
||||
hosts. This functionality will be expanded in later versions.
|
||||
|
||||
RESOURCES
|
||||
=========
|
||||
|
||||
The Linux version of the 9P server, along with some client-side utilities
|
||||
can be found at http://v9fs.sf.net (along with a CVS repository of the
|
||||
development branch of this module). There are user and developer mailing
|
||||
lists here, as well as a bug-tracker.
|
||||
|
||||
For more information on the Plan 9 Operating System check out
|
||||
http://plan9.bell-labs.com/plan9
|
||||
|
||||
For information on Plan 9 from User Space (Plan 9 applications and libraries
|
||||
ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9
|
||||
|
||||
|
||||
STATUS
|
||||
======
|
||||
|
||||
The 2.6 kernel support is working on PPC and x86.
|
||||
|
||||
PLEASE USE THE SOURCEFORGE BUG-TRACKER TO REPORT PROBLEMS.
|
||||
|
|
@ -1,35 +1,27 @@
|
|||
/* -*- auto-fill -*- */
|
||||
|
||||
Overview of the Virtual File System
|
||||
Overview of the Linux Virtual File System
|
||||
|
||||
Richard Gooch <rgooch@atnf.csiro.au>
|
||||
Original author: Richard Gooch <rgooch@atnf.csiro.au>
|
||||
|
||||
5-JUL-1999
|
||||
Last updated on August 25, 2005
|
||||
|
||||
Copyright (C) 1999 Richard Gooch
|
||||
Copyright (C) 2005 Pekka Enberg
|
||||
|
||||
This file is released under the GPLv2.
|
||||
|
||||
|
||||
Conventions used in this document <section>
|
||||
=================================
|
||||
|
||||
Each section in this document will have the string "<section>" at the
|
||||
right-hand side of the section title. Each subsection will have
|
||||
"<subsection>" at the right-hand side. These strings are meant to make
|
||||
it easier to search through the document.
|
||||
|
||||
NOTE that the master copy of this document is available online at:
|
||||
http://www.atnf.csiro.au/~rgooch/linux/docs/vfs.txt
|
||||
|
||||
|
||||
What is it? <section>
|
||||
What is it?
|
||||
===========
|
||||
|
||||
The Virtual File System (otherwise known as the Virtual Filesystem
|
||||
Switch) is the software layer in the kernel that provides the
|
||||
filesystem interface to userspace programs. It also provides an
|
||||
abstraction within the kernel which allows different filesystem
|
||||
implementations to co-exist.
|
||||
implementations to coexist.
|
||||
|
||||
|
||||
A Quick Look At How It Works <section>
|
||||
A Quick Look At How It Works
|
||||
============================
|
||||
|
||||
In this section I'll briefly describe how things work, before
|
||||
|
@ -38,7 +30,8 @@ when user programs open and manipulate files, and then look from the
|
|||
other view which is how a filesystem is supported and subsequently
|
||||
mounted.
|
||||
|
||||
Opening a File <subsection>
|
||||
|
||||
Opening a File
|
||||
--------------
|
||||
|
||||
The VFS implements the open(2), stat(2), chmod(2) and similar system
|
||||
|
@ -77,7 +70,7 @@ back to userspace.
|
|||
|
||||
Opening a file requires another operation: allocation of a file
|
||||
structure (this is the kernel-side implementation of file
|
||||
descriptors). The freshly allocated file structure is initialised with
|
||||
descriptors). The freshly allocated file structure is initialized with
|
||||
a pointer to the dentry and a set of file operation member functions.
|
||||
These are taken from the inode data. The open() file method is then
|
||||
called so the specific filesystem implementation can do it's work. You
|
||||
|
@ -102,7 +95,8 @@ filesystem or driver code at the same time, on different
|
|||
processors. You should ensure that access to shared resources is
|
||||
protected by appropriate locks.
|
||||
|
||||
Registering and Mounting a Filesystem <subsection>
|
||||
|
||||
Registering and Mounting a Filesystem
|
||||
-------------------------------------
|
||||
|
||||
If you want to support a new kind of filesystem in the kernel, all you
|
||||
|
@ -123,17 +117,21 @@ updated to point to the root inode for the new filesystem.
|
|||
It's now time to look at things in more detail.
|
||||
|
||||
|
||||
struct file_system_type <section>
|
||||
struct file_system_type
|
||||
=======================
|
||||
|
||||
This describes the filesystem. As of kernel 2.1.99, the following
|
||||
This describes the filesystem. As of kernel 2.6.13, the following
|
||||
members are defined:
|
||||
|
||||
struct file_system_type {
|
||||
const char *name;
|
||||
int fs_flags;
|
||||
struct super_block *(*read_super) (struct super_block *, void *, int);
|
||||
struct file_system_type * next;
|
||||
struct super_block *(*get_sb) (struct file_system_type *, int,
|
||||
const char *, void *);
|
||||
void (*kill_sb) (struct super_block *);
|
||||
struct module *owner;
|
||||
struct file_system_type * next;
|
||||
struct list_head fs_supers;
|
||||
};
|
||||
|
||||
name: the name of the filesystem type, such as "ext2", "iso9660",
|
||||
|
@ -141,51 +139,97 @@ struct file_system_type {
|
|||
|
||||
fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
|
||||
|
||||
read_super: the method to call when a new instance of this
|
||||
get_sb: the method to call when a new instance of this
|
||||
filesystem should be mounted
|
||||
|
||||
next: for internal VFS use: you should initialise this to NULL
|
||||
kill_sb: the method to call when an instance of this filesystem
|
||||
should be unmounted
|
||||
|
||||
The read_super() method has the following arguments:
|
||||
owner: for internal VFS use: you should initialize this to THIS_MODULE in
|
||||
most cases.
|
||||
|
||||
next: for internal VFS use: you should initialize this to NULL
|
||||
|
||||
The get_sb() method has the following arguments:
|
||||
|
||||
struct super_block *sb: the superblock structure. This is partially
|
||||
initialised by the VFS and the rest must be initialised by the
|
||||
read_super() method
|
||||
initialized by the VFS and the rest must be initialized by the
|
||||
get_sb() method
|
||||
|
||||
int flags: mount flags
|
||||
|
||||
const char *dev_name: the device name we are mounting.
|
||||
|
||||
void *data: arbitrary mount options, usually comes as an ASCII
|
||||
string
|
||||
|
||||
int silent: whether or not to be silent on error
|
||||
|
||||
The read_super() method must determine if the block device specified
|
||||
The get_sb() method must determine if the block device specified
|
||||
in the superblock contains a filesystem of the type the method
|
||||
supports. On success the method returns the superblock pointer, on
|
||||
failure it returns NULL.
|
||||
|
||||
The most interesting member of the superblock structure that the
|
||||
read_super() method fills in is the "s_op" field. This is a pointer to
|
||||
get_sb() method fills in is the "s_op" field. This is a pointer to
|
||||
a "struct super_operations" which describes the next level of the
|
||||
filesystem implementation.
|
||||
|
||||
Usually, a filesystem uses generic one of the generic get_sb()
|
||||
implementations and provides a fill_super() method instead. The
|
||||
generic methods are:
|
||||
|
||||
struct super_operations <section>
|
||||
get_sb_bdev: mount a filesystem residing on a block device
|
||||
|
||||
get_sb_nodev: mount a filesystem that is not backed by a device
|
||||
|
||||
get_sb_single: mount a filesystem which shares the instance between
|
||||
all mounts
|
||||
|
||||
A fill_super() method implementation has the following arguments:
|
||||
|
||||
struct super_block *sb: the superblock structure. The method fill_super()
|
||||
must initialize this properly.
|
||||
|
||||
void *data: arbitrary mount options, usually comes as an ASCII
|
||||
string
|
||||
|
||||
int silent: whether or not to be silent on error
|
||||
|
||||
|
||||
struct super_operations
|
||||
=======================
|
||||
|
||||
This describes how the VFS can manipulate the superblock of your
|
||||
filesystem. As of kernel 2.1.99, the following members are defined:
|
||||
filesystem. As of kernel 2.6.13, the following members are defined:
|
||||
|
||||
struct super_operations {
|
||||
void (*read_inode) (struct inode *);
|
||||
int (*write_inode) (struct inode *, int);
|
||||
void (*put_inode) (struct inode *);
|
||||
void (*drop_inode) (struct inode *);
|
||||
void (*delete_inode) (struct inode *);
|
||||
int (*notify_change) (struct dentry *, struct iattr *);
|
||||
void (*put_super) (struct super_block *);
|
||||
void (*write_super) (struct super_block *);
|
||||
int (*statfs) (struct super_block *, struct statfs *, int);
|
||||
int (*remount_fs) (struct super_block *, int *, char *);
|
||||
void (*clear_inode) (struct inode *);
|
||||
struct inode *(*alloc_inode)(struct super_block *sb);
|
||||
void (*destroy_inode)(struct inode *);
|
||||
|
||||
void (*read_inode) (struct inode *);
|
||||
|
||||
void (*dirty_inode) (struct inode *);
|
||||
int (*write_inode) (struct inode *, int);
|
||||
void (*put_inode) (struct inode *);
|
||||
void (*drop_inode) (struct inode *);
|
||||
void (*delete_inode) (struct inode *);
|
||||
void (*put_super) (struct super_block *);
|
||||
void (*write_super) (struct super_block *);
|
||||
int (*sync_fs)(struct super_block *sb, int wait);
|
||||
void (*write_super_lockfs) (struct super_block *);
|
||||
void (*unlockfs) (struct super_block *);
|
||||
int (*statfs) (struct super_block *, struct kstatfs *);
|
||||
int (*remount_fs) (struct super_block *, int *, char *);
|
||||
void (*clear_inode) (struct inode *);
|
||||
void (*umount_begin) (struct super_block *);
|
||||
|
||||
void (*sync_inodes) (struct super_block *sb,
|
||||
struct writeback_control *wbc);
|
||||
int (*show_options)(struct seq_file *, struct vfsmount *);
|
||||
|
||||
ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
|
||||
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
|
||||
};
|
||||
|
||||
All methods are called without any locks being held, unless otherwise
|
||||
|
@ -193,43 +237,62 @@ noted. This means that most methods can block safely. All methods are
|
|||
only called from a process context (i.e. not from an interrupt handler
|
||||
or bottom half).
|
||||
|
||||
alloc_inode: this method is called by inode_alloc() to allocate memory
|
||||
for struct inode and initialize it.
|
||||
|
||||
destroy_inode: this method is called by destroy_inode() to release
|
||||
resources allocated for struct inode.
|
||||
|
||||
read_inode: this method is called to read a specific inode from the
|
||||
mounted filesystem. The "i_ino" member in the "struct inode"
|
||||
will be initialised by the VFS to indicate which inode to
|
||||
read. Other members are filled in by this method
|
||||
mounted filesystem. The i_ino member in the struct inode is
|
||||
initialized by the VFS to indicate which inode to read. Other
|
||||
members are filled in by this method.
|
||||
|
||||
You can set this to NULL and use iget5_locked() instead of iget()
|
||||
to read inodes. This is necessary for filesystems for which the
|
||||
inode number is not sufficient to identify an inode.
|
||||
|
||||
dirty_inode: this method is called by the VFS to mark an inode dirty.
|
||||
|
||||
write_inode: this method is called when the VFS needs to write an
|
||||
inode to disc. The second parameter indicates whether the write
|
||||
should be synchronous or not, not all filesystems check this flag.
|
||||
|
||||
put_inode: called when the VFS inode is removed from the inode
|
||||
cache. This method is optional
|
||||
cache.
|
||||
|
||||
drop_inode: called when the last access to the inode is dropped,
|
||||
with the inode_lock spinlock held.
|
||||
|
||||
This method should be either NULL (normal unix filesystem
|
||||
This method should be either NULL (normal UNIX filesystem
|
||||
semantics) or "generic_delete_inode" (for filesystems that do not
|
||||
want to cache inodes - causing "delete_inode" to always be
|
||||
called regardless of the value of i_nlink)
|
||||
|
||||
The "generic_delete_inode()" behaviour is equivalent to the
|
||||
The "generic_delete_inode()" behavior is equivalent to the
|
||||
old practice of using "force_delete" in the put_inode() case,
|
||||
but does not have the races that the "force_delete()" approach
|
||||
had.
|
||||
|
||||
delete_inode: called when the VFS wants to delete an inode
|
||||
|
||||
notify_change: called when VFS inode attributes are changed. If this
|
||||
is NULL the VFS falls back to the write_inode() method. This
|
||||
is called with the kernel lock held
|
||||
|
||||
put_super: called when the VFS wishes to free the superblock
|
||||
(i.e. unmount). This is called with the superblock lock held
|
||||
|
||||
write_super: called when the VFS superblock needs to be written to
|
||||
disc. This method is optional
|
||||
|
||||
sync_fs: called when VFS is writing out all dirty data associated with
|
||||
a superblock. The second parameter indicates whether the method
|
||||
should wait until the write out has been completed. Optional.
|
||||
|
||||
write_super_lockfs: called when VFS is locking a filesystem and forcing
|
||||
it into a consistent state. This function is currently used by the
|
||||
Logical Volume Manager (LVM).
|
||||
|
||||
unlockfs: called when VFS is unlocking a filesystem and making it writable
|
||||
again.
|
||||
|
||||
statfs: called when the VFS needs to get filesystem statistics. This
|
||||
is called with the kernel lock held
|
||||
|
||||
|
@ -238,21 +301,31 @@ or bottom half).
|
|||
|
||||
clear_inode: called then the VFS clears the inode. Optional
|
||||
|
||||
umount_begin: called when the VFS is unmounting a filesystem.
|
||||
|
||||
sync_inodes: called when the VFS is writing out dirty data associated with
|
||||
a superblock.
|
||||
|
||||
show_options: called by the VFS to show mount options for /proc/<pid>/mounts.
|
||||
|
||||
quota_read: called by the VFS to read from filesystem quota file.
|
||||
|
||||
quota_write: called by the VFS to write to filesystem quota file.
|
||||
|
||||
The read_inode() method is responsible for filling in the "i_op"
|
||||
field. This is a pointer to a "struct inode_operations" which
|
||||
describes the methods that can be performed on individual inodes.
|
||||
|
||||
|
||||
struct inode_operations <section>
|
||||
struct inode_operations
|
||||
=======================
|
||||
|
||||
This describes how the VFS can manipulate an inode in your
|
||||
filesystem. As of kernel 2.1.99, the following members are defined:
|
||||
filesystem. As of kernel 2.6.13, the following members are defined:
|
||||
|
||||
struct inode_operations {
|
||||
struct file_operations * default_file_ops;
|
||||
int (*create) (struct inode *,struct dentry *,int);
|
||||
int (*lookup) (struct inode *,struct dentry *);
|
||||
int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
|
||||
struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
|
||||
int (*link) (struct dentry *,struct inode *,struct dentry *);
|
||||
int (*unlink) (struct inode *,struct dentry *);
|
||||
int (*symlink) (struct inode *,struct dentry *,const char *);
|
||||
|
@ -261,25 +334,22 @@ struct inode_operations {
|
|||
int (*mknod) (struct inode *,struct dentry *,int,dev_t);
|
||||
int (*rename) (struct inode *, struct dentry *,
|
||||
struct inode *, struct dentry *);
|
||||
int (*readlink) (struct dentry *, char *,int);
|
||||
struct dentry * (*follow_link) (struct dentry *, struct dentry *);
|
||||
int (*readpage) (struct file *, struct page *);
|
||||
int (*writepage) (struct page *page, struct writeback_control *wbc);
|
||||
int (*bmap) (struct inode *,int);
|
||||
int (*readlink) (struct dentry *, char __user *,int);
|
||||
void * (*follow_link) (struct dentry *, struct nameidata *);
|
||||
void (*put_link) (struct dentry *, struct nameidata *, void *);
|
||||
void (*truncate) (struct inode *);
|
||||
int (*permission) (struct inode *, int);
|
||||
int (*smap) (struct inode *,int);
|
||||
int (*updatepage) (struct file *, struct page *, const char *,
|
||||
unsigned long, unsigned int, int);
|
||||
int (*revalidate) (struct dentry *);
|
||||
int (*permission) (struct inode *, int, struct nameidata *);
|
||||
int (*setattr) (struct dentry *, struct iattr *);
|
||||
int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
|
||||
int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
|
||||
ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
|
||||
ssize_t (*listxattr) (struct dentry *, char *, size_t);
|
||||
int (*removexattr) (struct dentry *, const char *);
|
||||
};
|
||||
|
||||
Again, all methods are called without any locks being held, unless
|
||||
otherwise noted.
|
||||
|
||||
default_file_ops: this is a pointer to a "struct file_operations"
|
||||
which describes how to open and then manipulate open files
|
||||
|
||||
create: called by the open(2) and creat(2) system calls. Only
|
||||
required if you want to support regular files. The dentry you
|
||||
get should not have an inode (i.e. it should be a negative
|
||||
|
@ -328,31 +398,143 @@ otherwise noted.
|
|||
you want to support reading symbolic links
|
||||
|
||||
follow_link: called by the VFS to follow a symbolic link to the
|
||||
inode it points to. Only required if you want to support
|
||||
symbolic links
|
||||
inode it points to. Only required if you want to support
|
||||
symbolic links. This function returns a void pointer cookie
|
||||
that is passed to put_link().
|
||||
|
||||
put_link: called by the VFS to release resources allocated by
|
||||
follow_link(). The cookie returned by follow_link() is passed to
|
||||
to this function as the last parameter. It is used by filesystems
|
||||
such as NFS where page cache is not stable (i.e. page that was
|
||||
installed when the symbolic link walk started might not be in the
|
||||
page cache at the end of the walk).
|
||||
|
||||
truncate: called by the VFS to change the size of a file. The i_size
|
||||
field of the inode is set to the desired size by the VFS before
|
||||
this function is called. This function is called by the truncate(2)
|
||||
system call and related functionality.
|
||||
|
||||
permission: called by the VFS to check for access rights on a POSIX-like
|
||||
filesystem.
|
||||
|
||||
setattr: called by the VFS to set attributes for a file. This function is
|
||||
called by chmod(2) and related system calls.
|
||||
|
||||
getattr: called by the VFS to get attributes of a file. This function is
|
||||
called by stat(2) and related system calls.
|
||||
|
||||
setxattr: called by the VFS to set an extended attribute for a file.
|
||||
Extended attribute is a name:value pair associated with an inode. This
|
||||
function is called by setxattr(2) system call.
|
||||
|
||||
getxattr: called by the VFS to retrieve the value of an extended attribute
|
||||
name. This function is called by getxattr(2) function call.
|
||||
|
||||
listxattr: called by the VFS to list all extended attributes for a given
|
||||
file. This function is called by listxattr(2) system call.
|
||||
|
||||
removexattr: called by the VFS to remove an extended attribute from a file.
|
||||
This function is called by removexattr(2) system call.
|
||||
|
||||
|
||||
struct file_operations <section>
|
||||
struct address_space_operations
|
||||
===============================
|
||||
|
||||
This describes how the VFS can manipulate mapping of a file to page cache in
|
||||
your filesystem. As of kernel 2.6.13, the following members are defined:
|
||||
|
||||
struct address_space_operations {
|
||||
int (*writepage)(struct page *page, struct writeback_control *wbc);
|
||||
int (*readpage)(struct file *, struct page *);
|
||||
int (*sync_page)(struct page *);
|
||||
int (*writepages)(struct address_space *, struct writeback_control *);
|
||||
int (*set_page_dirty)(struct page *page);
|
||||
int (*readpages)(struct file *filp, struct address_space *mapping,
|
||||
struct list_head *pages, unsigned nr_pages);
|
||||
int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
|
||||
int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
|
||||
sector_t (*bmap)(struct address_space *, sector_t);
|
||||
int (*invalidatepage) (struct page *, unsigned long);
|
||||
int (*releasepage) (struct page *, int);
|
||||
ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
|
||||
loff_t offset, unsigned long nr_segs);
|
||||
struct page* (*get_xip_page)(struct address_space *, sector_t,
|
||||
int);
|
||||
};
|
||||
|
||||
writepage: called by the VM write a dirty page to backing store.
|
||||
|
||||
readpage: called by the VM to read a page from backing store.
|
||||
|
||||
sync_page: called by the VM to notify the backing store to perform all
|
||||
queued I/O operations for a page. I/O operations for other pages
|
||||
associated with this address_space object may also be performed.
|
||||
|
||||
writepages: called by the VM to write out pages associated with the
|
||||
address_space object.
|
||||
|
||||
set_page_dirty: called by the VM to set a page dirty.
|
||||
|
||||
readpages: called by the VM to read pages associated with the address_space
|
||||
object.
|
||||
|
||||
prepare_write: called by the generic write path in VM to set up a write
|
||||
request for a page.
|
||||
|
||||
commit_write: called by the generic write path in VM to write page to
|
||||
its backing store.
|
||||
|
||||
bmap: called by the VFS to map a logical block offset within object to
|
||||
physical block number. This method is use by for the legacy FIBMAP
|
||||
ioctl. Other uses are discouraged.
|
||||
|
||||
invalidatepage: called by the VM on truncate to disassociate a page from its
|
||||
address_space mapping.
|
||||
|
||||
releasepage: called by the VFS to release filesystem specific metadata from
|
||||
a page.
|
||||
|
||||
direct_IO: called by the VM for direct I/O writes and reads.
|
||||
|
||||
get_xip_page: called by the VM to translate a block number to a page.
|
||||
The page is valid until the corresponding filesystem is unmounted.
|
||||
Filesystems that want to use execute-in-place (XIP) need to implement
|
||||
it. An example implementation can be found in fs/ext2/xip.c.
|
||||
|
||||
|
||||
struct file_operations
|
||||
======================
|
||||
|
||||
This describes how the VFS can manipulate an open file. As of kernel
|
||||
2.1.99, the following members are defined:
|
||||
2.6.13, the following members are defined:
|
||||
|
||||
struct file_operations {
|
||||
loff_t (*llseek) (struct file *, loff_t, int);
|
||||
ssize_t (*read) (struct file *, char *, size_t, loff_t *);
|
||||
ssize_t (*write) (struct file *, const char *, size_t, loff_t *);
|
||||
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
|
||||
ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t);
|
||||
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
|
||||
ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t);
|
||||
int (*readdir) (struct file *, void *, filldir_t);
|
||||
unsigned int (*poll) (struct file *, struct poll_table_struct *);
|
||||
int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
|
||||
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
|
||||
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
|
||||
int (*mmap) (struct file *, struct vm_area_struct *);
|
||||
int (*open) (struct inode *, struct file *);
|
||||
int (*flush) (struct file *);
|
||||
int (*release) (struct inode *, struct file *);
|
||||
int (*fsync) (struct file *, struct dentry *);
|
||||
int (*fasync) (struct file *, int);
|
||||
int (*check_media_change) (kdev_t dev);
|
||||
int (*revalidate) (kdev_t dev);
|
||||
int (*fsync) (struct file *, struct dentry *, int datasync);
|
||||
int (*aio_fsync) (struct kiocb *, int datasync);
|
||||
int (*fasync) (int, struct file *, int);
|
||||
int (*lock) (struct file *, int, struct file_lock *);
|
||||
ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);
|
||||
ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);
|
||||
ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *);
|
||||
ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
|
||||
unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
|
||||
int (*check_flags)(int);
|
||||
int (*dir_notify)(struct file *filp, unsigned long arg);
|
||||
int (*flock) (struct file *, int, struct file_lock *);
|
||||
};
|
||||
|
||||
Again, all methods are called without any locks being held, unless
|
||||
|
@ -362,8 +544,12 @@ otherwise noted.
|
|||
|
||||
read: called by read(2) and related system calls
|
||||
|
||||
aio_read: called by io_submit(2) and other asynchronous I/O operations
|
||||
|
||||
write: called by write(2) and related system calls
|
||||
|
||||
aio_write: called by io_submit(2) and other asynchronous I/O operations
|
||||
|
||||
readdir: called when the VFS needs to read the directory contents
|
||||
|
||||
poll: called by the VFS when a process wants to check if there is
|
||||
|
@ -372,18 +558,25 @@ otherwise noted.
|
|||
|
||||
ioctl: called by the ioctl(2) system call
|
||||
|
||||
unlocked_ioctl: called by the ioctl(2) system call. Filesystems that do not
|
||||
require the BKL should use this method instead of the ioctl() above.
|
||||
|
||||
compat_ioctl: called by the ioctl(2) system call when 32 bit system calls
|
||||
are used on 64 bit kernels.
|
||||
|
||||
mmap: called by the mmap(2) system call
|
||||
|
||||
open: called by the VFS when an inode should be opened. When the VFS
|
||||
opens a file, it creates a new "struct file" and initialises
|
||||
the "f_op" file operations member with the "default_file_ops"
|
||||
field in the inode structure. It then calls the open method
|
||||
for the newly allocated file structure. You might think that
|
||||
the open method really belongs in "struct inode_operations",
|
||||
and you may be right. I think it's done the way it is because
|
||||
it makes filesystems simpler to implement. The open() method
|
||||
is a good place to initialise the "private_data" member in the
|
||||
file structure if you want to point to a device structure
|
||||
opens a file, it creates a new "struct file". It then calls the
|
||||
open method for the newly allocated file structure. You might
|
||||
think that the open method really belongs in
|
||||
"struct inode_operations", and you may be right. I think it's
|
||||
done the way it is because it makes filesystems simpler to
|
||||
implement. The open() method is a good place to initialize the
|
||||
"private_data" member in the file structure if you want to point
|
||||
to a device structure
|
||||
|
||||
flush: called by the close(2) system call to flush a file
|
||||
|
||||
release: called when the last reference to an open file is closed
|
||||
|
||||
|
@ -392,6 +585,23 @@ otherwise noted.
|
|||
fasync: called by the fcntl(2) system call when asynchronous
|
||||
(non-blocking) mode is enabled for a file
|
||||
|
||||
lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW
|
||||
commands
|
||||
|
||||
readv: called by the readv(2) system call
|
||||
|
||||
writev: called by the writev(2) system call
|
||||
|
||||
sendfile: called by the sendfile(2) system call
|
||||
|
||||
get_unmapped_area: called by the mmap(2) system call
|
||||
|
||||
check_flags: called by the fcntl(2) system call for F_SETFL command
|
||||
|
||||
dir_notify: called by the fcntl(2) system call for F_NOTIFY command
|
||||
|
||||
flock: called by the flock(2) system call
|
||||
|
||||
Note that the file operations are implemented by the specific
|
||||
filesystem in which the inode resides. When opening a device node
|
||||
(character or block special) most filesystems will call special
|
||||
|
@ -400,29 +610,28 @@ driver information. These support routines replace the filesystem file
|
|||
operations with those for the device driver, and then proceed to call
|
||||
the new open() method for the file. This is how opening a device file
|
||||
in the filesystem eventually ends up calling the device driver open()
|
||||
method. Note the devfs (the Device FileSystem) has a more direct path
|
||||
from device node to device driver (this is an unofficial kernel
|
||||
patch).
|
||||
method.
|
||||
|
||||
|
||||
Directory Entry Cache (dcache) <section>
|
||||
------------------------------
|
||||
Directory Entry Cache (dcache)
|
||||
==============================
|
||||
|
||||
|
||||
struct dentry_operations
|
||||
========================
|
||||
------------------------
|
||||
|
||||
This describes how a filesystem can overload the standard dentry
|
||||
operations. Dentries and the dcache are the domain of the VFS and the
|
||||
individual filesystem implementations. Device drivers have no business
|
||||
here. These methods may be set to NULL, as they are either optional or
|
||||
the VFS uses a default. As of kernel 2.1.99, the following members are
|
||||
the VFS uses a default. As of kernel 2.6.13, the following members are
|
||||
defined:
|
||||
|
||||
struct dentry_operations {
|
||||
int (*d_revalidate)(struct dentry *);
|
||||
int (*d_revalidate)(struct dentry *, struct nameidata *);
|
||||
int (*d_hash) (struct dentry *, struct qstr *);
|
||||
int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
|
||||
void (*d_delete)(struct dentry *);
|
||||
int (*d_delete)(struct dentry *);
|
||||
void (*d_release)(struct dentry *);
|
||||
void (*d_iput)(struct dentry *, struct inode *);
|
||||
};
|
||||
|
@ -451,6 +660,7 @@ Each dentry has a pointer to its parent dentry, as well as a hash list
|
|||
of child dentries. Child dentries are basically like files in a
|
||||
directory.
|
||||
|
||||
|
||||
Directory Entry Cache APIs
|
||||
--------------------------
|
||||
|
||||
|
@ -471,7 +681,7 @@ manipulate dentries:
|
|||
"d_delete" method is called
|
||||
|
||||
d_drop: this unhashes a dentry from its parents hash list. A
|
||||
subsequent call to dput() will dellocate the dentry if its
|
||||
subsequent call to dput() will deallocate the dentry if its
|
||||
usage count drops to 0
|
||||
|
||||
d_delete: delete a dentry. If there are no other open references to
|
||||
|
@ -507,16 +717,16 @@ up by walking the tree starting with the first component
|
|||
of the pathname and using that dentry along with the next
|
||||
component to look up the next level and so on. Since it
|
||||
is a frequent operation for workloads like multiuser
|
||||
environments and webservers, it is important to optimize
|
||||
environments and web servers, it is important to optimize
|
||||
this path.
|
||||
|
||||
Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus
|
||||
in every component during path look-up. Since 2.5.10 onwards,
|
||||
fastwalk algorithm changed this by holding the dcache_lock
|
||||
fast-walk algorithm changed this by holding the dcache_lock
|
||||
at the beginning and walking as many cached path component
|
||||
dentries as possible. This signficantly decreases the number
|
||||
dentries as possible. This significantly decreases the number
|
||||
of acquisition of dcache_lock. However it also increases the
|
||||
lock hold time signficantly and affects performance in large
|
||||
lock hold time significantly and affects performance in large
|
||||
SMP machines. Since 2.5.62 kernel, dcache has been using
|
||||
a new locking model that uses RCU to make dcache look-up
|
||||
lock-free.
|
||||
|
@ -527,7 +737,7 @@ protected the hash chain, d_child, d_alias, d_lru lists as well
|
|||
as d_inode and several other things like mount look-up. RCU-based
|
||||
changes affect only the way the hash chain is protected. For everything
|
||||
else the dcache_lock must be taken for both traversing as well as
|
||||
updating. The hash chain updations too take the dcache_lock.
|
||||
updating. The hash chain updates too take the dcache_lock.
|
||||
The significant change is the way d_lookup traverses the hash chain,
|
||||
it doesn't acquire the dcache_lock for this and rely on RCU to
|
||||
ensure that the dentry has not been *freed*.
|
||||
|
@ -535,14 +745,15 @@ ensure that the dentry has not been *freed*.
|
|||
|
||||
Dcache locking details
|
||||
----------------------
|
||||
|
||||
For many multi-user workloads, open() and stat() on files are
|
||||
very frequently occurring operations. Both involve walking
|
||||
of path names to find the dentry corresponding to the
|
||||
concerned file. In 2.4 kernel, dcache_lock was held
|
||||
during look-up of each path component. Contention and
|
||||
cacheline bouncing of this global lock caused significant
|
||||
cache-line bouncing of this global lock caused significant
|
||||
scalability problems. With the introduction of RCU
|
||||
in linux kernel, this was worked around by making
|
||||
in Linux kernel, this was worked around by making
|
||||
the look-up of path components during path walking lock-free.
|
||||
|
||||
|
||||
|
@ -562,7 +773,7 @@ Some of the important changes are :
|
|||
2. Insertion of a dentry into the hash table is done using
|
||||
hlist_add_head_rcu() which take care of ordering the writes -
|
||||
the writes to the dentry must be visible before the dentry
|
||||
is inserted. This works in conjuction with hlist_for_each_rcu()
|
||||
is inserted. This works in conjunction with hlist_for_each_rcu()
|
||||
while walking the hash chain. The only requirement is that
|
||||
all initialization to the dentry must be done before hlist_add_head_rcu()
|
||||
since we don't have dcache_lock protection while traversing
|
||||
|
@ -584,7 +795,7 @@ Some of the important changes are :
|
|||
the same. In some sense, dcache_rcu path walking looks like
|
||||
the pre-2.5.10 version.
|
||||
|
||||
5. All dentry hash chain updations must take the dcache_lock as well as
|
||||
5. All dentry hash chain updates must take the dcache_lock as well as
|
||||
the per-dentry lock in that order. dput() does this to ensure
|
||||
that a dentry that has just been looked up in another CPU
|
||||
doesn't get deleted before dget() can be done on it.
|
||||
|
@ -640,10 +851,10 @@ handled as described below :
|
|||
Since we redo the d_parent check and compare name while holding
|
||||
d_lock, lock-free look-up will not race against d_move().
|
||||
|
||||
4. There can be a theoritical race when a dentry keeps coming back
|
||||
4. There can be a theoretical race when a dentry keeps coming back
|
||||
to original bucket due to double moves. Due to this look-up may
|
||||
consider that it has never moved and can end up in a infinite loop.
|
||||
But this is not any worse that theoritical livelocks we already
|
||||
But this is not any worse that theoretical livelocks we already
|
||||
have in the kernel.
|
||||
|
||||
|
||||
|
|
|
@ -19,15 +19,43 @@ Mount Options
|
|||
|
||||
When mounting an XFS filesystem, the following options are accepted.
|
||||
|
||||
biosize=size
|
||||
Sets the preferred buffered I/O size (default size is 64K).
|
||||
"size" must be expressed as the logarithm (base2) of the
|
||||
desired I/O size.
|
||||
Valid values for this option are 14 through 16, inclusive
|
||||
(i.e. 16K, 32K, and 64K bytes). On machines with a 4K
|
||||
pagesize, 13 (8K bytes) is also a valid size.
|
||||
The preferred buffered I/O size can also be altered on an
|
||||
individual file basis using the ioctl(2) system call.
|
||||
allocsize=size
|
||||
Sets the buffered I/O end-of-file preallocation size when
|
||||
doing delayed allocation writeout (default size is 64KiB).
|
||||
Valid values for this option are page size (typically 4KiB)
|
||||
through to 1GiB, inclusive, in power-of-2 increments.
|
||||
|
||||
attr2/noattr2
|
||||
The options enable/disable (default is disabled for backward
|
||||
compatibility on-disk) an "opportunistic" improvement to be
|
||||
made in the way inline extended attributes are stored on-disk.
|
||||
When the new form is used for the first time (by setting or
|
||||
removing extended attributes) the on-disk superblock feature
|
||||
bit field will be updated to reflect this format being in use.
|
||||
|
||||
barrier
|
||||
Enables the use of block layer write barriers for writes into
|
||||
the journal and unwritten extent conversion. This allows for
|
||||
drive level write caching to be enabled, for devices that
|
||||
support write barriers.
|
||||
|
||||
dmapi
|
||||
Enable the DMAPI (Data Management API) event callouts.
|
||||
Use with the "mtpt" option.
|
||||
|
||||
grpid/bsdgroups and nogrpid/sysvgroups
|
||||
These options define what group ID a newly created file gets.
|
||||
When grpid is set, it takes the group ID of the directory in
|
||||
which it is created; otherwise (the default) it takes the fsgid
|
||||
of the current process, unless the directory has the setgid bit
|
||||
set, in which case it takes the gid from the parent directory,
|
||||
and also gets the setgid bit set if it is a directory itself.
|
||||
|
||||
ihashsize=value
|
||||
Sets the number of hash buckets available for hashing the
|
||||
in-memory inodes of the specified mount point. If a value
|
||||
of zero is used, the value selected by the default algorithm
|
||||
will be displayed in /proc/mounts.
|
||||
|
||||
ikeep/noikeep
|
||||
When inode clusters are emptied of inodes, keep them around
|
||||
|
@ -35,12 +63,31 @@ When mounting an XFS filesystem, the following options are accepted.
|
|||
and is still the default for now. Using the noikeep option,
|
||||
inode clusters are returned to the free space pool.
|
||||
|
||||
inode64
|
||||
Indicates that XFS is allowed to create inodes at any location
|
||||
in the filesystem, including those which will result in inode
|
||||
numbers occupying more than 32 bits of significance. This is
|
||||
provided for backwards compatibility, but causes problems for
|
||||
backup applications that cannot handle large inode numbers.
|
||||
|
||||
largeio/nolargeio
|
||||
If "nolargeio" is specified, the optimal I/O reported in
|
||||
st_blksize by stat(2) will be as small as possible to allow user
|
||||
applications to avoid inefficient read/modify/write I/O.
|
||||
If "largeio" specified, a filesystem that has a "swidth" specified
|
||||
will return the "swidth" value (in bytes) in st_blksize. If the
|
||||
filesystem does not have a "swidth" specified but does specify
|
||||
an "allocsize" then "allocsize" (in bytes) will be returned
|
||||
instead.
|
||||
If neither of these two options are specified, then filesystem
|
||||
will behave as if "nolargeio" was specified.
|
||||
|
||||
logbufs=value
|
||||
Set the number of in-memory log buffers. Valid numbers range
|
||||
from 2-8 inclusive.
|
||||
The default value is 8 buffers for filesystems with a
|
||||
blocksize of 64K, 4 buffers for filesystems with a blocksize
|
||||
of 32K, 3 buffers for filesystems with a blocksize of 16K
|
||||
blocksize of 64KiB, 4 buffers for filesystems with a blocksize
|
||||
of 32KiB, 3 buffers for filesystems with a blocksize of 16KiB
|
||||
and 2 buffers for all other configurations. Increasing the
|
||||
number of buffers may increase performance on some workloads
|
||||
at the cost of the memory used for the additional log buffers
|
||||
|
@ -49,10 +96,10 @@ When mounting an XFS filesystem, the following options are accepted.
|
|||
logbsize=value
|
||||
Set the size of each in-memory log buffer.
|
||||
Size may be specified in bytes, or in kilobytes with a "k" suffix.
|
||||
Valid sizes for version 1 and version 2 logs are 16384 (16k) and
|
||||
32768 (32k). Valid sizes for version 2 logs also include
|
||||
Valid sizes for version 1 and version 2 logs are 16384 (16k) and
|
||||
32768 (32k). Valid sizes for version 2 logs also include
|
||||
65536 (64k), 131072 (128k) and 262144 (256k).
|
||||
The default value for machines with more than 32MB of memory
|
||||
The default value for machines with more than 32MiB of memory
|
||||
is 32768, machines with less memory use 16384 by default.
|
||||
|
||||
logdev=device and rtdev=device
|
||||
|
@ -62,6 +109,11 @@ When mounting an XFS filesystem, the following options are accepted.
|
|||
optional, and the log section can be separate from the data
|
||||
section or contained within it.
|
||||
|
||||
mtpt=mountpoint
|
||||
Use with the "dmapi" option. The value specified here will be
|
||||
included in the DMAPI mount event, and should be the path of
|
||||
the actual mountpoint that is used.
|
||||
|
||||
noalign
|
||||
Data allocations will not be aligned at stripe unit boundaries.
|
||||
|
||||
|
@ -91,13 +143,17 @@ When mounting an XFS filesystem, the following options are accepted.
|
|||
O_SYNC writes can be lost if the system crashes.
|
||||
If timestamp updates are critical, use the osyncisosync option.
|
||||
|
||||
quota/usrquota/uqnoenforce
|
||||
uquota/usrquota/uqnoenforce/quota
|
||||
User disk quota accounting enabled, and limits (optionally)
|
||||
enforced.
|
||||
enforced. Refer to xfs_quota(8) for further details.
|
||||
|
||||
grpquota/gqnoenforce
|
||||
gquota/grpquota/gqnoenforce
|
||||
Group disk quota accounting enabled and limits (optionally)
|
||||
enforced.
|
||||
enforced. Refer to xfs_quota(8) for further details.
|
||||
|
||||
pquota/prjquota/pqnoenforce
|
||||
Project disk quota accounting enabled and limits (optionally)
|
||||
enforced. Refer to xfs_quota(8) for further details.
|
||||
|
||||
sunit=value and swidth=value
|
||||
Used to specify the stripe unit and width for a RAID device or
|
||||
|
@ -113,15 +169,21 @@ When mounting an XFS filesystem, the following options are accepted.
|
|||
The "swidth" option is required if the "sunit" option has been
|
||||
specified, and must be a multiple of the "sunit" value.
|
||||
|
||||
swalloc
|
||||
Data allocations will be rounded up to stripe width boundaries
|
||||
when the current end of file is being extended and the file
|
||||
size is larger than the stripe width size.
|
||||
|
||||
|
||||
sysctls
|
||||
=======
|
||||
|
||||
The following sysctls are available for the XFS filesystem:
|
||||
|
||||
fs.xfs.stats_clear (Min: 0 Default: 0 Max: 1)
|
||||
Setting this to "1" clears accumulated XFS statistics
|
||||
Setting this to "1" clears accumulated XFS statistics
|
||||
in /proc/fs/xfs/stat. It then immediately resets to "0".
|
||||
|
||||
|
||||
fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000)
|
||||
The interval at which the xfssyncd thread flushes metadata
|
||||
out to disk. This thread will flush log activity out, and
|
||||
|
@ -143,9 +205,9 @@ The following sysctls are available for the XFS filesystem:
|
|||
XFS_ERRLEVEL_HIGH: 5
|
||||
|
||||
fs.xfs.panic_mask (Min: 0 Default: 0 Max: 127)
|
||||
Causes certain error conditions to call BUG(). Value is a bitmask;
|
||||
Causes certain error conditions to call BUG(). Value is a bitmask;
|
||||
AND together the tags which represent errors which should cause panics:
|
||||
|
||||
|
||||
XFS_NO_PTAG 0
|
||||
XFS_PTAG_IFLUSH 0x00000001
|
||||
XFS_PTAG_LOGRES 0x00000002
|
||||
|
@ -155,7 +217,7 @@ The following sysctls are available for the XFS filesystem:
|
|||
XFS_PTAG_SHUTDOWN_IOERROR 0x00000020
|
||||
XFS_PTAG_SHUTDOWN_LOGERROR 0x00000040
|
||||
|
||||
This option is intended for debugging only.
|
||||
This option is intended for debugging only.
|
||||
|
||||
fs.xfs.irix_symlink_mode (Min: 0 Default: 0 Max: 1)
|
||||
Controls whether symlinks are created with mode 0777 (default)
|
||||
|
@ -164,25 +226,37 @@ The following sysctls are available for the XFS filesystem:
|
|||
fs.xfs.irix_sgid_inherit (Min: 0 Default: 0 Max: 1)
|
||||
Controls files created in SGID directories.
|
||||
If the group ID of the new file does not match the effective group
|
||||
ID or one of the supplementary group IDs of the parent dir, the
|
||||
ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl
|
||||
ID or one of the supplementary group IDs of the parent dir, the
|
||||
ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl
|
||||
is set.
|
||||
|
||||
fs.xfs.restrict_chown (Min: 0 Default: 1 Max: 1)
|
||||
Controls whether unprivileged users can use chown to "give away"
|
||||
a file to another user.
|
||||
|
||||
fs.xfs.inherit_sync (Min: 0 Default: 1 Max 1)
|
||||
Setting this to "1" will cause the "sync" flag set
|
||||
by the chattr(1) command on a directory to be
|
||||
fs.xfs.inherit_sync (Min: 0 Default: 1 Max: 1)
|
||||
Setting this to "1" will cause the "sync" flag set
|
||||
by the xfs_io(8) chattr command on a directory to be
|
||||
inherited by files in that directory.
|
||||
|
||||
fs.xfs.inherit_nodump (Min: 0 Default: 1 Max 1)
|
||||
Setting this to "1" will cause the "nodump" flag set
|
||||
by the chattr(1) command on a directory to be
|
||||
fs.xfs.inherit_nodump (Min: 0 Default: 1 Max: 1)
|
||||
Setting this to "1" will cause the "nodump" flag set
|
||||
by the xfs_io(8) chattr command on a directory to be
|
||||
inherited by files in that directory.
|
||||
|
||||
fs.xfs.inherit_noatime (Min: 0 Default: 1 Max 1)
|
||||
Setting this to "1" will cause the "noatime" flag set
|
||||
by the chattr(1) command on a directory to be
|
||||
fs.xfs.inherit_noatime (Min: 0 Default: 1 Max: 1)
|
||||
Setting this to "1" will cause the "noatime" flag set
|
||||
by the xfs_io(8) chattr command on a directory to be
|
||||
inherited by files in that directory.
|
||||
|
||||
fs.xfs.inherit_nosymlinks (Min: 0 Default: 1 Max: 1)
|
||||
Setting this to "1" will cause the "nosymlinks" flag set
|
||||
by the xfs_io(8) chattr command on a directory to be
|
||||
inherited by files in that directory.
|
||||
|
||||
fs.xfs.rotorstep (Min: 1 Default: 1 Max: 256)
|
||||
In "inode32" allocation mode, this option determines how many
|
||||
files the allocator attempts to allocate in the same allocation
|
||||
group before moving to the next allocation group. The intent
|
||||
is to control the rate at which the allocator moves between
|
||||
allocation groups when allocating extents for new files.
|
||||
|
|
|
@ -13,6 +13,7 @@
|
|||
#include <linux/kernel.h>
|
||||
#include <linux/init.h>
|
||||
#include <linux/device.h>
|
||||
#include <linux/string.h>
|
||||
|
||||
#include "linux/firmware.h"
|
||||
|
||||
|
@ -32,14 +33,14 @@ static void sample_firmware_load(char *firmware, int size)
|
|||
u8 buf[size+1];
|
||||
memcpy(buf, firmware, size);
|
||||
buf[size] = '\0';
|
||||
printk("firmware_sample_driver: firmware: %s\n", buf);
|
||||
printk(KERN_INFO "firmware_sample_driver: firmware: %s\n", buf);
|
||||
}
|
||||
|
||||
static void sample_probe_default(void)
|
||||
{
|
||||
/* uses the default method to get the firmware */
|
||||
const struct firmware *fw_entry;
|
||||
printk("firmware_sample_driver: a ghost device got inserted :)\n");
|
||||
printk(KERN_INFO "firmware_sample_driver: a ghost device got inserted :)\n");
|
||||
|
||||
if(request_firmware(&fw_entry, "sample_driver_fw", &ghost_device)!=0)
|
||||
{
|
||||
|
@ -61,7 +62,7 @@ static void sample_probe_specific(void)
|
|||
|
||||
/* NOTE: This currently doesn't work */
|
||||
|
||||
printk("firmware_sample_driver: a ghost device got inserted :)\n");
|
||||
printk(KERN_INFO "firmware_sample_driver: a ghost device got inserted :)\n");
|
||||
|
||||
if(request_firmware(NULL, "sample_driver_fw", &ghost_device)!=0)
|
||||
{
|
||||
|
@ -83,7 +84,7 @@ static void sample_probe_async_cont(const struct firmware *fw, void *context)
|
|||
return;
|
||||
}
|
||||
|
||||
printk("firmware_sample_driver: device pointer \"%s\"\n",
|
||||
printk(KERN_INFO "firmware_sample_driver: device pointer \"%s\"\n",
|
||||
(char *)context);
|
||||
sample_firmware_load(fw->data, fw->size);
|
||||
}
|
||||
|
|
|
@ -14,6 +14,8 @@
|
|||
#include <linux/module.h>
|
||||
#include <linux/init.h>
|
||||
#include <linux/timer.h>
|
||||
#include <linux/slab.h>
|
||||
#include <linux/string.h>
|
||||
#include <linux/firmware.h>
|
||||
|
||||
|
||||
|
|
|
@ -4,18 +4,18 @@ Kernel driver it87
|
|||
Supported chips:
|
||||
* IT8705F
|
||||
Prefix: 'it87'
|
||||
Addresses scanned: from Super I/O config space, or default ISA 0x290 (8 I/O ports)
|
||||
Addresses scanned: from Super I/O config space (8 I/O ports)
|
||||
Datasheet: Publicly available at the ITE website
|
||||
http://www.ite.com.tw/
|
||||
* IT8712F
|
||||
Prefix: 'it8712'
|
||||
Addresses scanned: I2C 0x28 - 0x2f
|
||||
from Super I/O config space, or default ISA 0x290 (8 I/O ports)
|
||||
from Super I/O config space (8 I/O ports)
|
||||
Datasheet: Publicly available at the ITE website
|
||||
http://www.ite.com.tw/
|
||||
* SiS950 [clone of IT8705F]
|
||||
Prefix: 'sis950'
|
||||
Addresses scanned: from Super I/O config space, or default ISA 0x290 (8 I/O ports)
|
||||
Prefix: 'it87'
|
||||
Addresses scanned: from Super I/O config space (8 I/O ports)
|
||||
Datasheet: No longer be available
|
||||
|
||||
Author: Christophe Gauthron <chrisg@0-in.com>
|
||||
|
|
|
@ -2,16 +2,11 @@ Kernel driver lm78
|
|||
==================
|
||||
|
||||
Supported chips:
|
||||
* National Semiconductor LM78
|
||||
* National Semiconductor LM78 / LM78-J
|
||||
Prefix: 'lm78'
|
||||
Addresses scanned: I2C 0x20 - 0x2f, ISA 0x290 (8 I/O ports)
|
||||
Datasheet: Publicly available at the National Semiconductor website
|
||||
http://www.national.com/
|
||||
* National Semiconductor LM78-J
|
||||
Prefix: 'lm78-j'
|
||||
Addresses scanned: I2C 0x20 - 0x2f, ISA 0x290 (8 I/O ports)
|
||||
Datasheet: Publicly available at the National Semiconductor website
|
||||
http://www.national.com/
|
||||
* National Semiconductor LM79
|
||||
Prefix: 'lm79'
|
||||
Addresses scanned: I2C 0x20 - 0x2f, ISA 0x290 (8 I/O ports)
|
||||
|
|
|
@ -24,14 +24,14 @@ Supported chips:
|
|||
http://www.national.com/pf/LM/LM86.html
|
||||
* Analog Devices ADM1032
|
||||
Prefix: 'adm1032'
|
||||
Addresses scanned: I2C 0x4c
|
||||
Addresses scanned: I2C 0x4c and 0x4d
|
||||
Datasheet: Publicly available at the Analog Devices website
|
||||
http://products.analog.com/products/info.asp?product=ADM1032
|
||||
http://www.analog.com/en/prod/0,2877,ADM1032,00.html
|
||||
* Analog Devices ADT7461
|
||||
Prefix: 'adt7461'
|
||||
Addresses scanned: I2C 0x4c
|
||||
Addresses scanned: I2C 0x4c and 0x4d
|
||||
Datasheet: Publicly available at the Analog Devices website
|
||||
http://products.analog.com/products/info.asp?product=ADT7461
|
||||
http://www.analog.com/en/prod/0,2877,ADT7461,00.html
|
||||
Note: Only if in ADM1032 compatibility mode
|
||||
* Maxim MAX6657
|
||||
Prefix: 'max6657'
|
||||
|
@ -71,8 +71,8 @@ increased resolution of the remote temperature measurement.
|
|||
|
||||
The different chipsets of the family are not strictly identical, although
|
||||
very similar. This driver doesn't handle any specific feature for now,
|
||||
but could if there ever was a need for it. For reference, here comes a
|
||||
non-exhaustive list of specific features:
|
||||
with the exception of SMBus PEC. For reference, here comes a non-exhaustive
|
||||
list of specific features:
|
||||
|
||||
LM90:
|
||||
* Filter and alert configuration register at 0xBF.
|
||||
|
@ -91,6 +91,7 @@ ADM1032:
|
|||
* Conversion averaging.
|
||||
* Up to 64 conversions/s.
|
||||
* ALERT is triggered by open remote sensor.
|
||||
* SMBus PEC support for Write Byte and Receive Byte transactions.
|
||||
|
||||
ADT7461
|
||||
* Extended temperature range (breaks compatibility)
|
||||
|
@ -119,3 +120,37 @@ The lm90 driver will not update its values more frequently than every
|
|||
other second; reading them more often will do no harm, but will return
|
||||
'old' values.
|
||||
|
||||
PEC Support
|
||||
-----------
|
||||
|
||||
The ADM1032 is the only chip of the family which supports PEC. It does
|
||||
not support PEC on all transactions though, so some care must be taken.
|
||||
|
||||
When reading a register value, the PEC byte is computed and sent by the
|
||||
ADM1032 chip. However, in the case of a combined transaction (SMBus Read
|
||||
Byte), the ADM1032 computes the CRC value over only the second half of
|
||||
the message rather than its entirety, because it thinks the first half
|
||||
of the message belongs to a different transaction. As a result, the CRC
|
||||
value differs from what the SMBus master expects, and all reads fail.
|
||||
|
||||
For this reason, the lm90 driver will enable PEC for the ADM1032 only if
|
||||
the bus supports the SMBus Send Byte and Receive Byte transaction types.
|
||||
These transactions will be used to read register values, instead of
|
||||
SMBus Read Byte, and PEC will work properly.
|
||||
|
||||
Additionally, the ADM1032 doesn't support SMBus Send Byte with PEC.
|
||||
Instead, it will try to write the PEC value to the register (because the
|
||||
SMBus Send Byte transaction with PEC is similar to a Write Byte transaction
|
||||
without PEC), which is not what we want. Thus, PEC is explicitely disabled
|
||||
on SMBus Send Byte transactions in the lm90 driver.
|
||||
|
||||
PEC on byte data transactions represents a significant increase in bandwidth
|
||||
usage (+33% for writes, +25% for reads) in normal conditions. With the need
|
||||
to use two SMBus transaction for reads, this overhead jumps to +50%. Worse,
|
||||
two transactions will typically mean twice as much delay waiting for
|
||||
transaction completion, effectively doubling the register cache refresh time.
|
||||
I guess reliability comes at a price, but it's quite expensive this time.
|
||||
|
||||
So, as not everyone might enjoy the slowdown, PEC can be disabled through
|
||||
sysfs. Just write 0 to the "pec" file and PEC will be disabled. Write 1
|
||||
to that file to enable PEC again.
|
||||
|
|
|
@ -3,6 +3,7 @@ Kernel driver smsc47b397
|
|||
|
||||
Supported chips:
|
||||
* SMSC LPC47B397-NC
|
||||
* SMSC SCH5307-NS
|
||||
Prefix: 'smsc47b397'
|
||||
Addresses scanned: none, address read from Super I/O config space
|
||||
Datasheet: In this file
|
||||
|
@ -12,11 +13,14 @@ Authors: Mark M. Hoffman <mhoffman@lightlink.com>
|
|||
|
||||
November 23, 2004
|
||||
|
||||
The following specification describes the SMSC LPC47B397-NC sensor chip
|
||||
The following specification describes the SMSC LPC47B397-NC[1] sensor chip
|
||||
(for which there is no public datasheet available). This document was
|
||||
provided by Craig Kelly (In-Store Broadcast Network) and edited/corrected
|
||||
by Mark M. Hoffman <mhoffman@lightlink.com>.
|
||||
|
||||
[1] And SMSC SCH5307-NS, which has a different device ID but is otherwise
|
||||
compatible.
|
||||
|
||||
* * * * *
|
||||
|
||||
Methods for detecting the HP SIO and reading the thermal data on a dc7100.
|
||||
|
@ -127,7 +131,7 @@ OUT DX,AL
|
|||
The registers of interest for identifying the SIO on the dc7100 are Device ID
|
||||
(0x20) and Device Rev (0x21).
|
||||
|
||||
The Device ID will read 0X6F
|
||||
The Device ID will read 0x6F (for SCH5307-NS, 0x81)
|
||||
The Device Rev currently reads 0x01
|
||||
|
||||
Obtaining the HWM Base Address.
|
||||
|
|
|
@ -12,6 +12,10 @@ Supported chips:
|
|||
http://www.smsc.com/main/datasheets/47m14x.pdf
|
||||
http://www.smsc.com/main/tools/discontinued/47m15x.pdf
|
||||
http://www.smsc.com/main/datasheets/47m192.pdf
|
||||
* SMSC LPC47M997
|
||||
Addresses scanned: none, address read from Super I/O config space
|
||||
Prefix: 'smsc47m1'
|
||||
Datasheet: none
|
||||
|
||||
Authors:
|
||||
Mark D. Studebaker <mdsxyz123@yahoo.com>,
|
||||
|
@ -30,6 +34,9 @@ The 47M15x and 47M192 chips contain a full 'hardware monitoring block'
|
|||
in addition to the fan monitoring and control. The hardware monitoring
|
||||
block is not supported by the driver.
|
||||
|
||||
No documentation is available for the 47M997, but it has the same device
|
||||
ID as the 47M15x and 47M192 chips and seems to be compatible.
|
||||
|
||||
Fan rotation speeds are reported in RPM (rotations per minute). An alarm is
|
||||
triggered if the rotation speed has dropped below a programmable limit. Fan
|
||||
readings can be divided by a programmable divider (1, 2, 4 or 8) to give
|
||||
|
|
|
@ -272,3 +272,6 @@ beep_mask Bitmask for beep.
|
|||
|
||||
eeprom Raw EEPROM data in binary form.
|
||||
Read only.
|
||||
|
||||
pec Enable or disable PEC (SMBus only)
|
||||
Read/Write
|
||||
|
|
|
@ -18,8 +18,9 @@ Authors:
|
|||
Module Parameters
|
||||
-----------------
|
||||
|
||||
force_addr=0xaddr Set the I/O base address. Useful for Asus A7V boards
|
||||
that don't set the address in the BIOS. Does not do a
|
||||
force_addr=0xaddr Set the I/O base address. Useful for boards that
|
||||
don't set the address in the BIOS. Look for a BIOS
|
||||
upgrade before resorting to this. Does not do a
|
||||
PCI force; the via686a must still be present in lspci.
|
||||
Don't use this unless the driver complains that the
|
||||
base address is not set.
|
||||
|
@ -63,3 +64,15 @@ miss once-only alarms.
|
|||
|
||||
The driver only updates its values each 1.5 seconds; reading it more often
|
||||
will do no harm, but will return 'old' values.
|
||||
|
||||
Known Issues
|
||||
------------
|
||||
|
||||
This driver handles sensors integrated in some VIA south bridges. It is
|
||||
possible that a motherboard maker used a VT82C686A/B chip as part of a
|
||||
product design but was not interested in its hardware monitoring features,
|
||||
in which case the sensor inputs will not be wired. This is the case of
|
||||
the Asus K7V, A7V and A7V133 motherboards, to name only a few of them.
|
||||
So, if you need the force_addr parameter, and end up with values which
|
||||
don't seem to make any sense, don't look any further: your chip is simply
|
||||
not wired for hardware monitoring.
|
||||
|
|
174
Documentation/hwmon/w83792d
Normal file
174
Documentation/hwmon/w83792d
Normal file
|
@ -0,0 +1,174 @@
|
|||
Kernel driver w83792d
|
||||
=====================
|
||||
|
||||
Supported chips:
|
||||
* Winbond W83792D
|
||||
Prefix: 'w83792d'
|
||||
Addresses scanned: I2C 0x2c - 0x2f
|
||||
Datasheet: http://www.winbond.com.tw/E-WINBONDHTM/partner/PDFresult.asp?Pname=1035
|
||||
|
||||
Author: Chunhao Huang
|
||||
Contact: DZShen <DZShen@Winbond.com.tw>
|
||||
|
||||
|
||||
Module Parameters
|
||||
-----------------
|
||||
|
||||
* init int
|
||||
(default 1)
|
||||
Use 'init=0' to bypass initializing the chip.
|
||||
Try this if your computer crashes when you load the module.
|
||||
|
||||
* force_subclients=bus,caddr,saddr,saddr
|
||||
This is used to force the i2c addresses for subclients of
|
||||
a certain chip. Example usage is `force_subclients=0,0x2f,0x4a,0x4b'
|
||||
to force the subclients of chip 0x2f on bus 0 to i2c addresses
|
||||
0x4a and 0x4b.
|
||||
|
||||
|
||||
Description
|
||||
-----------
|
||||
|
||||
This driver implements support for the Winbond W83792AD/D.
|
||||
|
||||
Detection of the chip can sometimes be foiled because it can be in an
|
||||
internal state that allows no clean access (Bank with ID register is not
|
||||
currently selected). If you know the address of the chip, use a 'force'
|
||||
parameter; this will put it into a more well-behaved state first.
|
||||
|
||||
The driver implements three temperature sensors, seven fan rotation speed
|
||||
sensors, nine voltage sensors, and two automatic fan regulation
|
||||
strategies called: Smart Fan I (Thermal Cruise mode) and Smart Fan II.
|
||||
Automatic fan control mode is possible only for fan1-fan3. Fan4-fan7 can run
|
||||
synchronized with selected fan (fan1-fan3). This functionality and manual PWM
|
||||
control for fan4-fan7 is not yet implemented.
|
||||
|
||||
Temperatures are measured in degrees Celsius and measurement resolution is 1
|
||||
degC for temp1 and 0.5 degC for temp2 and temp3. An alarm is triggered when
|
||||
the temperature gets higher than the Overtemperature Shutdown value; it stays
|
||||
on until the temperature falls below the Hysteresis value.
|
||||
|
||||
Fan rotation speeds are reported in RPM (rotations per minute). An alarm is
|
||||
triggered if the rotation speed has dropped below a programmable limit. Fan
|
||||
readings can be divided by a programmable divider (1, 2, 4, 8, 16, 32, 64 or
|
||||
128) to give the readings more range or accuracy.
|
||||
|
||||
Voltage sensors (also known as IN sensors) report their values in millivolts.
|
||||
An alarm is triggered if the voltage has crossed a programmable minimum
|
||||
or maximum limit.
|
||||
|
||||
Alarms are provided as output from "realtime status register". Following bits
|
||||
are defined:
|
||||
|
||||
bit - alarm on:
|
||||
0 - in0
|
||||
1 - in1
|
||||
2 - temp1
|
||||
3 - temp2
|
||||
4 - temp3
|
||||
5 - fan1
|
||||
6 - fan2
|
||||
7 - fan3
|
||||
8 - in2
|
||||
9 - in3
|
||||
10 - in4
|
||||
11 - in5
|
||||
12 - in6
|
||||
13 - VID change
|
||||
14 - chassis
|
||||
15 - fan7
|
||||
16 - tart1
|
||||
17 - tart2
|
||||
18 - tart3
|
||||
19 - in7
|
||||
20 - in8
|
||||
21 - fan4
|
||||
22 - fan5
|
||||
23 - fan6
|
||||
|
||||
Tart will be asserted while target temperature cannot be achieved after 3 minutes
|
||||
of full speed rotation of corresponding fan.
|
||||
|
||||
In addition to the alarms described above, there is a CHAS alarm on the chips
|
||||
which triggers if your computer case is open (This one is latched, contrary
|
||||
to realtime alarms).
|
||||
|
||||
The chips only update values each 3 seconds; reading them more often will
|
||||
do no harm, but will return 'old' values.
|
||||
|
||||
|
||||
W83792D PROBLEMS
|
||||
----------------
|
||||
Known problems:
|
||||
- This driver is only for Winbond W83792D C version device, there
|
||||
are also some motherboards with B version W83792D device. The
|
||||
calculation method to in6-in7(measured value, limits) is a little
|
||||
different between C and B version. C or B version can be identified
|
||||
by CR[0x49h].
|
||||
- The function of vid and vrm has not been finished, because I'm NOT
|
||||
very familiar with them. Adding support is welcome.
|
||||
- The function of chassis open detection needs more tests.
|
||||
- If you have ASUS server board and chip was not found: Then you will
|
||||
need to upgrade to latest (or beta) BIOS. If it does not help please
|
||||
contact us.
|
||||
|
||||
Fan control
|
||||
-----------
|
||||
|
||||
Manual mode
|
||||
-----------
|
||||
|
||||
Works as expected. You just need to specify desired PWM/DC value (fan speed)
|
||||
in appropriate pwm# file.
|
||||
|
||||
Thermal cruise
|
||||
--------------
|
||||
|
||||
In this mode, W83792D provides the Smart Fan system to automatically control
|
||||
fan speed to keep the temperatures of CPU and the system within specific
|
||||
range. At first a wanted temperature and interval must be set. This is done
|
||||
via thermal_cruise# file. The tolerance# file serves to create T +- tolerance
|
||||
interval. The fan speed will be lowered as long as the current temperature
|
||||
remains below the thermal_cruise# +- tolerance# value. Once the temperature
|
||||
exceeds the high limit (T+tolerance), the fan will be turned on with a
|
||||
specific speed set by pwm# and automatically controlled its PWM duty cycle
|
||||
with the temperature varying. Three conditions may occur:
|
||||
|
||||
(1) If the temperature still exceeds the high limit, PWM duty
|
||||
cycle will increase slowly.
|
||||
|
||||
(2) If the temperature goes below the high limit, but still above the low
|
||||
limit (T-tolerance), the fan speed will be fixed at the current speed because
|
||||
the temperature is in the target range.
|
||||
|
||||
(3) If the temperature goes below the low limit, PWM duty cycle will decrease
|
||||
slowly to 0 or a preset stop value until the temperature exceeds the low
|
||||
limit. (The preset stop value handling is not yet implemented in driver)
|
||||
|
||||
Smart Fan II
|
||||
------------
|
||||
|
||||
W83792D also provides a special mode for fan. Four temperature points are
|
||||
available. When related temperature sensors detects the temperature in preset
|
||||
temperature region (sf2_point@_fan# +- tolerance#) it will cause fans to run
|
||||
on programmed value from sf2_level@_fan#. You need to set four temperatures
|
||||
for each fan.
|
||||
|
||||
|
||||
/sys files
|
||||
----------
|
||||
|
||||
pwm[1-3] - this file stores PWM duty cycle or DC value (fan speed) in range:
|
||||
0 (stop) to 255 (full)
|
||||
pwm[1-3]_enable - this file controls mode of fan/temperature control:
|
||||
* 0 Disabled
|
||||
* 1 Manual mode
|
||||
* 2 Smart Fan II
|
||||
* 3 Thermal Cruise
|
||||
pwm[1-3]_mode - Select PWM of DC mode
|
||||
* 0 DC
|
||||
* 1 PWM
|
||||
thermal_cruise[1-3] - Selects the desired temperature for cruise (degC)
|
||||
tolerance[1-3] - Value in degrees of Celsius (degC) for +- T
|
||||
sf2_point[1-4]_fan[1-3] - four temperature points for each fan for Smart Fan II
|
||||
sf2_level[1-3]_fan[1-3] - three PWM/DC levels for each fan for Smart Fan II
|
|
@ -2,6 +2,7 @@ Kernel driver i2c-i810
|
|||
|
||||
Supported adapters:
|
||||
* Intel 82810, 82810-DC100, 82810E, and 82815 (GMCH)
|
||||
* Intel 82845G (GMCH)
|
||||
|
||||
Authors:
|
||||
Frodo Looijaard <frodol@dds.nl>,
|
||||
|
|
|
@ -4,17 +4,18 @@ Supported adapters:
|
|||
* VIA Technologies, Inc. VT82C596A/B
|
||||
Datasheet: Sometimes available at the VIA website
|
||||
|
||||
* VIA Technologies, Inc. VT82C686A/B
|
||||
* VIA Technologies, Inc. VT82C686A/B
|
||||
Datasheet: Sometimes available at the VIA website
|
||||
|
||||
* VIA Technologies, Inc. VT8231, VT8233, VT8233A, VT8235, VT8237
|
||||
Datasheet: available on request from Via
|
||||
|
||||
Authors:
|
||||
Frodo Looijaard <frodol@dds.nl>,
|
||||
Philip Edelbrock <phil@netroedge.com>,
|
||||
Kyösti Mälkki <kmalkki@cc.hut.fi>,
|
||||
Mark D. Studebaker <mdsxyz123@yahoo.com>
|
||||
Frodo Looijaard <frodol@dds.nl>,
|
||||
Philip Edelbrock <phil@netroedge.com>,
|
||||
Kyösti Mälkki <kmalkki@cc.hut.fi>,
|
||||
Mark D. Studebaker <mdsxyz123@yahoo.com>,
|
||||
Jean Delvare <khali@linux-fr.org>
|
||||
|
||||
Module Parameters
|
||||
-----------------
|
||||
|
@ -28,20 +29,22 @@ Description
|
|||
-----------
|
||||
|
||||
i2c-viapro is a true SMBus host driver for motherboards with one of the
|
||||
supported VIA southbridges.
|
||||
supported VIA south bridges.
|
||||
|
||||
Your lspci -n listing must show one of these :
|
||||
|
||||
device 1106:3050 (VT82C596 function 3)
|
||||
device 1106:3051 (VT82C596 function 3)
|
||||
device 1106:3050 (VT82C596A function 3)
|
||||
device 1106:3051 (VT82C596B function 3)
|
||||
device 1106:3057 (VT82C686 function 4)
|
||||
device 1106:3074 (VT8233)
|
||||
device 1106:3147 (VT8233A)
|
||||
device 1106:8235 (VT8231)
|
||||
devide 1106:3177 (VT8235)
|
||||
devide 1106:3227 (VT8237)
|
||||
device 1106:8235 (VT8231 function 4)
|
||||
device 1106:3177 (VT8235)
|
||||
device 1106:3227 (VT8237R)
|
||||
|
||||
If none of these show up, you should look in the BIOS for settings like
|
||||
enable ACPI / SMBus or even USB.
|
||||
|
||||
|
||||
Except for the oldest chips (VT82C596A/B, VT82C686A and most probably
|
||||
VT8231), this driver supports I2C block transactions. Such transactions
|
||||
are mainly useful to read from and write to EEPROMs.
|
||||
|
|
|
@ -4,22 +4,13 @@ Kernel driver max6875
|
|||
Supported chips:
|
||||
* Maxim MAX6874, MAX6875
|
||||
Prefix: 'max6875'
|
||||
Addresses scanned: 0x50, 0x52
|
||||
Addresses scanned: None (see below)
|
||||
Datasheet:
|
||||
http://pdfserv.maxim-ic.com/en/ds/MAX6874-MAX6875.pdf
|
||||
|
||||
Author: Ben Gardner <bgardner@wabtec.com>
|
||||
|
||||
|
||||
Module Parameters
|
||||
-----------------
|
||||
|
||||
* allow_write int
|
||||
Set to non-zero to enable write permission:
|
||||
*0: Read only
|
||||
1: Read and write
|
||||
|
||||
|
||||
Description
|
||||
-----------
|
||||
|
||||
|
@ -33,34 +24,85 @@ registers.
|
|||
|
||||
The Maxim MAX6874 is a similar, mostly compatible device, with more intputs
|
||||
and outputs:
|
||||
|
||||
vin gpi vout
|
||||
MAX6874 6 4 8
|
||||
MAX6875 4 3 5
|
||||
|
||||
MAX6874 chips can have four different addresses (as opposed to only two for
|
||||
the MAX6875). The additional addresses (0x54 and 0x56) are not probed by
|
||||
this driver by default, but the probe module parameter can be used if
|
||||
needed.
|
||||
|
||||
See the datasheet for details on how to program the EEPROM.
|
||||
See the datasheet for more information.
|
||||
|
||||
|
||||
Sysfs entries
|
||||
-------------
|
||||
|
||||
eeprom_user - 512 bytes of user-defined EEPROM space. Only writable if
|
||||
allow_write was set and register 0x43 is 0.
|
||||
|
||||
eeprom_config - 70 bytes of config EEPROM. Note that changes will not get
|
||||
loaded into register space until a power cycle or device reset.
|
||||
|
||||
reg_config - 70 bytes of register space. Any changes take affect immediately.
|
||||
eeprom - 512 bytes of user-defined EEPROM space.
|
||||
|
||||
|
||||
General Remarks
|
||||
---------------
|
||||
|
||||
A typical application will require that the EEPROMs be programmed once and
|
||||
never altered afterwards.
|
||||
Valid addresses for the MAX6875 are 0x50 and 0x52.
|
||||
Valid addresses for the MAX6874 are 0x50, 0x52, 0x54 and 0x56.
|
||||
The driver does not probe any address, so you must force the address.
|
||||
|
||||
Example:
|
||||
$ modprobe max6875 force=0,0x50
|
||||
|
||||
The MAX6874/MAX6875 ignores address bit 0, so this driver attaches to multiple
|
||||
addresses. For example, for address 0x50, it also reserves 0x51.
|
||||
The even-address instance is called 'max6875', the odd one is 'max6875 subclient'.
|
||||
|
||||
|
||||
Programming the chip using i2c-dev
|
||||
----------------------------------
|
||||
|
||||
Use the i2c-dev interface to access and program the chips.
|
||||
Reads and writes are performed differently depending on the address range.
|
||||
|
||||
The configuration registers are at addresses 0x00 - 0x45.
|
||||
Use i2c_smbus_write_byte_data() to write a register and
|
||||
i2c_smbus_read_byte_data() to read a register.
|
||||
The command is the register number.
|
||||
|
||||
Examples:
|
||||
To write a 1 to register 0x45:
|
||||
i2c_smbus_write_byte_data(fd, 0x45, 1);
|
||||
|
||||
To read register 0x45:
|
||||
value = i2c_smbus_read_byte_data(fd, 0x45);
|
||||
|
||||
|
||||
The configuration EEPROM is at addresses 0x8000 - 0x8045.
|
||||
The user EEPROM is at addresses 0x8100 - 0x82ff.
|
||||
|
||||
Use i2c_smbus_write_word_data() to write a byte to EEPROM.
|
||||
|
||||
The command is the upper byte of the address: 0x80, 0x81, or 0x82.
|
||||
The data word is the lower part of the address or'd with data << 8.
|
||||
cmd = address >> 8;
|
||||
val = (address & 0xff) | (data << 8);
|
||||
|
||||
Example:
|
||||
To write 0x5a to address 0x8003:
|
||||
i2c_smbus_write_word_data(fd, 0x80, 0x5a03);
|
||||
|
||||
|
||||
Reading data from the EEPROM is a little more complicated.
|
||||
Use i2c_smbus_write_byte_data() to set the read address and then
|
||||
i2c_smbus_read_byte() or i2c_smbus_read_i2c_block_data() to read the data.
|
||||
|
||||
Example:
|
||||
To read data starting at offset 0x8100, first set the address:
|
||||
i2c_smbus_write_byte_data(fd, 0x81, 0x00);
|
||||
|
||||
And then read the data
|
||||
value = i2c_smbus_read_byte(fd);
|
||||
|
||||
or
|
||||
|
||||
count = i2c_smbus_read_i2c_block_data(fd, 0x84, buffer);
|
||||
|
||||
The block read should read 16 bytes.
|
||||
0x84 is the block read command.
|
||||
|
||||
See the datasheet for more details.
|
||||
|
||||
|
|
38
Documentation/i2c/chips/x1205
Normal file
38
Documentation/i2c/chips/x1205
Normal file
|
@ -0,0 +1,38 @@
|
|||
Kernel driver x1205
|
||||
===================
|
||||
|
||||
Supported chips:
|
||||
* Xicor X1205 RTC
|
||||
Prefix: 'x1205'
|
||||
Addresses scanned: none
|
||||
Datasheet: http://www.intersil.com/cda/deviceinfo/0,1477,X1205,00.html
|
||||
|
||||
Authors:
|
||||
Karen Spearel <kas11@tampabay.rr.com>,
|
||||
Alessandro Zummo <a.zummo@towertech.it>
|
||||
|
||||
Description
|
||||
-----------
|
||||
|
||||
This module aims to provide complete access to the Xicor X1205 RTC.
|
||||
Recently Xicor has merged with Intersil, but the chip is
|
||||
still sold under the Xicor brand.
|
||||
|
||||
This chip is located at address 0x6f and uses a 2-byte register addressing.
|
||||
Two bytes need to be written to read a single register, while most
|
||||
other chips just require one and take the second one as the data
|
||||
to be written. To prevent corrupting unknown chips, the user must
|
||||
explicitely set the probe parameter.
|
||||
|
||||
example:
|
||||
|
||||
modprobe x1205 probe=0,0x6f
|
||||
|
||||
The module supports one more option, hctosys, which is used to set the
|
||||
software clock from the x1205. On systems where the x1205 is the
|
||||
only hardware rtc, this parameter could be used to achieve a correct
|
||||
date/time earlier in the system boot sequence.
|
||||
|
||||
example:
|
||||
|
||||
modprobe x1205 probe=0,0x6f hctosys=1
|
|
@ -17,9 +17,10 @@ For the most up-to-date list of functionality constants, please check
|
|||
I2C_FUNC_I2C Plain i2c-level commands (Pure SMBus
|
||||
adapters typically can not do these)
|
||||
I2C_FUNC_10BIT_ADDR Handles the 10-bit address extensions
|
||||
I2C_FUNC_PROTOCOL_MANGLING Knows about the I2C_M_REV_DIR_ADDR,
|
||||
I2C_M_REV_DIR_ADDR and I2C_M_REV_DIR_NOSTART
|
||||
flags (which modify the i2c protocol!)
|
||||
I2C_FUNC_PROTOCOL_MANGLING Knows about the I2C_M_IGNORE_NAK,
|
||||
I2C_M_REV_DIR_ADDR, I2C_M_NOSTART and
|
||||
I2C_M_NO_RD_ACK flags (which modify the
|
||||
I2C protocol!)
|
||||
I2C_FUNC_SMBUS_QUICK Handles the SMBus write_quick command
|
||||
I2C_FUNC_SMBUS_READ_BYTE Handles the SMBus read_byte command
|
||||
I2C_FUNC_SMBUS_WRITE_BYTE Handles the SMBus write_byte command
|
||||
|
@ -115,7 +116,7 @@ CHECKING THROUGH /DEV
|
|||
If you try to access an adapter from a userspace program, you will have
|
||||
to use the /dev interface. You will still have to check whether the
|
||||
functionality you need is supported, of course. This is done using
|
||||
the I2C_FUNCS ioctl. An example, adapted from the lm_sensors i2c_detect
|
||||
the I2C_FUNCS ioctl. An example, adapted from the lm_sensors i2cdetect
|
||||
program, is below:
|
||||
|
||||
int file;
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
Revision 4, 2004-03-30
|
||||
Revision 5, 2005-07-29
|
||||
Jean Delvare <khali@linux-fr.org>
|
||||
Greg KH <greg@kroah.com>
|
||||
|
||||
|
@ -17,20 +17,22 @@ yours for best results.
|
|||
|
||||
Technical changes:
|
||||
|
||||
* [Includes] Get rid of "version.h". Replace <linux/i2c-proc.h> with
|
||||
<linux/i2c-sensor.h>. Includes typically look like that:
|
||||
* [Includes] Get rid of "version.h" and <linux/i2c-proc.h>.
|
||||
Includes typically look like that:
|
||||
#include <linux/module.h>
|
||||
#include <linux/init.h>
|
||||
#include <linux/slab.h>
|
||||
#include <linux/i2c.h>
|
||||
#include <linux/i2c-sensor.h>
|
||||
#include <linux/i2c-vid.h> /* if you need VRM support */
|
||||
#include <linux/hwmon.h> /* for hardware monitoring drivers */
|
||||
#include <linux/hwmon-sysfs.h>
|
||||
#include <linux/hwmon-vid.h> /* if you need VRM support */
|
||||
#include <asm/io.h> /* if you have I/O operations */
|
||||
Please respect this inclusion order. Some extra headers may be
|
||||
required for a given driver (e.g. "lm75.h").
|
||||
|
||||
* [Addresses] SENSORS_I2C_END becomes I2C_CLIENT_END, SENSORS_ISA_END
|
||||
becomes I2C_CLIENT_ISA_END.
|
||||
* [Addresses] SENSORS_I2C_END becomes I2C_CLIENT_END, ISA addresses
|
||||
are no more handled by the i2c core.
|
||||
SENSORS_INSMOD_<n> becomes I2C_CLIENT_INSMOD_<n>.
|
||||
|
||||
* [Client data] Get rid of sysctl_id. Try using standard names for
|
||||
register values (for example, temp_os becomes temp_max). You're
|
||||
|
@ -66,19 +68,21 @@ Technical changes:
|
|||
if (!(adapter->class & I2C_CLASS_HWMON))
|
||||
return 0;
|
||||
ISA-only drivers of course don't need this.
|
||||
Call i2c_probe() instead of i2c_detect().
|
||||
|
||||
* [Detect] As mentioned earlier, the flags parameter is gone.
|
||||
The type_name and client_name strings are replaced by a single
|
||||
name string, which will be filled with a lowercase, short string
|
||||
(typically the driver name, e.g. "lm75").
|
||||
In i2c-only drivers, drop the i2c_is_isa_adapter check, it's
|
||||
useless.
|
||||
useless. Same for isa-only drivers, as the test would always be
|
||||
true. Only hybrid drivers (which are quite rare) still need it.
|
||||
The errorN labels are reduced to the number needed. If that number
|
||||
is 2 (i2c-only drivers), it is advised that the labels are named
|
||||
exit and exit_free. For i2c+isa drivers, labels should be named
|
||||
ERROR0, ERROR1 and ERROR2. Don't forget to properly set err before
|
||||
jumping to error labels. By the way, labels should be left-aligned.
|
||||
Use memset to fill the client and data area with 0x00.
|
||||
Use kzalloc instead of kmalloc.
|
||||
Use i2c_set_clientdata to set the client data (as opposed to
|
||||
a direct access to client->data).
|
||||
Use strlcpy instead of strcpy to copy the client name.
|
||||
|
@ -86,6 +90,8 @@ Technical changes:
|
|||
device_create_file. Move the driver initialization before any
|
||||
sysfs file creation.
|
||||
Drop client->id.
|
||||
Drop any 24RF08 corruption prevention you find, as this is now done
|
||||
at the i2c-core level, and doing it twice voids it.
|
||||
|
||||
* [Init] Limits must not be set by the driver (can be done later in
|
||||
user-space). Chip should not be reset default (although a module
|
||||
|
@ -93,7 +99,8 @@ Technical changes:
|
|||
limited to the strictly necessary steps.
|
||||
|
||||
* [Detach] Get rid of data, remove the call to
|
||||
i2c_deregister_entry.
|
||||
i2c_deregister_entry. Do not log an error message if
|
||||
i2c_detach_client fails, as i2c-core will now do it for you.
|
||||
|
||||
* [Update] Don't access client->data directly, use
|
||||
i2c_get_clientdata(client) instead.
|
||||
|
|
|
@ -33,8 +33,8 @@ static struct i2c_driver foo_driver = {
|
|||
.command = &foo_command /* may be NULL */
|
||||
}
|
||||
|
||||
The name can be chosen freely, and may be upto 40 characters long. Please
|
||||
use something descriptive here.
|
||||
The name field must match the driver name, including the case. It must not
|
||||
contain spaces, and may be up to 31 characters long.
|
||||
|
||||
Don't worry about the flags field; just put I2C_DF_NOTIFY into it. This
|
||||
means that your driver will be notified when new adapters are found.
|
||||
|
@ -43,9 +43,6 @@ This is almost always what you want.
|
|||
All other fields are for call-back functions which will be explained
|
||||
below.
|
||||
|
||||
There use to be two additional fields in this structure, inc_use et dec_use,
|
||||
for module usage count, but these fields were obsoleted and removed.
|
||||
|
||||
|
||||
Extra client data
|
||||
=================
|
||||
|
@ -58,6 +55,7 @@ be very useful.
|
|||
An example structure is below.
|
||||
|
||||
struct foo_data {
|
||||
struct i2c_client client;
|
||||
struct semaphore lock; /* For ISA access in `sensors' drivers. */
|
||||
int sysctl_id; /* To keep the /proc directory entry for
|
||||
`sensors' drivers. */
|
||||
|
@ -148,15 +146,15 @@ are defined in i2c.h to help you support them, as well as a generic
|
|||
detection algorithm.
|
||||
|
||||
You do not have to use this parameter interface; but don't try to use
|
||||
function i2c_probe() (or i2c_detect()) if you don't.
|
||||
function i2c_probe() if you don't.
|
||||
|
||||
NOTE: If you want to write a `sensors' driver, the interface is slightly
|
||||
different! See below.
|
||||
|
||||
|
||||
|
||||
Probing classes (i2c)
|
||||
---------------------
|
||||
Probing classes
|
||||
---------------
|
||||
|
||||
All parameters are given as lists of unsigned 16-bit integers. Lists are
|
||||
terminated by I2C_CLIENT_END.
|
||||
|
@ -171,12 +169,18 @@ The following lists are used internally:
|
|||
ignore: insmod parameter.
|
||||
A list of pairs. The first value is a bus number (-1 for any I2C bus),
|
||||
the second is the I2C address. These addresses are never probed.
|
||||
This parameter overrules 'normal' and 'probe', but not the 'force' lists.
|
||||
This parameter overrules the 'normal_i2c' list only.
|
||||
force: insmod parameter.
|
||||
A list of pairs. The first value is a bus number (-1 for any I2C bus),
|
||||
the second is the I2C address. A device is blindly assumed to be on
|
||||
the given address, no probing is done.
|
||||
|
||||
Additionally, kind-specific force lists may optionally be defined if
|
||||
the driver supports several chip kinds. They are grouped in a
|
||||
NULL-terminated list of pointers named forces, those first element if the
|
||||
generic force list mentioned above. Each additional list correspond to an
|
||||
insmod parameter of the form force_<kind>.
|
||||
|
||||
Fortunately, as a module writer, you just have to define the `normal_i2c'
|
||||
parameter. The complete declaration could look like this:
|
||||
|
||||
|
@ -186,66 +190,17 @@ parameter. The complete declaration could look like this:
|
|||
|
||||
/* Magic definition of all other variables and things */
|
||||
I2C_CLIENT_INSMOD;
|
||||
/* Or, if your driver supports, say, 2 kind of devices: */
|
||||
I2C_CLIENT_INSMOD_2(foo, bar);
|
||||
|
||||
If you use the multi-kind form, an enum will be defined for you:
|
||||
enum chips { any_chip, foo, bar, ... }
|
||||
You can then (and certainly should) use it in the driver code.
|
||||
|
||||
Note that you *have* to call the defined variable `normal_i2c',
|
||||
without any prefix!
|
||||
|
||||
|
||||
Probing classes (sensors)
|
||||
-------------------------
|
||||
|
||||
If you write a `sensors' driver, you use a slightly different interface.
|
||||
As well as I2C addresses, we have to cope with ISA addresses. Also, we
|
||||
use a enum of chip types. Don't forget to include `sensors.h'.
|
||||
|
||||
The following lists are used internally. They are all lists of integers.
|
||||
|
||||
normal_i2c: filled in by the module writer. Terminated by SENSORS_I2C_END.
|
||||
A list of I2C addresses which should normally be examined.
|
||||
normal_isa: filled in by the module writer. Terminated by SENSORS_ISA_END.
|
||||
A list of ISA addresses which should normally be examined.
|
||||
probe: insmod parameter. Initialize this list with SENSORS_I2C_END values.
|
||||
A list of pairs. The first value is a bus number (SENSORS_ISA_BUS for
|
||||
the ISA bus, -1 for any I2C bus), the second is the address. These
|
||||
addresses are also probed, as if they were in the 'normal' list.
|
||||
ignore: insmod parameter. Initialize this list with SENSORS_I2C_END values.
|
||||
A list of pairs. The first value is a bus number (SENSORS_ISA_BUS for
|
||||
the ISA bus, -1 for any I2C bus), the second is the I2C address. These
|
||||
addresses are never probed. This parameter overrules 'normal' and
|
||||
'probe', but not the 'force' lists.
|
||||
|
||||
Also used is a list of pointers to sensors_force_data structures:
|
||||
force_data: insmod parameters. A list, ending with an element of which
|
||||
the force field is NULL.
|
||||
Each element contains the type of chip and a list of pairs.
|
||||
The first value is a bus number (SENSORS_ISA_BUS for the ISA bus,
|
||||
-1 for any I2C bus), the second is the address.
|
||||
These are automatically translated to insmod variables of the form
|
||||
force_foo.
|
||||
|
||||
So we have a generic insmod variabled `force', and chip-specific variables
|
||||
`force_CHIPNAME'.
|
||||
|
||||
Fortunately, as a module writer, you just have to define the `normal_i2c'
|
||||
and `normal_isa' parameters, and define what chip names are used.
|
||||
The complete declaration could look like this:
|
||||
/* Scan i2c addresses 0x37, and 0x48 to 0x4f */
|
||||
static unsigned short normal_i2c[] = { 0x37, 0x48, 0x49, 0x4a, 0x4b, 0x4c,
|
||||
0x4d, 0x4e, 0x4f, I2C_CLIENT_END };
|
||||
/* Scan ISA address 0x290 */
|
||||
static unsigned int normal_isa[] = {0x0290,SENSORS_ISA_END};
|
||||
|
||||
/* Define chips foo and bar, as well as all module parameters and things */
|
||||
SENSORS_INSMOD_2(foo,bar);
|
||||
|
||||
If you have one chip, you use macro SENSORS_INSMOD_1(chip), if you have 2
|
||||
you use macro SENSORS_INSMOD_2(chip1,chip2), etc. If you do not want to
|
||||
bother with chip types, you can use SENSORS_INSMOD_0.
|
||||
|
||||
A enum is automatically defined as follows:
|
||||
enum chips { any_chip, chip1, chip2, ... }
|
||||
|
||||
|
||||
Attaching to an adapter
|
||||
-----------------------
|
||||
|
||||
|
@ -264,17 +219,10 @@ detected at a specific address, another callback is called.
|
|||
return i2c_probe(adapter,&addr_data,&foo_detect_client);
|
||||
}
|
||||
|
||||
For `sensors' drivers, use the i2c_detect function instead:
|
||||
|
||||
int foo_attach_adapter(struct i2c_adapter *adapter)
|
||||
{
|
||||
return i2c_detect(adapter,&addr_data,&foo_detect_client);
|
||||
}
|
||||
|
||||
Remember, structure `addr_data' is defined by the macros explained above,
|
||||
so you do not have to define it yourself.
|
||||
|
||||
The i2c_probe or i2c_detect function will call the foo_detect_client
|
||||
The i2c_probe function will call the foo_detect_client
|
||||
function only for those i2c addresses that actually have a device on
|
||||
them (unless a `force' parameter was used). In addition, addresses that
|
||||
are already in use (by some other registered client) are skipped.
|
||||
|
@ -283,19 +231,18 @@ are already in use (by some other registered client) are skipped.
|
|||
The detect client function
|
||||
--------------------------
|
||||
|
||||
The detect client function is called by i2c_probe or i2c_detect.
|
||||
The `kind' parameter contains 0 if this call is due to a `force'
|
||||
parameter, and -1 otherwise (for i2c_detect, it contains 0 if
|
||||
this call is due to the generic `force' parameter, and the chip type
|
||||
number if it is due to a specific `force' parameter).
|
||||
The detect client function is called by i2c_probe. The `kind' parameter
|
||||
contains -1 for a probed detection, 0 for a forced detection, or a positive
|
||||
number for a forced detection with a chip type forced.
|
||||
|
||||
Below, some things are only needed if this is a `sensors' driver. Those
|
||||
parts are between /* SENSORS ONLY START */ and /* SENSORS ONLY END */
|
||||
markers.
|
||||
|
||||
This function should only return an error (any value != 0) if there is
|
||||
some reason why no more detection should be done anymore. If the
|
||||
detection just fails for this address, return 0.
|
||||
Returning an error different from -ENODEV in a detect function will cause
|
||||
the detection to stop: other addresses and adapters won't be scanned.
|
||||
This should only be done on fatal or internal errors, such as a memory
|
||||
shortage or i2c_attach_client failing.
|
||||
|
||||
For now, you can ignore the `flags' parameter. It is there for future use.
|
||||
|
||||
|
@ -320,13 +267,13 @@ For now, you can ignore the `flags' parameter. It is there for future use.
|
|||
const char *type_name = "";
|
||||
int is_isa = i2c_is_isa_adapter(adapter);
|
||||
|
||||
/* Do this only if the chip can additionally be found on the ISA bus
|
||||
(hybrid chip). */
|
||||
|
||||
if (is_isa) {
|
||||
|
||||
/* If this client can't be on the ISA bus at all, we can stop now
|
||||
(call `goto ERROR0'). But for kicks, we will assume it is all
|
||||
right. */
|
||||
|
||||
/* Discard immediately if this ISA range is already used */
|
||||
/* FIXME: never use check_region(), only request_region() */
|
||||
if (check_region(address,FOO_EXTENT))
|
||||
goto ERROR0;
|
||||
|
||||
|
@ -362,22 +309,15 @@ For now, you can ignore the `flags' parameter. It is there for future use.
|
|||
client structure, even though we cannot fill it completely yet.
|
||||
But it allows us to access several i2c functions safely */
|
||||
|
||||
/* Note that we reserve some space for foo_data too. If you don't
|
||||
need it, remove it. We do it here to help to lessen memory
|
||||
fragmentation. */
|
||||
if (! (new_client = kmalloc(sizeof(struct i2c_client) +
|
||||
sizeof(struct foo_data),
|
||||
GFP_KERNEL))) {
|
||||
if (!(data = kzalloc(sizeof(struct foo_data), GFP_KERNEL))) {
|
||||
err = -ENOMEM;
|
||||
goto ERROR0;
|
||||
}
|
||||
|
||||
/* This is tricky, but it will set the data to the right value. */
|
||||
client->data = new_client + 1;
|
||||
data = (struct foo_data *) (client->data);
|
||||
new_client = &data->client;
|
||||
i2c_set_clientdata(new_client, data);
|
||||
|
||||
new_client->addr = address;
|
||||
new_client->data = data;
|
||||
new_client->adapter = adapter;
|
||||
new_client->driver = &foo_driver;
|
||||
new_client->flags = 0;
|
||||
|
@ -495,17 +435,15 @@ much simpler than the attachment code, fortunately!
|
|||
/* SENSORS ONLY END */
|
||||
|
||||
/* Try to detach the client from i2c space */
|
||||
if ((err = i2c_detach_client(client))) {
|
||||
printk("foo.o: Client deregistration failed, client not detached.\n");
|
||||
if ((err = i2c_detach_client(client)))
|
||||
return err;
|
||||
}
|
||||
|
||||
/* SENSORS ONLY START */
|
||||
/* HYBRID SENSORS CHIP ONLY START */
|
||||
if i2c_is_isa_client(client)
|
||||
release_region(client->addr,LM78_EXTENT);
|
||||
/* SENSORS ONLY END */
|
||||
/* HYBRID SENSORS CHIP ONLY END */
|
||||
|
||||
kfree(client); /* Frees client data too, if allocated at the same time */
|
||||
kfree(data);
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
@ -630,12 +568,12 @@ SMBus communication
|
|||
extern s32 i2c_smbus_write_block_data(struct i2c_client * client,
|
||||
u8 command, u8 length,
|
||||
u8 *values);
|
||||
extern s32 i2c_smbus_read_i2c_block_data(struct i2c_client * client,
|
||||
u8 command, u8 *values);
|
||||
|
||||
These ones were removed in Linux 2.6.10 because they had no users, but could
|
||||
be added back later if needed:
|
||||
|
||||
extern s32 i2c_smbus_read_i2c_block_data(struct i2c_client * client,
|
||||
u8 command, u8 *values);
|
||||
extern s32 i2c_smbus_read_block_data(struct i2c_client * client,
|
||||
u8 command, u8 *values);
|
||||
extern s32 i2c_smbus_write_i2c_block_data(struct i2c_client * client,
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
----------------------------
|
||||
|
||||
H. Peter Anvin <hpa@zytor.com>
|
||||
Last update 2002-01-01
|
||||
Last update 2005-09-02
|
||||
|
||||
On the i386 platform, the Linux kernel uses a rather complicated boot
|
||||
convention. This has evolved partially due to historical aspects, as
|
||||
|
@ -34,6 +34,8 @@ Protocol 2.02: (Kernel 2.4.0-test3-pre3) New command line protocol.
|
|||
Protocol 2.03: (Kernel 2.4.18-pre1) Explicitly makes the highest possible
|
||||
initrd address available to the bootloader.
|
||||
|
||||
Protocol 2.04: (Kernel 2.6.14) Extend the syssize field to four bytes.
|
||||
|
||||
|
||||
**** MEMORY LAYOUT
|
||||
|
||||
|
@ -103,10 +105,9 @@ The header looks like:
|
|||
Offset Proto Name Meaning
|
||||
/Size
|
||||
|
||||
01F1/1 ALL setup_sects The size of the setup in sectors
|
||||
01F1/1 ALL(1 setup_sects The size of the setup in sectors
|
||||
01F2/2 ALL root_flags If set, the root is mounted readonly
|
||||
01F4/2 ALL syssize DO NOT USE - for bootsect.S use only
|
||||
01F6/2 ALL swap_dev DO NOT USE - obsolete
|
||||
01F4/4 2.04+(2 syssize The size of the 32-bit code in 16-byte paras
|
||||
01F8/2 ALL ram_size DO NOT USE - for bootsect.S use only
|
||||
01FA/2 ALL vid_mode Video mode control
|
||||
01FC/2 ALL root_dev Default root device number
|
||||
|
@ -129,8 +130,12 @@ Offset Proto Name Meaning
|
|||
0228/4 2.02+ cmd_line_ptr 32-bit pointer to the kernel command line
|
||||
022C/4 2.03+ initrd_addr_max Highest legal initrd address
|
||||
|
||||
For backwards compatibility, if the setup_sects field contains 0, the
|
||||
real value is 4.
|
||||
(1) For backwards compatibility, if the setup_sects field contains 0, the
|
||||
real value is 4.
|
||||
|
||||
(2) For boot protocol prior to 2.04, the upper two bytes of the syssize
|
||||
field are unusable, which means the size of a bzImage kernel
|
||||
cannot be determined.
|
||||
|
||||
If the "HdrS" (0x53726448) magic number is not found at offset 0x202,
|
||||
the boot protocol version is "old". Loading an old kernel, the
|
||||
|
@ -230,12 +235,16 @@ loader to communicate with the kernel. Some of its options are also
|
|||
relevant to the boot loader itself, see "special command line options"
|
||||
below.
|
||||
|
||||
The kernel command line is a null-terminated string up to 255
|
||||
characters long, plus the final null.
|
||||
The kernel command line is a null-terminated string currently up to
|
||||
255 characters long, plus the final null. A string that is too long
|
||||
will be automatically truncated by the kernel, a boot loader may allow
|
||||
a longer command line to be passed to permit future kernels to extend
|
||||
this limit.
|
||||
|
||||
If the boot protocol version is 2.02 or later, the address of the
|
||||
kernel command line is given by the header field cmd_line_ptr (see
|
||||
above.)
|
||||
above.) This address can be anywhere between the end of the setup
|
||||
heap and 0xA0000.
|
||||
|
||||
If the protocol version is *not* 2.02 or higher, the kernel
|
||||
command line is entered using the following protocol:
|
||||
|
@ -255,7 +264,7 @@ command line is entered using the following protocol:
|
|||
**** SAMPLE BOOT CONFIGURATION
|
||||
|
||||
As a sample configuration, assume the following layout of the real
|
||||
mode segment:
|
||||
mode segment (this is a typical, and recommended layout):
|
||||
|
||||
0x0000-0x7FFF Real mode kernel
|
||||
0x8000-0x8FFF Stack and heap
|
||||
|
@ -312,9 +321,9 @@ Such a boot loader should enter the following fields in the header:
|
|||
|
||||
**** LOADING THE REST OF THE KERNEL
|
||||
|
||||
The non-real-mode kernel starts at offset (setup_sects+1)*512 in the
|
||||
kernel file (again, if setup_sects == 0 the real value is 4.) It
|
||||
should be loaded at address 0x10000 for Image/zImage kernels and
|
||||
The 32-bit (non-real-mode) kernel starts at offset (setup_sects+1)*512
|
||||
in the kernel file (again, if setup_sects == 0 the real value is 4.)
|
||||
It should be loaded at address 0x10000 for Image/zImage kernels and
|
||||
0x100000 for bzImage kernels.
|
||||
|
||||
The kernel is a bzImage kernel if the protocol >= 2.00 and the 0x01
|
||||
|
|
194
Documentation/ia64/mca.txt
Normal file
194
Documentation/ia64/mca.txt
Normal file
|
@ -0,0 +1,194 @@
|
|||
An ad-hoc collection of notes on IA64 MCA and INIT processing. Feel
|
||||
free to update it with notes about any area that is not clear.
|
||||
|
||||
---
|
||||
|
||||
MCA/INIT are completely asynchronous. They can occur at any time, when
|
||||
the OS is in any state. Including when one of the cpus is already
|
||||
holding a spinlock. Trying to get any lock from MCA/INIT state is
|
||||
asking for deadlock. Also the state of structures that are protected
|
||||
by locks is indeterminate, including linked lists.
|
||||
|
||||
---
|
||||
|
||||
The complicated ia64 MCA process. All of this is mandated by Intel's
|
||||
specification for ia64 SAL, error recovery and and unwind, it is not as
|
||||
if we have a choice here.
|
||||
|
||||
* MCA occurs on one cpu, usually due to a double bit memory error.
|
||||
This is the monarch cpu.
|
||||
|
||||
* SAL sends an MCA rendezvous interrupt (which is a normal interrupt)
|
||||
to all the other cpus, the slaves.
|
||||
|
||||
* Slave cpus that receive the MCA interrupt call down into SAL, they
|
||||
end up spinning disabled while the MCA is being serviced.
|
||||
|
||||
* If any slave cpu was already spinning disabled when the MCA occurred
|
||||
then it cannot service the MCA interrupt. SAL waits ~20 seconds then
|
||||
sends an unmaskable INIT event to the slave cpus that have not
|
||||
already rendezvoused.
|
||||
|
||||
* Because MCA/INIT can be delivered at any time, including when the cpu
|
||||
is down in PAL in physical mode, the registers at the time of the
|
||||
event are _completely_ undefined. In particular the MCA/INIT
|
||||
handlers cannot rely on the thread pointer, PAL physical mode can
|
||||
(and does) modify TP. It is allowed to do that as long as it resets
|
||||
TP on return. However MCA/INIT events expose us to these PAL
|
||||
internal TP changes. Hence curr_task().
|
||||
|
||||
* If an MCA/INIT event occurs while the kernel was running (not user
|
||||
space) and the kernel has called PAL then the MCA/INIT handler cannot
|
||||
assume that the kernel stack is in a fit state to be used. Mainly
|
||||
because PAL may or may not maintain the stack pointer internally.
|
||||
Because the MCA/INIT handlers cannot trust the kernel stack, they
|
||||
have to use their own, per-cpu stacks. The MCA/INIT stacks are
|
||||
preformatted with just enough task state to let the relevant handlers
|
||||
do their job.
|
||||
|
||||
* Unlike most other architectures, the ia64 struct task is embedded in
|
||||
the kernel stack[1]. So switching to a new kernel stack means that
|
||||
we switch to a new task as well. Because various bits of the kernel
|
||||
assume that current points into the struct task, switching to a new
|
||||
stack also means a new value for current.
|
||||
|
||||
* Once all slaves have rendezvoused and are spinning disabled, the
|
||||
monarch is entered. The monarch now tries to diagnose the problem
|
||||
and decide if it can recover or not.
|
||||
|
||||
* Part of the monarch's job is to look at the state of all the other
|
||||
tasks. The only way to do that on ia64 is to call the unwinder,
|
||||
as mandated by Intel.
|
||||
|
||||
* The starting point for the unwind depends on whether a task is
|
||||
running or not. That is, whether it is on a cpu or is blocked. The
|
||||
monarch has to determine whether or not a task is on a cpu before it
|
||||
knows how to start unwinding it. The tasks that received an MCA or
|
||||
INIT event are no longer running, they have been converted to blocked
|
||||
tasks. But (and its a big but), the cpus that received the MCA
|
||||
rendezvous interrupt are still running on their normal kernel stacks!
|
||||
|
||||
* To distinguish between these two cases, the monarch must know which
|
||||
tasks are on a cpu and which are not. Hence each slave cpu that
|
||||
switches to an MCA/INIT stack, registers its new stack using
|
||||
set_curr_task(), so the monarch can tell that the _original_ task is
|
||||
no longer running on that cpu. That gives us a decent chance of
|
||||
getting a valid backtrace of the _original_ task.
|
||||
|
||||
* MCA/INIT can be nested, to a depth of 2 on any cpu. In the case of a
|
||||
nested error, we want diagnostics on the MCA/INIT handler that
|
||||
failed, not on the task that was originally running. Again this
|
||||
requires set_curr_task() so the MCA/INIT handlers can register their
|
||||
own stack as running on that cpu. Then a recursive error gets a
|
||||
trace of the failing handler's "task".
|
||||
|
||||
[1] My (Keith Owens) original design called for ia64 to separate its
|
||||
struct task and the kernel stacks. Then the MCA/INIT data would be
|
||||
chained stacks like i386 interrupt stacks. But that required
|
||||
radical surgery on the rest of ia64, plus extra hard wired TLB
|
||||
entries with its associated performance degradation. David
|
||||
Mosberger vetoed that approach. Which meant that separate kernel
|
||||
stacks meant separate "tasks" for the MCA/INIT handlers.
|
||||
|
||||
---
|
||||
|
||||
INIT is less complicated than MCA. Pressing the nmi button or using
|
||||
the equivalent command on the management console sends INIT to all
|
||||
cpus. SAL picks one one of the cpus as the monarch and the rest are
|
||||
slaves. All the OS INIT handlers are entered at approximately the same
|
||||
time. The OS monarch prints the state of all tasks and returns, after
|
||||
which the slaves return and the system resumes.
|
||||
|
||||
At least that is what is supposed to happen. Alas there are broken
|
||||
versions of SAL out there. Some drive all the cpus as monarchs. Some
|
||||
drive them all as slaves. Some drive one cpu as monarch, wait for that
|
||||
cpu to return from the OS then drive the rest as slaves. Some versions
|
||||
of SAL cannot even cope with returning from the OS, they spin inside
|
||||
SAL on resume. The OS INIT code has workarounds for some of these
|
||||
broken SAL symptoms, but some simply cannot be fixed from the OS side.
|
||||
|
||||
---
|
||||
|
||||
The scheduler hooks used by ia64 (curr_task, set_curr_task) are layer
|
||||
violations. Unfortunately MCA/INIT start off as massive layer
|
||||
violations (can occur at _any_ time) and they build from there.
|
||||
|
||||
At least ia64 makes an attempt at recovering from hardware errors, but
|
||||
it is a difficult problem because of the asynchronous nature of these
|
||||
errors. When processing an unmaskable interrupt we sometimes need
|
||||
special code to cope with our inability to take any locks.
|
||||
|
||||
---
|
||||
|
||||
How is ia64 MCA/INIT different from x86 NMI?
|
||||
|
||||
* x86 NMI typically gets delivered to one cpu. MCA/INIT gets sent to
|
||||
all cpus.
|
||||
|
||||
* x86 NMI cannot be nested. MCA/INIT can be nested, to a depth of 2
|
||||
per cpu.
|
||||
|
||||
* x86 has a separate struct task which points to one of multiple kernel
|
||||
stacks. ia64 has the struct task embedded in the single kernel
|
||||
stack, so switching stack means switching task.
|
||||
|
||||
* x86 does not call the BIOS so the NMI handler does not have to worry
|
||||
about any registers having changed. MCA/INIT can occur while the cpu
|
||||
is in PAL in physical mode, with undefined registers and an undefined
|
||||
kernel stack.
|
||||
|
||||
* i386 backtrace is not very sensitive to whether a process is running
|
||||
or not. ia64 unwind is very, very sensitive to whether a process is
|
||||
running or not.
|
||||
|
||||
---
|
||||
|
||||
What happens when MCA/INIT is delivered what a cpu is running user
|
||||
space code?
|
||||
|
||||
The user mode registers are stored in the RSE area of the MCA/INIT on
|
||||
entry to the OS and are restored from there on return to SAL, so user
|
||||
mode registers are preserved across a recoverable MCA/INIT. Since the
|
||||
OS has no idea what unwind data is available for the user space stack,
|
||||
MCA/INIT never tries to backtrace user space. Which means that the OS
|
||||
does not bother making the user space process look like a blocked task,
|
||||
i.e. the OS does not copy pt_regs and switch_stack to the user space
|
||||
stack. Also the OS has no idea how big the user space RSE and memory
|
||||
stacks are, which makes it too risky to copy the saved state to a user
|
||||
mode stack.
|
||||
|
||||
---
|
||||
|
||||
How do we get a backtrace on the tasks that were running when MCA/INIT
|
||||
was delivered?
|
||||
|
||||
mca.c:::ia64_mca_modify_original_stack(). That identifies and
|
||||
verifies the original kernel stack, copies the dirty registers from
|
||||
the MCA/INIT stack's RSE to the original stack's RSE, copies the
|
||||
skeleton struct pt_regs and switch_stack to the original stack, fills
|
||||
in the skeleton structures from the PAL minstate area and updates the
|
||||
original stack's thread.ksp. That makes the original stack look
|
||||
exactly like any other blocked task, i.e. it now appears to be
|
||||
sleeping. To get a backtrace, just start with thread.ksp for the
|
||||
original task and unwind like any other sleeping task.
|
||||
|
||||
---
|
||||
|
||||
How do we identify the tasks that were running when MCA/INIT was
|
||||
delivered?
|
||||
|
||||
If the previous task has been verified and converted to a blocked
|
||||
state, then sos->prev_task on the MCA/INIT stack is updated to point to
|
||||
the previous task. You can look at that field in dumps or debuggers.
|
||||
To help distinguish between the handler and the original tasks,
|
||||
handlers have _TIF_MCA_INIT set in thread_info.flags.
|
||||
|
||||
The sos data is always in the MCA/INIT handler stack, at offset
|
||||
MCA_SOS_OFFSET. You can get that value from mca_asm.h or calculate it
|
||||
as KERNEL_STACK_SIZE - sizeof(struct pt_regs) - sizeof(struct
|
||||
ia64_sal_os_state), with 16 byte alignment for all structures.
|
||||
|
||||
Also the comm field of the MCA/INIT task is modified to include the pid
|
||||
of the original task, for humans to use. For example, a comm field of
|
||||
'MCA 12159' means that pid 12159 was running when the MCA was
|
||||
delivered.
|
|
@ -1,16 +1,16 @@
|
|||
IBM ThinkPad ACPI Extras Driver
|
||||
|
||||
Version 0.8
|
||||
8 November 2004
|
||||
Version 0.12
|
||||
17 August 2005
|
||||
|
||||
Borislav Deianov <borislav@users.sf.net>
|
||||
http://ibm-acpi.sf.net/
|
||||
|
||||
|
||||
This is a Linux ACPI driver for the IBM ThinkPad laptops. It aims to
|
||||
support various features of these laptops which are accessible through
|
||||
the ACPI framework but not otherwise supported by the generic Linux
|
||||
ACPI drivers.
|
||||
This is a Linux ACPI driver for the IBM ThinkPad laptops. It supports
|
||||
various features of these laptops which are accessible through the
|
||||
ACPI framework but not otherwise supported by the generic Linux ACPI
|
||||
drivers.
|
||||
|
||||
|
||||
Status
|
||||
|
@ -25,9 +25,14 @@ detailed description):
|
|||
- ThinkLight on and off
|
||||
- limited docking and undocking
|
||||
- UltraBay eject
|
||||
- Experimental: CMOS control
|
||||
- Experimental: LED control
|
||||
- Experimental: ACPI sounds
|
||||
- CMOS control
|
||||
- LED control
|
||||
- ACPI sounds
|
||||
- temperature sensors
|
||||
- Experimental: embedded controller register dump
|
||||
- Experimental: LCD brightness control
|
||||
- Experimental: volume control
|
||||
- Experimental: fan speed, fan enable/disable
|
||||
|
||||
A compatibility table by model and feature is maintained on the web
|
||||
site, http://ibm-acpi.sf.net/. I appreciate any success or failure
|
||||
|
@ -91,12 +96,12 @@ driver is still in the alpha stage, the exact proc file format and
|
|||
commands supported by the various features is guaranteed to change
|
||||
frequently.
|
||||
|
||||
Driver Version -- /proc/acpi/ibm/driver
|
||||
--------------------------------------
|
||||
Driver version -- /proc/acpi/ibm/driver
|
||||
---------------------------------------
|
||||
|
||||
The driver name and version. No commands can be written to this file.
|
||||
|
||||
Hot Keys -- /proc/acpi/ibm/hotkey
|
||||
Hot keys -- /proc/acpi/ibm/hotkey
|
||||
---------------------------------
|
||||
|
||||
Without this driver, only the Fn-F4 key (sleep button) generates an
|
||||
|
@ -188,7 +193,7 @@ and, on the X40, video corruption. By disabling automatic switching,
|
|||
the flickering or video corruption can be avoided.
|
||||
|
||||
The video_switch command cycles through the available video outputs
|
||||
(it sumulates the behavior of Fn-F7).
|
||||
(it simulates the behavior of Fn-F7).
|
||||
|
||||
Video expansion can be toggled through this feature. This controls
|
||||
whether the display is expanded to fill the entire LCD screen when a
|
||||
|
@ -201,6 +206,12 @@ Fn-F7 from working. This also disables the video output switching
|
|||
features of this driver, as it uses the same ACPI methods as
|
||||
Fn-F7. Video switching on the console should still work.
|
||||
|
||||
UPDATE: There's now a patch for the X.org Radeon driver which
|
||||
addresses this issue. Some people are reporting success with the patch
|
||||
while others are still having problems. For more information:
|
||||
|
||||
https://bugs.freedesktop.org/show_bug.cgi?id=2000
|
||||
|
||||
ThinkLight control -- /proc/acpi/ibm/light
|
||||
------------------------------------------
|
||||
|
||||
|
@ -211,7 +222,7 @@ models which do not make the status available will show it as
|
|||
echo on > /proc/acpi/ibm/light
|
||||
echo off > /proc/acpi/ibm/light
|
||||
|
||||
Docking / Undocking -- /proc/acpi/ibm/dock
|
||||
Docking / undocking -- /proc/acpi/ibm/dock
|
||||
------------------------------------------
|
||||
|
||||
Docking and undocking (e.g. with the X4 UltraBase) requires some
|
||||
|
@ -228,11 +239,15 @@ NOTE: These events will only be generated if the laptop was docked
|
|||
when originally booted. This is due to the current lack of support for
|
||||
hot plugging of devices in the Linux ACPI framework. If the laptop was
|
||||
booted while not in the dock, the following message is shown in the
|
||||
logs: "ibm_acpi: dock device not present". No dock-related events are
|
||||
generated but the dock and undock commands described below still
|
||||
work. They can be executed manually or triggered by Fn key
|
||||
combinations (see the example acpid configuration files included in
|
||||
the driver tarball package available on the web site).
|
||||
logs:
|
||||
|
||||
Mar 17 01:42:34 aero kernel: ibm_acpi: dock device not present
|
||||
|
||||
In this case, no dock-related events are generated but the dock and
|
||||
undock commands described below still work. They can be executed
|
||||
manually or triggered by Fn key combinations (see the example acpid
|
||||
configuration files included in the driver tarball package available
|
||||
on the web site).
|
||||
|
||||
When the eject request button on the dock is pressed, the first event
|
||||
above is generated. The handler for this event should issue the
|
||||
|
@ -267,7 +282,7 @@ the only docking stations currently supported are the X-series
|
|||
UltraBase docks and "dumb" port replicators like the Mini Dock (the
|
||||
latter don't need any ACPI support, actually).
|
||||
|
||||
UltraBay Eject -- /proc/acpi/ibm/bay
|
||||
UltraBay eject -- /proc/acpi/ibm/bay
|
||||
------------------------------------
|
||||
|
||||
Inserting or ejecting an UltraBay device requires some actions to be
|
||||
|
@ -284,8 +299,11 @@ when the laptop was originally booted (on the X series, the UltraBay
|
|||
is in the dock, so it may not be present if the laptop was undocked).
|
||||
This is due to the current lack of support for hot plugging of devices
|
||||
in the Linux ACPI framework. If the laptop was booted without the
|
||||
UltraBay, the following message is shown in the logs: "ibm_acpi: bay
|
||||
device not present". No bay-related events are generated but the eject
|
||||
UltraBay, the following message is shown in the logs:
|
||||
|
||||
Mar 17 01:42:34 aero kernel: ibm_acpi: bay device not present
|
||||
|
||||
In this case, no bay-related events are generated but the eject
|
||||
command described below still works. It can be executed manually or
|
||||
triggered by a hot key combination.
|
||||
|
||||
|
@ -306,22 +324,33 @@ necessary to enable the UltraBay device (e.g. call idectl).
|
|||
The contents of the /proc/acpi/ibm/bay file shows the current status
|
||||
of the UltraBay, as provided by the ACPI framework.
|
||||
|
||||
Experimental Features
|
||||
---------------------
|
||||
EXPERIMENTAL warm eject support on the 600e/x, A22p and A3x (To use
|
||||
this feature, you need to supply the experimental=1 parameter when
|
||||
loading the module):
|
||||
|
||||
The following features are marked experimental because using them
|
||||
involves guessing the correct values of some parameters. Guessing
|
||||
incorrectly may have undesirable effects like crashing your
|
||||
ThinkPad. USE THESE WITH CAUTION! To activate them, you'll need to
|
||||
supply the experimental=1 parameter when loading the module.
|
||||
These models do not have a button near the UltraBay device to request
|
||||
a hot eject but rather require the laptop to be put to sleep
|
||||
(suspend-to-ram) before the bay device is ejected or inserted).
|
||||
The sequence of steps to eject the device is as follows:
|
||||
|
||||
Experimental: CMOS control - /proc/acpi/ibm/cmos
|
||||
------------------------------------------------
|
||||
echo eject > /proc/acpi/ibm/bay
|
||||
put the ThinkPad to sleep
|
||||
remove the drive
|
||||
resume from sleep
|
||||
cat /proc/acpi/ibm/bay should show that the drive was removed
|
||||
|
||||
On the A3x, both the UltraBay 2000 and UltraBay Plus devices are
|
||||
supported. Use "eject2" instead of "eject" for the second bay.
|
||||
|
||||
Note: the UltraBay eject support on the 600e/x, A22p and A3x is
|
||||
EXPERIMENTAL and may not work as expected. USE WITH CAUTION!
|
||||
|
||||
CMOS control -- /proc/acpi/ibm/cmos
|
||||
-----------------------------------
|
||||
|
||||
This feature is used internally by the ACPI firmware to control the
|
||||
ThinkLight on most newer ThinkPad models. It appears that it can also
|
||||
control LCD brightness, sounds volume and more, but only on some
|
||||
models.
|
||||
ThinkLight on most newer ThinkPad models. It may also control LCD
|
||||
brightness, sounds volume and more, but only on some models.
|
||||
|
||||
The commands are non-negative integer numbers:
|
||||
|
||||
|
@ -330,10 +359,9 @@ The commands are non-negative integer numbers:
|
|||
echo 2 >/proc/acpi/ibm/cmos
|
||||
...
|
||||
|
||||
The range of numbers which are used internally by various models is 0
|
||||
to 21, but it's possible that numbers outside this range have
|
||||
interesting behavior. Here is the behavior on the X40 (tpb is the
|
||||
ThinkPad Buttons utility):
|
||||
The range of valid numbers is 0 to 21, but not all have an effect and
|
||||
the behavior varies from model to model. Here is the behavior on the
|
||||
X40 (tpb is the ThinkPad Buttons utility):
|
||||
|
||||
0 - no effect but tpb reports "Volume down"
|
||||
1 - no effect but tpb reports "Volume up"
|
||||
|
@ -346,26 +374,18 @@ ThinkPad Buttons utility):
|
|||
13 - ThinkLight off
|
||||
14 - no effect but tpb reports ThinkLight status change
|
||||
|
||||
If you try this feature, please send me a report similar to the
|
||||
above. On models which allow control of LCD brightness or sound
|
||||
volume, I'd like to provide this functionality in an user-friendly
|
||||
way, but first I need a way to identify the models which this is
|
||||
possible.
|
||||
|
||||
Experimental: LED control - /proc/acpi/ibm/LED
|
||||
----------------------------------------------
|
||||
LED control -- /proc/acpi/ibm/led
|
||||
---------------------------------
|
||||
|
||||
Some of the LED indicators can be controlled through this feature. The
|
||||
available commands are:
|
||||
|
||||
echo <led number> on >/proc/acpi/ibm/led
|
||||
echo <led number> off >/proc/acpi/ibm/led
|
||||
echo <led number> blink >/proc/acpi/ibm/led
|
||||
echo '<led number> on' >/proc/acpi/ibm/led
|
||||
echo '<led number> off' >/proc/acpi/ibm/led
|
||||
echo '<led number> blink' >/proc/acpi/ibm/led
|
||||
|
||||
The <led number> parameter is a non-negative integer. The range of LED
|
||||
numbers used internally by various models is 0 to 7 but it's possible
|
||||
that numbers outside this range are also valid. Here is the mapping on
|
||||
the X40:
|
||||
The <led number> range is 0 to 7. The set of LEDs that can be
|
||||
controlled varies from model to model. Here is the mapping on the X40:
|
||||
|
||||
0 - power
|
||||
1 - battery (orange)
|
||||
|
@ -376,49 +396,224 @@ the X40:
|
|||
|
||||
All of the above can be turned on and off and can be made to blink.
|
||||
|
||||
If you try this feature, please send me a report similar to the
|
||||
above. I'd like to provide this functionality in an user-friendly way,
|
||||
but first I need to identify the which numbers correspond to which
|
||||
LEDs on various models.
|
||||
|
||||
Experimental: ACPI sounds - /proc/acpi/ibm/beep
|
||||
-----------------------------------------------
|
||||
ACPI sounds -- /proc/acpi/ibm/beep
|
||||
----------------------------------
|
||||
|
||||
The BEEP method is used internally by the ACPI firmware to provide
|
||||
audible alerts in various situtation. This feature allows the same
|
||||
audible alerts in various situations. This feature allows the same
|
||||
sounds to be triggered manually.
|
||||
|
||||
The commands are non-negative integer numbers:
|
||||
|
||||
echo 0 >/proc/acpi/ibm/beep
|
||||
echo 1 >/proc/acpi/ibm/beep
|
||||
echo 2 >/proc/acpi/ibm/beep
|
||||
...
|
||||
echo <number> >/proc/acpi/ibm/beep
|
||||
|
||||
The range of numbers which are used internally by various models is 0
|
||||
to 17, but it's possible that numbers outside this range are also
|
||||
valid. Here is the behavior on the X40:
|
||||
The valid <number> range is 0 to 17. Not all numbers trigger sounds
|
||||
and the sounds vary from model to model. Here is the behavior on the
|
||||
X40:
|
||||
|
||||
2 - two beeps, pause, third beep
|
||||
0 - stop a sound in progress (but use 17 to stop 16)
|
||||
2 - two beeps, pause, third beep ("low battery")
|
||||
3 - single beep
|
||||
4 - "unable"
|
||||
4 - high, followed by low-pitched beep ("unable")
|
||||
5 - single beep
|
||||
6 - "AC/DC"
|
||||
6 - very high, followed by high-pitched beep ("AC/DC")
|
||||
7 - high-pitched beep
|
||||
9 - three short beeps
|
||||
10 - very long beep
|
||||
12 - low-pitched beep
|
||||
15 - three high-pitched beeps repeating constantly, stop with 0
|
||||
16 - one medium-pitched beep repeating constantly, stop with 17
|
||||
17 - stop 16
|
||||
|
||||
(I've only been able to identify a couple of them).
|
||||
Temperature sensors -- /proc/acpi/ibm/thermal
|
||||
---------------------------------------------
|
||||
|
||||
If you try this feature, please send me a report similar to the
|
||||
above. I'd like to provide this functionality in an user-friendly way,
|
||||
but first I need to identify the which numbers correspond to which
|
||||
sounds on various models.
|
||||
Most ThinkPads include six or more separate temperature sensors but
|
||||
only expose the CPU temperature through the standard ACPI methods.
|
||||
This feature shows readings from up to eight different sensors. Some
|
||||
readings may not be valid, e.g. may show large negative values. For
|
||||
example, on the X40, a typical output may be:
|
||||
|
||||
temperatures: 42 42 45 41 36 -128 33 -128
|
||||
|
||||
Thomas Gruber took his R51 apart and traced all six active sensors in
|
||||
his laptop (the location of sensors may vary on other models):
|
||||
|
||||
1: CPU
|
||||
2: Mini PCI Module
|
||||
3: HDD
|
||||
4: GPU
|
||||
5: Battery
|
||||
6: N/A
|
||||
7: Battery
|
||||
8: N/A
|
||||
|
||||
No commands can be written to this file.
|
||||
|
||||
EXPERIMENTAL: Embedded controller reigster dump -- /proc/acpi/ibm/ecdump
|
||||
------------------------------------------------------------------------
|
||||
|
||||
This feature is marked EXPERIMENTAL because the implementation
|
||||
directly accesses hardware registers and may not work as expected. USE
|
||||
WITH CAUTION! To use this feature, you need to supply the
|
||||
experimental=1 parameter when loading the module.
|
||||
|
||||
This feature dumps the values of 256 embedded controller
|
||||
registers. Values which have changed since the last time the registers
|
||||
were dumped are marked with a star:
|
||||
|
||||
[root@x40 ibm-acpi]# cat /proc/acpi/ibm/ecdump
|
||||
EC +00 +01 +02 +03 +04 +05 +06 +07 +08 +09 +0a +0b +0c +0d +0e +0f
|
||||
EC 0x00: a7 47 87 01 fe 96 00 08 01 00 cb 00 00 00 40 00
|
||||
EC 0x10: 00 00 ff ff f4 3c 87 09 01 ff 42 01 ff ff 0d 00
|
||||
EC 0x20: 00 00 00 00 00 00 00 00 00 00 00 03 43 00 00 80
|
||||
EC 0x30: 01 07 1a 00 30 04 00 00 *85 00 00 10 00 50 00 00
|
||||
EC 0x40: 00 00 00 00 00 00 14 01 00 04 00 00 00 00 00 00
|
||||
EC 0x50: 00 c0 02 0d 00 01 01 02 02 03 03 03 03 *bc *02 *bc
|
||||
EC 0x60: *02 *bc *02 00 00 00 00 00 00 00 00 00 00 00 00 00
|
||||
EC 0x70: 00 00 00 00 00 12 30 40 *24 *26 *2c *27 *20 80 *1f 80
|
||||
EC 0x80: 00 00 00 06 *37 *0e 03 00 00 00 0e 07 00 00 00 00
|
||||
EC 0x90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|
||||
EC 0xa0: *ff 09 ff 09 ff ff *64 00 *00 *00 *a2 41 *ff *ff *e0 00
|
||||
EC 0xb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|
||||
EC 0xc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|
||||
EC 0xd0: 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|
||||
EC 0xe0: 00 00 00 00 00 00 00 00 11 20 49 04 24 06 55 03
|
||||
EC 0xf0: 31 55 48 54 35 38 57 57 08 2f 45 73 07 65 6c 1a
|
||||
|
||||
This feature can be used to determine the register holding the fan
|
||||
speed on some models. To do that, do the following:
|
||||
|
||||
- make sure the battery is fully charged
|
||||
- make sure the fan is running
|
||||
- run 'cat /proc/acpi/ibm/ecdump' several times, once per second or so
|
||||
|
||||
The first step makes sure various charging-related values don't
|
||||
vary. The second ensures that the fan-related values do vary, since
|
||||
the fan speed fluctuates a bit. The third will (hopefully) mark the
|
||||
fan register with a star:
|
||||
|
||||
[root@x40 ibm-acpi]# cat /proc/acpi/ibm/ecdump
|
||||
EC +00 +01 +02 +03 +04 +05 +06 +07 +08 +09 +0a +0b +0c +0d +0e +0f
|
||||
EC 0x00: a7 47 87 01 fe 96 00 08 01 00 cb 00 00 00 40 00
|
||||
EC 0x10: 00 00 ff ff f4 3c 87 09 01 ff 42 01 ff ff 0d 00
|
||||
EC 0x20: 00 00 00 00 00 00 00 00 00 00 00 03 43 00 00 80
|
||||
EC 0x30: 01 07 1a 00 30 04 00 00 85 00 00 10 00 50 00 00
|
||||
EC 0x40: 00 00 00 00 00 00 14 01 00 04 00 00 00 00 00 00
|
||||
EC 0x50: 00 c0 02 0d 00 01 01 02 02 03 03 03 03 bc 02 bc
|
||||
EC 0x60: 02 bc 02 00 00 00 00 00 00 00 00 00 00 00 00 00
|
||||
EC 0x70: 00 00 00 00 00 12 30 40 24 27 2c 27 21 80 1f 80
|
||||
EC 0x80: 00 00 00 06 *be 0d 03 00 00 00 0e 07 00 00 00 00
|
||||
EC 0x90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|
||||
EC 0xa0: ff 09 ff 09 ff ff 64 00 00 00 a2 41 ff ff e0 00
|
||||
EC 0xb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|
||||
EC 0xc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|
||||
EC 0xd0: 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|
||||
EC 0xe0: 00 00 00 00 00 00 00 00 11 20 49 04 24 06 55 03
|
||||
EC 0xf0: 31 55 48 54 35 38 57 57 08 2f 45 73 07 65 6c 1a
|
||||
|
||||
Another set of values that varies often is the temperature
|
||||
readings. Since temperatures don't change vary fast, you can take
|
||||
several quick dumps to eliminate them.
|
||||
|
||||
You can use a similar method to figure out the meaning of other
|
||||
embedded controller registers - e.g. make sure nothing else changes
|
||||
except the charging or discharging battery to determine which
|
||||
registers contain the current battery capacity, etc. If you experiment
|
||||
with this, do send me your results (including some complete dumps with
|
||||
a description of the conditions when they were taken.)
|
||||
|
||||
EXPERIMENTAL: LCD brightness control -- /proc/acpi/ibm/brightness
|
||||
-----------------------------------------------------------------
|
||||
|
||||
This feature is marked EXPERIMENTAL because the implementation
|
||||
directly accesses hardware registers and may not work as expected. USE
|
||||
WITH CAUTION! To use this feature, you need to supply the
|
||||
experimental=1 parameter when loading the module.
|
||||
|
||||
This feature allows software control of the LCD brightness on ThinkPad
|
||||
models which don't have a hardware brightness slider. The available
|
||||
commands are:
|
||||
|
||||
echo up >/proc/acpi/ibm/brightness
|
||||
echo down >/proc/acpi/ibm/brightness
|
||||
echo 'level <level>' >/proc/acpi/ibm/brightness
|
||||
|
||||
The <level> number range is 0 to 7, although not all of them may be
|
||||
distinct. The current brightness level is shown in the file.
|
||||
|
||||
EXPERIMENTAL: Volume control -- /proc/acpi/ibm/volume
|
||||
-----------------------------------------------------
|
||||
|
||||
This feature is marked EXPERIMENTAL because the implementation
|
||||
directly accesses hardware registers and may not work as expected. USE
|
||||
WITH CAUTION! To use this feature, you need to supply the
|
||||
experimental=1 parameter when loading the module.
|
||||
|
||||
This feature allows volume control on ThinkPad models which don't have
|
||||
a hardware volume knob. The available commands are:
|
||||
|
||||
echo up >/proc/acpi/ibm/volume
|
||||
echo down >/proc/acpi/ibm/volume
|
||||
echo mute >/proc/acpi/ibm/volume
|
||||
echo 'level <level>' >/proc/acpi/ibm/volume
|
||||
|
||||
The <level> number range is 0 to 15 although not all of them may be
|
||||
distinct. The unmute the volume after the mute command, use either the
|
||||
up or down command (the level command will not unmute the volume).
|
||||
The current volume level and mute state is shown in the file.
|
||||
|
||||
EXPERIMENTAL: fan speed, fan enable/disable -- /proc/acpi/ibm/fan
|
||||
-----------------------------------------------------------------
|
||||
|
||||
This feature is marked EXPERIMENTAL because the implementation
|
||||
directly accesses hardware registers and may not work as expected. USE
|
||||
WITH CAUTION! To use this feature, you need to supply the
|
||||
experimental=1 parameter when loading the module.
|
||||
|
||||
This feature attempts to show the current fan speed. The speed is read
|
||||
directly from the hardware registers of the embedded controller. This
|
||||
is known to work on later R, T and X series ThinkPads but may show a
|
||||
bogus value on other models.
|
||||
|
||||
The fan may be enabled or disabled with the following commands:
|
||||
|
||||
echo enable >/proc/acpi/ibm/fan
|
||||
echo disable >/proc/acpi/ibm/fan
|
||||
|
||||
WARNING WARNING WARNING: do not leave the fan disabled unless you are
|
||||
monitoring the temperature sensor readings and you are ready to enable
|
||||
it if necessary to avoid overheating.
|
||||
|
||||
The fan only runs if it's enabled *and* the various temperature
|
||||
sensors which control it read high enough. On the X40, this seems to
|
||||
depend on the CPU and HDD temperatures. Specifically, the fan is
|
||||
turned on when either the CPU temperature climbs to 56 degrees or the
|
||||
HDD temperature climbs to 46 degrees. The fan is turned off when the
|
||||
CPU temperature drops to 49 degrees and the HDD temperature drops to
|
||||
41 degrees. These thresholds cannot currently be controlled.
|
||||
|
||||
On the X31 and X40 (and ONLY on those models), the fan speed can be
|
||||
controlled to a certain degree. Once the fan is running, it can be
|
||||
forced to run faster or slower with the following command:
|
||||
|
||||
echo 'speed <speed>' > /proc/acpi/ibm/thermal
|
||||
|
||||
The sustainable range of fan speeds on the X40 appears to be from
|
||||
about 3700 to about 7350. Values outside this range either do not have
|
||||
any effect or the fan speed eventually settles somewhere in that
|
||||
range. The fan cannot be stopped or started with this command.
|
||||
|
||||
On the 570, temperature readings are not available through this
|
||||
feature and the fan control works a little differently. The fan speed
|
||||
is reported in levels from 0 (off) to 7 (max) and can be controlled
|
||||
with the following command:
|
||||
|
||||
echo 'level <level>' > /proc/acpi/ibm/thermal
|
||||
|
||||
|
||||
Multiple Command, Module Parameters
|
||||
-----------------------------------
|
||||
Multiple Commands, Module Parameters
|
||||
------------------------------------
|
||||
|
||||
Multiple commands can be written to the proc files in one shot by
|
||||
separating them with commas, for example:
|
||||
|
@ -451,24 +646,19 @@ scripts (included with ibm-acpi for completeness):
|
|||
/usr/local/sbin/laptop_mode -- from the Linux kernel source
|
||||
distribution, see Documentation/laptop-mode.txt
|
||||
/sbin/service -- comes with Redhat/Fedora distributions
|
||||
/usr/sbin/hibernate -- from the Software Suspend 2 distribution,
|
||||
see http://softwaresuspend.berlios.de/
|
||||
|
||||
Toan T Nguyen <ntt@control.uchicago.edu> has written a SuSE powersave
|
||||
script for the X20, included in config/usr/sbin/ibm_hotkeys_X20
|
||||
Toan T Nguyen <ntt@physics.ucla.edu> notes that Suse uses the
|
||||
powersave program to suspend ('powersave --suspend-to-ram') or
|
||||
hibernate ('powersave --suspend-to-disk'). This means that the
|
||||
hibernate script is not needed on that distribution.
|
||||
|
||||
Henrik Brix Andersen <brix@gentoo.org> has written a Gentoo ACPI event
|
||||
handler script for the X31. You can get the latest version from
|
||||
http://dev.gentoo.org/~brix/files/x31.sh
|
||||
|
||||
David Schweikert <dws@ee.eth.ch> has written an alternative blank.sh
|
||||
script which works on Debian systems, included in
|
||||
configs/etc/acpi/actions/blank-debian.sh
|
||||
|
||||
|
||||
TODO
|
||||
----
|
||||
|
||||
I'd like to implement the following features but haven't yet found the
|
||||
time and/or I don't yet know how to implement them:
|
||||
|
||||
- UltraBay floppy drive support
|
||||
|
||||
script which works on Debian systems. This scripts has now been
|
||||
extended to also work on Fedora systems and included as the default
|
||||
blank.sh in the distribution.
|
||||
|
|
84
Documentation/input/appletouch.txt
Normal file
84
Documentation/input/appletouch.txt
Normal file
|
@ -0,0 +1,84 @@
|
|||
Apple Touchpad Driver (appletouch)
|
||||
----------------------------------
|
||||
Copyright (C) 2005 Stelian Pop <stelian@popies.net>
|
||||
|
||||
appletouch is a Linux kernel driver for the USB touchpad found on post
|
||||
February 2005 Apple Alu Powerbooks.
|
||||
|
||||
This driver is derived from Johannes Berg's appletrackpad driver[1], but it has
|
||||
been improved in some areas:
|
||||
* appletouch is a full kernel driver, no userspace program is necessary
|
||||
* appletouch can be interfaced with the synaptics X11 driver, in order
|
||||
to have touchpad acceleration, scrolling, etc.
|
||||
|
||||
Credits go to Johannes Berg for reverse-engineering the touchpad protocol,
|
||||
Frank Arnold for further improvements, and Alex Harper for some additional
|
||||
information about the inner workings of the touchpad sensors.
|
||||
|
||||
Usage:
|
||||
------
|
||||
|
||||
In order to use the touchpad in the basic mode, compile the driver and load
|
||||
the module. A new input device will be detected and you will be able to read
|
||||
the mouse data from /dev/input/mice (using gpm, or X11).
|
||||
|
||||
In X11, you can configure the touchpad to use the synaptics X11 driver, which
|
||||
will give additional functionalities, like acceleration, scrolling, 2 finger
|
||||
tap for middle button mouse emulation, 3 finger tap for right button mouse
|
||||
emulation, etc. In order to do this, make sure you're using a recent version of
|
||||
the synaptics driver (tested with 0.14.2, available from [2]), and configure a
|
||||
new input device in your X11 configuration file (take a look below for an
|
||||
example). For additional configuration, see the synaptics driver documentation.
|
||||
|
||||
Section "InputDevice"
|
||||
Identifier "Synaptics Touchpad"
|
||||
Driver "synaptics"
|
||||
Option "SendCoreEvents" "true"
|
||||
Option "Device" "/dev/input/mice"
|
||||
Option "Protocol" "auto-dev"
|
||||
Option "LeftEdge" "0"
|
||||
Option "RightEdge" "850"
|
||||
Option "TopEdge" "0"
|
||||
Option "BottomEdge" "645"
|
||||
Option "MinSpeed" "0.4"
|
||||
Option "MaxSpeed" "1"
|
||||
Option "AccelFactor" "0.02"
|
||||
Option "FingerLow" "0"
|
||||
Option "FingerHigh" "30"
|
||||
Option "MaxTapMove" "20"
|
||||
Option "MaxTapTime" "100"
|
||||
Option "HorizScrollDelta" "0"
|
||||
Option "VertScrollDelta" "30"
|
||||
Option "SHMConfig" "on"
|
||||
EndSection
|
||||
|
||||
Section "ServerLayout"
|
||||
...
|
||||
InputDevice "Mouse"
|
||||
InputDevice "Synaptics Touchpad"
|
||||
...
|
||||
EndSection
|
||||
|
||||
Fuzz problems:
|
||||
--------------
|
||||
|
||||
The touchpad sensors are very sensitive to heat, and will generate a lot of
|
||||
noise when the temperature changes. This is especially true when you power-on
|
||||
the laptop for the first time.
|
||||
|
||||
The appletouch driver tries to handle this noise and auto adapt itself, but it
|
||||
is not perfect. If finger movements are not recognized anymore, try reloading
|
||||
the driver.
|
||||
|
||||
You can activate debugging using the 'debug' module parameter. A value of 0
|
||||
deactivates any debugging, 1 activates tracing of invalid samples, 2 activates
|
||||
full tracing (each sample is being traced):
|
||||
modprobe appletouch debug=1
|
||||
or
|
||||
echo "1" > /sys/module/appletouch/parameters/debug
|
||||
|
||||
Links:
|
||||
------
|
||||
|
||||
[1]: http://johannes.sipsolutions.net/PowerBook/touchpad/
|
||||
[2]: http://web.telia.com/~u89404340/touchpad/index.html
|
216
Documentation/input/yealink.txt
Normal file
216
Documentation/input/yealink.txt
Normal file
|
@ -0,0 +1,216 @@
|
|||
Driver documentation for yealink usb-p1k phones
|
||||
|
||||
0. Status
|
||||
~~~~~~~~~
|
||||
The p1k is a relatively cheap usb 1.1 phone with:
|
||||
- keyboard full support, yealink.ko / input event API
|
||||
- LCD full support, yealink.ko / sysfs API
|
||||
- LED full support, yealink.ko / sysfs API
|
||||
- dialtone full support, yealink.ko / sysfs API
|
||||
- ringtone full support, yealink.ko / sysfs API
|
||||
- audio playback full support, snd_usb_audio.ko / alsa API
|
||||
- audio record full support, snd_usb_audio.ko / alsa API
|
||||
|
||||
For vendor documentation see http://www.yealink.com
|
||||
|
||||
|
||||
1. Compilation (stand alone version)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
Currently only kernel 2.6.x.y versions are supported.
|
||||
In order to build the yealink.ko module do
|
||||
|
||||
make
|
||||
|
||||
If you encounter problems please check if in the MAKE_OPTS variable in
|
||||
the Makefile is pointing to the location where your kernel sources
|
||||
are located, default /usr/src/linux.
|
||||
|
||||
|
||||
1.1 Troubleshooting
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
Q: Module yealink compiled and installed without any problem but phone
|
||||
is not initialized and does not react to any actions.
|
||||
A: If you see something like:
|
||||
hiddev0: USB HID v1.00 Device [Yealink Network Technology Ltd. VOIP USB Phone
|
||||
in dmesg, it means that the hid driver has grabbed the device first. Try to
|
||||
load module yealink before any other usb hid driver. Please see the
|
||||
instructions provided by your distribution on module configuration.
|
||||
|
||||
Q: Phone is working now (displays version and accepts keypad input) but I can't
|
||||
find the sysfs files.
|
||||
A: The sysfs files are located on the particular usb endpoint. On most
|
||||
distributions you can do: "find /sys/ -name get_icons" for a hint.
|
||||
|
||||
|
||||
2. keyboard features
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
The current mapping in the kernel is provided by the map_p1k_to_key
|
||||
function:
|
||||
|
||||
Physical USB-P1K button layout input events
|
||||
|
||||
|
||||
up up
|
||||
IN OUT left, right
|
||||
down down
|
||||
|
||||
pickup C hangup enter, backspace, escape
|
||||
1 2 3 1, 2, 3
|
||||
4 5 6 4, 5, 6,
|
||||
7 8 9 7, 8, 9,
|
||||
* 0 # *, 0, #,
|
||||
|
||||
The "up" and "down" keys, are symbolised by arrows on the button.
|
||||
The "pickup" and "hangup" keys are symbolised by a green and red phone
|
||||
on the button.
|
||||
|
||||
|
||||
3. LCD features
|
||||
~~~~~~~~~~~~~~~
|
||||
The LCD is divided and organised as a 3 line display:
|
||||
|
||||
|[] [][] [][] [][] in |[][]
|
||||
|[] M [][] D [][] : [][] out |[][]
|
||||
store
|
||||
|
||||
NEW REP SU MO TU WE TH FR SA
|
||||
|
||||
[] [] [] [] [] [] [] [] [] [] [] []
|
||||
[] [] [] [] [] [] [] [] [] [] [] []
|
||||
|
||||
|
||||
Line 1 Format (see below) : 18.e8.M8.88...188
|
||||
Icon names : M D : IN OUT STORE
|
||||
Line 2 Format : .........
|
||||
Icon name : NEW REP SU MO TU WE TH FR SA
|
||||
Line 3 Format : 888888888888
|
||||
|
||||
|
||||
Format description:
|
||||
From a user space perspective the world is seperated in "digits" and "icons".
|
||||
A digit can have a character set, an icon can only be ON or OFF.
|
||||
|
||||
Format specifier
|
||||
'8' : Generic 7 segment digit with individual addressable segments
|
||||
|
||||
Reduced capabillity 7 segm digit, when segments are hard wired together.
|
||||
'1' : 2 segments digit only able to produce a 1.
|
||||
'e' : Most significant day of the month digit,
|
||||
able to produce at least 1 2 3.
|
||||
'M' : Most significant minute digit,
|
||||
able to produce at least 0 1 2 3 4 5.
|
||||
|
||||
Icons or pictograms:
|
||||
'.' : For example like AM, PM, SU, a 'dot' .. or other single segment
|
||||
elements.
|
||||
|
||||
|
||||
4. Driver usage
|
||||
~~~~~~~~~~~~~~~
|
||||
For userland the following interfaces are available using the sysfs interface:
|
||||
/sys/.../
|
||||
line1 Read/Write, lcd line1
|
||||
line2 Read/Write, lcd line2
|
||||
line3 Read/Write, lcd line3
|
||||
|
||||
get_icons Read, returns a set of available icons.
|
||||
hide_icon Write, hide the element by writing the icon name.
|
||||
show_icon Write, display the element by writing the icon name.
|
||||
|
||||
map_seg7 Read/Write, the 7 segments char set, common for all
|
||||
yealink phones. (see map_to_7segment.h)
|
||||
|
||||
ringtone Write, upload binary representation of a ringtone,
|
||||
see yealink.c. status EXPERIMENTAL due to potential
|
||||
races between async. and sync usb calls.
|
||||
|
||||
|
||||
4.1 lineX
|
||||
~~~~~~~~~
|
||||
Reading /sys/../lineX will return the format string with its current value:
|
||||
|
||||
Example:
|
||||
cat ./line3
|
||||
888888888888
|
||||
Linux Rocks!
|
||||
|
||||
Writing to /sys/../lineX will set the coresponding LCD line.
|
||||
- Excess characters are ignored.
|
||||
- If less characters are written than allowed, the remaining digits are
|
||||
unchanged.
|
||||
- The tab '\t'and '\n' char does not overwrite the original content.
|
||||
- Writing a space to an icon will always hide its content.
|
||||
|
||||
Example:
|
||||
date +"%m.%e.%k:%M" | sed 's/^0/ /' > ./line1
|
||||
|
||||
Will update the LCD with the current date & time.
|
||||
|
||||
|
||||
4.2 get_icons
|
||||
~~~~~~~~~~~~~
|
||||
Reading will return all available icon names and its current settings:
|
||||
|
||||
cat ./get_icons
|
||||
on M
|
||||
on D
|
||||
on :
|
||||
IN
|
||||
OUT
|
||||
STORE
|
||||
NEW
|
||||
REP
|
||||
SU
|
||||
MO
|
||||
TU
|
||||
WE
|
||||
TH
|
||||
FR
|
||||
SA
|
||||
LED
|
||||
DIALTONE
|
||||
RINGTONE
|
||||
|
||||
|
||||
4.3 show/hide icons
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
Writing to these files will update the state of the icon.
|
||||
Only one icon at a time can be updated.
|
||||
|
||||
If an icon is also on a ./lineX the corresponding value is
|
||||
updated with the first letter of the icon.
|
||||
|
||||
Example - light up the store icon:
|
||||
echo -n "STORE" > ./show_icon
|
||||
|
||||
cat ./line1
|
||||
18.e8.M8.88...188
|
||||
S
|
||||
|
||||
Example - sound the ringtone for 10 seconds:
|
||||
echo -n RINGTONE > /sys/..../show_icon
|
||||
sleep 10
|
||||
echo -n RINGTONE > /sys/..../hide_icon
|
||||
|
||||
|
||||
5. Sound features
|
||||
~~~~~~~~~~~~~~~~~
|
||||
Sound is supported by the ALSA driver: snd_usb_audio
|
||||
|
||||
One 16-bit channel with sample and playback rates of 8000 Hz is the practical
|
||||
limit of the device.
|
||||
|
||||
Example - recording test:
|
||||
arecord -v -d 10 -r 8000 -f S16_LE -t wav foobar.wav
|
||||
|
||||
Example - playback test:
|
||||
aplay foobar.wav
|
||||
|
||||
|
||||
6. Credits & Acknowledgments
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
- Olivier Vandorpe, for starting the usbb2k-api project doing much of
|
||||
the reverse engineering.
|
||||
- Martin Diehl, for pointing out how to handle USB memory allocation.
|
||||
- Dmitry Torokhov, for the numerous code reviews and suggestions.
|
||||
|
|
@ -878,7 +878,7 @@ DVD_READ_STRUCT Read structure
|
|||
|
||||
error returns:
|
||||
EINVAL physical.layer_num exceeds number of layers
|
||||
EIO Recieved invalid response from drive
|
||||
EIO Received invalid response from drive
|
||||
|
||||
|
||||
|
||||
|
|
|
@ -31,7 +31,7 @@ This document describes the Linux kernel Makefiles.
|
|||
|
||||
=== 6 Architecture Makefiles
|
||||
--- 6.1 Set variables to tweak the build to the architecture
|
||||
--- 6.2 Add prerequisites to prepare:
|
||||
--- 6.2 Add prerequisites to archprepare:
|
||||
--- 6.3 List directories to visit when descending
|
||||
--- 6.4 Architecture specific boot images
|
||||
--- 6.5 Building non-kbuild targets
|
||||
|
@ -734,18 +734,18 @@ When kbuild executes the following steps are followed (roughly):
|
|||
for loadable kernel modules.
|
||||
|
||||
|
||||
--- 6.2 Add prerequisites to prepare:
|
||||
--- 6.2 Add prerequisites to archprepare:
|
||||
|
||||
The prepare: rule is used to list prerequisites that needs to be
|
||||
The archprepare: rule is used to list prerequisites that needs to be
|
||||
built before starting to descend down in the subdirectories.
|
||||
This is usual header files containing assembler constants.
|
||||
|
||||
Example:
|
||||
#arch/s390/Makefile
|
||||
prepare: include/asm-$(ARCH)/offsets.h
|
||||
#arch/arm/Makefile
|
||||
archprepare: maketools
|
||||
|
||||
In this example the file include/asm-$(ARCH)/offsets.h will
|
||||
be built before descending down in the subdirectories.
|
||||
In this example the file target maketools will be processed
|
||||
before descending down in the subdirectories.
|
||||
See also chapter XXX-TODO that describe how kbuild supports
|
||||
generating offset header files.
|
||||
|
||||
|
@ -872,7 +872,13 @@ When kbuild executes the following steps are followed (roughly):
|
|||
Assignments to $(targets) are without $(obj)/ prefix.
|
||||
if_changed may be used in conjunction with custom commands as
|
||||
defined in 6.7 "Custom kbuild commands".
|
||||
|
||||
Note: It is a typical mistake to forget the FORCE prerequisite.
|
||||
Another common pitfall is that whitespace is sometimes
|
||||
significant; for instance, the below will fail (note the extra space
|
||||
after the comma):
|
||||
target: source(s) FORCE
|
||||
#WRONG!# $(call if_changed, ld/objcopy/gzip)
|
||||
|
||||
ld
|
||||
Link target. Often LDFLAGS_$@ is used to set specific options to ld.
|
||||
|
|
|
@ -39,8 +39,7 @@ SETUP
|
|||
and apply http://lse.sourceforge.net/kdump/patches/kexec-tools-1.101-kdump.patch
|
||||
and after that build the source.
|
||||
|
||||
2) Download and build the appropriate (latest) kexec/kdump (-mm) kernel
|
||||
patchset and apply it to the vanilla kernel tree.
|
||||
2) Download and build the appropriate (2.6.13-rc1 onwards) vanilla kernel.
|
||||
|
||||
Two kernels need to be built in order to get this feature working.
|
||||
|
||||
|
@ -67,11 +66,11 @@ SETUP
|
|||
c) Enable "/proc/vmcore support" (Optional, in Pseudo filesystems).
|
||||
CONFIG_PROC_VMCORE=y
|
||||
d) Disable SMP support and build a UP kernel (Until it is fixed).
|
||||
CONFIG_SMP=n
|
||||
CONFIG_SMP=n
|
||||
e) Enable "Local APIC support on uniprocessors".
|
||||
CONFIG_X86_UP_APIC=y
|
||||
CONFIG_X86_UP_APIC=y
|
||||
f) Enable "IO-APIC support on uniprocessors"
|
||||
CONFIG_X86_UP_IOAPIC=y
|
||||
CONFIG_X86_UP_IOAPIC=y
|
||||
|
||||
Note: i) Options a) and b) depend upon "Configure standard kernel features
|
||||
(for small systems)" (under General setup).
|
||||
|
@ -84,17 +83,23 @@ SETUP
|
|||
|
||||
4) Load the second kernel to be booted using:
|
||||
|
||||
kexec -p <second-kernel> --crash-dump --args-linux --append="root=<root-dev>
|
||||
init 1 irqpoll"
|
||||
kexec -p <second-kernel> --args-linux --elf32-core-headers
|
||||
--append="root=<root-dev> init 1 irqpoll"
|
||||
|
||||
Note: i) <second-kernel> has to be a vmlinux image. bzImage will not work,
|
||||
as of now.
|
||||
ii) By default ELF headers are stored in ELF32 format (for i386). This
|
||||
is sufficient to represent the physical memory up to 4GB. To store
|
||||
headers in ELF64 format, specifiy "--elf64-core-headers" on the
|
||||
kexec command line additionally.
|
||||
ii) By default ELF headers are stored in ELF64 format. Option
|
||||
--elf32-core-headers forces generation of ELF32 headers. gdb can
|
||||
not open ELF64 headers on 32 bit systems. So creating ELF32
|
||||
headers can come handy for users who have got non-PAE systems and
|
||||
hence have memory less than 4GB.
|
||||
iii) Specify "irqpoll" as command line parameter. This reduces driver
|
||||
initialization failures in second kernel due to shared interrupts.
|
||||
iv) <root-dev> needs to be specified in a format corresponding to
|
||||
the root device name in the output of mount command.
|
||||
v) If you have built the drivers required to mount root file
|
||||
system as modules in <second-kernel>, then, specify
|
||||
--initrd=<initrd-for-second-kernel>.
|
||||
|
||||
5) System reboots into the second kernel when a panic occurs. A module can be
|
||||
written to force the panic or "ALT-SysRq-c" can be used initiate a crash
|
||||
|
|
File diff suppressed because it is too large
Load diff
161
Documentation/keys-request-key.txt
Normal file
161
Documentation/keys-request-key.txt
Normal file
|
@ -0,0 +1,161 @@
|
|||
===================
|
||||
KEY REQUEST SERVICE
|
||||
===================
|
||||
|
||||
The key request service is part of the key retention service (refer to
|
||||
Documentation/keys.txt). This document explains more fully how that the
|
||||
requesting algorithm works.
|
||||
|
||||
The process starts by either the kernel requesting a service by calling
|
||||
request_key():
|
||||
|
||||
struct key *request_key(const struct key_type *type,
|
||||
const char *description,
|
||||
const char *callout_string);
|
||||
|
||||
Or by userspace invoking the request_key system call:
|
||||
|
||||
key_serial_t request_key(const char *type,
|
||||
const char *description,
|
||||
const char *callout_info,
|
||||
key_serial_t dest_keyring);
|
||||
|
||||
The main difference between the two access points is that the in-kernel
|
||||
interface does not need to link the key to a keyring to prevent it from being
|
||||
immediately destroyed. The kernel interface returns a pointer directly to the
|
||||
key, and it's up to the caller to destroy the key.
|
||||
|
||||
The userspace interface links the key to a keyring associated with the process
|
||||
to prevent the key from going away, and returns the serial number of the key to
|
||||
the caller.
|
||||
|
||||
|
||||
===========
|
||||
THE PROCESS
|
||||
===========
|
||||
|
||||
A request proceeds in the following manner:
|
||||
|
||||
(1) Process A calls request_key() [the userspace syscall calls the kernel
|
||||
interface].
|
||||
|
||||
(2) request_key() searches the process's subscribed keyrings to see if there's
|
||||
a suitable key there. If there is, it returns the key. If there isn't, and
|
||||
callout_info is not set, an error is returned. Otherwise the process
|
||||
proceeds to the next step.
|
||||
|
||||
(3) request_key() sees that A doesn't have the desired key yet, so it creates
|
||||
two things:
|
||||
|
||||
(a) An uninstantiated key U of requested type and description.
|
||||
|
||||
(b) An authorisation key V that refers to key U and notes that process A
|
||||
is the context in which key U should be instantiated and secured, and
|
||||
from which associated key requests may be satisfied.
|
||||
|
||||
(4) request_key() then forks and executes /sbin/request-key with a new session
|
||||
keyring that contains a link to auth key V.
|
||||
|
||||
(5) /sbin/request-key execs an appropriate program to perform the actual
|
||||
instantiation.
|
||||
|
||||
(6) The program may want to access another key from A's context (say a
|
||||
Kerberos TGT key). It just requests the appropriate key, and the keyring
|
||||
search notes that the session keyring has auth key V in its bottom level.
|
||||
|
||||
This will permit it to then search the keyrings of process A with the
|
||||
UID, GID, groups and security info of process A as if it was process A,
|
||||
and come up with key W.
|
||||
|
||||
(7) The program then does what it must to get the data with which to
|
||||
instantiate key U, using key W as a reference (perhaps it contacts a
|
||||
Kerberos server using the TGT) and then instantiates key U.
|
||||
|
||||
(8) Upon instantiating key U, auth key V is automatically revoked so that it
|
||||
may not be used again.
|
||||
|
||||
(9) The program then exits 0 and request_key() deletes key V and returns key
|
||||
U to the caller.
|
||||
|
||||
This also extends further. If key W (step 5 above) didn't exist, key W would be
|
||||
created uninstantiated, another auth key (X) would be created [as per step 3]
|
||||
and another copy of /sbin/request-key spawned [as per step 4]; but the context
|
||||
specified by auth key X will still be process A, as it was in auth key V.
|
||||
|
||||
This is because process A's keyrings can't simply be attached to
|
||||
/sbin/request-key at the appropriate places because (a) execve will discard two
|
||||
of them, and (b) it requires the same UID/GID/Groups all the way through.
|
||||
|
||||
|
||||
======================
|
||||
NEGATIVE INSTANTIATION
|
||||
======================
|
||||
|
||||
Rather than instantiating a key, it is possible for the possessor of an
|
||||
authorisation key to negatively instantiate a key that's under construction.
|
||||
This is a short duration placeholder that causes any attempt at re-requesting
|
||||
the key whilst it exists to fail with error ENOKEY.
|
||||
|
||||
This is provided to prevent excessive repeated spawning of /sbin/request-key
|
||||
processes for a key that will never be obtainable.
|
||||
|
||||
Should the /sbin/request-key process exit anything other than 0 or die on a
|
||||
signal, the key under construction will be automatically negatively
|
||||
instantiated for a short amount of time.
|
||||
|
||||
|
||||
====================
|
||||
THE SEARCH ALGORITHM
|
||||
====================
|
||||
|
||||
A search of any particular keyring proceeds in the following fashion:
|
||||
|
||||
(1) When the key management code searches for a key (keyring_search_aux) it
|
||||
firstly calls key_permission(SEARCH) on the keyring it's starting with,
|
||||
if this denies permission, it doesn't search further.
|
||||
|
||||
(2) It considers all the non-keyring keys within that keyring and, if any key
|
||||
matches the criteria specified, calls key_permission(SEARCH) on it to see
|
||||
if the key is allowed to be found. If it is, that key is returned; if
|
||||
not, the search continues, and the error code is retained if of higher
|
||||
priority than the one currently set.
|
||||
|
||||
(3) It then considers all the keyring-type keys in the keyring it's currently
|
||||
searching. It calls key_permission(SEARCH) on each keyring, and if this
|
||||
grants permission, it recurses, executing steps (2) and (3) on that
|
||||
keyring.
|
||||
|
||||
The process stops immediately a valid key is found with permission granted to
|
||||
use it. Any error from a previous match attempt is discarded and the key is
|
||||
returned.
|
||||
|
||||
When search_process_keyrings() is invoked, it performs the following searches
|
||||
until one succeeds:
|
||||
|
||||
(1) If extant, the process's thread keyring is searched.
|
||||
|
||||
(2) If extant, the process's process keyring is searched.
|
||||
|
||||
(3) The process's session keyring is searched.
|
||||
|
||||
(4) If the process has a request_key() authorisation key in its session
|
||||
keyring then:
|
||||
|
||||
(a) If extant, the calling process's thread keyring is searched.
|
||||
|
||||
(b) If extant, the calling process's process keyring is searched.
|
||||
|
||||
(c) The calling process's session keyring is searched.
|
||||
|
||||
The moment one succeeds, all pending errors are discarded and the found key is
|
||||
returned.
|
||||
|
||||
Only if all these fail does the whole thing fail with the highest priority
|
||||
error. Note that several errors may have come from LSM.
|
||||
|
||||
The error priority is:
|
||||
|
||||
EKEYREVOKED > EKEYEXPIRED > ENOKEY
|
||||
|
||||
EACCES/EPERM are only returned on a direct search of a specific keyring where
|
||||
the basal keyring does not grant Search permission.
|
|
@ -195,8 +195,8 @@ KEY ACCESS PERMISSIONS
|
|||
======================
|
||||
|
||||
Keys have an owner user ID, a group access ID, and a permissions mask. The mask
|
||||
has up to eight bits each for user, group and other access. Only five of each
|
||||
set of eight bits are defined. These permissions granted are:
|
||||
has up to eight bits each for possessor, user, group and other access. Only
|
||||
six of each set of eight bits are defined. These permissions granted are:
|
||||
|
||||
(*) View
|
||||
|
||||
|
@ -224,6 +224,10 @@ set of eight bits are defined. These permissions granted are:
|
|||
keyring to a key, a process must have Write permission on the keyring and
|
||||
Link permission on the key.
|
||||
|
||||
(*) Set Attribute
|
||||
|
||||
This permits a key's UID, GID and permissions mask to be changed.
|
||||
|
||||
For changing the ownership, group ID or permissions mask, being the owner of
|
||||
the key or having the sysadmin capability is sufficient.
|
||||
|
||||
|
@ -241,16 +245,16 @@ about the status of the key service:
|
|||
type, description and permissions. The payload of the key is not available
|
||||
this way:
|
||||
|
||||
SERIAL FLAGS USAGE EXPY PERM UID GID TYPE DESCRIPTION: SUMMARY
|
||||
00000001 I----- 39 perm 1f0000 0 0 keyring _uid_ses.0: 1/4
|
||||
00000002 I----- 2 perm 1f0000 0 0 keyring _uid.0: empty
|
||||
00000007 I----- 1 perm 1f0000 0 0 keyring _pid.1: empty
|
||||
0000018d I----- 1 perm 1f0000 0 0 keyring _pid.412: empty
|
||||
000004d2 I--Q-- 1 perm 1f0000 32 -1 keyring _uid.32: 1/4
|
||||
000004d3 I--Q-- 3 perm 1f0000 32 -1 keyring _uid_ses.32: empty
|
||||
00000892 I--QU- 1 perm 1f0000 0 0 user metal:copper: 0
|
||||
00000893 I--Q-N 1 35s 1f0000 0 0 user metal:silver: 0
|
||||
00000894 I--Q-- 1 10h 1f0000 0 0 user metal:gold: 0
|
||||
SERIAL FLAGS USAGE EXPY PERM UID GID TYPE DESCRIPTION: SUMMARY
|
||||
00000001 I----- 39 perm 1f3f0000 0 0 keyring _uid_ses.0: 1/4
|
||||
00000002 I----- 2 perm 1f3f0000 0 0 keyring _uid.0: empty
|
||||
00000007 I----- 1 perm 1f3f0000 0 0 keyring _pid.1: empty
|
||||
0000018d I----- 1 perm 1f3f0000 0 0 keyring _pid.412: empty
|
||||
000004d2 I--Q-- 1 perm 1f3f0000 32 -1 keyring _uid.32: 1/4
|
||||
000004d3 I--Q-- 3 perm 1f3f0000 32 -1 keyring _uid_ses.32: empty
|
||||
00000892 I--QU- 1 perm 1f000000 0 0 user metal:copper: 0
|
||||
00000893 I--Q-N 1 35s 1f3f0000 0 0 user metal:silver: 0
|
||||
00000894 I--Q-- 1 10h 003f0000 0 0 user metal:gold: 0
|
||||
|
||||
The flags are:
|
||||
|
||||
|
@ -361,6 +365,8 @@ The main syscalls are:
|
|||
/sbin/request-key will be invoked in an attempt to obtain a key. The
|
||||
callout_info string will be passed as an argument to the program.
|
||||
|
||||
See also Documentation/keys-request-key.txt.
|
||||
|
||||
|
||||
The keyctl syscall functions are:
|
||||
|
||||
|
@ -533,8 +539,8 @@ The keyctl syscall functions are:
|
|||
|
||||
(*) Read the payload data from a key:
|
||||
|
||||
key_serial_t keyctl(KEYCTL_READ, key_serial_t keyring, char *buffer,
|
||||
size_t buflen);
|
||||
long keyctl(KEYCTL_READ, key_serial_t keyring, char *buffer,
|
||||
size_t buflen);
|
||||
|
||||
This function attempts to read the payload data from the specified key
|
||||
into the buffer. The process must have read permission on the key to
|
||||
|
@ -555,9 +561,9 @@ The keyctl syscall functions are:
|
|||
|
||||
(*) Instantiate a partially constructed key.
|
||||
|
||||
key_serial_t keyctl(KEYCTL_INSTANTIATE, key_serial_t key,
|
||||
const void *payload, size_t plen,
|
||||
key_serial_t keyring);
|
||||
long keyctl(KEYCTL_INSTANTIATE, key_serial_t key,
|
||||
const void *payload, size_t plen,
|
||||
key_serial_t keyring);
|
||||
|
||||
If the kernel calls back to userspace to complete the instantiation of a
|
||||
key, userspace should use this call to supply data for the key before the
|
||||
|
@ -576,8 +582,8 @@ The keyctl syscall functions are:
|
|||
|
||||
(*) Negatively instantiate a partially constructed key.
|
||||
|
||||
key_serial_t keyctl(KEYCTL_NEGATE, key_serial_t key,
|
||||
unsigned timeout, key_serial_t keyring);
|
||||
long keyctl(KEYCTL_NEGATE, key_serial_t key,
|
||||
unsigned timeout, key_serial_t keyring);
|
||||
|
||||
If the kernel calls back to userspace to complete the instantiation of a
|
||||
key, userspace should use this call mark the key as negative before the
|
||||
|
@ -637,6 +643,34 @@ call, and the key released upon close. How to deal with conflicting keys due to
|
|||
two different users opening the same file is left to the filesystem author to
|
||||
solve.
|
||||
|
||||
Note that there are two different types of pointers to keys that may be
|
||||
encountered:
|
||||
|
||||
(*) struct key *
|
||||
|
||||
This simply points to the key structure itself. Key structures will be at
|
||||
least four-byte aligned.
|
||||
|
||||
(*) key_ref_t
|
||||
|
||||
This is equivalent to a struct key *, but the least significant bit is set
|
||||
if the caller "possesses" the key. By "possession" it is meant that the
|
||||
calling processes has a searchable link to the key from one of its
|
||||
keyrings. There are three functions for dealing with these:
|
||||
|
||||
key_ref_t make_key_ref(const struct key *key,
|
||||
unsigned long possession);
|
||||
|
||||
struct key *key_ref_to_ptr(const key_ref_t key_ref);
|
||||
|
||||
unsigned long is_key_possessed(const key_ref_t key_ref);
|
||||
|
||||
The first function constructs a key reference from a key pointer and
|
||||
possession information (which must be 0 or 1 and not any other value).
|
||||
|
||||
The second function retrieves the key pointer from a reference and the
|
||||
third retrieves the possession flag.
|
||||
|
||||
When accessing a key's payload contents, certain precautions must be taken to
|
||||
prevent access vs modification races. See the section "Notes on accessing
|
||||
payload contents" for more information.
|
||||
|
@ -660,12 +694,18 @@ payload contents" for more information.
|
|||
If successful, the key will have been attached to the default keyring for
|
||||
implicitly obtained request-key keys, as set by KEYCTL_SET_REQKEY_KEYRING.
|
||||
|
||||
See also Documentation/keys-request-key.txt.
|
||||
|
||||
|
||||
(*) When it is no longer required, the key should be released using:
|
||||
|
||||
void key_put(struct key *key);
|
||||
|
||||
This can be called from interrupt context. If CONFIG_KEYS is not set then
|
||||
Or:
|
||||
|
||||
void key_ref_put(key_ref_t key_ref);
|
||||
|
||||
These can be called from interrupt context. If CONFIG_KEYS is not set then
|
||||
the argument will not be parsed.
|
||||
|
||||
|
||||
|
@ -689,13 +729,17 @@ payload contents" for more information.
|
|||
|
||||
(*) If a keyring was found in the search, this can be further searched by:
|
||||
|
||||
struct key *keyring_search(struct key *keyring,
|
||||
const struct key_type *type,
|
||||
const char *description)
|
||||
key_ref_t keyring_search(key_ref_t keyring_ref,
|
||||
const struct key_type *type,
|
||||
const char *description)
|
||||
|
||||
This searches the keyring tree specified for a matching key. Error ENOKEY
|
||||
is returned upon failure. If successful, the returned key will need to be
|
||||
released.
|
||||
is returned upon failure (use IS_ERR/PTR_ERR to determine). If successful,
|
||||
the returned key will need to be released.
|
||||
|
||||
The possession attribute from the keyring reference is used to control
|
||||
access through the permissions mask and is propagated to the returned key
|
||||
reference pointer if successful.
|
||||
|
||||
|
||||
(*) To check the validity of a key, this function can be called:
|
||||
|
@ -732,7 +776,7 @@ More complex payload contents must be allocated and a pointer to them set in
|
|||
key->payload.data. One of the following ways must be selected to access the
|
||||
data:
|
||||
|
||||
(1) Unmodifyable key type.
|
||||
(1) Unmodifiable key type.
|
||||
|
||||
If the key type does not have a modify method, then the key's payload can
|
||||
be accessed without any form of locking, provided that it's known to be
|
||||
|
|
588
Documentation/kprobes.txt
Normal file
588
Documentation/kprobes.txt
Normal file
|
@ -0,0 +1,588 @@
|
|||
Title : Kernel Probes (Kprobes)
|
||||
Authors : Jim Keniston <jkenisto@us.ibm.com>
|
||||
: Prasanna S Panchamukhi <prasanna@in.ibm.com>
|
||||
|
||||
CONTENTS
|
||||
|
||||
1. Concepts: Kprobes, Jprobes, Return Probes
|
||||
2. Architectures Supported
|
||||
3. Configuring Kprobes
|
||||
4. API Reference
|
||||
5. Kprobes Features and Limitations
|
||||
6. Probe Overhead
|
||||
7. TODO
|
||||
8. Kprobes Example
|
||||
9. Jprobes Example
|
||||
10. Kretprobes Example
|
||||
|
||||
1. Concepts: Kprobes, Jprobes, Return Probes
|
||||
|
||||
Kprobes enables you to dynamically break into any kernel routine and
|
||||
collect debugging and performance information non-disruptively. You
|
||||
can trap at almost any kernel code address, specifying a handler
|
||||
routine to be invoked when the breakpoint is hit.
|
||||
|
||||
There are currently three types of probes: kprobes, jprobes, and
|
||||
kretprobes (also called return probes). A kprobe can be inserted
|
||||
on virtually any instruction in the kernel. A jprobe is inserted at
|
||||
the entry to a kernel function, and provides convenient access to the
|
||||
function's arguments. A return probe fires when a specified function
|
||||
returns.
|
||||
|
||||
In the typical case, Kprobes-based instrumentation is packaged as
|
||||
a kernel module. The module's init function installs ("registers")
|
||||
one or more probes, and the exit function unregisters them. A
|
||||
registration function such as register_kprobe() specifies where
|
||||
the probe is to be inserted and what handler is to be called when
|
||||
the probe is hit.
|
||||
|
||||
The next three subsections explain how the different types of
|
||||
probes work. They explain certain things that you'll need to
|
||||
know in order to make the best use of Kprobes -- e.g., the
|
||||
difference between a pre_handler and a post_handler, and how
|
||||
to use the maxactive and nmissed fields of a kretprobe. But
|
||||
if you're in a hurry to start using Kprobes, you can skip ahead
|
||||
to section 2.
|
||||
|
||||
1.1 How Does a Kprobe Work?
|
||||
|
||||
When a kprobe is registered, Kprobes makes a copy of the probed
|
||||
instruction and replaces the first byte(s) of the probed instruction
|
||||
with a breakpoint instruction (e.g., int3 on i386 and x86_64).
|
||||
|
||||
When a CPU hits the breakpoint instruction, a trap occurs, the CPU's
|
||||
registers are saved, and control passes to Kprobes via the
|
||||
notifier_call_chain mechanism. Kprobes executes the "pre_handler"
|
||||
associated with the kprobe, passing the handler the addresses of the
|
||||
kprobe struct and the saved registers.
|
||||
|
||||
Next, Kprobes single-steps its copy of the probed instruction.
|
||||
(It would be simpler to single-step the actual instruction in place,
|
||||
but then Kprobes would have to temporarily remove the breakpoint
|
||||
instruction. This would open a small time window when another CPU
|
||||
could sail right past the probepoint.)
|
||||
|
||||
After the instruction is single-stepped, Kprobes executes the
|
||||
"post_handler," if any, that is associated with the kprobe.
|
||||
Execution then continues with the instruction following the probepoint.
|
||||
|
||||
1.2 How Does a Jprobe Work?
|
||||
|
||||
A jprobe is implemented using a kprobe that is placed on a function's
|
||||
entry point. It employs a simple mirroring principle to allow
|
||||
seamless access to the probed function's arguments. The jprobe
|
||||
handler routine should have the same signature (arg list and return
|
||||
type) as the function being probed, and must always end by calling
|
||||
the Kprobes function jprobe_return().
|
||||
|
||||
Here's how it works. When the probe is hit, Kprobes makes a copy of
|
||||
the saved registers and a generous portion of the stack (see below).
|
||||
Kprobes then points the saved instruction pointer at the jprobe's
|
||||
handler routine, and returns from the trap. As a result, control
|
||||
passes to the handler, which is presented with the same register and
|
||||
stack contents as the probed function. When it is done, the handler
|
||||
calls jprobe_return(), which traps again to restore the original stack
|
||||
contents and processor state and switch to the probed function.
|
||||
|
||||
By convention, the callee owns its arguments, so gcc may produce code
|
||||
that unexpectedly modifies that portion of the stack. This is why
|
||||
Kprobes saves a copy of the stack and restores it after the jprobe
|
||||
handler has run. Up to MAX_STACK_SIZE bytes are copied -- e.g.,
|
||||
64 bytes on i386.
|
||||
|
||||
Note that the probed function's args may be passed on the stack
|
||||
or in registers (e.g., for x86_64 or for an i386 fastcall function).
|
||||
The jprobe will work in either case, so long as the handler's
|
||||
prototype matches that of the probed function.
|
||||
|
||||
1.3 How Does a Return Probe Work?
|
||||
|
||||
When you call register_kretprobe(), Kprobes establishes a kprobe at
|
||||
the entry to the function. When the probed function is called and this
|
||||
probe is hit, Kprobes saves a copy of the return address, and replaces
|
||||
the return address with the address of a "trampoline." The trampoline
|
||||
is an arbitrary piece of code -- typically just a nop instruction.
|
||||
At boot time, Kprobes registers a kprobe at the trampoline.
|
||||
|
||||
When the probed function executes its return instruction, control
|
||||
passes to the trampoline and that probe is hit. Kprobes' trampoline
|
||||
handler calls the user-specified handler associated with the kretprobe,
|
||||
then sets the saved instruction pointer to the saved return address,
|
||||
and that's where execution resumes upon return from the trap.
|
||||
|
||||
While the probed function is executing, its return address is
|
||||
stored in an object of type kretprobe_instance. Before calling
|
||||
register_kretprobe(), the user sets the maxactive field of the
|
||||
kretprobe struct to specify how many instances of the specified
|
||||
function can be probed simultaneously. register_kretprobe()
|
||||
pre-allocates the indicated number of kretprobe_instance objects.
|
||||
|
||||
For example, if the function is non-recursive and is called with a
|
||||
spinlock held, maxactive = 1 should be enough. If the function is
|
||||
non-recursive and can never relinquish the CPU (e.g., via a semaphore
|
||||
or preemption), NR_CPUS should be enough. If maxactive <= 0, it is
|
||||
set to a default value. If CONFIG_PREEMPT is enabled, the default
|
||||
is max(10, 2*NR_CPUS). Otherwise, the default is NR_CPUS.
|
||||
|
||||
It's not a disaster if you set maxactive too low; you'll just miss
|
||||
some probes. In the kretprobe struct, the nmissed field is set to
|
||||
zero when the return probe is registered, and is incremented every
|
||||
time the probed function is entered but there is no kretprobe_instance
|
||||
object available for establishing the return probe.
|
||||
|
||||
2. Architectures Supported
|
||||
|
||||
Kprobes, jprobes, and return probes are implemented on the following
|
||||
architectures:
|
||||
|
||||
- i386
|
||||
- x86_64 (AMD-64, E64MT)
|
||||
- ppc64
|
||||
- ia64 (Support for probes on certain instruction types is still in progress.)
|
||||
- sparc64 (Return probes not yet implemented.)
|
||||
|
||||
3. Configuring Kprobes
|
||||
|
||||
When configuring the kernel using make menuconfig/xconfig/oldconfig,
|
||||
ensure that CONFIG_KPROBES is set to "y". Under "Kernel hacking",
|
||||
look for "Kprobes". You may have to enable "Kernel debugging"
|
||||
(CONFIG_DEBUG_KERNEL) before you can enable Kprobes.
|
||||
|
||||
You may also want to ensure that CONFIG_KALLSYMS and perhaps even
|
||||
CONFIG_KALLSYMS_ALL are set to "y", since kallsyms_lookup_name()
|
||||
is a handy, version-independent way to find a function's address.
|
||||
|
||||
If you need to insert a probe in the middle of a function, you may find
|
||||
it useful to "Compile the kernel with debug info" (CONFIG_DEBUG_INFO),
|
||||
so you can use "objdump -d -l vmlinux" to see the source-to-object
|
||||
code mapping.
|
||||
|
||||
4. API Reference
|
||||
|
||||
The Kprobes API includes a "register" function and an "unregister"
|
||||
function for each type of probe. Here are terse, mini-man-page
|
||||
specifications for these functions and the associated probe handlers
|
||||
that you'll write. See the latter half of this document for examples.
|
||||
|
||||
4.1 register_kprobe
|
||||
|
||||
#include <linux/kprobes.h>
|
||||
int register_kprobe(struct kprobe *kp);
|
||||
|
||||
Sets a breakpoint at the address kp->addr. When the breakpoint is
|
||||
hit, Kprobes calls kp->pre_handler. After the probed instruction
|
||||
is single-stepped, Kprobe calls kp->post_handler. If a fault
|
||||
occurs during execution of kp->pre_handler or kp->post_handler,
|
||||
or during single-stepping of the probed instruction, Kprobes calls
|
||||
kp->fault_handler. Any or all handlers can be NULL.
|
||||
|
||||
register_kprobe() returns 0 on success, or a negative errno otherwise.
|
||||
|
||||
User's pre-handler (kp->pre_handler):
|
||||
#include <linux/kprobes.h>
|
||||
#include <linux/ptrace.h>
|
||||
int pre_handler(struct kprobe *p, struct pt_regs *regs);
|
||||
|
||||
Called with p pointing to the kprobe associated with the breakpoint,
|
||||
and regs pointing to the struct containing the registers saved when
|
||||
the breakpoint was hit. Return 0 here unless you're a Kprobes geek.
|
||||
|
||||
User's post-handler (kp->post_handler):
|
||||
#include <linux/kprobes.h>
|
||||
#include <linux/ptrace.h>
|
||||
void post_handler(struct kprobe *p, struct pt_regs *regs,
|
||||
unsigned long flags);
|
||||
|
||||
p and regs are as described for the pre_handler. flags always seems
|
||||
to be zero.
|
||||
|
||||
User's fault-handler (kp->fault_handler):
|
||||
#include <linux/kprobes.h>
|
||||
#include <linux/ptrace.h>
|
||||
int fault_handler(struct kprobe *p, struct pt_regs *regs, int trapnr);
|
||||
|
||||
p and regs are as described for the pre_handler. trapnr is the
|
||||
architecture-specific trap number associated with the fault (e.g.,
|
||||
on i386, 13 for a general protection fault or 14 for a page fault).
|
||||
Returns 1 if it successfully handled the exception.
|
||||
|
||||
4.2 register_jprobe
|
||||
|
||||
#include <linux/kprobes.h>
|
||||
int register_jprobe(struct jprobe *jp)
|
||||
|
||||
Sets a breakpoint at the address jp->kp.addr, which must be the address
|
||||
of the first instruction of a function. When the breakpoint is hit,
|
||||
Kprobes runs the handler whose address is jp->entry.
|
||||
|
||||
The handler should have the same arg list and return type as the probed
|
||||
function; and just before it returns, it must call jprobe_return().
|
||||
(The handler never actually returns, since jprobe_return() returns
|
||||
control to Kprobes.) If the probed function is declared asmlinkage,
|
||||
fastcall, or anything else that affects how args are passed, the
|
||||
handler's declaration must match.
|
||||
|
||||
register_jprobe() returns 0 on success, or a negative errno otherwise.
|
||||
|
||||
4.3 register_kretprobe
|
||||
|
||||
#include <linux/kprobes.h>
|
||||
int register_kretprobe(struct kretprobe *rp);
|
||||
|
||||
Establishes a return probe for the function whose address is
|
||||
rp->kp.addr. When that function returns, Kprobes calls rp->handler.
|
||||
You must set rp->maxactive appropriately before you call
|
||||
register_kretprobe(); see "How Does a Return Probe Work?" for details.
|
||||
|
||||
register_kretprobe() returns 0 on success, or a negative errno
|
||||
otherwise.
|
||||
|
||||
User's return-probe handler (rp->handler):
|
||||
#include <linux/kprobes.h>
|
||||
#include <linux/ptrace.h>
|
||||
int kretprobe_handler(struct kretprobe_instance *ri, struct pt_regs *regs);
|
||||
|
||||
regs is as described for kprobe.pre_handler. ri points to the
|
||||
kretprobe_instance object, of which the following fields may be
|
||||
of interest:
|
||||
- ret_addr: the return address
|
||||
- rp: points to the corresponding kretprobe object
|
||||
- task: points to the corresponding task struct
|
||||
The handler's return value is currently ignored.
|
||||
|
||||
4.4 unregister_*probe
|
||||
|
||||
#include <linux/kprobes.h>
|
||||
void unregister_kprobe(struct kprobe *kp);
|
||||
void unregister_jprobe(struct jprobe *jp);
|
||||
void unregister_kretprobe(struct kretprobe *rp);
|
||||
|
||||
Removes the specified probe. The unregister function can be called
|
||||
at any time after the probe has been registered.
|
||||
|
||||
5. Kprobes Features and Limitations
|
||||
|
||||
As of Linux v2.6.12, Kprobes allows multiple probes at the same
|
||||
address. Currently, however, there cannot be multiple jprobes on
|
||||
the same function at the same time.
|
||||
|
||||
In general, you can install a probe anywhere in the kernel.
|
||||
In particular, you can probe interrupt handlers. Known exceptions
|
||||
are discussed in this section.
|
||||
|
||||
For obvious reasons, it's a bad idea to install a probe in
|
||||
the code that implements Kprobes (mostly kernel/kprobes.c and
|
||||
arch/*/kernel/kprobes.c). A patch in the v2.6.13 timeframe instructs
|
||||
Kprobes to reject such requests.
|
||||
|
||||
If you install a probe in an inline-able function, Kprobes makes
|
||||
no attempt to chase down all inline instances of the function and
|
||||
install probes there. gcc may inline a function without being asked,
|
||||
so keep this in mind if you're not seeing the probe hits you expect.
|
||||
|
||||
A probe handler can modify the environment of the probed function
|
||||
-- e.g., by modifying kernel data structures, or by modifying the
|
||||
contents of the pt_regs struct (which are restored to the registers
|
||||
upon return from the breakpoint). So Kprobes can be used, for example,
|
||||
to install a bug fix or to inject faults for testing. Kprobes, of
|
||||
course, has no way to distinguish the deliberately injected faults
|
||||
from the accidental ones. Don't drink and probe.
|
||||
|
||||
Kprobes makes no attempt to prevent probe handlers from stepping on
|
||||
each other -- e.g., probing printk() and then calling printk() from a
|
||||
probe handler. As of Linux v2.6.12, if a probe handler hits a probe,
|
||||
that second probe's handlers won't be run in that instance.
|
||||
|
||||
In Linux v2.6.12 and previous versions, Kprobes' data structures are
|
||||
protected by a single lock that is held during probe registration and
|
||||
unregistration and while handlers are run. Thus, no two handlers
|
||||
can run simultaneously. To improve scalability on SMP systems,
|
||||
this restriction will probably be removed soon, in which case
|
||||
multiple handlers (or multiple instances of the same handler) may
|
||||
run concurrently on different CPUs. Code your handlers accordingly.
|
||||
|
||||
Kprobes does not use semaphores or allocate memory except during
|
||||
registration and unregistration.
|
||||
|
||||
Probe handlers are run with preemption disabled. Depending on the
|
||||
architecture, handlers may also run with interrupts disabled. In any
|
||||
case, your handler should not yield the CPU (e.g., by attempting to
|
||||
acquire a semaphore).
|
||||
|
||||
Since a return probe is implemented by replacing the return
|
||||
address with the trampoline's address, stack backtraces and calls
|
||||
to __builtin_return_address() will typically yield the trampoline's
|
||||
address instead of the real return address for kretprobed functions.
|
||||
(As far as we can tell, __builtin_return_address() is used only
|
||||
for instrumentation and error reporting.)
|
||||
|
||||
If the number of times a function is called does not match the
|
||||
number of times it returns, registering a return probe on that
|
||||
function may produce undesirable results. We have the do_exit()
|
||||
and do_execve() cases covered. do_fork() is not an issue. We're
|
||||
unaware of other specific cases where this could be a problem.
|
||||
|
||||
6. Probe Overhead
|
||||
|
||||
On a typical CPU in use in 2005, a kprobe hit takes 0.5 to 1.0
|
||||
microseconds to process. Specifically, a benchmark that hits the same
|
||||
probepoint repeatedly, firing a simple handler each time, reports 1-2
|
||||
million hits per second, depending on the architecture. A jprobe or
|
||||
return-probe hit typically takes 50-75% longer than a kprobe hit.
|
||||
When you have a return probe set on a function, adding a kprobe at
|
||||
the entry to that function adds essentially no overhead.
|
||||
|
||||
Here are sample overhead figures (in usec) for different architectures.
|
||||
k = kprobe; j = jprobe; r = return probe; kr = kprobe + return probe
|
||||
on same function; jr = jprobe + return probe on same function
|
||||
|
||||
i386: Intel Pentium M, 1495 MHz, 2957.31 bogomips
|
||||
k = 0.57 usec; j = 1.00; r = 0.92; kr = 0.99; jr = 1.40
|
||||
|
||||
x86_64: AMD Opteron 246, 1994 MHz, 3971.48 bogomips
|
||||
k = 0.49 usec; j = 0.76; r = 0.80; kr = 0.82; jr = 1.07
|
||||
|
||||
ppc64: POWER5 (gr), 1656 MHz (SMT disabled, 1 virtual CPU per physical CPU)
|
||||
k = 0.77 usec; j = 1.31; r = 1.26; kr = 1.45; jr = 1.99
|
||||
|
||||
7. TODO
|
||||
|
||||
a. SystemTap (http://sourceware.org/systemtap): Work in progress
|
||||
to provide a simplified programming interface for probe-based
|
||||
instrumentation.
|
||||
b. Improved SMP scalability: Currently, work is in progress to handle
|
||||
multiple kprobes in parallel.
|
||||
c. Kernel return probes for sparc64.
|
||||
d. Support for other architectures.
|
||||
e. User-space probes.
|
||||
|
||||
8. Kprobes Example
|
||||
|
||||
Here's a sample kernel module showing the use of kprobes to dump a
|
||||
stack trace and selected i386 registers when do_fork() is called.
|
||||
----- cut here -----
|
||||
/*kprobe_example.c*/
|
||||
#include <linux/kernel.h>
|
||||
#include <linux/module.h>
|
||||
#include <linux/kprobes.h>
|
||||
#include <linux/kallsyms.h>
|
||||
#include <linux/sched.h>
|
||||
|
||||
/*For each probe you need to allocate a kprobe structure*/
|
||||
static struct kprobe kp;
|
||||
|
||||
/*kprobe pre_handler: called just before the probed instruction is executed*/
|
||||
int handler_pre(struct kprobe *p, struct pt_regs *regs)
|
||||
{
|
||||
printk("pre_handler: p->addr=0x%p, eip=%lx, eflags=0x%lx\n",
|
||||
p->addr, regs->eip, regs->eflags);
|
||||
dump_stack();
|
||||
return 0;
|
||||
}
|
||||
|
||||
/*kprobe post_handler: called after the probed instruction is executed*/
|
||||
void handler_post(struct kprobe *p, struct pt_regs *regs, unsigned long flags)
|
||||
{
|
||||
printk("post_handler: p->addr=0x%p, eflags=0x%lx\n",
|
||||
p->addr, regs->eflags);
|
||||
}
|
||||
|
||||
/* fault_handler: this is called if an exception is generated for any
|
||||
* instruction within the pre- or post-handler, or when Kprobes
|
||||
* single-steps the probed instruction.
|
||||
*/
|
||||
int handler_fault(struct kprobe *p, struct pt_regs *regs, int trapnr)
|
||||
{
|
||||
printk("fault_handler: p->addr=0x%p, trap #%dn",
|
||||
p->addr, trapnr);
|
||||
/* Return 0 because we don't handle the fault. */
|
||||
return 0;
|
||||
}
|
||||
|
||||
int init_module(void)
|
||||
{
|
||||
int ret;
|
||||
kp.pre_handler = handler_pre;
|
||||
kp.post_handler = handler_post;
|
||||
kp.fault_handler = handler_fault;
|
||||
kp.addr = (kprobe_opcode_t*) kallsyms_lookup_name("do_fork");
|
||||
/* register the kprobe now */
|
||||
if (!kp.addr) {
|
||||
printk("Couldn't find %s to plant kprobe\n", "do_fork");
|
||||
return -1;
|
||||
}
|
||||
if ((ret = register_kprobe(&kp) < 0)) {
|
||||
printk("register_kprobe failed, returned %d\n", ret);
|
||||
return -1;
|
||||
}
|
||||
printk("kprobe registered\n");
|
||||
return 0;
|
||||
}
|
||||
|
||||
void cleanup_module(void)
|
||||
{
|
||||
unregister_kprobe(&kp);
|
||||
printk("kprobe unregistered\n");
|
||||
}
|
||||
|
||||
MODULE_LICENSE("GPL");
|
||||
----- cut here -----
|
||||
|
||||
You can build the kernel module, kprobe-example.ko, using the following
|
||||
Makefile:
|
||||
----- cut here -----
|
||||
obj-m := kprobe-example.o
|
||||
KDIR := /lib/modules/$(shell uname -r)/build
|
||||
PWD := $(shell pwd)
|
||||
default:
|
||||
$(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules
|
||||
clean:
|
||||
rm -f *.mod.c *.ko *.o
|
||||
----- cut here -----
|
||||
|
||||
$ make
|
||||
$ su -
|
||||
...
|
||||
# insmod kprobe-example.ko
|
||||
|
||||
You will see the trace data in /var/log/messages and on the console
|
||||
whenever do_fork() is invoked to create a new process.
|
||||
|
||||
9. Jprobes Example
|
||||
|
||||
Here's a sample kernel module showing the use of jprobes to dump
|
||||
the arguments of do_fork().
|
||||
----- cut here -----
|
||||
/*jprobe-example.c */
|
||||
#include <linux/kernel.h>
|
||||
#include <linux/module.h>
|
||||
#include <linux/fs.h>
|
||||
#include <linux/uio.h>
|
||||
#include <linux/kprobes.h>
|
||||
#include <linux/kallsyms.h>
|
||||
|
||||
/*
|
||||
* Jumper probe for do_fork.
|
||||
* Mirror principle enables access to arguments of the probed routine
|
||||
* from the probe handler.
|
||||
*/
|
||||
|
||||
/* Proxy routine having the same arguments as actual do_fork() routine */
|
||||
long jdo_fork(unsigned long clone_flags, unsigned long stack_start,
|
||||
struct pt_regs *regs, unsigned long stack_size,
|
||||
int __user * parent_tidptr, int __user * child_tidptr)
|
||||
{
|
||||
printk("jprobe: clone_flags=0x%lx, stack_size=0x%lx, regs=0x%p\n",
|
||||
clone_flags, stack_size, regs);
|
||||
/* Always end with a call to jprobe_return(). */
|
||||
jprobe_return();
|
||||
/*NOTREACHED*/
|
||||
return 0;
|
||||
}
|
||||
|
||||
static struct jprobe my_jprobe = {
|
||||
.entry = (kprobe_opcode_t *) jdo_fork
|
||||
};
|
||||
|
||||
int init_module(void)
|
||||
{
|
||||
int ret;
|
||||
my_jprobe.kp.addr = (kprobe_opcode_t *) kallsyms_lookup_name("do_fork");
|
||||
if (!my_jprobe.kp.addr) {
|
||||
printk("Couldn't find %s to plant jprobe\n", "do_fork");
|
||||
return -1;
|
||||
}
|
||||
|
||||
if ((ret = register_jprobe(&my_jprobe)) <0) {
|
||||
printk("register_jprobe failed, returned %d\n", ret);
|
||||
return -1;
|
||||
}
|
||||
printk("Planted jprobe at %p, handler addr %p\n",
|
||||
my_jprobe.kp.addr, my_jprobe.entry);
|
||||
return 0;
|
||||
}
|
||||
|
||||
void cleanup_module(void)
|
||||
{
|
||||
unregister_jprobe(&my_jprobe);
|
||||
printk("jprobe unregistered\n");
|
||||
}
|
||||
|
||||
MODULE_LICENSE("GPL");
|
||||
----- cut here -----
|
||||
|
||||
Build and insert the kernel module as shown in the above kprobe
|
||||
example. You will see the trace data in /var/log/messages and on
|
||||
the console whenever do_fork() is invoked to create a new process.
|
||||
(Some messages may be suppressed if syslogd is configured to
|
||||
eliminate duplicate messages.)
|
||||
|
||||
10. Kretprobes Example
|
||||
|
||||
Here's a sample kernel module showing the use of return probes to
|
||||
report failed calls to sys_open().
|
||||
----- cut here -----
|
||||
/*kretprobe-example.c*/
|
||||
#include <linux/kernel.h>
|
||||
#include <linux/module.h>
|
||||
#include <linux/kprobes.h>
|
||||
#include <linux/kallsyms.h>
|
||||
|
||||
static const char *probed_func = "sys_open";
|
||||
|
||||
/* Return-probe handler: If the probed function fails, log the return value. */
|
||||
static int ret_handler(struct kretprobe_instance *ri, struct pt_regs *regs)
|
||||
{
|
||||
// Substitute the appropriate register name for your architecture --
|
||||
// e.g., regs->rax for x86_64, regs->gpr[3] for ppc64.
|
||||
int retval = (int) regs->eax;
|
||||
if (retval < 0) {
|
||||
printk("%s returns %d\n", probed_func, retval);
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
static struct kretprobe my_kretprobe = {
|
||||
.handler = ret_handler,
|
||||
/* Probe up to 20 instances concurrently. */
|
||||
.maxactive = 20
|
||||
};
|
||||
|
||||
int init_module(void)
|
||||
{
|
||||
int ret;
|
||||
my_kretprobe.kp.addr =
|
||||
(kprobe_opcode_t *) kallsyms_lookup_name(probed_func);
|
||||
if (!my_kretprobe.kp.addr) {
|
||||
printk("Couldn't find %s to plant return probe\n", probed_func);
|
||||
return -1;
|
||||
}
|
||||
if ((ret = register_kretprobe(&my_kretprobe)) < 0) {
|
||||
printk("register_kretprobe failed, returned %d\n", ret);
|
||||
return -1;
|
||||
}
|
||||
printk("Planted return probe at %p\n", my_kretprobe.kp.addr);
|
||||
return 0;
|
||||
}
|
||||
|
||||
void cleanup_module(void)
|
||||
{
|
||||
unregister_kretprobe(&my_kretprobe);
|
||||
printk("kretprobe unregistered\n");
|
||||
/* nmissed > 0 suggests that maxactive was set too low. */
|
||||
printk("Missed probing %d instances of %s\n",
|
||||
my_kretprobe.nmissed, probed_func);
|
||||
}
|
||||
|
||||
MODULE_LICENSE("GPL");
|
||||
----- cut here -----
|
||||
|
||||
Build and insert the kernel module as shown in the above kprobe
|
||||
example. You will see the trace data in /var/log/messages and on the
|
||||
console whenever sys_open() returns a negative value. (Some messages
|
||||
may be suppressed if syslogd is configured to eliminate duplicate
|
||||
messages.)
|
||||
|
||||
For additional information on Kprobes, refer to the following URLs:
|
||||
http://www-106.ibm.com/developerworks/library/l-kprobes.html?ca=dgr-lnxw42Kprobe
|
||||
http://www.redhat.com/magazine/005mar05/features/kprobes/
|
|
@ -626,7 +626,7 @@ ignored (others aren't affected).
|
|||
can be performed in optimal order. Not all SCSI devices support
|
||||
tagged queuing (:-().
|
||||
|
||||
4.6 switches=
|
||||
4.5 switches=
|
||||
-------------
|
||||
|
||||
Syntax: switches=<list of switches>
|
||||
|
@ -661,28 +661,6 @@ correctly.
|
|||
earlier initialization ("ov_"-less) takes precedence. But the
|
||||
switching-off on reset still happens in this case.
|
||||
|
||||
4.5) stram_swap=
|
||||
----------------
|
||||
|
||||
Syntax: stram_swap=<do_swap>[,<max_swap>]
|
||||
|
||||
This option is available only if the kernel has been compiled with
|
||||
CONFIG_STRAM_SWAP enabled. Normally, the kernel then determines
|
||||
dynamically whether to actually use ST-RAM as swap space. (Currently,
|
||||
the fraction of ST-RAM must be less or equal 1/3 of total memory to
|
||||
enable this swapping.) You can override the kernel's decision by
|
||||
specifying this option. 1 for <do_swap> means always enable the swap,
|
||||
even if you have less alternate RAM. 0 stands for never swap to
|
||||
ST-RAM, even if it's small enough compared to the rest of memory.
|
||||
|
||||
If ST-RAM swapping is enabled, the kernel usually uses all free
|
||||
ST-RAM as swap "device". If the kernel resides in ST-RAM, the region
|
||||
allocated by it is obviously never used for swapping :-) You can also
|
||||
limit this amount by specifying the second parameter, <max_swap>, if
|
||||
you want to use parts of ST-RAM as normal system memory. <max_swap> is
|
||||
in kBytes and the number should be a multiple of 4 (otherwise: rounded
|
||||
down).
|
||||
|
||||
5) Options for Amiga Only:
|
||||
==========================
|
||||
|
||||
|
|
Some files were not shown because too many files have changed in this diff Show more
Loading…
Add table
Reference in a new issue