ad4ecbcba7
Send per-tgid data only once during exit of a thread group instead of once with each member thread exit. Currently, when a thread exits, besides its per-tid data, the per-tgid data of its thread group is also sent out, if its thread group is non-empty. The per-tgid data sent consists of the sum of per-tid stats for all *remaining* threads of the thread group. This patch modifies this sending in two ways: - the per-tgid data is sent only when the last thread of a thread group exits. This cuts down heavily on the overhead of sending/receiving per-tgid data, especially when other exploiters of the taskstats interface aren't interested in per-tgid stats - the semantics of the per-tgid data sent are changed. Instead of being the sum of per-tid data for remaining threads, the value now sent is the true total accumalated statistics for all threads that are/were part of the thread group. The patch also addresses a minor issue where failure of one accounting subsystem to fill in the taskstats structure was causing the send of taskstats to not be sent at all. The patch has been tested for stability and run cerberus for over 4 hours on an SMP. [akpm@osdl.org: bugfixes] Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> Signed-off-by: Balbir Singh <balbir@in.ibm.com> Cc: Jay Lan <jlan@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
141 lines
6 KiB
Text
141 lines
6 KiB
Text
Per-task statistics interface
|
|
-----------------------------
|
|
|
|
|
|
Taskstats is a netlink-based interface for sending per-task and
|
|
per-process statistics from the kernel to userspace.
|
|
|
|
Taskstats was designed for the following benefits:
|
|
|
|
- efficiently provide statistics during lifetime of a task and on its exit
|
|
- unified interface for multiple accounting subsystems
|
|
- extensibility for use by future accounting patches
|
|
|
|
Terminology
|
|
-----------
|
|
|
|
"pid", "tid" and "task" are used interchangeably and refer to the standard
|
|
Linux task defined by struct task_struct. per-pid stats are the same as
|
|
per-task stats.
|
|
|
|
"tgid", "process" and "thread group" are used interchangeably and refer to the
|
|
tasks that share an mm_struct i.e. the traditional Unix process. Despite the
|
|
use of tgid, there is no special treatment for the task that is thread group
|
|
leader - a process is deemed alive as long as it has any task belonging to it.
|
|
|
|
Usage
|
|
-----
|
|
|
|
To get statistics during task's lifetime, userspace opens a unicast netlink
|
|
socket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid.
|
|
The response contains statistics for a task (if pid is specified) or the sum of
|
|
statistics for all tasks of the process (if tgid is specified).
|
|
|
|
To obtain statistics for tasks which are exiting, userspace opens a multicast
|
|
netlink socket. Each time a task exits, its per-pid statistics is always sent
|
|
by the kernel to each listener on the multicast socket. In addition, if it is
|
|
the last thread exiting its thread group, an additional record containing the
|
|
per-tgid stats are also sent. The latter contains the sum of per-pid stats for
|
|
all threads in the thread group, both past and present.
|
|
|
|
getdelays.c is a simple utility demonstrating usage of the taskstats interface
|
|
for reporting delay accounting statistics.
|
|
|
|
Interface
|
|
---------
|
|
|
|
The user-kernel interface is encapsulated in include/linux/taskstats.h
|
|
|
|
To avoid this documentation becoming obsolete as the interface evolves, only
|
|
an outline of the current version is given. taskstats.h always overrides the
|
|
description here.
|
|
|
|
struct taskstats is the common accounting structure for both per-pid and
|
|
per-tgid data. It is versioned and can be extended by each accounting subsystem
|
|
that is added to the kernel. The fields and their semantics are defined in the
|
|
taskstats.h file.
|
|
|
|
The data exchanged between user and kernel space is a netlink message belonging
|
|
to the NETLINK_GENERIC family and using the netlink attributes interface.
|
|
The messages are in the format
|
|
|
|
+----------+- - -+-------------+-------------------+
|
|
| nlmsghdr | Pad | genlmsghdr | taskstats payload |
|
|
+----------+- - -+-------------+-------------------+
|
|
|
|
|
|
The taskstats payload is one of the following three kinds:
|
|
|
|
1. Commands: Sent from user to kernel. The payload is one attribute, of type
|
|
TASKSTATS_CMD_ATTR_PID/TGID, containing a u32 pid or tgid in the attribute
|
|
payload. The pid/tgid denotes the task/process for which userspace wants
|
|
statistics.
|
|
|
|
2. Response for a command: sent from the kernel in response to a userspace
|
|
command. The payload is a series of three attributes of type:
|
|
|
|
a) TASKSTATS_TYPE_AGGR_PID/TGID : attribute containing no payload but indicates
|
|
a pid/tgid will be followed by some stats.
|
|
|
|
b) TASKSTATS_TYPE_PID/TGID: attribute whose payload is the pid/tgid whose stats
|
|
is being returned.
|
|
|
|
c) TASKSTATS_TYPE_STATS: attribute with a struct taskstsats as payload. The
|
|
same structure is used for both per-pid and per-tgid stats.
|
|
|
|
3. New message sent by kernel whenever a task exits. The payload consists of a
|
|
series of attributes of the following type:
|
|
|
|
a) TASKSTATS_TYPE_AGGR_PID: indicates next two attributes will be pid+stats
|
|
b) TASKSTATS_TYPE_PID: contains exiting task's pid
|
|
c) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid stats
|
|
d) TASKSTATS_TYPE_AGGR_TGID: indicates next two attributes will be tgid+stats
|
|
e) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs
|
|
f) TASKSTATS_TYPE_STATS: contains the per-tgid stats for exiting task's process
|
|
|
|
|
|
per-tgid stats
|
|
--------------
|
|
|
|
Taskstats provides per-process stats, in addition to per-task stats, since
|
|
resource management is often done at a process granularity and aggregating task
|
|
stats in userspace alone is inefficient and potentially inaccurate (due to lack
|
|
of atomicity).
|
|
|
|
However, maintaining per-process, in addition to per-task stats, within the
|
|
kernel has space and time overheads. To address this, the taskstats code
|
|
accumalates each exiting task's statistics into a process-wide data structure.
|
|
When the last task of a process exits, the process level data accumalated also
|
|
gets sent to userspace (along with the per-task data).
|
|
|
|
When a user queries to get per-tgid data, the sum of all other live threads in
|
|
the group is added up and added to the accumalated total for previously exited
|
|
threads of the same thread group.
|
|
|
|
Extending taskstats
|
|
-------------------
|
|
|
|
There are two ways to extend the taskstats interface to export more
|
|
per-task/process stats as patches to collect them get added to the kernel
|
|
in future:
|
|
|
|
1. Adding more fields to the end of the existing struct taskstats. Backward
|
|
compatibility is ensured by the version number within the
|
|
structure. Userspace will use only the fields of the struct that correspond
|
|
to the version its using.
|
|
|
|
2. Defining separate statistic structs and using the netlink attributes
|
|
interface to return them. Since userspace processes each netlink attribute
|
|
independently, it can always ignore attributes whose type it does not
|
|
understand (because it is using an older version of the interface).
|
|
|
|
|
|
Choosing between 1. and 2. is a matter of trading off flexibility and
|
|
overhead. If only a few fields need to be added, then 1. is the preferable
|
|
path since the kernel and userspace don't need to incur the overhead of
|
|
processing new netlink attributes. But if the new fields expand the existing
|
|
struct too much, requiring disparate userspace accounting utilities to
|
|
unnecessarily receive large structures whose fields are of no interest, then
|
|
extending the attributes structure would be worthwhile.
|
|
|
|
----
|