Top Processes

Lab128 - Tools for Advanced Oracler Tuning and Monitoring.

Reference Guide.

Top Processes

This window shows top processes on the server. This data comes directly from the operating system (OS) - it is an output of the 'top' command. This information is joined to the Oracle session's data coming from Oracle instance, if you are monitoring this instance. This window is most useful when you are troubleshooting high CPU usage and need to see what is happening on the host server. For example, if the Oracle instance is very slow, there must be some processes hogging CPU resources.

Note. Because this window depends on the 'top' utility, it doesn't work in operating systems where 'top' is not implemented. Currently IBM AIX and MS Windows are not supported. Other major Unixes (including HP-UX, Linux, Solaris) are supported. Also, this technique depends on the SSH daemon, which must be installed and running on the server side. Fortunately, these days every Unix-compatible distribution includes an OpenSSH implementation installed by default.

In order to get the output of 'top' utility, Lab128 establishes a secure shell (SSH) connection to the server. You will be prompted to enter the user name / password and other attributes of an SSH (see SHH Login window) connection. If a connection is already opened by another window, it will be reused in this window.

When you connect to any server with SSH and 'top', this window will work similarly to the 'top' utility. However this window was designed specifically for the hosts running Oracle database instances, because it can join data coming from Oracle. Data from these two sources is joined using process ID (PID). Because there can be many Instance monitors opened in Lab128, the correct monitor should be located to provide Oracle data. Oracle instance exposes host name in v$instance view. This name and the host name entered in SSH login are used to locate the instance. Note. If IP address is entered in SSH login instead of the host name, joining of Oracle data becomes impossible. Please see a workaround in About using IP address in SHH connection.

There are some administrative actions, such as 'kill' and 'renice', available for the selected process; they also depend on privileges of the OS account used to connect to the host.

Filtering rows in the table of processes.

CPU Percent more than - set up a threshold value to show processes which have CPU usage higher than this value. To disable this filter, set threshold value to 0;
Top - set maximum number of processes to show. To disable this filter, set this value to 999;

Available actions for the selected process.

The following actions are available through the pop-up menu. You will be asked to confirm the action before it is executed.

Kill process (-15) - Stops a process via sending SIGTERM signal. The process can catch this signal and process it. This allows the process to terminate gracefully.
Kill process (-9) - Stops a process via sending SIGKILL signal which causes a process to terminate immediately. The process cannot catch this signal to clean up after itself and exit gracefully. Use this command as a last resort.
Renice - alters scheduling priority of the selected process. You will be asked to provide renice increment value.
pstack - takes an execution stack trace of the selected process by executing the 'pstack' command. You will be asked to provide the number of 'pstack' snapshots. After the series of snapshot is collected, they are aggregated and mapped to the names of execution steps in Explain Plan. See more about this technique below: Advanced troubleshooting using pstack traces.

Available links and shortcuts.

Drill down to the Session details by clicking on links in the SID column. This will open a Session Details window and will make this particular session selected.
Drill down to the SQL Statement details by clicking on links in the SQL_ID column. This will open a SQL Statement Details window for this particular SQL statement.

Advanced troubleshooting using pstack traces.

This technique has been developed and described by Tanel Poder, see Advanced Oracle Troubleshooting Guide, Part 6: Understanding Oracle execution plans with os_explain. We added the automation of the entire process; the collected pstack snapshots are aggregated into a single report.

Here is a very brief and simplified description of how this process is implemented. When Oracle executes a SQL statement, it uses an execution plan - the tree-like structure where each node can be associated with the function (sometimes set of functions) executed during run time. Some nodes have a child node, which means that functions associated with the parent node call a function associated with the child node. As execution of the SQL statement progresses, the call stack gets bigger once a child function is called, and it gets smaller once the child function returns. If a SQL statement is slow or seemingly hangs, taking several snapshots of the call stack can shed some light on which function is being executed most of the time. Given knowledge of how Oracle internal functions are mapped to the steps of Explain plan, the step in the plan causing problems can be identified. This provides actionable information for improvement of the query.

Due to the statistical nature of this technique, the accuracy improves with a higher number of stack traces. On the other hand, too many pstack runs can take considerable time. From our experience on Linux system, 5 snapshots is the minimum providing statistically meaningful results. The collected stack traces are merged producing a tree of calls: the common call path remains until a deviation is found; this deviation starts a new branch. The frequency of every node in the tree is counted. After all snapshots are merged, the tree is reordered with the most frequent path shown at the top, less frequent paths shown below that and so on. This is done for convenience only, to simplify finding of the most statistically meaningful path. The function names are mapped into Oracle plan steps, when possible. The output then can be compared to the Explain plan of the SQL statement, which is a couple of clicks away in the Top Processes window.

A sample report can be found in the "SQL Plan Statistics, Pstack examples", see "Using Pstack technique for troubleshooting of runaway query" section.

This implementation is still in experimental stage. Your input and feedback is very welcome.