/************************************************************************/ /* Document : Short note on how to trace processes in UNIX. */ /* Doc. Version : 4 */ /* File : tracing.txt */ /* Purpose : Some examples on how to trace processes in UNIX. */ /* For the DBA working with databases on UNIX. */ /* Date : 14/08/2009 */ /* Compiled by : Albert van der Sel */ /************************************************************************/ ============================================================================ 1. First some info before you trace: ============================================================================ When you study your trace files, you may come accross a number of error messages or error codes. The errorcodes we mean here, are the codes that are also visible in the file "errno.h". This is a header file in the standard library of C programming language. Those are a subset of the codes that a program might get when it requests a service from the system (like for example, "open file"). That's certainly is not all there is that you might run into about errors and corresponding codes, but it constitues an important base of what you can encounter in traces. Suppose you find something like this in a trace: vnop_lookup(dvp = F100010034228BF8, flag = 0002) = 0002, *vpp = 0000 return from statx. error ENOENT [13 usec] What can ENOENT mean? If you don't find some more "explaining text" 'close' to this line, then you can find from the table below, that it means "No such file or directory". Actually, I produced 2 lists, one from Linux and one from AIX, just to prove they are quite the same (there is no garantee that they are *exactly* the same on all systems). By the way, if you go search for that "errno.h" file (or similar name), on your own system, and take a look at the contents, you can create the list yourself for your particular unix/linux system. You can find that file (likely) in "/usr/include/sys" But for easy reference, we list the most important errno's for 2 representative unixes. (Yes.. one listing would have been quite sufficient). 1.1 Errcodes Linux (generic) from errno.h : =========================================== #define EPERM 1 /* Operation not permitted */ #define ENOENT 2 /* No such file or directory */ #define ESRCH 3 /* No such process */ #define EINTR 4 /* Interrupted system call */ #define EIO 5 /* I/O error */ #define ENXIO 6 /* No such device or address */ #define E2BIG 7 /* Arg list too long */ #define ENOEXEC 8 /* Exec format error */ #define EBADF 9 /* Bad file number */ #define ECHILD 10 /* No child processes */ #define EAGAIN 11 /* Try again */ #define ENOMEM 12 /* Out of memory */ #define EACCES 13 /* Permission denied */ #define EFAULT 14 /* Bad address */ #define ENOTBLK 15 /* Block device required */ #define EBUSY 16 /* Device or resource busy */ #define EEXIST 17 /* File exists */ #define EXDEV 18 /* Cross-device link */ #define ENODEV 19 /* No such device */ #define ENOTDIR 20 /* Not a directory */ #define EISDIR 21 /* Is a directory */ #define EINVAL 22 /* Invalid argument */ #define ENFILE 23 /* File table overflow */ #define EMFILE 24 /* Too many open files */ #define ENOTTY 25 /* Not a typewriter */ #define ETXTBSY 26 /* Text file busy */ #define EFBIG 27 /* File too large */ #define ENOSPC 28 /* No space left on device */ #define ESPIPE 29 /* Illegal seek */ #define EROFS 30 /* Read-only file system */ #define EMLINK 31 /* Too many links */ #define EPIPE 32 /* Broken pipe */ #define EDOM 33 /* Math argument out of domain of func */ #define ERANGE 34 /* Math result not representable */ #define EDEADLK 35 /* Resource deadlock would occur */ #define ENAMETOOLONG 36 /* File name too long */ #define ENOLCK 37 /* No record locks available */ #define ENOSYS 38 /* Function not implemented */ #define ENOTEMPTY 39 /* Directory not empty */ #define ELOOP 40 /* Too many symbolic links encountered */ #define EWOULDBLOCK EAGAIN /* Operation would block */ #define ENOMSG 42 /* No message of desired type */ #define EIDRM 43 /* Identifier removed */ #define ECHRNG 44 /* Channel number out of range */ #define EL2NSYNC 45 /* Level 2 not synchronized */ #define EL3HLT 46 /* Level 3 halted */ #define EL3RST 47 /* Level 3 reset */ #define ELNRNG 48 /* Link number out of range */ #define EUNATCH 49 /* Protocol driver not attached */ #define ENOCSI 50 /* No CSI structure available */ #define EL2HLT 51 /* Level 2 halted */ #define EBADE 52 /* Invalid exchange */ #define EBADR 53 /* Invalid request descriptor */ #define EXFULL 54 /* Exchange full */ #define ENOANO 55 /* No anode */ #define EBADRQC 56 /* Invalid request code */ #define EBADSLT 57 /* Invalid slot */ #define EDEADLOCK EDEADLK #define EBFONT 59 /* Bad font file format */ #define ENOSTR 60 /* Device not a stream */ #define ENODATA 61 /* No data available */ #define ETIME 62 /* Timer expired */ #define ENOSR 63 /* Out of streams resources */ #define ENONET 64 /* Machine is not on the network */ #define ENOPKG 65 /* Package not installed */ #define EREMOTE 66 /* Object is remote */ #define ENOLINK 67 /* Link has been severed */ #define EADV 68 /* Advertise error */ #define ESRMNT 69 /* Srmount error */ #define ECOMM 70 /* Communication error on send */ #define EPROTO 71 /* Protocol error */ #define EMULTIHOP 72 /* Multihop attempted */ #define EDOTDOT 73 /* RFS specific error */ #define EBADMSG 74 /* Not a data message */ #define EOVERFLOW 75 /* Value too large for defined data type */ #define ENOTUNIQ 76 /* Name not unique on network */ #define EBADFD 77 /* File descriptor in bad state */ #define EREMCHG 78 /* Remote address changed */ #define ELIBACC 79 /* Can not access a needed shared library */ #define ELIBBAD 80 /* Accessing a corrupted shared library */ #define ELIBSCN 81 /* .lib section in a.out corrupted */ #define ELIBMAX 82 /* Attempting to link in too many shared libraries */ #define ELIBEXEC 83 /* Cannot exec a shared library directly */ #define EILSEQ 84 /* Illegal byte sequence */ #define ERESTART 85 /* Interrupted system call should be restarted */ #define ESTRPIPE 86 /* Streams pipe error */ #define EUSERS 87 /* Too many users */ #define ENOTSOCK 88 /* Socket operation on non-socket */ #define EDESTADDRREQ 89 /* Destination address required */ #define EMSGSIZE 90 /* Message too long */ #define EPROTOTYPE 91 /* Protocol wrong type for socket */ #define ENOPROTOOPT 92 /* Protocol not available */ #define EPROTONOSUPPORT 93 /* Protocol not supported */ #define ESOCKTNOSUPPORT 94 /* Socket type not supported */ #define EOPNOTSUPP 95 /* Operation not supported on transport endpoint */ #define EPFNOSUPPORT 96 /* Protocol family not supported */ #define EAFNOSUPPORT 97 /* Address family not supported by protocol */ #define EADDRINUSE 98 /* Address already in use */ #define EADDRNOTAVAIL 99 /* Cannot assign requested address */ #define ENETDOWN 100 /* Network is down */ #define ENETUNREACH 101 /* Network is unreachable */ #define ENETRESET 102 /* Network dropped connection because of reset */ #define ECONNABORTED 103 /* Software caused connection abort */ #define ECONNRESET 104 /* Connection reset by peer */ #define ENOBUFS 105 /* No buffer space available */ #define EISCONN 106 /* Transport endpoint is already connected */ #define ENOTCONN 107 /* Transport endpoint is not connected */ #define ESHUTDOWN 108 /* Cannot send after transport endpoint shutdown */ #define ETOOMANYREFS 109 /* Too many references: cannot splice */ #define ETIMEDOUT 110 /* Connection timed out */ #define ECONNREFUSED 111 /* Connection refused */ #define EHOSTDOWN 112 /* Host is down */ #define EHOSTUNREACH 113 /* No route to host */ #define EALREADY 114 /* Operation already in progress */ #define EINPROGRESS 115 /* Operation now in progress */ #define ESTALE 116 /* Stale NFS file handle */ #define EUCLEAN 117 /* Structure needs cleaning */ #define ENOTNAM 118 /* Not a XENIX named type file */ #define ENAVAIL 119 /* No XENIX semaphores available */ #define EISNAM 120 /* Is a named type file */ #define EREMOTEIO 121 /* Remote I/O error */ #define EDQUOT 122 /* Quota exceeded */ #define ENOMEDIUM 123 /* No medium found */ #define EMEDIUMTYPE 124 /* Wrong medium type */ The list above should actually be sufficient, but we shall show next, the corresponding list for AIX (a bit nonsense ofcourse): 1.2 errcodes AIX: ================= #define EPERM 1 /* Operation not permitted */ #define ENOENT 2 /* No such file or directory */ #define ESRCH 3 /* No such process */ #define EINTR 4 /* interrupted system call */ #define EIO 5 /* I/O error */ #define ENXIO 6 /* No such device or address */ #define E2BIG 7 /* Arg list too long */ #define ENOEXEC 8 /* Exec format error */ #define EBADF 9 /* Bad file descriptor */ #define ECHILD 10 /* No child processes */ #define EAGAIN 11 /* Resource temporarily unavailable */ #define ENOMEM 12 /* Not enough space */ #define EACCES 13 /* Permission denied */ #define EFAULT 14 /* Bad address */ #define ENOTBLK 15 /* Block device required */ #define EBUSY 16 /* Resource busy */ #define EEXIST 17 /* File exists */ #define EXDEV 18 /* Improper link */ #define ENODEV 19 /* No such device */ #define ENOTDIR 20 /* Not a directory */ #define EISDIR 21 /* Is a directory */ #define EINVAL 22 /* Invalid argument */ #define ENFILE 23 /* Too many open files in system */ #define EMFILE 24 /* Too many open files */ #define tr 25 /* Inappropriate I/O control operation */ #define ETXTBSY 26 /* Text file busy */ #define EFBIG 27 /* File too large */ #define ENOSPC 28 /* No space left on device */ #define ESPIPE 29 /* Invalid seek */ #define EROFS 30 /* Read only file system */ #define EMLINK 31 /* Too many links */ #define EPIPE 32 /* Broken pipe */ #define EDOM 33 /* Domain error within math function */ #define ERANGE 34 /* Result too large */ #define ENOMSG 35 /* No message of desired type */ #define EIDRM 36 /* Identifier removed */ #define ECHRNG 37 /* Channel number out of range */ #define EL2NSYNC 38 /* Level 2 not synchronized */ #define EL3HLT 39 /* Level 3 halted */ #define EL3RST 40 /* Level 3 reset */ #define ELNRNG 41 /* Link number out of range */ #define EUNATCH 42 /* Protocol driver not attached */ #define ENOCSI 43 /* No CSI structure available */ #define EL2HLT 44 /* Level 2 halted */ #define EDEADLK 45 /* Resource deadlock avoided */ #define ENOTREADY 46 /* Device not ready */ #define EWRPROTECT 47 /* Write-protected media */ #define EFORMAT 48 /* Unformatted media */ #define ENOLCK 49 /* No locks available */ #define ENOCONNECT 50 /* no connection */ #define ESTALE 52 /* no filesystem */ #define EDIST 53 /* old, currently unused AIX errno*/ #define EINPROGRESS 55 /* Operation now in progress */ #define EALREADY 56 /* Operation already in progress */ #define ENOTSOCK 57 /* Socket operation on non-socket */ #define EDESTADDRREQ 58 /* Destination address required */ #define EDESTADDREQ EDESTADDRREQ /* Destination address required */ #define EMSGSIZE 59 /* Message too long */ #define EPROTOTYPE 60 /* Protocol wrong type for socket */ #define ENOPROTOOPT 61 /* Protocol not available */ #define EPROTONOSUPPORT 62 /* Protocol not supported */ #define ESOCKTNOSUPPORT 63 /* Socket type not supported */ #define EOPNOTSUPP 64 /* Operation not supported on socket */ #define EPFNOSUPPORT 65 /* Protocol family not supported */ #define EAFNOSUPPORT 66 /* Address family not supported by protocol family */ #define EADDRINUSE 67 /* Address already in use */ #define EADDRNOTAVAIL 68 /* Can't assign requested address */ #define ENETDOWN 69 /* Network is down */ #define ENETUNREACH 70 /* Network is unreachable */ #define ENETRESET 71 /* Network dropped connection on reset */ #define ECONNABORTED 72 /* Software caused connection abort */ #define ECONNRESET 73 /* Connection reset by peer */ #define ENOBUFS 74 /* No buffer space available */ #define EISCONN 75 /* Socket is already connected */ #define ENOTCONN 76 /* Socket is not connected */ #define ESHUTDOWN 77 /* Can't send after socket shutdown */ #define ETIMEDOUT 78 /* Connection timed out */ #define ECONNREFUSED 79 /* Connection refused */ #define EHOSTDOWN 80 /* Host is down */ #define EHOSTUNREACH 81 /* No route to host */ #define ERESTART 82 /* restart the system call */ #define EPROCLIM 83 /* Too many processes */ #define EUSERS 84 /* Too many users */ #define ELOOP 85 /* Too many levels of symbolic links */ #define ENAMETOOLONG 86 /* File name too long */ #define EDQUOT 88 /* Disc quota exceeded */ #define ECORRUPT 89 /* Invalid file system control data */ #define EREMOTE 93 /* Item is not local to host */ #define ENOSYS 109 /* Function not implemented POSIX */ #define EMEDIA 110 /* media surface error */ #define ESOFT 111 /* I/O completed, but needs relocation */ #define ENOATTR 112 /* no attribute found */ #define ESAD 113 /* security authentication denied */ #define ENOTRUST 114 /* not a trusted program */ #define ETOOMANYREFS 115 /* Too many references: can't splice */ #define EILSEQ 116 /* Invalid wide character */ #define ECANCELED 117 /* asynchronous i/o cancelled */ #define ENOSR 118 /* temp out of streams resources */ #define ETIME 119 /* I_STR ioctl timed out */ #define EBADMSG 120 /* wrong message type at stream head */ #define EPROTO 121 /* STREAMS protocol error */ #define ENODATA 122 /* no message ready at stream head */ #define ENOSTR 123 /* fd is not a stream */ #define ECLONEME ERESTART /* this is the way we clone a stream ... */ #define ENOTSUP 124 /* POSIX threads unsupported value */ #define EMULTIHOP 125 /* multihop is not allowed */ #define ENOLINK 126 /* the link has been severed */ #define EOVERFLOW 127 /* value too large to be stored in data type */ Actually, this is only a very small list of errors and code: It is ONLY associated with the interaction of a process with the system. And even in that context, this is a limited list. There are ofcourse also many classes of errors you will never see in a trace. Think of the possible errors that can be seen at boottime of a system, or what an error logging daemon might write in a logfile, can all be a very different story. ============================================================================ 2. A quick one: The "truss" tool on many unixes: ============================================================================ Here is a quick one to trace a shell script, or executable program: using "truss". The "truss" tool is available on many unix platforms. It has many options, but a very usefull command to trace the system calls that a script or program does is: $ truss -o /tmp/myprg.log myprg In this example, truss will log in the file "/tmp/myprg.log" while it traces the program "myprg". Ofcourse, you can choose another path and logfile to trace to. The upper command is quite good for tracing a shell script, or program, that starts up, does some work, and then terminates. If an error occurs during runtime, it's likely that you find some pointers in the logfile that truss made for you. In the example above, you started the trace while activating the program at the same time. You can attach to an existing process using the "pid", with the "-p" flag. $ truss -p 5743 This tool has so many options, for example, you can focus your trace on a certain library etc.. Anyway, even the upper example of truss can already be very helpfull. So, for example, if you find in the log that truss has produced, the error "EACCES" which is "errno 13 = Permission denied", that would really be helpfull. Obviously, your shell script or program tries to access a certain object, to which it has insufficient permisions, and thus may fail. Be warned though, that some errno's might be found multiple times, while it's actually not something to worry about. For example "ENOENT= No such file or directory" might be found quite often. Here, your script or program seems to be unable to find a file or directory. Well, if it's related to the $PATH environment variable, it could be quite reasonable. Your shell will search your $PATH from beginning, to the end, until the object has been found. Thus, it's quite possible that some ENOENT errors occurred. In section 4.2 you can find some more info on truss. ============================================================================ 3. Tracing in Linux: ============================================================================ 3.1.strace: =========== >>> strace example on Linux: One main trace utility on most Linux distro's, is the "strace" command. You can use it with many parameters, but the "-o outputfile" is very important, in order to save the output to a file. Use it like: # strace -o logfile # strace -o logfile -p # In cases where you want to trace a process that is already running, # pass the -p option to strace. Because strace will show you the systemcalls and signals, you can use it to reveal whether a program cannot find a file, or does not have permissions to read (or write to) a file. In such a case, a program might fail. Example 1: ---------- Suppose we have a file called "/etc/security.conf". Now we run a utility to read the file (like cat, pg, more, less etc..) as a normal user, which user does not have permissions to read the file. Let's trace that event to a logfile, and see what we can discover. $ strace -o strace_example.log less /etc/security.conf A trace file can get pretty long, but you should just browse it and be alert on what seems to be an error reported. So, if we take a look in the logfile "strace_example.log" .. .. open("/etc/security.conf", O_RDONLY|O_LARGEFILE) = -1 EACCES (Permission denied) write(2, "/etc/security.conf: Permission denied\n", 32) = 32 .. .. We can clearly see, that our program failed due to lack of permission. Example 2: ---------- You can use strace in many ways. One other famous "error" you might find using strace, is that a program needs a libary, but can't find it. Like in this example; .. open("/opt/tux/cbl/lib/libdcpybk.so", O_RDONLY) = -1 ENOENT (No such file or directory) .. Remark: To find out what libraries a program needs, you might also try the ldd command. For example, what uuencode needs is shown with: $ ldd uuencode uuencode needs: /usr/lib/libc.a(shr.o) /unix /usr/lib/libcrypt.a(shr.o) 3.2. ltrace: ============ While "strace" deals with systemcalls, if you want to track what library calls an application does, you can use the "ltrace" command. It works really similar to "strace". Example: $ ltrace -o ls_example_trace_file.trc ls 3.3. LTT Linux Trace Toolkit: ============================= Strace, as we have seen above, will trace only one process and present the result in text form. To trace many processes in a given period of time, Linux Trace Toolkit (LTT) is a better choice. LTT is distributed as free software under GPL. The trace toolkit provides a daemon, which will capture the events and write it to disk. It's (generally) not a standard feature of Linux, and you need to obtain it elswhere. If you are interested, just Google on Linux Trace Toolkit, to find current info. Basically, you run the tracedaemon, and after a while, you use the tracevisualizer to view results in graphical form. 3.4. Other possible usefull Linux commands (limited list): ========================================================== Although not directly related to tracing, the following limited list of commands might help in creating a better view of your system and processes. I am sure you are familiair with them, but let's list them anyway: -- Show your OS version: # cat /proc/version # uname -a -- Show the open files that a process uses: # pfiles pid -- Show the jobs that are scheduled (in the account you use) from cron: # crontab -l -- What are the standard mounted filesystems: That's defined in "/etc/fstab" # cat /etc/fstab -- Which processes are using a certain filesystem? # fuser -c /filesystem # We mean the "mountpoint", like for example "/apps/oracle" -- Show memory usage of a process: # pmap -d pid # (Most important options: -x Show the extended format; -d Show the device format.) # (And pid is the process-id, as visible in the command "ps -ef".) -- Show system memory: # cat /proc/meminfo # /usr/sbin/dmesg | grep "Physical" # free # (the free command) -- Swap usage: # cat /proc/swaps # Above 60%-70% it's getting scary # cat /proc/meminfo -- cpu info: # cat /proc/cpuinfo -- user and process limits: Sometimes, when a process runs under some account, and it fails for no immediate reason, it might be worth checking the "ulimit" of that account (like max filesize, max open files, number of files etc..) use it under that account as: # ulimit (-a) -- Show processtree of parent and children: # pstree pid # on some distros ptree is implemented -- Show the system error report / error log: # cat /var/log/messages | more (# more will ensure that not all contents scroll at your screen "at once", until the end is reached) -- Determine the type of a file (e.g. is it ascii, or another type of file?) # file file_name # (the command is really named "file") -- Show free/used space of the filesystems: # df -m # m in MB; k in KB If there are many filesystems, you might want to see just the top 5 that are the lowest on free space: # df -k |awk '{print $4,$7}' |grep -v "Filesystem" | sort -n | tail -5 -- How to become another user, or possibly root: # su - accountname # (switch to that accountname like "su - albert") # su - # (switch to root) # if the sudo utility is implemented, you might try the command "sudo -l" to see what you might execute. -- Carefull!! How to kill a process "the hard way"? # kill -9 PID # carefull, don't kill the wrong one; not recommended unless you don't have a choice. -- Carefull!! How to kill all your processes "the hard way", all at once? # kill -9 -1 # very carefull; not recommended unless you don't have a choice. # killall # implemented on some distros. very carefull; not recommended unless you don't have a choice. -- Show your uid (userid) and gid (groupid): # id -- refreshing (restarting) inetd after modifying "/etc/inetd.conf" # service xinetd restart # depending on the distro, like RedHat # /etc/init.d/inetd restart -- To show the init runlevel: # who -r -- Show uptime of system plus average load (15 minutes) # uptime -- Show the last logged on users: account name & pts & date (history since last restart) # last | more ============================================================================ 4. Tracing in AIX: ============================================================================ In AIX, tracing commands are available like "truss", "syscalls" and "trace". First we will talk about the "trace" facility, to which AIX also offers a userfriendly interface. It's a menu based system (via smitty). But you can use "trace" on the commandline as well. The neat thing here is that you can trace a PID, a program, or just all. We will start with the command "smitty trace". We will instruct the system to create a raw tracefile first (not easily readable), and then, after we have stopped tracing, we create an ascii (readable) file, from the raw file. 4.1. Setting up a trace with "smitty trace": ============================================ >>> Define and start the trace: ------------------------------- You can start with $ smitty trace The following menu appears: Move cursor to desired item and press Enter. START Trace STOP Trace Generate a Trace Report Manage Trace Manage Event Groups First we choose "START Trace" The following menu appears: FIG. 1. [Entry Fields] EVENT GROUPS to trace [] ADDITIONAL event IDs to trace [] Event Groups to EXCLUDE from trace [] Event IDs to EXCLUDE from trace [] ->Process IDs to Trace [] Program to Trace [] Propagate Tracing to [new processes and threads] Trace MODE [alternate] STOP when log file full? [no] LOG FILE [/var/adm/ras/trcfile] SAVE PREVIOUS log file? [no] Omit PS/NM/LOCK HEADER to log file? [yes] Omit DATE-SYSTEM HEADER to log file? [no] Run in INTERACTIVE mode? [no] Trace BUFFER SIZE in bytes [262144] LOG FILE SIZE in bytes [2621440] Buffer Allocation [automatic] Now move to the item: - Process ID to Trace: Hopefully, you know the "pid", or "process id", of the process you want to trace. Maybe with the "ps -ef" command, you can find the pid. If you do not specify a particular pid, your trace is going to capture almost all processes, which ofcourse can lead to incredably large and fast growing traces. In this example, we do not fill in a pid. Normally, you should always choose the pid you want to trace. Next, move to the item: - "LOG FILE": Now we adjust the logfile location from the default "/var/adm/ras/trcfile" to another suitable filesystem and filename, like "/tmp/trcraw" (the /var filesystem is usually not a good idea to store your own large tracefile) In this example, we use "/tmp" as the filesystem to store our tracefile (if there is enough free space). And we let the tracefile has the name of "trcraw", because it will not contain readable text (at first), hence the "raw". Next, move to the item: - "LOG FILE SIZE in bytes": It might be a good idea to limit the size of the tracefile. For exmple, if you only have 1GB free in /tmp, you must stay well below that size. But you will see that tracing to file, is like "exploding" the filesize. It can grow incredibly fast, also depending on the event groups you trace. Undoubtly, you will see that for yourself. If you trace on too many events, it can be as bad as 500MB per minute. But in this example, we stay "modest" in sizes. So here, we have taken the example value of 100MB (104857600 bytes) FIG. 2. [Entry Fields] EVENT GROUPS to trace [] ADDITIONAL event IDs to trace [] Event Groups to EXCLUDE from trace [] Event IDs to EXCLUDE from trace [] Process IDs to Trace [] Program to Trace [] Propagate Tracing to [new processes and threads] Trace MODE [alternate] STOP when log file full? [yes] LOG FILE [/tmp/trcraw] SAVE PREVIOUS log file? [no] Omit PS/NM/LOCK HEADER to log file? [yes] Omit DATE-SYSTEM HEADER to log file? [no] Run in INTERACTIVE mode? [no] Trace BUFFER SIZE in bytes [262144] LOG FILE SIZE in bytes [104857600] (changed to 100MB) # Buffer Allocation [automatic] Next, move to - "STOP when log file full?" Decide whether you want to stop logging when the size limit has been reached (generally a good idea). You can choose between "yes" and "no" via the F4 key. Next, we move to - "EVENT GROUPS to trace": When you have your cursor at this item, press F4. An impressive list of "counters" or trace-able events, is shown. With the F7 key, you can toggle "Select event" to on/off. Remember, the more event(groups) you choose, the more "intensive" the system will trace, and the faster your tracefile will grown. believe me: if you want to create a relatively simple trace for troubleshooting purposes, then the selection of - fop - FILE OPENS (reserved) - fact - FILE ACTIVITY (open,close,read,write) (reserved) can be sufficient. Because many process failures are related to permission problems (on files and directories) and not able to find files (like libaries, logfiles etc..). So, in this we just choose those eventgroups, and press Enter. FIG. 3. +--------------------------------------------------------------------------+ ¦ EVENT GROUPS to trace ¦ ¦ ¦ ¦ Move cursor to desired item and press F7. Use arrow keys to scroll. ¦ EVENT GROUPS to trace ¦ ONE OR MORE items can be selected. ¦ ADDITIONAL event IDs to trace ¦ Press Enter AFTER making all selections. ¦ Event Groups to EXCLUDE from trace ¦ ¦ Event IDs to EXCLUDE from trace ¦ [TOP] ¦ Process IDs to Trace ¦ tidhk - Hooks needed to display thread name (reserved) ¦ Program to Trace ¦ gka - GENERAL KERNEL ACTIVITY (files,execs,dispatches) (reserved) ¦ Propagate Tracing to ¦ gkasc - GENERAL KERNEL ACTIVITY + SYSTEM CALLS (reserved) ¦ Trace MODE ¦ fop - FILE OPENS (reserved) ¦ STOP when log file full? ¦ fact - FILE ACTIVITY (open,close,read,write) (reserved) ¦ LOG FILE ¦ proc - EXECS, FORKS, EXITS (reserved) ¦ SAVE PREVIOUS log file? ¦ procd - EXECS, FORKS, DISPATCHES (reserved) ¦ Omit PS/NM/LOCK HEADER to log file? ¦ filephys - FILE ACTIVITY (with physical file system) (reserved) ¦ Omit DATE-SYSTEM HEADER to log file? ¦ filepfsv - FILE ACTIVITY (with physical file system and VMM) (reserved ¦ Run in INTERACTIVE mode? ¦ filepvl - FILE ACTIVITY (with physical file system, VMM, and LVM) (res ¦ Trace BUFFER SIZE in bytes ¦ filepvld - FILE ACTIVITY (w/ phys. file sys., VMM, LVM, and disk) (res ¦ LOG FILE SIZE in bytes ¦ syscall - SYSTEM CALLS (reserved) ¦ Buffer Allocation ¦ inthands - FLIHS and SLIHS (reserved) ¦ ¦ lfs - LOGICAL FILE SYSTEM (deprecated, use vnops and vfsops) (reserved ¦ ¦ pfs - PHYSICAL FILE SYSTEM (reserved) ¦ ¦ vmm - VIRTUAL MEMORY MANAGER (reserved) ¦ ¦ vmmsvc - VMM SERVICES (reserved) ¦ ¦ lvm - LOGICAL VOLUME MANAGER (reserved) ¦ ¦ lvmbb - LOGICAL VOLUME MANAGER BADBLOCK EVENTS (reserved) ¦ ¦ ipcgen - IPC: GENERAL (reserved) ¦ ¦ ipcsm - IPC: SHARED MEMORY (reserved) ¦ ¦ ipcmsgs - IPC: MESSAGES (reserved) ¦ ¦ ipcsem - IPC: SEMAPHORES (reserved) ¦ ¦ ipcmmap - IPC: MMAP (reserved) ¦ ¦ ipcmsem - IPC: MSEMAPHORES (reserved) ¦ ¦ errlg - ERROR LOGGING (reserved) ¦ ¦ parpdd - DEVICE DRIVER: PARALLEL PRINTER (reserved) ¦ ¦ tapedd - DEVICE DRIVER: TAPE (reserved) ¦ ¦ entdd - DEVICE DRIVER: ETHERNET - HIGH PERFORMANCE LAN ADAPTER (8ef5) ¦ ¦ tokdd - DEVICE DRIVER: TOKEN RING - HIGH PERFORMANCE ADAPTER (8fc8) (r ¦ ¦ c3270dd - DEVICE DRIVER: C3270 (reserved) ¦ ¦ fddd - DEVICE DRIVER: FLOPPY DISK (reserved) ¦ ¦ scsidd - DEVICE DRIVER: SCSI (reserved) ¦ ¦ sisadd - DEVICE DRIVER: PCI-X SCSI (reserved) ¦ ¦ sissasdd - DEVICE DRIVER: SAS (reserved) ¦ ¦ diskdd - DEVICE DRIVER: DISK (reserved) ¦ ¦ mpqdd - DEVICE DRIVER: MULTI-PROTOCAL ADAPTERS (reserved) ¦ ¦ graphdd - DEVICE DRIVER: GRAPHICS (reserved) ¦ ¦ ttydd - DEVICE DRIVER: pty (reserved) ¦ ¦ rs232dd - DEVICE DRIVER: rs232 (reserved) ¦ ¦ 64portdd - DEVICE DRIVER: 64 PORT ASYNC CONTROLLER (reserved) ¦ ¦ x25dd - DEVICE DRIVER: X25 (reserved) ¦ ¦ harierdd - DEVICE DRIVER: HARRIER2 (reserved) ¦ ¦ scsitgdd - DEVICE DRIVER: SCSI Target Mode (reserved) ¦ ¦ lpfkdd - DEVICE DRIVER: Dials/LPFKeys (reserved) ¦ ¦ [MORE...36] ¦ ¦ ¦ ¦ F1=Help F2=Refresh F3=Cancel ¦ F1=Help F2¦ F7=Select F8=Image F10=Exit ¦ F4=List F5=Reset F6¦ Enter=Do /=Find n=Find Next ¦ F8=Image Now the trace wil start and you should see the file "/tmp/trcraw" grow in size. You can see that with: $ ls -al /tmp/trcraw Also, try this command from the prompt: $ ps -ef | grep trace and you should see your trace running in the process list. IMPORTANT: Did you note, that we did not select a PID (process ID) to trace on? So, actually, we trace on (almost) all processes, "which do something" on the eventgroups we selected. Ofcourse, if you know a PID on which you want to trace, you just fill that in the menu shown in Fig. 2. If you select to trace on a PID (only), the your tracefile will ofcourse not grow that fast, as it would in our example. But even in our example (where we trace on all processes on the selected eventgroups), we can see marvelous things. Suppose Oracle and/or Websphere, or monitoring tools, (or you name it), are running. Later on, when you inspect the tracefile, you can find very valuable information about what those processes do "under the hood". Remember, we are creating a raw trace file here. We still need to do one extra step, after stopping the trace. >>> Stop the trace and create a readable file: ---------------------------------------------- Ok, if you have left smitty, start it up again. $ smitty trace In the menu that follows, just select " STOP Trace". START Trace STOP Trace Generate a Trace Report Manage Trace Manage Event Groups and the trace facility will stop tracing. Next, we want to have a readable file, which we can view (use cat, pg, more, grep etc..). In smitty, there are options available to create a trace report, but I think it's more instructive to do this from the prompt. Here we go: We have a raw trace in the file /tmp/trcraw Lets create a readable file from the raw file, and call it "/tmp/trctxt". You can do that with for example: $ trcrpt -O pid=on,exec=on trcraw > trcnew Please be aware that the textfile is typically 2 or 3 times larger than the raw file. So, always be aware on available space in the filesystem where you want to create the file. Now you can open the file, or grep it on an identifier etc.. >>> Some important subroutines, or how to "read" the trace: ----------------------------------------------------------- If you inspect your trace with cat, vi, or whatever tool, it's rather full with what seems many cryptic messages, like 101 ksh 1175648 4.075516073 0.009374 fstatx LR = D0376824 104 ksh 1175648 4.075520760 0.004687 return from fstatx [5 usec] 104 ksh 1175648 4.074383326 0.001349 return from __loadx [1 usec] 101 ksh 1175648 4.074383929 0.000603 __loadx LR = D03B8288 104 ksh 1175648 4.074385155 0.001226 return from __loadx [1 usec] 101 ksh 1175648 4.074395702 0.010547 getuidx LR = D03BF94C 104 ksh 1175648 4.074396214 0.000512 return from getuidx [1 usec] >>> statx, stat, lstat, fstatx, fstat, fullstat, ffullstat, stat64, lstat64, fstat64, stat64x, fstat64x, or lstat64x Subroutine Purpose Provides information about a file or shared memory object. Library Standard C Library (libc.a) >>> vnop_open Entry Point Purpose Requests that a file be opened for reading or writing. 0 Indicates success. Nonzero return values are returned from the /usr/include/sys/errno.h file to indicate failure. >>> getuid, geteuid, or getuidx Subroutine Purpose Gets the real or effective user ID of the current process. Library Standard C Library (libc.a) 4.2. A few examples of using the truss command: =============================================== With "truss" you can trace a command, or trace an existing process. It shows all system calls (or a selection) made, with their arguments and the return code. System call parameters are displayed symbolically. It also prints information about all signals received by a process. You can use truss in the following way: # truss [options] command You must understand that in this way, you actually start the command, and let truss attach, and then it will display the calls to the system and external libaries. # truss [options] -p PID In this case, you 'attach' to an existing process. There are many parameters (or options) you can use, but a few of the most important options are: -o truss.log # So here you save the truss trace to the logfile "truss.log" -t [!] Syscall # If you leave out -t, you trace on all syscalls. Indeed, the default is "-tall". # If you use -t, you can also give a comma seperated list on the calls you want to # trace on, like "-t open,statx,close", where you will only trace on open, close, statx. # You can also excluse certain syscalls, by using "-t ! syscall". -u [!] [LibraryName] # Here you can give a comma seperated list on which you want to trace the calls to. # using -u ! LibraryName, you can exclude a certain library from the trace. let's take a look at a few simple examples: Example 1: ---------- Suppose in /opt/app/cc we have a program called "test". Somebody from your group tries to run it, but it immediately dies, and you don't have a clue to what caused it. It was supposed to present colleque a menuscreen to work with, but that never happened. Ofcourse, any well behaved program should give a messsage on the screen, or write status information in a logile. But suppose we are dealing with a program without those nice features. $ ./test And it dies, while we were expecting a menuscreen to work with. Why did it die? Let's try truss: $ truss ./test execve("test", 0xFFBFFDEC, 0xFFBFFDF4) argc = 1 getcwd("/home/albert", 1015) = 0 stat("/home/albert/test", 0xFFBFFBC8) = 0 open("/var/ld/ld.config", O_RDONLY) Err#2 ENOENT stat("/opt/csw/lib/libc.so.1", 0xFFBFF6F8) Err#2 ENOENT stat("/lib/libc.so.1", 0xFFBFF6F8) = 0 resolvepath("/lib/libc.so.1", "/lib/libc.so.1", 1023) = 14 open("/lib/libc.so.1", O_RDONLY) = 3 memcntl(0xFF280000, 139692, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0 close(3) = 0 getcontext(0xFFBFF8C0) getrlimit(RLIMIT_STACK, 0xFFBFF8A0) = 0 getpid() = 7895 [7894] setustack(0xFF3A2088) open("/opt/app/etc/cc.conf", O_RDONLY) Err#13 EACCES [file_dac_read] <--- !!! ioctl(1, TCGETA, 0xFFBFEF14) = 0 Now note the line that I have marked with "!!!". Here you see Err#13 EACCES. From the lists in Section 1, we can find that Error 13 corresponds to "Permission denied". So, suppose that you go to "/opt/app/etc/" and check the permissions on the file "cc.conf", you would find that the permission on that file should be altered. After using the following command: $ chmod g+r cc.conf # here we give the group read permission on "cc.conf" Now the program runs without errors. Probably this was a program that first wanted to read configuration information from "/opt/app/etc/cc.conf", and if that fails, the program would just terminate without any message. Ofcourse, that program could have been designed much better. But we have seen an example where truss was of use. Example 2: ---------- Let's run the program "lsps -s" (show pagingspace) from my home dir, and let's truss it, to see what systemcalls it makes: albert@sharky:/home/albert $ truss lsps -s execve("/usr/sbin/lsps", 0x2FF22A4C, 0x2000EB28) argc: 2 __loadx(0x03000000, 0x2FF22870, 0x000000F0, 0x10000000, 0x20000E14) = 0x00000000 __loadx(0x0A040000, 0xD0572CD4, 0x0000000A, 0x00000000, 0x00000000) = 0x00000000 sbrk(0x00000000) = 0x20004570 vmgetinfo(0x2FF21C30, 7, 16) = 0 sbrk(0x00000000) = 0x20004570 __libc_sbrk(0x00000000) = 0x20004570 getuidx(4) = 6318 getuidx(2) = 6318 getuidx(1) = 6318 getgidx(4) = 1105 getgidx(2) = 1105 getgidx(1) = 1105 __loadx(0x01000080, 0x2FF216E0, 0x00000960, 0x2FF22160, 0x00000000) = 0xD0149130 __loadx(0x0A040000, 0xD0572CA0, 0x2FF22FFC, 0x0000D0B2, 0x00000000) = 0x00000000 __loadx(0x01000180, 0x2FF216E0, 0x00000960, 0xF028CC4C, 0xF028CB7C) = 0xF03358D8 __loadx(0x0A040000, 0xD0572CA0, 0x2FF22FFC, 0x0000D0B2, 0x00000000) = 0x00000000 __loadx(0x07080000, 0xF028CC1C, 0xFFFFFFFF, 0xF03358D8, 0x00000000) = 0xF0336808 __loadx(0x07080000, 0xF028CB5C, 0xFFFFFFFF, 0xF03358D8, 0x00000000) = 0xF0336814 __loadx(0x07080000, 0xF028CC2C, 0xFFFFFFFF, 0xF03358D8, 0x00000000) = 0xF0336844 __loadx(0x07080000, 0xF028CB6C, 0xFFFFFFFF, 0xF03358D8, 0x00000000) = 0xF0336850 __loadx(0x07080000, 0xF028CBEC, 0xFFFFFFFF, 0xF03358D8, 0x00000000) = 0xF0336820 __loadx(0x07080000, 0xF028CB8C, 0xFFFFFFFF, 0xF03358D8, 0x00000000) = 0xF0336838 __loadx(0x07080000, 0xF028CBFC, 0xFFFFFFFF, 0xF03358D8, 0x00000000) = 0xF033685C __loadx(0x07080000, 0xF028CC0C, 0xFFFFFFFF, 0xF03358D8, 0x00000000) = 0xF033688C __loadx(0x07080000, 0xF028CB9C, 0xFFFFFFFF, 0xF03358D8, 0x00000000) = 0xF0336874 __loadx(0x07080000, 0xF028CBAC, 0xFFFFFFFF, 0xF03358D8, 0x00000000) = 0xF0336910 getuidx(4) = 6318 getuidx(2) = 6318 getuidx(1) = 6318 getgidx(4) = 1105 getgidx(2) = 1105 getgidx(1) = 1105 __loadx(0x01000080, 0x2FF216E0, 0x00000960, 0x2FF22160, 0x00000000) = 0xD0149130 getuidx(4) = 6318 getuidx(2) = 6318 getuidx(1) = 6318 getgidx(4) = 1105 getgidx(2) = 1105 getgidx(1) = 1105 __loadx(0x01000080, 0x2FF216E0, 0x00000960, 0x2FF22160, 0x00000000) = 0xD0149130 getuidx(4) = 6318 getuidx(2) = 6318 getuidx(1) = 6318 getgidx(4) = 1105 getgidx(2) = 1105 getgidx(1) = 1105 __loadx(0x01000080, 0x2FF216E0, 0x00000960, 0x2FF22160, 0x00000000) = 0xD0149130 getuidx(4) = 6318 getuidx(2) = 6318 getuidx(1) = 6318 getgidx(4) = 1105 getgidx(2) = 1105 getgidx(1) = 1105 __loadx(0x01000080, 0x2FF216E0, 0x00000960, 0x2FF22160, 0x00000000) = 0xD0149130 getuidx(4) = 6318 getuidx(2) = 6318 getuidx(1) = 6318 getgidx(4) = 1105 getgidx(2) = 1105 getgidx(1) = 1105 __loadx(0x01000080, 0x2FF216E0, 0x00000960, 0x2FF22160, 0x00000000) = 0xD0149130 access("/usr/lib/nls/msg/en_US/cmdps.cat", 0) = 0 _getpid() = 483490 psdanger(0) = 524288 psdanger(-1) = 521468 open("/usr/lib/nls/msg/en_US/cmdps.cat", O_RDONLY) = 3 kioctl(3, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY kfcntl(3, F_SETFD, 0x00000001) = 0 kioctl(3, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY kread(3, "\0\001 ù\001\001 I S O 8".., 4096) = 4096 lseek(3, 0, 1) = 4096 lseek(3, 0, 1) = 4096 lseek(3, 0, 1) = 4096 _getpid() = 483490 lseek(3, 0, 1) = 4096 kioctl(1, 22528, 0x00000000, 0x00000000) = 0 Total Paging Space Percent Used kwrite(1, " T o t a l P a g i n g".., 34) = 34 2048MB 1% kwrite(1, " 2 0 4 8 M B".., 30) = 30 __loadx(0x04000000, 0x2FF22080, 0x00000800, 0x0000D0B2, 0x00000000) = 0x00000000 kfcntl(1, F_GETFL, 0x00000001) = 67110914 kfcntl(2, F_GETFL, 0xF02DF418) = 67110914 _exit(0) There is a lot of output on the screen. I entered "lsps -s", and truss will watch what syscalls are done and shows that on your screen. In fact, many of the first lines deal with "getuidx" and that kind of calls. The system would like to know who (and in what groups he/she is) issued the command. You can ignore the output, because it's not that interresting. I only "published" it here, to give you an idea on how much output those tracing commands (like truss) generates. If I just want to store that information to a logfile, for example "truss.log", I would use the following command: albert@sharky:/home/albert $ truss -o truss.log lsps -s 4.3. Other possible usefull AIX commands: ========================================= Although not directly related to tracing, the following limited list of commands might help in creating a better view of your system and processes. I am sure you are familiair with them, but let's list them anyway:: -- Show your AIX version: # oslevel -r # oslevel -s # with SP, TL -- Show the jobs that are scheduled (in the account you use) from cron: # crontab -l -- What are the standard mounted filesystems?: That's defined in "/etc/filesystems" # cat /etc/filesystems | more -- Which processes are using a certain filesystem? # fuser -c /filesystem # We mean the "mountpoint", like for example /appl/oracle -- Show memory usage of a process: # procmap pid # pid is the process-id, as visible in the command "ps -ef" -- Show the open files that a process uses: # pfiles pid # also take a look at the "lsof" command: man lsof -- Show system memory: # bootinfo -r # lsattr -E -l mem0 # lsattr -E -l sys0 -a realmem # svmon -G # vmstat -v # vmo -L # ( lots of output ) # svmon -U -g -t 10 # ( top 10 users paging space) -- Swap usage: # lsps -s # more than 60%-70% used? It get's really scary. More than 75% used? Oh boy! # pstat -s -- cpu info: # lparstat (-i) # prtconf | grep proc # pmcycles -m # lscfg | grep proc # pstat -S -- ulimit: Sometimes, when a process runs under some ones credentials, and it fails for no immediate reason, it might be worth checking the "ulimit" of that account (like max filesize, max open files, number of files etc..) use it under that account as: # ulimit -a -- Show process tree of parent and children: # proctree pid # Tip: take a look at the "proc tools" on AIX -- Show the system error report / error log: # errpt # or "errpt | more" # errpt -aj | more # view details of an error record. ERRID is the 1st identifier in such a record. -- Determine the type of a file (e.g. is it ascii, or another type of file?) # file file_name # (yes..., the command is really "file") -- Show free/used space of the filesystems: # df -m # m in MB; k in KB; g in GB If there are many filesystems, you might want to see just the top 5 that have the lowest on free space: # df -k |awk '{print $4,$7}' |grep -v "Filesystem" | sort -n | tail -5 -- How to become another user, or possibly root: # su - accountname # (switch to that accountname like "su - albert") # su - # (switch to root) # if the sudo utility is implemented, you might try the command "sudo -l" to see what you might execute. -- Carefull!! How to kill a process "the hard way"? # kill -9 PID # carefull, don't kill the wrong one; not recommended unless you don't have a choice. -- Carefull!! How to kill all your processes "the hard way", all at once? # kill -9 -1 # be very carefull; not recommended unless you don't have a choice. # killall # be very carefull; not recommended unless you don't have a choice. -- Show your uid (userid) and gid (groupid): # id -- refresh inetd after modifying "/etc/inetd.conf": # refresh -s inetd -- Show the last logged on users + date (history since last restart): # last | more -- To show the init runlevel: # who -r -- Show uptime of system plus average load (15 minutes): # uptime -- Clean memory with ipcrm (be carefull): # ipcrm -m 50855977 # (clear memory segment, identfied by example id 50855977; Be carefull) # ipcrm -s 2228248 # (remove semaphore, identfied by example id 2228248; Be carefull) # ipcrm -q 5111883 # (remove queue, identfied by example id 5111883; Be carefull) ) # (see man pages ipcrm) -- To clear out unused system modules (currently unused modules in kernel and library memory): # slibclean ============================================================================ 5. Solaris: ============================================================================ A similar "story" will be put here, but then ofcourse for Solaris. ============================================================================ 6. Other: ============================================================================ 6.1 Some trivial remarks: ========================= (1): ==== Now for some really really trivial remarks...... (Sorry !) - kernel parameters If you have problems installing a program, or if fails to run properly, are you sure all required kernel parameters have been set? - Environment variables If you have problems installing a program, or if fails to run properly, are you sure all required Environment variables have been set? Many "large" programs really have an impressive list of variables you need to set in place before it will run properly. - Dependencies on other stuff. Most (commercial) programs depend heavily on installed support programs or tools, like perl, java, etc.. They may even have very strict requirements on versions of those support programs. - Cluttered memory (ipc identifiers, semaphores, shared memory) If you have started an application, and terminated it roughly, it's possible that "stuff" still remains in memory. In such a case, it's possible that your app will not be able to restart. You need to use a tool like "ipcrm" to clean memory, or you might even consider to reboot the system. (2): ==== For HPUX 11i, a trace tool called "tusc" is available. You need to download it from HP and install it. The way to use it is very similar to the tools we have seen above. There even exists a "truss" wrap around it, so you can use it like truss as we have seen above.