<html> <head> <title>Albert van der Sel : diskdevice paths</title> </head> <body bgcolor="#FFFFFF" link="blue" alink="blue" vlink="blue"> <h1>Very simple note on disk device "paths" in some common architectures.</h1> <B>Version</B> : 2.5<br> <B>Date</B> : 19/05/2012<br> <B>By</B> : Albert van der Sel<br> <hr/> <font face="arial" size=2 color="black"> <br> <br> <font face="arial" size=2 color="blue"> <B>Main Contents:</B><br> <br> <B> <A href="#section1"> 1. A GENERIC (INCOMPLETE) OS MODEL.</A><br> <A href="#section2"> 2. THE DEVICE TREE & FIRMWARE pre-boot IMPLEMENTATIONS: Open firmware & UEFI.</A><br> <A href="#section3"> 3. MBR (BIOS), GPT (UEFI).</A><br> <A href="#section4"> 4. A VERY SHORT SECTION ON SOME SCSI TERMS.</A><br> <A href="#section5"> 5. WORLD WIDE NAME IDENTIFIERS.</A><br> <A href="#section6"> 6. BLOCK IO, FILE IO, AND PROTOCOLS.</A><br> <A href="#section7"> 7. JUST A FEW NOTES ON VMWARE.</A><br> <A href="#section8"> 8. JUST A FEW NOTES ON NETAPP.</A><br> <A href="#section9"> 9. JUST A FEW NOTES ABOUT STORAGE ON UNIX, LINUX AND WINDOWS.</A><br> </B> <font face="arial" size=2 color="black"> <br> <br> In this simple note, we try to address some key concepts in "access of storage". For example, what driver models<br> are generally in use in a common Operating System? Or, why exists WWPN's,and what exactly are they?<br> And, why, in some OS'ses, do we see device files like "/dev/rdsk/c0t0d0s0"? And, what is is a thing like EFI,<br> a "device tree" and stuff, and indeed, how does nvram or firmware play a role in finding storage?<br> <br> Nummerous other questions exists ofcourse. Now, this simple note might help us in understanding<br> some of the answers to those questions.<br> <br> However, the material presented here, really is <B>supersimple</B>. Sometimes, to understand something completely,<br> you really have no other option than to "dive deep" into some protocol.<br> In contrast, this note, keeps a <I>very</I> High-Level view at all times.<br> <br> Also, there is no special focus on any specific type of architecture.<br> <br> Next I need to apologize for the very simplistic pictures in this note. There are many professional figures<br> on the internet, but it's not always clear which are "free" to use, so I created a few myself. Don't laugh ;-)<br> <br> <br> <br> <font face="arial" size=2 color="blue"> <hr/> <h2 id="section1">Chapter 1. A GENERIC (BUT INCOMPLETE) OS MODEL.</h2> <hr/> <font face="arial" size=2 color="black"> We must start "somewhere". So, maybe it's a good idea to see first which parts in the Operating System (OS)<br> have some relation to finding and accessing storage.<br> <br> Suppose a user application wants to open a file, which exists "somewhere" on "some sort of storage".<br> So, we could have a user who uses an editor, and wants to open some "readme" file on some<br> filesystem, say, a directory somewhere in the "/home" filesystem.<br> Note that in this case, it is "known" to the OS what <B>the type</B> of the filesystem is (like JFS, ext3 etc..)<br> <br> Typically, the applications uses a "fopen()" systemcall, which will be handled by the "system call interface",<br> (sort of api), and the request will be passed on to the kernel.<br> <br> Please take a look at figure 1. <br> <br> The kernel has many essential tasks like process management and the like, and it will<br> hand off the request to a specialized 'IO Manager' (which in fact is a very generalized concept).<br> Since there are so many types of "filesystems", this module really uses a sort of "plug and play" concept.<br> In our "generic" OS, "a close friend" of the IO Manager, namely the "Virtual Filesystem", will use (or if neccessary load),<br> any specific module which can deal with a certain specific filesytem.<br> <br> So, suppose the Virtual Filesystem has determined that the user application actually wants to access<br> an "ext3" filesystem, then it will put the corresponding "ext3" driver/module at work, and it won't use<br> the NTFS and JFS modules (since they are not of use here).<br> Note that some number of those modules might thus already be loaded, ready for use.<br> <br> <B>Fig 1. Simplified storage driver stack in a model OS.</B><br> <br> <img src="diskdevices1.jpg" align="centre"/> <br> <br> Now, this specific filesystem driver knows all "in and outs" of that specific filesystem, like<br> locking behaviour and types of access etc... but, it cannot retreive the file all by itself since it does not have<br> a clue of the true physical path.<br> That's why more specialized modules are set in action. Some are helper modules, but the last one in the stack<br> really "knows" how to access the Host Bust Adapter (HBA), if the file that the user application wants, happens to resides somewhere<br> on a Fiber Channel (FC) SAN.<br> If another type of storage needed to be accessed, like iSCSI, another specialized set of drivers was used.<br> <br> Our generic OS model cannot be so bad indeed. Take a look at figure 2. Here we see a similar stack<br> but this time of a real OS. Figure 2 shows you (very high-level !) how it is implemented in Windows.<br> <br> <br> <B>Fig 2. Simplified storage driver stack in Windows.</B><br> <br> <img src="diskdevices2.jpg" align="centre"/> <br> <br> Here too, we see a Virtual Filesystem (VFS), which can utilise specialised modules for a specific filesystem type.<br> In figure 2, you can see the "ntfs", "fat" and other specialized modules.<br> <br> The Partition Manager, "partmgr.sys", among other tasks, keeps track of partitions and LUNs and makes sure<br> that their "identity" stays the same (like an F: drive stays the F: drive, also after a reboot).<br> <br> Next, "storport.sys" is a general acces module for any type of remote storage.<br> It's the follow up of the former "scsiport" module. Storport is an <B>interface</B> between higher level modules,<br> and the lower level "miniport" drivers. It deals with many tasks like queueing, error messages etc..<br> <br> If we get lower in the stack, we will end up with specialized modules which knows how to handle<br> for example a netcard (for iSCSI) or a HBA (for accessing a FC SAN).<br> <br> <br> <br> <font face="arial" size=2 color="blue"> <hr/> <h2 id="section2">Chapter 2. The Device Tree & firmware pre-boot implementations.</h2> <hr/> <font face="arial" size=2 color="darkmagenta"> This chapter takes a look at possible pre-boot implementations (like Device Tree & firmware) with respect<br> to <B>booting a Host</B>. Now, whatever Operating System follows after this "pre boot" is actually of no concern here.<br> Ofcourse, it <U>is</U> important for the full picture, but here we do not really distinguish between<br> a Host which is as the full boot ends, is "just" an Operating System, or an OS that is the basis for<br> supporting "Virtual Machines" (VMs).<br> <br> So, for practical purposes, view this chapter as a description of "bare metal machine" boot.<br> In a later chapter, we will see more of a boot of a Hypervisor-like machine, which natively is meant to support VMs. <br> <font face="arial" size=2 color="black"> <br> <br> It's interesting to see what "entity" actually discovers buses, and devices on those buses.<br> But it severly depends on the architecture.<br> <br> I think it would be quite reasonble if you would have the idea that when your OS boots, it fully scans<br> the computer to find out "what's in there", and configure the system accordingly.<br> <br> Yes and No. I know that "yes and no" does not sound good, but it is the truth.<br> <br> Please be aware that I'am not saying that an OS does not have "polling/enummarator features". Certainly not.<br> It's only that for certain platforms, device trees can be build by "firmware like" solutions, immediately after power-on<br> of such a system. Usually, such a firmware solution is accompanied by a "shell" with a relatively compact commandset,<br> which enables the admin the view/walk the tree, map devices, probe for devices, and ofcourse, boot an OS.<br> <br> Ofcourse, in general terms, if you get a kernel (and the rest of an OS), it's meant for a <I>certain architecture</I>.<br> We all understand that, say, just some Linux distro for Intel x86, will not run unmodified on other platforms.<br> <br> But there is a bit more to it. Even for a <I>certain architecture,</I> there might exists many mainboards using<br> different buses and interconnects. Engineers and other IT folks, never liked the idea of creating<br> a superheavy kernel which is prepared for all sorts of mainboards and the miriad of possible devices<br> on all sorts of buses.<br> The problem is clear: how can we let the OS reckognize all buses and devices, and bind drivers to those<br> devices? One solution might be: build a kernel with lots of tables or configuration databases.<br> In an Open Industry, this was never perceived as a viable solution.<br> <br> That's why some "solutions" were presented already quite some time ago, and some newer followup protocols were constructed,<br> like "Open firmware", ACPI and UEFI.<br> <br> <br> <font face="arial" size=2 color="red"> <h3> 2.1 Early firmware-like solutions.</h3> <font face="arial" size=2 color="black"> Already quite early, some firmware-like solutions were implemented, like for example Openboot for Sun Sparc machines.<br> Openboot might be considered to be an "early" Open firmware implementation.<br> <br> It was partly a "PROM" and "NVRAM" solution, where at "power on", a socalled <B>"device tree"</B> was build first,<br> before the OS boots (like Solaris 8 in those days).<br> The associated code was written in Forth, and the NVRAM could store all sorts of variables like if<br> autoboot (to Solaris) was true, and so forth.<br> A short time after the power was put on, the Admin had the choiche to boot to the OS, or just stay<br> at the "Openboot" prompt, usually visible on the console as a "ok>" prompt.<br> On this prompt, if the Admin wanted to do so, he could traverse the Device Tree (showing what was in there),<br> or even "probe" for newly installed devices.<br> <br> We should not stay too long at this particular implementation, but I want you know that already at<br> the "ok>" prompt, address paths to devices were visible. And a key point of the story is, that the Device Tree,<br> a datastructure in memory, was passed on to the kernel when it boots, and where the kernel next was able <br> to build "friendly" names in "/dev" while the device tree was still visible (and mounted) in "/devices".<br> So, note that the kernel uses this datastructure, to know devices and to more easily bind drivers<br> and configure the system.<br> <br> Here is an example of such a physical address path:<br> <br> <font face="courier" size=2 color="blue"> /pci@1f,0/pci@1/isptwo@4/sd@2,0<br> <font face="arial" size=2 color="black"> <br> Such a path actually can be seen as a "root node" where a hierarchy of "sub/child nodes" resides,<br> ultimately ending in devices. In the example above, we see a path to a SCSI disk, from a local SCSI controller<br> in a PCI slot.<br> <br> Actually, a key point here, is that devices, also storage devices, <B>might</B> be found at a firmware boot,<br> where it gets asssociated with an address path (a "physical device name" if you like), and subsequenly the kernel<br> uses that info further for configuration.<br> <br> <br> <font face="arial" size=2 color="red"> <h3> 2.2 Open Firmware:</h3> <font face="arial" size=2 color="black"> While Sun's former <B>"Openboot"</B> might be considered to be a predecessor, <B>"Open firmware"</B> as is it is now called, should be seen as a<br> non-proprietary boot firmware that might be implemented on various platforms or architectures.<br> <br> &#8658; For example, the CHRP Specification (Common Hardware Reference Platform) requires the use of Open Firmware.<br> Examples are the Power- and PowerPC architectures of IBM.<br> No doubt that "system p" and "system i" Admins, know of the "ok>" prompt which can be accessed through the SMS Boot Menu.<br> These are the modern AIX and AS400 (old term) machines, capable of running AIX, I (AS400), and Linux VM's.<br> <br> &#8658; Also, you will not be surprised that Sun machines uses an Open firmware implmentation too. (Sun has been "eaten" by Oracle).<br> <br> &#8658; In contrast, as you probably know, Intel based x86/64 machines usually used to use PC BIOS implementations.<br> However, implementing ACPI is quite established, and on newer Intel/Windows Platforms, EFI can be used as well.<br> More about that later on.<br> <br> You might say, that Open Frirmware is very effective for PCI devices. Open firmware can be seen as a boot manager, and when the<br> "minikernel" (so to speak) from firmware has booted, it can offer a shell with a limited commandset.<br> Additionally, PCI devices might be equiped with FCode which enables it to report identification- and resource characteristics,<br> which Openfirmware can use to build the device tree.<br> Usually, Open firware offers simple means to boot the system to an OS (possibly multiboot) which will use the device tree further for configuration<br> and driver binding on devices.<br> <br> <br> <font face="arial" size=2 color="red"> <h3>2.3 EFI or UEFI:</h3> <font face="arial" size=2 color="black"> EFI, or the <B>"Extensible Firmware Interface"</B>, was Intel's idea on replacing the traditional PC BIOS systems.<br> Formally, EFI "evolved" into the "Unified EFI Platform Initialization" specifications. So, from now on we will use the term UEFI<br> to decribe this "new" BIOS-like followup protocol.<br> <br> I'am not saying that the BIOS stayed exactly the same over all those years. Far from it. The BIOS'ses from different vendors<br> using more and more extensions, kept up with new hardware implementations. However, it was time for a fundamental change.<br> For example, demands for security, better descisions by low level software on sleeping modes, better bootmanagers, and more<br> control on the bootprocess (and maybe many more), probably all led to the UEFI implementation.<br> That is no to say that UEFI is "great". No, actually it's very complex, but at least it overcomes many limitations.<br> <br> <font face="arial" size=2 color="darkmagenta"> Note that the <B>GUID Partition Table (GPT)</B> was also introduced as part of UEFI, to overcome the limitations of the MBR<br> partitioning scheme. However, in principle, you can use GPT without EFI, if it isn't a bootdisk.<br> <font face="arial" size=2 color="black"> <br> UEFI as it is now, seems to be tied to Intel newest architectures. Not meaning that you now immediately think of Windows<br> Operating Systems perse. One nice example is HP systems based on Itanium". Ofcourse, HP hopes that you will boot to <B>HP-UX</B><br>, but EFI, seen as a bootmanager, allows you easily to boot to another OS like RedHat, Suse, Windows, etc..<br> <br> For security experts, the EFI pre boot environment could be seen as a large improvement too. In principle, various forms of authentication<br> are possible, as described in the "EFI pre-boot authentication protocol".<br> As a sidenote: It's also, for example, quite interesting to read articles on the <B>Windows 8</B> "Early Launch Anti-Malware" implementation.<br> <br> In one sentence: EFI is often defined as an <B>interface</B> between an operating system and platform firmware.<br> It's best to view EFI as a modular structure, consisting of "certain parts", like a firmware interface, a bootmanager, <br> an EFI systempartition, and support for a drivermodel using EFI Byte Code (EBC).<br> <br> The hardware must be suitable for using EFI, so, really old x86 systems are not usable, but x86 is not excluded.<br> However, if the hardware is supported, it is still possible (if you would insist) to run legacy BIOS Operating Systems,<br> "thanks" to compatibility modules like CSM. However, UEFI is targeted for BIOS free systems.<br> <br> After a system is powered on, very globally, the following happes:<br> <br> First, some <B>system dependent routines</B> are activated, like for example "IMM" initialization on a IBM System X,<br> where at the last stage, UEFI code is called. So, each architecture uses it's own very specific initial routines.<br> <br> Next, the Security (SEC) phase, Pre-EFI Initialization (PEI) phase, and then the Driver Execution Environment (DXE) phase<br> are executed in sequence.<br> In reality, those phases are really pretty complex, but for our purposes, it not neccessary to go into the details.<br> <br> During the DXE phase, EFI Byte Code drivers might be loaded from any firmware flash, or could come<br> from UEFI-compliant adapters. A device tree is build, for OS'ses that can use it.<br> <br> Lastly, the "Boot Device Selection" (BDS) takes place, and the system: <ul> <li> might be configured for "autoboot"to some OS.</li> <li>Or, the system might enter a "bootmenu"</li> <li>Or possibly enter the EFI "Shell".</li> </ul> With respect to the autoboot: just like with Open boot or Open firmware, NVRAM can store several variables,<br> like whether "autoboot" to some selected OS should take place.<br> <br> From this point on, it gets interesting for us. Take a look please, at figure 3.<br> <br> <B>Fig 3. Simplified EFI boot sequence (on HP Itanium).</B><br> <br> <img src="diskdevices3.jpg" align="centre"/> <br> <br> UEFI is a set of specifications in technical documents, but how it, per architecture, will "look like",<br> depends also a bit per manufacturer, I am afraid.<br> However, something called the <B>EFI System partition</B> (ESP) really is part of the EFI specs.<br> Note the existence of the EFI System partition in figure 3.<br> <br> <B>Example 1: HP on Itanium using EFI:</B><br> <br> Figure 3 tries to show the Itanium implementation, as is often used by HP.<br> After the EFI launch, you might enter a <B>bootmenu</B>, or a <B>Shell</B> where you can use a small commandset<br> like "cd", "map" etc..<br> This system partition, is indeed just a partition (likely to be on disk0) and it's of a FAT filesystem type.<br> <br> Do you notice the "\EFI" main directory, and the subdirectories like for example "\EFI\Redhat"?<br> Each such subdir contains a specific "OS loader" for that specific OS.<br> You can see such loaders by entering the "Shell" fs0> and navigate around a bit. Example loaders could be files like:<br> <br> <B>"\EFI\vms\vms_loader.efi"<br> <br> or<br> <br> "\EFI\redhat\elilo.efi"</B><br> <br> The actual kernel of an OS, will reside on another partition, possibly a partition on another disksystem.<br> <br> This explains why here on Itanium, multiboot is possible. By the way, the already mentioned "bootmenu"<br> is very easy to use, and here you can also specify to which OS the system should autoboot.<br> <br> <B>Example 2: EFI and RedHat (RHELS 6):</B><br> <br> Not using EFI, RedHat boots using the "grub" bootloader, which enters various "stages", and it will also<br> use the "/boot/grub/grub.conf" file which helps to display a bootmenu, from which (usually) various<br> kernel versions can be choosen to boot from.<br> <br> The EFI implementation is a bit like this: <br> The EFI System partition is now "/boot/efi/" of type vFAT, where in subdirectories the OS loaders of various<br> Operating systems may reside. For booting to RedHat, the OS Loader directory is "/boot/efi/EFI/redhat/".<br> This directory contains "grub.efi", which is a special GRUB compiled for the EFI firmware architecture.<br> If "autoboot" specifies this loader, then the system will boot to RedHat.<br> <br> This example is quite similar to example 1, but there are some minor differences per Manufacturer.<br> <br> This section was indeed a very lightweight discussion on the UEFI implementation. But it will help us later on.<br> Note that I skipped the GPT implementation (replacement of MBR) here. I think it's better to don't do<br> "everything in one Bang" so I leave GPT for Chapter 3.<br> <br> <br> <font face="arial" size=2 color="red"> <h3> 2.4 Other firmware-like solutions.</h3> <font face="arial" size=2 color="black"> As we have seen, in many systems, the Kernel is "helped" by configuring the system, using a Device Tree.<br> This might be the case in Open Firmware, UEFI systems, and a few others.<br> <br> However, traditionally other methods are in use as well. There are so many architectures, that it is<br> not really helpfull to create all sorts of listings in this note. For example, for some architectures,<br> just a pre-stored file or blob is passed to the kernel when it boots.<br> <br> But ofcourse Operating Systems <I>themselves</I> can scan buses and find devices. For example, "ioscan" or<br> "cfgmgr" (for some Unix systems) are a few examples of commands which the sysadmin can use to scan<br> for new devices, which usually are new local disks or new LUNs from a SAN.<br> Some notes about that can be found in chapter 6.<br> <font face="arial" size=2 color="black"> <br> <br> <br> <font face="arial" size=2 color="blue"> <hr/> <h2 id="section3">Chapter 3. MBR & GPT.</h2> <hr/> <font face="arial" size=2 color="black"> Sections 3.1 (MBR) and 3.2 (GPT), are about "disk boot structures" on Intel architectures,<br> that are used by typical OS'ses on that platform, like Linux, VMWare, Windows and others.<br> <br> <font face="arial" size=2 color="red"> <h3> 3.1 The MBR.</h3> <font face="arial" size=2 color="black"> In Chapter 2, we discusses (lightly) the function of UEFI. But part of UEFI, is a new boot sector structure,<br> called "GPT" as a follow up of the traditional MBR.<br> We did not discusses GPT there, because it more or less focused on firmware and Device Trees.<br> Now, it is also time to discuss stuff like MBR and GPT.<br> <br> For PC systems (workstations & Servers), using BIOS, we can discuss the MBR first, and it's role in booting the system,<br> as well as it's limitations. Mind you: there are still countless Windows, Linux and other Server/Workstations<br> out there, using BIOS, instead of Open Firmware or UEFI.<br> <br> <I>So here are a few words on MBR...</I><br> <br> In very ancient times, Cylinder-Head-Sector (CHS) addressing was used to address sectors of a disk, but is was<br> rather quickly replaced by "Logical Block Addressing" (LBA), which uses a simple numbering scheme (0,1,2 etc..) of disksectors.<br> This could indeed be implemented in those days, thanks to newer logic on the diskcontroller, and the support of<br> BIOS int 13 and Enhanced BIOS implementations.<br> <br> In this scheme, we simply have one "linear address space", from LBA 0 to LBA N, and leave the details to the<br> onboard logic of the Controller.<br> <br> <font face="arial" size=2 color="brown"> Note: there is an interesting history in "geometry translation" methods, and various addressing limits of BIOS,<br> which explains why various partition size limits existed in the past, like the infamous 512M, 2G, 8G, 137G limits.<br> However, that's much too "lengthy" so I skip that here (it's also not very relevant).<br> <font face="arial" size=2 color="black"> <br> Now there are (or were) at least 2 problems:<br> <br> (1). Disk manufacturers already have (or want) to go from a fundamental sector size of 512 bytes to 4096 bytes.<br> <br> (2). The Traditional MBR (Master Boot Record) of a disk is 512 bytes in size. The MBR is located in Sector 0.<br> <br> The bootsequence of an OS through the MBR is most easiest described using Windows. Not that it's very different<br> from another often used OS on Intel, like Linux, but a description of a MBR based Linux boot via a stage 1 "grub"<br> installed in the MBR, It think, is not very "opportunistic" at this point, so I will just discuss an MBR based Windows boot.<br> <br> The MBR starts with initial bootcode, and some tiny errormessages (like 'Missing Operating System') and this bootcode<br> has a length of 446 bytes. It's followed by the 64 byte "Partition Table", which supports 4 "partition entries" of each 16 bytes.<br> <br> One partition could be marked "active", and this then was a bootable partition containing the Windows OS bootloader.<br> So, the booting sequence in the MBR scheme, was like this: <ul> <li>The initial bootcode of the MBR gets loaded, and reads the partition table.</li> <li>The active partition was found, and execution was transferred to the OS loader in that partition (like NTLDR).</li> <li>This OS loader then initiates the boot of Windows.<br> </ul> A very schematic "layout" of an MBR looks like this:<br> <br> <B>Fig.4: Schema of the MBR.</B><br> <br> <TABLE border=1 BGCOLOR=#81DAF5> <TR> <TD><font face="courier" size=2><B>-From starting byte 0 to byte 445 (incl)<br> -length:446 bytes</B>:<br> Purpose: Initial bootcode (also for loading/reading<br> the partition table)<br> and some error messages</font></TD> </TR> <TR> <TD><font face="courier" size=2><B>-From starting byte 446 to byte 509 (incl)<br> -length: 64 bytes</B>:<br> Purpose: Partition Table<br> 4 Partition Entries each 16 bytes in length</font></TD> </TR> <TR> <TD><font face="courier" size=2><B>bytes 510 and 511 (2 bytes)</B>:<br> Purpose: 2 byte closing "Boot record signature"<br> with values: 55 AA</font></TD> </TR> </TABLE> <br> One problem will get clear in a moment. A 16 byte Partition Entry has the following structure:<br> <br> <B>Fig.5: Schema of a Partition Entry in the MBR.</B><br> <br> <TABLE border=1 BGCOLOR=#81DAF5> <TR> <TD><font face="courier" size=2>Lengt (bytes):</TD> <TD><font face="courier" size=2>Content:</TD> </TR> <TR> <TD><font face="courier" size=2>1</TD> <TD><font face="courier" size=2>Boot Indicator (80h=active):</TD> </TR> <TR> <TD><font face="courier" size=2>3</TD> <TD><font face="courier" size=2>Starting CSH</TD> </TR> <TR> <TD><font face="courier" size=2>1</TD> <TD><font face="courier" size=2>Partition Type Descriptor</TD> </TR> <TR> <TD><font face="courier" size=2>3</TD> <TD><font face="courier" size=2>Ending CSH</TD> </TR> <TR> <TR> <TD><font face="courier" size=2>4</TD> <TD><font face="courier" size=2>Starting Sector</TD> </TR> <TR> <TD><font face="courier" size=2>4</TD> <TD><font face="courier" size=2>Partition size (Sectors)</TD> </TR> <TR> </TABLE> <font face="arial" size=2 color="black"> <br> The last 2 fields express the problem. For example, the "partition size" (in no of sectors), is 4 bytes (32 bits) long, so it<br> can have as a maximum value "FF FF FF FF" in hex, which is "4294967295" in decimal. So, when using 512byte sectorsize,<br> this amounts to about 4294967295 x 512= bytes, or a maximum partition size of 2.2 TB.<br> <br> The fact that only 4 partitions (not counting optional logigal drives in an "extended" partition)<br> are possible, and this partition size limit, these limits are, for today's standards, considered to be too small.<br> <br> So, as you can read in a trillion other internet documents, the GUID Partition Table, or GPT, is the replacement<br> for the MBR.<br> <br> As we have seen in Chapter 2, UEFI is a new firmware interface for newer Intel machines, as a replacement<br> for the traditional BIOS.<br> <br> Also, the UEFI specifications supposes that the machine gets a "EFI System Partition". So.. what is the relation<br> with GPT, which is also a UEFI spec?<br> It seems complicated, <I>but its really not !</I><br> <br> As it turns out, if you have a UEFI compliant machine, and you install a UEFI compliant OS,<br> then you get GPT <B>with</B> a "EFI System Partition".<br> <br> What makes it all a bit cloudy, is that a GPT disk, is actually sort of <B>"self describing"</B> which has as a<br> consequence that you can use a GPT disk even with a BIOS based system, although with certain restrictions.<br> <br> <B>So, UEFI is actually not required for using a GPT disk.</B><br> <br> This will be explained better in the next section.<br> <br> Indeed, popular OS'ses on Intel like <B>VMWare, Linux distros, Windows</B>, they all used MBR in the past,<br> but later versions are switched to GPT.<br> <br> Later on, you will find a table listing those OS'ses, with respect to Version, 32/64 bit, UEFI/BIOS system,<br> and showing if they can use GPT.<br> <br> <br> <font face="arial" size=2 color="red"> <h3> 3.2 The GUID Partition Table (GPT).</h3> <font face="arial" size=2 color="black"> The GUID Partition Table is the follow up of the MBR. In section 3.1, we have seen how the MBR was structured,<br> how it was used in the bootsequence, and the limitations posed by the MBR and Partition Entries.<br> <br> Contrary to the simple and small MBR (of 512 bytes long), the GPT is a completely different thing.<br> <br> As said before, GPT is part of the UEFI specifications. In practice, the following statements are true:<br> <ul> <li>An system with UEFI firmware, will natively use GPT based disks, and can boot from a GPT disk.</li> <li>An (older) BIOS based system, can use GPT based disks as data disks, but cannot boot from a GPT disk.</li> <li>So, UEFI is not "perse" required for using GPT disks</li> <li>Most newer releases of popular (Intel based) OS'ses, transfer to using UEFI and GPT disks (or already UEFI based).</li> <li>A GPT based disk uses as the first sector (LBA 0) a MBR like structure, called <B>"the protective MBR"</B>,<br> which precedes the newer GPT implementation. It looks exactly the same a the oldfashioned MBR, but it was added<br> for several reasons, like protection from older tools like "fdisk" or legacy programs and utilities.</li> </ul> <br> A GPT is way larger than the old MBR. A GPT spans multiple LBA's. In fact, GPT reserves LBA 0 to LBA 33,<br> leaving LBA 34 is the first usable sector for a true Partition.<br> So, as from LBA 34, we can have a number of true partitions, that is, usable diskspace like C:, D: etc..<br> <br> But the "end" of the disk is special again! It's a copy of the GPT, which can be used for recovery purposes.<br> <br> Just as we did in figure 4 for the MBR, let's take a look at a schematic representation of the GPT.<br> Since we can number sectors just by refering to LBA numbers, let's use that as well. So, if we use that<br> for example in the old MBR scheme, it that case we can then say that the MBR was in "LBA 0".<br> <br> <B>Fig.6: Simplified Schema of the GUID Partition Table.</B><br> <br> <TABLE border=1 BGCOLOR=#81DAF5> <TR> <TD><font face="courier" size=2>LBA 0</TD> <TD><font face="courier" size=2>Protective MBR</TD> </TR> <TR> <TD><font face="courier" size=2>LBA 1</TD> <TD><font face="courier" size=2>Primary GPT Header</TD> </TR> <TR> <TD><font face="courier" size=2>LBA 2</TD> <TD><font face="courier" size=2>Partition Table starts.<br> Partition Entries 1,2,3,4</TD> </TR> <TR> <TD><font face="courier" size=2>LBA 3 - 33</TD> <TD><font face="courier" size=2>Partition Entries 5-128</TD> </TR> </TABLE> <TABLE border=1 BGCOLOR=#FE9A2E> <TR> <TD><font face="courier" size=2>LBA 34 -LBA M</TD> <TD><font face="courier" size=2>Possible first true Partition (like C:)</TD> </TR> <TR> <TD><font face="courier" size=2>LBA M+1 - LBA N </TD> <TD><font face="courier" size=2>Possible Second true Partition (like D:)</TD> </TR> <TR> <TD><font face="courier" size=2>Other LBA's, except the last 33 LBA's</TD> <TD><font face="courier" size=2>Possible other partitions<br> up to the last LBA: END_OF_DISK -34</TD> </TR> </TABLE> <TABLE border=1 BGCOLOR=#81DAF5> <TR> <TD><font face="courier" size=2>END_OF_DISK - 33</TD> <TD><font face="courier" size=2>Partition Entries 1,2,3,4 (copy)</TD> </TR> <TR> <TD><font face="courier" size=2>END_OF_DISK - 32</TD> <TD><font face="courier" size=2>Partition Entries 5-128 (copy)</TD> </TR> <TR> <TD><font face="courier" size=2>END_OF_DISK - 1</TD> <TD><font face="courier" size=2>Secondary GPT Header (copy)</TD> </TR> </TABLE> <br> Please be aware that the LBA numbers (like the starting "LBA 34" for usable partitions), is not exactly<br> specified in the original specifications.<br> If you take the numbers like in the above table, like 128 byte sized Partition Entries, and "room"<br> for 5 up to 128 Partition Entries, only then you will end up in LBA 34 as a startpoint for usable data.<br> <br> In GPT, in a Partition Entry, the "partition size field" (in no of sectors/LBA's), is now 64 bits wide.<br> This amounts to a max partition size of about 9.4 ZetaBytes (about 9.4 billion Tera Bytes), which is quite large indeed.<br> You can easily do that math yourself (if needed, take a look at the math in section 3.1).<br> <br> Note that the information in a GPT, or even MBR, can be considered to form <B>"metadata"</B> of a disk,<br> since both describes the structure of the disk (like "where" is "which" partition on "what" location etc..).<br> <br> Remember the UEFI System Partition (ESP) as described in section 2.3?<br> On a "native" UEFI system, it will be automatically created as one partition on the first disk.<br> So, it will usually be the first partition in the "orange area" as depicted in figure 6.<br> Depending on the Manufacturer, it's size may vary somewhat, but it's typically about a few hundreds MB or so,<br> since it only needs to store the uefi metadata and OS bootloaders (see also figure 3).<br> <br> Note:<br> <br> For OS'ses like Linux, Windows, it should be stressed <B>not</B> to use older GPT <B>unaware</B> tools<br> like "fdisk" and the like.<br> For example, on recent versions of Linux, the traditional "fdisk" tool is replaced by utilities like "parted".<br> <br> <br> <font face="arial" size=2 color="red"> <h3> 3.3 Some figures illustrating a BIOS/MBR boot, and an UEFI/GPT boot.</h3> <font face="arial" size=2 color="black"> <h4>3.3.1. BIOS and MBR:</h4> <B>Fig 7. Simplified example of a BIOS/MBR initiated boot of an "older" Windows system like Win2K3.</B><br> <br> <img src="diskdevices4.jpg" align="centre"/> <br> <br> In fig 7, we see an "old-fashioned style" boot of a once popular OS like XP, Win2K3 etc..<br> What we see here, is the stuff we have talked about before.<br> <br> -The BIOS selects a bootable device. Maybe it tries the DVD first, before going to harddisk0.<br> -Then, it loads the "initial bootcode" from the MBR, which will access the "partition table".<br> -Then it determines the specs from the "active" or "bootable partition" (for example, where it starts).<br> This partition could be, for example, partition No 1.<br> <br> -Control is then passed to the <B>OS loader</B> on that partition (in this example, "NTLDR").<br> -Next, ntldr reads "boot.ini" which is a small file containing socalled "arc" paths, which "points" to the partitions<br> containing Operating Systems.<br> For example, such an arc path could point to partition No 2 (or to for example partition No 1 on another disk).<br> <br> -From then, the bootsequence of that OS really starts.<br> <br> Now you may say: "this is Windows !"</I> So how about another OS that also initially starts from BIOS/MBR, like Linux RHEL 5 or so?<br> Ok, the following figure it not so terribly different from fig 7, but I like to show it here anyway.<br> <br> <B>Fig 8. Simplified example of a BIOS/MBR initiated boot of an "older" Linux system like RHEL5.</B><br> <br> <img src="diskdevices5.jpg" align="centre"/> <br> <br> This note is not for discussing bootsequences in detail. However, the overall sequence of events is visible in figure 8.<br> Ofcourse, the start of the <I>Linux OS as such</I>, is depicted to happed as of Step 7 in figure 8.<br> At this point, those phases are not so of interest, so that's why it was not detailed further.<br> <br> <h4>3.3.2. UEFI using GPT:</h4> Actually, we already have discussed the UEFI boot in section 2.3. However, here I try to produce a figure that illustrates<br> a bit more on the role of GPT, and the UEFI System Partition.<br> Here, I will only show that picture. If you need more info on UEFI itself, you might check section 2.3.<br> So lets give it a try:<br> <br> <B>Fig 9. Simplified example of a UEFI/GPT initiated boot.</B><br> <br> <img src="diskdevices6.jpg" align="centre"/> <br> <br> Ok, I myself would certainly not give an "A rating" to the figure above, but it should help "somewhat" in understanding UEFI boots.<br> <br> The firmware boot, is decscribed in section 2.3. At a certain moment, the UEFI bootmanager reeds the GPT.<br> Since the GPT is "metadata", mainly about true partitions, partitions can be indentified by there <B>Globally Unique Id</B>, the GUID.<br> <br> So, from the GPT, UEFI tries to indentify the "EUFI System Partition (EPT)", by it's unique GUID,<br> which should be like C12A7328-F81F-11D2-BA4B-00A0C93EC93B.<br> <br> Once found, UEFI can locate the correct OS loader, in case autoboot is in effect.<br> <br> Contary to the "native" MBR situation as shown in section 3.3.1, UEFI does <B>NOT</B> load "initial bootcode" from GPT.<br> So, in this phase, no sort of bootsecor with code, is loaded.<br> Only meta data is read. The preboot is fully containted in UEFI itself.<br> <br> <h4>3.3.3. UEFI using MBR:</h4> Now, the following might "feel" a bit strange. In the discussion above, we have seen that UEFI more or less<br> expects GUID Partition Table metadata (GPT). Indeed, GPT is part of the UEFI specifications.<br> <br> However, manufacturers sometimes find smart ways (or maybe not so smart ways) to implement sort of hybrid forms.<br> Since the "EFI System Partition" (ESP) actually is <I>"just"</I> a partition, like any other partition (only with a special content),<br> it actually is possible to have a system with an ESP, using the "old MBR style" metadata.<br> In this case, in a Partition Entry in the MBR, the ESP can be indentified by it's ID of value "0xEF".<br> <br> There actually exists (some might say, "weird") variants. It's possible to replace the original bootcode<br> in the standard MBR, to a variant "which looks like EFI firmware". This makes it even possible that non UEFI machines are capable<br> of booting from GPT disks.<br> <br> We already have seen the limitations of the MBR. GPT based disks are becoming more and more the standard on Intel systems.<br> <br> Although deviating variants exists, I would say that it's important to remember that:<br> <ul> <li>Traditional BIOS systems use MBR bootcode and MBR partitioning metadata.</li> <li>Native UEFI machines (newer x64 and Itanium) use GPT as partitioning metadata. The only preboot code is from UEFI.<br> <li>Traditional BIOS systems using MBR, can (easily) use GPT disks as data disks.<br> <li>And.. indeed it's possible to use MBR and a EFI System Partition.<br> </ul> <br> <br> <br> <font face="arial" size=2 color="blue"> <hr/> <h2 id="section4">Chapter 4. A VERY SHORT SECTION ON SOME SCSI TERMS.</h2> <hr/> <font face="arial" size=2 color="black"> This very short section, will just focus on few concepts related to SCSI, namely addressing and paths.<br> <br> Undoubtly, with any OS using SCSI, you will encounter addressing paths like for example "[1:0:1:0]".<br> But what is it? It's really extremely simple.<br> <br> It's all about on how to "reach" any "end device", that is: from which adapter, which bus, which target,<br> and lastly (and optionally) which LUN do we want to address?<br> <br> Sure, the SCSI protocol (and bus) has it's complexities, but for this note we can keep it really that simple.<br> Don't forget that it is just a set of rules and commands. In other words: it's a protocol.<br> So, definitions were described too, on how to address devices (targets) and sub-devices (LUNs) on the bus. That's all.<br> <br> Now, since a computer might have multiple SCSI cards, and since any card might have multiple ports (thus buses),<br> a "fully qualified" path to any subdevice goes like:<br> <br> SCSI language: <B>adapter#, channel#, scsi id, lun#</B><br> <br> and for example implemented in Unix/Linux<br> <br> Unix/Linux: <B>Host#, bus#, target#, lun#</B><br> <br> Fig. 10. SCSI controller with bus, targets.<br> <br> <img src="diskdevices9.jpg" align="centre"/> <br> <br> The figure above illustrates an old-fashioned "SCSI card", or also called "SCSI controller".<br> However, with respect to addressing of storage, the most common "name" for such a card is <B>"HBA"</B> (Host Bus Adapter).<br> <br> Now, with respect to <I>device addressing</I> the way to address devices is not much different when you compare<br> a true physical SCSI bus, to for example a fiber based FC LAN, which connects your system to SCSI disks.<br> <br> How does it look like?<br> <br> => Example 1: On LInux you might see stuff like:<br> <br> <font face="courier" size=2 color="blue"> -- list devices:<br> <br> [root@starboss tmp]# cat /proc/scsi/scsi<br> <br> Attached devices:<br> <B>Host: scsi0 Channel: 00 Id: 00 Lun: 00</B><br> ..Vendor: HP...Model: HSV210...Rev: 6220<br> ..Type:...RAID.................ANSI SCSI revision: 05<br> <B>Host: scsi0 Channel: 00 Id: 00 Lun: 01</B><br> ..Vendor: HP...Model: HSV210...Rev: 6220<br> ..Type:...Direct-Access........ANSI SCSI revision: 05<br> <B>Host: scsi0 Channel: 00 Id: 00 Lun: 02</B><br> ..Vendor: HP...Model: HSV210...Rev: 6220<br> ..Type:...Direct-Access........ANSI SCSI revision: 05<br> etc..<br> <br> -- list adapters (hosts) like FC HBA cards:<br> <br> # ls -al /sys/class/scsi_host<br> <br> [root@starboss ~]# ls -al /sys/class/scsi_host<br> total 0<br> drwxr-xr-x 9 root root 0 Sep 21 14:59 .<br> drwxr-xr-x 39 root root 0 Sep 21 14:59 ..<br> drwxr-xr-x 2 root root 0 Sep 21 14:59 host0<br> drwxr-xr-x 2 root root 0 Sep 21 14:59 host1<br> drwxr-xr-x 2 root root 0 Sep 21 14:59 host2<br> <br> <br> <font face="arial" size=2 color="black"> => Example 2: On VMWare, you might see stuff like:<br> <br> <font face="courier" size=2 color="blue"> # cd /vmfs/devices/disks<br> # ls vmh*<br> <br> vmhba0:0:0:0<br> vmhba0:0:41:0<br> vmhba0:0:53:0<br> <br> <br> <font face="arial" size=2 color="black"> => Example 3: On HPUX, you might see stuff like:<br> <font face="courier" size=2 color="blue"> <br> # ioscan -fnC ext_bus<br> <br> ext_bus..15..0/2/0/0.<B></B>...fcparray.CLAIMED..INTERFACE...FCP Array Interface<br> <br> Note the string in "bold" which is again a SCSI address, as a part of a full "hardware path".<br> <br> <br> <font face="arial" size=2 color="black"> If your system is connected to an FC- or iSCSI SAN, you will see one or more devices names, probably also identifiers<br> like "1:0:2:0" as in example 1, or denoted slightly otherwise like for example "adaptername0:C0:T0:L0" as in example2.<br> Note: in many SANs the "T" might refer to the "zoning target/number", but the idea stays quite te same.<br> <br> I hope it gets clear that in most OS'ses, the address is always reckognizable as:<br> <br> <B>"host#:bus#:target#:lun#"</B> or <B>"adapter#:channel#:scsiID#:lun#"</B><br> <br> This then represents the address of a LUN (or "disk") from a FC or iSCSI SAN.<br> <br> Now, you might sometimes observe several "0" (zeros) in such a identifier, but that comes from the fact that the first controller is "host0",<br> the first bus is "bus0" etc... Also, LUNs start numbering as of "0".<br> <br> Now, a "target" is the device the SCSI bus will address first. Maybe to know the address the target, is sufficient, since<br> that might be the "end device" itself: it does not have any subdevices "inside".<br> Don't forget that there exists many sorts of devices which can be placed on a SCSI bus.<br> <br> However, in Storage, the target often is a complex device which manages many subdevices.<br> In such a case, <I>we do have subdevices</I> that needs an subaddress too (the LUN numbers).<br> This is more or less how the situation often is, in addressing LUNs from a SAN.<br> You might have found one target#, with one or multiple LUNs "below" it.<br> <br> Note that it does not really matter if you would have SCSI commands on a old-fashioned SCSI bus, or if<br> those commands are just <B>simply encapsulated in a network packet</B> as in iSCSI: the principle stays quite the same.<br> <br> On a SCSI bus, like shown in figure 10, a path like for example [0:0:1:1], uniquely identifies an object on that Bus.<br> However, in a typical SAN situation, we have multiple Hosts, connected to switches, which themselves are connected<br> to Storage controllers. In this situation, we cannot uniquely identify for example a HBA on a Host.<br> If you have multiple Hosts, all with a HBA called "host0", you cannot uniquely identify communication partners.<br> That's why extended addressing have been applied in SANs, using socalled "WWPN addresses".<br> This will be the subject of the next chapter.<br> <br> <br> <br> <font face="arial" size=2 color="blue"> <hr/> <h2 id="section5">Chapter 5. "World Wide" Identifiers.</h2> <hr/> <font face="arial" size=2 color="black"> Suppose you would have a computersystem, with a HBA installed, which connects the system to a number<br> of local targets using a traditional SCSI bus (or channel). Just like in figure 10.<br> <br> <B>In such a case, the addressing scheme of Chapter 4 would be fully sufficient, because in that example, any device can<br> be uniquely addressed (and identified).</B><br> <br> Now, in contrast, imaging a number of Servers connected to switches, while those switches themselves are connected<br> to SANs.<br> In such a case, ServerA has a HBA called "host0", but the same is true for any other Server like Server B.<br> So, ServerB might have a HBA called "host0" too.<br> <br> You see?<br> <br> <B> In a larger environment we need an "extended addressing scheme", in order to be able to really distinguish<br> between the different HBA's on the different Servers, and all other devices like the switch ports involved, the SAN ports involved etc..<br> </B> <br> <font face="arial" size=2 color="brown"> Note:<br> Before we go any further, well known other examples are close at hand. For example, you probably know that any netcard<br> is supposed to have a unique MAC address, to identify it uniquely on the network. Also, just look at the internet:<br> any Host is supposed to have a unique IP address just to make sure that the "sender" and the "receipient" of data are uniquely defined.<br> <br> Key point is: in any network, every device should be able to be identified uniquely, which is not possible if for example<br> only names like "host0" is used. Questions like "which host0 (hba) on which Server" would arise immediately.<br> <br> <font face="arial" size=2 color="black"> That's why globally unique "World Wide" ID's were introduced.<br> <br> <br> <font face="arial" size=2 color="red"> <h3>5.1 Addressing in Fiber Channel:</h3> <font face="arial" size=2 color="black"> Just like in networking, where on a subnet the networkcard MAC addresses of all devices is essential for communication,<br> the same idea have been applied in FC SAN networks.<br> <br> <ul> <li>Every "device" like a HBA (on a Host), or SAN controller, has it's unique <B>WWNN</B> (Word Wide Node Number).</li> <li>But, a device might have one, or even multiple "ports". That's why a associated to the device's WWNN, <U>each port</U><br> has it's derived <B>WWPN</B> (Word Wide Port Number).</li> <li>A SAN controller/filer (managing diskarrays) has a WWNN, and it has one or more ports too, each with it's on WWPN.</li> <li>Ultimately, traffic goes from WWPN to WWPN (Host to/from SAN) with optional switches in between.</li> </ul> The format of a WWPN:<br> <br> A WWPN is sort of the "MAC equivalent" in a SAN network. It's a 8 byte identifier.<br> Here is an example WWPN: "2135-3900-f027-6769". This could also be denoted without the hyphens, like "21353900f0276769",<br> or with a ":" between each byte (two digits), like "21:35:39:00:f0:27:67:69". <br> <br> Some organization has to keep an "eye" on the possible WWPNs, in order to warrant uniqueness. This is the IEEE.<br> <br> When a Manufacturer wants to produce FC hardware, it has to register with the IEEE for a 3 byte "OUI" identifier, which, when granted,<br> will be part of the WWPNs in all their products. This OUI is then unique per Manufacturer.<br> <br> There is a certain structure in any WWPN. A WWPN is derived from it's parent WWNN (of the device).<br> <br> Let's see how to find WWPNs on a few example platforms:<br> <br> <br> <font face="arial" size=2 color="red"> <h3>5.2 A few examples of finding WWPNs on some example platforms:</h3> <font face="arial" size=2 color="black"> <B><U>Example 1: Linux:</U></B><br> <br> On Linux, people often use the "QLogic" adapters (qla drivers) or "Emulex" adapters (lpfc driver).<br> <br> You might try:<br> <br> (1):<br> # ls -al /proc/scsi/adapter_type/n<br> <br> Where adapter_type is the host adapter type and n is the host adapter number for your card.<br> <br> (2):<br> If for your adapter, the more modern "sysfs" registration have been implemented, browse around in "/sys/class/scsi_host/hostn" or subdirs,<br> or in "/sys/class/fc_host". In the latter, you then probably find an entry called "port_name".<br> <br> (3):<br> Or take a look in "/var/log/messages", since we should see the adapter modules be loaded at boottime:<br> <br> <font face="courier" size=2 color="blue"> # cat /var/log/messages<br> ...<br> Dec 15 09:40:10 stargate kernel: (scsi): Found a QLA2200 @ bus 1, device 0x1, irq 20, iobase 0x2300<br> ..<br> Dec 15 09:40:10 stargate kernel: scsi-qla1-adapter-node=200000e08b02e534;<br> Dec 15 09:40:10 stargate kernel: scsi-qla1-adapter-port=210000e08b02e534;<br> Dec 15 09:40:10 stargate kernel: scsi-qla1-target-0=5005076300c08b1f;<br> ..<br> <br> <font face="arial" size=2 color="black"> <br> <B><U>Example 2: AIX</U></B><br> <br> It depends a bit which driver stack you have loaded (like SDD), but here are a few examples, just for illustrational purposes:<br> <br> <font face="courier" size=2 color="blue"> # datapath query wwpn<br> <br> Adapter Name....PortWWN<br> fscsi0..........10000000C94F91CD<br> fscsi1..........10000000C94F9923<br> <br> or<br> <br> # lscfg -lv fcs0<br> <br> fcs0...............U7879.001.DQDKCPR-P1-C2-T1..FC Adapter<br> <br> ...Part Number.................03N6441<br> ...EC Level....................A<br> ...Serial Number...............1D54508045<br> ...Manufacturer................001D<br> ...Feature Code................280B<br> ...FRU Number.................. 03N6441<br> ...Device Specific.(ZM)........3<br> ...Network Address.............10000000C94F91CD<br> ...ROS Level and ID............0288193D<br> <br> <I>A lot of other output omitted...</I><br> <font face="arial" size=2 color="black"> <br> <br> <B><U>Example 3: Windows</U></B><br> <br> Again, it depends a bit of which driver and FC Card you use. Maybe, with your Card, some utilty was provided too.<br> Here is just an example for illustrational purposes:<br> <br> <font face="courier" size=2 color="blue"> C:\> fcquery<br> <br> com.emulex-LP9002-1: PortWWN: 10:00:00:00:c8:22:d0:18 \\.\Scsi3: <br> <br> or this Powershell commandlet might work on your system:<br> <br> PS C:\> Get-HBAWin -ComputerName IPAddress | Format-Table -AutoSize<br> <br> See also: "http://gallery.technet.microsoft.com/scriptcenter/Find-HBA-and-WWPN-53121140"<br> <br> <br> <font face="arial" size=2 color="red"> <h3>5.3 A conceptual representation of SAN connections:</h3> <font face="arial" size=2 color="black"> Now that we know that ultimately, a HBA port identified by it's WWPN, connects to a Storage controller port dentified by it's WWPN,<br> let's see if we can capture that in a simple figure:<br> <br> Fig. 11. Host-SAN communication<br> <br> <img src="diskdevices10.jpg" align="centre"/> <br> <br> While the above sketch focusses on the importance of WWPNs in communication, ofcourse in most SANs,<br> <B>multiple</B> Hosts are connected (through one or more switches) to a SAN controller.<br> Maybe it's nice to show such a figure as well. In the figure below, the left side is a highlevel<br> view of such a SAN.<br> <br> Fig. 12. Sketch of Hosts to FC SAN connections, and NAS network communication.<br> <br> <img src="diskdevices11.jpg" align="centre"/> <br> <br> <br> <br> <br> <font face="arial" size=2 color="blue"> <hr/> <h2 id="section6">Chapter 6. BLOCK IO, FILE IO, AND PROTOCOLS.</h2> <hr/> <font face="arial" size=2 color="black"> <br> Local disks or local (SCSI) disk arrays, work at the block I/O layer, below the file system layer.<br> The same is true for LUNs exposed by iSCSI or FC SANs. For the client OS, they "just look like" local disks<br> and are also accessed by block IO services.<br> <br> Contrary, unix-like NFS network mounts, or Microsoft SMB/CIFS shares, operate on the network layer.<br> File IO commands will see to it that the client OS gets access to the data on those shares.<br> <br> <B><U>1. Block IO with traditional FC and iSCSI SANs:</U></B><br> <br> When your Host is connected to (what we now see as) a traditional FC SAN, your Server might have one or more<br> HBA Fiber cards, which connects to one or more switch(es), which is then further connected to the Storage arrays.<br> <br> Typically, the elements associated with the transfer of data, are <B>"block address spaces" and "datablocks"</B>, and that's why<br> people talk about <B>"block I/O services"</B> when discussing (traditional) SAN's.<br> So, here SCSI block-based protocols are in use, over Fibre Channel (FC)<br> <br> The same is true in iSCSI, where transfer of block data to hosts occurs, using the SCSI protocol over TCP/IP.<br> <br> <B><U>2. File IO with Shares, Exports, and NAS:</U></B><br> <br> In a network, there might exists "fileshares" exposed by file/print servers (Microsoft), or NFS Servers (Unix/Linux).<br> Here, your redirector client software, "thinks" in terms of "files" that it want to retreive or store to/from the Server.<br> This network is just a normal network that we all know of. Ofcourse, a file will be transfered by the network protocol, meaning<br> that data segments are enveloped by more or less regular network packets, like in any other normal TCPIP network.<br> Since client and Server <I>think in terms of whole files</I>, people speak of File IO Services.<br> <br> Two main Redirector/Server protocols are often used: CIFS (the well-known SMB from Microsoft) and NFS (Unix/Linux world).<br> <br> <B><U>3. Network Attached Storage (NAS):</U></B><br> <br> A NAS device actually has all features of a regular FileServer. So, it's often used as a CIFS and/or NFS based Server.<br> It has some features like a SAN, since a true NAS device is also a device with a controller and Diskarray(s), thus resembling a SAN.<br> Obviously, a NAS devices has network ports which connects it to networkdevices (like a switch).<br> However NAS comes in a wide variety of forms and shapes.<br> <br> <br> Please take a look again at figure 12. Here, the left side tries to depict a traditional FC SAN, while the part<br> on the right side shows a NAS device that's placed in a "regular" network.<br> <br> <B><U>4. Modern SAN controllers/filers:</U></B><br> <br> Many modern controllers (or filers) have options for FC SAN, and iSCSI SAN implementations.<br> Obviously FC uses primarily FC cards and FC connectors. However, iSCSI just uses network controllers.<br> And, modern filers have options to expose the device as a NAS (CIFS and/or NFS Server) on the network as well.<br> <br> <br> Some folks used to say: <B>"if it is an Block I/O then it is SAN, if it is an File I/O then it is an NAS"</B> <br> Nowadays, this is still true, but as we saw above, modern SANs have options to expose it (partly) as a NAS too.<br> <br> <B><U>5. Some main types of SANs:</U></B><br> <br> A number of implementations exists. The most prominent ones are:<br> <ul> <li>FC-SAN or Fibre Channel Protocol (FCP), usually using a switched Fiber infrastructure, can be seen as a mapping of SCSI over Fibre Channel.</li> <li>iSCSI SAN, using a network infrastructure, can be seen as a mapping of SCSI over a TCPIP network.</li> <li>Fibre Channel over Ethernet (FCoE). This is the FCP protocol using network technology.</li> </ul> In Europe, especially FC-SANs and iSCSI SANs are popular, while recently a renewed interest seems to exists in FCoE SANs.<br> <br> Many other sorts of "storage access" implementations exists, especially remote storage access. Some of those have features that a regular SAN has too<br> like for example "FICON" in IBM mainframe storage technologies. However, it's not alway called a "SAN".<br> <br> <B><U>6. Note on LUNs:</U></B><br> <br> "NAS" devices, and FileServers, primarily expose "shares" (like Microsoft SMB/CIFS, or NFS on Unix/Linux), where "File IO"<br> is implemented. So, the "unit" that ultimately gets transferred (using network packets), is a file.<br> <br> A LUN exposed by a FC- or iSCSI SAN (or FCoE SAN), is a bit different. For the Host (the client), the LUN acts just as if<br> <B>it were a local disk</B> (which it's not ofcourse).<br> So, once a LUN is discovered on a Host, you can format it and create a filesystem, like for example creating a a G: drive in Windows,<br> or creating the "/data" filesystem on a Unix/Linux system.<br> Note that this is different from NAS. On a NAS, the Host (client) only accesses the storage, but it does not format it, and<br> it does not create some sort of preferred filesystem on that NAS based storage.<br> <br> How the LUN physically is organized on the SAN, often is not known, except for Storage Admins. For example, it could be a "slice", striped over 6 physical disks<br> using some RAID implementation. The more spindels (disks) are used to support the LUN, usually, the better the performance (IOPS)<br> will be.<br> <br> So, although most often LUNs are used for filesystems (where files and directories can be created on in the usual way), sometimes<br> a LUN is used as a "raw" device. An example can be Oracle "ASM", where the LUN cannot be directly accessed by the Host Operating System,<br> and only Oracle ASM IO services has formatted it to it's proprierty layout, and only ASM "knows" the details on how to access the data on that LUN.<br> <br> <br> <br> <font face="arial" size=2 color="blue"> <hr/> <h2 id="section7">Chapter 7. A FEW NOTES ON VMWARE.</h2> <hr/> <font face="arial" size=2 color="black"> A VMWare ESX / ESXi host, which is a "bare metal" or "physical" machine, is the "home" for a number of "Virtual Machines" (VMs).<br> There are quite some differences between ESX 3.x, and ESXi 4/5, also with respect to storage implementations.<br> <br> Along all versions, the following "red line" can be discovered.<br> <br> You know that an ESX(i) Host, essentially runs Virtual Machines (like most notably Windows Servers, Linux Servers).<br> These VMs have one or more "disks", just like their "bare metal" cousins.<br> <br> However, most cleverly, such a "systemdisk" (like C: of a Windows Server) actually is a ".vmdk" file on<br> some storage system. Ofcourse there are a few supporting files as well, but suppose we have a Windows VM called "goofy",<br> then somewhere (on some storage) the systemdisk of this machine is just contained in the vmdk file "goofy_flat.vmdk", which<br> typically could have a size of 20GB or so.<br> <br> VMWare uses a "datastore" for storage of such files of the VM's. Such a datastore often is stored on<br> a "VMFS (VMware File System) datastore" which could be local disks on the ESXi Host, or it can be found<br> on a NFS filesystem. This NFS storage can be "exposed" by a NAS, or a SAN acting as NAS, or other NFS Server.<br> <br> However, the datastore could also be found on LUNs from a FC SAN or iSCSI SAN.<br> <br> So, a VM uses one or more "common disks", which are just .vmdk files, for example stored on NFS.<br> For a Windows VM, such a disk is the local systemdrive C:, and possible other drives like D:, all stored in their .vmdk files.<br> <br> However, a VM might <B>also use RDM (Raw Device Mapping)</B>. These are <B>LUNs</B>, traditionally stored in an FC SAN,<br> and nowadays also often on iSCSI SANs too.<br> <br> These RDM's are not the common disks (like the sytem disk of a Windows VM), but are those shared storage areas often found<br> in clustered solutions, where for example the SQL Server database files reside on.<br> <br> <font face="arial" size=2 color="brown"> <B>Common Scenario:<br> <br> So, a common scenario in VMWare was, that the "system drives" of the VMs were stored as .vmdk files in some datastore.<br> <br> For those VMs that needed it, like VMs in an "MSCS/SQL Server cluster", they also used RDMs on a SAN, used for the<br> (shared) storage of SQL Server database files, and other shared resources needed for the cluster.<br> <br> This is still a very common scenario, but in the latest versions, different scenario's are possible too.<br> </B> <font face="arial" size=2 color="black"> <br> Although a VMWare host can connect to various SAN vendors, it's just a fact that "NetApp" SANs are very popular.<br> <br> Fig. 13. Traditional VMWare hosts and NetApp storage solution.<br> <br> <img src="diskdevices7.jpg" align="centre"/> <br> <br> In the "sketch" above, we see three VM's, each with their own "virtual machine disk" (vmdk). Such a vmdk disk file<br> can represent their local system disk (like C: on a Windows VM).<br> Only the second VM (VM2) uses LUNs from a SAN. In this example, it's a Netapp SAN.<br> <br> Nowaydays, for remote Hosts (like an ESXi host) to connect to a SAN, not only Fibre Channel (FC) is used,<br> but iSCSI and various forms of NFS/NAS implementations have gained popularity too.<br> <br> <br> <h4>Disks in VMWare:</h4><br> <B>&#8658; VMDK files (system disk and optionally other disks of the VM):</B><br> <br> A said before, VM's uses ".vmdk" files for their systemdisk, and optionally other disks.<br> Ofcourse, such a VM might <I>additionally</I> also use a LUN from an FC SAN, or iSCSI SAN, or other LUN provider.<br> <br> Let's first see how the VM's systemdisks (their .vmdk file and other files) are organized.<br> <br> Note: The sample commands below, are just part of a full procedure, and cannot be used isolated.<br> They are listed for illustrational purposes only.<br> <br> VMWare uses the concept of "datastore", which can be viewd as a "storage container" for files.<br> The datastore could be on a local Host hard drive, on NFS, or (nowadays) even on a FC or iSCSI SAN.<br> "Inside" the datastore, you will find the virtual machines .vmdk files and other files (like a .vmx configfile).<br> Here, our aim is to find out which files makes up a VM's systemdisk. But, let's first create a datastore.<br> <br> You can use graphical tools to create a new datastore (like vSphere client), or you can use a commandline.<br> In VMWare, using graphical tools while connected to a central Management Server, is the best way to go.<br> <br> However, for illustrational purposes, a typical command to create a datastore locally on a ESXi host,<br> while having a session to that host, looks similar to this:<br> <br> <font face="courier" size=2 color="blue"> # vmkfstools -C vmfs5 -b 8m -S Datastore2 /vmfs/devices/disks/naa.60811234567899155456789012345321:1<br> <font face="arial" size=2 color="black"> <br> Now, lets suppose that we already have created several VM's. How does the VM files "look like"?<br> Here too, as a preliminary, it is important that the VM's were properly registered in the repository of "vCenter" (or "VirtualCenter").<br> Again, using the vSphere client, those actions are really easy and it makes sure the VM's are properly registered.<br> Here, just "browse" the correct datastore, locate the correct .vmx file, and choose "register". Very easy indeed.<br> <br> For illustrational purposes, if having a session to an ESXi Host, a commandline action might resemble this:<br> <br> <font face="courier" size=2 color="blue"> # vim-cmd -s register /vmfs/volumes/datastore2/vm1/VM1.vmx<br> <font face="arial" size=2 color="black"> <br> You might say that the VM "exists", when it was shutdown, solely of the files located in the datastore.<br> The files can be found in the VM's "homedirectory".<br> <br> Actually, there are a number of files, with different extensions, and different purposes.<br> Suppose that our VM is named "VM1", then here is a typical listing of the most prominent files:<br> <br> <B>Fig.12: Some files that make up a VM in VMWare.</B><br> <font face="courier" size=2 color="blue"> <br> <TABLE border=1 BGCOLOR=#81DAF5> <TR> <TD><font face="courier" size=2>vm1.nvram</TD> <TD><font face="courier" size=2>The "firmware" or "BIOS" as presented to the VM by the Host.</TD> </TR> <TR> <TD><font face="courier" size=2>vm1.vmx</TD> <TD><font face="courier" size=2>Editable configuration file with specific setting for this VM<br> like amount of RAM, nic info, disk info etc..</TD> </TR> <TR> <TD><font face="courier" size=2>vm1-flat.vmdk</TD> <TD><font face="courier" size=2>The full content of the VM's "harddisk".</TD> </TR> <TR> <TD><font face="courier" size=2>vm1.vswp</TD> <TD><font face="courier" size=2>Swapfile associated with this VM.</TD> </TR> <TR> <TD><font face="courier" size=2>-rdm.vmdk</TD> <TD><font face="courier" size=2>If the VM uses SAN LUNs, this is a proxy .vmdk for a RAW Lun.</TD> </TR> <TR> <TD><font face="courier" size=2>.log files</TD> <TD><font face="courier" size=2>Various log files exists for VM activity records, usable for troubleshooting.<br> The current one is called vmware.log, and a number of former log files are retained.</TD> </TR> </TABLE> <br> <font face="arial" size=2 color="black"> This list is far from complete, but it's enough for getting an idea about how it's organized.<br> <br> <br> <br> <font face="arial" size=2 color="blue"> <hr/> <h2 id="section8">Chapter 8. A FEW NOTES ON NETAPP.</h2> <hr/> <font face="arial" size=2 color="black"> <font face="arial" size=2 color="red"> <h3>8.1 A quick overview.</h3> <font face="arial" size=2 color="black"> NetApp is the name of a company, delivering a range of small to large popular SAN solutions.<br> <br> It's not really possible to "capture" the solution in just a few pages. People go to trainings for a good reason:<br> the product is very wide, and technically complex. To implement an optimal configured SAN, is a real challenge.<br> So, this chapter does not even scratch the surface, I am afraid. However, to get a high-level impression, it should be OK.<br> <br> Essentially, a high-level description of "NetApp" is like this:<br> <ul> <li>A controller, called the <B>"Filer"</B> or <B>"FAS"</B> (NetApp Fabric-Attached Storage), functions as the managing device for the SAN.</li> <li>The Filer runs the "Ontap" Operating System, a unix-like system, which has it's root in FreeBSD.</li> <li>The Filer manages "diskarrys" which are also called "shelves".</li> <li>It uses a "unified" architecture, that is, from small to large SANs, it's the same Ontap software, with the<br> same CL and tools, and methodology.</li> <li>Many features in NetApp/Ontap must be seperately licensed, and the list of features is very impressive.</li> <li>There is a range of SNAP* methodologies which allows for very fast backups, and replication of Storage data to other another controller and its shelves,<br> and much more other stuff, not mentioned here. But we will discuss Snapshot backup Technology in section 8.4.</li> <li>The storage itself uses the WAFL filesystem, which is more than just a "filesystem". It was probably inspired by "FFS/Episode/LFS",<br> resulting in "a sort of" Filesystem with "very" extended LVM capabilities.<br> </ul> <br> Fig. 14. SAN: Very simplified view on connection of the NetApp Filer (controller) to diskshelves.<br> <br> <img src="diskdevices8.jpg" align="centre"/> <br> <br> In the "sketch" above, we see a simplified model of a NetApp SAN.<br> Here, the socalled "Filer", or the "controller" (or "FAS"), is connected to two disk shelves (disk arrays).<br> Most SANs, like NetApp, supports FCP disks, SAS disks, and (slower) SATA disks.<br> Since quite some time, NetApp favoures to put SAS disks in their shelves.<br> <br> If the Storage Admin wants, he or she can configure the system to act as a SAN and/or as a NAS, so that it can provide storage using either<br> file-based or block-based protocols.<br> <br> The picture above is extremely simple. Often, two Filers are arrangend in a clustered solution, with multiple paths<br> to multiple diskshelves. This would then be a HA solution using a "Failover" methodology. <br> So, suppose "netapp1" and "netapp2" are two Filers, each controlling their own shelves. Then if netapp1 would fail for some reason,<br> the ownership of its shelves would go to the netapp2 filer.<br> <br> <br> <font face="arial" size=2 color="red"> <h3>8.2 A conceptual view on NetApp Storage.</h3> <font face="arial" size=2 color="black"> Note from figure 14, that if a shelve is on port "0a", the Ontap software identifies individual disks by the portnumber and the disk's SCSI ID, like for example "0a.10", "0a.11", "0a.12" etc..<br> <br> <br> This sort of identifiers are used in many Ontap prompt (CL) commands.<br> <br> But first it's very important to get a notion on how NetApp organizes it's storage. Here we will show a very high-level<br> conceptual model.<br> <br> Fig. 15. NetApp's organization of Storage.<br> <br> <img src="diskdevices12.jpg" align="centre"/> <br> <br> The most fundamental level is the <B>"Raid Group" (RG)</B>. NetApp uses "RAID4", or "RAID6 with double parity (DP)" on two disks,<br> which is the most robust option ofcourse. It's possible to have one or more Raid Groups.<br> <br> An <B>"Aggregate"</B> is a logical entity, composed of one or more Raid Groups.<br> Once created, it fundamentally represents <I>the</I> storage unit.<br> <br> If you want, you might say that an aggregate "sort of" virtualizes the real physical implementation of RG's<br> <br> Ontap will create RG groups for you "behind the scene" when you create an aggregate. It uses certain rules for this,<br> depending on disk type, disk capacities and the number of disks choosen for the aggregate. So, you could end up with one or more RG's<br> when creating a certain aggregate.<br> <br> As an example, for a certain default setup:<br> <br> - if you would create a 16 disk aggregate, you would end up with one RG.<br> - if you would create a 32 disk aggregate, you would end up with two RG's.<br> <br> It's quite an art to get the arithmetic right. How large do you create an aggregate initially? What happens if additional spindles<br> become available later? Can you then still expand the aggregate? What is the ratio of usable space compared to what gets reserved?<br> <br> You see? When architecting these structures, you need a lot of detailed knowledge and do a large amount of planning.<br> <br> A <B>FlexVol</B> is next level of storage, "carved out" from the aggregate. The FlexVol forms the basis for "real" usable stuff, like<br> LUNs (for FC or iSCSI), or CIFS/NFS shares.<br> <br> From a FlexVol, CIFS/NFS shares or LUNs are created.<br> <br> A LUN is a logical representation of storage. As we have seen before, it "just looks" like a hard disk to the client.<br> From a NetApp perspective, it looks like a file inside a volume.<br> The true physical implementation of a LUN on the aggregate, is that it is a "stripe" over N physical disks in RAID DP.<br> <br> Why would you choose CIFS/NFS or (FC/iSCSI) LUNs? Depends on the application. If you need a large share, then the answer is obvious.<br> Also, some Hosts really need storage that acts like a local disk, and where SCSI <B>reservations</B> can be placed on (as in clustering).<br> In this case, you obviously need to create a LUN.<br> <br> Since, using NetApp tools, LUNs are sometimes represented (or showed) as "files", the entity "qtree" gets meaning too.<br> It's analogous to a folder/subdirectory. So, it's possible to "associate" LUNs with a qtree.<br> Since it have the properties that a folder has too, you can associate NTFS or Unix-like permissions to all<br> objects associated to that qtree.<br> <br> <br> <font face="arial" size=2 color="red"> <h3>8.3 A note on tools.</h3> <font face="arial" size=2 color="black"> There are a few very important <B>GUI or Webbased tools</B> for a Storage Admin, for configuring and monitoring their Filers and Storage.<br> Once "FilerView" (depreciated on Ontap 8) was great, and followup versions like "OnCommand System Manager" are probably indispensable too.<br> <br> These type of GUI tools allow for monitoring, and creating/modifying all entities as discussed in section 8.2.<br> <br> It's also possible to setup a "ssh" session through a network to the Filer, and it also has a serial "console" port for direct communication.<br> <br> <br> There is a very strong "command line" (CL) available too, which has a respectable "learning curve".<br> <br> Even if you have a very strong background in IT, nothing in handling a SAN of a specific Vendor is "easy".<br> Since, if a SAN is in full production, almost <B>all vital data</B> of your Organization is centered on the SAN, you cannot afford any mistakes.<br> To be carefull and not taking any risks, is a good quality.<br> <br> There are hundreds of commands. Some are "pure" unix shell-like, like "df" and many others. But most are specific to Ontap like "aggr create"<br> and many others to create and modify the entities as discussed in section 8.2.<br> <br> If you want to be "impressed", here are some links to "Ontap CL" references:<br> <br> <a href="http://support.netapp.com/NOW/public/knowledge/docs/ontap/rel732/pdfs/ontap/210-04499.pdf">Ontap 7.x mode CL Reference</a><br> <a href="http://contourds.com/uploads/file/Netapp_ResourceCenter2.pdf">Ontap 8.x mode CL Reference</a><br> <br> <br> <font face="arial" size=2 color="red"> <h3>8.4 A note on SNAPSHOT Backup Technology.</h3> <font face="arial" size=2 color="black"> One attractive feature of NetApps storage, is the range of SNAP technologies, like the usage of SNAPSHOT backups.<br> You can't talk about NetApp, and not dealing with this one.<br> <br> From Raid Groups, an aggregate is created. From an aggregate, FlexVols are created. From a FlexVol, a NAS (share) might be created,<br> or LUNs might be created (accesible via FCP/iSCSI).<br> <br> Now, we know that NetApp uses the WAFL "filesystem", and it has its own "overhead", which will diminish your total usable space.<br> This overhead is estimated to be about 10% per disk (not reclaimable). It's partly used for <B>WAFL metadata</B>.<br> <br> Apart from "overhead", several additional <B>"reservations"</B>are in effect.<br> <br> When an aggregate is created, per default "reserved space" is defined to hold optional future "snapshot" copies.<br> The Storage Admin has a certain degree of freedom of the size of this reserved space, but in general it is advised<br> not to set it too low. As a guideline (and default), often a value of 5% is "postulated".<br> <br> Next, it's possible to create a "snapshot reserve" for a FlexVol too.<br> Here the Storage Admin has a certain degree of freedom as well. NetApp generally seems to indicate that a snapshot<br> reserve of 20% should be applied. However, numbers seem to vary somewhat when reading various recommendations.<br> However, there is a big difference in NAS and SAN LUN based Volumes.<br> <br> Here is an example of manipulating the reserved space on the volume level, setting it to 15%, using the Ontap CL:<br> <br> <font face="courier" size=2 color="blue"> FAS1> snap reserve vol10 15 <br> <font face="arial" size=2 color="black"> <br> <br> <B><U>Snapshot Technologies:</U></B><br> <br> There are few different "Snapshot" technologies around.<br> <br> One popular implementation uses the <B>"Copy On Write"</B> technology, which is fully block based or page based. NetApp does not use that.<br> In fact, NetApp uses "a new block write", on any change, and then sort of cleverly "remebers" inode pointers.<br> <br> To understand this, lets review "Copy On Write" first, and then return to NetApp Snapshots.<br> <br> <B>&#8658; "Copy On Write" Snapshot:</B><br> <br> Fig. 16. "Copy on Write" Snapshot (not used by NetApp).<br> <br> <img src="diskdevices13.jpg" align="centre"/> <br> <br> Let's say we have a NAS volume, where a number of diskblocks are involved. "Copy on Write" is really easy to understand.<br> Just before <I>any block</I> gets modified, the <B>original</B> block gets copied to a reserved space area.<br> You see? Only the "deltas", as of a certain t=t<sub>0</sub> (when the snapshot was activated), of a Volume (or file, or whatever)<br> gets copied. This is great, but it involves multple "writes": first, write the original block to a save place, then write the<br> the block with the new data.<br> <br> In effect, you have a backup of the entity (the Volume, the file, the "whatever") as it was at t=t<sub>0</sub>.<br> <br> If, later on, at t=t<sub>1</sub>, you need to restore, or go back to t=t<sub>0</sub>, you need the primary block space, and copy the all reserved<br> (saved) blocks "over" the modified blocks.<br> Note that the reserved space does NOT contain a full backup. It's only a collection of blocks freezed at t=t<sub>0</sub>, before they<br> were modified between t=t<sub>1</sub> - t=t<sub>0</sub>.<br> Normally, the reserved space will contain much less blocks than the primary (usable, writable) space, which means a lot of saving<br> of diskspace compared to a traditional "full" copy of blocks.<br> <br> <B>&#8658; "NetApp" Snapshot copy: general description (1)</B><br> <br> You can schedule a Snapshot backup of a Volume, or you can make one interactively using an Ontap command or GUI tool.<br> So, a Netapp Snapshot backup is not an "ongoing process". You start it (or it is scheduled), then it runs until it is done.<br> <br> The mechanics of a snapshot backup are pretty "unusual", but it sure is <I>fast</I>.<br> <br> Fig. 17. NetApp Snapshot copy.<br> <br> <img src="diskdevices14.jpg" align="centre"/> <br> <br> It's better to speak of a "Snapshot copy", than of a "Snapshot backup", but most of us do not care too much about that.<br> It's an exact state of the Volume as it was at t=t<sub>0</sub>, when it started.<br> <br> With a snapshot running, WAFL takes a completely another approach than many of us are used to. If an existing "block" (that already contained data),<br> is going to be modified while the backup runs, WAFL just takes a new free block, and puts the modified block there.<br> The original block stays the same, and the inode (pointer) to that block is part of the Snapshot !<br> So, there is only one write (that to the new block). The inode (a pointer) of the original block is part of the Snapshot.<br> <br> It explains why snapshots are so incredably fast.<br> <br> <B>&#8658; "NetApp" Snapshot copy: the open file problem (2)</B><br> <br> From Ontap's perspective, there is no problem at all. However, many programs run on Hosts (Servers) and not on the Filer ofcourse.<br> So, applications like Oracle, SQL Server etc.. have a <B>completely different perspective</B>.<br> <br> The Snapshot copy might thus be inconsistent. This is not caused by Netapp. Netapp only produced a state image of pointers at t=t0.<br> And that is actually a good backup.<br> <br> The potential problem is this: NetApp created the snapshot at t<sub>0</sub>, during the t<sub>0</sub> to t=t<sub>1</sub> interval.<br> In that interval, a database file is fractioned, meaning that processes might have updated records in the databasefiles.<br> Typical of databases is, is that their <B> own checkpoint system process</B> flushes dirty blocks to disk, and update<br> fileheaders accordingly with a new "sequence number". If all files are in sync, the database engine considers the database<br> as "consistent". If that's not done, the database is "inconsistent" (so the database engine thinks).<br> <br> By the way, it's not databases alone that behave in that manner. Also all sorts of workflow, messaging, queuing programs etc..<br> show similar behaviour.<br> <br> Although the Snapshot copy is, from a filesystem view, perfectly consistent, Server programs might think differently.<br> That thus poses a problem.<br> <br> Netapp fixed that, by letting you install additional programs on any sort of Database Server.<br> These are "SnapDrive" and "SnapManager for xyz" (like SnapManager for SQL Server).<br> <br> In effect, just before the Snapshot starts, the SnapManager asks the Database to checkpoint and to "shut up" for a short while (freeze as it were).<br> SnapDrive will do the same for any other open filesystem processes.<br> The result is good consistent backups at all times.<br> <br> <br> <br> <font face="arial" size=2 color="blue"> <hr/> <h2 id="section9">Chapter 9. A FEW NOTES ABOUT STORAGE ON UNIX, LINUX, and WINDOWS.</h2> <hr/> <font face="arial" size=2 color="black"> <br> After a few words on storage in general and SAN's, let's take a look to how storage is viewed, used, and managed from some client Operating Systems.<br> <br> Since this information really can be seen as a "independent" chapter, I have put this in a seperate file.<br> It can be used as a fully independent document.<br> <br> If you are interested, please see:<br> <br> <a href="unixlvm.htm">Some notes on Storage and LVM in Unix systems.</a><br> <br> <br> <h3>Hope you think this document was a bit usefull...</h3> <br> <br> <br> <br> <br> <br> </body> </html>