A few simple notes on Sharepoint / SQL Server Storage

Version : 0.9
Date : 12/03/2012
By : Albert van der Sel
For who : For anyone who likes some simple notes about Sharepoint, focused on SQL Server storage
Status : Working document (not ready)



1. Global Purpose of some of the main Sharepoint databases:

SharePoint_AdminContent_GUID Central Administration content database. The content database is used to store all SharePoint Services content related to Administration.
SharePoint_Config The configuration database contains config info: metadata on databases, and all of the IIS Web sites, Web applications, Web Part packages, and site templates.
WSS_Content This database contains data specific to a Web application and its site collections. It contains the content data, such as a document uploaded to a list or a library. By default, a single content database is created for each Web application.
WSS_Search This database is used for the search services. It contains the information that is required for searching content.
SharedServices_DB If Shared Services are used, each SSP uses its own database. This one is used for configuration settings. It may be related to user profiles, memberships, audiences, calculations, administration etc..
SharedServices_SSP Stores the content for this SSP website.


2. Global Architecture Sharepoint:

Fig 1: High level overview Sharepoint, focused on storage models:



Note from figure 1, that Sharepoint apps can store documents directly in the database, or, with additional
software, to an external filesystem (like on a SAN).
Per default, Sharepoint will store documents inside the backend SQL Server "Content" database(s).

However, especially at large volumes of documents, SQL Server storage is not optimized for storage
of files, as internal (inline) "blobs".Especially when Sharepoint is integrated with Office programs,
on any "save" action, a blob will be stored in the content database(s), which will inflate those databases rapidly.

If you want (or need to), you can choose to store those files on a filesystem, like local storage, or a SAN.
This is called "external storage of files", or "outline" storage. Technically, we then should not speak
of "blobs" anymore, because the objects are just files on a filesystem, while SQL Server only stores "metadata"
about the documents (like a pointer on how to locate the file: see section 5).

Four solutions exists (up to now):

-EBS or "External Blob Storage", using a 3rd party "storage provider"

-RBS or "Remote Blob Storage", from MS (default solution is in conjuction with SQL Filestream)

-RBS using a 3rd party "storage provider" and "Third Party" software (I/O directly to Storage, not using SQL Filestream)

-Special "Third Party" Software, that couples a Sharepoint library to a File Share.


⇒ The older solution is Microsoft's EBS (External Blob Storage), which can be used with
Sharepoint 2007 & 2010. However, you need a thirdparty "storage provider".
EBS "sits" close on the lower stack of Sharepoint and needs extensions in order to "talk" to
to the storage system. EBS is only available in conjunction with a third party storage provider.
The term EBS is just a generic name for any provider that let sharepoint middleware Servers talk
to remote storage (iSCSI, FC SAN).
It works alongside (parallel) to the Database. Metadata is always stored in the database, and you
can "tune" (on basis of size, or type of file) what get's on the filesystem, and what gets into the database.

To coordinate the two data stores, the "core" is a COM interface, ISPExternalBinaryProvider,
where semantics are implemented to recognize file Save and Open commands and invokes redirection calls
to either the Database or the external storage system.

⇒ The (supposedly) newer solution is Microsoft's RBS (Remote Blob Storage), which can be used with Sharepoint 2010
and (as of) SQL Server 2008 R2. Standardly, it uses the "FILESTREAM" feature of SQL2008 to interface to the filesystem.
So, this is not using third-party software, it's explicitly from Microsoft and needs to be used in conjunction with FILESTREAM.
But an option to use RBS in conjunction with a third party provider, and not using FILESTREAM exists as well.

⇒ RBS with a 3rd party storage provider and 3rd party software, not using the FILESTREAM feature.

⇒ SharePoint "File Share Library" is another solution. It is a customized SharePoint Document Library
that retains much of all functions of any other regular Document Library.
Its additional features include automatically mapping the library to a network drive and displaying
the files and folder structure on the network file share "as if" they are in the library.

For example, the "Bamboo File Share Library" or the "DocAve File Share Connector" are both able
to store on a File Share directly.

In all four cases, on any Server in the SharePoint Farm, additional software needs to be installed.


Fig 2: High level overview implementation external storage: EBS, RBS, SharePoint File Share Library:




3. How to manipulate / differentiate storage of small content and large content? :

Actually, there are two questions:

- What are the options to store files (all, or just a part) on a filesystem/SAN.
- How to manipulate / differentiate storage of small content and large content on filesystems/SAN.

For now, this is only some sort of braindump:

Per default, a by Sharepoint provisioned database, will only delivery a Primary filegroup, consisting
of just one file (the primary file). Ofcourse, a transaction log will be created too. For now, we leave
the transaction log out of the discussion (since we are interested in storage of data).
Furthermore, the "Content" databases are our primary focus (which is certainly incomplete, since search databases
etc.. are very important too).

The only question here is: can we somehow force the system to store smaller objects on some storage
facility (whether internal or external) and larger objects on some other storage facility
(whether internal in the database or external on some filesystem).
This could be an interesting question, if critical business documents are (say) less than 100MB,
while much less critical documents (say) are greater than 100MB (e.g. maps), or greater 20MB or so, and could possibly
be stored on somewhat less expensive hardware, with a somewhat lower backup frequency.

Can we possibly use multiple database files, or FileGroups, to achive that objective?
Or must we use EBS, RBS or "File Share Library"?

It's critical to know that we can ajust EBS/RBS to map files to database storage, or
external storage, by criteria as the filetype (say .pdf), or size (say >100MB)

As another point in our investigation, we will list all options for just storing files on some filesystem,
and not inside the database.

1. Default: SharePoint provisioned Using EBS/RBS/SharePoint File Share Library:NO
Just one primary file, for example

K:\mssql\primary_data1.mdf

No differentiation is possible (unless you use EBS/RBS to store blobs on filesystems)
2. By DBA provisioned database Using EBS/RBS/SharePoint File Share Library:NO
The Sharepoint Admin is able to (re-)attach the database to a WebApp/Collection.
The DBA has created the Primary Filegroup to include multiple datafiles like for example:

K:\mssql\primary_data1.mdf
L:\mssql\primary_data2.ndf
M:\mssql\primary_data3.ndf

No differentiation is possible (unless you use EBS/RBS to store blobs on filesystems)
At least a certain gain in performance is expected (due to multiple spindles, if used).
3. By DBA provisioned database Using EBS/RBS/SharePoint File Share Library:NO
The Sharepoint Admin is able to (re-)attach the database to a WebApp/Collection.
The DBA has created the Primary Filegroup to as for example:

K:\mssql\primary_data1.mdf

The DBA has created a second (and possibly more) Filegroup (FG1) to include muliple files
as for example:

L:\mssql\FG1_data1.ndf
M:\mssql\FG1_data2.ndf
N:\mssql\FG1_data3.ndf

Directly after creation, the DBA assigns FG1 to be the "default" Filegroup.

No differentiation is possible (unless you use EBS/RBS to store blobs on filesystems)
At least a certain gain in performance is expected (due to multiple spindles, if used).
4. Default: SharePoint provisioned Using EBS/RBS:YES

Independed of database layout of files:

Differentiation is possible (using EBS/RBS criteria filetype/size)
5. -Default: SharePoint provisioned
OR
-by DBA provisioned database
Using EBS/RBS/SharePoint File Share Library:NO

The Sharepoint Admin carefully assigns a specific databases to specific WebApps/Collections,
while keeping in mind (if possible) which site is for large objects (and thus using it's
specific database), and which is for ciritical business documents (and thus using it's
own specific database also)

In cooperation with DBA and Storage Admins, LUNS can be assigned in accordance with
critical apps (critical storage/high backup freq) and less critical apps (less expensive
storage and lower backup freq.)

Differentiation is possible.
6. -Default: SharePoint provisioned
OR
-by DBA provisioned database
Using EBS/RBS:NO

Using SharePoint File Share Library:YES

The Sharepoint Admin carefully assigns a specific Libary to specific WebApps/Collections.
This library uses a file share using the "SharePoint File Share Library"

The DBA creates a database, or the default SharePoint provisioned database is used.

Storage of files ouside the database is possible, for this specific Library type.
But no differntiation on small-large files, unless the Library usage is coupled to
specific purposes.


4. Some quick notes on specific Sharepoint objects:

4.1 Short description Sharepoint Databases:

See section 1.

4.2 Relation of Web application, to Site collection, and content database:

Web Application (WebApp) > Site Collection = Top Level site + optional subsites

A Web Application hosts Site Collection(s). Sites in a site collection, share administrative settings

When you create a WebApp, you specify:
- Name of the IIS website (toplevel site)
- port, path of the virtual direcory, the url, the security settings etc..
- Database Server and database name.

From the Adminatrative webpage, the Admin can create the above from an easy to use Admin console.
Note that the relation of the choice of the content database, relates to the WebApp/Site collection.

Per default, a Content database is created for each Web Application.

Also, directly after creation of the WebApp, it is possible to attach the WebApp to another database
of your choice (the former will be deleted, and the new one populated with standard sharepoint objects).
You can do that using powershell, or simply by using the webbased Administration Console.

4.3 Libraries and Lists:

On a site, you can show documents in an organized way, using libraries.
For example, you can create a library of common documents for a project, and project team members
can use their client tools to find documents, edit them, and checkin/checkout those files.
Once created, a library may somewhat show "the look and feel" of a fileshare, but it is not.
The files are either stored in the SQL Server content database, or, if using EBS/RBS, on an external storage system.

Each library displays a list of files and descriptions about the files, which helps users to use the files,
and to work together. So, the sharepoint Admin controls how documents are viewed, tracked, managed, and created.
Also, you can use workflows to enhance collaboration on documents in libraries.

4.4 Sharepoint services:

This section just serves to "get an idea" about the different Sharepoint services.

The term "service" in a Sharepoint environment, often refers to the specific apllications that support
al sorts of functionality in this environment.
Besides those services, in a Windows environment we can expect to find some true background "services"
as well. These then are the services you can see, and control, using various Windows admin tools like
using the following commands:

C:\TEMP> sc query > listing.txt

Or by using the applet:

C:\TEMP> services.msc
The first command, just prints a listing of all installed services to a text file (listing.txt).
In a Sharepoint 7 environment, we should expect at least the following services:

SERVICE_NAME: OSearch
DISPLAY_NAME: Office SharePoint Server Search

SERVICE_NAME: SPAdmin
DISPLAY_NAME: Windows SharePoint Services Administration

SERVICE_NAME: SPSearch
DISPLAY_NAME: Windows SharePoint Services Search

SERVICE_NAME: SPTimerV3
DISPLAY_NAME: Windows SharePoint Services Timer

SERVICE_NAME: SPTrace
DISPLAY_NAME: Windows SharePoint Services Tracing

Depending on the size of the installation (what features are active), this list might be somewhat longer
as to include also "loadbalancer services" and "single-signon services" (and possibly others as well).

If you would take a quick look at figure 1 again, your configuration might consist of 3 tiers:
1. Webservers with IIS and al sorts of Sharepoint related information under "wwwroot".
2. Sharepoint as the middle tier.
3. SQL Server as the backend database engine.

However, often (1) and (2) are often combined on the same machine, while multiple of those machines
are collected in a "farm".

Although you can control, and modify operations on the services by using Windows tools (like services.msc),
not all operations are allowed using the standard tools.
For example:

-Starting and stopping a service is allowed using Windows tools (like using "net start" or "net stop").
-For adjusting account information for the services, you should use Sharepoint own tools like "stsadm" (Sharepoint7)
or powershell (Sharepoint 2010), or other Sharepoint "aware" means to change such critical data.

The get a further idea of the "stack" of services on tiers (1) and (2),
take a look at the following example stop script (cmd) where we assume tiers (1)&(2) are on one machine:

net stop w3svc
net stop iisadmin
net stop ssosrv
net stop DCLauncher
net stop DCLoadBalancer
net stop OSearch
net stop SPAdmin
net stop SPSearch
net stop SPTimerV3
net stop SPTrace
net stop SPWriter
iisreset /stop


5. Metadata, pointers, and BLOBs:

Let's first see how SQL Server stores blobs inside the database. So, here we want to explore the default
situation where blobs (the documents) are fully stored inside the database.

After that, we will try to explore how outline storage (on filesystems/SAN) is implemented.

5.1 Blob metadata, with inline storage:

5.1.1 Metadata on documents (or blobs)

In SQL 2K5/2K8, we can try a couple of standard queries, in order to find "alien" objects,
in this case, originating from Sharepoint.
Using the following query (in a Content database), we get an impression of objects (like tables, views, sp's)
which are not installed at a default installation of SQL Server. So, they are installed afterwards,
like the adding of database objects that results from a creation of a WebApp in Sharepoint.

select substring(name,1,20) as "Name", substring(type_desc,1,20) as "Type", create_date
from sys.all_objects where is_ms_shipped=0 and type_desc='VIEW' or type_desc='USER_TABLE'
and name not like '%CI_AS' order by name -- in my case I want to exclude some collation tables

This produces (in my case) a list like:

NameType
AllDocsUSER_TABLE
AllDocStreamsUSER_TABLE
AllDocVersionsUSER_TABLE
AllLinksUSER_TABLE
AllListsUSER_TABLE
AllUserDataUSER_TABLE
AllListsUSER_TABLE
....
DocsVIEW
DocStreamsVIEW
DocVersionsVIEW
....
SiteQuotaUSER_TABLE
SitesUSER_TABLE
....
WorkflowUSER_TABLE

The above listing, is just a small sample (partial listing)..

With not too much effort, we can browse through the datamodel of these tables and views.
Then, after some investigation, we can clearly see that the "Alldocs" and "AlldocStreams" tables,
gives us all we need to know about "id's" (of blobs), the "content" of the blob, the "size", the associated
"site_id" which uses this blob, and much more.
Joining these tables with the AllLists view, will reveal the list or library (and other information) too.

You can try for example the following three queries to obtain lots of information about the document storage
within a content database:

USE [YourContentDatabase]
GO

-- Query 1:

SELECT AllLists.tp_Title AS 'List Name',
AllDocs.LeafName AS 'File Name',
AllDocs.DirName AS 'URL',
AllDocStreams.Content AS 'Document Content (Binary)',
AllDocStreams.Size AS 'Size'
FROM AllDocs
JOIN AllDocStreams
ON
AllDocs.Id=AllDocStreams.Id
JOIN AllLists
ON
AllLists.tp_id = AllDocs.ListId
ORDER BY AllDocStreams.Size DESC

-- Query 2:

SELECT AllDocStreams.Id, AllDocStreams.[Content], AllDocStreams.Size, AllDocs.Version, AllDocs.TimeLastModified,
AllDocs.CheckoutUserId, AllDocs.CheckoutDate, AllDocs.IsCurrentVersion, AllDocs.DirName,
AllDocs.LeafName, AllDocs.[Level]
FROM AllDocs INNER JOIN
AllDocStreams ON AllDocStreams.Id = AllDocs.Id AND AllDocs.[Level] = AllDocStreams.[Level]
WHERE AllDocs.TimeLastModified > getdate() - 30 -- use here as getdate()- number_of_days_ago
ORDER by AllDocs.TimeLastModified

-- Query 3: Suppose you want information on .doc, .xls, and .pdf files:

SELECT [AllDocs].[DirName], [AllDocs].[LeafName], [AllDocStreams].[Content]
from [AllDocs],[AllDocStreams] where (LeafName like ‘%.doc’ or LeafName like ‘%.xls’ or LeafName like ‘%.pdf’)
and [AllDocStreams].[Content] is not NULL and [AllDocs].[Id] = [AllDocStreams].[Id]“

You can query on specific Webs and Sites as well, where you are not focusing on document information,
but you for example just want to obtain lists of Webapps and sites. For example:

SELECT Id, OwnerId, TimeCreated, RootWebId FROM Sites

- The "AllDocs" Table regsiters lot's of metadata about documents, like document id, site id, Webid,
TimeLastWritten, Version, and much more.
- The "AllDocStreams" Table registers all document id's, associated site id's, blob contents,
document sizes in bytes, Parent id's (The identifier of the document's parent container).
- The "AllUserData" Table registers data for any item associated with any list- and document librarylists.
- The "AllLists" Table contains information of SharePoint lists such as list’s name,
list’s description and the list’s item count, and more.


5.2 Blob inline storage (No EBS/RBS):

It's well known that SQL Server is optimized for ascii oriented storage (typical administrtive data).
Don't forget that BLOBs and relational database data are very different entities.
Microsoft itself refers to this kind of content, as unstructured data, as opposed to the more structured,
relational data normally stored in a database.

However, various datatypes are available for storage of large text bloks, XML, and "binary" data.

binary datatypes:

- image: up to 2GB, e.g.: to store pdf, word files, images, or other binary data
- varbinary(max): up to 2G, e.g.: to store pdf, word files, images, or other binary data
The storage size is the actual length of the data entered + 2 bytes.
- binary: Fixed-length binary data with a length of n bytes, where n is a value from 1 through 8,000.

Or we can use FILESTREAM Storage. FILESTREAM is a storage attribute of the existing VARBINARY(MAX) data type.
FILESTREAM alters how the BLOB data is stored—in the file system rather than in the SQL Server data files.
However, the Filegroup supporting the FILESTREAM data, is just a database file(s) on the same LUN(s)/Disk(s)
as the other regular database files.

While the "image" datatype was the way to go in SQL Server 2000, it will be depreciated in newer SQL versions.
So, in SQL 2K5/2K8, image is still available, but the "varbinary" is the best type for storage of any type of blob.

Not using RBS, the BLOB is registered as stored in the Content Database in the AllDocStreams table,
in a varbinary field called Content (in Sharepoint7 it was the image datatype).
Each document version counts as a record, and even the Contents field is replicated, even if only a metadata
change occurred. This is quite inefficient indeed.

The following figure illustrates the storage of blobs inside SQL Server (no EBS/RBS).

Fig 3: High level overview Inline blob storage (no EBS/RBS):



Extracting files:

It's instructive, as a sort of exercise, to extract all files to the filesystem. Although it maybe viewed as pretty useless,
it's still informative to see how it can be done, using only simple sql tools and methods.
It must be said that it's certainly against the policy of Microsoft, to directly interface to the Content database
in this way.

Method 1:

USE WSS_Content -- name of your content database
GO

DECLARE
  @Command VARCHAR(4000),
  @FileID VARCHAR(128),
  @MyFilename nvarchar(max)

DECLARE cur1 CURSOR FOR
SELECT Id FROM [WSS_Content].[dbo].[AllDocStreams]

OPEN cur1
FETCH NEXT FROM cur1 INTO @FileID

WHILE (@@FETCH_STATUS = 0)
BEGIN

  SELECT @MyFilename = LeafName from [WSS_Content].[dbo].[AllDocs] WHERE Id = @FileID

  SET @Command = 'bcp "SELECT Content from [WSS_Content].[dbo].[AllDocStreams] WHERE Id = ''' +
  @FileID + '''" queryout "C:\SQLExtract\' +
  @MyFilename+'" -T -n -Slocalhost'

  EXEC xp_cmdshell @Command

  FETCH NEXT FROM cur1 INTO @FileID

END

CLOSE cur1
DEALLOCATE cur1



5.2 Documents on outline storage (filesystems/SAN):



6. Other Storage pointers:

6.1 Growth and Planning sizing:

SharePoint uses:

  • Recycle bin
  • Versioning
  • Search and index information of Content
  • MetaData
  • Regular Content (in DB or SAN or Share)
For Database storage of blobs, studies have shown that a good rule of thumb for initial planning is: 3.5 x file system storage.


6.2 Backup / Recovery:

Here, we are don't take into consideration specialized methods/tools like using "Data Protection Manager",
or any other 3rd party tools.

6.2.1 Full Backup / Recovery:

This really means full backup, and NOT on the item level (document, site, library etc..).

- Only SQL Storage: consistent model for backup/recovery.
Just all or nothing scenario. Possibly slow, but fully consistent.

- EBS: SQL backup: consistent model for backup/recovery.
But Sync between SQL Storage + SAN Storage could be difficult.
Although a COM interface (called the EBS Provider) keeps these two stores in sync, it's only
at events where the COM interface recognizes file Save and Open commands and invokes redirection calls to the EBS
This then, is done with well-known clients. The EBS Provider thus ensures that the SQL Server content database
contains metadata references to their associated BLOB streams in the external BLOB store.
But this is only garanteed with using well-known clients.

- RBS: SQL backup: consistent model for backup/recovery.
But Sync between SQL Storage + SAN Storage should not be so difficult. SAN storage = FILESTREAM Filegroup.

- Shared Library (file share): SQL backup: consistent model for backup/recovery.
Sync between SQL Storage + File Share (SAN) Storage could be difficult.


6.2.2 Item level Recovery:

Still to do.....