Inside SharePoint Creating an External Storage Solution for SharePoint
Inside SharePoint Creating an External Storage Solution for SharePoint
Pav Cherny
Microsoft estimates that as much as 80
percent of the data stored in Microsoft Windows SharePoint Services ( WSS ) 3.0
and Microsoft Office SharePoint Server (MOSS) 2007 content databases is
non-relational binary large object ( BLOB) data, such as Microsoft Office Word
documents, Microsoft Office Excel spreadsheets, and Microsoft Office PowerPoint
presentations. Only 20 percent is relational metadata, which implies a
suboptimal use of Microsoft SQL Server resources at the database backend.
SharePoint does not take advantage of recent SQL Server innovations for
unstructured data introduced in SQL Server 2008, such as the FILESTREAM
attribute or Remote BLOB Storage API, but provides its own options to increase
the storage efficiency and manageability of massive data volumes.
Specifically, SharePoint includes an
external binary storage provider API, ISPExternalBinaryProvider, which Microsoft
first published as a hotfix in May 2007 and incorporated later into Service Pack
1. The ISPExternalBinaryProvider API is separate from the Remote BLOB Storage
API. Third-party vendors can use this API to integrate SharePoint with advanced
storage solutions, such as content-addressable storage (CAS) systems. You can
also use this API to maintain SharePoint BLOB data on a central file server
outside of content databases if you want to build a custom solution to increase
storage efficiency and scalability in a SharePoint farm. Keep in mind, however,
that this API is specific to WSS 3.0 and MOSS 2007. It will change in the next
SharePoint release, which means that you will have to update your
provider.
In this column, I discuss how to
extend the SharePoint storage architecture using the ISPExternalBinaryProvider
API, including advantages and disadvantages, implementation details, performance
considerations, and garbage collection. I also discuss a 64-bit compatibility
issue of Microsoft Visual Studio that can cause SharePoint to fail loading
managed ISPExternalBinaryProvider components despite a correct interface
implementation. Where appropriate, I refer to the ISPExternalBinaryProvider
documentation in the WSS 3.0 SDK. Another reference worth mentioning is Kyle
Tillman's blog.
Kyle does a great job explaining how
he mastered the implementation hurdles in managed code, but neither the WSS 3.0
SDK nor Kyle's blog post includes a Visual Studio sample project, so I decided
to provide ISPExternalBinaryProvider samples in both unmanaged and managed code
in this column's companion material. The purpose of these samples is to help you
get started if you are interested in integrating external storage solutions with
SharePoint. Remember, though, that these samples are untested and not ready for
production use.
Internal Binary Storage
By default, SharePoint stores BLOB
data in the Content column of the AllDocStreams table in the content database.
The obvious advantage of this approach is straightforward transactional
consistency between relational data and the associated non-relational file
contents. For example, it's not complicated to insert the metadata of a Word
document along with the unstructured content into a content database, nor is it
complicated to associate metadata with the corresponding unstructured content in
select, update, or delete operations. However, the most obvious disadvantage of
the default approach is an inefficient use of storage resources. Despite an I/O
subsystem optimized for high performance, the SQL Server storage engine is not
exactly a file-server replacement.
A SQL Server database consists of
transaction log and data files, as illustrated in Figure 1. In
order to ensure reliable transactional behavior, SQL Server first writes all
transaction records to the log file before it flushes the corresponding data in
8KB pages to the data file on disk. Depending on the selected recovery model,
this requires more than twice the BLOB size in storage capacity until you
perform a backup and purge the transaction log. Moreover, SQL Server does not
store unstructured SharePoint content directly in data pages. Instead, SQL
Server uses a separate collection of text/image pages and only stores a 16-byte
text pointer to the BLOB's root node in the data row. Text/image pages are
organized in a balanced tree, yet there is only one collection of text/image
pages for each table. For the AllDocStreams table, this means that the content
of all files is spread across the same text/image page collection. A single
text/image page can hold data fragments from multiple BLOBs, or it may hold
intermediate nodes for BLOBs larger than 32KB in size.
Figure 1 Default SharePoint BLOB
storage in SQL Server
Let's not dive too deeply into SQL
Server internals, though. The point is that when reading unstructured content,
SQL Server must go through the data row to get the text pointer and then through
the BLOB's root node and possibly additional intermediate nodes to locate all
data fragments spread across any number of text/image pages that SQL Server must
load into memory in full to get all data blocks. This is because SQL Server
performs I/O operations at the page level. These complexities impair
file-streaming performance in comparison to direct access through the file
system. SQL Server also imposes a hard size limit of 2GB on SharePoint because
this is the maximum capacity of the image data type. The Content column of the
AllDocStreams table is an image column, so you cannot store files larger than
2GB in a SharePoint content database.
External Binary Storage
The ISPExternalBinaryProvider API
offers a clever alternative to internal BLOB storage in SharePoint content
databases. It is a straightforward COM interface with only two methods
(StoreBinary and RetrieveBinary), which you can use to implement an External
Binary Storage (EBS) provider. For architecture details, see the topic "Architecture of
External BLOB Storage" in the WSS 3.0 SDK.
SharePoint loads your EBS provider
when you set the ExternalBinaryStoreClassId property of the local SPFarm object
(SPFarm.Local.ExternalBinaryStoreClassId) to the provider's COM class identifier
(CLSID). SharePoint then calls the provider's StoreBinary method whenever you
submit BLOB data, such as when you're uploading a file to a document library.
The EBS provider can decide to store the BLOB in its associated external storage
system and return a corresponding BLOB identifier ( BLOB ID) to SharePoint, or
it can set the pfAccepted parameter in the StoreBinary method to false to
indicate that it did not handle the BLOB. In the latter case, SharePoint stores
the BLOB in the content database as usual. On the other hand, if the EBS
provider accepted the BLOB, SharePoint only inserts the BLOB ID into the Content
column of the AllDocStreams table, as indicated in Figure 2.
The BLOB ID can be any value that enables the EBS provider to locate the content
in the external storage system, such as a filename, a file path, a globally
unique identifier (GUID), or a content digest. The sample providers included in
the companion material, for instance, use GUIDs as filenames for reliable
identification of BLOBs on a file server.
Figure 2 Storing a SharePoint BLOB in
an external storage system
SharePoint also keeps track of
externally stored files by setting the highest DocFlags bit of these files to 1.
DocFlags is a column of the AllDocs table. When a user requests to download an
externally stored file, SharePoint checks DocFlags and passes the Content value
from the AllDocStreams table to the RetrieveBinary method of the EBS provider.
In response to the RetrieveBinary call, the EBS provider must retrieve the
indicated BLOB from the external storage system and return the binary content to
SharePoint in form of a COM object that implements the ILockBytes interface.
Note that SharePoint does not call the RetrieveBinary method for BLOBs stored
directly in the content database.
Note also that the storage and
retrieval processes are transparent to the user as long as the user doesn't
attempt to bypass SharePoint. So, you don't need to replace built-in Web parts
with custom versions that tie metadata in a list with a document stored
externally; productivity applications, such as Microsoft Office, don't need to
know how to store metadata in one place and then the document in another; and
Search does not need to process metadata separate from documents. Moreover, and
this is one of my favorite advantages of the EBS provider architecture, the user
must go through SharePoint to access externally stored BLOB data. A user
bypassing SharePoint and directly accessing a content database through a SQL
Server connection ends up downloading BLOB IDs instead of actual file contents,
as illustrated in Figure 3. You can verify this behavior if you
deploy the SQL Download Web Part (which I used in the April 2009 column to
demonstrate how to bypass SharePoint AD RMS protection) in a test environment.
Furthermore, users don't need—and should not have—access permissions to the
external BLOB store. Only SharePoint security accounts require access because
SharePoint calls the EBS provider methods in the security context of the site's
application pool account.
Figure 3 The EBS provider can be a
roadblock to bypassing SharePoint permissions for file
downloads
Keep in mind, however, that EBS
providers also have drawbacks due to the complexity of maintaining integrity
between metadata in the SharePoint farm's content databases and the external
BLOB store. For a good discussion of pros and cons, check out the topic "Operational Limits
and Trade-Off Analysis" in the WSS 3.0 SDK. Make sure you read this very
important topic before implementing an EBS provider in a SharePoint
environment.
Building an Unmanaged EBS
Provider
Now let's tackle the challenges of
building EBS providers. The ISPExternalBinaryProvider interface is
well-documented in the WSS 3.0 SDK under "The BLOB Access
Interface: ISPExternalBinaryProvider." However, it seems Microsoft forgot to
cover the EBS provider details. After all, we are not just consuming the
interface of an existing COM server. We are tasked with building that COM server
ourselves and implementing the ISPExternalBinaryProvider interface. Most
importantly, the WSS 3.0 SDK fails to mention the type of COM server we are
supposed to build and the required threading model. A classic COM server can run
out-of-process or in-process, and it can support the single-threaded apartment
(STA) model, the multithreaded apartment ( MTA) model, or both, or the
free-threaded model. For the EBS provider to work properly, make sure you build
a thread-safe in-process COM server that supports the threading model "Both" for
STAs and the MTA.
You also need to think about which
programming language to use. This is important because the
ISPExternalBinaryProvider interface is the lowest-level API of SharePoint.
Performance issues can affect the entire SharePoint farm. For this reason, I
recommend using a language that enables you to build small and fast COM objects,
such as Visual C++ and Active Template Library (ATL). ATL provides helpful C++
classes to simplify the development of thread-safe COM servers in unmanaged code
with the correct level of threading support.
Visual Studio also includes a variety
of ATL wizards. Just create an ATL project, select Dynamic-link library ( DLL)
for the server type, copy the ISPExternalBinaryProvider interface definition
from the WSS 3.0 SDK into the interface definition language ( IDL) file of your
ATL project, add a new class for an ATL Simple Object, select "Both" as the
threading model and no aggregation, then right-click the new class, point to
Add, click Implement Interface, and select ISPExternalBinaryProvider. That's it!
The Implement Interface Wizard performs all necessary plumbing, so you can focus
on implementing the StoreBinary and RetrieveBinary methods.
And don't let unmanaged C++ code
intimidate you. If you analyze the SampleStore.cpp file in the companion
material, you can see that the StoreBinary and RetrieveBinary implementations
are relatively straightforward. Essentially, the sample StoreBinary method
constructs a file path based on a StorePath registry value, the Site ID passed
in from SharePoint, and a GUID generated for the BLOB, and then uses the Win32
WriteFile function to save the binary data obtained from the ILockBytes
instance. The sample RetrieveBinary method, on the other hand, constructs the
file path based on the same StorePath registry value, the Site ID, and the BLOB
ID passed in from SharePoint, and then uses the Win32 ReadFile function to
retrieve the unstructured data, which the EBS provider copies into a new
ILockBytes instance that it then passes back to SharePoint. Figure
4 illustrates how the EBS provider constructs the file path.
Figure 4 Constructing file paths for
StoreBinary and RetrieveBinary operations in the sample EBS
providers
Building a Managed EBS Provider
Of course, SharePoint developers might
prefer using familiar managed languages to build EBS providers, even though
building managed EBS providers is not necessarily less complicated than building
unmanaged providers due to the complexity of COM interoperability. Keep in mind
that an application written in unmanaged code can only load one version of the
common language runtime (CLR), so your code needs to work with the same version
of the CLR that the rest of SharePoint is using, otherwise you might end up with
unexpected behavior. Also, you still must deal with unmanaged interfaces and the
corresponding marshalling of parameters and buffers. Just compare
SampleStore.cpp with SampleStore.cs in the companion material. There are no
gains using a managed language in terms of code structure or programming
simplicity.
Moreover, be aware of 64-bit
compatibility issues if you develop managed EBS providers on the x64 platform.
Figure 5 shows a typical error that results from invalid COM
registration settings on a development computer. If you enable the Register for
COM Interop checkbox in the project properties in Visual Studio 2005 or Visual
Studio 2008, you'll end up with COM registration settings for your provider in
the registry under HKEY_CLASSES_ROOT\Wow6432Node\CLSID\<ProviderCLSID>.
Visual Studio uses the 32-bit version of the Assembly Registration Tool
(Regasm.exe) even on the x64 platform.
Figure 5 Due to invalid COM
registration settings, a managed EBS provider could not be
loaded
However, the 64-bit version of
SharePoint cannot load a 32-bit COM server registered under the Wow6432Node, so
you must manually register your managed EBS provider by using the 64-bit
Regasm.exe version, located in the %WINDIR%\Microsoft.NET\Framework64\v2.0.50727
directory. For example, the command
"%WINDIR%\Microsoft.NET\Framework64\v2.0.50727\Regasm.exe" ManagedProvider.dll
creates the required registry settings for the managed sample provider under
HKEY_CLASSES_ROOT\CLSID\<ProviderCLSID>. Another approach is to create a
Setup program and mark the EBS provider for automatic COM registration.
Remember also that managed EBS
providers come with significantly more overhead and performance penalties than
their unmanaged ATL counterparts. You can see this if you compare the COM
registration settings in the registry. As the InProcServer32 key reveals, the
COM runtime loads unmanaged EBS provider DLLs directly, while managed EBS
providers rely on Mscoree.dll as the in-proc server, which is the core engine of
the CLR. So, for managed providers, the COM runtime loads the CLR and then the
CLR loads the EBS provider assembly as registered under the Assembly key and
creates a COM Callable Wrapper (CCW) proxy to handle the interaction between the
unmanaged SharePoint client (Owssvr.dll) and the managed EBS provider.
Keep in mind that the unmanaged
SharePoint server does not directly interact with your managed provider. It's
the CCW that marshals parameters, calls the managed methods, and handles
HRESULTs. This indirection is especially apparent in the different return types
of managed methods in comparison to unmanaged methods. Unmanaged methods return
HRESULTs to indicate success or failures while managed methods are supposed to
have the void return type. So don't return explicit HRESULTs in managed code.
You must raise system or user-defined exceptions in response to error
conditions. If a managed method completes without an exception, the CCW
automatically returns S_OK to the unmanaged client.
On the other hand, if a managed method
raises an exception, the CCW maps error codes and messages to HRESULTs and error
information. The CCW implements various error-handling interfaces for this
purpose, such as ISupportErrorInfo and IErrorInfo, but SharePoint does not take
advantage of these interfaces. EBS providers must implement their own error
reporting through the Windows event log, SharePoint diagnostic logs, trace
files, or other means. SharePoint only expects the HRESULT values S_OK for
success and E_FAIL for any error. You can use the Marshal.ThrowExceptionForHR
method to return E_FAIL to SharePoint, as demonstrated in
SampleStore.cs.
Registering an EBS Provider in
SharePoint
Easily the most confusing section on
ISPExternalBinaryProvider in the WSS 3.0 SDK is the topic "Installing and
Configuring Your BLOB Provider." At the time of this writing, this section
was filled with misleading information and errors. Even the Windows PowerShell
commands were incorrect. If you assign the EBS provider to $yourProviderConfig
and afterwards use $providerConfig.ProviderCLSID, don't be surprised when you
receive an error stating that $providerConfig doesn't exist. Of course, you
won't even reach this point because the Active and ProviderCLSID properties
aren't part of the ISPExternalBinaryProvider interface. These mysterious
properties belong to a dual interface that is not covered in the documentation.
Just for fun, I implemented a sample version in both unmanaged and managed code,
but your ISPExternalBinaryProvider implementation does not require these
proprietary properties at all.
The ProviderCLSID property might be
handy, but the CLSID is also available in the registry if you search for the
ProgID, such as UnmanagedProvider.SampleStore or ManagedProvider.SampleStore,
and you can also find the CLSIDs in the code files SampleStore.rgs and
SampleStore.cs. As mentioned earlier, setting the ExternalBinaryStoreClassId
property of the local SPFarm object to the CLSID registers the EBS provider.
Setting the ExternalBinaryStoreClassId property of the local SPFarm object to an
empty GUID ("00000000-0000-0000-0000-000000000000") removes the EBS provider
registration. Don't forget to call the SPFarm object's Update method to save the
changes in the configuration database and restart Internet Information Services
( IIS). The following code listing illustrates how to accomplish these tasks in
Windows PowerShell:
[System.Reflection.Assembly]::LoadWithPartialName('Microsoft.SharePoint') $farm = [Microsoft.SharePoint.Administration.SPFarm]::Local # Registering the CLSID of an EBS provider $farm.ExternalBinaryStoreClassId = "C4A543C2-B7DB-419F-8C79-68B8842EC005" $farm.Update() IISRESET # Removing the EBS provider registration $farm.ExternalBinaryStoreClassId = "00000000-0000-0000-0000-000000000000" $farm.Update() IISRESET Implementing Garbage Collection
Another section in the WSS 3.0 SDK
featuring mysterious components and critical code snippets is titled "Implementing Lazy
Garbage Collection." At the time of this writing, this section contained
references to another mysterious Utility class with DirFromSiteId and
FileFromBlobid methods as well as an incorrect assignment of Directory.GetFiles
results to a FileInfo array, but let's not be too demanding on WSS 3.0
documentation quality. The DirFromSiteId and FileFromBlobid helper methods
reveal their purpose through their names and the incorrect FileInfo array is
easily replaced with a string array, or you can replace the Directory.GetFiles
method with a call to the GetFiles method of a DirectoryInfo object. The Garbage
Collector sample program in the companion material uses the DirectoryInfo
approach and follows the suggested sequence of steps for garbage
collection.
An important deviation of the Garbage
Collector sample from the SDK explanations concerns the handling of timing
conditions. This is a critical issue because timing conditions can lead to
misidentification and deletion of valid files during garbage collection. Take a
look at Figure 6, which illustrates the WSS 3.0 SDK–recommended
approach to determine orphaned files by enumerating all BLOB files in the EBS
store and then removing all those references from the BLOB list that are still
in the content database as indicated through the site's ExternalBinaryIds
collection. The remaining references in the BLOB list are supposed to indicate
orphaned files that should be deleted.
Figure 6 Misidentification of a valid
BLOB as orphaned due to a timing condition
However, the EBS provider must, of
course, first finish writing BLOB data before it can return a BLOB ID to
SharePoint. Depending on network bandwidth and other conditions, I/O performance
can fluctuate. So, there is a chance that the EBS provider could create a new
BLOB—which then appears in your BLOB list—but completes writing the BLOB data
after you have determined the ExternalBinaryIds so the BLOB ID is not yet
present in this collection. Accordingly, the reference to the new BLOB remains
in the orphaned BLOB list and if you purge the orphaned BLOBs at this point, you
accidentally delete a valid content item and lose data! In order to avoid this
problem, the sample Garbage Collector checks the file creation time and adds
only those items to the BLOB list that are more than one hour old.
Conclusion
By integrating an external storage
solution with SharePoint, you can increase storage efficiency, system
performance, and scalability of a SharePoint farm. Another advantage is that
this forces users to go through SharePoint to access unstructured contents.
Pulling data out of the content databases via direct SQL Server connections only
yields binary BLOB identifiers instead of the actual files. However, EBS
providers also have drawbacks due to the complexity of maintaining integrity
between metadata in the SharePoint farm's content databases and the external
BLOB store.
In order to integrate SharePoint with
an external storage solution, you must build an EBS provider, which is a COM
server that implements the ISPExternalBinaryProvider interface with its
StoreBinary and RetrieveBinary methods. You can create unmanaged and managed EBS
providers, but be aware of performance and compatibility issues if you decide to
use managed code. Also keep in mind that the ISPExternalBinaryProvider interface
does not include a DeleteBinary method. You must explicitly remove orphaned
BLOBs through lazy garbage collection, and be careful to avoid timing conditions
that can lead to the accidental deletion of valid BLOB items.
Pav Cherny is an IT expert and author
specializing in Microsoft technologies for collaboration and unified
communication. His publications include white papers, product manuals, and books
with a focus on IT operations and system administration. Pav is President of
Biblioso Corporation, a company that specializes in managed documentation and
localization services.
|
|
Powered By D&J (URL:http://www.cnblogs.com/Areas/)