^

Belits Computer Systems


Document management server

Document processing model in modern offices

Modern networked offices usually have very simple document processing model. Documents are stored as files in users' personal storage (local hard drives, users' directories on the network-attached storage), shared storage (directory hierarchies on servers with various kinds permissions given to users) and read-only public storage such as internal web servers. In addition to regular files and directories, documents are stored as attachments to email in personal and shared folders.

Some documents exist only within some application that treats them as records managed by some set of rules, for example, a support/trouble ticket system that allows users to store records and associate documents to some rigidly-defined easy to look up issues, automatically associated with products, customers, people involved in resolution of problems, etc. Relatively recently wiki-style systems with less rigid systems of links yet strict enough editing and history/reference management procedures, became popular for storage of "knowledge base"-style sets of documents.

Changes to the documents usually are tracked in a manner that depends on the nature of the document. Word processor formats often have built-in history, email may contain countless snapshots of the same document with associated discussion, support tickets have history as their format's purpose, and wiki-like systems have change tracking as the functionality of the wiki or wiki-like software that is also responsible for consistency of references and search. Version/revision tracking systems are occasionally used to store documents with large amount of text, however their use for general-purpose document storage and change tracking is not nearly close to their popularity in software development for storage of software sources.

Documents transfer between users often happens by sending the whole document by email, often intended to be edited and sent back or contributed to some shared storage. Read-only documents are often stored in some area accessible to most of the local users, yet are available for writing only to their authors, so they are often pointed to by URL, or path instead of being sent to recipients. Many documents that arrive by something other than email (physical packages, faxes) do not easily integrate into this system, and often remain stored in large image files, being readable yet difficult to incorporate into anything else due to their large size or poor software compatibility. Printed forms are the most difficult to make available electronically -- even though their function is close to the commonly used web, word processor or PDF forms, the electric typewriter remains being best tool for dealing with them due to the variety of their layouts, even if in the end filled form is digitized by a fax machine.

Problems and limitations of traditional document processing

In the absence of document management all documents end up being passed between the users, accumulating changes and ending up in multiple copies of their various revisions. This process can invisibly create branches and require a large amount of manual work to combine those diverging versions. In the end a "final" version ends up in read-only storage, starting another complex process of using such a document (approvals, responses, derivation, use as a form or a template).

A document that initially arrived on paper, by fax or in some unusual format may be unable to enter the processing in a usable form, require vast amounts of resources just to be scanned/converted, read and displayed, and become unreadable after any attempt of embedding into documents or email formats commonly used within the office. Automated handling of forms is mostly limited to complex applications, so whenever a form is specific to the business logic of the company, the amount of resources and expertise necessary for adding it to some web application are often unreasonably large for such a trivial task. Forms end up being distributed in Word and PDF documents, with entries manually submitted, requiring further manual processing just to extract useful data.

Support ticket handling applications and wiki-like systems organize document handling for their particular applications, however their purpose is limited. Tickets/CRM system is only good because it ties data to distinct problems and solution, wiki handles everything as a free-format cross-referenced text, possibly with some references to (but not from) other formats. Considering that all currently popular document formats lack distinction between presentation and structure, wiki-like systems, designed primarily for handling of text, directly entered by users, are only useful for people who are committed to use of those systems as the primary form of internal storage, and have sufficient discipline to keep them organized, as opposed to cutting/pasting Word documents into text and marking a few links in it, and then having no good procedure for handling its revisions because all editing happens outside of the system, in different format.

Functionality of the document management system

Document management system should provide a way to handle documents in existing formats and fit into the above mentioned disorganized document processing model, providing storage, input, formats conversion, backup, change and derivation tracking, and search. Email, ticket-tracking and a wiki-based cross-referenced system can be incorporated as well, however those subsystems should be able to use references to stored documents, and easily import text from them.

Storage, access and backup

Storage should be provided in the form of regular networked filesystems/shares and repositories that can be accessed over HTTP or POP/IMAP if it's mail or shared folder accessible through a mail reader. Time-based change history for filesystems can be provided through automated snapshots, long-term history can be handled through incremental backups. HTTP-based repositories should track versions of uploaded files. Mail folders will have time-based policy of messages in "current" folders being moved to "old" folders, and from there to historical backups. "Permanent" folders may contain messages that are intended to be excluded from this process, with regular incremental backup policy applied to them.

Input

Documents may arrive by email, fax, scanned as images, entered manually as text, entered manually as formatted or structured documents, or transcribed from audio recordings. Additionally, many documents may arrive in a format that requires manual conversion -- for example, scanned image converted to formatted text. If a document arrived by fax, it is usually impossible to automatically determine, who is supposed to receive it, so it should appear in the "incoming faxes" queue/repository, accessible through a web interface. Depending on the security policy in the office, two solutions are possible:

  1. Make this queue/repository read-only and readable by everyone, or everyone in the department served by the fax number that received the document (physically it may be the same device or server for multiple numbers, however every number will correspond to a queue with separate access rules). Documents are deleted by time, or by a rule that assigns the initial lifetime once the document arrives, and each access sets the lifetime to current time plus some constant. "Time" may be counted in local "business hours" if necessary. Say, document lives for three business days, however access sets the limit to one business day since the access time. Document that was accessed immediately after arrival remains on the server for a day, then gets deleted. Document that was accessed within the third day gets its lifetime extended into the next day, just in case that it would be necessary to re-download it.
  2. Give some users the right to dispatch the documents to one or more recipients, possibly also with some timeout to allow them to copy the documents to more recipients.

In either case queue should be backed up with current files only -- once the file is deleted from the queue it should be assumed to be either copied or discarded. Faxes produce giant, hard to process bitmaps, so archiving all of them in one backup repository would be impractical.

Scanning does not have the problem with identifying the destination because it is supposed to be initiated by the recipient even if the scanner is networked. Some network scanners have "upload to FTP" function that allows the user to enter a code on the scanner's front panel, and scanned image is uploaded to some server, however this may be inconvenient, scanner may have a limit for the number of codes, and it requires a tedious manual programming procedure. As an alternative, there should be an application that allows a user to "reserve" the scanner, place the document(s) in it, configure scanning mode, possibly do preview scanning, then start normal scanning, and receive the documents in his repository. Again, this can be implemented as a web application, and it may allow the user to direct the document to his files, or to the web-accessible repository of tracked documents. In very large offices or in offices with high workload on scanners it may be necessary for the application to choose the scanner when it become available, and tell the user, which scanner he should use. It may be necessary to make small "scanner server" computer that does nothing but controlling one or few USB scanners, uploading scanned images to the server and controlling a small LCD panel, displaying users' names and status of the queue. "Scanner server" may have no local storage, and its configuration can be handled by providing a boot image and configuration files over TFTP and NFS.

Transcription from audio records should be handled on the desktop computers, with a special audio player that allows the user to start/stop/seek the file in small segments, and to assign typed paragraphs to the time range in the original record. Once transcribed, original audio (or video) file is uploaded to the server, and is stored along with text+markup file, not unlike the format used for subtitles. It should be possible to edit those files, and to extract text for insertion into normal text documents. The "media player/editor" application should be completely client-side, and may use additional devices such a foot pedal, audio input and output, USB or bluetooth headset, USB or firewire connection to the standalone recorder or DV camera, etc. In the "virtual desktop" configuration this media player may operate independently from the desktop's virtual machine, and run as a part of the client.

Similar operation can be performed on the faxes and scanned image files -- they may be originally received and tracked as images, then converted to formatted or unformatted text manually, or with assistance of OCR software. Document tracking system should place text version as an alternatve format, or new version of the original file, preserving the associations by keeping URLs and all internal references unchanged.

Version tracking and format conversion

There should be a simple method of checking in documents after editing. It may plug into file updates on user's networked storage, or be done explicitly by uploading the file using a web form. Some simple upload UI, such as drop box on a desktop or file manager/shell menu extension on the client side can be also helpful to simplify this operation. Once the file is uploaded, it may be analyzed for changes independently from any change-tracking features of the file format itself, and set of changes may be inserted into the "history" version of document. "History" version may be in the same format as the original document, or it may be some easier to handle format. For example, text or HTML file history may be presented as HTML with attributes that make text highlited or deleted when a custom stylesheet is applied to it in some web browser's extention. OpenDocument file may use change tracking information in XML file the same way as changes recorded when the file is edited (style changes will be difficult to reflect in this format, so they will be invisible to the user, yet can be reversed when a particular version of file is requested from the repository). Image files may be converted into layered formats, or set of images and highlited changes, accessible through a web page with Javascript. Vector and technical drawings may be also represented as blocks and layers, however most likely the procedure will be unreasonably convoluted, so they may be just left without human-readable history. It may be necessary to convert and simplify some formats (PDF, Microsoft Office, all spreadsheets) to make history visible.

To provide better interoperability and avoid wasting resources on the clients, a document uploaded into repository may need to be converted into a "viewable" format that will be preferred for displaying the document when it is read-only. For example, fax may have original TIFF format with different horizontal and vertical resolution, and two viewable formats -- PNG and PDF. Technical drawing may be in DXF or DWG original format, and viewable PDF. Some of those conversions can be done automatically on the server when the file is uploaded/checked in, some may require user's actions. In the latter case, user may be presented with the list of files that need alternative representation, so he can open/convert/upload results manually. It may be automated using filename conventions used by "export"/"print to PDF"/... procedures provided by various editing software.

When multiple users edit the same file, or the same user derives multiple versions from the same original file, it should be possible to create a tree of changes, and view differences between any versions. The viewer may present the tree graphically, in a format similar to email thread. A separate operation may provide text-merging functionality. If document editor is incapable of providing adequate user interface for merging, a web interface may provide a simple javascript-based text merge application. The units of text presented to it should be generated on the server, so identification of corresponding chunks of text may be very sophisticated while the user interface can be very simple (click to switch between versions, press button to add your own version of the segment). The end result may be automatically converted into some reasonable approximation of the original format.

Templates and forms

Templates and forms do not fit into a regular change-tracking pattern because a single document may have huge number of "derivations" that share the original document's structure and have various data entered by users. Templates are usually easily handled by document editors, so it may be only necessary to keep the template association with a document that may have large parts of it, including formatting, making the final document only associated with template by origin. Templates are specific to formatted text and technical drawings -- they are uncommon in other formats, so at most it will make sense to recognize and keep track of templates used in documents in meta-information accessible through a web interface.

Forms have different purpose -- they are documents with some fields that can be filled by the user. Usually fields are supposed to be filled with plain text entries or check marks, though other kinds of form entries (fields for sketches, diagrams and drawings used to position markings on them) may be a part of the form. Paper forms usually have fixed size and numbers of the fields, and may allow attachment of plain text if the space on the form is insufficient for entries. Forms distributed electronically (in word processor text files, spreadsheets, PDF and HTML files) may accept automatic expansion of the entry field, however only application-driven web interfaces are designed to handle variable number of imput fields. Spreadsheets may be expanded as necessary when they are filled.

HTML forms are tied to the web server and a script that processes the input. PDF forms may be possible to fill and save, however Adobe tied forms saving functionality to the use of its Acrobat software, that is licensed per user, so many businesses don't have enough licenses to allow everyone generate those forms. This may change in new Adobe and third-party software versions, however the served-side processing of form filling remains the best solution for both HTML and PDF forms.

Word processing applications such as Microsoft Office and OpenOffice.org have their own forms and spreadsheets, and those applications allow the user to save forms, or to submit them to the remote server preprogrammed in the form. It's also possible to generate PDF forms in those word processors.

It is relatively easy to extract data from edited and returned form in any format, and it is possible to create a submission preprocessing program that embeds unique identifier into every form entered into a tracking system, so any form, submitted as either returned file or over HTTP using controls inside the form is properly associated with the original distributed form. Regardless of the way how submission is handled, a "filled form" document and form submission database table can be automatically generated and become available to the originator through the web interface. This functionality will be similar to Adobe Acrobat handling of forms, however it will have an advantage of handling all possible form formats, and require no additional software on the desktops, workstations or remote client computers.

When scanned or faxed documents are converted into fillable forms, it should be possible to use Adobe Acrobat, wordprocessor's functionality of creating a form over a background image, or a specialized paper-to-form client-side application that creates forms in a wordprocessor-compatible or HTML format with CSS field positioning. The advantage of wordprocessor formats (such as OpenDocument) over PDF forms is the possibility to edit the form by replacing one large raster image by automatically or manually recognized text and formatting while keeping the set of fields. This provides a continuity between the versions of the form, allowing to re-use the new form in the same procedure as the original raster-based form was used. To allow this operation it should be possible to provide document versioning that preserves form association if the fields set is unchanged, or converts the submissions if the set of fields is modified.

If some form is supposed to be processed by an application, it should be possible to associate an application with form submission. Once application is associated, document management system should be able to "feed" it already submitted forms, and to pass all new form submissions to it. This will allow to separate the procedures that require programming from both format design and data collection process, thus reducing the amount of time-dependent operations and allowing IT personnel to debug their applications without asking everyone else to re-submit their data or delaying data collection until after the application passed the testing.

Documents submission procedures

There should be multiple ways for the document to enter the tracking system. It should be possible to upload it over HTTP, using a regular web browser through a "file upload" page provided on the server, or some previously mentioned client-side program that simplifies this operation. Another entry point can be provided by email address that may be added by the user when sending a document. However in most of environments it should be easier to let users save all documents to their normal documents directories stored on a server, and monitor changes of those directories, entering the documents into the system automatically. Once the program detected a new or changed document, it should analyze its associations (format, templates, original document that was edited to produce this file, forms association, etc.) and add a document entry request to the user's list of incoming documents.

If the user runs a submission client program, or has document submission page opened in a browser, it may issue a request (popup, information balloon, etc.) allowing the user to edit or add all additional information needed for document tracking. For the user this should look seamless because it would be triggered by him saving the file. It may also happen when a program auto-saves the document, and if such situation is detected, it makes sense to make notification less obnoxious. Client application may detect the user's inactivity when auto-save happened, and make notification a balloon instead of a more intrusive pop-up window that would be a better choice if the user was saving the file manually. While by no means reliable, this will make the user's experience more comfortable, and create an impression of document submission requests being a part of document saving procedure without a need to modify any client-side document-processing software.

Search

The search functionality should allow the users to search the documents by content and association with groups/projects known by the system. Since the system knows all users' permissions, it will also be able to avoid unauthorized access to the search results. While this part of the system will duplicate the Google desktop and appliance functionality, it should be understood that Google success as a search engine is based on their software's ability to recognize relevance of documents through references produced by people not involved in creation of the referenced document. Within the company this is irrelevant, and the task reverts to format recognition and trivial word/phrase search in the indexed documents database. The functionality will be useful because it can use people/projects/application associations as the key, and the output would be internal documents' directory with association and version information instead of just links to files, however actual algorithms used for search should be very simple

Implementation of the document management system as an appliance server

There is nothing in the above mentioned design that makes it absolutely necessary to provide this system as a pre-made hardware+software product. In theory, it may be sold as a software product. The problem is, any server-side product usually has to be configured for particular configuration of existing hardware and network, what often requires involvement of integrators or consultants. This can unreasonably increase the cost, and require some draconian per-seat licensing, common for other office software. Support of those configurations installed by third parties and affected by various problems caused by other software running on the same servers, support of running it under Windows, or on desktop-class hardware, and other problems that stem from lack of control over the environment where software is installed, can become a constant problem for the software vendor. A massive support and training infrastructure for such a product would require resources that would be otherwise spent on development, marketing and reasonable support for few known configurations.

The whole system can be implemented on one or few servers that perform the following functionality:

  1. Documents storage.
  2. Database storage.
  3. Indexing.
  4. Documents conversion.
  5. Format recognition, automatic insertion of identifiers, creation of alternative copies, association of documents.
  6. Web interface.
  7. Users authentication and authorization.

Additionally the system may contain small scanner, fax and printer servers, and traditional mail/web/... servers that are tied into the rest of the system. Some additional functionality can be provided, such as email spam filtering based on the "good" words lists pulled from indexes produced from known-good internal documents.

Client-side software should be minimal, and include:

  1. Documents uploader.
  2. Lightweight form editor.
  3. Media player, designed for transcription.

Those components are small, do not depend on large amount of resources or reliability of hardware, and can be easily provided with minimal support that they require. If the system is combined with the virtual desktops, those components may be either parts of the standard desktop image running inside the virtual machine, or provided as a part of the client running on desktop hardware (in particular, media player that would reduce virtual machine's involvement in media playback).

TD-44 (processing or servers with external storage) and TD-88 (with built-in storage) servers can be easily supplied with software and pre-configured to run in any reasonable network configuration. Since the only initial configuration necessary for the system itself is the lists of users, projects and permissions, it will eliminate the need for consultants, and reduce the cost of the system for the users. Large amount of both storage and processing power justify the use of dedicated high-performance servers, so the overall cost of this system will be lower than the cost of equivalent configuration of software installed on pre-existing servers.