Breadcrumb Navigation:
SAF/E: Structured Authoring Framework/Environment
Data and Content Workflow Tools
Contents
Draft in progress - This Version is the 0.5 draft May/13/2009
Core Problem:
Non-technical Domain Overview
Communication of information relies on shared understanding. Participants in communication rely on mapping internal understanding to language with the expectation that others will be able to reverse the mapping from language to their own internal understanding with some reasonable level of fidelity. Traditional approaches to information modeling for computers are based on the need for reliable interpretation at low computational cost. The result is a complex landscape of data formats, some more standardized and interoperable than others.
A number of technologies have made great strides towards making data exchange, communication between different computer applications, more efficient and effective. Encoding standards, markup languages, and relational databases are a few of the key contributors. These tools improve communion, and the process of generating, sending, receiving, and processing such communication. The primary challenge that remains is that interchange requires a shared model and computing is no more ready to embrace a single data format, than humans are ready to embrace a single language.
The goal for SAF/E is to create a framework that supports authoring and communication of a highly efficient and effective data format. Information may be stored in a variety of native formats well suited to individual applications, and automatically transformed for communication in an interoperable format. SAF/E heavily considers the role of workflow around communication in the pursuit of providing an efficient and effective framework. It incorporates many well established technologies to fully realize the goal.
Consider a typical scenario:
Our Department A relies on Department B to provide a set of information in a timely manner such that our own workflow, dependent on that data, does not holdup other workflows and the productivity they generate. As with all scenarios of interdepartmental cooperation, we have our preferred way of working and Department B has their own. These two ways of working are frequently different and, at least occasionally, not interoperable. Department B has access to systems we do not, keeps information in formats convenient to their own workflow, and our dependence on their willingness to duplicate effort creates an additional chain of dependence that is very likely unnecessary.
Generally, many organizations suffer from the following interdependent problems:
- Workforce turnover and consequently:
- the cost of training new staff.
- the loss of 'organizational memory' when a staff member takes
unique or specialized knowledge with them.
- The liability of developing and maintaining multiple systems of information.
- Dependence on legacy systems.
- The risk of dependence on black-box systems.
- The loss of productivity when some link in the chain fails.
- The barrier of resistance to change.
- Resource constraints.
Technical Domain Overview
Defining the notion of "authoring" very broadly, individuals and computerized systems generate information containing useful data on a regular basis. This synthesis is only one step in the communication process. A single communication involves:
- a synthesis of data into information (authoring),
- transmission through some medium (publishing),
- reception of information, and
- a replicating process where the information is decomposed back into data (comprehension).
When communication is successful, the data that went into the process reasonably resembles the data that comes out. More often than not, communication of information in the real world follows a more complex chain of processes composed of the same basic steps and often multiple senders and recipients are involved.
While SAF/E's name might seem to imply an emphasis on the authoring
portion of the workflow, its components include features for the
comprehension end.
Just as communication is a complex and layered process, many human and
automated workflows around communication are complex and layered.
- Allow more seamless collection, securing, and sharing of data according to both internal business logic and proffered internal workflow.
- Allow selected others to access data translated into a representation compatible with their business logic as seamlessly as possible with their own proffered internal workflow.
- Create a scalable, flexible, supportable system.
- Be implementable in a variety of technologies such that it need not be tied to any one architecture.
- Build on existing application framework(s) and established technologies to reduce development time and maintenance.
- Conform to or be interoperable with public standards for communication of data and documents.
- Be designed in such a way that workflows are treated little differently than other forms of data and can themselves be communicated.
Introduction:
Background
In 2002 I began a project to provide an enhanced clone of the (NC State)
University's inForm
facility for the College of Engineering. The project took longer than I
would have liked. It was a learning experience in project management,
delegating coding responsibilities, and evolving a software service
over time. The end result was named engrForm.
This service grew in use and had easily recouped its development cost
in saved development time before I left engineering in 2007.
engrForm is a generic form processing package which combines:
- data submitted by any standard HTTP Get or Post method
(typically an HTML form)
- with a set of configuration directives submitted by the form
- and configuration directives contained in a PHP configuration
file.
- a returned HTML page returned to the web browser following a storage action to one of several supported output mediums,
- an HTML page with relevant error information such as required
information missing,
- or another HTML page including the previously submitted data
propagated in hidden fields.
The model that made engrForm successful and a good long term time investment was the time savings it granted when used as a sort of framework to create multiple web forms. Rather than having to code the logic for each form individually, engrForm did the heavy lifting. The early versions handled all form processing exclusively and consequently had limited extendability. Later versions added a facility to plug-in code to perform additional input verification and more complex branching logic to determine what HTML to present next based on the previously submitted data.
The ultimate limitation of engrForm was that it only solved one of the two major needs. It reduced the amount of programming time required to handle the server side processing of form data. It did not provide any solutions for the task of generating the form to be presented in the web browser. This is work that is arguably less technical, but undoubtedly a very design and knowledge intensive task. Proper form design requires very careful consideration of usability practices, information needs, and use of HTML form tags. This skill set is by no means trivial even compared to the task of programming the server side scripts needed to process collected data.
By 2005 engrForm was well in place as a solution, and the need to generate more advanced forms of output arose: storage to a database and output to XML. Neither of these solutions were conceived in the original inForm application, nor coded into engrForm. Other organizational needs prevented the kind of time devotion required to develop these features. I had conceived of methods to accomplish the kind of data modeling needed to make these two formats attainable, but the specifications were sketchy at best.
The next application in line to address some of these needs was TOTM, a topic-oriented topic management system that was being developed for engineering to help them meet their goal of rolling out a new college homepage for the fall of 2007. TOTM was designed to dovetail into structured authoring methods. This functionality was needed to make the engineering web site dynamically generated and continuously renewable in a scalable way. TOTM included a self describing data schema that included both data encapsulation (meta data on how types of data defined in the system are to be collected and verified) and data aggregation (meta data on combining reusable parts of the data in aggregate data types) to create a guided structured authoring environment.
The site release TOTM was originally intended to support was delayed
due to content acquisition problems. TOTM has since been used in a
limited capacity to serve the College of Engineering's computing
website. I changed jobs during this key point in TOTM's life cycle, so
the long term viability of the project remains to be seen. It was
picked up for use in E155,
an introduction to computing course. I had plans
to extend TOTM as a form generating front-end to engrForm's form
processing back-end, but these two projects were never linked.
Foundation
It is clear that there are a wide variety of formats and models to choose from, and there are equally as many target outputs one might desire. I belive that there is a data model that makes it possible to take advantage of both a process driven object oriented approach and also the mathematical foundations of the relational model in such a way that the strengths of both are maximized and their respective weaknesses mitigated. The nature of this new model is digressive and something I am still working out. It is detailed in The Flora Model. For the purpose of this project it is suffice to say that the Flora Model is a typed-oriented view of data.Consider the proposition that any given document can be represented as a collection of granular and potentially reusable content units combined with formatting to create the overall presentation. These content units can be called "topics", as they are focal points of a communication activity. While formatting may be specific to a an output medium, the topics that make up a single document have the potential for reuse across multiple documents or output mediums.
SAF/E topics are typed, meaning there is a prescribed structure that each must follow. This structure guides the authoring process, ensures data validity, and makes the data of each topic queryable. Floral, the data model used by SAF/E, uses an inherited type scheme such that data types can be specialized for very specific applications, and automatically generalized for the purpose of data sharing.
Without too much digression, it is important to point out that Flora is like XML in that authority for data structure is largely decentralized. This means individuals and organizations have control to structure the data in ways that best fit their needs. Flora mirrors a particular XML language DITA, which is an OASIS Standard. DITA balances the need for authors to be able to define unique data structures with the need for authors to be able to automatically exchange data. Flora's approach to centralized structure governance and highly automated interchange works on some of the same principles as DITA.
SAF/E defines five modules which provide functionality for applications to abstract the Flora data model.
- Process Module
- Map Module
- Topic Module
- Driver Module
- Skin Module
The map module uses the map query to select a map, and then inserts, updates, or selects a number of topics to or from the map. A map contains references to, relationships among, and hierarchy for one or more other maps, static topics, or dynamic topic drivers. All map and driver references and are converted to their constituent topics if selected by the query. Every map query returns a bundle of one or more topics, even if it is only a trivial topic to indicate success or failure of the query. Very roughly, the map module correlates to a specific portion of the Complex Model in MVC. SAF/E uses a specific form of MVC where the actual View component is very thinly defined and implemented. The map module is a View Model, a portion of the model that focuses the work of the Model through the View, almost like a lens focuses light. The View may be highly specialized, does little real work.
The topic module defines functionality to facilitate the creation, validation, and retrieval of topics based on a type definition and the creation of new type definitions. The maps used to aggregate topics are in turn simply a type of topic and can be created, specialized, and manipulated as topics. The topic module is the Domain Model portion of the Complex Model. This part of the model is most closely tied to the Action Controllers and the workflows appropriate to specific topic types.
The driver module defines functionality to create drivers that turn data from one or more data sources into topics. Each driver is an external module defined by the application. The driver model is the Data Model portion of the Complex Model. It most closely interfaces with the data abstraction layer.
The topic bundle that results from the map query is given to the skin module along with the formatting query. The formatting query selects a skin. In addition to pairing topic bundles with the formatting query, the skin module defines functionality to create skins. Skins, like drivers are external modules defined by the application. The skin receives a topic bundle and integrates it into the final output. A skin may transform the topic bundle before integration. The skin module is effectively the View portion of the framework.
The final output may then be cached, and is returned back to the process module so that it can be sent as a response. Storage and retrieval of the cache, maps, drivers, topics, and skins are all controlled by some external storage module defined by the application. Applications may impose authenticiation, access, and business logic control during any of the storage/retrieval steps. The application is responsible for making authorization decisions, logging and auditing, and a variety of other tasks more specific to the application than to the authoring and publishing workflow.
Approach
Rather than port code designed for another set of functional requirements and designed under a different design model I decided a new solution was called for. Just as the prior developments (engrForm and TOTM) involved an informal "R&D" (research and development) phase, this project also required one. The first two projects were built on a wide and largely unfocused research and development phase. Source information was "gathered" rather informally and largely by reactive exposure rather than by direct research. This project required a more focused endeavor considering the scope and short-lead time.
My general approach to development is two fold, address a specific problem by implementing and/or adopting a general solution (framework) and then extend the general solution to a specific solution for the problem (application). In an ideal scenario, the implementation of any framework is preceded by some research (analysis) of the problem domain. Since the recurrence of similar problem types in IT (and in fact most fields) is common, there is a value in spending the extra time up-front to perform this analysis and framework implementation. Following any successful application extension of the framework, a memory of the benefits and problems associated with using the framework is created. The next time a similar problem arises, the framework can be refactored (expanded and/or improved) to better address any existing issues or limitations and extended again to create a new specific solution. This form of problem-solving memory and evolutionary design saves a lot of development time in the long run. I like to call this the AFAR (analysis, framework, application, refactor) approach.
If s is the time it would take to implement a completely independent specialized solution to a problem, an approach I call "onesies", ideally the initial development of a framework should take roughly 2s and each extension of the framework would take s/2. Thus a typical first pass at this approach might take two to three times as long as the onesies approach, but by the third or fourth pass the original investment has been recouped and further applications will mean a net savings of time.
A "problem domain" is a body of knowledge relevant to a specific problem or set of problems.
In the context of software development, the problem domain encompasses
everything about the application that makes it a valuable solution.
This can be defined as all knowledge needed to understand and write
software to meet the functional requirements.
One
problem with my proffered approach is it is not always formally
"agile", as it
specifically recommends taking additional time up front. With the
additional analysis phase at the beginning, some may argue it is not
agile at all. It is my opinion that agility in any problem domain
requires a certain level of expertise to form a stable knowledge base.
Hitting the ground running with a new and sizable development project
is rarelya good idea and most modern agile processes include either
time up-front for planning, learning, and requirements gathering.
Considering this, there is some value in careful application of the onesies approach. It is most effective when approaching problems for the first time, especially in the case of a problem that is more trivial than complex. In such cases there is little point in spending extra time to solve the more general problem and it is often easier to "get a feel" for the problem domain in a hands on approach. The second or third time a similar problem arises, however, it is often worth considering how likely future similar occurrences will be. This largely relies on a judgment call which is in turn based on the memory of an individual or organization.
One critical part of any sizable project is to build from a framework. In many cases leveraging an existing framework, such as those that are community developed, is a highly effective way to drive down overall development time. There is often an initial learning investment. When the right frameworks is selected this investment more than pays for itself over the course of several iterations. Essentially, whenever possible don't try to build completely from scratch. Existing good ideas are worth considering before reinventing the wheel.
There will be times where a "good" framework is not available that
meets the specific problem domain needs and the specific requirements
of the project. If there is a more general application framework,
considering a framework extension is almost preferable to building a
framework from scratch. There are some cases where building a framework
from scratch may legitimately make sense, but this option should be
carefully scrutinized.
AFAR is not an approach that can easily be applied without a well established technology architecture, or without a foundation of experience in the problem domain. These are both important building blocks in the AFAR approach. In the absence of an existing technology architecture the framework phase is prone to the types of setbacks typically experienced when implementing solutions in an unfamiliar architecture. Similarly, in the absence of experience with the problem domain there will be a large learning curve associated with arriving at the correct solution and a high probability of false starts.
Vision
SAF (Structured Authoring Framework, part of SAF/E) is a
framework to address the reoccurring needs associated with structured authoring. Structured
authoring has emerged as a powerful new approach to bridging the gap
between modular collaborative writing, and professional digital publishing.
Structured authoring is the combination of a guided authoring process,
where the writer is informed of their options, and structural
constraints on the product, where there is a logically defined limitation on how information
may be expressed.
Guided authoring makes writing documents easier by simplifying the authoring process. The process of guided authoring relies on structural conventions and structural constraints. When there are no conventions or constraints, authoring cannot be guided.
Structural conventions and constraints standardize the representation of information. These standards are important for both human comprehension of information and for reliable programmatic manipulation. Standardization increases reliability and simplifies communication at the cost of creative expression. Structural conventions may be enforced to varying degrees, and in general there is spectrum of benefit gained from following conventions that increases with the degree of compliance. In contrast, structural constraints must be followed completely or no benefit is gained.
The "E" in SAF/E stands for environment. This is an editor
application built on the SAF to make authoring data and content easier.
The core problem defined in this project focuses on data interchange
between workflows. This data is typically highly structured
information, meaning it has a specific format. Information this
structured is not traditionally handled with the "structured authoring"
approach, but it can be modeled as a special
case of structured authoring. As such, to provide a general solution
the problem domain for this project will encompass structured
authoring. This is design
decision is beneficial for two reasons: in future scenarios there is a
high degree of probability that some data being interchanged may not be
highly structured, and addressing both loosely structured content
authoring and highly
structured data creates a much more reusable framework.
Focus
In keeping true to AFAR, I am considering the body of work that went into engrForm and TOTM in creating SAF/E. I am not however including any implementation specifics.
The primary focus is to build a general system that is reusable and extendable without being overly complex. My implementation of choice is PHP (with MySQL where needed) and that is what my employer would prefer me to use. Where possible, SAF will be built on the Zend Framework for PHP and use database abstraction to reduce technical dependence on any one database product. The system will be designed such that it should be implementation neutral and portable to another language/architecture with relative ease. As such it will conform to good OO practices in the limited scope of PHP5's support for OO. The application should, however be portable to non-OO languages.
Where the database aspect of the program is concerned, my goal is to model it in a light (widely supported) subset of the relational database model. The database will be true to the relational model, but will not rely inherently on many of the features of the relational model that are not widely supported. In particular, it is my goal with this development to apply the AFAR principle to database modeling in such a way that only the framework will access the database directly. This means that the database itself is completely encapsulated.
Rather than rely on an OO model of data, or a relational model of
data, this project will use the Flora Model
of data to improve portability and benefit from the best features of
both models.
The "framework" component of SAF/E in this case as an extension of existing frameworks. My current intention is to build in top of the Zend Framework for the server side components and JQuery or ExtJS for the client side components.
SAF/E dovetails into existing workflows by providing a data
submission and request service. This is not to say that SAF/E is built
on or adheres to service oriented architecture (SOA), but it
shares some traits in common with SOA.
Workflows can use SAF/E to define an "interface" or "contract" by which
it provides data that can be accessed by other
workflows. SAF/E also has a mechanism for defining a number of internal
workflows which provide and/or accept data.
SAF/E builds upon the Flora Model and as such is largely XML based.
It has methods for converting information in a variety of
formats to and from XML, but these are strictly to enhance the
interoperability of the framework. It is important to note that in the
Flora Model, XML is simply one way to present information in a specific
struture. This does
not mean that the Floral Model represents data as XML strings
internally.
Implementation
SAF/E is designed as a framework extension for Zend Framework. It
follows a Complex Model MVC pattern on the server side. The editor
component is a Thin GUI driven largely by a JavaScript framework and
configuration driven activity from the server side application.
SAF is in turn designed to be highly extendable, primarily without the need to program in the language of implementation, or in the relational database. SAF focuses on configuration, data definition, and workflow to be highly adaptable.
Data definition in Flora is defined by simple inheritance, a restrictive perspective on the same kind of paradigm that drives Object Oriented programming. Flora inheritance does not define behavior, but rather information constraints. The most general form of information from which all data types inherit is an element that is any possible representation, about which almost nothing is known. From this type, more specific data types emerge. This forms a tree of possible representations terminating in each specific possible representation, being a specific set of data. For this reason Flora is said to be Type Oriented rather then Object Oriented, since the aspect being "extended" by inheritance is semantic certainty, not specialized behavior. If one type inherits from another, it is always a proper sub-type and special case.
Planned Features
- Editor
- Login/Auth
- Basic Single User Login for 1.0
- Multi-user login for 1.5 with LDAP integration and author/admin levels.
- Multi-user/Group login for 2.0 with LDAP, OpenAuth, Module Based auth and full Zend Framework Auth module support. Types inherit access.
- Logging/Audit
- Basic audit table of all transactions, archive threshold for 1.0
- Action logging option for 1.5
- Templating
- Ultra Template Module for 1.0
- Extended Ultra functionality for 1.5
- Create Problem Domains (sets of types)
- Create/Edit types
- warn as you stray from allowed values
- Create/Edit maps and data
- source view for 1.0
- semi-WYSIWYG for 2.0
- warning and strict mode for when author strays from allowed
values (strict mode prevents erroneous input, but always allows
saving(draft), even when invalid).
- storing drafts for invalid data only in 1.0
- storing drafts and basic approval process for 1.5
- auto-draft 2.0
- Debugging client 2.0
- Hooks for debugging client from QP module, available from 1.0
- Temporal data
- 1.0
- Tagging and Notes
- 1.0
turn to Top
