Research Article GenFlow: Generic flow for integration, management and analysis of molecular biology data

A large number of DNA sequencing projects all over the world have yielded a fantastic amount of data, whose analysis is, currently, a big challenge for computational biology. The limiting step in this task is the integration of large volumes of data stored in highly heterogeneous repositories of genomic and cDNA sequences, as well as gene expression results. Solving this problem requires automated analytical tools to optimize operations and efficiently generate knowledge. This paper presents an information flow model , called GenFlow, that can tackle this analytical task.


Introduction
Computational molecular biology uses computational power to analyze genetic, molecular and biochemical data from multiple biological species in a genome scale.Presently, scientists face the challenge of integrating biological data stored over the years in dynamic heterogeneous structures.Many genomic and molecular biology studies have been applying a number of computational tools, generating results with different input and/or output formats, without common standards.Today, integration and analysis of these large volumes of heterogeneous data are no longer amenable to direct human manipulation and has become one of the major problems in bioinformatics.Thus, biologists badly need efficient dynamic techniques for data integration.However, dynamic integration and analysis of heterogeneous data, using multiple applications (applying different heuristics on the data, customizing specific arguments, etc.), are not trivial problems.In fact, they remain important topics for research in computer science.
This paper reports on the development of an generic flow model, GenFlow, allowing efficient integration and analysis of heterogeneous biological data.We start characterizing the generic flow approach describing the use of GenFlow to tackle a common problem in molecular genet-ics, i.e., choosing molecular clones from a mouse cDNA library to manufacture microarrays for studies of differential gene expression during cell cycle progression.Next, we present a formal study of generic flow design.The final product, GenFlow, provides integrated tools enabling investigators to comparatively analyze results of multiple experiments, through friendly structures for management of heterogeneous data.

Characterizing the Problem: A Common Work Flow in Molecular Genetics
We illustrate the integration problem and the generic flow approach with a real example found in studies of mouse gene expression within a large cooperative project between departments of the University of São Paulo (CAGE-Cooperation for Analysis of Gene Expression).We had an UNIGENE set of 34,000 mouse cDNA clones (kindly supplied by Prof. Marcelo Bento Soares, University of Iowa; Bonaldo et al., 1996) whose sequences stored in the Genbank database were the only data available.To select clones of interest, we proceed according to the following steps: 1. Assembly of the sequences, performed by Phrap (Gordon, 2001) and CAP3 (Huang, 1996;Huang and Madan, 1999), aiming to confirm sequences uniqueness.
2. Identification and separation of polyA tailpossessing sequences.
3. Local alignment against a general database NR (NCBI, 1988), performed by BLAST (Altschul et al., 1997), aiming to assess sequence similarities and to exclude no match sequences.At the end of this step, we were reduced to 25,772 sequences.
4. Another round of the Blast against a RIKEN (RIKEN, 2001) database, searching for similar sequences in the Mouse Consortium Genome.At this point we were down to 17,338 sequences.
5. Choice of the best sequences of interest, manually done by the investigator.
Basically, the generic flow is represented by the scheme in Figure 1.
Note that each application has a specific role, therefore, Phred, CAP3 and BLAST must be executed according to the specified order.Thus, the generic flow is a single sequential chain, with well-defined starting and finishing points and without loops between steps (see Figure 1).

Formal Analysis and generic flow Modeling
GenFlow (Generic flow) stores the results of generic components for pipeline definition and control, seeing the applications as atomic objects.These objects encapsulate some properties while maintaining mutual relationships.These relationships, adding to the knowledge of the expert users, define a flow for information.The Genflow can be edited for supporting more than one application and generalized for multiple users.Particularly, we have presented an example applied to sequence analysis, but it could be used to analyze microarray data, metabolic and signaling pathways, protein 3D structures, and so forth.
Next, we will suggest a formal way of modeling a generic execution flow, built from the applications and based on the following assumptions: i. there is a finite and non-empty collection of applications for analysis, which we consider valid for the operating system.This is denoted as A; ii. there is a finite set of non-empty values for tasks, denoted as T.This is a set of discrete elements, where each element indicates the position of the application within the workflow; iii. there is a finite set of finite and non-empty values for the algorithms, denoted as H.The algorithms indicate the variation of the implementation strategies for the execution of the same task; iv.there is a finite set of finite and non-empty values for structural input and output formats, denoted as F; v. some constructs, such as tuple, used for complex data structures composition, are available.
All values of the sets defined above (i to v) are atomic.Values are formed using the construct tuple and they will be called structured values (Liu, 1992).The type T denotes a task of the application p, p ∈ A, in the experiment.

Application for analysis
Given a set P of parameters for an application, we can define an application for analysis as a tuple A = (v, h, t, p, in, out) where v ∈ N and indicates the version of the used application, h ∈ H, t ∈ T, p ∈ P, in ∈ F and out ∈ F.
Then, the set of all application objects are represented by N × H × T × P × F × F. We will adopt the convention object domain in recovering the values of a particular domain.

Property Functions
We can use the above definition for relationships among applications.We call these relationships validators, which map their arguments to Boolean values (true or false).

Precedence validator (<)
The < is a function of the form f: A × A → {true, false}. Let

Format equivalence validator (≈)
The ≈ is a function of the form f: F × F → {true, false}.
Let a and b be structured formats, whereby a, b ∈F.We say that a ≈ b if a has an equal semantic meaning to b, and, it is so possible structurally to translate a to b.
Figure 1 -Pipeline for Gene Classification.
Moreover, a and b must belong to the same domain.In addition, each structure component of an a format must have an equivalence to b format.A direct consequence of this approach is the usage of a building wrapper to translate files from a to b formats.

Chained validator (→)
The → is a function of the form f: A × A → {true, false}.
Let a and b be applications, whereby a, b ∈ A. We say that a → b (or Therefore, a and b are called chains if we can run them in series, using the output of application a as input (or one of them) in application b, abstracting structural format restrictions for input/output files.Hence, a generates an output file semantically similar to the b input file.

Application equivalence validator (≡)
The ≡ is a function f: A × A → {true, false}.
Given applications a and b.We assume that a, b ∈ A. We denote by a ≡ b (or a is equivalent to b) whether a and b are of the same type (a.t = b.t)and, in addition, whether their input and output files are semantically similar; this implies that a.in ≈ b.in and a.out ≈ b.out.

Relationships and communication
We can define some logical relations and their communication features using the definitions viewed previously.In this way, we can classify the applications according to their relations and, in addition, we are able to construct communication rules and restrictions based on their properties.

Chained applications
Let P i and P j be two applications, whereby P i , P j ∈ A. P i and P j are called chained if the relation P i → P j returns true.
The applications chain is a relation key for our integration system.This relationship defines whether they can be executed in sequential order, or not.Furthermore, this relation is strongly linked to the input/output structures of the used algorithms, because these patterns define, physically, the sets of applications which can be chained.

Modeling and implementation
Biological data processing usually generates a sequential flow of tasks, each one representing a single application.The full, or partial automation, occurs through an integration process, controlling communication among subsequent applications.Our approach extends ideas from Peleg et al. (2002) and Siepel et al. (2001), integrating tasks according to running rules specified and formalized earlier in this section.
The precedence and chaining logic, represented by respective functions and exposed by their definitions, allow us to create alternative flows for the experiment analysis.The formalization facilitates the design of generic integration components, allowing better definition of flow structures.In addition, formal components give us the freedom to build complex generic flow, analyzing them through diverse viewpoints.GenFlow allows partial flow without having to cover all sequential tasks.This is very convenient, since there are cases in which we do not have primitive data sources or our experiment tasks are based on intermediate results.
The main advantage of a GenFlow editing system is the automation of sequencing tasks.A file generated from one step can be used in next, even if it has some structural incompatibility, solved by a XML file translation (Achard et al., 2001) (see the internal structure of the system in Figure 2 and the interface for the users in Figures 3 and 4).
In addition, the real physical location of each application is visualized by users.Therefore, investigators do not worry about the installation or communication between ap-Integrating molecular biology data and applications 693 plications, because software interactions are controlled by GenFlow.
A generic GenFlow schema can be viewed in Figure 2. The applications P i , i = 1,2, ... n (where n is the number of applications installed and available in the operating system), are placed in sequential order, according to their task.This property is defined when P i is installed in the environment, according to its algorithm running parameters.It defines precedence relations among other applications.
Each P i application is associated to an execution step E k , k = 1,2, ..., m (where m is the number of tasks for a particular workflow).The chains are built in series and run using input/output files.
All applications follow the precedence order and present semantic compatibility of their input and output files; they are called chained.In Figure 2, we can see some kinds of chaining applications, such as P 1 and P 3 , P 2 and P 4 , P 3 and P 4 .
The system implementation is external, using text files and database connection drivers.This approach pro-duces a flexible architecture, under DBMS (Database Management System) and implementation language viewpoint.Between each two sequential steps, one or more text files are generated for intermediation of the communication interfaces of the applications; we usually map its information to a XML (W3C, 2004) file format.The choice of XML is based on the adoption of a renowned and standardized file format, optimizing the understanding of all embedded structural elements and promoting the improvements and extensions for interested groups.Furthermore, this preference avoids the creation of new, non-standard and eventually confusing file formats.The information in XML format can be stored in database through persistent services.

System Application and Conclusions
Our group is developing a web application of GenFlow.A centralized architecture was chosen for the web implementation, benefiting investigators that have facilities to manage applications and large volumes of data.The system can be accessed from different web clients, providing an additional level of convenience and freedom to the users.
In addition, all information is stored on a DBMS, guaranteeing information persistence and an alternative tool for data organization and retrieval, essential for works involving large amounts of data.
The visualization pages will allow multiple ways of displaying results, including construction of novel and customized forms to exhibit results.
Currently, integration research addresses the problem through scripts or structured compiled code, so each minimal change implies code modification.The main advantage of our approach is system scalability.We can reduce the dependence levels through software components, providing flexibility and dynamism for the system.
The dynamism is particularly interesting to biologists, because they will be able to start the execution of many workflows at the same time, without having to translate formats between the applications.Furthermore, the researcher gains evaluation power for algorithms, results and experiments, thus improving their quality.
In the Figures 3 and 4, we see the operation of the system.This process is divided into four steps: 1. Editing -the user builds the sequence of tasks (GenFlow) that need to be executed (see Figure 3); 2. Parameter configurations -the user sets up the parameters for each selected task, if necessary; 3. Data recovery -the user chooses the data that will feed GenFlow from a central DBMS; 4. Execution -the execution of the tasks starts (see Figure 4).
The addition of new applications into the system does not require strategic changes, because the integration level is separated from the applications.We need just define the   function mapping to link the XML standard format of the system to the input and output formats of the new entry.
We successfully tested the performance of the GenFlow system analyzing a mouse cDNA library, presented in section 2. The investigators were able to integrate biological knowledge and automated execution of applications, without manual manipulations of the intermediate files.
Each step of this testing operation can be compared to a single filter.The filter determines the better information and aggregates quality gauges on the data.The combination of all filters comprising the whole system qualifies GenFlow as an generic flow system.
a and b be applications, whereby a, b ∈ A and a ≠ b.We say that a < b (or a precedes b) if a and b have distinct types and the type of a (a.t) indicates a task prior to type of b (b.t);

Figure 3 -
Figure 3 -Definition of the GenFlow: editing the task flow for an experiment.