![]() |
![]() |
Project Leaders:
Gordon K. Springer, CECS; J. Forrester, DNA Core Facility;
T. Patrick, Medical Informatics Group
Genetic databases (DNA and protein) continue to grow at an ever increasing rate. In addition the number of databases, some of which contain information not available from other sources, is also increasing. Due to the volume of data, researchers cannot, in many cases, peruse one or more genome databases easily or efficiently when attempting to follow a research lead. The problem is two-fold. First, the location, content and access to the various genome databases may be difficult for a researcher to keep track of, especially when involved a discipline specific line of thought. Second, maintaining a local copy of all databases requires disk storage space that is becoming hard for many to afford. Retrieving and reformatting files is also a tedious and time consuming for researchers.
In making decisions about what sequences to include in an analysis, the researcher needs to have ready access to the actual sequence data and the data must be in a format that can be input to a particular analysis program. Thus, widely distributed access to standard genome databases as well as specialized databases (e.g., gene mapping databases, specialized disease databases such as AIDS, etc) must be available and accessible in nearly real-time fashion. These sequences need to be "imported", formatted and input into an analysis program without the researcher having to actually manipulate the data themselves. That is, an analysis program may be directed to retrieve, reformat and use a given sequence from a remote genome database as if the data were contained in a local repository. A method of making the data available from a few sites that appear to be local and with built in format conversion would eliminate the need to maintain redundant copies of the databases at numerous sites. This method gives rise to the concept of a "virtual local database" wherein the physical location of the data is not known by the user. However, the data can be viewed, processed or utilized as if it were physically housed locally. We propose to develop such a capability so that the enormous wealth of genome data is readily available to the researcher regardless of its physical location. In order to accomplish this goal, access to low latency network bandwidth is required.
In prior HPCC work, a prototype system was developed to integrate a wide collection of biomedical analysis and research tools in a seamless fashion. This prototype system incorporated analysis tools that ran at the PSC, SDSC, and NIH as well as locally. This system automatically combines information sources based on the data paths between them. A data path exists between two sources of information when it is possible to extract data from one information source and then to use that data, under a possibly null transformation, to drive retrieval of information from the other information source. We call this approach to combining information sources data path integration. All of the information sources are located via location independent identifiers whose instances can be found on the network dynamically.
By extending our work with data path integration and the evolving efforts to provide location independent access to data, the concept of the "virtual local database" can become a reality and provide researchers with enormously powerful tools with which to pursue their scientific investigations. Utilizing our collaborative effort with the biomedical group at PSC, we can begin to provide the environment and the techniques needed to locate and access both the data repositories and extremely powerful computational tools to analyze the data from the researcher's own desktop.
With the growth of the Internet in the last few years, this work has been severely limited due to the ever-increasing delays encountered on the network. The network traversal time to make simple analysis requests at PSC has increased by several orders of magnitude. This has severely hindered the natural interactive behavior of the system as well as the time needed to transfer analysis results (amounting to many megabytes) back to the waiting researcher. With a vBNS connection, this limitation can be eliminated.
Other I2 links: