5.4 Software Design

1. Learning From the Past

Speed, Speed, Speed! That was at the core of my design philosophy for this system. One of the main reasons that each of my two previous attempts to build this system failed was lack of speed. I started off trying to use a standard SQL database to store and analyze the data. This was horrendously slow. For the next attempt I tried to use an in-memory database. I was trying to use SQL database structures so that I could easily and flexibly query the data to get an idea of what was going on. For example, what quantity of a transcription factor with a degrade rate between 0.1 and 0.4 were created in cells (0, 0, 0) to (5, 5, 1) at time slice 50. By using standard sql and having the data in a DB I could easily do this. However, these DB's also had a massive amount of overhead. SQL DB's are designed by their very nature to be very general purpose. They knew nothing about my problem domain. Consequently even this completely in-memory SQL system was still too slow. I became convinced that I needed to use all the knowledge I had about the problem domain I was trying to solve to build a custom application and no longer rely on general SQL. So in this third attempt I set out to do just that.

The other major reason why my previous attempts failed is that at the time I simply did not have a good enough grasp of how cellular processes like gene transcription, receptor/ligand interaction, and regulation occurred. I was trying to model a system when I still did not adequately understand how that system functioned in the first place. I know, it sounds pretty stupid looking back on it. But I will admit that I do occasionally do stupid things. lol. It was only after I had finished several important biology classes like molecular cell biology, developmental neurobiology, and fundamentals of neuroscience that I again attempted to build this system.

2. Loading and File Types

All of the configuration files for this system are saved in xml format. Even the binary version of a chromosome is saved in xml as a hex string. While xml is bulky compared to saving in a binary format it has the advantage that it is human readable and can be easily modified and understood in a standard text editor. There are several different files that are required and they are listed in table 1.

File Types
Simulation Configuration File: This file loads all of the basics for a simulation run. It defines things like the number of cells, injections, fitness tests, predefined searches and so on. When a genetic algorithm is being evaluated this file is loaded once at the beginning, and each chromosome is evaluated using these settings.
Chromosome Configuration File: Chromosomes can be created by hand as xml files or stored as binary hex. During the normal processing of a genetic algorithm the binary hex value of the chromosomes would be stored in the database. This data is what defines the genes and proteins that are processed.
Chromosome Definition File: This file tells the simulation how to load the chromosome. It specifies things like how many bits is a degrade rate of a transcription factor, and to what valid values they map those bit values. It also tells the version of the chromosome. This will be discussed more below.
Gene Modules File: This is used to logically separate sets of related genes into modules that have been found to perform specific functions like the achaete scute genes. These can then be included into new genetic algorithms as a group.
Table 1. These are the basic file types for this system with a description of what they are for.

3. Defining the Chromosome

When you load a gene or protein from a string of bits how do you know how many bits to use for each parameter? What values does the binary value map to? Does 1021 map to 0.3 or to 3000? It is possible to hard code this kind of information into the protein and gene objects themselves so that they know how to translate these values. But if you ever want to change one of the values like to increase the number of bits that define a degrade rate then it is a real pain in the butt. This is what the chromosome configuration file is for. Each protein and gene class has a corresponding definition class. It is this class that knows how to load and save that object, not the object itself. When the simulation configuration is loaded it points to a chromosome definition that is then loaded. Then each time a chromosome is loaded it is the definition that does the loading. The definition knows exactly how many bits each parameter of every class uses and how to translate that binary value into a real world value. However, there is also another more subtle use for this chromosome definition file. What would happen if this type of information was stored directly in the objects themselves and some important genetic algorithms were run where a lot of binary data was stored in a database and later an additional parameter was added to one of the proteins? All of that previous data would now essentially be garbage. It would not be possible to load in that previous data because it was generated based on the old configuration and when the binary data is loaded back in this parameter would be loaded in even though it was not originally saved. This would through everything off and completely different chromosome would be loaded than the original one that was saved. Chromosome definition files avoid this problem because each run of a genetic algorithm would be associated with a definition that would be stored in the db along with the binary data. Later on if a new parameter is added then when the definition is being loaded that parameter would not be found and the definition object knows that it should not attempt to load that parameter value from the binary data. So the old binary chromosome is loaded correctly even though the a new parameter has been added. The same would happen if say the mapping range of an old parameter had been changed. The old configuration would still be setup with the values used when the chromosome data was created and they would load correctly. So the chromosome definition gives us a way to extend the system while still retaining old data.

4. Arrays and Constant Time Lookups

The primary strategy employed to increase the processing speed is to try and do as much as possible through preprocessing. The idea is to put as much as possible into lookup arrays so that the main processing loop has to do a minimal amount of work. This is critical because it is the main processing loop that is going to be called over and over again for millions of iterations. Every nanosecond of wasted time in the main loop will be amplified to make huge differences in the overall time required to perform the simulation. During the preprocessing stage all of the links and lists that will be repeatedly used during the main processing are built. A good example of this is the binding lists shown in figure 1. When the chromosome is first loaded we generate a list of genes and another of proteins. We will then also have another list that separates the proteins based on their type. So we would have a list of all of the membrane proteins and so on. But during the main processing loop one of the primary things we will need to do is to match up a specific protein with all of the other proteins or genes that bind to it. Take membrane receptors and ligands for example. We will be looping through the membrane receptors and for each one we will need to loop through all of the membrane ligands that bind to that receptor. We do not want to have to do a search to find these proteins because it will always be the same list and we will no it beforehand. This is where binding lists come into play. A binding list is a variable three dimensional array. The first dimension is the indexed by the protein type. The second dimension is all of the possible binding ID values. So if the binding ID for membrane receptors is 10 bits, then there are 1024 possible values and that is the size of the second dimension. If there are no proteins with a given binding ID value then a null is stored at this spot in the array. But if there is one or more proteins that have that binding ID then a list is stored that keeps a pointer to all of the associated proteins. This list is built during the preprocessing stage and then used during the processing loop. It is true that the array of binding ID's is wasting memory. However, it is a small amount of memory relative to the total available, and it greatly increases the speed of the algorithm if it can directly access lists. So even though the memory is "wasted", it is still very beneficial. If we want to loop through all of the ligands that bind to a specific receptor we go straight to that list.

Binding Lists Description
Figure 1. Binding Lists are variable three dimensional arrays. In the first dimension there is a slot for each protein type. In the second dimension there is a slot for every binding ID that is possible. If there are any proteins of the correct type with that binding ID then the third dimension is a list of those matching proteins. So say you wanted to look through all transcription factors with a binding ID of 2. There would be 8 such proteins. But if you wanted a list of membrane ligands with a binding ID of 2 then there would be no list, which means there are no proteins of that type.

5. Protein and Gene Quantities

All of these lists of genes and proteins are stored in the chromosome and are used for the organism as a whole. They are not duplicated for each cell in the body. So in other words there is not a list of protein objects for each cell. There is only one list of protein and gene objects and that is in the chromosome. At this point you might be thinking "Then how can you have different quantities of proteins in different cells? Or how can you turn a gene off in one cell and not in another?" Something that we need to remember is that the properties and functionality of proteins and genes, and their linkages between each other, do not change from one cell to the next. The only things that change are things like the quantity present in the cell, the amount of gene expressed, the activity state of the genes, etc. So instead of each cell having to keep a bunch of protein and gene objects and wasting memory, instead a massive pool of memory is allocated at the chromosome initialization for each of these different cell states and then the memory is divided amongst each cell. So for the quantity of proteins a pool is allocated that has a size of (Cell Count * Protein Count * Size of float). Each cell then gets a piece of this pie that is the size of the protein count. The quantity of each protein in each cell is then stored as a float value in this array. But how do you associate a given protein to an element in the array? When the list of genes and proteins are created they are sorted by things like protein type, degrade rate, etc. Each protein has a variable called ProteinID that is then assigned to it based on its index in the list of proteins. So later during the processing loop if we are in a cell and want to know the quantity of that protein in that cell then it would be retrieved by accessing the array like m_aryProteinQtys[lpProtein->m_iProteinID]. And again this is a constant time array lookup, so there is no searching during the processing loop.

6. Genome Compression

The more proteins and genes there are in the system the slower it will run. For example, if we have 50 proteins and we have to loop through that list 4 times per cell, and we have 5,000 cells, then that will be 50 * 4 * 5,000 = 1,000,000 steps for each time slice. Whereas if we only have 10 proteins then this number falls to 200,000 steps for a time slice. So it is very beneficial to keep the number of proteins and genes to a minimum. However, when a chromosome is generated some of those proteins and genes that are created randomly may not have any effect on the system. For instance, if you have a membrane receptor but there is no corresponding membrane ligand with the same binding ID then that receptor is completely useless. It will Never be able to bind to anything and will never be able to produce its expressed protein. In this case there is no reason to keep that protein, and that would also eliminate its expressed protein as well. In fact, if you eliminate the expressed protein it may make another protein unusable now and require it to be eliminated and so on. So this elimination is an iterative process where you are essentially compressing the genome down to its most compact form. Once the compression is completed then only those proteins and genes that will definitely be used in the simulation will remain. You might be asking "Why not just get rid of any redundant proteins directly in the chromosome instead of doing each time a chromosome is preprocessed?" There are two problems with this. The first is that even if this is done with the initial generation of chromosomes future mutations have the potential of changing these linkages and causing proteins to no longer link up or to link up with different items. So it will essentially always have to perform this service for the chromosome. The second reason is that we do not want to destroy our variability within the chromosome. If we eliminate anything that is currently unusable then evolution has nothing to play with. At some point later on that a fortuitous mutation may occur to that protein that turns out to be the key to solving the problem. If we had eliminated it at the start then this could never occur.

7. What's Going On Inside?

To be able to analyze what is happening and to be able to produce any kind of fitness value for the chromosome it is necessary to be able to find out what is happening inside the simulation. What is the quantity of protein A at time X? How much of gene Y is being expressed at time Z? Questions such as these have to be able to be answered. As mentioned above this is one of the main reasons I originally wanted to use SQL. By putting all of the data into a database, even an in-memory one, it would be trivial to come up with queries to get back data like this. However, it was just to slow. So instead I came up with a custom solution. Each main class like a protein type of gene has an associated search class that knows everything needed to be able to search for that object. So it is possible to write a query for any of the objects that is as vague or as specific as the user wants. There are a few items that are always required like the type of search (protein or gene), and the type of protein. Other than those two things though you can specify or leave out any defined parameter for that object. Lets take an example, it is possible to create a query to find all transcription factors with a binding ID of 212 and a degrade rate between 0.2 and 0.4 and that have a graph type of Linear. Or you could want to find diffusible ligands with a diffusion rate between 0.01 and 0.2. When this was attempted in sql then all of the searches happened during the main processing loop and they had to be executed over and over again. This is not done with this custom system though. During initialization of the chromosome once the genome is stabilized all of the searches look for matching items at that point and generate a list of pointers to those items. Then during the main processing loop they simply walk through this list and do whatever is required. Again, it is all just a matter directly accessing arrays during the critical processing period.

8. Injections and Evaluations

It is possible to directly inject quantities of proteins directly into a range of cells. The injection itself can either define the protein to inject or specify a search to find the proteins that will be injected. The user can also define the time at which each injection occurs. This is a very important feature to have during testing and is often used in the examples seen throughout this site. It is also used to simulate OOGenesis. Currently it is only possible to inject proteins into cells. However, once it is possible to differentiate neurons it will also be possible to directly inject currents into cells.

No genetic algorithm can work if you can not assign a fitness for each chromosome. This system allows the user to define a number of fitness tests that are run at predefined times throughout the processing of the simulation. Some of the most common tests that are currently defined are things like comparing the quantities of a protein over a range of cells to defined values for that range. Another is doing the same but looking instead for a change in the values. By chaining these tests together it is possible to come up with a fitness that looks for protein quantity to rise over a certain number of milliseconds and to fall in a succeeding range of time. All fitness values are based so that the higher the value the worse the fitness, and the total fitness value for each test is accumulated. This means that it is possible to have one test that is more important than other tests. While building some of the genetic algorithms for the examples listed here and some that did not make into this site I found out that building a set of fitness tests that drive the evolutionary process is very difficult. If you leave even any path uncovered it will exploit that and do something you did not expect.

9. Algorithm Overview

  1. A system configuration file is loaded. This defines the simulation and fitness tests. It also loads in the chromosome definition so that chromosomes can be loaded.
  2. A new chromosome is loaded. All of the gene and protein objects loaded into separate arrays and each of the objects have links to each other.
  3. A pass is made through the proteins and genes to find and eliminate similar proteins. For example, say two transcription factors exists that are exactly the same except that one has a degrade rate of 0.1 and the other has a rate of 0.11. Is it really necessary to have two separate proteins for this? The more we can reduce the number of proteins and genes the faster this system can run. So in this case one of the proteins would be eliminated and the links would be redirect to the survivor. Once we have lists of unique proteins and genes those lists are sorted based on things like their protein type, degrade rate, and so on.
  4. The next step is to map out all of the interconnections between the various objects into binding lists so that we never have to do searches during the main processing loop. It starts by first getting lists for each of the different types of proteins. So there is a list of transcription factors and a list of membrane receptors and so on. Then for every possible binding ID value we also have list of proteins that match that ID. See figure X for more details.
  5. After the interconnections have been made genome compression begins. Any proteins that don't have matching counterparts are removed, any enhancers without matching transcription factors are removed and so on. This is all done to eliminate any unusable items so that we do not have to deal with them repeatedly during the main processing loop. This procedure is done iteratively until nothing else can be found to be removed. At each iteration the previous step is also recalculated so new connectivity lists are generated. When this is finished we are left with a core of proteins and genes that are all used in the upcoming simulation.
  6. Now that the final list of proteins and genes is known it is time to create the state variable arrays for the cells. Massive pools of memory are allocated for each state like protein qty and gene expression qty for all cells and then each cell takes a part of this pool.
  7. In this stage all of the predefined searches are initialized. As mentioned above there are custom queries for proteins and genes that allow the user to be very specific or quite general. But these searches are Never run during the main processing loop. Once the genome has been compressed and is stable each search is initialized so that it goes through and finds all matching genes or proteins now and gets a list of pointers to all matching objects. So later when a search needs to be run to find the qty of proteins that match that search criteria it is just a matter of looping through the list of matching proteins and getting the quantities for each.
  8. Finally the preprocessing stages are finished and the main processing loop begins. This loop goes through each cell of the organism and processes it. The process loop contains several steps which currently are: Regulate Proteins, Diffuse Out Ligands, Bind Cell Receptors, Bind Membrane Receptors, Switch Genes, Transcribe Genes, and Degrade Proteins. It does this repeatedly for every cell for every time slice until the duration of processing defined in the simulation configuration is reached. Also, at user designated time slices injections of proteins can occur, and fitness tests are performed. The total fitness of the chromosome is the accumulated fitness at the end of processing. This simulation is finished and a new chromosome can now be evaluated starting back at step 2.

10. Design Overview

A considerable amount of work has been put into making this system lightning fast and very flexible. I wanted to make something that would be a solid core for each of the future steps. One of the primary ways of gaining speed used here is to try and do everything possible in the preprocessing stage so that it does not have to be done in the costly main processing loop. By generating all of the lists and searches up front and having them in arrays it costs more memory but this is a negligible price to pay for the massive increases in speed that it achieves. in fact, the bulk of the coding for this system was that related to the preprocessing of the chromosomes. It was far more complicated than the relatively simple main processing loop. But hopefully all of that work will pay off in the ability to simulate very large quantities of cells.


<< Previous Contents Home Next >>

MindCreators.Com is edited and maintained by David Cofer. If you have any questions, comments, or just want to discuss the contents of this website, then email me at: dcofer@MindCreators.com.

Copyright © 2003 by David Cofer. All rights reserved.