[Mono-list] 64bit gmcs/mcs in SLES/openSuSE rpms?

Jonathan Pryor jonpryor at vt.edu
Tue Apr 28 09:54:29 EDT 2009


On Tue, 2009-04-28 at 02:08 -0700, David Henderson wrote:
> 2) Is there a way to store char/string data as something smaller than
> UTF-16?  The data are SNP genotypes, i.e. a single SNP genotype looks
> like A T and there are almost a million of these per individual.  I'm
> thinking that what I need to do is record the genotype as bits, i.e. 0
> or 1, and relate that back to a translation class thet returns A or T
> when that SNP is queried.  It would be simpler if I could store
> char/string data as something reasonably small.

Aren't there 4, not 2 (ACGT)?  In which case you'd need 2 bits, not 1.

Recording as bit pairs would certainly be a good idea, and you could use
BitArray to assist in efficiently doing this (as Alan suggested), but I
wouldn't suggest this as BitArray can't be resized.  I'll discuss this
later.

> 3) What I'm currently doing is:
>   a) read in each line as a single string which is split based upon
> whitespace
>   b) input each SNP into a class which is stored in an ArrayList, or
> as a string array in a List<string> (I've implemented it both ways)
>   c) once the while file is read in, output each collection of SNPs by
> chromosome to a different file for processing by other software

I'm no biology expert, but do you really need to load up the entire file
before you can print out each chromosome?  If you could print out each
chromosome as it's encountered, this would reduce your memory footprint.

> I've been able to get past my initial problem by re-compiling mono
> with the large heap size GC and when the entire data is read in, it
> takes up 17GB RAM for a 300MB file.  I know I'm new to mono/C#, but
> I've been programming in C++ for years and have written many
> commerical applications for large data and nothing I've written to
> date has been as memory hungry as this.  I'm hoepful I can get some
> good suggestions on how to improve performance.

First, why is your app so memory hungry?  You'd have to get a profiler,
but I imagine a lot of it is filled with temporary strings.  Storing the
entire file as a single string should only take ~600MB, but since you
read each line (one string/line for how many lines?) then split that
string (N strings/line), you have N+1 strings allocated per line.  Then
since you're using ArrayList or List<string> (which use arrays
internally), you'll have temporary arrays filling your memory as the
internal array is resized to store the new contents.

So it's not very surprising that you're using tons of memory.

How to mitigate it?

1. If possible, try not to store the entire file in memory at once.

2. Minimize string use, by e.g. storing the entire file as a single 
   string, and then instead of splitting the string into substrings
   you can instead create a structure that stores the start and end
   point of the "interesting" sub-sequence.  (You certainly want to
   use a struct, to minimize per-object overheads.)

   The problem with this is that each ACTG is 2 bytes in RAM, but
   you're (hopefully) not duplicating them as often.

   You will instead have an 8/16 byte "overhead" to store the start
   and end of each sub-sequence on each line.  Depending on the
   average length of each sub-sequence, this may be a decent tradeoff.

   Note that you should use List<struct> and not ArrayList, as
   ArrayLists + structures == severe memory overhead (the structs
   must be boxed, removing any memory savings).

3. Remove string use, by dealing directly with the underlying Stream,
   and either read it into a byte[] (~300MB RAM) and then using the
   start/end structure suggested in (2), or leaving the file on disk
   (by using a FileStream, not a byte[]).

   The problem with this is that, if you read the entire file into a
   byte[], you're storing 1 byte/ACTG, which is still more than needed.

4. Use a custom collection type.  You'll likely want to be able to Add()
   new items to the collection, so BitArray isn't appropriate as it
   can't be resized once created.  What I would instead suggest is 
   taking the existing BitArray sources[0] and changing them to support
   resizing the internal array (into a ResizableBitArray), then:

        enum Genotype : byte { A, C, G, T }
        class GenotypeCollection : IList<Genotype> {
                ResizableBitArray bits = new ResizableBitArray ();
                
                Genotype this [int index] {
                        get {
                          return (Genotype)
                            bits.Get(index) << 1 | bits.Get(index+1);
                        }
                        set {
                          bits.Set(index, ((byte) value) & 0x02);
                          bits.Set(index+1, ((byte) value) & 0x01);
                        }
                }
                // other IList<T> members...
        }

   This would allow memory efficient storage of genotype sequences (2 
   bits per Genotype), so storing the entire 300MB file would require
   only 75MB of RAM (plus array overheads, which are likely to be
   quite sizable, but should be less crazy than what you currently 
   have.)

 - Jon

[0]
http://anonsvn.mono-project.com/viewvc/trunk/mcs/class/corlib/System.Collections/BitArray.cs?revision=111994&view=markup




More information about the Mono-list mailing list