Meshbeyn / Projects

FastBWT

FastBWT is a BWT-sorter, that uses my new exponential algorithm for BWT. This algorithm uses for comparing already partial presorted ranges from previous iterations. This method has about double memory footprint as a "conventional" sorter, but it has the worst case complexity O(n*logn*logn) instead of O(n*n*logn). BWT usually sorts blocks with "many" bytes, so the sorting time for one block with long repeating sequences may become hundreds, thousands or even millions of the sorting time for usual block with the same size. This makes the BWT sorting unfeasible for most use cases, where it would be very useful. My algorithm has no "catastrophically unsuitable data", where sorting speed noticeably decreases. Even the repeated sorting takes usually the same or less time than the first one. With conventional sorter second sorting of text data is terrible slow.

FastBWT may be used as console application or .Net DLL library.

Current version (0.3) includes 2 version of sorter and one desorter. If you use FastBWT as executable, it uses as sorter BWT_Medium_Sorter.

Changes

Version 0.3 (Download)

First public version.

BWT_Medium_Sorter has now parallel initialization and 2-bytes bucket sorting, that decreases sorting time about 15%-20%.

Use FastBWT as DLL

You may reference the file FastBWT.exe from your project as normal .Net DLL using section References. FastBWT has the following public classes:

FastBWT.ASorter
Abstract superclass for all sorters. Also contains static methods to apply sorting vector.
FastBWT.BWT_Small_Sorter
Sorter for small blocks. Can sort blocks of up to 2G, but is effective only for very small blocks, where any optimizations are harmful. Consider to use this sorter only if your blocks are 200K or smaller.
FastBWT.BWT_Medium_Sorter
Sorter for normal blocks. Can sort blocks of up to 2G. To sort larger blocks, the sorter should use complex data structures, because .Net does not support arrays larger than 2G elements.
FastBWT.ADesorter
Abstract superclass for all desorters.
FastBWT.BWT_Small_Desorter
Desorter for small blocks. Can desort blocks of up to 2G. Currently works single threaded.
FastBWT.StatisticStep
Structure with a report data about one iteration of one sorting worker.

Sorter objects are reusable, but not reenterable. Wait for call completion of Sort method before you call it for the next block. If you want to start many sorters simultaneously, create many objects. Each sorter parallels its job very well, so parallel invocation of more than 2 sorters will only slow down the sorting process even if you have 8 kernels. Memory footprint is usually 9-11 times of block size and dynamically depends on data type. For example, video or other compressed data may cause a short, but very large explosion of memory footprint.

Use FastBWT as a console application

FastBWT may be started manual, from scripts or from not-.Net applications. If don't specify input or output file, the program reads and/or writes to standard input/output stream. In this case you cannot use parameters, that write something to the screen (indeed to the console output stream). Don't try to sort files over 2G.

You should specify some parameters by starting the program:

Statistics format

Statistics consist of Step blocks for sorting iterations. Each iteration sorts indices up to corresponding depth. Then it regroups the indistinguishable indices and so on.

Big groups consist of more than 65536 indices. This groups may considerably slow down the next iteration.

Small groups consist of 2 - 65535 indices. This groups will be sorted very fast and are processed in batches.

Points consist of 1 index and need no processing after this iteration.

Duration in seconds that this worker has spent for this iteration.

Total statistics show total costs for each iteration. Total Duration shows the spent time for single threaded sorting. Max Duration shows the duration of each iteration. Worker cannot start the next iteration before all other workers have completed this one, so in each iteration workers may lose some seconds.

Total Time shows the elapsed time from starting initialization and to getting sorted block. Reading and writing time is not counted.

Examples for sorting

FastBWT.exe -bwt -ip -i d:\abc\file.txt -o file.bwt

Sort file d:\abc\file.txt, save the result to file file.bwt in current folder and print the BWT index to screen.

FastBWT.exe -bwt -in -stat -i d:\abc\file.txt -o d:\abc\file.bwt

Sort file d:\abc\file.txt, insert the BWT index to output file name and save the results to the file with this name. For example. if the BWT index is 12345, result will be saved to file d:\abc\file[12345].bwt. After sorting print the statistics.

MyProg.exe | FastBWT.exe -bwt -ii > d:\abc\out.bwt

Sort the bytes from program MyProg.exe, save result and index to file d:\abc\out.bwt.

Examples for desorting

FastBWT.exe -ubwt -i d:\abc\file.bwt -o file.txt -id 12345

Desort file d:\abc\file.bwt to file d:\abc\file.txt using BWT index 12345.

FastBWT.exe -ubwt -i d:\abc\file[12345].bwt -o file.txt -in

Desort file d:\abc\file[12345].bwt, extracting index from the file name, and save the result to file file.txt.

FastBWT.exe -ubwt -ii < d:\abc\in.bwt | MyProg.exe

Desort file d:\abc\in.bwt, extracting index from the last 4 bytes, and feed the result to program MyProg.exe