FastBWT
FastBWT is a BWT-sorter, that uses my new exponential algorithm for BWT. This algorithm uses for comparing already partial presorted ranges from previous iterations. This method has about double memory footprint as a "conventional" sorter, but it has the worst case complexity O(n*logn*logn) instead of O(n*n*logn). BWT usually sorts blocks with "many" bytes, so the sorting time for one block with long repeating sequences may become hundreds, thousands or even millions of the sorting time for usual block with the same size. This makes the BWT sorting unfeasible for most use cases, where it would be very useful. My algorithm has no "catastrophically unsuitable data", where sorting speed noticeably decreases. Even the repeated sorting takes usually the same or less time than the first one. With conventional sorter second sorting of text data is terrible slow.
FastBWT may be used as console application or .Net DLL library.
Current version (0.3) includes 2 version of sorter and one desorter. If you use FastBWT as executable, it uses as sorter BWT_Medium_Sorter.
Changes
Version 0.3 (Download)
First public version.
BWT_Medium_Sorter has now parallel initialization and 2-bytes bucket sorting, that decreases sorting time about 15%-20%.
Use FastBWT as DLL
You may reference the file FastBWT.exe from your project as normal .Net DLL using section References. FastBWT has the following public classes:
- FastBWT.ASorter
- Abstract superclass for all sorters. Also contains static methods to apply sorting vector.
- FastBWT.BWT_Small_Sorter
- Sorter for small blocks. Can sort blocks of up to 2G, but is effective only for very small blocks, where any optimizations are harmful. Consider to use this sorter only if your blocks are 200K or smaller.
- FastBWT.BWT_Medium_Sorter
- Sorter for normal blocks. Can sort blocks of up to 2G. To sort larger blocks, the sorter should use complex data structures, because .Net does not support arrays larger than 2G elements.
- FastBWT.ADesorter
- Abstract superclass for all desorters.
- FastBWT.BWT_Small_Desorter
- Desorter for small blocks. Can desort blocks of up to 2G. Currently works single threaded.
- FastBWT.StatisticStep
- Structure with a report data about one iteration of one sorting worker.
Sorter objects are reusable, but not reenterable. Wait for call completion of Sort method before you call it for the next block. If you want to start many sorters simultaneously, create many objects. Each sorter parallels its job very well, so parallel invocation of more than 2 sorters will only slow down the sorting process even if you have 8 kernels. Memory footprint is usually 9-11 times of block size and dynamically depends on data type. For example, video or other compressed data may cause a short, but very large explosion of memory footprint.
Use FastBWT as a console application
FastBWT may be started manual, from scripts or from not-.Net applications. If don't specify input or output file, the program reads and/or writes to standard input/output stream. In this case you cannot use parameters, that write something to the screen (indeed to the console output stream). Don't try to sort files over 2G.
You should specify some parameters by starting the program:
- Command key -bwt or -ubwt. To sort specify -bwt, to desort -ubwt.
- Input or output key -i and -o. After a space specify path to file. If the program should read or write to standard stream, just omit corresponding key.
- Index encoding key -ip, -in, -ii or -id. Starting index is the index of source block in the sorted matrix. This index will be used to shift the desorted block to right position.
- Use -ip to print the index to screen. This key is forbidden if the output file is omitted. This key will be used only for sorting.
- Use -in to write/read the index from file name. Before the file extension will be inserted decimal index in square brackets. For example, if the BWT index is 12345 and file name is specified as file.bwt, the program creates file file[12345].bwt. Desorter extracts the index from file name.
- Use -ii to write/read index in/from the stream. Sorter writes after sorted block additional 4 bytes, containing the BWT index in Big Endian order (most significant byte first). Desorter reads file and uses last 4 bytes as BWT index.
- Use -id to specify index. After space specify index in decimal form. This key is opposite to -ip and used only for desorting.
- Optimal statistic key -stat. Every worker thread reports about its work after each sorting iteration. This statistics may be used to value the "sortability" of file and to test if the sorter is suitable.
Statistics format
Statistics consist of Step blocks for sorting iterations. Each iteration sorts indices up to corresponding depth. Then it regroups the indistinguishable indices and so on.
Big groups consist of more than 65536 indices. This groups may considerably slow down the next iteration.
Small groups consist of 2 - 65535 indices. This groups will be sorted very fast and are processed in batches.
Points consist of 1 index and need no processing after this iteration.
Duration in seconds that this worker has spent for this iteration.
Total statistics show total costs for each iteration. Total Duration shows the spent time for single threaded sorting. Max Duration shows the duration of each iteration. Worker cannot start the next iteration before all other workers have completed this one, so in each iteration workers may lose some seconds.
Total Time shows the elapsed time from starting initialization and to getting sorted block. Reading and writing time is not counted.
Examples for sorting
- FastBWT.exe -bwt -ip -i d:\abc\file.txt -o file.bwt
-
Sort file d:\abc\file.txt, save the result to file file.bwt in current folder and print the BWT index to screen.
- FastBWT.exe -bwt -in -stat -i d:\abc\file.txt -o d:\abc\file.bwt
-
Sort file d:\abc\file.txt, insert the BWT index to output file name and save the results to the file with this name. For example. if the BWT index is 12345, result will be saved to file d:\abc\file[12345].bwt. After sorting print the statistics.
- MyProg.exe | FastBWT.exe -bwt -ii > d:\abc\out.bwt
-
Sort the bytes from program MyProg.exe, save result and index to file d:\abc\out.bwt.
Examples for desorting
- FastBWT.exe -ubwt -i d:\abc\file.bwt -o file.txt -id 12345
-
Desort file d:\abc\file.bwt to file d:\abc\file.txt using BWT index 12345.
- FastBWT.exe -ubwt -i d:\abc\file[12345].bwt -o file.txt -in
-
Desort file d:\abc\file[12345].bwt, extracting index from the file name, and save the result to file file.txt.
- FastBWT.exe -ubwt -ii < d:\abc\in.bwt | MyProg.exe
-
Desort file d:\abc\in.bwt, extracting index from the last 4 bytes, and feed the result to program MyProg.exe