file_sorter - man pages section 3: Extended Library Functions, Volume 1

Language:

file_sorter (3erl)

Name

file_sorter - File sorter.

Synopsis

Please see following description for synopsis

Description

file_sorter(3)             Erlang Module Definition             file_sorter(3)



NAME
       file_sorter - File sorter.

DESCRIPTION
       This  module  contains  functions  for  sorting terms on files, merging
       already sorted files, and checking files for  sortedness.  Chunks  con-
       taining  binary  terms are read from a sequence of files, sorted inter-
       nally in memory and written on temporary files, which are  merged  pro-
       ducing  one  sorted file as output. Merging is provided as an optimiza-
       tion; it is faster when the files are already  sorted,  but  it  always
       works to sort instead of merge.

       On  a file, a term is represented by a header and a binary. Two options
       define the format of terms on files:

         {header, HeaderLength}:
           HeaderLength determines the number of bytes preceding  each  binary
           and  containing  the  length of the binary in bytes. Defaults to 4.
           The order of the header bytes is defined as  follows:  if  B  is  a
           binary  containing a header only, size Size of the binary is calcu-
           lated as <<Size:HeaderLength/unit:8>> = B.

         {format, Format}:
           Option Format determines the function that is applied  to  binaries
           to create the terms to be sorted. Defaults to binary_term, which is
           equivalent to fun binary_to_term/1. Value binary is  equivalent  to
           fun(X)  ->  X end, which means that the binaries are sorted as they
           are. This is the fastest format. If Format is  term,  io:read/2  is
           called  to  read  terms.  In  that  case, only the default value of
           option header is allowed.

           Option format also determines what is written to the sorted  output
           file:  if  Format is term, then io:format/3 is called to write each
           term, otherwise the binary prefixed by a header is written.  Notice
           that  the  binary  written  is  the  same binary that was read; the
           results of applying function Format are thrown away when the  terms
           have  been sorted. Reading and writing terms using the io module is
           much slower than reading and writing binaries.

       Other options are:

         {order, Order}:
           The default is to sort terms in ascending order, but  that  can  be
           changed  by  value descending or by specifying an ordering function
           Fun. An ordering function is antisymmetric, transitive, and  total.
           Fun(A,  B)  is  to return true if A comes before B in the ordering,
           otherwise false. An example of a typical ordering function is  less
           than  or  equal to, =</2. Using an ordering function slows down the
           sort considerably. Functions keysort, keymerge and keycheck do  not
           accept ordering functions.

         {unique, boolean()}:
           When  sorting  or  merging  files,  only the first of a sequence of
           terms that compare equal (==) is output if this option  is  set  to
           true.  Defaults to false, which implies that all terms that compare
           equal are output. When checking files for sortedness, a check  that
           no  pair of consecutive terms compares equal is done if this option
           is set to true.

         {tmpdir, TempDirectory}:
           The directory where temporary files are put can be  chosen  explic-
           itly.  The  default, implied by value "", is to put temporary files
           on the same directory as the sorted output file.  If  output  is  a
           function  (see  below), the directory returned by file:get_cwd() is
           used instead. The names of temporary files  are  derived  from  the
           Erlang  nodename  (node()),  the  process identifier of the current
           Erlang   emulator   (os:getpid()),    and    a    unique    integer
           (erlang:unique_integer([positive])).  A  typical  name  is  fs_myn-
           ode@myhost_1763_4711.17, where 17 is a  sequence  number.  Existing
           files  are  overwritten.  Temporary  files  are deleted unless some
           uncaught EXIT signal occurs.

         {compressed, boolean()}:
           Temporary files and the output file  can  be  compressed.  Defaults
           false, which implies that written files are not compressed. Regard-
           less of the value of option compressed, compressed files can always
           be  read. Notice that reading and writing compressed files are sig-
           nificantly slower than reading and writing uncompressed files.

         {size, Size}:
           By default about 512*1024 bytes read from files are  sorted  inter-
           nally. This option is rarely needed.

         {no_files, NoFiles}:
           By  default  16  files  are merged at a time. This option is rarely
           needed.

       As an alternative to sorting files, a function of one argument  can  be
       specified  as  input.  When  called with argument read, the function is
       assumed to return either of the following:

         * end_of_input or {end_of_input, Value}} when there is no more  input
           (Value is explained below).

         * {Objects,  Fun},  where  Objects  is  a  list  of binaries or terms
           depending on the format, and Fun is a new input function.

       Any other value is immediately returned as value of the current call to
       sort  or  keysort.  Each  input  function is called exactly once. If an
       error occurs, the last function is  called  with  argument  close,  the
       reply of which is ignored.

       A  function  of one argument can be specified as output. The results of
       sorting or merging the input is collected in a  non-empty  sequence  of
       variable length lists of binaries or terms depending on the format. The
       output function is called with one list at a time, and  is  assumed  to
       return  a  new  output  function. Any other return value is immediately
       returned as value of the current call to the sort  or  merge  function.
       Each  output function is called exactly once. When some output function
       has been applied to all of the results or an  error  occurs,  the  last
       function  is  called  with argument close, and the reply is returned as
       value of the current call to the sort or merge function.

       If a function is specified as input and the last input function returns
       {end_of_input,  Value}, the function specified as output is called with
       argument {value, Value}. This makes it easy to initiate the sequence of
       output functions with a value calculated by the input functions.

       As  an  example, consider sorting the terms on a disk log file. A func-
       tion that reads chunks from the disk log and returns a list of binaries
       is used as input. The results are collected in a list of terms.

       sort(Log) ->
           {ok, _} = disk_log:open([{name,Log}, {mode,read_only}]),
           Input = input(Log, start),
           Output = output([]),
           Reply = file_sorter:sort(Input, Output, {format,term}),
           ok = disk_log:close(Log),
           Reply.

       input(Log, Cont) ->
           fun(close) ->
                   ok;
              (read) ->
                   case disk_log:chunk(Log, Cont) of
                       {error, Reason} ->
                           {error, Reason};
                       {Cont2, Terms} ->
                           {Terms, input(Log, Cont2)};
                       {Cont2, Terms, _Badbytes} ->
                           {Terms, input(Log, Cont2)};
                       eof ->
                           end_of_input
                   end
           end.

       output(L) ->
           fun(close) ->
                   lists:append(lists:reverse(L));
              (Terms) ->
                   output([Terms | L])
           end.

       For  more examples of functions as input and output, see the end of the
       file_sorter module; the term format is implemented with functions.

       The possible values of Reason returned when an error occurs are:

         * bad_object, {bad_object, FileName} - Applying the  format  function
           failed  for  some binary, or the key(s) could not be extracted from
           some term.

         * {bad_term, FileName} - io:read/2 failed to read some term.

         * {file_error,  FileName,  file:posix()}  -  For  an  explanation  of
           file:posix(), see file(3).

         * {premature_eof, FileName} - End-of-file was encountered inside some
           binary term.

DATA TYPES
       file_name() = file:name()

       file_names() = [file:name()]

       i_command() = read | close

       i_reply() =
           end_of_input |
           {end_of_input, value()} |
           {[object()], infun()} |
           input_reply()

       infun() = fun((i_command()) -> i_reply())

       input() = file_names() | infun()

       input_reply() = term()

       o_command() = {value, value()} | [object()] | close

       o_reply() = outfun() | output_reply()

       object() = term() | binary()

       outfun() = fun((o_command()) -> o_reply())

       output() = file_name() | outfun()

       output_reply() = term()

       value() = term()

       options() = [option()] | option()

       option() =
           {compressed, boolean()} |
           {header, header_length()} |
           {format, format()} |
           {no_files, no_files()} |
           {order, order()} |
           {size, size()} |
           {tmpdir, tmp_directory()} |
           {unique, boolean()}

       format() = binary_term | term | binary | format_fun()

       format_fun() = fun((binary()) -> term())

       header_length() = integer() >= 1

       key_pos() = integer() >= 1 | [integer() >= 1]

       no_files() = integer() >= 1

       order() = ascending | descending | order_fun()

       order_fun() = fun((term(), term()) -> boolean())

       size() = integer() >= 0

       tmp_directory() = [] | file:name()

       reason() =
           bad_object |
           {bad_object, file_name()} |
           {bad_term, file_name()} |
           {file_error,
            file_name(),
            file:posix() | badarg | system_limit} |
           {premature_eof, file_name()}

EXPORTS
       check(FileName) -> Reply

       check(FileNames, Options) -> Reply

              Types:

                 FileNames = file_names()
                 Options = options()
                 Reply = {ok, [Result]} | {error, reason()}
                 Result = {FileName, TermPosition, term()}
                 FileName = file_name()
                 TermPosition = integer() >= 1

              Checks files for sortedness. If a file is not sorted, the  first
              out-of-order  element  is returned. The first term on a file has
              position 1.

              check(FileName) is equivalent to check([FileName], []).

       keycheck(KeyPos, FileName) -> Reply

       keycheck(KeyPos, FileNames, Options) -> Reply

              Types:

                 KeyPos = key_pos()
                 FileNames = file_names()
                 Options = options()
                 Reply = {ok, [Result]} | {error, reason()}
                 Result = {FileName, TermPosition, term()}
                 FileName = file_name()
                 TermPosition = integer() >= 1

              Checks files for sortedness. If a file is not sorted, the  first
              out-of-order  element  is returned. The first term on a file has
              position 1.

              keycheck(KeyPos, FileName)  is  equivalent  to  keycheck(KeyPos,
              [FileName], []).

       keymerge(KeyPos, FileNames, Output) -> Reply

       keymerge(KeyPos, FileNames, Output, Options) -> Reply

              Types:

                 KeyPos = key_pos()
                 FileNames = file_names()
                 Output = output()
                 Options = options()
                 Reply = ok | {error, reason()} | output_reply()

              Merges  tuples on files. Each input file is assumed to be sorted
              on key(s).

              keymerge(KeyPos,   FileNames,   Output)   is    equivalent    to
              keymerge(KeyPos, FileNames, Output, []).

       keysort(KeyPos, FileName) -> Reply

              Types:

                 KeyPos = key_pos()
                 FileName = file_name()
                 Reply  =  ok  |  {error,  reason()}  |  input_reply()  | out-
                 put_reply()

              Sorts tuples on files.

              keysort(N, FileName) is  equivalent  to  keysort(N,  [FileName],
              FileName).

       keysort(KeyPos, Input, Output) -> Reply

       keysort(KeyPos, Input, Output, Options) -> Reply

              Types:

                 KeyPos = key_pos()
                 Input = input()
                 Output = output()
                 Options = options()
                 Reply  =  ok  |  {error,  reason()}  |  input_reply()  | out-
                 put_reply()

              Sorts tuples on files. The sort is performed on  the  element(s)
              mentioned  in  KeyPos.  If  two tuples compare equal (==) on one
              element, the next element according to KeyPos is  compared.  The
              sort is stable.

              keysort(N,  Input,  Output)  is  equivalent to keysort(N, Input,
              Output, []).

       merge(FileNames, Output) -> Reply

       merge(FileNames, Output, Options) -> Reply

              Types:

                 FileNames = file_names()
                 Output = output()
                 Options = options()
                 Reply = ok | {error, reason()} | output_reply()

              Merges terms on files. Each input file is assumed to be sorted.

              merge(FileNames, Output) is equivalent to merge(FileNames,  Out-
              put, []).

       sort(FileName) -> Reply

              Types:

                 FileName = file_name()
                 Reply  =  ok  |  {error,  reason()}  |  input_reply()  | out-
                 put_reply()

              Sorts terms on files.

              sort(FileName) is equivalent to sort([FileName], FileName).

       sort(Input, Output) -> Reply

       sort(Input, Output, Options) -> Reply

              Types:

                 Input = input()
                 Output = output()
                 Options = options()
                 Reply =  ok  |  {error,  reason()}  |  input_reply()  |  out-
                 put_reply()

              Sorts terms on files.

              sort(Input, Output) is equivalent to sort(Input, Output, []).



Ericsson AB                       stdlib 3.17                   file_sorter(3)