UOP file format [Updated for reading/writing]

Rambings, rants, insight, wisdom, and idotic thoughts, depending on the time.

Moderator: punt

Post Reply
punt
VIP
Posts: 244
Joined: Wed Mar 24, 2004 7:46 pm
Has thanked: 0
Been thanked: 9 times

UOP file format [Updated for reading/writing]

Post by punt »

I wanted to capture the format of the UOP files used in UO, and also a reference class that can be used to subclass other classes to handle the different types of data. I updated this enable writing as well.
//Copyright © 2021 Charles Kerr. All rights reserved.

#ifndef UOPData_hpp
#define UOPData_hpp

#include <string>
#include <fstream>
#include <cstdint>
#include <vector>
#include <map>
#include <tuple>

/*******************************************************************************
 Acknowledgement
 This information was gleamed from Mythic LegacyMul Convertor.
 Special thanks for those that deciphered that data, and allowing that
 source to be available for others to examine and learn.
 
 ******************************************************************************/

/*******************************************************************************
 Hashes
 Hashes are used to define who the data is used (what it represents).
 There are two types of hashes used (Alder32 and HashLittle2). For
 more information on these refer to http://burtleburtle.net/bob/c/lookup3.c
 In the hashstrings, {#} is used as subsitution placeholders.  The # represents
 the number of characters the final substituion should be (to pad leading 0).
 So a {2} would indicate that it should be two characters. So if one is reprenting
 a number 1, it would result in 01.
 
 The hash strings used for each file type are as follows(case is important).
 Some file types use two different hashes. In addition the number of keys(hashes)
 to be built can very.  Other programs that process UOP files use
 0x7FFFF as an entry.
 
 Art
 "build/artlegacymul/{8}.tga"
 The number being replaced essentially corresponds to the idx
 entry in artidx.mul.
  The number of keys to be built is around 0x13FDC.
  UOFiddler requires this exact idx length to recognize UOHS art files (it checks with == operator, not with >=)

 GumpArt
 "build/gumpartlegacymul/{8}.tga"
 "build/gumpartlegacymul/{7}.tga"
 The number being replaced essentially corresponds to the idx
 entry in gumpidx.mul.
 
 Map
 "build/map{1}legacymul/{8}.dat"
 The first substitution is the map number, the second one is the
 index.  An index represents index*C4000 location in a corresponding
 map mul file.
 
 Sound
 "build/soundlegacymul/{8}.dat"
 
 Multi
 "build/multicollection/{6}.bin"
 
 Embedded with the multi data is a file, housing.bin.  This
 is identifed has file hash : 0x126D1E99DDEDEE0A
 It is compressed, and that data should be treated as a
 not part of the multi.mul, but a separate file housing.bin.
 ******************************************************************************/


/*******************************************************************************
 Notes/Exceptions
 For the most part, when one access the data pointed to by that
 entry, it has the same format as the data in corresponding mul file.
 Exceptions:
 Gumps
 The first 8 bytes of the data represent the the width
 (bytes 0-4) and height (bytes 4-8) of the gump
 ******************************************************************************/


/*******************************************************************************
 UOP file format
 UOP format holds a variety of different data for Ultima Online.  The
 file contains table(s) of index entries , which contains information about where
 the data is in the file for that entry.  It also contains whether or not the data
 is compress (zlib compression), and a hash!  This hash is based on the original
 file name , and it format varies based on each file type.  The hash has a direct
 correlation of what "index" in an IDX (or mapblock for non idx files) the data
 is correlated with.
 
 A table entry has the following format
 
 UOP Table entry:
 
 std::int64_t   data_offset ;   // Offset to the data for this entry
 std::uint32_t  header_length;  // Length of header
 std::uint32_t  compress_size;  // Compressed size of data
 std::uint32_t  decompress_size;    // Decompressed size of data
 std::uint64_t  identifer;      // Filename(index) hash (HashLittle2)
 std::uint32_t  data_hash;      // Data hash (Alder32)
 std::int16_t   compression;    // 0 = none, 1 = zlib
 
 
 Using the table entry, the file format is as follows
 
 UOP File Format (the table entry will be at offset 0x28 or greater):
 
 std::int32_t   signature;      // This signifies to be a UOP file
                        // and has a fixed value of
                        // 0x50594D  ('MYP')
 std::int32_t   version;        // Version of the format/file
                        // At this time believe this documentation
                        // is valid for versions below 5 inclusive
 std::int32_t   timestamp;      // ? Uknown, believed to be a timestamp or something
 // for the file (0xFD23EC43)
 
 std::uint64_t  table_offset;   // Offset to the next table
                        // There can be multiple tables in the file!
 
 std::uint32_t  tablesize       // Only needed really for writing(table (block) size)
                        // current value is 100
 std::uint32_t  filecount       // Each entry is consider a file
 std::int32_t   unknown     // Value is 1, perhaps modified count?
 std::int32_t   unknown     // Value is 1
 std::int32_t   unknown     // Value is 0
 
 
 
 The following is repeated for each table
 
 std::uint32_t  number_entries; // how many entries are in the table
 std::uint64_t  next_table;     // Offset to the next table
 
 UOPTable       table[number_entries];
 ******************************************************************************/


namespace UO {
    /************************************************************************
     USE:
        One subclasses this class for each data type to handle.  The pertinent
     methods to override are:
     Reading:
        virtual bool specialIdentifer(std::uint64_t identifier, std::vector<unsigned char> &data)
            This method allows one to handle a special type hash. Multi's include a housing.bin
            file, that this is a way to capture it.  In that case one would either ignore it (but still
            return true, meaning for the base class to not process it), or take the data
            and create a housing.bin and return true.  Regardless, this is a way to catch
            special identifiers before processing happens;
     
        virtual void processData(std::size_t index, std::vector<unsigned char> &data)
            This method is the main method. It provides the subclass with the index it found
            based on the identifier, and the data associated with it.  The subclass can then
            interpret the data accordingly. See above for notes on the data.
     
        virtual bool identifierFailed(std::uint64_t identifier, std::vector<unsigned char> &data)
            This method is alerting the subclass that the identifer was not found in the hashkey lookup.
            This method can mean: 1. specialIdentifer was not used, and it is now being called because
                             no lookup key was found.
                            2. the max number of keys told to be created upon reading was
                             insufficent.
                            3. An improper hash string format was provided upon reading.
                            4. Something changed in the format?
            If the subclass wants all processing to stop on the uop file, return false and the reading
            will stop with a false result. If one responds true to this method, processing on the next
            entry will continue.

     Writing:
     
        virtual std::tuple<std::size_t,std::string,bool> retrieveInfo(std::size_t count, std::vector<unsigned char> &data)
            This method is called for the subclass to provide the data for the entry, and information on hash string format,
            what index/key it represents, and if the data should be compressed.

     There are two methods provided for reading/writing (not to be overridden).
     
     bool readUOP(std::ifstream &input,std::size_t maxindex, const std::string &hash_format1, const std::string &hash_format2="" )
        This method is to read a UOP.  The file must be open (and the stream is passed via input, opened for binary reading).  One provides the max number of keys
        the method is to make (see above for information on this number).  In addition, two hash formats may be provided to be used
        to look for keys (see above for hash formats).
     
     bool writeUOP(std::ofstream &output, std::uint32_t total_entries)
        For this method, one provides an open stream (UOP file that is opened for binary writing). In addition, the total number of
        entries that will be written.

     
     ************************************************************************/

    class UOPData {
    private:
        /******************** Table entry structure ***********************/
        struct TableEntry {
            std::int64_t    offset ;
            std::uint32_t   header_length ;
            std::uint32_t   compressed_length ;
            std::uint32_t   decompressed_length ;
            std::uint64_t   identifer ;
            std::uint32_t   data_block_hash ;
            std::int16_t    compression ;
            TableEntry();
            TableEntry &    load(std::istream &input) ;
            TableEntry &    save(std::ostream &output) ;
            // 34 bytes for a table entry
            /*********************** Constants used ******************/
            static constexpr unsigned int _entry_size = 34 ;
        };
        /******************** File Header structure ***********************/
        struct FileHeader {
            std::int32_t    signature ;
            std::int32_t    version ;
            std::int32_t    stamp ;
            std::uint64_t   table_offset ;
            std::uint32_t   tablesize ;
            std::uint32_t   filecount ;
            std::int32_t    unknown0;
            std::int32_t    unknown1 ;
            std::int32_t    unknown2 ;
            FileHeader() ;
            bool valid() const ;
            FileHeader &    load(std::ifstream &input);
            FileHeader &    save(std::ofstream &output);
            FileHeader &    pad(std::ofstream &output); //Pad with zero
                                            // from current
                                            // location to table_offset
            /*********************** Constants used ******************/
            // File signature
            static constexpr    unsigned int _uop_identifer = 0x50594D;
            // version
            static constexpr    unsigned int _uop_version = 5 ;
            // Number entries in a table (maximum)
            static constexpr  unsigned int _table_size = 100 ;
            // Value of the "timestamp" field in the file header
            static constexpr    unsigned int _uop_stamp = 0xFD23EC43;
           
            // Offset where we start tables
            static constexpr    unsigned int _table_start = 0x200;
           
            // the file header is at least 40 bytes. So no table should
            // start until after this location
            static constexpr  unsigned int _file_header_size = 0x28 ;
        };


        /****************** zlib compression wrappers *********************/
        std::vector<unsigned char> compress(const std::vector<unsigned char> &data) const;
        std::vector<unsigned char> decompress(const std::vector<unsigned char> &source, std::size_t decompressed_size) const;
       
        /********************* hash routines ******************************/
        std::uint64_t HashLittle2(const std::string& s)  const ;
        std::uint32_t HashAdler32(const char* d, std::uintmax_t length ) const ;
        std::uint32_t HashAdler32(const std::vector<unsigned char> &data) const ;

        // Apply the index into the format string
        std::string format(const std::string& hashformat, std::size_t index);
        // Build a series of identifiers for a hash string and count of entries
        std::map<std::uint64_t,std::size_t> buildIdentifiers(const std::string &hashstring, std::size_t number_entries) ;
        // Retreive and index from identifer
        std::size_t retrieveIndex(std::uint64_t identifer, const std::map<std::uint64_t,std::size_t> &lookup1,const std::map<std::uint64_t,std::size_t> &lookup2 ) const ;
        /*************************** Index gather/writers ******************/
        std::vector<UOPData::TableEntry> gatherEntries(std::ifstream &input, std::uint64_t offset);
        // Writes out table entries (and the table) and returns a vector of offsets for each entry
        std::vector<std::uint64_t> buildAllTable(std::ofstream &output,std::uint32_t totalentry,std::uint32_t tablecount);
        // Builds a table (with the amount of entries) in the output stream.  Returns a vector of entry offsets
        std::vector<std::uint64_t> buildTable(std::uint32_t entrycount,std::ofstream &output);
       
    protected:
       
        /********************** Override these by subclasses **************/
       
        // Special processing for this identifier (if special, return true, else false)
        virtual bool specialIdentifer(std::uint64_t identifier, std::vector<unsigned char> &data);
        // Process the data however it means to the subclass
        virtual void processData(std::size_t index, std::vector<unsigned char> &data);
        // Identifier lookup failed for the entry.  Return true if processing should continue
        virtual bool identifierFailed(std::uint64_t identifier, std::vector<unsigned char> &data);
        // Retrieve the key/index, hashstring, compression flag, and data to be written (count is for the
        // subclass to use or not to understand what data is being asked for.  There may not be a one for one
        // correlation of index to count if one wants to not save empty indexes (idx entries) in the uop file
        virtual std::tuple<std::size_t,std::string,bool> retrieveInfo(std::size_t count, std::vector<unsigned char> &data);
       
       
       
        // Read a UOP
        bool readUOP(std::ifstream &input,std::size_t maxindex, const std::string &hash_format1, const std::string &hash_format2="" ) ;
        bool writeUOP(std::ofstream &output, std::uint32_t total_entries);
    public:
        virtual ~UOPData() = default ;
        UOPData() = default ;
    };
}
#endif /* UOPData_hpp */
//Copyright © 2021 Charles Kerr. All rights reserved.

#include "UOPData.hpp"
#include "StringUtility.hpp"
#include <stdexcept>
#include <zlib.h>
namespace UO {
    /*************************************************************************
     TableEntry methods
     ************************************************************************/

    //===============================================================
    UOPData::TableEntry::TableEntry(){
        offset = 0 ;
        header_length = 0 ;
        compressed_length = 0 ;
        decompressed_length = 0 ;
        identifer = 0;
        data_block_hash = 0 ;
        compression = 0 ;

    }
    //===============================================================
    UOPData::TableEntry &   UOPData::TableEntry::load(std::istream &input) {
        input.read(reinterpret_cast<char*>(&offset),sizeof(offset));
        input.read(reinterpret_cast<char*>(&header_length),sizeof(header_length));
        input.read(reinterpret_cast<char*>(&compressed_length),sizeof(compressed_length));
        input.read(reinterpret_cast<char*>(&decompressed_length),sizeof(decompressed_length));
        input.read(reinterpret_cast<char*>(&identifer),sizeof(identifer));
        input.read(reinterpret_cast<char*>(&data_block_hash),sizeof(data_block_hash));
        input.read(reinterpret_cast<char*>(&compression),sizeof(compression));
        return *this ;
    }
    //===============================================================
    UOPData::TableEntry &   UOPData::TableEntry::save(std::ostream &output) {
        output.write(reinterpret_cast<char*>(&offset),sizeof(offset));
        output.write(reinterpret_cast<char*>(&header_length),sizeof(header_length));
        output.write(reinterpret_cast<char*>(&compressed_length),sizeof(compressed_length));
        output.write(reinterpret_cast<char*>(&decompressed_length),sizeof(decompressed_length));
        output.write(reinterpret_cast<char*>(&identifer),sizeof(identifer));
        output.write(reinterpret_cast<char*>(&data_block_hash),sizeof(data_block_hash));
        output.write(reinterpret_cast<char*>(&compression),sizeof(compression));
        return *this ;
    }
   
    /*************************************************************************
     FileHeader methods
     ************************************************************************/


    //=============================================================================
    UOPData::FileHeader::FileHeader() {
        signature = _uop_identifer;
        version = _uop_version ;
        stamp = _uop_stamp  ;
        table_offset = _table_start;
        tablesize = _table_size ;
        filecount = 1;
        unknown0 = 1;
        unknown1 = 1;
        unknown2 = 0 ;
       
    }
    //=============================================================================
    bool UOPData::FileHeader::valid() const {
        return ((signature == _uop_identifer) && (version==_uop_version));
    }
    //=============================================================================
    UOPData::FileHeader &   UOPData::FileHeader::load(std::ifstream &input){
        input.read(reinterpret_cast<char*>(&signature),sizeof(signature));
        input.read(reinterpret_cast<char*>(&version),sizeof(version));
        input.read(reinterpret_cast<char*>(&stamp),sizeof(stamp));
        input.read(reinterpret_cast<char*>(&table_offset),sizeof(table_offset));
        input.read(reinterpret_cast<char*>(&tablesize),sizeof(tablesize));
        input.read(reinterpret_cast<char*>(&filecount),sizeof(filecount));
        input.read(reinterpret_cast<char*>(&unknown0),sizeof(unknown0));
        input.read(reinterpret_cast<char*>(&unknown1),sizeof(unknown1));
        input.read(reinterpret_cast<char*>(&unknown2),sizeof(unknown2));

        return *this;
    }
    //=============================================================================
    UOPData::FileHeader &   UOPData::FileHeader::save(std::ofstream &output){
        output.write(reinterpret_cast<char*>(&signature),sizeof(signature));
        output.write(reinterpret_cast<char*>(&version),sizeof(version));
        output.write(reinterpret_cast<char*>(&stamp),sizeof(stamp));
        output.write(reinterpret_cast<char*>(&table_offset),sizeof(table_offset));
        output.write(reinterpret_cast<char*>(&tablesize),sizeof(tablesize));
        output.write(reinterpret_cast<char*>(&filecount),sizeof(filecount));
        output.write(reinterpret_cast<char*>(&unknown0),sizeof(unknown0));
        output.write(reinterpret_cast<char*>(&unknown1),sizeof(unknown1));
        output.write(reinterpret_cast<char*>(&unknown2),sizeof(unknown2));

   
        return *this;

    }
    //=============================================================================
    UOPData::FileHeader &   UOPData::FileHeader::pad(std::ofstream &output){
        auto loc = output.tellp() ;
        auto size = table_offset - loc ;
        if (size>0) {
            char zero = 0 ;
            for (auto i=0;i<size ; i++){
                output.write(&zero,1);
            }
        }
        return *this ;
    }

    /************************************************************************
     zlib wrappers for compression
     ***********************************************************************/

    //=============================================================================
    std::vector<unsigned char> UOPData::decompress(const std::vector<unsigned char> &source, std::size_t decompressed_size) const{
        // uLongf is from zlib.h
        auto srcsize = static_cast<uLongf>(source.size()) ;
        auto destsize = static_cast<uLongf>(decompressed_size);
        std::vector<unsigned char> dest(decompressed_size,0);
        auto status = uncompress2(dest.data(), &destsize, source.data(), &srcsize);
        if (status != Z_OK){
            dest.clear() ;
            dest.resize(0) ;
            return dest ;
        }
        dest.resize(destsize);
        return dest ;
    }
    //=============================================================================
    std::vector<unsigned char> UOPData::compress(const std::vector<unsigned char> &source) const {
        auto size = compressBound(source.size());
        std::vector<unsigned char> rdata(size,0);
        auto status = compress2(reinterpret_cast<Bytef*>(rdata.data()), &size, reinterpret_cast<const Bytef*>(source.data()), static_cast<uLongf>(source.size()),Z_DEFAULT_COMPRESSION);
        if (status != Z_OK){
            rdata.clear();
            return rdata ;
        }
        rdata.resize(size) ;
        return rdata;
    }

    /************************************************************************
     Hash routines
     ***********************************************************************/

   
    //=============================================================================
    std::uint64_t UOPData::HashLittle2(const std::string& s) const {
       
        std::uint32_t length = static_cast<std::uint32_t>(s.size()) ;
        std::uint32_t a ;
        std::uint32_t b ;
        std::uint32_t c ;
       
        c = 0xDEADBEEF + static_cast<std::uint32_t>(length) ;
        a = c;
        b = c ;
        std::uint32_t k = 0 ;
        std::uint32_t l = 0 ;
       
        while (length > 12){
            a += (s[k++]);
            a += (s[k++] << 8);
            a += (s[k++] << 16);
            a += (s[k++] << 24);
            b += (s[k++]);
            b += (s[k++] << 8);
            b += (s[k++] << 16);
            b += (s[k++] << 24);
            c += (s[k++]);
            c += (s[k++] << 8);
            c += (s[k++] << 16);
            c += (s[k++] << 24);
           
            a -= c; a ^= c << 4 | c >> 28; c += b;
            b -= a; b ^= a << 6 | a >> 26; a += c;
            c -= b; c ^= b << 8 | b >> 24; b += a;
            a -= c; a ^= c << 16 | c >> 16; c += b;
            b -= a; b ^= a << 19 | a >> 13; a += c;
            c -= b; c ^= b << 4 | b >> 28; b += a;
           
            length -= 12 ;
        }
       
        // Notice the lack of breaks!  we actually want it to fall through
        switch (length) {
            case 12: {
                l = k + 11;
                c += (s[l] << 24);
            }
            case 11: {
                l = k + 10;
                c += (s[l] << 16);
            }
            case 10: {
                l = k + 9;
                c += (s[l] << 8);
            }
            case 9: {
                l = k + 8;
                c += (s[l]);
            }
            case 8: {
                l = k + 7;
                b += (s[l] << 24);
            }
            case 7: {
                l = k + 6;
                b += (s[l] << 16);
            }
            case 6: {
                l = k + 5;
                b += (s[l] << 8);
            }
            case 5: {
                l = k + 4;
                b += (s[l]);
            }
            case 4: {
                l = k + 3;
                a += (s[l] << 24);
            }
            case 3: {
                l = k + 2;
                a += (s[l] << 16);
            }
            case 2: {
                l = k + 1;
                a += (s[l] << 8);
            }
            case 1: {
                a += (s[k]);
                c ^= b; c -= b << 14 | b >> 18;
                a ^= c; a -= c << 11 | c >> 21;
                b ^= a; b -= a << 25 | a >> 7;
                c ^= b; c -= b << 16 | b >> 16;
                a ^= c; a -= c << 4 | c >> 28;
                b ^= a; b -= a << 14 | a >> 18;
                c ^= b; c -= b << 24 | b >> 8;
                break;
            }
               
            default:
                break;
        }
       
        return (static_cast<std::uint64_t>(b) << 32) | static_cast<std::uint64_t>(c) ;
    }
   
    //=============================================================================
    std::uint32_t UOPData::HashAdler32(const std::vector<unsigned char> &data) const {
        auto d = reinterpret_cast<const char*>(data.data());
        auto length = data.size() ;
        return HashAdler32(d, length);
    }
    //=============================================================================
    std::uint32_t UOPData::HashAdler32(const char* d, std::uintmax_t length ) const  {
        std::uint32_t a = 1 ;
        std::uint32_t b = 0 ;
        for (std::uintmax_t i = 0 ; i < length; i++){
            a = (a + (d[i] % 65521)) ;
            b = (b + a) % 65521 ;
        }
        return (b<<16) | a ;
    }

    /************************************************************************
                    Hash string formatting
     ***********************************************************************/

    //=============================================================================
    std::string UOPData::format(const std::string& hashformat, std::size_t index){
        // How much do we pad?  Find the subsutition character
        auto pos = hashformat.find_first_of("{") ;
        if (pos == std::string::npos){
            // we are not subsituting anything, pass on the string
            return hashformat ;
        }
       
        auto loc = hashformat.find_first_of("}",pos+1) ;
        if (loc == std::string::npos){
            // we are not subsituting anything, pass on the string
            return hashformat ;
        }
        auto sub = strutil::numtostr(index,10,false,strutil::strtoi(hashformat.substr(pos+1,loc-(pos+1))));
        auto rvalue = hashformat;
        return rvalue.replace(pos, (loc-pos)+1, sub);
    }
    //=============================================================================
    std::map<std::uint64_t,std::size_t> UOPData::buildIdentifiers(const std::string &hashstring,std::size_t number_entries){
        std::map<std::uint64_t,std::size_t> hashes ;
        if (hashstring.empty()){
            return hashes;
        }
        for (auto i = 0 ; i < number_entries;i++){
            auto formatted = format(hashstring,i);
            auto hash = HashLittle2(formatted);
            hashes.insert_or_assign(hash, i);
        }
        return hashes ;
    }
    //=============================================================================
    std::size_t UOPData::retrieveIndex(std::uint64_t identifer, const std::map<std::uint64_t,std::size_t> &lookup1,const std::map<std::uint64_t,std::size_t> &lookup2 ) const {
       
        auto iter = lookup1.find(identifer) ;
        if (iter == lookup1.end()){
            iter = lookup2.find(identifer);
            if (iter == lookup1.end()){
                throw std::out_of_range("Identifer "s + strutil::numtostr(identifer,16,true,8)+ " not found");
            }
            return iter->second;
        }
        return iter->second ;
    }
   
    /*************************** Index gather/writers ******************/
    //=============================================================================
    std::vector<UOPData::TableEntry> UOPData::gatherEntries(std::ifstream &input, std::uint64_t offset){
        std::vector<TableEntry> entries ;
        input.seekg(offset,std::ios::beg);
        auto entry_count = static_cast<std::uint32_t>(0);
       
        while ((offset != 0 ) && (!input.eof()) && input.good()){
            // Read in the number of entries, and next table offset
            input.read(reinterpret_cast<char*>(&entry_count),sizeof(entry_count));
            input.read(reinterpret_cast<char*>(&offset),sizeof(offset));
            for (auto i = 0 ; i< entry_count;i++){
                TableEntry entry ;
                entry.load(input);
                entries.push_back(entry);
            }
            if (offset != 0){
                input.seekg(offset,std::ios::beg);
            }
        }
       
        return entries ;
    }
    //=======================================================================
    // Writes out table entries (and the table) and returns a vector of offsets for each entry
    std::vector<std::uint64_t> UOPData::buildTable(std::uint32_t entrycount,std::ofstream &output){
        // Number of entries
        std::uint64_t zero = 0 ;
        // write out the number of entries for this table
        output.write(reinterpret_cast<char*>(&entrycount),4);
        // write a place holder for the next table offset
        output.write(reinterpret_cast<char*>(&zero),8);
       
        std::vector<std::uint64_t> locations ;
        locations.reserve(entrycount);
        TableEntry entry ;
        // For each entry, save the offset it is written to
        while (entrycount>0){
            locations.push_back(output.tellp());
            entry.save(output) ;
            entrycount--;
        }
        return locations;
    }
    //=======================================================================
    // Builds a table (with the amount of entries) in the output stream.  Returns a vector of entry offsets
    std::vector<std::uint64_t> UOPData::buildAllTable(std::ofstream &output,std::uint32_t totalentry,std::uint32_t tablecount){
       
        std::vector<std::uint64_t> entry_locations;
        entry_locations.reserve(totalentry);
        auto entrycount = FileHeader::_table_size ; // Set to the max number of entries
        // Modify it on the last table entry
        // for just the remaining entries
        // This will loop though for each table, and buld a placeholder for
        // the entries
        for (auto i=0;i<tablecount;i++){
            // Save where the next table_offset in the table should go
            // It will be 4 bytes past current (past the number of entries)
            auto position = output.tellp() ;
            position+=4 ;
           
            // If this is our last table entry, figure out the actual
            // number of entries
            if ((i==(tablecount-1)) && (totalentry != FileHeader::_table_size)){
                entrycount = (totalentry%FileHeader::_table_size);
            }
            // Now, build the table
            auto locations = buildTable(entrycount, output);
            entry_locations.insert(entry_locations.end(),locations.begin(),locations.end());
           
            // Write the next table offset into table we just did
            std::uint64_t current = output.tellp() ;
            output.seekp(position,std::ios::beg);
            if (i!=(tablecount-1)){
                output.write(reinterpret_cast<char*>(&current),8);
            }
            else {
                std::uint64_t zero = 0 ;
                output.write(reinterpret_cast<char*>(&zero),8);
            }
            output.seekp(current,std::ios::beg);
        }
        return entry_locations ;
    }
    /*************************  Subclass Overrides **************************/
   
    //=============================================================================
    // Special processing for this identifier (if special, return true, else false)
    bool UOPData::specialIdentifer(std::uint64_t identifier, std::vector<unsigned char> &data){
        return false ;
    }
    //=============================================================================
    // Process the data however it means to the subclass
    void UOPData::processData(std::size_t index, std::vector<unsigned char> &data){
       
    }
    //=============================================================================
    // Identifier lookup failed for the entry.  Return true if processing should continue
    bool UOPData::identifierFailed(std::uint64_t identifier, std::vector<unsigned char> &data){
        return false ;
    }
    //=============================================================================
    // Retrieve the key/index, hashstring, compression flag, and data to be written (count is for the
    // subclass to use or not to understand what data is being asked for.  There may not be a one for one
    // correlation of index to count if one wants to not save empty indexes (idx entries) in the uop file
    std::tuple<std::size_t,std::string,bool> UOPData::retrieveInfo(std::size_t count, std::vector<unsigned char> &data){
        data.resize(0) ;
        return std::make_tuple(0,"nohash"s,false);
    }

    /************************* Read/Write UOP streams ***********************/
    //=============================================================================
    bool UOPData::readUOP(std::ifstream &input,std::size_t maxindex, const std::string &hash_format1, const std::string &hash_format2 ) {
        if (!input.is_open() ) {
            return false ;
        }
        FileHeader header;
        header.load(input) ;
        if (!input.good() || input.eof() || !header.valid()){
            return false ;
        }
        auto entries = gatherEntries(input, header.table_offset) ;
        if (!input.good() || input.eof() ){
            return false ;
        }
        // Build the identiers
        auto identifier_mapping1 = buildIdentifiers(hash_format1, maxindex);
        auto identifier_mapping2 = buildIdentifiers(hash_format2, maxindex);
       
        // Process the entries!
        for (auto &entry:entries){
            input.seekg(entry.offset,std::ios::beg);
            auto size = entry.decompressed_length;
            if (entry.compression != 0){
                size = entry.compressed_length ;
            }
            auto data = std::vector<unsigned char>(size,0) ;
            if (size >0){
                input.read(reinterpret_cast<char*>(data.data()),size);
                if (entry.compression != 0){
                    // We need to decompress this
                    data = decompress(data, entry.decompressed_length);
                }
            }
            // Data read and decompressed
            // Time to process it
                // First, see if this identifer is special in some way?
                if (!specialIdentifer(entry.identifer, data)){
                    // Wasn't, so get the index for it
                    try{
                        auto index = retrieveIndex(entry.identifer, identifier_mapping1, identifier_mapping2);
                        processData(static_cast<std::uint64_t>(index), data);
                    }
                    catch(...){
                        if (!identifierFailed(entry.identifer, data)){
                            return false ;
                        }
                    }
                }

        }
        return true ;
    }
    //=============================================================================
    bool UOPData::writeUOP(std::ofstream &output, std::uint32_t total_entries){
        if (!output.is_open()){
            return false ;
        }
        FileHeader header ;
        header.filecount = total_entries ;
        header.save(output);
        header.pad(output);
        if (!output.good()){
            return false ;
        }
        // Now we need to build a table of entries
        auto table_count = total_entries/FileHeader::_table_size  + ((total_entries%FileHeader::_table_size)!=0?1:0) ;
       
        auto entries = buildAllTable(output, total_entries, table_count);
        if (!output.good()){
            return false ;
        }
        // We now have all tables and entries done.  We just need to update them
        auto count = static_cast<std::size_t>(0) ;
        auto data = std::vector<unsigned char>(0,0) ;
        auto current = output.tellp() ;
        for (auto &offset : entries){
            TableEntry entry ;
            entry.offset = current ;
            const auto &[index,formatstring,compressdata] = retrieveInfo(count,data) ;
            auto hashstring = format(formatstring, index) ;
            entry.identifer = HashLittle2(hashstring);
            entry.compression = (compressdata?1:0) ;
            entry.decompressed_length = static_cast<std::uint32_t>(data.size()) ;
            entry.compressed_length = static_cast<std::uint32_t>(data.size()) ;
            if (compressdata){
                data = compress(data);
                entry.compressed_length = static_cast<std::uint32_t>(data.size()) ;
            }
            entry.data_block_hash = HashAdler32(data);
            // Write out the data ;
            output.write(reinterpret_cast<char*>(data.data()),data.size());
            current = output.tellp() ;
            output.seekp(offset,std::ios::beg);
            entry.save(output);
            output.seekp(current,std::ios::beg);
        }
       
        return true ;
    }

}
These users thanked the author punt for the post:
Xuri
Post Reply