Written by [Authors name removed by Pete Davis at Author's request.]
Edited by Pete Davis                                          8/1/93


Editors Notes: The author works on a project related to WinHelp. This is
information he put together during the time I was working on the WinHelp
articles for Dr. Dobb's Journal. This information covers the |TOPIC
file format. He and I were pursuing different ends, so he has insights
into areas I didn't cover and a different way of looking at some of the
things I did.

	Because the project the author is working on is supposed to be hush-hush,
he has requested that his name and any references to his company be
removed from all distributed information.


--------------------------------------------------------------------------


Topic file paging
-----------------

The TOPIC file is divided into 4K pages.

Each 4K page starts with the following 3-dword header:


  typedef struct {
    DWORD dwPrevRecord;         //First topic in previous 4K page
    DWORD dwFirstTopic;         //First topic in this 4K page
    DWORD dwLastTopic;          //Last topic in prev 4K page
    } TOPICPAGEHEADER;

dwPrevRecord is the address of the last record in the previous 4K page
dwFirstTopic is the address of the first Topic record in this 4K page
dwLastTopic is the address of the last topic record in the previous 4K page

The topic file page header is followed by one or more records.
Each record consists of a record of topic title data, or topic data.
The records for a particular topic may be spread over more than one page.

General Topic Record Format
---------------------------

Each topic is comprised of a single topic title record followed by one
or more topic data records.  All records are stored in a doubly-linked list.

}iEach record has a standard record header that contains the following:


   typedef struct {
     DWORD dwRecordSize;        //Size of record
     DWORD dwDataSize;          //Size of data
     DWORD dwPrev;              //Previous record pointer
     DWORD dwNext;              //Next record pointer
     DWORD dwDataOffset;        //Offset to data
     BYTE bRecordType;          //Type of record
     } TOPICRECHEADER;

The byte bRecordType determines the type of record this is.
The known types are:

  0x02  - Topic title (start of a new topic)
  0x20  - Topic contents
  0x23  - Topic contents for a table

  (there are probably others I haven't run into yet)


The Topic Title Record
----------------------

The Topic Title records (type 0x02), consist of the following title
record header, followed by the title itself:

 typedef struct {
   DWORD dwNextTopicOffset;     //Offset to next topic
   DWORD dwReserved1;           //Reserved
   DWORD dwReserved2;           //Reserved
   DWORD dwTopicId;             //Topic id
   DWORD dwTopicNonScroll;      //Pointer to non-scroll region of topic
   DWORD dwTopicData;           //Pointer to topic data (scrolling region)
   DWORD dwNextTopicPointer;    //Pointer to next topic
   } TOPICTITLERECHEADER;

The dwTopicId value seems to be an internal topic id that just increments
for each topic.


The dwTopicNonScroll field is the pointer to the first contents record for
the contents for the non-scrolling region of the topic, or 0xFFFFFFFF if
there is no non-scroll region.

The dwTopicData field is the pointer to the first contents record for
the topic -- i.e., the pointer to the first record of the scrolling portion
of the topic.

The dwNextTopicPointer points to the topic title record for the next topic.

This record is immediately followed by the ASCII topic title.  Since the
dwDataOffset of TOPICRECHEADER contains the pointer to the topic title for
this record, and dwRecordSize contains the total size, the title is not
zero terminated.


The Topic Contents Records
--------------------------

The textual contents of the topic are stored in one or more topic data
records.  Each topic data record consists of a record header, followed
by the record data.  The record header is the same for both types.
The only difference is that the type 0x23 has a list of text data,
whereas the type 0x20 has only 1 text data.  Conceptually, the type
0x20 can be thought of as a type 0x23 with 1 column, with slight differences.


The Type 0x20 and 0x23 Topic Contents Record
--------------------------------------------

The record header for a type 0x20 record consists of the following:

    BYTE bUnknown0;			//Unknown
    BYTE bUnknown1;			//Unknown
    BYTE bTwiceData;			//Twice data size

  bUnknown0 is an unknown value that seems to take on various values.
  bUnknown1 is probably flags since this seems to be 0x80 or 0x81.

  bTwiceData is twice the size of the data (usually twice
  TOPICRECHEADER.dwDataSize, but not always!).  This value is used
  to calculate hotspot and keyword pointers (see below).

  If the lsb of bTwiceData is set, then this indicates that the data size
  cannot fit in a single byte and that an additional byte of size is to
  be read in.  The second byte is taken as the high byte of the value.

After this the following data is read in:

    BYTE bColumnCount;                  //# of columns

  The number of columns is 0 for a type 0x20 record.

If this is a type 0x23 table record, then this value is immediately
followed by a table of DWORDs that contain the widths of each column.
For a table with n columns, there are n DWORD values here.
A type 0x20 record does not store this value as the column width is not fixed.

After this there is data for each column of data.  For a type 0x23 record,
there will be data for multiple columns consisting of a column data header
followed by the paragraph attributes and segments of the column.  For a type
0x20 record, there is only the paragraph and attribute data for the text.

  The data header for a type 0x23 record contains the following:

    WORD wColumnNo;                     //Column number
    WORD wUnknown;                      //Unknown word value (or 2 bytes)
    WORD wUnknown1;                     //Unknown word value (or 2 bytes)

  wColumnNo is set to 0xFFFF if this is the end of the list.

  This is followed by the following data, present in both type 0x20 and
  type 0x23 records:


    BYTE bUnknown2;                     //Unknown, usually 80.
    DWORD dwParAttribs;                 //Paragraph attributes

  This is immediately followed by paragraph attribs (and params)

    DWORD dwParAttribs;                 //Paragraph attributes

  If any parameter bits are set, then this is followed by additional
  information specific to the individual option.

  See below for a list of attributes and parameters.


  This record is immediately followed by the list of attributes to
  apply to the text (as in the 0x20 record type).
  This list is described below.


The text itself follows after all the data for the columns.


Type 0x20 and type 0x23 data comparison
---------------------------------------

Here's a summary comparison of type 0x20 and type 0x23 records:


  Type 0x20                             Type 0x23
  ---------------------------           -------------------------------

  TOPICRECHEADER TopicRecHeader;        TOPICRECHEADER TopicRecHeader;

  BYTE bUnknown0;                       BYTE bUnknown0;
  BYTE bUnknown1;                       BYTE bUnknown1;
  BYTE bTwiceData;                      BYTE bTwiceData;
  high byte of twicedata if needed    high byte of twicedata if needed

  BYTE 00;                              BYTE bColumnCount;

                                        DWORD dwColumnWidth;

  Only one instance:                    For each column:
                                       
                                         WORD wColumnNo (0xFFFF for end);
                                         WORD wUnknown1;
                                         WORD wUnknown2;
                                       
    BYTE bUnknown2;                      BYTE bUnknown2;
    DWORD dwParAttribs;                  DWORD dwParAttribs;
    any attrib parameters              any attrib parameters
                                       
    <attribute list>                     <attribute list>
}i    <FF terminated>                      <FF terminated>
  \----------------------               \------------------------

  <topic text>                          <topic text>




Paragraph attributes
--------------------


  #define PAR_JUSTCENTER  0x08000000      //Center justified
  #define PAR_JUSTRIGHT   0x04000000      //Right justified

  #define PAR_BORDERED    0x01000000      //Paragraph bordered

  #define PAR_FIRSTINDENT 0x00400000      //First line indent
  #define PAR_RIGHTINDENT 0x00200000      //Right margin indent
  #define PAR_LEFTINDENT  0x00100000      //Left margin indent

  #define PAR_LINESPACING 0x00080000      //Line spacing
  #define PAR_SPACEAFTER  0x00040000      //Space after
  #define PAR_SPACEBEFORE 0x00020000      //Space before


I'm still determining the formats for these, so I'll get you more on this
later.

The paragraph attributes are immediately followed by any additional data,
such as indent amounts, etc., depending on the bits set.

The order seems to be to read data for the bits, starting from lsb.
The order (so far) is:

  - byte indicating spacing before
  - byte indicating spacing after
  - byte indicating line spacing
  - byte indicating left indent
  - byte indicating right indent
  - byte indicating first line indent
  - 3 bytes indicating borders.

If borders are used, the first byte of the border data is the types of
borders:

  #define BORDER_DOT      0x80            //Dotted border
  #define BORDER_DOUBLE   0x40            //Double border
  #define BORDER_THICK    0x20            //Thick border
  #define BORDER_RIGHT    0x10            //Right border
  #define BORDER_BOTTOM   0x08            //Bottom border
  #define BORDER_LEFT     0x04            //Left border
  #define BORDER_TOP      0x02            //Top border
  #define BORDER_BOX      0x01            //Boxed border

The following word is unknown.




Topic Attribute Data
--------------------


This is followed by the data for the topic contents.

The data is divided into two parts.
The first is a table of attributes that are assigned to segments of the
topic text.  The second part is a list of 0-terminated segments of text.
There is a one-to-one correspondence between the attribute list and the
text segment list.  The first attribute is assigned to the first segment,
the second attribute to the second segment, etc.

The known attributes are:


  0x80    Defines a font change.  Followed by word font id which is the
            index into the font attribute table from the FONT file.
  0x81    Segment starts on a new line.
  0x82    Segment starts a new paragraph.
  0x83    Segment starts after a tab
  0x89    Segment ends a hotlink
  0xE2    Segment starts a pop-up hotlink.  Followed by hotlink hash.
  0xE3    Segment starts a normal hotlink.  Followed by hotlink hash.
  0xFF    Ends the list of attributes

Paragraph attribute changes seem to be done by using a new topic data record,
since the paragraph attribs are in the TOPICDATARECHEADER structure.  For
example, when you place a box around a paragraph the paragraph starts on a
new record.  If paragraph attribs do not change between pars, then there
can be multiple pars per record, with the 0x83 attrib seperating the pars.


Links
-----

The links used in these structures (all fields in TOPICPAGEHEADER;
dwPrev, dwNext and dwDataOffset in TOPICRECHEADER; and dwNextTopicPointer
in TOPICTITLERECHEADER) are not straight file offsets.  Instead they use
a page and offset scheme similar to segmentation.  The low 12 bits (3 hex
digits) are an offset into a 4K page, while the upper 20 bits (5 hex digitis)
are a 1K page number:


     3                  1 1
     1                  2 1          0
     ---------------------------------
         1K Page #        Offset    
     ---------------------------------


To convert to an offset into the TOPIC file, use the following:

  #define ConvertPointer(dwPtr)  \
        ( ((dwPtr & 0xFFFFF000) >> 2) + (dwPtr & 0x00000FFF) )

Note that although the topic file itself is divided into 4K pages, the
pointers use a 1K page granularity.  This may be a relic from the past.


Hotspots and keywords
---------------------

Hotspots are stored using the 0xE2 or 0xE3 attribute.  They are associated
with a hotspot hash value, which must be looked up in the CONTEXT file.
The hash is associated with a pointer.  This pointer is the same pointer
that is used in the keyword file.

The pointer for hotspots and keywords are NOT encoded using the above method.
Instead they work like this:

The hotspot link is formatted similar to the links above, except that the
page number is shifted over another bit, and the offset is not a direct
offset to the text but is used otherwise (described below).  The format is:

     3               1 1
     1               5 4             0
     ---------------------------------
         4K Page #        Offset    
     ---------------------------------


The page number indicates the page that the desired topic can be found on.

The offset is the sum of the data sizes of the type 0x20 and 0x23 records
in the 4K page prior to the desired record.  The data size is not gotten
from the dwDataSize field of the TOPICRECHEADER structure, but is instead
added from the bTwiceData fields of the type 0x20 and type 0x23 records.

In other words, to determine the record that is being pointed to by the
hotlink pointer, you read in the 4K page indicated in the pointer, then
go through the page record by record, adding the sizes and keeping a total.
For each type 0x20 or type 0x23 record, you take the value in bTwiceData
(which may be a byte or word, depending on the lsb), shift it to divide by
two, then add it to the running total.  At the start of each record, you
check the current total to see if it equals the desired pointer, and if so
then this is the record you want.

Here is an example to better explain.

I have a small test help file with the following records (I'm just
showing the relavent data here, not all data):


  ----------------------------------------------

  Hotspot Address = 00000000 (initial)

  Address         = 0000000C
  bRecordType     = 02 (Topic title record)

    New Hotspot Address = 00000000 (unchanged)

    Address         = 0000004D
    bRecordType     = 20 (Topic data record)
    bTwiceData      = AA

    New Hotspot Address = 00000000 + AA/2 = 00000000+55 = 00000055

  ----------------------------------------------

  Address         = 000000CD
  bRecordType     = 02 (Topic title record)

    New Hotspot Address = 00000055 (unchanged)

    Address         = 0000010E
    bRecordType     = 20 (Topic data record)
    bTwiceData      = CC

    New Hotspot Address = 00000055 + CC/2 = 00000055+66 = 000000BB

    Address         = 00000198
    bRecordType     = 23 (Topic data record for table)
    bTwiceData      = 72

    New Hotspot Address = 000000BB + 72/2 = 000000BB+39 = 000000F4

    Address         = 00000234
    bRecordType     = 23 (Topic data record for table)
    bTwiceData      = 4C

    New Hotspot Address = 000000F4 + 4C/2 = 000000F4+26 = 0000011A

    Address         = 000002BD
    bRecordType     = 23 (Topic data record for table)
    bTwiceData      = 1D
    bTwiceDataHigh  = 05

    New Hotspot Address = 0000011A + 51D/2 = 0000011A+28E = 000003A8

    Address         = 000005AF
    bRecordType     = 20 (Topic data record)
    bTwiceData      = D6

    New Hotspot Address = 000003A8 + D6/2 = 000003A8+6B = 00000413


  ----------------------------------------------

  Address         = 0000063E
  bRecordType     = 02 (Topic title record)

    New Hotspot Address = 00000413 (unchanged)


    Address         = 0000066F
    bRecordType     = 20
    bTwiceData      = 04

    New Hotspot Address = 00000413 + 04/2 = 00000413+02 = 00000415


  ----------------------------------------------

  Address         = 00000693
  bRecordType     = 02 (Topic title record)
  ----------------------------------------------


Notice that the hotspot address calculated at the start of each topic
record (indicated by dashed lines above) is:

  First topic . . . . . . 00000000
  Second topic  . . . . . 00000055
  Third topic . . . . . . 00000413
  Last topic  . . . . . . 00000415


Now, the topic title list for this file (from the TTLBTREE subfile)
contains the following:


  Title                                                  Topic
  ------------------------------------------------------ --------
  TopicTitle_Test1                                       00000000
  TopicTitle_Test2                                       00000055
  <<untitled>>                                           00000413
  <<untitled>>                                           00000415

The keyword list for the file contains the following data:

  Keyword                                Topic(s)
  -------------------------------------- --------
  Keyword_Test1                          00000000
  Keyword_Test2                          00000055
  Keyword_Test2a                         00000055

The hash table for the hotlinks in the file (from CONTEXT) contains:

  Hash val  Offset
  --------  --------
  6325E5E5  00000000
  6325E5E6  00000055


This same method works when going past the end of a page boundary, as
long as the page number is kept but the pointer part of the hotlink
is reset to 0 at the top of each 4K page.


I've added stuff to my dumper to display the hotlink pointers, and so
far they check out.  I'll keep you up to date as I find out more.

[Authors name removed by Pete Davis at author's request]

