| Forum List • Thread List • Reply • Refresh • New Topic • Search • Previous • Next 1 | 1. Dummy Journey to Write a Compiler Using RapidQ #935 Posted by: 2003-05-18 07:05:27 | Dear RapidQ friends,
To moderator: if you found this notes is useful, please make it available as text file for download We have seen a few attempts before to cloning the RapidQ compiler but it was frozen and nothing has progress until now. The reason is so little of RapidQ programmer can participate since not all knowing C/C++ or Assembly language that become the source code of the new clone compiler. With this limitations, less people would be able to join and make improvement into the project.
If we do it on the language where most of us know, Which is BASIC language of RapidQ, I believe we can progress more. For Example, some of RapidQ regulars has done so much on contributions of UDT (user defined Types) and has added features into RapidQ with some of INCLUDE files that available for downloads now.
Some people says we can't a compiler using RapidQ. But I believe otherwise. Since the RapidQ has features of reading and writing binary filestream, I guess, this would be a hidden feature we never use fully. With a Hex editor/utility, Intel instruction sets and PE format Documentations which is available in the internet, I think we able to create something like small new compiler or layer include for the RapidQ itself. And it is look delicous, but it will take time with a lots of testing and sharing the source among us.
Based on what I'm mentioned above in subject of binary filestream, once we know what is the requirement of PE format and machine instructions, we could make another 'INCLUDES' to have layer (off cause using UDT and QFileStream function) of a compiler that produced small footprint of executable format in a new RapidQ IDE. With RapidQ source code, it mean will be more access by other RapidQ programmers (novice or experts) and done some testing to produce small binary results for a start. May be, we begin with a "Hello World" dialog compiler.
I've seen in Nasmw, a small sample of code to produce 'Hello" dialog executable window. Since not all of us know ASM well including me, don't look at the Nasm source code, but take the "hello " EXE binary and we examine under a Hex Editor. We break it into a few block and isolate which are the PE header, format, data and so on. The EXE just about 616 Bytes, but it will take use a long, long time to analyse it. And it worth it eventhough we will not has succeed on it. Because, if it succeed, we will be able to make an UDT routine to write PE format using QFileStream Binary read/write function and another UDT that translate the basic syntax to HEX code. Have you ever heard a phrase 'when I type 101011, your live will changed'?
Okay, maybe some of you did not understand what I'm talking about the above idea. Here some simple explanation about the "Hello" executable example produced by Nasmw above based from it's source code:
[qoute] 1. The Hello dialog executable contain PE Header format which is understandable by the OS and the machine. By that, the file can be execute under the OS (for eg. Windows OS).
2. Also in the file also has Window API call to User32.dll that is requesting for the file to borrow the dailog window named MessageBoxA to be drawn into the desktop screen. The borrowing task feeding a string "Hello World!" to be drawn in the dialog box/window. [/qoute]
Below is the Hello World source code in NASMW:
-----------------------------------------------------------
; TO assemble , just RUN " masw hello.asm -o hello.exe " !
; there
IMAGEBASE equ 0x400000
SECTIONALIGN equ 4096
FILEALIGN equ 512
SECTION_RVA equ 4096
a equ 0x400e00
dw
times 29 dw 0
dd 0x40
db
dw 0x14c
dw 1
dd 0,0,0
dw 0xe0
dw 0x102
dw 0x10b
dw 0
dd 0,0,0
dd Main-SECTION_START+SECTION_RVA
dd 0,0
dd IMAGEBASE
dd SECTIONALIGN
dd FILEALIGN
dw 0,0,0,0
dw 4
dw 0
dd 0
dd SECTION_END-SECTION_START
dd SECTION_START
dd 0
dw 2
dw 0
dd 0x1000
dd 0x1000
dd 0,0,0
dd 0x10
dd 0,0
dd SECTION_RVA
dd IMPORT_DESCRIPTOR_END-SECTION_START
times 28 dd 0
db
dd 0
dd SECTION_RVA
dd SECTION_END-SECTION_START
dd SECTION_START
dd 0,0
dw 0,0
db 0x60,0,0,0xe0
align FILEALIGN,db 0
[BITS 32]
SECTION_START
dd FIRST_THUNK-SECTION_START+SECTION_RVA
dd 0
dd -1
dd DLLNAME-SECTION_START+SECTION_RVA
dd FIRST_THUNK-SECTION_START+SECTION_RVA
times 5 dd 0
DLLNAME db
FIRST_THUNK
MsgBox
dd IMPORT_BY_NAME-SECTION_START+SECTION_RVA
dd 0
IMPORT_BY_NAME
dw 0
db
IMPORT_DESCRIPTOR_END
STRING db
Main
XOR edx,edx
mov eax,STRING+a
push edx
push eax
push eax
push edx
CALL DWORD [MsgBox+a]
ret
SECTION_END
------------------------------------------------------------
I've modified a bit from original source code to see what was happening and is works anyway and produce 616 Bytes. But here I'm not stressing about Assembly language here, but how we could use Hex editor and break apart the block of sections and what data need to be change when we produce different output such as adding icon resource into the executable file.
Look here the file in Bytes(Hex) of the hello.exe above:
-Offset------------------ Hex Bytes ----------------------
4d 5a 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 40 00 00 00
50 45 00 00 4c 01 01 00 00 00 00 00 00 00 00 00
00 00 00 00 e0 00 02 01 0b 01 00 00 00 00 00 00
00 00 00 00 00 00 00 00 56 10 00 00 00 00 00 00
00 00 00 00 00 00 40 00 00 10 00 00 00 02 00 00
00 00 00 00 00 00 00 00 04 00 00 00 00 00 00 00
68 00 00 00 00 02 00 00 00 00 00 00 02 00 00 00
000000A0 00 10 00 00 00 10 00 00 00 00 00 00 00 00 00 00
000000B0 00 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00
000000C0 00 10 00 00 49 00 00 00 00 00 00 00 00 00 00 00
000000D0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000000F0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 46 6c 61 74 5f 65 78 65
00 00 00 00 00 10 00 00 68 00 00 00 00 02 00 00
00 00 00 00 00 00 00 00 00 00 00 00 60 00 00 e0
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000001A0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000001B0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000001C0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000001D0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000001F0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
33 10 00 00 00 00 00 00 ff ff ff ff 28 10 00 00
33 10 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 75 73 65 72 33 32 2e 64
6c 6c 00 3b 10 00 00 00 00 00 00 00 00 4d 65 73
73 61 67 65 42 6f 78 41 00 48 65 6c 6c 6f 20 57
6f 72 6c 64 21 00 31 d2 b8 49 10 40 00 52 50 50
52 ff 15 33 10 40 00 c3
----------------------------------------------------------
Funny though, if you copy above hex bytes row-by-row excluding the offset, into a Hex editor and save it as hello.exe, you will get the program working w/o have to get nasmw and do the above asm coding. lol! This is ordinary people looking at complicated things...
Now enough joking, look at the offset number and total twin numbers at hex bytes row. it's looks like this:
0123456789ABCDEF and each hex bytes row has 16 columns each.
Are you noticed that? Maybe Assembly programmers understood that easily. But for us ordinary people say 'oh! Thats how machine counting number!' If human count 1 to 10, machine count 0 to F and it's got 16 digits.
Let examine the structures on hello.exe binary file base on the above data. Using an PE viewer, I could summarized the way that we familiar as the following (note: this is not a source code but the method I describe things similar to basic language. But I guess, some smart people here could makes source code from those. Remember William Yu source code for ZIP file viewer?):
Stucture of Hello.exe
Header
Import
END Structure
Lets breakdown the Header Structure
Structure Header
Exe
Coff
Optional
Section
END Structure
Lets break on details of the above Headers:
TYPE Exe
.Signature: MZ 5A4D 00000000 AS SHORT &H4D54
.Extra Bytes 0000 00000002 AS SHORT &H0000
.Pages 0000 00000004 AS SHORT &H0000
.Reloc Items 0000 00000006 AS SHORT &H0000
.Header Size 0000 00000008 AS SHORT &H0000
.Min Alloc 0000 0000000A AS SHORT &H0000
.Max Alloc 0000 0000000C AS SHORT &H0000
.Initial SS 0000 0000000E AS SHORT &H0000
.Initial SP 0000 00000010 AS SHORT &H0000
.Check Sum 0000 00000012 AS SHORT &H0000
.Initial IP 0000 00000014 AS SHORT &H0000
.Initial CS 0000 00000016 AS SHORT &H0000
.Reloc Table 0000 00000018 AS SHORT &H0000
.Overlay 0000 0000001A AS SHORT &H0000
END TYPE
TYPE Coff
.Signature: PE 00004550 00000040 AS LONG &H50450000
.Machine: i386 014C 00000044 AS SHORT &H4C01
.Num of Sections 0001 00000046 AS SHORT &H0100
.Time/Date Stamp 00000000 00000048 AS LONG &H00000000
.Symbol Table PTR 00000000 0000004C AS LONG &H00000000
.Num of Symbols 00000000 00000050 AS LONG &H00000000
.Opt Header Size 00E0 00000054 AS SHORT &HE000
.Characteristics 0102 00000056 AS SHORT &H0201
END TYPE
TYPE Optional
.Magic 010B 00000058 AS SHORT &H0B01
.Linker Major Ver 00 0000005A AS BYTE &H00
.Linker Minor Ver 00 0000005B AS BYTE &H00
.Code Sect Size 00000000 0000005C AS LONG &H00000000
.Init DATA Size 00000000 00000060 AS LONG &H00000000
.UnInit DATA Size 00000000 00000064 AS LONG &H00000000
.Entry Point RVA 00001056 00000068 AS LONG &H56100000
.Base of Code 00000000 0000006C AS LONG &H00000000
.Base of DATA 00000000 00000070 AS LONG &H00000000
.Image Base 00400000 00000074 AS LONG &H00004000
.Sect. Alignment 00001000 00000078 AS LONG &H00100000
.File Alignment 00000200 0000007C AS LONG &H00020000
.OS Major Version 0000 00000080 AS SHORT &H0000
.OS Minor Version 0000 00000084 AS SHORT &H0000
.User Major Ver. 0000 00000086 AS SHORT &H0000
.User Minor Ver. 0000 00000088 AS SHORT &H0000
.SubSys Major Ver 0004 0000008A AS SHORT &H0400
.SubSys Minor Ver 0000 0000008A AS SHORT &H0000
.Reserved 00000000 0000008C AS LONG &H00000000
.Image Size 00000068 00000090 AS LONG &H68000000
.Header Size 00000200 00000094 AS LONG &H00020000
.File Checksum 00000000 00000098 AS LONG &H00000000
.SubSystem 0002 0000009C AS SHORT &H0200
.DLL Flags 0000 0000009E AS SHORT &H0000
.Stack Reverse Sz 00001000 000000A0 AS LONG &H00100000
.Stack Commit Sz 00001000 000000A4 AS LONG &H00100000
.Heap Reverse Sz 00000000 000000A8 AS LONG &H00000000
.Heap Commit Size 00000000 000000AC AS LONG &H00000000
.Loader Flags 00000000 000000B0 AS LONG &H00000000
.DATA Dir Num 00000010 000000B4 AS LONG &H10000000
.Export Tbl Addr. 00000000 000000B8 AS LONG &H00000000
.Export& Size 00000000 000000BC AS LONG &H00000000
.Import Tbl Addr. 00001000 000000C0 AS LONG &H00100000
.Import& Size 00000049 000000C4 AS LONG &H49000000
.Res. Table Addr. 00000000 000000C8 AS LONG &H00000000
.Resource& Size 00000000 000000CC AS LONG &H00000000
.Except. Tbl Size 00000000 000000D0 AS LONG &H00000000
.Exception& Size 00000000 000000D4 AS LONG &H00000000
.Secur. Tbl Addr. 00000000 000000D8 AS LONG &H00000000
.Security& Size 00000000 000000DC AS LONG &H00000000
.Base Reloc TAddr 00000000 000000E0 AS LONG &H00000000
.Base Reloc& Size 00000000 000000E4 AS LONG &H00000000
.Debug DATA Addr. 00000000 000000E8 AS LONG &H00000000
.Dbg DATA Size 00000000 000000EC AS LONG &H00000000
.CR DATA Address 00000000 000000F0 AS LONG &H00000000
.Copyright Size 00000000 000000F4 AS LONG &H00000000
.GLOBAL PTR 00000000 000000F8 AS LONG &H00000000
.GLOBAL PTR& Size 00000000 000000FC AS LONG &H00000000
.TLS Table Addr. 00000000 00000100 AS LONG &H00000000
.TLS& Size 00000000 00000104 AS LONG &H00000000
.LoadCfg Tbl Addr 00000000 00000108 AS LONG &H00000000
.LoadCfg& Size 00000000 0000010C AS LONG &H00000000
END TYPE
TYPE Section[Array]
.Component[0] AS Flat_exe (should be .rdata)
END TYPE
TYPE Flat_exe (.rdata)
.Section Name Flat_exe 00000138 AS STRING*8
.Virtual Size 00000000 00000140 AS LONG &H00000000
.RVA/Offset 00001000 00000144 AS LONG &H00100000
.Size of Raw DATA 00000068 00000148 AS LONG &H68000000
.PTR TO Raw DATA 00000200 0000014C AS LONG &H00020000
.PTR TO Relocs 00000000 00000150 AS LONG &H00000000
.PTR TO LineNo 00000000 00000154 AS LONG &H00000000
.Number of Relocs 0000 00000158 AS SHORT &H0000
.Number of LineNo 0000 0000015A AS SHORT &H0000
.Section Flag E0000060 0000015C AS LONG &H600000E0
END TYPE
It's quite long informations to be digested isn't. What we can see from the above structures and types, we could write an includes file that write PE header to produce executable file. But, it's not done yet. How about those import data and if any resources added into the code? From PE viewer, I can't see the address and how long it takes for import data integers.
We have yet done. From the data breakdown above, we stop at offset 0000015C in section flag of .code type that talk to OS is this type is executable section (It's just my impression. I don't know what actually it is anyway). At least, I'm understand it's look that way. Perhaps, assembler programmer could tell better. In human numbering, we stop at row 23 column 12 from the hex bytes table above plus a LONG bytes (+4 columns) and it stop exactly at 00000160 or row 24. So, begin from there, the import section is begin until the end. But, it too general isn't it?
Using Hex Editor, from row 24(offset 160) until row 33(offset 1F0), there is no value. But it might change on different executable. For now, just forget about it. we begin from row 34(offset 200). I've made a wild guess based on Type Flat_exe.Ptr_to_Raw_Data above we should read at offset 200. I can't figure out yet what is structure references from offset 200 to 227. The only translation i can do now is the following:
---Elements----------Value--------Offsets---TYPE ?------
.Imported DLL file User32.dll 00000228 AS STRING*10
.DLL FUNCTION Name MessageBoxA 0000023D AS STRING*11
.QLABEL Hello World! 00000249 AS STRING*12
Conclusions: This is the study that I've made so far to make a new compiler using RapidQ itself. My objective is very simple, if I could make a small message box compiler using rapidQ based from the above UDT's and produce an executable file. I hope with this informations, someone else would able too make an attempt to create small compiler using RapidQ based on Binary file study.
I've to study different types of similar messagebox executable file as comparation on PE header structures. For instance, Iczelion's tutorial 2 compiled using MASM32. Since that file size is bigger about 2KB, I believe there is a new things I'll found inside the binary such as additional call to Kernel.dll and more valid and organized PE file structures.
Lets look at this way, we make UDTs to write executable file similar to read PCX or GIF files using UDT and then transfer to QBitmap. I believe this can be done because I've successfully make an Truvision TGA file with alpha channel viewer and save it to transparent GIF without using DLL call such as NViewLib.dll (I will talk about this some other time or I will upload the source code once I managed to optimized the file reading speed and after completed the project).
So, the method is the same when writing to binary file and this time we produce an EXE file. Some of the above values might change here and there along the UDTs. With more studies, we would understand more about those value and their options so we could produce mini compiler for a start.
Happy coding... | 2. Machine code #939 | Since the high efficiency of this forum script, seems no necessary to make your post as a separate text file?
I have to say, the data/graphic file analysis is quite different with exe file. Right, the PE file format is obtainable and clear, but besides the PE header, the main question is, how to translate RQ source to machine code, i.e. compile, to actual executable instruction code? Say, RQ source
messagebox "hello world!"
to machine code, for example, maybe:
00000220 00 00 00 00 00 00 00 00 75 73 65 72 33 32 2e 64 00000230 6c 6c 00 3b 10 00 00 00 00 00 00 00 00 4d 65 73 00000240 73 61 67 65 42 6f 78 41 00 48 65 6c 6c 6f 20 57 00000250 6f 72 6c 64 21 00 31 d2 b8 49 10 40 00 52 50 50 00000260 52 ff 15 33 10 40 00 c3
One idea is to use an existing compiler, like BCX does. Or, use assembler --- that way, we must create many many low level libraries, like string, math, array, UDT, etc. etc., besides GUI framework, all in ASM or binary compatible format (that's what doctor electron planning). | 3. Re: Machine code #941 Posted by: 2003-05-18 15:27:18 | Guidance, if we look at coding keyword-by-keyword, it will be a lot harder as you said. when I wrote TGA data reader, it's quite a challenges. I've to read binary pointer from top of the file & then from the bottom off the file. This exercise has teach me some valuable experiences in working binary as I've never before do such a things.
To translate RQ code such as messagebox "hello world!", we just can't to associate the machine code from offset 220 to 260. If we do this way, it's comeback to square one to producing mass duplicates machine code that lead to bigger executable footprint.
Try visual this way, when we readline QMessageBox() + String$ + strlength%, we add value to UDT Optional and UDT Section as in the above notes. Value(integer) such as size, position or location in numerics value which is we could understand then Qfilestream.writenum() will convert to Hex bytes. Once, it done, we add Win32Api DLL name and it function into offset 220 to 260. This machine will expand it size if we call other DLL. So, this machince code is not just for MessageBox only but as import 'Module' or table that contains many records. Apart from that, their other value in the machine code that could tell which is call by who and how. This is need to translated further.
When Keyword MessageBox has been used, we use UDT that will specify what DLL and which functions need for it. To do this, we need Win32Api references. If DLL already call by other functions, we just add borrowed function need to be used. Again, if already called/defined, just pointing the execution with parameters. With Win32Api, we no longer using Delphi object as used in current RQ such as TForm and so on. Using many many includes files thats has several UDTs are normals. Even BCX has to rely on several of LCC-Win32 Includes and Library files. MASM32 also has the same with added macros and so on. Once we have working compiler, the decision can be made to built them into the compiler or separate includes file outside the compiler. But separate includes/UDT files will be an advantages, where improvement can be made by users or other programmers. Because programming is subjective and some people preferred different needs than others. Needs will change thru times.
Compiled RQ executable file contains a lots of things that not need by the program result. But thats was William Yu decision not to optimized the code. Single programmer will take ages to do just about optimizing back on the time. Nowaday, with advance HLA language has been use in Assembler, similar optimization techniques and exercises could be followed.
BCX is good, but rely on Lcc. It is hard to leave RQ. Once the binary PE translated, BCX should be the good source code examples. We should have BAX, Basic to ASM translator to get even. Let the things happened thru time. Sure there are smart guys capable to make first step. About Dr. Electron project, I've full support on it. It should be done and some coder has ASM skill should gather and working on the project. On the other hand, what I'm visionized in this thread is a possibility making compiler in RQ Basic language itself. I agree that it hasn't been done before without translating into C/C++ or ASM. But my idea is using direct Binary writing as an escape path for BASIC limitation. What is most important in my idea is keywords and values are understandable in plain english by the common BASIC programmers.
Anyway, thanks to you Guidance. By try to answer you question, I get the solution at the same time. Still, this is just a theory. I should continue the experiments to pursue this objective. If anybody got any interesting questions, please write in. It may lead to some solutions just by asking...
Happy coding....
| 4. Language #945 | Which language to write a Basic to C or to ASM translator is not the key problem here, I also prefer RQ itself. But since RQ code is going to be translated to C or ASM, the translator writer must master both RQ and the target language, C or ASM, thus, C/C++ is also a good candidate. ASM is not proposed in my opinion anyhow. | 5. Re: Language #950 Posted by: 2003-05-20 01:28:09 | This is why making compiler is very tricky. Nevertheless, the idea seed in RapidQ by William Yu; learning programming through examples is driving people to make a new BASIC compiler. He made one using Delphi like Objects. When someone make new RQ compiler using ASM for instance, he/she has to decide wether to keep these OOP intact or just completely migrate to Win32API resource like BCX.
If you're C oriented person, might be able to use a free template called The compiler compiler or compiler creator if I'm not wrong. I'm not pretty much sure about the exact name of the package and it's available for download from programmer resources download site. May be you could give it try.
Nobody can master to each programming language completely. But if they working in group, one always completing each other. Then most of needed things covered by their works. What is the important things is the leader that organized the group. The master mind must not weak to keep project going...
| Forum List • Thread List • Reply • Refresh • New Topic • Search • Previous • Next 1 |
|
|