The main issue is that to make code human-readable, we include a lot of conventions that computers don’t need. We use specific formatting, name conventions, code structure, comments, etc. to help someone look at the code and understand its function.
Let’s say I write code, and I have a function named ‘findUserName’ that takes a variable ‘text’ and checks it against a global variable ‘userName’, to see if the user name is contained in the text, and returns ‘true’ if so. If I compile and decompile that, the result will be (for example) a function named ‘function_002’ that takes a variable ‘var_local_000’ and checks it against ‘var_global_115’. Also, my comments will be gone, and finding where the function was called from will be difficult. Yes, you could look at that code and puzzle out what it’s doing, but you wouldn’t know that var_global_115 is a username, so you’d have to go find where that variable was set and try to puzzle out where it was coming from, and follow that rabbit hole backwards until you eventually find a request for user input which you’d have to use context clues to determine the purpose of.
It’s not that the code you get back from a decompiler is incorrect or inefficient, it’s that it’s very much not human-readable without a lot of extra investigatory work.
fenynro@lemmy.world 9 months ago
It depends on the specifics of how the language is conlmpiled. I’ll use C# as an example since that’s what I’m currently working with, but the process is different between all of them.
C#, when compiled, actually gets compressed down to what is known as an intermediate language (MSIL for C# specifically). This intermediate file is basically a set of genericized instructions that are not linked to any specific CPU. This is useful because different CPUs require different instructions.
Then, when the program is run a second compiler, known as the Just In Time compiler, takes the intermediate commands and translates them into something directly relevant to the CPU being used.
When we decompile a C# dll, we’re really converting from the intermediate language (generic CPU-agnostic instructions) and translating it back into source code.
To your second point, you are correct that the decompiled version will be more efficient from a processing perspective, but that efficiency comes at the direct cost of being able to easily understand what is happening at a human level.
Squizzy@lemmy.world 9 months ago
Could I trouble you to go deeper? I’m think I’m getting it but if we were to say uncompile GTA V or Super Mario Bros, could we make changes and figure it out from there or would it be complete nonsense with no way points to jump in at and get a grip on what is being done.
On a side note I was told once that everything is 1s and 0s and as a result that someone could type a picture of you if they got the order right. This could be why I’m so wrong in my understanding given I’m now assuming this was bullshit.
mindlessLump@lemmy.world 9 months ago
Here is a real world example of someone doing some reverse engineering of compiled code. Might help you understand what is possible, and some of the processes. nee.lv/…/How-I-cut-GTA-Online-loading-times-by-70…
folkrav@lemmy.ca 9 months ago
At a very low level, yes, everything is 1s and 0s. However, virtually nobody deals with binary anymore. Programming languages are abstractions over abstractions over abstractions not to have to deal with typing binary.
The point of programming languages is for humans to be able to read it and make sense out of it. It’s a way to represent in a kind of intermediate language that’s halfway between something humans can read and computers can interpret.
Say the game’s programmer wants to handle moving your character right on pressing the right arrow key. They might write some function called “handleRightArrow()”, which does whatever. Then your compiler will turn this to some instructions - read stuff in RAM at address XYZ, copy it over, etc. The original code with readable names, comments, documentation, proper organization, it’s gone. Once you decompile, it’s gonna be random function/variable names, compiler might have rewritten some parts of the implementation as automatic optimizations, unlined some functions, etc. The human readable meaning of the code is lost. It does the same thing as the original code, but it isn’t the original code either.