Decompressing ELF Directly to Memory on Linux/x86
        Copyright (C) 2000-2024 John F. Reiser  jreiser@BitWagon.com

References:
  <elf.h>   definitions for the ELF file format
  /usr/src/linux/fs/binfmt_elf.c   what Linux execve() does with ELF
  objdump --private-headers a.elf  dump the Elf32_Phdr
  http://www.cygnus.com/pubs/gnupro/5_ut/b_Usingld/ldLinker_scripts.html
     how to construct unusual ELF using /bin/ld

There is exactly one immovable object:  In all of the Linux kernel,
only the execve() system call sets the initial value of "the brk(0)",
the value that is manipulated by system call 45 (__NR_brk in
/usr/include/asm/unistd.h).  For "direct to memory" decompression,
there will be no execve() except for the execve() of the decompressor
program itself.  So, the decompressor program (which contains the
compressed version of the original executable) must have the same
brk() as the original executable.  So, the first PT_LOAD
ELF "segment" of the compressed program is used only to set the brk(0).
See src/p_lx_elf.cpp, function PackLinuxElf32::generateElfHdr.
All of the decompressor's code, and all of the compressed image
of the original executable, reside in the first PT_LOAD of the
decompressor program.

The decompressor program stub is just under 2K bytes when linked.
After linking, the decompressor code is converted to an initialized
array, and #included into the compilation of the compressor;
see stub/i386-linux.elf-entry.h.  To make self-contained compressed
executables even smaller, the compressor also compresses all but the
startup and decompression subroutine of the decompressor itself,
saving a few hundred bytes.  The startup code first decompresses the
rest of the decompressor, then jumps to it.  A nonstandard linker
script src/stub/src/i386-linux.elf-entry.lds arranges the SECTIONS
so that PackLinuxElf32x86::buildLoader() and buildLinuxLoader()
generate the desired stub code, which goes into PT_LOAD[1].

At runtime, the decompressed stub lives close beyond the brk().
In order for the decompressed stub to work properly at an address
that is different from its link-time address, the compiled code must
contain no absolute addresses.  So, the data items in stub code
must be only parameters and automatic (on-stack) local variables;
no global data, no static data, and no string constants.  Also,
the '&' operator may not be used to take the address of a function.

Decompression of the executable begins by decompressing the Elf32_Ehdr
and Elf32_Phdr, and then uses those Ehdr and Phdrs to control decompression
of the PT_LOAD segments.  Subroutine do_xmap() of src/stub/src/
i386-linux.elf-main.c performs the
"virtual execve()" using the compressed data as source, and stores
the decompressed bytes directly into the appropriate virtual addresses.

Before transferring control to the PT_INTERP "program interpreter",
minor tricks are required to setup the Elf32_auxv_t entries,
clear the free portion of the stack (to compensate for ld-linux.so.2
assuming that its automatic stack variables are initialized to zero),
and remove (all but 4 bytes of) the decompression program (and
compressed executable) from the address space.

As of upx-3.05, by default on Linux, upon decompression then one page
of the compressed executable remains mapped into the address space
of the process.  If all of the pages of the compressed executable are
unmapped, then the Linux kernel erases the symlink /proc/self/exe,
and this can cause trouble for the runtime shared library loader
expanding $ORIGIN in -rpath, or for application code that relies on
/proc/self/exe.  Use the compress-time command-line option
--unmap-all-pages to achieve that effect at run time.  Upx-3.04
and previous versions did this by default with no option.  However,
too much other software erroneously assumes that /proc/self/exe
always exists.  upx-4.3.0 made /proc/self/exe optional so that
chroot() and related environments can work.
For Elf formats, UPX adds an environment variable named "   " [three
spaces] which saves the results of readlink("/proc/self/exe",,)
If /proc/self/exe is ENOENT, then the variable has the same value
as its name "/proc/self/exe".

All of the above documentation refers to ET_EXEC main programs,
which always use the same virtual addresses.  An ET_DYN executable
(main program or shared library) follows much the same scheme,
re-using the address space that the kernel chose originally.

Linux stores the pathname argument that was specified to execve()
immediately after the '\0' which terminates the character string of the
last environment variable [as of execve()].  This is true for at least
all Linux 2.6, 2.4, and 2.2 kernels.  Linux kernel 2.6.29 and later
records a pointer to that character string in Elf32_auxv[AT_EXECFN].
The pathname is not "bound" to the file as strongly as /proc/self/exe
(the file may be changed without affecting the pathname), but the
pathname does provide some information.  The pathname may be relative
to the working directory, so look before performing any chdir().

On any page, then SELinux in strictest enforcing mode prohibits
simultaneous PROT_EXEC and PROT_WRITE, and also prohibits adding
PROT_EXEC if the kernel VMA struct (Virtual Memory Area struct)
that manages that page ever has had PROT_WRITE.  This implies that
the only way to get PROT_EXEC is to map the page directly from a file.
Therefore, in late 2023 the various decompression stubs are being
rewritten to "bounce" the decompressed data through pages in a
memory-resident file created by the memfd_create() system call,
and subsequently mapped PROT_EXEC.  Actual copying of the pages
can be avoided by careful sequence mmap() modes, but the overhead
of an additional system call is required.

The not-as-strict "targeted enforcing" mode of
SELinux seems not to demand this extra work, except for executables
that run with elevated privileges, such as various system daemons.
So "ordinary" user-mode apps can run in current "targeted enforcing"
mode.  But because the actual runtime mode of SELinux is unknown
at compression time, then the memfd_create method should be used
all the time.