Be careful relabeling volumes with Container run times. Sometimes things can go very wrong?
I recently revieved an email from someone who made the mistake of volume mounting /root into his container with the :Z option.

docker run -ti -v /root:/root:Z fedora sh

The container ran fine, and everything was well on his server machine until the next time he tried to ssh into the server.

The sshd refused to allow him in?  What went wrong?

I wrote about using volumes and SELinux on the project atomic blog.  I explain their that in order to use a volume within a non privileged container, you need to relabel the content on the volume.  You can either use the :z or the :Z option.

:z will relabel with a shared label so other containers ran read and write the volume.
:Z will relabel with a private label so that only this specific container can read and write the volume.

I probably did not emphasize enough is that as Peter Parker (Spider Man) says: With great power comes great responsibility.

Meaning you have to be careful what you relabel.  Using one of the :Z and :z options to recursively change the labels of the source content to container_file_t. When doing this you must be sure that this content is truly private to the container.  The content is not needed by other confined domains.  For example doing a volume mount of -v /var/lib/mariadb:/var/lib/mariadb:Z for a mariadb container is probably a good idea. But while doing -v /var/lib:/var/lib:Z will work, it is probably a bad idea.

Back to the email, the user relabeled all of the content under /root with a label similar to
Later when he attempted to ssh in, the sshd daemon,  running as the sshd_t type, attempts to read content in /root/.ssh it gets permission denied, since sshd_t is not allowed to read container_file_t.

The emailer realized what happened and tried to fix this situation by running restorecon -R -v /root, but this failed to change them?

Why did the labels not change when he ran restorecon?

There is a little known feature of restorecon called customizable_types, that I talked about 10 years ago.

By default, restorecon does not change types defined in the customizable_types file.  These types can be randomly scattered around the file system, and we don't want a global relabel to change them.  This is meant to make it easier to the admin, but sometimes causes confusion.  The
-F option tells restorecon to force the relabel and ignore customizable_types.

restorecon -F -R /root

This comand will reset the labels under /root to the system default and allow the emailer to login to the system via sshd again.

We have safeguards built into the SELinux go bindings which prevent container runtimes relabeling of
/, /etc, and /usr.

I need to open a pull request to add a few more directories to help prevent users from making serious mistakes in labeling, starting with

Understanding SELinux Roles
I received a container bugzilla today for someone who was attempting to assign a container process to the object_r role.  Hopefully this blog will help explain how roles work with SELinux.

When we describe SELinux we often concentrate on Type Enforcement, which is the most important and most used feature of SELinux.  This is what describe in the SELinux Coloring book as Dogs and Cats. We also describe MLS/MCS Separation in the coloring book.

Lets look at the SELinux labels
The SELinux labels consist of four parts, User, Role, Type and Level.  Often look something like


One area I do not cover is Roles and SELinux Users.

The analogy I like to use for the Users and Roles is around Russian dolls.  In that the User controls the reachable roles and the roles control the reachable types.

When we create an SELinux User, we have to specify which roles are reachable within the user.  (We also specify which levels are are available to the user.

semanage user -l

         Labeling   MLS/       MLS/                       
SELinux User    Prefix     MCS Level  MCS Range       SELinux Roles

guest_u             user       s0         s0                             guest_r
root                    user       s0         s0-s0:c0.c1023        staff_r sysadm_r system_r unconfined_r

staff_u               user       s0         s0-s0:c0.c1023        staff_r sysadm_r system_r unconfined_r
sysadm_u         user       s0         s0-s0:c0.c1023        sysadm_r
system_u          user       s0         s0-s0:c0.c1023        system_r unconfined_r
unconfined_u    user       s0         s0-s0:c0.c1023        system_r unconfined_r
user_u               user       s0         s0                             user_r
xguest_u           user       s0         s0                             xguest_r

In the example above you see the Staff_u user is able to reach the staff_r sysadm_r system_r unconfined_r roles, and is able to have any level in the MCS Range s0-s0:c0.c1023.

Notice also the system_u user, which can reach system_r unconfined_r roles as well as the complete MCS Range.  System_u is the default user for all processes started at boot or started by systemd.  SELinux users are inherited by children processes by default.

You can not assign an SELinux user a role that is not listed,  The kernel will reject it with a permission denied.

# runcon staff_u:user_r:user_t:s0 sh
runcon: invalid context: ‘staff_u:user_r:user_t:s0’: Invalid argument

The next part of the Russian dolls is roles.

Here is the roles available on my Fedora Rawhide machine.

# seinfo -r

Roles: 14


Roles control which types are available to a role.

# seinfo -rwebadm_r -x
      Dominated Roles:

In the example above the only valid type available to the webadm_r is the webadm_t.   So if you attempted to transition from webadm_t to a different type the SELinux kernel would block the access.

system_r role is the default role for all processes started at boot, most of the other roles are "user" roles.

seinfo -rsystem_r -x | wc -l

As you can see there are over 950 types that can be used with the system_r.

Other roles of note:

object_r is not really a role, but more of a place holder.  Roles only make sense for processes, not for files
on the file system.  But the SELinux label requires a role for all labels.  object_r is the role that we use to fill the objects on disks role.  Changing a process to run as object_r or trying to assign a different role to a file will always be denied by the kernel.

RBAC - Roles Based Access Control

SELinux Users and Roles are mainly used to implement RBAC.  The idea is to control what an adminsitrator can do on a system when he is logged in.  For certain activities he has to switch his roles.  For example, he might log in to a system as a normal user role, but needs to switch to an admin role when he needs to administrate the system.

Most of the other roles that are assigned above are user roles.  I have written about switching roles in the past when dealing with confined administrators and users.  I usually run as the staff_r but switch to the unconfined_r (through sudo) when I become the administrator of my system.

Most people who use SELinux do not deal with RBAC all that often, but hopefully this blog helps remove some of the confusion on the use of Roles.

Tug of war between SELinux and Chrome Sandbox, who's right?

Over the years, people have wanted to use SELinux to confine the web browser. The most common vulnerabilty for a desktop user is attacks caused by bugs in the browser.  A user goes to a questionable web site, and the web site has code that triggers a bug in the browser that takes over your machine.  Even if the browser has no bugs, you have to worry about helper plugins like flash-plugin, having vulnerabilities.

I wrote about confineing the browser all the way back in 2008. As I explained then, confining the browser without breaking expected functionality is impossible, but we wanted to confine the "plugins".  Luckily Mozilla and Google also wanted to confine the plugins, so they broke them into separate programs which we can wrap with SELinux labels.  By default all systems running firefox or chrome the plugins are locked down by SELinux preventing them from attacking your home dir.  But sometimes we have had bugs/conflicts in this confinement.

SELinux versus Chrome.

We have been seeing bug reports like the following for the last couple of years.

SELinux is preventing chrome-sandbox from 'write' accesses on the file oom_score_adj.

I believe the chrome-sandbox is trying to tell the kernel to pick it as a candidate to be killed if the machine gets under memory pressure, and the kernel has to pick some processes to kill.

But SELinux blocks this access generating an AVC that looks like:

type=AVC msg=audit(1426020354.763:876): avc:  denied  { write } for  pid=7673 comm="chrome-sandbox" name="oom_score_adj" dev="proc" ino=199187 scontext=unconfined_u:unconfined_r:chrome_sandbox_t:s0-s0:c0.c1023 tcontext=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 tclass=file permissive=0

This SELinux AVC indicates that the chrome-sandbox running as chrome_sandbox_t is trying to write to the oom_score_adj field in its parent process, most likely the chrome browser process, running as unconfined_t.  The only files labeled as unconfined_t on a system are virtual files in the /proc file system.

SELinux is probably blocking something that should be allowed, but ...

We have gone back and forth on how to fix this issue.

If you read the bugzilla

Scott R. Godin points out:

according to the final comment at

"Restricting further comments.

As explained in and this is working as intended."

so, google says 'works as intended'. *cough* (earlier discussion of this) says it's a google-chrome issue not an selinux issue.

I dunno who to believe at this point.

I respond:

I would say it is a difference of opinion.  From Google's perspective they are  just allowing the chrome-sandbox to tell the kernel to pick chrome, when looking to kill processes on the system if they run out of memory.  But from SELinux point of view it can't tell the difference between the chrome browser and other processes on the system that are labeled as unconfined_t.  From a MAC point of view, this would allow the chrome-sandbox to pick any other user process to kill if running out of memory. It is even worse then this, allowing chrome_sandbox_t to write to files labeled as unconfined_t allows the chrome-sandbox to write to file objects under /proc of almost every process in the user session.  Allowing this access would destroy the effectiveness of the SELinux confinement on the process.

If we want to control the chrome-sandbox from a MAC perspective, we don't want to allow this. 

Bottom line both sides are in some ways correct.

This is why you have a boolean and a dontaudit rule.  If you don't want MAC confinement of chrome sandbox then you can disable it using

# setsebool -P unconfined_chrome_sandbox_transition 0

Or you can ignore the problem with a dontaudit rule and the chrome-sandbox will not be able to say PICKME for killing in an out of memory situation.

There are other ways you can run confined web browsers. One of the first "Containers" was introduced way back in Fedora 12/13 RHEL6 time frame, the SELinux Sandbox. This allowed you to run a sanboxed web browser totally isolated from the desktop.  The Flatpak effort is a newer effort to run containerized Desktop Applications.

docker-selinux changed to container-selinux
Changing upstream packages

I have decided to change the docker SELinux policy package on from docker-selinux to container-selinux

The main reason I did this was after seeing the following on twitter.   Docker, INC is requesting people not use docker prefix for packages on github.

Since the policy for container-selinux can be used for more container runtimes then just docker, this seems like a good idea.  I plan on using it for OCID, and would consider plugging it into the RKT CRI.

I have modified all of the types inside of the policy to container_*.  For instance docker_t is now container_runtime_t and docker_exec_t is container_runtime_exec_t.

I have taken advantage of the typealias capability of SELinux policy to allow the types to be preserved over an upgrade.

typealias container_runtime_t alias docker_t;
typealias container_runtime_exec_t alias docker_exec_t;

This means people can continue to use docker_t and docker_exec_t with tools but the kernel will automatically translate them to the primary name container_runtime_t and container_runtime_exec_t.

This policy is arriving today in rawhide in the container-selinux.rpm which obsoletes the docker-selinux.rpm.  Once we are confident about the upgrade path, we will be rolling out the new packaging to Fedora and eventually to RHEL and CentOS.

Changing the types associated with container processes.

Secondarily I have begun to change the type names for running containers.  Way back when I wrote the first policy for containers, we were using libvirt_lxc for launching containers, and we already had types defined for VMS launched out of libvirt.  VM's were labeled svirt_t.  When I decided to extend the policy for Containers I decided on extending svirt with lxc.
svirt_lxc, but I also wanted to show that it had full network.  svirt_lxc_net_t.  I labeled the content inside of the container svirt_sandbox_file_t.

Bad names...

Once containers exploded on the seen with the arrival of docker, I knew I had made a mistake choosng the types associated with container processes.  Time to clean this up.  I have submitted pull requests into selinux-policy to change these types to container_t and container_image_t.

typealias container_t alias svirt_lxc_net_t;
typealais container_image_t alias svirt_sandbox_file_t;

The old types will still work due to typealias, but I think it would become a lot easier for people to understand the SELinux types with simpler names.  There is a lot of documentation and "google" knowledge out there about svirt_lxc_net_t and svirt_sandbox_file_t, which we can modify over time.

Luckily I have a chance at a do-over.

What is the spc_t container type, and why didn't we just run as unconfined_t?
What is spc_t?

SPC stands for Super Privileged Container, which are containers that contain software used to manage the host system that the container will be running on.  Since these containers could do anything on the system and we don't want SELinux blocking any access we made spc_t an unconfined domain. 

If you are on an SELinux system, and run docker with SELinux separation turned off, the containers will run with the spc_t type.

You can disable SELinux container separation in docker in multiple different ways.

  • You don't build docker from scratch with the BUILDTAG=selinux flag.

  • You run the docker daemon without --selinux-enabled flag

  • You run a container with the --security-opt label:disable flag

          docker run -ti --security-opt label:disable fedora sh

  • You share the PID namespace or IPC namespace with the host

         docker run -ti --pid=host --ipc=host fedora sh
Note: we have to disable SELinux separation in ipc=host  and pid=host because it would block access to processes or the IPC mechanisms on the host.

Why not use unconfined_t?

The question comes up is why not just run as unconfined_t?  A lot of people falsely assume that unconfined_t is the only unconfined domains.  But unconfined_t is a user domain.   We block most confined domains from communicating with the unconfined_t domain,  since this is probably the domain that the administrator is running with.

What is different about spc_t?

First off the type docker runs as (docker_t) can transition to spc_t, it is not allowed to transition to unconfined_t. It transitions to this domain, when it executes programs located under /var/lib/docker

# sesearch -T -s docker_t | grep spc_t
   type_transition container_t docker_share_t : process spc_t;
   type_transition container_t docker_var_lib_t : process spc_t;
   type_transition container_t svirt_sandbox_file_t : process spc_t;

Secondly and most importantly confined domains are allowed to connect to unix domain sockets running as spc_t.

This means I could run as service as a container process and have it create a socket on /run on the host system and other confined domains on the host could communicate with the service.

For example if you wanted to create a container that runs sssd, and wanted to allow confined domains to be able to get passwd information from it, you could run it as spc_t and the confined login programs would be able to use it.


Some times you can create an unconfined domain that you want to allow one or more confined domains to communicate with. In this situation it is usually better to create a new domain, rather then reusing unconfined_t.

Fun with bash, or how I wasted an hour trying to debug some SELinux test scripts.
We are working to get SELinux and Overlayfs to work well together.  Currently you can not run docker containers
with SELinux on an Overlayfs back end.  You should see the patches posted to the kernel list within a week.

I have been tasked to write selinuxtestsuite tests to verify overlayfs works correctly with SELinux.
These tests will help people understand what we intended.

One of the requirements for overlayfs/SELinux is to check not only the access of the task process doing some access
but also the label of the processes that originally setup the overlayfs mount.

In order to do the test I created two process types test_overlay_mounter_t and test_overlay_client_t, and then I was using
runcon to execute a bash script in the correct context.  I added code like the following to the test to make sure that the runcon command was working.

# runcon -t test_overlay_mounter_t bash <<EOF
echo "Mounting as $(id -Z)"

The problem was when I ran the tests, I saw the following:

Mounting as unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023

Sadly it took me an hour to diagnose what was going on.  Writing several test scripts and running commands by hand.  Sometimes it seemed to work and other times it would not.  I thought there was a problem with runcon or with my SELinux policy.  Finally I took a break and came back to the problem realizing that the problem was with bash.  The $(id -Z) was
executed before the runcon command.

Sometimes you feel like an idiot.

runcon -t test_overlay_mounter_t bash <<EOF
echo "Mounting as $(id -Z)"
echo -n "Mounting as "
id -Z
Mounting as unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
Mounting as unconfined_u:unconfined_r:test_overlay_mounter_t:s0-s0:c0.c1023

My next blog will explain how we expect overlayfs to work with SELinux.

Passing Unix Socket File Descriptors between containers processes blocked by SELinux.
SELinux controls passing of Socket file descriptors between processes.

A Fedora user posted a bugzilla complaining about SELinux blocking transfer of socket file descriptors between two docker containers.

Lets look at what happens when a socket file descriptor is created by a process.

When a process accepts a connection from a remote system, the file descriptor is created by a process it automatically gets assigned the same label as the process creating the socket.  For example when the docker service (docker_t) listens on /var/run/docker.sock and a client connects the docker service, the docker service end of the connection gets labeled by default with the label of the docker process.  On my machine this is:


The client is probably running as unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023.  SELinux would then check to make sure that unconfined_t is able to connecto docker_t sockets.

If this socket descriptor is passed to another process the new process label has to have access to the socket with the "socket label".  If it does not SELinux will block the transfer.

In containers, even though by default all container processes have the same SELinux typo, they have different MCS Labels.

If I have a process labeled system_u:system_r:svirt_lxc_net_t:s0:c1,c2 and I pass that file descriptor to a process in a different container labeled system_u:system_r:svirt_lxc_net_t:s0:c4,c5, SELinux will block the access.

The bug reporter was reporting that by default he was not able to pass the descriptor, which is goodness. We would not want to allow a confined container to be able to read/write socket file descriptors from another container by default.

The reporter also figured out that he could get this to work by disabling SELinux either on the host or inside of the container.

Surprisingly he also figured out if he shared IPC namespaces between the containers, SELinux would not block.

The reason for this is when you share the same IPC Namespace, docker automatically caused the containers share the Same SELinux label.  If docker did not do this SELinux would block processes from container A access to IPCs created in Container B.  With a shared IPC the SELinux labels for both of the reporters containers were the same, and SELinux allowed the passing.

How would I make two containers share the same SELinux labels?

Docker by default launches all containers with the same type field, but different MCS labels.  I told the reporter that you could cause two containers to run with the same MCS labels by using the --security-opt label:level:MCSLABEL option.

Something like this will work

docker run -it --rm --security-opt label:level:s0:c1000,c1001 --name server -v myvol:/tmp test /server
docker run -it --rm --security-opt label:level:s0:c1000,c1001 --name client -v myvol:/tmp test /client

These containers would then run with the same MCS labels, which would give the reporter the best security possible and still allow the two containers to pass the socket between containers.  These containers would still be locked down with SELInux from the host and other containers, however they would be able to attack each other from an SELinux point of view, however the other container separation security would still be in effect to prevent the attacks.

Its a good thing SELinux blocks access to the docker socket.
I have seen lots of SELinux bugs being reported where users are running a container that volume mounts the docker.sock into a container.  The container then uses a docker client to do something with docker. While I appreciate that a lot of these containers probbaly need this access, I am not sure people realize that this is equivalent to giving the container full root outside of the contaienr on the host system.  I just execute the following command and I have full root access on the host.

docker run -ti --privileged -v /:/host fedora chroot /host

SELinux definitely shows its power in this case by blocking the access.  From a security point of view, we definitely want to block all confined containers from talking to the docker.sock.  Sadly the other security mechanisms on by default in containers, do NOT block this access.  If a process somehow breaks out of a container and get write to the docker.sock, your system is pwned on an SELinux disabled system. (User Namespace, if it is enabled, will also block this access also going forward).

If you have a run a container that talks to the docker.sock you need to turn off the SELinux protection. There are two ways to do this.

You can turn off all container security separation by using the --privileged flag. Since you are giving the container full access to your system from a security point of view, you probably should just do this.

docker run --privileged -v /run/docker.sock:/run/docker.sock POWERFULLCONTAINER

If you want to just disable SELinux you can do this by using the --security-opt label:disable flag.

docker run --security-opt label:disable -v /run/docker.sock:/run/docker.sock POWERFULLCONTAINER

Note in the future if you are using User Namespace and have this problem, a new flag --userns=host flag is being
developed, which will turn off user namespace within the container.

Adding a new filename transition rule.
Way back in 2012 we added File Name Transition Rules.  These rules allows us to create content with the correct label
in a directory with a different label.  Prior to File Name Transition RUles Administrators and other tools like init scripts creating content in a directory would have to remember to execute restorecon on the new content.  In a lot of cases they would forget
and we would end up with mislabeled content, in some cases this would open up a race condition where the data would be
temporarily mislabeled and could cause security problems.

I recently recieved this email and figured I should write a blog.

Hiya everyone. I'm an SELinux noob.

I love the newish file name transition feature. I was first made aware of it some time after RHEL7 was released (, probably thanks to some mention from Simon or one of the rest of you on this list. For things that can't be watched with restorecond, this feature is so awesome.

Can someone give me a quick tutorial on how I could add a custom rule? For example:

filetrans_pattern(unconfined_t, httpd_sys_content_t, httpd_sys_rw_content_t, dir, "rwstorage")

Of course the end goal is that if someone creates a dir named "rwstorage" in /var/www/html, that dir will automatically get the httpd_sys_rw_content_t type. Basically I'm trying to make a clone of the existing rule that does the same thing for "/var/www/html(/.*)?/uploads(/.*)?".
Thanks for reading.

First you need to create a source file myfiletrans.te

policy_module(myfiletrans, 1.0)
    type unconfined_t, httpd_sys_content_t, httpd_sys_rw_content_t;
filetrans_pattern(unconfined_t, httpd_sys_content_t, httpd_sys_rw_content_t, dir, "rwstorage")

Quickly looking at the code we added.  When writing policy, if you are using type fields, unconfined_t, httpd_sys_content_t, httpd_sys_rw_content_t, that are defined in other policy packages, you need to specify this in a gen_require block.  This is similar to defining extern variables to be used in "C".  Then we call the filetrans_pattern interface.  This code tells that kernel that if a process running as unconfined_t, creating a dir named rwstorage in a directory labeled httpd_ssy_content_t, create the directory as httpd_sys_rw_content_t.

Now we need to compile and install the code, note that you need to have selinux-policy-devel, package installed.

make -f /usr/share/selinux/devel/Makefile myfiletrans.pp
semodule -i myfiletrans.pp

Lets test it out.

# mkdir /var/www/html/rwstorage
# ls -ldZ /var/www/html/rwstorage
drwxr-xr-x. 2 root root unconfined_u:object_r:httpd_sys_rw_content_t:s0 4096 Apr  5 08:02 /var/www/html/rwstorage

Lets make sure the old behaviour still works.

# mkdir /var/www/html/rwstorage1
# ls -lZ /var/www/html/rwstorage1 -d
drwxr-xr-x. 2 root root unconfined_u:object_r:httpd_sys_content_t:s0 4096 Apr  5 08:04 /var/www/html/rwstorage1

This is an excellent way to customize your policy, if you continuously see content being created with the incorrect label.

Boolean: virt_use_execmem What? Why? Why not Default?
In a recent bugzilla, the reporter was asking about what the virt_use_execmem.

  • What is it?

  • What did it allow?

  • Why was it not on by default?

What is it?

Well lets first look at the AVC

type=AVC msg=audit(1448268142.167:696): avc:  denied  { execmem } for  pid=5673 comm="qemu-system-x86" scontext=system_u:system_r:svirt_t:s0:c679,c730 tcontext=system_u:system_r:svirt_t:s0:c679,c730 tclass=process permissive=0

If you run this under audit2allow it gives you the following message:

#============= svirt_t ==============

#!!!! This avc can be allowed using the boolean 'virt_use_execmem'
allow svirt_t self:process execmem;

Setroubleshoot also tells you to turn on the virt_use_execmem boolean.

# setsebool -P virt_use_execmem 1

What does the virt_use_execmem boolean do?

# semanage boolean -l | grep virt_use_execmem
virt_use_execmem               (off  ,  off)  Allow confined virtual guests to use executable memory and executable stack

Ok what does that mean?  Uli Drepper back in 2006 added a series of memory checks to the SELInux kernel to handle common
attack vectors on programs using executable memory.    Basically these memory checks would allow us to stop a hacker from taking
over confined applications using buffer overflow attacks.

If qemu needs this access, why is this not enabled by default?

Using standard kvm vm's does not require qemu to have execmem privilege.  execmem blocks certain attack vectors 
Buffer Overflow attack where the hacked process is able overwrite memory and then execute the code the hacked 
program wrote. 

When using different qemu emulators that do not use kvm, the emulators require execmem to work.  If you look at 
the AVC above, I highlighted that the user was running qemu-system-x86.  I order for this emulator to work it
needs execmem so we have to loosen the policy slightly to allow the access.  Turning on the virt_use_execmem boolean
could allow a qemu process that is susceptible to buffer overflow attack to be hacked. SELinux would not block this

Note: lots of other SELinux blocks would still be in effect.

Since most people use kvm for VM's we disable it by default.

I a perfect world, libvirt would be changed to launch different emulators with different SELinux types, based on whether or not the emulator
requires execmem.   For example svirt_tcg_t is defined which allows this access.

Then you could run svirt_t kvm/qemus and svirt_tcg_t/qemu-system-x86 VMs on the same machine at the same time without having to lower
the security.  I am not sure if this is a common situation, and no one has done the work to make this happen.


Log in

No account? Create an account