What is the spc_t container type, and why didn't we just run as unconfined_t?
danwalsh
What is spc_t?

SPC stands for Super Privileged Container, which are containers that contain software used to manage the host system that the container will be running on.  Since these containers could do anything on the system and we don't want SELinux blocking any access we made spc_t an unconfined domain. 

If you are on an SELinux system, and run docker with SELinux separation turned off, the containers will run with the spc_t type.

You can disable SELinux container separation in docker in multiple different ways.

  • You don't build docker from scratch with the BUILDTAG=selinux flag.

  • You run the docker daemon without --selinux-enabled flag

  • You run a container with the --security-opt label:disable flag

          docker run -ti --security-opt label:disable fedora sh

  • You share the PID namespace or IPC namespace with the host

         docker run -ti --pid=host --ipc=host fedora sh
         
Note: we have to disable SELinux separation in ipc=host  and pid=host because it would block access to processes or the IPC mechanisms on the host.

Why not use unconfined_t?

The question comes up is why not just run as unconfined_t?  A lot of people falsely assume that unconfined_t is the only unconfined domains.  But unconfined_t is a user domain.   We block most confined domains from communicating with the unconfined_t domain,  since this is probably the domain that the administrator is running with.

What is different about spc_t?

First off the type docker runs as (docker_t) can transition to spc_t, it is not allowed to transition to unconfined_t. It transitions to this domain, when it executes programs located under /var/lib/docker

# sesearch -T -s docker_t | grep spc_t
   type_transition container_t docker_share_t : process spc_t;
   type_transition container_t docker_var_lib_t : process spc_t;
   type_transition container_t svirt_sandbox_file_t : process spc_t;


Secondly and most importantly confined domains are allowed to connect to unix domain sockets running as spc_t.

This means I could run as service as a container process and have it create a socket on /run on the host system and other confined domains on the host could communicate with the service.

For example if you wanted to create a container that runs sssd, and wanted to allow confined domains to be able to get passwd information from it, you could run it as spc_t and the confined login programs would be able to use it.

Conclusion:

Some times you can create an unconfined domain that you want to allow one or more confined domains to communicate with. In this situation it is usually better to create a new domain, rather then reusing unconfined_t.

Fun with bash, or how I wasted an hour trying to debug some SELinux test scripts.
danwalsh
We are working to get SELinux and Overlayfs to work well together.  Currently you can not run docker containers
with SELinux on an Overlayfs back end.  You should see the patches posted to the kernel list within a week.

I have been tasked to write selinuxtestsuite tests to verify overlayfs works correctly with SELinux.
These tests will help people understand what we intended.

One of the requirements for overlayfs/SELinux is to check not only the access of the task process doing some access
but also the label of the processes that originally setup the overlayfs mount.

In order to do the test I created two process types test_overlay_mounter_t and test_overlay_client_t, and then I was using
runcon to execute a bash script in the correct context.  I added code like the following to the test to make sure that the runcon command was working.

# runcon -t test_overlay_mounter_t bash <<EOF
echo "Mounting as $(id -Z)"
...
EOF


The problem was when I ran the tests, I saw the following:

Mounting as unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
...


Sadly it took me an hour to diagnose what was going on.  Writing several test scripts and running commands by hand.  Sometimes it seemed to work and other times it would not.  I thought there was a problem with runcon or with my SELinux policy.  Finally I took a break and came back to the problem realizing that the problem was with bash.  The $(id -Z) was
executed before the runcon command.

Sometimes you feel like an idiot.

runcon -t test_overlay_mounter_t bash <<EOF
echo "Mounting as $(id -Z)"
echo -n "Mounting as "
id -Z
EOF
Mounting as unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
Mounting as unconfined_u:unconfined_r:test_overlay_mounter_t:s0-s0:c0.c1023


My next blog will explain how we expect overlayfs to work with SELinux.

Passing Unix Socket File Descriptors between containers processes blocked by SELinux.
danwalsh
SELinux controls passing of Socket file descriptors between processes.

A Fedora user posted a bugzilla complaining about SELinux blocking transfer of socket file descriptors between two docker containers.

Lets look at what happens when a socket file descriptor is created by a process.

When a process accepts a connection from a remote system, the file descriptor is created by a process it automatically gets assigned the same label as the process creating the socket.  For example when the docker service (docker_t) listens on /var/run/docker.sock and a client connects the docker service, the docker service end of the connection gets labeled by default with the label of the docker process.  On my machine this is:

system_u:system_r:docker_t:s0

The client is probably running as unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023.  SELinux would then check to make sure that unconfined_t is able to connecto docker_t sockets.

If this socket descriptor is passed to another process the new process label has to have access to the socket with the "socket label".  If it does not SELinux will block the transfer.

In containers, even though by default all container processes have the same SELinux typo, they have different MCS Labels.

If I have a process labeled system_u:system_r:svirt_lxc_net_t:s0:c1,c2 and I pass that file descriptor to a process in a different container labeled system_u:system_r:svirt_lxc_net_t:s0:c4,c5, SELinux will block the access.

The bug reporter was reporting that by default he was not able to pass the descriptor, which is goodness. We would not want to allow a confined container to be able to read/write socket file descriptors from another container by default.

The reporter also figured out that he could get this to work by disabling SELinux either on the host or inside of the container.

Surprisingly he also figured out if he shared IPC namespaces between the containers, SELinux would not block.

The reason for this is when you share the same IPC Namespace, docker automatically caused the containers share the Same SELinux label.  If docker did not do this SELinux would block processes from container A access to IPCs created in Container B.  With a shared IPC the SELinux labels for both of the reporters containers were the same, and SELinux allowed the passing.

How would I make two containers share the same SELinux labels?

Docker by default launches all containers with the same type field, but different MCS labels.  I told the reporter that you could cause two containers to run with the same MCS labels by using the --security-opt label:level:MCSLABEL option.

Something like this will work

docker run -it --rm --security-opt label:level:s0:c1000,c1001 --name server -v myvol:/tmp test /server
docker run -it --rm --security-opt label:level:s0:c1000,c1001 --name client -v myvol:/tmp test /client


These containers would then run with the same MCS labels, which would give the reporter the best security possible and still allow the two containers to pass the socket between containers.  These containers would still be locked down with SELInux from the host and other containers, however they would be able to attack each other from an SELinux point of view, however the other container separation security would still be in effect to prevent the attacks.

Its a good thing SELinux blocks access to the docker socket.
danwalsh
I have seen lots of SELinux bugs being reported where users are running a container that volume mounts the docker.sock into a container.  The container then uses a docker client to do something with docker. While I appreciate that a lot of these containers probbaly need this access, I am not sure people realize that this is equivalent to giving the container full root outside of the contaienr on the host system.  I just execute the following command and I have full root access on the host.

docker run -ti --privileged -v /:/host fedora chroot /host

SELinux definitely shows its power in this case by blocking the access.  From a security point of view, we definitely want to block all confined containers from talking to the docker.sock.  Sadly the other security mechanisms on by default in containers, do NOT block this access.  If a process somehow breaks out of a container and get write to the docker.sock, your system is pwned on an SELinux disabled system. (User Namespace, if it is enabled, will also block this access also going forward).

If you have a run a container that talks to the docker.sock you need to turn off the SELinux protection. There are two ways to do this.

You can turn off all container security separation by using the --privileged flag. Since you are giving the container full access to your system from a security point of view, you probably should just do this.

docker run --privileged -v /run/docker.sock:/run/docker.sock POWERFULLCONTAINER

If you want to just disable SELinux you can do this by using the --security-opt label:disable flag.

docker run --security-opt label:disable -v /run/docker.sock:/run/docker.sock POWERFULLCONTAINER

Note in the future if you are using User Namespace and have this problem, a new flag --userns=host flag is being
developed, which will turn off user namespace within the container.

Adding a new filename transition rule.
danwalsh
Way back in 2012 we added File Name Transition Rules.  These rules allows us to create content with the correct label
in a directory with a different label.  Prior to File Name Transition RUles Administrators and other tools like init scripts creating content in a directory would have to remember to execute restorecon on the new content.  In a lot of cases they would forget
and we would end up with mislabeled content, in some cases this would open up a race condition where the data would be
temporarily mislabeled and could cause security problems.

I recently recieved this email and figured I should write a blog.

Hiya everyone. I'm an SELinux noob.

I love the newish file name transition feature. I was first made aware of it some time after RHEL7 was released (http://red.ht/1VhtaHI), probably thanks to some mention from Simon or one of the rest of you on this list. For things that can't be watched with restorecond, this feature is so awesome.

Can someone give me a quick tutorial on how I could add a custom rule? For example:


filetrans_pattern(unconfined_t, httpd_sys_content_t, httpd_sys_rw_content_t, dir, "rwstorage")


Of course the end goal is that if someone creates a dir named "rwstorage" in /var/www/html, that dir will automatically get the httpd_sys_rw_content_t type. Basically I'm trying to make a clone of the existing rule that does the same thing for "/var/www/html(/.*)?/uploads(/.*)?".
Thanks for reading.

First you need to create a source file myfiletrans.te

policy_module(myfiletrans, 1.0)
gen_require(`
    type unconfined_t, httpd_sys_content_t, httpd_sys_rw_content_t;
')
filetrans_pattern(unconfined_t, httpd_sys_content_t, httpd_sys_rw_content_t, dir, "rwstorage")


Quickly looking at the code we added.  When writing policy, if you are using type fields, unconfined_t, httpd_sys_content_t, httpd_sys_rw_content_t, that are defined in other policy packages, you need to specify this in a gen_require block.  This is similar to defining extern variables to be used in "C".  Then we call the filetrans_pattern interface.  This code tells that kernel that if a process running as unconfined_t, creating a dir named rwstorage in a directory labeled httpd_ssy_content_t, create the directory as httpd_sys_rw_content_t.

Now we need to compile and install the code, note that you need to have selinux-policy-devel, package installed.

make -f /usr/share/selinux/devel/Makefile myfiletrans.pp
semodule -i myfiletrans.pp


Lets test it out.

# mkdir /var/www/html/rwstorage
# ls -ldZ /var/www/html/rwstorage
drwxr-xr-x. 2 root root unconfined_u:object_r:httpd_sys_rw_content_t:s0 4096 Apr  5 08:02 /var/www/html/rwstorage


Lets make sure the old behaviour still works.

# mkdir /var/www/html/rwstorage1
# ls -lZ /var/www/html/rwstorage1 -d
drwxr-xr-x. 2 root root unconfined_u:object_r:httpd_sys_content_t:s0 4096 Apr  5 08:04 /var/www/html/rwstorage1


This is an excellent way to customize your policy, if you continuously see content being created with the incorrect label.

Boolean: virt_use_execmem What? Why? Why not Default?
danwalsh
In a recent bugzilla, the reporter was asking about what the virt_use_execmem.

  • What is it?

  • What did it allow?

  • Why was it not on by default?


What is it?

Well lets first look at the AVC

type=AVC msg=audit(1448268142.167:696): avc:  denied  { execmem } for  pid=5673 comm="qemu-system-x86" scontext=system_u:system_r:svirt_t:s0:c679,c730 tcontext=system_u:system_r:svirt_t:s0:c679,c730 tclass=process permissive=0

If you run this under audit2allow it gives you the following message:


#============= svirt_t ==============

#!!!! This avc can be allowed using the boolean 'virt_use_execmem'
allow svirt_t self:process execmem;


Setroubleshoot also tells you to turn on the virt_use_execmem boolean.

# setsebool -P virt_use_execmem 1

What does the virt_use_execmem boolean do?

# semanage boolean -l | grep virt_use_execmem
virt_use_execmem               (off  ,  off)  Allow confined virtual guests to use executable memory and executable stack


Ok what does that mean?  Uli Drepper back in 2006 added a series of memory checks to the SELInux kernel to handle common
attack vectors on programs using executable memory.    Basically these memory checks would allow us to stop a hacker from taking
over confined applications using buffer overflow attacks.

If qemu needs this access, why is this not enabled by default?

Using standard kvm vm's does not require qemu to have execmem privilege.  execmem blocks certain attack vectors 
Buffer Overflow attack where the hacked process is able overwrite memory and then execute the code the hacked 
program wrote. 

When using different qemu emulators that do not use kvm, the emulators require execmem to work.  If you look at 
the AVC above, I highlighted that the user was running qemu-system-x86.  I order for this emulator to work it
needs execmem so we have to loosen the policy slightly to allow the access.  Turning on the virt_use_execmem boolean
could allow a qemu process that is susceptible to buffer overflow attack to be hacked. SELinux would not block this
attack.

Note: lots of other SELinux blocks would still be in effect.

Since most people use kvm for VM's we disable it by default.



I a perfect world, libvirt would be changed to launch different emulators with different SELinux types, based on whether or not the emulator
requires execmem.   For example svirt_tcg_t is defined which allows this access.

Then you could run svirt_t kvm/qemus and svirt_tcg_t/qemu-system-x86 VMs on the same machine at the same time without having to lower
the security.  I am not sure if this is a common situation, and no one has done the work to make this happen.

How come MCS Confinement is not working in SELinux even in enforcing mode?
danwalsh
MCS separation is a key feature in sVirt technology.

We currently use it for separation of our Virtual machines using libvirt to launch vms with different MCS labels.  SELinux sandbox relies on it to separate out its sandboxes.  OpenShift relies on this technology for separating users, and now docker uses it to separate containers.  

When I discover a hammer, everything looks like a nail.

I recently saw this email.

"I have trouble understanding how MCS labels work, they are not being enforced on my RHEL7 system even though selinux is "enforcing" and the policy used is "targeted". I don't think I should be able to access those files:

$ ls -lZ /tmp/accounts-users /tmp/accounts-admin
-rw-rw-r--. backup backup guest_u:object_r:user_tmp_t:s0:c3
/tmp/accounts-admin
-rw-rw-r--. backup backup guest_u:object_r:user_tmp_t:s0:c99
/tmp/accounts-users
backup@test ~ $ id
uid=1000(backup) gid=1000(backup) groups=1000(backup)
context=guest_u:guest_r:guest_t:s0:c1

root@test ~ # getenforce
Enforcing

I can still access them even though they have different labels (c3 and
c99 as opposed to my user having c1).
backup@test ~ $ cat /tmp/accounts-users
domenico balance: -30
backup@test ~ $ cat /tmp/accounts-admin
don't lend money to domenico

Am I missing something?
"

MCS Is different then type enforcement.

We decided not to apply MCS Separation to every type.    We only apply it to the types that we plan on running in a Multi-Tennant way.  Basically it is for objects that we want to share the same access to the system, but not to each other.  We introduced an attribute called mcs_constrained_type.

On my Fedora Rawhide box I can look for these types:

seinfo -amcs_constrained_type -x
   mcs_constrained_type
      netlabel_peer_t
      docker_apache_t
      openshift_t
      openshift_app_t
      sandbox_min_t
      sandbox_x_t
      sandbox_web_t
      sandbox_net_t
      svirt_t
      svirt_tcg_t
      svirt_lxc_net_t
      svirt_qemu_net_t
      svirt_kvm_net_t

If you add the mcs_constrained_type attribute to a type the kernel will start enforcing MCS separation on the type.

Adding a policy like this will MCS confine guest_t

# cat myguest.te 
policy_module(mymcs, 1.0)
gen_require(`
    type guest_t;
    attribute mcs_constrained_type;
')

typeattribute guest_t mcs_constrained_type;

# make -f /usr/share/selinux/devel/Makefile
# semodule -i myguest.pp

Now I want to test this out.  First i have to allow the guest_u user to use multiple MCS labels.  You would not
have to do this with non user types. 

# semanage user -m -r s0-s0:c0.c1023 guest_u

Create content to read and change it MCS label

# echo Read It > /tmp/test
# chcon -l s0:c1,c2 /tmp/test
# ls -Z /tmp/test
unconfined_u:object_r:user_tmp_t:s0:c1,c2 /tmp/test

Now login as a guest user

# id -Z
guest_u:guest_r:guest_t:s0:c1,c2
# cat /tmp/test
Read It

Now login as a guest user with a different MCS type

# id -Z
guest_u:guest_r:guest_t:s0:c3,c4
# cat /tmp/test
cat: /tmp/test: Permission denied

libselinux is a liar!!!
danwalsh
On an SELinux enabled machine, why does getenforce in a docker container say it is disabled?

SELinux is not namespaced

This means that there is only one SELinux rules base for all containers on a system.  When we attempt to confine containers we want to prevent them from writing to kernel file systems, which might be one mechanism for escape.  One of those file systems would be /proc/fs/selinux, and we also want to control there access to things like /proc/self/attr/* field.

By default Docker processes run as svirt_lxc_net_t and they are prevented from doing (almost) all SELinux operations.  But processes within containers do not know that they are running within a container.  SELinux aware applications are going to attempt to do SELinux operations, especially if they are running as root.

For example,  if you are running yum/dnf/rpm inside of a docker build container and the tools sees that SELinux is enabled, the tool is going to attempt to set labels on the file system, if SELinux blocks the setting of these file labels these calls will fail causing the tool will fail and exit.  Because of it SELinux aware applications within containers would mostly fail.

Libselinux is a liar

We obviously do not want  these apps failing,  so we decided to make libselinux lie to the processes.  libselinux checks if /proc/fs/selinux is mounted onto the system and whether it is mounted read/write.  If /proc/fs/selinux not mounted read/write, libselinux will report to calling applications that SELinux is disabled.  In containers we don't mount these file systems by default or we mount it read/only causing libselinux to report that it is disabled.

# getenforce
Enforcing
# docker run --rm fedora id -Z
id: --context (-Z) works only on an SELinux-enabled kernel

# docker run --rm -v /sys/fs/selinux:/sys/fs/selinux:ro fedora id -Z
id: --context (-Z) works only on an SELinux-enabled kernel
# docker run --rm -v /sys/fs/selinux:/sys/fs/selinux fedora id -Z
system_u:system_r:svirt_lxc_net_t:s0:c196,c525


When SELinux aware applications like yum/dnf/rpm see SELinux is disabled, they stop trying to do SELinux operations, and succeed within containers.

Applications work well even though SELinux is very much enforcing, and controlling their activity.

I believe that SELInux is the best tool we currently use to make Contaieners actually contain.

In this case SELinux disabled does not make me cry. 

nsenter gains SELinux support
danwalsh
nsenter is a program that allows you to run program with namespaces of other processes

This tool is often used to enter containers like docker, systemd-nspawn or rocket.   It can be used for debugging or for scripting
tools to work inside of containers.  One problem that it had was the process that would be entering the container could potentially
be attacked by processes within the container.   From an SELinux point of view, you might be injecting an unconfined_t process
into a container that is running as svirt_lxc_net_t.  We wanted a way to change the process context when it entered the container
to match the pid of the process who's namespaces you are entering.

As of util-linux-2.27, nsenter now has this support.

man nsenter
...
       -Z, --follow-context
              Set the SELinux  security  context  used  for  executing  a  new process according to already running process specified by --tar‐get PID. (The util-linux has to be compiled with SELinux support otherwise the option is unavailable.)



docker exec

Already did this but this gives debuggers, testers, scriptors a new tool to use with namespaces and containers.

'CVE-2015-4495 and SELinux', Or why doesn't SELinux confine Firefox?
danwalsh
Why don't we confine Firefox with SELinux?

That is one of the most often asked questions, especially after a new CVE like CVE-2015-4495, shows up.  This vulnerability in firefox allows a remote session to grab any files in your home directory.  If you can read the file then firefox can read it and send it back to the website that infected your browser.

The big problem with confining desktop applications is the way the desktop has been designed.

I wrote about confining the desktop several years ago. 

As I explained then the problem is applications are allowed to communicate with each other in lots of different ways. Here are just a few.

*   X Windows.  All apps need full access to the X Server. I tried several years ago to block applications access to the keyboard settings, in order to block keystroke logging, (google xspy).  I was able to get it to work but a lot of applications started to break.  Other access that you would want to block in X would be screen capture, access to the cut/paste buffer. But blocking
these would cause too much breakage on the system.  XAce was an attempt to add MAC controls to X and is used in MLS environments but I believe it causes to much breakage.
*   File system access.  Users expect firefox to be able to upload and download files anywhere they want on the desktop.  If I was czar of the OS, I could state that upload files must go into ~/Upload and Download files go into ~/Download, but then users would want to upload photos from ~/Photos.  Or to create their own random directories.  Blocking access to any particular directory including .ssh would be difficult, since someone probably has a web based ssh session or some other tool that can use ssh public key to authenticate.  (This is the biggest weakness in described in CVE-2015-4495
*   Dbus communications as well as gnome shell, shared memory, Kernel Keyring, Access to the camera, and microphone ...

Every one expects all of these to just work, so blocking these with MAC tools and SELinux is most likely to lead to "setenforce 0" then actually adding a lot of security.

Helper Applications.

One of the biggest problems with confining a browser, is helper applications.  Lets imagine I ran firefox with SELinux type firefox_t.  The user clicks on a .odf file or a .doc file, the browser downloads the file and launches LibreOffice so the user
can view the file.  Should LibreOffice run as LibreOffice_t or firefox_t?  If it runs as LibreOffice_t then if the LibreOffice_t app was looking at a different document, the content might be able to subvert the process.  If I run the LibreOffice as firefox_t, what happens when the user launched a document off of his desktop, it will not launch a new LibreOffice it will just communicate with the running LibreOffice and launch the document, making it accessible to firefox_t.

Confining Plugins.

For several years now we have been confining plugins with SELinux in Firefox and Chrome.  This prevents tools like flashplugin
from having much access to the desktop.  But we have had to add booleans to turn off the confinement, since certain plugins, end up wanting more access.

mozilla_plugin_bind_unreserved_ports --> off
mozilla_plugin_can_network_connect --> off
mozilla_plugin_use_bluejeans --> off
mozilla_plugin_use_gps --> off
mozilla_plugin_use_spice --> off
unconfined_mozilla_plugin_transition --> on


SELinux Sandbox

I did introduce the SELinux Sandbox a few years ago.

The SELinux sandbox would allow you to confine desktop applications using container technologies including SELinux.  You could run firefox, LibreOffice, evince ... in their own isolated desktops.  It is quite popular, but users must choose to use it.  It does not work by default, and it can cause unexpected breakage, for example you are not allowed to cut and paste from one window to another.

Hope on the way.

Alex Larsson is working on a new project to change the way desktop applications run, called Sandboxed Applications.

Alex explains that their are two main goals of his project.

* We want to make it possible for 3rd parties to create and distribute applications that works on multiple distributions.
* We want to run the applications with as little access as possible to the host. (For example user files or network access)

The second goal might allow us to really lock down firefox and friends in a way similar to what Android is able to do on your cell phone (SELinux/SEAndroid blocks lots of access on the web browser.)

Imagine that when a user says he wants upload a file he talks to the desktop rather then directly to firefox, and the desktop
hands the file to firefox.  Firefox could then be prevented from touching anything in the homedir.  Also if a user wanted to
save a file, firefox would ask the desktop to launch the file browser, which would run in the desktop context.   When the user
selected where to save the file, the browser would give a descriptor to firefox to write the file.

Similar controls could isolate firefox from the camera microphone etc.

Wayland which will eventually replace X Windows, also provides for better isolation of applications.

Needless to say, I am anxiously waiting to see what Alex and friends come up with.

The combination of Container Techonolgy including Namespaces and SELinux gives us a chance at controling the desktop

?

Log in

No account? Create an account