SELinux blocks loading kernel modules

The kernel has a feature where it will load certain kernel modules for a process, when certain syscalls are made.  For example, loading a kernel module when a process attempts to create a different network socket.  

I wrote a blog on explaining how this is probably a bad idea from a containers perspective.  I don't want to allow container processes to trigger modifications of the kernel.  And potentially causing the kernel to load risky modules that could have vulnerabilities in them.  I say, let the Administrator or packagers decide what kernel modules need to be loaded and then make the containers live with what is provided for them.  Here is a link to the blog.

Why all the DAC_READ_SEARCH AVC messages?

If you followed SELinux policy bugs being reported in bugzilla you might have noticed a spike in messages about random domains being denied DAC_READ_SEARCH.

Let's quickly look at what the DAC_READ_SEARCH capability is.  In Linux the power of "root" was broken down into 64 distinct capabilities.  Things like being able to load kernel modules or bind to ports less then 1024.  Well DAC_READ_SEARCH is one of these.

DAC stands for Discretionary Access Control, which is what most people understand as standard Linux permissions, Every process has owner/group.  All file system objects are  assigned owner, group and permission flags.  DAC_READ_SEARCH allows a privilege process to ignore parts of DAC for read and search.

man capabilities



              * Bypass file read permission checks and directory read and execute permission checks;

There is another CAPABILITY called DAC_OVERRIDE


              Bypass file read, write, and execute permission checks.

As you can see DAC_OVERRIDE is more powerful then DAC_READ_SEARCH, in that it can write and execute content ignoring DAC rules, as opposed to just reading the content.

Well why did we suddenly see a spike in confined domains needing DAC_READ_SEARCH?

Read more...Collapse )

New version of buildah 0.2 released to Fedora.
New features and bugfixes in this release

Updated Commands
buildah run
     Add support for -- ending options parsing
     Add a way to disable PTY allocation
     Handle run without an explicit command correctly
Buildah build-using-dockerfile (bud)
    Ensure volume points get created, and with perms
buildah containers
     Add a -a/--all option - Lists containers not created by buildah.
buildah Add/Copy
     Support for glob syntax
buildah commit
     Add flag to remove containers on commit
buildah push
     Improve man page and help information
buildah export:
    Allows you to export a container image
buildah images:
    update commands
    Add JSON output option
buildah rmi
    update commands
buildah containers
     Add JSON output option

New Commands
buildah version
     Identify version information about the buildah command
buildah export
     Allows you to export a containers image

Buildah docs: clarify --runtime-flag of run command
Update to match newer storage and image-spec APIs
Update containers/storage and containers/image versions

Quick Blog on Buildah.
Buildah is a new tool that we released last week for building containers without requiring a container runtime daemon running. --nodockerneeded

Here is a blog that talks about some of its features.

Our main goal was to make this simple.  I was asked by a fellow engineer about a feature that docker has for copying a file out of a container onto the host.  "docker cp".  In docker this ends up being a client server operation, and required someone to code it up.  We don't have this feature in buildah.  :^(

BUT, buildah gives you the primitives you need to do simpler functionality and allows you to use the full power of bash.  If I want to copy a file out of a container, I can simply mount the container and copy it out.

# mnt=$(buildah mount CONTAINER_ID)
# buildah umount CONTAINER_ID

The beauty of this is we could use lots of tools, I could scp if I wanted to copy to another machine, or rsync, or ftp...

Once your have the container mounted up, you can use any bash command on it, to move files in or out.

buildah == simplicity

What capabilities do I really need in my container?

I have written previous blogs discussing using linux capabilities in containers.

Recently I gave a talk in New York and someone in the audience asked me about how do they figure out what capabilities their containers require?

This person was dealing with a company that was shipping their software as a container image, but they had instructed the buyer, that you would have to run their container ‘fully privileged”.  He wanted to know what privileges the container actually needed.  I told him about a project we worked on a few years ago, we called Friendly Eperm.

Permission Denied!  WHY?

A few years ago the SELinux team realized that more and more applications were getting EPERM returns when a syscall requested some access.  Most operators understood EPERM (Permission Denied) inside of a log file to mean something was wrong with the Ownership of a process of the contents it was trying to access or the permission flags on the object were wrong.  This type of Access Control is called DAC (Discretionary Access Control) and under certain conditions SELinux also caused the kernel to return EPERM.  This caused Operators to get confused and is one of the reasons that Operators did not like SELinux. They would ask, why didn’t httpd report that Permission denied because of SELinux?  We realized that there was a growing list of other tools besides regular DAC and SELinux which could cause EPERM.  Things like SECCOMP, Dropped Capabilities, other LSM …   The problem was that the processes getting the EPERM had no way to know why they got EPERM.  The only one that knew was the kernel and in a lot of cases the kernel was not even logging the fact that it denied access.  At least SELinux denials usually show up in the audit log (AVCs).   The goal of Friendly EPERM was to allow the processes to figure out why they got EPERM and make it easier for admin to diagnose.

Here is the request that talks about the proposal.

The basic idea was to have something in the /proc file system which would identify why the previous EPERM happened.  You are running a process, say httpd, and it gets permission denied. Now somehow the process can get information on why it got permission denied.  One suggestion was that we enhanced the libc/kernel to provide this information. The logical place for the kernel to reveal it would be in /proc/self.  But the act of httpd attempting to read the information out of /proc/self itself could give you a permission denied.  Basically we did not succeed because it would be a race condition, and the information could be wrong.

Here is a link to the discussion!msg/fa.linux.kernel/WQyHPUdvodE/ZGTnxBQw4ioJ

Bottom line, no one has figured a way to get this information out of the kernel.


Later I received an email discussing the Friendly EPERM product and asking if there was a way to at least figure out what capabilities the application needed.

I wondered if the audit subsystem would give us anything here.  But I contacted the Audit guys at Red Hat, Steve Grubb and Paul Moore,  and they informed me that there is no Audit messages generated when DAC Capabilities are blocked.

An interesting discussion occurred in the email chain:

DWALSH: Well I would argue most developers have no idea what capabilities their application requires.

SGRUBB: I don't think people are that naive. If you are writing a program that runs as root and then you get the idea to run as a normal user, you will immediately see your program crash. You would immediately look at where it’s having problems. Its pretty normal to lookup the errno on the syscall man page to see what it says about it. They almost always list necessary capabilities for that syscall. If you are an admin restricting software you didn't write, then it’s kind of a  puzzle. But the reason there's no infrastructure is because historically it’s never been a problem because the software developer had to choose to use capabilities and it’s incumbent on the developer to know what they are doing.  With new management tools offering to do this for you, I guess it’s new territory.

But here we had a vendor telling a customer that it needed full root, ALL Capabilities,  to run his application,

DWALSH:  This is exactly what containers are doing.  Which is why the emailer is asking.  A vendor comes to him telling him it needs all Capabilities.  The emailer does not believe them and wants to diagnose what they actually need.

DWALSH: With containers and SELinux their is a great big "TURN OFF SECURITY" button, which is too easy for software packagers to do, and then they don't have to figure out exactly what their app needs.

Paul Moore - Red Hat SELinux Kernel Engineer suggested

That while audit can not record the DAC Failures, SELinux also enforces the capability checks.  If we could put the processes into a SELinux type that had no capabilities by default, then ran the process with full capabilities and SELinux in permissive mode, we could gather the SELinux AVC messages indicating which capabilities the application required to run.

“ (Ab)using security to learn through denial messages. What could possibly go wrong?! :)

After investigating further, turns out the basic type used to run containers, `container_t`, can be setup to have no capabilities by turning off an SELinux boolean.

To turn off the capabilities via a boolean, and put the machine into permissive mode.

setsebool virt_sandbox_use_all_caps=0

setenforce 0

Now execute the application via docker with all capabilities allowed.

docker run --cap-add all IMAGE ...

Run and test the application. This should cause SELinux to generate AVC messages about capabilities used.

grep capability /var/log/audit/audit.log

type=AVC msg=audit(1495655327.756:44343): avc:  denied  { syslog } for  pid=5246 comm="rsyslogd" capability=34  scontext=system_u:system_r:container_t:s0:c795,c887 tcontext=system_u:system_r:container_t:s0:c795,c887 tclass=capability2   


Now you know your list.

Turns out the application the emailer was trying to containerize was a tool which was allowed to manipulate the syslog system, and the only capability it needed was CAP_SYSLOG.  The emailer should be able to run the container by simply adding the CAP_SYSLOG capability and everything else about the container should be locked down.

docker run --cap-add syslog IMAGE ...


After writing this blog, I was pointed to

Find what capabilities an application requires to successful run in a container

Which is similar in that it finds out the capabilities needed for a container/process by using SystemTap.

SELinux and --no-new-privs and the setpriv command.

SELinux transitions are in some ways similar to a setuid executable in that when a transition happens the new process has different security properties then the calling process.  When you execute setuid executable, your parent process has one UID, but the child process has a different UID.

The kernel has a way for block these transitions, called --no-new-privs.  If I turn on --no-new-privs, then setuid applications like sudo, su and others will no longer work.   Ie you can not get more privileges then the parent process.

SELinux does something similar and in most cases the transition is just blocked.

For example.

We have a rule that states a httpd_t process executing a script labeled httpd_sys_script_exec_t (cgi label) will transition to httpd_sys_script_t.  But if you tell the kernel that your process tree has --no-new-privs, depending on how you wrote the policy , when the process running as httpd_t executes the httpd_sys_script_exec_t it will no longer transition, it will attempt to continue to run the script as httpd_d.

SELinux enforces that the new transition type must have a subset of the allow rules of its parent process.  But it can have no other allow rules.  IE the transitioned process can not get any NEW Privs.

This feature has not been used much in SELinux Policy, so we have found and fixed a few issues.

Container Runtimes like docker and runc have now added the --no-new-privs flag.  We have been working to make container-selinux follow the rules.

The container runtimes running container_runtime_t can start a container_t process only if container_runtime_t has all of the privileges of its parent process.

In SELinux policy you write a rule like:

typebounds container_runtime_t container_t;

This tells the compiler of policy to make sure that the container_t is a subset. If you are running the compiler in strict mode, the compiler will fail, if it is not a subsection.
If you are not running in strict mode, the compiler will silently remove any allow rules that are not in the parent, which can cause some surprises.

setpriv command

I recently heard about the setpriv command which is pretty cool for playing with kernel security features like dropping capabilities, and SELinux. One of the things you can do is execute

$ setpriv --no-new-privs sudo sh
sudo: effective uid is not 0, is sudo installed setuid root?

But if you want to try out SELinux you could combine the setpriv command and runcon together

$ setpriv --no-new-privs runcon -t container_t id -Z
$ setpriv --no-new-priv runcon -t container_t id -Z
runcon: id: Operation not permitted

This happens because container_t is not type bounds by staff_t.

Be careful relabeling volumes with Container run times. Sometimes things can go very wrong?
I recently revieved an email from someone who made the mistake of volume mounting /root into his container with the :Z option.

docker run -ti -v /root:/root:Z fedora sh

The container ran fine, and everything was well on his server machine until the next time he tried to ssh into the server.

The sshd refused to allow him in?  What went wrong?

I wrote about using volumes and SELinux on the project atomic blog.  I explain their that in order to use a volume within a non privileged container, you need to relabel the content on the volume.  You can either use the :z or the :Z option.

:z will relabel with a shared label so other containers ran read and write the volume.
:Z will relabel with a private label so that only this specific container can read and write the volume.

I probably did not emphasize enough is that as Peter Parker (Spider Man) says: With great power comes great responsibility.

Meaning you have to be careful what you relabel.  Using one of the :Z and :z options to recursively change the labels of the source content to container_file_t. When doing this you must be sure that this content is truly private to the container.  The content is not needed by other confined domains.  For example doing a volume mount of -v /var/lib/mariadb:/var/lib/mariadb:Z for a mariadb container is probably a good idea. But while doing -v /var/lib:/var/lib:Z will work, it is probably a bad idea.

Back to the email, the user relabeled all of the content under /root with a label similar to
Later when he attempted to ssh in, the sshd daemon,  running as the sshd_t type, attempts to read content in /root/.ssh it gets permission denied, since sshd_t is not allowed to read container_file_t.

The emailer realized what happened and tried to fix this situation by running restorecon -R -v /root, but this failed to change them?

Why did the labels not change when he ran restorecon?

There is a little known feature of restorecon called customizable_types, that I talked about 10 years ago.

By default, restorecon does not change types defined in the customizable_types file.  These types can be randomly scattered around the file system, and we don't want a global relabel to change them.  This is meant to make it easier to the admin, but sometimes causes confusion.  The
-F option tells restorecon to force the relabel and ignore customizable_types.

restorecon -F -R /root

This comand will reset the labels under /root to the system default and allow the emailer to login to the system via sshd again.

We have safeguards built into the SELinux go bindings which prevent container runtimes relabeling of
/, /etc, and /usr.

I need to open a pull request to add a few more directories to help prevent users from making serious mistakes in labeling, starting with

Understanding SELinux Roles
I received a container bugzilla today for someone who was attempting to assign a container process to the object_r role.  Hopefully this blog will help explain how roles work with SELinux.

When we describe SELinux we often concentrate on Type Enforcement, which is the most important and most used feature of SELinux.  This is what describe in the SELinux Coloring book as Dogs and Cats. We also describe MLS/MCS Separation in the coloring book.

Lets look at the SELinux labels
The SELinux labels consist of four parts, User, Role, Type and Level.  Often look something like


One area I do not cover is Roles and SELinux Users.

The analogy I like to use for the Users and Roles is around Russian dolls.  In that the User controls the reachable roles and the roles control the reachable types.

When we create an SELinux User, we have to specify which roles are reachable within the user.  (We also specify which levels are are available to the user.

semanage user -l

         Labeling   MLS/       MLS/                       
SELinux User    Prefix     MCS Level  MCS Range       SELinux Roles

guest_u             user       s0         s0                             guest_r
root                    user       s0         s0-s0:c0.c1023        staff_r sysadm_r system_r unconfined_r

staff_u               user       s0         s0-s0:c0.c1023        staff_r sysadm_r system_r unconfined_r
sysadm_u         user       s0         s0-s0:c0.c1023        sysadm_r
system_u          user       s0         s0-s0:c0.c1023        system_r unconfined_r
unconfined_u    user       s0         s0-s0:c0.c1023        system_r unconfined_r
user_u               user       s0         s0                             user_r
xguest_u           user       s0         s0                             xguest_r

In the example above you see the Staff_u user is able to reach the staff_r sysadm_r system_r unconfined_r roles, and is able to have any level in the MCS Range s0-s0:c0.c1023.

Notice also the system_u user, which can reach system_r unconfined_r roles as well as the complete MCS Range.  System_u is the default user for all processes started at boot or started by systemd.  SELinux users are inherited by children processes by default.

You can not assign an SELinux user a role that is not listed,  The kernel will reject it with a permission denied.

# runcon staff_u:user_r:user_t:s0 sh
runcon: invalid context: ‘staff_u:user_r:user_t:s0’: Invalid argument

The next part of the Russian dolls is roles.

Here is the roles available on my Fedora Rawhide machine.

# seinfo -r

Roles: 14


Roles control which types are available to a role.

# seinfo -rwebadm_r -x
      Dominated Roles:

In the example above the only valid type available to the webadm_r is the webadm_t.   So if you attempted to transition from webadm_t to a different type the SELinux kernel would block the access.

system_r role is the default role for all processes started at boot, most of the other roles are "user" roles.

seinfo -rsystem_r -x | wc -l

As you can see there are over 950 types that can be used with the system_r.

Other roles of note:

object_r is not really a role, but more of a place holder.  Roles only make sense for processes, not for files
on the file system.  But the SELinux label requires a role for all labels.  object_r is the role that we use to fill the objects on disks role.  Changing a process to run as object_r or trying to assign a different role to a file will always be denied by the kernel.

RBAC - Roles Based Access Control

SELinux Users and Roles are mainly used to implement RBAC.  The idea is to control what an adminsitrator can do on a system when he is logged in.  For certain activities he has to switch his roles.  For example, he might log in to a system as a normal user role, but needs to switch to an admin role when he needs to administrate the system.

Most of the other roles that are assigned above are user roles.  I have written about switching roles in the past when dealing with confined administrators and users.  I usually run as the staff_r but switch to the unconfined_r (through sudo) when I become the administrator of my system.

Most people who use SELinux do not deal with RBAC all that often, but hopefully this blog helps remove some of the confusion on the use of Roles.

Tug of war between SELinux and Chrome Sandbox, who's right?

Over the years, people have wanted to use SELinux to confine the web browser. The most common vulnerabilty for a desktop user is attacks caused by bugs in the browser.  A user goes to a questionable web site, and the web site has code that triggers a bug in the browser that takes over your machine.  Even if the browser has no bugs, you have to worry about helper plugins like flash-plugin, having vulnerabilities.

I wrote about confineing the browser all the way back in 2008. As I explained then, confining the browser without breaking expected functionality is impossible, but we wanted to confine the "plugins".  Luckily Mozilla and Google also wanted to confine the plugins, so they broke them into separate programs which we can wrap with SELinux labels.  By default all systems running firefox or chrome the plugins are locked down by SELinux preventing them from attacking your home dir.  But sometimes we have had bugs/conflicts in this confinement.

SELinux versus Chrome.

We have been seeing bug reports like the following for the last couple of years.

SELinux is preventing chrome-sandbox from 'write' accesses on the file oom_score_adj.

I believe the chrome-sandbox is trying to tell the kernel to pick it as a candidate to be killed if the machine gets under memory pressure, and the kernel has to pick some processes to kill.

But SELinux blocks this access generating an AVC that looks like:

type=AVC msg=audit(1426020354.763:876): avc:  denied  { write } for  pid=7673 comm="chrome-sandbox" name="oom_score_adj" dev="proc" ino=199187 scontext=unconfined_u:unconfined_r:chrome_sandbox_t:s0-s0:c0.c1023 tcontext=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 tclass=file permissive=0

This SELinux AVC indicates that the chrome-sandbox running as chrome_sandbox_t is trying to write to the oom_score_adj field in its parent process, most likely the chrome browser process, running as unconfined_t.  The only files labeled as unconfined_t on a system are virtual files in the /proc file system.

SELinux is probably blocking something that should be allowed, but ...

We have gone back and forth on how to fix this issue.

If you read the bugzilla

Scott R. Godin points out:

according to the final comment at

"Restricting further comments.

As explained in and this is working as intended."

so, google says 'works as intended'. *cough* (earlier discussion of this) says it's a google-chrome issue not an selinux issue.

I dunno who to believe at this point.

I respond:

I would say it is a difference of opinion.  From Google's perspective they are  just allowing the chrome-sandbox to tell the kernel to pick chrome, when looking to kill processes on the system if they run out of memory.  But from SELinux point of view it can't tell the difference between the chrome browser and other processes on the system that are labeled as unconfined_t.  From a MAC point of view, this would allow the chrome-sandbox to pick any other user process to kill if running out of memory. It is even worse then this, allowing chrome_sandbox_t to write to files labeled as unconfined_t allows the chrome-sandbox to write to file objects under /proc of almost every process in the user session.  Allowing this access would destroy the effectiveness of the SELinux confinement on the process.

If we want to control the chrome-sandbox from a MAC perspective, we don't want to allow this. 

Bottom line both sides are in some ways correct.

This is why you have a boolean and a dontaudit rule.  If you don't want MAC confinement of chrome sandbox then you can disable it using

# setsebool -P unconfined_chrome_sandbox_transition 0

Or you can ignore the problem with a dontaudit rule and the chrome-sandbox will not be able to say PICKME for killing in an out of memory situation.

There are other ways you can run confined web browsers. One of the first "Containers" was introduced way back in Fedora 12/13 RHEL6 time frame, the SELinux Sandbox. This allowed you to run a sanboxed web browser totally isolated from the desktop.  The Flatpak effort is a newer effort to run containerized Desktop Applications.

docker-selinux changed to container-selinux
Changing upstream packages

I have decided to change the docker SELinux policy package on from docker-selinux to container-selinux

The main reason I did this was after seeing the following on twitter.   Docker, INC is requesting people not use docker prefix for packages on github.

Since the policy for container-selinux can be used for more container runtimes then just docker, this seems like a good idea.  I plan on using it for OCID, and would consider plugging it into the RKT CRI.

I have modified all of the types inside of the policy to container_*.  For instance docker_t is now container_runtime_t and docker_exec_t is container_runtime_exec_t.

I have taken advantage of the typealias capability of SELinux policy to allow the types to be preserved over an upgrade.

typealias container_runtime_t alias docker_t;
typealias container_runtime_exec_t alias docker_exec_t;

This means people can continue to use docker_t and docker_exec_t with tools but the kernel will automatically translate them to the primary name container_runtime_t and container_runtime_exec_t.

This policy is arriving today in rawhide in the container-selinux.rpm which obsoletes the docker-selinux.rpm.  Once we are confident about the upgrade path, we will be rolling out the new packaging to Fedora and eventually to RHEL and CentOS.

Changing the types associated with container processes.

Secondarily I have begun to change the type names for running containers.  Way back when I wrote the first policy for containers, we were using libvirt_lxc for launching containers, and we already had types defined for VMS launched out of libvirt.  VM's were labeled svirt_t.  When I decided to extend the policy for Containers I decided on extending svirt with lxc.
svirt_lxc, but I also wanted to show that it had full network.  svirt_lxc_net_t.  I labeled the content inside of the container svirt_sandbox_file_t.

Bad names...

Once containers exploded on the seen with the arrival of docker, I knew I had made a mistake choosng the types associated with container processes.  Time to clean this up.  I have submitted pull requests into selinux-policy to change these types to container_t and container_file_t.

typealias container_t alias svirt_lxc_net_t;
typealais container_file_t alias svirt_sandbox_file_t;

The old types will still work due to typealias, but I think it would become a lot easier for people to understand the SELinux types with simpler names.  There is a lot of documentation and "google" knowledge out there about svirt_lxc_net_t and svirt_sandbox_file_t, which we can modify over time.

Luckily I have a chance at a do-over.


Log in

No account? Create an account