Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

delete #14

Open
wants to merge 10,000 commits into
base: lineage-19.1
Choose a base branch
from
Open

delete #14

wants to merge 10,000 commits into from

Conversation

danielkot
Copy link

@danielkot danielkot commented Feb 11, 2023

deleted

Vladimir Davydov and others added 30 commits August 27, 2022 08:25
An associated css can be around for quite a while after a cgroup
directory has been removed. In general, it makes sense to reset it to
defaults so as not to worry about any remnants. For instance, memory
cgroup needs to reset memory.low, otherwise pages charged to a dead
cgroup might never get reclaimed. There's ->css_reset callback, which
would fit perfectly for the purpose. Currently, it's only called when a
subsystem is disabled in the unified hierarchy and there are other
subsystems dependant on it. Let's call it on css destruction as well.

Suggested-by: Johannes Weiner <[email protected]>
Signed-off-by: Vladimir Davydov <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
css_sets are hashed by their subsys[] contents and in cgroup_init()
init_css_set is hashed early, before subsystem inits, when all entries
in its subsys[] are NULL, so that cgroup_dfl_root initialization can
find and link to it.  As subsystems are initialized,
init_css_set.subsys[] is filled up but the hashing is never updated
making init_css_set hashed in the wrong place.  While incorrect, this
doesn't cause a critical failure as css_set management code would
create an identical css_set dynamically.

Fix it by rehashing init_css_set after subsystems are initialized.
While at it, drop unnecessary @key local variable.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Zefan Li <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
During task migration, tasks may transfer between two css_sets which
are associated with the same cgroup.  If those tasks are the only
tasks in the cgroup, this currently triggers a spurious de-populated
event on the cgroup.

Fix it by bumping up populated count before bumping it down during
migration to ensure that it doesn't reach zero spuriously.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Zefan Li <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
Currently, interface files are created when a css is created depending
on whether @visible is set.  This patch separates out the two into
separate steps to help code refactoring and eventually allow cgroups
which aren't visible through cgroup fs.

Move css_populate_dir() out of create_css() and drop @visible.  While
at it, rename the function to css_create() for consistency.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Zefan Li <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
…userland

Currently, whether a css (cgroup_subsys_state) has its interface files
created is not tracked and assumed to change together with the owning
cgroup's lifecycle.  cgroup directory and interface creation is being
separated out from internal object creation to help refactoring and
eventually allow cgroups which are not visible through cgroupfs.

This patch adds CSS_VISIBLE to track whether a css has its interface
files created and perform management operations only when necessary
which helps decoupling interface file handling from internal object
lifecycle.  After this patch, all css interface file management
functions can be called regardless of the current state and will
achieve the expected result.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Zefan Li <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
Currently, operations to initialize internal objects and create
interface directory and files are intermixed in cgroup_mkdir().  We're
in the process of refactoring cgroup and css management paths to
separate them out to eventually allow cgroups which aren't visible
through cgroup fs.

This patch reorders operations inside cgroup_mkdir() so that interface
directory and file handling comes after internal object
initialization.  This will enable further refactoring.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Zefan Li <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
We're in the process of refactoring cgroup and css management paths to
separate them out to eventually allow cgroups which aren't visible
through cgroup fs.  This patch factors out cgroup_create() out of
cgroup_mkdir().  cgroup_create() contains all internal object creation
and initialization.  cgroup_mkdir() uses cgroup_create() to create the
internal cgroup and adds interface directory and file creation.

This patch doesn't cause any behavior differences.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Zefan Li <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
When a controller is enabled and visible on a non-root cgroup is
determined by subtree_control and subtree_ss_mask of the parent
cgroup.  For a root cgroup, by the type of the hierarchy and which
controllers are attached to it.  Deciding the above on each usage is
fragile and unnecessarily complicates the users.

This patch introduces cgroup_control() and cgroup_ss_mask() which
calculate and return the [visibly] enabled subsyste mask for the
specified cgroup and conver the existing usages.

* cgroup_e_css() is restructured for simplicity.

* cgroup_calc_subtree_ss_mask() and cgroup_subtree_control_write() no
  longer need to distinguish root and non-root cases.

* With cgroup_control(), cgroup_controllers_show() can now handle both
  root and non-root cases.  cgroup_root_controllers_show() is removed.

v2: cgroup_control() updated to yield the correct result on v1
    hierarchies too.  cgroup_subtree_control_write() converted.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Zefan Li <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
…_write()

Factor out async css offline draining into cgroup_drain_offline().

* Nest subsystem walk inside child walk.  The child walk will later be
  converted to subtree walk which is a bit more expensive.

* Relocate the draining above subsystem mask preparation, which
  doesn't create any behavior differences but helps further
  refactoring.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Zefan Li <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
…_control_write()

Factor out css disabling and hiding into cgroup_apply_control_disable().

* Nest subsystem walk inside child walk.  The child walk will later be
  converted to subtree walk which is a bit more expensive.

* Instead of operating on the differential masks @css_enable and
  @css_disable, simply disable or hide csses according to the current
  cgroup_control() and cgroup_ss_mask().  This leads to the same
  result and is simpler and more robust.

* This allows error handling path to share the same code.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Zefan Li <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
…control_write()

Factor out css enabling and showing into cgroup_apply_control_enable().

* Nest subsystem walk inside child walk.  The child walk will later be
  converted to subtree walk which is a bit more expensive.

* Instead of operating on the differential masks @css_enable, simply
  enable or show csses according to the current cgroup_control() and
  cgroup_ss_mask().  This leads to the same result and is simpler and
  more robust.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Zefan Li <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
…|enable}() recursive

The three factored out css management operations -
cgroup_drain_offline() and cgroup_apply_control_{disable|enable}() -
only depend on the current state of the target cgroups and idempotent
and thus can be easily made to operate on the subtree instead of the
immediate children.

This patch introduces the iterators which walk live subtree and
converts the three functions to operate on the subtree including self
instead of the children.  While this leads to spurious walking and be
slightly more expensive, it will allow them to be used for wider scope
of operations.

Note that cgroup_drain_offline() now tests for whether a css is dying
before trying to drain it.  This is to avoid trying to drain live
csses as there can be mix of live and dying csses in a subtree unlike
children of the same parent.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Zefan Li <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
While controllers are being enabled and disabled in
cgroup_subtree_control_write(), the original subsystem masks are
stashed in local variables so that they can be restored if the
operation fails in the middle.

This patch adds dedicated fields to struct cgroup to be used instead
of the local variables and implements functions to stash the current
values, propagate the changes and restore them recursively.  Combined
with the previous changes, this makes subsystem management operations
fully recursive and modularlized.  This will be used to expand cgroup
core functionalities.

While at it, remove now unused @css_enable and @css_disable from
cgroup_subtree_control_write().

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Zefan Li <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
…ree_control_write()

Factor out cgroup_{apply|finalize}_control() so that control mask
update can be done in several simple steps.  This patch doesn't
introduce behavior changes.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Zefan Li <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
cgroup_drain_offline() is used to wait for csses being offlined to
uninstall itself from cgroup->subsys[] array so that new csses can be
installed.  The function's only user, cgroup_subtree_control_write(),
calls it after performing some checks and restarts the whole process
via restart_syscall() if draining has to release cgroup_mutex to wait.

This can be simplified by draining before other synchronized
operations so that there's nothing to restart.  This patch converts
cgroup_drain_offline() to cgroup_lock_and_drain_offline() which
performs both locking and draining and updates cgroup_kn_lock_live()
use it instead of cgroup_mutex() if requested.  This combined locking
and draining operations are easier to use and less error-prone.

While at it, add WARNs in control_apply functions which triggers if
the subtree isn't properly drained.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Zefan Li <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
cgroup_create() manually updates control masks and creates child csses
which cgroup_mkdir() then manually populates.  Both can be simplified
by using cgroup_apply_enable_control() and friends.  The only catch is
that it calls css_populate_dir() with NULL cgroup->kn during
cgroup_create().  This is worked around by making the function noop on
NULL kn.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Zefan Li <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
…and friends

rebind_subsystem() open codes quite a bit of css and interface file
manipulations.  It tries to be fail-safe but doesn't quite achieve it.
It can be greatly simplified by using the new css management helpers.
This patch reimplements rebind_subsytsems() using
cgroup_apply_control() and friends.

* The half-baked rollback on file creation failure is dropped.  It is
  an extremely cold path, failure isn't critical, and, aside from
  kernel bugs, the only reason it can fail is memory allocation
  failure which pretty much doesn't happen for small allocations.

* As cgroup_apply_control_disable() is now used to clean up root
  cgroup on rebind, make sure that it doesn't end up killing root
  csses.

* All callers of rebind_subsystems() are updated to use
  cgroup_lock_and_drain_offline() as the apply_control functions
  require drained subtree.

* This leaves cgroup_refresh_subtree_ss_mask() without any user.
  Removed.

* css_populate_dir() and css_clear_dir() no longer needs
  @cgrp_override parameter.  Dropped.

* While at it, add WARN_ON() to rebind_subsystem() calls which are
  expected to always succeed just in case.

While the rules visible to userland aren't changed, this
reimplementation not only simplifies rebind_subsystems() but also
allows it to disable and enable csses recursively.  This can be used
to implement more flexible rebinding.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Zefan Li <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
Change-Id: I23d1a815cc4c412e83331aebd323f111d1046dd0
cgroup_calc_subtree_ss_mask() currently takes @CGRP and
@subtree_control.  @CGRP is used for two purposes - to decide whether
it's for default hierarchy and the mask of available subsystems.  The
former doesn't matter as the results are the same regardless.  The
latter can be specified directly through a subsystem mask.

This patch makes cgroup_calc_subtree_ss_mask() perform the same
calculations for both default and legacy hierarchies and take
@this_ss_mask for available subsystems.  @CGRP is no longer used and
dropped.  This is to allow using the function in contexts where
available controllers can't be decided from the cgroup.

v2: cgroup_refres_subtree_ss_mask() is removed by a previous patch.
    Updated accordingly.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Zefan Li <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
During prep, cgroup_setup_root() allocates cgrp_cset_links matching
the number of existing css_sets to later link the new root.  This is
fine for now as the only operation which can happen inbetween is
rebind_subsystems() and rebinding of empty subsystems doesn't create
new css_sets.

However, while not yet allowed, with the recent reimplementation,
rebind_subsystems() can rebind subsystems with descendant csses and
thus can create new css_sets.  This patch makes cgroup_setup_root()
allocate 2x of the existing css_sets so that later use of live
subsystem rebinding doesn't blow up.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Zefan Li <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
The existing sequences of operations ensure that the offlining csses
are drained before cgroup_update_dfl_csses(), so even though
cgroup_update_dfl_csses() uses css_for_each_descendant_pre() to walk
the target cgroups, it doesn't end up operating on dead cgroups.
Also, the function explicitly excludes the subtree root from
operation.

This is fragile and inconsistent with the rest of css update
operations.  This patch updates cgroup_update_dfl_csses() to use
cgroup_for_each_live_descendant_pre() instead and include the subtree
root.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Zefan Li <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
cgroup_update_dfl_csses() should move each task in the subtree to
self; however, it was incorrectly calling cgroup_migrate_add_src()
with the root of the subtree as @dst_cgrp.  Fortunately,
cgroup_migrate_add_src() currently uses @dst_cgrp only to determine
the hierarchy and the bug doesn't cause any actual breakages.  Fix it.

Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
…_prepare_dst()

cgroup_migrate_prepare_dst() verifies whether the destination cgroup
is allowable; however, the test doesn't really belong there.  It's too
deep and common in the stack and as a result the test itself is gated
by another test.

Separate the test out into cgroup_may_migrate_to() and update
cgroup_attach_task() and cgroup_transfer_tasks() to perform the test
directly.  This doesn't cause any behavior differences.

Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
…group

On the default hierarchy, a migration can be multi-source and/or
multi-destination.  cgroup_taskest_migrate() used to incorrectly
assume single destination cgroup but the bug has been fixed by
1f7dd3e ("cgroup: fix handling of multi-destination migration
from subtree_control enabling").

Since the commit, @dst_cgrp to cgroup[_taskset]_migrate() is only used
to determine which subsystems are affected or which cgroup_root the
migration is taking place in.  As such, @dst_cgrp is misleading.  This
patch replaces @dst_cgrp with @root.

Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
Migration can be multi-target on the default hierarchy when a
controller is enabled - processes belonging to each child cgroup have
to be moved to the child cgroup itself to refresh css association.

This isn't a problem for cgroup_migrate_add_src() as each source
css_set still maps to single source and target cgroups; however,
cgroup_migrate_prepare_dst() is called once after all source css_sets
are added and thus might not have a single destination cgroup.  This
is currently worked around by specifying NULL for @dst_cgrp and using
the source's default cgroup as destination as the only multi-target
migration in use is self-targetting.  While this works, it's subtle
and clunky.

As all taget cgroups are already specified while preparing the source
css_sets, this clunkiness can easily be removed by recording the
target cgroup in each source css_set.  This patch adds
css_set->mg_dst_cgrp which is recorded on cgroup_migrate_src() and
used by cgroup_migrate_prepare_dst().  This also makes migration code
ready for arbitrary multi-target migration.

Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
Some controllers, perf_event for now and possibly freezer in the
future, don't really make sense to control explicitly through
"cgroup.subtree_control".  For example, the primary role of perf_event
is identifying the cgroups of tasks; however, because the controller
also keeps a small amount of state per cgroup, it can't be replaced
with simple cgroup membership tests.

This patch implements cgroup_subsys->implicit_on_dfl flag.  When set,
the controller is implicitly enabled on all cgroups on the v2
hierarchy so that utility type controllers such as perf_event can be
enabled and function transparently.

An implicit controller doesn't show up in "cgroup.controllers" or
"cgroup.subtree_control", is exempt from no internal process rule and
can be stolen from the default hierarchy even if there are non-root
csses.

v2: Reimplemented on top of the recent updates to css handling and
    subsystem rebinding.  Rebinding implicit subsystems is now a
    simple matter of exempting it from the busy subsystem check.

Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
…amespaces

Patch summary:

When showing a cgroupfs entry in mountinfo, show the path of the mount
root dentry relative to the reader's cgroup namespace root.

Short explanation (courtesy of mkerrisk):

If we create a new cgroup namespace, then we want both /proc/self/cgroup
and /proc/self/mountinfo to show cgroup paths that are correctly
virtualized with respect to the cgroup mount point.  Previous to this
patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
does not.

Long version:

When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
namespace, and then mounts a new instance of the freezer cgroup, the new
mount will be rooted at /a/b.  The root dentry field of the mountinfo
entry will show '/a/b'.

 cat > /tmp/do1 << EOF
 mount -t cgroup -o freezer freezer /mnt
 grep freezer /proc/self/mountinfo
 EOF

 unshare -Gm  bash /tmp/do1
 > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
 > 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer

The task's freezer cgroup entry in /proc/self/cgroup will simply show
'/':

 grep freezer /proc/self/cgroup
 9:freezer:/

If instead the same task simply bind mounts the /a/b cgroup directory,
the resulting mountinfo entry will again show /a/b for the dentry root.
However in this case the task will find its own cgroup at /mnt/a/b,
not at /mnt:

 mount --bind /sys/fs/cgroup/freezer/a/b /mnt
 130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer

In other words, there is no way for the task to know, based on what is
in mountinfo, which cgroup directory is its own.

Example (by mkerrisk):

First, a little script to save some typing and verbiage:

echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
cat /proc/self/mountinfo | grep freezer |
        awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'

Create cgroup, place this shell into the cgroup, and look at the state
of the /proc files:

2653
2653                         # Our shell
14254                        # cat(1)
        /proc/self/cgroup:      10:freezer:/a/b
        mountinfo:              /       /sys/fs/cgroup/freezer

Create a shell in new cgroup and mount namespaces. The act of creating
a new cgroup namespace causes the process's current cgroups directories
to become its cgroup root directories. (Here, I'm using my own version
of the "unshare" utility, which takes the same options as the util-linux
version):

Look at the state of the /proc files:

        /proc/self/cgroup:      10:freezer:/
        mountinfo:              /       /sys/fs/cgroup/freezer

The third entry in /proc/self/cgroup (the pathname of the cgroup inside
the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
is rooted at /a/b in the outer namespace.

However, the info in /proc/self/mountinfo is not for this cgroup
namespace, since we are seeing a duplicate of the mount from the
old mount namespace, and the info there does not correspond to the
new cgroup namespace. However, trying to create a new mount still
doesn't show us the right information in mountinfo:

                                      # propagating to other mountns
        /proc/self/cgroup:      7:freezer:/
        mountinfo:              /a/b    /mnt/freezer

The act of creating a new cgroup namespace caused the process's
current freezer directory, "/a/b", to become its cgroup freezer root
directory. In other words, the pathname directory of the directory
within the newly mounted cgroup filesystem should be "/",
but mountinfo wrongly shows us "/a/b". The consequence of this is
that the process in the cgroup namespace cannot correctly construct
the pathname of its cgroup root directory from the information in
/proc/PID/mountinfo.

With this patch, the dentry root field in mountinfo is shown relative
to the reader's cgroup namespace.  So the same steps as above:

        /proc/self/cgroup:      10:freezer:/a/b
        mountinfo:              /       /sys/fs/cgroup/freezer
        /proc/self/cgroup:      10:freezer:/
        mountinfo:              /../..  /sys/fs/cgroup/freezer
        /proc/self/cgroup:      10:freezer:/
        mountinfo:              /       /mnt/freezer

cgroup.clone_children  freezer.parent_freezing  freezer.state      tasks
cgroup.procs           freezer.self_freezing    notify_on_release
3164
2653                   # First shell that placed in this cgroup
3164                   # Shell started by 'unshare'
14197                  # cat(1)

Signed-off-by: Serge Hallyn <[email protected]>
Tested-by: Michael Kerrisk <[email protected]>
Acked-by: Michael Kerrisk <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
commit 4f41fc59620f ("cgroup, kernfs: make mountinfo
 show properly scoped path for cgroup namespaces")
 added the following compile warning:

kernel/cgroup.c: In function ‘cgroup_show_path’:
kernel/cgroup.c:1634:15: warning: unused variable ‘ret’ [-Wunused-variable]
  int len = 0, ret = 0;
               ^
fix it.

Fixes: 4f41fc59620f ("cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces")
Signed-off-by: Felipe Balbi <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
When create css failed, before call css_free_rcu_fn, we remove the css
id and exit the percpu_ref, but we will do these again in
css_free_work_fn, so they are redundant.  Especially the css id, that
would cause problem if we remove it twice, since it may be assigned to
another css after the first remove.

tj: This was broken by two commits updating the free path without
    synchronizing the creation failure path.  This can be easily
    triggered by trying to create more than 64k memory cgroups.

Signed-off-by: Wenwei Tao <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Fixes: 9a1049d ("percpu-refcount: require percpu_ref to be exited explicitly")
Fixes: 01e5865 ("cgroup: release css->id after css_free")
Cc: [email protected] # v3.17+
Signed-off-by: Chatur27 <[email protected]>
The valid cgroup hierarchy ID range includes 0, so we can't filter for
positive numbers when freeing it, or it'll leak the first ID. No big
deal, just disruptive when reading the code.

The ID is freed during error handling and when the reference count
hits zero, so the double-free test is not necessary; remove it.

Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
css_idr allocation starts at 1, so index 0 will never point to an
item. css_from_id() currently filters that before asking idr_find(),
but idr_find() would also just return NULL, so this is not needed.

Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Chatur27 <[email protected]>
ivanmeler and others added 24 commits August 27, 2022 08:59
WireGuard is a layer 3 secure networking tunnel made specifically for
the kernel, that aims to be much simpler and easier to audit than IPsec.
Extensive documentation and description of the protocol and
considerations, along with formal proofs of the cryptography, are
available at:

  * https://www.wireguard.com/
  * https://www.wireguard.com/papers/wireguard.pdf

This commit implements WireGuard as a simple network device driver,
accessible in the usual RTNL way used by virtual network drivers. It
makes use of the udp_tunnel APIs, GRO, GSO, NAPI, and the usual set of
networking subsystem APIs. It has a somewhat novel multicore queueing
system designed for maximum throughput and minimal latency of encryption
operations, but it is implemented modestly using workqueues and NAPI.
Configuration is done via generic Netlink, and following a review from
the Netlink maintainer a year ago, several high profile userspace tools
have already implemented the API.

This commit also comes with several different tests, both in-kernel
tests and out-of-kernel tests based on network namespaces, taking profit
of the fact that sockets used by WireGuard intentionally stay in the
namespace the WireGuard interface was originally created, exactly like
the semantics of userspace tun devices. See wireguard.com/netns/ for
pictures and examples.

The source code is fairly short, but rather than combining everything
into a single file, WireGuard is developed as cleanly separable files,
making auditing and comprehension easier. Things are laid out as
follows:

  * noise.[ch], cookie.[ch], messages.h: These implement the bulk of the
    cryptographic aspects of the protocol, and are mostly data-only in
    nature, taking in buffers of bytes and spitting out buffers of
    bytes. They also handle reference counting for their various shared
    pieces of data, like keys and key lists.

  * ratelimiter.[ch]: Used as an integral part of cookie.[ch] for
    ratelimiting certain types of cryptographic operations in accordance
    with particular WireGuard semantics.

  * allowedips.[ch], peerlookup.[ch]: The main lookup structures of
    WireGuard, the former being trie-like with particular semantics, an
    integral part of the design of the protocol, and the latter just
    being nice helper functions around the various hashtables we use.

  * device.[ch]: Implementation of functions for the netdevice and for
    rtnl, responsible for maintaining the life of a given interface and
    wiring it up to the rest of WireGuard.

  * peer.[ch]: Each interface has a list of peers, with helper functions
    available here for creation, destruction, and reference counting.

  * socket.[ch]: Implementation of functions related to udp_socket and
    the general set of kernel socket APIs, for sending and receiving
    ciphertext UDP packets, and taking care of WireGuard-specific sticky
    socket routing semantics for the automatic roaming.

  * netlink.[ch]: Userspace API entry point for configuring WireGuard
    peers and devices. The API has been implemented by several userspace
    tools and network management utility, and the WireGuard project
    distributes the basic wg(8) tool.

  * queueing.[ch]: Shared function on the rx and tx path for handling
    the various queues used in the multicore algorithms.

  * send.c: Handles encrypting outgoing packets in parallel on
    multiple cores, before sending them in order on a single core, via
    workqueues and ring buffers. Also handles sending handshake and cookie
    messages as part of the protocol, in parallel.

  * receive.c: Handles decrypting incoming packets in parallel on
    multiple cores, before passing them off in order to be ingested via
    the rest of the networking subsystem with GRO via the typical NAPI
    poll function. Also handles receiving handshake and cookie messages
    as part of the protocol, in parallel.

  * timers.[ch]: Uses the timer wheel to implement protocol particular
    event timeouts, and gives a set of very simple event-driven entry
    point functions for callers.

  * main.c, version.h: Initialization and deinitialization of the module.

  * selftest/*.h: Runtime unit tests for some of the most security
    sensitive functions.

  * tools/testing/selftests/wireguard/netns.sh: Aforementioned testing
    script using network namespaces.

This commit aims to be as self-contained as possible, implementing
WireGuard as a standalone module not needing much special handling or
coordination from the network subsystem. I expect for future
optimizations to the network stack to positively improve WireGuard, and
vice-versa, but for the time being, this exists as intentionally
standalone.

We introduce a menu option for CONFIG_WIREGUARD, as well as providing a
verbose debug log and self-tests via CONFIG_WIREGUARD_DEBUG.

Signed-off-by: Jason A. Donenfeld <[email protected]>
Cc: David Miller <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: David S. Miller <[email protected]>
[Jason: ported to 4.19 by doing the following:
 - wg_get_device_start uses genl_family_attrbuf
 - skb_probe_transport_header has an extra argument
 - NLA_EXACT/MIN_LEN is not there yet
 - nla policy is per verb not family
 - totalram_pages isn't a function]
 - __kernel_timespec -> __uapi_kernel_timespec]
(cherry picked from commit e7096c131e5161fa3b8e52a650d7719d2857adfd)
Bug: 152722841
Signed-off-by: Jason A. Donenfeld <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Change-Id: I04cd661a4cfec9b9fa64c3ab0ea39e4e2352fa13
Function 'udp_tunnel6_xmit_skb' had been redefined in commit 4d5805d.
Compat hack must be removed to fix compilation issue.

Change-Id: I155b1c45ef57ca2be4fb3f005a5df174fc9041b9
Signed-off-by: Roberto Sartori <[email protected]>
Change-Id: Id91af29873e04446dd5cbc9033e3bedae7816da1
Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to
BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run
any time a process in the cgroup opens an AF_INET or AF_INET6 socket.
Currently only sk_bound_dev_if is exported to userspace for modification
by a bpf program.

This allows a cgroup to be configured such that AF_INET{6} sockets opened
by processes are automatically bound to a specific device. In turn, this
enables the running of programs that do not support SO_BINDTODEVICE in a
specific VRF context / L3 domain.

Signed-off-by: David Ahern <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Change-Id: I96a6f6f8f650c494d8c173dbb42580a25698368e
A lot of code currently abuses is_compat_task to determine this.

Signed-off-by: Andy Lutomirski <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "Theodore Ts'o" <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: Clemens Ladisch <[email protected]>
Cc: David Airlie <[email protected]>
Cc: David Herrmann <[email protected]>
Cc: David Miller <[email protected]>
Cc: Dmitry Torokhov <[email protected]>
Cc: Eric Paris <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Acked-by: Jiri Kosina <[email protected]>
Cc: Matt Fleming <[email protected]>
Cc: Neil Horman <[email protected]>
Cc: Oded Gabbay <[email protected]>
Cc: Oleg Drokin <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Paul Moore <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Cc: Steffen Klassert <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlad Yasevich <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Change-Id: Ic4c37d15c1d72d1f0a90c7f5091d696140fc2a4c
The kill() syscall operates on process identifiers (pid). After a process
has exited its pid can be reused by another process. If a caller sends a
signal to a reused pid it will end up signaling the wrong process. This
issue has often surfaced and there has been a push to address this problem [1].

This patch uses file descriptors (fd) from proc/<pid> as stable handles on
struct pid. Even if a pid is recycled the handle will not change. The fd
can be used to send signals to the process it refers to.
Thus, the new syscall pidfd_send_signal() is introduced to solve this
problem. Instead of pids it operates on process fds (pidfd).

/* prototype and argument /*
long pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags);

/* syscall number 424 */
The syscall number was chosen to be 424 to align with Arnd's rework in his
y2038 to minimize merge conflicts (cf. [25]).

In addition to the pidfd and signal argument it takes an additional
siginfo_t and flags argument. If the siginfo_t argument is NULL then
pidfd_send_signal() is equivalent to kill(<positive-pid>, <signal>). If it
is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo().
The flags argument is added to allow for future extensions of this syscall.
It currently needs to be passed as 0. Failing to do so will cause EINVAL.

/* pidfd_send_signal() replaces multiple pid-based syscalls */
The pidfd_send_signal() syscall currently takes on the job of
rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a
positive pid is passed to kill(2). It will however be possible to also
replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended.

/* sending signals to threads (tid) and process groups (pgid) */
Specifically, the pidfd_send_signal() syscall does currently not operate on
process groups or threads. This is left for future extensions.
In order to extend the syscall to allow sending signal to threads and
process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and
PIDFD_TYPE_TID) should be added. This implies that the flags argument will
determine what is signaled and not the file descriptor itself. Put in other
words, grouping in this api is a property of the flags argument not a
property of the file descriptor (cf. [13]). Clarification for this has been
requested by Eric (cf. [19]).
When appropriate extensions through the flags argument are added then
pidfd_send_signal() can additionally replace the part of kill(2) which
operates on process groups as well as the tgkill(2) and
rt_tgsigqueueinfo(2) syscalls.
How such an extension could be implemented has been very roughly sketched
in [14], [15], and [16]. However, this should not be taken as a commitment
to a particular implementation. There might be better ways to do it.
Right now this is intentionally left out to keep this patchset as simple as
possible (cf. [4]).

/* naming */
The syscall had various names throughout iterations of this patchset:
- procfd_signal()
- procfd_send_signal()
- taskfd_send_signal()
In the last round of reviews it was pointed out that given that if the
flags argument decides the scope of the signal instead of different types
of fds it might make sense to either settle for "procfd_" or "pidfd_" as
prefix. The community was willing to accept either (cf. [17] and [18]).
Given that one developer expressed strong preference for the "pidfd_"
prefix (cf. [13]) and with other developers less opinionated about the name
we should settle for "pidfd_" to avoid further bikeshedding.

The  "_send_signal" suffix was chosen to reflect the fact that the syscall
takes on the job of multiple syscalls. It is therefore intentional that the
name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the
fomer because it might imply that pidfd_send_signal() is a replacement for
kill(2), and not the latter because it is a hassle to remember the correct
spelling - especially for non-native speakers - and because it is not
descriptive enough of what the syscall actually does. The name
"pidfd_send_signal" makes it very clear that its job is to send signals.

/* zombies */
Zombies can be signaled just as any other process. No special error will be
reported since a zombie state is an unreliable state (cf. [3]). However,
this can be added as an extension through the @flags argument if the need
ever arises.

/* cross-namespace signals */
The patch currently enforces that the signaler and signalee either are in
the same pid namespace or that the signaler's pid namespace is an ancestor
of the signalee's pid namespace. This is done for the sake of simplicity
and because it is unclear to what values certain members of struct
siginfo_t would need to be set to (cf. [5], [6]).

/* compat syscalls */
It became clear that we would like to avoid adding compat syscalls
(cf. [7]).  The compat syscall handling is now done in kernel/signal.c
itself by adding __copy_siginfo_from_user_generic() which lets us avoid
compat syscalls (cf. [8]). It should be noted that the addition of
__copy_siginfo_from_user_any() is caused by a bug in the original
implementation of rt_sigqueueinfo(2) (cf. 12).
With upcoming rework for syscall handling things might improve
significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain
any additional callers.

/* testing */
This patch was tested on x64 and x86.

/* userspace usage */
An asciinema recording for the basic functionality can be found under [9].
With this patch a process can be killed via:

 #define _GNU_SOURCE
 #include <errno.h>
 #include <fcntl.h>
 #include <signal.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <sys/stat.h>
 #include <sys/syscall.h>
 #include <sys/types.h>
 #include <unistd.h>

 static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
                                         unsigned int flags)
 {
 #ifdef __NR_pidfd_send_signal
         return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
 #else
         return -ENOSYS;
 #endif
 }

 int main(int argc, char *argv[])
 {
         int fd, ret, saved_errno, sig;

         if (argc < 3)
                 exit(EXIT_FAILURE);

         fd = open(argv[1], O_DIRECTORY | O_CLOEXEC);
         if (fd < 0) {
                 printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]);
                 exit(EXIT_FAILURE);
         }

         sig = atoi(argv[2]);

         printf("Sending signal %d to process %s\n", sig, argv[1]);
         ret = do_pidfd_send_signal(fd, sig, NULL, 0);

         saved_errno = errno;
         close(fd);
         errno = saved_errno;

         if (ret < 0) {
                 printf("%s - Failed to send signal %d to process %s\n",
                        strerror(errno), sig, argv[1]);
                 exit(EXIT_FAILURE);
         }

         exit(EXIT_SUCCESS);
 }

/* Q&A
 * Given that it seems the same questions get asked again by people who are
 * late to the party it makes sense to add a Q&A section to the commit
 * message so it's hopefully easier to avoid duplicate threads.
 *
 * For the sake of progress please consider these arguments settled unless
 * there is a new point that desperately needs to be addressed. Please make
 * sure to check the links to the threads in this commit message whether
 * this has not already been covered.
 */
Q-01: (Florian Weimer [20], Andrew Morton [21])
      What happens when the target process has exited?
A-01: Sending the signal will fail with ESRCH (cf. [22]).

Q-02:  (Andrew Morton [21])
       Is the task_struct pinned by the fd?
A-02:  No. A reference to struct pid is kept. struct pid - as far as I
       understand - was created exactly for the reason to not require to
       pin struct task_struct (cf. [22]).

Q-03: (Andrew Morton [21])
      Does the entire procfs directory remain visible? Just one entry
      within it?
A-03: The same thing that happens right now when you hold a file descriptor
      to /proc/<pid> open (cf. [22]).

Q-04: (Andrew Morton [21])
      Does the pid remain reserved?
A-04: No. This patchset guarantees a stable handle not that pids are not
      recycled (cf. [22]).

Q-05: (Andrew Morton [21])
      Do attempts to signal that fd return errors?
A-05: See {Q,A}-01.

Q-06: (Andrew Morton [22])
      Is there a cleaner way of obtaining the fd? Another syscall perhaps.
A-06: Userspace can already trivially retrieve file descriptors from procfs
      so this is something that we will need to support anyway. Hence,
      there's no immediate need to add another syscalls just to make
      pidfd_send_signal() not dependent on the presence of procfs. However,
      adding a syscalls to get such file descriptors is planned for a
      future patchset (cf. [22]).

Q-07: (Andrew Morton [21] and others)
      This fd-for-a-process sounds like a handy thing and people may well
      think up other uses for it in the future, probably unrelated to
      signals. Are the code and the interface designed to permit such
      future applications?
A-07: Yes (cf. [22]).

Q-08: (Andrew Morton [21] and others)
      Now I think about it, why a new syscall? This thing is looking
      rather like an ioctl?
A-08: This has been extensively discussed. It was agreed that a syscall is
      preferred for a variety or reasons. Here are just a few taken from
      prior threads. Syscalls are safer than ioctl()s especially when
      signaling to fds. Processes are a core kernel concept so a syscall
      seems more appropriate. The layout of the syscall with its four
      arguments would require the addition of a custom struct for the
      ioctl() thereby causing at least the same amount or even more
      complexity for userspace than a simple syscall. The new syscall will
      replace multiple other pid-based syscalls (see description above).
      The file-descriptors-for-processes concept introduced with this
      syscall will be extended with other syscalls in the future. See also
      [22], [23] and various other threads already linked in here.

Q-09: (Florian Weimer [24])
      What happens if you use the new interface with an O_PATH descriptor?
A-09:
      pidfds opened as O_PATH fds cannot be used to send signals to a
      process (cf. [2]). Signaling processes through pidfds is the
      equivalent of writing to a file. Thus, this is not an operation that
      operates "purely at the file descriptor level" as required by the
      open(2) manpage. See also [4].

/* References */
[1]:  https://lore.kernel.org/lkml/[email protected]/
[2]:  https://lore.kernel.org/lkml/[email protected]/
[3]:  https://lore.kernel.org/lkml/[email protected]/
[4]:  https://lore.kernel.org/lkml/[email protected]/
[5]:  https://lore.kernel.org/lkml/[email protected]/
[6]:  https://lore.kernel.org/lkml/[email protected]/
[7]:  https://lore.kernel.org/lkml/[email protected]/
[8]:  https://lore.kernel.org/lkml/[email protected]/
[9]:  https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy
[11]: https://lore.kernel.org/lkml/[email protected]/
[12]: https://lore.kernel.org/lkml/[email protected]/
[13]: https://lore.kernel.org/lkml/[email protected]/
[14]: https://lore.kernel.org/lkml/[email protected]/
[15]: https://lore.kernel.org/lkml/[email protected]/
[16]: https://lore.kernel.org/lkml/[email protected]/
[17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/
[18]: https://lore.kernel.org/lkml/[email protected]/
[19]: https://lore.kernel.org/lkml/[email protected]/
[20]: https://lore.kernel.org/lkml/[email protected]/
[21]: https://lore.kernel.org/lkml/[email protected]/
[22]: https://lore.kernel.org/lkml/[email protected]/
[23]: https://lwn.net/Articles/773459/
[24]: https://lore.kernel.org/lkml/[email protected]/
[25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/

Cc: "Eric W. Biederman" <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Florian Weimer <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: Tycho Andersen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: David Howells <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Acked-by: Aleksa Sarai <[email protected]>

(cherry picked from commit 3eb39f47934f9d5a3027fe00d906a45fe3a15fad)

Conflicts:
        arch/x86/entry/syscalls/syscall_32.tbl - trivial manual merge
        arch/x86/entry/syscalls/syscall_64.tbl - trivial manual merge
        include/linux/proc_fs.h - trivial manual merge
        include/linux/syscalls.h - trivial manual merge
        include/uapi/asm-generic/unistd.h - trivial manual merge
        kernel/signal.c - struct kernel_siginfo does not exist in 4.14
        kernel/sys_ni.c - cond_syscall is used instead of COND_SYSCALL
        arch/x86/entry/syscalls/syscall_32.tbl
        arch/x86/entry/syscalls/syscall_64.tbl

(1. manual merges because of 4.14 differences
 2. change prepare_kill_siginfo() to use struct siginfo instead of
kernel_siginfo
 3. use copy_from_user() instead of copy_siginfo_from_user() in copy_siginfo_from_user_any()
 4. replaced COND_SYSCALL with cond_syscall
 5. Removed __ia32_sys_pidfd_send_signal in arch/x86/entry/syscalls/syscall_32.tbl.
 6. Replaced __x64_sys_pidfd_send_signal with sys_pidfd_send_signal in arch/x86/entry/syscalls/syscall_64.tbl.)

Bug: 135608568
Test: test program using syscall(__NR_pidfd_send_signal,..) to send SIGKILL
Change-Id: I34da11c63ac8cafb0353d9af24c820cef519ec27
Signed-off-by: Suren Baghdasaryan <[email protected]>
Signed-off-by: electimon <[email protected]>
…nt pidfd

The current sys_pidfd_send_signal() silently turns signals with explicit
SI_USER context that are sent to non-current tasks into signals with
kernel-generated siginfo.
This is unlike do_rt_sigqueueinfo(), which returns -EPERM in this case.
If a user actually wants to send a signal with kernel-provided siginfo,
they can do that with pidfd_send_signal(pidfd, sig, NULL, 0); so allowing
this case is unnecessary.

Instead of silently replacing the siginfo, just bail out with an error;
this is consistent with other interfaces and avoids special-casing behavior
based on security checks.

Fixes: 3eb39f47934f ("signal: add pidfd_send_signal() syscall")
Signed-off-by: Jann Horn <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

(cherry picked from commit 556a888a14afe27164191955618990fb3ccc9aad)

Bug: 135608568
Test: test program using syscall(__NR_pidfd_send_signal,..) to send SIGKILL
Change-Id: I493af671b82c43bff1425ee24550d2fb9aa6d961
Signed-off-by: Suren Baghdasaryan <[email protected]>
As stated in the original commit for pidfd_send_signal() we don't allow
to signal processes through O_PATH file descriptors since it is
semantically equivalent to a write on the pidfd.

We already correctly error out right now and return EBADF if an O_PATH
fd is passed.  This is because we use file->f_op to detect whether a
pidfd is passed and O_PATH fds have their file->f_op set to empty_fops
in do_dentry_open() and thus fail the test.

Thus, there is no regression.  It's just semantically correct to use
fdget() and return an error right from there instead of taking a
reference and returning an error later.

Signed-off-by: Christian Brauner <[email protected]>
Acked-by: Oleg Nesterov <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: David Howells <[email protected]>
Cc: "Michael Kerrisk (man-pages)" <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

(cherry picked from commit 738a7832d21e3d911fcddab98ce260b79010b461)

Bug: 135608568
Test: test program using syscall(__NR_pidfd_send_signal,..) to send SIGKILL
Change-Id: Id52eaadf9da371fb2d9caae4df49627760de7229
Signed-off-by: Suren Baghdasaryan <[email protected]>
Make the anon_inodes facility unconditional so that it can be used by core
VFS code.

Signed-off-by: David Howells <[email protected]>
Signed-off-by: Al Viro <[email protected]>

(cherry picked from commit dadd2299ab61fc2b55b95b7b3a8f674cdd3b69c9)

Bug: 135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I2f97bda4f360d8d05bbb603de839717b3d8067ae
Signed-off-by: Suren Baghdasaryan <[email protected]>
This patchset makes it possible to retrieve pid file descriptors at
process creation time by introducing the new flag CLONE_PIDFD to the
clone() system call.  Linus originally suggested to implement this as a
new flag to clone() instead of making it a separate system call.  As
spotted by Linus, there is exactly one bit for clone() left.

CLONE_PIDFD creates file descriptors based on the anonymous inode
implementation in the kernel that will also be used to implement the new
mount api.  They serve as a simple opaque handle on pids.  Logically,
this makes it possible to interpret a pidfd differently, narrowing or
widening the scope of various operations (e.g. signal sending).  Thus, a
pidfd cannot just refer to a tgid, but also a tid, or in theory - given
appropriate flag arguments in relevant syscalls - a process group or
session. A pidfd does not represent a privilege.  This does not imply it
cannot ever be that way but for now this is not the case.

A pidfd comes with additional information in fdinfo if the kernel supports
procfs.  The fdinfo file contains the pid of the process in the callers
pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d".

As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the
parent_tidptr argument of clone.  This has the advantage that we can
give back the associated pid and the pidfd at the same time.

To remove worries about missing metadata access this patchset comes with
a sample program that illustrates how a combination of CLONE_PIDFD, and
pidfd_send_signal() can be used to gain race-free access to process
metadata through /proc/<pid>.  The sample program can easily be
translated into a helper that would be suitable for inclusion in libc so
that users don't have to worry about writing it themselves.

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Co-developed-by: Jann Horn <[email protected]>
Signed-off-by: Jann Horn <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: David Howells <[email protected]>
Cc: "Michael Kerrisk (man-pages)" <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Al Viro <[email protected]>

(cherry picked from commit b3e5838252665ee4cfa76b82bdf1198dca81e5be)

Conflicts:
        kernel/fork.c

(1. Replaced proc_pid_ns() with its direct implementation.)

Bug: 135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I3c804a92faea686e5bf7f99df893fe3a5d87ddf7
Signed-off-by: Suren Baghdasaryan <[email protected]>
Signed-off-by: electimon <[email protected]>
Let pidfd_send_signal() use pidfds retrieved via CLONE_PIDFD.  With this
patch pidfd_send_signal() becomes independent of procfs.  This fullfils
the request made when we merged the pidfd_send_signal() patchset.  The
pidfd_send_signal() syscall is now always available allowing for it to
be used by users without procfs mounted or even users without procfs
support compiled into the kernel.

Signed-off-by: Christian Brauner <[email protected]>
Co-developed-by: Jann Horn <[email protected]>
Signed-off-by: Jann Horn <[email protected]>
Acked-by: Oleg Nesterov <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: David Howells <[email protected]>
Cc: "Michael Kerrisk (man-pages)" <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Al Viro <[email protected]>

(cherry picked from commit 2151ad1b067275730de1b38c7257478cae47d29e)

Conflicts:
        kernel/sys_ni.c

(1. Replaced COND_SYSCALL with cond_syscall.)

Bug: 135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I621fe6547397e0e68c560d7da60ef7715deb290c
Signed-off-by: Suren Baghdasaryan <[email protected]>
Avoid calling cgroup_threadgroup_change_end() without having called
cgroup_threadgroup_change_begin() first.

During process creation we need to check whether the cgroup we are in
allows us to fork. To perform this check the cgroup needs to guard itself
against threadgroup changes and takes a lock.
Prior to CLONE_PIDFD the cleanup target "bad_fork_free_pid" would also need
to call cgroup_threadgroup_change_end() because said lock had already been
taken.
However, this is not the case anymore with the addition of CLONE_PIDFD. We
are now allocating a pidfd before we check whether the cgroup we're in can
fork and thus prior to taking the lock. So when copy_process() fails at the
right step it would release a lock we haven't taken.
This bug is not even very subtle to be honest. It's just not very clear
from the naming of cgroup_threadgroup_change_{begin,end}() that a lock is
taken.

Here's the relevant splat:

entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
RIP: 0023:0xf7fec849
Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 14 24 c3 8b 3c 24 c3 90 90
90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90
90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 002b:00000000ffed5a8c EFLAGS: 00000246 ORIG_RAX: 0000000000000078
RAX: ffffffffffffffda RBX: 0000000000003ffc RCX: 0000000000000000
RDX: 00000000200005c0 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000012 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
------------[ cut here ]------------
DEBUG_LOCKS_WARN_ON(depth <= 0)
WARNING: CPU: 1 PID: 7744 at kernel/locking/lockdep.c:4052 __lock_release
kernel/locking/lockdep.c:4052 [inline]
WARNING: CPU: 1 PID: 7744 at kernel/locking/lockdep.c:4052
lock_release+0x667/0xa00 kernel/locking/lockdep.c:4321
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 7744 Comm: syz-executor007 Not tainted 5.1.0+ #4
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  panic+0x2cb/0x65c kernel/panic.c:214
  __warn.cold+0x20/0x45 kernel/panic.c:566
  report_bug+0x263/0x2b0 lib/bug.c:186
  fixup_bug arch/x86/kernel/traps.c:179 [inline]
  fixup_bug arch/x86/kernel/traps.c:174 [inline]
  do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272
  do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291
  invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:972
RIP: 0010:__lock_release kernel/locking/lockdep.c:4052 [inline]
RIP: 0010:lock_release+0x667/0xa00 kernel/locking/lockdep.c:4321
Code: 0f 85 a0 03 00 00 8b 35 77 66 08 08 85 f6 75 23 48 c7 c6 a0 55 6b 87
48 c7 c7 40 25 6b 87 4c 89 85 70 ff ff ff e8 b7 a9 eb ff <0f> 0b 4c 8b 85
70 ff ff ff 4c 89 ea 4c 89 e6 4c 89 c7 e8 52 63 ff
RSP: 0018:ffff888094117b48 EFLAGS: 00010086
RAX: 0000000000000000 RBX: 1ffff11012822f6f RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff815af236 RDI: ffffed1012822f5b
RBP: ffff888094117c00 R08: ffff888092bfc400 R09: fffffbfff113301d
R10: fffffbfff113301c R11: ffffffff889980e3 R12: ffffffff8a451df8
R13: ffffffff8142e71f R14: ffffffff8a44cc80 R15: ffff888094117bd8
  percpu_up_read.constprop.0+0xcb/0x110 include/linux/percpu-rwsem.h:92
  cgroup_threadgroup_change_end include/linux/cgroup-defs.h:712 [inline]
  copy_process.part.0+0x47ff/0x6710 kernel/fork.c:2222
  copy_process kernel/fork.c:1772 [inline]
  _do_fork+0x25d/0xfd0 kernel/fork.c:2338
  __do_compat_sys_x86_clone arch/x86/ia32/sys_ia32.c:240 [inline]
  __se_compat_sys_x86_clone arch/x86/ia32/sys_ia32.c:236 [inline]
  __ia32_compat_sys_x86_clone+0xbc/0x140 arch/x86/ia32/sys_ia32.c:236
  do_syscall_32_irqs_on arch/x86/entry/common.c:334 [inline]
  do_fast_syscall_32+0x281/0xd54 arch/x86/entry/common.c:405
  entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
RIP: 0023:0xf7fec849
Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 14 24 c3 8b 3c 24 c3 90 90
90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90
90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 002b:00000000ffed5a8c EFLAGS: 00000246 ORIG_RAX: 0000000000000078
RAX: ffffffffffffffda RBX: 0000000000003ffc RCX: 0000000000000000
RDX: 00000000200005c0 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000012 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Kernel Offset: disabled
Rebooting in 86400 seconds..

Reported-and-tested-by: [email protected]
Fixes: b3e583825266 ("clone: add CLONE_PIDFD")
Signed-off-by: Christian Brauner <[email protected]>

(cherry picked from commit c3b7112df86b769927a60a6d7175988ca3d60f09)

Bug: 135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: Ib9ecb1e5c0c6e2d062b89c25109ec571570eb497
Signed-off-by: Suren Baghdasaryan <[email protected]>
Improve the comments for pidfd_send_signal().
First, the comment still referred to a file descriptor for a process as a
"task file descriptor" which stems from way back at the beginning of the
discussion. Replace this with "pidfd" for consistency.
Second, the wording for the explanation of the arguments to the syscall
was a bit inconsistent, e.g. some used the past tense some used present
tense. Make the wording more consistent.

Signed-off-by: Christian Brauner <[email protected]>

(cherry picked from commit c732327f04a3818f35fa97d07b1d64d31b691d78)

Bug: 135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I06c6bdd1dddaeb8ac75a78dd21f9cdd0dc139a4c
Signed-off-by: Suren Baghdasaryan <[email protected]>
This patch adds polling support to pidfd.

Android low memory killer (LMK) needs to know when a process dies once
it is sent the kill signal. It does so by checking for the existence of
/proc/pid which is both racy and slow. For example, if a PID is reused
between when LMK sends a kill signal and checks for existence of the
PID, since the wrong PID is now possibly checked for existence.
Using the polling support, LMK will be able to get notified when a process
exists in race-free and fast way, and allows the LMK to do other things
(such as by polling on other fds) while awaiting the process being killed
to die.

For notification to polling processes, we follow the same existing
mechanism in the kernel used when the parent of the task group is to be
notified of a child's death (do_notify_parent). This is precisely when the
tasks waiting on a poll of pidfd are also awakened in this patch.

We have decided to include the waitqueue in struct pid for the following
reasons:
1. The wait queue has to survive for the lifetime of the poll. Including
   it in task_struct would not be option in this case because the task can
   be reaped and destroyed before the poll returns.

2. By including the struct pid for the waitqueue means that during
   de_thread(), the new thread group leader automatically gets the new
   waitqueue/pid even though its task_struct is different.

Appropriate test cases are added in the second patch to provide coverage of
all the cases the patch is handling.

Cc: Andy Lutomirski <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Daniel Colascione <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Tim Murray <[email protected]>
Cc: Jonathan Kowalski <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: David Howells <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: [email protected]
Reviewed-by: Oleg Nesterov <[email protected]>
Co-developed-by: Daniel Colascione <[email protected]>
Signed-off-by: Daniel Colascione <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

(cherry picked from commit b53b0b9d9a613c418057f6cb921c2f40a6f78c24)

Bug: 135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I02f259d2875bec46b198d580edfbb067f077084e
Signed-off-by: Suren Baghdasaryan <[email protected]>
This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
process that is created via traditional fork()/clone() calls that is only
referenced by a PID:

int pidfd = pidfd_open(1234, 0);
ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);

With the introduction of pidfds through CLONE_PIDFD it is possible to
created pidfds at process creation time.
However, a lot of processes get created with traditional PID-based calls
such as fork() or clone() (without CLONE_PIDFD). For these processes a
caller can currently not create a pollable pidfd. This is a problem for
Android's low memory killer (LMK) and service managers such as systemd.
Both are examples of tools that want to make use of pidfds to get reliable
notification of process exit for non-parents (pidfd polling) and race-free
signal sending (pidfd_send_signal()). They intend to switch to this API for
process supervision/management as soon as possible. Having no way to get
pollable pidfds from PID-only processes is one of the biggest blockers for
them in adopting this api. With pidfd_open() making it possible to retrieve
pidfds for PID-based processes we enable them to adopt this api.

In line with Arnd's recent changes to consolidate syscall numbers across
architectures, I have added the pidfd_open() syscall to all architectures
at the same time.

Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: David Howells <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Al Viro <[email protected]>
Cc: [email protected]

(cherry picked from commit 32fcb426ec001cb6d5a4a195091a8486ea77e2df)

Conflicts:
        kernel/pid.c

(1. Replaced PIDTYPE_TGID with PIDTYPE_PID and thread_group_leader() check in pidfd_open() call)

Bug: 135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I52a93a73722d7f7754dae05f63b94b4ca4a71a75
Signed-off-by: Suren Baghdasaryan <[email protected]>
Signed-off-by: electimon <[email protected]>
Arrange for mach-types.h to be directly generated in the relevant
path, so we don't need a one-liner file in arch/arm/include/asm/.

Change-Id: I4693b7279625fdcbf5d99902af40968414fc998a
Signed-off-by: Russell King <[email protected]>
Git-Commit: 4e2648db9c5f7b2281551694597102612f54460d
Git-Repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[[email protected]: resolve trivial merge conflicts]
Signed-off-by: Swetha Chikkaboraiah <[email protected]>
(cherry-picked from e33f8d32677fa4f4f8996ef46748f86aac81ccff)

Disable the generic address limit check in favor of an architecture
specific optimized implementation. The generic implementation using
pending work flags did not work well with ARM and alignment faults.

The address limit is checked on each syscall return path to user-mode
path as well as the irq user-mode return function. If the address limit
was changed, a function is called to report data corruption (stopping
the kernel or process based on configuration).

The address limit check has to be done before any pending work because
they can reset the address limit and the process is killed using a
SIGKILL signal. For example the lkdtm address limit check does not work
because the signal to kill the process will reset the user-mode address
limit.

Change-Id: Ic61ba05961ad1dcf10c48040427d92bd650616af
Signed-off-by: Thomas Garnier <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Tested-by: Kees Cook <[email protected]>
Tested-by: Leonard Crestez <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Pratyush Anand <[email protected]>
Cc: Dave Martin <[email protected]>
Cc: Will Drewry <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Russell King <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: David Howells <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Al Viro <[email protected]>
Cc: [email protected]
Cc: Yonghong Song <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Satya Tangirala <[email protected]>
Convert ARM to use a similar mechanism to x86 to generate the unistd.h
system call numbers and the various kernel system call tables.  This
means that rather than having to edit three places (asm/unistd.h for
the total number of system calls, uapi/asm/unistd.h for the system call
numbers, and arch/arm/kernel/calls.S for the call table) we have only
one place to edit, making the process much more simple.

The scripts have knowledge of the table padding requirements, so there's
no need to worry about __NR_syscalls not fitting within the immediate
constant field of ALU instructions anymore.

Change-Id: Ie70e712b4779601beaeb4f660b8fa910a159ce87
Signed-off-by: Russell King <[email protected]>
Git-Commit: 96a8fae0fe094b6a26a3ec88b2f097418f269cfe
Git-Repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[[email protected]: resolve trivial merge conflicts]
Signed-off-by: Swetha Chikkaboraiah <[email protected]>
This wires up the pidfd_open() syscall into all arches at once.

Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: David Howells <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Andy Lutomirsky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Al Viro <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

(cherry picked from commit 7615d9e1780e26e0178c93c55b73309a5dc093d7)

Conflicts:

        arch/alpha/kernel/syscalls/syscall.tbl
        arch/arm/tools/syscall.tbl
        arch/ia64/kernel/syscalls/syscall.tbl
        arch/m68k/kernel/syscalls/syscall.tbl
        arch/microblaze/kernel/syscalls/syscall.tbl
        arch/mips/kernel/syscalls/syscall_n32.tbl
        arch/mips/kernel/syscalls/syscall_n64.tbl
        arch/mips/kernel/syscalls/syscall_o32.tbl
        arch/parisc/kernel/syscalls/syscall.tbl
        arch/powerpc/kernel/syscalls/syscall.tbl
        arch/s390/kernel/syscalls/syscall.tbl
        arch/sh/kernel/syscalls/syscall.tbl
        arch/sparc/kernel/syscalls/syscall.tbl
        arch/xtensa/kernel/syscalls/syscall.tbl
        arch/x86/entry/syscalls/syscall_32.tbl
        arch/x86/entry/syscalls/syscall_64.tbl

(1. Skipped syscall.tbl modifications for missing architectures.
 2. Removed __ia32_sys_pidfd_open in arch/x86/entry/syscalls/syscall_32.tbl.
 3. Replaced __x64_sys_pidfd_open with sys_pidfd_open in arch/x86/entry/syscalls/syscall_64.tbl.)

Bug: 135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I294aa33dea5ed2662e077340281d7aa0452f7471
Signed-off-by: Suren Baghdasaryan <[email protected]>
There is a race between reading task->exit_state in pidfd_poll and
writing it after do_notify_parent calls do_notify_pidfd. Expected
sequence of events is:

CPU 0                            CPU 1
------------------------------------------------
exit_notify
  do_notify_parent
    do_notify_pidfd
  tsk->exit_state = EXIT_DEAD
                                  pidfd_poll
                                     if (tsk->exit_state)

However nothing prevents the following sequence:

CPU 0                            CPU 1
------------------------------------------------
exit_notify
  do_notify_parent
    do_notify_pidfd
                                   pidfd_poll
                                      if (tsk->exit_state)
  tsk->exit_state = EXIT_DEAD

This causes a polling task to wait forever, since poll blocks because
exit_state is 0 and the waiting task is not notified again. A stress
test continuously doing pidfd poll and process exits uncovered this bug.

To fix it, we make sure that the task's exit_state is always set before
calling do_notify_pidfd.

Fixes: b53b0b9d9a6 ("pidfd: add polling support")
Cc: [email protected]
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[[email protected]: adapt commit message and drop unneeded changes from wait_task_zombie]
Signed-off-by: Christian Brauner <[email protected]>

(cherry picked from commit b191d6491be67cef2b3fa83015561caca1394ab9)

Bug: 135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I043e54c9b69f25de88f6f19ae167920af8532de2
Signed-off-by: Suren Baghdasaryan <[email protected]>
@danielkot danielkot changed the title Include third party dex adapter Include third party dex adapter and update 10k commits Feb 11, 2023
@danielkot danielkot changed the title Include third party dex adapter and update 10k commits Include third party dex adapter and update 10k commits if needed Feb 11, 2023
@danielkot danielkot changed the title Include third party dex adapter and update 10k commits if needed delete Feb 11, 2023
Royna2544 pushed a commit to Roynas-Android-Playground/android_kernel_samsung_universal8895 that referenced this pull request May 15, 2023
…g the sock

[ Upstream commit 3cf7203ca620682165706f70a1b12b5194607dce ]

There is a race condition in vxlan that when deleting a vxlan device
during receiving packets, there is a possibility that the sock is
released after getting vxlan_sock vs from sk_user_data. Then in
later vxlan_ecn_decapsulate(), vxlan_get_sk_family() we will got
NULL pointer dereference. e.g.

   #0 [ffffa25ec6978a38] machine_kexec at ffffffff8c669757
   8890q#1 [ffffa25ec6978a90] __crash_kexec at ffffffff8c7c0a4d
   8890q#2 [ffffa25ec6978b58] crash_kexec at ffffffff8c7c1c48
   8890q#3 [ffffa25ec6978b60] oops_end at ffffffff8c627f2b
   8890q#4 [ffffa25ec6978b80] page_fault_oops at ffffffff8c678fcb
   exynos8895#5 [ffffa25ec6978bd8] exc_page_fault at ffffffff8d109542
   exynos8895#6 [ffffa25ec6978c00] asm_exc_page_fault at ffffffff8d200b62
      [exception RIP: vxlan_ecn_decapsulate+0x3b]
      RIP: ffffffffc1014e7b  RSP: ffffa25ec6978cb0  RFLAGS: 00010246
      RAX: 0000000000000008  RBX: ffff8aa000888000  RCX: 0000000000000000
      RDX: 000000000000000e  RSI: ffff8a9fc7ab803e  RDI: ffff8a9fd1168700
      RBP: ffff8a9fc7ab803e   R8: 0000000000700000   R9: 00000000000010ae
      R10: ffff8a9fcb748980  R11: 0000000000000000  R12: ffff8a9fd1168700
      R13: ffff8aa000888000  R14: 00000000002a0000  R15: 00000000000010ae
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
   exynos8895#7 [ffffa25ec6978ce8] vxlan_rcv at ffffffffc10189cd [vxlan]
   exynos8895#8 [ffffa25ec6978d90] udp_queue_rcv_one_skb at ffffffff8cfb6507
   exynos8895#9 [ffffa25ec6978dc0] udp_unicast_rcv_skb at ffffffff8cfb6e45
  exynos8895#10 [ffffa25ec6978dc8] __udp4_lib_rcv at ffffffff8cfb8807
  exynos8895#11 [ffffa25ec6978e20] ip_protocol_deliver_rcu at ffffffff8cf76951
  exynos8895#12 [ffffa25ec6978e48] ip_local_deliver at ffffffff8cf76bde
  exynos8895#13 [ffffa25ec6978ea0] __netif_receive_skb_one_core at ffffffff8cecde9b
  exynos8895#14 [ffffa25ec6978ec8] process_backlog at ffffffff8cece139
  exynos8895#15 [ffffa25ec6978f00] __napi_poll at ffffffff8ceced1a
  exynos8895#16 [ffffa25ec6978f28] net_rx_action at ffffffff8cecf1f3
  exynos8895#17 [ffffa25ec6978fa0] __softirqentry_text_start at ffffffff8d4000ca
  exynos8895#18 [ffffa25ec6978ff0] do_softirq at ffffffff8c6fbdc3

Reproducer: https://github.com/Mellanox/ovs-tests/blob/master/test-ovs-vxlan-remove-tunnel-during-traffic.sh

Fix this by waiting for all sk_user_data reader to finish before
releasing the sock.

Reported-by: Jianlin Shi <[email protected]>
Suggested-by: Jakub Sitnicki <[email protected]>
Fixes: 6a93cc9 ("udp-tunnel: Add a few more UDP tunnel APIs")
Signed-off-by: Hangbin Liu <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Ulrich Hecht <[email protected]>
Royna2544 pushed a commit to Roynas-Android-Playground/android_kernel_samsung_universal8895 that referenced this pull request May 15, 2023
[ Upstream commit 9af31d6ec1a4be4caab2550096c6bd2ba8fba472 ]

There is an use-after-free problem reported by KASAN:
  ==================================================================
  BUG: KASAN: use-after-free in ubi_eba_copy_table+0x11f/0x1c0 [ubi]
  Read of size 8 at addr ffff888101eec008 by task ubirsvol/4735

  CPU: 2 PID: 4735 Comm: ubirsvol
  Not tainted 6.1.0-rc1-00003-g84fa3304a7fc-dirty exynos8895#14
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
  BIOS 1.14.0-1.fc33 04/01/2014
  Call Trace:
   <TASK>
   dump_stack_lvl+0x34/0x44
   print_report+0x171/0x472
   kasan_report+0xad/0x130
   ubi_eba_copy_table+0x11f/0x1c0 [ubi]
   ubi_resize_volume+0x4f9/0xbc0 [ubi]
   ubi_cdev_ioctl+0x701/0x1850 [ubi]
   __x64_sys_ioctl+0x11d/0x170
   do_syscall_64+0x35/0x80
   entry_SYSCALL_64_after_hwframe+0x46/0xb0
   </TASK>

When ubi_change_vtbl_record() returns an error in ubi_resize_volume(),
"new_eba_tbl" will be freed on error handing path, but it is holded
by "vol->eba_tbl" in ubi_eba_replace_table(). It means that the liftcycle
of "vol->eba_tbl" and "vol" are different, so when resizing volume in
next time, it causing an use-after-free fault.

Fix it by not freeing "new_eba_tbl" after it replaced in
ubi_eba_replace_table(), while will be freed in next volume resizing.

Fixes: 801c135 ("UBI: Unsorted Block Images")
Signed-off-by: Li Zetao <[email protected]>
Reviewed-by: Zhihao Cheng <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Ulrich Hecht <[email protected]>
When building for Android using CC for linking produces an error
because its clang calls the host linker aarch64-linux-gnu-ld
if available but the build tools prohibit the use of host
linkers.

Vanilla nowadays also uses LD since 691efbedc60d2a7364a90e38882fc762f06f52c4
but the same approach did not work because of some missing changes in
Makefile.lib and bad interactions with the compile steps.

Change-Id: I5e42f300a0523f710dccc0e73138d21ccbf55e96
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.