freeswan/klips/doc/klips2-design.txt

   1 #  -*- mode: Outline; fill-column: 78; fill-prefix: "   " -*-
   2 #
   3 #  klips2-design.txt
   4 #       Richard Guy Briggs <rgb@conscoop.ottawa.on.ca>
   5 #
   6 #  RCSID $Id: klips2-design.txt,v 1.18 2001/07/06 07:32:43 rgb Exp $
   7 #
   8
   9 * Outline Commands cheat sheet (C-c C-s to see this)
  10         C-c C-t         Hide EVERYTHING in buffer
  11         C-c C-a         Show EVERYTHING in buffer
  12
  13         C-c C-d         Hide THIS item and subitems (subtree)
  14         C-c C-s         Show THIS item and subitems (subtree)
  15
  16         C-c C-c         Hide ONE item
  17         C-c C-e         Show ONE item
  18
  19 * Introduction
  20
  21 Linux FreeS/WAN IPSec -- KLIPS2 DESIGN
  22 ======================================
  23
  24 # This document outlines the proposed design for KLIPS2, the second
  25 # generation Linux FreeS/WAN IPSec kernel implementation.  It is
  26 # accompanied by the following:
  27 #
  28 # klips2-design.dia             dia(1) diagram
  29 # klips2-design-legend.txt      diagram legend
  30 # klips2-design-api.txt         API descriptions
  31 # klips2-design-api-trips.txt   scenarios that force trips through the
  32 #                               various APIs to verify all needed resources
  33 #                               are available
  34
  35 # This document is devided up into Introduction, Goals, Feature requests,
  36 # Packet path overview, NetFilter overview, IPSec path description, and
  37 # Other issues.
  38 #
  39 # This document was originally written 2.5 weeks after OLS2000, inspired
  40 # from a meeting with Rusty and Marc Boucher in Montreal in November 1999
  41 # and two meetings at OLS2000.
  42 #
  43 # Please comment to the linux-ipsec, netfilter-devel or netdev lists.
  44 #
  45 # Current kernel version reference is 2.4.4 with iptables 1.2
  46
  47 * Goals:
  48
  49         To get rid of the current IPSec virtual interfaces that associate with
  50         a specific physical network interfaces and replace them with IPSec
  51         virtual interfaces that specify a local gateway address as a source
  52         address, a remote gateway address as a destination address and
  53         specific tunnel policy or SAs.
  54
  55         To use existing packet matching engines rather than re-invent.
  56
  57         To support more of the required selectors, especially source and
  58         destination ports, and possibly userid.  Security labels are not of
  59         obvious value, but the selectors will be easy to add in the future if
  60         they are implemented in the Linux kernel.
  61
  62         To get access to *all* packets incoming and outgoing to enforce policy
  63         in both directions.
  64
  65         To better support opportunistic encryption.
  66
  67         To take advantage of the parallellism of SMP and H/W encryption.
  68
  69         To make encryption and authentication modular.
  70
  71 The idea is to redesign KLIPS (kernel parts of FreeS/WAN) to avoid all the
  72 'stoopid routing tricks' (TM) to which we have had to resort since the project
  73 started by disassociating any IPSec devices from physical devices and to add a
  74 proper SPDB to do proper incoming IPSec policy checks.  We are hoping to use
  75 existing pattern-matching tools rather than invent our own.  NetFilter appears
  76 to have all the pattern matching capabilities with the exception of security
  77 labels which Linux doesn't appear to have anyways, but may be limited in other
  78 ways.
  79
  80 There is also a significant interest in enabling FreeS/WAN to communicate with
  81 routing daemons and be able to do load sharing and failover:
  82
  83         http://www.quintillion.com/fdis/moat/ipsec+routing/
  84
  85
  86
  87 * Feature requests:
  88
  89 Code=Level/Status/Priority
  90 Level:
  91         S = strategic
  92         M = middle
  93         T = tactical
  94         U = unset
  95 Status:
  96         U = unstarted
  97         C = coded
  98         T = tested
  99         D = deleted
 100 Priority:
 101         H = high
 102         M = medium
 103         L = low
 104         U = unset
 105         D = deprecated
 106
 107 Code    Feature         Implementation
 108 ====    =======         ==============
 109 S/U/U   changeable gw wild-side addresses on-the-fly
 110                         - road warriors with RSA keys and hooks from DHCP
 111                           to move to a new set of SAs upon expiry of previous
 112                           DHCP lease.  Notify peers.  Negociate new tunnels.
 113                           Handle delayed or denied renewal see: conn up, down,
 114                           wanted.
 115 M/U/U   address inertia for remote gw's with changeable wild-side addresses
 116         so local gw reboots will initiate reconnect to remotes.
 117                         - this requires a disk cache.
 118                         - 3 possible levels:
 119                                 - save list of connections
 120                                 - save IKE phase 1 keys
 121                                 - save IPSEC SA keys (requires KLIPS mods)
 122 T/U/U   mini-database of road warriors that persists across reboots.
 123                         - this requires a disk cache.
 124 M/U/U   connection up, down, wanted
 125                         - KLIPS2?
 126 S/U/U   routing below tunnel layer to support mobility and multi-homing
 127                         -
 128 M/U/U   tunnel identified by subnets served?
 129                         -
 130 M/U/U   why do equalizing schedulers not play well with tunnels?
 131                         -
 132 M/U/U   decouple SA retrieval from DADDR (don't care how it arrived)
 133                         - protocol redesign
 134                                 sysctl and ifdef for dstaddr
 135 T/U/U   SPIs unique, independant of protocol and DADDR
 136                         -
 137                                 sysctl and ifdef for protocol
 138 S/U/U   routing above tunnel layer
 139                         -
 140 S/U/U   granularity smaller than host
 141                         - SPORT, DPORT, UID, SecLev
 142 M/U/U   /dev/ipsecNNN devices that could be chown(1)ed and chmod(1)ed.
 143
 144 M/U/U   process to process tunnels
 145
 146 T/U/U   netfilter,pf_key,ioctl,/dev/ipsecNNN ways to manipulate tunnel perms.
 147
 148 S/U/U   KLIPS as a loadable module (isn't it already?)
 149
 150 S/U/U   stats: {number,time_of_last} packets {out,good_in,error_in}
 151
 152 S/U/U   integrate IPSec and firewall policy into Security Policy.
 153         (What APIs and user-level tools?)
 154
 155 S/U/U   full inbound policy checking
 156 S/U/U   secure ciphers and hashes
 157 T/U/U   kernel implementation (should be faster)
 158 S/U/U   plays well with routing daemons
 159 S/U/U   free of export restrictions
 160 T/U/U   standard crypto api to add newer ciphers and hashes
 161 S/U/U   opportunistic
 162 T/U/U   SADB hash table will be locked for additions/deletions
 163 T/U/U   use a refcount on each SA to increase locking granularity
 164
 165
 166
 167 * Packet path overview:
 168
 169 The basic path through the kernel as it concerns IPSec for the three
 170 types of packets is as follows:
 171
 172 IN:
 173         NIC
 174         basic sanity checks
 175         NF_IP_PRE_ROUTING hook
 176         route-in
 177         ip-options processing
 178         defragmentation
 179         NF_IP_LOCAL_IN hook
 180         transport layer demux
 181         application
 182
 183 FORWARD:
 184         NIC
 185         basic sanity checks
 186         NF_IP_PRE_ROUTING hook
 187         routing-in
 188         ip-options processing
 189         ttl decrement and check
 190         NF_IP_FORWARD hook
 191         fragmentation
 192         NF_IP_POST_ROUTING hook
 193         fragmentation
 194         output()
 195         NIC
 196
 197 OUT:
 198         application
 199         transport layer mux
 200         NF_IP_LOCAL_OUT hook
 201         route-out
 202         NF_IP_POST_ROUTING hook
 203         fragmentation
 204         output()
 205         NIC
 206
 207
 208
 209 * NetFilter overview:
 210
 211 The basic architecture of NetFilter is:
 212
 213        --->[1]--->(ROUTE)--->[3]--->[4]--->     where:
 214                      |            ^             [1] NF_IP_PRE_ROUTING
 215                      |            |             [2] NF_IP_LOCAL_IN
 216                      |         (ROUTE)          [3] NF_IP_FORWARD
 217                      v            |             [4] NF_IP_POST_ROUTING
 218                     [2]          [5]            [5] NF_IP_LOCAL_OUT
 219                      |            ^
 220                      |            |
 221                      v            |
 222
 223 Destination NAT (port forwarding) gets applied in NF_IP_PRE_ROUTING and
 224 NF_IP_LOCAL_OUT at priority NF_IP_PRIORITY_DNAT = -100, and Source NAT
 225 (masquerading) gets applied in NF_IP_POST_ROUTING at priority
 226 NF_IP_PRIORITY_SNAT = 100, .  Filtering is applied in NF_IP_LOCAL_IN,
 227 NF_IP_FORWARD and NF_IP_LOCAL_OUT at priority NF_IP_PRIORITY_FILTER = 0.
 228
 229
 230 Hook processing order would generally be:
 231 <PRE>
 232 PRE     IN      FWD     OUT     POST    PRIORITY MACRO         PRI
 233 =======.=======.=======.=======.=======.=================== = ====
 234 -500?  .       .       .       .       .NF_IP_PRI_IPSEC_IN  = -500 (?)
 235 -200   .       .       .-200   .       .NF_IP_PRI_CONNTRACK = -200
 236 -175? . . . . . . . . . . . . . . . . . NF_IP_PRI_IPSEC_IN  = -175 (?)
 237 -150   .       .       .-150   .       .NF_IP_PRI_MANGLE    = -150
 238 -100   .       .       .-100   .       .NF_IP_PRI_NAT_DST   = -100
 239 . . . . . .0. . . .0. . . .0. . . . . . NF_IP_PRI_FILTER    =    0
 240        .       .       .       . 100   .NF_IP_PRI_NAT_SRC   =  100
 241        .       .       .       . 500   .NF_IP_PRI_IPSEC_OUT =  500
 242 =======.=======.=======.=======.=======.=================== = ====
 243 </PRE>
 244
 245 Not all modules are present at each hook.  I am uncertain still if IPSEC_IN
 246 should be before or after CONNTRACK.  Any comments?
 247
 248
 249
 250 * IPSec path description:
 251
 252 Treat incoming IPSec encapsulation as a transport layer protocol and
 253 decapsulate it at the transport layer demultiplexer since it appears as a
 254 transport layer protocol from the bottom of the Internet Protocol network
 255 stack.  For outgoing, we treat IPSec as a network layer protocol since that is
 256 what IPSec appears to be from the top of the IP stack.
 257
 258 An incoming packet starts off with a sanity check.  It then goes through all
 259 the NF_IP_PRE_ROUTING hooks starting with the SPDB checking.  It would have
 260 several possible targets: DROP; REJECT; ACCEPT; PEEK.  DROP, REJECT, ACCEPT
 261 are standard NetFilter targets.  It would DROP if it should have been
 262 encrypted.  REJECT is a special case of DROP where an ICMP is returned.  It
 263 would ACCEPT if it was an encrypted IPSec packet bound for this machine and no
 264 other policy was expected first, it had already been decrypted from expected
 265 SAs indicated by nfmarks or virtual IPSec device or there was policy to allow
 266 it through.  PEEK would let the KMd have a look at the packet to see if it
 267 needed to start thinking about opportunistic and then pass it on.  Since it is
 268 a fresh ESP or AH packet, it will not have any nfmarks or virtual IPSec device
 269 association and unless that outer IP header should have been processed by
 270 another SG in between, no policy will have been required, letting it through.
 271
 272 The rest of the NF_IP_PRE_ROUTING hooks may cause it to be DNATed and
 273 defragmented.  It then goes through routing which thinks it is a local packet,
 274 deals with any outer header IP options, then defragmentation and
 275 NF_IP_LOCAL_IN filter (allow ESP,AH) before getting to ipsec_rcv() where the
 276 outer bundle is authenticated and decrypted and nfmarked or associated with a
 277 virtual IPSec device to indicate what decapsulation happenned before being
 278 passed back to netif_rx().  The next IP header is now visible.  The packet now
 279 gets re-injected at the beginning.  It goes through the incoming sanity check
 280 again, getting checked at NF_IP_PRE_ROUTING for policy using previously set
 281 nfmark or virtual IPSec device from decryption.  It may again be DNATed and
 282 defragmented.  Routing looks at the now-visible next IP header and routes it
 283 locally or via the forward hook.
 284
 285 If it is a local packet, IP options and defragmentation are processed.
 286 NF_IP_LOCAL_IN then gets to check filtering policy for other transport layer
 287 protocols.  If it is the endpoint for nested bundles, it is sent back to
 288 netif_rx(), having exposed the next IP header.
 289
 290 If it is not a local packet, routing has selected a route, potentially through
 291 an existing virtual IPSec device, one per connection, not per physical I/F.
 292 IP options and TTL are processed before being filtered at NF_IP_FORWARD,
 293 fragmented, then sent to NF_IP_POST_ROUTING.
 294
 295 If it is a locally generated packet, it would go through normal filtering at
 296 NF_IP_LOCAL_OUT, then go through routing, then be sent to NF_IP_POST_ROUTING.
 297
 298 At NF_IP_POST_ROUTING, the ipsec table would make a decision about the fate of
 299 the packet.  It would have several possible targets: ACCEPT; IPSEC SAList;
 300 DROP; REJECT; TRAP; HOLD.  ACCEPT would allow the packet through with no
 301 processing.  IPSEC would return NF_STOLEN, stealing the packet and applying
 302 the policy specified by its parameter of an SA list.  If the SA(s) do(es)n't
 303 exist(s) or if the TRAP target was specified, it would send up an ACQUIRE to
 304 all listening key management daemons via PF_KEYv2 and put in a HOLD that would
 305 keep only the last packet that matched for that HOLD, waiting for the
 306 appropriate SA(s).  If or once the SA(s) is/are available, it then IPSec
 307 processes the packet, then re-injects the packet at NF_IP_LOCAL_OUT (since the
 308 packet now appears to originate from this host) and sets nfmark or associates
 309 it with a virtual IPSec device to indicate what processing happenned.  The
 310 packet would then be routed and sent back to NF_IP_POST_ROUTING.  If the IPSec
 311 remote security gateway is not different upon policy lookup, the ipsec table
 312 would ACCEPT it.  DROP would drop the packet if previous attempts to do
 313 opportunistic encryption failed and the default policy was to block non-IPSec
 314 packets.  REJECT would be almost the same as DROP, except that it returns
 315 ICMPs.  ACCEPT, DROP, REJECT are standard NetFilter targets.
 316
 317 A packet routed through an optional IPSec virtual I/F simply gets assigned a
 318 specific source address, and has the nfmark/SA list preloaded.
 319
 320
 321
 322 * Other issues:
 323
 324 The way that nfmark is used is rather vague.  It is presently only 32 bits.
 325 Ideally, I would like to be able to indicate exactly which SAs were processed
 326 on the way in, which would most easily be represented by as many as 4 SAs (AH,
 327 ESP, IPCOMP, IPIP), each having an 8 bit protocol field (absolute minimum of
 328 2-bits), 32-bit destination address field (for IPv4, IPv6 would be 128) and a
 329 32-bit SPI.  This is a potential maximum of 672 bits.  A way of mapping 672
 330 bits on to the 32 bits available would be required to use this.  A lookup
 331 table could be used to map nfmarks to SAIDs, not the SAs themselves, since the
 332 SAs could disappear at any time the SADB is not locked.  It should be
 333 able to represent a bundle of SAs where one SA could be used in more than one
 334 bundle.  There could also be more than one right answer for the incoming SPDB.
 335 I have an idea how to accomplish this by changing/extending nfmark by
 336 converting it to a list of nfmark structures that contain a pointer to the
 337 next item on the list, a cookie for the specific netfilter function that owns
 338 the data and a pointer to a data structure.
 339
 340 nfmark may not be the right tool for this.  Another possible solution is to
 341 add a member to the struct sk_buff to point to this information.  This has the
 342 benefit of not depending on anyone else, but the drawback of needing to patch
 343 a header file *and recompiling the entire kernel*.  There is also the
 344 possibility of using the NetFilter Connection Tracking facility.
 345
 346 The SADB would be managed via the PF_KEYv2 socket I/F.
 347
 348 The SPDB would be managed via a combination of PF_KEYv2 socket I/F extensions
 349 and iptables.  A separate NetFilter table called 'ipsec' (as opposed to
 350 'filter', 'nat' or 'mangle') would have the first hook at NF_IP_PRE_ROUTING
 351 and the last hook at NF_IP_POST_ROUTING.  iptables(8) currently uses
 352 get/set_sockopt(2) system calles, but there is discussion of having it
 353 converted to use the AF_NETLINK socket family.  Having it use a PF_POLICY
 354 interface that was interoperable with multiple platforms would be a big win.
 355
 356 For matches, we have source/destination address/mask/port, userid (owner) and
 357 security label.  source/destination address/mask/port and userid are already
 358 supplied by iptables.  We only need to supply security label, if we even think
 359 we need it.
 360
 361
 362 For targets, how do we do this?  SPDB with policies, or name specific SAs, SA
 363 chains or SA lists or even virtual IPSec devices.  Currently, we name specific
 364 SAs which are chained exclusively together.
 365
 366 We could have a target iptables library and kernel module that has a target of
 367 IPSEC which:
 368         1. names a specific SA/chain which are unsharable
 369         2. lists SAs to apply which are sharable
 370         3. names a specific virtual IPSec device which implies a list of SAs
 371         4. spec's req'd policy, ie.: cipher, hash, shared?, remote gw
 372                 (pfs should not be and option)
 373
 374 I am favouring option number 2.  Option number 3 would map to number 2 in the
 375 SPDB.
 376
 377 SAs would still be stored in a SADB hash table.  The prev and next fields of
 378 struct ipsec_sa would be removed if SAs were no longer chained, but were
 379 listed in lists, either from a direct list, virtual IPSec device or from an
 380 SPDB.
 381
 382
 383 We could use one table for ipsec matching or maybe use one table for each of
 384 ipsec_in and ipsec_out.  It would have to use a table separate from filter,
 385 nat or mangle since we need an input NF_IP_PREROUTE priority of less than -200
 386 (CONNTRACK) and an output NF_IP_POST_ROUTING priority of more than 200
 387 (CONNTRACK?).  I suggest NF_IP_PRIORITY_IPSEC_IN = -500 and
 388 NF_IP_PRIORITY_IPSEC_OUT = 500.
 389
 390 There might be a security level iptables library and netfilter kernel
 391 module, if needed.
 392
 393 There would be "IPSEC", "TRAP" and "HOLD" iptables target libraries
 394 and netfilter target kernel modules.
 395
 396 Equivalent ipsec_rcv() functionality would be installed as an IP
 397 transport layer protocol handler and be included with the netfilter
 398 ipsec table module.
 399
 400 Equivalent ipsec_tunnel_start_xmit() functionality would be in the
 401 IPSEC netfilter target kernel module.
 402