1 # -*- mode: Outline; fill-column: 78; fill-prefix: " " -*-
4 # Richard Guy Briggs <rgb@conscoop.ottawa.on.ca>
6 # RCSID $Id: klips2-design.txt,v 1.18 2001/07/06 07:32:43 rgb Exp $
9 * Outline Commands cheat sheet (C-c C-s to see this)
10 C-c C-t Hide EVERYTHING in buffer
11 C-c C-a Show EVERYTHING in buffer
13 C-c C-d Hide THIS item and subitems (subtree)
14 C-c C-s Show THIS item and subitems (subtree)
21 Linux FreeS/WAN IPSec -- KLIPS2 DESIGN
22 ======================================
24 # This document outlines the proposed design for KLIPS2, the second
25 # generation Linux FreeS/WAN IPSec kernel implementation. It is
26 # accompanied by the following:
28 # klips2-design.dia dia(1) diagram
29 # klips2-design-legend.txt diagram legend
30 # klips2-design-api.txt API descriptions
31 # klips2-design-api-trips.txt scenarios that force trips through the
32 # various APIs to verify all needed resources
35 # This document is devided up into Introduction, Goals, Feature requests,
36 # Packet path overview, NetFilter overview, IPSec path description, and
39 # This document was originally written 2.5 weeks after OLS2000, inspired
40 # from a meeting with Rusty and Marc Boucher in Montreal in November 1999
41 # and two meetings at OLS2000.
43 # Please comment to the linux-ipsec, netfilter-devel or netdev lists.
45 # Current kernel version reference is 2.4.4 with iptables 1.2
49 To get rid of the current IPSec virtual interfaces that associate with
50 a specific physical network interfaces and replace them with IPSec
51 virtual interfaces that specify a local gateway address as a source
52 address, a remote gateway address as a destination address and
53 specific tunnel policy or SAs.
55 To use existing packet matching engines rather than re-invent.
57 To support more of the required selectors, especially source and
58 destination ports, and possibly userid. Security labels are not of
59 obvious value, but the selectors will be easy to add in the future if
60 they are implemented in the Linux kernel.
62 To get access to *all* packets incoming and outgoing to enforce policy
65 To better support opportunistic encryption.
67 To take advantage of the parallellism of SMP and H/W encryption.
69 To make encryption and authentication modular.
71 The idea is to redesign KLIPS (kernel parts of FreeS/WAN) to avoid all the
72 'stoopid routing tricks' (TM) to which we have had to resort since the project
73 started by disassociating any IPSec devices from physical devices and to add a
74 proper SPDB to do proper incoming IPSec policy checks. We are hoping to use
75 existing pattern-matching tools rather than invent our own. NetFilter appears
76 to have all the pattern matching capabilities with the exception of security
77 labels which Linux doesn't appear to have anyways, but may be limited in other
80 There is also a significant interest in enabling FreeS/WAN to communicate with
81 routing daemons and be able to do load sharing and failover:
83 http://www.quintillion.com/fdis/moat/ipsec+routing/
89 Code=Level/Status/Priority
107 Code Feature Implementation
108 ==== ======= ==============
109 S/U/U changeable gw wild-side addresses on-the-fly
110 - road warriors with RSA keys and hooks from DHCP
111 to move to a new set of SAs upon expiry of previous
112 DHCP lease. Notify peers. Negociate new tunnels.
113 Handle delayed or denied renewal see: conn up, down,
115 M/U/U address inertia for remote gw's with changeable wild-side addresses
116 so local gw reboots will initiate reconnect to remotes.
117 - this requires a disk cache.
119 - save list of connections
120 - save IKE phase 1 keys
121 - save IPSEC SA keys (requires KLIPS mods)
122 T/U/U mini-database of road warriors that persists across reboots.
123 - this requires a disk cache.
124 M/U/U connection up, down, wanted
126 S/U/U routing below tunnel layer to support mobility and multi-homing
128 M/U/U tunnel identified by subnets served?
130 M/U/U why do equalizing schedulers not play well with tunnels?
132 M/U/U decouple SA retrieval from DADDR (don't care how it arrived)
134 sysctl and ifdef for dstaddr
135 T/U/U SPIs unique, independant of protocol and DADDR
137 sysctl and ifdef for protocol
138 S/U/U routing above tunnel layer
140 S/U/U granularity smaller than host
141 - SPORT, DPORT, UID, SecLev
142 M/U/U /dev/ipsecNNN devices that could be chown(1)ed and chmod(1)ed.
144 M/U/U process to process tunnels
146 T/U/U netfilter,pf_key,ioctl,/dev/ipsecNNN ways to manipulate tunnel perms.
148 S/U/U KLIPS as a loadable module (isn't it already?)
150 S/U/U stats: {number,time_of_last} packets {out,good_in,error_in}
152 S/U/U integrate IPSec and firewall policy into Security Policy.
153 (What APIs and user-level tools?)
155 S/U/U full inbound policy checking
156 S/U/U secure ciphers and hashes
157 T/U/U kernel implementation (should be faster)
158 S/U/U plays well with routing daemons
159 S/U/U free of export restrictions
160 T/U/U standard crypto api to add newer ciphers and hashes
162 T/U/U SADB hash table will be locked for additions/deletions
163 T/U/U use a refcount on each SA to increase locking granularity
167 * Packet path overview:
169 The basic path through the kernel as it concerns IPSec for the three
170 types of packets is as follows:
175 NF_IP_PRE_ROUTING hook
177 ip-options processing
180 transport layer demux
186 NF_IP_PRE_ROUTING hook
188 ip-options processing
189 ttl decrement and check
192 NF_IP_POST_ROUTING hook
202 NF_IP_POST_ROUTING hook
209 * NetFilter overview:
211 The basic architecture of NetFilter is:
213 --->[1]--->(ROUTE)--->[3]--->[4]---> where:
214 | ^ [1] NF_IP_PRE_ROUTING
215 | | [2] NF_IP_LOCAL_IN
216 | (ROUTE) [3] NF_IP_FORWARD
217 v | [4] NF_IP_POST_ROUTING
218 [2] [5] [5] NF_IP_LOCAL_OUT
223 Destination NAT (port forwarding) gets applied in NF_IP_PRE_ROUTING and
224 NF_IP_LOCAL_OUT at priority NF_IP_PRIORITY_DNAT = -100, and Source NAT
225 (masquerading) gets applied in NF_IP_POST_ROUTING at priority
226 NF_IP_PRIORITY_SNAT = 100, . Filtering is applied in NF_IP_LOCAL_IN,
227 NF_IP_FORWARD and NF_IP_LOCAL_OUT at priority NF_IP_PRIORITY_FILTER = 0.
230 Hook processing order would generally be:
232 PRE IN FWD OUT POST PRIORITY MACRO PRI
233 =======.=======.=======.=======.=======.=================== = ====
234 -500? . . . . .NF_IP_PRI_IPSEC_IN = -500 (?)
235 -200 . . .-200 . .NF_IP_PRI_CONNTRACK = -200
236 -175? . . . . . . . . . . . . . . . . . NF_IP_PRI_IPSEC_IN = -175 (?)
237 -150 . . .-150 . .NF_IP_PRI_MANGLE = -150
238 -100 . . .-100 . .NF_IP_PRI_NAT_DST = -100
239 . . . . . .0. . . .0. . . .0. . . . . . NF_IP_PRI_FILTER = 0
240 . . . . 100 .NF_IP_PRI_NAT_SRC = 100
241 . . . . 500 .NF_IP_PRI_IPSEC_OUT = 500
242 =======.=======.=======.=======.=======.=================== = ====
245 Not all modules are present at each hook. I am uncertain still if IPSEC_IN
246 should be before or after CONNTRACK. Any comments?
250 * IPSec path description:
252 Treat incoming IPSec encapsulation as a transport layer protocol and
253 decapsulate it at the transport layer demultiplexer since it appears as a
254 transport layer protocol from the bottom of the Internet Protocol network
255 stack. For outgoing, we treat IPSec as a network layer protocol since that is
256 what IPSec appears to be from the top of the IP stack.
258 An incoming packet starts off with a sanity check. It then goes through all
259 the NF_IP_PRE_ROUTING hooks starting with the SPDB checking. It would have
260 several possible targets: DROP; REJECT; ACCEPT; PEEK. DROP, REJECT, ACCEPT
261 are standard NetFilter targets. It would DROP if it should have been
262 encrypted. REJECT is a special case of DROP where an ICMP is returned. It
263 would ACCEPT if it was an encrypted IPSec packet bound for this machine and no
264 other policy was expected first, it had already been decrypted from expected
265 SAs indicated by nfmarks or virtual IPSec device or there was policy to allow
266 it through. PEEK would let the KMd have a look at the packet to see if it
267 needed to start thinking about opportunistic and then pass it on. Since it is
268 a fresh ESP or AH packet, it will not have any nfmarks or virtual IPSec device
269 association and unless that outer IP header should have been processed by
270 another SG in between, no policy will have been required, letting it through.
272 The rest of the NF_IP_PRE_ROUTING hooks may cause it to be DNATed and
273 defragmented. It then goes through routing which thinks it is a local packet,
274 deals with any outer header IP options, then defragmentation and
275 NF_IP_LOCAL_IN filter (allow ESP,AH) before getting to ipsec_rcv() where the
276 outer bundle is authenticated and decrypted and nfmarked or associated with a
277 virtual IPSec device to indicate what decapsulation happenned before being
278 passed back to netif_rx(). The next IP header is now visible. The packet now
279 gets re-injected at the beginning. It goes through the incoming sanity check
280 again, getting checked at NF_IP_PRE_ROUTING for policy using previously set
281 nfmark or virtual IPSec device from decryption. It may again be DNATed and
282 defragmented. Routing looks at the now-visible next IP header and routes it
283 locally or via the forward hook.
285 If it is a local packet, IP options and defragmentation are processed.
286 NF_IP_LOCAL_IN then gets to check filtering policy for other transport layer
287 protocols. If it is the endpoint for nested bundles, it is sent back to
288 netif_rx(), having exposed the next IP header.
290 If it is not a local packet, routing has selected a route, potentially through
291 an existing virtual IPSec device, one per connection, not per physical I/F.
292 IP options and TTL are processed before being filtered at NF_IP_FORWARD,
293 fragmented, then sent to NF_IP_POST_ROUTING.
295 If it is a locally generated packet, it would go through normal filtering at
296 NF_IP_LOCAL_OUT, then go through routing, then be sent to NF_IP_POST_ROUTING.
298 At NF_IP_POST_ROUTING, the ipsec table would make a decision about the fate of
299 the packet. It would have several possible targets: ACCEPT; IPSEC SAList;
300 DROP; REJECT; TRAP; HOLD. ACCEPT would allow the packet through with no
301 processing. IPSEC would return NF_STOLEN, stealing the packet and applying
302 the policy specified by its parameter of an SA list. If the SA(s) do(es)n't
303 exist(s) or if the TRAP target was specified, it would send up an ACQUIRE to
304 all listening key management daemons via PF_KEYv2 and put in a HOLD that would
305 keep only the last packet that matched for that HOLD, waiting for the
306 appropriate SA(s). If or once the SA(s) is/are available, it then IPSec
307 processes the packet, then re-injects the packet at NF_IP_LOCAL_OUT (since the
308 packet now appears to originate from this host) and sets nfmark or associates
309 it with a virtual IPSec device to indicate what processing happenned. The
310 packet would then be routed and sent back to NF_IP_POST_ROUTING. If the IPSec
311 remote security gateway is not different upon policy lookup, the ipsec table
312 would ACCEPT it. DROP would drop the packet if previous attempts to do
313 opportunistic encryption failed and the default policy was to block non-IPSec
314 packets. REJECT would be almost the same as DROP, except that it returns
315 ICMPs. ACCEPT, DROP, REJECT are standard NetFilter targets.
317 A packet routed through an optional IPSec virtual I/F simply gets assigned a
318 specific source address, and has the nfmark/SA list preloaded.
324 The way that nfmark is used is rather vague. It is presently only 32 bits.
325 Ideally, I would like to be able to indicate exactly which SAs were processed
326 on the way in, which would most easily be represented by as many as 4 SAs (AH,
327 ESP, IPCOMP, IPIP), each having an 8 bit protocol field (absolute minimum of
328 2-bits), 32-bit destination address field (for IPv4, IPv6 would be 128) and a
329 32-bit SPI. This is a potential maximum of 672 bits. A way of mapping 672
330 bits on to the 32 bits available would be required to use this. A lookup
331 table could be used to map nfmarks to SAIDs, not the SAs themselves, since the
332 SAs could disappear at any time the SADB is not locked. It should be
333 able to represent a bundle of SAs where one SA could be used in more than one
334 bundle. There could also be more than one right answer for the incoming SPDB.
335 I have an idea how to accomplish this by changing/extending nfmark by
336 converting it to a list of nfmark structures that contain a pointer to the
337 next item on the list, a cookie for the specific netfilter function that owns
338 the data and a pointer to a data structure.
340 nfmark may not be the right tool for this. Another possible solution is to
341 add a member to the struct sk_buff to point to this information. This has the
342 benefit of not depending on anyone else, but the drawback of needing to patch
343 a header file *and recompiling the entire kernel*. There is also the
344 possibility of using the NetFilter Connection Tracking facility.
346 The SADB would be managed via the PF_KEYv2 socket I/F.
348 The SPDB would be managed via a combination of PF_KEYv2 socket I/F extensions
349 and iptables. A separate NetFilter table called 'ipsec' (as opposed to
350 'filter', 'nat' or 'mangle') would have the first hook at NF_IP_PRE_ROUTING
351 and the last hook at NF_IP_POST_ROUTING. iptables(8) currently uses
352 get/set_sockopt(2) system calles, but there is discussion of having it
353 converted to use the AF_NETLINK socket family. Having it use a PF_POLICY
354 interface that was interoperable with multiple platforms would be a big win.
356 For matches, we have source/destination address/mask/port, userid (owner) and
357 security label. source/destination address/mask/port and userid are already
358 supplied by iptables. We only need to supply security label, if we even think
362 For targets, how do we do this? SPDB with policies, or name specific SAs, SA
363 chains or SA lists or even virtual IPSec devices. Currently, we name specific
364 SAs which are chained exclusively together.
366 We could have a target iptables library and kernel module that has a target of
368 1. names a specific SA/chain which are unsharable
369 2. lists SAs to apply which are sharable
370 3. names a specific virtual IPSec device which implies a list of SAs
371 4. spec's req'd policy, ie.: cipher, hash, shared?, remote gw
372 (pfs should not be and option)
374 I am favouring option number 2. Option number 3 would map to number 2 in the
377 SAs would still be stored in a SADB hash table. The prev and next fields of
378 struct ipsec_sa would be removed if SAs were no longer chained, but were
379 listed in lists, either from a direct list, virtual IPSec device or from an
383 We could use one table for ipsec matching or maybe use one table for each of
384 ipsec_in and ipsec_out. It would have to use a table separate from filter,
385 nat or mangle since we need an input NF_IP_PREROUTE priority of less than -200
386 (CONNTRACK) and an output NF_IP_POST_ROUTING priority of more than 200
387 (CONNTRACK?). I suggest NF_IP_PRIORITY_IPSEC_IN = -500 and
388 NF_IP_PRIORITY_IPSEC_OUT = 500.
390 There might be a security level iptables library and netfilter kernel
393 There would be "IPSEC", "TRAP" and "HOLD" iptables target libraries
394 and netfilter target kernel modules.
396 Equivalent ipsec_rcv() functionality would be installed as an IP
397 transport layer protocol handler and be included with the netfilter
400 Equivalent ipsec_tunnel_start_xmit() functionality would be in the
401 IPSEC netfilter target kernel module.