Re: (PM) Radius (fwd)

Karl Denninger (karl@Mcs.Net)
Thu, 6 Nov 1997 17:00:11 -0600

On Thu, Nov 06, 1997 at 02:47:15PM -0800, MegaZone wrote:
> Once upon a time Karl Denninger shaped the electrons to say...
> >1) 100% reliable accounting information (ie: Start and Stop records are
> > ALWAYS delivered). This does work well enough to be functional. We
> > haven't seen trouble here for non-MCPPP calls, and with 3.7.2, it
> > appears that MCPPP calls are fixed as well.
> >
> >2) ABSOLUTE notification when a box is powered on or restarted.
> > Livingston sends a Radius accounting log entry on a crash or
> > console restart, but NOT for a cold power-up. You MUST have
> > this to remove all entries in the table for a given system on a
> > reset.
>
> This is not enough.
>
> RADIUS - the protocol - cannot do it 100% reliably. Period.

Yep. But you can get close enough for it to be completely functional,
particularly in an analog environment.

> RADIUS is setup to make sure data gets there - but not in any particular
> order or over any period of time.
>
> It is possible, and I have seen it, to get a Start for session 1, a
> start for session 2, and THEN the stop for session 1 - when the
> user logged out and logged back in.

Yes. You can also get a Start and *NO* corresponding STOP - ever (ie: the
box is powered off and back on without warning). You send a "00000000"
session record on a deliberate reboot, and I assume if you're sane enough
during a crash, but you screw it up in one significant way - you send it on
the RESET, not on the *RESTART*.

You want to send that on the RESTART, once communication is re-established.
I understand the likely reason(s) why you don't do it that way, but without
a solid indication *coming out of* a reset condition, prior to accepting any
sessions, there's no way to *know* that the database is consistent.

This is a problem that Livingston should address.

> This is especially likely with ISDN.
> A user can disconnect and reconnect within one second. Timeouts for one
> try in RADIUS is 3 seconds. If the first STOP packet gets delayed at all,
> you have a race condition.

Correct. This is a risk. But denying access for *two seconds* is not
generally a big deal. It will take longer on EVERY LOOKUP to verify via
SNMP than the race condition exists. Also, if you keep track by *port*,
not by user ID, then you're ok - the odds of the same port being used again
by the same person (assuming you config your trunk groups right) is darn
close to zero.

For analog callers the risk is basically zero - there's a 20-second or so
time requirement to get the line reconnected.

For ISDN static address accounts the risk is very low as well - it generally
takes at least 3 seconds for routing to converge, even in an OSPF-active
environment.

The real risk is single-address dynamic ISDN accounts, where you really
*can* run into significant trouble if someone "bounces" a line. That's a
bad thing to do generally from a user's billing perspective anyway (on the
LEC side), and as a consequence it doesn't come up often.

If you have lossy links between the communications device and database then
your accounting data is in trouble to begin with, and you need REALLY GOOD
filtering on the back-end to make sure that you don't screw up and bill
someone twice or lose records entirely. For that matter, your RADIUS
authentication is going to be erratic in performance as well.

My answer to that is "fix your frigging network", but then again I design
for an intended worst-case loss rate of 0.1% in real-world operation.

> RADIUS was *deliberately* designed this way, it was never meant to be
> used as a resource allocation protocol.
>
> This is why another protocol - like SNMP - is used. Let's use a simple
> case where the limit is one login per user.
>
> Request comes in - RADIUS checks database.
>
> 1 - user is not shown as logged in in database. ie, STOPs have been
> received for all Starts. ACK

Yep.

> 2 - user is shown as logged in in database on NAS X. Now you use SNMP to
> query NAS X to be *sure* the user is still on.
>
> a - SNMP reports user is not on. Close the open entry and flag it
> as a special closure. ACK
>
> b - SNMP reports user is on. Leave current entry alone. NAK
>
> This is the simplest example.

Yes, and now when a box "disappears" (ie: power cycle) you get a *boatload*
of SNMP queries against it, all in real time, and possibly all in rapid
succession. This is very, very bad for the CPU load on the box so queried.

SNMP is a pig. High-volume queries against a system are considered, at
least around here, to be a no-no.

Verification is a good thing *IF* you make the risk of needing to use it
extremely low. Otherwise you're asking to bury the machine in question
for legitimate users during a convergence situation (precisely the time
that it probably doesn't have spare CPU to waste on something like this).

--
-- 
Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin
http://www.mcs.net/~karl     | T1's from $600 monthly to FULL DS-3 Service
			     | NEW! K56Flex modem support is now available
Voice: [+1 312 803-MCS1 x219]| 56kbps DIGITAL ISDN DOV on analog lines!
Fax:   [+1 312 803-4929]     | 2 FULL DS-3 Internet links; 400Mbps B/W Internal
-
To unsubscribe, email 'majordomo@livingston.com' with
'unsubscribe portmaster-users' in the body of the message.