Statistical Analysis of Network Intrusion Detection

Statistics courses usually live inside toy datasets — coin flips, heights, standardized test scores. For my MATH 211 portfolio project I wanted the opposite: a real cybersecurity dataset, real skew, real noise, and a real conclusion about what separates attack traffic from normal traffic. The result is an end-to-end R Markdown analysis of the UNSW-NB15 network intrusion detection dataset that walks every Course Learning Outcome of the class — from probability through regression — then closes the loop with a formal logical argument for a root-cause indicator.

The dataset

The UNSW-NB15 dataset was developed at the University of New South Wales by Moustafa and Slay (2015) as a modern benchmark for network anomaly detection. It contains roughly 2.54 million labeled connection records across 49 packet- and flow-level features, with each record tagged as either normal traffic or one of nine attack categories — Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode, and Worms.

I worked from a stratified random sample of n = 15,000 records. The downsampling is deliberate: with 2.5 million rows, every hypothesis test rejects on sample size alone. Shrinking the sample gives confidence intervals meaningful width and forces the tests to reflect practical significance, not just mechanical power.

Probability on real traffic

The first analytical section computes marginal, joint, and conditional probabilities over two events: A = “connection uses TCP” and B = “connection is an attack.” The joint probability table decomposes the sample into TCP/non-TCP × attack/normal cells, and the conditional *P(B

A)* is compared to the marginal P(B) as a quick independence check. The gap between the two is large enough to motivate a formal chi-square test later. I extend to a three-event scenario by adding C = “service is HTTP” and applying the multiplication rule to verify consistency.

Fitting distributions

With the marginal structure established, the analysis fits three canonical probability distributions to three different features:

Poisson to ct_srv_src (connections from the same source to the same service in the last 100 connections) for normal traffic — a count variable where E[X] = Var(X) = λ.
Exponential to dur (connection duration in seconds) — a right-skewed continuous variable that captures the memoryless arrival pattern typical of benign sessions.
Normal to log(sbytes + 1) — a log transform was required because raw source bytes are heavily right-skewed across many orders of magnitude.

Each fit reports MLE parameter estimates, compares theoretical to empirical moments, and uses the fitted parameters to compute specific point and tail probabilities.

Confidence intervals

Three intervals anchor the inference section: a t-interval for the mean connection duration of normal traffic, a z-interval for the proportion of TCP connections that are malicious, and a Welch two-sample interval for the difference in mean source bytes between attack and normal traffic. The third one is where the dataset starts to earn its keep — the interval for μ_attack − μ_normal does not contain zero, which already hints at what the hypothesis tests will confirm.

Hypothesis testing

Three formal tests, each with explicit null and alternative, stated α, and decision rule:

Two-sample Welch’s t-test on connection duration: attack vs. normal. H₀: μ_A = μ_N.
Chi-square test of independence between protocol and traffic type, using the top four protocols to keep the contingency table well-populated. H₀: protocol and label are independent.
One-proportion z-test comparing the HTTP attack rate to the overall attack rate as a one-sided upper test. H₀: p_HTTP = p₀.

All three reject the null at α = 0.05. Attack traffic has a significantly different mean duration than normal, protocol and attack status are not independent, and HTTP sessions are disproportionately malicious.

The interesting result isn’t that the tests reject — with any sample this size they were always going to reject. The interesting result is the direction of the effects and the fact that they stack consistently across three independent framings of the question.

Correlation and regression

The regression section fits a simple linear model of log(dbytes + 1) on log(sbytes + 1) — the size of a response as a function of the size of the request. The log transform is doing real work here; on raw bytes the scatter is a dense blob near the origin with a handful of extreme outliers. On the log scale the relationship becomes nearly linear, Pearson r is strongly positive, R² is meaningful, and the residual diagnostics look reasonable. Prediction intervals are computed for three sample log(sbytes) values.

From statistics to a logical signature

The final section is where the project stops being a tour of techniques and becomes an argument. Using propositional logic, I define four predicates over connections:

T(x): connection x uses TCP
A(x): connection x is attack traffic
D(x): connection x has elevated duration (more than two standard deviations above the normal mean)
S(x): connection x has elevated source bytes (above the third quartile)

The hypothesis tests become the premises of a formal argument, and the conclusion is that the conjunction T(x) ∧ D(x) ∧ S(x) is a statistically validated indicator of A(x) at α = 0.05. Empirically verifying this on the sample produces a contingency table with real precision and recall numbers — not a classifier, but a first-pass detection signature derived entirely from standard coursework statistics.

What this project demonstrates

Comfort with probability, distribution fitting, interval estimation, hypothesis testing, and regression — the core inferential toolkit.
R / RMarkdown as a reproducible analysis environment, version-controlled in Git.
The discipline to downsample on purpose and reason about practical vs. statistical significance.
The willingness to close a coursework project with a formal argument, not a vague conclusion.

Explore the project repository for the complete R Markdown source, rendered PDF, generated figures, and references.