Statistics courses usually live inside toy datasets — coin flips, heights, standardized test scores. For my MATH 211 portfolio project I wanted the opposite: a real cybersecurity dataset, real skew, real noise, and a real conclusion about what separates attack traffic from normal traffic. The result is an end-to-end R Markdown analysis of the UNSW-NB15 network intrusion detection dataset that walks every Course Learning Outcome of the class from probability through regression, then closes the loop with a formal logical argument for a root-cause indicator.
The dataset
The UNSW-NB15 dataset was developed at the University of New South Wales by Moustafa and Slay (2015) as a modern benchmark for network anomaly detection. It contains roughly 2.54 million labeled connection records across 49 packet- and flow-level features, with each record tagged as either normal traffic or one of nine attack categories — Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode, and Worms.
I worked from a stratified random sample of n = 15,000 records drawn from the training partition. The downsampling is deliberate: with 2.5 million rows, every hypothesis test rejects on sample size alone. Shrinking the sample gives confidence intervals meaningful width and forces the tests to reflect practical significance, not just mechanical power.
Probability on real traffic
| The first analytical section computes marginal, joint, and conditional probabilities over two events: A = “connection uses TCP” and B = “connection is an attack.” The joint probability table decomposes the sample into TCP/non-TCP × attack/normal cells, and the conditional *P(B | A)* is compared to the marginal P(B) as a quick independence check. The gap between the two is large enough to motivate a formal chi-square test later on. I also extend to a three-event scenario by adding C = “service is HTTP” and applying the multiplication rule to verify consistency. |
Fitting distributions
With the marginal structure established, the analysis fits three canonical probability distributions to three different features of the dataset:
- Poisson to
ct_srv_src(connections from the same source to the same service in the last 100 connections) for normal traffic — a count variable where E[X] = Var(X) = λ - Exponential to
dur(connection duration in seconds) — a right-skewed continuous variable that captures the memoryless arrival pattern typical of benign sessions - Normal to
log(sbytes + 1)— a log-transform was required because raw source bytes are heavily right-skewed across many orders of magnitude
Each fit reports MLE parameter estimates, compares theoretical to empirical moments, and uses the fitted parameters to compute specific point and tail probabilities.

Confidence intervals
Three separate intervals anchor CLO D: a t-interval for the mean connection duration of normal traffic, a z-interval for the proportion of TCP connections that are malicious, and a Welch two-sample interval for the difference in mean source bytes between attack and normal traffic. The third one is where the dataset starts to earn its keep — the interval for μ_attack − μ_normal does not contain zero, which already hints at what the hypothesis tests will confirm.
Hypothesis testing
Three formal tests are run, each with an explicit null and alternative, a stated α, and a decision rule:
- Two-sample Welch’s t-test on connection duration: attack vs. normal. H₀: μ_A = μ_N
- Chi-square test of independence between protocol and traffic type, using the top four protocols to keep the contingency table well-populated. H₀: protocol and label are independent
- One-proportion z-test comparing the HTTP attack rate to the overall attack rate as a one-sided upper test. H₀: p_HTTP = p₀
All three reject the null at α = 0.05. Attack traffic has a significantly different mean duration than normal traffic, protocol and attack status are not independent, and HTTP sessions are disproportionately malicious compared to the baseline rate.
The interesting result isn’t that the tests reject — with any sample this size they were always going to reject. The interesting result is the direction of the effects and the fact that they stack in a consistent way across three independent framings of the question.
Correlation and regression
For CLO F the project fits a simple linear regression of log(dbytes + 1) on log(sbytes + 1) — the size of a response as a function of the size of the request. The log transform is doing real work here; on raw bytes the scatterplot is a dense blob near the origin with a handful of extreme outliers, and any regression line is dragged around by a few points. On the log scale the relationship becomes nearly linear, Pearson r is strongly positive, R² is meaningful, and the residual diagnostic plots look reasonable. Prediction intervals are computed for three sample log(sbytes) values.
Log-transforming heavy-tailed network features (bytes, durations, packet counts) before fitting any linear model is almost always the right move. The raw-scale scatter hides the actual relationship behind a few extreme points, and the residuals of a raw-scale fit will fail every diagnostic check.
From statistics to a logical signature
The final section is where the project stops being a tour of techniques and becomes an argument. Using propositional logic, I define four predicates over connections:
- T(x): connection x uses TCP
- A(x): connection x is attack traffic
- D(x): connection x has elevated duration (more than two standard deviations above the normal mean)
- S(x): connection x has elevated source bytes (above the third quartile)
The hypothesis tests from the previous section become the premises of a formal argument, and the conclusion is that the conjunction T(x) ∧ D(x) ∧ S(x) is a statistically validated indicator of A(x) at α = 0.05. Empirically verifying this on the sample produces a contingency table with real precision and recall numbers — not a classifier, but a first-pass detection signature derived entirely from standard coursework statistics.
Why this framing
Most of my existing blog posts are incident-response case studies: SolarWinds, NotPetya, Log4j, xz. Those are narrative. This one is the opposite — it’s what happens when you aim the basic tools of an introductory statistics course at a real security dataset and take the results seriously. The conclusion isn’t a production detection rule; it’s a demonstration that probability, hypothesis testing, and regression can combine into a logically coherent argument about a cybersecurity phenomenon. That’s the kind of bridge between coursework and the security domain I want this portfolio to keep building.
Want to see the full analysis? Explore the project repository for the complete R Markdown source, the rendered PDF, generated figures, and the references list.