Introducing microservices to students in Stanford CS 110

Ryan Eberhardt invited Xpanse to give a guest lecture on the last day of a summer session of Stanford CS 110, in that gap between real coursework and the final.

CS 110 is the second course in Stanford’s systems programming sequence. I loved taking it as a student. I loved CS 110 so much that I TAed it twice, even though it’s a really tough course to TA (the students are zillions of new undergrads, there’s a lot of assignments to give feedback on, and the material is pretty hard for them so office hours consist of a never-ending queue of students with questions). My professor Jerry Cain gave me an award for my TA work, so hopefully I did okay by them.

For this re-visit to CS 110, I introduced microservices, containerization, and orchestration. I gave the orientation to why they should care and who we were. Then two sharp coworkers talked about their daily tech of port scanning and functional programming. I concluded the lecture with hinting at the problems solved by Docker and Kubernetes (and the problems created by them), and I asked leading questions that extended some of the core ideas in CS 110: decoupling of concerns, each worker does one thing, pools of workers share a single point of entry, and request/response models.

Data documentation and me

When I start wrangling a new data source, I search for a lot of terms, I ask a lot of questions and I take a lot of notes. I like to publish those notes on the company wiki. By working publicly, the quality of the docs tends to improve – no one has to re-invent the wheel, I sometimes get comments that improve the docs, and there is a clear stub for others to maintain.

These notes often find their way to becoming onboarding directives (“go read these pages to learn about our data”), and they’re valuable to skim when revisiting data.

When I was in a highly mature data science organization with weak engineering practices, everyone was building on everyone else’s data and data documentation was useful in tracking alternative datasets and tracing back to canonical sources. Now I’m in a maturing data science organization with strong engineering practices, and these documents are useful because they stop the data users from annoying the engineers with the same questions repeatedly.

I’ve landed on a general template while documenting many, many dozens of datasets. I want to share it in case it can be useful to someone else.


$typeOfData data documentation

The page naming convention supports searching for “X data”.

The first section is a very short orientation (ideally 1-3 sentences). It gives just enough for the reader to figure out if the data are relevant to their task. When the dataset scope is complex, this section tends to be a bit longer to explain what is (not) included.

Background

This section varies in length. It covers anything that isn’t immediately apparent that someone might need to know, such as:

  • Where do these data come from? Do we have any contractual or PII obligations?
  • How does the system that creates these data work? (for instance, how does Tor work, how does network communication work, why would a certificate be revoked, what is domain registration)

Summary of dataset

This section contains a short paragraph or small table that always give the most basic facts:

  • Where can you find the data?
  • What time period do the data cover?
  • How often is the source refreshed? By whom?
  • Roughly how large are the data on disk (per unit of time)?
  • Who is the best POC team(s) for this data? (useful in itself and as an indicator of data maturity)

Organizations sometimes keep multiple of the same data. Maybe that looks like Parquet and .ndjson and Avro copies of every file. Maybe that looks like GCS and BigQuery and Databricks. Maybe that looks like something else entirely. I get opinionated here. I think it’s unreasonable to document all of those, so it’s important to decide which approach is the most “organizationally canonical” source and write a data documentation doc that reflects only that canonical source. (And I think the source itself should have documentation on it that discusses how it is loaded.)

Fields

This section is the core of the documentation. It contains a very, very long table. Each field gets its own row. I use three columns:

  • field name
  • example values
  • notes

The examples must be edited to be notional. The company wiki never has the same controls as R&D systems do, so I always edit the examples so they do not actually appear in the dataset.

The “notes” field is the core. This is where I capture relevant facts, like whether this field is sufficient to be a primary key, whether it’s a hash of another field, whether it’s an enum masquerading as a string and how frequent each value is, whether it is actually the same as the field with the same name in that other dataset, whether it’s a dirty raw version of some other field, ….

I often stub out a row for each field, and then I fill in example values and notes as I work with each field.

Limitations and gotchas

This section reiterates the most frequent mistakes that someone is likely to make with this data. I often fill the entire section with a handful of very short bullet points.

It includes big-picture gotchas. What kinds of data will not be visible in this dataset? What kinds of mistakes will people make? (This is the place I underscore the implications of what is already documented – “remember to deduplicate when you JOIN on this data, or your join will explode”.)

It also includes the most important field-level gotchas (“the domains in this dataset always have a final period postpended, which means all naive JOINs will falsely fail”).

Since different people read different things and duplicating warnings is usually valuable, I think the repetition is worthwhile.

Usage

When the datasouce is canonical but opaque and there isn’t organizational interest in maintaining either a usable copy of all the data or a centralized query library, it can be useful to provide an example bit of SQL or Java code. I write a bit of code that massages the data into a useful format while correctly avoiding the gotchas. If the organizational appetite for data maturity is there, that’s even better; do that instead.

This section distinguishes these data from similar data sources. It also curates links to any other related data that someone might not even know they were looking for (e.g., WHOIS might link to DNS data as a source of FQDNs rather than only SLDs).


And that’s that. I don’t use all the sections every time, and I sometimes move them around, but I do tend to keep to this structure. I find public documents in a consistent structure to be very valuable for working with data.

Detecting live security threats without internal data: Netflow and building the Behavior product

We were awarded a patent for Behavior, Qadium’s second product, which I designed, architected, and built as applied research! We sold my BigQuery-based implementation for roughly a year to the tune of >$1 million in ARR – a big deal for a startup. As soon as the value was visible, we staffed it with multiple engineers, converted it to Dataflow, and cranked up a UI.

The core insight inside Behavior is the same as the core insight of our first product, Expander, and the core business of Qadium: We can help you monitor everything that your organization has connected to the internet, and we can do it without deploying any physical devices or software agents. It turns out most organizations have all kinds of unexpected and unwanted systems connected to the internet. Qadium products give you visibility into what isn’t under management and what is misconfigured, and Behavior extends that unprecedented visibility to live interaction flow data.

Challenges with using netflow data

Behavior detects otherwise-undetectable security threats by operating on netflow data. Netflow data is very basic metadata about what communications are traversing the public internet. In netflow data, we have only a few fields: the source and destination IP addresses, the source and destination ports, the layer 4 protocol, the timestamp, and any TCP flags. These aren’t much to work with. Worse, the modern internet multiplexes many domains and even organizations to shared IP addresses, so the data aren’t definitive.

Netflow data capture only a very small fraction of a percent of all traffic (on the order of 1 packet in 1 million), and the sampling is entirely by convenience. Most routers and switches retain some netflow records as the traffic passes through them, but each system has very limited memory, and quickly handling traffic is their priority. So, dropped records are unavoidable, and dropping behavior varies by characteristics like node identity (network position) and time-of-day (data volume). We also see non-negligible duplicate records, because a single client-server conversation might be observed by multiple sensors. “Chattier” types of interactions (like audiovisual communications) are over-represented. Adding more complexity, which IP is considered the “source” of the packet is essentially arbitrary for our purposes.

Additionally, because Qadium doesn’t have access to internal firewall logs, we don’t actually know what truly happens to risky flows. We don’t know whether the organization’s firewall blocked the risky connection (as it should), nor do we know which machine ultimately received or initiated the flow.

My challenge was: Can we say anything valuable about an organization’s internet security risks from such a limited dataset?

What I built

I focused on one fact. When we see a flow, that interaction happened.

First I verified that fact was indeed true. I ran an experiment. I took a set of otherwise-quiescent IP addresses, and I sent a zillion packets to a random selection of even-numbered IP addresses. I observed how fast we observed netflow records for those interactions (quite fast), and I verified no packets were hallucinated in the multi-day window. I saw no traffic, a spike of even-numbered packets for a few minutes, and then no traffic again.

To be able to use the fact that observations always reflect real traffic, I had to transform the data. I first converted its column names to be meaningful – from “source” and “destination” to “likely client” and “likely server”. I used the port numbers for this translation. (Low ports and known ports are servers; high ports are clients.) With this transformation, I could deduplicate conversations. I also kept only TCP records, and dropped all records whose flags didn’t indicate an established connection. I joined customer IPs onto the records, keeping only the flows relevant to each customer. I dropped IPs shared by multiple organizations (commonly CDNs, hosting providers, and Cloud resources). I did all of this in a computationally effective way, rather than a human-consumption-friendly way.

Then I checked for indicators that the customer IP had been compromised or was behaving in a risky way:

  • Remote IP is risky. The remote IP might be in a country on the OFAC Sanctions List, it might be a known bad IP like a command-and-control server, or it might have some other property that makes it risky to communicate with.
  • Remote service is insecure. Because Qadium regularly talks to every IPv4 address on the most important ports using the most important protocols, we know what is at the other end of each netflow record. I flagged the flows where the remote system was too juicy to be legitimate.
  • Remote IP:port is against security policy. For example, many organizations ban peer-to-peer networking; many US government agencies banned the use of popular Kaspersky anti-virus and other products[1]. Behavior is able to detect and demonstrate compliance.
  • Remote is a honeypot. Legitimate users have no need to interact with honeypot servers, but compromised servers running automated attack and reconaissance software find them very enticing.
  • Remote is not a publicized IP. Non-publicized IPs have no legitimate reason to be involved in a communication. Any communication with such an IP is an indicator that the client node is scanning the internet, which is an indicator that malware may be present on the customer system.
  • Customer service is insecure. Insecure systems sending voluminous outbound traffic to strange IPs are potentially compromised. (It’s also possible, though much less likely, that inbound traffic reflects an attacker’s reverse shell.)
  • We detect a change. By detecting changes in volume going to particular kinds of remote systems, we identified behavioral oddities for security teams to investigate.

I developed “risk filters” in all the categories, and applied them at scale to customer data. Hits were, unfortunately, common. Some connections were correctly dropped at a proxy firewall, but many others were not. I worked through many possible risk filter ideas, keeping the ones that were most effective. My final set all warranted security team investigation, unlike most security products, which have extremely high false positives. The risk filter hits created an opportunity for effective alignment conversations and uncovered misconfigured firewalls at customer locations (satellite offices are particularly tricky).

Squeezing value from data was great fun for me, and linking our static “inventory” product with dynamic “risk” data was incredibly valuable for customers.


I wrote this post with more than 5 years of hindsight, well past the point where any details are sensitive. I am backdating it to the first public release of detailed information about Behavior and our netflow work.

End-to-end neural networks for subvocal speech recognition

My final project for Stanford CS 224S was on subvocal speech recognition. This was my last paper at Stanford; it draws on everything I learned in a whirlwind of CS grad school without a CS undergraduate major. Pol Rosello provided the topic; he and I contributed equally to the paper.

We describe the first approach toward end-to-end, session-independent subvocal automatic speech recognition from involuntary facial and laryngeal muscle movements detected by surface electromyography. We leverage character-level recurrent neural networks and the connectionist temporal classification loss (CTC). We attempt to address challenges posed by a lack of data, including poor generalization, through data augmentation of electromyographic signals, a specialized multi-modal architecture, and regularization. We show results indicating reasonable qualitative performance on test set utterances, and describe promising avenues for future work in this direction.

A brief introduction to reinforcement learning

We spend a lot of time talking about supervised learning when discussing ML with students, but I find reinforcement learning just as interesting and useful.

I developed a talk on reinforcement learning for high school participants in SAIL ON, the year-round diversity and high school outreach program of the Stanford AI Lab that I initiated and led, which follows the SAILORS intensive two-week AI summer camp.

We discuss how reinforcement learning works, how to make decisions given Bayesian bounds, touchstone RL problems and recent applications, and where RL tends to succeed and fail.