Identifying sign languages from video: SLANG-3k

As I haven’t yet created a permanent place to hold the dataset I collected for my most recent class project, I’m hanging it here for now.  SLANG-3k is an uncurated corpus of 3000 clips of 15 seconds each of people signing in American Sign Language, British Sign Language, and German Sign Language, intended as a public benchmark dataset for sign language identification in the wild.  Using 5 frames, I was able to achieve accuracies bounded around 0.66/0.67.  More details can be found in the paper and poster created for CS 231N, Convolutional Neural Networks for Visual Recognition.

Many thanks to everyone who helped with this project — and most especially to the anonymous survey respondents who received only warm fuzzies as compensation for taking the time to help with this early-stage research.

Hadoop Intro

In the interest of forming structured diagrams and take-aways from my coursework, this post documents the high-level take-aways of the class on the Hadoop ecosystem that I’m shopping.  This isn’t new information to me, but restructuring and restating it means it’ll get embedded deeper.  (This is 246H from Daniel Templeton, a “how to use Hadoop” class based on Cloudera’s professional training course.  I’m inclined to postpone 246, which is about MapReduce algorithms and involves things like proofs of convergence without explicitly teaching Hadoop, since doing so lets me opt instead for the neural networks/deep learning class that may not be offered again — and has the fringe benefits of waiting until 246 has had a transition year to re: the course leadership shift & (maybe!) incorporating Spark.)


Benefits of Hadoop Ecosystem:

  • Worth the costs when data has Volume, Variability, and/or Velocity
  • Cheap(er) because it scales out on commodity hardware
  • Scaling adds a node with both a CPU & a HDD — memory and computing power increase together
  • Schema-on-read rather than schema-on-write upholds the separation of concerns (here, the data we can store is distinct from schema we impose); data & schema can be addressed independently
  • Moving the code to the data means algorithms scale with amount of code (vs. amount of data)
  • All parts of ecosystem manipulate the same data (no moving it around)

ETL becomes ELT in Hadoop.

In the architecture, the NameNode is a single machine that tracks meta-information about where the data is kept.  More memory and it can track more files.  The limit on the size of the system grows as the number of distinct files it tracks and not their sizes (many small files can get you to the bounds faster than a few large files).

Software:

  • HDFS — file system
  • Apache HBase — scalable key-value store (scales much larger than my MongoDB — or Riak, Couchbascee, etc. — but also requires a lot more up-front and ongoing infrastructure work); CP so it sacrifices availability
  • Apache Sqoop — uses cluster to pull in files in parallel
  • Flume — stream logs
  • Spark — immature & moving fast; much faster for iterative algorithms (like just about any gradient-based optimization)
  • Kafka — distributed publish/subscribe messaging service
  • MapReduce — dominant, mature, but no longer getting many new investments; work may be replicated compared to parallelized solution but because of scaling it may still be faster
  • Apache Pig (Yahoo) — procedural version of SQL (unnests complexity through variables); an original framework atop MapReduce
  • Impala (Cloudera) — main open-source SQL engine (querying); parallel database built over HDFS; works entirely in memory which isn’t ideal
  • Hive (Facebook) — translation layer between SQL and MapReduce jobs (not a database, no memory, no server so no caching)
  • Cloudera Search — enables indexing; built on Solr on Lucene on Nutch which precipitated Hadoop
  • Hue — web-based desktop for Hadoop (browse HDFS like an actual file system)
  • Apache Oozie — workflow management; limited documentation
  • Apache Sentry (Cloudera) — adds more granular security (Kerberos)

Constraints in Grails

Some experimentation on constraints in Grails 2.4.4 (based on the constraint documentation) leads me to the following conclusions:

  • Use “nullable” instead of “blank”. The documentation says that “blank” is for strings, but Grails 2.4.4 doesn’t seem to pay any attention to that keyword.  Setting “nullable” always works.
  • If the field can be null, you can’t specify other constraints for when it is non-null.  The logic is “null or must conform to set of standards X” is not supported through combinations of nullable and other constraints.  You could get around this with custom validation or even regular expressions (using the constraints ‘validator’ or ‘matches’) if you wanted. (Perhaps there is another way to get around this – if so please share it!)
  • All missingness is the same.  All three missingness versions (empty string, null, and a missing parameter) have the same behavior.  This is in contrast to some of the things I had been reading about empty strings in HTML forms being treated differently in Grails from actual null values, so I wonder if this is one of the pieces that has changed over time.
  • “Optionals” is no longer supported. Testing suggested this, and it is backed up by this JIRA report.

The chart below gives the specifics as well as the take-aways.

Input for all tests was a String variable.  When the string value “abc” was used, all tests pass.  I excluded all the attempts with “optionals”, because all of those failed and the keyword is no longer supported.

Constraints Test Result
Blank Nullable Size Unique
Not stated Not stated Not stated Not stated “” Invalid
Not stated Not stated Not stated Not stated null Invalid
Not stated Not stated Not stated Not stated missing Invalid
False Not stated Not stated Not stated “” Invalid
False Not stated Not stated Not stated null Invalid
False Not stated Not stated Not stated missing Invalid
True Not stated Not stated Not stated “” Invalid
True Not stated Not stated Not stated null Invalid
True Not stated Not stated Not stated missing Invalid
Not stated True Not stated Not stated “” OK
Not stated True Not stated Not stated null OK
Not stated True Not stated Not stated missing OK
Not stated False Not stated Not stated “” Invalid
Not stated False Not stated Not stated null Invalid
Not stated True 3..3 Not stated “” Invalid
Not stated True 3..3 Not stated null Invalid
Not stated True Not stated True “” Invalid
Not stated True Not stated True null Invalid

Data types in Grails and MongoDB

In working with Grails and MongoDB, I found myself building out some data type documentation I wasn’t able to find.  For your coding pleasure, please find…

Groovy Type Mongo Type
Integer Int32
Double Double
Long Int64
Float String
Short Int32

Grails is great, but the documentation is limited and the dates/versions of the references matter a lot (it seems Grails has been through many versions and many recommended approaches, not all of which are backwards-compatible).  The above mapping is for Grails 2.4.4 and the mongodb plugin 3.0.2, atop MongoDB 2.6.3.

Inductive-Deductive Loop

Last year I went looking for an “inductive-deductive loop” image (I was trying to convince stone-cold scientific method biologists that it really is okay to start science from observations), but I couldn’t find anything close to the simple diagram I was envisioning.  So, I drew my version on a Post-it note, and I’m sharing it now for posterity and for Google Images.

My talking point here is that scientific inquiry is both inductive and deductive.  Although many disciplines privilege a single type of reasoning, it’s better to integrate both approaches. With a circular view, we are free to enter problems where it’s most straightforward to start them — exploring the data, taking hypotheses or patterns to their conclusions, or considering how known theories might manifest — knowing that we’ll do a complete investigation in the end.  We trace as far as we can through the loop, verifying our interpretations through multiple methods.  Sometimes we cycle around the loop multiple times.

For instance, if you’re heavy on data and light on abstractions, you might start by trying to find patterns in the observations.  Once you identify some patterns, you formalize those patterns into a theory.  Given theory, you can generate some hypotheses based on the implications of that theory.  You then collect more data to disprove those hypotheses.  The new observations might suggest new patterns, starting another round of the loop.  You don’t limit yourself to collecting data only to disprove hypotheses, though — you also look at data that hasn’t been deliberately collected under the premises required by your hypotheses.  By looking at all the observations, you can start to investigate when the premises themselves hold.

The inductive-deductive loop is the structure of scientific inquiry.

Loop showing theory to hypothesis to observations to pattern to theory, in a loop