Frequently asked questions

A growing movement within natural language processing (NLP) and cognitive science asks how we can gain a deeper understanding of the generalizations that are learned by neural language models. While a language model may achieve high performance on certain benchmarks, another measure of success may be the degree to which its predictions agree with human intuitions about grammatical phenomena. To this end, an emerging line of work has begun evaluating language models as "psycholinguistic subjects" (e.g. Linzen et al. 2016, Futrell et al. 2018). This approach has shown certain models to be capable of learning a wide range of phenomena, while failing at others.

However, as this subfield grows, it becomes increasingly difficult to compare and replicate results. Test suites from existing papers have been published in a variety of formats, making them difficult to adapt in new studies. It has also been notoriously challenging to reproduce model output due to differences in computing environments and resources.

Furthermore, this research demands nuanced knowledge about both natural language syntax and machine learning. This has made it difficult for experts on both sides to engage in discussion: linguists may have trouble running language models, and computer scientists may have trouble designing robust suites of test items.

This is why we created SyntaxGym: a unified platform where language and NLP researchers can design psycholinguistic tests and visualize the performance of language models. Our goal is to make psycholinguistic assessment of language models more standardized, reproducible, and accessible to a wide variety of researchers.
SyntaxGym has three main components:
  1. Browse or create psycholinguistic test suites.
  2. Browse containerized language models.
  3. View performance across models and test suites through interactive visualizations..
In information theory, the surprisal of a word \(w\) measures the amount of information gained from observing \(w\), conditioned on the context in which it occurs. Surprisal is commonly used in psycholinguistics, as it has been shown to correlate with human behavioral measures (e.g. Smith & Levy 2013).

Formally, surprisal is given by the negative log probability, or \[S(w|\text{context}) = - \log_2 p(w|\text{context}).\] As a general rule of thumb, we expect ungrammatical (or unexpected) constructions to have high surprisal and grammatical (or predictable) constructions to have low surprisal.
Test suites consist of a set of sentences known as items. Items are split into chunks called regions, which take different forms in different conditions.

Please refer to the test suite documentation for more information.
Unlike other NLP benchmarks like GLUE, the syntactic generalization benchmark provided by SyntaxGym is meant for testing only. That means you can train your language model however you'd like; the test suites can be used to evaluate any pre-trained or off-the-shelf model.
We currently have several papers under review; check back in spring 2020 for an updated answer.
We would love to hear your feedback. Please email us at contact@syntaxgym.org.
You can sign up for our mailing list to receive occasional updates. We don't spam.


SyntaxGym was created by Jennifer Hu, Jon Gauthier, Ethan Wilcox, Peng Qian, and Roger Levy in the MIT Computational Psycholinguistics Laboratory. J.H. is supported by an NSF Graduate Research Fellowship and NIH Computationally-Enabled Integrative Neuroscience training grant.

The icons in our logo and homepage were found on Flaticon.