Creating the First-Ever, Fully Adaptive Automated Speaking Assessment

This week’s post will be a little self indulgent. I hate the “sorry not sorry” card but I’m going to play it anyway. The Emmersion team crossed an important milestone a couple weeks ago as we leveled up our automated English speaking assessment, and it became the FIRST-ever, fully adaptive automated speaking assessment. It has been a great privilege to work with the brilliant minds that have made this vision a reality, and it is a credit to the persistence of the visionary who carried the seeds, tilled the soil, and nurtured it to fruition. You can hear from the brains behind it all, Dr. Jacob Burdis, here.

My first introduction to elicited imitation

I believe I previously mentioned that my connection to this particular project far predates my tenure at Emmersion. In 2006, on my very first day, in my very first paid role related to English language learning and teaching, I was introduced to the concept of elicited imitation (EI). This listen-and-repeat test task type is the backbone of the TrueNorth Test. I had just been hired to work as a computer support representative for the English Language Center at Brigham Young University, where I would spend the next 14 years in different permutations of responsibilities.

It was early September; a new cohort of students had just arrived from 30 or so different countries. They needed to be tested and then placed so that the semester of instruction could begin. I was there to help the students successfully use the computer-based assessments that would give the program administrators the data they needed to do so. 

While the EI test was only one of several assessments that I helped proctor that day, it definitely made an impression. First, it was 63 items long. One of the sentences was “Big ships always make noise.” Depending on the accent of the student, you may have heard “Big sheeps always make noise.” (I grew up in an agricultural community over 800 miles from an ocean. Frankly, that was the version that I knew was true.) Of the several different tests given to this incoming group, the EI section was, for many, the most intense.


Part of the test’s intensity was the result of it being a new way of testing English ability for the students, almost all of whom came with some previous instruction experience. That said it wasn’t hard to pick up the process. Listen, record, repeat.

Second, its length drew my attention. Not that it actually took long, but, psychologically, I could tell that it was taxing. This was likely less about how many questions there were and more the result of an issue of fit.

The confident, high-ability students would be bored by the dozens of tasks that were much too easy. The overwhelmed beginning students had very few items where they succeeded before they were bludgeoned by the remaining that were far above their capacity. The ones that were in the middle experienced both boredom and bludgeon just in smaller doses. 

When I asked about this particular part of the test protocol, I was told that it was part of a research study into the assessment technique. Students, or even the program, would not immediately get any feedback from the data. The test content generated by the students would first be hand-rated. Each of the 63 recordings of the student repeating a sentence would be listened to and each correctly spoken syllable was recorded.

You may be wondering if the newest hire got the privilege of participating in that monotonous task. He most certainly did! These hand-rated scores would be compared to the output of an Automatic Speech Recognition (ASR) processor that would also analyze each recording to compare what was spoken with what the test-taker had heard.


If man versus machine was a draw (which it was!) the researcher could assume that this particular test item could be automated. [Spoiler alert: it can be! Even more successfully so with the refinements to ASR tools achieved in the last decade and a half.]

The scores from the automated assessment were then compared to other measures of speaking ability (all of which relied on human rating) to see if such an automated assessment could be a substitute for the more expensive and unwieldy use of human raters. Yes, there, too, indications were positive.

Although the task of listening and repeating is different from the open response speech generated in interview-based or simulated tests, both can be used to predict speaking ability.

You Must Walk Before You Can Run

So, the academics did what academics do; they showed that something could be done and did so over and over and over. As is often the case, it took over a decade before the good idea became a solution which was made available to those who needed it. 

Since its release, the TrueNorth Speaking Test (TNT) has benefited companies and organizations around the world with an automated assessment that is fast, affordable, reliable, and scalable measure of English speaking ability (and Spanish, Portuguese, French, German, Japanese and, coming soon, Italian, Russian, Tagalog, Chinese and Korean).


Using the innovative approach of elicited imitation and robust scoring technology, test-takers today get what the students or program administrators in 2006 did not get–immediate feedback on their measured ability. Until this most recent adaptive release, the TrueNorth Speaking Test has had 30 listen-and-repeat items in its Part 1. Even with an improved length, that issue of fit has still persisted.

For these early versions of TNT, Part 1 has been fixed in its form—thirty tasks starting with beginning difficulty and increasing to mastery. The test form has been the same for all, regardless of the performance ability, which starts to be revealed from the first item.

A language test is most reliable and efficient when the difficulty of the task fits the likely ability of the test-taker as close as possible. 


Just as with the research forms I observed at the beginning of my career, high-ability test-takers wade through low-difficulty items and low-ability test-takers drown in high-difficulty items. This is in contrast to the ideal.

Achieving the Impossible

Some have said (*cough* *cough*—Seal of Biliteracy, I’m looking at you) that this is impossible. They have insisted that you cannot have both automaticity and adaptivity in computer-delivered speaking assessments. As Shakespeare said, “‘Twas so, but ‘tis not so.”

Even those early, ambitious researchers were not just interested in elicited imitation’s ability to provide an automatic measure of speaking ability. They posited, and Emmersion has now proved (and delivered), that a speaking ability assessment can be both automated AND custom fit to the test-taker’s ability.


With this adaptive version of the TrueNorth Speaking Test, after the first item has been submitted, the AI-powered assessment does what a skilled human interviewer would do: assesses performance, adjusts its estimation of likely ability, and selects the next item to be a better fit. If the item was too hard, the next item will be easier. If the item was clearly too easy, the next item will be harder.

These smart adjustments made by the test allow it to quickly focus. Each item becomes increasingly useful and revealing. The test form does not just become more efficient, but it also decreases the test time!

For many test takers, it has shortened by half or more! Remember that the earlier versions had 30, and the earliest version (that I know of) had 63. But it is not just about the length. Even with that trim physique, the test actually gives the test-taker more of the highest return items, which means it remains a measure that is reliable and accurate.

The virtues extend beyond it being faster. It was already efficient and reliable. Now, it is also more secure.

The test form is dynamically created from a task bank of hundreds of tasks. This means that the form created for my friend—even if our abilities are similar—will be different from what would be created for me. This also makes the TrueNorth English Speaking Test even more useful when it comes to tracking changes in ability. The same product can be used before and after learning or language development to measure the effectiveness of teaching and learning tools.

Claiming First Ascent

Last year, I went down a strange streaming binge into mountaineering movies— specifically ones that told the story of first ascents. Claiming a first ascent (the first documented climb of a particular mountain or route up a rock face) is a huge climbing achievement. Free Solo was probably the most well known, but Meru—and my favorite, the Dawn Wall—are all worth the time and would be nice escapism for most in 2020. 


Getting to claim a first ascent in the field of language assessment with the commercial release of the first fully adaptive automated speaking assessment is certainly a moment of pride, relief, and joyful camaraderie. 

However as I would imagine that Alex Honnold, Jimmy Chu, Tommy Caldwell, or Kevin Jorgeson can admit, despite the euphoria of attainment that follows that first ascent, from this new height, you also see that the mountain you are on is only one of many mountains to climb.

Rather than being deflated by this new understanding, I have found it invigorating. We are ready for it. To the challenges on the horizon, we will take the strength, understanding, and confidence that we have earned along the way. Stay tuned. There is much more to come from here.

Experience certified language ability for yourself

Emmersion is a fully automated and adaptive language assessment engine for certifying speaking, writing, and grammar ability in 9 global languages with immediate results. Click below to try a free Emmersion assessment for yourself.

Request a Demo


Leave a Comment

We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept All”, you consent to the use of ALL the cookies.