Word to World
Supervised by Prof. Carola Zwick
Master Project, May 2020

Google Speech API, Google Natural Language API, Unity, Asset store

“Word to World” is a further development of my bachelor thesis “Scribbling Speech”. It is a software that transforms real-time speech into dynamic visualisation. I used Unity game engine to process the data passing from the language input, the computer outputs a corresponding three-dimensional visual world where camera movements, physics, forces, collisions, animations, and motions are working together. We are proficient at intuition, language, imagination, and creativity, while the computer is proficient at computation, algorithm, logic, and memory, "Word to World" created a mix of intuition and logic. It encourages creation with technology.

World Structure:

Visualizing human's "free speech" is a complex task, so I deconstructed the project into following elements:

Bringing Objects onto the Canvas

Assets are what we have in our world: natural environments, animals, people, objects, transportation and even places, landmarks and so on. The computer will find the corresponding asset according to the detected "nouns".

"Some mushrooms here"

Bringing Animations to the World

The word “animation” stems from the Latin “animātiōn”, stem of “animātiō”, meaning “a bestowing of life”. Bringing motions into our canvas can largely increase the vividness of the world.
A cat can eat, walk, jump, run, sleep; a human can walk, kick, sit, dance, they can move!

• Animal Animations:

Each animal has its own specific animations built by animators, by detecting "verbs", an animation will be triggered and transit from the previous state. The default initial state is "idle".

transit from the default "idle" state to "walk" state

"The elephant is drinking water."

• Humanoid Animations:

I applied motion datas (fbx format) from CMUMocap database to a man character which is rigged and have humanoid rigs. Below are the screenshots of the man doing different motion, as a result of visualizing verbs.

"The cowboy is waving to the penguin"

• Anthropomorphic Animations:

Bringing anthropomorphism in character animation could stimulate our imagination through storytelling, you can tell a story of a bear dancing and doing boxing! Following are examples of a bear doing different motions after we give it a humanoid rig.

"The bear is dancing in front of the rose."

• Animating Lifeless Objects:

Inanimate objects such as tables, chairs, boxes, stones and so on, they are lifeless in the real world, but in a world of fantasy, everything can be living!

Above are examples I collected to show how lifeless objects
can be animated and brought to life.

I used a mesh deformer tool written by Keenan Woodall, it has various methods like "bend", "sine", "skew", "inflate" and so on, to change and animate the form of the models.

a mesh deformer tool written by Keenan Woodall

I started with the most basic geometry, a cube, by combining different deforming methods, I found the behaviors of the cube can already express some emotions and meanings.

start experimenting with the most basic "cube "

Then I moved on with the same methods to a "table" model, we could see the table comes to alive.

further study with a "table " model

"The sofa is walking to the chair"

Movement Path

• Linear and Random Movement Paths:

We have groups of verbs that will trigger object's linear movement, like "rise", "sink", "rotates", "dive", "go towards", etc. And we will give the objects a random path by decting verbs like "fly", "play around", etc.

"The hot air balloon is rising."
(linear motion)

"The moon rotates around the earth."
(linear rotation)

"A flying UFO!"
(random motion)

"There's an earthquake!"
(camera movement)

• Movement Paths with Intelligence:

Some verbs like “arrive”, “avoid”, “flee”, “follow”, “hide”, “pursuit”, “seek”, they all require the objects intelligently deploy its path.

"The dinosaur is chasing to the fox!"

• Real-time Navigating:

If you say "The girl walks in her room", the computer will calculate the walkable area of the "room" in real-time, this ensures that the girl won't walk into tables and shelves.

walkable area

"The girl is wandering in the garden."


Simulations could help us visualize a world that we are familiar with: we use gravity simulations to visualize for example, "An apple falls from the tree."; we use fluid simulations to visualize how a chair moves and floats in the water.

"The pencil and the ball fell down."
(Gravity Simulation)

"A flock of birds is flying over the mountains."
(Swarm Simulation)

"There's a swarm of fish swimming!"
(Swarm Simulation)

"A chair fell into the water."
(Fluid Physics Simulation)

Time and Weather

• Using Shadow to Visualize Time:

In “Word to World”, we use the shadow to represent the sun and the time. At different times of the day, our shadow gets longer and shorter or may disappear. We can tell the time based on your shadow’s current length and angle. With this method, you are free to say "Now it's 7am in the morning."

tell the time by showing the shadow

"In the late afternoon, it starts snowing."

• Weather Conditions:

In “Word to World” we will have 6 weather conditions:
“sunny”, “cloudy”, “rainy”, “thunderstorm”, “windy” and “snowy”.

"On a sunny day..."

"It's raining!"

Camera Language

• First Person & Third Person:

In “Word to World”, we use the shadow to represent the sun and the time. At different times of the day, our shadow gets longer and shorter or may disappear. We can tell the time based on your shadow’s current length and angle.

"I'm walking in the forest."
(will trigger a first-person controller)

"The girl is walking in the garden."
(will trigger a third-person controller)

• Changing Camera Perspective by Wording:

We can switch the perspective of the camera to from back viewed perspective by saying
"something sees something",
"something looks at something",
“in something’s eyes, the....”

"The bear saw an elephant!"

Adding a Virtual Layer

How do we visualize sentences like “A little boy is watching animation on a TV.”, “I’m watching a news report on the TV”, “The man is watching a weather report on the TV.”? “Word to World” will prepare three virtual layers which often appear on “screens”, a children’s animation clip, a weather report video clip, a news report video clip.

"The fox is watching an animation on TV!"

Audio Experience

We also provide great audio experience besides the visualisations. The sound effects in “Word to World” are from freesound.org. I divide the words into three groups:

words that has sound effects attached;
words that are "silent";
words and sentences like "I listen to...", "I hear...", this will trigger a sound effect when detected.

Natural Language Processing

The "Speech to Text API" will recognize real-time speech and convert it to text. The "Natural Language API" will then process the text to make the computer “understand” it.

• Classifying Verbs to Assign Animation and Movement:

To make actions happen, for example, to make the giraffe walk to the penguin, we will correspondingly assign an animation and movement to the subject fox according to the verb "walk".

However, there are other verbs that can change the size of the object, verbs that can make the object talk, verbs that can make the object interact with the environment. So we have to classify the verbs according to their semantics.

• Technical Flow:

"The giraffe is walking to the penguin!"

Semantic Models

We can make the computer understand us way much smarter by using semantic models, such as semantic network, hierarchical network model, synonyms, hyponymy and hypernymy.

User Scenarios

• Storytelling:

"Word to World" can be used as a storytelling tool for people to tell and share their stories. They can use it to record their dreams, explain something to somebody, describe their daily life, document a public event and so on.

• Bedtime story with Kids:

"Word to World" can be used for telling bedtime stories for kids. Nowadays we have a limited way of telling a story - read books, but with "Word to World", we can not only create a brand new story by ourselves but also create and tell it together with your beloved ones in real-time. It can stimulate kids' imagination as well as improve their language skills.

• For Professionals:

Game designers can use "Word to World" to construct a static scene such as a landscape, where the focus would be on the assets and aesthetics, therefore besides the option of exporting a story as a video, an option of exporting the whole scene as a 3D file package is also available.

Schematic Interface

• Start Telling a Story:

There will be some tips that guiding user to start speaking.

• Export a Story:

When a user remains silent for more than 10 seconds, the computer will ask the user if he/she would like to export the story, user can choose "continue story" to continue telling the story or export it as a 3D file or a video. The other way is a user saying "Export the story as a video." or "Export it as a 3D file."

• Talking State:

The bubble is a visual indicator, indicating that the user is speaking. And it gets broken when anything is not stored in the system.