Word to World
Supervised by Prof. Carola Zwick
Master Project, May 2020

Tools:
Google Speech API, Google Natural Language API, Unity, Asset store

Idea:
“Word to World” is a further development of my bachelor thesis “Scribbling Speech”. It is a software that transforms real-time speech into dynamic visualisation. I used Unity game engine to process the data passing from the language input, the computer outputs a corresponding three-dimensional visual world where camera movements, physics, forces, collisions, animations, and motions are working together. We are proficient at intuition, language, imagination, and creativity, while the computer is proficient at computation, algorithm, logic, and memory, "Word to World" created a mix of intuition and logic.



World Structure:



Assets are what we have in our world: natural environments, animals, people, objects, transportation and even places, landmarks and so on.

Bringing Objects onto the Canvas



Assets are what we have in our world: natural environments, animals, people, objects, transportation and even places, landmarks and so on.

"Some mushrooms here"

Bringing Animations to the World



The word “animation” stems from the Latin “animātiōn”, stem of “animātiō”, meaning “a bestowing of life”. Bringing motions into our canvas can largely increase the vividness of the world.
A cat can eat, walk, jump, run, sleep; a human can walk, kick, sit, dance, they can move!

• Animal Animations:

Assets are what we have in our world: natural environments, animals, people, objects, transportation and even places, landmarks and so on.

transit from the default "idle" state to "walk" state

"The elephant is drinking water by the pond."


• Humanoid Animations:

I applied motion datas (fbx format) from CMUMocap database to a man character which is rigged and have humanoid rigs. Below are the screenshots of the man doing different motion, as a result of visualizing verbs.

"The cowboy is waving to the penguin"



• Anthropomorphic Animations:

Assets are what we have in our world: natural environments, animals, people, objects, transportation and even places, landmarks and so on. Assets are what we have in our world: natural environments, animals, people, objects, transportation and even places, landmarks and so on.

"The bear is dancing in front of the rose."



• Animating Lifeless Objects:

Inanimate objects such as tables, chairs, boxes, stones and so on, they are lifeless in the real world, but in a world of fantasy, everything can be living!

Above are examples I collected to show how lifeless objects
can be animated and brought to life.



Inanimate objects such as tables, chairs, boxes, stones and so on, they are lifeless in the real world, but in a world of fantasy, everything can be living!


Inanimate objects such as tables, chairs, boxes, stones and so on, they are lifeless in the real world, but in a world of fantasy, everything can be living!

start experimenting with the most basic "cube "



Inanimate objects such as tables, chairs, boxes, stones and so on, they are lifeless in the real world, but in a world of fantasy, everything can be living!

further study with a "table " model

"The sofa is walking to the chair"

Movement Path

• Linear and Random Movement Paths:

Assets are what we have in our world: natural environments, animals, people, objects, transportation and even places, landmarks and so on.Assets are what we have in our world: natural environments, animals, people, objects, transportation and even places, landmarks and so on.

"The hot air balloon is rising."
(linear motion)

"The moon rotates around the earth."
(linear rotation)

"A flying UFO!"
(random motion)

"There's an earthquake!"
(camera movement)


• Movement Paths with Intelligence:


Some verbs like “arrive”, “avoid”, “flee”, “follow”, “hide”, “pursuit”, “seek”, they all require the objects intelligently deploy its path.

"The dinosaur is chasing to the fox!"


• Real-time Navigating:


Some verbs like “arrive”, “avoid”, “flee”, “follow”, “hide”, “pursuit”, “seek”, they all require the objects intelligently deploy its path.

walkable area

"The girl is wandering in the garden."

Simulation

Assets are what we have in our world: natural environments, animals, people, objects, Assets are what we have in our world: natural environments, animals, people, objects, Assets are what we have in our world: natural environments, animals, people, objects,

"The pencil and the ball fell down."
(Gravity Simulation)

"A flock of birds is flying over the mountains."
(Swarm Simulation)

"There's a swarm of fish swimming!"
(Swarm Simulation)

"A chair fell into the water."
(Fluid Physics Simulation)

Time and Weather


• Using Shadow to Visualize Time:

In “Word to World”, we use the shadow to represent the sun and the time. At different times of the day, our shadow gets longer and shorter or may disappear. We can tell the time based on your shadow’s current length and angle.

tell the time by showing the shadow

"In the late afternoon, it starts snowing."


• Weather Conditions:

In “Word to World” we will have 6 weather conditions:
“sunny”, “cloudy”, “rainy”, “thunderstorm”, “windy” and “snowy”.

"On a sunny day..."

"It's raining!"

Camera Language


• First Person & Third Person:

In “Word to World”, we use the shadow to represent the sun and the time. At different times of the day, our shadow gets longer and shorter or may disappear. We can tell the time based on your shadow’s current length and angle.

"I'm walking in the forest."
(will trigger a first-person controller)

"The girl is walking in the garden."
(will trigger a third-person controller)


• Changing Camera Perspective by Wording:

We can switch the perspective of the camera to from back viewed perspective by saying
"something sees something",
"something looks at something",
“in something’s eyes, the....”

"The bear saw an elephant!"

Adding a Virtual Layer



How do we visualize sentences like “A little boy is watching animation on a TV.”, “I’m watching a news report on the TV”, “The man is watching a weather report on the TV.”? “Word to World” will prepare three virtual layers which often appear on “screens”, a children’s animation clip, a weather report video clip, a news report video clip.

"The fox is watching an animation on TV!"

Audio Experience



The sound effects in “Word to World” are from freesound.org. Activating sound effect will be triggered by the input language.

Natural Language Processing



The "Speech to Text API" will recognize real-time speech and convert it to text. The "Natural Language API" will then process the text to make the computer “understand” it.


• Classifying Verbs to Assign Animation and Movement:

To make actions happen, for example, to make the giraffe walk to the penguin, we will correspondingly assign an animation and movement to the subject fox according to the verb "walk".

However, there are other verbs that can change the size of the object, verbs that can make the object talk, verbs that can make the object interact with the environment. So we have to classify the verbs according to their semantics.


• Technical Flow:

"The giraffe is walking to the penguin!"

Semantic Models



We can make the computer understand us way much smarter by using semantic models, such as semantic network, hierarchical network model, synonyms, hyponymy and hypernymy.

User Scenarios




• Storytelling:

"Word to World" can be used as a storytelling tool for people to tell and share their stories. They can use it to record their dreams, explain something to somebody, describe their daily life, document a public event and so on.



• Bedtime story with Kids:

"Word to World" can be used for telling bedtime stories for kids. Nowadays we have a limited way of telling a story - read books, but with "Word to World", we can not only create a brand new story by ourselves but also create and tell it together with your beloved ones in real-time. It can stimulate kids' imagination as well as improve their language skills.



• For Professionals:

Game designers can use "Word to World" to construct a static scene such as a landscape, where the focus would be on the assets and aesthetics, therefore besides the option of exporting a story as a video, an option of exporting the whole scene as a 3D file package is also available.

Schematic Interface



• Start Telling a Story:

There will be some tips that guiding user to start speaking.



• Export a Story:

When a user remains silent for more than 10 seconds, the computer will ask the user if he/she would like to export the story, user can choose "continue story" to continue telling the story or export it as a 3D file or a video. The other way is a user saying "Export the story as a video." or "Export it as a 3D file."



• Talking State:

The bubble is a visual indicator, indicating that the user is speaking. And it gets broken when anything is not stored in the system.

Demo