AI

The Semiotic Machine: Technology and multimodal interaction in context – a Multimodality Talk

Dr Rebekah Wegener in the Multimodality Talks Series 2024 – Conversations with Multimodality

Presenter: Dr Rebekah Wegener (Paris Lodron University Salzburg)

Discussants: Prof Øystein Gilje (University of Oslo) & Henrika Florén (Karolinska Institutet)

REGISTER HERE or use the QR code

Abstract:

Human interaction is inherently multimodal and if we want to integrate technology into human sense-making processes in a meaningful way, what kinds of theories, models, and methods for studying multimodal interaction do we need? Bateman (2012) points out that “most discussions of multimodal analyses and multimodal meaning-making still proceed without an explicit consideration of just what the ‘mode’ of multimodality is referring to”, which may be because it seems obvious or because development is coming from different perspectives, with different ultimate goals. However, when we want to put multimodality to work in technological development, this becomes problematic. This is particularly true if any attempt is being made at multimodal alignment to form multimodal ensembles: two terms which are themselves understood in very different ways. Here I take up Bateman’s (2012 and 2016) call for clarity on theoretical and methodological issues in multimodality to first give an overview of our work towards an analytical model that separates different concerns, namely the technologically mediated production and reception, the human sensory-motor dispositions and the semiotic representations. In this model, I make the distinction between modality, codality and mediality and situate this with context. To demonstrate the purpose of such a model for representing multimodality and why it is helpful for the machine learning and explicit knowledge representation tasks that we make use of, we draw on the example of CLAra, a multimodal smart listening system that we are building (Cassens and Wegener, 2018). CLAra is an active listening assistant that can automatically extract contextually important information from an interaction using multimodal ensembles (Hansen and Salamon,1990) and a rich model of context. In order to preserve privacy and reduce the need for costly data as much as possible, we utilise priviledged learning techniques, which make use of multiple modality input during training, learn the alignments and rely on the learned association during run-time without access to the full feature set used during learning (Vapnik and Vashist, 2009).Finally, I will demonstrate how the integration of rich theoretical models and access to costly, human annotated data in addition to data that can easily be perceived by machines makes this an example of development following true ‘smart data’ principles, which utilize the strength of good modelling and context to reduce the amount of data that is needed to achieve good results.

Bio:

Rebekah Wegener is Assistant Professor in English Linguistics at Paris Lodron University Salzburg and her research focuses on multimodal interaction across contexts, computer mediated communication and human computer interaction. In addition to theoretical work in linguistics, she looks at applications in human centred and explainable AI and contextual computing, particularly in medical and educational domains. Outside academia, Rebekah was project manager and head linguist for technology companies working on language technology and medical informatics. Rebekah is a member of the Austrian Society for Artificial Intelligence (ASAI) and the Association for Computing Machinery (ACM). She is editor of two ongoing book series “Text in Social Context” and “Key Concepts in SFL” and chief editor for the Routledge handbook “Transdisciplinary Linguistics”. She is also co-chair of the long-running workshop series “MRC: Modelling and Representing Context for Human-Centric and Contextual Systems” held at IJCAI and/or ECAI each year.


Discussants 

Øystein Gilje

Øystein Gilje (X – ogilje) is a Professor in Pedagogy at the Department of Teacher Education and School Research, University of Oslo. For nearly 20 years, he has studied young people’s multimodal production in Media Education and, more recently, across a wide range of subjects in lower secondary school in the project “Multimodal Learning and Assessment in digital classrooms” (#MuLVu). Gilje is particularly interested in how students can demonstrate their knowledge and competence in multimodal compositions when they are allowed to work with artificial intelligence. Currently, he is leading the AI project “Learning in the Age of Algorithms” (#LAT), funded by the Norweigan Research Council, and he is participating in the Agile_Edu project (#Agile_EDU), an Erasmus + project on platformization and datafication of schools and learning. 

“Digital School 4.0 is our first step into a world where humans and machines collaborate in new ways.

Learning and meaning making take place in the intersection between human and artificial cognition.»

Articles:

Gilje, Ø. (2019). Expanding Educational Chronotopes with Personal Digital Devices. Learning, Culture and Social Interaction, 21, 151–160.

Gilje, Ø. (2023). Digital School 4.0 One-to-one computing and AI. An interview with Øystein Gilje in ESHA Magazine, May 2023: https://cms.eu.dynatex.io/api/assets/prod-eu-esha/d02907fd-f235-4428-b920-0bd718d4118c

Gilje, Ø. (2024). Tracing semiotic choices in ‘new writing’ – the role of guidance in students’ work with semiotic technologies. In: Lim, F. V. & Querol-Julián, M. (Eds.). Designing learning with digital technologies: perspectives from multimodality in education. Routledge.


Henrika Florén,

Henrika Florén, MSc, MEd, MA is an educational developer at the unit of Teaching and Learning at Karolinska Institutet and a final year PhD candidate at UCL Institute of Education, department of Culture, Communication and Media. Her research into assessment in higher education employs as multimodal social semiotic perspective to explore teachers’ meaning-making across multiple modes, and media and what guides teachers in their assessment of students’ multimodal representations. She was the lead for the project “Generative AI in teaching and examinations”, and is currently leading the project “Development of KI’s digital environments for skills training” Karolinska Institutet. She is a co-organiser of an international  AI Promptation (AIMedEdConnect Promptathon) cohosted with the Mayo Clinic, and involved in the ongoing study “Perceptions of teaching and learning during implementation of team-based learning in medicatl programme curriculum”.

Papers:

Florén, H. (2021, September 1). Multimodal text and assessment practices—Renegotiated qualities of academic expression and recognition. Proceedings QUINT PhD Summer Institute 2021. QUINT PhD Summer Institute 2021, Online. https://doi.org/DOI 10.17605/OSF.IO/WEX63

Florén, H. (2023, September 28). Designing Assessment Futures and Points of Reference Guiding Multimodal Assessments. ICOM-11 International Conference on Multimodality, London, UK.

Ruttenberg, D., Harti, L. M. S., & Florén, H. (2023, September 27). Multimodality and Future Landscapes: Meaning Making, AI, Education, Assessment, and Ethics [panel Chair David Ruttenberg]. ICOM-11 International Conference on Multimodality, Online & London, UK.


MULTIMODALITY TALKS Series 2024

MULTIMODALITY TALKS Series is a joint initiative for researchers across the world who are interested in multimodality. It aims to provide a platform for dialogue for advancing multimodal research across disciplines. Multimodality draws attention to how meaning is made through the combined use of semiotic resources such as gesture, speech, face expression, body movement and proxemics, (still and moving) image, objects, sound and music, writing, colour, layout, and the built environment.

This international series of talks is organised by The UCL Visual and Multimodal Research Forum, the University of Leeds Multimodality@Leeds and Karolinska Institutet, Unit of Teaching and Learning. It is conceived of working as a tandem with the Bremen-Groningen Online Workshops on Multimodality to make the best of the online format to offer multiple chances for sharing research and stimulating discussions on multimodality worldwide.

In 2024 the series will experiment a new format as Conversations with Multimodality. We have invited speakers from outside the field whose work relates to multimodal aspects of communication. Their key speech will be commented on by two discussants with relevant expertise in multimodality. The topics for this year’s sessions will be: AI, interculturality, health communication/education, and museums.

Working with research images in a time of AI and big data

My ongoing doctoral research into multimodal assessment includes diverse data, of which a substantial amount is visual. That is, the data includes still or moving images taking into account colours, design elements as well as embodied modes (such as gaze and gestures).

If extracts from this kind of data should be included in any form of research publication (of course with consent from participants) images need to be processed to maintain the confidentiality of research participant. If people are present in an image that should be used for evidencing or illustrating some argument in the research, this person or these people need to be represented in a way that does not reveal their identity but still visualises the example or argument being made.

With the increasing use and capabilities of AI how we use and publish images need careful consideration. To reduce the risk of data being extracted from images I am now ‘doctoring’ images and illustrations using offline programs. Depending on what feature I want to highlight in an image I use different strategies. Word comes with a set of cartoon characters (full, half body, heads, faces) that can be superimposed on a photo for example. I then take a screenshot of this ‘new’ image’ where the person is now represented by a drawn character. By taking a screenshots I avoid metadata that is present in the original images (even if this has been manipulated). I some cases I want to keep the visual information from gestures, body language and placement in a room. Then cartoon elements to mask a person in my image are not a good choice, as the image would loose its relevance as an illustration of for example embodied modes. One solution is to manually draw from a photo, only keeping the relevant information. If I instead chose to use a digital method I now work with images in Inkscape. Sometimes in combination with cartoon elements in Word.

Below, I have used an old family photo (ca 1930), and processed it in Inkscape. The original is top left. The second image has been sharpened, the third is a filter that turns the photo into a line drawing and the fourth (top right) is another variation of line/ink drawing from photo. The bottom row of photos are examples with more or less use of further masking.

Original photo (by unknown) ca 1930, Boy with dachshund.

I have not spent a lot of time on processing these images, but they should offer some ideas of what you can do with just a bit of effort. There is much more you can do if your put your mind to it. It all depends on the image, your purpose and the level of integrity needed. In some situations the solution would be to not use images at all but to describe instead.

However, if you do need to work with images then Inkscape is a free software. It can be a bit tricky to start with. On the other hand there are lots of ‘how to do’ videos and instructions for Inkscape online. Easier to start with are using cartoon people in Word.

AI has moved up on the horizon

AI is bringing the unknown! Back in the mists of time, as our forebearers ventured into new spaces and new places, they walked across land bridges that then disappeared, cutting them off from others, creating cultures that became separate. Societies grew and people learned how to build ships. Ships went to new places and discovered new things. Those people at those times, they moved over land, over seas, in bright light and in darkness. Sometimes they went by ship and came into fog. They could not clearly see where they were going. So they had people to look out ahead – and that person could shout down to the other people on the ship. “I this on the horizon. There is land coming. Watch out! There are wonderful creatures in the sea.”

Today we have people look out toward the future. We are not on a ship, and they are not up a mast, but their function is the same, to watch for what may be found ahead. And the future carries new things at breakneck speed. Artifical Intelligence (AI) has moved up on the horizon and it is going to change how we write, how we live, how our societies are shaped, and it is going to change education. It is already changing education, but we do not know how yet. We do not know what is coming, but we know it is there!


Featured image created with Midjourney by Henrika Florén @Multimodal_HF

Assessments, Texts and AI

When Artificial Intelligence [AI] can be prompted to produce text that can pass for being written by a human, and can generate images, or write music, and generate voice and image that ‘impersonate’ a specific person – then we need to reconsider how we teach and assess.

There are many opinions about assessment in education and several positions in research. I am in this post not going to go into any of these, but I do recognise that there are differences. However, regardless of how you see or approach assessments, if you are worried I suggest that you use more than one way to find out about your students’ learning.

It the students’ are writing – add something else as well. This could be a filmed presentation or a discussion in class, or some other form where the student has to apply knowledge in some way that can be traced and evidenced as their learning. There are many ways assessments can be constructed where AI would not be a problem. Why not use ChatGPT as part of an assignment – maybe analyse the resulting text? There is room for creativity here – map the student’s knowledge and learning in multiple ways.

We also need to start thinking about what AI can, as well as cannot, do or be.

What to do when the AI facilitated transcription stops working

What do you do when the dictation function just will not cooperate?

I thought I had it all sorted for using the dictation tool to facilitate transcription, as I have been working with this for a while now without any problems. This morning I set up a new file, pressed all the right buttons, and nothing happened. As you can imagine this was not a good moment. I took some time for me to sort out but now it is all working again.

Why did it happen and what did I do to fix it?

I have been transcribing speech in different languages. One change of language did not cause any problems but changing multiple times between languages apparently made the AI get lost, or confused, and go on strike.

To fix it, I had to turn off the dictation and restart the the computer with the dictation function turned off. That was the key thing.

In this case I was using a Mac but my guess is that this is just as likely to happen using a windows computer (PC). The method to sort it out should then be transferrable.

To fix the problem

  1. Turn off the dictation
  2. Close all running programs
  3. Restart computer
  4. After restart turn off any programs that uses audio (such as Spotify, iTunes, ManyCam etc.)
  5. Turn on dictation and check that it is set to the language you want
  6. Check that all output channels are set for transcription. For instructions on how to do this please see How to facilitate transcription of recorded voice

How to facilitate transcription of recorded voice

Transcription

There are different approaches to transcription, and although I work in a multimodal framework, I will here describe how to facilitate transcription of recorded voice only. All software used is free and/or already included in your computer and the transcription process is offline.

Computers come with tools for speech-to-text. These are mostly used for voice control, but are essentially AI:s for speech recognition. My idea was to use this function to reduce the time I spend on manually transcribing voice recordings. Using the built-in tools enable me to work offline and there is no need to use third party transcription services. I need to ensure the integrity of my research participants and my data and an offline solution was the safest option I could think of.

The AI generated text is incomplete and incorrect but it reduces the time I spend on transcription substantially.  Please note that the first automatically generated text is only raw material for a proper transcription. It cannot do the job for you as if you had used a paid transcription service.

I used a Mac computer (ios) for the process I describe below as it can be used for multiple languages. There is a good AI for PCs (windows) but mine only recognises English, and in this case the voice recording was in another language.

How to automatically generate text from a voice recording

This is how to do it on a Mac (but the process is essentially the same on a pc). You will need to download Audacity and Soundflower both are free programs and available for both mac ios and windows.

Please note that no links are added but only the adresses as text*. You can cut and paste the link adress into any browser you are using.

1) Install programs

  • Audacity can be found  at audacityteam.org/download/
  • Soundflower can be found at soundflower.en.softonic.com/mac. (If you run into problems with the installation you can read more at techwiser.com/record-internal-audio-on-mac/)
  • If you have persistent problems installing soundflower there is a program called the unarchiver (theunarchiver.com) that did the trick for me. I unpacked the soundflower installation pack first and then went back and followed the installation procedure outlined in thetechwiser.com/record-internal-audio-on-mac/

2) If necessary reformat** the sound file to mp3

3) Change settings for dictation so that output and input for sound is soundflower 2ch

4) Transcribe

  • Open mp3 file in audacity
  • Open notepad (notes worked better than pages for some reason)
  • Play sound from audacity and start dictation with double clicking fn

The dictation seems to be able to cope with about 40 seconds of voice before it gets lost so I had to keep restarting every 30 seconds keeping track of where I was in Audacity. I also slowed down the speed to about 90% which seemed to make the dictation work better (but I did no comprehensive trial here).

This solution allows me to work offline keeping the integrity of my data and participants intact. I don’t have consent for any third party tools such as google or other online solutions. 


*No active links are included in this blog post. Please refer to the EU 2001 Information Society Directive, and 2019 Directive on Copyright in the Digital Single Market.

**One way to reformat (convert) a sound file to mp3 is by using vlc player. This ensures that you are working offline maintaining data integrity.

  1. Open the file in vlc on a pc
  2. Install vlc if necessary. Vlc player can be found at videolan.org/vlc/ 
  3. Open stream/convert
  4. Save as mp3 but ALSO manually rename the file as mp3 (edit the file extension)