Technically, several attempts have been made to automate text replacement in still images based on principles of deep style transfer. The research group is including this progress and their research to tackle the problem of text replacement in videos. Videotext replacement is not an easy task. It must meet the challenges faced in still images while also accounting for time and effects such as lighting changes, blur caused by camera motion or object movement.
One approach to solve video-test replacement could be to train an image-based text style transfer module on individual frames while incorporating temporal consistency constraints in the network loss. But with this approach, the network performing text style transfer will be additionally burdened with handling geometric and motion-induced effects encountered in the video.
Therefore, the research group took a very different approach. First, they extract text regions of interest (ROI) and train a Spatio-temporal transformer network (STTN) to Frontalize the ROIs so that they will be temporally consistent. Next, they scan the video and select a reference frame with high text quality which was measured in terms of text sharpness, size, and geometry.