Font Size: a A A

Unlabeled Unequal-length Captcha Cracking Based On Domain Adaptation

Posted on:2021-01-23Degree:MasterType:Thesis
Country:ChinaCandidate:K WangFull Text:PDF
GTID:2518306302474284Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
CAPTCHA(Completely Automated Public Turing test to tell Computers and Humans Apart)is an anti-Turing test that distinguishes whether a visitor is a computer or a human.Nowadays,many websites use captcha to prevent malicious attacks.However,the application of captcha also has two sides.With the development of the Internet,some underground economies have moved their platforms from offline to online,and obtained illegal funds through online recharge transactions.These websites also use captcha to avoid network inspection getting their accounts automatically Moreover,this type of website has lower operating costs,thus they usually use text captchas,which costs lower than other kinds of captchas.And are usually added with security features such as noise backgrounds and interference lines in order to be recognized easily.The character self will also be processed after certain deformation or rotation.If this type of captchas can be cracked efficiently,the efficiency of network inspection can be greatly improved,and underground industries can be effectively cracked.Captcha solvers so far have two main defects.Firstly,most of them only use convolutional networks to extract features and classify,and treat captcha solving problem as a multi-label image classification problem,so the length of the output dimensions must be fixed.However,after investigating,we found that the length of captcha S is actually different between websites,which ranges from 4 to 9 digits,resulting in the lack of generalization of these solvers.Secondly,the training of a network requires huge number of labeled samples.captchas are complex because of their transformation,in order to improve the recognition accuracy,a large number of labeled samples are required,which often obtained from marking platform or manually marking by researchers themselves,making the obtaining the label inefficiently.In this paper,a more general method of verification code cracking is proposed for the two defects of the existing scheme.Our work mainly has three contributions:(1)We applied a CNN-RNN schedule for variable-length captchas’ recognition.(2)We proposed a self-supervised method to for domain adaption from public natural scene datasets to captchas,and improved the identification recognition accuracy for the captcha without any labeled data and(3)proposed a domain adaption method for captcha denoising.The specific work of the above three points is as follows:(i)In order to solve the variable-length problem,We use convolution-recurrent neural network to extract features,and the final output of the network uses CTC loss,which can solve the problem of variable-length sequence recognition.We first use the data from public dataset from natural scene(IIIT-5K)to pre-train the network,and then use the trained model directly on predicting captcha.It is found that this method can achieve a certain degree of accuracy.But some accuracies for individual websites is 0.This result is obvious because the characters’ form of the captchas and IIIT-5K are different,and the features learned by the network from natural scene cannot be directly applied to the captcha.To make the network be able to extract features from public data set(source domain)and the captcha(target domain)simultaneously,it is necessary to make the distribution of the feature maps output of the two domains by the network as consistent as possible.(ii)The original model is inefficient because of differences between domains.This thesis realizes domain adaption by using self-supervised method.For the unlabeled captcha,we construct two self-supervised tasks,rotation and flipping,according to the features of captcha,so as to create labels for the unlabeled data.Next,we improved the training strategy based on the designed self-supervised task: on the one hand,we used the public data set for training with supervised tasks;on the other hand,we split the convolutional network part from the convolutional-recurrent neural network individually and treated it as a feature extractor.Come out,treat it as a feature extractor,and assigned self-supervised tasks to captcha data and the public dataset data separately,the new sample will be sent to feature extractor simultaneously.The results given in this thesis verified that the self-supervised task improves the recognition accuracy of the convolution-recurrent neural network on the captcha dataset.For the captcha with a recognition rate of 0,the accuracy rate was also successfully improved to 9%.This thesis also tracked the changes on the output distributions of the two datasets inputted to convolutional network.We verified that with the number of iterations increases,the distributions tend to be consistent in the output of the convolutional network.This implied that we effectively extracted the spatially invariant features in the public dataset and the captcha based on self-supervise method.(iii)Finally,a denoising scheme based on domain adaption is proposed.We treated captcha denoising as an image semantic segmentation problem.We first constructed a set of pictures with noise and their corresponding denoised images,and then use U-Net to calculate the semantic segmentation loss.Despite from semantic segmentation loss,we also add a GAN loss for domain adaption.It is required that the network can accurately recognize characters from noise while simultaneously extracting the same features from the source and target domains.After denoising,the recognition rate of our network is further improved.In summary,this article mainly designs a more general variable-length captcha recognition scheme,which can achieve a certain recognition accuracy without labeled captcha data,thereby greatly saving the cost.
Keywords/Search Tags:CAPTCHA recognition, image-based Sequence Recognition, self-supervised, domain adaption
PDF Full Text Request
Related items