| In this paper, the main idea is use convolutional networks to implementinvariant mapping, mapping 3D model sampling from high dimensionalpixel space to low dimensional manifold. Because of many high dimensionalsampling is determined by fewer hidden variables. And neurophysiologistconsider that vision memory is based on manifold or continuousattractors, .So simulating human vision cognition, maybe an efficient way tosolve this kind of problem.Find a rotation invariant, shift invariant, scale invariant shape descriptoris a key to 3D model search engine. In this paper, make sampling byprojection of each viewpoint of a 3D model, then mapping the 2D image tolow dimensional feature space, so we got the feature space point set of the3D model. This kind of mapping can keep the local relationship of originalimage, image collected from adjacent viewpoint is close in feature space. Infeature space, the Euclidean distance between two points is the similar metricof images. The points set can be the shape descriptor of 3D model, theaverage distance between two point set is the similar metric of 3D models.Extracting features of two dimensional image is non-lineardimensionality reduction method, learning an invariant mapping,mapping aset of high dimensional input points onto a low dimensional manifold so that"similar" points in input space are mapped to nearby points on the manifold.And the method can learn mappings that are invariant to certaintransformations of the inputs (such as rotation). And convolutional neuralnetwork is a good choice to do this.Convolutional Neural Network is a special kind of multi-layer neuralnetworks. Like almost every other neural networks they are trained with aversion of the back-propagation algorithm. Where they differ is in thearchitecture. Convolutional Neural Networks are designed to recognizevisual patterns directly from pixel images with minimal preprocessing. Theycan recognize patterns with extreme variability, and with robustness todistortions and simple geometric transformations. The idea come frombiologic vision system, thru the architectures with local connections andshared weights, decrease the number of free parameters, the networks getmore generalization ability. Convolutional networks have some successfulapplications in vision pattern recognition problem, such as LeNet5 inhandwriting character recognition.Apply convolutional networks to the problem of extract shapedescriptor from 3D models, we need get projection from each viewpoint totrain the networks. Not as common pattern recognition problem, the goal oftraining is not make the networks to recognize certain images, but make thenetworks extract invariant features. The same object with small rotation anddistortion will yield the similar output.I use an orthogonal projection of 3d model, keep the depth information,this provide more information than just use figures, and can improve theaccuracy of convolutional networks. In order to achieve proportionalsampling, I use geodesic dome, geodesic dome originality used inarchitecture to build dome, use polyhedron to produce a close approximationto a sphere. Because the vertices can symmetrical covered the sphere,projection to each vertex gain proportional sampling, and the edge betweenvertices can decide vertex is adjacent or not. Projection from adjacentvertices is similar, otherwise is dissimilar. We can generate training datasetaccording to this.I training convolutional networks by Siamese framework andEnergy-based model, The Siamese framework comprise two identicalnetworks (same structure and share the same weights) and one cost module.The input to the system is a pair of images and a label. The images arepassed through the sub-networks, yielding two outputs which are passed tothe cost module produces the scalar energy as the similar metric of twoimages. The goal of training is making the adjacent sample yield lowerenergy otherwise yield higher energy. The main idea of EBM is use energyfunction to measure the "compatibility" between inputs, thru minimizing aloss function to shape the correct energy surface.Training with EBM we need pass the label and Siamese's output to lossfunction, and use gradient descent algorithm to minimizing the average lossfunction of training set, and then the Siamese framework got the desiredoutput. The choice of loss function is a important problem to EBM, in thispaper, I use Square-Square Loss. One line is the loss function for the similarpairs, and the other line is for the dissimilar pairs. Loss function of similarpairs has a positive gradient, which make the energy of similar pairs smaller,and dissimilar pairs has negative gradient, which make the energy bigger,this is the goal of training.The gradient of the loss function with respect to the parameter vectorcontrolling both subnets is computed using back-propagation. The parametervector is updated with a stochastic gradient method using the sum of thegradients contributed by the two subnets.In the experiment, I only use 20000 pairs of images extract from two 3dmodels to train the convolutional networks, the result is ideal, it canrecognize the models which is not seen while training, it's exactly what Ineeded, I need the convolutional networks extract 'generic' features, not justshape features from certain kind objects.The main idea of the paper is to use cognitive science andneurophysiology research results in machine learning and artificialintelligence, After all, human brain is the most intelligent entities in theknown universe. Artificial neural networks which archive big success in realworld application were designed to model some properties of biologicalneural networks, The idea of convolutional networks come from biologicvision system, and manifold learning is also related to human cognition. Thismaybe an efficient way to solve some artificial intelligence problems. |